Elly Aj NK Abc Grad
Elly Aj NK Abc Grad
DOI 10.1007/s11009-013-9357-4
E. Ehrlich · N. Kantas
Department of Mathematics, Imperial College London, London, SW7 2AZ, UK
E. Ehrlich
e-mail: elena.ehrlich05@ic.ac.uk
A. Jasra (B)
Department of Statistics & Applied Probability, National University of Singapore,
Singapore, 117546, Singapore
e-mail: staja@nus.edu.sg
N. Kantas
Department of Statistical Science, University College London, London, WC1E 6BT, UK
e-mail: n.kantas@ucl.ac.uk, n.kantas@imperial.ac.uk
Methodol Comput Appl Probab
1 Introduction
with B ∈ B(Y) and gθ (yt |xt ) being the conditional likelihood density. The HMM is
given by Eqs. 1 and 2 and is often referred to in the literature also as a general
state-space model. Here θ is treated as a unknown and static model parameter,
which is to be estimated in using Maximum Likelihood estimation (MLE). This is
an important problem with many applications ranging from financial modeling to
numerical weather prediction.
Statistical inference for the class of HMMs described above is typically non-trivial.
In most scenarios of practical interest one cannot calculate the marginal likelihood
of n given observations
pθ (y1:n ) = gθ (yn |xn ) pθ (xn |y1:n−1 )dxn
where y1:n := (y1 , . . . , yn ) are considered fixed and pθ (xn |y1:n−1 ) is the predictor
density at time n. Hence as the likelihood is not analytically tractable, one must
resort to numerical methods to both compute and to maximize pθ (y1:n ) w.r.t. θ . When
θ is known, a popular collection of techniques for both estimating the likelihood as
well as performing filtering or smoothing are sequential Monte Carlo (SMC) methods
(Doucet et al. 2000; Cappé et al. 2005). SMC techniques simulate a collection of N
samples (known as particles) in parallel, sequentially in time and combine importance
sampling and resampling to approximate a sequence of probability distributions
of increasing state-space known point-wise up-to a multiplicative constant. These
techniques provide a natural estimate of the likelihood pθ (y1:n ). The estimate is
Methodol Comput Appl Probab
quite well understood and is known to be unbiased (Del Moral 2004, Chapter 9).
In addition, the relative variance of this quantity is known to increase linearly with
the number of data-points, n, (Cérou et al. 2011; Whiteley et al. 2012). When θ is
unknown, as is the case here, estimation of θ is further complicated, because of the
path-degeneracy caused to the population of the samples by the resampling step of
SMC. This issue has been well documented int the literature (Andrieu et al. 2005;
Kantas et al. 2011). However, there are still many specialized SMC techniques which
can successfully be used for parameter estimation of HMMs in a wide variety of
contexts; see Kantas et al. (2011) for an comprehensive overview. In particular for
MLE a variety of SMC methods have been proposed in the literature (Cappé 2009;
Del Moral et al. 2009; Poyiadjis et al. 2011). Note that the techniques in these papers
require the evaluation of gθ (y|x) and potentially gradient vectors as well.
In this article, we consider the scenario where gθ (y|x) is intractable. By this
we mean that one cannot calculate it for given y or x either because the density
does not exist or because it is computationally too expensive, e.g. due to the high-
dimensionality of x. In addition, we will assume a unbiased estimator for gθ (y|x)
is also not available. Instead we will assume that one can sample from gθ (·|x) for
any value of x. In this case, one cannot use the standard or the more advanced
SMC methods that are mentioned above (or indeed many other simulation based
approximations). Hence the problem of parameter estimation is very difficult. One
approach which is designed to deal with this problem is Approximate Bayesian
Computation (ABC). ABC is an approach that uses simulated samples from the
likelihood to deal with the restriction of not being to evaluate its density. Although
there is nothing inherently Bayesian about this, it owes its name due to its early
success in Bayesian inference; see Marin et al. (2012) and the references therein for
more details. Although here we will focus only upon ABC ideas, we note that there
are possible alternatives, such as Gauchi and Vila (2013), and refer the interested
reader to Gauchi and Vila (2013), Jasra et al. (2012) for a discussion of the relative
merits of ABC.
In the context of HMMs when the model parameters θ are known, the use of ABC
approximations has appeared in Jasra et al. (2012), McKinley et al. (2009) as well as
associated computational methods for filtering and smoothing in Jasra et al. (2012),
Martin et al. (2012), Calvet and Czellar (2012). When the parameter is unknown, the
statistical properties of ML estimators for θ based on ABC approximations has been
studied in detail in Dean et al. (2010), Dean and Singh (2011). ABC approximations
of lead to a bias, which can be controlled to arbitrary precision via a parameter > 0.
This bias typically goes to zero as 0. In this article we aim to:
1. Investigate the bias in the log-likelihood and the gradient of the log-likelihood
that is induced by the ABC approximation for a fixed data set,
2. Develop a gradient based approach based on SMC with computational cost
O(N) that allows one to estimate the model parameters in either a batch or on-
line fashion.
In order to implement such an approach one must obtain numerical estimates of the
log- marginal likelihood as well as its gradient. Thus, it is important to understand
what happens to the bias of the ABC approximation of these latter quantities, as
the time parameter (or equivalently number of data-points, n) grows. We establish,
under some assumptions, that this ABC bias, for both quantities is no worse than
Methodol Comput Appl Probab
O(n). This result is closely associated to the theoretical work in Dean et al. (2010),
Dean and Singh (2011). These former results indicate that the ABC approximation is
amenable to numerical implementation and parameter estimation will not necessar-
ily be dominated by the bias. We will discuss why this is the case later in Remarks 2.1
and 2.2. For the numerical implementation of MLE we will introduce a gradient-
free approach based on using finite differences with Simultaneous Perturbation
Stochastic approximation (SPSA) (Spall 1992, 2003). This is extending the work
in Poyiadjis et al. (2006) for the case when the likelihood is intractable and ABC
approximations are used.
This paper is structured as follows. In Section 2 we discuss the estimation pro-
cedure using ABC approximations. Our bias result is also given. In Section 3 our
computational strategy is outlined. In Section 4 the method is investigated from a
numerical perspective. In Section 5 the article is concluded with some discussion of
future work. The proofs of our results can be found in the appendices.
where we recall that θ ∈ ⊂ Rdθ is the vector of model parameters, xt ∈ X are the
hidden states and yt ∈ Y the observations. The joint filtering density can be computed
recursively using the well known Bayesian filtering recursions:
πθ (x0:t |y1:t−1 ) = πθ (x0:t−1 |y1:t−1 ) fθ (xt |xt−1 )dxt (3)
X
Note that this is a batch or off-line procedure, which means that one needs to wait
first to collect the complete data-set and then compute the ML estimate. In this paper
Methodol Comput Appl Probab
we will focus on computing ML estimates based on gradient methods. In this case one
may use iteratively for k ≥ 0
1 n
lim lθ (y1:n ) = lim log ( pθ (yt |y1:t−1 )) .
n→∞ n n→∞
t=1
Under appropriate regularity and ergodicity conditions for the augmented Markov
chain (Xt , Yt , pθ (xt |y1:t−1 ))t≥0 (Le Gland and Mevel 1997; Tadic and Doucet 2005)
the average log-likelihood is an ergodic average and this leads to a gradient update
scheme based on Stochastic Approximation (Benveniste et al. 1990). For a similar
step-size sequence (at ) t≥1 one may update θt as follows:
here as
∇ log pθ0:t (yt |y1:t−1 ) = ∇ log pθ0:t (y1:t ) − ∇ log pθ0:t−1 (y1:t−1 ) ,
where the subscript θ0:t in the notation for ∇ log pθ0:t (y1:t ) indicates that at each time
t the quantities in Eqs. 3–5 are computed using the current parameter estimate θt . The
asymptotic properties of RML have been studied in Arapostathis and Marcus (1990),
Le Gland and Mevel (1995, 1997, 2000) for a finite state-space HMMs and Tadic and
Doucet (2005), Tadic (2009) in more general cases. It is shown that under regularity
conditions this algorithm converges towards a local maximum of the average log-
likelihood, whose maximum lies at the ‘true’ parameter value.
Methodol Comput Appl Probab
In this article, we would like to implement approximate versions of RML and off-
line ML schemes when both the following cases hold:
• We can sample from the conditional distribution of Y|x, for any fixed θ and x.
• We cannot or do not want to evaluate the conditional density of Y|x, gθ (y|x) and
do not have access to an unbiased estimate of it.
Apart from using likelihoods which do not admit computable densities such as some
stable distributions, this context might appear relevant to the context when one is
interested to use SMC methods and evaluate gθ (y|x) when dx is large. SMC methods
for filtering do not always scale well with the dimension of the hidden state dx , often
requiring a computational cost O(κ dx ), with κ > 1 (Beskos et al. 2011; Bickel et al.
2008). A more detailed discussion on the difficulties of using SMC methods in high
dimensions is far beyond the scope of this article, but we remark the ideas in this
paper can be relevant in this context.
To facilitate ML estimation when the bullet points above hold we will resort to
ABC approximations of the ideal MLE procedures above. We will present a short
overview here and refer the author to Dean et al. (2010), Yildirim et al. (2013b) for
more details.
First, we consider an ABC approximation of the joint smoothing density as in
Jasra et al. (2012), McKinley et al. (2009):
μθ (x0 ) nt=1 K (yt , ut )gθ (ut |xt ) fθ (xt |xt−1 )
πθ, (u1:n , x0:n |y1:n ) = (6)
pθ, (y1:n )
with the ABC marginal likelihood being
n
pθ, (y1:n ) = μθ (x0 ) K (yt , ut )gθ (ut |xt ) fθ (xt |xt−1 )du1:n x0:n (7)
Xn+1 ×Yn t=1
can be viewed as the likelihood of an alternative “perturbed” HMM that uses the
same transition density but has gθ, as the likelihood. It can be easily shown that this
HMM will admit a marginal likelihood of Z1n pθ, (y1:n ) which is proportional to the
one written above in Eq. 7, but the proportionality constant does not depend on θ.
Note that a critical condition for this to hold is that we choose K (yt , ut ) such that the
normalizing constant Z = K (yt , ut )dut of Eq. 9 does not depend upon xt or θ.
The ABC–MLE approach we consider in this article will be then to use MLE for
the perturbed HMM defined by gθ, . For the off-line case let
lθ (y1:n ) = log pθ, (y1:n )
Results on the consistency and efficiency of this method n grows can be found
in Dean et al. (2010), Dean and Singh (2011). Under some regularity and other
assumptions (such as the data originating from the HMM considered), the bias of the
maximum likelihood estimator (MLE) is O(). In addition, one may avoid encoun-
tering this bias asymptotically, if one adds appropriately noise to the observations.
This procedure is referred to as noisy ABC, and then one can recover the true
parameter. We remark that the methodology that is considered in this article can
easily incorporate noisy ABC. However, there may be some reasons why one may
not want to use noisy ABC:
1. The consistency results (currently) depend upon the data originating from the
original HMM;
2. The current simulation-based methodology may not be able to be used efficiently
for close to zero.
For point 1., if the data do not originate from the HMM of interest, it has not been
studied what happens with regards to the asymptotics of noisy ABC for HMMs.
It may be that some investigators might be uncomfortable with assuming that the
data originate from the exactly the HMM being fitted. For point 2. the asymptotic
bias (which is under assumptions either O() or O( 2 ) Dean et al. 2010; Dean and
Singh 2011) could be less than the asymptotic variance (under assumptions O( 2 )
Dean et al. 2010; Dean and Singh 2011) as could be much bigger than unity when
using current simulation methodology. We do not use noisy ABC in this article,
but acknowledge its fundamental importance with regards to parameter estimation
associated to ABC for HMMs; our approach is intended for cases where points
similar to 1.−2. need to be taken into account.
For the ABC–RML we will define the time varying log- recursive likelihood as
rθ0:t (y1:t ) = log pθ0:t , (yt |y1:t−1 )
where the subscript θ0:t means again that at each time t one computes all the relevant
quantities in Eqs. 3–5 (with gθ, substituted instead of gθ ) using θt as the parameter
value and θ0:t−1 has been used similarly in all the previous times. Finally we write the
ABC–RML recursion for the parameter as
(A1) Lipschitz Continuity of the Likelihood. There exist L < +∞ such that for any
x ∈ X, y, y
∈ Y, θ ∈
(A3) Boundedness of Likelihood and Transition. There exist 0 < C < C < +∞
such that for all x, x
∈ X, y ∈ Y, θ ∈
C ≤ fθ (x
|x) ≤ C,
C ≤ gθ (y|x) ≤ C.
(A4) Lipschitz Continuity of the Gradient of the Likelihood. fθ (x
|x), gθ (y|x
) are
differentiable in θ for each x, x
∈ X, y ∈ Y. In addition, there exist L < +∞
such that for any x ∈ X, y, y
∈ Y, θ ∈
(A5) Boundedness of Gradients of the Likelihood and Transition. There exist 0 <
C < C < +∞ such that for all x, x
∈ X, y ∈ Y, θ ∈
C ≤ ∇ fθ (x
|x) ≤ C,
C ≤ ∇gθ (y|x) ≤ C.
Whilst it is fairly easy to find useful simple models where the above conditions do
not hold uniformly for θ , we remark that the emphasis here is to provide intuition for
the methodology and for this reason similar conditions are popular in the literature,
e.g. Del Moral et al. (2009, 2011), Dean et al. (2010), Tadic and Doucet (2005).
We first present the result on the ABC bias of the log-likelihood. The proof is in
Appendix B.
Proposition 2.1 Assume (A1–A3). Then there exist a C < +∞ such that for any n ≥
1, μθ ∈ P (X), > 0, θ ∈ we have:
|lθ (y1:n ) − lθ (y1:n )| ≤ Cn.
Remark 2.1 The above proposition gives some simple guarantees on the bias of
the ABC log-likelihood. When using SMC algorithms to approximate log( pθ (y1:n )),
the overall error will be decomposed into the deterministic bias that is present
from the ABC approximation (that in Proposition 2.1) and the numerical error of
approximating the log-likelihood. Under some assumptions, the L2 −error of the
SMC estimate of the log-likelihood should not deteriorate any faster than linearly in
time; this is due to the results cited previously. Thus, as the time parameter increases,
the ABC bias of the log-likelihood will not necessarily dominate the simulation-
based error that would be present even if gθ is evaluated.
Theorem 2.1 Assume (A1–A5). Then there exist a C < +∞ such that for any n ≥ 1,
μθ ∈ P (X), μ
θ ∈ M(X), > 0, θ ∈ we have:
|∇l θ (y1:n ) − ∇lθ (y1:n )| ≤ Cn(2 +
μθ ).
Remark 2.2 The above Theorem again provides some explicit guarantees when using
an ABC approximation along with SMC-based numerical methods. For example, if
one can consider approximating gradients in an ABC context as proposed in Yildirim
et al. (2013a), then from the results of Del Moral et al. (2011), one expects that
the variance of the SMC estimates to increase only linearly in time. Again, as time
increases the ABC bias does not necessarily dominate the variance that would be
present even if gθ is evaluated (i.e. one uses SMC on the true model).
Remark 2.3 The result in Theorem 2.1 can be found in Eq. 72 of Dean et al. (2010)
and direct limit (as 0) in Dean and Singh (2011). However, we adopt a new (and
fundamentally different) proof technique, with a substantially more elaborate proof
Methodol Comput Appl Probab
• For t = 1, . . . , n
– Step 1: For i = 1, . . . , N, sample next state x(i) x(i)
t ∼ qt,θ (·|
t−1 )
( j,i)
∗ For j = 1, . . . , M: sample auxiliary observation samples ut ∼ g(·|xi0 )
– Step 2. Compute weights
( j,i)
N M
j=1 K (yt , ut ) fθ (x(i) (i)
t |xt−1 )
(i)
(i)
t(i) =
Wt(i) ∝ Wt−1 Wt , Wt(i) = 1, W ,
i=1 M qt,θ (xt |x(i)
t−1 )
3 Computational Strategy
t=1
M j=1 j=1
(13)
where for every t we use this time M independent samples from the likelihood,
j
ut ∼ gθ (·|xt ), j = 1, . . . , M. When one integrates out u11:n , . . . , u1:n
M
then the targeted
sequence is the same as in Section 2.2, which targets a perturbed HMM with the
likelihood being gθ, shown earlier in Eq. 9. Of course, in terms of estimating θ and
MLE, again this yields the same bias as the original ABC approximation, but still
there are substantial computational improvements. This is because as M grows we
the behavior is closer to an ideal marginal SMC algorithm that targets directly the
perturbed HMM without the auxiliary u variables. We proceed by presenting first
SMC when the model parameters θ are known and then show how Simultaneous
Methodol Comput Appl Probab
For the sake of clarity and for this sub-section only consider θ to be fixed and known.
In Algorithm 1 we present the ABC–SMC algorithm of Jasra et al. (2012), which is
used to perform filtering for the perturbed HMM with likelihood gθ, and transition
density fθ . The basic design elements are the important sampling proposals qt,θ for
the weights, the number of particles N, the number of auxiliary observation samples
M and the ABC precision tolerance . The resampling step is presented here as
optional, but note to get good performance it is necessary to use it when the variance
of the weights or the effective sample size is low. For more details we refer the reader
at Jasra et al. (2012).
The algorithm allows us to approximate πθ, in Eq. 13 using the particles. For
instance, the particle approximation of the marginal of πθ, w.r.t. the u variables is
shown in Eq. 12. In addition one obtains also particle approximations for pθ, (y1:n )
and pθ, (yt |y1:t−1 ) as defined in Eqs. 7 and 8, which are critical quantities for parame-
ter estimation. So we denote this SMC estimates of these quantities as pθ, N
(y1:n ) and
pθ, (yt |y1:t−1 ) respectively. These are given as follows:
N
n
1
(i)
N
N
pθ, (y1:n ) = W
t=1
N i=1 t
with
1
(i)
N
N
pθ, (yt |y1:t−1 ) = W ,
N i=1 t
where E N [·] denotes the expectation w.r.t the distribution of all the randomly
variables in Algorithm 1. A similar result holds for pθN (yn |y1:n−1 ); see Del Moral
(2004,
Theorems 7.4.2
and 7.4.3, p.239) for a proof and more details. Note still that
log pθN (y1:n ) or log pθN (yn |y1:n−1 ) will be biased approximations of the ideal quan-
tities. A usual remedy is to correct the bias up to the first order
of a Taylor
expansion
and estimate the θ-dependendent parts of log pθ, (y1:n ) and log pθ, (yn |y1:n−1 )
instead with
N 1 N −2
lˆθ,
N
= log pθ, (y1:n ) + p (y1:n ) , (14)
2N θ,
Methodol Comput Appl Probab
and
−2
1
(i) 1
(i)
N N
1
N
r̂t,θ, = log W + W (15)
N i=1 t 2N N i=1 t
Remark 3.1 The parameter determines the accuracy of the the marginal likelihoods
of the perturbed HMM compared to the original one. At the same time if it is very
low one may require a high value for M. This can be computed adaptively as in Del
Moral et al. (2012), Jasra et al. (2012). Also it is remarked that a drawback of this
algorithm is that when d y grows with , N remaining fixed, one cannot expect the
algorithm to work well for every . Typically one must increase to get reasonable
results with moderate computational effort and this is at the cost of increasing the
bias. To maintain at a reasonable level, one must consider more sophisticated
strategies which are not investigated here.
Remark 3.2 We note that, after suppressing θ, if the HMM can be written in a state
space model form:
Yt = ξ(Xt , Wt )
Xt = ϕ(Xt−1 , Vt )
0.8
0.6
KF
SMC
ABC-SMC
MLE
0.4
0.2
0
0 5000 10000 15000 20000
Iteration
Fig. 1 A typical run of the offline parameter estimates obtained by the KF, SMC, and ABC–SMC
for the linear Gaussian HMM, along with the ML estimators for θ
Methodol Comput Appl Probab
lˆθN+ , − lˆθN− ,
θk (m) = θk (m) + ak k k
.
2ck
k (m)
where X0 = x0 ∈ X is known, both (Vn ) n≥1 and (Wn ) n≥0 are i.i.d. noise sequences
independent of each other and ξ , ϕ appropriate functions. Suppose that one can
evaluate:
Similar to Murray et al. (2011), Yildirim et al. (2013b), one can construct a ‘collapsed’
ABC approximation
n
π (w1:n , v1:n , u1:n |y1:n ) ∝ K ξ ϕ (t) (x0 , v1:t ), wt ,
t=1
ξ ϕ (t) (x0 , v1:t ), ut p(wt ) p(vt ) p(ut ).
Hence a version of the SMC algorithm in Fig. 1 can be derived which does not need
to sample from neither the dynamics of the data nor the transition density of the
hidden Markov chain. This representation, however, does not always apply.
We proceed by describing SPSA as a gradient free method for off-line or batch ABC–
MLE, which can be found in Algorithm 2. This algorithm does not require one to
evaluate gθ or its gradient. In this context one is interested in estimating θ
such that
∇lθ = 0
holds, where we have dropped the dependance on y1:n for simplicity. Recall that
here we do not have an expression for ∇lθ to pursue a standard Robbins–Monroe
procedure (Benveniste et al. 1990). One way around this would be to use a finite
difference approximation to estimate the gradient w.r.t. to the m-th element of θ as
Methodol Comput Appl Probab
2c
m
, where em is a unit magnitude vector that is zero in
any direction except
ˆ
m and l• an unbiased estimate of l• . To avoid having to do 2dθ
evaluations of these
estimates in total for each direction, SPSA has been proposed in Spall (1992) so that
the gradient update requires only 2 evaluations only. Instead weperturb θ using
ck
k
where
k is a dθ −dimensional zero mean vector, such that E |
k (m)|−1 or some
higher inverse moment is bounded. In this case we have used the most popular choice
with each entry of
k being ±1 Bernoulli distributed and the estimates for the lˆ• are
the bias-corrected versions as in Eq. 14. For more details on the conditions and the
convergence details for this Stochastic Approximation method we refer the reader
to Spall (1992) and for useful practical suggestions regarding the implementation to
Spall (2003).
4 Numerical Simulations
We consider two numerical examples that are designed to investigate the accuracy
and behavior of our numerical ABC–MLE algorithms. In order to do this, we
consider scenarios where gθ is a well behaved density, which we avoid to compute.
In the first example we look at a linear Gaussian model and in the second a HMM
involving the Lorenz ’63 model (Lorenz 1963).
Yt = Xt + σw Wt
Xt = φ Xt−1 + σv Vt ,
Methodol Comput Appl Probab
– x(i)
For i = 1, . . . , N sample independently
(i)
0 ∼ μθ . Set W0 = 1/N.
• For t = 1, . . . , n
– For m = 1, . . . , dθ , sample independently
t (m) from a Bernoulli distribution
with success probability 0.5 and support {−1, 1}.
– Set θt+ = θt + ct
t and θt− = θt − ct
t . For each value use
x(i) (i)
0:t−1 , Wt−1
to compute Steps 1 and
2 of Algorithm
1 (ABC–SMC) returning
(i) + (i) +
(i) − (i) −
Wt (θt ), Wt (θt ) and Wt (θt ), Wt (θt ) respectively.
– Compute r̂t,θ
N
+ and r̂t,θ
N
− respectively using Eq. 15.
t , t ,
– Update θt . For m = 1, . . . , dθ
N
r̂t,θ +
,
− r̂t,θ
N
−
,
θt+1 (m) = θt (m) + at t t
.
2ct
t (m)
– Compute Steps 1 to 3 of Algorithm 1 (ABC–SMC) using θt+1 to get
x(i) (i)
0:t , Wt−1 .
i.i.d. i.i.d.
with Wt , Vt independent and Wt ∼ N (0, 1), Vt ∼ N (0, 1). In the subsequent exam-
ples, we will use a simulated dataset obtained with θ = (σv , φ, σw ) = (0.2, 0.9, 0.3),
which is the same example as in Poyiadjis et al. (2006).
1. Kalman Filtering (KF) for the original HMM is used to compute lˆθ for SPSA,
2. Standard SMC (without ABC) with N = 1000 for the original HMM is used to
compute lˆθ for SPSA,
3. ABC–SMC with N = 200, M = 10, = 0.1 is used to compute lˆθ for SPSA.
The horizontal lines in Fig. 1 show also Maximum Likelihood estimates (MLE)
obtained from an offline grid search optimization that uses KF. All procedures seem
Methodol Comput Appl Probab
to be very accurate at estimating the MLE obtained from the grid search. This allows
us to investigate RML, which is a more challenging problem.
4.1.2 RML
We now consider a larger data set with n = 50,000 data points, simulated with the
previously indicated parameter values. We use Algorithm 3 described in Section 3.2.
Again we compare the same three procedures outlines above using fifty independent
runs in each case. The standard SMC and ABC–SMC algorithms were employed with
the same for N and M, as in the off-line case. Also for each case we used the same
the step-size sequences for SPSA, which were similar to their off-line counterparts
in Section 4.1.1. In Fig. 2, we plot the medians and credible intervals for the 5–
95 % percentiles of the parameter estimates (across the independent runs). The θt
converge after t = 20,000 time steps, with the KF and SMC yielding similarly valued
estimates. Note there seems to be an apparent bias in both cases relative to the true
parameters (the MLE for the data-set used has been checked that it converges to
the true parameters by n = 5 × 104 ). A similar bias has appeared in Poyiadjis et al.
(2006) for this particular model. The theoretical justification in Spall (1992) applies
directly when SPSA is used for off-line MLE (as in Section 4.1.1) with a finite and
fixed data-set. For RML the argument to be maximized is an ergodic average (Le
Gland and Mevel 1995, 1997; Tadic and Doucet 2005; Tadic 2009), so we believe
the bias accumulated here is due to the step-sizes of SPSA decreasing much faster
than the gradient to be estimated reaches stationarity. Ideally, one would like to
run this algorithm for a much longer n, slower decreasing step-sizes and also delay
updating θ until stationarity is reached, but this would make using multiple runs
prohibitive. In Poyiadjis et al. (2006) it seemed that this bias was not considerable for
other models, such as the popular stochastic volatility model. In any case, it would be
useful to examine precisely under what conditions SPSA can be used within RML,
but this is beyond the scope of this paper that puts more emphasis on the relative
accuracy of ABC. In Fig. 2 we also observe increased variance from left to right
in Fig. 2, which we attribute to the progressively added randomness of SMC and
ABC–SMC respectively. In particular, the expected reduced accuracy of ABC–SMC
against SMC is apparent, but, the bias does not appear to be substantial (for ABC–
SMC) in this particular example.
1 1 1
0 0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Time x 104 Time x 10
4 Time x 10
4
Fig. 2 Credible intervals for the 5–95 % percentiles and the medians after multiple runs of parameter
estimates using RML with KF, SMC, and ABC–SMC for the linear Gaussian HMM
Methodol Comput Appl Probab
where X(m), Ẋ(m) are the mth-components of the state and velocity at any time
respectively. We discretize the model to a discrete-time Markov chain with dynamics:
Xt = ft (Xt−1 ) + Vt , t≥0
Yt = H Xt + QWt , t≥1
i.i.d.
where Wt ∼ N (0, Id y ), Wt is independent of Vt and Q is the Cholesky root of a
Toeplitz matrix defined by the parameters κ and σ as follows:
Qij = σ S κ −1 min(|i − j|, d y − |i − j|) , i, j ∈ {1, . . . , d y }
⎧
⎪ 3 1
⎨ 1 − z + z3 , 0 ≤ z ≤ 1
S(z) = 2 2 ,
⎪
⎩
0, z>1
and
⎧
⎪
⎪
1
,i= j
⎪
⎪
⎪
⎪ 2
⎨
Hij = 1 .
⎪
⎪ , i = j−1
⎪
⎪ 2
⎪
⎪
⎩
0, i = j
When θ = (κ, σ, σ63 , ρ, β) = (2.5, 2, 10, 28, 83 ), n = 5,000 and τ = 0.05, a visualization
of the Lorenz ’63 (hidden) dynamics is shown in Fig. 3a and the associated simulated
dataset in Fig. 3b.
For the simulated data-set in Fig. 3b and its extension for longer n, in the
remainder we will use ABC–SMC to obtain parameter estimates from RML. In
the subsequent sub-section we will study the performance of these estimates under
different settings. We will use
θ,n
N,M
to denote the estimate of θ at time n, that was es-
timated using N particles, M pseudo-observations and a Gaussian kernel with covari-
ance Id y . We will compare the behavior of the algorithm as each of N, M, n, varies.
Methodol Comput Appl Probab
50 35
30
40
25
30 20
3
3
15
t
t
x
y
20 10
5
10
0
0 −5
30 40
20 30
10 20 30
10 20 20
0 10
0 10 0
x2
−10
−20 −10 1 y2t 0 −10 1
xt y
t −30 −20 −10 −30 −20 t
4.5 30
6 22
29.5
4 20
5 29
3.5 18
28.5
4
16 28
3
14 27.5
3 2.5
27
12
2 26.5
2
10
26
1.5
1 8
25.5
1
6 25
N=100 N=1000 N=10000 N=100 N=1000 N=10000 N=100 N=1000 N=10000 N=100 N=1000 N=10000
3 3 3 3
2 2 2 2
1 1 1 1
0 0 0 0
100 1000 10000
N N N
N
Fig. 4 N,10
θ1,5000 when estimating θ = (κ, σ, σ63 , ρ) of the Lorenz ’63 HMM, using ABC–SMC with
values of N ∈ {100, 1000, 10000}. a–d show the θ N,10 in box-plots and their true values in dotted 1,5000
green lines. e–h show the MC bias and MC standard deviation of the N,10
θ1,5000 , in red and blue, with
curves of least squared-error ∝ √1
N
Methodol Comput Appl Probab
estimates with parameters fixed is O(N −1 )), but the addition of parameter estimation
complicates things here. The main point is that as expected one obtains significantly
more reproducible/consistent results as N grows.
Next we look at the influence of the number of auxiliary observations samples. For
M ∈ {1, 3, 5, 10, 25, 50}, we show in Fig. 5a–d the box-plots of the terminal estimates
5000,M
θ1,5000 from fifty independent runs of ABC–SMC, using N = 5000 and = 1. The
dotted green lines marks the true θ values which generate the data. In Fig. 5e–h,
the MC biases and the MC standard deviations of the 5000,M
θ1,5000 are plotted as discrete
points, in red and blue, with lines of least squared-error fitted around them. As M
increases, we see reductions in the MC variance. This reduction in variance can be
attributed to the fact that the ABC–SMC algorithm approximates the ideal SMC
algorithm that targets the perturbed HMM. Hence by a Rao–Blackwellization type
argument, one expects a reduction in variance. These results are consistent with (Del
Moral et al. 2012). For this example, after M ≥ 5, there seems to be little impact on
the accuracy of the parameter estimates, but this is example specific.
We now vary n. For n ∈ {5000, 10,000, 15,000} we ran fifty independent runs of
ABC–SMC using N = 200, M = 10, and = 1, and plotted box-plots of the terminal
estimates 200,10
θ1,n , in Fig. 6a–d, against the true values of θ marked in dotted green
lines. Recall that RML estimation tries to maximize n1 log( pθ, (y1:n )), so we expect n
not to have a great effect on the bias nor the variance when it is above some value.
This can also be explained by the bias results in Section 2.3 and the theoretical results
in Dean et al. (2010), Dean and Singh (2011). In Fig. 6e–h the absolute value of the
MC biases and the MC standard deviations have been plotted in red and blue, and
fitted with linear lines of least squared-error.
Finally, we investigate the influence of ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 50}. For each
, we again ran fifty independent runs of ABC–SMC with N = 200 and M = 10,
for the dataset n = 5,000. The box-plot of the parameter estimates are plotted, in
Fig. 7a–d, against dotted green lines which indicate the true θ. Figure 7e–h show
9 10
32
24
8 9 30
22
7 8 28
20
26
6 7 18
24
5 6 16
22
4 5 14
20
12
3 4 18
10
2 3 16
8
14
1 2
6
12
0 1
4
M=1 M=3 M=5 M=10 M=25 M=50 M=1 M=3 M=5 M=10 M=25 M=50 M=1 M=3 M=5 M=10 M=25 M=50 M=1 M=3 M=5 M=10 M=25 M=50
4 4 4 4
3 3 3 3
2 2 2 2
1 1 1 1
0 0 0 0
1 3 5 10 25 50 1 3 5 10 25 50 1 3 5 10 25 50 1 3 5 10 25 50
M M M M
Fig. 5 5000,M
θ1,5000 when estimating θ = (κ, σ, σ63 , ρ) of the Lorenz ’63 HMM, using ABC–SMC with
values of M ∈ {1, 3, 5, 10, 25, 50}. a–d show the 5000,M
θ1,5000 in box-plots and their true values in dotted
green lines. e–h show the MC bias and MC standard deviation of the 5000,M
θ1,5000 , in red and blue, with
lines of least squared-error
Methodol Comput Appl Probab
25 32
10
9
9 30
8 20
8
28
7
7
15 26
6 6
5 5 24
10
4 4 22
3
3
20
5
2
2
18
1
1
0
n = 5000 n = 10000 n = 15000 n = 5000 n = 10000 n = 15000 n = 5000 n = 10000 n = 15000 n = 5000 n = 10000 n = 15000
5 5 5 5
4 4 4 4
3 3 3 3
2 2 2 2
1 1 1 1
0 0 0 0
5000 10000 15000 5000 10000 15000 5000 10000 15000 5000 10000 15000
T T T T
Fig. 6 200,10
θ1,n when using ABC–SMC to estimate θ = (κ, σ, σ63 , ρ) of the Lorenz ’63 HMM, for
datasets of length n ∈ {5000, 10000, 15000}. a–d show the
θ 200,10 in box-plots and their true values 1,n
in dotted green lines. e–h show the MC bias and MC standard deviation of the 200,10
θ1,n , in red and blue,
with lines of least squared-error
the absolute value of MC biases in red, and the MC standard deviations in blue.
Fitted to the MC biases is a non-linear least squares curve proportional to + 1 . The
result we presented in Section 2.3 states that as increases, the bias will increase
on O(), hence the term proportional to of the fitted curve. However, the ABC–
SMC algorithm becomes less stable for too small (in the sense that, for example,
the variance of the weights will become larger as grows), incurring more varied
estimates. We conjecture this will affect biases according to a term proportional to
25
10
9
9 35
8
20
8
7
7
30
6
6 15
5
5
25
4 4
10
3 3
20
2 2
5
1
1
15
0
e=1 e=2 e=3 e=4 e=5 e=6 e=7 e=8 e=9 e = 10 e = 50 e=1 e=2 e=3 e=4 e=5 e=6 e=7 e=8 e=9 e = 10 e = 50 e=1 e=2 e=3 e=4 e=5 e=6 e=7 e=8 e=9 e = 10 e = 50 e=1 e=2 e=3 e=4 e=5 e=6 e=7 e=8 e=9 e = 10 e = 50
4 4 4 4
3 3 3 3
2 2 2 2
1 1 1 1
0 0 0 0
1 2 3 4 5 6 7 8 910 50 1 2 3 4 5 6 7 8 910 50 1 2 3 4 5 6 7 8 910 50 1 2 3 4 5 6 7 8 910 50
\epsilon \epsilon \epsilon \epsilon
Fig. 7 200,10
θ,5000 when estimating θ = (κ, σ, σ63 , ρ) of the Lorenz ’63 HMM, using ABC–SMC with
values of ∈ {1, 2, 3, . . . , 10, 50}. a–d show the MC biases and their curves of non-linear least
squared-error proportional to + 1 in red, and the MC standard deviations with their curves of
non-linear least squared-error proportional to 1 in blue
Methodol Comput Appl Probab
1
.Similarly, we fitted to the MC standard deviations non-linear least squares curves
proportional to 1 and note that the MC standard deviation decreases at this rate as
increases.
5 Conclusions
Acknowledgements We thank the referee for comments that have vastly improved the paper.
We also acknowledge useful discussions on this material with Sumeetpal Singh. The second author
was funded by an MOE grant and acknowledges useful conversations with David Nott. The third
author acknowledges support from EPSRC under grant EP/J01365X/1 since July 2012 and under the
programme grant on Control For Energy and Sustainability EP/G066477/1 during earlier stages of
this work when he was employed at Imperial College.
Appendix A: Notations
We will introduce a round of notations. Firstly, we alert the reader that throughout
appendix k is used as a time index instead of t used earlier. As our analysis will
rely upon that in Tadic and Doucet (2005) our notations will follow that article. It
is remarked that under our assumptions, one can establish the same assumptions
as in Tadic and Doucet (2005). Moreover, the time-inhomogenous upper-bounds in
that paper can be made time-homogenous (albeit less tight) under our assumptions.
In addition, our proof strategy follows ideas in the expanded technical report of
Andrieu et al. (2005).
Methodol Comput Appl Probab
with the ABC equivalent Rθ,,n (x, dx
) := gθ, (yn |x
) fθ (x
|x)dx
, gθ, (y|x) =
A,y g(u|x)dy/ A,y dy. To keep consistency with Tadic and Doucet (2005) and
to allow the reader to follow the proofs, we note that the filter at time n ≥ 0, Fθn (μθ )
(respectively ABC filter, at time n, Fθ, n
(μθ )) is exactly, with initial distribution
μθ ∈ P (X) and test function ϕ ∈ Bb (X)
μθ R1,n,θ (ϕ)
Fθn (μθ )(ϕ) =
μθ R1,n,θ (1)
respectively
μθ R1,n,θ, (ϕ)
n
Fθ, (μθ )(ϕ) =
μθ R1,n,θ, (1)
n
where Fθ0 (μθ ) = Fθ,
0
(μθ ) = μθ , R1,n,θ (ϕ)(x0 ) = k=1 Rk,θ (xk−1 , dxk )ϕ(xn ). In ad-
dition, we write the filter derivatives as F
n (μθ , μ
θ )(ϕ),
n (μθ , μ
F θ )(ϕ) where the
θ θ,
second argument is the gradient of the initial measure.
The following operators will be used below, for n ≥ 1:
n (μθ , μ
G θ )(ϕ) := (μθ R1,n,θ (1))−1 [
μθ R1,n,θ (ϕ) − μ
θ R1,n,θ (1)Fθn (μθ )(ϕ)] (16)
(17)
0 (μθ , μ
with the convention G θ )(ϕ) = μ
θ . In addition, we set
(n) (μθ , μ
G θ )(ϕ) := (μθ Rn,θ (1))−1 [ θ Rn,θ (1)Fθ(n) (μθ )(ϕ)].
μθ Rn,θ (ϕ) − μ
where Fθ(n) (μθ ) = μθ Rn,θ /μθ Rn,θ (1). Finally, an important notational convention is
as follows. Throughout we use C to denote a constant whose value may change
from line-to-line in the calculations. This constant will typically not depend upon
important parameters such as and n and any important dependencies will be
highlighted.
log( pθ (y1:n )) − log( pθ, (y1:n )) = log( pθ (yk |y1:k−1 )) − log( pθ, (yk |y1:k−1 ))
k=1
(18)
Methodol Comput Appl Probab
with, for 1 ≤ k ≤ n
pθ (yk |y1:k−1 ) = gθ (yk |xk ) fθ (xk |xk−1 )Fθk−1 (μθ )(dxk−1 )dxk
X2
pθ, (yk |y1:k−1 ) = gθ (yk |xk ) fθ (xk |xk−1 )Fθ,
k−1
(μθ )(dxk−1 )dxk .
X2
We will consider each summand in Eq. 18. The case k ≥ 2 is only considered; the
scenario k = 1 will follow a similar and simpler argument.
Using the inequality | log(x) − log(y)| ≤ |x − y|/(x ∧ y) for every x, y > 0 we have
| pθ (yk |y1:k−1 ) − pθ, (yk |y1:k−1 )|
| log( pθ (yk |y1:k−1 )) − log( pθ, (yk |y1:k−1 ))| ≤ .
pθ (yk |y1:k−1 ) ∧ pθ, (yk |y1:k−1 )
Note that
pθ (yk |y1:k−1 ) ∧ pθ (yk |y1:k−1 )
= gθ (yk |xk ) fθ (xk |xk−1 )Fθk−1 (μθ )(dxk−1 )dxk
X2
∧ gθ (yk |xk ) fθ (xk |xk−1 )Fθ,
k−1
(μθ )(dxk−1 )dxk ≥ C > 0 (19)
X2
where we have applied (A3) and C does not depend upon . Thus we consider
| pθ, (yk |y1:k−1 ) − pθ (yk |y1:k−1 )|
#
#
= ## gθ (yk |xk ) fθ (xk |xk−1 )Fθk−1 (μθ )(dxk−1 )dxk
X2
#
#
− gθ, (yk |xk ) fθ (xk |xk−1 )Fθ,
k−1
(μθ )(dxk−1 )dxk ##.
X2
and
# #
# #
# gθ, (yk |xk ) fθ (xk |xk−1 )[Fθ,
k−1
(μθ )(dxk−1 ) − Fθ,
k−1
(μθ )(dxk−1 ])dxk ##.
#
X2
The first expression can be dealt with by using (A1), which implies
sup |gθ, (yk |x) − gθ, (yk |x)| ≤ C. (20)
x∈X
The second expression can be controlled by Jasra et al. (2012, Theorem 2):
sup Fθk−1 (μθ ) − Fθ,
k−1
(μθ ) ≤ C (21)
k≥1
to yield that
| pθ, (yk |y1:k−1 ) − pθ (yk |y1:k−1 )| ≤ C. (22)
One can thus conclude.
Methodol Comput Appl Probab
$ %
∇ log pθ (y1:n ) − log pθ, (y1:n ) = ∇ log[ pθ (yk |y1:k−1 ) − log[ pθ, (yk |y1:k−1 ) .
k=1
We will deal with the two terms on the R.H.S. of Eq. 23 in turn. The scenario k ≥ 2
is only considered; the case k = 1 follows a similar and simpler argument.
First starting with summand
[∇ pθ (yk |y1:k−1 ) − ∇ pθ, (yk |y1:k−1 )]
.
pθ (yk |y1:k−1 )
Noting Eq. 19, we need only upper-bound the L1 norm of the following expression
∇{gθ (yk |xk )} fθ (xk |xk−1 )Fθk−1 (μθ )(dxk−1 )dxk
X2
− ∇{gθ, (yk |xk )} fθ (xk |xk−1 )Fθ,
k−1
(μθ )(dxk−1 )dxk (24)
X2
+ gθ (yk |xk )∇{ fθ (xk |xk−1 )}Fθk−1 (μθ )(dxk−1 )dxk
X2
− gθ, (yk |xk )∇{ fθ (xk |xk−1 )}Fθ,
k−1
(μθ )(dxk−1 )dxk (25)
X2
+
k−1 (μθ , μ
gθ (yk |xk ) fθ (xk |xk−1 ) F θ )(dxk−1 )dxk
θ
X2
−
k−1 (μθ , μ
gθ, (yk |xk ) fθ (xk |xk−1 ) F θ )(dxk−1 )dxk . (26)
θ,
X2
We start with Eq. 24. Using (A4) we can establish that for each k ≥ 1
sup |∇{gθ (yk |xk )} − ∇{gθ, (yk |xk )}| ≤ C (27)
x∈X
Then we note that by Jasra et al. (2012, Theorem 2) (see Eq. 21) and (A5)
# #
# #
# ∇{g (y |x )} f (x |x )[F k−1
(μ )(dx ) − F k−1
(μ )(dx )]dx # ≤ C
# θ, k k θ k k−1 θ θ k−1 θ, θ k−1 k #
X2
and can again use Jasra et al. (2012, Theorem 2) (i.e. Eq. 21) to deduce that
# #
# #
# gθ, (yk |xk )∇{ fθ (xk |xk−1 )}[Fθ (μθ )(dxk−1 ) − Fθ, (μθ )(dxk−1 )]dxk ## ≤ C
k−1 k−1
#
X2
which upper-bounds the expression in Eq. 25. We now move onto Eq. 26, which
upper-bounded by
# #
# #
# [g (y |x ) − g (y |x )] f (x |x )
F k−1
(μ , μ
)(dx )dx #
# θ k k θ, k k θ k k−1 θ θ θ k−1 k#
X2
# #
# #
+##
gθ, (yk |xk ) fθ (xk |xk−1 )[ Fθ (μθ , μ
k−1
θ )(dxk−1 )]dxk ##.
θ )(dxk−1 ) − Fθ, (μθ , μ
k−1
X2
μθ ).
≤ C(2 +
Thus we have upper-bounded the L1 −norm of the sum of the expressions 24–26 and
we have established that
[∇ pθ (yk |y1:k−1 ) − ∇ pθ, (yk |y1:k−1 )]
≤ C(2 +
μθ ). (28)
pθ (yk |y1:k−1 )
Moving onto the second summand on the R.H.S. of Eq. 23,
∇ pθ, (yk |y1:k−1 )
[ pθ, (yk |y1:k−1 ) − pθ (yk |y1:k−1 ).
pθ (yk |y1:k−1 ) pθ, (yk |y1:k−1 )
By Eq. 22, we need only consider upper-bounding, in L1 , ∇ pθ, (yk |y1:k−1 ). This can
be decomposed into the sum of three expressions:
∇{gθ, (yk |xk )} fθ (xk |xk−1 )Fθ,
k−1
(μθ )(dxk−1 )dxk
X2
gθ, (yk |xk )∇{ fθ (xk |xk−1 )}Fθ,
k−1
(μθ )(dxk−1 )dxk
X2
and
k−1 (μθ , μ
gθ, (yk |xk ) fθ (xk |xk−1 ) F θ )(dxk−1 )dxk .
θ,
X2
As ∇{gθ, (yk |xk )} and gθ, (yk |xk )∇{ fθ (xk |xk−1 )} are upper-bounded as well as X
being compact the first two expressions are upper-bounded in L1 . In addition as
X gθ, (yk |xk ) fθ (xk |xk−1 )dxk is upper-bounded, we can apply Lemma 5.3 to see that
the third expression is upper-bounded in L1 . Hence, we have shown that
# #
# ∇ pθ, (yk |y1:k−1 ) #
# [ pθ, (yk |y1:k−1 ) − pθ (yk |y1:k−1 )]## ≤ C(1 +
μθ ). (29)
# p (y |y
θ k 1:k−1 ) pθ, (yk |y1:k−1 )
Combining the results Eqs. 28 and 29 and noting Eq. 23 we can conclude.
Theorem 5.1 Assume (A1–A5). Then there exist a C < +∞ such that for any n ≥ 1,
μθ ∈ P (X), μ
θ ∈ M(X), > 0, θ ∈ :
θn (μθ , μ
F
θ,
θ ) − F n
(μθ , μ
θ ) ≤ C(2 +
μθ ).
Proof We have the following telescoping sum decomposition (e.g. Del Moral 2004)
for the differences in the filters, with ϕ ∈ Bb (X):
n & '
n− p+1,n n− p n− p+2,n n− p+1
Fθ (μθ )(ϕ) − Fθ, (μθ )(ϕ) =
n n
Fθ (Fθ, (μθ ))(ϕ) − Fθ (Fθ, (μθ ))(ϕ)
p=1
Methodol Comput Appl Probab
q,n μ R (ϕ)
where we are using the notation Fθ (μθ )(ϕ) = μθθ Rq,n,θ
q,n,θ (1)
, for 1 ≤ q ≤ n. Hence,
taking gradients and swapping the order of summation and differentiation we have
and omitting the second arguments of F
on the R.H.S. (to reduce the notational
burden)
θn (μθ , μ
F
θ,
θ )(ϕ) − F n
(μθ , μ
θ )(ϕ)
&
n− p+2,n (n− p+1) n− p
n
=
F (Fθ
(n− p+1) [F n− p (μθ )])(ϕ)
[Fθ, (μθ )], F
θ θ θ,
p=1
'
n− p+2,n (F (n− p+1) [F (n− p) (μθ )], F
−F
(n− p+1) [F (n− p) (μθ )])(ϕ) . (30)
θ θ, θ, θ, θ,
To continue with the proof we will adopt (Tadic and Doucet 2005, Lemma 6.4):
n
θn (μθ , μ
F
n (μθ , μ
θ )(ϕ) = G θ ) +
q+1,n (F q (μθ ), H
G
q (μθ ))(ϕ)
θ θ θ
q=1
n and H
with G
q+1,n similar extension to the
q (μθ ) defined in Eqs. 16 and 17 and G
θ θ
q+1,n
notation as for the filter Fθ
n+1,n (μθ , μ
and the convention G θ ) = μ
θ . Returning
θ
to Eq. 30 and again omitting the second arguments of F
on the R.H.S.:
θn (μθ , μ
F
θ,
θ )(ϕ) − F n
(μθ , μ
θ )(ϕ)
n &
=
n− p+2,n {F (n− p+1) (F n− p (μθ )), F
G
(n− p+1) (F n− p (μθ ))}(ϕ)
θ θ θ, θ θ,
p=1
n $
+ G
q+1,n {F n− p+2,q [F (n− p+1) (F n− p (μθ ))],
θ θ θ θ,
q=n− p+2
−G
q+1,n n− p+2,q
{Fθ
(n− p+1)
[Fθ,
n− p
(Fθ, (μθ ))],
θ
%'
n− p+2,q [F (n− p+1) (F n− p (μθ ))]}(ϕ)
H . (31)
θ θ, θ,
We start first with the summand on the R.H.S. of the second line of Eq. 31, which
we compactly denote as:
(32)
Methodol Comput Appl Probab
and
p−1 {Fθ, [F n− p (μθ )], F
G
p−1 {Fθ, [F n− p (μθ )], F
θ [F n− p (μθ )]}(ϕ) − G
θ, [F n− p (μθ )]}(ϕ).
θ θ, θ, θ θ, θ,
(33)
Beginning with Eq. 32, by Tadic and Doucet (2005, Lemma 6.7), Eq. 43 we have
|G
p−1 n− p
{Fθ [Fθ, (μθ )], F
p−1 {Fθ, [F n− p (μθ )], F
θ [F n− p (μθ )]}(ϕ) − G
θ [F n− p (μθ )]}(ϕ)|
θ θ, θ θ, θ,
n− p n− p
θ [F n− p (μθ )]
≤ Cϕ∞ ρ p−1 Fθ [Fθ, (μθ )] − Fθ, [Fθ, (μθ )] F θ,
θ [F n− p (μθ )]
≤ Cϕ∞ ρ p−1 F θ,
where C does not depend upon μθ , or n, p. Then by Remark 5.1 and Lemma 5.3
θ [F n− p (μθ )] ≤ C(2 +
F μθ ) and thus the upper-bound on the L1 −norm of Eq. 32:
θ,
|G
p−1 n− p
θ [F
{Fθ [Fθ, (μθ )], F
n− p
p−1 n− p
θ [F n− p
θ θ, (μθ )]}(ϕ) − Gθ {Fθ, [Fθ, (μθ )], F θ, (μθ )]}(ϕ)|
≤ Cϕ∞ ρ p−1 (2 +
μθ ). (34)
Now, moving onto Eq. 33, by Tadic and Doucet (2005, Lemma 6.7), Eq. 42:
|G
p−1 n− p
θ [F
{Fθ, [Fθ, (μθ )], F
n− p
p−1 n− p
θ, [Fn− p
θ θ, (μθ )]}(ϕ) − Gθ {Fθ, [Fθ, (μθ )], F θ, (μθ )]}(ϕ)|
θ [F
≤ Cρ p−1 ϕ∞ F
n− p
n− p
θ, (μθ )] − Fθ, [Fθ, (μθ )].
n− p (μθ )).
≤ Cϕ∞ ρ p−1 (1 + F θ,
≤ Cϕ∞ ρ p−1 (2 +
μθ ). (35)
Combining Eqs. 34 and 35
|G
p−1 n− p
{Fθ [Fθ, (μθ )], F
p−1 {Fθ, [F n− p (μθ )], F
θ [F n− p (μθ )]}(ϕ) − G
θ, [F n− p (μθ )]}(ϕ)|
θ θ, θ θ, θ,
≤ Cϕ∞ ρ p−1 (2 +
μθ ). (36)
We now consider the summands over q in the second and third lines of Eq. 31.
Again, adopting the compact notation above we can decompose the summands over
q into the sum of
n−q {F s [Fθ (F n− p (μθ ))], H
G
n−q {F s [Fθ, (F n− p (μθ ))],
θs [Fθ (F n− p (μθ ))]}(ϕ) − G
θ θ θ, θ, θ θ θ,
and
n−q {F s [Fθ, (F n− p (μθ ))], H
G
n−q {F s [Fθ, (F n− p (μθ ))],
θs [Fθ (F n− p (μθ ))]}(ϕ) − G
θ θ θ, θ, θ θ θ,
where s = q − n + p − 1. We start with Eq. 37; by Tadic and Doucet (2005, Lemma
6.7) Eq. 43, we have
|G
n−q n− p
θs [Fθ (F
{Fθs [Fθ (Fθ, (μθ ))], H
n− p
θ θ, (μθ ))]}(ϕ)
−G
n−q n− p
θs [Fθ (F n− p (μθ ))]}(ϕ)|
{Fθs [Fθ, (Fθ, (μθ ))], H
θ θ,
n− p n− p
θs [Fθ (F n− p (μθ )).
≤ Cϕ∞ ρ n−q Fθs [Fθ (Fθ, (μθ ))] − Fθs [Fθ, (Fθ, (μθ ))] H θ,
Then we will use the stability of the filter (e.g. Tadic and Doucet 2005, Theorem 3.1)
n− p n− p n− p n− p
Fθs [Fθ (Fθ, (μθ ))] − Fθs [Fθ, (Fθ, (μθ ))] ≤ Cρ s Fθ (Fθ, (μθ )) − Fθ, (Fθ, (μθ )).
n− p n− p
By Lemma 5.2 Fθ (Fθ, (μθ )) − Fθ, (Fθ, (μθ )) ≤ C and thus
|G
n−q n− p
θs [Fθ (F
{Fθs [Fθ (Fθ, (μθ ))], H
n− p
θ θ, (μθ ))]}(ϕ)
−G
n−q n− p
θs [Fθ (F n− p (μθ ))]}(ϕ)|
{Fθs [Fθ, (Fθ, (μθ ))], H
θ θ,
By Tadic and Doucet (2005, Lemma 6.8) we have H
s [Fθ (F n− p (μθ ))] ≤ C, where
θ θ,
n− p
C does not depend upon Fθ (Fθ, (μθ )) or and hence
|G
n−q n− p
θs [Fθ (F n− p (μθ ))]}(ϕ)
{Fθs [Fθ (Fθ, (μθ ))], H
θ θ,
−G
n−q n− p
θs [Fθ (F n− p (μθ ))]}(ϕ)| ≤ Cϕ∞ ρ p−1 .
{Fθs [Fθ, (Fθ, (μθ ))], H
θ θ,
Now, turning to Eq. 38 and applying Tadic and Doucet (2005, Lemma 6.7) (42) we
have
|G
n−q n− p
θs [Fθ (F
{Fθs [Fθ, (Fθ, (μθ ))], H
n− p
θ θ, (μθ ))]}(ϕ)
−G
n−q n− p
θs [Fθ, (F
{Fθs [Fθ, (Fθ, (μθ ))], H
n− p
θ θ, (μθ ))]}(ϕ)|
−G
n−q n− p
θs [Fθ (F
{Fθs [Fθ, (Fθ, (μθ ))], H
n− p
θ, (μθ ))]}(ϕ)| ≤ Cϕ∞ ρ .
p−1
θ
Methodol Comput Appl Probab
|G
n−q n− p
θs [Fθ (F
{Fθs [Fθ (Fθ, (μθ ))], H
n− p
θ θ, (μθ ))]}(ϕ)
−G
n−q n− p
θs [Fθ, (F
{Fθs [Fθ, (Fθ, (μθ ))], H
n− p
θ, (μθ ))]}(ϕ)|| ≤ Cϕ∞ ρ .
p−1
θ (40)
Then, returning to Eq. 31 and noting Eq. 36, Eq. 40 we have the upper-bound
( )
n
n
θn (μθ , μ
F
θ,
θ ) − F n
(μθ , μ
θ ) ≤ C(2 +
μθ ) ρ p−1 + ρ p−1
p=1 q=n− p
≤ C(2 +
μθ ).
Lemma 5.1 Assume (A1–A5). Then there exist a C < +∞ such that for any n ≥ 1,
μθ ∈ P (X), μ
θ ∈ M(X), > 0 θ ∈ :
(n) (μθ , μ
F
(n) (μθ , μ
θ ) − F θ ) ≤ C(1 +
μθ ).
θ θ,
Proof By Tadic and Doucet (2005, Lemma 6.7) we have the decomposition, for ϕ ∈
Bb (X):
(n) (μθ , μ
F
(n) (μθ , μ
θ )(ϕ) = G
(n) (μθ )(ϕ)
θ )(ϕ) − H
θ θ θ
where
H
n,θ (ϕ) − μθ R
(n) (μθ )(ϕ) := μθ Rn,θ (1)−1 [μθ R
n,θ (1)μθ (ϕ).
(n) (μθ , μ
Thus to control the difference, we can consider the two differences G θ )(ϕ) −
θ
(n) (μθ , μ
G θ )(ϕ) and
(n) (μθ )(ϕ) − H
H
(n) (μθ )(ϕ).
θ, θ θ,
+ −
where μ ¯θ (·) = μ θ + (·)/μθ + (1) and μ θ − (·)/
¯θ (·) = μ μθ − (1). Thus we have
& + '
(n) (μθ , μ
(n) (μθ , μ μθ Rn,θ (1) μ θ + Rn,θ, (1)
G θ θ )(ϕ) − G θ, θ )(ϕ) = −
μθ Rn,θ (1) μθ Rn,θ, (1)
+
× [Fθ(n) (μ
¯θ )(ϕ) − Fθ(n) (μθ )(ϕ)]
θ + Rn,θ, (1) (n) +
μ
+ F (μ¯θ )(ϕ) − Fθ(n) (μθ )(ϕ)
μθ Rn,θ, (1) θ
(n) + (n)
− Fθ, (μ¯θ )(ϕ) + Fθ, (μθ )(ϕ)
& − − '
θ Rn,θ (1) μ
μ θ Rn,θ, (1)
+ −
μθ Rn,θ (1) μθ Rn,θ, (1)
−
× [Fθ(n) (μ¯θ )(ϕ) − Fθ(n) (μθ )(ϕ)]
θ − Rn,θ, (1) (n) −
μ
+ F (μ¯θ )(ϕ) − Fθ(n) (μθ )(ϕ)
μθ Rn,θ, (1) θ
(n) − (n)
− Fθ, (μ¯θ )(ϕ) + Fθ, (μθ )(ϕ) .
(41)
By symmetry, we need only consider the terms including μ θ + ; one can treat those
−
with μ
θ by using similar arguments. First dealing with term on the first line of the
R.H.S. of Eq. 41. We have that
& + '
μ
θ Rn,θ (1) μ θ + Rn,θ, (1) +
−
[Fθ(n) (μ¯θ )(ϕ) − Fθ(n) (μθ )(ϕ)]
μθ Rn,θ (1) μθ Rn,θ, (1)
& + '
θ Rn,θ (1) − μ
μ θ + Rn,θ, (1) + μθ Rn,θ, (1) − μθ Rn,θ (1)
= +μ θ Rn,θ, (1)
μθ Rn,θ (1) μθ Rn,θ, (1)μθ Rn,θ (1)
+
× [Fθ(n) (μ
¯θ )(ϕ) − Fθ(n) (μθ )(ϕ)]
thus
& '
θ + Rn,θ (1) − μ
μ θ + Rn,θ, (1) μθ Rn,θ, (1) − μθ Rn,θ (1)
θ + Rn,θ, (1)
+μ
μθ Rn,θ (1) μθ Rn,θ, (1)μθ Rn,θ (1)
Cμθ + (1) θ + Rn,θ, (1)
μ
≤ + C .
μθ Rn,θ (1) μθ Rn,θ, (1)μθ Rn,θ (1)
Now one can show that there exist a C < +∞ such that for any x, y ∈ X
(46)
where we have applied Eq. 27. Then we have
n,θ (ϕ) μθ R
μθ R
n,θ, (ϕ)
n,θ (ϕ) − μθ R
μθ R
n,θ, (ϕ)
− =
μθ Rn,θ (1) μθ Rn,θ, (1) μθ Rn,θ (1)
Then as
n,θ, (ϕ)(x) =
R ϕ(x
)[∇gθ, (yn |x
) fθ (x
|x) − gθ, (yn |x)∇ fθ (x
|x)]dx
≤ Cϕ∞ dx
≤ Cϕ∞ (47)
X
Methodol Comput Appl Probab
where the compactness of X and (A5) have been used, we have the upper-bound
# #
n,θ (ϕ) μθ R
# μθ R
n,θ, (ϕ) #
# − # ≤ Cϕ∞ . (48)
# μ R (1) μθ Rn,θ, (1) #
θ n,θ
Moving onto the second bracket on the R.H.S. of Eq. 45, this is equal to
&
n,θ (1) ' (n)
n,θ (1) (n)
μθ Rn,θ, (1) μθ R μθ R
− Fθ, (μθ )(ϕ) + [F (μθ )(ϕ) − Fθ(n) (μθ )(ϕ)]
μθ Rn,θ, (1) μθ Rn,θ (1) μθ Rn,θ (1) θ,
By using the inequality Eq. 48, we have
&
n,θ (1) ' (n)
μθ Rn,θ, (1) μθ R (n)
− F (μθ )(ϕ) ≤ C|Fθ, (μθ )(ϕ)| ≤ Cϕ∞ .
μθ Rn,θ, (1) μθ Rn,θ (1) θ,
Using Lemma 5.2 and in addition using Eq. 43 in the denominator and Eq. 47 in the
numerator we have
n,θ (1) (n)
μθ R
[F (μθ )(ϕ) − Fθ(n) (μθ )(ϕ)] ≤ Cϕ∞
μθ Rn,θ (1) θ,
where C does not depend upon μθ and . Thus we have established that
n,θ, (1)F (n) (μθ )(ϕ) μθ R
μθ R
n,θ (1)F (n) (μθ )(ϕ)
θ, θ
− ≤ Cϕ∞ . (49)
μθ Rn,θ, (1) μθ Rn,θ (1)
One can put together the results of Eqs. 48 and 49 and establish that
|H
(n) (μθ )(ϕ)| ≤ Cϕ∞ .
(n) (μθ )(ϕ) − H (50)
θ θ,
On combining the results Eqs. 44 and 50 and noting Eq. 45 we conclude the proof.
Lemma 5.2 Assume (A1–A3). Then there exist a C < +∞ such that for any n ≥ 1,
μθ ∈ P (X), > 0, θ ∈ :
Fθ(n) (μθ ) − Fθ,
(n)
(μθ ) ≤ C.
Lemma 5.3 Assume (A1–A5). Then there exist a C < +∞ such that for any n ≥ 1,
μθ ∈ P (X), μ
θ ∈ M(X), > 0, θ ∈ :
θn (μθ , μ
F
θ,
θ ) ∨ F n
(μθ , μ
θ ) ≤ C(1 +
μθ ).
Methodol Comput Appl Probab
Proof We will consider only Fθn (μθ , μθ ) as the ABC filter derivative will follow
similar calculations, for any > 0 (with upper-bounds that are independent of ).
By Tadic and Doucet (2005, Lemma 6.4) we have for ϕ ∈ Bb (X)
n
θn (μθ , μ
F
n (μθ , μ
θ )(ϕ) = G θ )(ϕ) +
n− p (F p (μθ ), H
G
p (μθ ))(ϕ).
θ θ θ θ
p=1
with ρ ∈ (0, 1). Then by Tadic and Doucet (2005, Lemma 6.8), it follows that
n
θn (μθ , μ
F θ ) ≤ C ρ n
μθ + ρ n− p
p=1
Remark 5.1 Using the proof above, one can also show that there exist a C < +∞
such that for any n ≥ 1, μθ ∈ P (X), μ
θ ∈ M(X), > 0, θ ∈
(n) (μθ , μ
F
(n) (μθ , μ
θ ) ∨ F θ ) ≤ C(1 +
μθ ).
θ θ,
References
Andrieu C, Doucet A, Tadic VB (2005) On-line simulation-based algorithms for parameter es-
timation in general state-space models. In: Proc. of the 44th IEEE Conference on De-
cision and Control and European Control Conference (CDC-ECC ’05), pp 332–337. Ex-
panded Technical Report, available at URL http://www.maths.bris.ac.uk/∼ maxca/preprints/
andrieu_doucet_tadic_2007.pdf
Arapostathis A, Marcus SI (1990) Analysis of an identification algorithm arising in the adaptive
estimation of Markov chains. Math Control Signals Syst 3:1–29
Barthelmé S, Chopin N (2011) Expectation–Propagation for summary-less, likelihood-free inference.
arXiv:1107.5959 [stat.CO]
Benveniste A, Métivier M, Priouret P (1990) Adaptive algorithms and stochastic approximation.
Springer-Verlag, New York
Beskos A, Crisan D, Jasra A, Whiteley N (2011) Error bounds and normalizing constants for
sequential Monte carlo in high-dimensions. arXiv:1112.1544 [stat.CO]
Bickel P, Li B, Bengtsson T (2008) Sharp failure rates for the bootstrap particle filter in high
dimensions. In: Clarke B, Ghosal S (eds) Pushing the limits of contemporary statistics. IMS,
pp 318–329
Cappé O, Ryden T, Moulines Ï (2005) Inference in hidden Markov models. Springer, New York
Cappé O (2009) Online sequential Monte Carlo EM algorithm. In: Proc. of IEEE workshop Statist.
Signal Process. (SSP). Cardiff, Wales, UK
Calvet C, Czellar V (2012) Accurate methods for approximate Bayesian computation filtering.
Technical Report, HEC Paris
Cérou F, Del Moral P, Guyader A (2011) A non-asymptotic variance theorem for un-normalized
Feynman–Kac particle models. Ann Inst Henri Poincare 47:629–649
Dean TA, Singh SS, Jasra A, Peters GW (2010) Parameter estimation for Hidden Markov models
with intractable likelihoods. arXiv:1103.5399 [math.ST]
Dean TA, Singh SS (2011) Asymptotic behavior of approximate Bayesian estimators.
arXiv:1105.3655 [math.ST]
Methodol Comput Appl Probab
Del Moral P (2004) Feynman–Kac formulae: genealogical and interacting particle systems with
applications. Springer, New York
Del Moral P, Doucet A, Jasra A (2006) Sequential Monte Carlo samplers. J R Stat Soc B 68:411–436
Del Moral P, Doucet A, Jasra A (2012) An adaptive sequential Monte Carlo method for approximate
Bayesian computation. Stat Comput 22:1009–1020
Del Moral P, Doucet A, Singh SS (2009) Forward only smoothing using sequential Monte Carlo.
arXiv:1012.5390 [stat.ME]
Del Moral P, Doucet A, Singh SS (2011) Uniform stability of a particle approximation of the optimal
filter derivative. arXiv:1106.2525 [math.ST]
Doucet A, Godsill S, Andrieu C (2000) On sequential Monte Carlo sampling methods for Bayesian
filtering. Stat Comput 10:197–208
Gauchi JP, Vila JP (2013) Nonparametric filtering approaches for identification and inference in
nonlinear dynamic systems. Stat Comput 23:523–533
Jasra A, Singh SS, Martin JS, McCoy E (2012) Filtering via approximate Bayesian computation. Stat
Comput 22:1223–1237
Kantas N, Doucet A, Singh SS, Maciejowski JM, Chopin N (2011) On particle methods for parameter
estimation in general state-space models. (submitted)
Le Gland F, Mevel M (2000) Exponential forgetting and geometric ergodicity in hidden Markov
models. Math Control Signals Syst 13:63–93
Le Gland F, Mevel M (1997) Recursive identification in hidden Markov models. In: Proc. 36th IEEE
conf. decision and control, pp 3468–3473
Le Gland F, Mevel M (1995) Recursive identification of HMM’s with observations in a finite set. In:
Proc. of the 34th conference on decision and control, pp 216–221
Lorenz EN (1963) Deterministic nonperiodic flow. J Atmos Sci 20:130–141
Marin J-M, Pudlo P, Robert CP, Ryder R (2012) Approximate Bayesian computational methods.
Stat Comput 22:1167–1197
Martin JS, Jasra A, Singh SS, Whiteley N, McCoy E (2012) Approximate Bayesian computation for
smoothing. arXiv:1206.5208 [stat.CO]
McKinley J, Cook A, Deardon R (2009) Inference for epidemic models without likelihooods. Int J
Biostat 5:a24
Murray LM, Jones E, Parslow J (2011) On collapsed state-space models and the particle marginal
Metropolis–Hastings sampler. arXiv:1202.6159 [stat.CO]
Pitt MK (2002) Smooth particle filters for likelihood evaluation and maximization. Technical Report,
University of Warwick
Poyiadjis G, Doucet A, Singh SS (2011) Particle approximations of the score and observed informa-
tion matrix in state space models with application to parameter estimation. Biometrika 98:65–80
Poyiadjis G, Singh SS, Doucet A (2006) Gradient-free maximum likelihood parameter estimation
with particle filters. In: Proc Amer. control conf., pp 6–9
Spall JC (1992) Multivariate stochastic approximation using a simultaneous perturbation gradient
approximation. IEEE Trans Autom Control 37(3):332–341
Spall J (2003) Introduction to stochastic search and optimization, 1st edn. Wiley, New York
Tadic VB, Doucet A (2005) Exponential forgetting and geometric ergodicity for optimal filtering in
general state-space models. Stoch Process Appl 115:1408–1436
Tadic VB (2009) Analyticity, convergence and convergence rate of recursive maximum likelihood
estimation in hidden Markov models. arXiv:0904.4264
Whiteley N, Kantas N, Jasra A (2012) Linear variance bounds for particle approximations of time-
homogeneous Feynman–Kac formulae. Stoch Process Appl 122:1840–1865
Yildirim S, Singh SS, Doucet A (2013a) An online expectation–maximisation algorithm for change-
point models. J Comput Graph Stat. doi:0.1080/10618600.2012.674653
Yildirim S, Dean TA, Singh SS, Jasra A (2013b) Approximate Bayesian computation for recursive
maximum likelihood estimation in hidden Markov models. Technical Report, University of
Cambridge