0% found this document useful (0 votes)
62 views35 pages

Elly Aj NK Abc Grad

This document proposes a gradient-free method for estimating the static parameters of hidden Markov models (HMMs) when the likelihood function is intractable. It uses approximate Bayesian computation (ABC) to approximate the HMM, which induces a bias that decreases with the ABC parameter ε. It shows that for a fixed dataset, the bias in the log-likelihood and its gradient from the ABC approximation is O(nε), where n is the number of data points. It then develops a gradient-based approach using sequential Monte Carlo and simultaneous perturbation stochastic approximation to estimate the model parameters in a batch or online fashion with computational cost O(N), where N is the number of particles. The performance is illustrated on two numerical examples.

Uploaded by

Gag Paf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views35 pages

Elly Aj NK Abc Grad

This document proposes a gradient-free method for estimating the static parameters of hidden Markov models (HMMs) when the likelihood function is intractable. It uses approximate Bayesian computation (ABC) to approximate the HMM, which induces a bias that decreases with the ABC parameter ε. It shows that for a fixed dataset, the bias in the log-likelihood and its gradient from the ABC approximation is O(nε), where n is the number of data points. It then develops a gradient-based approach using sequential Monte Carlo and simultaneous perturbation stochastic approximation to estimate the model parameters in a batch or online fashion with computational cost O(N), where N is the number of particles. The performance is illustrated on two numerical examples.

Uploaded by

Gag Paf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Methodol Comput Appl Probab

DOI 10.1007/s11009-013-9357-4

Gradient Free Parameter Estimation for Hidden


Markov Models with Intractable Likelihoods

Elena Ehrlich · Ajay Jasra · Nikolas Kantas

Received: 17 October 2012 / Revised: 29 March 2013 / Accepted: 25 June 2013


© Springer Science+Business Media New York 2013

Abstract In this article we focus on Maximum Likelihood estimation (MLE) for


the static model parameters of hidden Markov models (HMMs). We will consider
the case where one cannot or does not want to compute the conditional likelihood
density of the observation given the hidden state because of increased computational
complexity or analytical intractability. Instead we will assume that one may obtain
samples from this conditional likelihood and hence use approximate Bayesian
computation (ABC) approximations of the original HMM. Although these ABC
approximations will induce a bias, this can be controlled to arbitrary precision via
a positive parameter , so that the bias decreases with decreasing . We first establish
that when using an ABC approximation of the HMM for a fixed batch of data,
then the bias of the resulting log- marginal likelihood and its gradient is no worse
than O(n), where n is the total number of data-points. Therefore, when using
gradient methods to perform MLE for the ABC approximation of the HMM, one
may expect parameter estimates of reasonable accuracy. To compute an estimate of
the unknown and fixed model parameters, we propose a gradient approach based on
simultaneous perturbation stochastic approximation (SPSA) and Sequential Monte
Carlo (SMC) for the ABC approximation of the HMM. The performance of this
method is illustrated using two numerical examples.

E. Ehrlich · N. Kantas
Department of Mathematics, Imperial College London, London, SW7 2AZ, UK
E. Ehrlich
e-mail: elena.ehrlich05@ic.ac.uk

A. Jasra (B)
Department of Statistics & Applied Probability, National University of Singapore,
Singapore, 117546, Singapore
e-mail: staja@nus.edu.sg

N. Kantas
Department of Statistical Science, University College London, London, WC1E 6BT, UK
e-mail: n.kantas@ucl.ac.uk, n.kantas@imperial.ac.uk
Methodol Comput Appl Probab

Keywords Approximate Bayesian computation · Hidden Markov models ·


Parameter estimation · Sequential Monte Carlo

AMS 2000 Subject Classification 65C05 · 62F10

1 Introduction

Hidden Markov models (HMMs) provide a flexible description of a wide variety of


real-life phenomena when a time varying latent process is observed independently
at different epochs. A HMM can be defined as a pair of discrete-time stochastic
processes, (Xt , Yt+1 )t≥0 , where Xt ∈ X ⊆ Rdx is the unobserved process and Yt ∈ Y ⊆
Rd y is the observation at time t. Let θ ∈  ⊂ Rdθ be a vector containing the static
parameters of the model. The hidden process (Xt )t≥0 is assumed to be Markov chain
with initial density μθ (x0 ) at time 0 and transition density fθ (xt |xt−1 ) so that

 
Pθ (X0 ∈ A) = μθ (x0 )dx0 and Pθ Xt ∈ A| (Xm , Ym+1 )m≥0 = (xm , ym+1 )m≥0
A

= fθ (xt |xt−1 )dxt t ≥ 1, (1)
A

where Pθ denotes the probability, A belongs to the Borel σ -algebra of X, B(X),


and dxt is the Lebesgue measure. In addition, each observation Yt is assumed to
be statistically independent of every other quantity except Xt , θ:

 
Pθ Yt ∈ B| (Xm , Ym+1 )m≥0 = (xm , ym+1 )m≥0 = gθ (yt |xt )dyt t > 0 (2)
B

with B ∈ B(Y) and gθ (yt |xt ) being the conditional likelihood density. The HMM is
given by Eqs. 1 and 2 and is often referred to in the literature also as a general
state-space model. Here θ is treated as a unknown and static model parameter,
which is to be estimated in using Maximum Likelihood estimation (MLE). This is
an important problem with many applications ranging from financial modeling to
numerical weather prediction.
Statistical inference for the class of HMMs described above is typically non-trivial.
In most scenarios of practical interest one cannot calculate the marginal likelihood
of n given observations

pθ (y1:n ) = gθ (yn |xn ) pθ (xn |y1:n−1 )dxn

where y1:n := (y1 , . . . , yn ) are considered fixed and pθ (xn |y1:n−1 ) is the predictor
density at time n. Hence as the likelihood is not analytically tractable, one must
resort to numerical methods to both compute and to maximize pθ (y1:n ) w.r.t. θ . When
θ is known, a popular collection of techniques for both estimating the likelihood as
well as performing filtering or smoothing are sequential Monte Carlo (SMC) methods
(Doucet et al. 2000; Cappé et al. 2005). SMC techniques simulate a collection of N
samples (known as particles) in parallel, sequentially in time and combine importance
sampling and resampling to approximate a sequence of probability distributions
of increasing state-space known point-wise up-to a multiplicative constant. These
techniques provide a natural estimate of the likelihood pθ (y1:n ). The estimate is
Methodol Comput Appl Probab

quite well understood and is known to be unbiased (Del Moral 2004, Chapter 9).
In addition, the relative variance of this quantity is known to increase linearly with
the number of data-points, n, (Cérou et al. 2011; Whiteley et al. 2012). When θ is
unknown, as is the case here, estimation of θ is further complicated, because of the
path-degeneracy caused to the population of the samples by the resampling step of
SMC. This issue has been well documented int the literature (Andrieu et al. 2005;
Kantas et al. 2011). However, there are still many specialized SMC techniques which
can successfully be used for parameter estimation of HMMs in a wide variety of
contexts; see Kantas et al. (2011) for an comprehensive overview. In particular for
MLE a variety of SMC methods have been proposed in the literature (Cappé 2009;
Del Moral et al. 2009; Poyiadjis et al. 2011). Note that the techniques in these papers
require the evaluation of gθ (y|x) and potentially gradient vectors as well.
In this article, we consider the scenario where gθ (y|x) is intractable. By this
we mean that one cannot calculate it for given y or x either because the density
does not exist or because it is computationally too expensive, e.g. due to the high-
dimensionality of x. In addition, we will assume a unbiased estimator for gθ (y|x)
is also not available. Instead we will assume that one can sample from gθ (·|x) for
any value of x. In this case, one cannot use the standard or the more advanced
SMC methods that are mentioned above (or indeed many other simulation based
approximations). Hence the problem of parameter estimation is very difficult. One
approach which is designed to deal with this problem is Approximate Bayesian
Computation (ABC). ABC is an approach that uses simulated samples from the
likelihood to deal with the restriction of not being to evaluate its density. Although
there is nothing inherently Bayesian about this, it owes its name due to its early
success in Bayesian inference; see Marin et al. (2012) and the references therein for
more details. Although here we will focus only upon ABC ideas, we note that there
are possible alternatives, such as Gauchi and Vila (2013), and refer the interested
reader to Gauchi and Vila (2013), Jasra et al. (2012) for a discussion of the relative
merits of ABC.
In the context of HMMs when the model parameters θ are known, the use of ABC
approximations has appeared in Jasra et al. (2012), McKinley et al. (2009) as well as
associated computational methods for filtering and smoothing in Jasra et al. (2012),
Martin et al. (2012), Calvet and Czellar (2012). When the parameter is unknown, the
statistical properties of ML estimators for θ based on ABC approximations has been
studied in detail in Dean et al. (2010), Dean and Singh (2011). ABC approximations
of lead to a bias, which can be controlled to arbitrary precision via a parameter  > 0.
This bias typically goes to zero as   0. In this article we aim to:

1. Investigate the bias in the log-likelihood and the gradient of the log-likelihood
that is induced by the ABC approximation for a fixed data set,
2. Develop a gradient based approach based on SMC with computational cost
O(N) that allows one to estimate the model parameters in either a batch or on-
line fashion.

In order to implement such an approach one must obtain numerical estimates of the
log- marginal likelihood as well as its gradient. Thus, it is important to understand
what happens to the bias of the ABC approximation of these latter quantities, as
the time parameter (or equivalently number of data-points, n) grows. We establish,
under some assumptions, that this ABC bias, for both quantities is no worse than
Methodol Comput Appl Probab

O(n). This result is closely associated to the theoretical work in Dean et al. (2010),
Dean and Singh (2011). These former results indicate that the ABC approximation is
amenable to numerical implementation and parameter estimation will not necessar-
ily be dominated by the bias. We will discuss why this is the case later in Remarks 2.1
and 2.2. For the numerical implementation of MLE we will introduce a gradient-
free approach based on using finite differences with Simultaneous Perturbation
Stochastic approximation (SPSA) (Spall 1992, 2003). This is extending the work
in Poyiadjis et al. (2006) for the case when the likelihood is intractable and ABC
approximations are used.
This paper is structured as follows. In Section 2 we discuss the estimation pro-
cedure using ABC approximations. Our bias result is also given. In Section 3 our
computational strategy is outlined. In Section 4 the method is investigated from a
numerical perspective. In Section 5 the article is concluded with some discussion of
future work. The proofs of our results can be found in the appendices.

2 Model and Approximation

2.1 Maximum Likelihood for Hidden Markov models

Consider first the joint f iltering density of the HMM given by



μθ (x0 ) nt=1 gθ (yt |xt ) fθ (xt |xt−1 )
πθ (x0:n |y1:n ) =  n ,
Xn+1 μθ (x0 ) t=1 gθ (yt |xt ) fθ (xt |xt−1 )dx0:n

where we recall that θ ∈  ⊂ Rdθ is the vector of model parameters, xt ∈ X are the
hidden states and yt ∈ Y the observations. The joint filtering density can be computed
recursively using the well known Bayesian filtering recursions:

πθ (x0:t |y1:t−1 ) = πθ (x0:t−1 |y1:t−1 ) fθ (xt |xt−1 )dxt (3)
X

gθ (yt |xt )πθ (x0:t |y1:t−1 )


πθ (x0:t |y1:t ) = (4)
pθ (yt |y1:t−1 )
where the normalizing constant in Eq. 4 is referred to as recursive likelihood and is
given as follows:

pθ (yt |y1:t−1 ) = gθ (yt |xt )πθ (x0:t |y1:t−1 )dxt (5)
X

Furthermore, we write the log-(marginal) likelihood at time n:

lθ (y1:n ) = log( pθ (y1:n )).

In the context of MLE one is usually interested computing

θ̂ = arg max lθ (y1:n ).


θ ∈

Note that this is a batch or off-line procedure, which means that one needs to wait
first to collect the complete data-set and then compute the ML estimate. In this paper
Methodol Comput Appl Probab

we will focus on computing ML estimates based on gradient methods. In this case one
may use iteratively for k ≥ 0

θk+1 = θk + ak+1 ∇lθ (y1:n )|θ =θk ,


  2
where (ak ) k≥1 is a step sequence that satisfies k ak = ∞ and k ak < ∞
(Benveniste et al. 1990). Note that this scheme is only guaranteed to converge to
a local maximum and this is sensitive to initialization.
In case one expects a very long observation sequence, the computation of the
gradient at each iteration of the above gradient ascent algorithm can be prohibitive.
Therefore, one might prefer on-line ML methods, whereby the estimate of the
parameter is updated sequentially as the data arrives. A practical alternative would
be to consider maximizing instead the long run quantity

1  n
lim lθ (y1:n ) = lim log ( pθ (yt |y1:t−1 )) .
n→∞ n n→∞
t=1

Under appropriate regularity and ergodicity conditions for the augmented Markov
chain (Xt , Yt , pθ (xt |y1:t−1 ))t≥0 (Le Gland and Mevel 1997; Tadic and Doucet 2005)
the average log-likelihood is an ergodic average and this leads to a gradient update
scheme based on Stochastic Approximation (Benveniste et al. 1990). For a similar
step-size sequence (at ) t≥1 one may update θt as follows:

θt+1 = θt + at+1 ∇ log ( pθ (yt |y1:t−1 ))|θ =θt .

Upon receiving yt , the parameter estimate is updated in the direction of ascent of


the conditional density of this new observation. The algorithm in the present form is
not suitable for on-line implementation due to the need to evaluate the gradient of
log pθ (yt |y0:t−1 ) at the current parameter estimate which would require computing
the filter from time 0 to time t using the current parameter value θt . To bypass
this problem, the recursive ML (RML) algorithm has been proposed originally in
Arapostathis and Marcus (1990), Le Gland and Mevel (1995, 1997) for finite state
spaces and in Del Moral et al. (2009, 2011), Poyiadjis et al. (2011) in the context of
SMC approximations. It relies on the following update scheme
 
θt+1 = θt + at+1 ∇ log pθ0:t (yt |y1:t−1 ) ,

where the positive non-increasing step-size sequence (at )t≥1 satisfies t at = ∞ and
 −α
t at < ∞, e.g. at = t for 0.5 < α ≤ 1. The quantity ∇ log pθ0:t (yt |y1:t−1 ) is defined
2

here as
     
∇ log pθ0:t (yt |y1:t−1 ) = ∇ log pθ0:t (y1:t ) − ∇ log pθ0:t−1 (y1:t−1 ) ,
 
where the subscript θ0:t in the notation for ∇ log pθ0:t (y1:t ) indicates that at each time
t the quantities in Eqs. 3–5 are computed using the current parameter estimate θt . The
asymptotic properties of RML have been studied in Arapostathis and Marcus (1990),
Le Gland and Mevel (1995, 1997, 2000) for a finite state-space HMMs and Tadic and
Doucet (2005), Tadic (2009) in more general cases. It is shown that under regularity
conditions this algorithm converges towards a local maximum of the average log-
likelihood, whose maximum lies at the ‘true’ parameter value.
Methodol Comput Appl Probab

In this article, we would like to implement approximate versions of RML and off-
line ML schemes when both the following cases hold:
• We can sample from the conditional distribution of Y|x, for any fixed θ and x.
• We cannot or do not want to evaluate the conditional density of Y|x, gθ (y|x) and
do not have access to an unbiased estimate of it.
Apart from using likelihoods which do not admit computable densities such as some
stable distributions, this context might appear relevant to the context when one is
interested to use SMC methods and evaluate gθ (y|x) when dx is large. SMC methods
for filtering do not always scale well with the dimension of the hidden state dx , often
requiring a computational cost O(κ dx ), with κ > 1 (Beskos et al. 2011; Bickel et al.
2008). A more detailed discussion on the difficulties of using SMC methods in high
dimensions is far beyond the scope of this article, but we remark the ideas in this
paper can be relevant in this context.

2.2 ABC Approximations

To facilitate ML estimation when the bullet points above hold we will resort to
ABC approximations of the ideal MLE procedures above. We will present a short
overview here and refer the author to Dean et al. (2010), Yildirim et al. (2013b) for
more details.
First, we consider an ABC approximation of the joint smoothing density as in
Jasra et al. (2012), McKinley et al. (2009):

μθ (x0 ) nt=1 K (yt , ut )gθ (ut |xt ) fθ (xt |xt−1 )
πθ, (u1:n , x0:n |y1:n ) = (6)
pθ, (y1:n )
with the ABC marginal likelihood being

n
pθ, (y1:n ) = μθ (x0 ) K (yt , ut )gθ (ut |xt ) fθ (xt |xt−1 )du1:n x0:n (7)
Xn+1 ×Yn t=1

and the ABC recursive likelihood


pθ, (y1:t )
pθ, (yt |y1:t−1 ) = , (8)
pθ, (y1:t−1 )
where un ∈ Y are pseudo observations and K : Y × Y → R+ ∪ {0} is some kernel
function that has bandwidth that depends upon a precision parameter  > 0. We will
also assume that the kernel is such that K (yt , ut ) = K (ut , yt ). For example, possible
choices could be:


1
K (yt , ut ) = I{u:|yt −u|<} (ut ) or K (yt , ut ) = exp − (yt − ut )T −1 (yt − ut ) ,
2
where I is the indicator function, | · | is the a vector norm, and is a positive semi-
definite d y × d y matrix.
Note in this context the quantity:

1
gθ, (yt |xt ) = K (yt , ut )gθ (ut |xt )dut , (9)
Z Y
Methodol Comput Appl Probab

can be viewed as the likelihood of an alternative “perturbed” HMM that uses the
same transition density but has gθ, as the likelihood. It can be easily shown that this
HMM will admit a marginal likelihood of Z1n pθ, (y1:n ) which is proportional to the

one written above in Eq. 7, but the proportionality constant does not depend on θ.
Note that a critical condition for this to hold is that we choose K (yt , ut ) such that the
normalizing constant Z  = K (yt , ut )dut of Eq. 9 does not depend upon xt or θ.
The ABC–MLE approach we consider in this article will be then to use MLE for
the perturbed HMM defined by gθ, . For the off-line case let
 
lθ (y1:n ) = log pθ, (y1:n )

and denote the ABC_MLE estimate as

θˆ = arg max lθ (y1:n ). (10)


θ ∈

Results on the consistency and efficiency of this method n grows can be found
in Dean et al. (2010), Dean and Singh (2011). Under some regularity and other
assumptions (such as the data originating from the HMM considered), the bias of the
maximum likelihood estimator (MLE) is O(). In addition, one may avoid encoun-
tering this bias asymptotically, if one adds appropriately noise to the observations.
This procedure is referred to as noisy ABC, and then one can recover the true
parameter. We remark that the methodology that is considered in this article can
easily incorporate noisy ABC. However, there may be some reasons why one may
not want to use noisy ABC:
1. The consistency results (currently) depend upon the data originating from the
original HMM;
2. The current simulation-based methodology may not be able to be used efficiently
for  close to zero.
For point 1., if the data do not originate from the HMM of interest, it has not been
studied what happens with regards to the asymptotics of noisy ABC for HMMs.
It may be that some investigators might be uncomfortable with assuming that the
data originate from the exactly the HMM being fitted. For point 2. the asymptotic
bias (which is under assumptions either O() or O( 2 ) Dean et al. 2010; Dean and
Singh 2011) could be less than the asymptotic variance (under assumptions O( 2 )
Dean et al. 2010; Dean and Singh 2011) as  could be much bigger than unity when
using current simulation methodology. We do not use noisy ABC in this article,
but acknowledge its fundamental importance with regards to parameter estimation
associated to ABC for HMMs; our approach is intended for cases where points
similar to 1.−2. need to be taken into account.
For the ABC–RML we will define the time varying log- recursive likelihood as
 
rθ0:t (y1:t ) = log pθ0:t , (yt |y1:t−1 )

where the subscript θ0:t means again that at each time t one computes all the relevant
quantities in Eqs. 3–5 (with gθ, substituted instead of gθ ) using θt as the parameter
value and θ0:t−1 has been used similarly in all the previous times. Finally we write the
ABC–RML recursion for the parameter as

θt+1 = θt + at+1 ∇rθ0:t (y1:t ). (11)


Methodol Comput Appl Probab

2.3 Bias Results

We now prove an upper-bound on the bias induced by the ABC approximation on


the log- marginal likelihood and its gradient. The latter is more relevant for para-
meter estimation, but the mathematical arguments are considerably more involved
for this quantity, in comparison to the ABC bias of the log-likelihood. Hence the
log-likelihood is considered as a simple preliminary result. These results are to be
taken in the context of ABC (not noisy ABC) and help to provide some guarantees
associated to the numerics.
We consider for this section the scenario

K (yt , ut ) = I A,yt (ut )

where the set A,yt is specified below. Here | · | should be understood to be an


L1 −norm. The hidden-state is assumed to lie on a compact set, i.e. X is compact.
We use the notation P (X) to denote the class of probability measures on X and
M(X) the collection of finite and signed measures on X. · denotes the total
variation distance. The initial distribution of the hidden Markov chain is written
as μθ ∈ P (X). In addition, we condition on the observed data and do not mention
them in any mathematical statement of results (due to the assumptions below).
We do not consider the instance of whether the data originate, or not, from a
HMM. For the control of the bias of the gradient of the log-likelihood (Theorem
2.1), we assume that dθ = 1. This is not restrictive as one can use the arguments
to prove analogous results when dθ > 1, by considering component-wise arguments
for the gradient. In addition, for the gradient result, the derivative of μθ is written
θ ∈ M(X) and constants C, C, L are to be understood as arbitrary lower, higher
μ
bounds and Lipschitz constants respectively. We make the following assumptions,
which are quite strong but are intended for keeping the proofs as short as possible.

(A1) Lipschitz Continuity of the Likelihood. There exist L < +∞ such that for any
x ∈ X, y, y ∈ Y, θ ∈ 

|gθ (y|x) − gθ (y |x)| ≤ L|y − y |.

(A2) Statistic and Metric. The set A,y is:

A,y = {u : |y − u| < }.

(A3) Boundedness of Likelihood and Transition. There exist 0 < C < C < +∞
such that for all x, x ∈ X, y ∈ Y, θ ∈ 

C ≤ fθ (x |x) ≤ C,
C ≤ gθ (y|x) ≤ C.

(A4) Lipschitz Continuity of the Gradient of the Likelihood. fθ (x |x), gθ (y|x ) are
differentiable in θ for each x, x ∈ X, y ∈ Y. In addition, there exist L < +∞
such that for any x ∈ X, y, y ∈ Y, θ ∈ 

|∇gθ (y|x) − ∇gθ (y |x)| ≤ L|y − y |.


Methodol Comput Appl Probab

(A5) Boundedness of Gradients of the Likelihood and Transition. There exist 0 <
C < C < +∞ such that for all x, x ∈ X, y ∈ Y, θ ∈ 
C ≤ ∇ fθ (x |x) ≤ C,
C ≤ ∇gθ (y|x) ≤ C.
Whilst it is fairly easy to find useful simple models where the above conditions do
not hold uniformly for θ , we remark that the emphasis here is to provide intuition for
the methodology and for this reason similar conditions are popular in the literature,
e.g. Del Moral et al. (2009, 2011), Dean et al. (2010), Tadic and Doucet (2005).
We first present the result on the ABC bias of the log-likelihood. The proof is in
Appendix B.

Proposition 2.1 Assume (A1–A3). Then there exist a C < +∞ such that for any n ≥
1, μθ ∈ P (X),  > 0, θ ∈  we have:
|lθ (y1:n ) − lθ (y1:n )| ≤ Cn.

Remark 2.1 The above proposition gives some simple guarantees on the bias of
the ABC log-likelihood. When using SMC algorithms to approximate log( pθ (y1:n )),
the overall error will be decomposed into the deterministic bias that is present
from the ABC approximation (that in Proposition 2.1) and the numerical error of
approximating the log-likelihood. Under some assumptions, the L2 −error of the
SMC estimate of the log-likelihood should not deteriorate any faster than linearly in
time; this is due to the results cited previously. Thus, as the time parameter increases,
the ABC bias of the log-likelihood will not necessarily dominate the simulation-
based error that would be present even if gθ is evaluated.

Proposition 2.1 is reasonably straight-forward to prove, but, is of less interest


in the context of parameter estimation, as one is interested in the gradient of the
log-likelihood. We now have the result on the ABC bias of the gradient of the log-
likelihood. The proof in Appendix C.

Theorem 2.1 Assume (A1–A5). Then there exist a C < +∞ such that for any n ≥ 1,
μθ ∈ P (X), μ
θ ∈ M(X),  > 0, θ ∈  we have:
|∇l θ (y1:n ) − ∇lθ (y1:n )| ≤ Cn(2 +
μθ ).

Remark 2.2 The above Theorem again provides some explicit guarantees when using
an ABC approximation along with SMC-based numerical methods. For example, if
one can consider approximating gradients in an ABC context as proposed in Yildirim
et al. (2013a), then from the results of Del Moral et al. (2011), one expects that
the variance of the SMC estimates to increase only linearly in time. Again, as time
increases the ABC bias does not necessarily dominate the variance that would be
present even if gθ is evaluated (i.e. one uses SMC on the true model).

Remark 2.3 The result in Theorem 2.1 can be found in Eq. 72 of Dean et al. (2010)
and direct limit (as   0) in Dean and Singh (2011). However, we adopt a new (and
fundamentally different) proof technique, with a substantially more elaborate proof
Methodol Comput Appl Probab

Algorithm 1 SMC with ABC


• Initialization t = 0:
– x(i)
For i = 1, . . . , N sample independently (i)
0 ∼ μθ . Set W0 = 1/N.

• For t = 1, . . . , n
– Step 1: For i = 1, . . . , N, sample next state x(i) x(i)
t ∼ qt,θ (·| t−1 )
( j,i)
∗ For j = 1, . . . , M: sample auxiliary observation samples ut ∼ g(·|xi0 )
– Step 2. Compute weights
 
( j,i)

N M
j=1 K (yt , ut ) fθ (x(i) (i)
t |xt−1 )
(i) (i) t(i) =
Wt(i) ∝ Wt−1 Wt , Wt(i) = 1, W ,
i=1 M qt,θ (xt |x(i)
t−1 )

– Step 3. If required, resample N particles from



N
πθ,,t =
 Wt(i) δx(i) , (12)
0:t
i=1
  1
x(i)
to get (i)
0:t and set Wt = x(i)
, else set (i)
0:t = x0:t .
N

and an additional result of independent interest is proved. We derive the stability of


the bias with time of the ABC approximation of the filter derivative; see Theorem
5.1 in Appendix D.

3 Computational Strategy

We begin by considering a modified target instead of the ABC targeted filtering


density in Eq. 6:
⎡⎛ ⎞ ⎤
n
1 M M
, x0:n |y1:n ) ∝ μθ (x0 ) ⎣⎝ K (yt , ut )⎠ gθ (ut |xt )⎦ fθ (xt |xt−1 ),
j j
πθ, (u11:n , . . . , u1:n
M

t=1
M j=1 j=1

(13)

where for every t we use this time M independent samples from the likelihood,
j
ut ∼ gθ (·|xt ), j = 1, . . . , M. When one integrates out u11:n , . . . , u1:n
M
then the targeted
sequence is the same as in Section 2.2, which targets a perturbed HMM with the
likelihood being gθ, shown earlier in Eq. 9. Of course, in terms of estimating θ and
MLE, again this yields the same bias as the original ABC approximation, but still
there are substantial computational improvements. This is because as M grows we
the behavior is closer to an ideal marginal SMC algorithm that targets directly the
perturbed HMM without the auxiliary u variables. We proceed by presenting first
SMC when the model parameters θ are known and then show how Simultaneous
Methodol Comput Appl Probab

Perturbation Stochastic Approximation (SPSA) can be used for (off-line) gradient-


free MLE and RML.

3.1 Sequential Monte Carlo

For the sake of clarity and for this sub-section only consider θ to be fixed and known.
In Algorithm 1 we present the ABC–SMC algorithm of Jasra et al. (2012), which is
used to perform filtering for the perturbed HMM with likelihood gθ, and transition
density fθ . The basic design elements are the important sampling proposals qt,θ for
the weights, the number of particles N, the number of auxiliary observation samples
M and the ABC precision tolerance . The resampling step is presented here as
optional, but note to get good performance it is necessary to use it when the variance
of the weights or the effective sample size is low. For more details we refer the reader
at Jasra et al. (2012).
The algorithm allows us to approximate πθ, in Eq. 13 using the particles. For
instance, the particle approximation of the marginal of πθ, w.r.t. the u variables is
shown in Eq. 12. In addition one obtains also particle approximations for pθ, (y1:n )
and pθ, (yt |y1:t−1 ) as defined in Eqs. 7 and 8, which are critical quantities for parame-
ter estimation. So we denote this SMC estimates of these quantities as pθ, N
(y1:n ) and
pθ, (yt |y1:t−1 ) respectively. These are given as follows:
N

n
1  (i)
N
N
pθ, (y1:n ) = W
t=1
N i=1 t

with

1  (i)
N
N
pθ, (yt |y1:t−1 ) = W ,
N i=1 t

where W t(i) is defined in Algorithm 1. To avoid possible confusion, we remind the


reader that because Z  in Eq. 9 is unknown, one pθ, (y1:n ) coincides with the actual
marginal likelihood of the perturbed HMM only up-to a proportionality constant Z n
that is independent of θ . Of course in the context of parameter estimation this does
not pose any problems.
The standard SMC approximation for the likelihood pθN (y1:n ) is an unbiased
estimate in the sense
 
E N pθN (y1:n ) = pθ (y1:n ),

where E N [·] denotes the expectation w.r.t the distribution of all the randomly
variables in Algorithm 1. A similar result holds for pθN (yn |y1:n−1 ); see Del Moral
(2004,
 Theorems  7.4.2
 and 7.4.3, p.239) for a proof and more details. Note still that
log pθN (y1:n ) or log pθN (yn |y1:n−1 ) will be biased approximations of the ideal quan-
tities. A usual remedy is to correct the bias up to the first order
 of a Taylor
 expansion
and estimate the θ-dependendent parts of log pθ, (y1:n ) and log pθ, (yn |y1:n−1 )
instead with
 N  1  N −2
lˆθ,
N
= log pθ, (y1:n ) + p (y1:n ) , (14)
2N θ,
Methodol Comput Appl Probab

and
   −2
1  (i) 1  (i)
N N
1
N
r̂t,θ, = log W + W (15)
N i=1 t 2N N i=1 t

respectively as suggested in Pitt (2002).

Remark 3.1 The parameter  determines the accuracy of the the marginal likelihoods
of the perturbed HMM compared to the original one. At the same time if it is very
low one may require a high value for M. This can be computed adaptively as in Del
Moral et al. (2012), Jasra et al. (2012). Also it is remarked that a drawback of this
algorithm is that when d y grows with , N remaining fixed, one cannot expect the
algorithm to work well for every . Typically one must increase  to get reasonable
results with moderate computational effort and this is at the cost of increasing the
bias. To maintain  at a reasonable level, one must consider more sophisticated
strategies which are not investigated here.

Remark 3.2 We note that, after suppressing θ, if the HMM can be written in a state
space model form:

Yt = ξ(Xt , Wt )
Xt = ϕ(Xt−1 , Vt )

0.8

0.6
KF
SMC
ABC-SMC
MLE
0.4

0.2

0
0 5000 10000 15000 20000
Iteration

Fig. 1 A typical run of the offline parameter estimates obtained by the KF, SMC, and ABC–SMC
for the linear Gaussian HMM, along with the ML estimators for θ
Methodol Comput Appl Probab

Algorithm 2 SPSA for batch ABC–MLE


• Initialization k = 0. Set θ0 and choose step size sequences (ak )k≥0 , (ck )k≥0 , so
  a2
that ak > 0, ak , ck → 0, k≥0 ak = ∞, k≥0 c2k < ∞.
k
• For k ≥ 0
– For m = 1, . . . , dθ , sample independently k (m) from a Bernoulli distribu-
tion with success probability 0.5 and support {−1, 1}.
– Run Algorithm 1 (ABC–SMC) for θk+ = θk + ck k and θk− = θk − ck k to
obtain lˆθN+ , and lˆθN− , respectively.
k k
– For m = 1, . . . , dθ , update θ(m)

lˆθN+ , − lˆθN− ,
θk (m) = θk (m) + ak k k
.
2ck k (m)

where X0 = x0 ∈ X is known, both (Vn ) n≥1 and (Wn ) n≥0 are i.i.d. noise sequences
independent of each other and ξ , ϕ appropriate functions. Suppose that one can
evaluate:

• The densities of Wn and Vn and sample from the associated distributions,


• ξ and ϕ point-wise.

Similar to Murray et al. (2011), Yildirim et al. (2013b), one can construct a ‘collapsed’
ABC approximation


n
  
π (w1:n , v1:n , u1:n |y1:n ) ∝ K ξ ϕ (t) (x0 , v1:t ), wt ,
t=1
 
ξ ϕ (t) (x0 , v1:t ), ut p(wt ) p(vt ) p(ut ).

Hence a version of the SMC algorithm in Fig. 1 can be derived which does not need
to sample from neither the dynamics of the data nor the transition density of the
hidden Markov chain. This representation, however, does not always apply.

3.2 Simultaneous Perturbation Stochastic Approximation (SPSA)

We proceed by describing SPSA as a gradient free method for off-line or batch ABC–
MLE, which can be found in Algorithm 2. This algorithm does not require one to
evaluate gθ or its gradient. In this context one is interested in estimating θ such that

∇lθ = 0

holds, where we have dropped the dependance on y1:n for simplicity. Recall that
here we do not have an expression for ∇lθ to pursue a standard Robbins–Monroe
procedure (Benveniste et al. 1990). One way around this would be to use a finite
difference approximation to estimate the gradient w.r.t. to the m-th element of θ as
Methodol Comput Appl Probab

lˆθ +cem −lˆθ−ce




2c
m
, where em is a unit magnitude vector that is zero in
any direction except
ˆ
m and l• an unbiased estimate of l• . To avoid having to do 2dθ

evaluations of these
estimates in total for each direction, SPSA has been proposed in Spall (1992) so that
the gradient update requires only 2 evaluations only. Instead weperturb θ using
 ck k
where k is a dθ −dimensional zero mean vector, such that E | k (m)|−1 or some
higher inverse moment is bounded. In this case we have used the most popular choice
with each entry of k being ±1 Bernoulli distributed and the estimates for the lˆ• are
the bias-corrected versions as in Eq. 14. For more details on the conditions and the
convergence details for this Stochastic Approximation method we refer the reader
to Spall (1992) and for useful practical suggestions regarding the implementation to
Spall (2003).

3.2.1 Recursive ML with SPSA


Recall from Eq. 11 in Section 2.2 that the ABC–RML recursion for the parameter is
given as

θt+1 = θt + at+1 ∇rθ0:t (y1:t ).

In Algorithm 3 we illustrate how this can be implemented using ABC–SMC. We


have extended the RML procedure using both SMC and SPSA that appeared in
Poyiadjis et al. (2006) for the case where ABC approximations are used due to the
intractability of log(gθ (y|x)), ∇ log(gθ (y|x)). In Yildirim et al. (2013b) one can find an
alternative approach which implements RML and ABC that does not require to use
SPSA. Although a direct comparison is beyond the scope of this paper, we expect the
method in Yildirim et al. (2013b) to be more accurate. On the other hand, Algorithm
3 can be applied possibly to a wider class of models, but the use of SPSA means
that we add an additional layer of approximation and there is a possibility of biases
incurring that need to be investigated more thoroughly.

4 Numerical Simulations

We consider two numerical examples that are designed to investigate the accuracy
and behavior of our numerical ABC–MLE algorithms. In order to do this, we
consider scenarios where gθ is a well behaved density, which we avoid to compute.
In the first example we look at a linear Gaussian model and in the second a HMM
involving the Lorenz ’63 model (Lorenz 1963).

4.1 Linear Gaussian Model

We consider the following linear Gaussian HMM, with Y = X = R, t ≥ 1:

Yt = Xt + σw Wt
Xt = φ Xt−1 + σv Vt ,
Methodol Comput Appl Probab

Algorithm 3 RML with ABC–SMC


• Initialization t = 0:
– Set θ1 and choose step size sequences (at )t≥0 , (ct )t≥0 , so that at >
  a2
0, at , ct → 0, t≥0 at = ∞, t≥0 c2t < ∞.
t

– x(i)
For i = 1, . . . , N sample independently (i)
0 ∼ μθ . Set W0 = 1/N.

• For t = 1, . . . , n
– For m = 1, . . . , dθ , sample independently t (m) from a Bernoulli distribution
with success probability 0.5 and support {−1, 1}.  
– Set θt+ = θt + ct t and θt− = θt − ct t . For each value use x(i) (i)
0:t−1 , Wt−1
to compute Steps 1 and
  2 of Algorithm
 1 (ABC–SMC) returning
(i) + (i) + (i) − (i) −
Wt (θt ), Wt (θt ) and Wt (θt ), Wt (θt ) respectively.
– Compute r̂t,θ
N
+ and r̂t,θ
N
− respectively using Eq. 15.
t , t ,
– Update θt . For m = 1, . . . , dθ
N
r̂t,θ +
,
− r̂t,θ
N

,
θt+1 (m) = θt (m) + at t t
.
2ct t (m)
– Compute Steps 1 to 3 of Algorithm 1 (ABC–SMC) using θt+1 to get

x(i) (i)
0:t , Wt−1 .

i.i.d. i.i.d.
with Wt , Vt independent and Wt ∼ N (0, 1), Vt ∼ N (0, 1). In the subsequent exam-
ples, we will use a simulated dataset obtained with θ = (σv , φ, σw ) = (0.2, 0.9, 0.3),
which is the same example as in Poyiadjis et al. (2006).

4.1.1 Batch MLE


We begin by considering a short data set, of n = 1000 data points. The off-line
scenario is the one for which we can expect the best possible performance of the
ABC–MLE. If one could not obtain reasonable parameter estimates in this example
one would not expect ABC to be very useful in practice. In Algorithm 3.2 recall
k (m) is the mth-entry of the ±1, zero mean Bernoulli variable and for the step-sizes
we chose ck = k−0.1 , ak = 1 for k < 104 , and ak = (k − 104 )−0.8 for k ≥ 104 . In Fig. 1,
we compare offline ML estimates of the following cases:

1. Kalman Filtering (KF) for the original HMM is used to compute lˆθ for SPSA,
2. Standard SMC (without ABC) with N = 1000 for the original HMM is used to
compute lˆθ for SPSA,
3. ABC–SMC with N = 200, M = 10,  = 0.1 is used to compute lˆθ for SPSA.

The horizontal lines in Fig. 1 show also Maximum Likelihood estimates (MLE)
obtained from an offline grid search optimization that uses KF. All procedures seem
Methodol Comput Appl Probab

to be very accurate at estimating the MLE obtained from the grid search. This allows
us to investigate RML, which is a more challenging problem.

4.1.2 RML
We now consider a larger data set with n = 50,000 data points, simulated with the
previously indicated parameter values. We use Algorithm 3 described in Section 3.2.
Again we compare the same three procedures outlines above using fifty independent
runs in each case. The standard SMC and ABC–SMC algorithms were employed with
the same for N and M,  as in the off-line case. Also for each case we used the same
the step-size sequences for SPSA, which were similar to their off-line counterparts
in Section 4.1.1. In Fig. 2, we plot the medians and credible intervals for the 5–
95 % percentiles of the parameter estimates (across the independent runs). The  θt
converge after t = 20,000 time steps, with the KF and SMC yielding similarly valued
estimates. Note there seems to be an apparent bias in both cases relative to the true
parameters (the MLE for the data-set used has been checked that it converges to
the true parameters by n = 5 × 104 ). A similar bias has appeared in Poyiadjis et al.
(2006) for this particular model. The theoretical justification in Spall (1992) applies
directly when SPSA is used for off-line MLE (as in Section 4.1.1) with a finite and
fixed data-set. For RML the argument to be maximized is an ergodic average (Le
Gland and Mevel 1995, 1997; Tadic and Doucet 2005; Tadic 2009), so we believe
the bias accumulated here is due to the step-sizes of SPSA decreasing much faster
than the gradient to be estimated reaches stationarity. Ideally, one would like to
run this algorithm for a much longer n, slower decreasing step-sizes and also delay
updating θ until stationarity is reached, but this would make using multiple runs
prohibitive. In Poyiadjis et al. (2006) it seemed that this bias was not considerable for
other models, such as the popular stochastic volatility model. In any case, it would be
useful to examine precisely under what conditions SPSA can be used within RML,
but this is beyond the scope of this paper that puts more emphasis on the relative
accuracy of ABC. In Fig. 2 we also observe increased variance from left to right
in Fig. 2, which we attribute to the progressively added randomness of SMC and
ABC–SMC respectively. In particular, the expected reduced accuracy of ABC–SMC
against SMC is apparent, but, the bias does not appear to be substantial (for ABC–
SMC) in this particular example.

1 1 1

0.9 0.9 0.9

0.8 0.8 0.8


0.7 0.7 0.7

0.6 0.6 0.6


\sigma \sigma \sigma
v v v
\phi \phi \phi
0.5 \sigmaw 0.5 \sigmaw 0.5 \sigmaw
MLE MLE MLE
0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0 0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Time x 104 Time x 10
4 Time x 10
4

(a) Kalman (b) Sequential Monte Carlo (c) SMC-ABC

Fig. 2 Credible intervals for the 5–95 % percentiles and the medians after multiple runs of parameter
estimates using RML with KF, SMC, and ABC–SMC for the linear Gaussian HMM
Methodol Comput Appl Probab

4.2 Lorenz ’63 Model

4.2.1 Model and Data


We now consider the following non-linear state-space model with X = Y = R3 . The
original model is such that hidden process evolves deterministically according to the
Lorenz ’63 system of ordinary differential equations,
 
Ẋ(1) = σ63 X(2) − X(1)
Ẋ(2) = ρ X(1) − X(2) − X(1)X(3)
Ẋ(3) = X(1)X(2) − β X(3).

where X(m), Ẋ(m) are the mth-components of the state and velocity at any time
respectively. We discretize the model to a discrete-time Markov chain with dynamics:

Xt = ft (Xt−1 ) + Vt , t≥0

where ft is the 4th-order approximation Runge–Kutta approximation of the Lorenz


i.i.d.
’63 system, Vt ∼ N (0, τ Idx ) and X0 is taken as known. Here τ is used to represent
the time-discretization.
For the observations we use:

Yt = H Xt + QWt , t≥1
i.i.d.
where Wt ∼ N (0, Id y ), Wt is independent of Vt and Q is the Cholesky root of a
Toeplitz matrix defined by the parameters κ and σ as follows:
 
Qij = σ S κ −1 min(|i − j|, d y − |i − j|) , i, j ∈ {1, . . . , d y }

⎪ 3 1
⎨ 1 − z + z3 , 0 ≤ z ≤ 1
S(z) = 2 2 ,


0, z>1
and



1
,i= j



⎪ 2

Hij = 1 .

⎪ , i = j−1

⎪ 2



0, i  = j

When θ = (κ, σ, σ63 , ρ, β) = (2.5, 2, 10, 28, 83 ), n = 5,000 and τ = 0.05, a visualization
of the Lorenz ’63 (hidden) dynamics is shown in Fig. 3a and the associated simulated
dataset in Fig. 3b.
For the simulated data-set in Fig. 3b and its extension for longer n, in the
remainder we will use ABC–SMC to obtain parameter estimates from RML. In
the subsequent sub-section we will study the performance of these estimates under
different settings. We will use 
θ,n
N,M
to denote the estimate of θ at time n, that was es-
timated using N particles, M pseudo-observations and a Gaussian kernel with covari-
ance  Id y . We will compare the behavior of the algorithm as each of N, M, n,  varies.
Methodol Comput Appl Probab

50 35
30
40
25
30 20
3

3
15
t

t
x

y
20 10
5
10
0
0 −5
30 40
20 30
10 20 30
10 20 20
0 10
0 10 0
x2
−10
−20 −10 1 y2t 0 −10 1
xt y
t −30 −20 −10 −30 −20 t

(a) Hidden Markov chain (b) Observations

Fig. 3 Evolution of the 3-dimensional Lorenz ’63 HMM with n = 5,000

4.2.2 Numerical Results


We now examine the performance of the algorithm for N ∈ {100, 1000, 10000}. For
each value of N, we ran fifty independent runs of ABC–SMC, using M = 10 and
 = 1. In Fig. 4a–d we plot box-plots of the terminal parameter estimates,  N,10
θ1,5000 ,
against their true values marked by dotted green lines. In Fig. 4e–h we plot the
absolute value of the Monte Carlo (MC) bias (that is, the absolute difference between
the estimate and true value), in red, and the MC standard deviation, in blue. The MC
bias and standard deviation points are fitted with least-squares curves proportional
to √1N , the standard MC rates with which the accuracy of the estimates is expected
to improve. With regards to the variability of the estimates one sees the expected
reduction in variability as N increases. The bias is harder to quantify; it will not
necessarily be the case that as N grows the bias falls. This is because there is a Monte
Carlo bias (from the SMC), an optimization bias (from the SPSA), an approximation
bias (from the ABC). Increasing N can only deal with the SMC bias (which for

4.5 30
6 22

29.5
4 20
5 29

3.5 18
28.5
4
16 28
3

14 27.5
3 2.5
27
12
2 26.5
2
10
26
1.5
1 8
25.5
1
6 25

N=100 N=1000 N=10000 N=100 N=1000 N=10000 N=100 N=1000 N=10000 N=100 N=1000 N=10000

(a) (b) (c) (d)


4.5 4.5 4.5 4.5
bias bias bias bias
4
stdev 4
stdev 4
stdev 4
stdev

3.5 3.5 3.5 3.5

3 3 3 3

2.5 2.5 2.5 2.5

2 2 2 2

1.5 1.5 1.5 1.5

1 1 1 1

0.5 0.5 0.5 0.5

0 0 0 0
100 1000 10000
N N N
N

(e) (f) (g) (h)

Fig. 4  N,10
θ1,5000 when estimating θ = (κ, σ, σ63 , ρ) of the Lorenz ’63 HMM, using ABC–SMC with
values of N ∈ {100, 1000, 10000}. a–d show the  θ N,10 in box-plots and their true values in dotted 1,5000
green lines. e–h show the MC bias and MC standard deviation of the N,10
θ1,5000 , in red and blue, with
curves of least squared-error ∝ √1
N
Methodol Comput Appl Probab

estimates with parameters fixed is O(N −1 )), but the addition of parameter estimation
complicates things here. The main point is that as expected one obtains significantly
more reproducible/consistent results as N grows.
Next we look at the influence of the number of auxiliary observations samples. For
M ∈ {1, 3, 5, 10, 25, 50}, we show in Fig. 5a–d the box-plots of the terminal estimates
 5000,M
θ1,5000 from fifty independent runs of ABC–SMC, using N = 5000 and  = 1. The
dotted green lines marks the true θ values which generate the data. In Fig. 5e–h,
the MC biases and the MC standard deviations of the  5000,M
θ1,5000 are plotted as discrete
points, in red and blue, with lines of least squared-error fitted around them. As M
increases, we see reductions in the MC variance. This reduction in variance can be
attributed to the fact that the ABC–SMC algorithm approximates the ideal SMC
algorithm that targets the perturbed HMM. Hence by a Rao–Blackwellization type
argument, one expects a reduction in variance. These results are consistent with (Del
Moral et al. 2012). For this example, after M ≥ 5, there seems to be little impact on
the accuracy of the parameter estimates, but this is example specific.
We now vary n. For n ∈ {5000, 10,000, 15,000} we ran fifty independent runs of
ABC–SMC using N = 200, M = 10, and  = 1, and plotted box-plots of the terminal
estimates  200,10
θ1,n , in Fig. 6a–d, against the true values of θ marked in dotted green
lines. Recall that RML estimation tries to maximize n1 log( pθ, (y1:n )), so we expect n
not to have a great effect on the bias nor the variance when it is above some value.
This can also be explained by the bias results in Section 2.3 and the theoretical results
in Dean et al. (2010), Dean and Singh (2011). In Fig. 6e–h the absolute value of the
MC biases and the MC standard deviations have been plotted in red and blue, and
fitted with linear lines of least squared-error.
Finally, we investigate the influence of  ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 50}. For each
, we again ran fifty independent runs of ABC–SMC with N = 200 and M = 10,
for the dataset n = 5,000. The box-plot of the parameter estimates are plotted, in
Fig. 7a–d, against dotted green lines which indicate the true θ. Figure 7e–h show

9 10
32
24
8 9 30
22

7 8 28
20
26
6 7 18
24
5 6 16
22

4 5 14
20
12
3 4 18
10
2 3 16
8
14
1 2
6
12
0 1
4
M=1 M=3 M=5 M=10 M=25 M=50 M=1 M=3 M=5 M=10 M=25 M=50 M=1 M=3 M=5 M=10 M=25 M=50 M=1 M=3 M=5 M=10 M=25 M=50

(a) (b) (c) (d)


5.5 5.5 5.5 5.5
bias bias bias bias
stdev
5 5 stdev 5 stdev 5 stdev

4.5 4.5 4.5 4.5

4 4 4 4

3.5 3.5 3.5 3.5

3 3 3 3

2.5 2.5 2.5 2.5

2 2 2 2

1.5 1.5 1.5 1.5

1 1 1 1

0.5 0.5 0.5 0.5

0 0 0 0
1 3 5 10 25 50 1 3 5 10 25 50 1 3 5 10 25 50 1 3 5 10 25 50
M M M M

(e) (f) (g) (h)

Fig. 5  5000,M
θ1,5000 when estimating θ = (κ, σ, σ63 , ρ) of the Lorenz ’63 HMM, using ABC–SMC with
values of M ∈ {1, 3, 5, 10, 25, 50}. a–d show the  5000,M
θ1,5000 in box-plots and their true values in dotted
green lines. e–h show the MC bias and MC standard deviation of the  5000,M
θ1,5000 , in red and blue, with
lines of least squared-error
Methodol Comput Appl Probab

25 32
10
9
9 30

8 20
8
28
7
7

15 26
6 6

5 5 24

10
4 4 22

3
3
20
5
2
2
18
1
1
0
n = 5000 n = 10000 n = 15000 n = 5000 n = 10000 n = 15000 n = 5000 n = 10000 n = 15000 n = 5000 n = 10000 n = 15000

(a) (b) (c) (d)


6 6 6 6
bias bias bias bias
stdev stdev stdev stdev

5 5 5 5

4 4 4 4

3 3 3 3

2 2 2 2

1 1 1 1

0 0 0 0
5000 10000 15000 5000 10000 15000 5000 10000 15000 5000 10000 15000
T T T T

(e) (f) (g) (h)

Fig. 6  200,10
θ1,n when using ABC–SMC to estimate θ = (κ, σ, σ63 , ρ) of the Lorenz ’63 HMM, for
datasets of length n ∈ {5000, 10000, 15000}. a–d show the 
θ 200,10 in box-plots and their true values 1,n
in dotted green lines. e–h show the MC bias and MC standard deviation of the 200,10
θ1,n , in red and blue,
with lines of least squared-error

the absolute value of MC biases in red, and the MC standard deviations in blue.
Fitted to the MC biases is a non-linear least squares curve proportional to  + 1 . The
result we presented in Section 2.3 states that as  increases, the bias will increase
on O(), hence the term proportional to  of the fitted curve. However, the ABC–
SMC algorithm becomes less stable for  too small (in the sense that, for example,
the variance of the weights will become larger as  grows), incurring more varied
estimates. We conjecture this will affect biases according to a term proportional to

25
10
9

9 35
8
20
8
7
7
30
6
6 15

5
5
25
4 4
10

3 3
20

2 2
5
1
1
15
0
e=1 e=2 e=3 e=4 e=5 e=6 e=7 e=8 e=9 e = 10 e = 50 e=1 e=2 e=3 e=4 e=5 e=6 e=7 e=8 e=9 e = 10 e = 50 e=1 e=2 e=3 e=4 e=5 e=6 e=7 e=8 e=9 e = 10 e = 50 e=1 e=2 e=3 e=4 e=5 e=6 e=7 e=8 e=9 e = 10 e = 50

(a) (b) (c) (d)


6 6 6 6
bias bias bias bias
stdev
stdev stdev stdev
5 5 5 5

4 4 4 4

3 3 3 3

2 2 2 2

1 1 1 1

0 0 0 0
1 2 3 4 5 6 7 8 910 50 1 2 3 4 5 6 7 8 910 50 1 2 3 4 5 6 7 8 910 50 1 2 3 4 5 6 7 8 910 50
\epsilon \epsilon \epsilon \epsilon

(e) (f) (g) (h)

Fig. 7 200,10
θ,5000 when estimating θ = (κ, σ, σ63 , ρ) of the Lorenz ’63 HMM, using ABC–SMC with
values of  ∈ {1, 2, 3, . . . , 10, 50}. a–d show the MC biases and their curves of non-linear least
squared-error proportional to  + 1 in red, and the MC standard deviations with their curves of
non-linear least squared-error proportional to 1 in blue
Methodol Comput Appl Probab

1

.Similarly, we fitted to the MC standard deviations non-linear least squares curves
proportional to 1 and note that the MC standard deviation decreases at this rate as 
increases.

5 Conclusions

In this article we have presented how to perform ML parameter estimation using


ABC–SMC and SPSA for HMMs. For batch MLE the method appears to be very
accurate when a well-selected step-sized is used. In the on-line case and RML
the method again appears to be sensitive to the tuning of the step-sizes and for
moderately long runs one should expect a bias, which in the examples here and
in Poyiadjis et al. (2006) seems small but not negligible. We believe this bias is
due to using SPSA within another Stochastic Approximation algorithm, i.e. the
RML. A theoretical investigation of identifying the source of this bias should be
an interesting extension our work. Furthermore, except the obvious case when the
likelihood in the HMM is intractable, these ideas could be also useful for models
where the parameter and observations are of moderate dimension and the state-
dimension is high. Such models have wide application in data assimilation and
numerical weather prediction. In addition, the work related here is closely related
to Yildirim et al. (2013b), where following a representation similar to Remark 3.2,
the authors provide a RML algorithm without using SPSA and also show how on-
line Expectation-Maximization techniques like (Del Moral et al. 2009; Yildirim et al.
2013a) are relevant for ABC–MLE for HMMs. We conclude by mentioning that
current ongoing work is trying to address the limitations in efficiency of the presented
ABC–SMC algorithm when small  is used. Two potential ways to proceed can be (1)
to introduce approximations by the expectation-propagation algorithm in Barthelmé
and Chopin (2011) and potentially removing SMC and (2) to consider combining
ABC with more advanced SMC approaches such as Del Moral et al. (2006) to allow
use of much lower .

Acknowledgements We thank the referee for comments that have vastly improved the paper.
We also acknowledge useful discussions on this material with Sumeetpal Singh. The second author
was funded by an MOE grant and acknowledges useful conversations with David Nott. The third
author acknowledges support from EPSRC under grant EP/J01365X/1 since July 2012 and under the
programme grant on Control For Energy and Sustainability EP/G066477/1 during earlier stages of
this work when he was employed at Imperial College.

Appendix A: Notations

We will introduce a round of notations. Firstly, we alert the reader that throughout
appendix k is used as a time index instead of t used earlier. As our analysis will
rely upon that in Tadic and Doucet (2005) our notations will follow that article. It
is remarked that under our assumptions, one can establish the same assumptions
as in Tadic and Doucet (2005). Moreover, the time-inhomogenous upper-bounds in
that paper can be made time-homogenous (albeit less tight) under our assumptions.
In addition, our proof strategy follows ideas in the expanded technical report of
Andrieu et al. (2005).
Methodol Comput Appl Probab

Bb (X) is the class of bounded and real-valued measurable functions on X.


Throughout, for ϕ ∈ Bb (X), ϕ
 ∞ := supx∈X |ϕ(x)|. For ϕ ∈ Bb (X) and any operator
Q
 : X → M (X), Q(ϕ)(x) := X ϕ(y)Q(x, dy). In addition for μθ ∈ M(X), μθ Q(ϕ) :=
X μθ (dx)Q(ϕ)(x).
We introduce the non-negative operator:
Rθ,n (x, dx ) := gθ (yn |x ) fθ (x |x)dx

with the ABC equivalent Rθ,,n (x, dx ) := gθ, (yn |x ) fθ (x |x)dx , gθ, (y|x) =
A,y g(u|x)dy/ A,y dy. To keep consistency with Tadic and Doucet (2005) and
to allow the reader to follow the proofs, we note that the filter at time n ≥ 0, Fθn (μθ )
(respectively ABC filter, at time n, Fθ, n
(μθ )) is exactly, with initial distribution
μθ ∈ P (X) and test function ϕ ∈ Bb (X)
μθ R1,n,θ (ϕ)
Fθn (μθ )(ϕ) =
μθ R1,n,θ (1)
respectively
μθ R1,n,θ, (ϕ)
n
Fθ, (μθ )(ϕ) =
μθ R1,n,θ, (1)
 n
where Fθ0 (μθ ) = Fθ,
0
(μθ ) = μθ , R1,n,θ (ϕ)(x0 ) = k=1 Rk,θ (xk−1 , dxk )ϕ(xn ). In ad-
dition, we write the filter derivatives as F n (μθ , μ
θ )(ϕ), n (μθ , μ
F θ )(ϕ) where the
θ θ,
second argument is the gradient of the initial measure.
The following operators will be used below, for n ≥ 1:
n (μθ , μ
G θ )(ϕ) := (μθ R1,n,θ (1))−1 [
μθ R1,n,θ (ϕ) − μ
θ R1,n,θ (1)Fθn (μθ )(ϕ)] (16)

n (μθ )(ϕ) := F n−1 (μθ )Rn,θ (1)−1 [F n−1 (μθ ) R


H n,θ (ϕ) − F n−1 (μθ ) R
n,θ (1)Fθn (μθ )(ϕ)]
θ θ θ

(17)
0 (μθ , μ
with the convention G θ )(ϕ) = μ
θ . In addition, we set
(n) (μθ , μ
G θ )(ϕ) := (μθ Rn,θ (1))−1 [ θ Rn,θ (1)Fθ(n) (μθ )(ϕ)].
μθ Rn,θ (ϕ) − μ

where Fθ(n) (μθ ) = μθ Rn,θ /μθ Rn,θ (1). Finally, an important notational convention is
as follows. Throughout we use C to denote a constant whose value may change
from line-to-line in the calculations. This constant will typically not depend upon
important parameters such as  and n and any important dependencies will be
highlighted.

Appendix B: Bias of the Log-Likelihood

Proof (Proof of Proposition 2.1) We begin with the equality


n


log( pθ (y1:n )) − log( pθ, (y1:n )) = log( pθ (yk |y1:k−1 )) − log( pθ, (yk |y1:k−1 ))
k=1
(18)
Methodol Comput Appl Probab

with, for 1 ≤ k ≤ n

pθ (yk |y1:k−1 ) = gθ (yk |xk ) fθ (xk |xk−1 )Fθk−1 (μθ )(dxk−1 )dxk
X2

pθ, (yk |y1:k−1 ) = gθ (yk |xk ) fθ (xk |xk−1 )Fθ,
k−1
(μθ )(dxk−1 )dxk .
X2

We will consider each summand in Eq. 18. The case k ≥ 2 is only considered; the
scenario k = 1 will follow a similar and simpler argument.
Using the inequality | log(x) − log(y)| ≤ |x − y|/(x ∧ y) for every x, y > 0 we have
| pθ (yk |y1:k−1 ) − pθ, (yk |y1:k−1 )|
| log( pθ (yk |y1:k−1 )) − log( pθ, (yk |y1:k−1 ))| ≤ .
pθ (yk |y1:k−1 ) ∧ pθ, (yk |y1:k−1 )
Note that
pθ (yk |y1:k−1 ) ∧ pθ (yk |y1:k−1 )

= gθ (yk |xk ) fθ (xk |xk−1 )Fθk−1 (μθ )(dxk−1 )dxk
X2

∧ gθ (yk |xk ) fθ (xk |xk−1 )Fθ,
k−1
(μθ )(dxk−1 )dxk ≥ C > 0 (19)
X2

where we have applied (A3) and C does not depend upon . Thus we consider
| pθ, (yk |y1:k−1 ) − pθ (yk |y1:k−1 )|
#
#
= ## gθ (yk |xk ) fθ (xk |xk−1 )Fθk−1 (μθ )(dxk−1 )dxk
X2
 #
#
− gθ, (yk |xk ) fθ (xk |xk−1 )Fθ,
k−1
(μθ )(dxk−1 )dxk ##.
X2

The R.H.S. can be upper-bounded by the sum of


# #
# #
# [gθ (yk |xk ) − gθ, (yk |xk )] fθ (xk |xk−1 )Fθ (μθ )(dxk−1 )dxk ##
k−1
#
X2

and
# #
# #
# gθ, (yk |xk ) fθ (xk |xk−1 )[Fθ,
k−1
(μθ )(dxk−1 ) − Fθ,
k−1
(μθ )(dxk−1 ])dxk ##.
#
X2

The first expression can be dealt with by using (A1), which implies
sup |gθ, (yk |x) − gθ, (yk |x)| ≤ C. (20)
x∈X

The second expression can be controlled by Jasra et al. (2012, Theorem 2):
sup Fθk−1 (μθ ) − Fθ,
k−1
(μθ ) ≤ C (21)
k≥1

to yield that
| pθ, (yk |y1:k−1 ) − pθ (yk |y1:k−1 )| ≤ C. (22)
One can thus conclude. 

Methodol Comput Appl Probab

Appendix C: Bias of the Gradient of the Log-Likelihood

Proof (Proof of Theorem 2.1) We have that



n

$ %
∇ log pθ (y1:n ) − log pθ, (y1:n ) = ∇ log[ pθ (yk |y1:k−1 ) − log[ pθ, (yk |y1:k−1 ) .
k=1

It then follows that




∇ log pθ (y1:n ) − log pθ, (y1:n )

 [∇ pθ (yk |y1:k−1 ) − ∇ pθ, (yk |y1:k−1 )] ∇ pθ, (yk |y1:k−1 )


= +
pθ (yk |y1:k−1 ) pθ (yk |y1:k−1 ) pθ, (yk |y1:k−1 )
k=1

× [ pθ, (yk |y1:k−1 ) − pθ (yk |y1:k−1 )] . (23)

We will deal with the two terms on the R.H.S. of Eq. 23 in turn. The scenario k ≥ 2
is only considered; the case k = 1 follows a similar and simpler argument.
First starting with summand
[∇ pθ (yk |y1:k−1 ) − ∇ pθ, (yk |y1:k−1 )]
.
pθ (yk |y1:k−1 )
Noting Eq. 19, we need only upper-bound the L1 norm of the following expression

∇{gθ (yk |xk )} fθ (xk |xk−1 )Fθk−1 (μθ )(dxk−1 )dxk
X2

− ∇{gθ, (yk |xk )} fθ (xk |xk−1 )Fθ,
k−1
(μθ )(dxk−1 )dxk (24)
X2


+ gθ (yk |xk )∇{ fθ (xk |xk−1 )}Fθk−1 (μθ )(dxk−1 )dxk
X2

− gθ, (yk |xk )∇{ fθ (xk |xk−1 )}Fθ,
k−1
(μθ )(dxk−1 )dxk (25)
X2


+ k−1 (μθ , μ
gθ (yk |xk ) fθ (xk |xk−1 ) F θ )(dxk−1 )dxk
θ
X2

− k−1 (μθ , μ
gθ, (yk |xk ) fθ (xk |xk−1 ) F θ )(dxk−1 )dxk . (26)
θ,
X2

We start with Eq. 24. Using (A4) we can establish that for each k ≥ 1
sup |∇{gθ (yk |xk )} − ∇{gθ, (yk |xk )}| ≤ C (27)
x∈X

where C does not depend upon k, . Hence


# #
# #
# [∇{gθ (yk |xk )} − ∇{gθ, (yk |xk )}] fθ (xk |xk−1 )Fθ (μθ )(dxk−1 )dxk ## ≤ C.
k−1
#
X2
Methodol Comput Appl Probab

Then we note that by Jasra et al. (2012, Theorem 2) (see Eq. 21) and (A5)
# #
# #
# ∇{g (y |x )} f (x |x )[F k−1
(μ )(dx ) − F k−1
(μ )(dx )]dx # ≤ C
# θ, k k θ k k−1 θ θ k−1 θ, θ k−1 k #
X2

Thus we have shown that


#
#
# ∇{gθ (yk |xk )} fθ (xk |xk−1 )Fθk−1 (μθ )(dxk−1 )dxk
#
X2
 #
#
− ∇{gθ, (yk |xk )} fθ (xk |xk−1 )Fθ,
k−1
(μθ )(dxk−1 )dxk ## ≤ C.
X2

Now, moving onto Eq. 25, by Eq. 20 we have


# #
# #
# [g (y |x ) − g (y |x )]∇{ f (x |x )}F k−1
(μ )(dx )dx # ≤ C.
# θ k k θ, k k θ k k−1 θ θ k−1 k #
X2

and can again use Jasra et al. (2012, Theorem 2) (i.e. Eq. 21) to deduce that
# #
# #
# gθ, (yk |xk )∇{ fθ (xk |xk−1 )}[Fθ (μθ )(dxk−1 ) − Fθ, (μθ )(dxk−1 )]dxk ## ≤ C
k−1 k−1
#
X2

and thus that


#
#
# gθ (yk |xk )∇{ fθ (xk |xk−1 )}Fθk−1 (μθ )(dxk−1 )dxk
#
X2
 #
#
− gθ, (yk |xk )∇{ fθ (xk |xk−1 )}Fθ,
k−1
(μθ )(dxk−1 )dxk ## ≤ C
X2

which upper-bounds the expression in Eq. 25. We now move onto Eq. 26, which
upper-bounded by
# #
# #
# [g (y |x ) − g (y |x )] f (x |x )
F k−1
(μ , μ
)(dx )dx #
# θ k k θ, k k θ k k−1 θ θ θ k−1 k#
X2
# #
# #
+##
gθ, (yk |xk ) fθ (xk |xk−1 )[ Fθ (μθ , μ
k−1 θ )(dxk−1 )]dxk ##.
θ )(dxk−1 ) − Fθ, (μθ , μ
k−1
X2

For the first expression, we can write:


# 

# [gθ (yk |xk ) − gθ, (yk |xk )]
#
(sup |gθ (yk |x) − gθ, (yk |x)|)# fθ (xk |xk−1 )dxk
x∈X X X (supx∈X |gθ (yk |x) − gθ, (yk |x)|)
#
#
× F k−1 (μθ , μ
θ )(dx k−1 #.
) #
θ

Then we can apply Eq. 20 and, noting that




[gθ (yk |xk ) − gθ, (yk |xk )]
fθ (xk |xk−1 )dxk ≤ 1
X (supx∈X |gθ (yk |x) − gθ, (yk |x)|)

one can also use Lemma 5.3 to deduce that


# #
# #
# θ )(dxk−1 )dxk ## ≤ C(1 +
[gθ (yk |xk ) − gθ, (yk |xk )] fθ (xk |xk−1 ) Fθ (μθ , μ
k−1
μθ ).
#
X2
Methodol Comput Appl Probab

Then, one can easily apply Theorem 5.1 to show that


# #
# #
# g (y |x ) f (x |x )[ k−1 (μθ , μ
F )(dx ) − k−1 (μθ , μ
F )(dx )]dx #
# θ, k k θ k k−1 θ θ k−1 θ, θ k−1 k #
X2

μθ ).
≤ C(2 +
Thus we have upper-bounded the L1 −norm of the sum of the expressions 24–26 and
we have established that
[∇ pθ (yk |y1:k−1 ) − ∇ pθ, (yk |y1:k−1 )]
≤ C(2 +
μθ ). (28)
pθ (yk |y1:k−1 )
Moving onto the second summand on the R.H.S. of Eq. 23,
∇ pθ, (yk |y1:k−1 )
[ pθ, (yk |y1:k−1 ) − pθ (yk |y1:k−1 ).
pθ (yk |y1:k−1 ) pθ, (yk |y1:k−1 )
By Eq. 22, we need only consider upper-bounding, in L1 , ∇ pθ, (yk |y1:k−1 ). This can
be decomposed into the sum of three expressions:

∇{gθ, (yk |xk )} fθ (xk |xk−1 )Fθ,
k−1
(μθ )(dxk−1 )dxk
X2


gθ, (yk |xk )∇{ fθ (xk |xk−1 )}Fθ,
k−1
(μθ )(dxk−1 )dxk
X2

and

k−1 (μθ , μ
gθ, (yk |xk ) fθ (xk |xk−1 ) F θ )(dxk−1 )dxk .
θ,
X2

As ∇{gθ, (yk |xk )} and gθ, (yk |xk )∇{ fθ (xk |xk−1 )} are upper-bounded as well as X
being compact the first two expressions are upper-bounded in L1 . In addition as
X gθ, (yk |xk ) fθ (xk |xk−1 )dxk is upper-bounded, we can apply Lemma 5.3 to see that
the third expression is upper-bounded in L1 . Hence, we have shown that
# #
# ∇ pθ, (yk |y1:k−1 ) #
# [ pθ, (yk |y1:k−1 ) − pθ (yk |y1:k−1 )]## ≤ C(1 +
μθ ). (29)
# p (y |y
θ k 1:k−1 ) pθ, (yk |y1:k−1 )

Combining the results Eqs. 28 and 29 and noting Eq. 23 we can conclude. 


Appendix D: Bias of the Gradient of the Filter

Theorem 5.1 Assume (A1–A5). Then there exist a C < +∞ such that for any n ≥ 1,
μθ ∈ P (X), μ
θ ∈ M(X),  > 0, θ ∈ :
θn (μθ , μ
F θ,
θ ) − F n
(μθ , μ
θ ) ≤ C(2 +
μθ ).

Proof We have the following telescoping sum decomposition (e.g. Del Moral 2004)
for the differences in the filters, with ϕ ∈ Bb (X):
n & '
n− p+1,n n− p n− p+2,n n− p+1
Fθ (μθ )(ϕ) − Fθ, (μθ )(ϕ) =
n n
Fθ (Fθ, (μθ ))(ϕ) − Fθ (Fθ, (μθ ))(ϕ)
p=1
Methodol Comput Appl Probab

q,n μ R (ϕ)
where we are using the notation Fθ (μθ )(ϕ) = μθθ Rq,n,θ
q,n,θ (1)
, for 1 ≤ q ≤ n. Hence,
taking gradients and swapping the order of summation and differentiation we have
and omitting the second arguments of F on the R.H.S. (to reduce the notational
burden)

θn (μθ , μ
F θ,
θ )(ϕ) − F n
(μθ , μ
θ )(ϕ)
&
 n− p+2,n (n− p+1) n− p
n
=
F (Fθ (n− p+1) [F n− p (μθ )])(ϕ)
[Fθ, (μθ )], F
θ θ θ,
p=1
'
n− p+2,n (F (n− p+1) [F (n− p) (μθ )], F
−F (n− p+1) [F (n− p) (μθ )])(ϕ) . (30)
θ θ, θ, θ, θ,

To continue with the proof we will adopt (Tadic and Doucet 2005, Lemma 6.4):


n
θn (μθ , μ
F n (μθ , μ
θ )(ϕ) = G θ ) + q+1,n (F q (μθ ), H
G q (μθ ))(ϕ)
θ θ θ
q=1

n and H
with G q+1,n similar extension to the
q (μθ ) defined in Eqs. 16 and 17 and G
θ θ
q+1,n
notation as for the filter Fθ n+1,n (μθ , μ
and the convention G θ ) = μ
θ . Returning
θ
to Eq. 30 and again omitting the second arguments of F on the R.H.S.:

θn (μθ , μ
F θ,
θ )(ϕ) − F n
(μθ , μ
θ )(ϕ)
n &
= n− p+2,n {F (n− p+1) (F n− p (μθ )), F
G (n− p+1) (F n− p (μθ ))}(ϕ)
θ θ θ, θ θ,
p=1

−G n− p+2,n {F (n− p+1) [F n− p (μθ )], F


(n− p+1) [F n− p (μθ )]}(ϕ)
θ θ, θ, θ, θ,

 n $
+ G q+1,n {F n− p+2,q [F (n− p+1) (F n− p (μθ ))],
θ θ θ θ,
q=n− p+2

n− p+2,q [F (n− p+1) (F n− p (μθ ))]}(ϕ)


Hθ θ θ,


−G
q+1,n n− p+2,q
{Fθ
(n− p+1)
[Fθ,
n− p
(Fθ, (μθ ))],
θ
%'
n− p+2,q [F (n− p+1) (F n− p (μθ ))]}(ϕ)
H . (31)
θ θ, θ,

We start first with the summand on the R.H.S. of the second line of Eq. 31, which
we compactly denote as:

p−1 {Fθ [F n− p (μθ )], F


G p−1 {Fθ, [F n− p (μθ )], F
θ [F n− p (μθ )]}(ϕ) − G θ, [F n− p (μθ )]}(ϕ).
θ θ, θ, θ θ, θ,

This can be decomposed further into the sum of

p−1 {Fθ [F n− p (μθ )], F


G p−1 {Fθ, [F n− p (μθ )], F
θ [F n− p (μθ )]}(ϕ) − G θ [F n− p (μθ )]}(ϕ)
θ θ, θ, θ θ, θ,

(32)
Methodol Comput Appl Probab

and
p−1 {Fθ, [F n− p (μθ )], F
G p−1 {Fθ, [F n− p (μθ )], F
θ [F n− p (μθ )]}(ϕ) − G θ, [F n− p (μθ )]}(ϕ).
θ θ, θ, θ θ, θ,

(33)
Beginning with Eq. 32, by Tadic and Doucet (2005, Lemma 6.7), Eq. 43 we have

|G
p−1 n− p
{Fθ [Fθ, (μθ )], F p−1 {Fθ, [F n− p (μθ )], F
θ [F n− p (μθ )]}(ϕ) − G θ [F n− p (μθ )]}(ϕ)|
θ θ, θ θ, θ,

n− p n− p θ [F n− p (μθ )]
≤ C ϕ ∞ ρ p−1 Fθ [Fθ, (μθ )] − Fθ, [Fθ, (μθ )] F θ,

where ρ ∈ (0, 1) and C do not depend upon μθ ,  or n, p. Applying Lemma 5.2 we


have

|G
p−1 n− p
{Fθ [Fθ, (μθ )], F p−1 {Fθ, [F n− p (μθ )], F
θ [F n− p (μθ )]}(ϕ) − G θ [F n− p (μθ )]}(ϕ)|
θ θ, θ θ, θ,

θ [F n− p (μθ )]
≤ C ϕ ∞ ρ p−1  F θ,

where C does not depend upon μθ ,  or n, p. Then by Remark 5.1 and Lemma 5.3
θ [F n− p (μθ )] ≤ C(2 +
F μθ ) and thus the upper-bound on the L1 −norm of Eq. 32:
θ,


|G
p−1 n− p θ [F
{Fθ [Fθ, (μθ )], F
n− p p−1 n− p θ [F n− p
θ θ, (μθ )]}(ϕ) − Gθ {Fθ, [Fθ, (μθ )], F θ, (μθ )]}(ϕ)|

≤ C ϕ ∞ ρ p−1 (2 +
μθ ). (34)
Now, moving onto Eq. 33, by Tadic and Doucet (2005, Lemma 6.7), Eq. 42:

|G
p−1 n− p θ [F
{Fθ, [Fθ, (μθ )], F
n− p p−1 n− p θ, [Fn− p
θ θ, (μθ )]}(ϕ) − Gθ {Fθ, [Fθ, (μθ )], F θ, (μθ )]}(ϕ)|

θ [F
≤ Cρ p−1 ϕ ∞ F
n− p n− p
θ, (μθ )] − Fθ, [Fθ, (μθ )] .

Applying Lemma 5.1



|G
p−1 n− p
{Fθ, [Fθ, (μθ )], F p−1 {Fθ, [F n− p (μθ )], F
θ [F n− p (μθ )]}(ϕ) − G θ, [F n− p (μθ )]}(ϕ)|
θ θ, θ θ, θ,

n− p (μθ ) ).
≤ C ϕ ∞ ρ p−1 (1 + F θ,

Then by Lemma 5.3, we deduce that



|G
p−1 n− p
{Fθ, [Fθ, (μθ )], F p−1 {Fθ, [F n− p (μθ )], F
θ [F n− p (μθ )]}(ϕ) − G θ, [F n− p (μθ )]}(ϕ)|
θ θ, θ θ, θ,

≤ C ϕ ∞ ρ p−1 (2 +
μθ ). (35)
Combining Eqs. 34 and 35

|G
p−1 n− p
{Fθ [Fθ, (μθ )], F p−1 {Fθ, [F n− p (μθ )], F
θ [F n− p (μθ )]}(ϕ) − G θ, [F n− p (μθ )]}(ϕ)|
θ θ, θ θ, θ,

≤ C ϕ ∞ ρ p−1 (2 +
μθ ). (36)
We now consider the summands over q in the second and third lines of Eq. 31.
Again, adopting the compact notation above we can decompose the summands over
q into the sum of
n−q {F s [Fθ (F n− p (μθ ))], H
G n−q {F s [Fθ, (F n− p (μθ ))],
θs [Fθ (F n− p (μθ ))]}(ϕ) − G
θ θ θ, θ, θ θ θ,

θs [Fθ (F n− p (μθ ))]}(ϕ)


H (37)
θ,
Methodol Comput Appl Probab

and
n−q {F s [Fθ, (F n− p (μθ ))], H
G n−q {F s [Fθ, (F n− p (μθ ))],
θs [Fθ (F n− p (μθ ))]}(ϕ) − G
θ θ θ, θ, θ θ θ,

θs [Fθ, (F n− p (μθ ))]}(ϕ)


H (38)
θ,

where s = q − n + p − 1. We start with Eq. 37; by Tadic and Doucet (2005, Lemma
6.7) Eq. 43, we have

|G
n−q n− p θs [Fθ (F
{Fθs [Fθ (Fθ, (μθ ))], H
n− p
θ θ, (μθ ))]}(ϕ)


−G
n−q n− p θs [Fθ (F n− p (μθ ))]}(ϕ)|
{Fθs [Fθ, (Fθ, (μθ ))], H
θ θ,
n− p n− p θs [Fθ (F n− p (μθ )) .
≤ C ϕ ∞ ρ n−q Fθs [Fθ (Fθ, (μθ ))] − Fθs [Fθ, (Fθ, (μθ ))] H θ,

Then we will use the stability of the filter (e.g. Tadic and Doucet 2005, Theorem 3.1)
n− p n− p n− p n− p
Fθs [Fθ (Fθ, (μθ ))] − Fθs [Fθ, (Fθ, (μθ ))] ≤ Cρ s Fθ (Fθ, (μθ )) − Fθ, (Fθ, (μθ )) .
n− p n− p
By Lemma 5.2 Fθ (Fθ, (μθ )) − Fθ, (Fθ, (μθ )) ≤ C and thus


|G
n−q n− p θs [Fθ (F
{Fθs [Fθ (Fθ, (μθ ))], H
n− p
θ θ, (μθ ))]}(ϕ)


−G
n−q n− p θs [Fθ (F n− p (μθ ))]}(ϕ)|
{Fθs [Fθ, (Fθ, (μθ ))], H
θ θ,

θs [Fθ (F n− p (μθ ))] .


≤ C ϕ ∞ ρ p−1 H θ,

By Tadic and Doucet (2005, Lemma 6.8) we have H s [Fθ (F n− p (μθ ))] ≤ C, where
θ θ,
n− p
C does not depend upon Fθ (Fθ, (μθ )) or  and hence


|G
n−q n− p θs [Fθ (F n− p (μθ ))]}(ϕ)
{Fθs [Fθ (Fθ, (μθ ))], H
θ θ,


−G
n−q n− p θs [Fθ (F n− p (μθ ))]}(ϕ)| ≤ C ϕ ∞ ρ p−1 .
{Fθs [Fθ, (Fθ, (μθ ))], H
θ θ,

Now, turning to Eq. 38 and applying Tadic and Doucet (2005, Lemma 6.7) (42) we
have

|G
n−q n− p θs [Fθ (F
{Fθs [Fθ, (Fθ, (μθ ))], H
n− p
θ θ, (μθ ))]}(ϕ)


−G
n−q n− p θs [Fθ, (F
{Fθs [Fθ, (Fθ, (μθ ))], H
n− p
θ θ, (μθ ))]}(ϕ)|

θs [Fθ (F n− p (μθ ))] − H


≤ C ϕ ∞ ρ n−q H θs [Fθ, (F n− p (μθ ))] . (39)
θ, θ,

Then by Tadic and Doucet (2005, Lemma 6.8) we have


θs [Fθ (F n− p (μθ ))] − H
H θs [Fθ, (F n− p (μθ ))] ≤ Cρ s Fθ (F n− p )(μθ ) − Fθ, (F n− p (μθ ))
θ, θ, θ, θ,

and then on applying Lemma 5.2 we thus have that


q (Fθ (F n− p )(μθ )) − H
H q (F n− p+1 )(μθ ) ≤ Cρ s .
θ θ, θ θ,

Returning to Eq. 39, it follows by the above calculations that:



|G
n−q n− p θs [Fθ (F
{Fθs [Fθ (Fθ, (μθ ))], H
n− p
θ θ, (μθ ))]}(ϕ)


−G
n−q n− p θs [Fθ (F
{Fθs [Fθ, (Fθ, (μθ ))], H
n− p
θ, (μθ ))]}(ϕ)| ≤ C ϕ ∞ ρ .
p−1
θ
Methodol Comput Appl Probab

Thus we have proved that


|G
n−q n− p θs [Fθ (F
{Fθs [Fθ (Fθ, (μθ ))], H
n− p
θ θ, (μθ ))]}(ϕ)


−G
n−q n− p θs [Fθ, (F
{Fθs [Fθ, (Fθ, (μθ ))], H
n− p
θ, (μθ ))]}(ϕ)|| ≤ C ϕ ∞ ρ .
p−1
θ (40)

Then, returning to Eq. 31 and noting Eq. 36, Eq. 40 we have the upper-bound
( )

n 
n
θn (μθ , μ
F θ,
θ ) − F n
(μθ , μ
θ ) ≤ C(2 +
μθ ) ρ p−1 + ρ p−1
p=1 q=n− p

≤ C(2 +
μθ ).




Appendix D.1: Technical Results for ABC Bias of the Filter-Derivative

Lemma 5.1 Assume (A1–A5). Then there exist a C < +∞ such that for any n ≥ 1,
μθ ∈ P (X), μ
θ ∈ M(X),  > 0 θ ∈ :

(n) (μθ , μ
F (n) (μθ , μ
θ ) − F θ ) ≤ C(1 +
μθ ).
θ θ,

Proof By Tadic and Doucet (2005, Lemma 6.7) we have the decomposition, for ϕ ∈
Bb (X):

(n) (μθ , μ
F (n) (μθ , μ
θ )(ϕ) = G (n) (μθ )(ϕ)
θ )(ϕ) − H
θ θ θ

where

H n,θ (ϕ) − μθ R
(n) (μθ )(ϕ) := μθ Rn,θ (1)−1 [μθ R n,θ (1)μθ (ϕ).

(n) (μθ , μ
Thus to control the difference, we can consider the two differences G θ )(ϕ) −
θ
(n) (μθ , μ
G θ )(ϕ) and (n) (μθ )(ϕ) − H
H (n) (μθ )(ϕ).
θ, θ θ,

Control of G (n) (μθ , μ (n) (μθ , μ


θ )(ϕ) − G θ )(ϕ) We will use the Hahn–Jordan decom-
θ θ,
+ −
θ = μ
position: μ θ − μ θ . It is assumed that both μ θ + (1), μ
θ − (1) > 0. The scenario
+ +
with either μ
θ (1) = 0 or μ θ (1) = 0 is straightforward and omitted for brevity. We
can write:

(n) (μθ , μ θ + Rn,θ (1) (n) +


μ
Gθ θ )(ϕ) = [F (μ¯θ )(ϕ) − Fθ(n) (μθ )(ϕ)]
μθ Rn,θ (1) θ
θ − Rn,θ (1) (n) −
μ
+ [F (μ¯θ )(ϕ) − Fθ(n) (μθ )(ϕ)]
μθ Rn,θ (1) θ
Methodol Comput Appl Probab

+ −
where μ ¯θ (·) = μ θ + (·)/ μθ + (1) and μ θ − (·)/
¯θ (·) = μ μθ − (1). Thus we have
& + '
(n) (μθ , μ (n) (μθ , μ μ θ Rn,θ (1) μ θ + Rn,θ, (1)
G θ θ )(ϕ) − G θ, θ )(ϕ) = −
μθ Rn,θ (1) μθ Rn,θ, (1)
+
× [Fθ(n) (μ
¯θ )(ϕ) − Fθ(n) (μθ )(ϕ)]
θ + Rn,θ, (1)  (n) +
μ
+ F (μ¯θ )(ϕ) − Fθ(n) (μθ )(ϕ)
μθ Rn,θ, (1) θ
(n) + (n) 
− Fθ, (μ¯θ )(ϕ) + Fθ, (μθ )(ϕ)
& − − '
θ Rn,θ (1) μ
μ θ Rn,θ, (1)
+ −
μθ Rn,θ (1) μθ Rn,θ, (1)


× [Fθ(n) (μ¯θ )(ϕ) − Fθ(n) (μθ )(ϕ)]
θ − Rn,θ, (1)  (n) −
μ
+ F (μ¯θ )(ϕ) − Fθ(n) (μθ )(ϕ)
μθ Rn,θ, (1) θ
(n) − (n) 
− Fθ, (μ¯θ )(ϕ) + Fθ, (μθ )(ϕ) .
(41)

By symmetry, we need only consider the terms including μ θ + ; one can treat those

with μ
θ by using similar arguments. First dealing with term on the first line of the
R.H.S. of Eq. 41. We have that
& + '
μ
θ Rn,θ (1) μ θ + Rn,θ, (1) +

[Fθ(n) (μ¯θ )(ϕ) − Fθ(n) (μθ )(ϕ)]
μθ Rn,θ (1) μθ Rn,θ, (1)
& + '
θ Rn,θ (1) − μ
μ θ + Rn,θ, (1) + μθ Rn,θ, (1) − μθ Rn,θ (1)
= +μ θ Rn,θ, (1)
μθ Rn,θ (1) μθ Rn,θ, (1)μθ Rn,θ (1)
+
× [Fθ(n) (μ
¯θ )(ϕ) − Fθ(n) (μθ )(ϕ)]

Now by (A1), for any n

sup |Rn,θ (1)(x) − Rn,θ, (1)(x)| ≤ C (42)


x∈X

thus
& '
θ + Rn,θ (1) − μ
μ θ + Rn,θ, (1) μθ Rn,θ, (1) − μθ Rn,θ (1)
θ + Rn,θ, (1)

μθ Rn,θ (1) μθ Rn,θ, (1)μθ Rn,θ (1)
C μθ + (1) θ + Rn,θ, (1)
μ
≤ + C .
μθ Rn,θ (1) μθ Rn,θ, (1)μθ Rn,θ (1)
Now one can show that there exist a C < +∞ such that for any x, y ∈ X

Rn,θ (1)(x) ≥ C Rn,θ (1)(y) Rn,θ, (1)(x) ≥ C Rn,θ, (1)(y). (43)

Then it follows that


C μθ + (1) θ + Rn,θ, (1)
μ
+ C μθ + (1).
≤ C
μθ Rn,θ (1) μθ Rn,θ, (1)μθ Rn,θ (1)
Methodol Comput Appl Probab

Hence we have shown that


& + + '
μ
θ Rn,θ (1) μ ¯θ Rn,θ, (1)
− [Fθ(n) (
μθ + )(ϕ) − Fθ(n) (μθ )(ϕ)] ≤ C ϕ ∞ 
μθ + (1).
μθ Rn,θ (1) μθ Rn,θ, (1)
Second, the second line of the R.H.S. of Eq. 41. By Lemma 5.2, for any μθ ∈ P (X),
Fθ(n) (μθ ) − Fθ,
(n)
(μθ ) ≤ C, with C independent of μθ , and in addition using Eq. 43
we have
θ + Rn,θ, (1) (n) +
μ (n) +
[F (μ¯θ )(ϕ) − Fθ(n) (μθ )(ϕ) − Fθ, (μ¯θ )(ϕ)
μθ Rn,θ, (1) θ
(n)
+ Fθ, μθ + (1).
(μθ )(ϕ)] ≤ C ϕ ∞ 
Thus we have shown:
(n) (μθ , μ
G (n) (μθ , μ
θ )(ϕ) − G μθ + (1) + μ
θ )(ϕ) ≤ C[ θ − (1)] = C
μθ . (44)
θ θ,

Control of H (n) (μθ )(ϕ) − H (n) (μθ )(ϕ) We have


θ θ,
& '
(n) (μθ )(ϕ) − H
H (n) (μθ )(ϕ) = μθ Rn,θ (ϕ) − μθ Rn,θ, (ϕ)
θ θ,
μθ Rn,θ (1) μθ Rn,θ, (1)
& (n)
n,θ (1)F (n) (μθ )(ϕ) '
μθ Rn,θ, (1)Fθ, (μθ )(ϕ) μθ R θ
+ − .
μθ Rn,θ, (1) μθ Rn,θ (1)
(45)
We start with the first bracket on the R.H.S. of Eq. 45. We first note that

n,θ (ϕ)(x) − R
R n,θ, (ϕ)(x) = fθ (x |x)ϕ(x )[∇gθ (yn |x ) − ∇gθ, (yn |x )]dx ≤ C ϕ ∞ 

(46)
where we have applied Eq. 27. Then we have
n,θ (ϕ) μθ R
μθ R n,θ, (ϕ) n,θ (ϕ) − μθ R
μθ R n,θ, (ϕ)
− =
μθ Rn,θ (1) μθ Rn,θ, (1) μθ Rn,θ (1)

n,θ, (ϕ) μθ Rn,θ, (1) − μθ Rn,θ (1)


+μθ R .
μθ Rn,θ, (1)μθ Rn,θ (1)
By using Eq. 46 on the first term on the R.H.S. of the above equation and by using
Eq. 42 in the numerator for the second, along with Eq. 43 in the denominator, we
have
# #
n,θ (ϕ) μθ R
# μθ R n,θ, (ϕ) #
# # ≤ C[ ϕ ∞ + |μθ R
n,θ, (ϕ)|].
# μ R (1) − μ R #
θ n,θ θ n,θ, (1)

Then as

n,θ, (ϕ)(x) =
R ϕ(x )[∇gθ, (yn |x ) fθ (x |x) − gθ, (yn |x)∇ fθ (x |x)]dx

≤ C ϕ ∞ dx ≤ C ϕ ∞ (47)
X
Methodol Comput Appl Probab

where the compactness of X and (A5) have been used, we have the upper-bound
# #
n,θ (ϕ) μθ R
# μθ R n,θ, (ϕ) #
# − # ≤ C ϕ ∞ . (48)
# μ R (1) μθ Rn,θ, (1) #
θ n,θ

Moving onto the second bracket on the R.H.S. of Eq. 45, this is equal to
& n,θ (1) ' (n) n,θ (1) (n)
μθ Rn,θ, (1) μθ R μθ R
− Fθ, (μθ )(ϕ) + [F (μθ )(ϕ) − Fθ(n) (μθ )(ϕ)]
μθ Rn,θ, (1) μθ Rn,θ (1) μθ Rn,θ (1) θ,
By using the inequality Eq. 48, we have
& n,θ (1) ' (n)
μθ Rn,θ, (1) μθ R (n)
− F (μθ )(ϕ) ≤ C|Fθ, (μθ )(ϕ)| ≤ C ϕ ∞ .
μθ Rn,θ, (1) μθ Rn,θ (1) θ,
Using Lemma 5.2 and in addition using Eq. 43 in the denominator and Eq. 47 in the
numerator we have
n,θ (1) (n)
μθ R
[F (μθ )(ϕ) − Fθ(n) (μθ )(ϕ)] ≤ C ϕ ∞ 
μθ Rn,θ (1) θ,
where C does not depend upon μθ and . Thus we have established that
n,θ, (1)F (n) (μθ )(ϕ) μθ R
μθ R n,θ (1)F (n) (μθ )(ϕ)
θ, θ
− ≤ C ϕ ∞ . (49)
μθ Rn,θ, (1) μθ Rn,θ (1)
One can put together the results of Eqs. 48 and 49 and establish that
|H (n) (μθ )(ϕ)| ≤ C ϕ ∞ .
(n) (μθ )(ϕ) − H (50)
θ θ,

On combining the results Eqs. 44 and 50 and noting Eq. 45 we conclude the proof.



Lemma 5.2 Assume (A1–A3). Then there exist a C < +∞ such that for any n ≥ 1,
μθ ∈ P (X),  > 0, θ ∈ :
Fθ(n) (μθ ) − Fθ,
(n)
(μθ ) ≤ C.

Proof For ϕ ∈ Bb (ϕ)


μθ Rn,θ (ϕ) − μθ Rn,θ, (ϕ)
Fθ(n) (μθ )(ϕ) − Fθ,
(n)
(μθ )(ϕ) =
μθ Rn,θ (1)
& '
μθ Rn,θ, (1) − μθ Rn,θ (1)
+ μθ Rn,θ, (ϕ) .
μθ Rn,θ, (1)μθ Rn,θ (1)
Then by applying Eq. 42 on both terms on the R.H.S. we have the upper-bound
C ϕ ∞ 
.
μθ Rn,θ (1)
One can conclude by using the inequality Eq. 43 for Rn,θ (1)(·). 


Lemma 5.3 Assume (A1–A5). Then there exist a C < +∞ such that for any n ≥ 1,
μθ ∈ P (X), μ
θ ∈ M(X),  > 0, θ ∈ :
θn (μθ , μ
F θ,
θ ) ∨ F n
(μθ , μ
θ ) ≤ C(1 +
μθ ).
Methodol Comput Appl Probab

Proof We will consider only Fθn (μθ , μ θ ) as the ABC filter derivative will follow
similar calculations, for any  > 0 (with upper-bounds that are independent of ).
By Tadic and Doucet (2005, Lemma 6.4) we have for ϕ ∈ Bb (X)

n
θn (μθ , μ
F n (μθ , μ
θ )(ϕ) = G θ )(ϕ) + n− p (F p (μθ ), H
G p (μθ ))(ϕ).
θ θ θ θ
p=1

By Tadic and Doucet (2005, Lemma 6.6) we have the upper-bound


 
n 
θn (μθ , μ
F θ ) ≤ C ρ n
μθ + p (μθ )
ρ n− p Hθ
p=1

with ρ ∈ (0, 1). Then by Tadic and Doucet (2005, Lemma 6.8), it follows that
 
n 
θn (μθ , μ
F θ ) ≤ C ρ n
μθ + ρ n− p
p=1

from which one concludes. 




Remark 5.1 Using the proof above, one can also show that there exist a C < +∞
such that for any n ≥ 1, μθ ∈ P (X), μ
θ ∈ M(X),  > 0, θ ∈ 
(n) (μθ , μ
F (n) (μθ , μ
θ ) ∨ F θ ) ≤ C(1 +
μθ ).
θ θ,

References

Andrieu C, Doucet A, Tadic VB (2005) On-line simulation-based algorithms for parameter es-
timation in general state-space models. In: Proc. of the 44th IEEE Conference on De-
cision and Control and European Control Conference (CDC-ECC ’05), pp 332–337. Ex-
panded Technical Report, available at URL http://www.maths.bris.ac.uk/∼ maxca/preprints/
andrieu_doucet_tadic_2007.pdf
Arapostathis A, Marcus SI (1990) Analysis of an identification algorithm arising in the adaptive
estimation of Markov chains. Math Control Signals Syst 3:1–29
Barthelmé S, Chopin N (2011) Expectation–Propagation for summary-less, likelihood-free inference.
arXiv:1107.5959 [stat.CO]
Benveniste A, Métivier M, Priouret P (1990) Adaptive algorithms and stochastic approximation.
Springer-Verlag, New York
Beskos A, Crisan D, Jasra A, Whiteley N (2011) Error bounds and normalizing constants for
sequential Monte carlo in high-dimensions. arXiv:1112.1544 [stat.CO]
Bickel P, Li B, Bengtsson T (2008) Sharp failure rates for the bootstrap particle filter in high
dimensions. In: Clarke B, Ghosal S (eds) Pushing the limits of contemporary statistics. IMS,
pp 318–329
Cappé O, Ryden T, Moulines Ï (2005) Inference in hidden Markov models. Springer, New York
Cappé O (2009) Online sequential Monte Carlo EM algorithm. In: Proc. of IEEE workshop Statist.
Signal Process. (SSP). Cardiff, Wales, UK
Calvet C, Czellar V (2012) Accurate methods for approximate Bayesian computation filtering.
Technical Report, HEC Paris
Cérou F, Del Moral P, Guyader A (2011) A non-asymptotic variance theorem for un-normalized
Feynman–Kac particle models. Ann Inst Henri Poincare 47:629–649
Dean TA, Singh SS, Jasra A, Peters GW (2010) Parameter estimation for Hidden Markov models
with intractable likelihoods. arXiv:1103.5399 [math.ST]
Dean TA, Singh SS (2011) Asymptotic behavior of approximate Bayesian estimators.
arXiv:1105.3655 [math.ST]
Methodol Comput Appl Probab

Del Moral P (2004) Feynman–Kac formulae: genealogical and interacting particle systems with
applications. Springer, New York
Del Moral P, Doucet A, Jasra A (2006) Sequential Monte Carlo samplers. J R Stat Soc B 68:411–436
Del Moral P, Doucet A, Jasra A (2012) An adaptive sequential Monte Carlo method for approximate
Bayesian computation. Stat Comput 22:1009–1020
Del Moral P, Doucet A, Singh SS (2009) Forward only smoothing using sequential Monte Carlo.
arXiv:1012.5390 [stat.ME]
Del Moral P, Doucet A, Singh SS (2011) Uniform stability of a particle approximation of the optimal
filter derivative. arXiv:1106.2525 [math.ST]
Doucet A, Godsill S, Andrieu C (2000) On sequential Monte Carlo sampling methods for Bayesian
filtering. Stat Comput 10:197–208
Gauchi JP, Vila JP (2013) Nonparametric filtering approaches for identification and inference in
nonlinear dynamic systems. Stat Comput 23:523–533
Jasra A, Singh SS, Martin JS, McCoy E (2012) Filtering via approximate Bayesian computation. Stat
Comput 22:1223–1237
Kantas N, Doucet A, Singh SS, Maciejowski JM, Chopin N (2011) On particle methods for parameter
estimation in general state-space models. (submitted)
Le Gland F, Mevel M (2000) Exponential forgetting and geometric ergodicity in hidden Markov
models. Math Control Signals Syst 13:63–93
Le Gland F, Mevel M (1997) Recursive identification in hidden Markov models. In: Proc. 36th IEEE
conf. decision and control, pp 3468–3473
Le Gland F, Mevel M (1995) Recursive identification of HMM’s with observations in a finite set. In:
Proc. of the 34th conference on decision and control, pp 216–221
Lorenz EN (1963) Deterministic nonperiodic flow. J Atmos Sci 20:130–141
Marin J-M, Pudlo P, Robert CP, Ryder R (2012) Approximate Bayesian computational methods.
Stat Comput 22:1167–1197
Martin JS, Jasra A, Singh SS, Whiteley N, McCoy E (2012) Approximate Bayesian computation for
smoothing. arXiv:1206.5208 [stat.CO]
McKinley J, Cook A, Deardon R (2009) Inference for epidemic models without likelihooods. Int J
Biostat 5:a24
Murray LM, Jones E, Parslow J (2011) On collapsed state-space models and the particle marginal
Metropolis–Hastings sampler. arXiv:1202.6159 [stat.CO]
Pitt MK (2002) Smooth particle filters for likelihood evaluation and maximization. Technical Report,
University of Warwick
Poyiadjis G, Doucet A, Singh SS (2011) Particle approximations of the score and observed informa-
tion matrix in state space models with application to parameter estimation. Biometrika 98:65–80
Poyiadjis G, Singh SS, Doucet A (2006) Gradient-free maximum likelihood parameter estimation
with particle filters. In: Proc Amer. control conf., pp 6–9
Spall JC (1992) Multivariate stochastic approximation using a simultaneous perturbation gradient
approximation. IEEE Trans Autom Control 37(3):332–341
Spall J (2003) Introduction to stochastic search and optimization, 1st edn. Wiley, New York
Tadic VB, Doucet A (2005) Exponential forgetting and geometric ergodicity for optimal filtering in
general state-space models. Stoch Process Appl 115:1408–1436
Tadic VB (2009) Analyticity, convergence and convergence rate of recursive maximum likelihood
estimation in hidden Markov models. arXiv:0904.4264
Whiteley N, Kantas N, Jasra A (2012) Linear variance bounds for particle approximations of time-
homogeneous Feynman–Kac formulae. Stoch Process Appl 122:1840–1865
Yildirim S, Singh SS, Doucet A (2013a) An online expectation–maximisation algorithm for change-
point models. J Comput Graph Stat. doi:0.1080/10618600.2012.674653
Yildirim S, Dean TA, Singh SS, Jasra A (2013b) Approximate Bayesian computation for recursive
maximum likelihood estimation in hidden Markov models. Technical Report, University of
Cambridge

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy