On Particle Methods For Parameter Estimation in State-Space Models
On Particle Methods For Parameter Estimation in State-Space Models
biochemical network models where Xn corresponds early on that this naive approach is problematic [54]
to the population of various biochemical species and due to the parameter space not being explored ad-
Yn are imprecise measurements of the size of a subset equately. This has motivated over the past fifteen
of these species [93], neuroscience models where Xn years the development of many particle methods for
is a state vector determining the neuron’s stimulus– the parameter estimation problem, but numerically
response function and Yn some spike train data [77]. robust methods have only been proposed recently.
However, nonlinear non-Gaussian state-space mod- The main objective of this paper is to provide a
els are also notoriously difficult to fit to data and comprehensive overview of this literature. This pa-
it is only recently, thanks to the advent of powerful per thus differs from recent survey papers on parti-
simulation techniques, that it has been possible to cle methods which all primarily focus on estimating
fully realize their potential. the state sequence X0 : n or discuss a much wider
To illustrate the complexity of inference in state- range of topics, for example, [32, 55, 58, 65]. We
space models, consider first the scenario where the will present the main features of each method and
parameter θ is known. On-line and off-line infer- comment on their pros and cons. No attempt, how-
ence about the state process {Xn } given the ob- ever, is made to discuss the intricacies of the specific
servations {Yn } is only feasible analytically for sim- implementations. For this we refer the reader to the
ple models such as the linear Gaussian state-space original references.
We have chosen to broadly classify the methods as
model. In nonlinear non-Gaussian scenarios, numer-
follows: Bayesian or Maximum Likelihood (ML) and
ous approximation schemes, such as the Extended
whether they are implemented off-line or on-line. In
Kalman filter or the Gaussian sum filter [1], have
the Bayesian approach, the unknown parameter is
been proposed over the past fifty years to solve these
assigned a prior distribution and the posterior den-
so-called optimal filtering and smoothing problems,
sity of this parameter given the observations is to be
but these methods lack rigor and can be unreliable characterized. In the ML approach, the parameter
in practice in terms of accuracy, while determinis- estimate is the maximizing argument of the likeli-
tic integration methods are difficult to implement. hood of θ given the data. Both these inference pro-
Markov chain Monte Carlo (MCMC) methods can cedures can be carried out off-line or on-line. Specifi-
obviously be used, but they are impractical for on- cally, in an off-line framework we infer θ using a fixed
line inference; and even for off-line inference, it can observation record y0 : T . In contrast, on-line meth-
be difficult to build efficient high-dimensional pro- ods update the parameter estimate sequentially as
posal distributions for such algorithms. For nonlin- observations {yn }n≥0 become available.
ear non-Gaussian state-space models particle algo- The rest of the paper is organized as follows. In
rithms have emerged as the most successful. Their Section 2 we present the main computational chal-
widespread popularity is due to the fact that they lenges associated to parameter inference in state-
are easy to implement, suitable for parallel imple- space models. In Section 3 we review particle meth-
mentation [60] and, more importantly, have been ods for filtering when the model does not include
demonstrated in numerous settings to yield more any unknown parameters, whereas Section 4 is ded-
accurate estimates than the standard alternatives, icated to smoothing. These filtering and smoothing
for example, see [11, 23, 30, 67]. techniques are at the core of the off-line and on-line
In most practical situations, the model (1.1)–(1.2) ML parameter procedures described in Section 5.
depends on an unknown parameter vector θ that In Section 6 we discuss particle methods for off-line
needs to be inferred from the data either in an on- and on-line Bayesian parameter inference. The per-
line or off-line manner. In fact inferring the param- formance of some of these algorithms are illustrated
eter θ is often the primary problem of interest; for on simple examples in Section 7. Finally, we sum-
example, for biochemical networks, we are not inter- marize the main advantages and drawbacks of the
ested in the population of the species per se, but we methods presented and discuss some open problems
want to infer some chemical rate constants, which in Section 8.
are parameters of the transition prior fθ (x′ |x). Al-
though it is possible to define an extended state 2. COMPUTATIONAL CHALLENGES
that includes the original state Xn and the param- ASSOCIATED TO PARAMETER INFERENCE
eter θ and then apply standard particle methods to A key ingredient of ML and Bayesian parame-
perform parameter inference, it was recognized very ter inference is the likelihood function pθ (y0 : n ) of
ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 3
3.2 Particle Filtering One recovers the SISR algorithm as a special case
of Algorithm 1 by taking qθ (yn |xn−1 ) = 1 [or, more
3.2.1 Algorithm Particle filtering methods are a
generally, by taking qθ (yn |xn−1 ) = hθ (yn ), some ar-
set of simulation-based techniques which approxi-
bitrary positive function]. Further, one recovers
mate numerically the recursions (3.1) to (3.3). We
the bootstrap filter by taking qθ (xn |yn , xn−1 ) =
focus here on the APF (auxiliary particle filter [78])
fθ (xn |xn−1 ). This is an important special case, as
for two reasons: first, this is a popular approach, in
some complex models are such that one may sample
particular, in the context of parameter estimation
from fθ (xn |xn−1 ), but not compute the correspond-
(see, e.g., Section 6.2.3); second, the APF covers as
ing density; in such a case the bootstrap filter is
special cases a large class of particle algorithms, such
the only implementable algorithm. For models such
as the bootstrap filter [46] and SISR (Sequential Im-
portance Sampling Resampling [31, 69]). that the density fθ (xn |xn−1 ) is tractable, [78] rec-
Let ommend selecting qθ (xn |yn , xn−1 ) = pθ (xn |yn , xn−1 )
and qθ (yn |xn−1 ) = pθ (yn |xn−1 ) when these quanti-
(3.4) qθ (xn , yn |xn−1 ) = qθ (xn |yn , xn−1 )qθ (yn |xn−1 ), ties are tractable, and using approximations of these
quantities in scenarios when they are not. The intu-
where qθ (xn |yn , xn−1 ) is a probability density func-
ition for these recommendations is that this should
tion which is easy to sample from and qθ (yn |xn−1 )
make the weight function (3.6) nearly constant.
is not necessarily required to be a probability den-
The computational complexity of Algorithm 1 is
sity function but just a nonnegative function of
O(N ) per time step; in particular, see, for example,
(xn−1 , yn ) ∈ X × Y one can evaluate. [For n = 0,
[31], page 201, for a O(N ) implementation of the
remove the dependency on xn−1 , i.e., qθ (x0 , y0 ) =
resampling step. At time n, the approximations of
qθ (x0 |y0 )qθ (y0 ).]
pθ (x0 : n |y0 : n ) and pθ (yn |y0 : n−1 ) presented earlier in
The algorithm relies on the following importance
(2.3) and (3.3), respectively, are given by
weights:
N
X
gθ (y0 |x0 )µθ (x0 )
(3.5) w0 (x0 ) = , (3.7) p̂θ (dx0 : n |y0 : n ) = Wni δX i (dx0 : n ),
qθ (x0 |y0 ) 0: n
i=1
gθ (yn |xn )fθ (xn |xn−1 ) N
!
wn (xn−1 : n ) = 1 X i
qθ (xn , yn |xn−1 ) p̂θ (yn |y0 : n−1 ) = wn (Xn−1 : n)
(3.6) N
i=1
for n ≥ 1. (3.8) !
N
X
In order to alleviate the notational burden, we omit · i
Wn−1 i
qθ (yn |Xn−1 ) ,
the dependence of the importance weights on θ; we i=1
will do so in the remainder of the paper when no PN
confusion is possible. The auxiliary particle filter can
i
where Wni ∝ wn (Xn−1 : n ),
i
i=1 Wn = 1 and p̂θ (y0 ) =
1 P N i
be summarized in Algorithm 1 [12, 78]. N i=1 w0 (X0 ). In practice, one uses (3.7) mostly
ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 5
to obtain approximations of posterior moments model, Aθ,n,p typically grows exponentially with n.
N
X This is intuitively not surprising, as the dimension
Wni ϕ(X0i : n ) ≈ E[ϕ(X0 : n )|y0 : n ], of the target density pθ (x0 : n |y0 : n ) is increasing with
i=1 n. Moreover, the successive resampling steps lead to
but expressing particle filtering as a method for a depletion of the particle population; pθ (x0 : m |y0 : n )
approximating distributions (rather than moments) will eventually be approximated by a single unique
turns out to be a more convenient formalization. The particle as n − m increases. This is referred to as
likelihood (3.2) is then estimated through the degeneracy problem in the literature ([11], Fig-
n
Y ure 8.4, page 282). This is a fundamental weakness
(3.9) p̂θ (y0 : n ) = p̂θ (y0 ) p̂θ (yk |y0 : k−1 ). of particle methods: given a fixed number of parti-
k=1 cles N , it is impossible to approximate pθ (x0 : n |y0 : n )
The resampling procedure is introduced to replicate accurately when n is large enough.
particles with high weights and discard particles Fortunately, it is also possible to establish much
with low weights. It serves to focus the computa- more positive results. Many state-space models pos-
tional efforts on the “promising” regions of the state sess the so-called exponential forgetting property
space. We have presented above the simplest resam- ([23], Chapter 4). This property states that for
pling scheme. Lower variance resampling schemes any x0 , x′0 ∈ X and data y0 : n , there exist constants
have been proposed in [53, 69], as well as more ad- Bθ < ∞ and λ ∈ [0, 1) such that
vanced particle algorithms with better overall per-
formance, for example, the Resample–Move algo- kpθ (dxn |y1 : n , x0 ) − pθ (dxn |y1 : n , x′0 )kTV
(3.11)
rithm [44]. For the sake of simplicity, we have also ≤ B θ λn ,
presented a version of the algorithm that operates
resampling at every iteration n. It may be more effi- where k · kTV is the total variation distance, that is,
cient to trigger resampling only when a certain crite- the optimal filter forgets exponentially fast its initial
rion regarding the degeneracy of the weights is met; condition. This property is typically satisfied when
see [31] and [68], pages 35 and 74. the signal process {Xn }n≥0 is a uniformly ergodic
3.2.2 Convergence results Many sharp convergen- Markov chain and the observations {Yn }n≥0 are not
ce results are available for particle methods [23]. A too informative ([23], Chapter 4), or when {Yn }n≥0
selection of these results that gives useful insights on are informative enough that it effectively restricts
the difficulties of estimating static parameters with the hidden state to a bounded region around it [76].
particle methods is presented below. Weaker conditions can be found in [29, 90]. When
Under minor regularity assumptions, one can exponential forgetting holds, it is possible to estab-
show that for any n ≥ 0, N > 1 and any bounded lish much stronger uniform-in-time convergence re-
test function ϕn : X n+1 → [−1, 1], there exist con- sults for functions ϕn that depend only on recent
stants Aθ,n,p < ∞ such that for any p ≥ 1 states. Specifically, for an integer L > 0 and any
Z bounded test function ΨL : X L → [−1, 1], there ex-
E ϕn (x0 : n ) ist constants Cθ,L,p < ∞ such that for any p ≥ 1,
p n ≥ L − 1,
Z p
(3.10) · {p̂θ (dx0 : n |y0 : n ) − pθ (dx0 : n |y0 : n )}
E Ψ(xn−L+1 : n )∆θ,n (dxn−L+1 : n )
XL
Aθ,n,p (3.12)
≤ p/2 , Cθ,L,p
N ≤ ,
where the expectation is with respect to the law N p/2
of the particle filter. In addition, for more general where
classes of functions, we can obtain for any fixed n
a Central Limit Theorem (CLT) as N → +∞ ([17] ∆θ,n (dxn−L+1 : n )
and [23], Proposition 9.4.2). Such results are reassur- Z
ing but weak, as they reveal nothing regarding long- (3.13) = {p̂θ (dx0 : n |y0 : n )
x0 : n−L ∈X n−L+1
time behavior. For instance, without further restric-
tions on the class of functions ϕn and the state-space − pθ (dx0 : n |y0 : n )}.
6 N. KANTAS ET AL.
This result explains why particle filtering is an fixed batch of observations y0 : T . Smoothing for a
effective computational tool in many applications fixed parameter θ is at the core of the two main par-
such as tracking, where one is only interested in ticle ML parameter inference techniques described
pθ (xn−L+1 : n |y0 : n ), as the approximation error is in Section 5, as these procedures require computing
uniformly bounded over time. smoothed additive functionals of the form (3.14).
Similar positive results hold for p̂θ (y0 : n ). This es- Clearly, one could unfold the recursion (3.1) from
timate is unbiased for any N ≥ 1 ([23], Theorem n = 0 to n = T to obtain pθ (x0 : T |y0 : T ). However,
7.4.2, page 239), and, under assumption (3.11), the as pointed out in the previous section, the path
relative variance of the likelihood estimate p̂θ (y0 : n ), space approximation (3.7) suffers from the degener-
that is the variance of the ratio p̂θ (y0 : n )/pθ (y0 : n ), acy problem and yields potentially high variance es-
is bounded above by Dθ n/N [14, 90]. This is a great
timates of (3.14) as (3.15) holds. This has motivated
improvement over the exponential increase with n
the development of alternative particle approaches
that holds for standard importance sampling tech-
to approximate pθ (x0 : T |y0 : T ) and its marginals.
niques; see, for instance, [32]. However, the con-
stants Cθ,L,p and Dθ are typically exponential in nx , 4.1 Fixed-lag Approximation
the dimension of the state vector Xn . We note that
nonstandard particle methods designed to minimize For state-space models with “good” forgetting
the variance of the estimate of pθ (y0 : n ) have recently properties [e.g., (3.11)], we have
been proposed [92].
Finally, we recall the theoretical properties of par- (4.1) pθ (x0 : n |y0 : T ) ≈ pθ (x0 : n |y0 : (n+L)∧T )
ticles estimates of the following so-called smoothed
for L large enough, that is, observations collected
additive functional ([11], Section 8.3 and [74]),
( n ) at times k > n + L do not bring any significant
Z X additional information about X0 : n . In particular,
Snθ = sk (xk−1 : k ) when having to evaluate STθ of the form (3.14)
X n+1 k=1
(3.14) we can approximate the expectation of sn (xn−1 : n )
· pθ (x0 : n |y0 : n ) dx0 : n . w.r.t. pθ (xn−1 : n |y0 : T ) by its expectation w.r.t.
pθ (xn−1 : n |y0 : (n+L)∧T ).
Such quantities are critical when implementing ML
Algorithmically, a particle implementation of (4.1)
parameter estimation procedures; see Section 5. If
we substitute p̂θ (dx0 : n |y0 : n ) to pθ (x0 : n |y0 : n ) dx0 : n means not resampling the components X0i : n of the
particles X0i : k obtained by particle filtering at times
to approximate Snθ , then we obtain an estimate Sbnθ
k > n + L. This was first suggested in [56] and
which can be computed recursively in time; see, for
example, [11], Section 8.3. For the remainder of this used in [11], Section 8.3, and [74]. This algorithm is
paper we will refer to this approximation as the path simple to implement, but the main practical prob-
space approximation. Even when (3.11) holds, there lem is the choice of L. If taken too small, then
exists 0 < Fθ , Gθ < ∞ such that the asymptotic bias pθ (x0 : n |y0 : (n+L)∧T ) ) is a poor approximation of
[23] and variance [81] satisfy pθ (x0 : n |y0 : T ). If taken too large, the degeneracy re-
mains substantial. Moreover, even as N → ∞, this
n n2
(3.15) , V(Sbnθ ) ≥ Gθ
|E(Sbnθ ) − Snθ | ≤ Fθ particle approximation will have a nonvanishing bias
N N since pθ (x0 : n |y0 : T ) 6= pθ (x0 : n |y0 : (n+L)∧T ).
2
for sp : X → [−1, 1] where the variance is w.r.t. the
law of the particle filter. The fact that the variance 4.2 Forward–Backward Smoothing
grows at least quadratically in time follows from the 4.2.1 Principle The joint smoothing density
degeneracy problem and makes Sbnθ unsuitable for pθ (x0 : T |y0 : T ) can be expressed as a function of the
some on-line likelihood based parameter estimation filtering densities {pθ (xn |y0 : n )}Tn=0 using the follow-
schemes discussed in Section 5.
ing key decomposition:
4. SMOOTHING pθ (x0 : T |y0 : T )
In this section the parameter θ is still assumed (4.2)
TY
−1
known and we focus on smoothing, that is, the prob- = pθ (xT |y0 : T ) pθ (xn |y0 : n , xn+1 ),
lem of estimating the latent variables X0 : T given a n=0
ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 7
where pθ (xn |y0 : n , xn+1 ) is a backward (in time) averages that approximate smoothing expectations
Markov transition density given by E[ϕ(X0 : T )|y0 : T ]. In that scenario, the first approach
costs O(N 2 (T + 1)), while the second approach costs
fθ (xn+1 |xn )pθ (xn |y0 : n )
(4.3) pθ (xn |y0 : n , xn+1 ) = . O(N (T + 1)) on average. In some applications, the
pθ (xn+1 |y0 : n ) rejection sampling procedure can be computation-
A backward in time recursion for {pθ (xn |y0 : T )}Tn=0 ally costly as the acceptance probability can be very
follows by integrating out x0 : n−1 and xn+1 : T in small for some particles; see, for example, Section 4.3
(4.2) while applying (4.3), in [75] for empirical results. This has motivated the
development of hybrid procedures combining FF-
pθ (xn |y0 : T ) BSa and rejection sampling [85].
We can also directly approximate the marginals
(4.4) = pθ (xn |y0 : n )
Z {pθ (xn |y0 : T )}Tn=0 . Assuming we have an approxima-
fθ (xn+1 |xn )pθ (xn+1 |y0 : T ) PN i
· dxn+1 . tion p̄θ (dxn+1 |y0 : T ) = i=1 Wn+1|T δX i (dxn+1 )
n+1
pθ (xn+1 |y0 : n ) where WTi |T = WTi , then by using (4.4) and (4.5), we
This is referred to as forward–backward smooth- obtain the approximation p̄θ (dxn |y0 : T ) =
PN
ing, as a forward pass yields {pθ (xn |y0 : n )}Tn=0 which i
i=1 Wn|T δXni (dxn ) with
can be used in a backward pass to obtain {pθ (xn |
N Wj j i
y0 : T )}Tn=0 . Combined to {pθ (xn |y0 : n , xn+1 )}Tn=0
−1
, i
X n+1|T fθ (Xn+1 |Xn )
θ
this allows us to obtain ST . An alternative to (4.6) Wn|T = Wni × PN j
.
l l
j=1 l=1 Wn fθ (Xn+1 |Xn )
these forward–backward procedures is the general-
ized two-filter formula [6]. This Forward Filtering Backward Smoothing
(FFBSm, where “m” stands for “marginal”) pro-
4.2.2 Particle implementation The decomposi- cedure requires O(N 2 (T + 1)) operations to approx-
tion (4.2) suggests that it is possible to sample ap- imate {pθ (xn |y0 : T )}Tn=0 instead of O(N (T + 1)) for
proximately from pθ (x0 : T |y0 : T ) by running a par- the path space and fixed-lag methods. However,
ticle filter from time n = 0 to T , storing the ap- this high computational complexity of forward–
proximate filtering distributions {p̂θ (dxn |y0 : n )}Tn=0 , backward estimates can be reduced using fast com-
that is, the marginals of (3.7), then sampling XT ∼ putational methods [57]. Particle approximations
p̂θ (dxT |y0 : T ) and for n = T − 1, T − 2, . . . , 0 sam- of generalized two-filter smoothing procedures have
pling Xn ∼ p̂θ (dxn |y0 : n , Xn+1 ) where this distribu- also been proposed in [6, 38].
tion is obtained by substituting p̂θ (dxn |y0 : n ) for 4.3 Forward Smoothing
pθ (dxn |y0 : n ) in (4.3):
4.3.1 Principle Whenever we are interested in
p̂θ (dxn |y0 : n , Xn+1 ) computing the sequence {Snθ }n≥0 recursively in
(4.5) PN time, the forward–backward procedure described
i i
i=1 Wn fθ (Xn+1 |Xn )δXn i (dxn )
above is cumbersome, as it requires performing a
= PN .
i i
i=1 Wn fθ (Xn+1 |Xn )
new backward pass with n + 1 steps at time n. An
important but not well-known result is that it is pos-
This Forward Filtering Backward Sampling (FF- sible to implement exactly the forward–backward
BSa) procedure was proposed in [45]. It requires procedure using only a forward procedure. This re-
O(N (T + 1)) operations to generate a single path sult is at the core of [34], but its exposition relies on
X0 : T , as sampling from (4.5) costs O(N ) oper- tools which are nonstandard for statisticians. We fol-
ations. However, as noted in [28], it is possible low here the simpler derivation proposed in [24, 25]
to sample using rejection from an alternative ap- which simply consists of rewriting (3.14) as
proximation of pθ (xn |y0 : n , Xn+1 ) in O(1) opera- Z
tions if we use an unweighted particle approxima- (4.7) Snθ = Vnθ (xn )pθ (xn |y0 : n ) dxn ,
tion of pθ (xn |y0 : n ) in (4.3) and if the transition
prior satisfies fθ (x′ |x) ≤ C < ∞. Hence, with this where
Z (X
n
)
approach, sampling a path X0 : T costs, on average,
Vnθ (xn ) := sk (xk−1 : k )
only O(T + 1) operations. A related rejection tech-
k=1
nique was proposed in [48]. In practice, one may gen- (4.8)
erate N such trajectories to compute Monte Carlo · pθ (x0 : n−1 |y0 : n−1 , xn ) dx0 : n−1 .
8 N. KANTAS ET AL.
It can be easily checked using (4.2) that Vnθ (xn ) sat- 4.4 Convergence Results for Particle Smoothing
isfies the following forward recursion for n ≥ 0:
Empirically, for a fixed number of particles, these
Z
smoothing procedures perform significantly much
Vn+1 (xn+1 ) = {Vnθ (xn ) + sn+1 (xn : n+1 )}
θ
better than the naive path space approach to
(4.9) smoothing (i.e., simply propagating forward the
· pθ (xn |y0 : n , xn+1 ) dxn , complete state trajectory within a particle filter-
with V0θ (x0 ) = 0 and where pθ (xn |y0 : n , xn+1 ) is ing algorithm). Many theoretical results validating
given by (4.3). In practice, we shall approximate these empirical findings have been established un-
der assumption (3.11) and additional regularity as-
the function Vnθ on a certain grid of values xn , as
sumptions. The particle estimate of Snθ based on
explained in the next section.
the fixed-lag approximation (4.1) has an asymptotic
4.3.2 Particle implementation We can easily pro- variance in n/N with a nonvanishing (as N → ∞)
vide a particle approximation of the forward smooth- bias proportional to n and a constant decreasing
ing recursion. Assume you have access to approxi- exponentially fast with L [74]. In [24, 25, 28], it is
mations {Vbnθ (Xni )} of {Vnθ (Xni )} at time n, where shown that when (3.11) holds, there exists 0 < Fθ ,
P Hθ < ∞ such that the asymptotic bias and variance
p̂θ (dxn |y0 : n ) = N i
i=1 Wn δXn i (dxn ). Then when up-
θ
dating our particle filter to obtain p̂θ (dxn+1 |y0 : n+1 ) = of the particle estimate of Sn computed using the
PN i
i=1 Wn+1 δXn+1 i (dxn+1 ), we can directly compute forward–backward procedures satisfy
n n
the particle approximations {Vbn+1 θ (X i
n+1 )} by plug- (4.13) |E(Sbnθ ) − Snθ | ≤ Fθ , V(Sbnθ ) ≤ Hθ .
N N
ging (4.5) and p̂θ (dxn |y0 : n ) in (4.7)–(4.9) to obtain
The bias for the path space and forward–backward
X N estimators of Snθ are actually equal [24]. Recently,
b θ i
Vn+1 (Xn+1 ) = j i
Wn fθ (Xn+1 |Xn ) j
it has also been established in [75] that, under sim-
j=1 ilar regularity assumptions, the estimate obtained
! through (4.12) also admits an asymptotic variance
(4.10) · {Vbnθ (Xnj ) + sn+1 (Xnj , Xn+1i
)} in n/N whenever K ≥ 2.
Carlo error, a popular strategy is to make the evalu- 5.1.2 Gradient ascent The log-likelihood ℓT (θ)
ated function continuous by using common random may be maximized with the following steepest as-
numbers over different evaluations to ease the opti- cent algorithm: at iteration k + 1
mization. Unfortunately, this strategy is not helpful (5.1) θk+1 = θk + γk+1 ∇θ ℓT (θ)|θ=θk ,
in the particle context. Indeed, in the resampling
i
stage, particles {X n }N are resampled according to where ∇θ ℓT (θ)|θ=θk is the gradient of ℓT (θ) w.r.t. θ
PN i=1 i evaluated at θ = θk and {γk } is a sequence of positive
the distribution i=1 W n+1 δXni (dxn ) which admits real numbers, called the step-size sequence. Typi-
a piecewise constant and hence discontinuous cumu- cally, γk is determined adaptively at iteration k us-
lative distribution function (c.d.f.). A small change ing a line search or the popular Barzilai–Borwein al-
in θ will cause a small change in the importance ternative. Both schemes guarantee convergence to a
i
weights {W n+1 }N i=1 and this will potentially gener- local maximum under weak regularity assumptions;
ate a different set of resampled particles. As a result, see [95] for a survey.
the log-likelihood function estimate will not be con- The score vector ∇θ ℓT (θ) can be computed by us-
tinuous in θ even if ℓT (θ) is continuous. ing Fisher’s identity given in (2.4). Given (2.2), it is
To bypass this problem, an importance sampling easy to check that the score is of the form (3.14). An
method was introduced in [49], but it has compu- alternative to Fisher’s identity to compute the score
tational complexity O(N 2 (T + 1)) and only pro- is presented in [20], but this also requires computing
vides low variance estimates in the neighborhood an expectation of the form (3.14).
of a suitably preselected parameter value. In the These score estimation methods are not appli-
restricted scenario where X ⊆ R, an elegant solu- cable in complex scenarios where it is possible to
tion to the discontinuity problem was proposed in sample from fθ (x′ |x), but the analytical expres-
[72]. The method uses common random numbers sion of this transition kernel is unavailable [51]. For
and introduces a “continuous” version of the re- those models, a naive approach is to use a finite
sampling step by finding a permutation σ such that difference estimate of the gradient; however, this
σ(1) σ(2) σ(N ) might generate too high a variance estimate. An
Xn ≤ Xn ≤ · · · ≤ Xn and defining a piece-
wise linear approximation of the resulting c.d.f. from interesting alternative presented in [50], under the
which particles are resampled, that is, name of iterated filtering, consists of deriving an ap-
! proximation of ∇θ ℓT (θ)|θ=θk based on the posterior
k−1
X σ(k−1)
σ(i) σ(k) x − Xn moments {E(ϑn |y0 : n ), V(ϑn |y0 : n )}Tn=0 of an artifi-
Fn (x) = W n+1 + W n+1 σ(k) σ(k−1)
, cial state-space model with latent Markov process
i=1 Xn − Xn
{Zn = (Xn , ϑn )}Tn=0 ,
σ(k−1) σ(k)
Xn ≤ x ≤ Xn . (5.2) ϑn+1 = ϑn + εn+1 , Xn+1 ∼ fϑn+1 (·|xn ),
This method requires O(N (T + 1) log N ) operations and observed process Yn+1 ∼ gϑn+1 (·|xn+1 ). Here
due to the sorting of the particles, but the result- {εn }n≥1 is a zero-mean white noise sequence with
ing continuous estimate of ℓT (θ) can be maximized variance σ 2 Σ, E(ϑn+1 |ϑn ) = ϑn , E(ϑ0 ) = θk , V(ϑ0 ) =
using standard optimization techniques. Extensions τ 2 Σ. It is shown in [50] that this approximation im-
to the multivariate case where X ⊆ Rnx (with nx > proves as σ 2 , τ 2 → 0 and σ 2 /τ 2 → 0. Clearly, as the
1) have been proposed in [59] and [22]. However, variance σ 2 of the artificial dynamic noise {εn } on
the scheme [59] does not guarantee continuity of the θ-component decreases, it will be necessary to
the likelihood function estimate and only provides use more particles to approximate ∇θ ℓT (θ)|θ=θk as
log-likelihood estimates which are positively corre- the mixing properties of the artificial dynamic model
lated for neighboring values in the parameter space, deteriorates.
whereas the scheme in [22] has O(N 2 ) computa- 5.1.3 Expectation–Maximization Gradient ascent
tional complexity and relies on a nonstandard par- algorithms can be numerically unstable as they re-
ticle filtering scheme. quire to scale carefully the components of the score
When θ is high dimensional, the optimization over vector. The Expectation Maximization (EM) algo-
the parameter space may be made more efficient if rithm is a very popular alternative procedure for
provided with estimates of the gradient. This is ex- maximizing ℓT (θ) [27]. At iteration k + 1, we set
ploited by the algorithms described in the forthcom-
ing sections. (5.3) θk+1 = arg max Q(θk , θ),
θ
10 N. KANTAS ET AL.
where O(N −2 (T + 1)), but the MSE of the path space es-
Z timate is variance dominated, whereas the forward–
Q(θk , θ) = log pθ (x0 : T , y0 : T ) backward estimates are bias dominated. This can be
(5.4) understood by decomposing the MSE as the sum of
· pθk (x0 : T |y0 : T ) dx0 : T . the squared bias and the variance and then substi-
tuting appropriately for N 2 particles in (3.15) for
The sequence {ℓT (θk )}k≥0 generated by this algo-
the path space method and for N particles in (4.13)
rithm is nondecreasing. The EM is usually favored
for the forward–backward estimates. We confirm ex-
by practitioners whenever it is applicable, as it is
perimentally this fact in Section 7.1.
numerically more stable than gradient techniques.
These experimental results suggest that these par-
In terms of implementation, the EM consists of
ticle smoothing estimates might thus be of limited
computing a ns -dimensional summary statistic of
interest compared to the path based estimates for
the form (3.14) when pθ (x0 : T , y0 : T ) belongs to the
ML parameter inference when accounting for com-
exponential family, and the maximizing argument
putational complexity. However, this comparison ig-
of Q(θk , θ) can be characterized explicitly through a
nores that the O(N 2 ) computational complexity
suitable function Λ : Rns → Θ, that is,
of these particle smoothing estimates can be re-
(5.5) θk+1 = Λ(T −1 STθk ). duced to O(N ) by sampling approximately from
pθ (x0 : T |y0 : T ) with the FFBSa procedure in Sec-
5.1.4 Discussion of particle implementations The tion 4.2 or by using fast computational methods [57].
path space approximation (3.7) can be used to ap- Related O(N ) approaches have been developed for
proximate the score (2.4) and the summary statis- generalized two-filter smoothing [7, 38]. When ap-
tics of the EM algorithm at the computational cost plicable, these fast computational methods should
of O(N (T + 1)); see [11], Section 8.3, and [74, 81]. be favored.
Experimentally, the variance of the associated esti-
mates increases typically quadratically with T [81]. 5.2 On-Line Methods
To obtain estimates whose variance increases only For a long observation sequence the computation
typically linearly with T with similar computational of the gradient of ℓT (θ) can be prohibitive, and
cost, one can use the fixed-lag approximation pre- moreover, we might have real-time constraints. An
sented in Section 4.1 or a more recent alternative alternative would be a recursive procedure in which
where the path space method is used, but the addi- the data is run through once sequentially. If θn is
tive functional of interest, which is a sum of terms the estimate of the model parameter after the first
over n = 0, . . . , T , is approximated by a sum of sim- n observations, a recursive method would update
ilar terms which are now exponentially weighted the estimate to θn+1 after receiving the new data
w.r.t. n [73]. These methods introduce a nonvanish- yn . Several on-line variants of the ML procedures
ing asymptotic bias difficult to quantify but appear described earlier are now presented. For these meth-
to perform well in practice. ods to be justified, it is crucial for the observation
To improve over the path space method without process to be ergodic for the limiting averaged like-
introducing any such asymptotic bias, the FFBSm lihood function ℓT (θ)/T to have a well-defined limit
and forward smoothing discussed in Sections 4.2 and ℓ(θ) as T → +∞.
4.3 as well as the generalized two-filter smoother
have been used [6, 24, 25, 81, 82]. Experimen- 5.2.1 On-line gradient ascent An alternative to
tally, the variance of the associated estimates in- gradient ascent is the following parameter update
creases typically linearly with T [81] in agreement scheme at time n ≥ 0:
with the theoretical results in [24, 25, 28]. However,
(5.6) θn+1 = θn + γn+1 ∇ log pθ (yn |y0 : n−1 )|θ=θn ,
the computational complexity of these techniques
is O(N 2 (T + 1)). For a fixed computational com- where the positivePnonincreasing P step-size sequence
plexity of order O(N 2 (T + 1)), an informal com- {γn }n≥1 satisfies n γn = ∞ and n γn2 < ∞ [5, 64],
parison of the performance of the path space esti- for example, γn = n−α for 0.5 < α ≤ 1. Upon receiv-
mate using N 2 particles and the forward–backward ing yn , the parameter estimate is updated in the
estimate using N particles suggest that both esti- direction of ascent of the conditional density of this
mates admit a Mean Square Error (MSE) of order new observation. In other words, one recognizes in
ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 11
(5.6) the update of the gradient ascent algorithm algorithm computed sequentially based on y0 : n−1 .
(5.1), except that the partial (up to time n) like- When yn is received, we compute
lihood is used. The algorithm in the present form Z
is, however, not suitable for on-line implementation, Sθ0 : n = γn+1 sn (xn−1 : n )
because evaluating the gradient of log pθ (yn |y0 : n−1 )
at the current parameter estimate requires comput- · pθ0 : n (xn−1 , xn |y0 : n ) dxn−1 : n
(5.10) !
ing the filter from time 0 to time n using the current Xn Yn
parameter value θn . + (1 − γn+1 ) (1 − γi ) γk+1
An algorithm bypassing this problem has been k=0 i=k+2
proposed in the literature for a finite state-space la- Z
tent process in [64]. It relies on the following update · sk (xk−1 : k )pθ0 : k (xk−1 : k |y0 : k ) dxk−1 : k ,
scheme: P
where {γn }n≥1 needs to satisfy
P n γn = ∞ and
(5.7) θn+1 = θn + γn+1 ∇ log pθ0 : n (yn |y0 : n−1 ), γ 2 < ∞. Then the standard maximization step
n n
where ∇ log pθ0 : n (yn |y0 : n−1 ) is defined as (5.5) is used as in the batch version
(5.11) θn+1 = Λ(Sθ0 : n ).
∇ log pθ0 : n (yn |y0 : n−1 )
(5.8) The recursive calculation of Sθ0 : n is achieved by set-
= ∇ log pθ0 : n (y0 : n ) − ∇ log pθ0 : n−1 (y0 : n−1 ), ting Vθ0 = 0, then computing
Z
with the notation ∇ log pθ0 : n (y0 : n ) corresponding to
a “time-varying” score which is computed with a Vθ0 : n (xn ) = {γn+1 sn (xn−1 , xn )
filter using the parameter θp at time p. The update
(5.12) + (1 − γn+1 )Vθ0 : n−1 (xn−1 )}
rule (5.7) can be thought of as an approximation to
the update rule (5.6). If we use Fisher’s identity to · pθ0 : n (xn−1 |y0 : n−1 , xn ) dxn−1
compute this “time-varying” score, then we have for
and, finally,
1 ≤ p ≤ n, Z
sp (xp−1 : p ) = ∇ log fθ (xp |xp−1 )|θ=θp (5.13) Sθ0 : n = Vθ0 : n (xn )pθ0 : n (xn |y0 : n ) dxn .
(5.9)
+ ∇ log gθ (yp |xp )|θ=θp . Again, the subscript θ0 : n on pθ0 : n (x0 : n |y0 : n ) indi-
cates that the posterior density is being computed
The asymptotic properties of the recursion (5.7) sequentially using the parameter θp at time p ≤ n.
(i.e., the behavior of θn in the limit as n goes to infin- The filtering density then is advanced from time
ity) has been studied in [64] for a finite state-space n − 1 to time n by using fθn (xn |xn−1 ), gθn (yn |xn )
HMM. It is shown that under regularity conditions and pθn (yn |y0 : n ) in the fraction of the r.h.s. of (3.1).
this algorithm converges toward a local maximum Whereas the convergence of the EM algorithm to-
of the average log-likelihood ℓ(θ), ℓ(θ) being max- ward a local maximum of the average log-likelihood
imized at the “true” parameter value under iden- ℓ(θ) has been established for i.i.d. data [10], its con-
tifiability assumptions. Similar results hold for the vergence for state-space models remains an open
recursion (5.6). problem despite empirical evidence it does [8, 9, 24].
5.2.2 On-line Expectation–Maximization It is also This has motivated the development of modified ver-
possible to propose an on-line version of the EM sions of the on-line EM algorithm for which conver-
algorithm. This was originally proposed for finite gence results are easier to establish [4, 62]. However,
state-space and linear Gaussian models in [35, 42]; the on-line EM presented here usually performs em-
pirically better [63].
see [9] for a detailed presentation in the finite state-
space case. Assume that pθ (x0 : n , y0 : n ) is in the 5.2.3 Discussion of particle implementations Both
exponential family. In the on-line implementation the on-line gradient and EM procedures require
of EM, running averages of the sufficient statistics approximating terms (5.8) and (5.10) of the form
n−1 Snθ are computed [8, 35]. Let {θp }0≤p≤n be the (3.14), except that the expectation is now w.r.t. the
sequence of parameter estimates of the on-line EM posterior density pθ0 : n (x0 : n |y0 : n ) which is updated
12 N. KANTAS ET AL.
using the parameter θp at time p ≤ n. In this on-line presentation of the Particle Marginal Metropolis–
framework, only the path space, fixed-lag smoothing Hastings (PMMH) sampler, which is an approxima-
and forward smoothing estimates are applicable; the tion of an ideal MMH sampler for sampling from
fixed-lag approximation is applicable but introduces p(x0 : T , θ|y0 : T ) which would utilize the following
a nonvanishing bias. For the on-line EM algorithm, proposal density:
similarly to the batch case discussed in Section 5.1.4,
q((x′0 : T , θ ′ )|(x0 : T , θ))
the benefits of using the forward smoothing estimate (6.1)
[24] compared to the path space estimate [8] with = q(θ ′ |θ)pθ′ (x′0 : T |y0 : T ),
N 2 particles are rather limited, as experimentally
demonstrated in Section 7.1. However, for the on- where q(θ ′ |θ) is a proposal density to obtain a can-
line gradient ascent algorithm, the gradient term didate θ ′ when we are at location θ. The acceptance
∇ log pθ0 : n (yn |y0 : n−1 ) in (5.7) is a difference be- probability of this sampler is
tween two score-like vectors (5.8) and the behavior pθ′ (y0 : T )p(θ ′ )q(θ|θ ′ )
of its particle estimates differs significantly from its (6.2) 1∧ .
pθ (y0 : T )p(θ)q(θ ′ |θ)
EM counterpart. Indeed, the variance of the particle
path estimate of ∇ log pθ0 : n (yn |y0 : n−1 ) increases lin- Unfortunately, this ideal algorithm cannot be imple-
early with n, yielding an unreliable gradient ascent mented, as we cannot sample exactly from pθ′ (x0 : T |
procedure, whereas the particle forward smooth- y0 : T ) and we cannot compute the likelihood terms
ing estimate has a variance uniformly bounded in pθ (y0 : T ) and pθ′ (y0 : T ) appearing in the acceptance
time under appropriate regularity assumptions and probability.
yields a stable gradient ascent procedure [26]. Hence, The PMMH sampler is an approximation of this
the use of a procedure of computational complexity ideal MMH sampler which relies on the particle ap-
O(N 2 ) is clearly justified in this context. The very proximations of these unknown terms. Given θ and
recent paper [88] reports that the computationally a particle approximation p̂θ (y0 : T ) of pθ (y0 : T ), we
cheaper estimate (4.12) appears to exhibit similar sample θ ′ ∼ q(θ ′ |θ), then run a particle filter to ob-
properties whenever K ≥ 2 and might prove an at- tain approximations p̂θ′ (dx0 : T |y0 : T ) and p̂θ′ (y0 : T )
tractive alternative. of pθ′ (dx0 : T |y0 : T ) and pθ′ (y0 : T ). We then sam-
ple X0′ : T ∼ p̂θ′ (dx0 : T |y0 : T ), that is, we choose ran-
domly one of N particles generated by the particle
6. BAYESIAN PARAMETER ESTIMATION
filter, with probability WTi for particle i, and accept
In the Bayesian setting, we assign a suitable (θ ′ , X0′ : T ) [and p̂θ′ (y0 : T )] with probability
prior density p(θ) for θ and inference is based on
p̂θ′ (y0 : T )p(θ ′ )q(θ|θ ′ )
the joint posterior density p(x0 : T , θ|y0 : T ) in the (6.3) 1∧ .
off-line case or the sequence of posterior densities p̂θ (y0 : T )p(θ)q(θ ′ |θ)
{p(x0 : n , θ|y0 : n )}n≥0 in the on-line case. The acceptance probability (6.3) is a simple approx-
6.1 Off-Line Methods imation of the “ideal” acceptance probability (6.2).
This algorithm was first proposed as a heuris-
6.1.1 Particle Markov chain Monte Carlo meth- tic to sample from p(θ|y0 : T ) in [39]. Its remark-
ods Using MCMC is a standard approach to ap- able feature established in [3] is that it does ad-
proximate p(x0 : T , θ|y0 : T ). Unfortunately, designing mit p(x0 : T , θ|y0 : T ) as invariant distribution what-
efficient MCMC sampling algorithms for nonlin- ever the number of particles N used in the particle
ear non-Gaussian state-space models is a difficult approximation [3]. However, the choice of N has an
task: one-variable-at-a-time Gibbs sampling typi- impact on the performance of the algorithm. Using
cally mixes very poorly for such models, whereas large values of N usually results in PMMH aver-
blocking strategies that have been proposed in the ages with variances lower than the corresponding av-
literature are typically very model-dependent; see, erages using fewer samples, but the computational
for instance, [52]. cost of constructing p̂θ (y0 : T ) increases with N . A
Particle MCMC are a class of MCMC tech- simplified analysis of this algorithm suggests that N
niques which rely on particle methods to build ef- should be selected such that the standard deviation
ficient high-dimensional proposal distributions in a of the logarithm of the particle likelihood estimate
generic manner [3]. We limit ourselves here to the should be around 0.9 if the ideal MMH sampler was
ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 13
using the perfect proposal q(θ ′ |θ) = p(θ ′ |y0 : n ) [79] each successive resampling step reduces the diversity
and around 1.8 if one uses an isotropic normal ran- of the sample of θ values; after a certain time n, the
dom walk proposal for a target that is a product approximation p̂(dθ|y0 : n ) contains a single unique
of d i.i.d. components with d → ∞ [83]. For gen- value for θ. This is clearly a poor approach. Even
eral proposal and target densities, a recent theoret- in the much simpler case when there is no latent
ical analysis and empirical results suggest that this variable X0 : n , it is shown in [17], Theorem 4, that
standard deviation should be selected around 1.2– the asymptotic variance of the corresponding parti-
1.3 [33]. As the variance of this estimate typically cle estimates diverges at least at a polynomial rate,
increases linearly with T , this means that the com- which grows with the dimension of θ.
putational complexity is of order O(T 2 ) by iteration. A pragmatic approach that has proven useful in
A particle version of the Gibbs sampler is also some applications is to introduce artificial dynamics
available [3] which mimicks the two-component for the parameter θ [54],
Gibbs sampler sampling iteratively from p(θ|
x0 : T , y0 : T ) and pθ (x0 : T |y0 : T ). These algorithms (6.4) θn+1 = θn + εn+1 ,
rely on a nonstandard version of the particle fil- where {εn }n≥0 is an artificial dynamic noise with
ter where N − 1 particles are generated conditional decreasing variance. Standard particle methods
upon a “fixed” particle. Recent improvements over can now be applied to approximate {p(x0 : n , θ0 : n |
this particle Gibbs sampler introduce mechanisms y0 : n )}n≥0 . A related kernel density estimation
to rejuvenate the fixed particle, using forward or method also appeared in [67], which proposes to
backward sampling procedures [66, 89, 91]. These use a kernel density estimate p(θ|y0 : n ) from which
methods perform empirically extremely well, but, one samples from. As before, the static parameter
contrary to the PMMH, it is still unclear how one is transformed to a slowly time-varying one, whose
should scale N with T . dynamics is related to the kernel bandwidth. To
6.2 On-Line Methods mitigate the artificial variance inflation, a shrink-
age correction is introduced. An improved version
In this context, we are interested in approxi- of this method has been recently proposed in [41].
mating on-line the sequence of posterior densities It is difficult to quantify how much bias is intro-
{p(x0 : n , θ|y0 : n )}n≥0 . We emphasize that, contrary duced in the resulting estimates by the introduc-
to the on-line ML parameter estimation procedures, tion of this artificial dynamics. Additionally, these
none of the methods presented in this section by- methods require a significant amount of tuning, for
pass the particle degeneracy problem. This should example, choosing the variance of the artificial dy-
come as no surprise. As discussed in Section 3.2.2, namic noise or the kernel width. However, they can
even for a fixed θ, the particle estimate of pθ (y0 : n ) perform satisfactorily in practice [41, 67].
has a relative variance that increases linearly with n
under favorable mixing assumptions. The methods 6.2.2 Practical filtering The practical filtering ap-
in this section attempt to approximate p(θ|y0 : n ) ∝ proach proposed in [80] relies on the following fixed-
pθ (y0 : n )p(θ). This is a harder problem, as it implic- lag approximation:
itly requires having to approximate pθi (y0 : n ) for all (6.5) p(x0 : n−L , θ|y0 : n−1 ) ≈ p(x0 : n−L , θ|y0 : n )
the particles {θ i } approximating p(θ|y0 : n ).
for L large enough; that is, observations coming af-
6.2.1 Augmenting the state with the parameter ter n − 1 presumably bring little information on
At first sight, it seems that estimating the se- x0 : n−L . To sample approximately from p(θ|y0 : n ),
quence of posterior densities {p(x0 : n , θ|y0 : n )}n≥0 one uses the following iterative process: at time n,
can be easily achieved using standard particle meth- several MCMC chains are run in parallel to sample
ods by merely introducing the extended state Zn = from
(Xn , θn ), with initial density p(θ0 )µθ0 (x0 ) and tran-
sition density fθn (xn |xn−1 )δθn−1 (θn ), that is, θn = p(xn−L+1 : n , θ|y0 : n , X0i : n−L )
θn−1 . However, this extended process Zn clearly i
= p(xn−L+1 : n , θ|yn−L+1 : n , Xn−L ),
does not possess any forgetting property (as dis-
where the Xn−Li have been obtained at the pre-
cussed in Section 3), so the algorithm is bound to
degenerate. Specifically, the parameter space is ex- vious iteration and are such that (approximately)
plored only in the initial step of the algorithm. Then, i
Xn−L ∼ p(xn−L |y0 : n−1 ) ≈ p(xn−L |y0 : n ). Then one
14 N. KANTAS ET AL.
collects the first component Xn−L+1 i of the simu- p̃θ (x0 : n , y0 : n ) is in the exponential family and thus
i
lated sample Xn−L+1 : n , increments the time index can be summarized by a set of fixed-dimensional suf-
and runs several new MCMC chains in parallel to ficient statistics sn (x0 : n , y0 : n ). This type of method
i
sample from p(xn−L+2 : n+1 , θ|yn−L+2 : n+1 , Xn−L+1 ) was first used to perform on-line Bayesian parame-
and so on. The algorithm is started at time L − 1, ter estimation in a context where p̃θ (x0 : n , y0 : n ) is
with MCMC chains that target p(x0 : L−1 |y0 : L−1 ). in the exponential family [36, 44]. Similar strategies
Like all methods based on fixed-lag approximation, were adopted in [2] and [84]. In the particular sce-
the choice of the lag L is difficult and this introduces nario where qθ (xn |yn , xn−1 ) = pθ (xn |yn , xn−1 ) and
a nonvanishing bias which is difficult to quantify. qθ (yn |xn−1 ) = pθ (yn |xn−1 ), this method was men-
However, the method performs well on the exam- tioned in [2, 86] and is discussed at length in [70]
ples presented in [80]. who named it particle learning. Extensions of this
strategy to parameter estimation in conditionally
6.2.3 Using MCMC steps within particle meth- linear Gaussian models, where a part of the state
ods To avoid the introduction of an artificial dy- is integrated out using Kalman techniques [15, 31],
namic model or of a fixed-lag approximation, an is proposed in [13].
approach originally proposed independently in [36] As opposed to the methods relying on kernel or
and [44] consists of adding MCMC steps to re- artificial dynamics, these MCMC-based approaches
introduce “diversity” among the particles. Assum- have the advantage of adding diversity to the par-
ing we use an auxiliary particle filter to approximate ticles approximating p(θ|y0 : n ) without perturbing
{p(x0 : n , θ|y0 : n )}n≥0 , then the particles {X0i : n , θni } the target distribution. Unfortunately, these algo-
obtained after the sampling step at time n are ap- rithms rely implicitly on the particle approximation
proximately distributed according to of the density p(x0 : n |y0 : n ) even if algorithmically
p̃(x0 : n , θ|y0 : n ) it is only necessary to store some fixed-dimensional
sufficient statistics {sn (X0i : n , y0 : n )}. Hence, in this
∝ p(x0 : n−1 , θ|y0 : n−1 )qθ (xn , yn |xn−1 ). respect they suffer from the degeneracy problem.
We have p̃(x0 : n , θ|y0 : n ) = p(x0 : n , θ|y0 : n ) if qθ (xn | This was noticed as early as in [2]; see also the word
yn , xn−1 ) = pθ (xn|yn , xn−1 ) and qθ (yn |xn−1 ) = pθ (yn| of caution in the conclusion of [4, 36] and [18]. The
xn−1 ). To add diversity in this population of parti- practical implications are that one observes empir-
ically that the resulting Monte Carlo estimates can
cles, we introduce an MCMC kernel Kn (d(x′0 : n , θ ′ )|
display quite a lot of variability over multiple runs as
(x0 : n , θ)) with invariant density p̃(x0 : n , θ|y0 : n ) and
demonstrated in Section 7.2. This should not come
replace, at the end of each iteration, the set of resam-
i as a surprise, as the sequence of posterior distribu-
pled particles, (X 0 : n , θ̄ni ) with N “mutated” parti- tions does not have exponential forgetting proper-
cles (X e i , θ̃ni ) simulated from, for i = 1, . . . , N ,
0: n ties, hence, there is an accumulation of Monte Carlo
i
e0i : n , θ̃ni ) ∼ Kn (d(x0 : n , θ)|(X 0 : n , θ̄ni )). errors over time.
(X
6.2.4 The SMC2 algorithm The SMC2 algorithm
If we use the SISR algorithm, then we can alter-
introduced simultaneously in [19] and [43] may
natively use an MCMC step of invariant density
be considered as the particle equivalent of Par-
p(x0 : n , θ|y0 : n ) after the resampling step at time n.
ticle MCMC. It mimics an “ideal” particle algo-
Contrary to standard applications of MCMC, the
rithm proposed in [16] approximating sequentially
kernel does not have to be ergodic. Ensuring ergodic-
{p(θ|y0 : n )}n≥0 where Nθ particles (in the θ-space)
ity would indeed require one to sample an increasing are used to explore these distributions. The Nθ
number of variables as n increases—this algorithm particles at time n are reweighted according to
would have an increasing cost per iteration, which pθ (y0 : n+1 )/pθ (y0 : n ) at time n + 1. As these like-
would prevents its use in on-line scenarios, but it lihood terms are unknown, we substitute to them
can be an interesting alternative to standard MCMC p̂θ (y0 : n+1 )/p̂θ (y0 : n ) where p̂θ (y0 : n ) is a particle ap-
and was suggested in [61]. In practice, one there- proximation of the partial likelihood pθ (y0 : n ), ob-
fore sets X ei i i
0 : n−L = X0 : n−L and only samples θ and tained by a running a particle filter of Nx particles
Xei
n−L+1 : n , where L is a small integer; often L = 0 in the x-dimension, up to time n, for each of the
(only θ is updated). Note that the memory require- Nθ θ-particles. When particle degeneracy (in the θ-
ments for this method do not increase over time if dimension) reaches a certain threshold, θ-particles
ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 15
are refreshed through the succession of a resampling using (4.7)–(4.11); see [24]. Recall that this proce-
step, and an MCMC step, which in these particular dure has a computational cost that is O(N 2 ) per
settings takes the form of a PMCMC update. The time for N particles and provides the same esti-
cost per iteration of this algorithm is not constant mates as the standard forward–backward implemen-
and, additionally, it is advised to increase Nx with n tation of FFBSm. For the sake of brevity, we will
for the relative variance of p̂θ (y0 : n ) not to increase, not consider the remaining smoothing methods of
therefore, it cannot be used in truly on-line scenar- Section 4; for the fixed-lag and the exponentially
ios. Yet there are practical situations where it may weighted approximations we refer the reader to [74],
be useful to approximate jointly all the posteriors respectively, [73] for numerical experiments.
p(θ|y1 : n ), for 1 ≤ n ≤ T , for instance, to assess the We use a simulated data set of size 6 × 104 ob-
∗ ∗
predictive power of the model. tained using θ ∗ = (ρ∗ , τ 2 , σ 2 ) = (0.8, 0.1, 1) and
then generate 300 independent replications of each
method in order to compute the empirical bias and
7. EXPERIMENTAL RESULTS ∗
variance of Sbnθ when θ is fixed to θ ∗ . In order to
We focus on illustrating numerically a few algo- make a comparison that takes into account the com-
rithms and the impact of the degeneracy problem putational cost, we use N 2 particles for the O(N )
on parameter inference. This last point is motivated method and N for the O(N 2 ) one. We look sep-
by the fact that particle degeneracy seems to have arately at the behavior of the bias of Sbnθ and the
√
been overlooked by many practitioners. In this way variance and MSE of the rescaled estimates Sbnθ / n.
numerical results may provide valuable insights. The results are presented in Figure 1 for N = 50,
We will consider the following simple scalar linear 100, 200.
Gaussian state space model: For both methods the bias grows linearly with
time, this growth being higher for the O(N 2 )
(7.1) Xn = ρXn−1 + τ Wn , Yn = Xn + σVn , √
method. For the variance of Sbnθ / n, we observe a
where Vn , Wn are independent zero-mean and unit- linear growth with time for the O(N ) method with
variance Gaussians and ρ ∈ [−1, 1]. The main reason N 2 particles, whereas this variance appears roughly
for choosing this model is that Kalman recursions constant for the O(N 2 ) method. Finally, the MSE
√
can be implemented to provide the exact values of of Sbnθ / n grows for both methods linearly as ex-
the summary statistics Snθ used for ML estimation pected. In this particular scenario, the constants of
through the EM algorithm and to compute the exact proportionality are such that the MSE is lower for
likelihood pθ (y0 : n ). Hence, using a fine discretiza- the O(N ) method than for the O(N 2 ) method. In
tion of the low-dimensional parameter space, we can general, we can expect that the O(N ) method be su-
compute a very good approximation of the true pos- perior in terms of the bias and the O(N 2 ) method
terior density p(θ|y0 : n ). In this model it is straight- superior in terms of the variance. These results are
forward to present numerical evidence of some ef- in agreement with the theoretical results in the lit-
fects of degeneracy for parameter estimation and to erature [24, 25, 28], but additionally show that the
show how it can be overcome by choosing an appro- lower bound on the variance growth of Sbnθ for the
priate particle method. O(N ) method of [81] appears sharp.
We proceed to see how the bias and variance
7.1 Maximum Likelihood Methods of the estimates of Snθ affect the ML estimates,
As ML methods require approximating smoothed when the former are used within both an off-line
additive functionals Snθ of the form (3.14), we be- and an on-line EM algorithm; see Figures 2 and
gin by investigating the empirical bias, variance 3, respectively. For the model in (7.1) the E-step
and MSE of two standard particle estimates of Snθ , corresponds to computing Snθ where sk (xk−1 , xk ) =
((yk − xk )2 , x2k−1 , xk−1 xk , x2k ) and the M-step update
where we set sk (xk−1 , xk ) = xk−1 xk for the model
function is given by
described in (7.1). The first estimate relies on the
path space method with computational cost O(N ) z3 z32
Λ(z1 , z2 , z3 , z4 ) = , z4 − , z1 .
per time, which uses p̂θ (dx0 : n |y0 : n ) in (3.7) to ap- z4 z2
proximate Snθ as Sbnθ ; see [11], Section 8.3, for more ∗
We compare the estimates of θ when the E-step is
details. The second estimate relies on the forward computed using the O(N ) and the O(N 2 ) meth-
implementation of FFBSm presented in Section 4.3 ods described in the previous section with 1502
16 N. KANTAS ET AL.
Fig. 1. Estimating smoothed additive functionals: empirical bias of the estimate of Snθ (top panel), empirical variance (middle
√
panel) and MSE (bottom panel) for the estimate of Snθ / n. Left column: O(N ) method using N 2 = 2500, 10,000, 40,000
2
particles. Right column: O(N ) method using N = 50, 100, 200 particles. In every subplot, the top line corresponds to using
N = 50, the middle for N = 100 and the lower for N = 200.
and 150 particles, respectively. A simulated data results given the low number of particles used. How-
set for θ ∗ = (ρ∗ , τ ∗ , σ ∗ ) = (0.8, 1, 0.2) will be used. ever, we note, as observed previously in the litera-
In every case we will initialize the algorithm using ture, that the on-line EM as well as the on-line gra-
θ0 = (0.1, 0.1, 0.2) and assume σ ∗ is known. In Fig- dient ascent method requires a substantial number
ures 2 and 3 we present the results obtained using of observations, that is, over 10,000, before achiev-
150 independent replications of the algorithm. For ing convergence [8, 9, 24, 81]. For smaller data sets,
the off-line EM, we use 25 iterations for T = 100, these algorithms can also be used by going through
1000, 2500, 5000, 10,000. For the on-line EM, we the data, say, K times. Typically, this method is
use T = 105 with the step size set as γn = n−0.8 and cheaper than iterating (5.1) or (5.4)–(5.5) K times
for the first 50 iterations no M-step update is per- the off-line algorithms and can yield comparable pa-
formed. This “freezing” phase is required to allow rameter estimates [94]. Experimentally, the proper-
for a reasonable estimation of the summary statis- ties of the estimates of Snθ discussed earlier appear
tic; see [8, 9] for more details. Note that in Figure 3 to translate into properties of the resulting parame-
we plot only the results after the algorithm has con- ter estimates: the O(N ) method provides estimates
verged, that is, for n ≥ 5×104 . In each case, both the with less bias but more variance than the O(N 2 )
O(N ) and the O(N 2 ) methods yield fairly accurate method.
ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 17
Fig. 2. Off-line EM: boxplots of θ̂n for various T using 25 iterations of off-line EM and 150 realizations of the algorithms.
Top panels: O(N ) method using N = 1502 particles. Bottom panels: O(N 2 ) with N = 150. The dotted horizontal lines are the
ML estimate for each time T obtained using Kalman filtering on a grid.
For more numerical examples regarding the re- O(N ) on-line EM, to [72] and [59], Chapter 10,
maining methods discussed in Section 5, we re- for smooth likelihood function methods and to [11],
fer the reader to [50, 51] for iterated filtering, to
[24, 25, 81] for comparisons of the O(N ) and O(N 2 ) Chapters 10–11, for a detailed exposition of off-line
methods for EM and gradient ascent, to [8] for the EM methods.
Fig. 3. On-line EM: boxplots of θ̂n for n ≥ 5 × 104 using 150 realizations of the algorithms. We also plot the ML estimate
at time n obtained using Kalman filtering on a grid (black).
18 N. KANTAS ET AL.
7.2 Bayesian Methods between the average estimated variance of the pos-
terior over the true one decreases with time n and
We still consider the model in (7.1), but simplify
it shows that the supports of the approximate pos-
it further by fixing either ρ or τ . This is done in
terior densities provided by this method cover, on
order to keep the computations of the benchmarks
average, only a small portion of the support of the
that use Kalman computations on a grid relatively
true posterior. These experiments confirm that in
inexpensive. For those parameters that are not fixed,
we shall use the following independent priors: a uni- this example the particle method with MCMC steps
form on [−1, 1] for ρ, and inverse gamma for τ 2 , σ 2 fails to adequately explore the space of θ. Although
with the shape and scale parameter pair being (a, b) the box plots provide some false sense of security, the
and (c, d), respectively, with a = b = c = d = 1. In relative and scaled average variance clearly indicate
all the subsequent examples, we will initialize the that any posterior estimates obtained from a single
algorithms by sampling θ from the prior. run of particle method with MCMC steps should be
We proceed to examine the particle algorithms used with caution. Furthermore, in the bottom right
with MCMC moves that we described in Sec- panel of Figure 4 we also investigate experimentally
tion 6.2.3. We focus on an efficient implementation the empirical relative variance of the marginal likeli-
of this idea discussed in [70] which can be put in hood estimates {p̂(y0 : n )}n≥0 . This relative variance
practice for the simple model under consideration. appears to increase quadratically with n for the par-
We investigate the effect of the degeneracy problem ticle method with MCMC moves instead of linearly
in this context. The numerical results obtained in as it does for state-space models with good mixing
this section have been produced in Matlab (code properties. This suggests that one should increase
available from the first author) and double-checked the number of particles quadratically with the time
using the R program available on the personal web index to obtain an estimate of the marginal like-
page of the first author of [70, 71]. lihood whose relative variance remains uniformly
We first focus on the estimate of the poste- bounded with respect to the time index. Although
rior of θ = (τ 2 , σ 2 ) given a long sequence of simu- we attribute this quadratic relative variance growth
lated observations with τ = σ = 1. In this scenario, to the degeneracy problem, the estimate p̂(y0 : n ) is
pθ (x0 : n , y0 : n ) admits the following two-dimensional not the particle approximation of a smoothed addi-
n (x
Pn tive functional, thus there is not yet any theoretical
sufficient P statistics, s 0 : n , y 0 : n ) = ( k=1 (xk −
xk−1 )2 , nk=0 (yk − xk )2 ), and θ can be updated us- convergence result explaining rigorously this phe-
ing Gibbs steps. We use T = 5 × 104 and N = 5000. nomenon.
We ran the algorithm over 100 independent runs One might argue that these particle methods with
over the same data set. We present the results only MCMC moves are meant to be used with larger
for τ 2 and omit the ones for σ 2 , as these were very N and/or shorter data sets T . We shall consider
similar. The top left panel of Figure 4 shows the box this time a slightly different example where τ = 0.1
plots for the estimates of the posterior mean, and is known and we are interested in estimating the
the top right panel shows how the corresponding posterior of θ = (ρ, σ 2 ) given a sequence of obser-
relative variance of the estimator for the posterior vations obtained using ρ = 0.5 and σ = 1. In that
case, n
mean evolves with time. Here the relative variance is Pn the sufficient
Pn−1statistics
2
Pn are s (x0 :2n , y0 : n ) =
defined as the ratio of the empirical variance (over ( k=1 xk−1 xk , k=0 xk−1 , k=0 (yk −xk ) ), and the
different independent runs) of the posterior mean parameters can be rejuvenated through a single
estimates at time n over the true posterior variance Gibbs update. In addition, we let T = 5000 and use
at time n, which in this case is approximated using N = 104 particles. In Figure 5 we display the esti-
a Kalman filter on a fine grid. This quantity exhibits mated marginal posteriors p(ρ|y0 : n ) and p(σ 2 |y0 : n )
a steep increasing trend when n ≥ 15,000 and con- obtained from 50 independent replications of the
firms the aforementioned variability of the estimates particle method. On this simple problem, the es-
of the posterior mean. In the bottom left panel of timated posteriors seem consistently rather inac-
Figure 4 we plot the average (over different runs) curate for ρ, whereas they perform better for σ 2
of the estimators of the variance of p(τ 2 |y0 : n ). This but with some nonnegligible variability over runs,
average variance is also scaled/normalized by the which increases as T increases. Similar observations
actual posterior variance. The latter is again com- have been reported in [18] and remain unexplained:
puted using Kalman filtering on a grid. This ratio for some parameters this methodology appears to
ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 19
Fig. 4. Top left: box plots for estimates of posterior mean of τ 2 at n = 1000, 2000, . . . ,50,000. Top right: relative variance,
that is, empirical variance (over independent runs) for the estimator of the mean of p(τ 2 |y0 : n ) using particle method with
MCMC steps normalized with the true posterior variance computed using Kalman filtering on a grid. Bottom left: average
(over independent runs) of the estimated variance of p(τ 2 |y0 : n ) using particle method with MCMC normalized with the true
posterior variance. Bottom right: relative variance of the {p̂(y0 : n )}n≥0 ; All plots are computed using N = 5000 and over 100
different independent runs.
provide reasonable results despite the degeneracy prove the performance of the particle with MCMC
problem and for others it provides very unreliable moves when N increases for a fixed time horizon T .
results. For a fixed computational complexity, the particle
We investigate further the performance of this Gibbs sampler estimates appear to display less vari-
method in this simple example by considering the ability. For a higher-dimensional parameter θ and/or
same example for T = 1000, but now consider two very vague priors, this comparison would be more fa-
larger numbers of particles, N = 7.5 × 104 and N = vorable to the particle Gibbs sampler as illustrated
6 × 105 , over 50 different runs. Additionally, we com- in [3], pages 336–338.
pare the resulting estimates with estimates provided
by the particle Gibbs sampler of [66] using the same 8. CONCLUSION
computational cost, that is, N = 50 particles with
3000 and 24,000 iterations, respectively. The results Most particle methods proposed originally in the
are displayed in Figures 6 and 7. As expected, we im- literature to perform inference about static param-
20 N. KANTAS ET AL.
Fig. 5. Particle method with MCMC steps, θ = (ρ, σ 2 ); estimated marginal posterior densities for n = 103 , 2 × 103 , . . . , 5 × 103
over 50 runs (red) versus ground truth (blue).
eters in general state-space models were computa- with good statistical properties and a reasonable
tionally inefficient as they suffered from the degen- computational cost have recently appeared in the
eracy problem. Several approaches have been pro- literature.
posed to deal with this problem by either adding an To perform batch ML estimation, the forward
artificial dynamic on the static parameter [40, 54, filter backward sampler/smoother and generalized
67] or introducing a fixed-lag approximation [56, 74, two-filter procedures are recommended whenever
80]. These methods can work very well in practice, the O(N 2 T ) computational complexity per itera-
but it remains unfortunately difficult/impossible to tion of their direct implementations can be low-
quantify the bias introduced in most realistic ap- ered to O(N T ) using, for example, the methods de-
plications. Various asymptotically bias-free methods scribed in [7, 28, 38, 57]. Otherwise, besides a low-
ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 21
Fig. 6. Estimated marginal posterior densities for θ = (ρ, σ 2 ) with T = 103 over 50 runs (black-dashed) versus ground truth
(green). Top: particle method with MCMC steps, N = 7.5 × 104 . Bottom: particle Gibbs with 3000 iterations and N = 50.
Fig. 7. Estimated marginal posterior densities for θ = (ρ, σ 2 ) with T = 103 over 50 runs (black-dashed) versus ground truth
(green). Top: particle method with MCMC steps, N = 6 × 105 . Bottom: particle Gibbs with 24,000 iterations and N = 50.
22 N. KANTAS ET AL.
ering of memory requirements, not much can be research funded in part by the ANR as part of
gained from these techniques compared to simply the “Investissements d’Avenir” program (ANR-11-
using a standard particle filter with N 2 particles. LABEX-0047).
In an on-line ML context, the situation is markedly
different. Whereas for the on-line EM algorithm, the REFERENCES
forward smoothing approach in [24, 81] of complex-
[1] Alspach, D. and Sorenson, H. (1972). Nonlinear
ity O(N 2 ) per time step will be similarly of limited Bayesian estimation using Gaussian sum approx-
interest compared to a standard particle filter us- imations. IEEE Trans. Automat. Control 17 439–
ing N 2 particles; it is crucial to use this approach 448.
when performing on-line gradient ascent as demon- [2] Andrieu, C., De Freitas, J. F. G. and Doucet, A.
strated empirically and established theoretically in (1999). Sequential MCMC for Bayesian model se-
lection. In Proc. IEEE Workshop Higher Order
[26]. In on-line scenarios where one can admit a ran- Statistics 130–134. IEEE, New York.
dom computational complexity at each time step, [3] Andrieu, C., Doucet, A. and Holenstein, R.
the method presented in [75] is an interesting al- (2010). Particle Markov chain Monte Carlo meth-
ternative when it is applicable. Empirically, these ods. J. R. Stat. Soc. Ser. B. Stat. Methodol. 72
on-line ML methods converge rather slowly and will 269–342. MR2758115
[4] Andrieu, C., Doucet, A. and Tadić, V. B. (2005).
be primarily useful for large data sets. On-line parameter estimation in general state-
In a Bayesian framework, batch inference can be space models. In Proc. 44th IEEE Conf. on De-
conducted using particle MCMC methods [3, 66]. cision and Control 332–337. IEEE, New York.
However, these methods are computationally expen- [5] Benveniste, A., Métivier, M. and Priouret, P.
(1990). Adaptive Algorithms and Stochastic Ap-
sive as, for example, an efficient implementation of
proximations. Applications of Mathematics (New
the PMMH has a computational complexity of order York) 22. Springer, Berlin. MR1082341
O(T 2 ) per iteration [33]. On-line Bayesian inference [6] Briers, M., Doucet, A. and Maskell, S. (2010).
remains a challenging open problem as all methods Smoothing algorithms for state-space models.
currently available, including particle methods with Ann. Inst. Statist. Math. 62 61–89. MR2577439
[7] Briers, M., Doucet, A. and Singh, S. S. (2005). Se-
MCMC moves [13, 36, 84], suffer from the degen-
quential auxiliary particle belief propagation. In
eracy problem. These methods should not be ruled Proc. Conf. Fusion. Philadelphia, PA.
out, but should be used cautiously, as they can pro- [8] Cappé, O. (2009). Online sequential Monte Carlo EM
vide unreliable results even in simple scenarios as algorithm. In Proc. 15th IEEE Workshop on Sta-
demonstrated in our experiments. tistical Signal Processing 37–40. IEEE, New York.
[9] Cappé, O. (2011). Online EM algorithm for hidden
Very recent papers in this dynamic research area
Markov models. J. Comput. Graph. Statist. 20
have proposed to combine individual parameter es- 728–749. MR2878999
timation techniques so as to design more efficient [10] Cappé, O. and Moulines, E. (2009). On-line
inference algorithms. For example, [21] suggests to expectation–maximization algorithm for latent
use the score estimation techniques developed for data models. J. R. Stat. Soc. Ser. B. Stat.
Methodol. 71 593–613. MR2749909
ML parameter estimation to design better proposal
[11] Cappé, O., Moulines, E. and Rydén, T. (2005). In-
distributions for the PMMH algorithm, whereas [37] ference in Hidden Markov Models. Springer, New
demonstrates that particle methods with MCMC York. MR2159833
moves might be fruitfully used in batch scenarios [12] Carpenter, J., Clifford, P. and Fearnhead, P.
when plugged into a particle MCMC scheme. (1999). An improved particle filter for non-linear
problems. IEE Proceedings—Radar, Sonar and
Navigation 146 2–7.
ACKNOWLEDGMENTS [13] Carvalho, C. M., Johannes, M. S., Lopes, H. F.
and Polson, N. G. (2010). Particle learning and
N. Kantas supported in part by the Engineer- smoothing. Statist. Sci. 25 88–106. MR2741816
ing and Physical Sciences Research Council (EP- [14] Cérou, F., Del Moral, P. and Guyader, A.
SRC) under Grant EP/J01365X/1 and programme (2011). A nonasymptotic theorem for unnormal-
grant on Control For Energy and Sustainabil- ized Feynman–Kac particle models. Ann. Inst.
ity (EP/G066477/1). S. S. Singh was supported Henri Poincaré, B Probab. Stat. 47 629–649.
MR2841068
by the EPSRC (grant number EP/G037590/1). [15] Chen, R. and Liu, J. S. (2000). Mixture Kalman fil-
A. Doucet’s research funded in part by EPSRC ters. J. R. Stat. Soc. Ser. B. Stat. Methodol. 62
(EP/K000276/1 and EP/K009850/1). N. Chopin’s 493–508. MR1772411
ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 23
[16] Chopin, N. (2002). A sequential particle filter pling approach. Electron. J. Probab. 14 27–49.
method for static models. Biometrika 89 539–551. MR2471658
MR1929161 [30] Doucet, A., De Freitas, J. F. G. and Gor-
[17] Chopin, N. (2004). Central limit theorem for sequen- don, N. J., eds. (2001). Sequential Monte
tial Monte Carlo methods and its application to Carlo Methods in Practice. Springer, New York.
Bayesian inference. Ann. Statist. 32 2385–2411. MR1847783
MR2153989 [31] Doucet, A., Godsill, S. J. and Andrieu, C. (2000).
[18] Chopin, N., Iacobucci, A., Marin, J. M., On sequential Monte Carlo sampling methods for
Mengersen, K., Robert, C. P., Ryder, R. Bayesian filtering. Stat. Comput. 10 197–208.
and Schäufer, C. (2011). On particle learning. [32] Doucet, A. and Johansen, A. M. (2011). A tuto-
In Bayesian Statistics 9 (J. M. Bernardo, rial on particle filtering and smoothing: Fifteen
M. J. Bayarri, J. O. Berger, A. P. Dawid, years later. In The Oxford Handbook of Nonlin-
D., Heckerman A. F. M. Smith and M., West, ear Filtering 656–704. Oxford Univ. Press, Oxford.
eds.) 317–360. Oxford Univ. Press, Oxford. MR2884612
MR3204011 [33] Doucet, A., Pitt, M. K., Deligiannidis, G. and
[19] Chopin, N., Jacob, P. E. and Papaspiliopoulos, O. Kohn, R. (2015). Efficient implementation of
(2013). SMC2 : An efficient algorithm for sequential Markov chain Monte Carlo when using an unbiased
analysis of state space models. J. R. Stat. Soc. Ser. likelihood estimator. Biometrika 102 295–313.
B. Stat. Methodol. 75 397–426. MR3065473 [34] Elliott, R. J., Aggoun, L. and Moore, J. B.
[20] Coquelin, P. A., Deguest, R. and Munos, R. (1995). Hidden Markov Models: Estimation and
(2009). Sensitivity analysis in HMMs with appli- Control. Applications of Mathematics (New York)
cation to likelihood maximization. In Proc. 22th 29. Springer, New York. MR1323178
Conf. NIPS. Vancouver. [35] Elliott, R. J., Ford, J. J. and Moore, J. B. (2000).
[21] Dahlin, J., Lindsten, F. and Schön, T. B. (2015). On-line consistent estimation of hidden Markov
Particle Metropolis–Hastings using gradient and models. Technical report, Dept. Systems Engineer-
Hessian information. Stat. Comput. 25 81–92. ing, Australian National Univ., Canberra.
MR3304908 [36] Fearnhead, P. (2002). Markov chain Monte Carlo,
[22] DeJong, D. N., Liesenfeld, R., Moura, G. V., sufficient statistics, and particle filters. J. Comput.
Richard, J.-F. and Dharmarajan, H. (2013). Graph. Statist. 11 848–862. MR1951601
Efficient likelihood evaluation of state-space [37] Fearnhead, P. and Meligkotsidou, L. (2014). Aug-
representations. Rev. Econ. Stud. 80 538–567. mentation schemes for particle MCMC. Preprint.
MR3054070 Available at arXiv:1408.6980.
[23] Del Moral, P. (2004). Feynman–Kac Formulae: Ge- [38] Fearnhead, P., Wyncoll, D. and Tawn, J. (2010).
nealogical and Interacting Particle Systems with A sequential smoothing algorithm with lin-
Applications. Springer, New York. MR2044973 ear computational cost. Biometrika 97 447–464.
[24] Del Moral, P., Doucet, A. and Singh, S. S. MR2650750
(2009). Forward smoothing using sequential [39] Fernández-Villaverde, J. and Rubio-
Monte Carlo. Technical Report 638, CUED-F- Ramı́rez, J. F. (2007). Estimating macroe-
INFENG, Cambridge Univ. Preprint. Available at conomic models: A likelihood approach. Rev.
arXiv:1012.5390. Econ. Stud. 74 1059–1087. MR2353620
[25] Del Moral, P., Doucet, A. and Singh, S. S. (2010). [40] Flury, T. and Shephard, N. (2009). Learning and
A backward particle interpretation of Feynman– filtering via simulation: Smoothly jittered particle
Kac formulae. ESAIM Math. Model. Numer. Anal. filters. Economics Series Working Papers 469.
44 947–975. MR2731399 [41] Flury, T. and Shephard, N. (2011). Bayesian infer-
[26] Del Moral, P., Doucet, A. and Singh, S. S. (2015). ence based only on simulated likelihood: Particle
Uniform stability of a particle approximation of filter analysis of dynamic economic models. Econo-
the optimal filter derivative. SIAM J. Control Op- metric Theory 27 933–956. MR2843833
tim. 53 1278–1304. MR3348115 [42] Ford, J. J. (1998). Adaptive hidden Markov model es-
[27] Dempster, A. P., Laird, N. M. and Rubin, D. B. timation and applications. Ph.D. thesis, Dept. Sys-
(1977). Maximum likelihood from incomplete data tems Engineering, Australian National Univ., Can-
via the EM algorithm. J. Roy. Statist. Soc. Ser. B berra. Available at http://infoeng.rsise.anu.
39 1–38. MR0501537 edu.au/files/jason_ford_thesis.pdf.
[28] Douc, R., Garivier, A., Moulines, E. and Ols- [43] Fulop, A. and Li, J. (2013). Efficient learning via sim-
son, J. (2011). Sequential Monte Carlo smooth- ulation: A marginalized resample–move approach.
ing for general state space hidden Markov models. J. Econometrics 176 146–161. MR3084050
Ann. Appl. Probab. 21 2109–2145. MR2895411 [44] Gilks, W. R. and Berzuini, C. (2001). Following
[29] Douc, R., Moulines, E. and Ritov, Y. (2009). For- a moving target—Monte Carlo inference for dy-
getting of the initial condition for the filter in namic Bayesian models. J. R. Stat. Soc. Ser. B.
general state-space hidden Markov chain: A cou- Stat. Methodol. 63 127–146. MR1811995
24 N. KANTAS ET AL.
[45] Godsill, S. J., Doucet, A. and West, M. (2004). [62] Le Corff, S. and Fort, G. (2013). Online expecta-
Monte Carlo smoothing for nonlinear times series. tion maximization based algorithms for inference
J. Amer. Statist. Assoc. 99 156–168. MR2054295 in hidden Markov models. Electron. J. Stat. 7 763–
[46] Gordon, N. J., Salmond, D. J. and Smith, A. F. M. 792. MR3040559
(1993). Novel approach to nonlinear/non-Gaussian [63] Le Corff, S. and Fort, G. (2013). Convergence of a
Bayesian state estimation. IEE Proc. F, Comm., particle-based approximation of the block online
Radar, Signal. Proc. 140 107–113. expectation maximization algorithm. ACM Trans.
[47] Higuchi, T. (2001). Self-organizing time series model. Model. Comput. Simul. 23 Art. 2, 22. MR3034212
In Sequential Monte Carlo Methods in Practice. [64] Le Gland, F. and Mevel, M. (1997). Recursive es-
Stat. Eng. Inf. Sci. 429–444. Springer, New York. timation in hidden Markov models. In Proc. 36th
MR1847803 IEEE Conf. Decision and Control 3468–3473. San
[48] Hürzeler, M. and Künsch, H. R. (1998). Monte Diego, CA.
Carlo approximations for general state-space [65] Lin, M., Chen, R. and Liu, J. S. (2013). Lookahead
models. J. Comput. Graph. Statist. 7 175–193. strategies for sequential Monte Carlo. Statist. Sci.
MR1649366 28 69–94. MR3075339
[49] Hürzeler, M. and Künsch, H. R. (2001). Approxi- [66] Lindsten, F., Jordan, M. I. and Schön, T. B.
mating and maximising the likelihood for a gen- (2014). Particle Gibbs with ancestor sampling. J.
eral state-space model. In Sequential Monte Carlo Mach. Learn. Res. 15 2145–2184. MR3231604
Methods in Practice. Stat. Eng. Inf. Sci. 159–175. [67] Liu, J. and West, M. (2001). Combined parameter
Springer, New York. MR1847791 and state estimation in simulation-based filtering.
[50] Ionides, E. L., Bhadra, A., Atchadé, Y. and In Sequential Monte Carlo Methods in Practice.
King, A. (2011). Iterated filtering. Ann. Statist. Springer, New York. MR1847793
39 1776–1802. MR2850220 [68] Liu, J. S. (2001). Monte Carlo Strategies in Scientific
[51] Ionides, E. L., Bretó, C. and King, A. A. (2006). Computing. Springer, New York. MR1842342
[69] Liu, J. S. and Chen, R. (1998). Sequential Monte
Inference for nonlinear dynamical systems. Proc.
Carlo methods for dynamic systems. J. Amer.
Natl. Acad. Sci. USA 103 18438–18443.
Statist. Assoc. 93 1032–1044. MR1649198
[52] Kim, S., Shephard, N. and Chib, S. (1998). Stochas-
[70] Lopes, H. F., Carvalho, C. M., Johannes, M. S.
tic volatility: Likelihood inference and comparison
and Polson, N. G. (2011). Particle learning
with ARCH models. Rev. Econ. Stud. 65 361–393.
for sequential Bayesian computation. In Bayesian
[53] Kitagawa, G. (1996). Monte Carlo filter and smoother
Statistics 9 (J. M. Bernardo, M. J. Bayarri,
for non-Gaussian nonlinear state space models. J.
J. O. Berger, A. P. Dawid, D., Heckerman
Comput. Graph. Statist. 5 1–25. MR1380850
A. F. M. Smith and M., West, eds.). Oxford
[54] Kitagawa, G. (1998). A self-organizing state-space
Univ. Press, Oxford. MR3204011
model. J. Amer. Statist. Assoc. 93 1203–1215.
[71] Lopes, H. F. and Tsay, R. S. (2011). Particle filters
[55] Kitagawa, G. (2014). Computational aspects of se-
and Bayesian inference in financial econometrics.
quential Monte Carlo filter and smoother. Ann. J. Forecast. 30 168–209. MR2758809
Inst. Statist. Math. 66 443–471. MR3211870 [72] Malik, S. and Pitt, M. K. (2011). Particle filters
[56] Kitagawa, G. and Sato, S. (2001). Monte Carlo for continuous likelihood evaluation and maximi-
smoothing and self-organising state-space model. sation. J. Econometrics 165 190–209. MR2846644
In Sequential Monte Carlo Methods in Practice. [73] Nemeth, C., Fearnhead, P. and Mihaylova, L.
Stat. Eng. Inf. Sci. 177–195. Springer, New York. (2013). Particle approximations of the score and
MR1847792 observed information matrix for parameter esti-
[57] Klaas, M., Briers, M., De Freitas, N., Doucet, A., mation in state space models with linear computa-
Maskell, S. and Lang, D. (2006). Fast particle tional cost. Preprint. Available at arXiv:1306.0735.
smoothing: If I had a million particles. In Proc. [74] Olsson, J., Cappé, O., Douc, R. and Moulines, E.
International Conf. Machine Learning 481–488. (2008). Sequential Monte Carlo smoothing with
Pittsburgh, PA. application to parameter estimation in nonlin-
[58] Künsch, H. R. (2013). Particle filters. Bernoulli 19 ear state space models. Bernoulli 14 155–179.
1391–1403. MR3102556 MR2401658
[59] Lee, A. (2008). Towards smoother multivariate parti- [75] Olsson, J. and Westerborn, J. (2014). Efficient
cle filters. M.Sc. Computer Science, Univ. British particle-based online smoothing in general hidden
Columbia, Vancouver, BC. Markov models: The PaRIS algorithm. Preprint.
[60] Lee, A. and Whiteley, N. (2014). Forest resampling Available at arXiv:1412.7550.
for distributed sequential Monte Carlo. Preprint. [76] Oudjane, N. and Rubenthaler, S. (2005). Stability
Available at arXiv:1406.6010. and uniform particle approximation of nonlinear
[61] Lee, D. S. and Chia, K. K. (2002). A particle algo- filters in case of non ergodic signals. Stoch. Anal.
rithm for sequential Bayesian parameter estima- Appl. 23 421–448. MR2140972
tion and model selection. IEEE Trans. Signal Pro- [77] Paninski, L., Ahmadian, Y., Ferreira, D. G.,
cess. 50 326–336. Koyama, S., Rad, K. R., Vidne, M., Vogel-
ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 25
stein, J. and Wu, W. (2010). A new look at state- models with unknown transition matrix and appli-
space models for neural data. J. Comput. Neu- cations to IEEE 802.11 networks. In Proc. IEEE
rosci. 29 107–126. MR2721336 ICASSP, Vol. IV 13–16. Philadelphia, PA.
[78] Pitt, M. K. and Shephard, N. (1999). Filtering [87] West, M. and Harrison, J. (1997). Bayesian Fore-
via simulation: Auxiliary particle filters. J. Amer. casting and Dynamic Models, 2nd ed. Springer,
Statist. Assoc. 94 590–599. MR1702328 New York. MR1482232
[79] Pitt, M. K., Silva, R. d. S., Giordani, P. and [88] Westerborn, J. and Olsson, J. (2014). Efficient
Kohn, R. (2012). On some properties of Markov
particle-based online smoothing in general hidden
chain Monte Carlo simulation methods based on
Markov models. In Proc. IEEE ICASSP 8003–
the particle filter. J. Econometrics 171 134–151.
MR2991856 8007. Florence.
[80] Polson, N. G., Stroud, J. R. and Müller, P. [89] Whiteley, N. (2010). Discussion of Particle Markov
(2008). Practical filtering with sequential param- chain Monte Carlo methods. J. Royal Stat. Soc.
eter learning. J. R. Stat. Soc. Ser. B. Stat. 72 306–307.
Methodol. 70 413–428. MR2424760 [90] Whiteley, N. (2013). Stability properties of some
[81] Poyiadjis, G., Doucet, A. and Singh, S. S. (2011). particle filters. Ann. Appl. Probab. 23 2500–2537.
Particle approximations of the score and observed MR3127943
information matrix in state space models with ap- [91] Whiteley, N., Andrieu, C. and Doucet, A. (2010).
plication to parameter estimation. Biometrika 98 Efficient Bayesian inference for switching state–
65–80. MR2804210 space models using discrete particle Markov chain
[82] Schön, T. B., Wills, A. and Ninness, B. (2011). Sys- Monte Carlo methods. Preprint. Available at
tem identification of nonlinear state-space models. arXiv:1011.2437.
Automatica J. IFAC 47 39–49. MR2878244 [92] Whiteley, N. and Lee, A. (2014). Twisted particle
[83] Sherlock, C., Thiery, A. H., Roberts, G. O. and
filters. Ann. Statist. 42 115–141. MR3178458
Rosenthal, J. S. (2015). On the efficiency of [93] Wilkinson, D. J. (2012). Stochastic Modelling for Sys-
pseudo-marginal random walk Metropolis algo- tems Biology, 2nd ed. CRC Press, Boca Raton, FL.
rithms. Ann. Statist. 43 238–275. MR3285606
[94] Yildirim, S., Singh, S. S. and Doucet, A. (2013).
[84] Storvik, G. (2002). Particle filters in state space mod-
An online expectation–maximization algorithm for
els with the presence of unknown static parame-
ters. IEEE Trans. Signal Process. 50 281–289. changepoint models. J. Comput. Graph. Statist. 22
[85] Taghavi, E., Lindsten, F., Svensson, L. and 906–926. MR3173749
Schön, T. B. (2013). Adaptive stopping for fast [95] Yuan, Y.-x. (2008). Step-sizes for the gradient method.
particle smoothing. In Proc. IEEE ICASSP 6293– In Third International Congress of Chinese Math-
6297. Vancouver, BC. ematicians. Part 1, 2. AMS/IP Stud. Adv. Math.,
[86] Vercauteren, T., Toledo, A. and Wang, X. (2005). 42, Pt. 1 2 785–796. Amer. Math. Soc., Providence,
Online Bayesian estimation of hidden Markov RI. MR2409671