0% found this document useful (0 votes)

55 views10 pages

A Latent Model To Detect Multiple Clusters of Varying Sizes

This document proposes a latent model and likelihood-based statistical methods to detect multiple temporal clusters of varying sizes within event data. The model mimics typical processes that generate clustered event data. The methodology uses model selection techniques to determine the number of clusters, a Monte Carlo EM algorithm to estimate model parameters and detect clusters, and AIC/BIC criteria to select the optimal number of clusters. Simulation studies show the proposed methodology is more efficient than competing least squares methods at detecting clusters under typical data generating processes. The methodology is illustrated using two real-world public health event data applications.

Uploaded by

francogrex

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views10 pages

A Latent Model To Detect Multiple Clusters of Varying Sizes

Uploaded by

francogrex

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Biometrics 65, 10111020

December 2009
DOI: 10.1111/j.1541-0420.2009.01197.x
A Latent Model to Detect Multiple Clusters of Varying Sizes
Minge Xie,

Qiankun Sun, and Joseph Naus

Department of Statistics, Rutgers, the State University of New Jersey, Piscataway,
New Jersey 08854, U.S.A.

email: mxie@stat.rutgers.edu
Summary. This article develops a latent model and likelihood-based inference to detect temporal clustering of events. The
model mimics typical processes generating the observed data. We apply model selection techniques to determine the number
of clusters, and develop likelihood inference and a Monte Carlo expectationmaximization algorithm to estimate model
parameters, detect clusters, and identify cluster locations. Our method diers from the classical scan statistic in that we can
simultaneously detect multiple clusters of varying sizes. We illustrate the methodology with two real data applications and
evaluate its eciency through simulation studies. For the typical data-generating process, our methodology is more ecient
than a competing procedure that relies on least squares.
Key words: AIC and BIC criteria; Clustering events; EM algorithm; Latent model; Likelihood inference; MCMC algorithm;
Scan statistics; Temporal samples.
1. Introduction
This article develops a general latent modeling framework
and likelihood-based inference tools for detecting clustering
of events within temporal samples. In the past 50 years, re-
searchers have been investigating dierent types of clustering
of events in time and space. Some applications look for an
unusually large number of events in small clusters, or some
patterns suggesting clumping over the entire study period or
area. Other applications are concerned with unusually large
clusters within a small region of time, space, or location in a
sequence. In some cases, focus is on a specic region, for exam-
ple a region with heavy pollution. In other cases, researchers
scan the entire study area and seek to locate regions with
unusually high likelihood of clustering. Practical examples
cover a wide range of elds over various disciplines. In epi-
demiological studies when the etiology of diseases has not
been well established, it is often required to analyze the data
to obtain evidence of temporal clusters (Molinari, Bonaldi,
and Daures, 2001). In surveillance for biological terrorism,
it is essential to provide early warnings of terrorist attacks.
Syndromic surveillance for biological terrorism requires sta-
tistical methods of detecting relatively abrupt increase in
incidence (Wallenstein and Naus, 2004). In environmental
studies, people living near a factory generating pollution, may
have an increased chance of certain diseases, and it is of inter-
est to detect and monitor such clusters (Diggle, Rowlingson,
and Su, 2005). In biological studies of DNA sequencing, the
detection of unusual clusters of specic patterns can be used
to allocate lab resources and help nd biologically important
origins of diseases (Leung et al., 2005).
A traditional statistical method to detect a cluster of events
is via Scan Statistics; see, e.g., Glaz, Naus, and Wallenstein
(2001) and Fu and Lou (2003). The most commonly used
scan statistic is the maximum number of cases in a xed size
moving window that scans through the study area. The test
based on this scan statistic has been shown to be a general-
ized likelihood ratio test for a uniform null against a pulse
alternative. A related scan statistic is the diameter of the
smallest window that contains a xed number of cases. Other
scan statistics and related likelihood-based tests for localized
temporal or spatial clustering are developed, that use a range
of xed window sizes or a range of xed number of cases.
See, Kulldor and Nagarwalla (1995), Naus and Wallenstein
(2004), among others. One can also test for several unusually
large clusters using asymptotic and approximate results; See,
Dembo and Karlin (1992) and Su, Wallenstein, and Bishop
(2001). Gangnon and Clayton (2004) develop a weighted
average likelihood ratio scan statistic and apenalized scan
statistic, which can be viewed as generalized scan statistics
approaches.
Scan statistics procedures have been very successful in de-
tecting a single signicant cluster, and they also have had
some success in detecting multiple clusters of xed sizes. But
there are some technical diculties to detecting multiple clus-
ters of varying sizes. In recent years, there have been several
attempts to overcome this diculty. The best approach so far
is by Molinari et al. (2001), who use a stepwise regression (SR)
model together with model selection procedures to locate and
determine the number of clustering regions in temporal data.
For a given number of clusters, the locations of the clusters are
determined by a least square method where the response vari-
able is the set of interarrival times (gaps) between events. To
make inference, they rely on bootstrap methodology and the
least square formulation. Because the responses used in their
model are usually nonnormally distributed, the least square
method may not be ecient. Also, the bootstrap simulation
could be computationally expensive. To overcome the di-
culty of using bootstrap simulation, Demattei and Molinari
C _ 2009, The International Biometric Society 1011
1012 Biometrics, December 2009
b
1
c
k
b
k+1
b
k
c
2
b
2 c
1
0 T

Time
Figure 1. A latent model of k clusters in the time window (0, T).
(2006) add a new testing method to Molinari et al.s (2001)
SR method. The new testing method utilizes Bernsteins in-
equality, resulting in a conservative test. This SR method is
extended by Demattei, Molinari, and Daures (2007) to detect
arbitrarily shaped multiple clusters in spatial data.
Diggle et al. (2005) develop a two-level hierarchical model
based on a latent stochastic process to detect localized regions
of unusually high likelihood of event (relative to background)
in temporal-spatial data. They indicate how a third level of
hierarchy, a model for the prior distribution of the parame-
ters of the latent stochastic process, can be incorporated for
a Bayesian analysis. The model in Diggle et al.s (2005) ap-
proach is useful in many applications. It is very similar to
the so-called disease mapping approach (e.g., Clayton and
Kaldor, 1987; Besag, York, and Mollie, 1991; Waller et al.,
1997; Gangnon and Clayton, 2000). Note, however, that the
goal of Diggle et al. (2005) is to describe intensity functions
instead of directly detecting and making inference on clusters.
Neill, Moore, and Cooper (2006) propose a Bayesian
scan statistic for detecting spatial clusters, and compare
their method to a frequentist spatial scan statistic method
(Kulldor and Nagarwalla, 1995). By using conjugate priors,
they can obtain closed form solutions for likelihood functions,
which leads to a much faster algorithm. As in Kulldor
and Nagarwalla (1995) and many other papers, the potential
spatial clusters are limited to a nite set of specic choices.
Other Bayesian methods of detection of clusters include
Lawson (1995), Gangnon and Clayton (2000, 2003), Knorr-
Held and RaBer (2000), and Denison and Holmes (2001).
Although the Bayesian methods may allow us to incor-
porate prior information from experts in the domain, the
choice of priors is always a challenging task (Neill et al.,
2006).
Our method deals with the same problems as Molinari et al.
(2001). The approach that we use is very dierent. By mim-
icking the processes and mechanisms that generate clusters,
we develop a latent model that allows us to use the standard
likelihood inference to detect multiple clusters in a given time
window. Unlike the scan statistics procedures, we focus on de-
tecting multiple clusters of varying sizes. Our method can be
used to simultaneously detect multiple clusters, as well as a
signicant single cluster. In our approach the likelihood func-
tion can be fully specied. We can answer a variety of infer-
ence questions related to our goal. Under the assumed model,
the likelihood-based approach is more ecient than the SR
approach by Molinari et al. (2001), which is illustrated in our
simulation studies. Furthermore, our latent model approach
is very exible and it can incorporate several extensions (Sun,
2008).
The rest of the article is arranged as follows. Section 2 pro-
poses a general latent model for multiple clusters, that gen-
erates the observed data (observed time points of an event).
Section 3 develops, for a given number of clusters, approaches
based on likelihood inference to estimate and test model pa-
rameters and detect clusters. Section 4 employs model selec-
tion approaches, both Akaike (AIC) and Bayesian (BIC) infor-
mation criteria, to determine the number of clusters. Section 5
contains two real data analysis examples, including analysis of
the Hospital Hemoptysis Admission data studied by Molinari
et al. (2001) and a set of brucellosis event data collected by
the CDC (Centers for Disease Control and Prevention). Sec-
tion 6 provides a comprehensive simulation study to further
illustrate and evaluate the proposed methodology. Section 7
provides some additional comments and discussions.
2. A Latent Multiple Cluster Model
Suppose in a given time window, say (0, T), there are k clus-
ters, where k is a xed integer. Here, clusters are dened as
the time intervals within which an event of interest is much
more (or less) likely to happen (per unit of time) than outside
these time intervals. A temporal (latent) clustering model is
specied as in Figure 1. Starting from time 0, we wait for b
1
units of time until the rst cluster appears, and the rst clus-
ter lasts c
1
units of time. After the rst cluster, we wait for b
2
units of time until the second cluster appears, and the second
cluster lasts c
2
units of time, and so on until the kth clus-
ter appears that lasts c
k
units of time. After the kth cluster,
b
k +1
is the waiting period till the next cluster, which occurs
after the endpoint T. From now on, we will use this clus-
tering model to illustrate the development of our methodol-
ogy. Other potential latent clustering models are discussed in
Section 7.
To complete the model specication, we further assume
that the waiting time periods b
1
, b
2
, . . . , b
k +1
, and the clus-
ter interval lengths c
1
, c
2
, . . . , c
k
are random variables. We
assume that b
1
, b
2
, . . . , b
k +1
are independent samples from a
distribution with a density function
b
(t) =
b
(t;
b
) and c
1
,
c
2
, . . . , c
k
are independent samples from a distribution with a
density function
c
(t) =
c
(t;
c
). Here,
b
and
c
are un-
known parameters,
b
and
c
are known density functions
that may or may not be from the same family of distribu-
tions. One simple example is that both
b
(t) and
c
(t) are
exponential densities but with means equal to 1/
b
and 1/
c
,
respectively.
Write b = (b
1
, b
2
, . . . , b
k +1
)
/
and c = (c
1
, c
2
, . . . , c
k
)
/
.
It can be seen from Figure 1 that I
j
= [

j 1
l =1
(b
l
+ c
l
) +
b
j
,

j
l =1
(b
l
+ c
l
)] is the interval of the jth cluster, j = 2, . . . , k,
and I
1
= (b
1
, b
1
+ c
1
). For convenience, we introduce a
random number such that = k is the event that
exactly k clusters occur in time window (0, T). Clearly,
event = k is equivalent to event

k
j =1
(b
j
+ c
j
) + b
k +1

T and

k
j =1
(b
j
+ c
j
) T.
Detecting Clusters of Varying Sizes 1013
The latent variables b and c are not observed. What we
can observe in this model setting are only the time points y
1
,
y
2
, . . . , y
n
when an event of interest occurs. We assume that
the observations y
1
, y
2
, . . . , y
n
are independent samples from
the following step uniform function,
f

(y [ b, c, k) =

1
T +
k

j =1
(
j
1)c
j
,
if y I
1
;
. . . . . .

k
T +
k

j =1
(
j
1)c
j
,
if y I
k
;
1
T +
k

j =1
(
j
1)c
j
,
if y ,
k
j =1
I
j
;
(1)
where = (
T
,
T
)
/
is the collection of all parameters, in-
cluding unknown parameters = (
1
, . . . ,
k
)
/
and the pa-
rameters = (
b
,
c
)
/
that are associated with random vari-
ables b
i
s and c
i
s. When k = 1, the step uniform density
function (1) becomes the single step uniform density function
used for the single cluster case; see, e.g., chapter 14 of Glaz
et al. (2001). The parameters
j
> 0 for each j = 1, 2, . . . , k;
they may or may not be the same across the k clusters. Un-
der the density assumption (1), the event is
j
times more
likely to happen (per unit of time) inside the jth cluster than
that outside the clusters. The case with
j
> 1 corresponds
to a more dense cluster of more events; the case with < 1
corresponds to a more sparse cluster of less events; and the
case with = 1 corresponds to no cluster. To see whether
there are any signicant clusters in the data, we can test a
hypothesis H
0
:
1
=
2
= =
k
= 1 versus H
1
: at least one

j
,= 1.
The proposed distribution assumption can be alternatively
expressed in terms of Poisson models, similar to those used
in Gangnon and Clayton (2004) and others. We use the cur-
rent formulation of a step uniform function to highlight the
interpretation of the parameters
j
s. The proposed model is
also closely related to a Bayesian model. In particular, if we
further assume that
j
s are random and place priors on the
parameters, the proposed model corresponds to a Bayesian hi-
erarchical model. We use the frequentist formulation, because
it allows us to utilize the fully developed likelihood inferences
and avoid choosing priors. Although we illustrate our latent
model for temporal data, the model developed also covers
other types of data, for example, with patterns or events in
a sequence such as the DNA data studied by Leung et al.
(2005).
We also consider a slight generalization of model (1), that
includes a known background function W(t). As mentioned by
Molinari et al. (2001) and Wallenstein and Naus (2004), the
background value, such as seasonal patterns or population
sizes, may not be the same across the time window (0, T).
The known background function W(t), usually assessed from
separate sources, can be easily incorporated into our model.
In this case, we replace equation (1) by
f

(y [ b, c, k) =

1
W(y)

T +
k

j =1
(
j
1) c
j
,
if y I
1
. . . . . .

k
W(y)

T +
k

j =1
(
j
1) c
j
,
if y I
k
W(y)

T +
k

j =1
(
j
1) c
j
,
if y ,
k
j =1
I
j
,
(2)
where

T =
_
T
0
W(t) dt and c
j
=
_
I
j
W(t) dt, for j = 1,
2, . . . , k. Fitting model (2) is exactly the same as tting model
(1), except that T and c
j
need to be replaced by

T and c
j
. To
simplify our presentation, we focus our developments in the
next few sections on model (1).
3. Model Inference for a Given Number of Clusters
The number of clusters k needs to be bounded away from
the number of observations n, so that there are no overtting
problems such as the case of a cluster consisting of a single
event. We apply model selection techniques to determine k;
for a xed k, we develop a likelihood approach to estimate
model parameters and cluster locations and make statistical
inference. We assume that k is known in this section, and
discuss how to determine k in Section 4.
3.1 Likelihood Function of Observed Data When = k
In the latent model illustrated in Figure 1, the probability of
k clusters existing in the time window (0, T) can be computed
by
P

( = k)
= P

_
k

j =1
(b
j
+ c
j
) + b
k +1
T and
k

j =1
(b
j
+ c
j
) T
_
=
_
T
0
_

T s

b
(t)
[k ]
bc
(s) dt ds,
where
[k ]
bc
(s) is the density function of

k
j =1
(b
j
+ c
j
). Con-
ditional on = k, the joint conditional density function of
(b, c) is
f

(b, c [ k) =
k

j =1

b
(b
j
)
c
(c
j
)
b
(b
k +1
)/P

( = k).
From model (1), the conditional joint density function of
the sample observations y = (y
1
, y
2
, . . . , y
n
), conditional on
b, c and = k, is
f

(y [ b, c, k) =
n

i=1
f(y
i
[ b, c, k)
= e

k
j =1
(log
j
)Z
j
n log
_
T +

k
j =1
(
j
1)c
j
_
,
where Z
j
= Z
j
(y, b, c) =

n
i=1
1
(y
i
I
j
)
is the number of events
that appear within the jth cluster interval. Here, 1
()
is the
1014 Biometrics, December 2009
indicator function. Thus, the joint density function of y and
= k, is
f

(y, k) =
_ _
f

(y, b, c [ k)P

( = k) dbdc
=
_ _
f

(y [ b, c, k)f

(b, c [ k)P

( = k) dbdc,
and the log-likelihood function of observing y and = k is

k
( [ y) = logf

(y, k)
= log
__ _
f

(y [ b, c, k)f

(b, c [ k)
P

( =k) dbdc
_
.
(3)
Because equation (3) involves multiple integrations, it is
complicated to directly compute the log-likelihood function

k
( [ y) and its rst and second derivatives. As a result, it is
hard to obtain the maximum likelihood estimates by directly
maximizing the likelihood function. We instead develop next a
Monte Carlo expectationmaximization (EM) algorithm (see,
e.g., Tanner, Section 4.5) to estimate the model parameters.
3.2 A Monte Carlo EM Algorithm for Model Estimation
Note that the joint density function of (y, b, c, = k) is
explicit,
f

(y, b, c, k) = f

(y [ b, c, k)f

(b, c [ k)P

( = k)
= e

k
j =1
(log
j
)Z
j
n log
_
T +

k
j =1
(
j
1)c
j
_

j =1

b
(b
j
)
c
(c
j
)
b
(b
k +1
).
We treat (y, b, c, = k) as the complete responses and (y, =
k) as the observed responses, and develop an EM algorithm
as follows.
Step 0. Select a set of starting parameter values
(0)
= (
(0)/
,

(0)/
)
/
.
Step 1. (E-step). For s = 0, 1, 2, . . . , calculate the conditional
expectation of the complete log-likelihood function,
given the observed and =
(s)
: Q( [
(s)
) = Q
1
( [

(s)
) + Q
2
( [
(s)
), where Q
1
( [
(s)
) =

k
j =1
E(Z
j
[
y, k,
(s)
) log
j
nE[logT +

k
j =1
(
j
1)c
j
[ y,
k,
(s)
] and Q
2
( [
(s)
) =

k +1
j =1
Elog
b
(b
j
) [ y, k,

(s)
+

k
j =1
Elog
c
(c
j
) [ y, k,
(s)
.
Step 2. (M-step). For s = 0, 1, 2, . . . , update the parameter
estimates:
(s+1)
= (
(s+1)
,
(s+1)
)
/
, by maximizing
the Q
1
( [
(s)
) and Q
2
( [
(s)
) functions:
(s+1)
=
argmax Q
1
( [
(s)
) and
(s+1)
= argmax Q
2
( [
(s)
).
Step 3. Repeat Steps 2 and 3 until |
(s+1)

(s)
| is very
small.
In the case with
b
and
c
being density functions of expo-
nential distributions Exp(
b
) and Exp(
c
), the updating for-
mula of
(s+1)
in Step 2 is simply
(s+1)
b
= (k + 1)/

k +1
j =1
E(b
j
[
y, k,
(s)
) and
(s+1)
c
= k/

k
j =1
E(c
j
[ y, k,
(s)
).
The conditional expectations in Step 1 do not usually have
explicit forms, but they can be numerically computed using
a Gibbs sampling approach. Suppose b

= (b

1
, b

2
, . . . , b

k +1
)
/
and c

= (c

1
, c

2
, . . . , c

k
)
/
are a set of Gibbs samples from
f(b, c [ y, k,
(s)
), and we have M sets of such Gibbs sam-
ples; see Appendix I in Supplementary Materials for a Gibbs
sampling algorithm to generate b

and c

. When M is large,
the four conditional expectations in Step 1 of the EM algo-
rithm can be evaluated by

j
/M,

logT +

k
j =1
(
j

1)c

j
/M,

log(b

j
)/M, and

log(c

j
)/M, respec-
tively, where

is the summation over M sets of Gibbs sam-

ples b

and c

, and Z

j
is the total number of events, but
computed with b
j
and c
j
values replaced by their correspond-
ing Gibbs sample values b

j
and c

j
in each Gibbs sample set.
The above Monte Carlo EM algorithm does not provide the
variancecovariance calculation for the parameter estimators.
To obtain an estimate of the variancecovariance matrix,
we calculate the observed information matrix using the
missing information principle and Louiss Method (see, e.g.
Tanner, 1993, Sections 4.4.24.4.3). In particular, the
observed information matrix is H
n
=
_

2

2

k
( [ y)
_
=
E
_

2

2

k
( [ b, c, y) [ y, = k
_
Var
_

k
( [ b, c, y) [ y,
= k
_
, where
k
( [ b, c, y) = logf

(y, b, c, k). It can be

numerically estimated by

H
n
=
1
M

k
( [ b

, c

, y)

_
1
M

k
( [ b

, c

, y)
_
2

_
1
M

k
( [ b

, c

, y)
_
2
_
,
where the summations are over the M sets of Gibbs samples
of b

and c

in the last iteration of the EM algorithm (see,

e.g., Tanner, 1993, Section 4.4.4.).
3.3 Estimation of Cluster Locations
We often like to know where the clusters are located. Based on
the model specication, the lower and upper bounds of the jth
cluster interval I
j
are, respectively, L
j
=

j 1
l =1
(b
l
+ c
l
) + b
j
and U
j
=

j
l =1
(b
l
+ c
l
). They are random quantities that can-
not be estimated by maximizing the observed likelihood func-
tion (3). Fortunately, based on the M sets of the Gibbs sam-
ples in the last iteration of the EM algorithm, we are able
to obtain M copies of the pair (L

j
, U

j
) = (

j 1
l =1
(b

l
+ c

l
) +
b

j
,

j
l =1
(b

l
+ c

l
)). These (L

j
, U

j
) can be treated as samples
from the conditional joint distribution (posterior distribu-
tion) of (L
j
, U
j
), given y and k. Thus, following Bayesian
point estimation techniques, we use the mode of the empiri-
cal conditional joint distribution of (L
j
, U
j
), obtained from
the M pairs of (L

j
, U

j
), to estimate the random bounds L
j
and U
j
. See Fraiman and Meloche (1999) for a kernel-based
estimation approach to estimate the mode of a bivariate dis-
tribution from its M copies of samples.
3.4 Likelihood Inference and Simulation-Based Tests Related
to s
For a xed k, we can obtain k clusters from the aforementioned
estimation algorithm. There is no guarantee that the k clus-
ters are statistically signicant. To test whether an identied
cluster is signicant or not, we use likelihood-based inference.
Detecting Clusters of Varying Sizes 1015
Let us rst consider testing a single (say the jth) cluster and
see whether it is signicant or not, i.e., H
0
:
j
= 1 versus H
1
:

j
,= 1. Let
j
be the estimator of the parameter
j
. From
the observed information matrix

H
n
, we can get an estimator
of the standard error of
j
, say se(
j
). Thus, a Wald-type t
statistic is t =
j
/ se(
j
). When n is large, the statistic t is
asymptotically normally distributed based on likelihood in-
ference and we can use a two-sided z test to test for whether

j
= 1 or not.
Another testing problem of interest is to see whether there
are any signicant clusters among the k clusters, i.e., H
0
:

1
=
2
= =
k
= 1 versus H
1
: at least one
j
,= 1. The
likelihood ratio test statistic is
R = log
_
max
H
1
H
0
f

(y, k)
max
H
0
f

(y, k)
_
=
k
(

[ y) [
=

+n log(T) max

log P

( = k)
= log
_ _
f

(y [ b, c, k)f

(b, c [ k) dbdc
+ log P

( = k) + n log(T) max

log P

( = k),
where

= (
/
,

/
)
/
are the estimates of the parameters ob-
tained from the aforementioned EM algorithm under H
1
.
Suppose we have M sets of random samples b

=
(b

1
, . . . , b

k +1
) and c

= (c

1
, . . . , c

k
) from f

(b, c [ k) when
=

; see Appendix I in Supplementary Materials for a Gibbs
sampling algorithm to simulate such b

and c

. By Monte
Carlo approximation, the test statistic R can be approximated
by
R

= log
_
1
M

f(y [ b

, c

, k)
_
+ log P

( = k) + n log(T) max

log P

( = k),
where

is the summation over the M sets of b

and c

samples. Based on likelihood inference, 2R is asymptotically

2
distributed with k degrees of freedom. By comparing 2R

with the
2
k
distribution, we perform a formal test for H
0
:

1
=
2
= =
k
= 1 versus H
1
: at least one
j
,= 1.
The above test relies on large sample theory and requires a
large n. Alternatively, we consider a simulation based Monte
Carlo testing approach that is computationally intensive. We
simulate L sets of data of samples of size n from the model
under the null hypothesis and compute for each set the test
statistic 2R

, denoted by 2

R

. When L is large, the em-

pirical distribution of these 2

R

values provides a good ap-

proximation to the theoretical distribution of the test statistic
2R

under the null hypothesis. This is utilized to perform the

simulation-based Monte Carlo test.
4. Determination of the Unknown Number
of Clusters
In this section, we employ model selection approaches,
both AIC and BIC information criteria, to determine the
number of clusters k from the observed data. In our con-
text, their expressions are AIC(k) = 2 log
__
f

(y [ b,
c, k)f

(b, c [ k) dbdc 2 log P

( = k) + 2k and BIC(k) =
2 log
__
f

(y [ b, c, k)f

(b, c [ k) dbdc 2 log P

( = k) +
k log(n), respectively. Often the number of observed time
points n > e
2
= 7.389, thus the BIC criterion places more
penalty against a large k than the AIC criterion.
To compute the criteria, the unknown parameters are
replaced by their estimates

=

(k) that are obtained from
the Monte Carlo EM algorithm proposed in Section 3.2 (set
the number of clusters to be k). Furthermore, the formulas
of the criteria involve integrations that do not have explicit
forms. So, we numerically evaluate their values. As in Sec-
tion 3.4, we know how to simulate b

= (b

1
, . . . , b

k +1
) and
c

= (c

1
, . . . , c

k
) from f

(b, c [ k) when =

(k). By Monte
Carlo approximation, the AIC (k) criterion can be computed
by

AIC(k) = 2 log
_
1
M

f(y [ b

, c

, k)
_
2 log P

( = k) + 2k,
and the BIC(k) criterion can be computed by

BIC(k) = 2 log
_
1
M

f(y [ b

, c

, k)
_
2 log P

( = k) + k log(n),
where

is the summation over M sets of repeatedly simu-

lated (b

, c

)s from f

(k )
(b, c [ k). The k to be chosen is the
one with the smallest corresponding

AIC(k) or

BIC(k) value.
Based on the developments in Sections 3 and 4, a prac-
tical approach for detecting clusters emerges. Denote by /
a preselected set of ks, which we would like to be small for
computing purposes but large enough to cover all potential
choices of the correct number of clusters. For each k /, ap-
ply the Monte Carlo EM algorithm in Section 3.2 to get the
parameter estimates and use either the AIC or BIC rule to
determine the number of clusters k. For the chosen k, use the
results in Sections 3.33.4 to detect and determine the loca-
tion of the clusters.
5. Real Data Examples
We analyze two real data sets in this section. The rst is
the hospital hemoptysis admission data studied by Molinari
et al. (2001). The second is the brucellosis data collected by
the CDC during 19972004. Both data sets are attached in
Appendix II in Supplementary Materials.
5.1 Hospital Hemoptysis Admission Data Set
The rst data set consists of 62 spontaneous hemoptysis ad-
missions (pulmonary disease) at Nice (a southern French city)
hospital from January 1 to December 31, 1995. Because Nice
is a tourist city located on the Mediterranean coast, many
tourists (15% of the local population) each summer increase
the population at risk. Molinari et al. (2001) suggest using
the following function to adjust for the population at risk
R(t) = 1 +
72t
10, 000365
+
55, 000
355, 000
1
[182, 244]
(t), t [1365].
We reanalyze the data using our method for ks in the set
/ = 1, 2, 3, 4 both with and without incorporating the back-
ground function R(t). Both the AIC and BIC criteria select
k = 1 regardless of whether the background function is used
1016 Biometrics, December 2009
Table 1
Model tting results of the two real data sets using k = 1 and k = 2, with comparisons to the SR and SaTScan methods
Without background With background
k = 1 k = 2 k = 1 k = 2
I. Hospital Parameter Log(
1
) 0.673 (0.457)
a
0.660 (0.410)
a
0.733 (0.424)
a
0.726 (0.393)
a
hemoptysis estimation Log(
2
) 0.111(0.533)
a
0.120 (0.509)
a
data
Testing Wald 1st 0.141
b
or 165
c
0.107
b
or 0.078
c
0.166
b
or 0.101
c
0.130
b
or 0.093
c
(p-values) 2nd 0.837
b
or 0.393
c
0.813
b
or 0.332
c
LRT 0.262
b
or 143
c
0.330
b
or 0.139
c
0.205
b
or 0.097
c
0.282
b
or 0.110
c
(SR) 0.024
d
0.026
d
0.02
d
or 0.130

0.30
d
(SaTScan) 0.273 0.328
Cluster Proposed Day[58, 108] Day[58, 108] Day[58, 108] Day[58, 108]
intervals 1st [Feb. 27 to Apr. 18] [Feb. 27 to Apr. 18] [Feb. 27 to Apr. 18] [Feb. 27 to Apr. 18]
2nd Day[198,235] Day[198, 235]
[Jul. 17 to Aug. 23] [Jul. 17 to Aug. 23]
(SR 1st) Day [58, 87] Day [58,87] Day [58, 87] Day [58, 87]
[Feb. 27 to Mar. 28] [Feb. 27 to Mar. 28] [Feb. 27 to Mar. 28] [Feb. 27 to Mar. 28]
(2nd) Day [187, 201] Day [187, 201]
[Jul. 6 to Jul. 20] [Jul. 6 to Jul. 20]
(SaTScan) Day [58, 87] Day [58, 87]
[Feb. 27 to Mar. 28] [Feb. 27 to Mar. 28]
II. CDC Parameter Log(
1
) 1.675 (0.181)
a
0.432 (.221)
a
1.883 (0.181)
a
0.084 (0.308)
a
Brucellosis estimation Log(
1
) 1.781 (0.187)
a
1.904 (0.182)
a
data
Testing Wald 1st <0.001
b(c)
0.051
b
or 0.072
c
<0.001
b(c)
0.078
b
or 0.276
c
(p-values) 2nd <0.001
b(c)
<0.001
b(c)
LRT <0.001
b(c)
<0.001
b(c)
<0.001
b(c)
<0.001
b(c)
(SaTScan) <0.001 <0.001
Cluster Proposed Week[44, 46] Week[20, 24] Week[44, 46] Week[20, 24]
intervals 1st [Oct. 30 to Nov. 19] [May. 15 to Jun. 18] [Oct. 30 to Nov. 19] [May. 15 to Jun. 18]
2nd Week[44, 46] Week[44, 46]
[Oct. 30 to Nov. 19] [Oct. 30 to Nov. 19]
(SaTScan) Week[44, 46] Week[44, 46]
[Oct. 30 to Nov. 19] [Oct. 30 to Nov. 19]
a
The numbers in the round brackets are the corresponding standard errors.
b
The results are from level 5% tests whose critical regions are determined by cut-o values obtained by the large sample theory.
c
The results are from level 5% tests whose critical regions are determined by cut-o values determined based on 1000 Monte Carlo simulations.
d
The p-values reported are from Molinari et al (2001), in which a bootstrap approach is used to obtain these p-values; these p-values are
termed not reliable by a follow up Demattei and Molinari (2006). The p-value marked with

is reported in Demattei and Molinari (2006).
or not. Part I of Table 1 summarizes the results of estimation,
testing and cluster detection in the cases of using k = 1 and
k = 2.
Included in Table 1 are also the results from the SR ap-
proach by Molinari et al. (2001) and a varying size win-
dow scan statistic approach using the free SaTScan software
(www.satscan.org). The results of the SR approach are re-
ported in Molinari et al. (2001) with a constraint that a clus-
ter should have a minimum of six events. The p-value marked
with

is reported in a follow up paper by Demattei and Moli-
nari (2006), in which they state that the bootstrap-based in-
ference by Molinari et al. (2001) is not reliable.
Based on Table 1, all three methods point to one potential
nonsignicant cluster. Our estimated cluster interval is from
day 58 to day 108 (February 27 to April 18). The interval cov-
ers, but is a little longer than, the estimated cluster interval
of day 58 to day 87 (February 27 to March 28) by the SR and
the SaTScan methods.
5.2 CDC Brucellosis Data
Brucellosis (Malta fever) is an infectious disease transmitted
from animals to human beings. It is caused by bacteria of the
genus Brucella and is one critical biologic agent reported to
NNDSS (National Notiable Disease Surveillance System)
(Chang et al., 2003). The data considered here are weekly
events across the United States, collected every year by the
CDC. We analyze the 2004 weekly data, using a background
function estimated by the weekly number of brucellosis cases
averaged over year 1997 to year 2003.
Because the data are provided as weekly counts, to avoid
simultaneously occurring events, we uniformly disperse the
multiple cases in a week and then analyze the transformed
data. We analyze the data using our method for ks in the set
/ = 1, 2, 3, 4. When the background function is not used,
the AIC criterion selects k = 2 and the BIC criterion selects
k = 1. When the background function is used, both the AIC
and BIC criterions select k = 1. Part II of Table 1 lists the
Detecting Clusters of Varying Sizes 1017
results of model tting, testing and cluster detection in the
cases of k = 1 and k = 2 for the CDC Brucellosis data.
Included in Table 1, for comparison, are also the results
from the varying size window scan statistic using the SaTScan
software. Because the primary cluster is signicant, we have
also tried to use the SaTScan software to pick up the secondary
cluster by replacing the event counts in the primary cluster
(week 44 to week 46) with the average outside the primary
cluster. However, it failed to nd any additional signicant
cluster.
Based on these results, we can conclude that there exists
one signicant cluster from week 44 to week 46 (October 30
to November 19) in the 2004 Brucellosis data.
6. Simulation Studies
In this section, we perform simulation studies from two set-
tings of prexed clusters to evaluate the performance of the
proposed estimation, testing and cluster detection methods.
Without loss of generality, all simulation studies are done
within the time window (0, 1).
Our rst simulation setting xes a single cluster at (0.258,
0.494), which is based on a realization of the proposed latent
cluster model with k = 1. Choosing = 3, we simulate 300
sets of size n = 100 independent time points y
1
, y
2
, . . . , y
100
according to model (1). The second simulation setting is for
a multiple cluster case with k = 3. In particular, based on a
realization of the latent cluster model with k = 3, we x the
cluster intervals at (0.092, 0.283), (0.490, 0.584), and (0.743,
Table 2
Summary of the simulation results for parameter estimation, testing and cluster location estimation, with
comparisons to the SR method, in the cases when true k = 1 and 3
k = 1 k = 3
I. Estimation Parameter estimates
a
Log(
1
) 1.119 (0.611)
b
1.107 (0.717)
b
Log(
2
) 1.169 (1.042)
b
Log(
3
) 1.269 (0.958)
b
II. Cluster location detection Sensitivity Proposed 0.956 (0.063)
b
0.962 (0.044)
b
(SR)
c
0.916 (0.097)
b
0.698 (0.111)
b
Specicity Proposed 0.956 (0.076)
b
0.836 (0.149)
b
(SR)
c
0.901 (0.126)
b
0.847 (0.139)
b
PPV Proposed 0.958 (0.069)
b
0.937 (0.055)
b
(SR)
c
0.902 (0.121)
b
0.892 (0.084)
b
NPV Proposed 0.965 (0.046)
b
0.913 (0.077)
b
(SR)
c
0.924 (0.085)
b
0.626 (0.101)
b
III. Test of signicance Power Wald 1 99.7%
d
or 99.3%
e
99.0%
d
or 99.0%
e
Wald 2 94.3%
d
or 95.3%
e
Wald 3 99.7%
d
or 99.7%
e
LRT 99.3%
d
or 99.3%
e
100%
d
or 100%
e
(SR)
c
91.0% 89.0%
Size (Type I error) Wald 1 9.3%
d
or 6.3%
e
6.3%
d
or 6.0%
e
Wald 2 3.3%
d
or 4.7%
e
Wald 3 5.7%
d
or 5.3%
e
LRT 3.3%
d
or 5.7%
e
0.7%
d
or 5.0%
e
(SR)
c
29.3% 21.3%
a
The true parameter values are log(
1
) = log (3) = 1.099, log(
2
) = log(3.25) = 1.179 and log(
3
) = log(3.5) = 1.253.
b
The numbers in the round brackets are the corresponding standard deviations of the 300 estimates.
c
The results of the SR method (Molinari et al., 2001; Demattei and Molinari, 2006) are obtained using an R code provided to
us by Dr. Molinari.
d
The results are from level 5% tests whose critical regions are determined by cut-o values obtained by the large sample theory.
e
The results are from level 5% tests whose critical regions are determined by cut-o values determined based on 1000 Monte
Carlo simulations.
0.891). We then choose = (
1
,
2
,
3
)
/
= (3, 3.25, 3.5)
/
, and
simulate 300 sets of size n = 180 independent time points y
1
,
y
2
, . . . , y
180
according to model (1). We assume that we can
only observe the y values in our data analysis. We t each of
the 2 300 data sets rst with k being their respective true
number of clusters. Table 2 summarizes the results of parame-
ter estimation, cluster detection, and testing. For comparison
purpose, we also include in Table 2 the corresponding results
from the SR method proposed by Molinari et al. (2001) and
Demattei and Molinari (2006).
The rst part of Table 2 provides the mean values of 300
estimates of the main parameters (log transformed) and
their standard errors. It indicates that the Monte Carlo EM
algorithm has provided very reasonable estimates for the pa-
rameters . The second part of Table 2 employs four empirical
measures to assess the accuracy of the estimated cluster loca-
tions: sensitivity, specicity, positive predictive value (PPV)
and negative predictive value (NPV). Here, sensitivity is the
proportion of the event points (ys) inside the true clusters,
that are inside the estimated clusters. Specicity is the pro-
portion of the event points (ys) outside the true clusters, that
are outside the estimated clusters. PPV is the proportion of
the event points (ys) inside the estimated clusters, that are
inside the true clusters. NPV is the proportion of the event
points (ys) outside the estimated clusters, that are outside
true clusters. The closer these measures are to one, the more
accurate the cluster intervals. Reported in Table 2 are the
means and standard deviations of the sensitivity, specicity,
1018 Biometrics, December 2009
Table 3
Model selection using AIC and BIC criteria
a
AIC BIC
Estimated Estimated
k = 1 k = 2 k = 3 k = 4 k = 1 k = 2 k = 3 k = 4
True k = 1 94.00% 3.00% 1.33% 1.67% 99.67% 0.33% 0 0
k = 3 0 21.33% 55.67% 23.00% 12.67% 42.33% 38.00% 7.00%
a
Reported in the table are the number of times (in percentage) that the AIC or BIC criterion selects k = 1, 2, 3, 4, when the
true k = 3, respectively.
Table 4
Cluster detection results of using k = 1, 2, 3 and 4 When the true number of clusters k = 3
a
True cluster 1 True cluster 2 True cluster 3 Outside
Fitting k = 1 Detected cluster 1 27.89% 10.56% 59.56% 2.00%
k = 2 Detected cluster 1 82.67% 17.00% 0 0.33%
Detected cluster 2 0 18.67% 80.67% 0.67%
k = 3 Detected cluster 1 99.67% 0.33% 0 0
Detected cluster 2 0 98.33% 0 1.67%
Detected Cluster 3 0 0 99.33% 0.67%
k = 4 Detected cluster 1 71.00% 0 0 29.00%
Detected cluster 2 28.00% 17.67% 0 54.33%
Detected cluster 3 0 75.00% 2.33% 22.67%
Detected cluster 4 0 0.17% 92.50% 7.33%
a
Reported in the table are the number of times (in percentage) that an estimated/detected cluster overlaps a true cluster.
The results are obtained based on the 300 simulated data sets of size 180 from the three cluster model described in Section 6.
Here, an estimated cluster is counted as an overlap to a true cluster, if it covers more than two-thirds of the true cluster. In the
case when an estimated cluster overlaps with two true clusters, it is counted as 0.5 overlaps to each of the two true clusters, and
so on.
PPV and NPV values in the 300 repeated simulations for the
cluster location estimation method described in Section 3.3
as well as the SR method proposed by Molinari et al. (2001)
and Demattei and Molinari (2006). The results indicate both
our and the SR methods can identify clusters reasonably, but
all four measures in our methods are almost uniformly higher
with smaller standard deviations.
The third part of Table 2 examines the powers and sizes
of the level 0.05 tests outlined in Section 3.4. For the power
calculation, we use the same 2 300 simulation data sets
described before. To compute the actual size (type I error),
we simulate 300 data sets of 100 time points from uniform
(0,1) distribution in the case when k = 1 and 300 data sets of
180 time points from the uniform(0,1) distribution in the case
when k = 3. Then the same estimation and testing procedures
as those of the power calculation are applied. It appears that
both the Wald and LRT tests have very high powers to detect
the clusters in the simulated data. The actual sizes of the tests
are slightly o if we approximate the distributions of the tests
by their large sample asymptotic distributions, indicating the
sample sizes used may be still a little too small. But the al-
ternative simulation-based Monte Carlo testing approach is
more or less on target. For the particular type of data from
model (1), our model-based likelihood approaches have more
power to detect clusters, and have smaller and much more ac-
curate Type I errors than those of the step regression method
proposed by Molinari et al. (2001).
In practice, we usually do not know the number of clus-
ters. The AIC or BIC criterion is often used to determine the
number of clusters from data that consists of only the time
points of events. Table 3 summarizes the model selection re-
sults using the AIC and BIC criteria described in Section 4.
The numbers reported are consistent in magnitude with those
reported in the model selection literature (e.g. Pan, 2001). It
appears that the penalty term in the BIC criterion is a little
too big and it tends to pick a smaller number of clusters. This
phenomena is also reported in additional results provided in
Sun (2008).
We carry out an additional study to examine what happens
if we use a wrong k to t the model. Table 4 lists cluster
detection results of using k = 1, 2, 3, and 4, while the true
number of clusters is k = 3. From Table 4, we can see that
when using a wrong k (k = 1 or 2) less than the true number
of clusters k = 3, the method almost always pick up one (if
wrong k = 1) or two (if wrong k = 2) of the three true clusters.
When using the correct k = 3, the method picks up the three
true clusters most of the time; this result is consistent with
those reported in the second part of Table 2. When using
a wrong k = 4, a number greater than the true number of
clusters k = 3, three true clusters appear total 99%, 92.84%
and 94.83% times, respectively, and on average 1.1333 clusters
are outside the three true clusters. This is evidence that the
proposed method picks up three of the true clusters plus one
false cluster most of the time.
Detecting Clusters of Varying Sizes 1019
Finally, we like to remark that we have also carried out sim-
ulation studies under an alternative design in which we allow
the clusters to change (randomly simulated according to the
latent model) in each of the simulation exercises. The simu-
lation results under this design are similar to (only slightly
worse than) what we reported here. Due to space limitation,
the results are not reported in the article. See, Sun (2008) for
such studies and an interpretation of the results.
7. Discussion
Statistical modeling is one of the most widely used tools in
modern applied statistics. A model that mimics the process
generating the sample data retrieves information and provides
great insight into the problem. We develop such a latent model
for a typical cluster generation process. Based on the model,
we develop a likelihood inference-based detection approach
and a Monte Carlo EM algorithm to identify clusters and es-
timate cluster locations. Like the generalized scan statistic,
the latent modeling approach can exibly adjust for nonuni-
form background variation, and can be generalized to two or
more dimensions (Sun, 2008). Dierent from the scan statistic
tests, our latent modeling approach has the advantage of not
needing to x the range of cluster sizes or window sizes and
number of clusters. Our procedure also gives both a global
test of simultaneous clustering and tests of signicance of es-
timated individual clusters. Compared with the SR method
proposed by Molinari et al. (2001) and Demattei and Molinari
(2006), our approach shares an important advantage of their
approach in its ability to detect multiple clusters of vary-
ing sizes in temporal data. But our simulation studies give
cases where our procedure has substantially increased e-
ciency over the SR approach.
The development in the article is based on the assumed la-
tent model illustrated in Figure 1. It is possible that there are
clusters before the starting time of the study, and the wait-
ing time between the last cluster before time 0, and the rst
cluster after time 0 is longer than b
1
. In the exponential case
(with its lack of memory property) the existence of clusters
before time 0 does not change the results. In other cases, we
may need to model b
1
separately by a truncated
b
(t) dis-
tribution. It is also possible that either 0 or T fall within a
cluster interval. That is, we may assume the rst or the last
cluster includes 0 or T. Under one of such assumptions, we can
in theory use the same techniques described in the article to
develop a similar algorithm. However, for practical purposes,
it may be sucient to use the algorithm outlined in the arti-
cle, unless additional information is available. Note that, the
probability that an event is exactly at the end point 0 or T is
zero; i.e., P(y
i
= 0) = P(y
i
= T) = 0. In practice, without any
information outside the given time window (0, T), we can not
distinguish a cluster that includes an endpoint with a cluster
that starts right after (or ends right before) the endpoint. A
simulation study (results not shown here) backs up such an
argument.
The likelihood inference described in Section 3.4 is for two-
sided tests. In some applications we might be interested in
one-sided tests. If the one-sided tests are for each single
j
or
in the case that k = 1, the Wald type tests and the likelihood
ratio test described in Section 3.4 can be directly extended
to one-sided tests by dividing the p-values in half. For a test
that involves multiple
j
s, there is a complication using the
likelihood ratio tests. This is inherited from the well known
fact that likelihood ratio tests (similar to the F-test for mul-
tiple regression parameters) are not well suited for one-sided
tests of multiple parameters. We also like to note that there
are applications in which we would like to limit the parameter
space to
j
1, for j = 1, 2, . . . , k or
j
1, for j = 1,
2, . . . , k. In this case,
j
= 1 is on the boundary of the param-
eter space, and the constrained likelihood inference applies;
see, e.g., Silvapulle and Sen (2005) and the references therein.
Silvapulle and Sen (2005) include an algorithm to compute
p-values for constrained likelihood testing problems. We can
directly incorporate their algorithm to our problem, with few
alterations to the Monte Carlo EM estimation and cluster
location estimation procedures. The theory for constrained
likelihood inference is much more complex both in theory and
computationally. Because the p-values from constrained likeli-
hood approaches are usually smaller than those from the reg-
ular likelihood inference, in practice one may conservatively
(with some loss in power) use the two-sided p-values discussed
in Section 3.4 for the one-sided testing problems.
8. Supplementary Materials
The supplementary materials, including the Gibbs algorithms
used in Section 3 and the data sets used in Section 6 are
available under the Paper Information link at the Biometrics
website http://www.biometrics.tibs.org.
Acknowledgements
The authors thank Dr Molinari for helpful information about
their SR method and providing the R codes for their ap-
proach. The authors also thank the reviewers for their con-
structive suggestions. The research is partially supported by
NSF grants SES 05-18543 and NSA H98230-08-1-0104.
References
Besag, J., York, J., and Mollie, A. (1991). Bayesian image restoration,
with two applications in spatial statistics (with discussion). An-
nals of the Institute of Statistical Mathemtics 43, 159.
Chang, M., Glynn, M. K., and Groseclose, S. L. (2003). Endemic, noti-
able bioterrorism-related diseases, United States, 1992 to 1999.
Emerging Infections Diseases 9, 55664.
Clayton, D. and Kaldor, J. (1987). Empirical Bayes estimates of age-
standardized relative risks for use in disease mapping. Biometrics
43, 671681.
Demattei, C. and Molinari, N. (2006). Multiple temporal cluster de-
tection test using exponential inequalities. Far East Journal of
Theoretical Statistics 19(2), 231244.
Demattei, C., Molinari, N., and Daures, J. P. (2007). Arbitrarily shaped
multiple spatial cluster detection for case event data. Computa-
tional Statistics and Data Analysis 51, 39313945.
Dembo, A. and Karlin, S. (1992). Poisson approximations for r-scan
processes. Annals of Applied Probability 2, 329357.
Denison, D. and Holmes, C. (2001). Bayesian partitioning for estimating
disease risk. Biometrics 57, 143149.
Diggle, P., Rowlingson, B., and Su, T.-L. (2005). Point process method-
ology for on-line spatio-temporal disease surveillance. Environ-
metrics 16, 423434.
1020 Biometrics, December 2009
Fraiman, R. and Meloche, J. (1999). Multivariate L-estimation (with
discussion). Test 8, 255317.
Fu, J. C. and Lou, W. Y. (2003). Distribution Theory of Runs and
Patterns and its Applications: A Finite Markov Chain Imbedding
Approach. World Scientic.
Gangnon, R. E. and Clayton, M. K. (2000). Bayesian detection and
modeling of spatial disease maps. Biometrics 56, 922935.
Gangnon, R. E. and Clayton, M. K. (2003). A hierarchical model for
spatially clustered disease rates. Statistics in Medicine 22, 3213
3228.
Gangnon, R. E. and Clayton, M. K. (2004). Likelihood-based tests for
localized spatial clustering of disease. Environmetrics 15, 797
810.
Glaz, J., Naus, J., and Wallenstein, S. (2001). Scan Statistics. New
York: Springer.
Knorr-Held, L. and RaBer, G. (2000). Bayesian detection of clusters
and discontinuities in disease maps. Biometrics 56, 1321.
Kulldor, M. and Nagarwalla, N. (1995). Spatial disease clusters: De-
tection and infection. Statistics in Medicine 14, 799810.
Lawson, A. B. (1995). Markov chain Monte Carlo methods for putative
pollution source problems in environmental epidemiology. Statis-
tics in Medicine 14, 24732486.
Leung, M.-Y., Choi, K.-P., Xai, A., and Chen, L. H. Y. (2005). Non-
random clusters of palindromes in Herpesvirus genomes. Journal
of Computational Biology 12, 331354.
Molinari, N., Bonaldi, C., and Daures, J. P. (2001). Multiple temporal
cluster detection. Biometrics 57, 577583.
Naus, J. and Wallenstein, S. (2004). Multiple window and cluster size
scan procedures. Methodology and Computing in Applied Proba-
bility 6, 389400.
Neill, D. B., Moore, A. W., and Cooper, G. F. (2006). A Bayesian
spatial scan statistic. Advances in Neural Information Processing
Systems 18, 10031010.
Pan, W. (2001). Akaike information criterion in generalized estimating
equations. Biometrics 57, 120125.
Silvapulle, M. J. and Sen, P. K. (2005). Constrained Statistical Infer-
ence. Wiley & Sons Inc.
Su, X., Wallenstein, S., and Bishop, D. (2001). Non-overlapping clus-
ters: Approximate distribution and application to molecular biol-
ogy. Biometrics 57, 420426.
Sun, Q. K. (2008). Statistical models and inferences for detecting mul-
tiple temporal and spatial clusters. Ph.D. dissertation, Rutgers
University.
Tanner, M. (1993). Tools for Statistical Inference. New York: Springer-
Verlag.
Wallenstein, S. and Naus, J. (2004). Statistics for temporal surveil-
lance of bio-terrorism. Morbidity and Mortality Weekly Report
53(Suppl), 7478.
Waller, L. A., Carlin, B. P., Xia, H., and Gelfand, A. E. (1997). Hier-
archical spatio-temporal mapping of disease rates. Journal of the
American Statistical Association 92, 607617.
Received December 2007. Revised August 2008.
Accepted October 2008.

Chda Exam 2 2024 With 100
No ratings yet
Chda Exam 2 2024 With 100
31 pages
Recurrent Events Data Analysis For Product Repairs Disease Recurrences and Other Applications by Wayne B Nelson Z-Liborg
No ratings yet
Recurrent Events Data Analysis For Product Repairs Disease Recurrences and Other Applications by Wayne B Nelson Z-Liborg
164 pages
Charts QP-104
100% (1)
Charts QP-104
21 pages
A Spatial Scan Statistic
No ratings yet
A Spatial Scan Statistic
17 pages
Spatial Cluster Modelling: An: A.B. Lawson D.G.T. Denison
No ratings yet
Spatial Cluster Modelling: An: A.B. Lawson D.G.T. Denison
19 pages
MMPP Event Detection
No ratings yet
MMPP Event Detection
22 pages
2002 Hakidi Cluster Validity Methods Part II
No ratings yet
2002 Hakidi Cluster Validity Methods Part II
9 pages
Extending the Boundaries: An Expansive Journey into Nonparametric Curve Estimation
From Everand
Extending the Boundaries: An Expansive Journey into Nonparametric Curve Estimation
Pasquale De Marco
No ratings yet
zeluiz,+v4n4a03
No ratings yet
zeluiz,+v4n4a03
21 pages
Bej1906 004r2a0 PDF
No ratings yet
Bej1906 004r2a0 PDF
35 pages
A Bayesian Spatial Scan Statistic: Daniel B. Neill Andrew W. Moore Gregory F. Cooper
No ratings yet
A Bayesian Spatial Scan Statistic: Daniel B. Neill Andrew W. Moore Gregory F. Cooper
8 pages
Bayesian Spatial Scan Statistic
No ratings yet
Bayesian Spatial Scan Statistic
8 pages
Mid Term 160907470
No ratings yet
Mid Term 160907470
39 pages
1 s2.0 S2666827022000925 Main
No ratings yet
1 s2.0 S2666827022000925 Main
11 pages
Stageverslag Roelofsen Tcm235 882304
No ratings yet
Stageverslag Roelofsen Tcm235 882304
83 pages
Estimation of Parameterized Spatio-Temporal Dynamic Models: Ke Xu and Christopher K. Wikle
No ratings yet
Estimation of Parameterized Spatio-Temporal Dynamic Models: Ke Xu and Christopher K. Wikle
33 pages
Techniques of Event History Modeling New Approaches to Casual Analysis 2nd Edition High-Quality eBook
100% (9)
Techniques of Event History Modeling New Approaches to Casual Analysis 2nd Edition High-Quality eBook
17 pages
Chapter 5 Exponential Smoothing Methods L 2015
No ratings yet
Chapter 5 Exponential Smoothing Methods L 2015
19 pages
S C - B E D F N - I L M: Equential Lustering Ased Vent Etection OR ON Ntrusive OAD Onitoring
No ratings yet
S C - B E D F N - I L M: Equential Lustering Ased Vent Etection OR ON Ntrusive OAD Onitoring
9 pages
Management-Activity Prediction For Differently-Mouneshachari S
No ratings yet
Management-Activity Prediction For Differently-Mouneshachari S
6 pages
Deep Multivariate Time Series Embedding Clustering
No ratings yet
Deep Multivariate Time Series Embedding Clustering
26 pages
Spatial Dynamic Factor Analysis
No ratings yet
Spatial Dynamic Factor Analysis
34 pages
1512.04349v1
No ratings yet
1512.04349v1
53 pages
Journal Time Series Analysis - 2023 - Armillotta - Count Network Autoregression
No ratings yet
Journal Time Series Analysis - 2023 - Armillotta - Count Network Autoregression
29 pages
Scan Statistics Google Drive Download
100% (13)
Scan Statistics Google Drive Download
16 pages
Sampling Uncovered
From Everand
Sampling Uncovered
Pasquale De Marco
No ratings yet
Time Series Lecture
No ratings yet
Time Series Lecture
14 pages
2017-AdaCluster Adaptive Clustering For Heterogeneous Data
No ratings yet
2017-AdaCluster Adaptive Clustering For Heterogeneous Data
34 pages
Reviewonthemethodsofdetectingspacetimeclusters
No ratings yet
Reviewonthemethodsofdetectingspacetimeclusters
7 pages
Causal Discovery From Temporally Aggregated Time Series
No ratings yet
Causal Discovery From Temporally Aggregated Time Series
10 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
Probability and Statistics Made Easy
From Everand
Probability and Statistics Made Easy
Pasquale De Marco
No ratings yet
Dynamic spatio-temporal pattern discovery: a novel grid and density-based clustering algorithm
No ratings yet
Dynamic spatio-temporal pattern discovery: a novel grid and density-based clustering algorithm
11 pages
Clustering Data With Measurement Errors: Mahesh Kumar, Nitin R. Patel, James B. Orlin Operations Research Center, MIT
No ratings yet
Clustering Data With Measurement Errors: Mahesh Kumar, Nitin R. Patel, James B. Orlin Operations Research Center, MIT
26 pages
.... Using Dynamic Time Warping To Findpatterns in Time Series
No ratings yet
.... Using Dynamic Time Warping To Findpatterns in Time Series
12 pages
Unit 4
No ratings yet
Unit 4
65 pages
Anomaly Detection
No ratings yet
Anomaly Detection
51 pages
Predictive Mining of Time Series D Ata
No ratings yet
Predictive Mining of Time Series D Ata
11 pages
Choosing a Research Method, Scientific Inquiry:: Complete Process with Qualitative & Quantitative Design Examples
From Everand
Choosing a Research Method, Scientific Inquiry:: Complete Process with Qualitative & Quantitative Design Examples
Christian S. Yorgure PhD
No ratings yet
A Method For Incremental Discovery of Financial Event Types Based On Anomaly Detection
No ratings yet
A Method For Incremental Discovery of Financial Event Types Based On Anomaly Detection
10 pages
La Note de David Marsan Sur PredPol
No ratings yet
La Note de David Marsan Sur PredPol
9 pages
Shape Theory: Categorical Methods of Approximation
From Everand
Shape Theory: Categorical Methods of Approximation
J. M. Cordier
No ratings yet
What Are The Most Important Statistical Ideas of The Past 50 Years?
No ratings yet
What Are The Most Important Statistical Ideas of The Past 50 Years?
19 pages
Cluster and Calendar Based Visualization of Time Series Data
No ratings yet
Cluster and Calendar Based Visualization of Time Series Data
6 pages
What Are The Most Important Statistical Ideas of The Past 50 Years?
No ratings yet
What Are The Most Important Statistical Ideas of The Past 50 Years?
21 pages
Cluster Analysis
No ratings yet
Cluster Analysis
39 pages
Jurnal Statistik
No ratings yet
Jurnal Statistik
15 pages
Structure For Temporal Granularity Spatial Resolution and Scalability
No ratings yet
Structure For Temporal Granularity Spatial Resolution and Scalability
11 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
FM - Resumes
No ratings yet
FM - Resumes
18 pages
Lecture 12 - Unsupervised Learning - Shoould Be Marged
No ratings yet
Lecture 12 - Unsupervised Learning - Shoould Be Marged
31 pages
Unit Iii
No ratings yet
Unit Iii
70 pages
Data Clustering: 50 Years Beyond K-Means
No ratings yet
Data Clustering: 50 Years Beyond K-Means
35 pages
Cluster
No ratings yet
Cluster
120 pages
Geostatistics - Noel Cressie
No ratings yet
Geostatistics - Noel Cressie
7 pages
Causal inference for time series analysis- problems, methods and evaluation
No ratings yet
Causal inference for time series analysis- problems, methods and evaluation
45 pages
tmpAF07 TMP
No ratings yet
tmpAF07 TMP
8 pages
An Approach of Hybrid Clustering Technique For Maximizing Similarity of Gene Expression
No ratings yet
An Approach of Hybrid Clustering Technique For Maximizing Similarity of Gene Expression
14 pages
Casey Kevin MSThesis
No ratings yet
Casey Kevin MSThesis
51 pages
BF02293899
No ratings yet
BF02293899
13 pages
Dynoc
No ratings yet
Dynoc
7 pages
S VD For Clustering
No ratings yet
S VD For Clustering
10 pages
NCEA Time Series Exemplars
No ratings yet
NCEA Time Series Exemplars
19 pages
Probability R, V
No ratings yet
Probability R, V
91 pages
DA lab manual 24-24
No ratings yet
DA lab manual 24-24
53 pages
1.2 Control Chart
No ratings yet
1.2 Control Chart
21 pages
SDI Paper Template
No ratings yet
SDI Paper Template
10 pages
Chapter 4
No ratings yet
Chapter 4
18 pages
Econometric S
No ratings yet
Econometric S
59 pages
Class Work 4 Mba Finance&Accounting
No ratings yet
Class Work 4 Mba Finance&Accounting
5 pages
EViews 9 Users Guide II
100% (3)
EViews 9 Users Guide II
1,099 pages
Data in Epidemiological Studies Are Obtained From Several Sources Such As Questionnaires
No ratings yet
Data in Epidemiological Studies Are Obtained From Several Sources Such As Questionnaires
2 pages
Practical Research 2 Gr4 F
No ratings yet
Practical Research 2 Gr4 F
46 pages
Play Tennis Example: Outlook Temperature Humidity Windy
No ratings yet
Play Tennis Example: Outlook Temperature Humidity Windy
29 pages
Coal Pillar Design Criteria For Surface Protection,: Research Online
No ratings yet
Coal Pillar Design Criteria For Surface Protection,: Research Online
8 pages
message (3)
100% (1)
message (3)
4 pages
9874 29375 1 PB
No ratings yet
9874 29375 1 PB
10 pages
Summary of Findings Conclusions and Recommendations Thesis Sample
100% (1)
Summary of Findings Conclusions and Recommendations Thesis Sample
8 pages
Lab Test (Q) Questions
No ratings yet
Lab Test (Q) Questions
2 pages
MSC Banking and International Finance
No ratings yet
MSC Banking and International Finance
62 pages
Chapter 1. Introductory Notions Meaning of Statistics
No ratings yet
Chapter 1. Introductory Notions Meaning of Statistics
4 pages
Through These Plotted Points. A Free Hand Line Is Drawn in Such A Way
No ratings yet
Through These Plotted Points. A Free Hand Line Is Drawn in Such A Way
3 pages
TOS 2nd Quarter PR2 SY18 19
No ratings yet
TOS 2nd Quarter PR2 SY18 19
11 pages
BSC in IT Information Guide
No ratings yet
BSC in IT Information Guide
24 pages
Machine Learning Notes Unit 1 To 4
No ratings yet
Machine Learning Notes Unit 1 To 4
101 pages
Biostatistics Classes Till 11.10.20
No ratings yet
Biostatistics Classes Till 11.10.20
265 pages
Principles-of-Data-Science-WEB-1
No ratings yet
Principles-of-Data-Science-WEB-1
30 pages
Pops Plan
100% (1)
Pops Plan
136 pages
5) Correlation of Economics
100% (2)
5) Correlation of Economics
27 pages
Supervised Learning
No ratings yet
Supervised Learning
14 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

A Latent Model To Detect Multiple Clusters of Varying Sizes

Uploaded by

A Latent Model To Detect Multiple Clusters of Varying Sizes

Uploaded by

Biometrics 65, 10111020

Qiankun Sun, and Joseph Naus

is the summation over M sets of Gibbs sam-

(y, b, c, k). It can be

in the last iteration of the EM algorithm (see,

is the summation over the M sets of b

samples. Based on likelihood inference, 2R is asymptotically

. When L is large, the em-

values provides a good ap-

under the null hypothesis. This is utilized to perform the

(b, c [ k) dbdc 2 log P

(b, c [ k) dbdc 2 log P

is the summation over M sets of repeatedly simu-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.