A Spatial Scan Statistic
A Spatial Scan Statistic
Martin Kulldorff
To cite this article: Martin Kulldorff (1997) A spatial scan statistic, Communications in Statistics -
Theory and Methods, 26:6, 1481-1496, DOI: 10.1080/03610929708831995
A SPATIAL S C A N STATISTIC
Martin Kulidorff
ABSTRACT
1. I N T R O D U C T I O N
A scan statistic is used t o detect clusters in a point process. It has been studied
in the one-dimensional setting by Naus (1965a) and by many others For a
point process o n an interval [a, b], a window [t, t $ wj of fixed size w < b - a is
moved along t h e interval. Over all possible values of t , the maximum number
of point,s in the window is recorded and compared to its distribution under the
null hypothesis of a purely random Poisson pzocess.
T h e one-dimensional problem has been extended in various direct,ions.
When the points are grouped into one of several sub-intervals we have aggre-
gated data. This has been studied by h'allenstein et al. (1989) among others,
and is of interest when we have, for instance, monthly counts of some event,
Weinstock (1981) has studied t h e problem where under t h e null-hypothesis the
intensity of the underlying Poisson process has a known inhomogeneity. Vari-
ous authors, such as Saperstein (1972) and Haus (1974), have studied a related
Bernoulli model, with a sequence of binary outcomes. Loader (1991 j allows for
a non-fixed window size. Glaz and Naus (1983) have looked a t a scan statistic
searching for multipie clusters. For any oi these extensions: and deperid~ngon
t h e application, t h e scan statistic may or may not be conditioned on the total
number of points observed.
In this paper a spatial scan statistic is proposed, .An a t t e m p t is made to
treat the problem in a setting as general as possible, except t h a t the analysis is
always conditioned on t h e total number of observed points. T h e window may
take any predefined shape and the size of the window is allowed t o vary as it
scans t h e study region. T h e latter is very useful when we lack a prior knowledge
about the size of t h e area covered by the cluster. The method also allows for
a n arbitrary, but known, underlying intensity that governs the distributior? of
points under the null hypothesis. This can take many different forms depending
on t h e application. It is modeled as a measure p on a geographical space G.
When G is a line and p is a uniform measure on [ a )b], we obtain the traditional
one dimensional problem as a special case. With the Lebesgue measure on t h e
plane we have a homogeneous spatial Poisson process. Other possible rneasures
include t h e following:
1 T h e spatial clustering of trees is studied in forestry. A problem of potential
interest is to see if there are clusters of trees that are of a specific kind or that
have a certain characteristic, after having compensated for t h e uneven spatial
distribution of all trees. T h a t is, we want to know if the proportion of one kind
of tree is particularly high in some location. T h e measure of an area i~oulclin
this case be the total number of trees growing there.
2 In astronomy, there is an equivalent three-dimensional problem if we want
to detect clusters of a particular kind of star after compensating for the irreg-
ular spatial distribution of all stars.
3 Epidemiologists are interested in geograpliical clusters of disease. Here it
is necessary to compensate for the uneven density of the population as a whole.
When data is aggregated into census districts the measure will b e concentratcd
a t the central coordinates of those districts.
4 To find uranium deposits, airplanes measure Geiger counts as they flj !n
parallel lines over large areas. A high number of counts in a spec~ficarea
SPATIAL SCAN STATISTIC 1483
Let i": denote a spatla! point process where N ( A ) is the random number of
points in the set A C G. As the window moves over she study area it defines a
1484 KULLDORFF
N(A) -
i5 No : p = q. T h e alternative hypothesis is H I : p > q , Z E 2. Under Iso,
- -
B i n ( p ( A ) , p ) for all sets A. Under HI IVJA) B i n ( p ( A ) ? p )for all
sets A C Z , a n d N ( A ) B i n ( p ( A ) , q ) for all sets A C ZC.
-
Under t h e Poisson model, points are generated by an inhomogeneous Pois-
son process. There is exactly one zone Z c G such that N ( A ) P o ( p p ( A n
+
2) qp(A n 2')) VA. T h e null hypothesis is Ho : p = q , whiie the alternative
-
hypothesis states t h a t H, : p > q , Z E 2. Under i&, I V ( A ) Po(pp(A)) VA.
Note that one of t h e parameters, Z, disappears under the null hypothesis. This
is unusual but not unheard of, see for example Davies (1977).
T h e best choice of window, and thereby t h e corresponding collection 2 of
zones, depends on t h e application. Some possibilities are:
1 AH circular subsets.
2 All circles centered a t any of several foci on a fixed grid, with a possible
upper limit on circle size. (Kulidorff and Nagarwalla, 1994)
3 Same as (2) but with a fixed circle size (Turnbull et, al., 1989)
3 . LIKELIHOOD
It is now time to derive the likelihood ratio test. It is slightly different for
the Bernoul!i and the Poisson models, and we start with the former. Let nz
denote t h e observed number of points in zone Z, and n~ the total number of
observed points,
3,P Bernoulli m o d e l
3.2 P o i s s o n model
The likelihood function for the Poisson model is a little more complex. T h e
probability of n G number of points in t h e study area is
For the numerator we first take t h e supremum over all p and q for a fixed 2.
Equation 3 takes its maximum when p = n z / p ( Z ) and q = ( n - n~z ) / ( p ( @ )-
P ( Z ) ) ,so
15 &JG nZ,
otherwise
SPATIAL SCAN STATISTIC
4.1 D e t e c t i o n versus i n f e r e n c e
Most statistical methods for cluster analysis of a spatial point process are
either descriptive in t h e sense t h a t they can detect the location of clust~ersbut
without any inference involved, or they do inference but without the ability to
detect t h e location of clusters. An important characteristic of t h e spatial scan
test is that it does both, so that when the null hypothesis is rejected we can
h a t e t h e specific area of the m a p that causes the rejection, To b e precise,
let x = (x;,i= 1, . , , n G ) denote the set of coordinates of the n~ points in
a d a t a set where 2 is t h e most likely cluster, and let x' = (s:, i = 1,..; n G }
b e a n alternative configuration with exactly t,he same number of points, T h e
following theorem holds for the Bernoulii and Poisson models.
In words, the theorem states that as long as the points within the zone
constituting the most likely cluster are located where they are, we would still
reject t h e null hypothesis no matter how the rest of the points were shuffled
around. For example, if t h e null hypothesis is rejected due to a disease cluster
in Seattie, it does not matter how we move around the cases on the U.S. east
coast, the null-hypothesis will stili be rejected. This might sound like a self
evident property, but i t does not hold for most other tests for spatial clustering
such as Knox (1964), Whitternore et al. (1987), Cuzick and Edwards (1990),
or Diggle and Chetwynd (1991). Those tests are hence not suitable if we want
t o know the iocation of clusters. Rather, they are geared towards answering
1488 KULLDORFF
the question of whether the phenomenon of clustering occurs over the study
region as a whole, such as if a disease is infectious or not, a question for which
the spatial scan statistic is not suitable.
ProoP: Let X(x) and X(xl) denote the values of t h e test statistic for t h e two
different d a t a sets. Since the two d a t a sets have t h e same number of points,
t h e distribution of X under the null hypothesis will b e the same, and it is hence
enough to show that Xjx') 2 X(x). In t h e Bernoulli case we have
( 2 ) ( z ) sup, L ( Z J x l )
X ( X ) = ---- 5 = X(x'j.
Lo - Lo Lo
T h e first inequality holds since x' has a t least as many points within zone
Z as x. For the Poisson model it is trivially true if X(x) = 1. When X(x) > 1
we have from equation 4 that
nc-nz
X(x) = sup "'( ne - nz
z I< ,421 \4G? - P(Z)
4.2 P o w e r
This means that if we fix the critical region except for its subset C k as
indicated by statement 1, then t h e test is uniformly most powerful compared
to all remaining choices of t h e critical region and with respect to all parameters
( Z , p , q ) E Ak. This property is very important in any mukiple testing type of
a situation, where there is a composite alternative hypothesis and where we
wish to know which part of it causes the rejection. As mentioned before, the
scan statistic has the ability t o identify the zone responsible for rejecting the
null hypothesis, and if we fail to detect a real cluster, it is of little comfort
if the null hypothesis is rejected based on a n untrue cluster in another part
of the study area, In fact, t h a t is usually less desirable than just failing to
reject the null hypothesis. T h e problem resembles other multiple comparison
situations, where instead of testing multiple ciuster locations simu!taneously,
we might test several new agricultural crop varieties to see if any of them are
better than t h e one presently in use, or we might simultaneously test several
poiential risk factors for cancer.
If we are only concerned about rejection versus no rejection, without an
interest in the location of clusters, then t h e property of being a n individually
most powerful test is of little value. For such a problem the likelihood ratio
based spatial scan statistic would be a suboptimal choice.
Now, let Az = ( ( Z , p , q ) : p > q ) and A. = { ( Z , p , q ) : p = q ) . Let Cz
denote the intersection of t h e critical region C and the subset of the sample
space in which 2 is the most likely cluster.
Theorem 2 The test based on X forms an individually most powerful test ,with
respect to the partitions { A z } and { C z ) This holds JOT the Bernouili as well
as the Poisson model.
roof: We show that if statements (1) and (2) in the definition are true, then
(3) cannot hold. For an arbitrary Z , let D- = {w : w E Cz,w $ Ch) and
D+ = {w : w E C>,w $ C z ) Let
A similar argument holds for the Bernoulli case, based on equation 1. Now,
for any (2,P:q) 6 A z ,
= M ( P ( u E D+ \No)- P ( u E D- lHo))
= M ( P ( w E C&IHo)- P ( w E CzIHo))
= M ( P ( w E C'lIIo) - P ( w E CIHo)) = 0
The second to last equality holds since C, = 6;for all + Z according to
statement 1 in the definition.
In order to find the value of the test statistic, we need a way to calculate the
likelihood ratio as it is maximized over the collection of zones in the alternative
hypothesis. This might seem like a daunting task since the number of zones
could easily be infinite. Two properties allows us to reduce it to a finite prob-
lem. The number of observed points is always finite and for a fixed number of
points the likelihood decreases as the measure of the moving window increases.
SPATIAL SCAN STATISTIC 1491
TABLE I
The spatial scan statistic applied to sudden infant death syndrome in North
Carolina, adjusted for the uneven geographical distribution of births. Zones
refer t o Figure 1 and incidence is the number of deaths per 1000 live births.
Zone # SIDs # Births Incidence p-value
z nz P(Z)
Bernoulli A 139 36376 3.8 0.0001
model B 59 14388 4.1 0.0005
Poisson A 139 36376 3.8 0.0001
model B 59 14388 4.1 0.0003
and the State Center for Health Statistics of the North Carolina Department
of Human Resources. They have previously been analysed by Gressie and
Chan (1989) among others.
For each of the 100 counties in North Carolina, the data comprise the
total number of live births as well as the number of sudden infant deaths
(SIDs) for the years 1974-1984. The number of live births in the counties
ranges from 567 to 52345. The location of county seats were used as the
geographical coordinates The total number of SIDs are 1503 out of 753354
live births. This gives a state wide incidence rate of 2.0 per 1000. The total
SPATIAL SCAN STATISTIC 1493
number of births in each county, as well as the statewide number of SIDs, are
also stratified into whites and non-whites. T h e complete data are presented
by Cressie and Chan (1989).
The measure a t t h e coordinate point of each county is taken as t h e number
of live births in t h a t county. T h e measure is zero elsewhere. This is as in
example (3) of Section I . As zones for the window we use ail circles that are
centered at one of t h e county coordinate points a n d that include at most half
of t,he total population, This follows example (2) of Section 2,
Note that the zones are circular only with respect to t h e aggregated data,
As we draw the circles around one county seat, other counties will either be
completely part of a zone or else not a t ail, depending on whether its county
seat falls within t h e circle or not. Hence, we get a compact but irregular shaped
zone following the county boundaries. This can b e seen in Figures 1 and 2.
The Bernoulli model is the most natural one t o use for this d a t a set. We
have birth counts, and each birth can correspond t o a t most one sudden infant
death. Table 1 summarizes the results of the analysis.
The most likely cluster, A, consists of t h e counties of Bladen, Coiurnbus,
Hoke, Robeson, and Scotland, in the southern part of the state, T h e rank is
1/10000, i.e. a p value of 0.0001.
There is one other significant cluster, B,composed of Halifax, Hertford,
and Northampton counties in the northeast. W i t h a rank of 5/10000 it has a
p value of 0.0005. This latter test is conservative, because we are comparing a
secondary cluster in the d a t a set with t h e most likely clusters from the replicas,
6.2 P o i s s o n Model
Since we are dealing with a rare disease, the Poisson model should give a close
approximation to t h e Bernoulli model. T h a t t h e results are indeed similar for
this data set can b e seen in Table 1.
The Poisson approximation is especially useful when we have covariates
that we wish t o include in the analysis, For SIDS, one possible covariate is
race (Cressie and Chan, !989), which may be related to SIDS through unob-
served variables such as quality of housing or access to health care. The racial
distribution differs widely among the counties in North Carolina, and could
possibly explain t h e previously detected clusters, We may want to see if there
are still geographic clusters after adjusting for race. This could lead us to other
spatially related risk factors that are otherwise hidden.
The overall incidence of SIDS is 1,512 for white children, and 2.970 for
non-white children (Cressie and Chan, 1989). T h e underlying measure at each
county coordinate x can now be defined as
1494 KULLDORFF
TABLE I1
T h e spatial scan statistic applied to sudden infant death syndrome in North
Carolina, adjusted for race a n d the uneven geographical distribution of live
births. Zones refer t o Figure 2.
Zone j+ SIDs El# SIDs] # Births p-value
z 122 P(Z)
Poisson A 139 94.5 36376 0.0036
model C 191 140.8 86780 0.0060
ACKNOWLEDGE
V7aluabie discussions with Laurence Freedman and Lisa McShane are grate-
fully acknowledged. This research was partly funded by the Swedish Research
Council in the Humanities and Social Sciences.
BIBLIOGRAPHY
Naus Jf, (1965a). The distribution of the size of t h e maximum cluster of points
on t h e line. Journal of the American Statastical Association 60, 532-538.
Sahu SK, Bendel RB and Sison CP, (1993). Effect of relative risk and cluster
configuration on the power of t h e one-dimensional scan statistic. Statistics zn
Medicine 12, 1853-1865.
Turnbull B W , Iwano EJ, Burnett WS, Howe HL and Clark LC, (1990). Mon-
itoring for clusters of disease: .4pplication to leukemia incidence in upstate
New York. American Journal of Epidemiology 132, S136-S143.
Wallenstein S, Naus J and Glaz J , (1993). Power of the scan statistic for
detection of clustering. Statistics in Medicine 12, 1829-1843.
Whitternore AS, Friend N, Brown BW and Holly EA, (1987). A test to detect
clusters of disease. Biometrika 74, 631-635, and 75, 396.