Interfacing Geostatistics and GIS
Interfacing Geostatistics and GIS
Interfacing Geostatistics
and GIS
123
Editor
Prof. Dr. Jürgen Pilz
Universität Klagenfurt
Institut für Statistik
Universitätsstr. 65-67
9020 Klagenfurt
Austria
juergen.pilz@uni-klu.ac.at
DOI 10.1007/978-3-540-33236-7
c Springer-Verlag Berlin Heidelberg 2009
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,
reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9,
1965, in its current version, and permission for use must always be obtained from Springer. Violations are
liable to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws
and regulations and therefore free for general use.
9 8 7 6 5 4 3 2 1
springer.com
Preface
Most of the papers contained in this volume grew out of presentations given
at the International Workshop StatGIS03 – Interfacing Geostatistics, GIS and
Spatial Data Bases, which was held in Pörtschach, Austria, Sept. 29–Oct. 1,
2003, and ensuing discussions, afterwards. Some of the papers are new and
have not been given at the conference. Therefore, most of the papers should
not be considered as conference proceedings in its original sense but rather
more as self-contained and actual contributions to the theme of the conference,
the interfacing between geostatistics, geoinformation systems and spatial data
base management.
Although some progress has been made toward interfacing, we still feel
that there is only little overlap between the different communities. The present
volume is intended to provide a bridge between specialists working in different
areas. According to the topics of the above mentioned workshop, this volume
has been divided into three parts:
Part I starts with general aspects of geostatistical model building
(Pebesma) and then new methodological developments in geostatics are pre-
sented, in particular this pertains to neural networks (Parkin and Kanevski),
Gibbs fields as used in statistical physics (Hristopulos). Furthermore, new de-
velopments in Bayesian spatial interpolation with skewed heavy-tailed data
and new classification methods based on wavelets (Hofer et al.) and support
vector machines (Chaouch et al.) are presented.
Part II contains applications of geostatistics to such diverse areas as
geodetic network modelling (Čepek and Pytel), land use policy (Müller and
Munroe), precipitation fields modelling (Ahrens), air pollution monitoring
(Shibli and Dubois), soil characterization (Sunila and Horttanainen) and soil
contamination modelling (Palaseanu-Lovejoy et al.). But also new application
areas such as traffic modelling (Braxmeier et al.) and spatial modelling of
entrepreneurship data (Breitenecker et al.) are touched.
Part III is devoted to the issues of the integration of different types of
information systems. The paper by Krivoruchko and Bivand deals with the
problems of interfacing GIS and spatial statistics software systems, from the
VI Preface
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
List of Contributors
Bodo Ahrens
Institut für Meteorologie und Geophysik, Universität Wien, Wien, Austria
Bodo.Ahrens@univie.ac.at
Gennady Andrienko
Fraunhofer Institute AIS Schloss Birlinghoven, Sankt Augustin, Germany
gennady.andrienko@ais.fraunhofer.de
Natalia Andrienko
Fraunhofer Institute AIS Schloss Birlinghoven, Sankt Augustin, Germany
Robert Barr
School of Geography, The University of Manchester, Manchester, UK
Goze Bénié
Geography and Remote-Sensing Department, Université de Sherbrooke,
Sherbrooke, QC, Canada
Goze.Bertin.Benie@USherbrooke.ca
Hans Braxmeier
Department of Applied Information Processing, University of Ulm, Ulm,
Germany
hans.braxmeier@uni-ulm.de
Robert J. Breitenecker
Department of Innovation Management and Entrepreneurship, University of
Klagenfurt, Klagenfurt, Austria
robert.breitenecker@uni-klu.ac.at
Roger Bivand
Norges Handelshøyskole, Bergen, Norway
Roger.Bivand@nhh.no
XII List of Contributors
Aleš Cepek
Faculty of Civil Engineering, CTU Prague, Prague, Czech Republic
cepek@fsv.cvut.cz
A. Chaouch
Institute of Mineralogy and Geochemistry, University of Lausanne, Lausanne,
Switzerland
aziz.chaouch@etu.unil.ch
Josiane Courteau
PRIMUS group, Clinical Research Center, Centre Hospitalier Universitaire de
Sherbrooke, Sherbrooke, QC, Canada
josiane.courteau@usherbrooke.ca
Charmaine Dean
Statistics and Actuarial Science, Simon-Fraser University, Vancouver, BC,
Canada
dean@stat.sfu.ca
Ian Douglas
School of Geography, The University of Manchester, Manchester, UK
Gregoire Dubois
Radioactivity Environmental Monitoring, Institute for the Environment and
Sustainability, Joint Research Centre, European Commission, Ispra, Italy
gregoire.dubois@jrc.it
J. Ferrándiz
Dpto. Estadística e Investigación Operativa, Universitat de València, València,
Spain
Juan.Ferrandiz@uv.es
Albrecht Gebhardt
Departement of Statistics, University of Klagenfurt, Klagenfurt, Austria
agebhard@uni-klu.ac.at
V. Gómez-Rubio
Dpto. Matemáticas, Universidad de Castilla-La Mancha, Albacete, Spain
Virgilio.Gomez@uclm.es
Thorgeir S. Helgason
Petromodel Ltd, Reykjavik, Iceland
thorgeir@petromodel.is
List of Contributors XIII
Abbas Hemiari
PRIMUS group, Clinical Research Center, Centre Hospitalier Universitaire de
Sherbrooke, Sherbrooke, QC, Canada
Dionissios T. Hristopulos
Department of Mineral Resources Engineering, Technical University of Crete,
Crete, Greece
dionisi@mred.tuc.gr
Vera Hofer
Department of Statistics and Operations Research, Karl-Franzens University
Graz, Graz, Austria
vera.hofer@uni-graz.at
Pekka Horttanainen
Department of Surveying, Institute of Cartography and Geoinformatics,
Helsinki University of Technology (HUT), Espoo, Finland
Mikhail Kanevski
Institute of Geomatics and Analysis of Risk, University of Lausanne,
Switzerland
Mikhail.Kanevski@unil.ch
Hannes Kazianka
Department of Statistics, University of Klagenfurt, Klagenfurt, Austria
hannes.kazianka@uni-klu.ac.at
Konstantin Krivoruchko
Environmental Systems Research Institute Redlands, Redlands, CA, USA
kkrivoruchko@esri.com
A. López
Dpto. Estadística e Investigación Operativa, Universitat de València, València,
Spain
Antonio.Lopez@uv.es
M. Maignan
Institute of Mineralogy and Geochemistry, University of Lausanne, Lausanne,
Switzerland
Darla K. Munroe
Department of Geography, The Ohio State University, Columbus, OH, USA
munroe.9@osu.edu
XIV List of Contributors
Daniel Müller
Leibniz Institute of Agricultural Development in Central and Eastern Europe,
Halle (Saale)
mueller@iamo.de
Théophile Niyonsenga
Epidemiology and Biostatistics, Robert Stempel School of Public Health,
Florida International University (FIU), Miami, FL (USA)
theophile.niyonsenga@fiu.edu
Monica Palaseanu-Lovejoy
School of Geography, The University of Manchester, Manchester, UK
monica.palaseanu-lovejoy@stud.man.ac.uc
R. Parkin
Institute of Nuclear Safety (IBRAE), Moscow, Russia
park@ibrae.ac.ru
Edzer J. Pebesma
Institute for Geoinformatics (ifgi), University of Münster, Münster, Germany
edzer.pebesma@uni-muenster.de
Jürgen Pilz
Department of Statistics, University of Klagenfurt, Klagenfurt, Austria
juergen.pilz@uni-klu.ac.at
G. Piller
Swiss Federal Office of Public Health (OFSP), Bern, Switzerland
Philipp Pluch
Energy and Petroleum Resources Services GmbH, Vienna, Austria
ppluch@menpet.at
Alexei Pozdnoukhov
Institute of Geomatics and Analysis of Risk, University of Lausanne,
Switzerland
Alexei.Pozdnoukhov@unil.ch
Jan Pytel
Faculty of Civil Engineering, CTU Prague, Prague, Czech Republic
pytel@fsv.cvut.cz
J. Rodriguez
Swiss Federal Office of Public Health (OFSP), Bern, Switzerland
List of Contributors XV
M. Sambrakos
InfoLab, Agricultural University of Athens, Athens, Greece
marios@aua.gr
Volker Schmidt
Department of Stochastics, University of Ulm, Ulm, Germany
volker.schmidt@uni-ulm.de
Erich J. Schwarz
Department of Innovation Management and Entrepreneurship, University of
Klagenfurt, Klagenfurt, Austria
erich.schwarz@uni-klu.ac.at
Syed Shibli
Landmark Eame Ltd, Aberdeen, Scotland, UK
syed.shibli@googlemail.com
Evgeny Spodarev
Department of Stochastics, University of Ulm, Ulm, Germany
evgeny.spodarev@uni-ulm.de
Gunter Spöck
Department of Statistics, University of Klagenfurt, Klagenfurt, Austria
gunter.spoeck@uni-klu.ac.at
Rangsima Sunila
Department of Surveying, Institute of Cartography and Geoinformatics,
Helsinki University of Technology (HUT), Espoo, Finland
rangsima.sunila@hut.fi
T. Tsiligiridis
InfoLab, Agricultural University of Athens, Athens, Greece
tsili@aua.gr
Alain Vanasse
Family Medicine Department, Université de Sherbrooke, Sherbrooke (QC),
Canada
alain.vanasse@usherbrooke.ca
How We Build Geostatistical Models and Deal
with Their Output
Edzer J. Pebesma
1 Introduction
Multivariable linear geostatistical models extend multivariable, multiple lin-
ear regression models for cases where observations are spatially correlated,
enabling the prediction of values at unobserved locations. In multiple linear
regression, the goal is to explain a large part of the observed variability by a set
of regressors and possibly their interactions. The more variability explained,
the better the prediction. Geostatistics extends this by looking at spatial cor-
relation in the residual variability: at a prediction location a nearby residual
may carry predictive value to the residual value at that location. However,
much of the geostatistical curriculum (literature and software) does not start
off by attempting to explain variability in the observed variables, but rather
starts at describing and modelling the observed variability after assuming the
trend is a spatially constant, thereby potentially ignoring available informative
predictors.
Extensions are universal kriging and external drift kriging [5]. In universal
kriging, only coordinates are used to explain variability. It is of no surprise
that this has not become popular, as coordinates hardly ever carry a phys-
ical relation to the observed variable, and may lead to extreme, unrealistic
extrapolations near the border of the domain. External drift kriging does ex-
tend kriging interpolation to the linear using a linear regression model with
an external variable for the trend, but it is most often explained as being the
case where only a single predictor (external drift variable) is present. In the
following, we will not distinguish between universal kriging and external drift
kriging, as the procedures are equivalent [7].
Multivariable prediction has been known for a long time, and has been
applied especially when using one or more secondary variables to predict a
primary variable. The general case where m variables are used to predict m
variables, m being larger than say 3, is found seldom in literature. The reasons
for this do not have a statistical ground, but rather stem from the fact that
4 E.J. Pebesma
2 Geostatistical Prediction
In geostatistics, the variability in an observed variable Z, taken at location
si , is assumed to be the sum of a fixed trend and a random residual: Z(si ) =
m(si ) + e(si ), and the trend is modelled as a linear combination of p unknown
coefficients and p known predictors Xj (s):
p
Z(s) = Xj (s)βp + e(s) = X(s)β + e(s), s ∈ {s1 , ...sn }
j=1
with X1 (s) ≡ 1 when β1 is the intercept, and X(s) the n × p matrix with
predictors. Given knowledge of the (spatial) covariance of e, V = Cov(e), and
knowledge of the covariance between e(s) and e(s0 ), v = (Cov(e(s1 ), e(s0 )), ...,
Cov(e(sn ), e(s0 ))) , the best linear unbiased (or kriging) predictor is
obtained by
Ẑ(s0 ) = x(s0 )β̂ + v V −1 (Z(s) − X(s)β̂)
where x(s0 ) contains the known predictors location s0 , and with
with σ02 = Var(e(s0 )) and η = (x(s0 )−v V −1 X(s)). These equations reduce to
traditional multiple regression prediction if v = 0 and V is diagonal (weighted
least squares) or V = σ02 I (ordinary least squares) [9], and they reduce to
ordinary kriging if the regression only contains an intercept (i.e., X(s) and
x(s0 ) only contain a single column of ones).
When multiple, spatially cross correlated variables are present, they may
be used in a multivariable prediction [23], not only to enhance the predictions
How We Build Geostatistical Models 5
of each individual variable, but also to assess the prediction error covariances
for all pairs of variables.
In practice, the application of these equations is often restricted to the
data available in a local neighbourhood around s0 . The reasons for this may be
computational, to avoid solving kriging systems with a very large (n 1000)
covariance matrix, or statistical, to reduce the assumption of globally constant
regression coefficients to the more flexible assumption of locally constant re-
gression coefficients.
Another specialty on the geostatistics menu is called change of support:
rather than predicting values Z(s0 ) for point locations s0 , we may want to
predict the integral (mean) of Z(B0 ) = |B10 | u∈B0 Z(u)du, with |B0 | the area
or volume of integration. Block average values can be obtained by averaging
point kriging values, but block average prediction errors can not; for this
we need block kriging [5, 13]. The reason for wanting block kriging is that
highly detailed spatial predictions may not be wanted, and that block kriging
prediction errors are always smaller then point kriging prediction errors.
In addition to prediction, it may be useful to simulate realisations of ran-
dom fields Z(s) that honour both the observed data, the regression relations,
and the spatial correlation [19]. Abrahamsen and Espen Benth [2] describe an
algorithm where the simulation equivalent of universal (external drift) kriging
is given.
Table 1. PCB138 (μg/kg dry matter) data summaries; years marked with a ∗ are
the regular monitoring years, other years result from additional sampling programs
year 1986∗ 1987 1989 1991∗ 1993 1996∗ 2000∗ All
mean 7.29 8.39 4.08 3.70 1.03 1.58 1.27 4.20
median 6.90 7.50 2.65 3.05 0.775 1.40 0.90 2.85
max 21.1 19.7 12.3 13.1 2.7 4.9 3.3 21.1
min 1.60 2.10 1.00 0.70 0.25 0.20 0.20 0.2
n 45 29 14 42 6 49 31 216
Figure 1 shows a bubble plot with the spatial locations of the measurement
sites, per year. Symbol size is proportional to log-concentration, which is the
natural scale to view such variables. The summary statistics of Table 1 already
reveal that PCB138 decreases over time. Figure 1 furthermore shows that
high concentrations appear close to the coast. Simply looking at how PCB138
0.2
1
2
5
10
20
x−coordinate
concentrations decrease with time may not be appropriate because the spa-
tial locations of sampling vary from year to year, and the sampling pattern
is not random. The sampling pattern (Fig. 1) is directed towards transects
perpendicular to the Dutch coast (the direction of the main gradient), and
seems clustered; many short distances are present.
3.2 Trend
10 20 30 40 10 20 30 40
1986 1987 1989 1991
20
10
5
2
1
.5
PCB138
.2
1993 1996 2000
20
10
5
2
1
.5
.2
10 20 30 40 10 20 30 40
water depth
Fig. 2. PCB138 concentration as a function of sea water depth, for each of the
measured years
e(s) the residual. The regression model explains 77% of the variability in log-
PCB138. Under the assumption of independent data, (i) all terms were highly
significant (p < .001), and (ii) an interaction between year and depth (i.e., a
year-dependent regression slope with depth) was not significant. Clearly, these
significance assertions are of little value, as the data vary spatially, and we
may assume that they are spatially correlated.
Each of the monitoring years years has too few measurements to model a
residual variogram (Table 1). For that reason, the residual information of
all years was merged. Simply merging all residuals leads to the variogram
in the first panel of Fig. 3. This would be a valid approach if the residual
spatial pattern were constant over time. Constructing a pooled variogram
by only considering point pairs with both measurements in the same year
(rest of Fig. 3) shows that the hypothesis of a temporary constant spatial
pattern is not valid: a much stronger spatial correlation is revealed under the
hypothesis that only the spatial variability (variogram) is persistent over time.
On single residual variogram model was fitted for all within-year residuals,
γ(h) = 0.08δ(h + 0.224(1 − exp(−h/17247)) with δ(h) = 0 if h = 0 and
δ(h) = 1 if h > 0 (last panel of Fig. 3).
183
218
235 177
75 0.3
151
106 144
151264 180
261 116
0.2
177
0.1
semivariance
0.0
pooled, boundaries adjusted pooled, with fitted model
154
0.4
183
218
235 177
0.3 75
151
106 144
151264 180
261 116
0.2
65
55
21
0.1
36
0.0
0 50000 100000 150000
distance
Fig. 3. Different approaches to modelling the sampling variogram for the residu-
als of the linear regression lines in Fig. 2; top left: residual variogram; top right:
pooled, within-year residual variogram; bottom left: short-distance variogram values
are split into smaller distance intervals; bottom right: a model fitted to the bottom
left variogram. Numbers reflect the point pairs that contribute to sample variogram
estimates
How We Build Geostatistical Models 9
1986
0.0 0.10.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.2 0.4 0.6
1986.1991 1991
0.4
0.2
semivariance
0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0
Fig. 4. Sample direct and cross variograms for the four main measurement years,
and fitted Intrinsic Correlation model. The direct variogram model is that of 3, each
cross variogram is scaled down by a factor equal to the pointwise correlations of the
pair of years; pointwise correlations are approximated by joining spatially nearest
neighbours to form data pairs
10 E.J. Pebesma
thus found, the correlation coefficient was calculated. Next, the two years were
reversed, and a second correlation coefficient was calculated. The average of
these two correlations was used to model the cross variograms of Fig. 4. Be-
cause spatially nearest neighbours of year y were used to approximate the
measured value at a certain location in year x, the estimated correlations
must underestimate the true correlations.
Spatio-temporal prediction under model (1), given the data for each of the
four “main” years and given the direct variograms and the cross variograms
of Fig. 4 is simply a matter of universal cokriging. Universal cokriging yields
spatial predictions for each of the four years, shown in Fig. 5, and yields
in addition spatial prediction error variances for each of the four years, and
spatial prediction error covariances for prediction errors of all pairs of years.
Spatially differentiated estimates of trends can be assessed by combining the
yearly predictions and prediction error (co)variances.
−20
1986.pred 1991.pred 1996.pred 2000.pred
−10
−5
−2
y
−1
−0.5
−0.2
−0.1
x
Cokriging basically yields for each location s0 a vector with predictions, which
in our case could be called y(s0 ) = (y86 (s0 ), y91 (s0 ), y96 (s0 ), y00 (s0 )) , along
with the prediction error covariance matrix Cov(y(s0 )). Given this vector we
can calculate for each location s0 a contrast
C(s0 ) = λ y(s0 )
• prediction of the difference between the means of 1986 and 1991 versus
the mean of 1996 and 2000: λ = (− 12 , − 12 , 12 , 12 )
• prediction of the average yearly increase: λ = (−0.065, −0.02, 0.025, 0.061)
The weights of the latter contrast, which is obviously of major interest when
we want to assess spatially differentiated trends, are obtained as follows. Trend
estimation uses linear regression for predicting concentrations from years by
y(s0 ) = β0 (s0 )+β1 (s0 )t+e = Xβ(s0 )+e. The ordinary least squares estimate
of β is (X X)−1 X y. The contrast coefficients that estimate β1 (s0 ) are in the
second row of (X X)−1 X , with
⎡ ⎤
1 1986
⎢1 1991⎥
X=⎢ ⎥
⎣1 1996⎦ .
1 2000
Figure 6 shows the predicted trends, as well as the trend predictions di-
vided by their own prediction standard error. Clearly, the majority of the area
−−0.05
−−0.1
y
−−0.15
−−0.2
−−0.25
Fig. 6. Predicted trends (in ppm/year) for each point location (left); and relative
predicted trends, expressed as fraction of their own prediction standard error (right).
On the right, under the assumed model, relative predicted trends smaller than -2
tentatively indicate trends that cannot be attributed to pure chance (i.e. that are
significant)
12 E.J. Pebesma
4 Shortcomings
This case study shows some of the capabilities of the gstat package for R
[12], or for S-PLUS, which extensively uses the graphics capabilities of the
Trellis/lattice graphics package [6]. The gstat program [18] or R package
[20] offers flexibility with respect to trend modelling, multivariate variogram
modelling, multivariate prediction and simulation, change of support and pre-
diction in a local neighbourhood. Additional features that it does not address
are e.g. flexible three-dimensional anisotropic variogram modelling, Bayesian
handling of uncertainty in variogram model coefficients [22], and multivari-
able space time modelling in continuous time (i.e., where time is a dimension
rather than a discrete variable as in the case study of this paper). These fea-
tures are available in either other R packages or other environments, where
they are potentially hard to combine with the features offered by gstat.
5 Discussion
There may be various historical reasons for not starting off with a linear re-
gression model for the modelling the trend of spatial data. First, geostatistics
was developed by mining engineers, who usually did not have useful pre-
dictor variables available, other than spatial coordinates of observations and
prediction locations. Second, the sample variogram from estimated residuals
is biased because of the estimation of the trend, which would need the true
variogram—a chicken-and-egg problem raised by Armstrong [1] but settled by
Kitanidis [15]. Third, leading authors have suggested that predictors are not
needed [14], but that observations themselves carry enough information. This
is indeed the case when observations are abundant and not too noisy—the
case where even (geo)statistics could be ignored altogether and any contour-
ing algorithm would suffice. All these factors lead to situation where much of
the available geostatistical software packages (GSLIB, [8]; GsTL, [21]; ArcGIS
Geostatistical Analyst, [16]; Isatis, http://www.geovariances.fr) have lit-
tle flexibility with respect to modelling external drifts with multiple linear
regression models.
How We Build Geostatistical Models 13
Dionissios T. Hristopulos
1 Introduction
Spartan spatial random fields (SSRFs) were introduced in [10]. Certain mathe-
matical properties of SSRFs were presented, inference of the model parameters
from synthetic samples was investigated [10], and methods for the uncondi-
tional simulation of SSRFs were developed [11]. This research has focused on
the fluctuation component of the spatial variability, which is assumed to be
statistically homogeneous (stationary) and normally distributed. The proba-
bility density function (pdf) of Spartan fields is determined from an energy
functional H[Xλ (s)], according to the familiar in statistical physics expression
for the Gibbs distribution
The constant Z (called partition function) is the pdf normalization factor ob-
tained by integrating exp (−H) over all degrees of freedom (i.e. states of the
SSRF). The subscript λ denotes the fluctuation resolution scale. The energy
functional determines the spatial variability by means of interactions between
neighboring locations. One can express the multivariate Gaussian pdf, typi-
cally used in classical geostatistics, in terms of the following energy functional
in the fluctuation – gradient – curvature (FGC) model, the pdf involves three
main parameters: the scale factor η0 , the covariance shape parameter η1 , and
the correlation length ξ . Another factor that adds flexibility to the model is
the coarse-graining kernel that determines the fluctuation resolution λ [10].
As we show below, the resolution is directly related to smoothness properties
of the SSRF. In previous work [10, 11], we have used a kernel with a boxcar
spectral density that imposes a sharp cutoff in frequency (wavevector) space
at kc ∝ λ−1 . We have treated the cutoff frequency as a constant, but it is
also possible to consider it as an additional model parameter, in which case
Np = 4.
A practical implication of an interaction-based energy functional is that
the parameters of the model follow from simple sample constraints that do not
require the full calculation of two-point functions (e.g., correlation function,
variogram). This feature permits fast computation of the model parameters.
In addition, for general spatial distributions (e.g., irregular distribution of
sampling points, anisotropic spatial dependence with unknown a priori prin-
cipal directions), the parameter inference does not require various empirical
assumptions such as choice of lag classes, number of pairs per class, lag and
angle tolerance, etc. [7] used in the calculation of two-point functions. In the
case of SSRFs that model data distributed on irregular supports, the definition
of the interaction between ‘near neighbors’ is not uniquely defined. Determin-
ing the neighbor structure for irregular supports increases the computational
effort [10], but the model inference process is still quite fast. Methods for the
non-constrained simulation of SSRFs with Gaussian probability densities on
the square lattice (by filtering Gaussian random variables in Fourier space and
reconstructing the state in real space with the inverse FFT) and for irregular
supports (based on a random phase superposition of cosine modes with fre-
quency distribution modeled on the covariance spectral density), have been
presented in [11].
The energy functional involves the SSRF states (configurations) Xλ (s). For
notational simplicity, we will not use different symbols for the random field and
its states in the following. As hinted above, the energy functional is properly
defined for SSRFs Xλ (s) with an inherent scale parameter ‘λ’ that denotes
the spatial resolution of the fluctuations. At lower scales, the fluctuations are
coarse-grained. The fluctuation resolution scale is physically meaningful, since
it would be unreasonable to expect a model of fluctuations to be valid for all
length scales. In contrast with classical random field representations, which
do not have a built-in scale for a fluctuation cutoff, SSRFs provide an explicit
‘handle’ for this meaningful parameter. In practical situations, the fluctuation
resolution scale is linked to the measurement support scale and the sampling
density. In the case of numerical simulations, the lattice spacing provides
Spartan Random Fields 19
a lower bound for λ. The fluctuation resolution can also exceed the lattice
spacing, to allow for smoother variations of the field. The general probability
density function of continuum FGC Spartan random fields (FGC-SSRF) in
IRd is determined from the following functional
1
Hfgc [Xλ ] = ds hfgc [Xλ (s); η1 , ξ] , (3)
2η0 ξ d
where η0 is a scale factor with dimensions [X]2 that determines the magnitude
of the overall variability of the SSRF, η1 is a covariance shape parameter
(dimensionless), ξ is the correlation length, and hfgc is the normalized (to
η0 = 1) local energy at the point s. In the case of a Gaussian FGC random
field with mean (not necessarily stationary) mX;λ (s) = E [Xλ (s)] and isotropic
spatial dependence of the fluctuations, the functional hfgc [Xλ (s); η1 , ξ] is given
by the following
2 2 2
hfgc [Xλ (s); η1 , ξ] = [χλ (s)] + η1 ξ 2 [∇χλ (s)] + ξ 4 ∇2 χλ (s) , (4)
where χλ (s) is the local fluctuation field. The functional (4) is permissible
if Bochner’s theorem [3] for the covariance function is satisfied. As shown in
[10], permissibility requires η1 > −2. The covariance spectral density follows
from the equation
2
Q̃λ (k)
η0 ξ d
G̃x;λ (k) = (5)
1 + η1 (k ξ)2 + (k ξ)4
where Q̃λ (k) is the Fourier transform of the smoothing kernel. If the latter
is the boxcar filter with cutoff at kc , (5) leads to a band-limited spectral
density G̃x;λ (k). For negative values of η1 the spectral density develops a
sharp peak, and as η1 approaches the permissibility boundary value equal to
−2, the spectral density tends to become singular. For negative values of η1
the structure of the spectral density leads to a negative hole in the covariance
function in real space. If Q̃λ (k) has no directional dependence, the spectral
density depends on the magnitude but not the direction of the frequency
vector k. Thus, the covariance is an isotropic function of distance in this case.
On regular lattices, the FGC spectral density is obtained by replac-
ing the operators ∇ and ∇2 in the energy functional with the correspond-
ing finite differences. Then, the local energy becomes hfgc [Xλ (s); η1 , ξ] =
hfgc [χλ {U (s); η1 , ξ}], where U (s) = s ∪ nnb(s) is the local neighborhood set
that contains the point s and its nearest lattice neighbors, χλ {U (s)} is the
set of the SSRF values at the points in U (s), and hfgc [·] is a quadratic func-
tional of the SSRF states that defines interactions between the fluctuation
values χλ {U (s)}. For irregular spatial distributions, there are more than one
possibilities for modeling the interactions. One approach, explored in [10], is
to define a background lattice that covers the area of interest and to construct
interactions between the cells of the background lattice. If CB (s) denotes the
cell of the background lattice that includes the point s and nnb {CB (s)} is
20 D.T. Hristopulos
the set of nearest neighbors of the cell CB (s), the local neighborhood set in-
volves the sampled points that belong to the cell CB (s) and its neighbors, i.e.
U (s) = s ∈ CB (s) ∪ nnb {CB (s)}.
3 Model Inference
The problem of model inference from available data is a typical inverse prob-
lem. In order to determine the model parameters experimental constraints
need to be defined that capture the main features of the spatial variability
in the data. These constraints should then be related to the interactions in
the SSRF energy functional. The experimental constraints used in [10] for the
square lattice are motivated by the local ‘fluctuation energy measures’ S0 (s) =
d 2 d (i) (j)
χ2λ (s), S1 (s) = i=1 [∇i χλ (s)] , and S2 (s) = i,j=1 Δ2 [χλ (s)] Δ2 [χλ (s)],
(i)
where Δ2 denotes the centered second-order difference operator. The respec-
tive experimental constraints are then given by S0 (s) (sample variance), S1 (s)
(average square gradient) and S2 (s), where the bar denotes the sample aver-
age. The respective stochastic constraints are E [Sm (s)], m = 0, 1, 2 and they
can be expressed in terms of the covariance function. For the isotropic FGC
model, calculation of the stochastic constraints involves a one-dimensional
numerical integration over the magnitude of the frequency. Matching of the
stochastic and experimental constraints is formulated as an optimization prob-
lem in terms of a functional that measures the distance between the two sets
[10] of constraints. Minimization of the distance functional leads to a set of
optimal values η0∗ , η1∗ , ξ ∗ for the model parameters. Use of kc as a fourth param-
eter needs further investigation. It should be noted that constraint matching
is based on the ergodic assumption, and thus a working approximation of
ergodicity should be established for the fluctuation field.
The probability density of the FGC-SSRF involves the first- and second-order
derivatives of the field’s states. This requires defining the energy functional in
a manner consistent with the existence of the derivatives. In general, for Gaus-
sian random fields [1, 15], the nth-order derivative ∂ n Xλ (s)/∂sn1 1 ...∂snd d exists
in the mean square sense if (i) the mean function mX;λ (s) is differentiable,
and (ii) the following derivative of the covariance function exists [1, 15]
∂ 2n Gx;λ (s, p)
n1 nd n1
nd
, n = n1 + ... + nd . (6)
∂s1 ...∂sd ∂p1 ...∂pd s=p
of order 2n at zero pair separation distance, i.e. the existence of the following
quantity 2n
(2n) d Gx;λ (r)
Gx;λ (0) = (−1)n
(7)
dr2n r=0
Equation (7) is equivalent to the existence of the corresponding integral of
the covariance spectral density
2
2n
∞
d Gx;λ (r)
Q̃λ (k)
k d+2n−1
d
= η0 ξ Sd dk (8)
dr2n r=0 1 + η1 (k ξ)2 + (k ξ)4
0
d/2
where Sd = d k̂ = 2π Γ (d/2) denotes the surface of the unit sphere in d
2
dimensions. Note that if
Q̃λ (k)
= 1 , i.e. in the absence of smoothing, the
above integral does not exist unless d + 2n < 4, which can be attained only for
d = 1 and n = 1. If the smoothing kernel has a sharp cutoff kc (band-limited
spectrum), the 2n-th order derivative is expressed in terms of the following
integral
kc ξ
d2n Gx;λ (r)
−2n κd+2n−1
= η 0 ξ S d dκ . (9)
dr2n r=0 1 + η1 κ2 + κ4
0
The integral in 9 exists for all d and n. However, if the correlation length ξ
exceeds significantly the resolution scale, i.e. ξ >> λ and kc ξ >> 1, for κ >>
(2n)
1 the integrand behaves as κd+2n−5 . Then, it follows Gx;λ (0) = regular +
−2n d+2n−4
αd ξ (kc ξ) , where ‘regular’ represents the bounded contribution of
the integral, while for fixed ξ the remaining term increases fast with kc ξ. The
constant αd depends on the dimensionality of space. Hence, for d ≥ 2 the
(2n)
singular term in Gx;λ (0) leads to large values of the covariance derivatives
for n ≥ 1. In [10] we focused on the case kc ξ >> 1, which leads to ‘rough’
Spartan fields. Based on the above, the Gaussian FGC-SSRF can, at least
in principle, interpolate between very smooth Gaussian random fields (e.g.,
Gaussian covariance function) and non-differentiable ones (e.g., exponential,
spherical covariance functions). The ‘degree’ of smoothness depends on the
value of the combined parameter kc ξ. Hence, the FGC-SSRF in effect has four
parameters, η0 , η1 , kc , ξ , and the value of kc ξ, which controls the smoothness
of the model. This property of smoothness control is also shared by random
fields with Matérn class covariance functions [14].
(higher than second order) interaction terms in the energy functional. An ex-
ample is the energy functional of the Landau model e.g. [10], which includes
non-Gaussian terms and exhibits a transition between exponential and power-
law spatial dependence of the covariance function. Geostatistical probability
density models provide sufficient flexibility for fitting various types of non-
Gaussian data. The approaches typically used in geostatistics for modeling
asymmetric distributions with higher-than-normal weight in their tails em-
ploy the logarithmic and the Box-Cox transforms. In the former approach,
the initial distribution is assumed to be approximately lognormal. The log-
arithmic mean mY (s) = E [log Xλ (s)] is first estimated. Then, the fluctu-
ations yλ (s) = log [Xλ (s)] − mY (s) follow the Gaussian distribution, and
they can be modeled by means of the FGC-SSRF normalized energy den-
sity hfgc [yλ (s); η1 , ξ]. If the logarithm of the random field deviates from the
Gaussian distribution, it is possible to modify the energy functional by adding
a non-Gaussian term as follows
3/2 2
The ratio S3 S0 represents the sample skewness coefficient, while S4 S0
the sample kurtosis coefficient. In the case of the Gaussian FGC-SSRF model,
the stochastic moments E [Sm ] , m = 0, 1, 2 (which are used in determining
the model parameters) are expressed exactly in terms of the two-point covari-
ance function. The covariance spectral density also follows directly from the
energy functional. Such explicit expressions are not available for non-Gaussian
energy functionals. The moments must be calculated either by numerical in-
tegration (e.g., Monte Carlo methods) for each set of parameters visited by
the optimization method or by approximate, explicit methods that have been
developed in the framework of many-body theories, e.g. [5, 8, 9, 13].
In statistical physics, e.g. [4, 5, 6] there is a long literature on approxi-
mate but explicit methods (variational approximations, Feynman diagrams,
renormalization group, replicas) that address calculations with non-Gaussian
Spartan Random Fields 23
We present the formalism of the variational method assuming that the SSRF
is defined in a discretized space (e.g. on a lattice). The fluctuation random
field and its states are denoted by the vector y. The characteristic function
Z[J] corresponding to the energy functional H is defined as
The symbol ‘Tr’ denotes the trace over all the field variables in H. For a lattice
field the trace is obtained by integrating over the fluctuations at every point
of the lattice. The cumulant generating functional (CGF) is defined by
The cumulants of the distribution are obtained from the derivatives of the
CGF with respect to J. For example, the mean is given by
∂F [J]
E [y(si )] = − , (14)
∂Ji
J=0
Higher-order cumulants are given by higher order derivatives of the CGF. The
CGF of the Gaussian part H0 −J·y is denoted as F0 [J]. Let us now consider a
variational Gaussian energy functional H0 , which is in general different than
the Gaussian component HG of H. The average of an operator A with respect
to the pdf with energy H0 , is obtained by means of
Tr A e−H0
A
0 = . (16)
Tr e−H0
The following inequality [5] is valid for all H0
F [J] ≤ F0 [J] + H − H0
0 . (17)
24 D.T. Hristopulos
The optimal Ĥ0 that gives the best approximation of F [J], is obtained by min-
imizing the variational bound F0 + H − H0
0 with respect to the parameters
of H0 . The optimal Gaussian pdf has energy Ĥ0 and provides approximate
estimates of the non-Gaussian covariance function.
It is possible to improve on the variational approximation by expressing
the energy functional H as follows
and treating the component Hpert = HG − Ĥ0 +δH of the energy functional as
a perturbation around the optimal Gaussian Ĥ0 . Corrections of the stochastic
moments can then be obtained either by means of simple (low-order) per-
turbation expansions, or by means of diagrammatic perturbation methods.
However, there is no a priori guarantee that such corrections will lead to more
accurate estimates, and such approximation must be investigated for each
energy functional.
Here we present a simple example for a univariate non-Gaussian pdf, which il-
lustrates the application of the variational method. Consider the non-Gaussian
energy functional
H(y) = a2 y 2 + β 4 y 4 , (19)
where y is a fluctuation with variance E[y 2 ], and the average is over the pdf
p(y) = Z −1 exp(−H). The following Gaussian variational expression is used
as an approximation of the non-Gaussian pdf
√ −1
p0 (y) = 2 πσ exp −y 2 2σ 2 . (20)
Hence, the variational energy functional is H0 = y 2 2σ 2 and σ is the vari-
√
ational parameter. It follows that F0 = − log( 2 πσ) and H − H0
0 =
a2 σ 2 + 3 β 4 σ 4 − 1/2. The variational bound given by (17) is a convex upward
function of σ, as shown in Fig. 1. The bound is minimized for the following
value of σ
α 4−1
1/2 α−1 4−1
1/2
σ̂ = 3 1 + 12 ρ = 3 1 + 12 ρ . (21)
6β 2 6ρ2
β
In the above, ρ = α is the dimensionless ratio of the quartic over the quadratic
pdf parameters that measures the deviation of the energy functional from the
Gaussian form. The value of σ̂ 2 is the variational estimate of the variance.
The exact variance, calculated by numerical integration, and the variational
approximation for various values of the dimensionless coefficient ratio ρ = β/α
Spartan Random Fields 25
2
β/α = 0.2
1.5 β/α = 0.6
β/α = 0.8
β/α = 1.0
Variational Bound
1
0.5
−0.5
−1
−1.5
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
σ
Fig. 1. Plots of the variational bound as a function of σ for four different values of
the ratio β/α
are plotted in Fig. 2, which shows that the variational estimate is an excellent
approximation of the exact result even for large values of the ratio ρ. Esti-
mates based on first-order and cumulant perturbation expansions around the
optimal Gaussian (these will be presented in detail elsewhere) are also shown
in Fig. 2. The additional corrections do not significantly alter the outcome
of the variational approximation for the variance, since all three plots almost
coincide. However, such corrections will be necessary for calculating higher
moments of non-Gaussian distributions. For example, the kurtosis of the
0.7
Exact
Variational
0.6 Variational & 1st Order
Variational & Cumulant
0.5
Variance
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6
Ratio β/α
Fig. 2. Plots of the exact variance (numerical) and approximate estimates based
on the variational approach as well as combinations of variational and perturbation
methods (first order and cumulant expansion)
26 D.T. Hristopulos
6 Discussion
1 Introduction
All these negative facts were the motivation for us to look for more ad-
vanced kriging methodologies that relax the Gaussian assumption and the
disadvantage of not taking into account the uncertainty of the covariance
function.
One of the first papers that addressed the above deficiencies and influenced
our work was De Oliveira [15]. He developed a Bayesian trans-Gaussian krig-
ing where uncertainties could be specified on the trend, the covariance func-
tion and the parameter of the transformation function. The transformation
he used to make a skew random field Gaussian was the Box-Cox transforma-
tion. Motivated by the conjugacy of the normal-inverse-gamma family to the
normal sampling distribution he exactly used such kind of prior for the sill,
the range and the trend parameters of his model. His approach distinguishes
from our approach by the fact that his prior is informative. Berger et al. [3]
were the first to investigate also non-informative reference priors for correlated
stochastic processes. The disadvantage of their approach is that the assumed
random field must be Gaussian. Some time later the paper [16] appeared
where also a non-informative prior for the transformed Gaussian model was
proposed, but with fixed range parameter while the sill and the nugget are
variable.
Our approach seems to be the first completely non-informative approach
to trans-Gaussian Bayesian kriging. We avoid the specification of a prior dis-
tribution by means of a parametric bootstrap, where the sampling distribution
of maximum likelihood estimates is taken as the posterior for unknown
parameters.
The restriction log (z) > −λ for all z out of the support of the random process
must be fulfilled. This restriction is not relevant in applications since we only
have a finite set of observations and thus minz log (z) > −∞. The main ad-
vantage of the log-log transformation is that because of the double logarithm
highly skewed data can be potentially transformed to a normal distribution.
For the trend and error model of the transformed Gaussian random field
we assume the conventional geostatistical model:
E{Y (x)} = μ,
where μ is a constant trend. For the Gaussian error model we assume a co-
variance function Cθ,σ2 of the form
where σ 2 = var{Y (x)} denotes the variance (overall sill) of the random
field and kθ (·) the correlation function (normalized covariance function);
θ ∈ Θ ⊂ Rm stands for a parameter vector whose components describe
the range and shape of the positive definite correlation function. Under these
assumptions the probability density of the observed data takes the form
1
∗ exp{− (gλ (Z) − 1μ)T (σ 2 Kθ )−1 (gλ (Z) − 1μ)},
2
where Jλ (Z) is the determinant of the Jacobian of the specific transformation
used and σ 2 Kθ is the covariance matrix of
and
1 −1
2
σ̂λ,θ = (gλ (Z) − 1μ̂OK T
λ,θ ) Kθ (gλ (Z) − 1μ̂OK
λ,θ ),
n−1
32 G. Spöck et al.
where E (μ) = μ0 is the fixed a-priori mean and var (μ) = σ 2 Φ is the fixed
a-priori variance for μ. The vector cθ contains the correlations between the
point to be predicted and the observations at the n locations. It can be shown
that the total mean-squared error (TMSEP) of this predictor
2
θ,σ
E{ZBK (x0 ) − Z(x0 )}2 =
σ 2 1 − cTθ K−1θ c θ + ||1 − 1T
K −1
θ c ||2
θ (1T K 1)−1 ,
−1
θ
where ||a||2A is a short-hand for the quadratic form aT Aa, is always smaller
than the mean-squared-error of prediction (MSEP) of the ordinary kriging
predictor. Thus, by accepting a small bias in the Bayes kriging predictor and
using prior knowledge E (μ) = μ0 and var(μ) = σ 2 Φ one gets better predictions
than with ordinary kriging. We refer to Spöck [20], where these results are
investigated in more detail.
An obvious advantage of the Bayesian approach, besides its ability to deal
with the uncertainty of the model parameters, is the compensation for the lack
of information in case of only few measurements. This has been demonstrated
impressively by Omre [17], Omre and Halvorsen [18] and Abrahamsen [1].
Bayesian linear kriging is not fully Bayesian, since it makes no a-priori
distributional assumptions on the parameters of the covariance function. The
first to take also account of the uncertainty with respect to these parameters,
using a Bayesian setup, were Kitanidis [13] and Handcock and Stein [11]. A
prior different from the one of Handcock and Stein was used by Gaudard et al.
Bayesian Trans-Gaussian Kriging 33
1
N
p(z0 |Z) p(gλ̂i (z0 )|λ̂i , θ̂i , σ̂i2 , Z) ∗ Jλ̂i (z0 ) (6)
N i=1
Here p(gλ̂i (z0 )|λ̂i , θ̂i , σ̂i2 , Z) is the conditional predictive density,
λ̂ ,θ̂i ,σ̂i2
Y (x0 )|λ̂i , θ̂i , σ̂i2 , Z ∼ N (ŶBK
i
(x0 ), T M SEPλ̂i ,θ̂i ,σ̂2 ),
i
λ̂ ,θ̂ ,σ̂ 2
i i
where ŶBK i
(x0 ) is the Bayes kriging predictor applied to the transformed
data Y = gλ̂i (Z) for fixed (λ̂i , θ̂i , σ̂i2 ), and T M SEPλ̂i ,θ̂i ,σ̂2 is the correspond-
i
ing Bayes kriging variance. From this predictive distribution quantiles, the
median, the mean and probabilities above certain thresholds can easily be
calculated.
One of our aims is to have a methodology that is intrinsic Bayesian and can
be applied also to highly skewed data sets that often occur in applications.
In 2004 one such data set was investigated in detail during the spatial inter-
polation contest (SIC2004) [8]. Ten training data sets on radioactivity levels
were given to the participants of the contest in a certain period of time to
Bayesian Trans-Gaussian Kriging 35
train their automatic interpolation algorithms. After training another data set,
called “Joker”, was given to the probands, which had one completely different
property than the training data. “A small corner located SW of the monitored
area was chosen and a dispersion process was modelled in order to obtain a
few values on the order of 10 times more than the overall background levels
reported for the first data set”, according to [8]. The automatic interpolation
routines applied to this data ranged from ordinary kriging, splines, support
vector machines to neural networks. The performance of the different algo-
rithms could later on be compared to the true values. Performance measures
such as mean absolute error (MAE) and root mean squared error (RMSE)
have been reported and published in [8]. Because we already know the true
data when we have looked at the performance of our Bayesian trans-Gaussian
kriging algorithm, the calculations we give here are outside the context of a
competition. But our results are in performance comparable to the best real
geostatistical algorithm from this contest. The winner was a neural network
algorithm.
Figure 1(a) gives the data locations of the joker data set (blue circles)
together with the locations where the prediction should take place (red stars).
For an exploratory data analysis we refer to Dubois [8]. A histogram of the
Joker data set is shown in Fig. 1(b). The histogram shows that the background
level is quite symmetric, however, there exist also some very large values that
can be interpreted as an accidental release of radioactivity.
The methodology we apply to this data set is Bayesian trans-Gaussian
kriging with the log-log transformation. Since we are not sure whether the
variogram model is linear or parabolic in the origin we have used a convex
combination of a Gaussian and an exponential variogram model. The advan-
tage of our method is that according to the data the bootstrap methodology
then takes account also of this uncertainty of the variogram model in the
origin. The convex combination parameter as well as Gaussian and expo-
nential range parameters, the overall sill and transformation parameter are
part of the bootstrap. The anisotropy is respected as well by including a
transformation matrix for the coordinates in the maximum likelihood boot-
strap. As already mentioned the main advantage of our approach is that all
uncertainties are taken into account and no prior specification is necessary.
Figure 2(a)–(d) show the bootstraped transformation parameters, covariance
parameters and variogram functions from the posterior. Although estimation
of geometric anisotropy was performed, it turned out that the boostrapped
semivariogram realizations show no anisotropy. Because we calculate posterior
predictive distributions (see Fig. 3) at all locations where prediction should
take place by means of Monte Carlo averaging with the samples from the
bootstrap, graphics like quantile maps, Fig. 4 (a)–(d) and Fig. 5(b), posterior
mean map, Fig. 5(a), and maps of the probability above thresholds, Fig. 6,
are available. To make our results comparable to the SIC2004 contest we
calculated the MAE=16.19 and RMSE=77.64. In terms of MSE this would
have been the second best result in the SIC2004 contest.
36 G. Spöck et al.
x 105
7
Northing
3
–1
–1 0 1 2 3
Easting
x 105
(a)
25
20
15
10
0
0 500 1000 1500
(b)
Fig. 1. (a) The data locations (blue circles) and the locations where prediction
takes places (red stars). (b) The histogram of the Joker data set
16 8
14 7
12 6
10 5
8 4
6 3
4 2
2 1
0 0
–4 –3.9 –3.8 –3.7 –3.6 –3.5 –3.4 –3.3 –3.2 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
(a) (b)
posterior of sill posterior of semivariograms
25 0.35
0.3
20
0.25
15 0.2
10 0.15
0.1
5
0.05
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 1000 2000 3000 4000 5000
(c) (d)
0.05 8
7
0.04 6
5
0.03
4
0.02 3
2
0.01
1
0 0
20 40 60 80 100 120 140 160 180 200 100 200 300 400 500 600 700
(a) (b)
105
140
600 600
100
130
500 95 500
120
90
400 400
110
85
300 300
80 100
200 200
75 90
100 70 100
80
65
0 0
70
60
0 100 200 0 100 200
(a) (b)
q0.75 q0.95
500
260
600 600
240 450
500
220
500
400
180
300 300 300
160
250
200 200
140
200
100 120 100
100
150
0 0
80 100
(c) (d)
Fig. 4. Maps of the quantiles of the posterior predictive distribution. (a) 5% quan-
tile, (b) 25% quantile, (c) 75% quantile, (d) 95% quantile
mean median
200
500 500 160
180
400 400
140
160
300 300
140 120
200 200
120
100
100 100
100
0 0 80
80
(a) (b)
0.8
600 0.8 600
0.7
0.7
500 500
0.6
0.6
400 400
0.5
0.5
300 300
0.4
0.4
200 200
0.3
0.3
0 0.1 0 0.1
(a) (b)
threshold 130 threshold 170
0.6
0.4
400 0.5 400
0.1
0 0.1 0
(c) (d)
Fig. 6. Maps of the probabilities above certain thresholds. (a) threshold 90, (b)
threshold 110, (c) threshold 130, (d) threshold 170
5 Conclusion
(a) (b)
crossvalidation
220
200
180
posterior mean
160
140
120
100
80
60
40
200 400 600 800 1000 1200 1400
data
(c)
Fig. 7. Crossvalidation results. (a) percent of actual data vs. expected percent of
data above the thresholds 10, 20, . . . , 170, 1000, 1100, . . . , 1500. (b) posterior predic-
tive quantiles vs. percent of data below quantiles. (c) data vs. posterior predictive
mean
where
2 2
p(gλ (Z)|μ̂θ,σ θ,σ
BK , σ , λ, θ) = N {1μ̂BK , σ Kθ ; gλ (Z)}.
2 2
Bayesian Trans-Gaussian Kriging 41
Philipp Pluch
1 Introduction
Spatial Statistics refers to a class of models and methods for spatial data that
aim at providing quantitative descriptions of natural variables distributed in
space or space and time (see Chiles and Delfiner [2], Cressie [3]). Examples
for such variables are ore grades collected in a mineral field,density of trees
of a certain species in a forest or CD (critical dimension) measurements in
semiconductor productions. A typical problem in spatial statistics is to pre-
dict values of measurements at places where they were not observed, or if
measured with error, to estimate a smooth spatial surface from the data. (Es-
timation of a regionalized variable.) A family of techniques, stochastic and non
stochastic ones were developed in geostatistics for that interpolation problem.
The general approach is to consider a class of unbiased estimators, usually
linear in the observations and to find the one with minimum uncertainty, as
measured by the error variance. A group of techniques, known loosely as krig-
ing, is a popular method among different interpolation techniques developed
in geostatistics by Krige [10], Matheron [11] and Journel and Huijbregts [8].
An interesting comparison of ten classes of interpolation techniques with char-
acteristics can be found in Burrough and McDonnell [1] and in a lot of papers
published recently a comparison of several interpolation techniques was made.
The goal of kriging like that of nonparametric regression is that the under-
standing of spatial estimation is enriched by the interpretation as smoothing
estimates. On the other hand random process models are also valuable in
setting uncertainty estimates for function estimates, specially in low noise sit-
uations. There are close connections between different mathematical subjects
such as kriging, radial basis functions (RBF) interpolations, spline interpola-
tions, reproducing hilbert space kernels (rhsk), PDE, Markov Random Fields
(MRF) etc. A short discussion of these links is given in Horiwitz et al. [7],
see Fig. 7. Splines link different fields of mathematics and statistics – and are
used in statistics for spatial modelling (see more in Wahba [15]).
46 P. Pluch
2 Interpolation Techniques
Interpolation techniques can be divided into techniques based on deterministic
and stochastic models. The stochastic approach regards the data {yi }ni=1 as a
realisation of a random field on Rd at ti = {xi1 , ..., xid }, i = 1, ..., n and sets
g(t) to be the best unbiased linear predictor of the random field at site t given
the measurements.
We assume here an intrinsic random field (second-order stationary). Splines
are smooth real valued functions g(t). Define a roughness penalty based on the
sum of integrated squared partial derivatives of a given order n. The choice
of g(t), which interpolates the data and minimises the roughness penalty, is
known as the smoothing thin plate spline introduced by Reinsch [13].
Yi = g(xi ) + εi (5)
48 P. Pluch
d
g(X) ≈ g0 + gi (Xi ) (6)
i=1
d
Y =α+ gi (Xi ) + ε (7)
i=1
3 Smoothing Spline
In this paragraph we consider the following problem: We want to find a func-
tion g(x) under all two times continuosly differentiable functions that min-
imises (1).
y − g2 + ρg T Kg (8)
K = ΔT C −1 Δ.
ĝ = Sy (9)
Equation (10) we will later find in the solution of the kriging prediction prob-
lem and can be found also in ridge regression with natural basis and Demmler-
Reinsch basis (see more in Nychka [12])
When we use this approach with the additive model we are able to find (with
help of (8)) the following result
⎛ ⎞2
d
d
n
g(X) ≈ ρi giT Ki gi + ⎝Yi − g0 − gj (xij )⎠ (11)
i=1 i=1 j=1
Minimisation leads to
⎛ ⎞
d
ĝl = Sl ⎝y − g0 − ĝj ⎠ (12)
j=1; j=l
d
ĝ(x) = Ŷ = g0 + Oj−1 RjT gj (15)
j=1
1
n
(yi − ηi , f
)2 + ρ P1 f 2R → min (16)
n i=1 f ∈Wm
M
n
fρ = d ν φν + ci ξi
ν=1 i=1
ξi = P1 ηi , i = 1, ..., n
d = (d1 , d2 , ...dM )T = ((T T M −1 T )−1 T T M −1 )y
c = (c1 , ..., cn )T = M −1 (I − T (T T M −1 T )−1 T T M −1 )y
M = Σ + nρI
Σ = ( ξi , ξj
), i = 1, ...n and j = 1, ...M
A more general way to describe the roughness penalty function is in the form
+ ,2
d (r + 1)! ∂ r+1 g(t)
Jr+1 (g) = dt (17)
m!(r + 1 − m)! Rd ∂t1 · · · ∂tm
m1
d
d
|m|=r+1
When a particular penalty is chosen then the result is invariant under rotations
and translations of t.
4 Kriging
Let
+ ,
f (t0 )T
Vr =
F
F = (f (t1 ), ..., f (tn ))T
EY (t) = β T f (t),
where σ 2 = σ(0) and σ0 is a (n×1) vector with elements σ(t0 −ti ), i = 1, ..., n.
Σ has the elements σi,j .
If {Y (t)} is a stationary random field with a polynomial drift, then kriging
involves predicting Y (t0 ) by a linear combination Ŷ (t0 ) = αT y. The goal is
to minimise the prediction mean squared error subject to an unbiasedness
constraint.
where
A = Σ −1 F (F T Σ −1 F )−1 (20)
B = Σ −1 − Σ −1 F (F T Σ −1 F )−1 F T Σ −1 (21)
for more details see Kent and Mardia [9] who also give a solution for intrinsic
random fields where σ(h) is a polynomial in h with degree 2p. A and B can
be found with the use of Moore-Penrose generalised inverse.
B = [(I − Ur (UrT Ur )−1 UrT )Σ(I − Ur (UrT Ur )−1 UrT )]− (22)
A = (I − BΣ)Ur (UrT UR )−1 (23)
by thin plate splines and with the help of GCV introduced in Wahba [15], see
Fig. 4 and Fig. 5.
The following Fig. 6 summarises the results of GCV.
Finally, Fig. 7 gives a contour plot of the surface, resulting from kriging
with a Matern covariance function.
1 Introduction
The present research deals with the novel development in the field of envi-
ronmental spatial data modelling with the help of Artificial Neural Networks
(ANN). The following spatial prediction is considered: given the measurements
of some physical quantity at finite (and relatively small) number of points, the
objective is to make predictions over the considered region either on a regu-
lar dense grid (traditional mapping problem) or on irregular decision-oriented
grid. In many cases in addition to the available measurements of the main vari-
able there can be additional information: secondary variables, remote sensing
images, physical model of the phenomena, soft qualitative information, etc. In
the present paper a problem of spatial predictions of the primary variable us-
ing additional comprehensive information on secondary variable is considered.
If there is a relationship between variables (e.g. linear correlation) the second
one can be considered as an external drift. In order to solve this problem an
ANNEX model (ANN + EXternal drift) is proposed. The family of ANNEX
models developed for the spatial mapping problems is based on the idea of
incorporation of additional spatially distributed information into the ANN as
additional input(s). This secondary information is assumed to be related to
the primary variable. This approach considers that additional information is
available both at the training points and at all the points of the prediction
grid. The similar idea traditionally is used in geostatistical “kriging with ex-
ternal drift” model [2]. In general, the ANNEX approach can be considered
as a nonlinear modelling on a hypersurface described by input variables. In
the present work the application of the ANNEX model is applied to the real
case study dealing with the average long-term temperatures of air in June in
Kazakh Priaralie. Additional information that will be used is the elevation at
the measurement and prediction locations above the sea level. ANNEX model
results are compared with the ones of both standard MLP (without extra
58 R. Parkin and M. Kanevski
2 ANNEX Model
The problem of spatial mapping of environmental data is rather traditional
and there exist a wide variety of different prediction models to solve it. In
most cases it is necessary to predict values of a spatial function (precipita-
tion, temperatures, contamination et al.) at the unsampled points, in partic-
ular on a regular grid. Geostatistics is the well-elaborated approach to solve
such problems. All geostatistical models rely on modelling of spatial correla-
tion structures (variography) and are mainly based on a linearity hypothesis.
Thus, geostatistics is a model-dependent approach: solutions highly depend
on a developed model of spatial correlation. Another data driven approach is
based on application of artificial neural networks [1, 3]. Neural networks are
robust, nonlinear and highly flexible tools for data modelling. It was shown
that ANN can be efficiently applied to spatial data modelling, especially in
combination with geostatistical tools [4]. Data analysis with ANN includes
several important steps: data selection and pre-processing, selection of archi-
tecture, training, testing, validation. In the present study multilayer percep-
tron (MLP), which is a workhorse of ANN data modelling is applied for spatial
prediction of temperature. MLP being very powerful modelling tools are able
to incorporate in a nonlinear manner different kinds of information and data
during modelling procedure. Usually in spatial data modelling input space of
ANN (independent variables) are described by geographical coordinates (e.g.,
x, y). Output unit (F) of ANN is a modelling function in case of univariate
prediction or a vector in case of multivariate predictions. The idea of ANNEX
model is as follows: if there is an additional information available at training
and prediction points and related to the primary one, we can try to use it as
additional inputs to the standard ANN.
Consider the examples of external information suitable for ANNEX type
of modelling:
1. Availability of “cheap” information on the secondary variable(s). Consider
that we are interested in a prediction of some physical quantity (primary
variable) whose measurements are rather complicated and/or expensive.
If there are other variables available or easily measured at all points (both
measurement and prediction grids) we can try to check and to use this
information in order to improve the quality of primary variable prediction.
2. Physical model of the phenomena. Consider that we are given the physical
model that describes phenomena under study. To include this model into
the data-driven ANNEX approach the output of the physical model at all
the prediction points and at measurements locations are used as an extra
input(s) for ANN. In general, secondary ANN model can be developed to
model (learn) physical phenomena.
ANNEX Model 59
3 Case Study
This case study deals with the prediction of air temperature in Kazakh Pri-
aralie. The selected region is covering 1, 400, 000 km2 with 400 monitoring
stations. The primary variable is average long-term temperatures of air in
June. Additional information that will be used as an extra ANN input is the
elevation of the locations above the sea level. This information is available on
a dense grid from Digital Elevation Model.
The correlation between air temperature and altitude is linear and is equal
to 0.9 (Fig. 1). The linearity of correlation allowed us to use traditional geosta-
tistical model (e.g., kriging with external drift) for modelling and comparing
the results with the one obtained by ANNEX model. The similar work on mod-
elling of air temperature applying kriging with external drift can be found in
Wackernagel [7].
Following the general methodology of ANN data modelling original data
were split into training and testing data sets. The spatial locations of train and
test data points are presented in Fig. 2. An important and difficult problem
60 R. Parkin and M. Kanevski
deals with the criteria of data splitting. In most cases data are split ran-
domly. But in case of spatial and clustered data such approach can be not
adequate. In the present study the similarity of data sets was controlled by
comparing summary statistics, histograms and spatial correlation structures
(variograms). Since it is difficult to control both testing and training datasets,
more attention was paid to the similarity of the training data set to the initial
data structures of all data. Similarity of spatial structures of obtained datasets
with the initial data is even more important than statistical factors. Compar-
ison of the spatial structures was carried out with the help of variogram roses,
which model anisotropic spatial correlation structures (see Fig. 3). Such com-
parison provided grounds that split with 168 training and 67 testing points is
quite suitable for the following modelling. More advanced splitting methods
can use statistical tests.
Fig. 2. The spatial location of train (circles) and test (cross) data points
ANNEX Model 61
Fig. 3. Variogram roses: raw (a), train (b) and test (c) datasets
In the present study, MLP models (as ANN) with the following parame-
ters were used: two (traditional ANN) or three (ANNEX) input neurons,
describing spatial co-ordinates (X, Y) and altitude, one hidden layer and
one output neuron describing air temperature. Backpropagation training with
Levenberg-Marquardt followed by conjugate gradient algorithm was used in
62 R. Parkin and M. Kanevski
order to avoid local minima [6]. ANN and ANNEX modelling results are pre-
sented below as errors on the test dataset (Table 2). MLP with structure
2-7-5-1 (7 and 5 neurones in the two hidden layers) showed the best result
among MLPs with 2 inputs, while ANNEX model with structure 3-8-1 (8
neurones in hidden layer) gave the best result among all considered models.
It is worth to mention that we used several MLP structures for the ANNEX
model and found the optimum model (see Table 2). Mapping on a grid with
the help of ANNEX model features similar pattern as kriging with external
drift.
Table 2. The air temperature test results for ANN and ANNEX models
model correlation RMSE MAE MRE
2-7-5-1 0.917 2.57 1.96 −0.02
3-3-1 0.989 0.96 0.73 −0.01
3-5-1 0.99 0.9 0.7 −0.007
3-7-1 0.991 0.85 0.66 −0.004
3-8-1 0.991 0.84 0.68 −0.001
3-9-1 0.991 0.88 0.69 −0.01
3-10-1 0.99 0.92 0.74 −0.01
4 Conclusions
An Artificial Neural Networks with External drift (ANNEX) model for the
analysis and mapping of spatially distributed data was applied to the real
data. It was shown that additional spatially distributed information can be
efficiently used by ANNEX and gives rise to better analysis and modelling of
environmental data. Promising results presented are based on the real case
study of air temperature mapping. Other kinds of Machine Leaning mod-
els (besides ANN) can be used with possible modifications in the proposed
framework. The advantage of the ANNEX model is its ability to model any
nonlinear relationships between variables. An interesting feature found in the
study is robustness and stability of the ANNEX solution versus noise. This
problem should be studied in more detail. ANNEX model performed better
even in the case of linear correlation between primary and secondary informa-
tion that is favourable to kriging with external drift. An even more interesting
study should consider nonlinear relationships between data and external in-
formation.
Acknowledgements
The analysis and presentation of the results as well as MLP and geostatistical
modelling were performed with the help of GEOSTAT OFFICE software ([5],
http://www.ibrae.ac.ru/ mkanev). The work was supported by INTAS Aral
Sea project #72. Authors thank S. Chernov and V. Timonin for programming
GEOSTAT OFFICE software, which was extensively used in the research.
References
1 Introduction
For the last 20 years, the Swiss Federal Office of Public Health (OFSP) has
performed more than 65,000 indoor radon measurements throughout whole
Switzerland. Swiss Indoor radon data are noisy and poorly spatially corre-
lated. They feature large low-scale variability and a strong spatial cluster-
ing. Univariate distribution is positively skewed and heavy-tailed. Thus one
possible way to deal with these data is to transform them into indicators
relative to a decision threshold and apply spatial statistics to these indica-
tors. Indeed when considering decision making, the task is often to classify
indoor radon data into low or high concentration level. This kind of two-class
classification task is commonly solved by geostatistical interpolations of in-
dicators using kriging and/or conditional simulations. However, geostatistical
approaches depend on several assumptions about the data (i.e. stationarity)
and require modelling of the variogram (see Chiles and Delfiner [1]), a task
that is, while sometimes possible, often very difficult and time consuming with
indoor radon data. In consequence, data-driven approaches such as support
vector machines (SVM) are considered as an alternative to geostatistical ap-
proaches. In this paper, their performance in application to indoor radon data
classification is assessed in comparison the one of indicator kriging (IK) and
sequential indicator simulations (SIS).
2 Data Pre-processing
This study will focus on a small square of 25 × 25 km located at the north-
eastern end of Switzerland, near the “Bodensee”. Over this region, stationarity
66 A. Chaouch et al.
Continuous indoor radon levels are transformed into discrete binary indicators
[0;1] for geostatistical methods or [−1; +1] for SVM relative to a user-defined
threshold. In the present study, the threshold is set at 45 Bq/m3 , close to
the median of the regional distribution of indoor radon levels. That level was
chosen for the present methodological study and does not reflect decision level
defined by Swiss federal law [6]. Indicator I at location u is built by comparing
the local indoor radon level Z(u) to the decision threshold Z as follow:
3
I(u; Z) = 1 (geostat) or − 1(SV M ) if Z(u) < Z; Z = 45 Bq/m
I(u; Z) = 0 (geostat) or + 1(SV M ) otherwise
After binary coding of the data, the dataset contains 658 indicators. To
assess classification abilities of methods, a subset of 158 data is kept for vali-
dation purposes only, leaving 500 source data available for the spatial analysis,
see Fig. 1.
Fig. 1. Plot of indicator 1 (white box), indicator 0 (black box) and validation data
(marks)
3 Geostatistical Tools
When dealing with indicator transformed data [0;1], geostatistical tools such
as kriging and simulations provide an estimate of the probability that indicator
1 prevails at any unknown location u. This estimation requires definition and
modelling of the variogram of indicators.
1
N
γ1 (h) = (I(ui ) − I(ui + h))2 (1)
2N i=1
Consider the problem of estimating the indicator value I(u; Z) with known
constant mean m at any location u using N available hard indicators Ik defined
at the threshold Z. The indicator kriging estimate is a linear combination of
available indicators.
- .
N
N
I(u; Z) = λk · Ik (xk ; Z) + 1 − λk · m (2)
k=1 k=1
1 N
L(w, z, α) = w 2 − αi (yi · (w · zi + b) − 1) (7)
2 i=1
∂L N
=⇒ w = αi · yi · zi (8)
∂w i=1
∂L N
=0⇒ αi · yi = 0 (9)
∂b i=1
Equation (6) shows that the weight vector w and thus the hyperplane H
is defined only in terms of data with associated α > 0. These data are called
support vectors. Attribute b of the hyperplane is chosen so that it maximizes
the margin.
Classification of any new point is then performed by substituting (8) in
(5), what gives the decision function:
/#SV 0
ynew = sign αi yi ziSV znew +b (10)
i=1
In the linear case, the classification is thus achieved by solving the dot
product between support vector inputs and inputs of the new point to be
classified. Inputs are commonly spatial coordinates of points but they may
also contain additional information.
When dealing with noisy data such as indoor radon levels, it is not always ad-
visable to correctly classify all training data. Indeed, data are likely to contain
errors or incoherent values. The classifier is then built to allow misclassification
of points that have a too strong impact on the boundary definition in order to
improve generalisation abilities. The resulting boundary has a so-called “soft
margin”. Implementation of this additional constraint on the optimisation
problem is done by introducing a new parameter called C-value in addition
to slack variables ξi . Basically misclassified points i are on the wrong side of
their margin by an amount C · ξi [2]. The Lagrangian can then be rewritten
as follow:
N
N
N
L= 1
2 w 2 +C · ξi − α[yi · (w · zi + b) − (1 − ξi )] − μi ξi
i=1 i=1 i=1 (12)
ξ ≥ 0, i = 1, . . . , N
0 ≤ αi ≤ C
SVM parameters such as kernel width and C value are unknown and must be
tuned. Therefore, source dataset is split into training and testing data. Model
is built with training data and parameters are tuned according to testing data
using different combinations of kernel widths and C values. Optimality of SVM
parameters is reached following the structural risk minimization principle [7].
Testing error and complexity of the model that is defined in term of number
of support vectors are both considered. The more support vectors, the more
complex the model.
Ten different random splits of 400 training and 100 testing data were per-
formed to tune SVM parameters. Indeed optimal parameters may vary for
different random splits due to noisy data and/or possible algorithmic instabil-
ities. Average values of both testing error and normalized number of support
vectors for the ten random splits are presented on a map.
As suggested on Fig. 5, there is no clear unique solution. Minimal test
error lies on a straight line pointing out a fairly linear dependence between
optimal kernel width and logarithm of C value. However, two white patches
of low testing error can be seen on this line, one for a kernel width of 1000 m
and the other at approximately 3 km. The solution with kernel width of 3 km
and log(C) of 3 can be built by less support vectors (Fig. 6) and is therefore
preferred over the other. Final classification with optimal parameters is ap-
plied to training and testing data altogether. Under such configuration, the
final classification is performed using only 341 support vectors out of the 500
original data.
Regional Classification of Indoor Radon Data 73
5 Classification Results
Classification maps produced using both geostatistical methods and support
vector machines are presented in this section. Interpolation grid has square
cell-size of 200 m and present postplot of source (marks) and validation data
(circles).
Classifications obtained with geostatistical methods are very similar as
they use the same model of spatial structure: the variogram. However, SIS
Fig. 7. IK classification
74 A. Chaouch et al.
6 Conclusions
Classification abilities of geostatistical approaches and SVM for a median
3
value test decision threshold (45 Bq/m ) are presented in this study. Despite
their poor classification abilities that are easily explained by the high variabil-
ity of indoor radon indicator data, all reviewed methods are able to efficiently
extract spatial information out of the data. In particular SVM are promising
for indoor radon binary classification as they don’t require any prior assump-
tion on data such as stationarity. They may therefore be applied to classifica-
tion problem over large regions if not over the whole country. A specific feature
76 A. Chaouch et al.
Acknowledgments
The work was supported in part by the INTAS Aral sea grant #72.
References
Aggregates, i.e. sand, gravel and crushed rock, are the most frequently used
construction materials worldwide, i.g. in concrete, cement, asphalt etc. Among
igneous rocks, granite and basalt are the most important. The properties of
all construction materials need to be appropriate for their intended purpose.
Some applications leave room for choice among many different rock variants,
others require a thorough inspection of the particular features of the rocks.
The question of whether the rock will resist physical and chemical loads, is of
particular importance. Specific imperfections in granite result from the trans-
formation of feldspar to kaolinite, or the decay of biotite and may lead to
reduced strength [6]. As rocks are increasingly being used up to the limits of
their mechanical strength, material tolerances are decreasing. This leads to
the demand for ever more careful assessment of rocks. In order to decrease
the costs of damages arising from improper use of aggregates, and to substan-
tially reduce production costs, the aggregates industry is interested in effective
quality control.
This calls for an efficient and fast method for classification of aggregates.
Automatic means for identifying suitable rock characteristics and their varia-
tion so have to be devised. It is well known that matter treated with light of
different wavelengths shows characteristic features that are suitable for quali-
tative and quantitative analysis and therefore for identification of substances
(e.g. [9]). The optical characteristics of the material investigated is expressed
in a spectrum, i.e. a plot of the absorption, transmission, reflection, or emission
intensity as a function of wavelength, frequency, or energy [2].
Theoretical studies show that different substances have characteristic
spectra in certain wavelength regions, even though the appearance of these
80 V. Hofer et al.
spectra may vary considerably, depending on the parameters used. This raises
the question, whether a statistical method for classification of aggregates is
possible.
The aim of the EUREKA project, PETROSCOPE, is to develop an auto-
matic testing instrument for process and quality control in the construction
aggregates industry [1]. In this paper, stemming from the PETROSCOPE
project, the identification of two types or variants of granite, called Granite
1 and Granite 2, by means of their reflectivity in mid-infrared light is investi-
gated. Ten samples, i.e. particles of size 16–32 mm for each of the two types of
Finnish granite, supplied by Lohja Rudus Oy from Hiiskula gravel pit, were
collected by the Geological Survey of Finland. Then the samples were irradi-
ated with infrared light at equidistant wavenumbers from 560 to 4000 cm−1
and from three positions for each particle. These measurements, performed at
VTT Electronics in Finland, resulted in three curves per particle or sample
and therefore 60 curves all together [15]. Figure 1 shows the spectral lines of
the samples for each of the two granite classes or variants.
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
1000 1500 2000 2500 3000 3500 4000 1000 1500 2000 2500 3000 3500 4000
In general, the spectral lines of the curves seem to be very similar, although
there are also some specific characteristics. This impression is mirrored by the
mean curves of the classes in Fig. 2, which raises the statistical question,
whether the differences in the shape of the curves are systematic or just ran-
dom.
The data observed are continous curves not single observations of scalars
[7, 8], even though the curves were measured at discrete knots and therefore
represented by data vectors xi = (xi1 , . . . , xin ), where n indicates the dimen-
sion of the vectors, i.e. the number of observation knots. In statistical problems
dealing with spectra, the high dimensionality of the data causes problems in
applying the common techniques of multivariate statistics because the number
of samples compared to the number of observation knots is very small. In this
study the proportion of samples to predictors is about 1:8. Even partial least
Daubechies Wavelets for Identification of Rock Variants 81
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
1000 1500 2000 2500 3000 3500 4000 1000 1500 2000 2500 3000 3500 4000
Fig. 2. Mean curves from each of three measurements of reflectivity for ten particles
or samples in both classes of granite
squares (PLS) does not lead to reasonable classification error rates because of
the similarity of the curves, as discussed in Sect. 2.
Several techniques exist in order to overcome the problem of high dimen-
sionality (multicollinearity). In the following section feature reduction by a
basis expansion is described. We use the fact that the curves observed are in
L2 (IR), which is a Hilbert space. The idea is to choose a proper basis and
then consider a subspace that contains the essential information of the signal.
This information is received by an orthogonal projection of the signal onto
this subspace.
In the current research we use a wavelet basis. Wavelets have adequate local
properties and have turned out to be appropriate for statistical modelling of
high dimensional data, as the characteristic features are summarized by a few
basis coefficients.
1.1 Wavelets
Wavelets are functions that are received from a mother wavelet ψ ∈ L1 (IR) ∩
L2 (IR), which is ([13]):
∞
(i) “waving” above and below the abscissa, i.e. −∞ ψ(t) dt = 0. This means
that ψ has mean zero.
(ii) well localized.
By translation and dilation of the mother wavelet ψ we obtain the family
j
ψj k (x) = 2 2 ψ(2j x − k) , (1)
where the integers j and k are the spatial parameters of the discret wavelet
transform. Two technical terms are usual in signal processing: The factor of
stretch or compression is called scale. The inverse of scale is the resolution.
The higher the resolution the better the approximation and the lower the
scale. The relation among level j, resolution and scale is shown in Table 1.
(In this paper Mallat’s indexing is used [13].)
82 V. Hofer et al.
level −2 −1 0 1 2
1 1
scale 4 2 1 2 4
1 1
resolution 4 2
1 2 4
{φJ0 k , ψj k | j ≥ J0 ∧ k ∈ Z} .
Then it can be seen easily that the space Vj+1 is decomposed into two or-
thonormal subspaces
Vj+1 = Vj ⊕ Wj .
This concept leads to a multiresolution of L2 (IR):
where
Dj (t) = dj k ψj k (t)
k∈Z
Daubechies Wavelets for Identification of Rock Variants 83
f(t)
···
AJ DJ DJ+1 DJ+2
gi (t) ∈ VN i.e. gi (t) = AJ (t) + Dj (t) (2)
J≤j<N
εi (t) ∈ VN⊥ i.e. εi (t) = Dj (t) ,
j≥N
where gi (t) represents the systematic part in the curves observed, i.e. the
characteristic feature that should be investigated, and εi (t) stands for the
random error. A projection of the curves observed onto this subspace VN
separates the systematic components of the feature observed from the random
ones.
84 V. Hofer et al.
The question remaining is, which wavelet basis should be used in (2) and
which scale (resolution) should be chosen? Contrary to Fourier transform the
wavelet transform is not unique. There is no rule of thumb that prescribes the
proper basis for the current problem.
In many practial applications Daubechies wavelets are applied [3, 5]. They
have compact support, which is related to computational efficiency [10]. Be-
sides this, they also have some vanishing moments, which improves computa-
tional efficiency, since the higher the number of vanishing moments, the more
information will be concentrated in a smaller number of wavelet coefficients,
and the fine scale wavelet coefficients will be essentially zero where the func-
tion is smooth. However, this increases the support of the wavelets, so that a
trade-off is necessary [10, 12].
This study uses Daubechies wavelets with two vanishing moments. Differ-
ent levels of J were also investigated, and it turned out that J = 5 led to the
lowest classification error in the wavelet model with PCA.
The number of observation knots turns out to play an important role when
working with discretized signals ([3, 11] and the references there). Multireso-
lution analysis requires that the sample size is 2n for some integer n. In cases
where this condition is not satisfied, the problem is often solved by padding
the signal with zeros. This procedure usually introduces unnecessary edge ef-
fects because of the resulting discontinuity of the signal at the borders. These
edge effects are difficult to compensate for. For the data underlying this study
zero padding leads to very low classification error rates.
The classification of the data was carried out according to the minimum of
Mahalanobis distance. As preparation for this, some basic transformations
were conducted, such as the log-transformation of the observations and the
standardisation of the observation range to an interval starting at zero, this
interval being devided into subintervals of length one. As each sample was
measured from three positions, the mean of these measurements was calcu-
lated and a baseline correction was carried out. Petrological examination of
the samples should ensure that the samples consisted only of one type of
granite. The signal resulting was projected onto V
estimation. In PCA and PLS, respectively, the number of features that enter
the classification is reduced to a further extent. Although this reduction is not
overwhelming – 16 scores remain in PLS and in PCA analysis (reduction of
about 1.5%!) – it has to be stated that the scores corresponding to low eigen-
values are important in cases when the direction of separation is orthogonal
to the first PCs (confer [4]).
Using this wavelet approach, complete classification of the mean measure-
ments was attained. The results of assigning single measurements in terms
of the leave-one-out model are summarized in the Tables 2 and 3 below and
depend on the additional method of dimension reduction.
assignment of
M0 M1 M2 M3
to G1 G2 G1 G2 G1 G2 G1 G2
from
G1 10 0 9 1 10 0 10 0
G2 0 10 1 9 1 9 0 10
The three measurements for each sample can be found in the columns M1
to M3 of Table 2, showing some misclassifications, whereas the classification
based on mean values gives correct assignments. This raises the question,
whether the assignment, based on the single measurements M1 to M3 , and
showing misclassification, belongs to the same particle or different particles;
unfortunately these are two different particles. But as can be seen in Table 2
3
the classification error rate for the single measurements is only 60 = 0.05,
which is very low for curves that are extremely similar as in the case here.
As the dimensionality of the data was only slightly reduced by PCA one
could ask, whether it was necessary to use PCA and whether PLS would
improve the results. The calculations showed that the error rate was unac-
ceptable in the case when classification was carried out only by use of wavelet
coefficients. Therefore, a further data reduction method was applied. After
calculation of the PLS estimation of the scores, instead of a PCA, the results
improved, but the number of scores was the same as in PCA, i.e. 16 scores
86 V. Hofer et al.
assignment of
M0 M1 M2 M3
to G1 G2 G1 G2 G1 G2 G1 G2
from
G1 10 0 10 0 10 0 10 0
G2 0 10 1 9 1 9 0 10
were used in further classification. Table 3 shows that only two single curves
were misclassified after application of PLS for further dimension reduction.
These two curves belong to different samples.
PLS is often applied in chemometrics to reduce the dimension of spec-
tral data for classification. The following Table 4 gives an impression of
the classification error, which arises when PLS is used to select the original
log-transformed spectral lines. This means that PLS is used for identifying
the relevant observation knots. As can be seen from Table 4, there is even
assignment of
M0 M1 M2 M3
to G1 G2 G1 G2 G1 G2 G1 G2
from
G1 9 1 7 3 9 1 9 1
G2 1 9 1 9 0 10 2 8
Daubechies Wavelets for Identification of Rock Variants 87
3 Summary
Geostatistical Applications
Simulating the Effects of Rural Development
Policies on Land Use: Evidence from Spatially
Explicit Modeling in the Central Highlands
of Vietnam
1 Introduction
Land cover, the spectral characteristics of the earth’ surface, and land use,
the operational employment on that land, are closely related. However, there
is also a clear distinction between land use and land cover. While land
cover refers to the biophysical earth surface, land use is shaped by human,
socioeconomic and political influences on the land [7]. In essence, ‘land use
links land cover to the human activities that transform the landscape’ [15]. In
most practical applications the analysis of satellite images are used to infer
land use from land cover.
Land use is a common phenomenon associated with population growth,
market development, technical and institutional innovation, and related ru-
ral development policy. This paper attempts to assess the impact of policy,
technology, socioeconomic, and geophysical conditions on land use in the last
decade and combines data from a village-level survey with remote sensing
data derived from Landsat images. Our objective is to analyze the influence
of these explanatory variables on land use using a reduced-form, spatially ex-
plicit multinomial logit model. Simulations are then carried out to assess the
effects of three policy scenarios of rural development on land use. An empirical
application is presented for two districts of Dak Lak province in the Central
Highlands of Vietnam. Dak Lak exhibits an interesting case in the study of
land use dynamics with its abundant forest resources, ethnic diversity, high
immigration rates and dynamic agricultural and socioeconomic development.
In particular, the last decade was characterized by rapid, labor- and capital-
intensive growth in the agricultural sector.
92 D. Müller and D.K. Munroe
Source: Primary data on village level collected in village survey; secondary data
on geophysical and agroecological variables were provided by the Mekong River
Commission (Digital Elevation Model) and the Department for Agriculture
and Rural Development, Dak Lak (Digital Soil Map and protected areas);
rainfall data from own interpolation of data from nine meteorological stations,
classification of soil suitability dummies from expert opinion.
3 Spatial Sampling
Ideally, to integrate spatially explicit data derived from geographical informa-
tion systems (GIS) and remote sensing (RS) techniques with village survey
data, the scale of the analysis should match the agricultural plots as the
unit of decision-making. Yet, in Vietnam as in most developing countries,
94 D. Müller and D.K. Munroe
plot maps and village boundaries are not available [12]. This renders spatial
modeling a time-consuming and costly task due to the necessary delineation
of the spatial extent of plots or villages, e.g. using Global Positioning Systems
(GPS). To demarcate the spatial base unit for the integration of socioeco-
nomic variables, the geographic positions of all villages were recorded using
GPS and point coverages created. Village boundaries for all surveyed villages
were then approximated by applying a cost-distance algorithm to delineate a
set of explicitly defined ‘accessibility catchments’, generated around each vil-
lage location and based on estimated transport costs [4]. Spatial accessibility
is similar to Euclidean distance functions, but instead of calculating the actual
distance from one point to another, the shortest cost distance (or accumulated
transportation cost) from each cell to the nearest source cells is determined
(Fig. 1).5 The resulting catchment polygons were then used as a base unit
for village-level data in sub-sequent analysis. Hence, survey data – apart from
population – takes the value of the interviewed village for each point, which
has lower transportation costs to the geographic location of that village than
to any other village location.
The units of analysis are square pixels of 50 by 50 m, i.e. 0.25 ha. To focus
on changes at the forest margins influenced by human interventions, we restrict
the analysis to those pixels that have a cost of access below the mean for the
Fig. 1. Transportation cost surface with spatial sample and approximated village
borders [9]
5
For an example of spatial data integration using purely Euclidean distance mea-
sures, see [10].
Simulating the Effects of Rural Development Policies 95
transportation cost surface (Fig. 1). In that way, we include nearly all the
agricultural area in 2000 and eliminate remote and high mountainous areas
covered mostly with thick primary forest, which are outside measurable human
influence.
At present, there are no models and test statistics available to account for
substantive spatial interaction in a qualitative dependent variable framework
[2]. To compensate for potential spatial dependence in the dependent vari-
ables, Besag’s coding scheme was used [3], also employed by [14] and [11] in
similar studies. The regular spatial sample was drawn by selecting every 5th
cell in the X and Y directions so that no selected cells are physical neighbors.
The sampling procedures allow us to apply standard estimation techniques [1]
and resulted in a dataset of 22,300 observations used for subsequent econo-
metric modeling. In addition, we include slope as a spatially lagged variable
[11, 13, 14]. These techniques help to reduce spatial autocorrelation although
they may not totally eliminate it [6].
The MNL has three land cover classes as categorical, unordered dependent
variables. To control for potential endogeneity problems, only lagged values
for time-variant independent variables such as population growth and road
access are considered in the empirical applications. In addition, all variables
were tested for multicollinearity. We assess the assumption of independence
96 D. Müller and D.K. Munroe
of irrelevant alternatives using both the Hausman and the Small-Hsiao test
and can accept the null hypothesis that outcomes are independent of other
alternatives. Model results are reported as raw coefficients in Table 2 for non-
agricultural land as the comparison group. Overall predictive power is 88%,
measured as the locations predicted correctly. Equivalent to [13] we found
that wrong predictions frequently lie on the border between land use classes,
which is likely to be related to spatial errors in the source data and artifacts
inherent in our technique of data integration.
Coefficients for the geophysical variables are mostly significant at the 1%
level and show the expected signs with high predictive power. Agriculture
Simulating the Effects of Rural Development Policies 97
is more likely at lower altitudes, flatter land, and more suitable soils. More
amount and less variance of rainfall increases the likelihood of paddy com-
pared to the other two categories. Surprisingly, access to all-year roads did
not have a significant effect on the probability of a certain land use class. We
assume that this is due to the relatively large areas under agricultural uses
far away from the all-year road network. Earlier introduction of mineral fer-
tilizer and more irrigated area increases the likelihood that pixels are under
paddy production. Lagged population does not seem to have an influence on
the amount of area cultivated as it was probably outweighed by effects from
agricultural intensification. In addition, the majority of migrants settled in
areas with high proportions of land suitable for paddy cultivation. Therefore,
lagged population has little influence on the amount of land used for cultiva-
tion. The dummy on ethnic composition is significant at the 5% level and has
a strong negative effect on the probability of paddy land. A more fragmented
landscape in an earlier period decreases the likelihood of present agricultural
uses and a significant amount of fragemented agricultural plots regenerated
into non-agricultural uses. Finally, forest protection has a strong effect on the
likelihood to observe both mixed agriculture and paddy land.
description proxy
1. Earlier introduction of fertilizer → introduction of NPK 5 years earlier
2. Forest protection → (a) protection of existing primary forest
→ (b) protection on slopes > 15 degrees
3. Earlier introduction of fertilizer → scenarios 2 and 3 combined
and forest protection
Source: authors.
98 D. Müller and D.K. Munroe
5 Simulation Results
simulated prediction
non-
mixed
paddy agricultural
agriculture
base prediction land total
mixed agriculture 347.5 21.9 5.0 374.3
paddy 0.0 88.0 0.0 88.0
non-agricultural 12.7 0.4 940.6 953.8
land
total 360.2 110.3 945.6 1,416.1
2
Source: own calculations; numbers reported in km .
Fig. 2. Prediction maps of policy scenarios on land use compared to base prediction
Acknowledgments
We are grateful to Manfred Zeller and Regina Birner for useful comments and
suggestions on previous versions of this paper. The research was funded by
the German Ministry of Economic Development and Cooperation (BMZ) un-
der the Tropical Ecology Support Programme (TÖB) of the German Agency
for Technical Cooperation (GTZ). The writing of this chapter was sup-
ported by the German Research Foundation (DFG) under the Emmy Noether-
Programme.
References
12. Gerald C. Nelson and Jacqueline Geoghegan (2002) Deforestation and land use
change: sparse data environments. Agricultural Economics, 27(3):201–216.
13. Gerald C. Nelson, Virginia Harris, and Steven W. Stone (2002) Deforestation,
land use, and property rights: Empirical evidence from darien, panama. Land
Economics, 77(2):187–205.
14. Gerald C. Nelson and Daniel Hellerstein (1997) Do roads cause deforestation?
using satellite images in econometric analysis of land use. American Journal of
Agricultural Economics, 79:80–88.
15. Policy Division Committee on Global Change Research NRC (National Re-
search Council), Board on Sustainable Development (1999) Global Environ-
mental Change: Research Pathways for the Next Decade. National Academy
Press, Washington, DC.
16. StataCorp (2003) Stata Statistical Software: Release 8.0. Stata Corporation,
College Station, Texas.
Kriged Road-Traffic Maps
1 Introduction
A common difficult problem of large cities with heavy traffic is the predic-
tion of traffic jams. In this paper, a first step towards mathematical traffic
forecasting, namely the spatial reconstruction of the present traffic state from
pointwise measurements is briefly described. For details, we refer to [1], where
models of stochastic geometry and geostatistics are used to spatially represent
the traffic state by means of velocity maps. A corresponding Java software that
implements efficient algorithms of spatial extrapolation is developed; see [5].
To illustrate our extrapolation method, we use real traffic data originating
from downtown Berlin. It was provided to us by the Institute of Transport
Research of the German Aerospace Center (DLR). Approximately 300 test
vehicles (taxis) equipped with GPS sensors transmit their geographic coor-
dinates and velocities to a central station within regular time intervals from
30 s up to 6 min; see Fig. 2. Thus, a large data base of more than 13 million
positions was formed since April 2001; see Fig. 1.
In the first stage of our research, only a smaller data set (taxi positions on
all working days from 30.09.2001 till 19.02.2002, 5.00–5.30 pm, moving taxis
only) was considered. Furthermore, the observation window was reduced to
downtown Berlin to avoid inhomogeneities in the taxi positions.
The main idea of the extrapolation technique described in Sects. 2 and 3
below is to interpret the velocities of all vehicles at given time t as a realization
of a spatial random field V (t) = {V (t, u)} where V (t, u) is a traffic velocity
vector at location u ∈ R and time instant t ≥ 0. The goal is to analyze
the spatial structure of these random fields of velocities in order to describe
the geometry of traffic jams. Since V (t, u) can be measured just pointwise at
some observation points u1 , . . . , un , a spatial extrapolation of the observed
data is necessary. Notice that the velocities strongly depend on the location
and the direction of movement, e.g. the speed limits and consequently the
mean velocities are higher on highways than in downtown streets.
In what follows, the data from a given time interval, i.e. [5.00, 5.30] pm,
will be taken for extrapolation. Keeping this in mind, we shall omit the time
parameter t in further notation.
The extrapolation method described in Sects. 2 and 3 has been imple-
mented in Java, where a software library has been developed comprising the
estimation and fitting of variograms as well as the ordinary kriging with mov-
ing neighborhood; see [5]. As far as it is known to the authors, this is the
first complete implementation of such kriging methods in Java. Much atten-
tion was paid to the efficient implementation of fast algorithms. In contrast to
classical geostatistics operating with relatively small data sets, this efficiency
is of great importance for larger data sets with more than 10, 000 entries; see
[1] for details.
In Sect. 4, a numerical example is discussed which shows how the devel-
oped extrapolation technique can be applied to directional traffic data. Some
structural features of the resulting velocity maps (see Figs. 5 and 6) are also
discussed. In Sect. 5, this is combined with a statistical space-time analysis of
polygonal road-traffic trajectories which have been extracted from the original
traffic data. For example, it turns out that the distribution of the number of
segments in these traffic trajectories can be fitted quite well by a geometric dis-
tribution. The directional distribution of the segments reflects the anisotropy
of the street system of downtown Berlin, where the distribution of segment
lengths is demonstrably non-normal. Furthermore, the distributions of veloc-
ity residuals, i.e. the deviations from their means, show interesting skewness
properties which depend on the considered classes of low, medium, and high
mean velocities, respectively. A short outlook to simulation and prediction of
future traffic states is given in Sect. 6.
2 Random Fields
To model traffic maps, non-stationary random fields composed of a determi-
nistic drift and an intrinsically stationary random field of order two (residual)
are used. See e.g. the monographs [4] and [6] for details.
Let X = {X(u), u ∈ R2 } be a non-stationary random field with finite
second moment EX 2 (u) < ∞, u ∈ R2 . Then, X(u) can be decomposed into a
sum X(u) = m(u) + Y (u), where m(u) = EX(u) is the mean field (drift) and
Y (u) = X(u) − m(u) is the deviation field from the mean or residual. Assume
that {Y (u)} is intrinsically stationary of order two. Denote by
1
γ(h) = E[(Y (u) − Y (u + h))2 ] (1)
2
its variogram function. In practice, the field X can be observed in a compact
(mostly rectangular) window W ⊂ R2 . Let x(u1 ), . . . , x(un ) be a sample of
observed values of X, ui ∈ W for all i. The extrapolation method described
1
in Sect. 3 yields an “optimal” estimator X(u) of the value of X(u) for any
u ∈ W based on the sample variables X(u1 ), . . . , X(un ).
108 H. Braxmeier et al.
is formed and its kriging estimator Y1 ∗ (u) is computed. Finally, the estimator
1
X(u) is given by
1
X(u) 1
= m(u) + Y1 ∗ (u) . (3)
1
If we suppose that the drift is known, i.e. m(u) = m(u) for all u, then we have
exact values of the deviation field Y (u1 ), . . . , Y (un ) since in this case
n
Y1 (u) = λi Y (ui )1{ui ∈ A(u)} . (4)
i=1
The estimation involves only those sample random variables Y (ui ) that are
positioned in the “neighborhood” A(u) of u, i.e. if ui ∈ A(u). Being an arbi-
trary set, this moving neighborhood A(u) contains a priori information about
the geometric dependence structure of the random field Y . For instance, it
could be designed to model the formation of traffic jams; see Sect. 4.
Unbiasedness of the estimator introduced in (4) and minimizing its vari-
ance lead to the following conditions on the weights λi . For all i = 1, . . . , n
with ui ∈ A(u) it holds
Kriged Road-Traffic Maps 109
n
λj γ(uj − ui )1{uj ∈ A(u)} + μ = γ(u − ui ) , (5)
j=1
n
λj 1{uj ∈ A(u)} = 1 .
j=1
3.2 Variograms
In this paper, the most simple and popular variogram estimator of Matheron
is used (cf. [3, 6]). It is defined by
1 2
1(h) =
γ (Y (ui ) − Y (uj )) (6)
2N (h)
i,j:ui −uj ≈h
The mean field {m(u)} can be estimated from the data by various methods
ranging from radial extrapolation to smoothing techniques such as moving
average and edge preserving smoothing. In what follows, the moving average
is used because of its ease and computational efficiency for large data sets. By
moving average, the value m(u) is estimated as
1
1
m(u) = X(ui ) (9)
Nu
ui ∈W (u)
In the previous sections, we supposed that the drift m(u) is explicitly known.
However, if it has to be estimated from the data, the theoretical background
for the application of the kriging method breaks down (cf. [3], pp. 122–125,
[4] p. 72, [6], p. 214). Nevertheless, practitioners continue to use the ordinary
kriging of residuals with estimated drift based on the data y ∗ (ui ) = x(ui ) −
1 i ), i = 1, . . . , n legitimized by its ease and satisfactory results.
m(u
1
Fig. 3. Mean field m(u) of data set 2
1
Fig. 6. Velocity field X(u)
1
Fig. 7. Traffic jams: X(u) ≤ 15 kph
1
In Fig. 7, areas with velocities X(u) ≤ 15 kph are marked grey. Some
of these regions might be caused by traffic jams, others are regions with low
average velocities. Indeed, the most likely velocity value in downtown Berlin
is about 20 kph as it can be seen in Fig. 12.
bring an extra insight into the structure of traffic data. In particular, they
help us to explain some features of anisotropy and spatial correlation which
we already mentioned in Sect. 4. For a more detailed treatment of the subject,
see [2].
If we think about the way the traffic data are collected we understand that the
locations where the velocities are measured can not be deterministic. More-
over, they are stochastically dependent. In fact, each test vehicle follows a
route that consists of a random number of segments. Each segment connects
two locations where consecutive GPS signals were sent; see Fig. 8. The his-
togramm of the number of segments in the taxi routes is shown in Fig. 9.
It turns out that this histogram can be well approximated by a geometric
distribution with parameter p = 0.9365064 being the probability of enlarg-
ing a route by a new segment. Furthermore, the geometry of the taxi routes
explains the form of the variogram anisotropy mentioned in Sect. 4.
In particular, the distribution of the angles between the movement direc-
tion of a vehicle and the eastward direction in Fig. 10(a) reflects the distri-
bution of typical street directions with heavy traffic in downtown Berlin. The
majority of main roads goes east or west which corresponds to the angles of
0◦ , 180◦ , and 360◦ , respectively. This is certainly the reason for the charac-
ter of zonal anisotropy of the variograms in Fig. 4. Figure 10(b) shows that
the distribution of segment lengths is demonstrably non-normal. Furthermore,
with probability of ca. 0.9, the distances between two subsequent GPS signals
in the taxi routes do not exceed 1000 m. It is clear that the velocities at two
positions within this distance are correlated. The opposite statement is also
true. As it has been already mentioned in Sect. 4, the velocities of two cars
Fig. 10. Histograms of segment directions (in degrees) and lengths (in m)
at a distance of more than 3 r12 + r22 ≈ 945 m from each other are almost
independent.
The histogram in Fig. 11 shows that the distribution of velocity residu als
can be well fitted by some normal distribution. Nevertheless, a more detailed
statistical inference shows that the distribution of velocity residuals depends
on the value of mean velocity. One reason for this is that the sum of the
residual and the mean has to be non-negative. Figures 13, 14 and 15 show
the histograms of the velocity residuals measured at locations with mean
velocities (in kph) belonging to three disjoint classes: [15, 20), [25, 30) and
[40, 45), respectively.
(a) for the first route segments (b) for the remaining route segments
If we add the mean velocity values to their residuals we see that most
velocities in downtown Berlin do not exceed 60 kph. The histogram of the
velocities themselves is given in Fig. 12 which shows that the most likely
velocity value in downtown Berlin (i.e., the modus of the empirical velocity
distribution) is about 20 kph. This explains the dominance of low velocity
values in the mean field m1 and the threshold maps in Figs. 3 and 7.
6 Outlook
The spatial extrapolation and statistical space-time analysis of traffic data
considered in the present paper is an important step towards stochastic mod-
elling, simulation and prediction of future road-traffic states. Our results can
be used to construct a Markov-type simulator by means of which future routes
of test vehicles can be generated, where the choice of the starting configura-
tion depends on the actually measured traffic situation. In particular, when
sampling the velocity residuals from histograms as given in Figs. 13, 14 and
15, the mean velocity field {m(u)}
1 will be actualized by the recently, say
at the given day, observed velocities. For example, suppose that significantly
larger velocities than usually have been observed at the considered day in a
certain neighborhood of location u. In this case, the velocity residual at loca-
tion u will be sampled from a histogram which corresponds to a larger class
1
of mean velocities than the “historical” value m(u). Further details concerning
our simulation algorithms can be found in [2].
Then, using the extrapolation technique described in Sects. 2 and 3, ve-
locity maps based on both the measured and simulated traffic data can be
computed. To evaluate the quality of these maps, they are compared with cor-
responding velocity maps computed exclusively from measured traffic data.
The comparison is based on morphological distance measures for digital image
data. These issues will be discussed in a forthcoming paper.
118 H. Braxmeier et al.
Acknowledgement
This research was supported by the German Aerospace Center (DLR) through
research grant 931/69175067. The authors are grateful to Reinhart Kühne, Pe-
ter Wagner and their co-workers from the DLR Institute of Transport Research
for suggesting the problem as well as for stimulating and fruitful discussions
on the subject.
References
Bodo Ahrens
1 Introduction
Nowadays, limited area numerical weather prediction models provide meteo-
rological forecasts with horizontal grid spacing of only a few kilometers and
grid spacing will decrease further in the coming years caused by progress in
high-performance computing [9, 26]. Precipitation forecasts are of primary in-
terest for both researchers and the public. For example, in flood forecasting
systems precipitation is the crucial input parameter, especially in mountain-
ous watersheds. Like the grid spacing of weather prediction models the grid
spacing of regional climate models is decreasing.
Precipitation forecasts have to be evaluated and errors have to be quanti-
fied. The most important evaluation method is comparison of meteorological
simulation results with meteorological observations. But, before errors can be
quantified two decisions have to be made. First, a set of useful statistics has
to be chosen. This shall not be the issue of this paper. The interested reader
is referred to, for example, Murphy and Winkler [23], Wilks [31], Wilson [32].
We apply for illustration a small set of simple continuous statistics.
Our focus is on the second problem: What is the observational reference?
Rain station data is commonly preferred to remote sensing data, in partic-
ular radar data, because of the relatively large measurement uncertainties
[e.g., 1, 14, 34]. Is it reasonable to compare precipitation forecasts valid for
grid boxes with several kilometers in diameter with sparsely distributed rain
station data valid for small areas of ∼ 1000cm2 ? This is often done in an op-
erational framework since it can be implemented by simple means. This area-
to-point evaluation is criticized and it is proposed to perform some upscaling
or regionalization of the station data up to forecast grid resolution [29, 12].
Regionalization can be done by some fitting approach yielding a precipita-
tion analysis. For example, a recent analysis of precipitation for the European
Alpes by Frei and Häller [17] has a time resolution of 24 h and a spatial grid
of about 25 km with regionally even lower effective resolution depending on
the available surface station network. This type of analysis is useful for model
122 B. Ahrens
validation at the 100 km-scale [see, e.g., 5, 16, 18], but not at 10 km-scale or
even less.
Analysis is a smoothing regionalization. This deteriorates application in
higher-moment statistics if the network is not dense enough. The statement
“dense enough” critically depends on the applied pixel support (is a pixel
value representative for boxes with diameter of ∼100, 10, or 1 km?) and the
analysis scheme. Another regionalization approach is stochastic simulation of
precipitation fields with conditioning on the available station data. The idea
of this is that the data is respected and the spatial variability is represented
more realistically than in the analysis. Then the forecast can be compared
with an ensemble of simulated fields. The ensemble mean field is an analysis
but the mean higher-moment statistics have not the same value than if the
forecast is just compared with the analysis alone.
This paper applies regionalization and performs area-to-point or area-to-
area comparison in evaluation of daily precipitation forecasts. The forecasts to
be evaluated by example are the forecasts of the NWP model ALADIN that is
operational at the Austrian national weather service with 10 km grid spacing.
ALADIN, the forecast days, and the available station data are introduced in
the next section. Section 3 discusses the applied evaluation approaches and
subsequent sections discuss the respective results. Finally, some concluding
remarks will be given.
3 Evaluation Methods
In the following we will discuss the evaluation procedures applying a min-
imal set
Nof useful statistics. Most important is the mean distance bias =
1/Nx x=1 x
(mx − dx ) with the model forecast field mx , the observational field
dx , and with the space index x = 1, . . . , Nx . Additional statistics considered
are the coefficient of determination R2 (the square of the linear product-
moment correlation, possible values are between 0 and 1 with optimal value 1),
2
and the ratio of spatial variances SPREX = σm /σd2 (optimal value 1).
The applied evaluation methods are comparisons of model fields with (a)
station data (i.e., effectively with point data) and (b) with regionalized and
box averaged precipitation fields (i.e., with area data). Comparison with point
data is often done and, for example, standard in most European Meteorologi-
cal Services [see 10, 32]. It is simple in implementation. Two variants are com-
mon practice: direct comparison of the station data with the closest model grid
box values and thus performing an area-to-point comparison, or interpolation
of the model fields to the station locations and thus performing a point-to-
point comparison. In fact this interpolation smoothes the forecast field that is
eligible since single box values should not be interpreted [2, 20]. But, on the
other hand a simple interpolation like the often applied bi-linear interpolation
assumes that the precipitation field is continuous and introduces no additional
information. Consequently, interpretation of the interpolated values as point
data is delusive. The effective resolution of ALADIN is not the issue here,
and the raw forecasts of ALADIN with about 10 km horizontal resolution are
evaluated by example.
Comparison of model grid box output with regionalized rain fields with ap-
propriate pixel support is an area-to-area comparison and respects the scales.
The second potential advantage of regionalization is that station representa-
tivity problems (clustering of stations around larger cities or along valleys)
can principally be compensated. Here, we call regionalization by some opti-
mization involving data-fitting techniques (like regression, polynomial fitting,
spline functions, kriging, etc.) analysis and the estimated field is an analysis
field. A problem of analysis is that it is difficult to estimate analysis errors
124 B. Ahrens
550 HZB analysis, block 10 km, 06.08.2002 550 HZB analysis, block 10 km, 07.08.2002
−120 −120
500
northing [km]
500 −120
northing [km]
–80 −100
450 450
–60 −80
400 400 −60
–40
350 350 −40
–20 −20
300 300
–0 −0
200 300 400 500 600 200 300 400 500 600
easting [km] easting [km]
Fig. 1. Daily analyses by ordinary block Kriging of HZB rain station data in Austria
for the four days investigated in this paper. The dots in the upper left panel indicate
the station locations. Units are mm/d
On Evaluation of Precipitation Fields 125
northing [km]
–120
450 –100 450 −80
northing [km]
−140
northing [km]
−100 −120
450 450
−80 −100
400 −60
400 −80
−60
350 −40 350 −40
−20 −20
300 300
−0 −0
200 300 400 500 600 200 300 400 500 600
easting [km] easting [km]
Fig. 2. Operational NWP model forecasts for the days shown in Fig. 1. Units are
mm/d
−120
550 TAWES analysis, block 10 km, 12.08.2002
−100
500
northing [km]
−80
450
−60
400
−40
350
−20
300
−0
200 300 400 500 600
easting [km]
Fig. 3. Daily analysis by ordinary block Kriging of TAWES rain station data. Units
are mm/d
126 B. Ahrens
500 −100
northing [km]
450 −80
−60
400
−40
350
−20
300
−0
200 300 400 500 600
easting [km]
In a first evaluation step the comparison of model grid box data with rain
station data is discussed. The nearest model grid box is used to compare with
the point observations ignoring the corresponding error in location. Figure 5
and Table 1 present the comparison with the HZB and TAWES data set.
There is a large scatter in the results depending on the applied reference: the
TAWES or HZB data set. For example, the relative bias is +4% in comparison
with TAWES and −4% in comparison with HZB data at August 7th. At the
same day the forecast explains 39% of the TAWES data variability (R2 =
0.39) but only 20% of the HZB data variability. The model underestimates
the field variance at the 11th if compared with HZB data by 10% or rather
overestimates by 10% in comparison with TAWES data.
The impact of the station sample size is illustrated by the box plots in
Fig. 5. Twenty random sub-samples of 116 stations (the sample size of the
TAWES set) are drawn from the HZB set. These sub-samples are applied in
the evaluation process and the box plots show the quartiles of the twenty
evaluation results for each day and statistics. Twenty is a small number of
random sub-samples, but enough to illustrate the effects. The range of these
On Evaluation of Precipitation Fields 127
100
3.0
60
2.5
80
40
2.0
20
SPREX [1]
60
R^2 [%]
bias [%]
0
1.5
40
–20
1.0
–40
20
0.5
–60
0
6. 7. 11. 12. 6. 7. 11. 12. 6. 7. 11. 12.
August 2002 August 2002 August 2002
Fig. 5. Comparison of NWP model forecasts against station data with the symbols
“+” indicating the evaluation statistics of forecast vs HZB station data for the four
days and with “×” indicating the comparison results against the TAWES data.
The box plots show the quartiles (the whiskers indicate the range) of results of
comparison against 20 random subsets with sample size 116 from the HZB data
results is substantial. For example, the relative bias range is about 20% for the
days with small bias and even 43% for August 6th. Interestingly, the TAWES
results are not within the interquartile range of the sub-sampling results most
of the time and the difference is systematic (besides the bias at August 12th).
The extremness can be explained by a more homogeneous distribution of the
TAWES stations (cf. Fig. 3) in comparison with the sub-sampled HZB stations
and more important by the different measurement system. The problem of
rain measurements can not be discussed further and the interested reader is
referred to [25, 33].
Instead of the next model box value often a bi-linearly interpolated value
is compared with station observations as discussed above. The bias results
are different to the results with next neighbor comparison, but within a scat-
ter range suggested by the box plots. The results of the pattern comparison
improves slightly but systematically (up to 5%). This is not surprising since
the model data is implicitly smoothed by the interpolation and thus the clas-
sic “double-penalty problem” [small location discrepancies of sharp peaks are
penalized twice, cf. 6] is reduced. Smoothing reduces the model variance and
thus the values of SPREX decrease (by 0.05 for the 7th to 0.2 for the 12th in
case of TAWES data).
128 B. Ahrens
Table 2. Comparison of NWP forecast fields with different analyses and of analyses
with analyses. Analyses considered are done by Kriging (OK) or inverse distance
weighting interpolation (IDW) based on different sets of rain station data (TAWES
or HZB). The last row shows the mean statistics of twenty comparisons of analyses
based on HZB sub-sets versus the total HZB data set
bias [%] R2 [%] SPREX [1]
NWP∼ OKHZB 73/ − 14/ − 0/ − 11 2/23/21/47 1.7/0.5/2.2/1.0
NWP∼ OKTAWES 70/ − 14/ − 1/ − 8 1/25/30/51 1.8/0.5/2.7/1.1
NWP∼ IDWHZB 73/ − 16/1/ − 12 2/23/23/50 2.1/0.5/3.1/1.2
NWP∼ IDWTAWES 71/ − 11/ − 2/ − 8 1/24/32/54 2.0/0.5/3.2/1.3
OKTAWES ∼ OKHZB 2/0/0/ − 4 84/93/62/87 0.9/1.0/0.8/0.9
IDWTAWES ∼ OKHZB 2/ − 4/2/ − 3 84/92/60/83 0.8/0.9/0.7/0.8
OKSS ∼ OKHZB −2/2/ − 1/1 79/93/55/81 0.8/1.0/0.8/0.9
impact of the analysis scheme is smaller than the impact of data sample size.
This is evident if, for example, the coefficient of determination is more closely
inspected for the third analysis day, August 11th. Both analyses based on
TAWES data explain only about 60% of the variance of the HZB analysis.
This day is less dominated by large-scale precipitation patterns and shows
the smallest field variance, but has the largest small-scale variability. Small
shifts in the analyzed small-scale pattern lead to more distinct double-penalty
effects than for the other days. And these shifts are less influenced by the
analysis method as the similar R2 values indicate than by the smaller data
sample size. This conclusion is supported by the last row in Table 2 where
the mean error of analyses based on random HZB subsets with sample size
116, as of the TAWES set, is shown. Nevertheless, a value of 60% is enough
if compared to the value of 21% explained by the NWP. What is missing is a
possibility to judge these 21% in comparison to the 23% for the 7th where the
precipitation field is far more easy to analyze as the more than 90% explained
variance indicates and thus should also be easier to forecast in the sense of
the applied statistics.
100
3.0
60
2.5
80
40
2.0
20
SPREX [1]
60
R^2 [%]
bias [%]
−20 0
1.5
40
1.0
−60 −40
20
0.5
0
Fig. 6. Comparison of NWP model forecasts against analyses. The symbols “+”
indicate the evaluation statistics of forecast vs HZB based analysis for the four
days and “×” indicates the results vs TAWES based analysis. The box plots show
the quartiles (the whiskers indicate the range) of results of comparison against 20
analyses based on random HZB subsets with sample size 116
3.0
60
2.5
80
40
20
2.0
SPREX [1]
bias [%]
60
R^2 [%]
0
1.5
40
–60 –40 –20
1.0
20
0.5
0
Fig. 7. Same as Fig. 6, but the box plots summarize the comparison of NWP
forecasts against precipitation simulations conditioned on TAWES data
On Evaluation of Precipitation Fields 131
7 Conclusions
Acknowledgements
1 Introduction
At 16:00 UTC on October 23, 1994, 340 kg of perfluoromethylcyclohexane
(PMCH) were released into the air from Monterfil in Brittany, France. Air
samples were collected at 168 stations in 17 European countries for a period
of 90 hours from the start of the release. The European Tracer Experiment
(ETEX) was initiated with the aim of collecting data for validating long range
transport and dispersion models used for emergency response applications
[4, 13]. Another release was made a month later under different meteorological
conditions.
Although the data have been used in numerous mechanistic atmospheric
dispersion studies, only previously was a geostatistical analysis performed in
order to provide some basis for a spatial interpolation [5]. Such an analysis
applied fractional Brownian motion models in order to summarise the spatial
correlation structure in terms of the power exponent of the variogram, which
is directly related to the fractal dimension.
Because of the distressing nature of the data set, which is highly skewed
and required a logarithmic transformation in the Dubois et al study, this
paper attempts to apply more robust variography on the raw data in order to
extract some order out of the chaos. We use the term “robust” in this paper
specifically to refer to the stability of variogram in the presence of a strong
direct proportional effect, which characterises the ETEX-1 dataset.
equation (3) shows that for Y (s) that is intrisically stationary, then the var-
iogram of Y (s) should be equivalent to the variogram of Z(s). However, in
practice, the sample variogram estimator is typically based on
which would only be valid for data with constant mean. Again, substituting
(1) into (4) would give the following expression for the variogram
equation (6) shows that if the mean is constant everywhere, μ(s + h) = μ(s),
so the variogram for Z(s) should be equivalent to that for Y (s) and (4) would
thus apply. However, if the mean is not constant then we will derive a var-
iogram estimator that will exhibit a quadratic growth with h, which would
make the estimator invalid.
One method of incorporating a non-constant mean is to estimate a mean
surface and work with residuals assumed to be intrisically stationary, e.g. me-
dian polish kriging [1]. Nevertheless, in practice, data showing a strong pro-
portional effect (heteroscedasticity) might still require a rescaling of the var-
iogram. This is because the dispersion of the data now depends on its local
mean.
Since heteroscedasticity is commonly associated with highly skewed data,
one common approach is to transform the data to a logarithmic scale and per-
form variography and kriging in log space before back transforming, e.g. log-
normal kriging [2, 8]. Another approach would be to use alternative measures
Robust Spatial Correlation Analysis of the ETEX-1 Tracer Data 139
such as the relative variogram in order to account for the proportional effect
via some form of rescaling of the sample variogram values.
Both techniques of transformation and use of the relative variogram are
related, as demonstrated by Cressie [1]. If we assume that we have spatial
regions{Dj ; j = 1, . . . , n} within which the regionalised variable Y (s) is in-
trinsically stationary with mean μj and variogram 2γZj (h) for each region
j = 1, . . . , n, then using the δ method of Kendall and Stuart [9] we can apply
a transformation for the variable Z(s) as follows:
Y j (s) = g(μj ) + g (μj )[Z j (s) − μj ] + g (μj )[Z j (s) − μj ]2 /2! + . . . ; s ∈ Dj (8)
It follows from (7) that if we were to take the increments for Y j (s) then
We can then define the variogram for Y j (s) by applying the variance op-
erator on both sides of (9) to derive
(10) is similar to the general form of the local relative variogram, defined by
various authors [7, 8] in the following manner:
2γZj (h)
2γRY (h) = (11)
μnj
The above takes each squared difference between pairs of sample values
and divides it by the square of their average, hence the term “pairwise.”
140 S. Shibli and G. Dubois
with the caveat that, although γ̂(h)
= Ĉ(0) − Ĉ(h), the differences should
be sufficiently small that one can apply γ̂(h) in practice. Curriero et al. [3]
contends that the “head” and “tail” values in (15) are meaningless for the
omnidirectional direction since the common practice is to count the location
twice. Hence the re-scaling of the variance is implicitly taken into account by
incorporation of the mean for all the data contributing to the lag.
A common problem with using (15) to derive an approximate sample vari-
ogram is that the first few lags can result in negative values, since the number
of data used to derive the sample variance is typically greater than the num-
ber of data used to derive the sample covariance itself. Nevertheless, such a
difficulty does not exist if one scales each covariance value by the product of
the standard deviations for the head and tail values, thus giving the following
definition for the correlogram:
Ĉ(h)
ρ̂(h) = (16)
σ̂(s)σ̂(s + h)
By explicitly accounting for the possibility that some lags contain more
variable values than others, the correlogram is likely to suffer least from the
combination of heteroscedasticity and clustering [? ].
Yet another variation of (11) presented by Isaaks and Srivastava [7] is the
general relative variogram, defined as follows:
γ̂(h)
γ̂GR (h) = 1
(17)
2Nh z(s) + z(s + h)
3 Data Description
The ETEX-1 data consists of 155 raw concentration measurements of PMCH
3
(in units ng/m ) from the first ETEX release recorded at 26 different times,
Robust Spatial Correlation Analysis of the ETEX-1 Tracer Data 141
where λi are weights assigned to the observed data z(si ) that will determine
their role in defining the value taken by the variable at unsampled location s0 .
The main interest in applying geostatistical techniques is that these weights
are computed from a model of the spatial correlation of the analysed phe-
nomena. Hence, unlike other interpolators (for an overview of interpolation
methods refer to Lam [10]), geostatistics takes the spatial structure of the
variable explicitly into account.
A useful opening move in any geostatistical study is to derive the omnidi-
rectional semivariogram of the variable under study to determine the degree
of spatial correlation, often called a “structural analysis”. As noted earlier, the
heteroscedasticity observed in the data can makes the inference of a range and
sill very difficult using conventional measures such as the sample variogram.
Although by itself the shape of the semivariogram may not be affected by the
heteroscedasticity if the mean of the sample values is roughly the same for all
lags, the oridinary kriging variance, however, is dependent on the magnitude
of the variogram.
For the sake of brevity, the analyses will now be presented based on the
data for t = 45 hours. Figure 9 shows six different spatial correlation mea-
sures for this time slice, calculated using an angular tolerance of 90 degrees
(omnidirectional). Twelve lags were calculated at a lag increment of roughly
100 km, with a lag tolerance of about 50 km. This ensured that there were at
least 30 pairs for each lag distance.
The functions shown in Fig. 9 also include the sample general relative
variogram (17), pairwise relative variogram (13), non-ergodic covariance (14),
non-ergodic correlogram (16), and semimadogram (a measure of the average
mean absolute difference), defined by the following:
1
2γ̂M (h) = |z(s + h) − z(s)| (19)
Nh
It can be noted that the both the madogram and variogram provide a poor
measure of the spatial correlation for this highly skewed data set, resulting in
an erratic sample variogram. Both measures rely only on the mean difference
146 S. Shibli and G. Dubois
(squared in the case of the variogram) between two data points located h m
apart, so no rescaling of the variogram is performed commensurate with the
proportional effect.
The madogram also forms the basis for other well known “robust” variog-
raphy techniques, e.g. that of Cressie and Hawkins [2] using the fourth power
of the square root of the madogram; and that of Genton [6] based on the k-th
quantile of the madogram. However these techniques rely on stability of the
variogram based on deviations from a primarily Gaussian distribution, and
are not expected to fare very well in the presence of a strong proportional
effect.
From Fig. 9, for the other four measures which scale as some function of
the mean value, some semblance of structure can be inferred, giving an omni-
directional range for the tracer concentration values at around 40,000 m. The
pairwise relative variogram results in a usable albeit smooth model because
values at each pair are rescaled by the mean of the values contributing to
that pair. The correlogram appears to be less erratic than the covariance, due
to the advantage of having the values standardised by the data variances.
Both the correlogram and the pairwise relative variogram result in smoother
spatial correlation structures compared to the general relative variogram and
covariance.
The above results confirm that some sort of spatial dependence exists
for the concentration cloud at t = 45 hours, which was not evident from
the traditional variogram measure. Figure 10 shows the directional pairwise
relative variogram for four principal directions 0 deg, 45 deg, 90 deg, and
135 deg based on an angular tolerance of 22.5 deg. Little or no anisotropy is
evident; for the direction 135 deg a somewhat shorter range of some 300 km
is observed, and this appears to be the direction of minimum continuity. The
range in the direction of maximum continuity appears to be between 400 and
500 km.
For convenience, and as a prerequisite for spatial interpolation, we will
model the spatial variability by assuming a power law model for the sample
variogram. This is performed by curve fitting a log–log plot of the pairwise
relative variogram values versus lag distance (refer Fig. 11) and inferring the
slope of the fit.
The variance of increments for random fractional Brownian motion (fBm)
models satisfying a distribution with fractal geometry can be written as:
2.5 0.08
0.06
2
0.04
1.5 0.02
1 0
–0.02
0.5
–0.04
0 –0.06
0 200000 400000 600000 800000 1000000 1200000 0 200000 400000 600000 800000 1000000 1200000
Lag distance [m]
Lag distance [m]
0.2 2.5
0.18
0.16 2
0.14
0.12 1.5
0.1
0.08 1
0.06
0.04 0.5
0.02
0 0
0 200000 400000 600000 800000 1000000 1200000 0 200000 400000 600000 800000 1000000 1200000
Lag distance [m] Lag distance [m]
The dimensions are characteristically high, although within the same range
of values reported by Dubois et al. [5], who based their results on log–log
plots of variograms based on log transformed variables. These high dimen-
sions translate to very low intermittency exponents, corresponding to anti-
persistence, or less continuous phenomena. Such behaviour is also reflected in
the shape of the relative variograms themselves.
6 Discussion
For the ETEX-1 data, the traditional variogram estimator performs poorly in
the presence of a strong proportional effect, showing erratic behaviour at all
lags. Use of the alternative estimators which re-scale the variogram value by
some function of the mean allows us to infer large scale structure from the
Robust Spatial Correlation Analysis of the ETEX-1 Tracer Data 151
highly skewed data. This obviates the need to perform a logarithmic transform
in order to temper the impact of such skewness. The next step in the analysis
would be to perform a spatial interpolation of the concentrations based on
the inferred structural range and curvature. Use of power law (fBm) models
is one alternative.
References
1. Cressie, N. (1993) Statistics for Spatial Data, Revised edition. John Wiley &
Sons.
2. Cressie, N.A.C and Hawkins, D.M. (1980). Robust Estimation of the Variogram:
I. Mathematical Geology, 12(2):115–125.
3. Curriero, F., Hohn, M., Liebold, A., and Lele, S. (2002). A statistical evaluation
of non-ergodic variogram estimators. Environmental and Ecological Statistics,
9(1):89–110.
4. Girardi, F., Graziani, G., van Veltzen, D., Galmarini, S., Mosca, S., Bianconi, R.,
Bellasio, R. and Klug, W. (eds.) (1998). “The ETEX Project”. EUR Report
18143 EN. Office for Official Publications of the European Communities, Lux-
embourg.
5. Dubois, G., Galmarini, S., and Saisana, M. (2005). Geostatistical Investigation
of ETEX-1: Structural Analysis. Atmospheric Environment, 39: 1683–1693.
6. Genton, M. (1998). Highly Robust Variogram Estimation. Mathematical Geol-
ogy, 30(2):213–221.
7. Isaaks, E. and Srivastava, R.M. (1989). An Introduction to Applied Geostatis-
tics, Oxford University Press, Oxford.
8. Journel, A.G. and Huijbregts, C.J. (1978). Mining Geostatistics. Academic
Press, New York.
9. Kendall, M.G., and Stuart, A. (1969). The advanced theory of statistics, Vol. 1,
3rd ed. Griffin, London.
10. Lam, N.S. (1983). Spatial interpolation methods: a review. The American Car-
tographer, 10(2):129–149.
11. Srivastava, R.M. (1987). A Non-ergodic framework for variograms and covari-
ance functions. M.Sc. Thesis, Stanford University.
12. Srivastava, R.M. and Parker, H.M. (1988). Robust Measures of Spatial Conti-
nuity, In: Geostatistics Volume 1, Proceedings of the Third International Geo-
statistics Congress, Kluwer Academic Publishers.
13. van Dop, H. and Nodop, K. (eds). (1998). ETEX: A European Tracer Experi-
ment, Atmospheric Environment 32:4089–4378.
Fuzzy Model of Soil Polygons
for Managing the Imprecision
1 Introduction
It is impossible to create a perfect representation of the world in a GIS
database since all GIS data are subject to uncertainty [5]. There is no perfect
information that contains 100% accurate presentation in a GIS database. The
impression of certainty usually conveyed by GIS is at odds with the uncertain
nature of geographic information, a contradiction that has been acknowledged
as an important research topic for nearly two decades [4]. Incorrectness in
measurement or errors in observations due to rich information bring about
imperfection in geographic information. The information is taken and used as
if it were accurate or believed to be true. In fact, the reliability of the infor-
mation is not yet considered in terms of level of accuracy or uncertainty. If
the geographic information is looked over carefully, it contains vagueness, im-
precision and inaccuracy particularly when it presents an invisible object like
soil. Soil polygon boundaries are defined based on accurate field observation
and compared with human interpretation and soil types are classed according
to geologists’ prowess or expertise. For many reasons, one can say that soil
map is one of the most imprecise maps in the world.
In Finland, mapped areas are big and field observations are relatively sparse.
Manual interpretation is therefore the only feasible alternative in creating
soil maps. Samples of soil are taken randomly by soil mapping surveyors for
soil type classification and some of these samples may be taken back to the
laboratory for detailed tests in case of inadequacies (Fig. 1).
For defining soil polygon boundaries, geologists use aerial photos, geologic
maps, topographic maps in scale 1:20,000 together with knowledge based of
geomorphology. Nevertheless, there are no specific rules to define the impreci-
sion of these boundaries and neither imprecision in data nor expert knowledge
154 R. Sunila and P. Horttanainen
The core idea in this research is to apply fuzzy modeling to the management
of expert knowledge in soil mapping. Fuzzy soil maps are then used in map
overlay type analysis [11]. The goal is not to construct a fuzzy model of soil
layers but a fuzzy soil map presenting non-crisp soil polygon boundaries. The
map will be created in certain scale for a certain purpose and we believe
that fuzzy soil layer with imprecision is better input to the analysis than
artificial crisp polygon map with no information about the uncertainty of the
boundaries.
Fuzzy Model of Soil Polygons 155
2 Literature Survey
Recently, there have been many researches that were related to fuzzy concepts.
Brown [2] carried out a research in classification and boundary vagueness in
mapping presettlement forest types. In his research, he conducted a model to
test the role of classification ambiguity in affecting boundary vagueness us-
ing fuzzy concepts and Kriging. He explained methods of determining species
memberships and interpolating membership values. Stefanakis et al. [10] con-
ducted a research on incorpolating fuzzy set methodologies in a Database
Management System (DBMS) repository for the application domain of GIS.
They considered that fuzzy set methodologies seemed to be instrumental in
the design of efficient tools to support the spatial decision-making process.
The results showed that Zadeh’s fuzzy concepts and fuzzy set theory [12]
might be adopted for the representation and analysis of geological data. Jiang
et al. [6] proposed the application of fuzzy measures and argued that the
standardized factors of multi-criteria evaluation belong to a general class of
fuzzy measures and the more specific instance of fuzzy membership. Develop-
ing of classification algorithms for using auxiliary information in fuzzification
and fuzzy set operations to reduce uncertainty in classification process was
researched in 2000 by Oberthür et al. [9]. The research was conducted in order
to study how to define fuzzy membership functions (FMF) and reduce classi-
fication uncertainty hedge operators. Zhu et al. [13] introduced soil mapping
using GIS, expert knowledge, and fuzzy logic. The scheme consisted of three
major components: a model employing a similarity representation of soils, a
set of inference techniques for deriving similarity representation and use of
the similarity repreentation. Basically they invented an automated soil infer-
ence under fuzzy logic. To produce a raster soil database for the study areas,
the knowledge base and the spatial data in the GIS database were combined
under the fuzzy inference engine. The output was the comparison of soil se-
ries referred from Soil Land Inference Model (SoLIM) and derived from the
soil map against the field observations for the study area and it showed that
SoLIM has higher correctness. The derivation of the fuzzy spatial extent was
developed by Cheng et al. [3]. Three fuzzy object models and the data ex-
tracted from field observation were introduced and modeled. Software such as
FUZZEKS [1], FuzME Version 3.0 [8] and ASIS [7] were programmed to deal
with fuzzy logic and data analysis.
3 Pilot Studies
At this stage of the project, expert knowledge is being collected. The re-
search team is currently trying to document a rule-based model of soil polygon
boundaries. As the geographic environment is varied depending on regions and
geomorphy, each region needs to be differentiated and taken into consideration
156 R. Sunila and P. Horttanainen
4 Methodology
From the dataset received from Geological Survey of Finland, the data will
be read in numerical format as in Fig. 2. These numbers represent different
types of soil and the connection with different numbers implies where the
polygon borders are. As it was mentioned earlier, to construct fuzzy models,
membership functions of classification are needed.
Fig. 2. Example of soil data in numerical format; soil type 1: bedrock, soil type 2:
sand, soil type 3: clay
Table 1. Level of certainty of polygon borders between different soil types ranged
from 0 to 1
eastern Finland bedrock sand clay
bedrock 0.95 0.92 0.88
sand 0.9 0.75
clay 0.85
Fuzzy Model of Soil Polygons 157
From the numbers in the table, logical rules will be constructed in a fuzzy
tool. The expected result is the soil data layer that shows imprecision on the
soil polygon boundaries. Next phase, the layer of elevation could be added to
adjust the values of fuzziness.
5 Expectations
It takes time to start up the system and develop a good connection that will
lead to success in the future. Collecting geologists’ knowledge and trying to
construct the documentation of expert knowledge is the first priority. From
the documentation, a rule-based model is created which will lead to the con-
struction of fuzzy models. Currently, there are three expectations.
• Documentation of geological knowledge used in interpretation
• Development of a rule-based model of imprecise soil polygons which could
be used for GIS analysis. This is not to create a new soil model for geolo-
gists or even for soil mapping but to improve the represetation of soil data
layer that shows the imprecision of the classification especially around the
boundaries of soil polygons.
• Fuzzy modeling of soil maps to understand uncertainty in geographical
information for better uses in spatial analyses
6 Future Plan
For further research, there are still many possibilities to continue studying
imprecision in soil polygon boundaries, for instance, implementing Kriging to
test out the result of fuzziness. The question may arise here whether Kriging
can be used, as the values seem to be from a discrete function. Clearly, Kriging
is not going to be used for better classification of soil types, instead, it will be
applied together with a fuzzy model to verify the imprecision on soil polygon
boundaries to smooth out the result. Sample points could be taken from the
real site to study soil type misclassification and these numbers will be used
together with fuzzy membership functions for better results.
6.1 An Example
Figure 3 shows an example of fuzzy and Kriging application on soil polygon
boundaries.
158 R. Sunila and P. Horttanainen
2 2 2 2 3 3 3 3 3 .75.75
2 2 2 2 2 3 3 3 3 .75.75
2 2 2 2 2 2 3 3 3 .75.75
2 2 2 2 2 2 3 3 3 .75.75
Sand 2 2 2 2 2 2 3 3 3 .75.75
2 2 2 2 2 2 3 3 3 .75.75
2 2 2 2 2 3 3 3 3 .75.75
Clay 2 2 2 2 2 3 3 3 3 .75.75
2 2 2 2 3 3 3 3 3 .75.75
(a) (b) (c) (d)
2 3
2 2
2 2
2 2
2 2
2 2
2 3
3 3
3 3
(e)
Fig. 3. (a) Consider a map that contains only two soil types, sand and clay. (b) The
soil map data is transferred to a raster format. (c) From the raster format the data
is coded: sand = 2 and clay = 3. (d) Next step, membership functions represent the
level of classification certainty on soil polygon boundaries. The result is the data
layer that contains imprecise soil polygon boundaries. (e) Kriging is used to test
out the soil property in each class. In this example, misclassification is discovered
and it affects the boundary. Comparison: (f ) The original soil polygon boundary in
raster format. (g) The new soil polygon boundary resulted from Kriging. (h) The
highlighted area shows the error or fuzziness along the boundary. The highlighted
areas are the areas that should be taken into consideration in order to adjust the
values of membership functions
8 Conclusion
Soil polygon boundaries are not crisp in reality. Moreover, soil maps contain
a large amount of uncertainty and imprecision. Therefore a natural way to
model vagueness of soil polygons is to include imprecision in their boundaries.
One way to do this is to develop a fuzzy model for raster data using fuzzy
membership functions for each soil layer. This model will be created using
expert knowledge and fuzzy logic.
In Finland there are no sufficient metadata to assess the uncertainty of
soil polygon boundaries. Thus, the first step of the research is to collect and
document expert knowledge about soil mapping. Only then can the rule-based
fuzzy model be created. The information of imprecision provided by the fuzzy
model will eventually be used in GIS to give an estimation of uncertainty in
soil maps.
Acknowledgements
The project is funded by the Ministry of Forestry and Agriculture of Finland
and in cooperation with Geological Survey of Finland. This project is an off-
spring from the military terrain analysis project conducted by the Scientific
Board of the Finnish Defence Forces. The authors would like to thank Profes-
sor Kirsi Virrantaus for introducing the project and giving her comments and
suggestions, Jukka-Pekka Palmu and Maija Haavisto-Hyvärinen for support-
ing the information, Professor Vesa Niskanen for instruction of fuzzy mod-
eling, Olga Křemenová for her assistance and cooperation and StatGIS 2003
conference committee for giving a chance to present the project.
References
1 Introduction
With 300,000 ha of contaminated land, 1.2% of the Britain land area [13, 14],
the UK has a major need for effective environmental risk assessment for land
remediation and reclamation [22]. Such a risk assessment is usually based on
the characterization of potential site contaminants and analysis of source –
pathway – target scenarios [8, 9, 11]. A risk-based contaminant description re-
quires a conceptual model of the site that includes qualitative and quantitative
analyses of pollution sources, contaminant pathways and pollutant receptors
[4, 22]. This characterization typically has to rely on limited and irregularly
distributed point data. In addition, soil and surface material heterogeneity,
as well as the quasi-random nature of contamination sources add to the com-
plexity of developing good spatial models of pollutants on old industrial sites
[13, 22]. Improved and appropriate geostatistical tools and GIS based anal-
ysis can help to overcome some of these problems. This paper tackles the
development of such a methodology for a former coking plant by examining
the sources and pathways of Polycyclic Aromatic Hydrocarbons (PAHs), as
part of an analysis of a wider range of contaminants at the Avenue Coking
Works, near Chesterfield, UK (Fig. 1). In the UK, coking works were estab-
lished alongside the iron and steel plants from the mid 18th century. By the
end of the 19th century surplus gas from coking works was sold as town gas,
and by 1912 coke ovens were being installed at town gas works [10]. Each
works occupied between 0.3 and 200 ha. By 1995, only four of the total of
400 such works were still operating [10]. Tar distillation took place on coal
and, or coke works sites, and was the primary source of organic chemicals
for different industries until petrochemical products took over in the 1960s
[10]. The contamination at former gas and coke works varies with the range of
products and by-products manufactured. On such sites, ground contamination
arises from by-products, waste products from landfills and lagoons, and ancil-
lary products such as ammoniacal liquor, coal tar, spent oxide and foul lime
[12]. The organic contaminants are derived from constituents of coal tar such
162 M. Palaseanu-Lovejoy et al.
of data, and good spatial correlation. Problems arise when potentially contin-
uous processes have not yet led to a normal spatial distribution, yet overlie
an older continuous, but random process, which is spatially correlated.
Lark [21] models complex soil properties by assuming that the soil con-
tamination is formed by a continuous but random component combined with
a quasi point process. The quasi point process characterizes contamination
164 M. Palaseanu-Lovejoy et al.
(or any other process) in a small area of finite extent, which is represented
by only one (or very few) soil sample(s) and does not diffuse continuously
towards its neighbours. The continuous random processes are representative
for the native metal content of the soil parent material and diffuse sources of
pollution, while the quasi point process is defined by localized point sources
of pollution. This situation may describe the pollution of an industrial site. If
we consider that contamination with the same pollutant can result from both
diffuse and point sources, we can expect its measured values to show very little
spatial correlation, if any. These values, which have the point process values
embedded, are called outliers and are considered unusual in their spatial con-
text. The outliers do not belong to the continuous, but random, distribution
of the majority of data, and are not necessarily extreme low or high values
[2, 23]. In the case of pollution, if the outliers are not statistical anomalies
due to errors in measurement or recordings, they indicate different processes
superimposed on the same area and affecting the same variable [2, 18, 24].
2 Site Description
The colliery built in the 1880s at Avenue (Fig. 1), and the adjacent later lime
and iron works were dismantled by 1938 and the site reverted to agriculture
use. When the new, up-to-date 98 ha Avenue coking plant, built in the early
1950’s to supply the Sheffield steel industry, was working at full capacity, it
carbonised 2,175 tons of coal a day, producing 1,400 tons of smokeless fuel,
65 tons of 77% sulphuric acid, 35 tons of ammonium sulphate, 70,000 litres of
crude benzole, and 250 tons of tar. Operations ceased in 1992 and since 1999
environmental reclamation work has been going on under the supervision of
the Babtie Group [1].
3 Statistics
As part of the reclamation work, the site owners’ consultants drilled 108 bore-
holes (BH) and 266 trial pits (TP). Seven hundred and twenty nine soil sam-
ples from depths between 10 cm and 18 m below surface level were analysed
for PAH16. The PAH16 levels in parts of the site are two orders of magnitude
higher than the PAH16 environmental threshold of 1,000 ppm. Overall the
concentrations span from 0.05 ppm to over 20,000 ppm [1]. The soil samples
were divided into four categories according to the depth of sampling, 0–1,
1–2, 2–4 m, and more than 4 m. It was originally hypothesised that the 265
soil samples between 10 cm and 1 m below surface level would be spatially cor-
related. To test this hypothesis the empirical PAH16 semi-variogram (Fig. 2)
was built using the Geostatistical Analyst tools and ArcGIS 8.3, considering
that it is more likely to have similar measured values close to the estimated
Mapping the Contaminant Legacy of a Coking Plant 165
point, but different measured values at further away. In this case the assump-
tion is that the difference in values between two samples depends only on the
distance between them and their relative orientation [5, 16]. In this case the
variance, or standard deviation, of the sample value differences, varies only
with the distance and the direction h between samples and it is known as a
variogram.
1
n
2γ ∗ (h) = [z(x) − z(x + h)]2 , where : (3)
n 1
– n = number of data pairs within a given class of distance and direction;
– γ ∗ (h) = calculated semi-variance;
– z(x) = value of the sample at location x;
– z(x + h) = value of the sample at location x + h;
The results are plotted into a graph in which the horizontal axis represents
the distance h and the vertical axis the experimental semi-variance, respec-
tively. If two samples were picked from the same location, therefore h equals
0, we expect that the semi-variance value to be 0 for both calculated and mea-
sured semi-variance [5, 16]. The semi-variogram (Fig. 2) shows the presence of
both global and local outliers for PAH16 and no spatial correlation. The global
outliers are defined as very high or very low values relative with all the values
166 M. Palaseanu-Lovejoy et al.
in the dataset [2, 24], and in the semi-variogram they plot as distinct hori-
zontal groupings of points [20]. The local outliers are values which, although
not out of the dataset range, are abnormal relative to the surrounding values.
Consequently, the local outliers have high semi-variogram values for pairs of
points close to each other. These points plot close to the semi-variogram axis
γ [2, 20, 23]. The arrows in Fig. 2 show the links between the local outliers
in the empirical semi-variogram (pair of points very close in space with high
semi-variance) and actual data locations in space (Avenue map).
A Q–Q normal plot for this data set suggests at least two different pop-
ulations (Fig. 3a). This is interpreted as one population set modelling the
diffuse pollution process, and the other describing the point source pollution
process. The statistical outliers were identified through a box and whisker plot
(Fig. 3b). This graph is depicting the first quartile, median, and the third quar-
tile of a dataset. The box’s lowest and highest horizontal limits represent the
first and the third quartile positions on the y-axis, respectively. Fifty percent
of the data values are plotted inside this box. The "whiskers" represent the
percentile of the most extreme data-point, which is no more than 1.5 times the
interquartile range from the box [25]. The outliers are values larger or equal
to the sum of the third-quartile and the box interquartile range multiplied
by 1.5 [26]. In Fig. 3b, the identified outliers are above the 80th percentile
and range from 371 ppm to 12,340 ppm (49 samples out of 265, or 18.5%
of the data) and represent the point source pollution from the coking works,
while the remaining values represent the historical, diffuse contamination on
the site. A Q–Q normal plot for these 49 outlier untransformed point source
pollutant values suggests a single statistical population (Fig. 3c)
Spatially, the PAH16 outlier values cluster in three main areas, with a few
isolated outliers elsewhere on the site (Fig. 4). The three main clusters are
associated with (a) waste disposal and tar lagoon 4, or point source 1 (PS1),
(b) stoking area, or point source 2 (PS2), and (c) main plant area, or point
source 3 (PS3). The isolated outliers may be the result of individual spills or
leakage from underground tanks.
The Avenue site was divided into Thiessen polygons based on the PAH16
sampling points (Fig. 5). This procedure took into account each sample depth,
Fig. 3. a: normal Q–Q plot; b: box-whisker plot; c: normal Q–Q plot outliers
Mapping the Contaminant Legacy of a Coking Plant 167
and it was assumed that each polygon has the characteristic of the sampled
data. A narrow uncontaminated area of 12–18 m depth separates PS1 and
PS2. Both PS2 and PS3 have small, unpolluted areas surrounded by high-
polluted areas. Pollution may be present below those uncontaminated areas
that were sampled only to a maximum depth of 2 m, since contaminated areas
around them have very high PAH16 values below this depth. This may imply
168 M. Palaseanu-Lovejoy et al.
below it. For the first and last sample depth, 20 cm were subtracted and added
respectively, in order to also place these samples inside 3D objects. The PAH16
concentration decreases from tens of thousands ppm to a few thousands ppm
to tens of ppm over just 2.5–5 m vertical distance. This demonstrates that not
only the pollution is extremely localized, but also that diffusion and dilution
of the point source pollution over 50 years has been relatively slight.
4 Conclusions
Acknowledgements
I would like to thank all who have helped me in this work, particularly Babtie
and East Midlands Development Agency, UK, for access to the site; Babtie,
UK, for permission to use the site data; Nigel Lawson for help in gaining
access to the site; and my supervisors: Prof. Ian Douglas and Prof. Robert
Barr.
References
1. Babtie Group (2000) The Avenue Coking Works, Ground model & preliminary
contamination assessment. Babtie Group, Fairbairn House, Ashton Lane, Sale,
Manchester
2. Barnett V, Lewis T (1994) Outliers in Statistical data. Wiley series in proba-
bility and mathematical statistics, 3rd edn. John Wiley & Sons: New York
3. Burrough PA, McDonnell RA (1998) Principles of geographical information
systems. Oxford University Press: Oxford
4. Carlon C, Critto A, Marcomini A, Nathanail P (2001) Risk based characteriza-
tion of contaminated industrial site using multivariate and geostatistical tools.
Environmental Pollution 111, pp 417–427
5. Clark I, (1978) Practical geostatistics. Elsevier Applied Science: New York
6. Clark and Harper (2000) Practical geostatistics 2000. Geostokos (Ecosse) Lim-
ited: Scotland, UK
7. Clark I, Harper WV (2001) Practical geostatistics 2000. Ecosse North America
Llc.: Columbus, OH
8. D.E.T.R. (2000a) Contaminated land: Implementation of Part II A of the En-
vironmental Protection Act 1990. London: HMSO
9. D.E.T.R. (2000b) Guidelines for environmental risk assessment and manage-
ment. Revised Departmental Guidance. London: HMSO
10. D.o.E. (1987) Problems arising from the redevelopment of gas works and similar
sites. Environmental Resources Limited, 2nd edn. London: HMSO
11. D.o.E. (1995a) Gas works, coke works and other coal carbonisation plants,
Industry Profile. London: HMSO
12. D.o.E. (1995b) A guide to risk Assessment and risk management for environ-
mental protection, London: HMSO
13. DTLR (2002) Development on land affected by contamination. Consultation
paper on Draft planning technical advice, London: HMSO
14. Environment Agency (2002) Dealing with contaminated land in England,
Progress in 2002 with implementing the part IIA regime, Environment Agency,
Rio House, Bristol, UK
15. Goovaerts P (1997) Geostatistics for natural resources evaluation, Oxford
University Press
Mapping the Contaminant Legacy of a Coking Plant 171
1 Introduction
Project Gama for adjustment of geodetic networks was started at the de-
partment of mapping and cartography, Faculty of Civil Engineering, Czech
TU Prague, in 1998. Formerly it was planned to be only a local project with
main goal to demonstrate students the power of object programming and at
the same time to be a free independent tool for comparison of adjustment
results from other sources. The Gama project received the official status of
GNU software in 2001 and now contains a C++ library (including small C++
matrix/vector template library gmatvec) and two programs gama-local and
gama-g3, that correspond to two development branches of the project.
Stable branch of the Gama project is represented by command line pro-
gram gama-local for adjustment of three-dimensional geodetic networks in a
local coordinates system. New development branch of the project (gama-g3)
is aimed to the adjustment of geodetic networks in global geocentric system.
The stable branch (gama-local) enables common adjustment of possibly cor-
related horizontal directions and distances, horizontal angles, slope distances
and zenith angles, height differences, observed coordinates (used in sequen-
tial adjustment, etc.) and observed coordinate differences (vectors). Although
such an adjustment model is now obsoleted by global positioning systems, it
can still serve as an educational tool for demonstrating adjustment procedures
to students and as a starting platform for the development of new branch of
the project (gama-g3).
Numerical solution of least squares adjustment in geodesy is most com-
monly based on the solution of normal equations. As the Gama project
was also meant to be a comparison tool, it was desirable to use a differ-
ent method and Singular Value Decomposition (SVD) was implemented as
the main numerical algorithm. As an testing alternative Gama implements
another algorithm from the family of orthogonal decompositions based on
Gram–Schmidt orthogonalization (GSO). Practical experience with both al-
gorithms are discussed. In the Gama project geodetic input data are described
174 A. Čepek and J. Pytel
Geodesy as the scientific discipline is studying geometry of the Earth or, from
the practical point of view, positioning if objects located on the Earth surface
or in its relatively close boundaries. The input information is represented by
geodetic observations.
The spectrum of observation types dealt by geodesy is very wide and ranges
from classical astro-geodetic observations (astronomical longitude and lati-
tude, variations and position of the Earth pole), measurements of geophysical
quantities (gravity acceleration and its local anomalies), through traditional
geometric observables like directions, angles and distances to photogrammet-
ric measurements of historical monuments. But of the main importance in
geodesy today are satellite global positioning systems (first of all NAVSTAR
GPS and complementary other systems like DORIS or GLONASS).
The key role in processing of geodetic data belongs to the sphere of ap-
plied statistics in geodesy traditionally called adjustment of observations. The
processing of geodetic observations is determined by the choice of appropriate
mathematical model, which can be symbolically expressed as
f (c, x, l) = 0, (1)
Ax − l = v (2)
A method commonly used for solving projects equations (2) (model explicit
in observations) is based on normal equations
N = A A, n = A l,
x = N−1 n (4)
p Np ≥ 0, p = 0.
The set of indices O can contain all elements, but more often only selected
elements of x.
In the case of plane geodetic free network we can geometrically interpret
the last constraint (9) as follows. By minimization of the Euclidean norm of
residual vector (3) the shape and scale (if at least one distance is available) of
the adjusted network together with covariances of adjusted observations are
uniquely defined. The second additional constraint (9) then defines localiza-
tion of the network in the coordinate system. Apart from the adjusted network
shape we define simultaneously its shift and rotation in the coordinate system.
Another equivalent interpretation is that constraint (9) defines the par-
ticular solution of (2) in which the trace of variance-covariance submatrix
corresponding to indices i ∈ O is minimal.
minimal upper estimate of the ratio of relative error of x and relative error of
right hand side l.
From the (10) directly follows that the condition number of normal equa-
tion matrix N is the square of condition number of the project equation
matrix A
2
κ (N) = (κ (A)) (11)
We can say that when solving poorly conditioned normal equations we are
loosing twice as much of correct decimal digits in a solution x compared with
any direct solution of project equations.
Probably the most important class of algorithms for direct solution of
project equations (2) is the family of orthogonal decomposition algorithms.
Apart from other goals, GNU project Gama has been planned to be a kind of
etalon, i.e. a tool for checking adjustment results from other software prod-
ucts. For this reason it was desirable to have adjustment based on a different
numerical method other then traditional solution of normal equations and
Singular Value Decomposition (SVD) was implemented as the main numeri-
cal algorithm. As an alternative another orthogonal decomposition adjustment
algorithm GSO (based on Gram–Schmidt orthogonalization) is also available.
We describe briefly both algorithm in the following section.
5 Gram–Schmidt Orthogonalization
Gram–Schmidt orthogonal decomposition is algorithm for computing factor-
ization
A = QR, Q Q = 1 (12)
where Q is orthogonal matrix and R is upper triangular matrix. Matrix R
here is identical to the upper triangular matrix of Cholesky decomposition of
normal equations
N = A A = R Q QR = R R. (13)
Gram–Schmidt orthogonalization is a very straightforward and relatively
simple algorithm that can be implemented in several variants differing in the
order in which vectors are orthogonalized. The following three algorithms are
adopted from [3, 300–301].
Algorithm 1.1 [Modified Gram–Schmidt (MGS) row version]
for k = 1, 2, . . . , n
(k)
q̂k := ak ; rkk := (q̂kT q̂k )1/2 ;
qk := q̂k /rkk ;
for i = k + 1, . . . , n
(k) (k+1) (k)
rki := qkT ai ; ai := ai − rki qk ;
end
end
A Note on Numerical Solutions 179
for k = 1, 2, . . . , n
for i = 1, . . . , k − 1
(i) (i+1) (i)
rik := qiT ak ; ak := ak − rik qi ;
end
(k)
q̂k := ak ; rkk := (q̂kT q̂k )1/2 ;
qk := q̂k /rkk ;
end
for k = 1, 2, . . . , n
for i = 1, . . . , k − 1
rik := qiT ak ;
end
k−1
q̂k := ak − rik qi ;
i=1
Q 1 Q1 = 1 (15)
M1 = Q1 R (16)
Q1 = M1 R−1 , Q2 = M2 − Q1 Q 1 M2 (17)
−1
Q3 = M3 R , Q4 = M4 − Q3 Q 1 M2 (18)
Ax − l = v.
The result is directly the vector of unknown parameters x and vector of residu-
als v. Cofactors (weight coefficients) of adjusted parameters qxi xj are available
as dot products of rows i and j of submatrix R−1 , cofactors of adjusted ob-
servations qlm ln are computed as dot products of rows m and n of submatrix
Q and mixed cofactors qxi ln similarly as dot products of i-th row of R−1 and
n-th row of matrix Q.
Let us suppose now that project equations matrix A contains r linearly inde-
pendent columns and remaining d linearly dependent columns. Without a loss
generality we can assume that linearly dependent columns are located in the
right part of matrix A. We denote linearly independent columns A1 , linearly
dependent columns A2 and the matrix of their linearly combinations α
+ ,
x1
A = (A1 , A2 ) , A2 = A1 α, x= (19)
x2
As the matrix A1 does not contain linearly dependent columns, the unique
solution x̃ of (20) exists that minimize Euclidean norm of v.
A Note on Numerical Solutions 181
is at the same time the Least Square solution of (20) with the same vector of
residuals v.
If we apply algorithm GSO to the matrix
/ 0 ⎛A A2 −l
⎞
I I 1
M M
MI = 1 2
=⎝ 1 0 0 ⎠ (22)
MI3 MI4 0 1 0
For a matrix A with linearly dependent columns d singular values are zero
(d is dimension of null space of A). Singular value decomposition explicitly
constructs orthonormal vector basis of the null space and the range of A.
Columns of the matrix U corresponding to nonzero singular values wi form
the orthonormal base of the range of A. Similarly columns of matrix V cor-
responding to nonzero singular values form the orthonormal basis of the null
space of A.
NA = {x | Ax = 0, x ∈ Rn }
RA = {y | y = Ax, x ∈ Rn }
In the case of rank deficient systems, we set into the diagonal of inverse
matrix W−1 zeros instead of reciprocals for elements corresponding to linearly
dependent columns A
−1 1/wi pro wi > 0
W = diag (29)
0 pro wi = 0
Resulting particular solution x minimizes both Euclidean norm of residuals
and at the same time the norm of unknown parameters x.
Rather surprising replacement of reciprocal 1/0 ≡ ∞ by zero can be ex-
plained as follows. Solution vector x of overdetermined system
Ax = l
Coefficients in the parenthesis are dot products of columns U and right hand
site l multiplied by reciprocal value of the singular value. Zero singular values
correspond to linearly dependent columns of matrix A that add no other infor-
mation to the given system. Setting corresponding diagonal elements of matrix
W−1 to zeros is equivalent to elimination of linearly dependent columns from
the matrix A.
With matrix W−1 defined according to (29), cofactors are computed the
same way for regular and singular systems
−1 −1
Qxx = N−1 = (A A) = (VW U UWV ) = VW−1 W−T V (31)
Qll = AQxx A = (UWV )(VW−1 W−T V )(VW U ) = UU (32)
−1 −T −1
Qlx = AQxx = UWV VW W V = UW V (33)
What now remains is to show how to compute the particular solution that
minimizes only a given subset of subvector x according to the second regular-
ization condition (9). We compose overdetermined system of linear equations
ψc + x = x̂ (34)
syntax of our structured data). Conversion from a well defined data format
into XML is relatively simple task but processing of XML is not a trivial task
and cannot be done without a XML parser. In GNU Gama project we use
XML parser expat by James Clark, see http://expat.sourceforge.net/.
We believe that XML is the best data format for description and exchange of
structured data in Gama project. One of the goals of our project is to compile
a free collection of geodetic networks described in XML.
References
1. Petr Vaníček and Edward J. Krakiwsky (1986) Geodesy: The Concepts, 2nd ed.,
North-Holland, Amsterdam
2. Karl-Rudolf Koch (1999) Parameter Estimation and Hypothesis Testing in Lin-
ear Models, 2nd ed., Springer-Verlag, Berlin
3. Åke Björck (1994) Numerics of Gram–Schmidt Orthogonalization, Linear Alge-
bra and Its Applications 197, 198:297–316
4. Gene H. Golub and Charles F. Van Loan (1996) Matrix Computations, 3rd ed.,
The John Hopkins University Press, Baltimore
5. Charamza, F. (1979) GSO—An Algorithm for Solving Least-Squares Problems
with Possibly Rank Deficient Matrices, Optimization of Design and Computation
of Control Networks, Akadémiai Kiadó, Budapest
6. Charamza, F. (1978) An Algorithm for the Minimum-Length Least-Squares So-
lution of a Set of Observation Equations, Studia geoph. et geod., Vol 22, pp.
129–139
7. G. H. Golub and C. Reinsch (1971) Singular Value Decomposition and Least
Squares Solutions, Numer. Math. 14, 403–420 (1970), Handbook for Auto.
Comp., Vol II—Linear Algebra, 134–151.
Presentation of Entrepreneurship Data
and Aspects of Spatial Modeling
1 Introduction
Positive effects of new firms on the job market, technology transfer, and contri-
butions to structural change has turned political attention to start-ups. Every
year the number of new firm creations increases, where on the other hand the
number of major enterprises decreases.4 More and more firms have no employ-
ees and a trend to small-scale self employment can be recognized.5 Political
and economic support programs try to revoke regional discrepancies of busi-
ness activity, firm development and foundation activity. These programs have
to be evaluated and improved continuously.
An meaningful curatorial foundation statistic for entrepreneurship research
and a statistic to assess the entrepreneurial activity in Austria for political
decision making does not exist until now.6 Solely the Federal Economic Cham-
ber of Austria (WKO) reports a statistic of foundation activity of commercial
firms every year.7 This statistic permits to observe a trend of firm foundations,
4
Cp. Wirtschaftskammer Österreich [16]: 23.
5
Cp. Schwarz and Grieshuber [11]: 103ff.
6
For the statistical situation in Germany see e.g. Fritsch et al. [3]: 2f. For the effort
to the development of the curatorial statistical system in Germany cp. Struck [14]:
41ff.
7
The Federal Economic Chamber of Austria is the legal representation of interests
of Austrian entrepreneurs. In its founding statistic the number of new start-ups
are calculated from new entrants into the membership database of the WKO. To
exclude pseudo foundations and multiple data set entries the database has been re-
vised. A detailed description of data revision can be found in Wirtschaftskammer
Österreich [16].
190 R.J. Breitenecker et al.
but does not map the overall Austrian foundation activity. Firms which are
not in the scope of the WKO are not included in this statistic.8 Few other
Austrian public services cumulate data from newly founded firms, but the
access to these data sources is limited and the data is not appropriate for
research.9
To ensure a continuous and complete evaluation of supporting programs
it would be wise to design a monitoring system for all Austrian enterprises
which includes all commercial and noncommercial firm foundations and clos-
ings. Such a system would be a valuable source for entrepreneurship research,
but suitable statistical measures and methods for presenting and modeling
non-normal spatial entrepreneurship data are needed to build such an overall
monitoring system.
This paper briefly reports measures and methods commonly used in de-
scriptive statistics for presenting entrepreneurship data in reference to its
spatial distribution. We illustrate these by an example of numbers of new
start-ups in Austria in the year 2001.10 By applying different measures we
show how sensible presenting regional differences in foundation activity can
be. Further we will give a brief introduction of spatial general linear mod-
els [5] and the hierarchical Bayesian models for count data [15] for modeling
non-normal spatial data.
Charts and tables are instruments for descriptive and explorative data anal-
ysis. They are helpful in visualizing data, building hypotheses and presenting
results from statistical computation with spatial reference. To compare re-
gional discrepancies, the right measure has to be specified to include different
area or population of regions in the calculation or presentation.
For example if the differences in firm foundation activity of Austrian
provinces are to be compared, the absolute numbers or the percentage of
counted foundation will not be practical. Regions with different size and pop-
ulation cannot be compared with non standardized measures. Although the
absolute values and the percentage are improper measures, both can be found
in regional comparisons.11
We give an example of how different measures of foundation activity can
influence the ranking of regions. Figure 1 shows the number of firm foundations
8
Foundations in the field of agriculture and forestry and freelancers which belong to
another chamber or not, are not registered by this and any other official statistic.
9
E.g. social insurance institution, finance office, commercial credit agencies.
10
Data from Wirtschaftskammer Österreich [16].
11
On the web site of the Lower Austria’s Business Portal the foundation statis-
tic of WKO can be found, but on the site only the absolute counts and the
percentage to all new Austrian firms for 2002 are presented (www.loweraustria.
biz/upload/downloads/Betriebsgruendungen%202002.pdf).
Presentation of Entrepreneurship Data 191
12
Cp. Wirtschaftskammer Österreich [16]: 21.
13
Cp. Wirtschaftskammer Österreich [16]: 22; Fritsch and Niese [4]: 4. Fritsch and
Niese calculate the number of new start ups in relation to 100 existing firms of
the respective region.
14
Cp. Wirtschaftskammer Österreich [16]: 22.
15
Cp. Egeln et al. [2]; Fritsch and Niese [4]: 3f.
192 R.J. Breitenecker et al.
16
Data from Wirtschaftskammer Österreich [16]: 21 and Statistik Austria [13] with
own calculations.
Presentation of Entrepreneurship Data 193
Upper Austria with its iron, steel and chemical industry has many major
enterprises. Burgenland seams to have more small enterprises.17
This short example shows that choosing the right measure for such data is
very sensitive, particularly if political decisions have to be made on the basis
of such statistics. Area, population and firm structure should be considered
by comparing different countries or provinces.
To make statistical information from different EU countries and regions
comparable, EUROSTAT has established the Nomenclature des unites terri-
toriales statistiques (NUTS). Using NUTS classification ensure that regions
of comparable size all appear at the same level and making it possible to com-
pare. Each NUTS unit contains regions which are similar in terms of area,
population, economic weight or administrative power.18 In Table 1 eigth dif-
ferent levels for presentation and comparison of regions in Austria, including
the NUTS units, are presented.19 It shows the configuration and the number
of regions of these units.
With the exception of the post code, all lower levels can be aggregated to
a higher level. To assign new firm foundation activity to the levels of NUTS
3, political districts or communities, the addresses or the post codes of the
firms can be used. A main problem in that case is that classification with the
post codes to these levels is not unique. The range of post districts overlap
17
Data Source: Statistik Austria [13].
18
For more information about NUTS see EUROSTAT on the web
(http://europa.eu.int/comm/eurostat/ramon/nuts/splash_regions.html).
19
Data from http://www.statistik.at/fachbereich_topograph/tab2.shtml for politi-
cal units in Austria, http://www.statistik.at/verzeichnis/nuts.pdf for NUTS clas-
sification and http://www.statistik.at/verzeichnis/gemeindeverzeichnis.shtml to
count post regions.
194 R.J. Breitenecker et al.
to be negotiated or two regions with common borders don’t have to share one
street. On this account an alternative defining neighborhoods between regions
could be common infrastructure. In this case two regions are neighbors, if they
have a street, a highway or a railroad line in common. For a detailed illustra-
tion of different neighborhood definitions and different spatial linear modeling
strategies in the field of entrepreneurship research see Breitenecker [1].
Gotway and Stroup [5] introduce a spatial approach for analyzing non-normal
data. We give a brief introduction how in terms of Gotway and Stroup [5] the
theory of generalized linear models can be extended to include discrete and
categorical data for spatial prediction.
Let Z = (Z(s1 ), . . . , Z(sn )) be a vector of random variables, each hav-
ing a distribution in the exponential family, and z = (z(s1 ), . . . , z(sn ))
the corresponding vector of data values at observed spatial locations s =
(s1 , . . . , sn ) . Suppose we want to predict a vector of k random variables Z 0 =
(Z(s0,1 ), . . . , Z(s0,k )) at unobserved spatial locations s0 = (s0,1 , . . . , s0,k ) .
We assume that the mean function for Z and Z0 can be written as
E(Z) = μ(s)
E(Z 0 ) = μ(s0 ),
where μ(s) and μ(s0 ) are n × 1 and k × 1 dimensional mean vectors associated
with data locations s and prediction locations s0 , respectively.
We define the link function
μ = E(Z) = h(Xβ),
where h(·) = g −1 (·) is the inverse link function. In case of our example with
the number of new start-ups, the canonical link function is the log link η =
log(μ(s)) for Poisson distributed data.
Further we assume that
+ ,
Z
var ≡ V = ZZ Z0 ,
Z0 0Z 00
where ZZ , Z0 , and 00 are known positive definite matrices of dimensions
n × n, n × k and k × k, respectively. In practice the general symmetric positive
definite variance-covariance matrix V can be calculated by
196 R.J. Breitenecker et al.
where υμ = diag[υ(μi )], and υ(μi ) is the general form of the variance function.
The matrix R is the correlation matrix that describes the spatial dependence
among the observations, which in practice will be estimated by a semivari-
ogram model with parameter vector α, denoting the parameters nugget effect,
partial sill, range. In terms of mátern semivariogram model, introduced by
Handcock and Stein [6], α will be extended by the smoothness parameter. For
Possion data υ(μi ) = μi and the variance-covariance matrix V in (2) can be
written as
V = diag[μi ]1/2 R(α)diag[μi ]1/2 .
Gotway and Stroup [5] emphasize that estimation with the generalized lin-
ear model is a maximum likelihood procedure, but that the full log-likelihood
for estimating β is not needed. It is sufficient to describe the relationship be-
tween the mean and the model with the link function, the form of the variance
and the relationship between the variance and the mean. This is fulfilled by
the quasi-likelihood procedure (cp. [5,9: 323ff]).
Prediction with generalized linear models can be accomplished by obtain-
ing β̂ G as an iterative solution vector corresponding to equation
X W Xβ = X W z ∗ ,
Gotway and Stroup [5] emphasized that the estimated parameter vector
β̂ G will still be consistent for β even if the correlation matrix is not correctly
specified.
A further extension of general linear model theory introduced by Hastie
and Tibshirani [7] are the general additive models. The generalized additive
model
differs from the generalized linear model in that an additive predictor
f
j j (X j ) replaces the linear predictor Xβ in (1). This theory should be
adapted to apply it with spatial data.
Ver Hoef and Frost [15] develop a Bayesian hierarchical model for analyzing
trend, abundance, and effects of covariates for monitoring programs of multi-
ple sites and apply it to counts of harbor seals in Alaska. This approach also
Presentation of Entrepreneurship Data 197
with
ln(λij ) = θij + x β + ij ,
where θij is an intercept, x = (x1ij , . . . , xpij ) is a p × 1 vector of observed
values of covariates in region i and year j, β = (β1i , . . . , βpi ) is a p × 1
vector of parameters, and ij is an overdispersion parameter. We assume that
conditional on the covariates, all observations are independent, then we can
write the joint density 3
f (z|θ, β) ≡ f (zij ).
Further a separate trend model for each region is developed. In Ver Hoef
and Frost [15] f (θij |τi , δ 2 ) = N (τi , δ 2 ) is a normal distribution and the joint
distribution can be written as
33
f (θ|τ , δ 2 ) = f (θij |τi , δ 2 ).
i j
In the next level of hierachy the region specific covariate parameters are
grouped. The joint distribution is given by
33
f (β|μ, σ) = f (βpi |μp , σp2 ),
p i
where in Ver Hoef and Frost [15] the region specific covariate parameters are
given a normal distribution with mean μp and variance σp2 .
Further in this level of hierachy the region specific covariate parameters
for the trend parameters are grouped. Ver Hoef and Frost give them a normal
distribution with mean η and varince γ 2 . The joint distribution is given by
3
f (τ |η, γ) = f (τi |η, γ 2 ).
i
and 3
f (ξ|νa , νb ) = f (ξi |νa , νb ).
i
198 R.J. Breitenecker et al.
In Ver Hoef and Frost [15] f (ij |0, ξi2 ) is a normal distribution with mean 0
and variance ξi2 and f (ξi |νa , νb ) is a gamma distribution with parameters νa
and νb .
In the fourth and final level of the hierarchy diffuse priors have to be given
to μp , σp2 , δ 2 , ηq , γq2 , νa and νb . Ver Hoef and Frost [15] give the mean pa-
rameters μp and ηq a normal distribution with mean 0 and variance 1,000,000
to constitute the uncertainty. A gamma distribution with parameter a and b
equal 0.001 is given to σp2 , δ 2 , γq2 , νa and νb .
With the Bayes theorem we can write the posterior distribution
f (θ, β, τ , , δ 2 , ξ, μ, σ, η, γ, νa , νb |z) ∝
f (z|β, θ)f (β|μ, σ)f (θ|τ , δ 2 )f (|0, ξ)f (τ |η, γ)f (ξ|νa , νb )
f (δ 2 )f (μ)f (σ)f (η)f (γ)f (νa )f (νb ).
4 Conclusion
In this paper we have briefly discussed some ideas for presenting entrepreneur-
ship data and have pointed out how sensible the right choice of a suitable
measure can be, to compare regions or political districts. Further we have
briefly presented two approaches for spatial modeling entrepreneurship data,
the generalized linear model from Gotway and Stroup [5] and the Bayesian
hierarchical model for count data from Ver Hoef and Frost [15]. In a future
project we will test how good these two approaches can be applied to model
firm foundations and firm survival in respect to their spatial location. A main
problem in the latter approach will be to find suitable prior distributions for
the parameters and to calculate the posterior distribution with Markov Chain
Monte Carlo techniques.
References
are making, and try at least some of the following alternative methods to
check the robustness of your conclusions.
Before proceeding, the reader deserves an explanation for the extension
of the title we have chosen. We found it helpful to stress that even when an
analyst feels “well-clothed” in relation to the expectations of his or her commu-
nity, it may well be that others will have different opinions. This is about as-
sumptions of what “well-clothed” means in different circumstances, and about
the sometimes self-reinforcing views expressed about this in inward-looking
communities of both users and developers. Different users and research com-
munities do things in various ways, often tradition-based, and it is a user’s
inherent right to choose software and tools to use in his/her research and
decision-making. But this right requires that the user (or developer making
tools available) accepts responsibility at least for documenting how the analy-
sis has been done. It is not enough to rely on the assurances of courtiers that
we are fashionably clothed, when our apparel is awry or absent in the view of
others.
Fig. 1. Typically the landscape in which we are living and working is much more
complicated than our models assume
Values above
threshold Threshold μ
become 1’s
1
Measured Value
Measured Value
ε(S)
Values below
threshold 0
become 0’s Z(S)
0 5 10 15 20 25 30 0 5 10 15 20 25 30
X-Coordinate X-Coordinate
this can be measurement error interval) are displayed as red points in the
indicator transformation in the graph to the right. It is quite possible that
they can be exchanged if one more measurement will be taken.
Notice that after indicator transformation, input values near and far from
the threshold became either zeroes or ones. This means that we are loosing
information when transforming the data.
Then these indicator values are used as input to ordinary (sometimes to
simple or universal) kriging. Ordinary kriging produces continuous predic-
tion and we might expect that prediction at the unsampled locations will be
between zero and one (this is often not fulfilled in practice, however). The
prediction is interpreted as the probability that the threshold is exceeded at
location s. For instance, if the prediction equals 0.71, it is interpreted as a
71% chance that threshold was exceeded. Predictions made at each location
form a surface that can be interpreted as a probability map that the specified
threshold is exceeded. If a set of indicators is used as input to ordinary kriging
(for example, 10 quantiles of the input data distribution), the resulting set of
predictions at each location can be combined to give a cumulative probability
distribution from which a probability density distribution can be estimated
and the prediction mean and variance can be calculated.
Although indicator kriging became very popular immediately, a number
of problems have been found. If different semivariogram models are used for
different thresholds, then internally inconsistent results may be obtained. One
possible workaround for this problem is to use the median indicator variogram
for all indicators. However, this nullifies the potential advantage of the model
that the spatial structure of a variable depends on its value. For instance,
we might expect that range of correlation is smaller and variance is larger
for large values. Nowadays indicator kriging is mostly used to provide risk-
qualified predictions (probability that a specified threshold is exceeded) at the
unsampled locations, and not for prediction itself.
Consider this kriging model for the signal Y (s), (see [17]):
Zi = Y (si ) + εi , i = 1, n,
where n is a number of measurements. This allows for more than one mea-
surement at the same data location.
Geostatistical prediction and conditional simulations should not honor the
data if there is measurement error and all real data are not exact. But geosta-
tistical programs usually assume that data are perfect, that is εi = 0, which
210 K. Krivoruchko and R. Bivand
1 1
I(Z(B)) ≥ T ) = I Z(u)du ≥ T
= I(Z(u) ≥ T )du
|B| B |B| B
see discussion in [11], meaning that an additional assumption concerning the
covariance between point and block indicators, cov(I(Z(si )) ≥ T ), I(Z(B)) ≥
T )), needs to be made.
A basic assumption behind any standard geostatistical model is an as-
sumption about data stationarity. In reality, data often more or less depart
from stationarity, and the solution is to use detrending and transformation
techniques to make data close to stationarity, see case study with comparison
of indicator, disjunctive, and other krigings performance in Krivoruchko [15].
However, indicator kriging uses original data and there is no possibility to
transform data to stationarity. Also, even if the original measurements are
stationary, there is no guarantee that the transformed indicator variable will
be stationary. For simple statistical models, departures from the stationarity
assumption are more serious in their consequences for the reliability of infer-
ence than violation of the distribution assumption for more complex models.
An important advantage of statistical models over deterministic ones is the
possibility to estimate prediction uncertainty. Without data pre-processing,
On Monarchs and Their Clothing 211
the kriging standard error map does not depend on data values, only on mea-
surement density. If input data are transformed to an approximately Gaussian
distribution, prediction standard errors depend on data values. For example,
Fig. 3 taken from [15], compares standard error of indicators maps created
using indicator and disjunctive kriging with data transformation for radionu-
clide soil contamination interpolation in Southern Belarus. The probability
map created using disjunctive kriging is data dependent, and the largest un-
certainty corresponds to areas close to the selected threshold value. Without
reliable information on modeling uncertainty, decision-making may be mis-
leading.
This example shows how the practical use of methods can become encum-
bered with what we can call “encrustations”. A method became established,
that was introduced to address a pragmatic issue, or a group of issues, at
least partly because other methods, acknowledged to be more adequate, were
seen as practically or computationally infeasible, as well as poorly matched
to users’ possibilities. Over the intervening period, not only computational
resources, but also the research bases, have changed, but not least for prag-
matic reasons, analytical practice has not necessarily followed up. We could
have chosen to present other examples of areas where spatial statisticians dif-
fer sincerely in their approaches to analysis, and others have been noted above
in brief. This will for now have to be sufficient to indicate some of the features
of one of many debates.
Because of the problems described above regarding indicator kriging usage,
it is safe to use it as ESDA technique, but not as a prediction model for
decision-making.
It is not advisable to use the conditional indicator simulation model as well,
because of above-mentioned problems with indicator kriging and because of
some other problems, see Gotway and Rutherford [10].
212 K. Krivoruchko and R. Bivand
where r̄ is the global mean value, defined and calculated as a simple average
value based on all the data. The data could be counts or rates, although in our
opinion working with count data is misleading since the underlying population
also varies among the regions, see detailed discussion in [16].
A local version, called a Local Indicator of Spatial Association or LISA by
Anselin [1] is:
+ , N + ,
local ri − r̄ rj − r̄
Ii,std = wij
s j=1
s
where r̄ and s are the overall mean and standard deviation, respectively,
and the weights wij reflect the spatial proximity between regions i and j.
On Monarchs and Their Clothing 213
This statistic provides a measure of local similarity (or dissimilarity) for each
region. There are problems with LISA. For instance, it is hard to understand
how to interpret a case where, when using adjacency weights, two adjacent
regions have very different statistics.
Getis and Ord [9] and Anselin [1] give expectations and variances for the
local indicators, using both assumptions of normality and randomization, fol-
lowing Cliff and Ord [4] for the global measures. The standard route to draw-
ing inferences has been to treat the square root of the difference between the
observed measure and its expectation divided by its variance, as a standard
normal deviate. The Gaussian distribution can be a good model for continuous
data, but count data are inherently discrete. In randomization, assuming the
observed values are exchangeable, the assumption of stationarity is actually
made and this is violated by counts and rates in any case, and also when sta-
tionarity is not present. Other indices that allow the mean and variance of the
data to vary with the population in each region, and are thus more suitable
for measuring clustering in regional populations are available, see discussion
in [16].
Moran’s I can be modified to relax the assumption of constant mean and
variance. One such statistic for rates, see Walter [26], is:
+ ,
N + ,
yi − r̄ni yj − r̄nj
IiWM = √ wij √
r̄ni j=1
r̄nj
assuming that the underlying risk r̄ to be constant over all regions and esti-
mated from the data over the entire region. This statistic is based on properties
of the Poisson distribution assuming that E(Y i) = r̄ni . The p-values can be
computed using Monte Carlo simulation as follows [16]:
1. Generate simulated values for each region, under the null (or default)
hypothesis of spatial independence; Here we assume the data follow a
Poisson distribution with mean E(Y i) = r̄ni ; these values are simulated
from the Poisson distribution and are not a permutation of the observed
counts.
2. Compute the statistic of interest, in this case U = IiWM for each simulated
data set.
3. Repeat M times. This gives U1 , U2 , . . . , UM .
4. Compare the observed statistic calculated from the available data, say
Uobs to the distribution of the simulated Uj and determine the proportion
of simulated Uj values that are greater than Uobs
The idea is to obtain the proportion of simulated values that are more
extreme that the value determined from the data.
It is natural and users regularly ask for probability values to be made
available for global and local indices of spatial association, but it is a rather
delicate procedure. Some software permutes all the data values across the set
of units, as is typically done for global measures. But this does not provide an
214 K. Krivoruchko and R. Bivand
adequate basis for inferring about the local neighborhood, in which the range
of values found may be much more restricted. One could attempt to simulate
for each neighborhood, but because the numbers of neighbors are small, very
few draws can be made before all possible combinations have been exhausted.
An underlying problem is that global autocorrelation, perhaps reflecting a
trend in the data, will yield apparently significant local measures, and will also
make the use of the whole pool of data values for simulation wrong, because
within a local neighborhood, the trend limits values to a narrow band. Users
are at risk of drawing conclusions from the output of local indicators that
are not robust, and it is not obvious how to indicate to them how dependent
these indicators, and derived measures, such as probability values, are on the
assumptions being made.
Often software users and developers assume that the data are independent
and follow a stationary Gaussian distribution, but this is an unreliable condi-
tions in practice, at least for aggregated data, such as cancer and crime rates.
In fact almost all polygonal data are not continuous and Moran’s I should
arguably be used only for pedagogical purposes. The best approach is to use
Monte Carlo testing. In this approach we generate realizations from a specified
univariate distribution that describes the data, calculate local index for each
polygon and then compute the p-value as in example using Walter’s modified
I above. This certainly is best approach in the case of global statistics, but
it still may be misleading for LISA because number of neighbors is usually
small, less than 10, and any statistics might be insufficient. One possible so-
lution to the problem is to use several different indices as in the case study
by Krivoruchko et al. [16]. If all or most of indices give similar results, we
can safely make conclusion about data clustering or cross-correlation. If not,
further research is required.
4 Reproducible Research
Leisch and Rossini [19] present arguments for making statistical research re-
producible, so that given the same data, another analyst will be able to re-
create the research outcome used in a paper or report. If software has been
used, the implementation of the method applied should also be documented. If
arguments used by the implemented functions can take different values, then
these also need documentation. An example is the way in which a geostatisti-
cal layer in ESRI ArcMap is defined [12]. Most ArcMap layer types store the
reference to the data source, the symbology for displaying the layer, and other
defining characteristics. A geostatistical layer stores the sources of the data
from which it was created (usually a point feature layers), the symbology, and
other defining characteristics, but it also stores the model parameters from the
interpolation, including type or model for data transformation, covariance and
cross-covariance models, estimated measurement error, trend surface, search-
ing neighborhood, and results of validation and cross-validation.
On Monarchs and Their Clothing 215
second order properties of a spatial point process, which involve the relation-
ship between numbers of events in pairs within the chosen study area. The K
function is a summary measure of second order effects, and is estimated for a
sequence of rings of distance h by:
R Ih (dij )
K̂(h) =
n2 i=j wij
where R is the area of the study area polygon, n is the number of points,
Ih (dij ) is an indicator function which is 1 if dij < h and 0 otherwise, and wij
is a edge adjustment – the proportion of the circumference of a circle centered
on i and going through j that is within the study area. K̂(h) is often reported
as L̂(h), where: 2
K̂(h)
L̂(h) = −h
π
Observed values of K̂(h) for a given study area polygon boundary can be
compared with simulated values of the same measure for a given spatial point
process model. Most often the model chosen is that of complete spatial ran-
domness, which involved simulating n points within the study area polygon
for each simulated pattern following a homogeneous Poisson process. Results
are displayed by recording K̂(h) for each simulation, and plotting the largest
and smallest values for each h as a simulation envelope. If the observed K̂(h)
leaves the envelope, this may be taken to show that it is unlikely that – for
the chosen number of simulations – that the observed pattern could have been
generated by the process used in the simulation. When we wish to test whether
a pattern is clustered, it may be more natural to use a process model that
suits this hypothesis. The Poisson cluster process involves the inclusion of a
spatial clustering mechanism into the model, so that observed K̂(h) falling
within a Poisson cluster process simulation envelope show that the observed
pattern could have been generated by such a model.
The following code example run in the R statistical computing environ-
ment [23] will generate reproducible results, in this case the plot shown
in Fig. 4. The function being called to generate the simulation envelope is
pcp.sim(), contributed to the splancs package [24] by Giovanni Petris, and
using faster code changing the order of calls to the random number generator
contributed by Nicolas Picard, which can be turned off by setting argument
vectorise.loop=FALSE.
# Load the "splancs" package
library(splancs)
# Load the Cardiff juvenile offenders domiciles point
# data set and bounding polygon, assign the distance
# sequence and compute Khat
data(cardiff, package="splancs")
r <- seq(2, 30, by = 2)
K.hat <- khat(as.points(cardiff), cardiff$poly, r)
# Compute the fitted Poisson Clustering Process
pcp.fit <- pcp(as.points(cardiff), cardiff$poly, h0=30, n.int=30)
218 K. Krivoruchko and R. Bivand
m <- npts(as.points(cardiff))/(areapl(cardiff$poly)*pcp.fit$par[2])
# Set the random number generator seed and perform the simulation
#to find the simulation envelope bounds
RNGkind(kind="Mersenne-Twister", normal.kind="Inversion")
set.seed(123)
K.env <- Kenv.pcp(pcp.fit$par[2], m, pcp.fit$par[1], cardiff$poly,
nsim = 20, r = r, vectorise.loop=TRUE)
# Create a function to convert Khat values to Lhat
Lhat <- function(x, r) sqrt(x/pi) - r
# Apply the function to the simulation results
L.env <- lapply(K.env, Lhat, r)
# plot the observed Lhat values
limits <- range(unlist(L.env))
plot(r, Lhat(K.hat, r), ylim = limits,
main = "L function with simulation envelopes and average",
type = "l", xlab = "distance", ylab = "", lwd=3)
# Add the simulation average and envelope to the plot
lines(r, L.env$lower, lty = 5)
lines(r, L.env$upper, lty = 5)
lines(r, L.env$ave, lty = 6)
abline(h = 0)
5 10 15 20 25 30
distance
Fig. 4. Observed function for Cardiff juvenile offenders places of residence, with
Poisson cluster process simulation envelope using the “Mersenne-Twister” random
number generator
On Monarchs and Their Clothing 219
Running the code calculates the K̂(h) function from the observed spa-
tial pattern for the chosen sequence of distances, using edge correction for a
bounding polygon, and plots its L̂(h) transformation. We test against a Pois-
son cluster process by simulating such a process within the bounding box,
here only 20 times, and plotting the maximum, mean, and minimum sim-
ulated L̂(h) values around the observed values (Fig. 4). It appears that the
observed data pattern could have been generated by a Poisson cluster process.
Having the code, the specified version of the splancs package and R, a
reviewer can re-investigate the impact on our conclusions of changing the
boundaries used for calculating edge effects, the number of simulations, the
distance sequence, the random number generator, and other parameters of
the model. Research should be documented not just for academic reasons,
and the provision of mechanisms for journaling methods used and thereby
securing the lineage of objects in documents within or derived from GIS is
necessary. Review and decision-making based on reproducible research using
well-documented closed (Geostatistical Analyst) or open (R) code software
is transparent and verifiable and does not require the participation of skilled
researchers. If the steps taken are documented and can be reproduced, the
results are available for checking in the future by the same or other users.
Should newer methods or fresh data become available, the documentation of
the lineage of results means that they can continue to be valuable for the
organizations that have invested in their collection and processing.
5 Concluding Remarks
Acknowledgements
This work was supported in part by EU Contract Number Q5RS-2000-30183.
The authors would like to thank Carol Gotway-Crawford for her helpful
discussions.
References
1 Introduction
From a rural land use perspective, an important development in Europe is that
agricultural activities are being combined with other activities such as envi-
ronmental care, maintaining the landscape, forestry, preserving recreational
and tourist areas, etc. As a result, there is a strong need for statistical data on
rural populations and particularly on landscapes and land use, which are by
their nature spatial in form. The management, the processing and the display
of such statistical data is therefore, to a large extend, a spatial process. In
this respect, GIS is considered necessary in the production of census maps,
for dealing with census logistics, for monitoring census activities, and for data
dissemination [2].
With the advent of GIS, a wide range of spatial analysis methods has
been developed for carrying out data transformations between different spa-
tial structures. These methods help to present the data in a more meaningful
and consistent manner and enable different data sets, based on different ge-
ographical units, to be brought together and overlaid. They also facilitate
the spatial analysis of statistical data required in the development and/or
calculation of more reliable indicators for the determination of the state and
quality of the environment, and the ability to measure the effect of the agri-
cultural economy, across regions and countries. Most policy makers concerned
with agri-environmental issues at the national level are confronted with frag-
mented information and it is accordingly difficult to use the information in a
way that effectively contributes to policy decision making.
A necessary step in the assessment of agricultural policies and of their
impact on the countryside and landscapes is the study of spatial units that
constitute the underlying structure of these areas. Most statistical data in the
European Union (EU), by means of the Farm Structure Survey (FSS) data,
224 M. Sambrakos and T. Tsiligiridis
new satellite data collected in 1998–1999 in relation to those used for the cre-
ation of the Hellenic geo-statistical database. The digital photo-interpretation
of the new satellite data is made using image processing software and other
data such as those from land recordings. The recording, planning and the
use of the data from the field work also define the reliability of the specific
photo-interpretation.
The new geographical database for the country’s area has numerous ad-
vantages, the most important of which are the following:
• It provides a land use/cover map covering all Hellenic territory using 16
classes.
• It takes into account the FSS nomenclature and definitions.
• It enables comparability between different time periods, using the same
source of information, namely census or photo-interpretation.
• It enables comparability between the two sources of information, namely
census versus photo-interpretation. In the case of Hellenic Republic, the
acquisition period of the data is spread over 2 years for both, the LTM
1998–1999 and the FSS 1999/2000 (reference year the 1998–1999 crop
year).
• It enables the integration of the chrono-geographical co-ordinates of the
satellite images sources of CLC. This will help in the identification of
districts for which image interpretation is one year apart (minus or plus)
from the census year (1990 or 2000, respectively). In addition, using the
intermediate FSS data that correspond closely to the date of the satellite
image, it will be possible to mitigate the effect of time.
As it appears, the new geo-statistical database is in principle more accurate
than CLC. It can be used to calibrate diversity measurements computed from
CLC, although there are some problems because the reference dates may not
coincide. The methodology has been tested in the region of Crete (NUTS II).
The Crete island is about 8, 267, 45 km2 , it is located in the most south part of
Hellenic Republic and it is divided into four administrative areas (NUTS III).
To describe the methodology adopted in the problem we are studying, one has
to take into account the non-matching areal units and the problem Modifiable
Area Unit (MAUP) [5]. Note that, the temporal incompatibilities problem and
the procedure of matching the data points by non-matching due to collection
cycles is not considered here.
Starting with the non-matching areal unit problem, as this appears in the
pilot case, a new object, called interoperable geo-object is introduced. This
object includes all the required procedures in order to solve the following two
problems.
Reassignment of the Farm Structure Statistical Data Using GIS 227
The first problem has been solved with the appropriate transformation
between different spatial structures. This transformation determines the pro-
cess of the aggregation and the disaggregation within nested, nonnested and
neighboring polygons. To overlay the data the conceptual model of Fig. 1
has been designed. This model contains and maintains all the polygons and
the related geometric data (lines, nodes etc), representing the areal units. To
link the descriptive with the spatial information, the data of the geographical
area has been divided into smaller parts in order to determine the field that
identifies the specific entity (PolyKey), which has been used as a reference
key to the GIS. Also, a set of spatial queries has been developed to carry out
the above transformation. The second problem has been solved through the
development of a common geodetic datum, which represents jointly the sta-
tistical and the ancillary geographical data on a map. Finally, an automated
procedure has been developed to convert the data from the original to the
target geodetic datum.
The MAUP problem has been faced by increasing the spatial detail, using
ancillary geographical data [4] such as contour lines, lines representing rivers,
or polygons representing lakes etc. This allows the synthesis of geographical
data along with the statistical data. Further, it allows the combination of
different scenarios to be considered in order to simulate the plotting of the
statistical data on a map. For validation and/or prediction purposes, the re-
sults have been compared visually with other spatial quantitative information
or sampling data presented on thematic maps.
To automate both the transformation between different definitions of ad-
ministrative units and to achieve the connection between a file containing
quantitative data (usually statistical) along with GIS data (ancillary and sta-
tistical), two-object classes have been developed namely, the class for data
manipulation and the class for GIS manipulation.
As it has been pointed out, the linkage of the two nomenclatures, by means of
the structure survey and the geographical databases, require the development
of a software tool able to display maps and descriptive data in a tabular
form. This has been achieved by linking the geographical information with
the multi-dimensional tabular information of FSS. Thus, the user becomes
part of the GIS without the necessity of having specific skills and intimate
knowledge of the data used.
Reassignment of the Farm Structure Statistical Data Using GIS 229
The application consists of four items, namely, the relational database, the
class of objects for data manipulation, the class of objects for GIS manipu-
lation and the main body of the application software containing the above
items along with the functions required by the end user.
To begin with, a step-by-step analysis of the software design is required.
The appropriate design steps are described below:
1. The ancillary geographical features, such as contour lines, roads, cities,
lakes and rivers are added on the geographical layer of the area of interest.
This will help to localize the geographical data.
2. From the FSS database only themes associated with agricultural products
have been selected. Note that the use of the geo-object offers the capability
to work at different levels of administrative units. However, in the pilot
case, the FSS data have been selected at prefecture level (NUTS III), in
thousands of hectares, as they are reported in the 2000 census.
3. We develop the entity relationship model as well as the relational database
of the software tool, based on the data provided by the FSS database and
geo-database.
4. The geographical data have been stored in some database tables of the
software tool, using some especially developed functions. Further, the
OLEServer method of the QuantitativeInput object has been used with
the appropriate DLLs, which have been provided by the FSS, in order to
transfer the FSS data into the database.
5. We define the appropriate functions and queries, and we developed object
classes in order to achieve uniformity at both the user and the developer
levels.
6. We developed an application in which the RDBMS, the GIS and the pre-
mentioned object class have been used. The basic capabilities offered by
this application are the following:
• Compose (aggregate) a new FSS theme by selecting one or more
classes, and vice versa.
• Decompose (disaggregate) an existing FSS theme to one or more
classes, and vice versa.
• Correspond (relate) the new FSS themes to classes.
• Classify (sort) the results by date, county (region), or by class.
• Observe the results plotted on a map and classify these by some ge-
ographical characteristics (e.g. allocation of the selected growth by
elevation).
7 Data Analysis
Although the new geo-statistical nomenclature has been harmonized with the
FSS nomenclature, there are still some problems related to the two different
methodologies. The analysis of the above problems has been carried out by a
comparison between the respective areas of the related classes. The available
data from the 2000 FSS is based at the Municipality/Commune level (NUTS
IV), whereas the data drawn from the new geostatistical nomenclature is at
the district level (NUTS III). The data of two databases have been compared
in a pilot study of four Hellenic regions at a district (NUTS III) and pre-
fecture (NUTS II) level. The comparison shows large difference between in
the agricultural areas. Generally, the examined agricultural areas in the geo-
statistical nomenclature are greater than the corresponding agricultural areas
in the 2000 FSS. The differences are because of the difficulties in correlating
the pastures areas between the two databases, whereas the differences in the
arable areas and the areas under permanent crops are related to the different
methodologies.
The results found so far are presented in Table 1. Table 1(a) presents the
differences (%) in arable areas, areas under permanent crops, and cultivated
areas, as they were recorded in the districts (NUTS III) of the examined
regions, between the two nomenclatures. Positive sign is in favor of the geo-
statistical nomenclature, whereas negative sign is in favor of the FSS nomen-
clature. Note that the actual differences in the above classes are not as high
as they are in the remaining classes, namely agricultural areas (Table 1(b)),
pastures and meadows (Table 1(c)), and heterogeneous areas (Table 1(d)). To
facilitate the comparison for the last cases the actual values are presented.
232 M. Sambrakos and T. Tsiligiridis
Table 1. Results showing the differences between classes, as they have been recorded
in the 2000 FSS and the geo-statistical databases
Table 1(a) districts (%) differences (2000 FSS – GeoStat)
(NUTS II) (NUTS III) arable areas permanent crops cultivated areas
crete IRAKLIO −71 4 −4
LASITHI 54 47 48
RETHIMNO −91 −7 −24
CHANIA −72 4 −4
total −66 6 −3
Table 1(b) Districts agricultural areas (ha)
(NUTS II) (NUTS III) 2000 FSS GeoStat Differences
crete IRAKLIO 221,982 139,733 82,249
LASITHI 127,252 37,864 89,388
RETHIMNO 115,842 101,182 14,660
CHANIA 116,472 109,191 7,281
total 581,548 387,970 193,578
Table 1(c) Districts pastures and meadows (ha)
(NUTS II) (NUTS III) 2000 FSS GeoStat Difference
crete IRAKLIO 36,412 69,070 32,658
LASITHI 16,817 61,631 44,814
RETHIMNO 62,470 53,241 −9,229
CHANIA 63,410 40,167 −23,243
total 179,109 224,109 45,000
Table 1(d) Districts heterogeneous areas (ha)
(NUTS II) (NUTS III) 2000 FSS GeoStat Differences
crete IRAKLIO 143 54,339 54,196
LASITHI 12 34,433 34,422
RETHIMNO 159 33,372 33,213
CHANIA 14 32,420 32,406
total 328 154,564 154,237
It has been observed that the above differences in the regions (NUTS II)
are generally smaller from the corresponding inter-regional ones (district level;
NUTS III). This is due to the fact that the mapping unit of 25 ha in the new
CLC is not able to identify parcels of smaller size. This is the case of Greece, in
which the average holding size is around 4,5 ha and the average parcel size is
around 0,7 ha. An additional reason is that in FSS all the holdings are recorded
at the place of residence of the holder (natural person) or headquarter (legal
person) of the holding.
8 Conclusions
The work presented so far is a pilot study merging with the use of a soft-
ware tool the statistical data, available at the administrative level, with the
Reassignment of the Farm Structure Statistical Data Using GIS 233
geo-referenced land cover in order to identify and explain the most significant
differences encountered between the aggregates of agricultural land cover
classes. This has been achieved thanks to the creation of a new geo-statistical
database, which is based on both, the FSS and the CLC nomenclature.
The above geo-statistical database seems to provide a good mapping base
for Greece, which could be improved further by using suitable satellite images
that are able to produce scaled maps of at least 1:50,000. Note that the im-
posed minimum mapping unit of 25 ha results in an overall underestimation
of the diversity of landscapes something, which is particularly important in
the case of Greece for which the average size of the holdings is 4,5 ha. Ad-
ditional sources may be used providing detailed complementary information,
such as aerial ortho-photographs, the cadastral map of Greece, IACS (In-
tegrated Administrative Control System), MARS (Monitor Agriculture with
Remote Sensing), NATURA2000 database, or other ongoing analysis of the
European landscape.
The methodology of using the interoperable geo-object in conjunction with
RDBMS settings and the OOP logic means that many of the objects can
be used in similar GIS applications with a little effort of maintenance. The
application developed is an easy-to-use tool, ideal for comparison of descriptive
census results and interpreted geo-data, as well as, to conclude about the
correctness of these data. If the expert combines the ability of simultaneous
comparison and appearance of results of different years, the conclusions will
be more reasonable.
Future research is three fold. Firstly, it is to continue improving the idea
of interoperable geo-object by adding methods and properties for uncertainty
manipulation and to investigate requirements of GIS in a fuzzy object data
model. Our final objective is to provide the geo-object with the ability to
generate and visualize transitions from one state to another, using the rules
of an expert spatiotemporal system. Related work on this aspect is given in
[8]. Secondly, this study may be considered as a first step in the direction
of presenting geo-reference statistical and/or agricultural and environmental
data. As soon as this initiation will be completed, it will become possible to
redistribute quantitative data other than land use from the FSS by defining
some distribution rules using co-variables. Finally, this research will facilitate
the spatial analysis of statistical data required in the development and/or
calculation of more reliable indicators.
References
1 Introduction
Since the range of applications of GIS in Public Health is nearly unlimited (like
in many other fields), we will focus on Spatial Epidemiology, which refers to
different topics about the spatial spread of diseases: disease mapping, detection
of clusters of disease, ecological analysis, etiology, etc.
In this paper we describe how a Geographic Information System for Spatial
Epidemiology can be developed and we briefly discuss the main points to which
attention should be paid.
236 V. Gómez-Rubio et al.
2 Managing Data
Data needed are determined by the conclusions we want to draw from the
studies. Usually, the main concern is to explore the spatial distribution of
a group of diseases (mortality and/or morbidity), which is accomplished by
means of Disease Mapping (see Sect. 3.2). The second step is often the detec-
tion of those regions where there exists a higher risk of suffering from these
diseases, known as Risk Assessment (as explained in Sect. 3.3).
When a high risk has been detected an explanation is usually required.
Sometimes it is possible to look for relationships between risk and a number
of covariates. This is done via Ecological Analysis, as described in Sect. 3.4.
Studies are always restricted to a period of time and to a particular area.
Data available are aggregated on the basis of units used to measure space and
time. For this, year is often used, while there is no clear preference for the
spatial units, since it usually depends on administrative boundaries.
The level of aggregation is restricted due to confidentiality issues. Data
available in the studies are usually in a form that prevents from identifying
single individuals. This means that short periods of time or quite small areas
can’t be used.
Data quality uses to be quite good for mortality, but it doesn’t happen
to morbidity, excepting for a few set of ‘important’ diseases for which special
registers are drawn up (like, for example, children malformations, AIDS or
cancer).
Measuring exposure to a risk factor is always difficult and it is often im-
possible to take exact measurements for every person at risk. Residence is
often used as a proxy to the place of exposition, although it can be misleading
since people is also expected to spend quite time at work, for example.
Nevertheless, government agencies and other data providers usually link
their information to the appropriate administrative areas, be it quarters, elec-
toral districts, etc. When this doesn’t happen it is not difficult to refer the
actual location to the standard administrative areas [27].
Finally, it is important to update data on a regular basis, so that recent
problems can be investigated.
This kind of data usually comes from a wide range of sources and variables
of interest depend on the kind of study to be carried out. Some of them may
be available from government agencies but others must be collected ad hoc
for the studies.
Maps can be created to represent the spatial distribution of these vari-
ables (see Fig. 1). When a continuous representation is required, geostatistical
methods [8] can be used to provide estimations at those points where a sample
hasn’t been taken.
Social and economical factors may also influence results. It has been proved
that way of life, diet, deprivation, etc. may have a strong influence over disease
risk. A deprivation index is created taking into account different variables and
it is often used in standardisation to filter socio-economic inequalities [12].
Epidemiological Information Systems 239
Once all these data have been collected, the next step is to make them available
from a common source. This compilation implies a debugging of the data in
order to detect and correct (or remove) wrong entries. This point is crucial
since data quality is the key to what kind of studies can be carried. Imputation
methods must be used to fill in missing data.
The same database should be used to store all the information available
in order to integrate different types of data and a link among them must be
established. This is done by means of the spatial location of the data, so it
is necessary to unify how it is specified: coordinates, municipalities, counties,
etc. This is a challenging task since there can be a lot of different spatial units.
When data are measured at individual locations (i.e., different measures
at single, maybe different, points) a mechanism to associate these points to
administrative regions may be needed [27]. Point data can be analysed, but
following a different approach from the one employed for area based data.
3 Statistical Analysis
Although standard and simpler methods can be used to carry out a number
of analysis, we should take advantage of the spatial nature of our data and
the special characteristics of the problem we are facing [8, 19].
As a rule, cases (mortality and/or morbidity) and population are stratified
according to sex, age group and, if available, a measure of deprivation or
poverty. These factors have been proved to be important in the analysis, and
this approach helps to reduce bias in the estimations and to remove the effect
of this factors [22].
We will also find different measures for administrative regions and periods
of time, so that we have variables Oijk and Pijk , for the observed number
of cases and population, respectively, in region i, age-sex-deprivation strata
j and period of time k. When working with fully spatial models, the third
component will be missed and data will be aggregated based on the third
component.
When calculating incidence rates, a quotient between affected people (nu-
merator) and population at risk (denominator) is made [22]. When a compar-
ison between different studies at different locations is required, another region
is used as a reference to calculate standard rates for each age-sex-deprivation
stratum.
If we call rjk the reference (or comparison) rate for stratum j and period
k and supposing no region – age-sex-deprivation interaction [34], we can get
an estimation of the expected number of cases in region i and period of time
k by Eijk = rjk · Pijk . This process is called standardisation [22].
240 V. Gómez-Rubio et al.
Most models suppose Oijk drawn from a Poisson distribution whose mean
is θik Eijk , where θik is called the Relative Risk [34]. Its Maximum Likelihood
Estimator is θ1ik = j Oijk / j Eijk , called the Standardised Mortality Ra-
tio (S.M.R.). Usually, a confidence interval is calculated to test significant
departure from value 1, which marks the standard risk.
Another layer can be added to the model in order to explain how θik is
distributed (see Sect. 3.5). Usually, a Generalised Linear Model is constructed
[26], although Bayesian Hierarchical Models with spatial structure have been
used to smooth relative risks (as shown in Sect. 3.5), so that neighbouring
regions are also taken into account in the local estimations.
By taking a look at a disease map, groups of areas with higher risk (clusters)
can be detected. Due to the problems mentioned in the previous subsection,
this method can be misleading and not accurate.
Since the detection of clusters of disease is one of the priorities for epi-
demiologists, a number of methods have been developed for this purpose, and
a few reviews have been made by authors in the last years [19, 25, 33].
In the investigation of clusters of disease we can mainly distinguish two
types: search for clusters in the study area [33] or investigating a known pu-
tative pollution source [10]. Clearly, statistical assumptions are quite different
depending on which one we are working on.
The relationship between disease and risk can be investigated through Ecolog-
ical Regression. Generalised Linear Models [26] are often used for this purpose,
although Generalised Additive Models has also been used [20]. An example of
Epidemiological Information Systems 241
Ecological Regression using GLMs is the study about the relationship between
hardness of drinking water and cerebrovascular mortality [14].
When performing an Ecological Regression, it is important to pay atten-
tion to how risk exposure has been taken. If different levels of aggregation
are used it may happen that some measurements have been made at a broad
level, i.e., the same risk is associated to a wide range of population and a bias
may be introduced in the analysis. This problem is know as the Ecological
Fallacy [28].
The Bayesian paradigm has been successfully applied to all fields of Spa-
tial Epidemiology. It is based on Bayes’ Theorem, so that we calculate the
posterior distribution (after observing the data) as the product of likelihood
and priors of the random variables. When the posterior distributions can’t be
worked out, as it happens most times, MCMC techniques [16] are employed
to simulate them.
Probably, the first Bayesian Hierarchical Model applied to Spatial Epi-
demiology was the one proposed by Clayton and Kaldor [6], which proposes
a prior Gamma for all the relative risks and produces a globally smoothed
estimate of the relative risk θ1i = (Oi + ν)/(Ei + α), which is a compromise
between the S.M.R. of region i and the prior mean (ν/α), reducing extreme
values.
Other models, such as the one proposed by Besag et al. [2], also produce
smoothed estimates of relative risks. The logarithm of the relative risk is
expressed as the sum of the effect of neighbouring areas plus the effect of
the local area, which can be a linear function of covariates [13]. Then, the
estimations of the relative risks obtained have been smoothed by taking into
account the effect of neighbours.
Smoothed estimates of relative risks obtained in these models can be used
to produce cloropleth maps. Comparing these maps to those made from SMRs
will show how the effect of extreme values in low population areas is reduced.
First of all, it is necessary to know current and future needs before designing
the whole system. Perhaps, we only must care about the spatial spread of
diseases, without paying attention to possible causes. Or, on the contrary,
the main concern is to investigate sources of pollution to see how they affect
health. This is important because data and statistical methods will depend
on these needs.
242 V. Gómez-Rubio et al.
For basic models with few data, they could be imported into any statistical
software available, while for huge amounts of data, statistical methods will
require better integration with the database in order to make analysis possible
in a short time (or even possible!).
Time required for the analysis may also be important. Exploratory tools
may be used as a first, rapid look into a problem to decide whether a further
investigation should be carried out. If it is decided so, more time consuming
methods (such as, for example, Bayesian methods computed via MCMC) may
be required for a more accurate analysis.
Since data are collected from many different sources, it is crucial to integrate
them into a single database. Before doing this, data quality must be assured
and it should be checked for inadequacies, wrong and missing values, etc. Data
can be linked by referring to their spatial location, as explained in Sect. 2.
Health data can be stored as single events, but usually a minimal aggre-
gation is defined for space and time, and data will be aggregated on the fly
when performing a study. Depending on the amount of data available it may
be useful to create separated tables of aggregated data to speed up future
investigations.
It is important to provide a way to move up and down the different ad-
ministrative layers in order to be able to carry out studies at different levels.
This can be done by providing a conversion table from one level into another.
Basic statistics can be computed with any statistical software available. Al-
though some GIS software are incorporating statistical methods for spatial
statistics, they are mostly focused on geostatistics and it is difficult to find
methods used in Spatial Epidemiology.
Accessing data directly to the database can solve this problem but, as
commented before, when the amount of data is big, it can be very slow and
it should be better to implement statistical methods inside the database.
Unfortunately, as stated in [9], there is a lack of software in the field of
Spatial Epidemiology, and what can be found so far, are isolated programs to
compute a few methods.
Although GIS, databases and statistical software can be used separately when
performing studies, it can be more helpful (and less time consuming) to de-
velop an unified tool.
Epidemiological Information Systems 243
5 Software Developed
5.1 Rapid Inquiry Facility
The Rapid Inquiry Facility (RIF) was initially developed at the Small Area
Health Statistics Unit [1], but it was rewritten within the framework of EU-
ROHEIS Project [7], funded by the European Commission. It was intended
to be a Health Information System for Disease Mapping and Risk Assessment
around putative pollution sources.
It is based on ArcView 3.2, Oracle Database 8i, and Oracle Forms and
Reports. A graphical user interface developed in Avenue (ArcView’s internal
programming language) allows the selection of study and comparison regions
together with the period of time and diseases to investigate. Two types of
studies can be done: Disease Mapping and Risk Assessment around putative
pollution sources.
The RIF was developed with a standard structure (see Fig. 2) in order
to allow all the partners of the project to customise it at their Local Health
ORACLE DATABASE 8i
Authorities with their own data. As the Spanish Partner, we did it at the
Conselleria de Sanitat (Comunidad Valenciana, Spain).
As numerator tables, we have mortality, and hospital admissions (removing
repeated registers) are used as a proxy to morbidity. Population, as provided
by Census register, is used as denominator. All these data are available from
1989 to 1998 at the level of municipality, due to confidentiality.
A deprivation index [12] was also developed and incorporated into the
system to be used, together with sex and age group, in standardisation.
While mortality and morbidity are stored as single cases in the database
(and later aggregated in the studies) population is stored as habitants per
age-sex-deprivation stratum in each municipality.
Administrative boundaries at three levels (municipality, province and au-
tonomous community) are available in the system, and any of these levels can
be used when carrying out a study. They are stored as shapefiles, which are
used by ArcView to create maps. The code of the municipality is used to link
different data available.
For Disease Mapping, basic statistics (expected cases, observed cases, rel-
ative risk and its 95% confidence interval) are calculated for each region under
study and Poisson-Gamma Model [6] is used to provide smoothed estimations
of relative risks.
For Risk Assessment, one or several points regarding putative pollution
sources (nuclear plants, waste incinerators, tile industry, etc.) are selected,
and regions are grouped according to their distance to these points and the
same basic statistics as before are calculated.
Beside maps based on the results, a report with all the statistics is created
by Oracle Forms and Reports. Results are grouped by administrative region
and 6 groups depending on sex (males, females and both sexes) and type of
standardisation (whether deprivation index is taken into account). This way of
presenting data is useful to compare risk between different sex and deprivation
levels. All studies are stored in the database so they can be accessed later.
Although the RIF provides a rapid look into the data, we missed a few ca-
pabilities in the system. For example, it doesn’t perform any test to compare
results among different sex-deprivation strata, which is quite important. Fur-
thermore, covariates can’t be used in the studies and there is no possibility of
exporting results to be analysed with an external statistical software.
The use of covariates in the study was implemented inside the RIF using
the existing structure. Covariates are treated as sex, age group and deprivation
index in standardisation [12]. Each covariate is split into groups defined by
the user and rates are calculated with and without standardising by covariate
groups. If the covariate really has any relationship with the disease, we expect
to have different results.
Epidemiological Information Systems 245
Acknowledgements
This work has been partly funded by Conselleria de Sanitat, Comunidad Va-
lenciana (Spain) and EUROHEIS Project, code SI2.329122 (2001CVG2-604),
funded by European Commission.
References
1 Introduction
The emerging field of Geomatics has found useful application in several re-
search areas. A new phase of its development merges it into the realm of
epidemiology and public health to bring insight into the regional disparities
of disease incidence and for disease surveillance. New advances in biostatistics
include spatial statistics methods which aim to specifically understand and
model the spatial variability. Spatial statistics, when combined with geomat-
ics, constitutes an excellent and powerful analysis approach to handle and
better understand health issues, specifically in disease related prevention and
intervention studies.
The main goal of this paper is to explore the integration of geomatics and
spatial statistics with an application to a specific health issue. The outcome of
interest is acute coronary syndrome (ACS) incidence in the province of Que-
bec (Canada) between 1996 and 1998, and hospital readmission at one month
post-discharge. It is an established fact that mortality and hospital-acquired
infection rates are indicators of the quality of care [24] but more recently
some studies are turning their attention to the early hospital readmission
rate [2, 26]. Within this context, following specific questions are addressed: Is
there spatial heterogeneity and/or spatial aggregation in the ACS incidence
250 T. Niyonsenga et al.
and early readmission rates? Is there any geographical trend in the rates?
Is there an explanation for the spatial heterogeneity? By ACS, we mean the
occurrence of myocardial infarction (MI) or unstable angina. By early read-
mission, we mean readmission of a patient for coronary heart disease (ACS
and angina) 30-days post discharge. Although practice guidelines have been
in circulation to standardize the treatment and follow-up of acute myocardial
infarction [16, 22], regional variations are currently reported in the literature
[20, 21, 25]. The presence of a complex network of factors influencing care
quality [2], hospital readmission and the interaction between them have been
put forward as the potential explanation of the observed spatial variability in
hospital readmission rates. Most of these factors center on the patient while
the data available and/or the interests of public policy makers focus on rates
of local health units over a given time period. Over these administrative ge-
ographical units, interest centers on variables that could explain spatial vari-
ability, such as deprivation indices and other area specific characteristics, and
help to understand inequalities within health care services and accessibility.
2 Methods
2.1 Population
The studied population consists of all the patients living in the Québec
province in Canada, hospitalized for an ACS. The first registered hospital-
ization during the study period (1996–1998) will be considered as the “index
hospitalization”. The Québec register “Maintenance et Exploitation des Don-
nées pour l’Étude de la Clientèle Hospitalière (MED-ECHO)” made possible
the identification of the patients that fit inclusion criterion. This register lists
all summary administrative data collected when any patient is treated in an
acute care hospital in the Québec province. The validity of this data, concern-
ing MI, was studied and its results published [14, 19, 27, 28]. The inclusion
criterion is the inscription of the code 410 (MI) or 411 (unstable angina) of
the international classification of disease 9th revision (IDC-9) as the main
diagnosis for the hospitalization. In order to increase the study’s internal va-
lidity, we excluded patients with an error of code of residence and patients
that were younger than 25 years old, because they are more likely to have had
a MI caused by a different pathophysiological process. For the same reasons,
we wanted ‘new’ cases of ACS, so we excluded patients that were hospitalized
with an ACS in the year preceding the index hospitalization.
2.3 Variables
The hospitalization for ACS and early hospital readmission are our depen-
dent variables. The former variable is defined as the first occurrence of an
ACS hospitalization (main diagnosis IDC-9 codes 410 or 411) in the 3 years’
study (index hospitalization), and the latter is defined as a early readmission
for a coronary heart disease (main diagnosis IDC-9 codes 410 to 414) in the
30-days following the index hospitalization. The incidence and the readmis-
sion rates were calculated by Local Health Unit (LHU). For the readmission
rate, we excluded all in-hospital deaths and all deaths encountered within
30-days post discharge. Two deprivation indices were retained as potential
explanatory variables for the spatial heterogeneity in health outcomes. Pam-
palon et al. [18] calculated two sets of deprivation quintiles for approximately
9,000 enumeration areas (EA) using a principal component analysis (PCA).
The two principal components of this PCA reflected a material dimension
of deprivation (percent of people without a secondary certificate, the ratio
employment by population, the average income) and a social dimension of de-
privation (the percent of people divorced, separated or widowed, the percent
of single-parents, the percent of persons living alone). For each of these PCA
components, the EA were ordered by their factor score – from the least to
the most deprived – and then the population was fragmented in quintiles (the
fifth quintile being the most deprived). For each LHU, the population that
belongs to each quintile was calculated. Based on these quintiles, we defined
two variables, denoted by material deprivation index (MDI) and social depri-
vation index (SDI), by the percent of the LHU population that belongs to the
fifth quintile.
2.4 Analyses
The analyses were performed in two steps. We first focussed on the geographi-
cal distribution of rates (spatial heterogeneity and clusters). To detect clusters
and particular hot spots, we used SaTScan [12]. To see the general spatial
trend in the rates, we used the Geographically Weighted Regression [8] ap-
proach (GWR), the intercept model (using GWR package [10]) as well as a
Poisson regression model as a function of latitudes and longitudes (using SAS
[23]). The spatial autocorrelation ‘best’ model was estimated by the variogram
(using GS+ [9]) by minimising the residual sum of squares (RSS) criteria. The
second step was to explain the observed heterogeneity by the available covari-
ables through regression models. Consider the general model:
g(E[y]) = Xi β + Ui
where (ui vi ) denotes the coordinates of the ith LHU. If we assume that LHUs
far from each other are more likely to have different coefficients, a weighted
calibration (W) is used, which is a Kernel-type function of the distance and
a varying bandwidth. The coefficients β are thus estimated by:
3 Results
A total of 50,839 patients met the inclusion criteria and were listed in the
MED-ECHO register between January 1st 1996 and December 31st 1998.
Almost 10% of the individuals (n = 4, 749) died during the “index hospital-
ization”. Thirty-three patients were excluded because of their age (< 25 years
old) or because of an error in the code of residence. Furthermore, 1,516 pa-
tients had been hospitalized in the year preceding the index hospitalization
Geomatics, Epidemiology and BioStatistics 253
and were also excluded, for a total cohort of 49,290 patients. Among these,
43,854 (89%) were alive 30-days after discharge. At the index hospitalization,
the average age of patients was 66.2 years (± 13.3), and men represented 63.8%
(n = 31, 449) of total cohort. A total of 4,090 early readmissions (9.3%) have
been observed. The early readmission rate is higher for men but lower for
older patients (Table 1). Figure 1 shows the 1996-population within LHU of
the province of Quebec, while Figs. 2 and 3 show material and social depriva-
tion indices respectively. We can easily see a large population density in the
south part of the province, as well as in the coasts of the St-Laurent River.
In addition, as measured with deprivation indices, we can observe that the
urban regions are generally less deprived in the material sense but the reverse
is observed for the social deprivation index.
Crude ACS incidence and readmission rates are shown in Figs. 4 and 5. The
most likely clusters are highlighted as black lines. In Figs. 6 and 7 however, we
present the smooth surface estimates of ACS incidence and early readmission
rates with the GWR method (intercept model only; Monte Carlo test for
spatial variability). The parameter estimates range from 0.00086 to 0.01266
(p < 0.0001) for ACS incidence and from 0.08093 to 0.13639 (p = 0.0300)
for readmission rates. There is a North-East trend for ACS incidence rates
while the readmission rates tend to increase as we move away from the urban
centers (see smoothed surfaces) (Figs. 6 and 7).
Poisson regression models of the number of ACS and readmissions as a
function of latitude and longitude showed similar trends (predicted rates not
shown), confirming the observed trend from the GWR analysis. The method
proposed by Kleinschmidt et al. [11] was explored but the semivariogram of
the signed residuals suggested the absence of an autocorrelation structure, so
the process stopped at the first iteration. Nevertheless, to see if the deprivation
indices explain the heterogeneity in the rates, we performed a GWR on the log-
arithm of the rates as a function of these two indices. The parameter estimates
for readmission rates were significant (p < 0.0001) while for ACS incidence
rates only the SDI estimate was significant (p < 0.0001). The residuals were
the ratio of observed and expected rates. Maps of surface residuals (Figs. 8
and 9) show that the two indices explain some of the variation but the ob-
served trend in some regions remains unexplained by the deprivation indices.
4 Discussion
Spatial analysis, in the sense of analysis of data in a geographical perspective,
has to be linked to geomatics and geographical information systems. Within
the context of public health issues, a map displaying geographic heterogeneity
is one of the most powerful tools for interpretation of spatial data once we
determine which surface estimate best portrays the data and which estimation
methods are appropriate for the health parameters of interest. Elliott et al. [6]
reviewed estimation methods such as Kernel-based and GAM-based methods
Geomatics, Epidemiology and BioStatistics 257
of the hierarchical setting by using other patient and LHU level covariables
as potential predictors with a hierarchy in the parameters of interest within
the Bayesian framework and in the context of no prior information (flat pri-
ors) or limited information. Another promising avenue for exploration is how
Monte Carlo simulations may be used to inform the choice of prior distri-
butions. In the context of establishing relationships between health events
and covariables (within exploratory or confirmatory spatial data analysis), it
would be quite interesting to combine both Kernel-based methods such as the
GWR approach and random effects models to deal simultaneously with large
scale spatial variability (uncorrelated heterogeneity) and small scale variabil-
ity (spatial autocorrelation). Another important consideration is the difference
between semivariogram approach to model the spatial autocorrelation struc-
ture and conditional autoregressive (CAR) models [15]. As Fotheringham et al.
[8] pointed out, the GWR approach offers an interesting framework by allow-
ing estimation of a continuous surface for each potential correlate which could
usefully be mapped. This approach offers a fertile ground for future study.
Acknowledgements
This project was subsidized by the Network of Centres of Excellence GEOIDE
and Merck Frosst Canada Ltd. At the time of the study, the first author
had the support of the Clinical Research Center of the Centre Hospitalier
Universitaire de Sherbrooke.
References
1 Classification as an Instrument
for Exploratory Analysis
spatial object. The description of this data set could be substantially simplified
if there were a directional trend in value distribution, for example, increase of
values from the north to the south or from the centre of the territory to its
periphery. Alternatively, a shorter description could be derived if the territory
could be divided into possibly smaller number of coherent regions with low
variation of attribute values within the regions. This technique is known as
regionalisation. Unclassed maps are better suited for detecting trends because
they do not hide differences. Classification discards differences between values
within a class interval and gives the corresponding objects similar appearance
on the map. When these objects are geographical neighbours, they tend to
be visually associated into clusters. This property makes classed maps well
suitable for regionalisation. Which of the two ways to simplification occurs to
be possible or more effective in each specific case, depends on the data and
not on the preferences of the analyst. Therefore it is necessary to have both
an unclassed choropleth map and a classification tool in order to investigate
properly data with previously unknown characteristics.
It is clear, however, that a single static classed map cannot appropriately
support regionalisation. It is well known in cartography that different selection
of the number of classes and class breaks may radically change the spatial
pattern perceived from the map [5, 7]. There is no universal recipe of how to
get an “ideal” classification with understandable class breaks, on the one hand,
and interpretable coherent regions, on the other hand. Therefore when we say
that classification may be used as an instrument of data analysis, we mean
not a classed map by itself but an interactive tool that allows the analyst to
change the classes and to observe immediately the effect on the map.
The exploratory value of classification was recognised in cartography only
relatively recently. Initially classification was regarded as a tool for conveying
specific messages from the map author to map consumers. Thus, the paper
[8] considers various possible intentions of the map designer and demonstrates
how they can be fulfilled through application of different classification methods
and selection of the number of classes.
In early nineties Egbert and Slocum developed a software system called
ExploreMap intended to support exploration of data with the use of classed
choropleth maps [4]. The most important feature of the system was a pos-
sibility to interactively change the classes. Another implementation of this
function based on direct manipulation techniques is the “dynamic classifica-
tion” tool incorporated in the system CommonGIS [1, 2]. Exploration on the
basis of classification is additionally supported in CommonGIS by the function
of computing statistics for the classes: the range of variation and the average,
median and quartile values of any selected attribute for each class.
In this paper we describe a recently developed extension of the dynamic
classification tool that exploits the properties of the cumulative frequency
curve and generalised cumulative curves. In the next section we define the
relevant notions and explain the use of cumulative curves in classification and
data exploration.
Interactive Cumulative Curves for Exploratory Classification Maps 263
100%
Number of objects(districts)
with values no more than 0.5
(72% of all objects)
values between 0.3 and 0.5
Number of objects with
Number of objects(districts)
with values no more than 0.3
(14% of all objects)
CARSCAP (N)
Minimum value of Value 0.3 Value 0.5 Maximum value of
the attribute (0.13) the attribute (0.63)
It is important that the cumulative frequency curve does not require prior
classification. However, it can represent results of classification by means of
additional graphical elements, and we used this opportunity in the latest ex-
tension of CommonGIS. Thus, the horizontal axis of the graph may be suited
to show class breaks. In the interface adopted in CommonGIS (Fig. 2) we use
for this purpose segmented bars with segments representing the classification
intervals. The segments are painted in the colours of the classes. The positions
of the breaks are projected onto the curve, and the corresponding points of
the curve are, in their turn, projected onto the vertical axis. The division of
the vertical axis is also shown with the use of coloured segmented bars. The
lengths of the segments are proportional to the numbers of objects in the
corresponding classes. With such a construction it becomes easy to compare
the sizes of the classes. For example, the class breaks shown in Fig. 2 divide
the whole set of objects into 3 groups of approximately equal size that is
demonstrated by the equal lengths of the bar segments on the vertical axis.
The overall interface for classification is shown in Fig. 3. Besides the cumu-
lative curve display exposing statistical characteristics of the current classifi-
cation, it includes a map showing geographical distribution of the classes. For
the use of the cumulative curve as a tool for classification it is important that
its display immediately reacts to any changes of the classes, as well as the map
does. The user can change class breaks by moving the sliders (double-ended
vertical arrows) along the slider bar (on the upper right of the window). In
Interactive Cumulative Curves for Exploratory Classification Maps 265
the process of moving the slider the map and the cumulative curve graph are
dynamically redrawn. In particular, changing are the relative lengths of the
coloured bars on the axes and the positions of the projection lines. Clicking on
the slider bar introduces a new class break, bringing a slider close to another
slider results in the corresponding break being removed. The map and the
cumulative curve display immediately reflect all these changes.
It is important that the cumulative frequency curve does not require prior
classification. However, it can represent results of classification by means of
additional graphical elements, and we used this opportunity in the latest ex-
tension of CommonGIS. Thus, the horizontal axis of the graph may be suited
to show class breaks. In the interface adopted in CommonGIS (Fig. 2) we use
for this purpose segmented bars with segments representing the classification
intervals. The segments are painted in the colours of the classes. The positions
of the breaks are projected onto the curve, and the corresponding points of
the curve are, in their turn, projected onto the vertical axis. The division of
the vertical axis is also shown with the use of coloured segmented bars. The
lengths of the segments are proportional to the numbers of objects in the
corresponding classes. With such a construction it becomes easy to compare
the sizes of the classes. For example, the class breaks shown in Fig. 2 divide
266 G. Andrienko and N. Andrienko
Fig. 3. The interface for classification provided in CommonGIS allows the user to
account both for statistical and for spatial distribution of values
the whole set of objects into 3 groups of approximately equal size that is
demonstrated by the equal lengths of the bar segments on the vertical axis.
The overall interface for classification is shown in Fig. 3. Besides the cumu-
lative curve display exposing statistical characteristics of the current classifi-
cation, it includes a map showing geographical distribution of the classes. For
the use of the cumulative curve as a tool for classification it is important that
its display immediately reacts to any changes of the classes, as well as the map
does. The user can change class breaks by moving the sliders (double-ended
vertical arrows) along the slider bar (on the upper right of the window). In
the process of moving the slider the map and the cumulative curve graph are
dynamically redrawn. In particular, changing are the relative lengths of the
coloured bars on the axes and the positions of the projection lines. Clicking on
the slider bar introduces a new class break, bringing a slider close to another
slider results in the corresponding break being removed. The map and the
cumulative curve display immediately reflect all these changes.
Fig. 4. Generalised cumulative curves are built for the attributes “Area” and “Total
population”. The classification is done on the basis of the attribute “Number of cars
per capita”
268 G. Andrienko and N. Andrienko
Fig. 6. The cumulative curve display was used to divide the districts into two classes
with equal total population
270 G. Andrienko and N. Andrienko
4 Conclusion
Albrecht Gebhardt
1 Introduction
This work presents some results of combining several pieces of Open Source
software, including R1 (see [6]), PostgreSQL2 (see [2]), and some appropriate
extension packages, see [3]. A central building block is REmbeddedPostgres,
an extension of the PostgreSQL RDBMS, which basicly delivers the possibility
to use R syntax within SQL queries. Using the work of Duncan Temple Lang
[4] as starting point, the idea arose, to implement more complex statistical
database queries. Finally the focus has been put on queries involving spatial
data. At this point PostGIS comes into action. It is another PostgreSQL exten-
sion and implements OpenGIS functions. This extension enables PostgreSQL
to process spatially referenced data.
An implementation of a “linear model” query will be shown. It involves
several modifications of REmbeddedPostgres and needs some extra SQL func-
tions written in R and Perl. This can easily be combined with OpenGIS SQL
functions. As a result a “spatial statistical SQL query” becomes possible.
GIS, database systems and statistical software fulfill different tasks in the
analysis of spatial data. GIS are used for collecting and editing data, genera-
tion of maps, transformation of and operations with maps. Database systems
hold data, are used for indexing and selection by queries and can combine dif-
ferent portions of data via joins between tables. Finally the tasks of statistical
software are exploratory analysis, modelling and estimation.
Several combinations of GIS, database systems (DBMS) and statistics
software are possible:
1
http://www.r-project.org/
2
http://www.postgresql.org/
274 A. Gebhardt
OpenGIS PostGIS
PostgreSQL
v. 5.1
GRASS
library(RPgSQL)
library(GRASS)
REmbeddedPostgres
3 PostgreSQL
3
http://grass.itc.it
Combining REmbeddedPostgres and PostGIS 275
4 PostGIS
5 REmbeddedPostgres
REmbeddedPostgres7 is part of the omegahat8 project. The development of
REmbeddedPostgres was driven mainly by the two ideas. Running an R inter-
preter within the database saves much data transmission time because com-
putation takes place at the server and not at a database client. Additionally
4
http://www.opengis.org/techno/specs/99-049.pdf
5
http://www.opengis.org/
6
http://postgis.refractions.net/docs/
7
http://www.omegahat.org/RSPostgres/
8
http://www.omegahat.org
276 A. Gebhardt
6 Combination
7 Technical Details
The idea is to extend REmbeddedPostgres in the following way to be able to
handle user defined data types:
Combining REmbeddedPostgres and PostGIS 277
• Create user defined types (needs input and output functions written in C
for PostgreSQL internal use).
• Write type conversion routines similar to the conversion routines for float8
already contained in REmbeddedPostgres.
• Introduce an additional table repg_utypes which contains the type OID,
type name, the names of to/from R converter functions and pathname of
a shared object file containing the shared library code which implements
the converter functions.
It is also necessary to add a “user type registration function” to the R initializa-
tion in REmbeddedPostgres which reads the table repg_utypes, dynamically
loads the shared library, accesses the converter routines (via the dlopen call)
and adds registration info to the converter routines structure.
A first prototype of this approach is working. It reads the table repg_utypes
within the C code by connecting back to the data base engine using the libpq
C interface.
Current development focuses on improving the above mentioned REmbed-
dedPostgres extension. This makes use of PL/PERL as another PostgreSQL
extension module, which implements an internal interface to the Perl9 script-
ing language. Combining PL/R and PL/PERL can help in simplifying the
notation of more complex PL/R queries.
The following problem arises. User data types have to be of a fixed length.
That means different types for several vector dimensions together with several
converting functions have to be written. This leads to a more complex syntax
of the PL/R queries, because calls to the type conversion routines have to
be added. Using PL/PERL can help here to hide these details from the user.
Perl scripts analyze the arguments and construct the PL/R queries. The ap-
propriate vectorized query can than be executed by using the DBD::PgSPI10
Perl module for internal database access from PL/PERL (see [1]).
Currently we can apply e.g. R’s linear model function lm() to a spatial
subset selected by means of OpenGIS functions in two steps. First creating
a SQL view and then applying the PL/PERL function lm to this view. The
PL/PERL function performs parsing of the linear model formula given in
S notation and then calls appropriate type conversion routines and finally
executes the PL/R aggregate function containing the call to the R function
lm().
The result of the following query would be the estimated parameter vector
θ of a linear model z = θ0 + θ1 x + θ2 y + ε.
CREATE VIEW spatial_view AS
SELECT X(geopoint),Y(geopoint),z FROM table
WHERE Distance(GeomFromText(’POINT (x0 y0))’,
SRID(geopoint)), geopoint)<r;
SELECT lm(’z~x+y’,’spatial_view’);
9
http://www.perl.org/
10
http://jamesthornton.com/postgres/7.3/postgres/plperl-database.html
278 A. Gebhardt
1. A. Descartes and T. Bunce (2000) Programming the Perl DBI. O’Reilly &
Associates, Inc.
2. J. Hartwig (2001) PostgreSQL Professionell und praxisnah. Addison-Wesley.
3. K. Hörhan (2002) Integration von Statistiksoftware in Datenbanksystemen und
deren Anwendung in der räumlichen Statistik. Thesis, University of Klagenfurt
4. D. T. Lang (2001) Scenarios for using R within a relational database manage-
ment system server. Technical report, Bell Labs.
5. Open GIS Consortium Inc. The OpenGIS abstract specification (1999). http:
//www.opengis.org/techno/abstract.htm
6. R Development Core Team (2004) R: A language and environment for sta-
tistical computing. R Foundation for Statistical Computing, Vienna, Austria.
3-900051-07-0.
Index