UK - ONS - Small Area Estimation
UK - ONS - Small Area Estimation
Contact points
For enquiries about this publication, contact the
Editor, Philip Clarke:
Tel: 020 7533 6241
E-mail: philip.clarke@ons.gov.uk
ii
Contents This publication includes many colour maps. It has been split into 16 files, each of which are linked
to the bookmarks in this file and the Contents list below.
Prevent the printing of these instructions by unchecking 'Annotations' in the Print dialogue box.
Acknowledgements 2
Executive summary 3
3 Results 17
3.1 Introduction 17
3.2 Covariate definitions 17
3.3 Specific estimates 19
3.4 Summary of results 40
1
Contents Small Area Estimation Project Report January 2003
7 Data 63
7.1 The data requirements 63
7.2 Construction of the database 64
9 Appendices 71
Appendix A: model procedures 71
Appendix B: bibliography 78
Appendix C: reports and publications 80
Appendix D: diagnostic plots 82
Acknowledgements
This is the second report in the Model-Based Small Area Estimation Series. We
would like to thank Professor Ray Chambers and his colleagues at Southampton
University Department of Social Statistics for their comments and suggestions on
an earlier draft. We are also grateful to Professors David Firth, Harvey Goldstein and
Danny Pfeffermann, and to our colleague Paul Smith, for reading and commenting
on the final draft of this report – and in particular for their stimulating suggestions
regarding further research. We have drawn extensively on these suggestions, and on
those made by Professor Chambers, in planning our future research programme. The
final responsibility for this report, and any errors it contains, is of course our own.
2
Executive Summary
Small area estimation of variables studied in social surveys is a growing need for
government. This interest is for areas of varying sizes but often at the level of
political wards – roughly 2000 households. The basic problem is that surveys are not
designed for this kind of estimation since they are optimised for efficient estimation
at national level.
Methods and Quality Division (now Statistical Methodology Division) of the Office
for National Statistics established the Small Area Estimation Project (SAEP) in April
1998. The aim of the project was to develop a generalised statistical methodology
and an operational system for deriving estimates to known precision from variables
contained in social surveys, for areas defined by a variety of boundary systems.
This report summarises this statistical methodology, presents results obtained from
the methodology and discusses the areas where further research is required.
The small area estimation methodology has been used to produce a number of small
area estimates. Chapter 3 summarises the results of modelling ten variables. These
variables can be approximately described as follows: household income from the FRS
survey, household income from the GHS survey, a measure of social capital, children
from ethnic minorities, number of people to help in a crisis, one parent families,
overcrowded living conditions and three measures of poor health. Final models
and maps of the resulting estimates are presented along with diagrams relating the
estimates to their confidence intervals.
In order to assess the appropriateness of the models, the estimates and their confidence
intervals, a number of diagnostic tests and additional investigations have been
performed as part of the SAEP project. These are presented in Chapter 4 and 5. These
address problems such as calibration, the addition of age/sex covariates, modelling
and estimating at different geographies and various factors influencing the calculation
of confidence intervals.
Chapter 6 draws on the evidence so far to assess whether the estimates produced by
the SAEP methodology are good enough to be published. It concludes that they are
– provided the confidence intervals are not too wide and proper diagnostic tests have
been done.
In order to carry out the small area estimation analysis a GIS database had to be
developed that involved survey variables, administrative and census data (for use as
covariates) with all variables matched to a set of digital area boundaries. Chapter 7
describes the construction and individual components of that database.
Finally possible areas for improvement with the SAEP methodology and further
necessary research are discussed.
3
Executive summary Small Area Estimation Project Report January 2003
4
1
Context and Motivation
1.1 Introduction
Small area estimation of variables studied in social surveys is a growing need for
government, principally for the establishment of better directed resource allocation for
problems of bad health, bad housing conditions, unemployment and low pay. This
interest is for areas of varying sizes but often at the level of political ward – roughly
2,000 households. The variable can be either continuous (e.g. household income)
in which case interest is in the area mean or discrete (e.g. suffering from long term
limiting illness or not) in which case interest is in the area proportion.
First: size – they are simply far too small, broadly seven to fifteen thousand households
per year in most surveys, which would result in insufficiently precise estimates at any
level below region. The Labour Force Survey (LFS) is much larger with 60,000 per
quarter or in an annual dataset 96,000 independent households, even so it is only
for a small number of the larger local authorities that the local LFS sample is large
enough for direct estimation. More information on the LFS can be found in the LFS
user guide (ONS 1999).
Secondly, there is also the problem of sample design – this is the fact that, with
the major exception of the LFS, all the principal national household surveys have
clustered designs. This means that the sample is not randomly distributed nationally,
but that certain areas are first selected as principal sampling units (PSU’s) and then
households are only selected for interview from these. The sample frame is actually
a Post Office frame of all addresses called the Postcode Address File (PAF). This is
arranged by postcode. The areas selected as PSU’s are postcode sectors – these are
groups of addresses with postcodes differing only in the last two characters. They
constitute bounded areas (full postcodes themselves are linear routes); e.g. N22 7AW
falls within postcode sector N22 7 together with all other codes N22 7xx. The size
of postcode sectors is of the order of 2,000 households and there are around 9,000
of them in England and Wales. Typically in the General Household Survey (GHS)
around 3% of postcode sectors are selected as PSU’s. The selection of these is
stratified in such a way that their distribution is nationally representative. The
problem for small area estimation is that, irrespective of the total sample size, with
a clustering like this the inevitable result for areas of sizes like wards is that the vast
majority will contain no sample responders at all. Hence no direct survey estimate
would be possible.
5
1: Context and motivation Small Area Estimation Project Report January 2003
Such a technique can be used within the same framework as direct survey estimation if
each estimation area contains a sample and if the same variable as that available at area
level as an auxiliary variable is also measured in the survey. In this case a direct survey
estimate can be calculated and the auxiliary information is used to adjust this estimate
on the basis of the difference between the known area value and the survey’s estimate.
These conditions though are not fulfilled in most British surveys. However, if one
decides to base the estimate on the area-level relationship between the survey and
auxiliary variables, then this relationship can be fitted by regressing individual survey
responses on area values of the covariates. The model once fitted describes the
relationship between the area values of the target survey variable and of the
covariate(s). Simply substituting the known values of the area covariate(s) into the
fitted model can produce estimates of the target variable for specific areas.
While the model has been constructed only on responses from sampled areas, the
model is assumed to apply nationally. Thus as administrative and census covariates
are known for all areas, not just those sampled, then the fitted model can be used to
obtain estimates and confidence intervals for all. This is the basis of the model-based
or synthetic estimation that ONS has used in its development of small area estimation.
The psychiatric morbidity study was to study the CIS-R score, a questionnaire-based
measure assessing psychiatric morbidity on a scale from zero indicating perfect
mental health to about 50 indicating very severe problems. The estimation areas
were postcode sectors and the quantities for which estimation was required were the
mean score for residents of the areas and the proportion of residents with a score of
12 or more. A report was produced in October 1996, (Heady & Ruddock 1996). The
outcome was that the fitted models showed high levels of statistical significance and
that the estimates of overall geographic distribution of problems were a substantial
improvement on assumptions of equal morbidity overall or within age/sex bands.
6
Small Area Estimation Project Report January 2003 1: Context and motivation
This report summarises the research and progress of the SAEP from 1998 to 2001.
The theory as well as results are presented and recommendations made for further
work.
The task of the SAEP was to develop a generalised statistical methodology and
an operational system for deriving estimates to known precision from variables
contained in social surveys, for areas defined by a variety of boundary systems. It
concentrated on the ‘point in time’ estimation of the nature of the psychiatric
morbidity study.
As well as producing estimates for each area of interest confidence intervals for
each estimate are also presented. A discussion of the appropriateness of confidence
intervals is contained in Chapter 4. Following model fitting, a set of diagnostic
procedures has been defined, Chapter 5. These involve use of plots and maps to
check for randomness.
7
1: Context and motivation Small Area Estimation Project Report January 2003
There were a number of statistical issues to address. One of the main ones related to
the nature of the estimation areas. The original study of psychiatric morbidity had
used postcode sectors. This is statistically the simplest as it corresponds to the PSU’s
of the sample designs. However most clients are interested in estimates for
administrative areas such as wards. This can cause some difficulties in the estimation
of confidence intervals, which are described in Chapter 4.
However, SUPCOM went further as it enabled the problem of small area estimation
to be addressed under two different national statistical systems and a study made of
the effects of non-response to surveys and of interviewer variance.
The final report was submitted to Eurostat in March 2000, (ONS et al 2000).
The data from Finland has also been useful in further research looking in particular
at the use of individual covariates as well as or instead of area covariates in models.
This work led to a study of the potential impact of the ecological fallacy on small
area estimation. Two members of the team have presented this as a paper to an RSS
meeting. (An earlier study of this issue is contained in a paper published by the SAEP
team in Statistics in Transition (Heady & Hennell 2000)).
8
2
Basic Ideas Behind the SAEP System
2.1 Introduction
As Chapter 1 explained, direct sample-based estimators – of the kind which surveys
typically produce at national and regional level – become progressively less precise
as the size of area decreases. As we continue on down to ever-smaller area sizes, there
always comes a point at which direct sample-based estimation becomes impossible
– because there are no sample respondents in most of the small areas concerned.
As a result the estimates have to be constructed in a different way. There are in fact
several different ways in which they might be constructed – but this report does not
consider them all. Instead it considers a series of variations on a common approach
– which might be described as regression synthetic estimation fitted using area-level
covariates. As that is a bit of a mouthful, we will often refer to the method below
simply as synthetic estimation.
The basic idea is that we construct a statistical model relating the observed value
of the variable of interest – measured at individual, household or address-level – to
“covariates” (a.k.a. “auxiliary variables”) that relate to the small area in which the
address is located. These covariates are generally average values or proportions
relating to all individuals or households in the area. The covariates are generally
obtained from sources covering the whole population – such as the census or the
social security system. Once the model has been fitted, it can be applied to these
known local values – in order to predict the average local value for the target variable.
Under the model-based approach randomness is not introduced via the sampling
process but assumed instead to be a feature of the underlying reality. This reality is
composed most fundamentally not of the actual characteristics of the populations of
particular areas but of underlying tendencies – which can never be directly observed
(though they can be estimated given appropriate assumptions and suitable sample
and covariate data). Actually observed data – whether in the form of a sample, or
of a complete census of the area concerned – is seen as a product of the underlying
tendency and of certain random processes which mean that actually observed
situations always depart to some extent from the underlying tendency.
9
2: Basic ideas behind the SAEP system Small Area Estimation Project Report January 2003
These models do not, of course, attempt to describe the whole of reality – but only
those parts of it that are relevant to the relation between the evidence available to us
as statisticians and the quantity which we are trying to estimate. As we remarked in
the previous section, this evidence is of two kinds – data from a sample survey, and
auxiliary data relating to the population as a whole.
where the line above indicates an average, the hat indicates an estimate and the
subscript j indicates the area. For example,
_ if Y represented household income and
we wanted ward level estimates, then Yˆ j would be the estimate of the average
household income in ward(j).
The model is represented below (the formula only includes one covariate, denoted
as X, for simplicity). This relationship is usually written with individual level
covariates Xij, but in our case we use area level covariates, so we write the basic
multilevel model as:
_
yij = α + β Xj + uj + eij (2)
where:
■ α and β are the regression parameters for intercept and slope respectively
■ u
j
is a random area-level term that is assumed to have expectation 0 and variance σ u2.
■ e is
ij
a random individual-level term with expectation 0 and variance σ e2.
_
A couple of points call for further explanation. In equation (2) the covariate Xj is
measured at area level. In principle there is no reason why it should not be specified
as xij and measured at individual level. The reason for specifying the model as we
have is that the way in which data is currently held within ONS does not enable us to
link survey and covariate data at individual- or address-level. The auxiliary data in
our models is therefore restricted to area-level means and proportions.
The other point relates to the meaning of the two random terms _ in the model: uj and
eij. If we just looked at the ‘fixed’ part of the model – α + β Xj – it would give us an
underlying expected _ value for individuals living in any area for which the covariate
value was equal to Xj. The term uj allows the underlying central value of Y in area
(j) itself to differ from this more general value. The term eij expresses the difference
between this underlying area-specific value, and the actual values for particular
individuals or addresses.
10
Small Area Estimation Project Report January 2003 2: Basic ideas behind the SAEP system
these estimated parameter values – α̂, β̂ , σ u2 and σ e2 – can be used to derive area-
specific estimates and confidence intervals. To see how these parameter estimates _ can
help with the estimation process it may be useful to express the real area mean Yj in
terms of the random and fixed effects assumed by the model. Equation 3 shows how
the true mean and the model effects are related.
_ _ _
Yj = α + β Xj + uj + e j (3)
_
where e j is the average value of the eij terms for the entire population of area(j).
which, as formula (4) shows, is formed by applying estimates of the fixed effect
parameters to the value of the auxiliary variable
_ for the area concerned. As such it
estimates
_ the underlying expected value of_Y for any area whose covariate value
equals Xj, rather than the specific value of Y for area(j) itself.
This raises the question of why we have not included any allowance for the two
random effect terms which complete equation (3) and make the _ difference between
the underlying value just referred to and the specific value of Y in area(j). The reason
is that it would be necessary to have sample data from area(j) itself in order to do so.
But, because of the clustered sample designs of the surveys that we are working with,
we usually have direct sample data for only a small proportion of the areas in which
we are interested – and so, for practical purposes, the synthetic estimator is the best
that we can do.
[If we did have sample data for the area concerned, it would be possible to use it – in
conjunction with the parameter estimates σ̂ u2 and σ̂ e2 – to provide an estimate of uj.
_
For details, see Appendix A. In order to go further and estimate e j one would need a
sample that included a substantial proportion of the population of area(j) – which
_
would not be realistic in survey terms. However, this may not matter much since, as e j
is an average value for the whole population of area(j) , its value is likely to be small.]
In the case of the synthetic estimator we need to allow for two sources of variability:
one of these is the sampling variability
_ of the synthetic estimator around the
underlying expected value “α + β Xj”. But we_also need to remember that, accord-
ing to the model, the unknown real value of Yj is itself a _random variable that varies
around this underlying expected value. The variance of Yj is given by
_
/ σ2
σ u2 + var(e j) = σ u2 + ( e N )
j
(5)
11
2: Basic ideas behind the SAEP system Small Area Estimation Project Report January 2003
where the first term on the right-hand side of the equation represents the real
variance of particular areas round the underlying expected values and the second
term represents the sampling variance of the synthetic estimator. In the examples
that follow, σ u2 is usually the main component of the MSE.
■ design-based methods are those in which the estimators are justified in terms of
the structure of the randomised sample design;
■ model-based methods are those in which the estimators are justified in terms of a
model of the inherent probabilistic structure underlying the population itself.
In principle the properties of any particular estimator can be assessed from either a
design-based or a model-based perspective – with potentially different conclusions.
There is a good deal of literature on the differences between the two approaches, (e.g.
Särndal 1984), which interested readers may consult. In the paragraphs which follow,
we will discuss the properties of the particular model-based approach described
above – and point out ways in which these may differ from those of more familiar
design-based estimators.
2.3.2 Equal probability selection, weighting and bias – and how they effect the
interpretation of the estimates
The simplest forms of design based estimation are those for sample designs in which
each member of the population has an equal chance of selection, and each selected
respondent agrees to take part. In this case it is relatively straightforward to construct
unbiased – or nearly unbiased – estimators. In order to obtain design-unbiased
estimates when selection probabilities are unequal it is necessary to weight the data
to compensate for the unequal selection probabilities. Non-response is usually dealt
with in a similar way: if the probability of responding is related to some variable
whose value is known for both responders and non-responders, the probability of
responding can be estimated. Subject to some assumptions, different response
probabilities can then be allowed for by weighting, in the same way as different
selection probabilities.
12
Small Area Estimation Project Report January 2003 2: Basic ideas behind the SAEP system
common for model-based analyses to use data from randomised samples, and to
attempt some kind of weighting to allow for different selection and response
probabilities. However, there is no general agreement about how best to weight
for small area estimation and we have not researched the question in this report
(as explained below one of the analyses was re-weighted for unequal selection
probabilities and none of them were re-weighted for non-response).
The idea of bias means different things in the design- and model-based perspectives.
An estimator is design-unbiased for a particular area if its expected value, given the
sample design, is equal to the real-life value for the area concerned. Estimators based
on the samples taken in the area concerned are generally design-unbiased provided
that the data has been appropriately weighted. However, the synthetic estimator
defined in equation (4) is not design-unbiased, since what it estimates is the
underlying expected value for any area with the same covariate values not the real
value for the area in question.
Too much should not be made of this property of model-unbiasedness – most of our
estimators (those relating to proportions) will in fact be slightly model-biased. The
important thing to remember is that, in the everyday sense of the term, the synthetic
estimate is biased because it relates not to the specific value of the area in question
but to the average value for similar areas – always assuming that the model is true. This
means that you cannot use synthetic estimates to evaluate the impact of special local
factors that have not been included in the model – such as, for instance, a particular
policy adopted by the local authority.
As with bias, so with variance estimation – if one takes the model as a literal
representation of the relevant aspects of reality there is no need to insist on random
sampling. However, if one has doubts about this, then it is a good idea to use
13
2: Basic ideas behind the SAEP system Small Area Estimation Project Report January 2003
representative random sampling methods. The reason for this is that the models
used in this report base their estimates of between- and within-area variance on the
assumption that the uj terms for different areas are mutually independent, as are
the eij terms for different individuals/units within the same area. This may well not
be true in the real world – where for instance, the richer inhabitants of a local area
might well live at one end of the district, while the poorer residents clustered together
at the other. In this situation, random sampling would ensure that the values entered
into the analysis were indeed independent – and that the resulting variance estimates
behaved in the expected way.
Because the model ignores patterns of spatial covariance, it leaves open two
possibilities that we need to guard against in some other way – even though the
national proportion of small areas whose true value lies within the confidence
interval is correct. The first of these is that the underestimated areas may cluster
together in some parts of the country, while the over-estimated areas cluster together
in other regions. The second of these is that if the model is fitted for one kind of
geographical unit – say wards – the estimated relationship between the target
variable and the auxiliary variables may well not hold for other kinds of areas –
such as constituencies and local authorities. This is an instance of the well-known
Modifiable Area Unit Problem (MAUP). We will return to these problems later in
this report.
It is worth noticing that none of these spatial biases arise with design-based
estimation, since, provided the selection probabilities and weights are right, the
expected values of the estimates correspond to the real local values at all area levels
– even though the estimators become very unstable at local levels (which is why we
need to use model-based estimators in the first place).
This brings us to another problem – namely how should we handle the potential
contradictions between design-based and model-based estimates. The two sets of
estimates can differ even at national level, but as we have seen, the differences are
particularly important at regional and sub-regional level. As well as the fact that both
sets of estimates have strengths that one might want somehow to combine, there is
the fact that the administrators who use our figures have a legitimate requirement
for consistent figures – and inconsistencies between model-based small area figures,
and design-based figures at region and above are hard to justify in a practical context.
One suggestion, which we shall consider further in a later chapter, is that model-
based small area estimates should be calibrated to the design-based estimates for
regions – i.e. adjusted so that the two sets of regional totals are the same.
14
Small Area Estimation Project Report January 2003 2: Basic ideas behind the SAEP system
The choice of auxiliary variables plays a particularly important role. Adding a new
covariate to the model alters both the value of the synthetic estimator and the
estimate of the between area variance σ̂ u2. It may also change the form of the
relationship between the target variable and the covariates – and the new variable
may need to be transformed in order to keep this relationship in a tractable form.
The procedure that we used to decide which covariates to include in the model is
described in Appendix A. The diagnostic checks that we used to judge whether the
form of the models was appropriate are described in later chapters, and a full set of
diagnostic results is given in Appendix D.
However, even with the best methods of model selection there remains a problem
– since the selection is partly based on sample data, there is always a possibility that
the selected model is not the best one. The confidence intervals produced by model-
based methods allow for errors in estimating parameter values – given that the form
of the model, and the choice of covariates, is correct. There is always the possibility
that they are not – which leaves an additional margin of uncertainty, not allowed for
in the confidence intervals.
15
2: Basic ideas behind the SAEP system Small Area Estimation Project Report January 2003
16