0% found this document useful (0 votes)
35 views18 pages

UK - ONS - Small Area Estimation

Model-Based Small Area Estimation Series No. 2 Small Area Estimation Project Report Patrick Heady

Uploaded by

Raul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views18 pages

UK - ONS - Small Area Estimation

Model-Based Small Area Estimation Series No. 2 Small Area Estimation Project Report Patrick Heady

Uploaded by

Raul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

January | 2003

Model-Based Small Area


Estimation Series No. 2
Small Area Estimation Patrick Heady
Philip Clarke
Project Report Gary Brown
Kerry Ellis
Dick Heasman
Sarah Hennell
Jane Longhurst
Bruce Mitchell

Office for National Statistics i


© Crown copyright 2003. A National Statistics publication
Published with the permission of the Controller of National Statistics are produced to high professional
Her Majesty’s Stationery Office (HMSO). standards set out in the National Statistics Code of
Practice. They undergo regular quality assurance
reviews to ensure that they meet customer needs.
ISBN 1 85774 539 6
They are produced free from any political interference.

Applications for reproduction should be submitted


to HMSO under HMSO’s Class Licence:
www.clickanduse.hmso.uk

Alternatively applications can be made in writing to:

HMSO Licensing Division


St. Clement’s House
2–16 Colegate
Norwich, NR3 1BQ

Contact points
For enquiries about this publication, contact the
Editor, Philip Clarke:
Tel: 020 7533 6241
E-mail: philip.clarke@ons.gov.uk

For general enquiries, contact the National


Statistics Public Enquiry Service on 0845 601 3034
(minicom: 01633 812399)
E-mail: info@statistics.gov.uk
Fax: 01633 652747

You can also find National Statistics on the internet


– go towww.statistics.gov.uk.

About the Office for National Statistics


The Office for National Statistics (ONS) is the
government agency responsible for compiling,
analysing and disseminating many of the United
Kingdom’s economic, social and demographic
statistics, including the retail prices index, trade
figures and labour market data, as well as the
periodic census of the population and health
statistics. The Director of ONS is also the National
Statistician and the Registrar General for England
and Wales, and the agency that administers the
registration of births, marriages and deaths there.

ii
Contents This publication includes many colour maps. It has been split into 16 files, each of which are linked
to the bookmarks in this file and the Contents list below.
Prevent the printing of these instructions by unchecking 'Annotations' in the Print dialogue box.

Acknowledgements 2

Executive summary 3

1 Context and motivation 5


1.1 Introduction 5
1.2 Outline of the problem 5
1.3 Technique of synthetic estimation 5
1.4 Initial Methods and Quality work 6
1.5 Academic work 7
1.6 Small area estimation project – SAEP 7
1.7 SUPCOM and other research work 8
1.8 Data management and interface system 8

2 Basic ideas behind the SAEP system 9


2.1 Introduction 9
2.2 Model assumptions and estimation 9
2.3 Comparison of properties of synthetic estimates with those of direct
sample estimates 12
2.4 Assessing the quality of the estimates 15

3 Results 17
3.1 Introduction 17
3.2 Covariate definitions 17
3.3 Specific estimates 19
3.4 Summary of results 40

4 Discussion of the confidence intervals 41


4.1 Introduction 41
4.2 Ward-postcode sector problem 41
4.3 Other departures from randomness 45
4.4 Interviewer variance 46
4.5 Overall quality of confidence intervals 46

5 Discussion of the estimates 47


5.1 Introduction 47
5.2 Assessing the form of the model 47
5.3 Model content: age/sex covariates 51
5.4 Allowing for spatially correlated errors 54
5.5 Overall conclusion 59

6 General discussion – are the synthetic estimates good enough to publish? 61


6.1 Introduction 61
6.2 Specific issues 61
6.3 Recommendation 62

1
Contents Small Area Estimation Project Report January 2003

7 Data 63
7.1 The data requirements 63
7.2 Construction of the database 64

8 Possible improvements to the SAEP system 69


8.1 Introduction 69
8.2 Possible changes in ONS data systems 69
8.3 Further research which would enhance the SAEP system 70
8.4 Research that would take us beyond the SAEP system 70

9 Appendices 71
Appendix A: model procedures 71
Appendix B: bibliography 78
Appendix C: reports and publications 80
Appendix D: diagnostic plots 82

Acknowledgements

This is the second report in the Model-Based Small Area Estimation Series. We
would like to thank Professor Ray Chambers and his colleagues at Southampton
University Department of Social Statistics for their comments and suggestions on
an earlier draft. We are also grateful to Professors David Firth, Harvey Goldstein and
Danny Pfeffermann, and to our colleague Paul Smith, for reading and commenting
on the final draft of this report – and in particular for their stimulating suggestions
regarding further research. We have drawn extensively on these suggestions, and on
those made by Professor Chambers, in planning our future research programme. The
final responsibility for this report, and any errors it contains, is of course our own.

2
Executive Summary

Small area estimation of variables studied in social surveys is a growing need for
government. This interest is for areas of varying sizes but often at the level of
political wards – roughly 2000 households. The basic problem is that surveys are not
designed for this kind of estimation since they are optimised for efficient estimation
at national level.

Methods and Quality Division (now Statistical Methodology Division) of the Office
for National Statistics established the Small Area Estimation Project (SAEP) in April
1998. The aim of the project was to develop a generalised statistical methodology
and an operational system for deriving estimates to known precision from variables
contained in social surveys, for areas defined by a variety of boundary systems.

This report summarises this statistical methodology, presents results obtained from
the methodology and discusses the areas where further research is required.

In order to enable surveys to estimate at smaller areas than designed, a common


technique used is to combine the survey data with other data sources that are available
on an area basis and build a modelled relationship. This is known as synthetic or
modelled estimation. The technique is used by SAEP to produce estimates at the
required small area levels. The area level relationship between the survey and auxiliary
variables (usually administration data or census data) can be estimated by regressing
individual survey responses on area values of the covariates. The basic ideas behind
this methodology are discussed in Chapter 2 and a detailed description of the theory
is given in Appendix A.

The small area estimation methodology has been used to produce a number of small
area estimates. Chapter 3 summarises the results of modelling ten variables. These
variables can be approximately described as follows: household income from the FRS
survey, household income from the GHS survey, a measure of social capital, children
from ethnic minorities, number of people to help in a crisis, one parent families,
overcrowded living conditions and three measures of poor health. Final models
and maps of the resulting estimates are presented along with diagrams relating the
estimates to their confidence intervals.

In order to assess the appropriateness of the models, the estimates and their confidence
intervals, a number of diagnostic tests and additional investigations have been
performed as part of the SAEP project. These are presented in Chapter 4 and 5. These
address problems such as calibration, the addition of age/sex covariates, modelling
and estimating at different geographies and various factors influencing the calculation
of confidence intervals.

Chapter 6 draws on the evidence so far to assess whether the estimates produced by
the SAEP methodology are good enough to be published. It concludes that they are
– provided the confidence intervals are not too wide and proper diagnostic tests have
been done.

In order to carry out the small area estimation analysis a GIS database had to be
developed that involved survey variables, administrative and census data (for use as
covariates) with all variables matched to a set of digital area boundaries. Chapter 7
describes the construction and individual components of that database.

Finally possible areas for improvement with the SAEP methodology and further
necessary research are discussed.

3
Executive summary Small Area Estimation Project Report January 2003

This is a blank page.

4
1
Context and Motivation

1.1 Introduction
Small area estimation of variables studied in social surveys is a growing need for
government, principally for the establishment of better directed resource allocation for
problems of bad health, bad housing conditions, unemployment and low pay. This
interest is for areas of varying sizes but often at the level of political ward – roughly
2,000 households. The variable can be either continuous (e.g. household income)
in which case interest is in the area mean or discrete (e.g. suffering from long term
limiting illness or not) in which case interest is in the area proportion.

1.2 Outline of the problem


The basic problem is that surveys are not designed for this kind of estimation. They
are optimised for efficient estimation at national level. This relates not only to the
sample size but also to the sample design.

First: size – they are simply far too small, broadly seven to fifteen thousand households
per year in most surveys, which would result in insufficiently precise estimates at any
level below region. The Labour Force Survey (LFS) is much larger with 60,000 per
quarter or in an annual dataset 96,000 independent households, even so it is only
for a small number of the larger local authorities that the local LFS sample is large
enough for direct estimation. More information on the LFS can be found in the LFS
user guide (ONS 1999).

Secondly, there is also the problem of sample design – this is the fact that, with
the major exception of the LFS, all the principal national household surveys have
clustered designs. This means that the sample is not randomly distributed nationally,
but that certain areas are first selected as principal sampling units (PSU’s) and then
households are only selected for interview from these. The sample frame is actually
a Post Office frame of all addresses called the Postcode Address File (PAF). This is
arranged by postcode. The areas selected as PSU’s are postcode sectors – these are
groups of addresses with postcodes differing only in the last two characters. They
constitute bounded areas (full postcodes themselves are linear routes); e.g. N22 7AW
falls within postcode sector N22 7 together with all other codes N22 7xx. The size
of postcode sectors is of the order of 2,000 households and there are around 9,000
of them in England and Wales. Typically in the General Household Survey (GHS)
around 3% of postcode sectors are selected as PSU’s. The selection of these is
stratified in such a way that their distribution is nationally representative. The
problem for small area estimation is that, irrespective of the total sample size, with
a clustering like this the inevitable result for areas of sizes like wards is that the vast
majority will contain no sample responders at all. Hence no direct survey estimate
would be possible.

1.3 Technique of synthetic estimation


In order to enable surveys to estimate at smaller areas than designed, a common
technique used is to combine the survey data with other data sources (known as
auxiliary data or covariates) that are available on an area basis. These naturally
cannot be sample survey sources but can either be data from some administrative
system or from a previous census. If the survey variable of interest is related to these
other data, more precise estimates can be produced.

5
1: Context and motivation Small Area Estimation Project Report January 2003

Such a technique can be used within the same framework as direct survey estimation if
each estimation area contains a sample and if the same variable as that available at area
level as an auxiliary variable is also measured in the survey. In this case a direct survey
estimate can be calculated and the auxiliary information is used to adjust this estimate
on the basis of the difference between the known area value and the survey’s estimate.

These conditions though are not fulfilled in most British surveys. However, if one
decides to base the estimate on the area-level relationship between the survey and
auxiliary variables, then this relationship can be fitted by regressing individual survey
responses on area values of the covariates. The model once fitted describes the
relationship between the area values of the target survey variable and of the
covariate(s). Simply substituting the known values of the area covariate(s) into the
fitted model can produce estimates of the target variable for specific areas.

In fact we use multilevel, also known as “hierarchical”, modelling – an extension


of regression analysis in which the responses are grouped into the areas at which
estimation is required (this is the hierarchy). The reason for using a multilevel model
is that such a model partitions variance into two elements, one representing the
between-area component and the other the within-area component. The former can
be used as the basis for assigning precision to the area estimate.

While the model has been constructed only on responses from sampled areas, the
model is assumed to apply nationally. Thus as administrative and census covariates
are known for all areas, not just those sampled, then the fitted model can be used to
obtain estimates and confidence intervals for all. This is the basis of the model-based
or synthetic estimation that ONS has used in its development of small area estimation.

1.4 Initial Methods and Quality work


This unit’s first work in this area of research was carried out prior to the formation
of ONS at the Office for Population Census and Surveys (OPCS) in 1995 with a
simulation study with data on patients at a group of doctors’ surgeries (Charlton &
Heady 1995). Two projects followed in 1996/7. One involved establishing the
incidence of psychiatric morbidity funded by the Department of Health. The other
was a study for the Department for Environment, Transport and the Regions (DETR)
on the possibilities of updating census indicators, that is using surveys to estimate the
current value of a variable which was measured in the last census.

The psychiatric morbidity study was to study the CIS-R score, a questionnaire-based
measure assessing psychiatric morbidity on a scale from zero indicating perfect
mental health to about 50 indicating very severe problems. The estimation areas
were postcode sectors and the quantities for which estimation was required were the
mean score for residents of the areas and the proportion of residents with a score of
12 or more. A report was produced in October 1996, (Heady & Ruddock 1996). The
outcome was that the fitted models showed high levels of statistical significance and
that the estimates of overall geographic distribution of problems were a substantial
improvement on assumptions of equal morbidity overall or within age/sex bands.

The study of updating census indicators was an example of a different kind of


problem. Given surveys carried out each year since the census, a national trend can
be estimated and applied to each area’s census value. The problem is to estimate how
the trend in individual areas differ from the national trend and then apply it to the
census value. In order to model this it was decided to use a time series of surveys
including time (zero at census year, one the next year etc.) and the census value as
covariates. The results are contained in a report issued and a paper presented to a
Statistics Canada symposium in 1997, (Heady, Ruddock & Goldstein 1997). Although
less successful than the psychiatric morbidity study this project was recognised as a
more difficult problem.

6
Small Area Estimation Project Report January 2003 1: Context and motivation

1.5 Academic work


We were also able to draw on the growing academic literature about small area
estimation. In particular one paper, (Ghosh & Rao 1994), that provided an appraisal
of different approaches to the problem including synthetic estimation. Methods were
also being used to provide estimates for application in government. (Schaible 1996)
reports on estimation methodologies employed by US government to provide small
area estimates of, for example, health indicators and unemployment and another
paper, (Drew et al 1982), reports on the small area estimation techniques used for
analysis of the Canadian Labour Force Survey.

1.6 Small Area Estimation Project – SAEP


In the light of the early work and the academic literature it was felt there was
sufficient potential to develop the methods. As a result, the Small Area Estimation
Project (SAEP), a research and development programme into methods and problems
of such estimation was established by Methods and Quality Division (MQ) (now
Statistical Methodology Division) in April 1998. The programme was funded half
from MQ’s own resources and half by client contributions from Regional Statistics,
Labour Market and Demography and Health Divisions within ONS and outside
from The Department of Environment, Transport and the Regions (DETR) and The
Health Development Agency (formerly the Health Education Authority).

This report summarises the research and progress of the SAEP from 1998 to 2001.
The theory as well as results are presented and recommendations made for further
work.

The task of the SAEP was to develop a generalised statistical methodology and
an operational system for deriving estimates to known precision from variables
contained in social surveys, for areas defined by a variety of boundary systems. It
concentrated on the ‘point in time’ estimation of the nature of the psychiatric
morbidity study.

A database within a geographical information system (GIS) environment was


constructed of individual level survey data and aggregated covariate data together
with a variety of digitised boundary systems, e.g. census enumeration district, ward,
local authority, (see Chapter 7). The surveys were the major national household-
based surveys conducted by Social Survey Division, (SSD). Each survey record
contained a geo-location grid reference. This enabled any record to be identified,
using standard GIS techniques, to its appropriate area on any of the boundary
systems. The covariate data was input onto the database in the form of aggregated
data of various administrative and census variables at as small a level of aggregation
as possible and corresponding to a boundary system held. Around fifteen census
variables were constructed from the published small area statistics and input at
enumeration district, ward and local authority level. Additionally data on DSS
benefit claimants was input at ward level and standardised mortality ratios at local
authority level.

This database enabled selection of a number of variables for exploratory modelling.


Datasets could be constructed based on any particular boundary system, for which
covariate data were available. Once the dataset was created, modelling could be
carried out to discover what appeared to be the best fitting model. This has been
carried out using specialist multilevel modelling software MLwiN. The results from
these modelling procedures are presented for ten variables in Chapter 3.

As well as producing estimates for each area of interest confidence intervals for
each estimate are also presented. A discussion of the appropriateness of confidence
intervals is contained in Chapter 4. Following model fitting, a set of diagnostic
procedures has been defined, Chapter 5. These involve use of plots and maps to
check for randomness.

7
1: Context and motivation Small Area Estimation Project Report January 2003

There were a number of statistical issues to address. One of the main ones related to
the nature of the estimation areas. The original study of psychiatric morbidity had
used postcode sectors. This is statistically the simplest as it corresponds to the PSU’s
of the sample designs. However most clients are interested in estimates for
administrative areas such as wards. This can cause some difficulties in the estimation
of confidence intervals, which are described in Chapter 4.

1.7 SUPCOM and other research work


In conjunction with the development in SAEP, research was pursued into the
precision and robustness of modelled estimates. A concurrent Eurostat SUPCOM
project on small area estimation was conducted in partnership with Statistics
Finland. This aided the ONS project in providing data from the Finnish statistical
registers upon samples from which our models can be applied and estimates
compared with known actual values. The result of this was that models worked
well in the area estimation of income and of sauna ownership. Areas used were
Finland NUTS4 areas.

However, SUPCOM went further as it enabled the problem of small area estimation
to be addressed under two different national statistical systems and a study made of
the effects of non-response to surveys and of interviewer variance.

The final report was submitted to Eurostat in March 2000, (ONS et al 2000).

The data from Finland has also been useful in further research looking in particular
at the use of individual covariates as well as or instead of area covariates in models.
This work led to a study of the potential impact of the ecological fallacy on small
area estimation. Two members of the team have presented this as a paper to an RSS
meeting. (An earlier study of this issue is contained in a paper published by the SAEP
team in Statistics in Transition (Heady & Hennell 2000)).

Further research work is currently continuing in a Eurostat FP5 project EURAREA


which is looking to enhance the methodology in the following areas: bringing in the
dimensions of time and of space, estimating cross classifications and allowing for the
effects of survey design.

1.8 Data management and interface system


The SAEP work has been carried out using a GIS database with survey and covariate
data. The creation of datasets has then required a coding effort using a statistical
package syntax language. In order to facilitate usage, particularly for more
operational based estimation, a project is currently under way to store the data in
a more computer efficient manner and to provide a visual front end to the data
definition and management procedures. A prototype system has been produced,
which will be assessed before being developed further.

8
2
Basic Ideas Behind the SAEP System

2.1 Introduction
As Chapter 1 explained, direct sample-based estimators – of the kind which surveys
typically produce at national and regional level – become progressively less precise
as the size of area decreases. As we continue on down to ever-smaller area sizes, there
always comes a point at which direct sample-based estimation becomes impossible
– because there are no sample respondents in most of the small areas concerned.

As a result the estimates have to be constructed in a different way. There are in fact
several different ways in which they might be constructed – but this report does not
consider them all. Instead it considers a series of variations on a common approach
– which might be described as regression synthetic estimation fitted using area-level
covariates. As that is a bit of a mouthful, we will often refer to the method below
simply as synthetic estimation.

The basic idea is that we construct a statistical model relating the observed value
of the variable of interest – measured at individual, household or address-level – to
“covariates” (a.k.a. “auxiliary variables”) that relate to the small area in which the
address is located. These covariates are generally average values or proportions
relating to all individuals or households in the area. The covariates are generally
obtained from sources covering the whole population – such as the census or the
social security system. Once the model has been fitted, it can be applied to these
known local values – in order to predict the average local value for the target variable.

2.2 Model assumptions and estimation


2.2.1 Randomised selection or random reality?
These models are based on certain assumptions which differ quite markedly from
the assumptions made in conventional (or design-based, see below) sample survey
inference. In the conventional set-up reality is assumed to be fixed, even if unknown,
and probability only enters the picture because of the randomised process used
to select the sample. The combination of the fixed population characteristics and
the random sample design generates a sampling distribution, whose properties are
estimated from the sample itself and used, in conjunction with key sample statistics,
to estimate the characteristics of the underlying population and to place confidence
intervals round the estimates.

Under the model-based approach randomness is not introduced via the sampling
process but assumed instead to be a feature of the underlying reality. This reality is
composed most fundamentally not of the actual characteristics of the populations of
particular areas but of underlying tendencies – which can never be directly observed
(though they can be estimated given appropriate assumptions and suitable sample
and covariate data). Actually observed data – whether in the form of a sample, or
of a complete census of the area concerned – is seen as a product of the underlying
tendency and of certain random processes which mean that actually observed
situations always depart to some extent from the underlying tendency.

In order to get from observations to a model of the underlying tendencies – and to


get back from that model to estimates and confidence intervals for real-world values
– we need to be able to describe and estimate the properties of these supposed
random processes. Fortunately, specifications of these properties – and therefore
ways of estimating them – are built into the assumptions of the models themselves.

9
2: Basic ideas behind the SAEP system Small Area Estimation Project Report January 2003

These models do not, of course, attempt to describe the whole of reality – but only
those parts of it that are relevant to the relation between the evidence available to us
as statisticians and the quantity which we are trying to estimate. As we remarked in
the previous section, this evidence is of two kinds – data from a sample survey, and
auxiliary data relating to the population as a whole.

2.2.2 The basic SAEP model


To bring these ideas into focus we will now discuss a slightly simplified version of
the models that we will be applying later in this report. We will assume that we are
interested in estimating the mean of a survey variable within each of a set of small
areas (although the theory can be applied to other statistics). Denoting the survey
variable as Y, we want to find
_
Yˆ j (1)

where the line above indicates an average, the hat indicates an estimate and the
subscript j indicates the area. For example,
_ if Y represented household income and
we wanted ward level estimates, then Yˆ j would be the estimate of the average
household income in ward(j).

The model is represented below (the formula only includes one covariate, denoted
as X, for simplicity). This relationship is usually written with individual level
covariates Xij, but in our case we use area level covariates, so we write the basic
multilevel model as:
_
yij = α + β Xj + uj + eij (2)

where:

■ y is the survey variable of interest for individual/case i within area of interest j


ij
_
■ X is the (known) population mean for the covariate in area of interest j
j

■ α and β are the regression parameters for intercept and slope respectively

■ u
j
is a random area-level term that is assumed to have expectation 0 and variance σ u2.

■ e is
ij
a random individual-level term with expectation 0 and variance σ e2.
_
A couple of points call for further explanation. In equation (2) the covariate Xj is
measured at area level. In principle there is no reason why it should not be specified
as xij and measured at individual level. The reason for specifying the model as we
have is that the way in which data is currently held within ONS does not enable us to
link survey and covariate data at individual- or address-level. The auxiliary data in
our models is therefore restricted to area-level means and proportions.

The other point relates to the meaning of the two random terms _ in the model: uj and
eij. If we just looked at the ‘fixed’ part of the model – α + β Xj – it would give us an
underlying expected _ value for individuals living in any area for which the covariate
value was equal to Xj. The term uj allows the underlying central value of Y in area
(j) itself to differ from this more general value. The term eij expresses the difference
between this underlying area-specific value, and the actual values for particular
individuals or addresses.

2.2.3 How the model can be used to produce an estimator


Before producing model-based estimates of the values for particular areas we need
to estimate the main parameters of the model. The way these estimates are derived is
rather complex and is partially explained in Appendix A (which also references texts
where a full explanation is available). What matters for present purposes is the way

10
Small Area Estimation Project Report January 2003 2: Basic ideas behind the SAEP system

these estimated parameter values – α̂, β̂ , σ u2 and σ e2 – can be used to derive area-
specific estimates and confidence intervals. To see how these parameter estimates _ can
help with the estimation process it may be useful to express the real area mean Yj in
terms of the random and fixed effects assumed by the model. Equation 3 shows how
the true mean and the model effects are related.
_ _ _
Yj = α + β Xj + uj + e j (3)
_
where e j is the average value of the eij terms for the entire population of area(j).

The estimator used in this report is the synthetic estimator:


_ _
Yˆ j,SYNTH = α̂ + β̂ Xj (4)

which, as formula (4) shows, is formed by applying estimates of the fixed effect
parameters to the value of the auxiliary variable
_ for the area concerned. As such it
estimates
_ the underlying expected value of_Y for any area whose covariate value
equals Xj, rather than the specific value of Y for area(j) itself.

This raises the question of why we have not included any allowance for the two
random effect terms which complete equation (3) and make the _ difference between
the underlying value just referred to and the specific value of Y in area(j). The reason
is that it would be necessary to have sample data from area(j) itself in order to do so.
But, because of the clustered sample designs of the surveys that we are working with,
we usually have direct sample data for only a small proportion of the areas in which
we are interested – and so, for practical purposes, the synthetic estimator is the best
that we can do.

[If we did have sample data for the area concerned, it would be possible to use it – in
conjunction with the parameter estimates σ̂ u2 and σ̂ e2 – to provide an estimate of uj.
_
For details, see Appendix A. In order to go further and estimate e j one would need a
sample that included a substantial proportion of the population of area(j) – which
_
would not be realistic in survey terms. However, this may not matter much since, as e j
is an average value for the whole population of area(j) , its value is likely to be small.]

2.2.4 Calculation of Confidence Intervals


As well as producing estimates of the mean within the small area, it is also important
to be able to assess the accuracy of the estimates. As in the conventional sample-
survey set-up, we do this by placing confidence intervals round our estimates – but in
the model-based case these intervals are calculated in a fundamentally different way.
In the conventional set-up the confidence interval allows for the variability of the
sample-based estimator around the underlying true value.

In the case of the synthetic estimator we need to allow for two sources of variability:
one of these is the sampling variability
_ of the synthetic estimator around the
underlying expected value “α + β Xj”. But we_also need to remember that, accord-
ing to the model, the unknown real value of Yj is itself a _random variable that varies
around this underlying expected value. The variance of Yj is given by
_
/ σ2
σ u2 + var(e j) = σ u2 + ( e N )
j
(5)

where Nj is the total population (of individuals or addresses) of area(j).


_ _
Both the sampling variance of Yˆ j,SYNTH and the real variance of Yj need to be allowed
for in calculating the variance of the difference between
_ them – which is often
referred to as the Mean Square Error (MSE) of Yˆ j,SYNTH. If we neglect the last,
probably small, term in equation (5) we can write this combined variance as
_ _
MSE( Yˆ j,SYNTH) = σ u2 + var(α̂ + β̂ Xj) (6)

11
2: Basic ideas behind the SAEP system Small Area Estimation Project Report January 2003

where the first term on the right-hand side of the equation represents the real
variance of particular areas round the underlying expected values and the second
term represents the sampling variance of the synthetic estimator. In the examples
that follow, σ u2 is usually the main component of the MSE.

In order to set confidence intervals we substitute estimated variances into equation


(6), obtaining
_ _
ˆ
MSE( Yˆ j,SYNTH) = σ̂ u2 + ˆ
var(α̂ + β̂ Xj) (7)

and set the 95% confidence interval as


_ˆ _ _
MSE( Yˆ j,SYNTH)
Yj,SYNTH ± 2√ˆ (8)

2.3 Comparison of properties of synthetic estimates with those of


direct sample estimates
2.3.1 A note on terminology
We noted earlier in this chapter that the rationale for model-based estimation
was fundamentally different from what we called ‘conventional’ sample survey
estimation. In fact, over the last two or three decades, the use of model-based
methods in the survey context has increased so rapidly that it is no longer really
appropriate to say that inference methods based on the sampling distribution are
the ‘conventional’ standard. Instead the following terminology is used

■ design-based methods are those in which the estimators are justified in terms of
the structure of the randomised sample design;

■ model-based methods are those in which the estimators are justified in terms of a
model of the inherent probabilistic structure underlying the population itself.

In principle the properties of any particular estimator can be assessed from either a
design-based or a model-based perspective – with potentially different conclusions.

There is a good deal of literature on the differences between the two approaches, (e.g.
Särndal 1984), which interested readers may consult. In the paragraphs which follow,
we will discuss the properties of the particular model-based approach described
above – and point out ways in which these may differ from those of more familiar
design-based estimators.

2.3.2 Equal probability selection, weighting and bias – and how they effect the
interpretation of the estimates
The simplest forms of design based estimation are those for sample designs in which
each member of the population has an equal chance of selection, and each selected
respondent agrees to take part. In this case it is relatively straightforward to construct
unbiased – or nearly unbiased – estimators. In order to obtain design-unbiased
estimates when selection probabilities are unequal it is necessary to weight the data
to compensate for the unequal selection probabilities. Non-response is usually dealt
with in a similar way: if the probability of responding is related to some variable
whose value is known for both responders and non-responders, the probability of
responding can be estimated. Subject to some assumptions, different response
probabilities can then be allowed for by weighting, in the same way as different
selection probabilities.

In an extreme model-based view, selection and response probabilities are irrelevant


– since the probabilistic relationships that underpin statistical inference are inherent
in the structure of reality itself. However, this view is only tenable if one is confident
that the particular structure of probabilistic relationships assumed in one’s model is
a true – and complete – description of the relationships in which one is interested.
Many statisticians would prefer to be more cautious than this – and so it is actually

12
Small Area Estimation Project Report January 2003 2: Basic ideas behind the SAEP system

common for model-based analyses to use data from randomised samples, and to
attempt some kind of weighting to allow for different selection and response
probabilities. However, there is no general agreement about how best to weight
for small area estimation and we have not researched the question in this report
(as explained below one of the analyses was re-weighted for unequal selection
probabilities and none of them were re-weighted for non-response).

The idea of bias means different things in the design- and model-based perspectives.
An estimator is design-unbiased for a particular area if its expected value, given the
sample design, is equal to the real-life value for the area concerned. Estimators based
on the samples taken in the area concerned are generally design-unbiased provided
that the data has been appropriately weighted. However, the synthetic estimator
defined in equation (4) is not design-unbiased, since what it estimates is the
underlying expected value for any area with the same covariate values not the real
value for the area in question.

Nevertheless, the synthetic estimator is unbiased in another, rather more abstruse,


sense. The reader may recall that the basic idea of unbiasedness is that the expected
value of the difference between the estimator and the quantity being estimated is zero.
Now, if the model is true, the way we have derived _ the synthetic estimator for area(j)
implies that its expected value is equal to “α + β Xj” – and the model also implies that
the true value for area(j) is a random variable with this same expected value. Since
the expected values of the estimator and the area itself are the same, their expected
difference is zero – and we can say that the synthetic estimator is model-unbiased.

Too much should not be made of this property of model-unbiasedness – most of our
estimators (those relating to proportions) will in fact be slightly model-biased. The
important thing to remember is that, in the everyday sense of the term, the synthetic
estimate is biased because it relates not to the specific value of the area in question
but to the average value for similar areas – always assuming that the model is true. This
means that you cannot use synthetic estimates to evaluate the impact of special local
factors that have not been included in the model – such as, for instance, a particular
policy adopted by the local authority.

There is another kind of question – of considerable importance to policy makers


– which neither design-based estimators nor our synthetic estimator can answer.
This regards the tails of the distribution of area values. For the purposes of obtaining
grants from bodies such as the EU, the important thing may be whether or not the
income level – or some other measure of well being – in an area is below a particular
threshold. Estimators based on the sample in the area concerned are likely to
over-estimate the number of poor areas – since random sampling variation will
mean that there will be a number of non-poor areas whose sample values
nevertheless fall below the poverty threshold. Synthetic estimators, in contrast, are
likely to underestimate_the number of poor areas – because they estimate simply the
expected value “α + β Xj” and ignore the possibility that the value of “uj” might carry
the area below the poverty threshold.

2.3.3 Random sampling, variance estimation and spatial patterns


The role of random sampling in design-based estimation goes beyond the fact that
it guarantees equal (or known) selection probabilities to each individual or unit
in the population. It also guarantees – subject to certain limitations built into the
particular sample design – that each combination of individuals has an equal (or
at least knowable) probability of selection. Without this property, it would not be
possible to calculate the variance of the sampling distribution – and so put
confidence intervals round the design-based estimates derived from the survey.

As with bias, so with variance estimation – if one takes the model as a literal
representation of the relevant aspects of reality there is no need to insist on random
sampling. However, if one has doubts about this, then it is a good idea to use

13
2: Basic ideas behind the SAEP system Small Area Estimation Project Report January 2003

representative random sampling methods. The reason for this is that the models
used in this report base their estimates of between- and within-area variance on the
assumption that the uj terms for different areas are mutually independent, as are
the eij terms for different individuals/units within the same area. This may well not
be true in the real world – where for instance, the richer inhabitants of a local area
might well live at one end of the district, while the poorer residents clustered together
at the other. In this situation, random sampling would ensure that the values entered
into the analysis were indeed independent – and that the resulting variance estimates
behaved in the expected way.

However, even if randomisation has enabled us to estimate the variances correctly


– and so set appropriate confidence intervals for individual small areas – we still
need to be careful about how we extend the results to larger areas. This is because,
regardless of what the model says, the true values for neighbouring areas may well
differ from the model-based estimates in similar ways. So that if, say, the actual value
of household income or the actual proportion of people with health problems in a
particular ward was higher than that predicted by the synthetic estimator, the same
might well be true for neighbouring wards, or even for most of the wards in the same
county. Technically, this is because our model only estimates the variance of uj, and
not the covariance between the values of uj for neighbouring areas and areas in
different parts of the country.

Because the model ignores patterns of spatial covariance, it leaves open two
possibilities that we need to guard against in some other way – even though the
national proportion of small areas whose true value lies within the confidence
interval is correct. The first of these is that the underestimated areas may cluster
together in some parts of the country, while the over-estimated areas cluster together
in other regions. The second of these is that if the model is fitted for one kind of
geographical unit – say wards – the estimated relationship between the target
variable and the auxiliary variables may well not hold for other kinds of areas –
such as constituencies and local authorities. This is an instance of the well-known
Modifiable Area Unit Problem (MAUP). We will return to these problems later in
this report.

It is worth noticing that none of these spatial biases arise with design-based
estimation, since, provided the selection probabilities and weights are right, the
expected values of the estimates correspond to the real local values at all area levels
– even though the estimators become very unstable at local levels (which is why we
need to use model-based estimators in the first place).

This brings us to another problem – namely how should we handle the potential
contradictions between design-based and model-based estimates. The two sets of
estimates can differ even at national level, but as we have seen, the differences are
particularly important at regional and sub-regional level. As well as the fact that both
sets of estimates have strengths that one might want somehow to combine, there is
the fact that the administrators who use our figures have a legitimate requirement
for consistent figures – and inconsistencies between model-based small area figures,
and design-based figures at region and above are hard to justify in a practical context.
One suggestion, which we shall consider further in a later chapter, is that model-
based small area estimates should be calibrated to the design-based estimates for
regions – i.e. adjusted so that the two sets of regional totals are the same.

2.3.4 The effects of model selection


Once the sample design has been chosen, and appropriate weights set, design-based
estimation proceeds automatically. This is not the case with model-based methods.
For one thing, the form of the model has to be decided on. The basic model
described above is a little too basic for any of the practical applications discussed in
the next chapter. In each of these practical applications the model actually used will
be set out – and their properties will be explained further in Appendix A.

14
Small Area Estimation Project Report January 2003 2: Basic ideas behind the SAEP system

The choice of auxiliary variables plays a particularly important role. Adding a new
covariate to the model alters both the value of the synthetic estimator and the
estimate of the between area variance σ̂ u2. It may also change the form of the
relationship between the target variable and the covariates – and the new variable
may need to be transformed in order to keep this relationship in a tractable form.

The procedure that we used to decide which covariates to include in the model is
described in Appendix A. The diagnostic checks that we used to judge whether the
form of the models was appropriate are described in later chapters, and a full set of
diagnostic results is given in Appendix D.

However, even with the best methods of model selection there remains a problem
– since the selection is partly based on sample data, there is always a possibility that
the selected model is not the best one. The confidence intervals produced by model-
based methods allow for errors in estimating parameter values – given that the form
of the model, and the choice of covariates, is correct. There is always the possibility
that they are not – which leaves an additional margin of uncertainty, not allowed for
in the confidence intervals.

2.4 Assessing the quality of the estimates


Given the various problems and uncertainties outlined above, it is clearly necessary
to use a range of different methods to assess the quality of the synthetic estimates
produced by the SAEP programme. In Chapter 3 we present the models and maps of
the resulting estimates, along with diagrams relating the estimates to their confidence
intervals. This should enable readers to see the method’s assessment of its own
precision – and relate this to their own assessment of the credibility of the mapped
results, taking account of any prior knowledge they may have. Chapters 4 and 5 use
diagnostics and spatial analyses to shed further light on the appropriateness of the
models and the adequacy of the confidence intervals.

15
2: Basic ideas behind the SAEP system Small Area Estimation Project Report January 2003

16

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy