STA 2402 Design and Analysis of Sample Surveys PDF
STA 2402 Design and Analysis of Sample Surveys PDF
U)
School of Science and Informatics(S.S.I)
FOR
January 2017
1
CONTENTS CONTENTS
Contents
1 Preliminary 5
1.1 Course Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Course Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Course Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Pre-requisite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Learning and Teaching Methodologies: . . . . . . . . . . . . . . . . . . . . 6
1.6 Instructional Materials and Equipment . . . . . . . . . . . . . . . . . . . . 6
1.7 Assessment: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8 Course Textbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.9 Couse Journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.10 Reference Textbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.11 Reference Journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2
CONTENTS CONTENTS
5 Systematic sampling 45
5.1 Introduction and Description . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Estimation of Population Mean, Variance and Total . . . . . . . . . . . . 46
5.2.1 Estimation of population mean : When N = nk . . . . . . . . . . . 46
5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6 Cluster sampling 48
6.1 Introduction and Description . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2 Estimation of Population Mean, Variance and Total . . . . . . . . . . . . 50
6.2.1 Case of equal clusters . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3
CONTENTS CONTENTS
10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
14 References 81
4
1 PRELIMINARY
1 Preliminary
To enable students to apply the principles of survey sampling and to understand the
different methods used in sampling.
1. Define a sample survey, and identify the advantages and principal steps in organizing
a survey.
8. Explain how national surveys are conducted, and the work done by the Kenya
National Bureau of Statistics.
Sample survey: definition, advantages and principal steps in organizing a survey. Types of
samples: probability and purposive. Simple random sampling: sampling proportions and
percentages; estimating sample size; stratified random, systematic, cluster and multistage
samples; selections with p.p.s (probability proportional to size). Ratio and regression
5
1.4 Pre-requisite 1 PRELIMINARY
estimators, sampling and non-sampling errors, organisation of national surveys, and the
Central Bureau of Statistics. Use of computer packages.
1.4 Pre-requisite
Students attain knowledge through lectures, seminars, tutorials, and independent studies.
Student discussion and seminar paper presentation.
1.7 Assessment:
1. Examination 70%
3. Total 100%
1. Yates, F.1981. Sampling Methods for Censuses and Surveys. 4th ed. New York.
1. Statistics Surveys
2. Survey Methodology
6
1.10 Reference Textbooks 1 PRELIMINARY
1. Deming, W.E. 1950. Some Theory of Sampling .New York: Dover ISBN-13: 9780486646848
ISBN: 048664684X.
2. Lohr Sharon (1999). Sampling: Design and Analysis. Duxbury press ISBN 10:0-
534-35361-4.
1. Annals of Statistics
3. Biometrics
7
2 THE BASIC CONCEPTS
2.1 Introduction
1. Population total, Y =
PN
i=1 Yi
2. Population mean, Ȳ = Y 1
PN
N
= N i=1 Yi
2
3. Population variance, SY2 = 1
PN
N −1 i=1 Yi − Ȳ
8
2.1 Introduction 2 THE BASIC CONCEPTS
1. Sample total, y =
Pn
i=1 yi
2. Sample mean, ȳ = y 1
Pn
n
= n i=1 yi
sy
4. Sample coefficient of variation, cy = ȳ
, where sy is the sample variance and ȳ is the
sample mean.
Definition 2.3. Sampling units: This refers to the individual items whose character-
istics are to be measured in the sample survey.
Definition. Sampling frame: This is the list of all sampling units. It may be a list
of units with identification and particulars or a map showing the boundaries of sampling
unis e.g. a manufacturing firm may want to determine how popular a newly manufactured
product is within the community suggests a possible frame for the survey. The firm may
decide to concentrate its surveys in urban residential areas only. In this case, you have
a complete list of estates in urban areas. The residents in those choosen estates will be
interviewed and inferences are made.
Definition. Sampling scheme: Its the technique by which the elements which consti-
tute the sampled are obtained from the population.
9
2.2 Types of Sampling 2 THE BASIC CONCEPTS
1. Haphazard sampling: No scheme has been used at all it is neither probability nor
non-probability sampling.
1. We are able to define the set of distinct samples, S1 , S2 , ...., SN , which the procedure
is capable of selecting if applied to a specific population. This means that we can
say precisely what sampling units belong to S1 to S2 , and so on.
3. We select one of the Si by a process in which each Si receives its appropriate prob-
ability πi ., of being selected.
4. The method for computing the estimate from the sample must be stated and must
lead to a unique estimate for any specific sample.
The simplest type of sampling is SRS(simple random sampling). We shall also make
use of common terms in statistics like; statistic, estimator, point estimation and interval
estimation, ratio and regression estimation. Design and analysis of sample survey is about
knowing those estimators and design procedures that are good.
1. Precision:- how much variation there is in the estimation from sample to sample.
10
2.4 Principal steps involved in planning and execution of2a sample
THE BASIC
survey.CONCEPTS
where µx = E [X].
Trueness of an estimation will be measured by the bias, which is defined as a difference
between the expectation of estimate and population parameter of which is an estimate.
Generally estimators which have low variance (high precision) and low bias are pre-
ferred.
Remark: There are various methods for determining estimator: Method of Mo-
ments(MME), OLS(Ordinary Least Squares), MLE(Maximum Likelihood Estimation),
Bayesian estimation, Rao-Blackwell method, Minimum Chi-Square method and Minimum
Distant method. There exists several methods of comparing estimates, these include use
of; bias,variance, MSE(mean squared error), consistency, sufficiency and location of scale
invariance.
survey.
1. Objective of the survey: The objective of the survey has to be clearly defined
and well understood by the person planning to conduct it. It is expected from the
statistician to be well versed with the issues to be addressed in consultation with
11
2.4 Principal steps involved in planning and execution of2a sample
THE BASIC
survey.CONCEPTS
the person who wants to get the survey conducted. In complex surveys, sometimes
the objective is forgotten and data is collected on those issues which are far away
from the objectives.
4. Degree of precision required: The results of any sample survey are always
subjected to some uncertainty. Such uncertainty can be reduced by taking larger
samples or using superior instruments. This involves more cost and more time. So
it is very important to decide about the required degree of precision in the data.
This needs to be conveyed to the surveyor also.
6. The frame: The sampling frame has to be clearly specified. The population is
divided into sampling units such that the units cover the whole population and
every sampling unit is tagged with identification. The list of all sampling units is
called the frame. The frame must cover the whole population and the units must not
12
2.4 Principal steps involved in planning and execution of2a sample
THE BASIC
survey.CONCEPTS
overlap each other in the sense that every element in the population must belong to
one and only one unit. For example, the sampling unit can be an individual member
in the family or the whole family.
7. Selection of sample: The size of the sample needs to be specified for the given
sampling plan. This helps in determining and comparing the relative cost and time of
different sampling plans. The method and plan adopted for drawing a representative
sample should also be detailed.
8. The Pre-test: It is advised to try the questionnaire and field methods on a small
scale. This may reveal some troubles and problems beforehand which the surveyor
may face in the field in large scale surveys.
9. Organization of the field work: How to conduct the survey, how to handle
business administrative issues, providing proper training to surveyors, procedures,
plans for handling the non-response and missing observations etc. are some of the
issues which need to be addressed for organizing the survey work in the fields. The
procedure for early checking of the quality of return should be prescribed. It should
be clarified how to handle the situation when the respondent is not available.
10. Summary and analysis of data: It is to be noted that based on the objectives
of the data, the suitable statistical tool is decided which can answer the relevant
questions. In order to use the statistical tool, a valid data set is required and this
dictates the choice of responses to be obtained for the questions in the questionnaire,
e.g., the data has to be qualitative, quantitative, nominal, ordinal etc. After getting
the completed questionnaire back, it needs to be edited to amend the recording errors
and delete the erroneous data. The tabulating procedures, methods of estimation
and tolerable amount of error in the estimation needs to be decided before the start
of survey. Different methods of estimation may be available to get the answer of
the same query from the same data set. So the data needs to be collected which is
compatible with the chosen estimation procedure.
11. Information gained for future surveys: The completed surveys work as guide
13
2.5 Pilot Survey 2 THE BASIC CONCEPTS
for improved sample surveys in future. Beside this they also supply various types
of prior information required to use various statistical tools, e.g., mean, variance,
nature of variability, cost involved etc. Any completed sample survey acts as a
potential guide for the surveys to be conducted in the future. It is generally seen
that the things always do not go in the same way in any complex survey as planned
earlier. Such precautions and alerts help in avoiding the mistakes in the execution
of future surveys.
In planning a survey efficiently, some prior information about the population under con-
sideration and the operational and cost aspects of of data collection will be needed. When
such information is not available
Sample surveys have potential advantages over complete enumeration(census). They in-
clude;
1. Reduced cost. If data are secured from only a small fraction of the aggregate,
expenditures may be expected to be smaller than if a complete census is attempted
2. Greater speed. For the same reason, the data can be collected and summarized more
quickly with a sample than with a complete count. This may be a vital consideration
when the information is urgently needed.
4. Greater accuracy. Because personnel of higher quality can he employed and can be
given intensive training, a sample may actually produce more accurate results than
14
2.7 Exercises 2 THE BASIC CONCEPTS
5. When a survey involves risky tests such as testing a new drug, sampling must be
used.
2.7 Exercises
2. State and explain the factors that affect the sample survey design.
2.8 Solutions
1.
→ Specify the precision and objectives of survey and declare other variables of interest
(auxiliary information,include explanatory variables, stratification variables).
→ Instruments we use.
2.
15
2.8 Solutions 2 THE BASIC CONCEPTS
→ The nature of the survey i.e either descriptive survey or analytical survey(model
development)
16
3 SIMPLE RANDOM SAMPLING(SRS)
This is the simplest form of sampling. This is a technique of selecting a sample from
a target population in such a way that any unit in the population has an equal and
independent chance of being selected into the sample. Therefore for a population of
size N , the probability of picking a unit at first draw is 1
N
and the second draw is 1
N −1
third draw is 1
N −2
hence for rth draw is 1
N −(r−1)
. In performing SRS one will do it with
replacement(wr) or without replacement(wor).
1 1 1
P ({i = 1, 2, ..., in }) = , ..., = n (3.1)
N N N
There are N n possible samples(sequences) in the sample space S, for a given (N, n).
A srswr of n draws from a population of size N will be denoted by srswr(N, n).
Proof: !
n
1X
E (ȳ) = E yi = y1 (3.2)
n i=1
17
3.2 Simple Random Sampling with Replacement
3 SIMPLE
(SRSWR).
RANDOM SAMPLING(SRS)
Since y1 , y2 , ..., yn are independently and identically distributed (iid) random variables
with;
1
P (yi = Yk ) = , k = 1, 2, ..., N, i = 1, 2, ..., n. (3.3)
N
−n
N2
(i 6= j = 1, 2, ..., N )(Show this)
P
Now, ȳ = n i=1 yi = n i=1 ti Yi ⇒ E (ȳ) = E n i=1 ti Yi = Yi E (ti ). But
1
Pn 1
PN 1 N 1
Pn
n i=1
E (ti ) = n
N
,
⇒
N
1n X
E (ȳ) = Yi = Ȳ (3.4)
n N i=1
Hence E (ȳ) = Ȳ .
N
σ2 2 1 X
(3.5)
V ar (ȳ) = , σ = Yi − Ȳ
n N i=1
Proof:
P
= V ar (ȳ) = V ar n1 N i=1 ti Y i
N
= n12 i=1 Yi2 V ar (ti ) + n12
P PP
i6=j Yi Yj Cov (ti , tj )
18
3.3 Simple Random Sampling without Replacement
3 SIMPLE RANDOM SAMPLING(SRS)
σ2
= (3.6)
n
P 2
Note: (x1 + x2 )2 = x21 + 2x1 x2 + x22 ⇒ N PN
x2i +
PP
i=1 xi = i=1 i6=j xi xj
N
N −1 2 2 1 X 2
V ar (ȳ) = Sy ; Sy = Yi − Ȳ (3.7)
Nn N − 1 i=1
N 2 σ2
Corollary 3.3. In srswr (N, n), V ar Ŷ = n
.
N
ment of size n, then this is chosen at random from distinct sample. Each of the
n
−1
N N
samples has the same probability 1 or of being selected.
n N
n
n
Lemma 3.1. For a srswor(N, n) design the probability of a specified unit being selected
19
3.3 Simple Random Sampling without Replacement
3 SIMPLE RANDOM SAMPLING(SRS)
1
at any given draw is N
i.e.
1
Pr (ik ) = , r = 1, 2, ..., n. (3.8)
N
Lemma 3.2. For a srswor(N, n) the probability of two specified units being selected at
any two given draws is N1 N1−1 , i..e.
1
Pr,s (ir , is ) = , r < s, r = 1, 2, ..., n. (3.9)
N (N − 1)
Lemma 3.3. For a srswor(N, n) the probability that a specified unit is included in the
n
sample is N
i.e.
Lemma 3.4. For a srswor(N, n) the probability that any two specified units are included
n(n−1)
in the sample is N (N −1)
i.e.
The quantities πi and πi,j (as defined in Lemma 3.3 and 3.4) are respectively the
inclusion probabilities of unitbi and (i, j) in the sample. These are called respectively,
the first order and second order incluion probabilities of a design.
20
3.3 Simple Random Sampling without Replacement
3 SIMPLE RANDOM SAMPLING(SRS)
N
X n
X
y= ai Y i = yi (3.12)
i=1 i=1
N n
1X 1X
ȳ = ai Yi = yi = ȳ (3.13)
n i=1 n i=1
Theorem 3.3. In srswor(N, n) the sample mean ȳ is an inbiased estimator of the pop-
ulation mean Ȳ
Proof: ȳ = n1 ni=1 yi = n1 N
P P
i=1 ai Yi
P
E (ȳ) = E n1 N 1
PN
i=1 ai Yi = n i=1 Yi E (ai )
= n1 N
P n 1
PN
i=1 Yi . N = N i=1 Yi = Ȳ
N −n 2
Theorem 3.4. In srswor(N, n), V ar (ȳ) = Nn
S .
2
where S 2 = 1
PN
N −1 i=1 Yi − Ȳ
Proof:
From V ar (y) = E (y 2 ) − (E (y))2 it implies that V ar (ȳ) = E (ȳ 2 ) − (E (ȳ))2
P
But E (ȳ) = E n1 ni=1 yi = E n1 N
P 1
PN 1
PN
i=1 i i = n
a Y i=1 Yi E (ai ) = N i=1 Yi
P 2 P
Next, E (ȳ 2 ) = E n1 N 1 N 2 1
PP
a Y
i=1 i i = E n2
a Y
i=1 i i + n2
a a Y Y
i6=j i j i j
(Factor in expectation and use the fact that E (ai ) = Nn and E (ai , aj ) = n(n−1)
N (N −1)
)
1
PN 2 n−1
PP
= nN i=1 Yi + N n(N −1) i6=j Yi Yj
2
But
PN
− N 2
PP P
i6=j Yi Yj = i=1 Yi i=1 Yi
2 P
Therefore, E (ȳ ) = N n i=1 Yi + N n(N −1)
2 1
PN 2 n(n−1) PN N 2
i=1 Yi − i=1 Yi
21
3.3 Simple Random Sampling without Replacement
3 SIMPLE RANDOM SAMPLING(SRS)
h iP 2 P
1 n−1 N n−1 N
= nN
− nN (N −1) i=1 Yi2 + Yi
N n(N −1) i=1
P 2 P
Now (x1 + x2 )2 = x21 + 2x1 x2 + x22 ⇒ N
= N 2
PP
x
i=1 i i=1 xi + i6=j xi xj
2 2
Therefore, V ar (ȳ) = N n(NN −n
PN 2 n−1
PN 1
PN
−1)
Y
i=1 i + N n(N −1) i=1 iY − N
Y
i=1 i
h i P 2
N −n
P N 2 n−1 1 N
= N n(N −1) i=1 Yi + N n(N −1) − N 2 i=1 Yi
PN PN 2
N −n 2 N −n
= N n(N −1) i=1 Yi − N 2 n(N −1) i=1 Yi
hP i
N −n N 2 2
= N n(N −1) i=1 Yi − N Ȳ
2 P
But S 2 = N1−1 N 1 N 2 2
.Therefore;
P
i=1 Y i − Ȳ = N −1 i=1 iY − N Ȳ
N −n 2
V ar (ȳ) = S (3.14)
Nn
on simplification.
N −n 2
Theorem 3.5. In srswor(N, n) an unbiased estimator of V ar (ȳ)is Nn
s where s2 =
1
Pn 2
n−1 i=1 (yi − ȳ) .
Proof this.
2
where s2 = (yi − ȳ)2 and S 2 =
1
Pn 1
PN
n−1 i=1 N −1 i=1 Yi − Ȳ
Proof:
hP i
2
( ni=1 yi2 − nȳ 2 ) = n−1 n
1 1 1
Pn
s2 = 2
P
n−1 i=1 iy − n
( i=1 yi )
hP P i
1 n 2 1 n 2
P P
= n−1 i=1 yi − n i=1 yi + i6=j yi yj
h i
(open brackets and simplify)
1
Pn 2 1 P P
1 − n1
= n−1 y
i=1 i − n i6=j y i y j
and
P P P P
n n−1
Yi Yj .
PP
E i6=j yi yj = E i6=j ai aj Yi Yj = N N −1 i6=j
Therefore;
22
3.3 Simple Random Sampling without Replacement
3 SIMPLE RANDOM SAMPLING(SRS)
E (s2 ) = n1 Nn N 2 1 n n−1
P PP
i=1 Yi − n(n−1) N N −1 i6=j Yi Yj
= N1 N 2 1
P PP
i=1 Yi − N (N −1) i6=j Yi Yj
2
= N1 N
PN
2 1
+ N (N1−1) N 2
P P
Y
i=1 i − N (N −1) i=1 iY i=1 Yi
h iP P 2
N N
= N1 + N (N1−1) Y
i=1 i
2
− 1
N (N −1) i=1 iY
hP i
N
= N1−1 2
i=1 Yi − N Ȳ
2
= S2
Hence;
E s2 = S 2 (3.15)
Corollary 3.5. For srswor(N, n),an unbiased variance estimator of Y is V ar Ŷ =
N (N −n) 2
n
s
Proof: V ar Ŷ = V ar (N ȳ) = N 2 V ar (ȳ)
= N 2 NN−n
n
s2
N (N −n) 2
= n
s which completes the proof.
q
Corollary 3.6. An estimator of standard error of ȳ is σ̂ (ȳ) = NN−n n
s. An estimator
q
N −n s
of the coefficient of variation is c (ȳ) = N n ȳ
. c (ȳ) is a ratio estimator and biased
estimator of C (ȳ).
NOTE:
1. The sample mean in srswor(N, n) is a better estiamtor of Ȳ (in the small variance
sense) than sample mean in srswr(N, n). Proof: V ar (ȳ|srswr)−V ar (ȳ|srswor) =
n−1 2
Nn
S > 0 for n > 1.
23
3.3 Simple Random Sampling without Replacement
3 SIMPLE RANDOM SAMPLING(SRS)
ȳ in srswor will be approximate the same as srswr. If N is very small say N ≤ 10,
then whatever n, f is not negligible and therefore there is considerable gain in using
srswor over srswr.
The sample mean ȳ and the variance s2 are point estimates of the unknown population
mean and variance respectively. An interval estimate of unknown population parameter
is a random interval constructed such that it has a given probability of including the
parameters. Consider a population with unknown parameter , if one can find an interval
(a, b) such that;
P (a ≤ θ ≤ b) = 0.95 (3.16)
then we say that (a, b) is a 95% confidence interval for θ. It is important to realize
that the θ is fixed and the intervals themselves vary.
Some conditions exist under which the distribution of the sample mean in a simple
random sampling tends to normal distribution. If the sample size is not too small and
the distribution of the population from which the sample is drawn is not different from
the normal, then in srswor, the sample mean ȳ is approximately normal with mean Ȳ and
√
standard deviation √N −n S
Nn
i.e.
N −n
ȳ ∼ N Ȳ , S (3.17)
Nn
ȳ − Ȳ
z=q ∼ N (0, 1)
N −n
Nn
S
q q
Hence P −z 2 ≤ N −n ≤ z 2 = 1−α ⇒ P ȳ − z 2
α √ȳ− Ȳ α α
N −n
Nn
S ≤ Ȳ ≤ ȳ + z 2
α
N −n
Nn
S =
Nn
S
1−α
where z α2 is the 100 1 − α2 % point of standard normal distribution. Therefore;
24
3.3 Simple Random Sampling without Replacement
3 SIMPLE RANDOM SAMPLING(SRS)
r r !
N −n N −n
ȳ − z α2 S, ȳ + z α2 S, (3.18)
Nn Nn
is the 100 1 − α2 % cofidence interval for Ȳ . For α = 0.05, 0.025, 0.01 values for z α2
nȳ 2 = 15 × (25.4)2 . The 95% confidence interval for Y at α = 0.05 will be;
p
Ŷ = Y ± N z0.05 var (ȳ)
√
= 3302 ± 130 1.96 × 1.14
= 3302 ± 272.05
⇒ 3029.05 ≤ Y ≤ 3574.05
In many situations, the characteristic under study on which the observations are collected
are qualitative in nature. For example, the responses of customers in many marketing
surveys are based on replies like ‘yes’ or ‘no’ , ‘agree’ or ‘disagree’ etc. Sometimes the re-
spondents are asked to arrange several options in the order like first choice, second choice
etc. Sometimes the objective of the survey is to estimate the proportion or the percent-
age of brown eyed persons, unemployed persons, graduate persons or persons favoring a
25
3.3 Simple Random Sampling without Replacement
3 SIMPLE RANDOM SAMPLING(SRS)
proposal, etc. In such situations, the first question arises how to do the sampling and
secondly how to estimate the population parameters like population mean, population
variance, etc.
The same sampling procedures that are used for drawing a sample in case of quanti-
tative characteristics can also be used for drawing a sample for qualitative characteristic.
So, the sampling procedures remain same irrespective of the nature of characteristic un-
der study - either qualitative or quantitative. For example, the SRSWOR and SRSWR
procedures for drawing the samples remain the same for qualitative and quantitative char-
acteristics. Similarly, other sampling schemes like stratified sampling, two stage sampling
etc. also remain same.
A
P = (3.19)
N
N −A
Q= =1−p (3.20)
N
An indicator variable Y can be associated with the charactersitics under study and
then for
i = 1, 2, ..., N.Yi = 1 if the ith unit belongs to C and 1 if the ith unit belongs to C∗.
26
3.3 Simple Random Sampling without Replacement
3 SIMPLE RANDOM SAMPLING(SRS)
N
X
YT OT AL = Yi = A (3.21)
i=1
PN
Yi A
Ȳ = i=1
= =P (3.22)
N N
a
p= (3.23)
n
Pn
y
which can be written as p = na = i=1 n
i
= ȳ.
Since, N i=1 Yi = A = N P so we can write S and s in terms of Q and P as follows;
2 2
P
2
S 2 = N1−1 N = N1−1 N
P P 2 2
i=1 Yi − Ȳ i=1 Yi − N Ȳ
= N1−1 N 2 N
P
i=1 (N P − N P ) = N −1 P Q
Similarly;
i=1 yi = a = np and s =
Pn 2 Pn
(yi − ȳ)2 =
2 1 1
Pn
n−1 i=1 n−1 i=1 (yi2 − nȳ 2 )
1
Pn 2
= n−1 i=1 (np − np )
n
= n−1
pq
Note that the quantities ȳ , Ȳ , s2 , and S 2 have been expressed as functions of sample
and population proportions. Since the sample has been drawn by simple random sampling
and sample proportion is same as the sample mean, so the properties of sample proportion
in SRSWOR and SRSWR can be derived using the properties of sample mean directly.
SRSWOR
Since the sample mean ȳ is an unbiased estimator of the population mean Ȳ i.e.
E (ȳ) = Ȳ in the case of SRSWOR, so;
E (p) = E (ȳ) = Ȳ = P and p is an unbiased estimator of P. Using the expression of
V ar (ȳ) the variance of p can be derived as V ar (p) = V ar (ȳ) = S .
N −n 2
Nn
Similarly, using
27
3.3 Simple Random Sampling without Replacement
3 SIMPLE RANDOM SAMPLING(SRS)
N −n
= pq (3.24)
N (n − 1)
SRSWR
Since the sample mean ȳ is an unibiased estimator of population mean Ȳ in case of
SRSWR, so the sample proportion
E (p) = E (ȳ) = Ȳ = P i.e., p is an unbiased estimator of P .
Using the expression of variance of ȳ and its estimate in case of SRSWR, the variance
of p and its estimate can be derived as follows:
N −1 2
V ar (p) = V ar (ȳ) = Nn
S
N −1 N
= N n N −1
PQ
PQ
= n
n pq
⇒ Vd
ar (p) = .
n−1 n
pq
= (3.25)
n−1
It is easy to see that an estimate of population total A (or total number of count ) is
 = N P = n its variance is V ar  = N 2 V ar (p) and the estimate of variance is
Na
V ar  = N 2 Vd
d ar (p)
If N and n are large, then √p−P approximately follows N (0, 1). With this approxima-
V ar(p)
tionwe can write P −z 2 ≤α √ p−P
≤ z 2 = 1 − α, and the 100(1 − α) % confidence
α
V ar(p)
interval of P is
(3.26)
p p
p − z α2 V ar (p), p + z α2 V ar (p)
28
3.3 Simple Random Sampling without Replacement
3 SIMPLE RANDOM SAMPLING(SRS)
It may be noted that in this case, a discrete random variable is being approximated
by a continuous random variable, so a continuity correction n
2
can be introduced in the
confidence limits and the limits become;
n n
(3.27)
p p
p − z α2 V ar (p) + , p + z α2 V ar (p) +
2 2
In a field survey the statisticians would like to have a sample size that will give a desired
level of precision of estimator. We note that the required precision is the difference
between the estimator and the true value. This difference is denoted by d.
Suppose that it is desired to find a sample size n such that the estimated value i.e.
sample mean ȳ differs from the true value (Population mean, Ȳ ) by a quantity not ex-
ceeding d with a very high probability, say greater than 1 − α. Hence the problem is to
find n such that;
(3.28)
P | ȳ − Ȳ |≤ d ≥ 1 − α
r !
N −n
P | ȳ − Ȳ |≤ S t =1−α (3.29)
Nn
P (| p − P |≤ d) ≥ 1 − α (3.30)
29
3.4 R Computing Notes. 3 SIMPLE RANDOM SAMPLING(SRS)
Hence; s !
N −n
P | p − P |≤ t PQ =1−α (3.31)
n (N − 1)
q
Equating 3.28 and 3.29 we get t n(N
N −n
−1)
P Q = d. This gives;
t2 P Q
d2
n= h t2 P Q i (3.32)
1
1+ N d2
−1
n0 n0
n= n0−1
≈ (3.33)
1 + nN0
1+ N
Example:
Solution:
tion,
q h i2
| ȳ − Ȳ |≤ 0.1Ȳ = 0.95. Hence, 1.96S N −n
= 0.1Ȳ or (1.96)2 1 1 Ȳ
Nn n
− N
= 0.01 S
=
0.01
0.36
Solving the above equation, we get n = 136(rounded off to the next integer)
30
3.4 R Computing Notes. 3 SIMPLE RANDOM SAMPLING(SRS)
31
3.5 Exercises 3 SIMPLE RANDOM SAMPLING(SRS)
3.5 Exercises
3. For the same population in 1 above, calculate s2 for al1 simple random samples of
size 3, and verify that E (s2 ) = S 2
32
3.6 Solutions 3 SIMPLE RANDOM SAMPLING(SRS)
4. If random samples of size 2 are drawn with replacement (from this population, show
σ2
by finding all possible samples that V ar (ȳ) satisfies the equation V ar (ȳ) = n
=
S 2 (N −1)
nN
. Give a general proof of this result.
5. A simple random sample of 30 households wa.s drawn {from a city area containing
14,848 households. The numbers of persons per household in the sample were as
follows: 5,6,3,3,2,3,3,3,4,4,3,2,7,4,3,5, 4,4,3,3,4,3,3, 1,2,4,3,4,2,4 Estimate the total
number of people in the area and compute the probability that this estimate is
within ±10 per cent of the true value.
3.6 Solutions
√
1. Ȳ = 19 , S 2 = 85.6 ⇒ S = 85.6 N = 430, d = 20 1
= 0.05. 10% of Ȳ ⇒ d = 0.1Ȳ =
2 2
0.1 (19) = 1.9. n0 = tsd but t = z α2 = z 0.05 = z0.025 = 1.96. ⇒ n0 = (1.96)1.9(85.6)
2 =
2
−1 −1
91.09167. n = n0 1 + nN0 , = 91.09 1 + 91.09
430
= 75.166 ' 75.
33
4 STRATIFIED RANDOM SAMPLING
The objective of any sampling method is usually to estimate the unknown population
parameters with the highest precision i.e. the variance of the estimators should be min-
imized. If the population is heterogeneous as will be in most situations then a sample
taken via SRS might yield high levels of variability . As a result in a survey where preci-
sion is a main factor to be considered, then a strategy that addresses heterogeneity must
be found. One way of achieving higher precision is to divide the population which is
originally heterogeneous into sub population which are to a big extent homogeneous with
respect to survey characteristics.
In stratified random sampling , the population of N units is first divided into subpop-
ulations N1 , N2 , ...., NL called strata. The strata are mutually disjoint so that;
L
X
N1 + N2 +, ...., +NL = Ni = N (4.1)
i=1
34
4.1 Introduction and Description 4 STRATIFIED RANDOM SAMPLING
4.1.1 Notations
The following is an extension of previous notation used where the suffix i denote the
stratum and j denote the j th unit within the stratum.
Let Yij be the value of the characteristic y on the j th unit in the ith stratum in the
population; yij value in the sample; j = 1, 2, ..., Ni (ni in the sample), i = 1, 2, ..., L
Define:
Ni =Total number of units in the ith stratum
ni =the number of units in the sample of the ith stratum.
Note: j = 1, 2, ..., N → units in a stratum; i = 1, 2, ..., L → strata
n = Li=1 ni =total sample size from all the strata
P
Wi = Ni
N
=population proportion for theith stratum. or stratum weight and
fi = ni
Ni
=sampling fraction for the ith stratum.
Note: The divisor of the variance is (Ni − 1)
35
4.1 Introduction and Description 4 STRATIFIED RANDOM SAMPLING
L Ni L
1 XX 1 X
Ȳ = Yij = Ni Ȳi (4.2)
N i=1 j=1 N i=1
where N = N1 + N2 + ... + NL .
For the population mean per unit„ the estimate used in stratified sampling is ȳst (st
for stratified), where ȳst = N1 Li=1 Ni ȳi = Li=1 Wi ȳi . Wi = NNi
P P
Note: The estimate ȳst is not in general the same as the sample mean. The sample
mean ȳ can be written as ȳ = n1 ni ȳi . The difference is that in ȳst the estimates from
P
PL Ni ȳi P
Theorem 4.1. In stratified random sampling ȳst = i=1 N
= Wi ȳi is an unbiased
estimator of the population mean Ȳ .
PL Wi E(ȳi )
Proof: E (ȳst ) = i=1 N
= Ȳ
Theorem 4.2. In stratified random sampling using srswor in each stratum V ar (ȳst ) =
1
PL 2 1
PL Ni (Ni −ni ) 2
N2 i=1 Ni V ar (ȳi ) = N 2 i=1 ni
Si
P P
L Ni ȳi N2
Proof: V ar (ȳst ) = V ar i=1 N = Li=1 Ni2 V ar (ȳi )
P N2 2
Si
= Li=1 Ni2 NiN−ni
i
ni
. Covariances terms vanish being independent from stratum to
stratum
1
PL Ni (Ni −ni ) 2
N2 i=1 ni
Si
ni
Corollary 4.1. If sampling fraction Ni
is negligibly small in each stratum, it reduces to
1
PL Ni2 Si2 PL Wi Si2
V ar (ȳst ) = N2 i=1 ni
= i=1 ni
Corollary 4.2. If Ŷst = N ȳst is the estimate of the population total Y then V ar Ŷst =
PL Si2
i=1 Ni (Ni − ni ) ni
36
4.2 Allocation problem and choice of sample4sizes
STRATIFIED
is different strata
RANDOM SAMPLING
Proof: Ŷst = N ȳst ⇒ V ar Ŷst = V ar (N ȳst )= N 2 V ar (ȳst )
P
2 1 L Si2
= N N 2 i=1 Ni (Ni − ni ) ni
L
X Si2
= Ni (Ni − ni ) (4.3)
i=1
ni
In simple random sampling, the estimate of the variance of each stratum is given by s2i =
Si2
for the stratum. We have found that .
1
Pni 2 th 1
PL
ni −1 j=1 (y ij − ȳ i ) i V ar (ȳ st ) = N 2 i=1 Ni (Ni − n i ) ni
In stratified random sampling, the unbiasd estimate of the variance of V ar (ȳst ) is given
s2
by s2st = N12 Li=1 Ni (Ni − ni ) nii . Note if ȳst is normally distributed over Ȳ then the
P
(4.4)
p
ȳst ± z α2 V ar (ȳst )
strata
Question: How to choose the sample sizes n1 , n2 , ...., nl so that the available resources
are used in an effective way?
There are two aspects of choosing the sample sizes:
(i) Minimize the cost of survey for a specified precision.
(ii) Maximize the precision for a given cost.
Note: The sample size cannot be determined by minimizing both the cost and variabil-
ity simultaneously. The cost function is directly proportional to the sample size whereas
variability is inversely proportional to the sample size. Based on different ideas, some
allocation procedures are as follows:
Choose the sample size in to be the same for all the strata. Draw samples of equal size
from each strata. Let n be the sample size and k be the number of strata, then ni = n
k
37
4.2 Allocation problem and choice of sample4sizes
STRATIFIED
is different strata
RANDOM SAMPLING
for i = 1, 2, ..., L
i=1 ni ȳi overall mean. In this case, ȳst coincides with ȳ (the overall sample mean),
1
PL
n
S2
V ar (ȳst ) = N12 Li=1 Ni (Ni − ni ) nii . Now using proportional allocation th variance
P
is;
Si2
V ar (ȳst )prop = N12 Ni Ni − nN
i
N nNi
N
2
= N12 Li=1 Ni N NiN−nNi NnNSi
P
S2
= N12 Li=1 (N Ni − nNi ) ni
P
L
N −nX
Ni Si2 (4.5)
N 2 n i=1
PL i i . This allocation arises when the V ar (ȳst ) is minimized subject to the constraint
nN S
i=1 Ni Si
ni (prespecified). There are some limitations of the optimum allocation. The knowl-
PL
i=1
edge of Si , i = 1, 2, ..., L is needed to know ni . If there are more than one characteristics,
then they may lead to conflicting allocation.
Choice of sample size based on cost of survey and variability
The cost of survey depends upon the nature of survey. A simple choice of the cost
function is
38
4.2 Allocation problem and choice of sample4sizes
STRATIFIED
is different strata
RANDOM SAMPLING
L
X
C = C0 + Ci ni (4.6)
i=1
where
C : total cost
C0 :overhead cost, e.g., setting up of office, training people etc
Ci cost per unit for the ith stratum.
To find ni under this cost function, consider the Lagrangian function with Lagrangian
multiplier l as;
φ = V ar (ȳst ) + λ2 (C − C0 )
PL P
2 1 1 2 2 L
= i=1 wi ni − Ni Si + λ i= Ci ni
PL wi2 Si2 w2 S 2
= i=1 ni + λ2 Li=1 Ci ni − ki=1 Ni i i
P P
PL h wi Si √ i2
= i=1 ni − λ Ci ni +terms independent of ni . Thus φ is minimum when;
√
√
√i Si = λ Ci ni for all i or ni = 1 w
w √i Si .
ni λ Ci
How to determine l?
There are two ways to determine l.
(i) Minimize variability for fixed cost.
(ii) Minimize cost for given variability. We consider both the cases.
(i) Minimize variability for fixed cost
Let C = C0∗ be the pre-specified cost which is fixed. So Ci ni = C0∗ or
PL PL
i=i i=1 Ci λw√i SCii =
PL √
Ci wi Si
C0∗ or λ = i=1
.
Substituting l in the expression for ni = λ1 w√iCSii the optimum for
C0∗
C∗
ni is obtained as n∗i = w√iCSii PL √0C w S . The required sample size to estimate Ȳ such
i=1 i i i
that the variance is minimum for given cost C = C0∗ is n = Li=1 n∗.
P
i
39
4.2 Allocation problem and choice of sample4sizes
STRATIFIED
is different strata
RANDOM SAMPLING
4.2.4 Sample size under proportional allocation for fixed cost and for fixed
variance.
n
N = nwi . So C0 = n Li=1 wi Ci or n = PL Cow C , therefore ni = PCw0 wi Ci i . The required
P
N i i=1 i i
Now we derive the variance of ȳst under proportional and optimum allocations.
(i) Proportional allocation
Under proportional allocation;
P PL Ni − Nn Ni Ni 2
ni = Nn Ni and V ar (ȳst ) = Li=1 NNi −n , Si2 ,
2 2
i ni
i
w S
i i V arprop (ȳst ) = i=1 Ni Nn
Ni N
N S 2
N −n
P L i i
= Nn i=1 N
L
N −nX
= wi Si2 (4.7)
N n i=1
Example 4.1. A population of size 800 is divided into three strata. Their sizes and
standard deviations are as given below.
40
4.2 Allocation problem and choice of sample4sizes
STRATIFIED
is different strata
RANDOM SAMPLING
Strata 1 2 3
Standard deviation Si 6 8 12
A standard sample of 120 is to be drawn from the population. Determine the sample
size based on;
1. Proportional allocation
2. Optimum allocation
3. Obtain the variance of the estimates of the population mean i.e. V arprop (ȳst ) and
V aropt (ȳst )
Solutions:
41
4.2 Allocation problem and choice of sample4sizes
STRATIFIED
is different strata
RANDOM SAMPLING
In order to compare Vsrs (ȳ) and Vprop (ȳst ), first we attempt to write S 2 as a function
2
of Si2 . Consider, S 2 = N1−1 Li=1 N
P P i
j=1 Yij − Ȳ
2
or (N − 1) S 2 = Li=1 N
P P i
j=1 Yij − Ȳ
2
= Li=1 N
P P i
j=1 Yij − Ȳi + Ȳi − Ȳ
2 PL PNi 2
= Li=1 N
P P i
j=1 Yij − Ȳi + i=1 j=1 Ȳi − Ȳ
2
= Li=1 (Ni − 1) Si2 + Li=1 Ni Ȳi − Ȳ
P P
2
S = Li=1 NN
N −1 2 i −1
Si2 + Li=1 NNi Ȳi − Ȳ
P P
N
sides)
2 2
.Since Li=1 Ȳi − Ȳ ≥ 0 ⇒ V arprop (ȳst ) ≤
PL
V arsrs Ȳ = Vprop (ȳsst )+ NN−n
P
n i=1 wi Ȳi − Ȳ
V arsrs (ȳ). Larger gain in the difference is achievd when Ȳi differs from Ȳ more.
(b) Optimum allocation
V aropt (ȳst ) = n1 Li=1 (wi Si )2 − N1 Li=1 wi Si2 . Consider
P P
h PL i P 2
N −n 2 1 L 1
PL 2
V arprop (ȳst ) − V aropt (ȳst ) = Nn i=1 wi Si − n i=1 wi Si − N i=1 wi Si
P 2
PL L
= n1 w
i=1 i iS 2
− w
i=1 i i S
= n1 Li=1 wi Si2 − n1 S̄ 2
P
2
= n1 Li=1 wi Si − S̄ where S̄ = i=1 wi Si ⇒ V arprop (ȳst ) − V aropt (ȳst ) ≥ 0 or
P PL
V aropt (ȳst ) ≤ V arprop (ȳst ). Larger gain in efficiency is achieved when Si differ from S̄
more. Combining the results in (a) and (b), we have
Under SRSWOR, an unbiased estimate of Si2 for the ith stratum(i = 1, 2, ..., L) is s2i ni1−1 Li=1 (yij − ȳi ).
P
In stratified sampling, V ar (ȳst ) = Li=1 wi2 NNi −n Si2 . So an unbiased estimate of V ar (ȳst )
P i
i ni
P w 2 s2 P w 2 s2 P w 2 s2
is Vd
ar (ȳst ) = Li=1 wi2 NNi −n si or Li=1 ni i i − Li= Ni i i or Li=1 ni i i − N1 Li=1 wi s2i . The
i 2
P P
i ni
42
4.3 Exercises 4 STRATIFIED RANDOM SAMPLING
second term in this expression represents the reduction due to finite population correction.
The confidence limits of Ȳ can be obtained as
q
ȳst ± t Vd
ar (ȳst ) (4.9)
distributed.
4.3 Exercises
College A 200 30 10
College B 100 60 40
Use the information to confirm that Neyman’s allocation scheme is a more efficient
scheme when compared to proportional allocation.
43
4.4 Solutions 4 STRATIFIED RANDOM SAMPLING
3. A stratified population has 5 strata. The stratum sizes Ni and means Ȳi and Si2 of
some variable Y are as follows;
Stratum Ni Ȳi Si2
2 98 6.9 2.03
3 74 11.2 1.13
4 41 9.1 1.96
5 45 9.6 1.74
2. For a stratified simple random sample of size 80, determine the appropriate stratum
sample sizes under Proportional allocation and Neyman allocation.
4.4 Solutions
1. N1 = 40, 000, N2 = 20, 000, N3 = 10, 000, c0 = 200(fixed cost), c = 20, 000,
c1 = 2.25, c1 = 4.0, c1 = 1.0, A = 2S3 , B = S3 , C = S3 . For optimum allocation,
(c−c0 )Ni Si
√
we need to find the size of the sample in each of the stratum i.e. ni = .
ci
P3 √
i=1 i Si ci
N
√ √ √ √
Now, 3i=1 Ni Si ci = 40, 000 (2S3 ) 2.25 + 20, 000 (S3 ) 4 + 10, 000 (S3 ) 1 =
P
(20,000−200)(40,000)2S3 (20,000−200)(20,000)S3
170, 000S3 , n1 = 1.5
170,000S3
' 6211, n2 = 2
170,000S3
' 1164.7, n3 =
(20,000−200)(10,000)S3
2
170,000S3
' 1164.7.
i=1 ci N , c0 + N , N =
PL nNi n
PL n
PL N (c−c0 )
i=1 Ci Ni ⇒ c − c0 = N i=1 Ci Ni ⇒ n =
PL
CN i=1 i i
44
5 SYSTEMATIC SAMPLING
5 Systematic sampling
The systematic sampling technique is operationally more convenient than the simple ran-
dom sampling. It also ensures at the same time that each unit has equal probability of
inclusion in the sample. In this method of sampling, the first unit is selected with the
help of random numbers and the remaining units are selected automatically according to
a predetermined pattern. This method is known as systematic sampling.
Suppose the N units in the population are numbered 1 to N in some order. Suppose
further that N is expressible as a product of two integers n and k , so that N = nk.
To draw a sample of size n,
2. Suppose it is i.
So first unit is selected at random and other units are selected systematically. This
systematic sample is called k th systematic sample and k is termed as sampling interval.
This is also known as linear systematic sampling.
Example: Let N = 50 and n = 5. So k = 10. Suppose first selected number between
1 and 10 is 3. Then systematic sample consists of units with following serial number 3,
13, 23, 33, 43.
Advantages of systematic sampling:
1. It is easier to draw a sample and often easier to execute it without mistakes. This
is more advantageous when the drawing is done in fields and offices as there may be
substantial saving in time.
45
5.2 Estimation of Population Mean, Variance and Total5 SYSTEMATIC SAMPLING
2. The cost is low and the selection of units is simple. Much less training is needed for
surveyors to collect units through systematic sampling .
3. The systematic sample is spread more evenly over the population. So no large part
will fail to be represented in the sample. The sample is evenly spread and cross
section is better. Systematic sampling fails in case of too many blanks.
Let yij be observation on the unit bearing the serial number i+(j − 1) k in the population,
i = 1, 2, ..., k, j = 1, 2, ..., n. Suppose the drawn random number is i ≤ k. Sample consists
of ith column (in earlier table). Consider the sample mean given by;
ȳsys = ȳi = n1 nj=1 yij as an estimator of the population mean given by Ȳ = nk
P 1
Pk Pn
i=1 j=1 yij =
46
5.3 Exercises 5 SYSTEMATIC SAMPLING
5.3 Exercises
3. Out of 24 villages in an area, two linear systematic samples of 4 villages each were
selected. The total area under wheat is given in the table below;
Sample villages
2. Estimate the variance of the sample mean and place an upper bound on the error
of estimation.
47
6 CLUSTER SAMPLING
6 Cluster sampling
It is one of the basic assumptions in any sampling procedure that the population can
be divided into a finite number of distinct and identifiable units, called sampling units.
The smallest units into which the population can be divided are called elements of the
population. The groups of such elements are called clusters.
In many practical situations and many types of populations, a list of elements is not
available and so the use of an element as a sampling unit is not feasible. The method of
cluster sampling or area sampling can be used in such situations.
In cluster sampling;
1. Divide the whole population into clusters according to some well defined rule.
4. Carry out a complete enumeration of the selected clusters, i.e., collect information
on all the sampling units available in selected clusters.
Area sampling.
In case, the entire area containing the populations is subdivided into smaller area
segments and each element in the population is associated with one and only one such
area segment, the procedure is called as area sampling.
Examples:
1. In a city, the list of all the individual persons staying in the houses may be difficult
to obtain or even may be not available but a list of all the houses in the city may
be available. So every individual person will be treated as sampling unit and every
house will be a cluster.
2. The list of all the agricultural farms in a village or a district may not be easily
available but the list of village or districts are generally available. In this case,
48
6.1 Introduction and Description 6 CLUSTER SAMPLING
every farm in sampling unit and every village or district is the cluster.
2. Even if the list of elements is available, the location or identification of the units
may be difficult.
3. A necessary condition for the validity of this procedure is that every unit of the
population under study must correspond to one and only one unit of the cluster so
that the total number of sampling units in the frame may cover all the units of the
population under study without any omission or duplication. When this condition
is not satisfied, bias is introduced.
49
6.2 Estimation of Population Mean, Variance and Total 6 CLUSTER SAMPLING
Suppose the population is divided into N clusters and each cluster is of size n. Select a
sample of n clusters from N clusters by the method of SRS, generally WOR.
So total population size = N M total sample size = nM. Let yij be the value of the
characteristic under study for the value of j th element (j = 1, 2, ..., M ) in the ith cluster
(i = 1, 2, ..., N ).
First select n clusters from N clusters by SRSWOR. Based on n clusters, find the
mean of each cluster separately based on all the units in every cluster. So we have the
cluster means as ȳ1 , ȳ2 , ...., ȳn . Consider the mean of all such cluster means as an estimator
of population mean as
ȳcl = n1 ni=1 ȳi
P
= Ȳ
Thus ȳcl is an unbiased estimator of Ȳ
Variance
The variance of ȳcl can be derived on the same lines as deriving the variance of sample
mean in SRSWOR. The only difference is that in SRSWOR, the sampling units are
y1 , y2 , ...., yn whereas in the case of ȳcl , the sampling units are ȳ1 , ȳ2 , ...., ȳn . Note that is
case of SRSWOR, V ar (ȳ) = N −n 2
Nn
S and Vd
ar (ȳ) = N −n 2
Nn
s
2 2
where Sb2 = which is the mean
N −n 2 1
PN
E (ȳcl ) = E ȳcl − Ȳ = Nn b
S N −1 i=1 ȳi − Ȳ
sum of square between the cluster means in the population.
Estimate of variance:
Using again the philosophy of estimate of variance in case of SRSWOR, we can find
where i=1 (ȳi − ȳcl ) is the mean sum of squares between
Pn 2
ar (ȳcl ) = NN−n
Vd n
s 2
b s 2
b = 1
n−1
50
6.3 Exercises 6 CLUSTER SAMPLING
6.3 Exercises
51
7 RATIO AND REGRESSION ESTIMATION
For example, if yi is the quantity of fruits produced in the ith plot, then xi can be the
area of ith plot or the production of fruit in the same plot in previous year.
Theorem 7.1. Let ((x1 , y1 ) , (x2 , y2 ) , ..., (xn yn )) be the random sample of size n on paired
variable(X, Y ) drawn, preferably by SRSWOR, from a population of size N . The ratio
estimate of population mean Ȳ is Ȳˆ = x̄ȳ X̄ = R̂X̄ assuming that the population mean X̄
is known. The ratio estimator of the population total Ytot = N ytot
P
i=1 Yi is ŶR(tot) = xtot Xtot
where Xtot = N
P
i=1 Xi is the population total of X which is assumed to be known, ytot =
Pn Pn
i=1 yi and xtot = i=1 xi are the sample totals of Y and X respectively. The ŶR(tot) can
Looking at the structure of ratio estimators, note that the ratio method estimates
the relative change Ytot
Xtot
that occurred after (xi , yi ) were observed. It is clear that if the
52
7.2 Bias and mean squared error of ratio
7 estimator
RATIO AND REGRESSION ESTIMATION
related to Y.
Similarly,
E (ε21 ) = n1 CX
2
,
1
E (ε0 ε1 ) = X̄ Ȳ
E x̄ − Ȳ ȳ − Ȳ
1 f 1 f
= . S
X̄ Ȳ n XY
= X̄ Ȳ n
ρSX SY = nf ρ SX̄X SȲY = nf ρCX CY
where CX = SX
X̄
is the coefficient of variation related to X and ρ is the population
correlation coefficient between X and Y.
Writting ȲˆR in terms of εi s we get ȲˆR = = (1 + ε0 ) (1 + ε1 )−1 Ȳ .
0 ȳ (1+ε0 )Ȳ
x̄
X̄ = (1+ε1 )X̄
X̄
Assuming |ε1 | < 1, the term (1 + ε1 )−1 may be expanded as an infinite series and it
53
7.2 Bias and mean squared error of ratio
7 estimator
RATIO AND REGRESSION ESTIMATION
would be convergent. Such assumption means that | x̄−X̄X̄ | < 1 i.e., possible estimate x̄ of
population mean X̄ lies between 0 and 2X̄, This is likely to hold true if the variation in x̄
is not large. In order to ensures that variation in x̄ is small, assume that the sample size
n is fairly large. With this assumption,
ȲˆR = Ȳ (1 + ε0 ) (1 − ε1 + ε21 − ...) = Ȳ (1 + ε0 − ε1 + ε21 − ε1 ε0 + ....).
So the estimation error of ȲˆR is
ȲˆR − Ȳ = Ȳ (ε0 − ε1 + ε21 − ε1 ε0 + ....).
In case, when sample size is large, then ε0 and ε1 are likely to be small quantities and
so the terms involving second and higher powers of ε0 and ε1 would be negligibly small.
In such a case ȲˆR − Ȳ = Ȳ (0 − ε1 ) and E ȲˆR − Ȳ = 0.
So the ratio estimator is an unbiased estimator of population mean upto the first order
of approximation.
If we assume that only terms of ε0 and ε1 involving powers more than two are negligibly
small (which is more realistic than assuming that powers more than one are negligibly
small), then the estimation error of Y¯ˆR can be approximated as
ȲˆR − Ȳ ' Ȳ (ε0 − ε21 − ε1 ε0 ).
Then the bias of ȲˆR is given by
ˆ
E ȲR − Ȳ = Ȳ 0 − 0 + nf CX 2
− nf ρCX CY
upto the second order of approximation. The bias generally decreases as the sample
size grows large.
The bias of ȲˆR is zero i.e.
Bias Ȳˆ = 0.
If E (ε21 − ε0 ε1 ) = 0
or if V ar(x̄)
X̄ 2
− Cov(x̄,ȳ)
X̄ Ȳ
=0
h i
or if 1
X̄ 2
V ar (x̄) − X̄
Ȳ
Cov (x̄, ȳ) = 0
or if V ar (x̄) − Cov(x̄,ȳ)
R
= 0 assuming X̄ 6= 0
or if R = Ȳ
X̄
= Cov(x̄,ȳ)
V ar(x̄)
54
7.3 Regression Estimation 7 RATIO AND REGRESSION ESTIMATION
2
M SE ȲˆR = E ȲˆR − Ȳ
2
= E Ȳ 2 (ε0 − ε1 + ε21 − ε1 ε0 + ...)
= E Ȳ 2 (ε20 + ε21 − 2ε0 ε1 )
Under the assumption |ε1 | < 1, and the terms of ε0 and ε1 involving powers more than
two are negligible small,
M SE ȲˆR = Ȳ 2 nf CX + nf CY2 − 2f
2
n
ρCX CY
Ȳ 2 f
= n
[Cx2 + CY2 − 2%CX CY ]
up to the second order of approximation.
The ratio method of estimation uses the auxiliary information which is correlated with
the study variable to improve the precision which results in the improved estimators when
the regression of Y on X is linear and passes through origin. When the regression of Y
on X is linear, it is not necessary that the line should always pass through origin. Under
such conditions, it is more appropriate to use the regression type estimator to estimate
the population means.
In ratio method, the conventional estimator sample mean ȳ was improved by multi-
plying it by a a factor X̄
x̄
where x̄ is an unbiased estimator of population mean X̄ which
is chosen as population mean of auxiliary variable. Now we consider another idea based
on difference.
Consider an estimator x̄ − X̄ for which E x̄ − X̄ = 0. Consider an improved
55
7.3 Regression Estimation 7 RATIO AND REGRESSION ESTIMATION
the optimum value of µ is same as the regression coefficient of y on x with a negative sign,
i.e., µ = −β. So the estimator Ȳˆ ∗ with the optimum value of µ is Ȳˆreg = ȳ + β X̄ − x̄
which is the regression estimator of Ȳ and the procedure of estimation is called as the
regression method of estimation.
The variance of Ȳˆreg is V ar Ȳˆreg = V (ȳ) [1 − ρ2 (x̄, ȳ)] where ρ (x̄, ȳ) is the correla-
tion coefficient between x̄ and ȳ. So Ȳˆreg would be efficient if x and y are highly correlated.
The estimator Ȳˆreg is more efficient than Ȳ if ρ (x̄, ȳ) 6= 0 which generally holds.
ˆ
estimator of Ȳ is given by Ȳreg = ȳ +β X̄ − x̄ . It is difficult to find the exact expressions
of E Ȳreg and V ar Ȳˆreg So we approximate them using the same methodology as in
56
7.4 Exercises 7 RATIO AND REGRESSION ESTIMATION
Ȳˆreg − Ȳ
= Ȳ ε0 − β X̄ε1 (1 + ε2 ) (1 + ε3 )−1 where β = SSXY
2 is the population regres-
X
sion coefficient. Assuming | ε3 |< 1, Ȳˆreg − Ȳ ' Ȳ ε0 − β X̄ (ε1 + ε1 ε2 ) (1 − ε3 + ε23 )
0
Retaining the terms of εi s upto the second power second and ignoring others, we have
2
ˆ
= E Ȳreg − Ȳ ≈ E ε20 Ȳ 2 + β 2 X̄ 2 ε21 − 2β X̄ Ȳ ε0 ε1
f
M SE Ȳˆreg = SY2 1 − ρ2 . (7.2)
n
So upto second order of approximation, the regression estimator is better than the
conventional sample mean estimator under SRSWOR. This is because the regression esti-
mator uses some extra information also. Moreover, such extra information requires some
extra cost also. This shows a false superiority in some sense. So the regression estimators
and SRS estimates can be combined if cost aspect is also taken into consideration.
7.4 Exercises
assumed to be known. Its value is 22,919. Estimate the total number of inhabitants
in the 196 cities in 1980. (i) Using the ratio estimator (ii) Using the sample estimator
2. In a study to estimate the total sugar content of a truck load of oranges a random
57
7.4 Exercises 7 RATIO AND REGRESSION ESTIMATION
sample of n=10 oranges was juiced and weighed as shown in the table below. The
total weight of all the oranges obtained by first weighing the truck loaded and then
unloaded was found to be 1800 pounds. Estimate the total sugar content of oranges
and place a bound on the error of estimation.
58
8 DOUBLE SAMPLING (TWO PHASE SAMPLING)
The ratio and regression methods of estimation require the knowledge of population mean
of auxiliary variable X̄ to estimate the population mean of study variable Ȳ . If
information on the auxiliary variable is not available, then there are two options – one
option is to collect a sample only on study variable and use sample mean as an estimator
of population mean.
An alternative solution is to use a part of the budget for collecting information on
auxiliary variable to collect a large preliminary sample in which xi alone is measured. The
purpose of this sampling is to furnish a good estimate of X̄ . This method is appropriate
when the information about xi is on file cards that have not been tabulated. After
collecting a large preliminary sample of size n0 units from the population, select a smaller
sample of size n from it and collect the information on y. These two estimates are then
used to obtain an estimator of population mean Ȳ . This procedure of selecting a large
sample for collecting information on auxiliary variable x and then selecting a sub-sample
from it for collecting the information on the study variable y is called double sampling
or two phase sampling. It is useful when it is considerably cheaper and quicker to
collect data on x than y and there is high correlation between x and y.
In this sampling, the randomization is done twice. First a random sample of size n0
is drawn from a population of size N and then again a random sample of size n is drawn
from the first sample of size n0 .
So the sample mean in this sampling is a function of the two phases of sampling. If
SRSWOR is utilized to draw the samples at both the phases, then
→ number of possible samples at the second phase where a sample of size n is drawn
59
8.2 Double sampling in ratio method
8 DOUBLE
of estimation
SAMPLING (TWO PHASE SAMPLING)
n
from the first phase sample of size n0 is = M1 , say.
0
n
Then the sample mean is a function of two variables. If τ is the statistic calculated at the
second phase such that τij i = 1, 2, ..., M0 , j = 1, 2, ..., M1 with Pij being the probability
that ith sample is chosen at first phase and j th sample is chosen at second phase, then;
E (τ ) = E1 [E2 (τ )] where E2 (τ ) denotes the expectation over second phase and E1
denotes the expectation over the first phase. Thus;
E (τ ) = M
P 0 PM 1
i=1 j=1 Pij τij
= M
P 0 PM 1
i=1 Pi j=1 Pj|i τij
where M i=1 Pi is for the first stage and j=1 Pj|i τij for the second stage.
P 0 PM1
8.1.1 Variance of τ
V ar (τ ) = E [τ − E (τ )]2
= E [(τ − E2 (τ )) + (E2 (τ ) − E (τ ))]2
= E1 E2 [τ − E2 (τ )]2 + [E2 (τ ) − E (τ )]2
= E1 E2 [τ − E2 (τ )]2 + E1 E2 [E2 (τ ) − E (τ )]2
= E1 [V2 (τ )] + E1 [E2 (τ ) − E1 (E2 (τ ))]2
= E1 V2 (τ ) + V1 [E2 (τ )]
Note: The two phase sampling can be extended to more than two phases depending
upon the need and objective of the experiment. Various expectations can also be extended
on the similar lines.
If the population mean X̄ is not known then double sampling technique is applied. Take
a large initial sample of size n0 by SRSWOR to estimate the population mean X̄ as
ˆ = x̄0 = 1 Pn0 x .
X̄ n0 i=1 i
Then a second sample is a subsample of size n selected from the initial sample by
SRSWOR. Let ȳ and x̄ be the means of y and x based on the subsample. Then;
60
8.2 Double sampling in ratio method
8 DOUBLE
of estimation
SAMPLING (TWO PHASE SAMPLING)
= E (ε22 )
E (ε0 ε2 ) = 1
X̄ Ȳ
Cov (ȳ, x̄0 )
= 1
Cov [E (ȳ|n0 ) , E (x̄0 |n0 )] + X̄1Ȳ E [Cov (ȳ, x̄0 ) |n0 ]
X̄ Ȳ
= 1
Cov [(ȳ 0 , x̄0 )]
X̄ Ȳ
= n10 − N1 SX̄XY
Ȳ
= n10 − N1 ρCx Cy
1
E (ε20 ) = Ȳ 2
V ar (ȳ)
61
8.2 Double sampling in ratio method
8 DOUBLE
of estimation
SAMPLING (TWO PHASE SAMPLING)
1
= Ȳ 2
V ar (ȳ)
= 1
Ȳ 2
[V1 {E2 (ȳ|n0 )} + E1 {V2 (ȳn |n0 )}]
1
V1 (ȳn0 ) + E1 n1 − n10 s02
= Ȳ 2 y
1 1
− N1 Sy2 + n1 − n10 Sy2
= Ȳ 2 n0
1
S2
= n
− N1 Ȳ y2
1
− N1 Cy2
= n
where s02
y is the mean sum of squares of y based on initial sample of size n .
0
E (ε1 ε2 ) = 1
X̄ 2
Cov (x̄, x̄0 )
= 1
[Cov {E (x̄|n0 ) , E (x̄0 |n0 )} + 0]
X̄ 2
= X̄12 V ar X̄ 0
62
9 VARYING PROBABILITY SAMPLING.
The simple random sampling scheme provides a random sample where every unit in the
population has equal probability of selection. Under certain circumstances, more effi-
cient estimators are obtained by assigning unequal probabilities of selection to the units
in the population. This type of sampling is known as varying probability sampling
scheme.
If Y is the variable under study and X is an auxiliary variable related to Y , then in
the most commonly used varying probability scheme, the units are selected with prob-
ability proportional to the value of X, called as size. This is termed as probability
proportional to a given measure of size (pps) sampling. If the sampling units
vary considerably in size, then SRS does not takes into account the possible importance
of the larger units in the population. A large unit, i.e., a unit with large value of Y con-
tributes more to the population total than the units with smaller values, so it is natural to
expect that a selection scheme which assigns more probability of inclusion in a sample to
the larger units than to the smaller units would provide more efficient estimators than the
estimators which provide equal probability to all the units. This is accomplished through
pps sampling.
Note that the “size” considered is the value of auxiliary variable X and not the value
of study variable Y . For example in an agriculture survey, the yield depends on the
area under cultivation. So bigger areas are likely to have larger population and they will
contribute more towards the population total, so the value of the area can be considered
as the size of auxiliary variable. Also, the cultivated area for a previous period can also be
taken as the size while estimating the yield of crop. Similarly, in an industrial survey, the
number of workers in a factory can be considered as the measure of size when studying
the industrial output from the respective factory.
Probability proportional to size sampling can be done on two ways;
1. Selection of units with replacement: The probability of selection of a unit will not
63
9.2 PPS sampling with replacement (WR)9 VARYING PROBABILITY SAMPLING.
change and the probability of selecting a specified unit is same at any stage. There
is no redistribution of the probabilities after a draw.
First we discuss the two methods to draw a sample with PPS and WR.
Notations:
Yi : value of study variable for the ith unit of the population, i = 1, 2, . . . , N.
Xi : known value of auxiliary variable (size) for the ith unit of the population.
Pi : probability of selection of ith unit in the population at any given draw and is
proportional to size Xi .
Theorem 9.1. Consider the varying probability scheme and with replacement for a sam-
ple of size n. Let yr be the value of rth observation on study variable in the sample
and pr be its initial probability of selection. Define zr = Nyprr , r = 1, 2, ..., n. Then
2
z̄ = n1 ni=1 zi is an unbiased estimator of population mean Ȳ , variance of z̄ is σnz
P
PN 2 2
where σz2 = P Pi
i=1 i N Pi − Ȳ and an unbiased estimate of variance of z̄ is snz =
1
Pn 2
n−1 r=1 (zr − z̄)
Proof: Note that zr can take any one of the N values out of Z1 , Z2 , ...., ZN with
corresponding initial probabilities P1 , P2 , ...., PN respectively. So
E (zr ) = N i=1 N Pi Pi = Ȳ . Therefore, E (z̄) = n
P PN Yi 1
Pn 1
Pn
i=1 Zi Pi = i=1 E (zr ) = n i=1 Ȳ =
Ȳ. So z̄ is an unbiased estimator of population mean Ȳ .
0
The variance of z̄ is V ar (z̄) = n12 V ar ( nr=1 zr ) = n12 nr=1 V ar (zr )(zr s are indepen-
P P
dent in WR case).
64
9.2 PPS sampling with replacement (WR)9 VARYING PROBABILITY SAMPLING.
s2z
To whow that is an unbiased estimator of variance of z̄, consider
n
Pn 2
(n − 1) E (s2z ) = E = E [ nr=1 zr2 − nz̄ 2 ]
P
r=1 (zr − z̄)
PN Yi 2
= nr=1 σz2 + Ȳ 2 − n σ2 (using Pi = σz2 )
P 2
N
+ Ȳ V ar (zr ) = i=1 N Pi − Ȳ
= (n − 1) σz2
2
σz2
E (s2z ) = σz2 or E s
z
= n
= V ar (z̄)
s2z Pn yr 2
⇒ Vd
ar (z̄) = n
= 1
n(n−1) r=1 N Pr
− nz̄ . Note: If Pi =
2 1
N
then z̄ = ȳ
2
σy2
which is the same as in the case of SRSWR.
1 1
PN Yi
V ar (z̄) = nN i=1 1
N. N
− Ȳ = n
Pn
1 yr
Theorem 9.2. An estimate of population total is T̂tot = n r=1 pr
= N z̄
P h i
Proof: Taking expectation, we get E Ŷtot = n1 nr=1 PY11 P1 + PY22 P2 +, ... + PYNN PN
P hPN i
= n1 nr=1 r=1 Ytot = Ytot . Therefore Ytot is an unbiased estimator of
1
Pn
i=1 i = n
Y
population total.
P 1 Yi 2
Proof: V ar (Ytot ) = N 2 V ar (z̄) = N 2 n1 Ni= N 2 Pi
− N Ȳ Pi
2 PN Yi2
= n1 N Yi
Pi = n1 2
P
i=1 Pi − Ytot i=1 Pi − Ytot
2
Corollary 9.1. An estimate of the variance of Ytot is V ar Ŷtot = N 2 snz
d
In varying probability scheme without replacement, when the initial probabilities of se-
lection are unequal, then the probability of drawing a specified unit of the population
at a given draw changes with the draw. Generally, the sampling WOR provides a more
efficient estimator than sampling WR. The estimators for population mean and variance
65
9.3 Exercises 9 VARYING PROBABILITY SAMPLING.
are more complicated. So this scheme is not commonly used in practice, especially in
large scale sample surveys with small sampling fractions.
9.3 Exercises
1. Show how to estimate the population mean and variance in probability scheme without
replacement.
2. Farms having:10,20,30,15,25,35,65,55,50,5 acres of land under maize. It is desired
to select a sample of size 4 without replacements and with probability proportional to the
average of maize in the farm. The total average of maize is 310. The first step is to form
cumulative totals(C.T) and ranges as follows.
Farm Size C.T Range
1 10 10 1-10
2 20 30 11-30
3 30 60 31-60
4 15 75 61-75
5 25 100 76-100
6 35 135 101-135
7 65 200 136-200
8 55 255 201-255
9 50 305 256-305
10 5 310 306-310
3. Explain the Lahiri’s method as used in probability proportional to size with re-
placement sampling.
4. Define the following estimators
66
10 TWO STAGE SAMPLING(SUBSAMPLING)
In cluster sampling, all the elements in the selected clusters are surveyed. Moreover, the
efficiency in cluster sampling depends on size of the cluster. As the size increases, the effi-
ciency decreases. It suggests that higher precision can be attained by distributing a given
number of elements over a large number of clusters and then by taking a small number of
clusters and enumerating all elements within them. This is achieved in subsampling. In
subsampling
3. From each of the selected cluster, select a sample of specified number of elements
(second stage)
The clusters which form the units of sampling at the first stage are called the first stage
units or primary stage units and the units or group of units within clusters which
form the unit of clusters are called the second stage units or subunits or secondary
stage units. The procedure is generalized to three or more stages and is then termed as
multistage sampling.
For example, in a crop survey;
2. Fields within the villages are the second stage units and
Two stage sampling with equal first stage units: Assume that;
2. NM elements are grouped into N first stage units of M second stage units each,
(i.e., N clusters, each cluster is of size M )
67
10.2 Notations 10 TWO STAGE SAMPLING(SUBSAMPLING)
4. Sample of m second stage units is selected from each selected first stage unit (i.e.,
choose m units from each cluster).
Not: Cluster sampling is a special case of two stage sampling in the sense that from a
population of N clusters of equal size m = M , a sample of n clusters are chosen. If further
M = m = 1, we get SRSWOR. If n = N , we have the case of stratified sampling.
10.2 Notations
Let;
yij : value of the characteristic under study for the j th second stage units of the ith first
stage unit; i = 1, 2, ..., N , j = 1, 2, ..., m.
j=1 yij : mean per 2 stage unit of the ith 1st stage unit in the population.
PM
Ȳi = M1 nd
Ȳ = M1N N i=1 ȳi = ȲM N ; mean per second stage unit in the
P PM 1
PN
i=1 j=1 yij = N
population.
ȳi = m1 m j=1 yij : mean per 2 stage unit of the ith 1st stage unit in the sample.
nd
P
i=1 ȳi = ȳmn ; mean per second stage unit in the sample.
1
Pn Pm 1
PN
ȳ = mn i=1 j=1 yij = n
Note: The expectations under two stage sampling scheme depend on the stages. For
example, the expectation at second stage unit will be dependent on first stage unit in the
sense that second stage unit will be in the sample provided it was selected in the first
stage.
To calculate the average
1. First average the estimator over all the second stage selections that can be drawn
from a fixed set of n units that the plan selects.
2. Then average over all the possible selections of n units by the plan.
68
10.3 Estimation of population mean 10 TWO STAGE SAMPLING(SUBSAMPLING)
h i
E θ̂ = E1 E2 θ̂ where E θ̂ is the average over all samples, E1 average over
all 1st stage samples, E2 average over all possible 2nd stage selections from a fixed set of
units.
In case of three stage sampling;
h n oi
E θ̂ = E1 E2 E3 θ̂
To calculate the variance, we proceed as follows:
In case of two stage sampling,
2
V ar θ̂ = E θ̂ − θ
2
= E1 E2 θ̂ − θ
Consider;
2
E2 θ̂ − θ = E2 θ̂2 − 2θE2 θ̂ + θ2
n o
2
= E2 θ̂ + V2 θ̂ − 2θE2 θ̂ + θ2
Now average over first stage selection as;
2 h i2 h i
E1 E2 θ̂ − θ = E1 E2 θ̂ + E1 V2 θ̂ − 2θE1 E2 θ̂ + E1 (θ2 )
n o
2 h i
2
= E1 E1 E2 θ̂ − θ + E1 V2 θ̂
h i h i
V ar θ̂ = V1 E2 θ̂ + E1 V2 θ̂
= N1 N
P
i=1 Ȳi
= Ȳ
Thus ȳmn is an unbiased estimator of the population mean.
Variance:
V ar (ȳ) = E1 [V2 (ȳ|i)] + V1 [E2 (ȳ|i)]
69
10.4 Estimate of variance 10 TWO STAGE SAMPLING(SUBSAMPLING)
= n12 ni=1 m1 − M1 E1 (Si2 ) + V1 (ȳc )(where ȳc is based on cluster means as in cluster
P
sampling)
1 1 1 N −n 2
= n2
n m
− M
S̄w2 + Nn b
S
1 1 1 1 1
= n n
− M
S̄w2 + n
− N
Sb2
2 2
where S̄w2 = Yij − Ȳi , S̄b2 =
1
PN 1
PN PM 1
PN
N i=1 Si2 = N (M −1) i=1 j=1 N −1 i=1 Ȳi − Ȳ
2
Proof: Consider an estimator of S̄w2 = N1 N i=1 Si where Si =
P 2 2 1
PM
M −1 j=1 yij − Ȳi
as s̄2w = n1 ni=1 s2i where s2i = m−1 j=1 (yij − ȳi ) . So,
P 1
Pm 2
= n1 ni=1 E1 (Si2 )
P
h P i
= N1 N 1 N 2
P
ni=1 N i=1 iS
N
= N1 i=1 Si2
P
10.5 Exercises
2
1. Consider s2b = (ȳi − ȳ)2 as an estimator of Sb2 = .
1
Pn 1
PN
n−1 i=1 N −1 i=1 Ȳi − Ȳ
Show that Vd 1 1
− M1 s̄2w + n1 − N1 s2b
ar (ȳ) = N m
70
11 SOURCES OF ERRORS IN SURVEYS
11.1 Introduction
It is a general assumption in the sampling theory that the true value of each unit in the
population can be obtained and tabulated without any errors. In practice, this assumption
may be violated due to several reasons and practical constraints. This results in errors
in the observations as well as in the tabulation. Such errors which are due to the factors
other than sampling are called non-sampling errors.
The non-sampling errors are unavoidable in census and surveys. The data collected
by complete enumeration in census is free from sampling error but would not remain free
from non-sampling errors. The data collected through sample surveys can have both –
sampling errors as well as non-sampling errors. The non-sampling errors arise because
of the factors other than the inductive process of inferring about the population from a
sample. In general, the sampling errors decrease as the sample size increases whereas non-
sampling error increases as the sample size increases. In some situations, the non-sampling
errors may be large and deserve greater attention than the sampling error.
In any survey, it is assumed that the value of the characteristic to be measured has
been defined precisely for every population unit. Such a value exists and is unique.
This is called the true value of the characteristic for the population value. In practical
applications, data collected on the selected units are called survey values and they differ
from the true values. Such difference between the true and observed values is termed
as the observational error or response error. Such an error arises mainly from the
lack of precision in measurement techniques and variability in the performance of the
investigators.
Non sampling errors can occur at every stage of planning and execution of survey or census.
It occurs at planning stage, field work stage as well as at tabulation and computation stage.
71
11.2 Non-Sampling Errors 11 SOURCES OF ERRORS IN SURVEYS
3. faulty definition,
More specifically, one or more of the following reasons may give rise to nonsampling errors
or indicate its presence:
1. The data specification may be inadequate and inconsistent with the objectives of
the survey or census.
6. The recall errors may pose difficulty in reporting the true data.
9. There can be errors in presenting and printing the tabulated results, graphs etc.
10. In a sample survey, the non-sampling errors arise due to defective frames and faulty
selection of sampling units.
72
11.3 Sampling errors 11 SOURCES OF ERRORS IN SURVEYS
These sources are not exhaustive but surely indicate the possible source of errors. Non-
sampling errors may be broadly classified into three categories.
(a) Specification errors: These errors occur at planning stage due to various reasons,
e.g., inadequate and inconsistent specification of data with respect to the objectives of sur-
veys/ census, omission or duplication of units due to imprecise definitions, faulty method
of enumeration/interview/ambiguous schedules etc.
(b) Ascertainment errors: These errors occur at field stage due to various reasons e.g.,
lack of trained and experienced investigations, recall errors and other type of errors in
data collection, lack of adequate inspection and lack of supervision of primary staff etc.
(c) Tabulation errors: These errors occur at tabulation stage due to various reasons,
e.g., inadequate scrutiny of data, errors in processing the data, errors in publishing the
tabulated results, graphs etc.
Ascertainment errors may be further sub-divided into
(i) Coverage errors owing to over-enumeration or under-enumeration of the population
or the sample, resulting from duplication or omission of units and from the non-response.
(ii) Content errors relating to the wrong entries due to the errors on the part of
investigators and respondents.
Same division can be made in the case of tabulation error also. There is a possibility
of missing data or repetition of data at tabulation stage which gives rise to coverage errors
and also of errors in coding, calculations etc. which gives rise to content errors.
Treatment of non-sampling errors: Some conceptual background is needed for the
mathematical treatment of non-sampling errors.
Total error: Difference between the sample survey estimate and the parametric true
value being estimated is termed as total error.
73
11.3 Sampling errors 11 SOURCES OF ERRORS IN SURVEYS
2. sampling variance.
If the results are also subjected to the non-sampling errors, then the total error would
have both sampling and non-sampling error.
Total bias: The difference between the expected value and the true value of the es-
timator is termed as total bias. This consists of sampling bias and nonsampling bias.
Non-sampling bias: For the sake of simplicity, assume that the two following steps are
involved in the randomization:
(i) for selecting the sample of units and
(ii) for selecting the survey personnel.
74
12 ORGANISATION OF NATIONAL SURVEYS, AND THE KENYA BUREAU OF
STATISTICS(K.N.B.S)
Bureau of Statistics(K.N.B.S)
1. Act as the principal agency of the government for collecting, analysing and dissem-
inating statistical data in Kenya
3. Conduct the Population and Housing Census every ten years, and such other cen-
suses and surveys as the Board may determine;
5. Establish standards and promote the use of best practices and methods in the pro-
duction and dissemination of statistical information across the National Statistical
System (NSS); and
6. Plan, authorise, coordinate and supervise all official statistical programmes under-
taken within the national statistical system.
75
13 PAST EXAMINATION PAPERS
13.1 Paper 1
W1-2-60-1-6
76
13.1 Paper 1 13 PAST EXAMINATION PAPERS
(d) In simple random sampling without replacement, verify that ȳ is unbiased es-
N −n S 2
timator of Ȳ and that its sample variance is given by V ar (ȳ) = N n
where S 2 =
2
[6 marks]
1
PN
N −1 i=1 Yi − Ȳ
(e) Consider the population consisting of 430 units. By complete enumeration of the
population it was found that Ȳ = 19, S 2 = 85.6 These being true population values with
simple random samples, how many units must be taken to estimate ȳ with 10% of Ȳ a
part from a chance of 1 in 20. [5 marks]
(f) The following table represents a summary of data for complete census of all 440 vil-
lages in a sub-division in Kenya. The villages are stratified by the size of their agricultural
are under maize production onto strata as shown below.
Stratum Size of Villages(acres) Ni Ȳi Si
77
13.1 Paper 1 13 PAST EXAMINATION PAPERS
was found to be 1800 pounds. Estimate the total sugar content of oranges and place a
bound on the error of estimation.
Oranges Sugar content(yi ) Weight of oranges(xi ) y i xi
1 8 4 1,0,0,1
2 14 4 0,3,0,0
3 9 4 1,0,2,7
4 12 4 0,0,1,5
78
13.1 Paper 1 13 PAST EXAMINATION PAPERS
Estimate the total number of home schooled children in the town using simple random
sampling method and ratio estimation method (No non-response)[8 marks]
(c) Prove that in probability proportional to size with replacement sampling V ar Ŷpps =
PN Yi
1
n i=1 Pi − Y Pi [4 marks]
(d) Consider a population of N=10,000 sampling units where you want to obtain a
systematic random sample of size n = 1000[2 marks]
(i) How many systematic random samples are there and show using a diagram what
sampling units they consist of [2 marks]
(ii) Use this to explain the relationship between cluster sampling and systematic ran-
dom sampling [2 marks]
(iii) Building on the results in part (ii) explain why variances are difficult to calculate
for systematic random sampling [1 mark]
(iv) Suggest a modified form of systematic random sampling which solves the variance
problem in part (iii) [1 mark]
QUESTION FOUR (20 MARKS)
(a) Clearly explain cluster sampling [2 marks]
(b) A mathematics achievement test was given to 486 students prior to their entering
a certain college. From these students a simple random sample of n = 10 students was
selected and their progress in calculus observed. Final calculus grades were then reported,
as given in the table below. It is known µx that for all 486 students taking the achievement
test. Estimate µY for this population. [8 marks]
Student 1 2 3 4 5 6 7 8 9 10
Achievement T.S.X 39 43 21 64 57 47 28 75 34 52
Final calculus. Y 65 78 52 82 92 89 73 98 56 75
(c) State and explain sources of non-sampling errors in surveys [2 marks]
(d) A nursery man wants to estimate the average height (in inches) of 1200 seedlings
in a field that is sub-divided into 50 plots that vary in size. A two-stage cluster sample
design produced the following data.
79
13.1 Paper 1 13 PAST EXAMINATION PAPERS
1 63 6 5, 2, 4, 3, 1,
2 57 8 4, 2, 7, 2, 7,
3 30 3 3, 2, 5
4 23 2 4, 4,
TOTAL 173 17
(i) Estimate the average height of seedlings in the field and the standard error of the
estimate [5 marks]
(ii) Construct a 95% confidence interval on the population mean [3 marks]
80
14 REFERENCES
14 References
2. Yates, F.1981. Sampling Methods for Censuses and Surveys. 4th ed. New York.
81