0% found this document useful (0 votes)
2K views81 pages

STA 2402 Design and Analysis of Sample Surveys PDF

This document provides notes for the course STA 2402 Design and Analysis of Sample Surveys. It includes an introduction outlining the course purpose, objectives, description, prerequisites, teaching methods, materials, assessment, and textbooks. The notes then cover the basic concepts of sampling, types of sampling, properties of estimators, and steps in planning and executing a sample survey. Specific sampling methods discussed in depth include simple random sampling, stratified random sampling, systematic sampling, and cluster sampling. Ratio and regression estimation techniques are also introduced.

Uploaded by

Kevin Wasike
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views81 pages

STA 2402 Design and Analysis of Sample Surveys PDF

This document provides notes for the course STA 2402 Design and Analysis of Sample Surveys. It includes an introduction outlining the course purpose, objectives, description, prerequisites, teaching methods, materials, assessment, and textbooks. The notes then cover the basic concepts of sampling, types of sampling, properties of estimators, and steps in planning and executing a sample survey. Specific sampling methods discussed in depth include simple random sampling, stratified random sampling, systematic sampling, and cluster sampling. Ratio and regression estimation techniques are also introduced.

Uploaded by

Kevin Wasike
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Taita Taveta University(T.T.

U)
School of Science and Informatics(S.S.I)

Department of Mathematics and Informatics(M&I)

THESE NOTES HAVE BEEN PREPARED BY

Noah Cheruiyot Mutai

BSc. Mathematics and Computer Science(JKUAT-Fist class honours)

MSc. Applied Statistics(JKUAT)

Doctor of Philosophy Candidate(JKUAT-Applied Statistics)

FOR

STA 2402 Design and Analysis of Sample Surveys.

January 2017

1
CONTENTS CONTENTS

Contents
1 Preliminary 5
1.1 Course Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Course Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Course Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Pre-requisite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Learning and Teaching Methodologies: . . . . . . . . . . . . . . . . . . . . 6
1.6 Instructional Materials and Equipment . . . . . . . . . . . . . . . . . . . . 6
1.7 Assessment: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8 Course Textbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.9 Couse Journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.10 Reference Textbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.11 Reference Journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 The Basic Concepts 8


2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Types of Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Properties of random sampling . . . . . . . . . . . . . . . . . . . . 10
2.3 Properties of estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Principal steps involved in planning and execution of a sample survey. . . . 11
2.5 Pilot Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Advantages of Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.8 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Simple Random Sampling(SRS) 17


3.1 Introduction and Description . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Simple Random Sampling with Replacement (SRSWR). . . . . . . . . . . 17
3.2.1 Definition and Estimation of Population Mean, Variance and Total 17
3.3 Simple Random Sampling without Replacement . . . . . . . . . . . . . . . 19
3.3.1 Definition and Estimation of Population Mean, Variance and Total 20
3.3.2 Confidence Intervals for Population Mean Ȳ and Total Y . . . . . . 24
3.3.3 Sampling For Proportions and Percentages . . . . . . . . . . . . . 25
3.3.4 Estimation of Population Proportion . . . . . . . . . . . . . . . . . 26
3.3.5 Estimation of population total or total number of count . . . . . . . 28
3.3.6 Confidence Interval estimation for P . . . . . . . . . . . . . . . . . 28
3.3.7 Determination of Sample Sizes . . . . . . . . . . . . . . . . . . . . 29
3.4 R Computing Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 Stratified Random Sampling 34


4.1 Introduction and Description . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.2 Estimation of Population Mean, Variance and Total . . . . . . . . 35
4.1.3 Estimation of Variance . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Allocation problem and choice of sample sizes is different strata . . . . . . 37
4.2.1 Equal allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2
CONTENTS CONTENTS

4.2.2 Proportional Allocations . . . . . . . . . . . . . . . . . . . . . . . . 38


4.2.3 Optimal Allocation of Sample Sizes(Neymann Allocation) . . . . . 38
4.2.4 Sample size under proportional allocation for fixed cost and for fixed
variance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.5 Variances under different allocations . . . . . . . . . . . . . . . . . 40
4.2.6 Comparison of variances of sample mean under SRS with stratified
mean under proportional and optimal allocation: . . . . . . . . . . 41
4.2.7 Estimate of variance and confidence intervals . . . . . . . . . . . . . 42
4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 Systematic sampling 45
5.1 Introduction and Description . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Estimation of Population Mean, Variance and Total . . . . . . . . . . . . 46
5.2.1 Estimation of population mean : When N = nk . . . . . . . . . . . 46
5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6 Cluster sampling 48
6.1 Introduction and Description . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2 Estimation of Population Mean, Variance and Total . . . . . . . . . . . . 50
6.2.1 Case of equal clusters . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7 Ratio and Regression Estimation 52


7.1 Ratio Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.2 Bias and mean squared error of ratio estimator . . . . . . . . . . . . . . . . 53
7.3 Regression Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.3.1 Estimate of variance . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.3.2 Regression estimates when β is computed from sample . . . . . . . 56
7.3.3 Bias of Ȳˆreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

8 Double Sampling (Two Phase Sampling) 59


8.1 Introduction and Description . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.1.1 Variance of τ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.2 Double sampling in ratio method of estimation . . . . . . . . . . . . . . . . 60

9 Varying Probability Sampling. 63


9.1 Introduction and Description . . . . . . . . . . . . . . . . . . . . . . . . . 63
9.2 PPS sampling with replacement (WR) . . . . . . . . . . . . . . . . . . . . 64
9.2.1 Estimation of Population Mean, Variance and Total . . . . . . . . 64
9.2.2 Varying probability scheme without replacement . . . . . . . . . . 65
9.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

10 Two Stage Sampling(Subsampling) 67


10.1 Introduction and Description . . . . . . . . . . . . . . . . . . . . . . . . . . 67
10.2 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
10.3 Estimation of population mean . . . . . . . . . . . . . . . . . . . . . . . . 69
10.4 Estimate of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3
CONTENTS CONTENTS

10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

11 Sources of Errors in Surveys 71


11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
11.2 Non-Sampling Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
11.2.1 Sources of non-sampling errors: . . . . . . . . . . . . . . . . . . . . 71
11.3 Sampling errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

12 Organisation of National surveys, and the Kenya Bureau of Statis-


tics(K.N.B.S) 75

13 Past Examination Papers 76


13.1 Paper 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

14 References 81

4
1 PRELIMINARY

1 Preliminary

1.1 Course Purpose

To enable students to apply the principles of survey sampling and to understand the
different methods used in sampling.

1.2 Course Objectives

At the end of the course the student should be able to;

1. Define a sample survey, and identify the advantages and principal steps in organizing
a survey.

2. Explain probability and purposive types of sample.

3. Apply simple random sampling both in proportions and percentages.

4. Explain the principles of estimating sample size.

5. Explain the methods of random sampling such as stratified, systematic, cluster,


multistage and proportional.

6. Determine the ratio and regression estimators.

7. Distinguish between sampling and non-sampling errors.

8. Explain how national surveys are conducted, and the work done by the Kenya
National Bureau of Statistics.

1.3 Course Description

Sample survey: definition, advantages and principal steps in organizing a survey. Types of
samples: probability and purposive. Simple random sampling: sampling proportions and
percentages; estimating sample size; stratified random, systematic, cluster and multistage
samples; selections with p.p.s (probability proportional to size). Ratio and regression

5
1.4 Pre-requisite 1 PRELIMINARY

estimators, sampling and non-sampling errors, organisation of national surveys, and the
Central Bureau of Statistics. Use of computer packages.

1.4 Pre-requisite

Probability and Statistics II

1.5 Learning and Teaching Methodologies:

Students attain knowledge through lectures, seminars, tutorials, and independent studies.
Student discussion and seminar paper presentation.

1.6 Instructional Materials and Equipment

Chalk/white board, Power-point projector, LCD Projector, Transparencies, films, slides


and computer.

1.7 Assessment:

1. Examination 70%

2. Continuous Assessment 30%

3. Total 100%

1.8 Course Textbooks

1. Yates, F.1981. Sampling Methods for Censuses and Surveys. 4th ed. New York.

2. Cochran, W.G. 1977. Sampling Techniques.3rd ed .New York: Wiley.

1.9 Couse Journals

1. Statistics Surveys

2. Survey Methodology

6
1.10 Reference Textbooks 1 PRELIMINARY

1.10 Reference Textbooks

1. Deming, W.E. 1950. Some Theory of Sampling .New York: Dover ISBN-13: 9780486646848
ISBN: 048664684X.

2. Lohr Sharon (1999). Sampling: Design and Analysis. Duxbury press ISBN 10:0-
534-35361-4.

3. B. Agarwal, Basic Statistics, Wiley Eastern, 1995

1.11 Reference Journals

1. Annals of Statistics

2. Journal of the American Statistical Association

3. Biometrics

7
2 THE BASIC CONCEPTS

2 The Basic Concepts

2.1 Introduction

Sample survey, finite population sampling or survey sampling is a method of drawing an


inference about the characteristics of a population or universe by observing only a part
of the population. Such methods are extensively used by government bodies throughout
the world for assessing, among others, different characteristics of national economy as
a required for making decisions and for the planning and projection of future economic
structure. Ideally, total information about the population is obtained through census,
where every individual in the population is involved in giving out information. However,
most of the times due to certain constraints to be discussed later, it is not always possible
to carry out a census.
In a sample survey the purpose of the survey statistician is to estimate some functions
of the population parameter, θ (y),say,by choosing a sample(part of the population) and
by observing the values of y only on units selected in the sample. The statistician therefore
want to make an inference about the population by observing only a part of it. This is
essential and perharps the only practical method of inference about the charactersitics of
the population since in many socio-economic investigations the survey population may be
very large, contianing say hundreds or thousands of units.

Definition 2.1. Survey population: A finite(survey) population is a collection of


known number N of identifiable units labelled 1, 2, 3, ..., i, ..., N where i stands for the
label as well as the physical unit labelled i. The number N is the size of the population.
The parametric functions of general interest for estimation are;

1. Population total, Y =
PN
i=1 Yi

2. Population mean, Ȳ = Y 1
PN
N
= N i=1 Yi
2
3. Population variance, SY2 = 1
PN
N −1 i=1 Yi − Ȳ

4. Population coefficient of variance, CY = SY



, where SY is the population variance
and Ȳ is the population mean.

8
2.1 Introduction 2 THE BASIC CONCEPTS

Definition 2.2. Sample: A sample is a part of the population/subset of the population


selected for study. A sample may be drawn from a population either with replacement(wr)
or without replacement(wor). After a sample is selected , data are collected from the
sampled units. We shall denote by yi the value of y on the unit selected at the ith draw
(i = 1, 2, ...., n). Thus for example if the sample is S = {2, 3, 2} ,y1 = Y2 , y2 = Y3 , y3 = Y2 .
Clearly yi is a random variable whose possible values lie in the set {Y1 , Y2 , ...., YN }
For a sample s, we shall denote some statistics as follows;

1. Sample total, y =
Pn
i=1 yi

2. Sample mean, ȳ = y 1
Pn
n
= n i=1 yi

3. Sample variance, s2y =


Pn
1
n−1 i=1 (yi − ȳ)2

sy
4. Sample coefficient of variation, cy = ȳ
, where sy is the sample variance and ȳ is the
sample mean.

Definition 2.3. Sampling units: This refers to the individual items whose character-
istics are to be measured in the sample survey.

Definition. Sampling frame: This is the list of all sampling units. It may be a list
of units with identification and particulars or a map showing the boundaries of sampling
unis e.g. a manufacturing firm may want to determine how popular a newly manufactured
product is within the community suggests a possible frame for the survey. The firm may
decide to concentrate its surveys in urban residential areas only. In this case, you have
a complete list of estates in urban areas. The residents in those choosen estates will be
interviewed and inferences are made.

Definition 2.4. Sampled population: It is the set of individuals in the sampling


frame. Its actually the subset of the taregt population. Note: Sampled population is not
necessarily the same as target population.

Definition. Sampling scheme: Its the technique by which the elements which consti-
tute the sampled are obtained from the population.

9
2.2 Types of Sampling 2 THE BASIC CONCEPTS

2.2 Types of Sampling

1. Haphazard sampling: No scheme has been used at all it is neither probability nor
non-probability sampling.

2. Purposive or judgemental sampling or non probability sampling.

3. Probability/random sampling- statistical theory is used and the kind of inferences


made are based on statistical procedures. There is some element of chance associated
with selection of items into the sample.

2.2.1 Properties of random sampling

1. We are able to define the set of distinct samples, S1 , S2 , ...., SN , which the procedure
is capable of selecting if applied to a specific population. This means that we can
say precisely what sampling units belong to S1 to S2 , and so on.

2. Each possible sample Si has assigned to it a known probability of selection πi .

3. We select one of the Si by a process in which each Si receives its appropriate prob-
ability πi ., of being selected.

4. The method for computing the estimate from the sample must be stated and must
lead to a unique estimate for any specific sample.

The simplest type of sampling is SRS(simple random sampling). We shall also make
use of common terms in statistics like; statistic, estimator, point estimation and interval
estimation, ratio and regression estimation. Design and analysis of sample survey is about
knowing those estimators and design procedures that are good.

2.3 Properties of estimators

1. Precision:- how much variation there is in the estimation from sample to sample.

2. Trueness:- on average how close is the estimate to the population characteristics


being estimated.

10
2.4 Principal steps involved in planning and execution of2a sample
THE BASIC
survey.CONCEPTS

3. Accuracy:- combination of precision and trueness. precision of estimates will be


measured by variance e.g. if the estimator is X then;

V ar (x) = δx2 = E [X − µx ]2 (2.1)

where µx = E [X].
Trueness of an estimation will be measured by the bias, which is defined as a difference
between the expectation of estimate and population parameter of which is an estimate.

Bias (x) = E [X] − µx (2.2)

If E [X] − µx = 0 then it is an unbiased estimator.


Accuracy is measured using mean squared error(MSE)

M SE (x) = σx2 + (Bias (x))2 (2.3)

Generally estimators which have low variance (high precision) and low bias are pre-
ferred.
Remark: There are various methods for determining estimator: Method of Mo-
ments(MME), OLS(Ordinary Least Squares), MLE(Maximum Likelihood Estimation),
Bayesian estimation, Rao-Blackwell method, Minimum Chi-Square method and Minimum
Distant method. There exists several methods of comparing estimates, these include use
of; bias,variance, MSE(mean squared error), consistency, sufficiency and location of scale
invariance.

2.4 Principal steps involved in planning and execution of a sample

survey.

The broad steps to conduct any sample surveys are as follows:

1. Objective of the survey: The objective of the survey has to be clearly defined
and well understood by the person planning to conduct it. It is expected from the
statistician to be well versed with the issues to be addressed in consultation with

11
2.4 Principal steps involved in planning and execution of2a sample
THE BASIC
survey.CONCEPTS

the person who wants to get the survey conducted. In complex surveys, sometimes
the objective is forgotten and data is collected on those issues which are far away
from the objectives.

2. Population to be sampled: Based on the objectives of the survey, decide the


population from which the information can be obtained. For example, population
of farmers is to be sampled for an agricultural survey whereas the population of
patients has to be sampled for determining the medical facilities in a hospital.

3. Data to be collected: It is important to decide that which data is relevant for


fulfilling the objectives of the survey and to note that no essential data is omitted.
Sometimes, too many questions are asked and some of their outcomes are never uti-
lized. This lowers the quality of the responses and in turn results in lower efficiency
in the statistical inferences.

4. Degree of precision required: The results of any sample survey are always
subjected to some uncertainty. Such uncertainty can be reduced by taking larger
samples or using superior instruments. This involves more cost and more time. So
it is very important to decide about the required degree of precision in the data.
This needs to be conveyed to the surveyor also.

5. Method of measurement: The choice of measuring instrument and the method


to measure the data from the population needs to be specified clearly. For exam-
ple, the data has to be collected through interview, questionnaire, personal visit,
combination of any of these approaches, etc. The forms in which the data is to
be recorded so that the data can be transferred to mechanical equipment for easily
creating the data summary etc. is also needed to be prepared accordingly.

6. The frame: The sampling frame has to be clearly specified. The population is
divided into sampling units such that the units cover the whole population and
every sampling unit is tagged with identification. The list of all sampling units is
called the frame. The frame must cover the whole population and the units must not

12
2.4 Principal steps involved in planning and execution of2a sample
THE BASIC
survey.CONCEPTS

overlap each other in the sense that every element in the population must belong to
one and only one unit. For example, the sampling unit can be an individual member
in the family or the whole family.

7. Selection of sample: The size of the sample needs to be specified for the given
sampling plan. This helps in determining and comparing the relative cost and time of
different sampling plans. The method and plan adopted for drawing a representative
sample should also be detailed.

8. The Pre-test: It is advised to try the questionnaire and field methods on a small
scale. This may reveal some troubles and problems beforehand which the surveyor
may face in the field in large scale surveys.

9. Organization of the field work: How to conduct the survey, how to handle
business administrative issues, providing proper training to surveyors, procedures,
plans for handling the non-response and missing observations etc. are some of the
issues which need to be addressed for organizing the survey work in the fields. The
procedure for early checking of the quality of return should be prescribed. It should
be clarified how to handle the situation when the respondent is not available.

10. Summary and analysis of data: It is to be noted that based on the objectives
of the data, the suitable statistical tool is decided which can answer the relevant
questions. In order to use the statistical tool, a valid data set is required and this
dictates the choice of responses to be obtained for the questions in the questionnaire,
e.g., the data has to be qualitative, quantitative, nominal, ordinal etc. After getting
the completed questionnaire back, it needs to be edited to amend the recording errors
and delete the erroneous data. The tabulating procedures, methods of estimation
and tolerable amount of error in the estimation needs to be decided before the start
of survey. Different methods of estimation may be available to get the answer of
the same query from the same data set. So the data needs to be collected which is
compatible with the chosen estimation procedure.

11. Information gained for future surveys: The completed surveys work as guide

13
2.5 Pilot Survey 2 THE BASIC CONCEPTS

for improved sample surveys in future. Beside this they also supply various types
of prior information required to use various statistical tools, e.g., mean, variance,
nature of variability, cost involved etc. Any completed sample survey acts as a
potential guide for the surveys to be conducted in the future. It is generally seen
that the things always do not go in the same way in any complex survey as planned
earlier. Such precautions and alerts help in avoiding the mistakes in the execution
of future surveys.

2.5 Pilot Survey

In planning a survey efficiently, some prior information about the population under con-
sideration and the operational and cost aspects of of data collection will be needed. When
such information is not available

2.6 Advantages of Sampling

Sample surveys have potential advantages over complete enumeration(census). They in-
clude;

1. Reduced cost. If data are secured from only a small fraction of the aggregate,
expenditures may be expected to be smaller than if a complete census is attempted

2. Greater speed. For the same reason, the data can be collected and summarized more
quickly with a sample than with a complete count. This may be a vital consideration
when the information is urgently needed.

3. Greater scope. In certain types of inquiry, highly trained personnel or specialized


equipment, limited in availability, must be used to obtain the data. A complete
census may then be impracticable: the choice lies between obtaining the information
by sampling or not at a.Il. Thus surveys which rely on sampling have more scope
and flexibility as to the types of information that can be obtained.

4. Greater accuracy. Because personnel of higher quality can he employed and can be
given intensive training, a sample may actually produce more accurate results than

14
2.7 Exercises 2 THE BASIC CONCEPTS

the kind of complete enumeration that it is feasible to take.

5. When a survey involves risky tests such as testing a new drug, sampling must be
used.

2.7 Exercises

1. State and explain the elements of a sample survey design.

2. State and explain the factors that affect the sample survey design.

2.8 Solutions

1.

→ Specify the target population.

→ Specify the sampling unit.

→ Specify the precision and objectives of survey and declare other variables of interest
(auxiliary information,include explanatory variables, stratification variables).

→ Instruments we use.

→ Develop or decide on sampling design/sampling scheme.

→ Determine the sample size allocation.

→ Explain the data processing procedures.

→ Give an outline of the anticipated final report.

2.

→ The choice of sampling units and the sample sizes.

→ The objectives of sample survey design is usually to enable production of estimates


for desired quantities and should be of adequate accuracy.

15
2.8 Solutions 2 THE BASIC CONCEPTS

→ The nature of the survey i.e either descriptive survey or analytical survey(model
development)

1. Descriptive surveys focus on the estimation of overall population or even sub-


population characteristics like means or totals.

2. Analytical surveys focus on estimation of relationships among variables and on


tests of hypothesis for instance it is reasonable to conclude that two populations have
been generated from same probability distribution or does an economic development
strategy have a positive impact.

16
3 SIMPLE RANDOM SAMPLING(SRS)

3 Simple Random Sampling(SRS)

3.1 Introduction and Description

This is the simplest form of sampling. This is a technique of selecting a sample from
a target population in such a way that any unit in the population has an equal and
independent chance of being selected into the sample. Therefore for a population of
size N , the probability of picking a unit at first draw is 1
N
and the second draw is 1
N −1

third draw is 1
N −2
hence for rth draw is 1
N −(r−1)
. In performing SRS one will do it with
replacement(wr) or without replacement(wor).

3.2 Simple Random Sampling with Replacement (SRSWR).

3.2.1 Definition and Estimation of Population Mean, Variance and Total

A sample is said to be selected by simple random sampling with replacement(srswr) by n


draws from a population of size N if the sample is drawn by observing the following rule;
1. At each draw, each unit in the population has the same chance of being selected.
2. A unit selected at a draw is returned to the population before the next draw.
The same unit , therefore might be selected more than once. Thus the probability of
getting a sample(sequence), i = 1, 2, ..., in is;

1 1 1
P ({i = 1, 2, ..., in }) = , ..., = n (3.1)
N N N

There are N n possible samples(sequences) in the sample space S, for a given (N, n).
A srswr of n draws from a population of size N will be denoted by srswr(N, n).

Theorem 3.1. In srswr(N, n) sample mean ȳ is an unbiased estimator of the population


mean Ȳ .

Proof: !
n
1X
E (ȳ) = E yi = y1 (3.2)
n i=1

17
3.2 Simple Random Sampling with Replacement
3 SIMPLE
(SRSWR).
RANDOM SAMPLING(SRS)

Since y1 , y2 , ..., yn are independently and identically distributed (iid) random variables
with;

1
P (yi = Yk ) = , k = 1, 2, ..., N, i = 1, 2, ..., n. (3.3)
N

Now, E (y1 ) = Yk = Ȳ . Hence, E (ȳ) = Ȳ .


1
PN
N i=1

Alternatively, let ti be the number of times i occurs in the sample. Therefore, ti


follows a multinomial distribution with E (ti ) = Nn , V ar (ti ) = Nn 1 − N1 ,Cov (ti , tj ) =


−n
N2
(i 6= j = 1, 2, ..., N )(Show this)
 P 
Now, ȳ = n i=1 yi = n i=1 ti Yi ⇒ E (ȳ) = E n i=1 ti Yi = Yi E (ti ). But
1
Pn 1
PN 1 N 1
Pn
n i=1

E (ti ) = n
N
,

N
1n X
E (ȳ) = Yi = Ȳ (3.4)
n N i=1

Hence E (ȳ) = Ȳ .

Corollary 3.1. In srswr (N, n) and unbiased stimator of Y is Ŷ = N V ar (ȳ). Proof


this.

Theorem 3.2. In srswr(N, n) ,the sample variance is given by

N
σ2 2 1 X
(3.5)

V ar (ȳ) = , σ = Yi − Ȳ
n N i=1

Proof:
 P 
= V ar (ȳ) = V ar n1 N i=1 ti Y i

N
= n12 i=1 Yi2 V ar (ti ) + n12
P PP
i6=j Yi Yj Cov (ti , tj )

but V ar (ti ) = Nn 1 − N1 and Cov (ti , tj ) = −n



(i 6= j = 1, 2, ..., N )
 N 2 2 P 
1 1
 PN 2 1
PN i N 2
⇒ V ar (ȳ) = N n 1 − N i=1 Yi − N 2 n i=1 Y − i=1 Yi
Pn  PN 2
= NN2 n N
P 2 1
PN 2 1
PN 2 1
Y
i=1 i − 2
N n i=1 iY + 2
N n i=1 iY − 2
N n i=1 i=1 Y i
P 2
N N N
= N1n i=1 Yi2 − N12 n = N1n i=1 Yi2 − N12 n N 2 Ȳ 2
P P 
i=1 Yi
PN  2 1 1 PN h i
1 2 1 2 2
= N n i=1 Yi − n Ȳ 2 = n N i=1 Yi − Ȳ

18
3.3 Simple Random Sampling without Replacement
3 SIMPLE RANDOM SAMPLING(SRS)

σ2
= (3.6)
n
P 2
Note: (x1 + x2 )2 = x21 + 2x1 x2 + x22 ⇒ N PN
x2i +
PP
i=1 xi = i=1 i6=j xi xj

Corollary 3.2. In srswr (N, n) the sample variance is given by;

N
N −1 2 2 1 X 2
V ar (ȳ) = Sy ; Sy = Yi − Ȳ (3.7)
Nn N − 1 i=1
 
N 2 σ2
Corollary 3.3. In srswr (N, n), V ar Ŷ = n
.

As n increases, V ar (ȳ) decreases, even if n = N, V ar (ȳ) does not vanish. Also in


srswr, n may be arbitrarily large.

3.3 Simple Random Sampling without Replacement

A sample of size n is said to be selected by simple random sampling without replacement


(srswr) if the selection procedure is such that every possible sequence(sample) has the
same chance of being selected. Sampling design is achieved by drawing a sample by the
following draw-by-draw procedure;
1. At each draw each available unit in the population has the same chance of being
selected.
2. A unit selected at a draw is removed from the population before the next draw.
If the population is of size N and we require a simple
 random
 sample without replace-

 N 
ment of size n, then this is chosen at random from   distinct sample. Each of the
n
   −1
 N   N 
  samples has the same probability  1  or   of being selected.
n  N 
  n
 
 
n
 

Lemma 3.1. For a srswor(N, n) design the probability of a specified unit being selected

19
3.3 Simple Random Sampling without Replacement
3 SIMPLE RANDOM SAMPLING(SRS)

1
at any given draw is N
i.e.

1
Pr (ik ) = , r = 1, 2, ..., n. (3.8)
N

for any given ik

Lemma 3.2. For a srswor(N, n) the probability of two specified units being selected at
any two given draws is N1 N1−1 , i..e.


1
Pr,s (ir , is ) = , r < s, r = 1, 2, ..., n. (3.9)
N (N − 1)

for any given ir 6= is

Lemma 3.3. For a srswor(N, n) the probability that a specified unit is included in the
n
sample is N
i.e.

P (is) = πi (say) , i = 1, 2, ..., N (3.10)

Lemma 3.4. For a srswor(N, n) the probability that any two specified units are included
n(n−1)
in the sample is N (N −1)
i.e.

P (is, js) = πi,j (say) , i 6= j, i = 1, 2, ..., N (3.11)

The quantities πi and πi,j (as defined in Lemma 3.3 and 3.4) are respectively the
inclusion probabilities of unitbi and (i, j) in the sample. These are called respectively,
the first order and second order incluion probabilities of a design.

3.3.1 Definition and Estimation of Population Mean, Variance and Total

We consider the problem of estimating, Ȳ , Y and S 2 in srswor. Consider a population


of size N and let n be the size of the simple random sample drawn from this population
without replacement. Now let ai equals 1 if the ith unit is selected and 0 elsewhere,
i = 1, 2, ..., N Then ai is a random variate such that;

20
3.3 Simple Random Sampling without Replacement
3 SIMPLE RANDOM SAMPLING(SRS)

E (ai ) = 1× probability of ith selected unit.


=1× n
N
= n
N
inclusiom probbaility.
E (ai , aj ) = 1× probability of the ith and j th unit selected.
n n−1 n(n−1)
=1× N
× N −1
= N (N −1)

Therefore the sample total is

N
X n
X
y= ai Y i = yi (3.12)
i=1 i=1

The sample mean is given as;

N n
1X 1X
ȳ = ai Yi = yi = ȳ (3.13)
n i=1 n i=1

Theorem 3.3. In srswor(N, n) the sample mean ȳ is an inbiased estimator of the pop-
ulation mean Ȳ

Proof: ȳ = n1 ni=1 yi = n1 N
P P
i=1 ai Yi
 P 
E (ȳ) = E n1 N 1
PN
i=1 ai Yi = n i=1 Yi E (ai )

= n1 N
P n 1
PN
i=1 Yi . N = N i=1 Yi = Ȳ

Corollary 3.4. For srswor(N, n), Ŷ = N ȳ is an unbiased estimator of the population


total Y.

N −n 2
Theorem 3.4. In srswor(N, n), V ar (ȳ) = Nn
S .
2
where S 2 = 1
PN
N −1 i=1 Yi − Ȳ
Proof:
From V ar (y) = E (y 2 ) − (E (y))2 it implies that V ar (ȳ) = E (ȳ 2 ) − (E (ȳ))2
 P 
But E (ȳ) = E n1 ni=1 yi = E n1 N
P  1
PN 1
PN
i=1 i i = n
a Y i=1 Yi E (ai ) = N i=1 Yi
 P 2  P 
Next, E (ȳ 2 ) = E n1 N 1 N 2 1
PP
a Y
i=1 i i = E n2
a Y
i=1 i i + n2
a a Y Y
i6=j i j i j

(Factor in expectation and use the fact that E (ai ) = Nn and E (ai , aj ) = n(n−1)
N (N −1)
)
1
PN 2 n−1
PP
= nN i=1 Yi + N n(N −1) i6=j Yi Yj
  2
But
PN
− N 2
PP P
i6=j Yi Yj = i=1 Yi i=1 Yi
 2 P 
Therefore, E (ȳ ) = N n i=1 Yi + N n(N −1)
2 1
PN 2 n(n−1) PN N 2
i=1 Yi − i=1 Yi

21
3.3 Simple Random Sampling without Replacement
3 SIMPLE RANDOM SAMPLING(SRS)

h iP 2 P
1 n−1 N n−1 N
= nN
− nN (N −1) i=1 Yi2 + Yi
N n(N −1) i=1
P 2 P
Now (x1 + x2 )2 = x21 + 2x1 x2 + x22 ⇒ N
= N 2
PP
x
i=1 i i=1 xi + i6=j xi xj
  2  2
Therefore, V ar (ȳ) = N n(NN −n
PN 2 n−1
PN 1
PN
−1)
Y
i=1 i + N n(N −1) i=1 iY − N
Y
i=1 i
h i P 2
N −n
P N 2 n−1 1 N
= N n(N −1) i=1 Yi + N n(N −1) − N 2 i=1 Yi
PN  PN  2
N −n 2 N −n
= N n(N −1) i=1 Yi − N 2 n(N −1) i=1 Yi
hP i
N −n N 2 2
= N n(N −1) i=1 Yi − N Ȳ
2 P 
But S 2 = N1−1 N 1 N 2 2
.Therefore;
P
i=1 Y i − Ȳ = N −1 i=1 iY − N Ȳ

N −n 2
V ar (ȳ) = S (3.14)
Nn

on simplification.

N −n 2
Theorem 3.5. In srswor(N, n) an unbiased estimator of V ar (ȳ)is Nn
s where s2 =
1
Pn 2
n−1 i=1 (yi − ȳ) .

Proof this.

Theorem 3.6. In srswor(N, n) the sample variance is an unbiased estimator of the


population variance i.e. E (s2 ) = S 2

2
where s2 = (yi − ȳ)2 and S 2 =
1
Pn 1
PN
n−1 i=1 N −1 i=1 Yi − Ȳ
Proof:
hP i
2
( ni=1 yi2 − nȳ 2 ) = n−1 n
1 1 1
Pn
s2 = 2
P
n−1 i=1 iy − n
( i=1 yi )
hP P i
1 n 2 1 n 2
P P
= n−1 i=1 yi − n i=1 yi + i6=j yi yj
h i
(open brackets and simplify)
1
Pn 2 1 P P
1 − n1

= n−1 y
i=1 i − n i6=j y i y j

= n1 ni=1 yi2 − n(n−1)


1
P PP
i6=j yi yj

Taking expectations on both sides wehave,


P P 
i6=j yi yj but
2 1
Pn 2 1
E (s ) = n E ( i=1 yi ) − n(n−1) E
 
E ( ni=1 yi2 ) = E i=1 Yi since E (ai ) = and E (ai aj ) =
PN 2
= Nn N 2 n n(n−1)
P P
i=1 ai Yi N N (N −1)

and
P P  P P 
n n−1
Yi Yj .
PP
E i6=j yi yj = E i6=j ai aj Yi Yj = N N −1 i6=j

Therefore;

22
3.3 Simple Random Sampling without Replacement
3 SIMPLE RANDOM SAMPLING(SRS)

E (s2 ) = n1 Nn N 2 1 n n−1
P PP
i=1 Yi − n(n−1) N N −1 i6=j Yi Yj

= N1 N 2 1
P PP
i=1 Yi − N (N −1) i6=j Yi Yj
 2
= N1 N
PN
2 1
+ N (N1−1) N 2
P P
Y
i=1 i − N (N −1) i=1 iY i=1 Yi
h iP P 2
N N
= N1 + N (N1−1) Y
i=1 i
2
− 1
N (N −1) i=1 iY
hP i
N
= N1−1 2
i=1 Yi − N Ȳ
2
= S2
Hence;
E s2 = S 2 (3.15)


 
Corollary 3.5. For srswor(N, n),an unbiased variance estimator of Y is V ar Ŷ =
N (N −n) 2
n
s
 
Proof: V ar Ŷ = V ar (N ȳ) = N 2 V ar (ȳ)
= N 2 NN−n
n
s2
N (N −n) 2
= n
s which completes the proof.
q
Corollary 3.6. An estimator of standard error of ȳ is σ̂ (ȳ) = NN−n n
s. An estimator
q
N −n s
of the coefficient of variation is c (ȳ) = N n ȳ
. c (ȳ) is a ratio estimator and biased
estimator of C (ȳ).

NOTE:

1. The sample mean in srswor(N, n) is a better estiamtor of Ȳ (in the small variance
sense) than sample mean in srswr(N, n). Proof: V ar (ȳ|srswr)−V ar (ȳ|srswor) =
n−1 2
Nn
S > 0 for n > 1.

2. In sampling from an infinite population( where each Yi is an independently and


identically distributed random variable) with variance of each random variable as
σ2
σ 2 , V ar (ȳ) = n
. In simple random sampling with replacement, draws may be made
2
an infinite number of times and V ar (ȳ) = σn . In simple random sampling without
h  i
n S2
replacement, however, V ar (ȳ) = 1 − N n . The quantity 1 − Nn appearing in
the expression above is a correction factor for the finite size of the population and is
called the finite population correction factor(fpc) or simply the finite multiplier. If
n is very small compared to N, the fpc is close to unity and the sampling variance of

23
3.3 Simple Random Sampling without Replacement
3 SIMPLE RANDOM SAMPLING(SRS)

ȳ in srswor will be approximate the same as srswr. If N is very small say N ≤ 10,
then whatever n, f is not negligible and therefore there is considerable gain in using
srswor over srswr.

3.3.2 Confidence Intervals for Population Mean Ȳ and Total Y

The sample mean ȳ and the variance s2 are point estimates of the unknown population
mean and variance respectively. An interval estimate of unknown population parameter
is a random interval constructed such that it has a given probability of including the
parameters. Consider a population with unknown parameter , if one can find an interval
(a, b) such that;

P (a ≤ θ ≤ b) = 0.95 (3.16)

then we say that (a, b) is a 95% confidence interval for θ. It is important to realize
that the θ is fixed and the intervals themselves vary.
Some conditions exist under which the distribution of the sample mean in a simple
random sampling tends to normal distribution. If the sample size is not too small and
the distribution of the population from which the sample is drawn is not different from
the normal, then in srswor, the sample mean ȳ is approximately normal with mean Ȳ and

standard deviation √N −n S
Nn
i.e.

 
N −n
ȳ ∼ N Ȳ , S (3.17)
Nn

ȳ − Ȳ
z=q ∼ N (0, 1)
N −n
Nn
S
   q q 
Hence P −z 2 ≤ N −n ≤ z 2 = 1−α ⇒ P ȳ − z 2
α √ȳ− Ȳ α α
N −n
Nn
S ≤ Ȳ ≤ ȳ + z 2
α
N −n
Nn
S =
Nn
S

1−α
where z α2 is the 100 1 − α2 % point of standard normal distribution. Therefore;
 

24
3.3 Simple Random Sampling without Replacement
3 SIMPLE RANDOM SAMPLING(SRS)

r r !
N −n N −n
ȳ − z α2 S, ȳ + z α2 S, (3.18)
Nn Nn

is the 100 1 − α2 % cofidence interval for Ȳ . For α = 0.05, 0.025, 0.01 values for z α2
 

are 1.96, 2.24 and 2.58 respectively.


Example:
In a private library, the books are kept on 130 shelves of similar size. The numbers of
books on 15 shelves picked at random were found to be 28,23,25,33,31,18,22,29,30,22,26,20,21,28
and 25. Estimate the total number Y , of books in the library and calculate an approxi-
mate 95% confidence interval for Y
Solution:
1
Pn 1
N = 130, n = 15, ȳ = n i=1 yi = 15
(28 + 23+, ... + 25) = 25.4
Y = N ȳ, = 130 × 25.4 = 3302. The 95% confidence interval is given by;
√ p
Y = N ȳ ± tN√ s 1 − f = Y ± N z0.05
n
var (ȳ)
2  
but S 2 = N1−1 N which is estimated by
P 1
PN 2 2
i=1 Y i − Ȳ = N −1
Y
i=1 i − N Ȳ
1
Pn 2 1
Pn 2
s2 = n−1 i=1 (yi − ȳ) = n−1 (
2
i=1 yi − nȳ )
P15 2 2 2 2
i=1 yi = (28 + 23 +, ..., +25 ) = 9947

nȳ 2 = 15 × (25.4)2 . The 95% confidence interval for Y at α = 0.05 will be;
p
Ŷ = Y ± N z0.05 var (ȳ)
√ 
= 3302 ± 130 1.96 × 1.14
= 3302 ± 272.05
⇒ 3029.05 ≤ Y ≤ 3574.05

3.3.3 Sampling For Proportions and Percentages

In many situations, the characteristic under study on which the observations are collected
are qualitative in nature. For example, the responses of customers in many marketing
surveys are based on replies like ‘yes’ or ‘no’ , ‘agree’ or ‘disagree’ etc. Sometimes the re-
spondents are asked to arrange several options in the order like first choice, second choice
etc. Sometimes the objective of the survey is to estimate the proportion or the percent-
age of brown eyed persons, unemployed persons, graduate persons or persons favoring a

25
3.3 Simple Random Sampling without Replacement
3 SIMPLE RANDOM SAMPLING(SRS)

proposal, etc. In such situations, the first question arises how to do the sampling and
secondly how to estimate the population parameters like population mean, population
variance, etc.
The same sampling procedures that are used for drawing a sample in case of quanti-
tative characteristics can also be used for drawing a sample for qualitative characteristic.
So, the sampling procedures remain same irrespective of the nature of characteristic un-
der study - either qualitative or quantitative. For example, the SRSWOR and SRSWR
procedures for drawing the samples remain the same for qualitative and quantitative char-
acteristics. Similarly, other sampling schemes like stratified sampling, two stage sampling
etc. also remain same.

3.3.4 Estimation of Population Proportion

The population proportion in case of qualitative characteristic can be estimated in a


similar way as the estimation of population mean in case of quantitative characteristic.
Consider a qualitative characteristic based on which the population can be divided into
two mutually exclusive classes, say C and C∗. For example, if C is the part of population
of persons saying ‘yes’ or ‘agreeing’ with the proposal then C∗ is the part of population
of persons saying ‘no’ or ‘disagreeing’ with the proposal. Let A be the number of units in
C and (N − A) units in C∗ be in a population of size N. Then the proportion of units in
C is;

A
P = (3.19)
N

and the proportion of units in C∗ is

N −A
Q= =1−p (3.20)
N

An indicator variable Y can be associated with the charactersitics under study and
then for
i = 1, 2, ..., N.Yi = 1 if the ith unit belongs to C and 1 if the ith unit belongs to C∗.

26
3.3 Simple Random Sampling without Replacement
3 SIMPLE RANDOM SAMPLING(SRS)

Now the population total is;

N
X
YT OT AL = Yi = A (3.21)
i=1

and the population mean is;

PN
Yi A
Ȳ = i=1
= =P (3.22)
N N

Suppose a sample of size n is drawn from a population of size N by simple random


sampling. Let a be the number of units in the sample which fall into class C and (n − a)
units fall in class C∗, then the sample proportion of units in C is;

a
p= (3.23)
n
Pn
y
which can be written as p = na = i=1 n
i
= ȳ.
Since, N i=1 Yi = A = N P so we can write S and s in terms of Q and P as follows;
2 2
P
2
S 2 = N1−1 N = N1−1 N
P P 2 2

i=1 Yi − Ȳ i=1 Yi − N Ȳ

= N1−1 N 2 N
P
i=1 (N P − N P ) = N −1 P Q

Similarly;

i=1 yi = a = np and s =
Pn 2 Pn
(yi − ȳ)2 =
2 1 1
Pn
n−1 i=1 n−1 i=1 (yi2 − nȳ 2 )
1
Pn 2
= n−1 i=1 (np − np )
n
= n−1
pq
Note that the quantities ȳ , Ȳ , s2 , and S 2 have been expressed as functions of sample
and population proportions. Since the sample has been drawn by simple random sampling
and sample proportion is same as the sample mean, so the properties of sample proportion
in SRSWOR and SRSWR can be derived using the properties of sample mean directly.
SRSWOR
Since the sample mean ȳ is an unbiased estimator of the population mean Ȳ i.e.
E (ȳ) = Ȳ in the case of SRSWOR, so;
E (p) = E (ȳ) = Ȳ = P and p is an unbiased estimator of P. Using the expression of
V ar (ȳ) the variance of p can be derived as V ar (p) = V ar (ȳ) = S .
N −n 2
Nn
Similarly, using

27
3.3 Simple Random Sampling without Replacement
3 SIMPLE RANDOM SAMPLING(SRS)

the estimate of variance can be derived as


N −n 2
Vd
ar (p) = Vd
ar (ȳ) = Nn
S
N −n n
= .
N n n−1
pq

N −n
= pq (3.24)
N (n − 1)

SRSWR
Since the sample mean ȳ is an unibiased estimator of population mean Ȳ in case of
SRSWR, so the sample proportion
E (p) = E (ȳ) = Ȳ = P i.e., p is an unbiased estimator of P .
Using the expression of variance of ȳ and its estimate in case of SRSWR, the variance
of p and its estimate can be derived as follows:
N −1 2
V ar (p) = V ar (ȳ) = Nn
S
N −1 N
= N n N −1
PQ
PQ
= n
n pq
⇒ Vd
ar (p) = .
n−1 n

pq
= (3.25)
n−1

3.3.5 Estimation of population total or total number of count

It is easy to see that an estimate of population total A (or total number of count ) is
 
 = N P = n its variance is V ar  = N 2 V ar (p) and the estimate of variance is
Na
 
V ar  = N 2 Vd
d ar (p)

3.3.6 Confidence Interval estimation for P

If N and n are large, then √p−P approximately follows N (0, 1). With this approxima-
V ar(p)
 
tionwe can write P −z 2 ≤α √ p−P
≤ z 2 = 1 − α, and the 100(1 − α) % confidence
α
V ar(p)

interval of P is

(3.26)
p p
p − z α2 V ar (p), p + z α2 V ar (p)

28
3.3 Simple Random Sampling without Replacement
3 SIMPLE RANDOM SAMPLING(SRS)

It may be noted that in this case, a discrete random variable is being approximated
by a continuous random variable, so a continuity correction n
2
can be introduced in the
confidence limits and the limits become;

n n
(3.27)
p p
p − z α2 V ar (p) + , p + z α2 V ar (p) +
2 2

3.3.7 Determination of Sample Sizes

In a field survey the statisticians would like to have a sample size that will give a desired
level of precision of estimator. We note that the required precision is the difference
between the estimator and the true value. This difference is denoted by d.
Suppose that it is desired to find a sample size n such that the estimated value i.e.
sample mean ȳ differs from the true value (Population mean, Ȳ ) by a quantity not ex-
ceeding d with a very high probability, say greater than 1 − α. Hence the problem is to
find n such that;

(3.28)

P | ȳ − Ȳ |≤ d ≥ 1 − α

From srswor ȳ ∼ N Ȳ , NN−n . Hence,


2

n
S

r !
N −n
P | ȳ − Ȳ |≤ S t =1−α (3.29)
Nn

where t = z α2 is the 100 1 − α2 % point of standard normal distribution. From equa-



2
( tsd )
q
d2
tion 3.28 and 3.29, tS N n = d where n = N + t2 S 2 . Hence, n =
N −n 1 1
1 tS 2
. As a first
1+ N (d)
2
approximation we may take no = tsd . If nN0 is neglibiliy small, this may be taken as

−1
the satisfactory value of n. If not, one should calculate n = 1+n0n0 = n0 (1 + nN0 .In
(N)
0
practice one has to replae S by an advance estimate s (say). In case the problem is that
of estimating a population proportion one may require to find n such that

P (| p − P |≤ d) ≥ 1 − α (3.30)

For large samples in srswor q p−P


is approxiamtely a standard normal variable.
{ n(N
N −n
−1)
P Q}

29
3.4 R Computing Notes. 3 SIMPLE RANDOM SAMPLING(SRS)

Hence; s !
N −n
P | p − P |≤ t PQ =1−α (3.31)
n (N − 1)
q
Equating 3.28 and 3.29 we get t n(N
N −n
−1)
P Q = d. This gives;

 
t2 P Q
d2
n=  h t2 P Q  i (3.32)
1
1+ N d2
−1

For practical purposes, P is to be replaced by some suitable estimate p of the same.


t2 P Q
For large N a first approximation of n is n0 = d2
. If n0
N
is negligible, n0 is a satisfactory
approximation to n. If not, one should calculate n as;

n0 n0
n= n0−1
 ≈ (3.33)
1 + nN0
 
1+ N

Example:

1. Suppose it is required to estimate the average value of output of a group of 5000


factories in a region so that the sample estimate lies within 10% of the true value
with a confidence coefficient of 95%. Determine the minimum sample size required.
The population coefficient of variation is known to be 60%.

Solution:

1. We require n such that P | ȳ − Ȳ |≤ 0.1Ȳ = 0.95. Now under normal approxima-




tion,
q h i2
| ȳ − Ȳ |≤ 0.1Ȳ = 0.95. Hence, 1.96S N −n
= 0.1Ȳ or (1.96)2 1 1 Ȳ
 
Nn n
− N
= 0.01 S
=
0.01
0.36

Solving the above equation, we get n = 136(rounded off to the next integer)

3.4 R Computing Notes.

#simulate two data sets which are normally distributed:


popnx <- runif(100) ;
popny <- runif(100)

30
3.4 R Computing Notes. 3 SIMPLE RANDOM SAMPLING(SRS)

#plot the spatial distribution of the population: plot(popnx,popny)


#change the size of the circle representing each object: plot(popnx,popny,cex=2)
#select a random sample, without replacement, of 10 objects out of the 100 in the
population :
oursample<- sample(1:100,10)
#draw the sample points, in the same plot: points(popnx[oursample],popny[oursample])
#distinguish the sample points from the others by color points:
(popnx[oursample],popny[oursample], pch=21,bg="red",cex=2)
#sample estimates:
y <- c(1, 50, 21, 98, 2, 36, 4, 29, 7, 15, 86, 10, 21, 5, 4)
#sample mean: mean(y); N <- 286; N * mean(y) ;var(y)
#the estimate of the variance of the sample mean: (1-15/286) *var(y)/15
#and standard error: sqrt(58.06)
#the estimate of the population total: 286*25.9333
#the estimated variance of that estimate: 286^2 * 58.0576 # and standard error:
sqrt(4748879)
#Simulation
#Print the trees data set: trees
#The variable of interest is tree volume, which for simplicity we name "y": y <-
trees$Volume
#The 31 trees will serve as our "population" for purposes of simulation., N <- 31 #
sample size: n <- 10
#Select a simple random sample of n units from 1, 2,..., N: s <- sample(1:N, n)
#Print out unit numbers for the 10 trees in the sample: s
#Print the y-values (volumes) of the sample trees: y[s]
#The sample mean: mean(y[s])
#Select another sample from the population and repeat the estimation procedure: s
<- sample(1:N, n) ;s
mean(y[s])

31
3.5 Exercises 3 SIMPLE RANDOM SAMPLING(SRS)

#Compare the estimate to the population mean: mu <- mean(y) ;mu


#Try a simulation of 6 runs and print out the six values of the estimate obtained,
mainly to check that the simlation procedure # has no errors:
b <- 6
#Let R know that the variable ybar is a vector: ybar <- numeric(6)
for (k in 1:b){ s <- sample(1:N,n) ybar[k] <- mean(y[s]) }
ybar
#Now do a full-size simulation of 10,000 runs: b <- 10000
for (k in 1:b){s <- sample(1:N,n) ybar[k] <- mean(y[s])}
#Summarize the properties of the sampling strategy graphically and numerically:
hist(ybar) ;mean(ybar) ; var(ybar)
#Compare the variance calculated directly from the simulation above to the formula
that applies specifically to simple random sampling with the sample mean:
(1-n/N)*var(y)/n ; sd(ybar) ;sqrt((1-n/N)*var(y)/n)
#The mean square error approximated from the simulation should be close to the
variance but not exactly equal, since they are calculated
slightly differently: mean((ybar - mu)^2)

3.5 Exercises

1. Consider the population consisting of 430 units. By complete enumeration of the


population it was found that Ȳ = 19, S 2 = 85.6 These being true population values
with simple random samples, how many units must be taken to estimate ȳ with 10%
of Ȳ a part from a chance of 1 in 20.

2. In a population with N = 6, the values of yi are 8, 3, 1, 11, 4, and 7. Calculate the


sa.mple mean ȳ for al1 possible simple random sa.mples of size 2. Verify that ȳ is
an unbised estimate of Ȳ .

3. For the same population in 1 above, calculate s2 for al1 simple random samples of
size 3, and verify that E (s2 ) = S 2

32
3.6 Solutions 3 SIMPLE RANDOM SAMPLING(SRS)

4. If random samples of size 2 are drawn with replacement (from this population, show
σ2
by finding all possible samples that V ar (ȳ) satisfies the equation V ar (ȳ) = n
=
S 2 (N −1)
nN
. Give a general proof of this result.

5. A simple random sample of 30 households wa.s drawn {from a city area containing
14,848 households. The numbers of persons per household in the sample were as
follows: 5,6,3,3,2,3,3,3,4,4,3,2,7,4,3,5, 4,4,3,3,4,3,3, 1,2,4,3,4,2,4 Estimate the total
number of people in the area and compute the probability that this estimate is
within ±10 per cent of the true value.

3.6 Solutions

1. Ȳ = 19 , S 2 = 85.6 ⇒ S = 85.6 N = 430, d = 20 1
= 0.05. 10% of Ȳ ⇒ d = 0.1Ȳ =
2 2
0.1 (19) = 1.9. n0 = tsd but t = z α2 = z 0.05 = z0.025 = 1.96. ⇒ n0 = (1.96)1.9(85.6)
2 =
2
−1 −1
91.09167. n = n0 1 + nN0 , = 91.09 1 + 91.09

430
= 75.166 ' 75.

33
4 STRATIFIED RANDOM SAMPLING

4 Stratified Random Sampling

4.1 Introduction and Description

The objective of any sampling method is usually to estimate the unknown population
parameters with the highest precision i.e. the variance of the estimators should be min-
imized. If the population is heterogeneous as will be in most situations then a sample
taken via SRS might yield high levels of variability . As a result in a survey where preci-
sion is a main factor to be considered, then a strategy that addresses heterogeneity must
be found. One way of achieving higher precision is to divide the population which is
originally heterogeneous into sub population which are to a big extent homogeneous with
respect to survey characteristics.
In stratified random sampling , the population of N units is first divided into subpop-
ulations N1 , N2 , ...., NL called strata. The strata are mutually disjoint so that;

L
X
N1 + N2 +, ...., +NL = Ni = N (4.1)
i=1

It is important that the number of units in the stratum denoted by Ni , i = 1, 2, ...., L


is known in order to maximize the gain from stratification. After determining the strata,
a sample of size ni , i = 1, 2, ...., L is drawn from each stratum. If simple random sampling
procedure is used to obtain the sub-samples in each stratum then the whole procedure
is called stratified random sampling. The basic idea of stratification is that it may
be possible to divide heterogenous population into sub-populations which are internally
homogenous. If each sub-population is homogenous, a precise estimate of any stratum
can be obtained from a samll sample of each stratum. This results in an improvement on
the precision of the entire estimate.
Example:
In order to find the average height of the students in a school of class 1 to class 12,
the height varies a lot as the students in class 1 are of age around 6 years and students
in class 10 are of age around 16 years. So one can divide all the students into different
subpopulations or strata such as

34
4.1 Introduction and Description 4 STRATIFIED RANDOM SAMPLING

Students of class 1, 2 and 3: Stratum 1


Students of class 4, 5 and 6: Stratum 2
Students of class 7, 8 and 9: Stratum 3
Students of class 10, 11 and 12: Stratum 4
Now draw the samples by SRS from each of the strata 1, 2, 3 and 4. All the drawn
samples combined together will constitute the final stratified sample for further analysis.

4.1.1 Notations

The following is an extension of previous notation used where the suffix i denote the
stratum and j denote the j th unit within the stratum.
Let Yij be the value of the characteristic y on the j th unit in the ith stratum in the
population; yij value in the sample; j = 1, 2, ..., Ni (ni in the sample), i = 1, 2, ..., L
Define:
Ni =Total number of units in the ith stratum
ni =the number of units in the sample of the ith stratum.
Note: j = 1, 2, ..., N → units in a stratum; i = 1, 2, ..., L → strata
n = Li=1 ni =total sample size from all the strata
P

Yi = N i=1 Yij =population total for the i stratum.


th
P i

yij =sample total for the ith stratum.


Pni
yi = i=1
ȳi = yi
ni
=sample mean for the ith stratum.
=overall population mean.
PL Ni Ȳi Y
Ȳ = i=1 N
= N
2
Yij − Ȳ =population variance for the ith stratum.
1
PNi
Si2 = Ni −1 i=1

(yij − ȳ)2 =sample variance for the ith stratum.


1
Pni
s2i = ni −1 i=1

Wi = Ni
N
=population proportion for theith stratum. or stratum weight and
fi = ni
Ni
=sampling fraction for the ith stratum.
Note: The divisor of the variance is (Ni − 1)

4.1.2 Estimation of Population Mean, Variance and Total

The mean of the target popultion is given by;

35
4.1 Introduction and Description 4 STRATIFIED RANDOM SAMPLING

L Ni L
1 XX 1 X
Ȳ = Yij = Ni Ȳi (4.2)
N i=1 j=1 N i=1

where N = N1 + N2 + ... + NL .
For the population mean per unit„ the estimate used in stratified sampling is ȳst (st
for stratified), where ȳst = N1 Li=1 Ni ȳi = Li=1 Wi ȳi . Wi = NNi
P P 

Note: The estimate ȳst is not in general the same as the sample mean. The sample
mean ȳ can be written as ȳ = n1 ni ȳi . The difference is that in ȳst the estimates from
P

the individual strata receive their correct weights Ni


N
. It is evident that ȳ coincides with
ȳst provided that in every stratum, ni
n
= Ni
N
or ni
Ni
= n
N
= fi = f .This means the sampling
fraction is the same in all strata.
The principal properties of the estimate ȳst are outlined in the following theorems. If
simple random sample is used in each stratum then, ȳst has the following properties.

PL Ni ȳi P
Theorem 4.1. In stratified random sampling ȳst = i=1 N
= Wi ȳi is an unbiased
estimator of the population mean Ȳ .

PL Wi E(ȳi )
Proof: E (ȳst ) = i=1 N
= Ȳ

Theorem 4.2. In stratified random sampling using srswor in each stratum V ar (ȳst ) =
1
PL 2 1
PL Ni (Ni −ni ) 2
N2 i=1 Ni V ar (ȳi ) = N 2 i=1 ni
Si
P  P
L Ni ȳi N2
Proof: V ar (ȳst ) = V ar i=1 N = Li=1 Ni2 V ar (ȳi )
P N2   2
Si
= Li=1 Ni2 NiN−ni
i
ni
. Covariances terms vanish being independent from stratum to
stratum
1
PL Ni (Ni −ni ) 2
N2 i=1 ni
Si

ni
Corollary 4.1. If sampling fraction Ni
is negligibly small in each stratum, it reduces to

1
PL Ni2 Si2 PL Wi Si2
V ar (ȳst ) = N2 i=1 ni
= i=1 ni

 
Corollary 4.2. If Ŷst = N ȳst is the estimate of the population total Y then V ar Ŷst =
PL Si2
i=1 Ni (Ni − ni ) ni

36
4.2 Allocation problem and choice of sample4sizes
STRATIFIED
is different strata
RANDOM SAMPLING

 
Proof: Ŷst = N ȳst ⇒ V ar Ŷst = V ar (N ȳst )= N 2 V ar (ȳst )
 P 
2 1 L Si2
= N N 2 i=1 Ni (Ni − ni ) ni

L
X Si2
= Ni (Ni − ni ) (4.3)
i=1
ni

4.1.3 Estimation of Variance

In simple random sampling, the estimate of the variance of each stratum is given by s2i =
Si2
for the stratum. We have found that .
1
Pni 2 th 1
PL
ni −1 j=1 (y ij − ȳ i ) i V ar (ȳ st ) = N 2 i=1 Ni (Ni − n i ) ni

In stratified random sampling, the unbiasd estimate of the variance of V ar (ȳst ) is given
s2
by s2st = N12 Li=1 Ni (Ni − ni ) nii . Note if ȳst is normally distributed over Ȳ then the
P

confidence interval for Ȳ is given by ȳst − z α2 Sȳst , ȳst + z α2 Sȳst .Therefore,


 

(4.4)
p
ȳst ± z α2 V ar (ȳst )

4.2 Allocation problem and choice of sample sizes is different

strata

Question: How to choose the sample sizes n1 , n2 , ...., nl so that the available resources
are used in an effective way?
There are two aspects of choosing the sample sizes:
(i) Minimize the cost of survey for a specified precision.
(ii) Maximize the precision for a given cost.
Note: The sample size cannot be determined by minimizing both the cost and variabil-
ity simultaneously. The cost function is directly proportional to the sample size whereas
variability is inversely proportional to the sample size. Based on different ideas, some
allocation procedures are as follows:

4.2.1 Equal allocation

Choose the sample size in to be the same for all the strata. Draw samples of equal size
from each strata. Let n be the sample size and k be the number of strata, then ni = n
k

37
4.2 Allocation problem and choice of sample4sizes
STRATIFIED
is different strata
RANDOM SAMPLING

for i = 1, 2, ..., L

4.2.2 Proportional Allocations

If the sample sizes in the strata are closer such that ni


n
= Ni
N
=constant, then the strat-
ification is defined as stratification with proportional allocation for ni for i = 1, 2, ...., L.
Consider, ȳst = N1 Li=1 Ni ȳi , then with proportional allocation, ȳst =
P PL N ni
i=1 N n ȳi =

i=1 ni ȳi overall mean. In this case, ȳst coincides with ȳ (the overall sample mean),
1
PL
n
S2
V ar (ȳst ) = N12 Li=1 Ni (Ni − ni ) nii . Now using proportional allocation th variance
P

is;
Si2
V ar (ȳst )prop = N12 Ni Ni − nN

i
N nNi
N
 2
= N12 Li=1 Ni N NiN−nNi NnNSi
P

S2
= N12 Li=1 (N Ni − nNi ) ni
P

L
N −nX
Ni Si2 (4.5)
N 2 n i=1

which is the formual for the V ar (ȳst ) under proportional allocation.

4.2.3 Optimal Allocation of Sample Sizes(Neymann Allocation)

This allocation considers the size of strata as well as variability;


ni  Ni Si , ni = C ∗ Ni Si , where C* is the constant of proportionality.

i=1 C Ni Si or n = C i=1 Ni Si or C = PL nN S , therefore ni =


PL PL ∗ ∗
PL ∗
i=1 ni = i i
i=1

PL i i . This allocation arises when the V ar (ȳst ) is minimized subject to the constraint
nN S
i=1 Ni Si

ni (prespecified). There are some limitations of the optimum allocation. The knowl-
PL
i=1

edge of Si , i = 1, 2, ..., L is needed to know ni . If there are more than one characteristics,
then they may lead to conflicting allocation.
Choice of sample size based on cost of survey and variability
The cost of survey depends upon the nature of survey. A simple choice of the cost
function is

38
4.2 Allocation problem and choice of sample4sizes
STRATIFIED
is different strata
RANDOM SAMPLING

L
X
C = C0 + Ci ni (4.6)
i=1

where
C : total cost
C0 :overhead cost, e.g., setting up of office, training people etc
Ci cost per unit for the ith stratum.

i=1 Ci ni total cost within sample.


PL

To find ni under this cost function, consider the Lagrangian function with Lagrangian
multiplier l as;
φ = V ar (ȳst ) + λ2 (C − C0 )
PL   P 
2 1 1 2 2 L
= i=1 wi ni − Ni Si + λ i= Ci ni
PL wi2 Si2 w2 S 2
= i=1 ni + λ2 Li=1 Ci ni − ki=1 Ni i i
P P
PL h wi Si √ i2
= i=1 ni − λ Ci ni +terms independent of ni . Thus φ is minimum when;


√i Si = λ Ci ni for all i or ni = 1 w
w √i Si .
ni λ Ci

How to determine l?
There are two ways to determine l.
(i) Minimize variability for fixed cost.
(ii) Minimize cost for given variability. We consider both the cases.
(i) Minimize variability for fixed cost
Let C = C0∗ be the pre-specified cost which is fixed. So Ci ni = C0∗ or
PL PL
i=i i=1 Ci λw√i SCii =
PL √
Ci wi Si
C0∗ or λ = i=1
.
Substituting l in the expression for ni = λ1 w√iCSii the optimum for
C0∗
 
C∗
ni is obtained as n∗i = w√iCSii PL √0C w S . The required sample size to estimate Ȳ such
i=1 i i i

that the variance is minimum for given cost C = C0∗ is n = Li=1 n∗.
P
i

(ii) Minimize cost for given variability


Let V = V0 be the pre-specified variance. Now determine ni such that;
Pk  1  PL wi Si2 PL wi2 Si PL λ√Ci 2 2
i=1 ni − 1
Ni
w 2 2
S
i i = V0 or i=1 ni = V0 + i=1 Ni or i=1 wi Si wi Si = V0 +
PL wi2 Si2
wi2 Si2 V0 +
or λ = (after substituting ni = √i Si ). Thus the optimum ni
PL i=1 Ni 1w
PL √
i=1 Ni λ Ci
i=1 wi S
!i Ci
wi Si√C
PL
is ñi = . So the required sample size to estimate Ȳ such that C is
w i=1
√i Si i
Ci w2 S 2
V0 + L i i
P
i=1 Ni

39
4.2 Allocation problem and choice of sample4sizes
STRATIFIED
is different strata
RANDOM SAMPLING

minimum for a prespecified V0 is n = ñi .


PL
i=1

4.2.4 Sample size under proportional allocation for fixed cost and for fixed
variance.

(i) If cost C = C0 is fixed then C0 = Li=1 Ci ni , under the proportional allocation, ni =


P

n
N = nwi . So C0 = n Li=1 wi Ci or n = PL Cow C , therefore ni = PCw0 wi Ci i . The required
P
N i i=1 i i

sample size to estimate Ȳ in this case is n = Li=1 ni .


P

(ii) If variance=V0 is fixed, then’


PL  1 
wi Si2 wi2 Si2
wi2 Si2 = V0 or Li=1 (using ni = nwi ) or n =
1
P PL
i=1 ni − Ni ni
= V0 + i=1 Ni
PL PL
wi2 Si2 wi Si2
i=1
PL w2 S 2
or ni = wi i=1
PL w2 S 2
. This is known Bowley’s allocation.
V0 + i i V0 + i i
i=1 Ni i=1 Ni

4.2.5 Variances under different allocations

Now we derive the variance of ȳst under proportional and optimum allocations.
(i) Proportional allocation
Under proportional allocation;
P   PL  Ni − Nn Ni  Ni 2
ni = Nn Ni and V ar (ȳst ) = Li=1 NNi −n , Si2 ,
2 2

i ni
i
w S
i i V arprop (ȳst ) = i=1 Ni Nn
Ni N
N S 2
N −n
P L i i
= Nn i=1 N

L
N −nX
= wi Si2 (4.7)
N n i=1

(ii) Optimum allocation


Under optimum allocation;
PL i i .
nN S
ni =
i=1 Ni Si
PL  1 1

V aropt (ȳst ) = i=1 ni − wi2 Si2 Ni
P w2 S 2 P w2 S 2
= Li=1 ni i i − Li=1 Ni i i
PL h 2 2  PLi=1 Ni Si i PL wi2 Si2
= i=1 wi Si nNi Si
− i=1 N
PL  1 Ni Si hPL i P i 2 2
w S
= i=1 n . N 2 i=1 Ni Si − Li=1 Ni i i
P  P P 2
1 L Ni Si L wi2 Si2 1 L 1
PL
=n i=1 N − i=1 Ni = n i=1 wi Si − N i=1 wi Si2

Example 4.1. A population of size 800 is divided into three strata. Their sizes and
standard deviations are as given below.

40
4.2 Allocation problem and choice of sample4sizes
STRATIFIED
is different strata
RANDOM SAMPLING

Strata 1 2 3

Size of Ni 200 300 300

Standard deviation Si 6 8 12
A standard sample of 120 is to be drawn from the population. Determine the sample
size based on;

1. Proportional allocation

2. Optimum allocation

3. Obtain the variance of the estimates of the population mean i.e. V arprop (ȳst ) and
V aropt (ȳst )

Solutions:

1. , ni , N = N1 +N2 +N3 = 200+300+300 = 8000.


ni Ni nNi
PL
n
= N
⇒ ni = N
n = 120 = i=1

Therefore under proportional allocation, n1 = nN1


N
= 120×200
800
= 30, n2 = nN2
N
=
120×300 nN3 120×300
800
= 45, n3 = N
= 800
= 45.

2. Under optimal allocation, ni = PL i i ,


nN S
PL
i=1 Ni Si = 200(6) + 300(8) + 300(12) =
i=1 Ni Si
nN1 S1 120×200×6 120×300×8 120×300×12
72, 000, n1 = 72,00
= 7200
= 20, n2 = 7200
= 40, n3 = 7200
= 60.

3. V ar (ȳst )prop = Ni Si2 ,


N −n
PL PL
N 2n i=1Ni Si2 = 200 (62 )+300 (82 )+300 (122 ) = 69, 600,
i=1
  2 P 
800−120 680 1 1
PL L 2
⇒ (800)2 (120) (69, 600) = 64,000(120) (69, 600) = 0.61625, V ar (ȳst )opt = N 2 n i=1 Ni Si − i=1 Ni Si
1
(7200)2 − 69, 600 = 0.56626.
 1 
8002 120

Note: V ar (ȳst )opt < V ar (ȳst )prop .

4.2.6 Comparison of variances of sample mean under SRS with stratified


mean under proportional and optimal allocation:

(a) Proportional allocation:


N −n 2
Vsrs (ȳ) = Nn
S
PL Ni Si2
Vprop (ȳst ) NN−n
n i=1 N

41
4.2 Allocation problem and choice of sample4sizes
STRATIFIED
is different strata
RANDOM SAMPLING

In order to compare Vsrs (ȳ) and Vprop (ȳst ), first we attempt to write S 2 as a function
2
of Si2 . Consider, S 2 = N1−1 Li=1 N
P P i
j=1 Yij − Ȳ
2
or (N − 1) S 2 = Li=1 N
P P i
j=1 Yij − Ȳ
2
= Li=1 N
P P i  
j=1 Yij − Ȳi + Ȳi − Ȳ
2 PL PNi 2
= Li=1 N
P P i
j=1 Yij − Ȳi + i=1 j=1 Ȳi − Ȳ
2
= Li=1 (Ni − 1) Si2 + Li=1 Ni Ȳi − Ȳ
P P
2
S = Li=1 NN
N −1 2 i −1
Si2 + Li=1 NNi Ȳi − Ȳ
P P
N

For simplification we assume that Ni is large enough to permit approximation.


Ni −1
Ni
≈ 1 and N −1
N
≈1
2
Therefore; S 2 =
PL Ni PL Ni
i=1 N Si2 + i=1 N Ȳi − Ȳ
2
or (premultiply by on both
N −n 2 N −n
PL Ni 2 N −n
PL Ni N −n
Nn
S = Nn i=1 N Si + Nn i=1 N Ȳi − Ȳ Nn

sides)
2 2
.Since Li=1 Ȳi − Ȳ ≥ 0 ⇒ V arprop (ȳst ) ≤
PL
V arsrs Ȳ = Vprop (ȳsst )+ NN−n
 P
n i=1 wi Ȳi − Ȳ

V arsrs (ȳ). Larger gain in the difference is achievd when Ȳi differs from Ȳ more.
(b) Optimum allocation
V aropt (ȳst ) = n1 Li=1 (wi Si )2 − N1 Li=1 wi Si2 . Consider
P P
h  PL i  P 2 
N −n 2 1 L 1
PL 2
V arprop (ȳst ) − V aropt (ȳst ) = Nn i=1 wi Si − n i=1 wi Si − N i=1 wi Si
 P 2 
PL L
= n1 w
i=1 i iS 2
− w
i=1 i i S
= n1 Li=1 wi Si2 − n1 S̄ 2
P
2
= n1 Li=1 wi Si − S̄ where S̄ = i=1 wi Si ⇒ V arprop (ȳst ) − V aropt (ȳst ) ≥ 0 or
P PL

V aropt (ȳst ) ≤ V arprop (ȳst ). Larger gain in efficiency is achieved when Si differ from S̄
more. Combining the results in (a) and (b), we have

V aropt (ȳst ) ≤ V arprop (ȳst ) ≤ V arsrs (ȳ) (4.8)

4.2.7 Estimate of variance and confidence intervals

Under SRSWOR, an unbiased estimate of Si2 for the ith stratum(i = 1, 2, ..., L) is s2i ni1−1 Li=1 (yij − ȳi ).
P

In stratified sampling, V ar (ȳst ) = Li=1 wi2 NNi −n Si2 . So an unbiased estimate of V ar (ȳst )
P i
i ni
P w 2 s2 P w 2 s2 P w 2 s2
is Vd
ar (ȳst ) = Li=1 wi2 NNi −n si or Li=1 ni i i − Li= Ni i i or Li=1 ni i i − N1 Li=1 wi s2i . The
i 2
P P
i ni

42
4.3 Exercises 4 STRATIFIED RANDOM SAMPLING

second term in this expression represents the reduction due to finite population correction.
The confidence limits of Ȳ can be obtained as

q
ȳst ± t Vd
ar (ȳst ) (4.9)

assuming ȳst is normally distributed and Vd


ar (ȳst )is well determined so that t can be
read from normal distribution tables. If only few degrees of freedom are provided by each
stratum, then t values are obtained from the table of student’s t-distribution.
The distribution of Vd
ar (ȳst ) is generally complex. An approximate method of as-
q 2
( Li=1 gi s2i )
P
signing an effective number of degrees of freedom ne to V ar (ȳst ) is ne = PL g2 s4
d
i i
i=1 ni −1
Ni (Ni −ni )
where gi = and min (ni − 1) ≤ ne ≤ (ni − 1) assuming yij are normally
PL
ni i=1

distributed.

4.3 Exercises

1. A market reseracher is allocated Ksh. 20,000 to conduct a survey by means of


stratified random sampling. The population consists of stratum A of size 40,000,
B of size 20,000 and C of size 10,000. The set cost of administering the survey is
200 and the cost of sampling one unit are 2.25, 4.00 and 1.00 for stratum A, B and
C respectively. The standard deviations of observations in stratum A is thought to
be twice that of stratum B and C. Find the optimum and proportional allocations,
assuming that all the money is to be spent on the survey[20 marks]

2. A sample of 30 students is to be drawn from a population of 300 students belonging


to two colleges A and B. The means and standard deviations of their marks are
given below;

Total number of students ȳi Si

College A 200 30 10

College B 100 60 40
Use the information to confirm that Neyman’s allocation scheme is a more efficient
scheme when compared to proportional allocation.

43
4.4 Solutions 4 STRATIFIED RANDOM SAMPLING

3. A stratified population has 5 strata. The stratum sizes Ni and means Ȳi and Si2 of
some variable Y are as follows;
Stratum Ni Ȳi Si2

1 117 7.3 1.31

2 98 6.9 2.03

3 74 11.2 1.13

4 41 9.1 1.96

5 45 9.6 1.74

1. Calculate the overall population mean and variance.

2. For a stratified simple random sample of size 80, determine the appropriate stratum
sample sizes under Proportional allocation and Neyman allocation.

4.4 Solutions

1. N1 = 40, 000, N2 = 20, 000, N3 = 10, 000, c0 = 200(fixed cost), c = 20, 000,
c1 = 2.25, c1 = 4.0, c1 = 1.0, A = 2S3 , B = S3 , C = S3 . For optimum allocation,
(c−c0 )Ni Si

we need to find the size of the sample in each of the stratum i.e. ni = .
ci
P3 √
i=1 i Si ci
N
√ √ √ √
Now, 3i=1 Ni Si ci = 40, 000 (2S3 ) 2.25 + 20, 000 (S3 ) 4 + 10, 000 (S3 ) 1 =
P  
(20,000−200)(40,000)2S3 (20,000−200)(20,000)S3
170, 000S3 , n1 = 1.5
170,000S3
' 6211, n2 = 2
170,000S3
' 1164.7, n3 =
(20,000−200)(10,000)S3
2
170,000S3
' 1164.7.

2. Under proportional allocation, nni = NNi ⇒ ni = nN , c = c0 + Li=1 ci ni = c0 +


i
P
N

i=1 ci N , c0 + N , N =
PL nNi n
PL n
PL N (c−c0 )
i=1 Ci Ni ⇒ c − c0 = N i=1 Ci Ni ⇒ n =
PL
CN i=1 i i

N1 + N2 + N3 = 70, 000. ⇒ n = 70,000(20,000−2,000)


(2.25)(40,000)+4(20,000)+1(10,000)
= 77, 000. Therfore,
nNi 7700(40,000) 7700(20,000) 7700(10,000)
ni = N
⇒ n1 = 70,000
= 4, 400, n2 = 70,000
= 2200, n3 = 70,000
=
1, 100.

44
5 SYSTEMATIC SAMPLING

5 Systematic sampling

5.1 Introduction and Description

The systematic sampling technique is operationally more convenient than the simple ran-
dom sampling. It also ensures at the same time that each unit has equal probability of
inclusion in the sample. In this method of sampling, the first unit is selected with the
help of random numbers and the remaining units are selected automatically according to
a predetermined pattern. This method is known as systematic sampling.
Suppose the N units in the population are numbered 1 to N in some order. Suppose
further that N is expressible as a product of two integers n and k , so that N = nk.
To draw a sample of size n,

1. Select a random number between 1 and k.

2. Suppose it is i.

3. Select the first unit whose serial number is i.

4. Select every k th unit after ith unit.

5. Sample will contain i, i + k, i + 2k, ..., i + (n − 1) k serial number units.

So first unit is selected at random and other units are selected systematically. This
systematic sample is called k th systematic sample and k is termed as sampling interval.
This is also known as linear systematic sampling.
Example: Let N = 50 and n = 5. So k = 10. Suppose first selected number between
1 and 10 is 3. Then systematic sample consists of units with following serial number 3,
13, 23, 33, 43.
Advantages of systematic sampling:

1. It is easier to draw a sample and often easier to execute it without mistakes. This
is more advantageous when the drawing is done in fields and offices as there may be
substantial saving in time.

45
5.2 Estimation of Population Mean, Variance and Total5 SYSTEMATIC SAMPLING

2. The cost is low and the selection of units is simple. Much less training is needed for
surveyors to collect units through systematic sampling .

3. The systematic sample is spread more evenly over the population. So no large part
will fail to be represented in the sample. The sample is evenly spread and cross
section is better. Systematic sampling fails in case of too many blanks.

5.2 Estimation of Population Mean, Variance and Total

5.2.1 Estimation of population mean : When N = nk

Let yij be observation on the unit bearing the serial number i+(j − 1) k in the population,
i = 1, 2, ..., k, j = 1, 2, ..., n. Suppose the drawn random number is i ≤ k. Sample consists
of ith column (in earlier table). Consider the sample mean given by;
ȳsys = ȳi = n1 nj=1 yij as an estimator of the population mean given by Ȳ = nk
P 1
Pk Pn
i=1 j=1 yij =

i=1 ȳi . The probability of selecting i column as systematic sample = k1 . So,


1
Pk th
nk

E (ȳsys ) = k1 ki=1 ȳi = Ȳ . Therefore, ȳsys is an unbiased estimator of Ȳ .


P
2
Further, V ar (ȳsy ) = k1 ki=1 ȳi − Ȳ .
P
2
Consider, (N − 1) S 2 = ki=1 nj=1 yij − Ȳ
P P
2
= ki=1 nj=1 (yij − ȳi ) + ȳi − Ȳ
P P 
2
= ki=1 nj=1 (yij − ȳi )2 + n ki=1 ȳi − Ȳ
P P P
2
2
+ n ki=1 ȳi − Ȳ
P
= k (n − 1) Swsy
where Swsy j=1 (yij − ȳi ) is the variation among the units that lie
2 1
Pk Pn 2
= k(n−1) i=1

within the same systematic sample . Thus;


N −1 2 k(n−1) 2
V ar (ȳsys ) = N
S − N
Swsy
= N −1 2 n−1 2
N
S − n Swsy where N −1 2
N
S is the variation as a whole while n−1 2
n
Swsy is the pooled
within variation of the k th systematic sample with N = nk. This expression indicates
that when the within variation is large, then V ar (ȳi ) becomes smaller. Thus higher
heterogeneity makes the estimator more efficient and higher heterogeneity is well expected
in systematic sample.

46
5.3 Exercises 5 SYSTEMATIC SAMPLING

5.3 Exercises

1. Show how to estimate the mean in systematic sampling when n 6= k

2. A census was conducted in a community. In addition to obtaining the usual pop-


ulation information the surveys questioned the occupants of every 20th household
to determine how long they have occupied their present homes. The results are
summarized as follows; n = 115 , yi2 = 2011.15,
P P
ȳi = 407.1, N = 2300, k = 20.
Use this results to estimate the average amount of time people have lived in their
present homes and place a bound on the error of estimation.

3. Out of 24 villages in an area, two linear systematic samples of 4 villages each were
selected. The total area under wheat is given in the table below;

Sample villages

Linear Systematic sample 1 2 3 4

1 422 326 481 445

2 335 412 503 348

1. Estimate the total area under wheat

2. Estimate the variance of the sample mean and place an upper bound on the error
of estimation.

47
6 CLUSTER SAMPLING

6 Cluster sampling

6.1 Introduction and Description

It is one of the basic assumptions in any sampling procedure that the population can
be divided into a finite number of distinct and identifiable units, called sampling units.
The smallest units into which the population can be divided are called elements of the
population. The groups of such elements are called clusters.
In many practical situations and many types of populations, a list of elements is not
available and so the use of an element as a sampling unit is not feasible. The method of
cluster sampling or area sampling can be used in such situations.
In cluster sampling;

1. Divide the whole population into clusters according to some well defined rule.

2. Treat the clusters as sampling units.

3. Choose a sample of clusters according to some procedure.

4. Carry out a complete enumeration of the selected clusters, i.e., collect information
on all the sampling units available in selected clusters.

Area sampling.
In case, the entire area containing the populations is subdivided into smaller area
segments and each element in the population is associated with one and only one such
area segment, the procedure is called as area sampling.
Examples:

1. In a city, the list of all the individual persons staying in the houses may be difficult
to obtain or even may be not available but a list of all the houses in the city may
be available. So every individual person will be treated as sampling unit and every
house will be a cluster.

2. The list of all the agricultural farms in a village or a district may not be easily
available but the list of village or districts are generally available. In this case,

48
6.1 Introduction and Description 6 CLUSTER SAMPLING

every farm in sampling unit and every village or district is the cluster.

Moreover, it is easier, faster, cheaper and convenient to collect information on clusters


rather than on sampling units. In both the examples, draw a sample of clusters from
houses/villages and then collect the observations on all the sampling units available in
the selected clusters.
Conditions under which the cluster sampling is used:
Cluster sampling is preferred when;

1. No reliable listing of elements is available and it is expensive to prepare it.

2. Even if the list of elements is available, the location or identification of the units
may be difficult.

3. A necessary condition for the validity of this procedure is that every unit of the
population under study must correspond to one and only one unit of the cluster so
that the total number of sampling units in the frame may cover all the units of the
population under study without any omission or duplication. When this condition
is not satisfied, bias is introduced.

Open segment and closed segment:


It is not necessary that all the elements associated with an area segment need be
located physically within its boundaries. For example, in the study of farms, the different
fields of the same farm need not lie within the same area segment. Such a segment is
called an open segment. In a closed segment, the sum of the characteristic under study,
i.e., area, livestock etc. for all the elements associated with the segment will account for
all the area, livestock etc. within the segment.
Construction of clusters:
The clusters are constructed such that the sampling units are heterogeneous within
the clusters and homogeneous among the clusters. The reason for this will become clear
later. This is opposite to the construction of the strata in the stratified sampling.There
are two options to construct the clusters – equal size and unequal size. We discuss the
estimation of population means and its variance in both the cases.

49
6.2 Estimation of Population Mean, Variance and Total 6 CLUSTER SAMPLING

6.2 Estimation of Population Mean, Variance and Total

6.2.1 Case of equal clusters

Suppose the population is divided into N clusters and each cluster is of size n. Select a
sample of n clusters from N clusters by the method of SRS, generally WOR.
So total population size = N M total sample size = nM. Let yij be the value of the
characteristic under study for the value of j th element (j = 1, 2, ..., M ) in the ith cluster
(i = 1, 2, ..., N ).

j=1 yij mean per element of i cluster.


PM
ȳi = M1 th

First select n clusters from N clusters by SRSWOR. Based on n clusters, find the
mean of each cluster separately based on all the units in every cluster. So we have the
cluster means as ȳ1 , ȳ2 , ...., ȳn . Consider the mean of all such cluster means as an estimator
of population mean as
ȳcl = n1 ni=1 ȳi
P

Bias; E (ȳcl ) = n1 ni=1 E (ȳi)or Ȳ (since SRS is used)


P 1
Pn
n i=1

= Ȳ
Thus ȳcl is an unbiased estimator of Ȳ
Variance
The variance of ȳcl can be derived on the same lines as deriving the variance of sample
mean in SRSWOR. The only difference is that in SRSWOR, the sampling units are
y1 , y2 , ...., yn whereas in the case of ȳcl , the sampling units are ȳ1 , ȳ2 , ...., ȳn . Note that is
case of SRSWOR, V ar (ȳ) = N −n 2
Nn
S and Vd
ar (ȳ) = N −n 2
Nn
s
2 2
where Sb2 = which is the mean
N −n 2 1
PN
E (ȳcl ) = E ȳcl − Ȳ = Nn b
S N −1 i=1 ȳi − Ȳ
sum of square between the cluster means in the population.
Estimate of variance:
Using again the philosophy of estimate of variance in case of SRSWOR, we can find
where i=1 (ȳi − ȳcl ) is the mean sum of squares between
Pn 2
ar (ȳcl ) = NN−n
Vd n
s 2
b s 2
b = 1
n−1

cluster means in the sample.

50
6.3 Exercises 6 CLUSTER SAMPLING

6.3 Exercises

51
7 RATIO AND REGRESSION ESTIMATION

7 Ratio and Regression Estimation

7.1 Ratio Estimation

An important objective in any statistical estimation procedure is to obtain the estimators


of parameters of interest with more precision. It is also well understood that incorporation
of more information in the estimation procedure yields better estimators, provided the
information is valid and proper. Use of such auxiliary information is made through the
ratio method of estimation to obtain an improved estimator of population mean. In ratio
method of estimation, auxiliary information on a variable is available which is linearly
related to the variable under study and is utilized to estimate the population mean.
Let Y be the variable under study and X be any auxiliary variable which is correlated
with Y . The observations xi on X and yi on Y are obtained for each sampling unit. The
population mean X̄ of X (or equivalently the population total ) Xtot must be known. For
0 0
example, xi s may be the values of xi s from;

1. some earlier completed census,

2. some earlier surveys,

3. some characteristic on which it is easy to obtain information etc.

For example, if yi is the quantity of fruits produced in the ith plot, then xi can be the
area of ith plot or the production of fruit in the same plot in previous year.

Theorem 7.1. Let ((x1 , y1 ) , (x2 , y2 ) , ..., (xn yn )) be the random sample of size n on paired
variable(X, Y ) drawn, preferably by SRSWOR, from a population of size N . The ratio
estimate of population mean Ȳ is Ȳˆ = x̄ȳ X̄ = R̂X̄ assuming that the population mean X̄
is known. The ratio estimator of the population total Ytot = N ytot
P
i=1 Yi is ŶR(tot) = xtot Xtot

where Xtot = N
P
i=1 Xi is the population total of X which is assumed to be known, ytot =
Pn Pn
i=1 yi and xtot = i=1 xi are the sample totals of Y and X respectively. The ŶR(tot) can

be equivalently expressed as ŶR(tot) = x̄ȳ Xtot = R̂Xtot .

Looking at the structure of ratio estimators, note that the ratio method estimates
the relative change Ytot
Xtot
that occurred after (xi , yi ) were observed. It is clear that if the

52
7.2 Bias and mean squared error of ratio
7 estimator
RATIO AND REGRESSION ESTIMATION

variation among the values of yi


xi
and is nearly same for all i = 1, 2, ..., n then values of
ytot
xtot
(or equivalently ȳ

) vary little from sample to sample and the ratio estimate will be
of high precision.

7.2 Bias and mean squared error of ratio estimator

Assume that the random sample (xi , yi ) , i = 1, 2, ...,



n isdrawn by SRSWOR and popula-

 N 

 
 
n
 
 
ˆ
tion mean X̄ is known. Then E ȲR = 1
P ȳ
X̄ 6= Ȳ (in general). Moreover,
i=1
 


 N 

 
 
n
 
 2
it is difficult to find the exact expression for E ȳ
and E ȳ
. So we approximate them

x̄ x̄2

and proceed as follows;


Let ε0 = ȳ−Ȳ

⇒ ȳ = (1 − ε0 ) Ȳ ,
x̄−X̄
ε1 = X̄
⇒ x̄ = (1 + ε1 ) X̄.
Since SRSWOR is being followed , so ;
E (ε0 ) = 0,
E (ε1 ) = 0,
2
E (ε20 ) = 1
Ȳ 2
E ȳ − Ȳ ,
f SY2
= 1 N −n 2
S
Ȳ 2 N n Y
= n Ȳ 2
= nf CY2
2
where f = , and CY = is the coefficient of variation
N −n 1
PN SY
N
SY2 = N −1 i=1 Yi − Ȳ Ȳ

related to Y.
Similarly,
E (ε21 ) = n1 CX
2
,
1
  
E (ε0 ε1 ) = X̄ Ȳ
E x̄ − Ȳ ȳ − Ȳ
1 f 1 f
= . S
X̄ Ȳ n XY
= X̄ Ȳ n
ρSX SY = nf ρ SX̄X SȲY = nf ρCX CY
where CX = SX

is the coefficient of variation related to X and ρ is the population
correlation coefficient between X and Y.
Writting ȲˆR in terms of εi s we get ȲˆR = = (1 + ε0 ) (1 + ε1 )−1 Ȳ .
0 ȳ (1+ε0 )Ȳ

X̄ = (1+ε1 )X̄

Assuming |ε1 | < 1, the term (1 + ε1 )−1 may be expanded as an infinite series and it

53
7.2 Bias and mean squared error of ratio
7 estimator
RATIO AND REGRESSION ESTIMATION

would be convergent. Such assumption means that | x̄−X̄X̄ | < 1 i.e., possible estimate x̄ of
population mean X̄ lies between 0 and 2X̄, This is likely to hold true if the variation in x̄
is not large. In order to ensures that variation in x̄ is small, assume that the sample size
n is fairly large. With this assumption,
ȲˆR = Ȳ (1 + ε0 ) (1 − ε1 + ε21 − ...) = Ȳ (1 + ε0 − ε1 + ε21 − ε1 ε0 + ....).
So the estimation error of ȲˆR is
ȲˆR − Ȳ = Ȳ (ε0 − ε1 + ε21 − ε1 ε0 + ....).
In case, when sample size is large, then ε0 and ε1 are likely to be small quantities and
so the terms involving second and higher powers of ε0 and ε1 would be negligibly small.
 
In such a case ȲˆR − Ȳ = Ȳ (0 − ε1 ) and E ȲˆR − Ȳ = 0.
So the ratio estimator is an unbiased estimator of population mean upto the first order
of approximation.
If we assume that only terms of ε0 and ε1 involving powers more than two are negligibly
small (which is more realistic than assuming that powers more than one are negligibly
small), then the estimation error of Y¯ˆR can be approximated as
ȲˆR − Ȳ ' Ȳ (ε0 − ε21 − ε1 ε0 ).
Then the bias of ȲˆR is given by
 
ˆ
E ȲR − Ȳ = Ȳ 0 − 0 + nf CX 2
− nf ρCX CY


upto the second order of approximation. The bias generally decreases as the sample
size grows large.
The bias of ȲˆR is zero i.e.
 
Bias Ȳˆ = 0.
If E (ε21 − ε0 ε1 ) = 0
or if V ar(x̄)
X̄ 2
− Cov(x̄,ȳ)
X̄ Ȳ
=0
h i
or if 1
X̄ 2
V ar (x̄) − X̄

Cov (x̄, ȳ) = 0
or if V ar (x̄) − Cov(x̄,ȳ)
R
= 0 assuming X̄ 6= 0
or if R = Ȳ

= Cov(x̄,ȳ)
V ar(x̄)

which is satisfied when the regression line of Y on X passes through origin.


Now, to find the mean squared error, consider

54
7.3 Regression Estimation 7 RATIO AND REGRESSION ESTIMATION

   2
M SE ȲˆR = E ȲˆR − Ȳ
 
2
= E Ȳ 2 (ε0 − ε1 + ε21 − ε1 ε0 + ...)

= E Ȳ 2 (ε20 + ε21 − 2ε0 ε1 )
Under the assumption |ε1 | < 1, and the terms of ε0 and ε1 involving powers more than
two are negligible small,
 
M SE ȲˆR = Ȳ 2 nf CX + nf CY2 − 2f
 2 
n
ρCX CY
Ȳ 2 f
= n
[Cx2 + CY2 − 2%CX CY ]
up to the second order of approximation.

7.3 Regression Estimation

The ratio method of estimation uses the auxiliary information which is correlated with
the study variable to improve the precision which results in the improved estimators when
the regression of Y on X is linear and passes through origin. When the regression of Y
on X is linear, it is not necessary that the line should always pass through origin. Under
such conditions, it is more appropriate to use the regression type estimator to estimate
the population means.
In ratio method, the conventional estimator sample mean ȳ was improved by multi-
plying it by a a factor X̄

where x̄ is an unbiased estimator of population mean X̄ which
is chosen as population mean of auxiliary variable. Now we consider another idea based
on difference.
Consider an estimator x̄ − X̄ for which E x̄ − X̄ = 0. Consider an improved
 

estimator of Ȳ as Ȳˆ ∗ = ȳ + µ x̄ − X̄ which is an unbiased estimator of Ȳ and µ is any



 
constant. Now find µ such that the V ar Ȳˆ ∗ is minimum.
 
V ar Ȳˆ ∗ = V ay (ȳ) + µ2 V ar (x̄) + 2µCov (x̄, ȳ)
 
∂ Ȳˆ ∗
∂µ
= 0 ⇒ µ = − Cov(x̄,ȳ)
V ar(x̄)
N −n 2
SXY
where SXY =
PN PN
= − SSXY 1 1
  2
=− Nn
N −n 2
SX 2 N −1 i=1 Xi − X̄ Yi − Ȳ , SX = N −1 i=1 Xi − X̄
Nn X

Consider a linear regression model y = xβ + e where y is the dependent variable, x


is the independent variable and e is the random error component which takes care of the
difference arising due to lack of exact relationship between x and y. Note that the value

55
7.3 Regression Estimation 7 RATIO AND REGRESSION ESTIMATION

of regression coefficient β in a linear regression model y = xβ + e of y on x obtained by


minimizing ni=1 e2i based on n data sets (xi , yi ) , i = 1, 2, ..., n is β = Cov(x,y) = SSxy2 . Thus
P
var(x) x

the optimum value of µ is same as the regression coefficient of y on x with a negative sign,
i.e., µ = −β. So the estimator Ȳˆ ∗ with the optimum value of µ is Ȳˆreg = ȳ + β X̄ − x̄


which is the regression estimator of Ȳ and the procedure of estimation is called as the
regression method of estimation.
 
The variance of Ȳˆreg is V ar Ȳˆreg = V (ȳ) [1 − ρ2 (x̄, ȳ)] where ρ (x̄, ȳ) is the correla-
tion coefficient between x̄ and ȳ. So Ȳˆreg would be efficient if x and y are highly correlated.
The estimator Ȳˆreg is more efficient than Ȳ if ρ (x̄, ȳ) 6= 0 which generally holds.

7.3.1 Estimate of variance


 
An unbiased sample estimate of V ar Ȳˆreg is
 
ar Ȳˆreg = n(n−1)
f Pn 2
Vd i=1 [(yi − ȳ) − β0 (xi − x̄)]

= nf ni=1 s2y + β02 s2x − 2β0 sxy


P 

7.3.2 Regression estimates when β is computed from sample

Suppose a random sample of size n on paired observations on (xi , yi ) , i = 1, 2, ..., n is drawn


by SRSWOR. When β is unknown, it is estimated as β̂ = ssxy2 and then the regression
x

ˆ
estimator of Ȳ is given by Ȳreg = ȳ +β X̄ − x̄ . It is difficult to find the exact expressions

 
of E Ȳreg and V ar Ȳˆreg So we approximate them using the same methodology as in


the case of ratio method of estimation. Let


ȳ−Ȳ
ε0 = Ȳ
⇒ ȳ = Ȳ (1 + 0 )
x̄−X̄
ε1 = x̄
⇒ x̄ = X̄ (1 + 1 )
sxy −SXY
ε2 = SXY
⇒ sxy = SXY (1 + 2 )
s2x −SX2
ε3 = SX2 ⇒ s2x = SX
2
(1 + 3 )
then E (ε0 ) = 0 ,E (ε1 ) = 0 , E (ε2 ) = 0, E (ε3 ) = 0, E (ε20 ) = nf CX
2

E (ε0 , ε1 ) = nf ρCX CY and Ȳreg = ȳ + ssxy2 X̄ − x̄



x

Ȳ (1 + ε0 ) + S 2 (1+ε3 ) −ε1 X̄ . The estimation error of Ȳreg is


SXY (1+ε2 )

X

56
7.4 Exercises 7 RATIO AND REGRESSION ESTIMATION

 
Ȳˆreg − Ȳ
= Ȳ ε0 − β X̄ε1 (1 + ε2 ) (1 + ε3 )−1 where β = SSXY
2 is the population regres-
X
 
sion coefficient. Assuming | ε3 |< 1, Ȳˆreg − Ȳ ' Ȳ ε0 − β X̄ (ε1 + ε1 ε2 ) (1 − ε3 + ε23 )

' Ȳ ε0 − β X̄ (ε1 − ε1 ε3 + ε1 ε2 ) (7.1)

7.3.3 Bias of Ȳˆreg


 2
ˆ ˆ
Now the bias of Ȳreg upto the second order of approximation is E Ȳreg − Ȳ ≈ E Ȳ ε0 − β X̄ (ε1 − ε1 ε3 +


0
Retaining the terms of εi s upto the second power second and ignoring others, we have
 2
ˆ 
= E Ȳreg − Ȳ ≈ E ε20 Ȳ 2 + β 2 X̄ 2 ε21 − 2β X̄ Ȳ ε0 ε1


= Ȳ 2 E (ε20 ) + β 2 X̄ 2 E (ε21 ) − 2β X̄ Ȳ E (ε0 ε1 )


h 2 2
i
f 2 SY 2 2 SX SX SY
= n Ȳ Ȳ 2 + β X̄ X̄ 2 − 2β X̄ Ȳ ρ X̄ Ȳ
   2
ˆ
M SE Ȳreg = E Ȳreg − Ȳ ˆ = nf (SY2 + β 2 SX
2
− 2βρSX SY ). Since β = SXY
= ρ SSXY ,
2
SX
 
so substituting it in M SE Ȳˆreg , we get.

  f
M SE Ȳˆreg = SY2 1 − ρ2 . (7.2)

n

So upto second order of approximation, the regression estimator is better than the
conventional sample mean estimator under SRSWOR. This is because the regression esti-
mator uses some extra information also. Moreover, such extra information requires some
extra cost also. This shows a false superiority in some sense. So the regression estimators
and SRS estimates can be combined if cost aspect is also taken into consideration.

7.4 Exercises

1. The number of inhabitants(1000’s) in each of of a simple random sample of 49


cities drawn from a populatuon of 196 large cities is as given below.
P
yi = 6262,
xi = 5054. The true total number of inhabitants in the 196 cities in 1929, x is
P

assumed to be known. Its value is 22,919. Estimate the total number of inhabitants
in the 196 cities in 1980. (i) Using the ratio estimator (ii) Using the sample estimator

2. In a study to estimate the total sugar content of a truck load of oranges a random

57
7.4 Exercises 7 RATIO AND REGRESSION ESTIMATION

sample of n=10 oranges was juiced and weighed as shown in the table below. The
total weight of all the oranges obtained by first weighing the truck loaded and then
unloaded was found to be 1800 pounds. Estimate the total sugar content of oranges
and place a bound on the error of estimation.

Oranges Sugar content(yi ) Weight of oranges(xi ) y i xi

1 0.021 0.40 0.0084

2 0.030 0.48 0.0144

3 0.025 0.43 0.01075

4 0.022 0.42 0.00924

5 0.030 0.50 0.0150

6 0.027 0.46 0.01242

7 0.019 0.39 0.00741

8 0.021 0.41 0.00861

9 0.023 0.42 0.00966

10 0.025 0.44 0.0110

TOTAL 4.35 0.10689


P
yi = 0.245
3. A company wish to estimate the average amount of money i.e. Ȳ paid to employees
for medical expenses during the first three months of the calendar year.Average quartely
reports available in the fiscal report of the previous year. A random sample of 100 em-
ployees records are taken from the population of 1000 employees. The sample results are
summarised below. Use the data to estimate the population mean and place a bound on
the error of estimation.
yi = 1750. The total for the corresponding quarter correspond-
P
n = 100, N = 1000,
ing to the previous year is 100i=1 xi = 1200. Population total X for the corresponding
P

quarter of the previous year X = 12500, yi2 = 31650,


P P 2 P
xi = 15620, xi yi = 22059.35

58
8 DOUBLE SAMPLING (TWO PHASE SAMPLING)

8 Double Sampling (Two Phase Sampling)

8.1 Introduction and Description

The ratio and regression methods of estimation require the knowledge of population mean
of auxiliary variable X̄ to estimate the population mean of study variable Ȳ . If
 

information on the auxiliary variable is not available, then there are two options – one
option is to collect a sample only on study variable and use sample mean as an estimator
of population mean.
An alternative solution is to use a part of the budget for collecting information on
auxiliary variable to collect a large preliminary sample in which xi alone is measured. The
purpose of this sampling is to furnish a good estimate of X̄ . This method is appropriate
when the information about xi is on file cards that have not been tabulated. After
collecting a large preliminary sample of size n0 units from the population, select a smaller
sample of size n from it and collect the information on y. These two estimates are then
used to obtain an estimator of population mean Ȳ . This procedure of selecting a large
sample for collecting information on auxiliary variable x and then selecting a sub-sample
from it for collecting the information on the study variable y is called double sampling
or two phase sampling. It is useful when it is considerably cheaper and quicker to
collect data on x than y and there is high correlation between x and y.
In this sampling, the randomization is done twice. First a random sample of size n0
is drawn from a population of size N and then again a random sample of size n is drawn
from the first sample of size n0 .
So the sample mean in this sampling is a function of the two phases of sampling. If
SRSWOR is utilized to draw the samples at both the phases, then

→ number of possible samples


 at the
 first phase when a sample of size n is drawn from
 N 
a population of size N is   = M0 , say.
n0

→ number of possible samples at the second phase where a sample of size n is drawn

59
8.2 Double sampling in ratio method
8 DOUBLE
of estimation
SAMPLING (TWO PHASE SAMPLING)

 
 n 
from the first phase sample of size n0 is   = M1 , say.
0
n

Then the sample mean is a function of two variables. If τ is the statistic calculated at the
second phase such that τij i = 1, 2, ..., M0 , j = 1, 2, ..., M1 with Pij being the probability
that ith sample is chosen at first phase and j th sample is chosen at second phase, then;
E (τ ) = E1 [E2 (τ )] where E2 (τ ) denotes the expectation over second phase and E1
denotes the expectation over the first phase. Thus;
E (τ ) = M
P 0 PM 1
i=1 j=1 Pij τij

= M j=1 Pi Pj|i τij (using P (A ∩ B) = P (A) P (B|A)


P 0 PM 1
i=1

= M
P 0 PM 1
i=1 Pi j=1 Pj|i τij

where M i=1 Pi is for the first stage and j=1 Pj|i τij for the second stage.
P 0 PM1

8.1.1 Variance of τ

V ar (τ ) = E [τ − E (τ )]2
= E [(τ − E2 (τ )) + (E2 (τ ) − E (τ ))]2
= E1 E2 [τ − E2 (τ )]2 + [E2 (τ ) − E (τ )]2
= E1 E2 [τ − E2 (τ )]2 + E1 E2 [E2 (τ ) − E (τ )]2
= E1 [V2 (τ )] + E1 [E2 (τ ) − E1 (E2 (τ ))]2
= E1 V2 (τ ) + V1 [E2 (τ )]
Note: The two phase sampling can be extended to more than two phases depending
upon the need and objective of the experiment. Various expectations can also be extended
on the similar lines.

8.2 Double sampling in ratio method of estimation

If the population mean X̄ is not known then double sampling technique is applied. Take
a large initial sample of size n0 by SRSWOR to estimate the population mean X̄ as
ˆ = x̄0 = 1 Pn0 x .
X̄ n0 i=1 i

Then a second sample is a subsample of size n selected from the initial sample by
SRSWOR. Let ȳ and x̄ be the means of y and x based on the subsample. Then;

60
8.2 Double sampling in ratio method
8 DOUBLE
of estimation
SAMPLING (TWO PHASE SAMPLING)

E (x̄0 ) = X̄, E (x̄) = X̄, E (ȳ) = Ȳ


The ratio estimator under double sampling now becomes
ȲˆRd = x̄ȳ x̄0
The exact expressions for the bias and mean squared error of ȲˆRd are difficult to
derive. So we find their approximate expressions using the same approach mentioned
while describing the ratio method of estimation.
Let
x̄0 −X̄
ε0 = ȳ−Ȳ

, ε1 = x̄−X̄

, ε1 = X̄
,
E (ε0 ) = 0, E (ε1 ) = 0, E (ε2 ) = 0,
E (ε21 ) = n1 − N1 Cx2


E (ε1 ε2 ) = X̄12 E x̄ − X̄ x̄0 − X̄


 

= X̄12 E1 E2 x̄ − X̄ x̄0 − X̄ |n0


   
h 2 i
= X̄12 E1 x̄0 − X̄
 S2
= n10 − N1 X̄X2
= n10 − N1 CX
 2

= E (ε22 )
E (ε0 ε2 ) = 1
X̄ Ȳ
Cov (ȳ, x̄0 )
= 1
Cov [E (ȳ|n0 ) , E (x̄0 |n0 )] + X̄1Ȳ E [Cov (ȳ, x̄0 ) |n0 ]
X̄ Ȳ

= X̄1Ȳ Cov Ȳ , X̄ + X̄1Ȳ E [Cov (ȳ, x̄0 )]


 

= 1
Cov [(ȳ 0 , x̄0 )]
X̄ Ȳ

= n10 − N1 SX̄XY


= n10 − N1 ρ X̄X SȲY


S


= n10 − N1 ρCx Cy


where ȳ 0 is the sample mean of y 0 s based on the sample size n0 .


1
E (ε0 ε1 ) = x̄ȳ Cov (ȳ, x̄)
Sxy
= n1 − N1 X̄


1 1
 Sx Sy
= n − N ρ X̄ Ȳ
= n1 − N1 ρCx Cy


1
E (ε20 ) = Ȳ 2
V ar (ȳ)

61
8.2 Double sampling in ratio method
8 DOUBLE
of estimation
SAMPLING (TWO PHASE SAMPLING)

1
= Ȳ 2
V ar (ȳ)
= 1
Ȳ 2
[V1 {E2 (ȳ|n0 )} + E1 {V2 (ȳn |n0 )}]
1
V1 (ȳn0 ) + E1 n1 − n10 s02
   
= Ȳ 2 y

1 1
− N1 Sy2 + n1 − n10 Sy2
   
= Ȳ 2 n0
1
 S2
= n
− N1 Ȳ y2
1
− N1 Cy2

= n

where s02
y is the mean sum of squares of y based on initial sample of size n .
0

E (ε1 ε2 ) = 1
X̄ 2
Cov (x̄, x̄0 )
= 1
[Cov {E (x̄|n0 ) , E (x̄0 |n0 )} + 0]
X̄ 2

= X̄12 V ar X̄ 0


where V ar X̄ 0 is the variance of mean of x based on initial sample of size n0 .




62
9 VARYING PROBABILITY SAMPLING.

9 Varying Probability Sampling.

9.1 Introduction and Description

The simple random sampling scheme provides a random sample where every unit in the
population has equal probability of selection. Under certain circumstances, more effi-
cient estimators are obtained by assigning unequal probabilities of selection to the units
in the population. This type of sampling is known as varying probability sampling
scheme.
If Y is the variable under study and X is an auxiliary variable related to Y , then in
the most commonly used varying probability scheme, the units are selected with prob-
ability proportional to the value of X, called as size. This is termed as probability
proportional to a given measure of size (pps) sampling. If the sampling units
vary considerably in size, then SRS does not takes into account the possible importance
of the larger units in the population. A large unit, i.e., a unit with large value of Y con-
tributes more to the population total than the units with smaller values, so it is natural to
expect that a selection scheme which assigns more probability of inclusion in a sample to
the larger units than to the smaller units would provide more efficient estimators than the
estimators which provide equal probability to all the units. This is accomplished through
pps sampling.
Note that the “size” considered is the value of auxiliary variable X and not the value
of study variable Y . For example in an agriculture survey, the yield depends on the
area under cultivation. So bigger areas are likely to have larger population and they will
contribute more towards the population total, so the value of the area can be considered
as the size of auxiliary variable. Also, the cultivated area for a previous period can also be
taken as the size while estimating the yield of crop. Similarly, in an industrial survey, the
number of workers in a factory can be considered as the measure of size when studying
the industrial output from the respective factory.
Probability proportional to size sampling can be done on two ways;

1. Selection of units with replacement: The probability of selection of a unit will not

63
9.2 PPS sampling with replacement (WR)9 VARYING PROBABILITY SAMPLING.

change and the probability of selecting a specified unit is same at any stage. There
is no redistribution of the probabilities after a draw.

2. Selection of units without replacement: The probability of selection of a unit will


change at any stage and the probabilities are redistributed after each draw. PPS
without replacement (WOR) is more complex than PPS with replacement (WR) .
We consider both the cases separately.

9.2 PPS sampling with replacement (WR)

First we discuss the two methods to draw a sample with PPS and WR.

9.2.1 Estimation of Population Mean, Variance and Total

Notations:
Yi : value of study variable for the ith unit of the population, i = 1, 2, . . . , N.
Xi : known value of auxiliary variable (size) for the ith unit of the population.
Pi : probability of selection of ith unit in the population at any given draw and is
proportional to size Xi .

Theorem 9.1. Consider the varying probability scheme and with replacement for a sam-
ple of size n. Let yr be the value of rth observation on study variable in the sample
and pr be its initial probability of selection. Define zr = Nyprr , r = 1, 2, ..., n. Then
2
z̄ = n1 ni=1 zi is an unbiased estimator of population mean Ȳ , variance of z̄ is σnz
P
PN  2 2
where σz2 = P Pi
i=1 i N Pi − Ȳ and an unbiased estimate of variance of z̄ is snz =
1
Pn 2
n−1 r=1 (zr − z̄)

Proof: Note that zr can take any one of the N values out of Z1 , Z2 , ...., ZN with
corresponding initial probabilities P1 , P2 , ...., PN respectively. So
E (zr ) = N i=1 N Pi Pi = Ȳ . Therefore, E (z̄) = n
P PN Yi 1
Pn 1
Pn
i=1 Zi Pi = i=1 E (zr ) = n i=1 Ȳ =
Ȳ. So z̄ is an unbiased estimator of population mean Ȳ .
0
The variance of z̄ is V ar (z̄) = n12 V ar ( nr=1 zr ) = n12 nr=1 V ar (zr )(zr s are indepen-
P P

dent in WR case).

64
9.2 PPS sampling with replacement (WR)9 VARYING PROBABILITY SAMPLING.

Now, V ar (zr ) = E (zr − E (zr ))2


2 P 2
= E zr − Ȳ = N i=1 Zi − Ȳ Pi
  2
= N Yi
Pi = σz2 (say)
P
i=1 N Pi − Ȳ
2
Therefore; V ar (z̄) = n12 nr=1 σz2 = σnz
P

s2z
To whow that is an unbiased estimator of variance of z̄, consider
n
Pn 2
(n − 1) E (s2z ) = E = E [ nr=1 zr2 − nz̄ 2 ]
P
r=1 (zr − z̄)

= E [ nr=1 E (zr2 ) − nE (z̄ 2 )]


P

= nr=1 V ar (zr ) + {E (zr )}2 − n V ar (z̄) + {E (z̄)}2


P    

PN  Yi 2
= nr=1 σz2 + Ȳ 2 − n σ2 (using Pi = σz2 )
P  2

N
+ Ȳ V ar (zr ) = i=1 N Pi − Ȳ
= (n − 1) σz2
 2
σz2
E (s2z ) = σz2 or E s
z
= n
= V ar (z̄)
 
s2z Pn  yr 2
⇒ Vd
ar (z̄) = n
= 1
n(n−1) r=1 N Pr
− nz̄ . Note: If Pi =
2 1
N
then z̄ = ȳ
2
σy2

which is the same as in the case of SRSWR.
1 1
PN Yi
V ar (z̄) = nN i=1 1
N. N
− Ȳ = n

Pn  
1 yr
Theorem 9.2. An estimate of population total is T̂tot = n r=1 pr
= N z̄
  P h i
Proof: Taking expectation, we get E Ŷtot = n1 nr=1 PY11 P1 + PY22 P2 +, ... + PYNN PN
P hPN i
= n1 nr=1 r=1 Ytot = Ytot . Therefore Ytot is an unbiased estimator of
1
Pn
i=1 i = n
Y
population total.

Theorem 9.3. The variance of Ytot is given by V ar (Ytot ) = N 2 V ar (z̄)

P 1  Yi 2
Proof: V ar (Ytot ) = N 2 V ar (z̄) = N 2 n1 Ni= N 2 Pi
− N Ȳ Pi
 2  PN Yi2 
= n1 N Yi
Pi = n1 2
P
i=1 Pi − Ytot i=1 Pi − Ytot

  2
Corollary 9.1. An estimate of the variance of Ytot is V ar Ŷtot = N 2 snz
d

9.2.2 Varying probability scheme without replacement

In varying probability scheme without replacement, when the initial probabilities of se-
lection are unequal, then the probability of drawing a specified unit of the population
at a given draw changes with the draw. Generally, the sampling WOR provides a more
efficient estimator than sampling WR. The estimators for population mean and variance

65
9.3 Exercises 9 VARYING PROBABILITY SAMPLING.

are more complicated. So this scheme is not commonly used in practice, especially in
large scale sample surveys with small sampling fractions.

9.3 Exercises

1. Show how to estimate the population mean and variance in probability scheme without
replacement.
2. Farms having:10,20,30,15,25,35,65,55,50,5 acres of land under maize. It is desired
to select a sample of size 4 without replacements and with probability proportional to the
average of maize in the farm. The total average of maize is 310. The first step is to form
cumulative totals(C.T) and ranges as follows.
Farm Size C.T Range

1 10 10 1-10

2 20 30 11-30

3 30 60 31-60

4 15 75 61-75

5 25 100 76-100

6 35 135 101-135

7 65 200 136-200

8 55 255 201-255

9 50 305 256-305

10 5 310 306-310
3. Explain the Lahiri’s method as used in probability proportional to size with re-
placement sampling.
4. Define the following estimators

→ Horvitz Thompson (HT) estimator.

→ Murthy’s unordered estimator

→ Des Raj’s ordered estimator

66
10 TWO STAGE SAMPLING(SUBSAMPLING)

10 Two Stage Sampling(Subsampling)

10.1 Introduction and Description

In cluster sampling, all the elements in the selected clusters are surveyed. Moreover, the
efficiency in cluster sampling depends on size of the cluster. As the size increases, the effi-
ciency decreases. It suggests that higher precision can be attained by distributing a given
number of elements over a large number of clusters and then by taking a small number of
clusters and enumerating all elements within them. This is achieved in subsampling. In
subsampling

1. Divide the population into clusters.

2. Select a sample of clusters (first stage).

3. From each of the selected cluster, select a sample of specified number of elements
(second stage)

The clusters which form the units of sampling at the first stage are called the first stage
units or primary stage units and the units or group of units within clusters which
form the unit of clusters are called the second stage units or subunits or secondary
stage units. The procedure is generalized to three or more stages and is then termed as
multistage sampling.
For example, in a crop survey;

1. Villages are the first stage units.

2. Fields within the villages are the second stage units and

3. Plots within the fields are the third stage units.

Two stage sampling with equal first stage units: Assume that;

1. Population consists of N M elements.

2. NM elements are grouped into N first stage units of M second stage units each,
(i.e., N clusters, each cluster is of size M )

67
10.2 Notations 10 TWO STAGE SAMPLING(SUBSAMPLING)

3. Sample of n first stage units is selected (i.e., choose n clusters)

4. Sample of m second stage units is selected from each selected first stage unit (i.e.,
choose m units from each cluster).

5. Units at each stage are selected with SRSWOR.

Not: Cluster sampling is a special case of two stage sampling in the sense that from a
population of N clusters of equal size m = M , a sample of n clusters are chosen. If further
M = m = 1, we get SRSWOR. If n = N , we have the case of stratified sampling.

10.2 Notations

Let;
yij : value of the characteristic under study for the j th second stage units of the ith first
stage unit; i = 1, 2, ..., N , j = 1, 2, ..., m.

j=1 yij : mean per 2 stage unit of the ith 1st stage unit in the population.
PM
Ȳi = M1 nd

Ȳ = M1N N i=1 ȳi = ȲM N ; mean per second stage unit in the
P PM 1
PN
i=1 j=1 yij = N

population.
ȳi = m1 m j=1 yij : mean per 2 stage unit of the ith 1st stage unit in the sample.
nd
P

i=1 ȳi = ȳmn ; mean per second stage unit in the sample.
1
Pn Pm 1
PN
ȳ = mn i=1 j=1 yij = n

Note: The expectations under two stage sampling scheme depend on the stages. For
example, the expectation at second stage unit will be dependent on first stage unit in the
sense that second stage unit will be in the sample provided it was selected in the first
stage.
To calculate the average

1. First average the estimator over all the second stage selections that can be drawn
from a fixed set of n units that the plan selects.

2. Then average over all the possible selections of n units by the plan.

In case of two stage sampling

68
10.3 Estimation of population mean 10 TWO STAGE SAMPLING(SUBSAMPLING)

  h  i  
E θ̂ = E1 E2 θ̂ where E θ̂ is the average over all samples, E1 average over
all 1st stage samples, E2 average over all possible 2nd stage selections from a fixed set of
units.
In case of three stage sampling;
  h n  oi
E θ̂ = E1 E2 E3 θ̂
To calculate the variance, we proceed as follows:
In case of two stage sampling,
   2
V ar θ̂ = E θ̂ − θ
 2
= E1 E2 θ̂ − θ
Consider;
 2    
E2 θ̂ − θ = E2 θ̂2 − 2θE2 θ̂ + θ2
n  o  
2  
= E2 θ̂ + V2 θ̂ − 2θE2 θ̂ + θ2
Now average over first stage selection as;
 2 h  i2 h  i  
E1 E2 θ̂ − θ = E1 E2 θ̂ + E1 V2 θ̂ − 2θE1 E2 θ̂ + E1 (θ2 )
 n  o 
2 h  i
2
= E1 E1 E2 θ̂ − θ + E1 V2 θ̂
  h  i h  i
V ar θ̂ = V1 E2 θ̂ + E1 V2 θ̂

10.3 Estimation of population mean

Consider ȳ = ȳmn as an estimator of the population mean Ȳ .


Bias:
E (ȳ) = E1 [E2 (ȳmn )]
= E1 [E2 (ȳim |i)](as 2nd stage is dependent on 1st satge)
= E1 [E2 (ȳim |i)](as yi is unbiased for Ȳi due to srswor)
= E1 n1 m
 P 
i=1 Ȳi

= N1 N
P
i=1 Ȳi

= Ȳ
Thus ȳmn is an unbiased estimator of the population mean.
Variance:
V ar (ȳ) = E1 [V2 (ȳ|i)] + V1 [E2 (ȳ|i)]

69
10.4 Estimate of variance 10 TWO STAGE SAMPLING(SUBSAMPLING)

= E1 V2 n1 ni=1 ȳi |i + V2 E2 n1 ni=1 ȳi |i


  P    P 

= E1 n12 ni=1 V (ȳi |i) + V1 n1 ni=1 E2 (ȳi |i)


 P   P 

= E1 n12 ni=1 m1 − M1 Si2 + V1 n1 ni=1 Ȳi


 P    P 

= n12 ni=1 m1 − M1 E1 (Si2 ) + V1 (ȳc )(where ȳc is based on cluster means as in cluster
P 

sampling)
1 1 1 N −n 2

= n2
n m
− M
S̄w2 + Nn b
S
1 1 1 1 1
 
= n n
− M
S̄w2 + n
− N
Sb2
2 2
where S̄w2 = Yij − Ȳi , S̄b2 =
1
PN 1
PN PM 1
PN
N i=1 Si2 = N (M −1) i=1 j=1 N −1 i=1 Ȳi − Ȳ

10.4 Estimate of variance

Theorem 10.1. An unbiased estimator of variance of ȳ can be obtained by replacing Sb2


and Sw2 by their unbiased estimators in the expression of variance of ȳ.

2
Proof: Consider an estimator of S̄w2 = N1 N i=1 Si where Si =
P 2 2 1
PM
M −1 j=1 yij − Ȳi
as s̄2w = n1 ni=1 s2i where s2i = m−1 j=1 (yij − ȳi ) . So,
P 1
Pm 2

E (s̄2w ) = E1 E2 (s̄2w |i)


= E1 n1 ni=1 [E2 (s2i |i)]
P

= E1 n1 ni=1 Si2 (as SRSWOR is used)


P

= n1 ni=1 E1 (Si2 )
P
h P i
= N1 N 1 N 2
P
ni=1 N i=1 iS
N
= N1 i=1 Si2
P

= S̄w2 so s̄2w is unbiased for S̄w2 .

10.5 Exercises
2
1. Consider s2b = (ȳi − ȳ)2 as an estimator of Sb2 = .
1
Pn 1
PN
n−1 i=1 N −1 i=1 Ȳi − Ȳ
Show that Vd 1 1
− M1 s̄2w + n1 − N1 s2b
 
ar (ȳ) = N m

70
11 SOURCES OF ERRORS IN SURVEYS

11 Sources of Errors in Surveys

11.1 Introduction

11.2 Non-Sampling Errors

It is a general assumption in the sampling theory that the true value of each unit in the
population can be obtained and tabulated without any errors. In practice, this assumption
may be violated due to several reasons and practical constraints. This results in errors
in the observations as well as in the tabulation. Such errors which are due to the factors
other than sampling are called non-sampling errors.
The non-sampling errors are unavoidable in census and surveys. The data collected
by complete enumeration in census is free from sampling error but would not remain free
from non-sampling errors. The data collected through sample surveys can have both –
sampling errors as well as non-sampling errors. The non-sampling errors arise because
of the factors other than the inductive process of inferring about the population from a
sample. In general, the sampling errors decrease as the sample size increases whereas non-
sampling error increases as the sample size increases. In some situations, the non-sampling
errors may be large and deserve greater attention than the sampling error.
In any survey, it is assumed that the value of the characteristic to be measured has
been defined precisely for every population unit. Such a value exists and is unique.
This is called the true value of the characteristic for the population value. In practical
applications, data collected on the selected units are called survey values and they differ
from the true values. Such difference between the true and observed values is termed
as the observational error or response error. Such an error arises mainly from the
lack of precision in measurement techniques and variability in the performance of the
investigators.

11.2.1 Sources of non-sampling errors:

Non sampling errors can occur at every stage of planning and execution of survey or census.
It occurs at planning stage, field work stage as well as at tabulation and computation stage.

71
11.2 Non-Sampling Errors 11 SOURCES OF ERRORS IN SURVEYS

The main sources of the nonsampling errors are

1. lack of proper specification of the domain of study and scope of investigation,

2. incomplete coverage of the population or sample,

3. faulty definition,

4. defective methods of data collection and tabulation errors.

More specifically, one or more of the following reasons may give rise to nonsampling errors
or indicate its presence:

1. The data specification may be inadequate and inconsistent with the objectives of
the survey or census.

2. Due to imprecise definition of the boundaries of area units, incomplete or wrong


identification of units, faulty methods of enumeration etc, the data may be dupli-
cated or may be omitted.

3. The methods of interview and observation collection may be inaccurate or inappro-


priate.

4. The questionnaire, definitions and instructions may be ambiguous.

5. The investigators may be inexperienced or not trained properly.

6. The recall errors may pose difficulty in reporting the true data.

7. The scrutiny of data is not adequate.

8. The coding, tabulation etc. of the data may be erroneous.

9. There can be errors in presenting and printing the tabulated results, graphs etc.

10. In a sample survey, the non-sampling errors arise due to defective frames and faulty
selection of sampling units.

72
11.3 Sampling errors 11 SOURCES OF ERRORS IN SURVEYS

These sources are not exhaustive but surely indicate the possible source of errors. Non-
sampling errors may be broadly classified into three categories.
(a) Specification errors: These errors occur at planning stage due to various reasons,
e.g., inadequate and inconsistent specification of data with respect to the objectives of sur-
veys/ census, omission or duplication of units due to imprecise definitions, faulty method
of enumeration/interview/ambiguous schedules etc.
(b) Ascertainment errors: These errors occur at field stage due to various reasons e.g.,
lack of trained and experienced investigations, recall errors and other type of errors in
data collection, lack of adequate inspection and lack of supervision of primary staff etc.
(c) Tabulation errors: These errors occur at tabulation stage due to various reasons,
e.g., inadequate scrutiny of data, errors in processing the data, errors in publishing the
tabulated results, graphs etc.
Ascertainment errors may be further sub-divided into
(i) Coverage errors owing to over-enumeration or under-enumeration of the population
or the sample, resulting from duplication or omission of units and from the non-response.
(ii) Content errors relating to the wrong entries due to the errors on the part of
investigators and respondents.
Same division can be made in the case of tabulation error also. There is a possibility
of missing data or repetition of data at tabulation stage which gives rise to coverage errors
and also of errors in coding, calculations etc. which gives rise to content errors.
Treatment of non-sampling errors: Some conceptual background is needed for the
mathematical treatment of non-sampling errors.
Total error: Difference between the sample survey estimate and the parametric true
value being estimated is termed as total error.

11.3 Sampling errors

If complete accuracy can be ensured in the procedures such as determination, identification


and observation of sample units and the tabulation of collected data, then the total error
would consist only of the error due to sampling, termed as sampling error. Measure of

73
11.3 Sampling errors 11 SOURCES OF ERRORS IN SURVEYS

sampling error is mean squared error (MSE).


The MSE is the difference between the estimator and the true value and has two
components:

1. square of sampling bias.

2. sampling variance.

If the results are also subjected to the non-sampling errors, then the total error would
have both sampling and non-sampling error.
Total bias: The difference between the expected value and the true value of the es-
timator is termed as total bias. This consists of sampling bias and nonsampling bias.
Non-sampling bias: For the sake of simplicity, assume that the two following steps are
involved in the randomization:
(i) for selecting the sample of units and
(ii) for selecting the survey personnel.

74
12 ORGANISATION OF NATIONAL SURVEYS, AND THE KENYA BUREAU OF
STATISTICS(K.N.B.S)

12 Organisation of National surveys, and the Kenya

Bureau of Statistics(K.N.B.S)

The Statistics Act 2006 specifically mandates KNBS to:

1. Act as the principal agency of the government for collecting, analysing and dissem-
inating statistical data in Kenya

2. Act as custodian of official statistics.

3. Conduct the Population and Housing Census every ten years, and such other cen-
suses and surveys as the Board may determine;

4. Maintain a comprehensive and reliable national socio-economic database.

5. Establish standards and promote the use of best practices and methods in the pro-
duction and dissemination of statistical information across the National Statistical
System (NSS); and

6. Plan, authorise, coordinate and supervise all official statistical programmes under-
taken within the national statistical system.

NOTE: Check www.knbs.org for more information.

75
13 PAST EXAMINATION PAPERS

13 Past Examination Papers

13.1 Paper 1

W1-2-60-1-6

TAITA TAVETA UNIVERSITY COLLEGE

University Examinations 2014/2015

FOURTH YEAR SECOND SEMESTER EXAMINATION FOR THE


DEGREE OF BACHELOR OF SCIENCE IN MATHEMATICS AND
COMPUTER SCIENCE

STA 2402 Design and Analysis of Sample Surveys

DATE: August 2015 TIME: 2 Hours


—————————————————————————————————————
Instructions to Candidates: Answer Question One and any other Two Questions
QUESTION ONE (30 MARKS)
(a) Distinguish the following terms as used in design and analysis of sample survey
(i) Theoretical population and study population [1 mark]
(ii) Sampling frame and sampling scheme [1 mark]
(b)
(i) Clearly explain systematic random sampling technique [3 marks]
(ii) Show that the variance of the mean of a systematic sample is given by V ar (ȳsy ) =
N −1 2
N
S − n−1 2
n
Swsy where N −1 2
N
S is the variation as a whole while n−1 2
n
Swsy is the pooled
within variation of the k th systematic sample with N = nk. [4 marks].

76
13.1 Paper 1 13 PAST EXAMINATION PAPERS

(d) In simple random sampling without replacement, verify that ȳ is unbiased es-
N −n S 2
timator of Ȳ and that its sample variance is given by V ar (ȳ) = N n
where S 2 =
2
[6 marks]
1
PN
N −1 i=1 Yi − Ȳ

(e) Consider the population consisting of 430 units. By complete enumeration of the
population it was found that Ȳ = 19, S 2 = 85.6 These being true population values with
simple random samples, how many units must be taken to estimate ȳ with 10% of Ȳ a
part from a chance of 1 in 20. [5 marks]
(f) The following table represents a summary of data for complete census of all 440 vil-
lages in a sub-division in Kenya. The villages are stratified by the size of their agricultural
are under maize production onto strata as shown below.
Stratum Size of Villages(acres) Ni Ȳi Si

1 0-500 163 112.1 56.2

2 501-1500 199 276.7 116.2

3 1502-2500 53 558.1 186.0

4 1502-2500 25 960.1 361.3


(i) How would you draw a sample of size 34 villages if the villages are selected according
to stratified random sampling with proportional allocation and optimal allocations [4
marks]
(ii) Using the data show that the variance of the estimate of the mean in optimal
allocation is less than in proportional allocation. [6 marks]
QUESTION TWO (20 MARKS).
(a) A random sample from a normal population yields the following values for the
characteristic Y 22,19,21,16,21,27,24,18,18,20,21,22,19
Obtain the 95% confidence interval for the population mean Ȳ if the population is
known to have S 2 = 16 and size N = 210 [6 marks]
(b)
(i) Define ratio and regression estimation [2 marks]
(ii) In a study to estimate the total sugar content of a truck load of oranges a random
sample of n = 10 oranges was juiced and weighed as shown in the table below. The total
weight of all the oranges obtained by first weighing the truck loaded and then unloaded

77
13.1 Paper 1 13 PAST EXAMINATION PAPERS

was found to be 1800 pounds. Estimate the total sugar content of oranges and place a
bound on the error of estimation.
Oranges Sugar content(yi ) Weight of oranges(xi ) y i xi

1 0.021 0.40 0.0084

2 0.030 0.48 0.0144

3 0.025 0.43 0.01075

4 0.022 0.42 0.00924

5 0.030 0.50 0.0150

6 0.027 0.46 0.01242

7 0.019 0.39 0.00741

8 0.021 0.41 0.00861

9 0.023 0.42 0.00966

10 0.025 0.44 0.0110

TOTAL 4.35 0.10689


P
yi = 0.245
σ2
(c) Show that in srswr(N, n), the sample variance is given by V ar (ȳ) = n
where
2
σ 2 = N1 N [5 marks]
P
i=1 Yi − Ȳ

QUESTION THREE (20 MARKS)


(a) Clearly explain the meaning of multi-stage sampling [3 marks]
(b) Suppose that a sociologist wants to estimate the total number of home schooled
children in a town based on two staged sampling. First a small pilot study is to be done.
At the first stage 4 blocks are sampled at random out of the total of 300 blocks existing in
the town. At the second stage 4 households are sampled at random out of each sampled
block. (It is known that there are a total of 3950 household in the town). The data
obtained is;
Blocks No of households in block No of households sampled No of schooled childr

1 8 4 1,0,0,1

2 14 4 0,3,0,0

3 9 4 1,0,2,7

4 12 4 0,0,1,5

78
13.1 Paper 1 13 PAST EXAMINATION PAPERS

Estimate the total number of home schooled children in the town using simple random
sampling method and ratio estimation method (No non-response)[8 marks]
 
(c) Prove that in probability proportional to size with replacement sampling V ar Ŷpps =
PN  Yi 
1
n i=1 Pi − Y Pi [4 marks]
(d) Consider a population of N=10,000 sampling units where you want to obtain a
systematic random sample of size n = 1000[2 marks]
(i) How many systematic random samples are there and show using a diagram what
sampling units they consist of [2 marks]
(ii) Use this to explain the relationship between cluster sampling and systematic ran-
dom sampling [2 marks]
(iii) Building on the results in part (ii) explain why variances are difficult to calculate
for systematic random sampling [1 mark]
(iv) Suggest a modified form of systematic random sampling which solves the variance
problem in part (iii) [1 mark]
QUESTION FOUR (20 MARKS)
(a) Clearly explain cluster sampling [2 marks]
(b) A mathematics achievement test was given to 486 students prior to their entering
a certain college. From these students a simple random sample of n = 10 students was
selected and their progress in calculus observed. Final calculus grades were then reported,
as given in the table below. It is known µx that for all 486 students taking the achievement
test. Estimate µY for this population. [8 marks]
Student 1 2 3 4 5 6 7 8 9 10

Achievement T.S.X 39 43 21 64 57 47 28 75 34 52

Final calculus. Y 65 78 52 82 92 89 73 98 56 75
(c) State and explain sources of non-sampling errors in surveys [2 marks]
(d) A nursery man wants to estimate the average height (in inches) of 1200 seedlings
in a field that is sub-divided into 50 plots that vary in size. A two-stage cluster sample
design produced the following data.

79
13.1 Paper 1 13 PAST EXAMINATION PAPERS

Plot Number of seedlings Mi Number of seedlings sampled mi Height of seedli

1 63 6 5, 2, 4, 3, 1,

2 57 8 4, 2, 7, 2, 7,

3 30 3 3, 2, 5

4 23 2 4, 4,

TOTAL 173 17
(i) Estimate the average height of seedlings in the field and the standard error of the
estimate [5 marks]
(ii) Construct a 95% confidence interval on the population mean [3 marks]

80
14 REFERENCES

14 References

1. Cochran, W.G. 1977. Sampling Techniques.3rd ed .New York: Wiley.

2. Yates, F.1981. Sampling Methods for Censuses and Surveys. 4th ed. New York.

3. Sampling: Design and Analysis, 2nd Edition, by Sharon L. Lohr, published by


Brooks/Cole

81

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy