STA 308 Lecture Notes
STA 308 Lecture Notes
SAMPLING TECHNIQUES
delivered by
1
Course Description
Non probability and probability sampling methods.
Convenience sampling, quota sampling, snowball
sampling, etc. Simple random sampling (with/without
replacement). Estimation of sample size; estimation of
population parameters e.g., total and proportion; ratio
estimators of population means, totals, etc. Stratified
random sampling - proportional and optimum allocations.
Cluster sampling, systematic sampling, multistage
sampling.
3
Lecture Schedules
Wednesdays: 7:30 am to 8:30 am SW 12
Wednesdays: 8:30 am to 9:30 am SW 12
Thursdays: 8:30 am to 9:30 am LT 21
4
COURSE OBJECTIVES
At the end of the course, the students should be able to:
distinguish between a population and a sample;
distinguish between probability (or random) and non
probability (or non random) sampling;
list at least 3 major probability sampling techniques;
describe simple random sampling (with or without)
replacement;
describe at least 2 common methods for selecting a simple
random sample;
select all possible simple random samples of size n from a
given population; 5
COURSE OBJECTIVES (Continued)
At the end of the course, the students should be able to:
state at least two advantages and disadvantages SRS;
explain the factors required in the determination of an
appropriate sample size;
estimate an appropriate sample size;
estimate population parameters using simple random
samples;
explain the justification for stratified random sampling;
select stratified random samples of size n using
proportional allocation;
6
COURSE OBJECTIVES (Continued)
At the end of the course, the students should be able to:
select stratified random samples of size n using optimum
allocation;
state at least two advantages and disadvantages stratified
random sampling;
estimate population parameters using stratified random
samples;
explain the justification for systematic sampling;
state at least two advantages and disadvantages systematic
sampling;
estimate population parameters using systematic samples;
7
Quiz Schedules
[20%]
8
PROBABILITY AND NON-PROBABILITY SAMPLING
11
Judgmental (or Purposive) Sampling
The approach to the problem of choosing sample involves selecting units
by considering available auxiliary information more or less subjectively
with a view to ensuring a reflection of the population in the sample. This
approach, therefore, involves selecting units that are known to be typical
with respect to certain characteristics, rather than selecting the units
randomly. The assumption is made that if the units are typical in this
respect, they will be typical with respect to the characteristics being
studied.
12
For example, in a household expenditure study in a given district, a
sample of households might be selected in such a manner that the
average size of the households in the sample would be the same as the
average for the district, and the distribution of the sample households
among the different types would parallel the distribution throughout the
district.
2. As the most appropriate people/objects for the study would have been
selected, this process becomes a lot less time consuming.
3. With fewer time constraints and a more accurate subject, the costs for
carrying out the sampling project are greatly reduced.
5. If you are looking for a very rare or much sought after group of
people for a particular study, using purposive sampling may usually
be the only way you can track them down.
14
This happens because the sampler is unlikely to select units which are
too small or too large, although such units exist in the population. It is
not possible to get strictly valid estimates of the population values and
their sampling errors due to the risk of bias in selection due to the lack
of information on the probability of selection of the units.
15
Disadvantages
1. The vast array of inferential statistical procedures is not invalid here.
Inferential statistics lets you generalize from a particular sample to a
larger population and make statements about how sure you are that
you are right, or about how accurate you are.
16
Convenience Sampling
The main stimulus is administrative convenience and thus, a sample is
chosen with the sole concern for its ease of access. To obtain
information quickly and inexpensively, a convenience sample can be
employed. The procedure is simply to contact sampling units that are
convenient shoppers at a market on a particular day, a few friends and
neighbours, or football fans after a game. A convenience sample is often
used to pre-test a questionnaire.
17
Advantages
19
Readily Available Since most convenience samples are collected with
the populations on hand, data are readily available for the researcher to
collect. They do not typically have to travel great distances to collect the
data, but simply pull from whatever environment is nearby. Having a
sample group readily available is important for meeting quotas quickly,
and allows for the researcher to even do multiple studies in an
expeditious fashion. If a researcher has a deadline to meet, he or she will
turn to this type of sampling instead of any other.
20
Cost Effective One of the most important aspects of this method is its
cost effectiveness. There is minimal overhead because there is no
elaborate setup, and researchers can pull from local population groups.
This method allows for funds to be distributed to other aspects of the
project. Often times this method of sampling is used to gain funding for a
larger, more thorough research project.
In this instance, funds are not available for a more complete survey, so a
quick selection of the population will be used to demonstrate a need for
the completed project.
21
Disadvantages
Biased Results One of the major drawbacks to this sampling method is
the opportunity for bias to cloud the results of the survey. For example, a
researcher looking to predict who will win the next election may only
survey an area close to them, and ignore the fact that the region is located
in the northern Ghana and therefore will likely have a more NDC slant.
Personal prejudices may also creep in. For example, a surveyor may not
distribute questionnaires to certain ethnic groups. These factors often
lead to skewed data collection, rendering the data useless for tracing
trends throughout the entire population.
22
Misrepresentation of Data Occasionally researchers will ignore the fact
that they did not complete a random survey, and will use the information
to prove facts that are not necessarily true. One example of this
misrepresentation is if certain magazine subscribers are polled on
political opinion, when the magazine appeals to a more liberal sect of the
populous. Because a national magazine subscription will have a large
pool of subscribers and the people surveyed could be selected at random,
researchers might be tempted to use the results of this poll as a
representation of the entire population’s view. Nevertheless, because the
magazine appeals to only a select group of people, it is still considered a
convenience sampling and will have a liberal slant. 23
Incomplete Conclusions Because of the flaws found in this form of
sampling, scientists cannot draw concrete conclusions from their data. In
order to be a truly accurate study, any statistical data must include all
facets of the population. By missing portions of the general populous,
researchers are only able to form incomplete conclusions. One example
would be polling only knitting clubs, which is typically frequented by
women. Researchers would not be able to draw conclusions on what
most men thought, as they would be underrepresented in this arena, and
any data collected would not allow them to reach accurate conclusions.
24
Quota Sampling
A wide variety of procedures go under the name of quota sampling, but
what distinguishes them fundamentally from probability sampling is
that, once the general breakdown of the sample is decided, and the quota
assignments are allocated to interviewers, the choice of the actual sample
units to fit into this framework is left to the interviewers. Quota
sampling is, therefore, a kind of stratified sampling in which the
selection within category is non-random. It is this non-random property
that constitutes its greatest weakness. Since speed is an important
consideration in surveys of public opinions, it has been common to use
the method of quota sampling for the selection of the sample. 25
Quota sampling starts with the premise that a sample should be well
spread geographically over the population and that it should contain the
same fraction of individuals having certain characteristics as does the
population.
The characteristics usually taken into account are sex, age, occupation,
economic level, and size of place, in addition to geographical control.
Data taken from a recent census are used as the standard. In a sense the
population is divided up into a number of strata whose weight are
obtained from the census or a large-scale survey. Interviewers are then
assigned quotas for the number of interviews to be taken from each
stratum. 26
Thus, an interviewer will be asked to select so many males and so many
females, so many young and so many old persons, and similarly for the
other characteristics used as control. The interviewer is free to choose
his sample as he likes provided the quota requirements are fulfilled.
27
The essential difference between quota sampling and stratified random
sampling is that in the former case, the selection of the sample within the
stratum is not strictly random. The interviewer may decide to omit
certain parts of the area entirely if it suits his convenience. He may not
like to approach certain kinds of people and so on. In general,
statisticians have criticized the method for this theoretical weakness,
while market and opinion researchers have defended it for its cheapness
and administrative convenience.
28
Advantages
1. It saves money when time is an issue
2. It is quick and easy to arrange
Disadvantages
3. Sample is not representative of the population.
4. Because the sample is non-random it is impossible to assess the
possible sampling error.
5. It limits the decisions of the researcher as it does not allow much
variety.
29
Snowball Sampling
A snowball design is a form of judgmental sampling that is very
appropriate when it is necessary to reach a small specialized (typical)
population. Suppose a long range planning group wanted to sample
people who were very knowledgeable about a new specialized
technology, such as the use of solar energy in manufacturing industries.
Even specialized magazines would have a small percentage of readers in
this category. Furthermore, the target group may be employed by diverse
organizations, like the government, universities, research institutions, and
industrial firms.
30
Under a snowball design each respondent is asked to identify one or
more potential respondents in that particular field.
32
Low Cost: The cost of locating samples and researching is not high.
The researcher does not spend time and money trying to find the
sample subjects; rather they are being brought to the him/her.
33
Disadvantages
Sampling Bias: Snowball sampling is based on researching one subject
and using the subject to recruit more members for sampling. These
people are known to the initial subject, who is more than likely to
nominate people they know very well. These people will more often than
not share similar traits and behavior characteristics. What the sampler
may end up with is a small subgroup of the entire population. Since this
sampling method is used predominantly in race sampling, one subject
can very well nominate an entire family, close friends and other
acquaintances. All of whom may exhibit the same traits and
characteristics, leading to sampling bias. 34
Lack of Control over the Sampling Method: In the snowball
sampling method, the researcher actually has very little control over the
sampling method. The type of subjects that the researcher can secure for
sampling is mainly dependent on the original subjects that were
researched. After the first set of original subjects is researched, the
researcher may lose control over the sampling method. The reason being
the original subjects are tasked with adding to the sampling pool by
nominating people they know.
Another problem is that representatives of the sample are not guaranteed
because the researcher has no idea of the true distribution of the sample.
35
Probability (or Random) Sampling
Random samples are defined as those samples in which every
element in the population has a known, non-zero probability of
selection. It is not necessary that the probabilities of selection
be equal for all elements. The advantage of random sampling
is that if it is done properly, it provides a bias-free method of
selecting sample units and permits the measurement of
sampling error in addition to providing valid estimates of the
population values.
36
Examples of random sampling methods are simple random
sampling, stratified random sampling, systematic sampling
and cluster sampling.
37
Simple Random Sampling
which every unit has an equal chance of being selected and each distinct
possible sample of size n has the same chance of being chosen. Sampling
is done in one stage with the units of the sample selected, unit by unit,
39
The above procedure of selecting a sample requires elimination of
random numbers that are either larger than N or those that are repeated
40
The non-existence of appropriate list of sampling units is one of the
major reasons why simple random samples are not widely used in
practice. Also in situations where the sampling units vary greatly in size,
size will result in sharp increase in sampling variance. Such could be the
case when stores, for example, are the sampling units – the large stores
want to use a sampling scheme that will give the large stores a
42
The samples can be drawn in two possible ways:
The sampling units are chosen without replacement because the units
once are chosen are not placed back in the population.
The sampling units are chosen with replacement because the selected
units are placed back in the population.
43
Simple random sampling without replacement:
SRS without replacement is a method of selection of n units out of the N
units one by one such that at any stage of selection, any one of the
remaining units have the same chance of being selected, (i.e. 1/N).
44
Symbols and Notations
45
Capital letters are used to denote population values and lower
N
1
Y Y
2
2
i is the variance of Yi from population mean Y ,
N i 1
n i 1
n
1
yi y is the unbiased estimator of the population variance.
2 2
s
n 1 i 1 47
Sampling without replacement
Properties of Estimates
4
48
Suppose that we wish to select two units without replacement. Then the
first unit could be selected in five different ways, and the second in four
different ways. This means that there are 5 4 20 possible ways of
selecting the first two units. Note, however, that these units are ordered,
for example U 2 , U 5
and . But U 2 two samples are the same.
U 5 , the
50
For instance when N 5 and n 2 , then the number of possible
samples is
5 5! 5 4 3!
C2 10,
2!(5 2)! 2 1 3!
The set of all 10 equally likely possible values of the mean is called the
sampling distribution of the mean. It simply describes all the possible
values of the sample mean that could result from the sampling scheme
being used.
51
That is, it is simply a frequency distribution of all possible values of the
mean under the same sampling scheme. Since the sample mean has a
sampling distribution, we say that it is a random variable. The study of a
sampling distribution tells how a sample value may deviate from the true
value. So if the sampling distribution is very much spread out, with a
large variance, the sampling procedure is more apt to be misleading than
it would be if the sampling distribution has a small variance.
52
Sampling distribution of y , ˆ 2 , and s 2 .
Sample Units in Values in
number sample sample y ˆ 2
s 2
54
In practice, only one sample will be selected and observed and hence any
of the means in the table above can be used as an estimate of the
population mean. The smallest estimate of the population mean is 12.0 as
found in sample number 6 which will mislead us by the amount
15.4 12.0 3.4. The largest estimate is obtained from sample
number 9 to be 18.5 and it differs from the true mean by the amount
15.4 18.5 3.1. However, the mean of the sample means will
always be exactly equal to the population mean as shown in the table
above. Each of the sample means is an independent estimate of the
population mean and it is written E ( y ).
55
The expected value of the sample mean is given by E ( y ) y i pi
i
where y i is the mean obtained from the i sample and pi is the
th
average of all the estimates will equal the true population mean.
57
The variance of the sample mean
and taken from the same population will vary among themselves and will
only by accident coincide with the population value. This is because only
58
Since the sampling distribution of the mean is a set of all the possible
means that could result from the sampling scheme being used, its
N
Cn
1 2
Var( y ) N ( yi y ) .
Cn i 1
59
For our example, the variance of the sample mean is obtained as
1
Var( y ) 14.5 15.4 2 16.5 15.4 2 15.5 15.4 2
10
3.69.
mean.
60
Theorem
2 2
N n S S
Var( y ) (1 f )
N n n
is used.
61
Uses of variance of the sampling mean
62
Sampling with replacement
since no unit is repeated and the order of selection is not taken into
consideration.
In SRS with replacement, however, a unit can appear more than once in
a sample and the order of selection is taken into consideration. There are
n 1 N n
N samples and the probability of selecting any one is . 63
Theorem
N 1 S2
N n
64
Proof
Y
66
Confidence interval for the sample mean
with the population mean. The variation of the mean among the samples
be sure of some error. What we do not know is how large or how small
judgment on the extent of the error. The likelihood that the error will be
68
A confidence interval is generally defined by an upper confidence limit
where
y is the sample mean;
sy is the standard error of the sample mean;
z 2is the value from the standard normal distribution that indicates
interval is
ˆ
y z 2
n;
and if sampling is done without replacement, the interval is
s
y z 2 1 f
n
71
Estimation of Proportion of Units
of units in the population that fall into a certain class or possess certain
may classify the N units in the population into two groups as follows:
72
Group 1: Units which possess the attributes
1, if U i is in Group 1
yi
0, if U i is in Group 2
73
Then, it can be seen that the population total is
N
Y yi A
i 1
earlier.
74
2
y y
Since i takes only the values 1 or 0, i will also take only the values
N
2
1 or 0, and hence i A NP
y
i i
Subsequently, we have
2 1 N
S ( yi Y ) 2
N 1 i 1
1 N 2 2
( yi 2 yiY Y )
N 1 i 1
1 N 2 2
i y N Y
N 1 i 1
1
N1
NP NP 2
NP (1 P ) NPQ
or , where Q 1 P 75
N1 N1
If in a SR sample of n units, a units possess the attribute then then the
sample mean is
1 n a
y yi p
n i 1 n
and since the sample mean is an unbiased estimator of the population
76
It follows that the variance of p is given by
NPQ
Var ( p) (1 f )
n( N 1)
N n PQ
N1 n
2 2
We have already seen that s is an unbiased estimator of S , it
form
2
Var ( p ) (1 f ) s n
77
But
n
2 1 2 2
s yi n y
n 1 i 1
1
n 1
np np 2
np(1 p ) npq
or , where q 1 p
n 1 n 1
Hence, an unbiased estimator of the variance of p is
pq
var( p ) (1 f )
n 1 78
Remarks:
80
Confidence Limits for Proportions
given by
p z 2 s p
where
z 2 is the percentage point in the normal distribution beyond which
( 2)% of the values lie; and
81
s p pq (n 1) in the case of sampling with replacement , and
82
Determination of Sample Size
The size of the sample to be used in every survey has to be determined
before hand. Usually, one of the first questions answered in a new survey
design is “How large should the sample be?” This important question has
to be addressed because the decision on sample size is directly related to
the cost of the survey. It is a difficult question to answer precisely since it
is related to a clear understanding of the concept of sampling
distribution. There are several approaches to determine the sample size,
the easiest being the empirical approach – to find out what sample sizes
have been used by others in similar studies.
83
The Confidence Interval Approach
This is half of the total length of the interval. For a given reliability
more accurate information and the cost of an increased sample size. The
an estimate is to be. The more precise an estimate is, the smaller the
87
In fixing the percentage confidence level for asserting that the margins
implies a 5% probability that the error will exceed one or the other
margin. This 5% divides into 2.5% probability of going below the lower
88
If the investigator is interested in one margin only, the requirement of
Suppose that we want an interval that extends d units on either side of the
finite population correction factor, then the size of the margin of error
2
2
n z 2 2 1
When samplingdis without replacement, and the sampling fraction
n N
is not negligible, that is, if the calculated value of n is found to be more
sample size is
girls to determine their average protein intake. Let us assume that the
nutritionist would like an interval about 10 units wide. That is, the
estimate should be within 5 units of the true value in either direction. Let
us also assume that a confidence interval of 95% is decided on, and that
from past experience, the nutritionist feels that the population standard
93
Example 2
takes to drill three holes in a certain metal clamp. How large a sample
will be needed to be 95% confident that the sample mean will be within
94
Solution
We have
2
40
2
n (1.96) 2
15
27.32
95
Determination of Sample Size for Proportions
The procedure for determining the appropriate sample size for estimating
proportions is similar to the one outlined above. We have seen that the
without replacement is
PQ P (1 P )
Var ( p ) or Var ( p )
n n
where P is the population proportion and n is the sample size.
96
Using this information in Equation (1) the required sample size n is
given by
P (1 P )
2
n z 2 2
2
d
As an usual modification if the calculated value of n is found to be more
sample size is
n
n0 .
1 (n 1) N
97
Equation (2) depends on P which in most practical situations is not
based on the fact that the maximum value of P(1 P) is 0.25, that is
than 0.25. Therefore, substituting for 0.5 for P in Equation (2) results in
(0.25) z2 2
2
3
d
99
Example 3
certain district are medically indigent. It is felt that the proportion cannot
100
Solution
101
Example 4
during the year 2007 and 17% of the cars operate on both petrol and gas.
The ministry is aware that cars which were originally mean to operate on
petrol have been converted to operate on both petrol and gas. How many
proportion of cars in the country that operate on both petrol and gas with
We know that the margin of error d is half the width of the confidence
plan to stay in the country for more than a month. The Board would like
104
Solution
Since the Board has no idea what proportion of visitors who plan to stay
for more than one month, Equation (3) is used to solve for the sample
happen to the tolerance error if the sample size were to be reduced 600?
P (1 p )
Solving for d in Equation (2), we obtain d z 2
n
and P(1 p )
d z 2
n
(0.5)(0.5)
(1.96)
600
0.04
Thus, reducing the sample size 600 increases the tolerable error slightly.
106
Remark
The issue sampling size is partly technical. The larger the sample the
more elaborate the analyses can be sustained. This being so, one cannot
speak of an adequate sample size, still less of an optimal sample size. The
capability.
107
In many discussions of the sample size appropriate for a particular
108
Stratified Random Sampling
Stratified random sampling begins with the division the entire study
population into subsets, called strata (singular form: stratum). Sampling
is then carried out independently in each stratum. This strategy reduces
the sampling error to the extent that the variable which defines the
strata is correlated with the survey variable of interest.
110
Stratified samples have the following general characteristics:
1. The entire population is first divided into an exhaustive set of strata, using some
external source, such as census data, to form the strata.
2. Within each stratum a separate random sample is selected. This implies that the
sample designer is at liberty to use different sampling fractions in different
strata.
3. From each separate sample, some statistics (e.g. mean) is calculated and
properly weighted to form an overall estimated mean for the whole population.
4. Sample variances are also calculated within each stratum and appropriately
weighted to yield a combined estimate for the whole population.
111
Reasons for stratification
Though the main advantage of using stratified sampling is the possible increase in
efficiency per unit of cost in estimating the population characteristics, it is also likely
to be useful in a number of situations.
115
Foreigners Nationals
N1 N2
Y1i Y2i
Y11 2 books Y21 8 books
Y12 4 books Y22 12 books
Y13 6 books Y23 16 books
Mean Y 4 books
1 Y2 12 books
116
2
118
For the first and second strata in our. example, we have
Y1 12
Y1 4 books per foreigner and
N1 3
Y2 36
Y2 12 books per national.
N2 3
N h yh
y st h 1 .
N
L
Note: E ( y st ) Y N E ( Ny st ) Y .But Ny st N h y h YˆstThat
. is,
ˆ
the estimator Yst is an unbiased estimator of Y: E (Yˆst ) Y .
h 1
121
Example
Let us illustrate these results using our hypothetical population. We can
3
select 2 3 possible samples without replacement from each stratum.
3 3
Hence, there are altogether 2 2 3 3 9 possible samples of size
n n1 n2 2 2 4.
Suppose the first sample consists of (2, 4, 8, 12),
1 is
of which (2, 4) is from stratum 1 and (8,y12) 2 from
4 stratum
6. 2. The total of
y1 6
the sample (2, 4) from stratum 1 is y1 The
3.sample mean of
n1 2
the sample (2, 4) from stratum 1 is : Hence, an estimate
of the total 1 (3)(3of
N 1 ynumber ) 9. in stratum 1 based on the sample (2, 4) is
books
122
Similarly,
y 2 20
y2 8 12 20 and y 2 10.
n2 2
Hence, an estimate of the total number of books in stratum 2 based on
the sample (8, 12) is N 2 y 2 (3)(10) 30. Thus, the estimator of the
total number of books in the population based on the first sample
(2, 4, 8, 12), is Yˆ N y N y 9 30 39 and the estimator of
st 1 1 2 2
Yˆst 39
the population mean is y st 6.5 books.
N 6
Following the above procedure for all other possible samples, we obtain
the results summarized in the table below.
123
Stratum 1 Stratum 2 y1 y 2 y1 y 2 N1 y1 N 2 y 2 Yˆst y st
2, 4 8, 12 6 20 3 10 9 30 39 6.5
2, 4 8, 16 6 24 3 12 9 36 45 7.5
2, 4 12, 16 6 28 3 14 9 42 51 8.5
2, 6 8, 12 8 20 4 10 12 30 42 7.0
2, 6 8, 16 8 24 4 12 12 36 48 8.0
2, 6 12, 16 8 28 4 14 12 42 54 9.0
4, 6 8, 12 10 20 5 10 15 30 45 7.5
4, 6 8, 16 10 24 5 12 15 36 51 8.5
4, 6 12, 16 10 28 5 14 15 42 57 9.5
124
Now,
1 39 45 51 57
E ( y st )
9 6 6 6 6
8.
But the population mean Y 8 and therefore E ( y st ) 8. Thus, y st is
an unbiased estimator of YWe . also note that
ˆ 39 45 57
E (Yst ) 48 Y
9
and Yˆst is an unbiased estimator of Y.
125
Population and stratum variances
There are two ways of defining the stratum variance which shows the
variance within
N
each stratum. One is
1
Y Yh
h Nh
1
and the other is s h N 1 Yhi Yh .
2
2
h hi
2 2
Nh i 1 h i 1
N h 1 i 1 N
The population variance shows the variation of the individual values from
the population mean, Y.
126
Example
Calculate the within stratum variances and the population variance for
our hypothetical example.
Note that,
Nh Nh
hi h hi h hi h
2 2 2
Y Y Y 2Y Y Y
i 1 i 1
Nh L
1
Yhi2 2 N hYh2 N hYh2 since Yh Yhi
i 1 Nh h 1
Nh
Yhi2 N hYh2
i 1
2
Nh
1
Yhi2 Y 1
N hi
i 1 h
127
Also, we have
Nh Nh
hi
Y Y 2
hi
Y 2
2Y Yhi Y 2
i 1 i 1
Nh Nh
Yhi2 2Y Yhi N hY 2 2
i 1 i 1
128
Now, we have
N1 N2
Y1i Y 2
1i Y2i Y 2
2i
4 64
Y11 2 Y21 8
16 Y22 12
144
Y12 4
36 Y23 16 256
Y13 6
12 56 36 464
129
From 1 Nh
Yhi Yh
2 2
h
Nh i 1
We have
N1
1
1 Y1i Y1
2 2
N1 i 1 N1
1
N1
1 1 N 1
2
s12 Y1i Y1 2
Y1i and N1 1 i 1
N1 i 1
2
Y1i
N1 i 1
1
56 13 (12) 2
1
56
2
1 (12) 2
3
3
8 4
3
130
Similarly, 2
1 N 2 1 2 N
2 2
2 Y2i Y2i
N 2 i 1 N 2 i 1
1
464 13 (36) 2
3
32
3
and
N2 N 2
1 1 2
2
s2
N 2 1 i 1
Y 2
2i
Y
2i
N 2 i 1
1
464 13 (36) 2
2
16
131
1 L Nh
1 N1 N2
2
Y Y (Y1i Y ) (Y2i Y )
2
2
hi
2
N h 1 i 1 N i 1 i 1
But
N1 N1 N1
1i
(Y
i 1
Y ) 2
1i
Y 2
2Y 1i 1
Y N Y 2
i 1 i 1
2
56 2(8)(12) (3)(8)
and
56
N2 N1 N1
(Y
i 1
2i Y ) Y 2Y Y2i N 2Y
2
i 1
2
2i
i 1
2
1 L Nh L Nh L Nh
Yhi Yh Yh Y Yhi Yh Yh Y
2 2
N h 1 i 1 h 1 i 1 h 1 i 1
1 L Nh L
2
Yhi Yh N h Yh Y
2
N h 1 i 1 h 1
1 L 1 L
N h h N h Yh Y
2 2
N h 1 N h 1
133
Thus, the overall population variance is given by the sum of the stratum
variances and variances among the stratum means. Using our example,
these terms may be calculated as follows:
1 L
1 8 32 40
N
h 1
N h 3 3 .
2
h
6 3 3 6
L
1 1
N Y Y 34 8 3(12 8) 2 16
2 2
h h
N h 1 6
40 136
16
2
6 6
134
Note that
2
is simply the population variance, and may thus be
obtained as
N
1
Yi Y
2 2
N i 1
1
2 82 4 82 6 82 8 82 12 82 16 82
6
136
6
N h 1
2 2
N
h h
N (
h 1
Y Y ) 2
,
and call 2
w within stratum variance and 2
b the between strata
variance. 135
The overall variance may then be shown as .
2 2 2
w b The
136
The variance of y st
1
The variance of y st is, by definition 2
st
m m
( y st Y ) 2
.
Stratum 1 Stratum 2
Yˆst y st y Y
2
y st Y st
2, 4 8, 12 39 39 6 96 81 36
2, 4 8, 16 45 45 6 36 9 36
2, 4 12, 16 51 51 6 36 9 36
2, 6 8, 12 42 42 6 66 36 36
2, 6 8, 16 48 48 6 0 0
2, 6 12, 16 54 54 6 66 36 36
4, 6 8, 12 45 45 6 36 9 36
4, 6 8, 16 51 51 6 36 9 36
4, 6 12, 16 57 57 6 96 81 36
137
3 3
There are m 9 possible samples of size n n1 n 2 2 2 4
2 2
that we can select. Hence, there are m 9 possible sample means, y st.
Thus,
1 270 30 5
y
2
.
st
9 36 36 6
the population samples are fairly large, the number of possible samples
N1 N 2
m
n1 n2
will be so large that calculation of var ( y st ) from its basic definition will
be practically difficult. 138
N 1 y1 N 2 y 2
We know that y st w1 y1 w2 y 2 where
N
Nh
wh nh
N
are called stratum weights. Since the samples are
selected
var ( y stby
) random
w12 var ( sampling and
y1 ) w22 var ( y 2are
) independent , we have
2 2
N n s N n s
w12 1 1 1 w22 2 2
2
N1 n1 N2 n2
2 2
L
N
h N n s
h h
h .....................................(1)
h 1 N Nh nh
1 L 2 N h nh s h2
2 Nh ..................................... (2)
N h 1 Nh nh
1 L
N h nh
N
Nh
N 2
h h ........................................ (3)
s
139
h 1
2
In our present illustration, we have s 4 and 2 16. Hence, we
s2
1
have
1 N 1 n1 N 1 s1 2 N 2 n2 N 2 s 2 2
var ( y st ) 2
N N1 n1 N2 n2
1 3 2 3 2 3 2 3 4
2 2
2
6 3 2 3 2
5
.
6
Example
Given the following data where samples of size n1 2 and n2 2 are
selected from each stratum, calculate var ( y st )
140
2
X 1i X 1i2 X 2i
X 2i
X 11 1 1 X 10 100
21
X 12 3
9 X 16 256
25 22
484
X 1395 X 23 22
35 48 840
The equation
N h n h N h s h
L 2
1
var ( y st ) 2
N
h 1 Nh
nh
2
shows that var ( y st ) is dependent on s when N,
h N h and n h are
2
given. The s shows the variance within each stratum. Hence, we
h
obtain the very important conclusion that when the variance within
each stratum is small, var ( y st ) will be small, and therefore, the
precision of y st will be high. 141
Different ways of allocating a sample among strata
Various ways exist for allocating samples among strata. We would learn
about two of them and how to estimate given population parameters
under their allocations
Proportional allocation
The simplest and most frequently used way of allocating a sample
among strata is to allocate it proportionally to the size of the strata. For
example, if a sample of size n 50 is to be selected from a population
size N 500, it means that the sampling fraction is to be
n N 50 500 0.1.
142
That means 10% of each stratum is to be selected for the sample. Then
n1 n2 nL
10%.
N1 N 2 NL
This method compares favourably with other methods in terms of
precision, and is both simpler and more convenient to use.
Example
Consider the population
Stratum 1 Stratum 2
If we wish to select a simple random
Y11 2 Y21 8
Y12 4 Y22 12 sample of size n 4 by proportional
Y13 6 Y23 16 allocation, find n h .
143
Solution
We have
n1 n2 n n 4
f, f .
N1 N 2 N N 6
4 4
n1 N 1 f 3 2 and n 2 N 2 f 3 2.
6 6
Since we are using simple random sampling in each stratum, the
probability of any sampling unit in stratum h being included in the
subsample n h and nh N h f . Hence, since nh N h f for all
strata, any unit in the population has the same probability f 4 6
be included in the sample. 144
Estimator of Y
The unbiased estimator of Y for stratified random sampling is
L
N h yh
y st h 1L . nh
Substituting N h into the above, we obtain
N h L f
h 1
n
h
f yh
y st h 1L
nh
h 1 f
L
n h yh
h 1
n
L nh
y hi
h 1 i 1
.
n 145
Thus, the estimator y st is the sample mean of the sample n, and is an
unbiased estimator of Y . Suppose we select the sample
n1 : y11 2, y12 6.
n2 : y 21 8, y 22 12.
2 6 8 12 28
Then, an estimate of Y is y st 7.
4 4
Example
Given the following population, estimate the average number of
cigarettes a person smokes by selecting a stratified random sample of
size n 5 by proportional allocation.
146
Stratum 1 (Males) Stratum 2 (Females)
Y11 20 Y21 10
Y12 25 Y22 12
Y13 35 Y23 8
Y14 30 Y24 6
Y15 24
Y16 26
Solution
n 5 1
The sampling fraction is f . Hence, we select
1 N 10 2 1
n1 N 1 f 6 3samples from stratum 1 and n 2 N 2 f 4 2
2 2
samples from stratum 2.
147
Suppose we select
n1 : y11 25, y12 20, y13 35, from Stratum 1
and
n2 : y 21 10, y 22 6 from Stratum 2.
Then the estimator y st is
1
y st (25 20 35 10 6) 19.2 cigarettes .
5
The true average is
1 1
Y (20 25 8 6) (196) 19.6 cigarettes .
10 10
148
Optimum allocation
In some cases, it may be necessary to conduct a sample survey with a
fixed budget, but with varying costs of selecting sampling units from
different strata. For example, when families are stratified into urban and
rural classifications in order to survey their average income, the cost of
selecting sampling units from urban and rural families will usually differ.
The cost function Lis given by
c c o c h n h,
h 1
where co is the fixed cost and c h is the variable cost. The fixed costs
include office rent, fixed administrative costs, equipment costs, etc., and
149
The variable cost shows the cost per sampling unit in stratum h. Hence,
ch nh is the cost of selecting n h families from stratum h. Now,
L
C c h nh ,.................................................(1)
h 1
150
Note that nh is proportional to N h sh with the possible implications.
Given the following population, allocate the sample of size n 4
by optimum allocation. Assume c1 GHc 1.00 and c 2 GHc 4.00.
Stratum 1 Stratum 2
2 8
2 8
4 12
6 16
6 16
151
Solution
N1
1
s12 1i 1Y Y 2
N 1 1 i 1
1
5 1
2 4 2 4 4 4 6 4 6 4
2 2 2 2 2
1
(16) 4.
4
N1
1
s 22 2i 2Y Y 2
N 2 1 i 1
1
5 1
8 122 8 122 12 122 16 122 16 122
1
(64) 16.
4
N 1 s1 c1 10
n1 n (4) 2
20
N s
h h ch
N 2 s2 c2 10
n2 n (4) 2
20
N s
h h ch
153
Suppose we select
I II
Then
y11 4 y 21 8 N 1 y1 N 2 y 2
y st
N
y12 6 5(5) 5(12)
y 22 16
10
y1 10 y 2 24 8.5
y1 5 y 2 12
154
SYSTEMATIC SAMPLING
We will now learn about systematic sampling which is more convenient
than simple random sampling and which ensures that each unit has equal
Introduction
In systematic sampling, we select every kth unit starting with a unit which
corresponds to the number r chosen at random from 1 to k, where k is an
integer such that k N n . The sample consists of the units
corresponding to the units: r , r k ; r 2k , , r (n 1)k
155
The random number r is called the random start and k is called
the sampling interval. A sample selected by this procedure is
called a systematic sample with a random start. One can easily
see that r determines the entire sample. In this procedure, we
select with equal probability one of the k possible groups or
samples. Besides the operational convenience, systematic
sampling provides estimators that are more efficient than those
provided by SRS under certain conditions that are reasonable in
practice. Systematic sampling provides a useful alternative to
SRS for the following reasons: 156
1. Systematic sampling is easier to perform in the field and hence is less
subject to the selection errors by field workers than are simple
random samples especially if a good frame is not available.
2. Systematic sampling can provide greater information per unit cost
than simple random sampling can provide. A systematic sample is
generally more uniformly spread over the entire population and thus
may provide more information about the population than an
equivalent amount of data contained in a simple random sample.
Systematic sampling often suggests itself when there is a sequence
of units occurring naturally in space (trees in a forest) or time
(landing of fishing boats on the coast). 157
Sample Selection Procedure
A sample which is obtained by systematic sampling may be expressed as
“a 1 in 5 sample” or “a 1 in 10 sample” Thus, in general, a sample which
is obtained by systematic sampling is expressed as “a 1 in k sample” This
means that the sampling fraction is 1 k . There are two common
methods for selecting a sample by systematic sampling which we shall
call Methods A and B.
158
Method A
Suppose we have a population
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Y11 Y12
159
Note that N nk , and therefore, the population is an exact multiple of k.
Since we select 1 sampling unit from each stratum, k 3 shows the
number of possible systematic samples that can be selected, and n 4
is the size of each sample. The k 3 possible samples that can be
obtained from the given population are shown in the table below.
Sample 1 Sample 2 Sample 3
Y1 Y2 Y3
Y4 Y5 Y6
Y7 Y8 Y9
Y10 Y11 Y12
160
Since the starting sampling unit is randomly selected from the first k 3
units, the probability of selecting any one of these k 3 sampling units is
1 k 1 3 , and the probability of selecting any one of these systematic
samples is also 1 k 1 3 .
162
Example
Select a 1 in 8 systematic samples from a population of size N 37.
Solution
The sample size is obtained as
N 37 5
4 .
k 8 8
Hence, the sample size will be either 4 or 5. That is,
5
4 4 5.
8
The samples are listed in the table below.
163
I II III IV V VI VII VIII
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8
Y9 Y10 Y11 Y12 Y13 Y14 Y15 Y16
Y17 Y18 Y19 Y20 Y21 Y22 Y23 Y24
Y25 Y26 Y27 Y28 Y29 Y30 Y31 Y32
Y33 Y34 Y35 Y36 Y37
164
Method B (Remainder Method)
Assume that N nk 12, and suppose that we wish to select a 1 in
k 3 j th
th
8
sample. A sampling unit (say, the unit) is randomly selected
from the population. Let j be the unit. Then
j 8
2 r 2.
k 3
with remainder
165
Note that r 2 k 3 and that the values r can take will be 0, 1, and 2.
When r 1, select Y1 ; when r 2 , select Y2 ; and when r 0, select
Y3 k 3 rd
169
To show that the sample mean of a systematic sample is an unbiased
estimator of the population mean when using Method A with N nk ,
let 1 n
y i y ij ..................................(1)
n j 1
th
be the sample mean for the systematic i sample. Then
1
E ( y sys ) ( y1 y 2 y k )
k
since there are only k possible sample we can select, and the probability
of selecting a certain systematic sample is 1 k . Using our example with
N 12, k 3, n 4.
and
170
We have
1
E ( y sys ) ( y1 y 2 y3 )
k
1 1
( y1 y 2 y N )
k n
1
( y1 y 2 y12 )
12
Y .
Hence,
E ( y sys ) Y .
171
Example 1
Given the number of books N 9 children have, select a 1 in 3 sample
by systematic sampling and estimate the population mean.
1, 2, 3, 4, 5, 6, 7, 8, 9
Solution
Note that N 9 3 3 nk . There are 3 systematic samples that can
be selected namely:
172
Sample 1 Sample 2 Sample 3
1 2 3
4 5 6
7 8 9
12 15 18
Hence, the sample means are y1 12 3 4, y 2 15 3 5, and
y 3 18 3 6, which are the estimates of the population mean.
173
Example 2
Using the data in the Example 1 above, show that the sample mean of
the systematic sample is an unbiased estimator of the population mean,
that is E ( y sys ) Y .
Solution
1
We have E ( y sys ) (4 5 6) 5.
3
1 45
But Y (1 2 9) 5.
9 9
Therefore, E ( y sys ) Y
174
Example 3
Suppose that the data in Example 1 are arranged in the following order:
6, 3, 4 9, 2 5 1, 7, 8
Then we have the 3 systematic samples
Sample 1 Sample 2 Sample 3
6 3 4
9 2 5
1 7 8
16 12 17
176
Example 4
Given the population Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 , select
Solution
We have
N 10 1
3 .
k 3 3
Hence, the sample size is 3 or 4, and the samples are
177
Sample 1 Sample 2 Sample 3
Y1 Y2 Y3
Y4 Y5 Y6
Y7 Y8 Y9
Y10
Hence,
11 1 1
E ( y sys ) (Y1 Y4 Y7 Y10 ) (Y2 Y5 Y8 ) (Y3 Y6 Y9 )
34 3 3
1
(Y1 Y2 Y10 ) Y ,
10
and so y sys is a biased estimator of Y . 178
To show that the sample mean of a systematic sample is an unbiased
estimator of the population mean when using Method B with N nk ,
suppose we have the population
Y1 Y2 Y3 Y4 Y5 Y6 Y Y8 Y9 Y10 Y11
7
If k 3 , then n 3 or 4. Thus, we obtain the samples
Sample 1 Sample 2 Sample 3
Y1 Y2 Y3
Y4 Y5 Y6
Y7 Y8 Y9
Y10 Y11
179
The probability of the samples 1, 2, and 3 are 4 11 , 4 11 , and 3 11 ,
respectively. Hence,
4 4 3
E ( y sys ) ( y1 ) ( y 2 ) ( y3 )
11 11 11
4 1 4 1 3 1
(Y1 Y4 Y7 Y10 (Y2 Y5 Y8 Y11 (Y3 Y6 Y9
11 4 11 4 11 3
1
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Y11
11
Y
180
Example 5
Given the number of books N 8 children have, select 1 in 3 samples by
systematic sampling and estimate the population mean by using Method
B (Remainder method).
Solution
j 7
th
Assume we select j 7 unit. Then: 2 with remainder r 1.
k 3
1 2 3 4 5 6 7 8
181
The 3 systematic samples are The probabilities associated with
Sample 1 Sample 2 Sample 3
1 2 3 these samples are 3 8 , 3 8 , and
4 5 6 2 8.
7 8
12 15 9
184
OU
Y
NK
HA
T