Survey Techniques Notes
Survey Techniques Notes
Sample Survey
and
Design of Experiments
Rafiqullah Khan
on
Sample Surveys
and
Design of experiments
Preface
We would like to give an idea of some of the special feature of this monograph on Sample
Surveys and Design of Experiment before we go further; let us tell you that it has been
prepared according to new syllabus prescribed by UGC. We have written this notes in such a
simple style that even the weak student will be able to understand very easily.
We are sure you will agree with us that the facts and formula of sample surveys is just the
same in all the books, the difference lies in the method of presenting these facts to the
students in such a simple way that while going through this notes, a student will feel as if a
teacher is sitting by his side and explaining various things to him. We are sure that after
reading this lecture notes, the student will develop a special interest in this field and would
like to help to analyze such type of data in other discipline as well.
We think that the real judges of this monograph are the teachers concern and the student for
whom it is meant. So, we request our teacher friends as well as students to point out our
mistakes, if any, and send their comments and suggestions for the further improvement of this
monograph. Wishes you a great success.
Yours sincerely
Dr. Rafiqullah Khan
aruke@rediffmail.com
Department of Statistics and Operations Research
Aligarh Muslim University, Aligarh.
CONTENTS
Random sample: A random or probability sample is a sample drawn in such a manner that
each unit in the population has a predetermined probability of selection.\
The probability of selection of a unit can be equal as well as unequal.
MSE (t ) E (t ) 2 .
The MSE may be considered to be a measure of accuracy with which the estimator t
estimates the parameter .
4 RU Khan
The expected value of the squared deviation of the estimator from its expected value is
termed sampling variance. It is a measure of the divergence of the estimator from its
expected value and is given by
V (t ) E [t E (t )]2 .
This measure of variability may be termed the precision of the estimator t .
The relation between MSE and sampling variance or between accuracy and precision can be
obtained as
MSE (t ) E (t ) 2 E[t E (t ) E (t ) ]2
Sample survey has its own limitations and the advantages of sampling over complete
enumeration can be derived only if
i) the units are drawn in a scientific manner
ii) an appropriate sampling technique is used, and
iii) the size of units selected in the sample is adequate.
Basic principles of sample surveys
Two basic principles for sample surveys are
i) Validity
ii) Optimization
The principle of optimization takes into account the factors of
a) Efficiency
b) Cost
By validity, we mean that the sample should be so selected that the results could be
interpreted objectively in terms of probability. The principle will be satisfied by selecting a
probability sample, which ensures that there is some definite, pre-assigned probability for
each individual of the population.
Efficiency is measured by the inverse of the sample variance of the estimator.
Cost is measured by the expenditure incurred in terms of money or man-hours. The principle
of optimization insures that a given level of efficiency will be reached with minimum cost or
that the maximum possible efficiency will be attained with a given level of cost.
Sampling and non-sampling errors
The error which arises due to only a sample (a part of population) being used to estimate the
population parameters and draw inferences about the population is termed sampling error or
sampling fluctuation. Whatever may be the degree of cautiousness in selecting a sample;
there will always be a difference between the parameter and its corresponding estimate. This
error is inherent and unavoidable in any and every sampling scheme. A sample with the
smallest sampling error will always be considered a good representative of the population.
This error can be reduced by increasing the size of the sample (number of units selected in the
sample). In fact, the decrease in sampling error is inversely proportional to the square root of
the sample size and the relationship can be examined graphically as below:
Sample
size
Sampling error
When the sample survey becomes a census survey, the sampling error becomes zero.
Non-sampling error
The non-sampling errors primarily arise at the following stages:
i) Failure to measure some of units in the selected sample
ii) Observational errors due to defective measurement technique
iii) Errors introduced in editing, coding and tabulating the results.
6 RU Khan
Non-sampling errors are present in both the complete enumeration survey and the sample
survey. In practice, the census survey results may suffer from non-sampling errors although
these may be free from sampling error. The non-sampling error is likely to increase with
increase in sample size, while sampling error decreases with increase in sample size.
Type of surveys
There are various types of surveys which are conducted on the basis of the objectives to be
fulfilled.
Demographic surveys: These surveys are conducted to collect the demographic data, for
example, household surveys, family size, number of males in families, etc., such surveys are
useful in the policy formulation for any city, state or country for the welfare of the people.
Educational surveys: These surveys are conducted to collect the educational data, for
example, how many children go to school, how many persons are graduate, etc., such surveys
are conducted to examine the educational programs in schools and colleges. Generally,
schools are selected first and then the students from each school constitute the sample.
Economic surveys: These surveys are conducted to collect the economic data, for example,
data related to export and import of goods, industrial production, consumer expenditure etc.,
such survey is helpful in constructing the indices indicating the growth in a particular sector
of economy or even the overall economic growth of the country.
Employment surveys: These surveys are conducted to collect the employment related data,
for example, employment rate, labour conditions, wages etc. in a city, state or country. Such
data helps in constructing various indices to know the employment conditions among the
people.
Health and Nutrition surveys: These surveys are conducted to collect the data related to
health and nutrition issues, for example, number of visits to doctors, food given to children,
nutritional value etc., such surveys are conducted in cities, states as well as countries by the
national and international organizations, like, UNICEF, WHO etc.
Agricultural surveys: These surveys are conducted to collect the agriculture related data to
estimate, for example, the acreage and production of crops, livestock numbers, use of
fertilizers, pesticides and other related topics. The government bases its planning related to
the food issues for the people based on such surveys.
Marketing surveys: These surveys are conducted to collect the data related to marketing.
They are conducted by major companies, manufacturers or those who provide services to
consumer etc., such data is used for knowing the satisfaction and opinion of consumers as
well as in developing the sales, purchase and promotional activities etc.
Election surveys: These surveys are conducted to study the outcome of an election or a poll,
for example, such polls are conducted in democratic countries to have the opinions of people
about any candidate who is contesting the election.
Public polls and surveys: These surveys are conducted to collect the public opinion on any
particular issue; such surveys are generally conducted by the news media and the agencies
which conduct polls and surveys on the current topics of interest to public.
Campus surveys: These surveys are conducted on the students of any educational institution
to study about the educational programs, living facilities, dining facilities, sports activities
etc.
CHAPTER II
A procedure for selecting a sample of size n out of a finite population of size N in which
each of the possible distinct samples has an equal chance of being selected is called random
sampling or simple random sampling.
We may have two distinct types of simple random sampling as follows:
i) Simple random sampling with replacement (srswr) .
ii) Simple random sampling without replacement (srswor) .
(3, 6, 8), (3, 6, 9), (3, 8, 8), (3, 8, 9), (3, 9, 9), (6, 6, 6), (6, 6, 8),(6, 6, 9), (6, 8, 8), (6, 8, 9), (6,
9, 9), (8, 8, 8), (8, 8, 9), (8, 9, 9), (9, 9, 9).
By the sampling wor , the number of possible samples will be N Cn 5C3 10 , which are as
follows:
(1, 3, 6), (1, 3, 8), (1, 3, 9), (1, 6, 8), (1, 6, 9), (1, 8, 9), (3, 6, 8), (3, 6, 9), (3, 8, 9), (6, 8, 9).
1 N
Y Yi , population mean.
N i 1
1 n
y yi , sample mean.
n i 1
1 N 1 N 2
(Yi Y ) Yi Y 2 , population variance.
2 2
N i 1 N i 1
1 N 1 N 2
S2
N 1 i 1
(Yi Y ) 2
N 1 i 1
Yi N Y 2 , population mean square.
1 n 1 n 2
s2
n 1 i 1
( y i y ) 2
n 1 i 1
yi n y 2 , sample mean square.
Theorem: In srswr , the sample mean y is an unbiased estimate of the population mean Y
N 1 2 2
i.e. E ( y ) Y and its variance V ( y ) S .
nN n
Proof: It is immediately seen that
1 n 1 n
E ( y ) E yi E ( yi ) . By definition,
n
i 1 n i 1
N
1 N
E ( yi ) Yi Pr ( yi Yi ) Yi Y , since yi can take any one of the values
i 1
N i 1
Y1 ,, YN each with probability 1 / N .
Simple random sampling 9
Therefore,
1 n
E( y) Y Y .
n i 1
Justification of the above result can see by taking particular case, i.e. as
2 2
n n
{ yi E ( yi )} ai (a1 a2 ... an ) . Put n 3 , then,
2
i 1 i 1
3 3
(a1 a2 a3 ) 2 a12 a22 a32 a1a2 a1a3 a2a1 a2a3 a3a1 a3a2 ai2 ai a j .
i 1 i, j
i j
1 n 1 n
E ( yi Y ) 2 E [( yi Y ) ( y j Y )] , i j .
n 2 i 1 n 2 i, j
1 n 1 n
V ( yi ) Cov ( yi , y j ) (2.1)
n 2 i 1 n 2 i, j
i j
Consider
N
V ( yi ) E ( yi Y ) (Yi Y ) 2 Pr ( yi Yi )
2
i 1
1 N
N i 1
(Yi Y ) 2 , since yi can take any one of the values Y1 ,, YN each with
probability 1 / N .
N 1 2 1 N
2
N
S , since S 2
N 1 i 1
(Yi Y ) 2 (2.2)
and
N
Cov ( yi , y j ) E [( yi Y ) ( y j Y )] (Yi Y ) (Y j Y ) Pr ( yi Yi , y j Y j ) .
i, j
10 RU Khan
In this case y j can take any one of the values Y1 ,, YN with probability 1 / N irrespective of
the values taken by yi , because old composition of the population remain the same
throughout the sampling process due to the sampling with replacement. In other words for
i j , yi and y j are independent, so that
1 1 1
Pr ( yi Yi , y j Y j ) Pr ( yi Yi ) Pr ( y j Y j ) .
N N N2
Hence,
1 N 1 N N
Cov ( yi , y j ) (Yi Y ) (Y j Y ) (Yi Y ) (Y j Y ) 0 . (2.3)
N 2 i, j N 2 i 1 j 1
Substitute the values of equations (2.2) and (2.3) in equation (2.1), we get
1 n N 1 2 N 1 2 2
V ( y) S S .
n 2 i 1 N nN n
ˆ N 2 2 N ( N 1) 2
V (Y ) S .
n n
Proof: By definition,
1 N
E (Y ) E ( N y ) N E ( y ) N Y N Yi Y
ˆ
N i 1
N 2 2 N ( N 1) 2
and V (Yˆ ) V ( N y ) N 2 V ( y ) S .
n n
Remarks:
N 1
i) The standard error (SE ) of y is SE ( y ) V ( y ) S .
n nN
N N ( N 1)
ii) The standard error Yˆ is SE (Yˆ ) V (Yˆ ) S .
n n
n 1 i 1 n 1 i 1
V ( yi ) E ( yi2 ) Y 2 , so that
E ( yi2 ) 2 Y 2 , since V ( yi ) ( N 1) S 2 / N 2 .
Simple random sampling 11
and V ( y ) E ( y 2 ) Y 2 , so that
2 N 1 2 2
2
E( y ) Y , since V ( y )
2 S , for srswr .
n nN n
Therefore,
1 n 2 N 1 2
2
E (s ) ( Y ) n
2 2
Y 2 2 S .
n 1 i 1 n N
Example: In a population with N 5 , the values of Yi are 8, 3, 11, 4 and 7.
( N 1) S 2 2
iii) V ( y ) , and
nN n
N 1 2
iv) E ( s 2 ) S .
2
N
Solution:
a) We know that
1 N 1 N 2 1 N 2
Y Yi 6.6 , Yi Y 8.24 and S
2 2 2
Yi N Y 2 10.3 .
N i 1 N i 1 N 1 i 1
b) Form a table for calculation as below:
Samples yi y i2 N yi si2 Samples yi y i2 N yi si2
(8, 8) 8.0 64.00 40.0 0.0 (11, 4) 7.5 56.25 37.5 24.5
(8, 3) 5.5 30.25 27.5 12.5 (11, 7) 9.0 81.00 45.0 8.0
(8, 11) 9.5 90.25 47.5 4.5 (4, 8) 6.0 36.00 30.0 8.0
(8, 4) 6.0 36.00 30.0 8.0 (4, 3) 3.5 12.25 17.5 0.5
(8, 7) 7.5 56.25 37.5 0.5 (4, 11) 7.5 56.25 37.5 24.5
(3, 8) 5.5 30.25 27.5 12.5 (4, 4) 4.0 16.00 20.0 0.0
(3, 3) 3.0 9.00 15.0 0.0 (4, 7) 5.5 30.25 27.5 4.5
(3, 11) 7.0 49.00 35.0 32.0 (7, 8) 7.5 56.25 37.5 0.5
(3, 4) 3.5 12.25 17.5 0.5 (7, 3) 5.0 25.00 25.0 8.0
(3, 7) 5.0 25.00 25.0 8.0 (7, 11) 9.0 81.00 45.0 8.0
(11, 8) 9.5 90.25 47.5 4.5 (7, 4) 5.5 30.25 27.5 4.5
(11, 3) 7.0 49.00 35.0 32.0 (7, 7) 7.0 49.00 35.0 0.0
(11, 11) 11.0 121.00 55.0 0.0
12 RU Khan
1 n 1
i) E( y)
n i 1
yi
25
165 6.6 Y , where n is the number of sample.
1 n
ii) E ( N y ) N yi 33
n i 1
or E( N y ) N E( y ) 33 .
1 n 2
iii) V ( y )
n i 1
yi Y 2 4.12 .
Now,
( N 1) S 2 2
4.12 , and 4.12 , therefore,
nN n
(n 1) S 2 2
V ( y) 4.12 .
nN n
1 n 2 1
iv) E ( s 2 ) si 25 206 8.24
n i 1
(1a)
( N 1) S 2
and 8.24 (2a)
N
In view of equation (1a) and (2a), we get
( N 1) S 2
2
E (s ) 2 8.24 .
N
In this case y j can take any one of the values except Yi , the value which is known to have
1
already been assumed by yi , with equal probability , so that for i j ,
N 1
1 1
Pr ( yi Yi , y j Y j ) Pr ( yi Yi ) Pr ( y j Y j | yi Yi ) .
N N 1
Simple random sampling 13
Hence,
N
1
Cov( yi , y j ) (Yi Y ) (Y j Y )
N ( N 1) i, j
1 N
N
N ( N 1) i 1
(Yi Y ) (Y j Y ) (Yi Y )
j 1
1 N N N
(Yi Y ) (Y j Y ) (Yi Y ) 2
N ( N 1) i 1 j 1 i 1
1 N
S2
i
N ( N 1) i 1
(Y Y ) 2
N
(2.6)
Substitute the values of equations (2.5) and (2.6) in equation (2.4), we get
1 ( N 1) S 2 1 S 2 ( N 1) 2 n 1 2
V ( y) n n (n 1) S S
2 N 2 N nN nN
n n
N n 2 n S2 S2
S 1 (1 f ) ,
nN N n n
n
where f is called the sampling fraction and the factor (1 f ) is called the finite
N
population correction ( fpc ) . If the population size N is very large or if n is small
n
corresponding with N , then f 0 and consequently fpc 1.
N
Alternative expression
N n 2 1 1 2
V ( y) S S .
nN n N
Corollary: Yˆ N y is an unbiased estimate of the population total Y with its variance
V (Yˆ ) N 2 (1 f ) S 2 / n .
Proof:
By definition,
1 N
E (Yˆ ) E ( N y ) N E ( y ) N Y N Yi Y
N i 1
and
N n 2 S2
V (Yˆ ) V ( N y ) N 2 2
S N (1 f ) .
nN n
Remarks
N n 1 f 1 1
i) The standard error of y is SE ( y ) S S S .
nN n n N
14 RU Khan
N n 1 f 1 1
ii) The standard error Yˆ is SE (Yˆ ) N S NS NS .
nN n n N
For large population fpc (1 f ) 1, then
S2 S
i) V ( y ) , and SE ( y ) .
n n
ˆ N 2S 2 NS
ii) V (Y ) , and SE (Yˆ ) .
n n
V ( yi ) E ( yi2 ) Y 2 , so that
N 1 2
E ( yi2 ) S Y 2 , since V ( yi ) ( N 1) S 2 / N .
N
and V ( y ) E ( y 2 ) Y 2 , so that
N n 2 N n 2
E( y 2 ) 2
S Y , since V ( y ) S , for srswr .
nN nN
Therefore,
1 n N 1 2 N n 2
E (s 2 ) S Y 2 n S Y 2
n 1 i 1 N nN
1 S2 1 S2
[n ( N 1) ( N n)] (n 1) N S2.
n 1 N n 1 N
Example: A random sample of n 2 households was drawn from a small colony of N 5
households having monthly income (in rupees) as follows:
Households: 1 2 3 4 5
Income (in thousand rupees): 8 6.5 7.5 7 6
( N n) S 2
iii) V ( y ) , and
nN
iv) E ( s 2 ) S 2 .
Solution:
a) We know that
1 N 1 N 2 1 N 2
Y Yi 7 , Yi Y 0.5 , and S
N i 1
2
N i 1
2 2
N 1 i 1
Yi N Y 2 0.625 .
b) Form a table for calculation as below:
Samples yi y i2 N yi si2 Samples yi y i2 N yi si2
(8, 6.5) 7.25 52.563 36.25 1.125 (8, 7.5) 7.75 60.063 38.75 0.125
(8, 7) 7.50 56.250 37.50 0.500 (8, 6) 7.00 49.000 35.00 2.000
(6.5, 7.5) 7.00 49.000 35.00 0.500 (6.5, 7) 6.75 45.563 33.75 0.125
(6.5, 6) 6.25 39.063 31.25 0.125 (7.5, 7) 7.25 52.563 36.25 0.125
(7.5, 6) 6.75 45.563 33.75 1.125 (7, 6) 6.50 42.250 32.50 0.500
1 n
i) E( y) yi 7 Y , where n is the number of sample.
n i 1
1 n
ii) E ( N y ) N yi 35 , or
n i 1
E( N y ) N E( y ) 35 .
1 n 1 n 2 ( N n) S 2
iii) V ( y ) ( yi Y ) yi Y 0.1875 , and
2 2
0.1875 .
n i 1 n i 1 nN
Therefore,
( N n) S 2
V ( y) 0.1875 .
nN
1 n 2
iv) E ( s ) si 0.625 S 2 .
2
n i 1
Example: In a population N 5 , the values are 2, 4, 6, 8 and 10, then for a srs size n 3 ,
show that V ( y ) srswor V ( y ) srswr .
Solution: We know that
N n 2 N 1 2
V ( y ) srswor S , and V ( y ) srswr S ,
nN nN
1 N 1 N
where, S 2 i
N 1 i 1
(Y Y ) 2
10 and Y Yi 6 .
N i 1
Thus,
4 8
V ( y ) srswor , V ( y ) srswr , and therefore,
3 3
V ( y ) srswor V ( y ) srswr .
1 N NP
Population mean
N i 1
Yi
N
P.
1 N 1 N 2 NP
Population variance
N i 1
(Yi P ) 2
N i 1
Yi P 2
N
P 2 PQ .
1 N 1 N 2
Mean square of population
N 1 i 1
2
(Yi P)
N 1 i 1
Yi NP 2
NP NP 2 NPQ
.
N 1 N 1
Similarly, assign to the i th member of the sample the value y i , which is equal to 1 if this
member possesses the character C and is equal to 0 otherwise, then
Simple random sampling 17
n
1 n a
Sample total yi np a , and Sample mean yi p .
i 1
n i 1 n
1 n 1 n 2
Mean square for sample i
n 1 i 1
( y p ) 2
i
n 1 i 1
y np 2
1 npq
(np np 2 ) .
n 1 n 1
N 2 2 N 2 PQ
V ( Aˆ ) V (Yˆ ) N 2V ( y ) .
n n
pq PQ
Theorem: Vˆ ( p) v( p) is an unbiased estimate of V ( p) .
n 1 n
pq n pq 1 npq
Proof: E [Vˆ ( p)] E E E
n 1 n n 1 n n 1
PQ npq
, since in srswr E ( s 2 ) 2 PQ and s 2 .
n n 1
pq PQ
Corollary: Vˆ ( Aˆ ) Vˆ ( Np) N 2 Vˆ ( p) N 2 is an unbiased estimate of V ( Aˆ ) N 2 .
n 1 n
Remarks
i) The standard error (SE ) of p is SE ( p) PQ / n .
Results are:
i) E ( p) E ( y ) Y P . This shows that sample proportion p is an unbiased estimate of
N n 2 N n NPQ N n PQ
population proportion P and V ( p) V ( y ) S .
nN nN N 1 N 1 n
ii) E ( Aˆ ) E ( Np) N E ( p) NP A , means that Np is an unbiased estimate of NP and
18 RU Khan
N n 2 2 N n NPQ N n PQ
V ( Aˆ ) V (Yˆ ) N 2V ( y ) N 2 S N N2 .
nN nN N 1 N 1 n
N n pq N n PQ
Theorem: Vˆ ( p) v( p) is an unbiased estimate of V ( p) .
n 1 N N 1 n
N n pq N n npq N n npq
Proof: E [Vˆ ( p)] E E E
n 1 N nN n 1 nN n 1
N n PQ NPQ npq
, since in srswor E ( s 2 ) S 2 and s 2 .
N 1 n N 1 n 1
N n
Corollary: Vˆ ( Aˆ ) Vˆ ( Np) N 2 Vˆ ( p) N pq is an unbiased estimate of
n 1
N n PQ
V ( Aˆ ) N 2 .
N 1 n
Remarks
N n PQ
The standard error (SE ) of p is SE ( p) and the standard error of Â
N 1 n
N n PQ
is SE ( Aˆ ) N .
N 1 n
Example: A list of 3000 voters of a ward in a city was examined for measuring the
accuracy of age of individuals. A random sample of 300 names was taken, which revealed
that 51 citizens were shown with wrong ages. Estimate the total number of voters having a
wrong description of age in the list and estimate the standard error.
a
Solution: Given N 3000 , n 300 , a 51 , and p 0.17 , then, Aˆ N p 510 .
n
i) If srswr , is considered, the estimate of the standard error is given by
pq
Est [ SE ( Aˆ )] N 65.1696 65 .
n 1
ii) If srswor , is considered, the estimate of the standard error is given by
N n
Est [ SE ( Aˆ )] N pq 61.8246 62 .
n 1
y Y
or Pr Z / 2 Z / 2 1
SE ( y )
or Pr [Z / 2 SE ( y ) y Y Z / 2 SE ( y )] 1
or Pr [ y Z / 2 SE ( y ) Y y Z / 2 SE ( y )] 1 .
The probability being (1 ) , the interval
Pr [ y Z / 2 SE ( y ) Y y Z / 2 SE ( y )] 1 will include Y , i.e. y Z / 2 / n
will include Y .
2. Confidence limit for population total: On the same above lines, we see that
Pr [ N y Z / 2 SE (Yˆ ) Y N y Z / 2 SE (Yˆ )] 1
Since sample size is small and variance of population is unknown, so the interval is defined
as
N n S
y t / 2, n 1 S y t / 2, n 1 , as population size is very large.
nN n
1 n 2
S 2 is unknown, it can be replaced by its estimator s 2 i
n 1 i 1
y n y 2
44.25 .
Therefore,
6.652
Upper confidence limit 7.125 2.131 10.668853 11 , and
16
6.652
Lower confidence limit 7.125 2.131 3.58 4 .
16
Example: In a mess, it was observed that leftover cost a lot. A survey was conducted to find
out the optimum quantity for each item. A random sample of 10 inmates showed that they
taken 4, 5, 2, 3, 1, 7, 2, 3, 4, 4 slices of bread in their breakfast. If there are 120 breakfasts are
to be served every day, estimate the number of slices required every day. Also obtain a 95%
confidence interval for it.
1 n
Solution: Given N 120 , n 10 , and y yi 3.5 , then
n i 1
Since sample size is small and variance of population is unknown, so that, confidence limit
N y t / 2, n 1 NS (1 f ) / n
1 n 2
2
s
n 1 i 1
yi n y 2 2.94444 .
Hence,
10
Upper confidence limit 420 2.262 120 1.716 1 / 10 561.02517 561
120
and
Lower confidence limit 420 141.02517 278.97483 279 .
Example: 100 villages were selected under srswor from a list of 1521 villages. It was
found that 19 of the selected villages where illegally occupied by some landlords. Estimate all
such villages occupied by the landlords out of the total 1521 villages and 95% confidence
interval.
Solution: Given N 1521, n 100 , and a 19 , then, p 0.19
Estimate of number of village illegally occupied by landlords in the population of villages
Aˆ N p 288.99 289 .
Since sample size is 30 and variance of population proportion is unknown, then,
confidence limit will be
N p Z / 2 SE ( Aˆ ) , where, SE (Aˆ ) is unknown, so it can be replaced by its unbiased
estimator
N n
N pq 57.964667 .
n 1
Thus,
Upper confidence limit 289 1.96 57.964667 412.5227 413 , and
Lower confidence limit 289 1.96 57.964667 165.4773 165 .
Example: A simple random sample of 30 households was drawn without replacement from
a city area containing 14848 households. The number of persons per household in the sample
were as follows: 5, 6, 3, 3, 2, 3, 3, 3, 4, 4, 3, 2, 7, 4, 3, 5, 4, 4, 3, 3, 4, 3, 3, 1, 2, 4, 3, 4, 3 and
4. Estimate the average and total number of people in the area and compute the probability
that these estimates are with in 10% of the true value.
Solution: Given N 14848 , and n 30 , then,
105
Estimate of the population total Y N y 14848 51968 . Assuming that the
30
1 f
population values are normally distributed, so that, N y ~ N Y , NS .
n
Thus,
22 RU Khan
where is small and it is the risk that we are willing to bear if the actual difference is
greater than d . This is called the level of significance and (1 ) is called level of
confidence or confidence coefficient.
As the population is normally distributed, so the sample mean will also follow the normal
y Y
distribution i.e. y ~ N [Y , V ( y )] , then Z ~ N (0,1) .
V ( y)
For the given value of we can find a value Z of standard normal variate from the
standard normal table by the following equation:
| y Y |
Pr Z 2 or Pr [| y Y | V ( y ) Z 2 ] (2.10)
V ( y )
Comparing the equation (2.9) and (2.10), we get
1 1
d Z 2 V ( y ) , so that d 2 Z2 2 V ( y ) Z 2 2 S 2 .
n N
Z2 2 S 2 1 1 1 1 Z2 / 2 S 2
1 n0 , where n0 (2.11)
d2 n N n N d2
n n n0 n n0
or 1 0 0 1 0 or n (2.12)
n N n N n
1 0
N
If N is sufficiently large, then n n0 and for unknown S 2 , some rough estimate of S 2
can be used in relation’s (2.12) and (2.11).
b) Specify the precision in terms of margin of V ( y ) i.e. we have to find sample size n
such that V ( y ) V (given). As in case of margin of error,
d2 Z 2 2 S 2 S2
d Z 2 V ( y ) V ( y ) , and n0
Z / 2 d2 V ( y)
Therefore,
2
N Z 2 S
n0 , and n can be obtained by the relation (2.12).
d
V N2 S2
V ( y) , and n0 , then, n from (2.12).
N2 V
Example: For a population of size N 430 roughly we know that Y 19 , S 2 85.6 with
srs , what should be the size of sample to estimate Yˆ with a margin of error 10% of Y apart
chance is 1 in 20.
Solution: Margin of error in the estimate y of Y is given, i.e.
19
y Y 10% of Y or | y Y | 10% of Y 1.9 , so that
10
1 Z 2 2 S 2 (1.96) 2 85.6
Pr [ | y Y | 1.9] 0.05 , and n0 91.091678 .
20 d2 (1.9) 2
Therefore,
n0
n 75.168 75 .
n0
1
N
Example: In the population of 676 petition sheets. How large must the sample be if the total
number of signatures is to be estimated with a margin of error of 1000, apart from a 1 in 20
chance? Assume that the population mean square to be 229.
Solution: Let Y be the number of signature on all the sheets. Let Yˆ is the estimate of Y .
Margin of error is specified in the estimate Yˆ of Y as
1
| Yˆ Y | 1000 , so that, Pr [ | Yˆ Y | 1000] 0.05 .
20
We know that
2
n0 N Z 2 S 676 1.96
2
n
, here, n0
n0 d 1000 229 402.01385 , and hence
1
N
n 252.09 252 .
N n PQ
d Z 2 V ( p) or d 2 Z2 / 2 V ( p) Z 2 2 , as sampling is srswr .
N 1 n
Z 2 2 PQ N n N n Z 2 2 PQ PQ
1 n0 , where n0 (2.16)
d 2
n ( N 1) n ( N 1) d 2 V ( p)
N 1 N n N N N 1
or 1 1
n0 n n n n0
N N n0 n0 n0
or n (2.17)
N 1 n0 ( N 1) n0 N 1 n
1 1 0
n0 N N N
If N is sufficiently large, then n n0
b) If precision is specified in terms of V ( p) i.e. V ( p) V (given).
PQ
Substituting V ( p) V in relation (2.16) we get, n0 , and hence n can be obtained
V
by relation (2.17).
c) When precision is given in terms of coefficient of variation of p
Let
V ( p) V ( p)
CV ( p) e e2 , or V ( p) e 2 P 2 (2.18)
P 2
P
Substitute equation (2.18) in relation (2.16), we get,
PQ Q 1 1
n0 1 , and hence n is given by the relation (2.17).
e2 P2 e2 P e2 P
Remarks
i) To get n , if the margin of error in the estimate Aˆ Np of the population total A NP is
d , then,
d 2
| Aˆ A | d or | N p N P | d , or N | d | d , or N 2 d 2 d 2 , or d 2 .
N2
Thus,
26 RU Khan
2
N Z 2 PQ
n0 , and n can be obtained by the relation (2.17).
d
n0
n 350.498 351 .
1 ( n0 / N )
Exercise: In a study of the possible use of sampling to cut down the work in taking
inventory in a stock room, a count is made of the value of the articles on each of 36 shelves
in the room. The values to the nearest dollar are as follows.
29, 38, 42, 44, 45, 47, 51, 53, 53, 54, 56, 56, 56, 58, 58, 59, 60, 60, 60, 60, 61, 61, 61, 62, 64,
65, 65, 67, 67, 68, 69, 71, 74, 77, 82, 85.
The estimate of total value made from a sample is to be correct within $200, apart from a 1 in
20 chance. An advisor suggests that a simple random sample of 12 shelves will meet the
requirements. Do you agree? Yi 2138 , and Yi2 131 682 .
Solution: It is given that Yi 2138 , Yi2 131 682 , and N 36 , then
i i
1 1 2138
2
S2 i 134.5 , and
2 2
Y NY 131 682 36
N 1 i 36 1 36
1
| Yˆ Y | 200 , then, Pr[| Yˆ Y | 200] 0.05 .
20
We know that
2 2
n0 N Z / 2 36 1.96
n , here n0 S 134.5 16.7409 , and therefore,
1
n0 d 200
N
n 11.42765 12 .
Simple random sampling 27
Exercise
1) The frequency distribution of 232 cities in a country, by population size in thousand, is
given as follows: (Singh and Chaudhary, 1986)
Population Number of Population Number of Population Number of
size class cities size class cities size class cities
50-75 81 500-550 2 1800-1850 1
75-100 45 550-600 3 1850-1950 0
100-150 42 600-650 1 1950-2000 1
150-200 14 650-700 1 2000-2050 0
200-250 9 700-750 0 2050-2100 1
250-300 5 850-900 1 2100-3600 0
300-350 6 800-850 2 3600-3650 1
350-400 5 850-900 1 3650-7850 0
450-500 2 950-1800 0
Calculate the standard error of the estimator of the population mean when
i) a sample of 50 cities is selected with srswor , and
ii) the two largest cities are definitely included in the survey and only 48 coties are drawn
from the remaining 230 cities with srswor .
2) In an agriculture survey, a sample of 36 holdings was selected with srswor , from a
population of 432 holdings, in a village. Data relating to land holding sie were recorded as
follows: (Singh and Chaudhary, 1986)
S. No. of Holding size S. No. of Holding size S. No. of Holding size
holding in acres holding in acres holding in acres
1 21.04 13 8.29 25 22.13
2 12.59 14 7.27 26 1.68
3 20.30 15 1.47 27 49.58
4 16.16 16 1.12 28 1.68
5 23.82 17 10.67 29 4.80
6 1.79 18 5.94 30 12.72
7 26.91 19 3.15 31 6.31
8 7.41 20 4.84 32 14.18
9 7.68 21 9.07 33 22.19
10 66.55 22 3.69 34 5.50
11 141.80 23 14.61 35 25.29
12 28.12 24 1.10 36 20.99
Estimate along with the standard error, the proportions of holdings P1 , P2 , P3 and P4 in the
four holding size classes 0 4.99 , 5.00 9.99 , 10.00 24.99 and 25 and above.
CHAPTER III
The precision of an estimator of the population parameters (mean or total etc.) depends on the
size of the sample and the variability or heterogeneity among the units of the population. If
the population is very heterogeneous and considerations of cost limit the size of the sample, it
may be found impossible to get a sufficiently precise estimate by taking a simple random
sample from the entire population. For this, one possible way to estimate the population mean
or total with greater precision is to divide the population in several groups (sub-population or
classes, these sub-populations are non-overlapping) each of which is more homogenous than
the entire population and draw a random sample of predetermined size from each one of the
groups. The groups, into which the population is divided, are called strata or each group is
called stratum and the whole procedure of dividing the population into the strata and then
drawing a random sample from each one of the strata is called stratified random sampling.
For example, to estimate the average income per household, it may be appropriate to group
the households into two or more groups (strata) according to the rent paid by the households.
The households in any stratum so form are likely to be more homogeneous with respect to
income as compared to the whole population. Thus, the estimated income per household
based on a stratified sample is likely to be more precise than that based on a simple random
sample of the same size drawn from the whole population.
Notations
Let the population, consisting of N units is first divided into k strata (sub-populations) of
size N1 , N 2 , , N k . These sub-populations are non-overlapping such that
N1 N 2 N k N . A sample is drawn (by the method of srs ) from each stratum
(group or sub-population) independently, the sample size within the i th stratum being ni ,
(i 1, 2, , k ) such that n1 n2 nk n . The following symbols refer to stratum i .
N i , total number of units.
ni , number of units in sample.
ni
fi , sampling fraction in the stratum.
Ni
Ni
Wi , stratum weight.
N
30 RU Khan
y ij , value of the characteristic under study for the j th unit in the i th stratum,
j 1,2, , N i .
N
1 i
Yi yij , mean based on N i units (stratum mean).
N i j 1
n
1 i
yi yij , mean based on ni units (sample mean).
ni j 1
N
1 i
i2 ( yij Yi ) 2 , variance based on N i units (stratum variance).
N i j 1
i N
1
S i2
N i 1 j 1
( yij Yi ) 2 , mean square based on N i units (stratum mean square).
i n
1
si2
ni 1 j 1
( yij yi ) 2 , sample mean square based on ni units.
k Ni k
Y yij N i Yi , population total.
i 1 j 1 i 1
k Ni k k
Y 1 1
Y
N N
yij N N i Yi Wi Yi , over all population mean.
i 1 j 1 i 1 i 1
Theorem: For stratified random sampling, wor , if in every stratum the sample estimate yi
is an unbiased of Yi , and samples are drawn independently in different strata, then
k
y st Wi yi is an unbiased estimate of the over all population mean Y and its variance is
i 1
k 2 2
1 1
V ( y st ) Wi S i .
i 1 i
n Ni
Proof: Since sampling within each stratum is simple random sampling, i.e. E ( yi ) Yi , it
follows that
k k k
E ( y st ) E Wi yi Wi E ( yi ) Wi Yi Y . To obtain the variance, we have
i 1 i 1 i 1
2 2
k k k
V ( y st ) E [ y st E ( y st )] E Wi yi E Wi yi E Wi { yi E ( yi )}
2
i 1
i 1 i 1
k 2 k
E Wi { yi E ( yi )}2 E Wi Wi { yi E ( yi )} { yi E ( yi )}
i 1 i, i
i i
k k k
Wi2 V ( yi ) Wi Wi Cov ( yi , yi ) .
i 1 i 1 ii 1
Stratified random sampling 31
Since samples are drawn independently in different strata, all covariance terms vanishes, then
k k 2 2
1 1
V ( y st ) Wi V ( yi )
2 Wi S i , as srswor within each stratum.
i 1 i
i 1
n Ni
Alternative expressions of V ( y st )
k 2 2 k N i ni N i2 2
11 1 k
i) V ( y st ) Wi S i
2 i
S i / ni N ( N i ni ) S i2 / ni .
2
i 1 i i 1 N
n Ni Ni N i 1
1 k
1 k n k (1 f i ) S i2
ii) V ( y st ) N i ( N i ni ) S i2 / ni
N i2 1 i S i2 / ni Wi2 ,
N 2 i 1 N 2
i 1 Ni i 1
ni
n
where f i i , sampling fraction in each statum.
Ni
Corollary: Yˆst N y st is an unbiased estimate of the population total Y with its variance
k
1 1 2 2
V (Yˆst ) N i S i .
i 1 n i N i
Proof: By definition
E (Yˆst ) N E ( y st ) NY Y , and
k 2 2 k 1 2 2
1 1 1
V (Yˆst ) N 2 V ( y st ) N 2 Wi S i N i S i
i 1 i i 1 i
n Ni n Ni
k k
N i ( N i ni ) S i2 / ni N i2 (1 f i ) Si2 / ni .
i 1 i 1
Remarks
n
a) If N i are large as compared to ni (if the sampling fractions f i i are negligible in all
Ni
strata), then,
k k
1
i) V ( y st ) Wi2 S i2 / ni
2 i i
N 2 S 2 / ni .
i 1 N i 1
k
ii) V (Yˆst ) N i2 S i2 / ni .
i 1
n N N
b) If in every stratum i i i.e. ni n i nWi , the variance of y st reduces to
n N N
k N n
2 2 k N nW
1 f k
V ( y st ) i Wi S i / ni i
i i
W S
i i
2
/ n Wi Si2 .
i 1 i 1
N i N i n i 1
n N
c) If in every stratum i i , and the variance of y st in all strata have the same value
n N
1 f k 2 1 f 2
k
2
S , then the result reduces to V ( y st ) Wi S n S , since Wi 1.
n i 1 i 1
32 RU Khan
Estimation of variance
If a simple random sample is taken within each stratum, then an unbiased estimator of S i2 , is
n i
1
si2 ( yij yi ) 2 , and an unbiased estimator of variance y st is
ni 1 j 1
k 2 2
1 1 1 k
V ( y st ) v ( y st )
2 i
ˆ Wi si N ( N i ni ) si2 / ni
i 1 i
n Ni N i 1
k
Wi2 (1 fi ) si2 / ni .
i 1
k
Theorem: If stratified random sampling is with replacement, then y st Wi yi is an
i 1
k
unbiased estimate of population mean Y and its variance is V ( y st ) Wi2 S i2 / ni .
i 1
k k k N 1 2 k
V ( y st ) Wi2 V ( yi ) Wi2 i2 / ni Wi2 i S i / ni Wi2 S i2 / ni
i 1 i 1 i 1 Ni i 1
k k
Corollary: Yˆst N y st N Wi yi N i yi is an unbiased estimate of the population
i 1 i 1
total Y and its variance is
k k
V (Yˆst ) V ( N y st ) N 2 V ( y st ) N 2 Wi2 S i2 / ni N i2 S i2 / ni .
i 1 i 1
Note: If the variances in all strata have the same value, S 2 (say), then
k
1 f 2
V ( y st ) prop S , as Wi 1 .
n i 1
Alternative expressions of V ( y st ) prop
k k k
1 1 N n Ni 2 N n
V ( y st ) prop Wi S i2 Si N i S i2 .
n N i 1 nN i 1 N 2
nN i 1
Optimum allocation: In this method of allocation the sample sizes ni in the respective
strata are determined with a view to minimize V ( y st ) for a specified cost of conducting the
sample survey or to minimize the cost for a specified value of V ( y st ) . The simplest cost
function is of the form
k
Cost C c0 ci ni , where the overhead cost c0 is constant and ci is the average
i 1
cost of surveying one unit in the i th stratum
k
C c0 ni ci C (say) (3.1)
i 1
k 2 2 k Wi2 S i2 k W 2S 2
1 1
and V ( y st ) Wi S i i i , so that
i 1 i
n Ni i 1
ni i 1
Ni
k W 2S 2 k W 2S 2
V ( y st ) i i i i V (say), (3.2)
i 1
N i i 1
ni
where C and V are function of ni . Choosing the ni to minimize V for fixed C or C for
fixed V are both equivalent to minimizing the product
k W 2S 2 k
V C i i ni ci .
i 1 ni i 1
34 RU Khan
2
k k
k
V C (Wi S i / ni ) ( ni ci ) Wi S i ci .
2 2
i 1 i 1 i 1
2
k
Thus, no choice of ni can make V C smaller than Wi S i ci . This minimum value
i 1
b
occurs when i constant, say .
ai
bi ni ni ci WS
ni ci or ni i i (3.3)
ai Wi S i Wi S i ci
Wi S i / ci N i S i / ci
ni n n . (3.4)
k k
Wi Si / ci N i Si / ci
i 1 i 1
Alternative method
To determine ni such that V ( y st ) is minimum and cost C is fixed, consider the function
k W 2S 2 k W 2S 2 k
i i
i i
c0 ci ni C , where is some unknown constant.
ni Ni
i 1 i 1 i 1
Using the calculus method of Lagrange multipliers, we select ni , and the constant to
minimize . Differentiating with respect to ni , and equating to zero, we have
Wi2 S i2 1 Wi S i
0 ci or ni (3.5)
ni ni2 ci
ni Wi S i / ci or ni N i S i / ci .
Stratified random sampling 35
Wi S i / ci N i S i / ci
ni n n (3.6)
k k
Wi S i / ci N i Si / ci
i 1 i 1
The total sample size n required for the optimum sample sizes within strata. The solution for
the value of n depends on whether the sample is chosen to meet a specified total cost C or to
give a specified variance V for y st .
i) If cost is fixed, substitute the optimum values of ni in (cost function) equation (3.1) and
solve for n as
k k Wi S i / ci k Wi S i ci
C c0 ci ni n ci n
k k
i 1 i 1
Wi Si / ci i 1
Wi Si / ci
i 1 i 1
C c0 k
n
k Wi Si / ci .
Wi Si ci i 1
i 1
Hence,
C c0 k Wi S i / ci (C c0 ) Wi S i / ci
ni
k Wi Si / ci
k
k
.
Wi Si ci i 1 Wi Si / ci Wi Si ci
i 1 i 1 i 1
k
1 k
V ( y st ) opt ci Wi S i ci 1 Wi2 S i2
(C c0 ) Wi S i Ni
i 1 i 1
k k 2 2
1 Wi S i ci Wi S i ci Wi S i
C c0
i 1 i 1 Ni
2
1 k k
Wi S i2
i i i N .
C c0 i 1
W S c
i 1
ii) If V is fixed, substitute the optimum ni in equation (3.2), we get
k
Wi2 S i2 Wi S i / ci
1 k
1 k
1 k k
V ( y st ) Wi S i2 i 1
Wi S i ci Wi S i / ci .
N i 1 n i 1 Wi S i / ci n i 1
i 1
36 RU Khan
Thus,
1 k k
n Wi Si ci Wi S i / ci , and hence,
1 k i 1
V Wi S i2 i 1
N i 1
1 k
ni (Wi S i / ci ) Wi S i ci .
1 k
V Wi S i2 i 1
N i 1
k k k
(Wi S i / ci ) Wi S i ci Wi S i ci Wi S i ci
k
i 1 i 1
C c 0 ci i 1
k k
i 1 1 1
V
N i 1
Wi S i2
V Wi S i2
N i 1
2
1 k
Wi S i ci .
1 k
V Wi S i2 i 1
N i 1
Remark
An important special case arises if ci c , that is, if the cost per unit is the same in all strata.
k
The cost becomes C c0 c ni c0 cn , and optimum allocation for fixed cost reduces
i 1
to optimum allocation for fixed sample size. The result in this case is as follows:
In stratified random sampling V ( y st ) is minimized for a fixed total size of sample n if
Wi S i N i Si
ni n n ni Wi Si or ni N i Si , and is called Neyman
k k
Wi Si N i Si
i 1 i 1
allocation and V ( y st ) under optimum allocation for fixed n or Neyman allocation.
k 1 k 2 2 k 1 k
V ( y st ) opt
1 Wi S i Wi S i Wi S i 1 N i Wi S i2
n Wi S i i i N i
WS
n
i 1 i 1
i 1 i 1 Ni N
2
1 k 1 k
Wi S i Wi S i2 .
n i 1
N i 1
2
1 k
Note: If N is large, V ( y st ) opt reduces to V ( y st ) opt Wi S i .
n i 1
Stratified random sampling 37
1 f k 1 k 1 k
V prop
n i 1
W S
i i
2
n i 1
W S
i i
2
N i 1
Wi S i2 .
2
1 k 1 k
Vopt
n i 1
Wi S i Wi S i2 .
N i 1
Now
k Ni k Ni
( N 1) S ( yij Y ) ( yij Yi Yi Y ) 2
2 2
i 1 j 1 i 1 j 1
k Ni k Ni k Ni
(Yij Yi ) (Yi Y ) 2 ( yij Yi ) (Yi Y )
2 2
i 1 j 1 i 1 j 1 i 1 j 1
k k k N
i
( N i 1) S i2 N i (Yi Y ) 2 2 (Yi Y ) (Yij Yi )
i 1 i 1 i 1
j 1
k k
( N i 1) S i2 N i (Yi Y ) 2 , as sum of the deviations from their mean
i 1 i 1
is zero.
k N 1
2 k Ni
or S 2 i Si (Yi Y ) 2
i 1
N 1 i 1
N 1
1 N i 1 ( N i / N ) (1 / N )
For large N , 0 , so that, Wi
N N 1 1 (1 / N )
and
Ni (Ni / N )
Wi , so that
N 1 1 (1 / N )
k k
S 2 Wi S i2 Wi (Yi Y ) 2 .
i 1 i 1
Hence,
38 RU Khan
1 f 2 1 f k 2 1 f
k
Vran
n
S
n i 1
W S
i i
n i 1
Wi (Yi Y ) 2
1 f k
V prop Wi (Yi Y ) 2 V prop positive quantity.
n i 1
Further, consider
2
1 k 1 k 1 k 1 k
V prop Vopt Wi S i2 Wi S i2 Wi S i Wi S i2
n i 1 N i 1 n i 1
N i 1
k 1 k
2
k
2
k
2
1 k
Wi S i Wi S i
2
Wi S i Wi S i 2 Wi S i
2
n i 1
i 1 n i 1 i 1 i 1
k
2
k k k
1 k , as
k
i i i i i i i i i Wi 1
n i 1
W S 2
W S W 2 W S W S
i 1 i 1 i 1 i 1 i 1
2
1 k 2 k k
Wi Si Wi Si 2 Si Wi Si
n i 1
i 1 i 1
2
1 k k
Wi S i Wi S i ve quantity.
n i 1 i 1
2
1 k k
V prop Vopt Wi S i Wi S i .
n i 1 i 1
Thus,
V prop Vopt . (3.8)
Also,
2
1 k k 1 f k
Vran Vopt Wi S i Wi S i Wi (Yi Y ) 2 .
n i 1 n i 1
i 1
Remark
In comparing the precision of stratified with un-stratified random sampling, it was assumed
that the population values of stratum means and variances were known.
Stratified random sampling 39
ii) Compute the estimate y st for every possible sample under optimum allocation and
proportion allocation. Show that the estimates are unbiased. Hence find V ( y st ) directly
under optimum and proportion allocation and verify that V ( y st ) under optimum agrees
2
1 k 1 k k
1 1 2 2
with the formula V ( y st ) Wi S i Wi S i2 Wi S i and
n i 1
N i 1 i 1 i
n Ni
k
1 1
V ( y st ) under proportion agrees with the formula V ( y st ) Wi S i2 .
n N i 1
Therefore,
N1S1 N S
n1 n 1 , and n2 n 21 2 3 .
N i Si N i Si
i i
Samples Means
I II y1 y2 y st
0 (4, 6, 11) 0 7 3.5
1 (4, 6, 11) 1 7 4.0
2 (4, 6, 11) 2 7 4.5
2
1 k 1 k
V ( y st ) Wi S i Wi S i2 0.1667 .
n i 1
N i 1
Samples Means
I II y1 y2 y st
(0, 1) (4, 6) 0.5 5.0 2.75
(0, 1) (4, 11) 0.5 7.5 4.00
(0, 1) (6, 11) 0.5 8.5 4.50
(0, 2) (4, 6) 1.0 5.0 3.00
(0, 2) (4, 11) 1.0 7.5 4.25
(0, 2) (6, 11) 1.0 8.5 4.75
(1, 2) (4, 6) 1.5 5.0 3.25
(1, 2) (4, 11) 1.5 7.5 4.50
(1, 2) (6, 11) 1.5 8.5 5.00
1
E ( y st ) (2.75 4.00 4.50 3.00 4.25 4.75 3.25 4.50 5.00) 4 Y .
9
Therefore, y st is unbiased estimate of Y under proportion allocation.
1
V ( y st ) [(2.75 4) 2 (4.00 4) 2 (5.00 4) 2 ] 0.583 .
9
By formula
k
1 1
V ( y st ) Wi S i2 0.583 .
n N i 1
Exercise
1) 2000 cultivator’s holdings in Uttar Pradesh were stratified according to their sizes. The
number of holdings ( N i ) , mean area under wheat per holdings (Yi ) and standard deviation of
area under wheat per holdings ( S i ) are given below for each stratum:
Stratum number Number of holdings Mean area under Standard deviation of area
wheat/holding under wheat/ holding
1 394 5.4 8.3
2 461 16.3 13.3
3 381 24.3 15.1
4 334 34.3 19.8
5 169 42.1 24.5
6 113 50.1 26.0
7 148 63.8 35.2
Stratified random sampling 41
For a sample of 200farms, compute the sample size in each stratum under proportional and
optimum allocations. Calculate the sampling variance of the estimated area under wheat from
the sample
i) if the farms are selected under proportional allocation by without replacement method.
ii) if the farms are selected under Neyman’s allocation by without replacement method.
Also compute the relative precision from these procedures compared to simple random
sampling.
2) Using the data given below and considering the size classes as strata, compare the
efficiencies of the following alternative allocations of a sample of 3000 factories for
estimating the total output. The sample is to be selected with srwor within each stratum:
i) Proportional allocation
ii) Optimum allocation
Size class no. of workers No. of factories Output/factory (in 000 Rs) St. Dev. (in 000 Rs)
1-49 18260 100 80
50-99 4315 250 200
100-249 2233 500 600
250-999 1057 1760 1900
1000 and above 567 2250 2500
CHAPTER IV
In the theory of simple random sampling, we considered estimators using observed values of
the characteristic under study. Many a time the characteristic y under study is highly
correlated to an auxiliary characteristic x , and data on x are either readily (easily) available
or can be easily collected for all the units in the population. The knowledge of x , which is
additional information about the population under study, is termed as auxiliary or
supplementary information. In such situations, it is customary (used) to consider estimators
of characteristic y that use the data on x and are more efficient than the estimators that use
data on the characteristic y alone. Two such methods of estimation are known as:
i) Ratio method of estimation
ii) Regression method of estimation.
1 N Y
Y
N i 1
yi , Population mean of y .
N
N
X xi , Population total of x .
i 1
1 N X
X
N i 1
xi , Population mean of x .
N
Y Y
R , Ratio of the population totals or means of y and x .
X X
, Correlation coefficient between y and x in the population.
Suppose it is desired to estimate Y or Y or R by drawing a simple random sample of n
units from the population. Let us assume that based on n pairs of observations ( yi , xi ) ,
i 1, 2, , n , y and x are the sample means of y and x respectively, and the population
total X or mean X is known. The ratio estimators of the population ratio R , the total Y ,
and the mean Y may be defined by
of R̂ with X (the average size of a holding in acres) would provide an estimator of Y (the
average number of bullocks per holding in the population), i.e.
y : Number of bullocks on a holding
x : Area in acres on a holding
y
Rˆ : Estimate of the number of bullocks per acre on a holding in the population
x
Yˆ R X : Estimate of the average number of bullocks per holding in the population
In sampling hospitals to estimate the number of patient-days of care during a particular
month, y may be the number of patient-days of care provided by a hospital during the month,
and x the number of beds in the hospital. As another example, y and x may denote the
values of the characteristic under study on two successive occasions, e.g. the acreage under a
crop during the current year and the previous year respectively.
y
Theorem: In srswor , for large n , Rˆ is approximately unbiased for the population
x
N
1 f
2 i
ratio R and has an approximate variance V ( Rˆ ) ( y R xi ) 2 .
n ( N 1) X i 1
Proof:
Consider,
y yRx yRx
Rˆ R R , since n is large, x X .
x x X
Under this condition,
1 1 1 Y
E ( Rˆ R) E ( y R x ) [ E ( y ) R E ( x )] Y X 0 , E ( Rˆ ) R .
X X X X
Alternative method
By definition,
y 1
E ( Rˆ ) E E ( y ) E , since y and x are not independent but highly correlated.
x x
If n is large, then we can take x X , under this condition, E (Rˆ ) reduces to
1 Y
E ( Rˆ ) E ( y ) R . This shows that for large n , R̂ can be taken as unbiased
X X
estimate of R .
To obtain the variance, we have
2
y yRx 1
V ( Rˆ ) E ( Rˆ R) 2 E R E 2 E ( y R x)
2
(4.1)
x x X
Now consider the variate
d i yi R xi , i 1, 2, , N .
Ratio and regression methods of estimation 45
Let d and D be the sample mean and population mean of variable d respectively,
where
1 n
d ( y i R xi ) y R x
n i 1
(4.2)
and
1 N Y
D ( yi R xi ) Y R X 0 , since R .
N i 1 X
By definition
V (d ) E (d D ) 2 E (d ) 2 E ( y Rx ) 2 (4.4)
In view of equations (4.1), (4.4), and (4.3), we find that
1 1 1 f N
V ( Rˆ ) 2 E ( y R x ) 2 2 V (d )
X X n ( N 1) X 2
( y i R xi ) 2 .
i 1
i 1 i 1
N
or ( yi Y ) ( xi X ) ( N 1) S y S x , and hence,
i 1
1 f N
1 f N
V ( Rˆ )
n ( N 1) X 2 ( y i R xi ) 2 n ( N 1) X 2 ( y i R X R X R xi ) 2
i 1 i 1
1 f N
Y
n ( N 1) X 2 ( yi Y R X R xi ) 2 , since R X
i 1
1 f N
n ( N 1) X 2 [( yi Y ) R( xi X )]2
i 1
46 RU Khan
1 f N N N
n ( N 1) X 2 i 1
( y i Y ) 2
R 2
( x i X ) 2
2 R ( y i Y ) ( x i X )
i 1 i 1
1 f
[ ( N 1) S y2 ( N 1) R 2 S x2 2 R ( N 1) S y S x ]
n ( N 1) X 2
1 f 1 f S y2 2 S y S x
2
( S y2 R 2
S x2 2 R S y Sx ) R S x2
nX 2
nX2 R2 R
1 f 2 S y S x2 2 S y S x 1 f
2
R R 2 (C yy C xx 2 C y C x ) ,
n Y 2 X2 YX n
Sy S
where, C y , and C x x are the coefficient of variation of y and x respectively,
Y X
thus, C yy and C xx are the square of the coefficient of variation and are also called relative
variances.
ii) In terms of covariance: The covariance of y and x is defined by
1 N
S yx ( yi Y ) ( xi X ) S y S x , so that,
N 1 i 1
1 f 2 S y2 S x2 2 S yx 1 f
V ( R) R R 2 (C yy C xx 2 C yx ) ,
n Y 2 X2 Y X n
where, C yx is called relative covariance.
Estimation of V (Rˆ )
1 n 1 N
Taking
n 1 i 1
( y i ˆ
R x i ) 2
as an unbiased estimate of
N 1 i 1
( yi R xi ) 2 , then,
n
1 f
2 i
Vˆ ( Rˆ ) v( Rˆ ) ( y Rˆ xi ) 2 .
n (n 1) X i 1
i) In terms of correlation coefficient
n
( y i y ) ( xi x )
1 f
v( Rˆ ) ( s 2y Rˆ 2 s x2 2 Rˆ r s y s x ) , as r i 1 ,
nX 2 (n 1) s y s x
1 n
s2 ( yi y ) 2 .
n 1 i 1
ii) In terms of Covariance
1 f 2 ˆ 2 2
v( Rˆ ) ( s y R s x 2 Rˆ s yx ) , since s yx r s y s x .
2
nX
Ratio and regression methods of estimation 47
E (YˆR ) E ( Rˆ X ) X E ( Rˆ ) X R Y .
Further,
1 f N
V (YˆR ) V ( Rˆ X ) X 2 V ( Rˆ ) X 2 ( yi R xi ) 2
n ( N 1) X
2
i 1
1 f N
n ( N 1)
( yi R xi )2 .
i 1
1 n 1 N
Taking
n 1 i 1
( y i ˆ
R x i ) 2
as an unbiased estimate of
N 1 i 1
( yi R xi ) 2 , then,
n
1 f 1 f 2 ˆ 2 2
Vˆ (YˆR ) X 2 v ( Rˆ ) ( yi Rˆ xi ) 2 ( s y R s x 2 Rˆ r s y s x ) .
n (n 1) i 1 n
Also,
1 f 2 ˆ 2 2
Vˆ (YˆR ) ( s y R s x 2 Rˆ s yx ) .
n
E (YˆR ) E ( Rˆ N X ) N X E ( Rˆ ) N X R NY Y .
and
1 f N
V (YˆR ) V ( Rˆ N X ) N 2 X 2 V ( Rˆ ) N 2 X 2
( yi R xi ) 2
n ( N 1) X
2
i 1
N 2 (1 f ) N
( y i R xi ) 2 .
n ( N 1) i 1
N 2 (1 f ) n
Vˆ (YˆR ) N 2 X 2 Vˆ ( Rˆ )
n (n 1) i 1
( yi Rˆ xi ) 2
N 2 (1 f ) 2 ˆ 2 2
( s y R s x 2 Rˆ r s y s x ) .
n
Also
N 2 (1 f ) 2 ˆ 2 2
Vˆ (YˆR ) ( s y R s x 2 Rˆ s yx ) .
n
Corollary: Show that, to the first order of approximation,
1 f
(CV ) 2 (C yy C xx 2 C y C x ) (relative variance).
n
Proof: We know that
x 1 f 2
CV ( x) , and V ( Rˆ ) R (C yy C xx 2 C y C x ) , so that,
x n
Ratio and regression methods of estimation 49
V ( Rˆ ) 1 f
[CV ( Rˆ )]2 (C yy C xx 2 C y C x ) , since R E (Rˆ ) .
2 n
R
Similarly, we can see (CV ) 2 for Yˆ , and Yˆ .
Note: The quantity (CV ) 2 is called the relative variance and is same for all the three
estimates R̂ , Yˆ and Yˆ .
R R
Corollary: If Cx C y C , show that the relative variance
Rˆ V ( Rˆ ) 1 f
V 2 2 C 2 (1 ) .
R R n
Proof: By definition,
Rˆ 1 1 1 f 2
V V ( Rˆ ) R (C yy C xx 2 C y C x )
R R
2
R2 n
1 f 1 f
(C 2 C 2 2 C 2 ) 2 C 2 (1 ) .
n n
1 1 1 f 2
V ( y) S 2 Sy
n N n
and the variance of the mean based on the ratio method is
1 f 2
V (YˆR ) ( S y R 2 S x2 2 R S y S x ) .
n
Obviously ratio estimate YˆR will more precise as compared to y if and only V (YˆR ) V ( y ) ,
so that
1 f 2 1 f 2
( S y R 2 S x2 2 R S y S x ) Sy
n n
R Sx
R 2 S x2 2 R S y S x or
2Sy
1 Sx / X 1 CV ( x)
or , or .
2 Sy /Y 2 CV ( y )
Cov ( Rˆ , x ) E ( Rˆ x ) E ( x ) E ( Rˆ ) E ( y ) X E ( Rˆ ) Y X E ( Rˆ ) , since E ( Rˆ ) R , so
that one cannot replace it when finding out amount of biasness.
Y 1 1
E ( Rˆ ) Cov ( Rˆ , x ) R Cov ( Rˆ , x )
X X X
and hence,
1
E ( Rˆ ) R Cov ( Rˆ , x ) .
X
B( Rˆ )
Corollary: Prove that CV ( x )
Rˆ
Proof: We know that
Cov ( Rˆ , x )
B ( Rˆ ) , so that
X
1 ˆ x
B( Rˆ ) Rˆ , x Rˆ x R , since correlation between R̂ and x can not 1 .
X X
Hence,
B( Rˆ ) x
CV ( x ) .
Rˆ X
N n 1 N
E( y Y ) ( x X ) ( yi Y ) ( xi X ) .
n N N 1 i 1
Let u and U be the sample mean and population mean of variable u respectively, where
u y x , and U Y X . As sampling is simple random, wor , then
N n 2 1 N
E (u ) U , and V (u ) E (u U ) 2
nN
S u , as S u2
N 1 i 1
(ui U ) 2 .
That is
Ratio and regression methods of estimation 51
N n 1 N
E ( y x Y X )2
n N N 1 i 1
( y i xi Y X ) 2
N n 1 N
or E[( y Y ) ( x X )]2
n N N 1 i 1
[( yi Y ) ( xi X )]2
or E ( y Y ) 2 E ( x X ) 2 2 E ( y Y ) ( x X )
N n 1 N N N
( y i Y ) ( xi X ) 2 ( y i Y ) ( xi X ) .
2 2
n N N 1 i 1 i 1 i 1
Hence,
N n 1 N 1 f N
E ( y Y ) (x X ) i
n N N 1 i 1
( y Y ) ( x i X ) ( y i Y ) ( xi X ) .
n ( N 1) i 1
Theorem: Show that the first approximation to the relative bias of the ratio estimator in the
simple random sampling, wor , is given by
B( Rˆ ) 1 f 1 f
( R S x2 S y S x ) (C xx C y C x ) .
R nXY n
Proof: We know that
ˆ y yRx yRx yRx 1
RR R
x x X (x X ) X xX
1
X
Expanding by a Taylor’s series, we get
yRx x X
Rˆ R 1 , as (1 x) 1 1 x x 2 (1) r x r valid for
X X
x 1.
E [( y R x ) ( x X )] E [ y ( x X )] R E [ x ( x X )]
E [( y RX RX ) ( x X )] R E [ x ( x X )]
E [( y Y RX ) ( x X )] R E [ x ( x X )]
E ( y Y ) ( x X ) RX E ( x X ) R E [ x ( x X )]
E( y Y ) ( x X ) R E ( x X ) ( x X )
52 RU Khan
E[( y Y ) ( x X )] R E ( x X ) 2
1 f N 1 f N
i
n ( N 1) i 1
( y Y ) ( x i X ) R ( xi X ) 2
n ( N 1) i 1
1 f
( S y S x R S x2 ) .
n
Hence,
1 f
E ( Rˆ R) B( Rˆ ) ( R S x2 S y S x )
2
nX
B( Rˆ ) 1 f 1 f S x2 S y Sx
( R S x2 S y S x )
R n X 2 (Y / X ) n X2 Y X
1 f
(C xx C y C x ) .
n
Sy
Note: The bias in the ratio estimator becomes zero, when R , because
Sx
1 f 1 f Sy 2
B( Rˆ ) ( R S x2 S y S x ) S S S , which is satisfied only if line
2 2 S x y x
nX nX x
of regression of y on x passes through the origin.
Example: In a locality there are 50 lanes. In 2015 there were 6250 persons living. Recently
sample of 5 lanes showed the number of residents changing as following:
Lane Number : 1 2 3 4 5
Person living in 2015 : 100 150 160 200 140
Recently : 120 160 200 170 150
Estimate the standard error of the number of persons residing in the locality using
i) The recent sample only.
ii) The information about 2015 as well as recent sample.
Solution:
N n 2
i) y 160 , then Yˆ N y 8000 , and V (Yˆ ) N 2 S .
Nn
1 n
Since S 2 is unknown, so its estimator s 2 can use s 2y
n 1 i 1
( yi y ) 2 850 , hence,
and
N 2 (1 f ) 2 ˆ 2 2
v (YˆR ) ( s y R s x 2 Rˆ s yx ) ,
n
1 n 1 n
where s x2 ( xi x ) 1300 and s yx n 1 ( yi y ) ( xi x ) 750 .
n 1 i 1
2
i 1
Therefore,
v(YˆR ) 36455.5555 , and hence, SE (YˆR ) 191 .
Product Estimator
If the correlation coefficient between the under study variable y and auxiliary variable x is
negative, we cannot make use of the ratio estimator because it gives precise results provided
1
the correlation coefficient is greater than (C x / C y ) . In such situations, another type of
2
estimators for the mean Y , and the total Y , defined as
Xˆ
YˆP y , and Yˆ p N
x x
y Yˆ which may be termed as the product estimators.
X X X
Note: For the product estimator in a large simple random sample, the coefficient of
variation of either Yˆ , or Yˆ is
P p
1 f
(CV ) 2 (C yy C xx 2 C yx ) .
n
54 RU Khan
1 f N 1 f 2
V ( ylr )
n ( N 1) i 1
[( yi Y ) b0 ( xi X )]2
n
( S y b02 S x2 2 b0 S yx ) .
Note: In most applications b is estimated from the result of its samples. However
sometimes it is reasonable to choose the value of b in advance and the estimator is called
difference estimator.
Proof:
By definition,
E( ylr ) E [ y b0 ( X x )] E ( y ) b0 X b0 E ( x ) Y b0 X b0 X Y .
To obtain the variance, consider the variate
ui yi b0 ( xi X ) , i 1, 2, , N .
Let u and U be the sample mean and population mean of variable u respectively,
Ratio and regression methods of estimation 55
where
1 n
u [ yi b0 ( xi X )] y b0 ( x X ) y b0 ( X x ) ylr
n i 1
(4.5)
1 N 1 N
U [ yi b0 ( xi X ] Y b0 ( xi X ) Y , and
N i 1 N i 1
1 1 1 f 2
V (u ) S u2 S u , as sampling is simple random, wor , where,
n N n
1 N 1 N
S u2
N 1 i 1
(ui U ) 2
N 1 i 1
[ yi b0 ( xi X ) Y ]2
1 N
N 1 i 1
[ ( yi Y ) b0 ( xi X )]2 .
1 f N N N
( yi Y ) b0 ( xi X ) 2 b0 ( yi Y ) ( xi X )
2 2 2
n ( N 1) i 1 i 1 i 1
1 f 2
( S y b02 S x2 2 b0 S yx ) .
n
Corollary: In simple random sampling, wor , an unbiased estimate of V ( ylr ) is
1 f 2
Vˆ ( ylr ) v( ylr ) ( s y b02 s x2 2 b0 s yx ) ,
n
where
1 n 1 n 1 n
s 2y
n 1 i 1
( yi y ) 2 , s x2
n 1 i 1
( xi x ) 2 , and s yx ( y i y ) ( xi x ) .
n 1 i 1
S yx S y Sx Sy
Theorem: The value of b0 which minimizes V ( ylr ) is B
(called
S x2 S x2 Sx
the regression linear coefficient of y on x in the population) and the resulting minimum
1 f 2
variance is Vmin ( ylr ) S y (1 2 ) .
n
Proof:
We prove this theorem by contradiction, let
S yx
b0 B , b0 B d d, d 0 (4.6)
S x2
Substituting equation (4.6) in the expression of V ( ylr ) , we get
56 RU Khan
2
1 f 2 S yx 2 S yx
V ( ylr ) Sy
d Sx 2 d S yx
n S2
x
S2
x
2 2
1 f 2 S yx 2 S 2
S 2 yx
S
Sy S d 2 S 2 2 d yx 2 d S yx
n S2 x x S2 x S
x x x
2 2
1 f 2 S yx S
S y d 2 S x2 2 d S yx 2 yx 2 d S yx
n S
Sx x
2
1 f 2 S yx
S y d 2 S x2 .
n
Sx
Clearly the RHS of the above expression will be minimum when d 0 i.e. b0 B .
2 2
1 f 2 S yx S S
1 f S y2 y x
1 f S y2 (1 2 ) .
Vmin ( ylr ) S y
n Sx n Sx n
V ( ylr ) 1 f S yx
0 (2 b0 S x2 2 S yx ) , or b0 .
b0 n S x2
Therefore,
2 2 2
1 f 2 S yx 2 S
S 2 yx
S 1 f S
S y2 yx
S
V ( ylr ) Sy 2 yx
n S2 x S2 yx n S S
x x x x
S ySx 1 f 2
2
1 f 2
Sy S y (1 2 ) .
n
Sx n
Theorem: If b is the least square estimate of B and ylr y b ( X x ) , then under srs
1 f 2
of size n , V ( ylr ) S y (1 2 ) , provided n is large enough, so that the error (b B)
n
in b is negligible.
Proof: Introduce the residual variate ei , define by the relation
ei yi Y B ( xi X ) .
Ratio and regression methods of estimation 57
ylr Y e (b B) ( X x )
Consider
n n
y i ( xi x ) [Y B ( xi X ) ei ] ( xi x )
b i 1 i 1
n n
( xi x ) 2
( xi x ) 2
i 1 i 1
n n n n
Y ( xi x ) B ( xi x ) 2 ei ( xi x ) ei ( xi x )
i 1 i 1 i 1
B i 1 .
n n
( xi x ) 2 ( xi x ) 2
i 1 i 1
N
By using two properties of ei are that ei 0 , and
i 1
N N
ei ( xi X ) [ yi Y B ( xi X )] ( xi X )
i 1 i 1
N N S yx
( yi Y ) ( xi X ) B ( xi X ) 2 0 , by definition of B , i.e. B .
i 1 i 1 S x2
Thus,
1 n 1 N
(b B) 0 , since ei ( xi x ) is an unbiased estimate of N 1 ei ( xi X ) .
n 1 i 1 i 1
Therefore,
ylr Y e . By definition,
V ( ylr ) E ( ylr Y ) 2 E (e ) 2
and V (e ) E [e E (e )]2 E (e ) 2 , as E (e ) 0 .
V ( ylr ) V (e ) .
Since e is the sample mean of ei ' s by srswor , then
1 f 2 1 N 1 N 2
V (e )
n
S e , where S e2 i
N 1 i 1
( e e ) 2
ei
N 1 i 1
1 f N
n ( N 1) i 1
[ yi Y B ( xi X )]2
58 RU Khan
1 f N N N
( y i Y ) B ( xi X ) 2 B ( y i Y ) ( xi X )
2 2 2
n ( N 1) i 1 i 1 i 1
2
1 f 2 1 f 2 S yx 2 S yx
S
2 2
[ S y B S x 2 B S yx ] Sy Sx 2
n n S2
x
S2
x
yx
2
1 f 2 S ySx 1 f 2
S y
S y (1 2 ) V ( ylr ) .
n Sx n
Estimation of V ( ylr )
1 f 2 1 f 2 sy
Vˆ ( ylr ) v ( ylr ) s y (1 r 2 ) ( s y b 2 s x2 ) , since b r
n n sx
1 f 1 n 1 n
( yi y ) 2 b 2 ( xi x ) 2
n n 1 i 1 n 1 i 1
( y i y ) ( xi x )
2
n n
1 f 2 i 2
( yi y )
n (n 1) i 1 i
(x x)
i ( x x ) 2
i 1
i
2
1 f n 1
( y i y ) 2 ( y i y ) ( xi x )
n (n 1) i 1
( xi x ) 2 i
i
1 f n n
( y i y ) b ( y i y ) ( xi x )
2
n (n 1) i 1 i 1
1 f n n n
( yi y ) b ( yi y ) ( xi x ) 2b ( yi y ) ( xi x )
2
n (n 1) i 1 i 1 i 1
1 f n
[( yi y ) 2 b 2 ( xi x ) 2 2 b ( yi y ) ( xi x )]
n (n 1) i 1
1 f n
n (n 1) i 1
[( yi y ) b ( xi x )]2 .
Comparisons of linear regression estimate with ratio and mean per unit
estimate
For large samples
1 f 2
V ( y sr ) Sy (4.7)
n
1 f 2
V ( yR ) ( S y R 2 S x2 2 R S y S x ) (4.8)
n
1 f 2
V ( ylr ) S y (1 2 ) (4.9)
n
From equation (4.7), and (4.9), it is clear that V ( ylr ) V ( y sr ) , unless 0 , in which case
V ( ylr ) V ( y sr ) and the two estimates are equally precise.
From equation (3.8), and (3.9), ylr will be precise than y R if and only if V ( ylr ) V ( y R ) .
Consider,
S y2 (1 2 ) S y2 R 2 S x2 2 R S y S x or 2 S y2 R 2 S x2 2 R S y S x
or 2 S y2 R 2 S x2 2 R S y S x 0 or ( S y R S x ) 2 0
2 2
S yx S yx
or S y R Sx 0 or R S x2 0
S y Sx S2
x
or ( B R) 2 S x2 0 (4.10)
As the LHS of equation (4.10) is a perfect square. Thus we conclude that the linear
regression estimate is always better than the ratio estimate except when B R , i.e. if B R ,
y
then b Rˆ , and
x
y
ylr y b ( X x ) y ( X x)
x
X y Rˆ X YˆR .
y
y
x
This means that both the estimates linear regression and ratio have the same variance and this
occurs only when the regression of y on x is a straight line passes through the origin.
Corollary: In srs , the bias of ylr is approximated by B ( ylr ) Cov (b, x ) , which will be
negligible if the sample size is large.
Proof: We have,
ylr y b ( X x ) , so that
Example: An eye estimate of the fruit weights ( xi ) on each tree in an orchard having 100
trees was made. The total weight (X ) was found to be 12500 kg. A random sample of 10
trees was taken and the actual weights of fruits ( yi ) along with eye estimates were given as
below:
Actual weight ( yi ) 51 42 46 39 71 61 58 57 58 67
Thus, the linear regression estimate of the total fruit weight Y is given as
Ylr N ylr
Now,
ylr y b ( X x ) 55 0.9035 (125 56) 117.3415 117 kg.
Therefore,
Ylr N ylr 100 117 11700 kg.
Exercise
1) In a study to estimate the total sugar contents of a truck load of oranges, a random sample
of 10 oranges was juiced and weighted; the data for the sugar contents and weight of oranges
are given in the following table. The total weight of all the oranges is obtained by the 1st
weighing the loading truck and then unloaded was found to be 1800 pounds. Estimate Y (the
total sugar contents of the oranges).
Orange number Sugar content ( y ) Weight of orange (x)
1 0.021 0.41
2 0.030 0.48
3 0.025 0.43
4 0.022 0.42
5 0.033 0.50
6 0.027 0.46
7 0.019 0.39
8 0.021 0.41
9 0.023 0.42
10 0.025 0.44
2) For studying milk yield, feeding and management practices of milch animals in the year
2015-16, the whole of Haryana State was divided into 4 zone according to agro-climatic
Ratio and regression methods of estimation 61
conditions. The total number of milch animals in 15 randomly selected villages of zone A ,
along with their livestock census data are shown below:
S. No. of village 1 2 3 4 5 6 7 8
No. of milch animals in survey ( y ) 1129 1144 1125 1138 1137 1127 1163 1153
No. of milch animals in census (x) 1141 1144 1127 1153 1117 1140 1153 1146
S. No. of village 9 10 11 12 13 14 15
No. of milch animals in survey ( y ) 1164 1130 1153 1125 1116 1115 1122
No. of milch animals in census (x) 1189 1137 1170 1115 1130 1118 1122
Given that the total area under guava orchards of N 146 villages is X 354.78 acres,
estimate the total number of guava trees along with its standard error, using the area under
guava orchards as the auxiliary variate. Discuss the efficiency of your estimate with the one
which does not make any use of the information on the auxiliary variate.
CHAPTER V
A sampling technique in which only the first unit is selected with the help of random numbers
and the rest get selected automatically according to some pre-determined pattern (in regular
spacing pattern) is known as systematic random sampling. Suppose N units of the
population are numbered from 1 to N in some order. Let N nk , where n is the sample
N
size, and k , being an integer and usually called the sampling interval. Draw a random
n
number less than or equal to k , say i , and select the unit with the corresponding serial
number and every k th unit in the population thereafter. The resultant sample will contain
the n units with serial numbers i, i k , i 2k ,, i (n 1) k is called every k th systematic
sample and such a procedure termed linear systematic sampling.
Example: There are 50 houses on a street. If a sample of size 5 is to be chosen, then
N
k 10 , we select randomly one house out of the first ten, suppose we select 3rd and then
n
take every 10th house after selected one, i.e. as 3rd, 13th, 23rd, 33rd, 43rd. And the possible
sample in (linear) systematic sampling will be:
Linear systematic sampling suffers from the limitation that it cannot be used when the
N
sampling interval k is not an integer. The procedure to be followed is that of circular
n
systematic sampling [proposed by Lahiri]. In this method, select first item randomly out of
N units and then take every k th unit thereafter (where k is the nearest integer to N / n ) in
a cyclical manner until n sampling units are obtained.
other blocks. (It has been found very useful for sampling material continuously distributed
over time or in space).
Notations
y ij is the observation on the j th unit of the i th sample, i 1, 2, , k ; j 1, 2, , n .
1 n
y i. yij ,
n j 1
mean of the i th systematic sample.
For notational convenience, we shall write y sy mean of the systematic sample yi.
1 k n 1 k
Y yij k yi. , population mean.
nk i 1 j 1 i 1
k n
1
2
S
nk 1 i 1 j 1
( yij Y ) 2 , mean square between units in the population.
N 1 2 k (n 1) 2
V (Yˆ ) V ( N y sy ) N 2 V ( y sy ) N 2 S S wsy
N N
N ( N 1) S 2 Nk (n 1) S wsy
2
.
Theorem: The mean of the systematic sample is more precise than the mean of a simple
2
random sample if and only if S wsy S2.
N n 2
Proof: If y is the mean of a simple random sample of size n , then V ( y ) S , and
nN
N 1 2 k (n 1) 2
y sy is the mean of systematic sample, then V ( y sy ) S S wsy , so that
N N
N n 2 N 1 2 k (n 1) 2
V ( y ) V ( y sy ) S S S wsy
nN N N
k (n 1) 2 N n N 1 2 n 1 2 n 1 2
S wsy S S wsy S
N nN N n n
n 1 2
2
( S wsy S 2 ) ve quantity, only when S wsy S2.
n
This result shows that systematic sampling is more precise than simple random sampling if
the variance within systematic samples is larger than the population variance as a whole.
k n
( yij Y ) ( yij Y )
i 1 j j1 1 k n
, since S 2
N 1 i 1 j 1
( yij Y ) 2 .
(n 1) ( N 1) S 2
k n
( yij Y ) ( yij Y ) (n 1) ( N 1) w S 2 (5.1)
i 1 j j
66 RU Khan
By definition,
2 2
k
1 k 1 n 1 k n
2 1
V ( y sy ) ( yi. Y ) yij Y yij nY
k i 1 k i 1 n j 1 n 2 k i 1 j 1
2
k n k n n n
1 ( y Y ) 1
ij
nN i 1 j 1 ij
nN i 1 j 1
( y Y ) 2
ij ( y Y ) ( y ij Y )
j 1 j j 1
1 k n k n
( yij Y ) 2 ( yij Y ) ( yij Y ) (5.2)
nN i 1 j 1 i 1 j j
Substitute the value of equation (5.1) in equation (5.2), we get
1
V ( y sy ) [( N 1) S 2 (n 1) ( N 1) w S 2 ]
nN
( N 1) S 2
[1 (n 1) w ] .
nN
The relative precision of systematic sample mean with simple random sample mean is given
by
( N 1) S 2
[1 (n 1) w ]
V ( y sy ) nN N 1
Relative Precision ( RP ) [1 (n 1) w ] .
V ( y) N n 2 N n
S
nN
It can be seen that the relative precision depends on the value of w , if
1
i) w , the two methods give estimate of equal precision, i.e. as
N 1
V ( y sy ) N 1 n 1
1 1.
V ( y) N n N 1
1
ii) w , the estimate based on systematic sample is more precise, i.e. as
N 1
V ( y sy )
1.
V ( y)
1
iii) w , Systematic sampling is less precise than simple random.
N 1
Sample numbers
1 2 i k
1 2 i k
1 k 2k ik 2k
1 2k 2 2k i 2k 3k
1 ( j 1)k 2 ( j 1)k i ( j 1)k jk
1 (n 1)k 2 (n 1)k i (n 1)k nk
and one unit is drawn randomly from each stratum, thus giving a stratified sample of size n .
1 k
Then the mean of the j th stratum is y. j yij , j 1, 2,, n , and the population
k i 1
1 k n 1 n 1 k
mean is Y
nk i 1 j 1
yij y. j yi. , and the variance of the mean of this stratified
n j 1 k j 1
n
1 1 2 2
random sample will be V ( y st ) Wj S j .
n N j
j 1 j
Here,
2
2 Nj k2 1
N j k, n j 1, for all j 1, 2, , n , and W j .
2
N n2k 2 n2
Thus,
n S2
1
V ( y st ) 1
j
.
k j 1 n 2
where, S 2j is the mean square between units of the j th stratum and is defined as
1 k
S 2j ( yij y. j ) 2 .
k 1 i 1
Therefore,
1 1 1 n k k 1 2
V ( y st ) 1 ( yij y. j ) 2 S wst ,
n 2 k k 1 j 1 i 1 nk
n k n
1 2 1
2
where S wst
n(k 1) j 1 i 1
( y ij y.j )
n j 1
S 2j is the mean of mean square between
k n
1
E ( y ij y. j ) ( yij y. j )
( yij y. j ) ( yij y. j )
k n (n 1) i 1 j j1
wst
E ( yij y. j ) 2 1 k n
nk i 1 j 1
( yij y. j ) 2
k n
( yij y. j ) ( yij y. j ) n k
i 1 j j1 1
2
2
, as S wst
n (k 1) j 1i 1
( yij y. j ) 2 (5.3)
(n 1) n (k 1) S wst
k n
( yij y. j ) ( yij y. j ) n (n 1) (k 1) wst S wst
2
(5.4)
i 1 j j
By definition,
2 2
1 k 1 k 1 n 1 n 1 k n
V ( y sy ) ( yi. Y ) 2 yij y. j
2 ij
( y y.j
)
k i 1 k i 1 n j 1 n j 1 n k i 1 j 1
1 k n k
( yij y. j ) 2 ( yij y. j ) ( yij y. j )
n 2 k i 1 j 1 i 1 j j
1
2
[n (k 1) S wst n(n 1) (k 1) wst S wst
2
] , by using eqn. (5.3), and (5.4).
2
n k
2
(k 1) S wst
[1 (n 1) wst ] .
nk
The relative precision of systematic sample mean with stratified sample mean is given by
V ( y sy )
Relative precision ( RP ) [1 (n 1) wst ] .
V ( y st )
It can be see that the relative precision depends on the depends on the values of wst , if
i) wst 0 , then V ( y sy ) V ( y st ) , i.e. the two methods give estimate of equal precision.
ii) wst 0 , then V ( y sy ) V ( y st ) , i.e. the estimate based on systematic sample is more
precise than stratified sampling.
iii) wst 0, then V ( y sy ) V ( y st ), i.e. systematic sampling is less precise than simple
random sampling.
Suppose that the values of the successive units in the population follow a linear trend, so that
yi i , i 1, 2, , N , where and are constants. In this case,
Systematic sampling 69
1 N 1 N 1 N
Y yi ( i ) N i
N i 1 N i 1 N
i 1
1 N ( N 1) N 1
N N 2
2
.
and
N N N2 2
N 1 N 1
( N 1) S 2 ( yi Y ) 2 i 2 i
i 1 i 1
i 1
2 2
N N 1 N N ( N 1) 2
2 N ( N 1) (2 N 1) N ( N 1)
2
2 i 2 2 i
i 1 2 i 1 4 6 4
2 N ( N 1) (2 N 1) 3N ( N 1) 2 2 N ( N 1)
2 ( N 1) .
12 12
Therefore,
N ( N 1) 2
S2 . (5.5)
12
Suppose the population of size N is divided into n classes of k each, i.e. N nk , then
nk (nk 1) 2
S2 , Accordingly V ( y sr ) is reduces to
12
N n 2 nk n nk (nk 1) 2 (k 1) (nk 1) 2
V ( y sr ) S 2 (5.6)
nN n k 12 12
For stratified random sampling, we know that in a population, for N nk ,
k 1 n
V ( y st )
2 S 2j , where S 2j is the mean square between units of the j th stratum.
n k j 1
1 k 1 k 2 1 k
S 2j
k 1 i 1
( yi Yk ) 2
k 1 i 1
yi k Yk2 , where
Yk yi
k i 1
From equation (5.5), after replacing N by k , we get,
k (k 1) 2
S 2j , and hence,
12
k 1 k (k 1) 2 (k 1) (k 1) 2 k 2 1 2
V ( y st ) n . (5.7)
n2k 12 12 n 12 n
For systematic sampling, the mean of the second sample exceeds that of the first by ; the
mean of the third exceeds that of the second by , and so on. Thus the means yi. may replace
by the numbers , 2 , 3 , , k .
70 RU Khan
Hence,
k
2
1 k k
1 k
V ( y sy ) ( yi. Y ) 2 yi2. k yi. , since
1 1
k i 1 k i 1 k
Y y i.
k i 1
i 1
k 1 2 k (k 1) (2k 1) 1 k (k 1) 2
2
1k
k i 1
2 1
(i ) i
k i 1 k 6
k 2
2k 1 k 1 k 1
2 (k 1) (k 1)
2
6 4 12
k 2 1 2
. (5.8)
12
Comparing the above three variances obtained in equations (5.6), (5.7) and (5.8), we see that
k 2 1 k 2 1 (k 1) (nk 1)
V ( y st ) V ( y sy ) V ( y sr ) .
12 n 12 12
Exercise
Given below are the daily milk yield (in liters) records of the first lactation of a specified cow
belonging to the Tharparkar herd maintained at the Government Cattle Farm, India. The milk
yields of the first five days were not recorded, being the colostrums period.
Day 1 2 3 4 5 6 7 8 9 10
Milk yield 10 11 14 10 14 9 10 8 11 10
Day 11 12 13 14 15 16 17 18 19 20
Milk yield 6 9 8 7 9 10 11 11 13 12
Day 21 22 23 24 25 26 27 28 29 30
Milk yield 12 10 11 11 14 15 12 17 18 16
Day 31 32 33 34 35 36 37 38 39 40
Milk yield 13 14 14 15 16 16 16 13 16 17
Day 41 42 43 44 45 46 47 48 49 50
Milk yield 14 16 15 14 14 15 17 15 16 17
Day 51 52 53 54 55 56 57 58 59 60
Milk yield 25 22 23 19 18 16 22 21 21 23
Day 61 62 63 64 65 66 67 68 69 70
Milk yield 21 19 19 19 19 19 19 19 19 19
Day 71 72 73 74 75 76 77 78 79 80
Milk yield 18 19 21 20 17 16 18 18 18 22
Day 81 82 83 84 85 86 87 88 89 90
Milk yield 22 22 20 20 20 18 20 21 21 20
Day 91 92 93 94 95 96 97 98 99 100
Milk yield 18 21 22 22 20 21 21 21 21 21
Find the efficiency of systematic sampling at 7 and 14 day’s interval of recording, with
respect to corresponding simple random sampling in estimating the lactation yield of the cow.
CHAPTER VI
DESIGN OF EXPERIMENT
Design of experiment means how to design an experiment in the sense that how the
observations or measurements should be obtained to answer a query in a valid, efficient and
economical way. The designing of the experiment and the analysis of obtained data are
inseparable. If the experiment is designed properly keeping in mind the question, then the
data generated is valid and proper analysis of data provides the valid statistical inferences. If
the experiment is not well designed, the validity of the statistical inferences is questionable
and may be invalid. It is important to understand first the basic terminologies used in the
experimental design.
Experiment: An experiment is a procedure which is done to make a discovery, test a
hypothesis, to understand cause and effect relationships. In other words, a way of getting an
answer to a question which the experimenter wants to know.
Experiment Unit: For conducting an experiment, the experimental material is divided into
smaller parts and each part is referred to as experimental unit. It is that unit to which a
treatment is applied. Examples are manufactured item, plot of land, etc.
Treatment: An object whose effect is measured and compared is called treatment. For
example, drugs, fertilizers, teaching methods, etc.
Experimental Error: It describes the failure of two identically treated experimental units to
give identical results. For example, mistake in data entry, systematic error or mistakes in the
design of the experiment itself or random error, caused by environmental condition or other
unpredictable factor.
at random. That is, the randomization is done without any restrictions. The design is
completely flexible, i.e., any number of treatments and any number of units per treatment
may be used. Moreover, the number of units per treatment need not be equal. A completely
randomized design is considered to be most useful in situations where
i) the experimental units are homogeneous
ii) the experiments are small such as laboratory experiments and greenhouse studies, and
iii) some experimental units are likely to be destroyed or fail to respond.
Layout of CRD
CRD is the one in which all the experimental units are taken in a single group which are
homogenous as for as possible.
The randomization procedure for allotting the treatments to various units will be as follows.
Step 1: Determine the total number of experimental units.
Step 2: Assign a plot number to each experimental units starting from left to right for all
rows.
Step 3: Assign the treatments to the experimental units by using random numbers.
The statistical model for CRD with one observation per unit
Yij ti eij , where
Total Y1 Y2 Yi Yk GT
The different steps in forming the analysis of variance table for a CRD are:
(GT ) 2
Correction Factor (CF ) , where, n is the total number of observations.
n
k ri
Total sum of square (TSS ) yij2 CF
i 1 j 1
Y12 Y22 Y2 k
Y2
Treatment sum of square ( SST ) k CF i CF
r1 r2 rk i 1 ri
ri
k k
Yi2
Error sum of square ( SSE ) yij2 TSS SST
i 1 j 1 i 1 ri
Total n 1 TSS
Compare the calculated F with the critical value of F ; (t 1), ( nt ) (corresponding to treatment
degree of freedom and error degree of freedom) so that acceptance or rejection of the null
hypothesis can be determined.
If null hypothesis is rejected that indicates there is significance differences between the
different treatments.
Calculate critical difference (CD) SE (d ) t ; (nt ) , where
1 1
SE (d ) MSE and
ri r j
ri number of replications for treatment i
t ; ( nt ) is the critical t value for error degree of freedom at specified level of significance, either
5% or 1% .
Advantages of a CRD
i) Its lay out is very easy.
ii) There is complete flexibility in this design i.e. any number of treatments and replications
for each treatment can be tried.
iii) Whole experimental material can be utilized in this design.
iv) This design yields maximum degrees of freedom for experimental error.
v) The analysis of data is simplest as compared to any other design.
vi) Even if some values are missing the analysis can be done.
74 RU Khan
Disadvantages of a CRD
i) It is difficult to find homogeneous experimental units in all respects and hence CRD is
seldom suitable for field experiments as compared to other experimental designs.
ii) It is less accurate than other designs.
Example: The following table gives the yield in kgs per plot of five varieties of wheat after
being applied to each of four plots in a completely randomized design.
D 12 10 15 11 48 (T4 ) 12 (T4 )
E 8 11 9 8 36 (T5 ) 9 (T5 )
Analysis
(GT ) 2 (224) 2
Correction Factor (CF ) 2508.8
n 45
k ri
Total sum of square (TSS ) yij2 CF 8 2 8 2 9 2 8 2 2508.8 207.2
i 1 j 1
Here F test indicates that there are significant difference between the variety means since
the observed value of the variance ratio is significant at 5% level of significance. Now we
wish to know as to which variety is the best and also which variety show the significant
difference among themselves. This can be done with the help of critical difference (CD) .
Now, standard error of the difference between two treatment means is
Design of Experiments 75
1 1 2 MSE 2 3.47
SE (d ) MSE 1.32 .
ri r j r 4
Therefore,
Critical Difference (CD) SE (d ) t 0.05; (15) 1.32 2.131 2.81 .
Since the F test in the analysis of variance indicates significant differences between the
varieties, we are justified in comparing the varieties with the help of the above Critical
Difference value.
Varieties A B C D E CD
Mean yields 8 11 16 12 9 2.81
The varieties which do not differ significantly have been underlined by a bar. This method of
underlining the treatments which do not differ significantly is the way of indicating the
significance and non-signification of individual comparisons.
Exercise
1) The following table gives the yields of five varieties of paddy with four replications each
by using completely randomized Design.
Varieties Yield in kg.
A 8 8 6 10
B 10 12 13 9
C 18 17 13 16
D 12 10 15 11
E 8 11 9 8
Layout of RBD
Suppose we have four treatments A, B, C and D each replicated four times. We divide the
whole experimental area into four relatively homogeneous blocks and each block into four
units or plots. Treatments are allocated at random to the units of the blocks, fresh
randomization being done for each block. A particular layout may be as follows:
Block 1 A B D C
Block 2 D C B A
Block 3 C B A D
Block 4 A D C B
Analysis
For analysis of this design we use the following linear additive model
Yij ti b j eij , i 1, 2,, t ; j 1, 2,, r ,
where
over all mean effect
t i true effect of the i th treatment
b j true effect of the j th block
2 y 21 y 22 y2 j y 2r T2
Design of Experiments 77
i yi1 yi 2 y ij yir Ti
t yt1 yt 2 y tj ytr Tt
Total B1 B2 Bj Br GT
Let
r
Yi . yij Ti Total of i th treatment
j 1
t
Y. j yij B j Total of j th block
i 1
t r t r
Y.. yij GT Ti B j .
i 1 j 1 i 1 j 1
The following formulae are used for the analysis of variance of this design
(GT ) 2
Correction Factor (CF ) , where, n is the total number of observations.
n
t r
Total sum of square (TSS ) yij2 CF
i 1 j 1
Ti2
k
1 t
Treatment sum of square ( SST ) CF Yi2. CF
i 1 r r i 1
r B 2j 1 r 2
Block sum of square ( SSB) CF Y. j CF
j 1 t t j 1
Error sum of square (SSE) TSS SST SSB
The null hypothesis will be
H 0 : T1 T2 Tt , (there is no significant difference between the treatments)
B1 B2 Br , (there is no significant difference between the blocks)
and the alternative hypothesis is
H A : Ti are not equal (there is significant difference between the treatments)
B j are not equal (there is significant difference between the blocks)
Form the following ANOVA table and calculate F value
Source of variation d. f . SS MSS Variance ratio F
Treatments t 1 SST MST SST MST F
t 1 MSE
Replication (block) r 1 SSB MSB SSB MSB F
r 1 MSE
Error (within treatment) (r 1) (t 1) SSE MSE SSE
( r 1) (t 1)
Total r t 1 n 1 TSS
78 RU Khan
Compare the calculated F with the critical values of F ; (t 1), ( r 1) (t 1) and F ; ( r 1), (r 1) (t 1) ,
respectively, so that acceptance or rejection of the null hypothesis can be determined.
If null hypothesis is rejected that indicates there is significance differences.
Calculate critical difference (CD) SE (d ) t ; (r 1) (t 1) , where
2 MSE
SE (d ) .
r
Advantages of a RBD
The precision is more in RBD. The amount of information obtained in RBD is more as
compared to CRD. RBD is more flexible. Statistical analysis is simple and easy.
Even if some values are missing, still the analysis can be done by using missing plot
technique.
Disadvantages of RBD
When the number of treatments is increased, the block size will increase. If the block size is
large maintaining homogeneity is difficult and hence when more number of treatments is
present this design may not be suitable.
2 y 21 y 22 y2 j y 2r T2
i yi1 yi 2 y ij yir Ti
t yt1 yt 2 y tj ytr Tt
Total B1 B2 Bj Br G
(T1 y) 2 ( B1 y) 2 (G y ) 2
y 2
terms not containing y .
r t n
Now the best value of y is that minimizes SSE . So the method of least square the least value
of y is given by differentiating one with respect to y equating to zero.
SSE (T y) ( B y) (G y)
0 2y 2 1 2 1 2 0
y r t rt
(T1 y) ( B1 y ) (G y)
y 0
r t rt
1 1 1 T B G
1 y 1 1
r t rt r t rt
r t t r 1 t T r B1 G
y 1
rt rt
Example: The yields of six nitrogen treatments on a crop in kgs along with the plan of the
experiment are given below. The number of blocks is five and the nitrogen treatments have
been represented by A, B, C, D, E and F.
Block I Block II Block III Block IV Block V
D B E A F
17 12 23 28 75
C C A F C
12 15 30 64 14
F E C B D
70 26 16 9 20
B A D D B
6 26 20 23 7
A D F E E
20 10 56 33 30
E F B C A
28 62 10 14 23
Analysis
The first step in the analysis of data is to tabulate yield figures according to block and
treatments in the following manner.
B 9 12 10 9 7 47 (T2 ) 9.4
C 12 15 16 14 14 71 (T3 ) 14.2
D 17 10 20 23 20 90 (T4 ) 18.0
(GT ) 2 (804) 2
Correction Factor (CF ) 21547.2 .
n 5 6
t r
Total sum of square (TSS ) yij2 CF
i 1 j 1
It is clear from the table that this observed value of F is significant at 5% level of
significance which proves that there are significant differences between the treatment means.
Now, we have to test the significance of the difference between the individual treatments, and
this will be done with the help of CD as usual.
Critical Difference
S.E. of the difference between any two treatment means is
2 MSE 2 20.91
SE (d ) 2.89 .
r 5
Therefore,
Critical Difference (CD) SE (d ) t 0.05 2.89 2.086 6.03 .
Conclusions represented symbolically
The treatments have been compared by setting them in the descending order of their mean
yields in the following manner
Varieties F E A D C B
Mean yield 65.4 28.4 25.4 18.0 14.2 9.4
The treatments which do not differ significantly have been underlined by a bar. The
treatment ‘F” has been found to be the best.
Exercise
1) The yield of rice (in kg) with five fertilizers tested in four blocks using RBD is given the
following layout. Analysis the data and interpret your conclusion.
Block 1 Block 2 Block 3 Block 4
B C A D
10 13 19 20
C A D E
16 21 24 36
A D E B
20 21 32 9
D E B C
23 31 10 13
E B C A
33 11 14 24
2) An experiment was conducted in RBD to study to comparative performance of yield of six
varieties of oranges (kg/plot) are given below. Analyze the data and give your conclusion.
Treatment Blocks
B1 B2 B3 B4 B5
V1 5.5 5.9 6.3 6.5 6.7
V2 7.4 7.7 7.9 7.5 8.1
V3 4.6 5.1 5.3 4.9 4.7
V4 5.0 5.8 5.6 6.1 5.3
V5 6.7 6.2 6.9 6.8 6.0
V6 8.2 7.9 7.5 7.2 6.9
82 RU Khan
A LSD of size m is an arrangement of m Latin letters into m 2 positions such that every
row and every column contain every treatment precisely once.
Layout of LSD
In this design, the whole experimental material is divided into m 2 experimental units and
arranged in a square so that each row as well as each column contains m units. The m
treatments are then allocated at random to these rows and columns in such a way that every
treatment occurs once and only once in each row and each column.
In LSD the treatments are usually denoted by A B C D etc. For a 5 5 LSD the
arrangements may be
A B C D E
B A E C D
C D A E B
D E B A C
E C D B A
Square 1
A B C D E
B A D E C
C E A B D
D C E A B
E D B C A
Square 2
A B C D E
B C D E A
C D E A B
D E A B C
E A B C D
Square 3
Analysis of LSD
For analysis of this design, we use the linear additive model
Yijk ri c j t k eijk , i , j , k 1, 2,, t
where
Yijk is the observation on the k th treatment in the i th row and j th column
Design of Experiments 83
(GT ) 2
Correction Factor (CF ) , where, n is the total number of observations.
t2
t t t
Total sum of square (TSS ) yijk
2
CF
i 1 j 1 k 1
t
Tk2
Treatment sum of square ( SST ) CF
k 1 t
t
Ri2
Row sum of square ( SSR) CF
i 1 t
t C 2j
Column sum of square ( SSC ) CF
j 1 t
Total t 2 1 TSS
84 RU Khan
If the calculated value of F is greater than tabulated value of F ; (t 1) (t 2) , then reject H 0 ,
otherwise accept it.
Advantages of LSD
LSD is more efficient than RBD or CRD. This is because of double grouping that will result
in small experimental error.
When missing values are present, missing plot technique can be used and analyzed.
Disadvantages of LSD
This design is not as flexible as RBD or CRD as the number of treatments is limited to the
number of rows and columns. LSD is seldom used when the number of treatments is more
than 12. LSD is not suitable for treatment less than 5.
R1 y R1 y
R2 R2
Ri Ri
Rt Rt
Total C1 y C2 Cj Ct G y
(G y ) 2
Total sum of square (TSS ) y 2 terms not containing y
t2
(T1 y ) 2 (G y ) 2
Treatment sum of square ( SST ) terms not containing y
t t2
( R1 y ) 2 (G y ) 2
Row sum of square ( SSR) terms not containing y
t t2
(C1 y ) 2 (G y ) 2
Column sum of square ( SSC ) terms not containing y
t t2
Error sum of square (SSE) TSS SST SSR SSC
Now the best value of y is that minimizes SSE . So the method of least square the least value
of y is given by differentiating one with respect to y equating to zero.
SSE (T y) ( R y) (C y) (G y)
0 2y 2 1 2 1 2 1 4 0
y t t t t2
(T1 y) ( R1 y) (C1 y) (G y)
y 2 0
t t t t2
t 2 y t (T1 y ) t ( R1 y) t (C1 y ) 2(G y )
0
t2
(t 2 3 t 2) y t (T1 R1 C 1) 2G
t (T1 R1 C 1) 2G
y .
(t 2 3 t 2)
The value of y is substituted for the missing place and analysis of the data is carried out as
usual. Substract one degree of freedom from error and total.
Example: Below are given the plan and yield in kgs/plot of a 5 5 Latin square experiment
on the wheat crop carried out for testing the effects of five, manorial treatments A, B, C, D,
and E. ‘A’ denotes control.
B A E D C
R1 77
15 8 17 20 17
A D C E B
R2 78
9 21 19 16 13
C B D A E
R3 78
18 12 23 8 17
E C A B D
R4 82
18 16 10 15 23
D E B C A
R5 78
22 15 13 18 10
C1 82 , C2 72 , C3 82 , C4 77 , C5 80 ; GT 393
Analyze the data and state your conclusions.
Analysis
(GT ) 2 (397) 2
Correction Factor (CF ) 6177.96
t2 25
t t t
Total sum of square (TSS ) yijk
2
CF 152 82 102 CF
i 1 j 1 k 1
t
Tk2 (45) 2 (68) 2 (83) 2
Treatment sum of square ( SST ) CF 6177.96
k 1 t 5
454.64
t
Ri2 (77) 2 (78) 2 (78) 2
Row sum of square ( SSR) CF 6177.96 3.04
i 1 t 5
Error sum of square (SSE) TSS SST SSR SSC 483.04 454.64 3.04 14.24
11.12
Source of variation d. f . SS MSS Variance ratio F
Treatments 4 454.24 113.66 123.34
Rows 4 3.04 0.76
Columns 4 14.24 3.56
Error (within treatment) 12 11.12 0.92
Total 24 483.04
The observed highly significant value of the variance ratio indicates that there are significant
differences between the treatment means.
Critical Difference
S.E. of the difference between the treatment means
2 MSE 2 0.92
SE (d ) 0.61 .
r 5
Therefore,
Critical Difference (CD) SE (d ) t 0.05 0.61 2.179 1.33 .
Summary of Results
Treatment means will be calculated from the original table on totals.
Treatments A B C D E CD 5%A
Mean yield 14.2 18.0 17.6 21.8 25.4 1.33
The treatment D is the best of all. The treatments C and E do not differ significantly each
other.
Design of Experiments 87
The yields obtained by applying every one of the manorial treatment is significantly higher
that obtained without applying any manure.
Exercise
1) An oil company tested four different blends of gasoline for fuel efficiency according to a
Latin square design in order to control for the variability of four different drivers and four
different models of cars. Fuel efficiency was measured in miles per gallon (mpg) after driving
cars over a standard course.
Fuel Efficiencies (mpg) For 4 Blends of Gasoline
(Latin Square Design: Blends Indicated by Letters A-D)
Car Model
Deriver
I II III IV
1 D B C A
15 33 13.2 29.1
2 B C A D
16.3 26.6 19.4 22.8
3 C A D B
10.8 31.1 17.1 30.3
4 A D B C
14.7 34.0 19.7 21.6
Rows Columns
1 P O N L M
4 2 5 1 3
2 M L O N P
5 1 6 5 3
3 O M L P N
4 8 1 5 4
4 N P M O L
12 7 7 10 5
5 L N P M O
5 4 3 6 9
Rows Columns
B D E A C
5 6 3 10 12
C A B E D
9 4 6 5 5
D C A B E
8 15 7 6 5
E B C D A
5 8 13 9 5
A E D C B
9 6 12 16 8
Design of Experiments 89
Factorial Experiments
When two or more number of factors are investigated simultaneously in a single experiment
such experiments are called as factorial experiments. For example, the yield of a crop
depends on particular variety of a crop being used and also on particular fertilizer applied.
Terminologies
Factor: Factor refers to a set of related treatments. We may apply of different doses of
nitrogen to a crop. Hence nitrogen irrespective of doses is a factor.
Levels of a factor: Different states or components making up a factor are known as the
levels of that factor, e.g. different doses of nitrogen.
Advantages
1. In such type of experiments we study the individual effects of each factor and their
interactions.
2. In factorial experiments a wide range of factor combinations are used.
3. Factorial approach will result in considerable saving of the experimental
resources, experimental material and time.
Disadvantages
1. When number of factors or levels of factors or both are increased, the number of treatment
combinations increases. Consequently block size increases. If block size increases it may be
difficult to maintain homogeneity of experimental material.
90 RU Khan
This will lead to increase in experimental error and loss of precision in the experiment.
2. All treatment combinations are to be included for the experiment irrespective of its
importance and hence this results in wastage of experimental material and time.
3. When many treatment combinations are included the execution of the experiment and
statistical analysis become difficult.
2 2 Factorial Experiment
Consider two factors A and B , each at two levels , say a 0 , a1 and b0 , b1 , so there will be
2 2 4 treatment combination. They are enumerated as follows:
a 0 b0 or (1) : A and B both at the first level
a1 b0 or a : A at second level and B at first level
a 0 b1 or b : A at first level and B at second level
a1 b1 or a b : A and B both at the second level
Note: The first level of A and B is generally expressed by the absence of the corresponding
letter in the treatment combination.
These four treatment combinations may be compared using a CRD or RBD or LSD .
In 2 2 experiment in RBD , the analysis will be the same as stated in RBD with the number
of treatment t 4 , and the analysis of 2 2 experiment in LSD will be the same as stated in
LSD with t 4 .
In factorial experiment, one is more interested in the separate tests about main effects and
interactions, which are performed by splitting the Treatment SS carrying 3 degree of
freedom in to three orthogonal components, each carrying a single degree of freedom and
each associated with main effect or an interaction.
Consider Yate' s method to obtain the factorial effects and their SS from the treatment totals
which is as follows:
Bi2
Sum of square due to replication (block) ( SSB) CF
4
Design of Experiments 91
r r
Total sum of square (TSS ) yij2 CF
i 1 j 1
Sum of square due to any main effect or the interaction effect is obtained as follows;
[ A]2
Sum of square due to main effect A ( SSA)
22 r
[ B] 2
Sum of square due to main effect B ( SSB)
22 r
[ AB ]2
Sum of square due to interaction effect AB ( SSAB )
22 r
Error sum of square (SSE) TSS SSR SSA SSB SSAB
Form the following ANOVA table and calculate F value
Source of variation d. f . SS MSS Variance ratio F
MSB F
Replication (Block) r 1 SSB MSB SSB
r 1 MSE
MSA F
Main effect A 1 SSA MSA SSA MSE
MSB F
Main effect B 1 SSB MSB SSB MSE
Interaction effect A B 1 SS AB MS AB SS AB MSAB F
MSE
Error (within treatment) 3 (r 1) SSE MSE SSE
3 ( r 1)
Total 22 r 1 TSS
If calculated value is greater than tabulated value at given level of significance, then reject
H 0 , otherwise accept it.
Example: An experiment was planned to study the effects of sulphate of potash and super
phosphate on the yield of potatoes. All the combinations of 2 levels of super phosphate [0%
( p0 ) and 5% ( p1 ) /acre] and 2 levels sulphate of potash [0% (k 0 ) and 5% (k1 ) /acre] were
studied in a randomized block design with 4 replications for each. The following yield were
obtained: (Gupta and Kapoor, 2000)
Block
I (1) k p kp
23 25 22 38
II p (1) k kp
40 26 36 38
III (1) k kp p
29 20 30 20
IV kp k p (1)
34 31 24 28
Solution: Taking deviation from y 29 , we rearrange the given table in the following table
for computations.
Block I II III IV Treatment Ti2
Totals Ti
(1) 6 3 0 1 10 100
k 4 7 9 2 4 16
p 7 11 9 5 10 100
kp 9 9 1 5 24 576
Block 8 24 17 1 G0
totals Bi
1 r 2 64 576 289 1
Sum of square due to replication (block) ( SSB)
4 i 1
Bi CF
4
232.5
Sum of square due to any main effect or the interaction effect is obtained as follows;
[ K ]2 (40) 2
Sum of square due to main effect K ( SSK ) 100
22 r 16
Design of Experiments 93
[ P]2 (28) 2
Sum of square due to main effect P ( SSP) 49
22 r 16
[ KP]2 (28) 2
Sum of square due to interaction effect KP ( SSKP) 49
22 r 16
Form the following ANOVA table and calculate F value
Source of variation d. f . SS MSS Variance ratio
F
Replication (Block) 3 SSB 232.5 MSB SSB 77.5 MSB F 3.04
r 1 MSE
As in each of the cases, the computed value of F is less than the corresponding theoretical
value, there are no significant main or interaction effects presented in the experiment. The
blocks as well as treatments do not differ significantly.
Remark: It may be noted that
SSK SSP SS KP 100 49 49 198 SST , as it should be.
2 3 Factorial Experiment
Consider three factors say A , B and C , each at two levels , say (a0 , a1 ) , (b0 , b1 ) and
(c0 , c1 ) , respectively, so that there are 2 2 2 8 treatment combinations and this
combination can be written in a systematic way as
a 0 b0 c0 or (1) : A , B and C all at the first level
a1 b0 c0 or a : A at second level and B , C at first level
a 0 b1 c0 or b : A , C at first level and B at second level
a 0 b0 c1 or c : A , B at first level and C at second level
a1 b1 c0 or a b : C at first level and A , B at second level
a1 b0 c1 or a c : B at first level and A , C at second level
a 0 b1 c1 or bc : A at first level and B , C at second level
a1 b1 c1 or abc : A , B and C all at the second level
2 3 factorial experiment can be performed as a CRD with 8 treatments. RBD with r
replications, each replicates containing 8 treatments or LSD with t 8 and data can be
analyzed accordingly.
94 RU Khan
(G ) 2
Correction Factor (CF )
8r
1
Sum of square due to replication (block) ( SSB)
8 i
Bi2 CF , where Bi is the total of
i th block
r r
Total sum of square (TSS ) yij2 CF
i 1 j 1
Sum of square due to any main effect or the interaction effect is obtained as follows;
[ A]2
Sum of square due to main effect A ( SSA)
8r
[ B]2
Sum of square due to main effect B ( SSB)
8r
[C ]2
Sum of square due to main effect C ( SSC )
8r
[ AB ]2
Sum of square due to AB ( SSAB )
8r
[ AC ]2
Sum of square due to AC ( SSAC )
8r
Design of Experiments 95
[ BC ]2
Sum of square due to BC ( SSBC )
8r
[ ABC ]2
Sum of square due to ABC ( SSABC )
8r
Error sum of square (SSE) TSS SSR SSA SSB SSAB SSABC
Form the following ANOVA table and calculate F value
Source of variation d. f . SS MSS Variance ratio F
MSB F
Replication (Block) r 1 SSB MSB SSB
r 1 MSE
MSA F
Main effect A 1 SSA MSA SSA MSE
MSB F
Main effect B 1 SSB MSB SSB MSE
Main effect C 1 SSC MSC SSC MSAB F
MSE
Interaction effect AB 1 SS AB MS AB SS AB
Interaction effect AC 1 SS AC MS AC SS AC
Interaction effect BC 1 SS BC MS BC SS BC
Interaction effect ABC 1 SS ABC MS ABC SS ABC
Error (within treatment) 7 (r 1) SSE MSE SSE
3 ( r 1)
Total 8r 1 TSS
If calculated value is greater than tabulated value at given level of significance, then reject
H 0 , otherwise accept it.
Example: The following table gives the layout and the results of a 2 3 factorial design laid
out 4 replicates. The purpose of the experiment is to determine the effect of different kinds of
fertilizers Nitrogen N , Potash K and Phosphate P on potato crop yield.
Block I Block II
nk kp p np kp p k nk
291 391 312 373 407 324 272 306
(1) k n nkp n nkp np (1)
101 265 106 450 89 449 338 106
Block III Block IV
p (1) np kp np nk n p
323 87 324 423 361 272 103 324
nk k n nkp k (1) nkp kp
334 279 128 471 302 131 437 435
Block effects are eliminated by carrying out the analysis of the given design as an RBD for 8
treatment combinations and 4 blocks. The initial calculations are therefore as follows:
Total number of observations 4 8 32
The block totals, the treatment total and the grand total are summarized in the following
table:
1 2289 (1) 425
2 2291 n 426
3 2369 k 1118
Block Totals Treatment Totals
4 2375 nk 1203
p 1283
np 1396
kp 1666
nkp 1807
21740988
2716780.5 843.0
8
1 r 2 12694944
Sum of square due to treatment ( SST )
4 i 1
Ti CF
4
2716780.5 456955.5 .
Sum of square due to any main effect or the interaction effect is obtained as follows;
[ N ]2 (340) 2
Sum of square due to main effect N ( SSN ) 3612.5
23 r 32
[ K ]2 (2264) 2
Sum of square due to main effect K ( SSK ) 160178.0
23 r 32
[ P] 2 (2980) 2
Sum of square due to main effect P ( SSP) 277512.5
23 r 32
[ NK ]2 (112) 2
Sum of square due to interaction effect NK ( SS NP) 392.0
32 32
[ NP]2 (168) 2
Sum of square due to interaction effect NP ( SS NP) 882.0
32 32
[ KP]2 (676) 2
Sum of square due to interaction effect KP ( SS KP) 14280.5
32 32
[ NKP]2 (56) 2
Sum of square due to interaction effect nKP ( SS NKP) 98.0
32 32
Form the following ANOVA table and calculate F value
Source of variation d. f . SS MSS Variance ratio F
Replication (Block) 3 SSB 843.0 MSB SSB 281.0 MSB F 0.78
r 1 MSE