Stat Study Mat 2
Stat Study Mat 2
Structure
2.0 Introduction
2.1 Unit Objectives
2.2 Theory of Sampling
2.3 Types of Sampling
2.4 Sampling Distribution: Distribution of Sample Mean
2.5 Central Limit Theorem
2.6 Estimation: Point and Interval estimates of mean
2.7 Characteristics of Good Estimator: Small and Large Sample Properties.
2.8 Simple Correlation and Regression
2.9 Estimation of Regression Equations of X on Y and Y on X Interpretation of
Regression Coefficients
2.10 Multiple Regression (introductory),
2.11 Standard Error of an Estimate.
2.12 Summary
2.13 Answer to Check Your Progress
2.14 Questions and Exercises
2.0 INTRODUCTION
In statistics, estimation refers to the process by which one makes inferences about a
population, based on information obtained from a sample. Many times population
studies are not possible and even if it is possible, it is very costly and time consuming.
The purpose of this module is to let the students know about the various sampling
selection methods and the statistical laws, popularly known as law of statistical
regularity and law of inertia of large numbers, on the basis of which inferences about
the population from samples are being made. In the process the students would come
to know about central limit theorem, sampling distribution, calculation of standard
error, Correlation and Regression as tools to make inferences about population based
on sample data.
The objectives of this unit are to let the students Know about the various
concepts/terms used in the sampling theory of estimation. It helps in learning the
concept of various methods of sampling an be acquainted with the theory of
Sampling Distribution. It also familiar with the concepts of Central Limit Theorem e
computation process of Standard Error and Correlation and Regression.
Before discussing the sampling theory, let us know the meaning of some important
terms.
POPULATION OR UNIVERSE
POPULATION SIZE
he
FINITE POPULATION
The population is said to be finite when the number of members of the population can
be expressed as a definite quantity. For example, population of marks obtained by
students in XII class exam of C.B.S.E. is finite because its number of members is a
definite quantity.
INFINITE POPULATION
The population is said to be infinite when the number of members of the population
cannot be expressed as a definite quantity.
EXISTENT/REAL POPULATION
The population is said to be existent when all the members of the population really
exist. For example, population of taxable incomes of all the persons in India is an e
HYPOTHETICAL POPULATION
The population is said to be hypothetical when all the members of the population do
not really exist. It is built up by repeating the event any number of times. Population
of points obtained in all possible throws of a die.
SAMPLE
Sample refers to that part of aggregate statistical information (i.e. Population) which
is actually selected in the course of an investigation/enquiry to ascertain the
characteristics of the population.
SAMPLE SIZE
Sample size refers to the number of members of the population included in the
PARAMETER
STATISTIC
Meaning: Under this method selection of sample items is often based on certain
predetermined criteria fixed by the individual judgement of the sampler.
Advantages
Disadvantages
Meaning: Under this method, certain blocks or clusters of higher concentration are
selected for complete enquiry e.g. all transactions of a particular period in a year.
These clusters are used often in multistage sampling wherein sampling is done in
stages.
Meaning: Under this method, the total geographical area (if big) is divided into a
number of smaller non-overlapping areas and then some of the smaller areas are
selected and all units of the selected areas constitute the sample.
Suitability: It is suited in inquiries to be conducted over a large area, when the list of
population concerned is not available.
iv) Quota Sampling
Meaning: Under this method, each person engaged in the primary selection of data is
assigned a fixed quota of investigations e.g. 50 salaried persons in the age group of
25-30 years. Within the quota, the selection of sample items depends entirely on
personal judgement.
Meaning: Under this method, selection of sample items is based on chance in such a
manner that each unit of the population has an equal chance of being included in the
sample. The methods of obtaining a random sample include Lottery System, Random
Tables, Nth number etc.
Advantages
Usefulness: The theories of sampling distribution arid test of significance are based
on random sampling only.
Meaning: Under this method, selection of sampling items is done at uniform intervals
of time, space or order of occurrence.
Methodology
Step 3: Select first unit of the sample from 1 to k at random and then include very k-
th unit in the sample.
For example, from the first 10 houses, one house is selected at random suppose with
serial number
9. Then the houses with serial numbers 19, 29, 39, 49, 59, 69, 79, 89, 99 should be
selected.
Disadvantage: The sample may be biased if there are periodic features associated
with the sampling interval.
Meaning: Under this method, the population is sub-divided into several groups
(called strata) on the basis of purposive sampling and then samples of desired size are
selected from each of them on the basis of random sampling. All the samples
combined together give the stratified sample. Thus, it is a mixture of both purposive
and random sampling.
(b) to ensure that all sections of the population are adequately represented.
1. It eliminates the difference between strata and thereby reduces the sampling
error.
2. It brings about a gain in the precision of the sample estimate when the strata
variability is the least.
Meaning: Under this method sampling is done in several stages starting from the
larger units, intermediate units and finally reaching the ultimate units of selection.
Procedure
Step 1: Divide the population into first-stage units. (say country into states)
Step 2: Divide the first-stage units into second-stage units (say a state into districts)
Step 3: Divide the second-stage units into third-stage units (say a district into tehsils)
Step 4: Divide the third-stage units into fourth-stage units (say a tehsil into villages)
Step 5: Divide the fourth-stage units until we reach the ultimate/units (say a village
into households)
Step 6: Select some of first stage units at random (say any three states at random) and
then select some second stage units (say any two districts) from each of the selected
first stage units and this process is carried on from stage to stage until the ultimate
units are selected.
Meaning: Under this method, a relatively small sample is tested for drawing a
decision and if the first sample does not give evidence for a definite decision, more
units are chosen at random and added to sample until a decision is possible using
enlarged sample.
Usefulness: It is used to draw inference on the behaviour of the population and in
estimating the unknown characteristics of the population.
normal distribution with mean = and s.d. = standard error of x provided the
sample size n is sufficiently large.
[Note:
However, the larger the value of n the better is the approximation.]
CONFIDENCE INTERVALS
A confidence level.
A statistic.
A margin of error.
The confidence level describes the uncertainty of a sampling method. The statistic and
the margin of error define an interval estimate that describes the precision of the
method. The interval estimate of a confidence interval is defined by the sample
statistic + margin of error.
CONFIDENCE LEVEL
Here is how to interpret a confidence level. Suppose we collected all possible samples
from a given population, and computed confidence intervals for each sample. Some
confidence intervals would include the true population parameter; others would not. A
95% confidence level means that 95% of the intervals contain the true population
parameter; a 90% confidence level means that 90% of the intervals contain the
population parameter; and so on.
MARGIN OF ERROR
In a confidence interval, the range of values above and below the sample statistic is
called the margin of error.
For example, suppose the local newspaper conducts an election survey and reports
that the independent candidate will receive 30% of the vote. The newspaper states that
the survey had a 5% margin of error and a confidence level of 95%. These findings
result in the following confidence interval: We are 95% confident that the
independent candidate will receive between 25% and 35% of the vote.
Note: Many public opinion surveys report interval estimates, but not confidence
intervals. They provide the margin of error, but not the confidence level. To clearly
interpret survey results you need to know both! We are much more likely to accept
survey findings if the confidence level is high (say, 95%) than if it is low (say, 50%).
A "Good" estimator is the one which provides an estimate with the following
qualities:
Consistency: The standard deviation of an estimate is called the standard error of that
estimate. The larger the standard error the more error in your estimate. The standard
deviation of an estimate is a commonly used index of the error entailed in estimating a
population parameter based on the information in a random sample of size n from the
entire population.
Efficiency: An efficient estimate is one which has the smallest standard error among
all unbiased estimators.
The "best" estimator is the one which is the closest to the population parameter being
estimated.
(a)
(c) In Business for inspecting the incoming lots of materials from suppliers.
CORRELATION-Basic Concepts:
So, correlation is a term used to describe how strong the relationship between the two
variables appears to be. We say that there is a positive linear correlation if y increases
as x increases and we say there is a negative linear correlation if y decreases as x
increases. There is no correlation if x and y do not appear to be related.
A number of different coefficients are used for different situations. The best known is
the Pearson product-moment correlation coefficient, which is obtained by dividing the
covariance of the two variables by the product of their standard deviations. Despite its
name, it was first introduced by Francis Galton.
2.
3. t
The correlation in the above table all goes from low on the left in a line to high on
the right. This is not always the shape of a correlation, as is shown in table.
Correlations can be positive or negative, linear or curved. They also do not go on
forever, and using them to predict values outside the measured range is always
-
A Scatter Diagram may show correlation between two items for three reasons:
(a) There is a cause and effect relationship between the two measured items,
where one is causing the other (at least in part).
(b) The two measured items are both caused by a third item. For example, a
Scatter Diagram which shows a correlation between cracks and transparency of glass
utensils because changes in both are caused by changes in furnace temperature.
X and Y are the given variables.
2
is sum of the squares of X variables.
2
is sum of the squares of Y variables.
N .d x .d y dx dy
or, r (x,y) =
N d x2 ( d x )2 N dy2 ( d y )2
The correlation is defined only if both of the standard deviations are finite and both of
them are nonzero. It is a corollary of the Cauchy-Schwarz inequality that the
correlation cannot exceed 1 in absolute value.
case of a perfect decreasing linear relationship, and some value in between in all other
cases, indicating the degree of linear dependence between the variables. The closer
If the variables are independent then the correlation is 0, but the converse is not true
because the correlation coefficient detects only linear dependencies between two
variables. Here is an example: Suppose the random variable X is uniformly distributed
Y = X2. Then Y is completely determined by X, so
that X and Y are dependent, but their correlation is zero; they are uncorrelated.
However, in the special case when X and Y are jointly normal, uncorrelatedness is
equivalent to independence.
X Y
Values Values
60 3.1
61 3.6
62 3.8
63 4
65 4.1
63 4 63 * 4 = 63 * 63 = 4 * 4 = 16
252 3969
Step 3:
= 19359
= 69.82
Step 4: Now, Substitute in the above formula given.
- - -
= ((5)*(1159.7)-(311)*(18.6))/sqrt([(5)*(19359)-(311)2]*[(5)*(69.82)-
(18.6)2])
= (5798.5 - 5784.6)/sqrt([96795 - 96721]*[349.1 - 345.96])
= 13.9/sqrt(74*3.14)
= 13.9/sqrt(232.36)
= 13.9/15.24336
= 0.9119. This shows high degree positive correlation between X and Y.
This example will guide you to find the relationship between two variables by
calculating the Correlation Co-efficient from the above steps
Example: From a paddy field, 12 plants were selected at random. The length of
panicles in cm (x) and the number of grains per panicle (y) of the selected plants were
recorded. The results are given in the following table. Calculate correlation coefficient
and its testing.
X 22.9 23.9 24.8 21.2 22.2 22.7 23.0 24.0 20.6 21.0 24.0 23.1
N . X .Y X Y
Correlation coefficient rxy = r =
N X2 ( X )2 N Y2 ( Y )2
N .d x .d y dx dy
Correlation Coefficient, r (x,y) =
N d x2 ( d x )2 N dy2 ( d y )2
variables analyzed are equivalent modulo scaling. Scientifically, this more frequently
indicates a trivial result than a profound one. For example, consider discovering a
correlation of 1.0 between how many feet tall a group of people are and the number of
inches from the bottom of their feet to the top of their heads.
Calculation:
-moment coefficient in
which two sets of data Xi and Yi are converted to rankings xi and yi before calculating
the coefficient. In practice, however, a simpler procedure is normally used to calculate
di between the ranks of
each observation on the two variables are calculated.
Which evaluates to =
This low value shows that the correlation between IQ and hours spent watching TV is
very low. In the case of ties in the original values, this formula should not be used.
Computation for tied observations: There may be two or more items having equal
values. In such case the same rank is to be given. The ranking is said to be tied. In
such circumstances an average rank is to be given to each individual item. For
example if the value so is repeated twice at the 5th rank, the common rank to be
assigned to each item is 5+6/2 = 5.5 which is the average of 5 and 6 given as 5.5,
appeared twice. If the ranks are tied, it is required to apply a correction factor which is
(m3-m)/12. A slightly different formula is used when there is more than one item
having the same value.
Where m is the number of items whose ranks are common and should be repeated as
many times as there are tied observations.
Example : Rank Correlation for tied observations. Following are the marks obtained
by 10 students in a class in two tests. Calculate the rank correlation coefficient
between the marks of two tests.
Test-I 70 68 67 55 60 60 75 63 60 72
Test-II 65 65 80 60 68 58 75 63 60 70
Solution:
V 60 8 68 4.0 4 16
VI 60 8 58 10.0 -2 4
VII 75 1 75 2.0 -1 1
VIII 63 6 63 7.0 -1 1
X 72 2 70 3.0 -1 1
1 3 1 3 1 3
6 50 3 3 2 2 2 2
12 12 12
r 1
103 10
6 50 2 0.5 0.5 6 53
Or, r 1 1
1000 10 990
Properties of Correlation:
Property 2:
Property 4: Independent variables are uncorrelated but the converse is not true.
Property 5: Correlation coefficient is the geometric mean of two regression
coefficients.
Limitations:
3. Existence of correlation does not necessarily indicate cause and effect relation.
REGRESSION:
The term
first used by a British Biometrician Sir Francis Galton.
The relationship between the independent and dependent variables may be expressed
as a function. Such functional relationship between two variables is termed as
regression. In regression analysis independent variable is also known as regressor or
Direct Method:
2
Substituting the v from the given data and solving it,
we get the values for a and b.
Regression Formula:
Regression Equation(y) = a + bx
- -
-
Where, x and y are the variables, b = The slope of the regression line.
a = The intercept point of the regression line and the y axis.
N = Number of values or elements; X = First Score; Y = Second Score
rst and Second Scores
X Y
Values Values
60 3.1
61 3.6
62 3.8
63 4
65 4.1
To find regression equation, we will first find slope, intercept and use it to form
regression equation..
Step 1: Count the number of values. N = 5
Step 2: Find XY, X2. See the below table-
X Y
X*Y X*X
Value Value
63 4 63 * 4 = 252 63 * 63 = 3969
Step 3:
= 19359
Step 4: Substitute in the above slope formula given.
- -
= ((5)*(1159.7)-(311)*(18.6))/((5)*(19359)-(311)2)
= (5798.5 - 5784.6)/(96795 - 96721)
= 13.9/74
= 0.19
Step 5: Now, again substitute in the above intercept formula given.
-
= (18.6 - 0.19(311))/5
= (18.6 - 59.09)/5
= -40.49/5
= -8.098
Step 6: Then substitute these values in regression equation formula
Regression Equation(y) = a + bx
= -8.098 + 0.19x.
Suppose if we want to know the approximate y value for the variable x = 64. Then we
can substitute the value in the above equation.
Regression Equation(y) = a + bx
= -8.098 + 0.19(64).
= -8.098 + 12.16
= 4.06
This example will guide you to find the relationship between two variables by
calculating the Regression from the above steps.
Deviation Method:
(Y Y ) byx ( X X)
n XY X Y y X Y
Where, byx 2
or, byx r and X ;Y .
n X2 X x n n
(X X ) bxy (Y Y )
n XY X Y x
Where, bxy 2
or, bxy r .
2
n Y Y y
2. If one of the regression coefficient is greater than unity, the other must be less than
unity.
4. Regression coefficients are independent of the change of origin but not of scale.
(independent variable).
1. Both the lines regression pass through the point ( X,Y ). In other words, the mean
values ( X , Y ) can be obtained as the point of intersection of the two regression lines
3. If r = ± 1, in this case the two lines of regression either coincide or they are parallel
to each other
Example :If two regression coefficients are byx= 4/5 and bxy = 9/20.What would be the
value of r?
Solution:
Example : Compute the two regression equations from the following data.
X 1 2 3 4 5
Y 2 3 5 4 6
Solution:
X Y x =X- X y =Y- Y x2 y2 xy
1 2 -2 -2 4 4 4
2 3 -1 -1 1 1 1
3 5 0 1 0 1 0
4 4 1 0 1 0 0
5 6 2 2 4 4 4
Here byx and bxy are satisfying the properties of regression coefficients, so our
assumption is correct. (Sign of byx and bxy are same and their multiple is less than
one).
Multiple regression is an extension of simple linear regression in which more than one
independent variable (X) is used to predict a single dependent variable (Y). The
predicted value of Y is a linear transformation of the X variables such that the sum of
squared deviations of the observed and predicted Y is a minimum. The computations
are more complex, however, because the interrelationships among all the variables
must be taken into account in the weights assigned to the variables. The interpretation
of the results of a multiple regression analysis is also more complex for the same
reason.
Note that this transformation is similar to the linear transformation of two variables
discussed in the previous chapter except that the w's have been replaced with b's and
the X'i has been replaced with a Y'i.
X= an X score (X is your Independent Variable) for which you are trying to predict a
value of Y
The "b" values are called regression weights and are computed in a way that
minimizes the sum of squared deviations
-3 ,657.213)X2
Note that I did not plug in the numbers for X1 and X2. These are the places where
you plug in your values that you are going to use to make a prediction. In this case,
X1 refers to the number of years of school
(13) and X2 is the motivation score (49). So, if we plug in these final numbers, we
can make our prediction. See below.
,356.085)(13) + (-3,657.213)(49)
-179203.437)
So, given a job applicant with 13 years of education completed and who received a
motivation score of 49 on the Higgins Motivation Scale, our single best prediction of
how much this person will earn for our dealership is $685,881.74. Pretty cool, huh?
Think a for a few minutes about how a tool like this could be used in whatever career
field you are thinking about going in to!
Where absent is measured in days per year; wage in thousands of euros per year;
tenure in years in the firm and age is expressed in years. Using a sample of size 48
(file absent), the following equation has been estimated:
The interpretation of 2^ is the following: holding fixed tenure and wage, if age
increases by one year, worker absenteeism will be reduced by 0.096 days per year.
The interpretation of 3^ is as follows: holding fixed the age and wage, if the tenure
increases by one year, worker absenteeism will be reduced by 0.078 days per year.
Finally, the interpretation of 4^ is the following: holding fixed the age and tenure, if
the wage increases by 1000 euros per year, worker absenteeism will be reduced by
0.036 days per year.
Thus,
Increases Decreases
Decreases Increases
An optimum size of SE. would be the one which secures a compromise between the
precision to be sacrificed and the effort involved in observing the sample of a given
size.
Why standard Error arises? The standard error arises due to use of sampling (which
is based on some items of the population) as against the complete enumeration censes
enquiry (which is based on all items of the population).
1. It is used to find confidence limits within which parameters are expected to lie.
For example, mean ± 1 S.E. will give 68.27% values, mean ± 2 S.E. will give
95.45% of values, mean ± 3 S.E. will give 99.73% of values, X ± Z S.E.
( X ), S ± Z S.E.(s) will give the confidence limits.
Notation:
The following notation is helpful, when we talk about the standard deviation and the
standard error.
i: Mean of population i xi i
The variability of a statistic is measured by its standard deviation. The table below
shows formulas for computing the standard deviation of statistics from simple random
samples. These formulas are valid when the population size is much larger (at least 20
times larger) than the sample size.
Sample mean, x x
2 2
Difference between means, x1 - x2 x1-x2 1 / n1 2 / n2 ]
Note: In order to compute the standard deviation of a sample statistic, you must know
the value of one or more population parameters.
The critical value is a factor used to compute the margin of error. This section
describes how to find the critical value, when the sampling distribution of the statistic
is normal or nearly normal.
The central limit theorem states that the sampling distribution of a statistic will be
nearly normal, if the sample size is large enough. As a rough guide, many statisticians
say that a sample size of 30 is large enough when the population distribution is bell-
shaped. But if the original population is badly skewed, has multiple peaks, and/or has
outliers, researchers like the sample size to be even larger.
When the sampling distribution is nearly normal, the critical value can be expressed
as a t score or as a z score. When the sample size is smaller, the critical value should
only be expressed as a t statistic.
To express the critical value as a z score, find the z score having a cumulative
probability equal to the critical probability (p*).
1 n
i.e. is the precision of x . The precision of x which is used as an
S .E .
estimate of the population mean (m), is directly proportional to the square root of the
sample size (n). It implies that to double the precision of the estimate, the sample size
(n) should be four times.
Identify a sample statistic. Choose the statistic (e.g, sample mean, sample
proportion) that you will use to estimate a population parameter.
Find the margin of error. If you are working on a homework problem or a test
question, the margin of error may be given. Often, however, you will need to
compute the margin of error, based on one of the following equations.
2.12 SUMMARY
The purpose of this unit is to let the students know about the various sampling
selection methods and the statistical laws, popularly known as law of statistical
regularity and law of inertia of large numbers, on the basis of which inferences about
the population from samples are being made. In the process the students would come
to know about central limit theorem, sampling distribution, calculation of standard
error, Correlation and Regression as tools to make inferences about population based
on sample data.
(A)Ionly
(B)IIonly
(C)IIIonly
(D)IVonly.
(E) None of the above.
2. Define Correlation and its various types. How scatter diagram is useful in
finding relationship between two variables.
3. Write down the direct and indirect (deviation form) formulas for the calculation
X 10 20 30 40 50 60 70
Y 100 90 85 70 60 45 30
5.
X 1 2 3 4 5
Y 10 20 30 50 40
(Ans.: r = +0.9)
X 1 2 3 4 5
Y 5 4 3 2 1
Z 3 5 2 1 4
Which pair of judges have the nearest approach to common tastes in beauty?
7.
X 49 69 39 49 29
Y 59 59 59 49 39
8. The coefficient of correlation between two variables X and Y is 0.4 and their
covariance is 10. If variance of X series is 9, find the variance of Y series.
(Ans.: 69.39).
9.
a) Sum of deviations of X = 5
b) Sum of deviations of Y = 4
Supply (Y) 10 20 30 50 40
Y = 3 + 9X)
b) Estimate the likely supply when price is Rs.7? (Ans.: When X=7,
Y=63).
c) What should be the price if the producer set the supply target at 80
units? (Ans.: When Y=80, X=7.5).
11) Compute both the regression equations by using deviation method from the
following data: (Ans.: Y = 18.04 1.34 X)
X 2 4 5 6 8 11
Y 18 12 10 8 7 5
X Y
Arithmetic mean 6 8
c) The most likely value of Y, when X = Rs. 100. (Ans.:When X=100, Y = 141.67).
1 2 3= 9.0