0% found this document useful (0 votes)
15 views44 pages

Stat Study Mat 2

This document covers the theory of sampling and estimation in statistics, detailing various sampling methods, sampling distributions, and estimation techniques such as point and interval estimates. It explains key concepts including the Central Limit Theorem, characteristics of good estimators, and the importance of confidence intervals in statistical analysis. The unit aims to equip students with the necessary knowledge to make inferences about populations based on sample data.

Uploaded by

Gourav Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views44 pages

Stat Study Mat 2

This document covers the theory of sampling and estimation in statistics, detailing various sampling methods, sampling distributions, and estimation techniques such as point and interval estimates. It explains key concepts including the Central Limit Theorem, characteristics of good estimators, and the importance of confidence intervals in statistical analysis. The unit aims to equip students with the necessary knowledge to make inferences about populations based on sample data.

Uploaded by

Gourav Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

UNIT-II: THEORY OF SAMPLING AND ESTIMATION

Structure

2.0 Introduction
2.1 Unit Objectives
2.2 Theory of Sampling
2.3 Types of Sampling
2.4 Sampling Distribution: Distribution of Sample Mean
2.5 Central Limit Theorem
2.6 Estimation: Point and Interval estimates of mean
2.7 Characteristics of Good Estimator: Small and Large Sample Properties.
2.8 Simple Correlation and Regression
2.9 Estimation of Regression Equations of X on Y and Y on X Interpretation of
Regression Coefficients
2.10 Multiple Regression (introductory),
2.11 Standard Error of an Estimate.
2.12 Summary
2.13 Answer to Check Your Progress
2.14 Questions and Exercises

2.0 INTRODUCTION

In statistics, estimation refers to the process by which one makes inferences about a
population, based on information obtained from a sample. Many times population
studies are not possible and even if it is possible, it is very costly and time consuming.
The purpose of this module is to let the students know about the various sampling
selection methods and the statistical laws, popularly known as law of statistical
regularity and law of inertia of large numbers, on the basis of which inferences about
the population from samples are being made. In the process the students would come
to know about central limit theorem, sampling distribution, calculation of standard
error, Correlation and Regression as tools to make inferences about population based
on sample data.

2.1 UNIT OBJECTIVES

The objectives of this unit are to let the students Know about the various
concepts/terms used in the sampling theory of estimation. It helps in learning the
concept of various methods of sampling an be acquainted with the theory of
Sampling Distribution. It also familiar with the concepts of Central Limit Theorem e
computation process of Standard Error and Correlation and Regression.

2.2 THEORY OF SAMPLING

Before discussing the sampling theory, let us know the meaning of some important
terms.

POPULATION OR UNIVERSE

Population or Universe refers to the aggregate of statistical information on a particular


character of all the members covered by an investigation/enquiry. For example, marks
obtained by students in XII Class exam of C.B.S.E. constitute population.

POPULATION SIZE

he

students in XII class exam of C.B.S.E. is 6,00,000.

FINITE POPULATION

The population is said to be finite when the number of members of the population can
be expressed as a definite quantity. For example, population of marks obtained by
students in XII class exam of C.B.S.E. is finite because its number of members is a
definite quantity.

INFINITE POPULATION

The population is said to be infinite when the number of members of the population
cannot be expressed as a definite quantity.

EXISTENT/REAL POPULATION

The population is said to be existent when all the members of the population really
exist. For example, population of taxable incomes of all the persons in India is an e
HYPOTHETICAL POPULATION

The population is said to be hypothetical when all the members of the population do
not really exist. It is built up by repeating the event any number of times. Population
of points obtained in all possible throws of a die.

SAMPLE

Sample refers to that part of aggregate statistical information (i.e. Population) which
is actually selected in the course of an investigation/enquiry to ascertain the
characteristics of the population.

SAMPLE SIZE

Sample size refers to the number of members of the population included in the

PARAMETER

Parameter is a statistical measure based on each and every item of the


universe/population. For example, Population Mean ( called sigma), Population
Standard Deviation ( called sigma), Proportion of Defectives in the whole lot of
population (P). Parameter shows the characteristic of the universe/population. Since
the parameter remains a constant, it has neither a sampling fluctuation nor sampling
distribution nor a standard error. Usually, parameters are unknown and statistics are
used as estimates of parameters. It may be noted that Greek letters like , are used
for these measures so as to differentiate the corresponding measures of a sample. It is
used for calculating the standard error of statistic.

STATISTIC

Statistic is a statistical measure based on items/observations of a sample. For example,


Sample Mean ( X ), Sample Standard Deviation (S), Proportion of Defectives
observed in the sample (p). Statistic shows the characteristic of the
universe/population. Since the value of a statistic varies from sample to sample, it has
sampling fluctuation, sampling distribution and standard error. Sampling distribution
of a statistic is the probability distribution of that statistic and standard error is the
standard deviation of the sampling of a statistic distribution of that statistic. Usually
parameters are unknown and statistics are used as estimates of parameters.
2.3 TYPES OF SAMPLING

The different methods of sampling are discussed below:

i) Deliberate, Purposive or Judgement Sampling

Meaning: Under this method selection of sample items is often based on certain
predetermined criteria fixed by the individual judgement of the sampler.

Advantages

1. A purposive sample may not vary widely from the average.

2. It is economical and useful if the sample size is small.

Disadvantages

1. These is much scope for personal bias.

2. Degree of accuracy of the estimates is not known.

3. As the sample size increases, the estimates become unreliable due to


accumulation of bias.

ii) Block or Cluster Sampling

Meaning: Under this method, certain blocks or clusters of higher concentration are
selected for complete enquiry e.g. all transactions of a particular period in a year.
These clusters are used often in multistage sampling wherein sampling is done in
stages.

Suitability: Ii is suitable where there is an unequal concentration of individual units


in the universe.

iii) Area Sampling

Meaning: Under this method, the total geographical area (if big) is divided into a
number of smaller non-overlapping areas and then some of the smaller areas are
selected and all units of the selected areas constitute the sample.

Advantage: It generally makes field interviewing efficient.

Suitability: It is suited in inquiries to be conducted over a large area, when the list of
population concerned is not available.
iv) Quota Sampling

Meaning: Under this method, each person engaged in the primary selection of data is
assigned a fixed quota of investigations e.g. 50 salaried persons in the age group of
25-30 years. Within the quota, the selection of sample items depends entirely on
personal judgement.

Advantage: The benefits of stratification are available.

Disadvantage: There is scope for personal bias.

Suitability: It is suitable in marketing research studies where it is not possible to stick


to it without delay and expenditure.

v) Random (or Probability) Sampling

Meaning: Under this method, selection of sample items is based on chance in such a
manner that each unit of the population has an equal chance of being included in the
sample. The methods of obtaining a random sample include Lottery System, Random
Tables, Nth number etc.

Advantages

1. There is no scope of personal bias,

2. Each item has an equal chance of being selected.

3. It provides more accurate and reliable data.

4. It becomes possible to have an idea about the errors of estimation.

Disadvantages: It is not suitable if the field of enquiry is small.

Suitability: It is suitable when the population is more or less homogeneous with


respect to characteristics under study.

Usefulness: The theories of sampling distribution arid test of significance are based
on random sampling only.

vi) Systematic Sampling

Meaning: Under this method, selection of sampling items is done at uniform intervals
of time, space or order of occurrence.
Methodology

Step 1: Determine the sample size (say 10 houses)

Step 2: Determine the value of k-th number as follows:

Total No. of items in the population 100 (say)


k th number 10 houses
Total No. of items in the desired sample 10 (say)

Step 3: Select first unit of the sample from 1 to k at random and then include very k-
th unit in the sample.

For example, from the first 10 houses, one house is selected at random suppose with
serial number

9. Then the houses with serial numbers 19, 29, 39, 49, 59, 69, 79, 89, 99 should be
selected.

Advantages: Actual selection of the sample is easier and quicker. A systematic


sample is practically equivalent to a random sample if the characteristic under study is
independent of the order of arrangement of the units.

Disadvantage: The sample may be biased if there are periodic features associated
with the sampling interval.

Suitability: It is suitable when the units described are serially numbered.

vii) Stratified Sampling

Meaning: Under this method, the population is sub-divided into several groups
(called strata) on the basis of purposive sampling and then samples of desired size are
selected from each of them on the basis of random sampling. All the samples
combined together give the stratified sample. Thus, it is a mixture of both purposive
and random sampling.

Purposes: The main purposes of stratification are:

(a) to increase the overall estimates,

(b) to ensure that all sections of the population are adequately represented.

(c) to avoid a large size of the population and

(d) to avoid the heterogeneity of the population.


Advantages

1. It eliminates the difference between strata and thereby reduces the sampling
error.

2. It brings about a gain in the precision of the sample estimate when the strata
variability is the least.

3. Independent estimates for different strata can be prepared.

4. There is not much scope for personal bias.

Disadvantage: The results may be misleading if the basis of stratification is not


properly decided.

viii) Multi Stage Sampling

Meaning: Under this method sampling is done in several stages starting from the
larger units, intermediate units and finally reaching the ultimate units of selection.

Procedure

Step 1: Divide the population into first-stage units. (say country into states)

Step 2: Divide the first-stage units into second-stage units (say a state into districts)

Step 3: Divide the second-stage units into third-stage units (say a district into tehsils)

Step 4: Divide the third-stage units into fourth-stage units (say a tehsil into villages)

Step 5: Divide the fourth-stage units until we reach the ultimate/units (say a village
into households)

Step 6: Select some of first stage units at random (say any three states at random) and
then select some second stage units (say any two districts) from each of the selected
first stage units and this process is carried on from stage to stage until the ultimate
units are selected.

Advantage: Usually, considerable saving in cost is achieved.

ix) Sequential Sampling

Meaning: Under this method, a relatively small sample is tested for drawing a
decision and if the first sample does not give evidence for a definite decision, more
units are chosen at random and added to sample until a decision is possible using
enlarged sample.
Usefulness: It is used to draw inference on the behaviour of the population and in
estimating the unknown characteristics of the population.

Advantages of Using Sampling Methods

1. These facilitate quick results.

2. These facilitate more skilled analysis.

3. These facilitate following up of non-responsive units.

4. These facilitate the error estimation.

5. These involve lower costs.

3. These provide higher quality data.

4. These are more scientific as compared to census.

2.4 SAMPLING DISTRIBUTION: DISTRIBUTION OF SAMPLE MEAN

Meaning: Sampling distribution of a given statistic is the probability distribution of


that statistic.

Examples: Two important sampling distribution for large samples are:

1. Sampling Distribution of Sample Mean: If x represents the mean of a


random sample of size n, drawn from a population with mean and standard

deviation (s.d.) , then the sampling distribution of x is approximately a

normal distribution with mean = and s.d. = standard error of x provided the
sample size n is sufficiently large.

2. Sampling Distribution of Sample Proportion: If p represents the proportion


of defectives in a random sample of size n drawn from a lot with proportion of
defectives F, then the sampling distribution of p is approximately a normal
distribution with mean = P and s.d. = standard error of p. provided the sample
size n is sufficiently large.

[Note:
However, the larger the value of n the better is the approximation.]

Properties: The sampling distribution of mean has the following properties:

1. Its mean ( X ) is the same as population mean ( ).


2.6 ESTIMATION: POINT AND INTERVAL ESTIMATES OF MEAN

Statisticians use sample statistics to estimate population parameters. For example,


sample means are used to estimate population means; sample proportions, to estimate
population proportions.

An estimate of a population parameter may be expressed in two ways:

Point estimate. A point estimate of a population parameter is a single value of


a statistic. For example, the sample mean x is a point estimate of the
p is a point estimate of
the population proportion P.

Interval estimate. An interval estimate is defined by two numbers, between


which a population parameter is said to lie. For example, a < x < b is an

mean is greater than a but less than b.

CONFIDENCE INTERVALS

Statisticians use a confidence interval to express the precision and uncertainty


associated with a particular sampling method. A confidence interval consists of three
parts.

A confidence level.

A statistic.

A margin of error.

The confidence level describes the uncertainty of a sampling method. The statistic and
the margin of error define an interval estimate that describes the precision of the
method. The interval estimate of a confidence interval is defined by the sample
statistic + margin of error.

For example, suppose we compute an interval estimate of a population parameter. We


might describe this interval estimate as a 95% confidence interval. This means that if
we used the same sampling method to select different samples and compute different
interval estimates, the true population parameter would fall within a range defined by
the sample statistic + margin of error 95% of the time.
Confidence intervals are preferred to point estimates, because confidence intervals
indicate (a) the precision of the estimate and (b) the uncertainty of the estimate.

CONFIDENCE LEVEL

The probability part of a confidence interval is called a confidence level. The


confidence level describes the likelihood that a particular sampling method will
produce a confidence interval that includes the true population parameter.

Here is how to interpret a confidence level. Suppose we collected all possible samples
from a given population, and computed confidence intervals for each sample. Some
confidence intervals would include the true population parameter; others would not. A
95% confidence level means that 95% of the intervals contain the true population
parameter; a 90% confidence level means that 90% of the intervals contain the
population parameter; and so on.

MARGIN OF ERROR

In a confidence interval, the range of values above and below the sample statistic is
called the margin of error.

For example, suppose the local newspaper conducts an election survey and reports
that the independent candidate will receive 30% of the vote. The newspaper states that
the survey had a 5% margin of error and a confidence level of 95%. These findings
result in the following confidence interval: We are 95% confident that the
independent candidate will receive between 25% and 35% of the vote.

Note: Many public opinion surveys report interval estimates, but not confidence
intervals. They provide the margin of error, but not the confidence level. To clearly
interpret survey results you need to know both! We are much more likely to accept
survey findings if the confidence level is high (say, 95%) than if it is low (say, 50%).

2.7 CHARACTERISTICS OF GOOD ESTIMATOR: SMALL AND LARGE


SAMPLE PROPERTIES.

A "Good" estimator is the one which provides an estimate with the following
qualities:

Unbiasedness: An estimate is said to be an unbiased estimate of a given parameter


when the expected value of that estimator can be shown to be equal to the parameter
being estimated. For example, the mean of a sample is an unbiased estimate of the
mean of the population from which the sample was drawn. Unbiasedness is a good
quality for an estimate, since, in such a case, using weighted average of several
estimates provides a better estimate than each one of those estimates. Therefore,
unbiasedness allows us to upgrade our estimates. For example, if your estimates of the
population mean µ are say, 10, and 11.2 from two independent samples of sizes 20,
and 30 respectively, then a better estimate of the population mean µ based on both
samples is [20 (10) + 30 (11.2)] (20 + 30) = 10.75.

Consistency: The standard deviation of an estimate is called the standard error of that
estimate. The larger the standard error the more error in your estimate. The standard
deviation of an estimate is a commonly used index of the error entailed in estimating a
population parameter based on the information in a random sample of size n from the
entire population.

An estimator is said to be "consistent" if increasing the sample size produces an


estimate with smaller standard error. Therefore, your estimate is "consistent" with the
sample size. That is, spending more money to obtain a larger sample produces a better
estimate.

Efficiency: An efficient estimate is one which has the smallest standard error among
all unbiased estimators.

The "best" estimator is the one which is the closest to the population parameter being
estimated.

Sampling refers to the selection of a part of aggregate statistical information (i.e.


Population) with a view to ascertain the characteristics of the whole (i.e. population).
Sampling is used in various areas such as

(a)

(b) In Industry for Statistical Quality Control.

(c) In Business for inspecting the incoming lots of materials from suppliers.

2.8 SIMPLE CORRELATION AND REGRESSION

CORRELATION-Basic Concepts:

In statistics, correlation (often measured as a correlation coefficient, ) indicates the


strength and direction of a linear relationship between two random variables. That is
in contrast with the usage of the term in colloquial speech, which denotes any
relationship, not necessarily linear. In general statistical usage, correlation or co-
relation refers to the departure of two random variables from independence. In this
broad sense there are several coefficients, measuring the degree of correlation,
adapted to the nature of the data.

So, correlation is a term used to describe how strong the relationship between the two
variables appears to be. We say that there is a positive linear correlation if y increases
as x increases and we say there is a negative linear correlation if y decreases as x
increases. There is no correlation if x and y do not appear to be related.

A number of different coefficients are used for different situations. The best known is
the Pearson product-moment correlation coefficient, which is obtained by dividing the
covariance of the two variables by the product of their standard deviations. Despite its
name, it was first introduced by Francis Galton.

There are various ways of measuring correlation coefficient:

1. Scatter diagram method

2.

3. t

Scatter Diagram (Correlation):

The Scatter Diagram helps to identify the existence of a measurable relationship


between two such items by measuring them in pairs and plotting them on a graph, as
in the figure below. This visually shows the correlation between the two sets of
measurements.
Figure 2.1: Scatter Diagram
For each first item measurement,
there may be a range of possible
second item measurements,
Second item
and vice versa
measurement

Each time the first item is measured,


the second item is also measured, and
this pair of measurements is plotted
on the Scatter Diagram

First item measurement


If the points plotted on the Scatter Diagram are randomly scattered, with no
discernible pattern, then this indicates that the two sets of measurements have no
correlation and cannot be said to be related in any way. If, however, the points form
a pattern of some kind, then this shows the type of relationship between the two
measurement sets. The closer the points are to the line, the greater the correlation, as
given in the following table.

Table 2.1: Degrees of correlation

Scatter Diagram Degree of Interpretation


correlation

No relationship can be seen.


None The y variable is not related to the x variable
in any way.

A vague relationship is seen.


Low There is a low positive correlation between
the x variable and the y variable. There might
be some connection between the two, but it is
not clear.

The points are grouped into a clear linear


shape.
High
The two variables are clearly related in some
way. Given one, you can predict a moderate
range in which the other will be found.

All points lie on a line (which is usually


Perfect straight).
The variables are deterministically related,
and given one you can predict the other with
accuracy.

The correlation in the above table all goes from low on the left in a line to high on
the right. This is not always the shape of a correlation, as is shown in table.
Correlations can be positive or negative, linear or curved. They also do not go on
forever, and using them to predict values outside the measured range is always
-

Table 2.2: Types of correlation

Scatter Diagram Type of Interpretation


correlation

Straight line, sloping up from left to right.


Positive Increasing the value of the 'cause' results in a
proportionate increase in the value of the
'effect'.

Straight line, sloping down from left to right.


Negative Increasing the value of the 'cause' results in a
proportionate decrease in the value of the
'effect'.

Various curves, typically U- or S-shaped.


Curved Changing the value of the 'cause' results in the
'effect' changing differently, depending on the
position on the curve.

Part of the diagram is a straight line (sloping


Part-linear up or down).
May be due to breakdown or overload of y
variable, or is a curve with a part that
approximates to a straight line (which may be
treated as such).

A Scatter Diagram may show correlation between two items for three reasons:

(a) There is a cause and effect relationship between the two measured items,
where one is causing the other (at least in part).

(b) The two measured items are both caused by a third item. For example, a
Scatter Diagram which shows a correlation between cracks and transparency of glass
utensils because changes in both are caused by changes in furnace temperature.
X and Y are the given variables.

2
is sum of the squares of X variables.
2
is sum of the squares of Y variables.

N .d x .d y dx dy
or, r (x,y) =
N d x2 ( d x )2 N dy2 ( d y )2

Where, dx = (X- A) deviations of X values around assumed mean A and dy = (Y

The correlation is defined only if both of the standard deviations are finite and both of
them are nonzero. It is a corollary of the Cauchy-Schwarz inequality that the
correlation cannot exceed 1 in absolute value.

case of a perfect decreasing linear relationship, and some value in between in all other
cases, indicating the degree of linear dependence between the variables. The closer

If the variables are independent then the correlation is 0, but the converse is not true
because the correlation coefficient detects only linear dependencies between two
variables. Here is an example: Suppose the random variable X is uniformly distributed
Y = X2. Then Y is completely determined by X, so
that X and Y are dependent, but their correlation is zero; they are uncorrelated.
However, in the special case when X and Y are jointly normal, uncorrelatedness is
equivalent to independence.

A correlation between two variables is diluted in the presence of measurement error


around estimates of one or both variables, in which case disattenuation provides a
more accurate coefficient.
Correlation Co-efficient Example: To find the Correlation of

X Y
Values Values

60 3.1

61 3.6

62 3.8

63 4

65 4.1

Step-1: Count the number of observations, N=5.

Step-2: Find X2, Y2, X.Y and their summations.

X Y X*Y X*X Y*Y


Value Value

60 3.1 60 * 3.1 = 60 * 60 = 3.1 * 3.1 =


186 3600 9.61

61 3.6 61 * 3.6 = 61 * 61 = 3.6 * 3.6 =


219.6 3721 12.96

62 3.8 62 * 3.8 = 62 * 62 = 3.8 * 3.8 =


235.6 3844 14.44

63 4 63 * 4 = 63 * 63 = 4 * 4 = 16
252 3969

65 4.1 65 * 4.1 = 65 * 65 = 4.1 * 4.1 =


266.5 4225 16.81

Step 3:

= 19359
= 69.82
Step 4: Now, Substitute in the above formula given.
- - -
= ((5)*(1159.7)-(311)*(18.6))/sqrt([(5)*(19359)-(311)2]*[(5)*(69.82)-
(18.6)2])
= (5798.5 - 5784.6)/sqrt([96795 - 96721]*[349.1 - 345.96])
= 13.9/sqrt(74*3.14)
= 13.9/sqrt(232.36)
= 13.9/15.24336
= 0.9119. This shows high degree positive correlation between X and Y.
This example will guide you to find the relationship between two variables by
calculating the Correlation Co-efficient from the above steps

Example: From a paddy field, 12 plants were selected at random. The length of
panicles in cm (x) and the number of grains per panicle (y) of the selected plants were
recorded. The results are given in the following table. Calculate correlation coefficient
and its testing.

Y 112 131 147 90 110 106 127 145 85 94 142 111

X 22.9 23.9 24.8 21.2 22.2 22.7 23.0 24.0 20.6 21.0 24.0 23.1

Solution: a) Direct Method:

N . X .Y X Y
Correlation coefficient rxy = r =
N X2 ( X )2 N Y2 ( Y )2

Substituting the values in the above formula:


12 32195.6 1400 273.4
Correlation coefficient rxy = r = 0.95
12 168450 14002 12 6248.20 273.42

N .d x .d y dx dy
Correlation Coefficient, r (x,y) =
N d x2 ( d x )2 N dy2 ( d y )2

Substituting the values in the above formula:

12 449.8 ( 124) ( 14.6)


Correlation coefficient rxy = r = 0.95
12 6398 ( 124)2 12 37 ( 14.6) 2

Interpretation of the size of a correlation:

Several authors have offered guidelines for


the interpretation of a correlation coefficient. Correlation Negative Positive

Cohen (1988), has observed, however, that all


Small 0.1 to 0.3
such criteria are in some ways arbitrary and
should not be observed too strictly. This is Medium 0.3 to 0.5
because the interpretation of a correlation
Large 0.5 to 1.0
coefficient depends on the context and
purposes. A correlation of 0.9 may be very
low if one is verifying a physical law using high-quality instruments, but may be
regarded as very high in the social sciences where there may be a greater contribution
from complicating factors.
Along this vein, it is important to remember that "large" and "small" should not be
taken as synonyms for "good" and "bad" in terms of determining that a correlation is

variables analyzed are equivalent modulo scaling. Scientifically, this more frequently
indicates a trivial result than a profound one. For example, consider discovering a
correlation of 1.0 between how many feet tall a group of people are and the number of
inches from the bottom of their feet to the top of their heads.

Spearman's rank correlation coefficient:

In statistics, Spearman's rank correlation coefficient or Spearman's rho, named


after Charles Spearman and often denoted by the Greek letter (rho) or as rs, is a
non-parametric measure of correlation that is, it assesses how well an arbitrary
monotonic function could describe the relationship between two variables, without
making any other assumptions about the particular nature of the relationship between
the variables. Certain other measures of correlation are parametric in the sense of
being based on possible relationships of a parameterised form, such as a linear
relationship.

Spearman's coefficient can be used when both dependent (outcome; response)


variable and independent (predictor) variable are ordinal numeric, or when one
variable is a ordinal numeric and the other is a continuous variable. However, it can
also be appropriate to use Spearman's correlation when both variables are continuous.

Calculation:

-moment coefficient in
which two sets of data Xi and Yi are converted to rankings xi and yi before calculating
the coefficient. In practice, however, a simpler procedure is normally used to calculate
di between the ranks of
each observation on the two variables are calculated.
Which evaluates to =

This low value shows that the correlation between IQ and hours spent watching TV is
very low. In the case of ties in the original values, this formula should not be used.

Computation for tied observations: There may be two or more items having equal
values. In such case the same rank is to be given. The ranking is said to be tied. In
such circumstances an average rank is to be given to each individual item. For
example if the value so is repeated twice at the 5th rank, the common rank to be
assigned to each item is 5+6/2 = 5.5 which is the average of 5 and 6 given as 5.5,
appeared twice. If the ranks are tied, it is required to apply a correction factor which is

(m3-m)/12. A slightly different formula is used when there is more than one item
having the same value.

The formula is:

Where m is the number of items whose ranks are common and should be repeated as
many times as there are tied observations.

Example : Rank Correlation for tied observations. Following are the marks obtained
by 10 students in a class in two tests. Calculate the rank correlation coefficient
between the marks of two tests.

Students I II III IV V VI VII VIII IX X

Test-I 70 68 67 55 60 60 75 63 60 72

Test-II 65 65 80 60 68 58 75 63 60 70

Solution:

Students Test-I Rank R1 Test-II Rank RII Di= RI-RII Di2

I 70 3 65 5.5 -2.5 6.25

II 68 4 65 5.5 -1.5 2.25


III 67 5 80 1.0 4 16

IV 55 10 60 8.5 1.5 2.25

V 60 8 68 4.0 4 16

VI 60 8 58 10.0 -2 4

VII 75 1 75 2.0 -1 1

VIII 63 6 63 7.0 -1 1

IX 60 8 60 8.5 -0.5 0.25

X 72 2 70 3.0 -1 1

60 is repeated 3 times in test 1. 60, 65 is repeated twice in test 2. m = 3; m = 2; m = 2

1 3 1 3 1 3
6 50 3 3 2 2 2 2
12 12 12
r 1
103 10

6 50 2 0.5 0.5 6 53
Or, r 1 1
1000 10 990

Or, r = 1-0.32 = 0.68.

Properties of Correlation:

Property1: Correlation coefficient lies between 1 and +1.

Property 2:

Property 3: It is a pure number independent of units of measurement.

Property 4: Independent variables are uncorrelated but the converse is not true.
Property 5: Correlation coefficient is the geometric mean of two regression
coefficients.

Property 6: The correlation coefficient of x and y is symmetric. rxy = ryx.

Limitations:

1. Correlation coefficient assumes linear relationship regardless of the assumption is


correct or not.

2. Extreme items of variables are being unduly operated on correlation coefficient.

3. Existence of correlation does not necessarily indicate cause and effect relation.

REGRESSION:

The term
first used by a British Biometrician Sir Francis Galton.

The relationship between the independent and dependent variables may be expressed
as a function. Such functional relationship between two variables is termed as
regression. In regression analysis independent variable is also known as regressor or

predictor or explanatory variable while dependent variable is also known as regressed


or explained variable. When only two variables are involved the functional
relationship is known as simple regression. If the relationship between two variables
is a straight line, it is known as simple linear regression; otherwise it is called as
simple non-linear regression.

Direct Method:

The regression equation of Y on X is given as Y= a+bX, Where Y= dependent


variable; X = independent variable and a = intercept, b = the regression coefficient (or
slope) of the line. a and b are also called as Constants. The constants a and b can be
estimate
2
-a-bx)2. This gives us two standard normal equations 2
is differentiated
w.r.t. a and b, a condition for minimisation.
2

2
Substituting the v from the given data and solving it,
we get the values for a and b.
Regression Formula:
Regression Equation(y) = a + bx
- -

-
Where, x and y are the variables, b = The slope of the regression line.
a = The intercept point of the regression line and the y axis.
N = Number of values or elements; X = First Score; Y = Second Score
rst and Second Scores

X2 = Sum of square First Scores

2.9 ESTIMATION OF REGRESSION EQUATIONS OF X ON Y AND Y


ON X INTERPRETATION OF REGRESSION COEFFICIENTS

Regression Example: To find the Simple/Linear Regression of

X Y
Values Values

60 3.1

61 3.6

62 3.8

63 4

65 4.1

To find regression equation, we will first find slope, intercept and use it to form
regression equation..
Step 1: Count the number of values. N = 5
Step 2: Find XY, X2. See the below table-

X Y
X*Y X*X
Value Value

60 3.1 60 * 3.1 = 186 60 * 60 = 3600


61 3.6 61 * 3.6 = 219.6 61 * 61 = 3721

62 3.8 62 * 3.8 = 235.6 62 * 62 = 3844

63 4 63 * 4 = 252 63 * 63 = 3969

65 4.1 65 * 4.1 = 266.5 65 * 65 = 4225

Step 3:
= 19359
Step 4: Substitute in the above slope formula given.
- -
= ((5)*(1159.7)-(311)*(18.6))/((5)*(19359)-(311)2)
= (5798.5 - 5784.6)/(96795 - 96721)
= 13.9/74
= 0.19
Step 5: Now, again substitute in the above intercept formula given.
-
= (18.6 - 0.19(311))/5
= (18.6 - 59.09)/5
= -40.49/5
= -8.098
Step 6: Then substitute these values in regression equation formula
Regression Equation(y) = a + bx
= -8.098 + 0.19x.
Suppose if we want to know the approximate y value for the variable x = 64. Then we
can substitute the value in the above equation.
Regression Equation(y) = a + bx
= -8.098 + 0.19(64).
= -8.098 + 12.16
= 4.06
This example will guide you to find the relationship between two variables by
calculating the Regression from the above steps.
Deviation Method:

The regression equation of Y on X is:

(Y Y ) byx ( X X)

Or, Y Y byx ( X X ) . byx is known as regression coefficient of Y on X.

n XY X Y y X Y
Where, byx 2
or, byx r and X ;Y .
n X2 X x n n

The regression equation of X on Y is:

(X X ) bxy (Y Y )

Or, X X bxy (Y Y ) . bxy is known as regression coefficient of X on Y.

n XY X Y x
Where, bxy 2
or, bxy r .
2
n Y Y y

Properties of Regression Coefficients:

1. Correlation coefficient is the geometric mean of the two regression coefficients.

i.e. r byx .bxy

2. If one of the regression coefficient is greater than unity, the other must be less than
unity.

3. Arithmetic mean of the regression coefficients is greater than the correlation

4. Regression coefficients are independent of the change of origin but not of scale.

6. Regression is only a one-way relationship between y (dependent variable) and x

(independent variable).

7. The range of b is from - -


Note:

1. Both the lines regression pass through the point ( X,Y ). In other words, the mean

values ( X , Y ) can be obtained as the point of intersection of the two regression lines

2. If r = 0, the two variables are uncorrelated, the lines of regression become


perpendicular to each other

3. If r = ± 1, in this case the two lines of regression either coincide or they are parallel
to each other

4. If the regression coefficients are positive, r is positive and if the regression


coefficients are negative, r is negative.

Example :If two regression coefficients are byx= 4/5 and bxy = 9/20.What would be the
value of r?

Solution:

The correlation coefficient , r byx .bxy 4 / 5 9 / 20 0.6 .

Example : Compute the two regression equations from the following data.

X 1 2 3 4 5

Y 2 3 5 4 6

If x =2.5, what will be the value of y?

Solution:

X Y x =X- X y =Y- Y x2 y2 xy

1 2 -2 -2 4 4 4

2 3 -1 -1 1 1 1

3 5 0 1 0 1 0

4 4 1 0 1 0 0

5 6 2 2 4 4 4

X=15 Y=20 x2=10 y2=10 xy=9


Now, the coefficient attached with Y is bxy = 0.45

Here byx and bxy are satisfying the properties of regression coefficients, so our
assumption is correct. (Sign of byx and bxy are same and their multiple is less than
one).

Correlation Coefficient, r byx .bxy

2.10 MULTIPLE REGRESSION (INTRODUCTORY)

Multiple regression is an extension of simple linear regression in which more than one
independent variable (X) is used to predict a single dependent variable (Y). The
predicted value of Y is a linear transformation of the X variables such that the sum of
squared deviations of the observed and predicted Y is a minimum. The computations
are more complex, however, because the interrelationships among all the variables
must be taken into account in the weights assigned to the variables. The interpretation
of the results of a multiple regression analysis is also more complex for the same
reason.

With two independent variables the prediction of Y is expressed by the following


equation:

Y'i = a + b1X1i + b2X2i

Note that this transformation is similar to the linear transformation of two variables
discussed in the previous chapter except that the w's have been replaced with b's and
the X'i has been replaced with a Y'i.

value of Y (which is your dependent variable)

b1 = The change in Y for each 1 increment change in X1

b2 = The change in Y for each 1 increment change in X2

X= an X score (X is your Independent Variable) for which you are trying to predict a
value of Y
The "b" values are called regression weights and are computed in a way that
minimizes the sum of squared deviations
-3 ,657.213)X2

( Here the values of a, b1 and b2 are taken hypothetically)

Note that I did not plug in the numbers for X1 and X2. These are the places where
you plug in your values that you are going to use to make a prediction. In this case,
X1 refers to the number of years of school

(13) and X2 is the motivation score (49). So, if we plug in these final numbers, we
can make our prediction. See below.
,356.085)(13) + (-3,657.213)(49)
-179203.437)

So, given a job applicant with 13 years of education completed and who received a
motivation score of 49 on the Higgins Motivation Scale, our single best prediction of
how much this person will earn for our dealership is $685,881.74. Pretty cool, huh?
Think a for a few minutes about how a tool like this could be used in whatever career
field you are thinking about going in to!

EXAMPLE: Quantifying the influence of age and wage on absenteeism in the


firm Buenosaires

Buenosaires is a firm devoted to manufacturing fans, having had relatively acceptable


results in recent years. The managers consider that these would have been better if the
absenteeism in the company were not so high. For this purpose, the following model
is proposed:

Absent = 1 + 2 age + 3 tenure + 4 wage + u

Where absent is measured in days per year; wage in thousands of euros per year;
tenure in years in the firm and age is expressed in years. Using a sample of size 48
(file absent), the following equation has been estimated:

The interpretation of 2^ is the following: holding fixed tenure and wage, if age
increases by one year, worker absenteeism will be reduced by 0.096 days per year.
The interpretation of 3^ is as follows: holding fixed the age and wage, if the tenure
increases by one year, worker absenteeism will be reduced by 0.078 days per year.
Finally, the interpretation of 4^ is the following: holding fixed the age and tenure, if
the wage increases by 1000 euros per year, worker absenteeism will be reduced by
0.036 days per year.

2.11 STANDARD ERROR OF AN ESTIMATE

Standard Error of a given statistic is the standard deviation of sampling distribution of


that statistic for example, Standard Error of Mean is the standard deviation of
sampling distribution of mean. Likewise, Standard Error of Proportion is the standard
deviation of sampling distribution of proportion obtained from all possible samples of
same size drawn from same population. In other words, standard error of a given
statistic is the standard deviation of all possible values of that statistic in repeated
sample of a fixed size from a given population. It is a measure of the divergence
between the- statistic and parameter values. This divergence varies with the sample
size (n).

Thus,

Sample Size Standard Error

Increases Decreases

Decreases Increases

An optimum size of SE. would be the one which secures a compromise between the
precision to be sacrificed and the effort involved in observing the sample of a given
size.

Why is standard deviation called standard error? Standard Deviation is called


standard error in order to distinguish standard deviation of sampling distribution from
that of an original frequency distribution.

Why standard Error arises? The standard error arises due to use of sampling (which
is based on some items of the population) as against the complete enumeration censes
enquiry (which is based on all items of the population).

Factor affecting: Standard Error depends on


1. the sample size

2. the nature of the statistic e.g. mean, variance, etc.

3. the mathematical form of the sampling distribution

4. the values of some of the parameters used in the sampling distribution.

USEFULNESS OF STANDARD ERROR

1. It is used to find confidence limits within which parameters are expected to lie.
For example, mean ± 1 S.E. will give 68.27% values, mean ± 2 S.E. will give
95.45% of values, mean ± 3 S.E. will give 99.73% of values, X ± Z S.E.
( X ), S ± Z S.E.(s) will give the confidence limits.

2. It is used in testing a given statistical hypothesis at different levels of


significance. For example,

Notation:
The following notation is helpful, when we talk about the standard deviation and the
standard error.

Population parameter Sample statistic

N: Number of observations in the


n: Number of observations in the sample
population

Ni: Number of observations in


ni: Number of observations in sample i
population i

P: Proportion of successes in population p: Proportion of successes in sample

Pi: Proportion of successes in


pi: Proportion of successes in sample i
population i

x: Sample estimate of population mean

i: Mean of population i xi i

p: Standard deviation of p SEp: Standard error of p

x: Standard deviation of x SEx: Standard error of x


Standard Deviation of Sample Estimates

Statisticians use sample statistics to estimate population parameters. Naturally, the


value of a statistic may vary from one sample to the next.

The variability of a statistic is measured by its standard deviation. The table below
shows formulas for computing the standard deviation of statistics from simple random
samples. These formulas are valid when the population size is much larger (at least 20
times larger) than the sample size.

Statistic Standard Deviation

Sample mean, x x

Sample proportion, p p = sqrt [ P(1 - P) / n ]

2 2
Difference between means, x1 - x2 x1-x2 1 / n1 2 / n2 ]

Difference between proportions, p1 - p1-p2 = sqrt [ P1(1-P1) / n1 + P2(1-P2) /


p2 n2 ]

Note: In order to compute the standard deviation of a sample statistic, you must know
the value of one or more population parameters.

How to Find the Critical Value

The critical value is a factor used to compute the margin of error. This section
describes how to find the critical value, when the sampling distribution of the statistic
is normal or nearly normal.

The central limit theorem states that the sampling distribution of a statistic will be
nearly normal, if the sample size is large enough. As a rough guide, many statisticians
say that a sample size of 30 is large enough when the population distribution is bell-
shaped. But if the original population is badly skewed, has multiple peaks, and/or has
outliers, researchers like the sample size to be even larger.
When the sampling distribution is nearly normal, the critical value can be expressed
as a t score or as a z score. When the sample size is smaller, the critical value should
only be expressed as a t statistic.

To find the critical value, follow these steps.

- (confidence level / 100)

Find the critical probability (p*): p* = 1 -

To express the critical value as a z score, find the z score having a cumulative
probability equal to the critical probability (p*).

To express the critical value as a t statistic, follow these steps.

Find the degrees of freedom (DF). When estimating a mean score or a


proportion from a single sample, DF is equal to the sample size minus
one. For other applications, the degrees of freedom may be calculated
differently. We will describe those computations as they come up.

The critical t statistic (t*) is the t statistic having degrees of freedom


equal to DF and a cumulative probability equal to the critical
probability (p*).

List of significance Difference between Whether or not


observed and expected considered significant

5% If the difference is more Significant


than 1.96 S.E.

5% If the difference is less Not Significant


than 1.96 S.E.

1% If the difference is more Significant


than 2.58 P.E.

1% If the difference is less Not Significant


than 2.58 P.E.
Tutorial Note: In practice, usually the hypotheses are tested at 5% level of
significance. Unless otherwise stated in the examination, the author advises the
students to test the hypothesis at 5% level of significance.

3. It gives an idea about the unreliability of a sample. For example, greater


Standard Error (S. E.) implies greater departure of actual frequencies from the
expected ones and hence the greater unreliability of the sample. The reciprocal of S.E.

1 n
i.e. is the precision of x . The precision of x which is used as an
S .E .
estimate of the population mean (m), is directly proportional to the square root of the
sample size (n). It implies that to double the precision of the estimate, the sample size
(n) should be four times.

How to Construct a Confidence Interval

There are four steps to constructing a confidence interval.

Identify a sample statistic. Choose the statistic (e.g, sample mean, sample
proportion) that you will use to estimate a population parameter.

Select a confidence level. As we noted in the previous section, the confidence


level describes the uncertainty of a sampling method. Often, researchers
choose 90%, 95%, or 99% confidence levels; but any percentage can be used.

Find the margin of error. If you are working on a homework problem or a test
question, the margin of error may be given. Often, however, you will need to
compute the margin of error, based on one of the following equations.

Margin of error = Critical value * Standard deviation of statistic


Margin of error = Critical value * Standard error of statistic

For guidance, see how to compute the margin of error.

Specify the confidence interval. The uncertainty is denoted by the confidence


level. And the range of the confidence interval is defined by the following
equation.

Confidence interval = sample statistic + Margin of error

2.12 SUMMARY

The purpose of this unit is to let the students know about the various sampling
selection methods and the statistical laws, popularly known as law of statistical
regularity and law of inertia of large numbers, on the basis of which inferences about
the population from samples are being made. In the process the students would come
to know about central limit theorem, sampling distribution, calculation of standard
error, Correlation and Regression as tools to make inferences about population based
on sample data.

2.13 ANSWER TO CHECK YOUR PROGRESS

2.14 QUESTIONS AND EXERCISES

1. Problem: Which of the following statements is true.

I. When the margin of error is small, the confidence level is high.


II. When the margin of error is small, the confidence level is low.
III. A confidence interval is a type of point estimate.
IV. A population mean is an example of a point estimate.

(A)Ionly
(B)IIonly
(C)IIIonly
(D)IVonly.
(E) None of the above.

2. Define Correlation and its various types. How scatter diagram is useful in
finding relationship between two variables.

3. Write down the direct and indirect (deviation form) formulas for the calculation

4. Draw a scatter diagram and indicate whether the correlation is positive or


negative: (Ans.: Negative)

X 10 20 30 40 50 60 70

Y 100 90 85 70 60 45 30

5.

X 1 2 3 4 5
Y 10 20 30 50 40

(Ans.: r = +0.9)

6. Three judges in a beauty contest ranked the entries as follows:

X 1 2 3 4 5

Y 5 4 3 2 1

Z 3 5 2 1 4

Which pair of judges have the nearest approach to common tastes in beauty?

xy =- xz = - yz= 0.2. So, Y and Z have nearest approach).

7.

X 49 69 39 49 29

Y 59 59 59 49 39

8. The coefficient of correlation between two variables X and Y is 0.4 and their
covariance is 10. If variance of X series is 9, find the variance of Y series.

(Ans.: 69.39).

9.

a) Sum of deviations of X = 5

b) Sum of deviations of Y = 4

c) Sum of squares of deviations of X = 40

d) Sum of squares of deviations of Y = 50

e) Sum of products of deviations of X and Y = 32

f) No. of pairs of observations = 10.


(Ans.: r = +0.704).

10) The following data relate to price and quantity suppliedof a


commodity:
Price (Rs.)-X 1 2 3 4 5

Supply (Y) 10 20 30 50 40

a) Find out the two regression equations? (Ans.: X = 0.3 + 0.09Y;

Y = 3 + 9X)

b) Estimate the likely supply when price is Rs.7? (Ans.: When X=7,
Y=63).

c) What should be the price if the producer set the supply target at 80
units? (Ans.: When Y=80, X=7.5).

d) Calculate coefficient of correlation? (Ans.: r = +0.9).

11) Compute both the regression equations by using deviation method from the
following data: (Ans.: Y = 18.04 1.34 X)

X 2 4 5 6 8 11

Y 18 12 10 8 7 5

12) Given the following information:

X Y

Arithmetic mean 6 8

Standard deviation 5 40/3

Coefficient of correlation between X and Y = 8/15

Find: a) The regression coefficient of Y on X. (Ans.: byx = 1.422)

b) The regression equation of X on Y. (Ans.: X = 0.2Y + 4.4)

c) The most likely value of Y, when X = Rs. 100. (Ans.:When X=100, Y = 141.67).

13) Caculate the coefficient of correlation in each of the followings:

a) bxy= +0.9 and byx = +9 (Ans.: r = +0.9).


19) A teacher in mathematics wishes to determine the relationship of marks on final
examination to those on two tests given during the semester. Calling X1, X2 and X3,
the marks of a student on Ist, 2nd and final examination respectively, he made the
following computations from a total of 120 students:

Mean of X1 = 6.8 Mean of X2 = 7.0 Mean of X3 = 74

1 2 3= 9.0

r12 = .60 r13 = .70 r23 = .65

Find the least square regression equation of X3 on X1 and X2?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy