Block-3 MCO-3 Unit-3
Block-3 MCO-3 Unit-3
STRUCTURE
10.0 Objectives
10.1 Introduction
10.2 Correlation
10.2.1 Scatter Diagram
10.3 The Correlation Coefficient
10.3.1 Karl Pearson’s Correlation Coefficient
10.3.2 Testing for the Significance of the Correlation Coefficient
10.3.3 Spearman’s Rank Correlation
10.4 Simple Linear Regression
10.5 Estimating the Linear Regression
10.5.1 Standard Error of Estimate
10.5.2 Coefficient of Determination
10.6 Difference Between Correlation and Regression
10.7 Let Us Sum Up
10.8 Key Words
10.9 Answers to Self Assessment Exercises
10.10 Terminal Questions/Exercises
10.11 Further Reading
Appendix Tables
10.0 OBJECTIVES
After studying this unit, you should be able to:
10.1 INTRODUCTION
In previous units, so far, we have discussed the statistical treatment of data
relating to one variable only. In many other situations researchers and decision-
makers need to consider the relationship between two or more variables. For
example, the sales manager of a company may observe that the sales are not
the same for each month. He/she also knows that the company’s advertising
expenditure varies from year to year. This manager would be interested in
knowing whether a relationship exists between sales and advertising
expenditure. If the manager could successfully define the relationship, he/she 5
Relational and might use this result to do a better job of planning and to improve predictions of
Trend Analysis yearly sales with the help of the regression technique for his/her company.
Similarly, a researcher may be interested in studying the effect of research and
development expenditure on annual profits of a firm, the relationship that exists
between price index and purchasing power etc. The variables are said to be
closely related if a relationship exists between them.
This unit, therefore, introduces the concept of correlation and regression, some
statistical techniques of simple correlation and regression analysis. The methods
used are important to the researcher(s) and the decision-maker(s) who need to
determine the relationship between two variables for drawing conclusions and
decision-making.
10.2 CORRELATION
If two variables, say x and y vary or move together in the same or in the
opposite directions they are said to be correlated or associated. Thus,
correlation refers to the relationship between the variables. Generally, we find
the relationship in certain types of variables. For example, a relationship exists
between income and expenditure, absenteesim and production, advertisement
expenses and sales etc. Existence of the type of relationship may be different
from one set of variables to another set of variables. Let us discuss some of
the relationships with the help of Scatter Diagrams.
Y r=1 Y r = –1
(a) X (b) X
Perfect Positive Correlation Perfect Negative Correlation
6
Correlation and Simple
r<0
Y r>0 Y Regression
X X
(c) (d)
Positive Correlation Negative Correlation
Y Y r=0
(e) X (f) X
Non-linear
Non-linearCorrelation
correlation No
NoCorrelation
correlation
If X and Y variables move in the same direction (i.e., either both of them
increase or both decrease) the relationship between them is said to be positive
correlation [Fig. 10.1 (a) and (c)]. On the other hand, if X and Y variables
move in the opposite directions (i.e., if variable X increases and variable Y
decreases or vice-versa) the relationship between them is said to be negative
correlation [Fig. 10.1 (b) and (d)]. If Y is unaffected by any change in X
variable, then the relationship between them is said to be un-correlated [Fig.
10.1 (f)]. If the amount of variations in variable X bears a constant ratio to the
corresponding amount of variations in Y, then the relationship between them is
said to be linear-correlation [Fig. 10.1 (a) to (d)], otherwise it is non-linear
or curvilinear correlation [Fig. 10.1 (e)]. Since measuring non-linear
correlation for data analysis is far more complicated, we therefore, generally
make an assumption that the association between two variables is of the linear
type.
7
Relational and Illustration 1
Trend Analysis
Table 10.1 : A Company’s Advertising Expenses and Sales Data (Rs. in crore)
Years : 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
Sales (Y) 60 55 50 40 35 30 20 15 11 10
The company’s sales manager claims the sales variability occurs because the
marketing department constantly changes its advertisment expenditure. He/she is
quite certain that there is a relationship between sales and advertising, but does
not know what the relationship is.
The different situations shown in Figure 10.1 are all possibilities for describing
the relationships between sales and advertising expenditure for the company. To
determine the appropriate relationship, we have to construct a scatter diagram
shown in Figure 10.2, considering the values shown in Table 10.1.
60
50
Sales (Rs. Crore)
40
30
20
10
0
1 2 3 4 5 6
Advertising Expenditure (Rs. Crore)
Figure 10.2 : Scatter Diagram of Sales and Advertising Expenditure for a Company.
Figure 10.2 indicates that advertising expenditure and sales seem to be linearly
(positively) related. However, the strength of this relationship is not known, that
is, how close do the points come to fall on a straight line is yet to be
determined. The quantitative measure of strength of the linear relationship
between two variables (here sales and advertising expenditure) is called the
correlation coefficient. In the next section, therefore, we shall study the
methods for determining the coefficient of correlation.
. ..........................................................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................
8
2) How does a scatter diagram approach help in studying the correlation between Correlation and Simple
Regression
two variables?
. ..........................................................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................
The simplified formulae (which are algebraic equivalent to the above formula)
are:
∑ xy
1) r= , where x = X − X, y = Y − Y
2 2
∑x ∑y
∑ X.∑ Y
∑ XY −
2) r= n
2 (∑ X) 2 2 (∑ Y)
2
∑X − ∑Y −
n n
i) ‘r’ is a dimensionless number whose numerical value lies between +1 to –1. The
value +1 represents a perfect positive correlation, while the value –1 represents
a perfect negative correlation. The value 0 (zero) represents lack of correlation.
Figure 10.1 shows a number of scatter plots with corresponding values for
correlation coefficient.
ii) The coefficient of correlation is a pure number and is independent of the units of
measurement of the variables.
iii) The correlation coefficient is independent of any change in the origin and scale
of X and Y values.
Remark: Care should be taken when interpreting the correlation results.
Although a change in advertising may, in fact, cause sales to change, the fact
that the two variables are correlated does not guarantee a cause and effect
relationship. Two seemingly unconnected variables may often be highly
correlated. For example, we may observe a high degree of correlation: (i)
between the height and the income of individuals or (ii) between the size of the
9
Relational and shoes and the marks secured by a group of persons, even though it is not
Trend Analysis possible to conceive them to be casually related. When correlation exists
between such two seemingly unrelated variables, it is called spurious or non-
sense correlation. Therefore we must avoid basing conclusions on spurious
correlation.
Illustration 2
We know that
∑ (X)∑ (Y)
∑ XY −
r= n
2 (∑ X) 2 2 (∑ Y)
2
∑X − ∑Y −
n n
You may notice that the manual calculations will be cumbersome for real life
research work. Therefore, statistical packages like minitab, SPSS, SAS, etc.,
may be used to calculate ‘r’ and other devices as well.
Once the coefficient of correlation has been obtained from sample data one is
normally interested in asking the questions: Is there an association between the
two variables? Or with what confidence can we make a statement about the
association between the two variables? Such questions are best answered
statistically by using the following procedure.
Testing of the null hypothesis (testing hypothesis and t-test are discussed in
detail in Units 15 and 16 of this course) that population correlation coefficient
equals zero (variables in the population are uncorrelated) versus alternative
hypothesis that it does not equal zero, is carried out by using t-statistic
formula.
n−2
t=r , where, r is the correlation coefficient from sample.
1− r2
Referring to the table of t-distribution for (n–2) degree of freedom, we can find
the critical value for t at any desired level of significance (5% level of
significance is commonly used). If the calculated value of t (as obtained by the
above formula) is less than or equal to the table value of t, we accept the null
hypothesis (H0), meaning that the correlation between the two variables is not
significantly different from zero.
Illustration 3
Solution: Let us take the null hypothesis (H0) that the variables in the
population are uncorrelated.
Applying t-test,
n−2 12 − 2
t=r 2
= 0.55
1− r 1 − 0.552
From the t-distribution (refer the table given at the end of this unit) with 10
degrees of freedom for a 5% level of significance, we see that the table value
of t0.05/2, (10–2) = 2.228. The calculated value of t is less than the table value of
t. Therefore, we can conclude that this r of 0.55 for n = 12 is not significantly
different from zero. Hence our hypothesis (H0) holds true, i.e., the sample
variables in the population are uncorrelated. 11
Relational and Let us take another illustration to test the significance.
Trend Analysis
Illustration 4
Solution: Let us take the hypothesis that the variables in the population are
uncorrelated. Apply the t-test:
n−2 100 − 2
t=r = 0.55
1− r2 1 − 0.552
= 6.52
Referring to the table of the t-distribution for n–2 = 98 degrees of freedom, the
critical value for t at a 5% level of significance [t0.05/2, (10–2)] = 1.99
(approximately). Since the calculated value of t (6.52) exceeds the table value
of t (1.99), we can conclude that there is statistically significant association
between the variables. Hence, our hypothesis does not hold true.
6∑ d 2
R =1− where, N = Number of pairs of ranks, and Σd2 =
N3 − N
squares of difference between the ranks of two variables.
Illustration 5
Salesmen employed by a company were given one month training. At the end
of the training, they conducted a test on 10 salesmen on a sample basis who
were ranked on the basis of their performance in the test. They were then
posted to their respective areas. After six months, they were rated in terms of
their sales performance. Find the degree of association between them.
Salesmen: 1 2 3 4 5 6 7 8 9 10
Ranks in
training (X): 7 1 10 5 6 8 9 2 3 4
Ranks on
sales
Peformance
(Y): 6 3 9 4 8 10 7 2 1 5
12
Solution: Table 10.3: Calculation of Coefficient of Rank Correlation. Correlation and Simple
Regression
Salesmen Ranks Secured Ranks Secured Difference
in Training on Sales in Ranks D2
X Y D = (X–Y)
1 7 6 1 1
2 1 3 –2 4
3 10 9 1 1
4 5 4 1 1
5 6 8 –2 4
6 8 10 –2 4
7 9 7 2 4
8 2 2 0 0
9 3 1 2 4
10 4 5 –1 1
ΣD2 = 24
6∑ D 2 6 × 24
R =1 − = 1 −
N3 − N 103 − 10
144
=1 − = 0.855
990
we can say that there is a high degree of positive correlation between the
training and sales performance of the salesmen.
Now we proceed to test the significance of the results obtained. We are
interested in testing the null hypothesis (H0) that the two sets of ranks are not
associated in the population and that the observed value of R differs from zero
only by chance. The test which is used is t-statistic.
n−2 10 − 2
t=R = 0.855
1− R2 1 − 0.8552
Referring to the t-distribution table for 8 d.f (n–2), the critical value for t at a
5% level of significance [t0.05/2, (10–2)] is 2.306. The calculated value of t is
greater than the table value. Hence, we reject the null hypothesis concluding that
the performance in training and on sales are closely associated.
13
Relational and Tied Ranks
Trend Analysis
Sometimes there is a tie between two or more ranks in the first and/or second
series. For example, there are two items with the same 4th rank, then instead
of awarding 4th rank to the respective two observations, we award 4.5 (4+5/2)
for each of the two observations and the mean of the ranks is unaffected. In
such cases, an adjustment in the Spearman’s formula is made. For this, Σd2 is
(t 3 − t)
increased by for each tie, where t stands for the number of observations
12
in each tie. The formula can thus be expressed as:
t3 − t t3 − t
6 ∑ d 2 + + + …
12 12
r =1−
N −N
3
1) Compute the degree of relationship between price of share (X) and price of
debentures over a period of 8 years by using Karl Pearson’s formula and test
the significance (5% level) of the association. Comment on the result.
Price of 42 43 41 53 54 49 41 55
shares:
Price of 98 99 98 102 97 93 95 94
debentures:
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
2) Consider the above exercise and assign the ranks to price of shares and price
of debentures. Find the degree of association by applying Spearman’s formula
and test its significance.
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
14
Correlation and Simple
10.4 SIMPLE LINEAR REGRESSION Regression
When we identify the fact that the correlation exists between two variables, we
shall develop an estimating equation, known as regression equation or estimating
line, i.e., a methodological formula, which helps us to estimate or predict the
unknown value of one variable from known value of another variable. In the
words of Ya-Lun-Chou, “regression analysis attempts to establish the nature of
the relationship between variables, that is, to study the functional relationship
between the variables and thereby provide a mechanism for prediction, or
forecasting.” For example, if we confirmed that advertisment expenditure
(independent variable), and sales (dependent variable) are correlated, we can
predict the required amount of advertising expenses for a given amount of sales
or vice-versa. Thus, the statistical method which is used for prediction is called
regression analysis. And, when the relationship between the variables is linear,
the technique is called simple linear regression.
Hence, the technique of regression goes one step further from correlation and
is about relationships that have been true in the past as a guide to what may
happen in the future. To do this, we need the regression equation and the
correlation coefficient. The latter is used to determine that the variables are
really moving together.
Yi = β0 + β1 Xi + ei
wherein
β0 = Y-intercept,
ei = error term (i.e., the difference between the actual Y value and the value
of Y predicted by the model.
i) Regression of Y on X
ii) Regression of X on Y.
15
Relational and When we draw the regression lines with the help of a scatter diagram as
Trend Analysis shown earlier in Fig. 10.1, we may get an infinite number of possible regression
lines for a set of data points. We must, therefore, establish a criterion for
selecting the best line. The criterion used is the Least Squares Method.
According to the least squares criterion, the best regression line is the one that
minimizes the sum of squared vertical distances between the observed (X, Y)
points and the regression line, i.e., ∑ ( Y − Ŷ ) 2 is the least value and the sum of
the positive and negative deviations is zero, i.e., ∑ (Y − Ŷ) = 0 . It is important
to note that the distance between (X, Y) points and the regression line is called
the ‘error’.
Regression Equations
As we discussed above, there are two regression equations, also called
estimating equations, for the two regression lines (Y on X, and X on Y). These
equations are, algebraic expressions of the regression lines, expressed as
follows:
Regression Equation of Y on X
Ŷ = a + bx
Ŷ − Y = byx ( X − X )
(∑ X ) (∑ Y )
σy ( ∑ XY ) −
byx = r = N
σx 2 (∑ X ) 2
∑X −
N
Regression equation of X on Y
X̂ = a + by
X̂ − X = bxy ( Y − Y )
(∑ X ) (∑ Y )
∑ XY −
σx N
bxy = r =
σy (∑ Y )2
∑ Y2 −
N
It is worthwhile to note that the estimated simple regression line always passes
through X and Y (which is shown in Figure 10.3). The following illustration
shows how the estimated regression equations are obtained, and hence how
they are used to estimate the value of Y for given X value.
16
Illustration 6 Correlation and Simple
Regression
(Rs. in lakh)
Advertisement
Expenditure: 0.8 1.0 1.6 2.0 2.2 2.6 3.0 3.0 4.0 4.0 4.0 4.6
Sales: 22 28 22 26 34 18 30 38 30 40 50 46
Solution:
(Rs. in lakh)
Advertising Sales
(X) (Y) X2 Y2 XY
0.8 22 0.64 484 17.6
1.0 28 1.00 784 28.0
1.6 22 2.56 484 35.2
2.0 26 4.00 676 52.0
2.2 34 4.84 1156 74.8
2.6 18 6.76 324 46.8
3.0 30 9.00 900 90.0
3.0 38 9.00 1,444 114.0
4.0 30 16.00 900 120.0
4.0 40 16.00 1600 160.0
4.0 50 16.00 2,500 200.0
4.6 46 21.16 2,116 211.6
Now we establish the best regression line (estimated by the least square
method).
Ŷ − Y = byx (X − X )
384 32.8
Y= = 32 ; X = = 2.733 Ŷ − Y = byx (X − X )
12 12
(∑ X ) (∑ Y )
∑ XY −
byx = N
2 (∑ X )2
∑X −
N
17
Relational and
(32.8) (384)
Trend Analysis
1,150 −
= 12 = 5.801
(32.8) 2
106.96 −
12
Ŷ − 32 = 5.801 (X − 2.733)
Ŷ = 5.801X − 15.854 + 32 = 5.801X + 16.146
or Ŷ = 16.146 + 5.801X
which is shown in Figure 10.3. Note that, as said earlier, this line passes
through X (2.733) and Y (32).
observed points used to
fit the estimating line
points on the estimating
Y
line
50
Estimating
Ŷ = 16.143 + 5.01X line
40
Positive
Error Ŷ Negative
Y = 32
Error
Sales (Rs. Lac)
30 Y
Y
20 Ŷ
10
0 X
0 1 2 3 4 5
x = 2.73
Advertising (Rs. Lac)
Figure 10.3: Least Squares Regression Line of a Company’s Advertising Expenditure
and Sales.
Ŷ = 16.146 + 5.801X
wherein Ŷ = estimated sales for given value of X, and
X = level of advertising expenditure.
18
To find Ŷ , the estimate of expected sales, we substitute the specified
advertising level into the regression model. For example, if we know that the Correlation and Simple
Regression
company’s marketing department has decided to spend Rs. 2,50,000/- (X = 2.5)
on advertisement during the next quarter, the most likely estimate of sales ( Ŷ )
is :
= Rs. 30,64,850
Regression Equation of X on Y
X̂ − X = bxy (Y − Y)
(∑ X ) (∑ Y ) (32.8) (384)
∑ XY − 1,150 −
bxy = N = 12 = 0.093
2 (∑ Y )
2 (384) 2
∑Y − 13368 −
N 12
X̂ = – 0.243 + 0.093Y
The following points about the regression should be noted:
1) The geometric mean of the two regression coefficients (byx and bxy) gives
coefficient of correlation.
2) Both the regression coefficients will always have the same sign (+ or –).
19
Relational and 10.5.1 Standard Error of Estimate
Trend Analysis
Once the line of best fit is drawn, the next process in the study of regression
analysis is how to measure the reliability of the estimated regression equation.
Statisticians have developed a technique to measure the reliability of the
estimated regression equation called “Standard Error of Estimate (Se).” This Se
is similar to the standard deviation which we discussed in Unit-9 of this course.
We will recall that the standard deviation is used to measure the variability of a
distribution about its mean. Similarly, the standard error of estimate
measures the variability, or spread, of the observed values around the
regression line. We would say that both are measures of variability. The
larger the value of Se, the greater the spread of data points around the
regression line. If Se is zero, then all data points would lie exactly on the
regression line. In that case the estimated equation is said to be a perfect
estimator. The formula to measure Se is expressed as:
Se =
∑ (Y − Ŷ) 2
n
Illustration 7
R&D (Rs. lakh): 2.5 3.0 4.2 3.0 5.0 7.8 6.5
Solution: To calculate Se for this problem, we must first obtain the value of
2
∑ ( Y − Ŷ ) . We have done this in Table 10.5.
20
Table 10.5: Calculation of Σ (Y- Ŷ )2 Correlation and Simple
(Rs. in lakh) Regression
Σ (Y − Ŷ) 2= 24.62
We can, now, find the standard error of estimate as follows.
∑ (Y − Ŷ)
2
Se =
n
24.62
= 1.875
7
2
Explained var iation ∑ ( Y − Ŷ )
R2 = or , 1 − 2
Total var iation ∑ (Y − Y )
21
Relational and
(∑ Y) 2
∑ (Y − Y ) = ∑ Y2 −
Trend Analysis 2
R2 = r 2
Refer to Illustration 6, where we have computed ‘r’ with the help of regression
coefficients (bxy and byx), as an example for R2
r = 0.734
R2 = r2 = 0.7342 = 0.5388
This means that 53.88 per cent of the variation in the sales (Y) can be
explained by the level of advertising expenditure (X) for the company.
You are given the following data relating to age of Autos and their maintenance
costs. Obtain the two regression equations by the method of least squares and
estimate the likely maintenance cost when the age of Auto is 5 years and also
compute the standard error of estimate.
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
Least Squares Criterion: The criterion for determining a regression line that
minimizes the sum of squared errors.
2. R = – 0.185
t = – 1.149
C) Y on X : Ŷ = 5 + 3.25x
X on Y : X̂ = – 3 + 0.297y
Students: A B C D E F G H I J
Rank by
Ist judge: 5 2 4 1 8 9 7 6 3 10
Rank by
IInd judge: 1 9 7 8 10 2 4 5 3 6
Find out whether the judges are in agreement with each other or not and apply
the t-test for significance at 5% level.
9) A sales manager of a soft drink company is studying the effect of its latest
advertising campaign. People chosen at random were called and asked how
many bottles they had bought in the past week and how many advertisements
of this product they had seen in the past week.
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
25
Relational and
Trend Analysis 10.11 FURTHER READING
A number of good text books are available for the topics dealt with in this unit.
The following books may be used for more indepth study.
Richard I. Levin and David S. Rubin, 1996, Statistics for Management.
Prentice Hall of India Pvt. Ltd., New Delhi.
Peters, W.S. and G.W. Summers, 1968, Statistical Analysis for Business
Decisions, Prentice Hall, Englewood-cliffs.
Hooda, R.P., 2000, Statistics for Business and Economics, MacMillan India
Ltd., New Delhi.
Gupta, S.P. 1989, Elementary Statistical Methods, Sultan Chand & Sons :
New Delhi.
Chandan, J.S. - Statistics for Business and Economics, Vikas Publishing
House Pvt. Ltd., New Delhi.
APPENDIX : TABLE OF t-DISTRIBUTION AREA
The table gives points of t- distribution corresponding to degrees of freedom and the upper tail area
(suitable for use n one tail test).
0 tα
Values of ta, m
27