Correlation
Correlation
Bivariate data
Bivariate data is data for which there are two variables for each observation. As an
example, the following bivariate data show the ages of husbands and wives of 10 married
couples.
Husband 36 72 37 36 51 50 47 50 37 41
Wife 35 67 33 35 50 46 47 42 36 41
Correlation
Correlation is a statistical technique which measure and analyses the degree or extent to
which two or more variables fluctuate with reference to one another.
Correlation thus denotes the interdependence amongst variates. The degrees are
expressed by a coefficient which ranges between and . The direction of change is
indicated by or signs.
Correlation thus expresses the relationship through a relative measure of change and it
has nothing to do with the units in which the variables are expressed.
Correlation
Uses
The concepts of regression are also based upon the measure of correlation.
Scatter Diagram
On the other hand, a scatter plot of two variables shows the values of one variable on the
-axis and the values of the other variable on the -axis. Scatter plots are well suited
for revealing the relationship between two variables.
Scatter Diagram
70
60
............... y ........
50
40
30
20
10
0
0 20 40 60 80 100 120
............. x ........
Types of Correlation
If two variables changes in the same direction (i.e. if one increases the other also increase
or if one decreases the other also decreases) then this is called a positive correlation. For
example:
Positive Positive
Correlation Correlation
X Y X Y
10 15 80 50
12 20 70 45
14 22 60 30
18 25 40 20
20 37 30 10
If two variables change in the opposite direction (i.e. if one increases, the other decreases
and vice versa), then the correlation is called a negative correlation. For example: T.V
registrations and cinema attendance.
Negative Negative
Correlation Correlation
X Y X Y
20 40 100 10
30 30 90 20
40 22 60 30
60 15 40 40
80 12 30 50
In multiple correlation three or more variables are studied simultaneously. For example,
when we study the relationship between the yield of rice per acre and both the amount of
rainfall and the amount of fertilizers used, it is problem of multiple correlation. Similarly
the relationship of plastic hardness, temperature and pressure is multivariate.
In partial correlation we recognize more than two variables. But consider only two
variables to be influencing variable being kept constant. For example, in the rice problem
taken above if we limit our correlation analysis of yield and rainfall to periods when a
certain average daily temperature existed, it becomes a problem of partial correlation.
The distinction between linear and non-linear correlation is based upon the constancy of
the ratio of change between the variables. If the amount of change in one variable tends
to bear a constant ratio to the amount of change in the other variable then the correlation
is said to be linear. For example, observe the following two variables X and Y:
X: 10 20 30 40 50
Y: 70 140 210 280 350
It is clear that the ratio of change between the two variables is the same. If such variables
are plotted on a graph paper all the plotted points would fall on a straight line.
Scatter Diagram
400
........... y ............
300
200
100
0
0 20 40 60
........... x ............
Scatter Diagram
........... y ..........
2000
1000
0
0 10 20 30 40 50
....... x ...........
coefficients. Symbolically:
Degrees of Correlation
Through the coefficient of correlation, we can measure the degree or extent of the
correlation between two variables. On the basis of the coefficient of correlation we can
also determine whether the correlation is positive or negative and also its degree or
extent.
Perfect correlation: If two variables changes in the same direction and in the
same proportion, the correlation between the two is perfect positive. According
to Karl Pearson the coefficient of correlation in this case is . On the other hand,
if the variables change in the opposite direction and in the same proportion, the
correlation is perfect negative. Its coefficient of correlation is . In practice we
rarely come across these types of correlations.
High degree, moderate degree or low degrees are the three categories of this kind of
correlation. The following table reveals the effect (or degree) of coefficient or correlation.
Absence of
Zero 0
correlation
Perfect correlation +1 -1
Scatter Plot.
In this method the values of the two variables are plotted on a graph paper. One is taken
along the horizontal ( -axis) and the other along the vertical ( -axis). By plotting the
data, we get points (dots) on the graph which are generally scattered and hence the name
‘Scatter Plot’.
The manner in which these points are scattered, suggest the degree and the direction of
correlation. The degree of correlation is denoted by ‘ ’ and its direction is given by the
signs positive and negative.
If all points lie on a rising straight line the correlation is perfectly positive and
.
Scatter Diagram
150
............. y ...........
100
50
0
8 13 18 23
......... x .............
If all points lie on a falling straight line the correlation is perfectly negative and
.
Scatter Diagram
.......... y .........
100
80
60
40
20
0
10 20 30 40 50 60
.............. x ..........
If the points are spread widely over a broad strip, rising upwards,
the correlation is low degree positive.
Scatter Diagram
60
............ y ...........
50
40
30
20
10
0
0 10 20 30 40
............ x ..........
Though this method is simple and is a rough idea about the existence
and the degree of correlation, it is not reliable. As it is not a
mathematical method, it cannot measure the degree of correlation.
Merits
Making a scatter diagram usually is the first step in investigating the relation ship
between the variables.
Limitations
By applying this method we can get an idea about the direction of correlation and also
whether it is high or low. But we cannot establish of correlation and also whether it is
high or low. But we cannot establish the exact degree of correlation between the variables
as is possible by applying the mathematical method.
2) Do you think that there is any correlation between profits and capital employed?
Is it positive? Is it high or low?
16
14
12
Profit 10
8 Profit(Lakhs of Rs.)
6
4
2
0
0 5 10 15
Capital Employed
By looking at the scatter diagram we can say that the variables profits and capital
employed are correlated. Further, correlation is positive because the trend to the points is
upward rising from the lower left hand corner to the upper right hand corner of the
diagram.
The diagram also indicate that the degree of relationship is high because the plotted
points are in a narrow band which shows that it is a case of high degree of positive
correlation.
The value of the coefficient of correlation as obtained by the above formula shall always
lie between .
Height of father 16 16 16 16 16 16 17
172
(cm): 5 6 7 8 7 9 0
Height of son 16 16 16 17 16 17 16
171
(cm): 7 8 5 2 8 2 9
Solution:
= 225828 =1344
= 228532 = 1352 = 227160
= 0.603022689 = 0.603
Example2: The following data consist of observations for the weights of 10 different
automobiles (in 1000 pounds) and the corresponding fuel consumptions (gallons per
100 miles).
3.4 5.5
3.8 5.9
4.1 6.5
2.2 3.3
2.6 3.6
2.9 4.6
2.0 2.9
2.7 3.6
1.9 3.1
3.4 4.9
= 89.29 = 29
= 207.31 =43.9 =135.8
= 0.976629971 = 0.976
Example3: Suppose that we took 7 mice and measured their body weight and their length
from nose to tail. We obtained the following results and want to know if there is any
relationship between the measured variables. [To keep the calculations simple, we will
use small numbers]
1 1 2
2 4 5
3 3 8
4 4 12
5 8 14
6 9 19
7 8 22
= 251 = 37
=1278 = 82 = 553
= 0.901441541= 0.90
Example4:The data below are the heights (cm) and weights (Kg) of 20 female students
taking STAT 201. Calculate the coefficient of correlation between the heights and
weights of female students of the following data.
SL fht fwt
1 167 60
2 164 65
3 170 64
4 163 47
5 152 46
6 160 57
7 170 57
8 160 55
9 157 55
10 170 65
11 150 50
12 156 46
13 168 60
14 159 55
15 160 50
16 172 69
17 175 56
18 169 56
19 169 72
20 156 56
Solution
We know that. Correlation of coefficient
= 534615 = 3267
= 66113 = 1141 = 187045
= 0.673318089 = 0.673
The association between two series of rank is called rank correlation. The method of
ascertaining the coefficient of correlation by ranks was devised by Charles Edwards
Spearman in 1904.This method is especially useful in case when the actual magnitudes or
item values are not given and simply their ranks in the series are known. Spearman’s rank
correlation coefficient, usually denoted by (Rho) is given by the formula:
Where d stands for the difference between the pair of ranks and n the number of paired
observations.
The value of Spearman’s rank correlation coefficient ranges between and .When
is , the concordance between rankings is perfect and the ranks are in the same
direction. When is , there is also perfect concordance between rankings but the ranks
in opposite direction.
Where Actual Ranks are given the steps required for computing rank correlation are:
Take the difference of the two ranks i.e and denote these differences by
.
Example1:
Two managers are asked to rank a group of employees in order of potential for eventually
becoming top managers .The rankings are as follows:
Solution:
We know that,
=1- = 0.915
Thus we find that there is a high degree of positive correlation in the ranks assigned by
the two managers.
When we are given the actual data and not the ranks it will be necessary to assigns the
ranks .Ranks can be assigned by taking either the highest value as 1 or the lowest value as
1. But whether we start with the lowest value or the highest value we must follow the
same method in case of all the variables.
Example1:
Calculate the rank correlation coefficient for the following data of marks of 2 tests given
to candidates for a clerical job:
Preliminary
92 89 87 86 83 77 71 63 53 50
test
Final test 86 83 91 77 68 85 52 82 37 57
Solutions:
We know that,
= = 1-0.267 = 0.733
Thus there is a high degree of positive correlation between preliminary and final test.
Merits
This method is simpler to understand and easier to apply compared to the Karl
Pearson’s method.
Where the data are of a qualitative nature like honesty, efficiency, intelligence
etc., this method can be used with great advantage. For example the workers of
two factories can be ranked in order of efficiency and the degree of correlation
established by applying the method.
This is the only method that can be used where we are given the ranks and not the
actual data.
Even where actual data are given rank method can be applied for ascertaining
rough degree of correlation.
Limitations:
This method cannot be used for finding out correlation in a grouped frequency
distribution.
For finding out correlation by the coefficient method of least squares we have to calculate
the values of two regression coefficients that of on and on . The correlation
coefficient is the square root of the product of two regression coefficients. Symbolically,
Coefficient of Determination
One very convenient and useful way of interpreting the value of coefficient of correlation
between two variables is to use the square of coefficient of correlation, which is called
coefficient of determination. The coefficient of determination thus equals .
*** If the value of , will be 0.81 and this would mean that 81% of the
variation in the dependent variable has been explained by the independent variable.