Topic 2 - Correlation Theory
Topic 2 - Correlation Theory
CORRELATION THEORY
Correlation is a measure of the degree to which two variables vary together, or a measure of the
intensity (or strength) of the association between two variables. For example, we may be interested
in finding the correlation between smoking and lung cancer. Correlation does not have to be
performed only between independent and dependent variables. Correlation can be done on two
dependent variables.
In correlation, when values of one variable increase with the increase in another variable, it is
supposed to be a positive correlation. On the other hand, if the values of one variable decrease with
the decrease in another variable, then it would be a negative correlation. There might be the case
when there is no change in a variable with any change in another variable. In this case, it is defined
as no correlation between the two.
The strength of the association between the two variables is measured by correlation coefficient,
r. The r can range from -1 to 1, and is independent of units of measurement. The strength of the
association increases as r approaches the absolute value of 1.0. A value of 0 indicates there is no
association between the two variables tested. Some of the properties of r are as follows: -
1
Correlation can be of three types as follows:
1. Simple correlation
2. Multiple correlation
3. Partial correlation
Simple correlation is defined as a variation related amongst any two variables. In simple
correlation, we measure the strength of the linear relationship between two variables, without
taking into consideration the fact that both these variables may be influenced by a third variable.
For example, when we study the correlation between price (dependent variable) and demand
(independent variable), we completely ignore the effect of other factors like money supply, import
and export etc., which definitely have a bearing on the price.
Partial correlation analysis involves studying the linear relationship between two variables after
excluding the effect of one or more independent factors. In order to get a correct picture of the
relationship between two variables, we should first eliminate the influence of other variables. For
example, when we study the correlation between price (dependent variable) and demand
(independent variable), we completely ignore the effect of other factors like money supply, import
and export etc. which definitely have a bearing on the price.
Partial correlation analysis involves studying the linear relationship between two variables after
excluding the effect of one or more independent factors. In order to get a correct picture of the
relationship between two variables, we should first eliminate the influence of other variables. For
example, study of partial correlation between price and demand would involve studying the
relationship between price and demand excluding the effect of money supply, exports, etc.
Multiple correlation analysis involves studying the effects of all the independent variables
simultaneously on a dependent variable. For example, the correlation co-efficient between the
yield of paddy (X1) and the other variables, viz. type of seedlings (X2), manure (X3), rainfall (X4),
humidity (X5) is the multiple correlation co-efficient R1.2345. This co-efficient takes the value of
between 0 and +1.
2
2.2 Linear Correlation Coefficient
Coefficient of correlation, r, called the linear correlation coefficient deals with finding out the
strength or degree of linear association between two variables. The linear correlation coefficient is
sometimes referred to as the Pearson product moment correlation coefficient in honor of its
developer Karl Pearson. The correlation coefficient between X and Y is given by:
n XY X Y
r
n X 2
X nY 2 Y
2 2
Another formula used to compute correlation coefficient is as follows:
r
xy
x y
2 2
_ _
Where x X X and y Y Y ; i.e. Deviations from their means.
Example 1
The following data relates to the sales and profit of ABC Company limited over 10 years. Calculate
the sample correlation coefficient between sales and profit
Time in years 1 2 3 4 5 6 7 8 9 10
Profit in ksh.000’s 2 3 5 7 8 9 11 12 14 19
Time X Y XY X2 Y2 _ _ xy x2 y2
xX X y Y Y
1 10 2 20 100 4 -45 -7 315 2025 49
2 20 3 60 400 9 -35 -6 210 1225 36
3 30 5 150 900 25 -25 -4 100 625 16
4 40 7 280 1600 49 -15 -2 30 225 4
3
5 50 8 400 2500 64 -5 -1 5 25 1
6 60 9 540 3600 81 5 0 0 25 0
7 70 11 770 4900 121 15 2 30 225 4
8 80 12 960 6400 144 25 3 75 625 9
9 90 14 1260 8100 196 35 5 175 1225 25
10 100 19 1900 10000 361 45 10 450 2025 100
_
X
X
550 _
55 And Y
Y 90 9 , n=10
n 10 n 10
Alternative I:
n XY X Y
r
n X 2
X nY 2 Y
2 2
r
10 6340 550 90 63,400 49,500
r
10 38500 550 10 1054 90
=
2 2
82,500 2,440
13,900
r 0.9797
14,188.02
Alternative II:
r
xy = r
1,390
,,,, r
1,390
0.9797
x . y
2 2
8,250 244 1,418.802
The results of 0.9797 shows that the association between X and Y is very strong because the r
value is high. The correlation coefficient squared equals the coefficient of determination. Yet, you
need to be careful if you decide to calculating r by taking the square root of the coefficient of
determination. You may not have the correct “sign” there is a negative association between the
two variables.
4
In the regression context, r2 is a more meaningful measure than r, for the former tells us the
proportion of variation in the dependent variable explained by the explanatory variable(s) and
therefore provides an overall, measure of the extent to which the variation in one variable
determines the variation of the other. The latter does not have such value. Moreover, the
interpretation of r (=R) in a multiple regression model is of dubious value.
r2 is the proportion of variance in one of the variables that can be explained by variation in the
other variable. 1-r2 is the proportion of variance left unexplained. For example, heights are
measured for 20 father and (adult) son pairs. The correlation is estimated to be r=0.6. r2 = 0.36. So
36% of the variation in the height of sons is attributable to variation in the height of the father.
64% of the variance is sons’ heights is left unexplained.
r = +1 x x
0< r < 0.25
r = -1 x
r=0
5
x
From example 1, the sample correlation coefficient was obtained as r = 0.9797 which means
that there is a strong positive correlation between profit and sales.
Now, if we square the sample correlation coefficient, we shall get a new value r2 is called the
coefficient of determination,
Thus r2 = (0.9797)2= 0.9598
To interpret, we say sales(x) explains or accounts for 95.98% of all changes in profit
(y) holding all other factors constant. This leaves about 4.02% of all changes in profit
(y) to be explained by other factors.
Now the correlation coefficient just described above is more specifically called the Pearson
correlation coefficient. However, apart from the Pearson correlation coefficient, we also have
the spearman’s rank correlation coefficient given by:
6 D 2
rs 1
nn 2 1
6
Where D = the difference or deviation after ranking the corresponding observations on X
and Y either in ascending or descending order.
Spearman’s Rank correlation coefficient is used to identify and test the strength of a relationship
between two sets of data. It is often used as a statistical method of aid with either proving or
disproving a hypothesis e.g. the depth of a river does not progressively increase the further from
the river bank. The formula used to calculate Spearman’s Rank is shown below:
6 D 2
rs 1
nn 2 1
1 0 0
2 50 10
3 150 28
4 200 42
5 250 59
6 300 51
7 350 73
8 400 85
9 450 104
10 500 96
Step 2: Rank each set of data (width rank and depth rank). Rank 1 will be given to the largest
number in column 2. Continue ranking till all widths have been ranked. Once all the widths have
been ranked, then do exactly the same for depth.
7
Sample Width (cm) Width Rank Depth (cm) Depth Rank
1 0 10 0 10
2 50 9 10 9
3 150 8 28 8
4 200 7 42 7
5 250 6 59 5
6 300 5 51 6
7 350 4 73 4
8 400 3 85 3
9 450 2 104 1
10 500 1 96 2
If there are two samples with the same value, the mean (average) rank should be used for example,
if there were 3 samples all with the same depth ranked 6 th in order you would add the rank values
together (6+7+8=21) then divide this number by the number of samples with the same depth
number, in this case 3 (21/3=7) so they would all receive a rank of 7. The next greatest depth would
be given a value of 9.
Step 3: The next stage is to find d (the difference in rank between the width and depth). First, add
a new column to your table, and then calculate d by subtracting the depth rank column (column5)
from the width rank column (column 3). For example, 6 width rank is 5 and the depth rank is 6 so
d = 5 – 6 = -1
1 0 10 0 10 0
2 50 9 10 9 0
3 150 8 28 8 0
4 200 7 42 7 0
8
5 250 6 59 5 1
6 300 5 51 6 -1
7 350 4 73 4 0
8 400 3 85 3 0
9 450 2 104 1 1
10 500 1 96 2 -1
Step 4: The next step is to calculate d2. Add another column to your table and label d2. To calculate
d2 simply square the elements in d, e.g. for sample 6 the value of d is -1 the value you should get
1. Once all the d2 values have been calculated add them together to calculate d 2.
1 0 10 0 10 0 0
2 50 9 10 9 0 0
3 150 8 28 8 0 0
4 200 7 42 7 0 0
5 250 6 59 5 1 1
6 300 5 51 6 -1 1
7 350 4 73 4 0 0
8 400 3 85 3 0 0
9 450 2 104 1 1 1
10 500 1 96 2 -1 1
Step 5: To complete the equation we still need to calculate n3- n. the number of samples, so in this
case is 10. We need to get the value of n3 by cubing n. once n3 has been calculated, subtract the
value of n from it.
9
Sample Width (cm) Width Rank Depth (cm) Depth Rank d d2
1 0 10 0 10 0 0
2 50 9 10 9 0 0
3 150 8 28 8 0 0
4 200 7 42 7 0 0
5 250 6 59 5 1 1
6 300 5 51 6 -1 1
7 350 4 73 4 0 0
8 400 3 85 3 0 0
9 450 2 104 1 1 1
10 500 1 96 2 -1 1
∑d2 4
n 10
n3 1000
n3-n 990
Step 6: All that is left to do now is to insert the values into the equation to calculate r.
6 D 2
rs 1
6∗4
=1 − 990 = 0.98
nn 1
2
There is therefore a strong correlation between the width and the depth of a river. Thus we can
reject the hypothesis and conclude that the depth of the river does progressively increase the further
the distance from the river bank.
Example:
The data given in example 1 below relates to the sales and profit of ABC Company limited over
10 years, as follows:
10
Let sales= X and profit = Y
0
n=10
6 D 2 60 rs 1 0,
Hence rs 1 rs 1
nn 1
2
10100 1 rs 1
In practice, the Spearman’s rank correlation, just like the Pearson’s correlation coefficient
takes the interval: 1 rs 1 . However, the interpretation for both of them is similar.
Partial correlation is a method used to describe the relationship between two variables whilst taking
away the effects of another variable, or several other variables, on this relationship. Consider a
three variable regression model. We can compute three correlation coefficients: r 12 (correlation
coefficient between Y and X2), r13 (correlation coefficient between Y and X3); (note that subscript
1 represent Y for notational convenience). These correlation coefficients are the simple correlation
11
coefficients, or correlation coefficients of zero order. However, X2 for example, is not likely to
reflect the true degree of association between Y and X2 in the presence of X3. As a matter of fact,
it is likely to give a false impression of the nature of association between Y and X 2. Therefore, we
need a correlation coefficient that is independent of the influence, if any, of X3 on X2 and Y. Such
a correlation is known as the partial correlation coefficient. We define
These partial correlations can be easily obtained from the simple or zero-order, correlation
coefficients as follows: -
The partial correlations above are called first-order correlation coefficients. By order we mean the
number of secondary subscripts. Thus r12.34 would be the correlation coefficient of order two, r12.345
would be the correlation coefficient of order three, and so on. As noted previously, r 12, r13, and so
on are called simple or zero-order correlations. The interpretation of, say, r 12.34 is that it gives the
coefficient of correlation between Y and X2, holding X3 and X4 constant.
1. Even if r12 = 0, r12.3 will not be zero unless r13 or both are zero.
2. If r12 = 0 and r13 and r23 are non-zero and are of the same sign, r12.3 will be negative, whereas
if they are not the opposite signs, it will be positive.
12
3. The terms r12.3 and r12 (and similar comparisons) need not to have the same sign.
4. In the two-variable case we have seen that r2 lies between 0 and 1. The same property holds
the true of the squared partial correlation coefficients.
5. Suppose that r13 = r23 = 0. Does this mean that r12 is also zero? The answer is NO. The fact
that Y and X3 and X2 and X3 are uncorrelated does not mean that Y and X2 are uncorrelated.
Note that the expression r212.3 may be called the coefficient of partial determination and may be
interpreted as the proportion of the variation in Y not explained by the variable X 3 that has been
explained by the inclusion of X2 into the model.
Example 1
Suppose that a rather cranky professor has just administered an exam in his statistics course, and
that for each student in the course we have measures on each of the following three variables:
Z= a measure of the degree to which the professor inspires fear and trembling in the student.
It looks odd that the correlation between X and Y should end up as a scant rXY = +.20 and r2XY =
.04, indicating a mere 4% covariance between the degrees of effort that students put into the
examined the scores that they receive. Examine the other two correlations, however and you will
see that it is not so odd after all. The greater the fear and trembling, the greater the effort that
students tend to put into preparing for the exam; hence r XZ=+.80 and r2XZ=.64. On the other hand,
the greater the fear and trembling, the less well students tend to do on the exam, as witness r YZ = -
13
-.40 and r2YZ=.16. If we remove the suppressing effects of fear and trembling from the equation,
we find
The correlation between effort and exam score goes from a scant rXY = +.20 to an impressive
rxyz=+.95 and the covariance between effort and exam score goes from a mere 4% to a very
substantial 90% (r2XYZ=.90).
i) It assumes a linear relationship between the variables. Generally, this assumption is not
valid especially in social science as linear relations rarely exists in such phenomena.
ii) It is liable to be misinterpreted, as a high degree of correlation does not necessarily
mean very close relationship between the variables.
iii) It is tedious to calculate. Its calculation is somewhat cumbersome – often difficult to
the mathematically uninitiated (though software’s have made life a lot easier).
iv) It is unduly affected by the values of extreme items. For example, if the values of-series
are 1, 2, 3, 4, 5 and the corresponding values of y –series are 5, 4, 3, 2, 1, the Coefficient
of Correlation between them would be zero. However, if we add one more value in the
–series and its value is 100 and add one more value in the y series and its value is 120,
the coefficients of correlation of these six pairs of values would be +.99. Thus the
addition of a single set of large figures has boosted up the correlation from zero to +.99.
14
Summary
While correlation is an important econometric tool, its application to economic analysis is rather
limited. For this reason, correlation is analysis rarely used in econometrics. A more versatile and
powerful econometric tool in economic analysis is Regression. This is considered in the next
section. According to Gujarati (2009), Regression analysis is the bread and butter tool to
econometrics.
Ideally, the sample correlation coefficient should lie between -1 and +1; (-1 ≤ r ≤ +1) such that:
15