0% found this document useful (0 votes)
285 views15 pages

Topic 2 - Correlation Theory

Correlation is a measure of how strongly two variables are related. It ranges from -1 to 1, where -1 is a perfect negative correlation, 0 is no correlation, and 1 is a perfect positive correlation. Simple correlation measures the relationship between two variables, without accounting for other possible influencing variables. Partial correlation measures the relationship between two variables while controlling for one or more other variables. Multiple correlation analyzes the effects of all independent variables simultaneously on a dependent variable. The Pearson product-moment correlation coefficient, r, is commonly used to quantify the strength and direction of linear correlation between two variables.

Uploaded by

tasamikassa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
285 views15 pages

Topic 2 - Correlation Theory

Correlation is a measure of how strongly two variables are related. It ranges from -1 to 1, where -1 is a perfect negative correlation, 0 is no correlation, and 1 is a perfect positive correlation. Simple correlation measures the relationship between two variables, without accounting for other possible influencing variables. Partial correlation measures the relationship between two variables while controlling for one or more other variables. Multiple correlation analyzes the effects of all independent variables simultaneously on a dependent variable. The Pearson product-moment correlation coefficient, r, is commonly used to quantify the strength and direction of linear correlation between two variables.

Uploaded by

tasamikassa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

TOPIC TWO

CORRELATION THEORY

2.1 Meaning of Correlation

Correlation is a measure of the degree to which two variables vary together, or a measure of the
intensity (or strength) of the association between two variables. For example, we may be interested
in finding the correlation between smoking and lung cancer. Correlation does not have to be
performed only between independent and dependent variables. Correlation can be done on two
dependent variables.

In correlation, when values of one variable increase with the increase in another variable, it is
supposed to be a positive correlation. On the other hand, if the values of one variable decrease with
the decrease in another variable, then it would be a negative correlation. There might be the case
when there is no change in a variable with any change in another variable. In this case, it is defined
as no correlation between the two.

The strength of the association between the two variables is measured by correlation coefficient,
r. The r can range from -1 to 1, and is independent of units of measurement. The strength of the
association increases as r approaches the absolute value of 1.0. A value of 0 indicates there is no
association between the two variables tested. Some of the properties of r are as follows: -

1. Can be positive or negative depending on the sign.


2. It lies between the limits of -1 and 1 i.e. -1< r < +1
3. It is symmetrical in nature i.e. the coefficient of correlation between X and Y (r xy) is the
same as that between Y and X (ryx).
4. It is a measure of linear association. It has no meaning for describing nonlinear relations.
5. Although it is a measure of linear association between two variables, it does not necessarily
imply any cause and effect relationship i.e. one cannot draw cause and effect conclusions
based on correlation. There are two reasons why we cannot make causal statements.
a. We don’t know the direction of the cause – Does X cause Y or does Y cause X?
b. A third variable “Z” may be involved that is responsible for the covariance between X
and Y.

1
Correlation can be of three types as follows:

1. Simple correlation
2. Multiple correlation
3. Partial correlation

Simple correlation is defined as a variation related amongst any two variables. In simple
correlation, we measure the strength of the linear relationship between two variables, without
taking into consideration the fact that both these variables may be influenced by a third variable.
For example, when we study the correlation between price (dependent variable) and demand
(independent variable), we completely ignore the effect of other factors like money supply, import
and export etc., which definitely have a bearing on the price.

Partial correlation analysis involves studying the linear relationship between two variables after
excluding the effect of one or more independent factors. In order to get a correct picture of the
relationship between two variables, we should first eliminate the influence of other variables. For
example, when we study the correlation between price (dependent variable) and demand
(independent variable), we completely ignore the effect of other factors like money supply, import
and export etc. which definitely have a bearing on the price.

Partial correlation analysis involves studying the linear relationship between two variables after
excluding the effect of one or more independent factors. In order to get a correct picture of the
relationship between two variables, we should first eliminate the influence of other variables. For
example, study of partial correlation between price and demand would involve studying the
relationship between price and demand excluding the effect of money supply, exports, etc.

Multiple correlation analysis involves studying the effects of all the independent variables
simultaneously on a dependent variable. For example, the correlation co-efficient between the
yield of paddy (X1) and the other variables, viz. type of seedlings (X2), manure (X3), rainfall (X4),
humidity (X5) is the multiple correlation co-efficient R1.2345. This co-efficient takes the value of
between 0 and +1.

2
2.2 Linear Correlation Coefficient

Coefficient of correlation, r, called the linear correlation coefficient deals with finding out the
strength or degree of linear association between two variables. The linear correlation coefficient is
sometimes referred to as the Pearson product moment correlation coefficient in honor of its
developer Karl Pearson. The correlation coefficient between X and Y is given by:

n XY   X Y 
r
n X 2

  X  nY 2  Y 
2 2

Another formula used to compute correlation coefficient is as follows:

r
 xy
 x  y 
2 2

_ _
Where x  X  X and y  Y  Y ; i.e. Deviations from their means.

Example 1
The following data relates to the sales and profit of ABC Company limited over 10 years. Calculate
the sample correlation coefficient between sales and profit

Time in years 1 2 3 4 5 6 7 8 9 10

Sales in Ksh.000’s 10 20 30 40 50 60 70 80 90 100

Profit in ksh.000’s 2 3 5 7 8 9 11 12 14 19

Solution: assumption: let sales=X and profit =Y

Time X Y XY X2 Y2 _ _ xy x2 y2
xX X y  Y Y
1 10 2 20 100 4 -45 -7 315 2025 49
2 20 3 60 400 9 -35 -6 210 1225 36
3 30 5 150 900 25 -25 -4 100 625 16
4 40 7 280 1600 49 -15 -2 30 225 4

3
5 50 8 400 2500 64 -5 -1 5 25 1
6 60 9 540 3600 81 5 0 0 25 0
7 70 11 770 4900 121 15 2 30 225 4
8 80 12 960 6400 144 25 3 75 625 9
9 90 14 1260 8100 196 35 5 175 1225 25
10 100 19 1900 10000 361 45 10 450 2025 100

 550 90 6,340 38,500 1,054 1,390 8,250 244

_
X
X 
550 _
 55 And Y 
 Y  90  9 , n=10
n 10 n 10

Alternative I:

n XY   X Y 
r
n X 2

  X  nY 2  Y 
2 2

r
10  6340   550  90 63,400  49,500
r
10  38500  550  10  1054  90 
=
2 2
82,500  2,440

13,900
r  0.9797
14,188.02

Alternative II:

r
 xy = r
1,390
,,,, r 
1,390
 0.9797
 x .  y 
2 2
8,250  244 1,418.802

The results of 0.9797 shows that the association between X and Y is very strong because the r
value is high. The correlation coefficient squared equals the coefficient of determination. Yet, you
need to be careful if you decide to calculating r by taking the square root of the coefficient of
determination. You may not have the correct “sign” there is a negative association between the
two variables.

4
In the regression context, r2 is a more meaningful measure than r, for the former tells us the
proportion of variation in the dependent variable explained by the explanatory variable(s) and
therefore provides an overall, measure of the extent to which the variation in one variable
determines the variation of the other. The latter does not have such value. Moreover, the
interpretation of r (=R) in a multiple regression model is of dubious value.

r2 is the proportion of variance in one of the variables that can be explained by variation in the
other variable. 1-r2 is the proportion of variance left unexplained. For example, heights are
measured for 20 father and (adult) son pairs. The correlation is estimated to be r=0.6. r2 = 0.36. So
36% of the variation in the height of sons is attributable to variation in the height of the father.
64% of the variance is sons’ heights is left unexplained.

These possible ranges of sample correlation coefficients are illustrated below:

r = +1 x x
0< r < 0.25

r = -1 x

r=0

5
x

From example 1, the sample correlation coefficient was obtained as r = 0.9797 which means
that there is a strong positive correlation between profit and sales.

Now, if we square the sample correlation coefficient, we shall get a new value r2 is called the
coefficient of determination,
Thus r2 = (0.9797)2= 0.9598

Usually, the r2 is converted to percent by multiplying it by 100. Thus the coefficient of


determination is: r2= 95.98%

 To interpret, we say sales(x) explains or accounts for 95.98% of all changes in profit
(y) holding all other factors constant. This leaves about 4.02% of all changes in profit
(y) to be explained by other factors.

More formally, the formula for r2 is thus:


n XY   X Y   xy
r OR r 
n X 2 2

  X  nY 2  Y 
2
  x .  y 
2 2

Now the correlation coefficient just described above is more specifically called the Pearson
correlation coefficient. However, apart from the Pearson correlation coefficient, we also have
the spearman’s rank correlation coefficient given by:
6 D 2
rs  1 
nn 2  1

6
Where D = the difference or deviation after ranking the corresponding observations on X
and Y either in ascending or descending order.

2.3 Spearman’s Rank Correlation Coefficient

Spearman’s Rank correlation coefficient is used to identify and test the strength of a relationship
between two sets of data. It is often used as a statistical method of aid with either proving or
disproving a hypothesis e.g. the depth of a river does not progressively increase the further from
the river bank. The formula used to calculate Spearman’s Rank is shown below:

6 D 2
rs  1 
nn 2  1

Steps in calculating Spearman’s Rank Correlation Coefficient

Step 1: Data Set

Sample Width (cm) Depth (cm)

1 0 0

2 50 10

3 150 28

4 200 42

5 250 59

6 300 51

7 350 73

8 400 85

9 450 104

10 500 96

Step 2: Rank each set of data (width rank and depth rank). Rank 1 will be given to the largest
number in column 2. Continue ranking till all widths have been ranked. Once all the widths have
been ranked, then do exactly the same for depth.

7
Sample Width (cm) Width Rank Depth (cm) Depth Rank

1 0 10 0 10

2 50 9 10 9

3 150 8 28 8

4 200 7 42 7

5 250 6 59 5

6 300 5 51 6

7 350 4 73 4

8 400 3 85 3

9 450 2 104 1

10 500 1 96 2

If there are two samples with the same value, the mean (average) rank should be used for example,
if there were 3 samples all with the same depth ranked 6 th in order you would add the rank values
together (6+7+8=21) then divide this number by the number of samples with the same depth
number, in this case 3 (21/3=7) so they would all receive a rank of 7. The next greatest depth would
be given a value of 9.

Step 3: The next stage is to find d (the difference in rank between the width and depth). First, add
a new column to your table, and then calculate d by subtracting the depth rank column (column5)
from the width rank column (column 3). For example, 6 width rank is 5 and the depth rank is 6 so
d = 5 – 6 = -1

Sample Width (cm) Width Rank Depth (cm) Depth Rank d

1 0 10 0 10 0

2 50 9 10 9 0

3 150 8 28 8 0

4 200 7 42 7 0

8
5 250 6 59 5 1

6 300 5 51 6 -1

7 350 4 73 4 0

8 400 3 85 3 0

9 450 2 104 1 1

10 500 1 96 2 -1

Step 4: The next step is to calculate d2. Add another column to your table and label d2. To calculate
d2 simply square the elements in d, e.g. for sample 6 the value of d is -1 the value you should get
1. Once all the d2 values have been calculated add them together to calculate d 2.

Sample Width (cm) Width Rank Depth (cm) Depth Rank d d2

1 0 10 0 10 0 0

2 50 9 10 9 0 0

3 150 8 28 8 0 0

4 200 7 42 7 0 0

5 250 6 59 5 1 1

6 300 5 51 6 -1 1

7 350 4 73 4 0 0

8 400 3 85 3 0 0

9 450 2 104 1 1 1

10 500 1 96 2 -1 1

Step 5: To complete the equation we still need to calculate n3- n. the number of samples, so in this
case is 10. We need to get the value of n3 by cubing n. once n3 has been calculated, subtract the
value of n from it.

9
Sample Width (cm) Width Rank Depth (cm) Depth Rank d d2

1 0 10 0 10 0 0

2 50 9 10 9 0 0

3 150 8 28 8 0 0

4 200 7 42 7 0 0

5 250 6 59 5 1 1

6 300 5 51 6 -1 1

7 350 4 73 4 0 0

8 400 3 85 3 0 0

9 450 2 104 1 1 1

10 500 1 96 2 -1 1

∑d2 4

n 10

n3 1000

n3-n 990

Step 6: All that is left to do now is to insert the values into the equation to calculate r.

6 D 2
rs  1 
6∗4
=1 − 990 = 0.98
nn  1
2

There is therefore a strong correlation between the width and the depth of a river. Thus we can
reject the hypothesis and conclude that the depth of the river does progressively increase the further
the distance from the river bank.

Example:

The data given in example 1 below relates to the sales and profit of ABC Company limited over
10 years, as follows:

10
Let sales= X and profit = Y

Time X Y Rank of X Rank of D  Rx  R y D2


Y
1 10 2 1 1 0 0
2 20 3 2 2 0 0
3 30 5 3 3 0 0
4 40 7 4 4 0 0
5 50 8 5 5 0 0
6 60 9 6 6 0 0
7 70 11 7 7 0 0
8 80 12 8 8 0 0
9 90 14 9 9 0 0
10 100 19 10 10 0 0

 0

n=10

6 D 2 60 rs  1  0,
Hence rs  1  rs  1 
nn  1
2
10100  1 rs  1

Thus, according to Spearman’s rank correlation coefficient, there is a perfect positive


correlation coefficient; there is a perfect positive correlation between sales and profit.

In practice, the Spearman’s rank correlation, just like the Pearson’s correlation coefficient
takes the interval:  1  rs  1 . However, the interpretation for both of them is similar.

2.4 Partial Correlation Coefficients

Partial correlation is a method used to describe the relationship between two variables whilst taking
away the effects of another variable, or several other variables, on this relationship. Consider a
three variable regression model. We can compute three correlation coefficients: r 12 (correlation
coefficient between Y and X2), r13 (correlation coefficient between Y and X3); (note that subscript
1 represent Y for notational convenience). These correlation coefficients are the simple correlation

11
coefficients, or correlation coefficients of zero order. However, X2 for example, is not likely to
reflect the true degree of association between Y and X2 in the presence of X3. As a matter of fact,
it is likely to give a false impression of the nature of association between Y and X 2. Therefore, we
need a correlation coefficient that is independent of the influence, if any, of X3 on X2 and Y. Such
a correlation is known as the partial correlation coefficient. We define

r12.3 = Partial correlation coefficient between Y and X2, holding X3 constant.

r13.2 = Partial correlation coefficient between Y and X3, holding X2 constant.

r23.1 = Partial correlation coefficient between X2 and X3, holding Y constant.

These partial correlations can be easily obtained from the simple or zero-order, correlation
coefficients as follows: -

𝑟12 −𝑟13 𝑟23


𝑟12.3 = ……………………………………………………………………. (2.4.1)
2 )(1−𝑟 2 )
√(1−𝑟13 23

𝑟13 −𝑟12 𝑟23


𝑟13.2 = ……………………………………………………………………. (2.4.2)
2 )(1−𝑟 2 )
√(1−𝑟12 23

𝑟23 −𝑟12 𝑟13


𝑟23.1 = ……………………………………………………………………. (2.4.3)
2 )(1−𝑟 2 )
√(1−𝑟12 13

The partial correlations above are called first-order correlation coefficients. By order we mean the
number of secondary subscripts. Thus r12.34 would be the correlation coefficient of order two, r12.345
would be the correlation coefficient of order three, and so on. As noted previously, r 12, r13, and so
on are called simple or zero-order correlations. The interpretation of, say, r 12.34 is that it gives the
coefficient of correlation between Y and X2, holding X3 and X4 constant.

From (2.4.1), for example, we make the following observations: -

1. Even if r12 = 0, r12.3 will not be zero unless r13 or both are zero.
2. If r12 = 0 and r13 and r23 are non-zero and are of the same sign, r12.3 will be negative, whereas
if they are not the opposite signs, it will be positive.

12
3. The terms r12.3 and r12 (and similar comparisons) need not to have the same sign.
4. In the two-variable case we have seen that r2 lies between 0 and 1. The same property holds
the true of the squared partial correlation coefficients.
5. Suppose that r13 = r23 = 0. Does this mean that r12 is also zero? The answer is NO. The fact
that Y and X3 and X2 and X3 are uncorrelated does not mean that Y and X2 are uncorrelated.

Note that the expression r212.3 may be called the coefficient of partial determination and may be
interpreted as the proportion of the variation in Y not explained by the variable X 3 that has been
explained by the inclusion of X2 into the model.

Example 1

Suppose that a rather cranky professor has just administered an exam in his statistics course, and
that for each student in the course we have measures on each of the following three variables:

X = the amount of effort spent on studying for the exam beforehand.

Y = the student’s score on the exam.

Z= a measure of the degree to which the professor inspires fear and trembling in the student.

The correlations among the three variables are: -

X versus Y: rXZ = +.20 r2XY = .04

X versus Z: rXZ = +.80 r2XZ = .64

X versus Z: rYZ = -.40 r2YZ = .16

It looks odd that the correlation between X and Y should end up as a scant rXY = +.20 and r2XY =
.04, indicating a mere 4% covariance between the degrees of effort that students put into the
examined the scores that they receive. Examine the other two correlations, however and you will
see that it is not so odd after all. The greater the fear and trembling, the greater the effort that
students tend to put into preparing for the exam; hence r XZ=+.80 and r2XZ=.64. On the other hand,
the greater the fear and trembling, the less well students tend to do on the exam, as witness r YZ = -

13
-.40 and r2YZ=.16. If we remove the suppressing effects of fear and trembling from the equation,
we find

0.20 − (0.80)(−0.40) 0.52 2


𝑟𝑥𝑦.𝑧 = = = 0.95, 𝑟𝑥𝑦 = 0.90
√(1 − 0.64)(1 − 0.16) 0.55

The correlation between effort and exam score goes from a scant rXY = +.20 to an impressive
rxyz=+.95 and the covariance between effort and exam score goes from a mere 4% to a very
substantial 90% (r2XYZ=.90).

2.5 Limitations of the Theory of Linear Correlation


The chief merit of correlation coefficient is that it not only gives an idea about the co-variation of
the two series but also indicates the direction of relationship. Thus this coefficient measures both
the degree and direction of the correlation between two variables. It is the most widely used method
of studying relationship between inter-related phenomena. However, it suffers from certain
limitations also. These are:

i) It assumes a linear relationship between the variables. Generally, this assumption is not
valid especially in social science as linear relations rarely exists in such phenomena.
ii) It is liable to be misinterpreted, as a high degree of correlation does not necessarily
mean very close relationship between the variables.
iii) It is tedious to calculate. Its calculation is somewhat cumbersome – often difficult to
the mathematically uninitiated (though software’s have made life a lot easier).
iv) It is unduly affected by the values of extreme items. For example, if the values of-series
are 1, 2, 3, 4, 5 and the corresponding values of y –series are 5, 4, 3, 2, 1, the Coefficient
of Correlation between them would be zero. However, if we add one more value in the
–series and its value is 100 and add one more value in the y series and its value is 120,
the coefficients of correlation of these six pairs of values would be +.99. Thus the
addition of a single set of large figures has boosted up the correlation from zero to +.99.

14
Summary

While correlation is an important econometric tool, its application to economic analysis is rather
limited. For this reason, correlation is analysis rarely used in econometrics. A more versatile and
powerful econometric tool in economic analysis is Regression. This is considered in the next
section. According to Gujarati (2009), Regression analysis is the bread and butter tool to
econometrics.

Ideally, the sample correlation coefficient should lie between -1 and +1; (-1 ≤ r ≤ +1) such that:

If r = -1, there is a perfect negative correlation

If -1< r < -0.75, there is a strong negative correlation

If -0.75< r< -0.25, there is a fair negative correlation

If -0.25< r< 0, there is weak negative correlation.

If r =0, there is no correlation

If 0< r < 0.25, there is a weak positive correlation

If 0.25< r < 0.75, there is a fair positive correlation

If 0.75< r < 1, there is a strong positive correlation

If r =1, there is a perfect positive correlation

15

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy