Correlation
Correlation
Chapter 5
5.1: Introduction
across certain set of data, where each item of the set may comprise of the
measure the heights and weights of all the students. We observe that each
individual (unit) of the set assumes two values – one relating to the height
and the other to the weight. Such a distribution in which each individual or
unit of the set is made up of two values is called a bivariate distribution. The
distribution.
married couples.
various methods and techniques used for studying and measuring the extent
There are two important types of correlation. They are (1) Positive
and Negative correlation and (2) Linear and Non – Linear correlation.
If the values of the two variables deviate in the same direction i.e. if
the variables deviate in opposite direction. That is, if the increase in the
R=;
x4
Note:
(i) If the points are very close to each other, a fairly good amount of
(ii) If the points are scattered and they reveal no upward or downward
trend as in the case of (d) then we say the variables are
uncorrelated.
(iii) If there is an upward trend rising from the lower left hand corner
and going upward to the upper right hand corner, the correlation
downward trend from the upper left hand corner the correlation
diagrams.
xx
x5
positive.
170 65
172 66
181 69
157 55
150 51
168 63
166 61
175 75
177 72
165 64
163 61
152 52
161 60
173 70
175 72
Since the points are dense (close to each other) we can expect a high
since the points reveal an upward trend, the correlation is positive. Arrange
the data in increasing order of height and check that , as height increases, the
EXERCISES
(1) A Company has just brought out an annual report in which the capital
investment and profits were given for the past few years. Find the
Profits (lakhs) 12 14 13 18 26 38 62 6
correlations.
(3) Construct the scattered diagram of the data given below and indicate
change of one unit in one variable result in the corresponding change in the
X 2 4 6 8 10
Y 7 13 19 25 31
the corresponding values of y and the above data can be expressed by the
relation
y = 3x +1
y = a + bx
where ‘a’ and ‘b’ are real numbers. This is nothing but a straight line when
plotted on a graph sheet with different values of x and y and for constant
values of a and b. Such relations generally occur in physical sciences but are
corresponding to a unit change in one variable, the other variable does not
the data is plotted on a graph sheet we will not get a straight line curve. For
y = a + bx + cx
‘r’ which measures the degree of association between the two values of
related variables given in the data set. It takes values from + 1 to – 1. If two
sets or data have r = +1, they are said to be perfectly correlated positively if
are uncorrelated.
( )( )
2222
n x ( x) n y ( y)
nxyxy
∑−∑∑−∑
∑ ∑− ∑
The following set of data was arrived at from a clinical study. Let us
determine the coefficient of correlation for this set of data. The first
column represents the serial number and the second and third columns
1. 78 140
2. 86 160
3. 72 134
4. 82 144
5. 80 180
6. 86 176
7. 84 174
8. 89 178
9. 68 128
10. 71 132
Solution:
xyx
xy
Then
[ ][ ]
22
−−
r =9
0 5966.
(1144 () 40244)
11444
smallest and numbered 1,2,3 ---- are said to be in ranks or ranked data..
These ranks prove useful at certain times when two or more values of one
variable are the same. The coefficient of correlation for such type of data is
by R.
difference in rank ‘d’ for each pair. The following example will explain the
( )1
()
16
−=
nn
Example:3: The data given below are obtained from student records.
Subject Grade Point Average (x) Graduate Record exam score (y)
1. 8.3 2300
2. 8.6 2250
3. 9.2 2380
4. 9.8 2400
5. 8.0 2000
6. 7.8 2100
7. 9.4 2360
8. 9.0 2350
9. 7.2 2000
Now we first arrange the data in descending order and then rank
1,2,3,---- 10 accordingly. In case of a tie, the rank of each tied value is the
mean of all positions they occupy. In x, for instance, 8.6 occupy ranks 5 and
6. So each has a rank ;5.5
56
Similarly in ‘y’ 2000 occupies ranks 9 and 10, so each has rank
5.9.
9 10
( )1
−=
nn
1. 8.3 2300 7 5 2 4
2. 8.6 2250 5.5 7 -1.5 2.25
3. 9.2 2380 3 2 1 1
4. 9.8 2400 1 1 0 0
6. 7.8 2100 9 8 1 1
7. 9.4 2360 2 3 -1 1
8. 9.0 2350 4 4 0 0
= 12. So
1 .0 0727 .0 9273
10 (100 )1
12(6 )
=−=
R=−
Note: If we are provided with only ranks without giving the values of x and
11
EXERCISES
110 2
113 6
120 3
115 4
110 6
115 5
117 4
123 2
106 1
108 4
110 1
110 3
2. The top and bottom number which may appear on a die are as
follows
Top 1 2 3 4 5 6
bottom 5 6 4 3 1 2
3. The ranks of two sets of variables (Heights and Weights) are given
coefficient R.
1 2 3 4 5 6 7 8 9 10
Weights 9 1 9 4 5 9 2 7 6 3 12
5.5: Regression
from the other. This observation leads to a very important concept known as
‘Regression Analysis’.
prediction of the unknown value of one variable from the known value of the
two or more variables that are related causally and for the estimation of
consumption, prices, investments, sales, profits, income etc. are of very great
between two or more variables in terms of the original units of the data.”
Suppose we have a sample of size ‘n’ and it has two sets of measures,
denoted by x and y. We can predict the values of ‘y’ given the values of ‘x’
y* = a + bx
n
ybx
nxx
n xy x y
∑−∑
∑−∑
∑∑−∑
22
()()
( () )
term and final examination are given here. Develop a regression equation
which may be used to predict final examination scores from the mid – term
score.
1. 98 90
2. 66 74
3. 100 98
4. 96 88
5. 88 80
6. 45 62
7. 76 78
8. 60 74
9. 74 86
10. 82 80
Solution:
We want to predict the final exam scores from the mid term scores. So
let us designate ‘y’ for the final exam scores and ‘x’ for the mid – term exam
Stud x y X
xy
1 98 90 9604 8820
2 66 74 4356 4884
4 96 88 9216 8448
5 88 80 7744 7040
6 45 62 2025 2790
7 76 78 5776 5928
8 60 74 3600 4440
9 74 86 5476 6364
10 82 80 6724 6560
Therefore a = 40.7531
y* = 40.7531 + (0.5127) x
We can use this to find the projected or estimated final scores of the
students.
For example, for the midterm score of 50 the projected final score is
To give another example, consider the midterm score of 70. Then the
This brings us to the end of this chapter. We close with some problems for
you.
EXERCISES
1. The data given below are obtained from student records. Calculate the
regression equation and compute the estimated GRE scores for GPA = 7.5,
8.5..
Subject Grade Point Average (x) Graduate Record exam score (y)
the weight and blood pressure of an individual. The following set of data
1. 78 140
2. 86 160
3. 72 134
4. 82 144
5 80 180
6. 86 176
7. 84 174
8. 89 178
9. 68 128
10. 71 132 16
3. A horse was subject to the test of how many minutes it takes to reach a
point from the starting point. The horse was made to carry luggage of
various weights on 10 trials.. The data collected are presented below in the
table.
1 11 13
2 23 22
3 16 16
4 32 47
5 12 13
6 28 39
7 29 43
8 19 21
9 25 32
10 20 22
Find the regression equation between the load and the time taken to reach
the goal. Estimate the time taken for the loads of 35 Kgs , 23 Kgs, and 9
Kgs. Are the answers in agrrement with your intuitive feelings? Justify.