WEEK 4 St.
WEEK 4 St.
Linear regression analysis is a statistical technique for calculating a line of best fit where there are a number of different
values for x, and for each value of x there is an associated value of y in the data. Linear regression analysis is used to
calculate values for a and b in the linear equation: y = a + bx. The linear regression method for calculating a and b are
shown below.
Graphical method
The following steps are to be taken:
i. Draw the scatter diagram for the data.
ii. Look at two points that a straight line will pass through on the diagram. One of the points ought to be ( x, y ).
iii. Estimate constants a and b from the graph.
a. = intercept on the y – axis of the drawn straight line.
b. = slope or gradient of the line drawn i.e slope = vertical length /horizontal length.
iv. Regression line y = a + bx is stated.
Example 1: Find the relationship between two variables y and x with the following data:
x 1 2 3 4 5
y 3 5 7 9 11
Algebric method:
In the algebric method, we use the “normal equation” which is derived by the Least Squares method. The said normal
equations are:
na + bΣx =Σy } ------ 1
It should be noted that when equations 1 and 2 are solved simultaneously, we have the following estimates of a and b:
Given a number of pairs of data, a line of best fit (y = a + bx) can be constructed by calculating values for a and b using
the following formulae:
∑𝑦 𝑏𝛴𝑥
𝑎= −
𝑛 𝑛
𝑛∑𝑥𝑦 − ∑𝑥∑𝑦
𝑏=
𝑛∑𝑥 2 − (∑𝑥)2
Where:
x, y = values of pairs of data.
n = the number of pairs of values for x and y.
∑ = A sign meaning the sum of. (The capital of the Greek letter sigma).
Note: the term b must be calculated first as it is used in calculating a.
1
Approach
Set out the pairs of data in two columns, with one column for the values of x and the second column for the
associated values of y. (For example, x for output and y for total cost.
Set up a column for x², calculate the square of each value of x and enter the value in the x² column.
Set up a column for xy and for each pair of data, multiply x by y and enter the value in the xy column.
Sum each column.
Enter the values into the formulae and solve for b and then a. (It must be in this order as you need b to find a).
Advantage
1. Economics and business: Linear regression analysis is widely used in economics and business. One application is
that it can be used to estimate fixed costs and variable cost per unit (or number of units) from historical total cost
data.
2. Forecasting: Once the equation of the line of best fit is derived, it can be used to make forecasts of impact of changes
in x on the value of y.
Limitations
1. The analysis is only based on a pair of variables. There might be other variables which affect the outcome but the
analysis cannot identify these.
2. A regression line should only be used for forecasting if there is a good fit between the line and the data. It might not
be valid to extrapolate the line beyond the range of observed data. In the example above the cost associated with
10,000 units was identified as 55.6. However, the data does not cover volumes of this size and it is possible that the
linear relationship between costs and output may not be the same at this level of output.
Example 1: Used the data below to calculate or fit the regression line of y on x:
x 1 2 3 4 5
y 3 5 7 9 11
Solution
x y xy X2
1 3 3 1
2 5 10 4
3 7 21 9
4 9 36 16
5 11 55 25
15 35 125 55
𝑛∑𝑥𝑦−∑𝑥∑𝑦
𝑏 = 𝑛∑𝑥 2−(∑𝑥)2
5(125)−15(35)
𝑏= 5(55) −(15)2
625 −525
𝑏 = 275 −225
100
𝑏 = 50
b= 2
∑𝑦 𝑏𝛴𝑥
𝑎= −
𝑛 𝑛
2
35 2(15)
𝑎= 5
− 5
35 30
𝑎= 5
− 5
𝑎 = 7−6= 1
Example 2: The table below shows the income and expenditure (in N'000) of a man for 10 months.
Income (x) 8 18 52 38 26 60 40 50 82 75
Expenditure y) 2 4 5 7 9 11 13 15 20 23
Fit simple linear regression line y = a + bx to the data.
Solution
x Y xy x2 y2
8 2 16 64 4
18 4 72 324 16
52 5 260 2704 25
38 7 266 1444 49
26 9 234 676 81
60 11 660 3600 121
40 13 520 1600 169
50 15 750 2500 225
82 20 1640 6724 400
75 23 1725 5625 529
449 109 6143 25261 1619
b = (10)(6143) – (449)(109)
(10)(25261) – (449)2
= 12489
51009
b = 0.2448
∑𝑦 𝑏𝛴𝑥
𝑎= 𝑛 − 𝑛
a = 10.9 – (0.2448)(44.9)
= 10.9 – 10.9915
= – 0.0915
The model is
y = a + bx
= – 0.0915 + 0.2448x
Practice Question.
3
1. A company has recorded the following output levels and associated costs in the past six months:
Month Output (000 of units) Total cost (₦ m)
January 5.8 40.3
February 7.7 47.1
March 8.2 48.7
April 6.1 40.6
May 6.5 44.5
June 7.5 47.1
Required: Construct the equation of a line of best fit for this data.
2. Construct a line of best fit for the following information and estimate the total costs when output is 15,000 units.
Output (000s) Total cost (₦m)
17 63
15 61
12 52
22 74
18 68
Linear regression analysis can be used to construct a regression line for any pairs of data. This does not prove that a
relationship exists between the data and if one does exist the regression line gives no indication of how well the line fits
the observations.
Correlation is a measure of how close the points on a scatter graph are to the line of best fit. If all of the points are very
close to the line of best fit then it is highly suggestive that there is a relationship between x and y. However, this is not
necessarily the case. Correlation is not causation.
Types of correlation
Positive correlation means that the value of y increases as the value of x increases (and vice versa).
Perfect positive correlation is when all the data points lie in an exact straight line and a linear relationship
exists between the two variables.
Negative correlation is where the value of y decreases as the value of x increases (and vice versa).
Perfect negative correlation is when all the data points plotted lie in an exact straight line.
Uncorrelated’ means that no correlation is seen to exist between the variables.
Correlation coefficient r
Correlation between different variables can be measured as a correlation coefficient. The formula for the correlation
coefficient (r) will be given to you in the examination.
Formula: Correlation coefficient (r)
𝑛𝛴𝑥𝑦 − ∑𝑥∑𝑦
𝑟=
√(𝑛∑𝑥 − (∑𝑥)2 (𝑛∑𝑦 2 − (∑𝑦)2 )
2
Where:
x, y = values of pairs of data.
n = the number of pairs of values for x and y.
This formula might seem difficult, but it is fairly similar to the formula for calculating ‘b’ in the linear cost equation.
The only additional value needed to calculate the correlation coefficient is a value for [nΣy2 – (Σy)2].
In order to do this a further column is needed for y2.
4
A value of –1 indicates that there is perfect negative correlation between the values for y and the values for x
that have been used in the regression analysis estimates. Perfect negative correlation means that all the values
for x and y, plotted on a graph, would lie on a straight downward-sloping line.
A value of +1 indicates that there is perfect positive correlation between the values for y and the values for x
that have been used in the regression analysis estimates. Perfect positive correlation means that all the values
for x and y, plotted on a graph, would lie on a straight upward-sloping line.
A value of r = 0 indicates no correlation at all between the values of x and y.
For cost estimation, a value for r close to +1 would indicate that the cost estimates are likely to be very reliable.
As a general guide, a value for r between + 0.90 and +1 indicates good correlation between the values of x and y,
suggesting that the formula for costs can be used with reasonable confidence for cost estimation.
Example 1: The costs on advertisement (x ) and revenues (y) generated by a company for 10 months
are given below:
Advertisement 45 70 32 24 75 16 28 43 60 15
(x) (N’000)
Revenue (y) 42 51 38 39 44 20 22 46 47 35
(Million)
Determine the product moment correlation coefficient for the table.
Solution
= (10)(17,171) – (408)(384)
√[10(20,864) – (408)2] [10(15,700) – (384)2]
= 171,710 – 156,672
√[208,640 – 166,464] [157,000 – 147,456]
= 15,038
√[42,176] [9,544]
= 15,038
√402,527,744
5
= 15,038
20063.0941
= 0.75
Example 2: The marks scored by seven students in Mathematics (x) and Accounts (y) are given below. If the
maximum scores obtainable in mathematics and accounts are respectively 50 and 100, determine the Pearson
correlation coefficient for these scores.
X 30 40 35 40 20 25 50
Y 50 70 65 68 40 60 80
Solution
x y xy X2 Y2
30 50 1500 900 2500
40 70 2800 1600 4900
35 65 2275 1225 4225
40 68 2720 1600 4624
20 40 800 400 1600
25 60 1500 625 3600
50 80 4000 2500 6400
240 433 15595 8850 27849
r = nΣxy – ΣxΣy
√[nΣx2 – (Σx)2] [nΣy2 – (Σy)2]
= (7)(15595) – (240)(433)
√[7(8850) – (240)2] [7(27849) – (433)2]
= 109165 – 103920
√[4350] [7454]
= 5245
√32424900
= 5245
5694.28661 = 0.92
Practice Question
1. A company has recorded the following output levels and associated costs in the past six months:
Month Output (000 of units) Total cost (₦ m)
January 5.8 40.3
February 7.7 47.1
March 8.2 48.7
April 6.1 40.6
May 6.5 44.5
June 7.5 47.1