Topic 3 (Correlation and Regression)
Topic 3 (Correlation and Regression)
Independent variable (Predictor/ Factor): Variable that influence the dependent variable.
Method?
Pearson’s product
Scatter Plot moment correlation
coefficient
2
Scatter Plot
• Introduction:
To describe the relationship/ correlation between the two variables; independent and dependent variables.
• In a scatter plot:
➢ x-axis: independent variable
➢ y-axis: dependent variable
(1) (2) (3) (4) (5) (6)
Perfect Negative Perfect Positive Negative Linear Positive Linear No Correlation Curvilinear
Correlation Correlation Correlation Correlation Correlation
S Huda 2020
Example 1: Scatter plot of the relationship between
age of used car and selling price
Draw a scatter plot between age of used car
and selling price. Hence, comment on the Selling price (RM ‘000)
relationship between the two variables.
25
Selling price
Age (in years)
(RM ‘000)
20
7 15 x
10 13 15 x x
x
x x
8 15
x
8 14 10
x
9 11
5
12 9
5 18
2 4 6 8 10 12 14 Age (in years)
9 13
Comment:
There is negative linear relationship between the age of used car and selling price.
Pearson’s product moment correlation coefficient
To measure the strength of relationship between two variables using Pearson’s product moment
correlation coefficient, 𝑟 with −1.0 ≤ 𝑟 ≤ 1.0
How to calculate 𝑟?
σ𝑥σ𝑦
σ 𝑥𝑦 − 𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦
𝑟= 𝑛 or 𝑟=
σ𝑥 2 σ𝑦 2 𝑛 σ 𝑥2 − σ 𝑥 2 𝑛 σ 𝑦2 − σ 𝑦 2
σ 𝑥2 − 𝑛 σ 𝑦2 − 𝑛
r = correlation coefficient
n = number of observations
Σ 𝑥𝑦 = sum of product of x and y
Σ 𝑥 = sum of all values of x
Σ 𝑥 2 = sum of squares of all values of x
Σ 𝑦 = sum of all values of y
Σ 𝑦 2 = sum of squares of all values of y
Pearson’s product moment correlation coefficient
Value of r Description
𝑟=1 Perfect positive linear relationship between x and y
0.8 ≤ 𝑟 < 1 Strong positive linear relationship between x and y
0.5 ≤ 𝑟 < 0.8 Moderate positive linear relationship between x and y
0 < 𝑟 < 0.5 Weak positive linear relationship between x and y
𝑟=0 No relationship between x and y
−0.5 < 𝑟 < 0 Weak negative linear relationship between x and y
−0.8 < 𝑟 ≤ −0.5 Moderate negative linear relationship between x and y
Solution:
𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦
𝑟=
𝑛 σ 𝑥2 − σ 𝑥 2 𝑛 σ 𝑦2 − σ 𝑦 2
8(881) − 68(108)
𝑟=
8(608) − 68 2 8(1510) − 108 2
= −0.9368
Interpretation: There is strong negative linear relationship between the age and the selling price.
Example 2:
Determine the relationship between the amount of fertilizer used and the yield of mango fruit in DDD plantation using
Pearson product moment correlation coefficient and explain its meaning.
Amount of fertilizer (gm) 150 175 190 200 75 150 190 140
Number of fruits 100 115 125 150 60 100 90 80
Solution:
𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦
𝑟=
𝑛 σ 𝑥2 − σ 𝑥 2 𝑛 σ 𝑦2 − σ 𝑦 2
8(136675) − 1270(820)
𝑟=
8(213050) − 1270 2 8(89450) − 820 2
= 0.8271
Comment: There is strong positive linear relationship between the amount of fertilizer
used and the number of fruits produced.
Quick check 1:
The following data give information on the ages (in years) of machine and the number of breakdowns during the
past year for a sample of ten machines at a large sawmill.
Answer:
𝑟 = 0.9566
Comment: There is strong positive linear relationship between the the ages (in years) of machine and
the number of breakdowns of ten machines at a large sawmill.
2
Coefficient of Determination, 𝑟
• The ratio of the explained variation to the total variation. 𝑟 2 explains how much of the variability in
dependent variable (Y) can be explained by the independent variable (X). The closer the coefficient of
determination, 𝑟 2 to ‘100’, the more important independent variable (X) in predicting dependent
variable (Y). 𝑟 2 = (𝑟)2 𝑥 100
Example 3:
The following table shows the data on the tuition class period (hours) and number of student who failed
in the examination.
Tuition Class Period (Hours) 10 12 20 22 12 7 6 7
Number of Student Failed 19 11 6 5 9 20 23 21
Comment: 85.34% of total variation in number of students failed in the examination can be explained by
the tuition class period, and the other 14.66% can be explained by other independent variables.
Example 4:
The following table shows the data on the experience (in years) and monthly salaries (RM ’00) of nine workers
selected randomly from Factory AAA. Calculate the coefficient of determination and explain its meaning.
Solution:
𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦
𝑟=
𝑛 σ 𝑥2 − σ 𝑥 2 𝑛 σ 𝑦2 − σ 𝑦 2
9(1681) − 80(164)
𝑟=
9(968) − 80 2 9(3202) − 164 2 Interpretation:
• The aim of the regression analysis is to develop the regression equation, or commonly known
as regression model. By using the regression model, the researcher can evaluate the
magnitude of change in one variable due to a certain change in another variable. For
example, a sociologist may want to estimate the increase in the crime rate due to a
particular increase in the unemployment rate.
• A regression model also helps predict the value of one variable for a given value of another
variable. For example, a sociologist can predict the crime rate with a given unemployment
rate.
• It is very important to define the right dependent and independent variables, where the
dependent variable is the variable that what we usually want to predict/ estimate/
forecast. If not you might have a wrong prediction.
Simple Linear Regression
Simple linear regression model is a basic regression equation/ model where there is only one
independent variable and one dependent variable using least square method.
• To find the regression coefficients (a and b)
Dependent
• To generate the linear regression equation, 𝑌 = 𝑎 + 𝑏X variable
𝑏 (positive value)
σ𝑦 σ𝑥 𝑌 = 𝑎 − 𝑏X As x , y
constant/ y-intercept, 𝑎 = −𝑏
𝑛 𝑛
𝑏 (negative value)
𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦
slope, 𝑏 = 𝑌 = 𝑎 + 𝑏X As x, y
𝑛 σ 𝑥2 − σ 𝑥 2
Independent
variable
22 6 𝑦 = 2.7143 + 0.1586𝑥
43 9
𝑎 : The estimated value of dependent variable (Y) when the independent variable (X) is zero.
𝑏: One unit change in the independent variable (X), the dependent variable (Y) will change
(increase/ ecrease) by 𝑏 unit(s).
* increase (b is positive value) or decrease (b is negative value)
Regression equation from Example 1:
𝑦 = 𝟐. 𝟕𝟏𝟒𝟑 + 𝟎. 𝟏𝟓𝟖𝟔𝑥
𝑥 = income (in RM ’00)
𝑦 = education expenditures (in RM ’00)
𝑎 = 𝟐. 𝟕𝟏𝟒𝟑: When there is no income, the education expenditures will be expected RM 271.43
𝑏 = 𝟎. 𝟏𝟓𝟖𝟔: When the income increases by RM 100, the education expenditures will increase
by RM 15.86
Example 6
An observation was carried out to determine the relationship between age of worker and the working time
(in hours) at farm. The table below shows the data recorded ten randomly selected workers.
Age (years) 30 38 35 40 25 24 30 28 42 45
Time (hours) 8.5 8.5 9 8 9 9 8.5 9 7 5
Find the regression model for the above data and interpret the slope value.
Solution:
𝑛 =10 σ 𝑥 = 337 σ 𝑦 = 81.5 𝑥 2 = 11843 𝑦 2 = 678.75 σ 𝑥𝑦 = 2680
σ𝑥 σ𝑦
σ 𝑥𝑦 − σ𝑦 σ𝑥
𝑏= 𝑛 𝑎= −𝑏
σ𝑥 2 𝑛 𝑛
σ 𝑥2 −
𝑛
81.5 337
337 81.5 = − −0.1369
2680 − 10 10
= 10 = −0.1369
337 2 = 12.7637
11843 −
10
𝑦= 𝑎 + 𝑏 𝑥
𝑏 = −0.1369: When the age of workers
increase by one year, the working time
𝑦 = 12.7637 − 0.1369𝑥 will decrease by 0.1369 hour.
Example 7
A real estate agent believes that the monthly rents of houses depend on the size of the houses. A sample of
eight houses in a residential area was selected and the information gathered is shown in table below:
Find the regression model for the above data and interpret the slope value.
Solution:
𝑛=8 σ 𝑥 = 83 σ 𝑦 = 103 𝑥 2 = 919 𝑦 2 = 1413 σ 𝑥𝑦 = 1136
σ𝑥 σ𝑦
σ 𝑥𝑦 − σ𝑦 σ𝑥
𝑏= 𝑛 𝑎= −𝑏
σ𝑥 2 𝑛 𝑛
σ 𝑥2 −
𝑛
103 83
83 103 = − 1.1641
1136 − 8 8
= 8 = 1.1641
83 2 = 0.7970
919 −
8
𝑦= 𝑎 + 𝑏 𝑥
𝑏 = 1.1641: When the size of houses
increase by 100 square feet, the monthly
𝑦 = 0.7970 + 1.1641𝑥 rent will increase by RM 116.41
Best Fitted Line/ Regression Line
Best fit means that the sum of the squares of the vertical distances from each point to the line
is at a minimum. The reason of best fit line is that the values of y will be predicted from the
values of x; hence, the closer the points are to the line, the better the fit and the prediction will.
𝑦 = 2.7143 + 0.1586𝑥
6 This regression line is known as the
Income Education ‘best fitted’ line which can used to
(RM’00) Expenditure (RM’00)
4 predict Y by any value of X, instead of
20 𝑦 = 2.7143 + 0.1586 20 = 5.8863 using formula.
40 𝑦 = 2.7143 + 0.1586 40 = 9.0583 2
0
0 10 20 30 40 50 60
Income (RM ’00)
Prediction/ Estimation
Regression equation: To predict/ estimate the dependent variable value
given the value of independent variable.
From Example 1:
Independent variable: Income (RM’00)
Dependent variable: Education expenditure *RM’00)
Estimate the education expenditure for the past month if the income is RM 3000.
Regression equation, 𝑦 = 2.7143 + 0.1586𝑥
Actual unit = RM ′00
Solution: 𝑦ො = 2.7143 + 0.1586 30 So, 3000 ÷ 100 = 30
= RM 7.4723 (′00)
Why significant? The significant value (Sig. = 0.001) < significance level (5% or 0.05)
Regression Coefficients
Regression equation: 𝑌 = 𝑎 + 𝑏𝑋
𝑎 = 23.983
𝑏 = -1.233
𝑌 = 23.983 − 1.233𝑋
Model Summary
𝑟 = -0.937
Interpretation: There is strong negative linear relationship between the age and the selling price.
Comment: 87.80% of total variation in selling price of machine can be explained by its age (in years)
and the other 12.20% can be explained by other predictors (independent variables).