0% found this document useful (0 votes)
4 views

SOCI1005 - Correlation and Regression

The document covers the concepts of correlation and regression, detailing how to assess the relationship between two quantitative variables using scatterplots, Pearson’s correlation coefficient, and the coefficient of determination. It explains the calculation of correlation coefficients and regression equations, emphasizing their importance in predicting changes in one variable based on another. Examples are provided to illustrate the application of these statistical methods in real-world scenarios.

Uploaded by

Ariel Salmon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

SOCI1005 - Correlation and Regression

The document covers the concepts of correlation and regression, detailing how to assess the relationship between two quantitative variables using scatterplots, Pearson’s correlation coefficient, and the coefficient of determination. It explains the calculation of correlation coefficients and regression equations, emphasizing their importance in predicting changes in one variable based on another. Examples are provided to illustrate the application of these statistical methods in real-world scenarios.

Uploaded by

Ariel Salmon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

SOCI1005

Correlation and Regression

Lecturer: Ms. Ayesha Facey


Date: Friday, February 18, 2022
2

Outline
• Correlation

• Scatterplot

• Pearson’s correlation coefficient

• Coefficient of determination

• Regression
3
Correlation

• Correlation is the association or relationship between two


quantitative (interval or ratio) variables.

• If changes in the values of one variable are associated with


changes in the values of another variable, the variables are said
to be correlated.

• The correlation between two quantitative variables can be


assessed both graphically and numerically.
4
Scatterplot
n A scatterplot (or scatter diagram) is a graphical display of the
correlation between two quantitative variables.
5
Scatterplot
n It is used to examine:

n if there exists a linear (straight-line) relationship between two


interval/ratio variables;

n the direction of the relationship (positive or negative), if any;

n the strength of the relationship (are the plots tightly clustered or


dispersed?), if any;

n if there is any noticeable deviations (outliers) from the


relationship.
6
Scatterplot

• A scatterplot is created as follows:


• The independent variable (the variable that effects a change) should
be placed on the x-axis and the dependent (the variable that
responds to a change) on the y- axis.

• A pair of values (one of each variable) is then plotted as a single


point on the graph.

• Determine if the pattern of the points exhibits a linear relationship.

• If there is an observed linear relationship, a “line of best fit” should


be drawn such that there are an equal number of points on either
side of the line.
Examples of Scatterplots 7
Pearson’s Correlation Coefficient
• Sometimes it is not easy to discern from a scatterplot the direction and strength of a
relationship , if it exists. An assessment of the strength of the relationship can be quite
subjective.
• An objective numerical measure can be calculated – Pearson’s Product Moment
Correlation Coefficient / Pearson’s correlation coefficient/ Pearson’s r.
• The correlation between two variables is represented by a number between -1 and +1
inclusively called the correlation coefficient. These numbers form the range of possible
correlation values.
• The ‘1’ indicates that there is perfect correlation between the two variables. Perfect
correlation means that all the points in the corresponding scatter plot will lie on a straight
line. Thus, a change in the independent variable will induce a change of similar
magnitude in the dependent variable as well.
• An imperfect correlation means that while some of the points in the scatter plot will lie
on the best-fit line, some will be above it and some will be below as well. As the name
suggests, the best-fit line is the line that would best represent the trend in the scatterplot.
Therefore, while the independent variable will influence the change in the dependent
variable, there are also other factors that will influence it to change as well.
Pearson’s Correlation Coefficient
• The ‘-’indicates that there is a negative correlation between the two variables.
A negative correlation means that as the value of the independent variable
increases, the value of the dependent variable decreases and vice- versa. As
such, the slope of the best-fit line is negative.
• The ‘+’ indicates that there is a positive correlation between the two
variables. A positive correlation means that as the value of the independent
variable increases, the value of the dependent variable increases and vice-
versa. As a result, the slope of the best-fit line is positive.
• No correlation is represented by 0. This means that a change in the value of
the independent variable will not influence the value of the dependent
variable.
• From the correlation coefficient we can determine both the direction and
strength of relationship.
Pearson’s Correlation Coefficient
• Direction: if r is negative, then there is a negative linear relationship and if
r is positive, then there is a positive linear relationship.
• Strength: This is indicated by the magnitude of the coefficient such that the
larger the absolute value of the coefficient, the stronger the relationship. Use
the scale below to determine strength.

≤ 0.19 very weak 0.20 – 0.39 weak 0.40 – 0.69 moderate


0.70 – 0.89 strong ≥ 0.9 very strong
Pearson’s Correlation Coefficient


nSxy - SxSy
r=
[nSx 2 2
][
- (Sx ) × nSy - (Sy )
2 2
]
12
Coefficient of Determination
¨ An alternative means of assessing the extent of the relationship between
two variables (that is, how closely related they are) is using the coefficient
of determination (COD).

¨ The COD is calculated as: r 2 ´100


¨ r2 generally lies between 0 and 1.

¨ The COD measures the percentage of the variability in a dependent


variable explained (or accounted for) by changes in an independent
variable.

¨ For example, if COD = 84% for the relationship between years of


experience and salary, then 84% of the variability in salary can be
explained by differences in experience.
Worked Example
The data below show the ages and glucose levels of six persons. Is
there a relationship between the variables? Interpret the Pearson’s
correlation coefficient. Calculate and interpret the coefficient of
determination.

SUBJECT AGE X GLUCOSE LEVEL Y

1 43 99

2 21 65

3 25 79

4 42 75

5 57 87

6 59 81
Worked Example

Step 1: Make a chart. Use the given data, and add three
more columns: xy, x2, and y2.
AGE GLUCOSE LEVEL
SUBJECT XY X2 Y2
X Y
1 43 99

2 21 65

3 25 79
4 42 75

5 57 87

6 59 81
Worked Example

Step 2: Multiply x and y together to fill the xy column. For


example, row 1 would be 43 × 99 = 4,257.
AGE GLUCOSE LEVEL
SUBJECT XY X2 Y2
X Y

1 43 99 4257

2 21 65 1365

3 25 79 1975

4 42 75 3150

5 57 87 4959

6 59 81 4779
Worked Example
Step 3: Take the square of the numbers in the x column
and put the results in the x2 column.

AGE GLUCOSE LEVEL


SUBJECT XY X2 Y2
X Y
1 43 99 4257 1849

2 21 65 1365 441

3 25 79 1975 625

4 42 75 3150 1764

5 57 87 4959 3249

6 59 81 4779 3481
Worked Example
Step 4: Take the square of the numbers in the y column and
put the results in the y2 column.

GLUCOSE
SUBJECT AGE X XY X2 Y2
LEVEL Y
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Worked Example
Step 5: Add up all of the numbers in each column and
place the results at the bottom of the column.

GLUCOSE
AGE
SUBJECT LEVEL XY X2 Y2
X
Y
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Total 247 486 20485 11409 40022
Worked Example
Step 6: Use the formula to calculate the correlation coefficient

nSxy - SxSy
r=
[nSx 2
- Sx × nSy - Sy ]
( )2
] [ 2
( )2

From the table


Σx = 247
Σy = 486
Σxy = 20,485
Σx2 = 11,409
Σy2 = 40,022
n is the sample size, in our case = 6
Worked Example

COD = r2 * 100
= (0.53)2 *100
=28.09%
Approximately 28 percent
of the variation in glucose
level is explained by age.
Therefore, 72 percent is
unexplained.
Example 1

The weights and average daily Weight (kg) Food consumption


consumption were measured for (hundred calories
6 obese adolescent girls. per day
a) Calculate and interpret the
Pearson’s correlation 84 32
coefficient. 93 33
b) Calculate and interpret the
81 33
coefficient of determination
61 24
95 39
86 32
Example 2
A researcher is proposing that Average hours of
Hours of sleep needed/night
exercise makes a person more exercise/week
energized as a result they require less
sleep. He carries out a survey on the 4 8.6
number of hours of exercise per week
and the average hours of sleep 5.2 8.1
needed per night in order to feel well 2 9
rested. The results are shown for the 8 3.4 8.5
participants.
a) What is the correlation between 8 7.4
the two variables? 10 6.8
b) Calculate and interpret the 1.5 9.4
coefficient of determination.
6 7.7
Least Squares Regression I

¨ Recall that correlation tells us how well two variables are related.

¨ However, correlation tells us nothing about how much a variable will


change if there is a change in another variable that is related to it.

¨ Also, correlation cannot specifically predict the value of one variable


given knowledge of the value of some other related variable.
¨ A regression is able to tell how much change and make specific
predictions.
¨ For example, a regression can tell that “if I had one more year of
experience, I could earn $2000 more”. Or it could predict that “a
person with 10 years of experience should earn $30,000”.
Least Square Regression
• Simple linear regression analysis is a basic form of modelling
since we derive a linear equation that adequately represents the
relationship between two variables.
• For example, we may want to quantify or estimate the
relationship between height and weight. After deriving this
relationship from observed data, the model can then be used to
predict other outcomes within the observation range.
• What was referred to as the ‘line of best fit’ earlier is also called
a regression line. Like any line drawn on a graph, it has an
intercept and a slope.
• The regression line is the estimate of the linear relationship between
the independent variable (x) and the dependent variable (y) ‘a’ and
‘b’ are called regression coefficients.
Least Square Regression

This line can be represented as an equation as follows:

y = a + bx
¨ y is the dependent variable and x is the
independent variable.
¨ a and b are the least square (or regression)
coefficients; ‘a’ is the intercept and ‘b’ is the
slope.
Least Square Regression

The least square coefficients are computed as


follows:
nSxy - SxSy
b=
nSx - (Sx )
2 2 a = y - bx
27

Interpreting the Regression Equation:


The Intercept

a = y - bx
• The intercept is the point on the vertical axis where
the regression line crosses the axis. It is the
predicted value for the dependent variable when
the independent variable has a value of zero.

• This may or may not be useful information


depending on the context of the problem.
Interpreting the Regression Equation:
The Slope

nSxy - SxSy
b=
nSx - (Sx )
2 2

The slope is interpreted as the amount of change in the predicted


value of the dependent variable associated with a one-unit
change in the value of the independent variable.
Relationship between Correlation and
Regression
• If the slope has a negative sign, the direction of the
relationship is negative or inverse, meaning that the
scores on the two variables move in opposite
directions.
• If the slope has a positive sign, the direction of the
relationship is positive or direct, meaning that the
scores on the two variables move in the same
direction.
• If the correlation coefficient is zero, then the slope of
the regression line must be zero as well. In a scatter
plot, the data points would be random with no
observable trend and the line of best-fit horizontal.
Worked Example
A researcher is proposing that exercise makes a person more energized as a result they
require less sleep. He carries out a survey on the number of hours of exercise per week
and the average hours of sleep needed per night in order to feel well rested. The results are
shown for the 8 participants. Determine the regression equation and interpret the
coefficients.
Average hours of
Hours of sleep needed/night
exercise/week

4 8.6
5.2 8.1
2 9
3.4 8.5
8 7.4
10 6.8
1.5 9.4
6 7.7
Worked Example
Average
Hours of Step 1: Find ‘b’
exercise/
hours of sleep
nSxy - SxSy
week (x)
needed/
b=
nSx 2 - (Sx )
night (y) xy x2 2
4 8.6 34.4 16
5.2 8.1 42.12 27.04
2 9 18 4
3.4 8.5 28.9 11.56
8 7.4 59.2 64
10 6.8 68 100
1.5 9.4 14.1 2.25
6 7.7 46.2 36
40.1 65.5 310.92 260.85

Step 2: Find ‘a’


a = y - bx
Worked Example

Step 3: Write out the equation by substituting the values of


‘a’ and ‘b’

y = 9.64 + (-0.29)x OR
y = 9.64 - 0.29x

Interpretation:

a = 9.64 – When a person doesn’t exercise, the predicted


average hours of sleep needed per night is 9.64 hours.

b = -0.29 - For every additional hour of exercise per week, the


predicted average hours of sleep needed per night will
decrease by 0.29 hours.
Worked Example


Example
A random sample of six drivers insured with a company and
having similar auto insurance policies was selected. The
following table lists their driving experiences (in years) and
monthly auto insurance premiums.

Driving Monthly Auto a) Determine the regression


experience Insurance ($) equation and interpret the
(years)
coefficients.
5 64 b) Predict the monthly auto
2 87 insurance premium of a
9 71 person who has 20 years of
driving experience.
15 44
25 42
16 60
35
Example - Application

¨ A car rental company charges $40 a day and 20 cents per mile for renting a
car. Let y be the total rental charges (in dollars) for one day and x be the
miles driven. The equation for the relationship between x and y is
y = 40 + .20x
Ø How much will a person pay who rents a car for one day and drives 100
miles?
Example

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy