0% found this document useful (0 votes)
2 views26 pages

Correlation Simple Regression

The document provides an overview of correlation and simple linear regression, explaining the concepts of correlation coefficients, their interpretation, and the differences between correlation and regression. It discusses the significance of correlation coefficients, the method of ordinary least squares for regression, and the assumptions underlying these statistical techniques. Additionally, it includes examples and formulas related to predicting outcomes and understanding model fit.

Uploaded by

beyeuhp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views26 pages

Correlation Simple Regression

The document provides an overview of correlation and simple linear regression, explaining the concepts of correlation coefficients, their interpretation, and the differences between correlation and regression. It discusses the significance of correlation coefficients, the method of ordinary least squares for regression, and the assumptions underlying these statistical techniques. Additionally, it includes examples and formulas related to predicting outcomes and understanding model fit.

Uploaded by

beyeuhp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Correlation

and
Simple Linear Regression
Lecture
Analysis of Quantitative Data I
Mayuko Onuki

1
Correlation

2
Correlation
• Correlation is a measure of linear association between two variables
which indicates
1. Direction of the correlation
2. Strength of the relationship
• The Pearson correlation coefficient (r) is the statistic that indicates the
association between two variables measured on interval-ratio scales
• Formula:

N
1
𝑟𝑥𝑦 = ෍ Z−score of 𝑥𝑖 ∗ Z−score of 𝑦𝑖
𝑁−1
𝑖=1
• r: Pearson product-moment correlation coefficient
• N = sample size

3
Direction of the Correlation
• Positive correlation: Indicates
that the values on the two
variables being analyzed move
in the same direction.
• That is, as scores on one
variable go up, scores on the • Negative correlation: Indicates
other variable go up as well & that the values on the two
vice versa (on average) variables being analyzed move
in opposite directions.
• That is, as scores on one
variable go up, scores on the
other variable go down, and
vice-versa (on average)

4
Strength of the Correlation
Coefficient values range between -1 and 1
General rule of thumb for interpreting coefficients (Weisburd and
Britt, 2007)
• 0 = No correlation
• ± .01 ~ .30 = Weak
• ± .31 ~ .69 = Moderate
• ± .70 ~ .99 = Strong
• ± 1 = Perfect correlation

5
The scatterplots with varying Pearson correlation coefficients
What correlation is not
Correlation ≠ Causation

image/correlations2.png

https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#/media/File:Correlation_examples2.svg 6
Caution with correlation coefficients
• Association is assumed to be linear
• Weak or strong correlation may be due to outliers
• E.g., the datasets from the tables below have outliers (student 1).
which strongly influence (bias) the correlation coefficients
• Weak correlation may be due to “truncated” range
• E.g., the data from the table below show all the students who did
well on the test. What about those who did not do well??

E.g., An outlier that makes the correlation weaker E.g., An outlier that makes the correlation stronger

Hours Spent Exam Score Hours Spent Exam Score


(r = .96 → .81) Studying (Y variable) (r = .37 → .72) Studying (Y variable)
(X variable) (X variable)
Student 1 0 12 Student 1 0 12
Student 2 3 85 Student 2 32 95
Student 3 4 84 Student 3 4 100
Student 4 7 94 Student 4 7 95
Student 5 10 98 Student 5 10 100 7
Significance of correlation coefficients
Statistical hypotheses for testing the significance of r2
• H0: 𝑟 = 0
• H1: 𝑟 ≠ 0

Formula for the t-test:


𝑡 = 𝑟 (𝑛 − 2)/(1 − 𝑟 2 )

• Degree of freedom: n - 2
• n: sample size
• r: Pearson correlation coefficient

8
The coefficient of determination
• The coefficient of determination, r2, indicates how much of
the variation in the scores of one variable can be explained
by the other variable
• E.g., r2 = .30 indicates that 30% of the variance in x is explained by y
(vice-versa)

x y
r = .00
r2 = .00
x y x y
r = .55 r = .30
r2 = .30 r2 = .09 9
Other types of correlation coefficients
There are specialized versions of the Pearson r for variables measured on
different scales
• Phi coefficient (𝜙)
• Two dichotomous variables
• Formula is the same the Pearson r
• Point Biserial (rpb)
• One of the variables is a dichotomous variable
• Formula is the same the Pearson r
• Spearman Rho (𝜌)
• Two ordinal variables (non-normal variables or small sample size)
• Kendall Tau B and G&K Gamma are more robust
6∑𝑑𝑖2
• Formula: 𝜌 = 1 − 𝑛 𝑛2 −1
• di : difference between the two ranks of each observation
• n: sample size
10
Simple Linear Regression

11
Correlation vs. Regression
• Correlation measures a linear association between two
variables
• Regression is a technique to generate predictions, using a
model
• Used in hypothesis testing
• Causality is assumed
• Independent variable is called “predictor”

12
Simple linear regression
• Simple linear regression looks for a linear relationship
between two variables, x and y
• Model:

𝑌 = 𝛼 + 𝛽𝑋 + 𝜖

• α: intercept (mean of y when x = 0)


• β: slope (change in y when x
increases by 1 unit)
• є: residual (error in a prediction)


𝑌෠ = 𝛼ො + 𝛽X
• ෡ indicates “predicted” ϵi = Residual for ith data
13
Error in predictions (Residual)
• The regression equation does not calculate the actual value
of Y. It can only make predictions about the value of Y. So
error (ϵ) is bound to occur, and these errors in prediction
are called residuals
• Error is the difference between the actual, or observed, value of Y
and the predicted value of Y

𝑌 = 𝛼 + 𝛽𝑋 + 𝜖

𝑌෠ = 𝛼ො + 𝛽𝑥

To calculate error:
𝜖 = 𝑌 − 𝑌෠
= 𝑌 − 𝛼 − 𝛽𝑋 ϵi = Residual for ith data
14
How do we find coefficients?
• The method used to find the line is referred to as “ordinary
least squares (OLS)”
• Use the model for each data point i, to find
estimates of α and β that minimize the sum of squared residuals
(SSR)

• ^ indicates “predicted”
• The mean of residuals across all the data points is zero.

15
Model fit
• The coefficient of determination, R2 , indicates how much
variation in Y is explained by the model

• Indicates how well the model “fits” the data


• Values range between 0 and 1
• If R2 = .34 → “34% of the variance in y is explained by the model”

x y
R2 = .34
16
Significance of model fit
Statistical hypotheses for testing the significance of R2
• H0: 𝑅2 = 0
• H1: 𝑅2 ≠ 0

F test
variance explained by the regression model
F=
variance due to error
𝑅2 /𝑑𝑓𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛
=
1−𝑅2 /𝑑𝑓𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙

• dfregression : k - 1
• dfresidual: N – k
• k: the number of parameters being estimated (intercept and slope)
• N: sample size 17
Regression coefficient
• A regression (slope) coefficient indicates an estimated
amount of average change in the outcome when a predictor
variable increases by one-unit
• Intercept coefficient indicates predicted outcome when X
takes the value of zero
• Formula
𝑠𝑦
𝛽=𝑟
𝑠𝑥
• r: Pearson correlation coefficient between x and y
• s: standard deviation

𝛼 = 𝑌෠ − 𝛽𝑋෠
18
Significance of regression coefficients
Statistical hypotheses for testing the significance of 𝛽
• H0: 𝛽 = 0
• H1: 𝛽 ≠ 0

T test

df = N – k • N = sample size
• K = number of parameters

19
Wrapping Words Around the
Regression Coefficient
◼ Example: how much can the
amount of education people have
predict their monthly income?

Yˆ = -3.77 + .61X
• For every unit of increase in X, there is a
corresponding predicted increase of 0.61
units in Y
OR
• For every additional year of education, we
would predict an increase of 0.61 ($1,000),
or $610, in monthly income

Urdan, Timothy C.. Statistics in Plain English. Taylor and Francis. Shared Slides. 20
Example: Predicting the 2000 US election results in
Florida using the 1996 US election results

Model: 𝑌 = 𝛼 + 𝛽𝑋 + 𝜖

21
Example: Predicting the 2000 US election results in
Florida using the 1996 US election results

መ = 45.84 + 0.02x
𝑌෠ = 𝛼ො + 𝛽𝑋

𝛽መ x 10,000 = 200
𝛼ො

Buchanan’s votes in 2000 were predicted to increase by 200 when Perot’s votes in 1996
increased by 10,000.
22
Assumptions of OLS regression
1. Linearity: the relationship between x and y is linear

2. Normality of residuals: residuals are normally distributed

3. Homoscedasticity: constant variance of residuals

23
Assumption 1. Linearity

24
Assumption 2. Normality

25
Assumption 3. Homoscedasticity

26

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy