0% found this document useful (0 votes)
6 views51 pages

Correlation and Linear Regression

The document discusses bivariate analysis and linear regression, focusing on how to quantitatively describe the relationship between two variables using correlation and regression techniques. It explains the mathematical formulation of linear regression, including the concepts of slope and intercept, and the process of minimizing the sum of squared errors to find the best fitting line. Additionally, it covers assumptions regarding residuals and the evaluation of regression models.

Uploaded by

吳恩
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views51 pages

Correlation and Linear Regression

The document discusses bivariate analysis and linear regression, focusing on how to quantitatively describe the relationship between two variables using correlation and regression techniques. It explains the mathematical formulation of linear regression, including the concepts of slope and intercept, and the process of minimizing the sum of squared errors to find the best fitting line. Additionally, it covers assumptions regarding residuals and the evaluation of regression models.

Uploaded by

吳恩
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Correlation and Linear Regression –

Part 2
Reading
• Chapter 6.2.1, 6.2.2, 6.2.3, and 6.2.4 in the textbook
Bivariate analysis
• We know how to read and interpret the scatterplots. But we want to use
mathematical/statistical language to describe the relationship between
two data sets quantitatively.
• One of the most commonly used statistical tools is correlation.
• Bivariate analysis aims to understand the relationship between two
variables x and y.
• When the two variables are measured on the same object, x is usually
identified as the independent variable or predictor, and y as the
dependent variable or predictand.
• The methods of bivariate statistics aim to describe the strength of the
relationship between variables, such as Pearson’s correlation coefficient
for linear relationships or an equation obtained by regression.
Linear regression
• Simple linear regression seeks to summarize the relationship
between two variables, shown graphically in their scatterplot, by a
single straight line.
Linear regression
• Simple linear regression seeks to summarize the relationship
between two variables, shown graphically in their scatterplot, by a
single straight line.
• The regression procedure chooses that line producing the least
error for predictions of y given observations of x.

Linear function: y= bx +a
where b is called the slope, and a is called
the intercept
Δy = b*Δx
The value of y depends on the value of x Δx
Linear function: y= bx +a + error

The value of y depends on the value of x,


but this relationship is masked by random errors

In ordinary linear regression, the dependent variable Y


has some random error (X is considered to be error-
free).
There is a conditional dependence between
the random variables Y and X.

The probability that Y is found in a certain interval


range is conditionally dependent on the value of X.

If we know the value of X we can make a


statistical estimate for value of Y given X.
We usually use 𝒀 ෡ to represent the estimated or predicted values of Y: 𝒚
ෝ = 𝒃𝒙 + 𝒂

This is what the linear regression line gives us


when we fit the line to the data sample.
Linear function y = bx +a

The value of y depends on the value of x

b= Δy/Δx =2

Δy = b*Δx
Δx
a=-1

a is the intercept: the point at which the regression line intersects


the y-axis. (That is the value of y for x=0)

b is the slope: if you go one unit step in x direction


b tells you how many steps you have go into y-direction.
Linear function y= bx +a

The value of y depends on the value of x

value of y
does
not depend on x
y = 0x + a = a
b>0 b<0

Corresponds to r(x,y)>0 Corresponds to r(x,y)=0 Corresponds to r(x,y)<0


Linear relationships between
independent (x) and
dependent variable (y) are
not perfect. We assume errors are
affecting the dependent variable.
What is the best linear fit to the data?
What we need is a quantitative measure how well the line fits the data!

?
𝑒𝑖 = 𝑦𝑖 − 𝑦(𝑥
ො 𝑖)

Figure 6.1 from the textbook

Schematic illustration of simple linear regression. The regression line, 𝑦ො = 𝑏𝑥 +


𝑎, is chosen as the one minimizing some measure of the vertical differences (the
residuals) between the points and the line. In least-squares regression that
measure is the sum of the squared vertical distances. The error or residual, 𝑒, is
the difference between the data point (observed y values) and the regression line
(estimated/predicted y values by the regression).
How to estimate the best fitting line?
Example: Imagine you want to ‘predict’ the expected
outcome on a final exam given the statistical linear
relationship between first exam scores (test taken earlier in
the semester) and the final exam scores.
Data available
X-axis: First Exam Score before the final
exam!

Our variable of
interest (before the Y-axis: Final Exam Score
final exam is taken)

You aim to build a linear regression model to predict your


final exam score based on your first exam score. You use
your scores from the courses you have been taken to build
such a model.
How to estimate the best fitting line?

Y-axis: Final Exam Score

X-axis: First Exam Score


How to estimate the best fitting line
for final exam scores based on previous third exam scores?

Red dot marks a


point in the X-Y

Final Exam Score


plane. We are
given a value X Example: Blue dots
(1st exam score) mark observed
is 65 and scores in the course
from past semester.
‘predict’ that
the student will
have a final
exam score First Exam Score
around 145
How to estimate the best fitting line?

Mathematically we formulate this as a minimization problem:


Minimize the distance of the data points from the linear regression line:
Distance is measured here only in y-direction (because we assume only error in y).
Note further, the squared distance is used.
How to estimate the best fitting line?

The deviations from


the deterministic model
line are interpreted as random errors (following a Gaussian distribution)

Sum of Squared Errors (SSE)


How to estimate the best fitting line?

Sum of Squared Errors (SSE)

What we need to do is:


Find the slope and intercept
value that minimize the SSE!

This is the least-squares regression.


How to estimate the best fitting line?
Minimize
Sum of Squared Errors (SSE) !

A line is defined in the x-y plane when we


know the slope and one point on the line.
In linear regression we want to find the point
where the line intersects the y-axis (intercept)
How to estimate the best fitting line?
• Finding analytic expressions for the least-squares slope, b, and the intercept, a.
• In order to minimize the sum of squared residuals,
Recap: finding the maximum and minimum of a quadratic function
• A quadratic function: 𝑓 𝑥 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐, 𝑎 ≠ 0.
• If a > 0, which has a minimum of the function.
• If a < 0, which has a maximum of the function.
• The maximum or minimum of the function, or the vertex point, can
be obtained by finding the roots of the derivative of the function.

• The derivative of 𝑓(𝑥): 𝑓 ′ 𝑥 =


2𝑎𝑥 + 𝑏.
𝑏
• Let 𝑓 𝑥 =0, then get 𝑥 = −


2𝑎
𝑏
When 𝑥 = − , we can find the
2𝑎
minimum of 𝑓(𝑥).
How to estimate the best fitting line?
• Finding analytic expressions for the least-squares slope, b, and the intercept, a.
• In order to minimize the sum of squared residuals,

• Set the derivatives of the above equation with respect to the parameters a and b
to zero and solve.
• Write down the expression of the derivatives for and
How to estimate the best fitting line?
• Finding analytic expressions for the least-squares slope, b, and the intercept, a.
• In order to minimize the sum of squared residuals,

• Set the derivatives of the above equation with respect to the parameters a and b
to zero and solve:

and
How to estimate the best fitting line?

• Now we have two equations and two unknowns, a and b.


• We can solve the two unknowns, a and b.
• Derive the final expression for a and b.
How to estimate the best fitting line?

𝑛 𝑛

෍ 𝑦𝑖 = 𝑎𝑛 + 𝑏 ෍ 𝑥𝑖
𝑖=1 𝑖=1
𝑛 𝑛 𝑛 These two equations
෍ 𝑥𝑖 𝑦𝑖 = 𝑎 ෍ 𝑥𝑖 + 𝑏 ෍ 𝑥𝑖2 will be useful in later
𝑖=1 𝑖=1 𝑖=1 analysis of variance.
How to estimate the best fitting line?

• Now we have two equations and two unknowns, a and b.


• We can solve the two unknowns, a and b.
𝑎 = 𝑦 − 𝑏𝑥

𝑛𝑥 𝑦 − σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖
𝑏= 2
𝑛𝑥 − σ𝑛𝑖=1 𝑥𝑖2
Summary: How to estimate the best fitting line?
• To find the slope b and the intercept a such that the sum of squared residuals is
minimized.

• 𝑎 = 𝑦 − 𝑏𝑥

𝑛𝑥 𝑦−σ𝑛
𝑖=1 𝑥𝑖 𝑦𝑖
• 𝑏= 2
𝑛𝑥 −σ𝑛 2
𝑖=1 𝑥𝑖

• Other expressions of the slope b


Class Activity:

Fitting the regression line (by hand) and


compare the results

X 65 67 71 71 66 75 67 70 71 69 69
Y 175 133 185 163 126 198 153 163 159 151 159
first exam score
How to estimate the best fitting line?
Minimize
Sum of Squared Errors (SSE) !

Task 1: Find the best value for the slope


How to estimate the best fitting line?
Minimize
Sum of Squared Errors (SSE) !

Task 1: Find the best value for the slope


𝑥,ҧ 𝑦ഥ represent the sample mean
of variable X and Y, respectively.
How to estimate the best fitting line?
Minimize
Sum of Squared Errors (SSE) !

Task 1: Find the best value for the slope

Note on the notation with the “^”:


Many textbooks distinguish between the estimated values
from the actual true (but unknown) parameter values using a different symbol.
Or they use Greek letters for the true values, and Latin letters for the estimates.
The hat (^) on this slide indicates that. In the next slides we omit this nuance.
How to estimate the best fitting line?
Minimize
Sum of Squared Errors (SSE) !

Task 2: Find the intercept point


Interpretation of the slope and intercept in terms of:

• statistical mean values,


• variances,
• covariance and correlation

In ordinary linear
Intercept a regression the fitted
line goes through the
center point of the
paired data samples.
center point 𝑥,ҧ 𝑦ത
Sample mean of x and y
How to estimate the best fitting line?

Minimize
Sum of Squared Errors (SSE)

Analytical solution for min(SSE):

Intercept Slope
Interpretation of the slope and intercept in terms of:

• statistical mean values,


• variances,
• covariance and correlation

slope
1
𝑛
1
𝑛
Covariance
between x and y
b=
Variance of x
Interpretation of the slope and intercept in terms of:

• statistical mean values, Covariance


• variances, between x and y
• covariance and correlation
b=
Variance of x

slope
Covariance
between x and y
r=
Standard deviation of x *
Standard deviation of y

Standard deviation of y
b= Correlation between x and y *
Standard deviation of x
Relationship between slope b and the Pearson
correlation coefficient

Slope of the regression line:


Correlation coefficient * standard deviation (y)
/ standard deviation (x)
How to estimate the best fitting line?

Sum of Squared Errors (SSE)

The intercept value guarantees that the


regression line goes through the center
point (mean of x, mean of y)

Slope of the regression line:


Correlation coefficient * standard deviation (y)
/ standard deviation (x)
How to estimate the best fitting line?

Estimated
Regression line

Linear relationship with errors: y= bx +a + ε


The value of y depends on the value of x plus a random error

ෝ + 𝒆 = 𝒃𝒙 + 𝒂 + 𝒆
𝒚=𝒚
Understanding and evaluating regression models
• Two assumptions for the quantities 𝑒𝑖 :
• 𝑒𝑖 are independent random variables with zero mean and constant
variance.
• The residuals follow a Gaussian distribution.
• The sample mean of the residuals (dividing the following equation
by n) is zero.
𝑛

෍ 𝑒𝑖 = 0
𝑖=1
• Given 𝑒𝑖 = 𝑦𝑖 − 𝑦ො𝑖 , 𝑦𝑖 = 𝑦ො𝑖 + 𝑒𝑖
• Prove that the mean of 𝑦ො𝑖 is 𝑦.
Understanding and evaluating regression models
• Now, we take a close look at the sum of square errors (SSE).
𝑛

𝑆𝑆𝐸 = ෍ 𝑒𝑖2
𝑖=1
1 1 1 1
• se2 = 𝑆𝑆𝐸 = σ𝑛𝑖=1 𝑒𝑖2 = σ𝑛𝑖=1 𝑒𝑖 − 𝑒𝑖 2
given 𝑒𝑖 = σ𝑛𝑖=1 𝑒𝑖 = 0
𝑛 𝑛 𝑛 𝑛

This is the variance of


the residuals/errors.
• If SSE is divided by n−2, it is residual variance from the sample of
residuals.
Understanding and evaluating regression models
1
• Variance of the residuals: se2 = σ𝑛𝑖=1 𝑒𝑖2
𝑛
1
• Variance of the observed y values: 𝑠𝑦2 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦 2
𝑛
1
• Variance of the estimated/predicted y values: 𝑠𝑦2ො = σ𝑛𝑖=1 𝑦ෝ𝑖 − 𝑦 2
𝑛
• Prove 𝑠𝑦2 = 𝑠𝑦2ො + 𝑠𝑒2 (Assignment 5)
Understanding and evaluating regression models
1
• Variance of the residuals: se2 = σ𝑛𝑖=1 𝑒𝑖2
𝑛
1
• Variance of the observed y values: 𝑠𝑦2 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦 2
𝑛
1
• Variance of the estimated/predicted y values: 𝑠𝑦2ො = σ𝑛𝑖=1 𝑦ෝ𝑖 − 𝑦 2
𝑛
• Prove 𝑠𝑦2 = 𝑠𝑦2ො + 𝑠𝑒2 (Assignment 5)
• Hint:
1 1
• = σ𝑛𝑖=1 𝑦𝑖 − 𝑦 2 = σ𝑛𝑖=1 𝑦ො𝑖 + 𝑒𝑖 − 𝑦
start from 𝑠𝑦2 2
𝑛 𝑛
• Replace 𝑒𝑖 with 𝑦𝑖 − 𝑦ො𝑖 in the above equation:
1 𝑛 1 𝑛 1
σ 𝑦ො + 𝑒𝑖 − 𝑦 = σ𝑖=1 𝑦ො𝑖 +
2
𝑦𝑖 − 𝑦ෝ𝑖 − 𝑦 2
= σ𝑛𝑖=1 𝑦ො𝑖 − 𝑦 + (𝑦𝑖 − 𝑦ො𝑖 ) 2
𝑛 𝑖=1 𝑖 𝑛 𝑛
1 1
• σ𝑛𝑖=1 𝑦𝑖 − 𝑦 2 = σ𝑛𝑖=1 𝑦ො𝑖 − 𝑦 + (𝑦𝑖 − 𝑦ො𝑖 ) 2
𝑛 𝑛
• Expand the right-hand-side of the equation and use the following equations
• σ𝑛𝑖=1 𝑦𝑖 = 𝑎𝑛 + 𝑏 σ𝑛𝑖=1 𝑥𝑖
If you have a different approach, that
• σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 = 𝑎 σ𝑛𝑖=1 𝑥𝑖 + 𝑏 σ𝑛𝑖=1 𝑥𝑖2 would be great and feel free to use that!
Understanding and evaluating regression models
• We know 𝒔𝟐𝒚 = 𝒔𝟐𝒚ෝ + 𝒔𝟐𝒆 , this is an important relationship!
1 1 1
• σ𝑛𝑖=1 𝑦𝑖 − 𝑦 2
= σ𝑛𝑖=1 𝑦ෝ𝑖 − 𝑦 2
+ σ𝑛𝑖=1 𝑒𝑖2
𝑛 𝑛 𝑛
• Removing 1/n, we get
𝑛 𝑛 𝑛

෍ 𝑦𝑖 − 𝑦 2
= ෍ 𝑦ෝ𝑖 − 𝑦 2
+ ෍ 𝑒𝑖2
𝑖=1 𝑖=1 𝑖=1
• σ𝑛𝑖=1 𝑦𝑖 − 𝑦 2 is the total sum of squares (SST).
• σ𝑛𝑖=1 𝑦ෝ𝑖 − 𝑦 2 is the regression sum of squares (SSR).
• σ𝑛𝑖=1 𝑒𝑖2 is the sum of squared errors (SSE).
• We get the relationship SST=SSR+SSE.
• This relationship describes that the variation in the predictand, y
can be explained by the variation represented by the regression and
the variation of the residuals.
Understanding and evaluating regression models
• We take a close look at the relationship: SST=SSR+SSE.
• If the variation of the predictand y can be completely explained by
SSR, which means SSE=0, then this is a perfect regression model
and it means all the data points in the scatterplot fall on the
regression line.
• If there is absolutely no linear relationship between x and y, the
regression slope will be zero, the SSR=0, and SSE=SST.
• So, We can use these three quantities to measure the fit of a
regression or the correspondence between the regression line and
a scatterplot of the data.
Variance analysis
• So, We can use quantities related to these three quantities to
measure the fit of a regression or the correspondence between the
regression line and a scatterplot of the data.
• Summarize in ANOVA table.

Table 6.1 from the textbook


Measures of the fit of a regression model
• So, We can use quantities related to these three quantities to measure the fit of
a regression or the correspondence between the regression line and a
scatterplot of the data.
• The first measure is SSE or MSE (mean squared error), MSE=SSE.
• Smaller SSE or MSE, better-fit regression model, better predictions.
• The second measure is the coefficient of determination 𝑅2 ,
𝑆𝑆𝑅 𝑆𝑆𝐸
2
𝑅 = =1− .
𝑆𝑆𝑇 𝑆𝑆𝑇
• The 𝑅2 can be interpreted as the proportion of the variation of the
predictand (proportional to SST) that is described or accounted for by the
regression (SSR). It is also known as explained variance.
• 𝑅2 varies from 0 and 1.
• Larger 𝑅2 , better-fit regression model, better predictions.
• Prove that the squares of the correlation coefficient r between x and y is the
same as 𝑅2 in a simple univariate linear regression (Assignment 5).
Multiple linear regression
• Chapter 6.2.8 in the textbook
• Multiple linear regression is the more general and more common situation of
linear regression. As in the case of simple linear regression, there is still a
single predictand, y, but in distinction, there is more than one predictor (x)
variable.
• Let K denote the number of predictor variables, the multiple linear regression
will look like:
𝑦ො = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯ + 𝑏𝐾 𝑥𝐾
• Each of the K predictor variables has its own coefficient, analogous to the
slope, b, in a simple univariate linear regression. 𝑏0 is the regression constant,
analogous to the intercept, a, in a simple univariate linear regression.
• These K + 1 regression coefficients (𝑏0 , 𝑏1 , 𝑏2 , 𝑏3 , … , 𝑏𝐾 ) often are called the
regression parameters.
• If K = 1, then 𝑦ො = 𝑏0 + 𝑏1 𝑥1 , which is the simple linear regression.
Multiple linear regression
• 𝑦ො = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯ + 𝑏𝐾 𝑥𝐾
• To find the regression parameters, the same approach is used as in the simple
linear regression.
• Minimizing the sum of squared errors (SSE):
𝑛 𝑛

𝑆𝑆𝐸 = ෍ 𝑒𝑖2 = ෍ 𝑦𝑖 − 𝑦ො𝑖 2

𝑖=1 𝑖=1
Curvilinear regression
• When we have nonlinear relations, we often assume an intrinsically linear
model, and then we fit data into the model using polynomial regression.
• That is, we employ some models that use regression to fit curves instead of
straight lines. The technique is known as curvilinear regression analysis.
• For example, we test several polynomial regression equations. Polynomial
equations are formed by taking our independent variable to successive powers.
Curvilinear regression
• In general, the polynomial equation is referred to by its degree, which is the
number of the largest exponent.
• For example, the linear equation is a polynomial equation of the first degree,
the quadratic is of the second degree, and the cubic is of the third degree.
• The function of the power terms is to introduce bends into the regression line.
With simple linear regression, the regression line is straight. With the addition
of the quadratic term, we can introduce or model one bend. With the addition
of the cubic term, we can model two bends, and so forth.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy