Correlation and Linear Regression
Correlation and Linear Regression
Part 2
Reading
• Chapter 6.2.1, 6.2.2, 6.2.3, and 6.2.4 in the textbook
Bivariate analysis
• We know how to read and interpret the scatterplots. But we want to use
mathematical/statistical language to describe the relationship between
two data sets quantitatively.
• One of the most commonly used statistical tools is correlation.
• Bivariate analysis aims to understand the relationship between two
variables x and y.
• When the two variables are measured on the same object, x is usually
identified as the independent variable or predictor, and y as the
dependent variable or predictand.
• The methods of bivariate statistics aim to describe the strength of the
relationship between variables, such as Pearson’s correlation coefficient
for linear relationships or an equation obtained by regression.
Linear regression
• Simple linear regression seeks to summarize the relationship
between two variables, shown graphically in their scatterplot, by a
single straight line.
Linear regression
• Simple linear regression seeks to summarize the relationship
between two variables, shown graphically in their scatterplot, by a
single straight line.
• The regression procedure chooses that line producing the least
error for predictions of y given observations of x.
Linear function: y= bx +a
where b is called the slope, and a is called
the intercept
Δy = b*Δx
The value of y depends on the value of x Δx
Linear function: y= bx +a + error
b= Δy/Δx =2
Δy = b*Δx
Δx
a=-1
value of y
does
not depend on x
y = 0x + a = a
b>0 b<0
?
𝑒𝑖 = 𝑦𝑖 − 𝑦(𝑥
ො 𝑖)
Our variable of
interest (before the Y-axis: Final Exam Score
final exam is taken)
• Set the derivatives of the above equation with respect to the parameters a and b
to zero and solve.
• Write down the expression of the derivatives for and
How to estimate the best fitting line?
• Finding analytic expressions for the least-squares slope, b, and the intercept, a.
• In order to minimize the sum of squared residuals,
• Set the derivatives of the above equation with respect to the parameters a and b
to zero and solve:
and
How to estimate the best fitting line?
𝑛 𝑛
𝑦𝑖 = 𝑎𝑛 + 𝑏 𝑥𝑖
𝑖=1 𝑖=1
𝑛 𝑛 𝑛 These two equations
𝑥𝑖 𝑦𝑖 = 𝑎 𝑥𝑖 + 𝑏 𝑥𝑖2 will be useful in later
𝑖=1 𝑖=1 𝑖=1 analysis of variance.
How to estimate the best fitting line?
𝑛𝑥 𝑦 − σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖
𝑏= 2
𝑛𝑥 − σ𝑛𝑖=1 𝑥𝑖2
Summary: How to estimate the best fitting line?
• To find the slope b and the intercept a such that the sum of squared residuals is
minimized.
• 𝑎 = 𝑦 − 𝑏𝑥
𝑛𝑥 𝑦−σ𝑛
𝑖=1 𝑥𝑖 𝑦𝑖
• 𝑏= 2
𝑛𝑥 −σ𝑛 2
𝑖=1 𝑥𝑖
X 65 67 71 71 66 75 67 70 71 69 69
Y 175 133 185 163 126 198 153 163 159 151 159
first exam score
How to estimate the best fitting line?
Minimize
Sum of Squared Errors (SSE) !
In ordinary linear
Intercept a regression the fitted
line goes through the
center point of the
paired data samples.
center point 𝑥,ҧ 𝑦ത
Sample mean of x and y
How to estimate the best fitting line?
Minimize
Sum of Squared Errors (SSE)
Intercept Slope
Interpretation of the slope and intercept in terms of:
slope
1
𝑛
1
𝑛
Covariance
between x and y
b=
Variance of x
Interpretation of the slope and intercept in terms of:
slope
Covariance
between x and y
r=
Standard deviation of x *
Standard deviation of y
Standard deviation of y
b= Correlation between x and y *
Standard deviation of x
Relationship between slope b and the Pearson
correlation coefficient
Estimated
Regression line
ෝ + 𝒆 = 𝒃𝒙 + 𝒂 + 𝒆
𝒚=𝒚
Understanding and evaluating regression models
• Two assumptions for the quantities 𝑒𝑖 :
• 𝑒𝑖 are independent random variables with zero mean and constant
variance.
• The residuals follow a Gaussian distribution.
• The sample mean of the residuals (dividing the following equation
by n) is zero.
𝑛
𝑒𝑖 = 0
𝑖=1
• Given 𝑒𝑖 = 𝑦𝑖 − 𝑦ො𝑖 , 𝑦𝑖 = 𝑦ො𝑖 + 𝑒𝑖
• Prove that the mean of 𝑦ො𝑖 is 𝑦.
Understanding and evaluating regression models
• Now, we take a close look at the sum of square errors (SSE).
𝑛
𝑆𝑆𝐸 = 𝑒𝑖2
𝑖=1
1 1 1 1
• se2 = 𝑆𝑆𝐸 = σ𝑛𝑖=1 𝑒𝑖2 = σ𝑛𝑖=1 𝑒𝑖 − 𝑒𝑖 2
given 𝑒𝑖 = σ𝑛𝑖=1 𝑒𝑖 = 0
𝑛 𝑛 𝑛 𝑛
𝑦𝑖 − 𝑦 2
= 𝑦ෝ𝑖 − 𝑦 2
+ 𝑒𝑖2
𝑖=1 𝑖=1 𝑖=1
• σ𝑛𝑖=1 𝑦𝑖 − 𝑦 2 is the total sum of squares (SST).
• σ𝑛𝑖=1 𝑦ෝ𝑖 − 𝑦 2 is the regression sum of squares (SSR).
• σ𝑛𝑖=1 𝑒𝑖2 is the sum of squared errors (SSE).
• We get the relationship SST=SSR+SSE.
• This relationship describes that the variation in the predictand, y
can be explained by the variation represented by the regression and
the variation of the residuals.
Understanding and evaluating regression models
• We take a close look at the relationship: SST=SSR+SSE.
• If the variation of the predictand y can be completely explained by
SSR, which means SSE=0, then this is a perfect regression model
and it means all the data points in the scatterplot fall on the
regression line.
• If there is absolutely no linear relationship between x and y, the
regression slope will be zero, the SSR=0, and SSE=SST.
• So, We can use these three quantities to measure the fit of a
regression or the correspondence between the regression line and
a scatterplot of the data.
Variance analysis
• So, We can use quantities related to these three quantities to
measure the fit of a regression or the correspondence between the
regression line and a scatterplot of the data.
• Summarize in ANOVA table.
𝑖=1 𝑖=1
Curvilinear regression
• When we have nonlinear relations, we often assume an intrinsically linear
model, and then we fit data into the model using polynomial regression.
• That is, we employ some models that use regression to fit curves instead of
straight lines. The technique is known as curvilinear regression analysis.
• For example, we test several polynomial regression equations. Polynomial
equations are formed by taking our independent variable to successive powers.
Curvilinear regression
• In general, the polynomial equation is referred to by its degree, which is the
number of the largest exponent.
• For example, the linear equation is a polynomial equation of the first degree,
the quadratic is of the second degree, and the cubic is of the third degree.
• The function of the power terms is to introduce bends into the regression line.
With simple linear regression, the regression line is straight. With the addition
of the quadratic term, we can introduce or model one bend. With the addition
of the cubic term, we can model two bends, and so forth.