05 Linear Regression
05 Linear Regression
LINEAR REGRESSION
AGENDA 2
0. BASIC FORM
I. ESTIMATING COEFFICIENTS
II. CATEGORICAL VARIABLES
III. MAKING INFERENCES
LINEAR REGRESSION
0. BASIC FORM
BASIC FORM 4
continuous categorical
supervised regression classification
unsupervised dimension reduction clustering
BASIC FORM 5
• widely used
• runs fast
• easy to use (not a lot of tuning required)
• highly interpretable
• basis for many other methods
BASIC FORM 6
∆y b1 = ∆ y / ∆ x
∆x
b0
x
BASIC FORM 1
7
We can extend this model to several input variables,
giving us the multiple linear regression model:
BASIC FORM 1
8
We can extend this model to several input variables,
giving us the multiple linear regression model:
y = b0 + b 1 x 1 + … + b n x n + e
LINEAR REGRESSION
I. ESTIMATING
COEFFICIENTS
ESTIMATING COEFFICIENTS 2
0
Q: How to determine the impact of a particular input
variable on the response variable?
( ˆ )
A: The coefficient estimates
ESTIMATING COEFFICIENTS 2
1
Q: What is meant by estimates?
A: We are making an inference based off of a sample.
ESTIMATING COEFFICIENTS 2
2
Q: What is meant by estimates?
A: We are making an inference based off of a sample.
Estimates
y True Model
x
ESTIMATING COEFFICIENTS 2
3
Q: What is meant by estimates?
A: We are making an inference based off of a sample.
Estimates
y True Model
x
A fundamental part of statistics is quantifying our
confidence that our estimates are reflective of truth.
ESTIMATING COEFFICIENTS 2
4
Q: How to estimate coefficients for a linear model?
A: By finding the line that minimizes the sum of
squared residuals.
x
ESTIMATING COEFFICIENTS 2
5
Q: How to estimate coefficients for a linear model?
A: By finding the line that minimizes the sum of
squared residuals.
x
ESTIMATING COEFFICIENTS 2
6
Q: How to estimate coefficients for a linear model?
A: By finding the line that minimizes the sum of
squared residuals.
2
SS residuals i 1 ( yˆ i yi )
N
y
x
ESTIMATING COEFFICIENTS 2
7
Q: How to estimate coefficients for a linear model?
A: By finding the line that minimizes the sum of
squared residuals. Model
Prediction
2
SS residuals i 1 ( yˆ i yi )
N
y
Observed
Result
x
ESTIMATING COEFFICIENTS 2
8
Q: How to calculate estimates that minimize the sum
of squared errors?
A: Through calculus, it can be shown that the
following equation minimizes the sum of squared
errors.
ˆ
(X X ) X Y
T 1 T
ESTIMATING COEFFICIENTS 2
9
Let’s walk through an trivial calculation to see how
this works. Predictor column
1, 3.385 44.5 Response column
1, 0.48 15.5
X 1, 1.35 Y 8.1
“Dummy” column 1, 465 423
placeholder for the 1, 36.33 119 .5
error variable b0
Transposing simply
means flipping the
columns and rows
1, 3.385
1, 0.48
1 1 1 1 1 5 506.54
X X
T
1, 1.35
3. 385 0. 48 1.35 465 36. 33 506 .54 217558 . 38
1, 465
1, 36.33
ESTIMATING COEFFICIENTS 3
1
ˆ
(X X ) X Y
T 1 T
1, 3.385
1, 0.48
1 1 1 1 1 5 506.54
X X
T
1, 1.35
3. 385 0. 48 1.35 465 36. 33 506 .54 217558 . 38
1, 465
1, 36.33
ESTIMATING COEFFICIENTS 3
2
ˆ
(X X ) X Y
T 1 T
Only square
matrices can be
inverted
1
5 506.54 0.26 6.1104
1
( XX )
T
4
6
506.54 217558.38 6.110 6.0 10
44.5
15.5
1 1 1 1 1 610.6
X T Y 8.1
3.385 0.48 1.35 465 36.33 201205.4
423
119.5
ESTIMATING COEFFICIENTS 3
4
ˆ
(X X ) X Y
T 1 T
II. CATEGORICAL
VARIABLES
CATEGORIAL VARIABLES 3
6
Q: How do we deal with categorical variables? (i.e.,
with k levels)
Major (k=4)
Computer Science
Engineering
Business
Literature
Business
Engineering
CATEGORIAL VARIABLES 3
7
Q: How do we deal with categorical variables? (i.e.,
with k levels)
A: Create a k-1 binary (“dummy”) variables.
Major (k=4) Engineering Business Literature
Computer Science 0 0 0
Engineering 1 0 0
Business 0 1 0
Literature
0 0 1
Business
0 1 0
Engineering
1 0 0
Computer Science is the reference
CATEGORIAL VARIABLES 3
8
Q: Why k-1 and not k?
A: Because k-1 captures all possible outputs, and to
avoid multicollinearity.
CATEGORIAL VARIABLES 3
9
Q: Why k-1 and not k?
A: Because k-1 captures all possible outputs, and to
avoid multicollinearity.
II. MAKING
INFERENCES
MAKING INFERENCS 4
5
Linear modeling is a parametric technique, meaning
that it relies on specific assumptions about the
underlying data:
1) Linearity and additivity of the relationship
between input and response variables
2) Homoscedasticity of the errors
3) Normality of the Error Distribution
4) Statistical independence of the errors
Source: http://people.duke.edu/~rnau/testing.htm
INTERPRETING THE OUPUT 4
6
Q: How to determine the whether a coefficient
estimate is significant?
A: The p-value associated with the coefficient t-
value.
INTERPRETING THE OUPUT 4
7
Q: How to determine the whether a coefficient
estimate is significant?
A: The p-value associated with the coefficient t-
Q: What is a p-value?
value.
A: The probability of getting the observed outcome
(e.g., the coefficient estimate) if the null hypothesis
were true (p < 0.05 is typically considered
significant).
INTERPRETING THE OUPUT 4
8
Q: What is the null hypothesis for linear regression
coefficients?
A: There is no relationship between X and Y.
H0: b j = 0
Ha: b j ≠ 0
INTERPRETING THE OUPUT 4
9
Q: What does the confidence interval mean?
A: 95% of the time, the true coefficients will be in
this range.
True value 1
for
Confidence ˆ j
Intervals for
INTERPRETING THE OUPUT 5
0
Q: What does the confidence interval mean?
A: 95% of the time, the true coefficients will be in
this range. Confidence intervals
are calculated based
True value 1 off of the error
for variance
Confidence ˆ j
Intervals for