0% found this document useful (0 votes)
39 views

05 Linear Regression

The document discusses linear regression and estimating coefficients. It begins with the basic form of a simple linear regression model relating a single input (x) to a continuous response variable (y). It then explains that coefficients are estimated to minimize the sum of squared errors between predicted and actual y values. Specifically, the coefficients are estimated using the normal equation, which finds the values that solve the linear system (X^T*X)^-1 *X^T*Y.

Uploaded by

ashishamitav123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

05 Linear Regression

The document discusses linear regression and estimating coefficients. It begins with the basic form of a simple linear regression model relating a single input (x) to a continuous response variable (y). It then explains that coefficients are estimated to minimize the sum of squared errors between predicted and actual y values. Specifically, the coefficients are estimated using the normal equation, which finds the values that solve the linear system (X^T*X)^-1 *X^T*Y.

Uploaded by

ashishamitav123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

DATA SCIENCE

LINEAR REGRESSION
AGENDA 2

0. BASIC FORM
I. ESTIMATING COEFFICIENTS
II. CATEGORICAL VARIABLES
III. MAKING INFERENCES
LINEAR REGRESSION

0. BASIC FORM
BASIC FORM 4

continuous categorical
supervised regression classification
unsupervised dimension reduction clustering
BASIC FORM 5

Q: What is the motivation for learning about linear


regression?

• widely used
• runs fast
• easy to use (not a lot of tuning required)
• highly interpretable
• basis for many other methods
BASIC FORM 6

Q: What is a regression model?


BASIC FORM 7

Q: What is a regression model?


A: A functional relationship between input &
continuous a response variable.
BASIC FORM 8

Q: What is a regression model?


A: A functional relationship between input &
response variables.

The simple linear regression model captures a


linear relationship between a single input variable x
and a response variable y :
BASIC FORM 9

Q: What is a regression model?


A: A functional relationship between input &
response variables.

The simple linear regression model captures a


linear relationship between a single input variable x
and a response variable y :
y = b0 + b 1x + e
BASIC FORM 1
0
Q: What do the terms in this model mean?
y = b0 + b 1x + e
BASIC FORM 1
1
Q: What do the terms in this model mean?
y = b0 + b 1x + e
A: y = response variable (the one we want to
predict)
BASIC FORM 1
2
Q: What do the terms in this model mean?
y = b0 + b 1x + e
A: y = response variable (the one we want to
predict)
x = input variable (the one we use to train the
model)
BASIC FORM 1
3
Q: What do the terms in this model mean?
y = b0 + b 1x + e
A: y = response variable (the one we want to
predict)
x = input variable (the one we use to train the
model)
b0 = intercept (where the line crosses the y-axis)
BASIC FORM 1
4
Q: What do the terms in this model mean?
y = b0 + b 1x + e
A: y = response variable (the one we want to
predict)
x = input variable (the one we use to train the
model)
b0 = intercept (where the line crosses the y-axis)
BASIC FORM 1
5
Q: What do the terms in this model mean?
y = b0 + b 1x + e
A: y = response variable (the one we want to
predict)
x = input variable (the one we use to train the
model)
b0 = intercept (where the line crosses the y-axis)
BASIC FORM 1
6
Q: What do the terms in this model mean?
y = b0 + b 1x + e

∆y b1 = ∆ y / ∆ x
∆x
b0

x
BASIC FORM 1
7
We can extend this model to several input variables,
giving us the multiple linear regression model:
BASIC FORM 1
8
We can extend this model to several input variables,
giving us the multiple linear regression model:

y = b0 + b 1 x 1 + … + b n x n + e
LINEAR REGRESSION

I. ESTIMATING
COEFFICIENTS
ESTIMATING COEFFICIENTS 2
0
Q: How to determine the impact of a particular input
variable on the response variable?

( ˆ )
A: The coefficient estimates
ESTIMATING COEFFICIENTS 2
1
Q: What is meant by estimates?
A: We are making an inference based off of a sample.
ESTIMATING COEFFICIENTS 2
2
Q: What is meant by estimates?
A: We are making an inference based off of a sample.

Estimates
y True Model

x
ESTIMATING COEFFICIENTS 2
3
Q: What is meant by estimates?
A: We are making an inference based off of a sample.

Estimates
y True Model

x
A fundamental part of statistics is quantifying our
confidence that our estimates are reflective of truth.
ESTIMATING COEFFICIENTS 2
4
Q: How to estimate coefficients for a linear model?
A: By finding the line that minimizes the sum of
squared residuals.

x
ESTIMATING COEFFICIENTS 2
5
Q: How to estimate coefficients for a linear model?
A: By finding the line that minimizes the sum of
squared residuals.

x
ESTIMATING COEFFICIENTS 2
6
Q: How to estimate coefficients for a linear model?
A: By finding the line that minimizes the sum of
squared residuals.
2
SS residuals  i 1 ( yˆ i  yi )
N
y

x
ESTIMATING COEFFICIENTS 2
7
Q: How to estimate coefficients for a linear model?
A: By finding the line that minimizes the sum of
squared residuals. Model
Prediction
2
SS residuals  i 1 ( yˆ i  yi )
N
y

Observed
Result
x
ESTIMATING COEFFICIENTS 2
8
Q: How to calculate estimates that minimize the sum
of squared errors?
A: Through calculus, it can be shown that the
following equation minimizes the sum of squared
errors.
ˆ
  (X X ) X Y
T 1 T
ESTIMATING COEFFICIENTS 2
9
Let’s walk through an trivial calculation to see how
this works. Predictor column
1, 3.385   44.5  Response column
   
1, 0.48   15.5 
X  1, 1.35  Y   8.1 
 
 
“Dummy” column 1, 465   423 
placeholder for the 1, 36.33  119 .5 
error variable b0    

Along the way, we’ll review some matrix


math.
ESTIMATING COEFFICIENTS 3
0
ˆ
  (X X ) X Y
T 1 T

Transposing simply
means flipping the
columns and rows
1, 3.385 
 
 1, 0.48 
 1 1 1 1 1     5 506.54 
X X  
T
 1, 1.35   
 3. 385 0. 48 1.35 465 36. 33    506 .54 217558 . 38 
1, 465 
1, 36.33 
 
ESTIMATING COEFFICIENTS 3
1
ˆ
  (X X ) X Y
T 1 T

1, 3.385 
 
 1, 0.48 
 1 1 1 1 1     5 506.54 
X X  
T
 1, 1.35   
 3. 385 0. 48 1.35 465 36. 33    506 .54 217558 . 38 
1, 465 
1, 36.33 
 
ESTIMATING COEFFICIENTS 3
2
ˆ
  (X X ) X Y
T 1 T

Only square
matrices can be
inverted

1
 5 506.54   0.26  6.1104 
1
( XX )  
T
   4

6 
 506.54 217558.38    6.110 6.0 10 

Taking the inverse of a 2x2


matrix simply means swapping 217558.38
across diagonals, and dividing
each value by the determinant. 5  217558.38  506.54  506.54
ESTIMATING COEFFICIENTS 3
3
ˆ
  (X X ) X Y
T 1 T

 44.5 
 
 15.5 
 1 1 1 1 1   610.6 
X T Y    8.1    
 3.385 0.48 1.35 465 36.33    201205.4 
 423 
119.5 
 
ESTIMATING COEFFICIENTS 3
4
ˆ
  (X X ) X Y
T 1 T

 ˆ0   0.26  6.1 10 4  610.6   37.201


    
 ˆ    6.1 10  4
 1  6.0 10  201205.4   0.838 
 6 
LINEAR REGRESSION

II. CATEGORICAL
VARIABLES
CATEGORIAL VARIABLES 3
6
Q: How do we deal with categorical variables? (i.e.,
with k levels)

Major (k=4)
Computer Science
Engineering
Business
Literature
Business
Engineering
CATEGORIAL VARIABLES 3
7
Q: How do we deal with categorical variables? (i.e.,
with k levels)
A: Create a k-1 binary (“dummy”) variables.
Major (k=4) Engineering Business Literature
Computer Science 0 0 0
Engineering 1 0 0
Business 0 1 0
Literature
0 0 1
Business
0 1 0
Engineering
1 0 0
Computer Science is the reference
CATEGORIAL VARIABLES 3
8
Q: Why k-1 and not k?
A: Because k-1 captures all possible outputs, and to
avoid multicollinearity.
CATEGORIAL VARIABLES 3
9
Q: Why k-1 and not k?
A: Because k-1 captures all possible outputs, and to
avoid multicollinearity.

Multicollinearity is when two or more


predictor variables in a regression
model are very correlated
CATEGORIAL VARIABLES 4
0
Q: Why k-1 and not k?
A: Because k-1 captures all possible outputs, and to
avoid multicollinearity.
Q: Does it matter which factor level I leave out?
A: Yes, this is the reference point for all other factor
levels.
CATEGORIAL VARIABLES 4
1
Q: Why k-1 and not k?
A: Because k-1 captures all possible outputs, and to
avoid multicollinearity.
Q: Does it matter which factor level I leave out?
A: Yes, this is the reference point for all other factor
levels.
Q: Is this a limitation?
A: Not really, a comparison must have a baseline.
CATEGORIAL VARIABLES 4
2
Q: Is this the only way to represent categorical data?
A: This is the conventional way to represent nominal
data, however, ordinal data can be represented with
integers.

Ordinal meaning that the data have order,


While Nominal data have NO order
CATEGORIAL VARIABLES 4
3
Q: Is this the only way to represent categorical data?
A: This is the conventional way to represent nominal
data, however, ordinal data can be represented with
integers.
Q: What does this mean?
A: Categories that can be ranked (i.e., strongly
disagree, disagree, neutral, agree, strongly agree) can
be represented as 1, 2, 3, 4, 5.
LINEAR REGRESSION

II. MAKING
INFERENCES
MAKING INFERENCS 4
5
Linear modeling is a parametric technique, meaning
that it relies on specific assumptions about the
underlying data:
1) Linearity and additivity of the relationship
between input and response variables
2) Homoscedasticity of the errors
3) Normality of the Error Distribution
4) Statistical independence of the errors
Source: http://people.duke.edu/~rnau/testing.htm
INTERPRETING THE OUPUT 4
6
Q: How to determine the whether a coefficient
estimate is significant?
A: The p-value associated with the coefficient t-
value.
INTERPRETING THE OUPUT 4
7
Q: How to determine the whether a coefficient
estimate is significant?
A: The p-value associated with the coefficient t-
Q: What is a p-value?
value.
A: The probability of getting the observed outcome
(e.g., the coefficient estimate) if the null hypothesis
were true (p < 0.05 is typically considered
significant).
INTERPRETING THE OUPUT 4
8
Q: What is the null hypothesis for linear regression
coefficients?
A: There is no relationship between X and Y.

H0: b j = 0

Ha: b j ≠ 0
INTERPRETING THE OUPUT 4
9
Q: What does the confidence interval mean?
A: 95% of the time, the true coefficients will be in
this range.
True value 1
for

Confidence ˆ j
Intervals for
INTERPRETING THE OUPUT 5
0
Q: What does the confidence interval mean?
A: 95% of the time, the true coefficients will be in
this range. Confidence intervals
are calculated based
True value 1 off of the error
for variance

Confidence ˆ j
Intervals for

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy