Week2 LinearRegression Post PDF
Week2 LinearRegression Post PDF
Announcements
n Enrollment
n Concurrent: Need to wait one more week to officially be
added
n Waitlist
n Still awaiting access to system to manage
n Definitely will have room for any mEng or PhD student…
will try to add as many as allowable
Announcements
n Survey feedback
n https://docs.google.com/forms/d/e/1FAIpQLSeK-
xJPWlDxxBTd6-
OxgEs6QNkJiWEpFQFtMhxNwNd4JTNETw/viewform
Announcements
Today’s Agenda
n Dependent variable:
n Price Index – composite metric of many different wineries in
thousands of wine auctions in the years 1990-1991
n His model used Log(Price Index)
IEOR 242, Spring 2020 - Week 2
+ 12
LogAuctionIndex
8.0
7.5
7.0
0.06 −0.53 0.56 0.47 0.01 −0.08 −0.27
6.5
700
WinterRain
600
500 −0.12 −0.21 −0.05 0.03 −0.05 0
400
300
300
HarvestRain
200 0.04 −0.41 −0.13 0.11 −0.22
100
19
GrowTemp
18
17 0.51 −0.6 0.52 −0.35
16
HarvestTemp
20
18 −0.28 0.25 −0.04
16
60
50
Age
40 −0.99 −0.13
30
20
FrancePop
55
50 0.27
45
11
USAlcConsump
10
8
6 7 8 300400500600700 100 200 300 16 17 18 19 16 18 20 20 30 40 50 60 45 50 55 8 9 10 11
Linear Regression
n Predict the value of the dependent variable:
n Log(price index)
Y = 0 + 1 X1 + ... + p Xp +✏
n Parametric method
n Output variable: Y
n Also often called response or dependent variable
Parametric Methods
Parametric Methods
Non-parametric Methods
Incom
Incom
e
e
y
rit
it
or
Ye Ye
o
ni
ni
ar ars
s
Se
Se
of of
Ed Ed
uc uc
ati ati
on on
Linear Regression
n Predict the value of the dependent variable:
n Log(price index)
Y = 0 + 1 X1 + ... + p Xp +✏
n Parametric method
n Falls into the generic format of Y = f (X) + ✏ with
f (X) = 0 + 1 X1 + . . . + p Xp
n
X
2
RSS( ) := (yi 0 1 xi1 ... p xip )
i=1
Unconstrained Optimization
Review
n Ingredients:
n is a vector of decision
variables (often called parameters in ML/Stats)
n is the objective function (often called loss
function or penalty function)
n Optimization problem:
Unconstrained Optimization
Review, cont.
n Optimization problem:
n Where:
n
X X2
2
RSS( ) := (yi 0 1 xi1 ... p xip )
i=1
X1
n Residuals: ei = yi ŷi
n Seriously, only use the test set once, when you have
finished training your model, to estimate the
performance of the model when you go to apply it
in the real world
n All data used to help build the model is training
data, and the training error (RSS) typically
underestimates the performance error
n Soon in the course we will see how to use some of
the training data as “validation data” to estimate
the performance error during the training phase
+
Regression Output and
Analysis
n
ˆ0 = -4.966
n
ˆwinter-rain = 0.0012 (An additional mm of winter rain is expected to
result in an additional 0.0012 units of log(price index))
n
ˆharvest-rain= -0.0033 (An additional mm of harvest rain is expected
to result in a decrease of 0.0033 units of log(price index))
n ….
n
ˆUSalc = 0.1093 (An additional liter of U.S. per capita alcohol
consumption is expected to result in a increase of 0.1093 units of
the log(price index))
Understanding R2
Understanding R2 , cont.
Understanding R2 , cont.
Understanding R2 , cont.
What really is R2 ?
SSE
= 1
SST
n
X
SST = (yi ȳ)2
i=1
n How good is this prediction? Well, let’s look at all of the test
set data records and compute a version of R2 , which we call
OSR2
Out-of-Sample R2 (OSR2 )
= 0.54
Out-of-Sample R2 (OSR2 )
Overfitting
n Overfitting occurs when the estimated model fits
the noise in the training data
Overfitting Becareful if
watchtheR2 to determine
it's possible to over
f
of being overfig
n Overfitting is more likely when:
n The number of parameters to be estimated is large
n Data is limited
n In other words, is j 6= 0 ?
H0 : j = 0 vs. Ha : j 6= 0
O
equivalent to looking
at confidence intervals HarvestTemp
as significance level α
if and only if (1-α)%
GrowTemp
P
confidence interval FrancePop
0.0 0.5
estimate
●
●
●
45 ●
●
●
●
20 30 40 50 60
Age of Vintage
n The data for Age and France population are highly correlated
Multicollinearity
n Occurs when two or more predictors are highly
correlated
n Makes the estimated coefficients ˆ = ( ˆ0 , ˆ1 , . . . , ˆp )
very sensitive to noise in the training data
n Thus can produce very inaccurate estimates which hurts
interpretability and possibly predictive performance
n Tell-tale signs:
n Some of the estimated coefficients have the “wrong” sign
n Some of the coefficients are not significantly different from
zero
Correlation Table
LogAuctionIndex WinterRain HarvestRain GrowTemp HarvestTemp Age FrancePop USAlcConsump
8.5
LogAuctionIndex
8.0
7.5
7.0
0.06 −0.53 0.56 0.47 0.01 −0.08 −0.27
6.5
700
WinterRain
600
500 −0.12 −0.21 −0.05 0.03 −0.05 0
400
300
300
HarvestRain
200 0.04 −0.41 −0.13 0.11 −0.22
100
19
GrowTemp
18
17 0.51 −0.6 0.52 −0.35
16
HarvestTemp
20
18 −0.28 0.25 −0.04
16
60
50
Age
40 −0.99 −0.13
30
20
FrancePop
55
50 0.27
45
11
USAlcConsump
10
8
6 7 8 300400500600700 100 200 300 16 17 18 19 16 18 20 20 30 40 50 60 45 50 55 8 9 10 11
Multicollinearity
n Rule of thumb:
n VIF > 10: definitely a problem
n VIF > 5: could be a problem
n VIF <= 5: probably okay
What is VIF?
n So, define:
1
VIFj =
1 Rj2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.9662699 9.3823951 -0.529 0.60166
WinterRain 0.0011863 0.0005628 2.108 0.04616 *
Coefficient VIF
HarvestRain -0.0033137 0.0010650 -3.112 0.00491 **
GrowTemp 0.6582753 0.1221937 5.387 1.79e-05 *** WinterRain 1.30
HarvestTemp 0.0044212 0.0599935 0.074 0.94189
HarvestRain 1.58
Age 0.0240080 0.0507587 0.473 0.64068
FrancePop -0.0290258 0.1369627 -0.212 0.83403 GrowTemp 1.70
USAlcConsump 0.1092561 0.1678945 0.651 0.52166
HarvestTemp 2.20
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Age 66.94
FrancePop 81.79
Residual standard error: 0.3307 on 23 degrees of freedom
Multiple R-squared: 0.7894, Adjusted R-squared: 0.7253 USAlcConsump 10.44
F-statistic: 12.31 on 7 and 23 DF, p-value: 1.859e-06
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.8404548 3.0706463 -2.228 0.03553 *
WinterRain 0.0012145 0.0005359 2.266 0.03274 *
Coefficient VIF
HarvestRain -0.0033611 0.0010203 -3.294 0.00305 **
GrowTemp 0.6671389 0.1125053 5.930 4.05e-06 *** WinterRain 1.22
HarvestTemp 0.0020543 0.0577600 0.036 0.97192
HarvestRain 1.51
Age 0.0340519 0.0178084 1.912 0.06787 .
USAlcConsump 0.0933334 0.1471271 0.634 0.53184 GrowTemp 1.50
---
HarvestTemp 2.12
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Age 8.58
Residual standard error: 0.3241 on 24 degrees of freedom
USAlcConsump 8.35
Multiple R-squared: 0.789, Adjusted R-squared: 0.7362
F-statistic: 14.95 on 6 and 24 DF, p-value: 4.604e-07
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.215161 1.672215 -3.119 0.004532 **
WinterRain 0.001119 0.000508 2.202 0.037112 *
HarvestRain -0.003437 0.001001 -3.433 0.002089 **
Coefficient VIF
GrowTemp 0.664336 0.111067 5.981 3.02e-06 *** WinterRain 1.13
HarvestTemp -0.006650 0.055432 -0.120 0.905462
Age 0.023466 0.006143 3.820 0.000785 *** HarvestRain 1.49
--- GrowTemp 1.50
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
HarvestTemp 2.00
Residual standard error: 0.3202 on 25 degrees of freedom
Age 1.04
Multiple R-squared: 0.7854, Adjusted R-squared: 0.7425
F-statistic: 18.3 on 5 and 25 DF, p-value: 1.213e-07
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.2163945 1.6401825 -3.180 0.003782 **
WinterRain 0.0011116 0.0004949 2.246 0.033424 *
Coefficient VIF
HarvestRain -0.0033766 0.0008504 -3.971 0.000505 *** WinterRain 1.11
GrowTemp 0.6569271 0.0905520 7.255 1.05e-07 ***
Age 0.0235571 0.0059785 3.940 0.000546 ***
HarvestRain 1.12
--- GrowTemp 1.04
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Age 1.03
Residual standard error: 0.3141 on 26 degrees of freedom
Multiple R-squared: 0.7853, Adjusted R-squared: 0.7523
F-statistic: 23.78 on 4 and 26 DF, p-value: 2.307e-08
n Correlation of residuals
n Non-constant variance of residuals
n Outliers/high-leverage points
n The last three are not a major concern in this course,
but it’s always healthy to plot your data, including
residual plots
n See James Section 3.3.3 for more details
Cos d’Estournel
Lafite-Rothschild
Beychevelle
Giscours
Cheval Blanc
All-Wineries Data
Winery
6 Beychevelle
Cheval Blanc
Cos d'Estournel
5 Giscours
Lafite−Rothschild
6
Winery
Cheval Blanc
5 Cos d'Estournel
A Two-Category Model
Auction Price = 0
+ 1 · WineryCos d0 Estournel
+ 2 · Age
+ 3 · WinterRain
+ 4 · HarvestRain
+ 5 · GrowTemp
+ 1 · WineryCheval Blanc
+ 2 · WineryCos d0 Estournel
+ 3 · WineryGiscours
+ 4 · WineryLafite Rothschild
+ 5 · Age
+ 6 · WinterRain
+ 7 · HarvestRain
+ 8 · GrowTemp
n R2 = 0.85 is excellent
= 3.762
= 3.188
= 3.487
n Ashenfelter: 1986
vintage will be mediocre,
but 1989 vintage will be
“stunningly good”
7.6 ●
log(Auction Index)
7.4
7.2 ●
7.0
●
6.8 ●
A Convergence of Results
Ashenfelter:
Conclusion