100% found this document useful (1 vote)
207 views115 pages

Week2 LinearRegression Post PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
207 views115 pages

Week2 LinearRegression Post PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 115

+

Week 2 – Predicting Wine Quality


with Linear Regression I
IEOR 242 – Applications in Data Analysis
Spring 2020 – Week 2 IEOR 242, Spring 2020 - Week 2
+ 2

Announcements

n Enrollment
n Concurrent: Need to wait one more week to officially be
added

n Waitlist
n Still awaiting access to system to manage
n Definitely will have room for any mEng or PhD student…
will try to add as many as allowable

IEOR 242, Spring 2020 - Week 2


+ 3

Announcements

n Survey feedback
n https://docs.google.com/forms/d/e/1FAIpQLSeK-
xJPWlDxxBTd6-
OxgEs6QNkJiWEpFQFtMhxNwNd4JTNETw/viewform

n Updated syllabus (still in process)


n ~Guest Lecturer per Month, Targeting more advanced
topics at end; tradeoff – more material covered per lecture,
greater expectation of individual knowledge/reading

n Office Hours: W 2-4p 2nd Floor Blum Hall Suites

IEOR 242, Spring 2020 - Week 2


+ 4

Announcements

n Possible Guest Speakers and Possible Topics


n Risk Modeler (Civil Eng PhD); Application: Data Breach
Model using GLMs
n Epidemiologist (Public Health PhD); Application: Fall
Prevention in Older Adults using Fuzzy Clustering
n Search Engineer (Comp Sci PhD); Application: Food and
Beverage Click Through Rates using Ensemble Trees
n Biostatistician(Comp Sci PhD): Precision Medicine/Gene
Mapping using Deep Learning
n Actuary (Physics MS): Commercial Lines Residual Model
Pricing using Ensemble Trees

IEOR 242, Spring 2020 - Week 2


+ 5

(Additional Qs for Guest Speakers)


Final Project Questions
n Why is/was that problem important to you? The business?
n What are the results/takeaways/impact?
n (What else did you need to do to get it implemented)
n (How many people worked on it and for how long? )
n (What other non-math/statistical considerations did you have?)
n What methods did you use?
n What alternatives did you consider?
n How did you evaluate efficacy?
n (How do you continue monitoring for efficacy?)

IEOR 242, Spring 2020 - Week 2


+ 6

Today’s Agenda

n Predicting wine quality with linear regression

n Model validation, overfitting, and other issues

n Significance, multicollinearity, and other issues

n An improved wine model with categorical


variables (most likely next week)

IEOR 242, Spring 2020 - Week 2


+
Predicting Wine Quality

IEOR 242, Spring 2020 - Week 2 7


+ 8

Vintage Bordeaux Wine

n Vintage wine vs. non-vintage


wine?

n Large differences in price and


quality in different years, even
though wine is produced in a
similar way

n Meant to be aged, so it is hard to


know the quality of the wine
when it initially goes on the
market

n Expert tasters predict which


wines will be good

n Can analytics be used to


develop a different system for
assessing the quality of wine?
IEOR 242, Spring 2020 - Week 2
+ 9

Wine Quality – Ask the Expert

IEOR 242, Spring 2020 - Week 2


+ 10

Predicting the Quality of Wine

n March 1990: Orley


Ashenfelter, a Princeton
economics professor,
claims he can predict wine
quality without tasting the
wine

IEOR 242, Spring 2020 - Week 2


+ 11

Using Linear Regression


n Ashenfelter used (multiple) linear regression
n Predicts a continuous response variable – the dependent variable
n Prediction is based on a set of independent variables

n Independent variables (features):


n Age – older wines are more expensive
n Weather
n Average Growing Season Temperature
n Harvest Rain
n Winter Rain

n Dependent variable:
n Price Index – composite metric of many different wineries in
thousands of wine auctions in the years 1990-1991
n His model used Log(Price Index)
IEOR 242, Spring 2020 - Week 2
+ 12

Why Log(Price Index)?

n Produces a better linear fit


n Better fit revealed through plotting
n The log( ) transformation also arises intrinsically,
especially in settings where “growth” or “proportion” are
natural phenomena

IEOR 242, Spring 2020 - Week 2


+ 13

US National Debt (1950 – 2014)

Coefficient for Year = 206


R2 = 0.72

IEOR 242, Spring 2020 - Week 2


+ 14

US National Debt (1950 – 2014)

Coefficient for Year = 0.075


R2 = 0.96

IEOR 242, Spring 2020 - Week 2


+ 15

The Expert’s Reaction

Robert Parker, the world's most


influential wine expert at the
time:
“Ashenfelter is an absolute total
sham”

“Really a Neanderthal way of


looking at wine”

“Rather like a movie critic who


never goes to see the movie but
tells you how good it is based on
the actors and the director”

IEOR 242, Spring 2020 - Week 2


+ 16

Vintage Wine Data

n Log(price index) based on 2015 auction prices


n Winter rain (mm)
n Harvest rain (mm)
n Average Temperature in growing season (Celsius)
n Average Temperature in harvest season (Celsius)
n Age of wine (years since vintage)
n Population of France
n US Alcohol Consumption (per capita, in liters of 100%
alcohol)

IEOR 242, Spring 2020 - Week 2


+ 17

Vintage Wine Data, cont.


p = # of independent variables (p=7); n = # of observations (n=46)
y x1 x2 x3 x4 x5 x6 x7
LogAuctionIndex WinterRain HarvestRain GrowTemp HarvestTemp Age FrancePop USAlcConsump
1 1952 y1 = 7.4950 x11 = 566.4 x21 = 165.5x31 = 17.28 x41 = 14.39x51 = 63 x61 = 42.46 x71 = 7.85
2 1953 y2 = 8.0393 x12 = 653.3 x22= 75.6x32 = 16.94 x42 = 17.64x52 = 62 x62 = 42.75 x72 = 8.03
3 1955 7.6858 504.3 129.5 17.30 17.13 60 43.43 7.84
4 1957 6.9845 390.8 110.4 16.31 16.47 58 44.31 7.77
5 1958 6.7772 538.8 187.0 16.82 19.72 57 44.79 7.74
6 1959 8.0757 377.0 182.6 17.68 19.28 56 45.24 7.89
7 1960 6.5188 748.2 290.6 16.67 16.18 55 45.68 8.02
8 1961 8.4937 747.8 37.7 17.64 21.05 54 46.16 8.08
9 1962 7.3880 639.4 51.8 16.58 17.86 53 47.00 8.13
10 1964 7.3094 326.5 96.1 17.63 19.43 51 48.31 8.46
n=46 11 1965 6.2518 548.4 266.6 15.71 15.33 50 48.76 8.62
12 1966 7.7443 734.0 85.2 16.81 18.82 49 49.16 8.78
13 1967 6.8398 646.9 118.1 16.51 17.16 48 49.55 9.03
14 1968 6.2435 508.6 292.1 16.37 16.77 47 49.91 9.28
15 1969 6.3459 480.1 243.9 16.65 16.89 46 50.32 9.53
16 1970 7.5883 563.5 88.8 16.92 18.69 45 50.77 9.78
17 1971 7.1934 488.4 111.9 17.20 17.28 44 51.25 9.99
18 1972 6.2049 465.1 157.3 15.27 15.04 43 51.70 10.10
19 1973 6.6367 357.2 122.6 17.41 18.50 42 52.12 10.37
20 1974 6.2941 503.6 185.1 16.39 16.48 41 52.46 10.48
... ... ... ... ... ... ... ... ...
46 2000 yn = 8.1817 x1n = 487.8 x2n = 69.0 x3n= 18.73 x4n = 19.45 x5n = 15 x6n = 59.05 x7n = 8.24

Why do the observations stop after the year 2000?


IEOR 242, Spring 2020 - Week 2
+ 18

Vintage Wine Data, cont.


LogAuctionIndex WinterRain HarvestRain GrowTemp HarvestTemp Age FrancePop USAlcConsump
8.5

LogAuctionIndex
8.0
7.5
7.0
0.06 −0.53 0.56 0.47 0.01 −0.08 −0.27
6.5

700

WinterRain
600
500 −0.12 −0.21 −0.05 0.03 −0.05 0
400
300

300

HarvestRain
200 0.04 −0.41 −0.13 0.11 −0.22
100

19

GrowTemp
18
17 0.51 −0.6 0.52 −0.35
16

HarvestTemp
20
18 −0.28 0.25 −0.04
16

60
50

Age
40 −0.99 −0.13
30
20

FrancePop
55

50 0.27
45

11

USAlcConsump
10

8
6 7 8 300400500600700 100 200 300 16 17 18 19 16 18 20 20 30 40 50 60 45 50 55 8 9 10 11

IEOR 242, Spring 2020 - Week 2


+
Linear Regression

IEOR 242, Spring 2020 - Week 2 19


+ 20

Linear Regression
n Predict the value of the dependent variable:
n Log(price index)

n Prediction as a linear function of the independent


variables:
n Winter rain (mm)
n Harvest rain (mm)
n Average Temperature in growing season (Celsius)
n Average Temperature in harvest season (Celsius)
n Age of wine (years since vintage)
n Population of France
n US alcohol consumption (per capita, in liters of 100%
alcohol)

IEOR 242, Spring 2020 - Week 2


+ 21

Multiple Linear Regression

Y = 0 + 1 X1 + ... + p Xp +✏

n Parametric method

n Observed data: (xi , yi ) i = 1, . . . , n

n Each observed xi is a feature vector: xi = (xi1 , xi2 , . . . , xip )T

n Each observed yi is a continuous


response/dependent variable associated with xi

IEOR 242, Spring 2020 - Week 2


+
Statistical Learning Interlude

IEOR 242, Spring 2020 - Week 2 22


+ 23

General Statistical Learning Model

n Input variables: X = (X1 , X2 , . . . , Xp )


n Also often called features, predictors, or independent
variables

n Output variable: Y
n Also often called response or dependent variable

n Collected data in the form of n pairs:


n (xi , yi ) i = 1, . . . , n
n xi = (xi1 , xi2 , . . . , xip )T

IEOR 242, Spring 2020 - Week 2


+ 24

Parametric Methods

n Start by assuming a particular functional form for f


n For example, assume that f is linear:
f (X) = 0 + 1 X1 + . . . + p Xp
n f is parameterized by = ( 0, 1, . . . , p)

n Now apply a method that uses the training data to


estimate
n We sometimes call this fitting the model
n Classic example: ordinary least squares, i.e., linear
regression
n We will consider more sophisticated approaches as well

IEOR 242, Spring 2020 - Week 2


+ 25

Parametric Methods

n Advantages of Parametric Methods:


n Simplifies the problem of estimating f to the problem of
estimating
n Potentially relatively less data needed to produce a
reliable estimate of

n Major Disadvantage of Parametric Methods:


n The true functional form of f is usually more complicated
than the model we chose
n This may be remedied by selecting a flexible model class,
but this comes at the danger of overfitting

IEOR 242, Spring 2020 - Week 2


+ 26

Non-parametric Methods

n Of course, non-parametric methods do not make


parametric assumptions about f

n No explicit functional form is assumed


n Allows for greater flexibility
n Runs a greater risk of overfitting if you are not careful
n Generally requires more data to produce an accurate
estimate

IEOR 242, Spring 2020 - Week 2


+ 27

Tradeoff Between Flexibility and


Interpretability
n Why not just always use flexible, non-parametric
methods?
n One reason is that parametric models are more
interpretable and thus better for inference
n Even if you don’t care about inference, non-parametric
methods may overfit the training data

Incom
Incom

e
e

y
rit

it
or
Ye Ye
o
ni

ni
ar ars
s
Se

Se
of of
Ed Ed
uc uc
ati ati
on on

IEOR 242, Spring 2020 - Week 2


+
Back to Linear Regression

IEOR 242, Spring 2020 - Week 2 28


+ 29

Linear Regression
n Predict the value of the dependent variable:
n Log(price index)

n Prediction as a linear function of the independent


variables:
n Winter rain (mm)
n Harvest rain (mm)
n Average Temperature in growing season (Celsius)
n Average Temperature in harvest season (Celsius)
n Age of wine (years since vintage)
n Population of France
n US alcohol consumption (per capita, in liters of 100%
alcohol)

IEOR 242, Spring 2020 - Week 2


+ 30

Multiple Linear Regression

Y = 0 + 1 X1 + ... + p Xp +✏

n Parametric method
n Falls into the generic format of Y = f (X) + ✏ with
f (X) = 0 + 1 X1 + . . . + p Xp

n Observed data: (xi , yi ) i = 1, . . . , n


n Each observed xi is a feature vector: xi = (xi1 , xi2 , . . . , xip )T
n Each observed yi is a continuous
response/dependent variable associated with xi

IEOR 242, Spring 2020 - Week 2


+ 31

Multiple Linear Regression, cont.

n The (true) regression coefficients =( 0, 1, . . . , p)


are unknown to us

n How do we estimate the regression coefficients?

n Minimize prediction error, as measured by the


residual sum of squares (RSS):

n
X
2
RSS( ) := (yi 0 1 xi1 ... p xip )
i=1

IEOR 242, Spring 2020 - Week 2


+ 32

Unconstrained Optimization
Review
n Ingredients:
n is a vector of decision
variables (often called parameters in ML/Stats)
n is the objective function (often called loss
function or penalty function)

n Optimization problem:

IEOR 242, Spring 2020 - Week 2


+ 33

Unconstrained Optimization
Review, cont.
n Optimization problem:

n Definition of optimality: solves the above


optimization problem if and only if
for all

n Necessary Optimality Condition: If is


differentiable with gradient and solves
the optimization problem, then:

IEOR 242, Spring 2020 - Week 2


+ 34

Multiple Linear Regression


Coefficient Estimates
n The regression coefficient estimates ˆ = ( ˆ0 , ˆ1 , . . . , ˆp )
are chosen to minimize RSS( )

n Where:

n
X X2
2
RSS( ) := (yi 0 1 xi1 ... p xip )
i=1

X1

IEOR 242, Spring 2020 - Week 2


+ 35

Multiple Linear Regression Coefficient


Estimates

n Let be the n x (p + 1) matrix where the ith row is


the appended feature vector

n Let be the n-vector of responses yi

n Then the matrix vector product is the n-vector


of training set predictions associated with the
coefficient vector and the n-
vector of residuals is

IEOR 242, Spring 2020 - Week 2


+ 36

Multiple Linear Regression Coefficient


Estimates, cont.

n Recall the 2-norm of an n-vector is defined by:

n Then it is easy to see that:

n Also, it holds that (slightly less obvious):


rRSS( ) =
<latexit sha1_base64="Nia+mD7MXZaDXayiKOGqpXYJNiQ=">AAACMXicbVBNSyNBEO3R3dWN+xH16KXZIGQPhhkR1osgeHCP0XwYyMRQ06nRxp6eobtGDMP8JE/+FE/CCuLVP7GdD3FXfdDw+r0qqupFmZKWfP/OW1j88PHT0vLnysqXr9++V1fXujbNjcCOSFVqehFYVFJjhyQp7GUGIYkUnkQXBxP/5BKNlalu0zjDQQJnWsZSADlpWD0MNUQKeEh4RcVxq1XWwwgJfvI9vrUdJkDnUVz0ytN2/fkzLvkWf3Fm5cNqzW/4U/C3JJiTGpujOazehqNU5AlqEgqs7Qd+RoMCDEmhsKyEucUMxAWcYd9RDQnaQTE9uOSbThnxODXuaeJT9d+OAhJrx0nkKid72tfeRHzP6+cU7w4KqbOcUIvZoDhXnFI+SY+PpEFBauwICCPdrlycgwFBLuNKxaUQvL75LeluNwK/ERzt1PZ78zyW2Qb7weosYL/YPvvNmqzDBLtmt+wPu/duvDvvwXuclS5485519h+8p7+37akq</latexit>
2XT (y X )

IEOR 242, Spring 2020 - Week 2


+ 37

Multiple Linear Regression Coefficient


Estimates, cont.

n Using the representation and


assuming that , then one may
use calculus/linear algebra to show that the solution
of is given by:

IEOR 242, Spring 2020 - Week 2


+ 38

Multiple Linear Regression, cont.

n Prediction for the ith observation:


ŷi := ˆ0 + ˆ1 xi1 + . . . + ˆp xip

n Residuals: ei = yi ŷi

n RSS with respect to the estimated coefficients:


n
X n
X
RSS = SSE = e2i = (yi ŷi )2
i=1 i=1
n SSE is the sum of squared errors (both conventions
often used)

IEOR 242, Spring 2020 - Week 2


+ 39

Vintage Wine Data


p = # of independent variables (p=7); n = # of observations (n=46)
y x1 x2 x3 x4 x5 x6 x7
LogAuctionIndex WinterRain HarvestRain GrowTemp HarvestTemp Age FrancePop USAlcConsump
1 1952 y1 = 7.4950 x11 = 566.4 x21 = 165.5x31 = 17.28 x41 = 14.39x51 = 63 x61 = 42.46 x71 = 7.85
2 1953 y2 = 8.0393 x12 = 653.3 x22= 75.6x32 = 16.94 x42 = 17.64x52 = 62 x62 = 42.75 x72 = 8.03
3 1955 7.6858 504.3 129.5 17.30 17.13 60 43.43 7.84
4 1957 6.9845 390.8 110.4 16.31 16.47 58 44.31 7.77
5 1958 6.7772 538.8 187.0 16.82 19.72 57 44.79 7.74
6 1959 8.0757 377.0 182.6 17.68 19.28 56 45.24 7.89
7 1960 6.5188 748.2 290.6 16.67 16.18 55 45.68 8.02
8 1961 8.4937 747.8 37.7 17.64 21.05 54 46.16 8.08
9 1962 7.3880 639.4 51.8 16.58 17.86 53 47.00 8.13
10 1964 7.3094 326.5 96.1 17.63 19.43 51 48.31 8.46
n=46 11 1965 6.2518 548.4 266.6 15.71 15.33 50 48.76 8.62
12 1966 7.7443 734.0 85.2 16.81 18.82 49 49.16 8.78
13 1967 6.8398 646.9 118.1 16.51 17.16 48 49.55 9.03
14 1968 6.2435 508.6 292.1 16.37 16.77 47 49.91 9.28
15 1969 6.3459 480.1 243.9 16.65 16.89 46 50.32 9.53
16 1970 7.5883 563.5 88.8 16.92 18.69 45 50.77 9.78
17 1971 7.1934 488.4 111.9 17.20 17.28 44 51.25 9.99
18 1972 6.2049 465.1 157.3 15.27 15.04 43 51.70 10.10
19 1973 6.6367 357.2 122.6 17.41 18.50 42 52.12 10.37
20 1974 6.2941 503.6 185.1 16.39 16.48 41 52.46 10.48
... ... ... ... ... ... ... ... ...
46 2000 yn = 8.1817 x1n = 487.8 x2n = 69.0 x3n= 18.73 x4n = 19.45 x5n = 15 x6n = 59.05 x7n = 8.24

IEOR 242, Spring 2020 - Week 2


+ 40

Best Practices: Out-of-Sample Testing

Training Set (n = 31)


Full Dataset (n = 46) 1 1952
LogAuctionIndex WinterRain HarvestRain GrowTemp ...
7.4950 566.4 165.5 17.28 ...
2 1953 8.0393 653.3 75.6 16.94 ...
LogAuctionIndex WinterRain HarvestRain GrowTemp ... 3 1955 7.6858 504.3 129.5 17.30 ...
1 1952 7.4950 566.4 165.5 17.28 ... 4 1957 6.9845 390.8 110.4 16.31 ...
2 1953 8.0393 653.3 75.6 16.94 ... 5 1958 6.7772 538.8 187.0 16.82 ...
3 1955 7.6858 504.3 129.5 17.30 ... 6 1959 8.0757 377.0 182.6 17.68 ...
4 1957 6.9845 390.8 110.4 16.31 ... 7 1960 6.5188 748.2 290.6 16.67 ...
5 1958 6.7772 538.8 187.0 16.82 ... 8 1961 8.4937 747.8 37.7 17.64 ...
6 1959 8.0757 377.0 182.6 17.68 ... 9 1962 7.3880 639.4 51.8 16.58 ...
7 1960 6.5188 748.2 290.6 16.67 ... 10 1964 7.3094 326.5 96.1 17.63 ...
8 1961 8.4937 747.8 37.7 17.64 ... ... ... ... ... ... ...
9 1962 7.3880 639.4 51.8 16.58 ... 30 1984 6.5496 572.6 144.8 16.71 ...
10 1964 7.3094 326.5 96.1 17.63 ... 31 1985 6.9171 667.1 37.2 17.19 ...
... ... ... ... ... ...
30 1984 6.5496 572.6 144.8 16.71 ...
31 1985 6.9171 667.1 37.2 17.19 ...
32 1986 6.7793 518.5 171.2 16.65 ... Testing Set (n = 15)
33 1987 7.1797 397.0 115.1 17.84 ...
34 1988 7.2646 734.2 58.8 17.65 ... LogAuctionIndex WinterRain HarvestRain GrowTemp ...
35 1989 7.5922 282.4 85.2 18.62 ... 32 1986 6.7793 518.5 171.2 16.65 ...
... ... ... ... ... ... 33 1987 7.1797 397.0 115.1 17.84 ...
45 1999 7.4462 502.4 253.4 19.07 ... 34 1988 7.2646 734.2 58.8 17.65 ...
46 2000 8.1817 487.8 69.0 18.73 ... 35 1989 7.5922 282.4 85.2 18.62 ...
... ... ... ... ... ...
45 1999 7.4462 502.4 253.4 19.07 ...
46 2000 8.1817 487.8 69.0 18.73 ...

IEOR 242, Spring 2020 - Week 2


+ 41

Best Practices: Out-of-Sample Testing

n Set aside a “test set” of 20% ‒ 50% of the observed


data before creating the regression model(s)

n Typical practice: set aside the most recently


observed data (for example the markets most
recently entered or wines most recently matured)

n Ifthere is no time-dependence in the observed


data, select a random sample for the test set

n Keep the test set data “hands-off” until you are


ready to asses the performance of your regression
model
IEOR 242, Spring 2020 - Week 2
+ 42

Best Practices: Out-of-Sample Testing

n Seriously, only use the test set once, when you have
finished training your model, to estimate the
performance of the model when you go to apply it
in the real world
n All data used to help build the model is training
data, and the training error (RSS) typically
underestimates the performance error
n Soon in the course we will see how to use some of
the training data as “validation data” to estimate
the performance error during the training phase

IEOR 242, Spring 2020 - Week 2



ii
i
I i
mi
y B

pix

+
Regression Output and
Analysis

IEOR 242, Spring 2020 - Week 2 43


+ 44

Regression Output (from R)


Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.9662699 9.3823951 -0.529 0.60166
WinterRain 0.0011863 0.0005628 2.108 0.04616 *
HarvestRain -0.0033137 0.0010650 -3.112 0.00491 **
GrowTemp 0.6582753 0.1221937 5.387 1.79e-05 ***
HarvestTemp 0.0044212 0.0599935 0.074 0.94189
Age 0.0240080 0.0507587 0.473 0.64068
FrancePop -0.0290258 0.1369627 -0.212 0.83403
USAlcConsump 0.1092561 0.1678945 0.651 0.52166
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3307 on 23 degrees of freedom


Multiple R-squared: 0.7894, Adjusted R-squared: 0.7253
F-statistic: 12.31 on 7 and 23 DF, p-value: 1.859e-06

IEOR 242, Spring 2020 - Week 2


+ 45

Interpreting the Regression


Coefficients
n Regression coefficients: ˆ = ( ˆ0 , ˆ1 , . . . , ˆp )
are estimates of =( 0, 1, . . . , p)

n
ˆ0 = -4.966
n
ˆwinter-rain = 0.0012 (An additional mm of winter rain is expected to
result in an additional 0.0012 units of log(price index))
n
ˆharvest-rain= -0.0033 (An additional mm of harvest rain is expected
to result in a decrease of 0.0033 units of log(price index))
n ….
n
ˆUSalc = 0.1093 (An additional liter of U.S. per capita alcohol
consumption is expected to result in a increase of 0.1093 units of
the log(price index))

IEOR 242, Spring 2020 - Week 2


+ 46

Understanding R2

n R2 is the coefficient of determination

n R2 is a measure of the overall quality of the


regression model

n R2 is a number between 0.0 and 1.0

nA higher R2 means the regression model is a


better fit to the (training) data

IEOR 242, Spring 2020 - Week 2


+ 47

Understanding R2 , cont.

n R2 = .924; very good linear model

IEOR 242, Spring 2020 - Week 2


+ 48

Understanding R2 , cont.

n R2 = .710; good linear model

IEOR 242, Spring 2020 - Week 2


+ 49

Understanding R2 , cont.

n R2 = .035; not a good model

IEOR 242, Spring 2020 - Week 2


+ 50

What really is R2 ?

n R2 compares two models:


n the regression model (the one determined by minimizing
the RSS (residual sum of squares error), and
n the “baseline” model. Think of the baseline model as a
model you might have built using this data but without any
real mathematical thinking.
n The baseline model predicts simplistically using
only the mean/average of the sample outcomes:
y 1 + · · · + yn y1952 + · · · + y1985
ȳ = = = 7.084
n 31

IEOR 242, Spring 2020 - Week 2


+ 51

What really is R2 , continued

Sum of squared residuals of regression model


R2 = 1
Sum of squared residuals of baseline model
n
X
(yi ŷi )2
i=1
= 1 n
X
(yi ȳ)2
i=1

SSE
= 1
SST
n
X
SST = (yi ȳ)2
i=1

IEOR 242, Spring 2020 - Week 2


+ 52

Regression Output (from R)


Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.9662699 9.3823951 -0.529 0.60166
WinterRain 0.0011863 0.0005628 2.108 0.04616 *
HarvestRain -0.0033137 0.0010650 -3.112 0.00491 **
GrowTemp 0.6582753 0.1221937 5.387 1.79e-05 ***
HarvestTemp 0.0044212 0.0599935 0.074 0.94189
Age 0.0240080 0.0507587 0.473 0.64068
FrancePop -0.0290258 0.1369627 -0.212 0.83403
USAlcConsump 0.1092561 0.1678945 0.651 0.52166
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3307 on 23 degrees of freedom


Multiple R-squared: 0.7894, Adjusted R-squared: 0.7253
F-statistic: 12.31 on 7 and 23 DF, p-value: 1.859e-06

IEOR 242, Spring 2020 - Week 2


+ 53

Vintage Wine Data


p = # of independent variables (p=7); n = # of observations (n=46)
y x1 x2 x3 x4 x5 x6 x7
LogAuctionIndex WinterRain HarvestRain GrowTemp HarvestTemp Age FrancePop USAlcConsump
1 1952 y1 = 7.4950 x11 = 566.4 x21 = 165.5x31 = 17.28 x41 = 14.39x51 = 63 x61 = 42.46 x71 = 7.85
2 1953 y2 = 8.0393 x12 = 653.3 x22= 75.6x32 = 16.94 x42 = 17.64x52 = 62 x62 = 42.75 x72 = 8.03
3 1955 7.6858 504.3 129.5 17.30 17.13 60 43.43 7.84
4 1957 6.9845 390.8 110.4 16.31 16.47 58 44.31 7.77
5 1958 6.7772 538.8 187.0 16.82 19.72 57 44.79 7.74
6 1959 8.0757 377.0 182.6 17.68 19.28 56 45.24 7.89
7 1960 6.5188 748.2 290.6 16.67 16.18 55 45.68 8.02
8 1961 8.4937 747.8 37.7 17.64 21.05 54 46.16 8.08
9 1962 7.3880 639.4 51.8 16.58 17.86 53 47.00 8.13
10 1964 7.3094 326.5 96.1 17.63 19.43 51 48.31 8.46
n=46 11 1965 6.2518 548.4 266.6 15.71 15.33 50 48.76 8.62
12 1966 7.7443 734.0 85.2 16.81 18.82 49 49.16 8.78
13 1967 6.8398 646.9 118.1 16.51 17.16 48 49.55 9.03
14 1968 6.2435 508.6 292.1 16.37 16.77 47 49.91 9.28
15 1969 6.3459 480.1 243.9 16.65 16.89 46 50.32 9.53
16 1970 7.5883 563.5 88.8 16.92 18.69 45 50.77 9.78
17 1971 7.1934 488.4 111.9 17.20 17.28 44 51.25 9.99
18 1972 6.2049 465.1 157.3 15.27 15.04 43 51.70 10.10
19 1973 6.6367 357.2 122.6 17.41 18.50 42 52.12 10.37
20 1974 6.2941 503.6 185.1 16.39 16.48 41 52.46 10.48
... ... ... ... ... ... ... ... ...
46 2000 yn = 8.1817 x1n = 487.8 x2n = 69.0 x3n= 18.73 x4n = 19.45 x5n = 15 x6n = 59.05 x7n = 8.24

IEOR 242, Spring 2020 - Week 2


+ 54

Best Practices: Out-of-Sample Testing

Training Set (n = 31)


Full Dataset (n = 46) 1 1952
LogAuctionIndex WinterRain HarvestRain GrowTemp ...
7.4950 566.4 165.5 17.28 ...
2 1953 8.0393 653.3 75.6 16.94 ...
LogAuctionIndex WinterRain HarvestRain GrowTemp ... 3 1955 7.6858 504.3 129.5 17.30 ...
1 1952 7.4950 566.4 165.5 17.28 ... 4 1957 6.9845 390.8 110.4 16.31 ...
2 1953 8.0393 653.3 75.6 16.94 ... 5 1958 6.7772 538.8 187.0 16.82 ...
3 1955 7.6858 504.3 129.5 17.30 ... 6 1959 8.0757 377.0 182.6 17.68 ...
4 1957 6.9845 390.8 110.4 16.31 ... 7 1960 6.5188 748.2 290.6 16.67 ...
5 1958 6.7772 538.8 187.0 16.82 ... 8 1961 8.4937 747.8 37.7 17.64 ...
6 1959 8.0757 377.0 182.6 17.68 ... 9 1962 7.3880 639.4 51.8 16.58 ...
7 1960 6.5188 748.2 290.6 16.67 ... 10 1964 7.3094 326.5 96.1 17.63 ...
8 1961 8.4937 747.8 37.7 17.64 ... ... ... ... ... ... ...
9 1962 7.3880 639.4 51.8 16.58 ... 30 1984 6.5496 572.6 144.8 16.71 ...
10 1964 7.3094 326.5 96.1 17.63 ... 31 1985 6.9171 667.1 37.2 17.19 ...
... ... ... ... ... ...
30 1984 6.5496 572.6 144.8 16.71 ...
31 1985 6.9171 667.1 37.2 17.19 ...
32 1986 6.7793 518.5 171.2 16.65 ... Testing Set (n = 15)
33 1987 7.1797 397.0 115.1 17.84 ...
34 1988 7.2646 734.2 58.8 17.65 ... LogAuctionIndex WinterRain HarvestRain GrowTemp ...
35 1989 7.5922 282.4 85.2 18.62 ... 32 1986 6.7793 518.5 171.2 16.65 ...
... ... ... ... ... ... 33 1987 7.1797 397.0 115.1 17.84 ...
45 1999 7.4462 502.4 253.4 19.07 ... 34 1988 7.2646 734.2 58.8 17.65 ...
46 2000 8.1817 487.8 69.0 18.73 ... 35 1989 7.5922 282.4 85.2 18.62 ...
... ... ... ... ... ...
45 1999 7.4462 502.4 253.4 19.07 ...
46 2000 8.1817 487.8 69.0 18.73 ...

IEOR 242, Spring 2020 - Week 2


+ 55

Training vs. Test Data

n Is R2 really what we care about?


n R2 is measured on the training data, the data that
we used to fit the model
n What we really care about is predictive
performance on new data
n Recall that we set aside some test data…
n We will use this test data to estimate the
performance of our model on new data that we
might see in the wild

IEOR 242, Spring 2020 - Week 2


+ Assessing “Real World” Performance of 56

the Regression Model


n Here is our model, based on the training data observations
(years 1952 through 1985):
n log(Price Index) = -4.966 + 0.001*(Winter Rain) - 0.003*(Harvest
Rain) + 0.658*(Growing Temp) + 0.004*(Harvest Temp) +
0.024*(Age) - 0.029*(France Population) + 0.109*(US Alcohol)

n Use the model to compute predictions and residuals for each


observation in the test set (observation years 1986 through
2000)
n Example: prediction for year 1998:
6.932 = -4.966 + 0.001*(693.4) + … + 0.109*(8.10)
n Actual 1998 log(Price Index) = 6.858
n Residual = -0.074 = 6.858 – 6.932

n How good is this prediction? Well, let’s look at all of the test
set data records and compute a version of R2 , which we call
OSR2

IEOR 242, Spring 2020 - Week 2


+ 57

Out-of-Sample R2 (OSR2 )

Sum of squared residuals of regression model on the Test Set


OSR2 = 1
Sum of squared residuals of baseline model applied to the Test Set
2000
X
(yt ŷt )2
t=1986
= 1 2000
X
(yt 7.084)2
t=1986

= 0.54

IEOR 242, Spring 2020 - Week 2


+ 58

Out-of-Sample R2 (OSR2 )

n OSR2 is an assessment of the real-world


performance of the model we have built

n It should only be computed once, at the end of


your analysis, as a final metric

n If OSR2 is significantly smaller than R2 (on the


training data), this is an indicator of potential
overfitting

IEOR 242, Spring 2020 - Week 2


+ 59

Overfitting
n Overfitting occurs when the estimated model fits
the noise in the training data

n All statistical learning methods are at risk for


overfitting

IEOR 242, Spring 2020 - Week 2


+ 60

Overfitting Becareful if
watchtheR2 to determine
it's possible to over
f
of being overfig
n Overfitting is more likely when:
n The number of parameters to be estimated is large
n Data is limited

n Care must be taken to make sure that the model we


estimate does not suffer from overfitting
n We will see how to address this issue throughout the
course, including today’s lecture

n Overfitting is related to the “bias-variance


tradeoff”

IEOR 242, Spring 2020 - Week 2


+ 61

Flexible Statistical Learning


Methods
n Flexible (usually non-parametric) statistical learning
methods are able to capture complicated relationships
n Linear regression is relatively inflexible
n Flexibility usually implies that:
n The resulting model is less interpretable
n The method requires more data to produce an accurate
estimate than a less flexible method
n There is an increased risk of overfitting

n We will see examples of flexible, non-parametric


methods later in the course

IEOR 242, Spring 2020 - Week 2


+ 62

Bias and Variance of Learning


Methods
n Bias refers to the error that is introduced by
modeling a complicated relationship with a simple
one
n Less flexible methods have more bias

n Variance refers to the amount that our estimated


function changes when you slightly change the
dataset
n More flexibility usually comes at the cost of higher
variance

n The bias-variance tradeoff is a common theme in


this course that we will continue discussing
IEOR 242, Spring 2020 - Week 2
+ 63

The Bias-Variance Tradeoff

Error is measured on a test set

“Model Complexity” is a synonym for “Model Flexibility”

IEOR 242, Spring 2020 - Week 2


+ Significance Testing,
Multicollinearity, and Other
Issues

IEOR 242, Spring 2020 - Week 2 64


+ 65

Some Important Questions

n Do all of the predictors help to explain the


response? Which variables are “significant”?

n Is at least one of the predictors X1 , X2 , . . . , Xp


useful in predicting the response Y ?

IEOR 242, Spring 2020 - Week 2


+ 66

Testing the Significance of


Regression Coefficients
n Is the independent variable Xj useful in predicting
the response Y ?
n Does US Alcohol Consumption help to predict log(price
index)?

n In other words, is j 6= 0 ?

n This is an inference question, and can be


addressed with a hypothesis test:

H0 : j = 0 vs. Ha : j 6= 0

IEOR 242, Spring 2020 - Week 2


+ 67

Testing the Significance of


Regression Coefficients
H0 : j = 0 vs. Ha : j 6= 0 WinterRain

n Hypothesis test is USAlcConsump

O
equivalent to looking
at confidence intervals HarvestTemp

n Reject null hypothesis


term
HarvestRain

as significance level α
if and only if (1-α)%
GrowTemp
P
confidence interval FrancePop

does not contain 0


Age

0.0 0.5
estimate

IEOR 242, Spring 2020 - Week 2


+ Interlude on “Standard
Assumptions” for Linear
Regression

IEOR 242, Spring 2020 - Week 2 68


+ 69

A Useful Set of Conceptual


Assumptions
n Question: Where do the previous confidence
intervals come from?

n Answer: Some of the statistical analysis associated


with linear regression is derived from a certain set
of assumptions regarding how the data is
generated

IEOR 242, Spring 2020 - Week 2


+ 70

A Useful Set of Conceptual


Assumptions
n 1.) The observed data (xi , yi ) i = 1, . . . , n satisfies

where are the true but unknown


regression coefficients and the are noise terms

n 2.) are independent and identically


distributed normal random variables with mean 0
and variance

n 3.) If the features are also regarded as


random variables, then they are independent of
IEOR 242, Spring 2020 - Week 2
+ 71

Consequences of the assumptions

n Under the previous set of assumptions, it is possible


to prove mathematically that:

n 1.) is an unbiased estimator of


the true vector of coefficients :

n 2.) The covariance matrix of given is:

n 3.) is a normally distributed random vector given


IEOR 242, Spring 2020 - Week 2
+ 72

Constructing a confidence interval

n Given the formula , we


can read off the diagonal entries of this matrix to
get the standard errors for each coefficient

n Given that is normally distributed, we can now


easily construct confidence intervals in the usual
way, i.e., for some z-score (such as z* = 1.96):

n Question: What’s the problem?


IEOR 242, Spring 2020 - Week 2
+ 73

Constructing a confidence interval

n Question: What’s the problem?

n Answer: we usually don’t know and must


estimate that from the data in order to construct the
matrix

n Letting denote the vector of


training set residuals, then use the estimate:

IEOR 242, Spring 2020 - Week 2


+ 74

Take home message of this


interlude
n It is important to understand the assumptions that
lead to the results of your analysis (e.g., which
variables you retain in your model)

n Ultimately though – regardless of whether you


believe or doubt that the assumptions hold for your
dataset – it is critical to validate your final model
on an out of sample testing set

IEOR 242, Spring 2020 - Week 2


+
Back to Significance Testing
and Other Issues

IEOR 242, Spring 2020 - Week 2 75


+ 76

Testing the Significance of


Regression Coefficients in R
nR shows stars * (literally!) for the significant
coefficients
n The more stars, the more significant. To be
significant at the 5% level (95% confidence
interval), the coefficient must have at least one *
n p-value (Pr(>|t|)) is the boundary point where we
switch from significant to not significant
(essentially smallest α such that significant at level
α)
n Smaller p-values are better

IEOR 242, Spring 2020 - Week 2


+ 77

Testing the Significance of


Regression Coefficients in R
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.9662699 9.3823951 -0.529 0.60166
WinterRain 0.0011863 0.0005628 2.108 0.04616 *
HarvestRain -0.0033137 0.0010650 -3.112 0.00491 **
GrowTemp 0.6582753 0.1221937 5.387 1.79e-05 ***
HarvestTemp 0.0044212 0.0599935 0.074 0.94189
Age 0.0240080 0.0507587 0.473 0.64068
FrancePop -0.0290258 0.1369627 -0.212 0.83403
USAlcConsump 0.1092561 0.1678945 0.651 0.52166
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3307 on 23 degrees of freedom


Multiple R-squared: 0.7894, Adjusted R-squared: 0.7253
F-statistic: 12.31 on 7 and 23 DF, p-value: 1.859e-06

• Are there coefficients that you are not comfortable with?


• Let’s return to this question in a moment
IEOR 242, Spring 2020 - Week 2
+ 78

Testing the Significant of the Entire


Model
nA more basic question: is the model worth
anything at all?

n Frame this question as a hypothesis test:

H0 : 1 = 2 = ... = p = 0 vs. Ha : at least one j 6= 0

nR reports the F-statistic and corresponding p-value


n Again, small p-value is good!
n Why is this not the same as checking the p-value of each
coefficient?

IEOR 242, Spring 2020 - Week 2


+ 79

Testing the Significant of the Entire


Model
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.9662699 9.3823951 -0.529 0.60166
WinterRain 0.0011863 0.0005628 2.108 0.04616 *
HarvestRain -0.0033137 0.0010650 -3.112 0.00491 **
GrowTemp 0.6582753 0.1221937 5.387 1.79e-05 ***
HarvestTemp 0.0044212 0.0599935 0.074 0.94189
Age 0.0240080 0.0507587 0.473 0.64068
FrancePop -0.0290258 0.1369627 -0.212 0.83403
USAlcConsump 0.1092561 0.1678945 0.651 0.52166
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3307 on 23 degrees of freedom


Multiple R-squared: 0.7894, Adjusted R-squared: 0.7253
F-statistic: 12.31 on 7 and 23 DF, p-value: 1.859e-06

• Are there coefficients that you are not comfortable with?


• Why might the last four coefficients not be significant?
IEOR 242, Spring 2020 - Week 2
+ 80

Plot of Age versus France Population

France Population at Time of Vintage (Millions)


















55 ●















50 ●






45 ●


20 30 40 50 60
Age of Vintage

n The data for Age and France population are highly correlated

n This is evidence of multicollinearity

IEOR 242, Spring 2020 - Week 2


+ 81

Multicollinearity
n Occurs when two or more predictors are highly
correlated
n Makes the estimated coefficients ˆ = ( ˆ0 , ˆ1 , . . . , ˆp )
very sensitive to noise in the training data
n Thus can produce very inaccurate estimates which hurts
interpretability and possibly predictive performance

n Tell-tale signs:
n Some of the estimated coefficients have the “wrong” sign
n Some of the coefficients are not significantly different from
zero

n Multicollinearity can usually be fixed by deleting one


or more independent variables

IEOR 242, Spring 2020 - Week 2


+ 82

Correlation Table
LogAuctionIndex WinterRain HarvestRain GrowTemp HarvestTemp Age FrancePop USAlcConsump
8.5

LogAuctionIndex
8.0
7.5
7.0
0.06 −0.53 0.56 0.47 0.01 −0.08 −0.27
6.5

700

WinterRain
600
500 −0.12 −0.21 −0.05 0.03 −0.05 0
400
300

300

HarvestRain
200 0.04 −0.41 −0.13 0.11 −0.22
100

19

GrowTemp
18
17 0.51 −0.6 0.52 −0.35
16

HarvestTemp
20
18 −0.28 0.25 −0.04
16

60
50

Age
40 −0.99 −0.13
30
20

FrancePop
55

50 0.27
45

11

USAlcConsump
10

8
6 7 8 300400500600700 100 200 300 16 17 18 19 16 18 20 20 30 40 50 60 45 50 55 8 9 10 11

IEOR 242, Spring 2020 - Week 2


+ 83

Multicollinearity

n Multicollinearity can exist without evidence of


large correlations in the correlation table

n Better to check the VIFs (variance inflation


factors):
WinterRain HarvestRain GrowTemp HarvestTemp Age FrancePop USAlcConsump
1.295370 1.578682 1.700079 2.198191 66.936256 81.792302 10.441217

n Rule of thumb:
n VIF > 10: definitely a problem
n VIF > 5: could be a problem
n VIF <= 5: probably okay

IEOR 242, Spring 2020 - Week 2


+ 84

What is VIF?

n Consider regressing each predictor variable Xj on


all of the others:
Xj = ↵ 0 + ↵ 1 X1 + . . . + ↵ j 1 Xj 1 + ↵j+1 Xj+1 + . . . + ↵p Xp

n If the R2 for the above (call it Rj2 ) is equal to 1, then


there exists a perfect linear relationship between Xj
and all other independent variables (at least
according to the training data)

n So, define:
1
VIFj =
1 Rj2

IEOR 242, Spring 2020 - Week 2


+ 85

How do we deal with


multicollinearity?
n One approach:
n Remove a variable with high VIF, but if there is a “tie” then
keep the variables that you “like”
n Iterate this procedure

n This issue falls under the realm of model selection


– the process of finding the best model

n Model selection is still somewhat of an art but we


will see some principled approaches later in the
course

IEOR 242, Spring 2020 - Week 2


+ 86

VIF Values for the Wine Model

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.9662699 9.3823951 -0.529 0.60166
WinterRain 0.0011863 0.0005628 2.108 0.04616 *
Coefficient VIF
HarvestRain -0.0033137 0.0010650 -3.112 0.00491 **
GrowTemp 0.6582753 0.1221937 5.387 1.79e-05 *** WinterRain 1.30
HarvestTemp 0.0044212 0.0599935 0.074 0.94189
HarvestRain 1.58
Age 0.0240080 0.0507587 0.473 0.64068
FrancePop -0.0290258 0.1369627 -0.212 0.83403 GrowTemp 1.70
USAlcConsump 0.1092561 0.1678945 0.651 0.52166
HarvestTemp 2.20
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Age 66.94
FrancePop 81.79
Residual standard error: 0.3307 on 23 degrees of freedom
Multiple R-squared: 0.7894, Adjusted R-squared: 0.7253 USAlcConsump 10.44
F-statistic: 12.31 on 7 and 23 DF, p-value: 1.859e-06

IEOR 242, Spring 2020 - Week 2


+
Building our Better Wine Model

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.8404548 3.0706463 -2.228 0.03553 *
WinterRain 0.0012145 0.0005359 2.266 0.03274 *
Coefficient VIF
HarvestRain -0.0033611 0.0010203 -3.294 0.00305 **
GrowTemp 0.6671389 0.1125053 5.930 4.05e-06 *** WinterRain 1.22
HarvestTemp 0.0020543 0.0577600 0.036 0.97192
HarvestRain 1.51
Age 0.0340519 0.0178084 1.912 0.06787 .
USAlcConsump 0.0933334 0.1471271 0.634 0.53184 GrowTemp 1.50
---
HarvestTemp 2.12
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Age 8.58
Residual standard error: 0.3241 on 24 degrees of freedom
USAlcConsump 8.35
Multiple R-squared: 0.789, Adjusted R-squared: 0.7362
F-statistic: 14.95 on 6 and 24 DF, p-value: 4.604e-07

IEOR 242, Spring 2020 - Week 2 87


+
Building our Better Wine Model

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.215161 1.672215 -3.119 0.004532 **
WinterRain 0.001119 0.000508 2.202 0.037112 *
HarvestRain -0.003437 0.001001 -3.433 0.002089 **
Coefficient VIF
GrowTemp 0.664336 0.111067 5.981 3.02e-06 *** WinterRain 1.13
HarvestTemp -0.006650 0.055432 -0.120 0.905462
Age 0.023466 0.006143 3.820 0.000785 *** HarvestRain 1.49
--- GrowTemp 1.50
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
HarvestTemp 2.00
Residual standard error: 0.3202 on 25 degrees of freedom
Age 1.04
Multiple R-squared: 0.7854, Adjusted R-squared: 0.7425
F-statistic: 18.3 on 5 and 25 DF, p-value: 1.213e-07

IEOR 242, Spring 2020 - Week 2 88


+
Building our Better Wine Model

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.2163945 1.6401825 -3.180 0.003782 **
WinterRain 0.0011116 0.0004949 2.246 0.033424 *
Coefficient VIF
HarvestRain -0.0033766 0.0008504 -3.971 0.000505 *** WinterRain 1.11
GrowTemp 0.6569271 0.0905520 7.255 1.05e-07 ***
Age 0.0235571 0.0059785 3.940 0.000546 ***
HarvestRain 1.12
--- GrowTemp 1.04
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Age 1.03
Residual standard error: 0.3141 on 26 degrees of freedom
Multiple R-squared: 0.7853, Adjusted R-squared: 0.7523
F-statistic: 23.78 on 4 and 26 DF, p-value: 2.307e-08

IEOR 242, Spring 2020 - Week 2 89


+ 90

A Very Good Wine Model


Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.2163945 1.6401825 -3.180 0.003782 **
WinterRain 0.0011116 0.0004949 2.246 0.033424 *
HarvestRain -0.0033766 0.0008504 -3.971 0.000505 ***
GrowTemp 0.6569271 0.0905520 7.255 1.05e-07 ***
Age 0.0235571 0.0059785 3.940 0.000546 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3141 on 26 degrees of freedom


Multiple R-squared: 0.7853, Adjusted R-squared: 0.7523
F-statistic: 23.78 on 4 and 26 DF, p-value: 2.307e-08

n R2 = 0.79 (previously 0.79)


n All coefficients are significantly different than zero
n OSR2 = 0.75
n This model is not really different from Orley Ashenfelter’s model

IEOR 242, Spring 2020 - Week 2


+ 91

Other Potential Fit Problems

n Nonlinear dependence on the features


n We will discuss this later in the course

n Correlation of residuals
n Non-constant variance of residuals
n Outliers/high-leverage points
n The last three are not a major concern in this course,
but it’s always healthy to plot your data, including
residual plots
n See James Section 3.3.3 for more details

IEOR 242, Spring 2020 - Week 2


+
A Better Wine Model Using
Categorical Variables

IEOR 242, Spring 2020 - Week 2 92


+ 93

Towards an Even Better Model


n The previous model predicts a price index
n OSR2 = 0.75, pretty good, but not really “great”
n It would be better if we could predict the actual price for a
given winery – then we could use the model in direct support
of the auction
n This is the “big data” era, yet we have only one price index
for each year back to 1952
n Even if we were to look at an individual winery, we would still
only have a few dozen data records
n Wouldn’t it be great if we could use the separate data from all
wineries in all years? Then we could take advantage of more
data.
n Let’s see how we can do this
IEOR 242, Spring 2020 - Week 2
+ 94

Map of Bordeaux Region

Cos d’Estournel
Lafite-Rothschild

Beychevelle

Giscours

Cheval Blanc

Map data © 2015 Google

IEOR 242, Spring 2020 - Week 2


+ 95

All-Wineries Data

LogAuction Winery Age WinterRain HarvestRain GrowTemp HarvestTemp FrancePop USAlcConsump


1 1952 6.653108 Cheval Blanc 63 566.4 165.5 17.28 14.39 42.46 7.85
2 1952 6.861502 Lafite-Rothschild 63 566.4 165.5 17.28 14.39 42.46 7.85
3 1953 6.664192 Cheval Blanc 62 653.3 75.6 16.94 17.64 42.75 8.03
4 1955 6.311426 Cheval Blanc 60 504.3 129.5 17.30 17.13 43.43 7.84
5 1955 6.550209 Lafite-Rothschild 60 504.3 129.5 17.30 17.13 43.43 7.84
6 1959 5.380957 Beychevelle 56 377.0 182.6 17.68 19.28 45.24 7.89
7 1959 7.437242 Cheval Blanc 56 377.0 182.6 17.68 19.28 45.24 7.89
8 1959 7.645302 Lafite-Rothschild 56 377.0 182.6 17.68 19.28 45.24 7.89
9 1960 6.405873 Lafite-Rothschild 55 748.2 290.6 16.67 16.18 45.68 8.02
10 1961 5.813802 Beychevelle 54 747.8 37.7 17.64 21.05 46.16 8.08
11 1961 7.311178 Cheval Blanc 54 747.8 37.7 17.64 21.05 46.16 8.08
12 1961 5.822247 Cos d'Estournel 54 747.8 37.7 17.64 21.05 46.16 8.08
13 1961 6.673045 Lafite-Rothschild 54 747.8 37.7 17.64 21.05 46.16 8.08
14 1962 6.747610 Cheval Blanc 53 639.4 51.8 16.58 17.86 47.00 8.13
15 1962 5.416100 Cos d'Estournel 53 639.4 51.8 16.58 17.86 47.00 8.13
16 1962 6.298839 Lafite-Rothschild 53 639.4 51.8 16.58 17.86 47.00 8.13
17 1964 4.354270 Beychevelle 51 326.5 96.1 17.63 19.43 48.31 8.46
18 1964 6.492785 Cheval Blanc 51 326.5 96.1 17.63 19.43 48.31 8.46
19 1964 6.011610 Lafite-Rothschild 51 326.5 96.1 17.63 19.43 48.31 8.46
20 1966 5.957908 Cheval Blanc 49 734.0 85.2 16.81 18.82 49.16 8.78
... ... ... ... ... ... ... ... ... ...
147 2000 7.060588 Lafite-Rothschild 15 487.8 69.0 18.73 19.45 59.05 8.24

IEOR 242, Spring 2020 - Week 2


+ 96

Training and Testing Set (Split by Year)

Training Set (N = 83)


LogAuction Winery ... USAlcConsump
Full Dataset (N = 147) 1 1952 6.653108 Cheval Blanc ... 7.85
2 1952 6.861502 Lafite-Rothschild ... 7.85
LogAuction Winery ... USAlcConsump 3 1953 6.664192 Cheval Blanc ... 8.03
1 1952 6.653108 Cheval Blanc ... 7.85 4 1955 6.311426 Cheval Blanc ... 7.84
2 1952 6.861502 Lafite-Rothschild ... 7.85 5 1955 6.550209 Lafite-Rothschild ... 7.84
6 1959 5.380957 Beychevelle ... 7.89
3 1953 6.664192 Cheval Blanc ... 8.03
7 1959 7.437242 Cheval Blanc ... 7.89
4 1955 6.311426 Cheval Blanc ... 7.84
8 1959 7.645302 Lafite-Rothschild ... 7.89
5 1955 6.550209 Lafite-Rothschild ... 7.84
9 1960 6.405873 Lafite-Rothschild ... 8.02
6 1959 5.380957 Beychevelle ... 7.89 10 1961 5.813802 Beychevelle ... 8.08
7 1959 7.437242 Cheval Blanc ... 7.89 ... ... ... ... ...
8 1959 7.645302 Lafite-Rothschild ... 7.89 81 1985 6.018934 Cheval Blanc ... 9.88
9 1960 6.405873 Lafite-Rothschild ... 8.02 82 1985 4.885072 Cos d'Estournel ... 9.88
10 1961 5.813802 Beychevelle ... 8.08 83 1985 6.296612 Lafite-Rothschild ... 9.88
... ... ... ... ...
81 1985
82 1985
6.018934
4.885072
Cheval Blanc
Cos d'Estournel
...
...
9.88
9.88
Testing Set (N = 64)
83 1985 6.296612 Lafite-Rothschild ... 9.88 LogAuction Winery ... USAlcConsump
84 1986 4.549235 Beychevelle ... 9.75 84 1986 4.549235 Beychevelle ... 9.75
85 1986 5.721852 Cheval Blanc ... 9.75 85 1986 5.721852 Cheval Blanc ... 9.75
86 1986 4.987435 Cos d'Estournel ... 9.75 86 1986 4.987435 Cos d'Estournel ... 9.75
... ... ... ... ... ... ... ... ... ...
146 2000 4.330339 Giscours ... 8.24 146 2000 4.330339 Giscours ... 8.24
147 2000 7.060588 Lafite-Rothschild ... 8.24 147 2000 7.060588 Lafite-Rothschild ... 8.24

IEOR 242, Spring 2020 - Week 2


+
Regression Model
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.2184029 3.4314946 -0.064 0.9494
Age 0.0528857 0.0118313 4.470 2.62e-05 ***
WinterRain 0.0018288 0.0009861 1.855 0.0674 .
HarvestRain 0.0024980 0.0019812 1.261 0.2111
GrowTemp 0.1209556 0.1926703 0.628 0.5320
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9639 on 78 degrees of freedom


Multiple R-squared: 0.2221, Adjusted R-squared: 0.1822
F-statistic: 5.567 on 4 and 78 DF, p-value: 0.0005389

n Why is the R2 low?

IEOR 242, Spring 2020 - Week 2 97


+
2015 Auction Prices by Winery and
Year log(2015 Auction Price)

Winery
6 Beychevelle
Cheval Blanc
Cos d'Estournel
5 Giscours
Lafite−Rothschild

1950 1960 1970 1980 1990 2000


Vintage Year

IEOR 242, Spring 2020 - Week 2 98


+ 99

Categorical Variables With Two


Levels
n For illustration, we will only look at data of two
wineries:
n Cheval Blanc – one of the most expensive wineries
n Cos d’Estournel – one of least expensive wineries

n Two categories corresponds to adding one


categorical variable

n We will call our variable WineryCosd’Estournel


n Value 1: Wine is from Cos d’Estournel
n Value 0: Wine is from Cheval Blanc

IEOR 242, Spring 2020 - Week 2


+ 100

2015 Auction Prices for Two Wineries


log(2015 Auction Price)

6
Winery
Cheval Blanc
5 Cos d'Estournel

1950 1960 1970 1980 1990 2000


Vintage Year

IEOR 242, Spring 2020 - Week 2


+
Categorical Variables with Two Levels

LogAuction WineryCos d'Estournel Age WinterRain HarvestRain GrowTemp


1 1952 6.653108 0 63 566.4 165.5 17.28
2 1953 6.664192 0 62 653.3 75.6 16.94
3 1955 6.311426 0 60 504.3 129.5 17.30
4 1959 7.437242 0 56 377.0 182.6 17.68
5 1961 7.311178 0 54 747.8 37.7 17.64
6 1961 5.822247 1 54 747.8 37.7 17.64
7 1962 6.747610 0 53 639.4 51.8 16.58
8 1962 5.416100 1 53 639.4 51.8 16.58
9 1964 6.492785 0 51 326.5 96.1 17.63
10 1966 5.957908 0 49 734.0 85.2 16.81
11 1966 4.809416 1 49 734.0 85.2 16.81
12 1967 6.146929 0 48 646.9 118.1 16.51
13 1970 5.536231 0 45 563.5 88.8 16.92
14 1970 4.254051 1 45 563.5 88.8 16.92
15 1971 5.975843 0 44 488.4 111.9 17.20
16 1973 4.207376 1 42 357.2 122.6 17.41
17 1974 5.290386 0 41 503.6 185.1 16.39
18 1975 5.689684 0 40 501.8 170.5 17.23
19 1975 4.397162 1 40 501.8 170.5 17.23
... ... ... ... ... ... ...
35 1985 4.885072 1 30 667.1 37.2 17.19

IEOR 242, Spring 2020 - Week 2 101


+ 102

A Two-Category Model

Auction Price = 0

+ 1 · WineryCos d0 Estournel

+ 2 · Age

+ 3 · WinterRain

+ 4 · HarvestRain

+ 5 · GrowTemp

n What is the interpretation?

IEOR 242, Spring 2020 - Week 2


+
The Two-Category Model
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.6292230 2.3910006 -2.773 0.00962 **
WineryCos d'Estournel -1.3616758 0.1393778 -9.770 1.12e-10 ***
Age 0.0357669 0.0072019 4.966 2.79e-05 ***
WinterRain 0.0016274 0.0006888 2.363 0.02506 *
HarvestRain -0.0015879 0.0015803 -1.005 0.32330
GrowTemp 0.6093432 0.1397853 4.359 0.00015 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3874 on 29 degrees of freedom


Multiple R-squared: 0.8598, Adjusted R-squared: 0.8356
F-statistic: 35.55 on 5 and 29 DF, p-value: 1.624e-11

n Note that R2 = 0.86

n All variables are significant except HarvestRain

IEOR 242, Spring 2020 - Week 2 103


+
Let’s go back to the All-Wineries Data

LogAuction Winery Age WinterRain HarvestRain GrowTemp HarvestTemp FrancePop USAlcConsump


1 1952 6.653108 Cheval Blanc 63 566.4 165.5 17.28 14.39 42.46 7.85
2 1952 6.861502 Lafite-Rothschild 63 566.4 165.5 17.28 14.39 42.46 7.85
3 1953 6.664192 Cheval Blanc 62 653.3 75.6 16.94 17.64 42.75 8.03
4 1955 6.311426 Cheval Blanc 60 504.3 129.5 17.30 17.13 43.43 7.84
5 1955 6.550209 Lafite-Rothschild 60 504.3 129.5 17.30 17.13 43.43 7.84
6 1959 5.380957 Beychevelle 56 377.0 182.6 17.68 19.28 45.24 7.89
7 1959 7.437242 Cheval Blanc 56 377.0 182.6 17.68 19.28 45.24 7.89
8 1959 7.645302 Lafite-Rothschild 56 377.0 182.6 17.68 19.28 45.24 7.89
9 1960 6.405873 Lafite-Rothschild 55 748.2 290.6 16.67 16.18 45.68 8.02
10 1961 5.813802 Beychevelle 54 747.8 37.7 17.64 21.05 46.16 8.08
11 1961 7.311178 Cheval Blanc 54 747.8 37.7 17.64 21.05 46.16 8.08
12 1961 5.822247 Cos d'Estournel 54 747.8 37.7 17.64 21.05 46.16 8.08
13 1961 6.673045 Lafite-Rothschild 54 747.8 37.7 17.64 21.05 46.16 8.08
14 1962 6.747610 Cheval Blanc 53 639.4 51.8 16.58 17.86 47.00 8.13
15 1962 5.416100 Cos d'Estournel 53 639.4 51.8 16.58 17.86 47.00 8.13
16 1962 6.298839 Lafite-Rothschild 53 639.4 51.8 16.58 17.86 47.00 8.13
17 1964 4.354270 Beychevelle 51 326.5 96.1 17.63 19.43 48.31 8.46
18 1964 6.492785 Cheval Blanc 51 326.5 96.1 17.63 19.43 48.31 8.46
19 1964 6.011610 Lafite-Rothschild 51 326.5 96.1 17.63 19.43 48.31 8.46
20 1966 5.957908 Cheval Blanc 49 734.0 85.2 16.81 18.82 49.16 8.78
... ... ... ... ... ... ... ... ... ...
83 1985 6.296612 Lafite-Rothschild 30 667.1 37.2 17.19 19.56 55.28 9.88

IEOR 242, Spring 2020 - Week 2 104


+ 105

Categorical Variables with More


Than Two Levels
n We need k-1 dummy variables to work with k
categories (why?)
n Variable WineryCheval Blanc: 1 if from Cheval Blanc,
otherwise 0
n Variable WineryCosd’Estournel: 1 if from Cos
d’Estournel, otherwise 0
n Variable WineryGiscours: 1 if from Giscours, otherwise
0
n Variable WineryLafite-Rothschild: 1 if from Lafite-
Rothschild, otherwise 0
n All variables 0 if wine from Beychevelle
IEOR 242, Spring 2020 - Week 2
+ 106

A Model with More Than Two


Categories
Auction Price = 0

+ 1 · WineryCheval Blanc

+ 2 · WineryCos d0 Estournel

+ 3 · WineryGiscours

+ 4 · WineryLafite Rothschild

+ 5 · Age

+ 6 · WinterRain

+ 7 · HarvestRain

+ 8 · GrowTemp

IEOR 242, Spring 2020 - Week 2


+
Model with Categorical Data for Five
Wineries
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.4945857 1.5757380 -2.852 0.005622 **
WineryCheval Blanc 1.6425245 0.1518157 10.819 < 2e-16 ***
WineryCos d'Estournel 0.2754099 0.1649803 1.669 0.099274 .
WineryGiscours -0.2992903 0.1934825 -1.547 0.126163
WineryLafite-Rothschild 1.8941459 0.1481200 12.788 < 2e-16 ***
Age 0.0307904 0.0054819 5.617 3.23e-07 ***
WinterRain 0.0016349 0.0004462 3.665 0.000463 ***
HarvestRain 0.0003949 0.0009050 0.436 0.663899
GrowTemp 0.3875778 0.0886121 4.374 3.93e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4336 on 74 degrees of freedom


Multiple R-squared: 0.8506, Adjusted R-squared: 0.8345
F-statistic: 52.68 on 8 and 74 DF, p-value: < 2.2e-16

n R2 = 0.85 is excellent

n OSR2 = 0.81 is also excellent

IEOR 242, Spring 2020 - Week 2 107


+
Prediction for Cos d’Estournel

n Consider making a prediction for the 2014 vintage


n Cos d’Estournel winery
n Aged 1 years in 2015
n 522.3 mm of winter rain
n 78.9 mm of harvest rain
n Average growing season temperature of 18.23 °C

LogAuctionPrice = -4.495 + 1.643*(0) + 0.2754*(1) – 0.299*(0) + 1.894*(0)

+ 0.031*(1) + 0.002*(522.3) + … + 0.388*(18.23)

= 3.762

IEOR 242, Spring 2020 - Week 2 108


+
Prediction for Giscours

n Consider making a prediction for the 2014 vintage


n Giscours winery
n Aged 1 years in 2015
n 522.3 mm of winter rain
n 78.9 mm of harvest rain
n Average growing season temperature of 18.23 °C

LogAuctionPrice = -4.495 + 1.643*(0) + 0.2754*(0) – 0.299*(1) + 1.894*(0)

+ 0.031*(1) + 0.002*(522.3) + … + 0.388*(18.23)

= 3.188

IEOR 242, Spring 2020 - Week 2 109


+
Prediction for Beychevelle

n Consider making a prediction for the 2014 vintage


n Beychevelle winery
n Aged 1 years in 2015
n 522.3 mm of winter rain
n 78.9 mm of harvest rain
n Average growing season temperature of 18.23 °C

LogAuctionPrice = -4.495 + 1.643*(0) + 0.2754*(0) – 0.299*(0) + 1.894*(0)

+ 0.031*(1) + 0.002*(522.3) + … + 0.388*(18.23)

= 3.487

IEOR 242, Spring 2020 - Week 2 110


+ 111

Showdown in The New York Times


n Showdown on the front
page of The New York
Times in 1990

n Parker: 1986 vintage will


be “very good to
sometimes exceptional”

n Ashenfelter: 1986
vintage will be mediocre,
but 1989 vintage will be
“stunningly good”

n Experts like Parker hadn’t


even had a chance to taste
the 1989 vintage
IEOR 242, Spring 2020 - Week 2
+ 112

Years Later, the Winner is Clear

7.6 ●

log(Auction Index)
7.4

7.2 ●

7.0

6.8 ●

1985 1986 1987 1988 1989


Year

IEOR 242, Spring 2020 - Week 2


+ 113

A Convergence of Results

Though most critics never acknowledged the value of


Ashenfelter’s models, through time the predictions of the
models and experts has converged.

Ashenfelter:

“Unlike the past, the tasters no longer make any


horrendous mistakes. Frankly, I kind of killed
myself. I don’t have much value added anymore.”

IEOR 242, Spring 2020 - Week 2


+ 114

Conclusion

nA linear regression model with only a few


variables can predict wine prices well

n In many cases, the model outperforms wine


experts’ judgments

nA quantitative approach to a traditionally


qualitative problem

n Regression capabilities are enhanced with a user


who knows how to think with data and learn from
the data

IEOR 242, Spring 2020 - Week 2


+ 115

n Some of the figures in this presentation are taken


from “An Introduction to Statistical Learning, with
applications in R” (Springer, 2013) with permission
from the authors: G. James, D. Witten, T. Hastie and
R. Tibshirani

n Thanks to Rob Freund and John Silberholz (MIT) for


the wine datasets

IEOR 242, Spring 2020 - Week 2

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy