100% found this document useful (1 vote)

207 views115 pages

Week2 LinearRegression Post PDF

Uploaded by

Yecheng Caroline Liu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

207 views115 pages

Week2 LinearRegression Post PDF

Uploaded by

Yecheng Caroline Liu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 115

+

Week 2 – Predicting Wine Quality

with Linear Regression I
IEOR 242 – Applications in Data Analysis
Spring 2020 – Week 2 IEOR 242, Spring 2020 - Week 2
+ 2

Announcements

n Enrollment
n Concurrent: Need to wait one more week to officially be
added

n Waitlist
n Still awaiting access to system to manage
n Definitely will have room for any mEng or PhD student…
will try to add as many as allowable

IEOR 242, Spring 2020 - Week 2

+ 3

Announcements

n Survey feedback
n https://docs.google.com/forms/d/e/1FAIpQLSeK-
xJPWlDxxBTd6-
OxgEs6QNkJiWEpFQFtMhxNwNd4JTNETw/viewform

n Updated syllabus (still in process)

n ~Guest Lecturer per Month, Targeting more advanced
topics at end; tradeoff – more material covered per lecture,
greater expectation of individual knowledge/reading

n Office Hours: W 2-4p 2nd Floor Blum Hall Suites

IEOR 242, Spring 2020 - Week 2

+ 4

Announcements

n Possible Guest Speakers and Possible Topics

n Risk Modeler (Civil Eng PhD); Application: Data Breach
Model using GLMs
n Epidemiologist (Public Health PhD); Application: Fall
Prevention in Older Adults using Fuzzy Clustering
n Search Engineer (Comp Sci PhD); Application: Food and
Beverage Click Through Rates using Ensemble Trees
n Biostatistician(Comp Sci PhD): Precision Medicine/Gene
Mapping using Deep Learning
n Actuary (Physics MS): Commercial Lines Residual Model
Pricing using Ensemble Trees

IEOR 242, Spring 2020 - Week 2

+ 5

(Additional Qs for Guest Speakers)

Final Project Questions
n Why is/was that problem important to you? The business?
n What are the results/takeaways/impact?
n (What else did you need to do to get it implemented)
n (How many people worked on it and for how long? )
n (What other non-math/statistical considerations did you have?)
n What methods did you use?
n What alternatives did you consider?
n How did you evaluate efficacy?
n (How do you continue monitoring for efficacy?)

IEOR 242, Spring 2020 - Week 2

+ 6

Today’s Agenda

n Predicting wine quality with linear regression

n Model validation, overfitting, and other issues

n Significance, multicollinearity, and other issues

n An improved wine model with categorical

variables (most likely next week)

IEOR 242, Spring 2020 - Week 2

+
Predicting Wine Quality

IEOR 242, Spring 2020 - Week 2 7

+ 8

Vintage Bordeaux Wine

n Vintage wine vs. non-vintage

wine?

n Large differences in price and

quality in different years, even
though wine is produced in a
similar way

n Meant to be aged, so it is hard to

know the quality of the wine
when it initially goes on the
market

n Expert tasters predict which

wines will be good

n Can analytics be used to

develop a different system for
assessing the quality of wine?
IEOR 242, Spring 2020 - Week 2
+ 9

Wine Quality – Ask the Expert

IEOR 242, Spring 2020 - Week 2

+ 10

Predicting the Quality of Wine

n March 1990: Orley

Ashenfelter, a Princeton
economics professor,
claims he can predict wine
quality without tasting the
wine

IEOR 242, Spring 2020 - Week 2

+ 11

Using Linear Regression

n Ashenfelter used (multiple) linear regression
n Predicts a continuous response variable – the dependent variable
n Prediction is based on a set of independent variables

n Independent variables (features):

n Age – older wines are more expensive
n Weather
n Average Growing Season Temperature
n Harvest Rain
n Winter Rain

n Dependent variable:
n Price Index – composite metric of many different wineries in
thousands of wine auctions in the years 1990-1991
n His model used Log(Price Index)
IEOR 242, Spring 2020 - Week 2
+ 12

Why Log(Price Index)?

n Produces a better linear fit

n Better fit revealed through plotting
n The log( ) transformation also arises intrinsically,
especially in settings where “growth” or “proportion” are
natural phenomena

IEOR 242, Spring 2020 - Week 2

+ 13

US National Debt (1950 – 2014)

Coefficient for Year = 206

R2 = 0.72

IEOR 242, Spring 2020 - Week 2

+ 14

US National Debt (1950 – 2014)

Coefficient for Year = 0.075

R2 = 0.96

IEOR 242, Spring 2020 - Week 2

+ 15

The Expert’s Reaction

Robert Parker, the world's most

influential wine expert at the
time:
“Ashenfelter is an absolute total
sham”

“Really a Neanderthal way of

looking at wine”

“Rather like a movie critic who

never goes to see the movie but
tells you how good it is based on
the actors and the director”

IEOR 242, Spring 2020 - Week 2

+ 16

Vintage Wine Data

n Log(price index) based on 2015 auction prices

n Winter rain (mm)
n Harvest rain (mm)
n Average Temperature in growing season (Celsius)
n Average Temperature in harvest season (Celsius)
n Age of wine (years since vintage)
n Population of France
n US Alcohol Consumption (per capita, in liters of 100%
alcohol)

IEOR 242, Spring 2020 - Week 2

+ 17

Vintage Wine Data, cont.

p = # of independent variables (p=7); n = # of observations (n=46)
y x1 x2 x3 x4 x5 x6 x7
LogAuctionIndex WinterRain HarvestRain GrowTemp HarvestTemp Age FrancePop USAlcConsump
1 1952 y1 = 7.4950 x11 = 566.4 x21 = 165.5x31 = 17.28 x41 = 14.39x51 = 63 x61 = 42.46 x71 = 7.85
2 1953 y2 = 8.0393 x12 = 653.3 x22= 75.6x32 = 16.94 x42 = 17.64x52 = 62 x62 = 42.75 x72 = 8.03
3 1955 7.6858 504.3 129.5 17.30 17.13 60 43.43 7.84
4 1957 6.9845 390.8 110.4 16.31 16.47 58 44.31 7.77
5 1958 6.7772 538.8 187.0 16.82 19.72 57 44.79 7.74
6 1959 8.0757 377.0 182.6 17.68 19.28 56 45.24 7.89
7 1960 6.5188 748.2 290.6 16.67 16.18 55 45.68 8.02
8 1961 8.4937 747.8 37.7 17.64 21.05 54 46.16 8.08
9 1962 7.3880 639.4 51.8 16.58 17.86 53 47.00 8.13
10 1964 7.3094 326.5 96.1 17.63 19.43 51 48.31 8.46
n=46 11 1965 6.2518 548.4 266.6 15.71 15.33 50 48.76 8.62
12 1966 7.7443 734.0 85.2 16.81 18.82 49 49.16 8.78
13 1967 6.8398 646.9 118.1 16.51 17.16 48 49.55 9.03
14 1968 6.2435 508.6 292.1 16.37 16.77 47 49.91 9.28
15 1969 6.3459 480.1 243.9 16.65 16.89 46 50.32 9.53
16 1970 7.5883 563.5 88.8 16.92 18.69 45 50.77 9.78
17 1971 7.1934 488.4 111.9 17.20 17.28 44 51.25 9.99
18 1972 6.2049 465.1 157.3 15.27 15.04 43 51.70 10.10
19 1973 6.6367 357.2 122.6 17.41 18.50 42 52.12 10.37
20 1974 6.2941 503.6 185.1 16.39 16.48 41 52.46 10.48
... ... ... ... ... ... ... ... ...
46 2000 yn = 8.1817 x1n = 487.8 x2n = 69.0 x3n= 18.73 x4n = 19.45 x5n = 15 x6n = 59.05 x7n = 8.24

Why do the observations stop after the year 2000?

IEOR 242, Spring 2020 - Week 2
+ 18

Vintage Wine Data, cont.

LogAuctionIndex WinterRain HarvestRain GrowTemp HarvestTemp Age FrancePop USAlcConsump
8.5

LogAuctionIndex
8.0
7.5
7.0
0.06 −0.53 0.56 0.47 0.01 −0.08 −0.27
6.5

700

WinterRain
600
500 −0.12 −0.21 −0.05 0.03 −0.05 0
400
300

300

HarvestRain
200 0.04 −0.41 −0.13 0.11 −0.22
100

GrowTemp
18
17 0.51 −0.6 0.52 −0.35
16

HarvestTemp
20
18 −0.28 0.25 −0.04
16

60
50

Age
40 −0.99 −0.13
30
20

FrancePop
55

50 0.27
45

USAlcConsump
10

8
6 7 8 300400500600700 100 200 300 16 17 18 19 16 18 20 20 30 40 50 60 45 50 55 8 9 10 11

IEOR 242, Spring 2020 - Week 2

+
Linear Regression

IEOR 242, Spring 2020 - Week 2 19

+ 20

Linear Regression
n Predict the value of the dependent variable:
n Log(price index)

n Prediction as a linear function of the independent

variables:
n Winter rain (mm)
n Harvest rain (mm)
n Average Temperature in growing season (Celsius)
n Average Temperature in harvest season (Celsius)
n Age of wine (years since vintage)
n Population of France
n US alcohol consumption (per capita, in liters of 100%
alcohol)

IEOR 242, Spring 2020 - Week 2

+ 21

Multiple Linear Regression

Y = 0 + 1 X1 + ... + p Xp +✏

n Parametric method

n Observed data: (xi , yi ) i = 1, . . . , n

n Each observed xi is a feature vector: xi = (xi1 , xi2 , . . . , xip )T

n Each observed yi is a continuous

response/dependent variable associated with xi

IEOR 242, Spring 2020 - Week 2

+
Statistical Learning Interlude

IEOR 242, Spring 2020 - Week 2 22

+ 23

General Statistical Learning Model

n Input variables: X = (X1 , X2 , . . . , Xp )

n Also often called features, predictors, or independent
variables

n Output variable: Y
n Also often called response or dependent variable

n Collected data in the form of n pairs:

n (xi , yi ) i = 1, . . . , n
n xi = (xi1 , xi2 , . . . , xip )T

IEOR 242, Spring 2020 - Week 2

+ 24

Parametric Methods

n Start by assuming a particular functional form for f

n For example, assume that f is linear:
f (X) = 0 + 1 X1 + . . . + p Xp
n f is parameterized by = ( 0, 1, . . . , p)

n Now apply a method that uses the training data to

estimate
n We sometimes call this fitting the model
n Classic example: ordinary least squares, i.e., linear
regression
n We will consider more sophisticated approaches as well

IEOR 242, Spring 2020 - Week 2

+ 25

Parametric Methods

n Advantages of Parametric Methods:

n Simplifies the problem of estimating f to the problem of
estimating
n Potentially relatively less data needed to produce a
reliable estimate of

n Major Disadvantage of Parametric Methods:

n The true functional form of f is usually more complicated
than the model we chose
n This may be remedied by selecting a flexible model class,
but this comes at the danger of overfitting

IEOR 242, Spring 2020 - Week 2

+ 26

Non-parametric Methods

n Of course, non-parametric methods do not make

parametric assumptions about f

n No explicit functional form is assumed

n Allows for greater flexibility
n Runs a greater risk of overfitting if you are not careful
n Generally requires more data to produce an accurate
estimate

IEOR 242, Spring 2020 - Week 2

+ 27

Tradeoff Between Flexibility and

Interpretability
n Why not just always use flexible, non-parametric
methods?
n One reason is that parametric models are more
interpretable and thus better for inference
n Even if you don’t care about inference, non-parametric
methods may overfit the training data

Incom
Incom

e
e

y
rit

it
or
Ye Ye
o
ni

ni
ar ars
s
Se

Se
of of
Ed Ed
uc uc
ati ati
on on

IEOR 242, Spring 2020 - Week 2

+
Back to Linear Regression

IEOR 242, Spring 2020 - Week 2 28

+ 29

Linear Regression
n Predict the value of the dependent variable:
n Log(price index)

n Prediction as a linear function of the independent

IEOR 242, Spring 2020 - Week 2

+ 30

Multiple Linear Regression

Y = 0 + 1 X1 + ... + p Xp +✏

n Parametric method
n Falls into the generic format of Y = f (X) + ✏ with
f (X) = 0 + 1 X1 + . . . + p Xp

n Observed data: (xi , yi ) i = 1, . . . , n

n Each observed xi is a feature vector: xi = (xi1 , xi2 , . . . , xip )T
n Each observed yi is a continuous
response/dependent variable associated with xi

IEOR 242, Spring 2020 - Week 2

+ 31

Multiple Linear Regression, cont.

n The (true) regression coefficients =( 0, 1, . . . , p)

are unknown to us

n How do we estimate the regression coefficients?

n Minimize prediction error, as measured by the

residual sum of squares (RSS):

n
X
2
RSS( ) := (yi 0 1 xi1 ... p xip )
i=1

IEOR 242, Spring 2020 - Week 2

+ 32

Unconstrained Optimization
Review
n Ingredients:
n is a vector of decision
variables (often called parameters in ML/Stats)
n is the objective function (often called loss
function or penalty function)

n Optimization problem:

IEOR 242, Spring 2020 - Week 2

+ 33

Unconstrained Optimization
Review, cont.
n Optimization problem:

n Definition of optimality: solves the above

optimization problem if and only if
for all

n Necessary Optimality Condition: If is

differentiable with gradient and solves
the optimization problem, then:

IEOR 242, Spring 2020 - Week 2

+ 34

Multiple Linear Regression

Coefficient Estimates
n The regression coefficient estimates ˆ = ( ˆ0 , ˆ1 , . . . , ˆp )
are chosen to minimize RSS( )

n Where:

n
X X2
2
RSS( ) := (yi 0 1 xi1 ... p xip )
i=1

IEOR 242, Spring 2020 - Week 2

+ 35

Multiple Linear Regression Coefficient

Estimates

n Let be the n x (p + 1) matrix where the ith row is

the appended feature vector

n Let be the n-vector of responses yi

n Then the matrix vector product is the n-vector

of training set predictions associated with the
coefficient vector and the n-
vector of residuals is

IEOR 242, Spring 2020 - Week 2

+ 36

Multiple Linear Regression Coefficient

Estimates, cont.

n Recall the 2-norm of an n-vector is defined by:

n Then it is easy to see that:

n Also, it holds that (slightly less obvious):

rRSS( ) =
<latexit sha1_base64="Nia+mD7MXZaDXayiKOGqpXYJNiQ=">AAACMXicbVBNSyNBEO3R3dWN+xH16KXZIGQPhhkR1osgeHCP0XwYyMRQ06nRxp6eobtGDMP8JE/+FE/CCuLVP7GdD3FXfdDw+r0qqupFmZKWfP/OW1j88PHT0vLnysqXr9++V1fXujbNjcCOSFVqehFYVFJjhyQp7GUGIYkUnkQXBxP/5BKNlalu0zjDQQJnWsZSADlpWD0MNUQKeEh4RcVxq1XWwwgJfvI9vrUdJkDnUVz0ytN2/fkzLvkWf3Fm5cNqzW/4U/C3JJiTGpujOazehqNU5AlqEgqs7Qd+RoMCDEmhsKyEucUMxAWcYd9RDQnaQTE9uOSbThnxODXuaeJT9d+OAhJrx0nkKid72tfeRHzP6+cU7w4KqbOcUIvZoDhXnFI+SY+PpEFBauwICCPdrlycgwFBLuNKxaUQvL75LeluNwK/ERzt1PZ78zyW2Qb7weosYL/YPvvNmqzDBLtmt+wPu/duvDvvwXuclS5485519h+8p7+37akq</latexit>
2XT (y X )

IEOR 242, Spring 2020 - Week 2

+ 37

Multiple Linear Regression Coefficient

Estimates, cont.

n Using the representation and

assuming that , then one may
use calculus/linear algebra to show that the solution
of is given by:

IEOR 242, Spring 2020 - Week 2

+ 38

Multiple Linear Regression, cont.

n Prediction for the ith observation:

ŷi := ˆ0 + ˆ1 xi1 + . . . + ˆp xip

n Residuals: ei = yi ŷi

n RSS with respect to the estimated coefficients:

n
X n
X
RSS = SSE = e2i = (yi ŷi )2
i=1 i=1
n SSE is the sum of squared errors (both conventions
often used)

IEOR 242, Spring 2020 - Week 2

+ 39

Vintage Wine Data

IEOR 242, Spring 2020 - Week 2

+ 40

Best Practices: Out-of-Sample Testing

Training Set (n = 31)

Full Dataset (n = 46) 1 1952
LogAuctionIndex WinterRain HarvestRain GrowTemp ...
7.4950 566.4 165.5 17.28 ...
2 1953 8.0393 653.3 75.6 16.94 ...
LogAuctionIndex WinterRain HarvestRain GrowTemp ... 3 1955 7.6858 504.3 129.5 17.30 ...
1 1952 7.4950 566.4 165.5 17.28 ... 4 1957 6.9845 390.8 110.4 16.31 ...
2 1953 8.0393 653.3 75.6 16.94 ... 5 1958 6.7772 538.8 187.0 16.82 ...
3 1955 7.6858 504.3 129.5 17.30 ... 6 1959 8.0757 377.0 182.6 17.68 ...
4 1957 6.9845 390.8 110.4 16.31 ... 7 1960 6.5188 748.2 290.6 16.67 ...
5 1958 6.7772 538.8 187.0 16.82 ... 8 1961 8.4937 747.8 37.7 17.64 ...
6 1959 8.0757 377.0 182.6 17.68 ... 9 1962 7.3880 639.4 51.8 16.58 ...
7 1960 6.5188 748.2 290.6 16.67 ... 10 1964 7.3094 326.5 96.1 17.63 ...
8 1961 8.4937 747.8 37.7 17.64 ... ... ... ... ... ... ...
9 1962 7.3880 639.4 51.8 16.58 ... 30 1984 6.5496 572.6 144.8 16.71 ...
10 1964 7.3094 326.5 96.1 17.63 ... 31 1985 6.9171 667.1 37.2 17.19 ...
... ... ... ... ... ...
30 1984 6.5496 572.6 144.8 16.71 ...
31 1985 6.9171 667.1 37.2 17.19 ...
32 1986 6.7793 518.5 171.2 16.65 ... Testing Set (n = 15)
33 1987 7.1797 397.0 115.1 17.84 ...
34 1988 7.2646 734.2 58.8 17.65 ... LogAuctionIndex WinterRain HarvestRain GrowTemp ...
35 1989 7.5922 282.4 85.2 18.62 ... 32 1986 6.7793 518.5 171.2 16.65 ...
... ... ... ... ... ... 33 1987 7.1797 397.0 115.1 17.84 ...
45 1999 7.4462 502.4 253.4 19.07 ... 34 1988 7.2646 734.2 58.8 17.65 ...
46 2000 8.1817 487.8 69.0 18.73 ... 35 1989 7.5922 282.4 85.2 18.62 ...
... ... ... ... ... ...
45 1999 7.4462 502.4 253.4 19.07 ...
46 2000 8.1817 487.8 69.0 18.73 ...

IEOR 242, Spring 2020 - Week 2

+ 41

Best Practices: Out-of-Sample Testing

n Set aside a “test set” of 20% ‒ 50% of the observed

data before creating the regression model(s)

n Typical practice: set aside the most recently

observed data (for example the markets most
recently entered or wines most recently matured)

n Ifthere is no time-dependence in the observed

data, select a random sample for the test set

n Keep the test set data “hands-off” until you are

ready to asses the performance of your regression
model
IEOR 242, Spring 2020 - Week 2
+ 42

Best Practices: Out-of-Sample Testing

n Seriously, only use the test set once, when you have
finished training your model, to estimate the
performance of the model when you go to apply it
in the real world
n All data used to help build the model is training
data, and the training error (RSS) typically
underestimates the performance error
n Soon in the course we will see how to use some of
the training data as “validation data” to estimate
the performance error during the training phase

IEOR 242, Spring 2020 - Week 2

器
ii
i
I i
mi
y B
⼼
pix

+
Regression Output and
Analysis

IEOR 242, Spring 2020 - Week 2 43

+ 44

Regression Output (from R)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.9662699 9.3823951 -0.529 0.60166
WinterRain 0.0011863 0.0005628 2.108 0.04616 *
HarvestRain -0.0033137 0.0010650 -3.112 0.00491 **
GrowTemp 0.6582753 0.1221937 5.387 1.79e-05 ***
HarvestTemp 0.0044212 0.0599935 0.074 0.94189
Age 0.0240080 0.0507587 0.473 0.64068
FrancePop -0.0290258 0.1369627 -0.212 0.83403
USAlcConsump 0.1092561 0.1678945 0.651 0.52166
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3307 on 23 degrees of freedom

Multiple R-squared: 0.7894, Adjusted R-squared: 0.7253
F-statistic: 12.31 on 7 and 23 DF, p-value: 1.859e-06

IEOR 242, Spring 2020 - Week 2

+ 45

Interpreting the Regression

Coefficients
n Regression coefficients: ˆ = ( ˆ0 , ˆ1 , . . . , ˆp )
are estimates of =( 0, 1, . . . , p)

n
ˆ0 = -4.966
n
ˆwinter-rain = 0.0012 (An additional mm of winter rain is expected to
result in an additional 0.0012 units of log(price index))
n
ˆharvest-rain= -0.0033 (An additional mm of harvest rain is expected
to result in a decrease of 0.0033 units of log(price index))
n ….
n
ˆUSalc = 0.1093 (An additional liter of U.S. per capita alcohol
consumption is expected to result in a increase of 0.1093 units of
the log(price index))

IEOR 242, Spring 2020 - Week 2

+ 46

Understanding R2

n R2 is the coefficient of determination

n R2 is a measure of the overall quality of the

regression model

n R2 is a number between 0.0 and 1.0

nA higher R2 means the regression model is a

better fit to the (training) data

IEOR 242, Spring 2020 - Week 2

+ 47

Understanding R2 , cont.

n R2 = .924; very good linear model

IEOR 242, Spring 2020 - Week 2

+ 48

Understanding R2 , cont.

n R2 = .710; good linear model

IEOR 242, Spring 2020 - Week 2

+ 49

Understanding R2 , cont.

n R2 = .035; not a good model

IEOR 242, Spring 2020 - Week 2

+ 50

What really is R2 ?

n R2 compares two models:

n the regression model (the one determined by minimizing
the RSS (residual sum of squares error), and
n the “baseline” model. Think of the baseline model as a
model you might have built using this data but without any
real mathematical thinking.
n The baseline model predicts simplistically using
only the mean/average of the sample outcomes:
y 1 + · · · + yn y1952 + · · · + y1985
ȳ = = = 7.084
n 31

IEOR 242, Spring 2020 - Week 2

+ 51

What really is R2 , continued

Sum of squared residuals of regression model

R2 = 1
Sum of squared residuals of baseline model
n
X
(yi ŷi )2
i=1
= 1 n
X
(yi ȳ)2
i=1

SSE
= 1
SST
n
X
SST = (yi ȳ)2
i=1

IEOR 242, Spring 2020 - Week 2

+ 52

Regression Output (from R)

Residual standard error: 0.3307 on 23 degrees of freedom

Multiple R-squared: 0.7894, Adjusted R-squared: 0.7253
F-statistic: 12.31 on 7 and 23 DF, p-value: 1.859e-06

IEOR 242, Spring 2020 - Week 2

+ 53

Vintage Wine Data

IEOR 242, Spring 2020 - Week 2

+ 54

Best Practices: Out-of-Sample Testing

Training Set (n = 31)

IEOR 242, Spring 2020 - Week 2

+ 55

Training vs. Test Data

n Is R2 really what we care about?

n R2 is measured on the training data, the data that
we used to fit the model
n What we really care about is predictive
performance on new data
n Recall that we set aside some test data…
n We will use this test data to estimate the
performance of our model on new data that we
might see in the wild

IEOR 242, Spring 2020 - Week 2

+ Assessing “Real World” Performance of 56

the Regression Model

n Here is our model, based on the training data observations
(years 1952 through 1985):
n log(Price Index) = -4.966 + 0.001*(Winter Rain) - 0.003*(Harvest
Rain) + 0.658*(Growing Temp) + 0.004*(Harvest Temp) +
0.024*(Age) - 0.029*(France Population) + 0.109*(US Alcohol)

n Use the model to compute predictions and residuals for each

observation in the test set (observation years 1986 through
2000)
n Example: prediction for year 1998:
6.932 = -4.966 + 0.001*(693.4) + … + 0.109*(8.10)
n Actual 1998 log(Price Index) = 6.858
n Residual = -0.074 = 6.858 – 6.932

n How good is this prediction? Well, let’s look at all of the test
set data records and compute a version of R2 , which we call
OSR2

IEOR 242, Spring 2020 - Week 2

+ 57

Out-of-Sample R2 (OSR2 )

Sum of squared residuals of regression model on the Test Set

OSR2 = 1
Sum of squared residuals of baseline model applied to the Test Set
2000
X
(yt ŷt )2
t=1986
= 1 2000
X
(yt 7.084)2
t=1986

= 0.54

IEOR 242, Spring 2020 - Week 2

+ 58

Out-of-Sample R2 (OSR2 )

n OSR2 is an assessment of the real-world

performance of the model we have built

n It should only be computed once, at the end of

your analysis, as a final metric

n If OSR2 is significantly smaller than R2 (on the

training data), this is an indicator of potential
overfitting

IEOR 242, Spring 2020 - Week 2

+ 59

Overfitting
n Overfitting occurs when the estimated model fits
the noise in the training data

n All statistical learning methods are at risk for

overfitting

IEOR 242, Spring 2020 - Week 2

+ 60

Overfitting Becareful if
watchtheR2 to determine
it's possible to over
f
of being overfig
n Overfitting is more likely when:
n The number of parameters to be estimated is large
n Data is limited

n Care must be taken to make sure that the model we

estimate does not suffer from overfitting
n We will see how to address this issue throughout the
course, including today’s lecture

n Overfitting is related to the “bias-variance

tradeoff”

IEOR 242, Spring 2020 - Week 2

+ 61

Flexible Statistical Learning

Methods
n Flexible (usually non-parametric) statistical learning
methods are able to capture complicated relationships
n Linear regression is relatively inflexible
n Flexibility usually implies that:
n The resulting model is less interpretable
n The method requires more data to produce an accurate
estimate than a less flexible method
n There is an increased risk of overfitting

n We will see examples of flexible, non-parametric

methods later in the course

IEOR 242, Spring 2020 - Week 2

+ 62

Bias and Variance of Learning

Methods
n Bias refers to the error that is introduced by
modeling a complicated relationship with a simple
one
n Less flexible methods have more bias

n Variance refers to the amount that our estimated

function changes when you slightly change the
dataset
n More flexibility usually comes at the cost of higher
variance

n The bias-variance tradeoff is a common theme in

this course that we will continue discussing
IEOR 242, Spring 2020 - Week 2
+ 63

The Bias-Variance Tradeoff

Error is measured on a test set

“Model Complexity” is a synonym for “Model Flexibility”

IEOR 242, Spring 2020 - Week 2

+ Significance Testing,
Multicollinearity, and Other
Issues

IEOR 242, Spring 2020 - Week 2 64

+ 65

Some Important Questions

n Do all of the predictors help to explain the

response? Which variables are “significant”?

n Is at least one of the predictors X1 , X2 , . . . , Xp

useful in predicting the response Y ?

IEOR 242, Spring 2020 - Week 2

+ 66

Testing the Significance of

Regression Coefficients
n Is the independent variable Xj useful in predicting
the response Y ?
n Does US Alcohol Consumption help to predict log(price
index)?

n In other words, is j 6= 0 ?

n This is an inference question, and can be

addressed with a hypothesis test:

H0 : j = 0 vs. Ha : j 6= 0

IEOR 242, Spring 2020 - Week 2

+ 67

Testing the Significance of

Regression Coefficients
H0 : j = 0 vs. Ha : j 6= 0 WinterRain

n Hypothesis test is USAlcConsump

O
equivalent to looking
at confidence intervals HarvestTemp

n Reject null hypothesis

term
HarvestRain

as significance level α
if and only if (1-α)%
GrowTemp
P
confidence interval FrancePop

does not contain 0

Age

0.0 0.5
estimate

IEOR 242, Spring 2020 - Week 2

+ Interlude on “Standard
Assumptions” for Linear
Regression

IEOR 242, Spring 2020 - Week 2 68

+ 69

A Useful Set of Conceptual

Assumptions
n Question: Where do the previous confidence
intervals come from?

n Answer: Some of the statistical analysis associated

with linear regression is derived from a certain set
of assumptions regarding how the data is
generated

IEOR 242, Spring 2020 - Week 2

+ 70

A Useful Set of Conceptual

Assumptions
n 1.) The observed data (xi , yi ) i = 1, . . . , n satisfies

where are the true but unknown

regression coefficients and the are noise terms

n 2.) are independent and identically

distributed normal random variables with mean 0
and variance

n 3.) If the features are also regarded as

random variables, then they are independent of
IEOR 242, Spring 2020 - Week 2
+ 71

Consequences of the assumptions

n Under the previous set of assumptions, it is possible

to prove mathematically that:

n 1.) is an unbiased estimator of

the true vector of coefficients :

n 2.) The covariance matrix of given is:

n 3.) is a normally distributed random vector given

IEOR 242, Spring 2020 - Week 2
+ 72

Constructing a confidence interval

n Given the formula , we

can read off the diagonal entries of this matrix to
get the standard errors for each coefficient

n Given that is normally distributed, we can now

easily construct confidence intervals in the usual
way, i.e., for some z-score (such as z* = 1.96):

n Question: What’s the problem?

IEOR 242, Spring 2020 - Week 2
+ 73

Constructing a confidence interval

n Question: What’s the problem?

n Answer: we usually don’t know and must

estimate that from the data in order to construct the
matrix

n Letting denote the vector of

training set residuals, then use the estimate:

IEOR 242, Spring 2020 - Week 2

+ 74

Take home message of this

interlude
n It is important to understand the assumptions that
lead to the results of your analysis (e.g., which
variables you retain in your model)

n Ultimately though – regardless of whether you

believe or doubt that the assumptions hold for your
dataset – it is critical to validate your final model
on an out of sample testing set

IEOR 242, Spring 2020 - Week 2

+
Back to Significance Testing
and Other Issues

IEOR 242, Spring 2020 - Week 2 75

+ 76

Testing the Significance of

Regression Coefficients in R
nR shows stars * (literally!) for the significant
coefficients
n The more stars, the more significant. To be
significant at the 5% level (95% confidence
interval), the coefficient must have at least one *
n p-value (Pr(>|t|)) is the boundary point where we
switch from significant to not significant
(essentially smallest α such that significant at level
α)
n Smaller p-values are better

IEOR 242, Spring 2020 - Week 2

+ 77

Testing the Significance of

Regression Coefficients in R
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.9662699 9.3823951 -0.529 0.60166
WinterRain 0.0011863 0.0005628 2.108 0.04616 *
HarvestRain -0.0033137 0.0010650 -3.112 0.00491 **
GrowTemp 0.6582753 0.1221937 5.387 1.79e-05 ***
HarvestTemp 0.0044212 0.0599935 0.074 0.94189
Age 0.0240080 0.0507587 0.473 0.64068
FrancePop -0.0290258 0.1369627 -0.212 0.83403
USAlcConsump 0.1092561 0.1678945 0.651 0.52166
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3307 on 23 degrees of freedom

Multiple R-squared: 0.7894, Adjusted R-squared: 0.7253
F-statistic: 12.31 on 7 and 23 DF, p-value: 1.859e-06

• Are there coefficients that you are not comfortable with?

• Let’s return to this question in a moment
IEOR 242, Spring 2020 - Week 2
+ 78

Testing the Significant of the Entire

Model
nA more basic question: is the model worth
anything at all?

n Frame this question as a hypothesis test:

H0 : 1 = 2 = ... = p = 0 vs. Ha : at least one j 6= 0

nR reports the F-statistic and corresponding p-value

n Again, small p-value is good!
n Why is this not the same as checking the p-value of each
coefficient?

IEOR 242, Spring 2020 - Week 2

+ 79

Testing the Significant of the Entire

Model
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.9662699 9.3823951 -0.529 0.60166
WinterRain 0.0011863 0.0005628 2.108 0.04616 *
HarvestRain -0.0033137 0.0010650 -3.112 0.00491 **
GrowTemp 0.6582753 0.1221937 5.387 1.79e-05 ***
HarvestTemp 0.0044212 0.0599935 0.074 0.94189
Age 0.0240080 0.0507587 0.473 0.64068
FrancePop -0.0290258 0.1369627 -0.212 0.83403
USAlcConsump 0.1092561 0.1678945 0.651 0.52166
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3307 on 23 degrees of freedom

Multiple R-squared: 0.7894, Adjusted R-squared: 0.7253
F-statistic: 12.31 on 7 and 23 DF, p-value: 1.859e-06

• Are there coefficients that you are not comfortable with?

• Why might the last four coefficients not be significant?
IEOR 242, Spring 2020 - Week 2
+ 80

Plot of Age versus France Population

France Population at Time of Vintage (Millions)

●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
55 ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
50 ●
●
●
●
●

●
●
●
45 ●
●

●
●

20 30 40 50 60
Age of Vintage

n The data for Age and France population are highly correlated

n This is evidence of multicollinearity

IEOR 242, Spring 2020 - Week 2

+ 81

Multicollinearity
n Occurs when two or more predictors are highly
correlated
n Makes the estimated coefficients ˆ = ( ˆ0 , ˆ1 , . . . , ˆp )
very sensitive to noise in the training data
n Thus can produce very inaccurate estimates which hurts
interpretability and possibly predictive performance

n Tell-tale signs:
n Some of the estimated coefficients have the “wrong” sign
n Some of the coefficients are not significantly different from
zero

n Multicollinearity can usually be fixed by deleting one

or more independent variables

IEOR 242, Spring 2020 - Week 2

+ 82

Correlation Table
LogAuctionIndex WinterRain HarvestRain GrowTemp HarvestTemp Age FrancePop USAlcConsump
8.5

LogAuctionIndex
8.0
7.5
7.0
0.06 −0.53 0.56 0.47 0.01 −0.08 −0.27
6.5

700

WinterRain
600
500 −0.12 −0.21 −0.05 0.03 −0.05 0
400
300

300

HarvestRain
200 0.04 −0.41 −0.13 0.11 −0.22
100

GrowTemp
18
17 0.51 −0.6 0.52 −0.35
16

HarvestTemp
20
18 −0.28 0.25 −0.04
16

60
50

Age
40 −0.99 −0.13
30
20

FrancePop
55

50 0.27
45

USAlcConsump
10

8
6 7 8 300400500600700 100 200 300 16 17 18 19 16 18 20 20 30 40 50 60 45 50 55 8 9 10 11

IEOR 242, Spring 2020 - Week 2

+ 83

Multicollinearity

n Multicollinearity can exist without evidence of

large correlations in the correlation table

n Better to check the VIFs (variance inflation

factors):
WinterRain HarvestRain GrowTemp HarvestTemp Age FrancePop USAlcConsump
1.295370 1.578682 1.700079 2.198191 66.936256 81.792302 10.441217

n Rule of thumb:
n VIF > 10: definitely a problem
n VIF > 5: could be a problem
n VIF <= 5: probably okay

IEOR 242, Spring 2020 - Week 2

+ 84

What is VIF?

n Consider regressing each predictor variable Xj on

all of the others:
Xj = ↵ 0 + ↵ 1 X1 + . . . + ↵ j 1 Xj 1 + ↵j+1 Xj+1 + . . . + ↵p Xp

n If the R2 for the above (call it Rj2 ) is equal to 1, then

there exists a perfect linear relationship between Xj
and all other independent variables (at least
according to the training data)

n So, define:
1
VIFj =
1 Rj2

IEOR 242, Spring 2020 - Week 2

+ 85

How do we deal with

multicollinearity?
n One approach:
n Remove a variable with high VIF, but if there is a “tie” then
keep the variables that you “like”
n Iterate this procedure

n This issue falls under the realm of model selection

– the process of finding the best model

n Model selection is still somewhat of an art but we

will see some principled approaches later in the
course

IEOR 242, Spring 2020 - Week 2

+ 86

VIF Values for the Wine Model

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.9662699 9.3823951 -0.529 0.60166
WinterRain 0.0011863 0.0005628 2.108 0.04616 *
Coefficient VIF
HarvestRain -0.0033137 0.0010650 -3.112 0.00491 **
GrowTemp 0.6582753 0.1221937 5.387 1.79e-05 *** WinterRain 1.30
HarvestTemp 0.0044212 0.0599935 0.074 0.94189
HarvestRain 1.58
Age 0.0240080 0.0507587 0.473 0.64068
FrancePop -0.0290258 0.1369627 -0.212 0.83403 GrowTemp 1.70
USAlcConsump 0.1092561 0.1678945 0.651 0.52166
HarvestTemp 2.20
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Age 66.94
FrancePop 81.79
Residual standard error: 0.3307 on 23 degrees of freedom
Multiple R-squared: 0.7894, Adjusted R-squared: 0.7253 USAlcConsump 10.44
F-statistic: 12.31 on 7 and 23 DF, p-value: 1.859e-06

IEOR 242, Spring 2020 - Week 2

+
Building our Better Wine Model

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.8404548 3.0706463 -2.228 0.03553 *
WinterRain 0.0012145 0.0005359 2.266 0.03274 *
Coefficient VIF
HarvestRain -0.0033611 0.0010203 -3.294 0.00305 **
GrowTemp 0.6671389 0.1125053 5.930 4.05e-06 *** WinterRain 1.22
HarvestTemp 0.0020543 0.0577600 0.036 0.97192
HarvestRain 1.51
Age 0.0340519 0.0178084 1.912 0.06787 .
USAlcConsump 0.0933334 0.1471271 0.634 0.53184 GrowTemp 1.50
---
HarvestTemp 2.12
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Age 8.58
Residual standard error: 0.3241 on 24 degrees of freedom
USAlcConsump 8.35
Multiple R-squared: 0.789, Adjusted R-squared: 0.7362
F-statistic: 14.95 on 6 and 24 DF, p-value: 4.604e-07

IEOR 242, Spring 2020 - Week 2 87

+
Building our Better Wine Model

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.215161 1.672215 -3.119 0.004532 **
WinterRain 0.001119 0.000508 2.202 0.037112 *
HarvestRain -0.003437 0.001001 -3.433 0.002089 **
Coefficient VIF
GrowTemp 0.664336 0.111067 5.981 3.02e-06 *** WinterRain 1.13
HarvestTemp -0.006650 0.055432 -0.120 0.905462
Age 0.023466 0.006143 3.820 0.000785 *** HarvestRain 1.49
--- GrowTemp 1.50
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
HarvestTemp 2.00
Residual standard error: 0.3202 on 25 degrees of freedom
Age 1.04
Multiple R-squared: 0.7854, Adjusted R-squared: 0.7425
F-statistic: 18.3 on 5 and 25 DF, p-value: 1.213e-07

IEOR 242, Spring 2020 - Week 2 88

+
Building our Better Wine Model

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.2163945 1.6401825 -3.180 0.003782 **
WinterRain 0.0011116 0.0004949 2.246 0.033424 *
Coefficient VIF
HarvestRain -0.0033766 0.0008504 -3.971 0.000505 *** WinterRain 1.11
GrowTemp 0.6569271 0.0905520 7.255 1.05e-07 ***
Age 0.0235571 0.0059785 3.940 0.000546 ***
HarvestRain 1.12
--- GrowTemp 1.04
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Age 1.03
Residual standard error: 0.3141 on 26 degrees of freedom
Multiple R-squared: 0.7853, Adjusted R-squared: 0.7523
F-statistic: 23.78 on 4 and 26 DF, p-value: 2.307e-08

IEOR 242, Spring 2020 - Week 2 89

+ 90

A Very Good Wine Model

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.2163945 1.6401825 -3.180 0.003782 **
WinterRain 0.0011116 0.0004949 2.246 0.033424 *
HarvestRain -0.0033766 0.0008504 -3.971 0.000505 ***
GrowTemp 0.6569271 0.0905520 7.255 1.05e-07 ***
Age 0.0235571 0.0059785 3.940 0.000546 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3141 on 26 degrees of freedom

Multiple R-squared: 0.7853, Adjusted R-squared: 0.7523
F-statistic: 23.78 on 4 and 26 DF, p-value: 2.307e-08

n R2 = 0.79 (previously 0.79)

n All coefficients are significantly different than zero
n OSR2 = 0.75
n This model is not really different from Orley Ashenfelter’s model

IEOR 242, Spring 2020 - Week 2

+ 91

n Nonlinear dependence on the features

n We will discuss this later in the course

n Correlation of residuals
n Non-constant variance of residuals
n Outliers/high-leverage points
n The last three are not a major concern in this course,
but it’s always healthy to plot your data, including
residual plots
n See James Section 3.3.3 for more details

IEOR 242, Spring 2020 - Week 2

+
A Better Wine Model Using
Categorical Variables

IEOR 242, Spring 2020 - Week 2 92

+ 93

Towards an Even Better Model

n The previous model predicts a price index
n OSR2 = 0.75, pretty good, but not really “great”
n It would be better if we could predict the actual price for a
given winery – then we could use the model in direct support
of the auction
n This is the “big data” era, yet we have only one price index
for each year back to 1952
n Even if we were to look at an individual winery, we would still
only have a few dozen data records
n Wouldn’t it be great if we could use the separate data from all
wineries in all years? Then we could take advantage of more
data.
n Let’s see how we can do this
IEOR 242, Spring 2020 - Week 2
+ 94

Map of Bordeaux Region

Cos d’Estournel
Lafite-Rothschild

Beychevelle

Giscours

Cheval Blanc

IEOR 242, Spring 2020 - Week 2

+ 95

All-Wineries Data

LogAuction Winery Age WinterRain HarvestRain GrowTemp HarvestTemp FrancePop USAlcConsump

1 1952 6.653108 Cheval Blanc 63 566.4 165.5 17.28 14.39 42.46 7.85
2 1952 6.861502 Lafite-Rothschild 63 566.4 165.5 17.28 14.39 42.46 7.85
3 1953 6.664192 Cheval Blanc 62 653.3 75.6 16.94 17.64 42.75 8.03
4 1955 6.311426 Cheval Blanc 60 504.3 129.5 17.30 17.13 43.43 7.84
5 1955 6.550209 Lafite-Rothschild 60 504.3 129.5 17.30 17.13 43.43 7.84
6 1959 5.380957 Beychevelle 56 377.0 182.6 17.68 19.28 45.24 7.89
7 1959 7.437242 Cheval Blanc 56 377.0 182.6 17.68 19.28 45.24 7.89
8 1959 7.645302 Lafite-Rothschild 56 377.0 182.6 17.68 19.28 45.24 7.89
9 1960 6.405873 Lafite-Rothschild 55 748.2 290.6 16.67 16.18 45.68 8.02
10 1961 5.813802 Beychevelle 54 747.8 37.7 17.64 21.05 46.16 8.08
11 1961 7.311178 Cheval Blanc 54 747.8 37.7 17.64 21.05 46.16 8.08
12 1961 5.822247 Cos d'Estournel 54 747.8 37.7 17.64 21.05 46.16 8.08
13 1961 6.673045 Lafite-Rothschild 54 747.8 37.7 17.64 21.05 46.16 8.08
14 1962 6.747610 Cheval Blanc 53 639.4 51.8 16.58 17.86 47.00 8.13
15 1962 5.416100 Cos d'Estournel 53 639.4 51.8 16.58 17.86 47.00 8.13
16 1962 6.298839 Lafite-Rothschild 53 639.4 51.8 16.58 17.86 47.00 8.13
17 1964 4.354270 Beychevelle 51 326.5 96.1 17.63 19.43 48.31 8.46
18 1964 6.492785 Cheval Blanc 51 326.5 96.1 17.63 19.43 48.31 8.46
19 1964 6.011610 Lafite-Rothschild 51 326.5 96.1 17.63 19.43 48.31 8.46
20 1966 5.957908 Cheval Blanc 49 734.0 85.2 16.81 18.82 49.16 8.78
... ... ... ... ... ... ... ... ... ...
147 2000 7.060588 Lafite-Rothschild 15 487.8 69.0 18.73 19.45 59.05 8.24

IEOR 242, Spring 2020 - Week 2

+ 96

Training and Testing Set (Split by Year)

Training Set (N = 83)

LogAuction Winery ... USAlcConsump
Full Dataset (N = 147) 1 1952 6.653108 Cheval Blanc ... 7.85
2 1952 6.861502 Lafite-Rothschild ... 7.85
LogAuction Winery ... USAlcConsump 3 1953 6.664192 Cheval Blanc ... 8.03
1 1952 6.653108 Cheval Blanc ... 7.85 4 1955 6.311426 Cheval Blanc ... 7.84
2 1952 6.861502 Lafite-Rothschild ... 7.85 5 1955 6.550209 Lafite-Rothschild ... 7.84
6 1959 5.380957 Beychevelle ... 7.89
3 1953 6.664192 Cheval Blanc ... 8.03
7 1959 7.437242 Cheval Blanc ... 7.89
4 1955 6.311426 Cheval Blanc ... 7.84
8 1959 7.645302 Lafite-Rothschild ... 7.89
5 1955 6.550209 Lafite-Rothschild ... 7.84
9 1960 6.405873 Lafite-Rothschild ... 8.02
6 1959 5.380957 Beychevelle ... 7.89 10 1961 5.813802 Beychevelle ... 8.08
7 1959 7.437242 Cheval Blanc ... 7.89 ... ... ... ... ...
8 1959 7.645302 Lafite-Rothschild ... 7.89 81 1985 6.018934 Cheval Blanc ... 9.88
9 1960 6.405873 Lafite-Rothschild ... 8.02 82 1985 4.885072 Cos d'Estournel ... 9.88
10 1961 5.813802 Beychevelle ... 8.08 83 1985 6.296612 Lafite-Rothschild ... 9.88
... ... ... ... ...
81 1985
82 1985
6.018934
4.885072
Cheval Blanc
Cos d'Estournel
...
...
9.88
9.88
Testing Set (N = 64)
83 1985 6.296612 Lafite-Rothschild ... 9.88 LogAuction Winery ... USAlcConsump
84 1986 4.549235 Beychevelle ... 9.75 84 1986 4.549235 Beychevelle ... 9.75
85 1986 5.721852 Cheval Blanc ... 9.75 85 1986 5.721852 Cheval Blanc ... 9.75
86 1986 4.987435 Cos d'Estournel ... 9.75 86 1986 4.987435 Cos d'Estournel ... 9.75
... ... ... ... ... ... ... ... ... ...
146 2000 4.330339 Giscours ... 8.24 146 2000 4.330339 Giscours ... 8.24
147 2000 7.060588 Lafite-Rothschild ... 8.24 147 2000 7.060588 Lafite-Rothschild ... 8.24

IEOR 242, Spring 2020 - Week 2

+
Regression Model
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.2184029 3.4314946 -0.064 0.9494
Age 0.0528857 0.0118313 4.470 2.62e-05 ***
WinterRain 0.0018288 0.0009861 1.855 0.0674 .
HarvestRain 0.0024980 0.0019812 1.261 0.2111
GrowTemp 0.1209556 0.1926703 0.628 0.5320
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9639 on 78 degrees of freedom

Multiple R-squared: 0.2221, Adjusted R-squared: 0.1822
F-statistic: 5.567 on 4 and 78 DF, p-value: 0.0005389

n Why is the R2 low?

IEOR 242, Spring 2020 - Week 2 97

+
2015 Auction Prices by Winery and
Year log(2015 Auction Price)

Winery
6 Beychevelle
Cheval Blanc
Cos d'Estournel
5 Giscours
Lafite−Rothschild

1950 1960 1970 1980 1990 2000

Vintage Year

IEOR 242, Spring 2020 - Week 2 98

+ 99

Categorical Variables With Two

Levels
n For illustration, we will only look at data of two
wineries:
n Cheval Blanc – one of the most expensive wineries
n Cos d’Estournel – one of least expensive wineries

n Two categories corresponds to adding one

categorical variable

n We will call our variable WineryCosd’Estournel

n Value 1: Wine is from Cos d’Estournel
n Value 0: Wine is from Cheval Blanc

IEOR 242, Spring 2020 - Week 2

+ 100

2015 Auction Prices for Two Wineries

log(2015 Auction Price)

6
Winery
Cheval Blanc
5 Cos d'Estournel

1950 1960 1970 1980 1990 2000

Vintage Year

IEOR 242, Spring 2020 - Week 2

+
Categorical Variables with Two Levels

LogAuction WineryCos d'Estournel Age WinterRain HarvestRain GrowTemp

1 1952 6.653108 0 63 566.4 165.5 17.28
2 1953 6.664192 0 62 653.3 75.6 16.94
3 1955 6.311426 0 60 504.3 129.5 17.30
4 1959 7.437242 0 56 377.0 182.6 17.68
5 1961 7.311178 0 54 747.8 37.7 17.64
6 1961 5.822247 1 54 747.8 37.7 17.64
7 1962 6.747610 0 53 639.4 51.8 16.58
8 1962 5.416100 1 53 639.4 51.8 16.58
9 1964 6.492785 0 51 326.5 96.1 17.63
10 1966 5.957908 0 49 734.0 85.2 16.81
11 1966 4.809416 1 49 734.0 85.2 16.81
12 1967 6.146929 0 48 646.9 118.1 16.51
13 1970 5.536231 0 45 563.5 88.8 16.92
14 1970 4.254051 1 45 563.5 88.8 16.92
15 1971 5.975843 0 44 488.4 111.9 17.20
16 1973 4.207376 1 42 357.2 122.6 17.41
17 1974 5.290386 0 41 503.6 185.1 16.39
18 1975 5.689684 0 40 501.8 170.5 17.23
19 1975 4.397162 1 40 501.8 170.5 17.23
... ... ... ... ... ... ...
35 1985 4.885072 1 30 667.1 37.2 17.19

IEOR 242, Spring 2020 - Week 2 101

+ 102

A Two-Category Model

Auction Price = 0

+ 1 · WineryCos d0 Estournel

+ 2 · Age

+ 3 · WinterRain

+ 4 · HarvestRain

+ 5 · GrowTemp

n What is the interpretation?

IEOR 242, Spring 2020 - Week 2

+
The Two-Category Model
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.6292230 2.3910006 -2.773 0.00962 **
WineryCos d'Estournel -1.3616758 0.1393778 -9.770 1.12e-10 ***
Age 0.0357669 0.0072019 4.966 2.79e-05 ***
WinterRain 0.0016274 0.0006888 2.363 0.02506 *
HarvestRain -0.0015879 0.0015803 -1.005 0.32330
GrowTemp 0.6093432 0.1397853 4.359 0.00015 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3874 on 29 degrees of freedom

Multiple R-squared: 0.8598, Adjusted R-squared: 0.8356
F-statistic: 35.55 on 5 and 29 DF, p-value: 1.624e-11

n Note that R2 = 0.86

n All variables are significant except HarvestRain

IEOR 242, Spring 2020 - Week 2 103

+
Let’s go back to the All-Wineries Data

LogAuction Winery Age WinterRain HarvestRain GrowTemp HarvestTemp FrancePop USAlcConsump

1 1952 6.653108 Cheval Blanc 63 566.4 165.5 17.28 14.39 42.46 7.85
2 1952 6.861502 Lafite-Rothschild 63 566.4 165.5 17.28 14.39 42.46 7.85
3 1953 6.664192 Cheval Blanc 62 653.3 75.6 16.94 17.64 42.75 8.03
4 1955 6.311426 Cheval Blanc 60 504.3 129.5 17.30 17.13 43.43 7.84
5 1955 6.550209 Lafite-Rothschild 60 504.3 129.5 17.30 17.13 43.43 7.84
6 1959 5.380957 Beychevelle 56 377.0 182.6 17.68 19.28 45.24 7.89
7 1959 7.437242 Cheval Blanc 56 377.0 182.6 17.68 19.28 45.24 7.89
8 1959 7.645302 Lafite-Rothschild 56 377.0 182.6 17.68 19.28 45.24 7.89
9 1960 6.405873 Lafite-Rothschild 55 748.2 290.6 16.67 16.18 45.68 8.02
10 1961 5.813802 Beychevelle 54 747.8 37.7 17.64 21.05 46.16 8.08
11 1961 7.311178 Cheval Blanc 54 747.8 37.7 17.64 21.05 46.16 8.08
12 1961 5.822247 Cos d'Estournel 54 747.8 37.7 17.64 21.05 46.16 8.08
13 1961 6.673045 Lafite-Rothschild 54 747.8 37.7 17.64 21.05 46.16 8.08
14 1962 6.747610 Cheval Blanc 53 639.4 51.8 16.58 17.86 47.00 8.13
15 1962 5.416100 Cos d'Estournel 53 639.4 51.8 16.58 17.86 47.00 8.13
16 1962 6.298839 Lafite-Rothschild 53 639.4 51.8 16.58 17.86 47.00 8.13
17 1964 4.354270 Beychevelle 51 326.5 96.1 17.63 19.43 48.31 8.46
18 1964 6.492785 Cheval Blanc 51 326.5 96.1 17.63 19.43 48.31 8.46
19 1964 6.011610 Lafite-Rothschild 51 326.5 96.1 17.63 19.43 48.31 8.46
20 1966 5.957908 Cheval Blanc 49 734.0 85.2 16.81 18.82 49.16 8.78
... ... ... ... ... ... ... ... ... ...
83 1985 6.296612 Lafite-Rothschild 30 667.1 37.2 17.19 19.56 55.28 9.88

IEOR 242, Spring 2020 - Week 2 104

+ 105

Categorical Variables with More

Than Two Levels
n We need k-1 dummy variables to work with k
categories (why?)
n Variable WineryCheval Blanc: 1 if from Cheval Blanc,
otherwise 0
n Variable WineryCosd’Estournel: 1 if from Cos
d’Estournel, otherwise 0
n Variable WineryGiscours: 1 if from Giscours, otherwise
0
n Variable WineryLafite-Rothschild: 1 if from Lafite-
Rothschild, otherwise 0
n All variables 0 if wine from Beychevelle
IEOR 242, Spring 2020 - Week 2
+ 106

A Model with More Than Two

Categories
Auction Price = 0

+ 1 · WineryCheval Blanc

+ 2 · WineryCos d0 Estournel

+ 3 · WineryGiscours

+ 4 · WineryLafite Rothschild

+ 5 · Age

+ 6 · WinterRain

+ 7 · HarvestRain

+ 8 · GrowTemp

IEOR 242, Spring 2020 - Week 2

+
Model with Categorical Data for Five
Wineries
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.4945857 1.5757380 -2.852 0.005622 **
WineryCheval Blanc 1.6425245 0.1518157 10.819 < 2e-16 ***
WineryCos d'Estournel 0.2754099 0.1649803 1.669 0.099274 .
WineryGiscours -0.2992903 0.1934825 -1.547 0.126163
WineryLafite-Rothschild 1.8941459 0.1481200 12.788 < 2e-16 ***
Age 0.0307904 0.0054819 5.617 3.23e-07 ***
WinterRain 0.0016349 0.0004462 3.665 0.000463 ***
HarvestRain 0.0003949 0.0009050 0.436 0.663899
GrowTemp 0.3875778 0.0886121 4.374 3.93e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4336 on 74 degrees of freedom

Multiple R-squared: 0.8506, Adjusted R-squared: 0.8345
F-statistic: 52.68 on 8 and 74 DF, p-value: < 2.2e-16

n R2 = 0.85 is excellent

n OSR2 = 0.81 is also excellent

IEOR 242, Spring 2020 - Week 2 107

+
Prediction for Cos d’Estournel

n Consider making a prediction for the 2014 vintage

n Cos d’Estournel winery
n Aged 1 years in 2015
n 522.3 mm of winter rain
n 78.9 mm of harvest rain
n Average growing season temperature of 18.23 °C

LogAuctionPrice = -4.495 + 1.643(0) + 0.2754(1) – 0.299(0) + 1.894(0)

+ 0.031(1) + 0.002(522.3) + … + 0.388*(18.23)

= 3.762

IEOR 242, Spring 2020 - Week 2 108

+
Prediction for Giscours

n Consider making a prediction for the 2014 vintage

n Giscours winery
n Aged 1 years in 2015
n 522.3 mm of winter rain
n 78.9 mm of harvest rain
n Average growing season temperature of 18.23 °C

LogAuctionPrice = -4.495 + 1.643(0) + 0.2754(0) – 0.299(1) + 1.894(0)

+ 0.031(1) + 0.002(522.3) + … + 0.388*(18.23)

= 3.188

IEOR 242, Spring 2020 - Week 2 109

+
Prediction for Beychevelle

n Consider making a prediction for the 2014 vintage

n Beychevelle winery
n Aged 1 years in 2015
n 522.3 mm of winter rain
n 78.9 mm of harvest rain
n Average growing season temperature of 18.23 °C

LogAuctionPrice = -4.495 + 1.643(0) + 0.2754(0) – 0.299(0) + 1.894(0)

+ 0.031(1) + 0.002(522.3) + … + 0.388*(18.23)

= 3.487

IEOR 242, Spring 2020 - Week 2 110

+ 111

Showdown in The New York Times

n Showdown on the front
page of The New York
Times in 1990

n Parker: 1986 vintage will

be “very good to
sometimes exceptional”

n Ashenfelter: 1986
vintage will be mediocre,
but 1989 vintage will be
“stunningly good”

n Experts like Parker hadn’t

even had a chance to taste
the 1989 vintage
IEOR 242, Spring 2020 - Week 2
+ 112

Years Later, the Winner is Clear

7.6 ●

log(Auction Index)
7.4

7.2 ●

7.0
●

6.8 ●

1985 1986 1987 1988 1989

Year

IEOR 242, Spring 2020 - Week 2

+ 113

A Convergence of Results

Though most critics never acknowledged the value of

Ashenfelter’s models, through time the predictions of the
models and experts has converged.

Ashenfelter:

“Unlike the past, the tasters no longer make any

horrendous mistakes. Frankly, I kind of killed
myself. I don’t have much value added anymore.”

IEOR 242, Spring 2020 - Week 2

+ 114

Conclusion

nA linear regression model with only a few

variables can predict wine prices well

n In many cases, the model outperforms wine

experts’ judgments

nA quantitative approach to a traditionally

qualitative problem

n Regression capabilities are enhanced with a user

who knows how to think with data and learn from
the data

IEOR 242, Spring 2020 - Week 2

+ 115

n Some of the figures in this presentation are taken

from “An Introduction to Statistical Learning, with
applications in R” (Springer, 2013) with permission
from the authors: G. James, D. Witten, T. Hastie and
R. Tibshirani

n Thanks to Rob Freund and John Silberholz (MIT) for

the wine datasets

IEOR 242, Spring 2020 - Week 2

CORRELATION & REGRESSION
No ratings yet
CORRELATION & REGRESSION
31 pages
Midterm Aut2014 (Final) Sol
No ratings yet
Midterm Aut2014 (Final) Sol
23 pages
Prediction of Wine Quality Using Machine Learning
100% (1)
Prediction of Wine Quality Using Machine Learning
12 pages
ML3 - Evaluation
100% (1)
ML3 - Evaluation
65 pages
H-311 Linear Regression Analysis With R
100% (1)
H-311 Linear Regression Analysis With R
71 pages
Wine Prediction
100% (1)
Wine Prediction
13 pages
Predictive Modelling
No ratings yet
Predictive Modelling
5 pages
Wine Case Report
100% (2)
Wine Case Report
16 pages
Hair Salon
33% (3)
Hair Salon
11 pages
Jawaban Uas Ekonomatrika
No ratings yet
Jawaban Uas Ekonomatrika
4 pages
Logistic Regression
100% (1)
Logistic Regression
56 pages
CHAPTER IV - Multiple Regression Model
No ratings yet
CHAPTER IV - Multiple Regression Model
90 pages
Econometrica - 2009 - Bai - Panel Data Models With Interactive Fixed Effects
No ratings yet
Econometrica - 2009 - Bai - Panel Data Models With Interactive Fixed Effects
51 pages
Programming For Data Science
100% (1)
Programming For Data Science
4 pages
Chapter 4. ARIMA - SV
No ratings yet
Chapter 4. ARIMA - SV
49 pages
Lead Scoring Group Case Study Presentation
100% (2)
Lead Scoring Group Case Study Presentation
19 pages
Correlation & Regression
100% (1)
Correlation & Regression
53 pages
Pengaruh Kualitas Produk Dan Harga Terhadap Keputusan Pembelian Mobil Daihatsu Grand Max Pick Up
No ratings yet
Pengaruh Kualitas Produk Dan Harga Terhadap Keputusan Pembelian Mobil Daihatsu Grand Max Pick Up
11 pages
Chapter 6
No ratings yet
Chapter 6
24 pages
Almeida2002 PDF
No ratings yet
Almeida2002 PDF
8 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
Lecture 4 Linear Regression
100% (1)
Lecture 4 Linear Regression
44 pages
What Is Exploratory Data Analysis - by Prasad Patil - Towards Data Science
No ratings yet
What Is Exploratory Data Analysis - by Prasad Patil - Towards Data Science
17 pages
ECN225sol4 PDF
0% (1)
ECN225sol4 PDF
5 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
1694600777-Unit2.2 Logistic Regression CU 2.0
100% (1)
1694600777-Unit2.2 Logistic Regression CU 2.0
37 pages
Bagging and Boosting Regression Algorithms
100% (1)
Bagging and Boosting Regression Algorithms
84 pages
ML MU Unit 2
100% (2)
ML MU Unit 2
42 pages
Classical Linear Regression Model Assumptions and Diagnostics
No ratings yet
Classical Linear Regression Model Assumptions and Diagnostics
71 pages
4.0 The Complete Guide To Artificial Neural Networks
No ratings yet
4.0 The Complete Guide To Artificial Neural Networks
23 pages
Python Fundamentals For Machine Learning Version1
No ratings yet
Python Fundamentals For Machine Learning Version1
58 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Data Mining in Insurance
No ratings yet
Data Mining in Insurance
9 pages
Topic 7. VAR Models
No ratings yet
Topic 7. VAR Models
44 pages
Orthogonal Property of Standard Design/Orthogonality of Design and Factorial Experiments (Statistics)
No ratings yet
Orthogonal Property of Standard Design/Orthogonality of Design and Factorial Experiments (Statistics)
16 pages
Logistic Regression
100% (1)
Logistic Regression
21 pages
3.multiple Correlation & Regression
No ratings yet
3.multiple Correlation & Regression
24 pages
Boosting and Additive Tree
No ratings yet
Boosting and Additive Tree
26 pages
Simple Linear Regression - Assignn5
No ratings yet
Simple Linear Regression - Assignn5
8 pages
Poly
100% (1)
Poly
108 pages
Time Series Forecasting Chapter 16
No ratings yet
Time Series Forecasting Chapter 16
43 pages
Tutor
100% (1)
Tutor
309 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
100% (2)
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
26 pages
Risk Return Summery
100% (1)
Risk Return Summery
85 pages
Lecture 9 PDF
100% (1)
Lecture 9 PDF
28 pages
Topic 24 - Hypothesis Tests and Confidence Intervals in Multiple Regression Question
No ratings yet
Topic 24 - Hypothesis Tests and Confidence Intervals in Multiple Regression Question
10 pages
Scip y Lectures
100% (1)
Scip y Lectures
329 pages
1 The Role of Statistics and The Data Analysis Process
100% (1)
1 The Role of Statistics and The Data Analysis Process
30 pages
Interpretation of Standard Deviation (See "Empirical Rule", Chapter 2, P. 61)
No ratings yet
Interpretation of Standard Deviation (See "Empirical Rule", Chapter 2, P. 61)
16 pages
Multiple Linear Regression
100% (1)
Multiple Linear Regression
14 pages
Project 5 PDF
100% (1)
Project 5 PDF
48 pages
Regression Analysis Project
No ratings yet
Regression Analysis Project
4 pages
Least Squares Problems: How To State and Solve Them, Then Evaluate Their Solutions
100% (1)
Least Squares Problems: How To State and Solve Them, Then Evaluate Their Solutions
63 pages
Customer Segmentation Clustering
No ratings yet
Customer Segmentation Clustering
35 pages
Logistic Regression
100% (1)
Logistic Regression
14 pages
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
100% (1)
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
27 pages
Introduction To STATISTICS-new
100% (1)
Introduction To STATISTICS-new
46 pages
Lesson 2 Linear Regression
100% (1)
Lesson 2 Linear Regression
21 pages
1.1 Simple Linear Regression Model
100% (1)
1.1 Simple Linear Regression Model
15 pages
Estimation and Hypothesis
100% (1)
Estimation and Hypothesis
32 pages
Q-7, RITIKA
No ratings yet
Q-7, RITIKA
3 pages
Nguyễn Phát Thịnh_assignment 11
No ratings yet
Nguyễn Phát Thịnh_assignment 11
6 pages
Practicals For Basic Econometrics-2.Docx 20241118 002851 0000
No ratings yet
Practicals For Basic Econometrics-2.Docx 20241118 002851 0000
3 pages
Regression Statistics
No ratings yet
Regression Statistics
8 pages
PGM 7
No ratings yet
PGM 7
3 pages
AdaBoost Classifier in Python (Article) - DataCamp
100% (1)
AdaBoost Classifier in Python (Article) - DataCamp
9 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
Correlation and Regration
No ratings yet
Correlation and Regration
8 pages
Scikit-Learn Cheat Sheet Python For Data Science: Preprocessing The Data Evaluate Your Model's Performance
100% (1)
Scikit-Learn Cheat Sheet Python For Data Science: Preprocessing The Data Evaluate Your Model's Performance
1 page
Kenny-230720-8 Unique Machine Learning Interview Questions About K Nearest Neighbors
No ratings yet
Kenny-230720-8 Unique Machine Learning Interview Questions About K Nearest Neighbors
3 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Course Outline
No ratings yet
Course Outline
4 pages
Day 5 Supervised Technique-Decision Tree For Classification PDF
100% (1)
Day 5 Supervised Technique-Decision Tree For Classification PDF
58 pages
POQ:quiz
No ratings yet
POQ:quiz
10 pages
Homework 2
100% (1)
Homework 2
14 pages
Machine Learning (Analytics Vidhya) : What Is Logistic Regression?
100% (1)
Machine Learning (Analytics Vidhya) : What Is Logistic Regression?
5 pages
Panel Cointegration Analysis With Xtpedroni: 14, Number 3, Pp. 684-692
No ratings yet
Panel Cointegration Analysis With Xtpedroni: 14, Number 3, Pp. 684-692
9 pages
PCA Using Python
No ratings yet
PCA Using Python
18 pages
Variable Selection
No ratings yet
Variable Selection
15 pages
EDA Assignment
No ratings yet
EDA Assignment
15 pages
ML Interview Cheat Sheet
No ratings yet
ML Interview Cheat Sheet
9 pages
Introductory Concepts of Probabability & Statistics
No ratings yet
Introductory Concepts of Probabability & Statistics
6 pages
Simple Linear Regression - Assign3
No ratings yet
Simple Linear Regression - Assign3
8 pages
Question and Answers For Pyplots
No ratings yet
Question and Answers For Pyplots
11 pages
2003 Makipaa 1
No ratings yet
2003 Makipaa 1
15 pages
Cheatsheet Midterms 2 - 3
No ratings yet
Cheatsheet Midterms 2 - 3
2 pages
XGBoost for Regression Predictive Modeling and Time Series Analysis: Learn how to build, evaluate, and deploy predictive models with expert guidance
From Everand
XGBoost for Regression Predictive Modeling and Time Series Analysis: Learn how to build, evaluate, and deploy predictive models with expert guidance
Partha Pritam Deka
No ratings yet
Beginning C# 3.0: An Introduction to Object Oriented Programming
From Everand
Beginning C# 3.0: An Introduction to Object Oriented Programming
Jack Purdum
No ratings yet
Big Data Analytics Complete Self-Assessment Guide
From Everand
Big Data Analytics Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.