0% found this document useful (0 votes)
5 views22 pages

Lecture 20

1) The document discusses prediction methods for "big data" problems with many predictors, focusing on ridge regression, lasso regression, and principal components regression. 2) Ridge and lasso regression improve upon ordinary least squares (OLS) by introducing bias via a shrinkage parameter to better trade off bias and variance, reducing mean squared prediction error (MSPE). Lasso also performs automatic variable selection. 3) An application predicts California school test scores using 817 predictors, finding ridge and lasso cut the square root of MSPE in half compared to OLS, with lasso retaining only 56 predictors.

Uploaded by

huey4966
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views22 pages

Lecture 20

1) The document discusses prediction methods for "big data" problems with many predictors, focusing on ridge regression, lasso regression, and principal components regression. 2) Ridge and lasso regression improve upon ordinary least squares (OLS) by introducing bias via a shrinkage parameter to better trade off bias and variance, reducing mean squared prediction error (MSPE). Lasso also performs automatic variable selection. 3) An application predicts California school test scores using 817 predictors, finding ridge and lasso cut the square root of MSPE in half compared to OLS, with lasso retaining only 56 predictors.

Uploaded by

huey4966
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

ECU3091 Econometrics A

Prediction with many regressors and big data II

Dr. Barra Roantree


Trinity College Dublin, 2022

Material from Stock and Watson Chapter 14


Outline
1. What is “Big Data”?
2. Prediction with many predictors: the MSPE, OLS, and
the principle of shrinkage
3. Ridge regression
4. The Lasso
5. Principal components
6. Application to prediction of school test scores
7. Summary
Recap: what is “Big Data”?
“Big Data” can mean many things and has different jargon,
which makes it seem very different than econometrics…
• e.g. “Machine learning:” when a computer (machine) uses a large
data set to learn e.g. about your online shopping preferences

But at its core, machine learning builds on familiar tools


• We focuses on one of the major applications of big data:
prediction with many predictors.
• With many predictors we need new methods that go beyond OLS.
• For prediction we do not need – and typically will not have –
causal interpretation of estimated coefficients: rather quality of
prediction is what matters (measured by MSPE)
• And it turns out, can get better predictions by allowing for biased
estimators of a certain type
Recap: the Principle of Shrinkage
• The James-Stein shrinkage estimator:
ˆ JS = c ˆ
where 0  c  1.
• As c gets smaller:

– The squared bias of the estimator increases,


– But the variance decreases.
– This produces a bias-variance tradeoff as MSPE function of both
– If k is large, the benefit of smaller variance can beat out the cost of
larger bias, for the right choice of c – thus reducing the MSPE.

• The estimators we consider all have a shrinkage interpretation


3. Ridge Regression
The ridge regression estimator shrinks the estimate towards
zero by penalizing large squared values of the coefficients.

Minimizes the penalized sum of squared residuals,


n k
(b; Ridge ) =  (Yi − b1 X1i − ... − bk X ki ) + Ridge  b 2j
Ridge 2
S
i =1 j =1

where Ridge  j =1 j is a “penalty term” (collapses to OLS when 0)


k2
b

• If the regressors are uncorrelated,


 1 
ˆ Ridge
=  ˆ j
j
 1 + Ridge /  X i2 
n
 i =1 

so the ridge regressor has the James-Stein form, ˆ = c ˆ


JS

– (math in S W Appendix 14.3 and 19.7)


Ridge Regression in a Picture
The ridge regression penalty term penalizes the sum of squared
residuals for large values  , as shown here for k = 1:

• The value of the ridge


objective function, S ridge (b ),
is the sum of squared
residuals plus a penalty
which is a quadratic in b.

• Thus, the penalized sum


of squared residuals is
minimized at a smaller
value of b than is the
unpenalized SSR.
Choosing the penalty factor  Ridge
The ridge regression estimator has an additional parameter, Ridge :
n k
(b; Ridge ) =  (Yi − b1 X1i − ... − bk X ki ) + Ridge  b 2j
Ridge 2
S
i =1 j =1

• It would seem natural to choose Ridge by minimizing wrt both b and


Ridge – but doing so would simply choose Ridge = 0, which would
just get you back to OLS!
• Instead, Ridge can be chosen by minimizing the m-fold cross-
validated estimate of the MSPE.
– Choose some value of Ridge , and estimate the MSPE by
m-fold cross-validation
– Repeat for many values of Ridge , and choose the one that
yields the lowest MSPE.
Empirical eg: predicting test scores

Data set: a school-level version of the California elementary


district data set, augmented with additional variables
describing school, student, and district characteristics

The full data set has 3932 observations. Half of those (1966)
are used now – the remaining 1966 are reserved for an out-
of-sample comparison of the ridge v. other prediction
methods, done later.

The data set has 817 predictors…


Empirical eg: predicting test scores
Variables in the 817-predictor school test score data set
Empirical eg: predicting test scores
Ridge is estimated by k = 817, n = 1966
minimizing the 10-fold
cross-validated MSPE
The resulting estimate
of the shrinkage
parameter is 39.5
Root MSPE’s:
OLS: 78.2
Ridge: 39.5

Ridge results cuts the square root of the MSPE in half,


compared to OLS!
4. The Lasso
• The Lasso estimator shrinks the estimate towards zero by
penalizing large absolute values of the coefficients.

• The Lasso regression estimator minimizes the penalized


sum of squared residuals,
n k
(b; Lasso ) =  (Yi − b1 X1i − ... − bk X ki ) + Lasso  b j
Lasso 2
S
i =1 j =1

where Lasso  j =1 b j
k
is the “penalty term.”

• This looks a lot like ridge estimation – but it turns out to


have very different properties…
Lasso in Pictures

When the OLS estimator is …but when the OLS estimator is


large, the Lasso estimator small, the Lasso estimator shrinks it
shrinks it slightly towards all the way to zero, so that the Lasso
zero – less than ridge… estimator is exactly zero.

Thus, the Lasso estimator sets some – many – of the  ’s


exactly to 0
More on Lasso (1 of 2)
Lasso sets some – many – of the  ’s exactly to 0
• This property gives the Lasso its name: the Least Absolute
Selection and Shrinkage Operator. Selection, because it
selects a subset of the predictors to use for prediction –
and drops the rest.
• This feature means that Lasso can work especially well
when in reality many of the predictors are irrelevant.
• Models in which most of the true  ’s are zero – that is, in
which E(Y|X) only depends on just a few X’s – are called
sparse.
Lasso produces sparse models, and works well when the
population model is in fact sparse.
More on Lasso (2 of 2)
• Lasso has another unusual property: the estimated model,
and selected variables, depends on how the variables are
specified.
• For example, if model A uses the dummy variables
Freshman, Sophomore, Junior (and omits senior since they
are deviated from their means); and model B uses
Freshman,  Sophomore, and  Junior, then Lasso will in
general give different predictions for models A and B,
although OLS (and ridge) will give the same predictions.
• Technically, Lasso predictions are not invariant to linear
transformations of the regressors
Predicting test scores
Lasso is estimated by k = 817, n = 1966
minimizing the 10-fold
cross-validated MSPE
The resulting estimate
of the shrinkage
parameter is 4527
Root MSPE’s:
OLS: 78.2
lasso: 39.7
The Lasso estimator retains only 56 of the 817 predictors.
Like ridge, Lasso cuts the square root of the MSPE in half,
compared to OLS!
5. Principal Components
• Ridge and Lasso reduce the MSPE by shrinking (biasing)
the estimated coefficients to zero – and in the case of
Lasso, by eliminating many of the regressors entirely.
• Instead, Principal components regression collapses the
very many predictors into a much smaller number ( p  k )
of linear combinations of the predictors
• These the linear combinations – called the principal
components of X – are computed so that they capture
as much of the variation in the original X’s as possible.
• Because the number p of principal components is small,
OLS can be used, with the principal components as
(new) regressors.
Principal Components in Pictures, k = 2 Upper K equals 2.

Suppose you have 2 X’s, and you want to choose a linear


combination of those Xs (say, aX1 + bX 2 ) that captures as much of
the variation of the X’s as possible in a single
summary variable. What values of a and b would you use?
The Principal Components solution
is to choose a and b to solve,

max var (aX1 + bX 2 ), subject to

a2 + b2 = 1
For 2 X’s that are positively
correlated, the resulting
choices of a and b are a = b = 1/ 2
This is shown in the figure −− 
Principal Components, k > 2 Upper K greater than 2.

For k  2 X’s, the principal components are the linear


combinations of the X’s that have the greatest variance and
that are uncorrelated with the previous principal components.
So, the jth principal component PC j , solves,
 k  k
max var   a ji X i  , subject to  ji
a 2

 i =1  i =1

and subject PC j to being uncorrelated with PC1,..., PC j −1.


The first p principal components are the linear
combinations of X that capture as much of the variation
in X as possible.
Principal Components as Data
Compression
• Principal components can be thought of as a data
compression tool, so that the compressed data have
fewer regressors with as little information loss as possible.
• Data compression is used all the time to reduce very large
data sets to smaller ones. A familiar example is image
compression, where the goal is to retain as many of the
features of the image (photograph) as possible, while
reducing the file size.
• In fact, many data compression algorithms build on or are
cousins of principal components analysis.
How many Principal Components? (1 of 2)

One way to choose p is to plot the increase in the average R 2


resulting from adding the pth principal components to a regression
of X on PC1,..., PCp −1.

This plot is known as a


scree plot. Here is the scree
plot for the school test score
data set →

• The first principal component explains 18% of the variation in the


817 X’s!
• The first 10 P C’s explain 63% of the variation in the 817 X’s!
• Still, it is rather hard to know where to draw the line…
How many Principal Components? (2 of 2)

The scree plot is informative (you should look at it) but doesn’t provide a
simple rule for choosing p.
• The number of principal components p is like the ridge and
Lasso penalty factors Ridge and Lasso - all are additional
parameters needed to implement the procedure.
• Like Ridge and Lasso , p can be estimated by minimizing the
m-fold cross validated estimate of the MSPE.
– For a given value of p, the principal components
forecast is obtained by regressing Y on PC1,..., PCp −1
using the estimation sample, then using that model to predict in the
test sample
Predicting test scores

p is estimated by minimizing the k = 817, n = 1966


10-fold cross-validated MSPE

The resulting estimate is 46


Root MSPE’s:

OLS: 78.2
Principal Components:
39.7

• Principal Components collapses the 817 predictors to 46.


• Like ridge and Lasso, PC cuts the square root of the MSPE in half,
compared to OLS!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy