0% found this document useful (0 votes)

5 views22 pages

Lecture 20

1) The document discusses prediction methods for "big data" problems with many predictors, focusing on ridge regression, lasso regression, and principal components regression. 2) Ridge and lasso regression improve upon ordinary least squares (OLS) by introducing bias via a shrinkage parameter to better trade off bias and variance, reducing mean squared prediction error (MSPE). Lasso also performs automatic variable selection. 3) An application predicts California school test scores using 817 predictors, finding ridge and lasso cut the square root of MSPE in half compared to OLS, with lasso retaining only 56 predictors.

Uploaded by

huey4966

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views22 pages

Lecture 20

Uploaded by

huey4966

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

ECU3091 Econometrics A

Prediction with many regressors and big data II

Dr. Barra Roantree

Trinity College Dublin, 2022

Material from Stock and Watson Chapter 14

Outline
1. What is “Big Data”?
2. Prediction with many predictors: the MSPE, OLS, and
the principle of shrinkage
3. Ridge regression
4. The Lasso
5. Principal components
6. Application to prediction of school test scores
7. Summary
Recap: what is “Big Data”?
“Big Data” can mean many things and has different jargon,
which makes it seem very different than econometrics…
• e.g. “Machine learning:” when a computer (machine) uses a large
data set to learn e.g. about your online shopping preferences

But at its core, machine learning builds on familiar tools

• We focuses on one of the major applications of big data:
prediction with many predictors.
• With many predictors we need new methods that go beyond OLS.
• For prediction we do not need – and typically will not have –
causal interpretation of estimated coefficients: rather quality of
prediction is what matters (measured by MSPE)
• And it turns out, can get better predictions by allowing for biased
estimators of a certain type
Recap: the Principle of Shrinkage
• The James-Stein shrinkage estimator:
ˆ JS = c ˆ
where 0  c  1.
• As c gets smaller:

– The squared bias of the estimator increases,

– But the variance decreases.
– This produces a bias-variance tradeoff as MSPE function of both
– If k is large, the benefit of smaller variance can beat out the cost of
larger bias, for the right choice of c – thus reducing the MSPE.

• The estimators we consider all have a shrinkage interpretation

3. Ridge Regression
The ridge regression estimator shrinks the estimate towards
zero by penalizing large squared values of the coefficients.

Minimizes the penalized sum of squared residuals,

n k
(b; Ridge ) =  (Yi − b1 X1i − ... − bk X ki ) + Ridge  b 2j
Ridge 2
S
i =1 j =1

where Ridge  j =1 j is a “penalty term” (collapses to OLS when 0)

k2
b

• If the regressors are uncorrelated,

 1 
ˆ Ridge
=  ˆ j
j
 1 + Ridge /  X i2 
n
 i =1 

so the ridge regressor has the James-Stein form, ˆ = c ˆ

– (math in S W Appendix 14.3 and 19.7)

Ridge Regression in a Picture
The ridge regression penalty term penalizes the sum of squared
residuals for large values  , as shown here for k = 1:

• The value of the ridge

objective function, S ridge (b ),
is the sum of squared
residuals plus a penalty
which is a quadratic in b.

• Thus, the penalized sum

of squared residuals is
minimized at a smaller
value of b than is the
unpenalized SSR.
Choosing the penalty factor  Ridge
The ridge regression estimator has an additional parameter, Ridge :
n k
(b; Ridge ) =  (Yi − b1 X1i − ... − bk X ki ) + Ridge  b 2j
Ridge 2
S
i =1 j =1

• It would seem natural to choose Ridge by minimizing wrt both b and

Ridge – but doing so would simply choose Ridge = 0, which would
just get you back to OLS!
• Instead, Ridge can be chosen by minimizing the m-fold cross-
validated estimate of the MSPE.
– Choose some value of Ridge , and estimate the MSPE by
m-fold cross-validation
– Repeat for many values of Ridge , and choose the one that
yields the lowest MSPE.
Empirical eg: predicting test scores

Data set: a school-level version of the California elementary

district data set, augmented with additional variables
describing school, student, and district characteristics

The full data set has 3932 observations. Half of those (1966)
are used now – the remaining 1966 are reserved for an out-
of-sample comparison of the ridge v. other prediction
methods, done later.

The data set has 817 predictors…

Empirical eg: predicting test scores
Variables in the 817-predictor school test score data set
Empirical eg: predicting test scores
Ridge is estimated by k = 817, n = 1966
minimizing the 10-fold
cross-validated MSPE
The resulting estimate
of the shrinkage
parameter is 39.5
Root MSPE’s:
OLS: 78.2
Ridge: 39.5

Ridge results cuts the square root of the MSPE in half,

compared to OLS!
4. The Lasso
• The Lasso estimator shrinks the estimate towards zero by
penalizing large absolute values of the coefficients.

• The Lasso regression estimator minimizes the penalized

sum of squared residuals,
n k
(b; Lasso ) =  (Yi − b1 X1i − ... − bk X ki ) + Lasso  b j
Lasso 2
S
i =1 j =1

where Lasso  j =1 b j
k
is the “penalty term.”

• This looks a lot like ridge estimation – but it turns out to

have very different properties…
Lasso in Pictures

When the OLS estimator is …but when the OLS estimator is

large, the Lasso estimator small, the Lasso estimator shrinks it
shrinks it slightly towards all the way to zero, so that the Lasso
zero – less than ridge… estimator is exactly zero.

Thus, the Lasso estimator sets some – many – of the  ’s

exactly to 0
More on Lasso (1 of 2)
Lasso sets some – many – of the  ’s exactly to 0
• This property gives the Lasso its name: the Least Absolute
Selection and Shrinkage Operator. Selection, because it
selects a subset of the predictors to use for prediction –
and drops the rest.
• This feature means that Lasso can work especially well
when in reality many of the predictors are irrelevant.
• Models in which most of the true  ’s are zero – that is, in
which E(Y|X) only depends on just a few X’s – are called
sparse.
Lasso produces sparse models, and works well when the
population model is in fact sparse.
More on Lasso (2 of 2)
• Lasso has another unusual property: the estimated model,
and selected variables, depends on how the variables are
specified.
• For example, if model A uses the dummy variables
Freshman, Sophomore, Junior (and omits senior since they
are deviated from their means); and model B uses
Freshman,  Sophomore, and  Junior, then Lasso will in
general give different predictions for models A and B,
although OLS (and ridge) will give the same predictions.
• Technically, Lasso predictions are not invariant to linear
transformations of the regressors
Predicting test scores
Lasso is estimated by k = 817, n = 1966
minimizing the 10-fold
cross-validated MSPE
The resulting estimate
of the shrinkage
parameter is 4527
Root MSPE’s:
OLS: 78.2
lasso: 39.7
The Lasso estimator retains only 56 of the 817 predictors.
Like ridge, Lasso cuts the square root of the MSPE in half,
compared to OLS!
5. Principal Components
• Ridge and Lasso reduce the MSPE by shrinking (biasing)
the estimated coefficients to zero – and in the case of
Lasso, by eliminating many of the regressors entirely.
• Instead, Principal components regression collapses the
very many predictors into a much smaller number ( p  k )
of linear combinations of the predictors
• These the linear combinations – called the principal
components of X – are computed so that they capture
as much of the variation in the original X’s as possible.
• Because the number p of principal components is small,
OLS can be used, with the principal components as
(new) regressors.
Principal Components in Pictures, k = 2 Upper K equals 2.

Suppose you have 2 X’s, and you want to choose a linear

combination of those Xs (say, aX1 + bX 2 ) that captures as much of
the variation of the X’s as possible in a single
summary variable. What values of a and b would you use?
The Principal Components solution
is to choose a and b to solve,

max var (aX1 + bX 2 ), subject to

a2 + b2 = 1
For 2 X’s that are positively
correlated, the resulting
choices of a and b are a = b = 1/ 2
This is shown in the figure −− 
Principal Components, k > 2 Upper K greater than 2.

For k  2 X’s, the principal components are the linear

combinations of the X’s that have the greatest variance and
that are uncorrelated with the previous principal components.
So, the jth principal component PC j , solves,
 k  k
max var   a ji X i  , subject to  ji
a 2

 i =1  i =1

and subject PC j to being uncorrelated with PC1,..., PC j −1.

The first p principal components are the linear
combinations of X that capture as much of the variation
in X as possible.
Principal Components as Data
Compression
• Principal components can be thought of as a data
compression tool, so that the compressed data have
fewer regressors with as little information loss as possible.
• Data compression is used all the time to reduce very large
data sets to smaller ones. A familiar example is image
compression, where the goal is to retain as many of the
features of the image (photograph) as possible, while
reducing the file size.
• In fact, many data compression algorithms build on or are
cousins of principal components analysis.
How many Principal Components? (1 of 2)

One way to choose p is to plot the increase in the average R 2

resulting from adding the pth principal components to a regression
of X on PC1,..., PCp −1.

This plot is known as a

scree plot. Here is the scree
plot for the school test score
data set →

• The first principal component explains 18% of the variation in the

817 X’s!
• The first 10 P C’s explain 63% of the variation in the 817 X’s!
• Still, it is rather hard to know where to draw the line…
How many Principal Components? (2 of 2)

The scree plot is informative (you should look at it) but doesn’t provide a
simple rule for choosing p.
• The number of principal components p is like the ridge and
Lasso penalty factors Ridge and Lasso - all are additional
parameters needed to implement the procedure.
• Like Ridge and Lasso , p can be estimated by minimizing the
m-fold cross validated estimate of the MSPE.
– For a given value of p, the principal components
forecast is obtained by regressing Y on PC1,..., PCp −1
using the estimation sample, then using that model to predict in the
test sample
Predicting test scores

p is estimated by minimizing the k = 817, n = 1966

10-fold cross-validated MSPE

The resulting estimate is 46

Root MSPE’s:

OLS: 78.2
Principal Components:
39.7

• Principal Components collapses the 817 predictors to 46.

• Like ridge and Lasso, PC cuts the square root of the MSPE in half,
compared to OLS!

Lecture 2_MRA and Inference
No ratings yet
Lecture 2_MRA and Inference
57 pages
Notes Simple Linear Regression Analysis
No ratings yet
Notes Simple Linear Regression Analysis
39 pages
Frequency Distributions and Graphs2
No ratings yet
Frequency Distributions and Graphs2
8 pages
ProbabilisticProgramming SUT 2024
No ratings yet
ProbabilisticProgramming SUT 2024
122 pages
Slides Ridge Lasso Regression
No ratings yet
Slides Ridge Lasso Regression
23 pages
ml_exam_answers
No ratings yet
ml_exam_answers
26 pages
Lab4 - Conditional Probability, Bayes' Theorem
No ratings yet
Lab4 - Conditional Probability, Bayes' Theorem
12 pages
PE Civil: Transportation e-book Practice Exam
No ratings yet
PE Civil: Transportation e-book Practice Exam
41 pages
21csc305p Ml Unit 2 Ppt
No ratings yet
21csc305p Ml Unit 2 Ppt
115 pages
Session 08 (Predicting)
No ratings yet
Session 08 (Predicting)
54 pages
Fundamental Biostatistics Dillon Jones
No ratings yet
Fundamental Biostatistics Dillon Jones
68 pages
pdynmc-pres-in-a-nutshell
No ratings yet
pdynmc-pres-in-a-nutshell
22 pages
Principles and Techniques in Combinatorics - Chen Chuan-Chong, Koh Khee-Meng (WS, 1992)
67% (3)
Principles and Techniques in Combinatorics - Chen Chuan-Chong, Koh Khee-Meng (WS, 1992)
440 pages
049 Stat 326 Regression Final Paper
No ratings yet
049 Stat 326 Regression Final Paper
17 pages
Econometrics I Lecture 3 Wooldridge
No ratings yet
Econometrics I Lecture 3 Wooldridge
50 pages
MDPN460 Lecture04
No ratings yet
MDPN460 Lecture04
41 pages
Ch03 DSP
100% (1)
Ch03 DSP
149 pages
PGN AI and ML Presentation
No ratings yet
PGN AI and ML Presentation
28 pages
Notes_Lecture 13_Regularization_LASSO and RIDGE Regression
No ratings yet
Notes_Lecture 13_Regularization_LASSO and RIDGE Regression
29 pages
Statistical ML Overview
No ratings yet
Statistical ML Overview
34 pages
TOPIC 5; PREDICTION WITH MANY REGRESSORS AND BIG DATA (PART 1)
No ratings yet
TOPIC 5; PREDICTION WITH MANY REGRESSORS AND BIG DATA (PART 1)
13 pages
Basic Statistical Techniques in Data Analysis
No ratings yet
Basic Statistical Techniques in Data Analysis
23 pages
EC220/221 Introduction To Econometrics: Canh Thien Dang
No ratings yet
EC220/221 Introduction To Econometrics: Canh Thien Dang
30 pages
Decomposing Variance: Kerby Shedden
No ratings yet
Decomposing Variance: Kerby Shedden
36 pages
Principal Components Regression
No ratings yet
Principal Components Regression
14 pages
Ols 23-24
No ratings yet
Ols 23-24
87 pages
Lecture BDS 4 23 24 Print
No ratings yet
Lecture BDS 4 23 24 Print
14 pages
Lecture 19
No ratings yet
Lecture 19
16 pages
Regression analysis
No ratings yet
Regression analysis
16 pages
Copie de Executive Summary of Marketing Plan by Slidesgo 1
No ratings yet
Copie de Executive Summary of Marketing Plan by Slidesgo 1
50 pages
UnivariateRegression 3
No ratings yet
UnivariateRegression 3
81 pages
11.Bj&S2 Generalized Continumm Regression
No ratings yet
11.Bj&S2 Generalized Continumm Regression
14 pages
Chapter 8 Sampling and Estimation
No ratings yet
Chapter 8 Sampling and Estimation
14 pages
Chapter 6 - 1 Handsout Machine Learning
No ratings yet
Chapter 6 - 1 Handsout Machine Learning
29 pages
SL_3
No ratings yet
SL_3
11 pages
Jurnal Ecogen Rahmayuni Alfajri
No ratings yet
Jurnal Ecogen Rahmayuni Alfajri
9 pages
Tibshirani Lasso
No ratings yet
Tibshirani Lasso
22 pages
EDA 4th Module
No ratings yet
EDA 4th Module
26 pages
Notes2
No ratings yet
Notes2
16 pages
Econometrics Summary
No ratings yet
Econometrics Summary
6 pages
DA-Unit-3-Trio
No ratings yet
DA-Unit-3-Trio
13 pages
Identification of Multivariate Outliers - Problems and Challenges of Visualization Methods
No ratings yet
Identification of Multivariate Outliers - Problems and Challenges of Visualization Methods
15 pages
Principal Components in Regression Analysis
No ratings yet
Principal Components in Regression Analysis
27 pages
Unit 3
No ratings yet
Unit 3
24 pages
Tutorial Questions (Chapter 7)
No ratings yet
Tutorial Questions (Chapter 7)
6 pages
Econometrics Cheatsheet en
100% (1)
Econometrics Cheatsheet en
3 pages
Noise2Noise Learning Image Restoration Without Clean Data
No ratings yet
Noise2Noise Learning Image Restoration Without Clean Data
12 pages
Little Card Bovaird Preacher Crandall 2007 PDF
No ratings yet
Little Card Bovaird Preacher Crandall 2007 PDF
24 pages
Lecture Week 13 - Regression
No ratings yet
Lecture Week 13 - Regression
10 pages
Lesson Plan For Probability and Statistics - 2024-1
No ratings yet
Lesson Plan For Probability and Statistics - 2024-1
3 pages
Chapter2 Econometrics MultipleLinearRegressionModel 1 1
No ratings yet
Chapter2 Econometrics MultipleLinearRegressionModel 1 1
34 pages
Chapter2 - Ordinary Least Squares
No ratings yet
Chapter2 - Ordinary Least Squares
32 pages
Regression Shrinkage and Selection Via The Lasso
No ratings yet
Regression Shrinkage and Selection Via The Lasso
22 pages
06 Regression
No ratings yet
06 Regression
18 pages
STAT22209 - Chapter 03-Multiple Regression - 2022
No ratings yet
STAT22209 - Chapter 03-Multiple Regression - 2022
41 pages
2 RegularizedRegression
No ratings yet
2 RegularizedRegression
25 pages
Selection Bias (Heckman-SPSS)
No ratings yet
Selection Bias (Heckman-SPSS)
9 pages
BRM - L4,5 - Linear Regression
No ratings yet
BRM - L4,5 - Linear Regression
113 pages
Chi Square Test
No ratings yet
Chi Square Test
9 pages
PC Regression
No ratings yet
PC Regression
25 pages
Stat 353 Study Guide
No ratings yet
Stat 353 Study Guide
44 pages
Math 112 Fall 2019 Practice Problemsfor Exam 4
No ratings yet
Math 112 Fall 2019 Practice Problemsfor Exam 4
5 pages
Introduction To Mathematical Modeling: Simple Linear Regression
No ratings yet
Introduction To Mathematical Modeling: Simple Linear Regression
21 pages
Probability Distribution
100% (1)
Probability Distribution
22 pages
Postgraduate-Pg Mba Semester-3 2023 November Decision-Science-Pattern-2019
No ratings yet
Postgraduate-Pg Mba Semester-3 2023 November Decision-Science-Pattern-2019
3 pages
Chap3 - Multiple Regression
No ratings yet
Chap3 - Multiple Regression
56 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Descriptive Statistics: Histogram
No ratings yet
Descriptive Statistics: Histogram
4 pages
The Multiple Linear Regression Model: Version: 30-10-2023, 16:07
No ratings yet
The Multiple Linear Regression Model: Version: 30-10-2023, 16:07
17 pages
Lecture2 241007 162001
No ratings yet
Lecture2 241007 162001
11 pages
Module 4: Regression Shrinkage Methods
No ratings yet
Module 4: Regression Shrinkage Methods
5 pages
Prediction Is A Key Task of Statistics
No ratings yet
Prediction Is A Key Task of Statistics
18 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
6 pages
Areas of Normal Curve
No ratings yet
Areas of Normal Curve
24 pages
TCH442E Quantitative Methods For Finance
No ratings yet
TCH442E Quantitative Methods For Finance
21 pages
Quantitative Techniques Assignment
No ratings yet
Quantitative Techniques Assignment
22 pages
Linear Regression
100% (2)
Linear Regression
228 pages
Unit - 1
No ratings yet
Unit - 1
8 pages
2 Simple Regression Model Estimation and Properties
100% (1)
2 Simple Regression Model Estimation and Properties
48 pages
Eco Trix
No ratings yet
Eco Trix
16 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
7 pages
The Chi Square Test
100% (1)
The Chi Square Test
57 pages
Matrix OLS NYU Notes
No ratings yet
Matrix OLS NYU Notes
14 pages
Ordinary Least Squares
No ratings yet
Ordinary Least Squares
17 pages
Econometric Theory: Module - Iii
No ratings yet
Econometric Theory: Module - Iii
10 pages
Shrinkage Regression: Rolf Sundberg Volume 4, PP 1994-1998 in
No ratings yet
Shrinkage Regression: Rolf Sundberg Volume 4, PP 1994-1998 in
5 pages
2 Simple Regression Model 29x09x2011
No ratings yet
2 Simple Regression Model 29x09x2011
35 pages
Institute and Faculty of Actuaries: Subject CT6 - Statistical Methods Core Technical
No ratings yet
Institute and Faculty of Actuaries: Subject CT6 - Statistical Methods Core Technical
5 pages
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 20

Uploaded by

Lecture 20

Uploaded by

ECU3091 Econometrics A

Prediction with many regressors and big data II

Dr. Barra Roantree

Material from Stock and Watson Chapter 14

But at its core, machine learning builds on familiar tools

– The squared bias of the estimator increases,

• The estimators we consider all have a shrinkage interpretation

Minimizes the penalized sum of squared residuals,

where Ridge  j =1 j is a “penalty term” (collapses to OLS when 0)

• If the regressors are uncorrelated,

so the ridge regressor has the James-Stein form, ˆ = c ˆ

– (math in S W Appendix 14.3 and 19.7)

• The value of the ridge

• Thus, the penalized sum

• It would seem natural to choose Ridge by minimizing wrt both b and

Data set: a school-level version of the California elementary

The data set has 817 predictors…

Ridge results cuts the square root of the MSPE in half,

• The Lasso regression estimator minimizes the penalized

• This looks a lot like ridge estimation – but it turns out to

When the OLS estimator is …but when the OLS estimator is

Thus, the Lasso estimator sets some – many – of the  ’s

Suppose you have 2 X’s, and you want to choose a linear

max var (aX1 + bX 2 ), subject to

For k  2 X’s, the principal components are the linear

and subject PC j to being uncorrelated with PC1,..., PC j −1.

One way to choose p is to plot the increase in the average R 2

This plot is known as a

• The first principal component explains 18% of the variation in the

p is estimated by minimizing the k = 817, n = 1966

The resulting estimate is 46

• Principal Components collapses the 817 predictors to 46.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.