Resampling Methods: Prof. Asim Tewari IIT Bombay
Resampling Methods: Prof. Asim Tewari IIT Bombay
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Resampling
• Resampling involve repeatedly drawing
samples from a training set and refitting a
model of interest on each sample
• Resampling methods
– Cross-validation
– Bootstrap
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Resampling
• Model assessment: The process of evaluating
a model’s performance
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Resampling
Cross-Validation
• The Validation Set Approach
– It involves randomly dividing the available set of
observations into two parts, a training set and a
validation set or hold-out set.
A schematic display of the validation set approach. A set of n observations are randomly split
into a training set (shown in blue, containing observations 7, 22, and 13, among others) and a
validation set (shown in beige, and containing observation 91, among others). The statistical
learning method is fit on the training set, and its performance is evaluated on the validation
set.
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Resampling
Cross-Validation
• The Validation Set Approach
Left: Validation error estimates for a single split into training and validation data sets. Right:
The validation method was repeated ten times, each time using a different random split of the
observations into a training set and a validation set. This illustrates the variability in the
estimated test MSE that results from this approach.
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Resampling
Cross-Validation
• The Validation Set Approach
– Test error rate can be highly variable, depending
on which observations are included in the training
set and the validation set.
– In the validation approach, only a subset of the
observations are used to fit the model. This is a
problem since statistical methods tend to perform
worse when trained on fewer observations.
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Resampling
Cross-Validation
• Leave-One-Out Cross-Validation
A schematic display of LOOCV. A set of n data points is repeatedly split into a training set
(shown in blue) containing all but one observation, and a validation set that contains only that
observation (shown in beige). The test error is then estimated by averaging the n resulting
MSE’s. The first training set contains all but observation 1, the second training set contains all
but observation 2, and so forth.
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Resampling
Cross-Validation
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
The Bootstrap Method
• Can be used to estimate the standard errors of the
coefficients. But not very useful for linear models
since the standard errors of the coefficients can be
directly estimated.
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
The Bootstrap Method
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
The Bootstrap Method
• The Bootstrap method can be used to estimate the standard errors of the
coefficients. But not very useful for linear models since the standard errors
of the coefficients can be directly estimated
A graphical illustration of
the bootstrap approach on
a small sample containing n
= 3 observations. Each
bootstrap data set contains
n observations, sampled
with replacement from the
original data set. Each
bootstrap data set is used
to obtain an estimate of α
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
The Bootstrap Method
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
The Bootstrap Method
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
The Bootstrap Method
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
The Bootstrap Method
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications