Lecture 19
Lecture 19
Model Selection
Box: “All models are wrong, but some are useful.”
Occam’s Razor: “It is futile to do more with what can be done with less.”
“The simplest explanation is best!”
MSE(𝜽, 𝜽*) = EX[(𝜽(x) - 𝜽*)2] = EX[𝜽2(x) - 2𝜽*𝜽(x) + (𝜽*)2] = EX[𝜽2(x)] - 2𝜽*EX[𝜽(x)] + (𝜽*)2
Bias-Variance Tradeoff
Image Source
Model Selection
Find a model that appropriately balances complexity and generalization
capabilities: i.e., that optimizes the bias-variance tradeoff.
Image Source
Train vs. Test Error
Image Source
Underfitting vs. Overfitting
● Simple (e.g., linear) models are highly biased; as
such, they often underfit, meaning they fail to
capture regularities in the data.
● Otoh, they are not sensitive to noise (i.e., they
assume so much bias that they don’t change much
with the data), so are comparatively low variance.
Image Source
Training vs. Test Data
● Divide data into two sets: training set and test set
● As their names suggest:
○ Train your model on the training set
○ Test your model on the test set
● Goal is to build a model that generalizes well from the training set to the
test set, i.e., from in-sample data to out-of-sample data.
● To achieve this goal, the test set should be representative of the training set,
and should be large enough to obtain statistically significant results.
Holdout Method (to evaluate model accuracy)
● Partition our training data into a large training set and smaller testing set
● Train model on the training data
● Test model accuracy on the testing data
● Data are often shuffled first (e.g., if they were compiled by different sources)
Training
Data
Testing
Cross validation
● Partition data multiple times
○ If you want to partition your data 10 times, create
10 folds, and then use each fold as a test set, and
the rest of the data as a training set
○ Average accuracy across all partitions to
approximate model accuracy
Image Source
λ
● The new, regularized objective must balance the original objective against
the regularization term.
● This balance is achieved via a parameter λ, the weight of the regularization
term, with 1 - λ as the weight of the original objective.
● Higher λ increases bias, so decreases variance.
Two (really three) Sources of Error
● Error = Reducible Error + Irreducible Error
● MSE(𝜽, 𝜽*) is reducible error. It varies from one learning
algorithm to another, presenting opportunities for improvement.
○ Reducible error decomposes into terms of bias and variance.
● Irreducible error arises because Y is almost never (except in toy
examples) completely determined by X.
○ Noise in a statistical model represents missing information.
○ The variance of this noise is the model’s irreducible error.
○ This error cannot be reduced, except perhaps by changing the model:
e.g., adding variables/features.
Variable (i.e., Feature) Selection
● Identifying independent variables whose relationship to the dependent
variable is “important”.
● Simple heuristic for eliminating variables from a model: is the coefficient
(essentially) zero?
● Ridge regression does not set coefficients to zero, unless λ = 0, so it cannot
be used for variable selection.
● So, while ridge regression is useful for prediction, it is less effective when the
goal to explain relationships among variables.
● LASSO, however, can set some coefficients to zero, so it is more popular,
and widely used when the goal is to build an interpretable model.
Curse of Dimensionality
Curse of Dimensionality
Adding new features to a model (i.e., increasing the dimensionality) in the hopes
of improving performance will eventually degrade performance
Image Source
Curse of Dimensionality
As the dimensionality of the data (i.e., the number of features) increases, Wikipedia
“the volume of the space increases so fast that the available data become sparse.”
Curse of Dimensionality
As the dimensionality of the data (i.e., the number of features) increases, Wikipedia
“the volume of the space increases so fast that the available data become sparse.”