0% found this document useful (0 votes)

19 views25 pages

Lecture 19

bias variance trade off discussion

Uploaded by

gautamtalukdar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views25 pages

Lecture 19

bias variance trade off discussion

Uploaded by

gautamtalukdar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Model Selection

Model Selection
Box: “All models are wrong, but some are useful.”

Occam’s Razor: “It is futile to do more with what can be done with less.”
“The simplest explanation is best!”

Einstein: “Everything should be made as simple as possible, but no simpler.”

Bias
● Suppose 𝜽* is our model parameter, and 𝜽 is our estimator.
● The function 𝜽 applied to data x ~ X|𝜽* yields a point estimate.
● Ex ~ X|𝜽*[𝜽(x)] is the expected value of the estimator.
● Bias[𝜽, 𝜽*] = Ex ~ X|𝜽*[𝜽(x)] - 𝜽*
Mean-squared Error
● Suppose 𝜽* is our model parameter, and 𝜽 is our estimator.
● The function 𝜽 applied to data x ~ X|𝜽* yields a point estimate.
● Mean-squared error is the expected residual error.
● MSE(𝜽, 𝜽*) = Ex ~ X|𝜽*[(𝜽(x) - 𝜽*)2]
Bias-Variance Decomposition
● Theorem: MSE(𝜽, 𝜽*) = Bias2[𝜽, 𝜽*] + Var[𝜽]
● So MSE is a combination of bias and variance in our estimator.
● Ideally, we would reduce both, but this is often impossible.
● Instead, we usually trade off one against the other.
Proof of Bias-Variance Decomposition
Bias2[𝜽, 𝜽*] = (EX[𝜽(x)] - 𝜽*)2 = (EX[𝜽(x)] - 𝜽*)(EX[𝜽(x)] - 𝜽*) = (EX[𝜽(x)])2 - 2𝜽*EX[𝜽(x)] + (𝜽*)2

Var[𝜽] = EX[(𝜽(x) - EX[𝜽(x)])2] = EX[𝜽2(x)] - (EX[𝜽(x)])2

Bias2[𝜽, 𝜽] + Var[𝜽] = (EX[𝜽(x)])2 - 2𝜽EX[𝜽(x)] + (𝜽*)2 + EX[𝜽2(x)] - (EX[𝜽(x)])2

= -2𝜽EX[𝜽(x)] + (𝜽)2 + EX[𝜽2(x)] (the first and last terms cancel)

MSE(𝜽, 𝜽*) = EX[(𝜽(x) - 𝜽*)2] = EX[𝜽2(x) - 2𝜽*𝜽(x) + (𝜽*)2] = EX[𝜽2(x)] - 2𝜽*EX[𝜽(x)] + (𝜽*)2
Bias-Variance Tradeoff

Image Source
Model Selection
Find a model that appropriately balances complexity and generalization
capabilities: i.e., that optimizes the bias-variance tradeoff.

● High bias, low variance (underfitting)

○ Build a model with only very few variables/features
○ Assume a simple relationship among variables (e.g., linear)
○ Make strong structural assumptions—so many, the data can barely be heard

● Low bias, high variance (overfitting)

○ Make weak structural assumptions
○ Allow for a complex relationship among variables (e.g., highly non-linear)
○ The analyst defers to the data almost entirely; their own domain expertise is suppressed
○ As for variables/features, throw in everything in the kitchen sink!
Overfitting
● Problem: Models are always biased
towards training data
● A model overfits when it “memorizes” the
training data
● Overfit models cannot generalize well to
test data
● Solution: Use test data to evaluate models
to mitigate the risk that they overfit

Image Source
Train vs. Test Error

● The grey curve shows the

● The points are the data training error. It decreases
indefinitely.
● The black line represents
the true relationship ● The red curve shows the
test error. It has an elbow.
● The various colors refer to
different estimators (linear, ● The colored boxes
quadratic, & something crazy). correspond to the colored
fits in the left plot.

Image Source
Underfitting vs. Overfitting
● Simple (e.g., linear) models are highly biased; as
such, they often underfit, meaning they fail to
capture regularities in the data.
● Otoh, they are not sensitive to noise (i.e., they
assume so much bias that they don’t change much
with the data), so are comparatively low variance.

● More complicated models are less biased. Because

of their flexibility, they end up modeling noise (as
well as signal), and consequently overfit.
● Flexible models have high variance, b/c the models
themselves can vary enormously with the data.

Image Source
Training vs. Test Data
● Divide data into two sets: training set and test set
● As their names suggest:
○ Train your model on the training set
○ Test your model on the test set

● Goal is to build a model that generalizes well from the training set to the
test set, i.e., from in-sample data to out-of-sample data.
● To achieve this goal, the test set should be representative of the training set,
and should be large enough to obtain statistically significant results.
Holdout Method (to evaluate model accuracy)
● Partition our training data into a large training set and smaller testing set
● Train model on the training data
● Test model accuracy on the testing data
● Data are often shuffled first (e.g., if they were compiled by different sources)

Training
Data

Testing
Cross validation
● Partition data multiple times
○ If you want to partition your data 10 times, create
10 folds, and then use each fold as a test set, and
the rest of the data as a training set
○ Average accuracy across all partitions to
approximate model accuracy

● This is called cross validation

○ Typical to use k partitions for k-fold cross
validation (usually k = 10)
○ Leave-one-out cross validation: cross validation,
to the extreme: k = n, the sample size

● Cross validation is useful for model selection

Linear Regression, Regularized
Regularization
● The bias-variance decomposition suggests trading bias for variance.
● Regularization is a technique that introduces bias to reduce variance.
● Shrinkage is a form of regularization that shrinks estimates towards zero.
● This technique discourages learning a more complex, flexible model, thereby
mitigating the risk of overfitting.
Regularizers
● A regularizer is a penalty term that is added to an objective function (e.g.,
minimize the sum of the squared residuals) to penalize large coefficients.
● Two popular choices lead to two popular variants on standard regression
● Ridge regression: Minimizes the sum of the coefficients squared
● LASSO: Minimizes the sum their absolute values
○ Least absolute shrinkage and selection operator

● Elastic Net: Minimizes a combination of the two

Regularizers Visualized in 2d
Assuming two coefficients, an increase in one is offset by a decrease in the other

Image Source
λ
● The new, regularized objective must balance the original objective against
the regularization term.
● This balance is achieved via a parameter λ, the weight of the regularization
term, with 1 - λ as the weight of the original objective.
● Higher λ increases bias, so decreases variance.
Two (really three) Sources of Error
● Error = Reducible Error + Irreducible Error
● MSE(𝜽, 𝜽*) is reducible error. It varies from one learning
algorithm to another, presenting opportunities for improvement.
○ Reducible error decomposes into terms of bias and variance.
● Irreducible error arises because Y is almost never (except in toy
examples) completely determined by X.
○ Noise in a statistical model represents missing information.
○ The variance of this noise is the model’s irreducible error.
○ This error cannot be reduced, except perhaps by changing the model:
e.g., adding variables/features.
Variable (i.e., Feature) Selection
● Identifying independent variables whose relationship to the dependent
variable is “important”.
● Simple heuristic for eliminating variables from a model: is the coefficient
(essentially) zero?
● Ridge regression does not set coefficients to zero, unless λ = 0, so it cannot
be used for variable selection.
● So, while ridge regression is useful for prediction, it is less effective when the
goal to explain relationships among variables.
● LASSO, however, can set some coefficients to zero, so it is more popular,
and widely used when the goal is to build an interpretable model.
Curse of Dimensionality
Curse of Dimensionality
Adding new features to a model (i.e., increasing the dimensionality) in the hopes
of improving performance will eventually degrade performance

Image Source
Curse of Dimensionality
As the dimensionality of the data (i.e., the number of features) increases, Wikipedia
“the volume of the space increases so fast that the available data become sparse.”

Example from Stack Exchange:

● To find on your favorite kind of cookie among four possible flavors (sweet, salty, bitter, sour),
requires eating four cookies.
● If there is an additional dimension, e.g., color, and there are three possible colors, you now now
have to eat 4 x 3 = 12 cookies to find your favorite.
● Add another dimension, e.g., shape, with five possibilities, and you now have to eat 4 x 3 x 5 =
60 cookies!
Image Source

Curse of Dimensionality
As the dimensionality of the data (i.e., the number of features) increases, Wikipedia
“the volume of the space increases so fast that the available data become sparse.”

To cover 20% of the population:

● Need 20% of the data in
1 dimension: (.2)1 〜 .2
● Need 45% of the data in
2 dimensions: (.45)2 〜 .2
● Need 58% of the data in
3 dimensions: (.58)3 〜 .2

Bias Variance Annotated
No ratings yet
Bias Variance Annotated
73 pages
DL Unit1
100% (1)
DL Unit1
79 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
10 pages
L3 Model Selection Diagnostics
No ratings yet
L3 Model Selection Diagnostics
75 pages
Lecture 10 - 04.09.2024 - Regression-02 Lecture Slides
No ratings yet
Lecture 10 - 04.09.2024 - Regression-02 Lecture Slides
61 pages
Unit 4
No ratings yet
Unit 4
50 pages
Regularization Linear Models
No ratings yet
Regularization Linear Models
23 pages
1 Machine Learning
No ratings yet
1 Machine Learning
111 pages
Lecture 1: Introduction and Key Concepts
No ratings yet
Lecture 1: Introduction and Key Concepts
62 pages
Ridge Lasso Regression Bias Variance Tradeoff 71
No ratings yet
Ridge Lasso Regression Bias Variance Tradeoff 71
19 pages
Data Science Interview Questions - 1
No ratings yet
Data Science Interview Questions - 1
55 pages
Linear Regression Summary
No ratings yet
Linear Regression Summary
57 pages
226 Lecture5 Prediction
No ratings yet
226 Lecture5 Prediction
45 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Bias Variance
No ratings yet
Bias Variance
14 pages
Machine Learning
No ratings yet
Machine Learning
63 pages
Unit 1.2 Perceptron 2024
No ratings yet
Unit 1.2 Perceptron 2024
107 pages
Class 9 After
No ratings yet
Class 9 After
38 pages
ML 04 Validation Regularization
No ratings yet
ML 04 Validation Regularization
57 pages
ESGB Evaluation Methods
No ratings yet
ESGB Evaluation Methods
84 pages
Statistical Learning
No ratings yet
Statistical Learning
31 pages
Handout5 Regularization
No ratings yet
Handout5 Regularization
20 pages
Supervised Regression Notes
No ratings yet
Supervised Regression Notes
11 pages
02 Chap02 AssesingModelAccuracy
No ratings yet
02 Chap02 AssesingModelAccuracy
22 pages
SSRN Id3588594
No ratings yet
SSRN Id3588594
27 pages
Feature Selection
No ratings yet
Feature Selection
19 pages
Overfitting & Feature Engineering
No ratings yet
Overfitting & Feature Engineering
37 pages
1 - Intro To Machine Learning
No ratings yet
1 - Intro To Machine Learning
34 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
Lecture 4 - Bias-Variance Trade-Off and Model Selection
No ratings yet
Lecture 4 - Bias-Variance Trade-Off and Model Selection
66 pages
Ghojogh, Benyamin, and Mark Crowley
No ratings yet
Ghojogh, Benyamin, and Mark Crowley
23 pages
Lecture 14
No ratings yet
Lecture 14
17 pages
ML11 Generalization
No ratings yet
ML11 Generalization
40 pages
02 - Diagnostics For Machine Learning Model
No ratings yet
02 - Diagnostics For Machine Learning Model
20 pages
Overfitting and Mitigation
No ratings yet
Overfitting and Mitigation
15 pages
Theory in Machine Learning
No ratings yet
Theory in Machine Learning
60 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Diagnosing Bias Vs Variance
No ratings yet
Diagnosing Bias Vs Variance
11 pages
Bias Varience Trade Off
100% (2)
Bias Varience Trade Off
35 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
0 Regularization PDF
No ratings yet
0 Regularization PDF
88 pages
Jkkklphftbbhuii
No ratings yet
Jkkklphftbbhuii
17 pages
Bias Variance Trade Off
No ratings yet
Bias Variance Trade Off
14 pages
Model Evaluation
No ratings yet
Model Evaluation
29 pages
ASSESSING MODEL Accuracy PDF
No ratings yet
ASSESSING MODEL Accuracy PDF
22 pages
ML Assignment
No ratings yet
ML Assignment
5 pages
Csa202 Unit 2
No ratings yet
Csa202 Unit 2
36 pages
ProCash NDC DDC V3021 InstallationManual en PDF
No ratings yet
ProCash NDC DDC V3021 InstallationManual en PDF
432 pages
PS Notes (Machine Learning
No ratings yet
PS Notes (Machine Learning
14 pages
Lecture 2 Ai
No ratings yet
Lecture 2 Ai
24 pages
Lecture 4 - Bias-Variance Trade-Off and Model Selection
No ratings yet
Lecture 4 - Bias-Variance Trade-Off and Model Selection
66 pages
Chem JUJ K1 K2 K3 Skema Jawapan SET 2
33% (6)
Chem JUJ K1 K2 K3 Skema Jawapan SET 2
18 pages
Lecture 21: Model Selection 1 Choosing Models
No ratings yet
Lecture 21: Model Selection 1 Choosing Models
14 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Idt 92HD73C DST 20110926
No ratings yet
Idt 92HD73C DST 20110926
252 pages
ML Models and When To Choose One Over Others
No ratings yet
ML Models and When To Choose One Over Others
7 pages
Machine Learning Interview Questions.
50% (2)
Machine Learning Interview Questions.
43 pages
Btech Ce 6 Sem Foundation Design Kce 064 2023
No ratings yet
Btech Ce 6 Sem Foundation Design Kce 064 2023
2 pages
TM CBLM - Copy 2.odt
100% (2)
TM CBLM - Copy 2.odt
98 pages
ADL 01 - Principles and Practices of Management Assignment
0% (1)
ADL 01 - Principles and Practices of Management Assignment
9 pages
Formal/Official Letters: Sample - Letter To The Editor
No ratings yet
Formal/Official Letters: Sample - Letter To The Editor
10 pages
Isabellamassambansiala
No ratings yet
Isabellamassambansiala
25 pages
Lect12 - Kanban Systems
100% (2)
Lect12 - Kanban Systems
37 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Tool Setter Calibration
100% (1)
Tool Setter Calibration
7 pages
Pereira & Justi (2024)
No ratings yet
Pereira & Justi (2024)
13 pages
Science Social Studies 2nd Quarter Lesson Plans
No ratings yet
Science Social Studies 2nd Quarter Lesson Plans
4 pages
Guidelines For Summer Project - Classes 5-12 - 2023
100% (1)
Guidelines For Summer Project - Classes 5-12 - 2023
6 pages
Thematic Apperception Test
100% (6)
Thematic Apperception Test
26 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
6 pages
WEEKLY LEARNING PLAN Practical Research II K.Ponsaran
No ratings yet
WEEKLY LEARNING PLAN Practical Research II K.Ponsaran
19 pages
Techniques & Methods in Sensory Evaluation
No ratings yet
Techniques & Methods in Sensory Evaluation
37 pages
Lesson-Plan-Template - COURSERA
100% (2)
Lesson-Plan-Template - COURSERA
2 pages
Electrohydrodynamic Atomization (EHDA)
No ratings yet
Electrohydrodynamic Atomization (EHDA)
15 pages
Geyser Phe
No ratings yet
Geyser Phe
14 pages
Visionary Leadership: Great Video On The 3 Most Important
No ratings yet
Visionary Leadership: Great Video On The 3 Most Important
21 pages
SPRING 2015 Result Final Semester
No ratings yet
SPRING 2015 Result Final Semester
31 pages
Research Forum Script
No ratings yet
Research Forum Script
4 pages
Forces 3 QP
No ratings yet
Forces 3 QP
7 pages
For A Semiotics of The Theatre
No ratings yet
For A Semiotics of The Theatre
5 pages
Sovrinmind Com Posts We Are Victimized by Facts
No ratings yet
Sovrinmind Com Posts We Are Victimized by Facts
21 pages
ML Summary PDF
No ratings yet
ML Summary PDF
5 pages
Desert Survival: 4 Diff. Ranking (1-3) (Individual Error Score) 5 Diff. Ranking (2-3) (Group Error Score)
No ratings yet
Desert Survival: 4 Diff. Ranking (1-3) (Individual Error Score) 5 Diff. Ranking (2-3) (Group Error Score)
1 page
Adventist University of The Philippines College of Nursing Intervention Plan
No ratings yet
Adventist University of The Philippines College of Nursing Intervention Plan
3 pages
Aerodyne Product Shopper
No ratings yet
Aerodyne Product Shopper
2 pages
A3 Strategy Article
100% (1)
A3 Strategy Article
4 pages
Crop Tool and Lasso Tool Lesson Plan
No ratings yet
Crop Tool and Lasso Tool Lesson Plan
2 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
The Practically Cheating Calculus Handbook
From Everand
The Practically Cheating Calculus Handbook
S. Deviant
3.5/5 (7)
Fundamental Math
From Everand
Fundamental Math
Russell Pead
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 19

Uploaded by

Lecture 19

Uploaded by

Model Selection

Einstein: “Everything should be made as simple as possible, but no simpler.”

Var[𝜽] = EX[(𝜽(x) - EX[𝜽(x)])2] = EX[𝜽2(x)] - (EX[𝜽(x)])2

Bias2[𝜽, 𝜽] + Var[𝜽] = (EX[𝜽(x)])2 - 2𝜽EX[𝜽(x)] + (𝜽*)2 + EX[𝜽2(x)] - (EX[𝜽(x)])2

= -2𝜽EX[𝜽(x)] + (𝜽)2 + EX[𝜽2(x)] (the first and last terms cancel)

● High bias, low variance (underfitting)

● Low bias, high variance (overfitting)

● The grey curve shows the

● More complicated models are less biased. Because

● This is called cross validation

● Cross validation is useful for model selection

● Elastic Net: Minimizes a combination of the two

Example from Stack Exchange:

To cover 20% of the population:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Lecture 19

Uploaded by

Lecture 19

Uploaded by

Model Selection

Einstein: “Everything should be made as simple as possible, but no simpler.”

Var[𝜽] = EX[(𝜽(x) - EX[𝜽(x)])2] = EX[𝜽2(x)] - (EX[𝜽(x)])2

Bias2[𝜽, 𝜽*] + Var[𝜽] = (EX[𝜽(x)])2 - 2𝜽*EX[𝜽(x)] + (𝜽*)2 + EX[𝜽2(x)] - (EX[𝜽(x)])2

= -2𝜽*EX[𝜽(x)] + (𝜽*)2 + EX[𝜽2(x)] (the first and last terms cancel)

● High bias, low variance (underfitting)

● Low bias, high variance (overfitting)

● The grey curve shows the

● More complicated models are less biased. Because

● This is called cross validation

● Cross validation is useful for model selection

● Elastic Net: Minimizes a combination of the two

Example from Stack Exchange:

To cover 20% of the population:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Bias2[𝜽, 𝜽] + Var[𝜽] = (EX[𝜽(x)])2 - 2𝜽EX[𝜽(x)] + (𝜽*)2 + EX[𝜽2(x)] - (EX[𝜽(x)])2

= -2𝜽EX[𝜽(x)] + (𝜽)2 + EX[𝜽2(x)] (the first and last terms cancel)