0% found this document useful (0 votes)
16 views33 pages

DL-Lec 2 - Bias-Variance-Tradeoff

The document outlines a master's degree program in Data Science and Mechatronics and Smart Technology Engineering, focusing on key topics such as the bias-variance tradeoff, overfitting, and regularization in machine learning. It includes a syllabus detailing various data science concepts, learning curves, and methods to address overfitting through regularization techniques. The content is presented in a structured format, highlighting the importance of balancing model complexity and generalization capabilities.

Uploaded by

manasishivarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views33 pages

DL-Lec 2 - Bias-Variance-Tradeoff

The document outlines a master's degree program in Data Science and Mechatronics and Smart Technology Engineering, focusing on key topics such as the bias-variance tradeoff, overfitting, and regularization in machine learning. It includes a syllabus detailing various data science concepts, learning curves, and methods to address overfitting through regularization techniques. The content is presented in a structured format, highlighting the importance of balancing model complexity and generalization capabilities.

Uploaded by

manasishivarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

DATA SCIENCE AND Master degree

MECHATRONICS AND SMART


in

TECHNOLOGY ENGINEERING
AUTOMATION

Lectures 6-7: Bias-variance SPEAKER


Prof. Mirko Mazzoleni
tradeoff, overfitting and PLACE

regularization University of Bergamo


Syllabus
1. Introduction to data science 10. Neural networks

1.1 The business perspective 11. Machine vision

1.2 Data analysis processes 11.1 Classic approaches

2. Data visualization 11.2 CNN and deep learning

3. Maximum Likelihood Estimation 12. Unsupervised learning

4. Linear regression 12.1 k-means and hierarchical clustering

5. Logistic regression 12.2 Principal Component Analysis

6. Bias-Variance tradeoff 13. Fault diagnosis

7. Overfitting and regularization 13.1 Model-based fault diagnosis

8. Validation and performance metrics 13.2 Signal-based fault diagnosis

9. Decision trees 13.3 Data-driven fault diagnosis

2 /32
Outline

1. Bias-variance tradeoff

2. Learning curves

3. Overfitting

4. Regularization

3 /32
Outline

1. Bias-variance tradeoff

2. Learning curves

3. Overfitting

4. Regularization

4 /32
Approximation vs. generalization
The example shows:

• Perfect fit on in sample (training) data



𝐸in = 0

• Bad fit on out of sample (test) data


𝐸out huge

5 /32
Approximation vs. generalization
The final aim is to have a small 𝐸out : good approximation of 𝑓 out-of-sample

Hypothesis space ℋ MORE complex Better chances to approximate 𝑓 in-


sample

Hypothesis space ℋ LESS complex Better chances to generalize 𝑓 out-


of-sample

The ideal case would be to have a hypothesis space ℋ that contains only the function 𝑓

ℋ= 𝑓 Win a lottery ticket ☺

6 /32
Bias and variance of machine learning model
2 2
𝒟
𝐛𝐢𝐚𝐬 𝟐 = 𝑔ҧ 𝒙 − 𝑓 𝒙 𝐯𝐚𝐫𝐢𝐚𝐧𝐜𝐞 = 𝔼𝒟 𝑔 𝒙 − 𝑔ҧ 𝒙
• 𝑔 𝒟 function learnt
𝑓 on the dataset 𝒟 𝑓
Bias • 𝑔:ҧ average
ℋ function in ℋ

Variance
VERY SMALL model set. Since there is only one VERY LARGE model set. The target function is in
hypothesis, both the average function 𝑔ҧ and the ℋ . Different data sets will led to different
final hypothesis 𝑔 𝒟 will be the same, for any hypotheses that agree with 𝑓 on the data set, and
dataset. are spread around 𝑓 in the red region.
Thus, var = 0. The bias will depend solely on how Thus, bias ≈ 0 because 𝑔ҧ is likely to be close to
well this single hypothesis approximates the target 𝑓. The variance is large (heuristically represented
𝑓, and unless we are extremely lucky, we expect a by the size of the red region)
large bias

7 /32
Bias and variance of machine learning model
Heuristic rule
How many points 𝑁 are required to ensure a good chance of generalizing?

Error
Out-of-sample error
Out of sample error

𝑁 ≥ 10 ⋅ number of model parameters

Model complexity
Model
General principle complexity

The «model complexity» should match the In-sample


In error
sample error

number of data, not the target complexity

𝑑∗ Number of params
VC dimension

8 /32
Outline

1. Bias-variance tradeoff

2. Learning curves

3. Overfitting

4. Regularization

9 /32
Learning curves
Learning curves are a graphical tool to understand if a learning model suffers from bias
or variance problems

The idea is to represent, by varying the number of data 𝑁 used to train the model:

• The expected out-of-sample error 𝔼𝒟 𝐸out 𝑔𝒟


𝒟1 𝒟2 𝒟3 𝒟4 𝒟5 𝒟6
• The expected out-of-sample error 𝔼𝒟 𝐸in 𝑔𝒟

In practice, the curves are computed from one dataset, or by dividing it into more parts
and taking the «average curve» resulting from the various sub-datasets

10 /32
Learning curves
Error 𝐕𝐀𝐑𝐈𝐀𝐍𝐂𝐄 "big"

𝐕𝐀𝐑𝐈𝐀𝐍𝐂𝐄 "small"

Error
Expected error
Expectederror

𝐸out

Expected
Expected

𝐸out
𝐸in
𝐁𝐈𝐀𝐒 2 "small"
𝐁𝐈𝐀𝐒 2 "big"
𝐸in
Number
Numberof ofdata used to
data𝑁points, Number of
Number of data
data 𝑁 used to
points,
train the model train the model

«Simple» model «Complex» model


11 /32
Learning curves
Interpretation
• The bias can be present when the expected error is quite high and 𝐸in is similar to 𝐸out
• When there is bias, it is unlikely that more data will help
• The variance can be present when there is a huge gap between 𝐸in and 𝐸out
• When there is variance, it is likey that more data will help

Solving a bias problem Solving a variance problem


• Add more features, for instance • Use fewer features
by combining the original ones • Get more data
• Use regularization
• Boosting
• Bagging

12 /32
Outline

1. Bias-variance tradeoff

2. Learning curves

3. Overfitting

4. Regularization

13 /32
Overfitting
We encountered the overfitting phenomenon when we talked about the approximation-
generalization tradeoff

We saw how we must use simpler models if


we have few data, independently of the
complexity of the true function

We now introduce another cause for the


overfitting: the stochastic noise on output
data 𝑦

14 /32
Overfitting

• Simple function to learn

•𝑁=5 points

• Model: 4-th order polynomial

𝐸in = 0 𝐸out = 0

15 /32
Overfitting

• Simple function to learn

• 𝑁 = 5 noisy points

• Model: 4-th order polynomial

16 /32
Overfitting

• Simple function to learn

• 𝑁 = 5 noisy points

• Model: 4-th order polynomial

𝐸in = 0 𝐸out = 𝐡𝐮𝐠𝐞

17 /32
Example: student that has to learn some concepts
To understand the phenomenon of overfitting in an intuitive way, let's consider the
following similarity

The teacher of a course provides solved exercises in order to teach how to solve a
problem. The exam exercises must necessarily be different from those provided in
class, otherwise the teacher is not able to understand if the student has only
memorized how to solve the exercises or if she/he has really learned the concepts

In the first case (memorizing) the student has not really learned: when he is faced with
a similar (but different) exercise she/he will not be able to solve it. The student has
overfitted the exercise seen in class, without having generalized the concepts and
therefore the solution method

18 /32
Overfitting vs. model complexity

Error
• We talk of overfitting when High
Highbias
bias SmallLow bias
bias
Low variance
Small variance HighHigh variance
variance
decreasing 𝐸in leads to
increasing 𝐸out

• Major source of failure for


machine learning systems Out of sampleerror
Out-of-sample error

Error
• Overfitting leads to bad
generalization
In sample error
In-sample error Overfitting
• A model can exibit bad
generalization even if it does
not overfit Low High
Low Model complexity High
complexity complexity
Model complexity
19 /32
Overfitting vs. model complexity

Underfit OK Overfit

High bias High variance

20 /32
Outline

1. Bias-variance tradeoff

2. Learning curves

3. Overfitting

4. Regularization

21 /32
A cure for overfitting
Regularization is the first line of defense against overfitting
We have seen that more complex models are more prone to overfitting. This is because
they are more «powerful» (expressive) and therefore can also adapt to the noise

Simple models show less variance due to their limited expressiveness. The reduction in
the variance of the model is often greater than the increase in its bias, so that, overall,
the expected error decreases (bias 2 + var + noise variance)

However, if we stick to simple models only, we may not get a satisfactory approximation
of the target function 𝑓

How can we retain the benefits of both worlds?

22 /32
Regularization
Idea: in addition to minimizing the «data-fit» cost 𝐸in 𝜽 ≡ 𝐽 𝜽 , minimize also the model
complexity

Instead of 𝐸in 𝜽 , we minimize an augmented error 𝐸aug 𝜽 ℎ ⋅ is some function


that represents our
model
𝑁
1 2
𝐸aug 𝜽 = ෍ 𝑦 𝑖 −ℎ 𝒙 𝑖 ; 𝜽 + 𝜆reg ⋅ Ω 𝜽
𝑁
𝑖=1 Regularizer: how much the
model is «complex»
Dat-fit: How bad the model
fits the data (is an error term)

The term 𝜆reg (hyperparameter) weights the importance of minimizing 𝐸in ≡ 𝐽 𝜽 with
respect to minimizing Ω 𝜽

23 /32
Example: 𝐿2 regularization
The 𝐿2 regularization penalizes the sum of the squared model’s coefficients 𝜽 ∈ ℝ𝑑×1

𝑁 𝑑−1
1 2 2
𝐸aug 𝜽 = ෍ 𝑦 𝑖 −ℎ 𝒙 𝑖 ; 𝜽 + 𝜆reg ⋅ ෍ 𝜃𝑗
𝑁 𝑑×1 𝑑×1 1×1
𝑖=1 𝑗=0

• If this 𝐿2 regularization is applied to a linear regression problem, the method is called


Ridge regression

• The intercept 𝜃0 sometimes is not penalized. In this case, 𝑗 will start from 1

• This problem can also be seen as a constrained optimization problem

24 /32
Example: 𝐿2 regularization
𝑁
1 2
minimize 𝐸in 𝜽 = ෍ 𝑦 𝑖 −ℎ 𝒙 𝑖 ; 𝜽
𝑁
𝑖=1

subject to 𝜽⊤ 𝜽 ≤ 𝑐
1×𝑑 𝑑×1 1×1

• With this interpretation, we are explicitly constraining the coefficients 𝜽 to not take big
values

• There is relation between 𝑐 and 𝜆reg so that if 𝑐 ↑, then 𝜆 ↓


In fact, larger 𝑐 means that the weights can be greater. This is the same as setting for a lower
𝜆reg , because the regularization term will be less important and therefore the weights will not be
reduced as much

25 /32
Effect of the regularization hyperparameter 𝜆
𝜆reg = 0 𝜆reg > 𝜆reg 𝜆reg > 𝜆reg 𝜆reg > 𝜆reg
1 2 1 3 2 4 3

Overfit Underfit

If we regularize too much, we will learn the simplest possible function, which is a
horizontal line (constant) with intercept 𝜃0

26 /32
Intuition about the importace of 𝐸aug w.r.t. 𝐸in
Minimizing 𝐸aug instead of 𝐸in leads to a better model (that is a model with better
generalization capabilities and so with lower 𝐸out )
error

error
Error

𝐕𝐀𝐑𝐈𝐀𝐍𝐂𝐄 ෩ = 𝐸out − 𝐸in

ExpectedError
Ω
Expected
Expected

Expected
𝐸out 𝐸out

𝐸in 𝐸in

𝐁𝐈𝐀𝐒 2 𝐸in

Number
Numberof ofdata used to
data𝑁points, Number
Numberof ofdata used to
data𝑁points,
train the model train the model

27 /32
Intuition about the importace of 𝐸aug w.r.t. 𝐸in
From the previous graph, as well as through bias and variance, we can interpret 𝐸out as
the sum of two contributions:

error
෩ = 𝐸out − 𝐸in

Error
Ω
෩ 𝜽
𝐸out 𝜽 = 𝐸in 𝜽 + Ω

Expected
𝐸out

Expected
Recalling the definition of 𝐸aug we have 𝐸in

𝐸in
𝐸aug 𝜽 = 𝐸in 𝜽 + 𝜆Ω 𝜽
Number of
Number ofdata used to
data𝑁points,
train the model

The error 𝐸aug is a better proxy of 𝐸out than 𝐸in

28 /32
Intuition about the importace of 𝐸aug w.r.t. 𝐸in

The Holy Grail of machine learning would be to have an


expression of 𝐸out to minimize

• In this way, it would be possible to directly minimize the out-of-sample error instead of
the in-sample one (or the augmented error)

• The regularization helps to estimate the quantity Ω(𝜽), that, added to 𝐸in , gives 𝐸aug , il
an estimate of 𝐸out

29 /32
Chosing the regularization term
There are different types of regularizers. The most used are:
𝑑−1
2
• 𝐿2 regularization: also called Ridge regularization Ω 𝜽 = ෍ 𝜃𝑗
𝑗=0
𝑑−1

• 𝐿1 regularization: also called Lasso regularization Ω 𝜽 = ෍ 𝜃𝑗


𝑗=0
𝑑−1 𝑑−1
2
• Elastic-net regularization: Ω 𝜽 = 𝛽 ෍ 𝜃𝑗 + 1 − 𝛽 ෍ 𝜃𝑗
𝑗=0 𝑗=0

The Ridge penalty tends to reduce all coefficients to a small value


The Lasso penalty tends to bring the coefficients exactly to zero

30 /32
Chosing the regularization term
Estimate of 𝜽 without
Ridge Lasso regularization

𝜃2 𝜃2
Notice that
Notice that

𝜽 𝜃1 = 0 and ෡
𝜽
𝜃1 e 𝜃2 are 𝐸in 𝜽 ≡ 𝐽 𝜽 𝜃2 is «small»
«small»
Level curves
assuming a convex
cost function 𝐽 𝜽 Estiamate of 𝜽 with
regularization

𝜃1 𝜃1
𝜃1 2 + 𝜃2 2 ≤ 𝑐 𝜃1 + 𝜃2 ≤𝑐
Constraints

31 /32
Regularization and bias-variance tradeoff
The effects of regularization can be observed in terms of bias and variance:

• Regularization slightly increases the bias (because I get a simpler model) in order to
considerably reduce the variance of the learning model

• Regularization leads to more «smooth» hypotheses, reducing the risk of overfitting

• The regularization hyperparameter 𝜆reg must be chosen specifically for each type of
regularizer. Usually, a procedure such as validation or cross-validation is used

32 /32

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy