DL-Lec 2 - Bias-Variance-Tradeoff
DL-Lec 2 - Bias-Variance-Tradeoff
TECHNOLOGY ENGINEERING
AUTOMATION
2 /32
Outline
1. Bias-variance tradeoff
2. Learning curves
3. Overfitting
4. Regularization
3 /32
Outline
1. Bias-variance tradeoff
2. Learning curves
3. Overfitting
4. Regularization
4 /32
Approximation vs. generalization
The example shows:
𝐸out huge
5 /32
Approximation vs. generalization
The final aim is to have a small 𝐸out : good approximation of 𝑓 out-of-sample
The ideal case would be to have a hypothesis space ℋ that contains only the function 𝑓
6 /32
Bias and variance of machine learning model
2 2
𝒟
𝐛𝐢𝐚𝐬 𝟐 = 𝑔ҧ 𝒙 − 𝑓 𝒙 𝐯𝐚𝐫𝐢𝐚𝐧𝐜𝐞 = 𝔼𝒟 𝑔 𝒙 − 𝑔ҧ 𝒙
• 𝑔 𝒟 function learnt
𝑓 on the dataset 𝒟 𝑓
Bias • 𝑔:ҧ average
ℋ function in ℋ
ℋ
Variance
VERY SMALL model set. Since there is only one VERY LARGE model set. The target function is in
hypothesis, both the average function 𝑔ҧ and the ℋ . Different data sets will led to different
final hypothesis 𝑔 𝒟 will be the same, for any hypotheses that agree with 𝑓 on the data set, and
dataset. are spread around 𝑓 in the red region.
Thus, var = 0. The bias will depend solely on how Thus, bias ≈ 0 because 𝑔ҧ is likely to be close to
well this single hypothesis approximates the target 𝑓. The variance is large (heuristically represented
𝑓, and unless we are extremely lucky, we expect a by the size of the red region)
large bias
7 /32
Bias and variance of machine learning model
Heuristic rule
How many points 𝑁 are required to ensure a good chance of generalizing?
Error
Out-of-sample error
Out of sample error
Model complexity
Model
General principle complexity
𝑑∗ Number of params
VC dimension
8 /32
Outline
1. Bias-variance tradeoff
2. Learning curves
3. Overfitting
4. Regularization
9 /32
Learning curves
Learning curves are a graphical tool to understand if a learning model suffers from bias
or variance problems
The idea is to represent, by varying the number of data 𝑁 used to train the model:
In practice, the curves are computed from one dataset, or by dividing it into more parts
and taking the «average curve» resulting from the various sub-datasets
10 /32
Learning curves
Error 𝐕𝐀𝐑𝐈𝐀𝐍𝐂𝐄 "big"
𝐕𝐀𝐑𝐈𝐀𝐍𝐂𝐄 "small"
Error
Expected error
Expectederror
𝐸out
Expected
Expected
𝐸out
𝐸in
𝐁𝐈𝐀𝐒 2 "small"
𝐁𝐈𝐀𝐒 2 "big"
𝐸in
Number
Numberof ofdata used to
data𝑁points, Number of
Number of data
data 𝑁 used to
points,
train the model train the model
12 /32
Outline
1. Bias-variance tradeoff
2. Learning curves
3. Overfitting
4. Regularization
13 /32
Overfitting
We encountered the overfitting phenomenon when we talked about the approximation-
generalization tradeoff
14 /32
Overfitting
•𝑁=5 points
𝐸in = 0 𝐸out = 0
15 /32
Overfitting
• 𝑁 = 5 noisy points
16 /32
Overfitting
• 𝑁 = 5 noisy points
17 /32
Example: student that has to learn some concepts
To understand the phenomenon of overfitting in an intuitive way, let's consider the
following similarity
The teacher of a course provides solved exercises in order to teach how to solve a
problem. The exam exercises must necessarily be different from those provided in
class, otherwise the teacher is not able to understand if the student has only
memorized how to solve the exercises or if she/he has really learned the concepts
In the first case (memorizing) the student has not really learned: when he is faced with
a similar (but different) exercise she/he will not be able to solve it. The student has
overfitted the exercise seen in class, without having generalized the concepts and
therefore the solution method
18 /32
Overfitting vs. model complexity
Error
• We talk of overfitting when High
Highbias
bias SmallLow bias
bias
Low variance
Small variance HighHigh variance
variance
decreasing 𝐸in leads to
increasing 𝐸out
Error
• Overfitting leads to bad
generalization
In sample error
In-sample error Overfitting
• A model can exibit bad
generalization even if it does
not overfit Low High
Low Model complexity High
complexity complexity
Model complexity
19 /32
Overfitting vs. model complexity
Underfit OK Overfit
20 /32
Outline
1. Bias-variance tradeoff
2. Learning curves
3. Overfitting
4. Regularization
21 /32
A cure for overfitting
Regularization is the first line of defense against overfitting
We have seen that more complex models are more prone to overfitting. This is because
they are more «powerful» (expressive) and therefore can also adapt to the noise
Simple models show less variance due to their limited expressiveness. The reduction in
the variance of the model is often greater than the increase in its bias, so that, overall,
the expected error decreases (bias 2 + var + noise variance)
However, if we stick to simple models only, we may not get a satisfactory approximation
of the target function 𝑓
22 /32
Regularization
Idea: in addition to minimizing the «data-fit» cost 𝐸in 𝜽 ≡ 𝐽 𝜽 , minimize also the model
complexity
The term 𝜆reg (hyperparameter) weights the importance of minimizing 𝐸in ≡ 𝐽 𝜽 with
respect to minimizing Ω 𝜽
23 /32
Example: 𝐿2 regularization
The 𝐿2 regularization penalizes the sum of the squared model’s coefficients 𝜽 ∈ ℝ𝑑×1
𝑁 𝑑−1
1 2 2
𝐸aug 𝜽 = 𝑦 𝑖 −ℎ 𝒙 𝑖 ; 𝜽 + 𝜆reg ⋅ 𝜃𝑗
𝑁 𝑑×1 𝑑×1 1×1
𝑖=1 𝑗=0
• The intercept 𝜃0 sometimes is not penalized. In this case, 𝑗 will start from 1
24 /32
Example: 𝐿2 regularization
𝑁
1 2
minimize 𝐸in 𝜽 = 𝑦 𝑖 −ℎ 𝒙 𝑖 ; 𝜽
𝑁
𝑖=1
subject to 𝜽⊤ 𝜽 ≤ 𝑐
1×𝑑 𝑑×1 1×1
• With this interpretation, we are explicitly constraining the coefficients 𝜽 to not take big
values
25 /32
Effect of the regularization hyperparameter 𝜆
𝜆reg = 0 𝜆reg > 𝜆reg 𝜆reg > 𝜆reg 𝜆reg > 𝜆reg
1 2 1 3 2 4 3
Overfit Underfit
If we regularize too much, we will learn the simplest possible function, which is a
horizontal line (constant) with intercept 𝜃0
26 /32
Intuition about the importace of 𝐸aug w.r.t. 𝐸in
Minimizing 𝐸aug instead of 𝐸in leads to a better model (that is a model with better
generalization capabilities and so with lower 𝐸out )
error
error
Error
ExpectedError
Ω
Expected
Expected
Expected
𝐸out 𝐸out
𝐸in 𝐸in
𝐁𝐈𝐀𝐒 2 𝐸in
Number
Numberof ofdata used to
data𝑁points, Number
Numberof ofdata used to
data𝑁points,
train the model train the model
27 /32
Intuition about the importace of 𝐸aug w.r.t. 𝐸in
From the previous graph, as well as through bias and variance, we can interpret 𝐸out as
the sum of two contributions:
error
෩ = 𝐸out − 𝐸in
Error
Ω
෩ 𝜽
𝐸out 𝜽 = 𝐸in 𝜽 + Ω
Expected
𝐸out
Expected
Recalling the definition of 𝐸aug we have 𝐸in
𝐸in
𝐸aug 𝜽 = 𝐸in 𝜽 + 𝜆Ω 𝜽
Number of
Number ofdata used to
data𝑁points,
train the model
28 /32
Intuition about the importace of 𝐸aug w.r.t. 𝐸in
• In this way, it would be possible to directly minimize the out-of-sample error instead of
the in-sample one (or the augmented error)
• The regularization helps to estimate the quantity Ω(𝜽), that, added to 𝐸in , gives 𝐸aug , il
an estimate of 𝐸out
29 /32
Chosing the regularization term
There are different types of regularizers. The most used are:
𝑑−1
2
• 𝐿2 regularization: also called Ridge regularization Ω 𝜽 = 𝜃𝑗
𝑗=0
𝑑−1
30 /32
Chosing the regularization term
Estimate of 𝜽 without
Ridge Lasso regularization
𝜃2 𝜃2
Notice that
Notice that
𝜽 𝜃1 = 0 and
𝜽
𝜃1 e 𝜃2 are 𝐸in 𝜽 ≡ 𝐽 𝜽 𝜃2 is «small»
«small»
Level curves
assuming a convex
cost function 𝐽 𝜽 Estiamate of 𝜽 with
regularization
𝜃1 𝜃1
𝜃1 2 + 𝜃2 2 ≤ 𝑐 𝜃1 + 𝜃2 ≤𝑐
Constraints
31 /32
Regularization and bias-variance tradeoff
The effects of regularization can be observed in terms of bias and variance:
• Regularization slightly increases the bias (because I get a simpler model) in order to
considerably reduce the variance of the learning model
• The regularization hyperparameter 𝜆reg must be chosen specifically for each type of
regularizer. Usually, a procedure such as validation or cross-validation is used
32 /32