Excellent 05 - Overfitting
Excellent 05 - Overfitting
replacements
y
y
Data Data
2nd Order Fit 2nd Orde
10th Order Fit 10th Orde
x x
1
simple noisy target complex noiseless target
2nd Order 10th Order 2nd Order 10th Order
Assignment 2 FAQ
Does an update of the alphas/weight vector of the adatron
occur regardless of whether an example is misclassified?
v Yes!
That brings up another question: when do we stop?
v After a fixed number of iterations (use the same bound you
use for the perceptron).
The alpha coefficients of the adatron explode. What should I
do?
v Put an upper bound on the magnitude of the alphas
What’s a good value for the learning rate?
v That requires some experimentation.
The adatron takes a long time to run
v The instructor suggests a speedup where the weight vector
is not computed from scratch after each update.
2
A Simple Learning Problem
The bias-variance decomposition
2 Data Points.
Consider 2 hypothesis
a simple sets:problem: two data points and two
learning
hypothesis sets.
H0 : h(x) = b
H1 : h(x) = ax + b
y
x x
Section 2.3 3
c AM
⃝ L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 14 /22 Many data sets −→
Let’sRepeating
Repeat the many times…
Experiment Many Times
a
y
y
y
y
x x
x x
c AM
⃝ L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 15 /22
The bias-variance decomposition
E E on a squared error
Let’s consider an out-of-sampleEerror based
measure:
!" ! # 2
$ $
(D) (D)
E (g )= EEx g (D)
(g !(x)
)=
" −
E
f (x)
"
g (D)
(x) −
# $f (x)# 2
x 2
E (g (D) )= Ex g (D) (x) − f (x)
To abstract
% away
& the dependence
! !" on a given dataset:# 2
$$
ED E (g (D) ) % = ED (D) g (D) (x)
Ex & !
!− f (x)
!" # 2
$$
(D)
E & (g! )! = E"D (D)
% ED (D) Ex g $$(x) −
! #2$$f (x)
ED E (g ) = ED" E x
(D) g! (x) #−
2 f (x)
= Ex ED g (x) − f (x) !" # 2
$$
!= !E"x ED g (D) (x) − # 2 f (x)
$$
(D)
= Ex ED g (x) − f (x)
!" $
To evaluate ED g (D) (x) − f (x)
# 2
ḡ(x)
! $
(D)
We consider the “averageḡ(x) hypothesis” ḡ(x) = E D g (x)
ḡ(x)
!" # 2
$ !" # 2
$
ED g (D) (x) − f (x) =ED g (D) (x) − ḡ(x) + ḡ(x)! −(D)
f (x) $
ḡ(x) = ED g D 1, D2, · · · , DK
(x)
K
!"
(D)
# 2 " 1 % #2
= ED g (x) − ḡ(x) ḡ(x) ≈ − f (x) g (Dk )(x)
+ ḡ(x)
D1, D2, · ·K· , DK
k=1
K $
"
+ 2 g (D) (x) − ḡ(x)
# 1
" % #
ḡ(x) ≈ ḡ(x) g−(Dfk(x)
)
(x)
K
k=1
!" # 2
$ " #2
(D)
= ED g (x) − ḡ(x) + ḡ(x) − f (x)
6
AM
L
The bias-variance decomposition
!" # 2
$ !" # 2
$ !" # $
(D)! (D) 2
#2E$D g (x) −
" (D)f (x) = E D# $g ! (x)
" − ḡ(x) +# $ ḡ(x) − f (x)
2 2
(x) = ED g (x) − ḡ(x) % + &' ḡ(x)
(x)
− f (x)( % &'
(x)
(
% &' ( % &' (
(x) (x)
Finally, we get: ) * ! !" # 2
$$
ED E (g (D) ) = Ex ED g (D) (x) − f (x)
) * ! !" # 2
$$
ED E (g (D) ) = Ex ED g (D) (x) − f (x)
= Ex [ (x) + (x)]
= Ex [ (x) += (x)] +
AM
L = +
7
The tradeoff between bias and variance
!" #2 $ ! !" # 2
$$
= Ex ḡ(x) − f (x) = Ex ED g (D) (x) − ḡ(x)
H
f
H f
↓ H↑ ↑
⃝ AM
L
8
Let’s Repeat the Experiment Many Times
Example
y
x ... x
H0 D
H1
For each data set D, you get a different g .
ḡ(x)
y
y
c AM
⃝ L Creator: Malik Magdon-Ismail
ḡ(x)
Approximation Versus Generalization: 15 /22 Average behavior −→
sin(πx) sin(πx)
x x
= 0.50 = 0.25 = 0.21 = 1.69 9
⃝ AM
L
The bias-variance decomposition
10
Match modelLearning
Match complexity to the
Power toamount
Data, .of
. . data notfthe
Not to
complexity of the target function
two data
2 Data points
Points five 5data
Datapoints
Points
y
y
x x x x
ḡ(x) ḡ(x)
y
y
ḡ(x) ḡ(x)
sin(x) sin(x) sin(x) sin(x)
x x x x
H0 H1 H0 H1
bias = 0.50; bias = 0.21; bias = 0.50; bias = 0.21;
var = 0.25. var = 1.69. var = 0.1. var = 0.21.
Eout = 0.75 ! Eout = 1.90 Eout = 0.6 Eout = 0.42 !
11
Decomposing The Learning Curve
Twois also
Generalization views
good. of out-of-sample
One can obtain a regression version of d .
error
vc
a
ThereVC
are Analysis Bias-Variance Analysis
Expected Error
other bounds, for example: Eout
! " σ2
d
E[Eout(h)] = E[Ein(h)] + O
N Ein
Expected Error
Expected Error
Eout d+1 Eout
Number of Data Points, N
generalization error variance
Ein Ein
c AM
⃝
in-sample error
L Creator: Malik Magdon-Ismail Linear Classification and Regression: 20 /21 bias Regression for classification −→
12
c AM
⃝ L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 22 /22
What is overfitting
n of Overfitting on a Simple Example
Assume a quadratic target function and a sample of 5
noisy data points:
Data
Target
y
nt error)
r polynomial fit
x
Chapter 4
13
Data
Target
Fit
y
ment error)
Data
Target
f Fit
nts y
15
Data
Target
f Fit
nts y
Observations:
Wewith
ü target
rfitting: simple areexcessively
overfitting the data: Ein = 0, Eout large
complex H.
ü The noise did us in!
; Eout ≫ 0
16
out-of-sample error
Error
overfitting
in-sample error
Model complexity
VC dimension, dvc
y
Data
Target
x x
y
Data
2nd Order Fit
10th Order Fit
x x
Expected Error
Eout
Ein
Eout
Ein
Number of Data Points, N Number of Data Points, N
21
Regularization
y
y
x x
Without regularization With regularization
⃝ AM
L
22