0% found this document useful (0 votes)
7 views22 pages

Excellent 05 - Overfitting

Overfitting in ML

Uploaded by

Imran S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views22 pages

Excellent 05 - Overfitting

Overfitting in ML

Uploaded by

Imran S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Approximation vs Generalization

LFD Sections 2.3, 4.1


Case Study: 2nd vs 10th Order Polynomial F

replacements
y

y
Data Data
2nd Order Fit 2nd Orde
10th Order Fit 10th Orde
x x
1
simple noisy target complex noiseless target
2nd Order 10th Order 2nd Order 10th Order
Assignment 2 FAQ
Does an update of the alphas/weight vector of the adatron
occur regardless of whether an example is misclassified?
v Yes!
That brings up another question: when do we stop?
v After a fixed number of iterations (use the same bound you
use for the perceptron).
The alpha coefficients of the adatron explode. What should I
do?
v Put an upper bound on the magnitude of the alphas
What’s a good value for the learning rate?
v That requires some experimentation.
The adatron takes a long time to run
v The instructor suggests a speedup where the weight vector
is not computed from scratch after each update.

2
A Simple Learning Problem
The bias-variance decomposition
2 Data Points.
Consider 2 hypothesis
a simple sets:problem: two data points and two
learning
hypothesis sets.
H0 : h(x) = b
H1 : h(x) = ax + b
y

x x

Section 2.3 3

c AM
⃝ L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 14 /22 Many data sets −→
Let’sRepeating
Repeat the many times…
Experiment Many Times

a
y

y
y

y
x x

x x

For each data set D, you get a different g D .


For each data set D, you get a different g D .

So, for a fixed x, g D (x) is random value, depending on D.


So, for a fixed x, g D (x) is random value, depending on D.
4

c AM
⃝ L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 15 /22
The bias-variance decomposition
E E on a squared error
Let’s consider an out-of-sampleEerror based
measure:
!" ! # 2
$ $
(D) (D)
E (g )= EEx g (D)
(g !(x)
)=
" −
E
f (x)
"
g (D)
(x) −
# $f (x)# 2
x 2
E (g (D) )= Ex g (D) (x) − f (x)
To abstract
% away
& the dependence
! !" on a given dataset:# 2
$$
ED E (g (D) ) % = ED (D) g (D) (x)
Ex & !
!− f (x)
!" # 2
$$
(D)
E & (g! )! = E"D (D)
% ED (D) Ex g $$(x) −
! #2$$f (x)
ED E (g ) = ED" E x
(D) g! (x) #−
2 f (x)
= Ex ED g (x) − f (x) !" # 2
$$
!= !E"x ED g (D) (x) − # 2 f (x)
$$
(D)
= Ex ED g (x) − f (x)

And let’s focus on !" # 2


$
ED g (D) (x) − f (x)
!" # $
2
!" ED g (D) (x) − # 2
$f (x)
ED g (D) (x) − f (x)
5
!" # 2
$
(D)
The bias-variance
ED
decomposition g (x) − f (x)

!" $
To evaluate ED g (D) (x) − f (x)
# 2
ḡ(x)
! $
(D)
We consider the “averageḡ(x) hypothesis” ḡ(x) = E D g (x)
ḡ(x)
!" # 2
$ !" # 2
$
ED g (D) (x) − f (x) =ED g (D) (x) − ḡ(x) + ḡ(x)! −(D)
f (x) $
ḡ(x) = ED g D 1, D2, · · · , DK
(x)
K
!"
(D)
# 2 " 1 % #2
= ED g (x) − ḡ(x) ḡ(x) ≈ − f (x) g (Dk )(x)
+ ḡ(x)
D1, D2, · ·K· , DK
k=1
K $
"
+ 2 g (D) (x) − ḡ(x)
# 1
" % #
ḡ(x) ≈ ḡ(x) g−(Dfk(x)
)
(x)
K
k=1
!" # 2
$ " #2
(D)
= ED g (x) − ḡ(x) + ḡ(x) − f (x)
6

AM
L
The bias-variance decomposition

!" # 2
$ !" # 2
$ !" # $
(D)! (D) 2
#2E$D g (x) −
" (D)f (x) = E D# $g ! (x)
" − ḡ(x) +# $ ḡ(x) − f (x)
2 2
(x) = ED g (x) − ḡ(x) % + &' ḡ(x)
(x)
− f (x)( % &'
(x)
(
% &' ( % &' (
(x) (x)
Finally, we get: ) * ! !" # 2
$$
ED E (g (D) ) = Ex ED g (D) (x) − f (x)
) * ! !" # 2
$$
ED E (g (D) ) = Ex ED g (D) (x) − f (x)
= Ex [ (x) + (x)]

= Ex [ (x) += (x)] +

AM
L = +

7
The tradeoff between bias and variance

!" #2 $ ! !" # 2
$$
= Ex ḡ(x) − f (x) = Ex ED g (D) (x) − ḡ(x)

H
f
H f

↓ H↑ ↑
⃝ AM
L

8
Let’s Repeat the Experiment Many Times

Example

y
x ... x

H0 D
H1
For each data set D, you get a different g .

So, for a fixed x, g D (x) is random value, depending on D.

ḡ(x)
y
y

c AM
⃝ L Creator: Malik Magdon-Ismail
ḡ(x)
Approximation Versus Generalization: 15 /22 Average behavior −→

sin(πx) sin(πx)

x x
= 0.50 = 0.25 = 0.21 = 1.69 9

⃝ AM
L
The bias-variance decomposition

In learning there is a tradeoff:

² How well can learning approximate the target


function
² How close can we get to that approximation with a
finite dataset.

10
Match modelLearning
Match complexity to the
Power toamount
Data, .of
. . data notfthe
Not to
complexity of the target function
two data
2 Data points
Points five 5data
Datapoints
Points
y

y
x x x x

ḡ(x) ḡ(x)
y

y
ḡ(x) ḡ(x)
sin(x) sin(x) sin(x) sin(x)

x x x x

H0 H1 H0 H1
bias = 0.50; bias = 0.21; bias = 0.50; bias = 0.21;
var = 0.25. var = 1.69. var = 0.1. var = 0.21.
Eout = 0.75 ! Eout = 1.90 Eout = 0.6 Eout = 0.42 !

11
Decomposing The Learning Curve
Twois also
Generalization views
good. of out-of-sample
One can obtain a regression version of d .
error
vc

a
ThereVC
are Analysis Bias-Variance Analysis

Expected Error
other bounds, for example: Eout
! " σ2
d
E[Eout(h)] = E[Ein(h)] + O
N Ein
Expected Error

Expected Error
Eout d+1 Eout
Number of Data Points, N
generalization error variance

Ein Ein

c AM

in-sample error
L Creator: Malik Magdon-Ismail Linear Classification and Regression: 20 /21 bias Regression for classification −→

Number of Data Points, N Number of Data Points, N

The choice of hypothesis needs to strike Pick aPick


hypothesis
Pick H that
a balance can generalize
between and hasf on
approximating a good
the (H, A) tothat can fit the
approximate datanot
f and (low bias)
behave
and not behave wildly (low variance)
training data and generalizing on new data.
chance to fit the data wildly after seeing the data

12

c AM
⃝ L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 22 /22
What is overfitting
n of Overfitting on a Simple Example
Assume a quadratic target function and a sample of 5
noisy data points:

Data
Target
y

nt error)

r polynomial fit
x

Chapter 4
13

target with excessively complex H.


What is overfitting
on of Overfitting on a Simple Example
Let’s fit this data with a degree 4 polynomial:

Data
Target
Fit
y

ment error)

der polynomial fit


x

e target with excessively complex H. 14


What is overfitting
Illustration of Overfitting on a Simple Example
Let’s fit this data with a degree 4 polynomial:

Data
Target
f Fit

nts y

oise (measurement error)

nts → 4th order polynomial fit


x

Overfitting: fitting the data more than is warranted.


rfitting: simple target with excessively complex H.
; Eout ≫ 0
Ein is small, and yet Eout is large
did us in. (why?)

15

don-Ismail Overfitting: 5 /24 What is overfitting? −→


What is overfitting
Illustration of Overfitting on a Simple Example
Let’s fit this data with a degree 4 polynomial:

Data
Target
f Fit

nts y

oise (measurement error)

nts → 4th order polynomial fit


x

Observations:
Wewith
ü target
rfitting: simple areexcessively
overfitting the data: Ein = 0, Eout large
complex H.
ü The noise did us in!
; Eout ≫ 0

did us in. (why?)

16

don-Ismail Overfitting: 5 /24 What is overfitting? −→


Overfitting is Not Just Bad Generalization
What is overfitting

out-of-sample error

Error

overfitting

in-sample error

Model complexity
VC dimension, dvc

Overfitting: fitting the data more than is warranted.


In other words – using a model that is more complex
Overfitting:than is necessary.
Going for lower and lower Ein results in higher and higher Eout
17
What is overfitting
Case Study: 2nd vs 10th Order Polynomia
Let’s look at another example:

y
Data
Target
x x

10th order f with noise. 50th order f with n

H2: 2nd order polynomial fit


←− special case of linear models with feature tra
H10: 10th order polynomial fit
18

Which model do you pick for which problem and why?


What is overfitting
Case Study: 2nd vs 10th Order Polyno
Let’s compare fitting the data with 2nd degree and
10th degree polynomials:
replacements
y

y
Data
2nd Order Fit
10th Order Fit
x x

Although the data issimple


generated with a 10th degree
noisy target complex noisele
polynomial, the quadratic fit 10th
2nd Order is better!
Order 2nd Order 1
Ein 0.050 0.034 Ein 0.029
Eout 0.127 9.00 Eout 0.120
19
Which hypothesis?
When is H2 Better than H10?
The choice of hypothesis space depends on the number of
available data points:
Learning curves for H2 Learning curves for H10
Expected Error

Expected Error
Eout

Ein
Eout

Ein
Number of Data Points, N Number of Data Points, N

High complexity hypothesis set: better chance of


v

approximating the target function


Overfitting:
v Low complexity hypothesis
Eout(Hset:
10 ) >better
Eout(H2chance
) of getting low
out-of-sample error 20
Factors that lead to overfitting

v Small number of data points


v Amount of noise
v Complexity of the target function
v Complexity of the hypothesis set

21
Regularization

The cure for overfitting - regularization

y
y

x x
Without regularization With regularization

⃝ AM
L

22

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy