0% found this document useful (0 votes)

7 views22 pages

Excellent 05 - Overfitting

Overfitting in ML

Uploaded by

Imran S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views22 pages

Excellent 05 - Overfitting

Overfitting in ML

Uploaded by

Imran S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Approximation vs Generalization

LFD Sections 2.3, 4.1

Case Study: 2nd vs 10th Order Polynomial F

replacements
y

y
Data Data
2nd Order Fit 2nd Orde
10th Order Fit 10th Orde
x x
1
simple noisy target complex noiseless target
2nd Order 10th Order 2nd Order 10th Order
Assignment 2 FAQ
Does an update of the alphas/weight vector of the adatron
occur regardless of whether an example is misclassified?
v Yes!
That brings up another question: when do we stop?
v After a fixed number of iterations (use the same bound you
use for the perceptron).
The alpha coefficients of the adatron explode. What should I
do?
v Put an upper bound on the magnitude of the alphas
What’s a good value for the learning rate?
v That requires some experimentation.
The adatron takes a long time to run
v The instructor suggests a speedup where the weight vector
is not computed from scratch after each update.

2
A Simple Learning Problem
The bias-variance decomposition
2 Data Points.
Consider 2 hypothesis
a simple sets:problem: two data points and two
learning
hypothesis sets.
H0 : h(x) = b
H1 : h(x) = ax + b
y

x x

Section 2.3 3

c AM
⃝ L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 14 /22 Many data sets −→
Let’sRepeating
Repeat the many times…
Experiment Many Times

a
y

y
y

y
x x

x x

For each data set D, you get a diﬀerent g D .

So, for a fixed x, g D (x) is random value, depending on D.

So, for a fixed x, g D (x) is random value, depending on D.
4

c AM
⃝ L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 15 /22
The bias-variance decomposition
E E on a squared error
Let’s consider an out-of-sampleEerror based
measure:
!" ! # 2
$ $
(D) (D)
E (g )= EEx g (D)
(g !(x)
)=
" −
E
f (x)
"
g (D)
(x) −
# $f (x)# 2
x 2
E (g (D) )= Ex g (D) (x) − f (x)
To abstract
% away
& the dependence
! !" on a given dataset:# 2
$$
ED E (g (D) ) % = ED (D) g (D) (x)
Ex & !
!− f (x)
!" # 2
$$
(D)
E & (g! )! = E"D (D)
% ED (D) Ex g $$(x) −
! #2$$f (x)
ED E (g ) = ED" E x
(D) g! (x) #−
2 f (x)
= Ex ED g (x) − f (x) !" # 2
$$
!= !E"x ED g (D) (x) − # 2 f (x)
$$
(D)
= Ex ED g (x) − f (x)

And let’s focus on !" # 2

$
ED g (D) (x) − f (x)
!" # $
2
!" ED g (D) (x) − # 2
$f (x)
ED g (D) (x) − f (x)
5
!" # 2
$
(D)
The bias-variance
ED
decomposition g (x) − f (x)

!" $
To evaluate ED g (D) (x) − f (x)
# 2
ḡ(x)
! $
(D)
We consider the “averageḡ(x) hypothesis” ḡ(x) = E D g (x)
ḡ(x)
!" # 2
$ !" # 2
$
ED g (D) (x) − f (x) =ED g (D) (x) − ḡ(x) + ḡ(x)! −(D)
f (x) $
ḡ(x) = ED g D 1, D2, · · · , DK
(x)
K
!"
(D)
# 2 " 1 % #2
= ED g (x) − ḡ(x) ḡ(x) ≈ − f (x) g (Dk )(x)
+ ḡ(x)
D1, D2, · ·K· , DK
k=1
K $
"
+ 2 g (D) (x) − ḡ(x)
# 1
" % #
ḡ(x) ≈ ḡ(x) g−(Dfk(x)
)
(x)
K
k=1
!" # 2
$ " #2
(D)
= ED g (x) − ḡ(x) + ḡ(x) − f (x)
6

AM
L
The bias-variance decomposition

!" # 2
$ !" # 2
$ !" # $
(D)! (D) 2
#2E$D g (x) −
" (D)f (x) = E D# $g ! (x)
" − ḡ(x) +# $ ḡ(x) − f (x)
2 2
(x) = ED g (x) − ḡ(x) % + &' ḡ(x)
(x)
− f (x)( % &'
(x)
(
% &' ( % &' (
(x) (x)
Finally, we get: ) * ! !" # 2
$$
ED E (g (D) ) = Ex ED g (D) (x) − f (x)
) * ! !" # 2
$$
ED E (g (D) ) = Ex ED g (D) (x) − f (x)
= Ex [ (x) + (x)]

= Ex [ (x) += (x)] +

AM
L = +

7
The tradeoff between bias and variance

!" #2 $ ! !" # 2
$$
= Ex ḡ(x) − f (x) = Ex ED g (D) (x) − ḡ(x)

H
f
H f

↓ H↑ ↑
⃝ AM
L

8
Let’s Repeat the Experiment Many Times

Example

y
x ... x

H0 D
H1
For each data set D, you get a diﬀerent g .

So, for a fixed x, g D (x) is random value, depending on D.

ḡ(x)
y
y

c AM
⃝ L Creator: Malik Magdon-Ismail
ḡ(x)
Approximation Versus Generalization: 15 /22 Average behavior −→

sin(πx) sin(πx)

x x
= 0.50 = 0.25 = 0.21 = 1.69 9

⃝ AM
L
The bias-variance decomposition

In learning there is a tradeoff:

² How well can learning approximate the target

function
² How close can we get to that approximation with a
finite dataset.

10
Match modelLearning
Match complexity to the
Power toamount
Data, .of
. . data notfthe
Not to
complexity of the target function
two data
2 Data points
Points five 5data
Datapoints
Points
y

y
x x x x

ḡ(x) ḡ(x)
y

y
ḡ(x) ḡ(x)
sin(x) sin(x) sin(x) sin(x)

x x x x

H0 H1 H0 H1
bias = 0.50; bias = 0.21; bias = 0.50; bias = 0.21;
var = 0.25. var = 1.69. var = 0.1. var = 0.21.
Eout = 0.75 ! Eout = 1.90 Eout = 0.6 Eout = 0.42 !

11
Decomposing The Learning Curve
Twois also
Generalization views
good. of out-of-sample
One can obtain a regression version of d .
error
vc

a
ThereVC
are Analysis Bias-Variance Analysis

Expected Error
other bounds, for example: Eout
! " σ2
d
E[Eout(h)] = E[Ein(h)] + O
N Ein
Expected Error

Expected Error
Eout d+1 Eout
Number of Data Points, N
generalization error variance

Ein Ein

c AM
⃝
in-sample error
L Creator: Malik Magdon-Ismail Linear Classification and Regression: 20 /21 bias Regression for classification −→

Number of Data Points, N Number of Data Points, N

The choice of hypothesis needs to strike Pick aPick

hypothesis
Pick H that
a balance can generalize
between and hasf on
approximating a good
the (H, A) tothat can fit the
approximate datanot
f and (low bias)
behave
and not behave wildly (low variance)
training data and generalizing on new data.
chance to fit the data wildly after seeing the data

c AM
⃝ L Creator: Malik Magdon-Ismail Approximation Versus Generalization: 22 /22
What is overfitting
n of Overfitting on a Simple Example
Assume a quadratic target function and a sample of 5
noisy data points:

Data
Target
y

nt error)

r polynomial fit
x

Chapter 4
13

target with excessively complex H.

What is overfitting
on of Overfitting on a Simple Example
Let’s fit this data with a degree 4 polynomial:

Data
Target
Fit
y

ment error)

der polynomial fit

e target with excessively complex H. 14

What is overfitting
Illustration of Overfitting on a Simple Example
Let’s fit this data with a degree 4 polynomial:

Data
Target
f Fit

nts y

oise (measurement error)

nts → 4th order polynomial fit

Overfitting: fitting the data more than is warranted.

rfitting: simple target with excessively complex H.
; Eout ≫ 0
Ein is small, and yet Eout is large
did us in. (why?)

don-Ismail Overfitting: 5 /24 What is overfitting? −→

What is overfitting
Illustration of Overfitting on a Simple Example
Let’s fit this data with a degree 4 polynomial:

Data
Target
f Fit

nts y

oise (measurement error)

nts → 4th order polynomial fit

Observations:
Wewith
ü target
rfitting: simple areexcessively
overfitting the data: Ein = 0, Eout large
complex H.
ü The noise did us in!
; Eout ≫ 0

did us in. (why?)

don-Ismail Overfitting: 5 /24 What is overfitting? −→

Overfitting is Not Just Bad Generalization
What is overfitting

out-of-sample error

Error

overfitting

in-sample error

Model complexity
VC dimension, dvc

Overfitting: fitting the data more than is warranted.

In other words – using a model that is more complex
Overfitting:than is necessary.
Going for lower and lower Ein results in higher and higher Eout
17
What is overfitting
Case Study: 2nd vs 10th Order Polynomia
Let’s look at another example:

y
Data
Target
x x

10th order f with noise. 50th order f with n

H2: 2nd order polynomial fit

←− special case of linear models with feature tra
H10: 10th order polynomial fit
18

Which model do you pick for which problem and why?

What is overfitting
Case Study: 2nd vs 10th Order Polyno
Let’s compare fitting the data with 2nd degree and
10th degree polynomials:
replacements
y

y
Data
2nd Order Fit
10th Order Fit
x x

Although the data issimple

generated with a 10th degree
noisy target complex noisele
polynomial, the quadratic fit 10th
2nd Order is better!
Order 2nd Order 1
Ein 0.050 0.034 Ein 0.029
Eout 0.127 9.00 Eout 0.120
19
Which hypothesis?
When is H2 Better than H10?
The choice of hypothesis space depends on the number of
available data points:
Learning curves for H2 Learning curves for H10
Expected Error

Expected Error
Eout

Ein
Eout

Ein
Number of Data Points, N Number of Data Points, N

High complexity hypothesis set: better chance of

approximating the target function

Overfitting:
v Low complexity hypothesis
Eout(Hset:
10 ) >better
Eout(H2chance
) of getting low
out-of-sample error 20
Factors that lead to overfitting

v Small number of data points

v Amount of noise
v Complexity of the target function
v Complexity of the hypothesis set

21
Regularization

The cure for overfitting - regularization

y
y

x x
Without regularization With regularization

⃝ AM
L

DL Unit1
100% (1)
DL Unit1
79 pages
Bias Variance Annotated
No ratings yet
Bias Variance Annotated
73 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
100 pages
Regression and Generalization
No ratings yet
Regression and Generalization
67 pages
Slides Foundations
No ratings yet
Slides Foundations
81 pages
DSA Module 3 Notes
No ratings yet
DSA Module 3 Notes
22 pages
Lec19 Introduction2LinearRegression
No ratings yet
Lec19 Introduction2LinearRegression
53 pages
Lecture 19
No ratings yet
Lecture 19
25 pages
Lecture 10 - 04.09.2024 - Regression-02 Lecture Slides
No ratings yet
Lecture 10 - 04.09.2024 - Regression-02 Lecture Slides
61 pages
All DL
No ratings yet
All DL
72 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
116 pages
Unit 1-1
No ratings yet
Unit 1-1
75 pages
226 Lecture5 Prediction
No ratings yet
226 Lecture5 Prediction
45 pages
Deep Reinforcement Learning
100% (4)
Deep Reinforcement Learning
48 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
Supervised Vs Unsupervised Learning
100% (1)
Supervised Vs Unsupervised Learning
7 pages
ML 19.03 Sidenotes
No ratings yet
ML 19.03 Sidenotes
30 pages
Mathematics-III Module-I - 1677825128-1
No ratings yet
Mathematics-III Module-I - 1677825128-1
22 pages
EE2211 Lecture 7
No ratings yet
EE2211 Lecture 7
43 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
ML PYQs
No ratings yet
ML PYQs
32 pages
Chap 4 Slides
No ratings yet
Chap 4 Slides
61 pages
2004 Vijayakumar
No ratings yet
2004 Vijayakumar
14 pages
Learning Theory
No ratings yet
Learning Theory
19 pages
ML 04 Validation Regularization
No ratings yet
ML 04 Validation Regularization
57 pages
Week 2
No ratings yet
Week 2
43 pages
DSA5102X Lecture1
No ratings yet
DSA5102X Lecture1
51 pages
M1 - Evaluating Predictive Performance
No ratings yet
M1 - Evaluating Predictive Performance
58 pages
07 - Evaluating Performance
No ratings yet
07 - Evaluating Performance
46 pages
Learning3 6pp
No ratings yet
Learning3 6pp
15 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
71 pages
Week 3
No ratings yet
Week 3
56 pages
DL-Lec 2 - Bias-Variance-Tradeoff
No ratings yet
DL-Lec 2 - Bias-Variance-Tradeoff
33 pages
Over Fitting
No ratings yet
Over Fitting
19 pages
DSA5105 Lecture1
No ratings yet
DSA5105 Lecture1
51 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
Theory in Machine Learning
No ratings yet
Theory in Machine Learning
60 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
Lecture2 PDF
No ratings yet
Lecture2 PDF
111 pages
Overfitting Regression
No ratings yet
Overfitting Regression
14 pages
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
No ratings yet
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
51 pages
Overfitting & Feature Engineering
No ratings yet
Overfitting & Feature Engineering
37 pages
RADL TQKhoat
No ratings yet
RADL TQKhoat
50 pages
CSE 440 AI Volume1 (p1)
No ratings yet
CSE 440 AI Volume1 (p1)
4 pages
Biasvariancetradeoff 210313075413
No ratings yet
Biasvariancetradeoff 210313075413
13 pages
Design & Analysis of Algorithms: Submittedto
No ratings yet
Design & Analysis of Algorithms: Submittedto
10 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
Class 02
No ratings yet
Class 02
42 pages
Regression Analysis
No ratings yet
Regression Analysis
11 pages
Jkkklphftbbhuii
No ratings yet
Jkkklphftbbhuii
17 pages
3 Bias Variance Tradeoff
No ratings yet
3 Bias Variance Tradeoff
9 pages
Machine Learning Lecture Notes Undergrad
No ratings yet
Machine Learning Lecture Notes Undergrad
19 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
Interpolation - Arpit (1) 3
No ratings yet
Interpolation - Arpit (1) 3
44 pages
Lecture 2 Ai
No ratings yet
Lecture 2 Ai
24 pages
Diagnosing Bias Vs Variance
No ratings yet
Diagnosing Bias Vs Variance
11 pages
4 Andriod Lectures Notes
No ratings yet
4 Andriod Lectures Notes
28 pages
Module 2 - Syllabus: CS 476 Introduction To Machine Learning, Module 2
No ratings yet
Module 2 - Syllabus: CS 476 Introduction To Machine Learning, Module 2
20 pages
Bias and Variance
No ratings yet
Bias and Variance
21 pages
Module2 MAD
No ratings yet
Module2 MAD
39 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
MTH313 Numerical Analysis PDF
No ratings yet
MTH313 Numerical Analysis PDF
82 pages
ML 01
No ratings yet
ML 01
24 pages
NLP MultiVAr Constrained
No ratings yet
NLP MultiVAr Constrained
63 pages
Machine Learning and Pattern Recognition Bayesian Complexity Control
No ratings yet
Machine Learning and Pattern Recognition Bayesian Complexity Control
4 pages
Chap 4 FACTORS AND POLYNOMIALS
No ratings yet
Chap 4 FACTORS AND POLYNOMIALS
18 pages
Lecture03b Overfitting Annotated
No ratings yet
Lecture03b Overfitting Annotated
5 pages
Document
No ratings yet
Document
6 pages
AdvancedMath Bisection Method
No ratings yet
AdvancedMath Bisection Method
19 pages
Chapter 6 - Algorithms Part II
No ratings yet
Chapter 6 - Algorithms Part II
25 pages
Hypothesis in ML
No ratings yet
Hypothesis in ML
8 pages
Unit 3 RMT Notes
No ratings yet
Unit 3 RMT Notes
13 pages
Laws Exponent
No ratings yet
Laws Exponent
25 pages
Digital Filters
No ratings yet
Digital Filters
27 pages
Lec4 Orth
No ratings yet
Lec4 Orth
13 pages
01.3 Text and Scrolling Views
No ratings yet
01.3 Text and Scrolling Views
19 pages
SLB 4
No ratings yet
SLB 4
1 page
The Traveling Salesman Problem: A Neural Network Perspective
No ratings yet
The Traveling Salesman Problem: A Neural Network Perspective
60 pages
Newton Raphson Method
100% (1)
Newton Raphson Method
29 pages
Principles of Deep Learning
No ratings yet
Principles of Deep Learning
2 pages
Android Attributes
No ratings yet
Android Attributes
10 pages
Unit 3 8
No ratings yet
Unit 3 8
5 pages
Roots of A Nonlinear Equation
No ratings yet
Roots of A Nonlinear Equation
18 pages
University Updates: Text Books
No ratings yet
University Updates: Text Books
2 pages
Pisinger, D., & Sigurd, M. (2007) - Using Decomposition Techniques and Constraint Programming For Solving The Two-Dimensional Bin-Packing Problem
No ratings yet
Pisinger, D., & Sigurd, M. (2007) - Using Decomposition Techniques and Constraint Programming For Solving The Two-Dimensional Bin-Packing Problem
16 pages
The Problem of Overfitting - Coursera
No ratings yet
The Problem of Overfitting - Coursera
1 page
IE141 - IX.b. Assignment Model
No ratings yet
IE141 - IX.b. Assignment Model
40 pages
(08.05) - On Solving Higher Order Equations For ODEs
No ratings yet
(08.05) - On Solving Higher Order Equations For ODEs
9 pages
Week 1: Introduction: Teori Bahasa Dan Otomata
No ratings yet
Week 1: Introduction: Teori Bahasa Dan Otomata
13 pages
Stat NM Quen Bank
No ratings yet
Stat NM Quen Bank
11 pages
Divide-And-Conquer (CLRS 4.2) : Matrix Multiplication
No ratings yet
Divide-And-Conquer (CLRS 4.2) : Matrix Multiplication
4 pages
DTG Workshop 2 Besvarelse
No ratings yet
DTG Workshop 2 Besvarelse
6 pages
Fixed-Radius Near Neighbors
No ratings yet
Fixed-Radius Near Neighbors
2 pages
Laboratorium Pembelajaran Ilmu Komputer Fakultas Ilmu Komputer Universitas Brawijaya
No ratings yet
Laboratorium Pembelajaran Ilmu Komputer Fakultas Ilmu Komputer Universitas Brawijaya
11 pages
LESSON 1. Factoring A Perfect Square Trinomial
No ratings yet
LESSON 1. Factoring A Perfect Square Trinomial
2 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Excellent 05 - Overfitting

Uploaded by

Excellent 05 - Overfitting

Uploaded by

Approximation vs Generalization

LFD Sections 2.3, 4.1

For each data set D, you get a diﬀerent g D .

So, for a fixed x, g D (x) is random value, depending on D.

And let’s focus on !" # 2

So, for a fixed x, g D (x) is random value, depending on D.

In learning there is a tradeoff:

² How well can learning approximate the target

Number of Data Points, N Number of Data Points, N

The choice of hypothesis needs to strike Pick aPick

target with excessively complex H.

der polynomial fit

e target with excessively complex H. 14

oise (measurement error)

nts → 4th order polynomial fit

Overfitting: fitting the data more than is warranted.

don-Ismail Overfitting: 5 /24 What is overfitting? −→

oise (measurement error)

nts → 4th order polynomial fit

did us in. (why?)

don-Ismail Overfitting: 5 /24 What is overfitting? −→

Overfitting: fitting the data more than is warranted.

10th order f with noise. 50th order f with n

H2: 2nd order polynomial fit

Which model do you pick for which problem and why?

Although the data issimple

High complexity hypothesis set: better chance of

approximating the target function

v Small number of data points

The cure for overfitting - regularization

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.