0% found this document useful (0 votes)

14 views34 pages

Lecture 05

Uploaded by

Tim Widmoser

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views34 pages

Lecture 05

Uploaded by

Tim Widmoser

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

WiSe 2023/24

Deep Learning 1

Lecture 5 Overtting & Robustness (1)

Recap Lectures 14

Lectures 14:
▶ With exible neural network architectures, powerful optimization
techniques, and fast machines, we have means of producing functions
that can accurately t large amount of highly nonlinear data.

Question:
▶ Do the learned neural networks generalize to new data, e.g. will it be
able to correctly classify new images?

The data on which we train the model also matter!

1/33
A Bit of Theory
So far, we have only considered the error we use to optimize the model (aka.
the training error):
N
1 X
Etrain (θ) = (f (xi , θ) − ti )2
N i=1
In practice, what we really care about is the true error:
Z
Etrue (θ) = (f (x, θ) − t)2 p(x, t) dxdt

where p(x, t) is the true probability distribution from which the data is coming
at test time. The true error is much harder to minimize, because we don't
know p(x, t).

2/33
Characterizing Datasets

Factors that makes the available dataset D and the true distribution p(x, t)
diverge:

▶ The fact the dataset is composed of few data points drawn randomly
from the underlying data distribution ( nite data).
▶ The fact that the dataset may overrepresent certain parts of the
underlying distribution, e.g. people of a certain age group ( dataset
bias).
▶ The fact that the dataset may have been generated from an underlying
distribution pold (x, t) that is now obsolete ( distribution shift).

3/33
Practical Examples

Data types Properties

thousands of images per class, aggregated

Image/text from many sources. Some image compo-
data sitions may be overrepresented ( dataset
bias / spurious correlations).

Potentially very large datasets, but sen-

Sensor
sors may become decalibrated over time
data
distribution shift).
(

Innite number of games states can

Games
be produced through computer-computer
(GO,
plays, but master-level plays being more
chess,
expensive to generate, simple games may
etc.)
be overrepresented ( dataset bias).

4/33
Practical Examples (cont.)

Data types Properties

Simulated Theoretically innite, but practically lim-

data (e.g. ited do the cost of running simulations. In
physics, practice, we only generate few instances
car driving) nite data).
(

limited number of patients due to rar-

ity of a particular disease, or regulatory
Medical
constraints ( nite data, dataset bias).
data
Aquisition devices may evolve over time
distribution shift).
(

Large amount of data, but only recent

Social data data is relevant. Risk of not capturing the
most recent trends ( distribution shift).

5/33
Outline

The Problem of Finite Data

▶ The problem of overtting

▶ Mitigating overtting

Dataset Bias 1: Imbalance Between Subgroups

▶ Data from Multiple Domains

▶ Building a `Domain' Invariant Classier

Dataset Bias 2: Spurious Correlations

▶ Examples of Spurious Correlations

▶ Detecting and Mitigating Spurious Correlations

6/33
Part 1 The Problem of Limited Data

7/33
Finite Data and Overtting

theoretical optimum model learned in practice

2 2

0 0

2 2

2 0 2 2 0 2

▶ Assume each data point x ∈ Rd and its label y ∈ {0, 1} is generated

iid. from two Gaussian distributions.

▶ With limited data, one class or target value may be locally predominant
`by chance'. Learning these spurious variations is known as overtting.

▶ An overtted model predicts the training data perfectly but works

poorly on new data.

8/33
Model Error and Model Complexity

William of Ockham (1287-1347)

Linked model complexity to how suitable the model is for
explaning phenomena. Entia non sunt multiplicanda praeter
necessitatem

Vladimir Vapnik
Showed a formal relation between model complexity
(measured as the VC-dimension) and the error of a classier.

9/33
Complexity and Generalization Error

Generalization Bound [Vapnik]

Let h denote the VC-dimension of F. The dierence between the true error
R[f ] and the traning error Remp [f ] is upper-bounded as:

s
2N
h(log h
+ 1) − log(η/4)
Etrue (θ) − Etrain (θ) ≤
N

The VC-dimension h denes the complexity (or exibility) of the class of

considered models.

Factors that reduce the gap between test error Etrue (θ) and training error
Etrain (θ):
▶ Lowering the VC-dimension h.
▶ Increasing the number of data points N.

10/33
Characterizing Complexity (One-Layer Networks)

Interpretation:
▶ Model complexity can be restrained if the input data is low-dimensional
or if the model builds a large margin (i.e. has low sensitivity).

Question:
▶ Can we build similar concepts for deep neural networks?

11/33
Reducing Complexity via Low Dimensionality

Features extraction
(hard-coded)

Learned parameters

Approach:
▶ First, generate a low-dimensional representation by extracting a few
features from the high-dimensional input data (either hand-designed, or
automatically generated using methods such as PCA).

▶ Then, learn a neural network on the resulting low-dimensional data.

12/33
Reducing Complexity via Low Dimensionality

Observations:
▶ Building low-dimensional representations can be useful when predicting
noisy high-dimensional data such as gene expression in biology
(d > 20000)
▶ On other tasks such as image recognition, low dimensional
representation can also delete class-relevant information (e.g. edges).

13/33
Reducing Complexity by Reducing Sensitivity

read pixels directly

...
ad

learned parameters
+ regularization

Weight Decay [4]:

▶ Include in the objective a term that makes the weights tend to zero if
they are not necessary for the prediction task.

N
X
E(θ) = (f (xi , θ) − ti )2 + λ∥θ∥2
i=1

▶ The higher the parameter λ, the more the exposure of the model to
variations in the input domain is being reduced.

14/33
Reducing Complexity by Reducing Sensitivity

Dropout [7]:
▶ Alternative to weight decay, which consists of adding articial
multiplicative noise to the input and intermediate neurons, and training
the model subject to that noise.

▶ This is achieved by inserting a

bj » Bernoulli(p)
dropout layer in the neural
network, which multiplies each input
(or activation) by a random variable
zj aj
bj ∼ Bernoulli(p):

15/33
Reducing Complexity by Reducing Sensitivity
Eect of dropout on performance on the MNIST dataset

Note:
▶ On neural networks for image data, dropout tends to yield superior
performance compared to simple weight decay.

16/33
Choosing a Model with Appropriate Complexity

Holdout Validation:
▶ Train multiple neural network models with dierent regularization
parameters (e.g. λ), and retain the one that performs the best on
some validation set disjoint from the training data.

Problem:
▶ Training a model for each parameter λ can be costly. One would
potentially benet from training a bigger model only once.

17/33
Accelerating Model Selection
Early Stopping Technique [6]:
▶ View the iterative procedure for training a neural network as generating
a sequence of increasingly complex models θ1 , . . . , θ T .
▶ Monitor the validation error throughout training and keep a snapshot
of the model when it had lowest validation error.

Early stopping:
θ⋆ = None
E⋆ = ∞
for t = 1 . . . T do
Run a few SGD steps, and collect the current parameter θt
ifEval (θt ) < E ⋆ then
θ ⋆ ← θt
E ⋆ ← Eval (θ)
end if
end for
Advantage:
▶ No need to train several models (e.g. with dient λ's). Only one
training run is needed!

18/33
Very Large Models

▶ When the model becomes very large there is an interesting ` double

descent ' [2] phenomenon that occurs in the context of neural
networks, where the generalization error starts to go down again as
model complexity increases.

▶ This can be interpreted as some implicit averaging between the many

components of the model (interpolating regime).

▶ Increasing model size to a great extent may contribute, without further

regularization techniques to achieve lower test set error.

19/33
Part 2 Imbalances between Subgroups

20/33
Data from Multiple Domains
▶ The data might come from dierent
domains (P , Q).
▶ Domains may e.g. correspond to
dierent acquisition devices, or
dierent ways they are
congured/calibrated.

▶ One of the domains may be

overrepresented in the available
data, or the ML model may learn
better on a given domain at the
expense of another domain.

Image source: Aubreville et al. Quantifying the Scanner-Induced Domain Gap in Mitosis Detection. CoRR abs/2103.16515 (2021)

21/33
Addressing Multiple Domains

Simple Approach (one-layer networks):

▶ Denoting by P and Q the two domains, regularize the ML model
⊤
(w x) so that both domains generate the same responses on average
at the output:

min E(w) + λ · (EP [w⊤ x] − EQ [w⊤ x])2

(aka. moment matching). The approach can be enhanced to include

higher-order moments such as variance, etc.

▶ In practice, more powerful tools exist to constrain distributions more

nely in representation space, such as the Wasserstein distance.

22/33
Addressing Multiple Domains

More Advanced Approach [1]:

▶ Learn a auxiliary neural network (domain critic φ) that tries to classify
the two domains. Learn the parameters of the feature extractor in a
way that the domain critic φ is no longer able to distinguish between
the two domains.

23/33
Addressing Multiple Domains

Example:
▶ Example of one particular class of the Oce-Caltech dataset and the
dierent domains from which the data is taken.

▶ Models equiped with a domain critic, although loosing performance on

some domains, achieve better worst-case accuracy:

24/33
Part 3 Spurious Correlations

25/33
Spurious Correlations

▶ Artefact of the distribution of available data (P ) where one or several

task-irrelevant input variables are spuriously correlated to the
task-relevant variables.

▶ Spurious correlations are very common on practical datasets, e.g. a

copyright tag occurring only on images of a certain class;
histopathological images of a certain class having been acquired with
particular device and having as a result a dierent color prole, etc.

26/33
Spurious Correlations and the Clever Hans Eect
Available data (P ) New data (Q)

▶ A ML classier is technically able to classify the available data (P )

using either the correct features or the spurious ones. The ML model
doesn't know a priori which feature (the correct one or the spurious
one) to use. A model that bases its decision strategy on the spurious
right for the wrong reasons
feature is and is also known as a Clever
Hans classier.
▶ A Clever Hans classier may fail to function well on the new data (Q)
where the spurious correlation no longer exists, e.g. horses without
copyright tags, or images of dierent class with copyright tags.

27/33
Spurious Correlations and the Clever Hans Eect

▶ Test set accuracy doesn't give much information on whether the model
bases its decision on the correct features or exploits the spurious
correlation.

▶ Only an inspection of the decision structure by the user (e.g. using

LRP heatmaps) enables the detection of the aw in the model [5].

28/33
Generating LRP heatmaps

Layer 1 Layer 2 Layer 3 Layer 4

Image of 'bicycle' Local Features GMM fitting Fisher Vector Normalization +Linear SVM
(Hellinger's kernel SVM)

Heatmap

Relevance
Conservation

Redistribution
Formula

▶ Explanations are produced using a layer-wise redistribution process

from the output of the model to the input features.

▶ Each layer can have its own redistribution scheme. The redistribution
rules are designed in a way that maximizes explanation quality.

29/33
Mitigating Reliance on Spurious Correlations

Feature Selection / Unlearning:

▶ Retrain without the feature containing the artefact (e.g. crop images
to avoid copyright tags).

▶ Actively look in the model for units (e.g. subsets of neurons) that
respond to the artifact and remove such units from the model (e.g.
[3]).

Dataset Design:
▶ Manually remove the artifact (e.g. copyright tags) from the classes
that contain it, or alternatively, inject the artifact in every class (so
that it cannot be used anymore for discriminating between classes).

▶ Stratify the dataset in a way that the spurious features are present in
all classes in similar proportions.

Learn with Explanation Constraints

▶ Include an extra term in the objective that penalizes decision strategies
that are based on unwanted features (preliminary revealed by an
explanation technique).

30/33
Summary

31/33
Summary

▶ While deep learning can in principle t very complex prediction

functions, the way they perform in practice is in large part determined
by the amount and quality of the data.

▶ Limited data may subject the ML model to overtting and lead to

lower performance on new data. Various methods exist to prevent
overtting (e.g. generating a low-dimensional input vector, or build a
model with limited sensitivity to input such as weight decay or
dropout).

▶ Another problem is dataset bias, where certain parts of the distribution

are over/under-represented, or plagued with spurious correlations.
Reliance of the model on spurious correlations can lead to low test
performance, but this can be detected by Explainable AI approaches. A
number of methods exist to reduce reliance on spurious correlations.

32/33
References

[1] L. Andéol, Y. Kawakami, Y. Wada, T. Kanamori, K.-R. Müller, and G. Montavon.

Learning domain invariant representations by joint wasserstein distance minimization.
Neural Networks, 167:233243, 2023.

[2] M. Belkin, D. Hsu, S. Ma, and S. Mandal.

Reconciling modern machine-learning practice and the classical biasvariance trade-o.
PNAS, 116(32):1584915854, 2019.

[3] P. Chormai, J. Herrmann, K. Müller, and G. Montavon.

Disentangled explanations of neural network predictions by nding relevant subspaces.
CoRR, abs/2212.14855, 2022.

[4] A. Krogh and J. A. Hertz.

A simple weight decay can improve generalization.
In NIPS, pages 950957, 1991.
[5] S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K.-R. Müller.
Unmasking clever hans predictors and assessing what machines really learn.
Nature Communications, 10(1), Mar. 2019.

[6] L. Prechelt.
Early stopping-but when?
In Neural Networks: Tricks , volume 1524 of
of the Trade LNCS, pages 5569. Springer, 1996.
[7] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.
Dropout: a simple way to prevent neural networks from overtting.
J. Mach. Learn. Res., 15(1):19291958, 2014.

33/33

Lecture 11
No ratings yet
Lecture 11
110 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Lecture # 4-2 Autoregressive Models
No ratings yet
Lecture # 4-2 Autoregressive Models
39 pages
DL CS 7 M4 Live Class Flow
No ratings yet
DL CS 7 M4 Live Class Flow
37 pages
Training Neural Netwok: Data Set
No ratings yet
Training Neural Netwok: Data Set
35 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
116 pages
cs188 Fa24 Lec24
No ratings yet
cs188 Fa24 Lec24
46 pages
Alice Book Volume 1
No ratings yet
Alice Book Volume 1
281 pages
Lecture 2
No ratings yet
Lecture 2
67 pages
Module 04
No ratings yet
Module 04
16 pages
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
No ratings yet
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
56 pages
Home Assignment Submission Solutions
No ratings yet
Home Assignment Submission Solutions
82 pages
DNN Tip
No ratings yet
DNN Tip
49 pages
Week 10
No ratings yet
Week 10
69 pages
DL Regularization
No ratings yet
DL Regularization
28 pages
ANN Presentation Exam Hafsa
No ratings yet
ANN Presentation Exam Hafsa
29 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
The Mathematics of Artificial Intelligence: 1 Supervised Learning
No ratings yet
The Mathematics of Artificial Intelligence: 1 Supervised Learning
10 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Lec 24
No ratings yet
Lec 24
39 pages
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-01-03 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-01-03 Reference-Material-I
39 pages
Lecture 06
No ratings yet
Lecture 06
22 pages
Lecture W15ab
No ratings yet
Lecture W15ab
44 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Recent Progress in The Theory of Deep Learning: Tengyu Ma Facebook AI Research/Stanford
No ratings yet
Recent Progress in The Theory of Deep Learning: Tengyu Ma Facebook AI Research/Stanford
50 pages
Inherent Stochasticity
No ratings yet
Inherent Stochasticity
12 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
No ratings yet
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
51 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
DLbook
No ratings yet
DLbook
165 pages
An Overview of Overfitting and Its Solutions
No ratings yet
An Overview of Overfitting and Its Solutions
7 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
ChatGPT - Machine Learning Overview
No ratings yet
ChatGPT - Machine Learning Overview
34 pages
Alloy 600
No ratings yet
Alloy 600
2,106 pages
Cie 15 2004 Tables
No ratings yet
Cie 15 2004 Tables
34 pages
6 Working Example 01-08-2024
No ratings yet
6 Working Example 01-08-2024
21 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Cours 4
No ratings yet
Cours 4
30 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Neural Network As Universal Approximates
No ratings yet
Neural Network As Universal Approximates
5 pages
4 - DNN Tip
No ratings yet
4 - DNN Tip
52 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
High - Temp Component Life
100% (1)
High - Temp Component Life
337 pages
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
No ratings yet
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
61 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
Unit 2
No ratings yet
Unit 2
37 pages
Contemporary ML For Physicists
No ratings yet
Contemporary ML For Physicists
91 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
ML 01
No ratings yet
ML 01
24 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
FEM 2d Lect1
No ratings yet
FEM 2d Lect1
138 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
Lecture 1
No ratings yet
Lecture 1
36 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Deep Learning - Summary - Deep - Learning
No ratings yet
Deep Learning - Summary - Deep - Learning
17 pages
2011 Positive Obligations Under The Eur
No ratings yet
2011 Positive Obligations Under The Eur
28 pages
Lecture 2: Basics and Definitions: Networks As Data Models
No ratings yet
Lecture 2: Basics and Definitions: Networks As Data Models
28 pages
Progs
No ratings yet
Progs
22 pages
Including:: 4 Authors
No ratings yet
Including:: 4 Authors
34 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
290 Module III
No ratings yet
290 Module III
31 pages
How To Tune Your TV
No ratings yet
How To Tune Your TV
5 pages
SAP2000 Tutorial Example: Analysis and Design of Continuous RC Beam
No ratings yet
SAP2000 Tutorial Example: Analysis and Design of Continuous RC Beam
21 pages
Network Security 1.0 Modules 8 - 10 - ACLs and Firewalls Group Exam Answers
No ratings yet
Network Security 1.0 Modules 8 - 10 - ACLs and Firewalls Group Exam Answers
20 pages
Everything You Ever Wanted To Functional Global Variables
No ratings yet
Everything You Ever Wanted To Functional Global Variables
51 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Cambridge Ext2 Ch1 Complex Numbers IWEB
No ratings yet
Cambridge Ext2 Ch1 Complex Numbers IWEB
62 pages
Simpsons Parking Lot Diversity Lab Share
No ratings yet
Simpsons Parking Lot Diversity Lab Share
7 pages
BMC JE Brochure English 1729259098
No ratings yet
BMC JE Brochure English 1729259098
7 pages
Caso Blue Mountain Coffee ADBUDG
No ratings yet
Caso Blue Mountain Coffee ADBUDG
16 pages
Constructions Reverse and Inspired S He Dec22
No ratings yet
Constructions Reverse and Inspired S He Dec22
3 pages
Quick Setup Guide: Radar Sensor For Continuous Level Measurement of Water and Wastewater
No ratings yet
Quick Setup Guide: Radar Sensor For Continuous Level Measurement of Water and Wastewater
28 pages
STD 9 Worksheet On Gravitation-2 - 1695986277296 - Xpq9F
No ratings yet
STD 9 Worksheet On Gravitation-2 - 1695986277296 - Xpq9F
4 pages
The Basel II IRB Approach For Credit Portfolios
0% (1)
The Basel II IRB Approach For Credit Portfolios
30 pages
Assignment EE5179 ME20B145 Report
No ratings yet
Assignment EE5179 ME20B145 Report
6 pages
Third Order Intercepts
No ratings yet
Third Order Intercepts
6 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
The Chevalley-Warning Theorem (Featuring. - . The Erd Os-Ginzburg-Ziv Theorem)
No ratings yet
The Chevalley-Warning Theorem (Featuring. - . The Erd Os-Ginzburg-Ziv Theorem)
14 pages
Half Deflection
No ratings yet
Half Deflection
4 pages
Tut 2
No ratings yet
Tut 2
2 pages
BODMAS 1new
No ratings yet
BODMAS 1new
2 pages
Discrete-Time Simulation With Simulink: ECE4560: Digital Control Laboratory
No ratings yet
Discrete-Time Simulation With Simulink: ECE4560: Digital Control Laboratory
5 pages
Circular Slab Estimation of Steel
No ratings yet
Circular Slab Estimation of Steel
3 pages
Module 2 Lab: Creating Data Types and Tables
No ratings yet
Module 2 Lab: Creating Data Types and Tables
5 pages
Troanary Photonic Storage Blueprint - How Light Based Logic can Redefine Computation and Data Storage
From Everand
Troanary Photonic Storage Blueprint - How Light Based Logic can Redefine Computation and Data Storage
Ylia Callan
No ratings yet
Math for Deep Learning: What You Need to Know to Understand Neural Networks
From Everand
Math for Deep Learning: What You Need to Know to Understand Neural Networks
Ronald T. Kneusel
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 05

Uploaded by

Lecture 05

Uploaded by

WiSe 2023/24

Lecture 5 Overtting & Robustness (1)

The data on which we train the model also matter!

Data types Properties

thousands of images per class, aggregated

Potentially very large datasets, but sen-

Innite number of games states can

Data types Properties

Simulated Theoretically innite, but practically lim-

limited number of patients due to rar-

Large amount of data, but only recent

The Problem of Finite Data

Dataset Bias 1: Imbalance Between Subgroups

▶ Building a `Domain' Invariant Classier

Dataset Bias 2: Spurious Correlations

▶ Detecting and Mitigating Spurious Correlations

theoretical optimum model learned in practice

▶ Assume each data point x ∈ Rd and its label y ∈ {0, 1} is generated

▶ An overtted model predicts the training data perfectly but works

William of Ockham (1287-1347)

Generalization Bound [Vapnik]

The VC-dimension h denes the complexity (or exibility) of the class of

▶ Then, learn a neural network on the resulting low-dimensional data.

read pixels directly

Weight Decay [4]:

▶ This is achieved by inserting a

▶ When the model becomes very large there is an interesting ` double

▶ This can be interpreted as some implicit averaging between the many

▶ Increasing model size to a great extent may contribute, without further

▶ One of the domains may be

Simple Approach (one-layer networks):

min E(w) + λ · (EP [w⊤ x] − EQ [w⊤ x])2

(aka. moment matching). The approach can be enhanced to include

▶ In practice, more powerful tools exist to constrain distributions more

More Advanced Approach [1]:

▶ Models equiped with a domain critic, although loosing performance on

▶ Artefact of the distribution of available data (P ) where one or several

▶ Spurious correlations are very common on practical datasets, e.g. a

▶ A ML classier is technically able to classify the available data (P )

▶ Only an inspection of the decision structure by the user (e.g. using

Layer 1 Layer 2 Layer 3 Layer 4

▶ Explanations are produced using a layer-wise redistribution process

Feature Selection / Unlearning:

Learn with Explanation Constraints

▶ While deep learning can in principle t very complex prediction

▶ Limited data may subject the ML model to overtting and lead to

▶ Another problem is dataset bias, where certain parts of the distribution

[1] L. Andéol, Y. Kawakami, Y. Wada, T. Kanamori, K.-R. Müller, and G. Montavon.

[2] M. Belkin, D. Hsu, S. Ma, and S. Mandal.

[3] P. Chormai, J. Herrmann, K. Müller, and G. Montavon.

[4] A. Krogh and J. A. Hertz.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Lecture 5 Overtting & Robustness (1)

Innite number of games states can

Simulated Theoretically innite, but practically lim-

▶ Building a `Domain' Invariant Classier

▶ An overtted model predicts the training data perfectly but works

The VC-dimension h denes the complexity (or exibility) of the class of

▶ A ML classier is technically able to classify the available data (P )

▶ While deep learning can in principle t very complex prediction

▶ Limited data may subject the ML model to overtting and lead to