0% found this document useful (0 votes)
14 views34 pages

Lecture 05

Uploaded by

Tim Widmoser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views34 pages

Lecture 05

Uploaded by

Tim Widmoser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

WiSe 2023/24

Deep Learning 1

Lecture 5 Overtting & Robustness (1)


Recap Lectures 14

Lectures 14:
▶ With exible neural network architectures, powerful optimization
techniques, and fast machines, we have means of producing functions
that can accurately t large amount of highly nonlinear data.

Question:
▶ Do the learned neural networks generalize to new data, e.g. will it be
able to correctly classify new images?

The data on which we train the model also matter!

1/33
A Bit of Theory
So far, we have only considered the error we use to optimize the model (aka.
the training error):
N
1 X
Etrain (θ) = (f (xi , θ) − ti )2
N i=1
In practice, what we really care about is the true error:
Z
Etrue (θ) = (f (x, θ) − t)2 p(x, t) dxdt

where p(x, t) is the true probability distribution from which the data is coming
at test time. The true error is much harder to minimize, because we don't
know p(x, t).

2/33
Characterizing Datasets

Factors that makes the available dataset D and the true distribution p(x, t)
diverge:

▶ The fact the dataset is composed of few data points drawn randomly
from the underlying data distribution ( nite data).
▶ The fact that the dataset may overrepresent certain parts of the
underlying distribution, e.g. people of a certain age group ( dataset
bias).
▶ The fact that the dataset may have been generated from an underlying
distribution pold (x, t) that is now obsolete ( distribution shift).

3/33
Practical Examples

Data types Properties

thousands of images per class, aggregated


Image/text from many sources. Some image compo-
data sitions may be overrepresented ( dataset
bias / spurious correlations).

Potentially very large datasets, but sen-


Sensor
sors may become decalibrated over time
data
distribution shift).
(

Innite number of games states can


Games
be produced through computer-computer
(GO,
plays, but master-level plays being more
chess,
expensive to generate, simple games may
etc.)
be overrepresented ( dataset bias).

4/33
Practical Examples (cont.)

Data types Properties

Simulated Theoretically innite, but practically lim-


data (e.g. ited do the cost of running simulations. In
physics, practice, we only generate few instances
car driving) nite data).
(

limited number of patients due to rar-


ity of a particular disease, or regulatory
Medical
constraints ( nite data, dataset bias).
data
Aquisition devices may evolve over time
distribution shift).
(

Large amount of data, but only recent


Social data data is relevant. Risk of not capturing the
most recent trends ( distribution shift).

5/33
Outline

The Problem of Finite Data


▶ The problem of overtting

▶ Mitigating overtting

Dataset Bias 1: Imbalance Between Subgroups


▶ Data from Multiple Domains

▶ Building a `Domain' Invariant Classier

Dataset Bias 2: Spurious Correlations


▶ Examples of Spurious Correlations

▶ Detecting and Mitigating Spurious Correlations

6/33
Part 1 The Problem of Limited Data

7/33
Finite Data and Overtting

theoretical optimum model learned in practice

2 2

0 0

2 2

2 0 2 2 0 2

▶ Assume each data point x ∈ Rd and its label y ∈ {0, 1} is generated


iid. from two Gaussian distributions.

▶ With limited data, one class or target value may be locally predominant
`by chance'. Learning these spurious variations is known as overtting.

▶ An overtted model predicts the training data perfectly but works


poorly on new data.

8/33
Model Error and Model Complexity

William of Ockham (1287-1347)


Linked model complexity to how suitable the model is for
explaning phenomena. Entia non sunt multiplicanda praeter
necessitatem

Vladimir Vapnik
Showed a formal relation between model complexity
(measured as the VC-dimension) and the error of a classier.

9/33
Complexity and Generalization Error

Generalization Bound [Vapnik]


Let h denote the VC-dimension of F. The dierence between the true error
R[f ] and the traning error Remp [f ] is upper-bounded as:

s
2N
h(log h
+ 1) − log(η/4)
Etrue (θ) − Etrain (θ) ≤
N

The VC-dimension h denes the complexity (or exibility) of the class of


considered models.

Factors that reduce the gap between test error Etrue (θ) and training error
Etrain (θ):
▶ Lowering the VC-dimension h.
▶ Increasing the number of data points N.

10/33
Characterizing Complexity (One-Layer Networks)

Interpretation:
▶ Model complexity can be restrained if the input data is low-dimensional
or if the model builds a large margin (i.e. has low sensitivity).

Question:
▶ Can we build similar concepts for deep neural networks?

11/33
Reducing Complexity via Low Dimensionality

Features extraction
(hard-coded)

a1

a2

Learned parameters

Approach:
▶ First, generate a low-dimensional representation by extracting a few
features from the high-dimensional input data (either hand-designed, or
automatically generated using methods such as PCA).

▶ Then, learn a neural network on the resulting low-dimensional data.

12/33
Reducing Complexity via Low Dimensionality

Observations:
▶ Building low-dimensional representations can be useful when predicting
noisy high-dimensional data such as gene expression in biology
(d > 20000)
▶ On other tasks such as image recognition, low dimensional
representation can also delete class-relevant information (e.g. edges).

13/33
Reducing Complexity by Reducing Sensitivity

read pixels directly

a1

...
ad

learned parameters
+ regularization

Weight Decay [4]:


▶ Include in the objective a term that makes the weights tend to zero if
they are not necessary for the prediction task.

N
X
E(θ) = (f (xi , θ) − ti )2 + λ∥θ∥2
i=1

▶ The higher the parameter λ, the more the exposure of the model to
variations in the input domain is being reduced.

14/33
Reducing Complexity by Reducing Sensitivity

Dropout [7]:
▶ Alternative to weight decay, which consists of adding articial
multiplicative noise to the input and intermediate neurons, and training
the model subject to that noise.

▶ This is achieved by inserting a


bj » Bernoulli(p)
dropout layer in the neural
network, which multiplies each input
(or activation) by a random variable
zj aj
bj ∼ Bernoulli(p):

15/33
Reducing Complexity by Reducing Sensitivity
Eect of dropout on performance on the MNIST dataset

Note:
▶ On neural networks for image data, dropout tends to yield superior
performance compared to simple weight decay.

16/33
Choosing a Model with Appropriate Complexity

Holdout Validation:
▶ Train multiple neural network models with dierent regularization
parameters (e.g. λ), and retain the one that performs the best on
some validation set disjoint from the training data.

Problem:
▶ Training a model for each parameter λ can be costly. One would
potentially benet from training a bigger model only once.

17/33
Accelerating Model Selection
Early Stopping Technique [6]:
▶ View the iterative procedure for training a neural network as generating
a sequence of increasingly complex models θ1 , . . . , θ T .
▶ Monitor the validation error throughout training and keep a snapshot
of the model when it had lowest validation error.

Early stopping:
θ⋆ = None
E⋆ = ∞
for t = 1 . . . T do
Run a few SGD steps, and collect the current parameter θt
ifEval (θt ) < E ⋆ then
θ ⋆ ← θt
E ⋆ ← Eval (θ)
end if
end for
Advantage:
▶ No need to train several models (e.g. with dient λ's). Only one
training run is needed!

18/33
Very Large Models

▶ When the model becomes very large there is an interesting ` double


descent ' [2] phenomenon that occurs in the context of neural
networks, where the generalization error starts to go down again as
model complexity increases.

▶ This can be interpreted as some implicit averaging between the many


components of the model (interpolating regime).

▶ Increasing model size to a great extent may contribute, without further


regularization techniques to achieve lower test set error.

19/33
Part 2 Imbalances between Subgroups

20/33
Data from Multiple Domains
▶ The data might come from dierent
domains (P , Q).
▶ Domains may e.g. correspond to
dierent acquisition devices, or
dierent ways they are
congured/calibrated.

▶ One of the domains may be


overrepresented in the available
data, or the ML model may learn
better on a given domain at the
expense of another domain.

Image source: Aubreville et al. Quantifying the Scanner-Induced Domain Gap in Mitosis Detection. CoRR abs/2103.16515 (2021)

21/33
Addressing Multiple Domains

Simple Approach (one-layer networks):


▶ Denoting by P and Q the two domains, regularize the ML model

(w x) so that both domains generate the same responses on average
at the output:

min E(w) + λ · (EP [w⊤ x] − EQ [w⊤ x])2


w

(aka. moment matching). The approach can be enhanced to include


higher-order moments such as variance, etc.

▶ In practice, more powerful tools exist to constrain distributions more


nely in representation space, such as the Wasserstein distance.

22/33
Addressing Multiple Domains

More Advanced Approach [1]:


▶ Learn a auxiliary neural network (domain critic φ) that tries to classify
the two domains. Learn the parameters of the feature extractor in a
way that the domain critic φ is no longer able to distinguish between
the two domains.

23/33
Addressing Multiple Domains

Example:
▶ Example of one particular class of the Oce-Caltech dataset and the
dierent domains from which the data is taken.

▶ Models equiped with a domain critic, although loosing performance on


some domains, achieve better worst-case accuracy:

24/33
Part 3 Spurious Correlations

25/33
Spurious Correlations

▶ Artefact of the distribution of available data (P ) where one or several


task-irrelevant input variables are spuriously correlated to the
task-relevant variables.

▶ Spurious correlations are very common on practical datasets, e.g. a


copyright tag occurring only on images of a certain class;
histopathological images of a certain class having been acquired with
particular device and having as a result a dierent color prole, etc.

26/33
Spurious Correlations and the Clever Hans Eect
Available data (P ) New data (Q)

▶ A ML classier is technically able to classify the available data (P )


using either the correct features or the spurious ones. The ML model
doesn't know a priori which feature (the correct one or the spurious
one) to use. A model that bases its decision strategy on the spurious
right for the wrong reasons 
feature is  and is also known as a Clever
Hans classier.
▶ A Clever Hans classier may fail to function well on the new data (Q)
where the spurious correlation no longer exists, e.g. horses without
copyright tags, or images of dierent class with copyright tags.

27/33
Spurious Correlations and the Clever Hans Eect

▶ Test set accuracy doesn't give much information on whether the model
bases its decision on the correct features or exploits the spurious
correlation.

▶ Only an inspection of the decision structure by the user (e.g. using


LRP heatmaps) enables the detection of the aw in the model [5].

28/33
Generating LRP heatmaps

Layer 1 Layer 2 Layer 3 Layer 4

Image of 'bicycle' Local Features GMM fitting Fisher Vector Normalization +Linear SVM
(Hellinger's kernel SVM)

Heatmap

Relevance
Conservation

Redistribution
Formula

▶ Explanations are produced using a layer-wise redistribution process


from the output of the model to the input features.

▶ Each layer can have its own redistribution scheme. The redistribution
rules are designed in a way that maximizes explanation quality.

29/33
Mitigating Reliance on Spurious Correlations

Feature Selection / Unlearning:


▶ Retrain without the feature containing the artefact (e.g. crop images
to avoid copyright tags).

▶ Actively look in the model for units (e.g. subsets of neurons) that
respond to the artifact and remove such units from the model (e.g.
[3]).

Dataset Design:
▶ Manually remove the artifact (e.g. copyright tags) from the classes
that contain it, or alternatively, inject the artifact in every class (so
that it cannot be used anymore for discriminating between classes).

▶ Stratify the dataset in a way that the spurious features are present in
all classes in similar proportions.

Learn with Explanation Constraints


▶ Include an extra term in the objective that penalizes decision strategies
that are based on unwanted features (preliminary revealed by an
explanation technique).

30/33
Summary

31/33
Summary

▶ While deep learning can in principle t very complex prediction


functions, the way they perform in practice is in large part determined
by the amount and quality of the data.

▶ Limited data may subject the ML model to overtting and lead to


lower performance on new data. Various methods exist to prevent
overtting (e.g. generating a low-dimensional input vector, or build a
model with limited sensitivity to input such as weight decay or
dropout).

▶ Another problem is dataset bias, where certain parts of the distribution


are over/under-represented, or plagued with spurious correlations.
Reliance of the model on spurious correlations can lead to low test
performance, but this can be detected by Explainable AI approaches. A
number of methods exist to reduce reliance on spurious correlations.

32/33
References

[1] L. Andéol, Y. Kawakami, Y. Wada, T. Kanamori, K.-R. Müller, and G. Montavon.


Learning domain invariant representations by joint wasserstein distance minimization.
Neural Networks, 167:233243, 2023.

[2] M. Belkin, D. Hsu, S. Ma, and S. Mandal.


Reconciling modern machine-learning practice and the classical biasvariance trade-o.
PNAS, 116(32):1584915854, 2019.

[3] P. Chormai, J. Herrmann, K. Müller, and G. Montavon.


Disentangled explanations of neural network predictions by nding relevant subspaces.
CoRR, abs/2212.14855, 2022.

[4] A. Krogh and J. A. Hertz.


A simple weight decay can improve generalization.
In NIPS, pages 950957, 1991.
[5] S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K.-R. Müller.
Unmasking clever hans predictors and assessing what machines really learn.
Nature Communications, 10(1), Mar. 2019.

[6] L. Prechelt.
Early stopping-but when?
In Neural Networks: Tricks , volume 1524 of
of the Trade LNCS, pages 5569. Springer, 1996.
[7] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.
Dropout: a simple way to prevent neural networks from overtting.
J. Mach. Learn. Res., 15(1):19291958, 2014.

33/33

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy