Lecture 05
Lecture 05
Deep Learning 1
Lectures 14:
▶ With exible neural network architectures, powerful optimization
techniques, and fast machines, we have means of producing functions
that can accurately t large amount of highly nonlinear data.
Question:
▶ Do the learned neural networks generalize to new data, e.g. will it be
able to correctly classify new images?
1/33
A Bit of Theory
So far, we have only considered the error we use to optimize the model (aka.
the training error):
N
1 X
Etrain (θ) = (f (xi , θ) − ti )2
N i=1
In practice, what we really care about is the true error:
Z
Etrue (θ) = (f (x, θ) − t)2 p(x, t) dxdt
where p(x, t) is the true probability distribution from which the data is coming
at test time. The true error is much harder to minimize, because we don't
know p(x, t).
2/33
Characterizing Datasets
Factors that makes the available dataset D and the true distribution p(x, t)
diverge:
▶ The fact the dataset is composed of few data points drawn randomly
from the underlying data distribution ( nite data).
▶ The fact that the dataset may overrepresent certain parts of the
underlying distribution, e.g. people of a certain age group ( dataset
bias).
▶ The fact that the dataset may have been generated from an underlying
distribution pold (x, t) that is now obsolete ( distribution shift).
3/33
Practical Examples
4/33
Practical Examples (cont.)
5/33
Outline
▶ Mitigating overtting
6/33
Part 1 The Problem of Limited Data
7/33
Finite Data and Overtting
2 2
0 0
2 2
2 0 2 2 0 2
▶ With limited data, one class or target value may be locally predominant
`by chance'. Learning these spurious variations is known as overtting.
8/33
Model Error and Model Complexity
Vladimir Vapnik
Showed a formal relation between model complexity
(measured as the VC-dimension) and the error of a classier.
9/33
Complexity and Generalization Error
s
2N
h(log h
+ 1) − log(η/4)
Etrue (θ) − Etrain (θ) ≤
N
Factors that reduce the gap between test error Etrue (θ) and training error
Etrain (θ):
▶ Lowering the VC-dimension h.
▶ Increasing the number of data points N.
10/33
Characterizing Complexity (One-Layer Networks)
Interpretation:
▶ Model complexity can be restrained if the input data is low-dimensional
or if the model builds a large margin (i.e. has low sensitivity).
Question:
▶ Can we build similar concepts for deep neural networks?
11/33
Reducing Complexity via Low Dimensionality
Features extraction
(hard-coded)
a1
a2
Learned parameters
Approach:
▶ First, generate a low-dimensional representation by extracting a few
features from the high-dimensional input data (either hand-designed, or
automatically generated using methods such as PCA).
12/33
Reducing Complexity via Low Dimensionality
Observations:
▶ Building low-dimensional representations can be useful when predicting
noisy high-dimensional data such as gene expression in biology
(d > 20000)
▶ On other tasks such as image recognition, low dimensional
representation can also delete class-relevant information (e.g. edges).
13/33
Reducing Complexity by Reducing Sensitivity
a1
...
ad
learned parameters
+ regularization
N
X
E(θ) = (f (xi , θ) − ti )2 + λ∥θ∥2
i=1
▶ The higher the parameter λ, the more the exposure of the model to
variations in the input domain is being reduced.
14/33
Reducing Complexity by Reducing Sensitivity
Dropout [7]:
▶ Alternative to weight decay, which consists of adding articial
multiplicative noise to the input and intermediate neurons, and training
the model subject to that noise.
15/33
Reducing Complexity by Reducing Sensitivity
Eect of dropout on performance on the MNIST dataset
Note:
▶ On neural networks for image data, dropout tends to yield superior
performance compared to simple weight decay.
16/33
Choosing a Model with Appropriate Complexity
Holdout Validation:
▶ Train multiple neural network models with dierent regularization
parameters (e.g. λ), and retain the one that performs the best on
some validation set disjoint from the training data.
Problem:
▶ Training a model for each parameter λ can be costly. One would
potentially benet from training a bigger model only once.
17/33
Accelerating Model Selection
Early Stopping Technique [6]:
▶ View the iterative procedure for training a neural network as generating
a sequence of increasingly complex models θ1 , . . . , θ T .
▶ Monitor the validation error throughout training and keep a snapshot
of the model when it had lowest validation error.
Early stopping:
θ⋆ = None
E⋆ = ∞
for t = 1 . . . T do
Run a few SGD steps, and collect the current parameter θt
ifEval (θt ) < E ⋆ then
θ ⋆ ← θt
E ⋆ ← Eval (θ)
end if
end for
Advantage:
▶ No need to train several models (e.g. with dient λ's). Only one
training run is needed!
18/33
Very Large Models
19/33
Part 2 Imbalances between Subgroups
20/33
Data from Multiple Domains
▶ The data might come from dierent
domains (P , Q).
▶ Domains may e.g. correspond to
dierent acquisition devices, or
dierent ways they are
congured/calibrated.
Image source: Aubreville et al. Quantifying the Scanner-Induced Domain Gap in Mitosis Detection. CoRR abs/2103.16515 (2021)
21/33
Addressing Multiple Domains
22/33
Addressing Multiple Domains
23/33
Addressing Multiple Domains
Example:
▶ Example of one particular class of the Oce-Caltech dataset and the
dierent domains from which the data is taken.
24/33
Part 3 Spurious Correlations
25/33
Spurious Correlations
26/33
Spurious Correlations and the Clever Hans Eect
Available data (P ) New data (Q)
27/33
Spurious Correlations and the Clever Hans Eect
▶ Test set accuracy doesn't give much information on whether the model
bases its decision on the correct features or exploits the spurious
correlation.
28/33
Generating LRP heatmaps
Image of 'bicycle' Local Features GMM fitting Fisher Vector Normalization +Linear SVM
(Hellinger's kernel SVM)
Heatmap
Relevance
Conservation
Redistribution
Formula
▶ Each layer can have its own redistribution scheme. The redistribution
rules are designed in a way that maximizes explanation quality.
29/33
Mitigating Reliance on Spurious Correlations
▶ Actively look in the model for units (e.g. subsets of neurons) that
respond to the artifact and remove such units from the model (e.g.
[3]).
Dataset Design:
▶ Manually remove the artifact (e.g. copyright tags) from the classes
that contain it, or alternatively, inject the artifact in every class (so
that it cannot be used anymore for discriminating between classes).
▶ Stratify the dataset in a way that the spurious features are present in
all classes in similar proportions.
30/33
Summary
31/33
Summary
32/33
References
[6] L. Prechelt.
Early stopping-but when?
In Neural Networks: Tricks , volume 1524 of
of the Trade LNCS, pages 5569. Springer, 1996.
[7] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.
Dropout: a simple way to prevent neural networks from overtting.
J. Mach. Learn. Res., 15(1):19291958, 2014.
33/33