0% found this document useful (0 votes)
11 views22 pages

Lecture 06

Uploaded by

Tim Widmoser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views22 pages

Lecture 06

Uploaded by

Tim Widmoser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

WiSe 2023/24

Deep Learning 1

Lecture 6 Overtting & Robustness (2)


Outline

Worst-Case Analysis
▶ The problem of adversarial examples
▶ Adversarial robustness
Adding Predictive Uncertainty
▶ Why predictive uncertainty?
▶ Density networks
▶ Ensemble models
Adding Prior Knowledge
▶ Translation invariance, local smoothness, etc.
▶ Feature reuse (transfer / multitask / self-supervised learning)

1/21
Part 1 Worst-Case Analysis

2/21
Motivations

Risk aversion
▶ One big error can often be more harmful than many small errors, e.g. a
system being controlled by a neural network may be tolerant to small
errors (which can be corrected subsequently), but not to a big error
from which one cannot recover.

Adversarial components
▶ Even though the neural network may perform well on average, an
adversary may craft inputs that steer the ML system towards the
worst-case decision behavior.

3/21
Worst-Case Analysis

neural network
predictions

ground truth

Typical causes of large errors:


▶ High dimensionality of input space allows to nely craft patterns to
which the network responds highly.
▶ High depth/nonlinearity implies that the function is locally steeper than
necessary.

4/21
Example: Adversarial Examples
▶ Carefully crafted nearly invisible perturbations of an existing data point
can cause the prediction of a neural network to change drastically,
while leaving almost no trace of the attack.

Image source: https://arxiv.org/abs/1312.6199

▶ Serious concern in various applications (e.g. biometric identication,


reading trac signs).
5/21
Addressing Worst-Case Behavior

Enhanced Regularization:
▶ Search for high local variations of
the decision function and add these
variations as a term of the error
function to minimize. neural network
predictions
▶ In practice, this can take the form of
generating adversarial examples, and
forcing them to be predicted in the
same way as the original data.
▶ More generic approaches based on
Lipschitz-continuity (e.g. spectral
norm regularization) can also be
used.
Data Preprocessing:
▶ In practice, one can also address worst-case behavior by applying
dimensionality reduction (e.g. blurring images) before applying the
neural network.

6/21
Part 2 Predictive Uncertainty

7/21
Predictive Uncertainty

Practical motivations:
▶ Understand when we can
trust the model in a more
precise way than just looking
at the overall error.
▶ Enables the user to be
prompted when the model is
unsure, in which case, the
user can decide e.g. to
collect more data, or to
perform the prediction Image source: https://doi.org/10.1103/PhysRevD.98.063511

manually.

8/21
Predictive Uncertainty

Approach 1:
▶ Explicitly encoding the uncertainty estimate in the neural network, i.e.
have one output neuron for predicting the actual value of interest, and
a second output neuron for predicting the uncertainty associated to
this prediction.
▶ For example, one predicts that the output is distributed according to
the random variable y ∼ N (µ, σ2 ) where µ and σ are the two neural
network outputs. (How to train these models will be presented in
Lecture 7.)

Image source: https://brendanhasz.github.io/2019/07/23/bayesian-density-net.html

9/21
Predictive Uncertainty
Approach 2:
▶ Train an ensemble of neural networks and measure prediction
uncertainty as the discrepancy of predictions of each network in the
ensemble, e.g. for an ensemble of n neural networks with respective
outputs o1 , . . . , on , we generate the two aggregated outputs
µ = avg(o1 , . . . , on ) and σ = std(o1 , . . . , on ).

that represent the nal prediction and its uncertainty.

Image source: https://doi.org/10.1007/s00521-019-04359-7

10/21
Predictive Uncertainty
Approach 2 (cont.):

Image source: https://doi.org/10.1007/s00521-019-04359-7

▶ Each network in the ensemble may have a dierent initialization, may


receive dierent input features, and may be trained on dierent subsets
of data.
▶ Uncertainty can then be understood as the eect of neural network
random initialization, feature selection, data sample.
▶ The more heterogeneous the ensemble, the higher the estimate of
uncertainty.

11/21
Part 3 Beyond Regularization: Prior Knowledge

12/21
Prior Knowledge
Recap:
▶ Machine learning is aected by data quality issues (e.g. scarcity of data
or labels, spurious correlations in the dataset, under-representation of
some part of the distribution, shift between data available for training
and data when deployed).
Idea:
▶ There is no point to learn from the data what we already know.
What we already know (our prior knowledge) should ideally be
hard-coded into the model.
Example:
▶ In specic tasks, certain features are known to have no eect on the
quantity to predict. It is better to not give them as input to the neural
network.
fever
symptoms cough disease 1
... disease 2
male/female ...
patient data age
month
date day irrelevant

13/21
Physical Invariances

Example: Rotation/Translation Invariance


▶ Rotating or translating a molecule maintains its atomization energy
constant.

▶ Rotation invariance can be ensured e.g. by encoding the molecule by


the pairwise distances between its atoms rather than its 3d
coordinates, and feeding these distances to a neural network.

14/21
Physical Invariances

Example: Modeling interaction between two molecules

Image source: https://doi.org/10.1021/acs.jctc.8b01285

▶ Distances are fed as input to a plain neural network.


▶ Work as long as atom of the molecules can be indexed (i.e. approach
stops working when the molecules received as input are of arbitrary
shape and size).

15/21
Soft Invariances

Example: Handwritten digits recognition


▶ Rotating digits by a few degrees
usually does not change class
membership.
▶ Some exceptions, e.g. rotating a
`1' may transform it into a `7'.

Approaches to build invariance:


▶ Use purposely designed neural network architectures. E.g. scattering
networks, pooling networks, etc.
▶ Augment the dataset with elastic distortions, and train the neural
network on the extended dataset.

16/21
Feature Reuse / Transfer Learning

▶ Certain tasks have intrinsically little annotated data (e.g. scientic


data), due to the cost of labeling by an expert.
▶ However, they have similarities with other tasks with much more data
(e.g. general-purpose image recognition). In both cases, one needs to
extract features such as edge or color detectors to solve the task.

17/21
Transfer Learning with Deep Networks
Approach:
▶ When two tasks are related, we can train a big neural network on the
rst task with abundant data, and reuse features in intermediate layer
for the task of interest.

▶ This type of transfer learning is very common in applied research. For


image recognition tasks, researchers typically start from a
state-of-the-art network for vision (e.g. ResNet) pretrained on
ImageNet, and retrain the top layers on the specic task.

18/21
How to Generate Useful Features
Classical approach:
▶ Learn a model to classify a more general task (e.g. image recognition).
Self-supervised learning approach
▶ Create an articial task where labels can be produced automatically,
and whose solution requires features that are needed for the task of
interest (more in DL2).

Image source: https://doi.org/10.3390/e24040551

19/21
Summary

20/21
Summary

▶ In practice, expected prediction accuracy may not be the most relevant


quantity. The true practical usefulness of a system is often better
characterized by its worst-case performance.
▶ It is often desirable to equip neural network models with some measure
of predictive uncertainty so that the network can tell the user when to
trust and when not to trust the prediction.
▶ There is no point to learn what we already know. Prior knowledge (e.g.
invariances, shared features) can be introduced in neural networks. As
a result, the model becomes less aected by overtting and also more
robust.

21/21

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy