07_regularization
07_regularization
Learning
Lecture slides for Chapter 7 of Deep Learning
www.deeplearningbook.org
Ian Goodfellow
2016-09-27
Adapted by m.n. for CMPS 392
Definition
• “Regularization is any modification we make to a
learning algorithm that is intended to reduce its
generalization error but not its training error.”
(Goodfellow 2016)
Regularization strategies
• Constraints: adding restrictions on the parameter
values.
• Soft constraints: Adding extra terms in the objective
function:
q Encode Prior knowledge.
q Generic preference for a simpler model
• Ensemble methods:
q Combine multiple hypotheses to explain the
training data
(Goodfellow 2016)
Parameter norm penalties
• 𝛼 ∈ 0, ∞
q 𝛼 = 0, no regularization
• Ω: norm function
q L1
q L2
(Goodfellow 2016)
!
𝐿 Parameter regularization
aka. ridge regression
1 !
1 𝑻
Ω 𝜽 = 𝒘 ! = 𝒘 𝒘
2 2
$
• 𝛻𝒘 𝒘𝑻 𝒘 = 𝒘
!
• Update step: 𝒘 ← 𝒘 − 𝜖 𝛼 𝒘 − 𝜖𝛻𝒘 𝐽
q 𝒘 ← 𝒘(1 − 𝜖𝛼) − 𝜖𝛻𝒘 𝐽 Weights are
• Let 𝒘∗ = argmin 𝐽 shrunk by a
𝒘 multiplicative
• M = argmin N𝐽
Let 𝒘 factor
𝒘
• Approximating 𝐽 in the neighborhood of 𝒘∗ : No first order
term since w* is
q O
𝐽 𝒘 = 𝐽 𝒘∗ + 𝒘 − 𝒘∗ & 𝑯 𝒘 − 𝒘∗ the minimum
(𝛻𝐽 𝒘∗ = 𝟎)
(Goodfellow 2016)
" (regularized solution) compares to
How 𝒘
unregularized solution 𝒘*?
• What is the gradient of /𝐽 𝒘 at 𝒘
2?
• 𝛻 /𝐽 𝒘
2 =𝑯 𝒘 2 − 𝒘∗
• 𝛻 8𝐽 𝒘 2 − 𝒘∗ + 𝛼 𝒘
2 =𝑯 𝒘 2 =𝟎
q 2 = 𝑯𝒘∗
𝑯 + 𝛼𝑰 𝒘
q 2 = 𝑯 + 𝛼𝑰 "𝟏 𝑯𝒘∗
𝒘
• 𝑯 is real and symmetric
q 𝑯 = 𝑸𝚲𝐐𝐓
𝐓 "𝟏 𝐓 ∗ 𝐓 𝑻 "𝟏
• ! = 𝑸𝚲𝐐 + 𝛼𝑰
𝒘 𝑸𝚲𝐐 𝐰 = 𝑸𝚲𝐐 + 𝑸𝛼𝑰𝑸 𝑸𝚲𝐐𝐓 𝐰 ∗
𝑻 "𝟏
• ! = 𝑸 𝚲 + 𝛼𝑰
𝒘 𝑸 𝑸𝚲𝐐𝐓 𝐰 ∗ = 𝑸 𝚲 + 𝛼𝑰 "𝟏 𝑸𝑻 𝑸𝚲𝐐𝐓 𝐰 ∗
𝒘
! = 𝑸 𝚲 + 𝛼𝑰 "𝟏 𝚲𝐐𝐓 𝐰 ∗
(Goodfellow 2016)
Interpretation
! = 𝑸 𝚲 + 𝛼𝑰 +𝟏 𝐓 ∗
𝒘 𝚲𝐐 𝐰
(Goodfellow 2016)
Weight Decay
1
+𝐽 𝒘 = 𝐽 𝒘∗ + 𝒘 − 𝒘∗ " 𝑯 𝒘 − 𝒘∗
Small Eigen 2
vector of H Unregularized
(regularization
effect is large) solution
1𝐽 𝒘
Large Eigen
vector of H
𝛼w 3 𝑤 (regularization effect
is small)
Regularized
Figure 7.1 solution
(Goodfellow 2016)
Special case: Linear Regression
& $ 𝑻
• Cost function: 𝑿𝒘 − 𝒚 𝑿𝒘 − 𝒚 + 𝛼𝒘 𝒘
!
• Normal equations
q 𝑿𝑻 𝑿𝒘 − 𝑿𝑻 𝒚 + 𝛼𝒘 = 𝟎 ⇒ (𝑿𝑻 𝑿 + 𝛼𝑰)𝒘 = 𝑿𝑻 𝒚
1𝟏 Covariance
q 𝒘 = 𝑿𝑻 𝑿 + 𝛼𝑰 𝑿𝑻 𝒚 feature-output
Proportional to the
• Basically, we are adding 𝛼 to the diag. covariance matrix
q The diag. elements correspond to the variance of
each feature
• We perceive the data as having higher variance
q A feature having low covariance with output got
shrunk even more due to this added variance
(Goodfellow 2016)
!
𝐿 regularization
• Ω 𝜽 = 𝒘 " = ∑# 𝑤#
• '𝐽 𝒘; 𝑿, 𝒚 = 𝛼 𝒘 " + 𝐽(𝒘; 𝑿, 𝒚)
q 𝛻𝒘 '
𝐽 𝒘; 𝑿, 𝒚 = 𝛼 sign 𝒘 + 𝛻𝒘 𝐽 𝒘; 𝑿, 𝒚
• 6𝐽 𝒘 = 𝐽 𝒘∗ + " 𝒘 − 𝒘∗ ' 𝑯 𝒘 − 𝒘∗
&
q 𝛻 6𝐽 𝒘 = 𝑯 𝒘 − 𝒘∗
• Assume that 𝑯 = diag ( 𝐻"," , … , 𝐻),) ), 𝐻#,# > 0
q Linear regression after PCA
"
• '𝐽 𝒘 ≈ 𝐽 𝒘∗ + ∑# 𝐻#,# 𝑤# − 𝑤#∗ & + 𝛼 𝑤#
&
*
• Solution: 𝑤# = 𝑠𝑖𝑔𝑛 𝑤#∗ max 𝑤#∗ − ,0
+&,&
(Goodfellow 2016)
Interpretation
𝛼
𝑤# = 𝑠𝑖𝑔𝑛 𝑤#∗ max 𝑤#∗ − ,0
𝐻#,#
• If 𝑤#∗ > 0:
∗ * *
q 𝑤# > ⇒ 𝑤# is shifted towards 0 by
+&,& +&,&
*
q 𝑤#∗ ≤ ⇒ 𝑤# = 0
+&,&
• If 𝑤#∗ < 0:
∗ * *
q −𝑤# > ⇒ 𝑤# = 𝑤# +
+&,& +&,&
*
o 𝑤# is shifted towards 0 by
+&,&
∗ *
q −𝑤# ≤ ⇒ 𝑤# = 0
+&,&
(Goodfellow 2016)
!
𝐿 regularization sparsity
• The sparsity property induced by L1 regularization can be
used as a feature selection mechanism
q LASSO regression (least absolute shrinkage and
selection operator)
• Equivalent to MAP Bayesian estimation with Laplace prior
q the prior is an isotropic Laplace distribution over 𝑤 ∈ ℝ) :
" "
o Laplace 𝑤# ; 0, = exp(−𝛼 𝑤# )
* &*
"
o log Laplace 𝑤# ; 0, = − log 2𝛼 − 𝛼 𝑤#
*
(Goodfellow 2016)
Norm Penalties
• MAP: Maximum A-Posteriori
• L1:
q Encourages sparsity,
q equivalent to MAP Bayesian estimation with
Laplace prior
• Squared L2:
q Encourages small weights,
q equivalent to MAP Bayesian estimation with
Gaussian prior
(Goodfellow 2016)
Explicit constraints
• We want to constrain Ω(𝜃) to be less than some
constant 𝑘
q construct a generalized Lagrange function
(Goodfellow 2016)
Projection
• Sometimes we may wish to use explicit constraints
rather than penalties.
q we can modify algorithms such as stochastic gradient
descent to take a step downhill on 𝐽 𝜃 and then
project 𝜃 back to the nearest point that satisfies
Ω 𝜃 < 𝑘.
• How to project?
q Project into unit L2 ball:
q Project into unit L1 ball:
o No closed-form solution
o Numerical solution
(Goodfellow 2016)
Dataset Augmentation
• Best way to regularize is to train with more data
q create fake data and add it to the training set.
q We can generate new (𝒙, 𝑦) pairs easily just by
transforming the 𝒙 inputs in our training set.
q particularly effective for object recognition
o translating the training images a few pixels in each
direction
o rotating the image or scaling
• Some inappropriate transformations:
q horizontal flips: ‘b’ and ‘d’,
q 180◦ rotations: ‘6’ and ‘9’,
(Goodfellow 2016)
Dataset Augmentation
Affine Elastic
Noise
Distortion Deformation
Horizontal Random
Hue Shift
flip Translation
(Goodfellow 2016)
Noise Robustness
• Noise with infinitesimal variance can be added:
q At the input
q At the hidden layers
q At the weights:
(Goodfellow 2016)
Injecting noise at the weights
• For 𝜂 small:
q 𝑦\3𝑾 = 𝑦\ 𝑾 + 𝝐 = 𝑦\ 𝑾 + 𝝐& 𝛻𝒘 𝑦(𝑾)
\
q 𝔼4 𝑦
\ ! = 𝔼 𝑦\ ! +
𝒙,7,𝝐𝑾 3𝑾 4 𝒙,7
! !
𝔼4 𝝐𝑾 𝜖 𝔼 4 𝒙,7 𝛻𝒘 𝑦
\ +0
q 𝔼4 𝒙,7,𝝐𝑾 𝑦\3!𝑾 = 𝔼4 𝒙,7 𝑦\ ! + 𝜂𝔼4 𝒙,7 𝛻9 𝑦\ !
(Goodfellow 2016)
Special case: linear
regression
• 𝐽;𝑾 = 𝐽 + 𝜂 𝔼5 𝒙,8 𝛻9 𝑦(𝒙)
? :
• 𝑦? = 𝒘3 𝒙 + 𝑏
: :
• 𝔼5 𝒙,8 𝛻𝒘 𝑦(𝑥)
? = 𝔼5 𝒙 𝒙
• which is not a function of parameters and therefore
does not contribute to the cost function w.r.t 𝒘:
q No regularization effect!
(Goodfellow 2016)
Injecting noise at the output
targets
• Most datasets have some amount of mistakes in the y labels.
• It can be harmful to maximize log 𝑝(𝑦 | 𝑥) when 𝑦 is a mistake.
• One way to prevent this is to explicitly model the noise on the labels.
q For example, we can assume that for some small constant 𝜖, the
training set label 𝑦 is correct with probability 1 − 𝜖,
q and otherwise any of the other possible labels might be correct.
• This assumption is easy to incorporate into the cost function analytically,
q rather than by explicitly drawing noise samples.
q For example, label smoothing regularizes a model based on a softmax
with 𝑘 output values
!
o by replacing the hard 0 by
"#$
o and 1 by 1 − 𝜖
• Label smoothing has the advantage of preventing the pursuit of hard
probabilities without discouraging correct classification.
(Goodfellow 2016)
Multi-Task Learning
Unsupervised
Task specific Learning context
parameters
(Goodfellow 2016)
Early stopping algorithm
(Goodfellow 2016)
Re-use the validation set
Less well-
behaved
(Goodfellow 2016)
Early stopping as a
regularizer
• 𝜖 (learning rate) and 𝜏 (number of training steps) limits the the volume of
parameters reachable from 𝜽𝟎 (initial parameters)
• Early stopping is equivalent to L2 regularization in the case of:
q a simple linear model
q with a quadratic error function
q and simple gradient descent
• 2𝐽 𝒘 = 𝐽 𝒘∗ + $ 𝒘 − 𝒘∗ ( 𝑯 𝒘 − 𝒘∗
'
q 𝛻𝒘 2𝐽 𝒘 = 𝑯 𝒘 − 𝒘∗
• 𝒘(𝝉) = 𝒘(𝝉#𝟏) − 𝜖𝛻𝒘 2𝐽 𝒘 𝝉#𝟏
=𝒘 𝝉#𝟏
− 𝜖 𝑯 𝒘(𝝉#𝟏) − 𝒘∗
q 𝒘(𝝉) = 𝑰 − 𝜖𝑯 𝒘 𝝉#𝟏
+ 𝜖𝑯𝒘∗
q 𝒘(𝝉) − 𝒘∗ = 𝑰 − 𝜖𝑯 𝒘 𝝉#𝟏
+ 𝜖𝑯 − 𝑰 𝒘∗
q 𝒘(𝝉) − 𝒘∗ = 𝑰 − 𝜖𝑯 (𝒘 𝝉#𝟏
− 𝐰∗)
(Goodfellow 2016)
The number of steps 𝜏 corresponds to some value of the weight
decay coefficient 𝛼
• 𝒘(𝝉) − 𝒘∗ = 𝑰 − 𝜖𝑯 (𝒘 𝝉&𝟏 − 𝐰∗)
q 𝑯 = 𝑸𝚲𝑸𝑻
q 𝒘(𝝉) − 𝒘∗ = 𝑸 𝑰 − 𝜖𝚲 𝑸𝑻 𝒘 𝝉&𝟏 − 𝐰∗
q 𝑸𝑻 𝒘 𝝉
− 𝒘∗ = 𝑰 − 𝜖𝚲 𝑸𝑻 𝒘 𝝉&𝟏
− 𝐰∗
o if 𝜖 small ⇒ 1 − 𝜖𝜆) < 1, every step brings closer to 𝒘∗
• Assume we start with 𝒘 𝟎 = 𝟎:
q 𝑸𝑻 𝒘 𝟏 = 𝑰 − 𝑰 − 𝜖𝚲 𝑸𝑻 𝒘∗
q 𝑸𝑻 𝒘 𝟐
= 𝑰 − 𝑰 − 𝜖𝚲 𝟐
𝑸 𝑻 𝒘∗
q 𝑸𝑻 𝒘 𝝉 = 𝑰 − 𝑰 − 𝜖𝚲 𝝉 𝑸 𝑻 𝒘∗
• L2 regularization:
q 𝑸𝑻 𝒘
; = 𝚲 + 𝛼𝑰 &𝟏 𝚲𝐐𝐓 𝐰 ∗
𝜆! 𝛼
=1−
q 𝑸𝑻 𝒘
; = 𝑰 − 𝚲 + 𝛼𝑰 &𝟏 𝛼 𝐐𝐓 𝐰 ∗ 𝜆! + 𝛼 𝜆! + 𝛼
• Compare:
&𝟏 𝛼 𝝉 1 − 1 − 𝜖𝜆. /
q 𝑰 − 𝚲 + 𝛼𝑰 and 𝑰 − 𝑰 − 𝜖𝚲
q 𝚲 + 𝛼𝑰 &𝟏 𝛼 = 𝑰 − 𝜖𝚲 𝝉
(Goodfellow 2016)
Early stopping advantage
0 / $
•
1" 20
= 1 − 𝜖𝜆. ⇒ 𝜏 log 1 − 𝜖𝜆. = log #
$2 "
$
(Goodfellow 2016)
Early Stopping and Weight
Decay
Figure 7.4
(Goodfellow 2016)
Parameter tying
• Formally, we have model 𝐴 with parameters 𝒘 ;
and model 𝐵 with parameters 𝒘(=)
• The two models map the input to two different, but
related outputs:
q 𝑦? ; = 𝑓(𝒘(;) , 𝒙)
= (=)
q 𝑦? = 𝑔(𝒘 , 𝒙)
q ∀𝑖, 𝒘(;) should be close to 𝒘(=)
• Regularization
(;) (=) (;) (=) :
Ω 𝒘 ,𝒘 = 𝒘 − 𝒘 :
(Goodfellow 2016)
Parameter sharing (e.g. CNN)
• Force sets of parameters to be equal.
• Advantage:
q only a subset of the parameters (the unique set)
needs to be stored in memory.
• Natural images have many statistical properties that are
invariant to translation.
q a photo of a cat remains a photo of a cat if it is
translated one pixel to the right
q Parameter sharing has allowed CNNs to dramatically
lower the number of unique model parameters
(Goodfellow 2016)
Sparse Representations
Sparse
parameters
Sparse
representations
(Goodfellow 2016)
Bagging
• Bagging (short for bootstrap aggregating) is a
technique for reducing generalization error by
combining several models
q train several different models separately
q the models vote on the output for test examples
• Bagging is an example of model averaging.
q The general term is Ensemble methods.
• The reason that model averaging works is that
different models will usually not make all the same
errors on the test set.
(Goodfellow 2016)
Bagging example
• Consider for example a set of 𝑘 regression models.
• Suppose that each model makes an error 𝜖. on each example, with the
errors drawn from a zero-mean multivariate normal distribution
'
q with variances 𝐸[𝜖. ] = 𝑣
q and covariances 𝐸 𝜖. 𝜖3 = 𝑐
• Then the error made by the average prediction of all the ensemble models
$
is ∑. 𝜖.
"
• The expected squared error of the ensemble predictor is:
(Goodfellow 2016)
Ensemble methods vs. bagging
• Different ensemble methods construct the ensemble of models in
different ways.
• Bagging is a method that allows the same kind of model, training
algorithm and objective function to be reused several times
• Bagging involves constructing 𝑘 different datasets.
q Each dataset has the same number of examples as the original
dataset,
q but each dataset is constructed by sampling with replacement
from the original dataset.
o with high probability, each dataset is missing some of the
examples from the original dataset and also contains several
duplicate examples
o on average around 2/3 of the examples from the original
dataset are found in the resulting training set, if it has the
same size as the original
(Goodfellow 2016)
Bagging
the detector learns
that a loop on top of
the digit corresponds
to an 8.
(Goodfellow 2016)
Expected number of duplicates
(Goodfellow 2016)
Dropout
• Dropout provides an inexpensive approximation to training
and evaluating a bagged ensemble of exponentially many
neural networks.
q removing non-output units from an underlying base
network
o by multiplying its output value by zero
• Each time we load an example into a minibatch, we
randomly sample a different binary mask to apply to all of
the input and hidden units in the network.
q The probability of sampling a mask value of one (causing
a unit to be included) is a hyperparameter fixed before
training begins.
o Typically, an input unit is included with probability 0.8
and a hidden unit is included with probability 0.5
(Goodfellow 2016)
Dropout
Figure 7.6
(Goodfellow 2016)
Computational graph of dropout
(Goodfellow 2016)
Weight scaling inference rule
• Evaluate with the trained model with all units,
q But with the weights going out of unit 𝑖 multiplied by the probability of including unit 𝑖 (e.g.
½)
q This corresponds to predict the geometric mean of the ensemble!
• Consider a softmax regression classifier with 𝑛 input variables represented by the vector 𝑣:
𝑃 𝑦 = 𝑦! 𝒗 = softmax 𝑾𝑻𝒗 + 𝒃 !
• To index into the family of submodels:
𝑃 𝑦 = 𝑦! 𝒗 = softmax 𝑾𝑻(𝒗 ⊙ 𝒅) + 𝒃 !
!"
𝑝Hensemble 𝑦 = 𝑦! 𝒙 = O softmax 𝑾𝑻(𝒗 ⊙ 𝒅) + 𝒃 !
𝒅∈ (,* "
1
𝑝Hensemble 𝑦 = 𝑦! 𝒙 ∝ exp + S 𝑾𝑻 𝒗 ⊙ 𝒅 + 𝒃
2 "
!
𝒅∈ (,*
1 +,* 𝑻 1 𝑻
= exp + 2 𝑾 𝒗 + 2+ 𝒃 = exp 𝑾 𝒗+𝒃
2 ! 2 !
(Goodfellow 2016)
Another perspective of
dropout
• (1) Droput is bagging with parameter sharing
• (2) Information erasing: Each hidden unit must be able to perform
well regardless of which other hidden units are in the model
q Dropout thus regularizes each hidden unit to be not merely a
good feature but a feature that is good in many contexts.
q For example, if the model learns a hidden unit ℎ2 that detects a
face by finding the nose,
q then dropping ℎ2 corresponds to erasing the information that
there is a nose in the image.
q The model must learn another ℎ2 ,
o either that redundantly encodes the presence of a nose,
o or that detects the face by another feature, such as the
mouth
(Goodfellow 2016)
Adversarial examples
• Search for an input 𝒙′ near a data point 𝒙 such that the
model output is very different at 𝒙′
• In many cases, 𝒙’ can be so similar to 𝒙 that a human
observer cannot tell the difference between the original
example and the adversarial example,
q but the network can make highly different predictions.
• Adversarial training
q training on adversarially perturbed examples from the
training set
• Adversarial examples are interesting in the context of
regularization
q because one can reduce the error rate on the original
i.i.d. test set via adversarial training
(Goodfellow 2016)
Adversarial Examples
Figure 7.8
(Goodfellow 2016)
Aversarial training
• The value of a linear function can change very rapidly if it has
numerous inputs.
q If we change each input by 𝜖, then a linear function with weights
𝑤 can change by as much as 𝜖 𝒘 0 , which can be a very large
amount if 𝒘 is high-dimensional.
• Adversarial training discourages this highly sensitive locally linear
behavior by encouraging the network to be locally constant in the
neighborhood of the training data.
• This can be seen as a way of explicitly introducing a local
constancy prior into supervised neural nets.
q The classifier may then be trained to assign the same label to 𝒙
and 𝒙’.
q The assumption motivating this approach is that different
classes usually lie on disconnected manifolds, and a small
perturbation should not be able to jump from one class manifold
to another class manifold.
(Goodfellow 2016)
Conclusion
• This chapter has described most of the general
strategies used to regularize neural networks.
• Regularization is a central theme of machine
learning
(Goodfellow 2016)