0% found this document useful (0 votes)
14 views51 pages

07_regularization

The document discusses regularization in deep learning, defining it as modifications to learning algorithms aimed at reducing generalization error without affecting training error. It outlines various regularization strategies, including constraints, soft constraints, ensemble methods, and parameter norm penalties, emphasizing their importance in developing effective models. Additionally, it covers techniques such as weight decay, dataset augmentation, and noise robustness to enhance model performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views51 pages

07_regularization

The document discusses regularization in deep learning, defining it as modifications to learning algorithms aimed at reducing generalization error without affecting training error. It outlines various regularization strategies, including constraints, soft constraints, ensemble methods, and parameter norm penalties, emphasizing their importance in developing effective models. Additionally, it covers techniques such as weight decay, dataset augmentation, and noise robustness to enhance model performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Regularization for Deep

Learning
Lecture slides for Chapter 7 of Deep Learning
www.deeplearningbook.org
Ian Goodfellow
2016-09-27
Adapted by m.n. for CMPS 392
Definition
• “Regularization is any modification we make to a
learning algorithm that is intended to reduce its
generalization error but not its training error.”

• Developing more effective regularization strategies has


been one of the major research efforts in the field.

• Deep learning take:


q the best fitting model (in the sense of minimizing
generalization error) is a large model that has been
regularized appropriately!

(Goodfellow 2016)
Regularization strategies
• Constraints: adding restrictions on the parameter
values.
• Soft constraints: Adding extra terms in the objective
function:
q Encode Prior knowledge.
q Generic preference for a simpler model
• Ensemble methods:
q Combine multiple hypotheses to explain the
training data

(Goodfellow 2016)
Parameter norm penalties

• 𝜃 : all learnable parameters (weights and biases)


• 𝑤: parameters affected by a norm penalty
q we take weights and exclude biases

• 𝛼 ∈ 0, ∞
q 𝛼 = 0, no regularization
• Ω: norm function
q L1
q L2

(Goodfellow 2016)
!
𝐿 Parameter regularization
aka. ridge regression
1 !
1 𝑻
Ω 𝜽 = 𝒘 ! = 𝒘 𝒘
2 2
$
• 𝛻𝒘 𝒘𝑻 𝒘 = 𝒘
!
• Update step: 𝒘 ← 𝒘 − 𝜖 𝛼 𝒘 − 𝜖𝛻𝒘 𝐽
q 𝒘 ← 𝒘(1 − 𝜖𝛼) − 𝜖𝛻𝒘 𝐽 Weights are
• Let 𝒘∗ = argmin 𝐽 shrunk by a
𝒘 multiplicative
• M = argmin N𝐽
Let 𝒘 factor
𝒘
• Approximating 𝐽 in the neighborhood of 𝒘∗ : No first order
term since w* is
q O
𝐽 𝒘 = 𝐽 𝒘∗ + 𝒘 − 𝒘∗ & 𝑯 𝒘 − 𝒘∗ the minimum
(𝛻𝐽 𝒘∗ = 𝟎)

(Goodfellow 2016)
" (regularized solution) compares to
How 𝒘
unregularized solution 𝒘*?
• What is the gradient of /𝐽 𝒘 at 𝒘
2?
• 𝛻 /𝐽 𝒘
2 =𝑯 𝒘 2 − 𝒘∗
• 𝛻 8𝐽 𝒘 2 − 𝒘∗ + 𝛼 𝒘
2 =𝑯 𝒘 2 =𝟎
q 2 = 𝑯𝒘∗
𝑯 + 𝛼𝑰 𝒘
q 2 = 𝑯 + 𝛼𝑰 "𝟏 𝑯𝒘∗
𝒘
• 𝑯 is real and symmetric
q 𝑯 = 𝑸𝚲𝐐𝐓
𝐓 "𝟏 𝐓 ∗ 𝐓 𝑻 "𝟏
• ! = 𝑸𝚲𝐐 + 𝛼𝑰
𝒘 𝑸𝚲𝐐 𝐰 = 𝑸𝚲𝐐 + 𝑸𝛼𝑰𝑸 𝑸𝚲𝐐𝐓 𝐰 ∗
𝑻 "𝟏
• ! = 𝑸 𝚲 + 𝛼𝑰
𝒘 𝑸 𝑸𝚲𝐐𝐓 𝐰 ∗ = 𝑸 𝚲 + 𝛼𝑰 "𝟏 𝑸𝑻 𝑸𝚲𝐐𝐓 𝐰 ∗

𝒘
! = 𝑸 𝚲 + 𝛼𝑰 "𝟏 𝚲𝐐𝐓 𝐰 ∗

(Goodfellow 2016)
Interpretation
! = 𝑸 𝚲 + 𝛼𝑰 +𝟏 𝐓 ∗
𝒘 𝚲𝐐 𝐰

• 𝐰 ∗ projections against the eigen vectors of 𝑯 are


scaled
/!
q Component 𝑖 is multiplied by
/! 01
q 𝜆2 ≫ 𝛼 ⇒ the effect of regularization is small
q 𝜆2 ≪ 𝛼 ⇒ the corresponding component is shrunk
by a factor of 𝛼

(Goodfellow 2016)
Weight Decay
1
+𝐽 𝒘 = 𝐽 𝒘∗ + 𝒘 − 𝒘∗ " 𝑯 𝒘 − 𝒘∗
Small Eigen 2
vector of H Unregularized
(regularization
effect is large) solution
1𝐽 𝒘

Large Eigen
vector of H
𝛼w 3 𝑤 (regularization effect
is small)

Regularized
Figure 7.1 solution
(Goodfellow 2016)
Special case: Linear Regression
& $ 𝑻
• Cost function: 𝑿𝒘 − 𝒚 𝑿𝒘 − 𝒚 + 𝛼𝒘 𝒘
!
• Normal equations
q 𝑿𝑻 𝑿𝒘 − 𝑿𝑻 𝒚 + 𝛼𝒘 = 𝟎 ⇒ (𝑿𝑻 𝑿 + 𝛼𝑰)𝒘 = 𝑿𝑻 𝒚
1𝟏 Covariance
q 𝒘 = 𝑿𝑻 𝑿 + 𝛼𝑰 𝑿𝑻 𝒚 feature-output
Proportional to the
• Basically, we are adding 𝛼 to the diag. covariance matrix
q The diag. elements correspond to the variance of
each feature
• We perceive the data as having higher variance
q A feature having low covariance with output got
shrunk even more due to this added variance

(Goodfellow 2016)
!
𝐿 regularization
• Ω 𝜽 = 𝒘 " = ∑# 𝑤#
• '𝐽 𝒘; 𝑿, 𝒚 = 𝛼 𝒘 " + 𝐽(𝒘; 𝑿, 𝒚)
q 𝛻𝒘 '
𝐽 𝒘; 𝑿, 𝒚 = 𝛼 sign 𝒘 + 𝛻𝒘 𝐽 𝒘; 𝑿, 𝒚
• 6𝐽 𝒘 = 𝐽 𝒘∗ + " 𝒘 − 𝒘∗ ' 𝑯 𝒘 − 𝒘∗
&
q 𝛻 6𝐽 𝒘 = 𝑯 𝒘 − 𝒘∗
• Assume that 𝑯 = diag ( 𝐻"," , … , 𝐻),) ), 𝐻#,# > 0
q Linear regression after PCA
"
• '𝐽 𝒘 ≈ 𝐽 𝒘∗ + ∑# 𝐻#,# 𝑤# − 𝑤#∗ & + 𝛼 𝑤#
&
*
• Solution: 𝑤# = 𝑠𝑖𝑔𝑛 𝑤#∗ max 𝑤#∗ − ,0
+&,&

(Goodfellow 2016)
Interpretation
𝛼
𝑤# = 𝑠𝑖𝑔𝑛 𝑤#∗ max 𝑤#∗ − ,0
𝐻#,#
• If 𝑤#∗ > 0:
∗ * *
q 𝑤# > ⇒ 𝑤# is shifted towards 0 by
+&,& +&,&
*
q 𝑤#∗ ≤ ⇒ 𝑤# = 0
+&,&
• If 𝑤#∗ < 0:
∗ * *
q −𝑤# > ⇒ 𝑤# = 𝑤# +
+&,& +&,&
*
o 𝑤# is shifted towards 0 by
+&,&
∗ *
q −𝑤# ≤ ⇒ 𝑤# = 0
+&,&

(Goodfellow 2016)
!
𝐿 regularization sparsity
• The sparsity property induced by L1 regularization can be
used as a feature selection mechanism
q LASSO regression (least absolute shrinkage and
selection operator)
• Equivalent to MAP Bayesian estimation with Laplace prior
q the prior is an isotropic Laplace distribution over 𝑤 ∈ ℝ) :
" "
o Laplace 𝑤# ; 0, = exp(−𝛼 𝑤# )
* &*
"
o log Laplace 𝑤# ; 0, = − log 2𝛼 − 𝛼 𝑤#
*

log 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 ∝ log 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 + log 𝑝𝑟𝑖𝑜𝑟


max log 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 ⟺ min negative log likelihood − log prior

(Goodfellow 2016)
Norm Penalties
• MAP: Maximum A-Posteriori
• L1:
q Encourages sparsity,
q equivalent to MAP Bayesian estimation with
Laplace prior

• Squared L2:
q Encourages small weights,
q equivalent to MAP Bayesian estimation with
Gaussian prior

(Goodfellow 2016)
Explicit constraints
• We want to constrain Ω(𝜃) to be less than some
constant 𝑘
q construct a generalized Lagrange function

• We can fix 𝛼 but lose 𝑘

• The regularized training problem :𝐽 is equivalent to


the explicit constraints problem for an unknown 𝑘!

(Goodfellow 2016)
Projection
• Sometimes we may wish to use explicit constraints
rather than penalties.
q we can modify algorithms such as stochastic gradient
descent to take a step downhill on 𝐽 𝜃 and then
project 𝜃 back to the nearest point that satisfies
Ω 𝜃 < 𝑘.
• How to project?
q Project into unit L2 ball:
q Project into unit L1 ball:
o No closed-form solution
o Numerical solution

(Goodfellow 2016)
Dataset Augmentation
• Best way to regularize is to train with more data
q create fake data and add it to the training set.
q We can generate new (𝒙, 𝑦) pairs easily just by
transforming the 𝒙 inputs in our training set.
q particularly effective for object recognition
o translating the training images a few pixels in each
direction
o rotating the image or scaling
• Some inappropriate transformations:
q horizontal flips: ‘b’ and ‘d’,
q 180◦ rotations: ‘6’ and ‘9’,

(Goodfellow 2016)
Dataset Augmentation
Affine Elastic
Noise
Distortion Deformation

Horizontal Random
Hue Shift
flip Translation

(Goodfellow 2016)
Noise Robustness
• Noise with infinitesimal variance can be added:
q At the input
q At the hidden layers
q At the weights:

(Goodfellow 2016)
Injecting noise at the weights

• For 𝜂 small:
q 𝑦\3𝑾 = 𝑦\ 𝑾 + 𝝐 = 𝑦\ 𝑾 + 𝝐& 𝛻𝒘 𝑦(𝑾)
\
q 𝔼4 𝑦
\ ! = 𝔼 𝑦\ ! +
𝒙,7,𝝐𝑾 3𝑾 4 𝒙,7
! !
𝔼4 𝝐𝑾 𝜖 𝔼 4 𝒙,7 𝛻𝒘 𝑦
\ +0
q 𝔼4 𝒙,7,𝝐𝑾 𝑦\3!𝑾 = 𝔼4 𝒙,7 𝑦\ ! + 𝜂𝔼4 𝒙,7 𝛻9 𝑦\ !

q 𝐽`𝑾 = 𝐽 + 𝜂 𝔼4 𝒙,7 𝛻9 𝑦(𝒙)


\ !

q Equivalent to adding a regularization term


• Pushes the model into regions where the model is
relatively insensitive to small variations in the weights

(Goodfellow 2016)
Special case: linear
regression
• 𝐽;𝑾 = 𝐽 + 𝜂 𝔼5 𝒙,8 𝛻9 𝑦(𝒙)
? :

• 𝑦? = 𝒘3 𝒙 + 𝑏
: :
• 𝔼5 𝒙,8 𝛻𝒘 𝑦(𝑥)
? = 𝔼5 𝒙 𝒙
• which is not a function of parameters and therefore
does not contribute to the cost function w.r.t 𝒘:
q No regularization effect!

(Goodfellow 2016)
Injecting noise at the output
targets
• Most datasets have some amount of mistakes in the y labels.
• It can be harmful to maximize log 𝑝(𝑦 | 𝑥) when 𝑦 is a mistake.
• One way to prevent this is to explicitly model the noise on the labels.
q For example, we can assume that for some small constant 𝜖, the
training set label 𝑦 is correct with probability 1 − 𝜖,
q and otherwise any of the other possible labels might be correct.
• This assumption is easy to incorporate into the cost function analytically,
q rather than by explicitly drawing noise samples.
q For example, label smoothing regularizes a model based on a softmax
with 𝑘 output values
!
o by replacing the hard 0 by
"#$
o and 1 by 1 − 𝜖
• Label smoothing has the advantage of preventing the pursuit of hard
probabilities without discouraging correct classification.

(Goodfellow 2016)
Multi-Task Learning
Unsupervised
Task specific Learning context
parameters

Among the factors that


explain the variations
observed in the data
associated with the
different tasks, some are
shared across two or
more tasks.

Figure 7.2 (Goodfellow 2016)


Learning Curves
Early stopping: terminate while validation set
performance is better

Figure 7.3 (Goodfellow 2016)


Early stopping
• probably the most commonly used form of
regularization in deep learning.
q the number of training steps (or training time) is
just another hyperparameter.
• The cost is running the validation set evaluation
periodically during training
q Reduce the validation set
q Evaluate the validation loss less frequently
• Periodically save the trained model

(Goodfellow 2016)
Early stopping algorithm

(Goodfellow 2016)
Re-use the validation set

Less well-
behaved

(Goodfellow 2016)
Early stopping as a
regularizer
• 𝜖 (learning rate) and 𝜏 (number of training steps) limits the the volume of
parameters reachable from 𝜽𝟎 (initial parameters)
• Early stopping is equivalent to L2 regularization in the case of:
q a simple linear model
q with a quadratic error function
q and simple gradient descent

• 2𝐽 𝒘 = 𝐽 𝒘∗ + $ 𝒘 − 𝒘∗ ( 𝑯 𝒘 − 𝒘∗
'

q 𝛻𝒘 2𝐽 𝒘 = 𝑯 𝒘 − 𝒘∗
• 𝒘(𝝉) = 𝒘(𝝉#𝟏) − 𝜖𝛻𝒘 2𝐽 𝒘 𝝉#𝟏
=𝒘 𝝉#𝟏
− 𝜖 𝑯 𝒘(𝝉#𝟏) − 𝒘∗
q 𝒘(𝝉) = 𝑰 − 𝜖𝑯 𝒘 𝝉#𝟏
+ 𝜖𝑯𝒘∗
q 𝒘(𝝉) − 𝒘∗ = 𝑰 − 𝜖𝑯 𝒘 𝝉#𝟏
+ 𝜖𝑯 − 𝑰 𝒘∗
q 𝒘(𝝉) − 𝒘∗ = 𝑰 − 𝜖𝑯 (𝒘 𝝉#𝟏
− 𝐰∗)

(Goodfellow 2016)
The number of steps 𝜏 corresponds to some value of the weight
decay coefficient 𝛼
• 𝒘(𝝉) − 𝒘∗ = 𝑰 − 𝜖𝑯 (𝒘 𝝉&𝟏 − 𝐰∗)
q 𝑯 = 𝑸𝚲𝑸𝑻
q 𝒘(𝝉) − 𝒘∗ = 𝑸 𝑰 − 𝜖𝚲 𝑸𝑻 𝒘 𝝉&𝟏 − 𝐰∗
q 𝑸𝑻 𝒘 𝝉
− 𝒘∗ = 𝑰 − 𝜖𝚲 𝑸𝑻 𝒘 𝝉&𝟏
− 𝐰∗
o if 𝜖 small ⇒ 1 − 𝜖𝜆) < 1, every step brings closer to 𝒘∗
• Assume we start with 𝒘 𝟎 = 𝟎:
q 𝑸𝑻 𝒘 𝟏 = 𝑰 − 𝑰 − 𝜖𝚲 𝑸𝑻 𝒘∗
q 𝑸𝑻 𝒘 𝟐
= 𝑰 − 𝑰 − 𝜖𝚲 𝟐
𝑸 𝑻 𝒘∗
q 𝑸𝑻 𝒘 𝝉 = 𝑰 − 𝑰 − 𝜖𝚲 𝝉 𝑸 𝑻 𝒘∗
• L2 regularization:
q 𝑸𝑻 𝒘
; = 𝚲 + 𝛼𝑰 &𝟏 𝚲𝐐𝐓 𝐰 ∗
𝜆! 𝛼
=1−
q 𝑸𝑻 𝒘
; = 𝑰 − 𝚲 + 𝛼𝑰 &𝟏 𝛼 𝐐𝐓 𝐰 ∗ 𝜆! + 𝛼 𝜆! + 𝛼

• Compare:
&𝟏 𝛼 𝝉 1 − 1 − 𝜖𝜆. /
q 𝑰 − 𝚲 + 𝛼𝑰 and 𝑰 − 𝑰 − 𝜖𝚲
q 𝚲 + 𝛼𝑰 &𝟏 𝛼 = 𝑰 − 𝜖𝚲 𝝉

(Goodfellow 2016)
Early stopping advantage
0 / $

1" 20
= 1 − 𝜖𝜆. ⇒ 𝜏 log 1 − 𝜖𝜆. = log #
$2 "
$

• Assume log 1 + 𝑥 ≈ 𝑥 for small enough 𝑥


1"
q Assume ≪ 1 and 𝜖𝜆. ≪ 1
0
1" $
• −𝜏𝜖𝜆. ≈ − ⇒𝛼≈
0" /!
q the number of training iterations 𝜏 plays a role inversely proportional to
the L2 regularization parameter,
q and the inverse of 𝜏𝜖 plays the role of the weight decay coefficient.
• Early stopping advantage over weight decay:
q early stopping automatically determines the correct amount of
regularization
q while weight decay requires many training experiments with different
values of its hyperparameter.

(Goodfellow 2016)
Early Stopping and Weight
Decay

Figure 7.4
(Goodfellow 2016)
Parameter tying
• Formally, we have model 𝐴 with parameters 𝒘 ;
and model 𝐵 with parameters 𝒘(=)
• The two models map the input to two different, but
related outputs:
q 𝑦? ; = 𝑓(𝒘(;) , 𝒙)
= (=)
q 𝑦? = 𝑔(𝒘 , 𝒙)
q ∀𝑖, 𝒘(;) should be close to 𝒘(=)
• Regularization
(;) (=) (;) (=) :
Ω 𝒘 ,𝒘 = 𝒘 − 𝒘 :

(Goodfellow 2016)
Parameter sharing (e.g. CNN)
• Force sets of parameters to be equal.
• Advantage:
q only a subset of the parameters (the unique set)
needs to be stored in memory.
• Natural images have many statistical properties that are
invariant to translation.
q a photo of a cat remains a photo of a cat if it is
translated one pixel to the right
q Parameter sharing has allowed CNNs to dramatically
lower the number of unique model parameters

(Goodfellow 2016)
Sparse Representations
Sparse
parameters

Sparse
representations

(Goodfellow 2016)
Bagging
• Bagging (short for bootstrap aggregating) is a
technique for reducing generalization error by
combining several models
q train several different models separately
q the models vote on the output for test examples
• Bagging is an example of model averaging.
q The general term is Ensemble methods.
• The reason that model averaging works is that
different models will usually not make all the same
errors on the test set.

(Goodfellow 2016)
Bagging example
• Consider for example a set of 𝑘 regression models.
• Suppose that each model makes an error 𝜖. on each example, with the
errors drawn from a zero-mean multivariate normal distribution
'
q with variances 𝐸[𝜖. ] = 𝑣

q and covariances 𝐸 𝜖. 𝜖3 = 𝑐
• Then the error made by the average prediction of all the ensemble models
$
is ∑. 𝜖.
"
• The expected squared error of the ensemble predictor is:

q 𝑐 = 𝑣 ⇒ no gain, the expected error remains 𝑣


q 𝑐 = 0 ⇒ max gain, the expected error is 𝑣/𝑘

(Goodfellow 2016)
Ensemble methods vs. bagging
• Different ensemble methods construct the ensemble of models in
different ways.
• Bagging is a method that allows the same kind of model, training
algorithm and objective function to be reused several times
• Bagging involves constructing 𝑘 different datasets.
q Each dataset has the same number of examples as the original
dataset,
q but each dataset is constructed by sampling with replacement
from the original dataset.
o with high probability, each dataset is missing some of the
examples from the original dataset and also contains several
duplicate examples
o on average around 2/3 of the examples from the original
dataset are found in the resulting training set, if it has the
same size as the original

(Goodfellow 2016)
Bagging
the detector learns
that a loop on top of
the digit corresponds
to an 8.

the detector learns


that a loop on the bottom
of the digit corresponds to
an 8.
(Goodfellow 2016)
Why 2/3 ?
• N: number of items All permutations of 𝑘
• K: number of unique items among 𝑁 items

• A: number of drawn items All ways to distribute


𝑁! 𝐴 items among k
subsets such as no
𝑁−𝑘 ! 𝐴 subset is left empty
𝑃 𝑘 =
𝑁; 𝑘

All possible ways to


draw A items
𝐴
• is a Stirling number of the second kind
𝑘

(Goodfellow 2016)
Expected number of duplicates

• The indicator 𝑑2 corresponds to original item 𝑖,


taking the value of one if 𝑖 is present and zero if not
? ;
• 𝑃 𝑑2 = 0 = 1 − @
? ;
• 𝐸[𝑑2 ] = 1 − 1 − @
? ;
• 𝐸 ∑𝑑2 = ∑𝐸 𝑑2 = 𝑁𝐸 𝑑2 = 𝑁 1 − 1 −
@
? @
• 𝐴 =𝑁 ⇒𝐸 𝑘 =𝑁 1− 1− → 𝑁 1 − 𝑒 +?
@
• 𝐸 𝑘 ≈ 0.632 𝑁
(Goodfellow 2016)
More about bagging
• Neural networks reach a wide enough variety of
solution points that they can often benefit from
model averaging
• Model averaging is an extremely powerful and
reliable method for reducing generalization error.
q Its use is usually discouraged when
benchmarking algorithms for scientific papers
• Machine learning contests are usually won by
methods using model averaging over dozens of
models.

(Goodfellow 2016)
Dropout
• Dropout provides an inexpensive approximation to training
and evaluating a bagged ensemble of exponentially many
neural networks.
q removing non-output units from an underlying base
network
o by multiplying its output value by zero
• Each time we load an example into a minibatch, we
randomly sample a different binary mask to apply to all of
the input and hidden units in the network.
q The probability of sampling a mask value of one (causing
a unit to be included) is a hyperparameter fixed before
training begins.
o Typically, an input unit is included with probability 0.8
and a hidden unit is included with probability 0.5

(Goodfellow 2016)
Dropout
Figure 7.6

In networks with wider layers, the


probability of dropping all possible
paths from inputs to outputs
becomes smaller.
(Goodfellow 2016)
Dropout vs. bagging
• More formally, suppose that a mask vector 𝝁 specifies which units
to include, and 𝐽(𝜽, 𝝁) defines the cost of the model defined by
parameters 𝜽 and mask 𝝁.
q Then dropout training consists in minimizing 𝔼𝝁 𝐽 𝜽, 𝝁 .
q The expectation contains exponentially many terms (2/ )
• Dropout training is not quite the same as bagging training.
q In the case of bagging, the models are all independent.
q In the case of dropout, the models share parameters
q In bagging, each model is trained to convergence on its
respective training set
q In dropout, a tiny fraction of the possible sub-networks are each
trained for a single step
q In both, the training set encountered by each sub-network is a
subset of the original training set sampled with replacement

(Goodfellow 2016)
Computational graph of dropout

• The entries of 𝝁 are binary and are


sampled independently from each
other,
q And is not a function of the
current value of the model
parameters or the input example
(Goodfellow 2016)
Inference
• To make a prediction, a bagged ensemble must accumulate votes
from all of its members.
q We refer to this process as inference
0
• In bagging, the prediction of the ensemble is 1 ∑4230 𝑝 2
𝑦𝒙
• In dropout, the arithmetic mean is ∑𝝁 𝑝(𝝁)𝑝 𝑦 𝒙, 𝝁
• The geometric mean is
-.
𝑝Nensemble 𝑦 𝒙 = R 𝑝 𝑦 𝒙, 𝝁
𝝁

• To guarantee that the result is a probability distribution,


q we impose that none of the sub-models assigns probability 0 to
any event,
q and we renormalize the resulting distribution.

(Goodfellow 2016)
Weight scaling inference rule
• Evaluate with the trained model with all units,
q But with the weights going out of unit 𝑖 multiplied by the probability of including unit 𝑖 (e.g.
½)
q This corresponds to predict the geometric mean of the ensemble!
• Consider a softmax regression classifier with 𝑛 input variables represented by the vector 𝑣:
𝑃 𝑦 = 𝑦! 𝒗 = softmax 𝑾𝑻𝒗 + 𝒃 !
• To index into the family of submodels:
𝑃 𝑦 = 𝑦! 𝒗 = softmax 𝑾𝑻(𝒗 ⊙ 𝒅) + 𝒃 !
!"
𝑝Hensemble 𝑦 = 𝑦! 𝒙 = O softmax 𝑾𝑻(𝒗 ⊙ 𝒅) + 𝒃 !
𝒅∈ (,* "
1
𝑝Hensemble 𝑦 = 𝑦! 𝒙 ∝ exp + S 𝑾𝑻 𝒗 ⊙ 𝒅 + 𝒃
2 "
!
𝒅∈ (,*
1 +,* 𝑻 1 𝑻
= exp + 2 𝑾 𝒗 + 2+ 𝒃 = exp 𝑾 𝒗+𝒃
2 ! 2 !

(Goodfellow 2016)
Another perspective of
dropout
• (1) Droput is bagging with parameter sharing
• (2) Information erasing: Each hidden unit must be able to perform
well regardless of which other hidden units are in the model
q Dropout thus regularizes each hidden unit to be not merely a
good feature but a feature that is good in many contexts.
q For example, if the model learns a hidden unit ℎ2 that detects a
face by finding the nose,
q then dropping ℎ2 corresponds to erasing the information that
there is a nose in the image.
q The model must learn another ℎ2 ,
o either that redundantly encodes the presence of a nose,
o or that detects the face by another feature, such as the
mouth

(Goodfellow 2016)
Adversarial examples
• Search for an input 𝒙′ near a data point 𝒙 such that the
model output is very different at 𝒙′
• In many cases, 𝒙’ can be so similar to 𝒙 that a human
observer cannot tell the difference between the original
example and the adversarial example,
q but the network can make highly different predictions.
• Adversarial training
q training on adversarially perturbed examples from the
training set
• Adversarial examples are interesting in the context of
regularization
q because one can reduce the error rate on the original
i.i.d. test set via adversarial training

(Goodfellow 2016)
Adversarial Examples

Figure 7.8

Training on adversarial examples is mostly


intended to improve security, but can sometimes
provide generic regularization.

(Goodfellow 2016)
Aversarial training
• The value of a linear function can change very rapidly if it has
numerous inputs.
q If we change each input by 𝜖, then a linear function with weights
𝑤 can change by as much as 𝜖 𝒘 0 , which can be a very large
amount if 𝒘 is high-dimensional.
• Adversarial training discourages this highly sensitive locally linear
behavior by encouraging the network to be locally constant in the
neighborhood of the training data.
• This can be seen as a way of explicitly introducing a local
constancy prior into supervised neural nets.
q The classifier may then be trained to assign the same label to 𝒙
and 𝒙’.
q The assumption motivating this approach is that different
classes usually lie on disconnected manifolds, and a small
perturbation should not be able to jump from one class manifold
to another class manifold.

(Goodfellow 2016)
Conclusion
• This chapter has described most of the general
strategies used to regularize neural networks.
• Regularization is a central theme of machine
learning

Our next topic is: optimization

(Goodfellow 2016)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy