0% found this document useful (0 votes)

14 views51 pages

07_regularization

The document discusses regularization in deep learning, defining it as modifications to learning algorithms aimed at reducing generalization error without affecting training error. It outlines various regularization strategies, including constraints, soft constraints, ensemble methods, and parameter norm penalties, emphasizing their importance in developing effective models. Additionally, it covers techniques such as weight decay, dataset augmentation, and noise robustness to enhance model performance.

Uploaded by

handybuddy.handyman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views51 pages

07_regularization

Uploaded by

handybuddy.handyman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Regularization for Deep

Learning
Lecture slides for Chapter 7 of Deep Learning
www.deeplearningbook.org
Ian Goodfellow
2016-09-27
Adapted by m.n. for CMPS 392
Definition
• “Regularization is any modification we make to a
learning algorithm that is intended to reduce its
generalization error but not its training error.”

• Developing more effective regularization strategies has

been one of the major research efforts in the field.

• Deep learning take:

q the best fitting model (in the sense of minimizing
generalization error) is a large model that has been
regularized appropriately!

(Goodfellow 2016)
Regularization strategies
• Constraints: adding restrictions on the parameter
values.
• Soft constraints: Adding extra terms in the objective
function:
q Encode Prior knowledge.
q Generic preference for a simpler model
• Ensemble methods:
q Combine multiple hypotheses to explain the
training data

(Goodfellow 2016)
Parameter norm penalties

• 𝜃 : all learnable parameters (weights and biases)

• 𝑤: parameters affected by a norm penalty
q we take weights and exclude biases

• 𝛼 ∈ 0, ∞
q 𝛼 = 0, no regularization
• Ω: norm function
q L1
q L2

(Goodfellow 2016)
!
𝐿 Parameter regularization
aka. ridge regression
1 !
1 𝑻
Ω 𝜽 = 𝒘 ! = 𝒘 𝒘
2 2
$
• 𝛻𝒘 𝒘𝑻 𝒘 = 𝒘
!
• Update step: 𝒘 ← 𝒘 − 𝜖 𝛼 𝒘 − 𝜖𝛻𝒘 𝐽
q 𝒘 ← 𝒘(1 − 𝜖𝛼) − 𝜖𝛻𝒘 𝐽 Weights are
• Let 𝒘∗ = argmin 𝐽 shrunk by a
𝒘 multiplicative
• M = argmin N𝐽
Let 𝒘 factor
𝒘
• Approximating 𝐽 in the neighborhood of 𝒘∗ : No first order
term since w* is
q O
𝐽 𝒘 = 𝐽 𝒘∗ + 𝒘 − 𝒘∗ & 𝑯 𝒘 − 𝒘∗ the minimum
(𝛻𝐽 𝒘∗ = 𝟎)

(Goodfellow 2016)
" (regularized solution) compares to
How 𝒘
unregularized solution 𝒘*?
• What is the gradient of /𝐽 𝒘 at 𝒘
2?
• 𝛻 /𝐽 𝒘
2 =𝑯 𝒘 2 − 𝒘∗
• 𝛻 8𝐽 𝒘 2 − 𝒘∗ + 𝛼 𝒘
2 =𝑯 𝒘 2 =𝟎
q 2 = 𝑯𝒘∗
𝑯 + 𝛼𝑰 𝒘
q 2 = 𝑯 + 𝛼𝑰 "𝟏 𝑯𝒘∗
𝒘
• 𝑯 is real and symmetric
q 𝑯 = 𝑸𝚲𝐐𝐓
𝐓 "𝟏 𝐓 ∗ 𝐓 𝑻 "𝟏
• ! = 𝑸𝚲𝐐 + 𝛼𝑰
𝒘 𝑸𝚲𝐐 𝐰 = 𝑸𝚲𝐐 + 𝑸𝛼𝑰𝑸 𝑸𝚲𝐐𝐓 𝐰 ∗
𝑻 "𝟏
• ! = 𝑸 𝚲 + 𝛼𝑰
𝒘 𝑸 𝑸𝚲𝐐𝐓 𝐰 ∗ = 𝑸 𝚲 + 𝛼𝑰 "𝟏 𝑸𝑻 𝑸𝚲𝐐𝐓 𝐰 ∗

𝒘
! = 𝑸 𝚲 + 𝛼𝑰 "𝟏 𝚲𝐐𝐓 𝐰 ∗

(Goodfellow 2016)
Interpretation
! = 𝑸 𝚲 + 𝛼𝑰 +𝟏 𝐓 ∗
𝒘 𝚲𝐐 𝐰

• 𝐰 ∗ projections against the eigen vectors of 𝑯 are

scaled
/!
q Component 𝑖 is multiplied by
/! 01
q 𝜆2 ≫ 𝛼 ⇒ the effect of regularization is small
q 𝜆2 ≪ 𝛼 ⇒ the corresponding component is shrunk
by a factor of 𝛼

(Goodfellow 2016)
Weight Decay
1
+𝐽 𝒘 = 𝐽 𝒘∗ + 𝒘 − 𝒘∗ " 𝑯 𝒘 − 𝒘∗
Small Eigen 2
vector of H Unregularized
(regularization
effect is large) solution
1𝐽 𝒘

Large Eigen
vector of H
𝛼w 3 𝑤 (regularization effect
is small)

Regularized
Figure 7.1 solution
(Goodfellow 2016)
Special case: Linear Regression
& $ 𝑻
• Cost function: 𝑿𝒘 − 𝒚 𝑿𝒘 − 𝒚 + 𝛼𝒘 𝒘
!
• Normal equations
q 𝑿𝑻 𝑿𝒘 − 𝑿𝑻 𝒚 + 𝛼𝒘 = 𝟎 ⇒ (𝑿𝑻 𝑿 + 𝛼𝑰)𝒘 = 𝑿𝑻 𝒚
1𝟏 Covariance
q 𝒘 = 𝑿𝑻 𝑿 + 𝛼𝑰 𝑿𝑻 𝒚 feature-output
Proportional to the
• Basically, we are adding 𝛼 to the diag. covariance matrix
q The diag. elements correspond to the variance of
each feature
• We perceive the data as having higher variance
q A feature having low covariance with output got
shrunk even more due to this added variance

(Goodfellow 2016)
!
𝐿 regularization
• Ω 𝜽 = 𝒘 " = ∑# 𝑤#
• '𝐽 𝒘; 𝑿, 𝒚 = 𝛼 𝒘 " + 𝐽(𝒘; 𝑿, 𝒚)
q 𝛻𝒘 '
𝐽 𝒘; 𝑿, 𝒚 = 𝛼 sign 𝒘 + 𝛻𝒘 𝐽 𝒘; 𝑿, 𝒚
• 6𝐽 𝒘 = 𝐽 𝒘∗ + " 𝒘 − 𝒘∗ ' 𝑯 𝒘 − 𝒘∗
&
q 𝛻 6𝐽 𝒘 = 𝑯 𝒘 − 𝒘∗
• Assume that 𝑯 = diag ( 𝐻"," , … , 𝐻),) ), 𝐻#,# > 0
q Linear regression after PCA
"
• '𝐽 𝒘 ≈ 𝐽 𝒘∗ + ∑# 𝐻#,# 𝑤# − 𝑤#∗ & + 𝛼 𝑤#
&
*
• Solution: 𝑤# = 𝑠𝑖𝑔𝑛 𝑤#∗ max 𝑤#∗ − ,0
+&,&

(Goodfellow 2016)
Interpretation
𝛼
𝑤# = 𝑠𝑖𝑔𝑛 𝑤#∗ max 𝑤#∗ − ,0
𝐻#,#
• If 𝑤#∗ > 0:
∗ * *
q 𝑤# > ⇒ 𝑤# is shifted towards 0 by
+&,& +&,&
*
q 𝑤#∗ ≤ ⇒ 𝑤# = 0
+&,&
• If 𝑤#∗ < 0:
∗ * *
q −𝑤# > ⇒ 𝑤# = 𝑤# +
+&,& +&,&
*
o 𝑤# is shifted towards 0 by
+&,&
∗ *
q −𝑤# ≤ ⇒ 𝑤# = 0
+&,&

(Goodfellow 2016)
!
𝐿 regularization sparsity
• The sparsity property induced by L1 regularization can be
used as a feature selection mechanism
q LASSO regression (least absolute shrinkage and
selection operator)
• Equivalent to MAP Bayesian estimation with Laplace prior
q the prior is an isotropic Laplace distribution over 𝑤 ∈ ℝ) :
" "
o Laplace 𝑤# ; 0, = exp(−𝛼 𝑤# )
* &*
"
o log Laplace 𝑤# ; 0, = − log 2𝛼 − 𝛼 𝑤#
*

log 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 ∝ log 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 + log 𝑝𝑟𝑖𝑜𝑟

max log 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 ⟺ min negative log likelihood − log prior

(Goodfellow 2016)
Norm Penalties
• MAP: Maximum A-Posteriori
• L1:
q Encourages sparsity,
q equivalent to MAP Bayesian estimation with
Laplace prior

• Squared L2:
q Encourages small weights,
q equivalent to MAP Bayesian estimation with
Gaussian prior

(Goodfellow 2016)
Explicit constraints
• We want to constrain Ω(𝜃) to be less than some
constant 𝑘
q construct a generalized Lagrange function

• We can fix 𝛼 but lose 𝑘

• The regularized training problem :𝐽 is equivalent to

the explicit constraints problem for an unknown 𝑘!

(Goodfellow 2016)
Projection
• Sometimes we may wish to use explicit constraints
rather than penalties.
q we can modify algorithms such as stochastic gradient
descent to take a step downhill on 𝐽 𝜃 and then
project 𝜃 back to the nearest point that satisfies
Ω 𝜃 < 𝑘.
• How to project?
q Project into unit L2 ball:
q Project into unit L1 ball:
o No closed-form solution
o Numerical solution

(Goodfellow 2016)
Dataset Augmentation
• Best way to regularize is to train with more data
q create fake data and add it to the training set.
q We can generate new (𝒙, 𝑦) pairs easily just by
transforming the 𝒙 inputs in our training set.
q particularly effective for object recognition
o translating the training images a few pixels in each
direction
o rotating the image or scaling
• Some inappropriate transformations:
q horizontal flips: ‘b’ and ‘d’,
q 180◦ rotations: ‘6’ and ‘9’,

(Goodfellow 2016)
Dataset Augmentation
Affine Elastic
Noise
Distortion Deformation

Horizontal Random
Hue Shift
flip Translation

(Goodfellow 2016)
Noise Robustness
• Noise with infinitesimal variance can be added:
q At the input
q At the hidden layers
q At the weights:

(Goodfellow 2016)
Injecting noise at the weights

• For 𝜂 small:
q 𝑦\3𝑾 = 𝑦\ 𝑾 + 𝝐 = 𝑦\ 𝑾 + 𝝐& 𝛻𝒘 𝑦(𝑾)
\
q 𝔼4 𝑦
\ ! = 𝔼 𝑦\ ! +
𝒙,7,𝝐𝑾 3𝑾 4 𝒙,7
! !
𝔼4 𝝐𝑾 𝜖 𝔼 4 𝒙,7 𝛻𝒘 𝑦
\ +0
q 𝔼4 𝒙,7,𝝐𝑾 𝑦\3!𝑾 = 𝔼4 𝒙,7 𝑦\ ! + 𝜂𝔼4 𝒙,7 𝛻9 𝑦\ !

q 𝐽`𝑾 = 𝐽 + 𝜂 𝔼4 𝒙,7 𝛻9 𝑦(𝒙)

\ !

q Equivalent to adding a regularization term

• Pushes the model into regions where the model is
relatively insensitive to small variations in the weights

(Goodfellow 2016)
Special case: linear
regression
• 𝐽;𝑾 = 𝐽 + 𝜂 𝔼5 𝒙,8 𝛻9 𝑦(𝒙)
? :

• 𝑦? = 𝒘3 𝒙 + 𝑏
: :
• 𝔼5 𝒙,8 𝛻𝒘 𝑦(𝑥)
? = 𝔼5 𝒙 𝒙
• which is not a function of parameters and therefore
does not contribute to the cost function w.r.t 𝒘:
q No regularization effect!

(Goodfellow 2016)
Injecting noise at the output
targets
• Most datasets have some amount of mistakes in the y labels.
• It can be harmful to maximize log 𝑝(𝑦 | 𝑥) when 𝑦 is a mistake.
• One way to prevent this is to explicitly model the noise on the labels.
q For example, we can assume that for some small constant 𝜖, the
training set label 𝑦 is correct with probability 1 − 𝜖,
q and otherwise any of the other possible labels might be correct.
• This assumption is easy to incorporate into the cost function analytically,
q rather than by explicitly drawing noise samples.
q For example, label smoothing regularizes a model based on a softmax
with 𝑘 output values
!
o by replacing the hard 0 by
"#$
o and 1 by 1 − 𝜖
• Label smoothing has the advantage of preventing the pursuit of hard
probabilities without discouraging correct classification.

(Goodfellow 2016)
Multi-Task Learning
Unsupervised
Task specific Learning context
parameters

Among the factors that

explain the variations
observed in the data
associated with the
different tasks, some are
shared across two or
more tasks.

Figure 7.2 (Goodfellow 2016)

Learning Curves
Early stopping: terminate while validation set
performance is better

Figure 7.3 (Goodfellow 2016)

Early stopping
• probably the most commonly used form of
regularization in deep learning.
q the number of training steps (or training time) is
just another hyperparameter.
• The cost is running the validation set evaluation
periodically during training
q Reduce the validation set
q Evaluate the validation loss less frequently
• Periodically save the trained model

(Goodfellow 2016)
Early stopping algorithm

(Goodfellow 2016)
Re-use the validation set

Less well-
behaved

(Goodfellow 2016)
Early stopping as a
regularizer
• 𝜖 (learning rate) and 𝜏 (number of training steps) limits the the volume of
parameters reachable from 𝜽𝟎 (initial parameters)
• Early stopping is equivalent to L2 regularization in the case of:
q a simple linear model
q with a quadratic error function
q and simple gradient descent

• 2𝐽 𝒘 = 𝐽 𝒘∗ + $ 𝒘 − 𝒘∗ ( 𝑯 𝒘 − 𝒘∗
'

q 𝛻𝒘 2𝐽 𝒘 = 𝑯 𝒘 − 𝒘∗
• 𝒘(𝝉) = 𝒘(𝝉#𝟏) − 𝜖𝛻𝒘 2𝐽 𝒘 𝝉#𝟏
=𝒘 𝝉#𝟏
− 𝜖 𝑯 𝒘(𝝉#𝟏) − 𝒘∗
q 𝒘(𝝉) = 𝑰 − 𝜖𝑯 𝒘 𝝉#𝟏
+ 𝜖𝑯𝒘∗
q 𝒘(𝝉) − 𝒘∗ = 𝑰 − 𝜖𝑯 𝒘 𝝉#𝟏
+ 𝜖𝑯 − 𝑰 𝒘∗
q 𝒘(𝝉) − 𝒘∗ = 𝑰 − 𝜖𝑯 (𝒘 𝝉#𝟏
− 𝐰∗)

(Goodfellow 2016)
The number of steps 𝜏 corresponds to some value of the weight
decay coefficient 𝛼
• 𝒘(𝝉) − 𝒘∗ = 𝑰 − 𝜖𝑯 (𝒘 𝝉&𝟏 − 𝐰∗)
q 𝑯 = 𝑸𝚲𝑸𝑻
q 𝒘(𝝉) − 𝒘∗ = 𝑸 𝑰 − 𝜖𝚲 𝑸𝑻 𝒘 𝝉&𝟏 − 𝐰∗
q 𝑸𝑻 𝒘 𝝉
− 𝒘∗ = 𝑰 − 𝜖𝚲 𝑸𝑻 𝒘 𝝉&𝟏
− 𝐰∗
o if 𝜖 small ⇒ 1 − 𝜖𝜆) < 1, every step brings closer to 𝒘∗
• Assume we start with 𝒘 𝟎 = 𝟎:
q 𝑸𝑻 𝒘 𝟏 = 𝑰 − 𝑰 − 𝜖𝚲 𝑸𝑻 𝒘∗
q 𝑸𝑻 𝒘 𝟐
= 𝑰 − 𝑰 − 𝜖𝚲 𝟐
𝑸 𝑻 𝒘∗
q 𝑸𝑻 𝒘 𝝉 = 𝑰 − 𝑰 − 𝜖𝚲 𝝉 𝑸 𝑻 𝒘∗
• L2 regularization:
q 𝑸𝑻 𝒘
; = 𝚲 + 𝛼𝑰 &𝟏 𝚲𝐐𝐓 𝐰 ∗
𝜆! 𝛼
=1−
q 𝑸𝑻 𝒘
; = 𝑰 − 𝚲 + 𝛼𝑰 &𝟏 𝛼 𝐐𝐓 𝐰 ∗ 𝜆! + 𝛼 𝜆! + 𝛼

• Compare:
&𝟏 𝛼 𝝉 1 − 1 − 𝜖𝜆. /
q 𝑰 − 𝚲 + 𝛼𝑰 and 𝑰 − 𝑰 − 𝜖𝚲
q 𝚲 + 𝛼𝑰 &𝟏 𝛼 = 𝑰 − 𝜖𝚲 𝝉

(Goodfellow 2016)
Early stopping advantage
0 / $
•
1" 20
= 1 − 𝜖𝜆. ⇒ 𝜏 log 1 − 𝜖𝜆. = log #
$2 "
$

• Assume log 1 + 𝑥 ≈ 𝑥 for small enough 𝑥

1"
q Assume ≪ 1 and 𝜖𝜆. ≪ 1
0
1" $
• −𝜏𝜖𝜆. ≈ − ⇒𝛼≈
0" /!
q the number of training iterations 𝜏 plays a role inversely proportional to
the L2 regularization parameter,
q and the inverse of 𝜏𝜖 plays the role of the weight decay coefficient.
• Early stopping advantage over weight decay:
q early stopping automatically determines the correct amount of
regularization
q while weight decay requires many training experiments with different
values of its hyperparameter.

(Goodfellow 2016)
Early Stopping and Weight
Decay

Figure 7.4
(Goodfellow 2016)
Parameter tying
• Formally, we have model 𝐴 with parameters 𝒘 ;
and model 𝐵 with parameters 𝒘(=)
• The two models map the input to two different, but
related outputs:
q 𝑦? ; = 𝑓(𝒘(;) , 𝒙)
= (=)
q 𝑦? = 𝑔(𝒘 , 𝒙)
q ∀𝑖, 𝒘(;) should be close to 𝒘(=)
• Regularization
(;) (=) (;) (=) :
Ω 𝒘 ,𝒘 = 𝒘 − 𝒘 :

(Goodfellow 2016)
Parameter sharing (e.g. CNN)
• Force sets of parameters to be equal.
• Advantage:
q only a subset of the parameters (the unique set)
needs to be stored in memory.
• Natural images have many statistical properties that are
invariant to translation.
q a photo of a cat remains a photo of a cat if it is
translated one pixel to the right
q Parameter sharing has allowed CNNs to dramatically
lower the number of unique model parameters

(Goodfellow 2016)
Sparse Representations
Sparse
parameters

Sparse
representations

(Goodfellow 2016)
Bagging
• Bagging (short for bootstrap aggregating) is a
technique for reducing generalization error by
combining several models
q train several different models separately
q the models vote on the output for test examples
• Bagging is an example of model averaging.
q The general term is Ensemble methods.
• The reason that model averaging works is that
different models will usually not make all the same
errors on the test set.

(Goodfellow 2016)
Bagging example
• Consider for example a set of 𝑘 regression models.
• Suppose that each model makes an error 𝜖. on each example, with the
errors drawn from a zero-mean multivariate normal distribution
'
q with variances 𝐸[𝜖. ] = 𝑣

q and covariances 𝐸 𝜖. 𝜖3 = 𝑐
• Then the error made by the average prediction of all the ensemble models
$
is ∑. 𝜖.
"
• The expected squared error of the ensemble predictor is:

q 𝑐 = 𝑣 ⇒ no gain, the expected error remains 𝑣

q 𝑐 = 0 ⇒ max gain, the expected error is 𝑣/𝑘

(Goodfellow 2016)
Ensemble methods vs. bagging
• Different ensemble methods construct the ensemble of models in
different ways.
• Bagging is a method that allows the same kind of model, training
algorithm and objective function to be reused several times
• Bagging involves constructing 𝑘 different datasets.
q Each dataset has the same number of examples as the original
dataset,
q but each dataset is constructed by sampling with replacement
from the original dataset.
o with high probability, each dataset is missing some of the
examples from the original dataset and also contains several
duplicate examples
o on average around 2/3 of the examples from the original
dataset are found in the resulting training set, if it has the
same size as the original

(Goodfellow 2016)
Bagging
the detector learns
that a loop on top of
the digit corresponds
to an 8.

the detector learns

that a loop on the bottom
of the digit corresponds to
an 8.
(Goodfellow 2016)
Why 2/3 ?
• N: number of items All permutations of 𝑘
• K: number of unique items among 𝑁 items

• A: number of drawn items All ways to distribute

𝑁! 𝐴 items among k
subsets such as no
𝑁−𝑘 ! 𝐴 subset is left empty
𝑃 𝑘 =
𝑁; 𝑘

All possible ways to

draw A items
𝐴
• is a Stirling number of the second kind
𝑘

(Goodfellow 2016)
Expected number of duplicates

• The indicator 𝑑2 corresponds to original item 𝑖,

taking the value of one if 𝑖 is present and zero if not
? ;
• 𝑃 𝑑2 = 0 = 1 − @
? ;
• 𝐸[𝑑2 ] = 1 − 1 − @
? ;
• 𝐸 ∑𝑑2 = ∑𝐸 𝑑2 = 𝑁𝐸 𝑑2 = 𝑁 1 − 1 −
@
? @
• 𝐴 =𝑁 ⇒𝐸 𝑘 =𝑁 1− 1− → 𝑁 1 − 𝑒 +?
@
• 𝐸 𝑘 ≈ 0.632 𝑁
(Goodfellow 2016)
More about bagging
• Neural networks reach a wide enough variety of
solution points that they can often benefit from
model averaging
• Model averaging is an extremely powerful and
reliable method for reducing generalization error.
q Its use is usually discouraged when
benchmarking algorithms for scientific papers
• Machine learning contests are usually won by
methods using model averaging over dozens of
models.

(Goodfellow 2016)
Dropout
• Dropout provides an inexpensive approximation to training
and evaluating a bagged ensemble of exponentially many
neural networks.
q removing non-output units from an underlying base
network
o by multiplying its output value by zero
• Each time we load an example into a minibatch, we
randomly sample a different binary mask to apply to all of
the input and hidden units in the network.
q The probability of sampling a mask value of one (causing
a unit to be included) is a hyperparameter fixed before
training begins.
o Typically, an input unit is included with probability 0.8
and a hidden unit is included with probability 0.5

(Goodfellow 2016)
Dropout
Figure 7.6

In networks with wider layers, the

probability of dropping all possible
paths from inputs to outputs
becomes smaller.
(Goodfellow 2016)
Dropout vs. bagging
• More formally, suppose that a mask vector 𝝁 specifies which units
to include, and 𝐽(𝜽, 𝝁) defines the cost of the model defined by
parameters 𝜽 and mask 𝝁.
q Then dropout training consists in minimizing 𝔼𝝁 𝐽 𝜽, 𝝁 .
q The expectation contains exponentially many terms (2/ )
• Dropout training is not quite the same as bagging training.
q In the case of bagging, the models are all independent.
q In the case of dropout, the models share parameters
q In bagging, each model is trained to convergence on its
respective training set
q In dropout, a tiny fraction of the possible sub-networks are each
trained for a single step
q In both, the training set encountered by each sub-network is a
subset of the original training set sampled with replacement

(Goodfellow 2016)
Computational graph of dropout

• The entries of 𝝁 are binary and are

sampled independently from each
other,
q And is not a function of the
current value of the model
parameters or the input example
(Goodfellow 2016)
Inference
• To make a prediction, a bagged ensemble must accumulate votes
from all of its members.
q We refer to this process as inference
0
• In bagging, the prediction of the ensemble is 1 ∑4230 𝑝 2
𝑦𝒙
• In dropout, the arithmetic mean is ∑𝝁 𝑝(𝝁)𝑝 𝑦 𝒙, 𝝁
• The geometric mean is
-.
𝑝Nensemble 𝑦 𝒙 = R 𝑝 𝑦 𝒙, 𝝁
𝝁

• To guarantee that the result is a probability distribution,

q we impose that none of the sub-models assigns probability 0 to
any event,
q and we renormalize the resulting distribution.

(Goodfellow 2016)
Weight scaling inference rule
• Evaluate with the trained model with all units,
q But with the weights going out of unit 𝑖 multiplied by the probability of including unit 𝑖 (e.g.
½)
q This corresponds to predict the geometric mean of the ensemble!
• Consider a softmax regression classifier with 𝑛 input variables represented by the vector 𝑣:
𝑃 𝑦 = 𝑦! 𝒗 = softmax 𝑾𝑻𝒗 + 𝒃 !
• To index into the family of submodels:
𝑃 𝑦 = 𝑦! 𝒗 = softmax 𝑾𝑻(𝒗 ⊙ 𝒅) + 𝒃 !
!"
𝑝Hensemble 𝑦 = 𝑦! 𝒙 = O softmax 𝑾𝑻(𝒗 ⊙ 𝒅) + 𝒃 !
𝒅∈ (,* "
1
𝑝Hensemble 𝑦 = 𝑦! 𝒙 ∝ exp + S 𝑾𝑻 𝒗 ⊙ 𝒅 + 𝒃
2 "
!
𝒅∈ (,*
1 +,* 𝑻 1 𝑻
= exp + 2 𝑾 𝒗 + 2+ 𝒃 = exp 𝑾 𝒗+𝒃
2 ! 2 !

(Goodfellow 2016)
Another perspective of
dropout
• (1) Droput is bagging with parameter sharing
• (2) Information erasing: Each hidden unit must be able to perform
well regardless of which other hidden units are in the model
q Dropout thus regularizes each hidden unit to be not merely a
good feature but a feature that is good in many contexts.
q For example, if the model learns a hidden unit ℎ2 that detects a
face by finding the nose,
q then dropping ℎ2 corresponds to erasing the information that
there is a nose in the image.
q The model must learn another ℎ2 ,
o either that redundantly encodes the presence of a nose,
o or that detects the face by another feature, such as the
mouth

(Goodfellow 2016)
Adversarial examples
• Search for an input 𝒙′ near a data point 𝒙 such that the
model output is very different at 𝒙′
• In many cases, 𝒙’ can be so similar to 𝒙 that a human
observer cannot tell the difference between the original
example and the adversarial example,
q but the network can make highly different predictions.
• Adversarial training
q training on adversarially perturbed examples from the
training set
• Adversarial examples are interesting in the context of
regularization
q because one can reduce the error rate on the original
i.i.d. test set via adversarial training

(Goodfellow 2016)
Adversarial Examples

Figure 7.8

Training on adversarial examples is mostly

intended to improve security, but can sometimes
provide generic regularization.

(Goodfellow 2016)
Aversarial training
• The value of a linear function can change very rapidly if it has
numerous inputs.
q If we change each input by 𝜖, then a linear function with weights
𝑤 can change by as much as 𝜖 𝒘 0 , which can be a very large
amount if 𝒘 is high-dimensional.
• Adversarial training discourages this highly sensitive locally linear
behavior by encouraging the network to be locally constant in the
neighborhood of the training data.
• This can be seen as a way of explicitly introducing a local
constancy prior into supervised neural nets.
q The classifier may then be trained to assign the same label to 𝒙
and 𝒙’.
q The assumption motivating this approach is that different
classes usually lie on disconnected manifolds, and a small
perturbation should not be able to jump from one class manifold
to another class manifold.

(Goodfellow 2016)
Conclusion
• This chapter has described most of the general
strategies used to regularize neural networks.
• Regularization is a central theme of machine
learning

Our next topic is: optimization

(Goodfellow 2016)

Lec8 Regularization
No ratings yet
Lec8 Regularization
41 pages
04. Chap 7-1 Regularization for Deep Learning-Keonwoo Noh
No ratings yet
04. Chap 7-1 Regularization for Deep Learning-Keonwoo Noh
41 pages
AI In 100 Images
No ratings yet
AI In 100 Images
104 pages
Overfitting Underfitting: UNIT 2: Optimization and Regularization in Neural Networks
No ratings yet
Overfitting Underfitting: UNIT 2: Optimization and Regularization in Neural Networks
18 pages
Unit - 4 REGULARIZATION FOR DEEP LEARNING
No ratings yet
Unit - 4 REGULARIZATION FOR DEEP LEARNING
56 pages
DL_Unit-3
No ratings yet
DL_Unit-3
56 pages
Overfitting vs Underfitting
No ratings yet
Overfitting vs Underfitting
16 pages
unit4
No ratings yet
unit4
93 pages
Deep Learning Basics Lecture 4 Regularization II
No ratings yet
Deep Learning Basics Lecture 4 Regularization II
27 pages
Deep Learning Basics Lecture 3 Regularization I
No ratings yet
Deep Learning Basics Lecture 3 Regularization I
32 pages
4th Unit DL Final Class Notes (1)
No ratings yet
4th Unit DL Final Class Notes (1)
68 pages
L10_regularization__slides(1)
No ratings yet
L10_regularization__slides(1)
45 pages
5 Regularization
No ratings yet
5 Regularization
79 pages
Nndl Notes
No ratings yet
Nndl Notes
73 pages
Lecture6 Regularization
No ratings yet
Lecture6 Regularization
56 pages
WEEK 10
No ratings yet
WEEK 10
69 pages
Deep Feedforward Networks and Regularization: Licheng Zhang
No ratings yet
Deep Feedforward Networks and Regularization: Licheng Zhang
56 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
03 Reg Slides
No ratings yet
03 Reg Slides
64 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
Regularization
No ratings yet
Regularization
46 pages
L09 - Regularisation
No ratings yet
L09 - Regularisation
79 pages
5-Introduction To regularization-03-Aug-2020Material - I - 03-Aug-2020 - Module3 - Regularization
No ratings yet
5-Introduction To regularization-03-Aug-2020Material - I - 03-Aug-2020 - Module3 - Regularization
10 pages
Unit 2.3
No ratings yet
Unit 2.3
43 pages
Lec 05 Regularization
No ratings yet
Lec 05 Regularization
77 pages
Lecture 05 - Regularization - 4p
No ratings yet
Lecture 05 - Regularization - 4p
21 pages
12-Regularization for Deep Learning-17!08!2024
No ratings yet
12-Regularization for Deep Learning-17!08!2024
51 pages
Lecture15 Regularization
No ratings yet
Lecture15 Regularization
47 pages
CM20315 09 Regularization
No ratings yet
CM20315 09 Regularization
44 pages
R-22 Open Electives
No ratings yet
R-22 Open Electives
117 pages
Unit4 DL Final
No ratings yet
Unit4 DL Final
30 pages
CAN AI REPLACE STOCK ANALYSTS
No ratings yet
CAN AI REPLACE STOCK ANALYSTS
56 pages
L11+ Regularization
No ratings yet
L11+ Regularization
24 pages
Regularization in Machine Learning
No ratings yet
Regularization in Machine Learning
17 pages
Regularization and Normalization
No ratings yet
Regularization and Normalization
29 pages
Introduction To Inverse Problems - Guillaume Bal PDF
No ratings yet
Introduction To Inverse Problems - Guillaume Bal PDF
205 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
DL_IT324a_3
No ratings yet
DL_IT324a_3
13 pages
2202.03599v3
No ratings yet
2202.03599v3
11 pages
8.TrainingNN-3
No ratings yet
8.TrainingNN-3
67 pages
UNIT-II Regularization in Deep Learning
No ratings yet
UNIT-II Regularization in Deep Learning
24 pages
L11+ Regularization
No ratings yet
L11+ Regularization
25 pages
DL Unit 4
No ratings yet
DL Unit 4
15 pages
Regularization (mathematics) - Wikipedia
No ratings yet
Regularization (mathematics) - Wikipedia
13 pages
UNIT IV NNHDL
No ratings yet
UNIT IV NNHDL
15 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
Unit -4-NNDL- Notes
No ratings yet
Unit -4-NNDL- Notes
14 pages
Regularization_(mathematics)
No ratings yet
Regularization_(mathematics)
11 pages
DL_M2_Regularization
No ratings yet
DL_M2_Regularization
12 pages
NN&DL Unit-IV Regularization for Deep Learning
No ratings yet
NN&DL Unit-IV Regularization for Deep Learning
16 pages
Deep Learning: Computer Science and Engineering
No ratings yet
Deep Learning: Computer Science and Engineering
18 pages
Module - 2 Ver 1.4
No ratings yet
Module - 2 Ver 1.4
35 pages
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
No ratings yet
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
20 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Bias Variance
No ratings yet
Bias Variance
3 pages
Aml CS 9 PRV
No ratings yet
Aml CS 9 PRV
47 pages
S10_DNN_Regularization_wip
No ratings yet
S10_DNN_Regularization_wip
11 pages
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
No ratings yet
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
100 pages
Regularization
No ratings yet
Regularization
14 pages
Regularization: Swetha V, Research Scholar
No ratings yet
Regularization: Swetha V, Research Scholar
32 pages
Support Vector Regression
100% (1)
Support Vector Regression
23 pages
Coursera Machine Learning Specialization
No ratings yet
Coursera Machine Learning Specialization
46 pages
Machine Learning: An Applied Econometric Approach: Sendhil Mullainathan and Jann Spiess
No ratings yet
Machine Learning: An Applied Econometric Approach: Sendhil Mullainathan and Jann Spiess
38 pages
DL Chpter 3
No ratings yet
DL Chpter 3
8 pages
Embeded Length Anchor Bolt
No ratings yet
Embeded Length Anchor Bolt
19 pages
Regularization in Deep Learning (1)
No ratings yet
Regularization in Deep Learning (1)
49 pages
Collaborative Filtering - Dotx
No ratings yet
Collaborative Filtering - Dotx
36 pages
Regularization
No ratings yet
Regularization
3 pages
Lecture 1.5-1.6
No ratings yet
Lecture 1.5-1.6
23 pages
Visualsentimentanalysis - Deeplearning - Applsci 12 01030 With Cover
No ratings yet
Visualsentimentanalysis - Deeplearning - Applsci 12 01030 With Cover
24 pages
On The Selection Stability of Stability Selection
No ratings yet
On The Selection Stability of Stability Selection
20 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
UNIT LV
No ratings yet
UNIT LV
8 pages
Unit 1
No ratings yet
Unit 1
66 pages
Explainable Machine Learning For Scientific Insights and Discoveries
No ratings yet
Explainable Machine Learning For Scientific Insights and Discoveries
29 pages
1.5 Regularization and Optimization
No ratings yet
1.5 Regularization and Optimization
17 pages
Journal of Banking and Finance: Philipp J. Kremer, Sangkyun Lee, Małgorzata Bogdan, Sandra Paterlini
No ratings yet
Journal of Banking and Finance: Philipp J. Kremer, Sangkyun Lee, Małgorzata Bogdan, Sandra Paterlini
15 pages
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
No ratings yet
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
23 pages
2019 - Structural Analysis of Attributes For Vehicle Re-Identification and Retrieval
No ratings yet
2019 - Structural Analysis of Attributes For Vehicle Re-Identification and Retrieval
12 pages
Accelerating Convolutional Neural Networks Via Activation Map Compression
No ratings yet
Accelerating Convolutional Neural Networks Via Activation Map Compression
12 pages
1 - Ill-Posed Problems in Geophysics
No ratings yet
1 - Ill-Posed Problems in Geophysics
19 pages
Department of Engineering: Computer Science &
No ratings yet
Department of Engineering: Computer Science &
21 pages
488 Solutions To The XOR Problem: Frans M. Virginia L. Stonick
No ratings yet
488 Solutions To The XOR Problem: Frans M. Virginia L. Stonick
7 pages
EE 769 2020.02.29 Mid Term Solution
No ratings yet
EE 769 2020.02.29 Mid Term Solution
6 pages
ML Syllabus
No ratings yet
ML Syllabus
5 pages
Deep Learning QP Ia ! Sep 2024
No ratings yet
Deep Learning QP Ia ! Sep 2024
2 pages
Deep Learning Question Bank(2024-25)
No ratings yet
Deep Learning Question Bank(2024-25)
2 pages
GCSE Maths Revision: Cheeky Revision Shortcuts
From Everand
GCSE Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (2)
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Exercises of Logarithms and Exponentials
From Everand
Exercises of Logarithms and Exponentials
Simone Malacrida
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

07_regularization

Uploaded by

07_regularization

Uploaded by

Regularization for Deep

• Developing more effective regularization strategies has

• Deep learning take:

• 𝜃 : all learnable parameters (weights and biases)

• 𝐰 ∗ projections against the eigen vectors of 𝑯 are

log 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 ∝ log 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 + log 𝑝𝑟𝑖𝑜𝑟

• We can fix 𝛼 but lose 𝑘

• The regularized training problem :𝐽 is equivalent to

q 𝐽`𝑾 = 𝐽 + 𝜂 𝔼4 𝒙,7 𝛻9 𝑦(𝒙)

q Equivalent to adding a regularization term

Among the factors that

Figure 7.2 (Goodfellow 2016)

Figure 7.3 (Goodfellow 2016)

• Assume log 1 + 𝑥 ≈ 𝑥 for small enough 𝑥

q 𝑐 = 𝑣 ⇒ no gain, the expected error remains 𝑣

the detector learns

• A: number of drawn items All ways to distribute

All possible ways to

• The indicator 𝑑2 corresponds to original item 𝑖,

In networks with wider layers, the

• The entries of 𝝁 are binary and are

• To guarantee that the result is a probability distribution,

Training on adversarial examples is mostly

Our next topic is: optimization

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.