0% found this document useful (0 votes)
16 views32 pages

DL 3unit Last Topic Meta Algoritham

deep learning

Uploaded by

uma maheshwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views32 pages

DL 3unit Last Topic Meta Algoritham

deep learning

Uploaded by

uma maheshwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Search

Write

Sign up
Sign in

Deep Learning Book: Chapter 8


— Optimization For Training
Deep Models Part II

Aman Dalmia
·

Follow
Published in
Inveterate Learner

20 min read

Aug 13, 2018

60
1
This is going to be a series of blog posts on the Deep Learning
book where we are attempting to provide a summary of each
chapter highlighting the concepts that we found to be the most
important so that other people can use it as a starting point for
reading the chapters, while adding further explanations on few
areas that we found difficult to grasp. Please refer this for
more clarity on notation.

T his is a continuation of our previous post on the first three

sections of Chapter 8. The main concepts that we discussed


were How learning differs from pure
optimization, Challenges in Neural Network
Optimization and the Basic algorithms used in optimization.
The remaining sections of the chapter deal with:

 Parameter Initialization Strategies


 Algorithms with Adaptive Learning Rates
 Approximate Second-Order Methods
 Optimization Strategies and Meta-Algorithms

1. Parameter Initialization Strategies

Training algorithms for deep learning models are iterative in


nature and require the specification of an initial point. This is
extremely crucial as it often decides whether or not the
algorithm converges and if it does, then does the algorithm
converge to a point with high cost or low cost.
Source: https://towardsdatascience.com/random-
initialization-for-neural-networks-a-thing-of-the-past-
bfcdd806bf9e

We have limited understanding of neural network optimization


but the one property that we know with complete certainty is
that the initialization should break symmetry. This means that
if two hidden units are connected to the same input units, then
these should have different initialization or else the gradient
would update both the units in the same way and we don’t
learn anything new by using an additional unit. The idea of
having each unit learn something different motivates random
initialization of weights which is also computationally cheaper.

Biases are often chosen heuristically (zero mostly) and only the
weights are randomly initialized, almost always from a
Gaussian or uniform distribution. The scale of the distribution
is of utmost concern. Large weights might have better
symmetry-breaking effect but might lead to chaos (extreme
sensitivity to small perturbations in the input) and exploding
values during forward & back propagation. As an example of
how large weights might lead to chaos, consider that there’s a
slight noise adding ϵ to the input. Now, we if did just a simple
linear transformation like W * x, the ϵ noise would add a factor
of W * ϵ to the output. In case the weights are high, this ends
up making a significant contribution to the output. SGD and
its variants tend to halt in areas near the initial values, thereby
expressing a prior that the path to the final parameters from
the initial values is discoverable by steepest descent
algorithms. A more mathematical explanation for the symmetry
breaking can be found in the Appendix.

Various suggestions have been made for appropriate


initialization of the parameters. The most commonly used ones
include sampling the weights of each fully-connected layer
having m inputs and n outputs uniformly from the following
distributions:

 U(-1 / √m, 1 / √m)


 U(- √6 / (m+n), √6 / (m+n))

U(a, b) represents the uniform distribution where the


probability of each value between a and b, a and b
inclusive, is 1/(b-a). The probability of every other value
is 0.

These initializations have already been incorporated into the


most commonly used Deep Learning frameworks nowadays so
that you can just specify which initializer to use and the
framework takes care of sampling appropriately. For
e.g. Keras, which is a very famous deep learning framework,
has a module called initializers, where the second distribution
(among the 2 mentioned above) is implemented
as glorot_uniform .

One drawback of using 1 / √m as the standard deviation is that


the weights end up being small when a layer has too many
input/output units. Motivated by the idea to have the total
amount of input to each unit independent of the number of
input units m, Sparse initialization sets each unit to have
exactly k non-zero weights. However, it takes a long time for
GD to correct incorrect large values and hence, this
initialization might cause problems.

If the weights are too small, the range of activations across the
mini-batch will shrink as the activations propagate forward
through the network.By repeatedly identifying the first layer
with unacceptably small activations and increasing its weights,
it is possible to eventually obtain a network with reasonable
initial activations throughout.

The biases are relatively easier to choose. Setting the biases to


zero is compatible with most weight initialization schemes
except for a few cases for e.g. when used for an output unit, to
prevent saturation at initialization or when using unit as a gate
for making a decision. Refer to the chapter for details.

2. Algorithms with Adaptive Learning Rates

 AdaGrad: As mentioned in Part I , it is important to


incrementally decrease the learning rate for faster
convergence. Instead of manually reducing the learning rate
after each (or several) epochs, a better approach is to adapt
the learning rate as the training progresses. This can be
done by scaling the learning rates of each model
parameter individually inversely proportional to the square
root of the sum of historical squared values of the gradient.
In the parameter update equation below, r is initialized with
0 and the multiplication in the update step happens
element-wise as mentioned. Since the gradient value would
be different for each parameter, the learning rate is scaled
differently for each parameter too. Thus, those parameters
having a large gradient have a large decrease in the
learning rate as the learning rate might be too high leading
to oscillations or it might be approaching the minima but
having large learning rate might cause it to jump over the
minima as explained in the figure below, because of which
the learning rate should be decreased for better
convergence, while those with small gradients have a small
decrease in the learning rate as they might have already
approached their respective minima and should not be
pushed away from that. Even if they have not, reducing the
learning rate too much would reduce the gradient even
further leading to slower learning.

AdaGrad parameter update equation.


This figure illustrates the need to reduce the learning
rate if gradient is large in case of a single parameter. 1)
One step of gradient descent representing a large
gradient value. 2) Result of reducing the learning rate —
moves towards the minima 3) Scenario if the learning
rate was not reduced — it would have jumped over the
minima.

However, accumulation of squared gradients from the very


beginning can lead to excessive and premature decrease in the
learning rate. Consider that we had a model with only 2
parameters (for simplicity) and both the initial gradients are
1000. After some iterations, the gradient of one of the
Figure explaining the problem with AdaGrad.
Accumulated gradients can cause the learning rate to be
reduced far too much in the later stages leading to
slower learning.

parameters has reduced to 100 but that of the other parameter


is still around 750. However, because of the accumulation at
each update, the accumulated gradient would still have almost
the same value. For e.g. let the accumulated gradients at each
step for the Parameter 1 be 1000 + 900 + 700 + 400 + 100 = 3100,

1/3100=0.0003 and that for Parameter 2 be: 1000 + 900 + 850 + 800 +

750 = 4300, 1/4300 = 0.0002. This would lead to a similar decrease


in the learning rates for both the parameters, even though the
parameter having the lower gradient might have its learning
rate reduced too much leading to slower learning.

 RMSProp: RMSProp addresses the problem caused by


accumulated gradients in AdaGrad. It modifies the gradient
accumulation step to an exponentially weighted moving
average in order to discard history from the extreme past.
The RMSProp update is given by:

ρ is the weighing used for exponential averaging. As


more updates are made, the contribution of past
gradient values are reduced since ρ < 1 and ρ > ρ² >ρ³ …

This allows the algorithm to converge rapidly after finding a


convex bowl, as if it were an instance of AdaGrad initialized
within that bowl. Let me explain why this is so. Consider the
figure below. The region represented by 1 indicates usual
RMSProp parameter updates as given by the update equation,
which is nothing but exponentially averaged AdaGrad updates.
Once the optimization process lands on A, it essentially lands at
the top of a convex bowl. At this point, intuitively, all the
updates before A can be seen to be forgotten due to the
exponential averaging and it can be seen as if (exponentially
averaged) AdaGrad updates start from point A onwards.
Intuition behind RMSProp. 1) Usual parameter updates
2) Once it reaches the convex bowl, exponentially
weighted averaging would cause the effect of earlier
gradients to reduce and to simplify, we can assume their
contribution to be zero. This can be seen as if AdaGrad
had been used with the training initiated inside the
convex bowl

 Adam: Adapted from “adaptive moments”, it focuses on


combining RMSProp and Momentum. Firstly, it views
Momentum as an estimate of the first-order moment and
RMSProp as that of the second moment. The weight update
for Adam is given by:
Secondly, since s and r are initialized as zeros, the authors
observed a bias during the initial steps of training thereby
adding a correction term for both the moments to account for
their initialization near the origin. As an example of what the
effect of this bias correction is, we’ll look at the values
of s and r for a single parameter (in which case everything is
now represented as a scalar). Let’s first understand what
would happen if there was no bias correction. Since s (notice
that this is not in bold as we are looking at the value for a
single parameter and the s here is a scalar) is initialized as
zero, after the first iteration, the value of s would be (1 — ρ1)
* g and that of r would be (1 — ρ2) * g². The preferred
values for ρ1 and ρ2 are 0.9 and 0.99 respectively. Thus, the
initial values of s and r are pretty small and this gets
compounded as the training progress. However, if we now use
bias correction, after the first iteration, the value of s is
just g and that of r is just g². This gets rid of the bias that
occurs in the initial phase of training. A major advantage of
Adam is that it’s fairly robust to the choice of these
hyperparameters, i.e. ρ1 and ρ2.

The figure below shows the comparison between the various


optimization methods discussed above. It can be clearly seen
that algorithms with adaptive learning rates provide faster
convergence:

NAG here refers to Nesterov Accelerated Gradient which


is the same as Nesterov Momentum.
Source: http://ruder.io/optimizing-gradient-descent/inde
x.html#adam

3. Approximate Second-Order Methods

The optimization algorithms that we’ve looked at till now


involved computing only the first derivative. But there are
many methods which involve higher order derivatives as well.
The main problem with these algorithms are that they are not
practically feasible in their vanilla form and so, certain
methods are used to approximate the values of the derivatives.
We explain three such methods, all of which use empirical risk
as the objective function:

 Newton’s Method: This is the most common higher-order


derivative method used. It makes use of the curvature of the
loss function via its second-order derivative to arrive at the
optimal point. Using the second-order Taylor Series
expansion to approximate J(θ) around a point θo and
ignoring derivatives of order greater than 2 (this has
already been discussed in previous chapters), we get:

We know that we get a critical point for any function f(x) by


solving for f'(x) = 0. We get the following critical point of the
above equation (refer to the Appendix for proof):

For quadratic surfaces (i.e. where cost function is quadratic),


this directly gives the optimal result in one step whereas
gradient descent would still need to iterate. However, for
surfaces that are not quadratic, as long as the Hessian remains
positive definite, we can obtain the optimal point through a 2-
step iterative process — 1) Get the inverse of the Hessian and
2) update the parameters.

Saddle points are problematic for Newton’s method. If all the


eigenvalues are not positive, Newton’s method might cause the
updates to move in the wrong direction. A way to avoid this is
to add regularization:

However, if there is a strong negative curvature i.e. the


eigenvalues are largely negative, α needs to be sufficiently
high to offset the negative eigenvalues in which case the
Hessian becomes dominated by the diagonal matrix. This leads
to an update which becomes the standard gradient divided
by α:

Another problem restricting the use of Newton’s method is the


computational cost. It takes O(k³) time to calculate the inverse
of the Hessian where k is the number of parameters. It’s not
uncommon for Deep Neural Networks to have about a million
parameters and since the parameters are updated every
iteration, this inverse needs to be calculated at every iteration,
which is not computationally feasible.

 Conjugate Gradients: One weakness of the method of


steepest descent (i.e. GD) is that line searches happen along
the direction of the gradient. Suppose the previous search
direction is d(t-1). Once the search terminates (which it
does when the gradient along the current gradient direction
vanishes) at the minimum, the next search direction, d(t) is
given by the gradient at that point, which is orthogonal to
d(t-1) (because if it’s not orthogonal, it’ll have some
component along d(t-1) which cannot be true as at the
minimum, the gradient along d(t-1) has vanished).
Upon getting the minimum along the current search
direction, the minimum along the previous search direction
is not preserved, undoing, in a sense, the progress made in
previous search direction.
In the method of conjugate gradients, we seek a search
direction that is conjugate to the previous line search direction:

Now, the previous search direction contributes towards


finding the next search direction.

with d(t) and d(t-1) being conjugates if d(t)' H d(t-1) = 0. βt


decides how much of d(t-1) is added back to the current search
direction. There are two popular choices for βt — Fletcher-
Reeves and Polak-Ribière. These discussions assumed the cost
function to be quadratic where the conjugate directions ensure
that the gradient along the previous direction does not
increase in magnitude. To extend the concept to work for
training neural networks, there is one additional change. Since
it’s no longer quadratic, there’s no guarantee anymore than the
conjugate direction would preserve the minimum in the
previous search directions. Thus, the algorithm includes
occasional resets where the method of conjugate gradients is
restarted with line search along the unaltered gradient.

 BFGS: This algorithm tries to bring the advantages of


Newton’s method without the additional computational
burden by approximating the inverse of H by M(t), which is
iteratively refined using low-rank updates. Finally, line
search is conducted along the direction M(t)g(t). However,
BFGS requires storing the matrix M(t) which takes O(n²)
memory making it infeasible. An approach called Limited
Memory BFGS (L-BFGS) has been proposed to tackle this
infeasibility by computing the matrix M(t) using the same
method as BFGS but assuming that M(t−1) is the identity
matrix.

4. Optimization Strategies and Meta-Algorithms

 Batch Normalization: Batch normalization (BN) is one of


the most exciting innovations in Deep learning that has
significantly stabilized the learning process and allowed
faster convergence rates. The intuition behind batch
normalization is as follows: Most of the Deep Learning
networks are compositions of many layers (or functions) and
the gradient with respect to one layer is taken considering
the other layers to be constant. However, in practise all the
layers are updated simultaneously and this can lead to
unexpected results. For example, let y* = x W¹ W² … W¹⁰.
Here, y* is a linear function of x but not a linear function of
the weights. Suppose the gradient is given by g and we now
intend to reduce y* by 0.1. Using first-order Taylor Series
approximation, taking a step of ϵg would reduce y* by ϵg’
g. Thus, ϵ should be 0.1/(g’ g) just using the first-order
information. However, higher order effects also creep in as
the updated y* is given by:

An example of a second-order term would be ϵ² g1 g2 ∏ wi.


∏ wi can be negligibly small or exponentially high depending
on whether the individual weights are less than or greater than
1. Since the updates to one layer is so strongly dependent on
the other layers, choosing an appropriate learning rate is
tough. Batch normalization takes care of this problem by using
an efficient reparameterization of almost any deep network.
Given a matrix of activations, H, the normalization is given
by: H’ = (H-μ) / σ, where the subtraction and division
is broadcasted.

𝛿 is added to ensure that σ is not equal to 0.

Going back to the earlier example of y*, let the activations of


layer l be given by h(l-1). Then h(l-1) = x W1 W2 … W (l-1).
Now, if x is drawn from a unit Gaussian, then h(l-1) also comes
from a Gaussian, however, not of zero mean and unit variance,
as it is a linear transformation of x. BN makes it zero mean and
unit variance. Therefore, y* = Wl h(l-1) and thus, the learning
now becomes much simpler as the parameters at the lower
layers mostly do not have any effect. This simplicity was
definitely achieved by rendering the lower layers useless.
However, in a realistic deep network with non-linearities, the
lower layers remain useful. Finally, the complete
reparameterization of BN is given by replacing H with γH’ +
β. This is done to retain its expressive power and the fact that
the mean is solely determined by XW. Also, among the choice
of normalizating X or XW + B, the authors recommend the
latter, specifically XW, since B becomes redundant because
of β. Practically, this means that when we are using the Batch
Normalization layer, the biases should be turned off. In a deep
learning framework like Keras, this can be done by setting the
parameter use_bias=False in the Convolutional layer.
 Coordinate Descent: Generally, a single weight update is
made by taking the gradient with respect to every
parameter. However, in cases where some of the
parameters might be independent (discussed below) of the
remaining, it might be more efficient to take the gradient
with respect to those independent sets of parameters
separately for making updates. Let me clarify that with an
example. Suppose we have the following cost function:

This cost function describes the learning problem called sparse


coding. Here, H refers to the sparse representation
of X and W is the set of weights used to linearly decode H to
retrieve X. An explanation of why this cost function enforces
the learning of a sparse representation of X follows. The first
term of the cost function penalizes values far from 0 (positive
or negative because of the modulus, |H|, operator. This
enforces most of the values to be 0, thereby sparse. The second
term is pretty self-explanatory in that it compensates the
difference between X and H being linearly transformed by W,
thereby enforcing them to take the same value. In this
way, H is now learned as a sparse “representation” of X. The
cost function generally consists of additionally
a regularization term like weight decay, which has been
avoided for simplicity. Here, we can divide the entire list of
parameters into two sets, W and H. Minimizing the cost
function with respect to any of these sets of parameters is a
convex problem. Coordinate Descent (CD) refers to
minimizing the cost function with respect to only 1 parameter
at a time. It has been shown that repeatedly cycling through all
the parameters, we are guaranteed to arrive at a local minima.
If instead of 1 parameter, we take a set of parameters as we
did before with W and H, it is called block coordinate
descent (the interested reader should explore Alternating
Minimization). CD makes sense if either the parameters are
clearly separable into independent groups or if optimizing with
respect to certain set of parameters is more efficient than with
respect to others.

The points A, B, C and D indicates the locations in the


parameter space where coordinate descent landed after
each gradient step.
Source: https://www.researchgate.net/figure/Coordinate-
Descent-CD-CD-algorithm-searches-along-one-
coordinate-direction-in-each_fig2_262805949

Coordinate descent may fail terribly when one variable


influences the optimal value of another variable.

 Polyak Averaging: Polyak averaging consists of averaging


several points in the parameter space that the optimization
algorithm traverses through. So, if the algorithm encounters
the points θ(1), θ(2), … during optimization, the output of
Polyak averaging is:
The figure below explains the intuition behind Polyak
averaging:

The optimization algorithm might oscillate back and


forth across a valley without ever reaching the minima.
However, the average of those points should be closer
to the bottom of the valley.
Most optimization problems in deep learning are non-convex
where the path taken by the optimization algorithm is quite
complicated and it might happen that a point visited in the
distant past might be quite far from the current point in the
parameter space. Thus, including such a point in the distant
past might not be useful, which is why an exponentially
decaying running average is taken. This scheme where the
recent iterates are weighted more than the past ones is
called Polyak-Ruppert Averaging:

 Supervised Pre-training: Sometimes it’s hard to directly


train to solve for a specific task. Instead it might be better
to train for solving a simpler task and use that as an
initialization point for training to solve the more challenging
task. As an intuition for why this seems logical, consider
that you didn’t have any background in integration and were
asked to learn how to compute the following integral:
If you’re anyone close to a normal person, your first reaction
would

However, wouldn’t it be better if you were asked to understand


the more basic integrations first:
I hope you understand what I meant with this example —
learning a simpler task would put you in a better position to
understand the more complex task. This particular strategy of
training to solve a simpler task before facing the herculean one
is called pretraining. A particular type of pretraining,
called greedy supervised pretraining, firstly breaks a given
supervised learning problem into simpler supervised learning
ones and solving for the optimal version of each component in
isolation. To build on the above intuition, the hypothesis as to
why this works is that it gives better guidance to the
intermediate layers of the network and helps in both,
generalization and optimization. More often that not, the
greedy pretraining is followed by a fine-tuning stage where all
the parts are jointly optimized to search for the optimal
solution to the full problem. As an example, the figure below
shows how each hidden layer is trained one at a time, where
the input to the hidden layer being learned is the output of the
previously trained hidden layer.
Greedy supervised pretraining (a) The first hidden layer
is being trained only using the original inputs and
outputs. (b) For training the second hidden layer, the
hidden-output connection from the first hidden layer is
removed and the output of the first hidden layer is used
as the input.

Also, FitNets shows an alternative way to guide the training


process. Deep networks are hard to train mainly because as
deeper the model gets, more non-linearities are introduced.
The authors propose the use of a shallower and
wider teacher network that is trained first. Then, a second
network which is thinner and deeper, called
the student network is trained to predict not only the final
outputs but also the intermediate layers of the teacher
network. For those who might not be clear with what deep,
shallow, wide and thin might mean, refer the following
diagram:

Explanation of the terms “shallow”, “deep”, “thin” and


“wide” in the context of neural networks.

The idea is that predicting the intermediate layers of the


teacher network provides some hints as to how the layers of
the student network should be used and aids the optimization
procedure. It was shown that without the hints to the hidden
layers, the students networks performs poorly in both the
training and test data.

 Designing Models to Aid Optimization: Most of the work


in deep learning has been towards making the models easier
to optimize rather than designing a more powerful
optimization algorithm. This is evident from the fact
that Stochastic Gradient Descent, which is primarily used
for training deep models today, has been in use since the
1960s. Many of the current design choices lend towards
using linear transformations between layers and using
activation functions like ReLU [max(0, x)] which are linear
for the most part and enjoy large gradients as compared to
sigmoidal units which saturate easily. Also, linear functions
increase in a particular direction. Thus, if there’s an error,
there’s a clear direction towards which the output should
move to minimize the error.

Source: https://arxiv.org/pdf/1512.03385.pdf

Residual connections reduce the length of the shortest path


from the output to the lower layers, thereby allowing a larger
gradient to flow through and hence, tackling the vanishing
gradient problem. Similarly, GoogleNet attached multiple
copies of the output to intermediate layers so that a larger
gradient flows to those layers. Once training is complete, the
auxiliary heads are removed.
Inc
eption architecture with the outputs at the intermediate
layers being marked.

 Continuation Methods and Curriculum


Learning: Based on the explanation given in the book (and
also given that it’s fairly short), I’d recommend the
interested reader to refer the book directly. The reason
being that I realized that the content there was exactly to
the point with no further scope of summarizing and hence,
found it redundant to explain it again here. To give a very
brief idea, Continuation Methods are a family of
strategies where instead of a single cost function, a series of
cost functions are optimized to aid optimization by making
sure that the algorithm spends most of the time in well-
behaved regions in the parameter space. The series of cost
functions are designed such that the cost functions become
progressively harder to solve, i.e. the first cost function is
easiest to solve and the final cost function (which is actually
the cost function we wanted to minimize) is the toughest,
and the solution to each easier cost function serves as a
good initialization point for the next, harder-to-optimize cost
function. Curriculum Learning can be interpreted as a
type of continuation method where the learning process is
planned such that the simpler concepts are learned first and
then the harder ones which depend on those simpler
concepts. This concept had been shown to accelerate
progress in animal training. One simple way to use it in the
context of machine learning is to initially show the
algorithm more of the easier images and less of the harder
images. As the learning progresses, the proportion of easier
images is reduced and that of the harder images is
increased so that the model can ultimately learn the harder
task.

This concludes the 2nd part of our summary for Chapter 8. This
has been one of the most math intensive chapters that I have
ever read and if you’ve made it this far, I have two things for
ya: Thank You for taking the time out and Congratulations, as
hopefully you have a much better clarity on what goes under
the hood. Since Medium doesn’t support math symbols yet, a
few of the mathematical notations were not proper and if
you’re looking for a version with better notation, feel free to
have a look over our repository here which contains Jupyter
Notebooks entailing the same content but with the symbols
converted to LaTex. Our next post will be about Chapter 11:
Practical Methodology which focuses upon practical tips for
making your Deep Learning model work. To stay updated with
our posts, follow our publication, Inveterate Learner. Finally,
I thank my co-author, Ameya Godbole, for thoroughly
reviewing the post and suggesting many important changes. A
special thanks to Raghav Somani for providing an in-depth
review from a theoretical perspective that played an important
role in the final shaping of this post.

Patience is a virtue and I’m learning


patience. It’s a tough lesson.

- Elon Musk
If there’s anything that you might want to share
with me or give any sort of feedback on my
writing/thoughts, I would love to hear it from you.
Feel free to connect with me
on LinkedIn, Facebook or follow me on Github.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy