0% found this document useful (0 votes)

5 views23 pages

DL Unit 5 Notes 2

Deep learning unit 5 notes Btech r18

Uploaded by

harshithareddy372

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views23 pages

DL Unit 5 Notes 2

Deep learning unit 5 notes Btech r18

Uploaded by

harshithareddy372

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate

veterate Learner | Med…

Open in app Sign up Sign In

Deep Learning Book: Chapter 8 —

Optimization For Training Deep Models Part II
Aman Dalmia · Follow
Published in Inveterate Learner
20 min read · Aug 13, 2018

Listen Share

This is going to be a series of blog posts on the Deep Learning book where we are
attempting to provide a summary of each chapter highlighting the concepts that we found
to be the most important so that other people can use it as a starting point for reading the

https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 1/31
5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate Learner | Med…

chapters, while adding further explanations on few areas that we found difficult to grasp.
Please refer this for more clarity on notation.

T his is a continuation of our previous post on the first three sections of Chapter
8. The main concepts that we discussed were How learning differs from pure
optimization, Challenges in Neural Network Optimization and the Basic algorithms
used in optimization. The remaining sections of the chapter deal with:

Parameter Initialization Strategies

Algorithms with Adaptive Learning Rates

Approximate Second-Order Methods

Optimization Strategies and Meta-Algorithms

1. Parameter Initialization Strategies

Training algorithms for deep learning models are iterative in nature and require the
specification of an initial point. This is extremely crucial as it often decides whether
or not the algorithm converges and if it does, then does the algorithm converge to a
point with high cost or low cost.

https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 2/31
5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate Learner | Med…

Source: https://towardsdatascience.com/random-initialization-for-neural-networks-a-thing-of-the-past-
bfcdd806bf9e

We have limited understanding of neural network optimization but the one

property that we know with complete certainty is that the initialization should
break symmetry. This means that if two hidden units are connected to the same
input units, then these should have different initialization or else the gradient would
update both the units in the same way and we don’t learn anything new by using an
additional unit. The idea of having each unit learn something different motivates
random initialization of weights which is also computationally cheaper.

Biases are often chosen heuristically (zero mostly) and only the weights are
randomly initialized, almost always from a Gaussian or uniform distribution. The
scale of the distribution is of utmost concern. Large weights might have better
symmetry-breaking effect but might lead to chaos (extreme sensitivity to small
perturbations in the input) and exploding values during forward & back
propagation. As an example of how large weights might lead to chaos, consider that
there’s a slight noise adding ϵ to the input. Now, we if did just a simple linear
transformation like W * x, the ϵ noise would add a factor of W * ϵ to the output. In
case the weights are high, this ends up making a significant contribution to the
output. SGD and its variants tend to halt in areas near the initial values, thereby
expressing a prior that the path to the final parameters from the initial values is
discoverable by steepest descent algorithms. A more mathematical explanation for
the symmetry breaking can be found in the Appendix.

Various suggestions have been made for appropriate initialization of the

parameters. The most commonly used ones include sampling the weights of each
fully-connected layer having m inputs and n outputs uniformly from the following
distributions:

U(-1 / √m, 1 / √m)

U(- √6 / (m+n), √6 / (m+n))

https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 3/31
5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate Learner | Med…

U(a, b) represents the uniform distribution where the probability of each value between a and b, a and b
inclusive, is 1/(b-a). The probability of every other value is 0.

These initializations have already been incorporated into the most commonly used
Deep Learning frameworks nowadays so that you can just specify which initializer
to use and the framework takes care of sampling appropriately. For e.g. Keras, which
is a very famous deep learning framework, has a module called initializers, where
the second distribution (among the 2 mentioned above) is implemented as
glorot_uniform .

One drawback of using 1 / √m as the standard deviation is that the weights end up
being small when a layer has too many input/output units. Motivated by the idea to
have the total amount of input to each unit independent of the number of input
units m, Sparse initialization sets each unit to have exactly k non-zero weights.
However, it takes a long time for GD to correct incorrect large values and hence, this
initialization might cause problems.

If the weights are too small, the range of activations across the mini-batch will
shrink as the activations propagate forward through the network.By repeatedly
identifying the ﬁrst layer with unacceptably small activations and increasing its
weights, it is possible to eventually obtain a network with reasonable initial
activations throughout.

The biases are relatively easier to choose. Setting the biases to zero is compatible
with most weight initialization schemes except for a few cases for e.g. when used
for an output unit, to prevent saturation at initialization or when using unit as a gate
for making a decision. Refer to the chapter for details.

2. Algorithms with Adaptive Learning Rates

https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 4/31
5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate Learner | Med…

AdaGrad: As mentioned in Part I , it is important to incrementally decrease the

learning rate for faster convergence. Instead of manually reducing the learning
rate after each (or several) epochs, a better approach is to adapt the learning
rate as the training progresses. This can be done by scaling the learning rates of
each model parameter individually inversely proportional to the square root of the
sum of historical squared values of the gradient. In the parameter update
equation below, r is initialized with 0 and the multiplication in the update step
happens element-wise as mentioned. Since the gradient value would be
different for each parameter, the learning rate is scaled differently for each
parameter too. Thus, those parameters having a large gradient have a large
decrease in the learning rate as the learning rate might be too high leading to
oscillations or it might be approaching the minima but having large learning
rate might cause it to jump over the minima as explained in the figure below,
because of which the learning rate should be decreased for better convergence,
while those with small gradients have a small decrease in the learning rate as
they might have already approached their respective minima and should not be
pushed away from that. Even if they have not, reducing the learning rate too
much would reduce the gradient even further leading to slower learning.

AdaGrad parameter update equation.

https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 5/31
5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate Learner | Med…

This figure illustrates the need to reduce the learning rate if gradient is large in case of a single parameter. 1)
One step of gradient descent representing a large gradient value. 2) Result of reducing the learning rate —
moves towards the minima 3) Scenario if the learning rate was not reduced — it would have jumped over the
minima.

However, accumulation of squared gradients from the very beginning can lead to
excessive and premature decrease in the learning rate. Consider that we had a
model with only 2 parameters (for simplicity) and both the initial gradients are
1000. After some iterations, the gradient of one of the

https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 6/31
5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate Learner | Med…

Figure explaining the problem with AdaGrad. Accumulated gradients can cause the learning rate to be
reduced far too much in the later stages leading to slower learning.

parameters has reduced to 100 but that of the other parameter is still around 750.
However, because of the accumulation at each update, the accumulated gradient
would still have almost the same value. For e.g. let the accumulated gradients at
each step for the Parameter 1 be 1000 + 900 + 700 + 400 + 100 = 3100, 1/3100=0.0003

and that for Parameter 2 be: 1000 + 900 + 850 + 800 + 750 = 4300, 1/4300 = 0.0002 .

This would lead to a similar decrease in the learning rates for both the parameters,
even though the parameter having the lower gradient might have its learning rate
reduced too much leading to slower learning.

RMSProp: RMSProp addresses the problem caused by accumulated gradients in

AdaGrad. It modifies the gradient accumulation step to an exponentially
weighted moving average in order to discard history from the extreme past. The
RMSProp update is given by:

https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 7/31
5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate Learner | Med…

ρ is the weighing used for exponential averaging. As more updates are made, the contribution of past
gradient values are reduced since ρ < 1 and ρ > ρ² >ρ³ …

This allows the algorithm to converge rapidly after finding a convex bowl, as if it
were an instance of AdaGrad initialized within that bowl. Let me explain why this is
so. Consider the figure below. The region represented by 1 indicates usual RMSProp
parameter updates as given by the update equation, which is nothing but
exponentially averaged AdaGrad updates. Once the optimization process lands on A,
it essentially lands at the top of a convex bowl. At this point, intuitively, all the
updates before A can be seen to be forgotten due to the exponential averaging and it
can be seen as if (exponentially averaged) AdaGrad updates start from point A
onwards.

Intuition behind RMSProp. 1) Usual parameter updates 2) Once it reaches the convex bowl, exponentially
weighted averaging would cause the effect of earlier gradients to reduce and to simplify, we can assume
their contribution to be zero. This can be seen as if AdaGrad had been used with the training initiated inside
the convex bowl

Adam: Adapted from “adaptive moments”, it focuses on combining RMSProp and

Momentum. Firstly, it views Momentum as an estimate of the first-order
https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 8/31
5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate Learner | Med…

moment and RMSProp as that of the second moment. The weight update for
Adam is given by:

Secondly, since s and r are initialized as zeros, the authors observed a bias during
the initial steps of training thereby adding a correction term for both the moments
to account for their initialization near the origin. As an example of what the effect
of this bias correction is, we’ll look at the values of s and r for a single parameter (in
which case everything is now represented as a scalar). Let’s first understand what
would happen if there was no bias correction. Since s (notice that this is not in bold
as we are looking at the value for a single parameter and the s here is a scalar) is
initialized as zero, after the first iteration, the value of s would be (1 — ρ1) * g and
that of r would be (1 — ρ2) * g². The preferred values for ρ1 and ρ2 are 0.9 and 0.99
respectively. Thus, the initial values of s and r are pretty small and this gets
compounded as the training progress. However, if we now use bias correction, after
the first iteration, the value of s is just g and that of r is just g². This gets rid of the
bias that occurs in the initial phase of training. A major advantage of Adam is that
it’s fairly robust to the choice of these hyperparameters, i.e. ρ1 and ρ2.

The figure below shows the comparison between the various optimization methods
discussed above. It can be clearly seen that algorithms with adaptive learning rates
provide faster convergence:

https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 9/31
5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate Learner | Med…

NAG here refers to Nesterov Accelerated Gradient which is the same as Nesterov Momentum. Source:
http://ruder.io/optimizing-gradient-descent/index.html#adam

3. Approximate Second-Order Methods

The optimization algorithms that we’ve looked at till now involved computing only
the first derivative. But there are many methods which involve higher order
derivatives as well. The main problem with these algorithms are that they are not
practically feasible in their vanilla form and so, certain methods are used to
approximate the values of the derivatives. We explain three such methods, all of
which use empirical risk as the objective function:

Newton’s Method: This is the most common higher-order derivative method

used. It makes use of the curvature of the loss function via its second-order
derivative to arrive at the optimal point. Using the second-order Taylor Series
expansion to approximate J(θ) around a point θo and ignoring derivatives of
order greater than 2 (this has already been discussed in previous chapters), we
get:

https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 10/31
5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate Learner | Med…

We know that we get a critical point for any function f(x) by solving for f'(x) = 0 .

We get the following critical point of the above equation (refer to the Appendix for
proof):

For quadratic surfaces (i.e. where cost function is quadratic), this directly gives the
optimal result in one step whereas gradient descent would still need to iterate.
However, for surfaces that are not quadratic, as long as the Hessian remains positive
definite, we can obtain the optimal point through a 2-step iterative process — 1) Get
the inverse of the Hessian and 2) update the parameters.

Saddle points are problematic for Newton’s method. If all the eigenvalues are not
positive, Newton’s method might cause the updates to move in the wrong direction.
A way to avoid this is to add regularization:

However, if there is a strong negative curvature i.e. the eigenvalues are largely
negative, α needs to be sufficiently high to offset the negative eigenvalues in which
case the Hessian becomes dominated by the diagonal matrix. This leads to an
update which becomes the standard gradient divided by α:

https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 11/31
5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate Learner | Med…

Another problem restricting the use of Newton’s method is the computational cost.
It takes O(k³) time to calculate the inverse of the Hessian where k is the number of
parameters. It’s not uncommon for Deep Neural Networks to have about a million
parameters and since the parameters are updated every iteration, this inverse needs
to be calculated at every iteration, which is not computationally feasible.

Conjugate Gradients: One weakness of the method of steepest descent (i.e. GD)
is that line searches happen along the direction of the gradient. Suppose the
previous search direction is d(t-1). Once the search terminates (which it does
when the gradient along the current gradient direction vanishes) at the
minimum, the next search direction, d(t) is given by the gradient at that point,
which is orthogonal to d(t-1) (because if it’s not orthogonal, it’ll have some
component along d(t-1) which cannot be true as at the minimum, the gradient
along d(t-1) has vanished).
Upon getting the minimum along the current search direction, the minimum
along the previous search direction is not preserved, undoing, in a sense, the
progress made in previous search direction.

https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 12/31
5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate Learner | Med…

In the method of conjugate gradients, we seek a search direction that is conjugate to

the previous line search direction:

Now, the previous search direction contributes towards finding the next search direction.

with d(t) and d(t-1) being conjugates if d(t)' H d(t-1) = 0 . βt decides how much of
d(t-1) is added back to the current search direction. There are two popular choices
for βt — Fletcher-Reeves and Polak-Ribière. These discussions assumed the cost
function to be quadratic where the conjugate directions ensure that the gradient
along the previous direction does not increase in magnitude. To extend the concept
to work for training neural networks, there is one additional change. Since it’s no
longer quadratic, there’s no guarantee anymore than the conjugate direction would
preserve the minimum in the previous search directions. Thus, the algorithm
includes occasional resets where the method of conjugate gradients is restarted
with line search along the unaltered gradient.

BFGS: This algorithm tries to bring the advantages of Newton’s method without
the additional computational burden by approximating the inverse of H by M(t),
which is iteratively refined using low-rank updates. Finally, line search is
conducted along the direction M(t)g(t). However, BFGS requires storing the
matrix M(t) which takes O(n²) memory making it infeasible. An approach called
Limited Memory BFGS (L-BFGS) has been proposed to tackle this infeasibility by
computing the matrix M(t) using the same method as BFGS but assuming that
M(t−1) is the identity matrix.

4. Optimization Strategies and Meta-Algorithms

Batch Normalization: Batch normalization (BN) is one of the most exciting
innovations in Deep learning that has significantly stabilized the learning
process and allowed faster convergence rates. The intuition behind batch
normalization is as follows: Most of the Deep Learning networks are
compositions of many layers (or functions) and the gradient with respect to one
layer is taken considering the other layers to be constant. However, in practise

https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 13/31
5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate Learner | Med…

all the layers are updated simultaneously and this can lead to unexpected
results. For example, let y* = x W¹ W² … W¹⁰. Here, y* is a linear function of x
but not a linear function of the weights. Suppose the gradient is given by g and
we now intend to reduce y* by 0.1. Using first-order Taylor Series
approximation, taking a step of ϵg would reduce y* by ϵg’ g. Thus, ϵ should be
0.1/(g’ g) just using the first-order information. However, higher order effects
also creep in as the updated y* is given by:

An example of a second-order term would be ϵ² g1 g2 ∏ wi. ∏ wi can be negligibly

small or exponentially high depending on whether the individual weights are less
than or greater than 1. Since the updates to one layer is so strongly dependent on
the other layers, choosing an appropriate learning rate is tough. Batch
normalization takes care of this problem by using an efficient reparameterization of
almost any deep network. Given a matrix of activations, H, the normalization is
given by: H’ = (H-μ) / σ, where the subtraction and division is broadcasted.

𝛿 is added to ensure that σ is not equal to 0.

Going back to the earlier example of y*, let the activations of layer l be given by h(l-
1). Then h(l-1) = x W1 W2 … W (l-1). Now, if x is drawn from a unit Gaussian, then
h(l-1) also comes from a Gaussian, however, not of zero mean and unit variance, as
it is a linear transformation of x. BN makes it zero mean and unit variance.
Therefore, y* = Wl h(l-1) and thus, the learning now becomes much simpler as the
parameters at the lower layers mostly do not have any effect. This simplicity was
definitely achieved by rendering the lower layers useless. However, in a realistic
deep network with non-linearities, the lower layers remain useful. Finally, the

https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 14/31
5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate Learner | Med…

complete reparameterization of BN is given by replacing H with γH’ + β. This is

done to retain its expressive power and the fact that the mean is solely determined
by XW. Also, among the choice of normalizating X or XW + B, the authors
recommend the latter, specifically XW, since B becomes redundant because of β.
Practically, this means that when we are using the Batch Normalization layer, the
biases should be turned off. In a deep learning framework like Keras, this can be
done by setting the parameter use_bias=False in the Convolutional layer.

Coordinate Descent: Generally, a single weight update is made by taking the

gradient with respect to every parameter. However, in cases where some of the
parameters might be independent (discussed below) of the remaining, it might
be more efficient to take the gradient with respect to those independent sets of
parameters separately for making updates. Let me clarify that with an example.
Suppose we have the following cost function:

This cost function describes the learning problem called sparse coding. Here, H
refers to the sparse representation of X and W is the set of weights used to linearly
decode H to retrieve X. An explanation of why this cost function enforces the
learning of a sparse representation of X follows. The first term of the cost function
penalizes values far from 0 (positive or negative because of the modulus, |H|,
operator. This enforces most of the values to be 0, thereby sparse. The second term
is pretty self-explanatory in that it compensates the difference between X and H
being linearly transformed by W, thereby enforcing them to take the same value. In
this way, H is now learned as a sparse “representation” of X. The cost function
generally consists of additionally a regularization term like weight decay, which has
been avoided for simplicity. Here, we can divide the entire list of parameters into
two sets, W and H. Minimizing the cost function with respect to any of these sets of
parameters is a convex problem. Coordinate Descent (CD) refers to minimizing the
cost function with respect to only 1 parameter at a time. It has been shown that
repeatedly cycling through all the parameters, we are guaranteed to arrive at a local
minima. If instead of 1 parameter, we take a set of parameters as we did before with
W and H, it is called block coordinate descent (the interested reader should explore
https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 15/31
5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate Learner | Med…

Alternating Minimization). CD makes sense if either the parameters are clearly

separable into independent groups or if optimizing with respect to certain set of
parameters is more efficient than with respect to others.

The points A, B, C and D indicates the locations in the parameter space where coordinate descent landed
after each gradient step. Source: https://www.researchgate.net/figure/Coordinate-Descent-CD-CD-
algorithm-searches-along-one-coordinate-direction-in-each_fig2_262805949

Coordinate descent may fail terribly when one variable influences the optimal value
of another variable.

Polyak Averaging: Polyak averaging consists of averaging several points in the

parameter space that the optimization algorithm traverses through. So, if the
algorithm encounters the points θ(1), θ(2), … during optimization, the output of
Polyak averaging is:

The figure below explains the intuition behind Polyak averaging:

https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 16/31
5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate Learner | Med…

The optimization algorithm might oscillate back and forth across a valley without ever reaching the minima.
However, the average of those points should be closer to the bottom of the valley.

Most optimization problems in deep learning are non-convex where the path taken
by the optimization algorithm is quite complicated and it might happen that a point
visited in the distant past might be quite far from the current point in the parameter
space. Thus, including such a point in the distant past might not be useful, which is
why an exponentially decaying running average is taken. This scheme where the
recent iterates are weighted more than the past ones is called Polyak-Ruppert
Averaging:

Supervised Pre-training: Sometimes it’s hard to directly train to solve for a

specific task. Instead it might be better to train for solving a simpler task and
use that as an initialization point for training to solve the more challenging task.
As an intuition for why this seems logical, consider that you didn’t have any
https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 17/31
5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate Learner | Med…

background in integration and were asked to learn how to compute the

following integral:

If you’re anyone close to a normal person, your first reaction would be:

https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 18/31
5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate Learner | Med…

Source: https://imgflip.com/i/gdnbg

However, wouldn’t it be better if you were asked to understand the more basic
integrations first:

I hope you understand what I meant with this example — learning a simpler task
would put you in a better position to understand the more complex task. This
particular strategy of training to solve a simpler task before facing the herculean
one is called pretraining. A particular type of pretraining, called greedy supervised
pretraining, firstly breaks a given supervised learning problem into simpler
supervised learning ones and solving for the optimal version of each component in
isolation. To build on the above intuition, the hypothesis as to why this works is that
it gives better guidance to the intermediate layers of the network and helps in both,
generalization and optimization. More often that not, the greedy pretraining is
followed by a fine-tuning stage where all the parts are jointly optimized to search
for the optimal solution to the full problem. As an example, the figure below shows
how each hidden layer is trained one at a time, where the input to the hidden layer
being learned is the output of the previously trained hidden layer.

https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 19/31
5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate Learner | Med…

Greedy supervised pretraining (a) The first hidden layer is being trained only using the original inputs and
outputs. (b) For training the second hidden layer, the hidden-output connection from the first hidden layer is
removed and the output of the first hidden layer is used as the input.

Also, FitNets shows an alternative way to guide the training process. Deep networks
are hard to train mainly because as deeper the model gets, more non-linearities are
introduced. The authors propose the use of a shallower and wider teacher network
that is trained first. Then, a second network which is thinner and deeper, called the
student network is trained to predict not only the final outputs but also the
intermediate layers of the teacher network. For those who might not be clear with
what deep, shallow, wide and thin might mean, refer the following diagram:

https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 20/31
5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate Learner | Med…

Explanation of the terms “shallow”, “deep”, “thin” and “wide” in the context of neural networks.

The idea is that predicting the intermediate layers of the teacher network provides
some hints as to how the layers of the student network should be used and aids the
optimization procedure. It was shown that without the hints to the hidden layers,
the students networks performs poorly in both the training and test data.

Designing Models to Aid Optimization: Most of the work in deep learning has
been towards making the models easier to optimize rather than designing a
more powerful optimization algorithm. This is evident from the fact that
Stochastic Gradient Descent, which is primarily used for training deep models
today, has been in use since the 1960s. Many of the current design choices lend
towards using linear transformations between layers and using activation
functions like ReLU [max(0, x)] which are linear for the most part and enjoy
large gradients as compared to sigmoidal units which saturate easily. Also,
linear functions increase in a particular direction. Thus, if there’s an error,
there’s a clear direction towards which the output should move to minimize the
error.

https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 21/31
5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate Learner | Med…

Source: https://arxiv.org/pdf/1512.03385.pdf

Residual connections reduce the length of the shortest path from the output to the
lower layers, thereby allowing a larger gradient to flow through and hence, tackling
the vanishing gradient problem. Similarly, GoogleNet attached multiple copies of the
output to intermediate layers so that a larger gradient flows to those layers. Once
training is complete, the auxiliary heads are removed.

Inception architecture with the outputs at the intermediate layers being marked.

Continuation Methods and Curriculum Learning: Based on the explanation

given in the book (and also given that it’s fairly short), I’d recommend the
interested reader to refer the book directly. The reason being that I realized that
the content there was exactly to the point with no further scope of summarizing
and hence, found it redundant to explain it again here. To give a very brief idea,
Continuation Methods are a family of strategies where instead of a single cost
function, a series of cost functions are optimized to aid optimization by making
sure that the algorithm spends most of the time in well-behaved regions in the
parameter space. The series of cost functions are designed such that the cost
functions become progressively harder to solve, i.e. the first cost function is
easiest to solve and the final cost function (which is actually the cost function
we wanted to minimize) is the toughest, and the solution to each easier cost
https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 22/31
5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate Learner | Med…

function serves as a good initialization point for the next, harder-to-optimize

cost function. Curriculum Learning can be interpreted as a type of continuation
method where the learning process is planned such that the simpler concepts
are learned first and then the harder ones which depend on those simpler
concepts. This concept had been shown to accelerate progress in animal
training. One simple way to use it in the context of machine learning is to
initially show the algorithm more of the easier images and less of the harder
images. As the learning progresses, the proportion of easier images is reduced
and that of the harder images is increased so that the model can ultimately learn
the harder task.

This concludes the 2nd part of our summary for Chapter 8. This has been one of the
most math intensive chapters that I have ever read and if you’ve made it this far, I
have two things for ya: Thank You for taking the time out and Congratulations, as
hopefully you have a much better clarity on what goes under the hood. Since
Medium doesn’t support math symbols yet, a few of the mathematical notations
were not proper and if you’re looking for a version with better notation, feel free to
have a look over our repository here which contains Jupyter Notebooks entailing the
same content but with the symbols converted to LaTex. Our next post will be about
Chapter 11: Practical Methodology which focuses upon practical tips for making your
Deep Learning model work. To stay updated with our posts, follow our publication,
Inveterate Learner. Finally, I thank my co-author, Ameya Godbole, for thoroughly
reviewing the post and suggesting many important changes. A special thanks to
Raghav Somani for providing an in-depth review from a theoretical perspective that
played an important role in the final shaping of this post.

Patience is a virtue and I’m learning patience. It’s a

tough lesson.

- Elon Musk

https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135 23/31

Supply Chain Management For Dummies
57% (7)
Supply Chain Management For Dummies
4 pages
UNIT3
No ratings yet
UNIT3
17 pages
DL 3unit Last Topic Meta Algoritham
No ratings yet
DL 3unit Last Topic Meta Algoritham
32 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
3 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
DL 4
No ratings yet
DL 4
15 pages
Fixing Neural Network Course 2 1659759284
No ratings yet
Fixing Neural Network Course 2 1659759284
30 pages
5th Unit DL Final Class Notes
No ratings yet
5th Unit DL Final Class Notes
77 pages
Initializing Neural Networks - Deeplearning - Ai
No ratings yet
Initializing Neural Networks - Deeplearning - Ai
15 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Unit 5
No ratings yet
Unit 5
36 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
ITNN Week3
No ratings yet
ITNN Week3
21 pages
Lecture 8.4
No ratings yet
Lecture 8.4
13 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
Intro DL 04
No ratings yet
Intro DL 04
35 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
26 pages
Unit 3
No ratings yet
Unit 3
110 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
Module 3-DL
No ratings yet
Module 3-DL
12 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Optimization For Deep Learning Theory and Algorithms
No ratings yet
Optimization For Deep Learning Theory and Algorithms
60 pages
Artificial Neural NetworkIV
No ratings yet
Artificial Neural NetworkIV
6 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
DL 12
No ratings yet
DL 12
55 pages
Introduction To Deep Learning - Deep Feed Forward Network
No ratings yet
Introduction To Deep Learning - Deep Feed Forward Network
24 pages
Introduction To Deep Learning AI 2025
No ratings yet
Introduction To Deep Learning AI 2025
78 pages
Deep Neural Network
No ratings yet
Deep Neural Network
60 pages
General Observation
No ratings yet
General Observation
93 pages
A Imprimer 4
No ratings yet
A Imprimer 4
4 pages
6 - Tips For Training Deep Neural Networks
No ratings yet
6 - Tips For Training Deep Neural Networks
59 pages
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
No ratings yet
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
39 pages
Deep Learning
No ratings yet
Deep Learning
299 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
Bio Optimization of Deep Learning Network Architectures 22fguqp5
No ratings yet
Bio Optimization of Deep Learning Network Architectures 22fguqp5
11 pages
Lect 12 - Deep Feed Forward NN - Review
No ratings yet
Lect 12 - Deep Feed Forward NN - Review
93 pages
DL CS 7 M4 Live Class Flow
No ratings yet
DL CS 7 M4 Live Class Flow
37 pages
Survey of FNN
No ratings yet
Survey of FNN
25 pages
2 Deep Neural Network - 241120 - 095158
No ratings yet
2 Deep Neural Network - 241120 - 095158
47 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Cours 4
No ratings yet
Cours 4
30 pages
DGM Mid Sem
No ratings yet
DGM Mid Sem
39 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
Training Neural Netwok: Data Set
No ratings yet
Training Neural Netwok: Data Set
35 pages
DL Unit 4&5
No ratings yet
DL Unit 4&5
27 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
ML Lec 10 Neural Networks
No ratings yet
ML Lec 10 Neural Networks
87 pages
Deep Learning Hand Book 2024
No ratings yet
Deep Learning Hand Book 2024
185 pages
Deep Learning
100% (2)
Deep Learning
49 pages
2.game AI 1
No ratings yet
2.game AI 1
268 pages
Deep Learning Tutorial: Reference: Hung-Yi Lee
100% (1)
Deep Learning Tutorial: Reference: Hung-Yi Lee
179 pages
Initialization
No ratings yet
Initialization
16 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
Deep Learning UNIT-II Part1
No ratings yet
Deep Learning UNIT-II Part1
48 pages
Module3 Notes
No ratings yet
Module3 Notes
18 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Mastering Classification Algorithms for Machine Learning: Learn how to apply Classification algorithms for effective Machine Learning solutions (English Edition)
From Everand
Mastering Classification Algorithms for Machine Learning: Learn how to apply Classification algorithms for effective Machine Learning solutions (English Edition)
PARTHA MAJUMDAR
No ratings yet
PRO Argument Facebook Fake News Dissemination
No ratings yet
PRO Argument Facebook Fake News Dissemination
2 pages
Internship - Report NETWORKING PDF
No ratings yet
Internship - Report NETWORKING PDF
24 pages
Roach 1
No ratings yet
Roach 1
2 pages
Continuously Reinforced Concrete Pavement
100% (1)
Continuously Reinforced Concrete Pavement
2 pages
HTML CSS Bootstrap
No ratings yet
HTML CSS Bootstrap
80 pages
Comparison of Different DEM Generation Methods Based On Open Source Datasets
No ratings yet
Comparison of Different DEM Generation Methods Based On Open Source Datasets
23 pages
Acampora B.SM Racer Design and - Jul .1995.MT
No ratings yet
Acampora B.SM Racer Design and - Jul .1995.MT
12 pages
W90.3ELH - Winch Assembly
No ratings yet
W90.3ELH - Winch Assembly
6 pages
Understanding The Basics of Essbase Data and Cubes Operations - Jane Story
No ratings yet
Understanding The Basics of Essbase Data and Cubes Operations - Jane Story
30 pages
Api Tools Presentation
No ratings yet
Api Tools Presentation
18 pages
Ivosights
No ratings yet
Ivosights
49 pages
HTML File Paths
No ratings yet
HTML File Paths
7 pages
Urgent Reminder Notice - Registration For TCS NQT (Phase-3) Recruitment Drive For 2024 Graduating Batch
No ratings yet
Urgent Reminder Notice - Registration For TCS NQT (Phase-3) Recruitment Drive For 2024 Graduating Batch
3 pages
OceanStor Dorado 6.1.x HyperReplication Feature Guide For Block
No ratings yet
OceanStor Dorado 6.1.x HyperReplication Feature Guide For Block
286 pages
Account Statement 1 Sep 2024 To 21 Mar 2025
No ratings yet
Account Statement 1 Sep 2024 To 21 Mar 2025
8 pages
PyQt Tutorial
No ratings yet
PyQt Tutorial
11 pages
DBMS
No ratings yet
DBMS
19 pages
RT 900 User Guide
No ratings yet
RT 900 User Guide
83 pages
63Y Set-Up EN XX
No ratings yet
63Y Set-Up EN XX
12 pages
General Notes: Bridge Site Location Plan
No ratings yet
General Notes: Bridge Site Location Plan
1 page
Manual
No ratings yet
Manual
64 pages
NI Serial Hardware Specifications PDF
No ratings yet
NI Serial Hardware Specifications PDF
62 pages
Algeria (DZA) : Administrative Boundary Common Operational Database (COD-AB)
No ratings yet
Algeria (DZA) : Administrative Boundary Common Operational Database (COD-AB)
3 pages
Catalogo
No ratings yet
Catalogo
3 pages
Challenges and Opportunities of Artificial Intelligence
No ratings yet
Challenges and Opportunities of Artificial Intelligence
9 pages
Trellix Insights: Key Benefits
No ratings yet
Trellix Insights: Key Benefits
8 pages
R Art 42999-10
No ratings yet
R Art 42999-10
5 pages
PyTorch Geometric Temporal Spatiotemporal Signal Processing
No ratings yet
PyTorch Geometric Temporal Spatiotemporal Signal Processing
10 pages
Legal Education Board
No ratings yet
Legal Education Board
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DL Unit 5 Notes 2

Uploaded by

DL Unit 5 Notes 2

Uploaded by

5/8/23, 3:08 PM Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II | by Aman Dalmia | Inveterate

veterate Learner | Med…

Open in app Sign up Sign In

Deep Learning Book: Chapter 8 —

Parameter Initialization Strategies

Algorithms with Adaptive Learning Rates

Approximate Second-Order Methods

Optimization Strategies and Meta-Algorithms

1. Parameter Initialization Strategies

We have limited understanding of neural network optimization but the one

Various suggestions have been made for appropriate initialization of the

U(-1 / √m, 1 / √m)

U(- √6 / (m+n), √6 / (m+n))

2. Algorithms with Adaptive Learning Rates

AdaGrad: As mentioned in Part I , it is important to incrementally decrease the

AdaGrad parameter update equation.

RMSProp: RMSProp addresses the problem caused by accumulated gradients in

Adam: Adapted from “adaptive moments”, it focuses on combining RMSProp and

3. Approximate Second-Order Methods

Newton’s Method: This is the most common higher-order derivative method

In the method of conjugate gradients, we seek a search direction that is conjugate to

4. Optimization Strategies and Meta-Algorithms

An example of a second-order term would be ϵ² g1 g2 ∏ wi. ∏ wi can be negligibly

𝛿 is added to ensure that σ is not equal to 0.

complete reparameterization of BN is given by replacing H with γH’ + β. This is

Coordinate Descent: Generally, a single weight update is made by taking the

Alternating Minimization). CD makes sense if either the parameters are clearly

Polyak Averaging: Polyak averaging consists of averaging several points in the

The figure below explains the intuition behind Polyak averaging:

Supervised Pre-training: Sometimes it’s hard to directly train to solve for a

background in integration and were asked to learn how to compute the

Continuation Methods and Curriculum Learning: Based on the explanation

function serves as a good initialization point for the next, harder-to-optimize

Patience is a virtue and I’m learning patience. It’s a

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.