0% found this document useful (0 votes)
2 views31 pages

Lecture 9

The document discusses various optimization techniques, focusing on gradient descent methods including batch, mini-batch, and stochastic gradient descent. It highlights the importance of learning rates, suggesting variable decay strategies and line search methods to improve convergence efficiency. Additionally, it covers the computational trade-offs of using exact versus inexact line search methods in optimization processes.

Uploaded by

sohindos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views31 pages

Lecture 9

The document discusses various optimization techniques, focusing on gradient descent methods including batch, mini-batch, and stochastic gradient descent. It highlights the importance of learning rates, suggesting variable decay strategies and line search methods to improve convergence efficiency. Additionally, it covers the computational trade-offs of using exact versus inexact line search methods in optimization processes.

Uploaded by

sohindos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Lecture 9

Math Foundations
Team
Introduction
Defination
Working rule
Example 1
Example 1
Example 2
Example 3
Motivation
Unconstrained Optimization

► We move in the direction of the negative gradient to decrease the


objective function.
► We move until we encounter a point at which the gradient is
zero.
Example
Optimization using gradient descent
Optimization using gradient descent
Example

Figure: left: with a learning rate of 0.01, local minimum is reached within a
couple of steps. right: When learning rate is reduced to 0.001, we need
relatively more steps to reach the local minimum
Example

Figure: left: with a learning rate of 0.01, minimum is reached. right: When
learning rate is reduced to 0.001, we need relatively more steps to reach the
minimum
Batch gradient descent
Mini-batch stochastic gradient
Stochastic Gradient Descent

► In the extreme case S can contain only one index chosen at


random, and the approach is then called as stochastic gradient
descent.
► The key idea in stochastic gradient descent is that the
gradient of the sample-specific objective function is an
excellent approximation of the true gradient.
► We can show that when the learning rate decreases at a suitable
rate and some mild assumptions can be made, stochastic gradient
descent almost surely converges to a local minimum.
Example

Figure: loss vs num updates (num epochs:5, dataset:MNIST, layers: lin-relu-


lin-relu-lin-relu, loss:crossEntropy, opt:Adam ) left: Batch Gradient Descent.
Entire data is used for every update (thus, 5 epochs
results in 5 updates). right: Stochastic Gradient Descent. Every update is
done based on single sample only. centre: Minibatch Gradient Descent. Every
update is done usnig a batch of 100 samples.
Example

Figure: left: Though the loss update is done for every sample in SGD, this
plot shows the loss averaged over 100 such updates. right: A summary of
measured accuracy for various methods
Learning rate Algorithm 1 : Decay

► How are we to decide the value of the learning rate?


► What happens if we choose a large value for the learning rate and
let it be constant? In this case, the algorithm might come close to
the optimal answer in the very first iteration but it will then
oscillate around the optimal point.
► What happens if we choose a small value for the learning rate and
let it be constant? In this case, it will take a very long time for
the algorithm to converge to the optimal point.
Learning rate Algorithm 1 : Decay

► Choose a variable learning-rate - large initially but decaying with


time.
► This will enable the algorithm to make large strides towards the
optimal point and then slowly converge.
► With a learning-rate dependent on time, the update step
becomes θt+1 = θt − α t ∇L.
Learning rate Algorithm 1 : Decay
Learning rate Algorithm 2 : Line search

► Line search uses the optimum step-size directly in order to


provide the best improvement.
► It is rarely used in vanilla gradient descent because of its
computational expense, but is helpful in some specialized
variations of gradient descent.
► Let L(θ) be the function being optimized, and let
d t = −∇L(θt ).
► The update step is θt+1 = θt + α t d t .
► In line search the learning rate α t is chosen at the t t h step so as
to minimize the value of the objective function at θt+1.
► Therefore the step-size α t is computed as
α t = minα L(θt + α d t ).
Line Search example
Line search Algorithms

► One question remains - how do we perform the optimization


minα L(θt + α d t ) ?
► An important property that we exploit of typical line-search
settings is that the objective function is a unimodal function of
α.
► This is especially true if we do not use the original objective
function but quadratic or convex approximations of it.
► The first step in optimization is to identify a range [0, αmax] in
which to perform the search for the optimum α.
Line search Algorithms

► We can sweep evaluate the objective function values at


geometrically increasing values of α.
► It is then possible to narrow the search interval by using binary-
search, golden-section search method, or the Armijo rule.
► The first two of these methods are exact methods and need for
the objective function to be unimodal in α, and the last of the
methods is an inexact method that does not rely on unimodality.
► The Armijo rule has broader applicability than either the binary
search or golden-section search methods. It will be part of
Assignment 2.
Line search Algorithms -Binary search
Line search Algorithms-Golden-section search

► Initialize the search interval to [a, b] = [0, αmax].


► we use the fact that for any mid-samples m1, m2 in the region [a,
b] where a < m1 < m2 < b, at least one of the intervals [a, m1]
or [m2, b] can be dropped. Sometimes we can go so far as to drop
[a, m2] and [m1, b].
► When α = a yields the minimum for the objective function, i.e
H(α), we can drop the interval (m1, b].
► Similarly when α = b yields the minimum for H(α) we can drop
the interval [a, m2). When α = m1 is the value at which the
minimum is achieved we can drop (m2, b].
► When α = m2 is the value at which the minimum is achieved we
can drop [a, m1).
Line search Algorithms-Golden-section search

► The new bounds on the search interval [a, b] are reset based on
the exclusions mentioned in the previous slide.
► At the end of the process we are left with an interval
containing 0 or 1 evaluated point.
► If we have an interval containing no evaluated point, we select a
random point α = p in the reset interval [a, b], and then another
point q in the larger of the intervals [a, p] and [p, b].
► On the other hand if we are left with an interval [a, b]
containing a single evaluated point α = p, then we select α =
q in the larger of the intervals [a, p] and [p, b].
► This yields another four points on which to continue the
golden-section search. We continue until we achieve the
desired accuracy.
When do we use line search?

► The line-search method can be shown to converge to a local


optimum, but it is computationally expensive. For this reason, it is
rarely used in vanilla gradient descent.
► Some methods like Newton’s method, however, require exact line
search.
► Fast inexact methods like Armijo’s rule are used in vanilla
gradient descent.
► One advantage of using exact line search is that fewer steps are
needed to achieve convergence to a local optimum. This might
more than compensate for the computational expense of
individual steps.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy