0% found this document useful (0 votes)

2 views31 pages

Lecture 9

The document discusses various optimization techniques, focusing on gradient descent methods including batch, mini-batch, and stochastic gradient descent. It highlights the importance of learning rates, suggesting variable decay strategies and line search methods to improve convergence efficiency. Additionally, it covers the computational trade-offs of using exact versus inexact line search methods in optimization processes.

Uploaded by

sohindos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views31 pages

Lecture 9

Uploaded by

sohindos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Lecture 9

Math Foundations
Team
Introduction
Defination
Working rule
Example 1
Example 1
Example 2
Example 3
Motivation
Unconstrained Optimization

► We move in the direction of the negative gradient to decrease the

objective function.
► We move until we encounter a point at which the gradient is
zero.
Example
Optimization using gradient descent
Optimization using gradient descent
Example

Figure: left: with a learning rate of 0.01, local minimum is reached within a
couple of steps. right: When learning rate is reduced to 0.001, we need
relatively more steps to reach the local minimum
Example

Figure: left: with a learning rate of 0.01, minimum is reached. right: When
learning rate is reduced to 0.001, we need relatively more steps to reach the
minimum
Batch gradient descent
Mini-batch stochastic gradient
Stochastic Gradient Descent

► In the extreme case S can contain only one index chosen at

random, and the approach is then called as stochastic gradient
descent.
► The key idea in stochastic gradient descent is that the
gradient of the sample-specific objective function is an
excellent approximation of the true gradient.
► We can show that when the learning rate decreases at a suitable
rate and some mild assumptions can be made, stochastic gradient
descent almost surely converges to a local minimum.
Example

Figure: loss vs num updates (num epochs:5, dataset:MNIST, layers: lin-relu-

lin-relu-lin-relu, loss:crossEntropy, opt:Adam ) left: Batch Gradient Descent.
Entire data is used for every update (thus, 5 epochs
results in 5 updates). right: Stochastic Gradient Descent. Every update is
done based on single sample only. centre: Minibatch Gradient Descent. Every
update is done usnig a batch of 100 samples.
Example

Figure: left: Though the loss update is done for every sample in SGD, this
plot shows the loss averaged over 100 such updates. right: A summary of
measured accuracy for various methods
Learning rate Algorithm 1 : Decay

► How are we to decide the value of the learning rate?

► What happens if we choose a large value for the learning rate and
let it be constant? In this case, the algorithm might come close to
the optimal answer in the very first iteration but it will then
oscillate around the optimal point.
► What happens if we choose a small value for the learning rate and
let it be constant? In this case, it will take a very long time for
the algorithm to converge to the optimal point.
Learning rate Algorithm 1 : Decay

► Choose a variable learning-rate - large initially but decaying with

time.
► This will enable the algorithm to make large strides towards the
optimal point and then slowly converge.
► With a learning-rate dependent on time, the update step
becomes θt+1 = θt − α t ∇L.
Learning rate Algorithm 1 : Decay
Learning rate Algorithm 2 : Line search

► Line search uses the optimum step-size directly in order to

provide the best improvement.
► It is rarely used in vanilla gradient descent because of its
computational expense, but is helpful in some specialized
variations of gradient descent.
► Let L(θ) be the function being optimized, and let
d t = −∇L(θt ).
► The update step is θt+1 = θt + α t d t .
► In line search the learning rate α t is chosen at the t t h step so as
to minimize the value of the objective function at θt+1.
► Therefore the step-size α t is computed as
α t = minα L(θt + α d t ).
Line Search example
Line search Algorithms

► One question remains - how do we perform the optimization

minα L(θt + α d t ) ?
► An important property that we exploit of typical line-search
settings is that the objective function is a unimodal function of
α.
► This is especially true if we do not use the original objective
function but quadratic or convex approximations of it.
► The first step in optimization is to identify a range [0, αmax] in
which to perform the search for the optimum α.
Line search Algorithms

► We can sweep evaluate the objective function values at

geometrically increasing values of α.
► It is then possible to narrow the search interval by using binary-
search, golden-section search method, or the Armijo rule.
► The first two of these methods are exact methods and need for
the objective function to be unimodal in α, and the last of the
methods is an inexact method that does not rely on unimodality.
► The Armijo rule has broader applicability than either the binary
search or golden-section search methods. It will be part of
Assignment 2.
Line search Algorithms -Binary search
Line search Algorithms-Golden-section search

► Initialize the search interval to [a, b] = [0, αmax].

► we use the fact that for any mid-samples m1, m2 in the region [a,
b] where a < m1 < m2 < b, at least one of the intervals [a, m1]
or [m2, b] can be dropped. Sometimes we can go so far as to drop
[a, m2] and [m1, b].
► When α = a yields the minimum for the objective function, i.e
H(α), we can drop the interval (m1, b].
► Similarly when α = b yields the minimum for H(α) we can drop
the interval [a, m2). When α = m1 is the value at which the
minimum is achieved we can drop (m2, b].
► When α = m2 is the value at which the minimum is achieved we
can drop [a, m1).
Line search Algorithms-Golden-section search

► The new bounds on the search interval [a, b] are reset based on
the exclusions mentioned in the previous slide.
► At the end of the process we are left with an interval
containing 0 or 1 evaluated point.
► If we have an interval containing no evaluated point, we select a
random point α = p in the reset interval [a, b], and then another
point q in the larger of the intervals [a, p] and [p, b].
► On the other hand if we are left with an interval [a, b]
containing a single evaluated point α = p, then we select α =
q in the larger of the intervals [a, p] and [p, b].
► This yields another four points on which to continue the
golden-section search. We continue until we achieve the
desired accuracy.
When do we use line search?

► The line-search method can be shown to converge to a local

optimum, but it is computationally expensive. For this reason, it is
rarely used in vanilla gradient descent.
► Some methods like Newton’s method, however, require exact line
search.
► Fast inexact methods like Armijo’s rule are used in vanilla
gradient descent.
► One advantage of using exact line search is that fewer steps are
needed to achieve convergence to a local optimum. This might
more than compensate for the computational expense of
individual steps.

Module3 Ch1
No ratings yet
Module3 Ch1
83 pages
CCS355 Neural Networks and Deep Learning
No ratings yet
CCS355 Neural Networks and Deep Learning
142 pages
Optimumengineeringdesign Day5
No ratings yet
Optimumengineeringdesign Day5
84 pages
GD Types
No ratings yet
GD Types
98 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Linear Regression
No ratings yet
Linear Regression
63 pages
Clnote Oct8
No ratings yet
Clnote Oct8
39 pages
Clnote Sept24
No ratings yet
Clnote Sept24
24 pages
Gradient Descent
No ratings yet
Gradient Descent
52 pages
pdfHXu ch1
No ratings yet
pdfHXu ch1
30 pages
IB352 Warwick Wk4 - Lecture-4
No ratings yet
IB352 Warwick Wk4 - Lecture-4
22 pages
06 23ECE216 GradientDescent v2
No ratings yet
06 23ECE216 GradientDescent v2
73 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
8.2. SE5072 - Optimization
No ratings yet
8.2. SE5072 - Optimization
73 pages
Unconstrained Optimization Gradient Search Method
No ratings yet
Unconstrained Optimization Gradient Search Method
8 pages
Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
5 1 SD 17122020
No ratings yet
5 1 SD 17122020
47 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
48 pages
CS 304.A Training Models
No ratings yet
CS 304.A Training Models
149 pages
Models PDF
No ratings yet
Models PDF
86 pages
Molecular Topology Mircea V Diudea Ivan Gutman Jantschi Lorentz PDF Download
No ratings yet
Molecular Topology Mircea V Diudea Ivan Gutman Jantschi Lorentz PDF Download
77 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Lecture8 UnconstrainedII 2023
No ratings yet
Lecture8 UnconstrainedII 2023
57 pages
UNIT3
No ratings yet
UNIT3
37 pages
Gradient Descent
No ratings yet
Gradient Descent
4 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
Screenshot 2024-10-19 at 10.37.25 AM
No ratings yet
Screenshot 2024-10-19 at 10.37.25 AM
25 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
Optim
No ratings yet
Optim
33 pages
Artificial Intelligence Fundamentals Midterm Q1
No ratings yet
Artificial Intelligence Fundamentals Midterm Q1
4 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
CH 4
No ratings yet
CH 4
28 pages
Linnear Nonlineae Numerical Method
No ratings yet
Linnear Nonlineae Numerical Method
43 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
CS-6777 Liu Abs
No ratings yet
CS-6777 Liu Abs
103 pages
Computing For Data Sciences: Introduction To Regression Analysis
No ratings yet
Computing For Data Sciences: Introduction To Regression Analysis
9 pages
5 Optimizers
No ratings yet
5 Optimizers
10 pages
NN WK 3 Lec 5 6 Gradient Descent
No ratings yet
NN WK 3 Lec 5 6 Gradient Descent
7 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
05 Gradient Descent
No ratings yet
05 Gradient Descent
23 pages
ML Module 5 Full Notes
No ratings yet
ML Module 5 Full Notes
23 pages
Cours 5
No ratings yet
Cours 5
23 pages
GA Ex 2
No ratings yet
GA Ex 2
21 pages
Optimumengineeringdesign Day3a
No ratings yet
Optimumengineeringdesign Day3a
34 pages
Unit VI Optimization Techniques Question Bank Solved Answer
No ratings yet
Unit VI Optimization Techniques Question Bank Solved Answer
20 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Hauser Lecture2
No ratings yet
Hauser Lecture2
26 pages
Set B All Test Program Hitbulls
No ratings yet
Set B All Test Program Hitbulls
37 pages
2024 Mth058 Lecture06 Mcts
100% (1)
2024 Mth058 Lecture06 Mcts
38 pages
Gradient Descent
No ratings yet
Gradient Descent
12 pages
Process Optimization
No ratings yet
Process Optimization
70 pages
Lecture 11
No ratings yet
Lecture 11
25 pages
Optim
No ratings yet
Optim
70 pages
6 Gradient Method
No ratings yet
6 Gradient Method
19 pages
Cos 102
No ratings yet
Cos 102
8 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
12 pages
Unit II: Computer Arithmetic
No ratings yet
Unit II: Computer Arithmetic
23 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Gradient Descent
No ratings yet
Gradient Descent
8 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
ML Notes
No ratings yet
ML Notes
14 pages
DAA Lab Manual
No ratings yet
DAA Lab Manual
49 pages
2019vgg Vqealgorithmhacker Dojowbg 190911191759
No ratings yet
2019vgg Vqealgorithmhacker Dojowbg 190911191759
12 pages
Arning
No ratings yet
Arning
4 pages
Assignment No-4 Subject: Cse-202: Object Oriented Programming
No ratings yet
Assignment No-4 Subject: Cse-202: Object Oriented Programming
9 pages
8.2 LTD and RWA Problems: 8.2.1 Lightpath Topology Design
No ratings yet
8.2 LTD and RWA Problems: 8.2.1 Lightpath Topology Design
6 pages
Binary Relations
No ratings yet
Binary Relations
8 pages
QUESTIONS DFA-Solved-Examples
No ratings yet
QUESTIONS DFA-Solved-Examples
12 pages
Dynamic Programming
No ratings yet
Dynamic Programming
68 pages
ML Assignments
No ratings yet
ML Assignments
2 pages
Algorithms Process Optimization
No ratings yet
Algorithms Process Optimization
5 pages
Chat-Gpt Coding Questions
No ratings yet
Chat-Gpt Coding Questions
2 pages
Practice Final CS61c
No ratings yet
Practice Final CS61c
19 pages
Lab IAT 05
No ratings yet
Lab IAT 05
9 pages
MAT183 - Final Assessment Feb 2021 Question
No ratings yet
MAT183 - Final Assessment Feb 2021 Question
6 pages
CIS Updated NUMERICAL ANALYSIS
No ratings yet
CIS Updated NUMERICAL ANALYSIS
4 pages
Linear Equations
No ratings yet
Linear Equations
2 pages
Optimization in Railway Scheduling
No ratings yet
Optimization in Railway Scheduling
8 pages
Hill Climbing
No ratings yet
Hill Climbing
5 pages
A Hybrid Approach of Genetic Algorithm and Particle Swarm Technique To Software Test Case Generation
No ratings yet
A Hybrid Approach of Genetic Algorithm and Particle Swarm Technique To Software Test Case Generation
7 pages
Data Types and Variables
No ratings yet
Data Types and Variables
3 pages
OOP - Week 1 - Assigment 1
No ratings yet
OOP - Week 1 - Assigment 1
3 pages
Divide and Conquer (Merge Sort)
No ratings yet
Divide and Conquer (Merge Sort)
18 pages
3.5 Dividing Polynomials
No ratings yet
3.5 Dividing Polynomials
2 pages
GE MMW Final Output
No ratings yet
GE MMW Final Output
2 pages
Sha 256 Explanation
No ratings yet
Sha 256 Explanation
3 pages
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 9

Uploaded by

Lecture 9

Uploaded by

Lecture 9

► We move in the direction of the negative gradient to decrease the

► In the extreme case S can contain only one index chosen at

Figure: loss vs num updates (num epochs:5, dataset:MNIST, layers: lin-relu-

► How are we to decide the value of the learning rate?

► Choose a variable learning-rate - large initially but decaying with

► Line search uses the optimum step-size directly in order to

► One question remains - how do we perform the optimization

► We can sweep evaluate the objective function values at

► Initialize the search interval to [a, b] = [0, αmax].

► The line-search method can be shown to converge to a local

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.