0% found this document useful (0 votes)

46 views53 pages

Gradient Descent - Xiaowei Huang

Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient of the function. The gradient is a vector of the partial derivatives of the function with respect to each input variable. Gradient descent works by taking steps proportional to the negative gradient, moving in the direction where the greatest rate of decrease in the function is achieved. This allows gradient descent to converge to a local minimum of the function. Gradient descent can be applied to problems with multiple variables by using the gradient vector rather than the derivative of a single variable function.

Uploaded by

Huy Pham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views53 pages

Gradient Descent - Xiaowei Huang

Uploaded by

Huy Pham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

Gradient Descent

Dr. Xiaowei Huang

https://cgi.csc.liv.ac.uk/~xiaowei/
Up to now,
• Three machine learning algorithms:
• decision tree learning
• k-nn
• linear regression

only optimization
objectives are
discussed, but
how to solve?
Today’s Topics
• Derivative
• Gradient
• Directional Derivative
• Method of Gradient Descent
• Example: Gradient Descent on Linear Regression
• Linear Regression: Analytical Solution
Problem Statement: Gradient-Based
Optimization
• Most ML algorithms involve optimization
• Minimize/maximize a function f (x) by altering x
• Usually stated as a minimization of e.g., the loss etc
• Maximization accomplished by minimizing –f(x)
• f (x) referred to as objective function or criterion
• In minimization also referred to as loss function cost, or error
• Example:
• linear least squares
• Linear regression
• Denote optimum value by x*=argmin f (x)
Derivative
Derivative of a function
• Suppose we have function y=f (x), x, y real numbers
• Derivative of function denoted: f’(x) or as dy/dx
• Derivative f’(x) gives the slope of f (x) at point x
• It specifies how to scale a small change in input to obtain a corresponding change in the
output:
f (x + ε) ≈ f (x) + ε f’ (x)
• It tells how you make a small change in input to make a small improvement in y

Recall what’s the derivative for the

following functions:
f(x) = x2
f(x) = ex
…
Calculus in Optimization
• Suppose we have function , where x, y are real numbers
• Sign function:

This technique is
• We know that
called gradient
descent (Cauchy
for small ε. 1847)
• Therefore, we can reduce by moving x in small steps with
opposite sign of derivative

Why opposite?
Example
• Function f(x) = x2 ε = 0.1
• f’(x) = 2x

• For x = -2, f’(-2) = -4, sign(f’(-2))=-1

• f(-2- ε*(-1)) = f(-1.9) < f(-2)

• For x = 2, f’(2) = 4, sign(f’(2)) = 1

• f(2- ε*1) = f(1.9) < f(2)
Gradient Descent Illustrated
For x<0, f(x) decreases with x
and f’(x)<0

For x>0, f(x) increases with x

and f’(x)>0

Use f’(x) to follow

function downhill

Reduce f(x) by going in direction

opposite sign of derivative f’(x)
Stationary points, Local Optima
• When derivative provides no information about direction of
move
• Points where are known as stationary or critical points
• Local minimum/maximum: a point where f(x) lower/ higher than all its
neighbors
• Saddle Points: neither maxima nor minima
Presence of multiple minima
• Optimization algorithms may fail to find global minimum
• Generally accept such solutions
Gradient
Minimizing with multiple dimensional inputs
• We often minimize functions with multiple-dimensional inputs

• For minimization to make sense there must still be only one (scalar)
output
Functions with multiple inputs
• Partial derivatives

measures how f changes as only variable xi increases at point x

• Gradient generalizes notion of derivative where derivative is wrt a
vector
• Gradient is vector containing all of the partial derivatives denoted
Example
• y = 5x15 + 4x2 + x32 + 2
• so what is the exact gradient on instance (1,2,3)

• the gradient is (25x14, 4, 2x3)

• On the instance (1,2,3), it is (25,4,6)
Functions with multiple inputs
• Gradient is vector containing all of the partial derivatives denoted

• Element i of the gradient is the partial derivative of f wrt xi

• Critical points are where every element of the gradient is equal to
zero
Example
• y = 5x15 + 4x2 + x32 + 2
• so what are the critical points?

• the gradient is (25x14, 4, 2x3)

• We let 25x14 = 0 and 2x3 = 0, so all instances whose x1 and x3 are 0.
but 4 /= 0. So there is no critical point.
Directional Derivative
Directional Derivative
• Directional derivative in direction (a unit vector) is the slope of
function in direction
• This evaluates to

• Example: let be a unit vector in Cartesian

coordinates, so

then
Directional Derivative
• To minimize f find direction in which f decreases the fastest

• where is angle between and the gradient

• Substitute and ignore factors that not depend on this simplifies
to

• This is minimized when points in direction opposite to gradient

• In other words, the gradient points directly uphill, and the negative
gradient points directly downhill
Method of Gradient Descent
Method of Gradient Descent
• The gradient points directly uphill, and the negative gradient points
directly downhill
• Thus we can decrease f by moving in the direction of the negative
gradient
• This is known as the method of steepest descent or gradient descent
• Steepest descent proposes a new point

• where is the learning rate, a positive scalar. Set to a small constant.

Choosing : Line Search
• We can choose in several different ways
• Popular approach: set to a small constant
• Another approach is called line search:
• Evaluate

for several values of and choose the one that results in smallest objective
function value
Example: Gradient Descent on Linear
Regression
Example: Gradient Descent on Linear
Regression

• Linear regression:

• The gradient is
Example: Gradient Descent on Linear
Regression

• Linear regression:

• The gradient is

• Gradient Descent algorithm is

• Set step size , tolerance δ to small, positive numbers.
• While do
Linear Regression: Analytical
solution
Convergence of Steepest Descent
• Steepest descent converges when every element of the gradient is
zero
• In practice, very close to zero
• We may be able to avoid iterative algorithm and jump to the critical
point by solving the following equation for x
Linear Regression: Analytical solution

• Linear regression:

• The gradient is

• Let

• Then, we have
Linear Regression: Analytical solution
• Algebraic view of the minimizer
• If 𝑋 is invertible, just solve 𝑋𝑤 = 𝑦 and get 𝑤 = 𝑋−1𝑦
• But typically 𝑋 is a tall matrix
Generalization to discrete spaces
Generalization to discrete spaces
• Gradient descent is limited to continuous spaces
• Concept of repeatedly making the best small move can be generalized
to discrete spaces
• Ascending an objective function of discrete parameters is called hill
climbing
Exercises
• Given a function f(x)= ex/(1+ex), how many critical points?

• Given a function f(x1,x2)= 9x12+3x2+4, how many critical points?

• Please write a program to do the following: given any differentiable

function (such as the above two), an ε, and a starting x and a target x’,
determine whether it is possible to reach x’ from x. If possible, how
many steps? You can adjust ε to see the change of the answer.
Extended Materials
Beyond Gradient: Jacobian and Hessian
matrices
• Sometimes we need to find all derivatives of a function whose input
and output are both vectors
• If we have function f: Rm -> Rn
• Then the matrix of partial derivatives is known as the Jacobian matrix J
defined as
Second derivative
• Derivative of a derivative
• For a function f: Rn -> R the derivative wrt xi of the derivative of f wrt
xj is denoted as
• In a single dimension we can denote by f’’(x)
• Tells us how the first derivative will change as we vary the input
• This is important as it tells us whether a gradient step will cause as
much of an improvement as based on gradient alone
Second derivative measures curvature
• Derivative of a derivative
• Quadratic functions with different curvatures
Hessian
• Second derivative with many dimensions
• H ( f ) (x) is defined as

• Hessian is the Jacobian of the gradient

• Hessian matrix is symmetric, i.e., Hi,j =Hj,i
• anywhere that the second partial derivatives are continuous
• So the Hessian matrix can be decomposed into a set of real eigenvalues and
an orthogonal basis of eigenvectors
• Eigenvalues of H are useful to determine learning rate as seen in next two slides
Role of eigenvalues of Hessian
• Second derivative in direction d is dTHd
• If d is an eigenvector, second derivative in that direction is given by its
eigenvalue
• For other directions, weighted average of eigenvalues (weights of 0 to 1, with
eigenvectors with smallest angle with d receiving more value)
• Maximum eigenvalue determines maximum second derivative and
minimum eigenvalue determines minimum second derivative
Learning rate from Hessian
• Taylor’s series of f(x) around current point x(0)

• where g is the gradient and H is the Hessian at x(0)

• If we use learning rate ε the new point x is given by x(0)-εg. Thus we get

• There are three terms:

• original value of f,
• expected improvement due to slope, and
• correction to be applied due to curvature
• Solving for step size when correction is least gives
Second Derivative Test: Critical Points
• On a critical point f’(x)=0
• When f’’(x)>0 the first derivative f’(x) increases as we move to the
right and decreases as we move left
• We conclude that x is a local minimum
• For local maximum, f’(x)=0 and f’’(x)<0
• When f’’(x)=0 test is inconclusive: x may be a saddle point or part of a
flat region
Multidimensional Second derivative test
• In multiple dimensions, we need to examine second derivatives of all
dimensions
• Eigendecomposition generalizes the test
• Test eigenvalues of Hessian to determine whether critical point is a
local maximum, local minimum or saddle point
• When H is positive definite (all eigenvalues are positive) the point is a
local minimum
• Similarly negative definite implies a maximum
Saddle point
• Contains both positive and negative curvature
• Function is f(x)=x12-x22

• Along axis x1, function curves upwards: this axis is an eigenvector of H and has
a positive value
• Along x2, function corves downwards; its direction is an eigenvector of H with
negative eigenvalue
• At a saddle point eigen values are both positive and negative
Inconclusive Second Derivative Test
• Multidimensional second derivative test can be inconclusive just like
univariate case
• Test is inconclusive when all non-zero eigen values have same sign but
at least one value is zero
• since univariate second derivative test is inconclusive in cross-section
corresponding to zero eigenvalue
Poor Condition Number
• There are different second derivatives in each direction at a single
point
• Condition number of H e.g., λmax/λmin measures how much they differ
• Gradient descent performs poorly when H has a poor condition no.
• Because in one direction derivative increases rapidly while in another
direction it increases slowly
• Step size must be small so as to avoid overshooting the minimum, but it will
be too small to make progress in other directions with less curvature
Gradient Descent without H
• H with condition no, 5
• Direction of most curvature has five times more curvature than direction of
least curvature
• Due to small step size Gradient
descent wastes time
• Algorithm based on Hessian can
predict that steepest descent is
not promising
Newton’s method uses Hessian
• Another second derivative method
• Using Taylor’s series of f(x) around current x(0)

• solve for the critical point of this function to give

• When f is a quadratic (positive definite) function use solution to jump to the

minimum function directly
• When not quadratic apply solution iteratively
• Can reach critical point much faster than gradient descent
• But useful only when nearby point is a minimum
Summary of Gradient Methods
• First order optimization algorithms: those that use only the gradient
• Second order optimization algorithms: use the Hessian matrix such as
Newton’s method
• Family of functions used in ML is complicated, so optimization is more
complex than in other fields
• No guarantees
• Some guarantees by using Lipschitz continuous functions,

• with Lipschitz constant L

Convex Optimization
• Applicable only to convex functions – functions which are well-
behaved,
• e.g., lack saddle points and all local minima are global minima
• For such functions, Hessian is positive semi-definite everywhere
• Many ML optimization problems, particularly deep learning, cannot
be expressed as convex optimization
Constrained Optimization
• We may wish to optimize f(x) when the solution x is constrained to lie
in set S
• Such values of x are feasible solutions
• Often we want a solution that is small, such as ||x||≤1
• Simple approach: modify gradient descent taking constraint into
account (using Lagrangian formulation)
Ex: Least squares with Lagrangian
• We wish to minimize
• Subject to constraint xTx ≤ 1
• We introduce the Lagrangian
• And solve the problem
• For the unconstrained problem (no Lagrangian) the smallest norm
solution is x=A+b
• If this solution is not feasible, differentiate Lagrangian wrt x to obtain ATAx-
ATb+2λx=0
• Solution takes the form x = (ATA+2λI)-1ATb
• Choosing λ: continue solving linear equation and increasing λ until x has the
correct norm
Generalized Lagrangian: KKT
• More sophisticated than Lagrangian
• Karush-Kuhn-Tucker is a very general solution to constrained
optimization
• While Lagrangian allows equality constraints, KKT allows both equality
and inequality constraints
• To define a generalized Lagrangian we need to describe S in terms of
equalities and inequalities
Generalized Lagrangian
• Set S is described in terms of m functions g(i) and n functions h(j) so
that

• Functions of g are equality constraints and functions of h are inequality

constraints
• Introduce new variables λi and αj for each constraint (called KKT
multipliers) giving the generalized Lagrangian

• We can now solve the unconstrained optimization problem

Hashing
No ratings yet
Hashing
13 pages
Discrete Time Systems
100% (1)
Discrete Time Systems
151 pages
Digital Signal Processing by John G. Pro Part13
No ratings yet
Digital Signal Processing by John G. Pro Part13
73 pages
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
No ratings yet
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
71 pages
Elijah's Math Notes
No ratings yet
Elijah's Math Notes
58 pages
Slides-4 Optimization Extra Gradient Descent
No ratings yet
Slides-4 Optimization Extra Gradient Descent
67 pages
ME554 Sheet 3 Final PDF
No ratings yet
ME554 Sheet 3 Final PDF
31 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
Optimization: 1 Motivation
No ratings yet
Optimization: 1 Motivation
20 pages
4-Optimization of 2 Variables, Gradient Descent
No ratings yet
4-Optimization of 2 Variables, Gradient Descent
12 pages
Linear Regression
100% (1)
Linear Regression
51 pages
4 - Multidimensional Gradient Method
No ratings yet
4 - Multidimensional Gradient Method
14 pages
3) Differentiation FPM
No ratings yet
3) Differentiation FPM
38 pages
Pre-Placements Checklist
100% (1)
Pre-Placements Checklist
9 pages
Machine Learning & Data Mining
No ratings yet
Machine Learning & Data Mining
108 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
FAI 4 Mathematical Concepts II
No ratings yet
FAI 4 Mathematical Concepts II
39 pages
L3 Linear Regression and Gradient Descent
No ratings yet
L3 Linear Regression and Gradient Descent
46 pages
Numerical Method Using Mathcad
No ratings yet
Numerical Method Using Mathcad
11 pages
NN Ch04
No ratings yet
NN Ch04
29 pages
CS6910 Tutorial1
No ratings yet
CS6910 Tutorial1
10 pages
Linear Equalizer
No ratings yet
Linear Equalizer
39 pages
Digital Signal Processing: Dr. Muayad
No ratings yet
Digital Signal Processing: Dr. Muayad
11 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
24 pages
Gradient Descent
No ratings yet
Gradient Descent
12 pages
Aravind Rangamreddy 500195259 cs3
No ratings yet
Aravind Rangamreddy 500195259 cs3
8 pages
SMA 2471 Numerical Analysis TIE
No ratings yet
SMA 2471 Numerical Analysis TIE
6 pages
Chapter Three Solving Problems by Searching and Constraint Satisfaction Problem
No ratings yet
Chapter Three Solving Problems by Searching and Constraint Satisfaction Problem
51 pages
Gradient Of A Function هّلادلا رادحنإ
No ratings yet
Gradient Of A Function هّلادلا رادحنإ
11 pages
Lec 5 - Gradient-Descent
No ratings yet
Lec 5 - Gradient-Descent
31 pages
Gradient Descent - Problem of Hiking Down A Mountain: Derivatives
No ratings yet
Gradient Descent - Problem of Hiking Down A Mountain: Derivatives
8 pages
Lecture 3 Ai
No ratings yet
Lecture 3 Ai
48 pages
Partial Derivatives
No ratings yet
Partial Derivatives
9 pages
Ds Solved 2021-22
No ratings yet
Ds Solved 2021-22
54 pages
07 08 Simpson 3-8 Integration Method
No ratings yet
07 08 Simpson 3-8 Integration Method
10 pages
Lecture12 Linear Regression PartII March08 2021
No ratings yet
Lecture12 Linear Regression PartII March08 2021
48 pages
Maths For ML
No ratings yet
Maths For ML
1 page
Optimization Based On Gradient Descent
No ratings yet
Optimization Based On Gradient Descent
24 pages
The Siruseri Singing Championship
100% (1)
The Siruseri Singing Championship
3 pages
Rızvanoğlu2019 Article OptimizationOfMunicipalSolidWa
No ratings yet
Rızvanoğlu2019 Article OptimizationOfMunicipalSolidWa
12 pages
Linear Programming: © Bruce F. Wollenberg, University of Minnesota 1
No ratings yet
Linear Programming: © Bruce F. Wollenberg, University of Minnesota 1
20 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Math YHPLinear Regression
No ratings yet
Math YHPLinear Regression
13 pages
4.2 Gradient-Based Optimization
No ratings yet
4.2 Gradient-Based Optimization
35 pages
18 Vector Calculus and Optimization
No ratings yet
18 Vector Calculus and Optimization
6 pages
RCMP Unit 2
No ratings yet
RCMP Unit 2
10 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
Final Csi 2110 y 15 e Solution
No ratings yet
Final Csi 2110 y 15 e Solution
15 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
ML Notes
No ratings yet
ML Notes
14 pages
A Hidden Markov Model HMM Can Be Used To Describe The Joint Probability of A Sequence of
No ratings yet
A Hidden Markov Model HMM Can Be Used To Describe The Joint Probability of A Sequence of
3 pages
MLPPT
No ratings yet
MLPPT
36 pages
ML 02 Linear Regression
No ratings yet
ML 02 Linear Regression
51 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
8 pages
AIDS - DM Using Python - Lab Programs
No ratings yet
AIDS - DM Using Python - Lab Programs
19 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Univariate Calculus and Multivariate Calculus
No ratings yet
Univariate Calculus and Multivariate Calculus
141 pages
Skip List
No ratings yet
Skip List
4 pages
Matlab Program For Bisection Method
No ratings yet
Matlab Program For Bisection Method
4 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Lecture05 - Multidimensional Gradient Methods
No ratings yet
Lecture05 - Multidimensional Gradient Methods
39 pages
Ec 1361 Digital Signal Processing
No ratings yet
Ec 1361 Digital Signal Processing
11 pages
Calc
No ratings yet
Calc
6 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Math Lecture 4
No ratings yet
Math Lecture 4
27 pages
GTU DOM Paper 4
No ratings yet
GTU DOM Paper 4
2 pages
Differentiation, Partial Differentiation & Gradients
No ratings yet
Differentiation, Partial Differentiation & Gradients
51 pages
04 Background - Calculs
No ratings yet
04 Background - Calculs
24 pages
Lect 5 - Gradient Descent
No ratings yet
Lect 5 - Gradient Descent
31 pages
Lesson02-Python Calculus Maths
No ratings yet
Lesson02-Python Calculus Maths
19 pages
LInear
No ratings yet
LInear
14 pages
Unit 1.2 Array
No ratings yet
Unit 1.2 Array
105 pages
Math2510Section14 4
No ratings yet
Math2510Section14 4
37 pages
Notes Unit 1-3 Part-III
No ratings yet
Notes Unit 1-3 Part-III
25 pages
Dsa 2019
No ratings yet
Dsa 2019
2 pages
Module2 Optimizations
No ratings yet
Module2 Optimizations
65 pages
An Aeromagnetic Compensation Algorithm Based On A
No ratings yet
An Aeromagnetic Compensation Algorithm Based On A
13 pages
03 23ECE216 PartialDerivatives
No ratings yet
03 23ECE216 PartialDerivatives
47 pages
DSA Lab Manual (Heap Sort)
No ratings yet
DSA Lab Manual (Heap Sort)
3 pages
Quadratic Equation
No ratings yet
Quadratic Equation
17 pages
Optimization and Gradient Descent Algorithm
No ratings yet
Optimization and Gradient Descent Algorithm
37 pages
Question Paper June 2024 (H44602)
No ratings yet
Question Paper June 2024 (H44602)
32 pages
Lec 1 - 3 - Unit 1
No ratings yet
Lec 1 - 3 - Unit 1
19 pages
Gradient Descent
No ratings yet
Gradient Descent
52 pages
Slides Concepts 1 Differentiability
No ratings yet
Slides Concepts 1 Differentiability
14 pages
Calculus - Class Notes
No ratings yet
Calculus - Class Notes
4 pages
Applications of Differentiation
No ratings yet
Applications of Differentiation
8 pages
5 1 SD 17122020
No ratings yet
5 1 SD 17122020
47 pages
2 Gradient Vector and Directional Derivative Portal File
No ratings yet
2 Gradient Vector and Directional Derivative Portal File
29 pages
Limits and Continuity (Calculus) Engineering Entrance Exams Question Bank
From Everand
Limits and Continuity (Calculus) Engineering Entrance Exams Question Bank
Mohmmad Khaja Shareef
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Gradient Descent - Xiaowei Huang

Uploaded by

Gradient Descent - Xiaowei Huang

Uploaded by

Gradient Descent

Dr. Xiaowei Huang

Recall what’s the derivative for the

• For x = -2, f’(-2) = -4, sign(f’(-2))=-1

• For x = 2, f’(2) = 4, sign(f’(2)) = 1

For x>0, f(x) increases with x

Use f’(x) to follow

Reduce f(x) by going in direction

measures how f changes as only variable xi increases at point x

• the gradient is (25x14, 4, 2x3)

• Element i of the gradient is the partial derivative of f wrt xi

• the gradient is (25x14, 4, 2x3)

• Example: let be a unit vector in Cartesian

• where is angle between and the gradient

• This is minimized when points in direction opposite to gradient

• where is the learning rate, a positive scalar. Set to a small constant.

• Gradient Descent algorithm is

• Given a function f(x1,x2)= 9x12+3x2+4, how many critical points?

• Please write a program to do the following: given any differentiable

• Hessian is the Jacobian of the gradient

• where g is the gradient and H is the Hessian at x(0)

• There are three terms:

• solve for the critical point of this function to give

• When f is a quadratic (positive definite) function use solution to jump to the

• with Lipschitz constant L

• Functions of g are equality constraints and functions of h are inequality

• We can now solve the unconstrained optimization problem

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.