0% found this document useful (0 votes)
46 views53 pages

Gradient Descent - Xiaowei Huang

Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient of the function. The gradient is a vector of the partial derivatives of the function with respect to each input variable. Gradient descent works by taking steps proportional to the negative gradient, moving in the direction where the greatest rate of decrease in the function is achieved. This allows gradient descent to converge to a local minimum of the function. Gradient descent can be applied to problems with multiple variables by using the gradient vector rather than the derivative of a single variable function.

Uploaded by

Huy Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views53 pages

Gradient Descent - Xiaowei Huang

Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient of the function. The gradient is a vector of the partial derivatives of the function with respect to each input variable. Gradient descent works by taking steps proportional to the negative gradient, moving in the direction where the greatest rate of decrease in the function is achieved. This allows gradient descent to converge to a local minimum of the function. Gradient descent can be applied to problems with multiple variables by using the gradient vector rather than the derivative of a single variable function.

Uploaded by

Huy Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Gradient Descent

Dr. Xiaowei Huang


https://cgi.csc.liv.ac.uk/~xiaowei/
Up to now,
• Three machine learning algorithms:
• decision tree learning
• k-nn
• linear regression

only optimization
objectives are
discussed, but
how to solve?
Today’s Topics
• Derivative
• Gradient
• Directional Derivative
• Method of Gradient Descent
• Example: Gradient Descent on Linear Regression
• Linear Regression: Analytical Solution
Problem Statement: Gradient-Based
Optimization
• Most ML algorithms involve optimization
• Minimize/maximize a function f (x) by altering x
• Usually stated as a minimization of e.g., the loss etc
• Maximization accomplished by minimizing –f(x)
• f (x) referred to as objective function or criterion
• In minimization also referred to as loss function cost, or error
• Example:
• linear least squares
• Linear regression
• Denote optimum value by x*=argmin f (x)
Derivative
Derivative of a function
• Suppose we have function y=f (x), x, y real numbers
• Derivative of function denoted: f’(x) or as dy/dx
• Derivative f’(x) gives the slope of f (x) at point x
• It specifies how to scale a small change in input to obtain a corresponding change in the
output:
f (x + ε) ≈ f (x) + ε f’ (x)
• It tells how you make a small change in input to make a small improvement in y

Recall what’s the derivative for the


following functions:
f(x) = x2
f(x) = ex

Calculus in Optimization
• Suppose we have function , where x, y are real numbers
• Sign function:

This technique is
• We know that
called gradient
descent (Cauchy
for small ε. 1847)
• Therefore, we can reduce by moving x in small steps with
opposite sign of derivative

Why opposite?
Example
• Function f(x) = x2 ε = 0.1
• f’(x) = 2x

• For x = -2, f’(-2) = -4, sign(f’(-2))=-1


• f(-2- ε*(-1)) = f(-1.9) < f(-2)

• For x = 2, f’(2) = 4, sign(f’(2)) = 1


• f(2- ε*1) = f(1.9) < f(2)
Gradient Descent Illustrated
For x<0, f(x) decreases with x
and f’(x)<0

For x>0, f(x) increases with x


and f’(x)>0

Use f’(x) to follow


function downhill

Reduce f(x) by going in direction


opposite sign of derivative f’(x)
Stationary points, Local Optima
• When derivative provides no information about direction of
move
• Points where are known as stationary or critical points
• Local minimum/maximum: a point where f(x) lower/ higher than all its
neighbors
• Saddle Points: neither maxima nor minima
Presence of multiple minima
• Optimization algorithms may fail to find global minimum
• Generally accept such solutions
Gradient
Minimizing with multiple dimensional inputs
• We often minimize functions with multiple-dimensional inputs

• For minimization to make sense there must still be only one (scalar)
output
Functions with multiple inputs
• Partial derivatives

measures how f changes as only variable xi increases at point x


• Gradient generalizes notion of derivative where derivative is wrt a
vector
• Gradient is vector containing all of the partial derivatives denoted
Example
• y = 5x15 + 4x2 + x32 + 2
• so what is the exact gradient on instance (1,2,3)

• the gradient is (25x14, 4, 2x3)


• On the instance (1,2,3), it is (25,4,6)
Functions with multiple inputs
• Gradient is vector containing all of the partial derivatives denoted

• Element i of the gradient is the partial derivative of f wrt xi


• Critical points are where every element of the gradient is equal to
zero
Example
• y = 5x15 + 4x2 + x32 + 2
• so what are the critical points?

• the gradient is (25x14, 4, 2x3)


• We let 25x14 = 0 and 2x3 = 0, so all instances whose x1 and x3 are 0.
but 4 /= 0. So there is no critical point.
Directional Derivative
Directional Derivative
• Directional derivative in direction (a unit vector) is the slope of
function in direction
• This evaluates to

• Example: let be a unit vector in Cartesian


coordinates, so

then
Directional Derivative
• To minimize f find direction in which f decreases the fastest

• where is angle between and the gradient


• Substitute and ignore factors that not depend on this simplifies
to

• This is minimized when points in direction opposite to gradient


• In other words, the gradient points directly uphill, and the negative
gradient points directly downhill
Method of Gradient Descent
Method of Gradient Descent
• The gradient points directly uphill, and the negative gradient points
directly downhill
• Thus we can decrease f by moving in the direction of the negative
gradient
• This is known as the method of steepest descent or gradient descent
• Steepest descent proposes a new point

• where is the learning rate, a positive scalar. Set to a small constant.


Choosing : Line Search
• We can choose in several different ways
• Popular approach: set to a small constant
• Another approach is called line search:
• Evaluate

for several values of and choose the one that results in smallest objective
function value
Example: Gradient Descent on Linear
Regression
Example: Gradient Descent on Linear
Regression

• Linear regression:

• The gradient is
Example: Gradient Descent on Linear
Regression

• Linear regression:

• The gradient is

• Gradient Descent algorithm is


• Set step size , tolerance δ to small, positive numbers.
• While do
Linear Regression: Analytical
solution
Convergence of Steepest Descent
• Steepest descent converges when every element of the gradient is
zero
• In practice, very close to zero
• We may be able to avoid iterative algorithm and jump to the critical
point by solving the following equation for x
Linear Regression: Analytical solution

• Linear regression:

• The gradient is

• Let

• Then, we have
Linear Regression: Analytical solution
• Algebraic view of the minimizer
• If 𝑋 is invertible, just solve 𝑋𝑤 = 𝑦 and get 𝑤 = 𝑋−1𝑦
• But typically 𝑋 is a tall matrix
Generalization to discrete spaces
Generalization to discrete spaces
• Gradient descent is limited to continuous spaces
• Concept of repeatedly making the best small move can be generalized
to discrete spaces
• Ascending an objective function of discrete parameters is called hill
climbing
Exercises
• Given a function f(x)= ex/(1+ex), how many critical points?

• Given a function f(x1,x2)= 9x12+3x2+4, how many critical points?

• Please write a program to do the following: given any differentiable


function (such as the above two), an ε, and a starting x and a target x’,
determine whether it is possible to reach x’ from x. If possible, how
many steps? You can adjust ε to see the change of the answer.
Extended Materials
Beyond Gradient: Jacobian and Hessian
matrices
• Sometimes we need to find all derivatives of a function whose input
and output are both vectors
• If we have function f: Rm -> Rn
• Then the matrix of partial derivatives is known as the Jacobian matrix J
defined as
Second derivative
• Derivative of a derivative
• For a function f: Rn -> R the derivative wrt xi of the derivative of f wrt
xj is denoted as
• In a single dimension we can denote by f’’(x)
• Tells us how the first derivative will change as we vary the input
• This is important as it tells us whether a gradient step will cause as
much of an improvement as based on gradient alone
Second derivative measures curvature
• Derivative of a derivative
• Quadratic functions with different curvatures
Hessian
• Second derivative with many dimensions
• H ( f ) (x) is defined as

• Hessian is the Jacobian of the gradient


• Hessian matrix is symmetric, i.e., Hi,j =Hj,i
• anywhere that the second partial derivatives are continuous
• So the Hessian matrix can be decomposed into a set of real eigenvalues and
an orthogonal basis of eigenvectors
• Eigenvalues of H are useful to determine learning rate as seen in next two slides
Role of eigenvalues of Hessian
• Second derivative in direction d is dTHd
• If d is an eigenvector, second derivative in that direction is given by its
eigenvalue
• For other directions, weighted average of eigenvalues (weights of 0 to 1, with
eigenvectors with smallest angle with d receiving more value)
• Maximum eigenvalue determines maximum second derivative and
minimum eigenvalue determines minimum second derivative
Learning rate from Hessian
• Taylor’s series of f(x) around current point x(0)

• where g is the gradient and H is the Hessian at x(0)


• If we use learning rate ε the new point x is given by x(0)-εg. Thus we get

• There are three terms:


• original value of f,
• expected improvement due to slope, and
• correction to be applied due to curvature
• Solving for step size when correction is least gives
Second Derivative Test: Critical Points
• On a critical point f’(x)=0
• When f’’(x)>0 the first derivative f’(x) increases as we move to the
right and decreases as we move left
• We conclude that x is a local minimum
• For local maximum, f’(x)=0 and f’’(x)<0
• When f’’(x)=0 test is inconclusive: x may be a saddle point or part of a
flat region
Multidimensional Second derivative test
• In multiple dimensions, we need to examine second derivatives of all
dimensions
• Eigendecomposition generalizes the test
• Test eigenvalues of Hessian to determine whether critical point is a
local maximum, local minimum or saddle point
• When H is positive definite (all eigenvalues are positive) the point is a
local minimum
• Similarly negative definite implies a maximum
Saddle point
• Contains both positive and negative curvature
• Function is f(x)=x12-x22

• Along axis x1, function curves upwards: this axis is an eigenvector of H and has
a positive value
• Along x2, function corves downwards; its direction is an eigenvector of H with
negative eigenvalue
• At a saddle point eigen values are both positive and negative
Inconclusive Second Derivative Test
• Multidimensional second derivative test can be inconclusive just like
univariate case
• Test is inconclusive when all non-zero eigen values have same sign but
at least one value is zero
• since univariate second derivative test is inconclusive in cross-section
corresponding to zero eigenvalue
Poor Condition Number
• There are different second derivatives in each direction at a single
point
• Condition number of H e.g., λmax/λmin measures how much they differ
• Gradient descent performs poorly when H has a poor condition no.
• Because in one direction derivative increases rapidly while in another
direction it increases slowly
• Step size must be small so as to avoid overshooting the minimum, but it will
be too small to make progress in other directions with less curvature
Gradient Descent without H
• H with condition no, 5
• Direction of most curvature has five times more curvature than direction of
least curvature
• Due to small step size Gradient
descent wastes time
• Algorithm based on Hessian can
predict that steepest descent is
not promising
Newton’s method uses Hessian
• Another second derivative method
• Using Taylor’s series of f(x) around current x(0)

• solve for the critical point of this function to give

• When f is a quadratic (positive definite) function use solution to jump to the


minimum function directly
• When not quadratic apply solution iteratively
• Can reach critical point much faster than gradient descent
• But useful only when nearby point is a minimum
Summary of Gradient Methods
• First order optimization algorithms: those that use only the gradient
• Second order optimization algorithms: use the Hessian matrix such as
Newton’s method
• Family of functions used in ML is complicated, so optimization is more
complex than in other fields
• No guarantees
• Some guarantees by using Lipschitz continuous functions,

• with Lipschitz constant L


Convex Optimization
• Applicable only to convex functions – functions which are well-
behaved,
• e.g., lack saddle points and all local minima are global minima
• For such functions, Hessian is positive semi-definite everywhere
• Many ML optimization problems, particularly deep learning, cannot
be expressed as convex optimization
Constrained Optimization
• We may wish to optimize f(x) when the solution x is constrained to lie
in set S
• Such values of x are feasible solutions
• Often we want a solution that is small, such as ||x||≤1
• Simple approach: modify gradient descent taking constraint into
account (using Lagrangian formulation)
Ex: Least squares with Lagrangian
• We wish to minimize
• Subject to constraint xTx ≤ 1
• We introduce the Lagrangian
• And solve the problem
• For the unconstrained problem (no Lagrangian) the smallest norm
solution is x=A+b
• If this solution is not feasible, differentiate Lagrangian wrt x to obtain ATAx-
ATb+2λx=0
• Solution takes the form x = (ATA+2λI)-1ATb
• Choosing λ: continue solving linear equation and increasing λ until x has the
correct norm
Generalized Lagrangian: KKT
• More sophisticated than Lagrangian
• Karush-Kuhn-Tucker is a very general solution to constrained
optimization
• While Lagrangian allows equality constraints, KKT allows both equality
and inequality constraints
• To define a generalized Lagrangian we need to describe S in terms of
equalities and inequalities
Generalized Lagrangian
• Set S is described in terms of m functions g(i) and n functions h(j) so
that

• Functions of g are equality constraints and functions of h are inequality


constraints
• Introduce new variables λi and αj for each constraint (called KKT
multipliers) giving the generalized Lagrangian

• We can now solve the unconstrained optimization problem

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy