Gradient Descent - Xiaowei Huang
Gradient Descent - Xiaowei Huang
only optimization
objectives are
discussed, but
how to solve?
Today’s Topics
• Derivative
• Gradient
• Directional Derivative
• Method of Gradient Descent
• Example: Gradient Descent on Linear Regression
• Linear Regression: Analytical Solution
Problem Statement: Gradient-Based
Optimization
• Most ML algorithms involve optimization
• Minimize/maximize a function f (x) by altering x
• Usually stated as a minimization of e.g., the loss etc
• Maximization accomplished by minimizing –f(x)
• f (x) referred to as objective function or criterion
• In minimization also referred to as loss function cost, or error
• Example:
• linear least squares
• Linear regression
• Denote optimum value by x*=argmin f (x)
Derivative
Derivative of a function
• Suppose we have function y=f (x), x, y real numbers
• Derivative of function denoted: f’(x) or as dy/dx
• Derivative f’(x) gives the slope of f (x) at point x
• It specifies how to scale a small change in input to obtain a corresponding change in the
output:
f (x + ε) ≈ f (x) + ε f’ (x)
• It tells how you make a small change in input to make a small improvement in y
This technique is
• We know that
called gradient
descent (Cauchy
for small ε. 1847)
• Therefore, we can reduce by moving x in small steps with
opposite sign of derivative
Why opposite?
Example
• Function f(x) = x2 ε = 0.1
• f’(x) = 2x
• For minimization to make sense there must still be only one (scalar)
output
Functions with multiple inputs
• Partial derivatives
then
Directional Derivative
• To minimize f find direction in which f decreases the fastest
for several values of and choose the one that results in smallest objective
function value
Example: Gradient Descent on Linear
Regression
Example: Gradient Descent on Linear
Regression
• Linear regression:
• The gradient is
Example: Gradient Descent on Linear
Regression
• Linear regression:
• The gradient is
• Linear regression:
• The gradient is
• Let
• Then, we have
Linear Regression: Analytical solution
• Algebraic view of the minimizer
• If 𝑋 is invertible, just solve 𝑋𝑤 = 𝑦 and get 𝑤 = 𝑋−1𝑦
• But typically 𝑋 is a tall matrix
Generalization to discrete spaces
Generalization to discrete spaces
• Gradient descent is limited to continuous spaces
• Concept of repeatedly making the best small move can be generalized
to discrete spaces
• Ascending an objective function of discrete parameters is called hill
climbing
Exercises
• Given a function f(x)= ex/(1+ex), how many critical points?
• Along axis x1, function curves upwards: this axis is an eigenvector of H and has
a positive value
• Along x2, function corves downwards; its direction is an eigenvector of H with
negative eigenvalue
• At a saddle point eigen values are both positive and negative
Inconclusive Second Derivative Test
• Multidimensional second derivative test can be inconclusive just like
univariate case
• Test is inconclusive when all non-zero eigen values have same sign but
at least one value is zero
• since univariate second derivative test is inconclusive in cross-section
corresponding to zero eigenvalue
Poor Condition Number
• There are different second derivatives in each direction at a single
point
• Condition number of H e.g., λmax/λmin measures how much they differ
• Gradient descent performs poorly when H has a poor condition no.
• Because in one direction derivative increases rapidly while in another
direction it increases slowly
• Step size must be small so as to avoid overshooting the minimum, but it will
be too small to make progress in other directions with less curvature
Gradient Descent without H
• H with condition no, 5
• Direction of most curvature has five times more curvature than direction of
least curvature
• Due to small step size Gradient
descent wastes time
• Algorithm based on Hessian can
predict that steepest descent is
not promising
Newton’s method uses Hessian
• Another second derivative method
• Using Taylor’s series of f(x) around current x(0)