0% found this document useful (0 votes)
7 views31 pages

Lecture 10

This lecture focuses on nonlinear optimization concepts essential for machine learning, emphasizing the importance of correctly computing gradients to avoid overfitting and ensure optimal model performance. It discusses challenges such as flat regions, local optima, and differential curvature that can hinder convergence during optimization. Additionally, it introduces constrained optimization and the Lagrange multiplier method for solving optimization problems with constraints.

Uploaded by

sohindos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views31 pages

Lecture 10

This lecture focuses on nonlinear optimization concepts essential for machine learning, emphasizing the importance of correctly computing gradients to avoid overfitting and ensure optimal model performance. It discusses challenges such as flat regions, local optima, and differential curvature that can hinder convergence during optimization. Additionally, it introduces constrained optimization and the Lagrange multiplier method for solving optimization problems with constraints.

Uploaded by

sohindos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Lecture 10

Math Foundations
Team
Introduction

► We will look at nonlinear optimization concepts in this lecture.


► We already know how to compute gradient, but there are some
minutiae of gradient descent that we need to address.
► Machine learning algorithms depend heavily on the correctness of
the gradient since if the gradient is computed erroneously, the
algorithms might fail to find the local or global optimum.
► We will also look into some challenges in non-linear
optimization.
Optimization in machine learning
Optimization of Additionally separable sum
Overfitting in machine learning

► In traditional optimization, we focus on updating parameters so


that the objective function is minimized as much as possible.
► In machine learning, minimization of the objective function is
performed over training data but the model is applied on test
data which is unseen.
► We need to avoid the problem of overfitting the training data.
Overfitting in machine learning
Example of overfitting

► Consider the following data on 4 variables x1, x2, x3, x4 and


associated output variable y .
► Let us say that this is a sample of real-life data where the
output y ≈ x1.
x1 x2 x3 x4 y
61 2 3 0.1 49
40 0 4 0.5 40
68 0 10 1.0 70
Example of overfitting

► Minimizing squared error, we notice that one good solution is


w1 = 1, w2 = w3 = w4 = 0, w5 = 0.
► This solution does not give zero squared error with respect to the
actual observations but gives an error close to zero.
► The solution w1 = 1, w2 = w3 = w4 = 0, w5 = 0 is a very good
one since it captures the real-life relationship between the
output variable y and x1.
Example of overfitting

► Consider w1 = 0, w2 = 7, w3 = 5, w4 = 0, w5 = 20. This


solution gives zero training error.
► It is a very poor solution since there is no dependence of the
output variable y on x1 while we know that there is actually a
strong dependence between y and x1.
► Therefore it will incur a high error on test-data.
► This example illustrates the idea that minimizing the loss function
to the greatest extent may not be a good thing since the model
may then perform poorly on real-life data.
Feature processing
Feature processing

► Taking small steps in w2 and large steps in w1 will make us go


steadily towards the optimal value for w2 but oscillate with respect
to the optimal value of w1, overshooting the target each time.
► This makes convergence very slow.
► It is therefore helpful to have features with similar variance.
Feature processing
Challenges in Gradient-Based Optimization

1. Flat regions and local optima


2. Different levels of curvature in different directions
Local optima and flat regions
Local optima and flat regions
Local optima and flat regions
Local optima and flat regions
Local optima and flat regions

► Another problem to contend with is the presence of flat


regions where the gradient is close to zero.
Local optima and flat regions

► Flat regions are problematic because the speed of descent


depends on the magnitude of the gradient, given a fixed
learning rate.
► The optimization process will take a long time to cross a flat
region of space which will make convergence slow.
Different levels of curvature

► In multi-dimensional settings, the components of the gradient


with respect to different parameters can vary widely. This will
cause convergence problems since there is oscillation in the update
step with respect to some components and a steady movement
with respect to other components.
► Consider the simplest possible case of a bowl-like convex,
quadratic objective function with a single global minimum - L =
x 2 + y 2 represents a perfectly circular bowl, and the function L
= x 2 + 4y 2.
► We shall show contour plots of both functions and how gradient
descent performs on finding the minimum of the two functions.
Different levels of curvature
Different levels of curvature

► The second-order derivative measures the rate of change of the


gradient - a high second-order derivative means high curvature.
► From the point of view of gradient descent we want moderate
curvature in all dimensions as it would means that the gradient
does not change too much in some dimensions compared to
others.
► We can then make gradient-descent steps of large sizes.
Different levels of curvature
Different levels of curvature

► In case of the perfect bowl, a sufficiently large step-size from any


point can take us directly to the optimum of the function in one-
step, since the gradient at any point points towards the optimum
of the function.
► This is not true for the elliptical bowl, the gradient at any
point does not point to the optimum of the function.
Contour plots

► Note that the gradient at any point is orthogonal to the


contour line at that point.
► This because the dot product of the gradient ∇F and a small
displacement δ x along the contour line gives the change in the
value of the function along the displacement x .
► Since the function remains constant along the contour line,
∇F.x = 0
Differential curvature

► A closer look at the contour plot for the elliptical bowl case
shows that in the y -direction, we see oscillatory movement as in
each step we correct the mistake of overshooting made in the
previous step. The gradient component along the
y -direction is more than the component along the x -direction.
► Along the x -direction, we make small movements towards the
optimum x -value. Overall, after many training steps we find that
we have made little progress to the optimum.
► It needs to be kept in mind that the path of steepest descent in
most objective functions is only an instantaneous direction of
best improvement, and is not the correct direction of descent in
the longer term.
Revisiting feature normalization

► We show how to address in some measure the differential


curvature problem by feature normalization.
► Consider the following toy dataset, where the two input
attributes are x1 and x2 , and the output attribute is y .
► We intend to find a relationship of the form y = w1x1 + w2x2 from
the data. The coefficients w1 and w2 are found using gradient
descent on the loss function computed from the data.
x1 x2 y
0.1 25 7
0.8 10 1
0.4 10 4
Revisiting feature normalization

► Loss function:

J ( w ) = (0.1w1+25w2−7)2+(0.8w1+10w2−1)2+(0.4w1+10w2−4)2

► Objective function is much more sensitive to w2 than w1


► One way to get around this issue is to standardize each
column to zero mean and unit variance
► The coefficients for w1 and w2 will become much more
similar, and differential curvature will be reduced.
Constrained Optimization

► Optimization problems are of 2 types


1. Unconstrained Optimization
2. Constrained Optimization
► We discussed algorithms to solve unconstrained optimization
► How do we find the solution to an optimization problem with
constraints?
► Example of a constrained optimization problem

maximize f (x, y ) = x 2y
subject to g (x, y ) : x 2 + y 2
=1
Constrained Optimization : Lagrange Multiplier Method

► Constrained maximization (minimization) problem is rewritten as a


Lagrange function whose optimal point is a saddle point,
i.e. a global maximum (minimum)
► Lagrange function use Lagrange multipliers as a strategy for
finding the local maxima and minima of a function subject to
constraints
► Maximum of f (x, y ) under constraint g (x, y ) is obtained
when their gradients point to same direction
► Introduce a Lagrange multiplier λ for the equality constraint
Mathematically, ∇f (x, y ) = λ∇g (x, y )
Constrained Optimization : Lagrange Multiplier Method

Consider the following optimization problem

maximize xy
subject to x+y =6

► The Lagrangian is L(x, y ) = xy − λ(x + y − 6)

► x and y values remain same even if you take +λ or −λ for


equality constraint

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy