0% found this document useful (0 votes)

7 views31 pages

Lecture 10

This lecture focuses on nonlinear optimization concepts essential for machine learning, emphasizing the importance of correctly computing gradients to avoid overfitting and ensure optimal model performance. It discusses challenges such as flat regions, local optima, and differential curvature that can hinder convergence during optimization. Additionally, it introduces constrained optimization and the Lagrange multiplier method for solving optimization problems with constraints.

Uploaded by

sohindos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views31 pages

Lecture 10

Uploaded by

sohindos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Lecture 10

Math Foundations
Team
Introduction

► We will look at nonlinear optimization concepts in this lecture.

► We already know how to compute gradient, but there are some
minutiae of gradient descent that we need to address.
► Machine learning algorithms depend heavily on the correctness of
the gradient since if the gradient is computed erroneously, the
algorithms might fail to find the local or global optimum.
► We will also look into some challenges in non-linear
optimization.
Optimization in machine learning
Optimization of Additionally separable sum
Overfitting in machine learning

► In traditional optimization, we focus on updating parameters so

that the objective function is minimized as much as possible.
► In machine learning, minimization of the objective function is
performed over training data but the model is applied on test
data which is unseen.
► We need to avoid the problem of overfitting the training data.
Overfitting in machine learning
Example of overfitting

► Consider the following data on 4 variables x1, x2, x3, x4 and

associated output variable y .
► Let us say that this is a sample of real-life data where the
output y ≈ x1.
x1 x2 x3 x4 y
61 2 3 0.1 49
40 0 4 0.5 40
68 0 10 1.0 70
Example of overfitting

► Minimizing squared error, we notice that one good solution is

w1 = 1, w2 = w3 = w4 = 0, w5 = 0.
► This solution does not give zero squared error with respect to the
actual observations but gives an error close to zero.
► The solution w1 = 1, w2 = w3 = w4 = 0, w5 = 0 is a very good
one since it captures the real-life relationship between the
output variable y and x1.
Example of overfitting

► Consider w1 = 0, w2 = 7, w3 = 5, w4 = 0, w5 = 20. This

solution gives zero training error.
► It is a very poor solution since there is no dependence of the
output variable y on x1 while we know that there is actually a
strong dependence between y and x1.
► Therefore it will incur a high error on test-data.
► This example illustrates the idea that minimizing the loss function
to the greatest extent may not be a good thing since the model
may then perform poorly on real-life data.
Feature processing
Feature processing

► Taking small steps in w2 and large steps in w1 will make us go

steadily towards the optimal value for w2 but oscillate with respect
to the optimal value of w1, overshooting the target each time.
► This makes convergence very slow.
► It is therefore helpful to have features with similar variance.
Feature processing
Challenges in Gradient-Based Optimization

1. Flat regions and local optima

2. Different levels of curvature in different directions
Local optima and flat regions
Local optima and flat regions
Local optima and flat regions
Local optima and flat regions
Local optima and flat regions

► Another problem to contend with is the presence of flat

regions where the gradient is close to zero.
Local optima and flat regions

► Flat regions are problematic because the speed of descent

depends on the magnitude of the gradient, given a fixed
learning rate.
► The optimization process will take a long time to cross a flat
region of space which will make convergence slow.
Different levels of curvature

► In multi-dimensional settings, the components of the gradient

with respect to different parameters can vary widely. This will
cause convergence problems since there is oscillation in the update
step with respect to some components and a steady movement
with respect to other components.
► Consider the simplest possible case of a bowl-like convex,
quadratic objective function with a single global minimum - L =
x 2 + y 2 represents a perfectly circular bowl, and the function L
= x 2 + 4y 2.
► We shall show contour plots of both functions and how gradient
descent performs on finding the minimum of the two functions.
Different levels of curvature
Different levels of curvature

► The second-order derivative measures the rate of change of the

gradient - a high second-order derivative means high curvature.
► From the point of view of gradient descent we want moderate
curvature in all dimensions as it would means that the gradient
does not change too much in some dimensions compared to
others.
► We can then make gradient-descent steps of large sizes.
Different levels of curvature
Different levels of curvature

► In case of the perfect bowl, a sufficiently large step-size from any

point can take us directly to the optimum of the function in one-
step, since the gradient at any point points towards the optimum
of the function.
► This is not true for the elliptical bowl, the gradient at any
point does not point to the optimum of the function.
Contour plots

► Note that the gradient at any point is orthogonal to the

contour line at that point.
► This because the dot product of the gradient ∇F and a small
displacement δ x along the contour line gives the change in the
value of the function along the displacement x .
► Since the function remains constant along the contour line,
∇F.x = 0
Differential curvature

► A closer look at the contour plot for the elliptical bowl case
shows that in the y -direction, we see oscillatory movement as in
each step we correct the mistake of overshooting made in the
previous step. The gradient component along the
y -direction is more than the component along the x -direction.
► Along the x -direction, we make small movements towards the
optimum x -value. Overall, after many training steps we find that
we have made little progress to the optimum.
► It needs to be kept in mind that the path of steepest descent in
most objective functions is only an instantaneous direction of
best improvement, and is not the correct direction of descent in
the longer term.
Revisiting feature normalization

► We show how to address in some measure the differential

curvature problem by feature normalization.
► Consider the following toy dataset, where the two input
attributes are x1 and x2 , and the output attribute is y .
► We intend to find a relationship of the form y = w1x1 + w2x2 from
the data. The coefficients w1 and w2 are found using gradient
descent on the loss function computed from the data.
x1 x2 y
0.1 25 7
0.8 10 1
0.4 10 4
Revisiting feature normalization

► Loss function:

J ( w ) = (0.1w1+25w2−7)2+(0.8w1+10w2−1)2+(0.4w1+10w2−4)2

► Objective function is much more sensitive to w2 than w1

► One way to get around this issue is to standardize each
column to zero mean and unit variance
► The coefficients for w1 and w2 will become much more
similar, and differential curvature will be reduced.
Constrained Optimization

► Optimization problems are of 2 types

1. Unconstrained Optimization
2. Constrained Optimization
► We discussed algorithms to solve unconstrained optimization
► How do we find the solution to an optimization problem with
constraints?
► Example of a constrained optimization problem

maximize f (x, y ) = x 2y
subject to g (x, y ) : x 2 + y 2
=1
Constrained Optimization : Lagrange Multiplier Method

► Constrained maximization (minimization) problem is rewritten as a

Lagrange function whose optimal point is a saddle point,
i.e. a global maximum (minimum)
► Lagrange function use Lagrange multipliers as a strategy for
finding the local maxima and minima of a function subject to
constraints
► Maximum of f (x, y ) under constraint g (x, y ) is obtained
when their gradients point to same direction
► Introduce a Lagrange multiplier λ for the equality constraint
Mathematically, ∇f (x, y ) = λ∇g (x, y )
Constrained Optimization : Lagrange Multiplier Method

Consider the following optimization problem

maximize xy
subject to x+y =6

► The Lagrangian is L(x, y ) = xy − λ(x + y − 6)

► x and y values remain same even if you take +λ or −λ for

equality constraint

Nonlinear Programming PDF
No ratings yet
Nonlinear Programming PDF
224 pages
Unconstrained Numerical Optimization An Introduction For Econometricians
100% (1)
Unconstrained Numerical Optimization An Introduction For Econometricians
32 pages
Nonlinear Programming Concepts PDF
No ratings yet
Nonlinear Programming Concepts PDF
224 pages
Math Lecture 4
No ratings yet
Math Lecture 4
27 pages
Join Full Batch Download Careerwill App-: Telegram Channel by Gagan Pratap Sir
No ratings yet
Join Full Batch Download Careerwill App-: Telegram Channel by Gagan Pratap Sir
2 pages
Nonconforming Least-Square Spectral Element Method For Parabolic Partial Differential Equation With Non-Smooth Data and Application To Finance
No ratings yet
Nonconforming Least-Square Spectral Element Method For Parabolic Partial Differential Equation With Non-Smooth Data and Application To Finance
132 pages
Bms Basic NLP 120609
No ratings yet
Bms Basic NLP 120609
103 pages
Chapter 4: Unconstrained Optimization
No ratings yet
Chapter 4: Unconstrained Optimization
25 pages
CS480 6 Linear Models
No ratings yet
CS480 6 Linear Models
68 pages
Gradient Descent
No ratings yet
Gradient Descent
108 pages
Lecture 11
No ratings yet
Lecture 11
46 pages
Kendriya Vidyalaya Gachibowli, Hyderabad: Sample Paper 01: Periodic Test - 1 (2017 - 18) Class - X Mathematics
No ratings yet
Kendriya Vidyalaya Gachibowli, Hyderabad: Sample Paper 01: Periodic Test - 1 (2017 - 18) Class - X Mathematics
2 pages
Unit3 Rev3
No ratings yet
Unit3 Rev3
201 pages
Setting Parameters of A Deep Neural Network - Hierarchical Representations
No ratings yet
Setting Parameters of A Deep Neural Network - Hierarchical Representations
10 pages
Process Optimization Algorythms PDF
No ratings yet
Process Optimization Algorythms PDF
77 pages
Week 10 Notes MLF
No ratings yet
Week 10 Notes MLF
20 pages
Discussion 4 CS771
No ratings yet
Discussion 4 CS771
25 pages
Lecture 8
No ratings yet
Lecture 8
16 pages
Chapter8-Unconstrained Optimization
No ratings yet
Chapter8-Unconstrained Optimization
14 pages
Module IV - Numerical Methods - I
No ratings yet
Module IV - Numerical Methods - I
7 pages
CH 4-Design Optimization-Optimum Design Concepts PDF
No ratings yet
CH 4-Design Optimization-Optimum Design Concepts PDF
62 pages
Optimization Theory 2
No ratings yet
Optimization Theory 2
27 pages
Lecture 11
No ratings yet
Lecture 11
30 pages
Convex Problems
No ratings yet
Convex Problems
48 pages
Computational OPT Book 2023 Chapter 01
No ratings yet
Computational OPT Book 2023 Chapter 01
18 pages
DL - Unit 2
No ratings yet
DL - Unit 2
60 pages
Princeton University Notation and Terminology in Optimization
No ratings yet
Princeton University Notation and Terminology in Optimization
13 pages
Hw2sol PDF
100% (1)
Hw2sol PDF
5 pages
Optimisation and Optimal Control
No ratings yet
Optimisation and Optimal Control
82 pages
ML Module 5 Full Notes
No ratings yet
ML Module 5 Full Notes
23 pages
NLO Notes
No ratings yet
NLO Notes
75 pages
A Realistic Example in 2 Dimension That Gradient Descent Takes Exponential Time To Escape Saddle Points
No ratings yet
A Realistic Example in 2 Dimension That Gradient Descent Takes Exponential Time To Escape Saddle Points
10 pages
06 23ECE216 GradientDescent v2
No ratings yet
06 23ECE216 GradientDescent v2
73 pages
O4MD 01 Introduction
No ratings yet
O4MD 01 Introduction
10 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
CSE488 Lab6 Optimization
No ratings yet
CSE488 Lab6 Optimization
20 pages
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
27 pages
Coordinate Descent Algorithms: Stephen J. Wright
No ratings yet
Coordinate Descent Algorithms: Stephen J. Wright
32 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Matinf 2360 Part 3
No ratings yet
Matinf 2360 Part 3
106 pages
Linnear Nonlineae Numerical Method
No ratings yet
Linnear Nonlineae Numerical Method
43 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
Optimization PDF
No ratings yet
Optimization PDF
59 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Convex Optimization - Introduction (S.l. Dr. Ing. Carmen Voicu)
No ratings yet
Convex Optimization - Introduction (S.l. Dr. Ing. Carmen Voicu)
32 pages
Chapter 4. Optimization
No ratings yet
Chapter 4. Optimization
62 pages
6th Maths Alegrbra
No ratings yet
6th Maths Alegrbra
19 pages
Chapter
No ratings yet
Chapter
46 pages
NLP Slides
No ratings yet
NLP Slides
201 pages
A New Adaptive Step-Size Euler ODE Solver
No ratings yet
A New Adaptive Step-Size Euler ODE Solver
32 pages
Graph Algorithms
No ratings yet
Graph Algorithms
70 pages
Power Systems Operation and Management: Second Lecture
No ratings yet
Power Systems Operation and Management: Second Lecture
35 pages
55 Optimization
No ratings yet
55 Optimization
21 pages
Finite Difference Method
No ratings yet
Finite Difference Method
14 pages
HW 2 Sol
No ratings yet
HW 2 Sol
5 pages
Process Optimization
No ratings yet
Process Optimization
70 pages
NEOM UNIT-1 Sept-23
No ratings yet
NEOM UNIT-1 Sept-23
34 pages
B For I 1,, M: N J J J
No ratings yet
B For I 1,, M: N J J J
19 pages
Leetcode P
No ratings yet
Leetcode P
3 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Fazal - CFD - Assignment - Example 7.2
No ratings yet
Fazal - CFD - Assignment - Example 7.2
9 pages
JNTU - Neural Network
No ratings yet
JNTU - Neural Network
5 pages
ECN 2115 Lecture 1 - 2
No ratings yet
ECN 2115 Lecture 1 - 2
6 pages
Optimisasi Di Industri Migas Slides w05
No ratings yet
Optimisasi Di Industri Migas Slides w05
96 pages
cs502 Midterm Solved Mcqs by Me
No ratings yet
cs502 Midterm Solved Mcqs by Me
4 pages
Lec 17 Multivariable OT
No ratings yet
Lec 17 Multivariable OT
30 pages
S Dixit
No ratings yet
S Dixit
57 pages
2.2 Power and Roots of Complex No
No ratings yet
2.2 Power and Roots of Complex No
4 pages
Hierarchical Clustering: Ke Chen
No ratings yet
Hierarchical Clustering: Ke Chen
21 pages
A New Approach To Find Roots of Nonlinear Equations by Hybrid Algorithm To Bisection and Newton-Raphson Algorithms
No ratings yet
A New Approach To Find Roots of Nonlinear Equations by Hybrid Algorithm To Bisection and Newton-Raphson Algorithms
8 pages
Optim
No ratings yet
Optim
70 pages
Assignment 1 - Linear Programming I - With Answers
No ratings yet
Assignment 1 - Linear Programming I - With Answers
2 pages
TO Operations Research: Nirupam Chaturvedi
No ratings yet
TO Operations Research: Nirupam Chaturvedi
60 pages
Python
No ratings yet
Python
10 pages
Modeling, Representing and
No ratings yet
Modeling, Representing and
17 pages
Adobe Scan 04-Jul-2022
No ratings yet
Adobe Scan 04-Jul-2022
7 pages
Cordeau 2002
No ratings yet
Cordeau 2002
11 pages
Lect 2 Linear Programming - Graphical Method
No ratings yet
Lect 2 Linear Programming - Graphical Method
22 pages
10 - April Test
No ratings yet
10 - April Test
4 pages
COMP4204: Advanced Data Structure and Algorithms Tutorial 6 (Graphs)
No ratings yet
COMP4204: Advanced Data Structure and Algorithms Tutorial 6 (Graphs)
3 pages
Inde4287 Hw2 Gams
No ratings yet
Inde4287 Hw2 Gams
1 page
TOS 1st Quarter MATH 8
No ratings yet
TOS 1st Quarter MATH 8
2 pages
SUDOKU For Students Spring 2025
No ratings yet
SUDOKU For Students Spring 2025
16 pages
Optimization
No ratings yet
Optimization
38 pages
Answer Key
No ratings yet
Answer Key
3 pages
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
Exercises of Logarithms and Exponentials
From Everand
Exercises of Logarithms and Exponentials
Simone Malacrida
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 10

Uploaded by

Lecture 10

Uploaded by

Lecture 10

► We will look at nonlinear optimization concepts in this lecture.

► In traditional optimization, we focus on updating parameters so

► Consider the following data on 4 variables x1, x2, x3, x4 and

► Minimizing squared error, we notice that one good solution is

► Consider w1 = 0, w2 = 7, w3 = 5, w4 = 0, w5 = 20. This

► Taking small steps in w2 and large steps in w1 will make us go

1. Flat regions and local optima

► Another problem to contend with is the presence of flat

► Flat regions are problematic because the speed of descent

► In multi-dimensional settings, the components of the gradient

► The second-order derivative measures the rate of change of the

► In case of the perfect bowl, a sufficiently large step-size from any

► Note that the gradient at any point is orthogonal to the

► We show how to address in some measure the differential

► Objective function is much more sensitive to w2 than w1

► Optimization problems are of 2 types

► Constrained maximization (minimization) problem is rewritten as a

Consider the following optimization problem

► The Lagrangian is L(x, y ) = xy − λ(x + y − 6)

► x and y values remain same even if you take +λ or −λ for

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.