0% found this document useful (0 votes)
8 views26 pages

ConvexSpring25_Week9

The document discusses various algorithms for optimization, focusing on gradient descent and its convergence properties under different conditions such as bounded gradient, smoothness, and strong convexity. It outlines the iterative process of these algorithms, error metrics for optimality, and the differences between gradient descent and stochastic gradient descent. Additionally, it covers accelerated gradient descent and the challenges in finite sum settings commonly encountered in machine learning.

Uploaded by

adarSh jaiswal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views26 pages

ConvexSpring25_Week9

The document discusses various algorithms for optimization, focusing on gradient descent and its convergence properties under different conditions such as bounded gradient, smoothness, and strong convexity. It outlines the iterative process of these algorithms, error metrics for optimality, and the differences between gradient descent and stochastic gradient descent. Additionally, it covers accelerated gradient descent and the challenges in finite sum settings commonly encountered in machine learning.

Uploaded by

adarSh jaiswal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Module C: Algorithms for Optimization

Recall that an optimization problem in standard form is given by

min f (x)
x2Rn
s.t. gi (x)  0, i 2 [m] := {1, 2, . . . , m},
hj (x) = 0, j 2 [p].

Most algorithms generate a sequence x0 , x1 , x2 , . . . by exploiting local information


collected on the path.

Zeroth Order: Only f (xt ), gi (xt ), hj (xt ) available.

First Order: Gradients rf (xt ), rgi (xt ), rhj (xt ) are used. Heavily used in
ML.

Second Order: Hessian information is used. Eg: Newton’s Method, etc.

Distributed Algorithms

Stochastic/Randomized Algorithms

1
Measure of progress

Let x? be the optimal solution. The iterative algorithms continue till any of the
following error metrics is sufficiently small.

errt := ||xt x? ||

errt := f (xt ) f (x? )

A solution x̄ is ✏-optimal when

f (x̄)  f (x? ) + ✏.

We often run the algorithm till errt is smaller than a sufficiently small ✏ > 0.

In presence of constaints, we define

errt := max(f (xt ) f (x? ), g1 (xt ), g2 (xt ), . . . , gm (xt ), |h1 (xt )|, . . . , |hp (xt )|).

2
First order methods: Gradient descent

Consider the unconstrained optimization problem: minx2Rn f (x)

Gradient Descent (GD): xt+1 = xt ⌘t rf (xt ), t 0 starting from an


initial guess x0 2 Rn .

The stationarity condition satisfies x⇤ = x⇤ ⌘t rf (x⇤ ) =) rf (x⇤ ) = 0.

Convergence rate depends on choice of step size ⌘t and characteristic of the


function.

Bounded Gradient: ||rf (x)||  G for all x 2 Rn .

Smooth: A di↵erentiable convex f is -smooth if for any x, y, we have

f (y)  f (x) + hrf (x), y xi + ||y x||2 .


2
We can obtain a quadratic upper bound on the function from local informa-
tion.

Strongly Convex: A di↵erentiable convex f is ↵-strongly convex if for any


x, y, we have

f (y) f (x) + hrf (x), y xi + ||y x||2 .
2
We can obtain a quadratic lower bound on the function from local informa-
tion.
If f is twice di↵erentiable, then
– f is -smooth if and only if r2 f (x) I or max (r
2
f (x))  for all
x 2 Rn .
– f is ↵-strongly convex if and only if r2 f (x) ⌫ ↵I or min (r
2
f (x)) ↵
for all x 2 Rn .
Determine and ↵ for f (x) = ||Ax b||22 .

3
Gradient Descent with Bounded Gradient Assumption

Let x0 , x1 , . . . , xT
be the iterates generated by the GD algorithm.
1
P
bt := 1t ti=01 xi . Let x? be the optimal solution.
For any t, we define x
Theorem 1: Convergence of Gradient Descent

Let the function f satisfy the ||rf (x)||  G for all x 2 Rn . Let ||x0 x? || 
D. Then, for the choice of step size ⌘t = GD p , we have
T

DG
f (b
xT ) f (x? )  p .
T

DG 2 ✏
To find an ✏ optimal solution, choose T ✏ and ⌘ = G2 .
Possible Limitation: Need to know G and D.

Proof: Define the following (potential) function:


1
t := ||xt x? ||2 .
2⌘
We show that t is decreasing in t. We compute t+1 t as:

4
Proof

5
Gradient Descent with Smoothness Assumption

Recall that a di↵erentiable convex f is -smooth if for any x, y, we have

f (y)  f (x) + hrf (x), y xi + ||y x||2 .


2
Theorem 2
Let the function f be -smooth. Let ||x0 x? ||  D. Then, for the choice
of step size ⌘t = 1 , we have

? ||x0 x? ||2
f (xT ) f (x )  .
2T

Proof: Define the following (potential) function:

t := t[f (xt ) f (x? )] + ||xt x? ||2 .


2
We show that t is decreasing in t. We compute t+1 t as:

7
Proof

8
Gradient Descent with Smoothness and Strong Convexity

Recall that a di↵erentiable convex f is ↵-strongly convex if for any x, y, we have



f (y) f (x) + hrf (x), y xi + ||y x||2 .
2
Theorem 3
Let the function f be -smooth and ↵-strongly convex with ↵  . Define
condition number  := ↵ . Then, for the choice of step size ⌘t = 1 , we have
T
f (xT ) f (x? )  e  (f (x0 ) f (x? )).

Note: To obtain ✏-optimal solution, choose T = O log( 1✏ ) .

Proof: Define the following (potential) function:


1 ↵
t := (1 + )t [f (xt ) f (x? )], where = = .
 1 ↵
We need to show that t+1  t.

9
Proof

10
Summary of gradient descent convergence rates

Consider the unconstrained optimization problem: minx2Rn f (x)

Gradient Descent (GD): xt+1 = xt ⌘t rf (xt ), t 0 starting from an


initial guess x0 2 Rn .

Theorem 4: GD Convergence rates

Let ||x0 x? ||  D.
If ||rf (x)||  G for all x 2 Rn , then with ⌘t = D
p ,
G T
f (b
xT ) f (x? ) 
DG
p .
T
||x0 x? ||2
If f is -smooth, for ⌘t = 1 , f (xT ) f (x? )  2T .

If f is -smooth and ↵-strongly convex, for ⌘t = 1 , f (xT ) f (x? ) 


T
e  (f (x0 ) f (x? )) where  := ↵ is the condition number.

12
Gradient descent: Constrained Case

Consider the unconstrained optimization problem: minx2X f (x) where X ✓ Rn


is a convex feasibility set.

Projected Gradient Descent (PGD): xt+1 = ⇧X [xt ⌘t rf (xt )], t 0


starting from an initial guess x0 2 R where ⇧X (y) is the projection of y
n

on the set X.

Theorem 5
Let ||x0 x? ||  D.
If ||rf (x)||  G for all x 2 Rn , then with ⌘t = D
p ,
G T
f (b
xT ) f (x? ) 
DG
p .
T
||x0 x? ||2
If f is -smooth, for ⌘t = 1 , f (xT ) f (x? )  2T .

If f is -smooth and ↵-strongly convex, for ⌘t = 1 , f (xT ) f (x? ) 


T
e  (f (0) f (x? )) where  := ↵ is the condition number.

Note: Convergence rates remain unchanged.

Note: Projection itself is another optimization problem!

Non-expansive Property which preserves the convergence rates:

||⇧X (y1 ) ⇧X (y2 )||  ||y1 y2 ||.

13
When is Projection easy to find?

Note that ⇧X (y) = argminx2X ||y x||2 . Find closed form expression of the
projection for the following cases.

X = {x 2 Rn |||x||2  r}.

X = {x 2 Rn |xl  x  xu }.

X = {x 2 Rn |Ax = b}.

Pn
X = {x 2 Rn |x 0, i=1 xi  1}.

14
Accelerated Gradient Descent

Start with x0 = y0 = z0 2 Rn . At every time-step t,


1
yt+1 = xt rf (xt )

zt+1 = zt ⌘t rf (xt )
xt+1 = (1 ⌧t+1 )yt+1 + ⌧t+1 zt+1

Theorem 6
t+1 2
Let f be -smooth, ⌘t = 2 and ⌧t = t+2 . Then, we have

? 2 ||x0 x⇤ ||2
f (yT ) f (x )  .
T (T + 1)

Proof: Define t = t(t + 1)(f (yt ) f (x⇤ )) + 2 ||zt x⇤ ||2 and show that
t+1  t .

15
Accelerated Gradient Descent 2

Start with x0 = y0 . At every state t,


1
yt+1 = xt rf (xt )
p p
 1  1
xt+1 = (1 + p )yt+1 p yt
+1 +1
Theorem 7

Let f be -smooth, ↵-strongly convex with  = ↵ and let = p1 1 . Then,


we have ⇣ ⌘
? T ↵+ ⇤ 2
f (yT ) f (x )  (1 + ) ||x0 x || .
2
1
Improvement upon the previous rate where we had =  1.

16
Further details

AGD invented by Nesterov in a series of papers in the 80s and early 2000s,
later popularized by ML researchers

The convergence rates in the previous two theorems are the best possible
ones.

Book by Nesterov:
https://link.springer.com/book/10.1007/978-1-4419-8853-9

https://francisbach.com/continuized-acceleration/

https://www.nowpublishers.com/article/Details/OPT-036

17
Finite Sum Setting

A large number of problems that arise in (supervised) ML can be written as


N N
1 X 1 X
min f (x) := fi (x) = l(x, ⇠i ).
x2Rn N i=1 N i=1

Example: Regression/Least Squares, SVM, NN Training

The above problem can also be viewed as sample average approximation


of a stochastic optimization problem

f (x) = E[l(x, ⇠)]

involving uncertain parameter or random variable ⇠.


Challenge: N (number of samples) or n (dimension of decision variable)
both may be large. Samples may be located in di↵erent servers.

18
Gradient Descent vs. Stochastic Gradient Descent

Gradient
PN Descent (GD) xt+1 = xt ⌘t rf (xt ) = xt
⌘t N i=1 rfi (xt ), t 0 starting from an initial guess x0 2 Rn .
1

Each step requires N gradient computations.

Stochastic Gradient Descent (SGD) At every time step t,


Pick an index (sample) it uniformly at random from the set
{1, 2, . . . , N }.
Set xt+1 = xt ⌘t rfit (xt ).

Each step requires 1 gradient computation, which is a noisy version of the true
gradient of the cost function at xt .

19
Key result for SGD convergence

Under the following assumptions


Convexity: each fi is convex,
Bounded variance: E[||rfit (x)||2 ]  2
for some for all x,
Unbiased gradient estimate: E[rfit (x)] = rf (x) for all x,
the solutions generated by SGD algorithm satisfies
T 1
X T 1
2 X
? 1 ? 2
⌘t [E[f (xt )] f (x )]  ||x0 x || + ⌘t2
t=0
2 2 t=0
PT 1 2
||x0 x? ||2
?
2
⌘t
=) E[f (b
xT )] f (x )  PT 1 + Pt=0
T 1
,
2 t=0 ⌘t 2 t=0 ⌘ t

1
PT 1
bT =
where x PT 1
⌘t t=0 ⌘t xt .
t=0

20
Proof Continues

21
Choice of stepsize

Constant step-size will not give us convergence. For convergence, we need to


choose step sizes that are diminishing and square-summable, i.e.,
T 1
X T 1
X
lim ⌘t = 1, lim ⌘t2 < 1.
T !1 T !1
t=0 t=0

⇣ ⌘
log
pT
If ⌘t := p1 ,
c t+1
then E[f (b
xT )] f (x )  O ?
T
. This rate does not
improve when the function is smooth.
When the function is smooth,
⇣ ⌘ then for ⌘t := ⌘ chosen appropriately, then
1
R.H.S. will be of order O ⌘T + O(⌘).

23
Analysis for Smooth and Strongly Convex Functions

When the function f is -smooth and ↵-strongly convex, we have the following
guarantees for SGD after T iterations.
⇣ ⌘
1 log T
If ⌘t := ct for a suitable constant c, then error bound is O T . Can be
1
improved to O T .

If ⌘t := ⌘, then error bound


2
? 2 T ? 2 ⌘
E[||xT x || ]  (1 ⌘↵) ||x0 x || + .
2↵
With constant step-size ⌘ < ↵1 , convergence is quick to a neighborhood of the
optimal solution.

24
Extension: Mini-Batch

25
Extension: Stochastic Averaging

26
Further Reading

SAG: Schmidt, Mark, Nicolas Le Roux, and Francis Bach. “Minimizing finite
sums with the stochastic average gradient.” Mathematical Programming
162 (2017): 83-112.

SAGA: Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien. “SAGA:


A fast incremental gradient method with support for non-strongly convex
composite objectives.” Advances in neural information processing systems
27 (2014).

Recent Review: Gower, Robert M., Mark Schmidt, Francis Bach, and Peter
Richtárik. “Variance-reduced methods for machine learning.” Proceedings
of the IEEE 108, no. 11 (2020): 1968-1983.

Allen-Zhu, Zeyuan. “Katyusha: The First Direct Acceleration of Stochastic


Gradient Methods.” Journal of Machine Learning Research 18 (2018): 1-51.

Varre, Aditya, and Nicolas Flammarion. “Accelerated SGD for non-strongly-


convex least squares.” In Conference on Learning Theory, pp. 2062-2126.
PMLR, 2022.

Hanzely, Filip, Konstantin Mishchenko, and Peter Richtárik. ”SEGA: Vari-


ance reduction via gradient sketching.” Advances in Neural Information Pro-
cessing Systems 31 (2018).

27
Extension: Adaptive Step-sizes

AdaGrad Duchi, John, Elad Hazan, and Yoram Singer. ”Adaptive subgra-
dient methods for online learning and stochastic optimization.” Journal of
machine learning research 12, no. 7 (2011).

Adam Kingma, Diederik P., and Jimmy Ba. ”Adam: A method for stochastic
optimization.” arXiv preprint arXiv:1412.6980 (2014).

28

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy