0% found this document useful (0 votes)

8 views26 pages

ConvexSpring25_Week9

The document discusses various algorithms for optimization, focusing on gradient descent and its convergence properties under different conditions such as bounded gradient, smoothness, and strong convexity. It outlines the iterative process of these algorithms, error metrics for optimality, and the differences between gradient descent and stochastic gradient descent. Additionally, it covers accelerated gradient descent and the challenges in finite sum settings commonly encountered in machine learning.

Uploaded by

adarSh jaiswal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views26 pages

ConvexSpring25_Week9

Uploaded by

adarSh jaiswal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Module C: Algorithms for Optimization

Recall that an optimization problem in standard form is given by

min f (x)
x2Rn
s.t. gi (x)  0, i 2 [m] := {1, 2, . . . , m},
hj (x) = 0, j 2 [p].

Most algorithms generate a sequence x0 , x1 , x2 , . . . by exploiting local information

collected on the path.

Zeroth Order: Only f (xt ), gi (xt ), hj (xt ) available.

First Order: Gradients rf (xt ), rgi (xt ), rhj (xt ) are used. Heavily used in
ML.

Second Order: Hessian information is used. Eg: Newton’s Method, etc.

Distributed Algorithms

Stochastic/Randomized Algorithms

1
Measure of progress

Let x? be the optimal solution. The iterative algorithms continue till any of the
following error metrics is sufficiently small.

errt := ||xt x? ||

errt := f (xt ) f (x? )

A solution x̄ is ✏-optimal when

f (x̄)  f (x? ) + ✏.

We often run the algorithm till errt is smaller than a sufficiently small ✏ > 0.

In presence of constaints, we define

errt := max(f (xt ) f (x? ), g1 (xt ), g2 (xt ), . . . , gm (xt ), |h1 (xt )|, . . . , |hp (xt )|).

2
First order methods: Gradient descent

Consider the unconstrained optimization problem: minx2Rn f (x)

Gradient Descent (GD): xt+1 = xt ⌘t rf (xt ), t 0 starting from an

initial guess x0 2 Rn .

The stationarity condition satisfies x⇤ = x⇤ ⌘t rf (x⇤ ) =) rf (x⇤ ) = 0.

Convergence rate depends on choice of step size ⌘t and characteristic of the

function.

Bounded Gradient: ||rf (x)||  G for all x 2 Rn .

Smooth: A di↵erentiable convex f is -smooth if for any x, y, we have

f (y)  f (x) + hrf (x), y xi + ||y x||2 .

2
We can obtain a quadratic upper bound on the function from local informa-
tion.

Strongly Convex: A di↵erentiable convex f is ↵-strongly convex if for any

x, y, we have
↵
f (y) f (x) + hrf (x), y xi + ||y x||2 .
2
We can obtain a quadratic lower bound on the function from local informa-
tion.
If f is twice di↵erentiable, then
– f is -smooth if and only if r2 f (x) I or max (r
2
f (x))  for all
x 2 Rn .
– f is ↵-strongly convex if and only if r2 f (x) ⌫ ↵I or min (r
2
f (x)) ↵
for all x 2 Rn .
Determine and ↵ for f (x) = ||Ax b||22 .

3
Gradient Descent with Bounded Gradient Assumption

Let x0 , x1 , . . . , xT
be the iterates generated by the GD algorithm.
1
P
bt := 1t ti=01 xi . Let x? be the optimal solution.
For any t, we define x
Theorem 1: Convergence of Gradient Descent

Let the function f satisfy the ||rf (x)||  G for all x 2 Rn . Let ||x0 x? || 
D. Then, for the choice of step size ⌘t = GD p , we have
T

DG
f (b
xT ) f (x? )  p .
T

DG 2 ✏
To find an ✏ optimal solution, choose T ✏ and ⌘ = G2 .
Possible Limitation: Need to know G and D.

Proof: Define the following (potential) function:

1
t := ||xt x? ||2 .
2⌘
We show that t is decreasing in t. We compute t+1 t as:

4
Proof

5
Gradient Descent with Smoothness Assumption

Recall that a di↵erentiable convex f is -smooth if for any x, y, we have

f (y)  f (x) + hrf (x), y xi + ||y x||2 .

2
Theorem 2
Let the function f be -smooth. Let ||x0 x? ||  D. Then, for the choice
of step size ⌘t = 1 , we have

? ||x0 x? ||2
f (xT ) f (x )  .
2T

Proof: Define the following (potential) function:

t := t[f (xt ) f (x? )] + ||xt x? ||2 .

2
We show that t is decreasing in t. We compute t+1 t as:

7
Proof

8
Gradient Descent with Smoothness and Strong Convexity

Recall that a di↵erentiable convex f is ↵-strongly convex if for any x, y, we have

↵
f (y) f (x) + hrf (x), y xi + ||y x||2 .
2
Theorem 3
Let the function f be -smooth and ↵-strongly convex with ↵  . Define
condition number  := ↵ . Then, for the choice of step size ⌘t = 1 , we have
T
f (xT ) f (x? )  e  (f (x0 ) f (x? )).

Note: To obtain ✏-optimal solution, choose T = O log( 1✏ ) .

Proof: Define the following (potential) function:

1 ↵
t := (1 + )t [f (xt ) f (x? )], where = = .
 1 ↵
We need to show that t+1  t.

9
Proof

10
Summary of gradient descent convergence rates

Consider the unconstrained optimization problem: minx2Rn f (x)

Gradient Descent (GD): xt+1 = xt ⌘t rf (xt ), t 0 starting from an

initial guess x0 2 Rn .

Theorem 4: GD Convergence rates

Let ||x0 x? ||  D.
If ||rf (x)||  G for all x 2 Rn , then with ⌘t = D
p ,
G T
f (b
xT ) f (x? ) 
DG
p .
T
||x0 x? ||2
If f is -smooth, for ⌘t = 1 , f (xT ) f (x? )  2T .

If f is -smooth and ↵-strongly convex, for ⌘t = 1 , f (xT ) f (x? ) 

T
e  (f (x0 ) f (x? )) where  := ↵ is the condition number.

12
Gradient descent: Constrained Case

Consider the unconstrained optimization problem: minx2X f (x) where X ✓ Rn

is a convex feasibility set.

Projected Gradient Descent (PGD): xt+1 = ⇧X [xt ⌘t rf (xt )], t 0

starting from an initial guess x0 2 R where ⇧X (y) is the projection of y
n

on the set X.

Theorem 5
Let ||x0 x? ||  D.
If ||rf (x)||  G for all x 2 Rn , then with ⌘t = D
p ,
G T
f (b
xT ) f (x? ) 
DG
p .
T
||x0 x? ||2
If f is -smooth, for ⌘t = 1 , f (xT ) f (x? )  2T .

If f is -smooth and ↵-strongly convex, for ⌘t = 1 , f (xT ) f (x? ) 

T
e  (f (0) f (x? )) where  := ↵ is the condition number.

Note: Convergence rates remain unchanged.

Note: Projection itself is another optimization problem!

Non-expansive Property which preserves the convergence rates:

||⇧X (y1 ) ⇧X (y2 )||  ||y1 y2 ||.

13
When is Projection easy to find?

Note that ⇧X (y) = argminx2X ||y x||2 . Find closed form expression of the
projection for the following cases.

X = {x 2 Rn |||x||2  r}.

X = {x 2 Rn |xl  x  xu }.

X = {x 2 Rn |Ax = b}.

Pn
X = {x 2 Rn |x 0, i=1 xi  1}.

14
Accelerated Gradient Descent

Start with x0 = y0 = z0 2 Rn . At every time-step t,

1
yt+1 = xt rf (xt )

zt+1 = zt ⌘t rf (xt )
xt+1 = (1 ⌧t+1 )yt+1 + ⌧t+1 zt+1

Theorem 6
t+1 2
Let f be -smooth, ⌘t = 2 and ⌧t = t+2 . Then, we have

? 2 ||x0 x⇤ ||2
f (yT ) f (x )  .
T (T + 1)

Proof: Define t = t(t + 1)(f (yt ) f (x⇤ )) + 2 ||zt x⇤ ||2 and show that
t+1  t .

15
Accelerated Gradient Descent 2

Start with x0 = y0 . At every state t,

1
yt+1 = xt rf (xt )
p p
 1  1
xt+1 = (1 + p )yt+1 p yt
+1 +1
Theorem 7

Let f be -smooth, ↵-strongly convex with  = ↵ and let = p1 1 . Then,

we have ⇣ ⌘
? T ↵+ ⇤ 2
f (yT ) f (x )  (1 + ) ||x0 x || .
2
1
Improvement upon the previous rate where we had =  1.

16
Further details

AGD invented by Nesterov in a series of papers in the 80s and early 2000s,
later popularized by ML researchers

The convergence rates in the previous two theorems are the best possible
ones.

Book by Nesterov:
https://link.springer.com/book/10.1007/978-1-4419-8853-9

https://francisbach.com/continuized-acceleration/

https://www.nowpublishers.com/article/Details/OPT-036

17
Finite Sum Setting

A large number of problems that arise in (supervised) ML can be written as

N N
1 X 1 X
min f (x) := fi (x) = l(x, ⇠i ).
x2Rn N i=1 N i=1

Example: Regression/Least Squares, SVM, NN Training

The above problem can also be viewed as sample average approximation

of a stochastic optimization problem

f (x) = E[l(x, ⇠)]

involving uncertain parameter or random variable ⇠.

Challenge: N (number of samples) or n (dimension of decision variable)
both may be large. Samples may be located in di↵erent servers.

18
Gradient Descent vs. Stochastic Gradient Descent

Gradient
PN Descent (GD) xt+1 = xt ⌘t rf (xt ) = xt
⌘t N i=1 rfi (xt ), t 0 starting from an initial guess x0 2 Rn .
1

Each step requires N gradient computations.

Stochastic Gradient Descent (SGD) At every time step t,

Pick an index (sample) it uniformly at random from the set
{1, 2, . . . , N }.
Set xt+1 = xt ⌘t rfit (xt ).

Each step requires 1 gradient computation, which is a noisy version of the true
gradient of the cost function at xt .

19
Key result for SGD convergence

Under the following assumptions

Convexity: each fi is convex,
Bounded variance: E[||rfit (x)||2 ]  2
for some for all x,
Unbiased gradient estimate: E[rfit (x)] = rf (x) for all x,
the solutions generated by SGD algorithm satisfies
T 1
X T 1
2 X
? 1 ? 2
⌘t [E[f (xt )] f (x )]  ||x0 x || + ⌘t2
t=0
2 2 t=0
PT 1 2
||x0 x? ||2
?
2
⌘t
=) E[f (b
xT )] f (x )  PT 1 + Pt=0
T 1
,
2 t=0 ⌘t 2 t=0 ⌘ t

1
PT 1
bT =
where x PT 1
⌘t t=0 ⌘t xt .
t=0

20
Proof Continues

21
Choice of stepsize

Constant step-size will not give us convergence. For convergence, we need to

choose step sizes that are diminishing and square-summable, i.e.,
T 1
X T 1
X
lim ⌘t = 1, lim ⌘t2 < 1.
T !1 T !1
t=0 t=0

⇣ ⌘
log
pT
If ⌘t := p1 ,
c t+1
then E[f (b
xT )] f (x )  O ?
T
. This rate does not
improve when the function is smooth.
When the function is smooth,
⇣ ⌘ then for ⌘t := ⌘ chosen appropriately, then
1
R.H.S. will be of order O ⌘T + O(⌘).

23
Analysis for Smooth and Strongly Convex Functions

When the function f is -smooth and ↵-strongly convex, we have the following
guarantees for SGD after T iterations.
⇣ ⌘
1 log T
If ⌘t := ct for a suitable constant c, then error bound is O T . Can be
1
improved to O T .

If ⌘t := ⌘, then error bound

2
? 2 T ? 2 ⌘
E[||xT x || ]  (1 ⌘↵) ||x0 x || + .
2↵
With constant step-size ⌘ < ↵1 , convergence is quick to a neighborhood of the
optimal solution.

24
Extension: Mini-Batch

25
Extension: Stochastic Averaging

26
Further Reading

SAG: Schmidt, Mark, Nicolas Le Roux, and Francis Bach. “Minimizing finite
sums with the stochastic average gradient.” Mathematical Programming
162 (2017): 83-112.

SAGA: Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien. “SAGA:

A fast incremental gradient method with support for non-strongly convex
composite objectives.” Advances in neural information processing systems
27 (2014).

Recent Review: Gower, Robert M., Mark Schmidt, Francis Bach, and Peter
Richtárik. “Variance-reduced methods for machine learning.” Proceedings
of the IEEE 108, no. 11 (2020): 1968-1983.

Allen-Zhu, Zeyuan. “Katyusha: The First Direct Acceleration of Stochastic

Gradient Methods.” Journal of Machine Learning Research 18 (2018): 1-51.

Varre, Aditya, and Nicolas Flammarion. “Accelerated SGD for non-strongly-

convex least squares.” In Conference on Learning Theory, pp. 2062-2126.
PMLR, 2022.

Hanzely, Filip, Konstantin Mishchenko, and Peter Richtárik. ”SEGA: Vari-

ance reduction via gradient sketching.” Advances in Neural Information Pro-
cessing Systems 31 (2018).

27
Extension: Adaptive Step-sizes

AdaGrad Duchi, John, Elad Hazan, and Yoram Singer. ”Adaptive subgra-
dient methods for online learning and stochastic optimization.” Journal of
machine learning research 12, no. 7 (2011).

Adam Kingma, Diederik P., and Jimmy Ba. ”Adam: A method for stochastic
optimization.” arXiv preprint arXiv:1412.6980 (2014).

Stama MC010
No ratings yet
Stama MC010
28 pages
Convex Module B
No ratings yet
Convex Module B
29 pages
Berkeley-tutorial Optimization for Machine Learningpart2
No ratings yet
Berkeley-tutorial Optimization for Machine Learningpart2
35 pages
Lecture 7 (with notes)
No ratings yet
Lecture 7 (with notes)
39 pages
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
27 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
SGD
No ratings yet
SGD
19 pages
Handbook of Convergence Theorems
No ratings yet
Handbook of Convergence Theorems
70 pages
OptimML
No ratings yet
OptimML
41 pages
Optimization For Machine Learning
No ratings yet
Optimization For Machine Learning
45 pages
lec13
No ratings yet
lec13
6 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
02_grad_desc
No ratings yet
02_grad_desc
54 pages
Cours D'optimisation
No ratings yet
Cours D'optimisation
159 pages
MLSS Complete PDF
No ratings yet
MLSS Complete PDF
106 pages
Nisheeth VishnoiFall2014 ConvexOptimization PDF
No ratings yet
Nisheeth VishnoiFall2014 ConvexOptimization PDF
114 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
online gradient descent
No ratings yet
online gradient descent
7 pages
2503.17356v1
No ratings yet
2503.17356v1
42 pages
Lecture05_descent
No ratings yet
Lecture05_descent
31 pages
Sparsity and Its Mathematics
No ratings yet
Sparsity and Its Mathematics
44 pages
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
22 pages
Lecture 5
No ratings yet
Lecture 5
4 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
Raghu Meka notes
No ratings yet
Raghu Meka notes
7 pages
Ee227c Notes PDF
No ratings yet
Ee227c Notes PDF
122 pages
Ee227c Notes 2 PDF
No ratings yet
Ee227c Notes 2 PDF
122 pages
Algorithms and Complexity
No ratings yet
Algorithms and Complexity
130 pages
Lecture 7
No ratings yet
Lecture 7
54 pages
Simon Chapter 3
No ratings yet
Simon Chapter 3
12 pages
Convex Optimization For Machine Learning
No ratings yet
Convex Optimization For Machine Learning
110 pages
Notes HQ
No ratings yet
Notes HQ
96 pages
ch4
No ratings yet
ch4
28 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Mirror Descent Slides
No ratings yet
Mirror Descent Slides
35 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Unit VI Optimization Techniques question bank solved answer
No ratings yet
Unit VI Optimization Techniques question bank solved answer
20 pages
Smooth Convex Minimization Problems
No ratings yet
Smooth Convex Minimization Problems
28 pages
Continuous Optimization
No ratings yet
Continuous Optimization
51 pages
Classification of Optimization methods
No ratings yet
Classification of Optimization methods
68 pages
Subgradients: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradients: Ryan Tibshirani Convex Optimization 10-725
25 pages
Advanced Gradient Descent
No ratings yet
Advanced Gradient Descent
14 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
ML MODULE 5 FULL NOTES
No ratings yet
ML MODULE 5 FULL NOTES
23 pages
Lecture_11_AGD_restart_lower_bounds
No ratings yet
Lecture_11_AGD_restart_lower_bounds
5 pages
Lecture 3 ML_optimization
No ratings yet
Lecture 3 ML_optimization
32 pages
Backpropagation_optimization_tutorial
No ratings yet
Backpropagation_optimization_tutorial
14 pages
Lecture_7_8_other_descent_methods
No ratings yet
Lecture_7_8_other_descent_methods
7 pages
Adaptive Proximal Gradient Method For Convex Optimization: 1 Intro
No ratings yet
Adaptive Proximal Gradient Method For Convex Optimization: 1 Intro
23 pages
INT255_unit-4
No ratings yet
INT255_unit-4
40 pages
10 Convex Optimisation
No ratings yet
10 Convex Optimisation
31 pages
Bregman
No ratings yet
Bregman
9 pages
week 10 notes MLF
No ratings yet
week 10 notes MLF
20 pages
Lecture 3 Gradient Descent
No ratings yet
Lecture 3 Gradient Descent
37 pages
Notes iPad
No ratings yet
Notes iPad
263 pages
06 SG Method
No ratings yet
06 SG Method
33 pages
Math562 ContinuousOptimization
No ratings yet
Math562 ContinuousOptimization
126 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Projected Gradient
No ratings yet
Projected Gradient
21 pages
p5-CO-opti-algo
No ratings yet
p5-CO-opti-algo
15 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
ConvexSpring25_Week11
No ratings yet
ConvexSpring25_Week11
5 pages
ConvexSpring25_Week6
No ratings yet
ConvexSpring25_Week6
8 pages
TechE_Handouts4
No ratings yet
TechE_Handouts4
42 pages
TechE_Handouts2
No ratings yet
TechE_Handouts2
27 pages
TechE_Handouts5
No ratings yet
TechE_Handouts5
38 pages
Manual For The Sound Card Oscilloscope V1.30: 1 Requirements
No ratings yet
Manual For The Sound Card Oscilloscope V1.30: 1 Requirements
11 pages
Sr. Node JS Developer: Bikash Shreshtha Contact Number: (979) 999-1104 Location: Fort Worth, TX
No ratings yet
Sr. Node JS Developer: Bikash Shreshtha Contact Number: (979) 999-1104 Location: Fort Worth, TX
11 pages
What's New at 2019 R1 Mechanical Enhancements
No ratings yet
What's New at 2019 R1 Mechanical Enhancements
21 pages
OD M1 Introduction To Data Engineering
No ratings yet
OD M1 Introduction To Data Engineering
69 pages
Java Networking Notes
No ratings yet
Java Networking Notes
11 pages
Edexcel IGCSE Section 2 Electricity
No ratings yet
Edexcel IGCSE Section 2 Electricity
54 pages
CAD/CAM Principles and Applications: CH 3 Computer Graphics
No ratings yet
CAD/CAM Principles and Applications: CH 3 Computer Graphics
55 pages
GPS Traker TK103 Manuel - Comandos
No ratings yet
GPS Traker TK103 Manuel - Comandos
23 pages
ZW3D CAD Tips How To Design A A Popular QQ Doll
No ratings yet
ZW3D CAD Tips How To Design A A Popular QQ Doll
10 pages
Investigations On Three-Phase Transmission Lines
No ratings yet
Investigations On Three-Phase Transmission Lines
6 pages
PDF (Sa1) - Eapp
No ratings yet
PDF (Sa1) - Eapp
4 pages
450 DSA Cracker: Topics (/) / Array
No ratings yet
450 DSA Cracker: Topics (/) / Array
3 pages
Selenium Intrew quest
No ratings yet
Selenium Intrew quest
21 pages
Emptech g12 Capslet w1 Circulado1
No ratings yet
Emptech g12 Capslet w1 Circulado1
20 pages
IC TESTING - DCAClab Blog
No ratings yet
IC TESTING - DCAClab Blog
5 pages
General Mathematics 11
100% (1)
General Mathematics 11
3 pages
Rajib Mall Lecture Notes
0% (1)
Rajib Mall Lecture Notes
69 pages
Review Test - Grade-5-Math
No ratings yet
Review Test - Grade-5-Math
2 pages
Assignment 1 - TMF2034
No ratings yet
Assignment 1 - TMF2034
5 pages
AVR Boot Loader
No ratings yet
AVR Boot Loader
7 pages
Free Scale Answers
No ratings yet
Free Scale Answers
4 pages
212ufrep in GATE-2020 Updated
No ratings yet
212ufrep in GATE-2020 Updated
43 pages
IOT Complete Notes TRUE ENGINEER
No ratings yet
IOT Complete Notes TRUE ENGINEER
107 pages
Lafore Chap8 q6
No ratings yet
Lafore Chap8 q6
3 pages
The ISC2 Cybersecurity Lexicon
No ratings yet
The ISC2 Cybersecurity Lexicon
13 pages
Dxx-790-862/880-960-65/65-17I/17.5I-M/M-R Easyret Dual-Band Antenna With 2 Integrated Rcus - 2.6M Model: Adu4517R0V01
No ratings yet
Dxx-790-862/880-960-65/65-17I/17.5I-M/M-R Easyret Dual-Band Antenna With 2 Integrated Rcus - 2.6M Model: Adu4517R0V01
2 pages
Tutorial8 Relation
No ratings yet
Tutorial8 Relation
4 pages
LATHESHAAS
100% (1)
LATHESHAAS
340 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

ConvexSpring25_Week9

Uploaded by

ConvexSpring25_Week9

Uploaded by

Module C: Algorithms for Optimization

Recall that an optimization problem in standard form is given by

Most algorithms generate a sequence x0 , x1 , x2 , . . . by exploiting local information

Zeroth Order: Only f (xt ), gi (xt ), hj (xt ) available.

Second Order: Hessian information is used. Eg: Newton’s Method, etc.

errt := f (xt ) f (x? )

A solution x̄ is ✏-optimal when

In presence of constaints, we define

Consider the unconstrained optimization problem: minx2Rn f (x)

Gradient Descent (GD): xt+1 = xt ⌘t rf (xt ), t 0 starting from an

The stationarity condition satisfies x⇤ = x⇤ ⌘t rf (x⇤ ) =) rf (x⇤ ) = 0.

Convergence rate depends on choice of step size ⌘t and characteristic of the

Bounded Gradient: ||rf (x)||  G for all x 2 Rn .

Smooth: A di↵erentiable convex f is -smooth if for any x, y, we have

f (y)  f (x) + hrf (x), y xi + ||y x||2 .

Strongly Convex: A di↵erentiable convex f is ↵-strongly convex if for any

Proof: Define the following (potential) function:

Recall that a di↵erentiable convex f is -smooth if for any x, y, we have

f (y)  f (x) + hrf (x), y xi + ||y x||2 .

Proof: Define the following (potential) function:

t := t[f (xt ) f (x? )] + ||xt x? ||2 .

Recall that a di↵erentiable convex f is ↵-strongly convex if for any x, y, we have

Note: To obtain ✏-optimal solution, choose T = O log( 1✏ ) .

Proof: Define the following (potential) function:

Consider the unconstrained optimization problem: minx2Rn f (x)

Gradient Descent (GD): xt+1 = xt ⌘t rf (xt ), t 0 starting from an

Theorem 4: GD Convergence rates

If f is -smooth and ↵-strongly convex, for ⌘t = 1 , f (xT ) f (x? ) 

Consider the unconstrained optimization problem: minx2X f (x) where X ✓ Rn

Projected Gradient Descent (PGD): xt+1 = ⇧X [xt ⌘t rf (xt )], t 0

If f is -smooth and ↵-strongly convex, for ⌘t = 1 , f (xT ) f (x? ) 

Note: Convergence rates remain unchanged.

Note: Projection itself is another optimization problem!

Non-expansive Property which preserves the convergence rates:

||⇧X (y1 ) ⇧X (y2 )||  ||y1 y2 ||.

Start with x0 = y0 = z0 2 Rn . At every time-step t,

Start with x0 = y0 . At every state t,

Let f be -smooth, ↵-strongly convex with  = ↵ and let = p1 1 . Then,

A large number of problems that arise in (supervised) ML can be written as

Example: Regression/Least Squares, SVM, NN Training

The above problem can also be viewed as sample average approximation

f (x) = E[l(x, ⇠)]

involving uncertain parameter or random variable ⇠.

Each step requires N gradient computations.

Stochastic Gradient Descent (SGD) At every time step t,

Under the following assumptions

Constant step-size will not give us convergence. For convergence, we need to

If ⌘t := ⌘, then error bound

SAGA: Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien. “SAGA:

Allen-Zhu, Zeyuan. “Katyusha: The First Direct Acceleration of Stochastic

Varre, Aditya, and Nicolas Flammarion. “Accelerated SGD for non-strongly-

Hanzely, Filip, Konstantin Mishchenko, and Peter Richtárik. ”SEGA: Vari-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.