0% found this document useful (0 votes)
16 views15 pages

p5 CO Opti Algo

The document outlines optimization algorithms used in machine learning, focusing on both first-order methods like gradient descent and second-order methods such as Newton's algorithm. It discusses the overall line-search algorithm, the classification of algorithms, and practical approaches for determining step sizes, including the Armijo condition and backtracking line search. Additionally, it covers convergence properties and the stochastic gradient descent method for minimizing averages of functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views15 pages

p5 CO Opti Algo

The document outlines optimization algorithms used in machine learning, focusing on both first-order methods like gradient descent and second-order methods such as Newton's algorithm. It discusses the overall line-search algorithm, the classification of algorithms, and practical approaches for determining step sizes, including the Armijo condition and backtracking line search. Additionally, it covers convergence properties and the stochastic gradient descent method for minimizing averages of functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Mathematics for Machine

Learning
Amit Chattopadhyay

IIIT-Bangalore

Module 1: Convex Optimization: Optimization Algorithms

1
5. Optimization Algorithms
Overview

f0∗ = minn f0 (x)


x∈R

Overall Line-Search Algorithm (f0 , x0 ∈ dom f0 , ε > 0)


1. Start with initial candidate point: x0 ∈ Rn
2. Next generate a sequence of candidate points {xk } (for k = 1, 2, . . .)
converging towards the actual minimum using update rule:

xk+1 = xk + sk vk

where scalar sk > 0 is called the stepsize, and vk ∈ Rn is the update


direction.
3. Stop: if the current solution meets desired accuracy level ε.
Note: Behavior of the algorithm depends on choice of direction vk and
stepsize sk .
3
Classification of Algorithms

First-order methods:
• Gradient descent method
• Subgradient method
• Proximal gradient descent
• Stochastic gradient descent
Second-order methods:
• Newton’s method
• Barrier method
• Primal-dual interior-point methods
• Quasi-Newton methods
• Proximal Newton method
4
First-order: Gradient Descent Method

f0 (xk + svk ) − f0 (xk )


The local rate of variation of f0 : lim = ∇f0 (xk )T vk
s→0 s
Descent Directions: vk for which ∇f0 (xk )T vk < 0 we have:

f0 (xk+1 ) < f0 (xk ).

Steepest Descent Direction: Direction of maximum local decrease

∇f0 (xk )
vk = −
∥∇f0 (xk )∥

5
First-order: Gradient Descent Method

Stepsize:
Restriction of f0 along vk : ϕ(s) = f0 (xk + svk )
To find s > 0 such that: ϕ(s) < ϕ(0)
Exact Line Search: s ∗ = arg min ϕ(s)
s≥0

(Computationally Expensive)
6
Stepsize: Practical Approach

Armijo Condition
Valid step sizes must satisfy: ϕ(s) ≤ ϕ(0) + s(αδk )
More explicitly, f0 (xk + svk ) ≤ f0 (xk ) + sα(∇f0 (xk )T vk ) for chosen
α ∈ (0, 1)
Note: s̄ is the smallest point where ϕ(s) and l̄(s) cross. Armijo
condition is satisfied ∀s ∈ (0,s̄).
7
Stepsize: Practical Approach

Backtracking Line Search


Require: f0 differentiable, α ∈ (0, 1), β ∈ (0, 1), xk ∈ dom f0 , vk a decent
direction, sinit a positive constant (typically, sinit = 1)
1: Set s = sinit , δk = ∇f0 (xk )T vk
2: If f0 (xk + svk ) ≤ f (xk ) + sαδk , then return sk = s
3: Else let s ← β s and go to step 2.

8
Stepsize: Lower Bound

Assumption: f0 has Lipschitz continuous gradient on S0 , i.e.,∃L > 0 such


that
∥∇f0 (x) − ∇f0 (y)∥2 ≤ L∥x − y∥2 , ∀x, y ∈ S0 .
Lower bound on step-size: ∃ a constant slb > 0 such that

sk ≥ slb , ∀k = 0, 1, . . .

9
Convergence

Convergence to a stationary point:

xk+1 = xk − sk ∇f0 (xk )

with stepsizes sk computed via backtracking line search, satisfying Armijo


condition:

f0 (xk+1 ) ≤ f0 (xk ) − sk α∥∇f0 (xk )∥22


=⇒ f0 (xk ) − f0 (xk+1 ) ≥ sk α∥∇f0 (xk )∥22
≥ slb α∥∇f0 (xk )∥22 , ∀k = 0, 1, . . .

k
Thus, slb α ∑ ∥∇f0 (xi )∥22 ≤ f0 (x0 ) − f (xk+1 ) ≤ f0 (x0 ) − f0∗
i=0
=⇒ lim ∥∇f0 (xk )∥2 = 0 (the algorithm converges to a stationary point
k→∞
of f0 )

10
Convergence

k
∑ ∥∇f0 (xi )∥22 ≥ (k + 1) i=0,...,k
min ∥∇f0 (xi )∥
i=0
1 1
q
=⇒ gk∗ = min ∥∇f0 (xi )∥2 ≤ √ √ f0 (x0 ) − f0∗
i=0,...,k 1 + k slb α
1
=⇒ gk∗ ∝ √
1+k
Stopping criterion is set as: ∥∇f0 (xi )∥2 ≤ ε
The exit condition is achieved in
1 f0 (x0 ) − f0∗
 
kmax =
ε2 slb α
.

11
Convergence: Convex Function

Assume: f0 has Lipschitz continuous gradient and it is convex.


1. The gradient algorithm converges to global minimum x∗
2. f0 (xk ) → f0∗ at a rate ∝ 1
k
3. f0 (xk ) − f0∗ ≤ ε in at most

∥x0 − x∗ ∥22
 
kmax =
2εslb

12
Second-order: Newton’s Algorithm

xk+1 = xk − sk [∇2 f0 (xk )]−1 ∇f0 (xk ), k = 0, 1, . . .

• vk = [∇2 f0 (xk )]−1 ∇f0 (xk )

• sk can be found by the Backtracking algorithm

• In particular useful in minimizing strongly convex functions, since


∇2 f0 (x) ⪰ mI ∀x

13
Second-order: Newton’s Algorithm

• Obtained by minimizing second order Taylor approximation at xk :


(k)
f0 (x) ≃ fq (x)
1
= f0 (xk ) + ∇f0 (xk )T (x − bxk ) + (x − xk )T ∇2 f0 (xk )(x − xk )
2
• λk2 = ∇f0 (xk )T [∇2 f0 (xk )]−1 f0 (xk ) (Newton decrement)

14
Stochastic Gradient Descent Method (SGD)

Consider minimizing an average of functions:

1 m
minn f (x) = ∑ fi (x)
x∈R m i=1

• Ordinary gradient descent: xk+1 = xk − sk m1 ∑m


i=1 ∇fi (xk ) (expensive
when m is large)
• Stochastic Gradient Descent : xk+1 = xk − sk ∇fik (xk ) where
ik ∈ {11, . . . , m} is chosen randomly.
• E[∇fik (x)] = ∇f (x): SGD is using an unbiased estimate of the
gradient at each step
• iteration cost is independent of m (number of functions)

15

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy