0% found this document useful (0 votes)
4 views41 pages

10 Unconstrained

The document discusses unconstrained minimization in convex optimization, covering methods such as gradient descent, steepest descent, and Newton's method. It outlines the necessary conditions and assumptions for optimization, including the properties of convex functions and the importance of choosing appropriate starting points and step sizes. Additionally, it highlights the convergence properties of these methods and provides examples to illustrate their application.

Uploaded by

賴裕芳
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views41 pages

10 Unconstrained

The document discusses unconstrained minimization in convex optimization, covering methods such as gradient descent, steepest descent, and Newton's method. It outlines the necessary conditions and assumptions for optimization, including the properties of convex functions and the importance of choosing appropriate starting points and step sizes. Additionally, it highlights the convergence properties of these methods and provides examples to illustrate their application.

Uploaded by

賴裕芳
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Convex Optimization

Stephen Boyd Lieven Vandenberghe

Revised slides by Stephen Boyd, Lieven Vandenberghe, and Parth Nobel


9. Unconstrained minimization
Outline

Terminology and assumptions

Gradient descent method

Steepest descent method

Newton’s method

Self-concordant functions

Implementation

Convex Optimization Boyd and Vandenberghe 9.1


Unconstrained minimization

▶ unconstrained minimization problem

minimize f (x)
▶ we assume
– f convex, twice continuously differentiable (hence dom f open)
– optimal value p★ = inf x f (x) is attained at x★ (not necessarily unique)

▶ optimality condition is ∇f (x) = 0

▶ minimizing f is the same as solving ∇f (x) = 0

▶ a set of n equations with n unknowns

Convex Optimization Boyd and Vandenberghe 9.2


Quadratic functions

▶ convex quadratic: f (x) = (1/2)xT Px + qT x + r, P ⪰ 0

▶ we can solve exactly via linear equations

∇f (x) = Px + q = 0

▶ much more on this special case later

Convex Optimization Boyd and Vandenberghe 9.3


Iterative methods

▶ for most non-quadratic functions, we use iterative methods


▶ these produce a sequence of points x (k) ∈ dom f , k = 0, 1, . . .

▶ x (0) is the initial point or starting point


▶ x (k) is the kth iterate
▶ we hope that the method converges, i.e.,

f (x (k) ) → p★, ∇f (x (k) ) → 0

Convex Optimization Boyd and Vandenberghe 9.4


Initial point and sublevel set

▶ algorithms in this chapter require a starting point x (0) such that


– x (0) ∈ dom f
– sublevel set S = {x | f (x) ≤ f (x (0) )} is closed

▶ 2nd condition is hard to verify, except when all sublevel sets are closed
– equivalent to condition that epi f is closed
– true if dom f = Rn
– true if f (x) → ∞ as x → bd dom f

▶ examples of differentiable functions with closed sublevel sets:


m
! m
∑︁ ∑︁
T
f (x) = log exp(ai x + bi ) , f (x) = − log(bi − aTi x)
i=1 i=1

Convex Optimization Boyd and Vandenberghe 9.5


Strong convexity and implications
▶ f is strongly convex on S if there exists an m > 0 such that

∇2 f (x) ⪰ mI for all x ∈ S

▶ same as f (x) − (m/2) ∥x∥ 22 is convex


▶ if f is strongly convex, for x, y ∈ S,
m
f (y) ≥ f (x) + ∇f (x) T (y − x) + ∥x − y∥ 22
2
▶ hence, S is bounded
▶ we conclude p★ > −∞, and for x ∈ S,

1
f (x) − p★ ≤ ∥∇f (x) ∥ 22
2m
▶ useful as stopping criterion (if you know m, which usually you do not)
Convex Optimization Boyd and Vandenberghe 9.6
Outline

Terminology and assumptions

Gradient descent method

Steepest descent method

Newton’s method

Self-concordant functions

Implementation

Convex Optimization Boyd and Vandenberghe 9.7


Descent methods

▶ descent methods generate iterates as

x (k+1) = x (k) + t (k) Δx (k)

with f (x (k+1) ) < f (x (k) ) (hence the name)


▶ other notations: x+ = x + tΔx, x := x + tΔx

▶ Δx (k) is the step, or search direction


▶ t (k) > 0 is the step size, or step length
▶ from convexity, f (x+ ) < f (x) implies ∇f (x) T Δx < 0
▶ this means Δx is a descent direction

Convex Optimization Boyd and Vandenberghe 9.8


Generic descent method

General descent method.


given a starting point x ∈ dom f .
repeat
1. Determine a descent direction Δx.
2. Line search. Choose a step size t > 0.
3. Update. x := x + tΔx.
until stopping criterion is satisfied.

Convex Optimization Boyd and Vandenberghe 9.9


Line search types

▶ exact line search: t = argmint>0 f (x + tΔx)

▶ backtracking line search (with parameters 𝛼 ∈ (0, 1/2) , 𝛽 ∈ (0, 1) )


– starting at t = 1, repeat t := 𝛽t until f (x + tΔx) < f (x) + 𝛼t∇f (x) T Δx

▶ graphical interpretation: reduce t (i.e., backtrack) until t ≤ t0

f (x + tΔx)

f (x) + t∇f (x) T Δx f (x) + 𝛼t∇f (x) T Δx


t
t=0 t0

Convex Optimization Boyd and Vandenberghe 9.10


Gradient descent method
▶ general descent method with Δx = −∇f (x)

given a starting point x ∈ dom f .


repeat
1. Δx := −∇f (x) .
2. Line search. Choose step size t via exact or backtracking line search.
3. Update. x := x + tΔx.
until stopping criterion is satisfied.

▶ stopping criterion usually of the form ∥∇f (x) ∥ 2 ≤ 𝜖


▶ convergence result: for strongly convex f ,

f (x (k) ) − p★ ≤ ck (f (x (0) ) − p★)

c ∈ (0, 1) depends on m, x (0) , line search type


▶ very simple, but can be very slow
Convex Optimization Boyd and Vandenberghe 9.11
Example: Quadratic function on R2

▶ take f (x) = (1/2) (x12 + 𝛾x22 ) , with 𝛾 > 0


▶ with exact line search, starting at x (0) = (𝛾, 1) :
 k  k
𝛾−1 𝛾−1
x1(k) = 𝛾 , x2(k) = −
𝛾+1 𝛾+1

– very slow if 𝛾 ≫ 1 or 𝛾 ≪ 1
x (0)

x2
0
– example for 𝛾 = 10 at right
x (1)
– called zig-zagging
−4
−10 0 10
x1

Convex Optimization Boyd and Vandenberghe 9.12


Example: Nonquadratic function on R2

▶ f (x1 , x2 ) = ex1 +3x2 −0.1 + ex1 −3x2 −0.1 + e−x1 −0.1

x (0) x (0)

x (2)
x (1)

x (1)

backtracking line search exact line search

Convex Optimization Boyd and Vandenberghe 9.13


Example: A problem in R100

Í500
▶ f (x) = cT x − i=1 log(bi − aTi x)

104

102

f (x (k) ) − p★
100 exact l.s.

10 −2
backtracking l.s.
10 −4 0 50 100 150 200
k

▶ linear convergence, i.e., a straight line on a semilog plot

Convex Optimization Boyd and Vandenberghe 9.14


Outline

Terminology and assumptions

Gradient descent method

Steepest descent method

Newton’s method

Self-concordant functions

Implementation

Convex Optimization Boyd and Vandenberghe 9.15


Steepest descent method

▶ normalized steepest descent direction (at x, for norm ∥ · ∥ ):

Δxnsd = argmin{∇f (x) T v | ∥v∥ = 1}

▶ interpretation: for small v, f (x + v) ≈ f (x) + ∇f (x) T v;

▶ direction Δxnsd is unit-norm step with most negative directional derivative

▶ (unnormalized) steepest descent direction: Δxsd = ∥∇f (x) ∥ ∗ Δxnsd

▶ satisfies ∇f (x) T Δxsd = −∥∇f (x) ∥ 2∗

▶ steepest descent method


– general descent method with Δx = Δxsd
– convergence properties similar to gradient descent

Convex Optimization Boyd and Vandenberghe 9.16


Examples

▶ Euclidean norm: Δxsd = −∇f (x)


▶ quadratic norm ∥x∥ P = (xT Px) 1/2 (P ∈ Sn++ ): Δxsd = −P−1 ∇f (x)
▶ ℓ1 -norm: Δxsd = −(𝜕f (x)/𝜕xi )ei , where |𝜕f (x)/𝜕xi | = ∥∇f (x) ∥ ∞
▶ unit balls, normalized steepest descent directions for quadratic norm and ℓ1 -norm:

−∇f (x)

−∇f (x)
Δxnsd Δxnsd

Convex Optimization Boyd and Vandenberghe 9.17


Choice of norm for steepest descent

x (0)
x (0)
x (2)
x (1) x (2)
x (1)

▶ steepest descent with backtracking line search for two quadratic norms
▶ ellipses show {x | ∥x − x (k) ∥ P = 1}
▶ interpretation of steepest descent with quadratic norm ∥ · ∥ P : gradient descent after change
of variables x̄ = P1/2 x
▶ shows choice of P has strong effect on speed of convergence

Convex Optimization Boyd and Vandenberghe 9.18


Outline

Terminology and assumptions

Gradient descent method

Steepest descent method

Newton’s method

Self-concordant functions

Implementation

Convex Optimization Boyd and Vandenberghe 9.19


Newton step

▶ Newton step is Δxnt = −∇2 f (x) −1 ∇f (x)

▶ interpretation: x + Δxnt minimizes second order approximation

1
f (x + v) = f (x) + ∇f (x) T v + vT ∇2 f (x)v
b
2

b
f

(x, f (x))

(x + Δxnt , f (x + Δxnt )) f

Convex Optimization Boyd and Vandenberghe 9.20


Another intrepretation

▶ x + Δxnt solves linearized optimality condition

f (x + v) = ∇f (x) + ∇2 f (x)v = 0
∇f (x + v) ≈ ∇b

b
f′

f′
(x + Δxnt , f ′ (x + Δxnt ))
(x, f ′ (x))

Convex Optimization Boyd and Vandenberghe 9.21


And one more interpretation

 1/2
▶ Δxnt is steepest descent direction at x in local Hessian norm ∥u∥ ∇2 f (x) = uT ∇2 f (x)u

x + Δxnsd
x + Δxnt

▶ dashed lines are contour lines of f ; ellipse is {x + v | vT ∇2 f (x)v = 1}


▶ arrow shows −∇f (x)

Convex Optimization Boyd and Vandenberghe 9.22


Newton decrement
 1/2
▶ Newton decrement is 𝜆(x) = ∇f (x) T ∇2 f (x) −1 ∇f (x)
▶ a measure of the proximity of x to x★
▶ gives an estimate of f (x) − p★, using quadratic approximation b
f:
1
f (x) − inf b
f (y) = 𝜆(x) 2
y 2
▶ equal to the norm of the Newton step in the quadratic Hessian norm
  1/2
T 2
𝜆(x) = Δxnt ∇ f (x)Δxnt

▶ directional derivative in the Newton direction: ∇f (x) T Δxnt = −𝜆(x) 2


▶ affine invariant (unlike ∥∇f (x) ∥ 2 )

Convex Optimization Boyd and Vandenberghe 9.23


Newton’s method

given a starting point x ∈ dom f , tolerance 𝜖 > 0.


repeat
1. Compute the Newton step and decrement.
Δxnt := −∇2 f (x) −1 ∇f (x) ; 𝜆2 := ∇f (x) T ∇2 f (x) −1 ∇f (x) .
2. Stopping criterion. quit if 𝜆2 /2 ≤ 𝜖 .
3. Line search. Choose step size t by backtracking line search.
4. Update. x := x + tΔxnt .

▶ affine invariant, i.e., independent of linear changes of coordinates


▶ Newton iterates for f̃ (y) = f (Ty) with starting point y (0) = T −1 x (0) are y (k) = T −1 x (k)

Convex Optimization Boyd and Vandenberghe 9.24


Classical convergence analysis

assumptions
▶ f strongly convex on S with constant m
▶ ∇2 f is Lipschitz continuous on S, with constant L > 0:

∥∇2 f (x) − ∇2 f (y) ∥ 2 ≤ L∥x − y∥ 2

(L measures how well f can be approximated by a quadratic function)

outline: there exist constants 𝜂 ∈ (0, m2 /L) , 𝛾 > 0 such that


▶ if ∥∇f (x) ∥ 2 ≥ 𝜂, then f (x (k+1) ) − f (x (k) ) ≤ −𝛾
▶ if ∥∇f (x) ∥ 2 < 𝜂, then
 2
L (k+1) L (k)
∥∇f (x ) ∥ 2 ≤ ∥∇f (x ) ∥ 2
2m2 2m2

Convex Optimization Boyd and Vandenberghe 9.25


Classical convergence analysis

damped Newton phase ( ∥∇f (x) ∥ 2 ≥ 𝜂)


▶ most iterations require backtracking steps
▶ function value decreases by at least 𝛾
▶ if p★ > −∞, this phase ends after at most (f (x (0) ) − p★)/𝛾 iterations

quadratically convergent phase ( ∥∇f (x) ∥ 2 < 𝜂)


▶ all iterations use step size t = 1
▶ ∥∇f (x) ∥ 2 converges to zero quadratically: if ∥∇f (x (k) ) ∥ 2 < 𝜂, then
  2l−k   2l−k
L l L k 1
2
∥∇f (x ) ∥ 2 ≤ 2
∥∇f (x ) ∥ 2 ≤ , l≥k
2m 2m 2

Convex Optimization Boyd and Vandenberghe 9.26


conclusion: number of iterations until f (x) − p★ ≤ 𝜖 is bounded above by

f (x (0) ) − p★
+ log2 log2 (𝜖0 /𝜖)
𝛾

▶ 𝛾 , 𝜖0 are constants that depend on m, L, x (0)


▶ second term is small (of the order of 6) and almost constant for practical purposes
▶ in practice, constants m, L (hence 𝛾 , 𝜖0 ) are usually unknown
▶ provides qualitative insight in convergence properties (i.e., explains two algorithm phases)

Convex Optimization Boyd and Vandenberghe 9.27


Example: R2
(same problem as slide 9.13)

105

x (0) 100

f (x (k) ) − p★
x (1) 10 −5

10 −10

10 −150 1 2 3 4 5
k

▶ backtracking parameters 𝛼 = 0.1, 𝛽 = 0.7


▶ converges in only 5 steps
▶ quadratic local convergence

Convex Optimization Boyd and Vandenberghe 9.28


Example in R100
(same problem as page 9.14)
105 2
exact line search
100 1.5

step size t (k)


f (x (k) ) − p★

backtracking
10 −5 1
exact line search
10 −10 0.5 backtracking

10 −15 0
0 2 4 6 8 10 0 2 4 6 8
k k

▶ backtracking parameters 𝛼 = 0.01, 𝛽 = 0.5


▶ backtracking line search almost as fast as exact l.s. (and much simpler)
▶ clearly shows two phases in algorithm
Convex Optimization Boyd and Vandenberghe 9.29
Example in R10000
(with sparse ai )
10000
∑︁ 100000
∑︁
f (x) = − log(1 − xi2 ) − log(bi − aTi x)
i=1 i=1

105

f (x (k) ) − p★
100

10 −5

0 5 10 15 20
k

▶ backtracking parameters 𝛼 = 0.01, 𝛽 = 0.5.


▶ performance similar as for small examples
Convex Optimization Boyd and Vandenberghe 9.30
Outline

Terminology and assumptions

Gradient descent method

Steepest descent method

Newton’s method

Self-concordant functions

Implementation

Convex Optimization Boyd and Vandenberghe 9.31


Self-concordance

shortcomings of classical convergence analysis


▶ depends on unknown constants (m, L, . . . )
▶ bound is not affinely invariant, although Newton’s method is

convergence analysis via self-concordance (Nesterov and Nemirovski)


▶ does not depend on any unknown constants
▶ gives affine-invariant bound
▶ applies to special class of convex functions (‘self-concordant’ functions)
▶ developed to analyze polynomial-time interior-point methods for convex optimization

Convex Optimization Boyd and Vandenberghe 9.32


Self-concordant functions

definition
▶ convex f : R → R is self-concordant if |f ′′′ (x)| ≤ 2f ′′ (x) 3/2 for all x ∈ dom f
▶ f : Rn → R is self-concordant if g(t) = f (x + tv) is self-concordant for all x ∈ dom f , v ∈ Rn

examples on R
▶ linear and quadratic functions
▶ negative logarithm f (x) = − log x
▶ negative entropy plus negative logarithm: f (x) = x log x − log x

affine invariance: if f : R → R is s.c., then f̃ (y) = f (ay + b) is s.c.:

f̃ ′′′ (y) = a3 f ′′′ (ay + b), f̃ ′′ (y) = a2 f ′′ (ay + b)

Convex Optimization Boyd and Vandenberghe 9.33


Self-concordant calculus

properties
▶ preserved under positive scaling 𝛼 ≥ 1, and sum
▶ preserved under composition with affine function
▶ if g is convex with dom g = R++ and |g′′′ (x)| ≤ 3g′′ (x)/x then

f (x) = log(−g(x)) − log x

is self-concordant

examples: properties can be used to show that the following are s.c.
▶ f (x) = − m T T
Í
i=1 log(bi − ai x) on {x | ai x < bi , i = 1, . . . , m}
▶ f (X) = − log det X on Sn++
▶ f (x) = − log(y2 − xT x) on {(x, y) | ∥x∥ 2 < y}

Convex Optimization Boyd and Vandenberghe 9.34


Convergence analysis for self-concordant functions

summary: there exist constants 𝜂 ∈ (0, 1/4] , 𝛾 > 0 such that


▶ if 𝜆(x) > 𝜂, then
f (x (k+1) ) − f (x (k) ) ≤ −𝛾
▶ if 𝜆(x) ≤ 𝜂, then
 2
2𝜆(x (k+1) ) ≤ 2𝜆(x (k) )

(𝜂 and 𝛾 only depend on backtracking parameters 𝛼, 𝛽)

complexity bound: number of Newton iterations bounded by

f (x (0) ) − p★
+ log2 log2 (1/𝜖)
𝛾

for 𝛼 = 0.1, 𝛽 = 0.8, 𝜖 = 10 −10 , bound evaluates to 375(f (x (0) ) − p★) + 6

Convex Optimization Boyd and Vandenberghe 9.35


Numerical example
150 randomly generated instances of
Ím
minimize f (x) = − i=1 log(bi − aTi x)
25

20

iterations
15
◦: m = 100, n = 50
□: m = 1000, n = 500 10
^ : m = 1000, n = 50
5

0
0 5 10 15 20 25 30 35
f (x (0) ) − p★
▶ number of iterations much smaller than 375(f (x (0) ) − p★) + 6
▶ bound of the form c(f (x (0) ) − p★) + 6 with smaller c (empirically) valid
Convex Optimization Boyd and Vandenberghe 9.36
Outline

Terminology and assumptions

Gradient descent method

Steepest descent method

Newton’s method

Self-concordant functions

Implementation

Convex Optimization Boyd and Vandenberghe 9.37


Implementation

main effort in each iteration: evaluate derivatives and solve Newton system

HΔx = −g

where H = ∇2 f (x) , g = ∇f (x)

via Cholesky factorization

H = LLT , Δxnt = −L −T L −1 g, 𝜆(x) = ∥L −1 g∥ 2

▶ cost (1/3)n3 flops for unstructured system


▶ cost ≪ (1/3)n3 if H sparse, banded

Convex Optimization Boyd and Vandenberghe 9.38


example of dense Newton system with structure
n
∑︁
f (x) = 𝜓i (xi ) + 𝜓0 (Ax + b), H = D + AT H0 A
i=1

▶ assume A ∈ Rp×n , dense, with p ≪ n


▶ D diagonal with diagonal elements 𝜓i′′ (xi ) ; H0 = ∇2 𝜓0 (Ax + b)

method 1: form H , solve via dense Cholesky factorization: (cost (1/3)n3 )


method 2 (page ??): factor H0 = L0 L0T ; write Newton system as

DΔx + AT L0 w = −g, L0T AΔx − w = 0

eliminate Δx from first equation; compute w and Δx from

(I + L0T AD−1 AT L0 )w = −L0T AD−1 g, DΔx = −g − AT L0 w

cost: 2p2 n (dominated by computation of L0T AD −1 AT L0 )

Convex Optimization Boyd and Vandenberghe 9.39

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy