0% found this document useful (0 votes)
95 views22 pages

CS 726: Nonlinear Optimization 1 Lecture 3: Di Erentiability

This document summarizes key concepts from a lecture on non-linear optimization and differentiability. 1) It defines local and global minimizers of functions, and discusses different types of solutions to optimization problems. 2) It covers Taylor's theorem and how it relates to differentiability, stating the first and second order Taylor approximations. 3) It discusses properties like convexity, strict convexity, and strong convexity, and how they relate to the uniqueness and existence of solutions.

Uploaded by

Harris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views22 pages

CS 726: Nonlinear Optimization 1 Lecture 3: Di Erentiability

This document summarizes key concepts from a lecture on non-linear optimization and differentiability. 1) It defines local and global minimizers of functions, and discusses different types of solutions to optimization problems. 2) It covers Taylor's theorem and how it relates to differentiability, stating the first and second order Taylor approximations. 3) It discusses properties like convexity, strict convexity, and strong convexity, and how they relate to the uniqueness and existence of solutions.

Uploaded by

Harris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

CS 726: Nonlinear Optimization 1

Lecture 3 : Di↵erentiability

Michael C. Ferris

Computer Sciences Department


University of Wisconsin-Madison

January 29 2021

Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 1 / 21


Background Material

Since we spent some part of the lecture in Background Quiz, I would


like you to review the whole of [Wright and Recht(2020), Chapter 2].
Definition of local and global, see [Wright and Recht(2020), Section
2.1].
Reading through this additional information and that of Bertsekas
mentioned last time may be helpful.

Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 2 / 21


Taxonomy 2.1
of Asolutions
Taxonomy of Solutions to Optimization Problems
Before we can begin designing algorithms, we must determine what it means
to solve an optimization problem. Suppose that f is a function mapping some
domain D ⇢ Rn to the real line R. We have the following definitions.
• x⇤ 2 D is a local minimizer of f if there is a neighborhood N of x⇤ such that
f (x) f (x⇤ ) for all x 2 N \ D.
• x⇤ 2 D is a global minimizer of f if f (x) f (x⇤ ) for all x 2 D.
• x⇤ 2 D is a strict local minimizer if it is a local minimizer for some neigh-
borhood N of x⇤ , and in addition f (x) > f (x⇤ ) for all x 2 N with x , x⇤ .
• x⇤ is an isolated local minimizer if there is a neighborhood N of x⇤ such
that f (x) f (x⇤ ) for all x 2 N \ D and in addition, N contains no local
minimizers other than x⇤ .
• x⇤ is the unique minimizer if it is the only global minimizer.
For the constrained optimization problem
min f (x), (2.1)
x2⌦

where ⌦ ⇢ D ⇢ Rn is a closed set, we modify the terminology slightly to use


the word “solution” rather than “minimizer.” That is, we have the following
definitions.

15
Note local solution and global solution

Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 3 / 21


Taylor’s Theorem and Di↵erentiability

Convention
f : Rn ! R.
I Df (x) is a 1 ⇥ n row vector.
I rf (x) = [Df (x)]T (column vector).
g : R n ! Rm .
I Dg (x) 2 Rm⇥n
I [Dg (x)]T = rg (x) 2 Rn⇥m .

Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 4 / 21


Theorem (First order Taylor)
([Wright and Recht(2020), Theorem 2.1])
f 2 C 1 . Then
1 f (x + p) = f (x) + rf (x + p)T p for some 2 (0, 1].
2 f (x + p) = f (x) + rf (x)T p + o(kpk), where

o(t)
lim = 0.
t#0 t
R1
3 f (x + p) = f (x) + 0 rf (x + p)T pd .

Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 5 / 21


Theorem (Second order Taylor)
([Wright and Recht(2020), Theorem 2.1])
f 2 C 2 . Then
1 f (x + p) = f (x) + rf (x)T p + 12 p T r2 f (x + p)p for some 2 (0, 1].
1 T 2 2
2 f (x + p) = f (x) + rf (x)T p + 2 p r f (x)p + o(kpk ).
R1
3 rf (x + p) = rf (x) + 0 r2 f (x + p)pd .

Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 6 / 21


Theorem
First Order Sufficiency Condition Let f : Rn ! R̄ (f is an extended
real-valued function). Suppose f is convex and x̄ is a local minimum of
f (x): f (x̄) = minx2Rn f (x). Then x̄ is a global minimizer of f (x) over Rn .

Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 7 / 21


Proof.
Suppose f is convex and x̄ is a local minimizer of the function f (x) and
suppose 9 x̂ which is better (i.e. f (x̂) < f (x̄)). Let us construct an ✏-ball
around the point x̄ such that f (x) f (x̄) 8x 2 B✏ (x̄).

The line connecting x̂ and x̄ with x inside the ✏-ball

Now, define x = x̄ + (x̂ x̄). Note this is equivalent to x̂ + (1 )x̄.


Thus x is on the segment joining x̄ and x̂. If we take > 0 sufficiently
small, we can ensure x 2 B✏ (x̄). Therefore, f (x ) f (x̄).

Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 8 / 21


Proof.
But we already know f (x ):

f (x̄)  f (x ) = f ( x̂ + (1 )x̂)

by our definition of x . f is convex, so

f ( x̂ + (1 )x̂)  f (x̂) + (1 )f (x̄).

By assumption, f (x̂) < f (x̄), so

f (x )  f (x̂) + (1 )f (x̄) < f (x̄) + (1 )f (x̄) = f (x̄).

This implies we can find a point in the neighborhood we defined strictly


better than the local minimizer. Contradiction, so x̄ must be the global
minimizer.
It’s worth noting this result holds regardless of di↵erentiability; it only
depends on the convexity of f .

Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 9 / 21


Uniqueness of Global Minimizers
Theorem
Uniqueness Result Let f : Rn ! R̄ be strictly convex. Then, if f has a
global minimizer, that global minimizer is unique.

Proof.
Let x 1 ,x 2 be distinct global minimizers. For 0 < < 1, because f is
strictly convex,

f ( x 1 + (1 )x 2 ) < f (x 1 ) + (1 )f (x 2 ).

In order for both x 1 and x 2 to be global minimizers, f (x 1 ) = f (x 2 ), so let’s


replace f (x 2 ) with f (x 1 ) on the right-hand side of the above inequality:

f ( x 1 + (1 )x 2 ) < f (x 1 ) + (1 )f (x 1 ) = f (x 1 ).

This says 9 a point x somewhere between x 1 and x 2 such that


f (x) < f (x 1 ). )( f must have a unique global minimizer.
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 10 / 21
for all 2 (0, 1).
As we will see throughout this text, a crucial quantity in optimization is the
Lipschitz constant L for the gradient of f , which is defined to satisfy

kr f (x) r f (y)k  Lkx yk, for all x, y 2 dom ( f ). (2.7)

We say that a continuously di↵erentiable function f with this property is L-


smooth or has L-Lipschitz gradients. We say that f is L0 -Lipschitz if

| f (x) f (y)|  L0 kx yk, for all x, y 2 dom ( f ). (2.8)

From (2.2), we have


Z 1
f (y) f (x) r f (x)T (y x) = [r f (x + (y x)) r f (x)]T (y x) d .
0

By using (2.7), we have

[r f (x+ (y x)) r f (x)]T (y x)  kr f (x+ (y x)) r f (x)kky xk  L ky xk2 .

By substituting this bound into the previous integral, we obtain the following
result.

Lemma 2.2 Given an L-smooth function f , we have for any x, y 2 dom ( f )


that
L
f (y)  f (x) + r f (x)T (y x) + ky xk2 . (2.9)
2
Lemma 2.2 asserts that f can be upper bounded by a quadratic function
whose value at x is equal to f (x).
When f is twice continuously di↵erentiable, we can characterize the con-
stant L in terms of the eigenvalues of the Hessian r2 f (x). Specifically, we
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 11 / 21
Strict and Strong Convexity

Definition
From now on, x1 , xn , etc, will be used to refer to components of vectors
and x 1 , x n , etc, will be used to refer to distinct points.

Definition
Strictly Convex: A function f : R ! R̄ is strictly convex if 8x,y such
that x 6= y and 8↵ 2 [0, 1]

f ((1 ↵)x + ↵y ) < (1 ↵)f (x) + ↵f (y ).

Note this definition is identical to that of convex functions, but the


inequality is now strict.

Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 12 / 21


Strong Convexity

Let f : ⌦ ! R, h : ⌦ ! R where ⌦ is an open convex set.


Definition
f is strongly convex (⇢) on ⌦ if 9⇢ > 0 such that 8x, y 2 ⌦, 2 [0, 1]

f ((1 )x + y )  (1 )f (x) + f (y ) (1 ) kx y k2
2

Definition
h is strongly monotone (⇢) on ⌦ if 9⇢ > 0 such that 8x, y 2 ⌦

hh(x) h(y ), x yi ⇢ kx y k2

Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 13 / 21


Theorem (Strong convexity)
If f is continuously di↵erentiable on ⌦ then the following are equivalent:
(a) f is strongly convex (⇢) on ⌦
(b) For all x, y 2 ⌦,
f (y ) f (x) + hrf (x), y xi + (⇢/2) kx y k2
(c) rf is strongly monotone (⇢) on ⌦
If f is twice continuously di↵erentiable on ⌦, then
⌦ ↵
(d) For all x,y ,z 2 ⌦, x y , r2 f (z)(x y ) ⇢ kx y k2
is equivalent to the above.

Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 14 / 21


Proof.
We show (a) () (b) () (c).
(a) =) (b) The hypothesis gives

f (x + (y x)) f (x) ⇢
 f (y ) f (x) (1 ) kx y k2
2
so taking the limit as !0

hrf (x), y xi  f (y ) f (x) kx y k2
2

Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 15 / 21


Proof.
(b) =) (c) Applying (b) twice gives

⇢ 2
f (y ) f (x) + hrf (x), y xi + kx yk
2
⇢ 2
f (x) f (y ) + hrf (y ), x yi + kx yk
2

Adding these inequalities gives

2
f (y ) + f (x) f (x) + f (y ) + hrf (x) rf (y ), y xi + ⇢ kx yk

from where the result follows.

Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 16 / 21


Proof.
(c) =) (b) The hypothesis gives
Z 1 Z 1
2
hrf (x + t(y x)) rf (x), y xi dt ⇢t kx y k dt
0 0

which implies the result.

Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 17 / 21


Proof.
(b) =) (a) Letting y = u and x = (1 )u + v in (b) gives

⇢ 2
f (u) f ((1 )u + v ) + hrf ((1 )u + v ), (u v )i + k (u v )k (1)
2

Also letting y = v and x = (1 )u + v in (b) implies

⇢ 2
f (v ) f ((1 )u + v ) + hrf ((1 )u + v ), (1 )(v u)i + k(1 )(v u)k (2)
2

Adding (1 ) times (1) to times (2) gives the required result.

Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 18 / 21


Proof.
To complete the proof, we assume that f is twice continuously
di↵erentiable on ⌦.
(d) =) (c) This follows from the hypothesis since

⌧Z 1
2
hrf (x) rf (y ), x yi = r f (y + t(x y ))(x y )dt, x y
0
2
⇢ kx yk

(c) =) (d) Let x, y , z 2 ⌦. Then z + (x y ) 2 ⌦ for sufficiently small , so

D E hx y , rf (z + (x y )) rf (z)i
2
x y , r f (z + (x y ))(x y) = + o(1)

2
⇢ kx y k + o(1)

The result follows in the limit as ! 0.

Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 19 / 21


Then give definition of strong convexity modulus m:
1
f ((1 ↵)x + ↵y ) + m↵(1 ↵) kx y k22  (1 ↵)f (x) + ↵f (y ).
2
Strong implies strict clearly.
Lemma
If a function is strictly convex

f (y + (y x)) < f (y ) + (f (y ) f (x))

then
f (y ) > f (x) + rf (x)T (y x)

Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 20 / 21


Proof.
Suppose f (y ) = f (x) + hrf (x), (y x)i for some x 6= y . Let
(t) = f (x + t(y x))f (x)t hrf (x), (y x)i. The above can be written
as (1) = (0) and we note that 0 (0) = 0, so we have (t) = (0) for
all t 2 [0, 1], which contradicts being strictly convex.
Hence f (y ) > f (x) + hrf (x), (y x)i for all x 6= y .

Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 21 / 21


S. J. Wright and B. Recht.
Optimization for Data Analysis.
in proof, 2020.

Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 21 / 21

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy