0% found this document useful (0 votes)

68 views67 pages

Optimization For Machine Learning: Lecture 6: Tractable Nonconvex Problems 6.881: MIT

This document summarizes a lecture on tractable nonconvex problems in machine learning optimization. It discusses several examples of nonconvex problems that can still be solved efficiently, such as computing the largest eigenvalue of a symmetric matrix. Other examples covered include generalized eigenvalues, the trust region subproblem, the Toeplitz-Hausdorff theorem about numerical ranges of matrices, and low-rank matrix approximation problems like principal component analysis.

Uploaded by

Mufakir Qamar Ansari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views67 pages

Optimization For Machine Learning: Lecture 6: Tractable Nonconvex Problems 6.881: MIT

Uploaded by

Mufakir Qamar Ansari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 67

Optimization for Machine Learning

Lecture 6: Tractable nonconvex problems

6.881: MIT

Suvrit Sra
Massachusetts Institute of Technology

04 Mar, 2021
Tractable nonconvex problems

Not all non-convex problems are bad

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 2
Tractable nonconvex problems

Not all non-convex problems are bad

♠ Generalizing the notion of convexity

♠ Problems with hidden convexity
♠ Miscellaneous examples from applications
♠ The list is much longer and growing!

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 2
Spectral problems

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 3
Simplest example: eigenvalues
Largest eigenvalue of a symmetric matrix

Ax = λmax x ⇔ max xT Ax.

xT x=1
Nonconvex problem, but we know how to solve it!

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 4
Simplest example: eigenvalues
Largest eigenvalue of a symmetric matrix

Ax = λmax x ⇔ max xT Ax.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 4
Simplest example: eigenvalues
Largest eigenvalue of a symmetric matrix

Ax = λmax x ⇔ max xT Ax.

xT x=1
Nonconvex problem, but we know how to solve it!
L(x, θ) := −xT Ax + θ(xT x − 1)
−2Ax + 2θx = 0
Ax = θx
Neccessary condition asks for (θ, x) to be eigenpair. Thus, xT Ax
is maximized by largest such pair. Alternative: Let A = UDU∗ ;
then maxxT x=1 xT Ax = maxyT y=1 i λi y2i , where y = U∗ x.
P

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 4
Simplest example: eigenvalues
Largest eigenvalue of a symmetric matrix

Ax = λmax x ⇔ max xT Ax.

X X
max λi y2i = max λi zi ,
yT y=1 zT 1=1,z≥0 i
i

which is a convex optimization problem.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 4
Generalized eigenvalues
Let A, B be symmetric matrices; generalized eigenvalue is:

xT Ax
max
x6=0 xT Bx

(more generally: Ax = λBx, generalized eigenvectors)

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 5
Generalized eigenvalues
Let A, B be symmetric matrices; generalized eigenvalue is:

xT Ax
max
x6=0 xT Bx

(more generally: Ax = λBx, generalized eigenvectors)

Exercise: Study its Lagrangian formulation as well as a convex

reformulation (similar to the “alternative” on slide 4)

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 5
Generalized eigenvalues
Let A, B be symmetric matrices; generalized eigenvalue is:

xT Ax
max
x6=0 xT Bx

(more generally: Ax = λBx, generalized eigenvectors)

Exercise: Study its Lagrangian formulation as well as a convex

reformulation (similar to the “alternative” on slide 4)

Read the book: https://web.stanford.edu/˜boyd/lmibook/lmibook.pdf

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 5
Trust region subproblem

min xT Ax + 2bT x + c
x
s.t. xT Bx + 2dT x + e ≤ 0.
Here A and B are merely symmetric. Hence, nonconvex

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 6
Trust region subproblem

min xT Ax + 2bT x + c
x
s.t. xT Bx + 2dT x + e ≤ 0.
Here A and B are merely symmetric. Hence, nonconvex
The dual problem can be formulated as (Verify!)
max u
u,v∈R

A + vB b + vd
s.t. 0,
(b + vd)T c + ve − u
v ≥ 0.

Importantly, strong duality holds (see Appendix B of BV).

(alternatively: turns out SDP relaxation of the primal is exact)

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 6
Trust region subproblem

Importantly, strong duality holds (see Appendix B of BV).

(alternatively: turns out SDP relaxation of the primal is exact)
Ref: See Wang, Kılın¸-Karzan, The generalized trust-region subproblem: solution
complexity and convex hull results, 2019, for recent results.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 6
Toeplitz-Hausdorff Theorem
Let A be a complex, square matrix. Its numerical range is

W(A) := {x∗ Ax | kxk2 = 1, x ∈ Cn } .

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 7
Toeplitz-Hausdorff Theorem
Let A be a complex, square matrix. Its numerical range is

W(A) := {x∗ Ax | kxk2 = 1, x ∈ Cn } .

Theorem. The set W(A) is convex (amazing!).

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 7
Toeplitz-Hausdorff Theorem
Let A be a complex, square matrix. Its numerical range is

W(A) := {x∗ Ax | kxk2 = 1, x ∈ Cn } .

Theorem. The set W(A) is convex (amazing!).

Exercise: If A is Hermitian show that W(A) = [λmin , λmax ].

Exercise: If AA∗ = A∗ A, then W(A) = conv(λi (A)).

Explore: Let A1 , . . . , An be Hermitian. When is the set

∗
(z A1 z, z∗ A2 z, . . . , z∗ An z) | z ∈ Cd , kzk = 1

convex (this is also called the “joint numerical range”).

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 7
Principal Component Analysis (PCA)
Let A ∈ Rn×p . Consider the nonconvex problem

min kA − Xk2F s.t. rank(X) = k.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 8
Principal Component Analysis (PCA)
Let A ∈ Rn×p . Consider the nonconvex problem

min kA − Xk2F s.t. rank(X) = k.

Well-known Eckart-Young-Mirsky theorem shows that

X∗ = Uk Σk VkT

where A has the SVD A = UΣV T .

Why is this true?

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 8
PCA via the Fantope
Another characterization of SVD (nonconvex prob)

min kA − AZk2F , s.t. rank(Z) = k, Z is a projection

Z=ZT
⇔ max hAT A, Zi, s.t. rank(Z) = k, Z is a projection.
Z=ZT

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 9
PCA via the Fantope
Another characterization of SVD (nonconvex prob)

min kA − AZk2F , s.t. rank(Z) = k, Z is a projection

Z=ZT
⇔ max hAT A, Zi, s.t. rank(Z) = k, Z is a projection.
Z=ZT

Optimal solution here is Z = Vk VkT , the top-k evecs of AT A

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 9
PCA via the Fantope
Another characterization of SVD (nonconvex prob)

min kA − AZk2F , s.t. rank(Z) = k, Z is a projection

Z=ZT
⇔ max hAT A, Zi, s.t. rank(Z) = k, Z is a projection.
Z=ZT

Optimal solution here is Z = Vk VkT , the top-k evecs of AT A

Equivalent convex problem!

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 9
PCA via the Fantope
Another characterization of SVD (nonconvex prob)

min kA − AZk2F , s.t. rank(Z) = k, Z is a projection

Z=ZT
⇔ max hAT A, Zi, s.t. rank(Z) = k, Z is a projection.
Z=ZT

Optimal solution here is Z = Vk VkT , the top-k evecs of AT A

Equivalent convex problem!
First, write constraint set C as
C = Z = ZT | rank(Z) = k, Z is a projection

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 9
PCA via the Fantope
Another characterization of SVD (nonconvex prob)

min kA − AZk2F , s.t. rank(Z) = k, Z is a projection

Z=ZT
⇔ max hAT A, Zi, s.t. rank(Z) = k, Z is a projection.
Z=ZT

Optimal solution here is Z = Vk VkT , the top-k evecs of AT A

Equivalent convex problem!
First, write constraint set C as
C = Z = ZT | rank(Z) = k, Z is a projection

= Z = ZT | λi (Z) ∈ {0, 1} , Tr(Z) = k .

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 9
Fantope

Now consider convex hull: C = conv C

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 10
Fantope

Now consider convex hull: C = conv C

n o
C = Z = ZT | λi (Z) ∈ [0, 1], Tr(Z) = k

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 10
Fantope

Now consider convex hull: C = conv C

n o
C = Z = ZT | λi (Z) ∈ [0, 1], Tr(Z) = k
n o
= Z = ZT | 0 Z I, Tr(Z) = k .

The set C is called the Fantope (named after Ky Fan).

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 10
Fantope

Now consider convex hull: C = conv C

n o
C = Z = ZT | λi (Z) ∈ [0, 1], Tr(Z) = k
n o
= Z = ZT | 0 Z I, Tr(Z) = k .

The set C is called the Fantope (named after Ky Fan).

Exercise: Now invoke the “maximize a convex function”

idea from Lecture 5 to claim that the convex problem
maxZ=ZT hAT A, Zi s.t. Z ∈ C solves the original problem.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 10
Sparsity

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 11
Nonconvex Sparse optimization
The `0 -quasi-norm is defined as

kxk0 := card {xi | xi 6= 0} .

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 12
Nonconvex Sparse optimization
The `0 -quasi-norm is defined as

kxk0 := card {xi | xi 6= 0} .

Projection onto `0 -ball

min 1
2 kx − yk22 , s.t. kxk0 ≤ k.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 12
Nonconvex Sparse optimization
The `0 -quasi-norm is defined as

kxk0 := card {xi | xi 6= 0} .

Projection onto `0 -ball

min 1
2 kx − yk22 , s.t. kxk0 ≤ k.
Nonconvex but tractable: If kyk0 ≤ k, then clearly x = y.
Otherwise, pick the k largest entries of |y|, and set the rest to 0.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 12
Nonconvex Sparse optimization
The `0 -quasi-norm is defined as

kxk0 := card {xi | xi 6= 0} .

Projection onto `0 -ball

min 1
2 kx − yk22 , s.t. kxk0 ≤ k.
Nonconvex but tractable: If kyk0 ≤ k, then clearly x = y.
Otherwise, pick the k largest entries of |y|, and set the rest to 0.

Exercise: Prove the above claim.

Exercise: Similarly solve 12 kx − yk22 + λkxk0

Used in so-called “Iterative Hard Thresholding” algorithms

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 12
Compressed Sensing

min kxk0 s.t. Ax = b

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 13
Compressed Sensing

min kxk0 s.t. Ax = b

If the “measurement matrix” A satisfies so-called restricted

isometry condition with the constant δs ∈ (0, 1)

(1 − δs )kxk2 ≤ kAxk2 ≤ (1 + δs )kxk2 , x is s-sparse,

then the `1 -convex relaxation is exact.

Explore: (search keywords): compressed sensing, sparse re-

covery, restricted isometry

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 13
Generalized convexity

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 14
Geometric programming

Monomial: g : Rn++ → R of the form

g(x) = γxa11 · · · xann ,

γ > 0, ai ∈ R.
P
Posynomial: Sum of monomials, e.g, f (x) = j gj (x)

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 15
Geometric programming

Monomial: g : Rn++ → R of the form

g(x) = γxa11 · · · xann ,

γ > 0, ai ∈ R.
P
Posynomial: Sum of monomials, e.g, f (x) = j gj (x)
Geometric Program
min f (x)
x
s.t. fi (x) ≤ 1, i ∈ [m]
gj (x) = 1, j ∈ [r],

where fi are posynomials and gj are monomials.

Clearly, nonconvex.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 15
Geometric programming
Make change of variables: yi = log xi (recall xi > 0). Then,
T y+b
f (x) = f (ey ) = γ(ey1 )a1 · · · (eyn )an = ea ,

for b = log y.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 16
Geometric programming
Make change of variables: yi = log xi (recall xi > 0). Then,
T y+b
f (x) = f (ey ) = γ(ey1 )a1 · · · (eyn )an = ea ,

for b = log y. Thus, after taking logs, geometric program is

X T
min log ea0k y+b0k
y k
X T
s.t. log ea0k y+b0k ≤ 0, i ∈ [m]
k
cTj y + dj = 0, j ∈ [r],

for suitable sets of vectors {aik }, and cj .

for b = log y. Thus, after taking logs, geometric program is

X T
min log ea0k y+b0k
y k
X T
s.t. log ea0k y+b0k ≤ 0, i ∈ [m]
k
cTj y + dj = 0, j ∈ [r],

for suitable sets of vectors {aik }, and cj .
Recall, log-sum-exp is convex, so above is a convex opt.
Ref: See Chapter 8.8 of BV; search online for “geometric programming”

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 16
Generalized convexity

Quasiconvexity: If level sets Lt (f ) = {x | f (x) ≤ t} are

convex, we say f is quasiconvex

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 17
Generalized convexity

Quasiconvexity: If level sets Lt (f ) = {x | f (x) ≤ t} are

convex, we say f is quasiconvex
Arcwise Convexity: f (γxy (t)) ≤ (1 − t)f (x) + tf (y), where
arc γ : [0, 1] → X joins point x to point y.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 17
Generalized convexity

Quasiconvexity: If level sets Lt (f ) = {x | f (x) ≤ t} are

convex, we say f is quasiconvex
Arcwise Convexity: f (γxy (t)) ≤ (1 − t)f (x) + tf (y), where
arc γ : [0, 1] → X joins point x to point y.
Several other notions of generalized convexity exist (see
also: genconv.org!)

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 17
Generalized convexity

Quasiconvexity: If level sets Lt (f ) = {x | f (x) ≤ t} are

Exercise: Suppose a set X is arcwise convex, and f : X → R is an

arcwise convex function. Prove that a local optimum of f is also
global (assume regularity as needed).
Exercise: View GP as arcwise convexity using: γ(t) = x1−t yt

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 17
Linear fractional programming

aT x + b
min
cT x + d
s.t. Gx ≤ h, cT x + d > 0, Ex = f .

This problem is nonconvex, but it is quasiconvex.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 18
Linear fractional programming

aT x + b
min
cT x + d
s.t. Gx ≤ h, cT x + d > 0, Ex = f .

This problem is nonconvex, but it is quasiconvex. Provided it is

feasible, it is equivalent to the LP

min aT y + bz
y,z
s.t. Gy − hz ≤ 0, z ≥ 0
Ey = fz, cT y + dz = 1.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 18
Linear fractional programming

aT x + b
min
cT x + d
s.t. Gx ≤ h, cT x + d > 0, Ex = f .

This problem is nonconvex, but it is quasiconvex. Provided it is

feasible, it is equivalent to the LP

min aT y + bz
y,z
s.t. Gy − hz ≤ 0, z ≥ 0
Ey = fz, cT y + dz = 1.

These two problems connected via the transformation

x 1
y= T , z= T .
c x+d c x+d
See BV Chapter 4 for details.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 18
Generalized Perron-Frobenius
Let A, B ∈ Rm×n .
max λ
x,λ

s.t. λAx ≤ Bx, xT 1 = 1, x ≥ 0.

Exercise: Try solving it directly somehow.

Exercise: Cast this as an (extended) linear-fractional program.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 19
Challenge: Simplex convexity

Let ∆n be the probability simplex, i.e., set of vectors x =

(x1 , . . . , xn ) such that xi ≥ 0 and xT 1 = 1. Assume that n ≥ 2.
Prove that the following “Bethe entropy”
X 1
g(x) = xi log + (1 − xi ) log(1 − xi ),
i xi
is concave on ∆n .

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 20
The Polyak-Łojasiewicz class
PL class aka gradient-dominated
f (x) − f (x∗ ) ≤ τ k∇f (x)kα , α ≥ 1.
Observe that if ∇f (x) = 0, then x must be global opt.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 21
The Polyak-Łojasiewicz class
PL class aka gradient-dominated
f (x) − f (x∗ ) ≤ τ k∇f (x)kα , α ≥ 1.
Observe that if ∇f (x) = 0, then x must be global opt.

Exercise: Let f be convex on Rn . Prove that on the set

{x | kx − x∗ k ≤ R}, f is PL with τ = R and α = 1.

Exercise: Let f be convex on Rn . Prove that on the set

{x | kx − x∗ k ≤ R}, f is PL with τ = R and α = 1.

Exercise: Let f be strongly-convex with parameter µ. Prove

that f is a PL function with τ = 1/2µ and α = 2.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 21
Important non-convex PL example
I Let g(x) = (g1 (x), . . . , gm (x)) be a differentiable func.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 22
Important non-convex PL example
I Let g(x) = (g1 (x), . . . , gm (x)) be a differentiable func.
I Consider the system of nonlinear equations g(x) = 0
I Assume that m ≤ n and that ∃x∗ s.t. g(x∗ ) = 0.
I Assume Jacobian J(x) = (∇g1 (x), . . . , ∇gm (x))
non-degenerate on a convex set X containing x∗ . Then,
σ = inf x∈X λmin (J(x)T J(x)) > 0.

k∇f (x)k2 = g(x)T J(x)T J(x)g(x) ≥ σkg(x)k2 = 2σ(f (x) − f (x∗ ))

Thus, f is PL with τ = 1/2σ, α = 2.

k∇f (x)k2 = g(x)T J(x)T J(x)g(x) ≥ σkg(x)k2 = 2σ(f (x) − f (x∗ ))

Thus, f is PL with τ = 1/2σ, α = 2.

Exercise: When m < n, are the Hessians of f degenerate at solutions?

k∇f (x)k2 = g(x)T J(x)T J(x)g(x) ≥ σkg(x)k2 = 2σ(f (x) − f (x∗ ))

Thus, f is PL with τ = 1/2σ, α = 2.

Exercise: When m < n, are the Hessians of f degenerate at solutions?

Explore: Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear
Convergence of Gradient and Proximal-Gradient Methods Under the
Polyak-Łojasiewicz Condition. https://arxiv.org/abs/1608.04636

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 22
Others tractable nonconvex problems

Instances of matrix completion, deep linear neural networks,

tensor factorization, many others. Check out the great collection
by Ju Sun: https://sunju.org/research/nonconvex/