Convex Optimization Prerequisite_topics
Convex Optimization Prerequisite_topics
Prerequisite Topics
February 3, 2015
This is meant to be a brief, informal refresher of some topics that will form building blocks in
this course. The content of the first two sections of this document is mainly taken from Appendix
A of B & V, with some supplemental information where needed. See the end for a list of potentially
helpful resources you can consult for further information.
Lipschitz A function f is Lipschitz with Lipschitz constant L if ||f (x) − f (y)|| ≤ L||x − y||
∀x, y ∈ domf . If we refer to a function f as Lipschitz, we are making a stronger statement
about the continuity of f . A Lipschitz function is not only continuous, but it does not change
value very rapidly, either. This is obviously not unrelated to the smoothness of f , but a
function can be Lipschitz but not smooth.
1
Taylor Expansion The first order Taylor expansion of a function gives us an easy way to form a
linear approximation to that function:
1.2 Sets
Interior The interior intC of the set C is the set of all points x ∈ C for which ∃ > 0 s.t.
{y||y − x||2 ≤ } ⊆ C.
Closure The closure clC of a set C is the set of all x such that ∀ > 0 ∃y ∈ C s.t. ||x − y||2 ≤ .
The closure only makes sense for closed sets (see below), and can be considered the union of
the interior of C and the boundary of C.
Boundary The boundary is the set of points bdC for which the following is true: ∀ ∃y ∈ C, z ∈
/C
s.t. ||y − x||2 ≤ and ||z − x||2 ≤ .
Complement The complement of the set C ⊆ Rn is denoted by Rn
C. It is the set of all points not in C
Open vs Closed A set C is open if intC = C. A set is closed if its complement is open.
Equality You’ll notice that above we used a notion of equality for sets. To show formally that
sets A and B are equal, you must show A ⊆ B and B ⊆ A.
1.3 Norms
See B & V for a much more detailed treatment of this topic. I am going to list the most common
norms so that you are aware of the notation we will be using in this class:
`0 ||x||0 is the number of nonzero elements in x. We often want to minimize this, but it is non-
convex (and actually, not a real norm), so we approximate it (you could say we relax it) to
other norms (e.g. `1 ).
`p ||x||p = (|x1 |p + · · · + |xn |p )1/p , where p ≥ 1. Some common examples:
Pn
• ||x||1 = i=1 |xi |
pPn
• ||x||2 = 2
i=1 xi
• ||x||∞ = maxi |xi |
Spectral/Operator Norm ||X||op = σ1 (X), the largest singular value of X.
Pr
Trace Norm ||X||tr = i=1 σr (X), the sum of all the singular values of X.
Chain Rule Let h(x) = g(f (x)) for g : R → R. We have: 5h(x) = g 0 (f (x)) 5 f (x)
Hessian In the world of optimization, we denote the Hessian matrix as 52 f (x) ∈ Rn×n (some of
you have maybe seen this symbol used as the Laplace operator in other courses). The ij t h
∂ 2 f (x)
entry of the Hessian is given by: 5f (x)ij = ∂x i ∂xj
Matrix Differentials In general we will not be using these too much in class. The major differ-
entials you need to know are:
• ∂X T X = X
∂
• ∂X tr(XA) = AT
2 Linear Algebra
2.1 Matrix Subspaces
Row Space The row space of a matrix A is the subspace spanned of the rows of A.
Column Space The column space of a matrix A is the subspace spanned of the columns of A.
Null Space The null space of a matrix A is the set of all x such that Ax = 0.
Rank rankA is the number of linearly independent columns in A (or, equivalently, the number of
linearly independent rows). A matrix A ∈ Rm×n is full rank if rankA = min{m, n}. Recall
that if A is square and full rank, it is invertible.
A = U ΣV T
Here U ∈ Rm×r has the property that U T U = I and V ∈ Rn×r likewise satisfies V T V = I.
Σ = diag(σ1 , σ2 , ..., σr ) where the singular values σi are ordered by decreasing value. Some
useful facts that we can learn using this decomposition:
• The SVD of A has the following implication for the eigendecomposition of AT A:
2
T Σ 0
A A = [V W ] [V W ]T
0 0
3 Canonical ML Problems
3.1 Linear Regression
Linear regression is the problem of finding f : X → Y , where X ∈ Rn×p , Y is an n-dimensional
vector of real values and f is a linear function. Canonically, we find f by finding the vector β̂ ∈ Rp
that minimizes the least squares objective:
For Y ∈ Rn×q , the multiple linear regression problem, we find a matrix B̂ that such that:
Note that in its basic form, the linear regression problem can be solved in closed form.
3.2 Logistic Regression
Logistic regression is the problem of finding f : X → Y , where Y is an n-dimensional vector binary
1
values, and f has the form f (x) = logit(β T x). The logit function is defined as logit(α) = 1+exp(−α) .
We typically solve for β by maximizing the likelihood of the observed data, which results in the
following optimization problem:
n
X
β̂ = argmax [yi β T xi − log(1 + exp(−yi β T xi )]
β i=1
s.t. yi (β T xi ) ≥ 1 ∀i = 1, ..., n
In its simplest form, the support vector machine seeks to find the hyperplane (parameterized
by β) that separates the classes (encoded in the constraint) and does so in a way that creates the
largest margin between the data points and the plane (encoded in the objective that is minimized).
3.4 Regularization/Penalization
Regularization (sometimes referred to as penalization) is a technique that can be applied to al-
most all machine learning problems. Most of the time, we regularize in an effort to simplify the
learned function, often by forcing the parameters to be “small” (either in absolute size or in rank)
and/or setting many of them to be zero. Regularization is also sometimes used to incorporate prior
knowledge about the problem.
We incorporate regularization by adding either constraints or penalties to the existing optimiza-
tion problem. This is easiest to see in the context of linear regression. Where previously we only
had least squares loss, we can add penalties to create the following two variations:
Ridge Regression By adding an `2 penalty, our objective to minimize becomes:
β̂ = argmin ||Xβ − Y ||22 + λ||β||2
β