0% found this document useful (0 votes)
21 views6 pages

Convex Optimization Prerequisite_topics

as the title

Uploaded by

issacwy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views6 pages

Convex Optimization Prerequisite_topics

as the title

Uploaded by

issacwy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

10-725/36-725: Convex Optimization

Prerequisite Topics

February 3, 2015

This is meant to be a brief, informal refresher of some topics that will form building blocks in
this course. The content of the first two sections of this document is mainly taken from Appendix
A of B & V, with some supplemental information where needed. See the end for a list of potentially
helpful resources you can consult for further information.

1 Real Analysis and Calculus


1.1 Properties of Functions
Limits You should be comfortable with the notion of limits, not necessarily because you will have
to evaluate them, but because they are key to understanding other attributes of functions.
Informally, limx→a f (x) is the value that f approaches as x approaches the value a.
Continuity A function f (x) is continuous at a particular point x0 if, as a sequence x1 , x2 , ...
approaches x0 , the value f (x1 ), f (x2 ), ... approaches f (x0 ). In limit notation: limi→∞ f (xi ) =
f (limi→∞ xi ). f is continuous if it is continuous at all points x0 ∈ domf .
Differentiability A function f : Rn → R is considered differentiable at x ∈ int domf if there
exists a vector 5f (x) that satisfies the following limit:

||f (z) − f (x) − Df (x)(z − x)||2


lim =0
z∈domf,z6=x,z→x ||z − x||2

We refer to 5f (x) as the derivative of f , and it is the transpose of the gradient.


Smoothness f is smooth if the derivatives of f are continuous over all of domf . We can describe
smoothness of a certain order if the derivatives of f are continuous up to a certain derivative.
It is also reasonable to talk about smoothness over a particular interval of the domain of f .

Lipschitz A function f is Lipschitz with Lipschitz constant L if ||f (x) − f (y)|| ≤ L||x − y||
∀x, y ∈ domf . If we refer to a function f as Lipschitz, we are making a stronger statement
about the continuity of f . A Lipschitz function is not only continuous, but it does not change
value very rapidly, either. This is obviously not unrelated to the smoothness of f , but a
function can be Lipschitz but not smooth.

1
Taylor Expansion The first order Taylor expansion of a function gives us an easy way to form a
linear approximation to that function:

f (y) ≈ f (x) + ∇f (x)(y − x)

And equivalent form that is often useful is the following:


Z 1
f (y) = f (x) + ∇f (t(x − y) + y)(y − x) dt.
0

For a quadratic approximation, we add another term:


1
f (y) ≈ f (x) + ∇f (x)(y − x) + (y − x)T ∇2 f (x)(y − x)
2
Often when doing convergence analysis we will upper bound the Hessian and use the quadratic
approximation to understand how well a technique does as a function of iterations.

1.2 Sets
Interior The interior intC of the set C is the set of all points x ∈ C for which ∃ > 0 s.t.
{y||y − x||2 ≤ } ⊆ C.
Closure The closure clC of a set C is the set of all x such that ∀ > 0 ∃y ∈ C s.t. ||x − y||2 ≤ .
The closure only makes sense for closed sets (see below), and can be considered the union of
the interior of C and the boundary of C.
Boundary The boundary is the set of points bdC for which the following is true: ∀ ∃y ∈ C, z ∈
/C
s.t. ||y − x||2 ≤  and ||z − x||2 ≤ .
Complement The complement of the set C ⊆ Rn is denoted by Rn
C. It is the set of all points not in C
Open vs Closed A set C is open if intC = C. A set is closed if its complement is open.
Equality You’ll notice that above we used a notion of equality for sets. To show formally that
sets A and B are equal, you must show A ⊆ B and B ⊆ A.

1.3 Norms
See B & V for a much more detailed treatment of this topic. I am going to list the most common
norms so that you are aware of the notation we will be using in this class:
`0 ||x||0 is the number of nonzero elements in x. We often want to minimize this, but it is non-
convex (and actually, not a real norm), so we approximate it (you could say we relax it) to
other norms (e.g. `1 ).
`p ||x||p = (|x1 |p + · · · + |xn |p )1/p , where p ≥ 1. Some common examples:
Pn
• ||x||1 = i=1 |xi |
pPn
• ||x||2 = 2
i=1 xi
• ||x||∞ = maxi |xi |
Spectral/Operator Norm ||X||op = σ1 (X), the largest singular value of X.
Pr
Trace Norm ||X||tr = i=1 σr (X), the sum of all the singular values of X.

1.4 Linear/Affine Functions


In this course, a linear function will be a function f (x) = aT x. Affine functions are linear functions
with a nonzero intercept term: g(x) = aT x + b.

1.5 Derivatives of Functions


See B & V for some nice examples. Consider the following for a function f : Rn → R:
Gradient The it h element of 5f is the partial derivative of f w.r.t. the it h dimension of the input
x: 5f (x)i = ∂f∂x(x)
i

Chain Rule Let h(x) = g(f (x)) for g : R → R. We have: 5h(x) = g 0 (f (x)) 5 f (x)
Hessian In the world of optimization, we denote the Hessian matrix as 52 f (x) ∈ Rn×n (some of
you have maybe seen this symbol used as the Laplace operator in other courses). The ij t h
∂ 2 f (x)
entry of the Hessian is given by: 5f (x)ij = ∂x i ∂xj

Matrix Differentials In general we will not be using these too much in class. The major differ-
entials you need to know are:
• ∂X T X = X

• ∂X tr(XA) = AT

2 Linear Algebra
2.1 Matrix Subspaces
Row Space The row space of a matrix A is the subspace spanned of the rows of A.
Column Space The column space of a matrix A is the subspace spanned of the columns of A.
Null Space The null space of a matrix A is the set of all x such that Ax = 0.

Rank rankA is the number of linearly independent columns in A (or, equivalently, the number of
linearly independent rows). A matrix A ∈ Rm×n is full rank if rankA = min{m, n}. Recall
that if A is square and full rank, it is invertible.

2.2 Orthogonal Subspaces


Two subspaces S1 , S2 ∈ Rn are orthogonal if sT1 s2 = 0 ∀ s1 ∈ S1 , s2 ∈ S2 .
2.3 Decomposition
Eigen Decomposition If A ∈ S n , the set of real, symmetric, n × n matrices, then A can be
factored:
A = QΛQT
Here Q is an orthogonal matrix, which means that QT Q = I. Λ = diag(λ1 , λ2 , ..., λn ), where
the eigenvalues λi are ordered by decreasing value. Some useful facts about A that we can
ascertain from the eigen decomposition:
Qn
• |A| = i=1 λi
Pn
• trA = i=1 λi
• A is invertible iff (if and only if) all its eigenvalues are nonzero. Then A−1 = QΛ−1 QT
(note that I have used the fact that for orthogonal Q, Q−1 = QT
• A is positive semidefinite if all its eigenvalues are nonnegative.
Singular Value Decomposition Any matrix A ∈ Rm×n with rank r can be factored as:

A = U ΣV T

Here U ∈ Rm×r has the property that U T U = I and V ∈ Rn×r likewise satisfies V T V = I.
Σ = diag(σ1 , σ2 , ..., σr ) where the singular values σi are ordered by decreasing value. Some
useful facts that we can learn using this decomposition:
• The SVD of A has the following implication for the eigendecomposition of AT A:
 2 
T Σ 0
A A = [V W ] [V W ]T
0 0

W is the matrix such that [V W ] is orthogonal.


σ1
• The condition number of A (an important concept for us in this course) is condA = σr

Pseudoinverse The SVD of a singular matrix A yields the pseudoinverse A† = V Σ−1 U T .

3 Canonical ML Problems
3.1 Linear Regression
Linear regression is the problem of finding f : X → Y , where X ∈ Rn×p , Y is an n-dimensional
vector of real values and f is a linear function. Canonically, we find f by finding the vector β̂ ∈ Rp
that minimizes the least squares objective:

β̂ = argmin ||Xβ − Y ||22


β

For Y ∈ Rn×q , the multiple linear regression problem, we find a matrix B̂ that such that:

B̂ = argmin ||XB − Y ||2F


B

Note that in its basic form, the linear regression problem can be solved in closed form.
3.2 Logistic Regression
Logistic regression is the problem of finding f : X → Y , where Y is an n-dimensional vector binary
1
values, and f has the form f (x) = logit(β T x). The logit function is defined as logit(α) = 1+exp(−α) .
We typically solve for β by maximizing the likelihood of the observed data, which results in the
following optimization problem:
n
X
β̂ = argmax [yi β T xi − log(1 + exp(−yi β T xi )]
β i=1

3.3 Support Vector Machines


Like logistic regression, SVMs attempt to find a function that linearly separates two classes. In this
case, the elements of Y are either 1 or −1. SVMs frame the problem as the following constrained
optimization problem (in primal form):
1
β̂ = argmin ||β||22
β 2

s.t. yi (β T xi ) ≥ 1 ∀i = 1, ..., n
In its simplest form, the support vector machine seeks to find the hyperplane (parameterized
by β) that separates the classes (encoded in the constraint) and does so in a way that creates the
largest margin between the data points and the plane (encoded in the objective that is minimized).

3.4 Regularization/Penalization
Regularization (sometimes referred to as penalization) is a technique that can be applied to al-
most all machine learning problems. Most of the time, we regularize in an effort to simplify the
learned function, often by forcing the parameters to be “small” (either in absolute size or in rank)
and/or setting many of them to be zero. Regularization is also sometimes used to incorporate prior
knowledge about the problem.
We incorporate regularization by adding either constraints or penalties to the existing optimiza-
tion problem. This is easiest to see in the context of linear regression. Where previously we only
had least squares loss, we can add penalties to create the following two variations:
Ridge Regression By adding an `2 penalty, our objective to minimize becomes:
β̂ = argmin ||Xβ − Y ||22 + λ||β||2
β

This will result in many elements of β being close to 0 (more so if λ is larger).


Lasso Regression By adding an `1 penalty, our objective to minimize becomes:
β̂ = argmin ||Xβ − Y ||22 + λ||β||1
β

This will result in many elements of β being 0 (more if λ is larger).


The first example is nice because it still can be solved in closed form. Notice however that the `1
penalty creates issues not only for a closed-form solution, but also for standard first-order methods,
because it is not differentiable everywhere. We will study how to deal with this later in the course.
4 Further Resources
In addition to B & V, the following are good sources of information on these topics:
• Matrix Cookbook: http://www.mit.edu/~wingated/stuff_i_use/matrix_cookbook.pdf
• Linear Algebra Lectures by Zico Kolter: http://www.cs.cmu.edu/~zkolter/course/linalg/
index.html

• Functional Analysis/Matrix Calculus Lectures by Aaditya Ramdas: http://www.cs.cmu.


edu/~aramdas/videos.html

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy