Classification of Optimization methods
Classification of Optimization methods
3
Contents
Contents 5
1 Introduction 7
1.1 Motivation examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Continuous optimization: First steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Unconstrained optimization 17
3 Convexity 19
3.1 Convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 The first and second order characterization of convex functions . . . . . . . . . . . . . . . 23
3.4 Other rules for detecting convexity of a function . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Convex optimization 27
4.1 Basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Quadratic programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Convex cone programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.1 Duality in convex cone programming . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.2 Second order cone programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.3 Semidefinite programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.1 Good news – the ellipsoid method . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.2 Bad news – copositive programming . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5.1 Robust PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5.2 Minimum volume enclosing ellipsoid . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6 Methods 47
6.1 Line search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2 Unconstrained problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.3 Constrained problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3.1 Methods of feasible directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3.2 Active-set methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3.3 Penalty and barrier methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.4 Conjugate gradient method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7 Selected topics 59
7.1 Robust optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.2 Concave programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5
Appendix 63
Notation 65
Bibliography 67
Chapter 1
Introduction
7
8 Chapter 1. Introduction
which is an integer linear programming problem. Denoting A the incidence matrix of graph G, the problem
has a compact form
max xt,s subject to Ax = 0, 0 ≤ x ≤ u.
The best known algorithms utilize the discrete nature of the problem. On the other hand, the LP formula-
tion is beneficial, too. Since matrix A is totally unimodular, the resulting optimal solution is automatically
integral, provided the capacities are integral. Hence the problem is efficiently solvable by means of linear
programming, despite integer conditions.
Another advantage of the LP formulation is that we can easily modify it to different variants of the
problem. Consider for example the problem of finding a minimum-cost flow. Denote by cij the cost of
sending a unit of flow along the edge (i, j) ∈ E and by d > 0 the minimum required flow. Then the
problem reads as an LP problem
X
min cij xij subject to Ax = 0, 0 ≤ x ≤ u, xts ≥ d.
(i,j)∈E
Inequality “≥”: Let x ∈ Rn be an arbitrary vector such that kxk2 = 1. Let A = QΛQT be a spectral
decomposition of matrix A. Denoting y := QT x, we have kyk2 = 1 and
n
X n
X
xTAx = xT QΛQT x = y T Λy = λi yi2 ≤ λ1 yi2 = λ1 kyk22 = λ1 .
i=1 i=1
Example 1.3 (Functional optimization). In principle, the number of variables need not be finite. For
example, in a functional problem, we want to find a function satisfying certain constraints and minimizing
a specified criterion. For illustration, imagine a problem of computing the best trajectory for a spacecraft
traveling from Earth to Mercury; the variable here is the curve of the trajectory described by a function,
and the objective is to minimize travel time. Certain simple functional problems can be solved analytically,
but in general they are solved by discretization of the unknown function and then application of classical
optimization methods.
Isoperimetric problems belong to this area, too. It is well-known that the ball has the smallest surface
area of all surfaces that enclose a given volume. But how it is when two volumes are given and we wish to
minimize the surface area (including the separating surface)? This problem is known as the double bubble
problem and it had not been solved until Hutchings et al. [2002]. The minimum area shape consists of two
spherical surfaces meeting at angles of 120◦ = 32 π. The separating area is also a spherical surface; it is a
disc in case of two equally sized volumes. See illustration on Figure 1.1.
Example 1.4 (When the nature optimizes). Snell’s law quantifies the bending of light as it passes through
a boundary between two media. The less dense the medium, the faster light travels. The trajectory of light
is such that it is traversed in the least time (the so called Fermat’s principle of least time). See illustration
on Figure 1.2.
1.2. Continuous optimization: First steps 9
less-dense medium
denser medium
Naturally, to solve a problem minx∈M f (x) means to find its minimum, called the optimal solution.
However, sometimes the problem is so hard that we are contented with an approximate solution instead.
Be ware that the minimal value of function f (x) on set M need not be attained. Consider for example
the problem minx∈R x, which is unbounded from below, or the problem minx∈R ex , which is bounded from
below. A sufficient condition for existence of a minimum is given by the Weierstrass theorem.
Theorem 1.5 (Weierstrass). If f (x) is continuous and M compact, then f (x) attains a minimum on M .
Another problem appears when local minima exist. The basic methods for solving optimization prob-
lems are iterative. They start at an initial point and move in the decreasing direction of the objective
function. When they approach a local minimum, they get stuck and they have to overcome this situation
problem.
10 Chapter 1. Introduction
f (x)
This phenomenon does occur in linear programming, or more generally in convex optimization, since
each local minimum is a global one (see Theorem 3.15).
Notice that the concept of a local minimum can be used in discrete optimization, too. For instance, in
the minimum spanning tree problem we can define a local neighbourhood as the set of all spanning trees
obtained by replacing just one edge.
Classification
The feasible set M is often defined by a system of equations and inequalities
gj (x) ≤ 0, j = 1, . . . , J,
hℓ (x) = 0, ℓ = 1, . . . , L,
g(x) ≤ 0, h(x) = 0,
where g : Rn → RJ and h : Rn → RL . Depending on the type of the objective function and the feasible
set, we classify the optimization problems as follows:
• Linear programming. Functions f (x), gj (x), hℓ (x) are linear. We assume that the reader has a basic
background in linear programming
• Convex optimization. Functions f (x), gj (x) are convex and hℓ (x) are linear.
Basic transformations
If one wants to find a maximum of f (x) on set M , then the problem is easily reduced to the minimization
problem
max f (x) = − min −f (x).
x∈M x∈M
An equation constraint can be reduced to inequalities since h(x) = 0 is equivalent to h(x) ≤ 0, h(x) ≥ 0,
but this is not recommended in view of numerical issues.
can be transformed to
min ϕ(f (x)) subject to ψ(g(x)) ≤ 0, η(h(x)) = 0,
provided
• ϕ(z) is increasing on its domain, e.g., z k , z 1/k , log(z);
• ψ(z) preserves nonnegativity, i.e., z ≤ 0 ⇔ ψ(z) ≤ 0, e.g., z 3 ;
• η(z) preserves roots, i.e., z = 0 ⇔ η(z) = 0, e.g., z 2 .
Both optimization problems then possess the same minima. The optimal values are different, but they can
be easily computed from the optimal solutions.
Example 1.6 (Geometric programming). The transformation turns out to be very convenient in geometric
programming, for instance. To illustrate it, consider the particular example
min x2 y subject to 5xy 3 ≤ 1, 7x−3 y ≤ 1, x, y > 0.
The logarithm of both sides yields
min 2 log(x) + log(y) subject to log(5) + log(x) + 3 log(y) ≤ 0, log(7) − 3 log(x) + log(y) ≤ 0.
The substitution x′ := log(x), y ′ := log(y) then leads to an LP problem
min 2x′ + y ′ subject to log(5) + x′ + 3y ′ ≤ 0, log(7) − 3x′ + y ′ ≤ 0.
Moving the objective function to the constraints. The frequently used transformation is to move
the objective function to the constraints, that is, the problem minx∈M f (x) is transformed to
min z subject to f (x) ≤ z, x ∈ M.
The objective function now is linear, and all possible obstacles are hidden in the constraints.
Example 1.7 (a finite minimax). Consider the problem
min max fi (x).
x∈M i=1,...,s
The problems of type min–max are very hard in general. However, in our situation, the outer objective
function is the maximum on a finite set. The problem thus can be written as
min z subject to fi (x) ≤ z, i = 1, . . . , s, x ∈ M.
In the original formulation, the outer objective function maxi=1,...,s fi (x) is nonsmooth. After the trans-
formation, the objective function is linear.
Surprisingly, the converse transformation can be sometimes convenient as well. Moving the constraints
into the objective function is addressed in Section 6.3.3.
Ax = b
a1 a2 am
The geometric interpretation of this problem is to find the projection of vector b ∈ Rm to the column
space S(A) of matrix A. The typical choices are the following norms:
Pm
• Euclidean norm. The problem then reads minx∈Rn kAx − bk22 = minx∈Rn 2
i=1 (Ai∗ x − bi ) , that
is, it is the ordinary least squares problem. If matrix A has full column rank, then the solution is
unique and has the form x∗ = (AT A)−1 AT b. This approach is also justified by statistics: Suppose
that the dependence is really linear and the entries of the right-hand side vector b are affected by
independent and normally distributed errors. Then x∗ is the best linear unbiased estimator and also
the the maximum likelihood estimator.
• Manhattan norm. The problem minx∈Rn kAx − bk1 can be expressed as the linear program
min eT z subject to − z ≤ Ax − b ≤ z, z ∈ Rm , x ∈ Rn .
This case has also a statistical interpretation. The optimal solution produces the maximum likelihood
estimator as long as the noise follows the Laplace distribution.
• Maximum norm. The problem minx∈Rn kAx − bk∞ is also equivalent to an LP problem
Outliers. An outlier is an observation that differs significantly from the others; see Figure 1.5. Usually,
it is caused by some experimental error. An outlier spoils the linear tendency in data and the resulting
estimator can be distorted. The Manhattan norm is less sensitive to outliers than the other norms, but
still outliers can cause problems.
1.3. Linear regression 13
Ax = b
a1 a2 am
If we expect or estimate that there are k ≪ m outliers in data, then we can solve the linear regression
problem as follows
Cardinality. The cardinality of a vector x ∈ Rn is the number of nonzero entries and it is denoted by
where γ > 0 is a suitably chosen constant. Again, this is a hard combinatorial problem. That is why kxk0
is approximated by the Manhattan norm (in some sense, it is the best approximation). As a consequence,
we get an effectively solvable optimization problem
Example 1.8 (Signal reconstruction). Consider the problem of a signal reconstruction. Let a vector
x̃ ∈ Rn represent the unknown signal, and let y = x̃ + err represent the observed noisy signal. We want to
smooth the noisy signal and find a good approximation of x̃. To this end, we will seek for a vector x ∈ Rn
that is close to y and that is also smoothed, i.e., there are not big oscillations.
This idea leads to multi-objective optimization problem
where γ > 0 is a parameter. A smaller value of γ prioritizes the first objective and so the resulting signal is
closer to the observed signal, while a larger γ penalizes oscillations and produces more smoothed signals.
Denote by D ∈ R(n−1)×n the difference matrix with entries Dii = 1, Di,i+1 = −1 and zeros elsewhere.
Then the problem reads
in which we aim to find a signal approximation in the form of a piecewise constant function. This approach
is called total variation reconstruction, and it is used when processing digital signals.
A comparison by pictures is presented in Figure 1.6, originating from the website
http://stanford.edu/class/ee364a/lectures/approx.pdf
A similar approach is used for image analysis and processing, e.g., for deblurring of blurred images,
reconstruction of damaged images, etc. See website
http://www.imm.dtu.dk/~pcha/mxTV/
1.3. Linear regression 15
x̂i
2 0
1
−2
0 500 1000 1500 2000
x
0
2
−1
x̂i
−2 0
0 500 1000 1500 2000
2 −2
0 500 1000 1500 2000
1 2
xcor
x̂i
0
−1
−2 −2
0 500 1000 1500 2000 0 500 1000 1500 2000
i i
original signal x and noisy three solutions on trade-off curve
signal xcor kx̂ − xcork2 versus φquad(x̂)
2
x̂
2 0
1
−2
0 500 1000 1500 2000
x
0
2
−1
x̂
−2 0
0 500 1000 1500 2000
2 −2
0 500 1000 1500 2000
1 2
xcor
0
x̂
0
−1
−2 −2
0 500 1000 1500 2000 0 500 1000 1500 2000
i i
original signal x and noisy three solutions on trade-off curve
signal xcor kx̂ − xcork2 versus φtv (x̂)
Figure 1.6: Example 1.8: In both pictures on the left-hand side, there is the original signal and beneath
it is the noisy signal. On the right-hand side, there are reconstructed signals with decreasing values of γ.
The top picture employs quadratic smoothing (i.e., kDxk2 instead of kDxk1 ), while the bottom picture
uses the total variation reconstruction, which better approximates the digital signal.
16 Chapter 1. Introduction
Chapter 2
Unconstrained optimization
The objective function f : Rn → R is either general or we impose some differentiability assumptions later
on. First we present the well-known first order necessary optimality condition.
Theorem 2.1 (First order necessary optimality condition). Let f (x) be differentiable and let x∗ ∈ Rn be
a local extremal point. Then ∇f (x∗ ) = o.
Proof. Without loss of generality assume that x∗ is a local minimum. Recall that for any i = 1, . . . , n
Obviously, the above condition is only a necessary condition for optimality since it cannot distinguish
between minima, maxima and inflection points; see Figure 2.1. The point with zero gradient is called a
stationary point.
We mention two second order optimality conditions, one is a necessary condition and one is a sufficient
condition.
Theorem 2.2 (Second order necessary optimality condition). Let f (x) be twice continuously differentiable
and let x∗ ∈ M be a local minimum. Then the Hessian matrix ∇2 f (x∗ ) is positive semidefinite.
Proof. The continuity of second partial derivatives implies that for every λ ∈ R and y ∈ Rn there is
θ ∈ (0, 1) such that
1
f (x∗ + λy) = f (x∗ ) + λ∇f (x∗ )T y + λ2 y T ∇2 f (x∗ + θλy)y. (2.1)
2
In other words, this is Taylor’s expansion with Lagrange remainder. Due to minimality of x∗ we have
f (x∗ + λy) ≥ f (x∗ ), and from Theorem 2.1 we have ∇f (x∗ ) = o. Hence
λ2 y T ∇2 f (x∗ + θλy)y ≥ 0.
Theorem 2.3 (Second order sufficient optimality condition). Let f (x) be twice continuously differentiable.
If ∇f (x∗ ) = o and ∇2 f (x∗ ) is positive definite for a certain x∗ ∈ M , then x∗ is a strict local minimum.
17
18 Chapter 2. Unconstrained optimization
f (x)
0 x
Figure 2.1: Stationary points of function f (x) include local minima, local minima and inflection points.
Proof. We proceed similarly as in the proof of Theorem 2.2. In equation (2.1) we have for λ 6= 0, y 6= o
and sufficiently small θ
1 2 T 2
λ∇f (x∗ )T y = 0, λ y ∇ f (x∗ + θλy)y > 0.
2
Therefore f (x∗ + λy) > f (x∗ ).
We see that there is a quite tight gap between the necessary and the sufficient conditions. However, the
example f (x) = −x4 shows that the gap is not zero: The point x = 0 is a strict local maximum (and not a
minimum), the sufficient condition is not satisfied (as expected), but the necessary condition is satisfied.
Example 2.4 (The least squares method). Consider a system of linear equations Ax = b, where A ∈
Rm×n , b ∈ Rm and matrix A has rank n (cf. Section 1.3). Usually, m is much greater than n. Since
this system has practically never an exact solution, we seek for an approximate solution by means of an
optimization problem
minn kAx − bk2 .
x∈R
Here we aim to find such a vector x that minimizes the Euclidean norm of the difference between the
left and right hand sides of system Ax = b. Since the square is an increasing function, the minimum is
attained at the same point as for the problem
We now check for the assumptions of Theorem 2.3. The gradient of the objective function is 2AT Ax−2AT b
(see the appendix, page. 63). Since it should be zero, we get the condition AT Ax = AT b, whence x =
(AT A)−1 AT b. The Hessian of the objective function is 2AT A, which is a positive definite matrix. Therefore
the point x = (AT A)−1 AT b is a strict local minimum. Moreover, since the objective function is convex,
this solution is indeed the global minimum (we will see later from Theorem 4.4).
If matrix A has not full column rank, then any solution of the system of linear equations AT Ax = AT b
is a candidate for an optimum. In fact, one can show that all these infinitely many solutions of the system
of equations are optimal solutions of our problem.
Chapter 3
Convexity
Convex sets and convex function appeared more than 100 years ago and the topic was pioneered by Hölder
(1889), Jensen (1906), Minkowski (1910) and other famous mathematicians.
u(x1 , x2 ) := {x ∈ Rn ; x = λ1 x1 + λ2 x2 , λ1 , λ2 ≥ 0, λ1 + λ2 = 1}.
Convexity of a set can be equivalently characterized by using convex combinations of all k-tuples of
its points.
Proof. An exercise.
Obviously, the union of convex sets need not be convex. On the other hand, the intersection of convex
sets is always convex.
Proof. Let x1 , x2 ∈ ∩i∈I Mi . Then for every i ∈ I we have x1 , x2 ∈ Mi , and hence also their convex
combination λ1 x1 + λ2 x2 ∈ Mi .
This property justifies introduction of the concept of the convex hull of a set M as the minimal (with
respect to inclusion) convex set containing M .
Definition 3.4. The convex hull of a set M ⊆ Rn is the intersection of all sets in Rn including M . We
denote it by conv(M ).
Proof. “⇒” Since M is convex, it is one the those convex sets that are intersected to conv(M ).
“⇐” Due to Theorem 3.3, the set conv(M ) is convex, so M is also convex.
19
20 Chapter 3. Convexity
aT x = b
aT x ≥b aT x ≤ b
M
N
Recall that the relative interior of a set M ⊆ Rn is the interior of M when restricted to the smallest
affine subspace containing M . We denote it by ri(M ).
Theorem 3.6. If M ⊆ Rn is convex, then ri(M ) is convex.
Proof. Let x1 , x2 ∈ ri(M ). Then there exist their relative ε-neighbourhoods Oε (x1 ), Oε (x2 ) ⊆ M . Consider
a convex combination x := λ1 x1 + λ2 x2 and the point y ∈ Oε (o). Then an arbitrary point in Oε (x) has
the form of x + y = λ1 x1 + λ2 x2 + y = λ1 (x1 + y) + λ2 (x2 + y), which belongs to M thanks to the fact
that x1 + y, x2 + y ∈ M .
An important property of disjoint convex sets is their linear separability; see Figure 3.1.
Definition 3.7. Two nonempty sets M, N ⊆ Rn are separable if there exists a vector o 6= a ∈ Rn and a
number b ∈ R such that
aT x ≤ b ∀x ∈ M,
T
a x≥b ∀x ∈ N,
but not
aT x = b ∀x ∈ M ∪ N.
We state one version of the separation theorem below. We omit the proof as it is included in another
course.
Theorem 3.8 (Separation theorem). Let M, N ⊆ Rn be nonempty and convex. Then they are separable
if and only if ri(M ) ∩ ri(N ) = ∅.
Let M ⊆ Rn be convex and closed. Using separation property we can separate a boundary point
x∗ ∈ M and the set M by a hyperplane aT x = b; we call this hyperplane as a supporting hyperplane of M .
We then have aT x∗ = b (i.e., the hyperplane contains the point x∗ ) and set M lies in the positive halfspace
defined by the hyperplane, that is, aT x ≤ b for every x ∈ M .
Proposition 3.9. Let M ⊆ Rn be convex and closed. Then M is equal to the intersection of the positive
halfspaces determined by all supporting hyperplanes of M .
Proof. From property aT x ≤ b ∀x ∈ M we get that M lies in the intersection of the halfspaces. We prove
the converse inclusion by contradiction: If there is x∗ 6∈ M lying in the intersection of the halfspaces, then
we can separate it (or more precisely, its neighbourhood) from M by a supporting hyperplane. Thus we
found a halfspace not containing x∗ ; a contradiction.
3.2. Convex functions 21
The above statement is not only of a theoretical importance. Using supporting hyperplanes, we can
enclose set M to a convex polyhedron with an arbitrary precision; see Figure 3.2. This property is used
in certain algorithms, too; they start with an initial selection of supporting hyperplanes and then they
iteratively include other ones when needed, in particular when one has to separate some points from set M .
Definition 3.10. Let M ⊆ Rn be a convex set. Then a function f : Rn → R is convex on M if for every
x1 , x2 ∈ M and every λ1 , λ2 ≥ 0, λ1 + λ2 = 1, one has
If we have
f (λ1 x1 + λ2 x2 ) < λ1 f (x1 ) + λ2 f (x2 )
for every convex combination with x1 6= x2 and λ1 , λ2 > 0, then f is strictly convex on M .
Analogously we define a concave function: f (x) is concave if −f (x) is convex. Obviously, a function is
linear (or, more precisely, affine) if and only if it is both convex and concave.
Example 3.11. Any vector norm is a convex function because by definition for any x1 , x2 ∈ Rn and
λ1 , λ2 ≥ 0, λ1 + λ2 = 1,
In particular, the smooth Euclidean norm kxk2 is convex as well as the non-smooth norms kxk1 and kxk∞ ,
or any matrix norm.
Analogously as in Theorem 3.2 we can characterize convex functions by means of convex combinations
of k-tuples of points.
E E
f (x) f (x)
0 M x 0 M x
Figure 3.3: The epigraph E of a nonconvex function (on the left) and a convex function (on the right).
Proof. We will proceed by mathematical induction Pk−1on k. The statement is obvious for k = 2, so we turn
Pk−1
our attention to the induction step. Define α := i=1 λi . Since α + λk = 1 and i=1 α−1 λi = 1, we get
using the induction hypothesis
Pk Pk−1 −1 Pk−1 −1
f i=1 λi xi = f α i=1 α λi xi + λk xk ≤ αf i=1 α λi xi + λk f (xk )
Pk−1 −1
α λi f (xi ) + λk f (xk ) = ki=1 λi f (xi ).
P
≤ α i=1
Theorem 3.13. A function f (x) is convex on M if and only if it is convex on each segment in M . That
is, the function g(t) = f (x + ty) is convex on the corresponding compact interval domain of variable t for
every x ∈ M and every y of norm 1.
Theorem 3.15 (Fenchel, 1951). Let M ⊆ Rn be a convex set. Then a function f : Rn → R is convex if
and only if its epigraph is a convex set.
Proof. “⇒” Denote by E the epigraph of f (x) on M , and let (x1 , z1 ), (x2 , z2 ) ∈ E be arbitrarily chosen.
Consider their convex combination
“⇐” Let E be convex. For any x1 , x2 ∈ M we have (x1 , f (x1 )), (x2 , f (x2 )) ∈ E. Consider a convex
combination λ1 x1 + λ2 x2 ∈ M . Due to convexity of E we have
The following property, illustrated on Figure 3.4, is frequently used in optimization. We will see later
in Chapter 4 that the feasible set M of an optimization problem minx∈M f (x) is usually described by a
system of inequalities gj (x) ≤ 0, j = 1, . . . , J. If functions gj are convex, then the set M is convex, too.
Theorem 3.16. Let M ⊆ Rn be a convex set and f : Rn → R a convex function. For any b ∈ R the set
{x ∈ M ; f (x) ≤ b} is convex.
3.3. The first and second order characterization of convex functions 23
f (x) f (x)
b b
0 Mb x 0 Mb x
Figure 3.4: The set Mb := {x ∈ M ; f (x) ≤ b} illustrated for a convex function (on the left) and a
nonconvex function (on the right); see Theorem 3.16.
y
y = f (x)
f (x1 )
y = f (x1 ) + ∇f (x1 )T (x − x1 )
0 x1 x
Figure 3.5: The tangent line to the graph of a convex function f (x) at point (x1 , f (x1 )).
Another nice property of convex functions is their continuity. We state this result without a proof,
which can be found e.g. in Lange [2016].
Theorem 3.17. Let M ⊆ Rn be a nonempty convex set of dimension n, and let f : Rn → R be a convex
function. Then f (x) is continuous and locally Lipschitz on int M .
Theorem 3.18 (The first order characterization of a convex function, Avriel 1976, Mangasarian 1969).
6 M ⊆ Rn be a convex set and let f (x) be a function differentiable on an open superset of M . Then
Let ∅ =
f (x) is convex on M if and only if for every x1 , x2 ∈ M
Multiply the first inequality by λ1 , the second one by λ2 , and summing up we get
Theorem 3.20 (The second order characterization of a convex function, Fenchel, 1951). Let ∅ = 6 M ⊆ Rn
be an open convex set of dimension n, and suppose that a function f : M → R is twice continuously
differentiable on M . Then f (x) is convex on M if and only if the Hessian ∇2 f (x) is positive semidefinite
for every x ∈ M .
Proof. Let x∗ ∈ M be arbitrary. Due to continuity of the second partial derivatives we have that for every
λ ∈ R and y ∈ Rn , x∗ + λy ∈ M , there is θ ∈ (0, 1) such that
1
f (x∗ + λy) = f (x∗ ) + λ∇f (x∗ )T y + λ2 y T ∇2 f (x∗ + θλy)y. (3.3)
2
“⇒” From Theorem 3.18 we get
y T ∇2 f (x∗ + θλy)y ≥ 0.
Remark 3.21. For strict convexity, we can state the following conditions:
(1) If f is strictly convex, then the Hessian ∇2 f (x) is positive definite almost everywhere on M ; in the
remaining cases it is positive semidefinite there.
(2) If the Hessian ∇2 f (x) is positive definite on M , then f is strictly convex.
In the first item, we cannot claim positive definiteness everywhere on M . Using an analogous reasoning as
in the proof of Theorem 3.20 the limit transition λ → 0 can turn the strict inequality to a non-strict one.
3.4. Other rules for detecting convexity of a function 25
Example 3.22.
1. Function f (x) = x4 is strictly convex on R, but its Hessian f (x)′′ = 12x2 vanishes at x = 0.
2. Function f (x) = x−2 has the second derivatives positive everywhere on R \ {0}, but it is not convex
there. The reason is that R \ {0} is not a convex set, and also the definition of a convex function
is not satisfied even when zero avoids the convex combinations. Therefore it is necessary that the
domain is a convex set. Hence f (x) is convex separately on (0, ∞) and on (−∞, 0).
Example 3.23. Consider a quadratic function f : Rn → R given by formula f (x) = xT Ax + bT x + c,
where A ∈ Rn×n is symmetric, b ∈ Rn and c ∈ R (see the appendix, page. 63). Then
• f (x) is convex if and only if A is positive semidefinite,
• f (x) is strictly convex if and only if A is positive definite.
Convex optimization
where f : Rn → R is a convex function and M ⊆ Rn is a convex set. Often the feasible set M is described
in the form as follows
M = {x ∈ Rn ; gj (x) ≤ 0, j = 1, . . . , J},
where gj (x) : Rn → R, j = 1, . . . , J, are convex functions. By Theorem 3.16 the set M is convex then. In
this chapter, however, we will deal with a general convex set M .
Another example:
Proof.
(1) Let x0 ∈ M be a local minimum and suppose to the contrary that there is x∗ ∈ M such that
f (x∗ ) < f (x0 ). Consider the convex combination x = λx∗ + (1 − λ)x0 ∈ M , λ ∈ (0, 1). Then
This is in contradiction with local minimality of x0 since for arbitrarily small λ > 0 we have
f (x) < f (x0 ).
(2) Let x1 , x2 ∈ M be two optimal solutions and denote by z = f (x1 ) = f (x2 ) the optimal value. The
convex combination x = λ1 x1 + λ2 x2 ∈ M then satisfies
27
28 Chapter 4. Convex optimization
(3) Suppose to the contrary that x1 , x2 ∈ M , x1 6= x2 , are two optimal solutions. Denote by z =
f (x1 ) = f (x2 ) the optimal value. The convex combination x = λ1 x1 + λ2 x2 ∈ M , λ1 , λ2 > 0, then
satisfies
f (x) < λ1 f (x1 ) + λ2 f (x2 ) = λ1 z + λ2 z = z,
that is, x is better that the optimal solution; a contradiction.
Notice that a convex optimization problem need not possess an optimal solution. Consider, for example,
minx∈R ex . This situation may happen even if the feasible set is compact:
Example 4.3. Consider the function f : [1, 2] → R defined by
(
x if 1 < x ≤ 2,
f (x) =
2 if x = 1.
This function is convex, but not continuous, and the minimum on [1, 2] is not attained.
Theorem 4.4. Let ∅ =6 M ⊆ Rn be an open convex set and f : M → R a convex differentiable function
on M . Then x∗ ∈ M is an optimal solution if and only if ∇f (x∗ ) = o.
Proof. “⇒” Let x∗ ∈ M be an optimal solution. Then it is a local minimum, too, and according to
Theorem 2.1 we have ∇f (x∗ ) = o.
“⇐” Let ∇f (x∗ ) = o. By Theorem 3.18 we have f (x) − f (x∗ ) ≥ ∇f (x∗ )T (x − x∗ ) = 0 for any x ∈ M .
Therefore f (x) ≥ f (x∗ ) and x∗ is an optimal solution.
We cannot remove the assumption that M is open. For instance, for the problem minx∈[1,2] x we have
M = [1, 2] convex and the objective function f (x) = x is differentiable on R, but its derivative at the
optimal point x∗ = 1 is f ′ (1) = 1.
We can generalize the theorem as follows.
6 M ⊆ Rn be a convex set and f : M ′ → R a convex function differentiable on an
Theorem 4.5. Let ∅ =
open set M ⊇ M . Then x∗ ∈ M is an optimal solution if and only if ∇f (x∗ )T (y − x∗ ) ≥ 0 for every
′
y ∈ M.
Proof. “⇒” Suppose to the contrary that there is y ∈ M such that ∇f (x∗ )T (y − x∗ ) < 0. Consider the
convex combination xλ = λy + (1 − λ)x∗ = x∗ + λ(y − x∗ ) ∈ M . Then
f (x∗ + λ(y − x∗ )) − f (x∗ ) f (xλ ) − f (x∗ )
0 > ∇f (x∗ )T (y − x∗ ) = lim = lim .
λ→0+ λ λ→0+ λ
Hence f (xλ ) < f (x∗ ) for a sufficiently small λ > 0; a contradiction.
“⇐” By Theorem 3.18, for every y ∈ M we have f (y) − f (x∗ ) ≥ ∇f (x∗ )T (y − x∗ ) ≥ 0. Therefore
f (y) ≥ f (x∗ ), and x∗ is an optimal solution.
The condition from Theorem 4.5 is particularly satisfied if ∇f (x∗ ) = o. This means that each stationary
point is a global minimum.
Example 4.6. The first problem in Example 4.1 reads
min x1 + x2 subject to x21 + x22 ≤ 2.
Obviously, the optimum is x∗ = (−1, −1)T . We can verify it by means of Theorem 4.5. First, compute
∇f (x∗ ) = (1, 1)T . Now, we have to show that for each feasible y we have
∗ T ∗ y1 + 1
∇f (x ) (y − x ) = (1, 1) ≥ 0,
y2 + 1
or y1 + y2 ≥ −2. This is clearly true.
The second problem in Example 4.1 reads
min x21 + x22 + 2x2 subject to x21 + x22 ≤ 2.
We compute ∇f (x∗ ) = (2x1 , 2x2 + 2)T , and this gradient is zero at point x⋆ = (0, −1). Since this point
satisfies the constrait, it is the optimum.
4.2. Quadratic programming 29
Example 4.7 (Rating system). Many methods have been developed to provide ratings of sport teams
or other entities. Here we present the following method [Langville and Meyer, 2012]. Consider n teams
that we want to rate by numbers r1 , . . . , rn ∈ Rn . Let A ∈ Rn×n be a known scoring matrix, where
aij gives the scoring of team i against team j. This matrix is skew symmetric, that is A = −AT , since
aii = 0 and aij = −aji . The rating vector r = (r1 , . . . , rn )T should reflect the scorings, so ideally we have
aij = ri − rj , or in matrix form A = reT − er T . This is hardly satisfied in practice, but we aim to find the
best approximation, which leads to an optimization formulation
Since the Hessian is positive semidefinite, function f (x) is convex. The optimality condition ∇f (x) = 0
yields the system of linear equations
The matrix has rank n − 1 and so the solution set is the line x = n1 Ae + αe, α ∈ R. Function f (x) is
constant on this line, so the whole line is the optimal solution set. In practice, we usually normalize the
rating vector such that eT r = 0. Since eT Ae = 0, we obtain the resulting formula for the rating vector
r = n1 Ae.
Naturally, for special problems in convex optimization we can derive special properties. In the following
sections we will discuss several particular classes of convex optimization problems.
min xT Cx + dT x subject to x ∈ M,
Theorem 4.8. The problem maxx∈M xT Cx is NP-hard even when C is positive definite.
Proof. We will construct a reduction from the NP-complete problem Set-Partitioning: Given a set of
numbers α1 , . . . , αn ∈ N, can we group them into two subsets such that the sums of the numbers in both
30 Chapter 4. Convex optimization
subsets are the same? Equivalently, is there x ∈ {±1}n such that ni=1 αi xi = 0? This problem can be
P
formulated as follows
Xn n
X
2
max xi subject to αi xi = 0, x ∈ [−1, 1]n .
i=1 i=1
The optimal value of this problem is n if and only if Set-Partitioning is solvable. This optimization
problem follows the template since the constraints are linear and the objective function has the form of
xT Cx + dT x for C = In and d = o.
Example 4.9 (Portfolio selection problem). This is a textbook example of an application of convex
quadratic programming. The pioneer in this area was Harry Markowitz, a Nobel Prize winner in Economics
in 1990, awarding his results from 1952.
The problem is formulated as follows: capital K is to be invested in n investments. The return of
investment i is ci . The mathematical formulation of the portfolio selection problem is as a linear program
max cT x subject to eT x = K, x ≥ o.
The returns of investments are usually not known exactly and they are modelled as random quantities.
Suppose that the vector c is random, its expected value is c̃ := E c and the covariance matrix is Σ :=
cov c = E (c − c̃)(c − c̃)T , which is positive semidefinite (Proof: for every x ∈ Rn we have xT Σx =
xT (E (c − c̃)(c − c̃)T )x = E xT (c − c̃)(c − c̃)T x = E ((c − c̃)T x)2 ≥ 0). For a real vector x ∈ Rn , the expected
value of the objective function value cT x is E (cT x) = c̃T x, and the variance of cT x is var(cT x) = xT Σx.
Maximizing the expected value of the reward leads to the linear programming problem
Taking into account the risks of investments, we model the problem as a convex quadratic program
Example 4.10 (Quadrocopter trajectory planning). We need to plan a trajectory for a quadrocopter
fleet such that a collision is avoided and the the fleet is transferred from an initial state to a terminal
state with minimum effort. In our model, time is discretized into time slots of length h. The variables are
the position pi (k), velocity vi (k) and acceleration ai (k) for quadrocopter i in time step k. The constraints
are:
• physical constraints: the relations between velocity and acceleration, position and velocity, . . .
(e.g., vi (k) = vi (k − 1) + h · ai (k − 1), pi (k) = pi (k − 1) + h · vi (k − 1), . . . )
• restrictions on the ts maximum velocity, acceleration and jerk (i.e., the derivative of acceleration),
• the initial and terminal state (positions etc.),
• the collision avoidance constraint is nonlinear (kpi (k) − pj (k)k2 ≥ r ∀i 6= j), so we have to linearize
it.
P
The objective function is given by the sum of norms of accelerations in particular time steps ( i,k kai (k)+
gk22 ). For more details see:
• https://www.youtube.com/watch?v=wwK7WvvUvlI
• F. Augugliaro, A.P. Schoellig, and R. D’Andrea, Generation of collision-free trajectories for a quadro-
copter fleet: A sequential convex programming approach, EEE/RSJ International Conference on
Intelligent Robots and Systems, 2012: pp. 1917–1922.
The practical importance of this problem is underlined by the fact that collision free planning is a very
topical research problem in air traffic control of airports.
4.3. Convex cone programming 31
where A1 , . . . , An , B are symmetric matrices. Such problems are called semidefinite programs; see
Section 4.3.3.
32 Chapter 4. Convex optimization
Now the question is, in the case of convex cone programming (4.1), which relation should replace
y ≥ 0? It is not hard to see that neither y ≥ 0 nor y ≥K 0 works well. In fact, we are interested in such y,
for which we have y T a ≥ 0 for every a ≥K 0. Obviously, the set of such ys forms a cone – this cone is
called the dual cone of K.
Definition 4.15. Let K ⊆ Rn be a cone. Then its dual cone is the cone
K∗ = {y ∈ Rn ; y T a ≥ 0 ∀a ∈ K}.
By using the dual cone, we formulate the dual problem to (4.1) as follows
cT x = y T Ax ≥ y T b.
In other words, the objective value of each feasible solution is an upper bound on every objective value of
the dual problem. Therefore the inequality holds true even for the extremal values.
Now we state some basic properties of dual cones. Some of them are illustrated on Figure 4.1. For
instance, Proposition 4.18(4) is illustrated by Figures 4.1a and 4.1c.
Example 4.17.
• Nonnegative orthant is self-dual, that is, (Rn+ )∗ = Rn+ (see Figure 4.1a).
• The Lorentz cone is self-dual as well, L∗ = L (see Figure 4.1b).
• The cone of positive semidefinite matrices is also self-dual; herein,
P the scalar product of positive
semidefinite matrices A, B is defined by hA, Bi := tr(AB) = i,j aij bij .
min cT x subject to Ax ≥ b, Bx ≥K d,
x2
L
Rn+
0
0 x1
−(Rn+ )∗
−L∗
(a) Nonnegative orthant Rn+ and its dual (Rn+ )∗ for n = 2. (b) Lorentz cone L and its dual L∗ .
x2 L
0
0 x1
−K∗
−L∗
(c) A cone and its dual in R2 . (d) Generalized Lorentz cone L and its dual L∗ .
Figure 4.1: Cones and their duals (for the sake of better visibility the dual cones are multiplied by −1,
i.e., rotated around the origin).
Theorem 4.19 (Strong duality). The primal and dual optimal values are the same provided at least one
of the following conditions holds
(1) the primal problem is strictly feasible, that is, there is x such that Ax >K b,
(2) the dual problem is strictly feasible, that is, there is y >K∗ 0 such that AT y = c.
Proof. We present the basic idea of the proof of (1) without technical details; in view of duality the point
(2) is analogous.
Let f ∗ be the optimal value and assume that c 6= 0 (otherwise f ∗ = 0 and we have strong duality with
y = 0). Define the set
M := {y = Ax − b; x ∈ Rn , cT x ≤ f ∗ }.
It is easy to see that M ∩ int(K) = ∅. Otherwise there is x such that Ax >K b and cT x ≤ f ∗ , so by a
small change of x in the direction of −c we obtain a super-optimal value.
Since both M and K are convex sets, we can separate them by a hyperplane λT y = 0 (the zero right-
hand side follows from the fact that K is a cone). Since K lies in the positive halfspace, we have λT y ≥ 0
for every y ∈ K, whence λ ∈ K∗ . Since M lies in the negative halfspace, we have λT y ≤ 0 for every y ∈ M;
so λT Ax ≤ λT b for every x such that cT x ≤ f ∗ . This can happen only if the normal vectors AT λ and c
are linearly dependent, that is, AT λ = µc for µ ≥ 0.
34 Chapter 4. Convex optimization
x2
0 x1
Figure 4.2: (Example 4.20) A second order cone program, for which the optimal value is not attained.
Observe that µ > 0. Otherwise, if µ = 0, then AT λ = 0 and also λT b ≥ 0. Due to strict feasibility of
the primal problem there is x̃ such that Ax̃ >K b. Since λ ≥K∗ 0, λ 6= 0, we get by premultiplication that
λT (Ax̃ − b) > 0, or λT b < 0; a contradiction.
By normalizing µ ≡ 1 we obtain AT λ = c. This yields a dual feasible solution λ since it satisfies
A λ = c, λ ≥K∗ 0. Moreover, we know that λT b ≥ AλT x = cT x for every x such that cT x ≤ f ∗ (including
T
Notice that even when strong duality holds and both primal and dual optimal values are (the same
and) finite, it may happen that the optimal value is not attained (formally, we should write “inf” instead
of “min”). The following example illustrates this situation.
Even though the problem is strictly feasible, the optimal value is not attained; see Figure 4.2.
The next example illustrates the situation when the assumption of Theorem 4.19 as well as strong
duality are not satisfied.
We express it equivalently as
min x2 subject to x2 = 0, x1 ≥ 0.
We can see that the optimal value is 0 and each feasible solution is optimal, that is, the optimal solution
set consists of the nonnegative part of the first axis.
To construct the dual problem, we rewrite the primal program into the canonical form
max 0 subject to y1 + y3 = 0, y2 = 1, y ≥L 0.
p
The inequality y ≥L 0 takes the form of y3 ≥ y12 + y22 , which together with y1 + y3 = 0 leads to y2 = 1; a
contradiction. Hence the dual problem is infeasible, even though the primal problem has a finite optimal
value.
We express
D f
(B | d) =
pT q
so the condition Bx ≥L d takes the form of kDx − f k2 ≤ pT x − q. Thus we have an explicit description
of problem (4.4)
Recall that the problem is not easily transformable to a convex quadratic problem, even when allowing
convex quadratic constraints. Thus we have a new class of optimization problems, which are efficiently
solvable and contain many interesting problems. Actually, a lot of functions and nonlinear conditions can
be expressed in the form of (4.5).
where c ∈ Rn , matrices A(1) , . . . , A(n) , B ∈ Rm×m are symmetric and the relation A B means that
A − B is positive semidefinite.1)
How to construct the dual problem? According to (4.3), the dual problem has 2
m×m
P m variables, so that
they constitute a matrix of variables Y ∈ R . The dual objective function is i,j bij yij , the equations
P (k)
have the form of i,j aij yij = ck , and the condition Y ≥K∗ 0 takes the form Y 0. In total, the dual
problem reads
• Second order cone constraints. They can be expressed as semidefinite constraints. Basically, it is
sufficient to show it for the condition kxk2 ≤ z; the others can be handled by a linear transformation.
We have
z · In x
kxk2 ≤ z ⇔ 0. (4.8)
xT z
Proof. For z = 0 the equivalence holds, so we assume z > 0. We consider the matrix as the matrix of
a quadratic form and we transform it to a block diagonal matrix by using row & column elementary
transformations. Subtracting z1 xT -multiple of the first block row from the second one, and applying
the same for the columns, we get
z · In x z · In 0
∼ .
xT z 0 z − 1z xT x
This matrix is positive semidefinite if and only if z > 0 and xT x ≤ z 2 , or after taking the square
root, kxk2 ≤ z.
• Eigenvalues. Many conditions on eigenvalues can be expressed as semidefinite programs. For in-
stance, the largest eigenvalue λmax of a symmetric matrix A ∈ Rn×n :
max cT x subject to eT x = K, x ≥ o,
where c is a random vector with the expected value c̃ := E c and the covariance matrix Σ := cov c =
E (c − c̃)(c − c̃)T . Assume that a portfolio x̃ is chosen, but for the covariance matrix we know only an
interval estimation Σ1 ≤ Σ ≤ Σ2 . What is the risk of portfolio x̃? The risk is given by the variance of the
reward cT x̃, which is equal to x̃T Σx̃. Thus the largest variance is computed by a semidefinite program
The objective function is linear in variable Σ, and the constraints are easily transformed to the basic form
(4.7) by means of Example 4.23.
1)
Relation defines a partial order, known also as the Löwner order. Karel Löwner was an American mathematician of
Czech origin (born near Prague).
4.4. Computational complexity 37
• The feasible set M shouldn’t be too flat or too large. There must exist “reasonably” large numbers
r, R > 0 such that M contains a ball of radius r and also M lies in the ball {x; kxk2 ≤ R}.
Pn (k)
min z subject to k=1 xk A + zIm B, z ≥ 0.
Example 4.25. In some cases, the ellipsoid method provides a polynomial algorithm for problems with
exponentially many or even infinitely many constraints. For example, let M be a unit ball described by
the tangent hyperplanes, that is,
M = {x ∈ Rn ; aT x ≤ 1, ∀a : kak2 = 1}.
To check if a given point x∗ ∈ Rn belongs to the set M , we do not need to process all the infinitely many
inequalities. It is sufficient to check the possibly violated constraint, which is the case of a = kx1∗ k2 x∗ .
C ∗ := conv{xxT ; x ≥ 0}
its dual cone of completely positive matrices. Obviously, the set C covers both nonnegative symmetric
matrices and positive semidefinite matrices, but it contains other matrices, too. Similarly the matrices
in C ∗ are nonnegative positive semidefinite, but not each such matrix belongs to C ∗ . Notice that even to
decide if a given matrix is copositive is a co-NP-complete problem [Murty and Kabadi, 1987]. Checking
complete positivity of a matrix is NP-hard [Dickinson and Gijben, 2014], but if the problem is in NP is
not known yet.
Consider a copositive program [Dür, 2010]
is a linear function in variables X and the equations are linear, too. The only nonlinear constraint is
X ∈ C, which makes the problem to be a convex conic program. Consider also a convex program with a
complete positivity condition of matrix X:
Both problems are convex, but NP-hard. We prove it for the latter.
The proof is based on a reduction from the maximum independent set problem. Let G = (V, E) be a
graph with n vertices and let α denote the size of a maximum independent set in graph G, that is, the
size of a maximum set I ⊆ V such that i, j ∈ I ⇒ {i, j} 6∈ E.
kx∗ k. The support of vector x∗ then corresponds to an independent set of size α(x∗ ). Denote by x̃ ∈ Rα(x )
∗
the restriction of vector x to its positive entries, so the zero entries are removed. Then the optimal value
h of problem (4.11) can be expressed as
since tr(eeT X) = tr(eeT xxT ) = tr(eT xxT e) = (eT x)2 . It is not hard to see that the optimal solution of this
problem is a vector of identical entries, that is, x̃∗ = α(x∗ )−1/2 e. Hence h = (eT x̃∗ )2 = (α(x∗ )−1/2 α(x∗ ))2 =
α(x∗ ). That is why h equals the size of the maximum independent set in graph G.
4.5. Applications 39
4.5 Applications
4.5.1 Robust PCA
Let A ∈ Rm×n be a matrix representing certain data. The problem is to determine some essential infor-
mation hidden in the data. For example, if the matrix represents a picture, then we may want to recognize
some pattern (e.g., a face) or to perform some operations such as reconstruction of a damaged picture.
To this end the SVD decomposition of A may serve well, however, for some purposes it is not sufficient.
We will formulate the problem as the so called robust PCA (principal component analysis):
→ Decompose A = L + S such that L has low rank and S is sparse.
Then L represents the fundamental information in the data and S can be interpreted as a noise. This
problem is rather vaguely defined and that is why we consider the (approximate) optimization problem
formulation
min kLk∗ + kSkℓ1 subject to A = L + S, (4.12)
where kSkℓ1 is the entrywise sum norm defined as
X
kSkℓ1 := |sij |,
i,j
and kLk∗ the nuclear norm defined as the sum of the singular values, that is,
X
kLk∗ := σi (L).
i
Notice that the nuclear norm is a good approximation of the matrix rank since it is the best convex
underestimator of the rank on a unit ball. Similarly, the entrywise sum norm is a good approximation of
matrix sparsity.
Problem (4.12) is a convex optimization problem since a norm is always convex. Hence the problem is
effectively solvable even though the best algorithms used are not so easy to describe by simple means.
where f : Rn → R is a differentiable function and the feasible set M ⊆ Rn is described by the system
gj (x) ≤ 0, j = 1, . . . , J,
hℓ (x) = 0, ℓ = 1, . . . , L,
Equality constraints
Consider for a while an equality constrained problem
Let x∗ be a feasible point. When x∗ is optimal? First we discuss the case when the constraints are linear.
Proof. The feasible set is the solution set of the system Ax = b, so it is an affine subspace x∗ + Ker(A).
Let B be a matrix such that its columns form a basis of Ker(A). Then the feasible set can be expressed
as x = x∗ + Bv, v ∈ Rk . Substituting for x we obtain an unconstrained optimization problem
By Theorem 2.1, the necessary condition for local optimality of v = 0 is zero gradient, that is, ∇f (x∗ )T B =
0T . In other words, ∇f (x∗ ) ∈ Ker(A)⊥ = R(A).
Now, the idea is based on linearization of possibly nonlinear functions hℓ . The equation hℓ (x) = 0 will
be replaced by the tangent hyperplane of the corresponding manifold at point x∗ :
∇hℓ (x∗ )T (x − x∗ ) = 0,
41
42 Chapter 5. Karush–Kuhn–Tucker optimality conditions
h1 (x) = 0 h1 (x) = 0
h2 (x) = 0
h2 (x) = 0
(a) Degenerate case: the intersection of the curves is a (b) Regular case: the intersection of the curves is a point
point, but the intersection of the tangent lines is a line. as well as the intersection of the tangent lines.
so that the linearized constraints can be expressed as A(x − x∗ ) = 0. In order that x∗ is optimal, the
objective function gradient ∇f (x∗ ) must be perpendicular to the intersection of the tangent hyperplanes;
in other words, ∇f (x∗ ) must be a linear combination of the gradients ∇hℓ (x∗ ) of the tangent hyperplanes.
According to Proposition 5.1 we have ∇f (x∗ ) ∈ R(A). This leads to the condition
L
X
∇f (x∗ ) + ∇hℓ (x∗ )µℓ = 0.
ℓ=1
As illustrated by Figure 5.1, this idea can be wrong since a degenerate situation may appear as depicted
on the figure. Thus we need to avoid such a degenerate case. This can be achieved by the assumption on
linear independence of gradients ∇hℓ (x∗ ).
Theorem 5.2. Let ∇hℓ (x∗ ), ℓ = 1, . . . , L, be linearly independent. If x∗ is a local optimum, then there is
µ ∈ RL such that
∇f (x∗ ) + ∇h(x∗ )µ = 0.
Coefficients µ1 , . . . , µL are called Lagrange multipliers. The condition stated in the theorem is a nec-
essary condition. This is convenient for us since we can restrict the feasible set to a much smaller set of
candidates for optima – ideally the candidate is unique.
Theorem 5.3 (KKT conditions). Let ∇hℓ (x∗ ), ℓ = 1, . . . , L, ∇gj (x∗ ), j ∈ I(x∗ ), be linearly independent.
If x∗ is a local optimum, then there exist λ ∈ RJ , λ ≥ 0, and µ ∈ RL such that
Remark. Condition (5.3) is called complementarity condition since it says that for every j = 1, . . . , J
we have λj = 0 or gj (x∗ ) = 0. If gj (x∗ ) < 0, then λj = 0 and hence variable λj does not act in the KKT
conditions; this corresponds to the situation that x∗ does not lie on the border of the set described by this
constraint. Conversely, if gj (x∗ ) = 0, then the complementarity makes no restriction on λj . In summary,
we can say that the complementarity condition enforces to consider the Lagrange multipliers λj for the
active constraints only.
43
Proof. (Main idea.) We linearize the problem such that the objective function and the constraint functions
are replaced by their tangent hyperplanes at point x∗ . This results in a linear programming problem
Due to the linear independence assumption, the solution x∗ remains optimal (this is a small step for a
reader, but a giant leap in the proof). The dual problem to the linear program is
L
X X
∇hℓ (x∗ )T x∗ µℓ + ∇gj (x∗ )T x∗ λj subject to
max
ℓ=1 j∈I(x∗ )
L
X X
∗
∇f (x ) + ∇hℓ (x∗ )µℓ + ∇gj (x∗ )λj = 0,
ℓ=1 j∈I(x∗ )
∗
λj ≥ 0, j ∈ I(x ).
Since the primal problem has an optimum, the dual problem must be feasible. For j 6∈ I(x∗ ) define λj := 0
and we have that also the problem
is feasible. Hence there exist λ ≥ 0, µ satisfying (5.2). Condition (5.3) is fulfilled since for j ∈ I(x∗ ) we
have gj (x∗ ) = 0 by definition, and for j 6∈ I(x∗ ) we can put λj = 0.
Conditions (5.2)–(5.3) are called Karush–Kuhn–Tucker conditions [Karush, 1939; Kuhn and Tucker,
1951], or KKT conditions in short.
Since the linear independence assumption is hard to check in general (notice that x∗ is unknown),
alternative assumptions were derived, too. Usually, they are more easy to verify but on account of stronger
assumptions. One commonly used assumption is Slater’s condition
where f (x), gj (x) are convex functions and M is a convex set. Suppose that Slater’s condition is satisfied.
If x∗ is an optimum of the above problem, then there exists λ ≥ 0 such that x∗ is an optimum of the
problem
see Figure 5.2. Both sets are convex, and their interiors are disjoint since otherwise there is a point
x ∈ M such that g(x) < 0 and f (x) < f (x∗ ). Therefore a separating hyperplane exists having the form of
λT r + λ0 s = c, where (λ, λ0 ) 6= 0. The separability implies:
∀(r, s) ∈ A : λT r + λ0 s ≥ c,
∀(r, s) ∈ B : λT r + λ0 s ≤ c.
44 Chapter 5. Karush–Kuhn–Tucker optimality conditions
f (x0 )
A
f (x∗ )
0 r
B
λT r + λ0 s = c
Since (0, f (x∗ )) ∈ A∩B, this point lies on the hyperplane, and thus c = λ0 f (x∗ ). Analogously (g(x∗ ), f (x∗ )) ∈
A ∩ B, so this point also lies on the hyperplane, yielding
λT g(x∗ ) + λ0 f (x∗ ) = c = λ0 f (x∗ ),
which gives the complementarity constraint λT g(x∗ ) = 0.
For every i we have (−ei , f (x∗ )) ∈ B, so this point lies in the negative halfspace. This means that
−λT ei + λ0 f (x∗ ) ≤ c, from which λi ≥ 0. Therefore λ ≥ 0. Analogously we deduce λ0 ≥ 0: Since
(o, f (x∗ ) − 1) ∈ B, so λT o + λ0 (f (x∗ ) − 1) ≤ c, and hence λ0 ≥ 0.
Since g(x0 ) < 0, we have (r, f (x0 )) ∈ A for every r in the neighbourhood of 0. Hence the separating
hyperplane cannot be vertical, which means λ0 6= 0. Without loss of generality we normalize it such that
λ0 = 1. Let us prove it formally: Suppose to the contrary that λ0 = 0. Now, c = 0 and in view of λ 6= 0
there is i such that λi > 0. Substituting the point (−εei , f (x0 )) ∈ A, where ε > 0 is small enough, into
the inequality, we get −ελi ≥ 0; a contradiction.
For every x ∈ M we have (g(x), f (x)) ∈ A, which fulfills
λT g(x) + f (x) ≥ c = λT g(x∗ ) + f (x∗ ).
This proves that x∗ is the optimum of (5.4).
Applying the optimality conditions from Theorem 2.1 to problem (5.4), we obtain the KKT conditions
as a corollary:
Corollary 5.5. Suppose that Slater’s condition is satisfied for the convex optimization problem
min f (x) subject to g(x) ≤ 0.
If x∗ is an optimum, then there exists λ ≥ 0 such that the KKT conditions are satisfied, i.e.,
∇f (x∗ ) + ∇g(x∗ )λ = 0, (5.5a)
T ∗
λ g(x ) = 0. (5.5b)
We obtain also a general form involving equality constraints.
Corollary 5.6. Suppose that Slater’s condition is satisfied for the convex optimization problem
min f (x) subject to g(x) ≤ 0, Ax = b.
If x∗ is an optimum, then there exist λ ≥ 0 and µ such that the KKT conditions are satisfied, i.e.,
∇f (x∗ ) + ∇g(x∗ )λ + AT µ = 0,
λT g(x∗ ) = 0.
45
g1 (x)
min x1
∇g1 (x∗ )
x∗ M
∇g2 (x∗ )
g2 (x)
Figure 5.3: (Example 5.7) Slater’s condition is not satisfied and KKT conditions property (Corollary 5.5)
fails.
Example 5.7. If Slater’s condition is not satisfied, then the KKT conditions property (Corollary 5.5) can
fail. Consider an optimization problem minx∈M x1 illustrated in Figure 5.3. Two constraints describe the
feasible set having the form of a half-line starting from point x∗ . Point x∗ is optimal. The KKT conditions
read −∇f (x∗ ) = ∇g(x∗ )λ, but the point x∗ does not fulfill them since the gradients ∇g1 (x∗ ) = (0, −1)T
and ∇g2 (x∗ ) = (0, 1)T span a vertical line, not containing the opposite of the objective function gradient
−∇f (x∗ ) = (−1, 0)T .
In optimization, necessary optimality conditions are usually preferred to sufficient optimality conditions
since they often help to restrict the feasible set to a smaller set of candidate optimal solutions. Anyway,
sufficient optimality conditions are also of interest, and below we show that the KKT conditions do this
job under general assumptions.
let f (x) be a convex function, and let gj (x), j ∈ I(x∗ ), be convex functions, too. If KKT conditions (5.5)
are satisfied for x∗ with certain λ ≥ 0, then x∗ is an optimal solution.
Proof. Convexity of function f (x) implies f (x) − f (x∗ ) ≥ ∇f (x∗ )T (x − x∗ ) due to Theorem 3.18. Anal-
ogously, for functions gj (x), j ∈ I(x∗ ), we have gj (x) − gj (x∗ ) ≥ ∇gj (x∗ )T (x − x∗ ). KKT conditions give
∇f (x∗ ) = −∇g(x∗ )λ, from which premultiplying by (x − x∗ ) we get
Methods
To solve an optimization problem is a very difficult task in general; indeed, it is undecidable (provably
there cannot exist an algorithm)! Thus we can hardly hope to solve optimally every problem. Many
algorithms thus produce approximate solutions only – KKT solutions, local optima etc. If the problem
is large and hard, then we often use heuristic methods (genetic and evolutionary algorithms, simulated
annealing, tabu search,. . . ). On the other hand, many hard optimization problems can be solved by using
global optimization techniques. However, they work in small dimensions only since their computational
complexity is high. The choice of a suitable method thus depends not only on the type of the problem,
but also on the dimensions, time restrictions etc.
In the following sections, we present selected methods for basic types of optimization problems.
Armijo rule
We assume that f (x) is differentiable and f ′ (0) < 0, so it locally decreases at x = 0. We want to decrease
the objective function by moving to the right from point x = 0. We wish to decrease it significantly, that
is, not to get stuck locally close to x = 0, but to move away from this current point if possible.
Consider the condition
must be satisfied for certain parameter ε′ > ε (which ensures that x is not too large small); see Figure 6.1b.
47
48 Chapter 6. Methods
f (x) f (x)
0 x 0 x
(a) Armijo rule – seeking for the intersection point. (b) Armijo rule as the termination condition.
q(x)
f (x)
0 xk+1 xk x
Figure 6.2: Newton method: approximation f (x) at point xk by a quadratic function q(x), and move to
its minimum xk+1 .
Newton method
It is the classical Newton method for finding a root of f ′ (x) = 0. Here we need f (x) to be twice differen-
tiable.
This method is iterative and we construct a sequence of points x0 = 0, x1 , x2 , . . . that, under some
assumptions, converge to a local minimum. The basic idea is to approximate function f (x) by a function
q(x) such that they both have the same value and the first and second derivatives and the current point xk
(in the kth iteration). Thus we want q(xk ) = f (xk ), q ′ (xk ) = f ′ (xk ) and q ′′ (xk ) = f ′′ (xk ); see Figure 6.2.
This suggests that it is suitable for q(xk ) to be a quadratic polynomial. Such a quadratic function is unique
and it is described
1
q(x) = f (xk ) + f ′ (xk )(x − xk ) + f ′′ (xk )(x − xk )2 .
2
(Proof: put x := xk .) The minimum of quadratic function q(xk ) is at the stationary point (where the
derivative is zero), so
0 = f ′ (xk ) + f ′′ (xk )(x − xk ).
-1
-2
-2 -1 0 1 2 3 4 5
Figure 6.3: In blue: the contours of the convex quadratic function f (x) = x21 + 4x22 . In red: the iterations
of the steepest descent method with initial point (5, −1)T .
Gradient methods1)
In kth iteration, the current point is xk . We determine a direction dk in which the objective function
locally decreases, that is, ∇f (xk )T dk < 0. Now we call a line search method applied to the function
ϕ(α) := f (xk + αdk ). Denote by αk the output value. Then the next point is set as xk+1 := xk + αk dk .
How to choose dk ? The simplest way is the steepest descent method, which takes dk := −∇f (xk ), that
is, the direction in which the objective function locally decreases the most rapidly. This choice need no be
the best one; see Figure 6.3, which illustrates the slow convergence even for the simple convex quadratic
function f (x) = x21 + 4x22 . There are advanced methods that take into account also the Hessian ∇2 f (xk )
or its approximation and they combine the steepest descent direction and the directions of the previous
iteration(s); see also the conjugate gradient methods in Section 6.4.
Example 6.1 (Learning of neural networks). Basically, the steepest descent method is used in learning of
artificial neural networks (for an introduction see Higham and Higham [2019]). The goal of the learning is
to set up weights of inputs of particular neurons such that the neural network performs best on the training
data. Mathematically speaking, the variables are the weights of inputs of the neurons. The objective
function that we minimize is the distance between the actual output vector and the ideal output vector.
It is hard ho find the optimal solution since this optimization problem is nonlinear, nonconvex and high-
dimensional. That is why the problem is solved iteratively and at each step the weights are refined by
means of the steepest descent. To compute the gradient of the objective function is also computationally
1)
The history of gradient methods dates back to 1847, when L.A. Cauchy introduced a gradient-like method to solve the
astronomical problem of calculating the orbit of a celestial body.
50 Chapter 6. Methods
0 xk x
Figure 6.4: The Hessian ∇2 f (xk ) is not positive definite.
demanding since there are usually large training data, so we simplify further and we approximate the
gradient by its partial value based on the gradient of a randomly chosen training sample point. This
approach is called stochastic gradient descent.
Example 6.2. Optimization techniques are also used to solve problems that are not optimization problems
in the essence. Consider for example the problem of solving a system of linear equations Ax = b, where A
is a positive definite matrix. Then the optimal solution of the convex quadratic program
1 T
minn x Ax − bT x
x∈R 2
is the point A−1 b, the same as the solution of the equations, since at this point the gradient ∇f (x) = Ax−b
of the objective function f (x) = 21 xT Ax − bT x vanishes. Thus we can solve linear equations by using
optimization techniques. This is really used in practice, in particular for large and sparse systems. There
exist several ways how to choose the vector dk in this context. For instance, the conjugate gradient method
combines the gradient and the previous direction, so it takes a linear combination of ∇f (xk ) and dk−1 ;
see Section 6.4.
Newton method
This works in a similar fashion as in the univariate case. We approximate the objective function by a
quadratic function, whose minimum is the current point of the subsequent iteration.
In step k, the current point is xk and at this point we approximate f (x) by using Taylor expansion
1
f (x) ≈ f (xk ) + ∇f (xk )T (x − xk ) + (x − xk )T ∇2 f (xk )(x − xk ).
2
This gives us a quadratic function. If its Hessian matrix ∇2 f (xk ) is positive definite, then its minimum is
unique and it is the point with zero gradient. This leads us to the system
This point is set as the current point xk+1 of the next iteration.
Comment. The expression y := (∇2 f (xk ))−1 ∇f (xk ) is evaluated by solving the system of linear equa-
tions ∇2 f (xk )y = ∇f (xk ), not by inverting the matrix.
The advantage of this method is a rapid convergence (if we are close to the minimum). The drawback
is that the Hessian ∇2 f (xk ) need not be positive definite; see example on Figure 6.4. Another drawback
is that the evaluation of the Hessian might be computationally demanding. Therefore, diverse variants of
this method exist (quasi-Newton methods) that approximate the Hessian matrix or regularize it.
6.3. Constrained problems 51
where f : Rn → R is a differentiable function and the feasible set M ⊆ Rn is characterized by the system
gj (x) ≤ 0, j = 1, . . . , J,
hℓ (x) = 0, ℓ = 1, . . . , L,
If we use the Euclidean norm, then we are seeking for the steepest descent direction that is feasible. In
order that the auxiliary problem is easy to solve, we usually employ the maximum or the Manhattan
norm. For the latter, for example, the problem takes the form of a linear program, in which kx − xk k ≤ 1
is replaced by
eT z ≤ 1, x − xk ≤ z, −x + xk ≤ z.
W := {j; gj (xk ) = 0}
∇g1 (x∗ )
∇g2 (x∗ )
g1 (x) ≤ 0 x∗
−∇f (x∗ )
g2 (x) ≤ 0
Figure 6.5: At point x∗ we have ∇f (x∗ ) − ∇g1 (x∗ ) + ∇g2 (x∗ ) = 0, so index 1 is removed from the active
set and index 2 remains there.
If we move to the boundary of M during the computation and another constraint becomes active, then
we include it to W . If we achieve a local minimum x∗ during the computation of this auxiliary problem,
then we assume that KKT conditions are satisfied. That is, there exists λ such that
X
∇f (x∗ ) + ∇h(x∗ )µ + λj ∇gj (x∗ ) = 0.
j∈W
Now, if λj ≥ 0, then j remains in W ; otherwise the index j is removed from W . This treatment is based
on the interpretation of Lagrange multipliers as the negative derivatives of the objective function with
respect to the right-hand side of the constraints. Hence, λj < 0 implies that locally a decrease of gj (x)
makes a decrease of f (x); see Figure 6.5.
The schema of this method resembles the simplex method in linear programming, in which we move
from one feasible basis to another and dynamically change the active set. Therefore, the active-set method
is primarily used in optimization problems with linear constraints.
Penalty methods
Consider the problem
min f (x) subject to x ∈ M,
where f (x) is a continuous function and M 6= ∅ is a closed set. A penalty function is any continuous
nonnegative function q : Rn → R satisfying the conditions:
• q(x) = 0 for every x ∈ M ,
• q(x) > 0 for every x 6∈ M .
Penalty methods are based on a transformation of the problem to an unconstrained problem
Penalty methods are implemented such that c is not constant, but it is increased during the iterations.
Too high value of c at the beginning leads to a numerically ill-conditioned problem. That is why in practice
the values from a suitable sequence ck > 0, where ck →k→∞ ∞, are used.
Theorem 6.3. Let xk be an optimal solution of problem
Proof. If x∗ 6∈ M , then for k∗ large enough we have xk 6∈ M ∀k ≥ k∗ , and thus the objective function
grows without bound. Hence f (x∗ ) + ck · q(x∗ ) →k→∞ ∞ and also f (xk ) + ck · q(xk ) →k→∞ ∞, which
contradicts optimality of xk .
Consider now the case of x∗ ∈ M and suppose to the contrary that x∗ is not optimal. Then there is a
point x′ ∈ M such that f (x′ ) < f (x∗ ). Since the penalization is zero within the feasible set M , we get
Example 6.4. For constraints of type g(x) ≤ 0 we often use the penalty function
J
X J
X
+ 2
q(x) := (gj (x) ) = max(0, gj (x))2 ,
j=1 j=1
which preserves smoothness of the objective function, and for constraints of type h(x) = 0 we can use the
penalty function
XL
q(x) := hℓ (x)2 .
ℓ=1
Barrier methods
Consider again the problem
min f (x) subject to x ∈ M,
where f (x) is a continuous function. Suppose that M is a connected set satisfying M = cl(int M ), that
is, it is equal to the closure if its interior. A barrier function is any continuous nonnegative function
q : int M → R such that q(x) → ∞ for every x → ∂M . This means that when x approaches to the
boundary of M , then the barrier function grows to infinity.
The original problem is then transformed to an unconstrained problem
1
min f (x) + q(x) subject to x ∈ Rn , (6.2)
c
where c > 0 is a parameter.
The algorithm is similar to penalty methods, that is, we iteratively seek for optimal solutions of
auxiliary problems when c → ∞. A drawback of this method if that we have to know an initial feasible
solution at the beginning. The advantage is its simplicity.
The pioneers of these methods are Fiacco and McCormick [1968].
Example 6.5. For constraints of type g(x) ≤ 0 we often use the barrier function in the form
J
X 1
q(x) := −
gj (x)
j=1
54 Chapter 6. Methods
or in the form
J
X
q(x) := − log(−gj (x)).
j=1
The latter is utilized in the popular interior point methods, which implementations can solve linear pro-
grams and certain convex optimization problems (such as quadratic programs) in polynomial time. For
example, the linear program
min cT x subject to Ax ≤ b
q(X) := − log(det(X)).
Under certain assumptions the optimal solutions of the auxiliary problems converge to the optimum
of the original problem.
Theorem 6.6. Let ck > 0 be a sequence of numbers such that ck →k→∞ ∞. Let xk be an optimal solution
of problem
1
min f (x) + q(x) subject to x ∈ Rn .
ck
If xk →k→∞ x∗ , then x∗ is an optimal solution of the original problem minx∈M f (x).
Proof. Suppose to the contrary that x∗ is not optimal, that is, there is x′ ∈ M such that f (x′ ) < f (x∗ ).
Due to continuity of f (x) there is x′′ ∈ int M such that f (x′′ ) < f (x∗ ). Then for k large enough we have
1 1
f (x′′ ) + q(x′′ ) < f (x∗ ) + q(x∗ ).
ck ck
1 1
f (x′′ ) + q(x′′ ) < f (xk ) + q(xk ),
ck ck
For convex optimization problems under general assumptions (e.g., strictly convex barrier function
and M bounded) the optimal solution x(c) of (6.2) is unique and the points x(c), c > 0, draw a smooth
curve, called the central path, whose limit as c → ∞ is the optimal solution of the original problem.
Certain algorithms use the same principle: For the increasing values of c they find (approximation
of) the optimal solutions x(c). With a small change of c the point x(c) moves continuously, so it is easy
and fast to reoptimize and find the new optimum. For theoretical analysis of polynomiality of certain
convex optimization problems short steps are used, but in practice larger steps are convenient. Typically,
we increase c with a factor of 1.1.
A natural question is, why not to choose a large value of c at the beginning? The numerical issues
cause troubles then. Next, such a choice makes not the algorithm faster. The Newton method (or other
methods used to solve (6.2)) is slow if we start far from the optimum. Therefore tracing the central path
using fast steps is the most convenient way. Notice that we have some difficulties at the beginning, but
this issue can be overcome.
6.4. Conjugate gradient method 55
Proof. It is sufficient to show that vector ∇f (xk+1 ) = gk+1 is perpendicular to subspace x1 +span{d1 , . . . , dk },
that is, it is perpendicular to every vector d1 , . . . , dk . Write
gk+1 = Axk+1 − b = A xj + ki=j αi di − b
P
= dTj gj + αj = 0.
56 Chapter 6. Methods
p 1 , . . . , dk } = span{g1 , . . . , gk }
The choice of basis d1 , . . . , dn . We choose the basis such that span{d
for every k = 1, . . . , n. At the beginning we naturally put d1 := −g1 / hg1 , g1 i. In (k + 1)st iteration we
construct vector dk+1 from vector −gk+1 by making it orthogonal to subspace span{d1 , . . . , dk }.
Proof. We prove it by mathematical induction on k. By definition and from the induction hypothesis we
have gk = Axk − b, where
Hence
gk ∈ Ax1 − b + span{Ag1 , A2 g1 , . . . , Ak−1 g1 } ⊆ span{g1 , Ag1 , A2 g1 , . . . , Ak−1 g1 }.
In fact, we have equality since gk does not belong to span{g1 , . . . , gk−1 }. Otherwise, according to Propo-
sition 6.7, vector gk is orthogonal to this subspace and gk = Axk − b must be zero, meaning that xk is the
solution x∗ .
Since gk+1 is orthogonal (in the standard sense) to vectors d1 , . . . , dk , it is also orthogonal to g1 , . . . , gk ,
and by Proposition 6.8 it is A-orthogonal to vectors g1 , . . . , gk−1 , too. Thus, in order to compute dk+1 , it
is sufficient to make −gk+1 orthogonal to vector dk . This is performed by the following statement. Notice
that the resulting value of dk+1 is not normalized, so we have to normalize it afterwards.
Proposition 6.9. We have dk+1 = −gk+1 + βk+1 dk , where βk+1 = hdk , gk+1 i.
Proof. We already know that hgk+1 , di i = 0 for i = 1, . . . , k − 1. Hence dk+1 has the form of dk+1 =
−gk+1 + βk+1 dk for certain βk+1 . From the equality 0 = hdk , dk+1 i = dTk A(−gk+1 + βk+1 dk ) we derive the
dT
k Agk+1 hdk ,gk+1 i
value of βk+1 = dT
= hgk ,gk i = hdk , gk+1 i.
k Adk
Summary. Now we have all the ingredients to explicitly write the algorithm:
1: choose x1 ∈ Rn and put d0 := 0,
2: for k = 1, . . . , n do
gk := Axk − b,
βk := dTk−1 Agk ,
q
dk := −gk + βk dk−1 , dk := dk / dTk Adk ,
αk := −dTk gk ,
xk+1 := xk + αk dk ,
The basic idea of the conjugate gradient method can be used to minimize a general nonlinear function
f (x) over space Rn . Herein, the key idea is to construct the improving direction dk as a linear combination
of gradient gk and the previous direction dk−1 . Vector gk is then the gradient of function f (x) at point
xk , and the coefficients are computed analogously. The resulting method is called the method of Fletcher–
Reeves (1964). There exist several variants, which differ in the values of coefficients βk .
There are also methods employing Krylov subspaces for solving systems Ax = b, where matrix A is
not necessarily symmetric positive definite. For example, let us mention GMRES (Generalized minimal
residual method, Saad & Schultz, 1986), which in kth iteration computes vector xk that minimizes the
Euclidean norm of the residual (i.e., kAx − bk) over subspace span{b, Ab, A2 b, . . . , Ak−1 b}.
58 Chapter 6. Methods
Chapter 7
Selected topics
min cT x subject to Ax ≤ b, x ≥ 0.
Suppose that A and b are not known exactly and the only information that we have are interval estimations
of the values. That is, we know a matrix of intervals [A, A] and the vector of interval right-hand sides
[b, b]. We say that a vector x is a robust feasible solution if it fulfills inequality Ax ≤ b for each A ∈ [A, A]
and b ∈ [b, b]. Due to nonnegativity of variables we have that x is robust feasible if and only if Ax ≤ b.
Hence the robust counterpart of the linear programu reads
min cT x subject to Ax ≤ b, x ≥ 0.
Example 7.1 (Catfish diet problem). This example comes from http://www.fao.org/3/x5738e/x5738e0h.
htm. It is a simplified example of an optimization model of finding a minimum cost catfish diet in Thailand.
The mathematical formulation reads
where variable xj stands for the number of units of food j to be consumed by the catfish, bi is the required
minimal amount of nutrient i, cj is the price per unit of food j, and aij is the amount of nutrient i
contained in one unit of food j. The data are recorded in Table 7.1. Thus we have
2.15
9 65 44 12 0 30 8.0
A = 1.10 3.90 2.57 1.99
0 , b = 250 , c =
6.0 .
0.02 3.7 0.3 0.1 38.0 0.5 2.0
0.4
Since the nutritive values are not known exactly, we assume that their accuracy is 5%. Hence the exact
value of each entry of matrix A lies in interval [0.95 · aij , 1.05 · aij ]. According to the lines described above,
the robust counterpart is obtained by setting the constraint matrix to be A, that is,
8.550 61.75 41.800 11.400 0.00
A = 1.045 3.705 2.4415 1.8905 0.00 .
0.019 3.515 0.2850 0.0950 36.1
59
60 Chapter 7. Selected topics
Table 7.1: (Example 7.1) Catfish diet problem: Nutritive value of foods and the nutritional demands
min cT x subject to Ax ≤ b.
Let aT x ≤ d be a selected inequality. Let intervals [a, a] = ([a1 , a1 ], . . . , [an , an ])T and [d, d] be given. A
solution x is a robust solution of the selected inequality if it satisfies
or,
max aT x ≤ d.
a∈[a,a]
Lemma 7.2. Denote by a∆ = 12 (a − a) the vector of interval radii and by ac = 12 (a + a) the vector of
interval midpoints. Then
max aT x = aTc x + aT∆ |x|.
a∈[a,a]
The inequality is attained as equation for certain a ∈ [a, a]. If x ≥ 0, then aTc x+aT∆ |x| = aTc x+aT∆ x = aT x.
If x ≤ 0, then aTc x + aT∆ |x| = aTc x − aT∆ x = aT x. Otherwise we apply this idea entrywise, so that inequality
is attained as equation for a each entry of which is the interval left or right endpoint.
The left-hand side function is convex, but not smooth. Nevertheless, we can rewrite the constraint as a
linear constraint by introducing an auxiliary variable y ∈ Rn
aTc x + aT∆ y ≤ d, x ≤ y, −x ≤ y.
Therefore linearity is preserved – the robust solutions of interval linear programs are also described by
linear constraints.
Example 7.3 (Robust classification). Consider two classes of data, the first one comprises given points
x1 , . . . , xp ∈ Rn , and the second one contains given points y1 , . . . , yq ∈ Rn . We wish to construct a classifier
that is able to predict to which class a new input belongs to. A basic linear classifier is based on data
separation by a widest separating band. Mathematically, we seek for a hyperplane aT x + b = 1 such that
the first set of points belongs to the positive halfspace, the second set of points belongs to the negative
halfspace, and the separating band is maximal. This leads to a convex quadratic program (see Figure 7.1a)
aT x + b = 1
(a) The widest separating band for real data. (b) The widest separating band for interval data.
Figure 7.1: (Example 7.3) A linear classifier for real data and the robust linear classifier for interval data.
Suppose now that data are not measured exactly and one knows them with a specified accuracy
only. Hence we are given vectors of intervals [xi , xi ] = [(xc )i − (x∆ )i , (xc )i + (x∆ )i ], i = 1, . . . , p, and
[y j , y j ] = [(yc )j −(y∆ )j , (yc )j +(y∆ )j ], j = 1, . . . , q, comprising the true data. Using the approach described
above, the robust counterpart model reads (see Figure 7.1b)
min kak2 subject to (xc )Ti a + (x∆ )Ti a′ + b ≤ 1 ∀i, (yc )Tj a + (y∆ )Tj a′ + b ≤ −1 ∀j, ±a ≤ a′ .
Ellipsoidal uncertainty
Consider again the linear program in the form with variables unrestricted in sign
min cT x subject to Ax ≤ b.
E = {a ∈ Rn ; a = p + P u, kuk2 ≤ 1},
which is expressed as the image of a unit ball under a linear (or more precisely affine) mapping. A point
x is a robust solution of the selected inequality if it satisfies
aT x ≤ d ∀a ∈ E
or,
max aT x ≤ d.
a∈E
Proof. Write
pT x + kP T xk2 ≤ d.
The left-hand side function is smooth and convex – indeed it is a second order cone constraint.
62 Chapter 7. Selected topics
Example 7.5. Consider again the portfolio selection problem (example 4.9)
max cT x subject to eT x = K, x ≥ o, (7.2)
where c is a random Gaussian vector, its expected value is c̃ := E c and the covariance matrix is Σ :=
cov c = E (c − c̃)(c − c̃)T . The level sets of the density function represent ellipsoids, so it is natural to work
with them. For a random vector c we have that √ the probability P (c − c̃ ∈ Eη ) = η, where Eη is a certain
ellipsoid (concretely, Eη = {d ∈√Rn ; d = F −1 (η) Σu, kuk2 ≤ 1}, where F −1 (η) is the quantile √ function
of the normal distribution and Σ is the positive semidefinite square root of matrix Σ, i.e., ( Σ)2 = Σ).
One of the possible ways to solve (7.2) is to consider the deterministic counterpart
max z subject to P (cT x ≥ z) ≥ η, eT x = K, x ≥ o,
where η ∈ [ 21 , 1] is a fixed value, e.g., η = 0.95. Obviously, condition P (cT x ≥ z) ≥ η is fulfilled if dT x ≥ z
holds for every d ∈ Eη + c̃. Hence we can approximate the problem as
max z subject to dT x ≥ z ∀d ∈ Eη + c̃, eT x = K, x ≥ o.
This optimization problem involves ellipsoidal uncertainty, so we can equivalently write it as
√
max z subject to c̃T x − F −1 (η)k Σxk2 ≥ z, eT x = K, x ≥ o.
Since F −1 (η) ≥ 0 for any η ≥ 12 , it is a second order cone programming problem.
f (x) = f ( m
P Pm Pm
i=1 αi vi ) ≤ i=1 αi f (vi ) ≤ i=1 αi f (v1 ) = f (v1 ).
Therefore v1 is an optimum.
This property holds in linear programming, too. For computing an optimal solution, however, it is
not very convenient since polyhedron M may contain many vertices, and we do not know which one is
optimal. By Theorem 4.8, concave programming is NP-hard.
Typical problems resulting in concave programming comprise
• Fixed charged problems. The objective function has the form f (x) = ki=1 fi (xi ), where fi (xi ) = 0
P
for xi = 0 and fi (xi ) = ci + gi (xi ) for xi > 0. Herein, fi (xi ) represents a price (e.g., the price for
the transport of goods of size xi ). Hence the price is naturally zero when xi = 0. When xi > 0, we
pay a fixed charge ci plus the price gi (xi ) depending on the size of xi . We can assume that gi (xi ) is
concave since the larger xi , the smaller relative price for the unit of goods (e.g., due to discounts).
• Multiplicative programming. The objective function has the form f (x) = ki=1 xi . This is not a
Q
concave function in general, but its logarithm gives a concave function log(f (x)) = ki=1 log(xi ).
P
Such problems appear in geometry, where, for example, we minimize the volume of a body (e.g., a
cuboid) subject to some constraints (e.g., the cuboid contains specified points).
Appendix
Derivative of matrix expressions. Let A ∈ Rn×n , b ∈ Rn and c ∈ R. Consider the quadratic function
f : Rn → R defined as
f (x) = xT Ax + bT x + c.
∇f (x) = (A + AT )x + b,
∇2 f (x) = A + AT .
whence ∇bT x = b.
For the quadratic term, we get
n X
n
∂ T ∂ X ∂ X X
x Ax = aij xi xj = ak x2k + (aik + aki )xi xk + aij xi xj
∂xk ∂xk ∂xk
i=1 j=1 i6=k i,j6=k
X n
X
(aik + aki )xi = (A + AT )x i .
= 2ak xk + (aik + aki )xi =
i6=k i=1
Hence the gradient reads ∇xT Ax = (A + AT )x. Since this is a linear function, the particular coordinates
are differentiated in the same way as for the linear term. Therefore ∇2 xT Ax = A + AT .
63
64 Appendix
Notation
Functions
1
Pn p p
kxkp ℓp -norm of a vector x ∈ Rn , kxkp = i=1 |x|P
i
kxk1 Manhattan norm of a vector x ∈ Rn , kxk1 =q ni=1 |x|i
Pn
kxk2 Euclidean norm of a vector x ∈ Rn , kxk2 = 2
i=1 xi
kxk∞ maximum norm of a vector x ∈ Rn , kxk∞ = maxi=1,...,n |x|i
P (c) probability of a random event c
65
66 Notation
Bibliography
M. S. Bazaraa, H. D. Sherali, and C. M. Shetty. Nonlinear Programming. Theory and Algorithms. 3rd ed.
John Wiley & Sons., NJ, 2006. 3
A. Ben-Tal and A. Nemirovski. Lectures on modern convex optimization. Analysis, algorithms, and engi-
neering applications. SIAM, Philadelphia, PA, 2001.
https://www2.isye.gatech.edu/~nemirovs/Lect_ModConvOpt.pdf. 31
A. Ben-Tal, L. El Ghaoui, and A. Nemirovski. Robust Optimization. Princeton University Press, 2009.
https://www2.isye.gatech.edu/~nemirovs/FullBookDec11.pdf. 59
S. Boyd, S.-J. Kim, L. Vandenberghe, and A. Hassibi. A tutorial on geometric programming. Optim.
Eng., 8(1):67, 2007.
P. J. Dickinson and L. Gijben. On the computational complexity of membership problems for the com-
pletely positive cone and its dual. Comput. Optim. Appl., 57(2):403–415, 2014.
http://dx.doi.org/10.1007/s10589-013-9594-z. 38
C. A. Floudas and P. M. Pardalos, editors. Encyclopedia of Optimization. 2nd ed. Springer, New York,
2009. 29
M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Res. Logist. Quart., 3(1-2):
95–110, 1956. 51
M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. J. Res. Natl.
Bur. Stand., 49(6):409–436, 1952.
https://doi.org/10.6028/jres.049.044. 55
C. F. Higham and D. J. Higham. Deep learning: An introduction for applied mathematicians. SIAM Rev.,
61(4):860–891, 2019. 49
67
68 Bibliography
M. Hutchings, F. Morgan, M. Ritoré, and A. Ros. Proof of the double bubble conjecture. Ann. Math.,
155(2):459–489, 2002.
https://arxiv.org/pdf/math/0406017. 8
W. Karush. Minima of functions of several variables with inequalities as side constraints. M.Sc. disserta-
tion, Department of Mathematics, University of Chicago, Chicago, IL, USA, 1939. 43
H. W. Kuhn and A. W. Tucker. Nonlinear programming. In Proceedings of the Second Berkeley Symposium
on Mathematical Statistics and Probability, 1950, pages 481–492, Berkeley, 1951. University of California
Press. 43
K. Lange. MM Optimization Algorithms, volume 147 of Other Titles Appl. Math. SIAM, Philadelphia,
PA, 2016. 23
A. N. Langville and C. D. Meyer. Who’s #1? The science of rating and ranking. Princeton University
Press, Princeton, NJ, 2012. 29
J. Liesen and Z. Strakoš. Krylov Subspace Methods, Principles and Analysis. Oxford University Press,
Oxford, 2013. 55, 56
D. Luenberger and Y. Ye. Linear and Nonlinear Programming. Springer, New York, third edition, 2008.
3, 55
K. G. Murty and S. N. Kabadi. Some NP-complete problems in quadratic and nonlinear programming.
Math. Program., 39(2):117–129, 1987.
https://doi.org/10.1007/BF02592948. 38
P. M. Pardalos and S. A. Vavasis. Quadratic programming with one negative eigenvalue is NP-hard. J.
Glob. Optim., 1(1):15–22, 1991.
https://doi.org/10.1007/BF00120662. 29
S. A. Vavasis. Nonlinear Optimization: Complexity Issues. Oxford University Press, New York, 1991. 29
W. Zhu. Unsolvability of some optimization problems. Appl. Math. Comput., 174(2):921–926, 2006.
https://doi.org/10.1016/j.amc.2005.05.025. 7
G. Zoutendijk. Methods of feasible directions: A study in linear and nonlinear programming. PhD thesis,
University of Amsterdam, Amsterdam, Netherlands, 1960. 51