Curs Tehnici de Optimizare
Curs Tehnici de Optimizare
Ion Necoara
Automation and Systems Engineering Department
University Politehnica Bucharest
Email: i.necoara@yahoo.com
2009
Contents
1
CONTENTS 2
Foreword
This course on numerical optimization techniques is intended for students of Automation
and Systems Engineering Department in the second year of their bachelor programme, as
well as for interested master and PhD students from neighboring subjects. The courses
aim is to give an introduction into numerical methods for solution of optimization prob-
lems, in order to prepare the students for using and developing these methods for specic
applications in engineering. The courses focus is on continuous optimization (rather than
discrete optimization) with special emphasis on nonlinear programming. For this reason,
the course follows its division into two major parts:
I. unconstrained optimization
II. constrained optimization
As for bibliography, I recommend the text book Numerical Optimization by J. Nocedal
and S. Wright [?] and the excellent text books: Convex Optimization by S. Boyd and
L. Vandenberghe [?] (this book is also freely available and can be downloaded from the
home page of Stephen Boyd) and Introductory Lectures on Convex Optimization: A Basic
Course by Y. Nesterov [?].
Background: It is required from students to have solid knowledge of linear algebra (e.g.
matrix theory, concepts from vector spaces, etc...) and calculus (notions of dierentiable
functions, convergence of sequences, etc...)
Acknowledgement: I would like to thank to Prof. M. Diehl (K.U. Leuven) and my
students C. Hristescu, A. Manciu, V. Caiter, F. Panait and F. Georgescu (UPB) for their
help in writing these notes.
Part I
Introduction
3
Chapter 1
Background
1.1 Review of matrix analysis
In this course we adopt the convention of considering elements x R
n
to be column vectors,
i.e. x = [x
1
x
n
]
T
. In R
n
the inner product is dened as
x, y = x
T
y =
n
i=1
x
i
y
i
.
When is not specied, the norm on R
n
is the standard Euclidian norm (i.e. the norm
induced by this inner product):
x =
x, x.
The angle between two non-zero vectors x and y may be dened by:
cos =
x, y
xy
, 0 .
The fundamental Cauchy-Schwarz inequality states that for any inner product and the
corresponding induced norm the following inequality holds:
|x, y| xy x, y R
n
,
4
CHAPTER 1. BACKGROUND 5
with equality if and only if x and y are linearly dependent.
Any norm has a dual norm
dened by:
x
= max
y=1
x, y.
The trace of a square matrix Q = [Q
ij
]
ij
R
nn
is dened as
Trace(Q) =
n
i=1
Q
ii
.
A scalar C and a non-zero vector x that satisfy the equation Qx = x is called an
eigenvalue and eigenvector of Q, respectively. The eigenvalue-eigenvector equation may be
written equivalently as
(I
n
Q)x = 0, x = 0,
i.e. the matrix I
n
Q is singular, that is,
det(I
n
Q) = 0.
Therefore, the characteristic polynomial of Q is dened as
p
Q
() = det(I
n
Q).
Clearly the set of roots of p
Q
() = 0 coincides with the set of eigenvalues of Q. The set
of all eigenvalues of Q is called the spectrum of Q and is denoted by (Q) = {
1
, ,
n
}.
Using this notation we have
p
Q
() = (
1
) (
n
)
and thus p
Q
(0) =
i
(
i
). From the previous discussion we obtain the following lemma:
Lemma 1.1.1 The following equality holds:
det(Q) =
i
and Trace(Q) =
i
(Q
k
) =
k
i
and
i
(aI
n
+ bQ) = a + b
i
i, a, b R.
CHAPTER 1. BACKGROUND 6
We denote with S
n
the vector space of symmetric matrices:
S
n
= {Q R
nn
: Q = Q
T
}.
On this space we dene the inner product using the trace:
Q, P = Trace(QP) Q, P S
n
.
Using the well-known properties of the inner product we have:
Trace(QPR) = Trace(RQP) = Trace(PRQ),
for any matrices Q, P and R of appropriate dimensions. As a consequence, we also have
x
T
Qx = Trace(Qxx
T
) x R
n
.
For a symmetric matrix Q S
n
the corresponding eigenvalues are real, i.e. (Q) R.
A symmetric matrix Q S
n
is positive semidenite (notation Q 0)
Q 0 if x
T
Qx 0 x R
n
and positive denite (notation Q 0) if x
T
Qx > 0 for all x R
n
, x = 0. We say that
Q P if QP 0. We denote the set of positive (semi)denite matrices with (S
n
+
)S
n
++
.
We have the following characterization of a positive semidenite matrix:
Lemma 1.1.2 The following equivalences hold:
(i) The matrix Q is positive semidenite
(ii) All the eigenvalues of Q are non-negative
(iii) All the principal minors of Q are non-negative
(iv) There exists a matrix L such that Q = L
T
L.
Let us denote with
min
and
max
the smallest and the largest eigenvalue of a symmetric
matrix Q S
n
. Then,
min
= min
x=0
x
T
Qx
x
T
x
and
max
= max
x=0
x
T
Qx
x
T
x
.
We conclude that
min
I
n
Q
max
I
n
.
CHAPTER 1. BACKGROUND 7
We can derive denitions for certain matrix norms from vector norms. Given a vector norm
, we dene the corresponding matrix norm as:
Q = sup
x=0
Qx
x
.
For the Euclidean norm the corresponding matrix norm is as follows:
Q = (
max
Q
T
Q)
1/2
.
The Frobenius norm of a matrix is dened by
Q
F
= (
m
i=1
n
j=1
Q
2
ij
)
1/2
1.2 Review of mathematical analysis
It is important to extend a function f to all R
n
by dening its value to be + outside its
domain. In the following we assume that all functions are implicitly extended. A scalar
function f : R
n
R has the eective domain the set
domf = {x R
n
: f(x) < }.
The function f is said to be dierentiable at a point x R
n
if there exists a vector g R
n
such that for all y R
n
:
f(x + y) = f(x) +g, y +R(y),
where lim
y0
R(y)
y
= 0 and R(0) = 0. The vector g is called the derivative or the gradient
of f at the point x and is written as f(x). In other words a function is dierentiable at
a point x if it admits a rst-order linear approximation at x. It is clear that the gradient
is uniquely determined and we dene it as a column vector with components
f(x) =
_
_
f(x)
x
1
f(x)
x
n
_
_
.
The function f is said to be dierentiable on a set X domf if it is dierentiable at all
points of X.
CHAPTER 1. BACKGROUND 8
The quantity (whenever the limit exists)
f
(x; d) = lim
t+0
f(x + td) f(x)
t
is called the directional derivative of f at x along direction d. Note that the directional
derivative may exists for a non-dierentiable function:
Example 1.2.1 For the function f(x) = x we have that f
(x; d) = f(x), d.
A scalar function f on R
n
is said to be twice dierentiable at x if it is dierentiable at x
and we can nd a symmetric matrix H R
nn
such that for all y R
n
f(x + y) = f(x) +f(x), y +
1
2
x
T
Hx +R(y
2
),
where lim
y0
R(y
2
)
y
2
= 0. This matrix H is called the Hessian and is denoted
2
f(x).
In conclusion, a function is twice dierentiable at x if it admits a second-order quadratic
approximation in a neighborhood of x. As for the gradient, the Hessian is unique, whenever
it exists, and is a symmetric matrix with the components
2
f(x) =
_
2
f(x)
2
x
1
2
f(x)
x
1
x
n
2
f(x)
x
n
x
1
2
f(x)
2
x
n
_
_
.
The function f is said to be twice dierentiable on a set X domf if it is twice dierentiable
at all points of X. The Hessian can be seen as a derivative of the vector function f:
f(x + y) = f(x) +
2
f(x)y +R(y).
Example 1.2.2 Let f be a quadratic function
f(x) =
1
2
x
T
Qx q
T
x + r,
CHAPTER 1. BACKGROUND 9
where Q R
nn
is a symmetric matrix. Then, it is clear that the gradient of f at x is
f(x) = Qx q
and the Hessian at x is
2
f(x) = Q.
A function that is at least once dierentiable is said to be a smooth function. A function
that is k times dierentiable with the k-th derivative continuous is said to belong to the
class C
k
.
For a dierentiable function g : R R, we have the classical rst-order Taylor approxima-
tion with mean value or using the integral form:
g(b) g(a) = g
()(b a) =
b
a
g
()d,
for some in the interval (a, b).
These equalities can be extended to any dierentiable function f : R
n
R using the
previous relations for the function g(t) = f(x + t(y x)):
f(y) = f(x) +f(x + (y x)), y x for some (0, 1)
f(y) = f(x) +
1
0
f(x + (y x)), y xd
The reader should note that, using the rules for dierentiability, we used:
g
() = f(x + (y x)), y x.
Some extensions are possible:
f(y) = f(x) +
1
0
2
f(x + (y x)), y xd
f(y) = f(x) +f(x), y x +
1
2
(y x)
T
2
f(x + (y x))(y x), for some (0, 1).
A dierentiable function has a Lipschitz continuous gradient if there exists a constant L > 0
such that
f(x) f(y) Lx y, x, y.
CHAPTER 1. BACKGROUND 10
Using Taylors approximations given above we obtain the following lemma:
Lemma 1.2.3 (i) A twice dierentiable function f has a Lipschitz continuous gradient if
and only if the following inequality holds:
2
f(x) L x.
(ii) If a dierentiable function has a Lipschitz continuous gradient then
|f(y) f(x) f(x), y x|
L
2
y x
2
x, y.
Interpretation: From Lemma ?? it follows that a dierentiable function with a Lipschitz
continuous gradient is bounded from above by a special quadratic function having the
Hessian
1
L
I
n
(here I
n
is the unit matrix in R
nn
):
f(y)
L
2
y x
2
+f(x), y x + f(x) y.
A twice dierentiable function has a Lipschitz continuous Hessian if there exists a constant
M > 0 such that
2
f(x)
2
f(y) Mx y x, y.
For this class of functions we have the following characterization:
Lemma 1.2.4 For a twice dierentiable function f which has a Lipschitz continuous Hes-
sian we have:
f(y) f(x)
2
f(x)(y x)
M
2
y x
2
x, y.
Moreover, the following inequality also holds:
Mx yI
n
2
f(x)
2
f(y)Mx yI
n
x, y.
Chapter 2
Convex Theory
2.1 Convex sets
Denition 2.1.1 A set S is an ane set if for any two points x
1
, x
2
S and any R
we have x
1
+ (1 )x
2
S (i.e. the line generated by any two points from S is included
in S).
x
x
X1
X2
Figure 2.1: Ane set generated by two points x
1
and x
2
.
Example 2.1.2 The solution set of a linear system is an ane set, i.e. the set {x R
n
:
Ax = b} is ane.
The ane combination of p points x
1
, , x
p
is dene as:
p
i=1
i
x
i
, where
p
i=1
i
= 1,
i
R
11
CHAPTER 2. CONVEX THEORY 12
The ane hull of a set S R
n
, denoted A(S), is the set containing all the possible ane
combinations with points from S:
A(S) = {
iI,I nite
i
x
i
: x
i
S,
i
= 1,
i
R}.
In other words A(S) is the smallest ane set that contains S.
Denition 2.1.3 The set S is called convex if for any two points x
1
, x
2
S and [0, 1]
we have x
1
+ (1 )x
2
S (i.e. the segment generated by any two points from S is
included in S).
x
x
X1
X2
x
x
X2
X1
Figure 2.2: Convex set.
It follows immediately that any ane set is convex. Furthermore, the convex combination
of p points x
1
, , x
p
is dened as:
p
i=1
i
x
i
, where
p
i=1
i
= 1,
i
0.
The convex hull of a set S, denoted Conv(S), is the set containing all possible convex
combinations with points from S:
Conv(S) = {
iI,I nite
i
x
i
: x
i
S,
i
= 1,
i
0}.
Note that the convex hull of a set is the smallest convex set that contains the given set. It
follows that if S is convex, the convex hull of S coincides with S.
Theorem 2.1.4 (Caratheodorys Theorem) If S R
n
is a convex set then every ele-
ment of S is a convex combination of at most n + 1 points of S.
CHAPTER 2. CONVEX THEORY 13
Figure 2.3: Convex hull.
An hyperplane is the convex set dened as
{x R
n
: a
T
x = b}, a = 0, b R.
An halfspace is the convex set dened as
{x R
n
: a
T
x b} or {x R
n
: a
T
x b},
where a = 0 and b R.
a
a x>b
a x<b
T
T
Figure 2.4: Hyperplane.
A ball with center x
0
R
n
and ray r > 0 is a convex set dened as:
B(x
0
, r) = {x R
n
: x x
0
r} = {x R
n
: x = x
0
+ ru, u 1}.
An ellipsoid is the convex set dened as:
{x R
n
: (x x
0
)
T
Q
1
(x x
0
) 1} = {x
0
+ Lu : u 1},
CHAPTER 2. CONVEX THEORY 14
where Q 0 and Q = L
T
L.
A polyhedron is the convex set described by a nite set of hyperplanes and/or halfspaces:
{x R
n
: a
T
i
x b
i
i = 1, , m, c
T
j
x = d
j
j = 1, , p}
A polygon is a bounded polyhedron. Another representation of a polyhedron is given in
terms of vertices:
{
n
1
i=1
i
v
i
+
n
2
j=1
j
r
j
:
i
= 1,
i
0,
j
0 j},
where v
i
are called vertices and r
j
are called ane rays.
v
f
I
I
Figure 2.5: Unbounded polygon generated by vertices and ane rays.
Figure 2.6: Bounded polygon.
Denition 2.1.5 A set K is called cone if for any x K and 0, R we have
x K. We say that K is a convex cone if K is a convex set and a cone.
The conic combination of p points x
1
, , x
p
is dened as:
p
i=1
i
x
i
, where
i
0
CHAPTER 2. CONVEX THEORY 15
The conic hull of a set S, denoted Con(S), is the set containing all possible conic combi-
nations with elements from S:
Con(S) = {
iI,I nite
i
x
i
: x
i
S,
i
0}
Note that the conic hull of a set is the smallest cone that contains the given set.
Figure 2.7: Conic hull generated by two points x
1
and x
2
.
Figure 2.8: Conic hull generated by a set S.
For a given cone K (with an associated scalar product , ) its dual cone, denoted K
, is
dened as:
K
= {y : x, y 0, x K}.
Note that the dual cone is always a closed set. Using the fact that x, y = xy cos (x, y)
we conclude that the angle between a vector from K and a vector from K
is less that
2
.
If the cone K satises the condition K = K
= {0}.
R
n
+
= {x R
n
: x 0} is called the orthant cone and is self-dual using the usual
scalar product x, y = x
T
y, i.e. (R
n
+
)
= R
n
+
.
CHAPTER 2. CONVEX THEORY 16
L
n
= {[x
T
t]
T
R
n+1
: x t} is called the Lorenz cone or ice-cream cone and it is
also self-dual with the scalar product [x
T
t]
T
, [y
T
v]
T
= x
T
y + tv, i.e. (L
n
)
= L
n
.
S
n
+
= {X S
n
: X 0} is the semidenite cone and is also self-dual with the scalar
product X, Y = Trace(XY ), i.e. (S
n
+
)
= S
n
+
.
Figure 2.9: Ice cream cone.
2.1.1 Operations that preserves convexity of sets
intersection of convex sets is convex, i.e. if the family of sets {S
i
}
iI
is convex, then
iI
S
i
is also convex.
sum of two convex sets S
1
and S
2
is also convex: S
1
+ S
2
= {x + y : x S
1
, y S
2
}.
Moreover, S = {x :, x S} is convex if the set S is convex and R.
translation of a convex set S is also convex, i.e. given an ane function f(x) = Qx+b,
the image of S through f, f(S) = {f(x) : x S}, is also convex. Similarly, the pre-
image: f
1
(S) = {x : f(x) S} is also convex.
Linear Matrix Inequalities (LMI): It can be easily proved that the set of positive
semidenite matrices S
n
+
is convex. Let us now regard an ane map G : R
m
S
n
+
, G(x) =
A
0
+
m
i=1
x
i
A
i
, with symmetric matrices A
0
, , A
m
S
n
. The expression
G(x)0
is called a linear matrix inequality (LMI). It denes a convex set {x R
m
: G(x)0}, as
the pre-image of S
n
+
under the ane map G(x).
CHAPTER 2. CONVEX THEORY 17
Theorem 2.1.7 (Hyperplane separation theorem) Let S
1
and S
2
be two convex sets
such that S
1
S
2
= . Then, there exists an hyperplane that separates these two sets, i.e.
there exists a = 0 and b R such that a
T
x b for all x S
1
and a
T
x b for all x S
2
.
Figure 2.10: Separation theorem.
Theorem 2.1.8 (Hyperplane support theorem) Let S be a convex set and x
0
bd(S) =
cl(S)int(S). Then there exists a supporting hyperplane for S at x
0
, i.e. there exists a = 0
such that a
T
x a
T
x
0
for all x S.
2.2 Convex functions
Denition 2.2.1 The function f : R
n
R is called convex if its eective domain domf
is a convex set and
f(x + (1 )y) f(x) + (1 )f(y),
for all x, y domf and [0, 1].
If
f(x + (1 )y) < f(x) + (1 )f(y),
for all x = y domf and (0, 1), then f is called a strictly convex function.
If there is a constant > 0 such that
f(x + (1 )y) f(x) + (1 )f(y)
2
(1 )x y
2
,
for all x, y domf and [0, 1], then f is called a strongly convex function.
CHAPTER 2. CONVEX THEORY 18
Figure 2.11: Convex function.
The Jensen inequality tells us that f is a convex function if and only if
f(
p
i=1
i
x
i
)
p
i=1
i
f(x
i
)
for all x
i
domf and
i
[0, 1],
i
= 1.
The geometrical interpretation of convexity is very simple. For a convex function the
function values are below the corresponding chord, that is, the values of convex function at
points on the line segment x + (1 )y are less than or equal to the height of the chord
joining the points (x, f(x)) and (y, f(y)).
Remark 2.2.2 A function is convex if and only if it is convex when restricted to any line
that intersects its domain. Rephrased, f is convex if and only if for all x domf and for
all d, the function g() = f(x +d) is convex on { R : x +d domf}. This property
is very useful in testing whether a function is convex by restricting it to a line.
A function f : R
n
R is called concave if f is convex.
2.2.1 First-order conditions for convex functions
Theorem 2.2.3 (Convexity for C
1
functions) Assume that f : R
n
R is continuously
dierentiable and domf is a convex set. Then f is convex if and only if
f(y) f(x) +f(x)
T
(y x) x, y domf. (2.1)
CHAPTER 2. CONVEX THEORY 19
Proof: From the convexity of f we have that for any x, y domf and for any
[0, 1]:
f(x + (y x)) f(x) (f(y) f(x))
and therefore
f(x)
T
(y x) = lim
t0
f(x + (y x)) f(x)
f(y) f(x).
To prove that for z = x+(yx) = (1)x+y holds that f(z) (1)f(x)+f(y)
let us use (??) twice at z, in order to obtain f(x) f(z) + f(z)
T
(x z) and f(y)
f(z) + f(z)
T
(y z) which yield, when weighted with (1 ) and and added to each
other
(1 )f(x) + f(y) f(z) +f(z)
T
[(1 )(x z) + (y z)]
. .
=(1)x+yz=0
.
The interpretation is simple: the tangents are below the graph for a convex function.
A straightforward consequence of this theorem is the following statement: assume that
f : R
n
R is continuously dierentiable and convex, then
(f(x) f(y))
T
(x y) 0 x, y domf.
2.2.2 Second-order conditions for convex functions
Theorem 2.2.4 (Convexity for C
2
Functions) Assume that f : R
n
R is twice continu-
ously dierentiable and domf is convex. Then f is convex if and only if for all x domf
the Hessian is positive semidenite, i.e.
2
f(x)0 x domf. (2.2)
Proof: To prove (??) (??) we use a second order Taylor expansion of f at x in an
arbitrary direction d:
f(x + td) = f(x) +f(x)
T
dt +
1
2
t
2
d
T
2
f(x)d + o(t
2
d
2
).
From this we obtain
d
T
2
f(x)d = lim
t0
2
t
2
_
f(x + td) f(x) tf(x)
T
d
_
. .
0, because of (??).
0.
CHAPTER 2. CONVEX THEORY 20
Conversely, to prove (??) (??) we use the Taylor rest term formula with some [0, 1]:
f(y) = f(x) +f(x)
T
(y x)t +
1
2
t
2
(y x)
T
2
f(x + (y x))(y x)
. .
0, due to (??).
.
Example 2.2.5
1. The function f(x) = log(x) is convex on R
+
because f
(x) =
1
x
2
> 0 for all x > 0.
2. The quadratic function f(x) = r +q
T
x +
1
2
x
T
Qx is convex on R
n
if and only if Q0,
because x R
n
:
2
f(x) = Q. Note that any ane function is convex and concave
in the same time.
3. The function f(x, t) =
x
T
x
t
is convex on R
n
(0, ) because its Hessian
2
f(x, t) =
_
2
t
I
n
2
t
2
x
2
t
2
x
T 2
t
3
x
T
x
_
is positive denite. To see this, multiply it from left and right with v = [z
T
s]
T
R
n+1
which yields v
T
f(x, t)v =
2
t
3
tz sx
2
0 if t > 0.
Theorem 2.2.6 (Convexity of sublevel sets) The sublevel set {x domf : f(x) c} of a
convex function f : R
n
R with respect to any constant c R is convex.
Proof: If f(x) c and f(y) c then for any [0, 1] holds also
f((1 )x + y) (1 )f(x) + f(y) (1 )c + c = c.
Epigraph of a function : Let f : R
n
R be a function. We dene its epigraph as the
following set:
epif = {[x
T
t]
T
R
n+1
: x domf, f(x) t}.
Theorem 2.2.7 (Convexity of epigraph) A function f : R
n
R is convex if and only if
its epigraph is a convex set.
CHAPTER 2. CONVEX THEORY 21
2.2.3 Operations that preserves convexity of functions
1. If f
1
and f
2
are convex functions and
1
,
2
0 then
1
f
1
+
2
f
2
is also convex
2. If f is convex then g(x) = f(Ax +b) (i.e. the composition of a convex function with
an ane function) is also convex
3. Let f : R
n
R
m
R be such that f(, y) convex for any y S R
m
. Then the new
function
g(x) = sup
yS
f(x, y)
is also convex.
4. The composition with a monotone convex function: if f : R
n
R is convex and
g : R R is convex and monotonically increasing, then the function g f : R
n
(f(x))
. .
0
f(x)f(x)
T
. .
0
+g
(f(x))
. .
0
2
f(x)
. .
0
0.
Conjugate functions : Let f : R
n
R be a function. We dene its conjugate, denoted
f
, as the function
f
(y) = sup
xR
n
y
T
x f(x)
. .
F(x,y)
From previous discussion it follows that f
= {y : f
(y) y
T
x x, y.
Example 2.2.8 For the convex quadratic function f(x) =
1
2
x
T
Qx, where Q 0, we have
f
(y) =
1
2
y
T
Q
1
y.
Chapter 3
Fundamental Concepts of
Optimization
3.1 Introduction
Why we are interested in optimization problems? Optimization is used in many applications
from diverse areas:
Business: allocation of resources in logistics, investment...
Science: estimation and tting of models to measurement data, design of experi-
ments...
Engineering: design and operation of technological systems such as bridges, cars,
aircraft, digital devices...
3.1.1 The evolution of optimization
The birth of the theory of extremal problems (minimum/maximum) starts centuries before
the time of Christ. Ancient mathematicians were interested in a number of questions of
22
CHAPTER 3. FUNDAMENTAL CONCEPTS OF OPTIMIZATION 23
isoperimetric type: e.g. what closed curve of a given length encloses the maximum area?
This was the age of geometric approaches for solving such optimization problems and the
ancients used these approaches to derive the solutions. However, a rigorous answer to this
type of problems was not given until the nineteenth century!
The isoperimetric problem can be traced back to the legendary story of the queen Dido,
told by Virgil in his Aeneid. Virgil told about the escape of Dido from her treacherous
brother in the rst chapter of Aeneid. Dido had to decide about the choice of a tract of land
near the future city of Carthage, while satisfying the famous constraint of selecting a space
of ground, which (from the bulls hide) they rst inclosed. By the legend, Phoenicians
cut the oxhide into thin strips and enclosed a large expanse. Now it is customary to think
that the decision by Dido was reduced to the isoperimetric problem of nding a gure of
greatest area among those surrounded by a curve whose length is given. It is not excluded
that Dido and her subjects solved the practical versions of the problem when the tower
was to be located at the sea coast and part of the boundary coastline of the tract was
somehow prescribed in advance. The foundation of Carthage is usually dated to the ninth
century before Christ when there was no hint of the Euclidean geometry. Ropestretching
around stakes leads to convex gures. The Dido problem has a unique solution in the
class of convex gures provided that the xed nonempty part of the boundary is a convex
polygonal line.
Figure 3.1: Didos problem
There are yet other methods the mathematicians in the days before calculus could have
used to solve optimization problems, namely algebraic approaches. One of the most elegant
is the arithmetic-geometric mean inequality:
x
1
+ x
2
+ + x
n
n
(x
1
x
2
x
n
)
1/n
, x
i
0, n 1
with equality if and only if x
1
= x
2
= = x
n
.
For example, to show that of all rectangles with a given area it is the square that has the
CHAPTER 3. FUNDAMENTAL CONCEPTS OF OPTIMIZATION 24
smallest perimeter, we can use this simple algebraic inequality: if we call the sides of the
rectangle x and y, then the problem is to determine them so that we minimize
2(x + y) subject to xy = A,
where A is given. From the arithmetic-geometric mean inequality we get
x + y
2
xy =
A
with equality if x = y =
A.
Decision making has become a science in the twentieth century once calculus was developed.
A simple civil engineering problem that can be solved using calculus is as follows: given two
cities on opposite sides of a river with constant width w located at distance a and b from
the river and lateral separation d, we need to nd the optimal location where we should
build a bridge so as to make the journey between the two cities as short as possible
min
x
f(x),
where f(x) =
x
2
+ a
2
+ w +
b
2
+ (d x)
2
. Imposing f
=
ad
a+b
.
Figure 3.2: Optimal location application.
3.2 What Characterizes an Optimization Problem?
An optimization problem consists of the following three ingredients:
CHAPTER 3. FUNDAMENTAL CONCEPTS OF OPTIMIZATION 25
An objective function, f(x), that shall be minimized or maximized,
decision variables, x, that can be chosen, and
constraint that shall be respected, e.g. of the form g(x) 0 (inequality constraints)
or h(x) = 0 (equality constraints).
3.2.1 Mathematical Formulation in Standard Form
min
xR
n
f(x)
s.t. g(x) 0, (3.1)
h(x) = 0.
Here, f : R
n
R, g : R
n
R
m
and h : R
n
R
p
, are usually assumed to be dierentiable.
Note that the inequalities hold for all components, i.e.
g(x) 0 g
i
(x) 0, i = 1, . . . , m
h(x) = 0 h
j
(x) = 0, j = 1, . . . , p.
Example 3.2.1
min
xR
2
x
2
1
+ x
2
2
s.t. x
2
1 + x
2
1
0
x
1
x
2
1 = 0.
Denition 3.2.2
1. The set {x R
n
: f(x) = c} is the level set of f for the value c.
2. The feasible set is X = {x R
n
: g(x) 0, h(x) = 0}.
3. The point x
R
n
is a global minimizer (often also called a global minimum) if and
only if (i) x
X and f(x
R
n
is a strict global minimizer i x
X and f(x
}
5. The point x
R
n
is a local minimizer i x
) so that f(x
R
n
is a strict local minimizer i x
so that f(x
}.
Figure 3.3: Local and global minima.
Example 3.2.3 For the following one dimensional problem
min
xR
sin xexp x
s.t. x 0, x 4.
X = {x R : x 0, x 4} = [0, 4]
Three local minimizers (which?)
One global minimizer (which?)
An important issue in the optimization theory is when minimizers exist.
Theorem 3.2.4 (Weierstrass) If the feasible set X R
n
is compact (i.e. bounded and
closed) and f : X R is continuous then there exists a global minimizer of the optimization
problem min
xX
f(x).
CHAPTER 3. FUNDAMENTAL CONCEPTS OF OPTIMIZATION 27
Proof: Regard the graph of f, G = {(x, t) R
n
R : x X, f(x) = t}. Note
that G is a compact set, and so is the projection of G onto its last coordinate, the set
Proj
R
G = {t R : x such that (x, t) G}, which is a compact interval [f
min
, f
max
] R.
By construction, there must be at least one x
so that (x
, f
min
) G.
Thus, minimizers exist under fairly mild circumstances. Though the proof was constructive,
it does not lend itself to an ecient algorithm. The topic of this lecture is how to practically
nd minimizers with help of computer algorithms.
3.3 Types of Optimization Problems
In order to choose the right algorithm for a practical problem, we should know how to
classify it and which mathematical structures can be exploited. Replacing an inadequate
algorithm by a suitable one can make solution times many orders of magnitude shorter.
3.3.1 Nonlinear Programming (NLP)
In this lecture we mainly treat algorithms for general Nonlinear Programming (NLP) prob-
lems that are given in the form
min
xR
n
f(x)
s.t. g(x) 0 (3.2)
h(x) = 0,
where f : R
n
R, g : R
n
R
m
, h : R
n
R
p
, are assumed to be continuously dieren-
tiable at least once, often twice and sometimes more.
Many problems have more structure, which we should exploit in order to solve problems
faster.
CHAPTER 3. FUNDAMENTAL CONCEPTS OF OPTIMIZATION 28
3.3.2 Linear Programming (LP)
When the functions f, g and h are ane in the general formulation (??), the general NLP
gets something easier to solve, namely a Linear Program (LP). Explicitly, an LP can be
written as follows.
min
xR
n
c
T
x
s.t. Ax b = 0, (3.3)
Cx d 0.
Here, the problem data are given by c R
n
, A R
pn
, b R
p
, C R
mn
, and d R
m
.
Note that we could also have a constant contribution to the objective, i.e. have f(x) =
c
T
x + c
0
, but that this would not change the minimizers x
.
LPs can be solved very eciently since the 1940s, when G. Dantzig invented the famous
simplex method, an active set method, which is still widely used, but got an equally ecient
competitor in the so called interior-point methods. LPs can nowadays be solved even if
they have millions of variables and constraints, every business student knows how to use
them, and LPs arise in myriads of applications. LP algorithms are not treated in detail in
this lecture, but please recognize them if you encounter them in practice and use the right
software.
Software: CPLEX, SOPLEX, lp solve, lingo, MATLAB (linprog), SeDuMi, YALMIP.
3.3.3 Quadratic Programming (QP)
If in the general NLP formulation (??) the constraints g, h are ane (as for an LP), but
the objective is a linear-quadratic function, we call the resulting problem a Quadratic
Programming Problem or Quadratic Program (QP). A general QP can be formulated as
follows.
min
xR
n
1
2
x
T
Qx + q
T
x + r
s.t. Ax b = 0 (3.4)
Cx d 0.
Here, in addition to the same problem data as in the LP, we also have the textitHessian
matrix Q R
nn
. Its name stems from the fact that
2
x
f(x) = Q, where f(x) =
1
2
x
T
Qx +
CHAPTER 3. FUNDAMENTAL CONCEPTS OF OPTIMIZATION 29
q
T
x + r.
Convex QP. If the Hessian matrix Q is positive semi-denite (i.e. Q 0) we call the
QP (??) a convex QP. Convex QPs are tremendously easier to solve globally than non-
convex QPs (i.e., where the Hessian Q is not positive semi-denite), which might have
dierent local minima.
Strictly convex QP. If the Hessian matrix Q is positive denite (i.e. Q 0) we call the
QP (??) a strictly convex QP. Strictly convex QPs are a subclass of convex QPs, but often
still a bit easier to solve than not-strictly convex QPs.
Example 3.3.1 (non-convex QP)
min
xR
2
1
2
x
T
_
5 0
0 1
_
x +
_
0
2
_
T
x
s.t. 1 x
1
1
1 x
2
10.
This problem has local minimizers at x
1
= [0 1]
T
and x
2
= [0 10]
T
, but only x
2
is a global
minimizer.
Example 3.3.2 (strictly convex QP)
min
xR
2
1
2
x
T
_
5 0
0 1
_
x +
_
0
2
_
T
x
s.t. 1 x
1
1
1 x
2
10.
This problem has only one (strict) local minimizer at x
= [0 1]
T
that is also global
minimizer.
Software: MOSEC, MATLAB (quadprog), SeDuMi, YALMIP.
3.3.4 Convex Optimization (CP)
Both LPs and convex QPs are part of an important class of optimization problems, namely
the convex optimization problems. An optimization problem with convex feasible set X
CHAPTER 3. FUNDAMENTAL CONCEPTS OF OPTIMIZATION 30
and convex objective function f : R
n
R is called a convex optimization problem (CP),
i.e.
min
xR
n
f(x)
s.t. g(x) 0 (3.5)
Ax b = 0,
where f : R
n
R and g : R
n
R
m
are convex functions and the equality constraints are
described by an ane function h(x) = Ax b.
Example 3.3.3 (Quadratically Constrained Quadratic Program (QCQP)) A convex opti-
mization problem of the form (??) with the functions f and g
i
being convex quadratic, is
called a Quadratically Constrained Quadratic Program (QCQP):
min
xR
n
1
2
x
T
Q
0
x + q
T
0
x + r
0
s.t.
1
2
x
T
Q
i
x + q
T
i
x + r
i
0 i = 1, , m (3.6)
Ax b = 0.
By choosing Q
1
= = Q
m
= 0 we obtain a usual QP, and by also setting Q
0
= 0 we
obtain an LP. Therefore, the class of QCQPs contains both LPs and QPs as subclasses.
Example 3.3.4 (Semidenite Programming (SDP)) An interesting class of convex opti-
mization problems makes use of linear matrix inequalities (LMI) in order to describe the
feasible set. As it involves the constraint that some matrices should remain positive semidef-
inite, this problem class is called Semidenite Programming (SDP). A general SDP can be
formulated as:
min
xR
n
c
T
x
s.t. A
0
+
n
i=1
A
i
x
i
0 (3.7)
Ax b = 0,
where A
i
S
k
for all i = 0, , n. It turns out that all LPs, QPs, and QCQPs can also
be formulated as SDPs, besides several other convex problems. Semidenite Programming
is a very powerful tool in convex optimization.
CHAPTER 3. FUNDAMENTAL CONCEPTS OF OPTIMIZATION 31
Minimizing the largest eigenvalue can be formulated as an SDP. We regard a sym-
metric matrix G(x) that anely depends on some design variables x R
n
, i.e. G(x) =
A
0
+
n
i=1
A
i
x
i
with A
i
S
n
for all i = 0, , n. If we want to minimize the largest
eigenvalue of G(x), i.e. to solve
min
x
max
(G(x))
we can formulate this problem as an SDP by adding a slack variable t R, as follows:
min
tR,xR
n
t s.t. tI
n
i=1
A
i
x
i
A
0
0.
Software: An excellent tool to formulate and solve convex optimization problems in a
MATLAB environment is YALMIP and CVX, which are available as open-source codes
and easy to install. These are interfaces that use e.g. SeDuMi as a solver.
3.3.5 Unconstrained Optimization Problems
Any NLP without constraints is called an unconstrained optimization problem. It has the
general form
min
xR
n
f(x). (3.8)
Unconstrained nonlinear optimization will be the focus of Part II of this lecture notes, while
general constrained optimization problems are the focus of Part III.
3.3.6 Non-Dierentiable Optimization Problems
If one or more of the problem functions f, g and h are not dierentiable in an optimiza-
tion problem (??), we speak of a non-dierentiable or non-smooth optimization problem.
Non-dierentiable optimization problems are much harder to solve than general NLPs. A
few solvers exist (Microsoft Excel solver, Nelder-Mead method, random search, genetic al-
gorithms...), but are typically much slower than derivative based methods (which are the
topic of this course).
CHAPTER 3. FUNDAMENTAL CONCEPTS OF OPTIMIZATION 32
3.3.7 Mixed-Integer Programming (MIP)
A mixed-integer problem is a problem with integer decisions. An MIP can be formulated
as follows:
min
xR
n
,zZ
m
f(x, z)
s.t. g(x, z) 0 (3.9)
h(x, z) = 0.
Mixed integer non-linear program (MINLP): if the functions f, g and h are twice
dierentiable in x and z we speak of a mixed integer non-linear program. Generally speak-
ing, these problems are very hard to solve, due to the combinatorial nature of the variable
z. However, if a relaxed problem, where the variables z are no longer restricted to the
integers, but to the real numbers, is convex, often very ecient solution algorithms exist.
More specically, we would require that the following problem is convex:
min
xR
n
,zR
m
f(x, z)
s.t. g(x, z) 0
h(x, z) = 0.
The ecient solution algorithms are often based on the technique of branch-and-bound,
which uses partially relaxed problems where some of the z are xed to specic integer
values and some of them are relaxed and exploits the fact that the solution of the relaxed
solutions can only be better than the best integer solution. This way, the search through
the combinatorial tree can be made more ecient than pure enumeration. Two important
examples of such problems are given in the following.
Mixed integer linear program (MILP): if the functions f, g and h are ane in both
x and z we speak of a mixed linear integer program. A famous problem in this class is the
traveling salesman problem.
Software: small to medium size (i.e. n < 100) problems of this form can eciently be
solved with codes such as the commercial code CPLEX or the free code lp_solve with a
nice manual http://lpsolve.sourceforge.net/5.5/.
CHAPTER 3. FUNDAMENTAL CONCEPTS OF OPTIMIZATION 33
Mixed integer quadratic programs (MIQP): if g and h are ane functions and f
convex quadratic in both x and z we speak of a mixed integer QP (MIQP).
Software: small to medium size (i.e. n < 100) problems of this form are also eciently
solvable, mostly by commercial solvers (e.g. those in TOMLAB).
Part II
Unconstrained Optimization
34
Chapter 4
Optimality Conditions
In this part of the course we regard unconstrained optimization problems of the form
min
xR
n
f(x), (4.1)
where we assume that the objective function f : R
n
R has an open eective domain
domf R
n
. We recall that in this course we assume the function f to be extended to all
R
n
by dening its value to be + outside its domain. Therefore, we are only interested in
minimizers that lie inside of domf. We might have domf = R
n
, but often this is not the
case, as in the following example where we consider domf = (0, ) and we assume that
outside domf the fuction takes the value +:
min
xR
1
x
+ x, (4.2)
4.1 Necessary Optimality Conditions
A direction d R
n
is called a descent direction for the function f C
1
at point x domf
if
f(x)
T
d < 0.
35
CHAPTER 4. OPTIMALITY CONDITIONS 36
Interpretation: if d is a descent direction at x domf then the objective function can
be improved around x. Indeed, as domf is open, we could nd a t > 0 that is small enough
so that for all [0, t] we have x +d domf and f(x +d)
T
d < 0 (due to continuity
of f() around x). By Taylors Theorem, there exists some (0, t) such that
f(x + td) = f(x) + t f(x
+ d)
T
d
. .
<0
< f(x).
Theorem 4.1.1 (First Order Necessary Conditions (FONC)) Let f be a dieren-
tiable function (i.e. f C
1
) and x
) = 0. (4.3)
Proof: Let us assume by contradiction that f(x
.
Indeed, we can nd a t > 0 that is small enough such that for all [0, t] we have
f(x
+d)
T
d = f(x
f(x
))
T
f(x
tf(x
)) = f(x
) t f(x
+ f(x
))
T
f(x
)
. .
>0
< f(x
).
This is a contradiction with our hypothesis that x
is a local minimizer.
Any point x
satisfying f(x
2
f(x
)0. (4.4)
Proof: If (??) does not hold there exists some d R
n
so that d
T
2
f(x
2
f(x
+ td) = f(x
) + tf(x
)
T
d
. .
=0
+
1
2
t
2
d
T
2
f(x
+ d)d
. .
<0
< f(x
),
which is a contradiction with x
) = 0) and
2
f(x
)0. Then x
). Clearly,
min
> 0 since
2
f(x
)0
and moreover
d
T
2
f(x
)d
min
d
2
d R
n
.
Using Taylor expansion we have:
f(x
+ d) f(x
) = f(x
)
T
d +
1
2
d
T
2
f(x
)d +R(d
2
)
min
2
d
2
+R(d
2
) = (
min
2
+
R(d
2
)
d
2
)d
2
.
Since
min
> 0 there exists > 0 and > 0 such that
min
2
+
R(d
2
)
d
2
2
for all d .
Note that the second order sucient condition (SOSC) is not necessary for a stationary
point x
) = 0.
CHAPTER 4. OPTIMALITY CONDITIONS 38
4.3 Optimality Condition for Convex Problems
In this section we discuss sucient conditions for optimality for the convex case. The rst
result refers to the following constrained optimization problem:
Theorem 4.3.1 Let X be a convex set and f C
1
(not necessarily convex). For the
constrained optimization problem
min
xX
f(x)
the following conditions hold:
(i) If x
)
T
(x x
) 0 x X.
(ii) If f is convex function then x
)
T
(x x
)
0 x X.
Proof: (i) Suppose that there exists an y X such that
f(x
)
T
(y x
) < 0.
Using one version of Taylors theorem we have that for t > 0 there exists some [0, 1]
such that
f(x
+ t(y x
)) = f(x
) + tf(x
+ t(y x
))
T
(y x
).
Since f is continuous taking t small enough we have f(x
+ t(y x
))
T
(y x
) < 0
and thus f(x
+t(y x
)) < f(x
is a local minimizer.
(ii) If f is convex f(x) f(x
)+f(x
)(xx
)(xx
) 0
it follows that f(x) f(x
). Indeed, since x
is a
local minimizer, there exists a neighborhood N of x
, i.e. we have x = x
+ t(y x
) with
t 1 but t > 0, and x X N. Due to local optimality, we have f(x
+ t(y x
)) f(x
) + t(f(y) f(x
)).
It follows that t(f(y) f(x
) 0, as desired.
Theorem 4.3.3 (Convex First Order Sucient Conditions (cFOSC)) Let f C
1
be convex. If x
) = 0), then x
is a global minimizer
of unconstrained convex optimization problem min
xR
n f(x).
Proof: Since f is convex we have
f(x) f(x
) +f(x
)
. .
=0
(x x
) = f(x
) x R
n
.
to be a local optimizer is
f(x
) = 0. (4.5)
In general we will solve the non-linear system of equations f(x
to be a global optimizer is
f(x
) = 0. (4.6)
Citation [Rockafellar]: The true watershed in optimization is not between linear and
non-linear, but between convex and non-convex..
CHAPTER 4. OPTIMALITY CONDITIONS 40
4.4 Perturbation Analysis
In numerical mathematics, we can never evaluate functions at precisions higher than ma-
chine precision. Thus, we usually compute only solutions to slightly perturbed problems,
and are most interested in minimizers that are stable against small perturbations. This is
the case for strict local minimizers that satisfy the second order sucient condition.
For this aim we regard functions f(x, a) that depend not only on x R
n
but also on some
disturbance parameter a R
m
. We are interested in the parametric family of problems
min
x
f(x, a) yielding minimizers x
(a) depending on a.
Theorem 4.4.1 (Stability of Parametric Solutions)
Assume that f : R
n
R
m
R is C
2
, and regard the minimization of f(, a) for a given xed
value of a R
m
. If x satises the SOSC condition, i.e.
x
f( x, a) = 0 and
2
x
f( x, a)0,
then there is a neighborhood N R
m
around a so that the parametric minimizer function
x
( a) = x. Its derivative at
a is given by
(x
( a))
a
=
_
2
x
f( x, a)
_
1
(
x
f( x, a))
a
. (4.7)
Moreover, each such x
: N R
n
follows from the implicit
function theorem applied to the stationarity condition
x
f(x
(a), a))
da
=
(
x
f(x
(a), a))
x
. .
=
2
x
f
(a)
a
+
(
x
f(x
(a), a))
a
The fact that all points x
) = 0 with n unknowns x
. In some cases
this system can be solve analytically:
Example 5.0.2 (Unconstrained QP) Let us consider the unconstrained convex QP
min
xR
n
1
2
x
T
Qx + q
T
x + r, (5.2)
where Q0. Due to the condition 0 = f(x
) = Qx
=
Q
1
q. The optimal value of (??) is given by the following basic relation:
min
xR
n
1
2
x
T
Qx + q
T
x + r =
1
2
q
T
Q
1
q + r. (5.3)
However, in most of the cases f(x
. The order of convergence is the largest positive numbers p which satises the
following relation:
0 lim
k
x
k+1
x
x
k
x
p
< ,
where we recall that the sup lim of a sequence {z
k
}
k
is dened as:
lim
k
z
k
= lim
n
y
n
, where y
n
= sup
kn
z
k
.
Assuming that the limit exists, then p indicates the behavior of the tail of the sequence.
When p is big, then the convergence rate is high because the distance to x
is reduced with
p decimals in one step:
x
k+1
x
x
k
x
p
.
Linear convergence: if (0, 1) and p = 1 then
x
k+1
x
x
k
x
and thus x
k
x
c
k
. For example, x
k
=
k
, where (0, 1).
CHAPTER 5. CONVERGENCE OF DESCENT DIRECTION METHODS 44
Superlinear convergence: if lim
k
x
k+1
x
x
k
x
k
x
k
x
with
k
0.
For example x
k
=
1
k!
converges superlinearly, as
x
k+1
x
k
=
1
k+1
.
Quadratic convergence: if lim
k
x
k+1
x
x
k
x
2
= , where (0, ) and p = 2, which is
equivalent to
x
k+1
x
x
k
x
.
For example x
k
=
1
2
2
k
converges quadratically, because
x
k+1
(x
k
)
2
=
2
2
k+1
(2
2
k
)
2
= 1 < . For k = 6,
x
k
= 2
1
64
0, so in practice convergence up to machine precision is reached after roughly
6 iterations.
R-convergence: If the norm sequence x
k
x
y
k
and if y
k
is converging with a given rate, i.e. linearly, superlinearly or
Q-quadratically, then x
k
is said to converge R-linearly, R-superlinearly, or R-quadratically
to x
. Here, R indicates root, because, e.g., R-linear convergence can also be dened
via the root criterion lim
k
k
x
k
x
< 1.
Example 5.1.1
x
k
=
_
1
2
k
if k even
0 else
(5.4)
This is a fast R-linear convergence, but not as regular as Q-linear.
Remark 5.1.2 The three dierent convergence and three dierent R-convergence rates
have the following relations with each other. Here, X Y should be read as If a sequence
converges with rate X this implies that the sequence also converges with rate Y .
quadratically superlinearly linearly
R quadratically R superlinearly R linearly
Note that quadratic rate reaches convergence the fastest.
CHAPTER 5. CONVERGENCE OF DESCENT DIRECTION METHODS 45
5.2 Convergence theorems
Considering a metric space (X, ), a numerical method can be viewed as a point to set
map M : X 2
X
, dened as x
k+1
M(x
k
). The degree of freedom given by this choice
x
k+1
M(x
k
) is used for taking into computations specic details of various methods.
However, a numerical method is not a random process since such a method generates the
same sequence {x
k
}
k
when we start from the same initial point x
0
. Dening the method
this way oers a greater degree of freedom
Example 5.2.1 We consider the following point to set map
x
k+1
[
| x
k
|
n
,
| x
k
|
n
]
and a particular realization is for given x
0
x
k+1
=
| x
k
|
n
.
Denition 5.2.2 Given a metric space (X, ), a subset S X and a numerical method
as a point to set map M : X 2
X
, we dene the decreasing function : X R for the
pair (S, M) if it satises the following two conditions:
(i) for all x S and y M(x) we have (y) (x)
(ii) for all x S and y M(x) we have (y) < (x)
Example 5.2.3 For the optimization problem min
xX
f(x), where X is a convex set and f is
dierentiable. We dene S = {x
R
n
:< f(x
), x x
k
= arg min
0
f(x
k
+ d
k
).
However, in many situations this univariate optimization problem is very dicult to solve.
Therefore, other choices for choosing
k
were developed, among them the most practical is
the Wolfe conditions: choose
k
such that the following two conditions are satised
(W1) f(x
k
+
k
d
k
) f(x
k
) + c
1
k
f(x
k
)
T
d
k
, where c
1
(0, 1)
(W2) f(x
k
+
k
d
k
)
T
d
k
c
2
f(x
k
)
T
d
k
, where 0 < c
1
< c
2
< 1.
Add gure!!!!!!!
CHAPTER 5. CONVERGENCE OF DESCENT DIRECTION METHODS 47
We can also choose the step size
k
in another way, such that we can avoid to search for
k
satisfying the second Wolfe condition, by backtracking:
0. choose > 0, , c
1
(0, 1)
1. while
f(x
k
+ d
k
) > f(x
k
) + c
1
k
f(x
k
)
T
d
k
update =
2.
k
= .
In general the initial = 1. Note that with the backtracking we can nd
k
in a nite
number of steps. Moreover,
k
found with this method is not too small since
k
is close to
which was rejected at the previous iteration because it was a too long step size.
Theorem 5.2.6 (Convergence theorem for descent direction methods) Let min
xR
n
f(x)
be an optimization problem where f C
1
and the iterative method x
k+1
= x
k
+
k
d
k
where
d
k
is a descent direction. Furthermore, we assume that f is bounded from below, the step
size
k
is chosen to satisfy the two Wolfe conditions (W1)(W2) and that f is Lipschitz.
Then
k=0
cos
2
k
f(x
k
)
2
< ,
where
k
is the angle made by d
k
with f(x
k
).
Proof : Since f is Lipschitz, then there exists L > 0 so that
f(x) f(y) Lx y x, y.
From (W2) we have:
(f(x
k+1
) f(x
k
))
T
d
k
(c
2
1)f(x
k
)
T
d
k
.
Using Cauchy-Schwartz inequality we obtain
f(x
k+1
) f(x
k
)d
k
(c
2
1)f(x
k
)
T
d
k
.
CHAPTER 5. CONVERGENCE OF DESCENT DIRECTION METHODS 48
Using now the Lipschitz property of the gradient we obtain
L
k
d
k
2
(c
2
1)f(x
k
)
T
d
k
i.e.
k
c
2
1
L
f(x
k
)
T
d
k
d
k
2
.
From (W1) we have:
f(x
k+1
) f(x
k
) + c
1
(f(x
k
)
T
d
k
)
2
d
k
2
c
2
1
L
which leads to
f(x
k+1
) f(x
k
) c
1
1 c
2
L
(f(x
k
)
T
d
k
)
2
f(x
k
)
2
d
k
2
f(x
k
)
2
.
In conclusion, using the notation c = c
1
1c
2
L
we get:
f(x
k+1
) f(x
k
) c cos
2
k
f(x
k
)
2
and thus by summing these inequalities we get:
f(x
N
) f(x
0
) c
N1
j=0
cos
2
j
f(x
j
)
2
.
Since f is bounded from below we have that when N
k=0
cos
2
k
f(x
k
)
2
<
i.e.
cos
2
k
f(x
k
)
2
0.
k
= 0 and thus f(x
k
) 0, i.e. x
k
converges to a stationary point.
Chapter 6
First order methods
In this chapter we present rst order methods (i.e. methods based on evaluation of the
function and its gradient) for solving the following unconstrained optimization problem:
f
= min
xR
n
f(x),
with f C
2
.
6.1 The gradient method (steepest descent method)
The gradient method is based on the following iteration:
x
k+1
= x
k
k
f(x
k
),
where the step size
k
is chosen using one of the three methods presented in the previous
chapter: either ideal or satisfying Wolfe conditions or backtracking.
Interpretation:
1. The direction in the gradient method d = f(x) is a descent direction since
f(x)
T
d = f(x)
2
= 0 for all x satisfying f(x) = 0.
49
CHAPTER 6. FIRST ORDER METHODS 50
2. The iteration x
k+1
is obtained by solving the following convex quadratic problem:
x
k+1
= arg min
yR
n
f(x
k
) +fx
k
T
(y x
k
) +
1
2
k
y x
k
2
,
i.e. we approximate locally the objective function f around x
k
using a quadratic
model with the Hessian given by Q =
1
k
I
n
.
3. The gradient method has the fastest local decrease, this is why this method is also
called the steepest descent: indeed for all directions d with d = 1 we have
f(x + d) = f(x) + f(x)
T
d +R().
From Cauchy-Schwartz inequality we also get
f(x)
T
d f(x)d = f(x).
Using this inequality and also taking the particular direction d
0
=
f(x)
f(x)
we get
f(x + d) f(x) f(x) +R()
while
f(x + d
0
) = f(x) f(x) +R(),
i.e. the largest decrease is obtained for the anti-gradient direction d
0
.
6.1.1 Convergence of the gradient method
Theorem 6.1.1 If the following conditions hold:
(i) f is dierentiable with f continuous (i.e. f C
1
)
(ii) the level set S
f(x
0
)
= {x R
n
: f(x) f(x
0
)} is compact for all initial points x
0
(iii) the step size satises the rst Wolfe condition (W1).
Then any limit point of the sequence {x
k
}
k
is a stationary point.
CHAPTER 6. FIRST ORDER METHODS 51
Proof: We based our proof on the general convergence theorem presented in the previous
chapter: we dene the map
M(x) = x f(x)
Since f is dierentiable with f continuous it follows that M(x) is a point to point con-
tinuous map and thus closed map. We dene S = {x
R
n
: f(x
k
= 0.
In conclusion,
k0
cos
2
k
f(x
k
)
2
=
k0
f(x
k
)
2
< .
It follows that the sequence x
k
has the property f(x
k
) 0 as k .
Remark 6.1.3 We observe that from the rst convergence theorem for gradient method
we obtained that some subsequence of {x
k
}
k
converges to a stationary point x
, while from
the second convergence theorem f(x
k
) 0.
6.1.2 Choosing optimal step size
In the case when the step size is constant over all the iterations, i.e. x
k+1
= x
k
f(x
k
)
we are interested in nding the optimal that guarantees the fastest convergence rate. We
CHAPTER 6. FIRST ORDER METHODS 52
assume f to be Lipschitz with L > 0. It follows that
f(y) f(x) +f(x)
T
(y x) +
L
2
x y
2
, x, y.
Therefore,
f(x
k+1
) f(x
k
) f(x
k
)
2
+
L
2
2
f(x
k
)
2
= f(x
k
) (1
L
2
)f(x
k
)
2
.
The step size that guarantees the largest decrease per iteration is obtained from the con-
dition
max
>0
(1
L
2
)
i.e.
=
1
L
.
For the gradient method with constant step the optimal step size is =
1
L
. In this case
the decrease at each step is given by
f(x
k+1
) f(x
k
)
1
2L
f(x
k
)
2
and summing up these inequalities we obtain
f(x
N+1
) f(x
0
)
1
2L
N
k=0
f(x
k
)
2
i.e.
1
2L
N
k=0
f(x
k
)
2
f(x
0
) f(x
N+1
) f(x
0
) f
.
Let us dene
f
N
= arg min
k=0N
f(x
k
).
It follows that
1
2L
(N + 1)f
N
2
f(x
0
) f
.
In conclusion, after N steps the following convergence rate is obtained
f
N
1
N + 1
2L(f(x
0
) f
)
CHAPTER 6. FIRST ORDER METHODS 53
i.e. the gradient method has in this case a sublinear convergence rate.
Note that nothing can be said in this case about the convergence of x
k
to some stationary
point x
or f(x
k
) to the optimal value f
= q.
Since Q is invertible x
= Q
1
q. However, in many cases the computation of the inverse
is very expensive, in general case it has complexity O(n
3
). In the sequel we will present a
less expensive numerical method for computing the solution x
.
Denition 6.2.1 Two vectors d
1
and d
2
are called Q-orthogonal if d
T
1
Qd
2
= 0. A set of
vectors {d
1
, d
2
, , d
k
} is called Q-orthogonal if d
T
i
Qd
j
= 0 for all i = j.
Note that if Q0 and if {d
1
, d
2
, , d
k
} are Q-orthogonal and dierent from zero then they
are linear independent. Moreover, for the case k = n they make a basis for R
n
. In conclu-
sion, if {d
1
, d
2
, , d
n
} is Q-orthogonal and dierent from zero, there exists
1
, ,
n
R
n
such that x
=
1
d
1
+
2
d
2
+ +
n
d
n
(i.e. linear combination of the vectors from the
basis). In order to nd
i
we use
i
=
d
T
i
Qx
d
T
i
Qd
i
=
d
T
i
q
d
T
i
Qd
i
.
CHAPTER 6. FIRST ORDER METHODS 54
We conclude that
x
d
T
i
q
d
T
i
Qd
i
d
i
and thus x
k
=
r
T
k
d
k
d
T
k
Qd
k
, r
k
= Qx
k
q
converges to x
.
Note that the residual r
k
= Qx
k
q coincides with the gradient of the quadratic objective
function.
Theorem 6.2.3 Let {d
1
, d
2
, , d
n
} be a Q-orthogonal vector set with nonzero elements
and dene the subspace S
k
= Span{d
1
, d
2
, , d
k
}. Then for any x
0
R
n
the sequence
x
k+1
= x
k
+
k
d
k
, where
k
=
r
T
k
d
k
d
T
k
Qd
k
has the following properties:
(i) x
k+1
= arg min
xx
0
+S
k
1
2
x
T
Qx q
T
x
(ii) the residual at step k is orthogonal to all the previous directions, i.e.
r
T
k
d
i
= 0 i < k.
From (ii) we obtain that
f(x
k
) S
k1
.
The conjugate directions method for solving a strict convex QP contains the following steps:
CHAPTER 6. FIRST ORDER METHODS 55
0. given x
0
R
n
we dene d
0
= f(x
0
) = r
0
= (Qx
0
q)
1. x
k+1
= x
k
+
k
d
k
with
k
=
r
T
k
d
k
d
T
k
Qd
k
2. d
k+1
= f(x
k+1
) +
k
d
k
where
k
=
r
T
k+1
Qd
k
d
T
k
Qd
k
.
Note that at every step a new direction is chosen that is a linear combination between the
current gradient and the previous direction. The conjugate directions method is computa-
tional cheap since it uses simple update formulas (i.e. operations with vectors).
Theorem 6.2.4 (Properties of the conjugate directions method) The following prop-
erties hold for the conjugate directions method:
(i) Span{d
0
, d
1
, , d
k
} = Span{r
0
, r
1
, , r
k
} = Span{r
0
, Qr
0
, , Q
k
r
0
}
(ii) d
T
k
Qd
i
= 0 for all i < k
(iii)
k
=
r
T
k
r
k
d
T
k
Qd
k
(iv)
k
=
r
T
k+1
r
k+1
r
T
k
r
k
.
6.2.1 Extension to general unconstrained optimization problems
For a general unconstrained optimization problem min
xR
n f(x), we repeat the same iter-
ations as in the quadratic case with the following identications:
Q =
2
f(x
k
), r
k
= f(x
k
)
for n steps and then we reinitialize x
0
= x
n
and repeat until f(x
k
) :
0. r
0
= f(x
0
) and d
0
= f(x
0
)
1. x
k+1
= x
k
+
k
d
k
for all k = 1, , n 1, where
k
=
r
T
k
d
k
d
T
k
2
f(x
k
)d
k
CHAPTER 6. FIRST ORDER METHODS 56
2. d
k+1
= f(x
k+1
) +
k
d
k
, where
k
=
r
T
k+1
2
f(x
k
)d
k
d
T
k
2
f(x
k
)d
k
3. after n iterations we replace x
0
with x
n
and repeat the whole process.
Note that in the general case this method is not convergent. The algorithm can be made
convergent by modifying adequately
k
. We have the following updating rules:
FletcherReeves
k
=
r
T
k+1
r
k+1
r
T
k
r
k
PolakRibiere
k
=
(r
k+1
r
k
)
T
r
k+1
r
T
k
r
k
.
Chapter 7
Newton Type Optimization
In this chapter we will treat how to solve a general unconstrained nonlinear optimization
problem using also information about the Hessian (second order information) or some
approximation of it (i.e. still based on rst order information):
min
xR
n
f(x) (7.1)
with f C
2
.
7.1 Newton method
In numerical analysis, Newtons method (or the Newton-Raphson method) is a method for
nding roots of a system of equations in one or more dimensions. We consider the rst
order necessary conditions for optimality, which reduces to the system of equations:
f(x
) = 0
with f : R
n
R
n
, which has as many components as variables.
The Newton idea consists of linearizing the non-linear equations at x
k
to nd x
k+1
= x
k
+d
k
f(x
k
) +
2
f(x
k
)d
k
= 0
d
k
=
2
f(x
k
)
1
f(x
k
).
57
CHAPTER 7. NEWTON TYPE OPTIMIZATION 58
In general, we call d
k
the Newton direction and the Newton method consists in the following
iteration:
x
k+1
= x
k
2
f(x
k
)
1
f(x
k
).
Visualization of the problem
Another interpretation of the Newton method for optimization can be obtained by a sec-
ond order Taylor approximation of the objective function f. We recall that second order
sucient conditions for optimality (SOSC) are: if there exists an x
satisfying
f(x
) = 0 and
2
f(x
)0
then x
is a local minimum. If x
2
f(x
k
)d
and thus we dene the Newton direction as
d
k
= arg min
d
f(x
k
) +f(x
k
)
T
d +
1
2
d
T
2
f(x
k
)d.
Note that if x
k
is suciently close to x
then
2
f(x
k
)0 and thus from the optimality
conditions for a strict convex QP we obtain again d
k
=
2
f(x
k
)
1
f(x
k
), i.e. is the
same formula, but with a dierent interpretation.
7.1.1 Local convergence rates
In this section we will analyze the local convergence rate of the Newton method:
x
k+1
= x
k
2
f(x
k
)
1
f(x
k
).
Theorem 7.1.1 (Quadratic convergence of Newton method) Let f C
2
and x
be
a local minimum satisfying SOSC (i.e. f(x
) = 0 and
2
f(x
2
f(x
) lI
n
.
CHAPTER 7. NEWTON TYPE OPTIMIZATION 59
Moreover, we assume that
2
f(x) is Lipschitz, i.e.
2
f(x)
2
f(y) Mx y x, y domf,
where M > 0. If x
0
is suciently close to x
, i.e.
x
0
x
2
3
l
M
,
then the Newton iteration x
k+1
= x
k
2
f(x
k
)
1
f(x
k
) has the property that the sequence
{x
k
}
k
converges to x
quadratically.
Proof: Since x
) +
1
0
2
f(x
+ (x
k
x
))(x
k
x
)d.
We obtain:
x
k+1
x
= x
k
x
2
f(x
k
)
1
f(x
k
)
=
2
f(x
k
)
1
[
2
f(x
k
)(x
k
x
) f(x
k
) +f(x
)]
=
2
f(x
k
)
1
[
2
f(x
k
)(x
k
x
1
0
2
f(x
+ (x
k
x
))(x
k
x
)d
=
2
f(x
k
)
1
1
0
2
f(x
k
)(x
k
x
)
2
f(x
+ (x
k
x
))(x
k
x
)d
=
2
f(x
k
)
1
1
0
[
2
f(x
k
)
2
f(x
+ (x
k
x
))](x
k
x
)d
Since
2
f(x
k
)
2
f(x
) Mx
k
x
it follows that
Mx
k
x
I
n
2
f(x
k
)
2
f(x
) Mx
k
x
I
n
.
It follows that
2
f(x
k
)
2
f(x
) Mx
k
x
I
n
lI
n
Mx
k
x
I
n
0,
CHAPTER 7. NEWTON TYPE OPTIMIZATION 60
provided that x
k
x
2
3
l
M
which leads to
0
2
f(x
k
)
1
1
l Mx
k
x
I
n
.
We conclude that
x
k+1
x
=
2
f(x
k
)
1
1
0
2
f(x
k
)
2
f(x
+ (x
k
x
))dx
k
x
1
l Mx
k
x
1
0
M(1 )x
k
x
dx
k
x
1
l Mx
k
x
1
0
M(1 )dx
k
x
1
l Mx
k
x
M
2
x
k
x
k
(
2
f(x
k
))
1
f(x
k
).
When x
k
is suciently close to x
k
(
k
I
n
+
2
f(x
k
))
1
f(x
k
).
3. The main disadvantage of this method is that we need to calculate the Hessian of f
and then to invert such a matrix.
CHAPTER 7. NEWTON TYPE OPTIMIZATION 61
7.2 Quasi Newton methods
As we have mentioned before the main disadvantage of Newton method is that the compu-
tation of the Hessian and its inverse is expensive in general (at least for large problems).
In quasi Newton methods the purpose is to replace f(x
k
)
1
with a matrix H
k
which can
be calculated more easily and thus we get the following iteration:
x
k+1
= x
k
k
H
k
f(x
k
).
Note that the direction d
k
= H
k
f(x
k
) is a descent direction if H
k
0:
f(x
k
)
T
d
k
= f(x
k
)
T
H
k
..
0
f(x
k
)
. .
>0
< 0
Furthermore, the step size
k
is chosen to satisfy the Wolfe conditions. In general, as in
the Newton method, when x
k
is suciently close to the solution x
we have
k
= 1. The
goal is to nd update rules for the matrix H
k
such that asymptotically it converges to the
true inverse of the Hessian, i.e.
H
k
2
f(x
)
1
.
From Taylor approximation we have:
f(x
k+1
) = f(x
k
+
k
) f(x
k
) +
2
f(x
k
)(x
k+1
x
k
)
In conclusion, approximating the true Hessian
2
f(x
k
) with a matrix B
k+1
we obtain the
following relation
f(x
k+1
) f(x
k
) = B
k+1
(x
k+1
x
k
)
or equivalently using the notation H
k+1
= B
1
k+1
H
k+1
[f(x
k+1
) f(x
k
)] = x
k+1
x
k
(7.2)
which is called the secant equation.
For H
1
k
=
2
f(x
k
) we recover Newtons method. Note that we have a similar interpreta-
tion as for the Newton method, namely in each iteration, a convex quadratic approximation
of the function is considered (i.e. B
k
0) and we minimize it to obtain the next direction:
d
k
= arg min
d
f(x
k
) +f(x
k
)
T
d +
1
2
d
T
B
k
d. (7.3)
Since the Hessian is symmetric we require for the matrices B
k
and H
k
to be symmetric
matrices as well. In conclusion we have n equations (from (??)) with
n(n+1)
2
unknowns (by
imposing symmetry) and thus we obtain an innite number of solutions. In the sequel we
will derive dierent update rules that satisfy (??) and symmetry.
CHAPTER 7. NEWTON TYPE OPTIMIZATION 62
7.2.1 Rank one updates
In this case we update the matrix H
k
using the following formula
H
k+1
= H
k
+
k
u
k
u
T
k
,
where we choose
k
R and u
k
R
n
such that the secant equation holds.
We start with a symmetric matrix H
0
0 and we denote with
k
= x
k+1
x
k
and
k
= f(x
k+1
) f(x
k
).
We update the matrix H
k+1
as follows
H
k+1
= H
k
+
(
k
H
k
k
)(
k
H
k
k
)
T
T
k
(
k
H
k
k
)
.
To guarantee that H
k
0 the following inequality must hold:
T
k
(
k
H
k
k
) > 0.
7.2.2 Rank two updates
In this case we update the matrix H
k
using the following formula:
H
k+1
= H
k
+
k
T
k
T
k
H
k
T
k
H
k
T
k
H
k
k
.
This update is called Davidon-Fletcher-Powell (DFP) update.
We again start with an initial matrix H
0
0. The following properties hold for the (DFP)
update
(i) H
k
0.
(ii) if f(x) =
1
2
x
T
Qx+q
T
x is quadratic and strictly convex then the (DFP) update yields
conjugate directions, i.e. d
k
= H
k
f(x
k
) are Q-conjugate directions. Moreover,
H
n
= Q
1
and in particular if H
0
= I
n
then the directions d
k
coincide with the
directions from the conjugate directions method. Therefore, we can nd the solution
of a quadratic problem in maximum n steps by using the (DFP) iterations.
CHAPTER 7. NEWTON TYPE OPTIMIZATION 63
If in the (DFP) method we do not use (??) but the equation f(x
k+1
) f(x
k
) =
B
k+1
(x
k+1
x
k
) we obtain the Broyden-Fletcher-Goldfarb-Shanno (BFGS) update:
H
k+1
= H
k
+
H
k
T
k
+
k
T
k
H
k
T
k
H
k
k
H
k
T
k
H
k
T
k
H
k
k
= 1 +
T
k
T
k
H
k
k
The same proprieties are valid for (BFGS) method as for the (DFP) method. However,
from a numerical point of view (BFGS) is considered the most stable.
Remark 7.2.1 Note that quasi Newton methods require only rst order information. More-
over, the directions generated by the quasi Newton method are descent directions if we ensure
that H
k
0. In general H
k
(
2
f(x
))
1
and under certain assumptions we will show
that we have superlinear convergence.
7.2.3 Local convergence for quasi Newton method
Theorem 7.2.2 Let x
<
2(1 )
M
(7.5)
Then x
k
converge to x
superlinearly if
k
0.
CHAPTER 7. NEWTON TYPE OPTIMIZATION 64
Proof: We will show that x
k+1
x
k
x
k
x
with
k
< 1. For this aim let us
compute
x
k+1
x
= x
k
x
H
k
f(x
k
)
= x
k
x
H
k
(f(x
k
) f(x
))
= H
k
(H
1
k
(x
k
x
)) H
k
1
0
2
f(x
+ (x
k
x
))(x
k
x
)d
= H
k
(H
1
k
2
f(x
k
))(x
k
x
) H
k
1
0
_
2
f(x
+ (x
k
x
))
2
f(x
k
)
_
(x
k
x
)d.
Taking the norm on both sides we obtain:
x
k+1
x
k
x
k
x
1
0
Mx
+ (x
k
x
) x
k
d x
k
x
=
_
k
+ M
1
0
(1 )d
. .
=
1
2
x
k+1
x
_
x
k+1
x
=
_
k
+
M
2
x
k
x
_
. .
=
k
x
k
x
(
k
+
M
2
x
k
x
)
. .
0
x
k
x
Truncated Newton method: This approach is suitable for large scale problems and
consists in solving the linear system
2
f(x
k
)d = f(x
k
) (7.6)
inexactly, e.g. by iterative linear algebra.
Chapter 8
Estimation and Fitting Problems
Estimation and tting problems are optimization problems with a special objective, namely
a least squares objective
min
xR
n
1
2
M(x)
2
. (8.1)
Here, R
m
are the m measurements and M : R
n
R
m
is a model, and x R
n
are called model parameters. If the true value for x would be known, we could evaluate
the model M(x) to obtain model predictions for the measurements. The computation of
M(x), which might be a very complex function and for example involve the solution of a
dierential equation, is sometimes called the forward problem: for given model inputs,
we determine the model outputs.
In estimation and tting problems, as (??), the situation is opposite: we want to nd
those model parameters x that yield a prediction M(x) that is as close as possible to the
actual measurements . This problem is often called an inverse problem: for given model
outputs , we want to nd the corresponding model inputs x.
This type of optimization problem arises in applications like
function approximation
online estimation for process control
weather forecast (weather data reconciliation)
65
CHAPTER 8. ESTIMATION AND FITTING PROBLEMS 66
parameter estimation
8.1 Linear least squares
Denition 8.1.1 (Moore-Penrose Pseudo Inverse) Assume J R
mn
with rank(J) =
k, and that the singular value decomposition (SVD) of J is given by J = UV
T
. Then, the
Moore-Penrose pseudo inverse J
+
is given by:
J
+
= V
+
U
T
,
where for
=
_
2
.
.
.
k
0
.
.
.
0
_
_
holds
+
=
_
1
1
1
2
0
.
.
.
1
k
0
.
.
.
_
_
Theorem 8.1.2 If rank(J) = n, then
J
+
= (J
T
J)
1
J
T
.
If rank(J) = m, then
J
+
= J
T
(JJ
T
)
1
.
CHAPTER 8. ESTIMATION AND FITTING PROBLEMS 67
Proof: Let us compute
(J
T
J)
1
J
T
= (V
T
U
T
UV
T
)
1
V
T
U
T
= V (
T
)
1
V
T
V
T
U
T
= V (
T
)
1
T
U
T
= V
_
2
1
2
2
.
.
.
2
r
_
_
1
_
2
0
.
.
.
r
_
_
U
T
= V
+
U
T
.
Similarly for the other case.
Note that if rank(J) = n, i.e. the columns of J are linearly independent then J
T
J is
invertible.
Many models in estimation and tting problems are linear functions of x. If M is linear,
i.e. M(x) = Jx, then the objective function becomes f(x) =
1
2
Jx
2
which is a convex
function since
2
f(x) = J
T
J0. Assuming that rank(J) = n, the global minimizer is
found by
J
T
Jx
J
T
= 0 x
= (J
T
J)
1
J
T
. .
=J
+
. (8.2)
Example [Average linear least squares]: Let us regard the simple optimization prob-
lem:
min
xR
1
2
m
i=1
(
i
x)
2
.
This is a linear least squares problem, where the vector and the matrix J R
m1
are
given by
=
_
2
.
.
.
m
_
_
, J =
_
_
1
1
.
.
.
1
_
_
. (8.3)
CHAPTER 8. ESTIMATION AND FITTING PROBLEMS 68
Because J
T
J = m, it can be easily seen that
J
+
= (J
T
J)
1
J
T
=
1
m
_
1 1 1
(8.4)
so we conclude that the local minimizer equals the average of the given points
i
:
x
= J
+
=
1
m
m
i=1
i
= . (8.5)
Example [Linear Regression]:
Figure: Given a set of points which tend to a linear relation between two units
Given data points {t
i
}
i=m
i=1
with corresponding values {
i
}
i=m
i=1
, nd the 2-dimensional pa-
rameter vector x = (x
1
, x
2
), so that the polynomial of degree one p(t; x) = x
1
+x
2
t provides
a prediction of at time t. The corresponding optimization problem looks like:
min
xR
2
1
2
m
i=1
(
i
p(t
i
; x))
2
= min
xR
2
1
2
_
_
_
_
J
_
x
1
x
2
__
_
_
_
2
2
(8.6)
where is the same vector as in (??) and J is given by
J =
_
_
1 t
1
1 t
2
.
.
.
.
.
.
1 t
n
_
_
. (8.7)
The local minimizer is found by equation (??), whereas the calculation of (J
T
J) is straight-
forward:
J
T
J =
_
m
t
i
t
i
t
2
i
_
= m
_
1
t
t
t
2
_
(8.8)
In order to obtain x
, rst (J
T
J)
1
is calculated
1
:
(J
T
J)
1
=
1
det(J
T
J)
adj(J
T
J) =
1
m(
t
2
(
t)
2
)
_
t
2
t 1
_
. (8.9)
1
Recall that the adjugate of a matrix A R
nxn
is given by taking the transpose of the cofactor matrix,
adj(A) = C
T
where C
ij
= (1)
i+j
M
ij
with M
ij
the (i, j) minor of A.
CHAPTER 8. ESTIMATION AND FITTING PROBLEMS 69
Second, we compute J
T
as follows:
J
T
=
_
1 1
t
1
t
m
_
_
1
.
.
.
m
_
_
=
_
i
t
i
_
= m
_
nt
_
. (8.10)
Hence, the local minimizer is found by combining the expressions (??) and (??). Note that
t
2
(
t)
2
=
1
m
(t
i
t)
2
=
2
t
. (8.11)
where we used in the last transformation a standard denition of the variance
t
. The
correlation coecient is similarly dened by
=
(
i
)(t
i
t)
m
t
. (8.12)
The two-dimensional parameter vector x = (x
1
, x
2
) is found:
x
=
1
2
t
_
t
2
t +
t
_
=
_
_
. (8.13)
Finally, this can be written as a polynomial of rst degree:
p(t; x
) = + (t
t)
t
. (8.14)
What do we do with the teasers, the solution of teaser one is given at this point of the
lecture
8.2 Ill posed linear least squares
When J
T
J is invertible, the set of optimal solutions X
,
given by equation (??): X
= {(J
T
J)
1
J}. If J
T
J is not invertible, the set of solutions
X
is given by
X
= {x : f(x) = 0} = {x : J
T
Jx J
T
= 0}. (8.15)
CHAPTER 8. ESTIMATION AND FITTING PROBLEMS 70
In order to pick a unique point out of this set, we might choose to search for the minimum
norm solution, i.e. the vector x
.
min
xR
n
1
2
x
2
s.t. x X
. (8.16)
We will show below that this minimal norm solution is given by the Moore-Penrose pseudo
inverse.
8.2.1 Regularization for least squares
The minimum norm solution can be approximated by a regularized problem
min
x
1
2
Jx
2
2
+
2
x
2
2
, (8.17)
with small > 0. To get unique solution
f(x) = J
T
Jx J
T
+ x (8.18)
= (J
T
J + I)x J
T
(8.19)
x
= (J
T
J + I)
1
J
T
(8.20)
Lemma 8.2.1
lim
0
(J
T
J + I)
1
J
T
= J
+
.
Proof: Taking the SVD of J = UV
T
, (J
T
J + I)
1
J
T
can be written in the form:
(J
T
J + I)
1
J
T
= (V
T
U
T
UV
T
+ I
..
V V
T
)
1
J
T
..
U
T
V
T
= V (
T
+ I)
1
V
T
V
T
U
T
= V (
T
+ I)
1
T
U
T
CHAPTER 8. ESTIMATION AND FITTING PROBLEMS 71
Rewriting the right hand side of the equation explicitly:
= V
_
2
1
+
.
.
.
2
r
+
.
.
.
_
1
_
1
0
.
.
.
r
.
.
.
0
.
.
.
0 0
_
_
U
T
Calculating the matrix product simplies the equation:
= V
_
2
1
+
0
.
.
.
2
r
+
.
.
.
0
.
.
.
0
0
_
_
U
T
It can be easily seen that for 0 each diagonal element has the solution:
lim
0
2
i
+
=
_
1
i
if
i
= 0
0 if
i
= 0
(8.21)
i
) =
2
i
and
i
,
j
independent. Then holds
P(|x) =
m
i=1
P(
i
| x) (8.22)
= C
m
i=1
exp
_
(
i
M
i
(x))
2
2
2
i
_
(8.23)
log P(|x) = log(C) +
m
i=1
(
i
M
i
(x))
2
2
2
i
(8.24)
with a constant C. Due to monotonicity of the logarithm holds that the argument maxi-
mizing P(|x) is given by
arg max
xR
n
P(|x) = arg min
xR
n
log(P(|x)) (8.25)
= arg min
xR
n
m
i=1
(
i
M
i
(x))
2
2
2
(8.26)
= arg min
xR
n
1
2
S
1
( M(x))
2
2
(8.27)
Thus, the least squares problem has a statistical interpretation. Note that due to the fact
that we might have dierent standard deviations
i
for dierent measurements
i
we need
to scale both measurements and model functions in order to obtain an objective in the
usual least squares form
M(x)
2
2
, as
min
x
1
2
n
i=1
_
i
M
i
(x)
i
_
2
= min
x
1
2
S
1
( M(x))
2
2
(8.28)
= min
x
1
2
S
1
S
1
M(x)
2
2
(8.29)
with S =
_
1
.
.
.
m
_
_
.
CHAPTER 8. ESTIMATION AND FITTING PROBLEMS 73
Statistical Interpretation of Regularization terms: Note that a regularization term
like x x
2
2
that is added to the objective can be interpreted as a pseudo measurement
x of the parameter value x, which includes a statistical assumption: the smaller , the larger
we implicitly assume the standard deviation of this pseudo-measurement. As the data of a
regularization term are usually given before the actual measurements, regularization is also
often interpreted as a priori knowledge. Note that not only the Euclidean norm with one
scalar weighting can be chosen, but many other forms of regularization are possible, e.g.
terms of the form A(x x)
2
2
with some matrix A.
8.4 L1-estimation
Instead of using .
2
2
, i.e. the L2-norm in equation (??), we might alternatively use .
1
,
i.e., the L1-norm. This gives rise to the so called L1-estimation problem:
min
x
M(x)
1
= min
x
m
i=1
|
i
M
i
(x)| (8.30)
Like the L2-estimation problem, also the L1-estimation problem can be interpreted statisti-
cally as a maximum-likelihood estimate. However, in the L1-case, the measurement errors
are assumed to follow a Laplace distribution instead of a Gaussian.
An interesting observation is that the optimal L1-t of a constant x to a sample of dierent
scalar values
1
, . . . ,
m
just gives the median of this sample, i.e.
arg min
xR
m
i=1
|
i
x| = median of {
1
, . . . ,
m
}. (8.31)
Remember that the same problem with the L2-norm gave the average of
i
. Generally
speaking, the median is less sensitive to outliers than the average, and a detailed analysis
shows that the solution to general L1-estimation problems is also less sensitive to a few
outliers. Therefore, L1-estimation is sometimes also called robust parameter estimation.
CHAPTER 8. ESTIMATION AND FITTING PROBLEMS 74
8.5 Gauss-Newton (GN) Method
Linear least squares problems can be solved easily. Solving non-linear least squares prob-
lems globally is in general NP-hard, but in order to nd a local minimum we can iteratively
solve it, and in each iteration approximate the problem by its linearization at the current
guess. This way we obtain a better guess for the next iterate, etc., just as in Newtons
method for root nding problems.
For non-linear least squares problems of the form
min
x
1
2
M(x)
2
2
. .
=f(x)
(8.32)
the so called Gauss-Newton (GN) method is used. To describe this method, let us rst
for notational convenience introduce the shorthand F(x) = M(x) and redene the
objective to
f(x) =
1
2
F(x)
2
2
(8.33)
where F(x) is a nonlinear function F : R
m
R
n
with m > n (more measurements than
parameters). At a given point x
k
(iterate k), F(x) is linearized, and the next iterate x
k+1
obtained by solving a linear least squares problem. We expand
F(x) F(x
k
) + J(x
k
)(x x
k
) (8.34)
where J(x) is the Jacobian of F(x) which is dened as
J(x) =
F(x)
x
. (8.35)
Then, x
k+1
can be found as solution of the following linear least squares problem:
x
k+1
= arg min
x
1
2
F(x
k
) + J(x
k
)(x x
k
)
2
2
(8.36)
For simplicity, we write J(x
k
) as J and F(x
k
) as F:
x
k+1
= arg min
x
1
2
F + J(x x
k
)
2
2
(8.37)
= x
k
+ arg min
d
1
2
F + Jd
2
2
(8.38)
= x
k
(J
T
J)
1
J
T
F (8.39)
= x
k
+ d
GN
k
(8.40)
CHAPTER 8. ESTIMATION AND FITTING PROBLEMS 75
The Gauss-Newton method is only applicable to least-squares problems, because the method
linearizes the non-linear function inside the L2-norm. Note that in equation J
T
J might
not always be invertible.
8.6 Levenberg-Marquardt (LM) Method
This method is a generalization of the Gauss-Newton method that is in particular applicable
if J
T
J is not invertible, and can lead to more robust convergence far from a solution. The
Levenberg-Marquardt (LM) method makes the step p
k
smaller by penalizing the norm of
the step. It denes the the step as:
d
LM
k
= arg min
d
1
2
F(x
k
) + J(x
k
)d
2
2
+
k
2
d
2
2
(8.41)
= (J
T
J +
k
I)
1
J
T
F (8.42)
with some
k
> 0. Using this step, it iterates as usual
x
k+1
= x
k
+ d
LM
k
. (8.43)
If we would make
k
very big, we would not correct the point, but we would stay where we
are: for
k
we get d
LM
k
0. More precisely, d
LM
k
=
1
k
J
T
F + R
_
1
2
k
_
. On the other
hand, for small
k
, i.e. for
k
0 we get d
LM
k
J
+
F.
It is interesting to note that the gradient of the least squares objective function f(x) =
1
2
F(x)
2
2
equals
f(x) = J(x)
T
F(x), (8.44)
which is the rightmost term in the step of both the Gauss-Newton and the Levenberg-
Marquardt method. Thus, if the gradient equals zero, then also d
GN
k
= d
LM
k
= 0. This is
a necessary condition for convergence to stationary points: the GN and LM method both
stay at a point x
k
with f(x
k
) = 0. In the following chapter we will in much more detail
analyse the convergence properties of these two methods, which are in fact part of a larger
family, namely the Newton type optimization methods.
We now discuss the convergence theory for these methods (we use a similar reasoning as
in Theorem ??):
CHAPTER 8. ESTIMATION AND FITTING PROBLEMS 76
Theorem 8.6.1 Let x
<
2(1 )
M
(8.46)
Then x
k
converge to x
linearly if
k
> > 0.
Proof: See the proof of Theorem ??.
Chapter 9
Globalisation Strategies
A Newton-type method only converges locally if
+
2
x
0
x
< 1 (9.1)
(9.2)
x
0
x
2
1
(9.3)
Recall that is a Lipschitz constant of the Hessian that is bounding the non-linearity of
the problem, and is a measure of the approximation error of the Hessian.
But what if x
0
x
k = 0
while f(x
k
) > TOL do
obtain B
k
0
p
k
B
1
k
f(x
k
)
get t
k
from the backtracking algorithm
x
k+1
x
k
+ t
k
p
k
k k + 1
end while
x
x
k
Note: For computational eciency, f(x
k
) should only be evaluated once in each itera-
tion.
Because f(x
k+1
) f(x
k
) we have f(x
k
) f
p
k
) > f(x
k
) +
t
k
f(x
k
)
T
p
k
(9.11)
f(x
k
+
t
k
p
k
) f(x
k
)
. .
=f(x
k
+p
k
)
T
p
k
t
k
, with (0,
t
k
)
>
t
k
f(x
k
)
T
p
k
(9.12)
f(x
k
+ p
k
)
T
p
k
> f(x
k
)
T
p
k
(9.13)
(f(x
k
+ p
k
) f(x
k
))
T
. .
Lp
k
p
k
> (1 ) (f(x
k
)
T
p
k
)
. .
p
k
T
B
k
p
k
(9.14)
L
t
k
p
k
2
> (1 ) p
k
T
B
k
p
k
. .
1
c
2
p
k
2
2
(9.15)
t
k
>
(1 )
c
2
L
(9.16)
(Recall that
1
c
1
eig(B
k
)
1
c
2
).
We have shown that the step length will not be shorter than
(1)
c
2
L
, and will thus never
become zero.
9.3 Trust-region methods (TR)
Line-search methods and trust-region methods both generate steps with the help of a
quadratic model of the objective function, but they use this model in dierent ways. Line
search methods use it to generate a search direction and then focus their eorts on nding
a suitable step length along this direction. Trust-region methods dene a region around
the current iterate within they trust the model to be an adequate representation of the
objective function and then choose the step to be the approximate minimizer of the model
in this region. In eect, they choose the direction and length of the step simultaneously.
CHAPTER 9. GLOBALISATION STRATEGIES 82
If a step is not acceptable, they reduce the size of the region and nd a new minimzer. In
general, the direction of the step changes whenever the size of the trust region is altered.
The size of the trust region is critical to the eectiviness of each step. (cited from [?]).
The idea is to iterate x
k+1
= x
k
+ p
k
with
p
k
= arg min
pR
n
m
k
(x
k
+ p) s.t. p
k
(9.17)
Equation (??) is called the TR-Subproblem, and
k
> 0 is called the TR-Radius.
One particular advantage of this new type of subproblem is that we even can use indenite
Hessians without problems. Remember that for an indenite Hessian the unconstrained
quadratic model is not bounded below. A trust-region constraint will always ensure that
the feasible set of the subproblem is bounded so that it always has a well-dened minimizer.
Before dening the trustworthiness of a model, recall that:
m
k
(x
k
+ p) = f(x
k
) +f(x
k
)
T
p +
1
2
p
T
B
k
p (9.18)
Denition [Trustworthiness]: A measure for the trustworthiness of a model is the ratio
of actual and predicted decrease.
k
=
f(x
k
) f(x
k
+ p
k
)
m
k
(x
k
) m
k
(x
k
+ p
k
)
. .
>0 if f(x
k
)=0
=
A
red
P
red
(9.19)
We have f(x
k
+ p
k
) < f(x
k
) only if
k
> 0.
k
1 means a very trustworthy model.
The trust region algorithm is described in Algorithm ??.
The general convergence of the TR algorithm can be found in Theorem 4.5 in [?]
CHAPTER 9. GLOBALISATION STRATEGIES 83
Algorithm 3 Trust Region
Inputs:
max
, [0,
1
4
] (when do we accept a step),
0
, x
0
, TOL > 0
Output: x
k = 0
while f(x
k
) > TOL do
Solve the TR-subproblem ?? and get p
k
(approximately)
Compute
k
Adapt
k+1
:
if
k
<
1
4
then
k+1
k
1
4
(bad model: reduce radius)
else if
k
>
3
4
and p
k
=
k
then
k+1
min(2
k
,
max
) (good model: increase radius, but not too much)
else
k+1
k
end if
Decide on acceptance of step
if
k
> then
x
k+1
x
k
+ p
k
(we trust the model)
else
x
k+1
x
k
null step
end if
end while
x
x
k
Chapter 10
Calculating Derivatives
In the previous chapters we saw that we regularily need to calculate f and
2
f. There
are several methods for calculating these derivatives:
1. By hand
Expensive and error prone.
2. Symbolic dierentiation
Using Mathematica or Maple. The disadvantage is that the result is often a very long
code and expensive to evaluate.
3. Numerical dierentiation (nite dierences)
Easy and fast, but innacurate
f(x + tp) f(x)
t
f(x)
T
p (10.1)
How should we choose t? If we take t too small, the derivative will suer from
numerical noise. On the other hand, if we take t too large, the linearization error will
be dominant. A good rule of thumb is to use t =
mach
, with
mach
the machine
precision (or the precision of f, if it is lower than the machine precision).
84
CHAPTER 10. CALCULATING DERIVATIVES 85
The accuracy of this method is
mach
, which means in practice that we loose half the
valid digits compared to the function evaluation. Second order derivates are therefore
even more dicult to accurately calculate.
4. Imaginary trick in MATLAB
If f : R
n
R is analytic, then for t = 10
100
we have
f(x)
T
p =
(f(x + itp))
t
(10.2)
which can be calculated up to machine precision.
Proof:
g(z) = f(x + zp)
g(z) = g(0) + g
(0)z +
1
2
g
(0)z
2
+ O(z
3
)
g(it) = g(0) + g
(0)it +
1
2
g
(0)i
2
t
2
+ O(t
3
)
= g(0)
1
2
g
(0)t
2
+ g
(0)it + O(t
3
)
(g(it)) = g
(0)t + O(t
3
)
j<n+i
n+i
x
j
dx
j
dt
, i = 1, . . . , m (10.3)
with
dx
j
dt
x
j
.
Remarks:
in each sum, only one or two terms are non-zero,
CHAPTER 10. CALCULATING DERIVATIVES 87
Algorithm 5 Forward Automatic Dierentation
Input: x
1
, . . . , x
n
(and all partial derivatives
n+i
x
j
)
Output: x
n+m
for i = 1 to m do
x
n+i
j<n+i
n+i
x
j
x
j
end for
the previous two algorithms can be combined, eliminating the need to store the
intermediate variables.
Example [Forward Automatic Dierentiation]:
f(x
1
, x
2
, x
3
) = sin(x
1
x
2
) + exp(x
1
x
2
x
3
)
x
4
= x
1
x
2
x
5
= sin(x
4
)
x
6
= x
4
x
3
x
7
= exp(x
6
)
x
8
= x
5
+ x
7
x
4
= x
1
x
2
+ x
1
x
2
x
5
= cos(x
4
) x
4
x
6
= x
4
x
3
+ x
4
x
3
x
7
= exp(x
6
) x
6
x
8
= x
5
+ x
7
We can prove that cost(algo ??) cost(algo ??), or in other words that cost(f
T
p) 2
cost(f) (note that p x).
How do we get the full gradient of f? Call the algorithm n times with n dierent seed
vectors x R
n
:
CHAPTER 10. CALCULATING DERIVATIVES 88
x =
_
_
1
0
0
.
.
.
0
_
_
,
_
_
0
1
0
.
.
.
0
_
_
, . . . ,
_
_
0
0
0
.
.
.
1
_
_
(10.4)
And hence we have cost(f) 2n cost(f).
AD forward is slightly more expensive than numerical nite dierences (FD), but is exact
up to machine precision.
Software: ADOL-C, Adic, Adifor.
10.2 Automatic Dierentiation: Reverse (Backward)
Mode
Recall that in forward AD we used
dx
n+i
dt
=
j<n+i
n+i
x
j
dx
j
dt
. In reverse AD we will instead
use
df
dx
i
=
j>max(i,n)
df
dx
j
j
x
i
(10.5)
Notation: In the reverse mode of AD we use bar quantities instead of the dot quanti-
ties that we used in the forward mode. These quantities can be interpreted as derivatives
of the nal output with respect to the respective intermediate quantity. We write x
i
df
dx
i
,
so that e.g. f(x) =
_
_
x
1
.
.
.
x
n
_
_
.
CHAPTER 10. CALCULATING DERIVATIVES 89
Algorithm 6 Reverse Automatic Dierentiation
Input: all partial derivatives
j
x
i
Output: x
1
, x
2
, . . . , x
n
x
1
, x
2
, . . . , x
n+m1
0
x
n+m
1
for j = n + m down to n + 1 do
for all i < j do
x
i
x
i
+ x
j
j
x
i
end for
end for
Example [Reverse Automatic Dierentiation]:
x
1
, x
2
, . . . , x
7
0
x
8
1
(j = 8 : x
8
= x
5
+ x
7
))
x
5
x
5
+ 1 x
8
x
7
x
7
+ 1 x
8
(j = 7 : x
7
= exp(x
6
))
x
6
x
6
+ exp(x
6
) x
7
(j = 6 : x
6
= x
4
x
3
)
x
4
x
4
+ x
3
x
6
x
3
x
3
+ x
4
x
6
(j = 5 : x
5
= sin(x
4
))
x
4
x
4
+ cos(x
4
) x
5
(j = 4 : x
4
= x
1
x
2
)
x
1
x
1
+ x
2
x
4
x
2
x
2
+ x
1
x
4
Output: x
1
, x
2
, x
3
with f(x) =
_
_
x
1
x
2
x
3
_
_
The gradient is returned ine one reverse sweep. Furthermore, it can be shown that cost(algo
CHAPTER 10. CALCULATING DERIVATIVES 90
??) 5 cost(algo ??). In other words cost(f) 5 cost(f), regardless of the dimension
n!
The only disadvantage is that, unlike in forward AD, you have to store all intermediate
variables and partial derivatives in reverse AD.
Part III
Constrained Optimization
91
Chapter 11
The Lagrangian Function and Duality
Let us in this section regard a (not-necessarily convex) NLP in standard form (??) with
functions f : R
n
R, g : R
n
R
p
, and h : R
n
R
q
.
Denition [Primal Optimization Problem]: We will denote the globally optimal
value of the objective function subject to the constraints as the primal optimal value p
,
i.e.,
p
=
_
min
xR
n
f(x) s.t. g(x) = 0, h(x) 0
_
, (11.1)
and we will denote this optimization problem as the primal optimization problem.
Denition [Lagrangian Function and Lagrange Multipliers]: We dene the so
called Lagrangian function to be
L(x, , ) = f(x)
T
g(x)
T
h(x). (11.2)
Here, we have introduced the so called Lagrange multipliers or dual variables R
p
and R
q
. The Lagrangian function plays a crucial role in both convex and general
nonlinear optimization. We typically require the inequality multipliers to be positive,
0, while the sign of the equality multipliers is arbitrary. This is motivated by the
following basic lemma.
92
CHAPTER 11. THE LAGRANGIAN FUNCTION AND DUALITY 93
Lemma [Lower Bound Property of Lagrangian]: If x is a feasible point of (??) and
0, then
L( x, , ) f( x). (11.3)
Proof: L( x, , ) = f( x)
T
g( x)
..
=0
T
..
0
h( x)
..
0
f( x).
Denition [Lagrange Dual Function]: We dene the so called Lagrange dual func-
tion as the unconstrained inmum of the Lagrangian over x, for xed multipliers , .
q(, ) = inf
xR
n
L(x, , ). (11.4)
This function will often take the value , in which case we will say that the pair (, )
is dual infeasible for reasons that we motivate in the last example of this subsection.
Lemma [Lower Bound Property of Lagrange Dual]: If 0, then
q(, ) p
(11.5)
Proof: The lemma is an immediate consequence of Eq. (??) which implies that for any
feasible x holds q(, ) f( x). This inequality holds in particular for the global minimizer
x
) = p
.
Theorem [Concavity of Lagrange Dual]: The function q : R
p
R
q
R is concave,
even if the original NLP was not convex.
Proof: We will show that q is convex. The Lagrangian L is an ane function in the
multipliers and , which in particular implies that L is convex in (, ). Thus, the
function q(, ) = sup
x
L(x, , ) is the supremum of convex functions in (, ) that
are indexed by x, and therefore convex.
CHAPTER 11. THE LAGRANGIAN FUNCTION AND DUALITY 94
A natural question to ask is what is the best lower bound that we can get from the Lagrange
dual function. We obtain it by maximizing the Lagrange dual over all possible multiplier
values, yielding the so called dual problem.
Denition [Dual Problem]: The dual problem with dual optimal value d
is de-
ned as the convex maximization problem
d
=
_
max
R
p
,R
q
q(, ) s.t. 0
_
. (11.6)
It is interesting to note that the dual problem is always convex, even if the so called primal
problem is not.
As an immediate consequence of the last lemma, we obtain a very fundamental result that
is called weak duality
Theorem [Weak Duality]:
d
(11.7)
This theorem holds for any arbitrary optimization problem, but does only unfold its full
strength in convex optimization, where very often holds a strong version of duality, which
we will not prove in this course.
Theorem [Strong Duality]: If the primal optimization problem (??) is convex and
a technical constraint qualication (e.g. Slaters condition) holds, then primal and dual
objective are equal to each other,
d
= p
. (11.8)
Strong duality allows us to reformulate a convex optimization problem into its dual, which
looks very dierently but gives the same solution. We will look at this at hand of two
examples.
CHAPTER 11. THE LAGRANGIAN FUNCTION AND DUALITY 95
Example [Dual of a strictly convex QP]: We regard the following strictly convex QP
(i.e., with B0)
p
= min
xR
n
c
T
x +
1
2
x
T
Bx (11.9a)
subject to Ax b = 0, (11.9b)
Cx d 0. (11.9c)
Its Lagrangian function is given by
L(x, , ) = c
T
x +
1
2
x
T
Bx
T
(Ax b)
T
(Cx d)
=
T
b +
T
d +
1
2
x
T
Bx +
_
c A
T
C
T
_
T
x.
The Lagrange dual function is the inmum value of the Lagrangian with respect to x, which
only enters the last two terms in the above expression. We obtain
q(, ) =
T
b +
T
d + inf
xR
n
_
1
2
x
T
Bx +
_
c A
T
C
T
_
T
x
_
=
T
b +
T
d
1
2
_
c A
T
C
T
_
T
B
1
_
c A
T
C
T
_
where we have made use of the basic result (??) in the last row.
Therefore, the dual optimization problem of the QP (??) is given by
d
= max
R
p
,R
q
1
2
c
T
B
1
c +
_
b + AB
1
c
d + CB
1
c
_
T
_
1
2
_
_
T
_
A
C
_
B
1
_
A
C
_
T
_
_
(11.10a)
subject to 0. (11.10b)
Due to the fact that the objective is concave, this problem is again a convex QP, but not a
strictly convex one. Note that the rst term is a constant, but that we have to keep it in
order to make sure that d
= p
= min
xR
n
c
T
x (11.11a)
subject to Ax b = 0, (11.11b)
Cx d 0. (11.11c)
Its Lagrangian function is given by
L(x, , ) = c
T
x
T
(Ax b)
T
(Cx d)
=
T
b +
T
d +
_
c A
T
C
T
_
T
x.
Here, the Lagrange dual is
q(, ) =
T
b +
T
d + inf
xR
n
_
c A
T
C
T
_
T
x
=
T
b +
T
d +
_
0 if c A
T
C
T
= 0
else.
Thus, the objective function q(, ) of the dual optimization problem is at all points
that do not satisfy the linear equality c A
T
C
T
= 0. As we want to maximize, these
points can be regarded as infeasible points of the dual problem (that is why we called them
dual infeasible), and we can explicitly write the dual of the above LP (??) as
d
= max
R
p
,R
q
_
b
d
_
T
_
_
(11.12a)
subject to c A
T
C
T
= 0, (11.12b)
0. (11.12c)
This is again an LP and it can be proven that strong duality holds for all LPs for which
at least one feasible point exists, i.e. we have d
= p
if there exists a
smooth curve x(t) : [0, ) R
n
with x(0) = x
(x
) of at x
.
97
CHAPTER 12. OPTIMALITYCONDITIONS FOR CONSTRAINEDOPTIMIZATION98
Example [Tangent Cone]:
h(x) =
_
(x
1
1)
2
+ x
2
2
1
(x
2
2)
2
x
2
1
+ 4
_
(12.2)
x =
_
0
4
_
: T
(x
) =
_
p|p
T
_
0
1
_
0
_
= R R
++
(12.3)
x =
_
0
0
_
: T
(x
) =
_
p|p
T
_
1
0
_
0 & p
T
_
0
1
_
0
_
= R
++
R
++
(12.4)
Insert gure for this example.
12.1 Karush-Kuhn-Tucker (KKT) Necessary Optimal-
ity Conditions
Theorem [First Order Necessary Conditions, variant 0]: If x
is a local minimum
of the NLP (??) then
1. x
2. for all tangents p T
(x
) holds: f(x
)
T
p 0
Proof: (By contradiction) If p T
(x
) with f(x
)
T
p < 0 there would exist a feasible
curve x(t) with
df( x(t))
dt
t=0
= f(x
)
T
p < 0.
12.2 Active constraints and constraint qualication
How can we characterize T
(x
)?
CHAPTER 12. OPTIMALITYCONDITIONS FOR CONSTRAINEDOPTIMIZATION99
Denition [Active/Inactive Constraint]: An inequality constraint h
i
(x) 0 is called
active at x
i h
i
(x
(x
).
Denition [LICQ]: The linear independence constraint qualication (LICQ) holds at
x
i all vectors g
i
(x
) for i A(x
) are linearly
independent.
Note: this is a technical condition, and is usually satised.
Insert gure that illustrates LICQ, and also illustrates why one should usually avoid re-
placing an equality with two inequalities.
Denition [Linearized Feasible Cone]: F(x
) = {p|g
i
(x
)
T
p = 0, i = 1, . . . , m & h
i
(x
)
T
p
0, i A(x
.
CHAPTER 12. OPTIMALITYCONDITIONS FOR CONSTRAINEDOPTIMIZATION100
Example [Linearized Feasible Cone]:
h(x) =
_
(x
1
1)
2
+ x
2
2
1
(x
2
2)
2
x
2
1
+ 4
_
(12.5)
x
=
_
0
4
_
, A(x
) = {2} (12.6)
h
2
(x) =
_
2x
1
2(x
2
2)
_
(12.7)
=
_
0
4
_
(12.8)
F(x
) =
_
p|
_
0
4
_
T
p 0
_
(12.9)
Lemma: At any x
holds
1. T
(x
) F(x
)
2. If LICQ holds at x
then T
(x
) = F(x
).
Proof:
1.
p T
x(t) with p =
d x
dt
t=0
& x(0) = x
dg
i
( x(t))
dt
= g
i
(x
)
T
p = 0, i = 1, . . . , m (12.13)
dh
i
( x(t))
dt
t=0
= lim
t0+
h
i
( x(t)) h
i
(x
)
t
0 (12.14)
for i A(x
) : h
i
(x
) = 0 & h
i
( x(t)) 0
dh
i
dt
h
i
(x
)
T
p 0 (12.15)
p F(x
) (12.16)
CHAPTER 12. OPTIMALITYCONDITIONS FOR CONSTRAINEDOPTIMIZATION101
2. For the full proof see [Noc2006]. The idea is to use the implicit function theorem to
construct a curve x(t) which has a given vector p F(x
) as tangent.
and x
2. p F(x
) : f(x
)
T
p 0.
How can we simplify the second condition? Here helps the following lemma. To interpret
it, remember that F(x
) = {p|Gp = 0, Hp 0} with G =
dg
dx
(x
), H =
_
h
i
(x
)
T
.
.
.
_
with
i A(x
).
Farkas Lemma: For any matrices G R
mn
, H R
qn
and vector c R
n
holds
either R
m
, R
q
with 0 & c = G
T
+ H
T
(12.17)
or p R
n
with Gp = 0 & Hp 0 & c
T
p < 0 (12.18)
but never both (theorem of alternatives).
Proof: In the proof we use the separating hyperplane theorem with respect to the point
c R
n
and the set S = {G
T
+ H
T
| R
n
, R
q
, 0}. S is a convex cone. The
separating hyperplane theorem states that two convex sets in our case the set S and the
point c can always be separated by a hyperplane. In our case, the hyperplane touches
the set S at the origin, and is described by a normal vector p. Separation of S and c means
that for all y S holds that y
T
p 0 and on the other hand, c
T
p < 0.
CHAPTER 12. OPTIMALITYCONDITIONS FOR CONSTRAINEDOPTIMIZATION102
Either c S (??) (12.19)
or c / S (12.20)
p R
n
: y S : p
T
y 0 & p
T
c < 0 (12.21)
p R
n
: , with 0 : p
T
(G
T
+ H
T
) 0 & p
T
c < 0 (12.22)
p R
n
: Gp = 0 & Hp 0 & p
T
c < 0 (??) (12.23)
From Farkas lemma follows the desired simplication of the previous theorem:
Theorem (variant 2) [KKT Conditions]: If x
) g(x
) h(x
) = 0 (12.24a)
g(x
) = 0 (12.24b)
h(x
) 0 (12.24c)
0 (12.24d)
i
h
i
(x
) = 0, i = 1, . . . , q. (12.24e)
Note: The KKT conditions are the First order necessary conditions for optimality (FONC)
for constrained optimization, and are thus the equivalent to f(x
) = 0 in unconstrained
optimization.
Proof: We know already that (??), (??) x
) : p
T
f(x
) : p
T
f(x
) : p
T
f(x
) < 0 (12.25)
,
i
0 : f(x
) =
g
i
(x
)
i
+
iA(x
)
h
i
(x
)
i
(12.26)
(12.27)
Now we set all components of that are not element of A(x
) to zero, i.e.
i
= 0 if h
i
(x
) >
0, and conditions (??) and (??) are trivially satised, as well as (??) due to
iA(x
)
h
i
(x
)
i
=
i={1,...,q}
h
i
(x
)
i
if
i
= 0 for i / A(x
).
Though it is not necessary for the proof of the necessity of the optimality conditions of the
above theorem (variant 2), we point out that the theorem is 100 % equivalent to variant 1,
but has the computational advantage that its conditions can be checked easily: if someone
gives you a triple (x
, , ) = 0. In absence
of inequalities, the KKT conditions simplify to
x
L(x, ) = 0, g(x) = 0, a formulation that
is due to Lagrange and was much earlier known than the KKT conditions.
Example [KKT Condition]:
minimize
x R
2
_
0
1
_
T
x (12.28)
subject to
_
x
2
1
+ x
2
2
1
(x
2
2)
2
x
2
1
+ 4
_
0 (12.29)
(12.30)
Does the local minimizer x
=
_
0
4
_
satisfy the KKT conditions?
First:
CHAPTER 12. OPTIMALITYCONDITIONS FOR CONSTRAINEDOPTIMIZATION104
A(x
) = {2} (12.31)
f(x
) =
_
0
1
_
(12.32)
h
2
(x
) =
_
0
4
_
(12.33)
Then we write down the KKT conditions, which are for the specic dimensions of this
example equivalent to the right hand side terms:
(??) f(x
) h
1
(x
)
1
h
2
(x
)
2
= 0 (12.34)
(??) (12.35)
(??) h
1
(x
) 0 & h
2
(x
) 0 (12.36)
(??)
1
0 &
2
0 (12.37)
(??)
1
h
1
(x
) = 0 &
2
h
2
(x
) = 0 (12.38)
Finally we check that indeed, all ve conditions are satised:
(??)
_
0
1
_
_
0
4
_
2
= 0 (
1
is inactive, use
1
= 0,
2
=
1
4
)(12.39)
(??) (12.40)
(??) h
1
(x
) > 0 & h
2
(x
) = 0 (12.41)
(??)
1
= 0 &
2
=
1
4
0 (12.42)
(??)
1
h
1
(x
) = 0h
1
(x
) = 0 &
2
h
2
(x
) =
2
0 = 0 (12.43)
12.3 Convex problems
Theorem: Regard a convex NLP and a point x
g
i
(x)
i
h
i
(x)
i
L is a convex function of x, and for xed , its gradient is zero, L(x
, , ) =
0. Therefore, x
We know that
d
= L(x
, , ) = f(x
g
i
(x
)
i
. .
0
h
i
(x
)
i
. .
0
= f(x
) and
x
is feasible: i.e. p
= d
and x
is global minimizer.
12.4 Complementarity
The last KKT conditio is called the complementarity condition. Visualized, the situation
for h
i
(x) and
i
that satisfy the three conditions h
i
0,
i
0 and h
i
i
= 0 is the
following:
CHAPTER 12. OPTIMALITYCONDITIONS FOR CONSTRAINEDOPTIMIZATION106
gure
Denition: Regard a KKT point (x
, , ). For i A(x
) we say h
i
is weakly active if
i
= 0, otherwise, if
i
> 0, we call it strictly active. We say that strict complementarity
holds at this KKT point i all active constraints are strictly active. We dene the set of
weakly active constraints to be A
0
(x
, ).
The sets are disjoint and A(x
) = A
0
(x
, ) A
+
(x
, ).
Note: strict complementarity makes many theorems easier.
12.5 Second order conditions
Denition: Regard the KKT point (x
, ) is the following
set:
C(x
, ) = { p|g(x
)
T
p = 0, h
i
(x
)
T
p = 0 if i A
+
(x
, ), h
i
(x
)
T
p 0 if i A
0
(x
, )}
(12.47)
Note: C(x
, ) F(x
, ) T
(x
). Thus, the
critical cone is a subset of all feasible directions. In fact: it contains all feasible directions
which are from rst order information neither uphill or downhill directions, as the following
theorem shows.
Theorem:
Regard the KKT point (x
(x
) holds
p C(x
, ) f(x
)
T
p = 0. (12.48)
CHAPTER 12. OPTIMALITYCONDITIONS FOR CONSTRAINEDOPTIMIZATION107
Proof:
Use
x
L(x
, ):
f(x
)
T
p =
T
g
T
p
. .
=0
+
i,
i
>0
i
h
i
(x
)
T
p
. .
=0
+
i,
i
=0
i
h
i
(x
) = 0 (12.49)
Conversely, if p T
(x
) then all terms on the right hand side must be non-negative, so that
f(x
)
T
p = 0 implies in particular
i,
i
>0
i
h
i
(x
)
T
p = 0 which implies h
i
(x
)
T
p = 0
for all i A
+
(x
, ), i.e. p C(x
, ).
Example:
min x
2
s.t. 1 x
2
1
x
2
2
0 (12.50)
x
=
_
0
1
_
(12.51)
h(x) =
_
2x
1
2x
2
_
(12.52)
f(x) =
_
0
1
_
(12.53)
=?
f(x) h(x) = 0 (12.54)
_
0
1
_
_
0
2
_
= 0 =
1
2
(12.55)
x
=
_
0
1
_
, =
1
2
is a KKT point.
T
(x
) = F(x
) = {p | h
T
p 0} = {p |
_
0
2
_
T
p 0} (12.56)
CHAPTER 12. OPTIMALITYCONDITIONS FOR CONSTRAINEDOPTIMIZATION108
C(x
, ) = {p | h
T
p = 0 if > 0} (12.57)
= {p |
_
0
2
_
T
p = 0} (12.58)
Theorem (SONC): Regard x
with LICQ. If x
, ) holds that p
T
2
x
L(x
, , )p 0
Theorem (SOSC): If x
, ), p = 0, holds that p
T
2
x
L(x
, , )p > 0
then x
is a local minimizer.
Note:
2
x
L(x
, , ) =
2
f(x
2
g
i
(x
2
h
i
(x
), i.e.
2
x
L contains
curvature of constraints.
Sketch of proof of both theorems:
Regard the following restriction of the feasible set (
):
= {x | g(x) = 0, h
i
(x) = 0 if i A
+
(x
, ), h
i
(x) 0 if i A
0
(x
, )} (12.59)
The critical cone is the tangent cone of this set
.
First, for any feasible direction p T
(x
) \ C(x
, ) we have f(x
)
T
p > 0. Thus, the
dicult directions are those in the critical cone only.
CHAPTER 12. OPTIMALITYCONDITIONS FOR CONSTRAINEDOPTIMIZATION109
So let us regard points in the set
. For xed , we have for all x
:
L(x, , ) = f(x)
i
g
i
(x)
..
=0
i,
i
>0
i
h
i
(x)
. .
=0
i,
i
=0
i
h
i
(x)
. .
=0
(12.60)
= f(x) (12.61)
Also:
x
L(x
, , ) = 0.
So for all x
we have:
f(x) = L(x, , ) (12.62)
= L(x
, , )
. .
=f(x
)
+
x
L(x
, , )
T
. .
=0
(x x
) +
1
2
(x x
)
T
2
x
L(x
, , )(x x
) + o(x x
2
) (12.63)
= f(x
) +
1
2
(x x
)
T
2
x
L(x
, , )(x x
) + o(x x
2
) (12.64)
x
L =
_
0
1
_
+
_
2x
1
2x
2
_
(12.66)
2
x
L = 0 +
_
2 0
0 2
_
(12.67)
CHAPTER 12. OPTIMALITYCONDITIONS FOR CONSTRAINEDOPTIMIZATION110
For =
1
2
and x
=
_
0
1
_
we have:
C(x
, ) = {p | h
T
p = 0} = {p |
_
0
2
_
T
p = 0} = {
_
p
1
0
_
} (12.68)
p C p =
_
p
1
0
_
(12.69)
2
x
L(x
, , ) =
1
2
_
2 0
0 2
_
=
_
1 0
0 1
_
(12.70)
SONC:
_
p
1
0
_
T
_
1 0
0 1
__
p
1
0
_
. .
=p
2
1
0 (12.71)
SOSC:
if p = 0, p C : p
T
2
x
Lp > 0 (12.72)
if p
1
= 0 : p
2
1
> 0 (12.73)
Example 2:
min x
2
s.t. 2x
2
x
2
1
1 (x
2
+ 1)
2
(12.74)
Here x
=
_
0
1
_
, =
1
2
is still a KKT point.
x
L(x
, ) = 0 (12.75)
2
x
L(x
, ) =
_
2 0
0 2
_
(12.76)
Chapter 13
Equality constrained optimization
In this chapter the problem of
minimize
x R
n
f(x) (13.1a)
subject to g(x) = 0 (13.1b)
with f : R
n
R, g : R
n
R
m
, f and g are both smooth functions, will be further treated
in detail.
13.1 Optimality conditions
KKT condition
The necessary KKT optimality condition for
L(x, ) = f(x)
T
g(x) (13.2)
leads to the expression
L(x
) = 0 (13.3)
g(x
) = 0 (13.4)
111
CHAPTER 13. EQUALITY CONSTRAINED OPTIMIZATION 112
Keep in mind that this expression is only valid if we have LICQ, or equivalently stated as
the entries g
i
(x
), g
2
(x
), . . . , g
m
(x
)) (13.5)
=
g
x
(x)
T
. (13.6)
The rank of the matrix g(x) must be m to obtain LICQ. The tangent space is dened as
T
(x
) =
_
p|g(x)
T
p = 0
_
(13.7)
= KER(g(x)
T
) (13.8)
An explicit form of the kernel KER
_
g(x)
T
_
can be obtained by basis for this space
Z R
n(nm)
such that the kernel
_
g(x)
T
_
= (Z), i.e. g(x)
T
Z = 0 and rank(Z) =
n m. This basis (Z
1
Z
2
. . . Z
nm
) can be obtained by using a QR-factorization of the
matrix g(x).
SONC and SOSC
The SONC looks like
p = Zv with v R
nm
(13.9)
z
T
2
xL(x
)Z 0 (13.10)
The SOSC points out that if
Z
T
2
xL(x
)Z 0 (13.11)
and the LICQ and KKT conditions are satised, then x
is a minimizer.
13.2 Equality constrained QP
Regard the optimization problem
minimize
x
1
2
x
T
Bx + g
T
x (13.12a)
subject to b + Ax = 0 (13.12b)
CHAPTER 13. EQUALITY CONSTRAINED OPTIMIZATION 113
with B R
nxn
, A R
mxn
, B = B
T
. The KKT condition leads to the equation
Bx + g A
T
= 0 (13.13a)
b + Ax = 0. (13.13b)
In matrix notation
_
B A
T
A 0
_ _
x
_
=
_
g
b
_
(13.14)
The left hand side matrix is nearly symmetric. With a few reformulations a symmetric
matrix is obtained
_
B A
T
A 0
_ _
x
_
=
_
g
b
_
(13.15)
Lemma [KKT-Matrix-Lemma]: Dene the matrix
_
B A
T
A 0
_
(13.16)
as the KKT matrix. Regard some matrix B R
nm
, B = B
T
, A R
mn
with m n. If
the rank(A) = m (A is of full rank, i.e. the LICQ holds) and for all p kernel(A), p = 0
holds p
T
Bp > 0 (SOSC). Then the KKT-matrix is invertible.
Refer to Nocedal NOC, 16.1
Remark that for a QP
B =
2
x
L(x
) (13.17)
A = g(x)
T
(13.18)
so that the above invertibility condition is equivalent to SOSC.
CHAPTER 13. EQUALITY CONSTRAINED OPTIMIZATION 114
13.2.1 Solving the KKT system
Solving KKT systems is an important research topic, there exist many ways to solve the
system (??). Some methods are:
(i) Brute Force: obtain an LU-factorization of KKT-matrix
(ii) As the KKT-matrix is not denite, a standard Cholesky decomposition does not work.
Use an indenite Cholesky decomposition.
(iii) Schur complement method or so called Range Space method: rst eliminate x, by
equation
x = B
1
(A
T
g) (13.19)
and plug it in to the second equation (??). Get from
b + A(B
1
(A
T
g)) = 0. (13.20)
This method requires that B is invertible, which is not always true.
(iv) Null Space Method: First nd basis Z R
N(nm)
of KER(A), set x = Zv + y with
b + Ay = 0 (a special solution) every x = Zv + y satises b + Ax = 0, so we have to
regard only (??). This is an unconstrained problem
minimize
v R
nm
g
T
(Zv + y) +
1
2
(Zv + y)
T
B(Zv + y) (13.21a)
Z
T
BZv + Z
T
g + Z
T
By = 0 (13.21b)
v = (Z
T
BZ)
1
(Z
T
g + Z
T
By). (13.21c)
The matrix Z
T
BZ is called Reduced Hessian. This method is always possible if
SOSC holds, in practice the matrices are sparse matrices.
(v) Sparse direct methods like sparse LU decomposition.
(vi) Iterative methods of linear algebra.
CHAPTER 13. EQUALITY CONSTRAINED OPTIMIZATION 115
13.3 Newton Lagrange method
Regard again the optimization problem (??) as stated at the beginning of the chapter. The
idea now is to apply Newtons method to solve the nonlinear KKT conditions
L(x, ) = 0 (13.22a)
g(x) = 0 (13.22b)
Dene
_
x
_
= w and F(w) =
_
L(x, )
g(x)
_
(13.23)
with w R
n+m
, F : R
n+m
R
n+m
, so that the optimization is just a nonlinear root
nding problem
F(w) = 0, (13.24)
which we solve again by Newtons method.
F(
k
) +
F
w
k
w
k
)(w w
k
) = 0 (13.25)
Written in terms of gradients
x
L(x
k
,
k
) +
2
x
L(x, )(x x
k
) g(x
k
)(
k
) = 0 (13.26)
2
x
L(x, )(x x
k
) is the linearisation with respect to x, g(x
k
)(
k
) the linearisation
with respect to . Recall that L = f g.
g(x
k
) +g(x
k
)
T
(x x
k
) = 0 (13.27)
Written in matrix form an interesting result is obtained
_
x
L
g
_
+
_
2
x
L g
g
T
0
_
. .
KKT-matrix
_
x x
k
(
k
)
_
= 0 (13.28)
The KKT-matrix is invertible if the KKT-matrix lemma holds. From this point it is clear
that at a given solution (x
). Thus, if (x
)
_
do
get x
k
and
k
from (??)
x
k+1
= x
k
+ x
k
k+1
=
k
+
k
k = k + 1
end while
then the Newton method is well dened for all (x
0
,
0
) in neighborhood of (x
) and
converges Q-quadratically.
The method is stated as an algorithm in Algorithm ??.
Using the denition
k+1
=
k
+
k
(13.29)
L(x
k
,
k
) = f(x
k
) g(x
k
)
k
(13.30)
the system (??) needed for calculating
k
and x
k
is equivalent to
_
f(x
k
)
g(x
k
)
_
+
_
2
L g
g
T
_ _
x
k
k+1
_
= 0. (13.31)
This formulation shows that the new iterate does not depend strongly on the old multiplier
guess, only via the Hessian matrix. We will later see that we can approximate the Hessian
with dierent methods.
CHAPTER 13. EQUALITY CONSTRAINED OPTIMIZATION 117
13.4 Quadratic model interpretation
Theorem x
k+1
and
k+1
are obtained from the solution of a QP:
minimize
x R
n
f(x
k
)
T
(x x
k
) +
1
2
(x x
k
)
T
2
L(x
k
,
k
)(x x
k
) (13.32a)
subject to g(x
k
) +g(x
k
)
T
(x x
k
) = 0 (13.32b)
So we can get a QP solution x
QP
and
QP
and take it as next NLP solution guess x
k+1
and
k+1
.
Proof: KKT of QP
f(x
k
) +
2
L(x
k
,
k
) g(x
k
) g
k
= 0 (13.33)
g +g
T
(x
x
k
) = 0 (13.34)
More generally, can replace
2
x
L(x
k
,
k
) by some approximation B
k
, (B
k
= B
T
k
often
B
k
0) by Quasi-Newton updates or other.
13.5 Constrained Gauss Newton
Regard:
minimize
x R
n
1
2
F(x)
2
2
(13.35a)
subject to g(x) = 0 (13.35b)
As in the unconstrained case, linearize both F and g. Get approximation by
minimize
x R
n
1
2
F(x
k
) + J(x
k
)(x x
k
)
2
2
(13.36a)
subject to g(x
k
) +g(x
k
)
T
(x x
k
) = 0 (13.36b)
CHAPTER 13. EQUALITY CONSTRAINED OPTIMIZATION 118
This is a LS-QP which is convex. We call this the constrained Gauss Newton method, this
approach gets new iterate x
k+1
by solution of (??)(??) in each iteration. Note that no
multipliers
k+1
are needed. The KKT conditions of LS-QP
x
1
2
F + J(x x
k
) = J
T
J(x x
k
) + J
T
F (13.37)
equals
J
T
J(x x
k
) + J
T
F g(x x
k
) = 0 (13.38)
g +g
T
= 0 (13.39)
Recall that J
T
J the same is as by Newton iteration, but we replace the Hessian. The
constrained Gaus Newton gives a Newton type iteration with B
k
= J
T
J. For LS,
2
x
L(x, ) = J(x)
T
J(x) +
F
i
(x)
2
F
i
(x)
2
g
i
(x) (13.40)
One can show that gets small if F is small. As the unconstrained case CGN converges
well if F 0.
13.6 An equality constrained BFGS method
CHAPTER 13. EQUALITY CONSTRAINED OPTIMIZATION 119
Algorithm 8 Equality constrained BFGS method
Choose x
0
, B
0
, tolerance
k = 0
Evaluate f(x
0
), g(x
0
),
g
x
(x
0
)
while g(x
k
) > tolerance or L(x
k
,
k
) > tolerance do
Solve KKT-system (e.g. with CVX):
_
f
g
_
+
_
B
k
g
x
T
g
x
0
_
_
p
k
k
_
= 0
Set
k
=
k
Choose step length t
k
(0, 1] (details 11.7)
x
k+1
= x
k
+ t
k
p
k
k+1
=
k
+ t
k
k
Compute old Lagrange gradient:
x
L(x
k
,
k+1
) = f(x
k
)
g
x
(x
k
)
T
k+1
Evaluate f(x
k+1
), g(x
k+1
),
g
x
(x
k+1
)
Compute new Lagrange gradient
x
L(x
k+1
,
k+1
)
Set s
k
= x
k+1
x
k
Set y
k
=
x
L(x
k+1
,
k+1
)
x
L(x
k
,
k+1
)
Calculate B
k+1
(e.g. with a BFGS update) using s
k
and y
k
.
k = k + 1
end while
Remark: B
k+1
can alternatively be obtained by either calculating the exact Hessian
2
L(x
k+1
,
k+1
) or by calculating the Gauss-Newton Hessian (J(x
k+1
)
T
J(x
k+1
) for a LS
objective function).
CHAPTER 13. EQUALITY CONSTRAINED OPTIMIZATION 120
13.7 Local convergence
Recall:
Theorem [Newton type convergence]: Regard the root nding problem F(x) = 0,
F : R
n
R
n
with F(x
(1 ) then x
k
x
2
L(x
k
,
k
) is not too big (Gauss-Newton).
Proof:
J
k
=
_
2
L(x
k
,
k
)
g
x
(x
k
)
T
g
x
(x
k
) 0
_
(13.41)
M
k
=
_
B
k
g
x
(x
k
)
T
g
x
(x
k
) 0
_
(13.42)
J
k
M
k
=
_
2
L(x
k
,
k
) B
k
0
0 0
_
(13.43)
_
= 0 then
DT
1
(x)[p] = f(x)
T
p g(x)
1
(13.47)
DT
1
(x)[p] p
T
Bp (
)g(x)
1
(13.48)
Corollary: If B 0 &
= 0 (13.55)
f(x)
T
p =
T
g
x
(x)p p
T
Bp (13.56)
=
T
g(x) p
T
Bp (13.57)
|f(x)
T
p|
g(x)
1
p
T
Bp (13.58)
(??) (13.59)
(if not,
increase ).
13.9 Careful BFGS updating
How can we make sure that B
k
remains positive denite?
Lemma: If B
k
0 and y
T
k
s
k
> 0 then B
k+1
from BFGS update is positive denite.
Update Nocedal proof reference.
CHAPTER 13. EQUALITY CONSTRAINED OPTIMIZATION 123
Proof: Nocedal?
This is as good as we can desire because:
Lemma: If y
T
k
s
k
< 0 & B
k+1
s
k
= y
k
then B
k+1
is not positive semidenite.
Proof: s
T
k
B
k+1
s
k
= s
T
k
y
k+1
< 0 i.e. s
k
is a direction of negative curvature of B
k+1
.
Insert gure with proof.
Powells trick: If y
T
k
s
k
< 0.2s
T
k
Bs
k
then do update with a y
k
instead of y
k
with y
k
=
y
k
+ (B
k
s
k
y
k
) so that y
k
T
s
k
= 0.2s
T
k
B
k
s
k
> 0.
Remark: If = 1 then y
k
= B
k
s
k
and B
k+1
= B
k
(thus, the choice of between 0 and 1
damps the BFGS update if necessary).
Chapter 14
Inequality Constrained Optimization
Algorithms
For simplicity, drop equalities and regard:
min
x
f(x)
s.t. h(x) 0
(14.1)
In the KKT conditions we had (for i = 1, . . . , q):
1. f(x)
q
i=1
h
i
(x)
i
= 0
2. h
i
(x) 0
3.
i
0
4.
i
h
i
(x) = 0
Conditions 2, 3 and 4 are non-smooth, which implies that Newtons method wont work
here.
124
CHAPTER 14. INEQUALITY CONSTRAINED OPTIMIZATION ALGORITHMS 125
14.1 Quadratic Programming via active set method
Regard the QP problem to be solved:
min
x
g
T
x +
1
2
x
T
Bx
s.t. Ax + b 0
(14.2)
Assume a convex QP (B 0). The KKT conditions are necessary and sucient for global
optimality (this is the basis for the algorithm):
Bx
+ g A
T
= 0 (14.3)
Ax
+ b 0 (14.4)
0 (14.5)
i
(Ax
+ b)
i
0 (14.6)
for i = 1, . . . , q.
How do we nd x
A
so that:
Bx
+ g A
T
A
A
= 0 (14.12)
A
A
x
+ b
A
= 0 (14.13)
A
A
x
+ b
I
0 (14.14)
A
0 (14.15)
and
=
_
I
_
with
I
= 0 (14.16)
Active set method idea:
Choose a set A
Solve (??) and (??) to get x
and
to A:
A
k+1
= A
k
{i
}
set k = k + 1 and go back to (3)
(b) If t
k
= 1 is possible then x
k
is feasible, we only need to check if
k
0.
If YES: Solution is found
if NO: Drop index i
in A
k
with
k,i
< 0 and A
k+1
= A
k
\{i
}.
set k = k + 1 and go back to (3)
Remark: we can prove that f(x
k+1
) f(x
k
) (with f the quadratic performance index).
14.2 Sequential Quadratic Programming (SQP)
Regard the NLP:
min
x
f(x)
s.t. h(x) 0
(14.19)
The SQP idea is to solve in each iteration the QP:
min
p
f(x)
T
p +
1
2
p
T
Bp
s.t. h(x
h
) +
h
x
(x
h
)p 0
(14.20)
Local convergence would follow from equality constrained optimization if the active set of
the QP is the same as the active set of the NLP, at least in the last iterations.
CHAPTER 14. INEQUALITY CONSTRAINED OPTIMIZATION ALGORITHMS 128
Theorem [ROBINSON]: If x
) and regard:
f(x) + Bp
h
A
x
(x)
T
QP
A
= 0 (14.21)
h
A
(x) +
h
A
x
(x)p = 0 (14.22)
this denes an implicit function
_
p(x, B)
QP
A
(x, B)
_
(14.23)
with
p(x
, B) = 0 and
QP
A
(x
, B) =
A
) (14.24)
This follows from
f(x
) + B0
h
A
x
(x
)
T
A
= 0
x
L(x
) = 0 (14.25)
h
A
(x
) +
h
A
x
(x
)0 = 0 (14.26)
which hold because of
h
A
(x
) = 0 (14.27)
h
I
(x
) > 0 (14.28)
I
= 0 (14.29)
Note that
A
> 0 because of strict complementarity.
For x x
QP
A
(x, B) > 0 (14.30)
and even more:
h
I
(x) +
h
I
x
(x)p(x, B) > 0 (14.31)
CHAPTER 14. INEQUALITY CONSTRAINED OPTIMIZATION ALGORITHMS 129
Therefore a solution of the QP has the same same active set as the NLP and also satises
strict complementarity.
Remark: we can generalise his Theorem to the case where the jacobian
h
x
(x
h
) is only
approximated.
14.3 Powells classical SQP algorithm
For an equality and inequality constrained NLP, we can use the BFGS algorithm as before
but:
1. We solve an inequality constrained QP instead of a linear system
2. We use T
1
(x) = f(x) + g(x)
1
+
q
i=1
|min(0, h
i
(x))|
3. Use full Lagrange gradient
x
L(x, , ) in the BFGS formula
(eg fmincon in Matlab).
14.4 Interior Point Methods
The IP method is an alternative for the active set method for QPs or LPs and for the
SQP method. The previous methods had problems with the non-smoothness in the KTT-
conditions (2), (3) and (4) (for i = 1, . . . , q):
1. f(x)
q
i=1
h
i
(x)
i
= 0
2. h
i
(x) 0
3.
i
0
CHAPTER 14. INEQUALITY CONSTRAINED OPTIMIZATION ALGORITHMS 130
4.
i
h
i
(x) = 0 .
The IP-idea is to replace 2,3 and 4 by a smooth condition (which is an approximation):
h
i
(x)
i
= with > 0 but small. The KKT-conditions now become a smooth root nding
problem:
f(x)
q
i=1
h
i
(x)
i
= 0 (14.32)
h
i
(x)
i
= 0 i = 1, . . . , q (14.33)
These conditions are called the IP-KKT conditions and can be solved by Newtons method
and yields solutions x() and ().
We can show that for 0
x() x
(14.34)
()
(14.35)
The IP algorithm:
1. Start with a big 0, choose (0, 1)
2. Solve IP-KKT to get x() and ()
3. Replace and go to 2.
(ititialize Newton iteration with old solution).
Remark: The set of solutions
_
x()
()
_
for (0, ) is called the central path.
Remark 2: In fact, the IP-KKT is equivalent to FONC of the Barrier Problem (BP):
min
x
f(x)
q
i=1
log h
i
(x) (14.36)
FONC of BP f(x)
q
i=1
1
h
i
(x)
h
i
(x) = 0 (14.37)
CHAPTER 14. INEQUALITY CONSTRAINED OPTIMIZATION ALGORITHMS 131
with
i
=
h
i
(x)
this is equivalent to IP-KKT.
For convex problems IP methods are well understood with strong complexity results (they
are used e.g. in CVX).
Chapter 15
Optimal Control Problems
We regard a dynamical system with dynamics
x
k+1
= f(x
k
, u
k
) (15.1)
with u
k
the controls or inputs and x
k
the states. Let x R
n
x
and let u R
n
u
with
k = 0, . . . , N 1.
15.1 Optimal control problem (OCP) formulation
minimize
x
0
, u
0
, x
1
, . . . , u
N1
, x
N
N1
k=0
L(x
k
, u
k
) + E(x
N
) (15.2a)
subject to x
k+1
f(x
k
, u
k
) = 0 for k = 0, . . . , N 1(15.2b)
Remark that (??) implies a lot of constraints. Sometimes this amount of constraints is not
enough, one could add some extra constraints. For example if the rst state and last state
are xed
r(x
0
, x
n
) = 0 (15.3)
132
CHAPTER 15. OPTIMAL CONTROL PROBLEMS 133
then we would also say that the controls are constrained. Another constraint would be
inequalities of the form
h(x
k
, u
k
) 0, k = 0, . . . , N 1 (15.4)
Remark that a free parameter could be added to the optimisation formulation, e.g. the
constant size of a pot in a chemical reactor. For this we dene an extra dummy state for
k = 0, . . . , N 1
p
k+1
= p
k
(15.5)
As an example consider for equation (??)
r(x
0
, x
n
) = x
0
x
0
(15.6)
where x
0
is a xed initial value. Another example would be considering both ends xed
r(x
0
, x
n
) =
_
x
0
x
0
x
N
x
N
_
. (15.7)
In many applications, cycles or periodic boundary conditions are optimized by adding
constraints in the form
r(x
0
, x
n
) = (x
0
x
n
) (15.8)
15.2 KKT conditions of optimal control problems
First summarize the variables w = {x
0
, u
0
, x
1
, u
1
, . . . , u
N1
, x
N
} and summarize the multi-
pliers = {
1
, . . . ,
N
,
r
}. The optimal control problem has the form
minimize
w
F(w) (15.9a)
subject to G(w) = 0 (15.9b)
Where
G(w) =
_
_
x
1
f(x
0
, u
0
)
x
2
f(x
1
, u
1
)
.
.
.
x
N
f(x
N1
, u
N1
)
r(x
0
, x
N
)
_
_
(15.9c)
CHAPTER 15. OPTIMAL CONTROL PROBLEMS 134
The Lagragian function has the form
L(, ) = F()
T
G()
=
N1
k=0
L(x
k
, u
k
) + E(x
n
)
N1
k=0
T
k+1
(x
k+1
f(x
k
, u
k
))
T
r
r(x
0
, x
n
) (15.10)
The KKT-conditions of the problem are
w
L(w, ) = 0 (15.11a)
G(w) (15.11b)
In more detail, ???? the derivative of L with respect to x
k
, where n = 0 and n = N are
special cases. First n = 0 is treated
x
0
L(, ) =
x
0
L(x
0
, u
0
) +
f
x
0
(x
0
, u
0
)
T
r
x
0
(x
0
, x
N
)
T
r
= 0. (15.12a)
Then the case for k = 1, . . . , N 1 is treated
x
k
L(, ) =
x
k
L(x
k
, u
k
)
k
+
f
x
k
(x
k
, u
k
)
T
k+1
= 0. (15.12b)
Now the special case n = N
x
N
L(, ) =
x
N
E(x
N
)
N
r
x
N
(x
0
, x
N
)
T
r
= 0. (15.12c)
The Lagrangian with respect to u is calculated, for k = 0, . . . , N 1
u
k
L(, ) =
u
k
L(x
k
, u
k
) +
f
x
k
(x
k
, u
k
)
T
k+1
= 0. (15.12d)
The last two conditions are
x
k+1
f(x
k
, u
k
) = 0 k = 0, . . . , N 1 (15.12e)
r(x
0
, x
n
) = 0 (15.12f)
The equations (??) till (??) are the KKT-system of the OCP. There exist dierent ap-
proaches to solve this system. On method is to solve equations (??) to (??) directly, this is
called the simultaneous approach. The other approach is to calculate all the states in (??)
by forwards elemination. This is called the sequential approach and treated rst.
CHAPTER 15. OPTIMAL CONTROL PROBLEMS 135
15.3 Sequential approach to optimal control
This method is also called single shooting or reduced approach. The idea is to keep only
x
0
and U = [u
T
0
, . . . , u
T
N1
]
T
as variables. The states x
1
, . . . , x
N
are eleminated recursively
by
x
0
(x
0
, U) = x
0
(15.13)
x
k+1
(x
0
, U) = f( x
k
(x
0
, ), u
k
) (15.14)
Then the optimal control problem is equivalent to a problem with less variables
minimize
x
0
, U
N1
k=0
L( x
k
(x
0
, U), u
k
) + E( x
k
(x
0
, U)) (15.15a)
subject to r(x
0
, x
N
(x
0
, U)) = 0 (15.15b)
Remark that equation (??) is implicitly satised. This is called the reduced optimal control
problem. It can be solved by e.g. Newton type method (SQP if inequalities are present).
If r(x
0
, x
N
) = x
0
x
0
one can also eliminate x
0
x
0
. The optimality conditions for this
problem are found in the next subsection.
15.4 Backward dierentiation of sequential Lagrangian
The Lagrangian function is given by
L(x
0
, U,
r
) =
N1
k=0
L( x
k
(x
0
, U), u
k
) + E( x
k
(x
0
, U))
T
r
r(x
0
, x
N
(x
0
, U)) (15.16)
so the KKT conditions for the reduced optimal control problem are
x
0
L(x
0
, U,
r
) = 0 (15.17a)
u
k
L(x
0
, U,
r
) = 0 k = 0, . . . , N 1 (15.17b)
r(x
0
, x
N
(x
0
, U)) = 0 (15.17c)
Usually dierences are linearized by nite dierences, the I-Trick or forward automatic
dierention (AD). But here, backward automatic dierentiation (AD) is more ecient.
CHAPTER 15. OPTIMAL CONTROL PROBLEMS 136
Algorithm 9 Result of backward AD to KKT-ROCP
Inputs
x
0
, u
0
,. . .,u
N1
,
r
Outputs
x
0
L,
u
k
L and r
Set x
0
x
0
Set k = 0, execute forward sweep:
repeat
x
k+1
= f( x
k
, u
k
)
k = k + 1
until k = N 1
Get r(x
0
, x
N
)
Compute intermediate quantities
N
, . . . ,
1
by
N
= E( x
N
)
r
x
n
(x
0
, x
N
)
T
r
Set k = N 1, execute backward sweep:
repeat
k
=
x
k
L( x
k
, u
k
) +
f
x
k
( x
k
, u
k
)
T
k+1
= 0
k = k 1
until k = 1
Compute
x
0
L =
x
0
(x
0
, u
0
)
r
x
0
(x
0
, x
N
)
T
r
+
f
x
0
(x
0
, u
0
)
T
1
= 0
Set k0
repeat
u
k
L =
u
k
L( x
k
, u
k
) +
f
u
k
( x
k
, u
k
)
T
k+1
= 0
k = k + 1
until k = N 1
CHAPTER 15. OPTIMAL CONTROL PROBLEMS 137
The result for backward AD to the equations (??) to (??) to get
x
0
L and
u
k
L is stated
in Algorithm ??. Compare the equations (??) to (??) whereas
k
k
with the algorithm.
We get a second interpretation to the second approach with backward AD: when solving
(??) to (??) we eliminate all equations that kan be eliminated by (??), (??) and (??).
Only the equations (??), (??) and (??) remain. Backward automatic dierentiation (AD)
gives gradient at a cost scaling linearly with N and forward dierences with respect to
u
0
, . . . , u
N1
, would grow with N
2
.
The sequential and backward automatic dierentiation (AD) leads to a small dense (Ja-
cobians are dense matrices) nonlinear system in variables (x
0
, u
0
, . . . , u
N1
,
r
). The next
sections tries to avoid the dense Jacobians.
15.5 Simultaneous optimal control
This method is also called multiple shooting or one shot optimization. The idea is to
solve (??) to (??) directly by a sparsity exploiting Newton-type method. If we regard the
original OCP, it is a NLP in variables w = (x
0
, u
0
, x
1
, u
1
, . . . , u
N1
, x
N
) with multipliers
(
1
, . . . ,
N
,
r
) = . In the SQP method we get
w
k+1
= w
k
+ w
k
(15.18)
k+1
=
k
QP
(15.19)
by solving
minimize
w
F(
k
)
T
w +
1
2
w
T
B
k
w (15.20a)
subject to G(w) +
G
w
(w)w (15.20b)
If we use
B
k
=
2
L(
k
,
k
) (15.21)
CHAPTER 15. OPTIMAL CONTROL PROBLEMS 138
this QP is very structured and equivalent to
minimize
x
0
, u
0
, . . . , x
n
1
2
N1
k=0
_
x
k
u
k
_
T
Q
k
_
x
k
u
k
_
+
1
2
x
T
N
Q
N
x
N
+
N
k=0
_
x
N
u
N
_
T
g
k
+ x
T
N
g
N
(15.22)
subject to r(x
0
, x
N
) +
r(x
0
, x
N
)
x
0
x
0
+
r(x
0
, x
N
)
x
N
x
N
= 0 (15.23)
x
k
f
u
k
(x
k
, u
k
)u
k
= 0 for k = 0, . . . , N 1 (15.24)
With
Q
k
=
2
(x
k
,u
k
)
L (15.25)
Q
N
=
2
x
N
L (15.26)
g
k
=
x
k
,u
k
L(x
k
, u
k
) (15.27)
g
N
= E(x
N
) (15.28)
Note that for k = m
x
k
x
m
L = 0 (15.29a)
x
k
u
m
L = 0 (15.29b)
2
u
k
u
m
L = 0 (15.29c)
This QP leads to a very sparse linear system and can be solved at a cost linear with N.
Also simultaneous approaches can deal better with unstable systems x
k+1
= f(x
k
, u
k
).
Chapter 16
Summary of the Lecture
To be added...
139
Bibliography
[1] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press,
2004.
[2] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer,
Boston, 2004.
[3] J. Nocedal and S. Wright. Numerical Optimization. Springer Verlag, 2006.
140