Course Notes For MATH 524: Non-Linear Optimization
Course Notes For MATH 524: Non-Linear Optimization
Optimization
Francisco Blanco-Silva
List of Figures v
Chapter 1. Review of Optimization from Vector Calculus 1
The Theory of Optimization 6
Exercises 7
Chapter 2. Existence and Characterization of Extrema for
Unconstrained Optimization 11
1. Anatomy of a function 11
2. Existence results 21
3. Characterization results 22
Examples 22
Exercises 24
Chapter 3. Numerical Approximation for Unconstrained Optimization 29
1. Newton-Raphson’s Method 29
2. Secant Methods 39
3. The Method of Steepest Descent 44
4. Effective Algorithms for Unconstrained Optimization 51
Exercises 54
Chapter 4. Existence and Characterization of Extrema for Constrained
Optimization 63
1. Necessary Conditions 66
2. Sufficient Conditions 68
Key Examples 69
Exercises 70
Chapter 5. Numerical Approximation for Constrained Optimization 73
1. Projection Methods for Linear Equality constrained programs 73
2. Linear Programing: The simplex method 78
3. The Frank-Wolfe Method 86
Exercises 89
Index 93
Bibliography 95
Appendix A. Rates of Convergence 97
iii
iv CONTENTS
4.1 Can you tell what are the global maximum and minimum values of
f in S? 64
4.2 Cones for (0, 0), (−1, −1) and (0, −1/2). 65
4.3 Feasibility region for (P ) in example 4.4 68
v
CHAPTER 1
∂y∂x ∂y 2
| {z }
Hessf (x0 ,y0 )
The second step was the notion of global (or absolute) minima: points
(x0 , y0 ) that satisfy f (x0 , y0 ) ≤ f (x, y) for any point (x, y) in the domain of
f . We always started with the easier setting, in which we placed restrictions
on the domain of our functions:
Theorem 1.3. A continuous real-valued function always attains its min-
imum value on a compact set K. If the function is also differentiable in the
interior of K, to search for global minima we perform the following steps:
Interior Candidates: List the critical points of f located in the interior
of K.
Boundary Candidates: List the points in the boundary of K where f may
have minimum values.
Evaluation/Selection: Evaluate f at all candidates and select the one(s)
with the smallest value.
Example 1.2. A flat circular plate has the shape of the region
x2 + y 2 ≤ 1.
The plate, including the boundary, is heated so that the temperature at the
point (x, y) is given by f (x, y) = 100(x2 + 2y 2 − x) in Celsius degrees. Find
the temperature at the coldest point of the plate.
4 1. REVIEW OF OPTIMIZATION FROM VECTOR CALCULUS
The behavior of each of the factors as the absolute value of x goes to infinity
leads to our claim.
lim an xn = +∞,
|x|→∞
an−1
+ · · · + anax0 n = 1.
lim 1+ an x
|x|→∞
This method can be extended to more than two dimensions, and more
than one constraint. For instance:
6 1. REVIEW OF OPTIMIZATION FROM VECTOR CALCULUS
Exercises
Problem 1.1 (Advanced). State and prove similar statements as in
Definition 1, Theorems 1.1, 1.2 and 1.3, but for local and global maxima.
Problem 1.2 (Basic). Find and sketch the domain of the following
functions.
√
(a) f (x, y) = y − x − 2
(b) f (x, y) = log x2 + y 2 − 4
(x−1)(y+2)
(c) f (x, y) = (y−x)(y−x 3)
8 1. REVIEW OF OPTIMIZATION FROM VECTOR CALCULUS
1See Appendix B
CHAPTER 2
1. Anatomy of a function
1.1. Continuity and Differentiability.
Definition. We say that a function f : Rd1 → Rd2 is continuous at a
point x? ∈ Rd1 if for all ε > 0 there exists δ > 0 so that for all x ∈ Rd1
satisfying kx − x? kd1 < δ, it is kf (x) − f (x? )kd2 < ε.
Example 2.1. Let f : R2 → R be given by
(
2xy
2 2, (x, y) 6= (0, 0)
f (x, y) = x +y
0, (x, y) = (0, 0)
This function is trivially continuous at any point (x, y) 6= (0, 0). However,
it fails to be continuous at the origin. Notice how we obtain different values
as we approach (0, 0) through different generic lines y = mx with m ∈ R:
2mx2 2m
lim f (x, mx) = lim = .
x→0 x→0 (1 + m2 )x2 1 + m2
Definition. A function T : Rd1 → Rd2 is said to be a linear map (or a
linear transformation) if it satisfies
T (x + λy) = T (x) + λT (y) for all x, y ∈ Rd1 , λ ∈ R.
The kernel and image of a linear map are respectively given by
ker T = {x ∈ Rd1 : T (x) = 0},
im T = {y ∈ Rd2 : there exists x ∈ Rd1 so that y = T (x)}.
The kernel of a linear map in this case has a very simple expression:
ker T = kerha, ·i = {(x1 , x2 , . . . , xd ) ∈ Rd : a1 x1 + a2 x2 + · · · + ad xd = 0}
The graph of a real-valued linear function can be identified with a hy-
perplane in Rd+1 :
Graph T = {(x, y) ∈ Rd+1 : y = T (x)}
= {(x1 , x2 , . . . , xd , y) ∈ Rd+1 : a1 x1 + a2 x2 + · · · + ad xd = y}
= kerh[a1 , a2 , . . . , ad , −1], ·i
Remark 2.2. For each linear map T : Rd1 → Rd2 there exists a matrix
A of size d1 × d2 so that T (x)| = A · x| .
y1 = a11 x1 + · · · + a1d1 xd1
y2 = a21 x1 + · · · + a2d xd
1 1
..
.
yd2 = ad2 1 x1 + · · · + ad1 d2 xd1
The converse is not true in general: the existence of partial derivatives (or
even all of the directional derivatives) is not guarantee that a function is
differentiable at a point. For instance, the function f : R2 → R given by
(
y 3 /(x2 + y 2 ) if (x, y) 6= (0, 0)
f (x, y) =
0 if (x, y) = (0, 0)
1. ANATOMY OF A FUNCTION 13
is not differentiable at (0, 0), although all partial derivatives and all direc-
tional derivatives exist at that point.
2. Existence results
2.1. Continuous functions on compact domains. The existence
of global extrema is guaranteed for continuous functions over compact sets
thanks to the following two basic results:
Theorem 2.12 (Bounded Value Theorem). The image f (K) of a con-
tinuous real-valued function f : Rd → R on a compact set K is bounded:
there exists M > 0 so that |f (x)| ≤ M for all x ∈ K.
Theorem 2.13 (Extreme Value Theorem). A continuous real-valued
function f : K → R on a compact set K ⊂ Rd takes on minimum and
maximum values on K.
2.2. Continuous functions on unbounded domains. Extra restric-
tions must be applied to the behavior of f in this case, if we want to guar-
antee the existence of extrema.
Theorem 2.14. Coercive functions always have a global minimum.
Proof. Since f is coercive, there exists r > 0 so that f (x) > f (0)
for all x satisfying kxk > r. On the other hand, consider the closed ball
Kr = {x ∈ R2 : kxk ≤ r}. The continuity of f guarantees a global minimum
x? ∈ Kr with f (x? ) ≤ f (0). It is then f (x? ) ≤ f (x) for all x ∈ Rd
trivially.
22 2. UNCONSTRAINED OPTIMIZATION
3. Characterization results
Differentiability is key to guarantee characterization of extrema. Critical
points lead the way:
Theorem 2.15 (First order necessary optimality condition for minimiza-
tion). Suppose f : Rd → R is differentiable at x? . If x? is a local minimum,
then ∇f (x? ) = 0.
To be able to classify extrema of a properly differentiable function, we
take into account the behavior of the function
around f (x) with respect to
the tangent hyperplane at the point x, f (x) . Second derivatives make this
process very easy.
Theorem 2.16. Suppose f : Rd → R is coercive and continuously differ-
entiable at a point x? . If x? is a global minimum, then ∇f (x? ) = 0.
Theorem 2.17 (Second order necessary optimality condition for mini-
mization). Suppose that f : Rd → R is twice continuously differentiable at
x? .
• If x? is a local minimum, then ∇f (x? ) = 0 and Hessf (x? ) is posi-
tive semidefinite.
• If x? is a strict local minimum, then ∇f (x? ) = 0 and Hessf (x? ) is
positive definite.
Theorem 2.18 (Second order sufficient optimality conditions for mini-
mization). Suppose f : D ⊆ Rd → R is twice continuously differentiable at
a point x? in the interior of D and ∇f (x? ) = 0. Then x? is a:
Local Minimum: if Hessf (x? ) is positive semidefinite.
Strict Local Minimum: if Hessf (x? ) is positive definite.
If D = Rd and x? ∈ Rd satisfies ∇f (x? ) = 0, then x? is a:
Global Minimum: if Hessf (x) is positive semidefinite for all x ∈ Rd .
Strict Global Minimum: if Hessf (x) is positive definite for all x ∈ Rd .
Theorem 2.19. Any local minimum of a convex function f : C → R on
a convex set C ⊆ Rd is also a global minimum. If f is a strictly convex
function, then any local minimum is the unique strict global minimum.
Theorem 2.20. Suppose f : C → R is a convex function with continuous
first partial derivatives on a convex set C ⊆ Rd . Then, any critical point of
f in C is a global minimum of f .
Examples
Example 2.13. Find a global minimum in R3 (if it exists) for the func-
tion
2
f (x, y, z) = ex−y + ey−x + ex + z 2 .
This function has continuous partial derivatives of any order in R3 . Its
continuity does not guarantee existence of a global minimum initially since
EXAMPLES 23
the domain is not compact, but we may try our luck with its critical points.
2
Note ∇f (x, y, z) = ex−y − ey−x + 2xex , −ex−y + ey−x , 2x . The only critical
point is then (0, 0, 0) (Why?). The Hessian at that point is positive definite:
4 −2 0
Hessf (0, 0, 0) = −2 2 0 , ∆1 = 4 > 0, ∆2 = 4 > 0, ∆3 = 8 > 0.
0 0 2
By Theorem 2.18, f (0, 0, 0) = 3 is a priori a strict local global minimum
value. To prove that this point is actually a strict global minimum, notice
that
2 2
ex−y + ey−x + 4x2 ex + 2ex −ex−y − ey−x 0
Hessf (x, y, z) = −ex−y − ey−x ex−y + ey−x 0
0 0 2
2
The first principal minor is trivially positive: ∆1 = ex−y + ey−x + 4x2 ex +
2
2ex , since it is a sum of three positive terms and on non-negative term.
The second principal minor is also positive:
2 2
ex−y + ey−x + 4x2 ex + 2ex −ex−y − ey−x
∆2 = det
−ex−y − ey−x ex−y + ey−x
2 2
= (ex−y + ey−x )2 + (ex−y + ey−x )(4x2 ex + 2ex ) − (ex−y + ey−x )2
2 2
= (ex−y + ey−x )(4x2 ex + 2ex ) > 0
The third principal minor is positive too: ∆3 = 2∆2 > 0. We have just
proved that Hessf (x, y, z) is positive definite for all (x, y, z) ∈ R3 , and thus
(0, 0, 0) is a strict global minimum.
Example 2.14. Find global minima in R2 (if they exist) for the function
f (x, y) = ex−y + ey−x .
This function also has continuous partial derivatives of any order, but no
extrema is guaranteed a priori. Notice that all points (x, y) satisfying y = x
are critical. For such points, the corresponding Hessians and eigenvalues are
2 −2
Hessf (x, x) = , λ1 = 2 > 0, λ2 = 0;
−2 2
therefore, Hessf (x, x) is positive semidefinite for each critical point. By
Theorem 2.18, f (x, x) = 2 is a local minimum for all x ∈ R. To prove they
are global minima, notice that for each (x, y) ∈ R2 :
x−y
+ ey−x −ex−y − ey−x
e x−y y−x
1 −1
Hessf (x, y) = = e +e
−ex−y − ey−x ex−y + ey−x −1 1
λ1 = ex−y + ey−x > 0, λ2 = 0.
The Hessian is positive semidefinite for all points, hence proving that any
point in the line y = x is a global minimum of f .
24 2. UNCONSTRAINED OPTIMIZATION
Example 2.15. Find local and global minima in R2 (if they exist) for
the function
f (x, y) = x3 − 12xy + 8y 3 .
This is a polynomial of degree 3, so we have continuous partial derivatives
of any order. It is easy to see that this function has no global minima:
lim f (x, 0) = lim x3 = −∞.
x→−∞ x→−∞
Let’s search instead for local minima. From the equation ∇f (x, y) = 0 we
obtain two critical points: (0, 0) and (2, 1). The corresponding Hessians and
their eigenvalues are:
0 −12
Hessf (0, 0) = , λ1 = −12 < 0, λ2 = 12 > 0,
−12 0
√ √
12 −12
Hessf (2, 1) = , λ1 = 30 − 6 13 > 0, λ2 = 30 + 6 30 > 0.
−12 48
By Theorem 2.18, we have that f (2, 1) = −8 is a local minimum, but
f (0, 0) = 0 is not.
Example 2.16. Find local and global minima in R2 (if they exist) for
the function
f (x, y) = x4 − 4xy + y 4 .
This is a polynomial of degree 4, so we do have continuous partial derivatives
of any order. There are three critical points: (0, 0), (−1, −1) and (1, 1). The
latter two are both strict local minima (by virtue of Theorem 2.18).
12 −4
Hessf (−1, −1) = Hessf (1, 1) = , ∆1 = 12 > 0, ∆2 = 128 > 0.
−4 12
We proved in Example 2.9 that f is coercive. By Theorems 2.14 and 2.16 we
have that f (−1, −1) = f (1, 1) = −2 must be strict global minimum values.
Exercises
Problem 2.1 (Basic). Consider the function
x+y
f (x, y) =
2 + cos x
At what points (x, y) ∈ R2 is this function continuous?
Problem 2.2 (Advanced). Prove the following facts about linear maps
T : Rd1 → Rd2 :
(a) For the real-valued case T : Rd → R, there exist a unique vector
a ∈ Rd so that T (x) = ha, xi for all x ∈ Rd .
(b) For the general case T : Rd1 → Rd2 , there exists a unique matrix A
of size d1 × d2 so that T (x)| = A · x| .
(c) Linear maps are continuous functions.
EXERCISES 25
Problem 2.6 (Basic). [13, p.32, #4] Write each of the quadratic forms
below in the form xAx| for an appropriate symmetric matrix A:
(a) 3x2 − xy + 2y 2 .
(b) x2 + 2y 2 − 3z 2 + 2xy − 4xz + 6yz.
(c) 2x2 − 4z 2 + xy − yz.
Problem 2.7 (Intermediate). Identify which of the following real-valued
functions are coercive. Explain the reason.
p
(a) f (x, y) = x2 + y 2 .
(b) f (x, y) = x2 + 9y 2 − 6xy.
(c) f (x, y) = x4 − 3xy + y 4 .
(d) Rosenbrock functions Ra,b .
Problem 2.8 (Advanced). [13, p.36, #32] Find an example of a con-
tinuous, real-valued, non-coercive function f : R2 → R that satisfies, for all
t ∈ R,
lim f (x, tx) = lim f (ty, y) = ∞.
x→∞ y→∞
26 2. UNCONSTRAINED OPTIMIZATION
While the correct expressions for ∇f and Hessf are quickly computed, trying
to find critical points results in an error:
>>> solve(gradient) # Search of critical points by solving ∇f = 0
NotImplementedError: could not solve
4*x**2*sqrt(-log(exp(x**2)/(2*x**2))) - 6*(-log(exp(x**2)/(2*x**2)))**(5/2)
1. Newton-Raphson’s Method
1.1. Newton-Raphson Method to search for √ roots of univariate
functions. In order to find a good estimation of 2 with many decimal
places, we allow a computer to find better and better approximations of the
root of the polynomial p(x) = x2 − 2. We start with an initial
√ guess, say
x0 = 3. We construct a sequence {xn }n∈N that converges to 2 as follows:
29
30 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION
p
Example 3.2. Consider now f (x) = sign(x) |x| over R, with root at
x = 0. The Newton-Raphson method fails miserably with this function: for
any x0 6= 0
f (x0 ) sign(x0 )|x0 |1/2
x1 = x0 − = x 0 − = −x0 .
f 0 (x0 ) 1
2 |x0 |
−1/2
This sequence turns into a loop: x2n = x0 , x2n+1 = −x0 for all n ∈ N (see
Figure 3.3).
32 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION
∆f [xn , xn , x? ]
= (xn − x? )2
∆f [xn , xn ]
Therefore,
xn+1 − x? ∆f [xn , xn , x? ] f 00 (x? )
lim = lim = .
n (xn − x? )2 n ∆f [xn , xn ] 2f 0 (x? )
Iff 00 (x? ) 6= 0, the Newton-Raphson’s iteration exhibits quadratic conver-
gence.1
Remark 3.1. We have just proven that, if a Newton-Raphson iteration
for a function f gives a convergent sequence, the convergence is quadratic.
But, how can we guarantee convergence to a root of f ? The key is in how
far can we start the sequence given the structure of the graph of f .
Theorem 3.1 (Local Convergence for the Newton-Raphson Method).
Let x? be a simple root of the equation f (x) = 0, and assume that there
exists ε > 0 so that
• f is twice continuously differentiable in the interval (x? − ε, x? + ε),
and
• there are no critical points of f on that interval.
Set 00
f (s) ? ?
M (ε) = max 0 : x − ε < s, t < x + ε .
2f (t)
If ε is small enough so that εM (ε) < 1, then I have seen this
(a) There are no other roots of f in (x? − ε, x? + ε). Theorem in [9]
(b) Any Newton-Raphson iteration starting at an initial guess x0 6= x? with the condition
in that interval will converge (quadratically) to x? 2εM (ε) < 1
instead, but I
Proof. Start with Taylor’s Theorem for f around x? . Given x 6= x? could not see why
satisfying |x − x? | < ε, there exists ξ between x and x? so that that 2 was
f (x) = f (x? ) + (x − x? )f 0 (x? ) + 12 (x − x? )2 f 00 (ξ) necessary. What
00
am I missing?
? 0 ? ? f (ξ)
= (x − x )f (x ) 1 + (x − x ) 0 ?
2f (x )
Note that the three factors on the last expression are never zero:
x − x? 6= 0 (since x 6= x? by hypothesis)
f 0 (x? ) 6= 0 (no critical points by hypothesis)
00
(x − x? ) f (ξ) ≤ εM (ε) < 1
(by hypothesis on M (ε))
2f 0 (x? )
This proves (a).
We want to prove now that all terms of a Newton-Raphson iteration
stay in the interval (x? − ε, x? + ε). We do that by induction:
• |x0 − x? | < ε by hypothesis.
1See Appendix A
34 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION
n
Since εM (ε) < 1, limn εM (ε) = 0, and {xn }n∈N converges to x? .
Example 3.3. Let’s use this Theorem to prove convergence of Newton-
Raphson iterations for f (x) = x2 − 2 for initial guesses x0 in a meaningful
interval—something
√ like the obvious choice x0 ∈ [1, 2]. Pick for example
ε√= 2/2,√ so we can analyze convergence with initial guesses in the interval
[ 2/2, 3 2/2] ⊃ [1, 2]. Note that
√
√ √ √
2 1 2 3 2 2
M 2 = max : <t< =
2t 2 2 2
and thus,
√ √
2 2
1
εM (ε) = M 2 = < 1.
2 2
√
By virtue of Theorem 3.1 we have quadratic convergence to 2 if we start
with any initial guess in this interval.
1.3. Extension to higher dimensions. Let’s proceed to extend this
process to functions g : Rd → Rd as follows.
• Any function g : Rd → Rd can be described in the form g(x) =
g1 (x), g2 (x), . . . , gd (x) for d real-valued functions gk : Rd → R
(1 ≤ k ≤ d).
• For such a function g, the derivative is the Jacobian
∂g ∂g1 ∂g1
∂x2 · · ·
1
∂x1 ∂xd
∂g2 ∂g2 ∂g2
∂x1 ∂x2 · · · ∂xd
J g = ∇g = .
. .. .. ..
. . . .
∂gd ∂gd ∂gd
∂x1 ∂x2 · · · ∂xd
1. NEWTON-RAPHSON’S METHOD 35
or equivalently,
−1
x1 | = x0 | − ∇g(x0 ) · g(x0 )| .
g(x, y, z) = x3 − y, y 3 − x
For an initial guess (x0 , y0 ), the sequence computed by this method is then
given by
2 3
xn+1 xn 1 3yn 1 x n − yn
= − 2 2
yn+1 yn 9xn yn − 1 1 3x2n yn3 − xn
n xn yn
0 −1.00000000 1.00000000
1 −0.50000000 0.50000000
2 −0.14285714 0.14285714
3 −0.00549451 0.00549451
4 −0.00000033 0.00000033
5 −0.00000000 0.00000000
6 −0.00000000 0.00000000
(b) Starting at (x0 , y0 ) = (3.5, 2.1), the sequence converges to (1, 1).
n xn yn
0 3.50000000 2.10000000
1 2.37631607 1.57961573
2 1.65945969 1.27476534
3 1.23996276 1.10419072
4 1.04837462 1.02274752
5 1.00260153 1.00133122
6 1.00000824 1.00000451
7 1.00000000 1.00000000
8 1.00000000 1.00000000
(c) Starting at (x0 , y0 ) = (−13.5, −7.3), the sequence converges to
(−1, −1).
n xn yn
0 −13.50000000 −7.30000000
1 −9.00900415 −4.92301873
2 −6.01982204 −3.36480659
3 −4.03494126 −2.36199873
4 −2.72553474 −1.73750959
5 −1.87830623 −1.36573112
6 −1.36121191 −1.15374930
7 −1.09518303 −1.04341362
8 −1.00932090 −1.00463507
9 −1.00010404 −1.00005571
10 −1.00000001 −1.00000001
11 −1.00000000 −1.00000000
12 −1.00000000 −1.00000000
13 −1.00000000 −1.00000000
points we found were (0, 0), (−1, −1) and (1, 1). See Figure 3.4.
Example 3.6. A similar process for the Rosenbrock function
R1,1 (x, y) = (1 − x)2 + (y − x2 )2
gives the following recursive formula:
xn+1 x −1
= n − HessR1,1 (xn , yn ) · ∇R1,1 (xn , yn )|
yn+1 yn
2x3n − 2xn yn + 1
1
= 2
2xn − 2yn + 1 xn (2x3n − 2xn yn − xn + 2)
For instance, starting with the initial guess (x0 , y0 ) = (−2, 2), the sequence
converges to the critical point (1, 1) in a few steps. See Figure 3.4.
There are some theoretical results that aid in the search for a good ini-
tial guess in case of multivariate functions. The following states a simple
set of conditions on f and x0 to guarantee quadratic convergence of the
corresponding sequences {xn }n∈N to a critical point x? .
Theorem 3.2 (Quadratic Convergence Theorem). Suppose f : Rd → R
is a twice continuously differentiable real-valued function, and x? is a critical
−1
point of f . Let N (x)| = x| − Hessf (x) · ∇f (x)| . If there exists
−1
(a) h > 0 so that2
Hessf (x? )
≤ h1 ,
(b) β > 0, L > 0 for which kHessf (x) − Hessf (x? )k ≤ Lkx − x? k
provided kx − x? k ≤ β.
2h
In that case, for all x ∈ Rd satisfying kx − x? k ≤ min{β, 3L },
kN (x) − x? k 3L
? 2
≤
kx − x k 2h
Example 3.8. There is an implementation of Newton-Raphson method
in the scipy.optimize libraries in Python: the routine fmin ncg() (the cg
indicates that the inversion of the Hessian is performed using the technique
of conjugate gradient). The following session illustrates how to use this
2Recall the norm of a matrix M , defined by kM k = max{kM · xk : kxk = 1}.
2. SECANT METHODS 39
We call fmin ncf with the function and its gradient (with the option
fprime=), and with the initial guess (2, 2).
>>> fmin_ncg(rosen, [2.,2.], fprime=rosen_der, retall=False)
Optimization terminated successfully.
Current function value: 0.000000
Iterations: 277
Function evaluations: 300
Gradient evaluations: 1540
Hessian evaluations: 0
array([ 0.99999993, 0.99999987])
2. Secant Methods
The Newton-Raphson method to compute local extrema of real-valued
functions f : Rd → R, has several shortcomings:
• Convergence is not guaranteed.
• Even when convergence is guaranteed, the limit of a Newton-Raphson
iteration is not necessarily a local minimum. Any critical point of
f is a target.
• For a successful implementation of Newton-Raphson we do need
expressions for the function itself, its gradient and Hessian matrix.
The secant method takes care of the the latter issue, while retaining the
same convergence rates.
2.1. A secant method to search for roots of univariate func-
tions. To√ explain how it works, let’s once again try to find an accurate
value of 2 as the root of the polynomial p(x) = x2 − 2.
(a) Consider two initial guesses x0 = 3, x1 = 2.8. Notice f (3) = 7 6=
5.84 = f (2.8).
(b) The line that joins the points (3, 7) and (2, 8, 5.84) has equation
5.84 − 7
y−7= (x − 3),
2.8 − 3
y = 5.8x − 10.4.
The latter can be seen as a linear approximation to the original
function—one that did not use the derivative of f . This linear
function intersects the x–axis at
10.4
x2 = ≈ 1.7931034483
5.8
40 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION
Example 3.10. Let’s use this method to find any of the roots (−1, −1),
(0, 0), (1, 1) of the function g(x, y) = x3 − y, y 3 − x from Example 3.4.
Starting with initial guess (x0 , y0 ) = (−1, 1), we take the first step from
Newton-Raphson:
2
3x0 −1 3 −1
∇g(x0 , y0 ) = =
−1 3y02 −1 3
| −2 3 −1 x 3x − y − 2
L0 (x, y) = + · =
2 −1 3 y −x + 3y + 2
This gives (x1 , y1 ) = (−1/2, 1/2) as root of L0 . For the second step, we
compute the matrix A1 satisfying the secant property:
| |
g(−1/2, 1/2) − g(−1, 1) − A0 [ 21 , − 12 ] ⊗ [ 12 , − 12 ]
A1 = A0 +
k[ 21 , − 12 ]k2
5 1
− 8 − (−2) 3 −1
⊗ 12 , − 12
− 2
5 −1 1
8 −2 −2
3
3 −1
= +
−1 3 1/2
3 −1
+ 2 − 85 , 85 ⊗ 12 , − 21
=
−1 3
5 1
− 8 · 2 − 58 · − 12
3 −1
= +2 5 1 5 1
−1 3 8 · 2 8 · − 2
3 −1 −5/8 5/8 19/8 −3/8
= + =
−1 3 5/8 −5/8 −3/8 19/8
This gives a linear approximation
19 3 3
8 x − 8 x + 4
−5/8 19/8 −3/8 x + 1/2
L1 (x, y)| = + · =
5/8 −3/8 19/8 y − 1/2 3 19 3
−8x + 8 y − 4
which has root (x2 , y2 ) = (−3/11, 3/11).
If we continue the computations, we arrive to the root (0, 0) to an accu-
racy of 6 decimal places in just 5 steps!
n xn yn f (xn , yn )
1 −1.000000 1.000000 − 2.000000, 2.000000
2 −0.500000 0.500000 − 0.625000, 0.625000
2 −0.272727 0.272727 − 0.293013, 0.293013
3 −0.072136 0.072136 − 0.072511, 0.072511
4 −0.006172 0.006172 − 0.006172, 0.006172
5 −0.000035 0.000035 − 0.000035, 0.000035
6 −0.000000 0.000000 − 0.000000, 0.000000
2.4. Secant Methods for Optimization. Given a real-valued func-
tion f : Rd → R, we may apply Broyden’s method to search for roots of the
gradient ∇f : Rd → Rd . As it happened with Newton’s method, we are not
guaranteed convergence to a local minimum.
44 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION
which proves that the gradient of consecutive terms of the sequence of steep-
est descent for f are perpendicular. Now, by virtue of the recurrence formula
(12),
n xn yn f (xn , yn )
0 −1.000000 1.000000 6.000000
1 0.000000 0.000000 0.000000
2 nan nan nan
n xn yn f (xn , yn )
0 3.500000 2.100000 140.110600
1 1.044472 1.753064 3.310777
2 1.141931 1.063276 −1.878163
3 1.008581 1.044435 −1.988879
4 1.013966 1.006319 −1.998931
5 1.000898 1.004472 −1.999891
6 1.001437 1.000651 −1.999989
7 1.000093 1.000461 −1.999999
8 1.000149 1.000067 −2.000000
9 1.000010 1.000048 −2.000000
10 1.000015 1.000007 −2.000000
11 1.000001 1.000005 −2.000000
12 1.000002 1.000001 −2.000000
13 1.000000 1.000001 −2.000000
14 1.000000 1.000000 −2.000000
15 1.000000 1.000000 −2.000000
n xn yn f (xn , yn )
0 −13.500000 −7.300000 35660.686600
1 2.362722 −4.871733 640.498302
2 1.434154 1.194162 −0.586492
3 1.021502 1.130993 −1.896212
4 1.038817 1.017881 −1.991558
5 1.002305 1.012291 −1.999167
6 1.003909 1.001808 −1.999917
7 1.000236 1.001246 −1.999992
8 1.000399 1.000185 −1.999999
9 1.000024 1.000127 −2.000000
10 1.000041 1.000019 −2.000000
11 1.000002 1.000013 −2.000000
12 1.000004 1.000002 −2.000000
13 1.000000 1.000001 −2.000000
14 1.000000 1.000000 −2.000000
15 1.000000 1.000000 −2.000000
n xn yn f (xn , yn ) n xn yn f (xn , yn )
0 −2.000000 2.000000 13.000000 17 0.916394 0.789239 0.009544
1 −0.166290 2.309522 6.567163 18 0.911201 0.818326 0.008028
2 0.256054 −0.056128 0.568264 19 0.929317 0.821560 0.006766
3 0.613477 0.007683 0.285318 20 0.925024 0.845608 0.005723
4 0.568566 0.259241 0.190235 21 0.939976 0.848277 0.004847
5 0.715784 0.285524 0.132227 22 0.936397 0.868329 0.004118
6 0.689755 0.431319 0.098227 23 0.948845 0.870551 0.003502
7 0.779264 0.447299 0.074310 24 0.945840 0.887385 0.002986
8 0.761554 0.546496 0.057977 25 0.956276 0.889248 0.002548
9 0.823325 0.557524 0.045696 26 0.953739 0.903457 0.002178
10 0.810322 0.630358 0.036667 27 0.962537 0.905028 0.001864
11 0.855862 0.638488 0.029614 28 0.960386 0.917075 0.001597
12 0.845883 0.694385 0.024199 29 0.967837 0.918405 0.001369
13 0.880846 0.700627 0.019862 30 0.966007 0.928657 0.001176
14 0.872964 0.744776 0.016437 31 0.972342 0.929788 0.001010
15 0.900551 0.749702 0.013647 32 0.970780 0.938539 0.000869
16 0.894200 0.785276 0.011399 33 0.976182 0.939503 0.000748
d
Assume we have a quadratic function p: R → R satisfying p(0) = 0.
There exist a d-dimensional vector D = q1 , . . . , qd and a symmetric matrix
d
Q = qjk j,k=1 (with qjk = qkj for all 1 ≤ j, k ≤ d) so that
d
X X
1 1 2
p(x) = hD, xi + 2 QQ (x) = 2 qkk xk + qk xk + qjk xj xk
k=1 1≤j<k≤d
= 21 DQ−1 D| − DQ−1 D|
= − 12 DQ−1 D| = − 21 Q(Q−1 ) (D). (13)
If xn is a term in a sequence of steepest descent, then to compute xn+1 we
proceed as follows:
3. THE METHOD OF STEEPEST DESCENT 49
therefore,
4
p(xn+1 ) − p(x? ) p(xn ) − p(x? ) − 2QkvQn(v
k
n)
=
p(xn ) − p(x? ) p(xn ) − p(x? )
50 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION
kv n k4
=1−
2QQ (v n ) p(xn ) − p(x? )
kv n k4
=1−
2QQ (v n ) 21 xn Qx|n + Dx|n + 12 DQ−1 D|
kv n k4
=1− .
QQ (v n ) xn Qx|n + 2Dx|n + DQ−1 D|
Note in the denominator we may rewrite some of the terms:
xn Qx|n = xn Q(Q−1 Q)x|n = (xn Q)Q−1 (xn Q)| ,
2Dx|n = Dx|n + Dx|n = xn D| + D(Q−1 Q)x|n
= xn (QQ−1 )D| + DQ−1 (xn Q)|
= (xn Q)Q−1 D| + DQ−1 (xn Q)| .
This allows us to rewrite in the following convenient form
p(xn+1 ) − p(x? ) kv n k4
= 1 −
p(xn ) − p(x? ) QQ (v n )(xn Q + D)Q−1 (xn Q + D)|
kv n k4
=1− .
QQ (v n )Q(Q−1 ) (v n )
We are ready to state the main result of this subsection:
Theorem 3.9. Given a d-dimensional vector D, and a positive definite
symmetric matrix Q of size d × d, consider the quadratic function p(x) =
1
2 QQ (x) + hD, xi. Any sequence {xn }n∈N of steepest descent converges to
the global minimum x? = −DQ−1 . The sequence of evaluations {p(xn )}n∈N
converges linearly to p(x? ) = − 21 Q(Q−1 ) (D). In particular, if 0 < λ1 ≤
λ2 ≤ · · · ≤ λd are the eigenvalues of Q, then
p(xn+1 ) − p(x? ) λd − λ1 2
≤
p(xn ) − p(x? ) λd + λ1
Proof. We start by offering the following lower bound estimate3 involv-
ing the associated directions of steepest descent v n in terms of the largest
and smallest eigenvalues of Q. For all n ∈ N,
kv n k4 4λ0 λd
≥ (14)
QQ (v n )Q(Q−1 ) (v n ) (λ0 + λd )2
We have then
p(xn+1 ) − p(x? ) kv n k4
= 1 −
p(xn ) − p(x? ) QQ (v n )Q(Q−1 ) (v n )
3This is left as and advanced exercise. It is not too tricky; if you are stuck, see e.g. [1,
section 1.3.1] for a proof.
4. EFFECTIVE ALGORITHMS FOR UNCONSTRAINED OPTIMIZATION 51
2
4λ1 λd λd − λ1
≤1− =
(λ1 + λd )2 λd + λ1
Example 3.14. The global minimum value of the quadratic function
p(x, y) = 5x2 + 5y 2 − xy − 11x + 11y + 11 is zero, and found at (1, −1).
Notice that we may write this function in the form p(x, y) = 12 QQ (x, y) +
hD, [x, y]i + 11, where
10 −1
D = [−11, 11], Q= .
10 −1
The symmetric matrix Q has eigenvalues λ1 = 9 > 0, λ2 = 11 > 0 and
is therefore positive definite. Theorem 3.9 states that sequences of steepest
descent exhibit linear convergence with a rate of convergence not larger than
2
δ = 11−9
11+9 = 0.01.
Observe the computations of the first six iterations for values of the
p(xn )
ratios p(xn−1 )
when we use (1.5, 3.5) as our initial guess.
p(xn ,yn )
n xn yn p(xn , yn ) p(xn−1 ,yn−1 )
0 1.5000000000 3.5000000000 100.2500000000
1 1.4498874016 −0.9600212545 1.0019989373 0.0099950019
2 1.0049975009 −0.9550224916 0.0100149812 0.0099950019
3 1.0044966254 −0.9996004124 0.0001000998 0.0099950019
4 1.0000499500 −0.9995504497 0.0000010005 0.0099950019
5 1.0000449438 −0.9999960061 0.0000000100 0.0099950042
6 1.0000004993 −0.9999955067 0.0000000001 0.0099950528
xn−1 = xn + tn wn ,
Using these principles, we are going to see two constructions based upon
secant methods: the DFP and BFGS methods.
4. EFFECTIVE ALGORITHMS FOR UNCONSTRAINED OPTIMIZATION 53
We call fmin bfgs with the function and its gradient (with the option
fprime=), the initial guess (−2, 2), and activate the option retall=True
that offers the complete list of iterations obtained by the algorithm.
>>> result = fmin_bfgs(R, [-2.,2.], fprime=jacR, retall=True)
Optimization terminated successfully.
Current function value: 0.000000
Iterations: 12
Function evaluations: 15
Gradient evaluations: 15
>>> plt.figure(figsize=(8,8));
... plt.axes(aspect='equal');
... plt.contour(X, Y, (1-X)**2+(Y-X**2)**2, \
... colors='k', \
... levels=[0.2,0.8,1,1.4,1.78,2.23,4,5.4,8,13,32,64]);
... plt.plot(x, x**2, 'r--');
... plt.xlim(-3, 3);
... plt.ylim(-1, 3);
... plt.plot([p[0] for p in result[1]], [p[1] for p in result[1]], 'b.-');
... plt.title("Convergence to critical point:\nBFGS on Rosenbrock");
... plt.show()
Exercises
Problem 3.1 (Basic). [9, p.249 #1] The following sequences all converge
to zero.
2 n
vn = n−10 wn = 10−n xn = 10−n yn = n10 3−n zn = 10−3·2
Indicate the type of convergence (See Appendix A).
Problem 3.2 (Advanced). [9, p.249 #4] Give an example of a positive
sequence {εn }n∈N converging to zero in such a way that limn εn+1
εpn
= 0 for
some p > 1, but not converging to zero with any order q > p.
EXERCISES 55
with
10 −18 2
D = [12, −47, −8], Q = −18 40 −1
2 −1 3
(a) Find the global minimum value of p, and its location.
(b) Compute the eigenvalues of Q. Is Q positive definite?
(c) What is the worst-case scenario rate of convergence of sequences of
steepest descent for this function?
(d) Compute sequences of steepest descent for this function with the
initial guesses below. Make sure to report a table similar to the one
in Example 3.14.
• (0, 0, 0)
• (15.09, 7.66, −6.56)
• (11.77, 6.42, −4.28)
• (4.46, 2.25, 1.85)
CHAPTER 4
or better, if we set
S = {x ∈ D : g1 (x) ≤ 0, . . . , gm (x) ≤ 0, h1 (x) = 0, . . . , h` (x) = 0},
we may simply write
(P ) = min f.
x∈S
Figure 4.1. Can you tell what are the global maximum and
minimum values of f in S?
triangle S is a Slater point for (P ), and all relevant functions are convex.
Definition. Given a consistent program (P ) as defined in (22), for each
feasible point x ∈ S, we define:
(a) The cone of improving directions of f at x, as
F0 (x) = {v ∈ Rd : kvk = 1, h∇f (x), vi < 0}
(b) The set of indices of the binding inequality constraints for x, as
I(x) = k ∈ {1, . . . , m} : gk (x) = 0 .
(c) The cone of inward pointing directions for the binding constraints
at x, as
G0 (x) = {v ∈ Rd : kvk = 1, h∇gk (x), vi < 0 for all k ∈ I(x)}
(d) The set of tangent directions for the equality constraints at x, as
H0 (x) = {v ∈ Rd : h∇hk (x), vi = 0, 1 ≤ k ≤ `}
4. CONSTRAINED OPTIMIZATION 65
Example 4.2. For the program (P ) in Example 4.1, consider the feasible
points (0, 0), (−1, −1) and (0, −1/2). Since ∇f (x, y) = [4x3 , 4y 3 ], we have
the following cones on improving direction
F0 (0, 0) = ∅,
F0 (−1, −1) = {v = (v1 , v2 ) ∈ R2 : kvk = 1, v1 + v2 > 0},
F0 (0, −1/2) = {v = (v1 , v2 ) ∈ R2 : kvk = 1, v2 > 0}
The indices of binding inequality constraints are
I(0, 0) = {3}, I(−1, −1) = {1, 2}, I(0, −1/2) = ∅,
and therefore, the cones of inward pointing directions for the binding con-
straints are
G0 (0, 0) = {v = (v1 , v2 ) ∈ R2 : kvk = 1, v1 + v2 < 0},
G0 (−1, −1) = {v = (v1 , v2 ) ∈ R2 : kvk = 1, v1 > 0, v2 > 0},
G0 (0, −1/2) = ∅.
Since there are no equality constraints, we do not have any sets of tangent
directions.
Figure 4.2. Cones for (0, 0), (−1, −1) and (0, −1/2).
1. Necessary Conditions
We begin with two results that focus on the structure of the equality
constraints.
Theorem 4.1 (Geometric Necessary Condition for Linear Equality Con-
straints). If (P ) is a consistent program with linear equality constraints
hk (x) = hak , xi + bk (ak ∈ Rd , bk ∈ R) for all 1 ≤ k ≤ `, then for all
feasible local minima x ∈ S,
F0 (x) ∩ G0 (x) ∩ H0 (x) = ∅.
Theorem 4.2 (Geometric Necessary Condition for Linearly Independent
Equality Constraints). If x ∈ S is a feasible local minimum for the consistent
program (P ), and the gradient vectors {∇hk (x) : 1 ≤ k ≤ `} are linearly
independent, then F0 (x) ∩ G0 (x) ∩ H0 (x) = ∅.
Example 4.3. Continuing with example 4.1, let’s check if the point
(0, 0) is a candidate to optimal solution of this program. Let’s use Theorem
4.3 to verify this claim:
∇f (x, y) = [4x3 , 4y 3 ] ∇f (0, 0) = [0, 0],
∇g1 (x, y) = [2x, 0] ∇g1 (0, 0) = [0, 0],
∇g2 (x, y) = [0, 2y] ∇g2 (0, 0) = [0, 0],
∇g3 (x, y) = ex+y , ex+y
∇g3 (0, 0) = [1, 1].
Notice how the gradients line up nicely—can we find λk ≥ 0 so that:
(a) [λ0 , λ1 , λ2 , λ3 ] 6= [0, 0, 0, 0],
(b) λ1 = λ2 = 0 (since λ1 g1 (0, 0) = −2λ1 and λ2 g2 (0, 0) = −2λ2 ), and
(c) the following linear combination is equal to [0, 0]
λ0 ∇f (0, 0) + λ1 ∇g1 (0, 0) + λ2 ∇g2 (0, 0) + λ3 ∇g3 (0, 0) = [0, 0],
λ0 [0, 0] + λ1 [0, 0] + λ2 [0, 0] + λ3 [1, 1] = [0, 0],
We may select, for instance λ0 = 1, λ1 = λ2 = λ3 = 0, which proves that
the point (0, 0) is indeed a candidate for optimal solution of (P ).
1. NECESSARY CONDITIONS 67
Remark 4.1. The conditions (a) and (b) of Theorem 4.4 are called the
KKT conditions of the program (P ) in the literature. The values λk , µk are
called multipliers.
Example 4.4. Set f (x, y) = (x − 12)2 + (y + 6)2 . Consider the program
(P ) designed to find the global minimum of this function on the set S =
{(x, y) ∈ R2 : x2 + 3x + y 2 − 4.5y ≤ 6.5, (x − 9)2 + y 2 ≤ 64, 8x + 4y = 20}. We
want to prove that the point (2, 1) is a good candidate for optimal solution
of (P ). The point (2, 1) is feasible. To see this, set
g1 (x, y) = x2 + 3x + y 2 − 4.5y − 6.5,
g2 (x, y) = (x − 9)2 + y 2 − 64,
h1 (x, y) = 8x + 4y − 20,
(or simpler equivalent constraints), and notice that
g1 (2, 1) = 0, g2 (2, 1) = −14, h1 (2, 1) = 0.
We have concluded that (P ) is consistent. Notice that, I(2, 1) = {1}, since
g1 (2, 1) = 0, g2 (2, 1) 6= 0. Further,
∇f (x, y) = [2(x − 12), 2(y + 6)] ∇f (2, 1) = [−20, 14],
∇g1 (x, y) = [2x + 3, 2y − 4.5] ∇g1 (2, 1) = [7, −2.5],
∇g2 (x, y) = [2(x − 9), 2y] ∇g2 (2, 1) = [−14, 2],
∇h1 (x, y) = [8, 4] ∇h1 (2, 1) = [8, 4],
The vectors ∇g1 (2, 1) = [7, −2.5] and ∇h1 (2, 1) = [8, 4] are linearly indepen-
dent. Therefore, to verify that (2, 1) is candidate for optimal solution of
(P ), we may now use Theorem 4.4. The KKT conditions read as follows:
we are looking for λk ≥ 0, µ1 ∈ R so that λk gk (2, 1) = 0 (k = 1, 2) and
∇f (2, 1) + λ1 ∇g1 (2, 1) + λ2 ∇g2 (2, 1) + µ1 ∇h1 (2, 1) = [0, 0],
Let’s address the first condition: Since g1 (2, 1) = 0 and g2 (2, 1) = −14 < 0,
it must be λ2 = 0. The second condition turns then into the equation
[−20, 14] + λ1 [7, −2.5] + 0 · [−14, 2] + µ1 [8, 4] = [0, 0]
68 4. CONSTRAINED OPTIMIZATION
or equivalently
7 8 λ1 20
= ,
−2.5 4 µ1 −14
which gives λ1 = 4, and µ1 = −1. This proves that the point (2, 1) is indeed
a good candidate for the optimal solution of (P ).
There are other instances in which the KKT conditions can be used
instead of those in the Fritz John Theorem.
Theorem 4.5 (Slater Necessary Condition). Suppose that the inequality
constraints gk of a super-consistent program (P ) are pseudo-convex (1 ≤
k ≤ m), the equality constraints hk are linear (1 ≤ k ≤ `), and the vectors
∇hk (x) are linearly independent at a feasible point x. Then the KKT con-
ditions (a) and (b) of Theorem 4.4 are necessary to characterize x as an
optimal solution of (P ).
Theorem 4.6. If all constraints of a consistent program (P ) are lin-
ear, then the KKT conditions (a) and (b) of Theorem 4.4 are necessary to
characterize optimal solutions of (P ).
2. Sufficient Conditions
It all boils down to a single result.
Theorem 4.7 (KKT Sufficient Conditions). Let x ∈ S be a feasible
point of the consistent program (P ) for which there are multipliers λk ≥ 0
(1 ≤ k ≤ m) and µk ∈ R (1 ≤ k ≤ `) satisfying the conditions (a) and (b) of
Theorem 4.4. If f is pseudo-convex, gk is quasi-convex for all 1 ≤ k ≤ m,
and hk is linear for all 1 ≤ k ≤ `, then x is a global optimal solution of (P ).
KEY EXAMPLES 69
Example 4.5. We saw that the point (0, 0) satisfies the KKT condi-
tions for the super-consistent convex program (P ) in Example 4.1. As a
consequence of Theorems 4.3 and 4.7, this point must be the optimal global
minimum of (P ).
We also saw that the point (2, 1) satisfies the KKT conditions for the
program (P ) in Example 4.4. It is not hard to see that this program is
super-consistent, f is pseudo-convex, g1 and g2 are quasi-convex, and h1
is linear. By virtue of Theorems 4.4 and 4.7, the point (2, 1) must be the
optimal solution of (P ).
Key Examples
In the following section we are going to use the KKT conditions to
address the characterization of optimal solutions of generic programs.
Example 4.6. Let Q be a symmetric d × d square matrix. Consider
the associated quadratic form QQ (x). We wish to find the global maximum
over all points of this function in the unit ball Bd = {x ∈ Rd : kxk ≤ 1}.
An equivalent program (P ) is thus defined with f (x) = −QQ (x) as
its objective function, and a single inequality constraint g1 (x) = kxk2 −
1. This is trivially a super-consistent program with a convex inequality
constraint. Checking the KKT conditions to look for the optimal solution
is thus justified under the hypothesis of Theorem 4.5. Notice that
∇f (x)| = −2Qx| ,
∇g1 (x) = 2x;
Exercises
Problem 4.1 (Basic). Consider the following problem: Find the global
minimum of the function f (x, y) = 6(x − 10)2 + 4(y − 12.5)2 on the set
S = {(x, y) ∈ R2 : x2 + (y − 5)2 ≤ 50, x2 + 3y 2 ≤ 200, (x − 6)2 + y 2 ≤ 37}.
(a) Write the statement of this problem as a program with the nota-
tion from equation 22. Label the objective function, as well as the
inequality constraints accordingly.
(b) Is the objective function f pseudo-convex? Why or why not?
(c) Are the inequality constraints quasi-convex? Why or why not?
(d) Sketch the feasibility region. Label all relevant objects involved.
(e) Is the point (7, 6) feasible? Why or why not?
(f) Employ Theorem 4.4 to write a necessary condition for optimality
and verify that is satisfied by the point (7, 6).
(g) Employ Theorem 4.7 to decide whether this point is an optimal
solution of (P ).
EXERCISES 71
Problem 4.2 (Basic). [8, lec6 constr opt, 10] Let f (x, y) = (x − 4)2 +
(y − 6)2 . Consider the program (P ) to find the global minimum of f on the
set S = {(x, y) ∈ R2 : y − x2 ≥ 0, y ≤ 4}.
(a) Write the statement of this problem as a program with the nota-
tion from equation 22. Label the objective function, as well as the
inequality constraints accordingly.
(b) Is the objective function f pseudo-convex? Why or why not?
(c) Are the inequality constraints quasi-convex? Why or why not?
(d) Sketch the feasibility region. Label all relevant objects involved.
(e) Is the point (2, 4) feasible? Why or why not?
(f) Employ Theorem 4.4 to write a necessary condition for optimality
and verify that is satisfied by the point (2, 4).
(g) Employ Theorem 4.7 to decide whether this point is an optimal
solution of (P ).
Problem 4.3 (Basic). [8, lec6 constr opt, 12] Let f (x, y) = (x−9/4)2 +
(y − 2)2 . Consider the program (P ) to find the global minimum of f on the
set S = {(x, y) ∈ R2 : y − x2 ≥ 0, x + y ≤ 6, x ≥ 0, y ≥ 0}.
(a) Write down the KKT optimality conditions and verify that these
conditions are satisfied at the point (3/2, 9/4).
(b) Present a graphical interpretation of the KKT conditions at (3/2, 9/4).
(c) Show that this point is the optimal solution to the program.
Problem 4.4 (Basic). Find examples of non-diagonal 3 × 3 symmetric
square matrices with integer-valued eigenvalues of each type below:
• A1 positive definite,
• A2 positive semi-definite,
• A3 negative definite,
• A4 negative semi-definite, and
• A5 indefinite.
For each of these matrices, find the maximum of their corresponding qua-
dratic form QAk (x, y, z) over the unit ball B3 = {(x, y, z) ∈ R3 : x2 + y 2 +
z 2 ≤ 1}.
Problem 4.5 (Advanced). [8, lec6 constr opt, 19]
Arrow-Hurwicz-Uzawa constraint qualification: Consider the prob-
lem to minimize f (x) subject to x ∈ X (X being an open set X ⊆ Rd )
and a set of continuous inequality constraints gk (x) ≤ 0 (1 ≤ k ≤ m). Let
J = k ∈ {1, . . . , m} : gk is pseudo-concave . Prove that if the set
x ∈ X : h∇gk (x? ), xi ≤ 0, (k ∈ J ); h∇gk (x? ), xi < 0, k ∈ I(x? ) \ J
Example 5.1. We would like to find the minimum value of the function
f (x, y, z) = x2 + y 2 + z 2 over the line at the intersection of the planes
x + y + z = 0 and x − y + 2z = 3.
Among the methods we use in a course of Vector Calculus, one would
start by computing a parameterization of the line first. For instance, by
forcing z = 0 and solving the system formed by the two planes (with this
restriction), we find that the point (3/2, −3/2, 0) belongs in this line. The
cross product of the normal vectors to the planes is the direction of the line:
i j k
[1, 1, 1] × [1, −1, 2] = 1 1 1 = [3, −1, −2].
1 −1 2
We have then the line with equation (3/2 + 3t, −3/2 − t, −2t), t ∈ R. A
restriction of f on this line gives
ϕ(t) = f 23 + 3t, − 32 − t, −2t
73
74 5. NUMERICAL APPROXIMATION FOR CONSTRAINED OPTIMIZATION
2 2
= 3
2 + 3t + − 3
2 −t + (−2t)2
= 9
4 + 9t2 + 9t + 9
4 + t2 + 3t + 4t2
= 14t2 + 12t + 9
2
The minimum of this function occurs at t = −3/7. This yields the point
3 3 3 3 3 3 15 6
2 − 3 · 7 , − 2 + 7 , 2 · 7 = 14 , − 14 , 7 .
to x̄0 ; that is, we force for example kx̄0 − x0 k ≤ R0 for some R0 > 0. This
gives the Direction Finding Program (DF P )
Note this is a convex program (it is linear, as a matter of fact) with a Slater
point at 0. The KKT conditions for the (DF P ) program are therefore
necessary and sufficient for optimality.
Notice how much simpler these conditions are: Set g1 (v) = kvk2 − R02 ,
hk (v) = hak , vi for 1 ≤ k ≤ `. Find v ∈ Rd with Av | = 0| , λ1 ≥ 0, µk ∈ R,
1 ≤ k ≤ ` so that:
λ1 kvk2 − R02 = 0,
`
X
∇f (x0 ) + 2λ1 v + µk ak = 0.
k=1
Example 5.2. Let’s illustrate this technique with the running example
5.1. Assume the initial guess is the feasible point (3/2, −3/2, 0). The corre-
sponding (DF P ) for a direction search with unit vectors reads as follows:
λ1 (kvk2 − 1) = 0,
2λ1 v + µ1 [1, 1, 1] + µ2 [1, −1, 2] = [−3, 3, 0].
and thus,
1 1 −1 3
µ1 1 1 1 1 1 1
= 1 −1 −3
µ2 1 −1 2 1 −1 2
1 2 0
3/7 −1/7 0 −6/7
= = ,
−1/7 3/14 6 9/7
1. PROJECTION METHODS 77
9/14 −3/14 −3/7 3 1/2 √
1
= 73 14
λ1 = 3 −3 0 −3/14 1/14 1/7 −3
2 −3/7 1/7 2/7 0
√ 9/14 −3/14 −3/7 3
v0| 1
= − 12 14 −3/14 1/14 1/7 −3
−3/7 1/7 2/7 0
√ 18/7 √ −3
1 1
= − 12 14 −6/7 = 14 14 1
−12/7 2
Notice how v 0 satisfies all required constraints, and kv 0 k = 1. This vector is
an optimal solution of (DF P ), and therefore a direction of steepest descent
for (P ) from the point (3/2, −3/2, 0).
We perform now the line search from x0 in this direction:
√
ϕ0 (t) = f (x0 + tv 0 ) = t2 − 76 14 t + 92
√
t0 = argmin ϕ0 (t) = 73 14
t≥0
We have then
3 3 3
√ 1
√ 3 15 6
x1 = 2 , − 2 , 0) + 7 14 14 14 − 3, 1, 2 = 14 , − 14 , 7 ,
| {z } | {z } | {z }
x0 t0 v0
Example 5.3. Let’s illustrate this technique with the running exam-
ple 5.1. Assume once again that the initial guess is the feasible point
(3/2, −3/2, 0). The corresponding program (P 0 ) to search for the Newton-
Raphson direction is
min 29 + h[3, −3, 0], vi + vv | , S = v ∈ R3 : 11 −1
1 1 | |
2 v =0
v∈S
a1 x1 + · · · + ad xd + s = b, s ≥ 0.
a1 x1 + · · · + ad xd − s = b, s ≥ 0.
Example 5.4. Consider the linear program (LP ) given by the no-standard
formulation
min 3y − 2x
(x,y,z)∈R3
(LP ) : x − 3y + 2z ≤ 3,
2y − x ≥ 2,
y ≥ 0, z ≥ 0
We can easily convert the objective function to a maximum, and the first
two inequality constraints into equality constraints by introducing a slack
variable s1 ≥ 0 and a surplus variable s2 ≥ 0. Notice that the variable x is
80 5. NUMERICAL APPROXIMATION FOR CONSTRAINED OPTIMIZATION
We have found two different optimal solutions of the program (LP ) using
the simplex method: (2, 0) and (5/3, 2/3). Notice that in this case, any
2. LINEAR PROGRAMING: THE SIMPLEX METHOD 83
other point in the segment joining those two points, must also be a solution.
Namely: for any t ∈ [0, 1], the point 2 − 31 t, 23 t satisfies
2 2 − 13 t + 23 t = 4
(The first constraint is satisfied)
2 − 13 t + 2 32 t = 2 + t ≤ 3 (The second constraint is satisfied)
1 5
2− 3t ≥ 3 >0 (The third constraint is satisfied)
2
3t ≥0 (The fourth constraint is satisfied)
Example 5.7. What happens if we are unable to employ Rule 1 from the
simplex method? This situation arises on unbounded programs. Consider
the one below:
max 2x + y
(x,y)∈R2
(LP ) : −x + y ≤ 1
x − 2y ≤ 2
x ≥ 0, y ≥ 0
But notice that at this new stage we are unable to apply rule 1, since there
are no positive coefficients for x2 . Any pivot operation that we apply using
the second row will change the values of z; in particular, we should be able
to perform enough changes to make z as large as we desire. For instance:
what row operations would you perform to get z = 19, with the feasible
point (8, 3)?
84 5. NUMERICAL APPROXIMATION FOR CONSTRAINED OPTIMIZATION
We call the optimization linprog with the values of c first (as a list),
then the values of the matrix A (as a list of lists), and the values of b
(as a list). We use the option bounds= to provide a collection of bounds
(min value, max value) for each of the relevant variables. For instance if
we request the first variable, x0 to satisfy a ≤ x0 ≤ b, we input (a,b). If
any of the bounds is infinite, we signal it with None. A simple output looks
like these.
>>> linprog(c1, A1, b1, bounds=(x0_bnds, x1_bnds), method='simplex')
fun: -2.3333333333333335
message: 'Optimization terminated successfully.'
nit: 2
slack: array([ 0., 0.])
status: 0
success: True
x: array([ 1.66666667, 0.66666667])
Basic Variables: [2 3]
Current Solution:
x = [ 0.0000 0.0000]
Basic Variables: [2 3]
Current Solution:
x = [ 0.0000 0.0000]
Tableau:
[[ 1.0000 0.5000 0.5000 0.0000 2.0000]
[ 0.0000 1.5000 -0.5000 1.0000 1.0000]
[ 0.0000 -0.5000 0.5000 0.0000 2.0000]]
Basic Variables: [0 3]
Current Solution:
x = [ 2.0000 0.0000]
Tableau:
[[ 1.0000 0.0000 0.6667 -0.3333 1.6667]
[ 0.0000 1.0000 -0.3333 0.6667 0.6667]
[ 0.0000 0.0000 0.3333 0.3333 2.3333]]
Basic Variables: [0 1]
Current Solution:
x = [ 1.6667 0.6667]
Once an optimal solution x̄0 of (LP0 ) has been obtained, a line-search is per-
formed on the segment joining x0 with x̄0 (which by hypothesis is contained
in the feasibility region S).
t0 = argmin f x0 + t(x̄0 − x0 ) .
0≤t≤1
max(−∞, ξ0 ) = −9.
At this point the stopping criteria gives |U B − LB| = |13 − (−9)| = 22.
We proceed to the second iteration step, but we update the upper bound
first: U B = f (x1 , y1 ) = f (3/2, 0) = 25/4.
We need the approximation to f at (3/2, 0):
L1 (x, y) = f 32 , 0 + ∇f 23 , 0 , x − 32 , y
= 25 3
25
= 4 − 3 x − 23 − 4y
4 + [−3, −4], x − 2 , y
After applying the simplex method to this program, we find that the solution
is the point (x̄1 , ȳ1 ) = (0, 3/2), with ξ1 = L1 (0, 3/2) = 19/4.
EXERCISES 89
The solution is again the point (x̄2 , ȳ2 ) = (5/4, 1/4), with ξ2 = L2 (5/4, 1/4) =
49/8. No further computations are needed at this point to realize that
(x3 , y3 ) = (x̄2 , ȳ2 ) = (x2 , y2 ) = (5/4, 1/4). Notice that |U B − LB| = 0, and
the stopping criteria has been satisfied as expected.
The solution of the program (P ) is precisely this point.
Exercises
Problem 5.1 (Basic). Use a projection method to find the global mini-
mum of the function f (x, y, z) = x2 + 2y 2 + 3z 2 over the plane x + y + z = 1.
90 5. NUMERICAL APPROXIMATION FOR CONSTRAINED OPTIMIZATION
Problem 5.2 (Basic). Use a projection method to find the global min-
imum of the function f (x, y, z, t) = x2 + 2y 2 + 2z 2 + t2 over the plane
S = {(x, y, z, t) ∈ R4 : x + y + z = t, x + y = 4}
Problem 5.3 (Basic). Solve the following linear program by the simplex
method:
max 4x + y − z
(x,y,z)∈R3
(LP ) : x + 3z ≤ 6
3x + y + 3z ≤ 9
x ≥ 0, y ≥ 0, z ≥ 0
93
94 INDEX
Jacobian, 12 Program, 63
consistent, 63
Karush-Kuhn-Tucker convex, 63
conditions, 67 Direction Finding, 75
multipliers, 67 linear, 63
kernel, 11 super-consistent, 63
95
APPENDIX A
Rates of Convergence
1. Function operations
Observe how easily we can perform all of the following:
Function evaluation: with the method
.subs({variable1: value1, variable2: value2, ...})
Limits: with the function limit(object, variable, value).
99
100 B. BASIC sympy COMMANDS FOR CALCULUS
Basic operations: with the usual operators for addition, subtraction, mul-
tiplication and division.
Composition: again with the method .subs().
>>> f.subs({x: pi}) # f (π)
0
>>> f.subs({x: 0}) # f (0) --- returns "not a number"
nan
>>> limit(f, x, 0) # Compute limx→0 f (x) instead
1
>>> (f.subs({x: x+h}) - f)/h # A divided quotient...
(sin(h + x)/(h + x) - sin(x)/x)/h
>>> limit( (f.subs({x: x+h}) - f)/h, h, 0) # ... and its limit as h → 0
(x*cos(x) - sin(x))/x**2
12*b*x**2 - 4*b*y + 2
>>> Delta1 = _ # Store that value as 'Delta1'
>>> hessian.det() # Compute the determinant of the Hessian
-16*b**2*x**2 + 2*b*(8*b*x**2 - 4*b*(-x**2 + y) + 2)
>>> Delta2 = simplify(_) # Store that value as 'Delta2'
It is then a simple task (in some cases) to search for critical points by
solving symbolically ∇f = 0, and checking whether they are local maxima,
local minima or saddle points.
>>> solve(gradient, [x,y]) # Critical points of R
[(a, a**2)]
>>> crit_points = _ # This is a list. We call it 'crit_points'
>>> for point in crit_points:
... x0,y0 = point
... print(point)
... print("Delta1 = ", Delta1.subs({x:x0, y:y0}))
... print("Delta2 = ", Delta2.subs({x:x0, y:y0}))
...
(a, a**2)
Delta1 = 8*a**2*b + 2
Delta2 = 4*b
>>> 8*a**2*b + 2 > 0 # Is Delta1 > 0? (remember a,b>0)
True
>>> 4*b > 0 # Is Delta2 > 0?
True
The conclusion after this small session is that any Rosenbrock function
R(x, y) = (a − x)2 + b(y − x2 )2 has a global minimum at the point (a, a2 ).
A word of warning. Symbolic differentiation and manipulation of expres-
sions may not work in certain cases. For those, numerical approximation is
more suited (and incidentally, that is the reason you are taking this course).
>>> solve(f.diff(x))
NotImplementedError: multiple generators [x, tan(x/2)]
No algorithms are implemented to solve equation
x**2*(-tan(x/2)**2 + 1)/(tan(x/2)**2 + 1) - 2*x*tan(x/2)/(tan(x/2)**2 + 1)
3. Integration
Symbolic integration for the computation of antiderivatives is also pos-
sible. Definite integrals, while the symbolic setting allows it in many cases,
it is preferably done in a numerical setting.
R
>>> R.integrate(x) # R(x, y) dx
-a*x**2 + b*x**5/5
R + x**3*(-2*b*y/3 + 1/3) + x*(a**2 + b*y**2)
>>> R.integrate(y) # R(x, y) dy
-b*x**2*y**2 + b*y**3/3 + y*(a**2 - 2*a*x + b*x**4 + x**2)
R1R1
>>> R.integrate(x, (x, 0, 1)).integrate(y, (y, 0, 1)) # 0 0 R(x, y) dx dy
a**2/4 - a/6 + 11*b/360 + 1/24
R sin(x)
>>> f.integrate(x) # x
dx
Si(x)
102 B. BASIC sympy COMMANDS FOR CALCULUS
Rπ
>>> f.integrate(x, (x, 0, pi)) # 0 sin(x)
x
dx
-2 + pi*Si(pi)
>>> _.evalf() # How much is that, actually?
3.81803183741885
4. Sequences, series
>>>
1. matplotlib
The matplotlib libraries are fundamentally focused on 2D plotting.
They are open source with license based on the Python Software Foundation
(PSF) license. If you are planning to use them for your scientific production,
it is customary to cite John Hunter’s 2007 seminal paper [12].
In this section we are going to explore just a handful of utilities:
• The module pyplot, that allows us to use a similar syntax and
interface as in matlab or octave
• The toolkit mplot3d to extend matplotlib for simple 3D plotting.
• The toolkits basemap and cartopy to extend matplotlib for pro-
jection and Geographic mapping.
• The toolkit ggplot for those of you familiar with the R plotting
system.
Let’s start with a simple example or usage of the module pyplot. We
are going to use exclusively the Rosenbrock function R1,1 .
1 import numpy as np, matplotlib.pyplot as plt
2
3 def R(x,y): return (1.0-x)**2 + (y - x**2)**2
For each plot, we usually indicate in our session the intent to create a
figure. At that point it is customary to impose the size of the figure, number
of subplots (if more than one), kind of axis, usage of grid, etc. This is what
we call the layout of our diagrams. For instance, to create a simple plot
(with size 5 × 10.5) of the graph of f (x) = R(x, 1) for −2 ≤ x ≤ 2, but
focusing only in the window [−2.5, 2.5] × [−0.5, 10], we issue the following
commands:
>>> x = np.linspace(-2, 2) # −2 ≤ x ≤ 2
103
104 C. BASIC GRAPHING IN PYTHON
Figure C.2. Tinkering with color, style and width of lines in pyplot
lines R1,1 (x, y) = c for c = 1, 2, 3, 5, 10, 15, 20 over the window [−2, 2] ×
[−2, 3], we could issue the following commands:
>>> y = np.linspace(-2,3) # −2 ≤ y ≤ 3
>>> X,Y = np.meshgrid(x,y) # generate the window