0% found this document useful (0 votes)
168 views112 pages

Course Notes For MATH 524: Non-Linear Optimization

This document provides course notes for MATH 524: Non-Linear Optimization. It begins with a review of optimization concepts from vector calculus, including definitions of gradients, directional derivatives, and critical points. It discusses conditions for local minima, including that the gradient is zero at a local minimum and applying the second derivative test. It provides examples of applying these concepts, including to Rosenbrock functions. The notes are intended to serve as the basis for a course on numerical methods for optimization.

Uploaded by

example
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
168 views112 pages

Course Notes For MATH 524: Non-Linear Optimization

This document provides course notes for MATH 524: Non-Linear Optimization. It begins with a review of optimization concepts from vector calculus, including definitions of gradients, directional derivatives, and critical points. It discusses conditions for local minima, including that the gradient is zero at a local minimum and applying the second derivative test. It provides examples of applying these concepts, including to Rosenbrock functions. The notes are intended to serve as the basis for a course on numerical methods for optimization.

Uploaded by

example
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 112

Course notes for MATH 524: Non-Linear

Optimization

Francisco Blanco-Silva

Department of Mathematics, University of South Carolina


E-mail address: blanco@math.sc.edu
URL: people.math.sc.edu/blanco
2010 Mathematics Subject Classification. 49M, 65K, 90C

This version of the notes was completed on November 22, 2017.


Contents

List of Figures v
Chapter 1. Review of Optimization from Vector Calculus 1
The Theory of Optimization 6
Exercises 7
Chapter 2. Existence and Characterization of Extrema for
Unconstrained Optimization 11
1. Anatomy of a function 11
2. Existence results 21
3. Characterization results 22
Examples 22
Exercises 24
Chapter 3. Numerical Approximation for Unconstrained Optimization 29
1. Newton-Raphson’s Method 29
2. Secant Methods 39
3. The Method of Steepest Descent 44
4. Effective Algorithms for Unconstrained Optimization 51
Exercises 54
Chapter 4. Existence and Characterization of Extrema for Constrained
Optimization 63
1. Necessary Conditions 66
2. Sufficient Conditions 68
Key Examples 69
Exercises 70
Chapter 5. Numerical Approximation for Constrained Optimization 73
1. Projection Methods for Linear Equality constrained programs 73
2. Linear Programing: The simplex method 78
3. The Frank-Wolfe Method 86
Exercises 89
Index 93
Bibliography 95
Appendix A. Rates of Convergence 97
iii
iv CONTENTS

Appendix B. Basic sympy commands for Calculus 99


1. Function operations 99
2. Derivatives, Gradients, Hessians 100
3. Integration 101
4. Sequences, series 102
5. Power series, series expansions 102
Appendix C. Basic graphing in Python 103
1. matplotlib 103
List of Figures

1.1 Details of the graph of R1,1 3


1.2 Global minima in unbounded domains 5
1.3 Contour plots for problem 1.4 8

2.1 Detail of the graph of W0.5,7 13


2.2 Convex sets. 18
2.3 Convex Functions. 19

3.1 Newton-Raphson iterative method 30


3.2 Initial guess must be carefully chosen in Newton-Raphson 31
3.3 Newton-Raphson fails for some functions 32
3.4 Newton-Raphson method 37
3.5 Secant iterative method 40
3.6 The Method of Steepest Descent: Polynomial function 47
3.7 The Method of Steepest Descent: Rosenbrock Function 48
3.8 The BFGS method: Rosenbrock function 55
3.9 Newton method in desmos.com 57

4.1 Can you tell what are the global maximum and minimum values of
f in S? 64
4.2 Cones for (0, 0), (−1, −1) and (0, −1/2). 65
4.3 Feasibility region for (P ) in example 4.4 68

5.1 Illustration of the simplex method for Example 5.4 82


5.2 Set up for Example 5.9 88
5.3 Frank-Wolfe iteration to solve the program (P ) in Example 5.9. 89

C.1 Basic rendering of functions with pyplot 105


C.2 Tinkering with color, style and width of lines in pyplot 105
C.3 Contour plots with pyplot 106

v
CHAPTER 1

Review of Optimization from Vector Calculus

The starting point of these notes is the concept of optimization as de-


veloped in a course of Vector Calculus (see e.g. [6, Chapter 14])
Definition. If f (x, y) is differentiable in an open region containing the
point (x0 , y0 ), we define the gradient vector of f (x, y) at (x0 , y0 ) as the vector
 
∂f (x0 , y0 ) ∂f (x0 , y0 )
∇f (x0 , y0 ) = , .
∂x ∂y
1/2
Given any vector v = [v1 , v2 ] with kvk = v12 + v22 = 1 (what we call a
unit vector or a direction), we define the directional derivative of f in the
direction v at (x0 , y0 ) by
∂f (x0 , y0 ) ∂f (x0 , y0 )
Dv f (x0 , y0 ) = h∇f (x0 , y0 ), vi = v1 + v2 .
∂x ∂y
Remark 1.1. The gradient has many interesting properties. Assume
f (x, y) is a differentiable function.
Fastest Increase: At any point (x, y), the function f increases most rapidly
in the direction of the gradient vector v = ∇f (x, y). The derivative
in that direction is Dv f (x, y) = k∇f (x, y)k.
Fastest Decrease: At any point (x, y), the function f decreases most rapidly
in the direction v = − ∇f (x, y). The derivative in that direction is
Dv f (x, y) = −k∇f (x, y)k.
Zero Change: Any direction v perpendicular to a non-zero gradient is a
direction of zero change in f at (x, y): Dv f (x, y) = 0.
Tangents to Level Curves: At every point (x, y) in the domain of f , the
gradient ∇f (x, y) is perpendicular to the level curve through (x, y).
Definition. Let D ⊆ R2 be a region on the plane containing the point
(x0 , y0 ). We say that the real-valued function f : D → R has a local mini-
mum at (x0 , y0 ) if f (x0 , y0 ) ≤ f (x, y) for all domain points (x, y) in an open
disk centered at (x0 , y0 ). In that case, we also say that f (x0 , y0 ) is a local
minimum value of f in D.

Emphasis was made to find conditions on the function f to guarantee


existence and characterization of minima:
1
2 1. REVIEW OF OPTIMIZATION FROM VECTOR CALCULUS

Theorem 1.1. Let D ⊆ R2 and let f : D → R be a function for which


first partial derivatives ∂f ∂f
∂x and ∂y exist in D. If (x0 , y0 ) ∈ D is a local
minimum of f , then ∇f (x0 , y0 ) = 0.
The local minima of these functions are among the zeros of the equation
∇f (x, y) = 0, the so-called critical points of f . More formally:
Definition. An interior point of the domain of a function f (x, y) where
both directional derivatives are zero, or where at least one of the directional
derivatives do not exist, is a critical point of f .

We employed the Second Derivative Test for Local Extreme Values to


characterize some minima:
Theorem 1.2. Suppose that f : R2 → R and its first and second partial
derivatives are continuous throughout a disk centered at the point (x0 , y0 ),
and that ∇f (x0 , y0 ) = 0. If the two following conditions are satisfied, then
f (x0 , y0 ) is a local minimum value:
∂ 2 f (x0 , y0 )
>0 (1)
 2 ∂x2 
∂ f (x0 , y0 ) ∂ 2 f (x0 , y0 )

 ∂x2 ∂x∂y  
det 
 2
>0 (2)
 ∂ f (x0 , y0 ) ∂ 2 f (x0 , y0 ) 

∂y∂x ∂y 2
| {z }
Hessf (x0 ,y0 )

Remark 1.2. The restriction of this result to univariate functions is


even simpler: Suppose f 00 is continuous on an open interval that contains
x0 . If f 0 (x0 ) = 0 and f 00 (x0 ) > 0, then f has a local minimum at x0 .
Example 1.1 (Rosenbrock Functions). Given strictly positive parame-
ters a, b > 0, consider the Rosenbrock function
Ra,b (x, y) = (a − x)2 + b(y − x2 )2 .
It is easy to see that Rosenbrock functions are polynomials (prove it!). The
domain is therefore the whole plane. Figure 1.1 illustrates a contour plot
with several level lines of R1,1 on the domain D = [−2, 2] × [−1, 3], as well
as its graph.
It is also easy to verify that the image is the interval [0, ∞). Indeed, note
first that Ra,b (x, y) ≥ 0 for all (x, y) ∈ R2 . Zero is attained: Ra,b (a, a2 ) = 0.
Note also that Ra,b (0, y) = a2 + by 2 is a polynomial of degree 2, therefore
unbounded.
Let’s locate all local minima:
• The gradient and Hessian are given respectively by
∇Ra,b (x, y) = 2(x − a) + 4bx(x2 − y), b(y − x2 )
 
1. REVIEW OF OPTIMIZATION FROM VECTOR CALCULUS 3

Figure 1.1. Details of the graph of R1,1

12bx2 − 4by + 2 −4bx


 
HessRa,b (x, y) =
−4bx 2b
• The search for critical points ∇Ra,b = 0 gives only the point (a, a2 ).
∂2R
• ∂x2a,b (a, a2 ) = 8ba2 + 2 > 0.
• The Hessian at that point has positive determinant:
 2 
2 8ba + 2 −4ab
det HessRa,b (a, a ) = det = 4b > 0
−4ab 2b
There is only one local minimum at (a, a2 ).

The second step was the notion of global (or absolute) minima: points
(x0 , y0 ) that satisfy f (x0 , y0 ) ≤ f (x, y) for any point (x, y) in the domain of
f . We always started with the easier setting, in which we placed restrictions
on the domain of our functions:
Theorem 1.3. A continuous real-valued function always attains its min-
imum value on a compact set K. If the function is also differentiable in the
interior of K, to search for global minima we perform the following steps:
Interior Candidates: List the critical points of f located in the interior
of K.
Boundary Candidates: List the points in the boundary of K where f may
have minimum values.
Evaluation/Selection: Evaluate f at all candidates and select the one(s)
with the smallest value.
Example 1.2. A flat circular plate has the shape of the region
x2 + y 2 ≤ 1.
The plate, including the boundary, is heated so that the temperature at the
point (x, y) is given by f (x, y) = 100(x2 + 2y 2 − x) in Celsius degrees. Find
the temperature at the coldest point of the plate.
4 1. REVIEW OF OPTIMIZATION FROM VECTOR CALCULUS

We start by searching for critical points. The equation ∇f (x, y) = 0


gives x = 12 , y = 0. The point ( 12 , 0) is clearly inside of the plate. This is
our first candidate.
The border of the plate can be parameterized by ϕ(t) = (cos t, sin t) for
t ∈ [0, 2π). The search for minima in the boundary of the plate can then
be coded as an optimization problem for the function h(t) = (f ◦ ϕ)(t) =
100(cos2 t + 2 sin2 t − cos t) on the interval [0, 2π). Note that h0 (t) = 0 for
t ∈ {0, 32 π} in [0, 2π). We thus have two more candidates:
√ 
ϕ(0) = (1, 0) ϕ( 32 π) = − 12 , 12 3
Evaluation of the function at all candidates gives us the solution to this
problem:
f ( 12 , 0) = −25◦ C.

On a second setting, we remove the restriction of boundedness of the


function. In this case, global minima will only be guaranteed for very special
functions.
Example 1.3. Any polynomial pn (x) = an xn + an−1 xn−1 + · · · + a0 with
even degree n ≥ 2 and positive leading coefficient satisfies lim|x|→∞ pn (x) =
+∞. To see this, we may write
an xn + an−1 xn−1 + · · · + a0 = an xn 1 + aan−1 + · · · + anax0 n

nx

The behavior of each of the factors as the absolute value of x goes to infinity
leads to our claim.
lim an xn = +∞,
|x|→∞
an−1
+ · · · + anax0 n = 1.

lim 1+ an x
|x|→∞

It is clear that a polynomial of this kind must attain a minimum somewhere


in its domain. The critical points will lead to them.
Example 1.4. Find the global minima of the function
f (x) = log(x4 − 2x2 + 2).
Note first that the domain of f is the whole real line, since x4 − 2x2 + 2 =
(x2 − 1)2 + 1 ≥ 1 for all x ∈ R. Note also that we can write f (x) = (g ◦ h)(x)
with g(x) = log(x) and h(x) = x4 − 2x2 + 1. Since g is one-to-one and
increasing, we can focus on h to obtain the requested solution. For instance,
lim|x|→∞ f (x) = +∞, since lim|x|→∞ h(x) = +∞. This guarantees the
existence of global minima. To look for it, h again points to the possible
locations by solving for its critical points: h0 (x) = 0. We have then that f
attains its minima at x = ±1.

We learned other useful characterizations for extrema, when the domain


could be expressed as solutions of equations:
1. REVIEW OF OPTIMIZATION FROM VECTOR CALCULUS 5

Figure 1.2. Global minima in unbounded domains

Theorem 1.4 (Orthogonal Gradient). Suppose f (x, y) is differentiable 


in a region whose interior contains a smooth curve C : r(t) = x(t), y(t) .
If P0 is a point on C where f has a local extremum relative to its values on
C, then ∇f is orthogonal to C at P0 .
This result leads to the Method of Lagrange Multipliers
Theorem 1.5 (Lagrange Multipliers on one constraint). Suppose that
f (x, y) and h1 (x, y) are differentiable and ∇h1 6= 0 when h1 (x, y) = 0. To
find the local extrema of f subject to the constraint h1 (x, y) = 0 (if these
exist), find the values of x, y and λ that simultaneously satisfy the equations
∇f = λ ∇h1 , and h1 (x, y) = 0
Example 1.5. Find the minimum value of the expression 3x + 4y for
values of x and y on the circle x2 + y 2 = 1.
We start by modeling this problem to adapt the technique of Lagrange
multipliers:
f (x, y) = 3x + 4y h1 (x, y) = x2 + y 2 − 1
| {z } | {z }
target constraint
Look for the values of x, y and λ that satisfy the equations ∇f = λ ∇h1 ,
h1 (x, y) = 0
3 = 2λx, 4 = 2λy 1 = x2 + y 2
Equivalently, λ 6= 0 and x, y satisfy
3 2 9 4
x= , y= , 1= 2 + 2
2λ λ 4λ λ
5
These equations lead to λ = ± 2 , and there are only two possible candidates
for minimum. Evaluation of f on those gives that the minimum is at the
point − 35 , − 54 .


This method can be extended to more than two dimensions, and more
than one constraint. For instance:
6 1. REVIEW OF OPTIMIZATION FROM VECTOR CALCULUS

Theorem 1.6 (Lagrange Multipliers on two constraints). Suppose that


f (x, y, z), h1 (x, y, z), h2 (x, y, z) are differentiable with ∇h1 not parallel to
∇h2 . To find the local extrema of f subject to the constraint h1 (x, y, z) =
h2 (x, y, z) = 0 (if these exist), find the values of x, y, λ1 and λ2 that simul-
taneously satisfy the equations
∇f = λ1 ∇h1 +λ2 ∇h2 , h1 (x, y, z) = 0, h2 (x, y, z) = 0
Example 1.6. The cylinder x2 +y 2 = 1 intersects the plane x+y +z = 1
in an ellipse. Find the points on the ellipse that lies closest to the origin.
We again model this as a Lagrange multipliers problem:
target
z }| {
f (x, y, z) = x2 + y 2 + z 2 ,
h1 (x, y, z) = x2 + y 2 − 1, h2 (x, y, z) = x + y + z − 1 .
| {z } | {z }
constraint constraint
The gradient equation ∇f = λ1 ∇h1 +λ2 ∇h2 gives
2x = 2λ1 x + λ2 , 2y = 2λ1 y + λ2 , 2z = λ2
These equations are satisfied simultaneously only in two scenarios:
(a) λ1 = 1, λ2 = 0 and z = 0
(b) λ1 6= 1 and x = y = z/(1 − λ1 )
Resolving each case we find four candidates:
√ √ √ √ √ √
(1, 0, 0), (0, 1, 0), ( 2/2, 2/2, 1 − 2), (− 2/2, − 2/2, 1 + 2).
The first two are our solution.

The Theory of Optimization


The purpose of these notes is the development of a theory to deal with
optimization in a more general setting.
• We start in an Euclidean d–dimensional space with the usual topol-
ogy based on the distance
v
u d
uX
1/2
d(x, y) = kx − yk = hx − y, x − yi = t (x − y )2 . k k
k=1

For instance, the open ball of radius r > 0 centered at a point x? is


the set Br (x? ) = {x ∈ Rd : kx − x? k < r}.
• Given a real-valued function f : D → R on a domain D ⊆ Rd , we
define the concept of extrema and extreme Values:
Definition. Given a real-valued function f : D → R on a do-
main D ⊆ Rd , we say that a point x? ∈ D is a:
global minimum: f (x? ) ≤ f (x) for all x ∈ D.
global maximum: f (x? ) ≥ f (x) for all x ∈ D.
strict global minimum: f (x? ) < f (x) for all x ∈ D \ {x? }.
EXERCISES 7

strict global maximum: f (x? ) > f (x) for all x ∈ D \ {x? }.


local minimum: There exists δ > 0 so that f (x? ) ≤ f (x) for all
x ∈ Bδ (x? ) ∩ D.
local maximum: There exists δ > 0 so that f (x? ) ≥ f (x) for all
x ∈ Bδ (x? ) ∩ D.
strict local minimum: There exists δ > 0 so that f (x? ) < f (x)
for all x ∈ Bδ (x? ) ∩ D, x 6= x? .
strict local maximum: There exists δ > 0 so that f (x? ) > f (x)
for all x ∈ Bδ (x? ) ∩ D, x 6= x? .
In this setting, the objective of optimization is the search for extrema in the
following two scenarios:
Unconstrained Optimization: if D is an open set (usually the whole
space Rd ).
Constrained Optimization: if D can be described as a set of constraints:
x ∈ D if there exist m, n ∈ N and functions gk : Rd → R (1 ≤ k ≤
m), hj : Rd → R (1 ≤ j ≤ n) so that
gk (x) ≤ 0 (1 ≤ k ≤ m)
hj (x) = 0 (1 ≤ j ≤ n)

For each of these problems, we follow a similar program:


Existence of extrema: Establish results that guarantee the existence of
extrema depending on the properties of D and f .
Characterization of extrema: Establish results that describe conditions
for points x ∈ D to be extrema of f .
Tracking extrema: Design robust numerical algorithms that find the ex-
trema for scientific computing purposes.
The development of existence of solutions of any optimization problem, as
well as the characterization results for unconstrained optimization will be
covered in chapter 2. The design of algorithms to track extrema in the
unconstrained setting will be covered in chapter 3. Chapter 4 is devoted to
characterization results for constrained optimization, and Chapter 5 for the
design of algorithms in that setting.

Exercises
Problem 1.1 (Advanced). State and prove similar statements as in
Definition 1, Theorems 1.1, 1.2 and 1.3, but for local and global maxima.
Problem 1.2 (Basic). Find and sketch the domain of the following
functions.

(a) f (x, y) = y − x − 2
(b) f (x, y) = log x2 + y 2 − 4

(x−1)(y+2)
(c) f (x, y) = (y−x)(y−x 3)
8 1. REVIEW OF OPTIMIZATION FROM VECTOR CALCULUS

(d) f (x, y) = log(xy + x − y − 1)


Problem 1.3 (Basic). Find and sketch the level lines f (x, y) = c on the
same set of coordinate axes for the given values of c.
(a) f (x, y) = x + y − 1, c ∈ {−3, −2, −1, 0, 1, 2, 3}.
(b) f (x, y) = x2 + y 2 , c ∈ {0, 1, 4, 9, 16, 25}.
(c) f (x, y) = xy, c ∈ {−9, −4, −1, 0, 1, 4, 9}
Problem 1.4 (CAS). Use a Computer Algebra System of your choice
to produce contour plots of the given functions on the given domains.

2 2
(a) f (x, y) = (cos x)(cos y)e− x +y /4 on [−2π, 2π] × [−2π, 2π].
xy(x2 − y 2 )
(b) g(x, y) = on [−1, 1] × [−1, 1]
x2 + y 2
(c) h(x, y) = y 2 − y 4 − x2 on [−1, 1] × [−1, 1]
(d) k(x, y) = e−y cos x on [−2π, 2π] × [−2, 0]

Figure 1.3. Contour plots for problem 1.4

Problem 1.5 (Basic). Sketch the curve f (x, y) = c together with ∇f


and the tangent line at the given point. Write an equation for the tangent
line.
√ √
(a) f (x, y) = x2 + y 2 , c = 4, √
( 2, 2).
(b) f (x, y) = x2 − y, c = 1, ( 2, 1).
(c) f (x, y) = xy, c = −1, (2, −1/2).
EXERCISES 9

(d) f (x, y) = x2 − xy + y 2 , c = 7, (−1, 2).


Problem 1.6 (Basic). For the function
x−y
f (x, y) = ,
x+y
at the point P0 = (−1/2, 3/2), find the directions v and the directional
derivatives Dv f (P0 ) for which
(a) Dv f (P0 ) is largest.
(b) Dv f (P0 ) is smallest.
(c) Dv f (P0 ) = 0.
(d) Dv f (P0 ) = 1.
(e) Dv f (P0 ) = −2.
Problem 1.7 (Intermediate). The derivative of f (x, y) at (1, 2) in the

2

direction 2 [1, 1] is 2 2 and in the direction [0, −1] is −3. What is the

5
derivative of f in the direction 5 [−1, −2]?
Problem 1.8 (Intermediate). Find the absolute maxima and minima
of the function f (x, y) = (4x − x2 ) cos y on the rectangular plate 1 ≤ x ≤
3, − π4 ≤ y ≤ π4 .
Problem 1.9 (Basic). Find two numbers a ≤ b such that
Z b
(24 − 2x − x2 )1/3 dx
a
has its largest value.
Problem 1.10 (Basic). Find the points of the hyperbolic cylinder x2 −
z2 − 1 = 0 in R3 that are closest to the origin.
Problem 1.11 (Intermediate). Find the extreme values of the function
f (x, y, z) = xy + z 2 on the circle in which the plane y − x = 0 intersects the
sphere x2 + y 2 + z 2 = 4.
Problem 1.12 (CAS). Write a routine (in your favorite CAS) that uses
symbolic computation 1 to find the minimum of a differentiable real-valued
function f : R → R over
(a) a closed interval [a, b]
(b) An interval of the form [a, ∞), or (−∞, b]
The routine should accept as input:
• the expression of the function f ,
• the endpoints a, b.

1See Appendix B
CHAPTER 2

Existence and Characterization of Extrema for


Unconstrained Optimization

In this chapter we will study different properties of functions and do-


mains that guarantee existence of extrema for unconstrained optimization.
Once we have them, we explore characterization of those points. We start
with a reminder of the definition of continuous and differentiable functions,
and then we proceed to introduce other functions with advantageous prop-
erties for optimization purposes.

1. Anatomy of a function
1.1. Continuity and Differentiability.
Definition. We say that a function f : Rd1 → Rd2 is continuous at a
point x? ∈ Rd1 if for all ε > 0 there exists δ > 0 so that for all x ∈ Rd1
satisfying kx − x? kd1 < δ, it is kf (x) − f (x? )kd2 < ε.
Example 2.1. Let f : R2 → R be given by
(
2xy
2 2, (x, y) 6= (0, 0)
f (x, y) = x +y
0, (x, y) = (0, 0)
This function is trivially continuous at any point (x, y) 6= (0, 0). However,
it fails to be continuous at the origin. Notice how we obtain different values
as we approach (0, 0) through different generic lines y = mx with m ∈ R:
2mx2 2m
lim f (x, mx) = lim = .
x→0 x→0 (1 + m2 )x2 1 + m2
Definition. A function T : Rd1 → Rd2 is said to be a linear map (or a
linear transformation) if it satisfies
T (x + λy) = T (x) + λT (y) for all x, y ∈ Rd1 , λ ∈ R.
The kernel and image of a linear map are respectively given by
ker T = {x ∈ Rd1 : T (x) = 0},
im T = {y ∈ Rd2 : there exists x ∈ Rd1 so that y = T (x)}.

Remark 2.1. For each real-valued linear map T : Rd → R there exists


a ∈ Rd so that T (x) = ha, xi for all x ∈ Rd .
11
12 2. UNCONSTRAINED OPTIMIZATION

The kernel of a linear map in this case has a very simple expression:
ker T = kerha, ·i = {(x1 , x2 , . . . , xd ) ∈ Rd : a1 x1 + a2 x2 + · · · + ad xd = 0}
The graph of a real-valued linear function can be identified with a hy-
perplane in Rd+1 :
Graph T = {(x, y) ∈ Rd+1 : y = T (x)}
= {(x1 , x2 , . . . , xd , y) ∈ Rd+1 : a1 x1 + a2 x2 + · · · + ad xd = y}
= kerh[a1 , a2 , . . . , ad , −1], ·i
Remark 2.2. For each linear map T : Rd1 → Rd2 there exists a matrix
A of size d1 × d2 so that T (x)| = A · x| .


 y1 = a11 x1 + · · · + a1d1 xd1

y2 = a21 x1 + · · · + a2d xd

1 1
..


 .

yd2 = ad2 1 x1 + · · · + ad1 d2 xd1

Definition. A function f : Rd1 → Rd2 is said to be differentiable at x?


if there exists a linear map J : Rd1 → Rd2 so that
kf (x? + h) − f (x? ) − J (h)kd2
lim =0
h→0 khkd1
Example 2.2. Consider a real-valued function f : R → R of a real
variable. To prove differentiability at a point x? , we need a linear map:
J(h) = ah for some a ∈ R. Notice how in that case,
|f (x? + h) − f (x? ) − J(h)| f (x? + h) − f (x? )


= − a ;
|h| h
therefore, we could pick a = limh→0 h−1 f (x? + h) − f (x? ) —this is the


definition of derivative we learned in Calculus: a = f 0 (x? ).


Remark 2.3. If a function f : Rd1 → Rd2 is differentiable at x? , then
all of the partial derivatives exist at x? , in which case the linear map J is
given by the Jacobian matrix
 ∂f ∂f1 ∂f1

∂x1
1
∂x2 · · · ∂x  
 ∂f2 d1 x1
∂f ∂f 
 ∂x
 1 ∂x2
2
· · · 2  
∂xd1   x2 

J (x) =  . .. .. ..  ·  .. 
 .. . . .   . 
 
∂fd2 ∂fd2 ∂f xd1
∂x1 ∂x2 · · · ∂xdd2
1

The converse is not true in general: the existence of partial derivatives (or
even all of the directional derivatives) is not guarantee that a function is
differentiable at a point. For instance, the function f : R2 → R given by
(
y 3 /(x2 + y 2 ) if (x, y) 6= (0, 0)
f (x, y) =
0 if (x, y) = (0, 0)
1. ANATOMY OF A FUNCTION 13

is not differentiable at (0, 0), although all partial derivatives and all direc-
tional derivatives exist at that point.

A friendly version of the differentiability of real-valued functions comes


with the next result (see, e.g. [6, p.818])
∂f ∂f
Theorem 2.1. If the partial derivatives ∂x1
, . . . , ∂xd
of a real-valued
function f : R → R are continuous on an open region G ⊆ Rd , then f is
d

differentiable at every point of G.


Example 2.3. Let f : Rd → R. To prove that f is differentiable at a
point x? ∈ Rd we need a linear map J(h) = ha, hi for some a ∈ Rd . Under
the conditions of Theorem 2.1 we may use
∂f (x? ) ∂f (x? )
 
?
a = ∇f (x ) = ,..., ,
∂x1 ∂xd
if we are able to prove that all partial derivatives are continuous in an open
set containing x? .

It is a simple task to prove that all differentiable functions are continu-


ous. Is it true that all continuous functions are differentiable?
Example 2.4 (Weierstrass Function). For any positive real numbers
a, b satisfying 0 < a < 1 < b and ab ≥ 1, consider the Weierstrass function
Wa,b : R → R given by

X
Wa,b (x) = an cos(bn πx)
n=0
This function is continuous everywhere, yet nowehere differentiable! (see
Figure 2.1). For a proof, see e.g. [11]

Figure 2.1. Detail of the graph of W0.5,7


14 2. UNCONSTRAINED OPTIMIZATION

A few more useful results about higher order derivatives follow:


Theorem 2.2 (Clairaut). If f : Rd → R and its partial derivatives of
∂f 2f
orders 1 and 2, ∂x k
, ∂x∂k ∂x j
, (1 ≤ k, j ≤ d) are defined throughout an open
region containing the point x? , and are all continuous at x? , then
∂ 2 f (x? ) ∂ 2 f (x? )
= , (1 ≤ k, j ≤ d).
∂xk ∂xj ∂xj ∂xk
Definition (Hessian). Given a twice-differentiable function f : Rd → R,
we define the Hessian of f at x to be the following matrix of second partial
derivatives:
 2 
∂ f (x) ∂ 2 f (x) ∂ 2 f (x)
 ∂x21 · · ·
 ∂x1 ∂x2 ∂x1 ∂xd  
 
 
 ∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x) 

 ∂x2 ∂x1 2
··· 
∂x 2 ∂x 2 ∂x d

Hessf (x) = 
 

 
 .. .. .. .. 
 .
 . . .  
 
 
 ∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x) 
··· 2
∂xd ∂x1 ∂xd ∂x2 ∂xd

Functions that satisfy the conditions of Theorem 2.2 have symmetric


Hessians. We shall need some properties in regard to symmetric matrices.
Definition. Given a symmetric matrix A, we define its associated qua-
dratic form as the function QA : Rd → R given by
  
a11 · · · a1d x1
 . .
| . ..   ... 

QA (x) = xAx = x1 · · · xd  .. ..  

a1d · · · add xd
We say that a symmetric matrix is:
positive definite: if QA (x) > 0 for all x ∈ Rd \ {0}.
positive semidefinite: if QA (x) ≥ 0 for all x ∈ Rd .
negative definite: if QA (x) < 0 for all x ∈ Rd \ {0}.
negative semidefinite: if QA (x) ≤ 0 for all x ∈ Rd .
indefinite: if there exist x, y ∈ Rd so that QA (x)QA (y) < 0.
Example 2.5. Let A be the 3 × 3–symmetric matrix
 
2 −1 2
A = −1 3 0
2 0 5
1. ANATOMY OF A FUNCTION 15

The associated quadratic form is given by


  
  2 −1 2 x
QA (x, y, z) = x y z  −1 3 0   y
2 0 5 z
 
  2x − y + 2z
= x y z  −x + 3y 
2x + 5z
= x(2x − y + 2z) + y(−x + 3y) + z(2x + 5z)
= 2x2 + 3y 2 + 5z 2 − 2xy + 4xz

To easily classify symmetric matrices, we usually employ any of the


following three criteria:
Definition. Given a general square matrix A, we define for each 1 ≤
` ≤ d, a principal minor of order ` as the determinant of the submatrix
obtained by deletingd − ` rows and the d − ` columns with the same indices.
There are always d` principal minors of order `. We write Ad` for any of
them.
Among those principal minors, we define the leading principal minors as
those that correspond to the upper left-hand corner ` × `–submatrix of A.
We denote them by ∆` .
 ∆1 ∆2 ∆3 
a11 a12 a13 · · · a1n
 a21 a22 a23 · · · a2n 
 
 a31 a32 a33 · · · a3n 
 
 .. .. .. .. .. 
 . . . . . 
an1 an2 an3 · · · ann

Theorem 2.3 (Principal Minor Criteria). A symmetric matrix A is:


• Positive definite if and only if ∆` > 0 for all 1 ≤ ` ≤ d.
• Negative definite if and only if (−1)` ∆` > 0 for all 1 ≤ ` ≤ d.
• Positive semidefinite if and only if Ad` ≥ 0 for all principal minors,
for all 1 ≤ ` ≤ d.
• Negative semidefinite if and only if (−1)` Ad` ≥ 0 for all principal
minors, for all 1 ≤ ` ≤ d.
Example 2.6. The matrix A in Example 2.5 is positive definite:
∆1 = 2 > 0,
 
2 −1
∆2 = det = 5 > 0,
−1 3
   
−1 3 2 −1
∆3 = det A = 2 det + 5 det = 13 > 0.
2 0 −1 3
16 2. UNCONSTRAINED OPTIMIZATION

Example 2.7. Consider the symmetric matrix A below:


 
1 4 6
A = 4 2 1
6 1 6
Computing the leading principal minors, we obtain
∆1 = 1 > 0,
 
1 4
∆2 = det = −14 < 0,
4 2
∆3 = det A = −109 < 0.
The principal leading minors we have computed do not fit with any of the
criteria in Theorem 2.3. We can conclude that A is indefinite.
Theorem 2.4 (Eigenvalue Criteria). Given a general square d×dmatrix
A, consider the function pA : C → C given by pA (λ) = det A − λI d . This
is a polynomial of (at most) degree d in λ. We call it the characteristic
polynomial of A. The roots (in C) of the characteristic polynomial are called
the eigenvalues of A. Symmetric matrices enjoy the following properties:
(a) The eigenvalues of a symmetric matrix are all real.
(b) If λ ∈ R is a root of multiplicity n of the characteristic polyno-
mial of a (non-trivial) symmetric matrix, then there exist n lin-
early independent vectors {x1 , x2 , . . . , xn } satisfying Axk = λxk
(1 ≤ k ≤ n).
(c) If λ1 6= λ2 are different roots of the characteristic polynomial of a
symmetric matrix, and x1 , x2 ∈ Rd satisfy Axk = λk xk (k = 1, 2),
then hx1 , x2 i = 0.
(d) A symmetric matrix is positive definite (resp. negative definite) if
and only if all its eigenvalues are positive (resp. negative).
(e) A symmetric matrix is positive semidefinite (resp. negative semidef-
inite) if and only if all its eigenvalues are non-negative (resp. non-
positive)
(f ) A symmetric matrix is indefinite if there exist two eigenvalues λ1 6=
λ2 with different sign.
Example 2.8. Let’s compute the eigenvalues of matrix A in Example
2.5:
 
2−λ −1 2
det(A − λI) = det  −1 3−λ 0 
2 0 5−λ
   
−1 2 2 − λ −1
= 2 det + (5 − λ) det
3−λ 0 −1 3 − λ

= −4(3 − λ) + (5 − λ) (2 − λ)(3 − λ) − 1
= −λ3 + 10λ2 − 26λ + 13
1. ANATOMY OF A FUNCTION 17

Notice that this polynomial has three real roots: λ1 ≈ 0.653537810432577,


λ2 ≈ 3.27775436623747, and λ3 ≈ 6.06870782332996, all of them positive
(as we expected).
1.2. Coercive Functions. Other set of functions that play an impor-
tant role in optimization are the kind of functions we explored in Example
1.3.
Definition (Coercive functions). A continuous real-valued function f
is said to be coercive if for all M > 0 there exists R = R(M ) > 0 so that
f (x) ≥ M if kxk ≥ R.
Remark 2.4. This is equivalent to the limit condition
lim f (x) = +∞.
kxk→∞

Example 2.9. We saw in Example 1.3 how even-degree polynomials


with positive leading coefficients are coercive, and how this helped guarantee
the existence of a minimum.
We must be careful assessing coerciveness of polynomials in higher di-
mension. Consider for example p2 (x, y) = x2 −2xy+y 2 . Note how p2 (x, x) =
0 for any x ∈ R, which proves p2 is not coercive.
To see that the polynomial p4 (x, y) = x4 + y 4 − 4xy is coercive, we start
by factoring the leading terms:
 
4 4 4 4
 4xy
x + y − 4xy = x + y 1− 4
x + y4
Assume r > 1 is large, and that x2 + y 2 = r2 . We have then
r4
x4 + y 4 ≥ (Why?)
2
r2
|xy| ≤ (Why?)
2
therefore,
4xy 4
≤ 2
+yx4
4 r
4xy 4
1− 4 4
≥1− 2
x +y r
r (r2 − 4)
2
 
4xy
x4 + y 4 1 − 4


x + y4 2

We can then conclude that given M > 0, if x2 + y 2 ≥ 2 + 4 + 2M , then
p4 (x, y) ≥ M . This proves p4 is coercive.
1.3. Convex Functions. There is one more kind of functions we should
explore.
Definition (Convex Sets). A subset C ⊆ Rd is said to be convex if for
every x, y ∈ C, and every λ ∈ [0, 1], the point λy + (1 − λ)x is also in C.
18 2. UNCONSTRAINED OPTIMIZATION

The following result is an interesting characterization of convex sets that


allows us to actually construct any convex set from a family of points.
Theorem 2.5. Let C ⊆ Rd be a convex set and let {x1 , x2 , . . . , xn } ⊂ C
be a family of points in C. The convex combinations λ1 x1 +λ2 x2 +· · ·+λn xn
are also in C, provided λk ≥ 0 for all 1 ≤ k ≤ n and λ1 + λ2 + · · · + λn = 1.

Figure 2.2. Convex sets.

Definition (Convex Functions). Given a convex set C ⊆ Rd , we say


that a real-valued function f : C → R is convex if

f λy + (1 − λ)x ≤ λf (y) + (1 − λ)f (x)

If instead we have f λx + (1 − λ)f (y) < λf (x) + (1 − λ)f (y) for 0 < λ < 1,
we say that the function is strictly convex. A function f is said to be concave
(resp. strictly concave) if −f is convex (resp. strictly convex).
Remark 2.5. There is an alternative definition of convex functions using
the concept of epigraph of a function. Given a convex function f : C → R
on a convex set C, the epigraph of f is a set epi(f ) ⊂ Rd+1 defined by
epi(f ) = {(x, y) ∈ Rd+1 : x ∈ C, y ∈ R, f (x) ≤ y}.
The function f is convex if and only if its epigraph is a convex set.

Convex functions have many pleasant properties:


Theorem 2.6. Convex functions are continuous.
Theorem 2.7. Let f : C → R be a real-valued convex function defined
on a convex set C ⊆ Rd . If λ1 , . . . , λn are nonnegative numbers satisfying
λ1 + · · · + λn = 1 and x1 , . . . , xn are n different points in C, then

f λ1 x1 + · · · + λn xn ≤ λ1 f (x1 ) + · · · + λn f (xn ).
1. ANATOMY OF A FUNCTION 19

Theorem 2.8. If f : C → R is a function on a convex set C ⊆ Rd with


continuous first partial derivatives on C, then
(a) f is convex if and only if for all x, y ∈ C,
f (x) + h∇f (x), y − xi ≤ f (y).
(b) f is strictly convex if for all x 6= y ∈ C,
f (x) + h∇f (x), y − xi < f (y).
Remark 2.6. Theorem 2.8 implies that the graph of any (strictly) con-
vex function always lies over the tangent hyperplane at any point of the
graph.

Figure 2.3. Convex Functions.

Two more useful characterization of convex functions.


Theorem 2.9. Suppose that f : C → R is a function with second par-
tial derivatives on an open convex set C ⊆ Rd . If the Hessian is positive
semidefinite (resp. positive definite) on C, then f is convex (resp. strictly
convex).
Theorem 2.10. Let C ⊆ Rd be a convex set.
(a) If fk : C → R are convex functions for 1 ≤ k ≤ n, then so is the
sum f : C → R:
Xn
f (x) = fk (x).
k=1
If at least one of them is strictly convex, then so is f .
(b) If f : C → R is convex (resp. strictly convex) on C, then so is λf
for any λ > 0.
(c) If f : C → R is convex (resp. strictly convex) on C, and g : f (C) →
R is an increasing convex function (resp. strictly increasing convex),
then so is g ◦ f .
(d) If f, g : C → R are convex functions on C, then so is max{f, g}.
20 2. UNCONSTRAINED OPTIMIZATION

Example 2.10. Consider the function f (x, y, z) defined on R3 by


f (x, y, z) = 2x2 + y 2 + z 2 + 2yz.
Notice that for all (x, y, z) ∈ R3 ,
 
4 0 0
Hessf (x, y, z) = 0 2 2 , ∆1 = 4 > 0, ∆2 = 8 > 0, ∆3 = 0.
0 2 2
By virtue of Theorem 2.9, we infer that the function f is convex, but not
strictly convex.
2 2 2
Example 2.11. To prove that f (x, y, z) = ex +y +z is convex, rather
than computing the Hessian and address if it is positive (semi)definite, it is
easier to realize that we can write f = g ◦ h with
g: R → R h : R3 → R
g(x) = ex h(x, y, z) = x2 + y 2 + z 2
The function g is trivially strictly increasing and convex (since g 0 (x) =
g 00 (x) = ex > 0 for all x ∈ R). The function h is strictly convex, since (by
Theorem 2.9)
 
2 0 0
Hessh(x, y, z) = 0 2 0 , ∆1 = 2 > 0, ∆2 = 4 > 0, ∆3 = 8 > 0.
0 0 2
By virtue of (c) in Theorem 2.10, we infer that f is strictly convex.
Example 2.12. Set C = {(x, y) ∈ R2 : x > 0, y > 0}. Consider the
function f : C → R given by
f (x, y) = x2 − 4xy + 5y 2 − log(xy)
Notice we may write f = g + h with g, h : C → R given respectively by
g(x, y) = x2 − 4xy + 5y 2 and h(x, y) = − log(xy). Note also that both
functions are strictly convex, since for all (x, y) ∈ C:
 
2 −4
Hessg(x, y) = , ∆1 = 2 > 0, ∆2 = 4 > 0,
−4 10
 −2 
x 0
Hessh(x, y) = , ∆1 = x−2 > 0, ∆2 = (xy)−2 > 0.
0 y −2
By virtue of part (a) in Theorem 2.10, we infer that f is strictly convex.
We find useful to study generalizations of convex functions as well.
In these notes we are going to focus on two such generalizations: quasi-
convexity and pseudo-convexity.
Definition (Quasi-convex functions). Given a convex set C ⊆ Rd , we
say that a real-valued function f : C → R is quasi-convex if for all x, y ∈ C
and all 0 ≤ λ ≤ 1,

f λy + (1 − λ)x ≤ max{f (x), f (y)}.
2. EXISTENCE RESULTS 21

We say that f is quasi-concave if



f λy + (1 − λ)x ≥ min{f (x), f (y)}.

Definition (Pseudo-convex functions). Given a convex set C ⊆ Rd , we


say that a real-valued differentiable function f : C → R is pseudo-convex if
for all x, y ∈ C satisfying h∇f (x), x − yi ≥ 0, it must be f (y) ≥ f (x)
Remark 2.7. The three functions are related as follows:
• Differentiable convex functions are pseudo-convex.
• Convex functions are quasi-convex.
• Pseudo-convex functions are quasi-convex.

Quasi-convex functions have a very nice characterization by means of


their level sets:
Theorem 2.11. Given a convex set C ⊆ Rd , a real-valued function
f : C → R is quasi-convex if and only if its level sets Λt (f ) = {x ∈ C :
f (x) ≤ t} are convex for all t ∈ R.
We are now ready to explore existence and characterization of extrema
in a wide variety of situations.

2. Existence results
2.1. Continuous functions on compact domains. The existence
of global extrema is guaranteed for continuous functions over compact sets
thanks to the following two basic results:
Theorem 2.12 (Bounded Value Theorem). The image f (K) of a con-
tinuous real-valued function f : Rd → R on a compact set K is bounded:
there exists M > 0 so that |f (x)| ≤ M for all x ∈ K.
Theorem 2.13 (Extreme Value Theorem). A continuous real-valued
function f : K → R on a compact set K ⊂ Rd takes on minimum and
maximum values on K.
2.2. Continuous functions on unbounded domains. Extra restric-
tions must be applied to the behavior of f in this case, if we want to guar-
antee the existence of extrema.
Theorem 2.14. Coercive functions always have a global minimum.
Proof. Since f is coercive, there exists r > 0 so that f (x) > f (0)
for all x satisfying kxk > r. On the other hand, consider the closed ball
Kr = {x ∈ R2 : kxk ≤ r}. The continuity of f guarantees a global minimum
x? ∈ Kr with f (x? ) ≤ f (0). It is then f (x? ) ≤ f (x) for all x ∈ Rd
trivially. 
22 2. UNCONSTRAINED OPTIMIZATION

3. Characterization results
Differentiability is key to guarantee characterization of extrema. Critical
points lead the way:
Theorem 2.15 (First order necessary optimality condition for minimiza-
tion). Suppose f : Rd → R is differentiable at x? . If x? is a local minimum,
then ∇f (x? ) = 0.
To be able to classify extrema of a properly differentiable function, we
take into account the behavior of the function
 around f (x) with respect to
the tangent hyperplane at the point x, f (x) . Second derivatives make this
process very easy.
Theorem 2.16. Suppose f : Rd → R is coercive and continuously differ-
entiable at a point x? . If x? is a global minimum, then ∇f (x? ) = 0.
Theorem 2.17 (Second order necessary optimality condition for mini-
mization). Suppose that f : Rd → R is twice continuously differentiable at
x? .
• If x? is a local minimum, then ∇f (x? ) = 0 and Hessf (x? ) is posi-
tive semidefinite.
• If x? is a strict local minimum, then ∇f (x? ) = 0 and Hessf (x? ) is
positive definite.
Theorem 2.18 (Second order sufficient optimality conditions for mini-
mization). Suppose f : D ⊆ Rd → R is twice continuously differentiable at
a point x? in the interior of D and ∇f (x? ) = 0. Then x? is a:
Local Minimum: if Hessf (x? ) is positive semidefinite.
Strict Local Minimum: if Hessf (x? ) is positive definite.
If D = Rd and x? ∈ Rd satisfies ∇f (x? ) = 0, then x? is a:
Global Minimum: if Hessf (x) is positive semidefinite for all x ∈ Rd .
Strict Global Minimum: if Hessf (x) is positive definite for all x ∈ Rd .
Theorem 2.19. Any local minimum of a convex function f : C → R on
a convex set C ⊆ Rd is also a global minimum. If f is a strictly convex
function, then any local minimum is the unique strict global minimum.
Theorem 2.20. Suppose f : C → R is a convex function with continuous
first partial derivatives on a convex set C ⊆ Rd . Then, any critical point of
f in C is a global minimum of f .

Examples
Example 2.13. Find a global minimum in R3 (if it exists) for the func-
tion
2
f (x, y, z) = ex−y + ey−x + ex + z 2 .
This function has continuous partial derivatives of any order in R3 . Its
continuity does not guarantee existence of a global minimum initially since
EXAMPLES 23

the domain is not compact, but we may try our luck with its critical points.
2
Note ∇f (x, y, z) = ex−y − ey−x + 2xex , −ex−y + ey−x , 2x . The only critical


point is then (0, 0, 0) (Why?). The Hessian at that point is positive definite:
 
4 −2 0
Hessf (0, 0, 0) = −2 2 0 , ∆1 = 4 > 0, ∆2 = 4 > 0, ∆3 = 8 > 0.
0 0 2
By Theorem 2.18, f (0, 0, 0) = 3 is a priori a strict local global minimum
value. To prove that this point is actually a strict global minimum, notice
that
 2 2 
ex−y + ey−x + 4x2 ex + 2ex −ex−y − ey−x 0
Hessf (x, y, z) =  −ex−y − ey−x ex−y + ey−x 0
0 0 2
2
The first principal minor is trivially positive: ∆1 = ex−y + ey−x + 4x2 ex +
2
2ex , since it is a sum of three positive terms and on non-negative term.
The second principal minor is also positive:
 2 2 
ex−y + ey−x + 4x2 ex + 2ex −ex−y − ey−x
∆2 = det
−ex−y − ey−x ex−y + ey−x
2 2
= (ex−y + ey−x )2 + (ex−y + ey−x )(4x2 ex + 2ex ) − (ex−y + ey−x )2
2 2
= (ex−y + ey−x )(4x2 ex + 2ex ) > 0
The third principal minor is positive too: ∆3 = 2∆2 > 0. We have just
proved that Hessf (x, y, z) is positive definite for all (x, y, z) ∈ R3 , and thus
(0, 0, 0) is a strict global minimum.
Example 2.14. Find global minima in R2 (if they exist) for the function
f (x, y) = ex−y + ey−x .
This function also has continuous partial derivatives of any order, but no
extrema is guaranteed a priori. Notice that all points (x, y) satisfying y = x
are critical. For such points, the corresponding Hessians and eigenvalues are
 
2 −2
Hessf (x, x) = , λ1 = 2 > 0, λ2 = 0;
−2 2
therefore, Hessf (x, x) is positive semidefinite for each critical point. By
Theorem 2.18, f (x, x) = 2 is a local minimum for all x ∈ R. To prove they
are global minima, notice that for each (x, y) ∈ R2 :
 x−y
+ ey−x −ex−y − ey−x
  
e x−y y−x
 1 −1
Hessf (x, y) = = e +e
−ex−y − ey−x ex−y + ey−x −1 1
λ1 = ex−y + ey−x > 0, λ2 = 0.
The Hessian is positive semidefinite for all points, hence proving that any
point in the line y = x is a global minimum of f .
24 2. UNCONSTRAINED OPTIMIZATION

Example 2.15. Find local and global minima in R2 (if they exist) for
the function
f (x, y) = x3 − 12xy + 8y 3 .
This is a polynomial of degree 3, so we have continuous partial derivatives
of any order. It is easy to see that this function has no global minima:
lim f (x, 0) = lim x3 = −∞.
x→−∞ x→−∞

Let’s search instead for local minima. From the equation ∇f (x, y) = 0 we
obtain two critical points: (0, 0) and (2, 1). The corresponding Hessians and
their eigenvalues are:
 
0 −12
Hessf (0, 0) = , λ1 = −12 < 0, λ2 = 12 > 0,
−12 0
√ √
 
12 −12
Hessf (2, 1) = , λ1 = 30 − 6 13 > 0, λ2 = 30 + 6 30 > 0.
−12 48
By Theorem 2.18, we have that f (2, 1) = −8 is a local minimum, but
f (0, 0) = 0 is not.
Example 2.16. Find local and global minima in R2 (if they exist) for
the function
f (x, y) = x4 − 4xy + y 4 .
This is a polynomial of degree 4, so we do have continuous partial derivatives
of any order. There are three critical points: (0, 0), (−1, −1) and (1, 1). The
latter two are both strict local minima (by virtue of Theorem 2.18).
 
12 −4
Hessf (−1, −1) = Hessf (1, 1) = , ∆1 = 12 > 0, ∆2 = 128 > 0.
−4 12
We proved in Example 2.9 that f is coercive. By Theorems 2.14 and 2.16 we
have that f (−1, −1) = f (1, 1) = −2 must be strict global minimum values.

Exercises
Problem 2.1 (Basic). Consider the function
x+y
f (x, y) =
2 + cos x
At what points (x, y) ∈ R2 is this function continuous?
Problem 2.2 (Advanced). Prove the following facts about linear maps
T : Rd1 → Rd2 :
(a) For the real-valued case T : Rd → R, there exist a unique vector
a ∈ Rd so that T (x) = ha, xi for all x ∈ Rd .
(b) For the general case T : Rd1 → Rd2 , there exists a unique matrix A
of size d1 × d2 so that T (x)| = A · x| .
(c) Linear maps are continuous functions.
EXERCISES 25

Problem 2.3 (Intermediate). Give an example of a 2 × 2 symmetric


matrix of each kind below:
(a) positive definite,
(b) positive semidefinite,
(c) negative definite,
(d) negative semidefinite,
(e) indefinite.
Problem 2.4 (Basic). [13, p.31, #2] Classify the following matrices
according to whether they are positive or negative definite or semidefinite
or indefinite:
     
1 0 0 −1 0 0 7 0 0
(a) 0 3 0 (b)  0 −3 0  (c) 0 −8 0
0 0 5 0 0 −2 0 0 5
     
3 1 2 −4 0 1 2 −4 0
(d) 1 5 3 (e)  0 −3 2  (f) −4 8 0
2 3 7 1 2 −5 0 0 −3
Problem 2.5 (Basic). [13, p.31, #3] Write the quadratic form QA (x)
associated with each of the following matrices A:
 
  1 −1 0
−1 2
(a) (b) −1 −2 2
2 3
0 2 3
 
  −3 1 2
2 −3
(c) (d)  1 2 −1
−3 0
2 −1 4

Problem 2.6 (Basic). [13, p.32, #4] Write each of the quadratic forms
below in the form xAx| for an appropriate symmetric matrix A:
(a) 3x2 − xy + 2y 2 .
(b) x2 + 2y 2 − 3z 2 + 2xy − 4xz + 6yz.
(c) 2x2 − 4z 2 + xy − yz.
Problem 2.7 (Intermediate). Identify which of the following real-valued
functions are coercive. Explain the reason.
p
(a) f (x, y) = x2 + y 2 .
(b) f (x, y) = x2 + 9y 2 − 6xy.
(c) f (x, y) = x4 − 3xy + y 4 .
(d) Rosenbrock functions Ra,b .
Problem 2.8 (Advanced). [13, p.36, #32] Find an example of a con-
tinuous, real-valued, non-coercive function f : R2 → R that satisfies, for all
t ∈ R,
lim f (x, tx) = lim f (ty, y) = ∞.
x→∞ y→∞
26 2. UNCONSTRAINED OPTIMIZATION

Problem 2.9 (Basic). Let C ⊆ R2 be a convex figure. Given a point


P ∈ C, let n(P ) be the number of chords for which P is a midpoint. For
instance, if C is a disk, any point P ∈ C satisfies n(P ) = 0 (if the point
P is on the circle), n(P ) = ∞ (if the point P is the center of the disk), or
n(P ) = 1 for any other point in the interior of C.
(a) Are there convex sets that contain points P with n(P ) = 2? If so,
sketch an example.
(b) Are there convex sets that contain points P with n(P ) = m for any
m ≥ 3? If so, sketch and example for m = 3.
Problem 2.10 (Basic). [13, p.77, #1,2,7] Determine whether the given
functions are convex, concave, strictly convex or strictly concave on the
specified domains:
(a) f (x) = log(x) on (0, ∞).
(b) f (x) = e−x on R.
(c) f (x) = |x| on [−1, 1].
(d) f (x) = |x3 | on R.
(e) f (x, y) = 5x2 + 2xy + y 2 − √x + 2x + 23 on R .
2
2 2
(f) f (x, y) = x /2 + 3y /2 + 3xy on R .
2 2
(g) f (x, y) = 4e3x−y + 5ex +y on R2 .
(h) f (x, y, z) = x1/2 +y 1/3 +z 1/5 on C = {(x, y, z) : x > 0, y > 0, z > 0}.
Problem 2.11 (Intermediate). [13, p.79 #11] Sketch the epigraph of
the following functions
(a) f (x) = ex .
(b) f (x, y) = x2 + y 2 .
Problem 2.12 (Advanced). Prove the Bounded Value and Extreme
Value Theorems (Theorems 2.12 and 2.13).
Problem 2.13 (Intermediate). For the following optimization problems,
state whether existence of a solution is guaranteed:
(a) f (x) = (1 + x)/x over [1, ∞)
(b) f (x) = 1/x over [1, 2)
(c) The following piecewise function over [1, 2]
(
1/x, 1 ≤ x < 2
f (x) =
1, x=2
Problem 2.14 (Advanced). State and prove equivalent results to The-
orems 2.15, 2.17 and 2.18 to describe necessary and sufficient conditions for
the characterization of maxima.
Problem 2.15 (Basic). [13, p.32, #7] Use the Principal Minor Criteria
(Theorem 2.3) to determine—if possible—the nature of the critical points
of the following functions:
(a) f (x, y) = x3 + y 3 − 3x − 12y + 20.
EXERCISES 27

(b) f (x, y, z) = 3x2 + 2y 2 + 2z 2 + 2xy + 2yz + 2xz.


(c) f (x, y, z) = x2 + y 2 + z 2 − 4xy.
(d) f (x, y) = x4 + y 4 − x2 − y 2 + 1.
(e) f (x, y) = 12x3 + 36xy − 2y 3 + 9y 2 − 72x + 60y + 5.
Problem 2.16 (Intermediate). [13, p.35 #26] Show that the function
2 +y 2 +z 2
f (x, y, z) = ex − x4 − y 6 − z 6
has a global minimum on R3 .
Problem 2.17 (Intermediate). [13, p.36 #33] Consider the function
f (x, y) = x3 + e3y − 3xey .
Show that f has exactly one critical point, and that this point is a local
minimum but not a global minimum.
Problem 2.18 (Basic). Let f (x, y) = − log(1 − x − y) − log x − log y.
(a) Find the domain D of f .
(b) Prove that D is a convex set.
(c) Prove that f is strictly convex on D.
(d) Find the strict global minimum.
Problem 2.19 (Basic). [13, p.81 #27] Find local and global minima in
R3 (if they exist) for the function
f (x, y) = ex+z−y + ey−x−z
CHAPTER 3

Numerical Approximation for Unconstrained


Optimization

Although technically any characterization result finds the exact value of


the extrema of a function, computationally this is hardly feasible (specially
for functions of very high dimension). See the following session based on
problem 2.16 for an example, where we try to find the critical points of the
2 2 2
function f (x, y, z) = ex +y +z − x4 − y y − z 6 symbolically in Python with
the sympy libraries:
1 # Importing necessary symbols/libraries/functions
2 from sympy.abc import x,y,z
3 from sympy import Matrix, solve, exp
4 from sympy.tensor.array import derive_by_array
5
6 # Description of f, computation of its gradient and Hessian
7 f = exp(x**2 + y**2 + z**2) - x**4 -y**6 - z**6
8 gradient = derive_by_array(f, [x,y,z])
9 hessian = Matrix([derive_by_array(gradient, a) for a in [x,y,z]])

While the correct expressions for ∇f and Hessf are quickly computed, trying
to find critical points results in an error:
>>> solve(gradient) # Search of critical points by solving ∇f = 0
NotImplementedError: could not solve
4*x**2*sqrt(-log(exp(x**2)/(2*x**2))) - 6*(-log(exp(x**2)/(2*x**2)))**(5/2)

Too complex a task to be performed symbolically, although the obvious


answer is (0, 0, 0). A better way to approach this is by trying to approximate
this minimum using the structure of the graph of f . In these notes we are
going to explore several strategies to accomplish this task, based on the
concept of iterative methods for finding zeros of real-valued functions.

1. Newton-Raphson’s Method
1.1. Newton-Raphson Method to search for √ roots of univariate
functions. In order to find a good estimation of 2 with many decimal
places, we allow a computer to find better and better approximations of the
root of the polynomial p(x) = x2 − 2. We start with an initial
√ guess, say
x0 = 3. We construct a sequence {xn }n∈N that converges to 2 as follows:
29
30 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION

(a) Find the tangent line to the graph of p at x0 ,


y = p(x0 ) + p0 (x0 )(x − x0 )
Intuitively, we substitute the original function f by the approxima-
tion given by the derivative. It is so much easier to seek roots of
this simpler function, and that is what we do to approximate the
roots of p.
(b) Provided the corresponding line is not horizontal (p0 (x0 ) 6= 0), re-
port the intersection of this line with the x–axis. Call this intersec-
tion x1
p0 (x0 )(x1 − x0 ) = −p(x0 )
or equivalently,
p(x0 )
x1 = x0 −
p0 (x0 )
(c) Repeat this process, to get a sequence
p(xn ) x2n − 2 xn 1
xn+1 = xn − = x n − = − .
p0 (xn ) 2xn 2 xn

Figure 3.1. Newton-Raphson iterative method

Note the result of applying this process a few times:


n xn p(xn )
0 3.000000000000000 7.0000E + 00
1 1.833333333333333 1.3611E + 00
2 1.462121212121212 1.3780E − 01
3 1.414998429894803 2.2206E − 03
4 1.414213780047198 6.1568E − 07
5 1.414213562373112 4.7518E − 14
6 1.414213562373095 −4.4409E − 16
7 1.414213562373095 4.4409E − 16
1. NEWTON-RAPHSON’S METHOD 31

Definition. Given a differentiable real-valued function f : R → R and


an initial guess x0 ∈ R, we define the Newton-Raphson iteration to be the
sequence {xn }n∈N given by one of the two following recursive formulas
f 0 (xn )(xn+1 − xn ) = −f (xn )
f (xn ) (3)
xn+1 = xn −
f 0 (xn )
The Newton-Raphson method (or Newton method ) refers to employing this
sequence to search and approximate roots of the equation f (x) = 0.
Example 3.1. Consider now the function f (x) = 1 − x1 over (0, ∞),
which has the obvious root x = 1. The Newton-Raphson method gives the
following iterates for any x0 ∈ (0, ∞):
f (xn ) 
xn+1 = xn − 0
= xn 2 − xn .
f (xn )
Notice the two factors in the right-hand side of that expression: xn , and
2 − xn . If the initial guess does not satisfy 0 < x0 < 2, then the next
iteration gives a non-positive value (see Figure 3.2). The method will not
work on those instances: convergence to a solution is not guaranteed.

Figure 3.2. Initial guess must be carefully chosen in


Newton-Raphson

p
Example 3.2. Consider now f (x) = sign(x) |x| over R, with root at
x = 0. The Newton-Raphson method fails miserably with this function: for
any x0 6= 0
f (x0 ) sign(x0 )|x0 |1/2
x1 = x0 − = x 0 − = −x0 .
f 0 (x0 ) 1
2 |x0 |
−1/2

This sequence turns into a loop: x2n = x0 , x2n+1 = −x0 for all n ∈ N (see
Figure 3.3).
32 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION

Figure 3.3. Newton-Raphson fails for some functions

1.2. Efficiency of Newton-Raphson’s Method. To study the error


in a Newton-Raphson iteration {xn }n∈N that converges to a root x? of the
function f , we observe that
f (xn )
xn+1 − x? = xn − x? −
f 0 (xn )
0
z }| { 
f (xn ) − f (x? )

?
= (xn − x ) 1 −
(xn − x? )f 0 (xn )
1 f (xn ) − f (x? )
 
?
= (xn − x ) 1 − 0 (4)
f (xn ) xn − x?
Recall at this point the definition of divided differences for any n–times
differentiable function g : R → R, and a family of values t0 ≤ t1 ≤ · · · ≤ tn :
g(t1 ) − g(t0 )
∆g[t0 , t1 ] = (if t0 6= t1 )
t1 − t0
∆g[t0 , t0 ] = g 0 (t0 )
∆g[t1 , t2 , . . . , tn ] − ∆g[t0 , t1 , . . . , tn−1 ]
∆g[t0 , t1 , . . . , tn ] = (if t0 6= tn )
tn − t0
1
∆g[t0 , . . . , t0 ] = g (n) (t0 ) (Why? Hint: Taylor’s Polynomial)
| {z } n!
n+1 times

We can then rewrite (4) in terms of divided differences as follows:


∆f [xn , x? ]
 
xn+1 − x? = (xn − x? ) 1 −
∆f [xn , xn ]
∆f [xn , xn ] − ∆f [xn , x? ]
= (xn − x? )
∆f [xn , xn ]
∆f [xn , xn ] − ∆f [xn , x? ] xn − x?
= (xn − x? )
xn − x? ∆f [xn , xn ]
1. NEWTON-RAPHSON’S METHOD 33

∆f [xn , xn , x? ]
= (xn − x? )2
∆f [xn , xn ]
Therefore,
xn+1 − x? ∆f [xn , xn , x? ] f 00 (x? )
lim = lim = .
n (xn − x? )2 n ∆f [xn , xn ] 2f 0 (x? )
Iff 00 (x? ) 6= 0, the Newton-Raphson’s iteration exhibits quadratic conver-
gence.1
Remark 3.1. We have just proven that, if a Newton-Raphson iteration
for a function f gives a convergent sequence, the convergence is quadratic.
But, how can we guarantee convergence to a root of f ? The key is in how
far can we start the sequence given the structure of the graph of f .
Theorem 3.1 (Local Convergence for the Newton-Raphson Method).
Let x? be a simple root of the equation f (x) = 0, and assume that there
exists ε > 0 so that
• f is twice continuously differentiable in the interval (x? − ε, x? + ε),
and
• there are no critical points of f on that interval.
Set  00 
f (s) ? ?
M (ε) = max 0 : x − ε < s, t < x + ε .

2f (t)
If ε is small enough so that εM (ε) < 1, then I have seen this
(a) There are no other roots of f in (x? − ε, x? + ε). Theorem in [9]
(b) Any Newton-Raphson iteration starting at an initial guess x0 6= x? with the condition
in that interval will converge (quadratically) to x? 2εM (ε) < 1
instead, but I
Proof. Start with Taylor’s Theorem for f around x? . Given x 6= x? could not see why
satisfying |x − x? | < ε, there exists ξ between x and x? so that that 2 was
f (x) = f (x? ) + (x − x? )f 0 (x? ) + 12 (x − x? )2 f 00 (ξ) necessary. What
 00
 am I missing?
? 0 ? ? f (ξ)
= (x − x )f (x ) 1 + (x − x ) 0 ?
2f (x )
Note that the three factors on the last expression are never zero:
x − x? 6= 0 (since x 6= x? by hypothesis)
f 0 (x? ) 6= 0 (no critical points by hypothesis)
00

(x − x? ) f (ξ) ≤ εM (ε) < 1

(by hypothesis on M (ε))
2f 0 (x? )
This proves (a).
We want to prove now that all terms of a Newton-Raphson iteration
stay in the interval (x? − ε, x? + ε). We do that by induction:
• |x0 − x? | < ε by hypothesis.
1See Appendix A
34 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION

• Assume |xn − x? | < ε. In that case,


?

xn+1 − x? = xn − x? 2 ∆f [xn , xn , x ]

∆f [xn , xn ]

but ∆f [xn , xn ] = f 0 (xn ), and there exist ξn between xn and x? so


that ∆f [xn , xn , x? ] = 21 f 00 (ξn ); therefore,
00

xn+1 − x? = xn − x? 2 f (ξ) ≤ ε2 M (ε) = ε · εM (ε) < ε.

2f 0 (xn )

The next step is to prove that there is convergence. A similar computation


to the previous gives
xn − x? ≤ εM (ε) xn−1 − x? ≤ εM (ε) n x0 − x? .


n
Since εM (ε) < 1, limn εM (ε) = 0, and {xn }n∈N converges to x? . 
Example 3.3. Let’s use this Theorem to prove convergence of Newton-
Raphson iterations for f (x) = x2 − 2 for initial guesses x0 in a meaningful
interval—something
√ like the obvious choice x0 ∈ [1, 2]. Pick for example
ε√= 2/2,√ so we can analyze convergence with initial guesses in the interval
[ 2/2, 3 2/2] ⊃ [1, 2]. Note that
√ 
 √ √  √
2 1 2 3 2 2
M 2 = max : <t< =
2t 2 2 2
and thus,
√ √
2 2
 1
εM (ε) = M 2 = < 1.
2 2

By virtue of Theorem 3.1 we have quadratic convergence to 2 if we start
with any initial guess in this interval.
1.3. Extension to higher dimensions. Let’s proceed to extend this
process to functions g : Rd → Rd as follows.
• Any function g : Rd → Rd can be described in the form g(x) =
g1 (x), g2 (x), . . . , gd (x) for d real-valued functions gk : Rd → R


(1 ≤ k ≤ d).
• For such a function g, the derivative is the Jacobian
 ∂g ∂g1 ∂g1

∂x2 · · ·
1
∂x1 ∂xd
 
 
 ∂g2 ∂g2 ∂g2 
 ∂x1 ∂x2 · · · ∂xd 
 
J g = ∇g =   .

 . .. .. .. 
 . . . . 

 
 
∂gd ∂gd ∂gd
∂x1 ∂x2 · · · ∂xd
1. NEWTON-RAPHSON’S METHOD 35

Start with a guess for the solution, x0 . As we did in the one-dimensional


case, we substitute the function g by the linear approximation given by the
Jacobian:
y | = g(x0 )| + ∇g (x0 )(x − x0 )| .
Provided the determinant of the Jacobian ∇g is not zero, there is only one
solution for the equation y = 0. The computation of x1 is therefore the
solution of the following system of linear equations:

∇g (x0 )(x1 − x0 )| = −g(x0 )| ,

or equivalently,
−1
x1 | = x0 | − ∇g(x0 ) · g(x0 )| .


We iterate this process to obtain a sequence of approximations {xn }n∈N to


a root, via any of the following two recursive formulas:

∇g(xn )(xn+1 − xn )| = −g(xn )| ,


−1 (5)
xn+1 | = xn | − ∇g(xn ) · g(xn )| ,


Example 3.4. Consider the function g : R2 → R2 given by

g(x, y, z) = x3 − y, y 3 − x
 

Its Jacobian at each (x, y) is given by


 2 
3x −1
∇g(x, y) =
−1 3y 2

Note the determinant of this matrix is det ∇g(x, y) = 9x2 y 2 − 1 = (3xy −


1)(3xy + 1). For any point (x, y) that does not make this expression zero,
this is an invertible matrix with
 2 
 −1 1 3y 1
∇g(x, y) = 2 2
9x y − 1 1 3x2

For an initial guess (x0 , y0 ), the sequence computed by this method is then
given by
     2  3 
xn+1 xn 1 3yn 1 x n − yn
= − 2 2
yn+1 yn 9xn yn − 1 1 3x2n yn3 − xn

Let’s run this process with three different initial guesses:


(a) Starting at (x0 , y0 ) = (−1.0, 1.0), the sequence converges to (0, 0).
36 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION

n xn yn
0 −1.00000000 1.00000000
1 −0.50000000 0.50000000
2 −0.14285714 0.14285714
3 −0.00549451 0.00549451
4 −0.00000033 0.00000033
5 −0.00000000 0.00000000
6 −0.00000000 0.00000000
(b) Starting at (x0 , y0 ) = (3.5, 2.1), the sequence converges to (1, 1).
n xn yn
0 3.50000000 2.10000000
1 2.37631607 1.57961573
2 1.65945969 1.27476534
3 1.23996276 1.10419072
4 1.04837462 1.02274752
5 1.00260153 1.00133122
6 1.00000824 1.00000451
7 1.00000000 1.00000000
8 1.00000000 1.00000000
(c) Starting at (x0 , y0 ) = (−13.5, −7.3), the sequence converges to
(−1, −1).
n xn yn
0 −13.50000000 −7.30000000
1 −9.00900415 −4.92301873
2 −6.01982204 −3.36480659
3 −4.03494126 −2.36199873
4 −2.72553474 −1.73750959
5 −1.87830623 −1.36573112
6 −1.36121191 −1.15374930
7 −1.09518303 −1.04341362
8 −1.00932090 −1.00463507
9 −1.00010404 −1.00005571
10 −1.00000001 −1.00000001
11 −1.00000000 −1.00000000
12 −1.00000000 −1.00000000
13 −1.00000000 −1.00000000

Remark 3.2. The Newton-Raphson’s method to solve g = 0, as given


by the recursive formula with the inverse of the gradient is very convenient to
provide explicit descriptions of the different iterations. However, it is hardly
suitable for practical purposes, due to the computational issues involving
matrix inversion.
1. NEWTON-RAPHSON’S METHOD 37

To avoid dealing with matrix


| inversion, we exclusively use the equivalent
formula ∇g(xn ) xn+1 − xn = −g(xn )| , since it offers a simple system
of linear equations, which is a much more reliable process, prone to less
numerical error.
1.4. Optimization via Newton’s Method. We can readily see how
this process aids in the computation of critical points of twice continuously
differentiable real-valued function f : Rd → R:
 ∂f ∂f 
(a) Set g(x) = ∇f (x) = ∂x 1
, . . . , ∂xd
(b) It is then ∇g(x) = Hessf (x)
(c) Perform a Newton method (with initial guess x0 ) on g = ∇f to
obtain the recursive formula
−1
xn+1 | = xn | − Hessf (xn ) · ∇f (xn )|

(6)
Example  33.5. Consider the polynomial p4 (x, y) = x4 − 4xy + y 4 . Notice
∇p4 (x, y) = x − y, y 3 − x —this is function g in Example 3.4. The critical


points we found were (0, 0), (−1, −1) and (1, 1). See Figure 3.4.
Example 3.6. A similar process for the Rosenbrock function
R1,1 (x, y) = (1 − x)2 + (y − x2 )2
gives the following recursive formula:
   
xn+1 x −1
= n − HessR1,1 (xn , yn ) · ∇R1,1 (xn , yn )|

yn+1 yn
2x3n − 2xn yn + 1
 
1
= 2
2xn − 2yn + 1 xn (2x3n − 2xn yn − xn + 2)
For instance, starting with the initial guess (x0 , y0 ) = (−2, 2), the sequence
converges to the critical point (1, 1) in a few steps. See Figure 3.4.

Figure 3.4. Newton-Raphson method


38 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION

Remark 3.3. The equivalent recursive formula to (6) to search for


critical points of a twice continuously differentiable real-valued function
f : Rd → R is given by
Hessf (xn )(xn+1 − xn )| = −∇f (xn )| . (7)
We refer to this formula as the Newton-Raphson recursive formula, and the
sequence {xn }n∈N is coined the Newton-Raphson iteration. The sequence of
directions associated to {xn+1 − xn }n∈N are called Newton-Raphson direc-
tions.
Example 3.7. The equivalent recursive formula to the one we obtained
in Example 3.4 is as follows:
 2    3 
3xn −1 xn+1 − xn xn − yn
=
−1 3yn2 yn+1 − yn yn3 − xn
All we need to do is, at each step n, solve for X and Y the system of linear
equations (
3x2n (X − xn ) − (Y − yn ) = x3n − yn
−(X − xn ) + 3yn2 (Y − yn ) = yn2 − xn
or equivalently, (
3x2n X − Y = 4x3n − 2yn
−X + 3yn2 Y = 4yn2 − 2xn

There are some theoretical results that aid in the search for a good ini-
tial guess in case of multivariate functions. The following states a simple
set of conditions on f and x0 to guarantee quadratic convergence of the
corresponding sequences {xn }n∈N to a critical point x? .
Theorem 3.2 (Quadratic Convergence Theorem). Suppose f : Rd → R
is a twice continuously differentiable real-valued function, and x? is a critical
−1
point of f . Let N (x)| = x| − Hessf (x) · ∇f (x)| . If there exists

−1
(a) h > 0 so that2 Hessf (x? ) ≤ h1 ,


(b) β > 0, L > 0 for which kHessf (x) − Hessf (x? )k ≤ Lkx − x? k
provided kx − x? k ≤ β.
2h
In that case, for all x ∈ Rd satisfying kx − x? k ≤ min{β, 3L },
kN (x) − x? k 3L
? 2

kx − x k 2h
Example 3.8. There is an implementation of Newton-Raphson method
in the scipy.optimize libraries in Python: the routine fmin ncg() (the cg
indicates that the inversion of the Hessian is performed using the technique
of conjugate gradient). The following session illustrates how to use this
2Recall the norm of a matrix M , defined by kM k = max{kM · xk : kxk = 1}.
2. SECANT METHODS 39

method to approximate the minimum of the Rosenbrock function R1,100


with an initial guess (x0 , y0 ) = (2, 2)
1 import numpy as np, matplotlib.pyplot as plt
2 from scipy.optimize import fmin_ncg
3
4 # Rosenbrock R1,100 function and its Jacobian ∇R1,100
5 from scipy.optimize import rosen, rosen_der

We call fmin ncf with the function and its gradient (with the option
fprime=), and with the initial guess (2, 2).
>>> fmin_ncg(rosen, [2.,2.], fprime=rosen_der, retall=False)
Optimization terminated successfully.
Current function value: 0.000000
Iterations: 277
Function evaluations: 300
Gradient evaluations: 1540
Hessian evaluations: 0
array([ 0.99999993, 0.99999987])

2. Secant Methods
The Newton-Raphson method to compute local extrema of real-valued
functions f : Rd → R, has several shortcomings:
• Convergence is not guaranteed.
• Even when convergence is guaranteed, the limit of a Newton-Raphson
iteration is not necessarily a local minimum. Any critical point of
f is a target.
• For a successful implementation of Newton-Raphson we do need
expressions for the function itself, its gradient and Hessian matrix.
The secant method takes care of the the latter issue, while retaining the
same convergence rates.
2.1. A secant method to search for roots of univariate func-
tions. To√ explain how it works, let’s once again try to find an accurate
value of 2 as the root of the polynomial p(x) = x2 − 2.
(a) Consider two initial guesses x0 = 3, x1 = 2.8. Notice f (3) = 7 6=
5.84 = f (2.8).
(b) The line that joins the points (3, 7) and (2, 8, 5.84) has equation
5.84 − 7
y−7= (x − 3),
2.8 − 3
y = 5.8x − 10.4.
The latter can be seen as a linear approximation to the original
function—one that did not use the derivative of f . This linear
function intersects the x–axis at
10.4
x2 = ≈ 1.7931034483
5.8
40 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION

We take this value as an approximation to the root of f (x) = 0.


(c) Repeat this process to get a sequence of approximations xn .
Example 3.9. Observe the result of applying this recursive process,
and compare with the similar experiment we conducted using the Newton-
Raphson method in page 30.
n xn f (xn )
0 3.000000000000000 7.0000E + 00
1 2.800000000000000 5.8400E + 00
2 1.793103448275862 1.2152E + 00
3 1.528528528528528 3.3640E − 01
4 1.427253172054743 3.7052E − 02
5 1.414717869757887 1.4267E − 03
6 1.414215876250105 6.5446E − 06
7 1.414213562785585 1.1667E − 09
8 1.414213562373095 8.8818E − 16

Figure 3.5. Secant iterative method

Definition. Given a function f : R → R and two initial guesses x0 6= x1


satisfying f (x0 ) 6= f (x1 ), we define the Secant method iteration to be the
sequence {xn }n∈N obtained by one of the following recursive formulas:
 
f (xn ) − f (xn−1 )
(xn+1 − xn ) = −f (xn )
xn − xn−1
  (8)
xn − xn−1
xn+1 = xn − f (xn )
f (xn ) − f (xn−1 )
The Secant method refers to employing this sequence to search and approx-
imate roots of the equation f (x) = 0.
2. SECANT METHODS 41

2.2. Efficiency of the Secant Method. Let f : R → R be a real-


valued function with a root x? , and let {xn }n∈N be a secant method iteration.
xn − xn−1
xn+1 − x? = xn − f (xn ) − x?
f (xn ) − f (xn−1 )
0
z }| { 
f (xn ) − f (x? )

? xn − xn−1
= (xn − x ) 1 −
f (xn ) − f (xn−1 ) xn − x?
?
 
∆f [xn , x ]
= (xn − x? ) 1 −
∆f [xn−1 , xn ]
∆f [xn−1 , xn ] − ∆f [xn , x? ]
= (xn − x? )
∆f [xn−1 , xn ]
? ?
∆f [x n−1 , xn ] − ∆f [xn , x ] xn−1 − x
= (xn − x? )
xn−1 − x? ∆f [xn−1 , xn ]
∆f [xn−1 , xn , x ]?
= (xn − x? )(xn−1 − x? )
∆f [xn−1 , xn ]
We use this identity to prove the corresponding local convergence result,
as we did for the Newton-Raphson method:
Theorem 3.3 (Local Convergence for the Secant Method). Let x? be a
simple root of the equation f (x) = 0, and there exists ε > 0 so that
• f is twice continuously differentiable in the interval (x? − ε, x? + ε),
and
• there are no critical points of f on that interval.
Set  00 
f (s) ? ?
M (ε) = max 0 : x − ε < s, t < x + ε .

2f (t)
If ε is small enough so that εM (ε) < 1, then
(a) There are no other roots of f in (x? − ε, x? + ε).
(b) Any Secant method iteration starting with initial guesses x0 6= x1
(none of them equal to x? , and with f (x0 ) 6= f (x1 )) in that interval
will converge (quadratically) to x?
The proof is similar to the one for Theorem 3.1, and is left as exercise.

2.3. Generalization of the Secant Method to higher dimen-


sions: Broyden’s Method. The objective now is the search of a root
for a function g : Rd → Rd , which we assume of the form
 
g(x1 , . . . , xd ) = g1 (x1 , . . . , xd ), . . . , gd (x1 , . . . , xd )
for appropriate functions gk : Rd → R (1 ≤ k ≤ d). We start at an initial
guess x0 ∈ Rd and, if possible, we choose as x1 the first step of Newton-
Raphson:
−1
x1 | = x0 | − ∇g(x0 ) · g(x0 )| .

42 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION

At that point we craft a linear approximation L1 : Rd → Rd to the graph of


g that interpolates x0 , g(x0 ) and x1 , g(x1 ) . We may impose this secant
property by requesting L1 to be of the form
L1 (x)| = g(x1 )| + A1 · (x − x1 )| ,
where A1 is a non-singular square matrix of size d × d that satisfies
A1 · (x1 − x0 )| = g(x1 )| − g(x0 )| .
At this point, we approximate the root of g by the root x2 of L1 :
A1 · (x2 − x1 )| = −g(x1 )| ,
or equivalently,
x2 | = x1 | − A−1 |
1 · g(x1 ) .
We repeat this process to obtain a sequence of approximations {xn }n∈N via
a sequence of non-singular square matrices An of size d × d that satisfy the
secant property
An · (xn+1 − xn )| = g(xn+1 )| − g(xn )| . (9)
The corresponding linear functions are given by
Ln (x)| = g(xn )| + An · (x − xn )| .
The recurrence formula can then be expressed in any of the following two
ways:
An · (xn+1 − xn )| = −g(xn )| ,
| (10)
xn+1 | = xn | − A−1
n · g(xn )

We refer to this process as the Broyden Method for finding roots.


Remark 3.4. There is not a unique way to choose the matrices An in
Broyden’s method (and I encourage you to find one of your own devise). A
straightforward construction is given by the following recursive formulas:
A0 = ∇g(x0 )
|
g(xn+1 ) − g(xn ) − An (xn+1 − xn )| ⊗ (xn+1 − xn )
 
An+1 = An +
kxn+1 − xn k2
(11)
Remark 3.5. There are other possibilities for the selection of the point
x1 that do not involve using the Jacobian ∇g(x0 ) as the matrix A0 . For
the purposes of root searching this choice is not too crucial but, as we shall
see in the context of optimization, having the initial matrix A0 satisfy extra
properties could be very advantageous. To learn about different strategies
in this regard, see e.g. [5, chapter 8].
2. SECANT METHODS 43

Example 3.10. Let’s use this method  to find any of the roots (−1, −1),
(0, 0), (1, 1) of the function g(x, y) = x3 − y, y 3 − x from Example 3.4.
Starting with initial guess (x0 , y0 ) = (−1, 1), we take the first step from
Newton-Raphson:
 2   
3x0 −1 3 −1
∇g(x0 , y0 ) = =
−1 3y02 −1 3
       
| −2 3 −1 x 3x − y − 2
L0 (x, y) = + · =
2 −1 3 y −x + 3y + 2
This gives (x1 , y1 ) = (−1/2, 1/2) as root of L0 . For the second step, we
compute the matrix A1 satisfying the secant property:
| |
g(−1/2, 1/2) − g(−1, 1) − A0 [ 21 , − 12 ] ⊗ [ 12 , − 12 ]

A1 = A0 +
k[ 21 , − 12 ]k2
 5    1 
− 8 − (−2) 3 −1
⊗ 12 , − 12
 
− 2
5 −1 1
8 −2 −2
3
 
3 −1
= +
−1 3 1/2
 
3 −1
+ 2 − 85 , 85 ⊗ 12 , − 21
   
=
−1 3
 5 1
− 8 · 2 − 58 · − 12
  
3 −1
= +2 5 1 5 1
−1 3 8 · 2 8 · − 2
     
3 −1 −5/8 5/8 19/8 −3/8
= + =
−1 3 5/8 −5/8 −3/8 19/8
This gives a linear approximation
 19 3 3

     
8 x − 8 x + 4
−5/8 19/8 −3/8 x + 1/2
L1 (x, y)| = + · = 
5/8 −3/8 19/8 y − 1/2 3 19 3
−8x + 8 y − 4
which has root (x2 , y2 ) = (−3/11, 3/11).
If we continue the computations, we arrive to the root (0, 0) to an accu-
racy of 6 decimal places in just 5 steps!
n xn yn f (xn , yn )

1 −1.000000 1.000000 − 2.000000, 2.000000
2 −0.500000 0.500000 − 0.625000, 0.625000
2 −0.272727 0.272727 − 0.293013, 0.293013
3 −0.072136 0.072136 − 0.072511, 0.072511
4 −0.006172 0.006172 − 0.006172, 0.006172
5 −0.000035 0.000035 − 0.000035, 0.000035
6 −0.000000 0.000000 − 0.000000, 0.000000
2.4. Secant Methods for Optimization. Given a real-valued func-
tion f : Rd → R, we may apply Broyden’s method to search for roots of the
gradient ∇f : Rd → Rd . As it happened with Newton’s method, we are not
guaranteed convergence to a local minimum.
44 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION

Example 3.11. As we did in Example 3.5, we may explore the conver-


gence to different critical points of the polynomial p4 (x, y) = x4 − 4xy + y 4
using different initial guesses. Note once again that the gradient of p4 is the
function ∇p4 = g from Example 3.10:
For instance, starting with (x0 , y0 ) = (−1, 1) we converge to the critical
point (0, 0), which is not a local minimum.

3. The Method of Steepest Descent


The method of Steepest Descent (also known as the method of Gradient
Descent) is based upon the following property of gradients that we learned
in Vector Calculus:
Theorem 3.4. If f : Rd → R is continuously differentiable, then at any
point x ∈ Rd , the vector − ∇f (x) points in the direction of most rapid
decrease for f at x. The rate of decrease of f at x in this direction is
precisely −k∇f (x)k.
Remark 3.6. For this reason, the vector − ∇f (x)/k∇f (x)k is called the
direction of steepest descent of f at x.

In order to search for a local minimum for a twice continuously differ-


entiable function f : Rd → R, we start by choosing an initial guess x0 .
(a) Restrict the function f over the line through x0 in the direction of
− ∇f (x0 ):

ϕ0 (t) = f x0 − t ∇f (x0 ) , t ≥ 0
(b) Line-search: Search for the value of t0 ≥ 0 that minimizes ϕ0 , and
set
x1 = x0 − t0 ∇f (x0 )
(c) Repeat this process to get the sequence
xn+1 = xn − tn ∇f (xn ),
tn = argmin ϕn (t) = argmin f xn − t ∇f (xn )
 (12)
t≥0 t≥0

Remark 3.7. Sequences constructed following the formula in (12) are


said to be sequences of Steepest Descent for f .
Unlike Newton-Raphson or the secant methods, this algorithm guaran-
tees that these sequences are non-increasing: f (xn+1 ) ≤ f (xn ) for all n ∈ N.
And even better: if there is convergence, their limit must be a critical point
of f . These results are formalized in Theorems 3.5 and 3.6 below.
Steepest descent sequences have another interesting property: on each
step n, the direction of steepest descent is perpendicular to the direction of
steepest descent of step n + 1 (!!) We state and prove this result in Theorem
3.7.
3. THE METHOD OF STEEPEST DESCENT 45

Theorem 3.5. Let f : Rd → R be a continuously differentiable real-


valued function, and let {xn }n∈N be a sequence of steepest descent for f . If
∇f (xN ) 6= 0, then f (xN +1 ) < f (xN ).

Theorem 3.6. Let f : Rd → R be a real-valued function, let x0 ∈ Rd be


an initial guess. Assume S = {x ∈ Rd : f (x) ≤ f (x0 )} is a compact set
and f is continuously differentiable in S. Under these conditions, the limit
of any convergent subsequence of the associated sequence of steepest descent
{xn }n∈N is a critical point of f .

Theorem 3.7. Let f : Rd → R be a continuously differentiable real-


valued function, and {xn }n∈N a sequence of steepest descent for f . For any
n ∈ N, hxn+2 − xn+1 , xn+1 − xn i = 0.

Proof. Consider for each n ∈ N the function ϕn (t) = f xn −t ∇f (xx ) ,
with a global minimum at tn ≥ 0. It must then be

0 = ϕ0n (tn ) = h∇f (xn ), − ∇f xn − tn ∇f (xn ) i = −h∇f (xn+1 ), ∇f (xn )i,




which proves that the gradient of consecutive terms of the sequence of steep-
est descent for f are perpendicular. Now, by virtue of the recurrence formula
(12),

hxn+2 − xn+1 , xn+1 − xn i = htn+1 ∇f (xn+1 ), tn ∇f (xn )i


= tn+1 tn h∇f (xn+1 ), ∇f (xn )i = 0,

which proves the statement. 

Example 3.12. For the polynomial function p4 (x, y) = x4 − 4xy + y 4


from Example 3.5, using the same initial guesses as in Example 3.4, we find
the following behavior:
• Starting at (x0 , y0 ) = (−1.0, 1.0), the sequence jumps to (0, 0) in
one step. At that point, since the gradient of the function is zero,
the method of Steepest descent ceases to work.

n xn yn f (xn , yn )
0 −1.000000 1.000000 6.000000
1 0.000000 0.000000 0.000000
2 nan nan nan

• Starting at (x0 , y0 ) = (3.5, 2.1), the sequence converges to (1, 1).


46 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION

n xn yn f (xn , yn )
0 3.500000 2.100000 140.110600
1 1.044472 1.753064 3.310777
2 1.141931 1.063276 −1.878163
3 1.008581 1.044435 −1.988879
4 1.013966 1.006319 −1.998931
5 1.000898 1.004472 −1.999891
6 1.001437 1.000651 −1.999989
7 1.000093 1.000461 −1.999999
8 1.000149 1.000067 −2.000000
9 1.000010 1.000048 −2.000000
10 1.000015 1.000007 −2.000000
11 1.000001 1.000005 −2.000000
12 1.000002 1.000001 −2.000000
13 1.000000 1.000001 −2.000000
14 1.000000 1.000000 −2.000000
15 1.000000 1.000000 −2.000000

• Starting at (x0 , y0 ) = (−13.5, −7.3), the sequence converges to (1, 1)


as well.

n xn yn f (xn , yn )
0 −13.500000 −7.300000 35660.686600
1 2.362722 −4.871733 640.498302
2 1.434154 1.194162 −0.586492
3 1.021502 1.130993 −1.896212
4 1.038817 1.017881 −1.991558
5 1.002305 1.012291 −1.999167
6 1.003909 1.001808 −1.999917
7 1.000236 1.001246 −1.999992
8 1.000399 1.000185 −1.999999
9 1.000024 1.000127 −2.000000
10 1.000041 1.000019 −2.000000
11 1.000002 1.000013 −2.000000
12 1.000004 1.000002 −2.000000
13 1.000000 1.000001 −2.000000
14 1.000000 1.000000 −2.000000
15 1.000000 1.000000 −2.000000

Example 3.13. Notice what happens when we try to implement the


same process on the Rosenbrock function R1,1 (x, y) = (1 − x)2 + (y − x2 )2 ,
with the initial guess (x0 , y0 ) = (−2, 2). The sequence does converge to the
minimum (1, 1), albeit very slowly.
3. THE METHOD OF STEEPEST DESCENT 47

Figure 3.6. The Method of Steepest Descent: Polynomial function

n xn yn f (xn , yn ) n xn yn f (xn , yn )
0 −2.000000 2.000000 13.000000 17 0.916394 0.789239 0.009544
1 −0.166290 2.309522 6.567163 18 0.911201 0.818326 0.008028
2 0.256054 −0.056128 0.568264 19 0.929317 0.821560 0.006766
3 0.613477 0.007683 0.285318 20 0.925024 0.845608 0.005723
4 0.568566 0.259241 0.190235 21 0.939976 0.848277 0.004847
5 0.715784 0.285524 0.132227 22 0.936397 0.868329 0.004118
6 0.689755 0.431319 0.098227 23 0.948845 0.870551 0.003502
7 0.779264 0.447299 0.074310 24 0.945840 0.887385 0.002986
8 0.761554 0.546496 0.057977 25 0.956276 0.889248 0.002548
9 0.823325 0.557524 0.045696 26 0.953739 0.903457 0.002178
10 0.810322 0.630358 0.036667 27 0.962537 0.905028 0.001864
11 0.855862 0.638488 0.029614 28 0.960386 0.917075 0.001597
12 0.845883 0.694385 0.024199 29 0.967837 0.918405 0.001369
13 0.880846 0.700627 0.019862 30 0.966007 0.928657 0.001176
14 0.872964 0.744776 0.016437 31 0.972342 0.929788 0.001010
15 0.900551 0.749702 0.013647 32 0.970780 0.938539 0.000869
16 0.894200 0.785276 0.011399 33 0.976182 0.939503 0.000748

3.1. Efficiency of Steepest Descent Method. The analysis of effi-


ciency of the method of Steepest descent is quite involved, but it boils down
to studying the efficiency of Steepest descent for quadratic functions—since
any function can be approximated using a Taylor’s polynomial of degree
two. We will study that easier case in these notes.
Theorem 3.8 (Taylor’s Formula). If f : Rd → R is a real-valued function
of d variables with continuous first and second partial derivatives on Rd , then
48 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION

Figure 3.7. The Method of Steepest Descent: Rosenbrock Function

for any choice x, y ∈ Rd , there exists a point ξ = ξ(x, y) in the segment


joining x and y so that
f (x) = f (y) + h∇f (y), x − yi + 21 QHessf (ξ) (x − y)

d
Assume we have a quadratic function  p: R →  R satisfying p(0) = 0.
There exist a d-dimensional vector D = q1 , . . . , qd and a symmetric matrix
 d
Q = qjk j,k=1 (with qjk = qkj for all 1 ≤ j, k ≤ d) so that
d
X X
1 1 2

p(x) = hD, xi + 2 QQ (x) = 2 qkk xk + qk xk + qjk xj xk
k=1 1≤j<k≤d

The gradient of this function is thus ∇p(x) = x · Q + D. It has one unique


critical point x? = −DQ−1 (Why?). At that point, it is
| |
p(x? ) = 21 − DQ−1 Q − DQ−1 + D − DQ−1


= 21 DQ−1 D| − DQ−1 D|
= − 12 DQ−1 D| = − 21 Q(Q−1 ) (D). (13)
If xn is a term in a sequence of steepest descent, then to compute xn+1 we
proceed as follows:
3. THE METHOD OF STEEPEST DESCENT 49

(a) The direction of steepest descent at xn is


v n = − ∇p(xn ) = −(xn Q + D).
(b) The restriction ϕ : (0, ∞) → R of the quadratic function p over the
half-line through xn in the direction v n is given by
ϕ(t) = p(xn + tv n )
= 12 (xn + tv n )Q(xn + tv n )| + D(xn + tv n )|
= 21 xn Q(xn + tv n )| + 12 tv n Q(xn + tv n )|
+ Dx|n + tDv |n
= 21 xn Qx|n + 12 txn Qv |n + 21 tv n Qx|n + 21 t2 v n Qv |n
+ Dx|n + tDv |n
= 1
2 v n Qv |n t2 + 12 xn Qx|n + Dx|n +tDv |n + txn Qv |n
| {z } | {z }
QQ (v n ) p(xn )
2
1
+ p(xn ) + t xn Q + D v |n

= 2 QQ (v n )t
| {z }
−v n
1 2 | |
= 2 t v n Qv n − tv n v n + p(xn )
1 2 2
= 2 QQ (v n )t − kv n k t + p(xn )

(c) The restriction function has its global minimum at


kv n k2
tn = ;
QQ (v n )
therefore, the next iteration occurs at
kv n k2
xn+1 = xn + tn v n = xn + vn
QQ (v n )

We want to observe the convergence behavior of the sequence of evalu-


ations {p(xn )}n∈N to p(x? ). We have
kv n k2
p(xn+1 ) = p(xn + QQ (v n ) v n )
kv n k2 2 2
= 12 QQ (v n ) QQ (v n ) − kv n k2 Qkv nk
Q (v n )
+ p(xn )
kv n k4
= p(xn ) − ;
2QQ (v n )

therefore,
4
p(xn+1 ) − p(x? ) p(xn ) − p(x? ) − 2QkvQn(v
k
n)
=
p(xn ) − p(x? ) p(xn ) − p(x? )
50 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION

kv n k4
=1− 
2QQ (v n ) p(xn ) − p(x? )
kv n k4
=1−
2QQ (v n ) 21 xn Qx|n + Dx|n + 12 DQ−1 D|


kv n k4
=1− .
QQ (v n ) xn Qx|n + 2Dx|n + DQ−1 D|
Note in the denominator we may rewrite some of the terms:
xn Qx|n = xn Q(Q−1 Q)x|n = (xn Q)Q−1 (xn Q)| ,
2Dx|n = Dx|n + Dx|n = xn D| + D(Q−1 Q)x|n
= xn (QQ−1 )D| + DQ−1 (xn Q)|
= (xn Q)Q−1 D| + DQ−1 (xn Q)| .
This allows us to rewrite in the following convenient form
p(xn+1 ) − p(x? ) kv n k4
= 1 −
p(xn ) − p(x? ) QQ (v n )(xn Q + D)Q−1 (xn Q + D)|
kv n k4
=1− .
QQ (v n )Q(Q−1 ) (v n )
We are ready to state the main result of this subsection:
Theorem 3.9. Given a d-dimensional vector D, and a positive definite
symmetric matrix Q of size d × d, consider the quadratic function p(x) =
1
2 QQ (x) + hD, xi. Any sequence {xn }n∈N of steepest descent converges to
the global minimum x? = −DQ−1 . The sequence of evaluations {p(xn )}n∈N
converges linearly to p(x? ) = − 21 Q(Q−1 ) (D). In particular, if 0 < λ1 ≤
λ2 ≤ · · · ≤ λd are the eigenvalues of Q, then
p(xn+1 ) − p(x? ) λd − λ1 2
 

p(xn ) − p(x? ) λd + λ1
Proof. We start by offering the following lower bound estimate3 involv-
ing the associated directions of steepest descent v n in terms of the largest
and smallest eigenvalues of Q. For all n ∈ N,
kv n k4 4λ0 λd
≥ (14)
QQ (v n )Q(Q−1 ) (v n ) (λ0 + λd )2
We have then
p(xn+1 ) − p(x? ) kv n k4
= 1 −
p(xn ) − p(x? ) QQ (v n )Q(Q−1 ) (v n )

3This is left as and advanced exercise. It is not too tricky; if you are stuck, see e.g. [1,
section 1.3.1] for a proof.
4. EFFECTIVE ALGORITHMS FOR UNCONSTRAINED OPTIMIZATION 51

 2
4λ1 λd λd − λ1
≤1− = 
(λ1 + λd )2 λd + λ1
Example 3.14. The global minimum value of the quadratic function
p(x, y) = 5x2 + 5y 2 − xy − 11x + 11y + 11 is zero, and found at (1, −1).
Notice that we may write this function in the form p(x, y) = 12 QQ (x, y) +
hD, [x, y]i + 11, where
 
10 −1
D = [−11, 11], Q= .
10 −1
The symmetric matrix Q has eigenvalues λ1 = 9 > 0, λ2 = 11 > 0 and
is therefore positive definite. Theorem 3.9 states that sequences of steepest
descent exhibit linear convergence with a rate of convergence not larger than
2
δ = 11−9
11+9 = 0.01.
Observe the computations of the first six iterations for values of the
p(xn )
ratios p(xn−1 )
when we use (1.5, 3.5) as our initial guess.
p(xn ,yn )
n xn yn p(xn , yn ) p(xn−1 ,yn−1 )
0 1.5000000000 3.5000000000 100.2500000000
1 1.4498874016 −0.9600212545 1.0019989373 0.0099950019
2 1.0049975009 −0.9550224916 0.0100149812 0.0099950019
3 1.0044966254 −0.9996004124 0.0001000998 0.0099950019
4 1.0000499500 −0.9995504497 0.0000010005 0.0099950019
5 1.0000449438 −0.9999960061 0.0000000100 0.0099950042
6 1.0000004993 −0.9999955067 0.0000000001 0.0099950528

4. Effective Algorithms for Unconstrained Optimization


All of the methods we have explored so far (Newton-Raphson, Secant
methods, Steepest descent) offer sound algorithms to compute local extrema
of real-valued functions f : Rd → R. They do have some pros and cons.
• The method of Steepest descent produces non-increasing iterations.
• In order to obtain new approximations on each Steepest descent
iteration, we have to solve many different one-dimensional opti-
mizations, each of them offering their own computational issues.
• Both Newton-Raphson and Secant methods offers faster sequences,
but we cannot always guarantee convergence.
• Another drawback of Newton-Raphson is the fact that we do need
to evaluate the function itself, its gradient and Hessian matrix.
• The recurrence formulas of Broyden’s method are simple, and re-
quire only evaluations of the function itself.
The goal of this section is precisely gathering the best properties of the
previous methods, so we may craft new methods with all the advantages,
but none of the shortcomings.
52 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION

Given a function f : Rd → R with continuous first partial derivatives,


and a given initial guess x0 ∈ Rd , we search for a recursive formula to
approximate a minimum of f . We request that this formula has the form

xn−1 = xn + tn wn ,

with positive parameters tn > 0, and vectors wn satisfying the following


criteria:
Non-increasing sequences: f (xn+1 ) < f (xn ) whenever ∇f (xn ) 6= 0.
This is achieved by requiring the vectors ωn to have an angle larger
than π/2 with respect to ∇f (xn ):

hwn , ∇f (xn )i < 0. (15)

Why does this work?  Consider at each step n ∈ N the restriction


ϕn (t) = f xn +twn of f over the line xn +twn with t > 0. We have
then ϕ0n (0) = h∇f (xn ), wn i < 0 (a decreasing function near t = 0).
We have a guaranteed value tn > 0 (that could be very small) that
give us a point xn+1 = xn + tn wn with f (xn+1 ) < f (xn ).
Control over length of steps: The steps tn are not too short, nor too
long.
This is achieved by picking first 0 < µ < λ < 1 and forcing vectors
ωn that satisfy

hwn , ∇f (xn+1 ) − λ ∇f (xn )i > 0 (16)


f (xn+1 ) ≤ f (xn ) + µtn hwn ∇f (xn )i (17)

if this is at all possible.


Easy to compute: Duh!

Is it possible to create a iteration satisfying these criteria? The following


result gives us a condition that helps in this regard:

Theorem 3.10 (Wolfe). Suppose that f : Rd → R is a real-valued func-


tion with continuous partial derivatives. Assume there exists M ∈ R so that
f (x) ≥ M . Let λ, µ be fixed numbers satisfying 0 < λ < µ < 1. If wn ,
xn ∈ Rd satisfy Criterion (15), then there exist real numbers an , bn such
that 0 ≤ an < bn and
(a) Criterion (16) is satisfied for any choice of tn ∈ (0, bn ), and
(b) Criterion (17) is satisfied for any choice of tn ∈ (an , bn ).

Remark 3.8. For a proof, see [13, Theorem 3.3.1].

Using these principles, we are going to see two constructions based upon
secant methods: the DFP and BFGS methods.
4. EFFECTIVE ALGORITHMS FOR UNCONSTRAINED OPTIMIZATION 53

4.1. The DFP Method. The Davidon-Fletcher-Powell method is is


one of the earliest and most effective secant methods. Its effectiveness stems
from the fact that the method simultaneously generates conjugate directions
and constructs an approximation of the inverse of the Hessian matrix. In
each step, the inverse of the Hessian is updated by the sum of two symmetric
rank 1 matrices. For this reason, it is referred to as a rank 2 correction
procedure. It is also called a variable metric method.
To minimize f : Rd → R, select an initial guess x0 and an initial positive
definite matrix A0 . If xn and An have been computed, then
wn
z }| {
|
(a) Find tn > 0 so that xn+1 = xn + tn −An ·∇f (xn ) satisfies
criteria (16) and (17).
(b) Update
An+1 = An + hy 1,dn i dn ⊗ dn − QA 1(y An y |n ⊗ An y |n
 
(18)
n n n)

with dn = xn+1 − xn = tn wn , and y n = ∇f (xn+1 ) − ∇f (xn ).


Remark 3.9. Unlike in the general Broyden’s method, all the matrices
An constructed in (18) are positive definite.
4.2. The BFGS method. The DFP method was soon superseded by
the BFGS method, which is its dual (interchanging the roles of dn and y n ).
Different parts of this method were devised by Broyden, Fletcher, Gold-
farb and Shanno independently in 1969. Their communications were sent as
manuscripts between March and June of 1969, and reviewed between Octo-
ber 1969 and January 1970. They were all published in 1970. Each of these
mathematicians derived similar formulas using different techniques (see [3],
[7], [10], [15] and [16])
To minimize f : Rd → R, select an initial guess x0 and an initial positive
definite matrix A0 . If xn and An have been computed, then
wn
z }| {
|
(a) Find tn > 0 so that xn+1 = xn + tn −A−1 n ·∇f (xn ) satisfies
criteria (16) and (17).
(b) Update
An+1 = An + hdn1,y i y n ⊗ y n − QA 1(dn ) An d|n ⊗ An d|n
 
(19)
n n

with dn = xn+1 − xn = tn wn , and y n = ∇f (xn+1 ) − ∇f (xn ).


Remark 3.10. Like in the DFP method, all the matrices An constructed
in (19) are positive definite. For a proof, see e.g. [13, Theorem 3.5.2].
Example 3.15. In Python there is an implementation of BFGS in the
libraries scipy.optimize: the routine fmin bfgs(). The following ses-
sion illustrates how to use this method to approximate the minimum of the
Rosenbrock function R1,1 with an initial guess (x0 , y0 ) = (−2, 2)
54 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION

1 import numpy as np, matplotlib.pyplot as plt


2 from scipy.optimize import fmin_bfgs
3
4 # Rosenbrock R1,1 function
5 def R(x):
6 return (1.0-x[0])**2 + (x[1] - x[0]**2)**2
7
8 # its Jacobian/gradient ∇R1,1
9 def jacR(x):
10 return np.array([-2.*(1.-x[0])-4.*x[0]*(x[1]-x[0]**2),
11 2.*(x[1]-x[0]**2)])
12
13 # Setup for diagrams.
14 x = np.linspace(-3,3)
15 y = np.linspace(-3,3)
16 X,Y = np.meshgrid(x,y)

We call fmin bfgs with the function and its gradient (with the option
fprime=), the initial guess (−2, 2), and activate the option retall=True
that offers the complete list of iterations obtained by the algorithm.
>>> result = fmin_bfgs(R, [-2.,2.], fprime=jacR, retall=True)
Optimization terminated successfully.
Current function value: 0.000000
Iterations: 12
Function evaluations: 15
Gradient evaluations: 15

>>> plt.figure(figsize=(8,8));
... plt.axes(aspect='equal');
... plt.contour(X, Y, (1-X)**2+(Y-X**2)**2, \
... colors='k', \
... levels=[0.2,0.8,1,1.4,1.78,2.23,4,5.4,8,13,32,64]);
... plt.plot(x, x**2, 'r--');
... plt.xlim(-3, 3);
... plt.ylim(-1, 3);
... plt.plot([p[0] for p in result[1]], [p[1] for p in result[1]], 'b.-');
... plt.title("Convergence to critical point:\nBFGS on Rosenbrock");
... plt.show()

Exercises
Problem 3.1 (Basic). [9, p.249 #1] The following sequences all converge
to zero.
2 n
vn = n−10 wn = 10−n xn = 10−n yn = n10 3−n zn = 10−3·2
Indicate the type of convergence (See Appendix A).
Problem 3.2 (Advanced). [9, p.249 #4] Give an example of a positive
sequence {εn }n∈N converging to zero in such a way that limn εn+1
εpn
= 0 for
some p > 1, but not converging to zero with any order q > p.
EXERCISES 55

Figure 3.8. The BFGS method: Rosenbrock function

Problem 3.3 (Basic). Find an example of a function f : R → R (dif-


ferent from the function in Example 3.2) with a unique root at x = 0 for
which the Newton-Raphson sequence is a loop no matter the initial guess
x0 6= 0: x2n = x0 , x2n+1 = −x0 for all n ∈ N. Bonus points is your function
is trigonometric.
Problem 3.4 (Intermediate). [9, p.251 #14] Consider the equation
x = cos x.
(a) Show graphically that there exists a unique positive root x? . Indi-
cate approximately where it is located.
(b) Show that Newton’s method  applied to f (x) = x − cos x converges
for any initial guess x0 ∈ 0, π2 .


Problem 3.5 (Intermediate). [9, p.251 #16] Consider the equation


tan x + λx = 0, (0 < λ < 1).
(a) Show graphically, as simply
1 as possible, that there is exactly one
?
root x in the interval 2 π, π .
(b) Does Newton’s method converge to the root x? ∈ 21 π, π if the
 
initial approximation is taken to be x0 = π? Justify your answer.
Problem 3.6 (Intermediate). [9, p.252 #17] Consider the equation
log2 x − x − 1 = 0, (x > 0).
(a) Graphical considerations suggest that there is exactly one positive
solution x? , and that 0 < x? < 1. Prove this.
(b) What is the largest positive 0 < x0 ≤ 1 such that Newton’s method
with f (x) = log2 x − x − 1 started at x0 converges to x? ?
56 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION

Problem 3.7 (Advanced). [9, p.252 #18] Consider Kepler’s equation


x − a sin x − b = 0, 0 < |a| < 1, b∈R
where a, b are parameters.
(a) Show that for each a, b there is exactly one real solution x? =
x? (a, b) that satisfies
b − |a| ≤ x? (a, b) ≤ b + |a|
(b) Let m ∈ N satisfy mπ < b < (m+1)π. Show that Newton’s method
for f (x) = x − a sin x − b with starting value
(
(m + 1)π if (−1)m a > 0
x0 =
mπ otherwise
is guaranteed to converge (monotonically) to x? (a, b).
Problem 3.8 (Basic). Consider the two equivalent equations
x log x − 1 = 0, (20)
1
log x − = 0. (21)
x
(a) Show that there is exactly one positive root and find a rough interval
containing it.
(b) For both (20) and (21), determine the largest interval on which
Newton’s method converges.
Hint: Investigate the convexity of the functions involved.
Problem 3.9 (CAS). Design a process in desmos.com to test the search
for critical points given by the recursion formulas produced by Newton’s
method.
Problem 3.10 (CAS). In a computer language or CAS of your choice,
design a routine that gathers the following as input:
• the definition of a generic real-valued function f : R → R,
• the derivative f 0 of that function,
• an initial guess x0 ∈ R,
• a number N of steps,
and produces the first N + 1 terms of the Newton-Raphson sequence to
approximate a root of f .
Modify the previous routine to receive as input, instead of a number of
steps, a tolerance tol indicating the accuracy of the solution. For example,
if we require a root of the equation f (x) = 0 accurate to the first 16 correct
decimal places, we use tol = 1e-16.
Problem 3.11 (Basic). The purpose of this problem is the design of
Horner’s method to evaluate polynomials effectively. Given a polynomial
Xn
p(x) = ak xk = a0 + a1 x + · · · + an xn ,
k=0
EXERCISES 57

Figure 3.9. Newton method in desmos.com

where a0 , a1 , . . . , an are real numbers, and given x0 ∈ R, we define the


Horner’s scheme {b0 , b1 , . . . , bn } to evaluate p(x0 ) as follows:
bn = an
bn−1 = an−1 + bn x0
bn−2 = an−2 + bn−1 x0
..
.
b0 = a0 + b1 x0
(a) Prove that b0 = p(x0 )
(b) Use Horner’s method to evaluate p(x) = 2x3 − 6x2 + 2x − 1 at
x = 3. Illustrate all steps, and count the number of basic operations
(addition, subtraction, multiplication, division) used.
(c) Employ the usual method of evaluation of polynomials to evaluate
p(x) = 2x3 − 6x2 + 2x − 1 at x = 3. Count the number of basic
operations (note that a3 counts as two multiplications: a × a × a,
for instance)
Problem 3.12 (CAS). In a computer language or CAS of your choice,
write a routine to apply Horner’s scheme to evaluate polynomials. Your
routine should gather the following inputs:
• A list of coefficients [a0, a1, ..., an] representing the polyno-
mial p(x) = a0 + a1 x + · · · + an xn .
• A value x0
58 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION

The output of your routine should be p(x0 ).


Problem 3.13 (CAS). Use any of the routines that you wrote in Prob-
lem 3.10 to produce a table and a visual representation for the numerical
solution of the following equations, with the given initial guesses and steps.
(a) f (x) = sin x, with x0 = 0.5, 5 steps.
(b) f (x) = sin x, with x = 3, enough steps to obtain accurately the first
16 correct decimal places of π.
(c) f (x) = −1 + log x, with x = 2, enough steps to obtain accurately
the first 16 correct decimal places of e.
Problem 3.14 (CAS). The objective of this problem is to use √ Newton’s
1
method to find an approximation to the golden ratio φ = 2 (1+ 5) accurate
to the first 16 decimal places. Find first an appropriate polynomial p(x) with
integer coefficients for which φ is a root. Employ any of the routines that you
wrote in Problem 3.10 with a good initial guess to guarantee the required
result.
Problem 3.15 (Intermediate—CAS). [8, lec3 newton mthd, 3.1] Con-
sider the function
f (x) = 9x − 4 log(x − 7).
We wish to study the behavior of Newton-Raphson to find approximations
to the critical points of this function.
(a) Find the domain D of f .
(b) Find the global minimum of f analytically.
(c) Compute an exact formula for the Newton-Raphson iterate xn+1
for an initial guess x0 ∈ D.
(d) Compute five iterations of the Newton-Raphson method starting at
each of the following initial guesses:
(a) x0 = 7.4.
(b) x0 = 7.2.
(c) x0 = 7.01.
(d) x0 = 7.8.
(e) x0 = 7.88.
(e) Prove that the Newton-Raphson method converges to the optimal
solution for any initial guess x0 ∈ (7, 7.8888).
(f) What is the behavior of the Newton-Raphson method if the initial
guess is not in the interval (7, 7.8888)?
Problem 3.16 (Intermediate—CAS). [8, lec3 newton mthd, 3.2] Con-
sider the function
f (x) = 6x − 4 log(x − 2) − 3 log(25 − x).
We wish to study the behavior of Newton-Raphson to find approximations
to the critical points of this function.
(a) Find the domain D of f .
EXERCISES 59

(b) Find the global minimum of f analytically.


(c) Compute an exact formula for the Newton-Raphson iterate xn+1
for an initial guess x0 ∈ D.
(d) Compute five iterations of the Newton-Raphson method starting at
each of the following initial guesses:
(a) x0 = 2.6.
(b) x0 = 2.7.
(c) x0 = 2.4.
(d) x0 = 2.8.
(e) x0 = 3.
(e) Prove that the Newton-Raphson method converges to the optimal
solution for any initial guess x0 ∈ (2, 3.05).
(f) What is the behavior of the Newton-Raphson method if the initial
guess is not in the interval (2, 3.05)?
Problem 3.17 (Basic). Approximate the solution of the following sys-
tem by computing two steps of Newton-Raphson’s method for an appropri-
ate function g : R3 → R3 , and initial guess x0 = (1, 0, 1).

2 2 2
3 = x + y + z

1 = x2 + y 2 − z

1 =x+y+z

Problem 3.18 (Intermediate). [1, p.91 #1.4.1] The purpose of this


exercise is to show that Newton’s method is unaffected by linear scaling
of the variables. Consider a linear invertible transformation of variables
x| = A · y | . Write Newton’s method in the space of the variables y and
show that it generates the sequence y n | = A−1 · xn | , where {xn }n∈N is the
sequence generated by Newton’s method in the space of variables x.
Problem 3.19 (Basic). Let A be a square matrix. An LU-decomposition
is a factorization of A = L·U into a lower triangular matrix L and an upper
triangular matrix U , both of which have non-zero entries in their diagonals.
For example, the general case for 3 × 3 square matrices:
     
a11 a12 a13 `11 0 0 u11 u12 u13
a21 a22 a23  = `21 `22 0  ·  0 u22 u23 
a31 a32 a33 `31 `32 `33 0 0 u33
| {z } | {z }
L U
(a) Find an LU-decomposition of the following matrix
 
4 3
6 3
that satisfies that all diagonal entries of L are ones.
(b) Find an example of a 2 × 2 square matrix for which there is not any
possible LU-decomposition.
Problem 3.20 (Advanced). Prove the following statements:
60 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION

(a) A square matrix A of size d × d admits an LU-decomposition if


and only if the leading principal minors are non-zero: ∆k 6= 0 for
1 ≤ k ≤ d.
(b) If A is a symmetric positive definite matrix, then it is possible to
find an LU-decomposition where U = L| : A = L · L| . In this case,
this factorization is also called a Cholesky decomposition.
Problem 3.21 (CAS). In a computer language or CAS of your choice,
design a routine that solves a linear system
A x| c|
z }| { z }| { z }|{
a11 ··· a1d x1 c1
 .. .. ..  ·  ..  =  .. 
 . . .  . .
ad1 · · · add xd cd
by performing first a LU-decomposition A = L·U (provided this is possible!)
and manipulating the resulting equation to solve instead the two (faster)
systems:
(a) Find y that solves the system L · y | = c| by Gaussian elimination.
(b) Find x that solves the system U · x| = y | by Gaussian elimination.
You may design a routine that computes LU-decompositions, or you may
use a built-in routine for that purpose.
Problem 3.22 (CAS). In a computer language or CAS of your choice,
design a routine that gathers the following as input:
• the definition of a generic real-valued function f : Rd → R,
• the gradient ∇f of that function,
• an initial guess x0 ∈ Rd ,
• a number N of steps,
and produces the first N + 1 terms of the Newton-Raphson sequence to
approximate a root of f .
Problem 3.23 (Advanced). Prove the Local convergence for the Secant
method, Theorem 3.3.
Problem 3.24 (Basic). Approximate the solution for the system in
Problem 3.17 by computing the first two iterations of a Broyden method
with A0 = ∇g(x0 ) for an appropriate function g : Rd → R and initial guess
x0 = (1, 0, 1).
Problem 3.25 (Basic). Compute the first two iterations of Broyden
method with initial guess (1, 4) to search for the critical points of the function
f (x, y) = 2x2 + y 2 − xy
 
(a) Using A0 = 10 01 .
(b) using A0 = Hessf (1, 4).
Problem 3.26 (Advanced). Prove Theorems 3.5, and 3.6.
EXERCISES 61

Problem 3.27 (Advanced). We want to prove estimate (14) in the proof


of Theorem 3.9 in page 50. This follows directly from the equivalent state-
ment below, which is easier to handle. Prove the following result:

Kantorovich Estimate. Given a positive definite symmetric


matrix Q of size d × d, consider the quadratic function p(x) =
1
2 QQ (x). Assume 0 < λ1 ≤ λ2 ≤ · · · ≤ λd are the eigenvalues
of Q. For any sequence {xn }n∈N of steepest descent for f , we
have the following estimate involving the directions of steepest
descent {v n }n∈N :
kv n k4 4λ1 λd

QQ (v n )Q(Q−1 ) (v n ) (λ1 + λd )2

Problem 3.28 (Basic). Approximate the solution for the system in


Problem 3.17 by computing the first two iterations of a Steepest descent
method for an appropriate function g : Rd → R and initial guess x0 =
(1, 0, 1).
Problem 3.29 (Basic). Compute the first two iterations of the method
of Steepest descent with initial guess (1, 4) to search for the critical points
of the function f (x, y) = 2x2 + y 2 − xy.
Problem 3.30. [8, lec5 steep desce, 8.3] Consider the quadratic poly-
nomial
p(x, y) = 12 QQ (x, y) + hD, [x, y]i + 13,
with
 
10 −9
D = [4, −15], Q=
−9 10
(a) Find the global minimum value of p, and its location.
(b) Compute the eigenvalues of Q. Is Q positive definite?
(c) What is the worst-case scenario rate of convergence of sequences of
steepest descent for this function?
(d) Compute sequences of steepest descent for this function with the
initial guesses below. Make sure to report a table similar to the one
in Example 3.14.
• (0, 0)
• (−0.4, 0)
• (10, 0)
• (11, 0)
Problem 3.31. [8, lec5 steep desce, 8.4] Consider the quadratic poly-
nomial
p(x, y, z) = 12 QQ (x, y, z) + hD, [x, y, z]i,
62 3. NUMERICAL APPROXIMATION FOR UNCONSTRAINED OPTIMIZATION

with
 
10 −18 2
D = [12, −47, −8], Q = −18 40 −1
2 −1 3
(a) Find the global minimum value of p, and its location.
(b) Compute the eigenvalues of Q. Is Q positive definite?
(c) What is the worst-case scenario rate of convergence of sequences of
steepest descent for this function?
(d) Compute sequences of steepest descent for this function with the
initial guesses below. Make sure to report a table similar to the one
in Example 3.14.
• (0, 0, 0)
• (15.09, 7.66, −6.56)
• (11.77, 6.42, −4.28)
• (4.46, 2.25, 1.85)
CHAPTER 4

Existence and Characterization of Extrema for


Constrained Optimization

Let us begin by reviewing the concept of constrained optimization, and


some associated notation and terminology. The objective is the search for
extrema of a function f : D → R where the input values x belong in an open
subset D ⊆ Rd and satisfy a finite set of constraints of the form
g1 (x) ≤ 0, g2 (x) ≤ 0, . . . , gm (x) ≤ 0,
h1 (x) = 0, h2 (x) = 0, . . . , h` (x) = 0,

for real-valued functions gk : Rd → R, (1 ≤ k ≤ m), hk : Rd → R, (1 ≤ k ≤


`).
For simplicity, we write instead

minx∈D f (x)

(P ) g1 (x) ≤ 0, . . . , gm (x) ≤ 0 (22)

h1 (x) = 0, . . . , h` (x) = 0

or better, if we set
S = {x ∈ D : g1 (x) ≤ 0, . . . , gm (x) ≤ 0, h1 (x) = 0, . . . , h` (x) = 0},
we may simply write
(P ) = min f.
x∈S

We refer to it as the program (P ). The function f is called the objective


function of (P ). We refer to the functions gk as the inequality constraints.
The functions hk are called the equality constraints.
A point x ∈ D that satisfies all the constraints of the program (P ) is said
to be feasible. The set of all feasible points is called the feasibility region
of (P ). If the feasibility region is non-empty, we say that the program
(P ) is consistent. If a feasible point satisfies gk (x) < 0 for all inequality
constraints, we call it a Slater point. Consistent programs (P ) that have
Slater points are said to be super-consistent.
If the objective function f and all constraints gk , hk of a program (P )
are linear functions, we denote it by (LP ) and refer to it as a linear program.
If the objective function, all constraints gk , hk and the set D are convex, we
call (P ) a convex program.
63
64 4. CONSTRAINED OPTIMIZATION

Example 4.1. Let f (x, y) = x4 + y 4 , and consider the program (P ) =


minx∈S f (x, y), where S = {(x, y) ∈ R2 : x2 ≤ 1, y 2 ≤ 1, ex+y ≤ 1}.
The inequality constraints are g1 (x, y) = x2 − 1, g2 (x, y) = y 2 − 1 and
g3 (x, y) = ex+y − 1 ≤ 0 (although you may choose simpler equivalent ex-
pressions). The feasibility region is thus a triangular region (see Figure 4.1):
S = {(x, y) ∈ R2 : |x| ≤ 1, |y| ≤ 1, x + y ≤ 0} =
6 ∅.
This is a super-consistent convex program, since any interior point of the

Figure 4.1. Can you tell what are the global maximum and
minimum values of f in S?

triangle S is a Slater point for (P ), and all relevant functions are convex.
Definition. Given a consistent program (P ) as defined in (22), for each
feasible point x ∈ S, we define:
(a) The cone of improving directions of f at x, as
F0 (x) = {v ∈ Rd : kvk = 1, h∇f (x), vi < 0}
(b) The set of indices of the binding inequality constraints for x, as

I(x) = k ∈ {1, . . . , m} : gk (x) = 0 .
(c) The cone of inward pointing directions for the binding constraints
at x, as
G0 (x) = {v ∈ Rd : kvk = 1, h∇gk (x), vi < 0 for all k ∈ I(x)}
(d) The set of tangent directions for the equality constraints at x, as
H0 (x) = {v ∈ Rd : h∇hk (x), vi = 0, 1 ≤ k ≤ `}
4. CONSTRAINED OPTIMIZATION 65

Example 4.2. For the program (P ) in Example 4.1, consider the feasible
points (0, 0), (−1, −1) and (0, −1/2). Since ∇f (x, y) = [4x3 , 4y 3 ], we have
the following cones on improving direction
F0 (0, 0) = ∅,
F0 (−1, −1) = {v = (v1 , v2 ) ∈ R2 : kvk = 1, v1 + v2 > 0},
F0 (0, −1/2) = {v = (v1 , v2 ) ∈ R2 : kvk = 1, v2 > 0}
The indices of binding inequality constraints are
I(0, 0) = {3}, I(−1, −1) = {1, 2}, I(0, −1/2) = ∅,
and therefore, the cones of inward pointing directions for the binding con-
straints are
G0 (0, 0) = {v = (v1 , v2 ) ∈ R2 : kvk = 1, v1 + v2 < 0},
G0 (−1, −1) = {v = (v1 , v2 ) ∈ R2 : kvk = 1, v1 > 0, v2 > 0},
G0 (0, −1/2) = ∅.
Since there are no equality constraints, we do not have any sets of tangent
directions.

Figure 4.2. Cones for (0, 0), (−1, −1) and (0, −1/2).

In the following sections we are going to discuss necessary and suffi-


cient conditions for the existence of (strict) global minima for any con-
sistent program (P ). We are going to focus exclusively on the main re-
sults without showing their proofs (if interested in those proofs, consult [8,
lec6 constr opt]).
66 4. CONSTRAINED OPTIMIZATION

1. Necessary Conditions
We begin with two results that focus on the structure of the equality
constraints.
Theorem 4.1 (Geometric Necessary Condition for Linear Equality Con-
straints). If (P ) is a consistent program with linear equality constraints
hk (x) = hak , xi + bk (ak ∈ Rd , bk ∈ R) for all 1 ≤ k ≤ `, then for all
feasible local minima x ∈ S,
F0 (x) ∩ G0 (x) ∩ H0 (x) = ∅.
Theorem 4.2 (Geometric Necessary Condition for Linearly Independent
Equality Constraints). If x ∈ S is a feasible local minimum for the consistent
program (P ), and the gradient vectors {∇hk (x) : 1 ≤ k ≤ `} are linearly
independent, then F0 (x) ∩ G0 (x) ∩ H0 (x) = ∅.

As a consequence, an algebraic version of this geometric necessary con-


dition gives the following result.
Theorem 4.3 (Fritz John Necessary Conditions). If x ∈ S is a feasible
local minimum of the consistent program (P ), then there exist λk ≥ 0 for
0 ≤ k ≤ m, and µ1 , . . . , µ` ∈ R so that
(a) [λ0 , λ1 , . . . , λm , µ1 . . . , µ` ] 6= 0,
(b) λk gk (x) = 0 for all 1 ≤ k ≤ m.
(c) λ0 ∇f (x) + m
P P`
k=1 λk ∇gk (x) + k=1 µk ∇hk (x) = 0.

Example 4.3. Continuing with example 4.1, let’s check if the point
(0, 0) is a candidate to optimal solution of this program. Let’s use Theorem
4.3 to verify this claim:
∇f (x, y) = [4x3 , 4y 3 ] ∇f (0, 0) = [0, 0],
∇g1 (x, y) = [2x, 0] ∇g1 (0, 0) = [0, 0],
∇g2 (x, y) = [0, 2y] ∇g2 (0, 0) = [0, 0],
∇g3 (x, y) = ex+y , ex+y
 
∇g3 (0, 0) = [1, 1].
Notice how the gradients line up nicely—can we find λk ≥ 0 so that:
(a) [λ0 , λ1 , λ2 , λ3 ] 6= [0, 0, 0, 0],
(b) λ1 = λ2 = 0 (since λ1 g1 (0, 0) = −2λ1 and λ2 g2 (0, 0) = −2λ2 ), and
(c) the following linear combination is equal to [0, 0]
λ0 ∇f (0, 0) + λ1 ∇g1 (0, 0) + λ2 ∇g2 (0, 0) + λ3 ∇g3 (0, 0) = [0, 0],
λ0 [0, 0] + λ1 [0, 0] + λ2 [0, 0] + λ3 [1, 1] = [0, 0],
We may select, for instance λ0 = 1, λ1 = λ2 = λ3 = 0, which proves that
the point (0, 0) is indeed a candidate for optimal solution of (P ).
1. NECESSARY CONDITIONS 67

Further properties of the involved functions provide us with simpler sets


of conditions
Theorem 4.4 (Karush-Kuhn-Tucker Necessary Conditions). If x ∈ S
is a feasible local minimum of the consistent program (P ) for which all the
vectors {∇hk (x), ∇gj (x) : 1 ≤ k ≤ `, j ∈ I(x)} are linearly independent,
then there exist λk ≥ 0 for 1 ≤ k ≤ m, and µ1 , . . . , µ` ∈ R so that
(a) λk gk (x) = 0 for all 1 ≤ k ≤ m.
(b) ∇f (x) + m
P P`
k=1 λk ∇gk (x) + k=1 µk ∇hk (x) = 0.

Remark 4.1. The conditions (a) and (b) of Theorem 4.4 are called the
KKT conditions of the program (P ) in the literature. The values λk , µk are
called multipliers.
Example 4.4. Set f (x, y) = (x − 12)2 + (y + 6)2 . Consider the program
(P ) designed to find the global minimum of this function on the set S =
{(x, y) ∈ R2 : x2 + 3x + y 2 − 4.5y ≤ 6.5, (x − 9)2 + y 2 ≤ 64, 8x + 4y = 20}. We
want to prove that the point (2, 1) is a good candidate for optimal solution
of (P ). The point (2, 1) is feasible. To see this, set
g1 (x, y) = x2 + 3x + y 2 − 4.5y − 6.5,
g2 (x, y) = (x − 9)2 + y 2 − 64,
h1 (x, y) = 8x + 4y − 20,
(or simpler equivalent constraints), and notice that
g1 (2, 1) = 0, g2 (2, 1) = −14, h1 (2, 1) = 0.
We have concluded that (P ) is consistent. Notice that, I(2, 1) = {1}, since
g1 (2, 1) = 0, g2 (2, 1) 6= 0. Further,
∇f (x, y) = [2(x − 12), 2(y + 6)] ∇f (2, 1) = [−20, 14],
∇g1 (x, y) = [2x + 3, 2y − 4.5] ∇g1 (2, 1) = [7, −2.5],
∇g2 (x, y) = [2(x − 9), 2y] ∇g2 (2, 1) = [−14, 2],
∇h1 (x, y) = [8, 4] ∇h1 (2, 1) = [8, 4],
The vectors ∇g1 (2, 1) = [7, −2.5] and ∇h1 (2, 1) = [8, 4] are linearly indepen-
dent. Therefore, to verify that (2, 1) is candidate for optimal solution of
(P ), we may now use Theorem 4.4. The KKT conditions read as follows:
we are looking for λk ≥ 0, µ1 ∈ R so that λk gk (2, 1) = 0 (k = 1, 2) and
∇f (2, 1) + λ1 ∇g1 (2, 1) + λ2 ∇g2 (2, 1) + µ1 ∇h1 (2, 1) = [0, 0],
Let’s address the first condition: Since g1 (2, 1) = 0 and g2 (2, 1) = −14 < 0,
it must be λ2 = 0. The second condition turns then into the equation
[−20, 14] + λ1 [7, −2.5] + 0 · [−14, 2] + µ1 [8, 4] = [0, 0]
68 4. CONSTRAINED OPTIMIZATION

Figure 4.3. Feasibility region for (P ) in example 4.4

or equivalently     
7 8 λ1 20
= ,
−2.5 4 µ1 −14
which gives λ1 = 4, and µ1 = −1. This proves that the point (2, 1) is indeed
a good candidate for the optimal solution of (P ).

There are other instances in which the KKT conditions can be used
instead of those in the Fritz John Theorem.
Theorem 4.5 (Slater Necessary Condition). Suppose that the inequality
constraints gk of a super-consistent program (P ) are pseudo-convex (1 ≤
k ≤ m), the equality constraints hk are linear (1 ≤ k ≤ `), and the vectors
∇hk (x) are linearly independent at a feasible point x. Then the KKT con-
ditions (a) and (b) of Theorem 4.4 are necessary to characterize x as an
optimal solution of (P ).
Theorem 4.6. If all constraints of a consistent program (P ) are lin-
ear, then the KKT conditions (a) and (b) of Theorem 4.4 are necessary to
characterize optimal solutions of (P ).

2. Sufficient Conditions
It all boils down to a single result.
Theorem 4.7 (KKT Sufficient Conditions). Let x ∈ S be a feasible
point of the consistent program (P ) for which there are multipliers λk ≥ 0
(1 ≤ k ≤ m) and µk ∈ R (1 ≤ k ≤ `) satisfying the conditions (a) and (b) of
Theorem 4.4. If f is pseudo-convex, gk is quasi-convex for all 1 ≤ k ≤ m,
and hk is linear for all 1 ≤ k ≤ `, then x is a global optimal solution of (P ).
KEY EXAMPLES 69

Example 4.5. We saw that the point (0, 0) satisfies the KKT condi-
tions for the super-consistent convex program (P ) in Example 4.1. As a
consequence of Theorems 4.3 and 4.7, this point must be the optimal global
minimum of (P ).
We also saw that the point (2, 1) satisfies the KKT conditions for the
program (P ) in Example 4.4. It is not hard to see that this program is
super-consistent, f is pseudo-convex, g1 and g2 are quasi-convex, and h1
is linear. By virtue of Theorems 4.4 and 4.7, the point (2, 1) must be the
optimal solution of (P ).

Key Examples
In the following section we are going to use the KKT conditions to
address the characterization of optimal solutions of generic programs.
Example 4.6. Let Q be a symmetric d × d square matrix. Consider
the associated quadratic form QQ (x). We wish to find the global maximum
over all points of this function in the unit ball Bd = {x ∈ Rd : kxk ≤ 1}.
An equivalent program (P ) is thus defined with f (x) = −QQ (x) as
its objective function, and a single inequality constraint g1 (x) = kxk2 −
1. This is trivially a super-consistent program with a convex inequality
constraint. Checking the KKT conditions to look for the optimal solution
is thus justified under the hypothesis of Theorem 4.5. Notice that
∇f (x)| = −2Qx| ,
∇g1 (x) = 2x;

 conditions request the search for x ∈ Bd and λ ≥ 0 so


therefore, the KKT
that λ 1 − kxk2 = 0 and −2Qx| + 2λx| = 0| .
It must be kxk = 1 by the first condition. The second condition states
that x must be an eigenvector of Q with eigenvalue λ: Qx| = λx| . The
value of the objective function in this case is f (x) = −QQ (x) = −xQx| =
−λkxk2 = −λ. In order to obtain the requested global minimum value
(different than zero), λ has to be the largest non-negative eigenvalue of Q,
and x its corresponding normalized eigenvector.
 
Example 4.7. A simple case of the previous example: Set Q = 13 31 .
The eigenvalues of Q are −2 and 4, and therefore the maximum of the
associated quadratic form QQ (x, y) = x2 + y 2 + 6xy over the ball x2 + y 2 ≤ 1
happens at the (normalized) solution of the system
    
1 3 x x
=4 .
3 1 y y
√ √
This gives the points ±( 2/2, 2/2).
Example 4.8. Let x0 ∈ Rd \ {0} and r > 0. Find the point on the
sphere of radius r, Sd = {x ∈ Rd : kxk = r} that is closer to x0 .
70 4. CONSTRAINED OPTIMIZATION

We may write a super-consistent convex program to solve this optimiza-


tion problem by using f (x) = kx − x0 k2 as objective function, and one
equality constraint h1 (x) = kxk2 − r2 . With this choice, we are well within
the hypothesis of Theorems 4.4 and 4.7. The KKT conditions request µ ∈ R
and a point x ∈ Rd with kxk = r so that
∇f (x) + µ ∇h1 (x) = 0,
This gives 2(x − x0 ) + 2µx = 0, or equivalently, (1 + µ)x = x0 .
It must be µ = −1 + kx0 k/r. We have then two cases:
(a) If kx0 k = r, then µ = 0 and x = x0 is the only solution.
(b) If kx0 k =
6 r (the point x0 is not on the sphere), then x = rx0 /kx0 k.
Example 4.9. Find the minimum value of a (real-valued) linear map
over the unit ball.
Given a ∈ Rd \{0}, consider the corresponding linear map L(x) = ha, xi.
We wish to find the minimum value of L over all points in the unit ball
Bd = {x ∈ Rd : kxk ≤ 1}.
An equivalent program (P ) is defined with L as its objective function
and g1 (x) = kxk2 − 1. This is a super-consistent convex program. Checking
the KKT conditions is justified under the hypothesis of Theorem 4.4. Notice
that
∇L(x) = a,
∇g1 (x) = 2x;

 conditions request the search for x ∈ Bd and λ ≥ 0 so


therefore, the KKT
2
that λ kxk − 1 = 0 and a + 2λx = 0.
This first condition imposes kxk = 1. The second condition requires
x = −a/(2λ). These two put together imply that it must be λ = −kak/2,
and hence x = −a/kak.

Exercises
Problem 4.1 (Basic). Consider the following problem: Find the global
minimum of the function f (x, y) = 6(x − 10)2 + 4(y − 12.5)2 on the set
S = {(x, y) ∈ R2 : x2 + (y − 5)2 ≤ 50, x2 + 3y 2 ≤ 200, (x − 6)2 + y 2 ≤ 37}.
(a) Write the statement of this problem as a program with the nota-
tion from equation 22. Label the objective function, as well as the
inequality constraints accordingly.
(b) Is the objective function f pseudo-convex? Why or why not?
(c) Are the inequality constraints quasi-convex? Why or why not?
(d) Sketch the feasibility region. Label all relevant objects involved.
(e) Is the point (7, 6) feasible? Why or why not?
(f) Employ Theorem 4.4 to write a necessary condition for optimality
and verify that is satisfied by the point (7, 6).
(g) Employ Theorem 4.7 to decide whether this point is an optimal
solution of (P ).
EXERCISES 71

Problem 4.2 (Basic). [8, lec6 constr opt, 10] Let f (x, y) = (x − 4)2 +
(y − 6)2 . Consider the program (P ) to find the global minimum of f on the
set S = {(x, y) ∈ R2 : y − x2 ≥ 0, y ≤ 4}.
(a) Write the statement of this problem as a program with the nota-
tion from equation 22. Label the objective function, as well as the
inequality constraints accordingly.
(b) Is the objective function f pseudo-convex? Why or why not?
(c) Are the inequality constraints quasi-convex? Why or why not?
(d) Sketch the feasibility region. Label all relevant objects involved.
(e) Is the point (2, 4) feasible? Why or why not?
(f) Employ Theorem 4.4 to write a necessary condition for optimality
and verify that is satisfied by the point (2, 4).
(g) Employ Theorem 4.7 to decide whether this point is an optimal
solution of (P ).
Problem 4.3 (Basic). [8, lec6 constr opt, 12] Let f (x, y) = (x−9/4)2 +
(y − 2)2 . Consider the program (P ) to find the global minimum of f on the
set S = {(x, y) ∈ R2 : y − x2 ≥ 0, x + y ≤ 6, x ≥ 0, y ≥ 0}.
(a) Write down the KKT optimality conditions and verify that these
conditions are satisfied at the point (3/2, 9/4).
(b) Present a graphical interpretation of the KKT conditions at (3/2, 9/4).
(c) Show that this point is the optimal solution to the program.
Problem 4.4 (Basic). Find examples of non-diagonal 3 × 3 symmetric
square matrices with integer-valued eigenvalues of each type below:
• A1 positive definite,
• A2 positive semi-definite,
• A3 negative definite,
• A4 negative semi-definite, and
• A5 indefinite.
For each of these matrices, find the maximum of their corresponding qua-
dratic form QAk (x, y, z) over the unit ball B3 = {(x, y, z) ∈ R3 : x2 + y 2 +
z 2 ≤ 1}.
Problem 4.5 (Advanced). [8, lec6 constr opt, 19]
Arrow-Hurwicz-Uzawa constraint qualification: Consider the prob-
lem to minimize f (x) subject to x ∈ X (X being an open set X ⊆ Rd )
and a set of continuous inequality constraints gk (x) ≤ 0 (1 ≤ k ≤ m). Let
J = k ∈ {1, . . . , m} : gk is pseudo-concave . Prove that if the set
x ∈ X : h∇gk (x? ), xi ≤ 0, (k ∈ J ); h∇gk (x? ), xi < 0, k ∈ I(x? ) \ J
 

is nonempty, then x? satisfies the KKT conditions.


CHAPTER 5

Numerical Approximation for Constrained


Optimization

1. Projection Methods for Linear Equality constrained programs


Consider the minimization of a function f : Rd → R subject to ` lin-
ear constraints hk (x) = 0, where hk (x) = hak , xi − bk for vectors ak =
[ak1 , . . . , akd ] ∈ Rd , and real values bk ∈ R (1 ≤ k ≤ `). If we set
 
a11 · · · a1d
A =  ... .. .. 

. . 
a`1 · · · a`d
and b = [b1 , . . . , b` ] we may write these linear constraints as Ax| = b| .
The corresponding program (P) has f as objective function and linear
equality constraints hk : Rd → R as defined above. If this program happens
to be consistent (it depends on A and b), then Theorem 4.6 allows us to use
KKT to find the optimal solutions.
The KKT conditions read as follows: Find x ∈ Rd satisfying Ax| = b| ,
µk ∈ R for 1 ≤ k ≤ ` so that
`
X
∇f (x) + µk ak = 0. (23)
k=1

Example 5.1. We would like to find the minimum value of the function
f (x, y, z) = x2 + y 2 + z 2 over the line at the intersection of the planes
x + y + z = 0 and x − y + 2z = 3.
Among the methods we use in a course of Vector Calculus, one would
start by computing a parameterization of the line first. For instance, by
forcing z = 0 and solving the system formed by the two planes (with this
restriction), we find that the point (3/2, −3/2, 0) belongs in this line. The
cross product of the normal vectors to the planes is the direction of the line:

i j k

[1, 1, 1] × [1, −1, 2] = 1 1 1 = [3, −1, −2].
1 −1 2
We have then the line with equation (3/2 + 3t, −3/2 − t, −2t), t ∈ R. A
restriction of f on this line gives
ϕ(t) = f 23 + 3t, − 32 − t, −2t


73
74 5. NUMERICAL APPROXIMATION FOR CONSTRAINED OPTIMIZATION

2 2
= 3
2 + 3t + − 3
2 −t + (−2t)2
= 9
4 + 9t2 + 9t + 9
4 + t2 + 3t + 4t2
= 14t2 + 12t + 9
2

The minimum of this function occurs at t = −3/7. This yields the point
3 3 3 3 3 3 15 6
 
2 − 3 · 7 , − 2 + 7 , 2 · 7 = 14 , − 14 , 7 .

The method we have explained in Chapter 4 starts by collecting the


constraints first, and claiming the use of the KKT conditions. In this case,
Theorem 4.6 guarantees we may use this technique.
h1 (x, y, z) = x + y + z = h[1, 1, 1], [x, y, z]i,
h2 (x, y, z) = x − y + 2z − 3 = h[1, −1, 2], [x, y, z]i − 3.
The corresponding KKT condition request feasible points (x, y, z) ∈ R3 and
two real values µ1 , µ2 ∈ R that satisfy:
0 = x + y + z,
3 = x − y + 2z,
[0, 0, 0] = [2x, 2y, 2z] + µ1 [1, 1, 1] + µ2 [1, −1, 2].
This gives the following system:
0 = x + y + z = − 12 (µ1 + µ2 ) − 21 (µ1 − µ2 ) − 12 (µ1 + 2µ2 ),
3 = x − y + 2z = − 12 (µ1 + µ2 ) + 21 (µ1 − µ2 ) − (µ1 + 2µ2 ),
which reduces to     
3/2 1 µ1 0
= .
1 3 µ2 −3
3 15 6

There is a unique solution µ1 = 6/7, µ2 = −9/7, with the point 14 , − 14 , 7 .

Notice how, in either case above, we ended solving the optimization


problem symbolically. In this section we are going to adapt the techniques
we learned in Chapter 3 to approximate the solution of this kind of pro-
grams numerically. As we have done in the past, it all starts by selecting a
feasible initial guess, and solving a related program, associated to a simpler
approximation instead. We explore the following options:
• Using linear approximations (this leads to the steepest descent me-
thod ).
• Using quadratic approximations (this leads to the Newton method ).

1.1. Steepest Descent. Given a feasible initial guess x0 ∈ Rd (satis-


fying Ax0 | = b| ), we proceed to search for a next iteration x̄0 within the
feasibility region that minimizes the linear approximant L0 (x) = f (x0 ) +
∇f (x0 )(x − x0 )| . We usually further impose a maximum distance from x0
1. PROJECTION METHODS 75

to x̄0 ; that is, we force for example kx̄0 − x0 k ≤ R0 for some R0 > 0. This
gives the Direction Finding Program (DF P )

min L0 (x), S0 = {x ∈ Rd : Ax| = b| , kx − x0 k ≤ R0 }.


x∈S0

If we set v = x − x0 , the program translates into simpler terms:

min ∇f (x0 )v | , S00 = {Av | = 0| , kvk ≤ R0 }.


v∈S00

Note this is a convex program (it is linear, as a matter of fact) with a Slater
point at 0. The KKT conditions for the (DF P ) program are therefore
necessary and sufficient for optimality.
Notice how much simpler these conditions are: Set g1 (v) = kvk2 − R02 ,
hk (v) = hak , vi for 1 ≤ k ≤ `. Find v ∈ Rd with Av | = 0| , λ1 ≥ 0, µk ∈ R,
1 ≤ k ≤ ` so that:

λ1 kvk2 − R02 = 0,


`
X
∇f (x0 ) + 2λ1 v + µk ak = 0.
k=1

This gives two possibilities.


(a) λ1 = 0, in which case we have ∇f (x0 ) + `k=1 µk ak = 0. This
P
means that x0 is already a feasible point that satisfies the KKT
conditions for (P ).
(b) λ1 6= 0, and v with kvk = R0 that satisfies
`
X
2λ1 v + µk ak = − ∇f (x0 ).
k=1

A solution is found by the following formulas (it is easy to check


why, and left to the reader):

P = R02 I − A| (AA| )−1 A ,



1/2
λ1 = 21 QP ∇f (x0 ) ,
(24)
[µ1 , . . . , µ` ]| = −(AA| )−1 A∇f (x0 )| ,

−1/2
v | = −QP ∇f (x0 ) P ∇f (x0 )| .

The solution v 0 of the (DF P ) points to a direction of steepest descent


for the program (P ) starting at x0 . We perform now a line-search: We
proceed to restrict f on the half line starting at x0 in the direction given by
v 0 , and search for a global minimum:

t0 = argmin ϕ0 (t) = argmin f (x0 + tv 0 ).


t≥0 t≥0
76 5. NUMERICAL APPROXIMATION FOR CONSTRAINED OPTIMIZATION

Set x1 = x0 + t0 v 0 . We iterate this process to obtain a sequence of feasible


points. Given xn ∈ Rd a feasible guess, and Rn > 0:

v n = argmin ∇f (xn )v | , Sn0 = {v ∈ Rd : kvk ≤ Rn , Av | = 0| }


0
v∈Sn
tn = argmin ϕn (t) = argmin t ≥ 0f (xn + tv n )
t≥0
xn+1 = xn + tn v n .

Example 5.2. Let’s illustrate this technique with the running example
5.1. Assume the initial guess is the feasible point (3/2, −3/2, 0). The corre-
sponding (DF P ) for a direction search with unit vectors reads as follows:

min0 h[3, −3, 0], vi, S 0 = v ∈ R3 : kvk ≤ 1, 11 −1


  1 1 | |

2 v =0
v∈S

The KKT conditions request a feasible v ∈ R3 , λ1 ≥ 0, µ1 , µ2 ∈ R so that

λ1 (kvk2 − 1) = 0,
2λ1 v + µ1 [1, 1, 1] + µ2 [1, −1, 2] = [−3, 3, 0].

The formulas in (24) give


     
1 0 0 1 1   1 1 −1  
1 1 1  1 1 1
P = 0 1 0 − 1 −1
   1 −1 
1 −1 2 1 −1 2
0 0 1 1 2 1 2
   
1 0 0 1 1  −1  
3 2 1 1 1
= 0
 1 0 − 1 −1
  
2 6 1 −1 2
0 0 1 1 2
   
1 0 0 1 1   
3/7 −1/7 1 1 1
= 0 1 0 − 1 −1
−1/7 3/14 1 −1 2
0 0 1 1 2
   
1 0 0 1 1  
2/7 4/7 1/7
= 0 1 0 − 1 −1
1/14 −5/14 2/7
0 0 1 1 2
     
1 0 0 5/14 3/14 3/7 9/14 −3/14 −3/7
= 0 1 0 − −3/14 1/14 1/7 = −3/14 1/14 1/7 
0 0 1 −3/7 1/7 2/7 −3/7 1/7 2/7

and thus,
   
    1 1 −1    3 
µ1 1 1 1  1 1 1
= 1 −1 −3
µ2 1 −1 2 1 −1 2
1 2 0
    
3/7 −1/7 0 −6/7
= = ,
−1/7 3/14 6 9/7
1. PROJECTION METHODS 77
  
 9/14 −3/14 −3/7 3 1/2 √

1
= 73 14

λ1 = 3 −3 0 −3/14 1/14 1/7  −3
2 −3/7 1/7 2/7 0
  
√ 9/14 −3/14 −3/7 3
v0| 1
= − 12 14 −3/14 1/14 1/7  −3
−3/7 1/7 2/7 0
   
√ 18/7 √ −3
1 1
= − 12 14  −6/7  = 14 14  1 
−12/7 2
Notice how v 0 satisfies all required constraints, and kv 0 k = 1. This vector is
an optimal solution of (DF P ), and therefore a direction of steepest descent
for (P ) from the point (3/2, −3/2, 0).
We perform now the line search from x0 in this direction:

ϕ0 (t) = f (x0 + tv 0 ) = t2 − 76 14 t + 92

t0 = argmin ϕ0 (t) = 73 14
t≥0

We have then
3 3 3
√ 1
√   3 15 6

x1 = 2 , − 2 , 0) + 7 14 14 14 − 3, 1, 2 = 14 , − 14 , 7 ,
| {z } | {z } | {z }
x0 t0 v0

which happens to be the optimal solution of the program (P ).


1.2. Newton-Raphson. Given a feasible initial guess x0 ∈ Rd (satis-
fying Ax0 | = b| ), we proceed to search for a next iteration x1 within the
feasibility region that minimizes the quadratic approximant
Q0 (x) = f (x0 ) + h∇f (x0 ), x − x0 i + 21 QHessf (x0 ) (x − x0 ).
No further imposition on the distance between x1 and x0 is needed at this
point. The corresponding associated program (P 0 ) becomes
min Q0 (x), S = {x ∈ Rd : Ax| = b| }.
x∈S
Notice that ∇Q0 (x) = ∇f (x0 ) + hHessf (x0 ), x − x0 i. Replacing v = x − x0 ,
the KKT conditions for the program (P 0 ) read as follows: Find µk ∈ R
(1 ≤ k ≤ `) and v ∈ Rd satisfying Av | = 0| so that
`
X
Hessf (x0 ) · v | + µk ak | = −∇f (x0 )| .
k=1
If det Hessf (x0 ) 6= 0, the system has a unique solution. To compute it, set
first H = [Hessf (x0 )]−1 . The solution in this case can be written according
to the following formulas:
−1
v 0 | = HA| AHA| AH − H · ∇f (x0 )|
   
−1 (25)
[µ1 , . . . , µ` ]| = − AHA| AH) · ∇f (x0 )|
 
78 5. NUMERICAL APPROXIMATION FOR CONSTRAINED OPTIMIZATION

The optimal solution v 0 of this system points to a Newton-Raphson direction


for the program (P ) starting at x0 . Set then x1 = x0 + v 0 . We iterate this
process to find a sequence of feasible points. Given xn ∈ Rd a feasible guess,
xn+1 = xn + argmin f (xn ) + h∇f (xn ), vi + 21 QHessf (xn ) (v) .

| |
|Av =0 {z }
vn

Example 5.3. Let’s illustrate this technique with the running exam-
ple 5.1. Assume once again that the initial guess is the feasible point
(3/2, −3/2, 0). The corresponding program (P 0 ) to search for the Newton-
Raphson direction is
min 29 + h[3, −3, 0], vi + vv | , S = v ∈ R3 : 11 −1
   1 1 | |

2 v =0
v∈S

The KKT conditions request multipliers µ1 , µ2 ∈ R and a feasible v ∈ R3 so


that
2v + µ1 [1, 1, 1] + µ2 [1, −1, 2] = [−3, 3, 0].
3
Since Hessf 2 , − 32 , 0 = 2I 3 (non-singular), we have H = 21 I 3 . The formu-


las in (25) give


   
1/2 1/2  −1   3
3/2 1 1/2 1/2 1/2
v 0 | = 1/2 −1/2 − 21 I 3  −3
1 3 1/2 −1/2 1
1/2 1 0
    
−9/28 3/28 3/14 3 −9/7
= −3/28 −1/28 −1/14 −3 =  3/7 
3/14 −1/14 −1/7 0 6/7
 
   −1   3
µ1 3/2 1 1/2 1/2 1/2  
=− −3
µ2 1 3 1/2 −1/2 1
0
 
  3  
2/7 4/7 1/7   6/7
=− −3 =
1/14 −5/14 2/7 −9/7
0
At this point, we may calculate
3 3
+ − 97 , 37 , 67 = 3 15 6
   
x1 = x0 + v 0 = 2, −2, 0 14 , − 14 , 7 ,
which happens to be the optimal solution of the program (P ).

2. Linear Programing: The simplex method


A general linear program (LP ) is usually expressed in the following
standard form: 
 max hc, xi


x∈Rd
(LP ) : haj , xi = bj (1 ≤ k ≤ `)


x ≥ 0 (1 ≤ k ≤ d)
k
2. LINEAR PROGRAMING: THE SIMPLEX METHOD 79

for a given c = [c1 , . . . , cd ], ak = [ak1 , . . . , akd ] ∈ Rd , bk ∈ R for 1 ≤ k ≤ `.


We accomplish this by performing the following operations, where necessary:
Objective function: Ifthe original program requests min f (x), convert it
to max − f (x) .
Slack variables: If the original program contains an inequality constraint
of the form ha, xi ≤ b with a = [a1 , . . . , ad ], convert it to an equality
constraint by adding a non-negative slack variable s. The resulting
constraint is

a1 x1 + · · · + ad xd + s = b, s ≥ 0.

Surplus variables: If the original program contains an inequality con-


straint of the form ha, xi ≥ b, convert it to an equality constraint
by subtracting a non-negative surplus variable s. The resulting
constraint is

a1 x1 + · · · + ad xd − s = b, s ≥ 0.

Unrestricted variables in sign: If some variable xk is unrestricted in



sign, replace it everywhere in the formulation by x+
k − xk , where
+ −
xk ≥ 0 and xk ≥ 0.

Once in standard form, we usually represent a linear program by its


corresponding tableau:
 
1 −c 0
,
0| A b |
where
 
a11 · · · a1d
 .. .. .. 
A= . . . 
a`1 · · · a`d
and b = [b1 , . . . , b` ].

Example 5.4. Consider the linear program (LP ) given by the no-standard
formulation
 
 min 3y − 2x

 (x,y,z)∈R3


(LP ) : x − 3y + 2z ≤ 3,


 2y − x ≥ 2,

y ≥ 0, z ≥ 0

We can easily convert the objective function to a maximum, and the first
two inequality constraints into equality constraints by introducing a slack
variable s1 ≥ 0 and a surplus variable s2 ≥ 0. Notice that the variable x is
80 5. NUMERICAL APPROXIMATION FOR CONSTRAINED OPTIMIZATION

unrestricted in sign. We convert it to two non-negative variables x+ − x− :


−2x−

2x+

 max −3y
(x,y,z)∈R3


x+ −x− −3y +2z +s1 = 3,

 −x+ +x− +2y −s2 = 2,
x− ≥ 0, y ≥ 0, z ≥ 0, s1 ≥ 0, s2 ≥ 0
 +
x ≥ 0,

The corresponding tableau is as follows:


1 -2 2 3 0 0 0 0
0 1 -1 -3 2 1 0 3
0 -1 1 2 0 0 -1 2
z x + x − y z s 1 s2

The Simplex method to find the optimal solution of a linear program in


standard form is based on the following two rules:
Rule 1: If all variables have a nonnegative coefficient on the first row, the
current basic solution (the last column) is optimal. Otherwise, pick
a variable xk with a negative coefficient (−ck ) in the first row—the
entering variable—and pivot it with another row.
Rule 2: The selection of row to perform the pivot for the entering variable
xk is performed by choosing among the rows j for which ajk > 0,
the one with the minimum ratio bj /ajk .
Example 5.5. Let’s illustrate this process with the following linear pro-
gram
 
 max x + y
(x,y)∈R2




(LP ) : 2x + y ≤ 4


 x + 2y ≤ 3

x ≥ 0, y ≥ 0

We start by converting to standard. For ease of computations below, we


first rename x = x1 and y = x2 . We introduce the slack variables x3 and x4
as they are needed. We finish the preparation step by finding the tableau of
this program.
 
 max x1 +x2
(x,y)∈R2


(LP ) : 2x1 +x2 +x3 =4


 x 1 +2x 2 +x 4 =3
x1 ≥ 0, x2 ≥ 0, x3 ≥ 0, x4 ≥ 0

 
1 −1 −1 0 0 0
0 2 1 1 0 4
 
0 1 2 0 1 3
z x 1 x2 x3 x4
2. LINEAR PROGRAMING: THE SIMPLEX METHOD 81

At this stage, we have the following initial situation:


z=0 x3 = 4, x4 = 3 x1 = x2 = 0
| {z } | {z }
basic solutions non-basic solutions

The first entering variable is x1 . We have two choices to pivot. Rule 2


indicates that we must use the second row, since for this row, we have the
ratio 4/2 = 2, while the third row offers a bigger ratio: 3/1 = 3.
   
1 −1 −1 0 0 0 1 0 −1/2 1/2 0 2
0 1 1/2 1/2 0 2
 → 0 1 1/2 1/2 0 2

 
0 1 2 0 1 3   0 0 3/2 −1/2 1 1
z x1 x2 x3 x4 z x1 x2 x3 x4
At this stage, we have the following situation:
z=2 x = 2, x4 = 1 x2 = x3 = 0
|1 {z } | {z }
basic solution non-basic solution

The second entering variable is x2 . We again have two choices to pivot.


Rule 2 indicates that we must use the third row, since for this row the ratio
is 1/(3/2) = 2/3. For the second row, the ratio is larger: 2/(1/2) = 4.
   
1 0 −1/2 1/2 0 2 1 0 0 1/3 1/3 7/3
0 1
 1/2 1/2 0 2  → 0 1 0
 2/3 −1/3 5/3 
0 0 1 −1/3 2/3 2/3 0 0 1 −1/3 2/3 2/3
z x1 x2 x3 x4 z x 1 x2 x3 x4
There are no more negative coefficients on the first row. This leads to an
solution of the tableau given by z = 7/3, x1 = 5/3. x2 = 2/3, x3 = x4 = 0.
The global maximum of the function f (x, y) = x+y on the set S = {(x, y) ∈
R2 : x ≥ 0, y ≥ 0, 2x + y ≤ 4, x + 2y ≤ 3} is attained at the point (5/3, 2/3).
The corresponding maximum value is thus 7/3. An illustration of the three
steps carried in these computations can be observed in Figure 5.1.
Example 5.6. What happens if the program does not have a unique
solution? Can the simplex method offer this information? Consider for
f (x, y) = x + 21 y the program

 max f (x, y)
(x,y)∈Rs




(LP ) : 2x + y ≤ 4


x + 2y ≤ 3

x ≥ 0, y ≥ 0

Once in standard form, this program has the tableau


 
1 −1 −1/2 0 0 0
0 2 1 1 0 4
 
0 1 2 0 1 3
z x1 x2 x3 x4
82 5. NUMERICAL APPROXIMATION FOR CONSTRAINED OPTIMIZATION

Figure 5.1. Illustration of the simplex method for Example 5.4

The initial solution gives


z=0 x3 = 4, x4 = 3 x1 = x2 = 0
| {z } | {z }
basic solutions non-basic solutions
There is an entering variable at x1 , that has to be pivoted with the second
row:
   
1 −1 −1/2 0 0 0 1 0 0 1/2 0 2
0 1 1/2 1/2 0 2 → 0 1 1/2 1/2 0 2

 
0 1 2 0 1 3 0 0 3/2 −1/2 1 1
z x1 x2 x3 x4 z x 1 x2 x3 x4
The solution at this stage—which already offers an optimal solution of the
program (LP )—gives
z=2 x1 = 2, x4 = 1 x2 = x3 = 0
| {z } | {z }
basic solutions non-basic solutions
Notice now that at this point we could increase the value of the coefficient
of the variable x2 without changing the value of z.
   
1 0 0 1/2 0 2 1 0 0 1/2 0 2
0 1 1/2 1/2
 0 2 
→
0 1 0 2/3 −1/3 5/3

0 0 1 −1/3 2/3 2/3 0 0 1 −1/3 2/3 2/3
z x 1 x2 x3 x4 z x 1 x2 x3 x4
The solution at this stage gives
z=2 x1 = 5/3, x2 = 2/3 x3 = x4 = 0
| {z } | {z }
basic solutions non-basic solutions

We have found two different optimal solutions of the program (LP ) using
the simplex method: (2, 0) and (5/3, 2/3). Notice that in this case, any
2. LINEAR PROGRAMING: THE SIMPLEX METHOD 83

other point in the segment joining those two points, must also be a solution.
Namely: for any t ∈ [0, 1], the point 2 − 31 t, 23 t satisfies


f 2 − 13 t, 23 t = 2 − 13 t + 2 13 t = 2. (The value is always 4)




2 2 − 13 t + 23 t = 4

(The first constraint is satisfied)
2 − 13 t + 2 32 t = 2 + t ≤ 3 (The second constraint is satisfied)
1 5
2− 3t ≥ 3 >0 (The third constraint is satisfied)
2
3t ≥0 (The fourth constraint is satisfied)
Example 5.7. What happens if we are unable to employ Rule 1 from the
simplex method? This situation arises on unbounded programs. Consider
the one below:
 
 max 2x + y
(x,y)∈R2




(LP ) : −x + y ≤ 1


x − 2y ≤ 2

x ≥ 0, y ≥ 0

Once in standard form, its tableau is as follows:


 
1 −2 −1 0 0 0
0 −1 1 1 0 1
 
0 1 −2 0 1 2
z x1 x2 x3 x4
At this stage, an initial solution is given by
z=0 x = 1, x4 = 2 x1 = x2 = 0
|3 {z } | {z }
basic solutions non-basic solutions

The entering variable x1 must be pivoted with the third row.


 
1 0 −5 0 2 4
0 0
 −1 1 1 3 
0 1 −2 0 1 2
z x1 x2 x3 x4
At this stage, we have
z=4 x1 = 2, x3 = 3 x2 = x4 = 0
| {z } | {z }
basic solution non-basic solutions

But notice that at this new stage we are unable to apply rule 1, since there
are no positive coefficients for x2 . Any pivot operation that we apply using
the second row will change the values of z; in particular, we should be able
to perform enough changes to make z as large as we desire. For instance:
what row operations would you perform to get z = 19, with the feasible
point (8, 3)?
84 5. NUMERICAL APPROXIMATION FOR CONSTRAINED OPTIMIZATION

Example 5.8. In Python there is an implementation of the simplex


algorithm in the libraries sympy.optimize: the routine linprog with the
option method=’simplex’. It is smart enough to indicate, among other
things,
• If the program terminates successfully, is infeasible or unbounded.
• The value of the slack or surplus variables, when these are used.
• In case of failure, the coordinates of the last point obtained by the
algorithm.
• The different tableau computed on each stage.
• The index of the tableau selected as pivot on each stage.
The following session illustrates how to use it to solve some of the examples
we have shown in this section.
1 import numpy as np, matplotlib.pyplot as plt
2 from scipy.optimize import linprog, lingrog_verbose_callback
3
4 # First program: max(x + y) with 2x + y ≤ 4, x + 2y ≤ 3, x ≥ 0, y ≥ 0
5 # We obtained the value 7/3 at the point (5/3, 2/3)
6 c1 = [-1, -1]
7 A1 = [[2,1], [1,2]]
8 b1 = [4,3]
9 x0_bnds = (0, None) # This means literally 0 ≤ x0 < ∞
10 x1_bnds = (0, None) # and this, 0 ≤ x1 < ∞
11
12 # Second program: max(2x + y) with −x + y ≤ 1, x − 2y ≤ 2, x ≥ 0, y ≥ 0
13 # The program is unbounded
14 c2 = [-2, -1]
15 A2 = [[-1, 1], [1, -2]]
16 b2 = [1, 2]

We call the optimization linprog with the values of c first (as a list),
then the values of the matrix A (as a list of lists), and the values of b
(as a list). We use the option bounds= to provide a collection of bounds
(min value, max value) for each of the relevant variables. For instance if
we request the first variable, x0 to satisfy a ≤ x0 ≤ b, we input (a,b). If
any of the bounds is infinite, we signal it with None. A simple output looks
like these.
>>> linprog(c1, A1, b1, bounds=(x0_bnds, x1_bnds), method='simplex')
fun: -2.3333333333333335
message: 'Optimization terminated successfully.'
nit: 2
slack: array([ 0., 0.])
status: 0
success: True
x: array([ 1.66666667, 0.66666667])

>>> linprog(c2, A2, b2, bounds=(x0_bnds, x1_bnds), method='simplex')


fun: -4.0
message: 'Optimization failed. The problem appears to be unbounded.'
nit: 1
2. LINEAR PROGRAMING: THE SIMPLEX METHOD 85

slack: array([ 3., 0.])


status: 3
success: False
x: array([ 2., 0.])

For extra information, we issue the option callback= with a user-defined


callback function, or the default callback linprog verbose callback al-
ready defined in scipy.optimize
>>> linprog(c1, A1, b1, bounds=(x0_bounds, x1_bounds), \
... method='simplex', callback=linprog_verbose_callback)
--------- Initial Tableau - Phase 1 ----------

[[ 2.0000 1.0000 1.0000 0.0000 4.0000]


[ 1.0000 2.0000 0.0000 1.0000 3.0000]
[ -1.0000 -1.0000 0.0000 0.0000 0.0000]
[ 0.0000 0.0000 0.0000 0.0000 0.0000]]

Pivot Element: T[nan, nan]

Basic Variables: [2 3]

Current Solution:
x = [ 0.0000 0.0000]

Current Objective Value:


f = -0.0

--------- Initial Tableau - Phase 2 ----------

[[ 2.0000 1.0000 1.0000 0.0000 4.0000]


[ 1.0000 2.0000 0.0000 1.0000 3.0000]
[ -1.0000 -1.0000 0.0000 0.0000 0.0000]]

Pivot Element: T[0, 0]

Basic Variables: [2 3]

Current Solution:
x = [ 0.0000 0.0000]

Current Objective Value:


f = -0.0

--------- Iteration 1 - Phase 2 --------

Tableau:
[[ 1.0000 0.5000 0.5000 0.0000 2.0000]
[ 0.0000 1.5000 -0.5000 1.0000 1.0000]
[ 0.0000 -0.5000 0.5000 0.0000 2.0000]]

Pivot Element: T[1, 1]


86 5. NUMERICAL APPROXIMATION FOR CONSTRAINED OPTIMIZATION

Basic Variables: [0 3]

Current Solution:
x = [ 2.0000 0.0000]

Current Objective Value:


f = -2.0

--------- Iteration Complete - Phase 2 -------

Tableau:
[[ 1.0000 0.0000 0.6667 -0.3333 1.6667]
[ 0.0000 1.0000 -0.3333 0.6667 0.6667]
[ 0.0000 0.0000 0.3333 0.3333 2.3333]]

Basic Variables: [0 1]

Current Solution:
x = [ 1.6667 0.6667]

Current Objective Value:


f = -2.33333333333

3. The Frank-Wolfe Method


Also known as the Conditional-Gradient method, it is widely used to
solve programs where the feasibility region S is described by a system of
linear inequalities.
(
min f (x)
(P ) : x∈S
S = {x ∈ Rd : hak , xi ≤ bk with ak ∈ Rd , bk ∈ R, 1 ≤ k ≤ `}

This is an iterative method that, at any feasible initial guess x0 ∈ S,


considers an associated linear program (LP0 ) that minimizes the linear ap-
proximation L0 (x) = f (x0 ) + h∇f (x0 ), x − x0 i on the same feasible region
S.
(
minh∇f (x0 ), x − x0 i
(LP0 ) : x∈S
S = {x ∈ Rd : hak , xi ≤ bk with ak ∈ Rd , bk ∈ R, 1 ≤ k ≤ `}

Once an optimal solution x̄0 of (LP0 ) has been obtained, a line-search is per-
formed on the segment joining x0 with x̄0 (which by hypothesis is contained
in the feasibility region S).

t0 = argmin f x0 + t(x̄0 − x0 ) .
0≤t≤1

Set then x1 = x0 + t0 (x̄0 − x0 ). We repeat this process to obtain a sequence


{xn }n∈N of feasible points.
3. THE FRANK-WOLFE METHOD 87

We usually devise a stopping criteria (given a fixed tolerance ε > 0)


to guarantee that we are close enough to the optimal solution of (P ). An
example of such a process is illustrated below:
Initialization: Let x0 ∈ S be an initial feasible guess, and set LB = −∞,
U B = f (x0 )—the lower and upper bounds (respectively) for the
stopping criteria.
Iteration: Assume we have xn , LB, U B. Set
Ln (x) = f (xn ) + h∇f (xn ), x − xn i.
Find
x̄n = argmin Ln (x) = argminh∇f (xn ), x − xn i,
x∈S x∈S
ξn = min Ln (x) = Ln (x̄n ),
x∈S

tn = argmin f xn + t(x̄n − xn ) ,
0≤t≤1
xn+1 = xn + tn (x̄n − xn ),
LB = max(LB, ξn )
Stopping Criteria: If |U B − LB| ≤ ε, then stop. Otherwise, update the
upper bound, U B = f (xn+1 ), and perform the next Iteration.
Example 5.9. Let us use this technique to try and find the minimum
value of the function f (x, y) = (x − 3)2 + (y − 2)2 over the square with
vertices at (−3/2, 0), (0, 3/2), (3/2, 0) and (0, −3/2). We start by defining
the inequality constraints:
3 3
g1 (x, y) = x + y − 2 = h[1, 1], [x, y]i − 2
3 3
g2 (x, y) = x − y − 2 = h[1, −1], [x, y]i − 2
3 3
g3 (x, y) = −x − y − 2 = h[−1, −1], [x, y]i − 2
3
g4 (x, y) = −x + y − = h[−1, 1], [x, y]i − 32
2
 
We also need the gradient of f : ∇f (x, y) = 2(x − 3), 2(y − 2) . At any
point (x0 , y0 ), the linear approximation needed for this method is given by
L0 (x, y) = f (x0 , y0 ) + h∇f (x0 , y0 ), (x − x0 , y − y0 )i
= (x0 − 3)2 + (y0 − 2)2 + 2(x0 − 3)(x − x0 ) + 2(y0 − 2)(y − y0 ).
Let’s assume that the initial guess is (0, 0), and we are looking for an exact
solution (ε = 0). At this point, we set up LB = −∞, and U B = f (0, 0) = 13.
According to our description of Frank-Wolfe, we have to solve at this step
the program
 
 min − 6x − 4y
(LP0 ) : (x,y)∈S
S = {(x, y) ∈ R2 : g (x, y) ≤ 0, 1 ≤ k ≤ 4}
k
88 5. NUMERICAL APPROXIMATION FOR CONSTRAINED OPTIMIZATION

Figure 5.2. Set up for Example 5.9

By performing the simplex method, we find the optimal solution of this


program attained at the point (x̄0 , ȳ0 ) = (3/2, 0), with ξ0 = L0 (3/2, 0) =
−9. We proceed to perform a line-search on the segment joining (0, 0) and
(3/2, 0):
2 
t0 = argmin f 32 t, 0 = argmin 4 + 23 t − 3
 
= 1.
0≤t≤1 0≤t≤1

It is then (x1 , y1 ) = (x0 , y0 ) + t0 (x̄0 − x0 , ȳ0 − y0 ) = 32 , 0 , and LB =




max(−∞, ξ0 ) = −9.
At this point the stopping criteria gives |U B − LB| = |13 − (−9)| = 22.
We proceed to the second iteration step, but we update the upper bound
first: U B = f (x1 , y1 ) = f (3/2, 0) = 25/4.
We need the approximation to f at (3/2, 0):

L1 (x, y) = f 32 , 0 + ∇f 23 , 0 , x − 32 , y

  

= 25 3
 25
= 4 − 3 x − 23 − 4y

 
4 + [−3, −4], x − 2 , y

The corresponding associated program at this stage is as follows:


 
 min − 3x − 4y
(LP1 ) : (x,y)∈S
S = {(x, y) ∈ R2 : g (x, y) ≤ 0, 1 ≤ k ≤ 4}
k

After applying the simplex method to this program, we find that the solution
is the point (x̄1 , ȳ1 ) = (0, 3/2), with ξ1 = L1 (0, 3/2) = 19/4.
EXERCISES 89

A line search between (x1 , y1 ) and (x̄1 , ȳ1 ) gives


t1 = argmin f 23 (1 − t), 23 t = argmin 92 t2 − 23 t + 25
= 16 ,
 
4
0≤t≤1 0≤t≤1

and therefore (x2 , y2 ) = (x1 , y1 ) + t(x̄1 − x1 , ȳ1 − y1 ) = (5/4, 1/4). We also


have LB = max(−9, ξ1 ) = 19/4.
At this point, the stopping criteria gives |U B −LB| = |25/4−19/4| = 32 .
We proceed to the third step, but update the upper bound first: U B =
f (x2 , y2 ) = f (5/4, 1/4) = 49/8.
We need the approximation to f at (5/4, 1/4):
L2 (x, y) = f 54 , 41 + ∇f ( 54 , 14 , x − 54 , y − 41 = − 72 x − 72 y + 91

  
8 .

The corresponding associated program at this stage is as follows:



 min − 7 x − 7 y

2 2
(LP2 ) : (x,y)∈S
S = {(x, y) ∈ R2 : g (x, y) ≤ 0, 1 ≤ k ≤ 4}
k

The solution is again the point (x̄2 , ȳ2 ) = (5/4, 1/4), with ξ2 = L2 (5/4, 1/4) =
49/8. No further computations are needed at this point to realize that
(x3 , y3 ) = (x̄2 , ȳ2 ) = (x2 , y2 ) = (5/4, 1/4). Notice that |U B − LB| = 0, and
the stopping criteria has been satisfied as expected.
The solution of the program (P ) is precisely this point.

Figure 5.3. Frank-Wolfe iteration to solve the program (P )


in Example 5.9.

Exercises
Problem 5.1 (Basic). Use a projection method to find the global mini-
mum of the function f (x, y, z) = x2 + 2y 2 + 3z 2 over the plane x + y + z = 1.
90 5. NUMERICAL APPROXIMATION FOR CONSTRAINED OPTIMIZATION

Problem 5.2 (Basic). Use a projection method to find the global min-
imum of the function f (x, y, z, t) = x2 + 2y 2 + 2z 2 + t2 over the plane

S = {(x, y, z, t) ∈ R4 : x + y + z = t, x + y = 4}

Problem 5.3 (Basic). Solve the following linear program by the simplex
method:
 
 max 4x + y − z
(x,y,z)∈R3




(LP ) : x + 3z ≤ 6


 3x + y + 3z ≤ 9

x ≥ 0, y ≥ 0, z ≥ 0

Problem 5.4 (Basic). The following tableaux were obtained in the


course of solving linear programs with two non-negative variables x1 and
x2 , two inequality constraints for which slack or surplus variables x3 and x4
were needed. In each case, indicate whether the corresponding linear pro-
gram has a unique optimum solution, has several optimum solutions (and
in that case find them all), it is unbounded, or degenerate.
 
1 0 3 2 0 20
0 1 −2 −1 0 4 
(a) 
0 0 −1 0

1 2
z x 1 x2 x3 x4
 
1 0 −1 0 2 20
0 0 0 1 −2 5 
(b) 
0 1 −2 0

3 6
z x1 x2 x3 x4
 
1 2 0 0 1 8
0 3 1 0 −2 4
(c) 
0 −2 0 1

1 0
z x1 x2 x3 x4
 
1 0 0 2 0 5
0 0 −1 1 1 4
(d)  
0 1 1 −1 0 4
z x 1 x2 x3 x4

Problem 5.5 (Intermediate). Consider the following linear program


 
 max 5x + 3y + z
(x,y,z)∈R3




(LP ) : x + y + z ≤ 6
5x + 3y + 6z ≤ 15



x ≥ 0, y ≥ 0, z ≥ 0

EXERCISES 91

Assume the following is an associated tableau:


 
1 0 0 5 0 1 15
0
 0 0.4 −0.2 1 −0.2 3 

0 1 0.6 1.2 0 0.2 3
z x 1 x2 x3 x4 x5
(a) What basic solution does this tableau represent? Is this solution
optimal? Explain why or why not.
(b) Does this tableau represent a unique optimal solution? If not, find
at least three alternative optimal solutions.
Problem 5.6 (CAS). In a computer language or CAS of your choice,
design a routine that gathers the following as input:
• A d-dimensional vector c = [c1 , . . . , cd ] representing a linear func-
tion f (x) = hc, xi.
• A matrix A with dimension ` × d, and an `-dimensional vector
b = [b1 , . . . , b` ]. These two objects represent ` linear inequality
constraints of the form Ax| ≤ b| .
The routine attempts to find a solution to the following linear program:
(
minhc, xi
(LP ) : x∈S
S = {x ∈ Rd : Ax| ≤ b| , xk ≥ 0, 1 ≤ k ≤ d}
If a solution is found, the code outputs it. If the program is unbounded, the
code indicates so.
Problem 5.7 (Basic). Use the Frank-Wolfe method to find the minimum
value of the function f (x, y) = (x−1)2 +(y+5)2 over the square with vertices
at (2, 2), (3, 3), (4, 2) and (3, 1). Use the center of the square as initial guess.
Illustrate graphically each step of the method.
Index

Arrow-Hurwicz-Uzawa constraint DFP method, see


qualification, 71 Davidon-Fletcher-Powell method
Direction, 1
BFGS method, see of steepest descent, 44, 75
Broyden-Fletcher-Goldfarb-Shanno Divided differences, 32
method, 53
Broyden method, 42 Eigenvalue, 16
iteration, 42 Epigraph, 18, 26
recursive formula, 42 Extreme Value, 6
Broyden-Fletcher-Goldfarb-Shanno Extremum, 6
method, 53
Feasibility region, 63
Characteristic Polynomial, 16 Feasible
Cholesky decomposition, 59 point, 63
Conditional-Gradient method, see Frank-Wolfe method, 86
Frank-Wolfe method lower bound, 87
Cone stopping criteria, 87
of improving directions, 64 upper bound, 87
of inward pointing directions for the Function
binding constraints, 64 coercive, 17, 21, 25
Conjugate gradient, 38 continuous, 11, 24
Constrained optimization, 63 convex, 18, 22, 26
Constraint, 63 differentiable, 12
equality, 63 pseudo-convex, 21
inequality, 63 quasi-concave, 20
Convergence quasi-convex, 20
cubic, 97 Rosenbrock, 2, 25, 37, 46
linear, 97 strictly convex, 18, 22
logarithmic, 97 Weierstrass, 13
quadratic, 97
rate of, 97 Gradient, 1
sublinear, 97 descent, see Steepest descent
superlinear, 97
Convex Hessian, 14
function, 18, 26 Horner’s
set, 17 method, 56
scheme, 56
Davidon-Fletcher-Powell method, 53
Derivative image, 11
directional, 1, 12 Indices of the binding inequality
partial, 12 constraints, 64

93
94 INDEX

Jacobian, 12 Program, 63
consistent, 63
Karush-Kuhn-Tucker convex, 63
conditions, 67 Direction Finding, 75
multipliers, 67 linear, 63
kernel, 11 super-consistent, 63

Lagrange Multipliers, 5 Quadratic Form, 14, 25


Level set, 21
Line-search, 44, 75, 77, 86 scipy, 38, 53, 84
Linear map, 11, 70 Secant method, 40
Linear programming, 78 iteration, 40
Linear transformation, see Linear map Local convergence for, 41
LU-decomposition, 59 recursive formula, 40
Secant property, 42
Matrix Set of tangent directions for equality
inverse, 36 constraints, 64
inversion, 36 Simplex method, 78
Jacobian, 12 entering variable, 80
leading principal minor, 15 slack variable, 79
principal minor, 15 surplus variable, 79
Symmetric, 14, 25 tableau, 79
Indefinite, 14, 25, 71 Unrestricted variables in sign, 79
Negative Definite, 14, 25, 71 Slater point, 63
Negative Semidefinite, 14, 25, 71 Steepest descent, 44
Positive Definite, 14, 25, 71 direction of, 44
Positive Semidefinite, 14, 25, 71 error, 50
Maximum sequence of, 44
global, 6
tableau, 79
local, 7
Theorem
strict global, 7
Bounded Value, 21, 26
strict local, 7
Clairaut, 14
Minimum
Eigenvalue Criteria, 16
global, 6, 65
Extreme Value, 21, 26
local, 7, 65
Fritz John necessary conditions, 66
strict global, 6, 65
Geometric necessary condition, 66
strict local, 7, 65
Kantorovich estimate, 50, 61
Newton method, see Newton-Raphson Karush-Kuhn-Tucker, 67, 68
method KKT necessary conditions, 67
Newton-Raphson method, 31, 35, 38, 77 KKT sufficient conditions, 68
direction, 38, 78 Local Convergence for
error, 32 Newton-Raphson, 33
iteration, 31, 35, 38 Local Convergence for Secant, 41
Local convergence for, 33 Orthogonal Gradient, 5
recursive formula, 31, 35, 38 Principal Minor Criteria, 15, 26
Quadratic Convergence, 38
Objective function of (P ), 63 Slater condition, 68
Optimization Taylor’s formula, 47
Constrained, 7 Wolfe, 52
Unconstrained, 7
Vector
Principal minor, 15 unit, 1
leading, 15
Bibliography

[1] Dimitri P Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.


[2] Francisco J Blanco-Silva. Mastering SciPy. Packt Publishing Ltd, 2015.
[3] Charles George Broyden. The convergence of a class of double-rank minimization
algorithms 1. general considerations. IMA Journal of Applied Mathematics, 6(1):76–
90, 1970.
[4] G Dantzig. Linear programming and extensions, series rand corporation research
study princeton univ, 1963.
[5] John E Dennis Jr and Robert B Schnabel. Numerical methods for unconstrained
optimization and nonlinear equations. SIAM, 1996.
[6] Ross L Finney, Maurice D Weir, and George Brinton Thomas. Thomas’ calculus:
early transcendentals. Addison-Wesley, 2001.
[7] Roger Fletcher. A new approach to variable metric algorithms. The computer journal,
13(3):317–322, 1970.
[8] Robert Freund. Nonlinear programming. Massachusetts Institute of Technology: MIT
OpenCourseWare, 2004. https://ocw.mit.edu License: Creative Commons BY-NC-
SA.
[9] Walter Gautschi. Numerical analysis. Springer Science & Business Media, 2011.
[10] Donald Goldfarb. A family of variable-metric methods derived by variational means.
Mathematics of computation, 24(109):23–26, 1970.
[11] Godefroy Harold Hardy. Weierstrasss non-differentiable function.
Trans. Amer. Math. Soc, 17(3):301–325, 1916.
[12] J. D. Hunter. Matplotlib: A 2d graphics environment. Computing In Science & En-
gineering, 9(3):90–95, 2007.
[13] Anthony L Peressini, Francis E Sullivan, and J Jerry Uhl. The mathematics of non-
linear programming. Springer-Verlag New York, 1988.
[14] Walter Rudin et al. Principles of mathematical analysis, volume 3. McGraw-hill New
York, 1964.
[15] David F Shanno. Conditioning of quasi-newton methods for function minimization.
Mathematics of computation, 24(111):647–656, 1970.
[16] David F Shanno and Paul C Kettler. Optimal conditioning of quasi-newton methods.
Mathematics of Computation, 24(111):657–664, 1970.

95
APPENDIX A

Rates of Convergence

Definition. Consider a convergent sequence {xn }n∈N ⊂ Rd with x? =


limn xn . We say that this sequence exhibits
Linear Convergence: If there exists 0 < δ < 1 so that
kxn+1 − x? k
lim = δ.
n kxn − x? k
We refer to δ as the rate of convergence.
Superlinear Convergence: If
kxn+1 − x? k
lim = 0.
n kxn − x? k
Sublinear Convergence: If
kxn+1 − x? k
lim = 1.
n kxn − x? k
If, additionally,
kxn+2 − xn+1 k
lim = 1,
n kxn+1 − xn k
we say the the sequence exhibits logarithmic convergence to x? .
Convergence of order q > 1: If xn is exhibits superlinear convergence,
and there exists q > 1, 0 < δ < 1 so that
kxn+1 − x? k
lim = δ.
n kxn − x? kq
In particular,
• Convergence with q = 2 is said to be quadratic.
• Convergence with q = 2 is said to be cubic.
• etc.
In any of these cases, we do refer to δ as the rate of convergence, as
we did in the linear case.
A practical method to calculate the rate of convergence of a sequence is
to calculate the following sequence, which converges to q:
−xn
log xxnn+1

−x
q≈ xn −xn−1n−1
(26)
log
xn−1 −xn−2
97
98 A. RATES OF CONVERGENCE

Example A.1. The sequence xn = 1/n! exhibits superlinear conver-


1
gence, since limn n! = 0 and
xn+1 1
lim = lim = 0.
n xn n n+1

Example A.2. Given a ∈ R, 0 < r < 1, the geometric sequence xn =


arn exhibits linear convergence, since limn arn = 0 and
xn+1
lim = r < 1.
n xn
The rate of convergence is precisely r.
n
Example A.3. The sequence xn = 2−2 converges to zero and is super-
linear:
xn+1 n
lim = lim 2−2 = 0
n xn n
Using the estimation for q given by the formula in (26), we obtain that this
sequence exhibits quadratic convergence. What is its rate of convergence?
Example A.4. The sequence xn = 1/n converges to zero and is sublin-
ear, since
xn+1 n
lim = lim = 1.
n xn n n+1
Notice
|xn+2 − xn+1 | n
lim = lim = 1;
n |xn+1 − xn | n n+2
therefore, this sequence exhibits logarithmic convergence.
APPENDIX B

Basic sympy commands for Calculus

A typical sympy session usually starts by loading the symbols we need,


some basic functions, and basic constructors. After that, we proceed to the
description of the functions we require.
1 # Symbols, including one for infinity, π and e
2 from sympy.abc import x,y,t,h
3 from sympy import oo, pi, E
4 # Symbols with conditions
5 from sympy import var
6 a,b = var('a,b', positive=True)
7
8 # Basic functions we may need
9 from sympy import sqrt, sin, cos, tan, exp, log
10
11 # Some basic symbolic manipulations may be needed
12 from sympy import solve, factor, expand, simplify, limit
13
14 # To do vector calculus, we need these two as well
15 from sympy import Matrix
16 from sympy.tensor.array import derive_by_array
17
18 # If in a jupyter notebook, we may want to render output as LaTeX
19 from sympy import init_printing
20 init_printing()
21
22 # Description of f
23 f = sin(x)/x
24
25 # A generic Rosenbrock function
26 # Note the symbols a, b act as parameters, while x and y act as variables
27 R = (a-x)**2 + b*(y-x**2)**2

We are going to use these functions to perform several common opera-


tions in Calculus.

1. Function operations
Observe how easily we can perform all of the following:
Function evaluation: with the method
.subs({variable1: value1, variable2: value2, ...})
Limits: with the function limit(object, variable, value).
99
100 B. BASIC sympy COMMANDS FOR CALCULUS

Basic operations: with the usual operators for addition, subtraction, mul-
tiplication and division.
Composition: again with the method .subs().
>>> f.subs({x: pi}) # f (π)
0
>>> f.subs({x: 0}) # f (0) --- returns "not a number"
nan
>>> limit(f, x, 0) # Compute limx→0 f (x) instead
1
>>> (f.subs({x: x+h}) - f)/h # A divided quotient...
(sin(h + x)/(h + x) - sin(x)/x)/h
>>> limit( (f.subs({x: x+h}) - f)/h, h, 0) # ... and its limit as h → 0
(x*cos(x) - sin(x))/x**2

Notice how smart sympy is in regard to the properties of symbols


>>> sqrt(x**2) # Square root of the square of a variable without conditions
sqrt(x**2)
>>> sqrt(a**2) # Square root of the square of a positive variable
a

Directional limits are also possible


>>> limit(1/x, x, 0, dir="+")
oo
>>> limit(1/x, x, 0, dir="-")
-oo

2. Derivatives, Gradients, Hessians


For functions of one variable, to obtain the symbolic derivative of a
function (of any order), we usually employ the method
.diff(variable, order)
For functions of several variables, we employ instead
derive_by_array(function, list-of-variables)
If necessary, we may arrange our outputs as matrices, so we can employ
proper matrix operations with them.
>>> f.diff(x) # f 0 (x) without the need to mess with limits
cos(x)/x - sin(x)/x**2
>>> f.diff(x, 2) # f 00 (x)
(-sin(x) - 2*cos(x)/x + 2*sin(x)/x**2)/x
>>> derive_by_array(R, [x,y]) # The gradient of R, ∇R
[-2*a - 4*b*x*(-x**2 + y) + 2*x, b*(-2*x**2 + 2*y)]
>>> gradient = _ # Store that in the variable 'gradient'
>>> derive_by_array(gradient, [x,y]) # The Hessian of R, HessR
[[8*b*x**2 - 4*b*(-x**2 + y) + 2, -4*b*x], [-4*b*x, 2*b]]
>>> hessian = Matrix(2,2, _) # Store that as a matrix, call it 'hessian'
>>> hessian[0,0] # If we want to access the first entry of the matrix
8*b*x**2 - 4*b*(-x**2 + y) + 2
>>> simplify(_) # Simplify that expression
3. INTEGRATION 101

12*b*x**2 - 4*b*y + 2
>>> Delta1 = _ # Store that value as 'Delta1'
>>> hessian.det() # Compute the determinant of the Hessian
-16*b**2*x**2 + 2*b*(8*b*x**2 - 4*b*(-x**2 + y) + 2)
>>> Delta2 = simplify(_) # Store that value as 'Delta2'

It is then a simple task (in some cases) to search for critical points by
solving symbolically ∇f = 0, and checking whether they are local maxima,
local minima or saddle points.
>>> solve(gradient, [x,y]) # Critical points of R
[(a, a**2)]
>>> crit_points = _ # This is a list. We call it 'crit_points'
>>> for point in crit_points:
... x0,y0 = point
... print(point)
... print("Delta1 = ", Delta1.subs({x:x0, y:y0}))
... print("Delta2 = ", Delta2.subs({x:x0, y:y0}))
...
(a, a**2)
Delta1 = 8*a**2*b + 2
Delta2 = 4*b
>>> 8*a**2*b + 2 > 0 # Is Delta1 > 0? (remember a,b>0)
True
>>> 4*b > 0 # Is Delta2 > 0?
True

The conclusion after this small session is that any Rosenbrock function
R(x, y) = (a − x)2 + b(y − x2 )2 has a global minimum at the point (a, a2 ).
A word of warning. Symbolic differentiation and manipulation of expres-
sions may not work in certain cases. For those, numerical approximation is
more suited (and incidentally, that is the reason you are taking this course).
>>> solve(f.diff(x))
NotImplementedError: multiple generators [x, tan(x/2)]
No algorithms are implemented to solve equation
x**2*(-tan(x/2)**2 + 1)/(tan(x/2)**2 + 1) - 2*x*tan(x/2)/(tan(x/2)**2 + 1)

3. Integration
Symbolic integration for the computation of antiderivatives is also pos-
sible. Definite integrals, while the symbolic setting allows it in many cases,
it is preferably done in a numerical setting.
R
>>> R.integrate(x) # R(x, y) dx
-a*x**2 + b*x**5/5
R + x**3*(-2*b*y/3 + 1/3) + x*(a**2 + b*y**2)
>>> R.integrate(y) # R(x, y) dy
-b*x**2*y**2 + b*y**3/3 + y*(a**2 - 2*a*x + b*x**4 + x**2)
R1R1
>>> R.integrate(x, (x, 0, 1)).integrate(y, (y, 0, 1)) # 0 0 R(x, y) dx dy
a**2/4 - a/6 + 11*b/360 + 1/24
R sin(x)
>>> f.integrate(x) # x
dx
Si(x)
102 B. BASIC sympy COMMANDS FOR CALCULUS


>>> f.integrate(x, (x, 0, pi)) # 0 sin(x)
x
dx
-2 + pi*Si(pi)
>>> _.evalf() # How much is that, actually?
3.81803183741885

4. Sequences, series
>>>

5. Power series, series expansions


>>>
APPENDIX C

Basic graphing in Python

One of the advantages of doing scientific computing with Python is the


wealth of different libraries we may use to represent data and functions
graphically. In this appendix we are going to do a quick overview of some
of the most widely used:
(a) matplotlib
(b) bokeh
(c) plotly

1. matplotlib
The matplotlib libraries are fundamentally focused on 2D plotting.
They are open source with license based on the Python Software Foundation
(PSF) license. If you are planning to use them for your scientific production,
it is customary to cite John Hunter’s 2007 seminal paper [12].
In this section we are going to explore just a handful of utilities:
• The module pyplot, that allows us to use a similar syntax and
interface as in matlab or octave
• The toolkit mplot3d to extend matplotlib for simple 3D plotting.
• The toolkits basemap and cartopy to extend matplotlib for pro-
jection and Geographic mapping.
• The toolkit ggplot for those of you familiar with the R plotting
system.
Let’s start with a simple example or usage of the module pyplot. We
are going to use exclusively the Rosenbrock function R1,1 .
1 import numpy as np, matplotlib.pyplot as plt
2
3 def R(x,y): return (1.0-x)**2 + (y - x**2)**2

For each plot, we usually indicate in our session the intent to create a
figure. At that point it is customary to impose the size of the figure, number
of subplots (if more than one), kind of axis, usage of grid, etc. This is what
we call the layout of our diagrams. For instance, to create a simple plot
(with size 5 × 10.5) of the graph of f (x) = R(x, 1) for −2 ≤ x ≤ 2, but
focusing only in the window [−2.5, 2.5] × [−0.5, 10], we issue the following
commands:
>>> x = np.linspace(-2, 2) # −2 ≤ x ≤ 2

103
104 C. BASIC GRAPHING IN PYTHON

>>> plt.figure(figsize=(5,10.5)); # Create a figure of requested size


... plt.axes(aspect='equal'); # I want x's and y's to be the same size
... plt.grid(); # I want a grid
... plt.xlim(-2.5, 2.5); # Set my window
... plt.ylim(-0.5, 10);
... plt.plot(x, R(x,1.0)); # Just plot it...
... plt.show() # ...and request it shows in screen

See Figure C.1, left.


It is possible to combine several plots on the same figure:
>>> plt.figure(figsize=(5,10.5));
... plt.axes(aspect='equal');
... plt.grid();
... plt.xlim(-2.5, 2.5);
... plt.ylim(-0.5, 10);
... for section in range(1,6): # Do 5 plots: R(x,n)
... plt.plot(x,R(x,section)) # where n=1,2,3,4,5
...
... plt.show()

See Figure C.1, center.


It is hard to see which graph corresponds to which function, in spite of
the different chosen colors. We solve this issue this by allowing each graph
to be labeled, and placing a legend in the diagram:
>>> plt.figure(figsize=(5,10.5));
... plt.axes(aspect='equal');
... plt.grid();
... plt.xlim(-2.5, 2.5);
... plt.ylim(-0.5, 10);
... for section in range(1,6): # The labels go here
... plt.plot(x,R(x,section), label=section)
...
... plt.legend() # The legend is requested here
... plt.show()

See Figure C.1, right.


It is possible to control more detail of your plots. See for example how we
modify line width, type, color, etc. As an exercise, try to comment each line
indicating what modifications have been performed on the corresponding
graphs.
>>> plt.figure(); # Let pyplot choose size and layout
... plt.plot(x, R(x,1), 'r-', lw=0.5);
... plt.plot(x, R(x,2), 'b--',lw=2);
... plt.plot(x, R(x,3), 'go');
... plt.plot(x, R(x,4), 'y-.');
... plt.show()

To plot a set of level lines (a countour plot), we can either do that


manually, or request pyplot to do it for us. For instance to render the level
1. matplotlib 105

Figure C.1. Basic rendering of functions with pyplot

Figure C.2. Tinkering with color, style and width of lines in pyplot

lines R1,1 (x, y) = c for c = 1, 2, 3, 5, 10, 15, 20 over the window [−2, 2] ×
[−2, 3], we could issue the following commands:
>>> y = np.linspace(-2,3) # −2 ≤ y ≤ 3
>>> X,Y = np.meshgrid(x,y) # generate the window

>>> plt.figure(); # Let pyplot choose size and layout


106 C. BASIC GRAPHING IN PYTHON

... plt.contour(X, Y, R(X,Y), levels=[1,2,3,5,10,15,20]);


... plt.show()

See Figure C.3, left.


Although each level line has been rendered with a different color, it is
hard to see which one is which. Placing a legend on an already cluttered
diagram may not be the best option. Fortunately, there is a neat method
to display relevant information of top of each level line:
>>> plt.figure(); # Let pyplot choose size and layout
... CS = plt.contour(X, Y, R(X,Y), levels=[1,2,3,5,10,15,20]);
... plt.clabel(CS, fontsize=9, inline=1);
... plt.show()

See Figure C.3, right.

Figure C.3. Contour plots with pyplot

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy