CAAM 454 554 1lvazxx
CAAM 454 554 1lvazxx
Matthias Heinkenschloss
Spring 2018
(Generated November 16, 2018)
ii
iii
iv CONTENTS
References 411
9
Chapter
1
Basic Properties and Examples
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Quadratic Optimization Problems and Linear Systems . . . . . . . . . . . . . . . . 11
1.3 Linear Elliptic Partial Differential Equations . . . . . . . . . . . . . . . . . . . . . 16
1.3.1 Elliptic Partial Differential Equations in 1D . . . . . . . . . . . . . . . . . 16
1.3.2 Elliptic Partial Differential Equations in 2D . . . . . . . . . . . . . . . . . 22
1.4 An Optimal Control Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.5 A Data Assimilation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.1. Introduction
This chapter introduces examples of linear systems and their basic properties. Later these examples
will be used to illustrate the application of iterative methods to be introduced in Chapters 2 and 3.
The linear system properties will be used to guide the selection of iterative solvers and to analyze
their convergence properties.
The next section explores the connection between convex quadratic optimization problems and
linear systems. Specifically, we will show that the solution of a convex quadratic optimization
problem is given as the solution of a linear system given by the necessary and sufficient optimality
conditions. Sections 1.3, 1.4, 1.5 introduce specific examples of linear systems. Section 1.3
introduces linear systems arising from a finite difference discretization of elliptic partial differential
equations. Sections 1.4 and 1.5 introduce two equality constrained convex quadratic optimization
problems governed by partial differential equations.
11
12 CHAPTER 1. BASIC PROPERTIES AND EXAMPLES
t2
0≤ ( Ax (∗) − b)T A( Ax (∗) − b) − t k Ax (∗) − bk22
2
k Ak2 t k Ak2
≤ t2 k Ax (∗) − bk22 − t k Ax (∗) − bk22 = t − 1 k Ax (∗) − bk22 .
2 2
With t = 1/k Ak2 , the previous inequality implies
1
0≤− k Ax (∗) − bk22,
2k Ak2
i.e., Ax (∗) = b.
Note that the previous theorem establishes the connection between solutions of Ax = b and
minimizers of q(x) = 21 xT Ax − bT x, but it does not establish their existence. If A ∈ Rn×n is
symmetric positive semidefinite, a minimizer of q exists if and only if b ∈ R ( A). If A ∈ Rn×n
symmetric positive definite there exists a unique minimizer of q.
Theorem 1.2.2 Let H ∈ Rn×n be symmetric and satisfy (1.5), and let A ∈ Rm×n , m < n. The
equality constrained quadratic program (1.4) has a solution x ∈ Rn if and only if there exists
λ ∈ Rn such that
H AT
! ! !
x c
= . (1.6)
A 0 λ b
and if A ∈ Rm×n , m < n, has rank m, then the equality constrained quadratic program (1.4) has a
unique solution x ∈ Rn and there exists a unique vector λ ∈ Rm that satisfies (1.6). See Problem 1.1.
Theorem 1.2.3 If H ∈ Rn×n is symmetric and satisfies (1.7), and if A ∈ Rm×n , m < n, has rank m,
the matrix
H AT
!
K= (1.8)
A 0
is symmetric indefinite and it has n positive eigenvalues and m negative eigenvalues.
Proof: Let A = U (Σ, 0)V T be the singular value decomposition of A. Since A has rank m,
Σ ∈ Rm×m is a diagonal matrix with positive diagonal entries. We write V = (V1, V2 ) with
V1 ∈ Rn×m and V2 ∈ Rn×(n−m) and define H D11 = V T HV1 , H
D12 = V T HV2 , H
D22 = V T HV2 . Since
1 1 2
N ( A) = R (V2 ), (1.7) implies that H
D22 is symmetric positive definite.
The matrices K and
H
D11 HD12 Σ
VT 0 H AT
! ! !
V 0
= .. =K
D22 0 // def
* +
DT H
H D
0 UT A 0 0 U 12
, Σ 0 0 -
I 0 I 0
+/ *. H11 H +/ *. H11 Σ 0
0 D D12 Σ 0 D
/=. Σ 0 0 / =K
*. +/ *. +/ def D
I /. H T I
. 0 0 12 H22 0 / .
D D 0 0 D
T −1 −1
, 0 I − H12 Σ -, Σ 0 0 - , 0 I −Σ H - , 0 0 H22
D D12 D
-
have the same inertia (i.e., the same number of positive and negative eigenvalues) 1 The eigenvalues
of K
D
D are equal to the eigenvalues of
!
H
D11 Σ
D22 ∈ R(n−m)×(n−m) .
∈ R2m×2m and of H
Σ 0
The inertia of !
H
D11 Σ
∈ R2m×2m, (1.9)
Σ 0
is equal to the inertia of
Σ−1 0 Σ−1 0
! ! ! !
H
D11 Σ Σ−1 H
D11 Σ−1 I
= . (1.10)
0 I Σ 0 0 I I 0
If µ is an eigenvalue of (1.10), then
! ! !
Σ−1 H
D11 Σ−1 I x x
=µ . (1.11)
I 0 y y
1Sylvester’s Law of Inertia states that if A is a symmetric n × n matrix and X is a non-signular n × n matrix, then
A and X T AX have the inertia, i.e., the same numbers of positive, zero, and negative eigenvalues. See, e.g., [GL89,
Thm. 8.1.12].
If µ = 0, then x = 0 and y = 0. Therefore all eigenvalues of (1.10) are non-zero. From (1.11) we
find y = µ−1 x and
D11 Σ−1 + µ−1 I)x = µx.
(Σ−1 H
D11 Σ−1 = W ΛW T is the eigen-decomposition of Σ−1 H
If Σ−1 H D11 Σ−1 , then µ2W T x − µΛW T x −W T x =
0. Hence, the eigenvalues µ of (1.10) are the roots of
µ2 − µλ j − 1 = 0, j = 1, . . . , m,
which are s
λj λ 2j
µ j± = ± + 1, j = 1, . . . , m.
2 4
This shows that (1.10) (and therefore (1.9)) has m positive and m negative eigenvalues.
The optimality system (1.6) is a particular type of saddle point system. See, e.g., the survey
paper [BGL05] by Benzi, Golub, and Liesen.
In Sections 1.4 and 1.5 we discuss applications that lead to optimization problems of the type
(1.2) or (1.4).
The system (1.12) is called a two–point boundary value problem (BVP). The conditions (1.12b)
and (1.12c) are specify the value of the solution at the boundary and are called Dirchlet boundary
conditions. Other boundary conditions are possible . The function f and the scalars , c, r, g0 and
g1 are given. We assume that > 0, r ≥ 0.
We want to compute an approximate solution of (1.12). We use a so-called finite difference
method to accomplish this. With this approach approximations of the solution y of (1.12) at
specified points 0 = x 0 < x 1 < . . . < x n+1 = 1 are obtained through the solution of a linear system.
We select a grid
0 = x 0 < x 1 < . . . < x n+1 = 1
with mesh size
1
h=
n+1
and with equidistant points
i
xi = = ih.
n+1
Central Differences
g(x + h) − g(x − h)
g0 (x) ≈ . (1.13)
2h
The solution y1, . . . , yn of (1.18) is an approximation of the solution y of (1.12) at the points
x 1, . . . , x n .
y(x)
y(x)
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x x
Figure 1.1: Solution of the differential equation (1.19) computed using central finite difference
approximation (1.18) with uniform mesh size h = 0.05 (left plot), h = 0.01 (middle plot), h = 0.002
(right plot)
The finite difference scheme (1.16) can be represented by the 3-point stencil
f + h
c 2 + h2r −h
cg
− 2
− 2
. (1.20)
h2 h2 h2
For many operations, the stencil is all we need if we want to work with the matrix. For example, if
we set y0 = g0 , yn+1 = g1 , then the ith equation in (1.18)
+h
c 2 + h2r − 2h c
− 2
yi−1 + yi − yi+1 = f (x i ), i = 1, . . . , n.
h2 h2 h2
The left hand side is obtained by first multiplying the stencil (1.20) component wise with yi−1 yi yi+1
and then summing up the resulting values.
We summarize a few properties of the matrix in (1.18) that will be important later.
• If > 0, c = 0, and r ≥ 0, the matrix is symmetric positive definite (see Problem 1.2).
with “<” for i = 1 and i = n. See Problem 1.4. We say the matrix in (1.18) is row-wise
diagonally dominant. Theorem 2.6.5, which will be introduced in the next chapter, will imply
that the matrix in (1.18) nonsingular if h < 2/|c|.
• Another important property of matrices arising from finite element discretizations is that of
an M-matrix. We will introduce M-matrices in Section 2.6.5. However, we note already
that the matrix (1.18) is only an M-matrix when h < 2/|c|. This can be used to explain
the behavior of the finite difference approximations observed in Figure 1.1. Stynes’ paper
[Sty05, Sec. 4] contains a nice discussion of M-matrix properties of the matrix in (1.18) and
we will discuss it in Section 2.7.
Upwind Differences
As we have mentioned before, the finite difference scheme (1.16) with uniform meshes, requires
that
h < 2/|c|
to avoid artificial oscillations in the computed solution. The problem with the finite difference
scheme (1.16) results from the use of central finite differences (1.15) for c y0 (x i ). Instead of the
central finite difference approximation (1.15),
y(x i + h) − y(x i − h)
c y0 (x i ) ≈ c ,
2h
f (x 1 ) + +hh2
c
g0
*. +/
f (x 2 )
..
.. //
= .. .
// . (1.24)
.. //
. f (x n−1 ) /
, f (x n ) + h2 g1 -
If c = 0, the matrices in (1.18) and (1.24) are identical. Let > 0, c > 0, and r ≥ 0. We
have seen that that for c , 0 the matrix in (1.18) is row-wise diagonally dominant only if h is small
relative to /|c|. In contrast, the matrix in (1.24) is row-wise diagonally dominant for any h > 0,
i.e., if ai j are the entries of the matrix, then
X
|ai j | ≤ |aii | for i = 1, . . . , n,
j,i
with “<” for i = 1 and i = n. See Problem 1.4. Theorem 2.6.5, which will be introduced in the
next chapter, will imply that the matrix in (1.24) nonsingular for all h.
As we have mentioned earlier, the M-matrix property is important for matrices arising from
finite element discretizations. We will see in Section 2.7 that the matrix (1.24) resulting from the
upwind discretization of the convection term is an M-matrix for any mesh size h > 0. See also
Stynes’ paper [Sty05, Sec. 4].
Figure 1.2 shows the computed finite difference approximations for the equation (1.19) with
parameter = 10−3 using upwind finite differences (1.24) with uniform mesh size h = 0.05,
h = 0.02, and h = 0.01. The upwind finite difference scheme (1.24) leads to much better results
for smaller bigger mesh size h than the central finite difference scheme (1.18). In particular, the
approximations computed using upwind finite differences (1.24) are nonnegative for nonnegative
right hand side functions f and boundary data g0, g1 .
y(x)
y(x)
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x x
Figure 1.2: Solution of the differential equation (1.19) computed using upwind finite difference
approximation (1.24) with uniform mesh size h = 0.05 (left plot), h = 0.02 (middle plot), h = 0.01
(right plot)
y0 = yn+1,
y0 − y−1 yn+1 − yn
= ,
h h
which implies
y0 = yn+1, y−1 = yn . (1.26)
Together with the upwind discretization
f (x 1 )
*. +/
.. f (x 2 )
..
//
= .. .
// . (1.28)
.. //
. f (x n ) /
, f (x n+1 ) -
We select a grid
0 = x 1,0 < x 1,1 < . . . < x 1,n1 +1 = 1, 0 = x 2,0 < x 2,1 < . . . < x 2,n2 +1 = 1,
Ay = b. (1.31)
The precise structure of A and b depends on the ordering of the unknowns and equations.
9 10 11 12
5 6 7 8
1 2 3 4
Figure 1.3: Simple 4 × 3 grid with lexicographic ordering of the grid points.
If we order the grid-points lexicographically2, as indicated for a small example in Figure 1.3,
then the vector of unknowns is
T
y = y11, . . . , yn1 1, y12, . . . , yn1 2, . . . . . . , y1n2, . . . , yn1 n2 (1.32)
and the matrix A can be computed as a Kronecker product involving matrices from the 1D dis-
cretization. For i = 1, 2 define
2+hi ci +hi2 r
*. hi2
− h2 +/
i
2+hi ci +hi2 r
− +h i ci
− h2
.. //
.. hi2 hi2 i
//
.. //
Ai = .. .. .. ..
. . . // .
def . //
(1.33)
.. //
..
2+hi ci +hi2 r
− +h i ci
− h2
.. //
.. h2 i hi2 i
//
2+hi ci +hi2 r
− +h i ci
/
, hi2 hi2 -
Recall that for matrices C ∈ Rm×n and B ∈ R p×q the Kronecker product C ⊗ B is the matrix
c11 B . . . c1n B
*. . .. +/ ∈ Rmp×nq .
C ⊗ B = . .. . / (1.34)
, cm1 B . . . cmn B -
Using lexicographic ordering of the grid points, the matrix A in (1.31) is given by
where Ini ∈ Rni ×ni is the ni × ni identity matrix. The matrix A in (1.35) is of the form
D1 −F1
.. −E2 D2 −F2
*. +/
//
...
.. //
A = .. // , (1.36a)
.. //
.. //
. −En2 −1 Dn2 −1 −Fn2 /
, −En2 Dn2 -
2 A vector v is lexicographically less then a vector w if there exists an index j such that v1 = w1, . . . , v j−1 = w j−1 and
v j < w j . The grid points shown in Figure 1.3 are in lexicographic order, i.e., the grid point (x 1,i, x 2,i ) is lexicographiclly
less than (x 1,k , x 2,k ) if and only if i < k.
d − h2
1
.. − +h21 c1 d
*. +/
h1
− h2 //
.. 1 //
.. .. ..
.. //
Di = .. . . . // ∈ Rn1 ×n1, i = 1, . . . , n2, (1.36b)
.. //
− +hh21 c1 − h2
.. //
.. d //
1 1
− +hh21 c1
. /
d
, 1 -
2 + h1 c1 2 + h2 c2
d= + + r,
h12 h22
and
+ h2 c2 + h2 c2 +
−Ei+1 = diag *− 2
, . . . , − 2
∈ Rn1 ×n1, (1.36c)
, h2 h2 -
−Fi = diag *− 2 , . . . , − 2 + ∈ Rn1 ×n1, (1.36d)
, h2 h2 -
Example 1.3.2 Consider the convection diffusion equation (1.29) with Ω = (0, 1) 2 , = 10−4 ,
θ = 47.3o , c = (cos θ, sin θ), r = 0, f = 0, and Dirichlet conditions
1 if x 1 = 0 and x 2 ≤ 0.25
g(x 1, x 2 ) = 1 if x 2 = 0
0 else.
Figure 1.4 shows a sketch of the problem data and Figure 1.5 shows the computed solution.
y=0
y=0
y=0
θ
y=1
y=1
Figure 1.4: Sketch of the problem data for the 2D advection diffusion equation in Example 1.3.2.
0.5
1
0
1 0.5
0.5
0 0 x1
x2
Figure 1.5: Finite difference approximation of the solution to the convection diffusion equation
(1.29) with data specified in Example 1.3.2 computed using an n1 = 10 by n2 = 10 grid.
Other orderings of the grid points are possible. In particular, the red-black (checkerboard)
ordering of the grid points, illustrated in Figure 1.3, will be useful for some iterative methods.
Other orderings correspond to a symmetric permutation
P APT Pu = Pb (1.37)
of the system (1.31). Here the permutation matrix P is derived from the ordering of the nodes. For
example, if the red-black (checkerboard) ordering is used, then for the 4 × 3 grid the permutation
matrix is determined from
P (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)T = (1, 7, 2, 8, 9, 3, 10, 4, 5, 11, 6, 12)T .
5 11 6 12
9 3 10 4
1 7 2 8
Figure 1.6: Simple 4 × 3 grid with red-black ordering of the grid points.
With a red-black ordering of the equation and unknowns the system matrix is of the form
!
Dr Ar b
A= , (1.38a)
Abr Db
2+h1 c1
h12
+ 2+h2 c2
h22
+r
*. +/
.. //
Dr = Db = ... .. // ∈ R n12n2 × n12n2
. // (1.38b)
.. //
2+h1 c1
+ 2+h2 c2
+r
.
, h12 h22 -
(if n1 n2 is odd, Dr has one more column and row than Db ), and the matrices Ar b , Abr have at most
four nonzero entries per row and per column. For the example grid shown in Figure 1.6, these
matrices are
− 2 0 − h2 0 0 0
*. +hh1 c 2
− h2 − h2
+
.. − h21 1 0 0 0 //
//
1 1 2
.. +h2 c2
− h22
0 − +hh21 c1 − h2 − h2 0 //
= .. /,
.
Ar b 1 1 2
− +hh22 c2 +h1 c1 (1.38c)
.. 0 0 − h12
0 − h2 //
2 1 /
− +hh22 c2 − h2
.. /
.. 0 0 0 0 //
2 1
− +hh22 c2 − +hh21 c1 − h2
/
0 0 0
, 2 1 1 -
+h1 c1
*. − h12 − h2 − h2 0 0 0 +/
1 2
.. 0 − +hh21 c1 0 − h2 0 0 //
1 2
.. − +hh22 c2 − h2 − h2
.. //
0 0 0 //
Abr = .. 2 1 2 /.
− +hh22 c2 − +hh21 c1
(1.38d)
.. 0 − h2 0 − h2 //
2 1 1 2
− +hh22 c2 − +hh21 c1 − h2 //
.. /
0 0 0
.. 2 1 1 /
0 0 0 − +hh22 c2 0 +h1 c1 /
− h2
, 2 1 -
As we have seen in the 1D case, for many matrix operations, the stencil is all we need. In fact
if we work with the stencil it is often favorable to store the unknowns as a matrix. If we store
the unknowns in an (n1 + 2) × (n2 + 2) array (we use zero based indexing) and for i ∈ {0, n1 } or
j ∈ {0, n2 } set the values to given boundary data (cf., (1.29b)),
then for i ∈ {1, . . . , n1 }, j ∈ {1, . . . , n2 }, the (i, j)-th equation can be written as
+ h1 c1 * 2 + h1 c1 + 2 + h2 c2 + r + yi, j
− yi, j+1 − yi−1, j +
h22 h12 , h12 h22 -
+ h2 c2
− 2 yi+1, j − yi, j−1 = f (x 1,i, x 2, j ). (1.39)
h1 h22
The finite difference scheme (1.39) can be represented by the 5-point stencil
0 − h2 0
2
− +h1 c1
2+h1 c1
+ 2+h2 c2
+r − h2 .
h12 h12 h22 (1.40)
1
− +hh22 c2
0 0
2
The left hand side in (1.39) is obtained by first multiplying the stencil (1.40) component wise with
yi−1, j+1 yi, j+1 yi+1, j+1
y yi, j yi+1, j
i−1, j
y y y
i−1, j−1 i, j−1 i+1, j−1
and then summing up the resulting values.
where the parameters > 0, r ≥ 0, c = (c1, c2 )T with c1, c2 ≥ 0 and the functions f , g are given,
but the function u on Ωc has to be selected. Here χΩc is the indicator function with χΩc (x) = 1
if x ∈ Ωc and χΩc (x) = 0 otherwise. Given another subset Ωo ⊂ Ω, we want to find a function u
such that the corresponding PDE (1.41) solution y is a close as possible to a desired function y des .
For example (1.41) could model the temperature distribution in a convection oven and u would be a
volumetric heat source provided through heaters located at Ωc . We want to heat the oven to achieve
a desired temperature y des in the region Ωo of the oven.
We will model ‘as close as possible’ in the least squares sense and formulate the problem as the
minimization problem
α
Z Z
1
y(x) − y (x) dx +
des 2
minimize u(x) 2 dx (1.42a)
2 Ωo 2 Ωc
subject to − ∆y(x) + c · ∇y(x) + r y(x) = f (x) + u(x) χΩc (x), x ∈ Ω, (1.42b)
y(x) = g(x), x ∈ ∂Ω, (1.42c)
The term α2 Ω u(x) 2 dx with α > 0 in the objective penalizes excessively large |u|. We refer to u
R
c
as the control, to y as the state, and to (1.42b,c) as the state equation.
0 = x 1,0 < x 1,1 < . . . < x 1,n1 +1 = 1, 0 = x 2,0 < x 2,1 < . . . < x 2,n2 +1 = 1,
We assume that the control region Ωc and the observation region Ωo are rectangles with corners
that coincide with grid points. For example Ωc = (x 1,ν, x 1,µ ) × (x 2,k , x 2,l ).
Applying the discretization (1.30) to (1.42b,c) leads to the system
yi−1, j − 2yi j + yi+1, j yi, j−1 − 2yi j + yi, j+1
− −
h12 h22
yi j − yi−1, j yi j − yi, j−1
+c1 + c2 + r yi j = f (x 1,i, x 2, j ) + ui j χΩc (x 1,i, x 2, j ), (1.43a)
h1 h2
for i = 1, . . . , n1, j = 1, . . . , n2,
yi j =g(x 1,i, x 2, j ), if i ∈ {0, n1 } or j ∈ {0, n2 }. (1.43b)
These equations can be arranged into a linear system
Ay + Bu = c. (1.44)
We can use (1.43b) to eliminate the yi j corresponding to boundary points, as we have done in
the previous section. In this case the number of y variables is n y = (n1 − 1)(n2 − 1). Here
we include all (n1 + 1)(n2 + 1) equations (1.43) into (1.44). Thus, the number of y variables is
n y = (n1 + 1)(n2 + 1). The number of u variables is nu , the number of grid points in Ωc , in either
case
Recall that the control region Ωc and the observation region Ωo are rectangles with corners that
coincide with grid points. The integrals are discretized using
Z X
y(x) − y des (x) 2 dx ≈ h1 h2 yi j − y des (x 1,i, x 2, j ) ,
Ωo
(x 1,i,x 2, j )∈Ωo
= y − y des T Q y − y des ,
(1.45a)
Z X
u(x) 2 dx ≈ h1 h2 ui j = uT Ru. (1.45b)
Ωc
(x 1,i,x 2, j )∈Ωc
In particular Q ∈ Rny ×ny and R ∈ Rnu ×nu are diagonal matrices with diagonal entries h1 h2 .
Combining (1.43)–(1.45) leads to the following discretization of (1.42).
1 α
y − y des T Q y − y des + uT Ru,
Minimize (1.46a)
2 2
subject to Ay + Bu = c. (1.46b)
Obviously, (1.46) is a special case of (1.4),
!T ! ! !T !
1 y Q 0 y Qy des y
+ y des T Qy des,
minimize −
2 u 0 αR u 0 u
y !
subject to A B = c.
u
The constant y des T Qy des in the objective can be dropped since it dies not change the solution y, u.
The matrix A is invertible. Therefore, we can eliminate y via
y = −A−1 Bu + A−1 c
minimize 12 uT Hu + dT u + γ (1.47)
where
Note that while A, B, Q, R are sparse matrices H = BT A−T Q A−1 B + αR is dense and in general it
is expensive to form the matrix H explicitly. Instead, we will construct methods that solve (1.47)
iteratively and in each iteration require the computation of one matrix-vector product Hv, where
the vector v is determined by the iterative method.
∂ ∂2 ∂
y(x, t) − α 2 y(x, t) + β y(x, t) =0 x ∈ (0, 1), t ∈ (0, T ), (1.48a)
∂t ∂x ∂x
y(0, t) = y(1, t), t ∈ (0, T ), (1.48b)
y x (0, t) = y x (1, t), t ∈ (0, T ), (1.48c)
y(x, 0) = y0 (x), x ∈ (0, 1). (1.48d)
We assume that α > 0 and β > 0 are known. In this example we use
α = 0.01, β = 1, T = 0.5.
We want to determine the initial data y0 from measurements of the solution y(x, t) at certain points
in space and in time. We consider a discretization of this problem.
First, we discretize the boundary value problem (1.48) in space using the upwind finite difference
method (1.26,1.27). We divide [0, 1] into n subintervals with length h = 1/n and gridpoints x i = ih,
i = 0, . . . , n. The upwind finite difference method (1.26,1.27) leads to the system of ordinary
differential equations (ODEs)
where
−2α − βh α α + βh
α + βh −2α − βh α
1 ... ... ...
K = 2 ∈ Rn×n,
h
α + βh −2α − βh α
α α + βh −2α − βh
y(t) = exp(Kt)y0,
where exp(Kt) ∈ Rn×n is the matrix exponential of Kt. For small n it can be evaluated using
Matlab’s expm, e.g., expm(K ∗ t).3 For larger problems we need to apply an ODE solver. We use
the Crank-Nicolson (trapezoidal) method.
We subdivide the time interval [0, T] into nt subintervals of equal length ∆t = T/nt and we set
t j = j∆t, j = 0, . . . , nt . The Crank-Nicolson (trapezoidal) scheme is given by
1
(y j+1 − y j ) = 12 (Ky j+1 + Ky j ), j = 0, . . . , nt − 1.
∆t
The vector y j is an approximation of y(t j ). Rearranging terms shows that for a given y0 we can
compute y j , j = 0, . . . , nt − 1, by successively solving
∆t ∆t
I− K y j+1 = I + K y j , j = 0, . . . , nt − 1. (1.50)
2 2
We use discretization parameters
0.8
0.6
0.4
0.2
0.5
1
0.5
0 0
t x
3 NOTE: exp(K ∗ t) is different from expm(K ∗ t) and the former does not give the matrix exponential, but evaluates
the exponential of the matrix entries.
Now, suppose that we do not know y0 . We want to estimate y0 from measurements of the
computed solution. To specify the spatial measurement, we let m be such that n/m is integer and
we define an observation matrix H ∈ Rm×n with entries
Hi j = 1 if j = (n/m)i, Hi j = 0 else .
with
−1 nt /mt
*. H I − ∆t2 K I + ∆t2 K +/
A = ..
. .. // mt m×n
,
. // ∈ R (1.54b)
.. −1 nt
∆t
I+ 2K∆t /
H I− 2K
, -
z1
*. . +/
b = . .. / ∈ Rmt m . (1.54c)
, zm -
The formulation (1.53) of the least squares problem uses (1.52), which is fine for theoretical
purposes, but not something that should be used to implement the problem. For the solution of
(1.53) we never compute A, but we use methods that require the action of A to a vector v ∈ Rn and
the action of AT to a vector a vector w ∈ Rmt m .
For a given vector v ∈ Rn we compute w = Av ∈ Rmt m as follows:
1. Set y0 = v ∈ Rn .
2. For j = 0, . . . , mt − 1 do
3. w = (wT1 , . . . , wTmt )T .
Of course, in an implementation we do not generate nt arrays to store the y j ’s, but we use only one
array for the current y j .
The transpose of A is given by
! n /m !n
∆t T ∆t −T t t T ∆t T ∆t −T t T
A = I+ K
T I− K H , . . ., I + K I− K H
2 2 2 2
∈ Rn×mt m .
2. For j = mt − 1, . . . , 0 do
Now, we want to recover the initial condition y0ex from measurements of the solution. We set
m = 5 and mt = 25,
and generate observations
z k = Hyex
knt /mt + η k , k = 1, . . . , mt,
where η k represents noise. We use 1% normally distributed noise.
We use conjugate gradient method to solve the resulting the least squares problem (1.53). The
exact initial data and our estimate of the initial data computed from noisy observations are shown
in Figure 1.8.
1.5
true y0
computed y 0
1
0.5
Figure 1.8: True initial data and estimated initial data. Estimated initial data are computed using
the regularized least squares problems computed by solving the least squares problem (1.53).
The least squares problem (1.53) is highly ill-conditioned. Thus small errors in the observations
comp
can lead to large errors on the computed solution y0 . This is what we have seen in Figure 1.8.
To remedy this situation, one can regularize the problem, i.e., replace (1.53) by
−1 nt /mt
2
*. H I − ∆t2 K I + ∆t2 K +/ z1
1
.. .. // y0 − .. ... +//
+ ρ kWy0 k 2
/ *
minn
.
y0 ∈R 2
.
. 2 2 (1.55)
−1
n t
/
, zmt -
∆t
I + ∆t2 K
. /
, H I − 2 K
- 2
where ρ > 0 is a regularization parameter and W ∈ R n×n is a given matrix. The regularized least
squares problems can also be written as
min kAy0 − bk22,
y0 ∈Rn
where now
−1 nt /mt
*. H I −
∆t
2 K I + ∆t
2 K +/ z1
.. *. .. +/
. . // ∈ Rmt m+n .
.. //
A = .. −1 nt
// ∈ R(mt m+n)×n, b = ...
∆t
+ ∆t zm //
. H I− K I 2 K
.. // .
2
√
/ 0 -
ρW
,
, -
Introductions to the regularization of inverse problems are given, e.g., in the books by Tarantola
[Tar05] and Vogel [Vog02].
We take the same data as above and estimate the initial data by solving the regularized least
squares problem (1.55) with W = I and ρ = 10−2 . The regularized least squares problem is
again solved using the conjugate gradient method. Figure 1.9 shows the exact initial data and the
estimated initial data. This regularization gives an excellent estimate.
1.5
true y0
computed y 0
1
0.5
Figure 1.9: True initial data and estimated initial data. Estimated initial data are computed using
the regularized least squares problems computed by solving the regularized least squares problem
(1.55) with ρ = 10−2 .
Everything we have done, can be applied to other linear time dependent PDEs. As an example,
(periodic boundary conditions in the x 1 direction and homogeneous Dirichlet boundary conditions
in x 2 direction). We assume that α > 0 and β > 0 are known. In this example we use
α = 0.01, β = 1, T = 0.5.
We discretize the spatial domain using n1 subintervals of length h1 = 1/n1 in the x 1 direction
and n2 + 1 subintervals of length h2 = 1/(n2 + 1) in the x 2 direction. If we define
−2α − βh1 α α + βh1
α + βh1 −2α − βh1 α
1 .. .. ..
K1 = . . . ∈ Rn1 ×n1,
h12 α + βh1 −2α − βh1 α
α α + βh1 −2α − βh1
−2α α
α −2α α
1 .. .. ..
K2 = . . . ∈ Rn2 ×n2,
h22 α −2α α
α −2α
and use a lexicographic ordering of unknowns
T
y(t) = y11 (t), . . . , yn1 1 (t), y12 (t), . . . , yn1 2 (t), . . . . . . , y1n2 (t), . . . , yn1 n2 (t)
K = I n2 ⊗ K 1 + K 2 ⊗ I n1 , (1.57)
To construct an observation matrix, let m1, m2 be integers such that n1 /m1 and n2 /m2 are
integers and for ` = 1, 2 define H` ∈ Rm` ×n` with entries
m1 = m2 = 5 and mt = 25,
knt /mt + η k ,
z k = Hyex k = 1, . . . , mt,
where η k represents noise. We use 1% normally distributed noise. We use conjugate gradient
method to solve the resulting the least squares problem (1.53). The exact initial data and our
estimate of the initial data computed from noisy observations are shown in Figure 1.11. In this case
the standard least squares problem (1.53) provides a good estimate.
1.6. Problems
Problem 1.1
i. Let H ∈ Rn×n be symmetric and satisfy
vT Hv > 0 for all v ∈ N ( A) \ {0},
and let A ∈ Rm×n , m < n, be a matrix with rank m.
Show that the equality constrained quadratic program (1.4) has a solution x ∈ Rn if and only
if there exists λ ∈ Rn such that (1.6) is satisfied. Moreover, show that x and λ are unique.
Hint: Since A has rank m, there exists an m × m invertible submatrix B of A. Without
loss of generality assume that the first m columns of A are linearly independent, i.e., that
A = (B | N ) with B ∈ Rm×m invertible and N ∈ Rm×(n−m) . Write
!
xB
x= , x B ∈ Rm, x N ∈ Rn−m,
xN
and convert the equality constrained quadratic program (1.4) into an unconstrained quadratic
program in x N .
ii. Let H ∈ Rn×n be symmetric and satisfy
vT Hv ≥ 0 for all v ∈ N ( A).
If (1.4) has a solution, is it unique?
iii. Let A ∈ Rm×n , m < n, have rank r < m and let b ∈ R ( A). Is the vector λ of Lagrange
multipliers unique?
Problem 1.2 Let α1, α2 be real numbers. Verify that the eigenvalues of the n × n matrix
α1 α2
.. α2 α1 α2
*. +/
//
. . .
.. //
A=. . . . . . . . // (1.60)
.. //
.. /
. α2 α1 α2 //
, α2 α1 -
are given by !
jπ
λ j = α1 + 2α2 cos , j = 1, . . . , n,
n+1
and that an eigenvector associated with the eigenvalue λ j is
r
2 π 2π
! nπ ! T
vj = sin j , sin j , . . . , sin j .
n+1 n+1 n+1 n+1
Moreover, show that viT v j = 0 for i , j, and kvi k2 = 1.
Problem 1.3 Let α1, α2, α3 be real numbers. Verify that the eigenvalues of the n × n matrix
α1 α3
.. α2 α1 α3
*. +/
//
. . .
.. //
A=. . . . . . . . // (1.61)
.. //
.. /
. α2 α1 α3 //
, α2 α1 -
are given by
α2
r !
jπ
λ j = α1 + 2α3 cos , j = 1, . . . , n,
α3 n+1
and that a (non-normalized) eigenvector associated with the eigenvalue λ j is
α
! 1/2 π α ! 2/2 2π
!
α
! n/2 nπ T
vj = * , , . . ., + .
2 2 2
sin j sin j sin j
, α 3 n + 1 α 3 n + 1 α 3 n + 1 -
Problem 1.4
ii. Let ai j be the entries of the matrix in (1.24). Show that if > 0, c, r ≥ 0, then for any h > 0,
X
|ai j | ≤ |aii | for i = 1, . . . , n,
j,i
Problem 1.5 We study the eigenvalues of AT A for the matrix that arises in the least squares
problem (1.54) for a slightly simplified problem.
We observe the ODE solution at every grid point (i.e., m = n and the observation matrix is
H = I ∈ Rn×n ) and at time steps nt /mt, nt /mt, . . . , nt , where mt be such that nt /mt is integer. Hence,
the matrix A in (1.54b) becomes
−1 nt /mt
*. I− ∆t
2 K I + ∆t
2 K +/
A = ..
. .. // mt n×n
. // ∈ R
.. −1 nt
∆t
+ ∆t /
I− 2 K I 2 K
, -
i. Determine AT A.
ii. Suppose there exists an orthonormal matrix V ∈ Rn×n and a diagonal matrix D ∈ Rn×n such
that
K = V DV T .
The diagonal entries of D are the eigenvalues of K and the columns of V are the corresponding
eigenvectors.
−1 `
What are the eigenvalues and eigenvectors of I − 2 K ∆t
I + 2 K , ` ∈ N?
∆t
and of the corresponding AT A, obtained using T = 0.5, n = 100, nt = 50 (∆t = 0.01), and
mt = 25.
Note K results from the finite difference discretization of (1.48) with α = 1 and β = 0.
[BGL05] M. Benzi, G. H. Golub, and J. Liesen. Numerical solution of saddle point problems.
In A. Iserles, editor, Acta Numerica 2005, pages 1–137. Cambridge University Press,
Cambridge, London, New York, 2005.
[ESW05] H. C. Elman, D. J. Silvester, and A. J. Wathen. Finite Elements and Fast Iterative
Solvers with Applications in Incompressible Fluid Dynamics. Oxford University Press,
Oxford, 2005.
[GL89] G. H. Golub and C. F. Van Loan. Matrix Computations, volume 3 of Johns Hopkins
Series in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD,
second edition, 1989.
[HPUU09] M. Hinze, R. Pinnau, M. Ulbrich, and S. Ulbrich. Optimization with PDE Con-
straints, volume 23 of Mathematical Modelling, Theory and Applications. Springer
Verlag, Heidelberg, New York, Berlin, 2009. URL: http://dx.doi.org/10.1007/
978-1-4020-8839-1, doi:10.1007/978-1-4020-8839-1.
[LKM10] W. Lahoz, B. Khattatov, and R. Menard, editors. Data Assimilation: Making Sense of
Observations. Springer, Berlin, Heidelberg, 2010. URL: http://dx.doi.org/10.
1007/978-3-540-74703-1, doi:10.1007/978-3-540-74703-1.
45
46 REFERENCES
[Saa03] Y. Saad. Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia, second
edition, 2003.
[Tar05] A. Tarantola. Inverse problem theory and methods for model parameter estimation.
Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2005.
[Trö10] F. Tröltzsch. Optimal Control of Partial Differential Equations: Theory, Methods and
Applications, volume 112 of Graduate Studies in Mathematics. American Mathemat-
ical Society, Providence, RI, 2010. URL: http://dx.doi.org/10.1090/gsm/112.
2.1. Introduction
In this section we study linear fixed point iterative methods for the solution of square systems of
linear equations
Ax = b. (2.1)
47
48 CHAPTER 2. STATIONARY ITERATIVE METHODS
Many methods discussed in this section are derived from a splitting of the matrix A. Let
A=M−N (2.2)
with nonsingular M ∈ Rn×n . Since M is nonsingular,
Ax = b if and only if x = M −1 N x + M −1 b.
Thus x solves the linear system Ax = b if and only if x is a fixed point of the map x 7→
M −1 N x + M −1 b. We can try to find a fixed point using the fixed point iteration
x (k+1) = M −1 N x (k) + b ,
(2.3)
which, since N = M − A may also be written
x (k+1) = M −1 N x (k) + M −1 b
= (I − M −1 A)x (k) + M −1 b. (2.4)
We discuss several stationary iterative methods and their convergence properties. We focus
on classical stationary iterative methods like the Jacobi method, the Gauss-Seidel method, and the
successive overrelaxation (SOR) method, but we also touch upon multigrid methods and domain
decomposition methods. We present several convergence results for the Jacobi method, the Gauss-
Seidel method, and the SOR method. Many of these convergence results fit beautifully with the
properties of systems arising from discretizations of PDEs and we will use the examples from
Sections 1.3.1 and 1.3.2 to illustrate the convergence behavior of these stationary iterative methods.
Finally, we will show that for systems with symmetric positive definite matrix, the Jacobi and the
Gauss-Seidel method can also be interpreted as particular coordinate decent minimization methods.
The methods introduced in this section generate a sequence of approximations x (k) ∈ Rn to the
solution of the linear system (2.1). We use superscripts (k) to denote the k-th iteration. The vector
x (k) has components x i(k) , i = 1, . . . , n.
end
The (pointwise) forward Gauss-Seidel (GS) Method is derived from the (pointwise) Jacobi
Method by using new information as soon as it becomes available, and it is given as follows.
For i = 1, . . . , n do
i−1 n
1 * X X
x i(k+1) = . bi − (k+1)
ai j x j − ai j x (k)
j
+/ (2.6)
aii j=1 j=i+1
, -
end
In an implementation of the Gauss-Seidel Method, only one array is needed to store x, since x i(k)
can be overwritten by x i(k+1) as soon as it becomes available.
Note that the Jacobi method is independent of a symmetric ordering of equations and unkowns
in the sense that if Π is a permutation matrix, then the Jacobi method applied to Ax = b is identical
to the Jacobi method applied to
Π AΠT Πx = Πb.
The Gauss-Seidel method, however, depends on the ordering of equations and unkowns, since it
uses new information as soon as it is computed. For example, if we start the Gauss-Seidel method
with the last equation and unknown and work backwards we obtain the (pointwise) backward
Gauss-Seidel (GS) Method
For i = n, n − 1, . . . , 1 do
i−1 n
1 * X X
x i(k+1) = .bi − (k)
ai j x j − ai j x (k+1)
j
+/ (2.7)
aii j=1 j=i+1
, -
end
Let x GS be the (pointwise) forward Gauss-Seidel iterate given by (2.6). A new iteration is
obtained if we choose ω > 0 and set
For i = 1, . . . , n do
i−1 n
1 * X X
x i(k+1) = ω .bi − ai j x (k+1)
j − ai j x (k)
j
+/ + (1 − ω)x (k)
i (2.9)
aii j=1 j=i+1
, -
end
Like the (pointwise) Gauss-Seidel method, the (pointwise) Successive Over Relaxation (SOR) also
depends on the ordering of the equations and unknowns. In principle, it is possible to do the same
with the Jacobi method. For example, if x J is the (pointwise) Jacobi iterate given by (2.5), then we
can generate a new iterate using
x i(k+1) = x i(k) + ω(x iJ − x i(k) ) = ωx iJ + (1 − ω)x i(k), i = 1, . . . , n. (2.10)
In the literature, this iteration is usually referred to as the (pointwise) damped Jacobi Method.
Using the defintion (2.5) of the Jacobi iterate, it can be written as follows.
For i = 1, . . . , n do
i−1 n
1 * X X
x i(k+1) = ω .bi − (k)
ai j x j − ai j x (k)
j
+/ + (1 − ω)x (k)
i (2.11)
aii j=1 j=i+1
, -
end
The Jacobi, Gauss–Seidel, and SOR method can be expressed using matrix-vector notation.
We split
a11 a12 . . . a1,n−1 a1n
.. a21 a22 . . . a2,n−1 a2n
*. +/
A = .. .. . .. .. .. ..
//
. . . . // (2.12a)
.. a a ...
n−1,1 n−1,2 an−1,n−1 an−1,n //
, an1 an2 ... an,n−1 ann -
into its diagonal
a11 0 . . . 0 0
0 a22 . . .
*. +
0 0 //
.. .. . . .. .. // ,
.
D = ...
. . . . . / (2.12b)
0 0 . . . an−1,n−1 0 //
..
, 0 0 ... 0 ann -
the strict lower triangular part −E and the strict upper triangular part −F,
0 0 ... 0 0 0 a12 . . . a1,n−1 a1n
a ... . . . a2,n−1 a2n
*. +/ *. +/
.. 21 0 0 0 0 0
− E = .. ... .. .. .. .. .. .. .. .. ..
// . //
. . . . // , −F = ... . . . . . // . (2.12c)
.. a
n−1,1 an−1,2 ... 0 0 // ..0 0 ... 0 an−1,n //
, an1 an2 . . . an,n−1 0 - , 0 0 ... 0 0 -
That is
A = D − E − F. (2.12d)
This leads to the following representations of the previous methods:
(pointwise) Jacobi
x (k+1) = D −1 (E + F)x (k) + b = I − D−1 A x (k) + D−1 b,
(2.13)
(pointwise) forward GS
x (k+1) = (D − E) −1 F x (k) + b ,
(2.14)
(pointwise) backward GS
x (k+1) = (D − F) −1 E x (k) + b ,
(2.15)
(pointwise) forward SOR
x (k+1) = (D − ωE) −1 [ωF + (1 − ω)D]x (k) + ωb .
(2.16)
and damped Jacobi
x (k+1) = I − ωD −1 A x (k) + ωD−1 b.
(2.17)
The Jacobi Method (2.13), the forward GS (2.14) the backward GS (2.15) and the forward SOR
(2.16) are special cases of (2.3) with the following splittings A = M − N:
Jacobi: M = D, N = E + F. (2.18a)
forward GS: M = D − E, N = F. (2.18b)
backward GS: M = D − F, N = E. (2.18c)
1 1
M= D − ωE , N= (1 − ω)D + ωF .
forward SOR: (2.18d)
ω ω
1 1
dampled Jacobi: M = D, N = D − A. (2.18e)
ω ω
We can also derive block versions of the Jacobi, Gauss–Seidel, and SOR method. Suppose that
and we define
If we use the matrices D, E, F in (2.19) in the equations (2.13), (2.14), (2.15), (2.16), and (2.3) we
obtain the block Jacobi Method the block forward GS Method, the block backward GS Method, and
the block forward SOR Method, respectively. Each iteration of these methods requires the solution
of systems of size mi × mi for i = 1, . . . , n.
A = M − N,
then x solves
Ax = b
if and only if x satisfies
x = M −1 N x + M −1 b.
We set G = M −1 N and f = M −1 b and consider the basic iterative method
When does {x (k) } converge? If this sequence converges, x (k) → x (∗) (k → ∞), then the limit x (∗)
satisfies
x (∗) = Gx (∗) + f
i.e., it is a fixed point of the map x 7→ Gx + f . The errors
satisfy
e (k+1) = Ge (k) = G k+1 e (0) . (2.21)
If there exists a matrix norm that is submultiplicative and subordinate to a vector norm1 such that
kGk < 1, then the series ∞ k −1
P
k=0 G converges to (I − G) . In particular I − G is invertible and there
exists a unique fixed point x (∗) = Gx (∗) + f . Moreover, since the errors satisfy e (k) = x (k) − x (∗)
Thus, if we are able to find a matrix norm such that kGk < 1, we are guaranteed convergence, but
how do we do whether such a norm exists?
, λn-
then
AU = ( Au1, . . . , Aun ) = (u1 λ 1, . . . , un λ n ) = UΛ.
If the matrix U is invertible, that is if we can find n linearly independent eigenvectors u1, . . . , un ,
then
A = UΛU −1 . (2.24)
If there exists an invertible matrix U and a diagonal matrix Λ such that (2.24) holds, we say that
the matrix A is diagonalizable.
T
Given a matrix U ∈ Cn×n we define U ∗ = U . A matrix is unitarily diagonalizable, if the matrix
U ∈ Cn×n of eigenvectors is not only invertible but satisfies U −1 = U ∗ , i.e., U ∗U = I. A matrix
U ∈ Cn×n with U ∗U = I is called a unitary matrix. If the matrix U ∈ Rn×n , then U ∗ = U T and a
square real matrix U with U T U = I is called orthogonal. Unfortunately, not all square matrices are
diagonalizable and not all diagonalizable matrices are unitarily diagonalizable.
Theorem 2.4.1 If A ∈ Rn×n is symmetric, all eigenvalues λ 1, . . . , λ n are real and there exists n
orthogonal eigenvectors, in other words, there exists a real diagonal matrix Λ = diag(λ 1, . . . , λ n ) ∈
Rn×n and an orthogonal matrix U ∈ Rn×n , such that A = UΛU T .
Even if a square matrix is not diagonalizable, it can be written in Jordan normal form sometimes
called Jordan canonical form.
Theorem 2.4.3 (Jordan Normal Form) For any square matrix A ∈ Cn×n there exists a nonsingu-
lar matrix U ∈ Cn×n such that
J 0 ... 0
*. 1
0 J2 . . . 0 // def
+
U AU = .. ..
−1 .
.. // = J, (2.25)
.. . /
,0 0 . . . Jk -
where
λi 1 0 . . . 0 0 λi 0 0 . . . 0 0 0 1 0 ... 0 0
λ . . . λ . . . . . . 0 0//
*. +/ *. +/ *. +
.. 0 i 1 0 0 // .. 0 i 0 0 0 // ..0 0 1
//
Ji = .. ... .. .. .. .. .. // + .. ... .. ..
.. // .. // ..
. . = . . . .
//
.. // .. .
/ .
// .. //
.. // .. / . //
. 0 0 0 . . . λ i 1 // .. 0 0 0 . . . λ i 0 // ..0 0 0 . . . 0 1//
, 0 0 0 . . . 0 λ i - , 0 0 0 . . . 0 λ i - ,0 0 0 . . . 0 0-
converges for any initial vector x (0) if and only if ρ(G) < 1. This section provides the proof of this
result. We begin with the case of a diagonalizable iteration matrix G.
Thus
x = Gx + f if and only if y j = λ j y j + (U −1 f ) j , j = 1, . . . , n,
where y = U −1
If ρ(G) < 1, i.e., |λ i | < 1, i = 1, . . . , n, then (1 − λ i )yi∗ = (U −1 f )i has a unique solution yi .
Consequently, x = Gx + f has a unique fixed point x (∗) = U y (∗) . Moreover the equations (2.21)
for the error implies
U −1 e (k) = ΛU −1 e (k−1) = Λ k U −1 e (0) . (2.27)
If we define z (k) = U −1 e (k) , then (2.27) reads
Clearly, if ρ(G) = maxi=1,...,n |λ i | < 1, then zi(k) → 0 (k → ∞), i = 1, . . . , n, for any z0(0), . . . , z n(0) ,
and e (k) → 0 (k → ∞) for any initial error e (0) . Moreover, if ρ(G) < 1 the error z (k) = U −1 e (k)
decreases monotonically and the components zi(k) decrease the faster the smaller |λ i |.
On the other hand, if the fixed point iteration (2.26) converges for any starting vector x (0) , then
x (k) → x (∗) = x (∗) (x (0) ) (k → ∞). (Note that we do not know yet that the there is only one fixed
point and therefore the limit may depend on the initial vector x (0) ). The errors e (k) = x (k) − x (∗)
satisfy (2.21) and z (k) = U −1 e (k) satisfy (2.28). The iterates given by (2.28) converge zi(k) → 0
(k → ∞), i = 1, . . . , n, for any initial errors z0(0), . . . , z n(0) only if ρ(G) = maxi=1,...,n |λ i | < 1 and in
this case the fixed point x (∗) is unique.
We have shown the following result.
Theorem 2.5.1 Let G ∈ Rn×n be diagonalizable. There exists a unique fixed point x (∗) of x = Gx+ f
and the iteration (2.26) converges to x (∗) for any initial vector x (0) if and only if ρ(G) < 1.
If G is unitarily diagonalizable, i.e., if G is normal, then G = UΛU −1 with kU k2 = 1. In this
case (2.27) implies
ke (k) k2 = kU ∗ e (k) k2 ≤ ρ(G) k kU ∗ e (0) k2 = ρ(G) k ke (0) k2 .
Hence if ρ(G) < 1 the error e (k) decreases monotonically in the 2-norm.
The diagonalizability of G can also be used to establish a relation between the spectral radius
of G and norms of matrix powers. We have
kG k k2 = kUΛ k U −1 k2 ≤ kU k2 kΛ k k2 kU −1 k2 = ρ(G) k kU k2 kU −1 k2
and
kU −1 k2 kUΛ k U −1 k2 kU k2 kΛ k k2 ρ(G) k
kG k k2 = ≥ = .
kU k2 kU −1 k2 kU k2 kU −1 k2 kU k2 kU −1 k2
Hence ! 1/k
1 1/k
ρ(G) ≤ kG k k21/k ≤ ρ(G) kU k2 kU −1 k2 . (2.29)
kU k2 kU −1 k2
Note that if U is unitary, we even have kG k k2 = ρ(G) k . The inequalities (2.29) imply
lim kG k k21/k = ρ(G). (2.30)
k→∞
Since all matrix norms are equivalent, we even have the following result.
We have proven (2.31) for diagonalizable matrices G. We will see shortly that (2.31) is true for
all matrices.
Since the errors e (k) = x (k) − x (∗) of the iteration (2.26) obey
k
ke (k) k = kG k e (0) k ≤ kG k k ke (0) k = kG k k 1/k ke (0) k,
kG k k is called the convergence factor (for k steps) of the iteration (2.26) and kG k k 1/k is called the
average convergence factor (per step for k steps) of the iteration (2.26).
The matrix G has eigenvalues λ 1 = 0.5, λ 2 = 0.3, and it is diagonalizable but not normal. The
2-norms of the errors e (k) and the components z1(k) , z2(k) of the error z (k) = U −1 e (k) are shown in
Figure 2.1. The components z1(k) , z2(k) of the error z (k) = U −1 e (k) decrease monotonically by a
factor λ 1 = 0.5 and λ 2 = 0.3, respectively.
5
10 2
1.8
0 1.6
10
1.4
|| Gk ||1/k
2
error
−5
10 1.2
1
−10 || e(k) ||2
10 0.8
z(k)
1
0.6
z(k)
2
−15
10 0.4
0 5 10 15 20 0 5 10 15 20
k k
Figure 2.1: Left plot: Convergence of the iterates e (k+1) = Ge (k) for G given by (2.32) and initial
iterate e (0) = (38, 38)T . Right plot: The average convergence factor kG k k 1/k for G given by (2.32).
The red line indicates ρ(G).
Remark 2.5.4 Assume that ρ(G) < 1. The errors e (k) = x (k) − x (∗) of the iteration (2.26) obey
k
ke (k) k = kG k e (0) k ≤ kG k k ke (0) k = kG k k 1/k ke (0) k ≈ ρ(G) k ke (0) k.
Hence we can use ρ(G) k as an estimate for ke (k) k/ke (0) k. In particular, if we want to reduce the
error below a factor 10−d times the initial error, i.e., we want
ke (k) k
≤ 10−d
ke (0) k
then we should expect to need k̄ iterations where k̄ is such that ρ(G) k̄ ≤ 10−d , or
k̄ ≥ −d/ log10 ( ρ(G)). (2.33)
Note this estimate is sharp for unitarily diagonalizable matrices if we use the 2-norm, but can be
too optimistic otherwise. See, e.g., Remark 2.5.8 below. Table 2.1 shows the estimated number of
linear fixed point iterations (2.26) needed to reduce the initial error by a factor 10−2 for various
spectral radii of G. The estimate is (2.33).
ρ(G) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.99
k̄ 2 3 4 6 7 10 13 21 44 90 459
Table 2.1: The estimated (using (2.33)) number k̄ of linear fixed point iterations (2.26) that need to
be executed to reduce the initial error by a factor 10−2 for various spectral radii of G.
Note that
Nik = 0 k ≥ ni .
Consequently,
k
X k! X i}
min{k,n
k!
k− j j k− j j
Jik = λ i Ni = λ i Ni
j=0
j!(k − j)! j=0
j!(k − j)!
and
X i}
min{k,n
k! j
k Jik k ≤ |λ i | k− j k Ni k. (2.35)
j=0
j!(k − j)!
For j = 1, . . . , k we have
k! k (k − 1) . . . (k − j + 1)
= ≤ k (k − 1) . . . (k − j + 1) ≤ k j .
j!(k − j)! j!
Consequently
k!
1≤ ≤ kj for j = 0, . . . , k
j!(k − j)!
If |λ i | < 1, then
X i}
min{k,n
k! j
k Jik k ≤ |λ i | k− j k Ni k,
j=0
j!(k − j)!
ni
X
j
ni
≤ k |λ i | k
|λ i | − j k Ni k → 0 (k → ∞).
| {z }
→0 (k→∞) j=0
This allows us to extend the arguments used to establish Theorem 2.5.1. The equations (2.21) for
the error implies
U −1 e (k) = ΛU −1 e (k−1) = J k U −1 e (0) . (2.36)
If we define z (k) = U −1 e (k) , then (2.27) reads
where now zi(k) ∈ Rni , i = 1, . . . , `, are subvectors of z (k) corresponding to the Jordan blocks. We
leave the careful proof of the following theorem as an exercise.
Theorem 2.5.5 Let G be a square matrix. There exists a unique fixed point x (∗) of x = Gx + f and
the iteration (2.26) converges to x (∗) for any initial vector x (0) if and only if ρ(G) < 1.
Theorem 2.5.6 If G is a square matrix, then for any matrix norm k · k we have
1
10 4.5
0
10 3.5
|| Gk ||1/k
|| e(k) ||2
2
−1
10 2.5
2
−2
10 1.5
1
−3
10 0.5
0 10 20 30 40 0 10 20 30 40
k k
Figure 2.2: Left plot: Convergence of the iterates e (k+1) = Ge (k) for G given by (2.39) and initial
iterate e (0) = (1, 1)T . Right plot: The average convergence factor kG k k 1/k for G given by (2.39). .
The red line indicates ρ(G).
Asymptotically, the linear fixed point iteration converges the faster, the smaller the spectral
radius. However, the spectral radius only describes the asymptotic convergence behavior for
sufficiently large iterations. For non-normal matrices and especially for non-diagonalizable matrices
there can be significant transition effects, such as the one observed in the convergence of the error
e (k) in Figure 2.2. This is studied in more detail in the book by Trefethen and Embree [TE05].
Remark 2.5.8 As we have seen in Example 2.5.7 for non-diagonalizable matrices, kG k k 1/k ≈
ρ(G) only for (potentially very) large k. Hence for non-normal matrices and especially for non-
diagonalizable matrices the estimate (2.33) for the number of iterations required to reduce the error
by a factor 10−d may be way too optimistic! For instance, for the matrix in Example 2.5.7, k=33
iterations are required to achieve ke (k) k2 /ke (0) k2 ≤ 10−2 , but −2/ log10 (0.75) < 17.
k AkT ≤ ρ( A) + .
k AkT = max k AxkT /k xkT = max kT Axk∞ /kT xk∞ = max kT AxT −1 yk∞ /k yk∞ = kT AT −1 k∞ .
x,0 x,0 y,0
(b) From the Jordan canonical form (2.34) of a matrix A we find that a nonsingular matrix U such
that
b11 b12 . . . b1,n−1 b1,n
0 b22 . . . b2,n−1 b2,n //
*. +
.. .. . . .. .. // ,
.
B = U AU −1 = ... . . . . . /
..
0 0 . . . bn−1,n−1 bn−1,n //
, 0 0 ... 0 bnn -
where the diagonal entries bii , i = 1, . . . , n, are the eigenvalues of A. (In fact more can be said
about the entries of B, but this is not necessary for our purposes.) Given δ > 0 define
Then
0, if i > j,
(DBD )i j = bii,
−1
bi j δ , if i < j.
j−i
Consequently,
n
X !
k(DBD )k∞ = max
−1
|(DBD )i j | ≤ max |bii | + n max |bi j |δ
−1 j−i
.
i i j>i
j=1
Corollary 2.5.10 Let G be a square matrix. There exists a matrix (operator) norm k · k such that
kGk < 1 if and only if ρ(G) < 1.
Proof: Assume there exists a matrix (operator) norm k · k such that kGk < 1. For any eigenvalue
λ of G and corresponding eigenvector v, Gv = λv. Taking norms gives
1 if j = π(i),
(
Πi j =
0 else,
then
Hi, j = (Π AΠT )i, j = Aπ(i),π( j),
A i, j = 1, . . . , n. (2.40)
For the specific permutation π(1) = n, π(2) = n − 1, . . . , π(n) = 1, the corresponding permutation
matrix is
1
.
Π=. * . . +/ (2.41)
, 1 -
and the (pointwise) forward Gauss-Seidel method applied to Π AΠT Hx = Πb is equivalent to the
(pointwise) backward Gauss-Seidel method applied to Ax = b. Therefore, the conditions on the
matrix in this section imply convergence of the forward Gauss-Seidel method and the backward
Gauss-Seidel method, as well as the Gauss-Seidel method for any symmetric reordering of the
equations and unknowns.
Definition 2.6.1 A square matrix A is said to be reducible, if there exists a permutation matrix P
such that P APT is block upper triangular, i.e.,
!
A11 A12
P AP =T
0 A22
The irreducibility of a matrix can often be tested using the directed graph G( A) associated with
the matrix A. A directed graph consists of vertices and directed edges. The directed graph G( A)
associated with the matrix A ∈ Rn×n consists of n vertices labeled 1, . . . , n and there is an oriented
edge between edges i ad j if and only if ai j , 0. It can be shown that A is irreducible if and only if
the graph G( A) is connected in the sense that for each pair of vertices i and j there is an oriented
path from i to j, that is there exist vertices i = i 0, i 1, i 2, . . . , i k = j such that G( A) contains the
oriented edge (i `−1, i ` ), ` = 1, . . . , k.
−1 −2 0 2 −1 0
A1 = *. 0 1 0 +/ , A2 = . −1 2 −1 +/
*
, 3 2 1 - , 0 −1 2 -
are shown in Figure 2.3.
1 2
1 2 3
3
Figure 2.3: Directed graphs associated with A1 (left plot) and A2 (right plot).
The graph associated with A1 has no directed path from vertex 1 to vertex 3 (or from vertex 2
to vertex 3). Therefore the matrix A1 is reducible. In fact if we use the permutation
0 0 1
P = *. 0 1 0 +/
, 1 0 0 -
then
0 0 1 1 2 3
P =.−1* 0 1 0 / and P A1 P = . 0
+ −1 *
1 0 /.
+
, 1 0 0 - , 0 −2 −1 -
The graph associated with A2 has directed path from every vertex i to vertex j. Therefore the matrix
A2 is irreducible.
Theorem 2.6.5 If the square matrix A is strictly row-wise (or column-wise) diagonally dominant
or if it is irreducibly row-wise (or column-wise) diagonally dominant, then A is nonsingular.
Proof: (a) Let A be strictly row-wise diagonally dominant. Suppose there exists and x , 0 with
Ax = 0. The ith equation in this system implies
n
X n
X
|aii | |x i | = | − ai j x j | ≤ |ai j | |x j |.
j=1, j,i j=1, j,i
n
X X n
|aii | = |aii | |x i | = |ai j | |x j | = .
* |ai j | +/ |x i | = |aii | |x i ||aii |. (2.42)
j=1, j,i , j=1, j,i -
These equalities imply that for any ` such that ai` , 0, the corresponding component of x must
satisfy |x ` | = 1 = k xk∞ .
Now let ν be an index such that
n
X
|aν j | < |aνν |. (2.43)
j=1, j,ν
n
X Xn
|ai1i1 | = |ai1i1 | |x i1 | = |ai1 j | |x j | = *
. |ai1 j | +/ |x i1 | = |ai1i1 | |x i1 | = |ai1i1 |.
j=1, j,i 1 , j=1, j,i1 -
Since ai1,i2 , 0, we have |x i2 | = 1. We can now repeat the same argument to show that |x i2 | = . . . =
|x ν | = 1. Since for row ν the inequality (2.43) holds, we have
n
X Xn
|aνν | |x ν | ≤ |aν j | |x j | ≤ *
. |aν j | +/ |x ν | < |aνν | |x ν |,
j=1, j,ν , j=1, j,ν -
a contradiction. Hence there is no x , 0 such that Ax = 0, which means that A is nonsingular.
(c) If A is strictly column-wise diagonally dominant, then (a) implies that AT is nonsingular.
Hence A is nonsingular.
(d) The nonsingularity of A follows from application of (b) to AT .
Theorem 2.6.6 If the square matrix A is strictly row-wise (or column-wise) diagonally dominant
or if it is irreducibly row-wise (or column-wise) diagonally dominant, then the pointwise Jacobi
method converges for any x (0) .
Proof: (a) Let A be strictly row-wise diagonally dominant. The iteration matrix for the (pointwise)
Jacobi method is given by
G = D−1 (E + F).
Let λ be an eigenvalue of G and let v be a corresponding eigenvector with |vm | = 1 and |vi | ≤ 1,
i , m. The mth equation in λv = Gv and the row-wise diagonal dominance of A imply
n n
X ami X
ami |v | < 1.
|λ| = vi ≤ a i
i=1,i,m
amm i=1,i,m mm
Hence ρ(G) < 1 and the assertion follows from Theorem 2.5.5.
(b) If A is irreducibly row-wise diagonally dominant, then one can show as in part (a) that every
eigenvalue λ of G satisfies
|λ| ≤ 1.
Suppose that ρ(G) = 1, i.e., that there exists an eigenvalue λ of G with |λ| = 1. In this case
λ D − E − F is singular. However, since |λ| = 1 the matrix λ D − E − F is also irreducibly row-wise
diagonally dominant. This contradicts Theorem 2.6.5. Hence ρ(G) < 1 holds.
(c) Let A be strictly column-wise diagonally dominant. By part (a) the (pointwise) Jacobi
method for the matrix AT , which is given by G̃ = D −1 (ET + F T ), satisfies ρ(G̃) < 1. The matrices
D−1 (ET + F T ) and (ET + F T )D −1 have the same eigenvalues and the matrices (ET + F T )D−1 and
T
(ET + F T )D−1 = G have the same eigenvalues. Hence ρ(G) = ρ(G̃) < 1.
Theorem 2.6.7 If the square matrix A is strictly row-wise (column-wise) diagonally dominant
or if it is irreducibly row-wise (column-wise) diagonally dominant, then the pointwise (forward)
Gauss-Seidel method converges for any symmetric reordering of the equations and unknowns for
any starting value.
Proof: First we show that pointwise forward Gauss-Seidel method applied to Ax = b converges
for any starting value provided that A is strictly row-wise diagonally dominant or if it is irreducibly
row-wise diagonally dominant. Since for any permutation matrix Π, Π AΠT is strictly row-wise
diagonally dominant (irreducibly row-wise diagonally dominant) if and only of A is strictly row-
wise diagonally dominant (irreducibly row-wise diagonally dominant), this implies convergence
of pointwise forward Gauss-Seidel method for any symmetric reordering of the equations and
unknowns.
(a) Let A be strictly row-wise diagonally dominant. The iteration matrix for the (pointwise)
forward Gauss-Seidel method is given by
G = (D − E) −1 F.
Let λ be an eigenvalue of G and let v be a corresponding eigenvector with kvk∞ . Let the index m
be such that |vm | = 1 and |vi | ≤ 1, i , m. The mth equation in λv = Gv is equivalent to
X X
λ ami vi = − ami vi
i≤m i>m
and implies P P
i>m ami vi i>m |ami | |vi |
|λ| =
≤
.
amm vm + i<m ami vi
P P
|amm | − i<m |ami | |vi |
The last term is of the form c1 /(d − c2 ) with c1, c2 ≥ 0, d > 0, and d − c1 − c2 > 0 . (For the latter
inequality we use that A is strictly row-wise diagonally dominant, |vm | = 1 and |vi | ≤ 1, i , m.)
Since c1 /(d − c2 ) = c1 /(c1 + (d − c1 − c2 )) < 1, we have
|λ| < 1.
Hence all eigenvalues λ of the (pointwise) forward Gauss-Seidel iteration matrix satisfy |λ| < 1.
Consequently ρ(G) < 1.
(b) If A is irreducibly row-wise diagonally dominant, then one can show as in part (a) that every
eigenvalue λ of G satisfies
|λ| ≤ 1.
Suppose that ρ(G) = 1, i.e., that there exists an eigenvalue λ of G with |λ| = 1. In this case
λ(D − E) − F is singular. However, since |λ| = 1 the matrix λ(D − E) − F is also irreducibly
row-wise diagonally dominant. This contradicts Theorem 2.6.5. Hence ρ(G) < 1 holds.
(c) Now let A be strictly column-wise diagonally dominant or irreducibly column-wise diago-
nally dominant, then AT = D − ET − F T is is strictly row-wise diagonally dominant or irreducibly
row-wise diagonally dominant. By parts (a) and (b), the (pointwise) backward Gauss-Seidel iter-
ation for AT converges. Since the (pointwise) backward Gauss-Seidel iteration matrix for AT is
(D − ET ) −1 F T , we have ρ((D − ET ) −1 F T ) < 1.
The iteration matrix for the (pointwise) forward Gauss-Seidel iteration for A if G = (D − E) −1 F.
We have
(D − ET ) −1 GT (D − ET ) = (D − ET ) −1 F T (D − ET ) −1 (D − ET ) = (D − ET ) −1 F T .
If the splitting (2.12) is used we obtain the pointwise forward SOR method. If the splitting (2.19)
is used we have the block forward SOR method. The forward SOR iteration matrix is given by
Gω = (D − ωE) −1 ωF + (1 − ω)D .
(2.44)
Theorem 2.6.8 The iteration matrix Gω of the (pointwise or block) forward SOR method satisifies
ρ(Gω ) ≥ |1 − ω|.
Proof: We use the following two properties of the determinant. The determinant of the product
of two square matrices is the product of the determinants of the matrices. The determinant of a
square matrix is the product of the eigenvalues of the matrix.
Since
Gω = (D − ωE) −1 ωF + (1 − ω)D
and since (I − ωD −1 E) −1 is a lower triangular matrix with ones on the diagonal and ωD−1 F is a
strict upper (block) triangular matrix we have
n
Y
λ i = det(Gω )
i=1
= det (I − ωD−1 E) −1 det ωD−1 F + (1 − ω)I
= |1 − ω| n .
Corollary 2.6.9 If the (pointwise or block) forward SOR method converges for any initial vector,
then ω ∈ (0, 2).
Proof: If the (pointwise or block) forward SOR method converges for any initial vector, then
Theorem 2.6.10 If A ∈ Cn×n is hermetian positive definite, then the (pointwise or block) forward
SOR method converges for all ω ∈ (0, 2).
Proof: We have
Gω = (D − ωE) −1 ωF + (1 − ω)D
1 −1
=I− D − ωE A = I − Mω−1 A,
ω
where
1
D − ωE . Mω =
ω
Let λ ∈ C be an eigenvalue of Gω with corresponding eigenvector v ∈ Cn . Then
Av = (1 − λ)Mω v.
1 1 1 v ∗ Mω v v ∗ Mω v v ∗ (Mω + Mω∗ )v
2 Re = + = ∗ + ∗ = .
1−λ 1−λ 1−λ v Av v Av v ∗ Av
and
v ∗ (Mω + Mω∗ )v
! ∗
1 2 v Dv
2 Re = =1+ −1 ∗ .
1−λ ∗
v Av ω v Av
If A ∈ Cn×n is hermetian positive definite, its (block) diagonal D is hermetian positive definite.
Hence ! ∗
1 2 v Dv
2 Re =1+ −1 ∗ > 1.
1−λ ω v Av
If we set λ = α + i β, then
1 2(1 − α)
1 < 2 Re = ,
1 − λ (1 − α) 2 + β 2
which implies
|λ| 2 = α 2 + β 2 < 1.
Remark 2.6.11 Since the (pointwise or block) forward SOR method with ω = 1 is the (pointwise or
block) forward Gauss-Seidel method, Theorem 2.6.10 shows that the forward Gauss-Seidel method
applied to a hermetian positive definite system converges.
Theorem 2.6.12 If A ∈ Rn×n and 2D − A are symmetric positive definite, then the (pointwise or
block) Jacobi method converges for all initial iterates.
and P αD−1 E + α −1 D−1 F PT have the same eigenvalues. Therefore, A is consistently ordered if
and only if P APT is consistently ordered.
Gω = (D − ωE) −1 ωF + (1 − ω)D
be the iteration matrices of the pointwise [block] Jacobi and forward SOR method, respectively.
Then:
i. With µ, also −µ is an eigenvalues of G J .
ii. If µ is an eigenvalue of G J and
(λ + ω − 1) 2 = λω2 µ2, (2.47)
then λ is an eigenvalue of Gω .
ρ(GGS ) = ρ(G J ) 2
Proof: By Theorem 2.6.17 the eigenvalues µ of G J and the eigenvalues λ of GGS = G1 obey
λ = µ2 .
and
2
ρ(G J )
ρ(Gωopt ) = * + .
+ ρ(G
p
, 1 1 − )
J -
2
2
ωopt = ∈ (1, 2)
1 + 1 − ρ(G J ) 2
p
In Remark 2.5.4 we have discussed the relation between the spectral radius ρ(G) of a basic
iterative method and the number of iterations k̄ needed to reduce the size of the initial error
e (0) = x (0) − x ∗ by a factor 10−d . An estimate is
Note that this tends to a good estimate for unitarily diagonalizable G, but can be too optimistic
otherwise. Table 2.2 shows the number of iterations needed to reduce the initial error by a factor
10−2 with the Jacobi method, the Gauss-Seidel method or the SOR method for various ρ(G J ).
Table 2.2: Estimated number of Jacobi iterations, Gauss-Seidel iterations or SOR iterations that
need to be executed to reduce the initial error by a factor 10−2 for various spectral radii of G J .
In Remark 2.6.14 we have established that A is consistently ordered if and only if P APT
is consistently ordered, where P is a permutation matrix. Furthermore, the eigenvalues of
the Jacobi iteration matrix D−1 (E + F) corresponding to A and of the Jacobi iteration matrix
(PDPT ) −1 (PEPT + PF PT ) = PD −1 (E + F)PT corresponding to P APT are identical. Thus, The-
orem 2.6.19 remains valid for any symmetric permutation of the P APT of the system Ax = b. In
particular, if P is the permutation (2.41), then Theorem 2.6.19 implies convergence of the pointwise
backward SOR method.
2.6.5. M-Matrices
M-matrices play a role in the discretization of partial differential equations, see Section 2.7.
2. A is nonsingular, and
Proof: Suppose there exists a non-negative diagonal entry aii ≤ 0. Let Ai be the ith column
of A. By properties 1 and 3 of the M-matrix A, A−1 Ai ≤ 0, which contradicts A−1 Ai = ei ,
where ei is the i-th unit vector.
ii. Let Π be a permutation matrix. The matrix A is an M-matrix if and only if Π AΠT is an
M-matrix.
Proof: Equation (2.40) shows that the diagonal entries of Π AΠT are equal to the diagonal
entries of A (but reordered according to π). Moreover, the entries in the π(i)-th row of Π AΠT
are equal to the entries in the i-th row of A.
The four conditions in our definition of an M-matrix are redundant. We refer to the literature
(e.g., [Axe94, Hac94, Saa03, Var00, You71]) for a complete discussion of M-matrices. The
following result shows the connection between M-matrices and the convergence of the (pointwise)
Jacobi method.
2. ai j ≤ 0 for i , j, i, j = 1, . . . , n,
then A is an M-matrix if and only if ρ(G J ) < 1, where G J = I − D−1 A and D is the diagonal of A.
Since I − G J = D −1 A, A is invertible if and only if ρ(G J ) < 1. Hence property 3 in the definition
of an M-matrix is satisfied if and only if ρ(G J ) < 1.
The properties 1 and 2 imply that all entries of G J are non-negative. Hence all entries of
j j
G J , j = 0, 1, . . . and kj=0 G J are non-negative. By (2.49) all entries of (I − G J ) −1 = A−1 D are
P
non-negative. Since the diagonal entries D are positive, all entries of A−1 are non-negative.
The following result establishes the relationship between the convergence of the Jacobi and the
Gauss-Seidel method for matrices A = D − E − F for which all entries of D−1 E and D −1 F are
non-negative. In particular if the matrix entries satisfy
H = Π AΠT
A
H= D
and let A H−E H − F, H strict lower triangular matrix − E
H with diagonal matrix D, H and strict lower
−1 −1
triangular matrix −F, then the entries of D E and D F are non-negative if and only if the entries
H
of DH−1 E H−1 F
H and D H are non-negative.
Theorem 2.6.23 (Stein-Rosenberg) Let A = D − E − F. If all entries of D−1 E and D−1 F are
non-negative, then exactly one of the following alternatives (2.50a-d) hold for the Jacobi iteration
and Gauss-Seidel iteration with any symmetric re-ordering:
For a proof see, e.g., Varga [Var00] or the original paper [SR48].
2 −1 y1 f (x 1 )
−1 2 −1 y2 // .. f (x 2 )
*. +/ *. +/ *. +/
.. // .. //
y3
.. .. .. .. ..
.. // .. // .. //
h−2 .. . . . // .. . // = .. . // . (2.52)
.. // .. // .. //
.. /. yn−2 // .. //
. −1 2 −1 // .. yn−1 / . f (x n−1 ) /
, −1 2 - , yn - , f (x n ) -
The solution y1, . . . , yn of (2.52) is an approximation of the solution u of (2.51) at the points
α1 α2
.. α2 α1 α2
*. +/
//
.. .. ..
.. //
.. . . . // ∈ Rn×n (2.53)
.. //
.. /
. α2 α1 α2 //
, α2 α1 -
are given by !
jπ
λ j = α1 + 2α2 cos , j = 1, . . . , n,
n+1
and r
2
vj = sin jπx 1 , sin jπx 2 , . . . , sin jπx n T .
n+1
is an eigenvector associated with the eigenvector λ j . See Problem 1.2 and also Iserles [Ise96,
pp. 197–203].
Since the Jacobi iteration matrix G J = D−1 (E + F) corresponding to (2.52) is
0 1
*. +/
.. 1 0 1 //
. . .
.. //
GJ = 2 .
1. .. .. .. // , (2.54)
.. //
.. /
. 1 0 1 //
, 1 0 -
the eigenvalues and corresponding eigenvectors of the Jacobi iteration matrix (2.54) are
!
jπ
λ j = cos , (2.55a)
n+1
r
2
vj = sin jπx 1 , sin jπx 2 , . . . , sin jπx n T .
(2.55b)
n+1
for j = 1, . . . , n. The spectral radius of the Jacobi iteration matrix (2.54) is
π π2
ρ(G J ) = cos ≈1− .
n+1 2(n + 1) 2
Next, we consider the Gauss-Seidel iteration matrix GGS = (D − E) −1 F. Our analysis follows
the paper by Kohaupt [Koh98]. See also Iserles [Ise96, pp. 200–203]. The Gauss-Seidel iteration
matrix can also be written asGGS = (I − D−1 E) −1 D−1 F, and for (2.52)
0 0 1
*. +/ *. +/
.. 1 0 // .. 0 0 1 //
. . . . .
.. // .. //
D E= 2.
−1 1. . . . . // , D F= 2.
−1 1. . . . . . . // .
.. // .. //
.. // .. /
. 1 0 / . 0 0 1 //
, 1 0 - , 0 0 -
Note that
0 0
*. +/ *. +/
.. 0 0 // .. 0 0 //
.. 1 0 0 // .. 0 0 //
.. // . //
(D −1 E) 2 = ( 21 ) 2 .. .. .. .. , . . . , (D −1
E) n−1
= ( 1 n−1 .
) .. .. // ,
. . . . .
/
// 2 ..
.. // .. //
.. // .. 0 //
.. 1 0 0 / .. 0 0 0 0 //
, 1 0 0 - , 1 0 0 0 0 -
and (D E) = 0. Therefore,
−1 n
and
0 β 0 0
β2 β
*. +/
0 0
β3 β2 β
.. //
0 0
= (I − D −1 E) −1 D−1 F = .. .. .. .. /.
. //
GGS
.. . . . 0 //
.. 0 β (n−1) β (n−2) β2 β //
, 0 βn β (n−1) β3 β2 -
This matrix is not normal. In fact, (GTGS GGS )11 = 0, but (GGS GTGS )11 = 2−2 .
The eigenvalues of GGS are
and
!
jπ
λ j = cos 2
, j = 1, . . . , bn/2c.
n+1
The eigenspace associated with λ 0 = 0 is one-dimensional and spanned by v0 = (1, 0, . . . , 0)T . The
eigenvectors associated with the other eigenvalues λ j , j = 1, . . . , bn/2c are
q q q T
vj = λ j sin jπx 1 , ( λ j ) 2 sin jπx 2 , . . . , ( λ j ) n sin jπx n .
See Kohaupt [Koh98]. The spectral radius of the Gauss-Seidel iteration matrix is
π π2
!2
ρ(GGS ) = cos 2
= ( ρ(G J )) ≈ 1 −
2
.
n+1 2(n + 1) 2
The central difference approximation (1.15) of the advection term cy0 (x) leads to the tridiagonal
linear system
2 + h2r −( − 2h c) y1
.. −( + 2 c) 2 + h r −( − 2 c)
h 2 h
y2
*. +/ *. +/
// .. //
y3
... ... ... ..
. // .. //
1 .. // .. . //
h2 ... // .. //
.. // .. yn−2 //
. −( + 2h c) 2 + h2r −( − 2h c) /. yn−1 /
, −( + 2h c) 2 + h2r -, yn -
f (x 1 )
.. f (x 2 ) ///
*. +
= .. .. /.
. (2.57)
.. f (x ) ///
n−1
, f (x n) -
We have already mentioned in Section 1.3.1 that (2.57) leads to spurious oscillations unless the for
mesh size is sufficiently small and satisfies
h < 2|c|/
(if uniform meshes are used). The poor behavior of this discretization scheme can be explained
by the fact that this matrix is not an M-matrix if h > 2/|c|. In fact, if h > 2/|c| we have
−( + 2h c) > 0 or −( − 2h c) depending on the sign of the advection c, and the sign condition in
Definition 2.6.20 of an M-matrix is violated. We will argue in the next paragraph that the matrix in
(2.57) is irreducibly row–wise diagonally dominant if h < 2/|c|. See also Stynes [Sty05, Sec. 4].
If ± 12 hc , 0, the matrix is irreducible (cf. Example 2.6.2). Moreover, using Problem 1.4
the matrix in (2.57) is irreducibly row–wise diagonally dominant provided h < 2/|c|. Therefore,
if h < 2/|c| Theorems 2.6.6 and 2.6.7 imply that both the Jacobi and the Gauss-Seidel Method
converge. By Theorem 2.6.22 the matrix in (2.57) is an M-matrix for h < 2/|c|. Since the matrix
in (2.57) is a tridiagonal matrix, it is consistently ordered (Theorem 2.6.16). Corollary 2.6.18 and
Theorem 2.6.19 apply. Numerical experiments for the matrix in (2.57) with = 10−2 , c = 1, r = 0
show that the spectral radii of the (pointwise) Jacobi iteration G J , the (pointwise) forward Gauss
Seidel iteration GGS and the SOR itertation Gω are less than one if the mesh size h < 2/|c| = 0.02.
See Figure 2.4 and Table 2.3.
Next we consider the upwind discretization. Let c > 0. (The case c < 0 leads to a matrix that
has the same properties as that for c > 0.) The upwind discretization (1.21) leads to the tridiagonal
linear system
2 + hc + h2r − y1
−( + h c) 2 + hc + h2r
*. +/ *. +/
.. − // .. y2 //
. // .. y3 //
1 .. .. .. .. ..
. . . .
. // .. //
h2 .. // .. //
.. yn−2
// .. //
.. −( + h c) 2 + hc + h2 r − yn−1
// .. //
, −( + h c) 2 + hc + h2r - , yn -
f (x 1 )
*. +/
.. f (x 2 )
..
//
= .. .
// . (2.58)
.. //
. f (x n−1 ) /
, f (x n ) -
Since , c > 0, the matrix in (2.58) is irreducible (cf. Example 2.6.2) and it follows from
Problem 1.4) that the matrix in (2.58) is irreducibly row–wise diagonally dominant for any mesh
size h > 0. Therefore, by Theorems 2.6.6 and 2.6.7 the (pointwise) Jacobi iteration method and the
(pointwise) forward Gauss Seidel iteration converge and by Theorem 2.6.22 the matrix in (2.58) is
an M-matrix for any h > 0. By Theorem 2.6.16 it is consistently ordered. Corollary 2.6.18 and
Theorem 2.6.19 apply. Spectral radii for the matrix in (2.58) with = 10−2 , c = 1, r = 0 are shown
in Figure 2.4 and in Table 2.3. Theorem 2.6.22 shows that the matrix in (2.58) is an M-matrix for
all mesh sizes h.
1
10
Jacobi 9aco;i
1 <or=ar8 >0
forward GS
forward SOR <or=ar8 0?7
0.8
0pectral 7a8ii
Spectral Radii
0
0.6
10
0.4
0.2
!1
10 !3 !2 !1 !# !2 !1
10 10 10 10 10 10
mesh size h mes, si/e ,
Figure 2.4: Spectral radii of the (pointwise) Jacobi iteration G J , the(pointwise) forward Gauss
Seidel iteration GGS and the SOR itertation Gω with ω given by (2.48) for the matrix in (2.57) with
= 10−2 , c = 1, r = 0 (left plot) and for the matrix in (2.58) with = 10−2 , c = 1, r = 0 (right
plot) for various mesh sizes h. For the central difference scheme (2.57) the spectral radii are only
less then one for sufficiently small mesh size. For the upwind difference scheme (2.58) the spectral
radii are less then one for all mesh sizes.
q(x) = 12 xT Ax − bT x.
def
Moreover, there exists a unique solution x (∗) of Ax = b and, therefore, a unique minimizer of q.
Using Ax (∗) = b we can show
where
k xk 2A = xT Ax.
def
Hence,
q(x (k+1) ) − q(x (k) ) = 12 k x (k+1) − x (∗) k 2A − 21 k x (k) − x (∗) k 2A,
i.e., the difference in function values is equal to half of the difference in the error squared, where
the error is measured in the A-norm.
We can view basic iterative methods
Theorem 2.8.1 If A ∈ Rn×n is symmetric positive definite and M ∈ Rn×n is a matrix such that
M T + M − A is symmetric positive definite, then the iterates generated by x (k+1) = (I − M −1 A)x (k) +
M −1 b converge to the minimizer x (∗) of q and obey
θ
!
kx (k+1) (∗) 2
− x kA ≤ 1 − k x (k) − x (∗) k 2A,
λ max
Proof: Recall that λ min k xk 2 ≤ k xk 2A ≤ λ max k xk 2 for all x ∈ Rn . Equations (2.59) and (2.61)
imply that
1
2 kx
(k+1)
− x (∗) k 2A = q(x (k+1) ) − q(x (∗) )
= q(x (k) ) − q(x (∗) ) − 21 (M −1r (k) )T M T + M − A (M −1r (k) )
= 12 k x (k) − x (∗) k 2A − 21 (M −1r (k) )T M T + M − A (M −1r (k) )
1 θ
≤ k x (k) − x (∗) k 2A − k x (k) − x (∗) k22
2 2
1 (k) θ
≤ k x − x (∗) k 2A − k x (k) − x (∗) k 2A .
2 2λ max
This implies the desired result.
Remark 2.8.2 i. Note that Theorem 2.6.12 is a special case of the previous theorem with M = D.
ii. A different proof of Theorem 2.8.1 is given in Problem 2.11. Theorem 2.8.1 describes the
convergence in terms of the A-norm of the error x (k) − x (∗) , which up to a constant is q(x (k) ), while
Problem 2.11 uses the spectral radius of the iteration matrix.
For the damped Jacobi method (2.17) we have M T = M = ω−1 D and
2
M + MT − A = D − A.
ω
Theorem 2.8.3 Let A ∈ Rn×n be symmetric with positive diagonal entries, and let ω > 0. The
matrix 2ω−1 D − A is positive definite if and only if ω satisfies
2
0<ω< ,
1 − µmin
where µmin ≤ 0 is the minimum eigenvalue of I − D−1 A.
Proof: The matrix 2ω−1 D − A is positive definite if and only if
2ω−1 − D −1/2 AD −1/2 = (2ω−1 − 1)I + D1/2 (I − D−1 A)D −1/2 = H
def
is positive definite. The eigenvalues of H are 2ω−1 −1+ µi , where µi are the eigenvalues of I −D−1 A.
µi = trace(I − D−1 A) = 0 and the eigenvalues µi are real, it follows that µmin ≤ 0.
Pn
Since i=1
Therefore, H is positive definite if 2ω−1 − 1 + µi > 0, i = 1, . . . , n, i.e., if 0 < ω < 2/(1 − µmin ).
For the remainder of this section we study so-called coordinate descent methods for the min-
imization of q(x) = 12 xT Ax − bT x, where A ∈ Rn×n is symmetric positive definite. We will
show that the Jacobi and the Gauss-Seidel method applied to Ax = b are particular cases of these
coordinate descent method. In addition many multilevel and domain decomposition methods for
the solution of discretized partial differential equations can be interpreted as coordinate descent
methods [Xu92]. Coordinate descent methods are also used in nonlinear optimization. See, e.g.,
[BT89, Wri15].
We consider two approaches, the so-called parallel directional correction (PDC) method and
the sequantial directional correction (PDC) method.
One iteration of the PDC is given as follows.
end
The SDC performs the minimization along the directions e (i) , i = 1, . . . , n, sequentially. One
iteration of the SDC is given as follows.
For i = 1, . . . , n do (sequentially)
Solve
min q(w (i−1) + αe (i) ). (2.63)
α∈R
end
is given by
vT (b − Ax)
α∗ = .
vT Av
Moreover,
α∗2 T α∗2 T
q(x + α∗ v) = q(x) + α∗ ( Ax − b) v + v Av = q(x) − v Av < q(x),
T
2 2
provided α∗ , 0.
ii. The SDC iterates satisfy q(x (k+1) ) ≤ q(x (k) ) for all k.
iii. The PDC iterates in general do not satisfy q(x (k+1) ) ≤ q(x (k) ).
Figure 2.5: One iteration of the PDC method. Note that q(x (k+1) ) > q(x (k) ).
Theorem 2.8.6 Let A ∈ Rn×n be a symmetric positive definite matrix and let e (i) , i = 1, . . . , n, be
linearly independent. The PDC and the SDC method are iterative methods of the form
with nonsingular M, which depends on whether the PDC or the SDC method is used.
Proof: We consider the PDC method and leave the proof for the SDC method as an exercise.
The solution of (2.62) is
(e (i) )T (b − Ax (k) )
αi = .
(e (i) )T Ae (i)
Hence,
n
X
x (k+1)
=x (k)
+ αi e (i)
i=1
n
X (e (i) )T (b − Ax (k) ) (i)
= x (k) + e
i=1
(e (i) )T Ae (i)
n
X e (i) (e (i) )T
= x (k) + (i) T (i)
(b − Ax (k) )
i=1
(e ) Ae
= x (k) + M
H (b − Ax (k) ) = (I − M
H A)x (k) + Mb
H
with
n
H=
X e (i) (e (i) )T
M .
i=1
(e (i) )T Ae (i)
H is invertible. This implies the assertion with M = M
We will show that M H −1 .
To show that M
H is invertible, assume
n n
X e (i) (e (i) )T X (e (i) )T v
0 = Mv
H =
(i) T (i)
v= (i) T (i)
e (i) .
i=1
(e ) Ae i=1
(e ) Ae
Theorem 2.8.7 Let A ∈ Rn×n be a symmetric positive definite matrix. If e (i) , i = 1, . . . , n, are the
Cartesian unit vectors, then the PDC and the SDC method are equivalent to the (pointwise) Jacobi
and the (pointwise forward ) Gauss Seidel method, respectively.
Theorem 2.8.8 Let A ∈ Rn×n be a symmetric positive definite matrix. If e (i) , i = 1, . . . , n, are
linearly independent , then the SDC method converges to the unique minimizer x (∗) of q for any
initial vector x (0) .
Proof: Recall that q(x (k) ) ≤ q(x (0) ) for all k. Hence,
λ min (k) 2
q(x (0) ) ≥ q(x (k) ) = 12 (x (k) )T Ax (k) − bT x (k) ≥ k x k2 − kbk2 k x (k) k2 for all k,
2
where λ min > 0 is the smallest eigenvalue of A. Since { λ min
2 kx
(k) k 2 − kbk k x (k) k } is bounded the
2 2 2
(k)
sequence {x } must be bounded.
There exists a subsequence {x (k j ) } with
lim x (k j ) = x (∗) .
j→∞
We show that x (∗) is the unique minimizer of q. Using the monotonicity of the q(x (k) )’s and
Theorem 2.8.6 we find
(e (i) )T (b − Ax (∗) )
αi = = 0, i = 1, . . . , n,
(e (i) )T Ae (i)
(see Remark 2.8.5i) which implies Ax (∗) = b. Thus the limit x (∗) is the unique minimizer of q.
Finally, since lim j→∞ q(x (k j ) ) = q(x (∗) ) and since q(x (k+1) ) ≤ q(x (k) ) for all k we have
The inequality (2.59) and the positive definiteness of A implies lim k→∞ x (k) = x (∗) .
Remark 2.8.9 If e (i) , i = 1, . . . , n, are the Cartesian unit vectors, then Theorem 2.8.8 implies the
convergence of the Gauss-Seidel method for symmetric positive definite systems. Of course, we
already know this from Theorem 2.6.10, and our previous convergence theory also established
k x (k+1) − x (∗) k2 ≤ ρ(GGS )k x (k) − x (∗) k2 .
2.9. Problems
Problem 2.1 Let A = D − E − F, where D is a diagonal of A, −E is the strict lower triangular part
of A and −F is the strict upper triangular part of A. One iteration of the symmetric SOR (SSOR)
method for the solution of Ax = b uses one iteration of the forward SOR method followed by one
iteration of the backward SOR method:
1
x (k+ 2 ) = (D − ωE) −1 [ωF + (1 − ω)D]x (k) + ωb ,
1
x (k+1) = (D − ωF) −1 [ωE + (1 − ω)D]x (k+ 2 ) + ωb .
i. Show that
x (k+1) = (I − MSSOR
−1
A)x (k) + MSSOR
−1
b,
where
1
MSSOR = (D − ωE)D−1 (D − ωF).
ω(2 − ω)
Problem 2.2 Let G ∈ Rn×n be a square matrix. The purpose of the exercise is to show that
i. Define
k!
when j = 0, . . . , k,
! (
k
= j!(k− j)!
j 0 otherwise.
Let
λν 1 0 . . . 0 0
*. 0 λν 1 . . . 0 0 +/
.. .. .. . . .. .. //
. . . . .
Jν = .. nν ×nν
. //
.. .. . . . .. .. // ∈ C
.. . . . . //
0 0 0 . . . λν 1
.. /
, 0 0 0 . . . 0 λν -
be a Jordan block of order nν with eigenvalue λ ν ∈ C, λ ν , 0. Show that the components of
its kth power satisfy !
k k− j+i
(Jν )i j =
k
λν .
j −i
Show that
nν
X
k Jνk k∞ = k(Jνk )1 j k.
j=1
and
k J k k∞ = max k Jνk k∞,
ν=1,...,`
where
J1 0 ... 0 0
0 J2 ...
*. +/
0 0
.. .. ... .. ..
. //
J = ... . . . . // .
..0 0 . . . J`−1 0 //
, 0 0 ... 0 J` -
If λ ν , 0, then i and iii
nν
X
k Jνk k∞ = k(Jνk )1 j k,
j=1
nν !
X k
= |λ ν | k− j+1,
j −1
j=1
nν !
X k
= |λ ν | k
|λ ν | − j+1 .
j −1
j=1
j=1
Problem 2.3 Prove Theorem 2.5.5 using the Jordan normal form of G.
Problem 2.6
i. Show that for a square A matrix
lim Ak = 0 ⇐⇒ ρ( A) < 1.
k→∞
Show that
lim Sk = (I − A) −1 ⇐⇒ ρ( A) < 1.
k→∞
Problem 2.7 Let A = D − E − F, where either D, −E and −F are given as in (2.12) or in (2.19).
Given ω ∈ R, the iteration matrix of the dampled Jacobi method is given by
ii. Show that if all eigenvalues of G J are real and ordered such that λ 1 ≥ . . . ≥ λ n and if λ 1 < 1,
then the spectral radius of G J,ω is minimal for
2
ωopt = .
2 − λ1 − λn
Problem 2.8 Let A ∈ Rn×n be symmetric positive definite, b ∈ Rn , and let x ∗ be the solution of
Ax = b.
i. Show that the iteration
x (k+1) = x (k) − ω( Ax (k) − b)
converges to the solution x ∗ of Ax = b for any x (0) if and only if
!
2
ω ∈ 0, .
k Ak2
Note: If we define Q(x) = 12 xT Ax − bT x, then ∇Q(x) = Ax − b and the iteration x (k+1) =
x (k) − ω( Ax (k) − b) = x (k) − ω∇Q(x (k) ) is the steepest descent method for the minimization
of Q. In the context of solving symmetric positive definite linear systems, this iteration is
also known as the Richardson iteration. We will return to this method in Section 3.6.1.
ii. Now consider the iteration
x (k+1) = x (k) − ω( Ax (k) − b + e (k) )
where e (k) ∈ Rn is an error that satisfies ke (k) k2 ≤ δ for all k. Let λ min and λ max be the
smallest and largest eigenvalue of A, respectively. Show that if
q = max{|1 − ωλ min |, |1 − ωλ max |} < 1,
def
Problem 2.9 Let A ∈ Rn×n be nonsingular and let x ∗ be the solution of Ax = b. Given the
splitting A = M − N, where M is nonsingular, we consider the basic iterative method x (k+1) =
(I − M −1 A)x (k) + M −1 b. Due to floating point errors, we can only compute
x (k+1) = (I − M −1 A)H
H x (k) + M −1 b + d (k)
where d (k) ∈ Rn represents the error in the computation of (I−M −1 A)H e (k) = H
x (k) . The error H x (k) −x ∗
obeys
e (k+1) = (I − M −1 A)H
H e (k) + d (k)
ii. Assume that ρ(I − M −1 A) < 1 and kd (k) k2 ≤ δ. Prove that the sequence of errors {H
e (k) }
remains bounded. Find as good an upper bound as you can for lim supk→∞ kH (k)
e k2 .
Problem 2.10 (The heavy ball method [Pol64]. Taken in modified form from [Ber95, p.78].)
Let A ∈ Rn×n be symmetric positive definite with smallest and largest eigenvalue λ min and
λ max , respectively, and let b ∈ Rn . Furthermore, let x ∗ be the solution of Ax = b and consider the
iteration
x (k+1) = x (k) − α( Ax (k) − b) + β(x (k) − x (k−1) ), (2.65)
where α is a positive stepsize and β is a scalar with 0 < β < 1.
Show that the iteration (2.65) converges x ∗ for any x (0) if 0 < α < 2(1 + β)/λ max .
Hint: Consider the iteration
x (k+1) (1 + β)I − α A − βI x (k)
! ! ! !
b
= +α
x (k) I 0 x (k−1) 0
and show that µ is an eigenvalue of the matrix in the above equation if and only if µ + β/µ is equal
to 1 + β − αλ where λ is an eigenvalue of A.
Problem 2.11 Let A, M ∈ Rn×n be symmetric positive definite and consider the iteration
ii. Show that if M − A is positive semidefinite, then all eigenvalues of I − M −1 A are contained
in the interval [0, 1). In particular, ρ(I − M −1 A) < 1.
iii. Let A = D − E − F where either D, −E and −F = −ET are given as in (2.12) or in (2.19),
and let M = D. Use part i to prove Theorem 2.6.12.
iv. Let A = D − E − F where either D, −E and −F = −ET are given as in (2.12) or in (2.19),
and let
M = (D − E)D−1 (D − ET )
be the matrix corresponding to the symmetric Gauss-Seidel method (see Problem 2.1). Show
that ρ(I − M −1 A) < 1.
Problem 2.12
i. Let B1 ∈ Rn1 ×n2 and B2 ∈ Rn2 ×n1 and consider the matrix
!
0 B1
B= .
B2 0
Such a matrix is called 2-cyclic.
– Show that the spectrum2 of B is
ii. Let ! ! ! !
D1 A1 D1 0 0 0 0 A1
A= = + + .
A2 D2 0 D2 A2 0 0 0
| {z } | {z } | {z }
=D = −E = −F
q
– Show that ρ(I − D −1 A) = ρ(D1−1 A1 D2−1 A2 ).
– Show that ρ(I − (D − E) −1 A) = ρ(D1−1 A1 D2−1 A2 ).
iii. Let D, E, F be the matrices in part ii. Show that the eigenvalues of αD−1 E + α −1 D−1 F do
not depend on α, that is, A is consistently ordered.
is a nonsingular (block) diagonal matrix and B is three-cyclic. How can the eigenvalues of
the (block) Jacobi iteration matrix be related to those of the (block) forward Gauss-Seidel
iteration matrix? How does the asymptotic convergence rate of the (block) Jacobi method
compare with that of the (block) forward Gauss-Seidel method.
iii. Answer the same questions as in ii for the case when (block) SOR replaced (block) forward
Gauss-Seidel.
0 E1
0 E2
*. +/
.. ..
. //
B = ... . . // .
.. 0 Ep−1 //
, Ep 0 -
Problem 2.14 Let the symmetric positive definite matrix A = B + C be split into two matrices
B, C ∈ Rn×n such that B is symmetric positive definite and C is symmetric positive semidefinite.
The linear system Ax = b can be split into
i. Show that x (∗) solves Ax = b if and only if x (∗) solves (2.66a) if and only if x (∗) solves
(2.66b).
with
GADI = (C + r I) −1 (r I − B)(B + r I) −1 (r I − C).
What is d?
iii. Show that the spectral radius ρ(GADI ) is equal to the spectral radius
ρ (r I − B)(B + r I) −1 (r I − C)(C + r I) −1 .
The same arguments can be applied to show that the spectral radius of (r I − C)(C + r I) −1
satisfies
ρ (r I − C)(C + r I) −1 ≤ 1.
v. Use part iv, to show that ρ(GADI ) < 1. (Hint: For two symmetric matrices M1, M2 , we have
ρ(M1 M2 ) ≤ k M1 M2 k2 ≤ k M1 k2 k M2 k2 = ρ(M1 ) ρ(M2 ).)
vi. Assume that the matrices B and C commute, i.e., BC = CB. In this case they are simultane-
ously diagonalizable, i.e., there exists an orthogonal matrix V ∈ Rn×n and diagonal matrices
D B = diag( β1, . . . , βn ) and DC = diag(γ1, . . . , γn ) such that
B = V DBV T , and C = V DC V T .
Let x (∗) solve Ax = b. Show that the error e (k) = x (k) − x (∗) of the iteration (2.67) satisfies
(r − γ j )(r − β j ) k
ke (k)
k2 = max
* + ke (0) k2 .
j=1,...,n (r + γ j )(r + β j )
, -
Instead of choosing a fixed parameter r, we can select a different parameter ri for the ith
iteration. In this case
k
Y (ri − γ j )(ri − β j ) + (0)
ke k2 = max
(k) * ke k2 .
, j=1,...,n i=1 (ri + γ j )(ri + β j ) -
Problem 2.15 Let Ω = (0, 1) 2 with boundary ∂Ω and let γ ≥ 0. Consider the Poisson equation
−∆u(x, y) + γu(x, y) = f (x, y), (x, y) ∈ Ω (2.68a)
u(x, y) = 0, (x, y) ∈ ∂Ω. (2.68b)
The finite difference method for (2.68) with n x = n y = n and h = 1/(n + 1) leads to a system of
equations
− ui−1, j − ui, j−1 + 4ui j − ui+1, j − ui, j+1 + γh2ui j = h2 f (x i, y j ), (2.69)
for i = 1, . . . , n and j = 1, . . . , n, where ui j = 0 for i ∈ {0, n x } or j ∈ {0, n y }. This leads to linear
system Au = b. The alternating-direction implicit (ADI)3 method splits the equations into
[−ui−1, j + 2ui j − ui+1, j ] + [−ui, j−1 + 2ui j − ui, j+1 ] + γh2ui j = h2 f (x i, y j ). (2.70)
3The ADI method was originally developed by Peaceman and Rachford [PR55]. See also [Pea90] for a history of
this method.
2
their action on a vector z ∈ Rn with components zi j . (Note that in the context of finite difference
discretization the entries of the vectors u, w, z, ... correspond to functions on a grid with points
(x i, y j ). Therefore its convenient to use double indices wi j to indicate that this is the value of a
function at grid point (x i, y j ).)
wi j = −zi−1, j + 2zi j − zi+1, j + 21 γh2 zi j if w = Bz, (2.71a)
wi j = −zi, j−1 + 2zi j − zi, j+1 + 12 γh2 zi j if w = Cz, (2.71b)
for i, j = 1, . . . , n. (We set z0, j = z n+1, j = zi,0 = zi,n+1 = 0!)
Given the matrix splitting,
A= B+C
we use the iterative scheme
(B + r I)u (k+1/2) = (r I − C)u (k) + b, (2.72a)
(C + r I)u (k+1) = (r I − B)u (k+1/2) + b (2.72b)
which can be combined into
u (k+1) = GADIu (k) + (C + r I) −1 [I + (r I − B)(B + r I) −1 ]b. (2.73)
with
GADI = (C + r I) −1 (r I − B)(B + r I) −1 (r I − C).
The advantage of (2.72) is that the computation of u (k+1/2) and u (k) requires the solution of two
block-diagonal systems, provided we order the unknowns in a suitable way.
The definition (2.71) shows that if we order the unknowns u (k+1/2) and right hand side in (2.72a)
along grid lines with constant y j
(k+1/2) T
(k+1/2) (k+1/2) (k+1/2) (k+1/2) (k+1/2)
u (k+1/2) = u11 , . . . , un,1 , u12 , . . . , un,2 , . . . . . . , u1n , . . . , un,n ,
then
T
B + r I = .. ..
. /,
* +/
, T -
where
2 + 21 γh2 + r −1
−1 2 + 2 γh2 + r −1
1
*. +/
.. //
... ... ...
.. //
T = .. // (2.74)
.. //
.. //
. −1 2 + 12 γh2 + r −1 /
, −1 2 + 2 γh + r -
1 2
If we order the unknowns u (k) and right hand side in (2.72a) along grid lines with constant x i
(k) T
(k) (k) (k) (k) (k)
u (k) = u11 , . . . , u1n , u21 , . . . , u2n , . . . . . . , un1 , . . . , un,n ,
then
T
C + r I = .. ..
. /,
* +/
, T -
where T is given as before.
Thus, one iteration (2.72) requires the solution of 2n equations with matrix T (which corresponds
to the discretization of a one dimensional differential equation).
i. Show that
Bv (k,`) = (λ k + 12 γh2 )v (k,`), (2.75a)
Cv (k,`) = (λ ` + 12 γh2 )v (k,`) (2.75b)
k, ` = 1, . . . , n, where the eigenvalues are given by
!
kπ
λ k = 4 sin2
, k = 1, . . . , n,
2(n + 1)
and the eigenvectors v (k,`) , k, ` = 1, . . . , n, have components
`π j
! !
(k`) kπi
vi j = sin sin , i, j = 1, . . . , n.
n+1 n+1
iii. Apply the ADI method to partial differential equation (2.68) with γ ≥ 0 and right hand side
f such that the exact solution is u(x, y) = 16x(1 − x)y(1 − y). (Any r > 0 will work. Try
r = 1.)
More information on the convergence of the ADI method for the model problem can be found in
the original paper by Peaceman and Rachford [PR55], or in the book by Stoer and Bulirsch [SB93,
Sec. 8.6].
Problem 2.16 Let A ∈ Rn×n be symmetric positive definite and let B ∈ Rm×n have rank m < n.
Sy = r (2.77)
ii. The Uzawa iteration for the solution of (2.76) generates a sequence of iterates x (k), y (k) as
follows:
(Note that it is x (k+1) in (2.78b), not x (k) !) Show that if A is symmetric positive definite, the
iterates y (k) generated by the Uzawa iteration are the iterates generated by gradient method
discussed in Problem 2.8 applied to the Schur complement system (2.77).
BT
! ! !
A x c
= . (2.79)
−ωB 0 y −ωd
Show that the Uzawa iteration (2.78) is obtained from a matrix splitting
BT
!
A
=M−N
−ωB 0
and is given by
x (k+1) x (k)
! ! !
c
=M −1
N +M −1
.
y (k+1) y (k) −ωd
What are M, N, and M −1 N?
Show that ρ(M −1 N ) < 1 if and only if ω ∈ 0, 2/kB A−1 BT k2 .
Problem 2.17 Let B ∈ Rm×n have rank m < n, and let A ∈ Rn×n be symmetric positive semidefinite
and symmetric positive definite on the null-space N (B) of B, i.e., let A satisfy vT Av ≥ 0 for all
v ∈ Rn and vT Av > 0 for all v ∈ N (B) \ {0}.
A + ωBT B BT
!
=M−N
−ωB 0
and is given by
iv. Show that ρ(M −1 N ) < 1 if and only if ρ B(ω−1 A + BT B) −1 BT ∈ (0, 2).
v. Let A be symmetric positive definite. Show that ρ B(ω−1 A + BT B) −1 BT ≤ 1 for all ω > 0.
Hint: First show that for sufficiently small > 0 we have vT (ω−1 A+ BT B)v ≥ vT ( I + BT B)v
for all vectors v ∈ Rn . Then vT B(ω−1 A + BT B) −1 BT v ≤ .....
A BT
! ! !
y e
= (2.83)
B D z f
| {z } |{z} |{z}
= K = x = b
with nonsingular symmetric matrix A ∈ Rn×n , symmetric matrix D ∈ Rm×m , and matrix B ∈ Rm×n .
(D − B A−1 BT )z = f − B A−1 e.
iii. Show that K is positive definite if and only if A, D, and S are positive definite.
iv. Assume that K is positive definite (in particular A, D, and S are positive definite).
Show that the block (forward) Gauss-Seidel method applied to (2.83) is equivalent to the
iteration
z (k+1) = D−1 B A−1 BT z (k) + D−1 ( f − B A−1 e).
v. Again assume that K is positive definite (in particular A, D, and S are positive definite).
Show that the eigenvalues of the iteration matrix D−1 B A−1 BT are contained in [0, 1).
Problem 2.19 We apply a simple Domain Decomposition Method to the finite difference dis-
cretization
2 −1 y1 f (x 1 ) + h12 g0
+/ *. +/
−1 2 −1 y2 f (x 2 )
*. +/ *.
.. // .. // .. //
y3 // .. //
.. .. .. .. ..
.. // ..
1 . // = ..
. . . . .
// .. //
h2 ... // .. // .. //
.. /. yn−2 // .. //
. −1 2 −1 // .. yn−1 / .. f (x n−1 ) //
−1 2 - , yn - , f (x n ) + h2 g1 -
1
,
We write (2.84) as
Ay = b. (2.85)
Assume that n = (k + 1)m − 1 for k, m ∈ N. We partition the equations (2.84) into groups
corresponding to indices
Γ = {m, 2m, . . . , km}
and
I j = { jm + 1, . . . , ( j + 1)m − 1}, j = 0, . . . , k
and we define
I = ∪ kj=0 I j .
Given an vector y ∈ Rn and an index set J ⊂ {1, . . . , n}, we use y J ∈ R| J | to denote the
subvector corresponding to that index set. Similarly, given a matrix A ∈ Rn×n and index sets
J, K ⊂ {1, . . . , n}, we use A JK ∈ R| J |×|K | to denote the submatrix corresponding to these index sets
i. If we reorder the equations and unknowns in (2.84) according to I = ∪ kj=1 I j , Γ, then the
resulting system is ! ! !
AI I AIΓ yI bI
= , (2.86)
AΓI AΓΓ yΓ bΓ
where AΓI = ATIΓ .
The matrices AI I , AIΓ, AΓΓ have a particular structure. Sketch the system (2.86) such that this
structure is revealed. It may be useful to start with the special case k = 1, m = 3.
ii. The system (2.86) is of the type (2.83). Implement the block Gauss-Seidel iteration of
Problem 2.18 iv.,
yΓ(k+1) = A−1 −1 (k)
ΓΓ AΓI AI I AΓI yΓ + AΓΓ bΓ − AΓI AI I bI .
−1 −1
Use the structure of AI I , AIΓ, AΓΓ . (For example, AI I is block diagonal and for bigger
(k)
problems the application of AΓI A−1I I AΓI yΓ can be done in parallel.)
Use k = 9, m = 10, and data f , g0, g1 such that the exact solution of (2.84) is y(x) = cos(2πx).
Plot the convergence history k Ay (k) − bk2 vs. k.
Problem 2.20 This problem studies the Kaczmarz method originally proposed in [Kac37] (see also
the translation [Kac93]). See the books by Natterer [Nat01] and Natterer and Wübbeling [NW01]
for application of the Kaczmarz method in image reconstruction.
Let A ∈ Rn×n be nonsingular, let b ∈ Rn , let ei be the i-th unit vector and let ai = AT ei ∈ Rn be
the transpose of the i-th row of A.
( )
i. The projection of y ∈ Rn onto the set x ∈ Rn : aiT x = bi is the solution x of
Show that
( Ay − b)T ei
x = y − ai .
aiT ai
(Hint: Since (2.87) is a convex optimization problem, the Lagrange Multiplier Theorem
provides the necessary and sufficient conditions for the solution of (2.87).)
ii. Given a vector x (k) the Kaczmarz iteration computes a new approximation x (k+1) of the linear
system as follows.
For i = 1, . . . , n
( Ax (k+(i−1)/n) − b)T ei
x (k+i/n) = x (k+(i−1)/n) − ai .
aiT ai
End
Let n = 2. Sketch the Kaczmarz iteration, i.e., the steps x (0), x (1/2), x (1), x (1+1/2), . . ..
For i = 1, . . . , n
1 X X
x i(k+1) = *.b −
i ( AAT
) x
ij j
(k+1)
− x (k)
( AAT )i j D j
+/
( AAT )ii
D D
, j<i j>i -
1 X X
x i(k) −
=D *. ( AAT ) D x
ij j
(k+1)
+ ( AAT )i j Dx (k)
j − bi .
+/
( AAT )ii j<i j ≥i
, -
End
For i = 1, . . . , n
n
1 X
(k+(i−1)/n)
x (k+i/n) = D
x (k+(i−1)/n) − ei *. *. ( AAT ) D
ij x j − bi +/+/ .
( AAT )ii
D
, , j=1 --
End
Show that Kaczmarz iteration for Ax = b is equivalent to the Gauss-Seidel iteration applied
x = b in the sense that the iterates satisfy x (k) = AT D
to AAT D x (k) .
iv. What can you say about the convergence of the Kaczmarz iteration?
v. This part applies the Kaczmarz iteration to an image deblurring problem. The true image is
represented by a function f : [0, 1] → [0, 1] (think of f (x) as the gray scale of the image at
x). The blurred image g : [0, 1] → R is given by
Z 1
k (ξ1, ξ2 ) f (x 2 )dξ2 = g(ξ1 ), ξ1 ∈ [0, 1], (2.88)
0
where χ I is the indicator function on the interval I. We insert these approximations into
(2.88) and approximate the integral by the midpoint rule. This leads to the linear system
Kf = g, (2.89)
where
f = ( f 1, . . . , f n )T , g = (g1, . . . , gn )T ,
and
Ki j = h k (ξi, ξ j ), i, j = 1, . . . , n.
Let n = 100, construct the true image f true
ftrue = zeros(n,1);
ftrue = exp( -(xi-0.75).^2 * 70 );
ind = (0.1<=xi) & (xi<=0.25);
ftrue(ind) = 0.8;
ind = (0.3<=xi) & (xi<=0.35);
ftrue(ind) = 0.3;
and compute the resulting blurred image g = Kf true . See the left plot in Figure 2.6. We want
to recover f true from g and (2.89).
The matrix K is highly ill-conditioned. Solving (2.89) using Matlab ’s backslash leads to a
highly oscillatory function, indicated by the blue dashed lines in center plot in Figure 2.6.
Apply the Kaczmarz iteration to (2.89) with starting point f (0) = 0. Stop the iteration when
kKf (k) − gk2 ≤ 10−2 kgk2 . Generate plots like those shown in Figure 2.6, as well as a plot of
the residual norms kKf (k) − gk2 .
The right plot in Figure 2.6 shows that the Kaczmarz iteration recovers the true image fairly
well, especially the smooth parts of true image. This is due to the early termination of the
iteration.
1 1 1
0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
true image true image
0 0 0
true image recovered image recovered image
-0.2 blurred image -0.2 blurred image -0.2 blurred image
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
ξ ξ ξ
Figure 2.6: Left plot: True image f true and blurred image g. Middle plot: True image f true , recovered
image f = K−1 g and blurred image g. Right plot: True image f true , recovered image f (k) using
the Kaczmarz iteration with stopping criteria kKf (k) − gk2 ≤ 10−2 kgk2 , and blurred image g. The
image f (k) recovered using the Kaczmarz iteration matches f true and especially the smooth part of
f true well.
109
110 REFERENCES
[PR55] D. W. Peaceman and H. H. Rachford, Jr. The numerical solution of parabolic and elliptic
differential equations. J. Soc. Indust. Appl. Math., 3:28–41, 1955.
[Saa03] Y. Saad. Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia, second
edition, 2003.
[SB93] J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Springer Verlag, New York,
Berlin, Heidelberg, London, Paris, second edition, 1993.
[TE05] L. N. Trefethen and M. Embree. Spectra and Pseudospectra. The Behavior of Nonnormal
Matrices and Operators. Princeton University Press, Princeton, NJ, 2005.
[Var62] R. S. Varga. Matrix Iterative Analysis. Prentice Hall, Englewood Cliffs, NJ, 1962.
[Wri15] S. J. Wright. Coordinate descent algorithms. Math. Program., 151(1, Ser. B):3–34,
2015. URL: http://dx.doi.org/10.1007/s10107-015-0892-3, doi:10.1007/
s10107-015-0892-3.
[Xu92] J. Xu. Iterative methods by space decomposition and subspace correction. SIAM Review,
34:581–613, 1992.
[You71] D. M. Young. Iterative Solution of Large Linear Systems. Academic Press, New York,
1971. Republished by Dover [You03].
[You03] D. M. Young. Iterative Solution of Large Linear Systems. Dover Publications Inc.,
Mineola, NY, 2003. Unabridged republication of the 1971 edition [You71].
113
114 CHAPTER 3. KRYLOV SUBSPACE METHODS
3.1. Introduction
In this section we study Krylov subspace methods for the solution of square linear systems
Ax = b, (3.1)
min 21 xT Ax − bT x (3.2)
with symmetric positive definite matrix A. These methods successively generate nested subspaces
K1 ( A, r 0 ) ⊂ K2 ( A, r 0 ) ⊂ . . . ⊂ Kk ( A, r 0 ) ⊂ Rn,
hx, yi = xT y
and p
k xk = k xk2 = xT x.
The reason for this notation is that the Krylov subspace methods can be easily extended to linear
operator equations in Hilbert spaces (X, h·, ·i). In the general case the transpose of A has to replaced
by the adjoint A∗ , which is the linear operator that satisfies hAx, yi = hx, A∗ yi for all x, y ∈ X.
Vk = R (Vk ). (3.4)
k xk A = hAx, xi1/2,
def
(3.5)
then Ax ∗ = b implies
1
2 hAx, xi − hb, xi = 21 hA(x − x ∗ ), x − x ∗ i − 21 hAx ∗, x ∗ i
= 21 k x − x ∗ k 2A − 12 k x ∗ k 2A .
Thus,
1
2 hAx, xi − hb, xi < 21 hAy, yi − hb, yi if and only if k x − x∗ k A < k y − x∗ k A.
Therefore, we can minimize 12 hAx, xi − hb, xi over x 0 + Vk to compute the approximation x k of the
solution x ∗ to the linear system Ax = b. Using (3.4) the minimization problem
can be written as
min 12 hVkT AVk y, yi − hVkT (b − Ax 0 ), yi + 12 hAx 0, x 0 i. (3.7)
y∈Rk
Theorem 3.2.1 Let A ∈ Rn×n be symmetric positive semidefinite, let b ∈ Rn , and let Vk ⊂ Rn . The
vector x k ∈ x 0 + Vk solves (3.6) if and only if
hAx k − b, vi = 0 ∀ v ∈ Vk . (3.8)
Proof: Let {v1, . . . , vk } be a basis for Vk and define Vk = (v1, · · · , vk ) ∈ Rn×k . For every
x k ∈ x 0 + Vk there exists a unique y k ∈ R k such that x k = x 0 + Vk y k . The minimization problem
(3.6) is equivalent to (3.7). By Theorem 1.2.1 the minimization problem (3.7) has a solution y k if
and only if VkT AVk y k = VkT (b − Ax 0 )
hAx k − b, vi = 0 ∀ v ∈ Vk . (3.10)
Ax k − b
b
Ax k
Vk
Figure 3.1: The vecotor x k ∈ x 0 + Vk is a Galerkin approximation if and only if the residual Ax k − b
is orthogonal to Vk .
Now suppose that Vk = R (Vk ), and that we have computed the Galerkin approximation
x k ∈ x 0 + Vk . If Ax k − b , 0, then we want to expand the subspace, i.e., generate a new subspace
Vk+1 = R ((Vk , vk+1 )) in such a way that the Galerkin approximation x k+1 ∈ x 0 + Vk+1 is a better
approximation to x ∗ in the sense that
What are the requirements on the vector vk+1 ? The next theorem shows that the new Galerkin
approximation x k+1 ∈ x 0 + Vk+1 satisfies (3.12) if and only the vector vk+1 is not orthogonal to
Ax k − b.
Theorem 3.2.3 Let A ∈ Rn×n be symmetric positive definite. Furthermore, let Vk ∈ Rn×k and
Vk+1 = (Vk , vk+1 ) ∈ Rn×(k+1) , and let Vk = R (Vk ), Vk+1 = R (Vk+1 ) be the corresponding
subspaces. Then,
min x∈x 0 +Vk+1 21 hAx, xi − hb, xi < min x∈x 0 +Vk 12 hAx, xi − hb, xi
Proof: Define Q(x) = 21 hAx, xi − hb, xi. Note that ∇Q(x) = Ax − b, and note that for all v ∈ Rn
def
and λ ∈ R we have
λ2
Q(x k + λv) = Q(x k ) + hAv, vi + λhAx k − b, vi. (3.13)
2
i. If hAx k − b, vk+1 i , 0, then we can find λ ∈ R with λhAx k − b, vk+1 i < 0 and |λ| sufficiently
small such that (3.13) with v = vk+1 implies
Since Q(x k+1 ) < Q(x k ) = min x∈x 0 +Vk Q(x) it follows that x k+1 < x 0 + Vk . Therefore x k+1 − x k
can be written in the form x k+1 − x k = v + λvk+1 , where v ∈ Vk , and λ , 0. With (3.10) this yields
the desired
We use Theorem 3.2.3 to construct our subspaces. Let x 0 be given. We want to find V1 = {v1 }
such that 21 hAx 1, x 1 i − hb, x 1 i < 21 hAx 0, x 0 i − hb, x 0 i. By Theorem 3.2.3 v1 must satisfy
hAx 0 − b, v1 i , 0.
V1 = {r 0 }.
Let x 1 = argmin x∈x 0 +V1 12 hAx, xi − hb, xi. We want to find V2 = span{V1 ∪ {v2 }} such that
2 hAx 2, x 2 i − hb, x 2 i < 2 hAx 1, x 1 i − hb, x 1 i, Thus we have to choose v2 such that
1 1
hAx 1 − b, v2 i , 0.
Vk = span{r 0, Ar 0, . . . , Ak−1r 0 }.
Definition 3.2.4 Given A ∈ Rn×n and v ∈ Rn the Krylov subspace (generated by A and v) is defined
by
Kk ( A, v) ≡ span{v, Av, . . . , Ak−1 v}.
Examples 3.2.5 Let e k denote the kth unit vector. Moreover, we use the subspaces Vk =
span{e1, · · · , e k } with corresponding matrices Vk = (e1, · · · , e k ) ∈ R3×k , k = 1, 2, 3
i. Consider
0 0 1 1
A= *
. 1 0 0 +
/ , b = 0 +/ .
*
.
, 0 1 0 - , 0 -
The unique solution of Ax = b is given by x ∗ = (0, 0, 1)T , but (3.10) is not solvable for k < 3. For
example, in the case k = 2 (3.11) is equivalent to
! !
0 0 1
T
V2 z =
V2 A |{z} z= = V2T b.
1 0 0
x2
ii. Consider
1 0 0 1
A = . 0 0 1 +/ ,
* b = . 1 +/ .
*
, 0 1 0 - , 0 -
The unique solution of Ax = b is given by x ∗ = (1, 0, 1)T . For k = 1 there exists a unique Galerkin
approximation given by x 1 = e1 , but for k = 2 (3.11) is not solvable.
Since in the general case (3.10) does not have a solution or no unique solution, we need to
define our approximation differently. If A is nonsingular, then x ∗ solves Ax = b if and only if it
minimizes 12 k Ax k − bk 2 = 12 k A(x k − x ∗ )k 2 , and we can can use the residual as a measure for the
error. This is possible even if A is not square.
Since
1
2 k Ax − bk 2 = 12 hAT Ax, xi − hAT b, xi + 21 kbk 2
the least squares problem (3.15) is equivalent to
One can see immediately that minimum residual approximations x k ∈ x 0 + Vk to x ∗ are Galerkin
approximations for the solution of the normal equation
AT Ax = AT b.
Previously we have seen that if A is symmetric positive definite, the use of Galerkin approxi-
mations leads to the Krylov subspace. What subspace do we select if we use minimum residual
approximations? We have seen that minimum residual approximations are Galerkin approximations
for the normal equation (see (3.15) and (3.16)). Hence, we can apply our arguments for Galerkin
approximation and select the subspace Vk = Kk ( AT A, AT r 0 ). This is done when A ∈ Rm×n .
If A ∈ Rn×n , then we select Vk = Kk ( A, r 0 ), i.e., we compute x k ∈ x 0 + Kk ( A, r 0 ) as the
solution of
2 k Ax − bk .
1 2
min (3.17)
x∈x 0 +Kk ( A,r 0 )
min 1
k Ax − bk 2 < min 1
k Ax − bk 2,
x∈x 0 +Kk+1 ( A,r 0 ) 2 x∈x 0 +Kk ( A,r 0 ) 2
e(x) = x ∗ − x
def
r (x) = b − Ax.
def
These representations will be important in the convergence analysis of Krylov subspace methods.
Any vector v ∈ Kk ( A, r 0 ) can be written as v = i=0 γi Ai r 0 = π k−1 ( A)r 0 for some polynomial
P k−1
π k−1 (t) = i=0 γi t of degree less then or equal to k − 1. If x ∈ x 0 + Kk ( A, r 0 ), where r 0 = b − Ax 0 ,
P k−1 i
then the error obeys
e(x) = x ∗ − x = x ∗ − x 0 − π k−1 ( A)r 0
def
for some polynomial π k−1 of degree less then or equal to k − 1. Moreover, since r 0 = b − Ax 0 =
A(x ∗ − x 0 ) we have
(note that Aπ k−1 ( A) = π k−1 ( A) A) for the polynomial π k−1 of degree less then or equal to k − 1 that
appears in the error representation.
or, equivalently,
1 2
2 ke k k A = min 1
kx − x ∗ k 2A = min 1
k(I − π( A) A)e0 k 2A, (3.20)
x∈x 0 +Kk ( A,r 0 ) 2 π∈Pk−1 2
where P k−1 is the set of all polynomials of degree less than or equal to k − 1 and e0 = x ∗ − x 0 ,
ek = x ∗ − x k .
where P k−1 is the set of all polynomials of degree less than or equal to k − 1.
Later in Section 3.8, we will use (3.20) and (3.21) in the convergence analysis of Krylov
subspace methods.
Theorem 3.2.7 Let A ∈ Rn×n be nonsingular. If Ak r 0 ∈ Kk ( A, r 0 ) for some k, then there exists a
polynomial π k−1 of degree less or equal to k − 1 such that
Proof: Let k be the smallest integer such that Ak r 0 ∈ Kk ( A, r 0 ). Since A j r 0 < K j ( A, r 0 ) for all
j = 1, . . . , k − 1, the vectors r 0, Ar 0, . . . , Ak−1r 0 are linearly independent. The nonsingularity of
A implies that Ar 0, A2r 0, . . . , Ak r 0 are linearly independent. By assumption r 0, Ar 0, A2r 0, . . . , Ak r 0
are linearly dependent. Therefore there exist scalars λ j such that
k
X k−1
X
r0 = λ j A r0 = A
j
λ j+1 A j r 0 .
j=1 j=0
The previous theorem implies that Krylov subspace approximation algorithms terminate after
k iterations if Ak r 0 ∈ Kk ( A, r 0 ). In fact, Ax = b is equivalent to A x̃ = r 0 = b − Ax 0 and if
Ak r 0 ∈ Kk ( A, r 0 ), then
x̃ = A−1r 0 = π k−1 ( A)r 0 ∈ Kk ( A, r 0 ).
j
X
ṽ j+1 = Av j − hAv j , vi ivi, v j+1 = ṽ j+1 /k ṽ j+1 k,
i=1
to compute an orthonormal basis {v1, . . . , v j+1 } are orthonormal bases for the Krylov subspaces
K j+1 ( A, r 0 ). For numerical reasons it is better to use the modified Gram-Schmidt method. This
leads to the so-called Arnoldi Iteration.
Theorem 3.3.3 If Algorithm 3.3.2 does not stop in step j, then for all k ≤ j + 1
Proof: The fact that {v1, . . . , vk } is an orthonormal basis of Kk ( A, r 0 ) follows from the properties
of the Gram-Schmidt orthogonalization and Theorem 3.3.1.
The orthogonality of the vectors v1, . . . , vk and steps (b)-(e) of Algorithm 3.3.2 implies
`+1
X
Av` = vi hi`,
i=1
which is the `th column in the identity AVk = Vk+1 H̄k . If we multiply both sides in this identity by
VkT and use the orthogonality of the vectors v1, · · · , vk+1 , we obtain VkT AVk = Hk .
Pj
If Algorithm 3.3.2 stops at step j < m, then h j+1, j = 0, i.e., Av j = i=1 vi hi j . Theorem 3.3.1
implies A j r 0 ∈ K j ( A, r 0 ).
Corollary 3.3.5 Let A ∈ Rn×n be symmetric. If Algorithm 3.3.4 does not stop in step j, then for all
k ≤ j +1
where Vk = (v1, . . . , vk ) ∈ Rn×k , Tk ∈ R k×k is the matrix given in (3.23) and e k is the kth unit vector
in R k .
If Algorithm 3.3.4 stops in step (d) of iteration j < m, then A j r 0 ∈ K j ( A, r 0 ).
Note that the work per iteration in the Lanczos Iteration is constant, whereas the work in the
Arnoldi Iteration grows linearly with the iteration count j.
GMRES, minimal residual algorithm for solving nonsymmetric linear systems was developed by
Saad and Schultz [SS86].
It computes minimum residual approximations using the Krylov subspace Kk ( A, r 0 ). The kth
step computes the solution x k of
min k Ax − bk,
x∈x 0 +Kk ( A,r 0 )
where r 0 = b− Ax 0 . The Arnoldi Iteration 3.3.2 is used to generate an orthonormal basis {v1, · · · , vk }
of Kk ( A, r 0 ). By Theorem 3.3.3 and from v1 = r 0 /kr 0 k we have
where e1 if the first unit vector in R k+1 . Since x ∈ x 0 + Kk ( A, r 0 ) if and only if x = x 0Vk y for some
y ∈ R k , the problem min x∈x 0 +Kk ( A,r0 ) k Ax − bk is equivalent to
If in step (2d) h k+1,k = 0, then Ak r 0 ∈ Kk ( A, r 0 ), cf. Theorem 3.3.3, and by Theorem 3.2.7,
x 0 + Vk y k solves Ax = b. This is sometimes called lucky breakdown.
GMRES is terminated if the residual or the relative residual is smaller than some tolerance
> 0, i.e. if
kr k k < or kr k k < kbk.
Since
Ax k = b − (b − Ax k ) = b − r k ,
the perturbation theory for the solution of linear systems implies the following estimates for the
error x ∗ − x k .
k x ∗ − x k k ≤ k A−1 k kr k k ≤ k A−1 k
if GMRES(m) stops with kr k k < and
k x∗ − x k k kr k k
≤ k Ak k A−1 k ≤ k Ak k A−1 k
k x∗ k kbk
if GMRES(m) stops with kr k k < kbk. Note that k Ak k A−1 k is the condition number of A.
The question when to restart is difficult in general. See [Emb03] for some interesting examples.
0 ··· ··· 0 1
*. +/
.. 1 0 0 0 // 1
*. +/
. 0 1 0 0 0
A = .. .. // ∈ R ,
n×n
b = ... ..
// ∈ Rn, x 0 = 0 ∈ Rn .
//
.. ..
.. . . . / . . //
.. 1 0 0 // , 0 -
, 1 0 -
0 ··· ··· 0
*. +
.. 1 0 0 //
. 0 1 0 //
H̄k = .. . . . . .. // ∈ R
(k+1)×k
.
.. . . . //
.. 1 0 //
, 1 -
From this it can be seen that the GMRES iterates are given by x k = x 0 for j = 1, . . . , n − 1,and
x n = x ∗ . Thus, the residuals are not reduced until the last iterate.
1
*. +/
1
..
. //
Gi = ... . // ∈ Ri×i,
.. ci −si //
, si ci -
where
ẑ (k) = (z1(k), . . . , z k(k) )T ∈ R k .
Therefore the residual is given by
(k)
kr k k = |z k+1 |. (3.27)
Due to the special structure of the Givens rotations G k , the vectors z (k−1) obey
z`(k) = z`(k−1), 1 ≤ ` ≤ k − 1,
z k(k) = ck+1 z k(k−1),
(k)
z k+1 = s k+1 z k(k−1) .
hAx k − b, vi = 0 ∀ v ∈ Kk ( A, r 0 ). (3.29)
where e1 the first unit vector in R k . If we set x k = x 0 + Vk y k and v = Vk ν in (3.29) and use the
previous identities, then (3.29) is equivalent to
Tk y k = kr 0 ke1 . (3.30)
Since A ∈ Rn×n is symmetric positive definite, the tridiagonal matrix A ∈ R k×k is symmetric
positive definite. We can use the LDLT –decomposition of Tk to solve (3.30).
There exists matrices
1 0 0 0 d1 0 ··· 0
`2 1 .. +/
. /
*. + *.
0 0 // 0 d2
L k = ... .. . . . . .. // , D k = ... .. .. (3.31)
. . . . /
/
. . . . 0 //
, 0 · · · `k 1 - , 0 · · · 0 dk -
such that
Tk = L k D k LTk . (3.32)
If we insert (3.23) and (3.31) into (3.32) and compare the matrix entries on both sides of the identity,
we see that the entries in (3.31) are given by
d 1 = α1,
`i = βi /di−1, di = αi − `i2 di−1 = αi − βi `i, i = 2, . . . , k.
` k = β k /d k−1, d k = α k − βk ` k
in order to obtain L k , D k .
In a naive implementation, we would compute y k by solving (3.30) and the set x k = x 0 + Vk y k .
This would require us to store all columns of Vk . Consequently, the storage requirements would
increase linearly with the iteration count. Fortunately, the fact that Tk is tridiagonal can be used
to limit the stroage to only four vectors of length n independent of the number of iterations k
performed. To accomplish this, we define W k ∈ Rn×k and z k ∈ R k by
W k = Vk L −T
k , z k = LTk y k (3.33)
Then
x k = x 0 + Vk y k = x 0 + Vk L −T
k L k yk = x 0 + Wk z k .
T
If we let W k = (w1, w2, . . . , w k ) and insert this into (3.33) (written as W k = LTk = Vk ), then we find
that
(w1, ` 2 w1 + w2, . . . , ` k w k−1 + w k ) = (v1, v2, . . . , vk ).
Thus,
w1 = v1, w2 = v2 − ` 2 w1, . . . , w k = vk − ` k w k−1 . (3.34)
If we set z k = (ζ1, ζ2, . . . , ζ k )T in (3.33), then this equation becomes
0 ζ1 kr 0 k
*. .. +/ *. ζ2
+/ *. +/
L k−1 D k−1 . // .. // = .. 0 // .
.. ..
..
.. 0 // .. . // .. . //
, 0 · · · 0 ` k d k−1 d k - , ζk - , 0 -
Since L k−1 D k−1 z k−1 = kr 0 ke1 , it follows that
!
z k−1
zk = , ζ k = −` k d k−1 ζ k−1 /d k . (3.35)
ζk
ζ1 = kr 0 k/d 1 = kr 0 k/α1 .
Hence
x k = x 0 + W k z k = x 0 + W k−1 z k−1 + ζ k w k = x k−1 + ζ k w k . (3.36)
This enables us to make the transition from (vk−1, w k−1, x k−1 ) to (vk , w k , x k ) with a minimal amount
of work and storage.
Using the identity AVk = Vk Tk + β k+1 vk+1 eTk established in Corollary 3.3.5 we obtain the
following formula for the residual
( j)
kr k k = k Ax k − bk = β k+1 |y k |.
The vector y k is the solution of (3.30) and y k(k) denotes the j–th component of y k = L −T z k . Using
(3.31) we conclude that the last component of y k is equal to the last component of ζ k of z k . Hence
kr k k = k Ax k − bk = β k+1 |ζ k |.
Algorithm 3.5.1
(0) Given A ∈ Rn×n symmetric positive definite, x 0, b ∈ Rn , and > 0.
(1) Compute r 0 = b − Ax 0 .
Set v̂1 = r 0
β1 = kr 0 k, ζ0 = 1,
v0 = 0, k = 0.
(2) While kr k k = | β k+1 ζ k | >
k = k + 1,
If β k , 0, then vk = v̂k / β k ;
Else vk = v̂k (= 0).
Endif
v̂k+1 = Avk − β k vk−1
α k = hv̂k+1, vk i ,
v̂k+1 = v̂k+1 − α k vk
β k+1 = k v̂k+1 k
If k = 1, then
d1 = α1 ,
w1 = v1 ,
ζ1 = β1 /α1 ,
x 1 = x 0 + ζ1 v1 ,
Else
` k = β k /d k−1 ,
d k = α k − βk ` k ,
w k = vk − ` k w k−1 ,
ζ k = −` k d k−1 ζ k−1 /d k ,
x k = x k−1 + ζ k w k .
Endif
End
To implement Algorithm 3.5.1 we need one array to hold the x k , one array to hold the w k and
two arrays to hold vk+1, vk (vk−1 can be overwritten by v̂k+1 ).
Algorithm 3.5.1 is equivalent to the conjugate gradient method (both algorithm generate the
same iterates). The conjugate gradient method will be discuss in Section 3.7 below.
3.5.2. SYMMLQ
If we want to extend the approach in Section 3.5.1 to matrices A ∈ Rn×n which are symmetric but
not necessarily positive definite, we encounter two difficulties. First, the Galerkin approximation
problem (3.29) may not have a solution. See Example 3.2.5. Using the Lanczos Iteration, Algorithm
3.3.4, we can transform (3.29) into (3.30) and (3.29) has a solution if and only if (3.30) has a solution.
The second difficulty is that if A ∈ Rn×n is not positive definite, then Tk may not be positive definite
and the LDLT decomposition of Tk cannot be used.
Paige and Saunders [PS75] developed an algorithm SYMMLQ that overcomes these difficulties.
Instead of using the LDLT decomposition of Tk , they use a LQ decomposition, that is they generate
a lower triangular matrix
d1
.. e2 d 2
*. +/
//
L̄ k = .. f 3 e3 d 3
... ... ...
//
.. //
, f k e k d¯k -
Tk = L̄ k Q k .
The system (3.30) can be solved using the LQ decomposition (assuming a solution exists). Paige
and Saunders [PS75] suggest a modification x kL of the Galerkin approximation that always exists
(even if (3.29), (3.30) do not have a solution). Moreover, they show that the errors x kL − x ∗ are
nonincreasing, i.e. that
k x ∗ − x kL k ≤ k x ∗ − x k−1
L
k.
Just like Algorithm 3.5.1, SYMMLQ requires a small, fixed amount of storage. The algorithm
is listed below. For details we refer to [PS75].
3.5.3. MINRES
If A ∈ Rn×n is symmetric but not necessarily positive definite, we can compute an approximation to
the solution x ∗ of Ax = b using the minimum residual approach, that if we compute approximations
x k by solving
1
min 2 k Ax − bk.
x∈x 0 +Kk ( A,r 0 )
The basic idea is the same the one presented in Section 3.4.1. However, since A is symmetric, we
use the Lanczos iteration. Algorithm 3.3.4 generates orthogonal matrices Vk , Vk+1 and a tridiagonal
Tk such that
!
Tk
AVk = Vk+1 r0 = kr 0 kVk+1 e1,
β k+1 eTk
where e1 if the first unit vector in R k+1 . See Corollary 3.3.5. The problem min x∈x 0 +Kk ( A,r0 ) k Ax − bk
is equivalent to
!
Tk
min k AVk y − r 0 k = min
y − kr ke
.
y∈Rk
β k+1 e k
T 0 1
y∈Rk 2
The structure of these small (k + 1) × k least squares systems can be used to update the solution
x k = x 0 + Vk y k without storing all columns of Vk . The details are given in [PS75]. The resulting
algorithms is known as MINRES. For symmetric matrices, MINRES is mathematically equivalent
to GMRES (without restart), but unlike GMRES the MINRES implementation requires a small,
fixed amount of storage. The algorithm is listed below. For details we refer to [PS75].
A simple minimization algorithm is the gradient method. The gradient of the quadratic function
Q is given by
∇Q(x) = Ax − b.
In Problem 2.8 we have already studied the steepest descent method with constant step size,
x k+1 = x k − α∇Q(x k ),
r k = −∇Q(x k ) = b − Ax k , 0.
Theorem 3.6.2 Let A be symmetric positive definite. The iterates generated by the Gradient Method
3.6.1 obey
(r Tk r k ) 2
k x ∗ − x k+1 k A = 1 − T
2 *
T A−1 r )
+ k x∗ − x k k2 .
A (3.39)
, (r k
Ar k ) (r k k -
If λ min, λ max are the smallest and the largest eigenvalues of A, respectively, then the iterates
generated by the Gradient Method 3.6.1 obey
λ max − λ min
!2
2
k x ∗ − x k+1 k A ≤ k x ∗ − x k k 2A . (3.40)
λ max + λ min
The proof of the second part of previous theorem used the Kantorovich inequality, stated in the
following lemma. We leave the proof of Theorem 3.6.2 and of the following lemma as an exercise.
Lemma 3.6.3 (Kantorovich Inequality) Let A be symmetric positive definite. If λ min, λ max are
the smallest and the largest eigenvalue of A, respectively, then
(xT x) 2 4λ min λ max
≥ .
(xT Ax) (xT A−1 x) (λ min + λ max ) 2
It is not difficult to show that the successive residuals generated by the gradient method are
orthogonal, i.e.,
hr k , r k+1 i = 0.
This leads to a convergence behavior of the gradient method known as zigg-zagging. It is illustrated
in Figure 3.2 where we have plotted the contours of Q and the gradient iterates for
! ! !
2 −1 1 5
A= , b= , x0 = .
−1 2 1 2
The solution is x ∗ = (1, 1)T . The plot on the left in Figure 3.2 shows the first iterations and the plot
on the right zooms into the region around the solution.
5
1.2
4 1.15
1.1
1.05
x2
2
1
x
0.95
1
0.9
0
0.85
−1 0.8
−1 0 1 2 3 4 5 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2
x1 x1
Figure 3.2: Typical Convergence Behavior of the Gradient Method. The right picture is a zoom of
the picture on the left around the minimizer (1, 1)T .
k I) r k .
x k+1 = x k + α k r k = x k + (α −1 −1
k I) r k ≈ A r k .
(α −1 −1 −1
(3.41)
Of course the right hand side in (3.41) essentially requires the solution of the original problem,
which is not feasible. Therefore, we replace (3.41) by a condition that only involves information
that is easily available. Given the previous and current iterate x k−1, x k , and the corresponding
gradients −r k−1 = ∇Q(x k−1 ) = Ax k−1 − b , −r k = ∇Q(x k ) = Ax k − b define
∆x = x k − x k−1, ∆r = −r k + r k−1 .
∆x ≈ α k ∆r.
i.e.,
h∆x, ∆ri
α k(1) = . (3.44)
k∆r k 2
α −1
k ∆x ≈ ∆r.
i.e.,
k∆xk 2
α k(2) = . (3.47)
h∆x, ∆ri
In the initial iteration k = 0 where x k−1 and r k−1 are not available the steepest descent step size is
used. This leads to the following algorithm.
Convergence results for Algorithm 3.6.4 are given by Raydan [Ray93] and by Dai and Liao
[DL02]. See also Fletcher’s paper [Fle05] for an overview.
The CG method is equivalent to Algorithm 3.5.1 (both algorithms generate the same iterates) and it
can be derived from Algorithm 3.5.1. For a discussion of the relation between Algorithm 3.5.1 and
the conjugate gradient method derived below see, e.g., the books by Golub and van Loan [GL96,
Secs. 9.3, 10.2] or by Saad [Saa03, Sec 6.7]. In this section we derive the CG method without using
the results from the previous sections.
Our goal is the minimization of the quadratic function
Q(x) = 12 hx, Axi − hx, bi. (3.48)
The necessary and sufficient optimality conditions are stated in the following theorem, which is
just an application of Theorem 3.2.1.
Theorem 3.7.1 Let A be symmetric and positive definite on span{p0, . . . , pk−1, r k }, i.e., let
hAv, vi > 0 ∀v ∈ span{p0, . . . , pk−1, r k }, v , 0.
The vector x k+1 ∈ x 0 + span{p0, . . . , pk−1, r k } solves (3.49) if and only if
hAx k+1 − b, vi = 0 ∀v ∈ span{p0, . . . , pk−1, r k }. (3.50)
Now, let us discuss how the search direction pk and the step size α k are computed. For k = 0
we have p0 = r 0 and x 1 = x 0 + α0r 0 , where α0 ∈ R is computed so that (3.50) is satisfied, i.e., so
that
hA(α0r 0 ) − r 0, r 0 i = 0.
This gives
kr 0 k 2
α0 = .
hAr 0, r 0 i
To see how the search direction pk and the step size α k are computed in iteration k > 0, let us
assume we have already computed the solution x k of (3.49) with k replaced by k − 1. We write
x k+1 = x k + α k pk . From (3.50) we obtain
Since x k solves (3.50) with k replaced by k − 1 and since pk−1 ∈ span{p0, . . . , pk−2, r k−1 }, we find
hAx k − b, pi i = 0 for i = 0, . . . , k − 1.
α k hApk , pi i = 0 for i = 0, . . . , k − 1.
Lemma 3.7.2 Let A ∈ Rn×n be a symmetric positive definite matrix and let x k satisfy (3.50) with
k replaced by k − 1. The vector x k+1 = x k + α k pk , α k , 0, satisfies (3.50), if and only if
hApk , pi i = 0, i = 0, . . . , k − 1.
To continue our discussion of the computation of search direction pk and step size α k , let us
assume that {p0, . . . , pk−1 } is an A–orthogonal basis of span{p0, . . . , pk−2, r k−1 }. We will see in a mo-
ment how this can be accomplished. Lemma 3.7.2 shows that if x k+1 = x k +α k pk , with α k , 0 satis-
fies (3.50), then p0, . . . , pk−1, pk are A–orthogonal. Since p0, . . . , pk−1, pk ∈ span{p0, . . . , pk−1, r k },
the vectors p0, . . . , pk−1, pk form an A–orthogonal basis of span{p0, . . . , pk−1, r k }.
Our next task is to compute pk so that p0, . . . , pk−1, pk is an A–orthogonal basis of
span{p0, . . . , pk−1, r k }. This can be accomplished using the Gram-Schmidt process applied with the
scalar product hAx, yi instead of hx, yi. Let p0, . . . , pk−1 be A–orthogonal and satisfy hpi, Api i , 0,
then
k−1
X hr k , Api i
pk = r k − pi (3.51)
i=0
hpi, Api i
satisfies
hpi, Apk i = 0, i = 0, . . . , k − 1
and {p0, . . . , pk−1, pk } is an A–orthogonal basis of span{p0, . . . , pk−1, r k }. Moreover, pk = 0 if and
only if r k ∈ span{p0, . . . , pk−1 }.
We obtain the following result.
Lemma 3.7.3 Let A ∈ Rn×n be symmetric positive definite. If p0, . . . , pk−1 are A–orthogonal and
satisfy hp j , Ap j i , 0, j = 0, . . . , k − 1, and if pk is given by (3.51), then
i. span{p0, . . . , pk−1, pk } = span{p0, . . . , pk−1, r k },
ii. hpk , Apk i = 0 if and only if r k = 0.
Proof: i. The first statement is a consequence of the Gram-Schmidt method.
ii. If r k = 0, then pk = 0 by definition (3.51) of pk and hpk , Apk i = 0. On the other hand,
if hpk , Apk i = 0, then the symmetric positive definiteness of A implies pk = 0. Thus, by part i,
r k ∈ span{p0, . . . , pk−1 }. Theorem 3.7.1 implies
hb − Ax k , p j i = hr k , p j i = 0, j = 0, . . . , k − 1.
The conditions r k ∈ span{p0, . . . , pk−1 } and hr k , p j i = 0, j = 0, . . . , k − 1, imply r k = 0.
Equation (3.51) shows how to compute the search direction pk in step k. Given pk , we have to
calculate α k such that x k+1 = x k + α k pk satisfies
hAx k+1 − b, pi i = hAx k − b, pi i + α k hApk , pi i = 0 for i = 0, . . . , k.
Since x k satisfies (3.50) and since the p j ’s are A–orthogonal,
hAx k − b, pi i = 0, hApk , pi i = 0 for i = 0, . . . , k − 1.
Thus, α k must be chosen so that
hAx k − b, pk i + α k hApk , pk i = 0,
i.e.,
hAx k − b, pk i hr k , pk i
αk = − = .
hApk , pk i hApk , pk i
Under the assumptions of Lemma 3.7.3 iii. the step size is well defined as long as r k , 0.
This leads to the following algorithm.
(c) α k = hr k , pk i/hApk , pk i.
(c) x k+1 = x k + α k pk .
(d) r k+1 = r k − α k Apk .
Kk+1 ( A, r 0 ) = span{r 0, Ar 0, . . . , Ak r 0 }.
Theorem 3.7.5 Let A ∈ Rn×n be symmetric positive definite. If x 0, . . . , x k and p0, . . . , pk are the
vectors generated by Algorithm 3.7.4, then the following assertions are true:
ii. hr k , Ap j i = 0 for j = 0, . . . , k − 2,
iii. hr k , r j i = 0, hr k , p j i = 0 for j = 0, . . . , k − 1.
Moreover, since p0, . . . , pk−1, r 0, . . . , r k−1 ⊂ Kk ( A, r 0 ) by the induction hypothesis, we find that
r k = r k−1 − α k−1 Apk−1 ∈ Kk+1 ( A, r 0 ), and
To prove
Kk+1 ( A, r 0 ) ⊂ span{p0, . . . , pk−1, r k }
1
Apk−1 = (r k−1 − r k ) ∈ span{r 0, . . . , r k−1, r k } = span{p0, . . . , pk−1, pk }.
α k−1
Thus,
k−1
X
Ak r 0 = γi Api ∈ span{p0, . . . , pk−1, pk }.
i=0
From Theorem 3.7.5 i. we now see that (3.49) is equivalent to the problem
1
min 2 hx, Axi − hx, bi.
(3.52)
x ∈ x 0 + Kk+1 ( A, r 0 )
This shows that the Conjugate Gradient Method is equivalent to the Lanczos method derived in
Section 3.5.1.
Due to Theorem 3.7.5 step 2b in Algorithm 3.7.4 reduces to
pk = r k + β k−1 pk−1,
where
hr k , Apk−1 i
β k−1 = − . (3.53)
hpk−1, Apk−1 i
Two other simplifications are possible. First, using
hr k , Apk−1 i
pk = r k − pk−1,
hpk−1, Apk−1 i
hr k , Apk−1 i
hr k , pk i = kr k k 2 − hr k , pk−1 i = kr k k 2 . (3.54)
hpk−1, Apk−1 i
Thus,
hr k , pk i kr k k 2
αk = = . (3.55)
hApk , pk i hApk , pk i
Moreover, taking the scalar product between pk+1 = r k+1 − hr k+1, Apk i/hpk , Apk i pk and r k yields
hr k+1, Apk i
− hr k , pk i = −hr k+1, r k i + hr k , pk+1 i.
hpk , Apk i
Now, using Theorem 3.7.5, the A–orthogonality of the p j ’s and (3.54) we find
hr k+1, Apk i
− hr k , pk i = −hr k+1, r k i + hr k , pk+1 i
hpk , Apk i
= hr k , pk+1 i
hr k , pk i
= hr k − Apk , pk+1 i
hpk , Apk i
= hr k+1, pk+1 i
= kr k+1 k 2 .
Using (3.54) again, we see that
hr k+1, Apk i kr k+1 k 2
βk = − = . (3.56)
hpk , Apk i kr k k 2
This gives the following final version of the conjugate gradient method.
We will comment on the stopping criteria in Step 2a of the Conjugate Gradient Method in
Section 3.7.2.
The following result on the monotonicity of the Conjugate Gradient iterates is important for
some optimization applications.
Theorem 3.7.7 Let A ∈ Rn×n be symmetric positive definite. The iterates generated by the Conju-
gate Gradient Algorithm 3.7.6 started with x 0 = 0 obey the monotonicity property
Next we comment on what happens when the Conjugate Gradient Algorithm 3.7.6 is applied to
a problem in which A is symmetric, but not necessarily positive definite.
Remark 3.7.8 (CG for positive semidefinite systems) i. We have derived the conjugate gradient
algorithm for symmetric positive definite systems. However, the conjugate gradient algorithm can
still be used if A ∈ Rn×n is symmetric positive semidefinite and b ∈ R ( A).
Since A is symmetric, the Fundamental Theorem of Linear Algebra implies R ( A) = N ( A) ⊥ .
By induction we can show that all directions pk ∈ R ( A) = N ( A) ⊥ . Since A is symmetric positive
semidefinite, hv, Avi > λ +min kvk 2 for all v ∈ N ( A) ⊥ , where λ +min is the smallest strictly positive
eigenvalue of A. Thus, α k in step (2b) is well defined. Since pk ∈ N ( A) ⊥ for all k, it follows that
the iterates x k of the CG algorithm obey
x k ∈ x 0 + N ( A) ⊥ .
The minimum norm solution x † of Ax = b is the solution in N ( A) ⊥ . The iterates generated by the
CG algorithm converge to PN ( A) x 0 + x † , where PN ( A) x 0 is the projection of x 0 onto N ( A). See
Section 3.8.5 for additional details.
The convergence behavior is illustrated in Figure 3.3 below.
ii. If A ∈ Rn×n is symmetric positive semidefinite and b < R ( A), then the minimization problem
min 12 hAx, xi + hb, xi does not have a solution. If the conjugate gradient method is applied in this
case, then in some iteration k the negative gradient r k , 0, but the search direction pk satisfies (in
exact arithmetic)
hApk , pk i = 0
and ideally the CG algorithm should be terminated. In floating point arithmetic, however, hApk , pk i
will never be zero. Generally, the size of the hApk , pk i depend on the specific problem and it is
difficult to determine whether ‘hApk , pk i = 0’.
1 −1 0
A = . −1 2 −1 +/ .
*
, 0 −1 1 -
Rn {x : Ax = b}
N ( A) ⊥ N ( A)
lim k→∞ x k
PN ( A) x 0
x† x0
Figure 3.3: Convergence of Conjugate Gradient Method for Symmetric Positive Semidefinite
Systems Ax = b with b ∈ R ( A).
i. If
−1
b = . 2 +/ ,
*
, −1 -
then b ∈ R ( A) and the minimum norm solution of Ax = b is x † = (−1/3, 2/3, −1/3)T . The
Conjugate Gradient Algorithm 3.7.6 with x 0 = 0 terminates in iteration k = 1 with the minimum
norm solution.
ii. If
1
b = *. 1 +/ ,
, 1 -
then b < R ( A) and we are in the situation of Remark 3.7.8, part√ii. Application of the Conjugate
Gradient Algorithm 3.7.6 with x 0 = 0 gives p0 = r 0 = b, kr 0 k = 3, and pT0 Ap0 = 0.
The previous example is motivated by elliptic PDEs with Neumann boundary conditions.
To discretize this problem we use a finite difference method on a grid 0 = x 0 < x 1 < . . . <
x n+1 = 1 with equidistant points x i = ih and mesh size h = 1/(n + 1). We approximate the
derivatives using central finite differences (see Section 1.3.1), that is we discretize (3.57a) by
−yi−1 + 2yi − yi+1
= f (x i ), i = 0, . . . , n + 1,
h2
and we discretize (3.57b) by
y1 − y−1 yn+2 − yn
= 0, = 0.
h h
This leads to the following discretization of (3.57).
1
1 −1 y0 2 f (x 0 )
.. −1 2 −1 y1 f (x 1 ) //
*. +/ *. +/ *. +
// .. // ..
y2 //
.. .. .. .. ..
.. // .. // ..
1 . // = ..
. . . // .
//
// .. . . (3.58)
h2 ... // .. // ..
.. /. yn−1 // .. //
. −1 2 −1 // .. yn / . f (x n ) //
, −1 1 - , yn+1 - ,
1
2 f (x n+1 ) -
The matrix A in (3.58) is symmetric positive semidefinite. It is easy to check that N ( A) = span{e},
were e = (1, . . . , 1)T ∈ Rn+2 . Thus, the system (3.58) has a solution if and only if
n
X
1
2 f (x 0 ) + f (x i ) + 12 f (x n+1 ) = 0.
i=1
Note that 0 f (x)dx ≈ h 12 f (x 0 ) + i=1 f (x i ) + 21 f (x n+1 ) using the composite trapezoidal rule.
R 1 Pn
exact solution
1 CG with x 0 = 0
CG with x 0 = 0.3
0.5
y(x) 0
-0.5
-1
0 0.2 0.4 0.6 0.8 1
x
Figure 3.4: Minimum norm solution of (3.57) with f (x) = 4π 2 cos(2πx) (solid black line), solution
of (3.58) with n = 39 computed with the Conjugate Gradient Algorithm 3.7.6 and starting value
(0, . . . , 0)T (dashed red line) and with starting value (0.3, . . . , 0.3)T (dash-dotted blue line).
||Axk -b||
20
10 p Tk Ap k
100
10-20
0 20 40 60
k
Figure 3.5: Plot of the residuals and of hApk , pk i for the first 60 iterations for the Conjugate Gradient
Algorithm 3.7.6 applied to (3.58) with n = 39 and incompatible right hand side f (x) = cos(πx/2).
Around iteration k = 42 the quantity hApk , pk i is small, while the residual remains large.
Remark 3.7.11 (CG for indefinite systems) If A is symmetric indefinite, then typically in some
iteration k, the Conjugate Gradient Algorithm 3.7.6 generates a direction pk such that
hApk , pk i < 0.
In this case,
min 12 hx k + αpk , A(x k + αpk )i − hx k + αpk , bi
does not have a solution. The conjugate gradient method should be truncated if hApk , pk i < 0 (or
equivalently α k < 0) is detected.
1 −1 0 1
A = . −1 1 −1 +/ ,
* b = . 1 +/ .
*
, 0 −1 1 - , 1 -
The matrix A is symmetric indefinite and b < R ( A). √
If we apply the Conjugate Gradient Algorithm 3.7.6 with x 0 = 0, then p0 = r 0 = b, kr 0 k = 3
and pT0 Ap0 = −1. The Conjugate Gradient Algorithm 3.7.6 should be terminated in step 2b of the
k = 1 iteration.
Let λ max, λ min be the largest and smallest eigenvalues of A, respectively. As we have already noted
in Section 3.4.1, if the Conjugate Gradient Algorithm 3.7.6 stops with kr k k < , then
k x ∗ − x k k ≤ λ −1
min kr k k ≤ λ min ,
−1
(3.59)
and if Conjugate Gradient Algorithm 3.7.6 stops with kr k k < kbk, then
k x∗ − x k k λ max kr k k λ max
≤ ≤ . (3.60)
k x∗ k λ min kbk λ min
By design, the iterates x k of the Conjugate Gradient Algorithm 3.7.6 solve
1
min hAx, xi − hb, xi,
x∈x 0 +Kk ( A,r 0 ) 2
this means that the x k of the Conjugate Gradient Algorithm 3.7.6 solve
min 1
kx − x ∗ k 2A .
x∈x 0 +Kk ( A,r 0 ) 2
k x − x ∗ k ≤ λ −1/2
min k x − x ∗ k A (3.62)
and
k x∗ − x k k λ max k x ∗ − x k k A
≤ √ . (3.63)
k x∗ k λ min kbk
If we compare (3.59) and (3.62), then we see that k x − x ∗ k A is a much better estimate for the error
k x k − x ∗ k than the residual kr k k if λ min 1. Also note that k Ax − bk = k x − x ∗ k A2 . Of course,
k x − x ∗ k A is not computable, but this indicates that for symmetric positive definite matrices it is
better to compute approximate solutions by minimizing Q in (3.61) than by minimizing k Ax − bk.
We also note that x solves the linear least squares problem (3.64) if and only if it solves the normal
equations
AT Ax = AT b. (3.65)
If the rank of A is less then n (which is for example the case when m < n), then (3.64) and (3.65)
have infinitely many solutions. If x ∗ is a particular solution of (3.64) or (3.65), then any vector
in x ∗ + N ( A) also solves (3.64) and (3.65). The minimum norm solution x † of (3.64) or (3.65)
satisfies x † ⊥ N ( A). By the fundamental theorem of linear algebra,
Rn = N ( A) ⊕ R ( AT ), N ( A) ⊥ R ( AT ),
Rm = N ( AT ) ⊕ R ( A), N ( AT ) ⊥ R ( A).
x † = AT y
AT AAT y = AT b.
CGNR
The CG Method 3.7.6 applied to (3.65) leads to the following algorithm. Here we set
r k = b − Ax k .
CGNE
Now we consider a linear system (3.66) with A ∈ Rm×n and b ∈ R ( A). The system matrix AAT in
(3.66) symmetric positive semidefinite. Hence, we can use the CG method to solve (3.66). This
gives the following algorithm.
Algorithm 3.7.14
(0) Given A ∈ Rm×n , b ∈ R ( A), y0 ∈ Rm , > 0.
(1) Set r 0 = b − AAT y0 ,
p0 = r 0 .
(2) For k = 0, 1, 2, · · · do
(a) If kr k k < stop; else
(b) α k = kr k k 2 /k AT pk k 2 .
(c) y k+1 = y k + α k pk .
(d) r k+1 = r k − α k AAT pk .
(e) β k = kr k+1 k 2 /kr k k 2 .
(f) pk+1 = r k+1 + β k pk .
2 hAA y k , y k i
min 1 T − hb, y k i,
(3.69)
y k ∈ y0 + Kk ( AAT , r 0 )
2 hAA y k , y k i − hAx †, y k i = 12 k AT y k − x † k 2 − 12 k x ∗ k 2,
1 T
min 1
2 k xk − x† k2.
(3.70)
x k ∈ x 0 + Kk ( AT AAT , AT r 0 )
The last character in the name CGNE is motivated by the error minimizing property of the
iterates. Sometimes Algorithm 3.7.15 is also called Craig’s method [Cra55].
e(x) = x ∗ − x
def
r (x) = b − Ax.
def
These representations will be important in the convergence analysis of Krylov subspace methods.
If x ∈ x 0 + Kk ( A, r 0 ), where r 0 = b − Ax 0 , then the error obeys
for the polynomial pk−1 of degree k − 1 that appears in the error representation.
Algorithm 3.5.1 and the Conjugate Gradient Algorithm 3.7.6 both compute iterates x k that
solve
1
min 2 hAx, xi − hb, xi.
x∈x 0 +Kk ( A,r 0 )
or, equivalently,
where
P k−1 is the set of all polynomials of degree less than or equal to k − 1
and e0 = x ∗ − x 0 , e k = x ∗ − x k . If p ∈ P k−1 , then the polynomial q(t) = 1 − p(t)t satisfies
q ∈ Pk, q(0) = 1.
Hence, Krylov subspace methods using Galerkin approximations generate iterates x k such that
λ 1 ≥ . . . ≥ λ n > 0.
The spectrum of A is
σ( A) = {λ 1, . . . , λ n }
Let v j denote the jth column of V . Using
n
X
e0 = he0, v j i v j , viT v j = δi j ,
j=1
we find that
Consequently, Krylov subspace methods using Galerkin approximations generate iterates x k such
that
ke k k A ≤ min max |q(λ)| ke0 k A . (3.77)
q∈Pk ,q(0)=1 λ∈σ( A)
and
where P k−1 is the set of all polynomials of degree less than or equal to k − 1. Hence
Instead of comparing the residual r k with the initial residual r 0 we can also derive the following
estimate.
Minimum residual methods are used when A is symmetric, but not positive (semi-)definite, or
when A is non-symmetric. When A is diagonalizable, we can repeat the argument of Section 3.8.2.
However, eigenvalues of A may be complex and the matrix V of eigenvectors in general will not be
orthogonal (the matrix V of eigenvectors is unitary, V ∗ = V if and only if A is normal, A∗ A = AA∗ ).
Of course, non-symmetric matrices A may not be diagonalizable at all. The lack of (unitary)
diagonalizability of A makes the convergence analysis of minimum residual methods in the general
case difficult.
If A ∈ Rn×n is diagonalizable, i.e. if A = V DV −1 , where D is a diagonal matrix, then
q( A) = V q(D)V −1 .
Hence,
kq( A)k ≤ κ 2 (V )kq(D)k,
where
κ 2 (V ) = kV k kV −1 k
is the condition number of V . If A ∈ Rn×n is diagonalizable by an orthogonal matrix V , then
κ 2 (V ) = 1.
The diagonal entries of D are the eigenvalues of A. Since A is not necessarily symmetric, A
may have complex eigenvalues. If σ( A) ⊂ C is the set of eigenvalues of A, then
Thus, if A is diagonalizable we obtain the following estimates from (3.80) and (3.81).
If we replace σ( A) by the interval [a, b] with 0 < a < b, then we can compute the solution of the
best approximation problem analytically using the so-called Chebyshev polynomials. See [Riv90].
Definition 3.8.1 For k ∈ N0 the Chebyshev Polynomials (of the first kind) Tk are defined recursively
by
one can see that cos(kθ) defines a polynomial in x = cos(θ) ∈ [−1, 1] and that on [−1, 1] the k-th
Chebyshev polynomial is given by
1
B n=0
0.5 @ n=1
@ n=5
0 n=4A
@ n=3
ï0.5
? n=2
ï1
ï1 ï0.5 0 0.5 1
x
and
|Tk (x)| > 1, x < [−1, 1]. (3.89)
Furthermore, the extrema of the Chebyshev polynomial Tk are
x j = cos( jπ/k), j = 0, 1, . . . , k
with
Tk (x j ) = (−1) j , j = 0, 1, . . . , k,
where
b + a − 2x b+a
! !
qk∗ (x) = Tk /Tk .
b−a b−a
The maximum is given by
! # −1
b+a
"
max |q∗ (x))| = Tk . (3.90)
x∈[a,b] k b−a
Proof: Since a > 0 we have that (b + a)/(b − a) > 1 and, thus, the denominator in the definition
of qk∗ is greater than one (see (3.89)). By construction, the polynomial qk∗ satisfies qk∗ (0) = 1.
The proof of optimality of qk∗ is by contradiction. Suppose that p̃k ∈ P k with p̃k (0) = 1 is a
polynomial with
max |qk∗ (x))| > max | p̃k (x))|. (3.91)
x∈[a,b] x∈[a,b]
< 0 i = 0, 2, 4, . . . ,
(
r ( x̃ i )
> 0 i = 1, 3, 5, . . . .
Thus, the polynomial r has k zeros in the intervals ( x̃ j , x̃ j+1 ), j = 0, 1, . . . , k − 1. Moreover,
r (0) = p̃k (0) − qk∗ (0) = 0. Hence, since r ∈ P k has k + 1 zeros, we can conclude that r = 0.
Equation (3.90) follows immediately from (3.88).
and
We derive convergence convergence estimates from (3.92) by selecting sets Λ and constructing
polynomials.
If we set Λ = [λ min, λ max ] ⊃ σ( A) in (3.92a), use Theorem 3.8.3 with [a, b] = [λ min, λ max ],
and apply Remark 3.8.4, then we obtain the following convergence result.
Theorem 3.8.6 Let A ∈ Rn×n be a symmetric positive definite matrix and let λ min , λ max be
the smallest and the largest eigenvalues of A, respectively. The conjugate gradient iterations
x k ∈ x 0 + Kk ( A, r 0 ) satisfy
√ !k
κ−1
k x k − x∗ kA ≤ 2 √ k x 0 − x ∗ k A,
κ+1
where κ = λ max /λ min .
Theorem 3.8.6 estimates the overall reduction of the error in the A–norm, but it does not indicate
by how much the error in the A–norm decreases in each iteration. Such a result can be obtained if
we set Λ = [λ min, λ max ] ⊃ σ( A) in (3.92b) and use Theorem 3.8.3 with [a, b] = [λ min, λ max ].
Theorem 3.8.7 Let A ∈ Rn×n be a symmetric positive definite matrix and let λ min , λ max be
the smallest and the largest eigenvalues of A, respectively. The conjugate gradient iterations
x k ∈ x 0 + Kk ( A, r 0 ) satisfy
κ−1
k x k − x∗ kA ≤ k x k−1 − x ∗ k A,
κ+1
where κ = λ max /λ min .
If A has a few well separated small eigenvalues λ 1 ≤ . . . ≤ λ ` with λ ` λ `+1 , and a few well
separated large eigenvalues λ n−r+1 ≤ . . . ≤ λ n with λ n−r λ n−r+1 , then the following theorem
gives a better estimate than Theorem 3.8.6.
Theorem 3.8.8 Let A ∈ Rn×n be a symmetric positive definite matrix with eigenvalues
` √ !k
Y λ n−r + κ − 1
k x `+r+k − x∗ kA ≤ 2 * √ k x 0 − x ∗ k A,
, i=1 λ i - κ + 1
Theorem 3.8.9 If
A = ρI + Ac,
where ρ > 0 and Ac ∈ Rn×n is a symmetric positive semidefinite matrix with eigenvalues µ1 ≥
µ2 ≥ . . . ≥ µn ≥ 0, then the iterates of the Conjugate Gradient method obey
k
Y µj +/ k x − x k .
k x k − x ∗ k A ≤ *. ∗ A
µj + ρ
0
, j=1 -
The estimate in Theorem 3.8.9 is better than the one in Theorem 3.8.6 if the eigenvalues µ j
of Ac decay to zero sufficiently fast. Theorem 3.8.9 is based on the work by Winther [Win80].
Theorem 3.8.9 explains the excellent performance of the Conjugate Gradient algorithm applied to
the regularized data assimilation least squares problem (1.55).
r n
X hb, vi i X
x∗ = vi + γi vi,
i=1
λi i=r+1
k x ∗ − x k k A → 0 implies PR (x ∗ − x k ) = PR x ∗ + PR x k = x † + PR x 0 + x̂ k → 0
x k → x † + PN x 0 .
The convergence of the Conjugate Gradient Method in the positive semidefinite case is illustrated
in Figure 3.3. See also the earlier Examples 3.7.9 and 3.7.10.
Proof: From Theorem 3.2.7 we know that there exists a polynomial pk∗ −1 of degree less or equal
to k ∗ − 1 ≤ n − 1, such that x ∗ − x 0 = A−1r 0 = pk∗ −1 ( A)r 0 and r 0 − Apk∗ −1 ( A)r 0 = 0. Hence,
Note that the proof does not require the diagonalizability of A. If A is diagonalizable and has J
distinct eigenvalues we can proceed as in the proof of Theorem 3.8.5 but with (3.92a) replaced by
(3.93a) to show that GMRES or MINRES converges in at most J iterations.
If A ∈ Rn×n is nonsingular and symmetric indefinite, then there exist an orthonormal matrix
V ∈ Rn×n and a real diagonal matrix D ∈ Rn×n with
A = V DV T .
κ 2 (V ) = 1
Again, we derive convergence convergence estimates from (3.93) by selecting sets Λ and construct-
ing polynomials.
If A ∈ Rn×n is nonsingular and symmetric indefinite, then
If we set
λ = max |λ| and λ = min |λ|,
λ∈σ( A) λ∈σ( A)
then
[a, b] ⊂ [−λ, −λ], [c, d] ⊂ [λ, λ].
Our convergence results are based on (3.93) with
( )
Λ = λ ∈ R : λ ≤ |λ| ≤ λ ⊃ σ( A).
Let [k/2] denotes the largest integer less than or equal to k/2. If we use the fact that for
q ∈ P[k/2] with q(0) = 1 the polynomial q(λ 2 ) satisfies q(λ 2 ) ∈ P k and q(02 ) = q(0) = 1, we can
prove the following result, cf. e.g. [Sto83, p. 547].
Theorem 3.8.11 Let A ∈ Rn×n be a nonsingular, symmetric indefinite matrix. If x k are MINRES
iterates, then the residuals r k = r 0 − Ax k obey
κ−1
! [k/2]
kr k k ≤ 2 kr 0 k,
κ+1
In general the residuals do not increase in every iteration if the matrix is indefinite.
Remark 3.8.14 The assumption implicitly underlying Theorems 3.8.11 and 3.8.13 is that the
intervals containing the eigenvalues of A are of equal size and that they have the same distance
from the origin:
[a, b] ⊂ [−λ, −λ], [c, d] ⊂ [λ, λ].
If this is the case and if the eigenvalues are equally distributed, then Theorems 3.8.11 and 3.8.13
give a good description. However, as in the positive definite case the distribution and clustering
of the eigenvalues will be important for the convergence of the method, and if there are few well
separated clusters of eigenvalues, Theorems 3.8.11 and 3.8.13 will be too pessimistic.
Theorem 3.8.15 Let A ∈ Rn×n and let x k be the minimum residual approximation. If the symmetric
part AS = 21 ( A + AT ) of A is positive definite, then the residuals r k = b − Ax k obey
λ min ( AS ) 2
" # 1/2
kr k k ≤ 1 − kr k−1 k,
λ max ( AT A)
where λ min ( AS ) and λ max ( AT A) denote the smallest eigenvalue of AS and the largest eigenvalue
of AT A, respectively.
1/2
λ min ( AS ) 2
Thus, 1 − λ max ( AT A)
is well defined.
ii. If the symmetric part AS is positive definite, then A is nonsingular. To see this note that
Ax = 0 implies 0 = xT Ax = xT AS x. Since AS is positive definite we find that x = 0.
Moreover,
xT AT Ax ≤ λ max ( AT A)
and
xT Ax = xT AS x ≥ λ min ( AS ).
for α < 0. The term on the right hand side is minimized by α = −λ min ( AS )/λ max ( AT A), and with
this choice of α
λ min ( AS ) 2
" # 1/2
min kI + α Ak ≤ 1 − ,
α λ max ( AT A)
which yields the desired estimate.
The situation of Theorem 3.8.15 frequently occurs if the linear system is obtained from a
discretization of a partial differential equation. For example consider the linear systems (1.22) and
(1.36) which arise in the finite difference discretization of (1.12) and (1.29), respectively. Using the
Gershgorin Circle Theorem 2.4.4 one can show that the symmetric parts of the matrices in (1.22)
and (1.36) are symmetric positive definite if c ≥ 0, c1, c2 ≥ 0, and r ≥ 0.
If A is diagonalizable, then one can derive error estimates from (3.93). See, for example [SS86]
and [Saa03, Sec 6.11.4]. However, if V is not unitary, then κ(V ) may be large and the resulting
estimate based on (3.93) may be useless. Thus if A is not unitary diagonalizable, i.e. if A is
not normal, then the eigenvalues may be irrelevant for the convergence of the minimum residual
methods. See [TE05]. To see what can happen when A is not diagonalizable, consider the following
example.
1 1 0
*. +/ *. +/
1 1 0
... ... ..
. // . //
A = ... // ∈ Rn×n, b = ... .// ∈ Rn .
.. 1 1 // ..0 //
, 1 - , 1 -
The matrix A has one eigenvalue λ = 1 with multiplicity n. The eigenspace corresponding to the
eigenvalue λ = 1 is the span of en , the n-th unit vector. Thus, A is no diagonalizable.
The solution of the system Ax = b is given by x ∗ = ((−1) n−1, . . . , −1, 1, −1, 1)T .
If we start GMRES with x 0 = 0, then the orthogonal vectors generated by the Arnoldi process
are given by
vi = en−i+1, i = 1, . . . , n.
Thus, although the eigenvalues of A are perfectly clustered, GMRES needs n iterations to reach the
solution.
3.9. Preconditioning
The convergence of Krylov subspace method is strongly influenced by the distribution of eigenvalues
of the system matrix A. Roughly speaking, if A is normal, the convergence is the better the more
the eigenvalues are clustered and the fewer the number of clusters. If A is not normal the situation
is unfortunately more complicates (as we have seen, e.g., in Exampe 3.8.17).
If A does not have this property, then one can replace the original system
Ax = b (3.94)
by the system
x = K L−1 b,
K L−1 AK R−1 H (3.95)
where K L, K R are nonsingular matrices chose such that 1) the distribution of eigenvalues of
K L−1 AK R−1 is more favorable, and 2) the application of K L−1 and K R−1 is relative inexpensive (so
that the potential savings due to reduced number of iterations when the Krylov subspace method
x = K L−1 b are not destroyed by a more expensive cost of matrix vector
is applied to K L−1 AK R−1 H
multiplications with K L AK R−1 ).
−1
If the original matrix A is symmetric, then we typically want the transformed system matrix to
be symmetric as well, and in this case we require that
K L = K, KR = KT .
It is easy to apply the Krylov subspace method to the transformed system (3.95). However,
since we are interested in x = K R H
x and not H
x we formulate the Krylov subspace method applied to
the transformed system (3.95) in terms of the original variables. If the matrix A is symmetric and
K L = K RT = K, this has the additional advantage that we do not need K, but only M = K K T . The
fact that we need M instead of a factorization K K T is important for constructing preconditioners.
We will discuss some preconditioned Krylov subspace methods next, and later (see Section
3.9.4) introduce a few basic but common preconditioners.
Hence, in search for a preconditioner, we look for a symmetric positive definite matrix M
such that AM −1 or, equivalently, M −1 A has a favorable eigenvalue distribution, and then we can
construct K so that
M = K KT .
The computation of such a K could be expensive. Fortunately, however, a matrix K with M = K T K
is never needed in the implementation of preconditioned conjugate gradient method. As we will
see shortly, we only have to solve systems where the system matrix is M.
We set
H = K −1 AK −T , H
A x = K T x, Hb = K −1 b.
Let us apply the Conjugate Gradient Algorithm 3.7.6 to the preconditioned system (3.96). By
x k, H
H r k , pHk we denote the vectors computed by the conjugate gradient method applied to AH
Hx = H
b.
With the transformations
x k = K T x k,
H r k = K −1r k ,
H pHk = K T pk ,
we obtain
pk , AH
hH Hpk i = hH pk , K −1 AK −T pHk i = hpk , Apk i,
r k k 2 = hr k , (K K T ) −1r k i.
kH
Moreover,
K T pk+1 = pHk+1 = H
r k+1 + β k pHk = K −1r k+1 + β k K T pk
and
pk+1 = (K K T ) −1r k+1 + β k pk
Thus, if we set M = K K T and introduce a vector z k = M −1r k , we obtain the algorithm stated next.
Before we state the final version of the preconditioned CG method, we note that since K is
nonsingular, M = K K T is symmetric positive definite. One the other hand, if M is symmetric
positive definite, then we can a nonsingular K such that M = K K T . Therefore, we only need
M. The matrix K is used to construct the preconditioned CG method, but is not needed for its
implementation.
Since the Preconditioned Conjugate Gradient Algorithm 3.9.1 is equivalent to the Conjugate
Gradient Algorithm 3.7.6 applied to (3.96), the iterates solve
1
min 2 hAx, xi − hb, xi. (3.97)
x 0 +Kk (M −1 A,M −1 r 0 )
Corollary 3.9.2 Let A, M ∈ Rn×n be symmetric positive definite. The iterates generated by the
Preconditioned Conjugate Gradient Algorithm 3.7.6 started with x 0 = 0 obey the monotonicity
property
0 < k x1 kM < k x2 kM < . . . .
Moreover, in the preconditioned version the residual norm kK L−1r k k is monitored, not the residual
norm of the original problem.
r k = K −1r k is small.
In Algorithm 3.9.4 the iteration is terminated if the transformed residual H
Using the transformations
vk = K T vk,
H v k = K TD
D
H vk, r k = K −1r k ,
H
we obtain
vk+1 = K −T K −1 Avk − δ k vk−1 = (K K T ) −1 Avk − δ k (K K T vk−1 ) ,
D
γ k = hK K T D
vk+1, vk i,
δ k+1 = hK K T D
vk+1, D
vk+1 i.
Introducing new vectors D uk = K K TDvk and setting M = K K T we obtain the algorithm given
next. As we have already noted in Section 3.9.1, M is symmetric positive definite if an only there
exists a nonsingular K such that M = K K T . Therefore, we only need M. The matrix K is used to
construct the algorithm, but is not needed for its implementation.
As we have mentioned in Section 3.5 we do not have to store all vectors v1, . . . , vk and Du1, . . . , D
uk ,
u k−1, D
but only the vectors vk and D uk, D
u k+1 . This is done in the preconditioned SYMMLQ, which is
stated below. However, Algorithm 3.9.5 shows how to transform from the H vk ’s to the vk , and how to
apply the preconditioner in the form M = K K T without using the factors K. The same ideas apply
to MINRES. The preconditioned SYMMLQ is stated as Algorithm 3.9.6 and the preconditioned
MINRES is stated as Algorithm 3.9.7.
(2) For k = 1, 2, . . . do
u k+1 = Avk − δδk−1
D k
u k−1 ,
D
γ k = hD
u k+1, vk i ,
u k+1 − γδkk u k ,
u k+1 = D
D
vk+1 = D
Solve MD u k+1
δ k+1 = hD
vk+1, D
u k+1 i.
If δ k+1 , 0, then vk+1 = D
vk+1 /δ k+1 ;
Else vk+1 = D vk+1 (= 0).
Endif
If k = 1, then
d¯k = γ k , ẽ k+1 = δ k+1 ,
Elseif k > 1, then
Apply Givens rotation G k to row k.
d¯k = s k ẽ k − ck γ k ,
e k = ck ẽ k + s k γ k .
Apply Givens rotation G k to row k + 1.
f k+1 = s k δ k+1 ,
ẽ k+1 = −ck δ k+1 ,
Endif
Determineq Givens rotation G k+1 .
d k = d¯2k + δ2k+1 ,
ck+1 = d¯k /d k ,
s k+1 = δ k+1 /d k ,
If k = 1, then
ζ1 = δ1 /d 1 .
Elseif k = 2, then
ζ2 = −ζ1 e2 /d 2,
Elseif k > 2, then
ζ k = (−ζ k−1 e k − ζ k−2 f k )/d k ,
Endif
x kL = x k−1
L + ζ (c
k k+1 w̄ k + s k+1 v k+1 ).
w̄ k+1 = s k+1 w̄ k − ck+1 vk+1 .
If kr k k < goto (3).
End
(3) x k = x kL + (ζ k s k+1 /ck+1 ) w̄ k+1 .
A=M−N
with nonsingular M ∈ Rn×n . Specifically, we rewrote the linear system Ax = b as the fixed point
equation x = M −1 N x + M −1 b and studied the fixed point iteration
This iteration converges for any initial vector x if and only of the spectral radius of I − M −1 A is less
than one, that is if and only if all eigenvalues of I − M −1 A are inside the unit circle in the complex
plane. Since eigenvalues λ of I − M −1 A and µ of M −1 A are related via λ = 1 − µ, the eigenvalues
of I − M −1 A are inside the unit circle if and only if the eigenvalues of M −1 A are inside the circle
of radius one with center one. In particular the eigenvalues of M −1 A are clustered. This suggested
the use of the matrix M as a preconditioner.
Note that as a standalone iterative method (3.100) all eigenvalues of M −1 A must be inside the
circle of radius one with center one. However, M can still be used as a preconditioner if there are
eigenvalues of M −1 A that are outside this circle.
When Krylov subspace methods are used that exploit the symmetry of the system matrix, such
as the Conjugate Gradient Method 3.9.1, SYMMLQ 3.9.6 , or MINRES 3.9.7, the preconditioner
M must be symmetric positive definite. This is one reason why we introduced the symmetric SOR
method and the symmetric Gauss-Seidel method in Problem 2.1. The matrix M for these methods
is symmetric positive definite.
3.10. Problems
Problem 3.1 Show that If the symmetric part 12 ( A + AT ) of the matrix A ∈ Rn×n is positive definite
on Vk ⊂ Rn , i.e. if there exists c > 0 such that
Problem 3.3 Let A ∈ Rn×n be nonsingular. We want to compute approximations x k of the solution
x ∗ of Ax = b using Galerkin approximations. That is we want to compute x k ∈ x 0 + Kk ( A, r 0 )
such that
hAx k − b, vi = 0 ∀ v ∈ Kk ( A, r 0 ). (3.102)
i. Describe an algorithm that uses the Arnoldi Iteration to compute an orthonormal basis for
Kk ( A, r 0 ) and used this basis to write (3.102) as a k × k linear system.
This algorithm is known as the Full Orthogonalization Method (FOM).
ii. Implement your algorithm in Matlab . If the system does not have a unique solution, your
algorithm should return with an error message.
iii. Apply your algorithm to solve the first linear system in Example 3.2.5.
iv. Apply your algorithm to solve (1.24) with h = 0.02 and the data specified in Example 1.3.1.
Show that the matrix A in (1.24) has positive definite symmetric part. (Hint: Gershgorin
Circle Theorem 2.4.4.) Hence Problem 3.1 guarantees the well-posedness of the FOM for
this example.
Problem 3.4
• [Emb03] Apply GMRES to the system Ax = b with
1 1 1 2
A = . 0 1 3 +/ ,
* b = . −4 +/ .
*
, 0 0 1 - , 1 -
Set x 0 = 0 and restart after m = 1, after m = 2 and after m = 3. In all cases set the maximum
number of iterations to 30.
Generate one plot that shows the normalized residuals kr k k/kr 0 k for the three restart cases.
• Apply GMRES(m) to solve (1.24) with h = 0.02 and the data specified in Example 1.3.1.
Use m = 2, 5, 10, 20, 60. (Since m = 60 > n, this corresponds to full GMRES - no restart).
Generate one plot that shows the normalized residuals kr k k/kr 0 k for the five restart cases.
Problem 3.7 Let A ∈ Rn×n be a symmetric positive definite matrix. Show that if v1, . . . , v j ∈ Rn
are nonzero and A–orthogonal, then they are linearly independent.
Problem 3.9 Let A ∈ Rn×n be a symmetric positive definite matrix, let B ∈ Rn×m and let I ∈ Rm×m
be the identity matrix.
Ax = b − Bd (3.104)
The previous result shows that even though the Conjugate Gradient Algorithm 3.7.6 is applied
to the nonsymmetric system (3.103), it effectively only “sees” the small symmetric positive definite
system (3.104), provided that the initial iterate is chosen appropriately. This result is important in
applications of the Conjugate Gradient Algorithm 3.7.6 to linear systems that arise from the finite
element discretization of partial differential equations with Dirichlet boundary conditions.
The finite difference discretization with mesh size h = 1/(n + 1) leads to the (n + 2) × (n + 2)
linear system
1 0 0 y0 1
*. −e A −e +/ *. y +/ = *. b +/ (3.105)
1 n
, 0 0 1 - , yn+1 - , −1 -
with y = (y1, . . . , yn )T , e1, en being the first and nth unit vector in Rn ,
2 −1
−1 2 −1
*. +/
.. //
... .. ..
.. //
A = .. . . // ∈ Rn×n
.. //
.. /
. −1 2 −1 //
, −1 2 -
and T
b = h2 f (h), . . . , f (nh) .
Even though in (3.105) the I block is not in the bottom right, but split, the result in part i still
applies since a symmetric permutation of (3.105) leads to a system of the type (3.103).
– Use the Conjugate Gradient method to solve the (n + 2) × (n + 2) system (3.105) with
f (x) = π 2 cos(πx) and n = 30 using the initial iterate
The exact solution of the differential equation is y(x) = cos(πx). For both cases plot the
solution computed by pcg as well as the convergence history.
Note: For the system (3.105) arising from a one-dimensional differential equation it is easy
to eliminate y0 and yn+1 . However, for the discretization of partial differential equations in
higher dimensions the approach in this problem is very convenient and frequently used.
Problem 3.10 Let A, B, V ∈ Rn×n , V invertible, and let p, q be arbitrary polynomials. Show the
following identities.
(i) Ap( A) = p( A) A.
(ii) If A = V −1 BV , then p( A) = V −1 p(B)V .
(iii) kp( A)q( A)k ≤ kp( A)k kq( A)k .
Let A ∈ Rn×n be symmetric positive definite and B ∈ Rn×n . Show the following identities.
(iv) kBk A = k A1/2 B A−1/2 k ,
(v) kp(B)k A = k A1/2 p(B) A−1/2 k ,
(vi) kp( A)k A = kp( A)k .
Problem 3.13 Let A ∈ Rn×n be a nonsingular, symmetric indefinite matrix with eigenvalues
Problem 3.14 Show that the directions pk and the residuals r k generated by the Preconditioned
Conjugate Gradient Algorithm 3.9.1 obey
Show that the iterates x k of the Preconditioned Conjugate Gradient Algorithm 3.9.1 solve
1
min 2 hAx, xi − hb, xi
x 0 +Kk (M −1 A,M −1 r 0 )
and, if x 0 = 0, obey
0 < k x1 kM < k x2 kM < . . . .
Problem 3.15 Let A be a symmetric positive definite matrix with constant diagonal. Show that
the Preconditioned Conjugate Gradient Algorithm 3.9.1 with Jacobi preconditioning produces the
same iterates as the (unpreconditioned) Conjugate Gradient Algorithm 3.7.6.
Problem 3.16 Follow the approach in Section 3.9.1 to derive the preconditioned gradient method
with steepest descent step-size and with Barzilai-Borwein step size.
Problem 3.18 This problem explores the implementation of the preconditioned conjugate gradient
method using Gauss-Seidel-type preconditioners which is described in [Eis81].
TO BE ADDED.
[Ben02] M. Benzi. Preconditioning techniques for large linear systems: a survey. J. Comput. Phys.,
182(2):418–477, 2002. URL: http://dx.doi.org/10.1006/jcph.2002.7176,
doi:10.1006/jcph.2002.7176.
[Cra55] E. J. Craig. The n-step iteration procedures. J. of Mathematics and Physics, 34:64–73,
1955.
[DL02] Y.-H. Dai and L.-Z. Liao. R-linear convergence of the Barzilai and Borwein gradient
method. IMA J. Numer. Anal., 22(1):1–10, 2002. URL: http://dx.doi.org/10.
1093/imanum/22.1.1, doi:10.1093/imanum/22.1.1.
[Emb03] M. Embree. The tortoise and the hare restart GMRES. SIAM Rev., 45(2):259–266
(electronic), 2003.
[Fle05] R. Fletcher. On the Barzilai-Borwein method. In L. Qi, K. Teo, and X. Yang, editors,
Optimization and control with applications, volume 96 of Appl. Optim., pages 235–256.
Springer, New York, 2005. URL: http://dx.doi.org/10.1007/0-387-24255-4_
10, doi:10.1007/0-387-24255-4_10.
193
194 REFERENCES
[GL96] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins Studies in the
Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD, third edition,
1996.
[GO89] G. H. Golub and D. P. O’Leary. Some history of the conjugate gradient and Lanczos
algorithms: 1948–1976. SIAM Rev., 31(1):50–102, 1989.
[Gre97] A. Greenbaum. Iterative Methods for Solving of Linear Systems. SIAM, Philadelphia,
1997.
[HS52] M.R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems.
J. of Research National Bureau of Standards, 49:409–436, 1952.
[O’L01] D. P. O’Leary. Commentary on methods of conjugate gradients for solving linear systems
by Magnus R. Hestenes and Eduard Stiefel. In D. R. Lide, editor, A Century of Excellence
in Measurements, Standards, and Technology - A Chronicle of Selected NBS/NIST Pub-
lications 1901-2000, pages 81–85. Natl. Inst. Stand. Technol. Special Publication 958,
U. S. Government Printing Office, Washington, D. C, 2001. Electronically available
at http://nvlpubs.nist.gov/nistpubs/sp958-lide/cntsp958.htm (accessed
February 6, 2012).
[OT14] M. A. Olshanskii and E. E. Tyrtyshnikov. Iterative Methods for Linear Systems: Theory
and Applications. SIAM, Philadelphia, 2014.
[PS75] C. C. Paige and M. A. Saunders. Solution of sparse indefinite systems of linear equations.
SIAM J. Numer. Anal., 12:617–629, 1975.
[Ray93] M. Raydan. On the Barzilai and Borwein choice of steplength for the gradient method.
IMA J. Numer. Anal., 13(3):321–326, 1993. URL: http://dx.doi.org/10.1093/
imanum/13.3.321, doi:10.1093/imanum/13.3.321.
[Riv90] T. J. Rivlin. Chebyshev Polynomials. From Approximation Theory to Algebra and Number
Theory. Pure and Applied Mathematics (New York). John Wiley & Sons Inc., New York,
second edition, 1990.
[Saa03] Y. Saad. Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia, second
edition, 2003.
[SS86] Y. Saad and M. H. Schultz. GMRES a generalized minimal residual algorithm for solving
nonsymmetric linear systems. SIAM J. Sci. Stat. Comp., 7:856–869, 1986.
[Sto83] J. Stoer. Solution of large linear systems of equations by conjugate gradient type methods.
In A. Bachem, M. Grötschel, and B. Korte, editors, Mathematical Programming, The
State of The Art, pages 540–565. Springer Verlag, Berlin, Heidelberg, New-York, 1983.
[TB97] L. N. Trefethen and D. Bau. Numerical Linear Algebra. SIAM, Philadelphia, 1997.
[TE05] L. N. Trefethen and M. Embree. Spectra and Pseudospectra. The Behavior of Nonnormal
Matrices and Operators. Princeton University Press, Princeton, NJ, 2005.
[Vor03] H. A. van der Vorst. Iterative Krylov Methods for Large Linear Systems, volume 13
of Cambridge Monographs on Applied and Computational Mathematics. Cambridge
University Press, Cambridge, 2003.
[Win80] R. Winther. Some superliner convergence results for the conjugate gradient methods.
SIAM J. Numer. Anal., 17:14–17, 1980.
197
Chapter
4
Introduction to Unconstrained
Optimization
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
4.2 Existence of Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
4.3 Unconstrained Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . 201
4.4 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
4.5 Convergence of Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
4.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
4.1. Introduction
We study the solution of unconstrained minimization problems
min f (x),
x∈Rn
Hence, all results derived for minimization problems can be readily applied to maximization
problems.
199
200 CHAPTER 4. INTRODUCTION TO UNCONSTRAINED OPTIMIZATION
min f (x),
(4.1)
s.t. x ∈ F ,
Definition 4.2.1 The point x ∗ ∈ Rn is called a local minimum of f over F , if there exists r > 0
such that
f (x ∗ ) ≤ f (x) for all x ∈ Br (x ∗ ) ∩ F .
The point x ∗ ∈ Rn is called a strict local minimum of f over F , if there exists r > 0 such that
Recall that F ⊂ Rn is compact if and only if it is closed and bounded. Many feasible sets F
are not bounded. In such cases we can obtain an existence result if we impose stronger conditions
on the objective function. In particular, we require
Theorem 4.2.3 If the set F ⊂ Rn of feasible points is closed and if f : F → R is continuous and
satisfies (4.2), then there exists x ∗ ∈ F such that f (x ∗ ) = inf { f (x) : x ∈ F }.
Proof: Let f ∗ = inf { f (x) : x ∈ F } and let {x k } be a sequence of feasible points x k ∈ F such
that lim k→∞ f (x k ) = f ∗ . In particular { f (x k )} is bounded. Because of (4.2), the sequence {x k }
must also be bounded. Thus, there exists M > 0 such that k x k k ≤ M for all k ∈ N. Since
x k ∈ {x ∈ Rn : k xk ≤ M } and lim k→∞ f (x k ) = f ∗ ,
f ∗ = inf f (x) : x ∈ F ∩ x ∈ Rn : k xk ≤ M .
The set F ∩ {x ∈ Rn : k xk2 ≤ M } is closed and bounded and, hence, compact. Thus, Theorem
4.2.2 gives the existence of x ∗ ∈ F ∩ {x ∈ Rn : k xk2 ≤ M } ⊂ F such that f (x ∗ ) = f ∗ .
If f˜ is continuous and bounded from below, then the function f (x) = f˜(x) + α2 k xk22 , α > 0,
satisfies (4.2) with k · k = k · k2 . In many applications one is really interested in minimizing f˜(x),
but the objective function used in the optimization is f (x) = f˜(x) + α2 k xk22 , α > 0. Theorem 4.2.3
provides one reason why one may do this.
It should be noted that Theorems 4.2.2 and 4.2.3 make statements about global minima. The
minimization algorithms that will be discussed later in this chapter are only guaranteed to find
so-called local minima. We will define local and global minima next, and provide necessary and
sufficient conditions for a point x ∗ to be a (local) minimum.
Lemma 4.3.1 Let r > 0 and let f : Br (x) → R be continuously differentiable on Br (x). If
x + v ∈ Br (x), then
Z 1
f (x + v) = f (x) + ∇ f (x) v + ∇ f (x + tv) − ∇ f (x)
T T
v dt (4.3)
0
Proof: Define φ(t) = f (x + tv). We have φ0 (t) = ∇ f (x + tv)T v and φ00 (t) = vT ∇2 f (x + tv)v.
The fundamental theorem of calculus states that
Z 1
φ(1) = φ(0) + φ (0) +
0
φ0 (t) − φ0 (0) dt. (4.7)
0
for all v with kvk < . This proves (4.4). Equation (4.6) can be obtained analogously.
0 ≤ f (x ∗ + tv) − f (x ∗ )
for all v ∈ Rn with kvk = 1 and all t ∈ (−r, r). If we set v equal to the ith unit vector ei , this implies
f (x ∗ + tei ) − f (x ∗ ) ∂
0 ≤ lim = f (x ∗ ).
t→0+ t ∂ xi
f (x ∗ − tei ) − f (x ∗ ) ∂
0 ≤ lim =− f (x ∗ ).
t→0+ t ∂ xi
Hence, ∂∂x i f (x ∗ ) = 0.
ii. From part i we know that ∇ f (x ∗ ) = 0. Suppose ∇2 f (x ∗ ) is not positive semidefinite. Then
there exist λ > 0 and w ∈ Rn such that
wT ∇2 f (x ∗ )w ≤ −λ kwk22 .
Let σ = λ/2. Lemma 4.3.1 and the definition of a local minimum guarantee the existence of
∈ (0, r) such that with v = /(2kwk2 ) w,
0 ≤ f (x ∗ + v) − f (x ∗ )
Z 1Z 1
= 2 v ∇ f (x ∗ )v + 2
1 T 2 1
tvT ∇2 f (x ∗ + τtv) − ∇2 f (x ∗ ) v dτdt
0 0
λ
< 1 T 2
2 v ∇ f (x ∗ )v + σkvk22 ≤ − kvk22 + σkvk22 = 0.
2
This is a contradiction. Hence our assumption that ∇2 f (x ∗ ) is not positive semidefinite must be
false.
The proof of the second part of Theorem 4.3.2 showed how to generate a point x with a function
value lower than f (x ∗ ) when the Hessian ∇2 f (x ∗ ) is not positive semidefinite. Such information
will be important when we design algorithms for the computation of minimum points.
The proof of the second part of this theorem is identical to our proof of Theorem 4.3.2(ii). The
proof of the first part can be carried out analogously using the first part of Lemma 4.3.1.
Theorem 4.3.3 shows that directions d with ∇ f (x)T d < 0 are descent directions, provided that
f is continuously differentiable, and that directions of negative curvature are descent directions,
provided that f is twice continuously differentiable and ∇ f (x) = 0. In particular, eigenvectors
of ∇2 f (x) corresponding to negative eigenvalues are directions of negative curvature and descent
directions. Geometrically, ∇ f (x)T d < 0 if and only if the angle between d and −∇ f (x) is less than
90 degrees (see Figure 4.1).
gradf
-gradf
Theorem 4.3.5 (Sufficient Optimality Conditions) Suppose there exists r > 0 such that f :
Rn → R is twice continuously differentiable on Br (x ∗ ). If ∇ f (x ∗ ) = 0 and if ∇2 f (x ∗ ) is positive
definite, then x ∗ is a strict local minimum. More precisely, there exist c > 0 and > 0 such that
Let σ < λ/4. Lemma 4.3.1 guarantees the existence of > 0 such that for all x ∈ B (x ∗ ),
v = x − x∗,
f (x) − f (x ∗ )
= f (x ∗ + v) − f (x ∗ )
Z 1Z 1
= 2 v ∇ f (x ∗ )v + 2
1 T 2 1
tvT ∇2 f (x ∗ + τtv) − ∇2 f (x ∗ ) v dτdt
0 0
λ
≥ kvk22 − σkvk22
2
λ
≥ kvk22 .
4
Since all norms on Rn are equivalent, this gives the assertion.
There is a gap between the necessary optimality conditions in Theorem 4.3.2 and the sufficient
optimality conditions in Theorem 4.3.5. The necessary conditions imply that at a minimum x ∗ ,
∇ f (x ∗ ) = 0 and ∇2 f (x ∗ ) is positive semidefinite. However, we need positive definiteness of
∇2 f (x ∗ ) and ∇ f (x ∗ ) = 0 to guarantee that x ∗ is a local minimum. If these two conditions are
satisfied, then x ∗ is even a strong local minimum and the local quadratic growth condition (4.8) is
satisfied.
If f is a quadratic function, this gap can be overcome. For quadratic minimization problems
min 12 xT H x + cT x + d, (4.9)
where H ∈ Rn×n is symmetric, c ∈ Rn and d ∈ R we have the following result on the characterization
of solutions and existence of solutions.
H x = −c (4.10)
ii. The quadratic minimization problem (4.9) has a solution if and only if H ∈ Rn×n is symmetric
positive semi-definite and c ∈ R (H). In this case the set of solutions of (4.9) is given by
S = x ∗ + N (H),
where x ∗ denotes a particular solution of (4.10) and N (H) denotes the null space of H.
With this identity, the first part can be proven using the techniques applied in the proof of
Theorem 4.3.2. The second part follows directly from the theory of linear systems applied to
(4.10).
Later, see Theorem 4.4.5, we will generalize the previous theorem and show that the gap
between necessary and sufficient optimality conditions can be closed if f is a convex continuously
differentiable function. In this case ∇ f (x ∗ ) = 0 is necessary and sufficient for an optimum.
t x + (1 − t)y ∈ C
The convexity of sets and the convexity of functions are related through the following theorem.
Proof: i. Let f be convex and let x, y ∈ C be arbitrary. If x = y, then (4.11) is trivial. Thus, let
x , y. Consider the function
Thus, (4.11) is violated if we set y = x + th and, by i., f can not be convex. This is a contradiction.
Therefore the Hessian must be positive semidefinite for all x ∈ C.
f (x) = 21 xT H x + cT x + d
is given by
∇2 f (x) = H.
Hence f (x) = 12 xT H x + cT x + d is convex if and only if H is positive semidefinite.
Theorem 4.4.4 Let the feasible set F ⊂ Rn be convex and let f : F → R be convex. If there exist
global a minimum, then all minima are global minima and the set S = {x ∗ ∈ F : f (x ∗ ) = f ∗ } of
global minima is convex.
Proof: i. Suppose x̄ ∈ F is a local minimum but not a global minimum. Then there exists x ∗ ∈ C
such that f (x ∗ ) < f ( x̄). Since f is convex,
for all t ∈ (0, 1]. This contradicts the assumption that x̄ is a local minimum.
ii. Let x 1, x 2 ∈ F be global minima, i.e., f (x i ) ≤ f (x) for all x ∈ C, i = 1, 2. By the convexity
of f ,
f (t x 1 + (1 − t)x 2 ) ≤ t f (x 1 ) + (1 − t) f (x 2 ) ≤ f (x)
for all x ∈ F and for all t ∈ [0, 1]. This proves the convexity of S.
For general twice continuously differentiable nonlinear functions there is a gap between the
necessary optimality conditions in Theorem 4.3.2 and the sufficient optimality conditions in Theo-
rem 4.3.5. This gap disappears if f is convex.
Proof: If x ∗ is a global minimum, then Theorem 4.3.2 implies that ∇ f (x ∗ ) = 0. On the other
hand, if ∇ f (x ∗ ) = 0, then (4.11) with y = x ∗ implies f (x) ≥ f (x ∗ ). Thus, x ∗ is a global minimum.
(i) The sequence is called q–linearly convergent if there exist c ∈ (0, 1) and k̂ ∈ N such that
kz k+1 − z∗ k ≤ c kz k − z∗ k
(ii) The sequence is called q–superlinearly convergent if there exists a sequence {ck } with ck > 0
and lim k→∞ ck = 0 such that
kz k+1 − z∗ k ≤ ck kz k − z∗ k
or, equivalently, if
kz k+1 − z∗ k
lim = 0.
k→∞ kz k − z∗ k
kz k+1 − z∗ k ≤ c kz k − z∗ k p
for all k ≥ k̂, then the sequence is called q–convergent with q–order at least p. In particular,
if p = 2, we say the sequence is q–quadratically convergent and if p = 3, we say the sequence
is q–cubically convergent.
Figure 4.4 illustrates q-linear convergence q-superlinear convergence and q-quadratic conver-
k−1
gence using the sequences z k = 1/2 k , z k = 1/k!, and z0 = 1, z k = 1/22 , k ≥ 1, respectively.
Remark 4.5.2 For q-linearly convergence the choice of norm is important. For example, the
sequence {z k } ⊂ R2 defined by
1 T 1 1 1 T
z2k = √ 1, 0 , z2k+1 = √ √ ,√ .
(2 2) k (2 2) k 2 2
0
10
q−linear
q−superlinear
q−quadratic
−5
10
error
−10
10
−15
10
5 10 15 20
k
We have
1 1 1 1 1 1 1 1
kz2k k∞ = √ = √ √ = kz2k−1 k∞, kz2k+1 k∞ = √ √ = √ kz2k k∞,
(2 2) k 2 (2 2) k−1 2 2 2 (2 2) k 2
√
and therefore the sequence converges q-linearly with q-factor 1/ 2 in the ∞-norm. However, if we
use the 2-norm, then
1 1 1 1 1
kz2k k2 = √ = √ √ = √ kz2k−1 k2, kz2k+1 k2 = √ = kz2k k2,
(2 2) k 2 2 (2 2) k−1 2 2 (2 2) k
which means the sequence does not converge q-linearly in the 2-norm.
Since all norms in Rl are equivalent, if a sequence converges q–superlinearly/q–order at least
p > 1 in one norm, it also converges q–superlinearly/q–order at least p > 1 in any other norm.
then we say that the sequence {z k } converges r–quadratically to z∗ . The condition (4.12) equivalent
to the existence of κ > 0 and c̃ ∈ (0, 1) such that
k
kz k − z∗ k ≤ κ c̃2 ∀k ∈ N. (4.13)
(i) The sequence is called r–linearly convergent if there exist c ∈ (0, 1) and κ > 0 such that
kz k − z∗ k ≤ κc k ∀k ∈ N.
(ii) The sequence is called r–superlinearly convergent if there exist κ > 0 and a sequence {ck }
with ck > 0 and lim k→∞ ck = 0 such that
k
Y
kz k − z∗ k ≤ κ ci ∀k ∈ N.
i=1
(iii) The sequence is said to converge r–quadratically, if there exist κ > 0 and c ∈ (0, 1) such that
k
kz k − z∗ k ≤ κc2 ∀k ∈ N.
Remark 4.5.4 Since all norms in Rn are equivalent, if a sequence converges r–linearly/r–
superlinearly/r–quadratically in one norm, it also converges r–linearly/r–superlinearly/r–
quadratically in any other norm.
Note the sequence {z k } ⊂ R2 from Remark 4.5.4 defined by
1 T 1 1 1 T
z2k = √ 1, 0 , z2k+1 = √ √ ,√ ,
(2 2) k (2 2) k 2 2
satisfies
√
q
1 1 1 1
kz2k k2 = √ = q , kz2k+1 k2 = √ = 2 2 q ,
(2 2) k √ 2k (2 2) k √ 2k+1
2 2 2 2
i.e.,
√ √
q q
kz k k2 ≤ κc k
with κ = 2 2 and c = 1/ 2 2.
Remark 4.5.5 Although the conjugate gradient (CG) method introduced in Section 3.7 conver-
gences in n iterations and therefore the notion of convergence of sequences does not apply, the
convergence results in Section 3.8.5 essentially state q-convergence and r-convergence results for
the CG method. In particular, Theorem 3.8.7 essentially states the q-linear convergence of the
CG method. Theorems 3.8.6 and 3.8.8 state the r-linear convergence of the CG method, and
Theorem 3.8.9 essentially states r-superlinear convergence of the CG method.
The paper [Pot89] by Potra gives sufficient conditions for a sequence to have the q-order and/or
the r-order of convergence greater than one.
The next result shows that q-linear convergence of the steps z k −z k−1 implies r-linear convergence
of the iterates z k − z∗ .
Lemma 4.5.6 Let {z k } ⊂ Rl be a sequence of vectors. If there exist c ∈ (0, 1) and k̄ ∈ N such that
then there exists z∗ ∈ R k such that the sequence {z k } converges to z∗ at least r–linearly.
for all i ≥ 0 and all k ≥ k̄. This shows that {z k } is a Cauchy sequence. There exist a limit z∗ of this
sequence. Letting l → ∞ in the previous inequality yields
c 1− k̄
kz∗ − z k k ≤ * kz k̄ − z k̄−1 k + c k
, 1 − c -
for all k ≥ k̄. This implies the r–linear convergence
4.6. Problems
Problem 4.2 (See [Ber95, p.13]) In each of the following problems fully justify your answer using
optimality conditions
i. Show that the 2–dimensional function f (x, y) = (x 2 − 4) 2 + y 2 has two global minima and
one stationary point (point at which ∇ f (x, y) = 0), which is neither a local maximum nor a
local minimum.
ii. Show that the 2–dimensional function f (x, y) = (y − x 2 ) 2 − x 2 has only one stationary point,
which is neither a local maximum nor a local minimum.
iii. Find all local minima of the 2–dimensional function f (x, y) = 12 x 2 + x cos(y).
Problem 4.3 (See [Hes75], [Ber95, p.14]) Let f : Rn → R be a differentiable function. Suppose
that a point x ∗ is a local minimum of f along every line that passes through x ∗ , that is, the function
φ(t) = f (x ∗ + t p)
i. Show that ∇ f (x ∗ ) = 0.
ii. Show by example that x ∗ need not be a local minimizer of f .
(Hint: Consider f (x, y) = (y − αx 2 )(y − βx 2 ) with 0 < α < β and (x ∗, y∗ ) = (0, 0). For
α < γ < β, f (x, γx 2 ) < 0 if x , 0.
What are the eigenvalues of ∇2 f at (x ∗, y∗ )?)
Problem 4.4 Let H ∈ Rn×n be symmetric positive semi-definite, c ∈ Rn , d ∈ R, and consider the
function
f (x) = 21 xT H x + cT x + d.
Show that f is convex using the Definition 4.4.1 of a convex function. (Do not use Theo-
rem 4.4.3.)
Show that f is bounded from below if and only if c ∈ R (H). (Hint: Use that H can be
diagonalized by an orthogonal matrix and express R (H) in terms of the eigenvectors of H.)
where α ∈ (0, 1) and 1 < p < 2. Show that {z k } converges r–quadratically, but that the q–
convergence order is less than or equal to p.
Problem 4.7 Let A ∈ Rn×n be nonsingular and let k · k, ||| · ||| be a vector and a matrix norm such
that k Mvk ≤ |||M ||| kvk and |||M N ||| ≤ |||M ||| |||N ||| for all M, N ∈ Rn×n , v ∈ Rn .
Schulz’s method for computing the inverse of A generates a sequence of matrices {X k } via the
iteration
X k+1 = 2X k − X k AX k .
ii. Show that {X k } converges q-quadratically to A−1 for any X0 with |||I − AX0 ||| < 1.
iii. Show that {X k } converges q-quadratically for any X0 = α AT with α ∈ 0, 2/λ max , where
λ max is the largest eigenvalue of AAT .
(Hint: First use ||| · ||| = k · k2 and ii., then equivalence of norms on Rn×n .)
[Pot89] F. A. Potra. On Q-order and R-order of convergence. J. Optim. Theory Appl., 63(3):415–
431, 1989.
217
218 REFERENCES
5.1. Introduction
Let f : Rn → R be twice differentiable. We want to compute a (local) minimizer x ∗ of f . Let x k
be a guess for x ∗ . Then min x f (x) is equivalent to mins f (x k + s). Of course the second problem
is as difficult as the first one. Therefore we replace the nonlinear function f by its quadratic Taylor
approximation,
f (x k + s) ≈ m k (x k + s) = f (x k ) + ∇ f (x k )T s + 12 sT ∇2 f (x k )s.
The quadratic problem (5.1) has a unique solution s k if and only if ∇2 f (x k ) is positive definite. In
this case the unique solution is given by
−1
s k = − ∇2 f (x k ) ∇ f (x k ). (5.2)
219
220 CHAPTER 5. NEWTON’S METHOD
x k+1 = x k + s k
as our new approximation of the (local) minimizer x ∗ of f . This is Newton’s method. In the
following section we will prove the well-posedness of the iteration (i.e., we will show that the
∇2 f (x k )’s are positive definite) and the convergence of the sequence of iterates {x k } generated by
Newton’s method, provided that the initial guess x 0 is sufficiently close to a point x ∗ at which the
second order sufficient optimality conditions are satisfied.
We note that (5.2) makes sense if the Hessian ∇2 f (x k ) is invertible. However, we want to
minimize f (x k + s) ≈ m k (x k + s) = f (x k ) + ∇ f (x k )T s + 21 sT ∇2 f (x k )s and therefore need the
Hessian not merely to be invertible, but positive definite.
The following notation will be useful. Throughout this section k · k is an arbitrary vector norm
on Rn . We continue to use Br ( x̄) = {x ∈ Rn : k x − x̄k < r } to denote the open ball x̄ with radius
r. Moreover, we define the set of Lipschitz continuous functions on D ⊂ Rn
By λ min (x) and λ max (x) we denote the smallest and largest eigenvalue of the Hessian of f at x, i.e.,
Lemma 5.2.1 Let f : Rn → R be twice continuously differentiable in an open set D ⊂ Rn . For all
x, y ∈ D such that {y + t(x − y) : t ∈ [0, 1]} ⊂ D,
Z 1
∇ f (x) − ∇ f (y) = ∇2 f (y + t(x − y))(x − y)dt.
0
∂
Proof: Apply the fundamental theorem of calculus to the functions φi (t) = ∂ xi f (y + t(x − y))
i = 1, . . . , n, on [0, 1].
Lemma 5.2.2 (Banach Lemma) If A ∈ Rn×n is an invertible matrix and if B ∈ Rn×n is such that
k A−1 k
kB−1 k ≤ . (5.4)
1 − k A−1 ( A − B)k
Lemma 5.2.3 Let D ⊂ Rn be an open set and let f : D → R be twice differentiable on D with
∇2 f ∈ Lip L (D). If the second order sufficient optimality conditions are satisfied at the point
x ∗ ∈ D, then there exists > 0 such that B (x ∗ ) ⊂ D and for all x ∈ B (x ∗ ),
Remark 5.2.4 The previous lemma show that if the second order sufficient optimality conditions
are satisfied at x ∗ , then the Hessian ∇2 f (x) is also positive definite in a neighborhood of x ∗ . In
particular, the quadratic problem
min ∇ f (x)T s + 12 sT ∇2 f (x)s
s
that determines the Newton step has a unique solution if x is sufficiently close to x ∗ .
= ∇2 f (x 0 ) −1 *.∇2 f (x 0 )(x 0 − x ∗ ) + ∇ f (x ∗ ) −∇ f (x 0 ) +/
| {z }
, =0 -
Z 1
= ∇2 f (x 0 ) −1 (∇2 f (x 0 ) − ∇2 f (x ∗ + t(x 0 − x ∗ ))(x 0 − x ∗ )dt,
0
where we have used Lemma 5.2.1 to obtain the last equality. Using (5.6) and the Lipschitz continuity
of ∇2 f we obtain
k x 1 − x ∗ k ≤ 2Lk∇2 f (x ∗ ) −1 k k x 0 − x ∗ k 2 /2
= Lk∇2 f (x ∗ ) −1 k k x 0 − x ∗ k 2 < σk x 0 − x ∗ k < .
This proves (5.10) for k = 0. The induction step can be proven analogously and is omitted.
ii. Since σ < 1 and
k x k+1 − x ∗ k < σk x k − x ∗ k < . . . < σ k+1 k x 0 − x ∗ k,
we find that lim k→∞ x k = x ∗ . The q–quadratic convergence rate follows from (5.10) with
c = Lk∇2 f (x ∗ ) −1 k.
Lemma 5.2.6 Let D ⊂ Rn be an open set and let f : D → Rn be twice continuously differentiable
on D with ∇2 f ∈ Lip L (D). Moreover, let x ∗ ∈ D be a point at which the second order sufficient
optimality conditions are satisfied. If ∇2 f (x k ) + ∆(x k ) is invertible, then
Lk(∇2 f (x k ) + ∆(x k )) −1 k
k x k+1 − x ∗ k ≤ k x k − x∗ k2
2
+k(∇2 f (x k ) + ∆(x k )) −1 ∆(x k )k k x k − x ∗ k + k(∇2 f (x k ) + ∆(x k )) −1 δ(x k )k.
Proof: The definition (5.11) of the perturbed Newton method, ∇ f (x ∗ ) = 0 and Lemma 5.2.1
imply that
x k+1 − x ∗
= x k − x ∗ − (∇2 f (x k ) + ∆(x k )) −1 (∇ f (x k ) + δ(x k ))
= (∇2 f (x k ) + ∆(x k )) −1 (∆(x k )(x k − x ∗ ) − δ(x k ))
Z 1
+(∇ f (x k ) + ∆(x k ))
2 −1
(∇2 f (x k ) − ∇2 f (x k + t(x k − x ∗ ))(x k − x ∗ )dt.
0
The previous lemma provides the basic estimate for the convergence analysis of the iteration
(5.11). One convergence result is the following.
Theorem 5.2.7 Let D ⊂ Rn be an open set and let f : D → R be twice differentiable on D with
∇2 f ∈ Lip L (D). Furthermore, let x ∗ ∈ D be a point at which the second order sufficient optimality
conditions are satisfied. If the perturbed Hessians ∇2 f (x) + ∆(x) are invertible for all x ∈ D and if
there exist η ∈ [0, 1), α > 1, and M ≥ 0 such that the gradient perturbations δ(x) and the Hessian
perturbations ∆(x) satisfy
k(∇2 f (x) + ∆(x)) −1 k ≤ M,
k(∇2 f (x) + ∆(x)) −1 ∆(x)k ≤ η
and
k(∇2 f (x) + ∆(x)) −1 δ(x)k ≤ ck x − x ∗ k α
for all x ∈ D, then for all σ ∈ (η, 1) there exists an > 0 such that Newton’s with inexact derivative
information (5.11) with starting point x 0 ∈ B (x ∗ ) generates iterates x k which converge to x ∗ and
which obey
ML
k x k+1 − x ∗ k ≤ k x k − x ∗ k 2 + ck x − x ∗ k α + η k x k − x ∗ k ≤ σk x k − x ∗ k
2
for all k.
If n is large or if only Hessian-times vector products ∇2 f (x k )v are available for any given vector
v, but the computation of the entire Hessian is expensive, then we can use the Conjugate Gra-
dient Algorithm 3.7.6 or the Preconditioned Conjugate Gradient Algorithm 3.9.1 to compute an
approximate solution s k . We focus on the Conjugate Gradient Algorithm 3.7.6. We stop the
Conjugate Gradient Algorithm 3.7.6 if the residual is ∇2 f (x k )s k + ∇ f (x k ) is sufficiently small.
More precisely, we stop the Conjugate Gradient Algorithm 3.7.6 if
where η k ≥ 0. If s k is computed such that (5.13) holds, the resulting method is known as he inexact
Newton method. The parameter η k is called the forcing parameter.
The following theorem analyzes the convergence of the inexact Newton method.
Theorem 5.3.1 Let D ⊂ Rn be an open set and let f : D → R be twice differentiable on D with
∇2 f ∈ Lip L (D). Furthermore, let x ∗ ∈ D be a point at which the second order sufficient optimality
conditions are satisfied. Define κ ∗ = k∇2 f (x ∗ ) −1 k k∇2 f (x ∗ )k.
If the sequence {η k } of forcing parameters satisfies 0 < η k ≤ η with η such that 4κ ∗ η < 1, then
for all σ ∈ (4κ ∗ η, 1) there exists an > 0 such that the inexact Newton method (5.13) with starting
point x 0 ∈ B (x ∗ ) generates iterates x k which converge to x ∗ and which obey
k x k+1 − x ∗ k ≤ Lk∇2 f (x ∗ ) −1 k k x k − x ∗ k 2 + 4η k κ ∗ k x k − x ∗ k ≤ σk x k − x ∗ k
for all k.
Proof: Let 1 > 0 be the parameter given by Lemma 5.2.3. Furthermore, let σ ∈ (4κ ∗ η, 1) be
arbitrary and let
σ − 4κ ∗ η
( )
= min 1, .
Lk∇2 f (x ∗ ) −1 k
We set r k = −∇ f (x k ) − ∇2 f (x k )s k . If k x k − x ∗ k < , then
x k+1 − x ∗ = x k − x ∗ + ∇2 f (x k ) −1 ∇2 f (x k )s k
= x k − x ∗ − ∇2 f (x k ) −1 ∇ f (x k ) − ∇2 f (x k ) −1r k
Z 1
= ∇2 f (x k ) −1 (∇2 f (x k ) − ∇2 f (x k + t(x k − x ∗ ))(x k − x ∗ )dt − ∇2 f (x k ) −1r k .
0
where ei is the ith unit vector. The step–size δi usually differs from one component to the other.
Dennis and Schnabel [DS96] recommend
√
δi = max{|x i |, typx i }sign(x i ), (5.16)
where is an approximation for the relative error in function evaluation and where typx i is a typical
size provided by the user and it is used to prevent from difficulties when x i wanders close to zero.
We will study later where this choice comes from.
To compute Hessian approximations, we proceed as follows. If the gradient of f is available,
we compute the ith column Hi of the matrix H ∈ Rn×n as
∇ f (x + δi ei ) − ∇ f (x)
Hi = ,
δi
where δi is chosen as in (5.16) and then we approximate
∇2 f (x) ≈ 12 (H + H T )
∂2 [ f (x + δi ei + δ j e j ) − f (x + δi ei )] − [ f (x + δ j e j ) − f (x)]
f (x) ≈ , (5.17)
∂ xi ∂ x j δi δ j
where
δ j = 1/3 max{|x j |, typx j }sign(x j ), (5.18)
To see why he choices (5.16) and (5.18) for the finite difference parameter make sense we
consider finite difference derivative approximations for the scalar function g : R → R. A one-sided
finite difference approximation of the derivative is given by
g(x + δ) − g(x)
g0 (x) ≈ (5.19)
δ
for a sufficiently small δ. Using the Taylor expansion
Now suppose that instead of the exact function g we can only compute an approximation g . In
this case the finite difference approximation (5.28) of the derivative of g is
g (x + δ) − g (x)
g0 (x) ≈
δ
Suppose that
|g(x ± δ) − g (x ± δ)| ≤ , |g (x) − g(x)| ≤ .
From
g (x + δ) − g (x)
g0 (x) −
δ
g(x + δ) − g(x) g(x + δ) − g (x + δ) g (x) − g(x)
= g0 (x) − + +
δ δ δ
and the estimates (5.20) we obtain
g (x + δ) − g (x)
g0 (x) −
δ
M2 |δ| |g(x + δ) − g (x + δ)| |g (x) − g(x)|
≤ + + ,
2 δ δ
M2 |δ| 2
≤ + . (5.21)
2 |δ|
The term M2 |δ|/2 in the error bound (5.21) results from the use of finite differences with exact
function values and the term 2/|δ| results from the use of inexact function values. The error bound
(5.21) and its two components are sketched in Figure 5.1.
Error
provided x is a floating point number. Here mach is the unit roundoff or machine precision and it is
given by mach = 2−24 ≈ 6∗10−8 if single precision arithmetic is used and mach = 2−53 ≈ 1.2∗10−16
if double precision arithmetic is used. In particular, if one-sided finite differences are used t
approximate g0 (x), then our previous analysis recommends a step size of
s
|g(x)|
δ∗ = √ mach,
M2
k x k+1 − x ∗ k ≤ c k x k − x ∗ k 2 ⇐⇒ k∇ f (x k+1 )k ≤ c̃ k∇ f (x k )k 2 .
Thus, we can use the gradients to observe q–superlinear, q–quadratic, or faster convergence rates
of the iterates. Note this is not possible if the iterates x k converge only q–linearly. The inequalities
k x k+1 − x ∗ k ≤ c k x k − x ∗ k, c ∈ (0, 1), generally does not imply k∇ f (x k+1 )k ≤ c̃ k∇ f (x k )k
with c̃ ∈ (0, 1). Hence, the gradients may not converge q-linearly to zero if the iterates converge
q–linearly.
We can use the gradient norm as a truncation criteria. The estimate (5.7) yields
k∇2 f (x ∗ ) −1 k
k∇ f (x)k < tolg =⇒ k x − x ∗ k ≤ tolg ,
2
provided x is sufficiently close to x ∗ .
If the iterates converge q–superlinearly or faster, we can use the norm of the step as a truncation
criteria. This stopping criteria is based on the following result.
5.6. Problems
Problem 5.1
i. Suppose we start Newton’s method for the minimization of f (x) = x 4 − 1 at x 0 > 0. What
are the iterates generated by Newton’s method? Prove that the iterates satisfy x k > 0 and that
they converge to the minimizer x ∗ of f .
iii. What is the rate of convergence of Newton iterates x k in part i? Does this contradict the
convergence result in Theorem 5.2.5? Explain!
lim x k = x ∗,
k→∞
Problem 5.4 Let f : Rn → R be twice differentiable and assume there exists L > 0 such that the
Hessian satisfies k∇2 f (x) − ∇2 f (y)k ≤ Lk x − yk for all x, y ∈ Rn . Furthermore, let x ∗ ∈ Rn be a
point at which the second order sufficient optimality conditions are satisfied. Given a symmetric
positive definite H ∈ Rn×n consider the simplified Newton iteration
H s k = −∇ f (x k ),
x k+1 = x k + s k .
Prove that if kI − H −1 ∇2 f (x ∗ )k < 1, then for every σ ∈ (kI − H −1 ∇2 f (x ∗ )k, 1) there exists
> 0 such that the iterates generated by the simplified Newton method with starting value x 0 ,
k x 0 − x ∗ k < , converge to x ∗ and obey
k x k+1 − x ∗ k ≤ σ k x k − x ∗ k
for all k.
What can you say about the convergence of the simplified Newton method if H = ∇2 f (x ∗ )?
Problem 5.5 Let f : Rn → R be twice differentiable and let the Hessian satisfy
k∇2 f (x) − ∇2 f (y)k ≤ Lk x − yk ∀x, y ∈ Rn .
Consider the Newton-type iteration
Hk s k = −∇ f (x k ), (5.25a)
x k+1 = x k + s k , (5.25b)
where Hk is a symmetric positive definite matrix, for the computation of a local minimizer x ∗ of f .
i. Show that
LkHk−1 k
k x k+1 − x ∗ k ≤ k x k − x ∗ k 2 + kI − Hk−1 ∇2 f (x k )k k x k − x ∗ k.
2
ii. State and prove a result for the local q-linear convergence of (5.25).
iii. Let ∇2 f (x k ) be symmetric positive definite. Suppose we want to compute an approximate
solution s k of
∇2 f (x k )s = −∇ f (x k ) (5.26)
by applying the Jacobi iterative method.
– What is the Jacobi iterative method applied to (5.26)? (Denote the ith iterate of the
Jacobi method by s (i)
k
, where k refers to the iteration number in the Newton-type iteration
and i is the iteration number of the Jacobi method.)
– Show that one step of the Jacobi iterative method started at s (0)
k
= 0 leads to s k = s (1)
k
given by (5.25a). What is Hk ?
∂ x j xi k k j = 1, . . . , n, k ∈ N,
i, j
∂ x 2j
with 0 ≤ η < 1.
Use your results in parts ii. and iii. to show that the Newton-type iteration (5.25), where s k
is computed by applying one iteration of the Jacobi method with zero initial value to (5.26),
converges locally q-linearly.
Problem 5.6 If we use the Preconditioned Conjugate Gradient Algorithm 3.9.1 with symmetric
positive definite preconditioner M to compute an approximate solution s k of (5.12), then the
Preconditioned Conjugate Gradient Iteration stops if
∇2 f (x k )s k + ∇ f (x k ) T M −1 ∇2 f (x k )s k + ∇ f (x k ) ≤ η k ∇ f (x k )T M −1 ∇ f (x k )
(5.27)
Problem 5.7 Let g : R → R be three time continuously differentiable and let g (x), g (x ± δ) be
approximations of g(x), g(x ± δ), respectively, with
00
g0 (x + 12 δ) − g0 (x − 12 δ)
g (x) ≈
δ
and inserting the approximations
g(x + δ) − g(x) g(x) − g(x − δ)
g0 (x + 12 δ) ≈ , g0 (x − 12 δ) ≈
δ δ
into the previous expresion. This leads to
g(x + δ) − 2g(x) + g(x − δ)
g00 (x) ≈ . (5.29)
δ2
Compute the optimal step size for the approximation of g00 (x) using centered finite differences
with inexact function evaluations and determine the error
g (x + δ∗ ) − 2g (x) + g (x − δ∗ )
g00 (x) −
δ2
∗
iii. Let g(x) = exp(x). Compute approximations of g0 (x) using one-sided finite differences and
centered finite differences using x = 1 and δ = 10−i , i = 1, . . . , 20, and compute centered
finite difference approximations of g00 (x). Plot the error between the derivative and its
approximation in a log-log-scale. Explain your results.
f (x 1, x 2 ) = (x 1 − 2) 4 + (x 1 − 2) 2 x 22 + (x 2 + 1) 2 .
i. Solve min f (x) using Newton’s method with starting value x = (1, 1).
ii. Repeat your computations in i. using finite difference approximations for the gradient and the
Hessian.
In both cases plot the error k x k − x ∗ k2 and the norm of the gradient k∇ f (x k )k2 . Carefully explain
and justify what stopping criteria you use and carefully document your choice of the finite difference
approximations. Experiment with different finite difference step-sizes.
Explain your results.
Problem 5.9 Given a function f : Rn → R and a positive scalar c > 0, consider the two problems
ii. Assume that f : Rn → R is twice continuously differentiable. Fix y. Apply one step of
Newton’s method at x = x k to
c
min n f (x) + k x − yk22 .
x∈R 2
What is x k+1 ?
iii. Assume that f : Rn → R is twice continuously differentiable and has a Lipschitz continuous
second derivative. Let L > 0 be the Lipschitz constant of ∇2 f , let λ k ≥ 0 be the smallest
eigenvalue of ∇2 f (x k ), and let x ∗ be a local minimum of f . Show that x k+1 from part (ii)
satisfies
L c
k x k+1 − x ∗ k2 ≤ k x k − x ∗ k22 + k y − x ∗ k2 .
2λ k + 2c λk + c
[DS96] J. E. Dennis, Jr. and R. B. Schnabel. Numerical Methods for Nonlinear Equations and
Unconstrained Optimization. SIAM, Philadelphia, 1996. URL: https://doi.org/
10.1137/1.9781611971200, doi:10.1137/1.9781611971200.
[GL83] G. H. Golub and C. F. Van Loan. Matrix Computations, volume 3 of Johns Hopkins
Series in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD,
1983.
[Kel95] C. T. Kelley. Iterative methods for linear and nonlinear equations, volume 16 of Fron-
tiers in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM),
Philadelphia, PA, 1995. URL: https://doi.org/10.1137/1.9781611970944, doi:
10.1137/1.9781611970944.
237
238 REFERENCES
6.1. Introduction
Newton’s method as well as many other methods generate iterates x k such that, under appropriate
conditions on the function, the sequence of iterates converges to a local minimizer x ∗ provided that
the initial iterate x 0 is sufficiently close to x ∗ . What ‘sufficiently close’ mans depends on the method
and on properties of the function and their derivative, such as the Lipschitz constant of the second
derivative. For practical problems it is impossible to say a-priori whether an initial iterate x 0 is
sufficiently close to the solution. Therefore, we need to modify the methods such that convergence
to a solution is guaranteed, under certain assumptions on the problem, from any starting points.
In this chapter we investigate two globalization techniques: line search methods and trust-region
methods. By globalization of the iteration we mean a technique that ensures convergence from any
starting point. It does not mean convergence to a global minimum.
239
240 CHAPTER 6. GLOBALIZATION OF THE ITERATION
= R(x k )T R0 (x k )s k + 12 k R0 (x k )s k k22
> R(x k )T R0 (x k )s k .
min ∇ f (x k )T s + 12 sT ∇2 f (x k )s (6.2)
s
can be computed using the Preconditioned Conjugate Gradient Algorithm 3.9.1. We use i
as the iteration counter in the conjugate gradient method and s k,i to denote the ith iterate
generated by the Preconditioned Conjugate Gradient Algorithm applied to (6.2). Since
for x k away from a point x ∗ at which the second order sufficient optimality conditions
are satisfied, the Hessian ∇2 f (x k ) may not be positive definite, we need to modify the
Preconditioned Conjugate Gradient Algorithm 3.9.1. Specifically, we need to check whether
piT ∇2 f (x k )pi ≤ 0 for the ith direction computed in the Preconditioned Conjugate Gradient
Algorithm. If piT ∇2 f (x k )pi ≤ 0, the Hessian is not positive definite and we stop the
Preconditioned Conjugate Gradient Algorithm. Of course, if the Hessian ∇2 f (x k ) is not
positive definite, the quadratic subproblem (6.2) does not have a solution. However, the
Preconditioned Conjugate Gradient iterate s k,i−1 computed up to that iterate is a descent
direction, as we will show next.
First, we state the Preconditioned Conjugate Gradient Algorithm for the approximate solution
of (6.2).
If the Preconditioned Conjugate Gradient Algorithm 6.2.2 stops in iteration i > 0 with
s k = s k,i , then
Ki (Mk−1 ∇2 f (x k ), Mk−1 ∇ f (x k )) = span{p0, . . . , pi−1 }
and s k = s k,i solves
min ∇ f (x k )T s + 12 sT ∇2 f (x k )s. (6.3)
s ∈ span{p0, . . . , pi−1 }
(See Section 3.9.1 and Problem 3.14.) Furthermore, Step 2b in Algorithm 6.2.2 implies
pT ∇2 f (x k )p > 0 ∀ p ∈ span{p0, . . . , pi−1 }. (6.4)
Therefore,
∇ f (x k )T s k,i + 12 sTk,i ∇2 f (x k )s k,i < 0,
| {z }
>0
which implies ∇ f (x k )T s k < 0. Hence, the Preconditioned Conjugate Gradient Algo-
rithm 6.2.2 generates descent directions s k = s k,i if it stops in iteration i > 0.
If the Preconditioned Conjugate Gradient Algorithm 6.2.2 stops in the initial iteration with
r 0T z0 = r 0T Mk−1r 0 < , one uses the direction s k = −∇ f (x k ).
Example 6.2.3 Consider f (x) = x 2 and x 0 = 2. Furthermore, we select the steps s k = (−1) k+1
and the step lengths α k = 2 + 3(2−(k+1) ). The iterates are x k = (−1) k (1 + 2−k ). We have
f (x k+1 ) < f (x k ) and ∇ f (x k )T s k < 0 for all k. However, lim k→∞ f (x k ) = 1. The problem here is
that the decrease in the function is too small. We have
3
f (x k+1 ) = f (x k ) − 2−k − 2−2k .
4
satisfies the simple decrease f (x k+1 ) < f (x k ), we require that the sufficient decrease condition
f (x k + α k s k ) ≤ f (x k ) + c1 α k ∇ f (x k )T s k , (6.5)
where c1 ∈ (0, 1), is satisfied. This parameter is chosen to be small; c1 = 10−4 is a typical value.
Consider the function φ(α) = f (x k + αs k ). Clearly, φ0 (α) = ∇ f (x k + αs k )T s k and φ0 (0) =
∇ f (x k )T s k . The sufficient decrease condition (6.5) requires that the actual descrease φ(0) −
φ(α k ) = f (x k ) − f (x k + α k s k ) is at least a fraction c1 of the decrease φ(0) − (φ(0) + φ0 (0)α k ) =
−α k ∇ f (x k )T s k > 0 predicted by the first order Taylor approximation of φ around 0.
Moreover, since ∇ f (x k )T s k < 0, we have φ0 (0) < 0. By continuity of φ0 there exists ᾱ > 0 such
that |φ0 (α) − φ0 (0)| < −(1 − c1 )φ0 (0) for all α ∈ (0, ᾱ). Consequently,
Z 1
φ(α) − φ(0) − c1 φ (0)α =
0
φ0 (tα)dt α − c1 φ0 (0))α
0
Z 1
= (φ0 (tα) − φ0 (0))dt α + (1 − c1 )φ0 (0)α
0
< −(1 − c1 )φ0 (0)α + (1 − c1 )φ0 (0)α = 0
Example 6.2.5 Again we consider f (x) = x 2 and x 0 = 2. This time we select s k = −1 and step
size α k = 2−(k+1) . This gives the iterates x k = 1 + 2−k . The steps satisfy ∇ f (x k )T s k < 0 for all k
and the sufficient decrease condition
3
f (x k+1 ) = f (x k ) − 2−k − 2−2k < f (x k ) − c1 (2−k + 2−2k ) = f (x k ) + c1 α k ∇ f (x k )T s k
4
for c1 ∈ (0, 3/4). However, lim k→∞ f (x k ) = 1. The problem here is that the step sizes α k are too
small.
In addition to the sufficient decrease condition (6.5) we need a condition that guarantees that the
step sizes α k are not unnecessarily small. What this means will be made precise in Lemma 6.2.9.
There are several conditions that can be added to the sufficient decrease condition (6.5) to ensure
that the step sizes α k are not unnecessarily small. We list some of the commonly used step size
conditions.
The step size α k satisfies the Wolfe conditions if
f (x k + α k s k ) ≤ f (x k ) + c1 α k ∇ f (x k )T s k , (6.6a)
∇ f (x k + α k s k )T s k ≥ c2 ∇ f (x k )T s k , (6.6b)
f (x k + α k s k ) ≤ f (x k ) + c1 α k ∇ f (x k )T s k , (6.7a)
|∇ f (x k + α k s k )T s k | ≤ c2 |∇ f (x k )T s k |, (6.7b)
−c2 ∇ f (x k )T s k ≥ ∇ f (x k + α k s k )T s k ≥ c2 ∇ f (x k )T s k ,
the strong Wolfe condition is in fact stronger than the Wolfe condition.
f (x k + ᾱs k ) − f (x k ) = ᾱ∇ f (x k + H
α s k )T s k . (6.9)
∇ f (x k + H
α s k )T s k = c1 ∇ f (x k )T s k > c2 ∇ f (x k )T s k , (6.10)
since c1 < c2 and ∇ f (x k )T s k < 0. Therefore, H α satisfies the Wolfe conditions (6.6) and both
inequalities in (6.6) hold strictly. Hence there exists an interval around H α such that the Wolfe
conditions (6.6) are satisfied for all step sizes in this interval.
Since the term on the left hand side in (6.10) is negative, the strong Wolfe conditions (6.7) hold
in the same interval.
Let 0 < c1 < 21 . The step size α k satisfies the Goldstein conditions if
(1 − c1 )α k ∇ f (x k )T s k ≤ f (x k + α k s k ) − f (x k ) ≤ c1 α k ∇ f (x k )T s k , (6.11)
Step length satisfying the Goldstein conditions are sketched in Figure 6.4. Notice that the minimizer
of φ does not satisfy the Goldstein conditions.
The condition α k(i+1) ≤ β2 α k(i) ensures that the trial step size is at least reduced by a factor β2 .
The condition α k(i+1) ≥ β1 α k(i) ensures that the trial step size is not reduced too fast.
If β1 < β2 , then one has some flexibility to introduce information about f to increase the
performance of the Backtracking Algorithm 6.2.8. We will return to this issue in Section 6.2.4.
The simplest form of a Backtracking Algorithm 6.2.8 is obtained when β1 = β2 = β. In this
case, α k = α k(0) β m , where m ∈ {0, 1, 2, . . .} is the smallest integer so that α k = α β m satisfies the
sufficient decrease condition (6.5). This is known as the Armijo rule. The choice β = 1/2 is
common.
(∇ f (x k + α k s k ) − ∇ f (x k ))T s k ≤ k∇ f (x k + α k s k ) − ∇ f (x k )k2 ks k k2
≤ α k Lks k k22 .
f (x k + α k s k ) − f (x k ) − α k ∇ f (x k )T s k ≥ −c1 α k ∇ f (x k )T s k .
Moreover,
f (x k + α k s k ) − f (x k ) − α k ∇ f (x k )T s k
Z 1
= αk (∇ f (x k + α k s k ) − ∇ f (x k ))T s k dt
0
L
≤ α 2k ks k k22 .
2
Combining both inequalities yields the estimate
2c1 ∇ f (x k )T s k
αk ≥ − .
L ks k k22
iii. Let α k be determined through the Backtracking Algorithm 6.2.8. If α k = α k(i) , then α k(i−1)
did not satisfy the sufficient decrease condition (6.5). Thus,
2(1 − c1 ) ∇ f (x k )T s k
α k(i−1) ≥ − .
L ks k k22
∇ f (x k )T s k
cos θ k = .
k∇ f (x k )k2 ks k k2
f (x k+1 ) − f (x k ) ≤ c1 α k ∇ f (x k )T s k .
The lower bound for the step size α k gives the desired result
∞
X (∇ f (x k )T s k ) 2
k∇ f (x k )k22 < ∞.
k=0
ks k
k 2
2 k∇ f (x )k 2
k 2
If cos2 θ k is bounded away from zero, Theorem 6.2.10 guarantees that lim k→∞ ∇ f (x k ) = 0.
The following lemma shows that cos2 θ k is bounded away from zero, if the sequence of condition
numbers {cond2 (Bk )} is bounded.
If cos2 θ k is bounded away from zero, Theorem 6.2.10 guarantees that lim k→∞ ∇ f (x k ) = 0.
However, convergence of ∇ f (x k ) does not imply convergence of x k , in general. The following
corollary of Theorem 6.2.10 is a typical result that addresses convergence of the iterates {x k }.
cos2 θ k ≥ c > 0
Proof: Since cos2 θ k ≥ c > 0, Theorem 6.2.10 guarantees that lim k→∞ ∇ f (x k ) = 0. Let x ∗ ∈ D
be an accumulation point and let {x k j } be a subsequence such that lim j→∞ x k j = x ∗ . Then
0 = lim ∇ f (x k j ) = ∇ f (x ∗ ).
j→∞
Hence, the accumulation point x ∗ is a critical point. By the sufficient decrease condition,
and, since lim j→∞ x k j = x ∗ , lim j→∞ f (x k j ) = f (x ∗ ), which implies that the accumulation point x ∗
is a critical point, but not a maximum point.
The sufficient decrease condition implies that all iterates x k are in the compact set L. Hence,
{x k } has a convergent subsequence.
k∇ f (x k ) + ∇2 f (x k )s k k2
lim = 0, (6.12)
k→∞ ks k k2
then there is an index k 0 such that for all k ≥ k 0 the sufficient decrease conditition (6.5) with
c1 ∈ (0, 12 ) is satisfied with α k = 1.
ii. If {x k } converges to a point x ∗ at which ∇2 f (x ∗ ) is positive definite, and if (6.12) holds, then
there is an index k 0 such that for all k ≥ k 0 the Wolfe conditions (6.6) with c1 ∈ (0, 12 ) are
satisfied with α k = 1.
iii. If {x k } converges to a point x ∗ at which ∇2 f (x ∗ ) is positive definite, and if (6.12) holds, then
there is an index k0 such that for all k ≥ k 0 the Goldstein conditions (6.11) with 0 < c1 < 21
are satisfied with α k = 1.
s k = −(∇2 f (x k ) + µ k I) −1 ∇ f (x k ),
α 2k
f (x k − α k ∇ f (x k )) = f (x k ) − α k k∇ f (x k )k22 + ∇ f (x k )T ∇2 f (x k − α k τ∇ f (x k ))∇ f (x k )
2
for some τ ∈ [0, 1]. Hence an α k that satisfies the sufficient decrease conditition (6.13) must satisfy
k∇ f (x k )k22
α k ≤ 2(1 − c1 ) . (6.14)
∇ f (x k )T ∇2 f (x k − α k τ∇ f (x k ))∇ f (x k )
In particular, if the smallest eigenvalue λ min (x ∗ ) of ∇2 f (x ∗ ) satisfies λ min (x ∗ ) > 2(1 − c1 ), the step
size α k in the steepest descent method will not be equal to one near the solution.
We compute
2[φ(α k(0) ) − φ0 (0)α k(0) − φ(0)] 2[φ(α k(0) ) − φ0 (0)α k(0) − φ(0)]
m (α) =
0
α + φ (0), 0
m (α) =
00
.
(α k(0) ) 2 (α k(0) ) 2
Since the sufficient decrease condition (6.5) is not satisfied for α k(0) , we have
φ(α k(0) ) > φ(0) + c1 α k(0) φ0 (0) = φ(0) + α k(0) φ0 (0) + (c1 − 1)α k(0) φ0 (0) > φ(0) + α k(0) φ0 (0).
Hence, m00 (α) > 0 for all α and the minimum of m is
(α k(0) ) 2 φ0 (0)
α∗ = .
2[φ(α k(0) ) − φ0 (0)α k(0) − φ(0)]
We set
β1 α k(0), if α∗ < β1 α k(0),
=α k(1)
β2 α (0), if α∗ > β2 α k(0),
α, k
else.
∗
Step 2. Suppose the sufficient decrease condition (6.5) is not satisfied for α k(1) . To find α k(1) , we
compute the cubic interpolant m(α) that satisfies
m(0) = φ(0) = f (x k ), m0 (0) = φ0 (0) = ∇ f (x k )T s k ,
m(α k(0) ) = φ(α k(0) ), m(α k(1) ) = φ(α k(1) ).
This interpolant is given by
m(α) = aα 3 + bα 2 + φ0 (0)α + φ(0).
where a, b are computed by solving the 2 × 2 system
is a minimizer of m. We set
β1 α k(1), if α∗ < β1 α k(1),
α k(2) = β2 α k(1), if α∗ > β2 α k(1),
α,
else
∗
Step i (i ≥ 3). Suppose the sufficient decrease condition (6.5) is not satisfied for α k(i) . Then we
repeat the procedure in step 2 with α k(0) and α k(1) in (6.15) replaced by α k(i−1) and α k(i) , respectively.
1
m k (x k + s) = f (x k ) + ∇ f (x k )T s + sT Bk s, (6.16)
2
where Bk is a symmetric matrix (in Newton’s method Bk = ∇2 f (x k )). Typically, the model m k of
f is only a good model for f near x k . Hence minimizing m k (x k + s) over all s ∈ Rn does not make
sense. Instead, one should minimize m k (x k + s) only over those s for which m k (x k + s) is expected
to be a sufficiently good approximation to f (x k + s).
min m k (x k + s)
(6.17)
s.t. ksk2 ≤ ∆ k ,
where the set {s : ksk2 ≤ ∆ k } is the trust-region and ∆ k > 0 is the trust-region radius. One can
admit more general models [ADLT98, CGT00], but we focus on quadratic models (6.16).
With the model quadratic models (6.16) the trust-region subproblem (6.17) is given by
min f (x k ) + ∇ f (x k )T s + 21 sT Bk s
(6.18)
s.t. ksk2 ≤ ∆ k .
In (6.18) we minimize a continuous function m k over a compact set. Hence, the minimum exists.
A characterization of the solution will be given in Lemma 6.3.5 below. However, it is not necessary
to solve the trust-region subproblem (6.18) exactly. Instead the trust-region step s k has to give a
decrease in the model that is at least as good as the decrease obtained by minimizing in the direction
of the negative gradient. We will make this precise below (see (6.19)).
If the step s k is accepted, i.e. if x k+1 = x k + s k , then the iteration k is called successful. Note
that we increase the iteration count even if the iteration is not successful.
Sensible choices for the parameters η 1, η 2 , γ1, γ2 and γ3 in the Trust Region Algorithm 6.3.1
are η 1 = 0.01, η 2 = 0.9, γ1 = γ2 = 0.5, γ3 = 2 [CGT00, p. 117]. See also [DS83, Sec. 6.4].
Proof: If ∇ f (x k ) = 0, then the right hand sides in (6.19) and in (6.20) are zero and the assertion
follows.
Let ∇ f (x k ) , 0. Define
∇ f (x k ) t2
ψ(t) = m k x k − t − m k (x k ) = −t k∇ f (x k )k2 + ∇ f (x k )T Bk ∇ f (x k )
k∇ f (x k )k2 2k∇ f (x k )k22
and let t ∗ be the minimizer of ψ on the interval [0, ∆ k ] . Condition (6.19) implies that
m k (x k + s k ) − m k (x k ) ≤ β1 ψ(t ∗ ).
In this case
k∇ f (x k )k24 1
ψ(t ∗ ) = − =− k∇ f (x k )k22 . (6.21)
2∇ f (x k )T B k ∇ f (x k ) 2ck
If t ∗ = ∆ k , then either the unconstrained minimizer of ψ is greater than ∆ k , i.e.
k∇ f (x k )k23
≥ ∆k,
∇ f (x k )T Bk ∇ f (x k )
or the unconstrained problem min ψ(t) has no solution. The latter is the case if and only if
∇ f (x k )T Bk ∇ f (x k ) ≤ 0. Thus, either
∆2k ∆k
ψ(t ∗ ) = −∆ k k∇ f (x k )k2 + ∇ f (x k )T Bk ∇ f (x k ) ≤ − k∇ f (x k )k2 , (6.22)
2k∇ f (x k )k22 2
or
ψ(t ∗ ) ≤ −∆ k k∇ f (x k )k2 . (6.23)
The assertion now follows from (6.21)–(6.23).
Corollary 6.3.3 Suppose that k is a successful iteration, i.e. that ρ k > η 1 . If s k satisfies (6.19),
then
β1 η 1 (1 )
f (x k ) − f (x k+1 ) ≥ k∇ f (x k )k2 min k∇ f (x k )k2, ∆ k , (6.24)
2 ck
where ck is defined as in Lemma 6.3.2.
Proof: The proof follows immediately from Lemma 6.3.2 if we use the definition of ρ k , aredk ,
and predk .
A basic convergence result is based on the estimate (6.24). We also assume that scalars ck
defined in Lemma 6.3.2 are bounded. This is guaranteed, e.g., if the norms kBk k are bounded.
Proof: Suppose that lim inf k→∞ k∇ f (x k )k2 > 0. Then there exists > 0 such that k∇ f (x k )k2 >
for all k.
In the first step of the proof we show that
∞
X
∆ k < ∞. (6.25)
k=1
If there are only finitely many successful iterations, then there exists K such that ∆ k+1 ≤ γ2 ∆ k for
all k ≥ K. This implies (6.25).
Suppose there are infinitely many successful iterations and let {ki } be the subsequence of
successful iterations. Algorithm 6.3.1 implies
∆ ki +1 ≤ γ3 ∆ ki
and
∆ ki + j+1 ≤ γ2 ∆ ki + j , j = 1, . . . , ki+1 − ki − 1.
Hence
i+1 −1
kX
γ3
!
∆ k ≤ ∆ ki + γ3 ∆ ki + γ3 γ2 ∆ ki + . . . + γ3 γ2ki+1 −ki −2 ∆ ki ≤ 1+ ∆ ki
k=ki
1 − γ2
and
∞ ∞
γ3
X !X
∆k ≤ 1 + ∆k .
k=1
1 − γ2 i=1 i
Inequality (6.24) and k∇ f (x k )k2 > for all k imply that
∞
X
∆ ki < ∞.
i=1
On the other hand, (6.20) and k∇ f (x k )k2 > for all k imply the existence of c̃ > 0 such that
Hence,
aredk − predk
| ρ k − 1| = ≤ (c + L/2)/c̃ ∆ k → 0.
predk
Since ρ k ≥ η 2 implies that the kth iteration is successful and the trust-region radius is increased,
this contradicts ∆ k → 0.
Theorem 6.3.4 is a basic convergence result. Additional results can be found in in Chapter 6 of
the book [CGT00] by Conn, Gould, and Toint.
min gT s + 21 sT Bs
(6.26)
s.t. ksk2 ≤ ∆.
Recall that a trust-region step s k does not need to solve the trust-region subproblem (6.18)
exactly. It only needs to satisfy the fraction of Cauchy decrease condition (6.19) which in the
simplified notation of this section reads
( )
gT s k + 12 sTk Bs k ≤ β1 min gT s + 12 sT Bs : s = −tg, ksk2 ≤ ∆ , (6.27)
ks k k2 ≤ β2 ∆
Lemma 6.3.5 The vector s∗ is a solution of the trust–region subproblem (6.26) if and only if there
exists a scalar λ ∗ ≥ 0 such that the following conditions are satisfied:
Proof: i. Let λ ∗ ≥ 0 and s∗ ∈ Rn satisfy (6.28). Theorem 4.3.6 shows that s∗ is a global minimizer
of
1 λ∗
ψ̂(s) = gT s + 12 sT (B + λ ∗ I)s = gT s + sT Bs + ksk22 .
2 2
Hence, for all s ∈ Rn we have the inequality
1 λ∗ 1 λ∗
gT s + sT Bs + ksk22 = ψ̂(s) ≥ ψ̂(s∗ ) = gT s∗ + sT∗ Bs∗ + ks∗ k22 . (6.29)
2 2 2 2
If ks∗ k22 < ∆2 , then (6.28b) implies λ ∗ = 0 and the inequality (6.29) reads
1 1
gT s + sT Bs = ψ̂(s) ≥ ψ̂(s∗ ) = gT s∗ + sT∗ Bs∗ for all s ∈ Rn .
2 2
Thus, ks∗ k22 < ∆2 , s∗ is an unconstrained minimizer of gT s + 12 sT Bs.
If ks∗ k22 = ∆2 , then for all s ∈ Rn with ksk22 ≤ ∆2 the inequality (6.29) implies
1 λ∗
gT s + sT Bs ≥ gT s∗ + 12 sT∗ Bs∗ + (ks∗ k22 − ksk22 )
2 2
λ ∗
= gT s∗ + 21 sT∗ Bs∗ + (∆2k − ksk22 )
2
≥ g s∗ + 2 s∗ Bs∗ .
T 1 T
Since the set {(s − s∗ )/ks − s∗ k2 : s , s∗ with ksk2 = ks∗ k2 = ∆} is dense in the unit ball, (6.31)
implies positive semidefiniteness of B + λ ∗ I.
In the final step, we show that λ ∗ ≥ 0. Our proof is by contradiction. Suppose λ ∗ < 0. Since
(6.28) holds, Theorem 4.3.6 shows that s∗ is a global minimizer of
1 λ∗
ψ̂(s) = gT s + 21 sT (B + λ ∗ I)s = gT s + sT Bs + ksk22 .
2 2
1 λ∗
gT s + sT Bs ≥ gT s∗ + 12 sT∗ Bs∗ + (ks∗ k22 − ksk22 )
2 2
|{z} | {z }
<0
<0
> gT s∗ + 21 sT∗ Bs∗ .
Since s∗ solves (6.26), we also have gT s + 21 sT Bs ≥ gT s∗ + 21 sT∗ Bs∗ for all s with ksk22 ≤ ∆2 = ks∗ k22 .
Thus, s∗ minimizes gT s + 12 sT Bs over Rn . Theorem 4.3.6 implies that Bs∗ + g = 0. Together
with (6.28a) this implies λ ∗ s∗ = 0. Since ks∗ k2 = ∆, we have s∗ , 0. Therefore, λ ∗ = 0, which
contradicts the assumption λ ∗ < 0.
Bs = −g (6.32)
2. If the solution of (6.26) was not found in step 1, then find λ > 0 such that B + λI is positive
semi–definite and
ks(λ)k2 = ∆ (6.33)
where s(λ) is the solution of
(B + λI)s(λ) = −g.
If B is positive definite, then the solution of (6.32) is unique and can be computed using the
Cholesky decomposition.
The problem of finding λ ∗ > 0 such that B + λ ∗ I is positive semi–definite and (6.33) is satisfied
is a particular root finding problem. To get better insight into this problem, we use the eigen
ks(λ)k22
∆k
λ
−µ2 −µ1 λ ∗
and
n
X (qiT g) 2
ks(λ)k22 = . (6.34)
i=1
(µi + λ) 2
The function
n
X (qiT g) 2
λ 7→
i=1
(µi + λ) 2
has poles at −µ1 > . . . > −µn . We need to find λ ≥ max{−µ1, 0}, such that
n
X (qiT g) 2
= ∆2 .
i=1
(µi + λ) 2
φ(λ) = ks(λ)k2 − ∆ = 0.
will generate iterates λ j that satisfy λ j−1 > λ j ≥ λ ∗ for all k ≥ 1, provided that λ 0 ∈ (−µ1, λ ∗ ). In
addition, λ 7→ ks(λ)k2 − ∆ is a rational function.
With
n
d X qiT g
s(λ) = q = −(B + λI) −1 s(λ)
2 i
dλ i=1
(µi + λ)
we find that
s(λ)T (B + λI) −1 s(λ)
φ0 (λ) = − .
ks(λ)k2
It is advantageous to consider Newton’s method applied to the equivalent root finding problem
1 1
ψ(λ) = − = 0.
ks(λ)k2 ∆
We find that
φ0 (λ)
ψ 0 (λ) = − = 0.
(φ(λ) + ∆) 2
Hence, the new Newton iterate λ + is
ψ(λ)
λ+ = λ −
ψ 0 (λ)
1 ks(λ)k22
!
1
= λ+ −
ks(λ)k2 ∆ φ0 (λ)
ks(λ)k2 ks(λ)k2 − ∆
= λ−
∆ φ0 (λ)
ks(λ)k2 φ(λ)
= λ− . (6.35)
∆ φ0 (λ)
Newton’s method should be safeguarded, however. We know that
φ(λ)
λ− ≤ λ∗.
φ0 (λ)
Thus, if λ low ∈ [µ1, λ ∗ ) is a known lower bound for the root λ ∗ , then
φ(λ)
( )
λ + = max λ , λ − 0
low low
φ (λ)
is another, possibly improved lower bound. To obtain an upper bound for λ ∗ consider the identity
B − µI + (λ ∗ + µ)I s(λ ∗ ) = −g.
is another, possibly improved upper bound. If the Newton iterate (6.35) satisfies
λ + < [λ low
+ , λ + ],
up
then we set
q
λ + = max{ λ low
+ λ + , 10 λ + }.
−3
up up
There is one case that requires more care. It is known as the hard case [MS83], and occurs if
−µ1 > 0 and
n
X (qiT g) 2
lim < ∆2 .
λ→−µ+1
i=1
(µ i + λ) 2
n n
X (qiT g) 2 (q1T g) 2 X (qiT g) 2
lim + = lim + + = ∞.
λ→−µ1
i=1
(µi + λ) 2 λ→−µ1 (µ1 + λ) 2 i=2
(µi + λ) 2
Since B + λ ∗ I must be positive semidefinite, λ ∗ = −µ1 > 0. To ensure conditions (6.28a,b) we set
s = s(−µ1 ) + τq1 , where τ is chosen to satisfy ks(−µ1 ) + τq1 k2 = ∆.
If µ1 > 0 then
Pn qiT g
Compute s(0) = i=1 µi qi .
If ks(0)k2 ≤ ∆ then stop.
elseif µ1 = 0 and q1T g = 0 then
Pn qiT g
Compute s(0) = i=2 µi qi .
If ks(0)k2 ≤ ∆ then stop.
elseif µ1 < 0 and q1T g = 0 then
Pn qiT g
Compute s(−µ1 ) = i=2 µi −µ1 qi .
If ks(−µ1 )k2 < ∆ then compute τ such that ks(−µ1 ) + τq1 k = ∆,
set s = s(−µ1 ) + τq1 , and stop.
endif
Set λ 0 = kgk
∆ − min{0, µ1 }.
up 2
Set λ low
0 = max{0,q −µ1 }.
Set λ 0 = max{ λ low
0 λ 0 , 10 λ 0 }.
−3up up
For j = 0, . . .
Compute φ(λ j ) and φ0 (λ j ).
( up )
If φ(λ j ) < 0, then λ j+1 = min λ j , λ j . Else λ j+1 = λ j .
up up up
φ(λ j )
Set λ j+1 = max λ j , λ j − φ0 (λ j ) .
low low
End
Algorithm 6.3.6 is due to [Heb73, Rei71, Mor78]. The iteration (6.35) is introduced in [Heb73,
Rei71]. The safeguards and many important implementation details were introduced in [Mor78],
where the algorithm is described in the context of least squares problems. See also [DS96,
Sec. 6.4.1]. The eigen decomposition of B is a very convenient tool for the implementation of the
Hebden–Reinsch–Moré algorithm, but is not necessary. See [Mor78, MS83] and [CGT00, Sec. 7.3]
Fortunately the basic convergence result for trust–region methods only requires that the fraction
of Cauchy decrease condition (6.27) is satisfied. This gives a lot of flexibility in computing a
trust–region step. Next we will discuss a few techniques for the computation of trust-region steps
that satisfy the fraction of Cauchy decrease condition (6.27).
Suppose that B is positive definite, i.e., that the eigenvalues are all positive 0 < µ1 < . . . < µn . Let
us consider the
n
X qiT g
s(λ) = − qi .
i=1
µi + λ
and
n
1 1 X qiT g
lim s(λ) = lim − P qi
λ→∞ ks(λ)k2 λ→∞ n (qiT g) 2 1/2 µ +λ
i=1 i
i=1 (µi +λ) 2
n
1 X qiT g
= lim − P qi
λ→∞ n qiT g 1/2
i=1
(µi /λ) + 1
i=1 (µi /λ)+1
n
1 X
= −P 1/2 (qiT g) qi
n T
i=1 qi g i=1
= −g/kgk2 .
The idea is to compute the step s as a combination of the Cauchy step (the minimizer of gT s + 12 sT Bs
along s = −tg, t ≥ 0, and the Newton step −B−1 g.
In particular, s1 solves
min gT s + 12 sT Bs
s=−tg,t≥0
Moreover, we have shown in Theorem 3.7.7 that the iterates are monotonically increasing, 0 <
ks1 k < ks2 k < . . .. Hence, we can use the conjugate gradient method for the computation of a
trust-regio step. If B is symmetric positive definite we will use the conjugate gradient method until
iterates si+1 violated the trust-region bound, ksi k ≤ ∆ < ksi+1 k, or g + Bsi is small. We can also
admit general symmetric B (B may not be positive definite). In this case, if we detect a conjugate
gradient direction pi such that piT Bpi ≤ 0, then we will move fropm si along pi until we hit the
trust-region bound. The resulting algorithm is due to [Ste83] (and a slightly different version to
[Toi81]) and is listed below. It can be shown that the step computed by Steihaug-Toint Conjugate
Gradient Algorithm 6.3.8 below satisfies the fraction of Cauchy decrease condition (6.27).
Subspace Techniques
Let V = span{v1, . . . , vk } ⊂ Rn . If −g ∈ V, then
( ) ( )
min gT s + 12 sT Bs : s ∈ V, ksk2 ≤ ∆ ≤ min gT s + 12 sT Bs : s = −tg, ksk2 ≤ ∆ .
satisfies the fraction of Cauchy decrease condition (6.27). If we set V = (v1, . . . , vk ) ∈ Rn×k , then
(6.36) is equivalent to
min (V T g)T ŝ + 12 ŝT V T Bk V ŝ
(6.37)
s.t. kV ŝk2 ≤ ∆.
If we assume that V has full rank k, then there exists a nonsingular R ∈ R k×k such that RT R = V T V .
Such an R can be computed, e.g., using the Cholesky decomposition of V T V . Since kV ŝk2 = k R ŝk2
we can set s̃ = R ŝ to write (6.37) as
if the trust-region radius is inactive, ksSTCG k2 < ∆, then ‘≥’ above becomes ‘=’. Differences
between the Steihaug-Toint Conjugate Gradient Algorithm 6.3.8 and the solution of (6.36) with V
computed via Lanczos arise only when the trust-region radius is active. See [CGT00, Sec. 7.5.4].
If −B−1 g can be computed one can also choose V = span{−g, −B −1 g}. If a negative
curvature direction d, i.e., a d with dT Bd < 0 exists and can be computed, then one can
V = span{−g, −B−1 g, d}. Note that if B is positive definite and if −g, −B−1 g ⊂ V then the
double dogleg step sd computed by Algorithm 6.3.7 satisfies
6.4. Problems
Problem 6.4
Consider f (x) = 21 kF (x)k22 , where F : Rn → Rn is a continuously differentiable function.
Furthermore, let s k be the solution of the Newton equation
F 0 (x k )s k = −F (x k ).
i. Show that if F 0 (x k ) is nonsingular, the sufficient decrease condition (6.5) can be formulated
as a condition involving F, but not F 0.
ii. Let the conditions in i. be satisfied. Give a condition (or conditions) on the sequence {α k } of
step sizes that produces iterates x k+1 = x k + α k s k with
lim kF (x k )k2 = 0.
k→∞
iii. Assume that the Jacobian F 0 is Lipschitz continuous. Without using Theorem 6.2.13 show
that your sufficient decrease condition derived in i. is satisfied for α k = 1, provided that the
iterate x k is sufficiently close to a root x ∗ of F at which F 0 (x ∗ ) is nonsingular.
(Hint: The quadratic convergence of the sequence {x k } of Newton iterates implies quadratic
convergence of the sequence {F (x k )} of corresponding function values.)
Problem 6.5
Let f : Rn → R be twice continuously differentiable.
Vk = {Vk z : z ∈ Rmk }.
Suppose that Bk ∈ Rn×n is a symmetric matrix which satisfies vT Bk v > 0 for all v ∈ Vk and
consider the subproblem
1
min ∇ f (x k )T s + sT Bk s. (6.39)
s∈Vk 2
Compute the solution s k of (6.39). Show that if s k , 0, then s k is a descent direction for f
at x k .
iii. Consider the iteration x k+1 = x k + α k s k , where s k is the solution of (6.39) and α k is the step
size. If s k , 0 for all k and if the assumptions of Theorem 6.2.10 are satisfied, then
∞ !2
X ∇ f (x k )T s k
k∇ f (x k )k22 < ∞.
k=0
k∇ f (x k )k2 ks k k2
Suppose there exist 0 < λ min ≤ λ max such that all eigenvalues of VkT Bk Vk , k ∈ N, are
bounded from below by λ min and from above by λ max .
Show that
lim kVkT ∇ f (x k )k2 = 0.
k→∞
Under what condition(s), can one show that
lim k∇ f (x k )k2 = 0?
k→∞
Problem 6.6 Let A ∈ Rn×n be symmetric positive definite and let b ∈ Rn . We consider the
following algorithm for minimizing
1 T
f (x) = x Ax − bT x.
2
Given linearly independent vectors v (1), . . . , v (n) ∈ Rn one step of the parallel directional
correction (PDC) method introduced in Section 2.8 is given as follows:
end
Recall that
(v (i) )T (b − Ax)
θi = .
(v (i) )T Av (i)
ii. Determine the optimal step size argminα∈R f (x (k) + αs (k) ) and show that it satisfies the
sufficient decrease condition (6.5) if c1 < 1/2. Prove that
f g2
∞
X ( Ax (k) − b)T s (k)
< ∞.
k=0
(s (k) )T As (k)
iii. Suppose that the directions v (i) are the unit vectors e (i) . What is s (k) ? Use part ii to show that
lim Ax (k) − b = 0.
k→∞
Problem 6.7
Let f : Rn → R be convex.
i. Show that the function φ(x) = f (x) + 2µ k x − yk22 is strictly convex for any given y ∈ Rn and
any given µ > 0.
(A function φ is strictly convex if for any x 1 , x 2 and t ∈ (0, 1), φ(t x 1 + (1 − t)x 2 ) ≤
tφ(x 1 ) + (1 − t)φ(x 2 ).)
We consider the following iteration. Given x k , the new iterate x k+1 is computed as the solution
of
µk
k x − x k k22,
min f (x) +
2
7 f (x) + µ2k k x − x k k22 is strictly convex (this ensures that local minima
where µ k ≥ 0 is such that x →
are global minima).
iii. Assume that f is bounded from below on the set {x ∈ Rn | f (x) ≤ f (x 0 )}. Show that
∞
X
µ k k x k+1 − x k k22 < ∞.
k=0
lim ∇ f (x k ) = 0.
k→∞
v. Now let H be symmetric positive definite and let f be the convex quadratic function f (x) =
cT x + 12 xT H x. Furthermore, let 0 < µ k ≤ µ for all k.
[CGT00] A. R. Conn, N. I. M. Gould, and Ph. L. Toint. Trust–Region Methods. SIAM, Philadel-
phia, 2000.
[DS83] J. E. Dennis, Jr. and R. B. Schnabel. Numerical Methods for Nonlinear Equations and
Unconstrained Optimization. Prentice-Hall, Englewood Cliffs, N. J, 1983. Republished
as [DS96].
[DS96] J. E. Dennis, Jr. and R. B. Schnabel. Numerical Methods for Nonlinear Equations and
Unconstrained Optimization. SIAM, Philadelphia, 1996. URL: https://doi.org/
10.1137/1.9781611971200, doi:10.1137/1.9781611971200.
[Heb73] M. D. Hebden. An algorithm for minimization using exact second order derivatives.
Technical Report T.P. 515, Atomic Energy Research Establishment, Harwell, England,
1973.
[MS83] J. J. Moré and D. C. Sorensen. Computing a trust region step. SIAM J. Sci. Statist.
Comput., 4(3):553–572, 1983.
[MT94] J. J. Moré and D. J. Thuente. Line search algorithms with guaranteed sufficient decrease.
ACM Transactions on Mathematical Software, 20(3):286–307, 1994.
273
274 REFERENCES
[RSS01] M. Rojas, S. A. Santos, and D. C. Sorensen. A new matrix-free algorithm for the large-
scale trust-region subproblem. SIAM J. Optim., 11(3):611–646 (electronic), 2000/01.
[Ste83] T. Steihaug. The conjugate gradient method and trust regions in large scale optimization.
SIAM J. Numer. Anal., 20:626–637, 1983.
[Toi81] Ph. L. Toint. Towards an efficient sparsity exploiting Newton method for minimization.
In I. S. Duff, editor, Sparse Matrices and Their Uses, pages 57–87. Academic Press,
New York, 1981.
7.1. Introduction
Given a smooth function R : Rn → Rm with component functions Ri , i = 1, . . . , m, we consider the
solution of the nonlinear least squares problem
275
276 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS
and
m
X
∇ f (x) = R (x) R (x) +
2 0 T 0
∇2 Ri (x) Ri (x), (7.3)
i=1
respectively, where R0 (x) denotes the Jacobian of R and ∇2 Ri (x) is the Hessian of the ith component
function. Note that the first part of the Hessian, R0 (x)T R0 (x), uses derivative information already
required for the computation of ∇ f (x). In this chapter we study a variation of Newton’s method,
called the Gauss-Newton method, for the minimization of f (x) = 12 k R(x)k22 in which the Hessian
∇2 f (x) is replaced by R0 (x)T R0 (x). Before we discuss the Gauss-Newton method, we present a
class of problems that leads to nonlinear least squares problems (7.1) in the next Section 7.2 and
then we discuss the special case of linear least squares problems R(x) = Ax + b in Section 7.3.
The Gauss-Newton method is presented and analyzed in Section 7.4. The final Section 7.5 of
this chapter treat a more complicated nonlinear least squares problem, parameter identification in
ordinary differential equations.
b3
b2
ϕ(t; x 1, . . . , x n )
bn
b1
t
t1 t2 t3 tn
Figure 7.1: Least squares curve fitting. Given data (t i, bi ), i = 1, . . . , m, we want to find a function
ϕ(t; x 1, . . . , x n ) parameterized by x = (x 1, . . . , x n )T such that the sum of squares of the residuals
ϕ(t i ; x) − bi , i = 1, . . . , m, indicated by solid blue lines is minimal.
If the function R or, equivalently, ϕ depends linearly on x, then we call (7.4) a linear least
squares problem. Otherwise, we call (7.4) a nonlinear least squares problem.
For linear least squares problems the model function ϕ is of the form
with some functions ϕ1, . . . , ϕn . Notice that the function ϕ depends linearly on the parameters
x 1, . . . , x n but it may be a nonlinear function of t! If we introduce
ϕ (t ) ϕ2 (t 1 ) . . . ϕn (t 1 )
*. 1 1
ϕ 1 (t 2 ) ϕ2 (t 2 ) . . . ϕn (t 2 ) /
+
A = ...
.. .. ..
// ∈ Rm×n, (7.5)
. . . . //
, ϕ1 (t m ) ϕ2 (t m ) . . . ϕn (t m ) -
and
b = (b1, . . . , bm )T ∈ Rm, (7.6)
then
R(x) = Ax − b
and, thus, a linear least squares problem can be written in the form
Example 7.2.1 ([Esp81, Sec. 6]) The temperature T dependence of the rate constant k for an
elementary chemical reaction is almost always expressed by a relation of the form
!
U
k (T; C, U) = CT exp −
n
, (7.8)
RT
where R = 8.314 [J/(kmol K)] is the general gas constant. Usually, n is assigned one of the values
0, 1/2, or 1. Depending on the choice of n, the notation in (7.8) varies. For example, for n = 0 we
obtain the Arrhenius equation, which is commonly written as
!
E
k (T; A, E) = A exp − . (7.9)
RT
In (7.9), A is called the pre-exponential factor and E is called the activation energy, and both have
to estimated from experiments.
Consider the reaction
NO + C1NO2 → NO2 + C1NO.
Measurements of the rate constant k (measured in cm3 mol−1 sec−1 ) for various temperatures T
(measured in K) are shown in the following table.
The coefficients C and U in (7.8) can be computed by solving the nonlinear least squares
problem
X5 2
min 21
ki − k (Ti ; C, U) . (7.10)
C,U
i=1
Alternatively, we can divide both sides in (7.8) by T n and take the logarithm. This gives
U
ln k (T; C, U)/T n = ln(C) − .
RT
The quantities x 1 = ln(C) and x 2 = U can be estimated by solving the linear least squares problem
.. .. ..
2
. . .
* +/ * +/
min 21
... 1 −1/(RTi ) // x − ... ln(ki /Tin ) //
. (7.11)
x .. .. ..
. . .
, - , -
2
The problems (7.10) and (7.11) are related
by are different,
because
in former
we match k (Ti ; C, U)
with ki while in the latter we match ln k (Ti ; C, U)/T with ln ki /T .
n n
We first estimate C, U from the linear least squares problem (7.11) and then use these estimates
as starting values in an optimization routine1 to solve (7.10). This results are shown in the following
table.
Solution of (7.11)
n C U Res
0 6.167e + 11 2.808e + 04 2.705e + 12
1/2 2.087e + 10 2.674e + 04 2.673e + 12
1 7.062e + 08 2.541e + 04 2.642e + 12
Solution of (7.10)
n C U Res
0 6.167e + 11 2.807e + 04 2.661e + 12
1/2 2.087e + 10 2.673e + 04 2.632e + 12
1 8.621e + 08 2.595e + 04 2.464e + 12
Here, Res = i=1
P5
(ki − k (Ti ; C, U)) 2 .
So far we have considered scalar measurements (t i, bi ) ∈ R × R, i = 1, . . . m. Everything that
was said before can be extended to the case t i ∈ R k , bi ∈ R` , i = 1, . . . m.
We will return to the solution of linear and nonlinear least squares problems. For some
applications and statistical aspects of linear and nonlinear least squares problems we refer to the
book [BW88].
Since
1
2 k Ax − bk22 = 12 ( Ax − b)T ( Ax − b) = 21 xT AT Ax − xT AT b + 12 bT b
(7.12) is an instance of (4.9) with H = AT A and c = −AT b. Hence we can apply Theorem 4.3.6.
A vector x ∗ solves (7.12) if and only if x ∗ solves
AT Ax = AT b. (7.13)
The equations (7.13) are called the normal equations. Using the singular value decomposition of
A one can easily show that
R ( AT A) = R ( AT ), N ( AT A) = N ( A).
Since AT b ∈ R ( AT ) = R ( AT A), the normal equations are solvable. We obtain the following result.
1We have used the Matlab function ls.
Ax − b
b
Ax
R ( A)
Theorem 7.3.1 A vector x ∗ solves (7.12) if and only if x ∗ solves the normal equation (7.13).
The normal equation has at least one solution x ∗ . The set of solutions of (7.13) is given by
Sb = x ∗ + N ( A),
where x ∗ denotes a particular solution of (7.13) and N ( A) denotes the null space of A.
If N ( A) , {0}, the set of solutions of (7.12) forms a manifold in Rn . In this case the minimum
norm solution x † is of interest. It is the solution of (7.12) with smallest norm. Mathematically, the
minimum norm solution x † is the solution of
min k xk2 .
x∈Sb
It can be shown that the minimum norm solution x † is the solution of the least squares problem
which is perpendicular to the null-space of A. See also Figure 7.3 and (7.18) below.
If A ∈ Rm×n has rank n, which implies m ≥ n, then AT A is invertible and the solution of the
least squares problem (7.12) (or equivalently the normal equation (7.13)) is unique and given by
x = ( AT A) −1 AT b. (7.14)
If A ∈ Rm×n has rank m, which implies m ≤ n, then AT A is not invertible and the least squares
problem has infinitly many solutions. The matrix AAT is invertible and it is easy to verify that
x = AT ( A AT ) −1 b is a solution of the least squares problem (7.12) (or equivalently the normal
equation (7.13)). In fact, since AT ( A AT ) −1 b ∈ R ( AT ) = R ( AT A) = N ( AT A) ⊥ = N ( A) ⊥ ,
x † = AT ( A AT ) −1 b (7.15)
such that
A = UΣV T . (7.16)
The decomposition (7.16) of A is called the singular value decomposition of A. The scalars σi ,
i = 1, . . . , min{m, n} are called the singular values of A.
Using the orthogonality of U and V we find that
uTi b
zi = , i = 1, . . . , r,
σi
zi = arbitrary, i = r + 1, . . . , n.
Moreover,
r r
X X uT b
AV z = UΣz = zi ui = i
ui (7.17)
i=1 i=1
σi
and
m
X
min 1
2 k Ax − bk22 = 1
2 (uTi b) 2 .
i=r+1
Since V is orthogonal, we find that
Hence, the minimum norm solution of the linear least squares problem is given by
x † = V z†,
uTi b
zi† = , i = 1, . . . , r,
σi
zi† = 0, i = r + 1, . . . , n,
i.e.
r
X uT b
x† = i
vi . (7.18)
i=1
σi
Since {v1, . . . , vr } is an orthonormal basis for N ( A) ⊥ , we see that x † ⊥ N ( A). Moreover, since
{u1, . . . , ur } is an orthonormal basis for R ( A), the projection PR ( A) b of b onto R ( A) is given by
m
X
PR ( A) b = (uTi b) ui
i=r+1
we see that Ax ∗ = PR ( A) b for all solutions x ∗ of (7.12) (see (7.17)). The structure of the solution
of the linear least squares problem is sketched in Figure 7.3.
Given the SVD (7.16) of A, the matrix,
A† = V Σ†U T , (7.19)
where Σ† ∈ Rn×m is the diagonal matrix with diagonal entries 1/σ1, . . . , 1/σr , 0, . . . , 0, is called
the Moore–Penrose pseudo inverse. The minimum norm solution (7.18) of the linear least squares
problem is given by
x † = A† b. (7.20)
A -
Rn Sb Rm
6 6
N ( A) ⊥ N ( A) R ( A)
@
@ @
@
@ @
@ @
@ @
1b
@ @
@I x†
@ @
@
@ @
@
@ - @ -
@ @@
@ @ PR ( A) b
@
@ @
R
@
@ @
@ @
@ @
AP = QR, (7.21)
Now we partition
c1 }r
QT b = *. c2 +/ }n − r (7.23)
, d - }m − n
and we set
y = PT x i.e. x = Py.
Let !
y1
y= , y1 ∈ Rr , y2 ∈ Rn−r . (7.24)
y2
This yields
R1 y1 + R2 y2 − c1
k Ax − bk22 =
*. c2 +/
2
2
, d -
= k R1 y1 + R2 y1 − c1 k22 + kc2 k22 + kd k22 .
If y solves the minimization problem on the right hand side, then x = Py solves the minimization
problem on the left hand side and vice versa. Since R1 ∈ Rr×r is nonsingular, we can compute
y1 = −R1−1 (R2 y2 − c1 )
is given by
−R1−1 (R2 y2 − c1 )
( ! )
n−r
| y2 ∈ R
y2
and we find that
Consequently, the solutions to the linear least squares problem min k Ax − bk2 is given by
−R1−1 (R2 y2 − c1 )
( ! )
Sb = P | y2 ∈ R n−r
(7.25)
y2
and
min k Ax − bk22 = kc2 k22 + kd k22 .
x
A particular solution can be found by setting y2 = 0 which yields
R1−1 c1 R1−1 c1
! !
y= and x = P .
0 0
The minimum norm solution x † is defined as the solution of
min k xk2 .
x∈Sb
Using (7.25) and the fact that kPyk2 = k yk2 for all y ∈ Rn , we find that
−R1−1 (R2 y2 − c1 )
!
min k xk2 = minn−r
P
x∈Sb y2 ∈R
y2
2
−R−1 R2 ! R −1 c !
= minn−r
y2 + 1
.
1 1
(7.26)
y2 ∈R
I 0
2
The right hand side in (7.26) is just another linear least squares problem in y2 . Its solution can be
obtained by solving the normal equations
(R1−1 R2 )T R1−1 R2 + I y2 = (R1−1 R2 )T R1−1 c1 (7.27)
−R1−1 R2
!
∈ Rn×(n−r) (7.28)
I
and proceeding as above. This matrix in (7.28) has full rank n − r. Consequently, (7.27) or,
equivalently, (7.26) has a unique solution y2∗ . The minimum norm solution of (7.12) is given by
We note that once the Jacobian R0 (x) is computed, we can compute the gradient of f and we can
Pm
compute the first term in the Hessian of f . If R0 (x)T R0 (x) is large compared to i=1 Ri (x)∇2 Ri (x),
then ∇2 f (x) ≈ R0 (x)T R0 (x). This will be the case if, e.g., Ri (x), i = 1, . . . , m, is small, or if
∇2 Ri (x), i = 1, . . . , m is small. The latter conditions means that Ri is almost linear.
If we omit the second order derivative information, then the approximate Newton system is of
the form
R0 (x k )T R0 (x k )s = −R0 (x k )T R(x k ).
This system is always solvable and it has a unique solution if and only if the Jacobian R0 (x k ) has
rank n. The previous system is the normal equation for the linear least squares problem
∇2 f (x k ) + ∆(x k ) = R0 (x k )T R0 (x k )
are invertible,
R0 (x )T R0 (x ) −1
≤ M,
k k
2
and that
k(∇2 f (x k ) + ∆(x k )) −1 ∆(x k )k2 ≤ α k ≤ α < 1,
i.e.,
m
R0 (x )T R0 (x ) −1
X
k k Ri (x k )∇2 Ri (x k )k2 ≤ α k ≤ α < 1. (7.31)
i=1
Under these assumptions the Gauss–Newton method is locally convergent and the iterates obey
ML
k x k+1 − x ∗ k2 ≤ α k k x k − x ∗ k2 + k x k − x ∗ k22
2
Theorem 7.4.2 (Local Convergence of the GN Method for Full Rank Problems) Let D ⊂ Rn
be an open set and let x ∗ ∈ D be a (local) solution of the nonlinear least squares problem. Suppose
that R : D → Rm is continuously differentiable with R0 ∈ Lip L (D) and suppose that for all x ∈ D
the Jacobian R0 (x) has rank n. If there exist ω > 0 and κ ∈ (0, 1) such that for all x ∈ D and all
t ∈ [0, 1] the following conditions hold
then there exists > 0 such that if x 0 ∈ B (x ∗ ), then the iterates {x k } generated by the Gauss–
Newton method convergence towards x ∗ and obey the estimate
ω
k x k+1 − x ∗ k2 ≤ k x k − x ∗ k22 + κk x k − x ∗ k2 . (7.34)
2
R0 (x ∗ )T R(x ∗ ) = 0 yields
x k+1 − x ∗
= x k − x ∗ − (R0 (x k )T R0 (x k )) −1 R0 (x k )T R(x k )
f
= (R0 (x k )T R0 (x k )) −1 R0 (x k )T {R0 (x k )(x k − x ∗ ) − R(x k ) + R(x ∗ )}
g
+(R0 (x ∗ ) − R0 (x k ))T R(x ∗ )
f Z 1
= (R (x k ) R (x k )) R (x k )
0 T 0 −1 0 T
(R0 (x k ) − R0 (x ∗ + t(x k − x ∗ ))(x k − x ∗ )dt
g 0
+(R (x ∗ ) − R (x k )) R(x ∗ ) .
0 0 T
k x k+1 − x ∗ k2
Z 1
≤ k (R0 (x k )T R0 (x k )) −1 R0 (x k )T {R0 (x k ) − R0 (x ∗ + t(x k − x ∗ ))}(x k − x ∗ )dt k2
0
+k(R0 (x k )T R0 (x k )) −1 (R0 (x ∗ ) − R0 (x k ))T R(x ∗ )k2
ω
≤ k x k − x ∗ k22 + κk x k − x ∗ k2 .
2
ii. Let 1 > 0 be such that B 1 (x ∗ ) ⊂ D and let σ ∈ (κ, 1) be arbitrary. If
2(σ − κ)
≤ min{ 1, }
ω
and x 0 ∈ B (x ∗ ), then
ω ω
k x ∗ − x 1 k2 ≤ k x 0 − x ∗ k2 + κk x 0 − x ∗ k2 <
2
+ κ k x 0 − x ∗ k2 ≤ σk x 0 − x ∗ k2
2 2
A simple induction argument shows that
The condition (7.33) is implied by the Lipschitz continuity of R0 and by the uniform boundedness
−1
of k R0 (x)T R0 (x) k2 . In fact if R0 ∈ Lip L (D) and
−1
a = sup k R0 (x)T R0 (x) R0 (x)T k2 < ∞,
x∈D
then
R0 (x)T R0 (x) −1 R0 (x)T (R0 (x + t(x − x )) − R0 (x))(x − x )
∗ ∗ ∗
2
−1
≤ k R0 (x)T R0 (x) R0 (x)T k2 Lt k x − x ∗ k22 ≤ aLt k x − x ∗ k22 .
Thus (7.33) holds with ω ≤ aL. The condition (7.32) is more interesting. Clearly, if R(x ∗ ) = 0
(zero residual problem) or if R(x) is affine linear, then (7.32) is satisfied with κ = 0 and the
Gauss–Newton method converges locally q–quadratic. We will show in Lemma 7.4.3 below that
(7.32) is essentially equivalent to the condition (7.31) with α = κ. Lemma 7.4.4 below relates
(7.32) (via the results in Lemma 7.4.3) to the second order sufficient optimality condition. The
analysis follows [Boc88, Sec. 3] and [Hei93].
Lemma 7.4.3 Let D ⊂ Rn be an open set and let x ∗ ∈ D be a (local) solution of the nonlinear least
squares problem. Suppose that R : D → Rm is continuously differentiable. Moreover, assume that
Ri , i = 1, . . . , m, is twice differentiable at x ∗ and that R0 (x ∗ )T R0 (x ∗ ) is invertible.
i. If
k(R0 (x ∗ )T R0 (x ∗ )) −1 (R0 (x)T − R0 (x ∗ )T )R(x ∗ )k ≤ κk x − x ∗ k ∀x ∈ D,
then
m
X
k(R0 (x ∗ )T R0 (x ∗ )) −1 Ri (x ∗ )∇2 Ri (x ∗ )k2 ≤ κ.
i=1
ii. If
m
X
k(R0 (x ∗ )T R0 (x ∗ )) −1 Ri (x ∗ )∇2 Ri (x ∗ )k2 ≤ κ̂,
i=1
then for any κ > κ̂ there exists > 0 such that for all x ∈ B (x ∗ )
k(R0 (x ∗ )T R0 (x ∗ )) −1 (R0 (x)T − R0 (x ∗ )T )R(x ∗ )k ≤ κk x − x ∗ k.
for all h ∈ Rn with khk2 = 1. Since we can cancel δ on the left and on the right hand side of the
previous inequality, this yields
m
X
0 T 0
k(R (x ∗ ) R (x ∗ )) −1
Ri (x ∗ )∇2 Ri (x ∗ )k2 ≤ (κ + φ(δ)).
i=1
If we take the limit δ → 0, we obtain the assertion.
ii. The second assertion can be proven in a similar way.
If the assumptions of Lemma 7.4.3 are satisfied and if Ri , i = 1, . . . , m, are twice differentiable
and the Hessians ∇2 Ri , i = 1, . . . , m, are Lipschitz continuous, then Lemma 7.4.3 ii. show that
(7.32) implies (7.31) with α ∈ (κ, 1) for all x k sufficiently close to x ∗ .
The next result relates (7.32) (via the results in Lemma 7.4.3) to the second order sufficient
optimality condition.
Lemma 7.4.4 Let D ⊂ Rn be an open set. Suppose that Ri : D → R, i = 1, . . . , m, are
twice continuously differentiable. If R0 (x ∗ )T R0 (x ∗ ) is invertible, then the following statements are
equivalent:
i. There exists λ > 0 with
m
X
T 0 T 0
h R (x ∗ ) R (x ∗ )h − |h T
Ri (x ∗ )∇2 Ri (x ∗ )h| ≥ λ khk22 ∀h ∈ Rn . (7.35)
i=1
Since
m
X
0 0 −1/2
T
k[R (x ∗ ) R (x ∗ )] Ri (x ∗ )∇2 Ri (x ∗ )[R0 (x ∗ )T R0 (x ∗ )]−1/2 k2
i=1
m
X
= k R0 (x ∗ )T R0 (x ∗ ) −1 Ri (x ∗ )∇2 Ri (x ∗ )k2,
i=1
1 0
2 k R (x k )s k + R(x k )k22 < 12 k R(x)k22, (7.37)
then s k is a descent direction. Hence we can use a line-search. The new iterate is
x k+1 = x k + α k s k ,
where the step size α k > 1 is chosen according to the conditions in Section 6.2.2 applied to
f (x) = 12 k R(x)k22 .
Often the special structure of the function can be used to find equivalent but more convenient
representations of the line search conditions. For example, the sufficient decrease condition (6.5)
for f (x) = 12 k R(x)k22 is given by
1
2 k R(x k + α k s k )k22 ≤ 12 k R(x k )k22 + c1 α k R(x k )T R0 (x k )s k . (7.38)
If s k is the exact solution of the linear least squares problem mins 21 k R0 (x k )s + R(x k )k22 , then the
sufficient decrease condition (7.38) is equivalent to
1
2 k R(x k + α k s k )k22 ≤ 12 k R(x k )k22 + c1 α k k R0 (x k )s k + R(x k )k22 − k R(x k )k22 . (7.39)
See Problem 6.1. The representation (7.39) of the sufficient decrease condition only requires
the quantities k R(x k )k2 and k R0 (x k )s k + R(x k )k22 that have to be computed anyway during the
Gauss-Newton algorithm.
has infinitely many solutions. How do we choose the step s k from the set of least squares solutions?
It seems unreasonable to take arbitrarily large steps. We will take the minimum norm solution
as our step. This step can be computed with the methods described in Sections 7.3.1 or 7.3.2. Note
that if R0 (x k )T R(x k ) = 0, then s k = 0 is the minimum norm solution of (7.40). Thus, if the first
order necessary optimality conditions for 12 k R(x)k22 are satisfied at x k , in particular, if x k is a local
minimizer, then the Gauss-Newton method with the choice (7.41) will not move away from such a
point.
Our convergence analysis of the Gauss-Newton method for the rank deficient case follows
[Boc88, DH79]. See also [DH95] and [Deu04, Ch. 4]. If R0 (x) ∈ Rm×n has rank n, then
R0 (x k ) † = (R0 (x k )T R0 (x k )) −1 R0 (x k )T .
We note that
R0 (x ∗ )T R(x ∗ ) = 0 ⇐⇒ R0 (x ∗ ) † R(x ∗ ) = 0.
Clearly, if R0 (x ∗ )T R(x ∗ ) = 0, the minimum norm solution of min 12 k R0 (x ∗ )s + R(x ∗ )k22 is
R0 (x ∗ ) † R(x ∗ ) = 0. On the other hand, if the minimum norm solution R0 (x ∗ ) † R(x ∗ ) of
min 12 k R0 (x ∗ )s + R(x ∗ )k22 is zero, then R0 (x ∗ )T R(x ∗ ) = 0.
Theorem 7.4.5 (Local Convergence of the GN Method) Let D ⊂ Rn be an open set and let
R : D → Rm be continuously differentiable in D. If there exist ω > 0 and κ ∈ (0, 1) such that for
all x ∈ D and all t ∈ [0, 1] the following conditions hold
R0 (y) † − R0 (x) † R(x) − R0 (x)R0 (x) † R(x)
≤ κk y − xk2, (7.42a)
2
R0 (y) † (R0 (x + t(y − x)) − R0 (x))(y − x)
≤ ωt k y − xk22, (7.42b)
2
2
where α0 = k R0 (x 0 ) † R(x 0 )k2 , and
def
B α0 (x 0 ) ⊂ D
1−δ0
x 0, x 1 = x 0 − R0 (x 0 ) † R(x 0 ) ∈ B α0 (x 0 ).
1−δ0
First we note that the identity R0 (x) † R0 (x) R0 (x) † = R0 (x) † implies
R0 (y) † − R0 (x) † R(x) − R0 (x)R0 (x) † R(x) = R0 (y) † R(x) − R0 (x)R0 (x) † R(x) .
From the identities s k+1 = −R0 (x k+1 ) † R(x k+1 ) and s k = −R0 (x k ) † R(x k ) we find that
Remark 7.4.6 i. Note that the proof Theorem 7.4.5 ony required the property
R (x k ) R (x k ) R0 (x k ) † = R0 (x k ) † , not all of the four properties of the Moore-Penrose pseudo
0 † 0
and vice versa. If we add a multiple µ k > 0 of the identity to R0 (x k )T R0 (x k ), then the resulting
matrix R0 (x k )T R0 (x k ) + µ k I is positive definite if µ k > 0 and k(R0 (x k )T R0 (x k ) + µ k I) −1 k2 ≤ µ−1
k .
Furthermore, using the SVD R (x k ) = UΣV we can show that the unique solution s k of
0 T
R0 (x k )T R0 (x k ) + µ k I s = −R0 (x k )T R(x k ) (7.44)
is given by
σi
min{m,n}
X
sk = − (uT R(x k )) vi . (7.45)
i=1
σi2 + µk i
It can be shown that µ → ks k (µ)k22 , where
σi
min{m,n}
X
s k (µ) = − (uT R(x k )) vi .
i=1
σi2 +µ i
is monotonically decreasing (see Problem 7.3). Furthermore,
σi
min{m,n}
X
lim − (uT R(x k )) vi = −R0 (x k ) † R0 (x k )
µk →0
i=1
σi2 + µk i
(see (7.18)). Thus the parameter µ k > 0 can be used to control the size of the step s k . In the
nearly rank deficient case we use the step (7.45). However, in an implementation of this variation
of the Gauss–Newton method we do not set up and solve (7.44). Instead we note that (7.44) are the
necessary and sufficient optimality conditions for the linear least squares problem
0 (x )
! ! 2
1
R
k R(x k )
min 2
√ s+
. (7.46)
µk I 0
2
Note also that
= 1 k R0 (x )s + R(x )k 2 + µ k ksk 2 .
R0 (x ) ! !
2
R(x k )
1
√ k s+
2
µk I 0
2 2
k k 2
2 2
There exist methods for the solution of linear least squares problems of the type (7.46) that utilize
the special structure of this problem.
We still have to discuss the choice of µ k . Clearly (7.45) is a perturbation of the Gauss-Newton
step. We do not want to add an unneccessarily large µ k > 0. On the other hand, we want to pick
µ k > 0 to ensure that the size of the step s k (or alternatively the size of (R0 (x k )T R0 (x k ) + µ k I) −1 )
becomes artificially large because the rank of R0 (x k ) is difficult to determine. A method that chooses
µ k adaptively is the Levenberg–Marquardt method [Lev44, Mar63]. This method is closely related
to trust–region methods which were discussed in Section 6.3. In fact, if s k solves (7.44) then it
solves
min 21 k R0 (x k )s + R(x k )k22
s.t. ksk2 ≤ ∆ k .
= 1 k R0 (x )s + R(x )k 2 + µ k kD sk 2 .
R0 (x ) ! !
2
R(x k )
1
√ k s+
2
µk Dk 0
2 2
k k 2
2
k 2
and !
2
R0 (x ) ! R(x k )
min
√ k
1
s+
2
µk Dk 0
2
is equivalent to
min 12 k R0 (x k )s + R(x k )k22
s.t. kD k sk2 ≤ ∆ k .
where ∆ k = kD k s k k2 . For the choice of scaling see [Mar63] and [Mor78].
A trust-region view of the Levenberg–Marquardt method is described in [Mor78]. NL2SOL
is an older, but still popular code for the solution of nonlinear least squares problems [DGW81,
DGE81, Gay83]. For example, it is part of R is a language and environment for statistical computing
and graphics. See http://www.r-project.org
σ A A + σ B B → σC C + σ D D. (7.47)
Here σ A, σ B, σC, σ D are the stoichiometric coefficients. The compounds A, B are the reactants,
C, D are the products. The → indicates that the reaction is irreversible. For a reversible reaction
we use
. For example, the reversible reaction of carbon dioxide and hydrogen to form methane
plus water is
CO2 + 4H2
CH4 + 2H2 O. (7.48)
Notice that the number of atoms on the left and on the right hand side balance. For example, there is a
single C atom and there are two O atoms. However, the appearance of two reactants and two products
is accidental. The stoichiometric coefficients in (7.48) are σCO2 = 1, σ H2 = 4, σCH4 = 1, σ H2 O = 2.
For each reaction we have a rate r of the reaction that together with the stoichiometric coefficients
determine the change in concentrations resulting from the reaction. Concentrations are typically
measured in [mol/L]. The reaction rate is defined as the number of reactive events per second per
unit volume and measured in [mol sec−1 L−1 ]. For example, if the rate of the reaction (7.47) is r
and if denote the concentration of compound A, . . . by C A, . . ., we have the following changes in
concentrations:
dt C A (t) = . . . − σ Ar . . . , dt CB (t) = . . . − σ B r . . . ,
d d
d
dt CC (t) = . . . σC r . . . , d
dt CD (t) = . . . σDr . . . .
The dots indicate that other reactions or inflows and outflows will in general also enter the change
in concentration. For a reaction of the form (7.47) the rate of the reaction r is of the form
β
r = kC Aα CB ,
where k is the reaction rate constant and α, β are nonnegative parameters. The sum α + β is called
the order of the reaction. The reaction rate constant depends on the temperature and is often given
by the Arrhenius equation (7.9).
As a particular example we consider an autocatalytic reaction. This example is taken from
[Ram97, S. 4.2]. ‘Autocatalysis is a term commonly used to describe the experimentally observable
phenomenon of a homogeneous chemical reaction which shows a marked increase in rate in time,
reaches its peak at about 50 percent conversion, and the drops off. The temperature has to remain
constant and all ingredients must be mixed at the start for proper observation.’ We consider the
catalytic thermal decomposition of a single compound A into two products B and C, of which B is
the autocatalytic agent. A can decompose via two routes, a slow uncatalyzed one (r 1 ) and another
catalyzed by B (r 3 ) The three essential kinetic steps are
A → B + C Start or background reaction,
A + B → AB Complex formation,
AB → 2B + C Autocatalytic step.
The autocatalytic agent B forms a complex AB (second reaction). Next, the complex AB decom-
poses, thereby releasing B in addition to forming B and C (third reaction). The last two reactions
form the path by which most of A decomposes. The first reaction is the starter, but continues
concurrently with the last two as long as there is any A.
Again, we denote the concentration of compound A, . . . by C A, . . .. The reaction rates for the
three reactions are
r 1 = k 1C A, r 2 = k 2C ACB, r 3 = k 3C AB .
dC A
= −k 1C A − k 2C ACB, (7.49a)
dt
dCB
= k 1C A − k 2C ACB + 2k3C AB, (7.49b)
dt
dC AB
= k 2C ACB − k3C AB, (7.49c)
dt
dCC
= k 1C A + k 3C AB (7.49d)
dt
with given initial values
p = (k1, k2, k 3 )T ,
y(t) = (C A (t), CB (t), C AB (t), CC (t))T ,
y0 = (C A0, CB0, C AB0, CC0 )T .
We see that the initial value problem (7.49a)–(7.49e) is a particular instance of the initial value
problem
y0 (t) = F (t, y(t), p), t ∈ [t 0, t f ]
(7.50)
y(t 0 ) = y0 (p),
where F : R × Rn × Rl → Rn , y0 : Rl → Rn .
We first review a result on the existence and uniqueness of the solution of the initial value
problem (7.50).
Theorem 7.5.1 Let G ⊂ R × Rn be an open connected set, let P ⊂ Rl , and for each p ∈ P let
F (·, ·, p) : G → Rn be continuous and bounded by M. If
J = (t, y) ∈ R × Rn : |t − t 0 | ≤ δ, k y − y0 (p)k2 ≤ δM ⊂ G
and F is Lipschitz continuous with respect to y on J, i.e., there exists L > 0 such that
then there exists a unique solution y(·; p) of the initial value problem (7.50) on I = [t 0 − δ, t 0 + δ].
0.8 A
Concentration (kmol/L)
C
0.6
0.4
0.2
AB
0
0 1000 2000 3000 4000 5000 6000 7000
Time (sec)
Figure 7.4: Solution of autocatalytic reaction (7.49) with initial values C A (0) = 1, CB (0) =
0, C AB (0) = 0, CC (0) = 0 and parameters k 1 = 10−4 , k 2 = 1, k 3 = 8 ∗ 10−4
Figure 7.4 shows the solution of (7.49a)–(7.49e) with initial values C A (0) = 1, CB (0) =
0, C AB (0) = 0, CC (0) = 0 and parameters k 1 = 10−4 , k2 = 1, k3 = 8 ∗ 10−4 on the time in-
terval t 0 = 0 to t f = 7200 secs. (2 hrs). The computations were done using the Matlab ODE
solver ode23s with the default options.
Now, suppose the reaction rates are not known, but have to be determined through an experiment.
Given initial concentrations we will run the reaction and measure the concentrations ŷi ∈ R4 at
times t i , i = 1, . . . , m. We try to fit the function y(t; p) to the measurements, where y(·; p) is the
solution of the initial value problem (7.50). This leads to a nonlinear least squares problem
where
y(t 1 ; p) − ŷ1
*. +/
y(t 2 ; p) − ŷ2
R(p) = ... ..
// ∈ Rmn (7.52)
. . //
, y(t m ; p) − ŷm -
and y(·; p) is the solution of the initial value problem (7.50). For the evaluation of R(p) at a
given p we have to solve the ODE (7.50), evaluate the solution y(·; p) of this ODE at the points t 1 ,
i = 1, . . . , m, and then assemble the vector R(p) in (7.52). Thus, R : Rl → Rmn is a composition
of functions
p 7→ y(·; p) 7→ y(t i ; p) 7→ R(p),
Rl 7→ C(I, Rn ) 7→ Rn 7→ Rmn .
By C ` (S, Rn ) we denote the set of all functions g : S ⊂ R j → Rn which are ` times continuously
differentiable of S. If the ODE (7.50) is solved numerically, then we do not obtain the exact solution
y(·; p), but only an approximation yh (·; p). Consequently, we are only able to compute
p 7→ yh (·; p) 7→ yh (t i ; p) 7→ Rh (p).
The error between R(p) and Rh (p) depends on the accuracy of the ODE solver. Thus, while we
want to minimize 12 k R(p)k22 , we do not have access to this function, but only to 12 k Rh (p)k22 .
p 7→ y(·; p),
Rl 7→ C(I, Rn ).
W (·; p) : R → Rn×l
such that
1
lim max k y(t; p + δp) − y(t; p) − W (t; p)δpk2 = 0. (7.53)
kδpk2 →0 kδpk2 t∈I
The existence of the derivative can be established with the aid of the implicit function theorem,
which also tells us how the derivative can be computed. For more details we refer to [Ama90],
[HNW93, Sec. I.14], or [Wal98]. The following result is taken from [WA86, Thm. 3.2.16].
Theorem 7.5.2 Let G ⊂ R × Rn be an ( open connected set, and) p̄ ∈ R . Further, let δ, δ1 > 0 and
l
If
(t, y) ∈ R × Rn : t ∈ I, k y − y0 (p)k2 ≤ ¯ + M kt − t 0 k2 ⊂ G,
then for each p ∈ P there exists a unique solution y(·; p) ∈ C ` (I, Rn ) of the initial value problem
(7.50). Moreover, the solution is ` times continuously differentiable with respect to p and the first
derivative w(·; p) = dp
d
y(·; p) is the solution of
∂ ∂
W 0 (t) = ∂ y F (t, y(t; p), p)W (t) + ∂p F (t, y(t; p), p) t∈I
(7.54)
W (t 0 ) = d
dp y0 (p).
The linear differential equation (7.54) is sometimes also called the sensitivity equation. The
function Wi j (t) is the sensitivity of the ith component of the solution with respect to the parameter
p j . From (7.53) we see that
yi (t; p + (δp) j e j ) = yi (t; p) + Wi j (t; p)(δp) j + o((δp) j ) for all t ∈ I.
Here e j denotes the jth unit vector and h ∈ R.
For the ODE (7.49a)–(7.49e) written in the more abstract notation
d
dt y1 −k 1 y1 − k 2 y1 y2,
*. +/ *. +/
.. d
dt y2
// .. k1 y1 − k 2 y1 y2 + 2k 3 y3, //
// = .. // , (7.55)
k2 y1 y2 − k 3 y3,
.. d
.. dt y3 // .. //
,
d
dt y4 - , k 1 y1 + k 3 y3 -
y1 (0) = y1,0, y2 (0) = y2,0, y3 (0) = y3,0, y4 (0) = y4,0 . (7.56)
the sensitivity equations are given by
d d d
*. dt W11 dt W12 dt W13 +/
d d d
dt W21 dt W22 dt W23
.. //
.. d d d
//
.. dt W31 dt W32 dt W33 //
d d d
, dt W41 dt W42 dt W43 -
−k 1 − k 2 y2 −k 2 y1 0 0 W W12 W13
*. +/ *. 11 +/
. k 1 − k 2 y2 −k 2 y1 2k3 0 // .. W21 W22 W23 //
= ... /. /
.. k 2 y2 k2 y1 −k3 0 // .. W31 W32 W33 //
/. /
, k 1 0 k 3 0 - , W41 W42 W43 -
−y1 −y1 y2 0
*. +
.. y1 −y1 y2 2y3 ///
+ .. /. (7.57)
.. 0 y1 y2 −y3 //
/
, 1y 0 y3 -
Since the initial values (7.56) do not depend on the parameter p, the sensitivity W obeys the initial
conditions
With the solution W (·; p) of the sensitivity equation (7.54) the Jacobian of R defined in (7.52)
is given by
W (t 1 ; p)
*. +/
W (t 2 ; p)
R0 (p) = ... ..
// ∈ Rmn×l . (7.59)
. . //
, W (t m ; p) -
In practice the ODE (7.50) and the sensitivity equations have to be solved numerically. Since
most ODE solvers are adaptive, one has to solve the original ODE (7.50) and the sensitivity equation
(7.54) simultaneously for y and W . Instead of the exact solutions y(·; p) and W (·; p) of (7.50) and
(7.54) one obtains approximations yh (·; p) and Wh (·; p) thereof. Thus in practice one has only
Rh (p) and
W (t ; p)
*. h 1 +/
W h (t 2 ; p)
(R0 (p))h = ... ..
// ∈ Rmn×l (7.60)
. . //
, Wh (t m ; p) -
available. It holds that Rh (p) ≈ R(p) and (R0 (p))h ≈ R0 (p) and estimates for the errors k Rh (p) ≈
R(p)k2 and k(R0 (p))h ≈ R0 (p)k2 are typically available. Usually, the approximation (R0 (p))h
of R0 (p) is not the derivative of Rh (p), the approximation of R(p). Therefore we have chosen
the notation (R0 (p))h over (Rh (p))0. 2 In fact, often we do not even know whether Rh (p) is
differentiable. Codes for the numerical solution of ODEs often chose the time steps adaptively.
Rules for this adaptation involve min, max, | · |. This might lead to the nondifferentiability of Rh (p).
If (7.50) and (7.54) have to be solved simultaneously for y and W using a numerical ODE solver,
the sensitivity equation (7.54) typically has to be written in vector form. For example the ODE
resulting from (7.55) and (7.57) is given by
2Here (R 0 (p))h indicates the we differentiate first an then discretize the derivative, whereas (Rh (p)) 0 indicates that
we discretize first and take the derivative of of the discretized Rh (p).
d
*. dt y1 +/ * −k 1 y1 − k2 y2 y3,
d
+/
dt y2 k 1 y1 − k 2 y1 y2 + 2k3 y3,
.. // . . //
.. d
// ..
dt y3 k 2 y1 y2 − k 3 y3,
//
.. // .. //
d // ..
dt y4 k 1 y1 + k3 y3,
.. //
.. // .. //
d
dt W11 (−k 1 − k2 y2 )W11 − k 2 y1W21 − y1,
.. // . . //
.. // .. //
d // .. (k1 − k 2 y2 )W11 − k 2 y1W21 + 2k 3W31 + y1,
dt W21
.. //
.. // ..
d
dt W31 k 2 y2W11 + k2 y1W21 − k 3W31,
//
.. // .. //
k1W11 + k3W31 + y1,
.. d // ..
dt W41
//
.. // = .. // . (7.61)
d (−k 1 − k 2 y2 )W12 − k2 y1W22 − y1 y2,
.
dt W12
.. // . //
.. // ..
// .. (k 1 − k 2 y2 )W12 − k 2 y1W22 + 2k 3W32 − y1 y2,
d
//
dt W22
.. //
// ..
k 2 y2W12 + k 2 y1W22 − k 3W32 + y1 y2,
..
d //
.. dt W32 // .. //
.. d // .. k 1W12 + k 3W32 + y1,
dt W42
//
.. // .. //
d . (−k 1 − k 2 y2 )W13 − k 2 y1W23,
dt W13
.. // . //
// ..
// .. (k1 − k 2 y2 )W13 − k 2 y1W23 + 2k 3W33 + 2y3,
.. //
d
dt W23
.. //
k 2 y2W13 + k 2 y1W23 − k 3W33 − y3,
.. // ..
d //
.. dt W33 // .. /
d k 1W13 + k 3W33 + y3
, dt W43 - , -
Figures 7.5 and 7.6 show the solution of (7.61). As before, the computations were done using
the Matlab ODE solver ode23s with the default options. The parameters were k1 = 10−4 , k 2 = 1,
k 3 = 8 ∗ 10−4 and t 0 = 0 to t f = 7200 secs. (2 hrs). Note the different scales of the sensitivities
with respect to k 1 , k 3 and k2 . For example, since k 1 = 10−4 , the sensitivities dy j /dk 1 can be
about 104 times larger than y j , j = 1, . . . , 4. In this case scaling issues have to be dealt with when
solving (7.61). When using a numerical solver such as the Matlab ODE solver ode23s it might be
necessary to choose different absolute tolerances AbsTol for each solution component in (7.61). In
our computations we have used AbsTol = 10−6 for all components (the default). A more sensible
choice might be AbsTol = 10−6 for components 1–4, AbsTol = 10−6 /k 1 for components 5–8,
AbsTol = 10−6 /k 2 for components 9–12, and AbsTol = 10−6 /k3 for components 13–16.
CA dC A/dk 1
Concentration (kmol/L)
1 5000
0.5 0
0 -5000
-0.5 -10000
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)
-3 dC A/dk 2 dC A/dk 3
10
5 500
0 0
-5 -500
-10 -1000
-15 -1500
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)
CB dC B/dk 1
Concentration (kmol/L)
1 8000
6000
0.5 4000
2000
0 0
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)
dC B/dk 2 dC B/dk 3
0.01 1500
0.005
1000
0
500
-0.005
-0.01 0
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)
C AB dC AB/dk 1
Concentration (kmol/L)
0.6 4000
2000
0.4
0
0.2
-2000
0 -4000
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)
dC AB/dk 2 dC AB/dk 3
10 -3
15 1000
10 500
5 0
0 -500
-5 -1000
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)
CC dC C/dk 1
Concentration (kmol/L)
1 4000
3000
0.5 2000
1000
0 0
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)
dC C/dk 2 dC C/dk 3
10 -3
4 800
3 600
2 400
1 200
0 0
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)
DASSL and DASPK are two Fortran codes for the solution of ODEs [BCP95]. Actually, DASSL
and DASPK solve differential–algebraic equations (DAEs), which are systems of ODEs coupled
with nonlinear algebraic equations. Both codes have been augmented to solve the DAE and the
corresponding sensitivity equations simultaneously. The original codes DASSL and DASPK and
their augmentations DASSLSO and DASPKSO are available. For details we refer to the paper
[MP96].
∂ 1
y(t i ; p) ≈ (y(t i ; p + (δp) j e j ) − y(t i ; p)),
∂p j (δp) j
where e j is the jth unit vector and (δp) j ∈ R. The scalar (δp) j ∈ R can and typically does vary with
j. Thus, we can compute a finite difference approximation of R0 (p) as follows. For j = 1, . . . , k
choose (δp) j ∈ R sufficiently small and compute the solution y(·, p + (δp) j e j ) of (7.50) with p
replaced by p + (δp) j e j . The jth column (R0 (p)) j of R0 (p) is then approximated by
CC dC C/dk 1
Concentration (kmol/L) 1 4000
3000
0.5 2000
1000
0 0
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)
dC C/dk 2 dC C/dk 3
10 -3
4 800
3 600
2 400
1 200
0 0
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)
Figure 7.7: y4 and corresponding sensitivities. The solid curves are the sensitivities computed
using the sensitivity equation method (these are identical to the corresponding plots in Figure 7.6)
and the dashed curves are the sensitivity approximations via finite differences with (δp) j = 10−1 p j ,
j = 1, 2, 3.
Matlab ODE solver ode23s with the default options. The parameters were k 1 = 10−4 , k 2 = 1,
k 3 = 8 ∗ 10−4 and t 0 = 0 to t f = 7200 secs. (2 hrs). If we want to extend the error analysis
of finite difference approximations performed for the function g to y(·; p), then we will obtain a
∂2
different constant L j for each component p j and L j ≈ maxt k ∂p 2 y(t; p)k2 . To compute the finite
j
difference step size (δp) j we need an estimate for L j and an estimate for the error level in the
evaluation of y(t; p), t ∈ [0, 7200]. Using our previous sensitivity computations we estimate that
maxt k ∂p∂ j y(t; p)k2 = O(1/p j ). (This seems to be reasonable for j = 1, 3, but it is too high for
∂
j = 2.) For the second partial derivatives we use the estimate maxt k ∂p 2 y(t; p)k2 = O(1/p j ). Thus,
2
2
j
√ √
the optimal finite difference step size for the j parameter is (δp∗ ) j = (2/ L j ) = O(p j ). The
p
default options in the Matlab ODE solver ode23s attempt to compute an approximate solution
that is within AbsTol = 10−6 of the true solution. Thus, we estimate that ≈ 10−6 . This gives
CC dC C/dk 1
Concentration (kmol/L) 1 4000
3000
0.5 2000
1000
0 0
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)
dC C/dk 2 dC C/dk 3
10 -3
20 800
15 600
10 400
5 200
0 0
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)
Figure 7.8: y4 and corresponding sensitivities. The solid curves are the sensitivities computed
using the sensitivity equation method (these are identical to the corresponding plots in Figure 7.6)
and the dashed curves are the sensitivity approximations via finite differences with (δp) j = 10−2 p j ,
j = 1, 2, 3.
an estimate (δp∗ ) j = O(p j 10−3 ) for the optimal step size. We see that for j = 1, 3 the best
agreements between the the sensitivities computed using the sensitivity equation method and the
finite difference approximations of the sensitivities are achieved for (δp∗ ) j = O(p j 10−2 ), j = 1, 3.
For j = 2, however, the best agreements are achieved for (δp∗ )2 = O(p2 10−1 ). The calculations
for the solution components y1 − y3 gave similar results. This example indicates how difficult it
is to approximate derivatives of vector valued functions using finite differences, especially if the
variables and functions have different scales and the functions are not computed exactly.
CC dC C/dk 1
Concentration (kmol/L) 1 6000
4000
0.5
2000
0 0
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)
dC C/dk 2 dC C/dk 3
0.1 1000
0.05
500
0
-0.05 0
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)
Figure 7.9: y4 and corresponding sensitivities. The solid curves are the sensitivities computed
using the sensitivity equation method (these are identical to the corresponding plots in Figure 7.6)
and the dashed curves are the sensitivity approximations via finite differences with (δp) j = 10−3 p j ,
j = 1, 2, 3.
Another approach to the computation of derivatives of the solution of ODEs with respect to
parameters is automatic differentiation (which is now also known as computational differentiation).
Given the source code of a computer program (or a set of computer programs) for the solution of a
differential equation, automatic differentiation tools take this program and generate source code for
a new program that computes the solution of the ODE as well as the derivatives of the solution with
respect to parameters. Actually, this technique is not limited to computer programs for the solution
of ODEs. Automatic differentiation is based on the observation that inside computer programs
only elementary functions such as +, ∗, sin, log are executed. The derivatives of these elementary
operations are known. A computer program can be viewed as a composition of such elementary
functions. The derivative of a composition of functions is obtained by the chain rule. This is
the basic mathematical observation underlying automatic differentiation. See the paper [Gri03]
by Griewank and the book [GW08] by Griewank and Walther. Of course computer programs
also include operations that are not necessarily differentiable such as max, | · |, and if-then-else
statements. Automatic differentiation techniques will always generate an augmented program that
also generates ‘derivatives’. It is important that the user applies these tools intelligently.
ADIC (for the automatic differentiation of programs written in C/C++) and ADIFOR (for the
automatic differentiation of programs written in Fortran 77) are available from https://wiki.
mcs.anl.gov/autodiff3.
7.6. Problems
Note: It can actually be shown that there exists only one matrix X ∈ Rn×m that satisfies
AX A = A, X AX = X, ( AX )T = AX, (X A)T = X A.
Hence these four identities are also used to define the Moore–Pensrose pseudo inverse. For more
details see, e.g., [Bjö96, pp. 15-17] and [BIG74, CM79, Gro77, Nas76].
A† = AT ( A AT ) −1 .
1 µ
min k Ax − bk22 + k xk22, (7.64)
2 2
where A ∈ Rm×n , b ∈ Rm , and µ ≥ 0.
i. Show that for each µ > 0, (7.64) has a unique solution x(µ).
ii. Let µ2 > µ1 > 0 and let x 1 = x(µ1 ), x 2 = x(µ2 ) be the solutions of (7.64) with µ = µ1 and
µ = µ2 , respectively.
Show that
1 µ1 1 µ2
k Ax 1 − bk22 + k x 1 k22 ≤ k Ax 2 − bk22 + k x 2 k22,
2 2 2 2
1 1
k Ax 1 − bk22 ≤ k Ax 2 − bk22,
2 2
k x 2 k22 ≤ k x 1 k22 .
has to be solved iteratively using, e.g., the conjugate gradient methods described in Sections 3.7.3
and 3.7.3. We assume that the computed step s k satisfies
Formulate and prove an extension of the local convergence Theorem 7.4.2 for this inexact
Gauss-Newton method.
Hint: Revisit Theorem 5.3.1. One convergence result for inexact Gauss-Newton methods is
presented in [Mar87].
where t is the time in days after the maximum luminosity and L(t) is the luminosity relative
to the maximum luminosity. The table in lumi_data.m gives the relative luminosity for the
type I supernovae SN1939A measured in 1939. The peak luminosity occured at day 0.0, but all
measurement before day 7.0 are omitted because the model above cannot account for the luminosity
before and immediately after the maximum.
i. Plot the data. You should notice two distinct regions, and thus two exponentials are required
to provide an adequate fit.
ii. Use the function lsqcurvefit from the Matlab optimization toolbox to fit the data to the
model above, i.e., to determine C1, C2, α1, α2 . Plot the fit along with the data. Also plot the
residuals. Do the residuals look random? Experience plays a role in choosing the starting
values It is known that the time constants α1 and α2 are about 5.0 and 60.0, respectively. Try
different values for C1 and C2 . How sensitive is the resulting fit to these values?
differences. Use the parameter values p1 = 0.9875, p2 = 0.2566, p3 = 0.3323. Evaluate the
sensitivities at t i = 0, 0, 1, 0.2, . . . , 4 (tspan = [0:0.1:4] if you use the Matlab ODE solvers).
For the computation of the finite difference approximations use (δp) j = δp, j = 1, . . . , 3, with
δp = 10−3, 10−6, 10−9 . For each δp plot the error WhS (·; p) − WhFD (·; p) (in log-format).
Problem 7.7 Consider the continuously stirred tank reactor (CSTR) with heating jacket in
FA
Fh
Tj
F
Fh
CA
Th A B
Figure 7.10. Reactant A is fed at a flow rate F, molar concentration C Ao , and temperature TA to
the reactor, where the irreversible endothermic reaction A → B occurs. The rate of the reaction is
given by the following relation
!
E
r = k 0 exp − C A,
RT
where C A is the molar concentration of A in the reactor holdup, T is the reactor temperature. The
parameters k 0 , E, and R are called the pre-exponential factor, the activation energy, and the gas
constant respectively. In the CSTR, the product stream is withdrawn at flow rate F and heat is
provided to the reactor through the heating jacket, where the heating fluid is fed at a flow rate Fh
and temperature Th . We assume that the flow rates F, and Fh are constant.
Write a program for the computation of approximate sensitivities WhS (·; p) via the sensitivity
equation method and for the computation of approximate sensitivities WhFD (·; p) via finite
differences.
Use the parameter values specified in Table 7.1 for p in (7.68).
Problem 7.8 In this problem we will determine parameters in a simple model for an electrical
furnace.
The mathematical model for the oven involves the following quantities: t: time (seconds), T:
temperature inside the oven (0 C), C: ‘heat capacity’ of the oven including load (joule/0 C), Q:
rate of loss of heat inside the oven to the environment (joule/sec), V : voltage of the source of
electricity (volt), I: intensity of the electric current (amp), R: resistance of the heating of the oven
plus regulation resistance (ohm) (R = ∞ corresponds to the ‘open’ (disconnected) circuit). Then,
according to the laws of Physics:
• I = V /R,
• Q = kT, where k is a constant of loss of heat per second and per degree of temperature
difference between oven and the environment,
• [(V 2 /R) − kT]/C = rate of increase of temperature of the oven (0 C per second).
The temperature of the oven will then evolve according to the differential equation
(V 2 /R(t)) − kT (t)
T 0 (t) = . (7.70)
C
We set
u(t) = 1/R(t).
This gives the differential equation
The model (7.72) depends on the parameters α, β. These depend on the particular oven (geometry,
material, ...). Our goal is to determine the parameters from measurements using the least squares
formulation.
First, we note that the solution of (7.72) for constant u is given by
βu
T (t) = T (t 0 )e−α(t−t 0 ) + (1 − e−α(t−t 0 ) ). (7.73)
α
For u ≡ 1, temperature measurements T̂i at times t i , i = 0, . . . , m, are given in Table 7.2.
i t i T̂i i t i T̂i
0 0.0 1.0000 11 1.1 1.6672
1 0.1 1.0953 12 1.2 1.6988
2 0.2 1.1813 13 1.3 1.7275
3 0.3 1.2592 14 1.4 1.7534
4 0.4 1.3298 15 1.5 1.7770
5 0.5 1.3935 16 1.6 1.7982
6 0.6 1.4512 17 1.7 1.8174
7 0.7 1.5034 18 1.8 1.8348
8 0.8 1.5508 19 1.9 1.8505
9 0.9 1.5935 20 2.0 1.8647
10 1.0 1.6322
i.e., determine R.
Note that since none of the temperature measurements above can assumed to be exact, we
include T0 = T (t 0 ) as a variable into the least squares problem.
[Bjö96] Å. Björck. Numerical Methods for Least Squares Problems. SIAM, Philadelphia, 1996.
[BW88] D. M. Bates and D. G. Watts. Nonlinear Regression Analysis and its Applications. John
Wiley and Sons, Inc., Somerset, New Jersey, 1988.
[Deu04] P. Deuflhard. Newton Methods for Nonlinear Problems. Affine Invariance and Adaptive
Algorithms, volume 35 of Springer Series in Computational Mathematics. Springer-
Verlag, Berlin, 2004.
[DH79] P. Deuflhard and G. Heindl. Affine invariant convergence theorems for Newton’s method
and extensions to related methods. SIAM J. Numer. Anal., 16:1–10, 1979.
319
320 REFERENCES
[DH95] P. Deuflhard and A. Hohmann. Numerical Analysis. A First Course in Scientific Com-
putation. Walter De Gruyter, Berlin, New York, 1995.
[Esp81] J. H. Espenson. Chemical Kinetics and Reaction Mechanisms. Mc Graw Hill, New
York, 1981.
[GL89] G. H. Golub and C. F. Van Loan. Matrix Computations, volume 3 of Johns Hopkins
Series in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD,
second edition, 1989.
[Gro77] C. W. Groetsch. Generalized Inverses of Linear Operators. Marcel Dekker, Inc., New
York, Basel, 1977.
[GW08] A. Griewank and A. Walther. Evaluating Derivatives. Principles and Techniques of Al-
gorithmic Differentiation. Society for Industrial and Applied Mathematics (SIAM),
Philadelphia, PA, second edition, 2008. URL: https://doi.org/10.1137/1.
9780898717761, doi:10.1137/1.9780898717761.
[Hei93] M. Heinkenschloss. Mesh independence for nonlinear least squares problems with norm
constraints. SIAM J. Optimization, 3:81–117, 1993. URL: http://dx.doi.org/10.
1137/0803005, doi:10.1137/0803005.
[KMN88] D. Kahaner, C.B. Moler, and S. Nash. Numerical Methods and Software. Prentice Hall,
Englewood Cliffs, NJ, 1988.
[Lev44] K. Levenberg. A method for the solution of certain nonlinear problems in least squares.
Quarterly Applied Mathematics, 2:164–168, 1944.
[Mar87] J. M. Martinez. An algorithm for solving sparse nonlinear least squares problems.
Computing, 39:307–325, 1987. URL: https://doi.org/10.1007/BF02239974,
doi:10.1007/BF02239974.
[MP96] T. Maly and L. R. Petzold. Numerical methods and software for sensitivity analysis of
differential-algebraic systems. Applied Numerical Mathematics, 20:57–79, 1996.
[Nas76] M. Z. Nashed. Generalized Inverses and Applications. Academic Press, Boston, San
Diego, New York, London„ 1976.
323
324 CHAPTER 8. IMPLICIT CONSTRAINTS
8.1. Introducton
This section studies the optimization of functions whose evaluation requires the solution of an
implicit equation. In this section we use u ∈ Rnu to denote the optimization variable. The objective
function we want to minimize is given by
fD(u) = f (y(u), u), (8.1)
where y(u) ∈ Rny is the solution of an equation
c(y, u) = 0. (8.2)
Here f : Rny ×nu → R and c : Rny ×nu → Rny are given functions.
While this problem structure may seem special, we have seen several examples of it already.
The data assimilation problems in Section 1.5 are one class of examples. The data assimilation
problem (1.54) is an optimization problem in y0 , which plays the role of u. The evaluation of
fD(y0 ) = 21 kAy0 − bk22 requires the solution of (1.50), which plays the role of (8.2) to compute
yT = (yT1 , . . . , yTnt ). Another class of problems that fits (abstractly) into the setting of minimizing
(8.1) is parameter identification in ordinary differential equations studied in Section 7.5. The vector
of optimization variables is p, which plays the role of u. To evaluate fD(y0 ) = 12 k R(p)k22 in (7.51) we
have to solve the ordinary differential equation (7.50) to get (7.52). Here the ordinary differential
equation (7.50) plays the role of (8.2). The solution y in this example is a function, not merely a
vetor in Rny and therefore this example does not fit precisely into the setting (8.1), (8.2). However,
this setting can be extended to cover this example. We will see additional examples of problems
with the structure (8.1), (8.2) in Sections 8.3 and 8.4 below.
The problem
minn fD(u) (8.3)
u∈R u
is an unconstrained optimization problem and can in principle be solved using any of the optimiza-
tion methods studied before. These methods require the computation of the gradient of fD(u) and,
possibly Hessian information. We will computation derivatives of fDin the next section.
We call (8.3) an implicitly constrained optimization problem because the solution of (8.2) is
invisible to the optimization algorithm. Of course, in principle one can formulate (8.3), (8.1),
(8.2) as an equality constrained optimization problem. In fact, since y is tied to u via the implicit
equation (8.2), we could just include this equation into the problem formulation and reformulate
(8.3), (8.1), (8.2) as
min f (y, u),
(8.4)
s.t. c(y, u) = 0.
In (8.4), the optimization variables are y ∈ Rny and u ∈ Rnu . The formulation (8.4) can have
significant advantages over (8.3), but in many applications the formulation of the optimization
problem as a constrained problem may not be possible, for example, because of the huge size of
y, which in applications can easily be many millions. The solution of constrained optimization
problems is also beyond the scope of this class. Therefore, we focus on (8.3).
Assumption 8.2.1
• There exists an open set D ⊂ Rny ×nu with {(y, u) : u ∈ U, c(y, u) = 0} ⊂ D such that f
and c are twice continuously differentiable on D.
• The inverse cy (y, u) −1 exists for all (y, u) ∈ {(y, u) : u ∈ U, c(y, u) = 0}.
Under these assumptions the implicit function theorem guarantees the existence of a differen-
tiable function
y : R nu → R n y
defined by
c(y(u), u) = 0.
Note that our Assumptions 8.2.1 are stronger than those required in the implicit function theorem.
The standard assumptions of the implicit function theorem, however, only guarantee the local
existence of the implicit function y(·).
To simplify the notation we write cy (y(u), u) and cu (y(u), u) instead of cy (y, u)| y=y(u) and
cu (y, u)| y=y(u) , respectively. With this notation, we have
The derivative yu (u) is also called the sensitivity (of y with respect to u).
Since y(·) is differentiable, the function fDis differentiable and its gradient is given by
∇ fD(u) = yu (u)T ∇ y f (y(u), u) + ∇u f (y(u), u) (8.7)
= −cu (y(u), u)T cy (y(u), u) −T ∇ y f (y(u), u) + ∇u f (y(u), u).
Note that if we define the matrix
−cy (y, u) −1 cu (y, u)
!
W (y, u) = , (8.8)
I
then !
yu (u)
W (y(u), u) = (8.9)
I
and the gradient of fDcan be written as
∇ fD(u) = W (y(u), u)T ∇ x f (y(u), u). (8.10)
The matrix W (y, u) will play a role later.
Equation (8.7) suggests the following method for computing the gradient.
1. Given u, solve c(y, u) = 0 for y (if not done already for the evaluation of fD(u)).
Denote the solution by y(u).
The computation of the sensitivity matrix S requires the solution of nu systems of linear
equations cy (y(u), u) S = −cu (y(u), u), all of which have the same system matrix but different
right hand sides. If nu is large this can be expensive. The gradient computation can be executed
more efficiently since for the computation of ∇ fD(u) we do not need S, but only the application
of ST to ∇ y f (y(u), u). If we revisit (8.7), we can define λ(u) = −cy (y(u), u) −T ∇ y f (y(u), u), or,
equivalently, we can define λ(u) ∈ Rny as the solution of
cy (y(u), u)T λ = −∇ y f (y(u), u). (8.11)
In optimization problems (8.1), (8.2) arising from discretized optimal control problems, the system
(8.11) are called the (discrete) adjoint equations and λ(u) is the (discrete) adjoint. We will see soon
(see (8.13)) that λ(u) is the Lagrange multiplier corresponding to the constraint problem (8.4).
With λ(u) the gradient can now be written as
∇ fD(u) = ∇u f (y(u), u) + cu (y(u), u)T λ(u), (8.12)
which suggests the so-called adjoint equation method for computing the gradient.
1. Given u, solve c(y, u) = 0 for y (if not done already for the evaluation of fD(u)).
Denote the solution by y(u).
The gradient computation using the adjoint equation method can also be expressed using the
Lagrangian
L(y, u, λ) = f (y, u) + λT c(y, u) (8.13)
corresponding to the constraint problem (8.4). Using the Lagrangian, the equation (8.11) can be
written as
∇ y L(y, u, λ)| y=y(u),λ=λ(u) = 0. (8.14)
Moreover, (8.12) can be written as
∇ fD(u) = ∇u L(y, u, λ)| y=y(u),λ=λ(u) . (8.15)
The adjoint equations (8.11) or (8.14) are easy to write down in this abstract setting, but (hand)
generating a code to set up and solve the adjoint equations can be quite a different matter. This
will become somewhat apparent when we discuss a simple optimal control example in Section 8.4.
The following observation can be used to generate some checks that indicate the correctness of the
adjoint code. Assume that we have a code that for given u computes the solution y of c(y, u) = 0.
Often it is not too difficult to derive from this a code that for given r computes the solution s of
cy (y, u)s = r. If λ solves the adjoint equation cy (y, u)T λ = −∇ y f (y, u), then
− sT ∇ y f (y, u) = sT cy (y, u)T λ = r T λ (8.16)
must hold.
If we use ∇ yλ L(y, u, λ) = cy (y, u)T and (8.6) in the previous equation we find that
−∇ yu L(y(u), u, λ(u)) .
(8.17)
To simplify the expression, we have used the notation ∇ yy L(y(u), u, λ(u)) instead of
∇ yy L(y, u, λ)| y=y(u),λ=λ(u) yu (u) and analogous notation for the other derivatives of L. We will
continue to use this notation in the following.
Now we can compute the Hessian of fDby differentiating (8.15),
∇2 fD(u) = ∇uy L(y(u), u, λ(u)) yu (u) + ∇uu L(y(u), u, λ(u))
+∇uλ L(y(u), u, λ(u)) λ u (u). (8.18)
If we insert (8.17) and (8.6) into (8.18) and observe that ∇uλ L(y(u), u, λ(u)) = cu (y(u), u) the
Hessian can be written as
∇2 fD(u) = cu (y(u), u)T cy (y(u), u) −T ∇ yy L(y(u), u, λ(u)) cy (y(u), u) −1 cu (y(u), u)
−cu (y(u), u)T cy (y(u), u) −T ∇ yu L(y(u), u, λ(u))
−∇uy L(y(u), u, λ(u)) cy (y(u), u) −1 cu (y(u), u) + ∇uu L(y(u), u, λ(u))
∇ yy L(y(u), u, λ(u)) ∇ yu L(y(u), u, λ(u))
!
= W (y(u), u) T
W (y(u), u). (8.19)
∇uy L(y(u), u, λ(u)) ∇uu L(y(u), u, λ(u))
Obviously the identities (8.19) can be used to compute the Hessian. However, in many cases, the
computation of the Hessian is too expensive. In that case optimization algorithms that only require
the computation of Hessian–times–vector products ∇2 fD(u)v can be used. The prime example is
the Newton-CG Algorithm, where an approximation of the Newton step s k is computed using the
CG Algorithm 6.2.2. Using the equality (8.19) Hessian–times–vector products can be computed
as follows.
1. Given u, solve c(y, u) = 0 for y (if not done already). Denote the solution by y(u).
2. Solve the adjoint equation cy (y(u), u)T λ = −∇ y f (y(u), u) for λ (if not done al-
ready). Denote the solution by λ(u).
5. Compute
∇2 fD(u)v = cu (y(u), u)T p − ∇uy L(y(u), u, λ(u))w + ∇uu L(y(u), u, λ(u))v.
Hence, if y(u) and λ(u) are already known, then the computation of ∇2 fD(u)v requires the
solution of two linear equations. One similar to the linearized state equation, Step 3, and one
similar to the adjoint equation, Step 4.
We conclude this section with an observation concerning the connection between the Newton
equation ∇2 fD(u) su = −∇ fD(u) or the Newton–like equation H
D su = −∇ fD(u) and the solution of
a quadratic program. These observations also emphasize the connection between the implicitly
constrained problem (8.1) and the nonlinear programming problem (8.4).
Theorem 8.2.5 Let cy (y(u), u) be invertible and let ∇2 fD(u) be symmetric positive semidefinite.
The vector su solves the Newton equation
∇2 fD(u) su = −∇ fD(u) (8.20)
if and only if (s y, su ) with s y = cy (y(u), u) −1 cu (y(u), u)su solves the quadratic program
!T !T
∇ y f (y, u)T ∇ yy L(y, u, λ) ∇ yu L(y, u, λ)
! ! !
sy sy sy
min +21
,
∇u f (y, u) su su ∇uy L(y, u, λ) ∇uu L(y, u, λ) su (8.21)
s.t. cy (y, u)s y + cu (y, u)su = 0,
where y = y(u) and λ = λ(u).
Proof: Every feasible point for (8.21) obeys
cy (y(u), u) −1 cu (y(u), u)su
! !
sy
= = W (y(u), u) su .
su su
Thus, using (8.10) and (8.19), we see that (8.21) is equivalent to
min sTu ∇ fD(u) + 21 sTu ∇2 fD(u) su . (8.22)
su
The desired result now follows from the equivalence of (8.21) and (8.22).
The computation of first and second order derivatives of the function fD in (8.1) is based on
application of the implicit function theorem and can be complicated and laborious. We will see a
concrete, rather simple example in Section 8.4 below.
There are approaches to compute or approximate these derivatives. They include:
Finite difference approximations [DS96, Sec. 5.6], [Sal86].
Algorithmic differentiation (also called Automatic differentiation) [Gri03, GW08].
Approximation of gradients using the complex-step method [MSA03], [ST98].
Computation of second derivatives via so-called hyper-dual numbers [FA11, FA12].
Input #1
Output #1
Input #2
Output #2
Input #3
Output #3
Input #4
Figure 8.1: Example of a neural network with n0 = 4 inputs, with two hidden layers with n1 = n2 = 5
neurons each, and with n3 = 2 outputs.
function,
yiL = σi *.biL + j /,
wiLj y L−1 + i = 1, . . . , n L .
, j=1 -
y1`−1 yi`
y2`−1 yi`
y3`−1 yi`
y4`−1 yi`
y5`−1 yi`
nL
X
0 < σ j (z) < 1, j = 1, . . . , n L, and σ j (z) = 1,
j=1
the outputs yiL = σi (· · · ) are interpreted as probabilities, e.g., that the input into the network
belongs to class i with probability yiL = σi (· · · ).
bi`, i = 1, . . . , n`, ` = 1, . . . , L,
are given, then the outputs y1L, . . . , ynLL of the neural network for given inputs y10, . . . , yn00 are
computed using Algorithm 8.3.1.
1. For ` = 1, . . . , L − 1:
Compute outputs of hidden layer `
n`−1
yi` = σ *.bi` + wi`j y `−1
X
j /, i = 1, . . . , n` .
+
, j=1 -
yiL = σi *.biL + j /,
wiLj y L−1 + i = 1, . . . , n L .
, j=1 -
For the following presentation a more compact notation will be useful. First we define
Then we aggregate the weights wi`j ∈ Rn` ×n`−1 and biases bi` ∈ Rn` into a vector of parameters
associated with level `,
u` ∈ R(n`−1 +1)n` , ` = 1, . . . , L.
Finally, we define the functions
..
*. . +/
` `−1 ` ` Pn`−1 ` `−1
σ y ,u = . σ bi + j=1 wi j y j
. // ∈ Rn` , ` = 1, . . . , L − 1,
. .. /
, . -
and
..
*
. . +/
n
σ y , u = .. σi bi + j=1
L L−1 L L wiLj y L−1 // ∈ RnL .
P L−1
j
. .. /
, . -
With this notation, given network inputs y0 ∈ Rn0 the corresponding network output vector
yL ∈ RnL can be computed recursively using
y` = σ ` y`−1, u` , ` = 1, . . . , L.
(8.27)
The system (8.27) has the structure of a so-called discrete-time system. The index ` corresponds
to time. The inputs into the system at ‘time’ ` is u` , and the state of the system ‘time’ ` is y` ∈ Rn` .
The final state, the output of the neural network y L ∈ RnL depends on the inputs y0 ∈ Rn0 and the
network parameters u1, . . . , u L ,
y L = y L (u1, . . . , u L ; y0 ).
So far we have assumed that the network parameters u1, . . . , u L are given. Now we describe how
they can be computed. Given a sequence of inputs y0k ∈ Rn0 , k = 1, . . . K, and corresponding desired
outputs Hy kL ∈ RnL , k = 1, . . . K, we want to find parameters u1, . . . , u L so that the network outputs
generated with these parameters and inputs match the desired outputs. This can be formulated as a
least squares problem
K
X
min 1
2 ky L (u1, . . . , u L ; y0k ) − H
y kL k22 . (8.28)
u1,...,u L
k=1
where y L (u1, . . . , u L ; y0k ) is defined by (8.27) with input y0 = y0k . The data y0k ∈ Rn0 , H
y kL ∈ RnL ,
k = 1, . . . K, are also called training data, and solving (8.28) is also called training the neural
network. Instead of a least squares functional, other functionals are possible to quantify ’matching’
the network outputs y L (u1, . . . , u L ; y0k ) and the desired outputs H
y kL .
The problem (8.28) is an implicitly constrained optimization problem since evaluation of
y (u1, . . . , u L ; y0k ) requited the solution of (8.27). Alternatively, (8.27) could be entered as a
L
constraint into the optimization problem and (8.28) can equivalently be formulated as the following
constrained problem in the optimization variables u1, . . . , u L and y1, . . . , y L .
K
X
min 1
2 ky kL − H
y kL k22 (8.29a)
k=1
y`k = σ ` y`−1 `
s.t. k ,u , ` = 1, . . . , L, k = 1, . . . , K . (8.29b)
8.3.2. Backpropagation
For the minimization (8.28) we note that the objective function is a sum of functions that all depend
on the same variables u1, . . . , u L . Thus the gradient of the sum is the sum of gradients, etc. For
derivative computations it is therefore sufficient to consider the case K = 1 and drop the subscript
k.
We consider the problem
2 ky (u , . . . , u ; y ) y L k22 .
1 L 1 L 0
min −H (8.30)
u1,...,u L
of
(u1, . . . , u L ) 7→ y L (u1, . . . , u L ; y0 ).
Since y L is implicitly defined through (8.27), the Jacobian can be computed by applying the implicit
function theorem to (8.27).
Let σy` ∈ Rn` ×n`−1 denote the partial Jacobian of σ ` with respect to y`−1 and let σu` ∈
Rn` ×((n`−1 +1)n` ) denote the partial Jacobian of σ ` with respect to u` . Because of the size of
the Jacobian, it is more convenient to describe the computation of the Jacobian applied to a vector
v1, . . . , v L . The recursion (8.27) defines functions
(u1, . . . , u` ) 7→ y` (u1, . . . , u` ; y0 ), ` = 1, . . . , L.
v1
w` = Dy` (u1, . . . , u` ; y0 ) ..
* .. +/ , ` = 1, . . . , L,
. /
, v` -
and are computed by the implicit function theorem applied to (8.27), i.e., are computed using
w0 = 0 (8.31a)
w` = σy` y`−1, u` w`−1 + σu` y`−1, u` v`,
` = 1, . . . , L. (8.31b)
The recursion (8.31) allows us to compute the Jacobian Dy L (u1, . . . , u L ; y0 ) applied to a vector.
For optimization we also need the application of the transpose of this Jacobian. To compute
Dy L (u1, . . . , u L ; y0 ) T r for a given vector r ∈ RnL we first note that (8.31) is equivalent to
I +/ * w1 +
.. −σy y , u
*. 2 1 2 I // .. w2 //
.. .. // .. .. //
. . // . . /
..
..
wL -
,| −σ L
y y L−1 , u L I ,
{z }-
=A
σ y ,u
1 0 1
*. u +/ * v1
.. σu2 y1, u2 // .. v2
+/
= .. .. .. ..
// . (8.32)
. . .
// .. //
.. /.
σu y , u - ,
L L−1 L
/ vL -
,| {z }
=B
We have
v1 v1
* +/ *. 2 +/
L 0 T T .
. v2 // = rT Dy L (u1, . . . , u L ; y0 ) .. v // = rT w L
Dy (u , . . . , u ; y ) r ..
L 1
.. .. ... //
. . //
, vL - , v -
L
T T
0 w1 0
2 + v1
w
*. +/ *. *. +/
0 / 0 *
.. v2 +//
. . .
= .. .. // .. .. // = .. .. // A B .. .. //
.. // .. // .. //
−1
.. 0 // .. w L−1 // .. 0 // . . /
L
L
, r - , w - , r - , v -
T
0
*. *. +/+/ v1
0 *. +/
.. .
..
//// v2
= ..BT A−T ... .
..
..
// . (8.33)
.
////
.. //
.. .. 0 ////
vL -
, , r -- ,
Since (8.33) holds for any vector v1, . . . , v L ,
0
*. +/
0
..
. //
Dy L (u1, . . . , u L ; y0 ) T r = BT A−T ... .// .
(8.34)
.. 0 //
, r -
If we define
0
*. +/
0
..
. //
p = A−T ... . //
.. 0 //
, r -
and use the definition of A and B in (8.32) the quantity Dy L (u1, . . . , u L ; y0 ) T r can be obtained
as follows. Compute
p L = r, (8.35a)
T
p`−1 = σy` y`−1, u` p`,
` = L, . . . , 2, (8.35b)
and set T
σu1 y0, u1 p1
..
*. +/
Dy L (u1, . . . , u L ; y0 ) T r = .. . // .
(8.35c)
. T /
, σu y , u L p L
L L−1
-
In particular, if we define
f (u1, . . . , u L ) = 12 ky L (u1, . . . , u L ; y0 ) − H
y L k22,
the gradient
T
∇ f (u1, . . . , u L ) = Dy L (u1, . . . , u L ; y0 ) y L (u1, . . . , u L ; y0 ) − H
yL
p L = y L (u1, . . . , u L ; y0 ) − H
y L, (8.36a)
T
p`−1 = σy` y`−1, u` p`,
` = L, . . . , 2, (8.36b)
and set T
σu1 y0, u1 p1 +
..
*.
∇ f (u , . . . , u ) = ..
L
// .
//
.
1
(8.36c)
. T
, σu y , u
L L−1 L pL -
If instead of the least squares functional, (8.30) we use another metric φ to quantify distance
between output y L (u1, . . . , u L ; y0 ) and desired output H
y L , we are led to the minimization problem
its gradient is
T
∇ f (u1, . . . , u L ) = Dy L (u1, . . . , u L ; y0 ) ∇yL φ y L (u1, . . . , u L ; y0 );H
yL ,
and set T
σu1 y0, u1 p1
..
*. +/
∇ f (u1, . . . , u L ) = .. . // . (8.38c)
. T /
, σu y , u L p L
L L−1
-
8.3.3. An Example
This example is adopted from [Bri15]. (To be added later.)
Input #1 Output #1
Input #2 Output #2
where for given function u ∈ L 2 ((0, 1) × (0, T )) the function y(u; ·) is the solution of
∂
− ν ∂∂x 2 y(x, t) + ∂
= r (x, t) + u(x, t), (x, t) ∈ (0, 1) × (0, T ),
2
∂t y(x, t) ∂ x y(x, t)y(x, t)
y(0, t) = y(1, t) = 0, t ∈ (0, T ), (8.39b)
y(x, 0) = y0 (x), x ∈ (0, 1),
where z : (0, 1) × (0, T ) → R, r : (0, 1) × (0, T ) → R, and y0 : (0, 1) → R are given functions
and ω, ν > 0 are given parameters. The parameter ν > 0 is also called the viscosity and the
differential equation (8.39b) is known as the (viscous) Burgers’ equation. The problem (8.39) is
studied, e.g., in [LMT97, Vol01]. As we have mentioned earlier, (8.39) can be viewed as a first step
towards solving optimal control problems governed by the Navier-Stokes equations [Gun03]. More
generally, (8.39) belongs to the class of partial differential equation (PDE) constrained optimization
problems [HPUU09].
In this context of (8.39) the function u is called the control, y is called the state, and (8.39b)
is called the state equation. We do not study the infinite dimensional problem (8.39), but instead
consider a discretization of (8.39).
Now we subdivide the spatial interval [0, 1] into n subintervals [x i−1, x i ], i = 1, . . . , n, with
x i = ih and h = 1/n. We define piecewise linear (‘hat’) functions
h−1 (x − (i − 1)h) x ∈ [(i − 1)h, ih] ∩ [0, 1],
ϕi (x) = h (−x + (i + 1)h) x ∈ [ih, (i + 1)h] ∩ [0, 1],
−1 i = 0, . . . , n,
(8.41)
0 else
ϕ0 ϕ1 ϕ2 ϕ3 ϕ4 ϕ5
x0 x1 x2 x3 x4 x5
n−1
X
yh (x, t) = y j (t)ϕ j (x) (8.42)
j=1
and
n
X
u h (x, t) = u j (t)ϕ j (x). (8.43)
j=0
We set
~y (t) = (y1 (t), . . . , yn−1 (t))T and u~ (t) = (u0 (t), . . . , un (t))T ,
If we insert the approximations (8.42), (8.43) into (8.40) and require (8.40) to hold for for ϕ = ϕi ,
i = 1, . . . , n − 1, then we obtain the system of ordinary differential equations
d
Mh ~y (t) + Ah ~y (t) + Nh (~y (t)) + Bh u~ (t) = r h (t), t ∈ (0, T ), (8.44)
dt
where Mh, Ah ∈ R(n−1)×(n−1) , Bh ∈ R(n−1)×(n+1) , r h (t) ∈ Rn−1 , and Nh (~y (t)) ∈ Rn−1 are matrices
where Mh ∈ R(n−1)×(n−1) is defined as before and Q h ∈ R(n+1)×(n+1) , gh (t) ∈ R(n−1) are a matrix
and vector with entries
Z 1
(Q h )i j = ϕ j (x)ϕi (x)dx,
0
Z 1
(gh (t))i = − z(x, t)ϕi (x)dx.
0
Thus a semi–discretization of the optimal control problem (8.39) is given by
ω
Z T
1
min ~y (t)T Mh ~y (t) + (gh (t))T ~y (t) + u~ (t)T Q h u~ (t)dt, (8.45a)
u~ 0 2 2
where ~y (t) is the solution of
Mh dtd ~y (t) + Ah ~y (t) + Nh (~y (t)) + Bh u~ (t) = r h (t), t ∈ (0, T ),
(8.45b)
~y (0) = ~y0,
where ~y0 = (y0 (h), . . . , y0 (1 − h))T .
Using the definition (8.41) of ϕi , i = 0, . . . , n, it is easy to compute that
4 1 2 −1
−1 2 −1
*. +/ *. +/
1 4 1
h ν
Mh = ... . . . . . . . . . Ah = ... . . . . . . . . .
. // . //
// ∈ R(n−1)×(n−1), // ∈ R(n−1)×(n−1),
6. h.
. 1 4 1 // . −1 2 −1 //
, 1 4 - , −1 2 -
Later we also need the Jacobian Nh0 (~y (t)) ∈ R(n−1)×(n−1) , which is shown in Figure 8.5.
To discretize the problem in time, we use the Crank-Nicolson method. We let
and we define
∆t i = t i+1 − t i, i = 0, . . . , N .
We also introduce
∆t −1 = ∆t N+1 = 0.
The fully discretized problem is given by
N+1
∆t i−1 + ∆t i 1 T ω T
X !
min ~y Mh ~yi + (gh )i ~yi + u~i Q h u~i ,
T
(8.46a)
u~0,...,~u N +1
i=0
2 2 i 2
.. //
.. −2yn−3 (t) − yn−2 (t) yn−1 (t) − yn−3 (t) yn−2 (t) + 2yn−1 (t) //
, −2yn−2 (t) − yn−1 (t) −yn−2 (t) -
and ~y0 is given. We denote the objective function in (8.46a) by fDand we set
u = (~uT0 , . . . , u~TN+1 )T .
We call u the control, y = (~y1T , . . . , ~yTN+1 )T the state, and (8.46b) is called the (discretized) state
equation.
Like with many applications, the verification that (8.46) satisfies the Assumptions 8.2.1, espe-
cially the first and third one, is difficult. If the set U of admissible controls u is constrained in a
suitable manner and if the parameters ν, h, ∆t i are chosen properly, then it is possible to verify
Assumptions 8.2.1. We ignore this issue and continue as if Assumptions 8.2.1 are valid for (8.46).
In our numerical experiments indicate that this is fine for our problem setting. We also note that our
simple Galerkin finite element method in space produces only meaningful results if the mesh size h
is sufficiently small (relative to the viscosity ν and size of the solution y). Otherwise the computed
solution exhibits spurious oscillations. Again, for our parameter settings, our discretization is
sufficient.
Since the Burgers’ equation (8.46b) is quadratic in ~yi+1 , the computation of ~yi+1 , i = 0, . . . , N,
requires the solution of system of nonlinear equations. We apply Newton’s method to compute the
solution ~yi+1 of (8.46b). We use the computed state ~yi at the previous time step as the initial iterate
in Newton’s method.
i=0
2 2 i 2
N
X f ∆t i ∆t i ∆t i
+ λ
~T
i+1 Mh + Ah ~yi+1 + Nh (~yi+1 ) + Bh u~i+1
i=0
2 2 2
∆t i ∆t i ∆t i
+ − Mh + Ah ~yi + Nh (~yi ) + Bh u~i
2 2 2
∆t i g
− r h (t i ) + r h (t i+1 ) . (8.47)
2
The adjoint equations corresponding to (8.11) are obtained by setting the partial derivatives
with respect to yi of the Lagrangian (8.47) to zero and are given by
T
Mh + ∆t2N Ah + ∆t2N Nh0 (~y N+1 ) λ ~ N+1 = − ∆t N (Mh ~y N+1 + (gh ) N+1 ),
2
T T
Mh + ∆t i−1 Ah + ∆t i−1 N 0 (~yi ) λ~ i = − − Mh + ∆t i Ah + ∆t i N 0 (~yi ) λ~ i+1
2 2 h 2 2 h
(8.48)
∆t i−1 +∆t i
− 2 (Mh ~yi + (gh )i ), i = N, . . . , 1,
where Nh0 (~yi ) denotes the Jacobian of Nh (~yi ). (Recall that ∆t N+1 = 0.) Given the solution of (8.48),
the gradient of the objective function fD can be obtained by computing the partial derivatives with
respect to ui of the Lagrangian (8.47). The gradient is given by
ω ∆t20 Q h u~0 + ∆t20 BTh λ~ 1
ω ∆t 0 +∆t Q h u~1 + BTh ( ∆t20 λ~ 1 + ∆t21 λ~ 2 )
*. 1
+/
2
..
.. //
∇u fD(u) = .. . // . (8.49)
.. ∆t N −1 +∆t N
. ω Q h u~ N + Bh ( ∆t N2 −1 λ~ N + ∆t N λ ~ N+1 )
T
//
2 2 /
, ω ∆t2N Q h u~ N+1 + ∆t2N BTh λ~ N+1 -
(Recall that ∆t −1 = ∆t N+1 = 0.)
We summarize the gradient computation using adjoints in the following algorithm.
Of course, if we have computed the solution ~y1, . . . , u~ N+1 of the discretized Burgers equation
(8.46b) for the given u~0, . . . , u~ N+1 already, then we can skip step 1 in Algorithm 8.4.1. Furthermore,
we can assemble the components of the gradient ∇u fD(u) that depend on λ~ i+1 immediately after it
has been computed. This way we do not have to store all λ~ 1, . . . , λ ~ N+1 .
We conclude by adapting Algorithm 8.2.4 to our problem. Since the the objective function
(8.46a) is quadratic and the implicit constraints (8.46b) are quadratic in y and linear in u, most
of the second derivative terms are zero. The multiplication of the Hessian ∇2u fD(u) times vector v
computation can be performed using Algorithm 8.4.2 below. In step 4 of the following algorithm
d
we use that Nh (y) is quadratic. Hence dy (Nh0 (~y )T λ)
~ w~ = Nh0 (~
w )T λ.
~
1. Given u~1, . . . , u~ N+1 , and ~y0 compute ~y1, . . . , ~y N+1 as in Step 1 of Algorithm 8.4.1.
3. Compute w
~ 1, . . . , w
~ N+1 from
∆t i ∆t i 0 ∆t i ∆t i 0 ∆t i
Mh + Ah + Nh (~yi+1 ) w~ i+1 = Mh − Ah − Nh (~yi ) w~i+ Bh (~vi +~vi+1 ),
2 2 2 2 2
i = 0, . . . , N, where w
~ 0 = 0.
5. Compute
and z(x, t) = y0 (x), t ∈ (0, T ) (cf. [KV99]). For the discretization we use n x = 80 spatial
subintervals and 80 time steps, i.e., ∆t = 1/80.
The solution y of the discretized Burgers’ equation (8.46b) with u(x, t) = 0 as well as the
desired state z are shown in Figure 8.6.
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
1 1
0.8 1 0.8 1
0.6 0.8 0.6 0.8
0.4 0.6 0.4 0.6
0.4 0.4
0.2 0.2 0.2 0.2
t 0 0 t 0 0
x x
Figure 8.6: Solution of Burgers’ equation with u = 0 (no control) (left) and desired state z (right)
The solution u of the optimal control problem (8.39), (8.39b), the corresponding solution y(u∗ )
of the discretized Burgers’ equation (8.39b) and the solution λ(u∗ ) of (8.48) are plotted in Figure 8.7
below.
The convergence history of the Newton–CG method with Armijo-line search applied to (8.46a)
is shown in Table 8.1. We use the Newton–CG Algorithm with gradient stopping tolerance
gtol = 10−8 and compute steps s k such that k∇2 fD(u k )s k + ∇ fD(u k )k ≤ η k k∇ fD(u k )k2 with η k =
min{0.01, k∇ fD(u k )k2 }.
4
1.5
2 1
0 0.5
−2 0
−4 −0.5
1 1
1 1
0.5 0.5
0.5 0.5
t 0 0 t 0 0
x x
0.2
0.1
−0.1
−0.2
1
1
0.5
0.5
t 0 0
x
Figure 8.7: Optimal control u∗ (upper left), corresponding solution y(u∗ ) of Burgers’ equation
(upper right) and corresponding Lagrange multipliers λ(u∗ ) (bottom)
Table 8.1: Convergence history of a Newton-CG method applied to the solution of (8.39)
8.4.5. Checkpointing
In Algorithm 8.4.1 we note that the state equation is solved forward for the ~yi ’s while the adjoint
equation is solved backward for the λ ~ i ’s. Moreover the states ~y N+1, . . . , ~y1 are needed for the
computation of the adjoints λ N+1, . . . , λ~ 1 . If the size of the state vectors ~yi is small enough so
~
that all states ~y1, . . . , ~y N+1 can be held in the computer memory, this dependence does not pose a
difficulty. However, for many problems, such as flow control problems governed by the unsteady
Navier-Stokes equations, the states are too large to hold the entire state history in computer memory.
In this case one needs to apply so-called checkpointing techniques.
With checkpointing one trades memory for state re-compuations. In a simple scheme one
does not keep every state ~y0, ~y1, . . . , ~y N+1 , but only every Mth state ~y0, ~y M , . . . , ~y N+1 (here we
assume that N + 1 is an integer multiple of M). In the computation of the adjoint variables λ~ i for
i ∈ {k M + 1, . . . , (k + 1)M − 1} and some k ∈ {0, . . . , (N + 1)/M } one needs ~yi , which has not
been stored. Therefore, one uses the stored ~y k M to re-compute ~y k M+1, . . . , ~y(k+1)M−1 .
2. Adjoint computation.
2.1. Compute λ
~ N+1 by solving
∆t N ∆t N 0 T ∆t N
Mh + Ah + Nh (~y N+1 ) λ~ N+1 = − (Mh ~y N+1 + (gh ) N+1 ).
2 2 2
Note that for k = (N +1)/M −1 one really does not need to recompute the states ~y N+2−M , . . . , ~y N
in step 2.2.1, since they are the last states computed in step 1.1. and should be stored there.
Algorithm 8.4.3 requires storage for (N + 1)/M + 1 vectors ~y0, ~y M , . . . , ~y N+1 , for M − 1 vectors
~y k M+1, . . . , ~y(k+1)M−1 computed in step 2.2.1, and for one vector λ~ i .
The simple checkpointing scheme used in Algorithm 8.4.3 is not optimal in the sense that given
a certain memory size to store state information it uses too many state re-computations. The issue
of optimal checkpointing is studied in the context of Automatic Differentiation. The so-called
reverse mode automatic differentiation is closely related to gradient computations via the adjoint
method. We refer to [Gri03, Sec. 4] or [GW08] for more details.
8.5. Optimization
In the previous sections we have discussed the computation of gradient and Hessian information
for the implicitly constrained optimization problem (8.3), (8.1), (8.2). Thus it seems we should
be able to apply a gradient based optimization algorithm, like the Newton–CG Algorithm to solve
the problem. In fact, in the previous section we have used the Newton–CG Algorithm to solve the
discretized optimal control problem (8.46). However, there are important issues left to be dealt
with. These are perhaps not so obvious when one deals with the algorithms in the previous sections
‘on paper’, but they become apparent when one actually as to implement the algorithms.
In the k-th iteration of the Newton–CG Algorithm we have to compute the gradient ∇ fD(u k ), we
have to apply the Hessian ∇2 fD(u k ) to a number of vectors, and we have to evaluate the function fD
at some trial points. In a Matlab implementation of Newton–CG Algorithm one may require the
user to supply three functions
function [f] = fval(u, usr_par)
function [g] = grad(u, usr_par)
function [Hv] = Hessvec(v, u, usr_par)
that evaluate the objective function fD(u), evaluate the gradient ∇ fD(u), and evaluate the Hessian-
times-vector product ∇2 fD(u)v, respectively. The last argument usr_par is included to allow he
user to pass problem specific parameters to the functions.
Now, if we look at Algorithms 8.2.2, 8.2.3, and 8.2.4 we see that the computation of ∇ fD(u) and
∇ fD(u)v all require the computation of y(u). Furthermore, the computation of ∇2 fD(u)v requires
2
the computation of λ(u). Since the computation y(u) can be expensive, we want to reuse an
already computed y(u) rather than to recompute y(u) every time fval, grad, or Hessvec is called.
Similarly we want to reuse λ(u) which has to be computed as part of the gradient computation in
Algorithm 8.2.3 during subsequent calls of Hessvec. Of course, if u changes, we must recompute
y(u) and λ(u). How can we do this?
If we know precisely what is going on in our optimization algorithm, then y(u) and λ(u) can be
reused. For example, if we use the Newton–CG Algorithm, then we know that fD(u k ) is evaluated
before ∇ fD(u k ) is computed. Moreover, we know that Hessian-times-vector products ∇2 fD(u k )v
computed only after ∇ fD(u k ) is computed. Thus, in this case, when fval is called, we compute
y(u k ) and store it to make it available for reuse in subsequent calls to grad and Hessvec. Similarly,
if the gradient is implemented via Algorithm 8.2.3, then when grad is called we compute λ(u k )
and store it to make it available for reuse in subsequent calls to Hessvec. This strategy works
only because we know that the functions fval, grad, or Hessvec are called in the right order.
If the optimization is changed such that, say ∇ fD(u k ) is computed before fD(u k ), the optimization
algorithm will fail because it is no longer interfaced correctly with our problem.
We need to find a way that allows us to separate the optimization algorithm (which doesn’t
need and shouldn’t need to know about the fact that the evaluation of our objective function
depends on the implicit function y(u)) from the particular optimization problem, but allows us to
avoid unnecessary recomputations of y(u) and λ(u). Such software design issues are extremely
important for the efficient implementation of optimization algorithms in which function evaluations
may involve expensive simulations. We refer to [BvH04, HV99, PSS09], for more discussions on
such issues. In our Matlab implementation we deal with this issue by expanding our interface
between optimization algorithm and application slightly.
In our Matlab implementation, we require the user to supply a function
The function unew is called by the optimization algorithm whenever u has been changed and before
any of the three functions fval, grad, or Hessvec are called. In our context, whenever unew is
called with argument u we compute y(u) and store it to make it available for reuse in subsequent
calls to fval, grad and Hessvec. If the implementer of the optimization algorithm changes
the algorithm and, say requires the computation of ∇ fD(u k ) before the computation of fD(u k ) then
she/he needs to ensure that unew is called with argument u k before grad is called. This change
of the optimization algorithm does not need to be communicated to the user of the optimization
algorithm. The interface would still work. This interface is used in a Matlab implementation
of the Newton–CG Algorithm and of a limited memory BFGS method which are available at
http://www.caam.rice.edu/~heinken/software. The introduction of unew enables us to
separate the optimization form the application and to avoid unnecessary recomputations of y(u)
and λ(u). It is not totally satisfactory, however, since it requires that the optimization algorithm
developer implements the use of unew correctly and it requires the application person not to
accidentally overwrite information between two calls of unew. These requirements become the
more difficult to fulfill the more complex the optimization algorithm and applications become. The
papers mentioned above discuss other approaches when C++ instead of Matlab is used.
Table 8.2: Performance of a Newton-CG method with gtol = 10−12 applied to the solution of (8.39).
The systems (8.46b) are solved with a residual stopping tolerance of 10−2 min{h2, ∆t 2 }
In the simple problem (8.46) we are able to solve the implicit constraints (8.46b) rather accu-
rately. Consequently, even for an optimization stopping tolerance gtol = 10−8 (which arguably is
small for our discretization of (8.39)) the Newton–CG Algorithm converges. In other applications
Table 8.3: Performance of a Newton-CG method with gtol = 10−12 applied to the solution of (8.39).
The systems (8.46b) are solved with a residual stopping tolerance of 10−5 min{h2, ∆t 2 }
the inexactness in the solution of the implicit equation will affect the optimization algorithm even
for coarser stopping tolerances gtol.
The ‘hand-tuning’ of stopping tolerances for the implicit equation and the optimization algo-
rithm is, of course, very unsatisfactory. Ideally one would like an optimization algorithm that
selects these automatically and allows more inexact and therefore less expensive solves of the
implicit equation at the beginning of the optimization iteration. One difficulty is that one cannot
compute the error in function and derivative information, but one can usually only provide an
asymptotic estimate of the form | fD (u) − fD(u)| = O( ).
There are approaches to handle inexact function and derivative information in optimization
algorithms. For example, a general approach to this problem is presented in the book [Pol97].
Additionally, Section 10.6 in [CGT00] and [KHRv14] describe approaches to adjust the accuracy
of function values and derivatives in a trust-region method (see also the references in that section).
Handling inexactness in optimization algorithms to increase the efficiency of the overall algorithm
by using rough, inexpensive function and derivative information whenever possible while main-
taining the robustness of the optimization algorithm are important research problems. Although
approaches exist, more work remains to be done.
∇ yy L(y, u, λ) ∇ yu L(y, u, λ)
!
H=
∇uy L(y, u, λ) ∇uu L(y, u, λ)
The QP (8.50) is almost identical to the QPs (8.21) and (8.24) arising in Newton-type methods
for the implicitly constrained problem (8.3), (8.1), (8.2). In the QPs (8.21) and (8.24), y = y(u)
and λ = λ(u) and the right hand side of the constraint is c(y(u), u) = 0. This indicates that one
step of an SQP method for (8.4) may not be computationally more expensive than one step of a
Newton type method for (8.3), (8.1), (8.2). However, SQP methods profit from the decoupling of
the variables y and u and can be significantly more efficient than Newton type method for (8.3),
(8.1), (8.2) because the latter compute iterates that are on the constraint manifold.
8.6. Problems
Problem 8.1 x
[Bur40] J. M. Burgers. Application of a model system to illustrate some points of the statistical
theory of free turbulence. Nederl. Akad. Wetensch., Proc., 43:2–12, 1940.
[Dre90] S. E. Dreyfus. Artificial neural networks, back propagation, and the Kelley-Bryson
gradient procedure. J. Guidance Control Dynam., 13(5):926–928, 1990. URL: http:
//dx.doi.org/10.2514/3.25422, doi:10.2514/3.25422.
[DS96] J. E. Dennis, Jr. and R. B. Schnabel. Numerical Methods for Nonlinear Equations and
Unconstrained Optimization. SIAM, Philadelphia, 1996. URL: https://doi.org/
10.1137/1.9781611971200, doi:10.1137/1.9781611971200.
[FA11] J. A. Fike and J. J. Alonso. The development of hyper-dual numbers for exact second-
derivative calculations. Proceedings, 49th AIAA Aerospace Sciences Meeting including
the New Horizons Forum and Aerospace Exposition. Orlando, Florida, 2011. URL:
https://doi.org/10.2514/6.2011-886, doi:10.2514/6.2011-886.
[FA12] J. A. Fike and J. J. Alonso. Automatic differentiation through the use of hyper-dual num-
bers for second derivatives. In S. Forth, P. Hovland, E. Phipps, J. Utke, and A. Walther,
357
358 REFERENCES
[GBC16] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
http://www.deeplearningbook.org (Accessed April 10, 2017).
[HPUU09] M. Hinze, R. Pinnau, M. Ulbrich, and S. Ulbrich. Optimization with PDE Con-
straints, volume 23 of Mathematical Modelling, Theory and Applications. Springer
Verlag, Heidelberg, New York, Berlin, 2009. URL: http://dx.doi.org/10.1007/
978-1-4020-8839-1, doi:10.1007/978-1-4020-8839-1.
[KV99] K. Kunisch and S. Volkwein. Control of Burger’s equation by a reduced order ap-
proach using proper orthogonal decomposition. Journal of Optimization Theory and
Applications, 102:345–371, 1999.
[LMT97] H. V. Ly, K. D. Mease, and E. S. Titi. Distributed and boundary control of the viscous
Burgers’ equation. Numer. Funct. Anal. Optim., 18(1-2):143–188, 1997.
[PSS09] A. D. Padula, S. D. Scott, and W. W. Symes. A software framework for abstract expres-
sion of coordinate-free linear algebra and optimization algorithms. ACM Trans. Math.
Software, 36(2):Art. 8, 36, 2009. URL: https://doi.org/10.1145/1499096.
1499097, doi:10.1145/1499096.1499097.
[Sal86] D. E. Salane. Adaptive routines for forming jacobians numerically. Technical Report
SAND86–1319, Sandia National Laboratories, 1986.
[ST98] W. Squire and G. Trapp. Using complex variables to estimate derivatives of real
functions. SIAM Rev., 40(1):110–112, 1998. URL: https://doi.org/10.1137/
S003614459631241X, doi:10.1137/S003614459631241X.
[Vol01] S. Volkwein. Distributed control problems for the Burgers equation. Comput. Optim.
Appl., 18(2):115–140, 2001.
f (x k + s) ≈ m k (x k + s) = f (x k ) + ∇ f (x k )T s + 21 sT Bk s.
We can use the model to compute a new guess x k+1 . For example, if Bk is symmetric positive definite
and if we use a line-search method, then the unique minimizer of the model is s k = −Bk−1 ∇ f (x k )
and the new approximation is x k+1 = x k + α k s k .
At the new iterate x k+1 we build a new model
How should we choose Bk+1 ? The matrix Bk+1 should be symmetric and positive definite, so
that the model m k+1 (x k+1 + s) = f (x k+1 ) + ∇ f (x k+1 )T s + 21 sT Bk+1 s has the unique minimizer
s k+1 = −Bk+1
−1 ∇ f (x
k+1 ). Moreover we require that the gradients of the model m k+1 at x k and x k+1
coincide with the gradients of f at these points. That is we require
The latter condition is automatically satisfied for any Bk+1 by the choice of m k+1 . The first condition
implies
Bk+1 (x k+1 − x k ) = ∇ f (x k+1 ) − ∇ f (x k ). (9.1)
361
362 CHAPTER 9. QUASI–NEWTON METHODS
The condition (9.1) is known as the secant equation. In the derivation of quasi-Newton methods
the notation
s k = x k+1 − x k , y k = ∇ f (x k+1 ) − ∇ f (x k )
is commonly used and we will adopt it here as well. In this notation the secant equation (9.1) is
given by
Bk+1 s k = y k .
We note that the notation s k = x k+1 − x k is misleading in the context of line searches, we have
α k s k = x k+1 − x k and in this case the formulas below should be applied with s k replaced by
α k s k = x k+1 − x k !
Since Bk+1 ∈ Rn×n is symmetric, we have 12 (n + 1)n entries to determine, but the secant
secant equation Bk+1 s k = y k provides only n equations. There are infinitely many symmetric
matrices Bk+1 that satisfy Bk+1 s k = y k . We choose the one that is closest to Bk . This leads to the
minimization problem
k|B − Bk k| = k M (B − Bk )M kF,
where M is a symmetric nonsingular matrix and the Frobenius norm of a square matrix A is given
by v
u
tX n
k AkF = Ai2j .
i, j=1
Theorem 9.2.1 Let M ∈ Rn×n be a symmetric nonsingular matrix and let s, y ∈ Rn be given vectors
with s , 0. If B ∈ Rn×n is a symmetric matrix, then the solution of
is given by
(y − Bs)cT + c(y − Bs)T (y − Bs)T s T
B+ = B + − cc , (9.4)
cT s (cT s) 2
where c = M −2 s.
ii. Given s and c, there are infinitely many symmetric nonsingular matrices M such that c =
M −2 s. For each of these matrices, the solution of the least change problem (9.3) is given by
the same matrix (9.4).
If we use the weighting matrix M = I in (9.2), then application of Theorem 9.2.1 gives the
so-called PSB (Powell symmetric Broyden) update
(y k − Bk s k )sTk + s k (y k − Bk s k )T (y k − Bk s k )T s k
PSB
Bk+1 = Bk + − s k sTk , (9.5)
sTk s k T
(s k s k ) 2
(y k − Bk s k )yTk + y k (y k − Bk s k )T (y k − Bk s k )T s k
DFP
Bk+1 = Bk + − y k yTk . (9.6)
yTk s k T
(y k s k ) 2
DFP
Bk+1 = (I − ρ k y k sTk )Bk (I − ρ k s k yTk ) + ρ k y k yTk , (9.7a)
where
1
ρk = (9.7b)
yTk s k
Since the model
Theorem 9.2.3 Let Bk ∈ Rn×n be symmetric positive definite and s k , 0. The matrix Bk+1
DFP is
Proof: i. Let Bk+1 DFP by symmetric positive definite. Since B DFP satisfies the secant equation
k+1
DFP s = y we have 0 < sT B DFP s = sT y .
Bk+1 k k k k+1 k k k
ii. Let yTk s k > 0. For any v , 0 we have
vT Bk+1
DFP
v = vT (I − ρ k y k sTk )Bk (I − ρ k s k yTk )v + ρ k vT y k yTk v
= wT Bk w + ρ k (vT y k ) 2,
where w = v − ( ρ k yTk v)s k . Since ρ k = 1/(yTk s k ) > 0 and Bk is symmetric positive definite,
DFP v = wT B w + ρ (vT y ) 2 ≥ 0. The right hand side is zero if and only if wT B w = 0 and
vT Bk+1 k k k k
v y k = 0. However, if vT y k = 0, then w = v − ( ρ k yTk v)s k = v , 0 and wT Bk w = vT Bk v > 0.
T
DFP v > 0 for all v , 0.
Hence, vT Bk+1
If Bk is symmetric positive definite with known inverse and if yTk s k > 0, then we can compute
DFP using the Sherman–Morrison–Woodbury formula.
the inverse of Bk+1
1 + vT A−1u ≡ σ , 0.
In this case
1 −1 T −1
( A + uvT ) −1 = A−1 −
A uv A .
σ
ii. Let U, V ∈ Rn×m , m < n, and assume that A ∈ Rn×n is nonsingular. Then A + UV T is
nonsingular if and only if
I + V T A−1U ≡ Σ ∈ Rm×m
is invertible. In this case
If Bk is symmetric positive definite with inverse Hk = Bk−1 and if yTk s k > 0, then the inverse
DFP
Hk+1 DFP in (9.6) is given by
of the DFP update Bk+1
Hk y k yTk Hk s k sTk
DFP
Hk+1 = Hk − − . (9.8)
yTk Hk y k yTk s k
We have developed the DFP update (9.6) using the least change principle applied to the replace-
ment Bk+1 of the Hessian. If Bk is invertible with inverse Hk , then we can try to update the inverse
Hk to obtain Hk+1 , a replacement for the inverse of the Hessian at x k+1 . The matrix Hk+1 should
be symmetric and it should satisfy the secant equation Hk+1 y k = s k . In addition, we require that x
Hk+1 is close to Hk . This we update the inverse by solving
Of course the problem (9.9) is of the same type as the problem (9.3) and everything that we have
derived so far can be applied to solve (9.9). We only have to change the notation Bk → Hk , s k → y k ,
and y k → s k . If in (9.9) we use the weighted Frobenius norm with symmetric nonsingular weighting
matrix M such that M −2 y k = s k , the solution leads to the BFGS (Broyden Fletcher Goldfarb Shanno)
update
BFGS
Hk+1 = (I − ρ k s k yTk )Hk (I − ρ k y k sTk ) + ρ k s k sTk , (9.10a)
where
ρ k = 1/yTk s k . (9.10b)
The following result corresponds to Theorem 9.2.3 and equation (9.8).
Theorem 9.2.5 Let Hk ∈ Rn×n be symmetric positive definite and yTk s k > 0. The matrix Hk+1 BFGS is
by
Bk s k sT Bk y k yTk
BFGS
Bk+1 = Bk − T k + T . (9.11)
s k Bk s k yk s k
9.3.2. Line-Search
One step in a quasi-Newton method with line search is given as follows. Suppose we have given an
approximation x k of a (local) minimizer and a symmetric positive definite matrix that replaces the
inverse of the Hessian. Hence, our model of f (x k + s) is
m k (x k + s) = f (x k ) + ∇ f (x k )T s + 12 sT Hk−1 s.
Update Bk using
Bk (x k+1 − x k )(x k+1 − x k )T Bk r k r Tk
Bk+1 = Bk − + T . (9.12)
(x k+1 − x k )T Bk (x k+1 − x k ) r k (x k+1 − x k )
The update (9.12) is just the standard BFGS update with y k replaced by r k . When θ k , 1, then
(x k+1 − x k )T r k = 0.2(x k+1 − x k )T Bk (x k+1 − x k ) > 0.
If we have computed
r k = H k qk ,
which can be done using the same steps as used in the computation of Hk+1 qk+1 , then
This leads to a recursion for the computation of r k+1 = Hk+1 qk+1 , which summarized the following
algorithm.
Algorithm 9.3.1
Compute r k+1 = Hk+1 qk+1 for a given qk+1 .
1. For i = k, . . . , 0 do
a. αi = ρi sTi qi+1 .
b. qi = qi+1 − αi yi .
2. r 0 = H0 q0 .
3. For i = 0, . . . , k do
9.4. Problems
minimize kB − AkF,
subject to B = BT
is solved by
B = 12 ( A + AT ).
References
[DM77] J. E. Dennis, Jr. and J. J. Moré. Quasi–Newton methods, motivation and theory. SIAM
Review, 19:46–89, 1977.
[DS83] J. E. Dennis, Jr. and R. B. Schnabel. Numerical Methods for Nonlinear Equations and
Unconstrained Optimization. Prentice-Hall, Englewood Cliffs, N. J, 1983. Republished
as [DS96].
[DS96] J. E. Dennis, Jr. and R. B. Schnabel. Numerical Methods for Nonlinear Equations and
Unconstrained Optimization. SIAM, Philadelphia, 1996. URL: https://doi.org/10.
1137/1.9781611971200, doi:10.1137/1.9781611971200.
[JS04] F. Jarre and J. Stoer. Optimierung. Springer Verlag, Berlin, Heidelberg, New-York, 2004.
[MS79] H. Matthies and G. Strang. The solution of nonlinear finite element equations. Internat.
J. Numer. Methods Engrg., 14:1613–1626, 1979.
[Noc80] J. Nocedal. Updating quasi-Newton matrices with limited storage. Math. Comp.,
35(151):773–782, 1980.
371
Chapter
10
Newton’s Method
10.1 Derivation of Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
10.2 Local Q-Quadratic Convergence of Newton’s Method . . . . . . . . . . . . . . . . 375
10.3 Modifications of Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . 376
10.3.1 Divided Difference Newton Methods . . . . . . . . . . . . . . . . . . . . 376
10.3.2 The Chord Method and the Shamanskii Method . . . . . . . . . . . . . . . 377
10.3.3 Inexact Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
10.4 Truncation of the Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
10.5 Newton’s Method and Fixed Point Iterations . . . . . . . . . . . . . . . . . . . . . 380
10.6 Kantorovich and Mysovskii Convergence Theorems . . . . . . . . . . . . . . . . . 381
10.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
373
374 CHAPTER 10. NEWTON’S METHOD
Proof: Apply the fundamental theorem of calculus to the functions φi (t) = Fi (y + t(x − y)), i =
1, . . . , n, on [0, 1].
and
1
k x − x ∗ k ≤ kF (x)k ≤ 2kF 0 (x ∗ )k k x − x ∗ k. (10.4)
2kF 0 (x ∗)
−1 k
lim x k = x ∗,
k→∞
Proof: Let 1 be the parameter determined by Lemma 10.2.2 and let σ ∈ (0, 1). We will show by
induction that if
= min{ 1, σ/(kF 0 (x ∗ ) −1 kL)}
and k x 0 − x ∗ k < , then
k x k+1 − x ∗ k ≤ LkF 0 (x ∗ ) −1 k k x k − x ∗ k 2 < σk x k − x ∗ k < (10.5)
for all iterates x k .
If k x 0 − x ∗ k < , then
x 1 − x ∗ = x 0 − x ∗ − F 0 (x 0 ) −1 F (x 0 )
Z 1
= F (x 0 )
0 −1
(F 0 (x 0 ) − F 0 (x 0 + t(x 0 − x ∗ ))(x 0 − x ∗ )dt
0
Using (10.3) and the Lipschitz continuity of F 0 we obtain
k x 1 − x ∗ k ≤ 2LkF 0 (x ∗ ) −1 k k x 0 − x ∗ k 2 /2
= LkF 0 (x ∗ ) −1 k k x 0 − x ∗ k 2 < σk x 0 − x ∗ k < .
This proves (10.5) for k = 0. The indcution step can be proven analogously and is therefore omitted.
Since σ < 1 and
k x k+1 − x ∗ k < σk x k − x ∗ k < . . . < σ k+1 k x 0 − x ∗ k,
we find that lim k→∞ x k = x ∗ . The q–quadratic convergence rate follows from (10.5) with
c = LkF 0 (x ∗ ) −1 k.
One can easily modify the results in Section 5.2.3 to analyze Newton’s method for systems of
nonlinear equations with inexact function evaluations.
For k = 0, . . .
Compute and factor F 0 (x km ).
For j = 0, . . . , m − 1
Check truncation criteria.
Solve F 0 (x km )s km+ j = −F (x km+ j ).
Set x km+ j+1 = x km+ j + s km+ j .
End
End
A special case of the Shamanskii Method is the Chord Method. Here the Jacobian F 0 (x k )
is computed and factored only once in the initial iteration. This case is included in the previous
algorithm if we formally set m = ∞. For m = 1 we obtain Newton’s method. The convergence
rates for the Shamanskii method are derived in [Pol97, Sec. 1.4.5]. These methods can be viewed
as Newton–like iterations of the form
x k+1 = x k − A−1
k F (x k )
For k = 0, . . .
Check truncation criteria.
Compute s k such that kF 0 (x k )s k + F (x k )k ≤ η k kF (x k )k.
Set x k+1 = x k + s k .
End
The following theorem analyzes the convergence of the inexact Newton method. It mirrors
Theorem 5.3.1.
Theorem 10.3.3 Let D ⊂ Rn be an open set and let F : D → Rn be differentiable on D with
F 0 ∈ LipL (D). Moreover, let x ∗ ∈ D be a root of F and let F 0 (x ∗ ) be nonsingular.
If 4η kF 0 (x ∗ ) −1 k kF 0 (x ∗ )k < 1 and if {η k } satisfies 0 < η k ≤ η, then for all σ ∈
(4η kF 0 (x ∗ ) −1 k kF 0 (x ∗ )k, 1) there exists an > 0 such that the inexact Newton method 10.3.2
with starting point x 0 ∈ B (x ∗ ) generates iterates x k which converge to x ∗ ,
lim x k = x ∗,
k→∞
and which obey
k x k+1 − x ∗ k ≤ L kF 0 (x ∗ ) −1 k k x k − x ∗ k 2 + 4η k kF 0 (x ∗ ) −1 k kF 0 (x ∗ )k k x k − x ∗ k ≤ σk x k − x ∗ k
for all k.
Proof: Let 1 > 0 be the parameter given by Lemma 10.2.2. Furthermore, let σ ∈
0 −1
(2kF (x ∗ ) kη, 1) be arbitrary and let
= min{ 1, 2(σ − 4η kF 0 (x ∗ ) −1 k kF 0 (x ∗ )k)/(L kF 0 (x ∗ ) −1 k)}
We set r k = −F (x k ) − F 0 (x k )s k . If k x k − x ∗ k < , then
x k+1 − x ∗ = x k − x ∗ + s k = x k − x ∗ − F 0 (x k ) −1 [F (x k ) − r k ]
= x k − x ∗ − F 0 (x k ) −1 [F (x k ) − F (x k )] − F 0 (x k ) −1r k
Z 1
= F 0 (x k ) −1 [F 0 (x k ) − F 0 (x k + t(x k − x ∗ )](x k − x ∗ )dt − F 0 (x k ) −1r k .
0
Taking norms yields
L
k x k+1 − x ∗ k ≤ kF 0 (x k ) −1 k k x k − x ∗ k 2 + η k kF 0 (x k ) −1 kkF (x k )k
2
≤ L kF 0 (x ∗ ) −1 k k x k − x ∗ k 2 + 4η k kF 0 (x ∗ ) −1 k kF 0 (x ∗ )k k x k − x ∗ k
≤ L kF 0 (x ∗ ) −1 k k x k − x ∗ k + 4η kF 0 (x ∗ ) −1 k kF 0 (x ∗ )k k x k − x ∗ k
≤ σk x k − x ∗ k.
if we set G(x) = x − F 0 (x) −1 F (x). Convergence of the fixed point iteration (10.8) can be proven
if D ⊂ Rn and if G : D → D is a contraction mapping, i.e., if there exists γ < 1 such that
kG(x) − G(y)k ≤ γk x − yk ∀ x, y ∈ D.
Theorem 10.5.1 (Banach Fixed Point Theorem) Let D ⊂ Rn be closed and let G : D → Rn be
a contraction mapping on D such that G(x) ∈ D for all x ∈ D. There exists a unique fixed point
x ∗ of G in D and for all x 0 the sequence {x k } generated by fixed point iteration 10.8 convergences
q–linearly with q–factor γ to the fixed point x ∗ , i.e.,
k x k+1 − x ∗ k ≤ γk x k − x ∗ k.
Proof: Consider the sequence {x k } generated by x k+1 = G(x k ) with x 0 ∈ D. First note that
and
` `
X X 1 − γ`
k x k+` − x k k ≤ k x k+i − x k+i−1 k ≤ γ k+i−1 k x 1 − x 0 k = γ k k x1 − x0 k
i=1 i=1
1−γ
1
≤ γk k x1 − x0 k ∀ k, `.
1−γ
Thus, given any > 0,
(1 − γ)
k x k+` − x k k < ∀ k > ln / ln γ, ` ≥ 1.
k x1 − x0 k
Thus the sequence {x k } is a Cauchy sequence and therefore has a limit, lim k→∞ x k = x ∗ . By
continuity of G,
x ∗ = lim x k = G( lim x k ) = G(x ∗ ),
k→∞ k→∞
which means the limit is a fixed point. The q-linear convergence of the sequence follows from
k x ∗ − y∗ k = kG(x ∗ ) − G(y∗ )k ≤ γk x ∗ − y∗ k.
Proof: For an arbitrary γ with kG0 (x ∗ )k < γ < 1 choose so that B (x ∗ ) ⊂ O and
kG0 (x)k ≤ γ ∀ x ∈ B (x ∗ ).
Application of the Banach Fixed Point Theorem with D = B (x ∗ ) implies the desired result.
We can use the previous result applied to G(x) = x − F 0 (x) −1 F (x) to prove the local q–
superlinear convergence of Newton’s method. See Problem 10.9.
kF 0 (x 0 ) −1 k ≤ β,
kF 0 (x 0 ) −1 F (x 0 )k ≤ η.
√
Define α = L βη. If α ≤ 1/2 and r ≥ r 0 ≡ 1 − 1 − 2α /( βη), then the sequence {x k }
produced by Newton’s method is well defined and converges to x ∗ , a unique zero of F in the
k η
k x k − x ∗ k ≤ (2α) 2 k = 0, 1, . . . (10.9)
α
For a proof of the Kantorovich Theorem see e.g. [KA64], [OR00, Sec. 12], or [Den71].
It is important to notice that the Kantorovich Theorem establishes the existence and local
uniqueness of a solution of F (x) = 0. It only requires smoothness properties of F on Br (x 0 ) and
estimates of kF 0 (x 0 ) −1 k and kF 0 (x 0 ) −1 F (x 0 )k at a single point x 0 . This aspect of the Kantorovich
Theorem is useful in many situations. See also Problem 10.7v. On the other hand, the Kantorovich
Theorem only predicts r–quadratic convergence of the iterates, which is inferior to the r–quadratic
convergence predicted by Theorem 10.2.3.
An important property of Newton’s method is its scale invariance. Consider the differentiable
function F : Rn → Rn . Suppose we transform the variables x and F (x) by
x = D x x,
D D(x) = DF F (x),
F
where DF, D x be nonsingular matrices. Instead of solving F (x) = 0, we solve the equivalent system
x ) = 0, where
D(D
F
FD(Dx ) = DF F (D−1
x Dx ) = DF F (x).
Let {D
x k } be the sequence of Newton iterates for the function F x 0 = D x x 0 . It is
D with starting value D
not difficult to show, see Problem 10.3, that for all k
x k = Dx x k .
D
Thus, Newton’s method is invariant with respect to scaling of the function F. Moreover, if the
initial iterate is scaled by the matrix D x , then all subsequent Newton iterates are scaled by the same
matrix. The invariance of Newton’s method with respect to the scaling of the function F is not
reflected in the previous convergence Theorems 10.2.3 or 10.6.1. The Lipschitz constant L depends
on the choice of the scaling matrix DF . Thus, if the scaling matrix DF is varied, the convergence
Theorems 10.2.3 and 10.6.1 predict a different convergence behavior, although the previous analysis
shows that Newton’s method does produce the same iterates, i.e., does not change. Therefore, affine
invariant convergence theorems have been introduced in [DH79] and slightly refined in [Boc88].
See also [Deu04]. The following theorem is due to [DH79], [Boc88] and is an extension of the
Newton–Mysovskii Theorem in, e.g., [KA64], [OR00, Thm. 12.4.6].
If x 0 ∈ D is such that
kF 0 (x 0 ) −1 F (x 0 )k ≤ α
and
h = αω/2 < 1
def
j=0
1−h
k x k − x ∗ k ≤ σ k k x k − x k−1 k 2, (10.12)
where
∞
X k 2 j −1 ω/2
σ k = (ω/2) h2 ≤ k
.
j=0 1 − h2
x k+1 − x k = −F 0 (x k ) −1 F (x k )
= −F 0 (x k ) −1 F (x k ) − F (x k−1 ) − F 0 (x k−1 )(x k − x k−1 )
Z 1f g
= −F (x k )
0 −1
F 0 (x k−1 + t(x k − x k−1 )) − F 0 (x k−1 ) (x k − x k−1 ) dt.
0
Next, we prove
k−1
X j
k x k − x0 k ≤ α h2 −1 (10.13)
j=0
by induction. For k = 1,
k x 1 − x 0 k = kF 0 (x 0 ) −1 F (x 0 )k ≤ α.
Now, assume that (10.13) holds for k. We have k x k+1 − x 0 k ≤ k x k+1 − x k k + k x k − x 0 k, where, by
induction hypothesis,
k−1
X j
k x k − x0 k ≤ α h2 −1
j=0
and, by (10.11),
k
≤ α2
k−1
Y z }| {
j k
k x k+1 − x k k ≤ (ω/2)k x k − x k−1 k 2 ≤ (ω/2) 2 k x 1 − x 0 k 2
j=0
2k −1 2k −1 k −1
≤ (ω/2) α α = h2 α. (10.14)
Thus,
k x k+1 − x 0 k ≤ k x k+1 − x k k + k x k − x 0 k
k−1
X k
X
2k −1 2 j −1 j
≤ αh +α h = α h2 −1,
j=0 j=0
l−1 l−1
X X ω
k x k+l − x k k ≤ k x k+ j+1 − x k+ j k ≤ k x k+ j − x k+ j−1 k 2 . (10.15)
j=0 j=0
2
Now we apply (10.11) to the term k x k+ j − x k+ j−1 k 2 and use (10.11) to estimate
j
Y i j+1
k x k+ j − x k+ j−1 k ≤ (ω/2) k x k+ j−1 − x k+ j−2 k ≤ . . . ≤
2 2 4
(ω/2) 2 k x k − x k−1 k 2
i=1
2 j+1 −2 2 j+1 −2
= (ω/2) k x k − x k−1 k k x k − x k−1 k 2
j+1 k−1 2 j+1 −2 k−1 αω 2 j+1 −2
≤ (ω/2) 2 −2 h2 −1 α k x k − x k−1 k 2 = h2 −1 k x k − x k−1 k 2
2
k−1 2 j+1 −2 k 2 j −1
= h2 k x k − x k−1 k 2 = h2 k x k − x k−1 k 2 .
This yields F (x ∗ ) = 0.
The previous theorem and the Kantorovich theorem provide an existence result. Theorem 10.6.2
is affine invariant, but one has to assume invertibility of F 0 (x) on D. In the Kantorovich Theorem
10.6.1 the invertibility of F 0 (x k ) is one of the resuls. Note that Theorem 10.6.2 does not state the
uniqueness of the zero x ∗ . Like the Kantorovich theorem, the previous theorem establishes the
r–quadratic convergence of Newton’s method.
10.7. Problems
ii. Let {x k } be the iterates generated by Newton’s method. Show that f (x k ) ≥ 0 for all k ≥ 1.
(Hint: Use Theorem 4.4.3 i.)
iv. x.
x k+1 = x k − A−1
k F (x k ) (10.17)
where Ak ∈ Rn×n is an invertible matrix which approximates F 0 (x k ). Prove the following theorem.
and
k ( Ak − F (x k ))k ≤ α k ≤ α < 1,
k A−1 0
then there exists an > 0 such that the generalized Newton method (10.17) with starting
point x 0 ∈ B (x ∗ ) generates iterates x k which converge to x ∗ ,
lim x k = x ∗,
k→∞
x ) = DF F (D−1
D(D
F x Dx ) = DF F (x).
Here D x and DF are nonsingular matrices. Suppose that for a given starting vector x 0 sequence
{x k } of Newton iterates for the function F is well defined. Let {D x k } be the sequence of Newton
x 0 = D x x 0 . Show that for all k
iterates for the function F with starting value D
D
x k = Dx x k .
D
i. Prove that the convergence rate is at least q-linear and derive the q-linear factor
k x k+1 − x ∗ k
lim sup .
k→∞ k x k − x∗ k
ii. Prove that the convergence rate is at least superlinear if and only if lim k→∞ α k = 1.
iii. Let α k , 1 for all k. Is it possible for {x k } to converge quadratically? Prove your assertion.
is used to solve exit distributions problems in radiative transfer. The goal is to find a function H
such that (10.18) is satisfied.
To solve the problem numerically, we discretize integrals by the composite mid-point rule,
Z 1 N
1 X
g(ν)dν ≈ g(ν j ),
0 N j=1
i. Solve this system numerically using Newton’s method with finite difference Jacobian approx-
imations.
where H k denotes the kth iterate, or if the number of iterations exceeds 20.
– Perform two runs, one with c = 0.9 and the other with 0.9999. In both runs N = 100
subintervals for the discretization of the integral and use the starting value H 0j = 1,
j = 1, . . . N.
– Output a table that shows the iteration number k, kF (H k )k∞ , kF (H k )k∞ /kF (H k−1 )k∞ ,
and kH k − H k−1 k∞ .
– Turn in program,source codes, tables generated by the program, and a plot of the
approximate solution H (µ) of (10.18).
– different starting values ((10.18) and (10.19) have two different solutions, only one of
which is physically meaningful)
– different parameters t in the finite difference approximations
Problem 10.6 Let A ∈ Rn×n and assume there exist exist a diagonal matrix D ∈ Rn×n and a
nonsingular matrix V ∈ Rn×n such that A = V DV −1 .
The problem of finding an eigenvalue λ ∗ and a corresponding eigenvector v∗ with unit length
of A can be formulated as a root finding problem
F (v, λ) = 0, (10.20)
where
F : Rn × R → Rn × R
is given by
Av − λv
!
F (v, λ) = 1 T 1 .
2v v − 2
(i) Formulate Newton’s method for the computation of (v∗, λ ∗ ).
(ii) Show that the Jacobian F 0 (v∗, λ ∗ ) is nonsingular if λ ∗ is a simple eigenvalue.
(iii) Show that the Jacobian F 0 (·, ·) is Lipschitz continuous.
(iv) Prove the local q–quadratic convergence of Newton’s method for the solution of (10.20) under
the assumption that λ ∗ is a simple eigenvalue.
(v) Apply this method to compute an eigenvalue of
2 −1
.. −1 2 −1
*. +/
//
. . .
.. //
1 .
A= 2. . . . . . . // ∈ Rn×n, h = 1/(n + 1).
h .. //
.. /
. −1 2 −1 //
, −1 2 -
√
Use n = 100, and starting values v0 = (1, . . . , 1)T / n, λ 0 = 1.
Stop the iteration when kF (v, λ)k2 < 10−6 /n.
Output a table that shows the iteration number k, kF (vk , λ k )k2 , and kkF (vk , λ k )k2 k2 .
What is the eigenvalue λ k ? What is the eigenvector (plot using x = (h:h:1-h);
plot(x,v);)?
See Problem 1.2 for eigenvalues and eigenvectors of symmetric tridiagonal matrices.
Problem 10.7 The implicit Euler method for solving large systems of ordinary differential equations
d
y(t) = G(y(t), t), t ∈ [0, T],
dt
partitions the time interval [0, T] into smaller time intervals 0 = t 0 < t 1 < . . . < t I = T and at each
time step t i+1 computes yi+1 ≈ y(t i+1 ) as the solution of
yi+1 − yi
= G(yi+1, t i+1 ), (10.21)
∆t
where yi ≈ y(t i ) is given from the previous time step and ∆t = t i+1 − t i > 0.
The equation (10.21) has to be solved for yi+1 ∈ Rn . This problem is concerned with Newton’s
method for the solution of this system. The indices i, i + 1 refer to the time steps in the implicit
Euler method; they do not have anything to do with Newton’s method. The Newton iterates are
k , k = 0, 1, . . . ,
yi+1
ii. Let G y (y, t) denote the Jacobian of G(y, t) with respect to y ∈ Rn and assume that
kG y (y, t)k < M for all y ∈ Rn and all t ∈ R.
iii. Assume that kG y (y, t) − G y (z, t)k < L G k y − zk for all y, z ∈ Rn and all t ∈ R.
Show that the Jacobian F 0 of F satisfies kF 0 (y) − F 0 (z)k < Lk y − zk y, z ∈ Rn . What is L?
iv. Under the assumtions in ii. and iii. Newton’s method for the solution of F (yi+1 ) = 0 can be
k , k ∈ N, denote the Newton iterates,
shown to converge locally q–quadratic. Moreover, if yi+1
then one can show that
∗
k+1
k yi+1 − yi+1 k ≤ Lk(F 0 (yi+1
∗
)) −1 k k yi+1
k ∗
− yi+1 k2
k is sufficiently close to y ∗ .
provided yi+1 i+1
Use your results in ii. and iii. to show why Newton’s method will perform the better the
smaller ∆t.
v. In vi. we have assumed the existence of yi+1∗ such that F (y ∗ ) = 0. Use the estimates in ii.
i+1
and iii. to establish the existence of such a solution using the Kantorovich Theorem 10.6.1.
(Hint: The role of x 0 in the Kantorovich Theorem is played by yi .)
Hint: Use Banach’s lemma, Lemma 5.2.2, to show that F 0 (x) is invertible for all x ∈ Br (x ∗ ) and
use the arguments in the proof of Banach’s lemma to derive the bound kF 0 (x) −1 F 0 (x ∗ )k ≤ 3 for all
x ∈ Br (x ∗ ).
Problem 10.9 Let F be twice differentiable in the open set D ⊂ Rn and let F 0 (x) be invertible for
all x ∈ D.
• Use Theorem 10.5.2 to establish the local q–superlinear convergence of Newton’s method.
g0 (x ∗ ) = . . . = g (p−1) (x ∗ ) = 0,
then there exists > 0 such that x ∗ is the only fixed point in B (x ∗ ) and the fixed point
iteration converges to x ∗ for any x 0 ∈ B (x ∗ ) with an q–order p and
|x k+1 − x ∗ | |g (p) (x ∗ )|
lim = .
k→∞ |x k − x ∗ | p p!
f (x ∗ ) = f 0 (x ∗ ) = . . . = f (m−1) (x ∗ ) = 0, f (m) (x ∗ ) , 0.
ii. Suppose that f 0 (x) , 0, for all x ∈ D \ {x ∗ }. Show that the Newton iteration
f (x k )
x k+1 = x k − .
f 0 (x k )
iv. Construct an example which shows that the iteration (10.24) may not converge if the multi-
plicity m is overestimated.
Problem 10.13 Let A ∈ Rn×n be nonsingular and let k · k, ||| · ||| be a vector and a matrix norm
such that k Mvk ≤ |||M ||| kvk and |||M N ||| ≤ |||M ||| |||N ||| for all M, N ∈ Rn×n , v ∈ Rn .
Schulz’s method for computing the inverse of A generates a sequence of matrices {X k } via the
iteration
X k+1 = 2X k − X k AX k .
ii. Show that {X k } converges q-quadratically to the inverse of A for all X0 with |||I − AX0 ||| < 1.
iii. Show that {X k } converges converges q-quadratically for all X0 = α AT with α ∈ 0, 2/λ max ,
where λ max is the largest eigenvalue of AAT .
Note: The assumption that A is invertible can be relaxed and converge of X k to the generalized
inverse can be shown. See [BI66].
Problem 10.14 Let A ∈ Rn×n be nonsingular and let G : Rn → Rn be Lipschitz continuous with
Lipschitz constant L > 0, i.e., kG(x) − G(y)k2 ≤ Lk x − yk2 for all x, y ∈
real n .
Let x ∗ be a solution of
Ax + G(x) = 0. (10.25)
a. Show that if Lk A−1 k2 < 1, then (10.25) has at most one solution.
f (x k + α k s k ) ≤ f (x k ) + c α k ∇ f (x k )T s k
is satisfied for c ∈ (0, 1) independent of k. Suppose that the step lengths α k are bounded
away from zero, i.e., that α k ≥ α > 0 for all k. Show that lim k→∞ Ax k + G(x k ) = 0.
[Den71] J. E. Dennis, Jr. . Toward a unified convergence theory for Newton–like methods. In L. B.
Rall, editor, Nonlinear Functional and Applications, pages 425–472. Academic Press,
New-York, 1971.
[DES82] R. S. Dembo, S. C. Eisenstat, and T. Steihaug. Inexact Newton methods. SIAM J. Numer.
Anal., 19:400–408, 1982.
[Deu04] P. Deuflhard. Newton Methods for Nonlinear Problems. Affine Invariance and Adaptive
Algorithms, volume 35 of Springer Series in Computational Mathematics. Springer-
Verlag, Berlin, 2004.
[DH79] P. Deuflhard and G. Heindl. Affine invariant convergence theorems for Newton’s method
and extensions to related methods. SIAM J. Numer. Anal., 16:1–10, 1979.
[DS83] J. E. Dennis, Jr. and R. B. Schnabel. Numerical Methods for Nonlinear Equations and
Unconstrained Optimization. Prentice-Hall, Englewood Cliffs, N. J, 1983. Republished
as [DS96].
[DS96] J. E. Dennis, Jr. and R. B. Schnabel. Numerical Methods for Nonlinear Equations and
Unconstrained Optimization. SIAM, Philadelphia, 1996. URL: https://doi.org/10.
1137/1.9781611971200, doi:10.1137/1.9781611971200.
[KA64] L.V. Kantorovich and G.P. Akilov. Functional Analysis in Normed Spaces. Pergamon
Press, New York, 1964.
[Kel95] C. T. Kelley. Iterative methods for linear and nonlinear equations, volume 16 of Fron-
tiers in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM),
395
396 REFERENCES
F : Rn → Rn .
In the multidimensional case we can try to generalize this approach as follows: Given two iterates
x k , x k+1 ∈ Rn we try to find a nonsingular matrix Bk+1 ∈ Rn×n which satisfies the so-called secant
equation
Bk+1 (x k+1 − x k ) = F (x k+1 ) − F (x k ). (11.1)
Then we compute the new iterate as follows
397
398 CHAPTER 11. BROYDEN’S METHOD
In the one-dimensional case bk+1 is uniquely determined from the secant equation. In the
multidimensional case this is not true. For example, if n = 2, x k+1 − x k = (1, 1)T , and
F (x k+1 ) − F (x k ) = (1, 2), then the matrices
! !
1 0 0 1
,
0 2 2 0
satisfy (11.1).
Therefore we chose Bk+1 ∈ Rn×n as the solution of 1
min kB − Bk k.
s.t. B(x k+1 − x k ) = F (x k+1 ) − F (x k )
min kB − Bk k. (11.2)
s.t. Bs k = y k
This can be interpreted as follows: Bk+1 should satisfy the secant equation (11.1) and Bk+1 should
be as close to the old matrix Bk as possible to preserve as much information contained in Bk as
possible.
and
vvT
T
= 1, v ∈ Rn, v , 0.
v v
If s k , 0 then a solution to (11.2) is given by
(y k − Bk s k )sTk
Bk+1 = Bk + . (11.3)
sTk s k
The matrix Bk+1 is the unique solution to (11.2) if k · k is the Frobenius norm.
1Recall that we always assume that the matrix norm is submultiplicative and that it is compatible with the vector
norm.
The uniqueness for the case that k · k is the Frobenius norm follows from the strict convexity of the
Frobenius norm and from the convexity of the set {B ∈ Rn×n | Bs k = y k }.
The pairs (k · k, k| · k|) = (k · k2, k · k2 ) and (k · k, k| · k|) = (k · k2, k · kF ) satisfy the properties
assumed in Lemma 11.1.1.
Lemma 11.1.2 Let Bk be invertible. The matrix Bk+1 defined in (11.3) is invertible if and only if
sTk Bk−1 y k , 0.
Proof: The result follows immediately from the Sherman–Morrison–Woodbury Lemma 9.2.4.
For k = 0, . . .
Solve Bk s k = −F (x k ) for s k .
Set x k+1 = x k + s k .
Evaluate F (x k+1 ) .
Check truncation criteria.
([F (x k+1 ) − F (x k )] − Bk s k )sTk
Set Bk+1 = Bk + .
sTk s k
End
To start Broyden’s method we need an initial guess x 0 for the root x ∗ and an initial matrix
B0 ∈ Rn×n . In practice one often chooses
B0 = F 0 (x 0 ), or B0 = γI,
where γ is a suitable scalar. Other choices, for example finite difference approximations to F 0 (x 0 )
or choices based on the specific structure of F 0 are also used.
k k x k k2 kF (x k )k2 ks k k2
0 0.250000E + 01 0.875017E + 01
1 0.166594E + 01 0.207320E + 01 0.880545E + 00
2 0.147651E + 01 0.873418E + 00 0.192204E + 00
3 0.141033E + 01 0.381251E + 00 0.132189E + 00
4 0.141763E + 01 0.158635E + 00 0.155521E + 00
5 0.142386E + 01 0.429850E − 01 0.962019E − 01
6 0.141585E + 01 0.468140E − 02 0.104304E − 01
7 0.141438E + 01 0.607409E − 03 0.258315E − 02
8 0.141421E + 01 0.405145E − 05 0.628819E − 03
9 0.141421E + 01 0.272411E − 07 0.180577E − 05
10 0.141421E + 01 0.118219E − 10 0.115425E − 07
The convergence of Broyden’s method is characterized in the following theorem.
Theorem 11.1.5 Let the assumptions of Theorem 10.2.3 hold. There exists an > 0 such that if
k x∗ − x0 k ≤
and
kF 0 (x ∗ ) − B0 k ≤
then Broyden’s method generates iterates x k which converge to x ∗ and the convergence rate is
q–superlinear, i.e. there exists a sequence ck with lim k→∞ ck = 0 such that
k x ∗ − x k+1 k ≤ ck k x ∗ − x k k
for all k.
A detailed convergence analysis of Broyden’s method can be found e.g. in [DS83] or in [Kel95].
Remark 11.1.6 Under suitable assumptions, the iterates x k in Broyden’s method converge towards
a zero x ∗ of F. The Broyden matrices Bk , however, generally do not converge to the Jacobian
F 0 (x ∗ ). See e.g. [DS83, Lemma 8.2.7].
Scaling of Variables
Suppose we transform the variables x and F (x) by
x = D x x,
D F x ) = DF F (D−1
D(D x Dx ),
Now, we perform the Broyden update for the scaled problem and then transform it back to the
original formulation. This process is equivalent to using the update
(y k − Bk s k )(DTx D x s k )T
Bk+1 = Bk +
sTk DTx D x s k
D0 = DF B0 D−1
in the original formulation, provided the starting matrix is B x . See Problem 11.1.
This leads to the following lemma.
Lemma 11.2.1 Suppose that k iterations of Broyden’s method are applied to the problem F (x) = 0
with starting vector x 0 and initial matrix B0 and that these iterations generate the Broyden matrices
B0, . . . , Bk . Suppose further that k iterations of Broyden’s method are applied to the problem
D(x) = A−1 F (x) = 0 with starting vector x 0 and initial matrix B
F D0 = A−1 B0 and that these
iterations generate the Broyden matrices B D0, . . . , B
Dk . Then
Dk = A−1 Bi,
B i = 0, . . . , k.
This lemma is important, because it says that we can choose the initial Broyden matrix to be
the identity matrix if we scale the function F by A−1 = B0−1 .
F (x k+1 )sTk
Bk+1 = Bk + = Bk + u k vTk ,
sTk s k
where
u k = F (x k+1 )/ks k k2, vk = s k /ks k k2 .
If we set
w k = Bk−1u k /(1 + vTk Bk−1u k ),
then
−1
Bk+1 = (I − w k vTk )Bk−1
= (I − w k vTk )(I − w k−1 vTk−1 )Bk−1
−1
..
.
k
Y
= (I − wi viT ) B0−1 .
i=0
s k+1 = −Bk+1
−1
F (x k+1 )
k−1
s k+1 sTk Y si+1 sTi
= − I+
* + * I+ + F (x k+1 ).
, ks k 2
k 2 - i=0 , ks k 2
i 2 -
| {z }
= Bk−1
Solving the previous equation for s k+1 yields
Bk−1 F (x k+1 )
s k+1 = − . (11.8)
1 + sTk Bk−1 F (x k+1 )/ks k k22
Note that by the Sherman–Morrison–Woodbury Lemma 9.2.4,
1 + sTk Bk−1 F (x k+1 )/ksi k22 = 1 + vTk Bk−1u k , 0
if and only if Bk+1 is singular.
To implement Broyden’s method, we need to store the vectors s0, . . . , s k to compute Bk−1 F (x k ).
If storage is limited and we can only store L such vectors, then we have two choices. Either we can
restart the Broyden algorithm after iteration L, or we can replace the oldest s k−L by s k . Thus, in
the second case we use the approximation
k−1
Y si+1 sTi
Bk−1 ≈ * I+ +.
i=k−L+1 ,
ks k 2
i 2 -
For k = 0, . . .
Evaluate F (x k ) .
Check truncation criteria.
Solve Bk s k = −F (x k ):
If k = 0, then s k = −F (x k ).
If k > 0, then compute
si+1 sTi
!
z= = I+
−1 F (x ) Q k−2
Bk−1 k i=k−L F (x k )
ksi k22
and set s k = −z/(1 + sTk−1 z/ks k−1 k22 ).
Set x k+1 = x k + s k .
End
In the derivation of Algorithm 11.3.1, we have used that the Broyden update
(F (x k+1 ) − F (x k )Bk s k )sTk
Bk+1 = Bk +
sTk s k
for s k given by Bk s k = −F (x k ) is equal to
F (x k+1 )sTk
Bk+1 = Bk + .
sTk s k
In some globalizations of Broyden’s method, however, we compute the new iterate as x k+1 =
x k + t k s k with t k ∈ (0, 1] and use the Broyden update
(F (x k+1 ) − F (x k )) − Bk (x k+1 − x k ) (x k+1 − x k )T
Bk+1 = Bk +
k x k+1 − x k k22
F (x k+1 ) − (1 − t k )F (x k ) sTk
= Bk + . (11.9)
t k ks k k22
In this case, (11.7) does not apply. However, we are still able to reproduce Bk+1 −1 from the vectors
s0, . . . , s k+1 .
We apply the Sherman–Morrison–Woodbury formula, Lemma 9.2.4, to the Broyden update
(11.9)
F (x k+1 ) − (1 − t k )F (x k ) sTk
Bk+1 = Bk + = Bk−1 + u k vTk ,
t k ks k k2
2
where
u k = (F (x k+1 ) − (1 − t k )F (x k )) / (t k ks k k2 ) , vk = s k /ks k k2 .
If we set
w k = Bk−1u k /(1 + vTk Bk−1u k ),
and assume that B0 = I, then
k
Y
−1
Bk+1 = (I − wi viT ).
i=0
Similar to the previous calculations we find that
k−1
Y
Bk−1u k = (I − wi viT ) (F (x k+1 ) − (1 − t k )F (x k )) / (t k ks k k2 )
i=0
Yk−1 k−1
Y
= (I − wi viT )F (x k+1 ) −(1 − tk ) (I − wi viT )F (x k ) / (t k ks k k2 )
|i=0 |i=0
=z
{z } {z }
= Bk F (x k ) = −s k
−1
= (z + (1 − t k )s k )/(t k ks k k2 ).
and
1 1
wk = (z + (1 − t k )s k ) = (z + (1 − t k )s k ).
t k ks k k2 + vTk z + (1 − t k )vk s k
T ks k k2 + vTk z
As before, we set
1
β= .
ks k k2 + vTk z
The next step in Broyden’s method is given by
s k+1 = −Bk+1
−1
F (x k+1 )
k−1
Y
= −(I − w k vk )
T
(I − wi viT )F (x k+1 )
i=0
= −(I − w k vk )z =
T
−(z − w k vTk z)
= −z + β(z + (1 − t k )s k )( β −1 − ks k k2 )
= (1 − t k )s k − ks k k2 w k .
Hence, for k ≥ 0,
1
wk = − (s k+1 − (1 − t k )s k ) .
ks k k2
and
k
Y (si+1 − (1 − t i )si )sTi
−1
Bk+1 = *I + +. (11.10)
i=0 , ksi k22 -
As in the case t k = 1, (11.10) depends on s k+1 , but Bk+1
−1 also defines s
k+1 . Thus we have to solve
s k+1 = −Bk+1
−1
F (x k+1 )
(s k+1 − (1 − t k )s k )sTk
= − *I + + B −1 F (x k+1 )
k
, ks k k22 -
sTk Bk−1 F (x k+1 ) sTk Bk−1 F (x k+1 )
= −Bk F (x k+1 ) −
−1
s k+1 + (1 − t k ) sk .
ks k k22 ks k k22
For k = 0, . . .
Evaluate F (x k ) .
Check truncation criteria.
Solve Bk s k = −F (x k ):
If k = 0, then s k = −F (x k ).
If k > 0, then compute
(si+1 − (1 − t i )si )sTi
!
z= = i=k−L I +
−1 F (x ) Q k−2
Bk−1 k F (x k )
ksi k22
(1 − t k−1 )sTk−1 z
!
ks k−1 k22
and set s k = − z− s k−1 .
ks k−1 k22 + sTk−1 z ks k−1 k22
Compute t k ∈ (0, 1].
Set x k+1 = x k + t k s k .
End
11.4. Problems
Problem 11.1 ([DS83, Problem 8.5.12]) Suppose we transform the variables x and F (x) by
x = D x x,
D D(x) = DF F (x),
F
where DF, D x be nonsingular matrices, perform Broyden’s method in the new variable and function
space, and then transform back to the original variables. Show that this process is equivalent to
using the update
(y k − Bk s k )(DTx D x s k )T
Bk+1 = Bk +
sTk DTx D x s k
D0 = DF B0 D−1
provided the starting matrix is B x . Notice, that the update is independent of the scaling
0
D (D
DF . Notice also that the new Jacobian F x ) is related to the old Jacobian by F x ) = DF F 0 (x)D −1
D0 (D x .
See also Problem 10.3.
A1 x + b1
!
F (x) = ,
F2 (x)
where A1 ∈ Rm×n , m < n. Suppose that Broyden’s method is used to solve F (x) = 0, generating a
sequence of Broyden matrices B0, B1, . . .. Let Bk be partitioned into
!
Bk1
Bk = , Bk1 ∈ Rm×n, Bk2 ∈ R(n−m)×n .
Bk2
[DS96] J. E. Dennis, Jr. and R. B. Schnabel. Numerical Methods for Nonlinear Equations and
Unconstrained Optimization. SIAM, Philadelphia, 1996. URL: https://doi.org/10.
1137/1.9781611971200, doi:10.1137/1.9781611971200.
[Kel95] C. T. Kelley. Iterative methods for linear and nonlinear equations, volume 16 of Fron-
tiers in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM),
Philadelphia, PA, 1995. URL: https://doi.org/10.1137/1.9781611970944, doi:
10.1137/1.9781611970944.
409
410 REFERENCES
[BB88] J. Barzilai and J. M. Borwein. Two-point step size gradient methods. IMA J. Numer.
Anal., 8(1):141–148, 1988. URL: http://dx.doi.org/10.1093/imanum/8.1.
141, doi:10.1093/imanum/8.1.141.
[BC99] F. Bouttier and P. Courtier. Data assimilation concepts and methods. Technical report,
European Centre for Medium-Range Weather Forecasts (ECMWF), 1999. https:
//www.ecmwf.int/en/learning/education-material/lecture-notes
(accessed Nov. 23, 2017).
[Ben02] M. Benzi. Preconditioning techniques for large linear systems: a survey. J. Comput.
Phys., 182(2):418–477, 2002. URL: http://dx.doi.org/10.1006/jcph.2002.
7176, doi:10.1006/jcph.2002.7176.
[BGL05] M. Benzi, G. H. Golub, and J. Liesen. Numerical solution of saddle point problems.
In A. Iserles, editor, Acta Numerica 2005, pages 1–137. Cambridge University Press,
Cambridge, London, New York, 2005.
411
412 REFERENCES
[Bjö96] Å. Björck. Numerical Methods for Least Squares Problems. SIAM, Philadelphia, 1996.
[Bur40] J. M. Burgers. Application of a model system to illustrate some points of the statistical
theory of free turbulence. Nederl. Akad. Wetensch., Proc., 43:2–12, 1940.
[BW88] D. M. Bates and D. G. Watts. Nonlinear Regression Analysis and its Applications.
John Wiley and Sons, Inc., Somerset, New Jersey, 1988.
[Cra55] E. J. Craig. The n-step iteration procedures. J. of Mathematics and Physics, 34:64–73,
1955.
[Den71] J. E. Dennis, Jr. . Toward a unified convergence theory for Newton–like methods. In
L. B. Rall, editor, Nonlinear Functional and Applications, pages 425–472. Academic
Press, New-York, 1971.
[DES82] R. S. Dembo, S. C. Eisenstat, and T. Steihaug. Inexact Newton methods. SIAM J.
Numer. Anal., 19:400–408, 1982.
[Deu04] P. Deuflhard. Newton Methods for Nonlinear Problems. Affine Invariance and Adaptive
Algorithms, volume 35 of Springer Series in Computational Mathematics. Springer-
Verlag, Berlin, 2004.
[DGE81] J. E. Dennis, D. M. Gay, and R. E.Welsch. Algorithm 573 nl2sol – an adaptive
nonlinear least-squares algorithm. TOMS, 7:369–383, 1981. Fortran code available
from http://www.netlib.org/toms/573.
[DGW81] J. E. Dennis, D. M. Gay, and R. E. Welsch. An adaptive nonlinear least-squares
algorithm. TOMS, 7:348–368, 1981.
[DH79] P. Deuflhard and G. Heindl. Affine invariant convergence theorems for Newton’s
method and extensions to related methods. SIAM J. Numer. Anal., 16:1–10, 1979.
[DH95] P. Deuflhard and A. Hohmann. Numerical Analysis. A First Course in Scientific
Computation. Walter De Gruyter, Berlin, New York, 1995.
[DL02] Y.-H. Dai and L.-Z. Liao. R-linear convergence of the Barzilai and Borwein gradient
method. IMA J. Numer. Anal., 22(1):1–10, 2002. URL: http://dx.doi.org/10.
1093/imanum/22.1.1, doi:10.1093/imanum/22.1.1.
[DM77] J. E. Dennis, Jr. and J. J. Moré. Quasi–Newton methods, motivation and theory. SIAM
Review, 19:46–89, 1977.
[Dre90] S. E. Dreyfus. Artificial neural networks, back propagation, and the Kelley-Bryson
gradient procedure. J. Guidance Control Dynam., 13(5):926–928, 1990. URL: http:
//dx.doi.org/10.2514/3.25422, doi:10.2514/3.25422.
[DS83] J. E. Dennis, Jr. and R. B. Schnabel. Numerical Methods for Nonlinear Equations and
Unconstrained Optimization. Prentice-Hall, Englewood Cliffs, N. J, 1983. Republished
as [DS96].
[DS96] J. E. Dennis, Jr. and R. B. Schnabel. Numerical Methods for Nonlinear Equations and
Unconstrained Optimization. SIAM, Philadelphia, 1996. URL: https://doi.org/
10.1137/1.9781611971200, doi:10.1137/1.9781611971200.
[Emb03] M. Embree. The tortoise and the hare restart GMRES. SIAM Rev., 45(2):259–266
(electronic), 2003.
[Esp81] J. H. Espenson. Chemical Kinetics and Reaction Mechanisms. Mc Graw Hill, New
York, 1981.
[ESW05] H. C. Elman, D. J. Silvester, and A. J. Wathen. Finite Elements and Fast Iterative
Solvers with Applications in Incompressible Fluid Dynamics. Oxford University Press,
Oxford, 2005.
[FA11] J. A. Fike and J. J. Alonso. The development of hyper-dual numbers for exact second-
derivative calculations. Proceedings, 49th AIAA Aerospace Sciences Meeting including
the New Horizons Forum and Aerospace Exposition. Orlando, Florida, 2011. URL:
https://doi.org/10.2514/6.2011-886, doi:10.2514/6.2011-886.
[FA12] J. A. Fike and J. J. Alonso. Automatic differentiation through the use of hyper-dual num-
bers for second derivatives. In S. Forth, P. Hovland, E. Phipps, J. Utke, and A. Walther,
editors, Recent advances in algorithmic differentiation, volume 87 of Lect. Notes Com-
put. Sci. Eng., pages 163–173. Springer, Heidelberg, 2012. URL: https://doi.org/
10.1007/978-3-642-30023-3_15, doi:10.1007/978-3-642-30023-3_15.
[Fle05] R. Fletcher. On the Barzilai-Borwein method. In L. Qi, K. Teo, and X. Yang, editors,
Optimization and control with applications, volume 96 of Appl. Optim., pages 235–256.
Springer, New York, 2005. URL: http://dx.doi.org/10.1007/0-387-24255-4_
10, doi:10.1007/0-387-24255-4_10.
[GBC16] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
http://www.deeplearningbook.org (Accessed April 10, 2017).
[GL83] G. H. Golub and C. F. Van Loan. Matrix Computations, volume 3 of Johns Hopkins
Series in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD,
1983.
[GL89] G. H. Golub and C. F. Van Loan. Matrix Computations, volume 3 of Johns Hopkins
Series in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD,
second edition, 1989.
[GL96] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins Studies in the
Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD, third edition,
1996.
[GO89] G. H. Golub and D. P. O’Leary. Some history of the conjugate gradient and Lanczos
algorithms: 1948–1976. SIAM Rev., 31(1):50–102, 1989.
[Gre97] A. Greenbaum. Iterative Methods for Solving of Linear Systems. SIAM, Philadelphia,
1997.
[Gro77] C. W. Groetsch. Generalized Inverses of Linear Operators. Marcel Dekker, Inc., New
York, Basel, 1977.
[Heb73] M. D. Hebden. An algorithm for minimization using exact second order derivatives.
Technical Report T.P. 515, Atomic Energy Research Establishment, Harwell, England,
1973.
[Hei93] M. Heinkenschloss. Mesh independence for nonlinear least squares problems with
norm constraints. SIAM J. Optimization, 3:81–117, 1993. URL: http://dx.doi.
org/10.1137/0803005, doi:10.1137/0803005.
[HJ85] R. A. Horn and C. A. Johnson. Matrix Analysis. Cambridge University Press, Cam-
bridge, London, New York, 1985.
[HPUU09] M. Hinze, R. Pinnau, M. Ulbrich, and S. Ulbrich. Optimization with PDE Con-
straints, volume 23 of Mathematical Modelling, Theory and Applications. Springer
Verlag, Heidelberg, New York, Berlin, 2009. URL: http://dx.doi.org/10.1007/
978-1-4020-8839-1, doi:10.1007/978-1-4020-8839-1.
[HS52] M.R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems.
J. of Research National Bureau of Standards, 49:409–436, 1952.
[Ise96] A Iserles. A First Course in the Numerical Analysis of Differential Equations. Cam-
bridge University Press, Cambridge, London, New York, 1996.
[JS04] F. Jarre and J. Stoer. Optimierung. Springer Verlag, Berlin, Heidelberg, New-York,
2004.
[KA64] L.V. Kantorovich and G.P. Akilov. Functional Analysis in Normed Spaces. Pergamon
Press, New York, 1964.
[Kel95] C. T. Kelley. Iterative methods for linear and nonlinear equations, volume 16 of Fron-
tiers in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM),
Philadelphia, PA, 1995. URL: https://doi.org/10.1137/1.9781611970944,
doi:10.1137/1.9781611970944.
[MP96] T. Maly and L. R. Petzold. Numerical methods and software for sensitivity analysis of
differential-algebraic systems. Applied Numerical Mathematics, 20:57–79, 1996.
[MS79] H. Matthies and G. Strang. The solution of nonlinear finite element equations. Internat.
J. Numer. Methods Engrg., 14:1613–1626, 1979.
[MS83] J. J. Moré and D. C. Sorensen. Computing a trust region step. SIAM J. Sci. Statist.
Comput., 4(3):553–572, 1983.
[MT94] J. J. Moré and D. J. Thuente. Line search algorithms with guaranteed sufficient decrease.
ACM Transactions on Mathematical Software, 20(3):286–307, 1994.
[Nas76] M. Z. Nashed. Generalized Inverses and Applications. Academic Press, Boston, San
Diego, New York, London„ 1976.
[Noc80] J. Nocedal. Updating quasi-Newton matrices with limited storage. Math. Comp.,
35(151):773–782, 1980.
[OT14] M. A. Olshanskii and E. E. Tyrtyshnikov. Iterative Methods for Linear Systems: Theory
and Applications. SIAM, Philadelphia, 2014.
[Pot89] F. A. Potra. On Q-order and R-order of convergence. J. Optim. Theory Appl., 63(3):415–
431, 1989.
[PR55] D. W. Peaceman and H. H. Rachford, Jr. The numerical solution of parabolic and
elliptic differential equations. J. Soc. Indust. Appl. Math., 3:28–41, 1955.
[PSS09] A. D. Padula, S. D. Scott, and W. W. Symes. A software framework for abstract expres-
sion of coordinate-free linear algebra and optimization algorithms. ACM Trans. Math.
Software, 36(2):Art. 8, 36, 2009. URL: https://doi.org/10.1145/1499096.
1499097, doi:10.1145/1499096.1499097.
[Ray93] M. Raydan. On the Barzilai and Borwein choice of steplength for the gradient method.
IMA J. Numer. Anal., 13(3):321–326, 1993. URL: http://dx.doi.org/10.1093/
imanum/13.3.321, doi:10.1093/imanum/13.3.321.
[RSS01] M. Rojas, S. A. Santos, and D. C. Sorensen. A new matrix-free algorithm for the large-
scale trust-region subproblem. SIAM J. Optim., 11(3):611–646 (electronic), 2000/01.
[Saa03] Y. Saad. Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia, second
edition, 2003.
[Sal86] D. E. Salane. Adaptive routines for forming jacobians numerically. Technical Report
SAND86–1319, Sandia National Laboratories, 1986.
[SB93] J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Springer Verlag, New
York, Berlin, Heidelberg, London, Paris, second edition, 1993.
[SS86] Y. Saad and M. H. Schultz. GMRES a generalized minimal residual algorithm for
solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comp., 7:856–869, 1986.
[ST98] W. Squire and G. Trapp. Using complex variables to estimate derivatives of real
functions. SIAM Rev., 40(1):110–112, 1998. URL: https://doi.org/10.1137/
S003614459631241X, doi:10.1137/S003614459631241X.
[Ste83] T. Steihaug. The conjugate gradient method and trust regions in large scale optimiza-
tion. SIAM J. Numer. Anal., 20:626–637, 1983.
[Sto83] J. Stoer. Solution of large linear systems of equations by conjugate gradient type meth-
ods. In A. Bachem, M. Grötschel, and B. Korte, editors, Mathematical Programming,
The State of The Art, pages 540–565. Springer Verlag, Berlin, Heidelberg, New-York,
1983.
[Tar05] A. Tarantola. Inverse problem theory and methods for model parameter estimation.
Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2005.
[TB97] L. N. Trefethen and D. Bau. Numerical Linear Algebra. SIAM, Philadelphia, 1997.
[TE05] L. N. Trefethen and M. Embree. Spectra and Pseudospectra. The Behavior of Nonnor-
mal Matrices and Operators. Princeton University Press, Princeton, NJ, 2005.
[Toi81] Ph. L. Toint. Towards an efficient sparsity exploiting Newton method for minimization.
In I. S. Duff, editor, Sparse Matrices and Their Uses, pages 57–87. Academic Press,
New York, 1981.
[Trö10] F. Tröltzsch. Optimal Control of Partial Differential Equations: Theory, Methods and
Applications, volume 112 of Graduate Studies in Mathematics. American Mathemat-
ical Society, Providence, RI, 2010. URL: http://dx.doi.org/10.1090/gsm/112.
[Var62] R. S. Varga. Matrix Iterative Analysis. Prentice Hall, Englewood Cliffs, NJ, 1962.
[Vol01] S. Volkwein. Distributed control problems for the Burgers equation. Comput. Optim.
Appl., 18(2):115–140, 2001.
[Vor03] H. A. van der Vorst. Iterative Krylov Methods for Large Linear Systems, volume 13
of Cambridge Monographs on Applied and Computational Mathematics. Cambridge
University Press, Cambridge, 2003.
[Win80] R. Winther. Some superliner convergence results for the conjugate gradient methods.
SIAM J. Numer. Anal., 17:14–17, 1980.
[Wri15] S. J. Wright. Coordinate descent algorithms. Math. Program., 151(1, Ser. B):3–
34, 2015. URL: http://dx.doi.org/10.1007/s10107-015-0892-3, doi:10.
1007/s10107-015-0892-3.
[Xu92] J. Xu. Iterative methods by space decomposition and subspace correction. SIAM
Review, 34:581–613, 1992.
[You71] D. M. Young. Iterative Solution of Large Linear Systems. Academic Press, New York,
1971. Republished by Dover [You03].
[You03] D. M. Young. Iterative Solution of Large Linear Systems. Dover Publications Inc.,
Mineola, NY, 2003. Unabridged republication of the 1971 edition [You71].