0% found this document useful (0 votes)

135 views422 pages

CAAM 454 554 1lvazxx

Uploaded by

Jihan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

135 views422 pages

CAAM 454 554 1lvazxx

Uploaded by

Jihan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 422

Lecture Notes

CAAM 454 / 554 – Numerical Analysis II

Matthias Heinkenschloss

Department of Computational and Applied Mathematics

Rice University
Houston, TX 77005–1892

Spring 2018
(Generated November 16, 2018)
ii

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

Contents
I Iterative Methods for Linear Systems 9
1 Basic Properties and Examples 11
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Quadratic Optimization Problems and Linear Systems . . . . . . . . . . . . . . . . 11
1.3 Linear Elliptic Partial Differential Equations . . . . . . . . . . . . . . . . . . . . . 16
1.3.1 Elliptic Partial Differential Equations in 1D . . . . . . . . . . . . . . . . . 16
1.3.2 Elliptic Partial Differential Equations in 2D . . . . . . . . . . . . . . . . . 22
1.4 An Optimal Control Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.5 A Data Assimilation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2 Stationary Iterative Methods 47

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.2 Jacobi, Gauss–Seidel, and SOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.3 Initial Convergence Analysis of Linear Fixed Point Iterations . . . . . . . . . . . . 52
2.4 The Matrix Eigenvalue Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.5 Convergence of Linear Fixed Point Iterations . . . . . . . . . . . . . . . . . . . . 56
2.5.1 Diagonalizable Iteration Matrix . . . . . . . . . . . . . . . . . . . . . . . 56
2.5.2 Non-Diagonalizable Iteration Matrix . . . . . . . . . . . . . . . . . . . . . 59
2.5.3 Approximation of the Spectral Radius by a Matrix Norm. . . . . . . . . . . 62
2.6 Convergence of the Jacobi, Gauss Seidel, and SOR Methods . . . . . . . . . . . . 63
2.6.1 Diagonally Dominant Matrices . . . . . . . . . . . . . . . . . . . . . . . . 64
2.6.2 Necessary Condition on the Relaxation Parameter for Convergence of the
SOR Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.6.3 Symmetric Positive Definite Matrices . . . . . . . . . . . . . . . . . . . . 71
2.6.4 Consistently Ordered Matrices . . . . . . . . . . . . . . . . . . . . . . . . 73
2.6.5 M-Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.7 Application to Finite Difference Discretizations of PDEs . . . . . . . . . . . . . . 78
2.7.1 Elliptic Partial Differential Equations in 1D . . . . . . . . . . . . . . . . . 78

iii
iv CONTENTS

2.7.2 Elliptic Partial Differential Equations in 2D . . . . . . . . . . . . . . . . . 85

2.8 An Optimization Viewpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

3 Krylov Subspace Methods 113

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3.2 Subspace Approximations and the Krylov Subspace . . . . . . . . . . . . . . . . . 115
3.2.1 Symmetric Positive Definite Matrices . . . . . . . . . . . . . . . . . . . . 115
3.2.2 General Square Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.2.3 Error Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.3 Computing Orthonormal Bases of Krylov Subspaces . . . . . . . . . . . . . . . . 122
3.3.1 The Arnoldi Iteration for Nonsymmetric Matrices . . . . . . . . . . . . . . 123
3.3.2 The Lanczos Iteration for Symmetric Matrices . . . . . . . . . . . . . . . 124
3.4 GMRES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.4.1 The Basic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.4.2 More Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . 128
3.5 Solution of Symmetric Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . 132
3.5.1 The Positive Definite Case . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.5.2 SYMMLQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
3.5.3 MINRES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.6 The Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
3.6.1 Steepest Descend Method . . . . . . . . . . . . . . . . . . . . . . . . . . 141
3.6.2 Barzilai-Borwein (BB) Method . . . . . . . . . . . . . . . . . . . . . . . 143
3.7 The Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
3.7.1 Derivation of the Conjugate Gradient Method . . . . . . . . . . . . . . . . 146
3.7.2 Stopping the Conjugate Gradient Method . . . . . . . . . . . . . . . . . . 156
3.7.3 The Conjugate Gradient Method for the Normal Equation . . . . . . . . . 157
3.8 Convergence of Krylov Subspace Method . . . . . . . . . . . . . . . . . . . . . . 161
3.8.1 Representation of Errors and Residuals . . . . . . . . . . . . . . . . . . . 161
3.8.2 Convergence of Galerkin Approximations . . . . . . . . . . . . . . . . . . 161
3.8.3 Convergence of Minimal Residual Approximations . . . . . . . . . . . . . 164
3.8.4 Chebyshev Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
3.8.5 Convergence of the Conjugate Gradient Method . . . . . . . . . . . . . . . 168
3.8.6 Convergence of Minimal Residual Approximations . . . . . . . . . . . . . 171
3.9 Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
3.9.1 Preconditioned Conjugate Gradient Method . . . . . . . . . . . . . . . . . 177
3.9.2 Preconditioned GMRES . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
3.9.3 Preconditioned SYMMLQ and MINRES . . . . . . . . . . . . . . . . . . 182
3.9.4 Basic Iterative Methods Based Preconditioners . . . . . . . . . . . . . . . 187
3.10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

CONTENTS v

II Iterative Methods for Unconstrained Optimization 197

4 Introduction to Unconstrained Optimization 199
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
4.2 Existence of Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
4.3 Unconstrained Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . 201
4.4 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
4.5 Convergence of Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
4.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

5 Newton’s Method 219

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
5.2 Local Convergence of Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . 220
5.2.1 Basic Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
5.2.2 Local Q–quadratic Convergence of Newton’s Method . . . . . . . . . . . . 223
5.2.3 Newton’s Method with Inexact Derivative Information . . . . . . . . . . . 224
5.3 Inexact Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
5.4 Derivative Approximations using Finite Differences . . . . . . . . . . . . . . . . . 226
5.5 Termination of the Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
5.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

6 Globalization of the Iteration 239

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
6.2 Line Search Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.2.1 Descent Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.2.2 Step–Length Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
6.2.3 Global Convergence Results . . . . . . . . . . . . . . . . . . . . . . . . . 247
6.2.4 Backtracking Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . 251
6.2.5 Finding Step-Sizes that satisfy the Wolfe Conditions . . . . . . . . . . . . 253
6.3 Trust–Region Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
6.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
6.3.2 Convergence Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
6.3.3 Computation of Trust-Region Steps . . . . . . . . . . . . . . . . . . . . . 258
6.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

7 Nonlinear Least Squares Problems 275

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
7.2 Least Squares Curve Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
7.3 Linear Least Squares Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
7.3.1 Solution of Linear Least Squares Problems Using the SVD . . . . . . . . . 281
7.3.2 Solution of Linear Least Squares Problems Using the QR–Decomposition . 284

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

vi CONTENTS

7.4 The Gauss–Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286

7.4.1 Derivation of the Gauss–Newton Method . . . . . . . . . . . . . . . . . . 286
7.4.2 Full Rank Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
7.4.3 Line Search Globalization . . . . . . . . . . . . . . . . . . . . . . . . . . 292
7.4.4 Rank Deficient Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
7.4.5 Nearly Rank Deficient Problems . . . . . . . . . . . . . . . . . . . . . . . 295
7.5 Parameter Identification in Ordinary Differential Equations . . . . . . . . . . . . . 297
7.5.1 Least Squares Formulation of the Parameter Identification Problem . . . . 297
7.5.2 Derivative Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
7.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312

8 Implicit Constraints 323

8.1 Introducton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
8.2 Derivative Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
8.2.1 Gradient Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
8.2.2 Hessian Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
8.2.3 Techniques to Approximation/Compute First and Second Order Derivatives 330
8.3 Neural Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
8.3.1 Training Neural Nets is an Optimization Problem . . . . . . . . . . . . . . 330
8.3.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
8.3.3 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
8.4 Optimal Control of Burgers’ Equation . . . . . . . . . . . . . . . . . . . . . . . . 339
8.4.1 The Infinite Dimensional Problem . . . . . . . . . . . . . . . . . . . . . . 339
8.4.2 Problem Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
8.4.3 Gradient and Hessian Computation . . . . . . . . . . . . . . . . . . . . . 344
8.4.4 A Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
8.4.5 Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
8.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
8.5.1 Implicit Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
8.5.2 Constrainted Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 354
8.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356

9 Quasi–Newton Methods 361

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
9.2 The BFGS Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
9.3 Implementation of the BFGS Method . . . . . . . . . . . . . . . . . . . . . . . . 365
9.3.1 Initial Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
9.3.2 Line-Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
9.3.3 Matrix-Free Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 367
9.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

CONTENTS 7

III Iterative Methods for Nonlinear Systems 371

10 Newton’s Method 373
10.1 Derivation of Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
10.2 Local Q-Quadratic Convergence of Newton’s Method . . . . . . . . . . . . . . . . 375
10.3 Modifications of Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . 376
10.3.1 Divided Difference Newton Methods . . . . . . . . . . . . . . . . . . . . 376
10.3.2 The Chord Method and the Shamanskii Method . . . . . . . . . . . . . . . 377
10.3.3 Inexact Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
10.4 Truncation of the Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
10.5 Newton’s Method and Fixed Point Iterations . . . . . . . . . . . . . . . . . . . . . 380
10.6 Kantorovich and Mysovskii Convergence Theorems . . . . . . . . . . . . . . . . . 381
10.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386

11 Broyden’s Method 397

11.1 Broyden’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
11.2 Implementation of Broyden’s Method . . . . . . . . . . . . . . . . . . . . . . . . 401
11.3 Limited Memory Broyden . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
11.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408

References 411

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

8 CONTENTS

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

Part I

Iterative Methods for Linear Systems

9
Chapter
1
Basic Properties and Examples
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Quadratic Optimization Problems and Linear Systems . . . . . . . . . . . . . . . . 11
1.3 Linear Elliptic Partial Differential Equations . . . . . . . . . . . . . . . . . . . . . 16
1.3.1 Elliptic Partial Differential Equations in 1D . . . . . . . . . . . . . . . . . 16
1.3.2 Elliptic Partial Differential Equations in 2D . . . . . . . . . . . . . . . . . 22
1.4 An Optimal Control Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.5 A Data Assimilation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

1.1. Introduction
This chapter introduces examples of linear systems and their basic properties. Later these examples
will be used to illustrate the application of iterative methods to be introduced in Chapters 2 and 3.
The linear system properties will be used to guide the selection of iterative solvers and to analyze
their convergence properties.
The next section explores the connection between convex quadratic optimization problems and
linear systems. Specifically, we will show that the solution of a convex quadratic optimization
problem is given as the solution of a linear system given by the necessary and sufficient optimality
conditions. Sections 1.3, 1.4, 1.5 introduce specific examples of linear systems. Section 1.3
introduces linear systems arising from a finite difference discretization of elliptic partial differential
equations. Sections 1.4 and 1.5 introduce two equality constrained convex quadratic optimization
problems governed by partial differential equations.

1.2. Quadratic Optimization Problems and Linear Systems

Solutions of convex quadratic optimization problems can be obtained by solving necessary and
sufficient optimality conditions, which are linear systems. This section derives and discusses basic

11
12 CHAPTER 1. BASIC PROPERTIES AND EXAMPLES

properties of these these systems.

Given a symmetric positive semidefinite matrix A ∈ Rn×n and a vector b ∈ Rn , consider the
quadratic function q : Rn → R,
q(x) = 12 xT Ax − bT x.
def
(1.1)
Let x, y ∈ Rn and t ∈ [0, 1]. Since A is symmetric positive semidefinite,
q t x + (1 − t)y

= 21 t x + (1 − t)y T A t x + (1 − t)y − bT t x + (1 − t)y

= t 21 xT Ax − bT x + (1 − t) 12 yT Ay − bT y − t(1 − t) 21 xT Ax − xT Ay + 12 xT Ax
t(1 − t)
= t q(x) + (1 − t) q(y) − (x − y)T A(x − y)
| 2 {z }
≥0
≤ t q(x) + (1 − t) q(y).
A function q with the property that q t x + (1 − t)y ≤ t q(x) + (1 − t) q(y) for all x, y ∈ Rn and

t ∈ [0, 1] is called convex. We will study convex functions in Section 4.4. Quadratic functions
(1.1) are convex if and only if A ∈ Rn×n is symmetric positive semidefinite (we have only shown
the ‘if’ direction) and this is the property we will use.
The next theorem shows that solving
1 T
minimize 2 x Ax − bT x (1.2)
is equivalent to solving the linear system ∇q(x) = 0,
Ax = b.
Theorem 1.2.1 Let A ∈ Rn×n be symmetric positive semidefinite and let b ∈ Rn . The vector x (∗)
solves Ax = b if and only if x (∗) minimizes q(x) = 12 xT Ax − bT x.
Proof: If Ax (∗) = b, then, since A is symmetric positive semidefinite,
q(x) − q(x (∗) ) = 21 xT Ax − bT x − 12 (x (∗) )T Ax (∗) + bT x (∗)
= 21 xT Ax − xT Ax (∗) + 12 (x (∗) )T Ax (∗)
= 21 (x − x (∗) )T A(x − x (∗) ) ≥ 0 for all x ∈ Rn . (1.3)

Hence x (∗) minimizes q.

If x (∗) minimizes q, then
0 ≤ q(x (∗) + v) − q(x (∗) )
= 21 (x (∗) + v)T A(x (∗) + v) − bT (x (∗) + v) − 21 (x (∗) )T Ax (∗) − bT x (∗)
= 21 vT Av + ( Ax (∗) − b)T v for all v ∈ Rn .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

1.2. QUADRATIC OPTIMIZATION PROBLEMS AND LINEAR SYSTEMS 13

Setting v = −t( Ax (∗) − b), t ∈ R, in the previous inequality gives

t2
0≤ ( Ax (∗) − b)T A( Ax (∗) − b) − t k Ax (∗) − bk22
2
k Ak2 t k Ak2
≤ t2 k Ax (∗) − bk22 − t k Ax (∗) − bk22 = t − 1 k Ax (∗) − bk22 .
2 2
With t = 1/k Ak2 , the previous inequality implies
1
0≤− k Ax (∗) − bk22,
2k Ak2

i.e., Ax (∗) = b.

Note that the previous theorem establishes the connection between solutions of Ax = b and
minimizers of q(x) = 21 xT Ax − bT x, but it does not establish their existence. If A ∈ Rn×n is
symmetric positive semidefinite, a minimizer of q exists if and only if b ∈ R ( A). If A ∈ Rn×n
symmetric positive definite there exists a unique minimizer of q.

Next consider the equality constrained quadratic program

1 T
minimize 2x Hx − cT x, (1.4a)
subject to Ax = b, (1.4b)

where A ∈ Rm×n , m < n, and b ∈ Rm , c ∈ Rn , and H ∈ Rn×n symmetric and satisfies

vT Hv ≥ 0 for all v ∈ N ( A). (1.5)

Theorem 1.2.2 Let H ∈ Rn×n be symmetric and satisfy (1.5), and let A ∈ Rm×n , m < n. The
equality constrained quadratic program (1.4) has a solution x ∈ Rn if and only if there exists
λ ∈ Rn such that
H AT
! ! !
x c
= . (1.6)
A 0 λ b

We leave the proof as an exercise. See Problem 1.1.

The optimality system (1.6) is called a Karush-Kuhn-Tucker (KKT) system and is a particular
type of saddle point system. Theorem 1.2.2 characterizes the solutions of (1.4), but it does not
establish the existence of a solution. If H ∈ Rn×n is symmetric and satisfies

vT Hv > 0 for all v ∈ N ( A) \ {0}, (1.7)

and if A ∈ Rm×n , m < n, has rank m, then the equality constrained quadratic program (1.4) has a
unique solution x ∈ Rn and there exists a unique vector λ ∈ Rm that satisfies (1.6). See Problem 1.1.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

14 CHAPTER 1. BASIC PROPERTIES AND EXAMPLES

Theorem 1.2.3 If H ∈ Rn×n is symmetric and satisfies (1.7), and if A ∈ Rm×n , m < n, has rank m,
the matrix
H AT
!
K= (1.8)
A 0
is symmetric indefinite and it has n positive eigenvalues and m negative eigenvalues.

Proof: Let A = U (Σ, 0)V T be the singular value decomposition of A. Since A has rank m,
Σ ∈ Rm×m is a diagonal matrix with positive diagonal entries. We write V = (V1, V2 ) with
V1 ∈ Rn×m and V2 ∈ Rn×(n−m) and define H D11 = V T HV1 , H
D12 = V T HV2 , H
D22 = V T HV2 . Since
1 1 2
N ( A) = R (V2 ), (1.7) implies that H
D22 is symmetric positive definite.
The matrices K and
H
D11 HD12 Σ
VT 0 H AT
! ! !
V 0
= .. =K
D22 0 // def
* +
DT H
H D
0 UT A 0 0 U 12
, Σ 0 0 -

have the same eigenvalues. The matrices K

D and

I 0 I 0
+/ *. H11 H +/ *. H11 Σ 0
0 D D12 Σ 0 D
/=. Σ 0 0 / =K
*. +/ *. +/ def D
I /. H T I
. 0 0 12 H22 0 / .
D D 0 0 D
T −1 −1
, 0 I − H12 Σ -, Σ 0 0 - , 0 I −Σ H - , 0 0 H22
D D12 D
-
have the same inertia (i.e., the same number of positive and negative eigenvalues) 1 The eigenvalues
of K
D
D are equal to the eigenvalues of
!
H
D11 Σ
D22 ∈ R(n−m)×(n−m) .
∈ R2m×2m and of H
Σ 0
The inertia of !
H
D11 Σ
∈ R2m×2m, (1.9)
Σ 0
is equal to the inertia of

Σ−1 0 Σ−1 0
! ! ! !
H
D11 Σ Σ−1 H
D11 Σ−1 I
= . (1.10)
0 I Σ 0 0 I I 0
If µ is an eigenvalue of (1.10), then
! ! !
Σ−1 H
D11 Σ−1 I x x
=µ . (1.11)
I 0 y y
1Sylvester’s Law of Inertia states that if A is a symmetric n × n matrix and X is a non-signular n × n matrix, then
A and X T AX have the inertia, i.e., the same numbers of positive, zero, and negative eigenvalues. See, e.g., [GL89,
Thm. 8.1.12].

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

1.2. QUADRATIC OPTIMIZATION PROBLEMS AND LINEAR SYSTEMS 15

If µ = 0, then x = 0 and y = 0. Therefore all eigenvalues of (1.10) are non-zero. From (1.11) we
find y = µ−1 x and
D11 Σ−1 + µ−1 I)x = µx.
(Σ−1 H
D11 Σ−1 = W ΛW T is the eigen-decomposition of Σ−1 H
If Σ−1 H D11 Σ−1 , then µ2W T x − µΛW T x −W T x =
0. Hence, the eigenvalues µ of (1.10) are the roots of

µ2 − µλ j − 1 = 0, j = 1, . . . , m,

which are s
λj λ 2j
µ j± = ± + 1, j = 1, . . . , m.
2 4
This shows that (1.10) (and therefore (1.9)) has m positive and m negative eigenvalues.

The optimality system (1.6) is a particular type of saddle point system. See, e.g., the survey
paper [BGL05] by Benzi, Golub, and Liesen.
In Sections 1.4 and 1.5 we discuss applications that lead to optimization problems of the type
(1.2) or (1.4).

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

16 CHAPTER 1. BASIC PROPERTIES AND EXAMPLES

1.3. Linear Elliptic Partial Differential Equations

This section derives linear systems derived from finite difference (FD) approximations of linear
elliptic partial differential equations (PDEs) and studies basic properties of the system matrices.
Analyses of the error between the finite difference approximations and the solution of the differential
equations are given, e.g., in the book [Hac92] by Hackbusch. Section 3.5 in the book by Elman,
Silvester and Wathen [ESW05] shows that many of the properties of matrices arising from FD
discretizations are also true for matrices arising in certain finite element discretizations.

1.3.1. Elliptic Partial Differential Equations in 1D

We want to compute the solution y of the following differential equation

− y00 (x) + c y0 (x) + r y(x) = f (x), x ∈ (0, 1), (1.12a)

y(0) = g0, (1.12b)
y(1) = g1 . (1.12c)

The system (1.12) is called a two–point boundary value problem (BVP). The conditions (1.12b)
and (1.12c) are specify the value of the solution at the boundary and are called Dirchlet boundary
conditions. Other boundary conditions are possible . The function f and the scalars , c, r, g0 and
g1 are given. We assume that > 0, r ≥ 0.
We want to compute an approximate solution of (1.12). We use a so-called finite difference
method to accomplish this. With this approach approximations of the solution y of (1.12) at
specified points 0 = x 0 < x 1 < . . . < x n+1 = 1 are obtained through the solution of a linear system.
We select a grid
0 = x 0 < x 1 < . . . < x n+1 = 1
with mesh size
1
h=
n+1
and with equidistant points
i
xi = = ih.
n+1

Central Differences

At a point x we can approximate the derivative of a function g : R → R by

g(x + h) − g(x − h)
g0 (x) ≈ . (1.13)
2h

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

1.3. LINEAR ELLIPTIC PARTIAL DIFFERENTIAL EQUATIONS 17

If we apply the approximation (1.13) with h = h

2 to g(x) = y0 (x) we obtain
d y0 (x i + 12 h) − y0 (x i − 12 h)
0
.

− y (x i ) ≈ −
dx h
Next, we approximate the derivatives y0 (x i ± 12 h) using (1.13) with h = h
2 and g(x) = y(x). This
gives
y(x i+1 )−y(x i )
− y(x i )−y(x i−1 )
00
−y (x i ) ≈ − hh
,
h
−y(x i−1 ) + 2y(x i ) − y(x i+1 )
= . (1.14)
h2
We approximate the first derivative terms in (1.12) by
y(x i+1 ) − y(x i−1 )
y0 (x i ) ≈ . (1.15)
2h
Now we consider (1.12a) at points x i , i = 0, . . . , n. We replace the derivatives of y on the
left hand side of (1.12a) by (1.14), (1.15) with y(x i−1 ), y(x i ), y(x i+1 ) replaced by yi−1, yi, yi+1 ,
respectively. This gives
−( + h
c)yi−1 + (2 + h2r)yi − ( − h
c)yi+1
2 2
= f (x i ), i = 1, . . . , n. (1.16)
h2
The boundary conditions (1.12b) and (1.12c) yields
y0 = g0, yn+1 = g1 . (1.17)
The equations (1.16) and (1.17) lead to a tridiagonal linear system in the unknowns y1, . . . , yn ,
which is given by
2
2+h r − h c
− h22
*. h2 +/ y1
.. − + h2 c 2+h2r − h c
− h22 y2
// *. +/
.. h2 h2 // .. //
y3
.. .. .. ..
.. // .. //
.. . . . // .. . //
.. // .. //
.. // .. yn−2 //
+ h2 c 2+h2 r − h2 c yn−1
.. − h2 h2
− h2
// . /
. + h2 c 2+h2 r
/, yn -
, − h2 h2 -
+ h2 c
*. f (x 1 ) + h2
g0 +/
.. f (x 2 ) //
= .. .. // .
.
. (1.18)
//
.. f (x n−1 ) //
− h2 c
.
, f (x n ) + h2 g1 -

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

18 CHAPTER 1. BASIC PROPERTIES AND EXAMPLES

The solution y1, . . . , yn of (1.18) is an approximation of the solution y of (1.12) at the points
x 1, . . . , x n .

Example 1.3.1 Consider the differential equation

− y00 (x) + y0 (x) = 1, x ∈ (0, 1), (1.19a)

y(0) = 0, y(1) = 0. (1.19b)

The solution is given by

e (x−1)/ − e−1/
.
y(x) = x −
1 − e−1/
Figure 1.1 shows the computed finite difference approximations for parameter = 10−3 using
central finite differences (1.18) with uniform mesh size h = 0.05, h = 0.01, h = 0.002.
The finite difference scheme (1.16) only gives acceptable approximations of the solution for
mesh sizes
h < 2/|c|
(if uniform meshes are used). If |c| this leads to extremely fine meshes. It can shown that
the central finite differences leads to oscillatory approximations when the mesh size h > 2/ =
0.002. The exact solution of the central finite difference scheme (1.18) with the problem data in
Example 1.3.1 is computed, e.g., in the paper by Stynes [Sty05, p. 459]. See also the book by Saad
[Saa03, Sec. 2.2.4] for the solution of a problem with slightly different data.

1 FD approximation 1 FD approximation 1 FD approximation

exact solution exact solution exact solution
0.8 0.8 0.8

0.6 0.6 0.6

y(x)

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x x

Figure 1.1: Solution of the differential equation (1.19) computed using central finite difference
approximation (1.18) with uniform mesh size h = 0.05 (left plot), h = 0.01 (middle plot), h = 0.002
(right plot)

The finite difference scheme (1.16) can be represented by the 3-point stencil

f + h
c 2 + h2r −h
cg
− 2
− 2
. (1.20)
h2 h2 h2

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

1.3. LINEAR ELLIPTIC PARTIAL DIFFERENTIAL EQUATIONS 19

For many operations, the stencil is all we need if we want to work with the matrix. For example, if
we set y0 = g0 , yn+1 = g1 , then the ith equation in (1.18)

+h
c 2 + h2r − 2h c
− 2
yi−1 + yi − yi+1 = f (x i ), i = 1, . . . , n.
h2 h2 h2

The left hand side is obtained by first multiplying the stencil (1.20) component wise with yi−1 yi yi+1
and then summing up the resulting values.

We summarize a few properties of the matrix in (1.18) that will be important later.

• The matrix in (1.18) is symmetric if and only if c = 0.

• If > 0, c = 0, and r ≥ 0, the matrix is symmetric positive definite (see Problem 1.2).

• Let ai j be the entries of the matrix in (1.18). If h < 2/|c|, then

X
|ai j | ≤ |aii | for i = 1, . . . , n,
j,i

with “<” for i = 1 and i = n. See Problem 1.4. We say the matrix in (1.18) is row-wise
diagonally dominant. Theorem 2.6.5, which will be introduced in the next chapter, will imply
that the matrix in (1.18) nonsingular if h < 2/|c|.

• Another important property of matrices arising from finite element discretizations is that of
an M-matrix. We will introduce M-matrices in Section 2.6.5. However, we note already
that the matrix (1.18) is only an M-matrix when h < 2/|c|. This can be used to explain
the behavior of the finite difference approximations observed in Figure 1.1. Stynes’ paper
[Sty05, Sec. 4] contains a nice discussion of M-matrix properties of the matrix in (1.18) and
we will discuss it in Section 2.7.

Upwind Differences
As we have mentioned before, the finite difference scheme (1.16) with uniform meshes, requires
that
h < 2/|c|
to avoid artificial oscillations in the computed solution. The problem with the finite difference
scheme (1.16) results from the use of central finite differences (1.15) for c y0 (x i ). Instead of the
central finite difference approximation (1.15),

y(x i + h) − y(x i − h)
c y0 (x i ) ≈ c ,
2h

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

20 CHAPTER 1. BASIC PROPERTIES AND EXAMPLES

we use the upwind scheme

y(x i+1 ) − y(x i )
 c if c < 0,
h


cy0 (x i ) ≈ 

(1.21)
 c y(x i ) − y(x i−1 )


if c > 0.
 h
The approximations (1.14) and (1.21) leads to

− yi−1 + (2 − hc + h2r)yi − ( − h c)yi+1

= f (x i ), i = 1, . . . , n, (1.22a)
h2
if c < 0, and

−( + h c)yi−1 + (2 + hc + h2r)yi − yi+1

= f (x i ), i = 1, . . . , n, (1.22b)
h2
if c > 0. The boundary conditions (1.12b) and (1.12c) specify y0 = g0 , yn+1 = g1 . The finite
difference schemes (1.22) can be represented by the 3-point stencils
f 2 − hc + h2r − hc g
− − if c < 0 (1.23a)
h2 h2 h2
and
+ hc 2 + hc + h2r
f g
− − if c > 0. (1.23b)
h2 h2 h2
For c > 0 this lead to the following tridiagonal linear system in the unknowns y1, . . . , yn .
2 +hc+h 2 r
h2
− h2 y1
+h c 2 +hc+h 2 r
− h2
*. +/ *
− h2
+/
.. h2 // .. y2 //
.. // .. y3 //
.. .. .. .. // .. ..
. . . .
//
.. // .. //
.. // .. yn−2 //
− +h 2 +hc+h 2 r
− h2
c // ..
yn−1
.. //
. h2 h2 /.
, − +h
h2
c 2 +hc+h 2 r
h2 -,
yn -

f (x 1 ) + +hh2
c
g0
*. +/
f (x 2 )
..
.. //
= .. .
// . (1.24)
.. //
. f (x n−1 ) /
, f (x n ) + h2 g1 -

If c = 0, the matrices in (1.18) and (1.24) are identical. Let > 0, c > 0, and r ≥ 0. We
have seen that that for c , 0 the matrix in (1.18) is row-wise diagonally dominant only if h is small

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

1.3. LINEAR ELLIPTIC PARTIAL DIFFERENTIAL EQUATIONS 21

relative to /|c|. In contrast, the matrix in (1.24) is row-wise diagonally dominant for any h > 0,
i.e., if ai j are the entries of the matrix, then
X
|ai j | ≤ |aii | for i = 1, . . . , n,
j,i

with “<” for i = 1 and i = n. See Problem 1.4. Theorem 2.6.5, which will be introduced in the
next chapter, will imply that the matrix in (1.24) nonsingular for all h.
As we have mentioned earlier, the M-matrix property is important for matrices arising from
finite element discretizations. We will see in Section 2.7 that the matrix (1.24) resulting from the
upwind discretization of the convection term is an M-matrix for any mesh size h > 0. See also
Stynes’ paper [Sty05, Sec. 4].
Figure 1.2 shows the computed finite difference approximations for the equation (1.19) with
parameter = 10−3 using upwind finite differences (1.24) with uniform mesh size h = 0.05,
h = 0.02, and h = 0.01. The upwind finite difference scheme (1.24) leads to much better results
for smaller bigger mesh size h than the central finite difference scheme (1.18). In particular, the
approximations computed using upwind finite differences (1.24) are nonnegative for nonnegative
right hand side functions f and boundary data g0, g1 .

1 FD approximation 1 FD approximation 1 FD approximation

exact solution exact solution exact solution
0.8 0.8 0.8

0.6 0.6 0.6

y(x)

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x x

Figure 1.2: Solution of the differential equation (1.19) computed using upwind finite difference
approximation (1.24) with uniform mesh size h = 0.05 (left plot), h = 0.02 (middle plot), h = 0.01
(right plot)

Other Boundary Conditions

The finite difference method can be extended to problems with other boundary conditions. Instead of
the Dirichlet conditions (1.12b, 1.12c), we consider the following problem with periodic boundary
conditions
− y00 (x) + c y0 (x) + r y(x) = f (x), x ∈ (0, 1), (1.25a)
y(0) = y(1), (1.25b)
y0 (0) = y0 (1). (1.25c)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

22 CHAPTER 1. BASIC PROPERTIES AND EXAMPLES

We assume that > 0, c, r ≥ 0 are given.

The boundary conditions (1.25b, 1.25c) are approximated by

y0 = yn+1,
y0 − y−1 yn+1 − yn
= ,
h h
which implies
y0 = yn+1, y−1 = yn . (1.26)
Together with the upwind discretization

−( + h c)yi−1 + (2 + hc + h2r)yi − yi+1

= f (x i ), i = 1, . . . , n + 1. (1.27)
h2
this lead to the following tridiagonal linear system in the unknowns y1, . . . , yn, yn+1 .
2 +hc+h 2 r
h2
− h2 − +h
h2
c
y1
− +h 2 +hc+h 2 r
− h2
*. c
+/ * +/
.. h2 h2 // .. y2 //
.. // .. y3 //
.. .. .. .. // .. ..
. . . .
//
.. // .. //
.. // .. yn−1 //
− +h 2 +hc+h 2 r
− h2
c // ..
yn
.. //
. h2 h2 /.
, − h2 − +h
h2
c 2 +hc+h 2 r
h2 -,
yn+1 -

f (x 1 )
*. +/
.. f (x 2 )
..
//
= .. .
// . (1.28)
.. //
. f (x n ) /
, f (x n+1 ) -

1.3.2. Elliptic Partial Differential Equations in 2D

Let Ω = (0, 1) × (0, 1) and ∂Ω = {x = (x 1, x 2 ) ∈ [0, 1] × [0, 1] : x 1 ∈ {0, 1} or x 2 ∈ {0, 1}} its
boundary. Given coefficients , r ∈ R, c ∈ R2 and a function f : Ω → R, we consider the
convection diffusion equation

−∆y(x) + c · ∇y(x) + r y(x) = f (x), x∈Ω (1.29a)

y(x) = g(x), x ∈ ∂Ω. (1.29b)

We assume that > 0, r ≥ 0 and c = (c1, c2 )T with c1, c2 > 0.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

1.3. LINEAR ELLIPTIC PARTIAL DIFFERENTIAL EQUATIONS 23

We select a grid

0 = x 1,0 < x 1,1 < . . . < x 1,n1 +1 = 1, 0 = x 2,0 < x 2,1 < . . . < x 2,n2 +1 = 1,

with equidistant points

i i
x 1,i = = ih1, i = 0, . . . , n1 + 1, x 2,i = = ih2, i = 0, . . . , n2 + 1.
n1 + 1 n2 + 1
and mesh size
1 1
h1 = , h2 = .
n1 + 1 n2 + 1
We consider the following simple finite difference scheme, which is an extension of (1.23b)
to the two-dimensional case. The unknowns yi j are approximations of the solution y of (1.29) at
(x 1,i, x 2, j ). We replace the differential equation (1.29a) at (x 1,i, x 2, j ) by

yi−1, j − 2yi j + yi+1, j yi, j−1 − 2yi j + yi, j+1

− −
h12 h22
yi j − yi−1, j yi j − yi, j−1
+c1 + c2 + r yi j = f (x 1,i, x 2, j ), (1.30)
h1 h2
for i = 1, . . . , n1 and j = 1, . . . , n2 . For i ∈ {0, n1 } or j ∈ {0, n2 } the unknowns are given by the
boundary conditions (1.29b), i.e.,

yi j = g(x 1,i, x 2, j ), if i ∈ {0, n1 } or j ∈ {0, n2 }.

To solve for the unknowns yi j , i = 1, . . . , n1 , j = 1, . . . , n2 , in (1.30) we form a linear system

Ay = b. (1.31)

The precise structure of A and b depends on the ordering of the unknowns and equations.

9 10 11 12

5 6 7 8

1 2 3 4

Figure 1.3: Simple 4 × 3 grid with lexicographic ordering of the grid points.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

24 CHAPTER 1. BASIC PROPERTIES AND EXAMPLES

If we order the grid-points lexicographically2, as indicated for a small example in Figure 1.3,
then the vector of unknowns is
T
y = y11, . . . , yn1 1, y12, . . . , yn1 2, . . . . . . , y1n2, . . . , yn1 n2 (1.32)

and the matrix A can be computed as a Kronecker product involving matrices from the 1D dis-
cretization. For i = 1, 2 define
2+hi ci +hi2 r
*. hi2
− h2 +/
i
2+hi ci +hi2 r
− +h i ci
− h2
.. //
.. hi2 hi2 i
//
.. //
Ai = .. .. .. ..
. . . // .
def . //
(1.33)
.. //
..
2+hi ci +hi2 r
− +h i ci
− h2
.. //
.. h2 i hi2 i
//
2+hi ci +hi2 r
− +h i ci
/
, hi2 hi2 -

Recall that for matrices C ∈ Rm×n and B ∈ R p×q the Kronecker product C ⊗ B is the matrix

c11 B . . . c1n B
*. . .. +/ ∈ Rmp×nq .
C ⊗ B = . .. . / (1.34)
, cm1 B . . . cmn B -
Using lexicographic ordering of the grid points, the matrix A in (1.31) is given by

A = In2 ⊗ A1 + A2 ⊗ In1, (1.35)

where Ini ∈ Rni ×ni is the ni × ni identity matrix. The matrix A in (1.35) is of the form

D1 −F1
.. −E2 D2 −F2
*. +/
//
...
.. //
A = .. // , (1.36a)
.. //
.. //
. −En2 −1 Dn2 −1 −Fn2 /
, −En2 Dn2 -
2 A vector v is lexicographically less then a vector w if there exists an index j such that v1 = w1, . . . , v j−1 = w j−1 and
v j < w j . The grid points shown in Figure 1.3 are in lexicographic order, i.e., the grid point (x 1,i, x 2,i ) is lexicographiclly
less than (x 1,k , x 2,k ) if and only if i < k.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

1.3. LINEAR ELLIPTIC PARTIAL DIFFERENTIAL EQUATIONS 25

where Di ∈ Rn1 ×n1 , i = 1, . . . , n2 are tridiagonal matrices given by

d − h2
1
.. − +h21 c1 d
*. +/
h1
− h2 //
.. 1 //
.. .. ..
.. //
Di = .. . . . // ∈ Rn1 ×n1, i = 1, . . . , n2, (1.36b)
.. //
− +hh21 c1 − h2
.. //
.. d //
1 1
− +hh21 c1
. /
d
, 1 -

are tridiagonal matrices with

2 + h1 c1 2 + h2 c2
d= + + r,
h12 h22

and

+ h2 c2 + h2 c2 +
−Ei+1 = diag *− 2
, . . . , − 2
∈ Rn1 ×n1, (1.36c)
, h2 h2 -

−Fi = diag *− 2 , . . . , − 2 + ∈ Rn1 ×n1, (1.36d)
, h2 h2 -

i = 1, . . . , n2 − 1, are diagonal matrices.

Example 1.3.2 Consider the convection diffusion equation (1.29) with Ω = (0, 1) 2 , = 10−4 ,
θ = 47.3o , c = (cos θ, sin θ), r = 0, f = 0, and Dirichlet conditions


 1 if x 1 = 0 and x 2 ≤ 0.25
g(x 1, x 2 ) =  1 if x 2 = 0



 0 else.

Figure 1.4 shows a sketch of the problem data and Figure 1.5 shows the computed solution.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

26 CHAPTER 1. BASIC PROPERTIES AND EXAMPLES

y=0

y=0
θ

y=1
y=1

Figure 1.4: Sketch of the problem data for the 2D advection diffusion equation in Example 1.3.2.

0.5
1
0
1 0.5
0.5
0 0 x1
x2

Figure 1.5: Finite difference approximation of the solution to the convection diffusion equation
(1.29) with data specified in Example 1.3.2 computed using an n1 = 10 by n2 = 10 grid.

Other orderings of the grid points are possible. In particular, the red-black (checkerboard)
ordering of the grid points, illustrated in Figure 1.3, will be useful for some iterative methods.
Other orderings correspond to a symmetric permutation
P APT Pu = Pb (1.37)
of the system (1.31). Here the permutation matrix P is derived from the ordering of the nodes. For
example, if the red-black (checkerboard) ordering is used, then for the 4 × 3 grid the permutation
matrix is determined from
P (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)T = (1, 7, 2, 8, 9, 3, 10, 4, 5, 11, 6, 12)T .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

1.3. LINEAR ELLIPTIC PARTIAL DIFFERENTIAL EQUATIONS 27

5 11 6 12

9 3 10 4

1 7 2 8

Figure 1.6: Simple 4 × 3 grid with red-black ordering of the grid points.

With a red-black ordering of the equation and unknowns the system matrix is of the form

!
Dr Ar b
A= , (1.38a)
Abr Db

where for even n1 n2 the matrices Dr = Db are

2+h1 c1
h12
+ 2+h2 c2
h22
+r
*. +/
.. //
Dr = Db = ... .. // ∈ R n12n2 × n12n2
. // (1.38b)
.. //
2+h1 c1
+ 2+h2 c2
+r
.
, h12 h22 -

(if n1 n2 is odd, Dr has one more column and row than Db ), and the matrices Ar b , Abr have at most
four nonzero entries per row and per column. For the example grid shown in Figure 1.6, these

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

28 CHAPTER 1. BASIC PROPERTIES AND EXAMPLES

matrices are
− 2 0 − h2 0 0 0
*. +hh1 c 2
− h2 − h2
+
.. − h21 1 0 0 0 //
//
1 1 2
.. +h2 c2
− h22
0 − +hh21 c1 − h2 − h2 0 //
= .. /,
.
Ar b 1 1 2
− +hh22 c2 +h1 c1 (1.38c)
.. 0 0 − h12
0 − h2 //
2 1 /
− +hh22 c2 − h2
.. /
.. 0 0 0 0 //
2 1
− +hh22 c2 − +hh21 c1 − h2
/
0 0 0
, 2 1 1 -
+h1 c1
*. − h12 − h2 − h2 0 0 0 +/
1 2
.. 0 − +hh21 c1 0 − h2 0 0 //
1 2
.. − +hh22 c2 − h2 − h2
.. //
0 0 0 //
Abr = .. 2 1 2 /.
− +hh22 c2 − +hh21 c1
(1.38d)
.. 0 − h2 0 − h2 //
2 1 1 2
− +hh22 c2 − +hh21 c1 − h2 //
.. /
0 0 0
.. 2 1 1 /
0 0 0 − +hh22 c2 0 +h1 c1 /
− h2
, 2 1 -

As we have seen in the 1D case, for many matrix operations, the stencil is all we need. In fact
if we work with the stencil it is often favorable to store the unknowns as a matrix. If we store
the unknowns in an (n1 + 2) × (n2 + 2) array (we use zero based indexing) and for i ∈ {0, n1 } or
j ∈ {0, n2 } set the values to given boundary data (cf., (1.29b)),

yi j = g(x 1,i, x 2, j ), if i ∈ {0, n1 } or j ∈ {0, n2 },

then for i ∈ {1, . . . , n1 }, j ∈ {1, . . . , n2 }, the (i, j)-th equation can be written as

+ h1 c1 * 2 + h1 c1 + 2 + h2 c2 + r + yi, j
− yi, j+1 − yi−1, j +
h22 h12 , h12 h22 -
+ h2 c2
− 2 yi+1, j − yi, j−1 = f (x 1,i, x 2, j ). (1.39)
h1 h22

The finite difference scheme (1.39) can be represented by the 5-point stencil

 0 − h2 0 
2 
 − +h1 c1

2+h1 c1
+ 2+h2 c2
+r − h2  .

h12 h12 h22 (1.40)
 1  
− +hh22 c2

 0 0 
 2 

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

1.3. LINEAR ELLIPTIC PARTIAL DIFFERENTIAL EQUATIONS 29

The left hand side in (1.39) is obtained by first multiplying the stencil (1.40) component wise with
 yi−1, j+1 yi, j+1 yi+1, j+1 
 y yi, j yi+1, j 
 i−1, j 
y y y
 i−1, j−1 i, j−1 i+1, j−1 
and then summing up the resulting values.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

30 CHAPTER 1. BASIC PROPERTIES AND EXAMPLES

1.4. An Optimal Control Problem

In this example, we want to select data in the PDE, such as boundary data and or right hand side
data, to influence the PDE solution in a desired way. Such problems are known as optimal control
problems or, more generally, as optimization problems governed by PDEs and are discussed in
more detail, e.g., in the books by Hinze et al. [HPUU09] and Töltzsch [Trö10]. Here we consider
a specific case.
Given a domain Ω ⊂ R2 and Ωc ⊂ Ω, consider the PDE

−∆y(x) + c · ∇y(x) + r y(x) = f (x) + u(x) χΩc (x), x ∈ Ω, (1.41a)

y(x) = g(x), x ∈ ∂Ω, (1.41b)

where the parameters > 0, r ≥ 0, c = (c1, c2 )T with c1, c2 ≥ 0 and the functions f , g are given,
but the function u on Ωc has to be selected. Here χΩc is the indicator function with χΩc (x) = 1
if x ∈ Ωc and χΩc (x) = 0 otherwise. Given another subset Ωo ⊂ Ω, we want to find a function u
such that the corresponding PDE (1.41) solution y is a close as possible to a desired function y des .
For example (1.41) could model the temperature distribution in a convection oven and u would be a
volumetric heat source provided through heaters located at Ωc . We want to heat the oven to achieve
a desired temperature y des in the region Ωo of the oven.
We will model ‘as close as possible’ in the least squares sense and formulate the problem as the
minimization problem

α
Z Z
1
y(x) − y (x) dx +
des 2
minimize u(x) 2 dx (1.42a)
2 Ωo 2 Ωc
subject to − ∆y(x) + c · ∇y(x) + r y(x) = f (x) + u(x) χΩc (x), x ∈ Ω, (1.42b)
y(x) = g(x), x ∈ ∂Ω, (1.42c)

The term α2 Ω u(x) 2 dx with α > 0 in the objective penalizes excessively large |u|. We refer to u
R
c
as the control, to y as the state, and to (1.42b,c) as the state equation.

Next we discretize (1.42). Assume that Ω = (0, 1) 2 . As before, we select a grid

0 = x 1,0 < x 1,1 < . . . < x 1,n1 +1 = 1, 0 = x 2,0 < x 2,1 < . . . < x 2,n2 +1 = 1,

with equidistant points

i i
x 1,i = = ih1, i = 0, . . . , n1 + 1, x 2,i = = ih2, i = 0, . . . , n2 + 1.
n1 + 1 n2 + 1
and mesh size
1 1
h1 = , h2 = .
n1 + 1 n2 + 1

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

1.4. AN OPTIMAL CONTROL PROBLEM 31

We assume that the control region Ωc and the observation region Ωo are rectangles with corners
that coincide with grid points. For example Ωc = (x 1,ν, x 1,µ ) × (x 2,k , x 2,l ).
Applying the discretization (1.30) to (1.42b,c) leads to the system
yi−1, j − 2yi j + yi+1, j yi, j−1 − 2yi j + yi, j+1
− −
h12 h22
yi j − yi−1, j yi j − yi, j−1
+c1 + c2 + r yi j = f (x 1,i, x 2, j ) + ui j χΩc (x 1,i, x 2, j ), (1.43a)
h1 h2
for i = 1, . . . , n1, j = 1, . . . , n2,
yi j =g(x 1,i, x 2, j ), if i ∈ {0, n1 } or j ∈ {0, n2 }. (1.43b)
These equations can be arranged into a linear system
Ay + Bu = c. (1.44)
We can use (1.43b) to eliminate the yi j corresponding to boundary points, as we have done in
the previous section. In this case the number of y variables is n y = (n1 − 1)(n2 − 1). Here
we include all (n1 + 1)(n2 + 1) equations (1.43) into (1.44). Thus, the number of y variables is
n y = (n1 + 1)(n2 + 1). The number of u variables is nu , the number of grid points in Ωc , in either
case
Recall that the control region Ωc and the observation region Ωo are rectangles with corners that
coincide with grid points. The integrals are discretized using
Z X
y(x) − y des (x) 2 dx ≈ h1 h2 yi j − y des (x 1,i, x 2, j ) ,

Ωo
(x 1,i,x 2, j )∈Ωo

= y − y des T Q y − y des ,

(1.45a)
Z X
u(x) 2 dx ≈ h1 h2 ui j = uT Ru. (1.45b)
Ωc
(x 1,i,x 2, j )∈Ωc

In particular Q ∈ Rny ×ny and R ∈ Rnu ×nu are diagonal matrices with diagonal entries h1 h2 .
Combining (1.43)–(1.45) leads to the following discretization of (1.42).
1 α
y − y des T Q y − y des + uT Ru,

Minimize (1.46a)
2 2
subject to Ay + Bu = c. (1.46b)
Obviously, (1.46) is a special case of (1.4),
!T ! ! !T !
1 y Q 0 y Qy des y
+ y des T Qy des,

minimize −
2 u 0 αR u 0 u
y !
subject to A B = c.
u

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

32 CHAPTER 1. BASIC PROPERTIES AND EXAMPLES

The constant y des T Qy des in the objective can be dropped since it dies not change the solution y, u.

The matrix A is invertible. Therefore, we can eliminate y via

y = −A−1 Bu + A−1 c

and write (1.46) as the unconstrained problem

minimize 12 uT Hu + dT u + γ (1.47)

where

H = BT A−T Q A−1 B + αR, d = Q A−1 c − y des , γ = A−1 c − y des T Q A−1 c − y des .

Note that while A, B, Q, R are sparse matrices H = BT A−T Q A−1 B + αR is dense and in general it
is expensive to form the matrix H explicitly. Instead, we will construct methods that solve (1.47)
iteratively and in each iteration require the computation of one matrix-vector product Hv, where
the vector v is determined by the iterative method.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

1.5. A DATA ASSIMILATION PROBLEM 33

1.5. A Data Assimilation Problem

In this example, we want to identify the initial data y0 of a time dependent PDE from measurements
of its solution at times t > 0. This problem is motivated by data assimilation problems similar to
those arising in weather forecasting [BC99, LKM10, LSZ15]. This identification problem leads to
a least squares problem min 12 kAy0 − bk 2 for which it is expensive to compute A explicitly, but for
which matrix-vector operations of the type Av and AT w can be computed.
We begin with the following model problem.

∂ ∂2 ∂
y(x, t) − α 2 y(x, t) + β y(x, t) =0 x ∈ (0, 1), t ∈ (0, T ), (1.48a)
∂t ∂x ∂x
y(0, t) = y(1, t), t ∈ (0, T ), (1.48b)
y x (0, t) = y x (1, t), t ∈ (0, T ), (1.48c)
y(x, 0) = y0 (x), x ∈ (0, 1). (1.48d)

We assume that α > 0 and β > 0 are known. In this example we use

α = 0.01, β = 1, T = 0.5.

We want to determine the initial data y0 from measurements of the solution y(x, t) at certain points
in space and in time. We consider a discretization of this problem.
First, we discretize the boundary value problem (1.48) in space using the upwind finite difference
method (1.26,1.27). We divide [0, 1] into n subintervals with length h = 1/n and gridpoints x i = ih,
i = 0, . . . , n. The upwind finite difference method (1.26,1.27) leads to the system of ordinary
differential equations (ODEs)

y0 (t) = Ky(t), t ∈ (0, T ), y(0) = y0, (1.49)

where
 −2α − βh α α + βh 
α + βh −2α − βh α
 
1 ... ... ...
 
K = 2   ∈ Rn×n,
h 
 α + βh −2α − βh α 
α α + βh −2α − βh 
 

and yi (t) is an approximation of y(x i, t), i = 1, . . . , n.

For given initial data y0 ∈ Rn , the solution of the system of ODEs (1.49) is given by

y(t) = exp(Kt)y0,

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

34 CHAPTER 1. BASIC PROPERTIES AND EXAMPLES

where exp(Kt) ∈ Rn×n is the matrix exponential of Kt. For small n it can be evaluated using
Matlab’s expm, e.g., expm(K ∗ t).3 For larger problems we need to apply an ODE solver. We use
the Crank-Nicolson (trapezoidal) method.
We subdivide the time interval [0, T] into nt subintervals of equal length ∆t = T/nt and we set
t j = j∆t, j = 0, . . . , nt . The Crank-Nicolson (trapezoidal) scheme is given by

1
(y j+1 − y j ) = 12 (Ky j+1 + Ky j ), j = 0, . . . , nt − 1.
∆t
The vector y j is an approximation of y(t j ). Rearranging terms shows that for a given y0 we can
compute y j , j = 0, . . . , nt − 1, by successively solving
∆t ∆t
I− K y j+1 = I + K y j , j = 0, . . . , nt − 1. (1.50)
2 2
We use discretization parameters

n = 100 (h = 1/00) and nt = 50 (∆t = 0.01).

Figure 1.7 shows the approximate solution of of (1.48) with

y0 (x) = exp(−100(x 1 − 0.5) 2 ). (1.51)

0.8
0.6
0.4
0.2

0.5
1
0.5
0 0
t x

Figure 1.7: Approximate solution of (1.48) with (1.51).

3 NOTE: exp(K ∗ t) is different from expm(K ∗ t) and the former does not give the matrix exponential, but evaluates
the exponential of the matrix entries.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

1.5. A DATA ASSIMILATION PROBLEM 35

Now, suppose that we do not know y0 . We want to estimate y0 from measurements of the
computed solution. To specify the spatial measurement, we let m be such that n/m is integer and
we define an observation matrix H ∈ Rm×n with entries

Hi j = 1 if j = (n/m)i, Hi j = 0 else .

Given a vector v ∈ Rn , the matrix vector multiplication Hv extracts m components from v.

More precisely, Hv is the vector in Rm with components vn/m, v2n/m, . . . , vn , i.e., we take spatial
measurements at m equidistant grid points. Of course this could be changed. We take spatial
measurements at time steps nt /mt, 2nt /mt, . . . , nt , where mt is such that nt /mt is integer.
Suppose we have measurements z1, . . . , zmt ∈ Rm of Hynt /mt , Hy2nt /mt , . . . , Hynt , where the
y j ’s are the solution of (1.50). We want to find y0 such that the solution of (1.50) matches the
observations, Hy knt /mt ≈ z k , k = 1, . . . , mt . The identity (1.50) implies
! j+1
∆t −1 ∆t
y j+1 = I− K I+ K y0, j = 0, . . . , nt − 1 (1.52)
2 2
and !j
∆t −1 ∆t
Hy j = H I− K I+ K y0, j = 1, . . . , nt .
2 2
We arrive at the least squares problem
−1 nt /mt 2
∆t
*. H I − 2 K I+ 2K
∆t
+/ z1
1 .. .. *. . +/
y0 − . .. / ,
//
min .
y0 ∈Rn 2 .
. // (1.53)
−1 nt
, zmt -

∆t
I+ 2K
∆t
. /
, H I − 2 K
- 2

which can be written in standard form

min 1 kAy0 − bk22, (1.54a)

y0 ∈Rn 2

with
−1 nt /mt
*. H I − ∆t2 K I + ∆t2 K +/
A = ..
. .. // mt m×n
,
. // ∈ R (1.54b)
.. −1 nt
∆t
I+ 2K∆t /
H I− 2K
, -
z1
*. . +/
b = . .. / ∈ Rmt m . (1.54c)
, zm -

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

36 CHAPTER 1. BASIC PROPERTIES AND EXAMPLES

The formulation (1.53) of the least squares problem uses (1.52), which is fine for theoretical
purposes, but not something that should be used to implement the problem. For the solution of
(1.53) we never compute A, but we use methods that require the action of A to a vector v ∈ Rn and
the action of AT to a vector a vector w ∈ Rmt m .
For a given vector v ∈ Rn we compute w = Av ∈ Rmt m as follows:
1. Set y0 = v ∈ Rn .

2. For j = 0, . . . , mt − 1 do

2.1 For i = 0,. . . , nt /mt − 1 do

Solve I − ∆t2 K y jnt /mt +i+1 = I + ∆t
2 K y jnt /mt +i .
2.2 Set w j+1 = Hy( j+1)nt /mt .

3. w = (wT1 , . . . , wTmt )T .
Of course, in an implementation we do not generate nt arrays to store the y j ’s, but we use only one
array for the current y j .
The transpose of A is given by
! n /m !n
∆t T ∆t −T t t T ∆t T ∆t −T t T 
 
A = I+ K
T  I− K H , . . ., I + K I− K H 
 2 2 2 2 
∈ Rn×mt m .

Given w ∈ Rmt m we can compute v = AT w ∈ Rn as follows.

1. Let w = (wT1 , . . . , wTmt )T , where w j ∈ Rm , j = 1, . . . , mt .
Set v = 0.

2. For j = mt − 1, . . . , 0 do

2.1 Compute v ← v + H T w j+1 .

2.2 For i = nt /mt − 1, . . . , 0 do
−T
2.2.1 Solve v ← I − ∆t2 K v.
T
2.2.2 Compute v ← I + ∆t2 K v.
The least squares problem (1.53) can be solved using the conjugate gradient method, which we
will discuss in Section 3.7. The conjugate gradient method does not require A as an explicit matrix,
but only requires matrix-vector products Av and AT w for given vectors v ∈ Rn and w ∈ Rmt m .
Consider the PDE (1.48) with α = 0.01, β = 1, T = 0.5, discretized using n = 100 (h = 1/100),
and nt = 50 (∆t = 0.01). The approximate solution yex with initial condition be (1.51) is shown in
Figure 1.7.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

1.5. A DATA ASSIMILATION PROBLEM 37

Now, we want to recover the initial condition y0ex from measurements of the solution. We set
m = 5 and mt = 25,
and generate observations
z k = Hyex
knt /mt + η k , k = 1, . . . , mt,
where η k represents noise. We use 1% normally distributed noise.
We use conjugate gradient method to solve the resulting the least squares problem (1.53). The
exact initial data and our estimate of the initial data computed from noisy observations are shown
in Figure 1.8.
1.5
true y0
computed y 0
1

0.5

0 0.2 0.4 0.6 0.8 1

Figure 1.8: True initial data and estimated initial data. Estimated initial data are computed using
the regularized least squares problems computed by solving the least squares problem (1.53).

The least squares problem (1.53) is highly ill-conditioned. Thus small errors in the observations
comp
can lead to large errors on the computed solution y0 . This is what we have seen in Figure 1.8.
To remedy this situation, one can regularize the problem, i.e., replace (1.53) by
−1 nt /mt 2
*. H I − ∆t2 K I + ∆t2 K +/ z1
1 .. .. // y0 − .. ... +// + ρ kWy0 k 2
/ *
minn .
y0 ∈R 2 .
. 2 2 (1.55)
−1
n t
/
, zmt -

∆t
I + ∆t2 K
. /
, H I − 2 K
- 2
where ρ > 0 is a regularization parameter and W ∈ R n×n is a given matrix. The regularized least
squares problems can also be written as
min kAy0 − bk22,
y0 ∈Rn

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

38 CHAPTER 1. BASIC PROPERTIES AND EXAMPLES

where now

−1 nt /mt
*. H I −
∆t
2 K I + ∆t
2 K +/ z1
.. *. .. +/
. . // ∈ Rmt m+n .
.. //
A = .. −1 nt
// ∈ R(mt m+n)×n, b = ...
∆t
+ ∆t zm //
. H I− K I 2 K
.. // .
2
√
/ 0 -
ρW
,
, -

Introductions to the regularization of inverse problems are given, e.g., in the books by Tarantola
[Tar05] and Vogel [Vog02].
We take the same data as above and estimate the initial data by solving the regularized least
squares problem (1.55) with W = I and ρ = 10−2 . The regularized least squares problem is
again solved using the conjugate gradient method. Figure 1.9 shows the exact initial data and the
estimated initial data. This regularization gives an excellent estimate.

1.5
true y0
computed y 0
1

0.5

0 0.2 0.4 0.6 0.8 1

Figure 1.9: True initial data and estimated initial data. Estimated initial data are computed using
the regularized least squares problems computed by solving the regularized least squares problem
(1.55) with ρ = 10−2 .

Everything we have done, can be applied to other linear time dependent PDEs. As an example,

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

1.5. A DATA ASSIMILATION PROBLEM 39

we consider the PDE

∂ ∂
y(x, t) − α∆y(x, t) + β y(x, t) =0 x ∈ (0, 1) 2, t ∈ (0, T ), (1.56a)
∂t ∂ x1
y(0, x 2, t) = y(1, x 2, t), x 2 ∈ (0, 1), t ∈ (0, T ), (1.56b)
y x 1 (0, x 2, t) = y x 1 (1, x 2, t), x 2 ∈ (0, 1), t ∈ (0, T ), (1.56c)
y(x 1, 0, t) = y(x 1, 1, t) = 0, x 1 ∈ (0, 1), t ∈ (0, T ), (1.56d)
y(x, 0) = y0 (x), x ∈ (0, 1) 2 . (1.56e)

(periodic boundary conditions in the x 1 direction and homogeneous Dirichlet boundary conditions
in x 2 direction). We assume that α > 0 and β > 0 are known. In this example we use

α = 0.01, β = 1, T = 0.5.

We discretize the spatial domain using n1 subintervals of length h1 = 1/n1 in the x 1 direction
and n2 + 1 subintervals of length h2 = 1/(n2 + 1) in the x 2 direction. If we define
 −2α − βh1 α α + βh1 
 α + βh1 −2α − βh1 α
 
1 .. .. ..

K1 =  . . .  ∈ Rn1 ×n1,
h12  α + βh1 −2α − βh1 α 
α α + βh1 −2α − βh1 
 

 −2α α 
 α −2α α
 
1 .. .. ..

K2 =  . . .  ∈ Rn2 ×n2,
h22  α −2α α 
α −2α 
 

and use a lexicographic ordering of unknowns
T
y(t) = y11 (t), . . . , yn1 1 (t), y12 (t), . . . , yn1 2 (t), . . . . . . , y1n2 (t), . . . , yn1 n2 (t)

and equations, we arrive at the semidscretization (1.49) of (1.48) where

K = I n2 ⊗ K 1 + K 2 ⊗ I n1 , (1.57)

We use discretization parameters

n1 = n2 = 20 (h1 = 1/20, h2 = 1/21) and nt = 50 (∆t = 0.01).

Figure 1.10 shows snapshots of the solution of (1.56) with

y0 (x) = exp(−100(x 1 − 0.5) 2 − 100(x 2 − 0.5) 2 ). (1.58)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

40 CHAPTER 1. BASIC PROPERTIES AND EXAMPLES

t = 0.1 t = 0.2 t = 0.3

Figure 1.10: Snapshots of the solution of (1.56) with (1.58)).

To construct an observation matrix, let m1, m2 be integers such that n1 /m1 and n2 /m2 are
integers and for ` = 1, 2 define H` ∈ Rm` ×n` with entries

(H` )i j = 1 if j = (n` /m` )i, (H` )i j = 0 else .

The observation matrix is

H = H2 ⊗ H1 . (1.59)
Again, we want to recover the initial condition y0ex from measurements of the solution. We set

m1 = m2 = 5 and mt = 25,

and generate observations

knt /mt + η k ,
z k = Hyex k = 1, . . . , mt,

where η k represents noise. We use 1% normally distributed noise. We use conjugate gradient
method to solve the resulting the least squares problem (1.53). The exact initial data and our
estimate of the initial data computed from noisy observations are shown in Figure 1.11. In this case
the standard least squares problem (1.53) provides a good estimate.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

1.6. PROBLEMS 41

True initial condition Estimated initial condition

Figure 1.11: True initial data and estimated initial data computed from noisy observations of the
discretized PDE solution (1.56). Estimates are obtained by solving the least squares problem (1.53).

1.6. Problems

Problem 1.1
i. Let H ∈ Rn×n be symmetric and satisfy
vT Hv > 0 for all v ∈ N ( A) \ {0},
and let A ∈ Rm×n , m < n, be a matrix with rank m.
Show that the equality constrained quadratic program (1.4) has a solution x ∈ Rn if and only
if there exists λ ∈ Rn such that (1.6) is satisfied. Moreover, show that x and λ are unique.
Hint: Since A has rank m, there exists an m × m invertible submatrix B of A. Without
loss of generality assume that the first m columns of A are linearly independent, i.e., that
A = (B | N ) with B ∈ Rm×m invertible and N ∈ Rm×(n−m) . Write
!
xB
x= , x B ∈ Rm, x N ∈ Rn−m,
xN
and convert the equality constrained quadratic program (1.4) into an unconstrained quadratic
program in x N .
ii. Let H ∈ Rn×n be symmetric and satisfy
vT Hv ≥ 0 for all v ∈ N ( A).
If (1.4) has a solution, is it unique?

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

42 CHAPTER 1. BASIC PROPERTIES AND EXAMPLES

iii. Let A ∈ Rm×n , m < n, have rank r < m and let b ∈ R ( A). Is the vector λ of Lagrange
multipliers unique?

Problem 1.2 Let α1, α2 be real numbers. Verify that the eigenvalues of the n × n matrix
α1 α2
.. α2 α1 α2
*. +/
//
. . .
.. //
A=. . . . . . . . // (1.60)
.. //
.. /
. α2 α1 α2 //
, α2 α1 -
are given by !
jπ
λ j = α1 + 2α2 cos , j = 1, . . . , n,
n+1
and that an eigenvector associated with the eigenvalue λ j is
r
2 π 2π
! nπ ! T
vj = sin j , sin j , . . . , sin j .
n+1 n+1 n+1 n+1
Moreover, show that viT v j = 0 for i , j, and kvi k2 = 1.

Problem 1.3 Let α1, α2, α3 be real numbers. Verify that the eigenvalues of the n × n matrix
α1 α3
.. α2 α1 α3
*. +/
//
. . .
.. //
A=. . . . . . . . // (1.61)
.. //
.. /
. α2 α1 α3 //
, α2 α1 -
are given by
α2
r !
jπ
λ j = α1 + 2α3 cos , j = 1, . . . , n,
α3 n+1
and that a (non-normalized) eigenvector associated with the eigenvalue λ j is

α
! 1/2 π α ! 2/2 2π
!
α
! n/2 nπ T
vj = * , , . . ., + .
2 2 2
sin j sin j sin j
, α 3 n + 1 α 3 n + 1 α 3 n + 1 -

Problem 1.4

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

1.6. PROBLEMS 43

i. Let ai j be the entries of the matrix in (1.18).

– Show that if > 0, r > 0, c ∈ R, and h < 2/|c|, then

X
|ai j | < |aii | for i = 1, . . . , n.
j,i

– Show that if > 0, r = 0, c ∈ R, and h < 2/|c|, then

X
|ai j | ≤ |aii | for i = 1, . . . , n,
j,i

with “<” for i = 1 and i = n.

ii. Let ai j be the entries of the matrix in (1.24). Show that if > 0, c, r ≥ 0, then for any h > 0,
X
|ai j | ≤ |aii | for i = 1, . . . , n,
j,i

with “<” for i = 1 and i = n.

Problem 1.5 We study the eigenvalues of AT A for the matrix that arises in the least squares
problem (1.54) for a slightly simplified problem.
We observe the ODE solution at every grid point (i.e., m = n and the observation matrix is
H = I ∈ Rn×n ) and at time steps nt /mt, nt /mt, . . . , nt , where mt be such that nt /mt is integer. Hence,
the matrix A in (1.54b) becomes
−1 nt /mt
*. I− ∆t
2 K I + ∆t
2 K +/
A = ..
. .. // mt n×n
. // ∈ R
.. −1 nt
∆t
+ ∆t /
I− 2 K I 2 K
, -

i. Determine AT A.

ii. Suppose there exists an orthonormal matrix V ∈ Rn×n and a diagonal matrix D ∈ Rn×n such
that
K = V DV T .
The diagonal entries of D are the eigenvalues of K and the columns of V are the corresponding
eigenvectors.
−1 `
What are the eigenvalues and eigenvectors of I − 2 K ∆t
I + 2 K , ` ∈ N?
∆t

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

44 CHAPTER 1. BASIC PROPERTIES AND EXAMPLES

iii. What are the eigenvalues µ j , j = 1, . . . , n, and eigenvectors of AT A?

iv. Numerically compute the eigenvalues of

 −2 1 1 
1 −2 1
 
1 .. .. ..
 
K = 2  . . .  ∈ Rn×n,
h 
 1 −2 1 

 1 1 −2 

and of the corresponding AT A, obtained using T = 0.5, n = 100, nt = 50 (∆t = 0.01), and
mt = 25.
Note K results from the finite difference discretization of (1.48) with α = 1 and β = 0.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

References
[BC99] F. Bouttier and P. Courtier. Data assimilation concepts and methods. Technical report,
European Centre for Medium-Range Weather Forecasts (ECMWF), 1999. https:
//www.ecmwf.int/en/learning/education-material/lecture-notes
(accessed Nov. 23, 2017).

[BGL05] M. Benzi, G. H. Golub, and J. Liesen. Numerical solution of saddle point problems.
In A. Iserles, editor, Acta Numerica 2005, pages 1–137. Cambridge University Press,
Cambridge, London, New York, 2005.

[ESW05] H. C. Elman, D. J. Silvester, and A. J. Wathen. Finite Elements and Fast Iterative
Solvers with Applications in Incompressible Fluid Dynamics. Oxford University Press,
Oxford, 2005.

[GL89] G. H. Golub and C. F. Van Loan. Matrix Computations, volume 3 of Johns Hopkins
Series in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD,
second edition, 1989.

[Hac92] W. Hackbusch. Elliptic Differential Equations: Theory and Numerical Treatment.

Springer–Verlag, Berlin, 1992.

[HPUU09] M. Hinze, R. Pinnau, M. Ulbrich, and S. Ulbrich. Optimization with PDE Con-
straints, volume 23 of Mathematical Modelling, Theory and Applications. Springer
Verlag, Heidelberg, New York, Berlin, 2009. URL: http://dx.doi.org/10.1007/
978-1-4020-8839-1, doi:10.1007/978-1-4020-8839-1.

[LKM10] W. Lahoz, B. Khattatov, and R. Menard, editors. Data Assimilation: Making Sense of
Observations. Springer, Berlin, Heidelberg, 2010. URL: http://dx.doi.org/10.
1007/978-3-540-74703-1, doi:10.1007/978-3-540-74703-1.

[LSZ15] K. J. H. Law, A. M. Stuart, and K. C. Zygalakis. Data Assimilation. A Math-

ematical Introduction. Texts in Applied Mathematics. Vol. 62. Springer, New
York, 2015. URL: http://dx.doi.org/10.1007/978-3-319-20325-6, doi:
10.1007/978-3-319-20325-6.

45
46 REFERENCES

[Saa03] Y. Saad. Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia, second
edition, 2003.

[Sty05] M. Stynes. Steady-state convection-diffusion problems. In A. Iserles, editor, Acta

Numerica 2005, pages 445–508. Cambridge University Press, Cambridge, London,
New York, 2005.

[Tar05] A. Tarantola. Inverse problem theory and methods for model parameter estimation.
Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2005.

[Trö10] F. Tröltzsch. Optimal Control of Partial Differential Equations: Theory, Methods and
Applications, volume 112 of Graduate Studies in Mathematics. American Mathemat-
ical Society, Providence, RI, 2010. URL: http://dx.doi.org/10.1090/gsm/112.

[Vog02] C. R. Vogel. Computational Methods for Inverse Problems. Frontiers in Applied

Mathematics, Vol 24. SIAM, Philadelphia, 2002. URL: https://doi.org/10.
1137/1.9780898717570, doi:10.1137/1.9780898717570.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

Chapter
2
Stationary Iterative Methods
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.2 Jacobi, Gauss–Seidel, and SOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.3 Initial Convergence Analysis of Linear Fixed Point Iterations . . . . . . . . . . . . 52
2.4 The Matrix Eigenvalue Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.5 Convergence of Linear Fixed Point Iterations . . . . . . . . . . . . . . . . . . . . 56
2.5.1 Diagonalizable Iteration Matrix . . . . . . . . . . . . . . . . . . . . . . . 56
2.5.2 Non-Diagonalizable Iteration Matrix . . . . . . . . . . . . . . . . . . . . . 59
2.5.3 Approximation of the Spectral Radius by a Matrix Norm. . . . . . . . . . . 62
2.6 Convergence of the Jacobi, Gauss Seidel, and SOR Methods . . . . . . . . . . . . 63
2.6.1 Diagonally Dominant Matrices . . . . . . . . . . . . . . . . . . . . . . . . 64
2.6.2 Necessary Condition on the Relaxation Parameter for Convergence of the
SOR Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.6.3 Symmetric Positive Definite Matrices . . . . . . . . . . . . . . . . . . . . 71
2.6.4 Consistently Ordered Matrices . . . . . . . . . . . . . . . . . . . . . . . . 73
2.6.5 M-Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.7 Application to Finite Difference Discretizations of PDEs . . . . . . . . . . . . . . 78
2.7.1 Elliptic Partial Differential Equations in 1D . . . . . . . . . . . . . . . . . 78
2.7.2 Elliptic Partial Differential Equations in 2D . . . . . . . . . . . . . . . . . 85
2.8 An Optimization Viewpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

2.1. Introduction
In this section we study linear fixed point iterative methods for the solution of square systems of
linear equations
Ax = b. (2.1)

47
48 CHAPTER 2. STATIONARY ITERATIVE METHODS

Many methods discussed in this section are derived from a splitting of the matrix A. Let
A=M−N (2.2)
with nonsingular M ∈ Rn×n . Since M is nonsingular,
Ax = b if and only if x = M −1 N x + M −1 b.
Thus x solves the linear system Ax = b if and only if x is a fixed point of the map x 7→
M −1 N x + M −1 b. We can try to find a fixed point using the fixed point iteration
x (k+1) = M −1 N x (k) + b ,

(2.3)
which, since N = M − A may also be written
x (k+1) = M −1 N x (k) + M −1 b
= (I − M −1 A)x (k) + M −1 b. (2.4)
We discuss several stationary iterative methods and their convergence properties. We focus
on classical stationary iterative methods like the Jacobi method, the Gauss-Seidel method, and the
successive overrelaxation (SOR) method, but we also touch upon multigrid methods and domain
decomposition methods. We present several convergence results for the Jacobi method, the Gauss-
Seidel method, and the SOR method. Many of these convergence results fit beautifully with the
properties of systems arising from discretizations of PDEs and we will use the examples from
Sections 1.3.1 and 1.3.2 to illustrate the convergence behavior of these stationary iterative methods.
Finally, we will show that for systems with symmetric positive definite matrix, the Jacobi and the
Gauss-Seidel method can also be interpreted as particular coordinate decent minimization methods.

The methods introduced in this section generate a sequence of approximations x (k) ∈ Rn to the
solution of the linear system (2.1). We use superscripts (k) to denote the k-th iteration. The vector
x (k) has components x i(k) , i = 1, . . . , n.

2.2. Jacobi, Gauss–Seidel, and SOR

Let A ∈ Rn×n be a nonsingular matrix with nonzero diagonal entries.
The (pointwise) Jacobi Method proceeds as follows. Given an approximation x (k) of the solution
we compute a new approximation x (k+1) by looping through all equations in the system Ax = b. In
the ith equation we set x j = x (k) (k+1)
j , j , i and solve for x i . One iteration of the (pointwise) Jacobi
Method is given as follows.
For i = 1, . . . , n do
i−1 n
1 * X X
x i(k+1) = .bi − ai j x (k)
j − ai j x (k)
j
+/ (2.5)
aii j=1 j=i+1
, -

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.2. JACOBI, GAUSS–SEIDEL, AND SOR 49

end

The (pointwise) forward Gauss-Seidel (GS) Method is derived from the (pointwise) Jacobi
Method by using new information as soon as it becomes available, and it is given as follows.

For i = 1, . . . , n do
i−1 n
1 * X X
x i(k+1) = . bi − (k+1)
ai j x j − ai j x (k)
j
+/ (2.6)
aii j=1 j=i+1
, -

end

In an implementation of the Gauss-Seidel Method, only one array is needed to store x, since x i(k)
can be overwritten by x i(k+1) as soon as it becomes available.
Note that the Jacobi method is independent of a symmetric ordering of equations and unkowns
in the sense that if Π is a permutation matrix, then the Jacobi method applied to Ax = b is identical
to the Jacobi method applied to
Π AΠT Πx = Πb.
The Gauss-Seidel method, however, depends on the ordering of equations and unkowns, since it
uses new information as soon as it is computed. For example, if we start the Gauss-Seidel method
with the last equation and unknown and work backwards we obtain the (pointwise) backward
Gauss-Seidel (GS) Method

For i = n, n − 1, . . . , 1 do
i−1 n
1 * X X
x i(k+1) = .bi − (k)
ai j x j − ai j x (k+1)
j
+/ (2.7)
aii j=1 j=i+1
, -

end

Let x GS be the (pointwise) forward Gauss-Seidel iterate given by (2.6). A new iteration is
obtained if we choose ω > 0 and set

x i(k+1) = x i(k) + ω(x iGS − x i(k) ) = ωx iGS + (1 − ω)x i(k), i = 1, . . . , n. (2.8)

If ω = 1, then we obtain the Gauss-Seidel method. If ω ∈ (0, 1) we damp the Gauss-Seidel

iteration, and if ω > 1 we overrelax. We will show in Section 2.6.2 that one cannot overrelax
too much and should take ω < 2. The iteration (2.8) is the (pointwise) forward Successive Over
Relaxation (SOR) method. Using definition (2.6) the (pointwise) forward Gauss-Seidel iterate, the
(pointwise) forward Successive Over Relaxation (SOR) can be expressed as follows.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

50 CHAPTER 2. STATIONARY ITERATIVE METHODS

For i = 1, . . . , n do
i−1 n
1 * X X
x i(k+1) = ω .bi − ai j x (k+1)
j − ai j x (k)
j
+/ + (1 − ω)x (k)
i (2.9)
aii j=1 j=i+1
, -
end
Like the (pointwise) Gauss-Seidel method, the (pointwise) Successive Over Relaxation (SOR) also
depends on the ordering of the equations and unknowns. In principle, it is possible to do the same
with the Jacobi method. For example, if x J is the (pointwise) Jacobi iterate given by (2.5), then we
can generate a new iterate using
x i(k+1) = x i(k) + ω(x iJ − x i(k) ) = ωx iJ + (1 − ω)x i(k), i = 1, . . . , n. (2.10)
In the literature, this iteration is usually referred to as the (pointwise) damped Jacobi Method.
Using the defintion (2.5) of the Jacobi iterate, it can be written as follows.
For i = 1, . . . , n do
i−1 n
1 * X X
x i(k+1) = ω .bi − (k)
ai j x j − ai j x (k)
j
+/ + (1 − ω)x (k)
i (2.11)
aii j=1 j=i+1
, -
end
The Jacobi, Gauss–Seidel, and SOR method can be expressed using matrix-vector notation.
We split
a11 a12 . . . a1,n−1 a1n
.. a21 a22 . . . a2,n−1 a2n
*. +/
A = .. .. . .. .. .. ..
//
. . . . // (2.12a)
.. a a ...
n−1,1 n−1,2 an−1,n−1 an−1,n //
, an1 an2 ... an,n−1 ann -
into its diagonal
a11 0 . . . 0 0
0 a22 . . .
*. +
0 0 //
.. .. . . .. .. // ,
.
D = ...
. . . . . / (2.12b)
0 0 . . . an−1,n−1 0 //
..
, 0 0 ... 0 ann -
the strict lower triangular part −E and the strict upper triangular part −F,
0 0 ... 0 0 0 a12 . . . a1,n−1 a1n
a ... . . . a2,n−1 a2n
*. +/ *. +/
.. 21 0 0 0 0 0
− E = .. ... .. .. .. .. .. .. .. .. ..
// . //
. . . . // , −F = ... . . . . . // . (2.12c)
.. a
n−1,1 an−1,2 ... 0 0 // ..0 0 ... 0 an−1,n //
, an1 an2 . . . an,n−1 0 - , 0 0 ... 0 0 -

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.2. JACOBI, GAUSS–SEIDEL, AND SOR 51

That is
A = D − E − F. (2.12d)
This leads to the following representations of the previous methods:
(pointwise) Jacobi
x (k+1) = D −1 (E + F)x (k) + b = I − D−1 A x (k) + D−1 b,

(2.13)
(pointwise) forward GS
x (k+1) = (D − E) −1 F x (k) + b ,

(2.14)
(pointwise) backward GS
x (k+1) = (D − F) −1 E x (k) + b ,

(2.15)
(pointwise) forward SOR
x (k+1) = (D − ωE) −1 [ωF + (1 − ω)D]x (k) + ωb .

(2.16)
and damped Jacobi
x (k+1) = I − ωD −1 A x (k) + ωD−1 b.

(2.17)
The Jacobi Method (2.13), the forward GS (2.14) the backward GS (2.15) and the forward SOR
(2.16) are special cases of (2.3) with the following splittings A = M − N:
Jacobi: M = D, N = E + F. (2.18a)
forward GS: M = D − E, N = F. (2.18b)
backward GS: M = D − F, N = E. (2.18c)
1 1
M= D − ωE , N= (1 − ω)D + ωF .

forward SOR: (2.18d)
ω ω
1 1
dampled Jacobi: M = D, N = D − A. (2.18e)
ω ω
We can also derive block versions of the Jacobi, Gauss–Seidel, and SOR method. Suppose that

A11 A12 . . . A1,n−1 A1n

A A22 . . . A2,n−1 A2n
*. +/
21
.
.. .. .. .. ..
.. //
A = .. . . . . // (2.19a)
n−1,1 An−1,2 . . . An−1,n−1
.. A An−1,n //
, A n1 An2 . . . An,n−1 Ann -
is a matrix with nonsingular diagonal blocks Aii ∈ Rmi ×mi . We denote the nonsingular block
diagonal by
A11 0 . . . 0 0
A . . .
*. +/
0 22 0 0
D = .. .. . .
.. .. .. ..
.. //
. . . // , (2.19b)
.. 0 0 . . . An−1,n−1 0 //
, 0 0 ... 0 Ann -

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

52 CHAPTER 2. STATIONARY ITERATIVE METHODS

and we define

0 0 ... 0 0 0 A12 . . . A1,n−1 A1n

A ... . . . A2,n−1 A2n
*. +/ *. +/
21 0 0 0 .. 0 0
.
. .. .. .. .. // , −F = .. ... .. .. .. ..
.. // //
− E = .. . . . . . . . . . // . (2.19c)
.. A
n−1,1 An−1,2 ... 0 0 / / .
. 0 0 ... 0 An−1,n //
, An1 An2 . . . An,n−1 0 - , 0 0 ... 0 0 -

If we use the matrices D, E, F in (2.19) in the equations (2.13), (2.14), (2.15), (2.16), and (2.3) we
obtain the block Jacobi Method the block forward GS Method, the block backward GS Method, and
the block forward SOR Method, respectively. Each iteration of these methods requires the solution
of systems of size mi × mi for i = 1, . . . , n.

2.3. Initial Convergence Analysis of Linear Fixed Point Itera-

tions
In the previous section we have seen that if

A = M − N,

then x solves
Ax = b
if and only if x satisfies
x = M −1 N x + M −1 b.
We set G = M −1 N and f = M −1 b and consider the basic iterative method

x (k+1) = Gx (k) + f . (2.20)

When does {x (k) } converge? If this sequence converges, x (k) → x (∗) (k → ∞), then the limit x (∗)
satisfies
x (∗) = Gx (∗) + f
i.e., it is a fixed point of the map x 7→ Gx + f . The errors

e (k) = x (k) − x (∗)

satisfy
e (k+1) = Ge (k) = G k+1 e (0) . (2.21)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.4. THE MATRIX EIGENVALUE PROBLEM 53

If there exists a matrix norm that is submultiplicative and subordinate to a vector norm1 such that
kGk < 1, then the series ∞ k −1
P
k=0 G converges to (I − G) . In particular I − G is invertible and there
exists a unique fixed point x (∗) = Gx (∗) + f . Moreover, since the errors satisfy e (k) = x (k) − x (∗)

ke (k+1) k = kG k+1 e (0) k ≤ kG k+1 k ke (0) k ≤ kGk k+1 ke (0) k → 0 (k → ∞).

Thus, if we are able to find a matrix norm such that kGk < 1, we are guaranteed convergence, but
how do we do whether such a norm exists?

Example 2.3.1 The matrix !

1 −1 8
G=
6 −2 7
has the properties that
kGk2 ≈ 1.8, kGk1 = 2.5, kGk∞ = 1.5.
Hence, if we check the commonly used matrix norms, we find that they do not satisfy kGk < 1.
Still, for this matrix there exists a matrix norm with kGk < 1 and for this matrix the fixed point
iteration converges for any initial vector x (0) . Note that the matrix G has the eigenvalue λ = 0.5
with multiplicity two.
(k)
In Section 2.3 we will show that answer to the question whether x converges is determined
by the spectral radius
ρ(G) = max |λ| : λ is an eigenvalue of G

(2.22)
of the matrix G. In particular, we will show that the fixed point iteration (2.20) converges for any
initial vector x (0) if and only if ρ(G) < 1.
Before we can show that the fixed point iteration (2.20) converges for any initial vector x (0) if
and only if ρ(G) < 1 we review a few results on eigenvalues and eigenvectors of matrices.

2.4. The Matrix Eigenvalue Problem

The eigenvalues of the iteration matrix will be important to determine the convergence behavior of
the stationary iteration. In this section, we review some of results used in this chapter. For proofs
of these results and for additional material on linear algebra see, e.g., the books by Meyer [Mey00],
Horn and Johnson [HJ85], and Golub and van Loan [GL89].
1We only consider matrix norms that are sub-multiplicative, i.e., satisfy k ABk ≤ k Ak kBk for all square matrices,
and are subordinate to a vector norm, i.e., satisfy k Avk ≤ k Ak kvk for all square matrices A and vectors v. Operator
norms, which are defined via
k Ak = sup k Avk/kvk
v,0

have this property. See, e.g., [GL89, Sec. 2.3].

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

54 CHAPTER 2. STATIONARY ITERATIVE METHODS

Let A be a square n × n matrix. Recall that λ ∈ C is an eigenvalue of A if there exists a non-zero

vector u ∈ Cn such that
Au = uλ. (2.23)
The non-zero vector u is called an eigenvector (of A) corresponding to the eigenvalue λ. The set of
eigenvalues
σ( A) = λ ∈ C : A − λI is singular

is called the spectrum of A. An n × n matrix has n eigenvalues λ 1, . . . , λ n ∈ C (including multiple

eigenvalues).
If u1, . . . , un ∈ Cn are the eigenvectors corresponding to the eigenvalues λ 1, . . . , λ n , then
( Au1, . . . , Aun ) = (u1 λ 1, . . . , un λ n ) .
If we define the matrix
U = (u1, . . . , un ) ∈ Cn×n
and the diagonal matrix
λ1
Λ = .. ..
/∈C ,
n×n
.
* +/

, λn-
then
AU = ( Au1, . . . , Aun ) = (u1 λ 1, . . . , un λ n ) = UΛ.
If the matrix U is invertible, that is if we can find n linearly independent eigenvectors u1, . . . , un ,
then
A = UΛU −1 . (2.24)
If there exists an invertible matrix U and a diagonal matrix Λ such that (2.24) holds, we say that
the matrix A is diagonalizable.
T
Given a matrix U ∈ Cn×n we define U ∗ = U . A matrix is unitarily diagonalizable, if the matrix
U ∈ Cn×n of eigenvectors is not only invertible but satisfies U −1 = U ∗ , i.e., U ∗U = I. A matrix
U ∈ Cn×n with U ∗U = I is called a unitary matrix. If the matrix U ∈ Rn×n , then U ∗ = U T and a
square real matrix U with U T U = I is called orthogonal. Unfortunately, not all square matrices are
diagonalizable and not all diagonalizable matrices are unitarily diagonalizable.

Theorem 2.4.1 If A ∈ Rn×n is symmetric, all eigenvalues λ 1, . . . , λ n are real and there exists n
orthogonal eigenvectors, in other words, there exists a real diagonal matrix Λ = diag(λ 1, . . . , λ n ) ∈
Rn×n and an orthogonal matrix U ∈ Rn×n , such that A = UΛU T .

A matrix A ∈ Cn×n is called normal if A∗ A = AA∗ .

Theorem 2.4.2 (Unitary Diagonalizability) A matrix A ∈ Cn×n is unitarily diagonalizable if and
only if it is normal, in other words, there exist a diagonal matrix Λ ∈ Cn×n and a unitary matrix
U ∈ Cn×n such that A = UΛU ∗ if and only if A∗ A = AA∗ .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.4. THE MATRIX EIGENVALUE PROBLEM 55

Even if a square matrix is not diagonalizable, it can be written in Jordan normal form sometimes
called Jordan canonical form.
Theorem 2.4.3 (Jordan Normal Form) For any square matrix A ∈ Cn×n there exists a nonsingu-
lar matrix U ∈ Cn×n such that
J 0 ... 0
*. 1
0 J2 . . . 0 // def
+
U AU = .. ..
−1 .
.. // = J, (2.25)
.. . /
,0 0 . . . Jk -
where
λi 1 0 . . . 0 0 λi 0 0 . . . 0 0 0 1 0 ... 0 0
λ . . . λ . . . . . . 0 0//
*. +/ *. +/ *. +
.. 0 i 1 0 0 // .. 0 i 0 0 0 // ..0 0 1
//
Ji = .. ... .. .. .. .. .. // + .. ... .. ..
.. // .. // ..
. . = . . . .
//
.. // .. .
/ .
// .. //
.. // .. / . //
. 0 0 0 . . . λ i 1 // .. 0 0 0 . . . λ i 0 // ..0 0 0 . . . 0 1//
, 0 0 0 . . . 0 λ i - , 0 0 0 . . . 0 λ i - ,0 0 0 . . . 0 0-

= λ i Ini + Ni ∈ Cni ×ni ,

def

i = 1, . . . , k, λ 1, . . . , λ k are eigenvalues of A, and = n.

Pk
i=1 ni
The matrix Ji = λ i Ini + Ni is called a Jordan block. The matrix
0 1 0 ... 0 0
*. 0 0 1 . . . 0 0 +/
.. .. .. . . .. .. //
. . . . . //
Ni = .. ni ×ni
.
.. .. . . .. .. // ∈ R
.. . . . . . /
0 0 ... 0 1 /
.. /
0
, 0 0 0 ... 0 0 -
satisfes
Nik = 0 for k ≥ ni .
A matrix with this property is called nil-potent.
Theorem 2.4.4 (Gershgorin Circle Theorem) The set of eigenvalues of a matrix A ∈ Rn×n , σ( A),
is contained in the union of the disks

 X 

Di =  z ∈ C : |z − aii | ≤ ,
 
 |ai j | 

 j,i

 
i.e.,
σ( A) ⊆ ∪i=1
n
Di .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

56 CHAPTER 2. STATIONARY ITERATIVE METHODS

2.5. Convergence of Linear Fixed Point Iterations

In Section 2.5 we have announced that the iteration

x (k+1) = Gx (k) + f . (2.26)

converges for any initial vector x (0) if and only if ρ(G) < 1. This section provides the proof of this
result. We begin with the case of a diagonalizable iteration matrix G.

2.5.1. Diagonalizable Iteration Matrix

First we consider a diagonalizable matrix G ∈ Rn×n . Assume there exists a nonsingular matrix
U ∈ Cn×n and a diagonal matrix Λ = diag(λ 1, . . . , λ n ) ∈ Cn×n , where λ 1, . . . , λ n are the eigenvalues
of G, such that
G = UΛU −1 .
Using the decomposition of G we see that

x = Gx + f if and only if U −1 x = Λ |{z}

|{z} U −1 x +U −1 f .
=y =y

Thus
x = Gx + f if and only if y j = λ j y j + (U −1 f ) j , j = 1, . . . , n,
where y = U −1
If ρ(G) < 1, i.e., |λ i | < 1, i = 1, . . . , n, then (1 − λ i )yi∗ = (U −1 f )i has a unique solution yi .
Consequently, x = Gx + f has a unique fixed point x (∗) = U y (∗) . Moreover the equations (2.21)
for the error implies
U −1 e (k) = ΛU −1 e (k−1) = Λ k U −1 e (0) . (2.27)
If we define z (k) = U −1 e (k) , then (2.27) reads

zi(k) = λ i zi(k−1) = λ ik zi(0) . (2.28)

Clearly, if ρ(G) = maxi=1,...,n |λ i | < 1, then zi(k) → 0 (k → ∞), i = 1, . . . , n, for any z0(0), . . . , z n(0) ,
and e (k) → 0 (k → ∞) for any initial error e (0) . Moreover, if ρ(G) < 1 the error z (k) = U −1 e (k)
decreases monotonically and the components zi(k) decrease the faster the smaller |λ i |.
On the other hand, if the fixed point iteration (2.26) converges for any starting vector x (0) , then
x (k) → x (∗) = x (∗) (x (0) ) (k → ∞). (Note that we do not know yet that the there is only one fixed
point and therefore the limit may depend on the initial vector x (0) ). The errors e (k) = x (k) − x (∗)
satisfy (2.21) and z (k) = U −1 e (k) satisfy (2.28). The iterates given by (2.28) converge zi(k) → 0
(k → ∞), i = 1, . . . , n, for any initial errors z0(0), . . . , z n(0) only if ρ(G) = maxi=1,...,n |λ i | < 1 and in
this case the fixed point x (∗) is unique.
We have shown the following result.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.5. CONVERGENCE OF LINEAR FIXED POINT ITERATIONS 57

Theorem 2.5.1 Let G ∈ Rn×n be diagonalizable. There exists a unique fixed point x (∗) of x = Gx+ f
and the iteration (2.26) converges to x (∗) for any initial vector x (0) if and only if ρ(G) < 1.
If G is unitarily diagonalizable, i.e., if G is normal, then G = UΛU −1 with kU k2 = 1. In this
case (2.27) implies
ke (k) k2 = kU ∗ e (k) k2 ≤ ρ(G) k kU ∗ e (0) k2 = ρ(G) k ke (0) k2 .
Hence if ρ(G) < 1 the error e (k) decreases monotonically in the 2-norm.
The diagonalizability of G can also be used to establish a relation between the spectral radius
of G and norms of matrix powers. We have
kG k k2 = kUΛ k U −1 k2 ≤ kU k2 kΛ k k2 kU −1 k2 = ρ(G) k kU k2 kU −1 k2
and
kU −1 k2 kUΛ k U −1 k2 kU k2 kΛ k k2 ρ(G) k
kG k k2 = ≥ = .
kU k2 kU −1 k2 kU k2 kU −1 k2 kU k2 kU −1 k2
Hence ! 1/k
1 1/k
ρ(G) ≤ kG k k21/k ≤ ρ(G) kU k2 kU −1 k2 . (2.29)
kU k2 kU −1 k2
Note that if U is unitary, we even have kG k k2 = ρ(G) k . The inequalities (2.29) imply
lim kG k k21/k = ρ(G). (2.30)
k→∞

Since all matrix norms are equivalent, we even have the following result.

Theorem 2.5.2 If G ∈ Rn×n is diagonalizable, then

lim kG k k 1/k = ρ(G) (2.31)
k→∞

for any matrix norm k · k.

We have proven (2.31) for diagonalizable matrices G. We will see shortly that (2.31) is true for
all matrices.
Since the errors e (k) = x (k) − x (∗) of the iteration (2.26) obey
k
ke (k) k = kG k e (0) k ≤ kG k k ke (0) k = kG k k 1/k ke (0) k,

kG k k is called the convergence factor (for k steps) of the iteration (2.26) and kG k k 1/k is called the
average convergence factor (per step for k steps) of the iteration (2.26).

Example 2.5.3 Let

! !
20 18 0.5 0
U= , Λ= , and G = UΛU −1 . (2.32)
18 20 0 0.3

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

58 CHAPTER 2. STATIONARY ITERATIVE METHODS

The matrix G has eigenvalues λ 1 = 0.5, λ 2 = 0.3, and it is diagonalizable but not normal. The
2-norms of the errors e (k) and the components z1(k) , z2(k) of the error z (k) = U −1 e (k) are shown in
Figure 2.1. The components z1(k) , z2(k) of the error z (k) = U −1 e (k) decrease monotonically by a
factor λ 1 = 0.5 and λ 2 = 0.3, respectively.
5
10 2

1.8

0 1.6
10

1.4

|| Gk ||1/k
2
error

−5
10 1.2

1
−10 || e(k) ||2
10 0.8
z(k)
1
0.6
z(k)
2
−15
10 0.4
0 5 10 15 20 0 5 10 15 20
k k

Figure 2.1: Left plot: Convergence of the iterates e (k+1) = Ge (k) for G given by (2.32) and initial
iterate e (0) = (38, 38)T . Right plot: The average convergence factor kG k k 1/k for G given by (2.32).
The red line indicates ρ(G).

Remark 2.5.4 Assume that ρ(G) < 1. The errors e (k) = x (k) − x (∗) of the iteration (2.26) obey
k
ke (k) k = kG k e (0) k ≤ kG k k ke (0) k = kG k k 1/k ke (0) k ≈ ρ(G) k ke (0) k.

Hence we can use ρ(G) k as an estimate for ke (k) k/ke (0) k. In particular, if we want to reduce the
error below a factor 10−d times the initial error, i.e., we want
ke (k) k
≤ 10−d
ke (0) k
then we should expect to need k̄ iterations where k̄ is such that ρ(G) k̄ ≤ 10−d , or
k̄ ≥ −d/ log10 ( ρ(G)). (2.33)
Note this estimate is sharp for unitarily diagonalizable matrices if we use the 2-norm, but can be
too optimistic otherwise. See, e.g., Remark 2.5.8 below. Table 2.1 shows the estimated number of
linear fixed point iterations (2.26) needed to reduce the initial error by a factor 10−2 for various
spectral radii of G. The estimate is (2.33).

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.5. CONVERGENCE OF LINEAR FIXED POINT ITERATIONS 59

ρ(G) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.99
k̄ 2 3 4 6 7 10 13 21 44 90 459

Table 2.1: The estimated (using (2.33)) number k̄ of linear fixed point iterations (2.26) that need to
be executed to reduce the initial error by a factor 10−2 for various spectral radii of G.

2.5.2. Non-Diagonalizable Iteration Matrix

Unfortunately, not every square matrix G is diagonalizable. However, every square matrix G can
be transformed into Jordan canonical form. That is for every square matrix G ∈ Rn×n there exist a
nonsingular matrix U ∈ Cn×n such that
G = U JU −1, (2.34a)
where
J1 0 ... 0 0
0 J2 ...
*. +/
0 0
.. .. ... .. ..
. //
J = ... . . . . // (2.34b)
.. 0 0 . . . J`−1 0 //
, 0 0 ... 0 J` -
and
λi 1 0 . . . 0 0
*. 0 λ i 1 . . . 0 0 +/
.. .. .. . . .. .. //
. . . . . //
Ji = .. ni ×ni
.
.. .. .. .. .. // ∈ C (2.34c)
.. . . . . . /
0 0 0 . . . λi 1 /
.. /
, 0 0 0 . . . 0 λi -
is a Jordan block of order ni , i = 1, . . . , `, λ i ∈ C, i = 1, . . . , `, are eigenvalues of A, and
n = n1 + . . . n` . Note that each Jordan block can be written as
Ji = λ i Ini + Ni,
where Ini ∈ Rni ×ni is the identity matrix and
0 1 0 ... 0 0
*. 0 0 1 ... 0 0 +/
.. .. .. . . .. .. //
. . . . .
Ni = .. ni ×ni
.
. //
.. .. . . .. .. // ∈ R
.. . . . . . //
0 0 ... 0 1
..
0 /
, 0 0 0 ... 0 0 -

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

60 CHAPTER 2. STATIONARY ITERATIVE METHODS

Note that
Nik = 0 k ≥ ni .
Consequently,
k
X k! X i}
min{k,n
k!
k− j j k− j j
Jik = λ i Ni = λ i Ni
j=0
j!(k − j)! j=0
j!(k − j)!
and
X i}
min{k,n
k! j
k Jik k ≤ |λ i | k− j k Ni k. (2.35)
j=0
j!(k − j)!
For j = 1, . . . , k we have
k! k (k − 1) . . . (k − j + 1)
= ≤ k (k − 1) . . . (k − j + 1) ≤ k j .
j!(k − j)! j!
Consequently
k!
1≤ ≤ kj for j = 0, . . . , k
j!(k − j)!
If |λ i | < 1, then
X i}
min{k,n
k! j
k Jik k ≤ |λ i | k− j k Ni k,
j=0
j!(k − j)!
ni
X
j
ni
≤ k |λ i | k
|λ i | − j k Ni k → 0 (k → ∞).
| {z }
→0 (k→∞) j=0

This allows us to extend the arguments used to establish Theorem 2.5.1. The equations (2.21) for
the error implies
U −1 e (k) = ΛU −1 e (k−1) = J k U −1 e (0) . (2.36)
If we define z (k) = U −1 e (k) , then (2.27) reads

zi(k) = Ji zi(k−1) = Jik zi(0), i = 1, . . . , `, (2.37)

where now zi(k) ∈ Rni , i = 1, . . . , `, are subvectors of z (k) corresponding to the Jordan blocks. We
leave the careful proof of the following theorem as an exercise.

Theorem 2.5.5 Let G be a square matrix. There exists a unique fixed point x (∗) of x = Gx + f and
the iteration (2.26) converges to x (∗) for any initial vector x (0) if and only if ρ(G) < 1.

Theorem 2.5.6 If G is a square matrix, then for any matrix norm k · k we have

lim kG k k 1/k = ρ(G). (2.38)

k→∞

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.5. CONVERGENCE OF LINEAR FIXED POINT ITERATIONS 61

We leave leave the proof as an exercise. (See Problem 2.2).

Example 2.5.7 The matrix

!
0.75 4
G= (2.39)
0 0.75
has the eigenvalue λ 1 = 0.75 with multiplicity 2. It is not diagonalizable. The 2-norms of the
errors e (k) are shown in Figure 2.2. The convergence is non-monotone. In particular, the error
increases significantly in the first iterations. Asymptotically, the error ke (k) k2 decreases as ρ(G) k ,
but not in the first several iterations.

1
10 4.5

0
10 3.5

|| Gk ||1/k
|| e(k) ||2

2
−1
10 2.5

2
−2
10 1.5

1
−3
10 0.5
0 10 20 30 40 0 10 20 30 40
k k

Figure 2.2: Left plot: Convergence of the iterates e (k+1) = Ge (k) for G given by (2.39) and initial
iterate e (0) = (1, 1)T . Right plot: The average convergence factor kG k k 1/k for G given by (2.39). .
The red line indicates ρ(G).

Asymptotically, the linear fixed point iteration converges the faster, the smaller the spectral
radius. However, the spectral radius only describes the asymptotic convergence behavior for
sufficiently large iterations. For non-normal matrices and especially for non-diagonalizable matrices
there can be significant transition effects, such as the one observed in the convergence of the error
e (k) in Figure 2.2. This is studied in more detail in the book by Trefethen and Embree [TE05].

Remark 2.5.8 As we have seen in Example 2.5.7 for non-diagonalizable matrices, kG k k 1/k ≈
ρ(G) only for (potentially very) large k. Hence for non-normal matrices and especially for non-
diagonalizable matrices the estimate (2.33) for the number of iterations required to reduce the error
by a factor 10−d may be way too optimistic! For instance, for the matrix in Example 2.5.7, k=33
iterations are required to achieve ke (k) k2 /ke (0) k2 ≤ 10−2 , but −2/ log10 (0.75) < 17.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

62 CHAPTER 2. STATIONARY ITERATIVE METHODS

2.5.3. Approximation of the Spectral Radius by a Matrix Norm.

We close this section by showing an alternative way to establish Theorems 2.5.5 and 2.5.6.
When does a matrix (operator) norm k · k exists such that kGk < 1? The answer can be obtained
using the spectral radius.

Theorem 2.5.9 Let T be a nonsingular matrix. If k xkT = kT xk∞ and if k AkT =

max x,0 k AxkT /k xkT is the induced matrix (operator) norm, then the following statements hold.
(a) k AkT = kT AT −1 k∞ .
(b) For every > 0 and matrix A there exists a nonsingular matrix T such that

k AkT ≤ ρ( A) + .

Proof: (a) Thew first identity follows from

k AkT = max k AxkT /k xkT = max kT Axk∞ /kT xk∞ = max kT AxT −1 yk∞ /k yk∞ = kT AT −1 k∞ .
x,0 x,0 y,0

(b) From the Jordan canonical form (2.34) of a matrix A we find that a nonsingular matrix U such
that
b11 b12 . . . b1,n−1 b1,n
0 b22 . . . b2,n−1 b2,n //
*. +
.. .. . . .. .. // ,
.
B = U AU −1 = ... . . . . . /
..
0 0 . . . bn−1,n−1 bn−1,n //
, 0 0 ... 0 bnn -
where the diagonal entries bii , i = 1, . . . , n, are the eigenvalues of A. (In fact more can be said
about the entries of B, but this is not necessary for our purposes.) Given δ > 0 define

D = D(δ) = diag(δ−1, . . . , δ−n ).

Then

 0, if i > j,
(DBD )i j =  bii,
−1

 bi j δ , if i < j.
 j−i

Consequently,
n
X !
k(DBD )k∞ = max
−1
|(DBD )i j | ≤ max |bii | + n max |bi j |δ
−1 j−i
.
i i j>i
j=1

For sufficiently small δ > 0 we get

k(DBD−1 )k∞ ≤ max |bii | + = ρ( A) + .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.6. CONVERGENCE OF THE JACOBI, GAUSS SEIDEL, AND SOR METHODS 63

If we set T = DU, then

k AkT = kT AT −1 k∞ = kDU AU −1 D−1 k∞ = kDBD−1 k∞ ≤ ρ( A) + .

Corollary 2.5.10 Let G be a square matrix. There exists a matrix (operator) norm k · k such that
kGk < 1 if and only if ρ(G) < 1.

Proof: Assume there exists a matrix (operator) norm k · k such that kGk < 1. For any eigenvalue
λ of G and corresponding eigenvector v, Gv = λv. Taking norms gives

|λ| kvk = kλvk = kGvk ≤ kGk kvk.

Thus all eigenvalues λ of G satisfy

|λ| ≤ kGk < 1,
which proves ρ(G) < 1.
Now assume that ρ(G) < 1. Set = (1 − ρ(G))/2. Theorem 2.5.9 guarantees the existence of
a T-norm with
kGkT ≤ ρ(G) + < 1.

2.6. Convergence of the Jacobi, Gauss Seidel, and SOR Methods

In this section we present some convergence results for the Jacobi, Gauss Seidel, and SOR Methods.
The results presented here and additional results can be found in the books by Young [You71],
Varga [Var62, Var00], Hackbusch [Hac94], or Axelsson [Axe94].
We have mentioned earlier in Section 2.2 that the Jacobi method is invariant to a symmetric
reordering of the equations and unknowns, whereas the Gauss-Seidel method and the SOR method
aren’t. That is, if Π is a permutation matrix and if we we apply the Jacobi method to Ax = b and
to Π AΠT H x = Πb with initial iterate H
x (0) = Πx (0) , then the Jacobi iterates satisfy H
x (k) = Πx (k) for
all k. This is not true for the Gauss-Seidel method or for the SOR method. How does the ordering
affect convergence of the Gauss-Seidel or the SOR method? It turns out that all conditions on the
matrix A that we will discuss in the following subsections are invariant with respect to a symmetric
reordering of the equations and unknowns. That is, if one of these conditions is satisfied for A, then
it is also satisfied for Π AΠT for any permutation matrix Π. In particular, the conditions given in
this section guaratnee convergence of the (forward) Gauss-Seidel method applied to any symmetric
reordering Π AΠT H x = Πb of the equations and unknowns.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

64 CHAPTER 2. STATIONARY ITERATIVE METHODS

If π : {1, . . . , n} 7→ {1, . . . , n} is a permutation of the numbers 1, . . . , n and

1 if j = π(i),
(
Πi j =
0 else,

then
Hi, j = (Π AΠT )i, j = Aπ(i),π( j),
A i, j = 1, . . . , n. (2.40)
For the specific permutation π(1) = n, π(2) = n − 1, . . . , π(n) = 1, the corresponding permutation
matrix is
1
.
Π=. * . . +/ (2.41)
, 1 -
and the (pointwise) forward Gauss-Seidel method applied to Π AΠT Hx = Πb is equivalent to the
(pointwise) backward Gauss-Seidel method applied to Ax = b. Therefore, the conditions on the
matrix in this section imply convergence of the forward Gauss-Seidel method and the backward
Gauss-Seidel method, as well as the Gauss-Seidel method for any symmetric reordering of the
equations and unknowns.

2.6.1. Diagonally Dominant Matrices

In Sections 1.3.1 and 1.3.2 we have seen matrices with the property that for each row (or column)
the absolute value of the diagonal entry is greater than or equal to the sum of the absolute values
of the off-diagonal entries. For such matrices we can guarantee convergence of the Jacobi and the
Gauss-Seidel method. This will be shown in this section. If for all but on row (column) the absolute
value of the diagonal entry is equal to the sum of the absolute values of the off-diagonal entries, we
need an additional property, which we will introduce first. Afterwards we will give definitions of
the various ‘shades’ of diagonal dominance, and then we will provide convergence results for the
Jacobi and the Gauss-Seidel method.

Definition 2.6.1 A square matrix A is said to be reducible, if there exists a permutation matrix P
such that P APT is block upper triangular, i.e.,
!
A11 A12
P AP =T
0 A22

with square matrices A11, A22 . Otherwise it is irreducible.

The irreducibility of a matrix can often be tested using the directed graph G( A) associated with
the matrix A. A directed graph consists of vertices and directed edges. The directed graph G( A)
associated with the matrix A ∈ Rn×n consists of n vertices labeled 1, . . . , n and there is an oriented
edge between edges i ad j if and only if ai j , 0. It can be shown that A is irreducible if and only if

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.6. CONVERGENCE OF THE JACOBI, GAUSS SEIDEL, AND SOR METHODS 65

the graph G( A) is connected in the sense that for each pair of vertices i and j there is an oriented
path from i to j, that is there exist vertices i = i 0, i 1, i 2, . . . , i k = j such that G( A) contains the
oriented edge (i `−1, i ` ), ` = 1, . . . , k.

Example 2.6.2 The graphs for the matrices

−1 −2 0 2 −1 0
A1 = *. 0 1 0 +/ , A2 = . −1 2 −1 +/
*
, 3 2 1 - , 0 −1 2 -
are shown in Figure 2.3.

1 2

1 2 3
3

Figure 2.3: Directed graphs associated with A1 (left plot) and A2 (right plot).

The graph associated with A1 has no directed path from vertex 1 to vertex 3 (or from vertex 2
to vertex 3). Therefore the matrix A1 is reducible. In fact if we use the permutation

0 0 1
P = *. 0 1 0 +/
, 1 0 0 -
then
0 0 1 1 2 3
P =.−1* 0 1 0 / and P A1 P = . 0
+ −1 *
1 0 /.
+
, 1 0 0 - , 0 −2 −1 -
The graph associated with A2 has directed path from every vertex i to vertex j. Therefore the matrix
A2 is irreducible.

Definition 2.6.3 A matrix A ∈ Rn×n is called

• (weakly) row–wise diagonally dominant if

n
X
|ai j | ≤ |aii |, i = 1, . . . , n ,
j=1, j,i

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

66 CHAPTER 2. STATIONARY ITERATIVE METHODS

• strictly row–wise diagonally dominant if

n
X
|ai j | < |aii |, i = 1, . . . , n ,
j=1, j,i

• irreducibly row–wise diagonally dominant if A is irreducible, and

n
X
|ai j | ≤ |aii |, i = 1, . . . , n
j=1, j,i

with strict inequality for at least one i.

• (weakly) column–wise diagonally dominant if

n
X
|ai j | ≤ |a j j |, j = 1, . . . , n ,
i=1,i, j

• strictly column–wise diagonally dominant if

n
X
|ai j | < |a j j |, j = 1, . . . , n ,
i=1,i, j

• irreducibly column–wise diagonally dominant if A is irreducible, and

n
X
|ai j | ≤ |a j j |, j = 1, . . . , n
i=1,i, j

with strict inequality for at least one i.

Remark 2.6.4 Let Π be a permutation. If A is irreducible, then Π AΠT is irreducible. Moreover,

if one of the diagonal dominance properties in Definition 2.6.3 are satisfied, then the diagonal
dominance property is also satisfied for Π AΠT .

Theorem 2.6.5 If the square matrix A is strictly row-wise (or column-wise) diagonally dominant
or if it is irreducibly row-wise (or column-wise) diagonally dominant, then A is nonsingular.

Proof: (a) Let A be strictly row-wise diagonally dominant. Suppose there exists and x , 0 with
Ax = 0. The ith equation in this system implies
n
X n
X
|aii | |x i | = | − ai j x j | ≤ |ai j | |x j |.
j=1, j,i j=1, j,i

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.6. CONVERGENCE OF THE JACOBI, GAUSS SEIDEL, AND SOR METHODS 67

For an index i with |x i | = k xk∞ > 0, we have

n
X X n
|aii | |x i | ≤ |ai j | |x j | ≤ *. |ai j | +/ |x i | < |aii | |x i |.
j=1, j,i , j=1, j,i -
This is a contradiction. Thus Ax = 0 implies x = 0, which proves that A is nonsingular.
(b) Let A be irreducibly row-wise diagonally dominant. Suppose there exists x , 0 such that
Ax = 0. We assume that x is scaled such k xk∞ = 1. We can proceed as in part (a) to show that for
an index i with |x i | = 1, we have
n
X X n
|aii | = |aii | |x i | ≤ |ai j | |x j | ≤ *. |ai j | +/ |x i | ≤ |aii | |x i | = |aii |.
j=1, j,i , j=1, j,i -
Hence all inequalities above become equalities,

n
X X n
|aii | = |aii | |x i | = |ai j | |x j | = .
* |ai j | +/ |x i | = |aii | |x i ||aii |. (2.42)
j=1, j,i , j=1, j,i -
These equalities imply that for any ` such that ai` , 0, the corresponding component of x must
satisfy |x ` | = 1 = k xk∞ .
Now let ν be an index such that
n
X
|aν j | < |aνν |. (2.43)
j=1, j,ν

Since A is irreducible, there exist indices i, i 1, i 2, . . . , i k = ν such that ai,i1 , 0, ai1,i2 , 0, . . . ,

ai k−1,i k , 0. Since ai,i1 , 0, we have |x i1 | = 1. Hence we can apply all arguments used to derive
(2.42) to the equation i 1 instead of equation i. This yields

n
X Xn
|ai1i1 | = |ai1i1 | |x i1 | = |ai1 j | |x j | = *
. |ai1 j | +/ |x i1 | = |ai1i1 | |x i1 | = |ai1i1 |.
j=1, j,i 1 , j=1, j,i1 -
Since ai1,i2 , 0, we have |x i2 | = 1. We can now repeat the same argument to show that |x i2 | = . . . =
|x ν | = 1. Since for row ν the inequality (2.43) holds, we have

n
X Xn
|aνν | |x ν | ≤ |aν j | |x j | ≤ *
. |aν j | +/ |x ν | < |aνν | |x ν |,
j=1, j,ν , j=1, j,ν -
a contradiction. Hence there is no x , 0 such that Ax = 0, which means that A is nonsingular.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

68 CHAPTER 2. STATIONARY ITERATIVE METHODS

(c) If A is strictly column-wise diagonally dominant, then (a) implies that AT is nonsingular.
Hence A is nonsingular.
(d) The nonsingularity of A follows from application of (b) to AT .

Theorem 2.6.6 If the square matrix A is strictly row-wise (or column-wise) diagonally dominant
or if it is irreducibly row-wise (or column-wise) diagonally dominant, then the pointwise Jacobi
method converges for any x (0) .

Proof: (a) Let A be strictly row-wise diagonally dominant. The iteration matrix for the (pointwise)
Jacobi method is given by
G = D−1 (E + F).
Let λ be an eigenvalue of G and let v be a corresponding eigenvector with |vm | = 1 and |vi | ≤ 1,
i , m. The mth equation in λv = Gv and the row-wise diagonal dominance of A imply
n n
X ami X
ami |v | < 1.
|λ| = vi ≤ a i
i=1,i,m
amm i=1,i,m mm

Hence ρ(G) < 1 and the assertion follows from Theorem 2.5.5.
(b) If A is irreducibly row-wise diagonally dominant, then one can show as in part (a) that every
eigenvalue λ of G satisfies
|λ| ≤ 1.
Suppose that ρ(G) = 1, i.e., that there exists an eigenvalue λ of G with |λ| = 1. In this case
λ D − E − F is singular. However, since |λ| = 1 the matrix λ D − E − F is also irreducibly row-wise
diagonally dominant. This contradicts Theorem 2.6.5. Hence ρ(G) < 1 holds.
(c) Let A be strictly column-wise diagonally dominant. By part (a) the (pointwise) Jacobi
method for the matrix AT , which is given by G̃ = D −1 (ET + F T ), satisfies ρ(G̃) < 1. The matrices
D−1 (ET + F T ) and (ET + F T )D −1 have the same eigenvalues and the matrices (ET + F T )D−1 and
T
(ET + F T )D−1 = G have the same eigenvalues. Hence ρ(G) = ρ(G̃) < 1.

Theorem 2.6.7 If the square matrix A is strictly row-wise (column-wise) diagonally dominant
or if it is irreducibly row-wise (column-wise) diagonally dominant, then the pointwise (forward)
Gauss-Seidel method converges for any symmetric reordering of the equations and unknowns for
any starting value.

Proof: First we show that pointwise forward Gauss-Seidel method applied to Ax = b converges
for any starting value provided that A is strictly row-wise diagonally dominant or if it is irreducibly

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.6. CONVERGENCE OF THE JACOBI, GAUSS SEIDEL, AND SOR METHODS 69

row-wise diagonally dominant. Since for any permutation matrix Π, Π AΠT is strictly row-wise
diagonally dominant (irreducibly row-wise diagonally dominant) if and only of A is strictly row-
wise diagonally dominant (irreducibly row-wise diagonally dominant), this implies convergence
of pointwise forward Gauss-Seidel method for any symmetric reordering of the equations and
unknowns.
(a) Let A be strictly row-wise diagonally dominant. The iteration matrix for the (pointwise)
forward Gauss-Seidel method is given by

G = (D − E) −1 F.

Let λ be an eigenvalue of G and let v be a corresponding eigenvector with kvk∞ . Let the index m
be such that |vm | = 1 and |vi | ≤ 1, i , m. The mth equation in λv = Gv is equivalent to
X X
λ ami vi = − ami vi
i≤m i>m

and implies P P
i>m ami vi i>m |ami | |vi |
|λ| =
≤
.
amm vm + i<m ami vi
P P
|amm | − i<m |ami | |vi |
The last term is of the form c1 /(d − c2 ) with c1, c2 ≥ 0, d > 0, and d − c1 − c2 > 0 . (For the latter
inequality we use that A is strictly row-wise diagonally dominant, |vm | = 1 and |vi | ≤ 1, i , m.)
Since c1 /(d − c2 ) = c1 /(c1 + (d − c1 − c2 )) < 1, we have

|λ| < 1.

Hence all eigenvalues λ of the (pointwise) forward Gauss-Seidel iteration matrix satisfy |λ| < 1.
Consequently ρ(G) < 1.
(b) If A is irreducibly row-wise diagonally dominant, then one can show as in part (a) that every
eigenvalue λ of G satisfies
|λ| ≤ 1.
Suppose that ρ(G) = 1, i.e., that there exists an eigenvalue λ of G with |λ| = 1. In this case
λ(D − E) − F is singular. However, since |λ| = 1 the matrix λ(D − E) − F is also irreducibly
row-wise diagonally dominant. This contradicts Theorem 2.6.5. Hence ρ(G) < 1 holds.
(c) Now let A be strictly column-wise diagonally dominant or irreducibly column-wise diago-
nally dominant, then AT = D − ET − F T is is strictly row-wise diagonally dominant or irreducibly
row-wise diagonally dominant. By parts (a) and (b), the (pointwise) backward Gauss-Seidel iter-
ation for AT converges. Since the (pointwise) backward Gauss-Seidel iteration matrix for AT is
(D − ET ) −1 F T , we have ρ((D − ET ) −1 F T ) < 1.
The iteration matrix for the (pointwise) forward Gauss-Seidel iteration for A if G = (D − E) −1 F.
We have

(D − ET ) −1 GT (D − ET ) = (D − ET ) −1 F T (D − ET ) −1 (D − ET ) = (D − ET ) −1 F T .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

70 CHAPTER 2. STATIONARY ITERATIVE METHODS

Since λ is an eigenvalue of (D − ET ) −1 GT (D − ET ) = (D − ET ) −1 F T if and only if it is an

eigenvalue of GT , we have ρ(G) = ρ(GT ) = ρ((D − ET ) −1 F T ) < 1. Thus, the (pointwise) forward
Gauss-Seidel iteration for A converges.
Since for any permutation matrix Π, Π AΠT is strictly column-wise diagonally dominant (ir-
reducibly column-wise diagonally dominant) if and only of A is strictly column-wise diagonally
dominant (irreducibly column-wise diagonally dominant), this implies convergence of pointwise
forward Gauss-Seidel method for any symmetric reordering of the equations and unknowns.

2.6.2. Necessary Condition on the Relaxation Parameter for Convergence of

the SOR Method
As usual in this chapter, we split
A = D − E − F,
where either D is the diagonal of A, and −E and −F are the strict lower triangular part and the
strict upper triangular part of A as shown in (2.12), or D is the block diagonal of A, and −E and −F
are the strict lower block triangular part and the strict upper block triangular part of A as shown in
(2.19). Throughout this section we assume that D is nonsingular. The forward SOR is given by

x (k+1) = (D − ωE) −1 [ωF + (1 − ω)D]x (k) + ωb .

If the splitting (2.12) is used we obtain the pointwise forward SOR method. If the splitting (2.19)
is used we have the block forward SOR method. The forward SOR iteration matrix is given by

Gω = (D − ωE) −1 ωF + (1 − ω)D .

(2.44)

Theorem 2.6.8 The iteration matrix Gω of the (pointwise or block) forward SOR method satisifies

ρ(Gω ) ≥ |1 − ω|.

Proof: We use the following two properties of the determinant. The determinant of the product
of two square matrices is the product of the determinants of the matrices. The determinant of a
square matrix is the product of the eigenvalues of the matrix.
Since

Gω = (D − ωE) −1 ωF + (1 − ω)D

= (D(I − ωD−1 E) −1 ωF + (1 − ω)D

= (I − ωD−1 E) −1 ωD−1 F + (1 − ω)I

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.6. CONVERGENCE OF THE JACOBI, GAUSS SEIDEL, AND SOR METHODS 71

and since (I − ωD −1 E) −1 is a lower triangular matrix with ones on the diagonal and ωD−1 F is a
strict upper (block) triangular matrix we have
n
Y

λ i = det(Gω )
i=1
= det (I − ωD−1 E) −1 det ωD−1 F + (1 − ω)I

= |1 − ω| n .

Hence, for at least one eigenvalue λ i of Gω , |λ i | ≥ |1 − ω|.

Corollary 2.6.9 If the (pointwise or block) forward SOR method converges for any initial vector,
then ω ∈ (0, 2).

Proof: If the (pointwise or block) forward SOR method converges for any initial vector, then

1 > ρ(Gω ) ≥ |1 − ω|,

which implies ω ∈ (0, 2).

2.6.3. Symmetric Positive Definite Matrices

Recall that a matrix A ∈ Cn×n is hermetian if A∗ = A. It is hermetian positive definite if it is
hermitian and v ∗ Av > 0 for all v , 0. If Π is a permutation, then Π AΠT is hermetian (hermetian
positive definite) if and only if A is hermetian (hermetian positive definite).

Theorem 2.6.10 If A ∈ Cn×n is hermetian positive definite, then the (pointwise or block) forward
SOR method converges for all ω ∈ (0, 2).

Proof: We have

Gω = (D − ωE) −1 ωF + (1 − ω)D

1 −1
=I− D − ωE A = I − Mω−1 A,
ω
where
1
D − ωE . Mω =

ω
Let λ ∈ C be an eigenvalue of Gω with corresponding eigenvector v ∈ Cn . Then

Av = (1 − λ)Mω v.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

72 CHAPTER 2. STATIONARY ITERATIVE METHODS

Since v ∗ Av > 0, λ , 1. Hence

v ∗ Mω v 1
= .
∗
v Av 1−λ
Consequently,

1 1 1 v ∗ Mω v v ∗ Mω v v ∗ (Mω + Mω∗ )v
2 Re = + = ∗ + ∗ = .
1−λ 1−λ 1−λ v Av v Av v ∗ Av

Since Mω = ω−1 D − E and D∗ = D, E ∗ = F, we have

Mω + Mω∗ = ω−1 D − E + (ω−1 D − E) ∗ = 2ω−1 D − E − F T = A + (2ω−1 − 1)D

and
v ∗ (Mω + Mω∗ )v
! ∗
1 2 v Dv
2 Re = =1+ −1 ∗ .
1−λ ∗
v Av ω v Av
If A ∈ Cn×n is hermetian positive definite, its (block) diagonal D is hermetian positive definite.
Hence ! ∗
1 2 v Dv
2 Re =1+ −1 ∗ > 1.
1−λ ω v Av
If we set λ = α + i β, then
1 2(1 − α)
1 < 2 Re = ,
1 − λ (1 − α) 2 + β 2
which implies
|λ| 2 = α 2 + β 2 < 1.

Since λ was an arbitrary eigenvalue of Gω , ρ(Gω ) < 1.

Remark 2.6.11 Since the (pointwise or block) forward SOR method with ω = 1 is the (pointwise or
block) forward Gauss-Seidel method, Theorem 2.6.10 shows that the forward Gauss-Seidel method
applied to a hermetian positive definite system converges.

Theorem 2.6.12 If A ∈ Rn×n and 2D − A are symmetric positive definite, then the (pointwise or
block) Jacobi method converges for all initial iterates.

We leave the proof as an exercise. See Problem 2.11.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.6. CONVERGENCE OF THE JACOBI, GAUSS SEIDEL, AND SOR METHODS 73

2.6.4. Consistently Ordered Matrices

Definition 2.6.13 Let
A = D − E − F,
where D, −E and −F are either the matrices shown in (2.12), or are matrices shown in (2.19), and
D is nonsingular. The matrix A is called consistently ordered if the eigenvalues of
Bα = αD −1 E + α −1 D −1 F
for α , 0 are independent of α.

2.6.14 Let P be a permutation matrix. Since P = P , the matrices αD E + α D F

Remark −1 T −1 −1 −1

and P αD−1 E + α −1 D−1 F PT have the same eigenvalues. Therefore, A is consistently ordered if
and only if P APT is consistently ordered.

There are several classes of matrices that are consistently ordered.

Theorem 2.6.15 If there exists a permutation matrix P such that

!
D1 M1
P AP =
T
with diagonal matrices Di ∈ Rni ×ni , i = 1, 2, (2.45)
M2 D2
aii , 0, i = 1, . . . , n, then A is consistently ordered.

Proof: [SB93, p. 583].

One says that a matrix A has property A if there permutation matrix P such that (2.45) holds. The
matrices arising in finite difference discretizations of elliptic PDEs have property A. In fact, the
red-black ordering of nodes leads to matrices with the structure (2.45). See Sections 1.3.1 and
1.3.2.
The following result shows that (block) tridiagonal matrices are consistently ordered. We have
seen in Sections 1.3.1 and 1.3.2 that such matrices arise in finite difference discretizations of elliptic
PDEs.

Theorem 2.6.16 The matrix

D1 −F1
.. −E2 D2 −F2
*. +/
//
..
.. //
A = .. . // (2.46)
.. //
.. //
. −Em−1 Dm−1 −Fm−1 /
, −Em Dm -
where Di ∈ Rni ×ni , i = 1, . . . , m are nonsingular, is consistently ordered.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

74 CHAPTER 2. STATIONARY ITERATIVE METHODS

Proof: For α , 0 we have

0 α −1 D1−1 F1
.. αD2 E2 α −1 D2−1 F2
*. −1 0
+/
//
...
.. //
Bα = .. // .
.. //
.. //
. αDm−1
−1 E
m−1 0 α −1 Dm−1
−1 F
m−1 /
, αD −1 E
m m 0 -
If we set
In
*. 1
αIn2
+/
.. //
..
.. //
Sα = .. . // ,
.. //
.. //
. α m−2 Inm−1 /
, α m−1 Inm -
where Ini ∈ Rni ×ni , i = 1, . . . , m, are identity matrices, then
Bα = Sα B1 Sα−1 for α , 0.
Since the matrices Bα and Sα B1 Sα−1 have the same eigenvalues, A is consistently ordered.

Theorem 2.6.17 Let

A = D − E − F,
where D, −E and −F are the matrices shown in (2.12) [are matrices shown in (2.19)] and D is
nonsingular, be consistently ordered. Furthermore let ω , 0 and let
G J = D−1 E + F ,

Gω = (D − ωE) −1 ωF + (1 − ω)D

be the iteration matrices of the pointwise [block] Jacobi and forward SOR method, respectively.
Then:
i. With µ, also −µ is an eigenvalues of G J .
ii. If µ is an eigenvalue of G J and
(λ + ω − 1) 2 = λω2 µ2, (2.47)
then λ is an eigenvalue of Gω .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.6. CONVERGENCE OF THE JACOBI, GAUSS SEIDEL, AND SOR METHODS 75

iii. If λ , 0 is an eigenvalue of Gω and (2.47), then µ is an eigenvalue of G J .

Proof: [SB93, p. 586].

Corollary 2.6.18 Let

A = D − E − F,
where D, −E and −F are the matrices shown in (2.12) [are matrices shown in (2.19)] and D
is nonsingular. If A is consistently ordered, then spectral radii of the iteration matrices G J =
D−1 E + F and GGS = (D − E) −1 F of the pointwise [block] Jacobi method and the pointwise

[block] forward Gauss-Seidel method satisfy

ρ(GGS ) = ρ(G J ) 2

Proof: By Theorem 2.6.17 the eigenvalues µ of G J and the eigenvalues λ of GGS = G1 obey
λ = µ2 .

Theorem 2.6.19 Let

A = D − E − F,
where D, −E and −F are the matrices shown in (2.12) [are matrices shown in (2.19)] and D is
nonsingular, be consistently ordered. If the eigenvalues of the pointwise [block] Jacobi method are
real and ρ(G J ) < 1, then the optimal relaxation parameter for the pointwise [block] forward SOR
method,
ωopt = argminω∈(0,2) ρ(Gω )
is given by
2
ωopt = (2.48)
1 + 1 − ρ(G J ) 2
p

and
2
ρ(G J )
ρ(Gωopt ) = * + .
+ ρ(G
p
, 1 1 − )
J -
2

Proof: [SB93, p. 587/8].

Note that if A is consistently ordered and ρ(G J ) < 1, then

2
ωopt = ∈ (1, 2)
1 + 1 − ρ(G J ) 2
p

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

76 CHAPTER 2. STATIONARY ITERATIVE METHODS

(since ωopt > 1 one speaks of over relaxation) and

2
ρ(G J )
ρ(Gωopt ) = * + < ρ(GGS ) = ρ(G J ) 2 < ρ(G J ) < 1.
, 1 + 1 − ρ(G J ) -
p
2

In Remark 2.5.4 we have discussed the relation between the spectral radius ρ(G) of a basic
iterative method and the number of iterations k̄ needed to reduce the size of the initial error
e (0) = x (0) − x ∗ by a factor 10−d . An estimate is

k̄ ≥ −d/ log10 ( ρ(G)).

Note that this tends to a good estimate for unitarily diagonalizable G, but can be too optimistic
otherwise. Table 2.2 shows the number of iterations needed to reduce the initial error by a factor
10−2 with the Jacobi method, the Gauss-Seidel method or the SOR method for various ρ(G J ).

ρ(G J ) 0.5 0.6 0.7 0.8 0.9 0.95 0.99

k̄ J 7 10 13 21 44 90 459
ρ(GGS ) 0.25 0.36 0.49 0.64 0.81 0.90 0.98
k̄GS 4 5 7 11 22 45 230
ρ(Gωopt ) 0.07 0.11 0.17 0.25 0.39 0.52 0.75
k̄ωopt 2 3 3 4 5 8 17

Table 2.2: Estimated number of Jacobi iterations, Gauss-Seidel iterations or SOR iterations that
need to be executed to reduce the initial error by a factor 10−2 for various spectral radii of G J .

In Remark 2.6.14 we have established that A is consistently ordered if and only if P APT
is consistently ordered, where P is a permutation matrix. Furthermore, the eigenvalues of
the Jacobi iteration matrix D−1 (E + F) corresponding to A and of the Jacobi iteration matrix
(PDPT ) −1 (PEPT + PF PT ) = PD −1 (E + F)PT corresponding to P APT are identical. Thus, The-
orem 2.6.19 remains valid for any symmetric permutation of the P APT of the system Ax = b. In
particular, if P is the permutation (2.41), then Theorem 2.6.19 implies convergence of the pointwise
backward SOR method.

2.6.5. M-Matrices
M-matrices play a role in the discretization of partial differential equations, see Section 2.7.

Definition 2.6.20 A square matrix A is said to be an M-matrix if

1. ai j ≤ 0 for i , j, i, j = 1, . . . , n,

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.6. CONVERGENCE OF THE JACOBI, GAUSS SEIDEL, AND SOR METHODS 77

2. A is nonsingular, and

3. A−1 ≥ 0, where “≥” is understood elementwise.

Remark 2.6.21 i. An M-matrix has positive diagonal entries

aii > 0 for i = 1, . . . , n.

Proof: Suppose there exists a non-negative diagonal entry aii ≤ 0. Let Ai be the ith column
of A. By properties 1 and 3 of the M-matrix A, A−1 Ai ≤ 0, which contradicts A−1 Ai = ei ,
where ei is the i-th unit vector.

ii. Let Π be a permutation matrix. The matrix A is an M-matrix if and only if Π AΠT is an
M-matrix.
Proof: Equation (2.40) shows that the diagonal entries of Π AΠT are equal to the diagonal
entries of A (but reordered according to π). Moreover, the entries in the π(i)-th row of Π AΠT
are equal to the entries in the i-th row of A.

The four conditions in our definition of an M-matrix are redundant. We refer to the literature
(e.g., [Axe94, Hac94, Saa03, Var00, You71]) for a complete discussion of M-matrices. The
following result shows the connection between M-matrices and the convergence of the (pointwise)
Jacobi method.

Theorem 2.6.22 If A be a square matrix such that

1. aii > 0 for i = 1, . . . , n,

2. ai j ≤ 0 for i , j, i, j = 1, . . . , n,

then A is an M-matrix if and only if ρ(G J ) < 1, where G J = I − D−1 A and D is the diagonal of A.

Proof: It is an exercise (see Problem 2.6) to show that

k
X
j
lim G J = (I − G J ) −1 ⇐⇒ ρ(G J ) < 1. (2.49)
k→∞
j=0

Since I − G J = D −1 A, A is invertible if and only if ρ(G J ) < 1. Hence property 3 in the definition
of an M-matrix is satisfied if and only if ρ(G J ) < 1.
The properties 1 and 2 imply that all entries of G J are non-negative. Hence all entries of
j j
G J , j = 0, 1, . . . and kj=0 G J are non-negative. By (2.49) all entries of (I − G J ) −1 = A−1 D are
P

non-negative. Since the diagonal entries D are positive, all entries of A−1 are non-negative.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

78 CHAPTER 2. STATIONARY ITERATIVE METHODS

The following result establishes the relationship between the convergence of the Jacobi and the
Gauss-Seidel method for matrices A = D − E − F for which all entries of D−1 E and D −1 F are
non-negative. In particular if the matrix entries satisfy

aii > 0 for i = 1, . . . , n,

ai j ≤ 0 for i , j, i, j = 1, . . . , n,

the entries of D−1 E and D−1 F are non-negative.

Note that if Π is a permutation matrix and we define the matrix

H = Π AΠT
A

H= D
and let A H−E H − F, H strict lower triangular matrix − E
H with diagonal matrix D, H and strict lower
−1 −1
triangular matrix −F, then the entries of D E and D F are non-negative if and only if the entries
H
of DH−1 E H−1 F
H and D H are non-negative.

Theorem 2.6.23 (Stein-Rosenberg) Let A = D − E − F. If all entries of D−1 E and D−1 F are
non-negative, then exactly one of the following alternatives (2.50a-d) hold for the Jacobi iteration
and Gauss-Seidel iteration with any symmetric re-ordering:

0= ρ(GGS ) = ρ(G J ), (2.50a)

0< ρ(GGS ) < ρ(G J ) < 1, (2.50b)
1= ρ(GGS ) = ρ(G J ), (2.50c)
1< ρ(G J ) = ρ(GGS ). (2.50d)

For a proof see, e.g., Varga [Var00] or the original paper [SR48].

2.7. Application to Finite Difference Discretizations of Elliptic

Partial Differential Equations
2.7.1. Elliptic Partial Differential Equations in 1D
The 1D Laplace Equation
The finite difference discretization of

−y00 (x) = f (x), x ∈ (0, 1), (2.51a)

y(0) = y(1) = 0. (2.51b)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.7. APPLICATION TO FINITE DIFFERENCE DISCRETIZATIONS OF PDES 79

with mesh size h = 1/(n + 1) leads to the linear system

2 −1 y1 f (x 1 )
−1 2 −1 y2 // .. f (x 2 )
*. +/ *. +/ *. +/
.. // .. //
y3
.. .. .. .. ..
.. // .. // .. //
h−2 .. . . . // .. . // = .. . // . (2.52)
.. // .. // .. //
.. /. yn−2 // .. //
. −1 2 −1 // .. yn−1 / . f (x n−1 ) /
, −1 2 - , yn - , f (x n ) -
The solution y1, . . . , yn of (2.52) is an approximation of the solution u of (2.51) at the points

x i = ih, where h = 1/(n + 1).

The eigenvalues of the symmetric tridiagonal matrix

α1 α2
.. α2 α1 α2
*. +/
//
.. .. ..
.. //
.. . . . // ∈ Rn×n (2.53)
.. //
.. /
. α2 α1 α2 //
, α2 α1 -

are given by !
jπ
λ j = α1 + 2α2 cos , j = 1, . . . , n,
n+1
and r
2
vj = sin jπx 1 , sin jπx 2 , . . . , sin jπx n T .

n+1
is an eigenvector associated with the eigenvector λ j . See Problem 1.2 and also Iserles [Ise96,
pp. 197–203].
Since the Jacobi iteration matrix G J = D−1 (E + F) corresponding to (2.52) is

0 1
*. +/
.. 1 0 1 //
. . .
.. //
GJ = 2 .
1. .. .. .. // , (2.54)
.. //
.. /
. 1 0 1 //
, 1 0 -

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

80 CHAPTER 2. STATIONARY ITERATIVE METHODS

the eigenvalues and corresponding eigenvectors of the Jacobi iteration matrix (2.54) are
!
jπ
λ j = cos , (2.55a)
n+1
r
2
vj = sin jπx 1 , sin jπx 2 , . . . , sin jπx n T .

(2.55b)
n+1
for j = 1, . . . , n. The spectral radius of the Jacobi iteration matrix (2.54) is
π π2
ρ(G J ) = cos ≈1− .
n+1 2(n + 1) 2
Next, we consider the Gauss-Seidel iteration matrix GGS = (D − E) −1 F. Our analysis follows
the paper by Kohaupt [Koh98]. See also Iserles [Ise96, pp. 200–203]. The Gauss-Seidel iteration
matrix can also be written asGGS = (I − D−1 E) −1 D−1 F, and for (2.52)
0 0 1
*. +/ *. +/
.. 1 0 // .. 0 0 1 //
. . . . .
.. // .. //
D E= 2.
−1 1. . . . . // , D F= 2.
−1 1. . . . . . . // .
.. // .. //
.. // .. /
. 1 0 / . 0 0 1 //
, 1 0 - , 0 0 -
Note that
0 0
*. +/ *. +/
.. 0 0 // .. 0 0 //
.. 1 0 0 // .. 0 0 //
.. // . //
(D −1 E) 2 = ( 21 ) 2 .. .. .. .. , . . . , (D −1
E) n−1
= ( 1 n−1 .
) .. .. // ,
. . . . .
/
// 2 ..
.. // .. //
.. // .. 0 //
.. 1 0 0 / .. 0 0 0 0 //
, 1 0 0 - , 1 0 0 0 0 -
and (D E) = 0. Therefore,
−1 n

(I − D −1 E) −1 = I + (D−1 E) + (D−1 E) 2 + . . . + (D−1 E) n−1 .

If we set β = 1/2, then
1
β
*. +/
1
β2 β
.. //
1
(I − D−1 E) −1 = ..
. //
β3 β2 β 1 //
.. ..
..
. .
.. //
/
, β (n−1) β (n−2) β 1 -

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.7. APPLICATION TO FINITE DIFFERENCE DISCRETIZATIONS OF PDES 81

and
0 β 0 0
β2 β
*. +/
0 0
β3 β2 β
.. //
0 0
= (I − D −1 E) −1 D−1 F = .. .. .. .. /.
. //
GGS
.. . . . 0 //
.. 0 β (n−1) β (n−2) β2 β //
, 0 βn β (n−1) β3 β2 -

This matrix is not normal. In fact, (GTGS GGS )11 = 0, but (GGS GTGS )11 = 2−2 .
The eigenvalues of GGS are

λ0 = 0 with algebraic multiplicity bn/2c (bn/2c + 1 if n is odd)

and
!
jπ
λ j = cos 2
, j = 1, . . . , bn/2c.
n+1

The eigenspace associated with λ 0 = 0 is one-dimensional and spanned by v0 = (1, 0, . . . , 0)T . The
eigenvectors associated with the other eigenvalues λ j , j = 1, . . . , bn/2c are

q q q T
vj = λ j sin jπx 1 , ( λ j ) 2 sin jπx 2 , . . . , ( λ j ) n sin jπx n .

See Kohaupt [Koh98]. The spectral radius of the Gauss-Seidel iteration matrix is

π π2
!2
ρ(GGS ) = cos 2
= ( ρ(G J )) ≈ 1 −
2
.
n+1 2(n + 1) 2

The 1D Advection Diffusion Equation

Next we consider the advection-diffusion equation

− y00 (x) + cy0 (x) + r y(x) = f (x), x ∈ (0, 1), (2.56a)

y(0) = y(1) = 0. (2.56b)

The central difference approximation (1.15) of the advection term cy0 (x) leads to the tridiagonal

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

82 CHAPTER 2. STATIONARY ITERATIVE METHODS

linear system

2 + h2r −( − 2h c) y1
.. −( + 2 c) 2 + h r −( − 2 c)
h 2 h
y2
*. +/ *. +/
// .. //
y3
... ... ... ..
. // .. //
1 .. // .. . //
h2 ... // .. //
.. // .. yn−2 //
. −( + 2h c) 2 + h2r −( − 2h c) /. yn−1 /
, −( + 2h c) 2 + h2r -, yn -
f (x 1 )
.. f (x 2 ) ///
*. +
= .. .. /.
. (2.57)
.. f (x ) ///
n−1
, f (x n) -

We have already mentioned in Section 1.3.1 that (2.57) leads to spurious oscillations unless the for
mesh size is sufficiently small and satisfies

h < 2|c|/

(if uniform meshes are used). The poor behavior of this discretization scheme can be explained
by the fact that this matrix is not an M-matrix if h > 2/|c|. In fact, if h > 2/|c| we have
−( + 2h c) > 0 or −( − 2h c) depending on the sign of the advection c, and the sign condition in
Definition 2.6.20 of an M-matrix is violated. We will argue in the next paragraph that the matrix in
(2.57) is irreducibly row–wise diagonally dominant if h < 2/|c|. See also Stynes [Sty05, Sec. 4].
If ± 12 hc , 0, the matrix is irreducible (cf. Example 2.6.2). Moreover, using Problem 1.4
the matrix in (2.57) is irreducibly row–wise diagonally dominant provided h < 2/|c|. Therefore,
if h < 2/|c| Theorems 2.6.6 and 2.6.7 imply that both the Jacobi and the Gauss-Seidel Method
converge. By Theorem 2.6.22 the matrix in (2.57) is an M-matrix for h < 2/|c|. Since the matrix
in (2.57) is a tridiagonal matrix, it is consistently ordered (Theorem 2.6.16). Corollary 2.6.18 and
Theorem 2.6.19 apply. Numerical experiments for the matrix in (2.57) with = 10−2 , c = 1, r = 0
show that the spectral radii of the (pointwise) Jacobi iteration G J , the (pointwise) forward Gauss
Seidel iteration GGS and the SOR itertation Gω are less than one if the mesh size h < 2/|c| = 0.02.
See Figure 2.4 and Table 2.3.
Next we consider the upwind discretization. Let c > 0. (The case c < 0 leads to a matrix that
has the same properties as that for c > 0.) The upwind discretization (1.21) leads to the tridiagonal

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.7. APPLICATION TO FINITE DIFFERENCE DISCRETIZATIONS OF PDES 83

linear system

2 + hc + h2r − y1
−( + h c) 2 + hc + h2r
*. +/ *. +/
.. − // .. y2 //
. // .. y3 //
1 .. .. .. .. ..
. . . .
. // .. //
h2 .. // .. //
.. yn−2
// .. //
.. −( + h c) 2 + hc + h2 r − yn−1
// .. //
, −( + h c) 2 + hc + h2r - , yn -

f (x 1 )
*. +/
.. f (x 2 )
..
//
= .. .
// . (2.58)
.. //
. f (x n−1 ) /
, f (x n ) -

Since , c > 0, the matrix in (2.58) is irreducible (cf. Example 2.6.2) and it follows from
Problem 1.4) that the matrix in (2.58) is irreducibly row–wise diagonally dominant for any mesh
size h > 0. Therefore, by Theorems 2.6.6 and 2.6.7 the (pointwise) Jacobi iteration method and the
(pointwise) forward Gauss Seidel iteration converge and by Theorem 2.6.22 the matrix in (2.58) is
an M-matrix for any h > 0. By Theorem 2.6.16 it is consistently ordered. Corollary 2.6.18 and
Theorem 2.6.19 apply. Spectral radii for the matrix in (2.58) with = 10−2 , c = 1, r = 0 are shown
in Figure 2.4 and in Table 2.3. Theorem 2.6.22 shows that the matrix in (2.58) is an M-matrix for
all mesh sizes h.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

84 CHAPTER 2. STATIONARY ITERATIVE METHODS

1
10
Jacobi 9aco;i
1 <or=ar8 >0
forward GS
forward SOR <or=ar8 0?7
0.8

0pectral 7a8ii
Spectral Radii

0
0.6
10

0.4

0.2

!1
10 !3 !2 !1 !# !2 !1
10 10 10 10 10 10
mesh size h mes, si/e ,

Figure 2.4: Spectral radii of the (pointwise) Jacobi iteration G J , the(pointwise) forward Gauss
Seidel iteration GGS and the SOR itertation Gω with ω given by (2.48) for the matrix in (2.57) with
= 10−2 , c = 1, r = 0 (left plot) and for the matrix in (2.58) with = 10−2 , c = 1, r = 0 (right
plot) for various mesh sizes h. For the central difference scheme (2.57) the spectral radii are only
less then one for sufficiently small mesh size. For the upwind difference scheme (2.58) the spectral
radii are less then one for all mesh sizes.

h ρ(G J ) ρ(GGS ) ρ(Gω ) h ρ(G J ) ρ(GGS ) ρ(Gω )

6.250000 10−2 2.9038 8.4320 5.8284 6.250000 10−2 0.6402 0.4099 0.1311
3.125000 10−2 1.1948 1.4276 5.8284 3.125000 10−2 0.7888 0.6222 0.2386
1.562500 10−2 0.6702 0.3887 0.1480 1.562500 10−2 0.8976 0.8057 0.3881
7.812500 10−3 0.9260 0.8493 0.4520 7.812500 10−3 0.9596 0.9205 0.5607
3.906250 10−3 0.9828 0.9629 0.6882 3.906250 10−3 0.9870 0.9733 0.7230
1.953125 10−3 0.9959 0.9914 0.8334 1.953125 10−3 0.9964 0.9926 0.8427

Central Finite Differences Upwind Finite Differences

Table 2.3: Spectral radii of the (pointwise) Jacobi iteration G J , the(pointwise) forward Gauss
Seidel iteration GGS and the SOR itertation Gω with ω given by (2.48) for the matrix in (2.57) with
= 10−2 , c = 1, r = 0 (top table) and for the matrix in (2.58) with = 10−2 , c = 1, r = 0 (bottom
table) for various mesh sizes h. For the central difference scheme (2.57) the spectral radii are only
less then one for sufficiently small mesh size. For the upwind difference scheme (2.58) the spectral
radii are less then one for all mesh sizes.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.8. AN OPTIMIZATION VIEWPOINT 85

2.7.2. Elliptic Partial Differential Equations in 2D

The previous discussion can be extended to finite difference discretizations of 2D elliptic PDEs.
For a discussion of the application of the Jacobi, the Gauss-Seidel, and the SOR method to the
finite difference discretization of the 2D Laplace equation see, e.g., the books by Stoer and Bulirsch
[SB93, pp. 588–596] or by Iserles [Ise96, Sec. 10.4].

2.8. An Optimization Viewpoint

In this section we assume that A ∈ Rn×n is symmetric positive definite. In this case x (∗) solves the
linear system
Ax = b
if and only if it minimizes the quadratic function

q(x) = 12 xT Ax − bT x.
def

Moreover, there exists a unique solution x (∗) of Ax = b and, therefore, a unique minimizer of q.
Using Ax (∗) = b we can show

q(x) − q(x (∗) ) = 21 (x − x (∗) )T A(x − x (∗) ) = 21 k x − x (∗) k 2A, (2.59)

where
k xk 2A = xT Ax.
def

Hence,
q(x (k+1) ) − q(x (k) ) = 12 k x (k+1) − x (∗) k 2A − 21 k x (k) − x (∗) k 2A,
i.e., the difference in function values is equal to half of the difference in the error squared, where
the error is measured in the A-norm.
We can view basic iterative methods

x (k+1) = (I − M −1 A)x (k) + M −1 b = x (k) + M −1 (b − Ax (k) )

as iterative methods for the minimization of q. If we set

r (k) = b − Ax (k) = −∇q(x (k) ),

def

then we can write

x (k+1) = x (k) + M −1r (k)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

86 CHAPTER 2. STATIONARY ITERATIVE METHODS

and we find that the value of q at x (k+1) can be expressed as

q(x (k+1) ) = 12 (x (k+1) )T Ax (k+1) − bT x (k+1)

= 21 (x (k) + M −1r (k) )T A(x (k) + M −1r (k) ) − bT (x (k) + M −1r (k) )
= q(x (k) ) + 21 (M −1r (k) )T A(M −1r (k) ) − (r (k) )T (M −1r (k) )

= q(x (k) ) − (M −1r (k) )T M T − 21 A (M −1r (k) ). (2.60)

Since vT M T v = 21 vT (M T + M)v for all vectors v, (2.60) gives

q(x (k+1) ) = q(x (k) ) − 12 (M −1r (k) )T M T + M − A (M −1r (k) ). (2.61)

This expression shows that

q(x (k+1) ) < q(x (k) )
if M T + M − A is symmetric positive definite.

Theorem 2.8.1 If A ∈ Rn×n is symmetric positive definite and M ∈ Rn×n is a matrix such that
M T + M − A is symmetric positive definite, then the iterates generated by x (k+1) = (I − M −1 A)x (k) +
M −1 b converge to the minimizer x (∗) of q and obey

θ
!
kx (k+1) (∗) 2
− x kA ≤ 1 − k x (k) − x (∗) k 2A,
λ max

where λ max is the largest eigenvalue of A and θ > 0 satisfies

vT AM −T M T + M − A M −1 Av ≥ θ kvk22 for all v ∈ Rn .

Proof: Recall that λ min k xk 2 ≤ k xk 2A ≤ λ max k xk 2 for all x ∈ Rn . Equations (2.59) and (2.61)
imply that
1
2 kx
(k+1)
− x (∗) k 2A = q(x (k+1) ) − q(x (∗) )

= q(x (k) ) − q(x (∗) ) − 21 (M −1r (k) )T M T + M − A (M −1r (k) )

= 12 k x (k) − x (∗) k 2A − 21 (M −1r (k) )T M T + M − A (M −1r (k) )
1 θ
≤ k x (k) − x (∗) k 2A − k x (k) − x (∗) k22
2 2
1 (k) θ
≤ k x − x (∗) k 2A − k x (k) − x (∗) k 2A .
2 2λ max
This implies the desired result.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.8. AN OPTIMIZATION VIEWPOINT 87

Remark 2.8.2 i. Note that Theorem 2.6.12 is a special case of the previous theorem with M = D.
ii. A different proof of Theorem 2.8.1 is given in Problem 2.11. Theorem 2.8.1 describes the
convergence in terms of the A-norm of the error x (k) − x (∗) , which up to a constant is q(x (k) ), while
Problem 2.11 uses the spectral radius of the iteration matrix.
For the damped Jacobi method (2.17) we have M T = M = ω−1 D and
2
M + MT − A = D − A.
ω
Theorem 2.8.3 Let A ∈ Rn×n be symmetric with positive diagonal entries, and let ω > 0. The
matrix 2ω−1 D − A is positive definite if and only if ω satisfies
2
0<ω< ,
1 − µmin
where µmin ≤ 0 is the minimum eigenvalue of I − D−1 A.
Proof: The matrix 2ω−1 D − A is positive definite if and only if
2ω−1 − D −1/2 AD −1/2 = (2ω−1 − 1)I + D1/2 (I − D−1 A)D −1/2 = H
def

is positive definite. The eigenvalues of H are 2ω−1 −1+ µi , where µi are the eigenvalues of I −D−1 A.
µi = trace(I − D−1 A) = 0 and the eigenvalues µi are real, it follows that µmin ≤ 0.
Pn
Since i=1
Therefore, H is positive definite if 2ω−1 − 1 + µi > 0, i = 1, . . . , n, i.e., if 0 < ω < 2/(1 − µmin ).

In case of the Gauss-Seidel method we have

M = D − E.
Since A = D − E − F is symmetric, ET = F. Hence,
M + M T − A = D − E + D − ET − D + E + ET = D
and
vT (M + M T − A)v = vT Dv > 0 for all v ∈ Rn \ {0},
since symmetric positive definiteness of A implies, symmetric positive definiteness of its diagonal
matrix D. Hence, we have the following corollary of Theorem 2.8.1.
Corollary 2.8.4 If A ∈ Rn×n is symmetric positive definite, then the iterates generated by the
Gauss-Seidel method x (k+1) = (I − (D − E) −1 A)x (k) + (D − E) −1 b converge to the minimizer x (∗)
of q and obey
θ
!
kx (k+1) (∗) 2
− x kA ≤ 1 − k x (k) − x (∗) k 2A,
λ max
where λ max is the largest eigenvalue of A and θ > 0 satisfies vT A(D − E) −T D(D − E) −1 Av ≥ θ kvk22
for all v ∈ Rn .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

88 CHAPTER 2. STATIONARY ITERATIVE METHODS

For the remainder of this section we study so-called coordinate descent methods for the min-
imization of q(x) = 12 xT Ax − bT x, where A ∈ Rn×n is symmetric positive definite. We will
show that the Jacobi and the Gauss-Seidel method applied to Ax = b are particular cases of these
coordinate descent method. In addition many multilevel and domain decomposition methods for
the solution of discretized partial differential equations can be interpreted as coordinate descent
methods [Xu92]. Coordinate descent methods are also used in nonlinear optimization. See, e.g.,
[BT89, Wri15].
We consider two approaches, the so-called parallel directional correction (PDC) method and
the sequantial directional correction (PDC) method.
One iteration of the PDC is given as follows.

For i = 1, . . . , n solve (in parallel)

min q(x (k) + αe (i) ). (2.62)

α∈R

end

Set x (k+1) = x (k) + αi e (i) , where αi is the solution of (2.62).

Pn
i=1

The SDC performs the minimization along the directions e (i) , i = 1, . . . , n, sequentially. One
iteration of the SDC is given as follows.

Set w (0) = x (k) .

For i = 1, . . . , n do (sequentially)
Solve
min q(w (i−1) + αe (i) ). (2.63)
α∈R

Set w (i) = w (i−1) + αi e (i) , where αi is the solution of (2.63).

end

Set x (k+1) = w (n) .

Remark 2.8.5 Let A ∈ Rn×n be a symmetric positive definite matrix.

i. Given x ∈ Rn and a nonzero direction v ∈ Rn , the solution of

min q(x + αv)

α∈R

is given by
vT (b − Ax)
α∗ = .
vT Av

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.8. AN OPTIMIZATION VIEWPOINT 89

Moreover,

α∗2 T α∗2 T
q(x + α∗ v) = q(x) + α∗ ( Ax − b) v + v Av = q(x) − v Av < q(x),
T
2 2
provided α∗ , 0.

ii. The SDC iterates satisfy q(x (k+1) ) ≤ q(x (k) ) for all k.

iii. The PDC iterates in general do not satisfy q(x (k+1) ) ≤ q(x (k) ).

Figure 2.5: One iteration of the PDC method. Note that q(x (k+1) ) > q(x (k) ).

Theorem 2.8.6 Let A ∈ Rn×n be a symmetric positive definite matrix and let e (i) , i = 1, . . . , n, be
linearly independent. The PDC and the SDC method are iterative methods of the form

x (k+1) = (I − M −1 A)x (k) + M −1 b.

with nonsingular M, which depends on whether the PDC or the SDC method is used.

Proof: We consider the PDC method and leave the proof for the SDC method as an exercise.
The solution of (2.62) is
(e (i) )T (b − Ax (k) )
αi = .
(e (i) )T Ae (i)
Hence,
n
X
x (k+1)
=x (k)
+ αi e (i)
i=1
n
X (e (i) )T (b − Ax (k) ) (i)
= x (k) + e
i=1
(e (i) )T Ae (i)
n
X e (i) (e (i) )T
= x (k) + (i) T (i)
(b − Ax (k) )
i=1
(e ) Ae
= x (k) + M
H (b − Ax (k) ) = (I − M
H A)x (k) + Mb
H

with
n
H=
X e (i) (e (i) )T
M .
i=1
(e (i) )T Ae (i)
H is invertible. This implies the assertion with M = M
We will show that M H −1 .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

90 CHAPTER 2. STATIONARY ITERATIVE METHODS

To show that M
H is invertible, assume
n n
X e (i) (e (i) )T X (e (i) )T v
0 = Mv
H =
(i) T (i)
v= (i) T (i)
e (i) .
i=1
(e ) Ae i=1
(e ) Ae

Since the directions e (i) , i = 1, . . . , n, are linearly independent this implies

(e (i) )T v
= 0, i = 1, . . . , n.
(e (i) )T Ae (i)
Since A is symmetric positive definite we have (e (i) )T v = 0, i = 1, . . . , n, and since the directions
e (i) , i = 1, . . . , n, are linearly independent this implies v = 0. Hence M
H is invertible.

Theorem 2.8.7 Let A ∈ Rn×n be a symmetric positive definite matrix. If e (i) , i = 1, . . . , n, are the
Cartesian unit vectors, then the PDC and the SDC method are equivalent to the (pointwise) Jacobi
and the (pointwise forward ) Gauss Seidel method, respectively.

Proof: We leave the proof as an exercise.

Theorem 2.8.8 Let A ∈ Rn×n be a symmetric positive definite matrix. If e (i) , i = 1, . . . , n, are
linearly independent , then the SDC method converges to the unique minimizer x (∗) of q for any
initial vector x (0) .

Proof: Recall that q(x (k) ) ≤ q(x (0) ) for all k. Hence,
λ min (k) 2
q(x (0) ) ≥ q(x (k) ) = 12 (x (k) )T Ax (k) − bT x (k) ≥ k x k2 − kbk2 k x (k) k2 for all k,
2
where λ min > 0 is the smallest eigenvalue of A. Since { λ min
2 kx
(k) k 2 − kbk k x (k) k } is bounded the
2 2 2
(k)
sequence {x } must be bounded.
There exists a subsequence {x (k j ) } with

lim x (k j ) = x (∗) .
j→∞

We show that x (∗) is the unique minimizer of q. Using the monotonicity of the q(x (k) )’s and
Theorem 2.8.6 we find

q(x (k j+1 ) ) ≤ q(x (k j +1) ) = q((I − M −1 A)x (k j ) + M −1 b) ≤ q(x (k j ) ).

If we take the limit j → ∞, the previous inequalities imply

q(x (∗) ) ≤ q (I − M −1 A)x (∗) + M −1 b ≤ q(x (∗) ).

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.8. AN OPTIMIZATION VIEWPOINT 91

The identity q(x (∗) ) = q (I − M −1 A)x (∗) + M −1 b implies

(e (i) )T (b − Ax (∗) )
αi = = 0, i = 1, . . . , n,
(e (i) )T Ae (i)

(see Remark 2.8.5i) which implies Ax (∗) = b. Thus the limit x (∗) is the unique minimizer of q.
Finally, since lim j→∞ q(x (k j ) ) = q(x (∗) ) and since q(x (k+1) ) ≤ q(x (k) ) for all k we have

lim q(x (k) ) = q(x (∗) ).

k→∞

The inequality (2.59) and the positive definiteness of A implies lim k→∞ x (k) = x (∗) .

Remark 2.8.9 If e (i) , i = 1, . . . , n, are the Cartesian unit vectors, then Theorem 2.8.8 implies the
convergence of the Gauss-Seidel method for symmetric positive definite systems. Of course, we
already know this from Theorem 2.6.10, and our previous convergence theory also established
k x (k+1) − x (∗) k2 ≤ ρ(GGS )k x (k) − x (∗) k2 .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

92 CHAPTER 2. STATIONARY ITERATIVE METHODS

2.9. Problems

Problem 2.1 Let A = D − E − F, where D is a diagonal of A, −E is the strict lower triangular part
of A and −F is the strict upper triangular part of A. One iteration of the symmetric SOR (SSOR)
method for the solution of Ax = b uses one iteration of the forward SOR method followed by one
iteration of the backward SOR method:
1
x (k+ 2 ) = (D − ωE) −1 [ωF + (1 − ω)D]x (k) + ωb ,

1
x (k+1) = (D − ωF) −1 [ωE + (1 − ω)D]x (k+ 2 ) + ωb .

For ω = 1 this is the symmetric Gauss-Seidel (SGS) method

i. Show that
x (k+1) = (I − MSSOR
−1
A)x (k) + MSSOR
−1
b,
where
1
MSSOR = (D − ωE)D−1 (D − ωF).
ω(2 − ω)

ii. Show that if A is symmetric, i.e., if D = DT and E = F T , then MSSOR is symmetric.

Problem 2.2 Let G ∈ Rn×n be a square matrix. The purpose of the exercise is to show that

lim kG k k 1/k = ρ(G).

k→∞

i. Define
k!
when j = 0, . . . , k,
! (
k
= j!(k− j)!
j 0 otherwise.
Let
λν 1 0 . . . 0 0
*. 0 λν 1 . . . 0 0 +/
.. .. .. . . .. .. //
. . . . .
Jν = .. nν ×nν
. //
.. .. . . . .. .. // ∈ C
.. . . . . //
0 0 0 . . . λν 1
.. /
, 0 0 0 . . . 0 λν -
be a Jordan block of order nν with eigenvalue λ ν ∈ C, λ ν , 0. Show that the components of
its kth power satisfy !
k k− j+i
(Jν )i j =
k
λν .
j −i

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.9. PROBLEMS 93

ii. Show that if Jν is the matrix in part i and λ ν = 0, then

1 when j = i + k,
(
(Jν )i j =
k
0 otherwise.

iii. Recall that for a matrix A ∈ Rn×n the ∞-norm is given by

n
X
k Ak∞ = max |ai j |.
i=1,...,n
j=1

Show that
nν
X
k Jνk k∞ = k(Jνk )1 j k.
j=1

and
k J k k∞ = max k Jνk k∞,
ν=1,...,`
where
J1 0 ... 0 0
0 J2 ...
*. +/
0 0
.. .. ... .. ..
. //
J = ... . . . . // .
..0 0 . . . J`−1 0 //
, 0 0 ... 0 J` -
If λ ν , 0, then i and iii
nν
X
k Jνk k∞ = k(Jνk )1 j k,
j=1
nν !
X k
= |λ ν | k− j+1,
j −1
j=1
nν !
X k
= |λ ν | k
|λ ν | − j+1 .
j −1
j=1

With the estimate !

k
1≤ ≤ k j−1
j −1
for j = 1, . . . , k + 1, and the definition
nν
X
Cν = |λ ν | − j+1
def

j=1

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

94 CHAPTER 2. STATIONARY ITERATIVE METHODS

we obtain the inequalities

Cν |λ ν | k ≤ k Jνk k∞ ≤ Cν |λ ν | k k nν .
If λ ν = 0, then iii implies

when k < nν,

(
1
k Jνk k∞ =
0 otherwise.

iv. Use the Jordan form (2.34) to show that

lim kG k k 1/k = ρ(G).

k→∞

Problem 2.3 Prove Theorem 2.5.5 using the Jordan normal form of G.

Problem 2.4 Prove Theorem 2.5.5 using Theorem 2.5.9.

Problem 2.5 Prove Theorem 2.5.6 using Theorem 2.5.9.

Problem 2.6
i. Show that for a square A matrix

lim Ak = 0 ⇐⇒ ρ( A) < 1.
k→∞

(Hint: Use Theorem 2.5.9.)

ii. Let A be a square matrix and define

k
X
Sk = Aj
j=0

Show that
lim Sk = (I − A) −1 ⇐⇒ ρ( A) < 1.
k→∞

(Hint: Show that Sk (I − A) = I − Ak+1 .)

Problem 2.7 Let A = D − E − F, where either D, −E and −F are given as in (2.12) or in (2.19).
Given ω ∈ R, the iteration matrix of the dampled Jacobi method is given by

G J,ω = (1 − ω)I + ωD−1 [E + F].

i. Show that if G J has eigenvalues λ i , i = 1, . . . , n, then G J,ω has eigenvalues 1 − ω + ωλ i ,

i = 1, . . . , n

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.9. PROBLEMS 95

ii. Show that if all eigenvalues of G J are real and ordered such that λ 1 ≥ . . . ≥ λ n and if λ 1 < 1,
then the spectral radius of G J,ω is minimal for
2
ωopt = .
2 − λ1 − λn

Problem 2.8 Let A ∈ Rn×n be symmetric positive definite, b ∈ Rn , and let x ∗ be the solution of
Ax = b.
i. Show that the iteration
x (k+1) = x (k) − ω( Ax (k) − b)
converges to the solution x ∗ of Ax = b for any x (0) if and only if
!
2
ω ∈ 0, .
k Ak2
Note: If we define Q(x) = 12 xT Ax − bT x, then ∇Q(x) = Ax − b and the iteration x (k+1) =
x (k) − ω( Ax (k) − b) = x (k) − ω∇Q(x (k) ) is the steepest descent method for the minimization
of Q. In the context of solving symmetric positive definite linear systems, this iteration is
also known as the Richardson iteration. We will return to this method in Section 3.6.1.
ii. Now consider the iteration
x (k+1) = x (k) − ω( Ax (k) − b + e (k) )
where e (k) ∈ Rn is an error that satisfies ke (k) k2 ≤ δ for all k. Let λ min and λ max be the
smallest and largest eigenvalue of A, respectively. Show that if
q = max{|1 − ωλ min |, |1 − ωλ max |} < 1,
def

then the iterates obey

ωδ
k x (k) − x ∗ k2 ≤ + q k k x (0) − x ∗ k2 .
1−q

Problem 2.9 Let A ∈ Rn×n be nonsingular and let x ∗ be the solution of Ax = b. Given the
splitting A = M − N, where M is nonsingular, we consider the basic iterative method x (k+1) =
(I − M −1 A)x (k) + M −1 b. Due to floating point errors, we can only compute
x (k+1) = (I − M −1 A)H
H x (k) + M −1 b + d (k)
where d (k) ∈ Rn represents the error in the computation of (I−M −1 A)H e (k) = H
x (k) . The error H x (k) −x ∗
obeys
e (k+1) = (I − M −1 A)H
H e (k) + d (k)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

96 CHAPTER 2. STATIONARY ITERATIVE METHODS

e (0) and d (0), . . . , d (k) .

e (k+1) in terms of (I − M −1 A), H
i. Express H

ii. Assume that ρ(I − M −1 A) < 1 and kd (k) k2 ≤ δ. Prove that the sequence of errors {H
e (k) }
remains bounded. Find as good an upper bound as you can for lim supk→∞ kH (k)
e k2 .

Problem 2.10 (The heavy ball method [Pol64]. Taken in modified form from [Ber95, p.78].)
Let A ∈ Rn×n be symmetric positive definite with smallest and largest eigenvalue λ min and
λ max , respectively, and let b ∈ Rn . Furthermore, let x ∗ be the solution of Ax = b and consider the
iteration
x (k+1) = x (k) − α( Ax (k) − b) + β(x (k) − x (k−1) ), (2.65)
where α is a positive stepsize and β is a scalar with 0 < β < 1.

Show that the iteration (2.65) converges x ∗ for any x (0) if 0 < α < 2(1 + β)/λ max .
Hint: Consider the iteration
x (k+1) (1 + β)I − α A − βI x (k)
! ! ! !
b
= +α
x (k) I 0 x (k−1) 0

and show that µ is an eigenvalue of the matrix in the above equation if and only if µ + β/µ is equal
to 1 + β − αλ where λ is an eigenvalue of A.

Problem 2.11 Let A, M ∈ Rn×n be symmetric positive definite and consider the iteration

x (k+1) = (I − M −1 A)x (k) + M −1 b.

i. Show that if 2M − A is positive definite, then all eigenvalues of I − M −1 A are contained in

the interval (−1, 1). In particular, ρ(I − M −1 A) < 1.

ii. Show that if M − A is positive semidefinite, then all eigenvalues of I − M −1 A are contained
in the interval [0, 1). In particular, ρ(I − M −1 A) < 1.

iii. Let A = D − E − F where either D, −E and −F = −ET are given as in (2.12) or in (2.19),
and let M = D. Use part i to prove Theorem 2.6.12.

iv. Let A = D − E − F where either D, −E and −F = −ET are given as in (2.12) or in (2.19),
and let
M = (D − E)D−1 (D − ET )
be the matrix corresponding to the symmetric Gauss-Seidel method (see Problem 2.1). Show
that ρ(I − M −1 A) < 1.

Problem 2.12

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.9. PROBLEMS 97

i. Let B1 ∈ Rn1 ×n2 and B2 ∈ Rn2 ×n1 and consider the matrix
!
0 B1
B= .
B2 0
Such a matrix is called 2-cyclic.
– Show that the spectrum2 of B is

σ(B) = ± σ(B1 B2 ) ∪ ± σ(B2 B1 ).

p p

– Furthermore, show that the spectra of B1 B2 and B2 B1 coincide up to a possible zero

eigenvalue,
σ(B1 B2 ) \ {0} = σ(B2 B1 ) \ {0}.
p p

ii. Let ! ! ! !
D1 A1 D1 0 0 0 0 A1
A= = + + .
A2 D2 0 D2 A2 0 0 0
| {z } | {z } | {z }
=D = −E = −F
q
– Show that ρ(I − D −1 A) = ρ(D1−1 A1 D2−1 A2 ).
– Show that ρ(I − (D − E) −1 A) = ρ(D1−1 A1 D2−1 A2 ).

iii. Let D, E, F be the matrices in part ii. Show that the eigenvalues of αD−1 E + α −1 D−1 F do
not depend on α, that is, A is consistently ordered.

Problem 2.13 (From [Saa03]).

A matrix of the form
0 E 0
B = . 0 0 F +/ ,
*
, H 0 0 -
where E ∈ Rn1 ×n2 , F ∈ Rn2 ×n3 , and H ∈ Rn3 ×n1 , is called a three-cyclic matrix.

i. Express the eigenvalues of B in terms of a smaller matrix that depends on E, F, H.

ii. Assume that A = D + B, where
D1 0 0
D= *
. 0 D 2 0 +/ ,
, 0 0 D3 -
√
( √2The spectrum of a matrix M) is σ(M) = λ ∈ C : λ is an eigenvalue of M . Furthermore we use ± σ(M) =

± λ : λ is an eigenvalue of M .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

98 CHAPTER 2. STATIONARY ITERATIVE METHODS

is a nonsingular (block) diagonal matrix and B is three-cyclic. How can the eigenvalues of
the (block) Jacobi iteration matrix be related to those of the (block) forward Gauss-Seidel
iteration matrix? How does the asymptotic convergence rate of the (block) Jacobi method
compare with that of the (block) forward Gauss-Seidel method.

iii. Answer the same questions as in ii for the case when (block) SOR replaced (block) forward
Gauss-Seidel.

iv. Generalize the above results to p cyclic matrices of the form

0 E1
0 E2
*. +/
.. ..
. //
B = ... . . // .
.. 0 Ep−1 //
, Ep 0 -

Problem 2.14 Let the symmetric positive definite matrix A = B + C be split into two matrices
B, C ∈ Rn×n such that B is symmetric positive definite and C is symmetric positive semidefinite.
The linear system Ax = b can be split into

(B + r I)x = (r I − C)x + b, (2.66a)

(C + r I)x = (r I − B)x + b, (2.66b)

where r > 0 is a parameter. This leads to the iterative scheme

(B + r I)x (k+1/2) = (r I − C)x (k) + b, (2.67a)

(C + r I)x (k+1) = (r I − B)x (k+1/2) + b. (2.67b)

i. Show that x (∗) solves Ax = b if and only if x (∗) solves (2.66a) if and only if x (∗) solves
(2.66b).

ii. Show that (2.67) is a stationary iterative method of the form

x (k+1) = GADI x (k) + d

with
GADI = (C + r I) −1 (r I − B)(B + r I) −1 (r I − C).
What is d?

iii. Show that the spectral radius ρ(GADI ) is equal to the spectral radius

ρ (r I − B)(B + r I) −1 (r I − C)(C + r I) −1 .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.9. PROBLEMS 99

iv. Recall that r > 0. Let β1, . . . , βn be the eigenvalues of B.

What are the eigenvalues of (r I − B)(B + r I) −1 ? Show that the spectral radius satisfies

ρ (r I − B)(B + r I) −1 < 1.

The same arguments can be applied to show that the spectral radius of (r I − C)(C + r I) −1
satisfies
ρ (r I − C)(C + r I) −1 ≤ 1.
v. Use part iv, to show that ρ(GADI ) < 1. (Hint: For two symmetric matrices M1, M2 , we have
ρ(M1 M2 ) ≤ k M1 M2 k2 ≤ k M1 k2 k M2 k2 = ρ(M1 ) ρ(M2 ).)
vi. Assume that the matrices B and C commute, i.e., BC = CB. In this case they are simultane-
ously diagonalizable, i.e., there exists an orthogonal matrix V ∈ Rn×n and diagonal matrices
D B = diag( β1, . . . , βn ) and DC = diag(γ1, . . . , γn ) such that
B = V DBV T , and C = V DC V T .
Let x (∗) solve Ax = b. Show that the error e (k) = x (k) − x (∗) of the iteration (2.67) satisfies
(r − γ j )(r − β j ) k
ke (k)
k2 = max
* + ke (0) k2 .
j=1,...,n (r + γ j )(r + β j )
, -
Instead of choosing a fixed parameter r, we can select a different parameter ri for the ith
iteration. In this case
k
Y (ri − γ j )(ri − β j ) + (0)
ke k2 = max
(k) * ke k2 .
, j=1,...,n i=1 (ri + γ j )(ri + β j ) -

Problem 2.15 Let Ω = (0, 1) 2 with boundary ∂Ω and let γ ≥ 0. Consider the Poisson equation
−∆u(x, y) + γu(x, y) = f (x, y), (x, y) ∈ Ω (2.68a)
u(x, y) = 0, (x, y) ∈ ∂Ω. (2.68b)
The finite difference method for (2.68) with n x = n y = n and h = 1/(n + 1) leads to a system of
equations
− ui−1, j − ui, j−1 + 4ui j − ui+1, j − ui, j+1 + γh2ui j = h2 f (x i, y j ), (2.69)
for i = 1, . . . , n and j = 1, . . . , n, where ui j = 0 for i ∈ {0, n x } or j ∈ {0, n y }. This leads to linear
system Au = b. The alternating-direction implicit (ADI)3 method splits the equations into
[−ui−1, j + 2ui j − ui+1, j ] + [−ui, j−1 + 2ui j − ui, j+1 ] + γh2ui j = h2 f (x i, y j ). (2.70)
3The ADI method was originally developed by Peaceman and Rachford [PR55]. See also [Pea90] for a history of
this method.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

100 CHAPTER 2. STATIONARY ITERATIVE METHODS

As in Problem 2.14 we split the matrix A = B + C, where we define matrices B, C ∈ Rn ×n through

2 2

2
their action on a vector z ∈ Rn with components zi j . (Note that in the context of finite difference
discretization the entries of the vectors u, w, z, ... correspond to functions on a grid with points
(x i, y j ). Therefore its convenient to use double indices wi j to indicate that this is the value of a
function at grid point (x i, y j ).)
wi j = −zi−1, j + 2zi j − zi+1, j + 21 γh2 zi j if w = Bz, (2.71a)
wi j = −zi, j−1 + 2zi j − zi, j+1 + 12 γh2 zi j if w = Cz, (2.71b)
for i, j = 1, . . . , n. (We set z0, j = z n+1, j = zi,0 = zi,n+1 = 0!)
Given the matrix splitting,
A= B+C
we use the iterative scheme
(B + r I)u (k+1/2) = (r I − C)u (k) + b, (2.72a)
(C + r I)u (k+1) = (r I − B)u (k+1/2) + b (2.72b)
which can be combined into
u (k+1) = GADIu (k) + (C + r I) −1 [I + (r I − B)(B + r I) −1 ]b. (2.73)
with
GADI = (C + r I) −1 (r I − B)(B + r I) −1 (r I − C).
The advantage of (2.72) is that the computation of u (k+1/2) and u (k) requires the solution of two
block-diagonal systems, provided we order the unknowns in a suitable way.
The definition (2.71) shows that if we order the unknowns u (k+1/2) and right hand side in (2.72a)
along grid lines with constant y j
(k+1/2) T
(k+1/2) (k+1/2) (k+1/2) (k+1/2) (k+1/2)

u (k+1/2) = u11 , . . . , un,1 , u12 , . . . , un,2 , . . . . . . , u1n , . . . , un,n ,
then
T
B + r I = .. ..
. /,
* +/

, T -
where
2 + 21 γh2 + r −1
−1 2 + 2 γh2 + r −1
1
*. +/
.. //
... ... ...
.. //
T = .. // (2.74)
.. //
.. //
. −1 2 + 12 γh2 + r −1 /
, −1 2 + 2 γh + r -
1 2

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.9. PROBLEMS 101

If we order the unknowns u (k) and right hand side in (2.72a) along grid lines with constant x i
(k) T
(k) (k) (k) (k) (k)

u (k) = u11 , . . . , u1n , u21 , . . . , u2n , . . . . . . , un1 , . . . , un,n ,
then
T
C + r I = .. ..
. /,
* +/

, T -
where T is given as before.
Thus, one iteration (2.72) requires the solution of 2n equations with matrix T (which corresponds
to the discretization of a one dimensional differential equation).

i. Show that
Bv (k,`) = (λ k + 12 γh2 )v (k,`), (2.75a)
Cv (k,`) = (λ ` + 12 γh2 )v (k,`) (2.75b)
k, ` = 1, . . . , n, where the eigenvalues are given by
!
kπ
λ k = 4 sin2
, k = 1, . . . , n,
2(n + 1)
and the eigenvectors v (k,`) , k, ` = 1, . . . , n, have components
`π j
! !
(k`) kπi
vi j = sin sin , i, j = 1, . . . , n.
n+1 n+1

ii. Let γ = 0. Show that

GADI v (k,`) = λ (k`) v (k,`), k, ` = 1, . . . , n
i.e., that the ADI iteration matrix GADI has eigenvalues
r − λ` r − λ k
λ (k`) = , k, ` = 1, . . . , n.
r + λ` r + λ k

iii. Apply the ADI method to partial differential equation (2.68) with γ ≥ 0 and right hand side
f such that the exact solution is u(x, y) = 16x(1 − x)y(1 − y). (Any r > 0 will work. Try
r = 1.)

More information on the convergence of the ADI method for the model problem can be found in
the original paper by Peaceman and Rachford [PR55], or in the book by Stoer and Bulirsch [SB93,
Sec. 8.6].

Problem 2.16 Let A ∈ Rn×n be symmetric positive definite and let B ∈ Rm×n have rank m < n.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

102 CHAPTER 2. STATIONARY ITERATIVE METHODS

i. Consider the linear system

A BT
! ! !
x c
= . (2.76)
B 0 y d
Express x in (2.76) in terms of y and derive a system

Sy = r (2.77)

such that x ∗, y ∗ solves (2.76) if and only if y ∗ solves (2.77).

Show that S is symmetric positive definite if A is symmetric positive definite. (You may have
to multiply your system Sy = r by −1 to obtain a symmetric positive definite S.)

ii. The Uzawa iteration for the solution of (2.76) generates a sequence of iterates x (k), y (k) as
follows:

x (k+1) = x (k) + A−1 (c − ( Ax (k) + BT y (k) )) = A−1 (c − BT y (k) ), (2.78a)

y (k+1) = y (k) + ω(Bx (k+1) − d). (2.78b)

(Note that it is x (k+1) in (2.78b), not x (k) !) Show that if A is symmetric positive definite, the
iterates y (k) generated by the Uzawa iteration are the iterates generated by gradient method
discussed in Problem 2.8 applied to the Schur complement system (2.77).

iii. For ω > 0, (2.76) is equivalent to

BT
! ! !
A x c
= . (2.79)
−ωB 0 y −ωd

Show that the Uzawa iteration (2.78) is obtained from a matrix splitting

BT
!
A
=M−N
−ωB 0

and is given by
x (k+1) x (k)
! ! !
c
=M −1
N +M −1
.
y (k+1) y (k) −ωd
What are M, N, and M −1 N?

Show that ρ(M −1 N ) < 1 if and only if ω ∈ 0, 2/kB A−1 BT k2 .

Problem 2.17 Let B ∈ Rm×n have rank m < n, and let A ∈ Rn×n be symmetric positive semidefinite
and symmetric positive definite on the null-space N (B) of B, i.e., let A satisfy vT Av ≥ 0 for all
v ∈ Rn and vT Av > 0 for all v ∈ N (B) \ {0}.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.9. PROBLEMS 103

i. Let ω > 0. Show that A + ωBT B is symmetric positive definite.

ii. Let ω > 0. Show that

A BT
! ! !
x c
= (2.80)
B 0 y d
is equivalent to
A + ωBT B BT c + ωBT d
! ! !
x
= . (2.81)
−ωB 0 y −ωd

iii. Consider the stationary iterative method

x (k+1) = ( A + ωBT B) −1 (c + ωBT d − BT y (k) )),

(2.82)
y (k+1) = y (k) + ω(Bx (k+1) − d).

(Note that it is x (k+1) in the second equation of (2.82), not x (k) !)

Show that the iteration (2.82) is obtained from a matrix splitting

A + ωBT B BT
!
=M−N
−ωB 0

and is given by

x (k+1) x (k) c + ωBT d

! ! !
=M −1
N +M −1
.
y (k+1) y (k) −ωd

What are M, N, and M −1 N?

iv. Show that ρ(M −1 N ) < 1 if and only if ρ B(ω−1 A + BT B) −1 BT ∈ (0, 2).

v. Let A be symmetric positive definite. Show that ρ B(ω−1 A + BT B) −1 BT ≤ 1 for all ω > 0.

Hint: First show that for sufficiently small > 0 we have vT (ω−1 A+ BT B)v ≥ vT ( I + BT B)v
for all vectors v ∈ Rn . Then vT B(ω−1 A + BT B) −1 BT v ≤ .....

Problem 2.18 Consider the linear system

A BT
! ! !
y e
= (2.83)
B D z f
| {z } |{z} |{z}
= K = x = b

with nonsingular symmetric matrix A ∈ Rn×n , symmetric matrix D ∈ Rm×m , and matrix B ∈ Rm×n .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

104 CHAPTER 2. STATIONARY ITERATIVE METHODS

i. Show that (2.83) is equivalent to the Schur complement system

(D − B A−1 BT )z = f − B A−1 e.

Conclude that S = D − B A−1 BT ∈ Rm×m is nonsingular if and only if K ∈ R(m+n)×(m+n) is

nonsingular.

ii. Show that

A BT I −A−1 BT
! ! ! !
I 0 A 0
= .
−B A−1 I B D 0 I 0 S
| {z } | {z } | {z }
= L = K = L
T

iii. Show that K is positive definite if and only if A, D, and S are positive definite.

iv. Assume that K is positive definite (in particular A, D, and S are positive definite).
Show that the block (forward) Gauss-Seidel method applied to (2.83) is equivalent to the
iteration
z (k+1) = D−1 B A−1 BT z (k) + D−1 ( f − B A−1 e).

v. Again assume that K is positive definite (in particular A, D, and S are positive definite).
Show that the eigenvalues of the iteration matrix D−1 B A−1 BT are contained in [0, 1).

Problem 2.19 We apply a simple Domain Decomposition Method to the finite difference dis-
cretization

2 −1 y1 f (x 1 ) + h12 g0
+/ *. +/
−1 2 −1 y2 f (x 2 )
*. +/ *.
.. // .. // .. //
y3 // .. //
.. .. .. .. ..
.. // ..
1 . // = ..
. . . . .
// .. //
h2 ... // .. // .. //
.. /. yn−2 // .. //
. −1 2 −1 // .. yn−1 / .. f (x n−1 ) //
−1 2 - , yn - , f (x n ) + h2 g1 -
1
,

of the boundary value problem

−y00 (x) = f (x), x ∈ (0, 1), (2.84a)

y(0) = g0, y(1) = g1 . (2.84b)

We write (2.84) as
Ay = b. (2.85)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.9. PROBLEMS 105

Assume that n = (k + 1)m − 1 for k, m ∈ N. We partition the equations (2.84) into groups
corresponding to indices
Γ = {m, 2m, . . . , km}
and
I j = { jm + 1, . . . , ( j + 1)m − 1}, j = 0, . . . , k
and we define
I = ∪ kj=0 I j .
Given an vector y ∈ Rn and an index set J ⊂ {1, . . . , n}, we use y J ∈ R| J | to denote the
subvector corresponding to that index set. Similarly, given a matrix A ∈ Rn×n and index sets
J, K ⊂ {1, . . . , n}, we use A JK ∈ R| J |×|K | to denote the submatrix corresponding to these index sets

i. If we reorder the equations and unknowns in (2.84) according to I = ∪ kj=1 I j , Γ, then the
resulting system is ! ! !
AI I AIΓ yI bI
= , (2.86)
AΓI AΓΓ yΓ bΓ
where AΓI = ATIΓ .
The matrices AI I , AIΓ, AΓΓ have a particular structure. Sketch the system (2.86) such that this
structure is revealed. It may be useful to start with the special case k = 1, m = 3.
ii. The system (2.86) is of the type (2.83). Implement the block Gauss-Seidel iteration of
Problem 2.18 iv.,

yΓ(k+1) = A−1 −1 (k)
ΓΓ AΓI AI I AΓI yΓ + AΓΓ bΓ − AΓI AI I bI .
−1 −1

Note this can be written as

y I(k+1) = A−1
II b I − A y (k)
IΓ Γ ,

yΓ(k+1) = A−1
ΓΓ b Γ − A ΓI y (k+1)
I .

Use the structure of AI I , AIΓ, AΓΓ . (For example, AI I is block diagonal and for bigger
(k)
problems the application of AΓI A−1I I AΓI yΓ can be done in parallel.)
Use k = 9, m = 10, and data f , g0, g1 such that the exact solution of (2.84) is y(x) = cos(2πx).
Plot the convergence history k Ay (k) − bk2 vs. k.

Problem 2.20 This problem studies the Kaczmarz method originally proposed in [Kac37] (see also
the translation [Kac93]). See the books by Natterer [Nat01] and Natterer and Wübbeling [NW01]
for application of the Kaczmarz method in image reconstruction.
Let A ∈ Rn×n be nonsingular, let b ∈ Rn , let ei be the i-th unit vector and let ai = AT ei ∈ Rn be
the transpose of the i-th row of A.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

106 CHAPTER 2. STATIONARY ITERATIVE METHODS

( )
i. The projection of y ∈ Rn onto the set x ∈ Rn : aiT x = bi is the solution x of

min 12 k x − yk22, (2.87a)

s.t. aiT x = bi . (2.87b)

Show that
( Ay − b)T ei
x = y − ai .
aiT ai
(Hint: Since (2.87) is a convex optimization problem, the Lagrange Multiplier Theorem
provides the necessary and sufficient conditions for the solution of (2.87).)

ii. Given a vector x (k) the Kaczmarz iteration computes a new approximation x (k+1) of the linear
system as follows.

For i = 1, . . . , n

( Ax (k+(i−1)/n) − b)T ei
x (k+i/n) = x (k+(i−1)/n) − ai .
aiT ai
End

Let n = 2. Sketch the Kaczmarz iteration, i.e., the steps x (0), x (1/2), x (1), x (1+1/2), . . ..

iii. Since A is nonsingular, the linear system Ax = b is equivalent to AAT Dx = b with x = AT D

x.
T
One iteration of the Gauss-Seidel method applied to AA D x = b is given by

For i = 1, . . . , n

1 X X
x i(k+1) = *.b −
i ( AAT
) x
ij j
(k+1)
− x (k)
( AAT )i j D j
+/
( AAT )ii
D D
, j<i j>i -
1 X X
x i(k) −
=D *. ( AAT ) D x
ij j
(k+1)
+ ( AAT )i j Dx (k)
j − bi .
+/
( AAT )ii j<i j ≥i
, -
End

This iteration can also be written as

For i = 1, . . . , n
n
1 X
(k+(i−1)/n)
x (k+i/n) = D
x (k+(i−1)/n) − ei *. *. ( AAT ) D
ij x j − bi +/+/ .
( AAT )ii
D
, , j=1 --

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2.9. PROBLEMS 107

End
Show that Kaczmarz iteration for Ax = b is equivalent to the Gauss-Seidel iteration applied
x = b in the sense that the iterates satisfy x (k) = AT D
to AAT D x (k) .
iv. What can you say about the convergence of the Kaczmarz iteration?
v. This part applies the Kaczmarz iteration to an image deblurring problem. The true image is
represented by a function f : [0, 1] → [0, 1] (think of f (x) as the gray scale of the image at
x). The blurred image g : [0, 1] → R is given by
Z 1
k (ξ1, ξ2 ) f (x 2 )dξ2 = g(ξ1 ), ξ1 ∈ [0, 1], (2.88)
0

where k : [0, 1]2 → [0, ∞) is a kernel, given by

k (ξ1, ξ2 ) = C exp − (ξ1 − ξ2 ) 2 /(2γ 2 ) ,

√
with γ = 0.05 and C = 1/(γ 2π). Given the kernel k and the blurred image g we want to
find f .
To discretize the problem, we divide [0, 1] into n equidistant intervals of length h = 1/n.
Let ξi = (i − 21 )h be the midpoint of the ith interval. We approximate f and g by piecewise
constant functions,
n
X n
X
f (x) ≈ f i χ[(i−1)h,ih] (x), g(x) ≈ gi χ[(i−1)h,ih] (x),
i=1 i=1

where χ I is the indicator function on the interval I. We insert these approximations into
(2.88) and approximate the integral by the midpoint rule. This leads to the linear system
Kf = g, (2.89)
where
f = ( f 1, . . . , f n )T , g = (g1, . . . , gn )T ,
and
Ki j = h k (ξi, ξ j ), i, j = 1, . . . , n.
Let n = 100, construct the true image f true

ftrue = zeros(n,1);
ftrue = exp( -(xi-0.75).^2 * 70 );
ind = (0.1<=xi) & (xi<=0.25);
ftrue(ind) = 0.8;
ind = (0.3<=xi) & (xi<=0.35);
ftrue(ind) = 0.3;

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

108 CHAPTER 2. STATIONARY ITERATIVE METHODS

and compute the resulting blurred image g = Kf true . See the left plot in Figure 2.6. We want
to recover f true from g and (2.89).
The matrix K is highly ill-conditioned. Solving (2.89) using Matlab ’s backslash leads to a
highly oscillatory function, indicated by the blue dashed lines in center plot in Figure 2.6.
Apply the Kaczmarz iteration to (2.89) with starting point f (0) = 0. Stop the iteration when
kKf (k) − gk2 ≤ 10−2 kgk2 . Generate plots like those shown in Figure 2.6, as well as a plot of
the residual norms kKf (k) − gk2 .
The right plot in Figure 2.6 shows that the Kaczmarz iteration recovers the true image fairly
well, especially the smooth parts of true image. This is due to the early termination of the
iteration.

1 1 1
0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
true image true image
0 0 0
true image recovered image recovered image
-0.2 blurred image -0.2 blurred image -0.2 blurred image

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
ξ ξ ξ

Figure 2.6: Left plot: True image f true and blurred image g. Middle plot: True image f true , recovered
image f = K−1 g and blurred image g. Right plot: True image f true , recovered image f (k) using
the Kaczmarz iteration with stopping criteria kKf (k) − gk2 ≤ 10−2 kgk2 , and blurred image g. The
image f (k) recovered using the Kaczmarz iteration matches f true and especially the smooth part of
f true well.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

References
[Axe94] O. Axelsson. Iterative Solution Methods. Cambridge University Press, Cambridge,
London, New York, 1994. URL: https://doi.org/10.1017/CBO9780511624100.
[Ber95] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, Massachusetts,
1995.
[BT89] D. P. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation: Numerical
Methods. Prentice Hall, Englewood Cliffs. N.J., 1989.
[GL89] G. H. Golub and C. F. Van Loan. Matrix Computations, volume 3 of Johns Hopkins
Series in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD,
second edition, 1989.
[Hac94] W. Hackbusch. Iterative Solution of Large Sparse Systems of Equations. Springer–Verlag,
Berlin, 1994.
[HJ85] R. A. Horn and C. A. Johnson. Matrix Analysis. Cambridge University Press, Cambridge,
London, New York, 1985.
[Ise96] A Iserles. A First Course in the Numerical Analysis of Differential Equations. Cambridge
University Press, Cambridge, London, New York, 1996.
[Kac37] S. Kaczmarz. Angenäherte Auflösung von Systemen linearer Gleichungen. Bulletin
International de l’Academie Polonaise des Sciences et des Lettres. Classe des Sciences
Mathematiques et Naturelles. Serie A, Sciences Mathematiques, 35:355–357, 1937.
[Kac93] S. Kaczmarz. Approximate solution of systems of linear equations. Internat. J. Control,
57(6):1269–1271, 1993. Translated from the German. URL: http://dx.doi.org/10.
1080/00207179308934446, doi:10.1080/00207179308934446.
[Koh98] L. Kohaupt. Basis of eigenvectors and principal vectors associated with Gauss-Seidel
matrix of A = tri diag[−1 2 − 1]. SIAM Rev., 40(4):959–964 (electronic), 1998.
[Mey00] C. D. Meyer. Matrix Analysis and Applied Linear Algebra. Society for Industrial and
Applied Mathematics (SIAM), Philadelphia, PA, 2000. URL: https://doi.org/10.
1137/1.9780898719512, doi:10.1137/1.9780898719512.

109
110 REFERENCES

[Nat01] F. Natterer. The Mathematics of Computerized Tomography, volume 32 of Classics in

Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadel-
phia, PA, 2001. Reprint of the 1986 original.

[NW01] F. Natterer and F. Wübbeling. Mathematical Methods in Image Reconstruction. SIAM

Monographs on Mathematical Modeling and Computation. Society for Industrial and
Applied Mathematics (SIAM), Philadelphia, PA, 2001.

[Pea90] D. W. Peaceman. A personal retorspection of reservoir simulation. In S. G. Nash, editor,

History of Scientific Computing, pages 106 – 128. ACM Press, New York, 1990. available
at http://history.siam.org/peaceman.htm.

[Pol64] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. Z.

VyCisl. Math. i Mat. Fiz., 4:1–17, 1964.

[PR55] D. W. Peaceman and H. H. Rachford, Jr. The numerical solution of parabolic and elliptic
differential equations. J. Soc. Indust. Appl. Math., 3:28–41, 1955.

[Saa03] Y. Saad. Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia, second
edition, 2003.

[SB93] J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Springer Verlag, New York,
Berlin, Heidelberg, London, Paris, second edition, 1993.

[SR48] P. Stein and R. L. Rosenberg. On the solution of linear simultaneous equations by

iteration. J. London Math. Soc., 23:111–118, 1948.

[Sty05] M. Stynes. Steady-state convection-diffusion problems. In A. Iserles, editor, Acta Numer-

ica 2005, pages 445–508. Cambridge University Press, Cambridge, London, New York,
2005.

[TE05] L. N. Trefethen and M. Embree. Spectra and Pseudospectra. The Behavior of Nonnormal
Matrices and Operators. Princeton University Press, Princeton, NJ, 2005.

[Var62] R. S. Varga. Matrix Iterative Analysis. Prentice Hall, Englewood Cliffs, NJ, 1962.

[Var00] R. S. Varga. Matrix Iterative Analysis, volume 27 of Springer Series in Computational

Mathematics. Springer-Verlag, Berlin, expanded edition, 2000.

[Wri15] S. J. Wright. Coordinate descent algorithms. Math. Program., 151(1, Ser. B):3–34,
2015. URL: http://dx.doi.org/10.1007/s10107-015-0892-3, doi:10.1007/
s10107-015-0892-3.

[Xu92] J. Xu. Iterative methods by space decomposition and subspace correction. SIAM Review,
34:581–613, 1992.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

REFERENCES 111

[You71] D. M. Young. Iterative Solution of Large Linear Systems. Academic Press, New York,
1971. Republished by Dover [You03].

[You03] D. M. Young. Iterative Solution of Large Linear Systems. Dover Publications Inc.,
Mineola, NY, 2003. Unabridged republication of the 1971 edition [You71].

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

112 REFERENCES

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

Chapter
3
Krylov Subspace Methods
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3.2 Subspace Approximations and the Krylov Subspace . . . . . . . . . . . . . . . . . 115
3.2.1 Symmetric Positive Definite Matrices . . . . . . . . . . . . . . . . . . . . 115
3.2.2 General Square Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.2.3 Error Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.3 Computing Orthonormal Bases of Krylov Subspaces . . . . . . . . . . . . . . . . 122
3.3.1 The Arnoldi Iteration for Nonsymmetric Matrices . . . . . . . . . . . . . . 123
3.3.2 The Lanczos Iteration for Symmetric Matrices . . . . . . . . . . . . . . . 124
3.4 GMRES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.4.1 The Basic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.4.2 More Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . 128
3.5 Solution of Symmetric Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . 132
3.5.1 The Positive Definite Case . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.5.2 SYMMLQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
3.5.3 MINRES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.6 The Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
3.6.1 Steepest Descend Method . . . . . . . . . . . . . . . . . . . . . . . . . . 141
3.6.2 Barzilai-Borwein (BB) Method . . . . . . . . . . . . . . . . . . . . . . . 143
3.7 The Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
3.7.1 Derivation of the Conjugate Gradient Method . . . . . . . . . . . . . . . . 146
3.7.2 Stopping the Conjugate Gradient Method . . . . . . . . . . . . . . . . . . 156
3.7.3 The Conjugate Gradient Method for the Normal Equation . . . . . . . . . 157
3.8 Convergence of Krylov Subspace Method . . . . . . . . . . . . . . . . . . . . . . 161
3.8.1 Representation of Errors and Residuals . . . . . . . . . . . . . . . . . . . 161
3.8.2 Convergence of Galerkin Approximations . . . . . . . . . . . . . . . . . . 161
3.8.3 Convergence of Minimal Residual Approximations . . . . . . . . . . . . . 164
3.8.4 Chebyshev Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
3.8.5 Convergence of the Conjugate Gradient Method . . . . . . . . . . . . . . . 168
3.8.6 Convergence of Minimal Residual Approximations . . . . . . . . . . . . . 171
3.9 Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

113
114 CHAPTER 3. KRYLOV SUBSPACE METHODS

3.9.1 Preconditioned Conjugate Gradient Method . . . . . . . . . . . . . . . . . 177

3.9.2 Preconditioned GMRES . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
3.9.3 Preconditioned SYMMLQ and MINRES . . . . . . . . . . . . . . . . . . 182
3.9.4 Basic Iterative Methods Based Preconditioners . . . . . . . . . . . . . . . 187
3.10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

3.1. Introduction
In this section we study Krylov subspace methods for the solution of square linear systems

Ax = b, (3.1)

or quadratic minimization problems

min 21 xT Ax − bT x (3.2)

with symmetric positive definite matrix A. These methods successively generate nested subspaces

K1 ( A, r 0 ) ⊂ K2 ( A, r 0 ) ⊂ . . . ⊂ Kk ( A, r 0 ) ⊂ Rn,

where Kk ( A, r 0 ) = span{r 0, Ar 0, · · · , Ak−1r 0 }, r 0 = b − Ax 0 , is called a Krylov subspace , and in

the k-th step compute an approximation x k ∈ x 0 + Kk ( A, r 0 ) of the solution x ∗ of using projections.
Historically the first Krylov subspace method is the conjugate gradient (CG) method of Hestenes
and Stiefel [HS52] for symmetric positive definite systems. See the paper by Golub and O’Leary
[GO89] and O’Leary [O’L01] for the history of this method. Since then many methods have been
developed that extend the CG method to systems that are not symmetric positive definite. We will
discuss several of them here. More Krylov subspace methods and details can be found in the books
by Saad [Saa03], van der Vorst [Vor03], or Olshanskii, Tyrtyshnikov [OT14]

Notation. In this section we use the notation

hx, yi = xT y

and p
k xk = k xk2 = xT x.
The reason for this notation is that the Krylov subspace methods can be easily extended to linear
operator equations in Hilbert spaces (X, h·, ·i). In the general case the transpose of A has to replaced
by the adjoint A∗ , which is the linear operator that satisfies hAx, yi = hx, A∗ yi for all x, y ∈ X.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.2. SUBSPACE APPROXIMATIONS AND THE KRYLOV SUBSPACE 115

3.2. Subspace Approximations and the Krylov Subspace

Let
V1 ⊂ V2 ⊂ . . . ⊂ Vk ⊂ Rn
be a sequence of nested subspaces. We want to compute an approximation x k ∈ x 0 + Vk of the
solution x ∗ of (3.1) or of (3.2). In this section we will define what we mean by ‘x k approximates
x ∗ ’, and based on our definition, we will provided requirements on the selection of subspaces.
The subspaces Vk , k = 1, 2, . . ., are specified via their bases {v1, . . . , vk }, k = 0, 1, . . .. That is
given a matrix
Vk = (v1, . . . , vk ) ∈ Rn×k (3.3)
with linearly independent columns we consider the subspace ispanned by the columns of Vk ,

Vk = R (Vk ). (3.4)

3.2.1. Symmetric Positive Definite Matrices

Let A ∈ Rn×n be symmetric positive definite. In Theorem (1.2.1) we have seen that the vector x ∗
solves Ax = b if and only if x ∗ minimizes 12 hAx, xi − hb, xi. Moreover, if we define the A-norm

k xk A = hAx, xi1/2,
def
(3.5)

then Ax ∗ = b implies
1
2 hAx, xi − hb, xi = 21 hA(x − x ∗ ), x − x ∗ i − 21 hAx ∗, x ∗ i
= 21 k x − x ∗ k 2A − 12 k x ∗ k 2A .

Thus,
1
2 hAx, xi − hb, xi < 21 hAy, yi − hb, yi if and only if k x − x∗ k A < k y − x∗ k A.

Therefore, we can minimize 12 hAx, xi − hb, xi over x 0 + Vk to compute the approximation x k of the
solution x ∗ to the linear system Ax = b. Using (3.4) the minimization problem

min 1 hAx, xi − hb, xi, (3.6)

x∈x 0 +Vk 2

can be written as
min 12 hVkT AVk y, yi − hVkT (b − Ax 0 ), yi + 12 hAx 0, x 0 i. (3.7)
y∈Rk

If y k solves (3.7), then x k = x 0 + Vk y k solves (3.6).

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

116 CHAPTER 3. KRYLOV SUBSPACE METHODS

Theorem 3.2.1 Let A ∈ Rn×n be symmetric positive semidefinite, let b ∈ Rn , and let Vk ⊂ Rn . The
vector x k ∈ x 0 + Vk solves (3.6) if and only if

hAx k − b, vi = 0 ∀ v ∈ Vk . (3.8)

Proof: Let {v1, . . . , vk } be a basis for Vk and define Vk = (v1, · · · , vk ) ∈ Rn×k . For every
x k ∈ x 0 + Vk there exists a unique y k ∈ R k such that x k = x 0 + Vk y k . The minimization problem
(3.6) is equivalent to (3.7). By Theorem 1.2.1 the minimization problem (3.7) has a solution y k if
and only if VkT AVk y k = VkT (b − Ax 0 )

hVkT AVk y k − VkT (b − Ax 0 ), ζi = 0 ∀ ζ ∈ R k . (3.9)

Since (3.4) and x k = x 0 + Vk y k , condition (3.9) is equivalent to (3.8).

Definition 3.2.2 Let A ∈ Rn×n , b ∈ Rn , and Vk ⊂ Rn . A vector x k ∈ x 0 + Vk is called a Galerkin

approximations to the solution x ∗ of Ax = b if

hAx k − b, vi = 0 ∀ v ∈ Vk . (3.10)

The Galerkin approximation is illustrated in Figure 3.1.

Ax k − b
b

Ax k

Figure 3.1: The vecotor x k ∈ x 0 + Vk is a Galerkin approximation if and only if the residual Ax k − b
is orthogonal to Vk .

If Vk = R (Vk ), then (3.10) is equivalent to x k = x 0 + Vk y k where y k solve the k × k linear

system
VkT AVk y k = VkT (b − Ax 0 ). (3.11)
So far we have defined what we mean by computing a best approximation x k ∈ x 0 + Vk to x ∗
and characterized the best approximation.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.2. SUBSPACE APPROXIMATIONS AND THE KRYLOV SUBSPACE 117

Now suppose that Vk = R (Vk ), and that we have computed the Galerkin approximation
x k ∈ x 0 + Vk . If Ax k − b , 0, then we want to expand the subspace, i.e., generate a new subspace
Vk+1 = R ((Vk , vk+1 )) in such a way that the Galerkin approximation x k+1 ∈ x 0 + Vk+1 is a better
approximation to x ∗ in the sense that

2 hAx k+1, x k+1 i − hb, x k+1 i < 12 hAx k , x k i − hb, x k i.

1
(3.12)

What are the requirements on the vector vk+1 ? The next theorem shows that the new Galerkin
approximation x k+1 ∈ x 0 + Vk+1 satisfies (3.12) if and only the vector vk+1 is not orthogonal to
Ax k − b.

Theorem 3.2.3 Let A ∈ Rn×n be symmetric positive definite. Furthermore, let Vk ∈ Rn×k and
Vk+1 = (Vk , vk+1 ) ∈ Rn×(k+1) , and let Vk = R (Vk ), Vk+1 = R (Vk+1 ) be the corresponding
subspaces. Then,

min x∈x 0 +Vk+1 21 hAx, xi − hb, xi < min x∈x 0 +Vk 12 hAx, xi − hb, xi

if and only if hAx k − b, vk+1 i , 0.

Proof: Define Q(x) = 21 hAx, xi − hb, xi. Note that ∇Q(x) = Ax − b, and note that for all v ∈ Rn
def

and λ ∈ R we have

λ2
Q(x k + λv) = Q(x k ) + hAv, vi + λhAx k − b, vi. (3.13)
2
i. If hAx k − b, vk+1 i , 0, then we can find λ ∈ R with λhAx k − b, vk+1 i < 0 and |λ| sufficiently
small such that (3.13) with v = vk+1 implies

Q(x k + λvk+1 ) < Q(x k ).

Since x k+1 ∈ x 0 + Vk minimizes Q over x 0 + Vk+1 , we have

Q(x k+1 ) ≤ Q(x k + λvk+1 ) < Q(x k ).

ii. If Q(x k+1 ) < Q(x k ), then we apply (3.13) to obtain

0 > Q(x k+1 ) − Q(x k ) = Q(x k + (x k+1 − x k )) − Q(x k )

= 1
2 hA(x k+1 − x k ), x k+1 − x k i + hAx k − b, x k+1 − x k i
≥ hAx k − b, x k+1 − x k i.

Thus, hAx k − b, x k+1 − x k i , 0.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

118 CHAPTER 3. KRYLOV SUBSPACE METHODS

Since Q(x k+1 ) < Q(x k ) = min x∈x 0 +Vk Q(x) it follows that x k+1 < x 0 + Vk . Therefore x k+1 − x k
can be written in the form x k+1 − x k = v + λvk+1 , where v ∈ Vk , and λ , 0. With (3.10) this yields
the desired

0 , hAx k − b, x k+1 − x k i = hAx k − b, vi + λhAx k − b, vk+1 i = λhAx k − b, vk+1 i.

We use Theorem 3.2.3 to construct our subspaces. Let x 0 be given. We want to find V1 = {v1 }
such that 21 hAx 1, x 1 i − hb, x 1 i < 21 hAx 0, x 0 i − hb, x 0 i. By Theorem 3.2.3 v1 must satisfy

hAx 0 − b, v1 i , 0.

If Ax 0 − b = 0, then x 0 is the desired solution. Otherwise Ax 0 − b , 0, and we can choose

v1 = r 0 = b − Ax 0 . This choice leads to
def

V1 = {r 0 }.

Let x 1 = argmin x∈x 0 +V1 12 hAx, xi − hb, xi. We want to find V2 = span{V1 ∪ {v2 }} such that
2 hAx 2, x 2 i − hb, x 2 i < 2 hAx 1, x 1 i − hb, x 1 i, Thus we have to choose v2 such that
1 1

hAx 1 − b, v2 i , 0.

If Ax 1 − b = 0, then x 1 is the desired solution. Otherwise Ax 1 − b , 0 and v2 = b − Ax 1 satisfied the

previous condition. Note that x 1 = x 0 + αr 0 , for some scalar α , 0. (Note α = 0 implies x 1 = x 0
and 12 hAx 1, x 1 i − hb, x 1 i = 12 hAx 0, x 0 i − hb, x 0 i, which is a contradiction.) Hence v2 = r 0 − α Ar 0 .
Since α , 0, then
V2 = span{r 0, r 0 − α Ar 0 } = span{r 0, Ar 0 }.
If we continue this process we find that the choice vk = b − Ax k leads to the subspaces

Vk = span{r 0, Ar 0, . . . , Ak−1r 0 }.

Definition 3.2.4 Given A ∈ Rn×n and v ∈ Rn the Krylov subspace (generated by A and v) is defined
by
Kk ( A, v) ≡ span{v, Av, . . . , Ak−1 v}.

If P k denotes the space of polynomials of degree less than or equal to k, then

Kk ( A, v) = {π( A)v : π ∈ P k−1 } . (3.14)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.2. SUBSPACE APPROXIMATIONS AND THE KRYLOV SUBSPACE 119

3.2.2. General Square Matrices

Now let A ∈ Rn×n be a general matrix. In this case, a Galerkin approximation, i.e., a solution of
(3.10) may not exist, or may not be unique.

Examples 3.2.5 Let e k denote the kth unit vector. Moreover, we use the subspaces Vk =
span{e1, · · · , e k } with corresponding matrices Vk = (e1, · · · , e k ) ∈ R3×k , k = 1, 2, 3
i. Consider
0 0 1 1
A= *
. 1 0 0 +
/ , b = 0 +/ .
*
.
, 0 1 0 - , 0 -
The unique solution of Ax = b is given by x ∗ = (0, 0, 1)T , but (3.10) is not solvable for k < 3. For
example, in the case k = 2 (3.11) is equivalent to
! !
0 0 1
T
V2 z =
V2 A |{z} z= = V2T b.
1 0 0
x2

ii. Consider
1 0 0 1
A = . 0 0 1 +/ ,
* b = . 1 +/ .
*
, 0 1 0 - , 0 -
The unique solution of Ax = b is given by x ∗ = (1, 0, 1)T . For k = 1 there exists a unique Galerkin
approximation given by x 1 = e1 , but for k = 2 (3.11) is not solvable.
Since in the general case (3.10) does not have a solution or no unique solution, we need to
define our approximation differently. If A is nonsingular, then x ∗ solves Ax = b if and only if it
minimizes 12 k Ax k − bk 2 = 12 k A(x k − x ∗ )k 2 , and we can can use the residual as a measure for the
error. This is possible even if A is not square.

Definition 3.2.6 (Minimum Residual Approximation) Let A ∈ Rm×n , x 0 ∈ Rn , b ∈ Rm , and

Vk ⊂ Rn . A vector x k ∈ x 0 + Vk is called a minimum residual approximation to the solution x ∗ of
Ax = b, or min k Ax − bk if
1
2 k Ax k − bk 2 = min 1
k Ax − bk 2 . (3.15)
x∈x 0 +Vk 2

Since
1
2 k Ax − bk 2 = 12 hAT Ax, xi − hAT b, xi + 21 kbk 2
the least squares problem (3.15) is equivalent to

min 1 hAT Ax, xi − hAT b, xi. (3.16)

x∈x 0 +Vk 2

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

120 CHAPTER 3. KRYLOV SUBSPACE METHODS

One can see immediately that minimum residual approximations x k ∈ x 0 + Vk to x ∗ are Galerkin
approximations for the solution of the normal equation

AT Ax = AT b.

Previously we have seen that if A is symmetric positive definite, the use of Galerkin approxi-
mations leads to the Krylov subspace. What subspace do we select if we use minimum residual
approximations? We have seen that minimum residual approximations are Galerkin approximations
for the normal equation (see (3.15) and (3.16)). Hence, we can apply our arguments for Galerkin
approximation and select the subspace Vk = Kk ( AT A, AT r 0 ). This is done when A ∈ Rm×n .
If A ∈ Rn×n , then we select Vk = Kk ( A, r 0 ), i.e., we compute x k ∈ x 0 + Kk ( A, r 0 ) as the
solution of
2 k Ax − bk .
1 2
min (3.17)
x∈x 0 +Kk ( A,r 0 )

The reason for choosing Vk = Kk ( A, r 0 ) instead of Kk ( AT A, AT r 0 ) is that the latter requires

matrix-vector products with A and matrix-vector products with the transpose of A. However, if
we use Vk = Kk ( A, r 0 ), then for general matrices there is no guarantee that the minimum residual
approximations improve in the sense that

min 1
k Ax − bk 2 < min 1
k Ax − bk 2,
x∈x 0 +Kk+1 ( A,r 0 ) 2 x∈x 0 +Kk ( A,r 0 ) 2

see, e.g., Example 3.4.2 below.

3.2.3. Error Representation

The representation (3.14) of the Krylov subspace allows a representation of the error

e(x) = x ∗ − x
def

between the solution x ∗ of Ax = b and the Krylov subspace approximation x ∈ x 0 + Kk ( A, r 0 ),

where r 0 = b − Ax 0 , and of the residual

r (x) = b − Ax.
def

These representations will be important in the convergence analysis of Krylov subspace methods.
Any vector v ∈ Kk ( A, r 0 ) can be written as v = i=0 γi Ai r 0 = π k−1 ( A)r 0 for some polynomial
P k−1
π k−1 (t) = i=0 γi t of degree less then or equal to k − 1. If x ∈ x 0 + Kk ( A, r 0 ), where r 0 = b − Ax 0 ,
P k−1 i
then the error obeys
e(x) = x ∗ − x = x ∗ − x 0 − π k−1 ( A)r 0
def

for some polynomial π k−1 of degree less then or equal to k − 1. Moreover, since r 0 = b − Ax 0 =
A(x ∗ − x 0 ) we have

x ∗ − x = x ∗ − x 0 − π k−1 ( A)r 0 = (I − π k−1 ( A) A)(x ∗ − x 0 ) (3.18)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.2. SUBSPACE APPROXIMATIONS AND THE KRYLOV SUBSPACE 121

for some polynomial π k−1 of degree k − 1. For the residual we obtain

r (x) = b − Ax = (b − Ax 0 ) − A(x − x 0 ) = r 0 − Aπ k−1 ( A)r 0

def

= (I − π k−1 ( A) A)r 0 (3.19)

(note that Aπ k−1 ( A) = π k−1 ( A) A) for the polynomial π k−1 of degree less then or equal to k − 1 that
appears in the error representation.

Galerkin Approximation. Recall that if A is symmetric positive definite, then

Q(x) = 12 hAx, xi − hb, xi = 21 k x − x ∗ k 2A − 12 k x ∗ k 2A .

Of course, x k minimizes Q(x) over x 0 + Kk ( A, r 0 ) if and only if it minimizes k x − x ∗ k 2A over

x 0 + Kk ( A, r 0 ). Consequenly, if we use (3.18), then we see that Krylov subspace methods using
Galerkin approximations generate iterates x k such that

Q(x k ) = min Q(x) = min 1

k(I − π( A) A)e0 k 2A − 12 k x ∗ k 2A
x∈x 0 +Kk ( A,r 0 ) π∈Pk−1 2

or, equivalently,
1 2
2 ke k k A = min 1
kx − x ∗ k 2A = min 1
k(I − π( A) A)e0 k 2A, (3.20)
x∈x 0 +Kk ( A,r 0 ) 2 π∈Pk−1 2

where P k−1 is the set of all polynomials of degree less than or equal to k − 1 and e0 = x ∗ − x 0 ,
ek = x ∗ − x k .

Minimum Residual Approximation. Similarly, if we compute minimum residual approxima-

tions (3.17), then the iterates x k satisfy

k Ax k − r 0 k = min k Ax − bk = min k(I − Aπ( A))r 0 k, (3.21)

x∈x 0 +Kk ( A,r 0 ) π∈Pk−1

where P k−1 is the set of all polynomials of degree less than or equal to k − 1.
Later in Section 3.8, we will use (3.20) and (3.21) in the convergence analysis of Krylov
subspace methods.

Finite Termination of Krylov Subspace Methods

Theorem 3.2.7 Let A ∈ Rn×n be nonsingular. If Ak r 0 ∈ Kk ( A, r 0 ) for some k, then there exists a
polynomial π k−1 of degree less or equal to k − 1 such that

A−1r 0 = πk−1 ( A)r 0 .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

122 CHAPTER 3. KRYLOV SUBSPACE METHODS

Proof: Let k be the smallest integer such that Ak r 0 ∈ Kk ( A, r 0 ). Since A j r 0 < K j ( A, r 0 ) for all
j = 1, . . . , k − 1, the vectors r 0, Ar 0, . . . , Ak−1r 0 are linearly independent. The nonsingularity of
A implies that Ar 0, A2r 0, . . . , Ak r 0 are linearly independent. By assumption r 0, Ar 0, A2r 0, . . . , Ak r 0
are linearly dependent. Therefore there exist scalars λ j such that
k
X k−1
X
r0 = λ j A r0 = A
j
λ j+1 A j r 0 .
j=1 j=0

This implies A−1r 0 = λ j+1 A j r 0 .

P k−1
j=0

The previous theorem implies that Krylov subspace approximation algorithms terminate after
k iterations if Ak r 0 ∈ Kk ( A, r 0 ). In fact, Ax = b is equivalent to A x̃ = r 0 = b − Ax 0 and if
Ak r 0 ∈ Kk ( A, r 0 ), then
x̃ = A−1r 0 = π k−1 ( A)r 0 ∈ Kk ( A, r 0 ).

3.3. Computing Orthonormal Bases of Krylov Subspaces

Let A ∈ Rn×n , b, x 0 ∈ Rn and r 0 = b − Ax 0 . In this section we discuss the computation of
orthonormal bases {v1, . . . , vk } of Krylov subspaces Kk ( A, r 0 ). We have seen in the previous
section that v1, . . . , vk is an orthonormal basis of the Krylov subspace Kk ( A, r 0 ) and if
Vk = (v1, . . . , vk ) ∈ Rn×k ,
then the Galerkin approximation x k ∈ x 0 + Kk ( A, r 0 ) is x k = x 0 + Vk y k where y k ∈ R k solves
VkT AVk y k = VkT r 0 .
Similarly, the minimum residual approximation x k ∈ x 0 + Kk ( A, r 0 ) is x k = x 0 + Vk y k where
y k ∈ R k solves
min 21 k AVk y − r 0 k22 .
y∈Rk
We successively construct orthonormal bases v1, . . . , vk of Krylov subspaces Kk ( A, r 0 ) for
k = 1, 2, . . . using a variation of the Gram-Schmidt method.
Theorem 3.3.1 Let A ∈ Rn×n . Let vectors v1, . . . , vm be given such that
span{v1, . . . , vk } = Kk ( A, v1 ) k = 1, . . . , m,
then
span{v1, . . . , vk , Avk } = Kk+1 ( A, v1 ), k = 1, . . . , m.
We leave the proof as an exercise (see Problem 3.2). Theorem 3.3.1 enables us to use the vector
Avk to generate the Krylov subspace Kk ( A, r 0 ) instead of Ak r 0 .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.3. COMPUTING ORTHONORMAL BASES OF KRYLOV SUBSPACES 123

3.3.1. The Arnoldi Iteration for Nonsymmetric Matrices

If {v1, . . . , vk } are orthonormal bases for the Krylov subspaces Kk ( A, r 0 ), k = 1, . . . , j, the we can
use the representation span{v1, . . . , v j , Av j } = K j+1 ( A, r 0 ) and Gram–Schmidt orthogonalization

j
X
ṽ j+1 = Av j − hAv j , vi ivi, v j+1 = ṽ j+1 /k ṽ j+1 k,
i=1

to compute an orthonormal basis {v1, . . . , v j+1 } are orthonormal bases for the Krylov subspaces
K j+1 ( A, r 0 ). For numerical reasons it is better to use the modified Gram-Schmidt method. This
leads to the so-called Arnoldi Iteration.

Algorithm 3.3.2 (Arnoldi Process)

(0) Given r 0 and m.
(1) Set v1 = r 0 /kr 0 k
(2) For j = 1, · · · , m − 1 do
(a) v̂ j+1 = Av j
(b) For i = 1, · · · , j do
hi j = hv̂ j+1, vi i
v̂ j+1 = v̂ j+1 − hi j vi
(c) h j+1, j = k v̂ j+1 k
(d) If h j+1, j = 0 Stop; else
(e) v j+1 = v̂ j+1 /h j+1, j .

Theorem 3.3.3 If Algorithm 3.3.2 does not stop in step j, then for all k ≤ j + 1

{v1, . . . , vk } is an orthonormal basis of Kk ( A, r 0 )

and for all k ≤ j

AVk = Vk+1 H̄k , VkT AVk = Hk ,
where Vk = (v1, . . . , vk ) ∈ Rn×k , H̄k ∈ R(k+1)×k is the matrix with entries hi j generated in Algorithm
3.3.2 and Hk ∈ R k×k is the matrix obtained from H̄k by deleting the last row.
If Algorithm 3.3.2 stops in step (d) of iteration j < m, then A j r 0 ∈ K j ( A, r 0 ).

Proof: The fact that {v1, . . . , vk } is an orthonormal basis of Kk ( A, r 0 ) follows from the properties
of the Gram-Schmidt orthogonalization and Theorem 3.3.1.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

124 CHAPTER 3. KRYLOV SUBSPACE METHODS

The orthogonality of the vectors v1, . . . , vk and steps (b)-(e) of Algorithm 3.3.2 implies
`+1
X
Av` = vi hi`,
i=1

which is the `th column in the identity AVk = Vk+1 H̄k . If we multiply both sides in this identity by
VkT and use the orthogonality of the vectors v1, · · · , vk+1 , we obtain VkT AVk = Hk .
Pj
If Algorithm 3.3.2 stops at step j < m, then h j+1, j = 0, i.e., Av j = i=1 vi hi j . Theorem 3.3.1
implies A j r 0 ∈ K j ( A, r 0 ).

The matrix H̄k is given by

h11 h12 ··· h1, j−1 h1, j
*. h21 h22 ··· h2, j−1 h2, j +/
..
.. .. .. .. //
0 . . . .
H̄k = .. (k+1)×k
.
. //
.. // ∈ R (3.22)
.. . h j−1, j−2 h j−1, j−1 h j−1, j //
0 ··· h j, j−1 h j, j
..
0 /
, 0 ··· 0 0 h j+1, j -
Such a matrix is called an (upper) Hessenberg matrix
The main work in the Arnoldi itertation consists of computing the matrix-vector product Av j
and the orthogonalization of Av j against v1, · · · , v j .

3.3.2. The Lanczos Iteration for Symmetric Matrices

Algorithm 3.3.2 generates Vk = (v1, . . . , vk ) ∈ Rn×k and Hk ∈ R k×k such that
VkT AVk = Hk .
If A ∈ Rn×n is a symmetric matrix, then Hk is symmetric. Since Hk is an upper Hessenberg matrix,
it must be a symmetric tridiagonal matrix. Instead of Hk we use the notation
α 1 β2
.. β2 α2 β3
*. +/
... ... ...
//
Tk = tridiag( βi, αi, βi+1 ) = .. // ∈ R k×k . (3.23)
.. β k−1 α k−1 β k //
, βk αk -
From VkT AVk = Tk we find
αi = hii = hAvi, vi i, βi = hi,i−1 = hAvi, vi−1 i = hi−1,i = hAvi−1, vi i.
Thus, in the symmetric case, the Arnoldi process can be simplified as follows.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.3. COMPUTING ORTHONORMAL BASES OF KRYLOV SUBSPACES 125

Algorithm 3.3.4 (Lanczos Process)

(0) Given v and m.
(1) Set v̂1 = r 0 (and formally v0 = 0),
β1 = k v̂1 k,
(2) For j = 1, · · · , m − 1 do
(a) If β j = 0 stop; else
(b) v j = v̂ j / β j
(c) v̂ j+1 = Av j − β j v j−1
(d) α j = hv̂ j+1, v j i
(e) v̂ j+1 = v̂ j+1 − α j v j
(f) β j+1 = k v̂ j+1 k

Application of Theorem 3.3.3 to symmetric matrices leads to the following result.

Corollary 3.3.5 Let A ∈ Rn×n be symmetric. If Algorithm 3.3.4 does not stop in step j, then for all
k ≤ j +1

{v1, . . . , vk } is an orthonormal basis of Kk ( A, r 0 )

and for all k ≤ j

VkT AVk = Tk , AVk = Vk Tk + β k+1 vk+1 eTk ,

where Vk = (v1, . . . , vk ) ∈ Rn×k , Tk ∈ R k×k is the matrix given in (3.23) and e k is the kth unit vector
in R k .
If Algorithm 3.3.4 stops in step (d) of iteration j < m, then A j r 0 ∈ K j ( A, r 0 ).

Note that the work per iteration in the Lanczos Iteration is constant, whereas the work in the
Arnoldi Iteration grows linearly with the iteration count j.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

126 CHAPTER 3. KRYLOV SUBSPACE METHODS

3.4. GMRES: A General Minimal Residual Algorithm for Solv-

ing Nonsymmetric Linear Systems

3.4.1. The Basic Algorithm

GMRES, minimal residual algorithm for solving nonsymmetric linear systems was developed by
Saad and Schultz [SS86].
It computes minimum residual approximations using the Krylov subspace Kk ( A, r 0 ). The kth
step computes the solution x k of

min k Ax − bk,
x∈x 0 +Kk ( A,r 0 )

where r 0 = b− Ax 0 . The Arnoldi Iteration 3.3.2 is used to generate an orthonormal basis {v1, · · · , vk }
of Kk ( A, r 0 ). By Theorem 3.3.3 and from v1 = r 0 /kr 0 k we have

AVk = Vk+1 H̄k ,

r 0 = kr 0 kVk+1 e1,

where e1 if the first unit vector in R k+1 . Since x ∈ x 0 + Kk ( A, r 0 ) if and only if x = x 0Vk y for some
y ∈ R k , the problem min x∈x 0 +Kk ( A,r0 ) k Ax − bk is equivalent to

min k AVk y − r 0 k = min kVk+1 H̄k y − r 0 k = min kVk+1 H̄k y − kr 0 kVk+1 e1 k

y∈Rk y∈Rk y∈Rk
= min k H̄k y − kr 0 ke1 k2 . (3.25)
y∈Rk

The last least squares problem is a small (k + 1) × k problem.

The work in the GMRES algorithm is dominated by the work of the Arnoldi Iteration 3.3.2.
Since the work and the storage requirements of the Arnoldi Iteration 3.3.2 grows linearly with the
number of iterations. Therefore, one often limits the maximum dimension of the Krylov subspace to
m. If the minimum residual approximation x m is not a good enough approximation of the solution,
one sets x 0 ← x 0 + x m and restarts the iteration. The GMRES Algorithm with restart is given as
follows.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.4. GMRES 127

Algorithm 3.4.1 (GMRES(m))

(0) Given x 0 and m.
Compute r 0 = b − Ax 0 .
(1) Set v1 = r 0 /kr 0 k.
(2) For k = 1, · · · , m do
(a) v̂k+1 = Avk
(b) For i = 1, · · · , j do
hik = hv̂k+1, vi i
v̂k+1 = v̂k+1 − hik vi
(c) h k+1,k = k v̂k+1 k
(d) If h k+1,k = 0, compute x 0 + Vk−1 y k−1 and stop.
(e) vk+1 = v̂k+1 /h k+1,k .
(f) Compute the solution y k of min y∈Rk k H̄k y − kr 0 ke1 k2 .
(g) If k H̄k y − kr 0 ke1 k is sufficiently small,
then compute x 0 + Vk y k and stop;
(3) Restart:
Compute x 0 = x 0 + Vm ym , r 0 = b − Ax 0 and goto (1).

If in step (2d) h k+1,k = 0, then Ak r 0 ∈ Kk ( A, r 0 ), cf. Theorem 3.3.3, and by Theorem 3.2.7,
x 0 + Vk y k solves Ax = b. This is sometimes called lucky breakdown.
GMRES is terminated if the residual or the relative residual is smaller than some tolerance
> 0, i.e. if
kr k k < or kr k k < kbk.
Since
Ax k = b − (b − Ax k ) = b − r k ,
the perturbation theory for the solution of linear systems implies the following estimates for the
error x ∗ − x k .
k x ∗ − x k k ≤ k A−1 k kr k k ≤ k A−1 k
if GMRES(m) stops with kr k k < and
k x∗ − x k k kr k k
≤ k Ak k A−1 k ≤ k Ak k A−1 k
k x∗ k kbk
if GMRES(m) stops with kr k k < kbk. Note that k Ak k A−1 k is the condition number of A.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

128 CHAPTER 3. KRYLOV SUBSPACE METHODS

The question when to restart is difficult in general. See [Emb03] for some interesting examples.

Example 3.4.2 Consider

0 ··· ··· 0 1
*. +/
.. 1 0 0 0 // 1
*. +/
. 0 1 0 0 0
A = .. .. // ∈ R ,
n×n
b = ... ..
// ∈ Rn, x 0 = 0 ∈ Rn .
//
.. ..
.. . . . / . . //
.. 1 0 0 // , 0 -
, 1 0 -

The exact solution is given by x ∗ = en .

The orthogonal vectors generated by GMRES are given by vi = ei , where ei denotes the i–the
unit vector. The matrix H̄k is given by

0 ··· ··· 0
*. +
.. 1 0 0 //
. 0 1 0 //
H̄k = .. . . . . .. // ∈ R
(k+1)×k
.
.. . . . //
.. 1 0 //
, 1 -

From this it can be seen that the GMRES iterates are given by x k = x 0 for j = 1, . . . , n − 1,and
x n = x ∗ . Thus, the residuals are not reduced until the last iterate.

3.4.2. More Implementation Details

Due to the special structure of the matrices H̄k , k = 1, . . . , m, the least squares problems (3.25) can
be solved efficiently using Given rotations, and the the norm of the residual

kr k k = min k H̄k y − kr 0 ke1 k2,

y∈Rk

k = 1, . . . , m, can be monitored efficiently. The details are given in this section.

We use the following notation: If Gi denotes a Givens matrix

1
*. +/
1
..
. //
Gi = ... . // ∈ Ri×i,
.. ci −si //
, si ci -

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.4. GMRES 129

then Gi,k , i ≤ k, denotes the matrix

!
Gi 0
Gi,k = ∈ R k×k .
0 Ik−i

With this notation we have that G k,k = G k .

In the first iteration of step (2) we have that
!
h11
H̄1 = .
h21

Using the Givens rotation !

c2 −s2
G2 = ∈ R2×2,
s2 c2
with q q
2 + h2 ,
c2 = h11 / h11 21
2 + h2 ,
s2 = −h21 / h11 21

we can transform H̄1 into an upper triangular matrix:

(1)
!
r 11
G2 H̄1 = R̄1 = ,
0
q
(1)
where r 11 = 2 + h2 .
h11 21
To show the transformation of H̄k into an upper triangular matrix R̄k , assume that we have
determined Givens rotations G2,k , . . . , G k,k such that

G k,k · · · G2,k H̄k−1 = R̄k−1,

where R̄k−1 ∈ R k×(k−1) is an upper triangular matrix:

(k−1) (k−1)
r 11 r 12 ··· r 1,(k−1)
j−1
*. +/
(k−1)
.. r 22 ··· r 2,(k−1)
j−1
//
= ... .. .. // .
R̄k−1 . . //
.. (k−1)
. r k−1, j−1
//
, 0 0 ··· 0 -
Note that the matrix H̄k−1 R k×(k−1) is equal to the matrix consisting of the first j − 1 columns of
Hk−1 R k×k . Therefore the (k − 1) × (k − 1) upper block of R̄k−1 is equal to the upper (k − 1) × (k − 1)
block of Rk . This is the reason why the elements of R̄k−1 have upper subscript (k).
Then, in the j–th step we obtain that

G k, j+1 · · · G2, j+1 H̄k = R̃k ,

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

130 CHAPTER 3. KRYLOV SUBSPACE METHODS

where R̃k ∈ R(k+1)×k has the following structure:

(k−1) (k−1)
r 11 r 12 ··· r 1,(k−1)
j−1
(k−1)
r 1,k
*. +/
(k−1) (k−1) (k−1)
.. r 22 ··· r 2, j−1 r 2,k //
.. .. ..
. . .
.. //
R̃k = .. (k−1) (k−1)
// .
.. r k−1, j−1
r k−1,k //
.. (k−1) //
. 0 0 ··· 0 r k,k /
, 0 0 ··· 0 h k+1,k -
One Givens rotation
1
*. +/
1
..
. //
G k+1 = ... . // ∈ R(k+1)×(k+1),
.. ck+1 −s k+1 //
, s k+1 ck+1 -
with q q
(k−1) (k−1) 2 (k−1) 2
ck+1 = r k,k / (r k,k ) + h2k+1,k , s k+1 = −h k+1,k / (r k,k ) + h2k+1,k ,
can be used to transform R̃k+1 into an upper triangular matrix R̄k ∈ R(k+1)×k .
If these Givens rotations are also applied to the vector kr 0 ke1 , then, with
G k+1, j+1 G k, j+1 · · · G2, j+1 kr 0 ke1 = z (k) = (z1(k), . . . , z k(k), z k+1
(k) T
) ∈ R k+1,
we obtain that
R̂ !
min k H̄k y − kr 0 ke1 k2 = min k R̄k y − z (k)
k2 = min Tk y − z (k) , (3.26)
y∈Rk y∈Rk y∈Rk 0 2
where R̂k ∈ R k×k is the matrix consisting of the first j rows of R̄k . The solution of (3.26) is given
by
y k = R̂−1
k ẑ ,
(k)

where
ẑ (k) = (z1(k), . . . , z k(k) )T ∈ R k .
Therefore the residual is given by
(k)
kr k k = |z k+1 |. (3.27)
Due to the special structure of the Givens rotations G k , the vectors z (k−1) obey
z`(k) = z`(k−1), 1 ≤ ` ≤ k − 1,
z k(k) = ck+1 z k(k−1),
(k)
z k+1 = s k+1 z k(k−1) .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.4. GMRES 131

z (1) ∈ R1 is given by z1(1) = kr 0 k. With (3.27) this yields

kr k k = kr 0 k |s2 · · · s k+1 | = kr k−1 k |s k+1 |. (3.28)
From this we can see that if h k+1,k = 0, then kr k k = 0.
Thus, although the solution y k is not known explicitly, the value of the corresponding residual
is known.
Using the observations concerning the computation of the residual, we can improve the
GMRES(m) method given by 3.4.1. In the following version the residual is monitored in each
Arnoldi step so that the method can be terminated without performing all m Arnoldi steps in step
(2).

Algorithm 3.4.3 (GMRES(m))

(0) Given x 0 , m, and .
Compute r 0 = b − Ax 0 .
(1) Set v1 = r 0 /kr 0 k.
(2) For k = 1, · · · , m do
(a) v̂k+1 = Avk ,
(b) For i = 1, · · · , j do
hik = hv̂k+1, vi i,
v̂k+1 = v̂k+1 − hik vi ,
(c) h k+1,k = k v̂k+1 k,
(d) If h k+1,k = 0, then compute the solution y k of
min y∈Rk k H̄k y − kr 0 ke1 k2 and x 0 + Vk−1 y k−1 , and stop.
(e) vk+1 = v̂k+1 /h k+1,k .
(f) Compute the Givens rotation G k+1 such that
G k+1, j+1 (G k, j+1 · · · G2, j+1 H̄k )
is an upper triangular matrix.
(g) Update the residual using (3.27), or (3.28).
(h) If kr k k < , then compute the solution y k of
min y∈Rk k H̄k y − kr 0 ke1 k2 and x 0 + Vk y k , and stop.
(3) Restart:
Compute the solution y k of min y∈Rk k H̄k y − kr 0 ke1 k2 ,
x 0 = x 0 + Vk y k , and r 0 = b − Ax 0 , and goto (1).

More implementation details may be found in [Saa03].

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

132 CHAPTER 3. KRYLOV SUBSPACE METHODS

3.5. Solution of Symmetric Linear Systems

3.5.1. The Positive Definite Case
Let A ∈ Rn×n be symmetric positive definite. The vector x ∗ solves Ax = b if and only if x ∗
minimizes 21 hAx, xi − hb, xi. We seek approximations x k of the solution x ∗ by solving
1
min hAx, xi − hb, xi
x∈x 0 +Kk ( A,r 0 ) 2

or, equivalently (see Theorem 3.2.1), by computing x k ∈ x 0 + Kk ( A, r 0 ) such that

hAx k − b, vi = 0 ∀ v ∈ Kk ( A, r 0 ). (3.29)

We compute an orthonormal basis {v1, · · · , vk } of Kk ( A, r 0 ) using the Lanczos iteration Algo-

rithm 3.3.4. It generates Vk = (v1, . . . , vk ) ∈ Rn×k and a tridiagonal matrix Tk ∈ R k×k given in
(3.23) such that VkT Vk = I and

VkT AVk = Tk , r 0 = kr 0 kVk e1,

where e1 the first unit vector in R k . If we set x k = x 0 + Vk y k and v = Vk ν in (3.29) and use the
previous identities, then (3.29) is equivalent to

Tk y k = kr 0 ke1 . (3.30)

Since A ∈ Rn×n is symmetric positive definite, the tridiagonal matrix A ∈ R k×k is symmetric
positive definite. We can use the LDLT –decomposition of Tk to solve (3.30).
There exists matrices

1 0 0 0 d1 0 ··· 0
`2 1 .. +/
. /
*. + *.
0 0 // 0 d2
L k = ... .. . . . . .. // , D k = ... .. .. (3.31)
. . . . /
/
. . . . 0 //
, 0 · · · `k 1 - , 0 · · · 0 dk -

such that
Tk = L k D k LTk . (3.32)
If we insert (3.23) and (3.31) into (3.32) and compare the matrix entries on both sides of the identity,
we see that the entries in (3.31) are given by

d 1 = α1,
ì = βi /di−1, di = αi − ì2 di−1 = αi − βi ì, i = 2, . . . , k.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.5. SOLUTION OF SYMMETRIC LINEAR SYSTEMS 133

If L k−1, D k−1 are given, we only need to compute the scalars

` k = β k /d k−1, d k = α k − βk ` k

in order to obtain L k , D k .
In a naive implementation, we would compute y k by solving (3.30) and the set x k = x 0 + Vk y k .
This would require us to store all columns of Vk . Consequently, the storage requirements would
increase linearly with the iteration count. Fortunately, the fact that Tk is tridiagonal can be used
to limit the stroage to only four vectors of length n independent of the number of iterations k
performed. To accomplish this, we define W k ∈ Rn×k and z k ∈ R k by

W k = Vk L −T
k , z k = LTk y k (3.33)

Then
x k = x 0 + Vk y k = x 0 + Vk L −T
k L k yk = x 0 + Wk z k .
T

If we let W k = (w1, w2, . . . , w k ) and insert this into (3.33) (written as W k = LTk = Vk ), then we find
that
(w1, ` 2 w1 + w2, . . . , ` k w k−1 + w k ) = (v1, v2, . . . , vk ).
Thus,
w1 = v1, w2 = v2 − ` 2 w1, . . . , w k = vk − ` k w k−1 . (3.34)
If we set z k = (ζ1, ζ2, . . . , ζ k )T in (3.33), then this equation becomes

0 ζ1 kr 0 k
*. .. +/ *. ζ2
+/ *. +/
L k−1 D k−1 . // .. // = .. 0 // .
.. ..
..
.. 0 // .. . // .. . //
, 0 · · · 0 ` k d k−1 d k - , ζk - , 0 -
Since L k−1 D k−1 z k−1 = kr 0 ke1 , it follows that
!
z k−1
zk = , ζ k = −` k d k−1 ζ k−1 /d k . (3.35)
ζk

For k = 1 the definition (3.33) of z1 = ζ1 implies

ζ1 = kr 0 k/d 1 = kr 0 k/α1 .

Hence
x k = x 0 + W k z k = x 0 + W k−1 z k−1 + ζ k w k = x k−1 + ζ k w k . (3.36)
This enables us to make the transition from (vk−1, w k−1, x k−1 ) to (vk , w k , x k ) with a minimal amount
of work and storage.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

134 CHAPTER 3. KRYLOV SUBSPACE METHODS

Using the identity AVk = Vk Tk + β k+1 vk+1 eTk established in Corollary 3.3.5 we obtain the
following formula for the residual

Ax k − b = AVk y k − r 0 = Vk (Tk y k − kr 0 ke1 ) + β k+1 vk+1 eTk y k = β k+1 vk+1 y k(k),

where y k(k) is the kth component of y k ∈ R k . Consequently

( j)
kr k k = k Ax k − bk = β k+1 |y k |.

The vector y k is the solution of (3.30) and y k(k) denotes the j–th component of y k = L −T z k . Using
(3.31) we conclude that the last component of y k is equal to the last component of ζ k of z k . Hence

kr k k = k Ax k − bk = β k+1 |ζ k |.

We can summarize the previous derivations in the following algorithm.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.5. SOLUTION OF SYMMETRIC LINEAR SYSTEMS 135

Algorithm 3.5.1
(0) Given A ∈ Rn×n symmetric positive definite, x 0, b ∈ Rn , and > 0.
(1) Compute r 0 = b − Ax 0 .
Set v̂1 = r 0
β1 = kr 0 k, ζ0 = 1,
v0 = 0, k = 0.
(2) While kr k k = | β k+1 ζ k | >
k = k + 1,
If β k , 0, then vk = v̂k / β k ;
Else vk = v̂k (= 0).
Endif
v̂k+1 = Avk − β k vk−1
α k = hv̂k+1, vk i ,
v̂k+1 = v̂k+1 − α k vk
β k+1 = k v̂k+1 k
If k = 1, then
d1 = α1 ,
w1 = v1 ,
ζ1 = β1 /α1 ,
x 1 = x 0 + ζ1 v1 ,
Else
` k = β k /d k−1 ,
d k = α k − βk ` k ,
w k = vk − ` k w k−1 ,
ζ k = −` k d k−1 ζ k−1 /d k ,
x k = x k−1 + ζ k w k .
Endif
End

To implement Algorithm 3.5.1 we need one array to hold the x k , one array to hold the w k and
two arrays to hold vk+1, vk (vk−1 can be overwritten by v̂k+1 ).
Algorithm 3.5.1 is equivalent to the conjugate gradient method (both algorithm generate the
same iterates). The conjugate gradient method will be discuss in Section 3.7 below.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

136 CHAPTER 3. KRYLOV SUBSPACE METHODS

3.5.2. SYMMLQ

If we want to extend the approach in Section 3.5.1 to matrices A ∈ Rn×n which are symmetric but
not necessarily positive definite, we encounter two difficulties. First, the Galerkin approximation
problem (3.29) may not have a solution. See Example 3.2.5. Using the Lanczos Iteration, Algorithm
3.3.4, we can transform (3.29) into (3.30) and (3.29) has a solution if and only if (3.30) has a solution.
The second difficulty is that if A ∈ Rn×n is not positive definite, then Tk may not be positive definite
and the LDLT decomposition of Tk cannot be used.
Paige and Saunders [PS75] developed an algorithm SYMMLQ that overcomes these difficulties.
Instead of using the LDLT decomposition of Tk , they use a LQ decomposition, that is they generate
a lower triangular matrix

d1
.. e2 d 2
*. +/
//
L̄ k = .. f 3 e3 d 3
... ... ...
//
.. //
, f k e k d¯k -

and an orthogonal matrix Q k , which is a product of Givens rotations such that

Tk = L̄ k Q k .

The system (3.30) can be solved using the LQ decomposition (assuming a solution exists). Paige
and Saunders [PS75] suggest a modification x kL of the Galerkin approximation that always exists
(even if (3.29), (3.30) do not have a solution). Moreover, they show that the errors x kL − x ∗ are
nonincreasing, i.e. that

k x ∗ − x kL k ≤ k x ∗ − x k−1
L
k.

Just like Algorithm 3.5.1, SYMMLQ requires a small, fixed amount of storage. The algorithm
is listed below. For details we refer to [PS75].

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.5. SOLUTION OF SYMMETRIC LINEAR SYSTEMS 137

Algorithm 3.5.2 (SYMMLQ)

(0) Given A ∈ Rn×n symmetric, b, x 0 ∈ Rn .
(1) Compute r 0 = b − Ax 0 .
Set v̂1 = r 0 ,
δ1 = kr 0 k,
If δ1 , 0, then
v1 = v̂1 /δ1 ;
Else
v1 = v̂1 (= 0).
Endif
w̄1 = v1 ,
v0 = 0,
x 0L = x 0 .
(2) For j = 1, 2, . . . do
v̂k+1 = Avk − δ k vk−1 ,
γ k = hv̂k+1, vk i ,
v̂k+1 = v̂k+1 − γ k vk ,
δ k+1 = k v̂k+1 k,
If δ k+1 , 0, then vk+1 = v̂k+1 /δ k+1 ;
Else vk+1 = v̂k+1 (= 0).
Endif
If k = 1, then
d¯k = γ k , ẽ k+1 = δ k+1 ,
Elseif k > 1, then
Apply Givens rotation G k to row k.
d¯k = s k ẽ k − ck γ k ,
e k = ck ẽ k + s k γ k .
Apply Givens rotation G k to row k + 1.
f k+1 = s k δ k+1 ,
ẽ k+1 = −ck δ k+1 ,
Endif

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

138 CHAPTER 3. KRYLOV SUBSPACE METHODS

Determineq Givens rotation G k+1 .

d k = d¯2k + δ2k+1 ,
ck+1 = d¯k /d k ,
s k+1 = δ k+1 /d k ,
If k = 1, then
ζ1 = δ1 /d 1 .
Elseif k = 2, then
ζ2 = −ζ1 e2 /d 2,
Elseif k > 2, then
ζ k = (−ζ k−1 e k − ζ k−2 f k )/d k ,
Endif
x kL = x k−1
L + ζ (c
k k+1 w̄ k + s k+1 v k+1 ).

w̄ k+1 = s k+1 w̄ k − ck+1 vk+1 .

If kr k k < goto (3).
End
(3) x k = x kL + (ζ k s k+1 /ck+1 ) w̄ k+1 .

3.5.3. MINRES
If A ∈ Rn×n is symmetric but not necessarily positive definite, we can compute an approximation to
the solution x ∗ of Ax = b using the minimum residual approach, that if we compute approximations
x k by solving
1
min 2 k Ax − bk.
x∈x 0 +Kk ( A,r 0 )

The basic idea is the same the one presented in Section 3.4.1. However, since A is symmetric, we
use the Lanczos iteration. Algorithm 3.3.4 generates orthogonal matrices Vk , Vk+1 and a tridiagonal
Tk such that
!
Tk
AVk = Vk+1 r0 = kr 0 kVk+1 e1,
β k+1 eTk

where e1 if the first unit vector in R k+1 . See Corollary 3.3.5. The problem min x∈x 0 +Kk ( A,r0 ) k Ax − bk
is equivalent to
!
Tk
min k AVk y − r 0 k = min y − kr ke .
y∈Rk β k+1 e k
T 0 1
y∈Rk 2

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.5. SOLUTION OF SYMMETRIC LINEAR SYSTEMS 139

The structure of these small (k + 1) × k least squares systems can be used to update the solution
x k = x 0 + Vk y k without storing all columns of Vk . The details are given in [PS75]. The resulting
algorithms is known as MINRES. For symmetric matrices, MINRES is mathematically equivalent
to GMRES (without restart), but unlike GMRES the MINRES implementation requires a small,
fixed amount of storage. The algorithm is listed below. For details we refer to [PS75].

Algorithm 3.5.3 (MINRES)

(0) Given A ∈ Rn×n symmetric, b, x 0 ∈ Rn .
(1) Compute r 0 = b − Ax 0 .
Set v̂1 = r 0 ,
δ1 = kr 0 k,
v0 = 0, m0 = m−1 = 0, k = 0.
(2) While kr k k > do
k = k + 1.
If δ k , 0, then
vk = v̂k /δ k ;
Else
vk = v̂k (= 0).
Endif
v̂k+1 = Avk − δ k vk−1 ,
γ k = hv̂k+1, vk i ,
v̂k+1 = v̂k+1 − γ k vk ,
δ k+1 = k v̂k+1 k,
If k = 1, then
d¯k = γ k .
ẽ k+1 = δ k+1 ,
Elseif k > 1, then
Apply Givens rotation G k to row k:
d¯k = s k ẽ k − ck γ k ,
e k = ck ẽ k + s k γ k .
Apply Givens rotation G k to row k + 1:
f k+1 = s k δ k+1 ,
ẽ k+1 = −ck δ k+1 .
Endif

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

140 CHAPTER 3. KRYLOV SUBSPACE METHODS

Determineq Givens rotation G k+1 :

d k = d¯2k + δ2k+1 ,
ck+1 = d¯k /d k ,
s k+1 = δ k+1 /d k .
If k = 1, then
τ1 = kr 0 k c2 .
Elseif k > 1, then
τk = kr 0 k s2 s3 · · · s k ck+1 = τk−1 s k ck+1 /ck .
Endif
m k = (vk − m k−1 e k − m k−2 f k )/d k .
x k = x k−1 + τk m k .
kr k k = |s k+1 | kr k−1 k.
End

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.6. THE GRADIENT METHOD 141

3.6. The Gradient Method

Let A ∈ Rn×n be symmetric positive definite. Solving Ax = b is equivalent to minimization of

Q(x) = 12 hx, Axi − hx, bi. (3.37)

A simple minimization algorithm is the gradient method. The gradient of the quadratic function
Q is given by
∇Q(x) = Ax − b.
In Problem 2.8 we have already studied the steepest descent method with constant step size,

x k+1 = x k − α∇Q(x k ),

Now we consider the iteration

x k+1 = x k − α k ∇Q(x k ),
where the step size α k > 0 is chosen

3.6.1. Steepest Descend Method

In the steepest descent method the step size α k is chosen by minimizing Q along the negative
gradient direction −∇Q(x k ), i.e., α k is the solution of the one–dimensional minimization problem

min Q(x k − α∇Q(x k )).

The function α 7→ Q(x k − α∇Q(x k )) is a convex quadratic function. Differentiation gives

d
Q(x k − α∇Q(x k )) = −∇Q(x k − α∇Q(x k ))T ∇Q(x k ),
dα
= −k Ax k − bk 2 + αhAx k − b, A( Ax k − b)i,
d2
Q(x k − α∇Q(x k )) = hAx k − b, A( Ax k − b)i.
dα 2
Since A is symmetric positive definite, d2
dα 2
Q(x k − α∇Q(x k )) > 0. If

r k = −∇Q(x k ) = b − Ax k , 0.

Of course if r k = 0, x k is the desired solution. If r k , 0, the optimal step size is

kr k k 2
αk = . (3.38)
hr k , Ar k i
This leads to the following algorithm.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

142 CHAPTER 3. KRYLOV SUBSPACE METHODS

Algorithm 3.6.1 (Gradient Method)

(0) Given A ∈ Rn×n symmetric, positive definite, b ∈ Rn , x 0 ∈ Rn .
(1) Set r 0 = b − Ax 0 .
(2) Fork = 0, 1, 2, · · · do
(a) If kr k k < , stop.
(b) α k = kr k k 2 /hAr k , r k i.
(c) x k+1 = x k + α k r k .
(d) r k+1 = r k − α k Ar k .

Theorem 3.6.2 Let A be symmetric positive definite. The iterates generated by the Gradient Method
3.6.1 obey
(r Tk r k ) 2
k x ∗ − x k+1 k A = 1 − T
2 *
T A−1 r )
+ k x∗ − x k k2 .
A (3.39)
, (r k
Ar k ) (r k k -

If λ min, λ max are the smallest and the largest eigenvalues of A, respectively, then the iterates
generated by the Gradient Method 3.6.1 obey

λ max − λ min
!2
2
k x ∗ − x k+1 k A ≤ k x ∗ − x k k 2A . (3.40)
λ max + λ min
The proof of the second part of previous theorem used the Kantorovich inequality, stated in the
following lemma. We leave the proof of Theorem 3.6.2 and of the following lemma as an exercise.

Lemma 3.6.3 (Kantorovich Inequality) Let A be symmetric positive definite. If λ min, λ max are
the smallest and the largest eigenvalue of A, respectively, then
(xT x) 2 4λ min λ max
≥ .
(xT Ax) (xT A−1 x) (λ min + λ max ) 2
It is not difficult to show that the successive residuals generated by the gradient method are
orthogonal, i.e.,
hr k , r k+1 i = 0.
This leads to a convergence behavior of the gradient method known as zigg-zagging. It is illustrated
in Figure 3.2 where we have plotted the contours of Q and the gradient iterates for
! ! !
2 −1 1 5
A= , b= , x0 = .
−1 2 1 2

The solution is x ∗ = (1, 1)T . The plot on the left in Figure 3.2 shows the first iterations and the plot
on the right zooms into the region around the solution.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.6. THE GRADIENT METHOD 143

5
1.2

4 1.15

1.1

1.05
x2

2
1

x
0.95
1

0.9

0
0.85

−1 0.8
−1 0 1 2 3 4 5 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2
x1 x1

Figure 3.2: Typical Convergence Behavior of the Gradient Method. The right picture is a zoom of
the picture on the left around the minimizer (1, 1)T .

3.6.2. Barzilai-Borwein (BB) Method

Barzilai and Borwein [BB88] proposed an alternative to the steepest descent step-size (3.38). The
BB step-size often leads to substantial improvements at no extra computational cost. See also the
survey paper by Fletcher [Fle05]. The BB method can be extended to the minimization of general,
differentiable functions f : Rn → R. Here we consider the minimization of

Q(x) = 12 hx, Axi − hx, bi,

where A ∈ Rn×n is symmetric positive definite.

Let the iterate x k be given and let r k = −∇Q(x k ) = b − Ax k be the negative gradient at x k . The
minimizer x ∗ of Q is
x ∗ = x k + A−1r k .
The gradient step with step size α is

k I) r k .
x k+1 = x k + α k r k = x k + (α −1 −1

Thus, we would like to find a step size α k such that

k I) r k ≈ A r k .
(α −1 −1 −1
(3.41)

Of course the right hand side in (3.41) essentially requires the solution of the original problem,
which is not feasible. Therefore, we replace (3.41) by a condition that only involves information

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

144 CHAPTER 3. KRYLOV SUBSPACE METHODS

that is easily available. Given the previous and current iterate x k−1, x k , and the corresponding
gradients −r k−1 = ∇Q(x k−1 ) = Ax k−1 − b , −r k = ∇Q(x k ) = Ax k − b define

∆x = x k − x k−1, ∆r = −r k + r k−1 .

We have A∆x = ∆r, i.e.,

∆x = A−1 ∆r. (3.42)

If instead of (3.41) we assume (α −1

k I)
−1 ≈ A−1 , then (3.42) leads to

∆x ≈ α k ∆r.

We compute the step size α k as the minimizer

α k(1) = argminα∈R k∆x − α∆r k 2, (3.43)

i.e.,
h∆x, ∆ri
α k(1) = . (3.44)
k∆r k 2

Alternatively, we can use

A∆x = ∆r. (3.45)

Now, if instead of (3.41) we assume α −1

k I ≈ A, then (3.45) leads to

α −1
k ∆x ≈ ∆r.

We compute the inverse step size α −1

k as the minimizer

(α k(2) ) −1 = argmin β∈R k β∆x − ∆r k 2, (3.46)

i.e.,
k∆xk 2
α k(2) = . (3.47)
h∆x, ∆ri

In the initial iteration k = 0 where x k−1 and r k−1 are not available the steepest descent step size is
used. This leads to the following algorithm.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.6. THE GRADIENT METHOD 145

Algorithm 3.6.4 (Gradient Method with Barzilai-Borwein Stepsize)

(0) Given A ∈ Rn×n symmetric, positive definite, b ∈ Rn , x 0 ∈ Rn .
(1) Set r 0 = b − Ax 0 .
(2) Fork = 0, 1, 2, · · · do
(a) If kr k k < , stop.
(b) If k = 0
α k = kr k k 2 /hAr k , r k i.
else
hx k − x k−1, r k−1 − r k i/kr k − r k−1 k 2, or
(
αk =
k x k − x k−1 k 2 /hx k − x k−1, r k−1 − r k i
(c) x k+1 = x k + α k r k .
(d) r k+1 = r k − α k Ar k .

Note the step-size computation

hx k − x k−1, r k−1 − r k i/kr k − r k−1 k 2,

(
or
αk =
k x k − x k−1 k 2 /hx k − x k−1, r k−1 − r k i

only requires information that is already available.

Convergence results for Algorithm 3.6.4 are given by Raydan [Ray93] and by Dai and Liao
[DL02]. See also Fletcher’s paper [Fle05] for an overview.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

146 CHAPTER 3. KRYLOV SUBSPACE METHODS

3.7. The Conjugate Gradient Method

Let A ∈ Rn×n be symmetric positive definite. The conjugate gradient (CG) method due to Hestenes
and Stiefel [HS52] computes Galerkin approximations x k that lie in the Krylov subspace x 0 +
Kk ( A, r 0 ), i.e., the CG method computes approximations x k by solving
1
min hAx, xi − hb, xi.
x∈x 0 +Kk ( A,r 0 ) 2

The CG method is equivalent to Algorithm 3.5.1 (both algorithms generate the same iterates) and it
can be derived from Algorithm 3.5.1. For a discussion of the relation between Algorithm 3.5.1 and
the conjugate gradient method derived below see, e.g., the books by Golub and van Loan [GL96,
Secs. 9.3, 10.2] or by Saad [Saa03, Sec 6.7]. In this section we derive the CG method without using
the results from the previous sections.
Our goal is the minimization of the quadratic function
Q(x) = 12 hx, Axi − hx, bi. (3.48)

3.7.1. Derivation of the Conjugate Gradient Method

The conjugate gradient method computes a new iterate of the form
x k+1 = x k + α k pk .
In contrast to the gradient method the search direction pk is computed based on the negative gradient
r k = −∇Q(x k )
of Q and on all previously computed search directions. More preceisely, the search direction pk
and the step-size α k are computed so that x k+1 minimizes Q over the shifted subspace
x 0 + span{p0, . . . , pk−1, r k }.
This means x k+1 solves
1
min hx, Axi − hx, bi. (3.49)
x∈x 0 +span{p0,...,pk−1,r k } 2

The necessary and sufficient optimality conditions are stated in the following theorem, which is
just an application of Theorem 3.2.1.

Theorem 3.7.1 Let A be symmetric and positive definite on span{p0, . . . , pk−1, r k }, i.e., let
hAv, vi > 0 ∀v ∈ span{p0, . . . , pk−1, r k }, v , 0.
The vector x k+1 ∈ x 0 + span{p0, . . . , pk−1, r k } solves (3.49) if and only if
hAx k+1 − b, vi = 0 ∀v ∈ span{p0, . . . , pk−1, r k }. (3.50)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.7. THE CONJUGATE GRADIENT METHOD 147

Now, let us discuss how the search direction pk and the step size α k are computed. For k = 0
we have p0 = r 0 and x 1 = x 0 + α0r 0 , where α0 ∈ R is computed so that (3.50) is satisfied, i.e., so
that
hA(α0r 0 ) − r 0, r 0 i = 0.
This gives
kr 0 k 2
α0 = .
hAr 0, r 0 i
To see how the search direction pk and the step size α k are computed in iteration k > 0, let us
assume we have already computed the solution x k of (3.49) with k replaced by k − 1. We write
x k+1 = x k + α k pk . From (3.50) we obtain

0 = hAx k+1 − b, pi i = hAx k − b, pi i + α k hApk , pi i for i = 0, . . . , k − 1.

Since x k solves (3.50) with k replaced by k − 1 and since pk−1 ∈ span{p0, . . . , pk−2, r k−1 }, we find

hAx k − b, pi i = 0 for i = 0, . . . , k − 1.

Thus, the search direction pk must satisfy

α k hApk , pi i = 0 for i = 0, . . . , k − 1.

We have proven the following result.

Lemma 3.7.2 Let A ∈ Rn×n be a symmetric positive definite matrix and let x k satisfy (3.50) with
k replaced by k − 1. The vector x k+1 = x k + α k pk , α k , 0, satisfies (3.50), if and only if

hApk , pi i = 0, i = 0, . . . , k − 1.

Vectors v1, . . . , vk ∈ Rn satisfying hAvi, v j i = 0 for i , j are called A–orthogonal, or A–

conjugate. One can show that A–orthogonal directions are linearly independent (see Problem
3.7).

To continue our discussion of the computation of search direction pk and step size α k , let us
assume that {p0, . . . , pk−1 } is an A–orthogonal basis of span{p0, . . . , pk−2, r k−1 }. We will see in a mo-
ment how this can be accomplished. Lemma 3.7.2 shows that if x k+1 = x k +α k pk , with α k , 0 satis-
fies (3.50), then p0, . . . , pk−1, pk are A–orthogonal. Since p0, . . . , pk−1, pk ∈ span{p0, . . . , pk−1, r k },
the vectors p0, . . . , pk−1, pk form an A–orthogonal basis of span{p0, . . . , pk−1, r k }.
Our next task is to compute pk so that p0, . . . , pk−1, pk is an A–orthogonal basis of
span{p0, . . . , pk−1, r k }. This can be accomplished using the Gram-Schmidt process applied with the

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

148 CHAPTER 3. KRYLOV SUBSPACE METHODS

scalar product hAx, yi instead of hx, yi. Let p0, . . . , pk−1 be A–orthogonal and satisfy hpi, Api i , 0,
then
k−1
X hr k , Api i
pk = r k − pi (3.51)
i=0
hpi, Api i
satisfies
hpi, Apk i = 0, i = 0, . . . , k − 1
and {p0, . . . , pk−1, pk } is an A–orthogonal basis of span{p0, . . . , pk−1, r k }. Moreover, pk = 0 if and
only if r k ∈ span{p0, . . . , pk−1 }.
We obtain the following result.
Lemma 3.7.3 Let A ∈ Rn×n be symmetric positive definite. If p0, . . . , pk−1 are A–orthogonal and
satisfy hp j , Ap j i , 0, j = 0, . . . , k − 1, and if pk is given by (3.51), then
i. span{p0, . . . , pk−1, pk } = span{p0, . . . , pk−1, r k },
ii. hpk , Apk i = 0 if and only if r k = 0.
Proof: i. The first statement is a consequence of the Gram-Schmidt method.
ii. If r k = 0, then pk = 0 by definition (3.51) of pk and hpk , Apk i = 0. On the other hand,
if hpk , Apk i = 0, then the symmetric positive definiteness of A implies pk = 0. Thus, by part i,
r k ∈ span{p0, . . . , pk−1 }. Theorem 3.7.1 implies
hb − Ax k , p j i = hr k , p j i = 0, j = 0, . . . , k − 1.
The conditions r k ∈ span{p0, . . . , pk−1 } and hr k , p j i = 0, j = 0, . . . , k − 1, imply r k = 0.

Equation (3.51) shows how to compute the search direction pk in step k. Given pk , we have to
calculate α k such that x k+1 = x k + α k pk satisfies
hAx k+1 − b, pi i = hAx k − b, pi i + α k hApk , pi i = 0 for i = 0, . . . , k.
Since x k satisfies (3.50) and since the p j ’s are A–orthogonal,
hAx k − b, pi i = 0, hApk , pi i = 0 for i = 0, . . . , k − 1.
Thus, α k must be chosen so that
hAx k − b, pk i + α k hApk , pk i = 0,
i.e.,
hAx k − b, pk i hr k , pk i
αk = − = .
hApk , pk i hApk , pk i
Under the assumptions of Lemma 3.7.3 iii. the step size is well defined as long as r k , 0.
This leads to the following algorithm.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.7. THE CONJUGATE GRADIENT METHOD 149

Algorithm 3.7.4 (Conjugate Gradient Method (preliminary version))

(0) Given A ∈ Rn×n symmetric positive definite, b, x 0 ∈ Rn , > 0.
(1) Set r 0 = b − Ax 0 .
(2) Fork = 0, 1, 2, . . . do
(a) If kr k k < stop.
hr k ,Ap j i
(b) pk = r k − k−1 (p0 = r 0 ).
P
j=0 hp j ,Ap j i p j

We recall the definition of the Krylov subspace

Kk+1 ( A, r 0 ) = span{r 0, Ar 0, . . . , Ak r 0 }.

Theorem 3.7.5 Let A ∈ Rn×n be symmetric positive definite. If x 0, . . . , x k and p0, . . . , pk are the
vectors generated by Algorithm 3.7.4, then the following assertions are true:

i. span{p0, . . . , pk } = span{r 0, . . . , r k } = Kk+1 ( A, r 0 ),

ii. hr k , Ap j i = 0 for j = 0, . . . , k − 2,

iii. hr k , r j i = 0, hr k , p j i = 0 for j = 0, . . . , k − 1.

Proof: i. The proof is by induction.

For k = 0 the assertion follows from the choice of p0 .
Assume the assertion is valid for k − 1, i.e., assume that span{p0, . . . , pk−1 } =
span{r 0, . . . , r k−1 } = Kk ( A, r 0 ). With Lemma 3.7.3 ii. we obtain

span{p0, . . . , pk } = span{p0, . . . , pk−1, r k }

= span{r 0, . . . , r k−1, r k }.

Moreover, since p0, . . . , pk−1, r 0, . . . , r k−1 ⊂ Kk ( A, r 0 ) by the induction hypothesis, we find that
r k = r k−1 − α k−1 Apk−1 ∈ Kk+1 ( A, r 0 ), and

span{p0, . . . , pk−1, r k } ⊂ Kk+1 ( A, r 0 ).

To prove
Kk+1 ( A, r 0 ) ⊂ span{p0, . . . , pk−1, r k }

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

150 CHAPTER 3. KRYLOV SUBSPACE METHODS

we need to show that Ak r 0 ∈ span{p0, . . . , pk−1, pk }. By induction hypothesis,

Ak−1r 0 ∈ Kk ( A, r 0 ) = span{p0, . . . , pk−1 } = span{r 0, . . . , r k−1 },

i.e., Ak−1r 0 = i=0 γi pi . Since p0, . . . , pk−2 ∈ Kk−1 ( A, r 0 ), Ap0, . . . , Apk−2 ∈ Kk ( A, r 0 ) =

P k−1
span{p0, . . . , pk−1 }. From r k = r k−1 − α k−1 Apk−1 and α k−1 , 0, we obtain

1
Apk−1 = (r k−1 − r k ) ∈ span{r 0, . . . , r k−1, r k } = span{p0, . . . , pk−1, pk }.
α k−1
Thus,
k−1
X
Ak r 0 = γi Api ∈ span{p0, . . . , pk−1, pk }.
i=0

ii. By i. we have Ap j ∈ K j+2 ( A, r 0 ) ⊂ Kk ( A, r 0 ) for j = 0, . . . , k − 2. Since x k satisfies (3.50)

with k replaced by k − 1, we obtain hAx k − b, Apk i = h−r k , Apk i = 0, j = 0, . . . , k − 2.
iii. These equalities follow from the fact that x k satisfies (3.50) with k replaced by k − 1 and
part i.

From Theorem 3.7.5 i. we now see that (3.49) is equivalent to the problem
1
min 2 hx, Axi − hx, bi.
(3.52)
x ∈ x 0 + Kk+1 ( A, r 0 )

This shows that the Conjugate Gradient Method is equivalent to the Lanczos method derived in
Section 3.5.1.
Due to Theorem 3.7.5 step 2b in Algorithm 3.7.4 reduces to

pk = r k + β k−1 pk−1,

where
hr k , Apk−1 i
β k−1 = − . (3.53)
hpk−1, Apk−1 i
Two other simplifications are possible. First, using

hr k , Apk−1 i
pk = r k − pk−1,
hpk−1, Apk−1 i

and hr k , pk−1 i = 0 (Theorem 3.7.5), we obtain

hr k , Apk−1 i
hr k , pk i = kr k k 2 − hr k , pk−1 i = kr k k 2 . (3.54)
hpk−1, Apk−1 i

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.7. THE CONJUGATE GRADIENT METHOD 151

Thus,
hr k , pk i kr k k 2
αk = = . (3.55)
hApk , pk i hApk , pk i
Moreover, taking the scalar product between pk+1 = r k+1 − hr k+1, Apk i/hpk , Apk i pk and r k yields
hr k+1, Apk i
− hr k , pk i = −hr k+1, r k i + hr k , pk+1 i.
hpk , Apk i
Now, using Theorem 3.7.5, the A–orthogonality of the p j ’s and (3.54) we find
hr k+1, Apk i
− hr k , pk i = −hr k+1, r k i + hr k , pk+1 i
hpk , Apk i
= hr k , pk+1 i
hr k , pk i
= hr k − Apk , pk+1 i
hpk , Apk i
= hr k+1, pk+1 i
= kr k+1 k 2 .
Using (3.54) again, we see that
hr k+1, Apk i kr k+1 k 2
βk = − = . (3.56)
hpk , Apk i kr k k 2
This gives the following final version of the conjugate gradient method.

Algorithm 3.7.6 (Conjugate Gradient Method)

(0) Given A ∈ Rn×n symmetric positive definite, b, x 0 ∈ Rn , > 0.
(1) Set p0 = r 0 = b − Ax 0 ,
(2) For k = 0, 1, 2, . . . do
(a) If kr k k < stop; else
(b) α k = kr k k 2 /hpk , Apk i.
(c) x k+1 = x k + α k pk .
(d) r k+1 = r k − α k Apk .
(e) β k = kr k+1 k 2 /kr k k 2 .
(f) pk+1 = r k+1 + β k pk .

We will comment on the stopping criteria in Step 2a of the Conjugate Gradient Method in
Section 3.7.2.
The following result on the monotonicity of the Conjugate Gradient iterates is important for
some optimization applications.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

152 CHAPTER 3. KRYLOV SUBSPACE METHODS

Theorem 3.7.7 Let A ∈ Rn×n be symmetric positive definite. The iterates generated by the Conju-
gate Gradient Algorithm 3.7.6 started with x 0 = 0 obey the monotonicity property

0 < k x1 k < k x2 k < . . . .

We leave the proof as an exercise. See Problem 3.8.

Next we comment on what happens when the Conjugate Gradient Algorithm 3.7.6 is applied to
a problem in which A is symmetric, but not necessarily positive definite.

Remark 3.7.8 (CG for positive semidefinite systems) i. We have derived the conjugate gradient
algorithm for symmetric positive definite systems. However, the conjugate gradient algorithm can
still be used if A ∈ Rn×n is symmetric positive semidefinite and b ∈ R ( A).
Since A is symmetric, the Fundamental Theorem of Linear Algebra implies R ( A) = N ( A) ⊥ .
By induction we can show that all directions pk ∈ R ( A) = N ( A) ⊥ . Since A is symmetric positive
semidefinite, hv, Avi > λ +min kvk 2 for all v ∈ N ( A) ⊥ , where λ +min is the smallest strictly positive
eigenvalue of A. Thus, α k in step (2b) is well defined. Since pk ∈ N ( A) ⊥ for all k, it follows that
the iterates x k of the CG algorithm obey

x k ∈ x 0 + N ( A) ⊥ .

The minimum norm solution x † of Ax = b is the solution in N ( A) ⊥ . The iterates generated by the
CG algorithm converge to PN ( A) x 0 + x † , where PN ( A) x 0 is the projection of x 0 onto N ( A). See
Section 3.8.5 for additional details.
The convergence behavior is illustrated in Figure 3.3 below.
ii. If A ∈ Rn×n is symmetric positive semidefinite and b < R ( A), then the minimization problem
min 12 hAx, xi + hb, xi does not have a solution. If the conjugate gradient method is applied in this
case, then in some iteration k the negative gradient r k , 0, but the search direction pk satisfies (in
exact arithmetic)
hApk , pk i = 0
and ideally the CG algorithm should be terminated. In floating point arithmetic, however, hApk , pk i
will never be zero. Generally, the size of the hApk , pk i depend on the specific problem and it is
difficult to determine whether ‘hApk , pk i = 0’.

Example 3.7.9 Consider a system Ax = b with

1 −1 0
A = . −1 2 −1 +/ .
*
, 0 −1 1 -

The matrix A is symmetric positive semidefinite and N ( A) = span{(1, 1, 1)T }.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.7. THE CONJUGATE GRADIENT METHOD 153

Rn {x : Ax = b}

N ( A) ⊥ N ( A)
lim k→∞ x k

PN ( A) x 0

x† x0

Figure 3.3: Convergence of Conjugate Gradient Method for Symmetric Positive Semidefinite
Systems Ax = b with b ∈ R ( A).

i. If
−1
b = . 2 +/ ,
*
, −1 -
then b ∈ R ( A) and the minimum norm solution of Ax = b is x † = (−1/3, 2/3, −1/3)T . The
Conjugate Gradient Algorithm 3.7.6 with x 0 = 0 terminates in iteration k = 1 with the minimum
norm solution.
ii. If
1
b = *. 1 +/ ,
, 1 -
then b < R ( A) and we are in the situation of Remark 3.7.8, part√ii. Application of the Conjugate
Gradient Algorithm 3.7.6 with x 0 = 0 gives p0 = r 0 = b, kr 0 k = 3, and pT0 Ap0 = 0.
The previous example is motivated by elliptic PDEs with Neumann boundary conditions.

Example 3.7.10 Consider the two-point boundary value problem

−y00 (x) = f (x), x ∈ (0, 1), (3.57a)
y (0) = y (1) = 0.
0 0
(3.57b)
This problem has a solution if and only if 0 f (x)dx = 0. If y is a solution of (3.57), then for any
R1

constant γ the function y + γ also solves (3.57).

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

154 CHAPTER 3. KRYLOV SUBSPACE METHODS

To discretize this problem we use a finite difference method on a grid 0 = x 0 < x 1 < . . . <
x n+1 = 1 with equidistant points x i = ih and mesh size h = 1/(n + 1). We approximate the
derivatives using central finite differences (see Section 1.3.1), that is we discretize (3.57a) by
−yi−1 + 2yi − yi+1
= f (x i ), i = 0, . . . , n + 1,
h2
and we discretize (3.57b) by
y1 − y−1 yn+2 − yn
= 0, = 0.
h h
This leads to the following discretization of (3.57).
1
1 −1 y0 2 f (x 0 )
.. −1 2 −1 y1 f (x 1 ) //
*. +/ *. +/ *. +
// .. // ..
y2 //
.. .. .. .. ..
.. // .. // ..
1 . // = ..
. . . // .
//
// .. . . (3.58)
h2 ... // .. // ..
.. /. yn−1 // .. //
. −1 2 −1 // .. yn / . f (x n ) //
, −1 1 - , yn+1 - ,
1
2 f (x n+1 ) -

The matrix A in (3.58) is symmetric positive semidefinite. It is easy to check that N ( A) = span{e},
were e = (1, . . . , 1)T ∈ Rn+2 . Thus, the system (3.58) has a solution if and only if
n
X
1
2 f (x 0 ) + f (x i ) + 12 f (x n+1 ) = 0.
i=1

Note that 0 f (x)dx ≈ h 12 f (x 0 ) + i=1 f (x i ) + 21 f (x n+1 ) using the composite trapezoidal rule.
R 1 Pn

If ~y = (y0, . . . , yn+1 )T solves (3.58), then ~y + γe, γ ∈ R, also solves (3.58).

We are in the situation of Remark 3.7.8, part i. The Conjugate Gradient Algorithm 3.7.6 with
zero initial iterate will compute an approximate solution ~y of (3.58) with ~yT e = 0.
If f (x) = 4π 2 cos(2πx), the minimum L 2 (0, 1)-norm solution of (3.57) is given by y(x) =
cos(2πx). We apply the Conjugate Gradient Algorithm 3.7.6 to solve the discretized problem
(3.58) with n = 39. We use initial iterates (0, . . . , 0)T and (0.3, . . . , 0.3)T . The resulting CG
solutions are shifted by 0.3. The results are shown in Figure 3.4 below.
If f (x) = cos(πx/2), neither the problem (3.57) or the coresponding discretized problem
(3.58) with n = 39 have solutions. We apply the Conjugate Gradient Algorithm 3.7.6 to (3.58)
with n = 39, using the initial iterates (0, . . . , 0)T . Conjugate Gradient Algorithm 3.7.6 does not
test for small hApk , pk i. The Conjugate Gradient Algorithm 3.7.6 stops after the maximum number
of iterations. Plot of the residuals and of hApk , pk i for the first 60 iterations is shown in Figure 3.5
below. We see that around iteration k = 42 the quantity hApk , pk i is small, while the residual
remains large.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.7. THE CONJUGATE GRADIENT METHOD 155

exact solution
1 CG with x 0 = 0
CG with x 0 = 0.3
0.5

y(x) 0

-0.5

-1
0 0.2 0.4 0.6 0.8 1
x

Figure 3.4: Minimum norm solution of (3.57) with f (x) = 4π 2 cos(2πx) (solid black line), solution
of (3.58) with n = 39 computed with the Conjugate Gradient Algorithm 3.7.6 and starting value
(0, . . . , 0)T (dashed red line) and with starting value (0.3, . . . , 0.3)T (dash-dotted blue line).

||Axk -b||
20
10 p Tk Ap k

100

10-20

0 20 40 60
k

Figure 3.5: Plot of the residuals and of hApk , pk i for the first 60 iterations for the Conjugate Gradient
Algorithm 3.7.6 applied to (3.58) with n = 39 and incompatible right hand side f (x) = cos(πx/2).
Around iteration k = 42 the quantity hApk , pk i is small, while the residual remains large.

Remark 3.7.11 (CG for indefinite systems) If A is symmetric indefinite, then typically in some
iteration k, the Conjugate Gradient Algorithm 3.7.6 generates a direction pk such that

hApk , pk i < 0.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

156 CHAPTER 3. KRYLOV SUBSPACE METHODS

In this case,
min 12 hx k + αpk , A(x k + αpk )i − hx k + αpk , bi
does not have a solution. The conjugate gradient method should be truncated if hApk , pk i < 0 (or
equivalently α k < 0) is detected.

Example 3.7.12 Consider the system Ax = b with

1 −1 0 1
A = . −1 1 −1 +/ ,
* b = . 1 +/ .
*
, 0 −1 1 - , 1 -
The matrix A is symmetric indefinite and b < R ( A). √
If we apply the Conjugate Gradient Algorithm 3.7.6 with x 0 = 0, then p0 = r 0 = b, kr 0 k = 3
and pT0 Ap0 = −1. The Conjugate Gradient Algorithm 3.7.6 should be terminated in step 2b of the
k = 1 iteration.

3.7.2. Stopping the Conjugate Gradient Method

As we have noted in Remarks (3.7.8ii) and 3.7.11, the Conjugate Gradient Algorithm 3.7.6 should
be stopped if hpk , Apk i ≤ 0 is detected.
Otherwise, the Conjugate Gradient Algorithm 3.7.6 is terminated if the residual or the relative
residual is smaller than some tolerance > 0, i.e. if

kr k k < or kr k k < kbk.

Let λ max, λ min be the largest and smallest eigenvalues of A, respectively. As we have already noted
in Section 3.4.1, if the Conjugate Gradient Algorithm 3.7.6 stops with kr k k < , then

k x ∗ − x k k ≤ λ −1
min kr k k ≤ λ min ,
−1
(3.59)

and if Conjugate Gradient Algorithm 3.7.6 stops with kr k k < kbk, then
k x∗ − x k k λ max kr k k λ max
≤ ≤ . (3.60)
k x∗ k λ min kbk λ min
By design, the iterates x k of the Conjugate Gradient Algorithm 3.7.6 solve
1
min hAx, xi − hb, xi,
x∈x 0 +Kk ( A,r 0 ) 2

where k xk 2A = hAx, xi. Since

def

Q(x) = 12 hAx, xi − hb, xi = 12 k x − x ∗ k 2A − 12 k x ∗ k 2A, (3.61)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.7. THE CONJUGATE GRADIENT METHOD 157

this means that the x k of the Conjugate Gradient Algorithm 3.7.6 solve

min 1
kx − x ∗ k 2A .
x∈x 0 +Kk ( A,r 0 ) 2

Using the eigen decomposition of A one can easily show that

k x − x ∗ k ≤ λ −1/2
min k x − x ∗ k A (3.62)

and
k x∗ − x k k λ max k x ∗ − x k k A
≤ √ . (3.63)
k x∗ k λ min kbk
If we compare (3.59) and (3.62), then we see that k x − x ∗ k A is a much better estimate for the error
k x k − x ∗ k than the residual kr k k if λ min 1. Also note that k Ax − bk = k x − x ∗ k A2 . Of course,
k x − x ∗ k A is not computable, but this indicates that for symmetric positive definite matrices it is
better to compute approximate solutions by minimizing Q in (3.61) than by minimizing k Ax − bk.

3.7.3. The Conjugate Gradient Method for the Normal Equation

In this section we adapt the conjugate gradient (CG) method to the solution of the linear least
squares problem
min 12 k Ax − bk 2, (3.64)
where A ∈ Rm×n and b ∈ Rm . Solving (3.64) is equivalent to solving

min 21 hAT Ax, xi − hAT b, xi.

We also note that x solves the linear least squares problem (3.64) if and only if it solves the normal
equations
AT Ax = AT b. (3.65)
If the rank of A is less then n (which is for example the case when m < n), then (3.64) and (3.65)
have infinitely many solutions. If x ∗ is a particular solution of (3.64) or (3.65), then any vector
in x ∗ + N ( A) also solves (3.64) and (3.65). The minimum norm solution x † of (3.64) or (3.65)
satisfies x † ⊥ N ( A). By the fundamental theorem of linear algebra,

Rn = N ( A) ⊕ R ( AT ), N ( A) ⊥ R ( AT ),
Rm = N ( AT ) ⊕ R ( A), N ( AT ) ⊥ R ( A).

Hence, the minimum norm solution x † is contained in R ( AT ). We can write

x † = AT y

for some y ∈ Rm . If we insert this into (3.65), then

AT AAT y = AT b.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

158 CHAPTER 3. KRYLOV SUBSPACE METHODS

or equivalently, AAT y − b ∈ N ( AT ) ⊥ R ( A). If b ∈ R ( A), then AAT y − b ∈ R ( A) and

AAT y − b ∈ N ( AT ). Consequently,
AAT y = b. (3.66)
Hence, if b ∈ R ( A), then we can solve (3.66) for y∗ . The minimum norm solution of (3.64) is given
by
x † = AT y∗,
where y∗ solves (3.66). Of course, if A ∈ Rm×n , m < n, has rank m, then b ∈ R ( A) = Rm . Thus
for underdetermined linear least squares problems with full rank A, we can compute the minimum
norm solution by solving (3.66).

CGNR
The CG Method 3.7.6 applied to (3.65) leads to the following algorithm. Here we set
r k = b − Ax k .

Algorithm 3.7.13 (CGNR)

(0) Given A ∈ Rm×n , b ∈ Rm, x 0 ∈ Rn , > 0.
(1) Set r 0 = b − Ax 0 ,
p0 = AT r 0 .
(2) For k = 0, 1, 2, · · · do
(a) If k AT r k k < stop; else
(b) α k = k AT r k k 2 /k Apk k 2 .
(c) x k+1 = x k + α k pk .
(d) r k+1 = r k − α k Apk .
(e) β k = k AT r k+1 k 2 /kr k k 2 .
(f) pk+1 = AT r k+1 + β k pk .

The CGNR iterates x k solve

2 hx k , AT Ax k i − hx k , AT bi.
1
min
(3.67)
x k ∈ x 0 + Kk ( AT A, AT r 0 )
Since 12 hx k , AT Ax k i − hx k , AT bi = 12 k Ax k − bk 2 − 21 kbk 2 the CGNR iterates x k solve
min 1
2 k Ax k − bk 2 .
(3.68)
x k ∈ x 0 + Kk ( AT A, AT r 0 )
This residual minimizing property contributes the letter R to the name CGNR. The letter N comes
from the normal equation.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.7. THE CONJUGATE GRADIENT METHOD 159

CGNE
Now we consider a linear system (3.66) with A ∈ Rm×n and b ∈ R ( A). The system matrix AAT in
(3.66) symmetric positive semidefinite. Hence, we can use the CG method to solve (3.66). This
gives the following algorithm.

Algorithm 3.7.14
(0) Given A ∈ Rm×n , b ∈ R ( A), y0 ∈ Rm , > 0.
(1) Set r 0 = b − AAT y0 ,
p0 = r 0 .
(2) For k = 0, 1, 2, · · · do
(a) If kr k k < stop; else
(b) α k = kr k k 2 /k AT pk k 2 .
(c) y k+1 = y k + α k pk .
(d) r k+1 = r k − α k AAT pk .
(e) β k = kr k+1 k 2 /kr k k 2 .
(f) pk+1 = r k+1 + β k pk .

In Algorithm 3.7.14, r k = b − AAT y k . The iterates y k in Algorithm 3.7.14 solve

2 hAA y k , y k i
min 1 T − hb, y k i,
(3.69)
y k ∈ y0 + Kk ( AAT , r 0 )

(see Section 3.7.1). Since b = Ax † and

2 hAA y k , y k i − hAx †, y k i = 12 k AT y k − x † k 2 − 12 k x ∗ k 2,
1 T

the vector x k = AT y k solves

min 1
2 k xk − x† k2.
(3.70)
x k ∈ x 0 + Kk ( AT AAT , AT r 0 )

Thus Algorithm 3.7.14 has an error minimizing property.

Usually, we are interested in x k = AT y k and not in y k . Therefore it is preferable to work with
the variables x = AT y. If multiply the equations in step (2c) and (2f) of Algorithm 3.7.14 by AT
and rename AT pk as pk we obtain the following algorithm.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

160 CHAPTER 3. KRYLOV SUBSPACE METHODS

Algorithm 3.7.15 (CGNE)

(0) Given A ∈ Rm×n , b ∈ R ( A), x 0 ∈ Rn , > 0.
(1) Set r 0 = b − Ax 0 ,
p0 = AT r 0 .
(2) For k = 0, 1, 2, · · · do
(a) If kr k k < stop; else
(b) α k = kr k k 2 /kpk k 2 .
(c) x k+1 = x k + α k pk .
(d) r k+1 = r k − α k Apk .
(e) β k = kr k+1 k 2 /kr k k 2 .
(f) pk+1 = AT r k+1 + β k pk .

The last character in the name CGNE is motivated by the error minimizing property of the
iterates. Sometimes Algorithm 3.7.15 is also called Craig’s method [Cra55].

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.8. CONVERGENCE OF KRYLOV SUBSPACE METHOD 161

3.8. Convergence of Krylov Subspace Method

3.8.1. Representation of Errors and Residuals
For completeness we review our presentation errors and residuals of given in Section 3.2.3.
The representation (3.14) of the Krylov subspace allows a representation of the error

e(x) = x ∗ − x
def

between the solution x ∗ of Ax = b and the Krylov subspace approximation x ∈ x 0 + Kk ( A, r 0 ),

where r 0 = b − Ax 0 , and of the residual

r (x) = b − Ax.
def

These representations will be important in the convergence analysis of Krylov subspace methods.
If x ∈ x 0 + Kk ( A, r 0 ), where r 0 = b − Ax 0 , then the error obeys

e(x) = x ∗ − x = x ∗ − x 0 − x = x ∗ − x 0 − pk−1 ( A)r 0

def

for some polynomial pk−1 of degree k − 1. Moreover, since r 0 = b − Ax 0 = A(x ∗ − x 0 ) we have

x ∗ − x = x ∗ − x 0 − pk−1 ( A)r 0 = (I − pk−1 ( A) A)(x ∗ − x 0 ) (3.71)

for some polynomial pk−1 of degree k − 1. For the residual we obtain

r (x) = b − Ax = (b − Ax 0 ) − A(x − x 0 ) = r 0 − Apk−1 ( A)r 0

def

= (I − pk−1 ( A) A)r 0 (3.72)

for the polynomial pk−1 of degree k − 1 that appears in the error representation.

3.8.2. Convergence of Galerkin Approximations

In this section we assume that A ∈ Rn×n symmetric positive definite. We study the convergence of
the Conjugate Gradient Algorithm 3.7.6 and of the equivalent Lanczos Algorithm 3.5.1.
Recall that we define the A-inner product and the A-norm by

hx 2, x 1 i A = hAx 2, x 1 i, k xk A = hx, xi1/2

A . (3.73)

For a matrix B ∈ Rn×n the corresponding operator norm is defined

kBk A = sup kBxk A . (3.74)

k xk A=1

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

162 CHAPTER 3. KRYLOV SUBSPACE METHODS

Algorithm 3.5.1 and the Conjugate Gradient Algorithm 3.7.6 both compute iterates x k that
solve
1
min 2 hAx, xi − hb, xi.
x∈x 0 +Kk ( A,r 0 )

For symmetric positive definite A ∈ Rn×n we have

Q(x) = 12 hAx, xi − hb, xi = 21 k x − x ∗ k 2A − 12 k x ∗ k 2A .

We see that x k minimizes Q(x) over x 0 + Kk ( A, r 0 ) if and only if it minimizes k x − x ∗ k 2A over

x 0 + Kk ( A, r 0 ). Consequenly, if we use (3.71), then we see that Algorithm 3.5.1 and the Conjugate
Gradient Algorithm 3.7.6 generate iterates x k such that

Q(x k ) = min Q(x) = min 1

k(I − p( A) A)e0 k 2A − 21 k x ∗ k 2A
x∈x 0 +Kk ( A,r 0 ) p∈Pk−1 2

or, equivalently,

ke k k 2A = min k x − x ∗ k 2A = min k(I − p( A) A)e0 k 2A, (3.75)

x∈x 0 +Kk ( A,r 0 ) p∈Pk−1

where
P k−1 is the set of all polynomials of degree less than or equal to k − 1
and e0 = x ∗ − x 0 , e k = x ∗ − x k . If p ∈ P k−1 , then the polynomial q(t) = 1 − p(t)t satisfies

q ∈ Pk, q(0) = 1.

Hence, Krylov subspace methods using Galerkin approximations generate iterates x k such that

ke k k A = min kq( A)e0 k A (3.76)

q∈Pk ,q(0)=1

Since A is symmetric positive definite, there exists an orthonormal matrix V and D =

diag(λ 1, . . . , λ n ) such that
A = V DV T .
The λ j ’s are the eigenvalues of A and we assume that they are ordered such that

λ 1 ≥ . . . ≥ λ n > 0.

The spectrum of A is
σ( A) = {λ 1, . . . , λ n }
Let v j denote the jth column of V . Using
n
X
e0 = he0, v j i v j , viT v j = δi j ,
j=1

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.8. CONVERGENCE OF KRYLOV SUBSPACE METHOD 163

we find that

kq( A)e0 k 2A = hAq( A)e0, q( A)e0 i

Xn
= λ j q(λ j ) 2 he0, v j i2
j=1
n
X
≤ max q(λ j ) 2
λ j he0, v j i2
λ∈σ( A)
j=1

= max q(λ j ) hAe0, e0 i

2
λ∈σ( A)
= max q(λ) 2 ke0 k 2A .
λ∈σ( A)

Consequently, Krylov subspace methods using Galerkin approximations generate iterates x k such
that
ke k k A ≤ min max |q(λ)| ke0 k A . (3.77)
q∈Pk ,q(0)=1 λ∈σ( A)

Similarly, one can prove

ke k k A ≤ min max |q(λ)| ke k− j k A for j ∈ {1, . . . , k}. (3.78)

q∈P j ,q(0)=1 λ∈σ( A)

In fact, if q1, q2 are polynomials with q1 ∈ P k− j , q2 ∈ P j , and q1 (0) = q2 (0) = 1, then q = q2 q1 ∈ P k

and q(0) = q2 (0)q1 (0) = 1. Hence

kq( A)e0 k 2A = hAq( A)e0, q( A)e0 i

Xn
= λ j q(λ j ) 2 he0, v j i2
j=1
n
X
≤ max q2 (λ) 2
λ j q1 (λ j ) 2 he0, v j i2
λ∈σ( A)
j=1

= max q2 (λ) 2 kq1 ( A)e0 k 2A

λ∈σ( A)

and

ke k k A = min kq( A)e0 k A

q∈Pk ,q(0)=1
≤ min max |q2 (λ)| kq1 ( A)e0 k A
q1 ∈Pk−j ,q2 ∈P j λ∈σ( A)
q1 (0)=1,q2 (0)=1

= min max |q2 (λ)| min kq1 ( A)e0 k A

q2 ∈P j ,q2 (0)=1 λ∈σ( A) q1 ∈Pk−j ,q1 (0)=1

= min max |q(λ)| ke k− j k A .

q∈P j ,q(0)=1 λ∈σ( A)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

164 CHAPTER 3. KRYLOV SUBSPACE METHODS

3.8.3. Convergence of Minimal Residual Approximations

Similarly, if we use (3.72), then we see that Krylov subspace methods using minimum residual
approximations, such as GMRES and MINRES, generate iterates x k such that

kb − Ax k k = min kb − Axk = min k(I − p( A) A)r 0 k, (3.79)

x∈x 0 +Kk ( A,r 0 ) p∈Pk−1

where P k−1 is the set of all polynomials of degree less than or equal to k − 1. Hence

kr k k = min kq( A)r 0 k ≤ min kq( A)k kr 0 k. (3.80)

q∈Pk ,q(0)=1 q∈Pk ,q(0)=1

Instead of comparing the residual r k with the initial residual r 0 we can also derive the following
estimate.

kr k k = min kq( A)r 0 k

q∈Pk ,q(0)=1
≤ min kq2 ( A)q1 ( A)r 0 k
q1 ∈Pk−j ,q2 ∈P j
q1 (0)=1,q2 (0)=1

≤ min kq2 ( A)k min kq1 ( A)r 0 k

q2 ∈P j ,q2 (0)=1 q1 ∈Pk−j ,q1 (0)=1
≤ min kq( A)k kr k− j k. (3.81)
q∈P j ,q(0)=1

Minimum residual methods are used when A is symmetric, but not positive (semi-)definite, or
when A is non-symmetric. When A is diagonalizable, we can repeat the argument of Section 3.8.2.
However, eigenvalues of A may be complex and the matrix V of eigenvectors in general will not be
orthogonal (the matrix V of eigenvectors is unitary, V ∗ = V if and only if A is normal, A∗ A = AA∗ ).
Of course, non-symmetric matrices A may not be diagonalizable at all. The lack of (unitary)
diagonalizability of A makes the convergence analysis of minimum residual methods in the general
case difficult.
If A ∈ Rn×n is diagonalizable, i.e. if A = V DV −1 , where D is a diagonal matrix, then

q( A) = V q(D)V −1 .

Hence,
kq( A)k ≤ κ 2 (V )kq(D)k,
where
κ 2 (V ) = kV k kV −1 k
is the condition number of V . If A ∈ Rn×n is diagonalizable by an orthogonal matrix V , then
κ 2 (V ) = 1.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.8. CONVERGENCE OF KRYLOV SUBSPACE METHOD 165

The diagonal entries of D are the eigenvalues of A. Since A is not necessarily symmetric, A
may have complex eigenvalues. If σ( A) ⊂ C is the set of eigenvalues of A, then

min kq( A)k ≤ κ 2 (V ) min kq(D)k

q∈Pk ,q(0)=1 q∈Pk ,q(0)=1
= κ 2 (V ) min max |q(λ)|.
q∈Pk ,q(0)=1 λ∈σ( A)

Thus, if A is diagonalizable we obtain the following estimates from (3.80) and (3.81).

kr k k ≤ κ 2 (V ) min max |q(λ)| kr 0 k,

q∈Pk ,q(0)=1 λ∈σ( A)
kr k k ≤ κ 2 (V ) min max |q(λ)| kr k− j k.
q∈P j ,q(0)=1 λ∈σ( A)

3.8.4. Chebyshev Polynomials

As we have seen before, the convergence of Krylov subspace methods is closely related to the best
approximation problem
min max |q(λ)|.
q∈P j ,q(0)=1 λ∈σ( A)

If we replace σ( A) by the interval [a, b] with 0 < a < b, then we can compute the solution of the
best approximation problem analytically using the so-called Chebyshev polynomials. See [Riv90].

Definition 3.8.1 For k ∈ N0 the Chebyshev Polynomials (of the first kind) Tk are defined recursively
by

T0 (x) = 1, T1 (x) = x, (3.83a)

Tk+1 (x) = 2xTk (x) − Tk−1 (x), k = 2, 3, . . . (3.83b)

The first five Chebyshev polynomials are shown in Figure 3.6.

Using the identity

cos((k + 1)θ) = 2 cos(θ) cos(kθ) − cos((k − 1)θ) (3.84)

one can see that cos(kθ) defines a polynomial in x = cos(θ) ∈ [−1, 1] and that on [−1, 1] the k-th
Chebyshev polynomial is given by

Tk (x) = cos(k cos−1 (x)), x ∈ [−1, 1]. (3.85)

An alternative representation of Chebyshev polynomials can be obtained from the identity

cos(kθ) = 1/2 [exp(ikθ) + exp(−ikθ)]

= 1/2 [ cos(θ) + i sin(θ) k + cos(−θ) + i sin(−θ) k ]

q q
= 1/2 [ cos(θ) + cos(θ) − 1 + cos(θ) − cos(θ) 2 − 1 k ],
2
k
(3.86)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

166 CHAPTER 3. KRYLOV SUBSPACE METHODS

1
B n=0

0.5 @ n=1
@ n=5

0 n=4A

@ n=3
ï0.5

? n=2
ï1
ï1 ï0.5 0 0.5 1
x

Figure 3.6: The Chebyshev polynomials T1 to T5 .

√
where i = −1 denotes the imaginary number. Setting x = cos(θ) this yields
1 p p
Tk (x) = (x + x − 1) + (x − x − 1) .
2 k 2 k
(3.87)
2
It is also easy to verify that (3.87) obeys the recursion (3.83) (see Problem 3.11). Thus, while (3.87)
was derived from (3.87) with x = cos(θ) ∈ [−1, 1], the equation (3.87) is valid for all x since the
recursion (3.83) defines Tk .
The representations (3.85) and (3.87) of the Chebyshev polynomials Tk immediately yields the
following result.

Theorem 3.8.2 For k ∈ N, the Chebyshev polynomial Tk satisfies

max |Tk (x)| = 1, (3.88)
x∈[−1,1]

and
|Tk (x)| > 1, x < [−1, 1]. (3.89)
Furthermore, the extrema of the Chebyshev polynomial Tk are
x j = cos( jπ/k), j = 0, 1, . . . , k
with
Tk (x j ) = (−1) j , j = 0, 1, . . . , k,

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.8. CONVERGENCE OF KRYLOV SUBSPACE METHOD 167

and the zeros of the Chebyshev polynomial Tk are given by

x 0j = cos((2 j − 1)π/2k), j = 1, . . . , k.
Theorem 3.8.3 Let b > a > 0, then
max |qk∗ (x))| = min max |q(x)|,
x∈[a,b] q∈Pk ,q(0)=1 x∈[a,b]

where
b + a − 2x b+a
! !
qk∗ (x) = Tk /Tk .
b−a b−a
The maximum is given by
! # −1
b+a
"
max |q∗ (x))| = Tk . (3.90)
x∈[a,b] k b−a
Proof: Since a > 0 we have that (b + a)/(b − a) > 1 and, thus, the denominator in the definition
of qk∗ is greater than one (see (3.89)). By construction, the polynomial qk∗ satisfies qk∗ (0) = 1.
The proof of optimality of qk∗ is by contradiction. Suppose that p̃k ∈ P k with p̃k (0) = 1 is a
polynomial with
max |qk∗ (x))| > max | p̃k (x))|. (3.91)
x∈[a,b] x∈[a,b]

Define x̃ j = 21 (b + a − (b − a) cos( jπ/k)), j = 0, 1, . . . , k. Then |qk∗ | attains its maximum at x̃ j ,

j = 0, 1, . . . , k, and
! # −1
b+a
"
qk ( x̃ i ) = (−1) Tk
∗ j
.
b−a
Since (3.91) is valid, the polynomial r = p̃k − qk∗ satisfies

< 0 i = 0, 2, 4, . . . ,
(
r ( x̃ i )
> 0 i = 1, 3, 5, . . . .
Thus, the polynomial r has k zeros in the intervals ( x̃ j , x̃ j+1 ), j = 0, 1, . . . , k − 1. Moreover,
r (0) = p̃k (0) − qk∗ (0) = 0. Hence, since r ∈ P k has k + 1 zeros, we can conclude that r = 0.
Equation (3.90) follows immediately from (3.88).

Remark 3.8.4 For k = 1 we have that

b+a b+a
!
Tk = ,
b−a b−a
and for k ≥ 1 formula (3.87) yields
√ √ √ !k
b+a 1 ( b/a + 1) k ( b/a − 1) k b/a + 1
! !
1
Tk = √ + √ ≥ √ .
b−a 2 ( b/a − 1) k ( b/a + 1) k 2 b/a − 1

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

168 CHAPTER 3. KRYLOV SUBSPACE METHODS

3.8.5. Convergence of the Conjugate Gradient Method

If σ( A) ⊂ Λ, then (3.77) and (3.78) imply the error estimates

ke k k A ≤ min max |q(λ)| ke0 k A,

q∈Pk ,q(0)=1 λ∈σ( A)
≤ min max |q(λ)| ke0 k A, (3.92a)
q∈Pk ,q(0)=1 λ∈Λ

and

ke k k A ≤ min max |q(λ)| ke k−1 k A,

q∈P1,q(0)=1 λ∈σ( A)
≤ min max |q(λ)| ke k−1 k A . (3.92b)
q∈P1,q(0)=1 λ∈Λ

We derive convergence convergence estimates from (3.92) by selecting sets Λ and constructing
polynomials.

Positive Definite Problems

Theorem 3.8.5 (Finite Convergence of the Conjugate Gradient Method) If the symmetric pos-
itive definite matrix A ∈ Rn×n has J distinct positive eigenvalues, then the Conjugate Gradient
methods stops after at most J iterations with the exact solution.

Proof: Let λ 1, . . . , λ J > 0 be the distinct eigemnvalues of A. The polynomial

q J (λ) = k=1 (λ k − λ)/λ k satisfies max λ∈σ( A) |q(λ)| = 0 and q(0) = 1. Hence (3.92a)
QJ
implies ke J k A = 0.

If we set Λ = [λ min, λ max ] ⊃ σ( A) in (3.92a), use Theorem 3.8.3 with [a, b] = [λ min, λ max ],
and apply Remark 3.8.4, then we obtain the following convergence result.

Theorem 3.8.6 Let A ∈ Rn×n be a symmetric positive definite matrix and let λ min , λ max be
the smallest and the largest eigenvalues of A, respectively. The conjugate gradient iterations
x k ∈ x 0 + Kk ( A, r 0 ) satisfy
√ !k
κ−1
k x k − x∗ kA ≤ 2 √ k x 0 − x ∗ k A,
κ+1
where κ = λ max /λ min .

Theorem 3.8.6 estimates the overall reduction of the error in the A–norm, but it does not indicate
by how much the error in the A–norm decreases in each iteration. Such a result can be obtained if
we set Λ = [λ min, λ max ] ⊃ σ( A) in (3.92b) and use Theorem 3.8.3 with [a, b] = [λ min, λ max ].

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.8. CONVERGENCE OF KRYLOV SUBSPACE METHOD 169

Theorem 3.8.7 Let A ∈ Rn×n be a symmetric positive definite matrix and let λ min , λ max be
the smallest and the largest eigenvalues of A, respectively. The conjugate gradient iterations
x k ∈ x 0 + Kk ( A, r 0 ) satisfy

κ−1
k x k − x∗ kA ≤ k x k−1 − x ∗ k A,
κ+1
where κ = λ max /λ min .

If A has a few well separated small eigenvalues λ 1 ≤ . . . ≤ λ ` with λ ` λ `+1 , and a few well
separated large eigenvalues λ n−r+1 ≤ . . . ≤ λ n with λ n−r λ n−r+1 , then the following theorem
gives a better estimate than Theorem 3.8.6.

Theorem 3.8.8 Let A ∈ Rn×n be a symmetric positive definite matrix with eigenvalues

0 < λ 1 ≤ . . . ≤ λ ` < λ `+1 ≤ . . . ≤ λ n−r < λ n−r+1 ≤ . . . ≤ λ n .

If x k are minimum residual approximations, then

` √ !k
Y λ n−r + κ − 1
k x `+r+k − x∗ kA ≤ 2 * √ k x 0 − x ∗ k A,
, i=1 λ i - κ + 1

for k ≥ 0, where κ = λ n−r /λ `+1 .

We leave the proof as an exercise (see Problem 3.12).

Theorem 3.8.9 If
A = ρI + Ac,
where ρ > 0 and Ac ∈ Rn×n is a symmetric positive semidefinite matrix with eigenvalues µ1 ≥
µ2 ≥ . . . ≥ µn ≥ 0, then the iterates of the Conjugate Gradient method obey

k
Y µj +/ k x − x k .
k x k − x ∗ k A ≤ *. ∗ A
µj + ρ
0
, j=1 -

Proof: The eigenvalues λ j of A are given by λ j = ρ + µ j .

The polynomial
k
Y λi − λ
q(λ) = .
i=1
λi

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

170 CHAPTER 3. KRYLOV SUBSPACE METHODS

has degree k and satisfies q(0) = 1. We have

k k k
Y |λ i − λ j | Y | µi − µ j | Y | µi − µn |
max |q(λ j )| = max = max =
j=1,...,n j=1,...,n
i=1
|λ i | j=1,...,n
i=1
µi + ρ i=1
µi + ρ
k
Y µi
≤ .
i=1
µi + ρ

From (3.77) we obtain

k
Y µj +/ k x − x k .
k x k − x∗ kA ≤ min max kq(λ j )k k x 0 − x ∗ k A ≤ *. ∗ A
µj + ρ
0
q∈Pk ,q(0)=1 j=1,...,n
, j=1 -

The estimate in Theorem 3.8.9 is better than the one in Theorem 3.8.6 if the eigenvalues µ j
of Ac decay to zero sufficiently fast. Theorem 3.8.9 is based on the work by Winther [Win80].
Theorem 3.8.9 explains the excellent performance of the Conjugate Gradient algorithm applied to
the regularized data assimilation least squares problem (1.55).

Positive Semidefinite Problems

If A is only symmetric positive semidefinite and if b ∈ R ( A), then
n
X r
X n
X
kbk 2A = λ i hb, vi i =
2
λ i hb, vi i +
2
0hb, vi i2 .
i=1 i=1 i=r+1

Since R ( A) = span{v1, . . . , vr }, k x ∗ − x k k 2A = i=1 λ i hx ∗ − x k , vi i2 and

λ r kPR (x ∗ − x k )k ≤ k x ∗ − x k k A ≤ λ 1 kPR (x ∗ − x k )k,

p p

where PR denotes the projection onto R ( A) = N ( A) ⊥ given by

r
X
PR y = hy, vi ivi .
i=1

Thus, if we prove k x ∗ − x k k A → 0, then we have shown PR (x ∗ − x k ) → 0 or, equivalently,

PR x k → PR x ∗ . If b ∈ R ( A), then b = i=1 hb, vi ivi and all solutions of Ax = b are of the form
Pr

r n
X hb, vi i X
x∗ = vi + γi vi,
i=1
λi i=r+1

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.8. CONVERGENCE OF KRYLOV SUBSPACE METHOD 171

with some γr+1, . . . , γn ∈ Rn . We define

r
X hb, vi i
x† = vi .
i=1
λi

This is the solution of Ax = b with smallest norm. For all solutions x ∗ of Ax = b, PR x ∗ = x † .

Since x k ∈ x 0 + x̂ k with x̂ k ∈ K ( A, r 0 ) ⊂ R ( A) = N ( A) ⊥ ,

k x ∗ − x k k A → 0 implies PR (x ∗ − x k ) = PR x ∗ + PR x k = x † + PR x 0 + x̂ k → 0

and therefore, since x 0 = PN x 0 + PR x 0 ,

x k → x † + PN x 0 .

The convergence of the Conjugate Gradient Method in the positive semidefinite case is illustrated
in Figure 3.3. See also the earlier Examples 3.7.9 and 3.7.10.

3.8.6. Convergence of Minimal Residual Approximations

For the convergence analysis of Krylov subspace methods using minimum residual approximations,
such as GMRES and MINRES we measure convergence by monitoring the residual.
The first results assures the convergence of GMRES or MINRES in at most n iterations.

Theorem 3.8.10 Let A ∈ Rn×n be a nonsingular matrix. If x k ∈ x 0 + Kk ( A, r 0 ) are minimum

residual approximations of x ∗ , then there exists k ∗ ≤ n such that the residuals r k∗ = b − Ax k∗
satisfies
kr k∗ k = 0.

Proof: From Theorem 3.2.7 we know that there exists a polynomial pk∗ −1 of degree less or equal
to k ∗ − 1 ≤ n − 1, such that x ∗ − x 0 = A−1r 0 = pk∗ −1 ( A)r 0 and r 0 − Apk∗ −1 ( A)r 0 = 0. Hence,

kr k∗ k = min kq( A)r 0 k ≤ k(I − pk∗ −1 ( A)) Ar 0 k = 0.

q∈Pk∗ ,q(0)=1

Note that the proof does not require the diagonalizability of A. If A is diagonalizable and has J
distinct eigenvalues we can proceed as in the proof of Theorem 3.8.5 but with (3.92a) replaced by
(3.93a) to show that GMRES or MINRES converges in at most J iterations.

The Symmetric Indefinite Case

First we consider the case where A ∈ Rn×n is symmetric. In the case one uses MINRES to generate
minimum residual approximations.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

172 CHAPTER 3. KRYLOV SUBSPACE METHODS

If A ∈ Rn×n is nonsingular and symmetric indefinite, then there exist an orthonormal matrix
V ∈ Rn×n and a real diagonal matrix D ∈ Rn×n with

A = V DV T .

If σ( A) ⊂ Λ, then the inequalities (3.92) and

κ 2 (V ) = 1

imply the error estimates

kr k k ≤ min max |q(λ)| kr 0 k, (3.93a)

q∈Pk ,q(0)=1 λ∈Λ
kr k k ≤ min max |q(λ)| kr k− j k. (3.93b)
q∈P j ,q(0)=1 λ∈Λ

Again, we derive convergence convergence estimates from (3.93) by selecting sets Λ and construct-
ing polynomials.
If A ∈ Rn×n is nonsingular and symmetric indefinite, then

σ( A) ⊂ [a, b] ∪ [c, d], b < 0 < c.

If we set
λ = max |λ| and λ = min |λ|,
λ∈σ( A) λ∈σ( A)

then
[a, b] ⊂ [−λ, −λ], [c, d] ⊂ [λ, λ].
Our convergence results are based on (3.93) with
( )
Λ = λ ∈ R : λ ≤ |λ| ≤ λ ⊃ σ( A).

Let [k/2] denotes the largest integer less than or equal to k/2. If we use the fact that for
q ∈ P[k/2] with q(0) = 1 the polynomial q(λ 2 ) satisfies q(λ 2 ) ∈ P k and q(02 ) = q(0) = 1, we can
prove the following result, cf. e.g. [Sto83, p. 547].

Theorem 3.8.11 Let A ∈ Rn×n be a nonsingular, symmetric indefinite matrix. If x k are MINRES
iterates, then the residuals r k = r 0 − Ax k obey

κ−1
! [k/2]
kr k k ≤ 2 kr 0 k,
κ+1

where κ is the condition number of A given by κ = λ/λ.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.8. CONVERGENCE OF KRYLOV SUBSPACE METHOD 173

Proof: Using equation (3.93a) and Theorem 3.8.3 we find that

kr k k / kr 0 k ≤ min max |q(λ)|
q∈Pk ,q(0)=1 λ≤|λ|≤λ

≤ min max |q(λ 2 )|

q∈P[k/2],q(0)=1 λ≤|λ|≤λ

≤ min max |q(λ)|

q∈P[k/2],q(0)=1 λ 2 ≤λ≤λ 2
!# −1
κ2 + 1
"
= T[k/2] 2
κ −1
κ−1
! [k/2]
≤ 2 .
κ+1

In general the residuals do not increase in every iteration if the matrix is indefinite.

Example 3.8.12 Consider ! !

1 0 1
A= , b= .
0 −1 1
Then the first iterate is given by x 1 = α∗ (1, 1)T , where α∗ solves minα∈R kα Ab − bk The solution
is given by α∗ = 0. Thus, x 1 = x 0 = 0 and r 1 = r 0 = b.
However, combining the techniques in the proof of Theorem 3.8.11 with the estimate (3.93b)
one can show that a reduction in the residual is achieved at least after two iterations.
Theorem 3.8.13 Let A ∈ Rn×n be a nonsingular, symmetric indefinite matrix. If x k are MINRES
iterates, then the residuals r k = r 0 − Ax k obey
κ2 − 1
!
kr k k ≤ 2 kr k−2 k,
κ +1
where κ = λ/λ.
Proof: Using equation (3.93b) and Theorem 3.8.3 we find that
kr k k / kr k−2 k ≤ min max |q(λ)|
q∈P2,q(0)=1 λ≤|λ|≤λ

≤ min max |q(λ)|

q∈P1,q(0)=1 λ 2 ≤λ≤λ 2
!# −1
κ2 + 1
"
= T1 2
κ −1
κ −1
2
!
= .
κ2 + 1

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

174 CHAPTER 3. KRYLOV SUBSPACE METHODS

Remark 3.8.14 The assumption implicitly underlying Theorems 3.8.11 and 3.8.13 is that the
intervals containing the eigenvalues of A are of equal size and that they have the same distance
from the origin:
[a, b] ⊂ [−λ, −λ], [c, d] ⊂ [λ, λ].
If this is the case and if the eigenvalues are equally distributed, then Theorems 3.8.11 and 3.8.13
give a good description. However, as in the positive definite case the distribution and clustering
of the eigenvalues will be important for the convergence of the method, and if there are few well
separated clusters of eigenvalues, Theorems 3.8.11 and 3.8.13 will be too pessimistic.

The Nonsymmetric Case

As we have already mentioned in Section 3.8.3, the lack of (unitary) diagonalizability of A makes
the convergence analysis of minimum residual methods in the general case difficult. Therefore, we
only discuss a few convergence results for GMRES without restart here.
The following convergence result can be found in [EES83].

Theorem 3.8.15 Let A ∈ Rn×n and let x k be the minimum residual approximation. If the symmetric
part AS = 21 ( A + AT ) of A is positive definite, then the residuals r k = b − Ax k obey

λ min ( AS ) 2
" # 1/2
kr k k ≤ 1 − kr k−1 k,
λ max ( AT A)

where λ min ( AS ) and λ max ( AT A) denote the smallest eigenvalue of AS and the largest eigenvalue
of AT A, respectively.

Remark 3.8.16 i. It holds that

λ min ( AS ) 2 = min (xT AS x) 2 = min (xT Ax) 2

k xk=1 k xk=1
≤ max (xT Ax) 2 ≤ max k Axk 2 = max xT AT Ax
k xk=1 k xk=1 k xk=1
= λ max ( A A).
T

1/2
λ min ( AS ) 2
Thus, 1 − λ max ( AT A)
is well defined.

ii. If the symmetric part AS is positive definite, then A is nonsingular. To see this note that
Ax = 0 implies 0 = xT Ax = xT AS x. Since AS is positive definite we find that x = 0.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.8. CONVERGENCE OF KRYLOV SUBSPACE METHOD 175

Proof: (of Theorem 3.8.15)

Since all linear polynomials with q(0) = 1 are of the form q(t) = 1 + αt, α ∈ R, inequality (3.81)
implies
kr k k ≤ min kq( A)k kr k−1 k = min kI + α Ak kr k−1 k.
q∈P1,q(0)=1 α

Moreover,

kI + α Ak 2 = max [(I + α A)x]T (I + α A)x

k xk=1
= max [1 + 2αxT Ax + α 2 xT AT Ax].
k xk=1

By the min–max characterization of the eigenvalues we have that

xT AT Ax ≤ λ max ( AT A)

and
xT Ax = xT AS x ≥ λ min ( AS ).

By the positive definiteness of AS we have that λ min ( AS ) > 0 and

kI + α Ak 2 ≤ 1 + 2λ min ( AS )α + λ max ( AT A)α 2

for α < 0. The term on the right hand side is minimized by α = −λ min ( AS )/λ max ( AT A), and with
this choice of α
λ min ( AS ) 2
" # 1/2
min kI + α Ak ≤ 1 − ,
α λ max ( AT A)
which yields the desired estimate.

The situation of Theorem 3.8.15 frequently occurs if the linear system is obtained from a
discretization of a partial differential equation. For example consider the linear systems (1.22) and
(1.36) which arise in the finite difference discretization of (1.12) and (1.29), respectively. Using the
Gershgorin Circle Theorem 2.4.4 one can show that the symmetric parts of the matrices in (1.22)
and (1.36) are symmetric positive definite if c ≥ 0, c1, c2 ≥ 0, and r ≥ 0.
If A is diagonalizable, then one can derive error estimates from (3.93). See, for example [SS86]
and [Saa03, Sec 6.11.4]. However, if V is not unitary, then κ(V ) may be large and the resulting
estimate based on (3.93) may be useless. Thus if A is not unitary diagonalizable, i.e. if A is
not normal, then the eigenvalues may be irrelevant for the convergence of the minimum residual
methods. See [TE05]. To see what can happen when A is not diagonalizable, consider the following

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

176 CHAPTER 3. KRYLOV SUBSPACE METHODS

example.

Example 3.8.17 Consider

1 1 0
*. +/ *. +/
1 1 0
... ... ..
. // . //
A = ... // ∈ Rn×n, b = ... .// ∈ Rn .
.. 1 1 // ..0 //
, 1 - , 1 -
The matrix A has one eigenvalue λ = 1 with multiplicity n. The eigenspace corresponding to the
eigenvalue λ = 1 is the span of en , the n-th unit vector. Thus, A is no diagonalizable.
The solution of the system Ax = b is given by x ∗ = ((−1) n−1, . . . , −1, 1, −1, 1)T .
If we start GMRES with x 0 = 0, then the orthogonal vectors generated by the Arnoldi process
are given by
vi = en−i+1, i = 1, . . . , n.
Thus, although the eigenvalues of A are perfectly clustered, GMRES needs n iterations to reach the
solution.

3.9. Preconditioning
The convergence of Krylov subspace method is strongly influenced by the distribution of eigenvalues
of the system matrix A. Roughly speaking, if A is normal, the convergence is the better the more
the eigenvalues are clustered and the fewer the number of clusters. If A is not normal the situation
is unfortunately more complicates (as we have seen, e.g., in Exampe 3.8.17).
If A does not have this property, then one can replace the original system

Ax = b (3.94)

by the system
x = K L−1 b,
K L−1 AK R−1 H (3.95)
where K L, K R are nonsingular matrices chose such that 1) the distribution of eigenvalues of
K L−1 AK R−1 is more favorable, and 2) the application of K L−1 and K R−1 is relative inexpensive (so
that the potential savings due to reduced number of iterations when the Krylov subspace method
x = K L−1 b are not destroyed by a more expensive cost of matrix vector
is applied to K L−1 AK R−1 H
multiplications with K L AK R−1 ).
−1

If the original matrix A is symmetric, then we typically want the transformed system matrix to
be symmetric as well, and in this case we require that

K L = K, KR = KT .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.9. PRECONDITIONING 177

It is easy to apply the Krylov subspace method to the transformed system (3.95). However,
since we are interested in x = K R H
x and not H
x we formulate the Krylov subspace method applied to
the transformed system (3.95) in terms of the original variables. If the matrix A is symmetric and
K L = K RT = K, this has the additional advantage that we do not need K, but only M = K K T . The
fact that we need M instead of a factorization K K T is important for constructing preconditioners.
We will discuss some preconditioned Krylov subspace methods next, and later (see Section
3.9.4) introduce a few basic but common preconditioners.

3.9.1. Preconditioned Conjugate Gradient Method

Let A ∈ Rn×n be symmetric positive definite. Roughly speaking, the convergence of the conjugate
gradient method is the better the more the eigenvalues are clustered and the fewer the number of
clusters.
The aim of preconditioning is to find nonsingular matrix K such that
K −1 AK −T
has a better (more clustered and fewer clusters) distribution of eigenvalues than A. Ideally,
K −1 AK −T ≈ I. Instead of solving Ax = b one then solves the equivalent system
x = K −1 b,
K −1 AK −T H (3.96)
where Hx = K T x. The conjugate gradient method requires the computation of matrix vector products
K −1 AK −T H
v in each iteration. Therefore, we want a nonsingular matrix K so that K −1 AK −T has
a better distribution of eigenvalues than A and so that solutions of linear systems K z = w and
K T z = w can be computed efficiently.
If A ∈ Rn×n and if K ∈ Rn×n is nonsingular, then the matrices K −1 AK −T , AK −T K −1 , and
K K −1 A have the same eigenvalues.
−T

Hence, in search for a preconditioner, we look for a symmetric positive definite matrix M
such that AM −1 or, equivalently, M −1 A has a favorable eigenvalue distribution, and then we can
construct K so that
M = K KT .
The computation of such a K could be expensive. Fortunately, however, a matrix K with M = K T K
is never needed in the implementation of preconditioned conjugate gradient method. As we will
see shortly, we only have to solve systems where the system matrix is M.
We set
H = K −1 AK −T , H
A x = K T x, Hb = K −1 b.
Let us apply the Conjugate Gradient Algorithm 3.7.6 to the preconditioned system (3.96). By
x k, H
H r k , pHk we denote the vectors computed by the conjugate gradient method applied to AH
Hx = H
b.
With the transformations
x k = K T x k,
H r k = K −1r k ,
H pHk = K T pk ,

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

178 CHAPTER 3. KRYLOV SUBSPACE METHODS

we obtain

pk , AH
hH Hpk i = hH pk , K −1 AK −T pHk i = hpk , Apk i,
r k k 2 = hr k , (K K T ) −1r k i.
kH

Moreover,
K T pk+1 = pHk+1 = H
r k+1 + β k pHk = K −1r k+1 + β k K T pk
and
pk+1 = (K K T ) −1r k+1 + β k pk
Thus, if we set M = K K T and introduce a vector z k = M −1r k , we obtain the algorithm stated next.
Before we state the final version of the preconditioned CG method, we note that since K is
nonsingular, M = K K T is symmetric positive definite. One the other hand, if M is symmetric
positive definite, then we can a nonsingular K such that M = K K T . Therefore, we only need
M. The matrix K is used to construct the preconditioned CG method, but is not needed for its
implementation.

Algorithm 3.9.1 (Preconditioned Conjugate Gradient Method)

(0) Given A ∈ Rn×n , M ∈ Rn×n symmetric, positive definite,
b, x 0 ∈ Rn , > 0.
(1) Set r 0 = b − Ax 0 ,
Preconditioning: Solve M z0 = r 0 .
Set p0 = z0 .
(2) For k = 0, 1, 2, · · · do
(a) If hr k , z k i ≤ 0 then M is not positive definite. Stop.
(b) If hr k , z k i < stop and return x k .
(c) If hpk , Apk i ≤ 0, then A is not positive definite. Stop.
(d) α k = hr k , z k i/hpk , Apk i.
(e) x k+1 = x k + α k pk .
(f) r k+1 = r k − α k Apk .
(g) Preconditioning:
Solve M z k+1 = r k+1 .
(h) β k = hr k+1, z k+1 i/hr k , z k i .
(i) pk+1 = z k+1 + β k pk .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.9. PRECONDITIONING 179

In Algorithm 3.9.1 we stop if

hr k , z k i = hr k , M −1r k i
is small. Since M is symmetric positive definite, M 1/2 exists and we can write

hr k , z k i = hr k , M −1r k i = k M −1/2r k k = k M −1/2 A(x ∗ − x k )k.

Since the Preconditioned Conjugate Gradient Algorithm 3.9.1 is equivalent to the Conjugate
Gradient Algorithm 3.7.6 applied to (3.96), the iterates solve
1
min 2 hAx, xi − hb, xi. (3.97)
x 0 +Kk (M −1 A,M −1 r 0 )

Moreover, we obtain the following corollary of Theorem 3.7.7.

Corollary 3.9.2 Let A, M ∈ Rn×n be symmetric positive definite. The iterates generated by the
Preconditioned Conjugate Gradient Algorithm 3.7.6 started with x 0 = 0 obey the monotonicity
property
0 < k x1 kM < k x2 kM < . . . .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

180 CHAPTER 3. KRYLOV SUBSPACE METHODS

3.9.2. Preconditioned GMRES

Let A ∈ Rn×n . Instead of Ax = b we solve K L−1 AK R−1 H x = K L−1 b. The matrix K L−1 is called a
left preconditioner and K R is called a right preconditioner. The following algorithm applies
−1

x = K L−1 b, but generates the iterates x k = K R H

GMRES(m) to K L−1 AK R−1 H x k instead of H
xk.

Algorithm 3.9.3 (Preconditioned GMRES(m))

(0) Given x 0 , m, and .
Compute r 0 = b − Ax 0 .
(1) Solve K L z = r 0 .
Set v1 = z/kzk.
(2) For k = 1, · · · , m do
(a) Solve K R z = vk ,
(b) Solve K LD
vk+1 = Az,
(c) For i = 1, · · · , j do
hik = hD
vk+1, vi i,
vk+1 = D
D vk+1 − hik vi ,
(d) h k+1,k = kD
vk+1 k,
(e) If h k+1,k = 0 goto (3); else
(f) vk+1 = D
vk+1 /h k+1,k .
(g) Compute the Givens rotation G k such that
G k−1,k · · · G1,k H̄k
is an upper triangular matrix.
(h) Update the residual using (3.27), or (3.28)
(i) If kK L−1r k k < , compute the solution y k of
min y∈Rk k H̄k y − kr 0 ke1 k2 , solve K R z = Vk y k , and
return x k = x 0 + z.
(3) Restart: Compute the solution y k of
min y∈Rk k H̄k y − kr 0 ke1 k2 and solve K R z = Vk y k .
Set x 0 = x 0 + z, r 0 = b − Ax k and goto (1).

The vectors vk generated by the preconditioned GMRES method satisfy

span(v1, . . . , vk ) = Kk (K L−1 AK R−1, K L−1r 0 ).

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.9. PRECONDITIONING 181

Moreover, in the preconditioned version the residual norm kK L−1r k k is monitored, not the residual
norm of the original problem.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

182 CHAPTER 3. KRYLOV SUBSPACE METHODS

3.9.3. Preconditioned SYMMLQ and MINRES

Let A ∈ Rn×n be symmetric. To maintain symmetry of the system, which is required for the
application of SYMMLQ and MINRES, we consider the transformed system
x = K −1 b,
K −1 AK −T H (3.98)
We set
H = K −1 AK −T ,
A x = K T x,
H Hb = K −1 b, H r 0 = K −1 (b − Ax 0 ).
The idea behind SYMMLQ is to find Hxk ∈ Hx 0 + Kk ( A,
HH r 0 ) as the solution of
h AH
Hx k − H vi = 0 ∀ H
b, H v ∈ Kk ( A, HH r 0 ). (3.99)
We compute an orthonormal basis of Kk ( A,HH Hk = (H
r 0 ) using Lanczos, which gives V v1, . . . , H
vk ) such
that
k Vk = Tk .
VHT AHH
xk = H
The equation (3.99) can be rewritten as H x0 + V Hk y k where y k solves
Tk y k = kH
r 0 ke1 .
Since AH is only symmetric, Tk may not be invertible. SYMMLQ modifies the iteration to avoid this
issue. However, as far as preconditioning is concerned, the basic algorithm is given as follows.

Algorithm 3.9.4 (Preconditioned Galerkin Approximation)

Given A ∈ Rn×n symmetric, K ∈ Rn×n nonsingular, x 0, b ∈ Rn , > 0.
Compute H
r 0 = K −1 (b − Ax 0 ) and kH
r 0 k.
v1 = H
Set k = 0, y0 = 0, D
H r 0 , and δ1 = kH
r 0 k.
While kH
r k k = kT̄k y k − δ1 ke1 k > do
k = k + 1,
If δ k , 0, then Hvk = D
v k /δ k ;
H
Else H vk = D
v k (= 0).
H
v k+1 = K −1 AK −T H
D
H v k − δ kH
vk−1 ,
γ k = hD
v k+1, H
H vk i ,
v k+1 = D
D
H v k+1 − γ kH
H vk ,
δ k+1 = kD
v k+1 k,
H
solve Tk y k = δ1 e1 (if possible).
End
x = x 0 + K −T V
Hk y k .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.9. PRECONDITIONING 183

r k = K −1r k is small.
In Algorithm 3.9.4 the iteration is terminated if the transformed residual H
Using the transformations

vk = K T vk,
H v k = K TD
D
H vk, r k = K −1r k ,
H

we obtain

vk+1 = K −T K −1 Avk − δ k vk−1 = (K K T ) −1 Avk − δ k (K K T vk−1 ) ,
D
γ k = hK K T D
vk+1, vk i,
δ k+1 = hK K T D
vk+1, D
vk+1 i.

Introducing new vectors D uk = K K TDvk and setting M = K K T we obtain the algorithm given
next. As we have already noted in Section 3.9.1, M is symmetric positive definite if an only there
exists a nonsingular K such that M = K K T . Therefore, we only need M. The matrix K is used to
construct the algorithm, but is not needed for its implementation.

Algorithm 3.9.5 (Preconditioned Galerkin Approximation)

Given A ∈ Rn×n symmetric, M ∈ Rn×n symmetric positive definite, x 0, b ∈ Rn , > 0.
Compute r 0 = b − Ax 0 and set D
u1 = r 0 .
Solve MD
v1 = D
u1 and compute δ1 = hD
v1, D
u1 i.
Set k = 0 and y0 = 0.
While kT̄k y k − δ1 e1 k > do
k = k + 1,
If δ k , 0, then vk = D
vk /δ k ;
Else vk = D vk (= 0).
δk
u k+1 = Avk −
D u k−1 ,
δk−1 D
γ k = hD
u k+1, vk i ,
γk
u k+1 = D
D u k+1 − δk u k ,
Solve MD
vk+1 = D
u k+1
δ k+1 = hD
vk+1, D
u k+1 i,
solve Tk y k = δ1 e1 (if possible).
End
x = x 0 + Vk y k .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

184 CHAPTER 3. KRYLOV SUBSPACE METHODS

As we have mentioned in Section 3.5 we do not have to store all vectors v1, . . . , vk and Du1, . . . , D
uk ,
u k−1, D
but only the vectors vk and D uk, D
u k+1 . This is done in the preconditioned SYMMLQ, which is
stated below. However, Algorithm 3.9.5 shows how to transform from the H vk ’s to the vk , and how to
apply the preconditioner in the form M = K K T without using the factors K. The same ideas apply
to MINRES. The preconditioned SYMMLQ is stated as Algorithm 3.9.6 and the preconditioned
MINRES is stated as Algorithm 3.9.7.

Algorithm 3.9.6 (Preconditioned SYMMLQ)

(0) Given A ∈ Rn×n symmetric, b, x 0 ∈ Rn .
(1) Compute r 0 = b − Ax 0 .
Solve MD
v1 = D
u1 .
Compute δ1 = hDv1, D
u1 i.
If δ1 , 0, then
v1 = r 0 /δ1 ;
Else
v1 = 0.
Endif
w̄1 = v1 ,
v0 = 0,
x 0L = x 0 .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.9. PRECONDITIONING 185

(2) For k = 1, 2, . . . do
u k+1 = Avk − δδk−1
D k
u k−1 ,
D
γ k = hD
u k+1, vk i ,
u k+1 − γδkk u k ,
u k+1 = D
D
vk+1 = D
Solve MD u k+1
δ k+1 = hD
vk+1, D
u k+1 i.
If δ k+1 , 0, then vk+1 = D
vk+1 /δ k+1 ;
Else vk+1 = D vk+1 (= 0).
Endif
If k = 1, then
d¯k = γ k , ẽ k+1 = δ k+1 ,
Elseif k > 1, then
Apply Givens rotation G k to row k.
d¯k = s k ẽ k − ck γ k ,
e k = ck ẽ k + s k γ k .
Apply Givens rotation G k to row k + 1.
f k+1 = s k δ k+1 ,
ẽ k+1 = −ck δ k+1 ,
Endif
Determineq Givens rotation G k+1 .
d k = d¯2k + δ2k+1 ,
ck+1 = d¯k /d k ,
s k+1 = δ k+1 /d k ,
If k = 1, then
ζ1 = δ1 /d 1 .
Elseif k = 2, then
ζ2 = −ζ1 e2 /d 2,
Elseif k > 2, then
ζ k = (−ζ k−1 e k − ζ k−2 f k )/d k ,
Endif
x kL = x k−1
L + ζ (c
k k+1 w̄ k + s k+1 v k+1 ).
w̄ k+1 = s k+1 w̄ k − ck+1 vk+1 .
If kr k k < goto (3).
End
(3) x k = x kL + (ζ k s k+1 /ck+1 ) w̄ k+1 .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

186 CHAPTER 3. KRYLOV SUBSPACE METHODS

Algorithm 3.9.7 (Preconditioned MINRES)

(0) Given A ∈ Rn×n symmetric, b, x 0 ∈ Rn .
(1) Compute r 0 = b − Ax 0 .
Solve MDv1 = Du1 .
Compute δ1 = hD v1, D
u1 i.
u0 = 0, m0 = m−1 = 0, k = 0.
D
(2) While kr k k > do
k = k + 1.
If δ k , 0, then vk = D vk /δ k ;
Else vk = D vk (= 0).
Endif
u k+1 = Avk − δδk−1
D k
u k−1 ,
D
γ k = hDu k+1, vk i ,
u k+1 = D
D u k+1 − γδkk u k ,
Solve MD
vk+1 = D
u k+1
δ k+1 = hD
vk+1, D
u k+1 i.
If k = 1, then
d¯k = γ k , ẽ k+1 = δ k+1 ,
Elseif k > 1, then
Apply Givens rotation G k to row k.
d¯k = s k ẽ k − ck γ k ,
e k = ck ẽ k + s k γ k .
Apply Givens rotation G k to row k + 1.
f k+1 = s k δ k+1 ,
ẽ k+1 = −ck δ k+1 ,
Endif
Determineq Givens rotation G k+1 :
d k = d¯2k + δ2k+1 ,
ck+1 = d¯k /d k ,
s k+1 = δ k+1 /d k .
If k = 1, then τ1 = kr 0 k c2 .
Elseif τk = kr 0 k s2 s3 · · · s k ck+1 = τk−1 s k ck+1 /ck .
Endif
m k = (vk − m k−1 e k − m k−2 f k )/d k .
x k = x k−1 + τk m k .
kr k k = |s k+1 | kr k−1 k.
End

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.9. PRECONDITIONING 187

3.9.4. Basic Iterative Methods Based Preconditioners

The construction of effective preconditioners is often problem dependent. A discussion of popular
preconditioning techniques can be found, e.g., in the books by Greenbaum [Gre97], Saad [Saa03],
by Olshanskii and Tyrtyshnikov [OT14], and by Chen [Che05]. See also the survey papers by Benzi
[Ben02] and by Wathen [Wat15].
We mention an important connection between preconditioners and iterative methods based on
matrix splittings discussed in Chapter 2.

Preconditioners based on Matrix Splittings

In Chapter 2 we have introduced and studied iterative methods based on matrix splittings

A=M−N

with nonsingular M ∈ Rn×n . Specifically, we rewrote the linear system Ax = b as the fixed point
equation x = M −1 N x + M −1 b and studied the fixed point iteration

x new = M −1 N x + M −1 b = (I − M −1 A)x + M −1 b. (3.100)

This iteration converges for any initial vector x if and only of the spectral radius of I − M −1 A is less
than one, that is if and only if all eigenvalues of I − M −1 A are inside the unit circle in the complex
plane. Since eigenvalues λ of I − M −1 A and µ of M −1 A are related via λ = 1 − µ, the eigenvalues
of I − M −1 A are inside the unit circle if and only if the eigenvalues of M −1 A are inside the circle
of radius one with center one. In particular the eigenvalues of M −1 A are clustered. This suggested
the use of the matrix M as a preconditioner.
Note that as a standalone iterative method (3.100) all eigenvalues of M −1 A must be inside the
circle of radius one with center one. However, M can still be used as a preconditioner if there are
eigenvalues of M −1 A that are outside this circle.
When Krylov subspace methods are used that exploit the symmetry of the system matrix, such
as the Conjugate Gradient Method 3.9.1, SYMMLQ 3.9.6 , or MINRES 3.9.7, the preconditioner
M must be symmetric positive definite. This is one reason why we introduced the symmetric SOR
method and the symmetric Gauss-Seidel method in Problem 2.1. The matrix M for these methods
is symmetric positive definite.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

188 CHAPTER 3. KRYLOV SUBSPACE METHODS

3.10. Problems

Problem 3.1 Show that If the symmetric part 12 ( A + AT ) of the matrix A ∈ Rn×n is positive definite
on Vk ⊂ Rn , i.e. if there exists c > 0 such that

hv, 12 ( A + AT )vi > ckvk 2 ∀ v ∈ Vk (3.101)

then (3.10) posses a unique solution.

Problem 3.2 Prove Theorem 3.3.1.

Problem 3.3 Let A ∈ Rn×n be nonsingular. We want to compute approximations x k of the solution
x ∗ of Ax = b using Galerkin approximations. That is we want to compute x k ∈ x 0 + Kk ( A, r 0 )
such that
hAx k − b, vi = 0 ∀ v ∈ Kk ( A, r 0 ). (3.102)

i. Describe an algorithm that uses the Arnoldi Iteration to compute an orthonormal basis for
Kk ( A, r 0 ) and used this basis to write (3.102) as a k × k linear system.
This algorithm is known as the Full Orthogonalization Method (FOM).

ii. Implement your algorithm in Matlab . If the system does not have a unique solution, your
algorithm should return with an error message.

iii. Apply your algorithm to solve the first linear system in Example 3.2.5.

iv. Apply your algorithm to solve (1.24) with h = 0.02 and the data specified in Example 1.3.1.
Show that the matrix A in (1.24) has positive definite symmetric part. (Hint: Gershgorin
Circle Theorem 2.4.4.) Hence Problem 3.1 guarantees the well-posedness of the FOM for
this example.

Problem 3.4
• [Emb03] Apply GMRES to the system Ax = b with

1 1 1 2
A = . 0 1 3 +/ ,
* b = . −4 +/ .
*
, 0 0 1 - , 1 -
Set x 0 = 0 and restart after m = 1, after m = 2 and after m = 3. In all cases set the maximum
number of iterations to 30.
Generate one plot that shows the normalized residuals kr k k/kr 0 k for the three restart cases.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.10. PROBLEMS 189

• Apply GMRES(m) to solve (1.24) with h = 0.02 and the data specified in Example 1.3.1.
Use m = 2, 5, 10, 20, 60. (Since m = 60 > n, this corresponds to full GMRES - no restart).
Generate one plot that shows the normalized residuals kr k k/kr 0 k for the five restart cases.

Problem 3.5 Prove Lemma 3.6.3.

Problem 3.6 Prove Theorem 3.6.2.

Problem 3.7 Let A ∈ Rn×n be a symmetric positive definite matrix. Show that if v1, . . . , v j ∈ Rn
are nonzero and A–orthogonal, then they are linearly independent.

Problem 3.8 Prove Theorem 3.7.7.

Problem 3.9 Let A ∈ Rn×n be a symmetric positive definite matrix, let B ∈ Rn×m and let I ∈ Rm×m
be the identity matrix.

i. Show that if the Conjugate Gradient Algorithm 3.7.6 is applied to the (m + n) × (m + n)

linear system ! ! !
A B x b
= (3.103)
0 I y d
with initial iterate !
x0
, y0 = d,
y0
then the iterates (xTk , yTk )T generated by Conjugate Gradient Algorithm 3.7.6 satisfy y k = d
and x k is the kth iterate generated by Conjugate Gradient Algorithm 3.7.6 applied to the
symmetric positive definite system

Ax = b − Bd (3.104)

with initial iterate x 0 .

The previous result shows that even though the Conjugate Gradient Algorithm 3.7.6 is applied
to the nonsymmetric system (3.103), it effectively only “sees” the small symmetric positive definite
system (3.104), provided that the initial iterate is chosen appropriately. This result is important in
applications of the Conjugate Gradient Algorithm 3.7.6 to linear systems that arise from the finite
element discretization of partial differential equations with Dirichlet boundary conditions.

ii. Consider the differential equation

−y00 (x) = f (x), x ∈ (0, 1),

y(0) = 1, y(1) = −1.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

190 CHAPTER 3. KRYLOV SUBSPACE METHODS

The finite difference discretization with mesh size h = 1/(n + 1) leads to the (n + 2) × (n + 2)
linear system
1 0 0 y0 1
*. −e A −e +/ *. y +/ = *. b +/ (3.105)
1 n
, 0 0 1 - , yn+1 - , −1 -
with y = (y1, . . . , yn )T , e1, en being the first and nth unit vector in Rn ,

2 −1
−1 2 −1
*. +/
.. //
... .. ..
.. //
A = .. . . // ∈ Rn×n
.. //
.. /
. −1 2 −1 //
, −1 2 -

and T
b = h2 f (h), . . . , f (nh) .
Even though in (3.105) the I block is not in the bottom right, but split, the result in part i still
applies since a symmetric permutation of (3.105) leads to a system of the type (3.103).

– Use the Conjugate Gradient method to solve the (n + 2) × (n + 2) system (3.105) with
f (x) = π 2 cos(πx) and n = 30 using the initial iterate

(1, 0, . . . , 0, −1)T ∈ Rn+2 .

– Use the Conjugate Gradient method to solve the n × n system Ay = b + e1 − en with

f (x) = π 2 cos(πx) and n = 30 using the initial iterate y = (0, . . . , 0)T ∈ Rn .

The exact solution of the differential equation is y(x) = cos(πx). For both cases plot the
solution computed by pcg as well as the convergence history.
Note: For the system (3.105) arising from a one-dimensional differential equation it is easy
to eliminate y0 and yn+1 . However, for the discretization of partial differential equations in
higher dimensions the approach in this problem is very convenient and frequently used.

iii. Extend part ii. to the 2d problem

−∆y(x) = f (x), x ∈ Ω = (0, 1) 2,

y(x) = g(x) x ∈ ∂Ω.

Avoid forming the matrix A explicitly.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

3.10. PROBLEMS 191

Problem 3.10 Let A, B, V ∈ Rn×n , V invertible, and let p, q be arbitrary polynomials. Show the
following identities.
(i) Ap( A) = p( A) A.
(ii) If A = V −1 BV , then p( A) = V −1 p(B)V .
(iii) kp( A)q( A)k ≤ kp( A)k kq( A)k .
Let A ∈ Rn×n be symmetric positive definite and B ∈ Rn×n . Show the following identities.
(iv) kBk A = k A1/2 B A−1/2 k ,
(v) kp(B)k A = k A1/2 p(B) A−1/2 k ,
(vi) kp( A)k A = kp( A)k .

Problem 3.11 Show that the functions

1 p p
Tk (x) = (x + x − 1) + (x − x − 1) .
2 k 2 k
2
obey the recursion (3.83).

Problem 3.12 Prove Theorem 3.8.8.

Hint: Consider
`
Y r
Y
q`+r+k (t) = (1 − t/λ i ) (1 − t/λ n+1−i ) qk∗ (t),
i=1 i=1

where qk∗ is defined in Theorem 3.8.3 with [a, b] = [λ `+1, λ n−r ].

Problem 3.13 Let A ∈ Rn×n be a nonsingular, symmetric indefinite matrix with eigenvalues

λ 1 ≤ . . . ≤ λ ` < 0 < λ `+1 ≤ . . . ≤ λ n .

Show that if x k ∈ x 0 + Kk ( A, r 0 ) is a minimum residual approximations of x ∗ , then

` √ !k
Y λ n − λi κ−1
kr `+k k ≤ 2 √ kr 0 k,
i=1
|λ i | κ+1

for k ≥ 0, where κ = λ n /λ `+1 .

Problem 3.14 Show that the directions pk and the residuals r k generated by the Preconditioned
Conjugate Gradient Algorithm 3.9.1 obey

Kk (M −1 A, M −1r 0 ) = span{p0, . . . , pk } = span{M −1r 0, . . . , M −1r k }.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

192 CHAPTER 3. KRYLOV SUBSPACE METHODS

Show that the iterates x k of the Preconditioned Conjugate Gradient Algorithm 3.9.1 solve
1
min 2 hAx, xi − hb, xi
x 0 +Kk (M −1 A,M −1 r 0 )

and, if x 0 = 0, obey
0 < k x1 kM < k x2 kM < . . . .

Problem 3.15 Let A be a symmetric positive definite matrix with constant diagonal. Show that
the Preconditioned Conjugate Gradient Algorithm 3.9.1 with Jacobi preconditioning produces the
same iterates as the (unpreconditioned) Conjugate Gradient Algorithm 3.7.6.

Problem 3.16 Follow the approach in Section 3.9.1 to derive the preconditioned gradient method
with steepest descent step-size and with Barzilai-Borwein step size.

Problem 3.17 (This is Exercise 40.1 in [TB97]).

Suppose A = M − N, where M is nonsingular. Suppose that kI − M −1 Ak ≤ 21 and that K L = M
is used as a left preconditioner and K R = I is used as a right preconditioner in the preconditioned
GMRES Algorithm 3.9.3 (without restart, i.e., m = ∞).
Show that after twenty steps of the preconditioned GMRES Algorithm 3.9.3 the preconditioned
residual norm is guaranteed to be at least six order of magnitude smaller.
Hint: Apply (3.80) to the preconditioned system and consider the polynomial q(z) = (1 − z) k .

Problem 3.18 This problem explores the implementation of the preconditioned conjugate gradient
method using Gauss-Seidel-type preconditioners which is described in [Eis81].
TO BE ADDED.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

References
[BB88] J. Barzilai and J. M. Borwein. Two-point step size gradient methods. IMA J. Numer.
Anal., 8(1):141–148, 1988. URL: http://dx.doi.org/10.1093/imanum/8.1.141,
doi:10.1093/imanum/8.1.141.

[Ben02] M. Benzi. Preconditioning techniques for large linear systems: a survey. J. Comput. Phys.,
182(2):418–477, 2002. URL: http://dx.doi.org/10.1006/jcph.2002.7176,
doi:10.1006/jcph.2002.7176.

[Che05] K. Chen. Matrix Preconditioning Techniques and Applications, volume 19 of Cambridge

Monographs on Applied and Computational Mathematics. Cambridge University Press,
Cambridge, 2005. URL: http://dx.doi.org/10.1017/CBO9780511543258, doi:
10.1017/CBO9780511543258.

[Cra55] E. J. Craig. The n-step iteration procedures. J. of Mathematics and Physics, 34:64–73,
1955.

[DL02] Y.-H. Dai and L.-Z. Liao. R-linear convergence of the Barzilai and Borwein gradient
method. IMA J. Numer. Anal., 22(1):1–10, 2002. URL: http://dx.doi.org/10.
1093/imanum/22.1.1, doi:10.1093/imanum/22.1.1.

[EES83] S. C. Eisenstat, H. C. Elman, and M. H. Schultz. Variational iterative methods for

nonsymmetric systems of linear equations. SIAM J. Numer. Anal., 20:345–357, 1983.

[Eis81] S. C. Eisenstat. Efficient implementation of a class of preconditioned conjugate gradient

methods. SIAM J. Scientific and Statistical Computing, 2:1–4, 1981.

[Emb03] M. Embree. The tortoise and the hare restart GMRES. SIAM Rev., 45(2):259–266
(electronic), 2003.

[Fle05] R. Fletcher. On the Barzilai-Borwein method. In L. Qi, K. Teo, and X. Yang, editors,
Optimization and control with applications, volume 96 of Appl. Optim., pages 235–256.
Springer, New York, 2005. URL: http://dx.doi.org/10.1007/0-387-24255-4_
10, doi:10.1007/0-387-24255-4_10.

193
194 REFERENCES

[GL96] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins Studies in the
Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD, third edition,
1996.

[GO89] G. H. Golub and D. P. O’Leary. Some history of the conjugate gradient and Lanczos
algorithms: 1948–1976. SIAM Rev., 31(1):50–102, 1989.

[Gre97] A. Greenbaum. Iterative Methods for Solving of Linear Systems. SIAM, Philadelphia,
1997.

[HS52] M.R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems.
J. of Research National Bureau of Standards, 49:409–436, 1952.

[O’L01] D. P. O’Leary. Commentary on methods of conjugate gradients for solving linear systems
by Magnus R. Hestenes and Eduard Stiefel. In D. R. Lide, editor, A Century of Excellence
in Measurements, Standards, and Technology - A Chronicle of Selected NBS/NIST Pub-
lications 1901-2000, pages 81–85. Natl. Inst. Stand. Technol. Special Publication 958,
U. S. Government Printing Office, Washington, D. C, 2001. Electronically available
at http://nvlpubs.nist.gov/nistpubs/sp958-lide/cntsp958.htm (accessed
February 6, 2012).

[OT14] M. A. Olshanskii and E. E. Tyrtyshnikov. Iterative Methods for Linear Systems: Theory
and Applications. SIAM, Philadelphia, 2014.

[PS75] C. C. Paige and M. A. Saunders. Solution of sparse indefinite systems of linear equations.
SIAM J. Numer. Anal., 12:617–629, 1975.

[Ray93] M. Raydan. On the Barzilai and Borwein choice of steplength for the gradient method.
IMA J. Numer. Anal., 13(3):321–326, 1993. URL: http://dx.doi.org/10.1093/
imanum/13.3.321, doi:10.1093/imanum/13.3.321.

[Riv90] T. J. Rivlin. Chebyshev Polynomials. From Approximation Theory to Algebra and Number
Theory. Pure and Applied Mathematics (New York). John Wiley & Sons Inc., New York,
second edition, 1990.

[Saa03] Y. Saad. Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia, second
edition, 2003.

[SS86] Y. Saad and M. H. Schultz. GMRES a generalized minimal residual algorithm for solving
nonsymmetric linear systems. SIAM J. Sci. Stat. Comp., 7:856–869, 1986.

[Sto83] J. Stoer. Solution of large linear systems of equations by conjugate gradient type methods.
In A. Bachem, M. Grötschel, and B. Korte, editors, Mathematical Programming, The
State of The Art, pages 540–565. Springer Verlag, Berlin, Heidelberg, New-York, 1983.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

REFERENCES 195

[TB97] L. N. Trefethen and D. Bau. Numerical Linear Algebra. SIAM, Philadelphia, 1997.

[TE05] L. N. Trefethen and M. Embree. Spectra and Pseudospectra. The Behavior of Nonnormal
Matrices and Operators. Princeton University Press, Princeton, NJ, 2005.

[Vor03] H. A. van der Vorst. Iterative Krylov Methods for Large Linear Systems, volume 13
of Cambridge Monographs on Applied and Computational Mathematics. Cambridge
University Press, Cambridge, 2003.

[Wat15] A. J. Wathen. Preconditioning. Acta Numer., 24:329–376, 2015. URL: http://dx.

doi.org/10.1017/S0962492915000021, doi:10.1017/S0962492915000021.

[Win80] R. Winther. Some superliner convergence results for the conjugate gradient methods.
SIAM J. Numer. Anal., 17:14–17, 1980.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

196 REFERENCES

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

Part II

Iterative Methods for Unconstrained Opti-

mization

197
Chapter
4
Introduction to Unconstrained
Optimization
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
4.2 Existence of Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
4.3 Unconstrained Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . 201
4.4 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
4.5 Convergence of Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
4.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

4.1. Introduction
We study the solution of unconstrained minimization problems

min f (x),
x∈Rn

with objective function f : Rn → R. We focus on problems with smooth objective function. In

particular, the optimization methods we study require the gradient of f and some also require the
Hessian of f .
We restrict our attention to minimization problems, but a function f can be maximized by
minimizing − f ,
maxn f (x) = − minn (− f (x)).
x∈R x∈R

Hence, all results derived for minimization problems can be readily applied to maximization
problems.

199
200 CHAPTER 4. INTRODUCTION TO UNCONSTRAINED OPTIMIZATION

4.2. Existence of Solutions

In this section we consider optimization problems of the form

min f (x),
(4.1)
s.t. x ∈ F ,

where f : Rn → R and F ⊂ Rn . We focus on F = Rn .

Throughout this section k · k is an arbitrary vector norm on Rn . By

Br ( x̄) = x ∈ Rn : k x − x̄k < r

we denote the open ball x̄ with radius r.

We will use the general formulation (4.1) to define various types of minima.

Definition 4.2.1 The point x ∗ ∈ Rn is called a local minimum of f over F , if there exists r > 0
such that
f (x ∗ ) ≤ f (x) for all x ∈ Br (x ∗ ) ∩ F .
The point x ∗ ∈ Rn is called a strict local minimum of f over F , if there exists r > 0 such that

f (x ∗ ) < f (x) for all x ∈ Br (x ∗ ) ∩ F .

The point x ∗ ∈ Rn is called a global minimum of f over F , if

f (x ∗ ) ≤ f (x) for all x ∈ F .

The following existence theorem is a standard result from calculus.

Theorem 4.2.2 If the set F ⊂ Rn of feasible points is compact and if f : F → R is continuous,

then f attains its minimum on F , i.e., there exists x ∗ ∈ F such that f (x ∗ ) = inf { f (x) : x ∈ F }.

Recall that F ⊂ Rn is compact if and only if it is closed and bounded. Many feasible sets F
are not bounded. In such cases we can obtain an existence result if we impose stronger conditions
on the objective function. In particular, we require

f (x k ) → ∞ (k → ∞) for all sequences {x k }

(4.2)
with x k ∈ F and k x k k → ∞ (k → ∞).

If F ⊂ Rn is bounded, (4.2) is satisfied because there is no sequence {x k } in F with k x k k → ∞.

Theorem 4.2.3 If the set F ⊂ Rn of feasible points is closed and if f : F → R is continuous and
satisfies (4.2), then there exists x ∗ ∈ F such that f (x ∗ ) = inf { f (x) : x ∈ F }.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

4.3. UNCONSTRAINED OPTIMIZATION PROBLEMS 201

Proof: Let f ∗ = inf { f (x) : x ∈ F } and let {x k } be a sequence of feasible points x k ∈ F such
that lim k→∞ f (x k ) = f ∗ . In particular { f (x k )} is bounded. Because of (4.2), the sequence {x k }
must also be bounded. Thus, there exists M > 0 such that k x k k ≤ M for all k ∈ N. Since
x k ∈ {x ∈ Rn : k xk ≤ M } and lim k→∞ f (x k ) = f ∗ ,
f ∗ = inf f (x) : x ∈ F ∩ x ∈ Rn : k xk ≤ M .

The set F ∩ {x ∈ Rn : k xk2 ≤ M } is closed and bounded and, hence, compact. Thus, Theorem
4.2.2 gives the existence of x ∗ ∈ F ∩ {x ∈ Rn : k xk2 ≤ M } ⊂ F such that f (x ∗ ) = f ∗ .

If f˜ is continuous and bounded from below, then the function f (x) = f˜(x) + α2 k xk22 , α > 0,
satisfies (4.2) with k · k = k · k2 . In many applications one is really interested in minimizing f˜(x),
but the objective function used in the optimization is f (x) = f˜(x) + α2 k xk22 , α > 0. Theorem 4.2.3
provides one reason why one may do this.
It should be noted that Theorems 4.2.2 and 4.2.3 make statements about global minima. The
minimization algorithms that will be discussed later in this chapter are only guaranteed to find
so-called local minima. We will define local and global minima next, and provide necessary and
sufficient conditions for a point x ∗ to be a (local) minimum.

4.3. Optimality Conditions for Unconstrained Optimization

Problems
In this section we consider unconstrained minimization problems of the form
min f (x),
where f : Rn → R. The point x ∗ ∈ Rn is a local minimum of f if there exists r > 0 such that
f (x ∗ ) ≤ f (x) for all x ∈ Br (x ∗ )
and it is a strict local minimum of f if there exists r > 0 such that
f (x ∗ ) < f (x) for all x ∈ Br (x ∗ ).
If f is sufficiently smooth in a neighborhood of x, then the Taylor expansion of the function f
gives
f (x + v) = f (x) + ∇ f (x)T v + R1 (x, v),
f (x + v) = f (x) + ∇ f (x)T v + 21 vT ∇2 f (x)v + R2 (x, v),
where R1, R2 are remainder terms with R1 (x, v)/kvk → 0, R2 (x, v)/kvk 2 → 0 as v → 0. This
indicates that first and second order derivative information at x reveal important information about
the function f in a neighborhood of x. The following lemma states the first and second order Taylor
expansions of f in a form that is suitable for our purposes.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

202 CHAPTER 4. INTRODUCTION TO UNCONSTRAINED OPTIMIZATION

Lemma 4.3.1 Let r > 0 and let f : Br (x) → R be continuously differentiable on Br (x). If
x + v ∈ Br (x), then
Z 1
f (x + v) = f (x) + ∇ f (x) v + ∇ f (x + tv) − ∇ f (x)
T T
v dt (4.3)
0

and for any σ > 0 there exist ∈ (0, r) such that

Z 1
∇ f (x + tv) − ∇ f (x) T v dt < σkvk for all v with kvk < .

(4.4)
0
If f is twice continuously differentiable on Br (x) and x + v ∈ Br (x), then

f (x + v) = f (x) + ∇ f (x)T v + 12 vT ∇2 f (x)v

Z 1Z 1
+21
tvT ∇2 f (x + τtv) − ∇2 f (x) v dτdt (4.5)
0 0

and for any σ > 0 there exist ∈ (0, r) such that

Z 1 Z 1
21 tvT ∇2 f (x + τtv) − ∇2 f (x) v dτdt < σkvk 2 for all v with kvk < . (4.6)
0 0

Proof: Define φ(t) = f (x + tv). We have φ0 (t) = ∇ f (x + tv)T v and φ00 (t) = vT ∇2 f (x + tv)v.
The fundamental theorem of calculus states that
Z 1
φ(1) = φ(0) + φ (0) +
0
φ0 (t) − φ0 (0) dt. (4.7)
0

This is just the identity (4.3). Similarly,

Z 1
φ (t) = φ (0) + φ (0)t +
0 0 00
(φ00 (tτ) − φ00 (0))t dτ.
0

If we insert this equation into (4.7), we obtain (4.5).

Let σ > 0 be arbitrary. Since ∇ f is continuous on Br (x), there exist ∈ (0, r) such that
k∇ f (x + v) − ∇ f (x)k < σ for all v with kvk < . Thus
Z 1 Z 1
∇ f (x + tv) − ∇ f (x) v dt ≤ k∇ f (x + tv) − ∇ f (x)k kvk dt < σkvk
T

0 0

for all v with kvk < . This proves (4.4). Equation (4.6) can be obtained analogously.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

4.3. UNCONSTRAINED OPTIMIZATION PROBLEMS 203

Theorem 4.3.2 (Necessary Optimality Conditions) Let x ∗ be a local minimum of f : Rn → R.

i. If the partial derivatives of f exist at x ∗ , then ∂∂x i f (x ∗ ) = 0, i = 1, . . . , n. In particular, if f
is differentiable at x ∗ , then ∇ f (x ∗ ) = 0.
ii. If there exists r > 0 such that the function f is twice continuously differentiable on Br (x ∗ ),
then ∇2 f (x ∗ ) is positive semidefinite.

Proof: i. Since x ∗ is a local minimizer there exists r > 0 such that

0 ≤ f (x ∗ + tv) − f (x ∗ )

for all v ∈ Rn with kvk = 1 and all t ∈ (−r, r). If we set v equal to the ith unit vector ei , this implies

f (x ∗ + tei ) − f (x ∗ ) ∂
0 ≤ lim = f (x ∗ ).
t→0+ t ∂ xi

Similarly, if we set v = −ei , this implies

f (x ∗ − tei ) − f (x ∗ ) ∂
0 ≤ lim =− f (x ∗ ).
t→0+ t ∂ xi

Hence, ∂∂x i f (x ∗ ) = 0.
ii. From part i we know that ∇ f (x ∗ ) = 0. Suppose ∇2 f (x ∗ ) is not positive semidefinite. Then
there exist λ > 0 and w ∈ Rn such that

wT ∇2 f (x ∗ )w ≤ −λ kwk22 .

Let σ = λ/2. Lemma 4.3.1 and the definition of a local minimum guarantee the existence of
∈ (0, r) such that with v = /(2kwk2 ) w,

0 ≤ f (x ∗ + v) − f (x ∗ )
Z 1Z 1
= 2 v ∇ f (x ∗ )v + 2
1 T 2 1
tvT ∇2 f (x ∗ + τtv) − ∇2 f (x ∗ ) v dτdt
0 0
λ
< 1 T 2
2 v ∇ f (x ∗ )v + σkvk22 ≤ − kvk22 + σkvk22 = 0.
2

This is a contradiction. Hence our assumption that ∇2 f (x ∗ ) is not positive semidefinite must be
false.

The proof of the second part of Theorem 4.3.2 showed how to generate a point x with a function
value lower than f (x ∗ ) when the Hessian ∇2 f (x ∗ ) is not positive semidefinite. Such information
will be important when we design algorithms for the computation of minimum points.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

204 CHAPTER 4. INTRODUCTION TO UNCONSTRAINED OPTIMIZATION

Theorem 4.3.3 (Descent Directions) i. Let f : Rn → R be continuously differentiable in a

neighborhood of x. If d ∈ Rn satisfies ∇ f (x)T d < 0, then there exists t¯ > 0 such that

f (x + td) < f (x) ∀t ∈ (0, t¯].

ii. Let f : Rn → R be twice continuously differentiable in a neighborhood of x. If ∇ f (x) = 0

and if d ∈ Rn satisfies dT ∇2 f (x)d < 0, then there exists t¯ > 0 such that

f (x + td) < f (x) ∀t ∈ (0, t¯].

The proof of the second part of this theorem is identical to our proof of Theorem 4.3.2(ii). The
proof of the first part can be carried out analogously using the first part of Lemma 4.3.1.

Definition 4.3.4 Let f : Rn → R. A vector d ∈ Rn is called a descent direction at x ∈ Rn if there

exists t¯ > 0 such that
f (x + td) < f (x) ∀t ∈ (0, t¯].
If f is twice differentiable, then d ∈ Rn is called a direction of negative curvature at x ∈ Rn if
dT ∇2 f (x)d < 0.

Theorem 4.3.3 shows that directions d with ∇ f (x)T d < 0 are descent directions, provided that
f is continuously differentiable, and that directions of negative curvature are descent directions,
provided that f is twice continuously differentiable and ∇ f (x) = 0. In particular, eigenvectors
of ∇2 f (x) corresponding to negative eigenvalues are directions of negative curvature and descent
directions. Geometrically, ∇ f (x)T d < 0 if and only if the angle between d and −∇ f (x) is less than
90 degrees (see Figure 4.1).

gradf

-gradf

Figure 4.1: Descent Directions

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

4.3. UNCONSTRAINED OPTIMIZATION PROBLEMS 205

Theorem 4.3.5 (Sufficient Optimality Conditions) Suppose there exists r > 0 such that f :
Rn → R is twice continuously differentiable on Br (x ∗ ). If ∇ f (x ∗ ) = 0 and if ∇2 f (x ∗ ) is positive
definite, then x ∗ is a strict local minimum. More precisely, there exist c > 0 and > 0 such that

f (x) − f (x ∗ ) ≥ ck x − x ∗ k 2 for all x ∈ B (x ∗ ). (4.8)

Proof: Since ∇2 f (x ∗ ) is positive definite, there exists λ > 0 such that

wT ∇2 f (x ∗ )w ≥ λ kwk22 for all w ∈ Rn .

Let σ < λ/4. Lemma 4.3.1 guarantees the existence of > 0 such that for all x ∈ B (x ∗ ),
v = x − x∗,

f (x) − f (x ∗ )
= f (x ∗ + v) − f (x ∗ )
Z 1Z 1
= 2 v ∇ f (x ∗ )v + 2
1 T 2 1
tvT ∇2 f (x ∗ + τtv) − ∇2 f (x ∗ ) v dτdt
0 0
λ
≥ kvk22 − σkvk22
2
λ
≥ kvk22 .
4
Since all norms on Rn are equivalent, this gives the assertion.

There is a gap between the necessary optimality conditions in Theorem 4.3.2 and the sufficient
optimality conditions in Theorem 4.3.5. The necessary conditions imply that at a minimum x ∗ ,
∇ f (x ∗ ) = 0 and ∇2 f (x ∗ ) is positive semidefinite. However, we need positive definiteness of
∇2 f (x ∗ ) and ∇ f (x ∗ ) = 0 to guarantee that x ∗ is a local minimum. If these two conditions are
satisfied, then x ∗ is even a strong local minimum and the local quadratic growth condition (4.8) is
satisfied.
If f is a quadratic function, this gap can be overcome. For quadratic minimization problems

min 12 xT H x + cT x + d, (4.9)

where H ∈ Rn×n is symmetric, c ∈ Rn and d ∈ R we have the following result on the characterization
of solutions and existence of solutions.

Theorem 4.3.6 i. A vector x ∗ solves (4.9) if and only if x ∗ solves

H x = −c (4.10)

and H ∈ Rn×n is symmetric positive semi-definite.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

206 CHAPTER 4. INTRODUCTION TO UNCONSTRAINED OPTIMIZATION

ii. The quadratic minimization problem (4.9) has a solution if and only if H ∈ Rn×n is symmetric
positive semi-definite and c ∈ R (H). In this case the set of solutions of (4.9) is given by

S = x ∗ + N (H),

where x ∗ denotes a particular solution of (4.10) and N (H) denotes the null space of H.

Proof: We can verify the identity

1
2 (x + v)T H (x + v) + cT (x + v) + d
= 1 T
2x Hx + cT x + d(H x + c)T v + 21 vT Hv.

With this identity, the first part can be proven using the techniques applied in the proof of
Theorem 4.3.2. The second part follows directly from the theory of linear systems applied to
(4.10).

Later, see Theorem 4.4.5, we will generalize the previous theorem and show that the gap
between necessary and sufficient optimality conditions can be closed if f is a convex continuously
differentiable function. In this case ∇ f (x ∗ ) = 0 is necessary and sufficient for an optimum.

4.4. Convex Functions

Definition 4.4.1 A set C ⊂ Rn is called convex if

t x + (1 − t)y ∈ C

for all x, y ∈ C and all t ∈ [0, 1].

Let C ⊂ Rn be convex. A function f : C → R is called convex if

f (t x + (1 − t)y) ≤ t f (x) + (1 − t) f (y)

for all x, y ∈ C and all t ∈ [0, 1].

The convexity of sets and the convexity of functions are related through the following theorem.

Theorem 4.4.2 Let C ⊂ Rn be convex. f : C → R is convex if and only if its epigragh

epi( f ) = {(x, y) : x ∈ C, y ∈ R, f (x) ≤ y}

is a convex set in Rn+1 .

We leave the proof of this theorem as an exercise (see Problem 4.1).

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

4.4. CONVEX FUNCTIONS 207

Figure 4.2: Convex Function

Figure 4.3: Differentiable Convex Function

Theorem 4.4.3 Let C ⊂ Rn be convex.

i. Let f : C → R be continuously differentiable. The function f is convex if and only if

f (y) ≥ f (x) + ∇ f (x)T (y − x) for all x, y ∈ C. (4.11)

ii. Let f : C → R be twice continuously differentiable. If ∇2 f (x) is positive semidefinite for

all x ∈ C, then f is convex. If C is open and if f is convex, then ∇2 f (x) is positive semidefinite for
all x ∈ C.

Proof: i. Let f be convex and let x, y ∈ C be arbitrary. If x = y, then (4.11) is trivial. Thus, let
x , y. Consider the function

φ(t) = (1 − t) f (x) + t f (y) − f ((1 − t)x + t y).

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

208 CHAPTER 4. INTRODUCTION TO UNCONSTRAINED OPTIMIZATION

Since f is convex, φ(t) ≥ 0 for t ∈ [0, 1]. Moreover, φ(0) = 0. Consequently,

0 ≤ φ0 (0) = − f (x) + f (y) − ∇ f (x)(y − x).
This implies (4.11).
Now, let (4.11) be satisfied and let y, z ∈ C. For a t ∈ (0, 1) define x = t y + (1 − t)z ∈ C. The
inequality (4.11) applied at x, y and x, z gives
f (y) ≥ f (x) + ∇ f (x)T (y − x),
f (z) ≥ f (x) + ∇ f (x)T (z − x).
If we multiply the first inequality by t and the second by1 − t and add the resulting inequalities we
obtain
t f (y) + (1 − t) f (z)
≥ t f (x) + ∇ f (x)T (t y − t x) + (1 − t) f (x) + ∇ f (x)T ((1 − t)z − (1 − t)x),
= f (x) + ∇ f (x)T (t y + (1 − t)z − x),
= f (x).
In the last equality we have used the definition of x = t y + (1 − t)z. This proves the convexity of f .
ii. This result uses the Taylor expansion of f which states that for all x, h there exists θ ∈ [0, 1]
such that
f (x + h) = f (x) + ∇ f (x)T h + 12 hT ∇2 f (x + θh)h.
Let the Hessian ∇2 f (z) be positive semidefinite for all z ∈ C. For arbitrary x, y ∈ C the Taylor
expansion of f gives
f (y) = f (x) + ∇ f (x)T (y − x) + 21 (y − x)T ∇2 f (x + θ(y − x))(y − x),
≥ f (x) + ∇ f (x)T (y − x).
To obtain the last inequality we have used that ∇2 f (x + θ(y − x)) is positive semidefinite. With
part i. this implies the convexity of f .
Let f be convex and let C be open. Assume that there exists x ∈ Rn such that ∇2 f (x) is not
positive semidefinite. Then there exists h ∈ Rn such that hT ∇2 f (x)h < 0. Let t > 0 be small
enough so that x + th ∈ C. The Taylor expansion with h replaced by th, t > 0, gives
t2 T 2
f (x + th) = f (x) + t∇ f (x)T h +
h ∇ f (x)h
2
t2
+ hT ∇2 f (x + θth) − ∇2 f (x) h.
2
Since the second derivative is continuous,
∇2 f (x + θth) − ∇2 f (x) → 0 as t → 0. Hence for t > 0
sufficiently small hT ∇2 f (x)h + hT ∇2 f (x + θth) − ∇2 f (x) h < 0. Consequently,

f (x + th) < f (x) + t∇ f (x)T h.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

4.4. CONVEX FUNCTIONS 209

Thus, (4.11) is violated if we set y = x + th and, by i., f can not be convex. This is a contradiction.
Therefore the Hessian must be positive semidefinite for all x ∈ C.

The Hessian of the quadratic function

f (x) = 21 xT H x + cT x + d

is given by
∇2 f (x) = H.
Hence f (x) = 12 xT H x + cT x + d is convex if and only if H is positive semidefinite.

Theorem 4.4.4 Let the feasible set F ⊂ Rn be convex and let f : F → R be convex. If there exist
global a minimum, then all minima are global minima and the set S = {x ∗ ∈ F : f (x ∗ ) = f ∗ } of
global minima is convex.

Proof: i. Suppose x̄ ∈ F is a local minimum but not a global minimum. Then there exists x ∗ ∈ C
such that f (x ∗ ) < f ( x̄). Since f is convex,

f (t x ∗ + (1 − t) x̄) ≤ t f (x ∗ ) + (1 − t) f ( x̄) < f ( x̄)

for all t ∈ (0, 1]. This contradicts the assumption that x̄ is a local minimum.
ii. Let x 1, x 2 ∈ F be global minima, i.e., f (x i ) ≤ f (x) for all x ∈ C, i = 1, 2. By the convexity
of f ,
f (t x 1 + (1 − t)x 2 ) ≤ t f (x 1 ) + (1 − t) f (x 2 ) ≤ f (x)
for all x ∈ F and for all t ∈ [0, 1]. This proves the convexity of S.

For general twice continuously differentiable nonlinear functions there is a gap between the
necessary optimality conditions in Theorem 4.3.2 and the sufficient optimality conditions in Theo-
rem 4.3.5. This gap disappears if f is convex.

Theorem 4.4.5 (Optimality Condition for Convex Functions) Let f : Rn → R be a continu-

ously differentiable, convex function. The point x ∗ is a global minimum of f if and only if
∇ f (x ∗ ) = 0.

Proof: If x ∗ is a global minimum, then Theorem 4.3.2 implies that ∇ f (x ∗ ) = 0. On the other
hand, if ∇ f (x ∗ ) = 0, then (4.11) with y = x ∗ implies f (x) ≥ f (x ∗ ). Thus, x ∗ is a global minimum.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

210 CHAPTER 4. INTRODUCTION TO UNCONSTRAINED OPTIMIZATION

4.5. Convergence of Sequences

We solve optimization problems (and later systems of nonlinear equations) iteratively. Our algo-
rithms generate sequences {x k } ⊂ Rn of iterates and related sequences, e.g., sequences { f (x k )} ⊂ R
of objective function values or sequences {∇ f (x k )} ⊂ Rn of gradients. We are interested under
what conditions these sequences converge and also how fast they converge. This section introduces
two concepts that quantify how fast sequences converge. First we discuss q–convergence (quotient
convergence).

Definition 4.5.1 (Q–convergence) Let {z k } ⊂ Rl be a sequence of vectors.

(i) The sequence is called q–linearly convergent if there exist c ∈ (0, 1) and k̂ ∈ N such that

kz k+1 − z∗ k ≤ c kz k − z∗ k

for all k ≥ k̂. The constant c ∈ (0, 1) is called the q–factor.

(ii) The sequence is called q–superlinearly convergent if there exists a sequence {ck } with ck > 0
and lim k→∞ ck = 0 such that

kz k+1 − z∗ k ≤ ck kz k − z∗ k

or, equivalently, if
kz k+1 − z∗ k
lim = 0.
k→∞ kz k − z∗ k

(iii) If z k → z∗ and if there exist p > 1, k̂ ∈ N and c > 0 with

kz k+1 − z∗ k ≤ c kz k − z∗ k p

for all k ≥ k̂, then the sequence is called q–convergent with q–order at least p. In particular,
if p = 2, we say the sequence is q–quadratically convergent and if p = 3, we say the sequence
is q–cubically convergent.

Figure 4.4 illustrates q-linear convergence q-superlinear convergence and q-quadratic conver-
k−1
gence using the sequences z k = 1/2 k , z k = 1/k!, and z0 = 1, z k = 1/22 , k ≥ 1, respectively.

Remark 4.5.2 For q-linearly convergence the choice of norm is important. For example, the
sequence {z k } ⊂ R2 defined by

1 T 1 1 1 T
z2k = √ 1, 0 , z2k+1 = √ √ ,√ .
(2 2) k (2 2) k 2 2

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

4.5. CONVERGENCE OF SEQUENCES 211

0
10
q−linear
q−superlinear
q−quadratic

−5
10
error

−10
10

−15
10
5 10 15 20
k

Figure 4.4: Illustration of q-linear convergence (z k = 1/2 k ), q-superlinear convergence (z k = 1/k!),

k−1
and q-quadratic convergence (z0 = 1, z k = 1/22 , k ≥ 1).

We have
1 1 1 1 1 1 1 1
kz2k k∞ = √ = √ √ = kz2k−1 k∞, kz2k+1 k∞ = √ √ = √ kz2k k∞,
(2 2) k 2 (2 2) k−1 2 2 2 (2 2) k 2
√
and therefore the sequence converges q-linearly with q-factor 1/ 2 in the ∞-norm. However, if we
use the 2-norm, then
1 1 1 1 1
kz2k k2 = √ = √ √ = √ kz2k−1 k2, kz2k+1 k2 = √ = kz2k k2,
(2 2) k 2 2 (2 2) k−1 2 2 (2 2) k
which means the sequence does not converge q-linearly in the 2-norm.
Since all norms in Rl are equivalent, if a sequence converges q–superlinearly/q–order at least
p > 1 in one norm, it also converges q–superlinearly/q–order at least p > 1 in any other norm.

Q–convergence (quotient convergence) looks at the quotients kz k+1 − z∗ k/kz k − z∗ k p . Another

k
measure for the rate of convergence of sequences is obtained by taking roots kz k − z∗ k 1/p of the
successive errors rather than forming the quotients. This leads to r–convergence (root convergence)
rates. For example, if
k
lim sup kz k − z∗ k 1/2 < 1, (4.12)
k→∞

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

212 CHAPTER 4. INTRODUCTION TO UNCONSTRAINED OPTIMIZATION

then we say that the sequence {z k } converges r–quadratically to z∗ . The condition (4.12) equivalent
to the existence of κ > 0 and c̃ ∈ (0, 1) such that
k
kz k − z∗ k ≤ κ c̃2 ∀k ∈ N. (4.13)

See Problem 4.5.

The definition of r–convergence rates that will be used later are given next. For more details on
r convergence we refer to [OR00, Sec. 9].

Definition 4.5.3 (R–convergence) Let {z k } ⊂ Rl be a sequence of vectors.

(i) The sequence is called r–linearly convergent if there exist c ∈ (0, 1) and κ > 0 such that

kz k − z∗ k ≤ κc k ∀k ∈ N.

(ii) The sequence is called r–superlinearly convergent if there exist κ > 0 and a sequence {ck }
with ck > 0 and lim k→∞ ck = 0 such that
k
Y
kz k − z∗ k ≤ κ ci ∀k ∈ N.
i=1

(iii) The sequence is said to converge r–quadratically, if there exist κ > 0 and c ∈ (0, 1) such that
k
kz k − z∗ k ≤ κc2 ∀k ∈ N.

Remark 4.5.4 Since all norms in Rn are equivalent, if a sequence converges r–linearly/r–
superlinearly/r–quadratically in one norm, it also converges r–linearly/r–superlinearly/r–
quadratically in any other norm.
Note the sequence {z k } ⊂ R2 from Remark 4.5.4 defined by

1 T 1 1 1 T
z2k = √ 1, 0 , z2k+1 = √ √ ,√ ,
(2 2) k (2 2) k 2 2
satisfies
√
q
1 1 1 1
kz2k k2 = √ = q , kz2k+1 k2 = √ = 2 2 q ,
(2 2) k √ 2k (2 2) k √ 2k+1
2 2 2 2

i.e.,
√ √
q q
kz k k2 ≤ κc k
with κ = 2 2 and c = 1/ 2 2.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

4.5. CONVERGENCE OF SEQUENCES 213

We note that a sequence {z k } is r–linearly/r–superlinearly/r–quadratically convergent if the

errors kz k − z∗ k are bounded by a q–linearly/q–superlinearly/q–quadratically convergent sequence
k
{ξ k } in R (given by ξ k = c k [ξ k = i=1 ] [ξ k = c2 ]). In particular, every sequence {z k }
Qk
that is q–linearly/q–superlinearly/q–quadratically convergent is also r–linearly/r–superlinearly/r–
quadratically convergent. However, not all r–linearly/r–superlinearly/r–quadratically convergent
sequences, are also q–linearly/q–superlinearly/q–quadratically convergent. See, e.g., Remark 4.5.4
and Problem 4.6.

Remark 4.5.5 Although the conjugate gradient (CG) method introduced in Section 3.7 conver-
gences in n iterations and therefore the notion of convergence of sequences does not apply, the
convergence results in Section 3.8.5 essentially state q-convergence and r-convergence results for
the CG method. In particular, Theorem 3.8.7 essentially states the q-linear convergence of the
CG method. Theorems 3.8.6 and 3.8.8 state the r-linear convergence of the CG method, and
Theorem 3.8.9 essentially states r-superlinear convergence of the CG method.

The paper [Pot89] by Potra gives sufficient conditions for a sequence to have the q-order and/or
the r-order of convergence greater than one.
The next result shows that q-linear convergence of the steps z k −z k−1 implies r-linear convergence
of the iterates z k − z∗ .

Lemma 4.5.6 Let {z k } ⊂ Rl be a sequence of vectors. If there exist c ∈ (0, 1) and k̄ ∈ N such that

kz k+1 − z k k ≤ ckz k − z k−1 k ∀k ≥ k̄,

then there exists z∗ ∈ R k such that the sequence {z k } converges to z∗ at least r–linearly.

Proof: By induction we can show that

kz k+i+1 − z k+i k ≤ ci kz k+1 − z k k

for all i ≥ 0 and all k ≥ k̄. Hence,

l−1
X
kz k+l − z k k = k z k+i+1 − z k+i k
i=0
l−1
X
≤ kz k+i+1 − z k+i k
i=0
l−1
X l−1
X
i k+1− k̄
≤ c kz k+1 − z k k ≤ c ci kz k̄ − z k̄−1 k
i=0 i=0
∞
X c k+1− k̄
≤ c k+1− k̄ ci kz k̄ − z k̄−1 k = kz k̄ − z k̄−1 k
i=0
1−c

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

214 CHAPTER 4. INTRODUCTION TO UNCONSTRAINED OPTIMIZATION

for all i ≥ 0 and all k ≥ k̄. This shows that {z k } is a Cauchy sequence. There exist a limit z∗ of this
sequence. Letting l → ∞ in the previous inequality yields

c 1− k̄
kz∗ − z k k ≤ * kz k̄ − z k̄−1 k + c k
, 1 − c -
for all k ≥ k̄. This implies the r–linear convergence

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

4.6. PROBLEMS 215

4.6. Problems

Problem 4.1 Prove Theorem 4.4.2.

Problem 4.2 (See [Ber95, p.13]) In each of the following problems fully justify your answer using
optimality conditions

i. Show that the 2–dimensional function f (x, y) = (x 2 − 4) 2 + y 2 has two global minima and
one stationary point (point at which ∇ f (x, y) = 0), which is neither a local maximum nor a
local minimum.
ii. Show that the 2–dimensional function f (x, y) = (y − x 2 ) 2 − x 2 has only one stationary point,
which is neither a local maximum nor a local minimum.
iii. Find all local minima of the 2–dimensional function f (x, y) = 12 x 2 + x cos(y).

Problem 4.3 (See [Hes75], [Ber95, p.14]) Let f : Rn → R be a differentiable function. Suppose
that a point x ∗ is a local minimum of f along every line that passes through x ∗ , that is, the function

φ(t) = f (x ∗ + t p)

is minimized at t = 0 for all p ∈ Rn .

i. Show that ∇ f (x ∗ ) = 0.
ii. Show by example that x ∗ need not be a local minimizer of f .
(Hint: Consider f (x, y) = (y − αx 2 )(y − βx 2 ) with 0 < α < β and (x ∗, y∗ ) = (0, 0). For
α < γ < β, f (x, γx 2 ) < 0 if x , 0.
What are the eigenvalues of ∇2 f at (x ∗, y∗ )?)

Problem 4.4 Let H ∈ Rn×n be symmetric positive semi-definite, c ∈ Rn , d ∈ R, and consider the
function
f (x) = 21 xT H x + cT x + d.
Show that f is convex using the Definition 4.4.1 of a convex function. (Do not use Theo-
rem 4.4.3.)
Show that f is bounded from below if and only if c ∈ R (H). (Hint: Use that H can be
diagonalized by an orthogonal matrix and express R (H) in terms of the eigenvectors of H.)

Problem 4.5 Show that

k
lim sup kz k − z∗ k 1/2 < 1,
k→∞

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

216 CHAPTER 4. INTRODUCTION TO UNCONSTRAINED OPTIMIZATION

if and only if there exist κ > 0 and c̃ ∈ (0, 1) such that

k
kz k − z∗ k ≤ κ c̃2 ∀k ∈ N.

Problem 4.6 ([OR00, p. 298]). Consider the sequence

k
 (α/2) 2 for k even,
zk = 


p/2 2k
 (α ) for k odd,

where α ∈ (0, 1) and 1 < p < 2. Show that {z k } converges r–quadratically, but that the q–
convergence order is less than or equal to p.

Problem 4.7 Let A ∈ Rn×n be nonsingular and let k · k, ||| · ||| be a vector and a matrix norm such
that k Mvk ≤ |||M ||| kvk and |||M N ||| ≤ |||M ||| |||N ||| for all M, N ∈ Rn×n , v ∈ Rn .
Schulz’s method for computing the inverse of A generates a sequence of matrices {X k } via the
iteration
X k+1 = 2X k − X k AX k .

i. Define Rk = I − AX k and show that Rk+1 = R2k .

ii. Show that {X k } converges q-quadratically to A−1 for any X0 with |||I − AX0 ||| < 1.

iii. Show that {X k } converges q-quadratically for any X0 = α AT with α ∈ 0, 2/λ max , where
λ max is the largest eigenvalue of AAT .
(Hint: First use ||| · ||| = k · k2 and ii., then equivalence of norms on Rn×n .)

iv. Show that {X k } converges q-quadratically for X0 = AT /||| AT A|||.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

References
[Ber95] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, Massachusetts,
1995.

[Hes75] M.R. Hestenes. Optimization Theory. Wiley-Interscience, New York, 1975.

[OR70] J. M. Ortega and W. C. Rheinboldt. Iterative Solution of Nonlinear Equations in Several

Variables. Academic Press, New York, 1970.

[OR00] J. M. Ortega and W. C. Rheinboldt. Iterative solution of nonlinear equations in several

variables, volume 30 of Classics in Applied Mathematics. Society for Industrial and
Applied Mathematics (SIAM), Philadelphia, PA, 2000. Reprint of the 1970 original
[OR70]. URL: https://doi.org/10.1137/1.9780898719468, doi:10.1137/1.
9780898719468.

[Pot89] F. A. Potra. On Q-order and R-order of convergence. J. Optim. Theory Appl., 63(3):415–
431, 1989.

217
218 REFERENCES

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

Chapter
5
Newton’s Method
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
5.2 Local Convergence of Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . 220
5.2.1 Basic Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
5.2.2 Local Q–quadratic Convergence of Newton’s Method . . . . . . . . . . . . 223
5.2.3 Newton’s Method with Inexact Derivative Information . . . . . . . . . . . 224
5.3 Inexact Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
5.4 Derivative Approximations using Finite Differences . . . . . . . . . . . . . . . . . 226
5.5 Termination of the Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
5.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

5.1. Introduction
Let f : Rn → R be twice differentiable. We want to compute a (local) minimizer x ∗ of f . Let x k
be a guess for x ∗ . Then min x f (x) is equivalent to mins f (x k + s). Of course the second problem
is as difficult as the first one. Therefore we replace the nonlinear function f by its quadratic Taylor
approximation,

f (x k + s) ≈ m k (x k + s) = f (x k ) + ∇ f (x k )T s + 12 sT ∇2 f (x k )s.

Instead of solving mins f (x k + s) we solve mins m k (x k + s) or equivalently

min ∇ f (x k )T s + 12 sT ∇2 f (x k )s. (5.1)

The quadratic problem (5.1) has a unique solution s k if and only if ∇2 f (x k ) is positive definite. In
this case the unique solution is given by
−1
s k = − ∇2 f (x k ) ∇ f (x k ). (5.2)

219
220 CHAPTER 5. NEWTON’S METHOD

If m k (x k + s) = f (x k ) + ∇ f (x k )T s + 21 sT ∇2 f (x k )s is a good model for f (x k + s), then we expect

s k to be a good approximation for the solution of mins f (x k + s). Hence we use

x k+1 = x k + s k

as our new approximation of the (local) minimizer x ∗ of f . This is Newton’s method. In the
following section we will prove the well-posedness of the iteration (i.e., we will show that the
∇2 f (x k )’s are positive definite) and the convergence of the sequence of iterates {x k } generated by
Newton’s method, provided that the initial guess x 0 is sufficiently close to a point x ∗ at which the
second order sufficient optimality conditions are satisfied.
We note that (5.2) makes sense if the Hessian ∇2 f (x k ) is invertible. However, we want to
minimize f (x k + s) ≈ m k (x k + s) = f (x k ) + ∇ f (x k )T s + 21 sT ∇2 f (x k )s and therefore need the
Hessian not merely to be invertible, but positive definite.
The following notation will be useful. Throughout this section k · k is an arbitrary vector norm
on Rn . We continue to use Br ( x̄) = {x ∈ Rn : k x − x̄k < r } to denote the open ball x̄ with radius
r. Moreover, we define the set of Lipschitz continuous functions on D ⊂ Rn

Lip L (D) = h : D → Rm : kh(x) − h(y)k ≤ Lk x − yk ∀x, y ∈ D .

By λ min (x) and λ max (x) we denote the smallest and largest eigenvalue of the Hessian of f at x, i.e.,

λ min (x) = min vT ∇2 f (x)v, λ max (x) = max vT ∇2 f (x)v.

kvk2 =1 kvk2 =1

5.2. Local Convergence of Newton’s Method

5.2.1. Basic Estimates
The next theorem is a consequence of the fundamental theorem of calculus.

Lemma 5.2.1 Let f : Rn → R be twice continuously differentiable in an open set D ⊂ Rn . For all
x, y ∈ D such that {y + t(x − y) : t ∈ [0, 1]} ⊂ D,
Z 1
∇ f (x) − ∇ f (y) = ∇2 f (y + t(x − y))(x − y)dt.
0

∂
Proof: Apply the fundamental theorem of calculus to the functions φi (t) = ∂ xi f (y + t(x − y))
i = 1, . . . , n, on [0, 1].

To perform Newton’s method, we have to guarantee that ∇2 f (x k ) is nonsingular. This can

be established using the Banach Lemma (see, e.g., [GL83, Th. 2.3.4]), which we state here for
convenience.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

5.2. LOCAL CONVERGENCE OF NEWTON’S METHOD 221

Lemma 5.2.2 (Banach Lemma) If A ∈ Rn×n is an invertible matrix and if B ∈ Rn×n is such that

k A−1 ( A − B)k < 1, (5.3)

then B is invertible and the inverse satisfies

k A−1 k
kB−1 k ≤ . (5.4)
1 − k A−1 ( A − B)k
Lemma 5.2.3 Let D ⊂ Rn be an open set and let f : D → R be twice differentiable on D with
∇2 f ∈ Lip L (D). If the second order sufficient optimality conditions are satisfied at the point
x ∗ ∈ D, then there exists > 0 such that B (x ∗ ) ⊂ D and for all x ∈ B (x ∗ ),

k∇2 f (x)k ≤ 2k∇2 f (x ∗ )k, (5.5)

k∇2 f (x) −1 k ≤ 2k∇2 f (x ∗ ) −1 k, (5.6)
1
−1 k
k x − x ∗ k ≤ k∇ f (x)k ≤ 2k∇2 f (x ∗ )k k x − x ∗ k (5.7)
2k∇2 f (x
∗ )
and the smallest eigenvalue of the Hessian of f at x satisfies
1
λ min (x) ≥ λ min (x ∗ ) > 0. (5.8)
2
Proof: Let 1 > 0 such that B 1 (x ∗ ) ⊂ D.
Using the Lipschitz continuity of ∇2 f yields

k∇2 f (x)k ≤ k∇2 f (x ∗ )k + Lk x − x ∗ k.

If < min{ 1, k∇2 f (x ∗ )k/L} , then (5.5) is satisfied.

If ( )
k∇2 f (x ∗ )k 1
< min 1, , ,
L 2Lk∇2 f (x ∗ ) −1 k
then

k∇2 f (x ∗ ) −1 (∇2 f (x ∗ ) − ∇2 f (x))k

< Lk∇2 f (x ∗ ) −1 k k x ∗ − xk < L k∇2 f (x ∗ ) −1 k
< 1/2. (5.9)

This inequality and the Banach lemma imply (5.6).

To prove the right inequality in (5.7), we note that x ∈ B (x ∗ ) implies x ∗ + t(x − x ∗ ) ∈ B (x ∗ )
for all t ∈ [0, 1]. Thus, by inequality (5.5),
Z 1
k∇ f (x)k ≤ k∇2 f (x ∗ + t(x − x ∗ ))k k x − x ∗ kdt ≤ 2k∇2 f (x ∗ )k k x − x ∗ k.
0

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

222 CHAPTER 5. NEWTON’S METHOD

To prove the left inequality in (5.7) note that

∇2 f (x ∗ ) −1 ∇ f (x)
Z 1
= ∇ f (x ∗ )
2 −1
∇2 f (x ∗ + t(x − x ∗ )))(x − x ∗ )dt
0
Z 1
= (x − x ∗ ) − ∇2 f (x ∗ ) −1 [∇2 f (x ∗ ) − ∇2 f (x ∗ + t(x − x ∗ )))](x − x ∗ )dt.
0
and hence, by (5.9),
k∇2 f (x ∗ ) −1 k k∇ f (x)k
≥ k∇2 f (x ∗ ) −1 ∇ f (x)k
Z 1
≥ k x − x∗ k − k∇2 f (x ∗ ) −1 [∇2 f (x ∗ ) − ∇2 f (x ∗ + t(x − x ∗ )))]k k x − x ∗ kdt
0
1
≥ k x − x ∗ k.
2
This yields (5.7).
To prove (5.8) we assume that the vector–norm is the 2–norm and that the matrix–norm is the
induced norm. We see that
vT ∇2 f (x)v = vT ∇2 f (x ∗ )v + vT (∇2 f (x) − ∇2 f (x ∗ )v)
≥ vT ∇2 f (x ∗ )v − k∇2 f (x) − ∇2 f (x ∗ )k2 kvk22

≥ λ min (x ∗ ) − k∇2 f (x) − ∇2 f (x ∗ )k2 kvk22
≥ (λ min (x ∗ ) − Lk x − x ∗ k2 ) kvk22 .
If
λ min (x ∗ )
( )
k∇2 f (x ∗ )k2 1
< min 1, , , ,
L 2Lk∇2 f (x ∗ ) −1 k2 2L
then
λ min (x ∗ )
λ min (x) = min vT ∇2 f (x)v ≥ .
kvk2 =1 2
Note that since k∇2 f (x) −1 k2 = 1/λ min (x), the inequality (5.8) also follows directly from (5.6)
with k · k = k · k2 .

Remark 5.2.4 The previous lemma show that if the second order sufficient optimality conditions
are satisfied at x ∗ , then the Hessian ∇2 f (x) is also positive definite in a neighborhood of x ∗ . In
particular, the quadratic problem
min ∇ f (x)T s + 12 sT ∇2 f (x)s
s
that determines the Newton step has a unique solution if x is sufficiently close to x ∗ .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

5.2. LOCAL CONVERGENCE OF NEWTON’S METHOD 223

5.2.2. Local Q–quadratic Convergence of Newton’s Method

With the previous estimates, we can prove the local convergence of Newton’s method.
Theorem 5.2.5 Let D ⊂ Rn be an open set and let f : D → R be twice differentiable on D with
∇2 f ∈ Lip L (D). If x ∗ ∈ D is a point at which the second order sufficient optimality conditions
are satisfied, then there exists an > 0 such that Newton’s method with starting point x 0 ∈ B (x ∗ )
generates iterates x k which converge to x ∗ and which obey
k x k+1 − x ∗ k ≤ ck x k − x ∗ k 2
for all k, where c is some positive constant.
Proof: Let 1 be the parameter determined by Lemma 5.2.3 and let σ ∈ (0, 1).
i. We will show by induction that if
= min{ 1, σ/(k∇2 f (x ∗ ) −1 kL)}
and k x 0 − x ∗ k < , then
k x k+1 − x ∗ k ≤ Lk∇2 f (x ∗ ) −1 k k x k − x ∗ k 2 < σk x k − x ∗ k < (5.10)
for all iterates x k . In particular, the subproblem (5.2) for determining the Newton step has a unique
solution, and the Newton iterates x k converge to x ∗ q-quadratically.
If k x 0 − x ∗ k < , then
x 1 − x ∗ = x 0 − x ∗ − ∇2 f (x 0 ) −1 ∇ f (x 0 )

= ∇2 f (x 0 ) −1 *.∇2 f (x 0 )(x 0 − x ∗ ) + ∇ f (x ∗ ) −∇ f (x 0 ) +/
| {z }
, =0 -
Z 1
= ∇2 f (x 0 ) −1 (∇2 f (x 0 ) − ∇2 f (x ∗ + t(x 0 − x ∗ ))(x 0 − x ∗ )dt,
0
where we have used Lemma 5.2.1 to obtain the last equality. Using (5.6) and the Lipschitz continuity
of ∇2 f we obtain
k x 1 − x ∗ k ≤ 2Lk∇2 f (x ∗ ) −1 k k x 0 − x ∗ k 2 /2
= Lk∇2 f (x ∗ ) −1 k k x 0 − x ∗ k 2 < σk x 0 − x ∗ k < .
This proves (5.10) for k = 0. The induction step can be proven analogously and is omitted.
ii. Since σ < 1 and
k x k+1 − x ∗ k < σk x k − x ∗ k < . . . < σ k+1 k x 0 − x ∗ k,
we find that lim k→∞ x k = x ∗ . The q–quadratic convergence rate follows from (5.10) with
c = Lk∇2 f (x ∗ ) −1 k.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

224 CHAPTER 5. NEWTON’S METHOD

5.2.3. Newton’s Method with Inexact Derivative Information

In many cases, derivative information of f is not avialable exactly. Instead only approximations
∇ f (x k ) + δ(x k ) and ∇2 f (x k ) + ∆(x k ) of the gradient and hessian at x k are available. Here δ ∈ Rn
and ∆(x k ) ∈ Rn×n . Instead of the exact Newton method we can only execute the following iteration.
Solve
(∇2 f (x k ) + ∆(x k ))s k = −(∇ f (x k ) + δ(x k )) (5.11)
and set
x k+1 = x k + s k .
Instead of (5.10) we can prove the following estimate, which generalizes (5.10).

Lemma 5.2.6 Let D ⊂ Rn be an open set and let f : D → Rn be twice continuously differentiable
on D with ∇2 f ∈ Lip L (D). Moreover, let x ∗ ∈ D be a point at which the second order sufficient
optimality conditions are satisfied. If ∇2 f (x k ) + ∆(x k ) is invertible, then

Lk(∇2 f (x k ) + ∆(x k )) −1 k
k x k+1 − x ∗ k ≤ k x k − x∗ k2
2
+k(∇2 f (x k ) + ∆(x k )) −1 ∆(x k )k k x k − x ∗ k + k(∇2 f (x k ) + ∆(x k )) −1 δ(x k )k.

Proof: The definition (5.11) of the perturbed Newton method, ∇ f (x ∗ ) = 0 and Lemma 5.2.1
imply that

x k+1 − x ∗
= x k − x ∗ − (∇2 f (x k ) + ∆(x k )) −1 (∇ f (x k ) + δ(x k ))
= (∇2 f (x k ) + ∆(x k )) −1 (∆(x k )(x k − x ∗ ) − δ(x k ))
Z 1
+(∇ f (x k ) + ∆(x k ))
2 −1
(∇2 f (x k ) − ∇2 f (x k + t(x k − x ∗ ))(x k − x ∗ )dt.
0

Taking norms yields the desired inequality.

The previous lemma provides the basic estimate for the convergence analysis of the iteration
(5.11). One convergence result is the following.

Theorem 5.2.7 Let D ⊂ Rn be an open set and let f : D → R be twice differentiable on D with
∇2 f ∈ Lip L (D). Furthermore, let x ∗ ∈ D be a point at which the second order sufficient optimality
conditions are satisfied. If the perturbed Hessians ∇2 f (x) + ∆(x) are invertible for all x ∈ D and if
there exist η ∈ [0, 1), α > 1, and M ≥ 0 such that the gradient perturbations δ(x) and the Hessian
perturbations ∆(x) satisfy
k(∇2 f (x) + ∆(x)) −1 k ≤ M,
k(∇2 f (x) + ∆(x)) −1 ∆(x)k ≤ η

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

5.3. INEXACT NEWTON METHODS 225

and
k(∇2 f (x) + ∆(x)) −1 δ(x)k ≤ ck x − x ∗ k α
for all x ∈ D, then for all σ ∈ (η, 1) there exists an > 0 such that Newton’s with inexact derivative
information (5.11) with starting point x 0 ∈ B (x ∗ ) generates iterates x k which converge to x ∗ and
which obey
ML
k x k+1 − x ∗ k ≤ k x k − x ∗ k 2 + ck x − x ∗ k α + η k x k − x ∗ k ≤ σk x k − x ∗ k
2
for all k.

The proof of this theorem is left as an exercise (see Problem 5.3).

5.3. Inexact Newton Methods

The Newton step s k is a solution of the quadratic minimization problem

min ∇ f (x k )T s + 12 sT ∇2 f (x k )s. (5.12)

If n is large or if only Hessian-times vector products ∇2 f (x k )v are available for any given vector
v, but the computation of the entire Hessian is expensive, then we can use the Conjugate Gra-
dient Algorithm 3.7.6 or the Preconditioned Conjugate Gradient Algorithm 3.9.1 to compute an
approximate solution s k . We focus on the Conjugate Gradient Algorithm 3.7.6. We stop the
Conjugate Gradient Algorithm 3.7.6 if the residual is ∇2 f (x k )s k + ∇ f (x k ) is sufficiently small.
More precisely, we stop the Conjugate Gradient Algorithm 3.7.6 if

k∇2 f (x k )s k + ∇ f (x k )k ≤ η k k∇ f (x k )k, (5.13)

where η k ≥ 0. If s k is computed such that (5.13) holds, the resulting method is known as he inexact
Newton method. The parameter η k is called the forcing parameter.
The following theorem analyzes the convergence of the inexact Newton method.

Theorem 5.3.1 Let D ⊂ Rn be an open set and let f : D → R be twice differentiable on D with
∇2 f ∈ Lip L (D). Furthermore, let x ∗ ∈ D be a point at which the second order sufficient optimality
conditions are satisfied. Define κ ∗ = k∇2 f (x ∗ ) −1 k k∇2 f (x ∗ )k.
If the sequence {η k } of forcing parameters satisfies 0 < η k ≤ η with η such that 4κ ∗ η < 1, then
for all σ ∈ (4κ ∗ η, 1) there exists an > 0 such that the inexact Newton method (5.13) with starting
point x 0 ∈ B (x ∗ ) generates iterates x k which converge to x ∗ and which obey

k x k+1 − x ∗ k ≤ Lk∇2 f (x ∗ ) −1 k k x k − x ∗ k 2 + 4η k κ ∗ k x k − x ∗ k ≤ σk x k − x ∗ k

for all k.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

226 CHAPTER 5. NEWTON’S METHOD

Proof: Let 1 > 0 be the parameter given by Lemma 5.2.3. Furthermore, let σ ∈ (4κ ∗ η, 1) be
arbitrary and let
σ − 4κ ∗ η
( )
= min 1, .
Lk∇2 f (x ∗ ) −1 k
We set r k = −∇ f (x k ) − ∇2 f (x k )s k . If k x k − x ∗ k < , then
x k+1 − x ∗ = x k − x ∗ + ∇2 f (x k ) −1 ∇2 f (x k )s k
= x k − x ∗ − ∇2 f (x k ) −1 ∇ f (x k ) − ∇2 f (x k ) −1r k
Z 1
= ∇2 f (x k ) −1 (∇2 f (x k ) − ∇2 f (x k + t(x k − x ∗ ))(x k − x ∗ )dt − ∇2 f (x k ) −1r k .
0

Taking norms and using that k∇2 f (x) −1 k ≤ 2k∇2 f (x ∗ ) −1 k yields

k x k+1 − x ∗ k ≤ Lk∇2 f (x ∗ ) −1 k k x k − x ∗ k 2 + 2η k k∇2 f (x ∗ ) −1 k k∇ f (x k )k
≤ Lk∇2 f (x ∗ ) −1 k k x k − x ∗ k 2 + 4η k κ ∗ k x k − x ∗ k

≤ Lk∇2 f (x ∗ ) −1 k k x k − x ∗ k + 4η κ ∗ k x k − x ∗ k
≤ σk x k − x ∗ k.
The proof now follows from this inequality by using induction.

Sharper convergence results can be obtained if the weighted norm

k∇2 f (x ∗ )(x k − x ∗ )k
of the error is considered. See [DES82] or [Kel95, Sec. 6.1.2].
The convergence analysis of the inexact Newton method is closely related to the convergence
analysis performed in Section 5.2.3. In fact, if s k satisfies (5.13), then
∇2 f (x k )s k = −∇ f (x k ) − δ(x k ), (5.14)
where
kδ(x k )k ≤ η k k∇ f (x k )k.
Thus, the step s k generated by the inexact Newton method can be interpreted as the Newton step
computed with an error in the gradient.

5.4. Derivative Approximations using Finite Differences

Finite differences are often used to approximate derivative information. To compute finite difference
approximations of the gradient, we observe that the ith component of the gradient is ∂∂x i f (x). Hence,
a one–sided finite difference approximation of the ith component is
∂ f (x + δi ei ) − f (x)
f (x) ≈ , (5.15)
∂ xi δi

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

5.4. DERIVATIVE APPROXIMATIONS USING FINITE DIFFERENCES 227

where ei is the ith unit vector. The step–size δi usually differs from one component to the other.
Dennis and Schnabel [DS96] recommend
√
δi = max{|x i |, typx i }sign(x i ), (5.16)

where is an approximation for the relative error in function evaluation and where typx i is a typical
size provided by the user and it is used to prevent from difficulties when x i wanders close to zero.
We will study later where this choice comes from.
To compute Hessian approximations, we proceed as follows. If the gradient of f is available,
we compute the ith column Hi of the matrix H ∈ Rn×n as
∇ f (x + δi ei ) − ∇ f (x)
Hi = ,
δi
where δi is chosen as in (5.16) and then we approximate

∇2 f (x) ≈ 12 (H + H T )

to enforce symmetry. If gradient information is not available, then we approximate

∂2 [ f (x + δi ei + δ j e j ) − f (x + δi ei )] − [ f (x + δ j e j ) − f (x)]
f (x) ≈ , (5.17)
∂ xi ∂ x j δi δ j

where
δ j = 1/3 max{|x j |, typx j }sign(x j ), (5.18)
To see why he choices (5.16) and (5.18) for the finite difference parameter make sense we
consider finite difference derivative approximations for the scalar function g : R → R. A one-sided
finite difference approximation of the derivative is given by
g(x + δ) − g(x)
g0 (x) ≈ (5.19)
δ
for a sufficiently small δ. Using the Taylor expansion

g(x + δ) = g(x) + g0 (x)δ + 12 g00 (x + θδ)δ2

for some θ ∈ [0, 1], we express the approximation error as

g(x + δ) − g(x)
g0 (x) − = − 12 g00 (x + θδ)δ.
δ
If we assume that |g00 (x + θδ)| ≤ M2 for all θ ∈ [0, 1], then
g(x + δ) − g(x) M2 |δ|
g0 (x) − ≤ . (5.20)
δ 2

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

228 CHAPTER 5. NEWTON’S METHOD

Now suppose that instead of the exact function g we can only compute an approximation g . In
this case the finite difference approximation (5.28) of the derivative of g is

g (x + δ) − g (x)
g0 (x) ≈
δ
Suppose that
|g(x ± δ) − g (x ± δ)| ≤ , |g (x) − g(x)| ≤ .
From
g (x + δ) − g (x)
g0 (x) −
δ
g(x + δ) − g(x) g(x + δ) − g (x + δ) g (x) − g(x)
= g0 (x) − + +
δ δ δ
and the estimates (5.20) we obtain
g (x + δ) − g (x)
g0 (x) −
δ

M2 |δ| |g(x + δ) − g (x + δ)| |g (x) − g(x)|
≤ + + ,
2 δ δ
M2 |δ| 2
≤ + . (5.21)
2 |δ|

The term M2 |δ|/2 in the error bound (5.21) results from the use of finite differences with exact
function values and the term 2/|δ| results from the use of inexact function values. The error bound
(5.21) and its two components are sketched in Figure 5.1.

Error

Figure 5.1: Finite Difference Error Bound.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

5.4. DERIVATIVE APPROXIMATIONS USING FINITE DIFFERENCES 229

The δ > 0 that minimizes the bound (5.21) is given by

√
δ∗ = (2/ M2 )
p

and the corresponding error bound is

g (x + δ∗ ) − g (x) p √
g0 (x) − ≤ 2 M2 . (5.22)
δ∗
Thus the quality of the finite difference derivative approximation is limited by the error in the
accuracy in function values. If, for example, = 10−14 , then the error in the finite difference
derivative approximation is expected to be 10−7 . Moreover, to achieve this accuracy, we have to
choose the finite difference step size δ properly. If we choose δ much smaller than δ∗ , then we
will ‘differentiate noise in the function evaluation’ and 2/|δ| will dominate the error bound. If we
choose δ much larger than δ∗ , then the finite difference approximation of the exact function will
not be good enough and M2 |δ|/2 will dominate the error bound.
If g is an elementary function such as sin, cos, exp, then the error in its evaluation is due to
rounding. We denote by fl (y) the floating point approximation of y. Floating point arithmetic
returns an approximation fl (g(x)) of an elementary function g(x) which satisfies

| fl (g(x)) − g(x)| ≤ |g(x)| mach,

provided x is a floating point number. Here mach is the unit roundoff or machine precision and it is
given by mach = 2−24 ≈ 6∗10−8 if single precision arithmetic is used and mach = 2−53 ≈ 1.2∗10−16
if double precision arithmetic is used. In particular, if one-sided finite differences are used t
approximate g0 (x), then our previous analysis recommends a step size of
s
|g(x)|
δ∗ = √ mach,
M2

where M2 for the second derivative of g near x, and in this case

fl (g(x + δ∗ )) − fl (g(x)) p p p
g0 (x) − ≤ M2 + 4/ M2 |g(x)| mach . (5.23)
δ∗
Thus, only half of the digits in the finite difference approximation can be trusted. If x is not a
floating point number, then x is rounded as well and the approximation of g is given by fl (g(fl (x))).
Two other issues need to be kept in mind when selecting δ. First, for δ ∈ (− mach |x|, mach |x|) one
has
fl (x + δ) = fl (x).
Secondly, if x and δ have opposite signs, then some significant digits will be lost in the computation
of fl (x + δ).

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

230 CHAPTER 5. NEWTON’S METHOD

5.5. Termination of the Iteration

When do we terminate the Newton iteration? We discus some termination critria.
Under the assumptions of Lemma 5.2.3 the inequalities (5.7) show that the gradient norms
eventually behave like the error norms, provided the convergence is sufficiently fast. The sequence
{x k } converges q–superlinearly [q–quadratically] if and only if the sequence of function values
{∇ f (x k+1 )} converges q–superlinearly [q–quadratically] to zero:
k x k+1 − x ∗ k k∇ f (x k+1 )k
lim = 0 ⇐⇒ lim =0
k→∞ k x k − x ∗ k k→∞ k∇ f (x k )k

k x k+1 − x ∗ k ≤ c k x k − x ∗ k 2 ⇐⇒ k∇ f (x k+1 )k ≤ c̃ k∇ f (x k )k 2 .

Thus, we can use the gradients to observe q–superlinear, q–quadratic, or faster convergence rates
of the iterates. Note this is not possible if the iterates x k converge only q–linearly. The inequalities
k x k+1 − x ∗ k ≤ c k x k − x ∗ k, c ∈ (0, 1), generally does not imply k∇ f (x k+1 )k ≤ c̃ k∇ f (x k )k
with c̃ ∈ (0, 1). Hence, the gradients may not converge q-linearly to zero if the iterates converge
q–linearly.
We can use the gradient norm as a truncation criteria. The estimate (5.7) yields

k∇2 f (x ∗ ) −1 k
k∇ f (x)k < tolg =⇒ k x − x ∗ k ≤ tolg ,
2
provided x is sufficiently close to x ∗ .
If the iterates converge q–superlinearly or faster, we can use the norm of the step as a truncation
criteria. This stopping criteria is based on the following result.

Lemma 5.5.1 If {x k } ⊂ Rn is a sequence that converges q–superlinearly to x ∗ , i.e

k x k+1 − x ∗ k
lim = 0,
k→∞ k x k − x∗ k
then
k x k+1 − x k k
lim = 1.
k→∞ k x k − x∗ k

This suggest the stopping criteria

k x k − x ∗ k k x k − x k+1 k
≈ < tols,
sizex sizex
where
sizex ≈ k x ∗ k.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

5.6. PROBLEMS 231

5.6. Problems

Problem 5.1

i. Suppose we start Newton’s method for the minimization of f (x) = x 4 − 1 at x 0 > 0. What
are the iterates generated by Newton’s method? Prove that the iterates satisfy x k > 0 and that
they converge to the minimizer x ∗ of f .

iii. What is the rate of convergence of Newton iterates x k in part i? Does this contradict the
convergence result in Theorem 5.2.5? Explain!

Problem 5.2 Prove the following result.

Let f : Rn → R be twice differentiable on D and assume the exist α ∈ (0, 1) and L > 0 such
that
k∇2 f (x) − ∇2 f (y)k ≤ Lk x − yk α
for all x, y ∈ Rn . If x ∗ ∈ D is a point at which the second order sufficient optimality conditions
are satisfied, then there exists an > 0 such that Newton’s method with starting point x 0 ∈ B (x ∗ )
generates iterates x k which converge to x ∗ ,

lim x k = x ∗,
k→∞

and which obey

k x k+1 − x ∗ k ≤ ck x k − x ∗ k 1+α
for all k, where c is some positive constant.

Problem 5.3 Prove Theorem 5.2.7.

Problem 5.4 Let f : Rn → R be twice differentiable and assume there exists L > 0 such that the
Hessian satisfies k∇2 f (x) − ∇2 f (y)k ≤ Lk x − yk for all x, y ∈ Rn . Furthermore, let x ∗ ∈ Rn be a
point at which the second order sufficient optimality conditions are satisfied. Given a symmetric
positive definite H ∈ Rn×n consider the simplified Newton iteration

H s k = −∇ f (x k ),
x k+1 = x k + s k .

Prove that if kI − H −1 ∇2 f (x ∗ )k < 1, then for every σ ∈ (kI − H −1 ∇2 f (x ∗ )k, 1) there exists
> 0 such that the iterates generated by the simplified Newton method with starting value x 0 ,
k x 0 − x ∗ k < , converge to x ∗ and obey

k x k+1 − x ∗ k ≤ σ k x k − x ∗ k

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

232 CHAPTER 5. NEWTON’S METHOD

for all k.
What can you say about the convergence of the simplified Newton method if H = ∇2 f (x ∗ )?

Problem 5.5 Let f : Rn → R be twice differentiable and let the Hessian satisfy
k∇2 f (x) − ∇2 f (y)k ≤ Lk x − yk ∀x, y ∈ Rn .
Consider the Newton-type iteration
Hk s k = −∇ f (x k ), (5.25a)
x k+1 = x k + s k , (5.25b)
where Hk is a symmetric positive definite matrix, for the computation of a local minimizer x ∗ of f .

i. Show that
LkHk−1 k
k x k+1 − x ∗ k ≤ k x k − x ∗ k 2 + kI − Hk−1 ∇2 f (x k )k k x k − x ∗ k.
2
ii. State and prove a result for the local q-linear convergence of (5.25).
iii. Let ∇2 f (x k ) be symmetric positive definite. Suppose we want to compute an approximate
solution s k of
∇2 f (x k )s = −∇ f (x k ) (5.26)
by applying the Jacobi iterative method.

– What is the Jacobi iterative method applied to (5.26)? (Denote the ith iterate of the
Jacobi method by s (i)
k
, where k refers to the iteration number in the Newton-type iteration
and i is the iteration number of the Jacobi method.)
– Show that one step of the Jacobi iterative method started at s (0)
k
= 0 leads to s k = s (1)
k
given by (5.25a). What is Hk ?

iv. Let ∇2 f (x k ), k ∈ N, satisfy

∂2
0<c≤ f (x k ) j = 1, . . . , n, k ∈ N,
∂ x 2j
n
∂ f (x ) ≤ η ∂ f (x )
X 2 2

∂ x j xi k k j = 1, . . . , n, k ∈ N,
i, j
∂ x 2j

with 0 ≤ η < 1.
Use your results in parts ii. and iii. to show that the Newton-type iteration (5.25), where s k
is computed by applying one iteration of the Jacobi method with zero initial value to (5.26),
converges locally q-linearly.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

5.6. PROBLEMS 233

Problem 5.6 If we use the Preconditioned Conjugate Gradient Algorithm 3.9.1 with symmetric
positive definite preconditioner M to compute an approximate solution s k of (5.12), then the
Preconditioned Conjugate Gradient Iteration stops if

∇2 f (x k )s k + ∇ f (x k ) T M −1 ∇2 f (x k )s k + ∇ f (x k ) ≤ η k ∇ f (x k )T M −1 ∇ f (x k )

(5.27)

where η k ≥ 0. Derive and prove a convergence result corresponding to Theorem 5.3.1.

Problem 5.7 Let g : R → R be three time continuously differentiable and let g (x), g (x ± δ) be
approximations of g(x), g(x ± δ), respectively, with

|g(x ± δ) − g (x ± δ)| ≤ , |g (x) − g(x)| ≤ .

i. The centered finite difference approximation of the derivative is given by

g(x + δ) − g(x − δ)
g0 (x) ≈ . (5.28)
2δ
Show that the optimal step size for the approximation of g0 (x) using centered finite differences
with inexact function evaluations is δ∗ = O( 1/3 ) and the resulting error in the centered finite
difference approximation is
g (x + δ∗ ) − g (x − δ∗ )
g0 (x) − = O( 2/3 ).
2δ∗

ii. To approximate second order derivatives we use

00
g0 (x + 12 δ) − g0 (x − 12 δ)
g (x) ≈
δ
and inserting the approximations
g(x + δ) − g(x) g(x) − g(x − δ)
g0 (x + 12 δ) ≈ , g0 (x − 12 δ) ≈
δ δ
into the previous expresion. This leads to
g(x + δ) − 2g(x) + g(x − δ)
g00 (x) ≈ . (5.29)
δ2

Compute the optimal step size for the approximation of g00 (x) using centered finite differences
with inexact function evaluations and determine the error
g (x + δ∗ ) − 2g (x) + g (x − δ∗ )
g00 (x) −
δ2

∗

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

234 CHAPTER 5. NEWTON’S METHOD

iii. Let g(x) = exp(x). Compute approximations of g0 (x) using one-sided finite differences and
centered finite differences using x = 1 and δ = 10−i , i = 1, . . . , 20, and compute centered
finite difference approximations of g00 (x). Plot the error between the derivative and its
approximation in a log-log-scale. Explain your results.

Problem 5.8 The function

f (x 1, x 2 ) = (x 1 − 2) 4 + (x 1 − 2) 2 x 22 + (x 2 + 1) 2 .

attains its minimum at x ∗ = (2, −1)T .

i. Solve min f (x) using Newton’s method with starting value x = (1, 1).

ii. Repeat your computations in i. using finite difference approximations for the gradient and the
Hessian.

In both cases plot the error k x k − x ∗ k2 and the norm of the gradient k∇ f (x k )k2 . Carefully explain
and justify what stopping criteria you use and carefully document your choice of the finite difference
approximations. Experiment with different finite difference step-sizes.
Explain your results.

Problem 5.9 Given a function f : Rn → R and a positive scalar c > 0, consider the two problems

min f (x) (5.30)

x ∈ Rn
and
min g(x, y), (5.31)
x ∈ Rn, y ∈ Rn
where
c
g(x, y) = f (x) + k x − yk22 .
2
i. Show that if x ∗ is a local minimum of f , then (x ∗, x ∗ ) is a local minimum of g.
Show that if (x ∗, y∗ ) is a local minimum of (5.31) then y∗ = x ∗ and x ∗ is a local minimum of
(5.30).

ii. Assume that f : Rn → R is twice continuously differentiable. Fix y. Apply one step of
Newton’s method at x = x k to
c
min n f (x) + k x − yk22 .
x∈R 2

What is x k+1 ?

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

5.6. PROBLEMS 235

iii. Assume that f : Rn → R is twice continuously differentiable and has a Lipschitz continuous
second derivative. Let L > 0 be the Lipschitz constant of ∇2 f , let λ k ≥ 0 be the smallest
eigenvalue of ∇2 f (x k ), and let x ∗ be a local minimum of f . Show that x k+1 from part (ii)
satisfies
L c
k x k+1 − x ∗ k2 ≤ k x k − x ∗ k22 + k y − x ∗ k2 .
2λ k + 2c λk + c

iv. Assume that λ k ≥ λ > 0. If y = x k , does the sequence {x k } converge if x 0 is sufficiently

close to x ∗ ? (Justify your answer.)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

236 CHAPTER 5. NEWTON’S METHOD

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

References
[DES82] R. S. Dembo, S. C. Eisenstat, and T. Steihaug. Inexact Newton methods. SIAM J. Numer.
Anal., 19:400–408, 1982.

[DS96] J. E. Dennis, Jr. and R. B. Schnabel. Numerical Methods for Nonlinear Equations and
Unconstrained Optimization. SIAM, Philadelphia, 1996. URL: https://doi.org/
10.1137/1.9781611971200, doi:10.1137/1.9781611971200.

[GL83] G. H. Golub and C. F. Van Loan. Matrix Computations, volume 3 of Johns Hopkins
Series in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD,
1983.

[Kel95] C. T. Kelley. Iterative methods for linear and nonlinear equations, volume 16 of Fron-
tiers in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM),
Philadelphia, PA, 1995. URL: https://doi.org/10.1137/1.9781611970944, doi:
10.1137/1.9781611970944.

237
238 REFERENCES

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

Chapter
6
Globalization of the Iteration
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
6.2 Line Search Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.2.1 Descent Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.2.2 Step–Length Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
6.2.3 Global Convergence Results . . . . . . . . . . . . . . . . . . . . . . . . . 247
6.2.4 Backtracking Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . 251
6.2.5 Finding Step-Sizes that satisfy the Wolfe Conditions . . . . . . . . . . . . 253
6.3 Trust–Region Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
6.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
6.3.2 Convergence Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
6.3.3 Computation of Trust-Region Steps . . . . . . . . . . . . . . . . . . . . . 258
6.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

6.1. Introduction
Newton’s method as well as many other methods generate iterates x k such that, under appropriate
conditions on the function, the sequence of iterates converges to a local minimizer x ∗ provided that
the initial iterate x 0 is sufficiently close to x ∗ . What ‘sufficiently close’ mans depends on the method
and on properties of the function and their derivative, such as the Lipschitz constant of the second
derivative. For practical problems it is impossible to say a-priori whether an initial iterate x 0 is
sufficiently close to the solution. Therefore, we need to modify the methods such that convergence
to a solution is guaranteed, under certain assumptions on the problem, from any starting points.
In this chapter we investigate two globalization techniques: line search methods and trust-region
methods. By globalization of the iteration we mean a technique that ensures convergence from any
starting point. It does not mean convergence to a global minimum.

239
240 CHAPTER 6. GLOBALIZATION OF THE ITERATION

6.2. Line Search Methods

6.2.1. Descent Directions
Recall Definition 4.3.4 of a descent direction. A direction s ∈ Rn is called a descent direction
of f at x if there exists ᾱ > 0 such that f (x + αs) < f (x) for all α ∈ (0, ᾱ]. Let f : Rn → R
be continuously differentiable in a neighborhood x. By Theorem 4.3.3 s is a descent direction if
∇ f (x)T s < 0.
The following example describes descent directions that are commonly used in optimization.

Examples 6.2.1 Let f : Rn → R be continuously differentiable in a neighborhood x k .

i. The steepest descent direction s k = −∇ f (x k ) is a descent direction.

ii. If Bk is positive definite, then s k = −Bk−1 ∇ f (x k ) is a descent direction.

iii. If f is twice differentiable at x k and if ∇2 f (x k ) is symmetric positive definite, then the

Newton step s k = −(∇2 f (x k )) −1 ∇ f (x k ) is a descent direction.

iv. Let R : Rn → Rm be continuously differentiable in a neighborhood x k and let f be the

nonlinear least squares functional f (x) = 12 k R(x)k22 . Let R0 (x k ) denote the Jacobian of R at
x k . The gradient of f is ∇ f (x k ) = R0 (x k )T R(x k ). If s k satisfies
1 0
2 k R (x k )s k + R(x k )k22 < 21 k R(x)k22, (6.1)

then s k is a descent direction. In fact, (6.1) implies R0 (x k )s k , 0 and elementary calculations

show that

0 > 2 k R (x k )s k + R(x k )k2 − 2 k R(x k )k2

1 0 2 1 2

= R(x k )T R0 (x k )s k + 12 k R0 (x k )s k k22
> R(x k )T R0 (x k )s k .

v. An approximate solution s k to the quadratic optimization subproblem

min ∇ f (x k )T s + 12 sT ∇2 f (x k )s (6.2)
s

can be computed using the Preconditioned Conjugate Gradient Algorithm 3.9.1. We use i
as the iteration counter in the conjugate gradient method and s k,i to denote the ith iterate
generated by the Preconditioned Conjugate Gradient Algorithm applied to (6.2). Since
for x k away from a point x ∗ at which the second order sufficient optimality conditions
are satisfied, the Hessian ∇2 f (x k ) may not be positive definite, we need to modify the
Preconditioned Conjugate Gradient Algorithm 3.9.1. Specifically, we need to check whether

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

6.2. LINE SEARCH METHODS 241

piT ∇2 f (x k )pi ≤ 0 for the ith direction computed in the Preconditioned Conjugate Gradient
Algorithm. If piT ∇2 f (x k )pi ≤ 0, the Hessian is not positive definite and we stop the
Preconditioned Conjugate Gradient Algorithm. Of course, if the Hessian ∇2 f (x k ) is not
positive definite, the quadratic subproblem (6.2) does not have a solution. However, the
Preconditioned Conjugate Gradient iterate s k,i−1 computed up to that iterate is a descent
direction, as we will show next.
First, we state the Preconditioned Conjugate Gradient Algorithm for the approximate solution
of (6.2).

Algorithm 6.2.2 (Preconditioned Conjugate Gradient Method)

(0) Given a symmetric positive definite matrix Mk ∈ Rn×n and > 0.
(1) Set s k,0 = 0 and r 0 = −∇ f (x k ).
Preconditioning: Solve Mk z0 = r 0 .
Set p0 = z0 .
(2) For i = 0, 1, 2, · · · do
(a) If riT zi < , set s k = s k,i and stop.
(b) If piT ∇2 f (x k )pi ≤ 0, set s k = s k,i and stop.
(c) αi = riT zi /piT ∇2 f (x k )pi .
(d) s k,i+1 = s k,i + αi pi .
(e) ri+1 = ri − αi ∇2 f (x k )pi .
(f) Preconditioning: Solve Mk zi+1 = ri+1 .
(g) βi = ri+1
T z
i+1 /r i zi .
T

(h) pi+1 = zi+1 + βi pi .

If the Preconditioned Conjugate Gradient Algorithm 6.2.2 stops in iteration i > 0 with
s k = s k,i , then
Ki (Mk−1 ∇2 f (x k ), Mk−1 ∇ f (x k )) = span{p0, . . . , pi−1 }
and s k = s k,i solves
min ∇ f (x k )T s + 12 sT ∇2 f (x k )s. (6.3)
s ∈ span{p0, . . . , pi−1 }
(See Section 3.9.1 and Problem 3.14.) Furthermore, Step 2b in Algorithm 6.2.2 implies
pT ∇2 f (x k )p > 0 ∀ p ∈ span{p0, . . . , pi−1 }. (6.4)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

242 CHAPTER 6. GLOBALIZATION OF THE ITERATION

Therefore,
∇ f (x k )T s k,i + 12 sTk,i ∇2 f (x k )s k,i < 0,
| {z }
>0
which implies ∇ f (x k )T s k < 0. Hence, the Preconditioned Conjugate Gradient Algo-
rithm 6.2.2 generates descent directions s k = s k,i if it stops in iteration i > 0.
If the Preconditioned Conjugate Gradient Algorithm 6.2.2 stops in the initial iteration with
r 0T z0 = r 0T Mk−1r 0 < , one uses the direction s k = −∇ f (x k ).

6.2.2. Step–Length Conditions

Given a direction s k ∈ Rn such that ∇ f (x k )T s k < 0, we compute a step size α k > 0 and a new
iterate x k+1 = x k + α k s k . We require that f (x k+1 ) < f (x k ). However, we need to be more careful
in the construction of the step length α k . To see what can happen in a naive application of a line
search procedure, we consider two examples from [DS83, pp. 117,118]

Example 6.2.3 Consider f (x) = x 2 and x 0 = 2. Furthermore, we select the steps s k = (−1) k+1
and the step lengths α k = 2 + 3(2−(k+1) ). The iterates are x k = (−1) k (1 + 2−k ). We have
f (x k+1 ) < f (x k ) and ∇ f (x k )T s k < 0 for all k. However, lim k→∞ f (x k ) = 1. The problem here is
that the decrease in the function is too small. We have
3
f (x k+1 ) = f (x k ) − 2−k − 2−2k .
4

The decrease 2−k + 34 2−2k is too small relative to the decrease

|α k ∇ f (x k )T s k | = (2 + 2−k+1 )(2 + 3(2−(k+1) ))

predicted by the linearization of f around x k .

Let s k ∈ R satisfy ∇ f (x k ) s k < 0. Instead of requiring that the new iterate x k+1 = x k + α k s k .
n T

satisfies the simple decrease f (x k+1 ) < f (x k ), we require that the sufficient decrease condition

f (x k + α k s k ) ≤ f (x k ) + c1 α k ∇ f (x k )T s k , (6.5)

where c1 ∈ (0, 1), is satisfied. This parameter is chosen to be small; c1 = 10−4 is a typical value.
Consider the function φ(α) = f (x k + αs k ). Clearly, φ0 (α) = ∇ f (x k + αs k )T s k and φ0 (0) =
∇ f (x k )T s k . The sufficient decrease condition (6.5) requires that the actual descrease φ(0) −
φ(α k ) = f (x k ) − f (x k + α k s k ) is at least a fraction c1 of the decrease φ(0) − (φ(0) + φ0 (0)α k ) =
−α k ∇ f (x k )T s k > 0 predicted by the first order Taylor approximation of φ around 0.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

6.2. LINE SEARCH METHODS 243

[NW06, Figure 3.3] or [DS96, Figure 6.3.3].

Figure 6.1: Sufficient Decease Condition

Lemma 6.2.4 Let f : Rn → R be continuously differentiable. If s k ∈ Rn satisfies ∇ f (x k )T s k < 0

and if 0 < c1 < 1, then there exists ᾱ > 0 such that the sufficient decrease condition (6.5) is satisfied
for all α ∈ (0, ᾱ).

Proof: If we set φ(α) = f (x k + αs k ), then the sufficient decrease condition reads

φ(α k ) − φ(0) − c1 φ0 (0)α k ≤ 0.

Since f is continuously differentiable, φ is continuously differentiable and

Z 1
φ(α) = φ(0) + φ (0)α +
0
φ0 (tα) − φ0 (0)dt α.
0

Moreover, since ∇ f (x k )T s k < 0, we have φ0 (0) < 0. By continuity of φ0 there exists ᾱ > 0 such
that |φ0 (α) − φ0 (0)| < −(1 − c1 )φ0 (0) for all α ∈ (0, ᾱ). Consequently,
Z 1
φ(α) − φ(0) − c1 φ (0)α =
0
φ0 (tα)dt α − c1 φ0 (0))α
0
Z 1
= (φ0 (tα) − φ0 (0))dt α + (1 − c1 )φ0 (0)α
0
< −(1 − c1 )φ0 (0)α + (1 − c1 )φ0 (0)α = 0

for all α ∈ (0, ᾱ).

Example 6.2.5 Again we consider f (x) = x 2 and x 0 = 2. This time we select s k = −1 and step
size α k = 2−(k+1) . This gives the iterates x k = 1 + 2−k . The steps satisfy ∇ f (x k )T s k < 0 for all k
and the sufficient decrease condition
3
f (x k+1 ) = f (x k ) − 2−k − 2−2k < f (x k ) − c1 (2−k + 2−2k ) = f (x k ) + c1 α k ∇ f (x k )T s k
4

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

244 CHAPTER 6. GLOBALIZATION OF THE ITERATION

for c1 ∈ (0, 3/4). However, lim k→∞ f (x k ) = 1. The problem here is that the step sizes α k are too
small.
In addition to the sufficient decrease condition (6.5) we need a condition that guarantees that the
step sizes α k are not unnecessarily small. What this means will be made precise in Lemma 6.2.9.
There are several conditions that can be added to the sufficient decrease condition (6.5) to ensure
that the step sizes α k are not unnecessarily small. We list some of the commonly used step size
conditions.
The step size α k satisfies the Wolfe conditions if

f (x k + α k s k ) ≤ f (x k ) + c1 α k ∇ f (x k )T s k , (6.6a)
∇ f (x k + α k s k )T s k ≥ c2 ∇ f (x k )T s k , (6.6b)

where 0 < c1 < c2 < 1.

[NW06, Figure 3.5]

Figure 6.2: Wolfe Conditions

The step size α k satisfies the strong Wolfe conditions if

f (x k + α k s k ) ≤ f (x k ) + c1 α k ∇ f (x k )T s k , (6.7a)
|∇ f (x k + α k s k )T s k | ≤ c2 |∇ f (x k )T s k |, (6.7b)

where 0 < c1 < c2 < 1.

Since (6.7b) is equivalent to

−c2 ∇ f (x k )T s k ≥ ∇ f (x k + α k s k )T s k ≥ c2 ∇ f (x k )T s k ,

the strong Wolfe condition is in fact stronger than the Wolfe condition.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

6.2. LINE SEARCH METHODS 245

[NW06, Figure 3.5]

Figure 6.3: Strong Wolfe Conditions

Lemma 6.2.6 Let f : Rn → R be continuously differentiable and let s k ∈ Rn satisfy ∇ f (x k )T s k <

0. If f is bounded below along the ray {x k + αs k : α > 0} and if 0 < c1 < c2 < 1, then there exist
intervals of step lengths satisfying the Wolfe conditions (6.6) and the strong Wolfe conditions (6.7).

Proof: Since φ(α) = f (x k + αs k ) is bounded, the line α 7→ f (x k ) + c1 α∇ f (x k )T s k intersects the

graph φ at least once. Let ᾱ > 0 be the smallest intersecting value,

f (x k + ᾱs k ) = f (x k ) + c1 ᾱ∇ f (x k )T s k . (6.8)

α ∈ (0, ᾱ) such that

By the mean value theorem, there exists H

f (x k + ᾱs k ) − f (x k ) = ᾱ∇ f (x k + H
α s k )T s k . (6.9)

Combining (6.8) and (6.9) gives

∇ f (x k + H
α s k )T s k = c1 ∇ f (x k )T s k > c2 ∇ f (x k )T s k , (6.10)

since c1 < c2 and ∇ f (x k )T s k < 0. Therefore, H α satisfies the Wolfe conditions (6.6) and both
inequalities in (6.6) hold strictly. Hence there exists an interval around H α such that the Wolfe
conditions (6.6) are satisfied for all step sizes in this interval.
Since the term on the left hand side in (6.10) is negative, the strong Wolfe conditions (6.7) hold
in the same interval.

Let 0 < c1 < 21 . The step size α k satisfies the Goldstein conditions if

(1 − c1 )α k ∇ f (x k )T s k ≤ f (x k + α k s k ) − f (x k ) ≤ c1 α k ∇ f (x k )T s k , (6.11)

Step length satisfying the Goldstein conditions are sketched in Figure 6.4. Notice that the minimizer
of φ does not satisfy the Goldstein conditions.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

246 CHAPTER 6. GLOBALIZATION OF THE ITERATION

[NW06, Figure 3.6]

Figure 6.4: Goldstein conditions

Lemma 6.2.7 Suppose that f : Rn → R is continuously differentiable. Let s k ∈ Rn satisfy

∇ f (x k )T s k < 0 and let f be bounded below along the ray {x k + αs k : α > 0}. If 0 < c1 < 1/2,
then there exist intervals of step lengths satisfying the the Goldstein conditions (6.11).
We leave the proof as an exercise. See Problem 6.2.
Another popular way to ensure that the step size α k is not unnecessarily small is called back-
tracking. Step sizes α k are determined iteratively. Let α k(0), α k(1), , . . ., be the trial step sizes. Thus,
the superscript i refers to the iteration for the step-size selection and α k = αik∗ for some iteration
i ∗ . The condition that α k is not unnecessarily small can be enforced by ensuring that the iterates
α k(i) are not reduced too rapidly from one iteration i to the next. The following Backtracking Line
Search algorithm ensures this.
Algorithm 6.2.8 (Backtracking Line Search)
Given parameters 0 < β1 ≤ β2 < 1.
Select α k(0) .
For i = 0, 1, . . . do
If α k = α k(i) satisfies (6.5), stop.
Compute α k(i+1) ∈ [ β1 α k(i), β2 α k(i) ].
End

The condition α k(i+1) ≤ β2 α k(i) ensures that the trial step size is at least reduced by a factor β2 .
The condition α k(i+1) ≥ β1 α k(i) ensures that the trial step size is not reduced too fast.
If β1 < β2 , then one has some flexibility to introduce information about f to increase the
performance of the Backtracking Algorithm 6.2.8. We will return to this issue in Section 6.2.4.
The simplest form of a Backtracking Algorithm 6.2.8 is obtained when β1 = β2 = β. In this
case, α k = α k(0) β m , where m ∈ {0, 1, 2, . . .} is the smallest integer so that α k = α β m satisfies the
sufficient decrease condition (6.5). This is known as the Armijo rule. The choice β = 1/2 is
common.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

6.2. LINE SEARCH METHODS 247

6.2.3. Global Convergence Results

Lemma 6.2.9 Let D ⊂ Rn be a convex open set and let f : D → R be continuously differentiable
with ∇ f ∈ Lip L (D). Furthermore, let x k ∈ D, x k + α k s k ∈ D, and ∇ f (x k )T s k < 0. If α k satisfies
(6.6b), (6.7b), (6.11), or if α k is determined through the Backtracking Algorithm 6.2.8 with starting
guess α k(0) ≥ −c̃∇ f (x k )T s k /ks k k22 for some c̃ > 0, then there exists c > 0 independent of k such
that
∇ f (x k )T s k
α k ≥ −c ∀k.
ks k k22

Proof: i. If α k satisfies (6.6b) or (6.7b), then

(∇ f (x k + α k s k ) − ∇ f (x k ))T s k ≥ (c2 − 1)∇ f (x k )T s k .

On the other hand,

(∇ f (x k + α k s k ) − ∇ f (x k ))T s k ≤ k∇ f (x k + α k s k ) − ∇ f (x k )k2 ks k k2
≤ α k Lks k k22 .

Combining both inequalities yields the estimate

1 − c2 ∇ f (x k )T s k
αk ≥ − .
L ks k k22
ii. If α k satisfies (6.11), then

f (x k + α k s k ) − f (x k ) − α k ∇ f (x k )T s k ≥ −c1 α k ∇ f (x k )T s k .

Moreover,

f (x k + α k s k ) − f (x k ) − α k ∇ f (x k )T s k
Z 1
= αk (∇ f (x k + α k s k ) − ∇ f (x k ))T s k dt
0
L
≤ α 2k ks k k22 .
2
Combining both inequalities yields the estimate
2c1 ∇ f (x k )T s k
αk ≥ − .
L ks k k22

iii. Let α k be determined through the Backtracking Algorithm 6.2.8. If α k = α k(i) , then α k(i−1)
did not satisfy the sufficient decrease condition (6.5). Thus,

f (x k + α k(i−1) s k ) − f (x k ) > c1 α k(i−1) ∇ f (x k )T s k .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

248 CHAPTER 6. GLOBALIZATION OF THE ITERATION

Using the same arguments as in part ii. of this proof, we obtain

2(1 − c1 ) ∇ f (x k )T s k
α k(i−1) ≥ − .
L ks k k22

Since α k = α k(i) ≥ β1 α k(i−1) , this implies the desired result.

Theorem 6.2.10 Let D ⊂ Rn be an open set, let f : D → R be continuously differentiable on D and

bounded from below on D. Furthermore, let {x k } be a sequence in D such that x k+1 = x k + α k s k ,
where α k > 0 and s k satisfies ∇ f (x k )T s k < 0. If α k satisfies (6.5) and if there exists c > 0
independent of k such that
∇ f (x k )T s k
α k ≥ −c ∀k
ks k k22
then
∞
X
cos2 θ k k∇ f (x k )k22 < ∞,
k=0
where θ k the angle between the gradient ∇ f (x k ) and the search direction s k ,

∇ f (x k )T s k
cos θ k = .
k∇ f (x k )k2 ks k k2

Proof: The sufficient decrease condition (6.5) can be written as

f (x k+1 ) − f (x k ) ≤ c1 α k ∇ f (x k )T s k .

This inequality implies

K
X K
X
f (x K+1 ) − f (x 0 ) = f (x k+1 ) − f (x k ) ≤ c1 α k ∇ f (x k )T s k < 0.
k=0 k=0

Since f is bounded from below,

∞
X
−α k ∇ f (x k )T s k < ∞.
k=0

The lower bound for the step size α k gives the desired result
∞
X (∇ f (x k )T s k ) 2
k∇ f (x k )k22 < ∞.
k=0
ks k
k 2
2 k∇ f (x )k 2
k 2

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

6.2. LINE SEARCH METHODS 249

If cos2 θ k is bounded away from zero, Theorem 6.2.10 guarantees that lim k→∞ ∇ f (x k ) = 0.
The following lemma shows that cos2 θ k is bounded away from zero, if the sequence of condition
numbers {cond2 (Bk )} is bounded.

Lemma 6.2.11 If s k = −Bk−1 ∇ f (x k ), where Bk is symmetric positive definite, then

1
| cos θ k | ≥ ,
cond2 (Bk )
where cond2 (Bk ) = λ max (Bk )/λ min (Bk ) is the condition number of Bk .

Proof: The two norm of the direction s k can be estimated by

1
ks k k2 = kBk−1 ∇ f (x k )k2 ≤ kBk−1 k2 k∇ f (x k )k2 = λ max (Bk−1 )k∇ f (x k )k2 = k∇ f (x k )k2 .
λ min (Bk )
Furthermore
1
−∇ f (x k )T s k = ∇ f (x k )T Bk−1 ∇ f (x k ) ≥ λ min (Bk−1 )k∇ f (x k )k22 = k∇ f (x k )k22 .
λ max (Bk )
Combining these two inequalities gives the desired result.

If cos2 θ k is bounded away from zero, Theorem 6.2.10 guarantees that lim k→∞ ∇ f (x k ) = 0.
However, convergence of ∇ f (x k ) does not imply convergence of x k , in general. The following
corollary of Theorem 6.2.10 is a typical result that addresses convergence of the iterates {x k }.

Corollary 6.2.12 Let the assumptions of Theorem 6.2.10 be valid. If

cos2 θ k ≥ c > 0

then any accumulation point x ∗ ∈ D of the sequence {x k } satisfies ∇ f (x ∗ ) = 0. The point x ∗ is a

critical point, but not a maximum point. If, in addition, the level set L = {x ∈ D : f (x) ≤ f (x 0 )}
is compact, then there exists an accumulation point x ∗ ∈ D of the sequence {x k }.

Proof: Since cos2 θ k ≥ c > 0, Theorem 6.2.10 guarantees that lim k→∞ ∇ f (x k ) = 0. Let x ∗ ∈ D
be an accumulation point and let {x k j } be a subsequence such that lim j→∞ x k j = x ∗ . Then

0 = lim ∇ f (x k j ) = ∇ f (x ∗ ).
j→∞

Hence, the accumulation point x ∗ is a critical point. By the sufficient decrease condition,

f (x k j ) > f (x k j+1 ) ≥ f (x ∗ ) ∀j,

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

250 CHAPTER 6. GLOBALIZATION OF THE ITERATION

and, since lim j→∞ x k j = x ∗ , lim j→∞ f (x k j ) = f (x ∗ ), which implies that the accumulation point x ∗
is a critical point, but not a maximum point.
The sufficient decrease condition implies that all iterates x k are in the compact set L. Hence,
{x k } has a convergent subsequence.

Theorem 6.2.13 Let f : D → R be twice continuously differentiable on an open set D ⊂ Rn

and assume that ∇2 f is Lipschitz continuous on D. Consider a sequence {x k } generated by
x k+1 = x k + α k s k , where ∇ f (x k )T s k < 0 and α k > 0.

i. If {x k } converges to a point x ∗ at which ∇2 f (x ∗ ) is positive definite, and if

k∇ f (x k ) + ∇2 f (x k )s k k2
lim = 0, (6.12)
k→∞ ks k k2
then there is an index k 0 such that for all k ≥ k 0 the sufficient decrease conditition (6.5) with
c1 ∈ (0, 12 ) is satisfied with α k = 1.

ii. If {x k } converges to a point x ∗ at which ∇2 f (x ∗ ) is positive definite, and if (6.12) holds, then
there is an index k 0 such that for all k ≥ k 0 the Wolfe conditions (6.6) with c1 ∈ (0, 12 ) are
satisfied with α k = 1.

iii. If {x k } converges to a point x ∗ at which ∇2 f (x ∗ ) is positive definite, and if (6.12) holds, then
there is an index k0 such that for all k ≥ k 0 the Goldstein conditions (6.11) with 0 < c1 < 21
are satisfied with α k = 1.

iv. If {x k } converges to a point x ∗ at which ∇ f (x ∗ ) = 0 and ∇2 f (x ∗ ) is positive definite and if

(6.12) holds, and if α k = 1, then {x k } converges q–superlinearly to x ∗ .

For a proof see [DS96, Sec 6.3.1].

The condition (6.12) is satisfied if s k is the Newton–like direction

s k = −(∇2 f (x k ) + µ k I) −1 ∇ f (x k ),

where µ k ≥ 0 is chosen so that ∇2 f (x k ) + µ k I is positive definite and lim k→∞ µ k = 0. However,

(6.12) is in general not satisfied, if s k is the steepest descent direction s k = −∇ f (x k ). In this
case even arbitrarily close to a minimum x ∗ that satisfies the second order sufficient optimality
conditions the step size α k will be different from one. We will discuss this next.
If s k is the steepest descent direction s k = −∇ f (x k ), then sufficient decrease conditition (6.5)
is given by
f (x k − α k ∇ f (x k )) ≤ f (x k ) + c1 α k k∇ f (x k )k22 . (6.13)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

6.2. LINE SEARCH METHODS 251

By the Taylor expansion we get

α 2k
f (x k − α k ∇ f (x k )) = f (x k ) − α k k∇ f (x k )k22 + ∇ f (x k )T ∇2 f (x k − α k τ∇ f (x k ))∇ f (x k )
2
for some τ ∈ [0, 1]. Hence an α k that satisfies the sufficient decrease conditition (6.13) must satisfy

k∇ f (x k )k22
α k ≤ 2(1 − c1 ) . (6.14)
∇ f (x k )T ∇2 f (x k − α k τ∇ f (x k ))∇ f (x k )

If ∇2 f (x k − α k τ∇ f (x k )) is positive definite with smallest eigenvalue λ min , then (6.14) implies

2(1 − c1 )
αk ≤ .
λ min

In particular, if the smallest eigenvalue λ min (x ∗ ) of ∇2 f (x ∗ ) satisfies λ min (x ∗ ) > 2(1 − c1 ), the step
size α k in the steepest descent method will not be equal to one near the solution.

6.2.4. Backtracking Line Search

The Backtracking Line Search Algorithm 6.2.8 allows us to incorporate information about the
function f to compute a new trial step size α k(i+1) ∈ [ β1 α k(i), β2 α k(i) ]. We describe how this can be
achieved using models of
φ(α) = f (x k + αs k )
obtained via quadratic or cubic interpolation of φ. Our presentation follows [DS96, Sec 6.3.2], see
also [NW06, Sec. 3.5].
In the first step of the backtracking line search we have an initial guess α k(0) for the step size.
Suppose the sufficient decrease condition (6.5) is not satisfied for α k(0) .
Step 1. To find α k(1) , we compute the quadratic interpolant m(α) = aα 2 + bα + c that satisfies

m(0) = φ(0) = f (x k ), m0 (0) = φ0 (0) = ∇ f (x k )T s k , m(α k(0) ) = φ(α k(0) ) = f (x k + α k(0) s k ).

This interpolant is given by

φ(α k(0) ) − φ0 (0)α k(0) − φ(0)

m(α) = α 2 + φ0 (0)α + φ(0).
(α k(0) ) 2

We compute

2[φ(α k(0) ) − φ0 (0)α k(0) − φ(0)] 2[φ(α k(0) ) − φ0 (0)α k(0) − φ(0)]
m (α) =
0
α + φ (0), 0
m (α) =
00
.
(α k(0) ) 2 (α k(0) ) 2

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

252 CHAPTER 6. GLOBALIZATION OF THE ITERATION

Since the sufficient decrease condition (6.5) is not satisfied for α k(0) , we have

φ(α k(0) ) > φ(0) + c1 α k(0) φ0 (0) = φ(0) + α k(0) φ0 (0) + (c1 − 1)α k(0) φ0 (0) > φ(0) + α k(0) φ0 (0).
Hence, m00 (α) > 0 for all α and the minimum of m is
(α k(0) ) 2 φ0 (0)
α∗ = .
2[φ(α k(0) ) − φ0 (0)α k(0) − φ(0)]
We set


 β1 α k(0), if α∗ < β1 α k(0),
=α k(1)
β2 α (0), if α∗ > β2 α k(0),

 α, k


else.
 ∗
Step 2. Suppose the sufficient decrease condition (6.5) is not satisfied for α k(1) . To find α k(1) , we
compute the cubic interpolant m(α) that satisfies
m(0) = φ(0) = f (x k ), m0 (0) = φ0 (0) = ∇ f (x k )T s k ,
m(α k(0) ) = φ(α k(0) ), m(α k(1) ) = φ(α k(1) ).
This interpolant is given by
m(α) = aα 3 + bα 2 + φ0 (0)α + φ(0).
where a, b are computed by solving the 2 × 2 system

(α k(0) ) 3 (α k(0) ) 2 f (x k + α k(0) s k ) − φ0 (0)α k(0) − φ(0)

! ! !
a
= .
(α k(1) ) 3 (α k(1) ) 2 b f (x k + α k(1) s k ) − φ0 (0)α k(1) − φ(0)
The two critical points of m are
q !
α± = −b ± b2 − 3aφ0 (0) /(3a)

and one can show that q !

α∗ = −b + b2 − 3aφ0 (0) /(3a)

is a minimizer of m. We set


 β1 α k(1), if α∗ < β1 α k(1),
α k(2) = β2 α k(1), if α∗ > β2 α k(1),

 α,


else
 ∗
Step i (i ≥ 3). Suppose the sufficient decrease condition (6.5) is not satisfied for α k(i) . Then we
repeat the procedure in step 2 with α k(0) and α k(1) in (6.15) replaced by α k(i−1) and α k(i) , respectively.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

6.3. TRUST–REGION METHODS 253

6.2.5. Finding Step-Sizes that satisfy the Wolfe Conditions

Algorithms for computing a step size that satisfies the Wolfe conditions (6.6) or the strong Wolfe
conditions (6.7) are described in [DS83, p. 328–330], [MT94], [NW06, Sec. 3.5].

6.3. Trust–Region Methods

6.3.1. Introduction
Trust-region methods are another approach to globalize the convergence. A comprehensive treat-
ment of trust-region methods can be found in the book by Conn, Gould, and Toint [CGT00]. Moré’s
paper [?] is an older, nice overview of trust-region methods for unconstrained problems.
Given a guess x k for a minimizer of f , we compute a new guess x k + s k by constructing a model
m k of f around x k ,
f (x k + s) ≈ m k (x k + s),
and then minimizing this model. In a Newton-type method the model is based on Taylor approxi-
mation and is given by

1
m k (x k + s) = f (x k ) + ∇ f (x k )T s + sT Bk s, (6.16)
2
where Bk is a symmetric matrix (in Newton’s method Bk = ∇2 f (x k )). Typically, the model m k of
f is only a good model for f near x k . Hence minimizing m k (x k + s) over all s ∈ Rn does not make
sense. Instead, one should minimize m k (x k + s) only over those s for which m k (x k + s) is expected
to be a sufficiently good approximation to f (x k + s).

min m k (x k + s)
(6.17)
s.t. ksk2 ≤ ∆ k ,

where the set {s : ksk2 ≤ ∆ k } is the trust-region and ∆ k > 0 is the trust-region radius. One can
admit more general models [ADLT98, CGT00], but we focus on quadratic models (6.16).
With the model quadratic models (6.16) the trust-region subproblem (6.17) is given by

min f (x k ) + ∇ f (x k )T s + 21 sT Bk s
(6.18)
s.t. ksk2 ≤ ∆ k .

In (6.18) we minimize a continuous function m k over a compact set. Hence, the minimum exists.
A characterization of the solution will be given in Lemma 6.3.5 below. However, it is not necessary
to solve the trust-region subproblem (6.18) exactly. Instead the trust-region step s k has to give a
decrease in the model that is at least as good as the decrease obtained by minimizing in the direction
of the negative gradient. We will make this precise below (see (6.19)).

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

254 CHAPTER 6. GLOBALIZATION OF THE ITERATION

Once an approximate solution s k of (6.18), we need to decide whether to accept x k + s k as our

new iterate and how to update the trust-region radius ∆ k . These decisions are based on the ratio
between the actual reduction in the objective function,

aredk = f(xk + sk ) − f(xk ),

and the reduction predicted by the model (short predicted reducion),

predk = mk (xk + s) − mk (xk ).

If m k is a good model of f in the trust-region, then

aredk
ρk =
predk
should be large (close to one). In the case we accept x k+1 = x k + s k as our new iterate and we
increase ∆ k . If m k is a poor model of f in the trust-region, then
aredk
ρk =
predk
should be small. In the case we keep x k+1 = x k and reduce reduce the region on which we trust m k
to model f , i.e., we decrease ∆ k .

Algorithm 6.3.1 (Trust Region)

Choose parameters 0 < η 1 ≤ η 2 < 1, 0 < γ1 < γ2 < 1 < γ3 , and

initial values x 0 ∈ {x | k xk2 ≤ R}, ∆0 > 0.
For k = 0, 1, · · · do
(1) Compute f (x k ), ∇ f (x k ), and Bk .
(2) Compute a solution s k of (6.18), or an approximation of it.
(3) Compute the ratio of actual versus predicted reduction
aredk
ρk = .
predk
(4) If ρ k ≥ η 1, then set x k+1 = x k + s k ; else set x k+1 = x k .
(5) Update ∆ k .
(a) If ρ k < η 1, then choose ∆ k+1 ∈ [γ1 ∆ k , γ2 ∆ k ].
(b) If ρ k ∈ [η 1, η 2 ), then choose ∆ k+1 ∈ [γ2 ∆ k , ∆ k ].
(c) If ρ k ≥ η 2, then choose ∆ k+1 ∈ [∆ k , γ3 ∆ k ].

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

6.3. TRUST–REGION METHODS 255

If the step s k is accepted, i.e. if x k+1 = x k + s k , then the iteration k is called successful. Note
that we increase the iteration count even if the iteration is not successful.
Sensible choices for the parameters η 1, η 2 , γ1, γ2 and γ3 in the Trust Region Algorithm 6.3.1
are η 1 = 0.01, η 2 = 0.9, γ1 = γ2 = 0.5, γ3 = 2 [CGT00, p. 117]. See also [DS83, Sec. 6.4].

6.3.2. Convergence Results

As we have mentioned earlier, we do not have to solve the trust-region subproblem (6.18) exactly.
Instead the trust-region step s k has to give a decrease in the model that is at least as good as the
decrease obtained by minimizing in the direction of the negative gradient. We require that the
computed trust-region step s k satisfies the so-called fraction of Cauchy decrease condition,
m k (x k + s k ) − m k (x k )
≤ β1 min {m k (x k + s) − m k (x k ) : s = −t∇ f (x k ), ksk2 ≤ ∆ k } , (6.19)
ks k k2 ≤ β2 ∆ k ,
where β1, β2 are positive constants.

Lemma 6.3.2 If s k satisfies (6.19), then

β1 1
m k (x k ) − m k (x k + s k ) ≥ k∇ f (x k )k2, ∆ k ,

k∇ f (x k )k2 min (6.20)
2 ck
where
( ∇ f (x k )T Bk ∇ f (x k ) )
ck = max , 1
k∇ f (x k )k22
if ∇ f (x k ) , 0 and ck = 1 else.

Proof: If ∇ f (x k ) = 0, then the right hand sides in (6.19) and in (6.20) are zero and the assertion
follows.
Let ∇ f (x k ) , 0. Define
∇ f (x k ) t2
ψ(t) = m k x k − t − m k (x k ) = −t k∇ f (x k )k2 + ∇ f (x k )T Bk ∇ f (x k )
k∇ f (x k )k2 2k∇ f (x k )k22
and let t ∗ be the minimizer of ψ on the interval [0, ∆ k ] . Condition (6.19) implies that

m k (x k + s k ) − m k (x k ) ≤ β1 ψ(t ∗ ).

We will estimate ψ(t ∗ ).

If t ∗ < ∆ k , then t ∗ is the unconstrained minimizer of ψ given by
k∇ f (x k )k23
t∗ = .
∇ f (x k )T Bk ∇ f (x k )

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

256 CHAPTER 6. GLOBALIZATION OF THE ITERATION

In this case
k∇ f (x k )k24 1
ψ(t ∗ ) = − =− k∇ f (x k )k22 . (6.21)
2∇ f (x k )T B k ∇ f (x k ) 2ck
If t ∗ = ∆ k , then either the unconstrained minimizer of ψ is greater than ∆ k , i.e.

k∇ f (x k )k23
≥ ∆k,
∇ f (x k )T Bk ∇ f (x k )
or the unconstrained problem min ψ(t) has no solution. The latter is the case if and only if
∇ f (x k )T Bk ∇ f (x k ) ≤ 0. Thus, either

∆2k ∆k
ψ(t ∗ ) = −∆ k k∇ f (x k )k2 + ∇ f (x k )T Bk ∇ f (x k ) ≤ − k∇ f (x k )k2 , (6.22)
2k∇ f (x k )k22 2

or
ψ(t ∗ ) ≤ −∆ k k∇ f (x k )k2 . (6.23)
The assertion now follows from (6.21)–(6.23).

As an immediate consequence of Lemma 6.3.2 we have the following result.

Corollary 6.3.3 Suppose that k is a successful iteration, i.e. that ρ k > η 1 . If s k satisfies (6.19),
then
β1 η 1 (1 )
f (x k ) − f (x k+1 ) ≥ k∇ f (x k )k2 min k∇ f (x k )k2, ∆ k , (6.24)
2 ck
where ck is defined as in Lemma 6.3.2.

Proof: The proof follows immediately from Lemma 6.3.2 if we use the definition of ρ k , aredk ,
and predk .

A basic convergence result is based on the estimate (6.24). We also assume that scalars ck
defined in Lemma 6.3.2 are bounded. This is guaranteed, e.g., if the norms kBk k are bounded.

Theorem 6.3.4 If f : Rn → R is bounded from below and Lipschitz continuously differentiable on

the level set L0 = {x : f (x) ≤ f (x 0 )}, if s k satisfies (6.19), and if there exists c > 0 such that
( ∇ f (x k )T Bk ∇ f (x k ) s k Bk s k )
max , , 1 ≤c
k∇ f (x k )k22 ks k k22

for all k, then

lim inf k∇ f (x k )k2 = 0.
k→∞

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

6.3. TRUST–REGION METHODS 257

Proof: Suppose that lim inf k→∞ k∇ f (x k )k2 > 0. Then there exists > 0 such that k∇ f (x k )k2 >
for all k.
In the first step of the proof we show that
∞
X
∆ k < ∞. (6.25)
k=1

If there are only finitely many successful iterations, then there exists K such that ∆ k+1 ≤ γ2 ∆ k for
all k ≥ K. This implies (6.25).
Suppose there are infinitely many successful iterations and let {ki } be the subsequence of
successful iterations. Algorithm 6.3.1 implies
∆ ki +1 ≤ γ3 ∆ ki
and
∆ ki + j+1 ≤ γ2 ∆ ki + j , j = 1, . . . , ki+1 − ki − 1.
Hence
i+1 −1
kX
γ3
!
∆ k ≤ ∆ ki + γ3 ∆ ki + γ3 γ2 ∆ ki + . . . + γ3 γ2ki+1 −ki −2 ∆ ki ≤ 1+ ∆ ki
k=ki
1 − γ2

and
∞ ∞
γ3
X !X
∆k ≤ 1 + ∆k .
k=1
1 − γ2 i=1 i
Inequality (6.24) and k∇ f (x k )k2 > for all k imply that
∞
X
∆ ki < ∞.
i=1

Next, we show that (6.25) implies

aredk − predk
ρk − 1 = → 0.
predk
From (6.25) we find that
ks k k ≤ ∆ k → 0.
The definition of aredk , predk , and the Lipschitz continuity of ∇ f imply the
|aredk − predk | = | f (x k + s k ) − f (x k ) − ∇ f (x k )T s k − 21 sTk Bk s k |
Z 1
= [∇ f (x k + ts k ) − ∇ f (x k )]T s k dt − 12 sTk Bk s k
0
= (c + L/2)ks k k 2 ≤ (c + L/2)∆2k .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

258 CHAPTER 6. GLOBALIZATION OF THE ITERATION

On the other hand, (6.20) and k∇ f (x k )k2 > for all k imply the existence of c̃ > 0 such that

predk = mk (xk ) − mk (xk + sk ) ≥ c̃∆k .

Hence,
aredk − predk
| ρ k − 1| = ≤ (c + L/2)/c̃ ∆ k → 0.
predk
Since ρ k ≥ η 2 implies that the kth iteration is successful and the trust-region radius is increased,
this contradicts ∆ k → 0.

Theorem 6.3.4 is a basic convergence result. Additional results can be found in in Chapter 6 of
the book [CGT00] by Conn, Gould, and Toint.

6.3.3. Computation of Trust-Region Steps

We now discuss the approximate solution of the trust-region subproblem (6.18) exactly. To simplify
notation we use g, H, and ∆ instead of ∇ f (x k ), Bk ,and ∆ k , respecitively. With this notation the
trust-region subproblem (6.18) reads

min gT s + 21 sT Bs
(6.26)
s.t. ksk2 ≤ ∆.

Recall that a trust-region step s k does not need to solve the trust-region subproblem (6.18)
exactly. It only needs to satisfy the fraction of Cauchy decrease condition (6.19) which in the
simplified notation of this section reads
( )
gT s k + 12 sTk Bs k ≤ β1 min gT s + 12 sT Bs : s = −tg, ksk2 ≤ ∆ , (6.27)
ks k k2 ≤ β2 ∆

where β1, β2 are positive constants.

Chapter 7 of [CGT00] discusses many methods for the computation of approximate solutions to
the trust-region subproblem (6.26). We will discuss some of them here. First we will characterize
the solution of the trust–region subproblem (6.26).

Lemma 6.3.5 The vector s∗ is a solution of the trust–region subproblem (6.26) if and only if there
exists a scalar λ ∗ ≥ 0 such that the following conditions are satisfied:

(B + λ ∗ I)s∗ = −g, (6.28a)

λ ∗ (∆ − ks∗ k2 ) = 0, (6.28b)
B + λ∗ I is positive semidefinite. (6.28c)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

6.3. TRUST–REGION METHODS 259

Proof: i. Let λ ∗ ≥ 0 and s∗ ∈ Rn satisfy (6.28). Theorem 4.3.6 shows that s∗ is a global minimizer
of
1 λ∗
ψ̂(s) = gT s + 12 sT (B + λ ∗ I)s = gT s + sT Bs + ksk22 .
2 2
Hence, for all s ∈ Rn we have the inequality
1 λ∗ 1 λ∗
gT s + sT Bs + ksk22 = ψ̂(s) ≥ ψ̂(s∗ ) = gT s∗ + sT∗ Bs∗ + ks∗ k22 . (6.29)
2 2 2 2
If ks∗ k22 < ∆2 , then (6.28b) implies λ ∗ = 0 and the inequality (6.29) reads
1 1
gT s + sT Bs = ψ̂(s) ≥ ψ̂(s∗ ) = gT s∗ + sT∗ Bs∗ for all s ∈ Rn .
2 2
Thus, ks∗ k22 < ∆2 , s∗ is an unconstrained minimizer of gT s + 12 sT Bs.
If ks∗ k22 = ∆2 , then for all s ∈ Rn with ksk22 ≤ ∆2 the inequality (6.29) implies
1 λ∗
gT s + sT Bs ≥ gT s∗ + 12 sT∗ Bs∗ + (ks∗ k22 − ksk22 )
2 2
λ ∗
= gT s∗ + 21 sT∗ Bs∗ + (∆2k − ksk22 )
2
≥ g s∗ + 2 s∗ Bs∗ .
T 1 T

Thus, s∗ minimizes gT s + 21 sT Bs over the trust-region {s ∈ Rn : ksk2 ≤ ∆}.

ii. Let s∗ be a solution of (6.18). We have to show the existence of λ ∗ ≥ 0 such that (6.28) is
satisfied. We consider two cases.
Case 1: ks∗ k2 < ∆. In this case s∗ is an unconstrained minimizer of gT s + 21 sT Bs and the
conditions (6.28) with λ ∗ = 0 follow from Theorem 4.3.6.
Case 2: ks∗ k2 = ∆. In this case s∗ solves
min gT s + 21 sT Bs
(6.30)
s.t. ksk22 = ∆2 .
The Lagrange multiplier theorem applied to the equality constrained problem (6.30) implies the
existence of λ ∗ such that Bs∗ + g + λ ∗ s∗ = 0. Hence, (6.28a,b) are satisfied. It remains to show
that λ ∗ ≥ 0 and B + λ ∗ I is positive semidefinite.
Since s∗ solves (6.30),
λ∗
gT s + 12 sT Bs ≥ gT s∗ + 12 sT∗ Bs∗ + (ks∗ k22 − ksk22 )
2
for all s with ksk2 = ks∗ k2 = ∆. Inserting (6.28a), the previous inequality implies
0 ≤ gT (s − s∗ ) + 12 sT (B + λ ∗ I)s − 12 sT∗ (B + λ ∗ I)s∗
= −sT∗ (B + λ ∗ I)(s − s∗ ) + 12 sT (B + λ ∗ I)s − 12 sT∗ (B + λ ∗ I)s∗
= 12 (s − s∗ )T (B + λ ∗ I)(s − s∗ ) for all s with ksk2 = ks∗ k2 = ∆. (6.31)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

260 CHAPTER 6. GLOBALIZATION OF THE ITERATION

Since the set {(s − s∗ )/ks − s∗ k2 : s , s∗ with ksk2 = ks∗ k2 = ∆} is dense in the unit ball, (6.31)
implies positive semidefiniteness of B + λ ∗ I.
In the final step, we show that λ ∗ ≥ 0. Our proof is by contradiction. Suppose λ ∗ < 0. Since
(6.28) holds, Theorem 4.3.6 shows that s∗ is a global minimizer of

1 λ∗
ψ̂(s) = gT s + 21 sT (B + λ ∗ I)s = gT s + sT Bs + ksk22 .
2 2

Therefore, for ksk22 > ∆2 = ks∗ k22 we have

1 λ∗
gT s + sT Bs ≥ gT s∗ + 12 sT∗ Bs∗ + (ks∗ k22 − ksk22 )
2 2
|{z} | {z }
<0
<0
> gT s∗ + 21 sT∗ Bs∗ .

Since s∗ solves (6.26), we also have gT s + 21 sT Bs ≥ gT s∗ + 21 sT∗ Bs∗ for all s with ksk22 ≤ ∆2 = ks∗ k22 .
Thus, s∗ minimizes gT s + 12 sT Bs over Rn . Theorem 4.3.6 implies that Bs∗ + g = 0. Together
with (6.28a) this implies λ ∗ s∗ = 0. Since ks∗ k2 = ∆, we have s∗ , 0. Therefore, λ ∗ = 0, which
contradicts the assumption λ ∗ < 0.

The Hebden–Reinsch–Moré Algorithm

Lemma 6.3.5 suggests the following solution strategy for (6.26).

1. If B is positive semi–definite compute the solution s+ of

Bs = −g (6.32)

with smallest norm. If ks+ k ≤ ∆, then s+ solves (6.26).

2. If the solution of (6.26) was not found in step 1, then find λ > 0 such that B + λI is positive
semi–definite and
ks(λ)k2 = ∆ (6.33)
where s(λ) is the solution of
(B + λI)s(λ) = −g.

If B is positive definite, then the solution of (6.32) is unique and can be computed using the
Cholesky decomposition.
The problem of finding λ ∗ > 0 such that B + λ ∗ I is positive semi–definite and (6.33) is satisfied
is a particular root finding problem. To get better insight into this problem, we use the eigen

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

6.3. TRUST–REGION METHODS 261

ks(λ)k22

∆k

λ
−µ2 −µ1 λ ∗

Figure 6.5: The Function λ 7→ ks(λ)k22 .

decomposition B = QT DQ of B, were Q ∈ Rn×n is an orthogonal matrix with columns q1, . . . , qn ,

and D ∈ Rn×n is a diagonal matrix with diagonal entries µ1 ≤ . . . ≤ µn . Then
n
X qiT g
s(λ) = −(B + λI) −1 g = − qi
i=1
µi + λ

and
n
X (qiT g) 2
ks(λ)k22 = . (6.34)
i=1
(µi + λ) 2
The function
n
X (qiT g) 2
λ 7→
i=1
(µi + λ) 2
has poles at −µ1 > . . . > −µn . We need to find λ ≥ max{−µ1, 0}, such that
n
X (qiT g) 2
= ∆2 .
i=1
(µi + λ) 2

This situation is sketched in Figure 6.5

We see that on (max{−µ1, 0}, ∞) the function
n
X (qiT g) 2
λ 7→ − ∆2
i=1
(µi + λ) 2

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

262 CHAPTER 6. GLOBALIZATION OF THE ITERATION

is convex. Newton’s method applied to the root finding problem

φ(λ) = ks(λ)k2 − ∆ = 0.

will generate iterates λ j that satisfy λ j−1 > λ j ≥ λ ∗ for all k ≥ 1, provided that λ 0 ∈ (−µ1, λ ∗ ). In
addition, λ 7→ ks(λ)k2 − ∆ is a rational function.
With
n
d X qiT g
s(λ) = q = −(B + λI) −1 s(λ)
2 i
dλ i=1
(µi + λ)
we find that
s(λ)T (B + λI) −1 s(λ)
φ0 (λ) = − .
ks(λ)k2
It is advantageous to consider Newton’s method applied to the equivalent root finding problem
1 1
ψ(λ) = − = 0.
ks(λ)k2 ∆
We find that
φ0 (λ)
ψ 0 (λ) = − = 0.
(φ(λ) + ∆) 2
Hence, the new Newton iterate λ + is
ψ(λ)
λ+ = λ −
ψ 0 (λ)
1 ks(λ)k22
!
1
= λ+ −
ks(λ)k2 ∆ φ0 (λ)
ks(λ)k2 ks(λ)k2 − ∆
= λ−
∆ φ0 (λ)
ks(λ)k2 φ(λ)
= λ− . (6.35)
∆ φ0 (λ)
Newton’s method should be safeguarded, however. We know that
φ(λ)
λ− ≤ λ∗.
φ0 (λ)
Thus, if λ low ∈ [µ1, λ ∗ ) is a known lower bound for the root λ ∗ , then
φ(λ)
( )
λ + = max λ , λ − 0
low low
φ (λ)
is another, possibly improved lower bound. To obtain an upper bound for λ ∗ consider the identity

B − µI + (λ ∗ + µ)I s(λ ∗ ) = −g.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

6.3. TRUST–REGION METHODS 263

If µ = min{0, µ1 }, then B − µI is positive semi–definite and

(λ ∗ + µ)ks(λ ∗ )k22 ≤ s(λ ∗ )T B − µI + (λ ∗ + µ)I s(λ ∗ )
= −s(λ ∗ )T g
≤ ks(λ ∗ )k2 kgk2 .

With ks(λ ∗ )k2 = ∆ this implies

kgk2
λ∗ ≤ − µ.
∆
Of course, if λ up is an upper bound for λ ∗ , and φ(λ) < 0, then
( )
λ + = min λ up, λ
up

is another, possibly improved upper bound. If the Newton iterate (6.35) satisfies

λ + < [λ low
+ , λ + ],
up

then we set
q
λ + = max{ λ low
+ λ + , 10 λ + }.
−3
up up

The iteration is stopped if

" #
3∆ 3∆
ks(λ + )k2 ∈ , .
4 2

There is one case that requires more care. It is known as the hard case [MS83], and occurs if
−µ1 > 0 and
n
X (qiT g) 2
lim < ∆2 .
λ→−µ+1
i=1
(µ i + λ) 2

The hard case can occur only if q1T g = 0, since otherwise

n n
X (qiT g) 2 (q1T g) 2 X (qiT g) 2
lim + = lim + + = ∞.
λ→−µ1
i=1
(µi + λ) 2 λ→−µ1 (µ1 + λ) 2 i=2
(µi + λ) 2

Since B + λ ∗ I must be positive semidefinite, λ ∗ = −µ1 > 0. To ensure conditions (6.28a,b) we set
s = s(−µ1 ) + τq1 , where τ is chosen to satisfy ks(−µ1 ) + τq1 k2 = ∆.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

264 CHAPTER 6. GLOBALIZATION OF THE ITERATION

Algorithm 6.3.6 (Hebden–Reinsch–Moré)

Input: g, eigen decomposition B = QT DQ of B, were
Q = (q1, . . . , qn ) ∈ Rn×n is orthogonal and
D = diag(µ1, . . . , µn ) ∈ Rn×n with µ1 < . . . < µn .

If µ1 > 0 then
Pn qiT g
Compute s(0) = i=1 µi qi .
If ks(0)k2 ≤ ∆ then stop.
elseif µ1 = 0 and q1T g = 0 then
Pn qiT g
Compute s(0) = i=2 µi qi .
If ks(0)k2 ≤ ∆ then stop.
elseif µ1 < 0 and q1T g = 0 then
Pn qiT g
Compute s(−µ1 ) = i=2 µi −µ1 qi .
If ks(−µ1 )k2 < ∆ then compute τ such that ks(−µ1 ) + τq1 k = ∆,
set s = s(−µ1 ) + τq1 , and stop.
endif

Set λ 0 = kgk
∆ − min{0, µ1 }.
up 2

Set λ low
0 = max{0,q −µ1 }.
Set λ 0 = max{ λ low
0 λ 0 , 10 λ 0 }.
−3up up

For j = 0, . . .
Compute φ(λ j ) and φ0 (λ j ).
( up )
If φ(λ j ) < 0, then λ j+1 = min λ j , λ j . Else λ j+1 = λ j .
up up up

φ(λ j )

Set λ j+1 = max λ j , λ j − φ0 (λ j ) .
low low

ks(λ j )k2 φ(λ j )

Set λ j+1 = λ j − ∆ φ 0 (λ j ) .
q
If λ j+1 < j+1, λ j+1 ], j+1 λ j+1, 10 λ j+1 }.
then set λ j+1 = max{ λ low −3
up up up
[λ low
f g
If ks(λ j+1 )k2 ∈ 3∆
4 2 then stop.
, 3∆

End

Algorithm 6.3.6 is due to [Heb73, Rei71, Mor78]. The iteration (6.35) is introduced in [Heb73,
Rei71]. The safeguards and many important implementation details were introduced in [Mor78],
where the algorithm is described in the context of least squares problems. See also [DS96,
Sec. 6.4.1]. The eigen decomposition of B is a very convenient tool for the implementation of the
Hebden–Reinsch–Moré algorithm, but is not necessary. See [Mor78, MS83] and [CGT00, Sec. 7.3]

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

6.3. TRUST–REGION METHODS 265

for implementations of Algorithm 6.3.6.

The Hebden–Reinsch–Moré Algorithm 6.3.6 sketched above is suitable for small dimensional
problems. For large scale problems, however, the exact solution of the trust–region subproblem
using the Hebden–Reinsch–Moré Algorithm is expensive.

Fortunately the basic convergence result for trust–region methods only requires that the fraction
of Cauchy decrease condition (6.27) is satisfied. This gives a lot of flexibility in computing a
trust–region step. Next we will discuss a few techniques for the computation of trust-region steps
that satisfy the fraction of Cauchy decrease condition (6.27).

Double Dogleg Step

Suppose that B is positive definite, i.e., that the eigenvalues are all positive 0 < µ1 < . . . < µn . Let
us consider the
n
X qiT g
s(λ) = − qi .
i=1
µi + λ

(see (6.34)). We see that

n
X qT g
s(0) = − i
qi = −B−1 g
i=1
µi

and

n
1 1 X qiT g
lim s(λ) = lim − P qi
λ→∞ ks(λ)k2 λ→∞ n (qiT g) 2 1/2 µ +λ
i=1 i
i=1 (µi +λ) 2
n
1 X qiT g
= lim − P qi
λ→∞ n qiT g 1/2
i=1
(µi /λ) + 1
i=1 (µi /λ)+1
n
1 X
= −P 1/2 (qiT g) qi
n T
i=1 qi g i=1

= −g/kgk2 .

The idea is to compute the step s as a combination of the Cauchy step (the minimizer of gT s + 12 sT Bs
along s = −tg, t ≥ 0, and the Newton step −B−1 g.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

266 CHAPTER 6. GLOBALIZATION OF THE ITERATION

Algorithm 6.3.7 (Double Dogleg Step)

Input: g, positive definite B.

If kgk23 /gT Bg > ∆ then

compute s = −(∆/kgk2 ) g. Stop
else
compute
kgk 2
scp = − gT Bg2 g,
sN = −B−1 g,
kgk24
η∈ (g Bg) (gT B −1 g)
T ,1 .
If η ksN k2 ≥ ∆ then
compute s = scp + τ(ηsN − scp ), with τ ∈ (0, 1)
such that kscp + τ(ηsN − scp )k2 = ∆.
elseif ksN k2 ≥ ∆ then
set s = (∆/ksN k2 ) sN .
else
set s = sN .
endif
endif

Steihaug-Toint Conjugate Gradient Approach

Assume that B is symmetric positive definite. The conjugate gradient method discussed in Sec-
tion 3.7 with starting value zero computes approximations si of the solution of gT s + 12 sT Bs by
solving
min gT s + 21 sT Bs.
s∈Kk (B,−g)

In particular, s1 solves
min gT s + 12 sT Bs
s=−tg,t≥0

and all subsequent iterates si , i > 1, satisfy

gT si + 21 sTi Bsi < gT s1 + 12 sT1 Bs1 .

Moreover, we have shown in Theorem 3.7.7 that the iterates are monotonically increasing, 0 <
ks1 k < ks2 k < . . .. Hence, we can use the conjugate gradient method for the computation of a
trust-regio step. If B is symmetric positive definite we will use the conjugate gradient method until
iterates si+1 violated the trust-region bound, ksi k ≤ ∆ < ksi+1 k, or g + Bsi is small. We can also

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

6.3. TRUST–REGION METHODS 267

admit general symmetric B (B may not be positive definite). In this case, if we detect a conjugate
gradient direction pi such that piT Bpi ≤ 0, then we will move fropm si along pi until we hit the
trust-region bound. The resulting algorithm is due to [Ste83] (and a slightly different version to
[Toi81]) and is listed below. It can be shown that the step computed by Steihaug-Toint Conjugate
Gradient Algorithm 6.3.8 below satisfies the fraction of Cauchy decrease condition (6.27).

Algorithm 6.3.8 (Steihaug-Toint Conjugate Gradient Method)

(0) Given > 0.
(1) Set s0 = 0 and r 0 = −g.
Set p0 = r 0 .
(2) For i = 0, 1, 2, · · · do
(a) If kri k2 < , return si and stop.
(b) If piT Bpi ≤ 0, then
compute τ > 0 such that ksi + τpi k2 = ∆,
return si + τpi and stop
(c) αi = riT ri /piT Bpi .
(d) If ksi + αi pi k2 ≥ ∆, then
compute τ > 0 such that ksi + τpi k2 = ∆,
return si + τpi and stop
(e) si+1 = si + αi pi .
(f) ri+1 = ri − αi Bpi .
(g) βi = kri+1 k22 /kri k22 .
(h) pi+1 = ri+1 + βi pi .

Subspace Techniques
Let V = span{v1, . . . , vk } ⊂ Rn . If −g ∈ V, then
( ) ( )
min gT s + 12 sT Bs : s ∈ V, ksk2 ≤ ∆ ≤ min gT s + 12 sT Bs : s = −tg, ksk2 ≤ ∆ .

Hence the solution of

min ∇ f (x k )T s + 12 sT Bk s
s.t. s ∈ V, (6.36)
ksk2 ≤ ∆.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

268 CHAPTER 6. GLOBALIZATION OF THE ITERATION

satisfies the fraction of Cauchy decrease condition (6.27). If we set V = (v1, . . . , vk ) ∈ Rn×k , then
(6.36) is equivalent to
min (V T g)T ŝ + 12 ŝT V T Bk V ŝ
(6.37)
s.t. kV ŝk2 ≤ ∆.
If we assume that V has full rank k, then there exists a nonsingular R ∈ R k×k such that RT R = V T V .
Such an R can be computed, e.g., using the Cholesky decomposition of V T V . Since kV ŝk2 = k R ŝk2
we can set s̃ = R ŝ to write (6.37) as

min g̃T s̃ + 21 s̃T B̃ s̃

(6.38)
s.t. k s̃k2 ≤ ∆.

where g̃ = RT V T g ∈ R k and B̃ = RT V T Bk V R ∈ R k×k . If k small, then (6.38) is a small dimensional

trust–region subproblem that can be solved using the Hebden–Reinsch–Moré Algorithm.
The subspace V = span{v1, . . . , vk } must contain the steepest descent direction −∇ f (x k ). For
example, it can be computed using the Lanczos iteration for symmetric matrices (applied with
A ← B and b ← −g) discussed in Section 3.3.2. See [CGT00, Sec. 7.5.4]. The resulting algorithm
is closely related to the Steihaug-Toint Conjugate Gradient Algorithm 6.3.8. If V is computed
using the Lanczos iteration for symmetric matrices (applied with A ← B and b ← −g) and if sSTCG
is the step computed by the Steihaug-Toint Conjugate Gradient Algorithm 6.3.8, then

gT sSTCG + 21 (sSTCG )T BsSTCG ≥ min gT s + 21 sT Bs

s.t. s ∈ V,
ksk2 ≤ ∆.

if the trust-region radius is inactive, ksSTCG k2 < ∆, then ‘≥’ above becomes ‘=’. Differences
between the Steihaug-Toint Conjugate Gradient Algorithm 6.3.8 and the solution of (6.36) with V
computed via Lanczos arise only when the trust-region radius is active. See [CGT00, Sec. 7.5.4].
If −B−1 g can be computed one can also choose V = span{−g, −B −1 g}. If a negative
curvature direction d, i.e., a d with dT Bd < 0 exists and can be computed, then one can
V = span{−g, −B−1 g, d}. Note that if B is positive definite and if −g, −B−1 g ⊂ V then the
double dogleg step sd computed by Algorithm 6.3.7 satisfies

gT sd + 21 (sd )T Bsd ≥ min gT s + 21 sT Bs

s.t. s ∈ V,
ksk2 ≤ ∆.

Hence the subspace technique extends the double dogleg.

Another method for the approximate solution of large-scale problems is discussed in [RSS01,
RSS08].

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

6.4. PROBLEMS 269

6.4. Problems

Problem 6.1 Let R : Rn → Rm be a continuously differentiable function, f (x) = 12 k R(x)k22 and

let s k be the solution of the linear least squares problem

min 1 k R0 (x k )s + R(x k )k22,

s∈Rn 2

where R0 is the Jacobian of R.

Derive an equivalent formulation of the sufficient decrease condition (6.5) in which R and R0
only appear in the terms 21 k R(x k + α k s k )k22 , 21 k R(x k )k22 and 21 k R0 (x k )s k + R(x k )k22 .

Problem 6.2 Prove Lemma 6.2.7.

Problem 6.3 Let f : Rn → R be a continuously differentiable function, and let x k+1 = x k + α k s k ,

where the direction s k ∈ Rn satisfies ∇ f (x k )T s k < 0 and the step-size α k > 0 is chosen such
that f (x k + α k s k ) < f (x k ). Show that if x ∗ and x ∗∗ are accumulation points of {x k }, then
f (x ∗ ) = f (x ∗∗ ).

Problem 6.4
Consider f (x) = 21 kF (x)k22 , where F : Rn → Rn is a continuously differentiable function.
Furthermore, let s k be the solution of the Newton equation

F 0 (x k )s k = −F (x k ).

i. Show that if F 0 (x k ) is nonsingular, the sufficient decrease condition (6.5) can be formulated
as a condition involving F, but not F 0.

ii. Let the conditions in i. be satisfied. Give a condition (or conditions) on the sequence {α k } of
step sizes that produces iterates x k+1 = x k + α k s k with

lim kF (x k )k2 = 0.
k→∞

Carry out the proof.

iii. Assume that the Jacobian F 0 is Lipschitz continuous. Without using Theorem 6.2.13 show
that your sufficient decrease condition derived in i. is satisfied for α k = 1, provided that the
iterate x k is sufficiently close to a root x ∗ of F at which F 0 (x ∗ ) is nonsingular.
(Hint: The quadratic convergence of the sequence {x k } of Newton iterates implies quadratic
convergence of the sequence {F (x k )} of corresponding function values.)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

270 CHAPTER 6. GLOBALIZATION OF THE ITERATION

Problem 6.5
Let f : Rn → R be twice continuously differentiable.

i. Let Vk ∈ Rn×mk , m k ≤ n, be a matrix with orthonormal columns and define

Vk = {Vk z : z ∈ Rmk }.

Suppose that Bk ∈ Rn×n is a symmetric matrix which satisfies vT Bk v > 0 for all v ∈ Vk and
consider the subproblem
1
min ∇ f (x k )T s + sT Bk s. (6.39)
s∈Vk 2
Compute the solution s k of (6.39). Show that if s k , 0, then s k is a descent direction for f
at x k .

ii. Let ∇ f (x k ) ∈ Vk . Show that

s k = 0 ⇔ ∇ f (x k ) = 0.

iii. Consider the iteration x k+1 = x k + α k s k , where s k is the solution of (6.39) and α k is the step
size. If s k , 0 for all k and if the assumptions of Theorem 6.2.10 are satisfied, then
∞ !2
X ∇ f (x k )T s k
k∇ f (x k )k22 < ∞.
k=0
k∇ f (x k )k2 ks k k2

Suppose there exist 0 < λ min ≤ λ max such that all eigenvalues of VkT Bk Vk , k ∈ N, are
bounded from below by λ min and from above by λ max .
Show that
lim kVkT ∇ f (x k )k2 = 0.
k→∞
Under what condition(s), can one show that

lim k∇ f (x k )k2 = 0?
k→∞

Problem 6.6 Let A ∈ Rn×n be symmetric positive definite and let b ∈ Rn . We consider the
following algorithm for minimizing

1 T
f (x) = x Ax − bT x.
2
Given linearly independent vectors v (1), . . . , v (n) ∈ Rn one step of the parallel directional
correction (PDC) method introduced in Section 2.8 is given as follows:

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

6.4. PROBLEMS 271

For i = 1, . . . , n solve (in parallel)

min f (x (k) + θv (i) ). (6.40)

θ∈R

end

Set s (k) = i=1 θ i v where θi is the solution of (6.40).

Pn (i) ,

Set x (k+1) = x (k) + α k s (k) , where α k > 0 is a stepsize.

Recall that
(v (i) )T (b − Ax)
θi = .
(v (i) )T Av (i)

i. Show that s (k) is a descent direction for f .

ii. Determine the optimal step size argminα∈R f (x (k) + αs (k) ) and show that it satisfies the
sufficient decrease condition (6.5) if c1 < 1/2. Prove that
f g2
∞
X ( Ax (k) − b)T s (k)
< ∞.
k=0
(s (k) )T As (k)

iii. Suppose that the directions v (i) are the unit vectors e (i) . What is s (k) ? Use part ii to show that

lim Ax (k) − b = 0.
k→∞

Problem 6.7
Let f : Rn → R be convex.

i. Show that the function φ(x) = f (x) + 2µ k x − yk22 is strictly convex for any given y ∈ Rn and
any given µ > 0.
(A function φ is strictly convex if for any x 1 , x 2 and t ∈ (0, 1), φ(t x 1 + (1 − t)x 2 ) ≤
tφ(x 1 ) + (1 − t)φ(x 2 ).)

We consider the following iteration. Given x k , the new iterate x k+1 is computed as the solution
of
µk
k x − x k k22,
min f (x) +
2
7 f (x) + µ2k k x − x k k22 is strictly convex (this ensures that local minima
where µ k ≥ 0 is such that x →
are global minima).

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

272 CHAPTER 6. GLOBALIZATION OF THE ITERATION

ii. Show that the sequence { f (x k )} is monotonically decreasing.

iii. Assume that f is bounded from below on the set {x ∈ Rn | f (x) ≤ f (x 0 )}. Show that
∞
X
µ k k x k+1 − x k k22 < ∞.
k=0

iv. In addition to the assumption in iii., suppose that f is continuously differentiable on {x ∈

Rn | f (x) ≤ f (x 0 )} and that µ k ∈ [µ, µ] with 0 < µ < µ. Show that

lim ∇ f (x k ) = 0.
k→∞

v. Now let H be symmetric positive definite and let f be the convex quadratic function f (x) =
cT x + 12 xT H x. Furthermore, let 0 < µ k ≤ µ for all k.

– Show that lim k→∞ x k = x ∗ , where x ∗ is a minimizer of f , and

– determine the q–convergence rate.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

References
[ADLT98] N. Alexandrov, J. E. Dennis Jr., R. M. Lewis, and V. Torczon. A trust region framework
for managing the use of approximation models in optimization. Structural Optimization,
15:16–23, 1998.

[CGT00] A. R. Conn, N. I. M. Gould, and Ph. L. Toint. Trust–Region Methods. SIAM, Philadel-
phia, 2000.

[DS83] J. E. Dennis, Jr. and R. B. Schnabel. Numerical Methods for Nonlinear Equations and
Unconstrained Optimization. Prentice-Hall, Englewood Cliffs, N. J, 1983. Republished
as [DS96].

[Heb73] M. D. Hebden. An algorithm for minimization using exact second order derivatives.
Technical Report T.P. 515, Atomic Energy Research Establishment, Harwell, England,
1973.

[Mor78] J. J. Moré. The Levenberg–Marquardt algorithm: Implementation and theory. In G. A.

Watson, editor, Numerical Analysis, Proceedings, Biennial Conference, Dundee 1977,
pages 105–116, Berlin, Heidelberg, New-York, 1978. Springer Verlag.

[MS83] J. J. Moré and D. C. Sorensen. Computing a trust region step. SIAM J. Sci. Statist.
Comput., 4(3):553–572, 1983.

[MT94] J. J. Moré and D. J. Thuente. Line search algorithms with guaranteed sufficient decrease.
ACM Transactions on Mathematical Software, 20(3):286–307, 1994.

[NW06] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Verlag, Berlin,

Heidelberg, New York, second edition, 2006. URL: https://doi.org/10.1007/
978-0-387-40065-5, doi:10.1007/978-0-387-40065-5.

[Rei71] C. H. Reinsch. Smoothing by spline functions. Numer. Math., 16:451–454, 1971.

273
274 REFERENCES

[RSS01] M. Rojas, S. A. Santos, and D. C. Sorensen. A new matrix-free algorithm for the large-
scale trust-region subproblem. SIAM J. Optim., 11(3):611–646 (electronic), 2000/01.

[RSS08] M. Rojas, S. A. Santos, and D. C. Sorensen. Algorithm 873: LSTRS: MATLAB

software for large-scale trust-region subproblems and regularization. ACM Trans.
Math. Software, 34(2):Art. 11, 28, 2008.

[Ste83] T. Steihaug. The conjugate gradient method and trust regions in large scale optimization.
SIAM J. Numer. Anal., 20:626–637, 1983.

[Toi81] Ph. L. Toint. Towards an efficient sparsity exploiting Newton method for minimization.
In I. S. Duff, editor, Sparse Matrices and Their Uses, pages 57–87. Academic Press,
New York, 1981.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

Chapter
7
Nonlinear Least Squares Problems
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
7.2 Least Squares Curve Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
7.3 Linear Least Squares Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
7.3.1 Solution of Linear Least Squares Problems Using the SVD . . . . . . . . . 281
7.3.2 Solution of Linear Least Squares Problems Using the QR–Decomposition . 284
7.4 The Gauss–Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
7.4.1 Derivation of the Gauss–Newton Method . . . . . . . . . . . . . . . . . . 286
7.4.2 Full Rank Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
7.4.3 Line Search Globalization . . . . . . . . . . . . . . . . . . . . . . . . . . 292
7.4.4 Rank Deficient Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
7.4.5 Nearly Rank Deficient Problems . . . . . . . . . . . . . . . . . . . . . . . 295
7.5 Parameter Identification in Ordinary Differential Equations . . . . . . . . . . . . . 297
7.5.1 Least Squares Formulation of the Parameter Identification Problem . . . . 297
7.5.2 Derivative Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
7.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312

7.1. Introduction
Given a smooth function R : Rn → Rm with component functions Ri , i = 1, . . . , m, we consider the
solution of the nonlinear least squares problem

min 21 k R(x)k22 . (7.1)

If we define the objective function

f (x) = 12 k R(x)k22,
then gradient and the Hessian of f are given by

∇ f (x) = R0 (x)T R(x) (7.2)

275
276 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS

and
m
X
∇ f (x) = R (x) R (x) +
2 0 T 0
∇2 Ri (x) Ri (x), (7.3)
i=1

respectively, where R0 (x) denotes the Jacobian of R and ∇2 Ri (x) is the Hessian of the ith component
function. Note that the first part of the Hessian, R0 (x)T R0 (x), uses derivative information already
required for the computation of ∇ f (x). In this chapter we study a variation of Newton’s method,
called the Gauss-Newton method, for the minimization of f (x) = 12 k R(x)k22 in which the Hessian
∇2 f (x) is replaced by R0 (x)T R0 (x). Before we discuss the Gauss-Newton method, we present a
class of problems that leads to nonlinear least squares problems (7.1) in the next Section 7.2 and
then we discuss the special case of linear least squares problems R(x) = Ax + b in Section 7.3.
The Gauss-Newton method is presented and analyzed in Section 7.4. The final Section 7.5 of
this chapter treat a more complicated nonlinear least squares problem, parameter identification in
ordinary differential equations.

7.2. Least Squares Curve Fitting

Suppose we are given m measurements
(t i, bi ), i = 1, . . . , m,
and we want to fit a function
b(t) = ϕ(t; x 1, . . . , x n )
to the data points. The model–function ϕ depends on n parameters x 1, . . . , x n which have to be
determined from the measurements.
Typically the measurements bi are not exact but are of the form bi = biex + δbi , where biex is the
exact data and δbi represents the measurement error. Moreover, the selection of the model–function
ϕ(t; x 1, . . . , x n ) is often based on simplifying assumptions and ϕ(t; x 1, . . . , x n ) may not represent
the correct relation between b and t. Therefore we try to find x = (x 1, . . . , x n ) such that the sum of
Pm 2
squares i=1 ri (x) of the residuals
ri (x) = ϕ(t i ; x) − bi, i = 1, . . . , m,
is minimized. If we define
ϕ(t 1 ; x) − b1
ϕ(t 2 ; x) − b2
*. +/
R(x) = ... ..
// ,
. . //
, ϕ(t m ; x) − bm -
then
m
X
ri2 (x) = k R(x)k22
i=1

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

7.2. LEAST SQUARES CURVE FITTING 277

b3
b2
ϕ(t; x 1, . . . , x n )
bn
b1
t
t1 t2 t3 tn

Figure 7.1: Least squares curve fitting. Given data (t i, bi ), i = 1, . . . , m, we want to find a function
ϕ(t; x 1, . . . , x n ) parameterized by x = (x 1, . . . , x n )T such that the sum of squares of the residuals
ϕ(t i ; x) − bi , i = 1, . . . , m, indicated by solid blue lines is minimal.

and we arrive at the minimization problem

min 21 k R(x)k22 . (7.4)

If the function R or, equivalently, ϕ depends linearly on x, then we call (7.4) a linear least
squares problem. Otherwise, we call (7.4) a nonlinear least squares problem.
For linear least squares problems the model function ϕ is of the form

ϕ(t; x 1, . . . , x n ) = x 1 ϕ1 (t) + x 2 ϕ2 (t) + . . . + x n ϕn (t)

with some functions ϕ1, . . . , ϕn . Notice that the function ϕ depends linearly on the parameters
x 1, . . . , x n but it may be a nonlinear function of t! If we introduce

ϕ (t ) ϕ2 (t 1 ) . . . ϕn (t 1 )
*. 1 1
ϕ 1 (t 2 ) ϕ2 (t 2 ) . . . ϕn (t 2 ) /
+
A = ...
.. .. ..
// ∈ Rm×n, (7.5)
. . . . //
, ϕ1 (t m ) ϕ2 (t m ) . . . ϕn (t m ) -
and
b = (b1, . . . , bm )T ∈ Rm, (7.6)
then
R(x) = Ax − b
and, thus, a linear least squares problem can be written in the form

min 21 k Ax − bk22 . (7.7)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

278 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS

Example 7.2.1 ([Esp81, Sec. 6]) The temperature T dependence of the rate constant k for an
elementary chemical reaction is almost always expressed by a relation of the form
!
U
k (T; C, U) = CT exp −
n
, (7.8)
RT

where R = 8.314 [J/(kmol K)] is the general gas constant. Usually, n is assigned one of the values
0, 1/2, or 1. Depending on the choice of n, the notation in (7.8) varies. For example, for n = 0 we
obtain the Arrhenius equation, which is commonly written as
!
E
k (T; A, E) = A exp − . (7.9)
RT

In (7.9), A is called the pre-exponential factor and E is called the activation energy, and both have
to estimated from experiments.
Consider the reaction
NO + C1NO2 → NO2 + C1NO.
Measurements of the rate constant k (measured in cm3 mol−1 sec−1 ) for various temperatures T
(measured in K) are shown in the following table.

T 300 311 323 334 344

k 0.79 ∗ 10 1.25 ∗ 10 1.64 ∗ 10 2.56 ∗ 10 3.4 ∗ 107
7 7 7 7

The coefficients C and U in (7.8) can be computed by solving the nonlinear least squares
problem
X5 2
min 21
ki − k (Ti ; C, U) . (7.10)
C,U
i=1

Alternatively, we can divide both sides in (7.8) by T n and take the logarithm. This gives
U
ln k (T; C, U)/T n = ln(C) − .
RT
The quantities x 1 = ln(C) and x 2 = U can be estimated by solving the linear least squares problem

.. .. .. 2
. . .

* +/ * +/
min 21 ... 1 −1/(RTi ) // x − ... ln(ki /Tin ) // . (7.11)
x .. .. ..
. . .

, - , - 2
The problems (7.10) and (7.11) are related
by are different,
because
in former
we match k (Ti ; C, U)
with ki while in the latter we match ln k (Ti ; C, U)/T with ln ki /T .
n n

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

7.3. LINEAR LEAST SQUARES PROBLEMS 279

We first estimate C, U from the linear least squares problem (7.11) and then use these estimates
as starting values in an optimization routine1 to solve (7.10). This results are shown in the following
table.
Solution of (7.11)
n C U Res
0 6.167e + 11 2.808e + 04 2.705e + 12
1/2 2.087e + 10 2.674e + 04 2.673e + 12
1 7.062e + 08 2.541e + 04 2.642e + 12
Solution of (7.10)
n C U Res
0 6.167e + 11 2.807e + 04 2.661e + 12
1/2 2.087e + 10 2.673e + 04 2.632e + 12
1 8.621e + 08 2.595e + 04 2.464e + 12
Here, Res = i=1
P5
(ki − k (Ti ; C, U)) 2 .
So far we have considered scalar measurements (t i, bi ) ∈ R × R, i = 1, . . . m. Everything that
was said before can be extended to the case t i ∈ R k , bi ∈ R` , i = 1, . . . m.
We will return to the solution of linear and nonlinear least squares problems. For some
applications and statistical aspects of linear and nonlinear least squares problems we refer to the
book [BW88].

7.3. Linear Least Squares Problems

An important unconstrained quadratic programming problem is the linear least squares problem

min 12 k Ax − bk22 . (7.12)

Since
1
2 k Ax − bk22 = 12 ( Ax − b)T ( Ax − b) = 21 xT AT Ax − xT AT b + 12 bT b
(7.12) is an instance of (4.9) with H = AT A and c = −AT b. Hence we can apply Theorem 4.3.6.
A vector x ∗ solves (7.12) if and only if x ∗ solves

AT Ax = AT b. (7.13)

The equations (7.13) are called the normal equations. Using the singular value decomposition of
A one can easily show that

R ( AT A) = R ( AT ), N ( AT A) = N ( A).

Since AT b ∈ R ( AT ) = R ( AT A), the normal equations are solvable. We obtain the following result.
1We have used the Matlab function ls.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

280 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS

Ax − b
b

Ax
R ( A)

Figure 7.2: The Linear Least Squares Problem

Theorem 7.3.1 A vector x ∗ solves (7.12) if and only if x ∗ solves the normal equation (7.13).
The normal equation has at least one solution x ∗ . The set of solutions of (7.13) is given by

Sb = x ∗ + N ( A),

where x ∗ denotes a particular solution of (7.13) and N ( A) denotes the null space of A.

If N ( A) , {0}, the set of solutions of (7.12) forms a manifold in Rn . In this case the minimum
norm solution x † is of interest. It is the solution of (7.12) with smallest norm. Mathematically, the
minimum norm solution x † is the solution of

min k xk2 .
x∈Sb

It can be shown that the minimum norm solution x † is the solution of the least squares problem
which is perpendicular to the null-space of A. See also Figure 7.3 and (7.18) below.
If A ∈ Rm×n has rank n, which implies m ≥ n, then AT A is invertible and the solution of the
least squares problem (7.12) (or equivalently the normal equation (7.13)) is unique and given by

x = ( AT A) −1 AT b. (7.14)

If A ∈ Rm×n has rank m, which implies m ≤ n, then AT A is not invertible and the least squares
problem has infinitly many solutions. The matrix AAT is invertible and it is easy to verify that
x = AT ( A AT ) −1 b is a solution of the least squares problem (7.12) (or equivalently the normal
equation (7.13)). In fact, since AT ( A AT ) −1 b ∈ R ( AT ) = R ( AT A) = N ( AT A) ⊥ = N ( A) ⊥ ,

x † = AT ( A AT ) −1 b (7.15)

is the minimum norm solution of (7.12).

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

7.3. LINEAR LEAST SQUARES PROBLEMS 281

7.3.1. Solution of Linear Least Squares Problems Using the SVD

The singular value decomposition (SVD) of A is a powerful tool in the analysis and solution of
(7.12). Given A ∈ Rm×n there exist orthogonal matrices U ∈ Rm×m , V ∈ Rn×n and a ‘diagonal’
matrix Σ ∈ Rm×n with diagonal entries

σ1 ≥ . . . ≥ σr > σr+1 = . . . = σmin{m,n} = 0

such that
A = UΣV T . (7.16)
The decomposition (7.16) of A is called the singular value decomposition of A. The scalars σi ,
i = 1, . . . , min{m, n} are called the singular values of A.
Using the orthogonality of U and V we find that

k Ax − bk22 = kU T ( AVV T x − b)k22

= kΣV T x − U T bk22
Xr m
X
= (σi zi − ui b) +
T 2
(uTi b) 2,
i=1 i=r+1

where we have set z = V T x. Thus,

r
X m
X
min k Ax − bk22 = min (σi zi − uTi b) 2 + (uTi b) 2 .
x z=V T x
i=1 i=r+1

The solutions are given by

uTi b
zi = , i = 1, . . . , r,
σi
zi = arbitrary, i = r + 1, . . . , n.

Moreover,
r r
X X uT b
AV z = UΣz = zi ui = i
ui (7.17)
i=1 i=1
σi
and
m
X
min 1
2 k Ax − bk22 = 1
2 (uTi b) 2 .
i=r+1
Since V is orthogonal, we find that

k xk2 = kVV T xk2 = kV T xk2 = kzk2 .

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

282 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS

Hence, the minimum norm solution of the linear least squares problem is given by

x † = V z†,

where z† ∈ Rn is the vector with entries

uTi b
zi† = , i = 1, . . . , r,
σi
zi† = 0, i = r + 1, . . . , n,

i.e.
r
X uT b
x† = i
vi . (7.18)
i=1
σi

Since {v1, . . . , vr } is an orthonormal basis for N ( A) ⊥ , we see that x † ⊥ N ( A). Moreover, since
{u1, . . . , ur } is an orthonormal basis for R ( A), the projection PR ( A) b of b onto R ( A) is given by

m
X
PR ( A) b = (uTi b) ui
i=r+1

we see that Ax ∗ = PR ( A) b for all solutions x ∗ of (7.12) (see (7.17)). The structure of the solution
of the linear least squares problem is sketched in Figure 7.3.
Given the SVD (7.16) of A, the matrix,

A† = V Σ†U T , (7.19)

where Σ† ∈ Rn×m is the diagonal matrix with diagonal entries 1/σ1, . . . , 1/σr , 0, . . . , 0, is called
the Moore–Penrose pseudo inverse. The minimum norm solution (7.18) of the linear least squares
problem is given by

x † = A† b. (7.20)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

7.3. LINEAR LEAST SQUARES PROBLEMS 283

Figure 7.3: Solution of the Linear Least Squares Problem

A -

Rn Sb Rm
6 6
N ( A) ⊥ N ( A) R ( A)
@
@ @
@
@ @
@ @
@ @
1b
@ @

@I x†
@ @

@
@ @
@
@ - @ -
@ @@
@ @ PR ( A) b
@
@ @
R
@
@ @
@ @
@ @

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

284 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS

7.3.2. Solution of Linear Least Squares Problems Using the QR–

Decomposition
Instead of using the SVD, we can also use the QR–decomposition of A ∈ Rm×n to solve the
linear least squares problem (7.12). Suppose that A ∈ Rm×n has rank r ≤ min{m, n}. The
QR–decomposition with column pivoting yields

AP = QR, (7.21)

where Q ∈ Rm×m is orthogonal, P ∈ Rn×n is a permutation matrix, and R ∈ Rn×n is an upper

triangular matrix of the form !
R1 R2
R= , (7.22)
0 0
with nonsingular upper triangular R1 ∈ Rr×r , and R2 ∈ Rr×(n−r) . See, e.g., [Bjö96, Sec. 1.3],
[GL89, Sec. 5.4.], Since QT Q = I and PT P = I, we can write

k Ax − bk22 = kQT ( APPT x − b)k22

!
R1 R2 2
=
PT x − QT b .
0 2

Now we partition
c1 }r
QT b = *. c2 +/ }n − r (7.23)
, d - }m − n
and we set
y = PT x i.e. x = Py.
Let !
y1
y= , y1 ∈ Rr , y2 ∈ Rn−r . (7.24)
y2
This yields

R1 y1 + R2 y2 − c1
k Ax − bk22 = *. c2 +/ 2
2
, d -
= k R1 y1 + R2 y1 − c1 k22 + kc2 k22 + kd k22 .

The least squares problem can be written as

min k Ax − bk22 = min k R1 y1 + R2 y2 − c1 k22 + kc2 k22 + kd k22 .

x y

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

7.3. LINEAR LEAST SQUARES PROBLEMS 285

If y solves the minimization problem on the right hand side, then x = Py solves the minimization
problem on the left hand side and vice versa. Since R1 ∈ Rr×r is nonsingular, we can compute

y1 = −R1−1 (R2 y2 − c1 )

for given y2 ∈ Rn−r . Thus, the set of solutions of

min k R1 y1 + R2 y2 − c1 k22 + kc2 k22 + kd k22

is given by
−R1−1 (R2 y2 − c1 )
( ! )
n−r
| y2 ∈ R
y2
and we find that

min k R1 y1 + R2 y2 − c1 k22 + kc2 k22 + kd k22 = kc2 k22 + kd k22 .

Consequently, the solutions to the linear least squares problem min k Ax − bk2 is given by
−R1−1 (R2 y2 − c1 )
( ! )
Sb = P | y2 ∈ R n−r
(7.25)
y2
and
min k Ax − bk22 = kc2 k22 + kd k22 .
x
A particular solution can be found by setting y2 = 0 which yields
R1−1 c1 R1−1 c1
! !
y= and x = P .
0 0
The minimum norm solution x † is defined as the solution of

min k xk2 .
x∈Sb

Using (7.25) and the fact that kPyk2 = k yk2 for all y ∈ Rn , we find that
−R1−1 (R2 y2 − c1 )
!
min k xk2 = minn−r P
x∈Sb y2 ∈R y2 2
−R−1 R2 ! R −1 c !
= minn−r y2 + 1 .
1 1
(7.26)
y2 ∈R I 0 2
The right hand side in (7.26) is just another linear least squares problem in y2 . Its solution can be
obtained by solving the normal equations

(R1−1 R2 )T R1−1 R2 + I y2 = (R1−1 R2 )T R1−1 c1 (7.27)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

286 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS

corresponding to (7.26) or by computing the QR–decomposition of

−R1−1 R2
!
∈ Rn×(n−r) (7.28)
I
and proceeding as above. This matrix in (7.28) has full rank n − r. Consequently, (7.27) or,
equivalently, (7.26) has a unique solution y2∗ . The minimum norm solution of (7.12) is given by

−R1−1 (R2 y2∗ − c1 )

!
x† = P ,
y2∗

where y2∗ is the solution of (7.26) or, equivalently, (7.27).

An important issue in computation of a solution to the rank deficient least squares problem is
the determination of the rank of A. In practice, the computed singular values will never be exactly
zero. In this case one has to decide which singular values are numerically zero. Similarly, in
practice the upper triangular matrix R in the QR–decomposition
!
R
Q AP =
T
0

will usually be of the form !

R1 R2
R= .
0 R̃3
The lower right block R̃3 will in general not be exactly equal to zero. One has to decide which
submatrix is considered to be numerically zero. The numerical rank determination can be a difficult
problem. We refer to the book [Bjö96, § 2.7] by Björck for a discussion of rank deficient problems.

7.4. The Gauss–Newton Method

7.4.1. Derivation of the Gauss–Newton Method
Let R : Rn → Rm be a smooth function with component functions Ri , i = 1, . . . , m. The Jacobian
of R is denoted by R0 (x) ∈ Rm×n .
To minimize
f (x) = 21 k R(x)k22
we compute the gradient
∇ f (x) = R0 (x)T R(x) (7.29)
and the Hessian
m
X
∇ f (x) = R (x) R (x) +
2 0 T 0
Ri (x)∇2 Ri (x). (7.30)
i=1

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

7.4. THE GAUSS–NEWTON METHOD 287

We note that once the Jacobian R0 (x) is computed, we can compute the gradient of f and we can
Pm
compute the first term in the Hessian of f . If R0 (x)T R0 (x) is large compared to i=1 Ri (x)∇2 Ri (x),
then ∇2 f (x) ≈ R0 (x)T R0 (x). This will be the case if, e.g., Ri (x), i = 1, . . . , m, is small, or if
∇2 Ri (x), i = 1, . . . , m is small. The latter conditions means that Ri is almost linear.
If we omit the second order derivative information, then the approximate Newton system is of
the form
R0 (x k )T R0 (x k )s = −R0 (x k )T R(x k ).
This system is always solvable and it has a unique solution if and only if the Jacobian R0 (x k ) has
rank n. The previous system is the normal equation for the linear least squares problem

min 1 k R0 (x k )s + R(x k )k22 .

s∈Rn 2

This leads to the Gauss-Newton method.

Algorithm 7.4.1 (Gauss-Newton Method)

Input: Starting value x 0 ∈ Rn , tolerance tol.
For k = 0, . . .
Compute R(x k ) and R0 (x k ).
Check truncation criteria.
Compute a solution s k of min 21 k R0 (x k )s + R(x k )k22 .
Set x k+1 = x k + s k .
End

7.4.2. Full Rank Problems

The Gauss–Newton method can be viewed as a Newton method with inexact Hessian information
applied to the minimization of f (x) = 21 k R(x)k22 . The Hessian ∇2 f (x k ) = R0 (x k )T R0 (x k ) +
Pm 2 0 T 0
i=1 Ri (x k )∇ Ri (x k ) is approximated by R (x k ) R (x k ). Our local convergence analysis for New-
ton’s method with inexact derivative information, Theorem 5.2.7 can be applied to this setting. We
have
m
X
∆(x k ) = − Ri (x k )∇2 Ri (x k ), δ(x k ) = 0
i=1
Theorem 5.2.7 requires that

∇2 f (x k ) + ∆(x k ) = R0 (x k )T R0 (x k )

are invertible,
R0 (x )T R0 (x ) −1 ≤ M,
k k 2
and that
k(∇2 f (x k ) + ∆(x k )) −1 ∆(x k )k2 ≤ α k ≤ α < 1,

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

288 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS

i.e.,
m
R0 (x )T R0 (x ) −1
X
k k Ri (x k )∇2 Ri (x k )k2 ≤ α k ≤ α < 1. (7.31)
i=1

Under these assumptions the Gauss–Newton method is locally convergent and the iterates obey

ML
k x k+1 − x ∗ k2 ≤ α k k x k − x ∗ k2 + k x k − x ∗ k22
2

for all k, where L is the Lipschitz constant of the Hessian of k R(x)k22 .

The matrices R0 (x k )T R0 (x k ) are invertible if and only of R0 (x k ) ∈ Rm×n has rank n. This
implies that m ≥ n.
The convergence analysis for approximate Newton methods as sketched above gives a good
description of the local convergence behavior of the Gauss–Newton method for problems in which
the Jacobians R0 (x) have full column rank. We briefly call problems for which this is the case full
rank problems. A more refined convergence result is given below. It avoids second derivatives
altogether. However, the assumptions in the following theorem are closely related to the assumptions
needed when using the convergence analysis for approximate Newton methods, as we will see later.

Theorem 7.4.2 (Local Convergence of the GN Method for Full Rank Problems) Let D ⊂ Rn
be an open set and let x ∗ ∈ D be a (local) solution of the nonlinear least squares problem. Suppose
that R : D → Rm is continuously differentiable with R0 ∈ Lip L (D) and suppose that for all x ∈ D
the Jacobian R0 (x) has rank n. If there exist ω > 0 and κ ∈ (0, 1) such that for all x ∈ D and all
t ∈ [0, 1] the following conditions hold

R0 (x)T R0 (x) −1 (R0 (x)T − R0 (x )T )R(x ) ≤ κk x − x k , (7.32)

∗ ∗ ∗ 2
2
R0 (x)T R0 (x) −1 R0 (x)T

×(R0 (x ∗ + t(x − x ∗ )) − R0 (x))(x − x ∗ ) ≤ ωt k x − x ∗ k22, (7.33)
2

then there exists > 0 such that if x 0 ∈ B (x ∗ ), then the iterates {x k } generated by the Gauss–
Newton method convergence towards x ∗ and obey the estimate

ω
k x k+1 − x ∗ k2 ≤ k x k − x ∗ k22 + κk x k − x ∗ k2 . (7.34)
2

Proof: i. First we show that for x k ∈ D, the estimate (7.34) is valid.

To prove (7.34) recall that in the full rank case, the new Gauss–Newton iterate is given by
x k+1 = x k − (R0 (x k )T R0 (x k )) −1 R0 (x k )T R(x k ). The definition of the Gauss–Newton iterate and

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

7.4. THE GAUSS–NEWTON METHOD 289

R0 (x ∗ )T R(x ∗ ) = 0 yields

x k+1 − x ∗
= x k − x ∗ − (R0 (x k )T R0 (x k )) −1 R0 (x k )T R(x k )
f
= (R0 (x k )T R0 (x k )) −1 R0 (x k )T {R0 (x k )(x k − x ∗ ) − R(x k ) + R(x ∗ )}
g
+(R0 (x ∗ ) − R0 (x k ))T R(x ∗ )
f Z 1
= (R (x k ) R (x k )) R (x k )
0 T 0 −1 0 T
(R0 (x k ) − R0 (x ∗ + t(x k − x ∗ ))(x k − x ∗ )dt
g 0
+(R (x ∗ ) − R (x k )) R(x ∗ ) .
0 0 T

Taking norms and applying (7.32) and (7.33) gives

k x k+1 − x ∗ k2
Z 1
≤ k (R0 (x k )T R0 (x k )) −1 R0 (x k )T {R0 (x k ) − R0 (x ∗ + t(x k − x ∗ ))}(x k − x ∗ )dt k2
0
+k(R0 (x k )T R0 (x k )) −1 (R0 (x ∗ ) − R0 (x k ))T R(x ∗ )k2
ω
≤ k x k − x ∗ k22 + κk x k − x ∗ k2 .
2
ii. Let 1 > 0 be such that B 1 (x ∗ ) ⊂ D and let σ ∈ (κ, 1) be arbitrary. If

2(σ − κ)
≤ min{ 1, }
ω
and x 0 ∈ B (x ∗ ), then
ω ω
k x ∗ − x 1 k2 ≤ k x 0 − x ∗ k2 + κk x 0 − x ∗ k2 <
2
+ κ k x 0 − x ∗ k2 ≤ σk x 0 − x ∗ k2
2 2
A simple induction argument shows that

k x k+1 − x ∗ k2 < σk x k − x ∗ k2 < σ k+1 k x 0 − x ∗ k2 .

This proves lim k→∞ x k = x ∗ .

The condition (7.33) is implied by the Lipschitz continuity of R0 and by the uniform boundedness
−1
of k R0 (x)T R0 (x) k2 . In fact if R0 ∈ Lip L (D) and

−1
a = sup k R0 (x)T R0 (x) R0 (x)T k2 < ∞,
x∈D

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

290 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS

then
R0 (x)T R0 (x) −1 R0 (x)T (R0 (x + t(x − x )) − R0 (x))(x − x )
∗ ∗ ∗
2
−1
≤ k R0 (x)T R0 (x) R0 (x)T k2 Lt k x − x ∗ k22 ≤ aLt k x − x ∗ k22 .
Thus (7.33) holds with ω ≤ aL. The condition (7.32) is more interesting. Clearly, if R(x ∗ ) = 0
(zero residual problem) or if R(x) is affine linear, then (7.32) is satisfied with κ = 0 and the
Gauss–Newton method converges locally q–quadratic. We will show in Lemma 7.4.3 below that
(7.32) is essentially equivalent to the condition (7.31) with α = κ. Lemma 7.4.4 below relates
(7.32) (via the results in Lemma 7.4.3) to the second order sufficient optimality condition. The
analysis follows [Boc88, Sec. 3] and [Hei93].
Lemma 7.4.3 Let D ⊂ Rn be an open set and let x ∗ ∈ D be a (local) solution of the nonlinear least
squares problem. Suppose that R : D → Rm is continuously differentiable. Moreover, assume that
Ri , i = 1, . . . , m, is twice differentiable at x ∗ and that R0 (x ∗ )T R0 (x ∗ ) is invertible.
i. If
k(R0 (x ∗ )T R0 (x ∗ )) −1 (R0 (x)T − R0 (x ∗ )T )R(x ∗ )k ≤ κk x − x ∗ k ∀x ∈ D,
then
m
X
k(R0 (x ∗ )T R0 (x ∗ )) −1 Ri (x ∗ )∇2 Ri (x ∗ )k2 ≤ κ.
i=1

ii. If
m
X
k(R0 (x ∗ )T R0 (x ∗ )) −1 Ri (x ∗ )∇2 Ri (x ∗ )k2 ≤ κ̂,
i=1
then for any κ > κ̂ there exists > 0 such that for all x ∈ B (x ∗ )
k(R0 (x ∗ )T R0 (x ∗ )) −1 (R0 (x)T − R0 (x ∗ )T )R(x ∗ )k ≤ κk x − x ∗ k.

Proof: i. Assume that

k(R0 (x ∗ )T R0 (x ∗ )) −1 (R0 (x)T − R0 (x ∗ )T )R(x ∗ )k ≤ κk x − x ∗ k
for all x ∈ D . In particular,
δ
k(R0 (x ∗ )T R0 (x ∗ )) −1 (R0 (x ∗ + h)T − R0 (x ∗ )T )R(x ∗ )k ≤ κδ
khk2
for all h ∈ Rn , h , 0 and all sufficiently small δ. Since Ri is twice differentiable at x ∗ , there exists
φ : R → R with limt→0 φ(t) = 0 such that
m
X
0 T 0
k(R (x ∗ ) R (x ∗ )) −1
Ri (x ∗ )∇2 Ri (x ∗ )δhk2 ≤ (κ + φ(δ))δ
i=1

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

7.4. THE GAUSS–NEWTON METHOD 291

for all h ∈ Rn with khk2 = 1. Since we can cancel δ on the left and on the right hand side of the
previous inequality, this yields
m
X
0 T 0
k(R (x ∗ ) R (x ∗ )) −1
Ri (x ∗ )∇2 Ri (x ∗ )k2 ≤ (κ + φ(δ)).
i=1
If we take the limit δ → 0, we obtain the assertion.
ii. The second assertion can be proven in a similar way.

If the assumptions of Lemma 7.4.3 are satisfied and if Ri , i = 1, . . . , m, are twice differentiable
and the Hessians ∇2 Ri , i = 1, . . . , m, are Lipschitz continuous, then Lemma 7.4.3 ii. show that
(7.32) implies (7.31) with α ∈ (κ, 1) for all x k sufficiently close to x ∗ .
The next result relates (7.32) (via the results in Lemma 7.4.3) to the second order sufficient
optimality condition.
Lemma 7.4.4 Let D ⊂ Rn be an open set. Suppose that Ri : D → R, i = 1, . . . , m, are
twice continuously differentiable. If R0 (x ∗ )T R0 (x ∗ ) is invertible, then the following statements are
equivalent:
i. There exists λ > 0 with
m
X
T 0 T 0
h R (x ∗ ) R (x ∗ )h − |h T
Ri (x ∗ )∇2 Ri (x ∗ )h| ≥ λ khk22 ∀h ∈ Rn . (7.35)
i=1

ii. There exists κ < 1 with

m
X
0 T
k(R (x ∗ ) R (x ∗ )) 0 −1
Ri (x ∗ )∇2 Ri (x ∗ )k2 ≤ κ. (7.36)
i=1

Proof: First we prove that i. implies ii. Since R0 (x ∗ )T R0 (x ∗ ) is invertible, it is symmetric

positive definite. Hence, we can form [R0 (x ∗ )T R0 (x ∗ )]−1/2 . With the variable transformation
h → [R0 (x ∗ )T R0 (x ∗ )]−1/2 h, (7.35) implies that
Xm
0 T 0 −1/2
T
|h [R (x ∗ ) R (x ∗ )] Ri (x ∗ )∇2 Ri (x ∗ )[R0 (x ∗ )T R0 (x ∗ )]−1/2 h|
!i=1
λ
≤ 1− khk22
k R0 (x ∗ )T R0 (x ∗ )k2
for all h ∈ Rn . The latter inequality implies
m
X
k[R0 (x ∗ )T R0 (x ∗ )]−1/2 Ri (x ∗ )∇2 Ri (x ∗ )[R0 (x ∗ )T R0 (x ∗ )]−1/2 k2
i=1
λ
≤ 1− 0 .
k R (x ∗ ) R0 (x ∗ )k2
T

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

292 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS

Since
m
X
0 0 −1/2
T
k[R (x ∗ ) R (x ∗ )] Ri (x ∗ )∇2 Ri (x ∗ )[R0 (x ∗ )T R0 (x ∗ )]−1/2 k2
i=1
m
X
= k R0 (x ∗ )T R0 (x ∗ ) −1 Ri (x ∗ )∇2 Ri (x ∗ )k2,
i=1

(7.36) holds with κ = 1 − λ/k R0 (x ∗ )T R0 (x ∗ )k.

Similar arguments can be used to show that ii. implies i. with λ = (1−κ)/k R0 (x ∗ )T R0 (x ∗ ) −1 k2 .

7.4.3. Line Search Globalization

We have seen in Section 6.2.1 that if s k satisfies

1 0
2 k R (x k )s k + R(x k )k22 < 12 k R(x)k22, (7.37)

then s k is a descent direction. Hence we can use a line-search. The new iterate is

x k+1 = x k + α k s k ,

where the step size α k > 1 is chosen according to the conditions in Section 6.2.2 applied to
f (x) = 12 k R(x)k22 .
Often the special structure of the function can be used to find equivalent but more convenient
representations of the line search conditions. For example, the sufficient decrease condition (6.5)
for f (x) = 12 k R(x)k22 is given by

1
2 k R(x k + α k s k )k22 ≤ 12 k R(x k )k22 + c1 α k R(x k )T R0 (x k )s k . (7.38)

If s k is the exact solution of the linear least squares problem mins 21 k R0 (x k )s + R(x k )k22 , then the
sufficient decrease condition (7.38) is equivalent to

1
2 k R(x k + α k s k )k22 ≤ 12 k R(x k )k22 + c1 α k k R0 (x k )s k + R(x k )k22 − k R(x k )k22 . (7.39)

See Problem 6.1. The representation (7.39) of the sufficient decrease condition only requires
the quantities k R(x k )k2 and k R0 (x k )s k + R(x k )k22 that have to be computed anyway during the
Gauss-Newton algorithm.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

7.4. THE GAUSS–NEWTON METHOD 293

7.4.4. Rank Deficient Problems

If R0 (x k ) ∈ Rm×n has rank r < n, then

min 1 k R0 (x k )s + R(x k )k22 (7.40)

s∈Rn 2

has infinitely many solutions. How do we choose the step s k from the set of least squares solutions?
It seems unreasonable to take arbitrarily large steps. We will take the minimum norm solution

s k = −R0 (x k ) † R(x k ) (7.41)

as our step. This step can be computed with the methods described in Sections 7.3.1 or 7.3.2. Note
that if R0 (x k )T R(x k ) = 0, then s k = 0 is the minimum norm solution of (7.40). Thus, if the first
order necessary optimality conditions for 12 k R(x)k22 are satisfied at x k , in particular, if x k is a local
minimizer, then the Gauss-Newton method with the choice (7.41) will not move away from such a
point.
Our convergence analysis of the Gauss-Newton method for the rank deficient case follows
[Boc88, DH79]. See also [DH95] and [Deu04, Ch. 4]. If R0 (x) ∈ Rm×n has rank n, then
R0 (x k ) † = (R0 (x k )T R0 (x k )) −1 R0 (x k )T .
We note that
R0 (x ∗ )T R(x ∗ ) = 0 ⇐⇒ R0 (x ∗ ) † R(x ∗ ) = 0.
Clearly, if R0 (x ∗ )T R(x ∗ ) = 0, the minimum norm solution of min 12 k R0 (x ∗ )s + R(x ∗ )k22 is
R0 (x ∗ ) † R(x ∗ ) = 0. On the other hand, if the minimum norm solution R0 (x ∗ ) † R(x ∗ ) of
min 12 k R0 (x ∗ )s + R(x ∗ )k22 is zero, then R0 (x ∗ )T R(x ∗ ) = 0.

Theorem 7.4.5 (Local Convergence of the GN Method) Let D ⊂ Rn be an open set and let
R : D → Rm be continuously differentiable in D. If there exist ω > 0 and κ ∈ (0, 1) such that for
all x ∈ D and all t ∈ [0, 1] the following conditions hold
R0 (y) † − R0 (x) † R(x) − R0 (x)R0 (x) † R(x) ≤ κk y − xk2, (7.42a)
2
R0 (y) † (R0 (x + t(y − x)) − R0 (x))(y − x) ≤ ωt k y − xk22, (7.42b)
2

then for all x 0 ∈ D with

α0ω
δ0 = + κ < 1,
def

2
where α0 = k R0 (x 0 ) † R(x 0 )k2 , and
def

B α0 (x 0 ) ⊂ D
1−δ0

the following statement are valid:

i. The Gauss-Newton iteration x k+1 = x k − R0 (x k ) † R(x k ) is well defined and {x k } ⊂ B α0 (x 0 ),

1−δ0

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

294 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS

ii. lim k→∞ x k = x ∗ and x ∗ satisfies R0 (x ∗ ) † R(x ∗ ) = 0,

α0
iii. k x k − x ∗ k2 ≤ δ0k 1−δ 0
,

iv. k x k+1 − x k k2 ≤ αk2ω + κ k x k − x k−1 k2 , where α k = k R0 (x k ) † R(x k )k2 ≤ α0 .

def

Proof: By definition of α0 we have

x 0, x 1 = x 0 − R0 (x 0 ) † R(x 0 ) ∈ B α0 (x 0 ).
1−δ0

and ks0 k2 = k R0 (x 0 ) † R(x 0 )k2 ≤ α0 . Suppose that ks k k2 ≤ α0 and x k+1 ∈ B α0 (x 0 ). We will

1−δ0
show that ks k+1 k2 ≤ α0 and x k+2 ∈ B α0 (x 0 ).
1−δ0

First we note that the identity R0 (x) † R0 (x) R0 (x) † = R0 (x) † implies

R0 (y) † − R0 (x) † R(x) − R0 (x)R0 (x) † R(x) = R0 (y) † R(x) − R0 (x)R0 (x) † R(x) .

From the identities s k+1 = −R0 (x k+1 ) † R(x k+1 ) and s k = −R0 (x k ) † R(x k ) we find that

s k+1 = − R0 (x k+1 ) † R(x k+1 )

= − R0 (x k+1 ) † R(x k+1 ) − R(x k ) − R0 (x k )s k − R0 (x k+1 ) † R(x k ) + R0 (x k )s k
Z 1
=− R0 (x k+1 ) † (R0 (x k + ts k ) − R0 (x k ))s k dt
0

+ R (x k+1 ) † R(x k ) − R0 (x k )R0 (x k ) † R(x k ) .
0

Hence, using (7.42) we find

Z 1
ks k+1 k2 ≤ k R0 (x k+1 ) † (R0 (x k + ts k ) − R0 (x k ))s k k2 dt
0

+ k R0 (x k+1 ) † R(x k ) − R0 (x k )R0 (x k ) † R(x k ) k2
ω
≤ ks k k22 + κks k k2
2
ω 0
≤ k R (x k ) † R(x k )k2 + κ ks k k2
2
ωα ωα
k
+ κ ks k k2 ≤ + κ ks k k2 = δ0 ks k k2 ≤ δ0 α0 ≤ α0 .
0
≤
2 2
Moreover,
k+1 k+1 k+1
α0
X X X
j
k x k+2 − x 0 k2 = s j ≤ ks j k2 ≤ δ0 ks0 k2 ≤ ,
j=0 j=0 j=0
1 − δ 0
2

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

7.4. THE GAUSS–NEWTON METHOD 295

which means that x k+2 ∈ B α0 (x 0 ).

1−δ0
The sequence {x k } ⊂ B α0 (x 0 ) is a Cauchy sequence since
1−δ0

p−1 p−1 p−1

α0
X X X
k+ j
k x k+p − x k k2 = s k+ j ≤
ks k+ j k2 ≤ δ0 ks0 k2 ≤ δ k ,
j=0 j=0 j=0
1 − δ0
2

and therefore has a limit x ∗ ∈ B α0 (x 0 ). The limit point x ∗ satisfies

1−δ0

k R0 (x ∗ ) † R(x k ) − R0 (x k )R0 (x k ) † R(x k ) k2

= k R0 (x ∗ ) † − R0 (x k ) † R(x k ) − R0 (x k )R0 (x k ) † R(x k ) k2
≤ κk x ∗ − x k k2 → 0
and
R(x k ) − R0 (x k )R0 (x k ) † R(x k ) → R(x ∗ ).
Hence R0 (x ∗ ) † R(x ∗ ) = 0. Finally,
α0
k x ∗ − x k k2 = lim k x k+p − x k k2 ≤ δ k .
p→∞ 1 − δ0

Remark 7.4.6 i. Note that the proof Theorem 7.4.5 ony required the property
R (x k ) R (x k ) R0 (x k ) † = R0 (x k ) † , not all of the four properties of the Moore-Penrose pseudo
0 † 0

inverse (see Problem 7.1).

ii. Theorem 7.4.5 not only establishes the convergence, more precise, the r–linear convergence
of the Gauss-Newton iteration to a stationary point x ∗ of the nonlinear least squares problem,
but it even establishes the existence of such a point.

7.4.5. Nearly Rank Deficient Problems

In many problems the numerical determination of the rank of R0 (x k ) is difficult and in other
problems the matrix R0 (x k )T R0 (x k ) might be nearly singular for some iterates. In such cases, the
computation of the Gauss–Newton step as the solution of
min 21 k R0 (x k )s + R(x k )k22 (7.43)
is problematic. Instead the problem (7.43) should be modified. To see how, we note that the
solution of (7.43) is the solution of the normal equation
R0 (x k )T R0 (x k )s = −R0 (x k )T R(x k )

296 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS

and vice versa. If we add a multiple µ k > 0 of the identity to R0 (x k )T R0 (x k ), then the resulting
matrix R0 (x k )T R0 (x k ) + µ k I is positive definite if µ k > 0 and k(R0 (x k )T R0 (x k ) + µ k I) −1 k2 ≤ µ−1
k .
Furthermore, using the SVD R (x k ) = UΣV we can show that the unique solution s k of
0 T

R0 (x k )T R0 (x k ) + µ k I s = −R0 (x k )T R(x k ) (7.44)
is given by
σi
min{m,n}
X
sk = − (uT R(x k )) vi . (7.45)
i=1
σi2 + µk i
It can be shown that µ → ks k (µ)k22 , where

σi
min{m,n}
X
s k (µ) = − (uT R(x k )) vi .
i=1
σi2 +µ i
is monotonically decreasing (see Problem 7.3). Furthermore,

σi
min{m,n}
X
lim − (uT R(x k )) vi = −R0 (x k ) † R0 (x k )
µk →0
i=1
σi2 + µk i
(see (7.18)). Thus the parameter µ k > 0 can be used to control the size of the step s k . In the
nearly rank deficient case we use the step (7.45). However, in an implementation of this variation
of the Gauss–Newton method we do not set up and solve (7.44). Instead we note that (7.44) are the
necessary and sufficient optimality conditions for the linear least squares problem
0 (x )
! ! 2
1 R

k R(x k )
min 2 √ s+ . (7.46)
µk I 0 2
Note also that
= 1 k R0 (x )s + R(x )k 2 + µ k ksk 2 .
R0 (x ) ! ! 2
R(x k )
1
√ k s+
2
µk I 0
2 2
k k 2
2 2

There exist methods for the solution of linear least squares problems of the type (7.46) that utilize
the special structure of this problem.
We still have to discuss the choice of µ k . Clearly (7.45) is a perturbation of the Gauss-Newton
step. We do not want to add an unneccessarily large µ k > 0. On the other hand, we want to pick
µ k > 0 to ensure that the size of the step s k (or alternatively the size of (R0 (x k )T R0 (x k ) + µ k I) −1 )
becomes artificially large because the rank of R0 (x k ) is difficult to determine. A method that chooses
µ k adaptively is the Levenberg–Marquardt method [Lev44, Mar63]. This method is closely related
to trust–region methods which were discussed in Section 6.3. In fact, if s k solves (7.44) then it
solves
min 21 k R0 (x k )s + R(x k )k22
s.t. ksk2 ≤ ∆ k .

7.5. PARAMETER IDENTIFICATION IN ORDINARY DIFFERENTIAL EQUATIONS 297

where ∆ k = ks k k2 . See Lemma 6.3.5.

√ √
Instead of using µ k I in (7.46) it is better to use µ k D k with a properly chosen scaling matrix
D k In this case we have

= 1 k R0 (x )s + R(x )k 2 + µ k kD sk 2 .
R0 (x ) ! ! 2
R(x k )
1
√ k s+
2
µk Dk 0
2 2
k k 2
2
k 2

and ! 2
R0 (x ) ! R(x k )
min √ k
1
s+
2
µk Dk 0
2
is equivalent to
min 12 k R0 (x k )s + R(x k )k22
s.t. kD k sk2 ≤ ∆ k .
where ∆ k = kD k s k k2 . For the choice of scaling see [Mar63] and [Mor78].
A trust-region view of the Levenberg–Marquardt method is described in [Mor78]. NL2SOL
is an older, but still popular code for the solution of nonlinear least squares problems [DGW81,
DGE81, Gay83]. For example, it is part of R is a language and environment for statistical computing
and graphics. See http://www.r-project.org

7.5. Parameter Identification in Ordinary Differential Equations

The evaluation of the residual function R in the nonlinear least squares problems of Section
7.2 involved the evaluation of elementary functions, such as +, ∗, exp, sin. The residual function
R in the nonlinear least squares problem of this section involves the solution of an ordinary
differential equation (ODE). Typically, the solution of the ODE is not known explicitly but only an
approximation of the solution can be computed by a numerical algorithm.

7.5.1. Least Squares Formulation of the Parameter Identification Problem

As a motivating example we will study the identification of reaction rates in a chemical reaction.
First, we describe the system of ordinary differential equations that model the chemical reactions.
Then we formulate the nonlinear least squares problem and discuss the computation of the derivative
of the residual function R.
A reaction involving the compounds with molecular formula A, B, C, D is written as

σ A A + σ B B → σC C + σ D D. (7.47)

Here σ A, σ B, σC, σ D are the stoichiometric coefficients. The compounds A, B are the reactants,
C, D are the products. The → indicates that the reaction is irreversible. For a reversible reaction

298 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS

we use
. For example, the reversible reaction of carbon dioxide and hydrogen to form methane
plus water is
CO2 + 4H2
CH4 + 2H2 O. (7.48)
Notice that the number of atoms on the left and on the right hand side balance. For example, there is a
single C atom and there are two O atoms. However, the appearance of two reactants and two products
is accidental. The stoichiometric coefficients in (7.48) are σCO2 = 1, σ H2 = 4, σCH4 = 1, σ H2 O = 2.
For each reaction we have a rate r of the reaction that together with the stoichiometric coefficients
determine the change in concentrations resulting from the reaction. Concentrations are typically
measured in [mol/L]. The reaction rate is defined as the number of reactive events per second per
unit volume and measured in [mol sec−1 L−1 ]. For example, if the rate of the reaction (7.47) is r
and if denote the concentration of compound A, . . . by C A, . . ., we have the following changes in
concentrations:
dt C A (t) = . . . − σ Ar . . . , dt CB (t) = . . . − σ B r . . . ,
d d

d
dt CC (t) = . . . σC r . . . , d
dt CD (t) = . . . σDr . . . .
The dots indicate that other reactions or inflows and outflows will in general also enter the change
in concentration. For a reaction of the form (7.47) the rate of the reaction r is of the form
β
r = kC Aα CB ,
where k is the reaction rate constant and α, β are nonnegative parameters. The sum α + β is called
the order of the reaction. The reaction rate constant depends on the temperature and is often given
by the Arrhenius equation (7.9).
As a particular example we consider an autocatalytic reaction. This example is taken from
[Ram97, S. 4.2]. ‘Autocatalysis is a term commonly used to describe the experimentally observable
phenomenon of a homogeneous chemical reaction which shows a marked increase in rate in time,
reaches its peak at about 50 percent conversion, and the drops off. The temperature has to remain
constant and all ingredients must be mixed at the start for proper observation.’ We consider the
catalytic thermal decomposition of a single compound A into two products B and C, of which B is
the autocatalytic agent. A can decompose via two routes, a slow uncatalyzed one (r 1 ) and another
catalyzed by B (r 3 ) The three essential kinetic steps are
A → B + C Start or background reaction,
A + B → AB Complex formation,
AB → 2B + C Autocatalytic step.
The autocatalytic agent B forms a complex AB (second reaction). Next, the complex AB decom-
poses, thereby releasing B in addition to forming B and C (third reaction). The last two reactions
form the path by which most of A decomposes. The first reaction is the starter, but continues
concurrently with the last two as long as there is any A.
Again, we denote the concentration of compound A, . . . by C A, . . .. The reaction rates for the
three reactions are
r 1 = k 1C A, r 2 = k 2C ACB, r 3 = k 3C AB .

7.5. PARAMETER IDENTIFICATION IN ORDINARY DIFFERENTIAL EQUATIONS 299

This leads to a system of ODEs

dC A
= −k 1C A − k 2C ACB, (7.49a)
dt
dCB
= k 1C A − k 2C ACB + 2k3C AB, (7.49b)
dt
dC AB
= k 2C ACB − k3C AB, (7.49c)
dt
dCC
= k 1C A + k 3C AB (7.49d)
dt
with given initial values

C A (0) = C A0, CB (0) = CB0, C AB (0) = C AB0, CC (0) = CC0 . (7.49e)

Here time is measured in [sec] and concentrations are measured in [kmol/L].

To put the previous system of ODEs into a more general mathematical framework we define

p = (k1, k2, k 3 )T ,
y(t) = (C A (t), CB (t), C AB (t), CC (t))T ,
y0 = (C A0, CB0, C AB0, CC0 )T .

We see that the initial value problem (7.49a)–(7.49e) is a particular instance of the initial value
problem
y0 (t) = F (t, y(t), p), t ∈ [t 0, t f ]
(7.50)
y(t 0 ) = y0 (p),

where F : R × Rn × Rl → Rn , y0 : Rl → Rn .
We first review a result on the existence and uniqueness of the solution of the initial value
problem (7.50).

Theorem 7.5.1 Let G ⊂ R × Rn be an open connected set, let P ⊂ Rl , and for each p ∈ P let
F (·, ·, p) : G → Rn be continuous and bounded by M. If

J = (t, y) ∈ R × Rn : |t − t 0 | ≤ δ, k y − y0 (p)k2 ≤ δM ⊂ G

and F is Lipschitz continuous with respect to y on J, i.e., there exists L > 0 such that

kF (t, y, p) − F (t, z, p)k2 ≤ Lk y − zk2 for all (t, y), (t, z) ∈ R,

then there exists a unique solution y(·; p) of the initial value problem (7.50) on I = [t 0 − δ, t 0 + δ].

300 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS

0.8 A
Concentration (kmol/L)

C
0.6

0.4

0.2
AB

0
0 1000 2000 3000 4000 5000 6000 7000
Time (sec)

Figure 7.4: Solution of autocatalytic reaction (7.49) with initial values C A (0) = 1, CB (0) =
0, C AB (0) = 0, CC (0) = 0 and parameters k 1 = 10−4 , k 2 = 1, k 3 = 8 ∗ 10−4

Figure 7.4 shows the solution of (7.49a)–(7.49e) with initial values C A (0) = 1, CB (0) =
0, C AB (0) = 0, CC (0) = 0 and parameters k 1 = 10−4 , k2 = 1, k3 = 8 ∗ 10−4 on the time in-
terval t 0 = 0 to t f = 7200 secs. (2 hrs). The computations were done using the Matlab ODE
solver ode23s with the default options.
Now, suppose the reaction rates are not known, but have to be determined through an experiment.
Given initial concentrations we will run the reaction and measure the concentrations ŷi ∈ R4 at
times t i , i = 1, . . . , m. We try to fit the function y(t; p) to the measurements, where y(·; p) is the
solution of the initial value problem (7.50). This leads to a nonlinear least squares problem

min 12 k R(p)k22, (7.51)

p∈Rl

where
y(t 1 ; p) − ŷ1
*. +/
y(t 2 ; p) − ŷ2
R(p) = ... ..
// ∈ Rmn (7.52)
. . //
, y(t m ; p) − ŷm -

7.5. PARAMETER IDENTIFICATION IN ORDINARY DIFFERENTIAL EQUATIONS 301

and y(·; p) is the solution of the initial value problem (7.50). For the evaluation of R(p) at a
given p we have to solve the ODE (7.50), evaluate the solution y(·; p) of this ODE at the points t 1 ,
i = 1, . . . , m, and then assemble the vector R(p) in (7.52). Thus, R : Rl → Rmn is a composition
of functions
p 7→ y(·; p) 7→ y(t i ; p) 7→ R(p),
Rl 7→ C(I, Rn ) 7→ Rn 7→ Rmn .
By C ` (S, Rn ) we denote the set of all functions g : S ⊂ R j → Rn which are ` times continuously
differentiable of S. If the ODE (7.50) is solved numerically, then we do not obtain the exact solution
y(·; p), but only an approximation yh (·; p). Consequently, we are only able to compute

p 7→ yh (·; p) 7→ yh (t i ; p) 7→ Rh (p).

The error between R(p) and Rh (p) depends on the accuracy of the ODE solver. Thus, while we
want to minimize 12 k R(p)k22 , we do not have access to this function, but only to 12 k Rh (p)k22 .

7.5.2. Derivative Computation

Derivative Computation using Sensitivity Equations
Numerical algorithms for the solution of (7.51) require the Jacobian R0 (p) ∈ Rmn×l of R. For this
purpose, we need to compute the derivative of

p 7→ y(·; p),
Rl 7→ C(I, Rn ).

The derivative W (·; p) = d

dp y(·; p) is a matrix valued function

W (·; p) : R → Rn×l

such that
1
lim max k y(t; p + δp) − y(t; p) − W (t; p)δpk2 = 0. (7.53)
kδpk2 →0 kδpk2 t∈I
The existence of the derivative can be established with the aid of the implicit function theorem,
which also tells us how the derivative can be computed. For more details we refer to [Ama90],
[HNW93, Sec. I.14], or [Wal98]. The following result is taken from [WA86, Thm. 3.2.16].

Theorem 7.5.2 Let G ⊂ R × Rn be an ( open connected set, and) p̄ ∈ R . Further, let δ, δ1 > 0 and
l

define I = [t 0 − δ, t 0 + δ] and P = p ∈ Rl : kp − p̄k2 < δ1 . Let F ∈ C ` (G × P, Rn ) and let

y ∈ C ` (P, Rn ) be given so that (x 0, y0 (p)) ∈ G for all p ∈ P. Suppose that F is bounded on G × P
by M and suppose that
k y0 (p) − y0 ( p̄)k2 ≤ ¯ for all p ∈ P.

302 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS

If
(t, y) ∈ R × Rn : t ∈ I, k y − y0 (p)k2 ≤ ¯ + M kt − t 0 k2 ⊂ G,

then for each p ∈ P there exists a unique solution y(·; p) ∈ C ` (I, Rn ) of the initial value problem
(7.50). Moreover, the solution is ` times continuously differentiable with respect to p and the first
derivative w(·; p) = dp
d
y(·; p) is the solution of
∂ ∂
W 0 (t) = ∂ y F (t, y(t; p), p)W (t) + ∂p F (t, y(t; p), p) t∈I
(7.54)
W (t 0 ) = d
dp y0 (p).

The linear differential equation (7.54) is sometimes also called the sensitivity equation. The
function Wi j (t) is the sensitivity of the ith component of the solution with respect to the parameter
p j . From (7.53) we see that
yi (t; p + (δp) j e j ) = yi (t; p) + Wi j (t; p)(δp) j + o((δp) j ) for all t ∈ I.
Here e j denotes the jth unit vector and h ∈ R.
For the ODE (7.49a)–(7.49e) written in the more abstract notation
d
dt y1 −k 1 y1 − k 2 y1 y2,
*. +/ *. +/
.. d
dt y2
// .. k1 y1 − k 2 y1 y2 + 2k 3 y3, //
// = .. // , (7.55)
k2 y1 y2 − k 3 y3,
.. d
.. dt y3 // .. //
,
d
dt y4 - , k 1 y1 + k 3 y3 -
y1 (0) = y1,0, y2 (0) = y2,0, y3 (0) = y3,0, y4 (0) = y4,0 . (7.56)
the sensitivity equations are given by
d d d
*. dt W11 dt W12 dt W13 +/
d d d
dt W21 dt W22 dt W23
.. //
.. d d d
//
.. dt W31 dt W32 dt W33 //
d d d
, dt W41 dt W42 dt W43 -
−k 1 − k 2 y2 −k 2 y1 0 0 W W12 W13
*. +/ *. 11 +/
. k 1 − k 2 y2 −k 2 y1 2k3 0 // .. W21 W22 W23 //
= ... /. /
.. k 2 y2 k2 y1 −k3 0 // .. W31 W32 W33 //
/. /
, k 1 0 k 3 0 - , W41 W42 W43 -
−y1 −y1 y2 0
*. +
.. y1 −y1 y2 2y3 ///
+ .. /. (7.57)
.. 0 y1 y2 −y3 //
/
, 1y 0 y3 -

7.5. PARAMETER IDENTIFICATION IN ORDINARY DIFFERENTIAL EQUATIONS 303

Since the initial values (7.56) do not depend on the parameter p, the sensitivity W obeys the initial
conditions

Wi j (0) = 0 for i = 1, . . . , 4, j = 1, . . . , 3. (7.58)

With the solution W (·; p) of the sensitivity equation (7.54) the Jacobian of R defined in (7.52)
is given by

W (t 1 ; p)
*. +/
W (t 2 ; p)
R0 (p) = ... ..
// ∈ Rmn×l . (7.59)
. . //
, W (t m ; p) -

In practice the ODE (7.50) and the sensitivity equations have to be solved numerically. Since
most ODE solvers are adaptive, one has to solve the original ODE (7.50) and the sensitivity equation
(7.54) simultaneously for y and W . Instead of the exact solutions y(·; p) and W (·; p) of (7.50) and
(7.54) one obtains approximations yh (·; p) and Wh (·; p) thereof. Thus in practice one has only
Rh (p) and

W (t ; p)
*. h 1 +/
W h (t 2 ; p)
(R0 (p))h = ... ..
// ∈ Rmn×l (7.60)
. . //
, Wh (t m ; p) -

available. It holds that Rh (p) ≈ R(p) and (R0 (p))h ≈ R0 (p) and estimates for the errors k Rh (p) ≈
R(p)k2 and k(R0 (p))h ≈ R0 (p)k2 are typically available. Usually, the approximation (R0 (p))h
of R0 (p) is not the derivative of Rh (p), the approximation of R(p). Therefore we have chosen
the notation (R0 (p))h over (Rh (p))0. 2 In fact, often we do not even know whether Rh (p) is
differentiable. Codes for the numerical solution of ODEs often chose the time steps adaptively.
Rules for this adaptation involve min, max, | · |. This might lead to the nondifferentiability of Rh (p).
If (7.50) and (7.54) have to be solved simultaneously for y and W using a numerical ODE solver,
the sensitivity equation (7.54) typically has to be written in vector form. For example the ODE
resulting from (7.55) and (7.57) is given by

2Here (R 0 (p))h indicates the we differentiate first an then discretize the derivative, whereas (Rh (p)) 0 indicates that
we discretize first and take the derivative of of the discretized Rh (p).

304 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS

d
*. dt y1 +/ * −k 1 y1 − k2 y2 y3,
d
+/
dt y2 k 1 y1 − k 2 y1 y2 + 2k3 y3,
.. // . . //
.. d
// ..
dt y3 k 2 y1 y2 − k 3 y3,
//
.. // .. //
d // ..
dt y4 k 1 y1 + k3 y3,
.. //
.. // .. //
d
dt W11 (−k 1 − k2 y2 )W11 − k 2 y1W21 − y1,
.. // . . //
.. // .. //
d // .. (k1 − k 2 y2 )W11 − k 2 y1W21 + 2k 3W31 + y1,
dt W21
.. //
.. // ..
d
dt W31 k 2 y2W11 + k2 y1W21 − k 3W31,
//
.. // .. //
k1W11 + k3W31 + y1,
.. d // ..
dt W41
//
.. // = .. // . (7.61)
d (−k 1 − k 2 y2 )W12 − k2 y1W22 − y1 y2,
.
dt W12
.. // . //
.. // ..
// .. (k 1 − k 2 y2 )W12 − k 2 y1W22 + 2k 3W32 − y1 y2,
d
//
dt W22
.. //
// ..
k 2 y2W12 + k 2 y1W22 − k 3W32 + y1 y2,
..
d //
.. dt W32 // .. //
.. d // .. k 1W12 + k 3W32 + y1,
dt W42
//
.. // .. //
d . (−k 1 − k 2 y2 )W13 − k 2 y1W23,
dt W13
.. // . //
// ..
// .. (k1 − k 2 y2 )W13 − k 2 y1W23 + 2k 3W33 + 2y3,
.. //
d
dt W23
.. //
k 2 y2W13 + k 2 y1W23 − k 3W33 − y3,
.. // ..
d //
.. dt W33 // .. /
d k 1W13 + k 3W33 + y3
, dt W43 - , -

Figures 7.5 and 7.6 show the solution of (7.61). As before, the computations were done using
the Matlab ODE solver ode23s with the default options. The parameters were k1 = 10−4 , k 2 = 1,
k 3 = 8 ∗ 10−4 and t 0 = 0 to t f = 7200 secs. (2 hrs). Note the different scales of the sensitivities
with respect to k 1 , k 3 and k2 . For example, since k 1 = 10−4 , the sensitivities dy j /dk 1 can be
about 104 times larger than y j , j = 1, . . . , 4. In this case scaling issues have to be dealt with when
solving (7.61). When using a numerical solver such as the Matlab ODE solver ode23s it might be
necessary to choose different absolute tolerances AbsTol for each solution component in (7.61). In
our computations we have used AbsTol = 10−6 for all components (the default). A more sensible
choice might be AbsTol = 10−6 for components 1–4, AbsTol = 10−6 /k 1 for components 5–8,
AbsTol = 10−6 /k 2 for components 9–12, and AbsTol = 10−6 /k3 for components 13–16.

7.5. PARAMETER IDENTIFICATION IN ORDINARY DIFFERENTIAL EQUATIONS 305

CA dC A/dk 1

Concentration (kmol/L)
1 5000

0.5 0

0 -5000

-0.5 -10000
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)
-3 dC A/dk 2 dC A/dk 3
10
5 500

0 0

-5 -500

-10 -1000

-15 -1500
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)

CB dC B/dk 1
Concentration (kmol/L)

1 8000

6000

0.5 4000

2000

0 0
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)
dC B/dk 2 dC B/dk 3
0.01 1500

0.005
1000
0
500
-0.005

-0.01 0
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)

Figure 7.5: y1, y2 and corresponding sensitivities

306 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS

C AB dC AB/dk 1

Concentration (kmol/L)
0.6 4000

2000
0.4
0
0.2
-2000

0 -4000
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)
dC AB/dk 2 dC AB/dk 3
10 -3
15 1000

10 500

5 0

0 -500

-5 -1000
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)

CC dC C/dk 1
Concentration (kmol/L)

1 4000

3000

0.5 2000

1000

0 0
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)
dC C/dk 2 dC C/dk 3
10 -3
4 800

3 600

2 400

1 200

0 0
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)

Figure 7.6: y3, y4 and corresponding sensitivities

7.5. PARAMETER IDENTIFICATION IN ORDINARY DIFFERENTIAL EQUATIONS 307

DASSL and DASPK are two Fortran codes for the solution of ODEs [BCP95]. Actually, DASSL
and DASPK solve differential–algebraic equations (DAEs), which are systems of ODEs coupled
with nonlinear algebraic equations. Both codes have been augmented to solve the DAE and the
corresponding sensitivity equations simultaneously. The original codes DASSL and DASPK and
their augmentations DASSLSO and DASPKSO are available. For details we refer to the paper
[MP96].

Derivative Computation using Finite Differences

Now we apply this to the approximation of the derivative R0 (p). This requires the approximation
of the derivatives of
p 7→ y(t i ; p).
The jth partial derivative can be approximated by

∂ 1
y(t i ; p) ≈ (y(t i ; p + (δp) j e j ) − y(t i ; p)),
∂p j (δp) j

where e j is the jth unit vector and (δp) j ∈ R. The scalar (δp) j ∈ R can and typically does vary with
j. Thus, we can compute a finite difference approximation of R0 (p) as follows. For j = 1, . . . , k
choose (δp) j ∈ R sufficiently small and compute the solution y(·, p + (δp) j e j ) of (7.50) with p
replaced by p + (δp) j e j . The jth column (R0 (p)) j of R0 (p) is then approximated by

(y(t 1 ; p + (δp) j e j ) − y(t 1 ; p))/(δp) j

(y(t 2 ; p + (δp) j e j ) − y(t 2 ; p))/(δp) j
*. +/
(R0 (p)) j ≈ ... // ∈ Rmn .
.. (7.62)
. . //
, (y(t m ; p + (δp) j e j ) − y(t m ; p))/(δp) j -
As we have pointed out several times before, the exact solution of (7.50) can not be computed,
but only an approximation thereof. Thus, instead of y(t i ; p), y(t i ; p + (δp) j e j ), j = 1, . . . , k, we
only have yh (t i ; p), yh (t i ; p + (δp) j e j ), j = 1, . . . , k. In practice the jth column (R0 (p)) j of R0 (p)
is approximated by

(yh (t 1 ; p + (δp) j e j ) − yh (t 1 ; p))/(δp) j

h (t 2 ; p + (δp) j e j ) − y h (t 2 ; p))/(δp) j
*. +/
(y // ∈ Rmn .
(R0 (p)) j ≈ ... .. (7.63)
. . //
, (yh (t m ; p + (δp) j e j ) − yh (t m ; p))/(δp) j -
Figures 7.7 to 7.9 below show the solution component y4 of (7.55) and the sensitivities of this
component with respect to the parameters p j = k j , j = 1, 2, 3. The solid plots are the sensitivities
computed using the sensitivity equation method and the dashed plots are the finite difference
approximations of the sensitivities. The approximate solutions of (7.55) were computed using the

308 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS

CC dC C/dk 1
Concentration (kmol/L) 1 4000

3000

0.5 2000

1000

0 0
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)
dC C/dk 2 dC C/dk 3
10 -3
4 800

3 600

2 400

1 200

0 0
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)

Figure 7.7: y4 and corresponding sensitivities. The solid curves are the sensitivities computed
using the sensitivity equation method (these are identical to the corresponding plots in Figure 7.6)
and the dashed curves are the sensitivity approximations via finite differences with (δp) j = 10−1 p j ,
j = 1, 2, 3.

Matlab ODE solver ode23s with the default options. The parameters were k 1 = 10−4 , k 2 = 1,
k 3 = 8 ∗ 10−4 and t 0 = 0 to t f = 7200 secs. (2 hrs). If we want to extend the error analysis
of finite difference approximations performed for the function g to y(·; p), then we will obtain a
∂2
different constant L j for each component p j and L j ≈ maxt k ∂p 2 y(t; p)k2 . To compute the finite
j
difference step size (δp) j we need an estimate for L j and an estimate for the error level in the
evaluation of y(t; p), t ∈ [0, 7200]. Using our previous sensitivity computations we estimate that
maxt k ∂p∂ j y(t; p)k2 = O(1/p j ). (This seems to be reasonable for j = 1, 3, but it is too high for
∂
j = 2.) For the second partial derivatives we use the estimate maxt k ∂p 2 y(t; p)k2 = O(1/p j ). Thus,
2
2
j
√ √
the optimal finite difference step size for the j parameter is (δp∗ ) j = (2/ L j ) = O(p j ). The
p
default options in the Matlab ODE solver ode23s attempt to compute an approximate solution
that is within AbsTol = 10−6 of the true solution. Thus, we estimate that ≈ 10−6 . This gives

7.5. PARAMETER IDENTIFICATION IN ORDINARY DIFFERENTIAL EQUATIONS 309

CC dC C/dk 1
Concentration (kmol/L) 1 4000

3000

0.5 2000

1000

0 0
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)
dC C/dk 2 dC C/dk 3
10 -3
20 800

15 600

10 400

5 200

0 0
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)

Figure 7.8: y4 and corresponding sensitivities. The solid curves are the sensitivities computed
using the sensitivity equation method (these are identical to the corresponding plots in Figure 7.6)
and the dashed curves are the sensitivity approximations via finite differences with (δp) j = 10−2 p j ,
j = 1, 2, 3.

an estimate (δp∗ ) j = O(p j 10−3 ) for the optimal step size. We see that for j = 1, 3 the best
agreements between the the sensitivities computed using the sensitivity equation method and the
finite difference approximations of the sensitivities are achieved for (δp∗ ) j = O(p j 10−2 ), j = 1, 3.
For j = 2, however, the best agreements are achieved for (δp∗ )2 = O(p2 10−1 ). The calculations
for the solution components y1 − y3 gave similar results. This example indicates how difficult it
is to approximate derivatives of vector valued functions using finite differences, especially if the
variables and functions have different scales and the functions are not computed exactly.

310 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS

CC dC C/dk 1
Concentration (kmol/L) 1 6000

4000
0.5
2000

0 0
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)
dC C/dk 2 dC C/dk 3
0.1 1000

0.05
500
0

-0.05 0
0 2000 4000 6000 0 2000 4000 6000
Time (sec) Time (sec)

Figure 7.9: y4 and corresponding sensitivities. The solid curves are the sensitivities computed
using the sensitivity equation method (these are identical to the corresponding plots in Figure 7.6)
and the dashed curves are the sensitivity approximations via finite differences with (δp) j = 10−3 p j ,
j = 1, 2, 3.

Derivative Computation using Automatic Differentiation

Another approach to the computation of derivatives of the solution of ODEs with respect to
parameters is automatic differentiation (which is now also known as computational differentiation).
Given the source code of a computer program (or a set of computer programs) for the solution of a
differential equation, automatic differentiation tools take this program and generate source code for
a new program that computes the solution of the ODE as well as the derivatives of the solution with
respect to parameters. Actually, this technique is not limited to computer programs for the solution
of ODEs. Automatic differentiation is based on the observation that inside computer programs
only elementary functions such as +, ∗, sin, log are executed. The derivatives of these elementary
operations are known. A computer program can be viewed as a composition of such elementary
functions. The derivative of a composition of functions is obtained by the chain rule. This is
the basic mathematical observation underlying automatic differentiation. See the paper [Gri03]

7.5. PARAMETER IDENTIFICATION IN ORDINARY DIFFERENTIAL EQUATIONS 311

by Griewank and the book [GW08] by Griewank and Walther. Of course computer programs
also include operations that are not necessarily differentiable such as max, | · |, and if-then-else
statements. Automatic differentiation techniques will always generate an augmented program that
also generates ‘derivatives’. It is important that the user applies these tools intelligently.
ADIC (for the automatic differentiation of programs written in C/C++) and ADIFOR (for the
automatic differentiation of programs written in Fortran 77) are available from https://wiki.
mcs.anl.gov/autodiff3.

3Accessed April 10, 2017

312 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS

7.6. Problems

Problem 7.1 Let A ∈ Rm×n . Show that A† defined in (7.19) satisfies

A A† A = A, A† A A† = A†, ( AA† )T = A A†, ( A† A)T = A† A.

Note: It can actually be shown that there exists only one matrix X ∈ Rn×m that satisfies

AX A = A, X AX = X, ( AX )T = AX, (X A)T = X A.

Hence these four identities are also used to define the Moore–Pensrose pseudo inverse. For more
details see, e.g., [Bjö96, pp. 15-17] and [BIG74, CM79, Gro77, Nas76].

Problem 7.2 Let A ∈ Rm×n have rank m (m ≤ n). Show that

A† = AT ( A AT ) −1 .

Problem 7.3 Consider the regularized linear least squares problem

1 µ
min k Ax − bk22 + k xk22, (7.64)
2 2
where A ∈ Rm×n , b ∈ Rm , and µ ≥ 0.

i. Show that for each µ > 0, (7.64) has a unique solution x(µ).

ii. Let µ2 > µ1 > 0 and let x 1 = x(µ1 ), x 2 = x(µ2 ) be the solutions of (7.64) with µ = µ1 and
µ = µ2 , respectively.
Show that
1 µ1 1 µ2
k Ax 1 − bk22 + k x 1 k22 ≤ k Ax 2 − bk22 + k x 2 k22,
2 2 2 2
1 1
k Ax 1 − bk22 ≤ k Ax 2 − bk22,
2 2
k x 2 k22 ≤ k x 1 k22 .

Problem 7.4 In many applications the linear least squares problem

min 21 k R0 (x k )s + R(x k )k22

7.6. PROBLEMS 313

has to be solved iteratively using, e.g., the conjugate gradient methods described in Sections 3.7.3
and 3.7.3. We assume that the computed step s k satisfies

k R0 (x k )T (R0 (x k )s + R(x k ))k2 ≤ η k k R0 (x k )T R(x k )k2 .

Formulate and prove an extension of the local convergence Theorem 7.4.2 for this inexact
Gauss-Newton method.
Hint: Revisit Theorem 5.3.1. One convergence result for inexact Gauss-Newton methods is
presented in [Mar87].

Problem 7.5 (Taken from [KMN88, p. 380])

Type I supernovae have been found to follow a specific pattern of luminosity (brightness).
Beginning a few days after the maximum luminosity this pattern may be described by

L(t) = C1 exp(−t/α1 ) + C2 exp(−t/α2 ),

where t is the time in days after the maximum luminosity and L(t) is the luminosity relative
to the maximum luminosity. The table in lumi_data.m gives the relative luminosity for the
type I supernovae SN1939A measured in 1939. The peak luminosity occured at day 0.0, but all
measurement before day 7.0 are omitted because the model above cannot account for the luminosity
before and immediately after the maximum.

i. Plot the data. You should notice two distinct regions, and thus two exponentials are required
to provide an adequate fit.

ii. Use the function lsqcurvefit from the Matlab optimization toolbox to fit the data to the
model above, i.e., to determine C1, C2, α1, α2 . Plot the fit along with the data. Also plot the
residuals. Do the residuals look random? Experience plays a role in choosing the starting
values It is known that the time constants α1 and α2 are about 5.0 and 60.0, respectively. Try
different values for C1 and C2 . How sensitive is the resulting fit to these values?

Problem 7.6 The Gas–Oil–Cracking problem is a system of two ODEs given by

y10 (t) −(p1 + p3 )y12 (t)

! !
= in (0, 4),
y20 (t) p1 y12 (t) − p2 y2 (t)
! ! (7.65)
y1 (0) 1
= .
y2 (0) 0

Formulate the system of ODEs for W (·; p) = dp

d
y(·; p).
Write a program for the computation of approximate sensitivities WhS (·; p) via the sensitiv-
ity equation method and for the computation of approximate sensitivities WhFD (·; p) via finite

314 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS

differences. Use the parameter values p1 = 0.9875, p2 = 0.2566, p3 = 0.3323. Evaluate the
sensitivities at t i = 0, 0, 1, 0.2, . . . , 4 (tspan = [0:0.1:4] if you use the Matlab ODE solvers).
For the computation of the finite difference approximations use (δp) j = δp, j = 1, . . . , 3, with
δp = 10−3, 10−6, 10−9 . For each δp plot the error WhS (·; p) − WhFD (·; p) (in log-format).

Problem 7.7 Consider the continuously stirred tank reactor (CSTR) with heating jacket in

F
Fh
CA
Th A B

Figure 7.10: A continuously stirred tank reactor with heating jacket.

Figure 7.10. Reactant A is fed at a flow rate F, molar concentration C Ao , and temperature TA to
the reactor, where the irreversible endothermic reaction A → B occurs. The rate of the reaction is
given by the following relation
!
E
r = k 0 exp − C A,
RT

where C A is the molar concentration of A in the reactor holdup, T is the reactor temperature. The
parameters k 0 , E, and R are called the pre-exponential factor, the activation energy, and the gas
constant respectively. In the CSTR, the product stream is withdrawn at flow rate F and heat is
provided to the reactor through the heating jacket, where the heating fluid is fed at a flow rate Fh
and temperature Th . We assume that the flow rates F, and Fh are constant.

7.6. PROBLEMS 315

The system is described by the following set of differential equations

!
d FA Ea
C A (t) = (C Ao − C A (t)) − k0 exp − C A (t), (7.66a)
dt V RT (t)
!
d FA Ea
CB (t) = − CB (t) + k 0 exp − C A (t), (7.66b)
dt V RT (t)
!
d FA Ea ∆Hr
T (t) = (TA − T (t)) − k 0 exp − C A (t)
dt V RT (t) ρcp
UA(T j (t) − T (t))
+ , (7.66c)
ρV cp
d Fh UA(T j (t) − T (t))
T j (t) = (Th − T j (t)) − . (7.66d)
dt Vh ρ h Vh cph
Here U is heat transfer rate and A denotes the heat transfer area. The other quantities and their
values are described in Table 7.1.
Changes in the concentration C A of chemical A are due to inflow and they are due to the reaction.
In (7.66a) this is modeled by the terms FA (C Ao − C A (t))/V and
!
Ea
−k0 exp − C A (t),
RT (t)
respectively. Equation (7.66b) is interpreted analogously. The heat inside the reactor affected by the
inflow, the reaction, and heat transfer between jacket and tank. In (7.66c) the term FA (TA − T (t))/V
models the change in heat due to inflow,
!
Ea ∆Hr
−k0 exp − C A (t)
RT (t) ρcp
models the heat generated by the reaction, and the term UA(T j (t) − T (t))/( ρV cp ) describes the
change of heat due to differences between temperature in the reactor T and in the heating jacket
T j . The change in the jacket temperature T j is modeled analogously, except that no reaction takes
place in the jacket.
a. If we set
y(t) = (C A (t), CB (t), T (t), T j (t))T (7.67)
and
p = (TA, Th, ρ, ρ h )T , (7.68)
then (7.66a)–(7.66d) can be written as
d
y(t) = f (y(t), p), y(0) = y0 . (7.69)
dt
Solve the system (7.66a)–(7.66d) on the time interval [0, 60] (min) using the Matlab function
ode23s and the initial conditions and problem parameters specified in Table 7.1.

316 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS

Variable Description Value

C Ao feed reactant concentration [mol l ]−1 5.0
C A (0) reactant concentration in reactor at time t = 0 [mol l−1 ] 1.596
CB (0) product concentration in reactor at time t = 0 [mol l−1 ] 3.404
cp −1
specific heat capacity [J g K ] −1 6.0
cph specific heat capacity of heating fluid [J g−1 K−1 ] 6.0
E activation energy [J mol ] −1 50000
R gas constant [J mol−1 K−1 ] 8.3144
k0 pre-exponential factor in reaction rate [min−1 ] 1.0 × 109
TA feed reactant temperature [K] 300
T (0) reactor temperature at time t = 0 [K] 284.08
Th heating fluid temperature [K] 375
T j (0) jacket temperature at time t = 0 [K] 285.37
V reactor holdup volume at time [l] 10.0
Vh jacket volume [l] 1.0
F inlet/outlet flow rate in/from reactor [l min−1 ] 3.0
Fh −1
heating fluid flow rate [l min ] 0.1
ρ −1
liquid density [g l ] 600
ρh liquid density of heating fluid [g l−1 ] 600
∆Hr heat of reaction [J mol ]−1 20000
ρcp /(UA) −1
[min l ] 0.144
Table 7.1: Parameters for the CSTR with heating jacket

7.6. PROBLEMS 317

b. Formulate the system of ODEs for W (·; p) = d

dp y(·; p).

Write a program for the computation of approximate sensitivities WhS (·; p) via the sensitivity
equation method and for the computation of approximate sensitivities WhFD (·; p) via finite
differences.
Use the parameter values specified in Table 7.1 for p in (7.68).

Problem 7.8 In this problem we will determine parameters in a simple model for an electrical
furnace.
The mathematical model for the oven involves the following quantities: t: time (seconds), T:
temperature inside the oven (0 C), C: ‘heat capacity’ of the oven including load (joule/0 C), Q:
rate of loss of heat inside the oven to the environment (joule/sec), V : voltage of the source of
electricity (volt), I: intensity of the electric current (amp), R: resistance of the heating of the oven
plus regulation resistance (ohm) (R = ∞ corresponds to the ‘open’ (disconnected) circuit). Then,
according to the laws of Physics:

• I = V /R,

• V I = V 2 /R = power = heat generated per second,

• Q = kT, where k is a constant of loss of heat per second and per degree of temperature
difference between oven and the environment,

• (V 2 /R) − Q = heat gain of the oven per second,

• [(V 2 /R) − kT]/C = rate of increase of temperature of the oven (0 C per second).

The temperature of the oven will then evolve according to the differential equation

(V 2 /R(t)) − kT (t)
T 0 (t) = . (7.70)
C

With α = k/C > 0 and β = V 2 /C > 0, this equation is of the form

T 0 (t) = −αT (t) + β/R(t). (7.71)

We set
u(t) = 1/R(t).
This gives the differential equation

T 0 (t) = −αT (t) + βu(t). (7.72)

318 CHAPTER 7. NONLINEAR LEAST SQUARES PROBLEMS

The model (7.72) depends on the parameters α, β. These depend on the particular oven (geometry,
material, ...). Our goal is to determine the parameters from measurements using the least squares
formulation.
First, we note that the solution of (7.72) for constant u is given by
βu
T (t) = T (t 0 )e−α(t−t 0 ) + (1 − e−α(t−t 0 ) ). (7.73)
α
For u ≡ 1, temperature measurements T̂i at times t i , i = 0, . . . , m, are given in Table 7.2.

Table 7.2: Measurements for u ≡ 1.

i t i T̂i i t i T̂i
0 0.0 1.0000 11 1.1 1.6672
1 0.1 1.0953 12 1.2 1.6988
2 0.2 1.1813 13 1.3 1.7275
3 0.3 1.2592 14 1.4 1.7534
4 0.4 1.3298 15 1.5 1.7770
5 0.5 1.3935 16 1.6 1.7982
6 0.6 1.4512 17 1.7 1.8174
7 0.7 1.5034 18 1.8 1.8348
8 0.8 1.5508 19 1.9 1.8505
9 0.9 1.5935 20 2.0 1.8647
10 1.0 1.6322

i. Set up the least squares problem

min 1 k R(α, β, T0 )k22, (7.74)

α, β,T0 ∈R 2

i.e., determine R.
Note that since none of the temperature measurements above can assumed to be exact, we
include T0 = T (t 0 ) as a variable into the least squares problem.

ii. Determine the Jacobian of R.

iii. Solve the least squares problem.

References
[Ama90] H. Amann. Ordinary Differential Equations. An Introduction to Nonlinear Analysis.
De Gruyter Studies in Mathematics, Vol. 13. de Gruyter, Berlin, New York, 1990.

[BCP95] K. E. Brenan, S. L. Campbell, and L. R. Petzold. The Numerical Solution of Initial

Value Problems in Differential-Algebraic Equations. Classics in Applied Mathematics,
Vol. 14. SIAM, Philadelphia, 1995.

[BIG74] A. Ben-Israel and T. N. E. Greville. Generalized Inverses: Theory and Applications.

John Wiley & Sons, New-York, Chicester, Brisbane, Toronto, 1974.

[Bjö96] Å. Björck. Numerical Methods for Least Squares Problems. SIAM, Philadelphia, 1996.

[Boc88] H. G. Bock. Randwertprobleme zur Parameteridentifizierung in Systemen nichtlin-

earer Differentialgleichungen. Preprint Nr. 442, Universität Heidelberg, Institut für
Angewandte Mathematik, SFB 123, D–6900 Heidelberg, Germany, 1988.

[BW88] D. M. Bates and D. G. Watts. Nonlinear Regression Analysis and its Applications. John
Wiley and Sons, Inc., Somerset, New Jersey, 1988.

[CM79] S. L. Campbell and C. D. Meyer. Generalized Inverses of Linear Transformations.

Pitman, London, San Francisco, Melbourne, 1979.

[Deu04] P. Deuflhard. Newton Methods for Nonlinear Problems. Affine Invariance and Adaptive
Algorithms, volume 35 of Springer Series in Computational Mathematics. Springer-
Verlag, Berlin, 2004.

[DGE81] J. E. Dennis, D. M. Gay, and R. E.Welsch. Algorithm 573 nl2sol – an adaptive

nonlinear least-squares algorithm. TOMS, 7:369–383, 1981. Fortran code available
from http://www.netlib.org/toms/573.

[DGW81] J. E. Dennis, D. M. Gay, and R. E. Welsch. An adaptive nonlinear least-squares

algorithm. TOMS, 7:348–368, 1981.

[DH79] P. Deuflhard and G. Heindl. Affine invariant convergence theorems for Newton’s method
and extensions to related methods. SIAM J. Numer. Anal., 16:1–10, 1979.

319
320 REFERENCES

[DH95] P. Deuflhard and A. Hohmann. Numerical Analysis. A First Course in Scientific Com-
putation. Walter De Gruyter, Berlin, New York, 1995.

[Esp81] J. H. Espenson. Chemical Kinetics and Reaction Mechanisms. Mc Graw Hill, New
York, 1981.

[Gay83] D. M. Gay. Remark on “algorithm 573: NL2SOL—an adaptive nonlinear least-squares

algorithm”. ACM Trans. Math. Softw., 9(1):139, 1983. doi:http://doi.acm.org/
10.1145/356022.356031.

[GL89] G. H. Golub and C. F. Van Loan. Matrix Computations, volume 3 of Johns Hopkins
Series in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD,
second edition, 1989.

[Gri03] A. Griewank. A mathematical view of automatic differentiation. In A. Iserles, editor,

Acta Numerica 2003, pages 321–398. Cambridge University Press, Cambridge, London,
New York, 2003. URL: https://doi.org/10.1017/S0962492902000132, doi:
10.1017/S0962492902000132.

[Gro77] C. W. Groetsch. Generalized Inverses of Linear Operators. Marcel Dekker, Inc., New
York, Basel, 1977.

[GW08] A. Griewank and A. Walther. Evaluating Derivatives. Principles and Techniques of Al-
gorithmic Differentiation. Society for Industrial and Applied Mathematics (SIAM),
Philadelphia, PA, second edition, 2008. URL: https://doi.org/10.1137/1.
9780898717761, doi:10.1137/1.9780898717761.

[Hei93] M. Heinkenschloss. Mesh independence for nonlinear least squares problems with norm
constraints. SIAM J. Optimization, 3:81–117, 1993. URL: http://dx.doi.org/10.
1137/0803005, doi:10.1137/0803005.

[HNW93] E. Hairer, S. O. Nørsett, and G. Wanner. Solving Ordinary Differential Equations I.

Nonstiff Problems. Springer Series in Computational Mathematics, Vol. 8. Springer
Verlag, Berlin, Heidelberg, New York, second edition, 1993.

[KMN88] D. Kahaner, C.B. Moler, and S. Nash. Numerical Methods and Software. Prentice Hall,
Englewood Cliffs, NJ, 1988.

[Lev44] K. Levenberg. A method for the solution of certain nonlinear problems in least squares.
Quarterly Applied Mathematics, 2:164–168, 1944.

[Mar63] D. W. Marquardt. An algorithm for least squares estimation of nonlinear parameters.

SIAM Journal on Applied Mathematics, 11:431–441, 1963.

REFERENCES 321

[Mar87] J. M. Martinez. An algorithm for solving sparse nonlinear least squares problems.
Computing, 39:307–325, 1987. URL: https://doi.org/10.1007/BF02239974,
doi:10.1007/BF02239974.

[Mor78] J. J. Moré. The Levenberg–Marquardt algorithm: Implementation and theory. In G. A.

Watson, editor, Numerical Analysis, Proceedings, Biennial Conference, Dundee 1977,
pages 105–116, Berlin, Heidelberg, New-York, 1978. Springer Verlag.

[MP96] T. Maly and L. R. Petzold. Numerical methods and software for sensitivity analysis of
differential-algebraic systems. Applied Numerical Mathematics, 20:57–79, 1996.

[Nas76] M. Z. Nashed. Generalized Inverses and Applications. Academic Press, Boston, San
Diego, New York, London„ 1976.

[Ram97] W. F. Ramirez. Computational Methods for Process Simulation. Butterworth–

Heinemann, Oxford, Boston, second edition, 1997.

[WA86] H. Werner and H. Arndt. Gewöhnliche Differentialgleichungen. Eine Einführung in

Theorie und Praxis. Springer Verlag, Berlin, Heidelberg, New-York, 1986.

[Wal98] W. Walter. Ordinary Differential Equations. Graduate Texts in Mathematics. Springer

Verlag, Berlin, Heidelberg, New York, 1998.

322 REFERENCES

Chapter
8
Implicit Constraints
8.1 Introducton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
8.2 Derivative Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
8.2.1 Gradient Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
8.2.2 Hessian Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
8.2.3 Techniques to Approximation/Compute First and Second Order Derivatives 330
8.3 Neural Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
8.3.1 Training Neural Nets is an Optimization Problem . . . . . . . . . . . . . . 330
8.3.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
8.3.3 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
8.4 Optimal Control of Burgers’ Equation . . . . . . . . . . . . . . . . . . . . . . . . 339
8.4.1 The Infinite Dimensional Problem . . . . . . . . . . . . . . . . . . . . . . 339
8.4.2 Problem Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
8.4.3 Gradient and Hessian Computation . . . . . . . . . . . . . . . . . . . . . 344
8.4.4 A Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
8.4.5 Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
8.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
8.5.1 Implicit Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
8.5.2 Constrainted Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 354
8.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
9.2 The BFGS Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
9.3 Implementation of the BFGS Method . . . . . . . . . . . . . . . . . . . . . . . . 365
9.3.1 Initial Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
9.3.2 Line-Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
9.3.3 Matrix-Free Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 367
9.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369

323
324 CHAPTER 8. IMPLICIT CONSTRAINTS

8.1. Introducton
This section studies the optimization of functions whose evaluation requires the solution of an
implicit equation. In this section we use u ∈ Rnu to denote the optimization variable. The objective
function we want to minimize is given by
fD(u) = f (y(u), u), (8.1)
where y(u) ∈ Rny is the solution of an equation
c(y, u) = 0. (8.2)
Here f : Rny ×nu → R and c : Rny ×nu → Rny are given functions.
While this problem structure may seem special, we have seen several examples of it already.
The data assimilation problems in Section 1.5 are one class of examples. The data assimilation
problem (1.54) is an optimization problem in y0 , which plays the role of u. The evaluation of
fD(y0 ) = 21 kAy0 − bk22 requires the solution of (1.50), which plays the role of (8.2) to compute
yT = (yT1 , . . . , yTnt ). Another class of problems that fits (abstractly) into the setting of minimizing
(8.1) is parameter identification in ordinary differential equations studied in Section 7.5. The vector
of optimization variables is p, which plays the role of u. To evaluate fD(y0 ) = 12 k R(p)k22 in (7.51) we
have to solve the ordinary differential equation (7.50) to get (7.52). Here the ordinary differential
equation (7.50) plays the role of (8.2). The solution y in this example is a function, not merely a
vetor in Rny and therefore this example does not fit precisely into the setting (8.1), (8.2). However,
this setting can be extended to cover this example. We will see additional examples of problems
with the structure (8.1), (8.2) in Sections 8.3 and 8.4 below.
The problem
minn fD(u) (8.3)
u∈R u

is an unconstrained optimization problem and can in principle be solved using any of the optimiza-
tion methods studied before. These methods require the computation of the gradient of fD(u) and,
possibly Hessian information. We will computation derivatives of fDin the next section.
We call (8.3) an implicitly constrained optimization problem because the solution of (8.2) is
invisible to the optimization algorithm. Of course, in principle one can formulate (8.3), (8.1),
(8.2) as an equality constrained optimization problem. In fact, since y is tied to u via the implicit
equation (8.2), we could just include this equation into the problem formulation and reformulate
(8.3), (8.1), (8.2) as
min f (y, u),
(8.4)
s.t. c(y, u) = 0.
In (8.4), the optimization variables are y ∈ Rny and u ∈ Rnu . The formulation (8.4) can have
significant advantages over (8.3), but in many applications the formulation of the optimization
problem as a constrained problem may not be possible, for example, because of the huge size of
y, which in applications can easily be many millions. The solution of constrained optimization
problems is also beyond the scope of this class. Therefore, we focus on (8.3).

8.2. DERIVATIVE COMPUTATIONS 325

8.2. Derivative Computations

To distinguish between the implicit function which is defined as the solution of (8.2) and a vector
in Rny , we use the notation y(·) to denote the implicit function and y to denote a vector in Rny .
Furthermore, we use subscripts y and u to denote partial derivatives. For example cy (y, u) ∈ Rny ×ny
is the partial Jacobian of the function c with respect to y and ∇u f (y, u) ∈ Rnu is the partial gradient
of the function f with respect to u
To make the formulation (8.1), (8.2) rigorous, we make the following assumptions.

Assumption 8.2.1

• For all u ∈ U there exists a unique y ∈ Rny such that c(y, u) = 0.

• There exists an open set D ⊂ Rny ×nu with {(y, u) : u ∈ U, c(y, u) = 0} ⊂ D such that f
and c are twice continuously differentiable on D.

• The inverse cy (y, u) −1 exists for all (y, u) ∈ {(y, u) : u ∈ U, c(y, u) = 0}.

Under these assumptions the implicit function theorem guarantees the existence of a differen-
tiable function
y : R nu → R n y
defined by
c(y(u), u) = 0.
Note that our Assumptions 8.2.1 are stronger than those required in the implicit function theorem.
The standard assumptions of the implicit function theorem, however, only guarantee the local
existence of the implicit function y(·).

8.2.1. Gradient Computations

Under Assumption 8.2.1, the Implicit Function Theorem guarantees the differentiability of y(·).
The Jacobian of y(·) is the solution of

cy (y, u)| y=y(u) yu (u) = −cu (y, u)| y=y(u) . (8.5)

To simplify the notation we write cy (y(u), u) and cu (y(u), u) instead of cy (y, u)| y=y(u) and
cu (y, u)| y=y(u) , respectively. With this notation, we have

yu (u) = −cy (y(u), u) −1 cu (y(u), u). (8.6)

The derivative yu (u) is also called the sensitivity (of y with respect to u).

326 CHAPTER 8. IMPLICIT CONSTRAINTS

Since y(·) is differentiable, the function fDis differentiable and its gradient is given by
∇ fD(u) = yu (u)T ∇ y f (y(u), u) + ∇u f (y(u), u) (8.7)
= −cu (y(u), u)T cy (y(u), u) −T ∇ y f (y(u), u) + ∇u f (y(u), u).
Note that if we define the matrix
−cy (y, u) −1 cu (y, u)
!
W (y, u) = , (8.8)
I
then !
yu (u)
W (y(u), u) = (8.9)
I
and the gradient of fDcan be written as
∇ fD(u) = W (y(u), u)T ∇ x f (y(u), u). (8.10)
The matrix W (y, u) will play a role later.
Equation (8.7) suggests the following method for computing the gradient.

Algorithm 8.2.2 (Gradient Computation Using Sensitivities)

1. Given u, solve c(y, u) = 0 for y (if not done already for the evaluation of fD(u)).
Denote the solution by y(u).

2. Compute the sensitivities S = yu (u) by solving

cy (y(u), u) S = −cu (y(u), u).

3. Compute ∇ fD(u) = ST ∇ y f (y(u), u) + ∇u f (y(u), u).

The computation of the sensitivity matrix S requires the solution of nu systems of linear
equations cy (y(u), u) S = −cu (y(u), u), all of which have the same system matrix but different
right hand sides. If nu is large this can be expensive. The gradient computation can be executed
more efficiently since for the computation of ∇ fD(u) we do not need S, but only the application
of ST to ∇ y f (y(u), u). If we revisit (8.7), we can define λ(u) = −cy (y(u), u) −T ∇ y f (y(u), u), or,
equivalently, we can define λ(u) ∈ Rny as the solution of
cy (y(u), u)T λ = −∇ y f (y(u), u). (8.11)
In optimization problems (8.1), (8.2) arising from discretized optimal control problems, the system
(8.11) are called the (discrete) adjoint equations and λ(u) is the (discrete) adjoint. We will see soon
(see (8.13)) that λ(u) is the Lagrange multiplier corresponding to the constraint problem (8.4).
With λ(u) the gradient can now be written as
∇ fD(u) = ∇u f (y(u), u) + cu (y(u), u)T λ(u), (8.12)
which suggests the so-called adjoint equation method for computing the gradient.

8.2. DERIVATIVE COMPUTATIONS 327

Algorithm 8.2.3 (Gradient Computation Using Adjoints)

1. Given u, solve c(y, u) = 0 for y (if not done already for the evaluation of fD(u)).
Denote the solution by y(u).

2. Solve the adjoint equation cy (y(u), u)T λ = −∇ y f (y(u), u) for λ.

Denote the solution by λ(u).

3. Compute ∇ fD(u) = ∇u f (y(u), u) + cu (y(u), u)T λ(u).

The gradient computation using the adjoint equation method can also be expressed using the
Lagrangian
L(y, u, λ) = f (y, u) + λT c(y, u) (8.13)
corresponding to the constraint problem (8.4). Using the Lagrangian, the equation (8.11) can be
written as
∇ y L(y, u, λ)| y=y(u),λ=λ(u) = 0. (8.14)
Moreover, (8.12) can be written as
∇ fD(u) = ∇u L(y, u, λ)| y=y(u),λ=λ(u) . (8.15)
The adjoint equations (8.11) or (8.14) are easy to write down in this abstract setting, but (hand)
generating a code to set up and solve the adjoint equations can be quite a different matter. This
will become somewhat apparent when we discuss a simple optimal control example in Section 8.4.
The following observation can be used to generate some checks that indicate the correctness of the
adjoint code. Assume that we have a code that for given u computes the solution y of c(y, u) = 0.
Often it is not too difficult to derive from this a code that for given r computes the solution s of
cy (y, u)s = r. If λ solves the adjoint equation cy (y, u)T λ = −∇ y f (y, u), then
− sT ∇ y f (y, u) = sT cy (y, u)T λ = r T λ (8.16)
must hold.

8.2.2. Hessian Computations

Since we assume f and c to be twice continuously differentiable, the function fDis twice continuous
differentiable. The Hessian of fD can be computed from (8.15). In fact, we have already computed
the derivative of y(·) in (8.6) using the implicit function theorem. Analogously we can apply
the implicit function theorem to (8.11) or equivalently (8.14) to compute the derivative of λ(·).
Differentiating (8.14) gives
∇ yy L(y, u, λ)| y=y(u),λ=λ(u) yu (u) + ∇ yu L(y, u, λ)| y=y(u),λ=λ(u)
+∇ yλ L(y, u, λ)| y=y(u),λ=λ(u) λ u (u) = 0.

328 CHAPTER 8. IMPLICIT CONSTRAINTS

If we use ∇ yλ L(y, u, λ) = cy (y, u)T and (8.6) in the previous equation we find that

λ u (u) = cy (y(u), u) −T ∇ yy L(y(u), u, λ(u)) cy (y(u), u) −1 cu (y(u), u)

−∇ yu L(y(u), u, λ(u)) .

(8.17)
To simplify the expression, we have used the notation ∇ yy L(y(u), u, λ(u)) instead of
∇ yy L(y, u, λ)| y=y(u),λ=λ(u) yu (u) and analogous notation for the other derivatives of L. We will
continue to use this notation in the following.
Now we can compute the Hessian of fDby differentiating (8.15),
∇2 fD(u) = ∇uy L(y(u), u, λ(u)) yu (u) + ∇uu L(y(u), u, λ(u))
+∇uλ L(y(u), u, λ(u)) λ u (u). (8.18)
If we insert (8.17) and (8.6) into (8.18) and observe that ∇uλ L(y(u), u, λ(u)) = cu (y(u), u) the
Hessian can be written as
∇2 fD(u) = cu (y(u), u)T cy (y(u), u) −T ∇ yy L(y(u), u, λ(u)) cy (y(u), u) −1 cu (y(u), u)
−cu (y(u), u)T cy (y(u), u) −T ∇ yu L(y(u), u, λ(u))
−∇uy L(y(u), u, λ(u)) cy (y(u), u) −1 cu (y(u), u) + ∇uu L(y(u), u, λ(u))
∇ yy L(y(u), u, λ(u)) ∇ yu L(y(u), u, λ(u))
!
= W (y(u), u) T
W (y(u), u). (8.19)
∇uy L(y(u), u, λ(u)) ∇uu L(y(u), u, λ(u))
Obviously the identities (8.19) can be used to compute the Hessian. However, in many cases, the
computation of the Hessian is too expensive. In that case optimization algorithms that only require
the computation of Hessian–times–vector products ∇2 fD(u)v can be used. The prime example is
the Newton-CG Algorithm, where an approximation of the Newton step s k is computed using the
CG Algorithm 6.2.2. Using the equality (8.19) Hessian–times–vector products can be computed
as follows.

Algorithm 8.2.4 (Hessian–Times–Vector Computation)

1. Given u, solve c(y, u) = 0 for y (if not done already). Denote the solution by y(u).

2. Solve the adjoint equation cy (y(u), u)T λ = −∇ y f (y(u), u) for λ (if not done al-
ready). Denote the solution by λ(u).

3. Solve the equation cy (y(u), u) w = cu (y(u), u)v.

4. Solve the equation

cy (y(u), u)T p = ∇ yy L(y(u), u, λ(u))w − ∇ yu L(y(u), u, λ(u))v.

5. Compute
∇2 fD(u)v = cu (y(u), u)T p − ∇uy L(y(u), u, λ(u))w + ∇uu L(y(u), u, λ(u))v.

8.2. DERIVATIVE COMPUTATIONS 329

Hence, if y(u) and λ(u) are already known, then the computation of ∇2 fD(u)v requires the
solution of two linear equations. One similar to the linearized state equation, Step 3, and one
similar to the adjoint equation, Step 4.
We conclude this section with an observation concerning the connection between the Newton
equation ∇2 fD(u) su = −∇ fD(u) or the Newton–like equation H
D su = −∇ fD(u) and the solution of
a quadratic program. These observations also emphasize the connection between the implicitly
constrained problem (8.1) and the nonlinear programming problem (8.4).
Theorem 8.2.5 Let cy (y(u), u) be invertible and let ∇2 fD(u) be symmetric positive semidefinite.
The vector su solves the Newton equation
∇2 fD(u) su = −∇ fD(u) (8.20)
if and only if (s y, su ) with s y = cy (y(u), u) −1 cu (y(u), u)su solves the quadratic program
!T !T
∇ y f (y, u)T ∇ yy L(y, u, λ) ∇ yu L(y, u, λ)
! ! !
sy sy sy
min +21
,
∇u f (y, u) su su ∇uy L(y, u, λ) ∇uu L(y, u, λ) su (8.21)
s.t. cy (y, u)s y + cu (y, u)su = 0,
where y = y(u) and λ = λ(u).
Proof: Every feasible point for (8.21) obeys
cy (y(u), u) −1 cu (y(u), u)su
! !
sy
= = W (y(u), u) su .
su su
Thus, using (8.10) and (8.19), we see that (8.21) is equivalent to
min sTu ∇ fD(u) + 21 sTu ∇2 fD(u) su . (8.22)
su

The desired result now follows from the equivalence of (8.21) and (8.22).

Similarly, one can show the following result.

Theorem 8.2.6 Let cy (y(u), u) be invertible and let H
D ∈ Rnu ×nu be a symmetric positive semidefi-
nite matrix. The vector su solves the Newton–like equation
D su = −∇ fD(u),
H (8.23)
if and only if (s y, su ) with s y = cy (y(u), u) −1 cu (y(u), u)su solves the quadratic program
!T !T
∇ y f (y, u)T
! ! !
sy sy 0 0 sy
min +2
1
,
∇u f (y, u) su su 0 H D su (8.24)
s.t. cy (y, u)s y + cu (y, u)su = 0,
where y = y(u) and λ = λ(u).

330 CHAPTER 8. IMPLICIT CONSTRAINTS

8.2.3. Techniques to Approximation/Compute First and Second Order

Derivatives

The computation of first and second order derivatives of the function fD in (8.1) is based on
application of the implicit function theorem and can be complicated and laborious. We will see a
concrete, rather simple example in Section 8.4 below.
There are approaches to compute or approximate these derivatives. They include:
Finite difference approximations [DS96, Sec. 5.6], [Sal86].
Algorithmic differentiation (also called Automatic differentiation) [Gri03, GW08].
Approximation of gradients using the complex-step method [MSA03], [ST98].
Computation of second derivatives via so-called hyper-dual numbers [FA11, FA12].

8.3. Neural Nets

Artificial neural networks are used in a variety of applications, including classification. These
networks consists of connected so-called (artificial) neurons. Information flows through these
neurons and inputs into each neurons are converted into and via parameterized functions. The
parameters associated with each neuron are determined to create a neural network that converts
network inputs into desired network outputs. Typically, given sets of network inputs and corre-
sponding desired outputs, the parameters are determined by requiring that the resulting artificial
neural networks transforms these network inputs into the corresponding desired outputs. This is
also called ‘training the network’. This leads to a large-scale optimization problem. This section
describes these problems for a class of feedforward networks and the application of the gradient
to solve them. Specifically, we will show that these feedforward networks have the structure of
so-called discrete-time optimal control problems, and the optimization problem that have to be
solved to train these networks are implicitly constrained optimization problems.
General information about neural nets can be found, e.g., in the books [GBC16, Nie17].

8.3.1. Training Neural Nets is an Optimization Problem

We consider feedforward neural networks of (artificial) neurons, which are organized in layers and
are connected across layers. The layers in the network are denoted by `, where ` = 0 corresponds
to the input layer, ` = L corresponds to the output layer, and layers ` = 1, . . . , L − 1 are the so
called hidden layers. Each layer ` contains a number of nodes that are connected with the nodes in
the previous layer and in the following layer. A simple neural network is shown in Figure 8.1.
The neurons in layers ` = 1, . . . , L transmit information via an activation function. Common
activation functions are the sigmoid activation function and the hyperbolic tangent activation

8.3. NEURAL NETS 331

Input Hidden Hidden Output

Layer Layer 1 Layer 2 Layer

Input #1
Output #1
Input #2
Output #2
Input #3
Output #3
Input #4

Figure 8.1: Example of a neural network with n0 = 4 inputs, with two hidden layers with n1 = n2 = 5
neurons each, and with n3 = 2 outputs.

function,

σ(z) = 1/(1 + exp(−z)) sigmoid activation function, (8.25a)

σ(z) = tanh(z) hyperbolic tangent activation function. (8.25b)

Consider a neuron i in layer `. Given inputs y1`−1, . . . , yn`−1 `−1

into the i-th neuron, weights
` ` `
wi1, . . . , win`−1 and a so-called bias bi associated with the i-th neuron, the output of the i-th neuron
is
n`−1
yi` = σ *.bi` + wi`j y `−1
X
j /, i = 1, . . . , n` .
+
, j=1 -
`
This output yi is then forwarded to all neurons in layer ` + 1. See Figure 8.2.
The output of neural networks used for classification is often described via the so-called softmax
function. If we have n L outputs, then the softmax function maps a vector z ∈ RnL into a vector
valued function with components
exp(zi )
σi (z) = PnL , i = 1, . . . , n L . (8.26)
j=1 exp(z j )

Given inputs y1L−1, . . . , ynL−1

L−1
L , . . . , wL
into the i-th neuron a the output layer, weights wi1 in L−1 and a
L
so-called bias bi associated with the i-th neuron, the i-th output of the neural net is
nX
L−1

yiL = σi *.biL + j /,
wiLj y L−1 + i = 1, . . . , n L .
, j=1 -

332 CHAPTER 8. IMPLICIT CONSTRAINTS

y1`−1 yi`

y2`−1 yi`

y3`−1 yi`

y4`−1 yi`

y5`−1 yi`

Figure 8.2: Inputs/outputs to/from neuron i in layer `.

Since for any vector vector z ∈ RnL ,

nL
X
0 < σ j (z) < 1, j = 1, . . . , n L, and σ j (z) = 1,
j=1

the outputs yiL = σi (· · · ) are interpreted as probabilities, e.g., that the input into the network
belongs to class i with probability yiL = σi (· · · ).

If the network parameters, i.e., the weights

wi`j , j = 1, . . . , n`−1, i = 1, . . . , n`, ` = 1, . . . , L,

and the biases

bi`, i = 1, . . . , n`, ` = 1, . . . , L,

are given, then the outputs y1L, . . . , ynLL of the neural network for given inputs y10, . . . , yn00 are
computed using Algorithm 8.3.1.

8.3. NEURAL NETS 333

Algorithm 8.3.1 (Compute Output of a Neural Network)

0. Given inputs y10, . . . , yn00 .

1. For ` = 1, . . . , L − 1:
Compute outputs of hidden layer `
n`−1
yi` = σ *.bi` + wi`j y `−1
X
j /, i = 1, . . . , n` .
+
, j=1 -

2. Compute outputs of the neural network

nX
L−1

yiL = σi *.biL + j /,
wiLj y L−1 + i = 1, . . . , n L .
, j=1 -

For the following presentation a more compact notation will be useful. First we define

y` = (y1`, . . . , yn` ` )T ∈ Rn` , ` = 0, . . . , L.

Then we aggregate the weights wi`j ∈ Rn` ×n`−1 and biases bi` ∈ Rn` into a vector of parameters

associated with level `,
u` ∈ R(n`−1 +1)n` , ` = 1, . . . , L.
Finally, we define the functions

..
*. . +/
` `−1 ` ` Pn`−1 ` `−1

σ y ,u = . σ bi + j=1 wi j y j
. // ∈ Rn` , ` = 1, . . . , L − 1,
. .. /
, . -
and
..
*
. . +/
n
σ y , u = .. σi bi + j=1
L L−1 L L wiLj y L−1 // ∈ RnL .
P L−1
j
. .. /
, . -
With this notation, given network inputs y0 ∈ Rn0 the corresponding network output vector
yL ∈ RnL can be computed recursively using

y` = σ ` y`−1, u` , ` = 1, . . . , L.

(8.27)

334 CHAPTER 8. IMPLICIT CONSTRAINTS

The system (8.27) has the structure of a so-called discrete-time system. The index ` corresponds
to time. The inputs into the system at ‘time’ ` is u` , and the state of the system ‘time’ ` is y` ∈ Rn` .
The final state, the output of the neural network y L ∈ RnL depends on the inputs y0 ∈ Rn0 and the
network parameters u1, . . . , u L ,
y L = y L (u1, . . . , u L ; y0 ).

So far we have assumed that the network parameters u1, . . . , u L are given. Now we describe how
they can be computed. Given a sequence of inputs y0k ∈ Rn0 , k = 1, . . . K, and corresponding desired
outputs Hy kL ∈ RnL , k = 1, . . . K, we want to find parameters u1, . . . , u L so that the network outputs
generated with these parameters and inputs match the desired outputs. This can be formulated as a
least squares problem
K
X
min 1
2 ky L (u1, . . . , u L ; y0k ) − H
y kL k22 . (8.28)
u1,...,u L
k=1

where y L (u1, . . . , u L ; y0k ) is defined by (8.27) with input y0 = y0k . The data y0k ∈ Rn0 , H
y kL ∈ RnL ,
k = 1, . . . K, are also called training data, and solving (8.28) is also called training the neural
network. Instead of a least squares functional, other functionals are possible to quantify ’matching’
the network outputs y L (u1, . . . , u L ; y0k ) and the desired outputs H
y kL .
The problem (8.28) is an implicitly constrained optimization problem since evaluation of
y (u1, . . . , u L ; y0k ) requited the solution of (8.27). Alternatively, (8.27) could be entered as a
L

constraint into the optimization problem and (8.28) can equivalently be formulated as the following
constrained problem in the optimization variables u1, . . . , u L and y1, . . . , y L .
K
X
min 1
2 ky kL − H
y kL k22 (8.29a)
k=1
y`k = σ ` y`−1 `

s.t. k ,u , ` = 1, . . . , L, k = 1, . . . , K . (8.29b)

Next, compute the gradient of the implicitly constrained function (u1, . . . , u L ) 7→

k=1 ky (u , . . . , u ; y k ) − H
PK L 1 L 0
1
2 y kL k22 . The adjoint equation approach derived in Section 8.2.1 is
know in the neural network community this is known as backpropagation (see, e.g., [RHW86]).
As we have mentioned already, the system (8.27) is a discrete-time system and therefore (8.30) is
a discrete-time optimal control problem. This connection between training of neural networks and
discrete-time optimal control problems has already been established, e.g., by Dreyfus [Dre90] and
Bertsekas [Ber95, Sec. 1.9, Example 5.3 in Sec 1.5].

8.3.2. Backpropagation
For the minimization (8.28) we note that the objective function is a sum of functions that all depend
on the same variables u1, . . . , u L . Thus the gradient of the sum is the sum of gradients, etc. For

8.3. NEURAL NETS 335

derivative computations it is therefore sufficient to consider the case K = 1 and drop the subscript
k.
We consider the problem

2 ky (u , . . . , u ; y ) y L k22 .
1 L 1 L 0
min −H (8.30)
u1,...,u L

To solve (8.30) we need the Jacobian

`=1 (n`−1 +1)n`

PL
Dy L (u1, . . . , u L ; y0 ) ∈ RnL ×

of
(u1, . . . , u L ) 7→ y L (u1, . . . , u L ; y0 ).

Since y L is implicitly defined through (8.27), the Jacobian can be computed by applying the implicit
function theorem to (8.27).
Let σy` ∈ Rn` ×n`−1 denote the partial Jacobian of σ ` with respect to y`−1 and let σu` ∈
Rn` ×((n`−1 +1)n` ) denote the partial Jacobian of σ ` with respect to u` . Because of the size of
the Jacobian, it is more convenient to describe the computation of the Jacobian applied to a vector
v1, . . . , v L . The recursion (8.27) defines functions

(u1, . . . , u` ) 7→ y` (u1, . . . , u` ; y0 ), ` = 1, . . . , L.

The Jacobians of these functions applied to a vector v1, . . . , v L are denoted by

v1
w` = Dy` (u1, . . . , u` ; y0 ) ..
* .. +/ , ` = 1, . . . , L,
. /
, v` -

and are computed by the implicit function theorem applied to (8.27), i.e., are computed using

w0 = 0 (8.31a)
w` = σy` y`−1, u` w`−1 + σu` y`−1, u` v`,

` = 1, . . . , L. (8.31b)

The recursion (8.31) allows us to compute the Jacobian Dy L (u1, . . . , u L ; y0 ) applied to a vector.
For optimization we also need the application of the transpose of this Jacobian. To compute

336 CHAPTER 8. IMPLICIT CONSTRAINTS

Dy L (u1, . . . , u L ; y0 ) T r for a given vector r ∈ RnL we first note that (8.31) is equivalent to

I +/ * w1 +
.. −σy y , u
*. 2 1 2 I // .. w2 //
.. .. // .. .. //
. . // . . /
..
..
wL -

,| −σ L
y y L−1 , u L I ,
{z }-
=A
σ y ,u
1 0 1
*. u +/ * v1
.. σu2 y1, u2 // .. v2
+/
= .. .. .. ..
// . (8.32)
. . .
// .. //
.. /.

σu y , u - ,
L L−1 L
/ vL -
,| {z }
=B
We have
v1 v1
* +/ *. 2 +/

L 0 T T .
. v2 // = rT Dy L (u1, . . . , u L ; y0 ) .. v // = rT w L
Dy (u , . . . , u ; y ) r ..
L 1
.. .. ... //
. . //
, vL - , v -
L

T T
0 w1 0
2 + v1
w
*. +/ *. *. +/
0 / 0 *
.. v2 +//
. . .
= .. .. // .. .. // = .. .. // A B .. .. //
.. // .. // .. //
−1
.. 0 // .. w L−1 // .. 0 // . . /
L
L
, r - , w - , r - , v -
T
0
*. *. +/+/ v1
0 *. +/
.. .
..
//// v2
= ..BT A−T ... .
..
..
// . (8.33)
.
////
.. //
.. .. 0 ////
vL -
, , r -- ,
Since (8.33) holds for any vector v1, . . . , v L ,

0
*. +/
0
..
. //
Dy L (u1, . . . , u L ; y0 ) T r = BT A−T ... .// .

(8.34)
.. 0 //
, r -

8.3. NEURAL NETS 337

If we define
0
*. +/
0
..
. //
p = A−T ... . //
.. 0 //
, r -
and use the definition of A and B in (8.32) the quantity Dy L (u1, . . . , u L ; y0 ) T r can be obtained

as follows. Compute

p L = r, (8.35a)
T
p`−1 = σy` y`−1, u` p`,

` = L, . . . , 2, (8.35b)

and set T
σu1 y0, u1 p1
..
*. +/
Dy L (u1, . . . , u L ; y0 ) T r = .. . // .

(8.35c)
. T /
, σu y , u L p L
L L−1
-
In particular, if we define

f (u1, . . . , u L ) = 12 ky L (u1, . . . , u L ; y0 ) − H
y L k22,

the gradient
T
∇ f (u1, . . . , u L ) = Dy L (u1, . . . , u L ; y0 ) y L (u1, . . . , u L ; y0 ) − H
yL

is computed as follows. Compute

p L = y L (u1, . . . , u L ; y0 ) − H
y L, (8.36a)
T
p`−1 = σy` y`−1, u` p`,

` = L, . . . , 2, (8.36b)

and set T
σu1 y0, u1 p1 +
..
*.
∇ f (u , . . . , u ) = ..
L
// .
//
.
1
(8.36c)
. T
, σu y , u
L L−1 L pL -
If instead of the least squares functional, (8.30) we use another metric φ to quantify distance
between output y L (u1, . . . , u L ; y0 ) and desired output H
y L , we are led to the minimization problem

min φ y L (u1, . . . , u L ; y0 );H

yL .

(8.37)
u1,...,u L

338 CHAPTER 8. IMPLICIT CONSTRAINTS

If we define the function

f (u1, . . . , u L ) = φ y L (u1, . . . , u L ; y0 );H

its gradient is
T
∇ f (u1, . . . , u L ) = Dy L (u1, . . . , u L ; y0 ) ∇yL φ y L (u1, . . . , u L ; y0 );H
yL ,

and is obtained as follows. Compute

p L = ∇yL φ y L (u1, . . . , u L ; y0 );H

yL ,

(8.38a)
T
p`−1 = σy` y`−1, u` p`,

` = L, . . . , 2, (8.38b)

and set T
σu1 y0, u1 p1
..
*. +/
∇ f (u1, . . . , u L ) = .. . // . (8.38c)
. T /
, σu y , u L p L
L L−1
-

8.3.3. An Example
This example is adopted from [Bri15]. (To be added later.)

Input Hidden Output

Layer Layer Layer

Input #1 Output #1

Input #2 Output #2

Figure 8.3: Neural Network used for Classification

8.4. OPTIMAL CONTROL OF BURGERS’ EQUATION 339

8.4. Optimal Control of Burgers’ Equation

8.4.1. The Infinite Dimensional Problem
We demonstrate the gradient and Hessian computation using an optimal control problems governed
by the so-called Burgers’ equation. The Burgers equation can be viewed as the Navier-Stokes
equations in one space dimension and it was introduced by Burgers [Bur40, Bur48]. We first
state the optimal control problem in the differential equation setting and then introduce a simple
discretization to arrive at a finite dimensional problem of the type (8.1).
We want to minimize
1 T 1
Z Z
min (y(u; x, t) − z(x, t)) 2 + ωu2 (x, t)dxdt, (8.39a)
u 2 0 0

where for given function u ∈ L 2 ((0, 1) × (0, T )) the function y(u; ·) is the solution of
∂
− ν ∂∂x 2 y(x, t) + ∂
= r (x, t) + u(x, t), (x, t) ∈ (0, 1) × (0, T ),
2
∂t y(x, t) ∂ x y(x, t)y(x, t)
y(0, t) = y(1, t) = 0, t ∈ (0, T ), (8.39b)
y(x, 0) = y0 (x), x ∈ (0, 1),
where z : (0, 1) × (0, T ) → R, r : (0, 1) × (0, T ) → R, and y0 : (0, 1) → R are given functions
and ω, ν > 0 are given parameters. The parameter ν > 0 is also called the viscosity and the
differential equation (8.39b) is known as the (viscous) Burgers’ equation. The problem (8.39) is
studied, e.g., in [LMT97, Vol01]. As we have mentioned earlier, (8.39) can be viewed as a first step
towards solving optimal control problems governed by the Navier-Stokes equations [Gun03]. More
generally, (8.39) belongs to the class of partial differential equation (PDE) constrained optimization
problems [HPUU09].
In this context of (8.39) the function u is called the control, y is called the state, and (8.39b)
is called the state equation. We do not study the infinite dimensional problem (8.39), but instead
consider a discretization of (8.39).

8.4.2. Problem Discretization

To discretize (8.39) in space, we use piecewise linear finite elements. For this purpose, we
multiply the differential equation in (8.39b) by a sufficiently smooth function ϕ which satisfies
ϕ(0) = ϕ(1) = 0. Then we integrate both sides over (0, 1), and apply integration by parts. This
leads to
∂ ∂
Z 1 Z 1 Z 1
d d
y(x, t)ϕ(x)dx + ν y(x, t) ϕ(x)dx + y(x, t)y(x, t)ϕ(x)dx
dt 0 0 ∂x dx 0 ∂x
Z 1
= (r (x, t) + u(x, t))ϕ(x)dx. (8.40)
0

340 CHAPTER 8. IMPLICIT CONSTRAINTS

Now we subdivide the spatial interval [0, 1] into n subintervals [x i−1, x i ], i = 1, . . . , n, with
x i = ih and h = 1/n. We define piecewise linear (‘hat’) functions


 h−1 (x − (i − 1)h) x ∈ [(i − 1)h, ih] ∩ [0, 1],
ϕi (x) =  h (−x + (i + 1)h) x ∈ [ih, (i + 1)h] ∩ [0, 1],
−1 i = 0, . . . , n,

 (8.41)

 0 else

which satisfy ϕ j (x j ) = 1 and ϕ j (x i ) = 0, i , j. See Figure 8.4.

ϕ0 ϕ1 ϕ2 ϕ3 ϕ4 ϕ5

x0 x1 x2 x3 x4 x5

Figure 8.4: The piecewise linear (‘hat’) functions for n = 5

We approximate y and u by functions of the form

n−1
X
yh (x, t) = y j (t)ϕ j (x) (8.42)
j=1

and
n
X
u h (x, t) = u j (t)ϕ j (x). (8.43)
j=0

We set
~y (t) = (y1 (t), . . . , yn−1 (t))T and u~ (t) = (u0 (t), . . . , un (t))T ,

If we insert the approximations (8.42), (8.43) into (8.40) and require (8.40) to hold for for ϕ = ϕi ,
i = 1, . . . , n − 1, then we obtain the system of ordinary differential equations

d
Mh ~y (t) + Ah ~y (t) + Nh (~y (t)) + Bh u~ (t) = r h (t), t ∈ (0, T ), (8.44)
dt

where Mh, Ah ∈ R(n−1)×(n−1) , Bh ∈ R(n−1)×(n+1) , r h (t) ∈ Rn−1 , and Nh (~y (t)) ∈ Rn−1 are matrices

8.4. OPTIMAL CONTROL OF BURGERS’ EQUATION 341

or vectors with entries

Z 1
(Mh )i j = ϕ j (x)ϕi (x)dx,
0
Z 1
d d
( Ah )i j = ν ϕ j (x) ϕi (x)dx,
0 dx dx
Z 1
(Bh )i j = − ϕ j (x)ϕi (x)dx,
0
n−1 X
n−1 Z 1
X d
(Nh (~y (t)))i = ϕ j (x)ϕ k (x)ϕi (x)dx y k (t)y j (t),
j=1 k=1 0 dx
Z 1
(r h (t))i = r (x, t)ϕi (x)dx.
0
If we insert (8.42), (8.43) into (8.39), we obtain
ω
Z T Z TZ 1
1 1 2
~y (t) Mh ~y (t) + (gh (t)) ~y (t) + u~ (t) Q h u~ (t)dt +
T T T
ŷ (x, t)dxdt,
0 2 2 0 0 2

where Mh ∈ R(n−1)×(n−1) is defined as before and Q h ∈ R(n+1)×(n+1) , gh (t) ∈ R(n−1) are a matrix
and vector with entries
Z 1
(Q h )i j = ϕ j (x)ϕi (x)dx,
0
Z 1
(gh (t))i = − z(x, t)ϕi (x)dx.
0
Thus a semi–discretization of the optimal control problem (8.39) is given by
ω
Z T
1
min ~y (t)T Mh ~y (t) + (gh (t))T ~y (t) + u~ (t)T Q h u~ (t)dt, (8.45a)
u~ 0 2 2
where ~y (t) is the solution of
Mh dtd ~y (t) + Ah ~y (t) + Nh (~y (t)) + Bh u~ (t) = r h (t), t ∈ (0, T ),
(8.45b)
~y (0) = ~y0,
where ~y0 = (y0 (h), . . . , y0 (1 − h))T .
Using the definition (8.41) of ϕi , i = 0, . . . , n, it is easy to compute that
4 1 2 −1
−1 2 −1
*. +/ *. +/
1 4 1
h ν
Mh = ... . . . . . . . . . Ah = ... . . . . . . . . .
. // . //
// ∈ R(n−1)×(n−1), // ∈ R(n−1)×(n−1),
6. h.
. 1 4 1 // . −1 2 −1 //
, 1 4 - , −1 2 -

342 CHAPTER 8. IMPLICIT CONSTRAINTS

y1 (t)y2 (t) + y22 (t)

−y12 (t) − y1 (t)y2 (t) + y2 (t)y3 (t) + y32 (t)
*. +/
..
.. //
. . //
1 ..
Nh (~y (t)) = .. 2 (t) − y
−yi−1 i−1 (t)yi (t) + yi (t)yi+1 (t) + yi+1
2 (t) // ∈ Rn−1
6. .. //
.. . //
.. −y 2 (t) − y (t)y (t) + y (t)y (t) + y 2 (t) //
n−3 n−3 n−2 n−2 n−1 n−1
−y 2 (t) − y (t)y (t)
, n−2 n−2 n−1 -
and
1 4 1 2 1
*. +/ *. +/
1 4 1 1 4 1
h .. .. .. .. h .. .
Q h = .. . . . . . ..
// //
Bh = − .. . . . // ∈ R(n−1)×(n+1), . // ∈ R(n+1)×(n+1) .
6. // 6.
. 1 4 1 . 1 4 1 //
, 1 4 1 - , 1 2 -
To approximate the integrals arising in the definition of r h (t) and gh (t) we apply the composite
trapezoidal rule. This yields

(r h (t))i = h r (ih, t), (gh (t))i = h ŷ(ih, t).

Later we also need the Jacobian Nh0 (~y (t)) ∈ R(n−1)×(n−1) , which is shown in Figure 8.5.
To discretize the problem in time, we use the Crank-Nicolson method. We let

0 = t 0 < t 1 < . . . < t N+1 = T

and we define
∆t i = t i+1 − t i, i = 0, . . . , N .
We also introduce
∆t −1 = ∆t N+1 = 0.
The fully discretized problem is given by
N+1
∆t i−1 + ∆t i 1 T ω T
X !
min ~y Mh ~yi + (gh )i ~yi + u~i Q h u~i ,
T
(8.46a)
u~0,...,~u N +1
i=0
2 2 i 2

where ~y1, . . . , ~y N+1 is the solution of

∆t i ∆t i ∆t i
Mh + Ah ~yi+1 + Nh (~yi+1 ) + Bh u~i+1
2 2 2
∆t i ∆t i ∆t i
+ − Mh + Ah ~yi + Nh (~yi ) + Bh u~i
2 2 2
∆t i
− r h (t i ) + r h (t i+1 ) = 0, i = 0, . . . , N, (8.46b)
2

CAAM 454/554
Figure 8.5: The Jacobian Nh0 (~y (t))

y2 (t) y1 (t) + 2y2 (t)

−2y (t) − y (t)
*. +/
.. 1 2 y3 (t) − y1 (t) y2 (t) + 2y3 (t) //
. .. . .. ..
.. . //
0 1.
Nh (~y (t)) = . −2yi−1 (t) − yi (t) yi+1 (t) − yi−1 (t) yi (t) + 2yi+1 (t) //
6 .. . .
//
.. .. ..
.
8.4. OPTIMAL CONTROL OF BURGERS’ EQUATION

.. //
.. −2yn−3 (t) − yn−2 (t) yn−1 (t) − yn−3 (t) yn−2 (t) + 2yn−1 (t) //
, −2yn−2 (t) − yn−1 (t) −yn−2 (t) -

343
344 CHAPTER 8. IMPLICIT CONSTRAINTS

and ~y0 is given. We denote the objective function in (8.46a) by fDand we set
u = (~uT0 , . . . , u~TN+1 )T .
We call u the control, y = (~y1T , . . . , ~yTN+1 )T the state, and (8.46b) is called the (discretized) state
equation.
Like with many applications, the verification that (8.46) satisfies the Assumptions 8.2.1, espe-
cially the first and third one, is difficult. If the set U of admissible controls u is constrained in a
suitable manner and if the parameters ν, h, ∆t i are chosen properly, then it is possible to verify
Assumptions 8.2.1. We ignore this issue and continue as if Assumptions 8.2.1 are valid for (8.46).
In our numerical experiments indicate that this is fine for our problem setting. We also note that our
simple Galerkin finite element method in space produces only meaningful results if the mesh size h
is sufficiently small (relative to the viscosity ν and size of the solution y). Otherwise the computed
solution exhibits spurious oscillations. Again, for our parameter settings, our discretization is
sufficient.
Since the Burgers’ equation (8.46b) is quadratic in ~yi+1 , the computation of ~yi+1 , i = 0, . . . , N,
requires the solution of system of nonlinear equations. We apply Newton’s method to compute the
solution ~yi+1 of (8.46b). We use the computed state ~yi at the previous time step as the initial iterate
in Newton’s method.

8.4.3. Gradient and Hessian Computation

The fully discretized problem (8.46) is of the form (8.3), (8.1), (8.2). To compute gradient and
Hessian information we first set up the Lagrangian corresponding to (8.46), which is given by
L(~y1, . . . , ~y N+1, u~0, . . . , u~ N+1, λ~ 1, . . . , λ~ N+1 )
N+1
X ∆t i−1 + ∆t i 1 ω T
!
= ~y Mh ~yi + (gh )i ~yi + u~i Q h u~i
T T

i=0
2 2 i 2
N
X f ∆t i ∆t i ∆t i
+ λ
~T
i+1 Mh + Ah ~yi+1 + Nh (~yi+1 ) + Bh u~i+1
i=0
2 2 2
∆t i ∆t i ∆t i
+ − Mh + Ah ~yi + Nh (~yi ) + Bh u~i
2 2 2
∆t i g
− r h (t i ) + r h (t i+1 ) . (8.47)
2
The adjoint equations corresponding to (8.11) are obtained by setting the partial derivatives
with respect to yi of the Lagrangian (8.47) to zero and are given by
T
Mh + ∆t2N Ah + ∆t2N Nh0 (~y N+1 ) λ ~ N+1 = − ∆t N (Mh ~y N+1 + (gh ) N+1 ),
2
T T
Mh + ∆t i−1 Ah + ∆t i−1 N 0 (~yi ) λ~ i = − − Mh + ∆t i Ah + ∆t i N 0 (~yi ) λ~ i+1
2 2 h 2 2 h
(8.48)
∆t i−1 +∆t i
− 2 (Mh ~yi + (gh )i ), i = N, . . . , 1,

8.4. OPTIMAL CONTROL OF BURGERS’ EQUATION 345

where Nh0 (~yi ) denotes the Jacobian of Nh (~yi ). (Recall that ∆t N+1 = 0.) Given the solution of (8.48),
the gradient of the objective function fD can be obtained by computing the partial derivatives with
respect to ui of the Lagrangian (8.47). The gradient is given by
ω ∆t20 Q h u~0 + ∆t20 BTh λ~ 1
ω ∆t 0 +∆t Q h u~1 + BTh ( ∆t20 λ~ 1 + ∆t21 λ~ 2 )
*. 1
+/
2
..
.. //
∇u fD(u) = .. . // . (8.49)
.. ∆t N −1 +∆t N
. ω Q h u~ N + Bh ( ∆t N2 −1 λ~ N + ∆t N λ ~ N+1 )
T
//
2 2 /
, ω ∆t2N Q h u~ N+1 + ∆t2N BTh λ~ N+1 -
(Recall that ∆t −1 = ∆t N+1 = 0.)
We summarize the gradient computation using adjoints in the following algorithm.

Algorithm 8.4.1 (Gradient Computation Using Adjoints)

1. Given u~0, . . . , u~ N+1 , and ~y0 compute ~y1, . . . , ~y N+1 by solving

∆t i ∆t i
Mh + Ah ~yi+1 + Nh (~yi+1 )
2 2
∆t i ∆t i ∆t i ∆t i
= Mh − Ah ~yi − Nh (~yi ) − Bh (~ui+1 + u~i ) + r h (t i ) + r h (t i+1 ) ,
2 2 2 2
for i = 0, . . . , N.

2. Compute λ~ N+1, . . . , λ~ 1 by solving

∆t N ∆t N 0 T ∆t N
Mh + Ah + Nh (~y N+1 ) λ~ N+1 = − (Mh ~y N+1 + (gh ) N+1 ),
2 2 2
∆t i−1 ∆t i−1 0 T ∆t i ∆t i 0 T
Mh + Ah + Nh (~yi ) λ~ i = Mh − Ah − Nh (~yi ) λ ~ i+1
2 2 2 2
∆t i−1 + ∆t i
− (Mh ~yi + (gh )i ),
2
for i = N, . . . , 1.

3. Compute ∇u fD(u) from (8.49).

Of course, if we have computed the solution ~y1, . . . , u~ N+1 of the discretized Burgers equation
(8.46b) for the given u~0, . . . , u~ N+1 already, then we can skip step 1 in Algorithm 8.4.1. Furthermore,
we can assemble the components of the gradient ∇u fD(u) that depend on λ~ i+1 immediately after it
has been computed. This way we do not have to store all λ~ 1, . . . , λ ~ N+1 .
We conclude by adapting Algorithm 8.2.4 to our problem. Since the the objective function
(8.46a) is quadratic and the implicit constraints (8.46b) are quadratic in y and linear in u, most

346 CHAPTER 8. IMPLICIT CONSTRAINTS

of the second derivative terms are zero. The multiplication of the Hessian ∇2u fD(u) times vector v
computation can be performed using Algorithm 8.4.2 below. In step 4 of the following algorithm
d
we use that Nh (y) is quadratic. Hence dy (Nh0 (~y )T λ)
~ w~ = Nh0 (~
w )T λ.
~

Algorithm 8.4.2 (Hessian–Times–Vector Computation)

1. Given u~1, . . . , u~ N+1 , and ~y0 compute ~y1, . . . , ~y N+1 as in Step 1 of Algorithm 8.4.1.

2. Compute λ~ N+1, . . . , λ~ 1 by as in Step 2 of Algorithm 8.4.1.

3. Compute w
~ 1, . . . , w
~ N+1 from
∆t i ∆t i 0 ∆t i ∆t i 0 ∆t i
Mh + Ah + Nh (~yi+1 ) w~ i+1 = Mh − Ah − Nh (~yi ) w~i+ Bh (~vi +~vi+1 ),
2 2 2 2 2
i = 0, . . . , N, where w
~ 0 = 0.

4. Compute p~N+1, . . . , p~1 by solving

∆t N ∆t N 0 T ∆t N ∆t N 0
Mh + Ah + Nh (~y N+1 ) p~N+1 = Mh w~ N+1 + Nh (~w N+1 )T λ~ N+1,
2 2 2 2
∆t i−1 ∆t i−1 0 T ∆t i ∆t i 0 T
Mh + Ah + Nh (~yi ) p~i = Mh − Ah − Nh (~yi ) p~i+1
2 2 2 2
∆t i−1 + ∆t i
+ Mh w~i
2
∆t i−1 ~ ∆t i ~
+Nh0 (~wi )T λi + λ i+1 ,
2 2
for i = N, . . . , 1.

5. Compute

ω ∆t20 Q h~v0 + ∆t20 BTh p~1

ω ∆t 0 +∆t Q h~v1 + BTh ( ∆t20 p~1 +
∆t 1
*. +/
2
1
2 p ~2 )
..
.. //
∇ f (u)v = .
2D . . // .
.. ∆t N −1 +∆t N
. ω Q h~v N + BTh ( ∆t N2 −1 p~N + ∆t2N p~N+1 )
//
2 /
, ω ∆t2N Q h~v N+1 + ∆t2N BTh p~N+1 -

8.4. OPTIMAL CONTROL OF BURGERS’ EQUATION 347

8.4.4. A Numerical Example

We consider the optimal control problem (8.46) with data T = 1, ω = 0.05, ν = 0.01, r = 0,
(
1 x ∈ (0, 12 ],
y0 (x) =
0 else,

and z(x, t) = y0 (x), t ∈ (0, T ) (cf. [KV99]). For the discretization we use n x = 80 spatial
subintervals and 80 time steps, i.e., ∆t = 1/80.
The solution y of the discretized Burgers’ equation (8.46b) with u(x, t) = 0 as well as the
desired state z are shown in Figure 8.6.

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
1 1
0.8 1 0.8 1
0.6 0.8 0.6 0.8
0.4 0.6 0.4 0.6
0.4 0.4
0.2 0.2 0.2 0.2
t 0 0 t 0 0
x x

Figure 8.6: Solution of Burgers’ equation with u = 0 (no control) (left) and desired state z (right)

The solution u of the optimal control problem (8.39), (8.39b), the corresponding solution y(u∗ )
of the discretized Burgers’ equation (8.39b) and the solution λ(u∗ ) of (8.48) are plotted in Figure 8.7
below.
The convergence history of the Newton–CG method with Armijo-line search applied to (8.46a)
is shown in Table 8.1. We use the Newton–CG Algorithm with gradient stopping tolerance
gtol = 10−8 and compute steps s k such that k∇2 fD(u k )s k + ∇ fD(u k )k ≤ η k k∇ fD(u k )k2 with η k =
min{0.01, k∇ fD(u k )k2 }.

348 CHAPTER 8. IMPLICIT CONSTRAINTS

4
1.5

2 1

0 0.5

−2 0

−4 −0.5
1 1
1 1
0.5 0.5
0.5 0.5

t 0 0 t 0 0
x x

0.2

0.1

−0.1

−0.2
1
1
0.5
0.5

t 0 0
x

Figure 8.7: Optimal control u∗ (upper left), corresponding solution y(u∗ ) of Burgers’ equation
(upper right) and corresponding Lagrange multipliers λ(u∗ ) (bottom)

k fD(u k ) k∇ fD(u k )k2 ks k k2 α k #CG iters

0 −8.320591e − 02 3.056462e − 03 1.350236e + 02 0.5 8
1 −1.752788e − 01 7.293242e − 04 3.511393e + 01 1.0 10
2 −1.861746e − 01 9.073135e − 05 4.239564e + 00 1.0 16
3 −1.863410e − 01 1.697294e − 06 9.011109e − 02 1.0 23
4 −1.863411e − 01 1.061131e − 09

Table 8.1: Convergence history of a Newton-CG method applied to the solution of (8.39)

8.4. OPTIMAL CONTROL OF BURGERS’ EQUATION 349

8.4.5. Checkpointing

In Algorithm 8.4.1 we note that the state equation is solved forward for the ~yi ’s while the adjoint
equation is solved backward for the λ ~ i ’s. Moreover the states ~y N+1, . . . , ~y1 are needed for the
computation of the adjoints λ N+1, . . . , λ~ 1 . If the size of the state vectors ~yi is small enough so
~
that all states ~y1, . . . , ~y N+1 can be held in the computer memory, this dependence does not pose a
difficulty. However, for many problems, such as flow control problems governed by the unsteady
Navier-Stokes equations, the states are too large to hold the entire state history in computer memory.
In this case one needs to apply so-called checkpointing techniques.

With checkpointing one trades memory for state re-compuations. In a simple scheme one
does not keep every state ~y0, ~y1, . . . , ~y N+1 , but only every Mth state ~y0, ~y M , . . . , ~y N+1 (here we
assume that N + 1 is an integer multiple of M). In the computation of the adjoint variables λ~ i for
i ∈ {k M + 1, . . . , (k + 1)M − 1} and some k ∈ {0, . . . , (N + 1)/M } one needs ~yi , which has not
been stored. Therefore, one uses the stored ~y k M to re-compute ~y k M+1, . . . , ~y(k+1)M−1 .

350 CHAPTER 8. IMPLICIT CONSTRAINTS

Algorithm 8.4.3 (Gradient Computation Using Adjoints and Simple Checkpointing)

Let N and M be such that N + 1 is an integer multiple of M.

1. Given u~0, . . . , u~ N+1 , and ~y0 . Store ~y0 .

1.1. For k = 0, . . . , (N + 1)/M − 1 solve

∆t i ∆t i
Mh + Ah ~yi+1 + Nh (~yi+1 )
2 2
∆t i ∆t i ∆t i ∆t i
= Mh − Ah ~yi − Nh (~yi ) − Bh (~ui+1 + u~i ) + r h (t i ) + r h (t i+1 ) ,
2 2 2 2
for i = k M, . . . , (k + 1)M − 1.
1.2. Store ~y(k+1)M .

2. Adjoint computation.

2.1. Compute λ
~ N+1 by solving
∆t N ∆t N 0 T ∆t N
Mh + Ah + Nh (~y N+1 ) λ~ N+1 = − (Mh ~y N+1 + (gh ) N+1 ).
2 2 2

Add the λ~ N+1 contribution to the appropriate entries of ∇u fD(u).

2.2. For k = (N + 1)/M − 1, . . . , 0
2.2.1 Re-compute ~y k M+1, . . . , ~y(k+1)M−1 from the stored ~y k M by solving
∆t i ∆t i
Mh + Ah ~yi+1 + Nh (~yi+1 )
2 2
∆t i ∆t i ∆t i ∆t i
= Mh − Ah ~yi − Nh (~yi ) − Bh (~ui+1 + u~i ) + r h (t i ) + r h (t i+1 ) ,
2 2 2 2
for i = k M, . . . , (k + 1)M − 1.
2.2.2 Compute λ~ (k+1)M−1, . . . λ~ k M by solving
∆t i−1 ∆t i−1 0 T
Mh + Ah + Nh (~yi ) λ ~i
2 2
∆t i ∆t i 0 T ∆t i−1 + ∆t i
= Mh − Ah − Nh (~yi ) λ~ i+1 − (Mh ~yi + (gh )i ),
2 2 2
for i = (k + 1)M − 1, . . . , k M.
After λ~ i has been computed add the λ
~ i contribution to the appropriate
entries of ∇u f (u).
D

8.5. OPTIMIZATION 351

Note that for k = (N +1)/M −1 one really does not need to recompute the states ~y N+2−M , . . . , ~y N
in step 2.2.1, since they are the last states computed in step 1.1. and should be stored there.
Algorithm 8.4.3 requires storage for (N + 1)/M + 1 vectors ~y0, ~y M , . . . , ~y N+1 , for M − 1 vectors
~y k M+1, . . . , ~y(k+1)M−1 computed in step 2.2.1, and for one vector λ~ i .
The simple checkpointing scheme used in Algorithm 8.4.3 is not optimal in the sense that given
a certain memory size to store state information it uses too many state re-computations. The issue
of optimal checkpointing is studied in the context of Automatic Differentiation. The so-called
reverse mode automatic differentiation is closely related to gradient computations via the adjoint
method. We refer to [Gri03, Sec. 4] or [GW08] for more details.

8.5. Optimization
In the previous sections we have discussed the computation of gradient and Hessian information
for the implicitly constrained optimization problem (8.3), (8.1), (8.2). Thus it seems we should
be able to apply a gradient based optimization algorithm, like the Newton–CG Algorithm to solve
the problem. In fact, in the previous section we have used the Newton–CG Algorithm to solve the
discretized optimal control problem (8.46). However, there are important issues left to be dealt
with. These are perhaps not so obvious when one deals with the algorithms in the previous sections
‘on paper’, but they become apparent when one actually as to implement the algorithms.

8.5.1. Implicit Constraints

Avoiding Recomputations of y and λ

In the k-th iteration of the Newton–CG Algorithm we have to compute the gradient ∇ fD(u k ), we
have to apply the Hessian ∇2 fD(u k ) to a number of vectors, and we have to evaluate the function fD
at some trial points. In a Matlab implementation of Newton–CG Algorithm one may require the
user to supply three functions
function [f] = fval(u, usr_par)
function [g] = grad(u, usr_par)
function [Hv] = Hessvec(v, u, usr_par)
that evaluate the objective function fD(u), evaluate the gradient ∇ fD(u), and evaluate the Hessian-
times-vector product ∇2 fD(u)v, respectively. The last argument usr_par is included to allow he
user to pass problem specific parameters to the functions.
Now, if we look at Algorithms 8.2.2, 8.2.3, and 8.2.4 we see that the computation of ∇ fD(u) and
∇ fD(u)v all require the computation of y(u). Furthermore, the computation of ∇2 fD(u)v requires
2

the computation of λ(u). Since the computation y(u) can be expensive, we want to reuse an
already computed y(u) rather than to recompute y(u) every time fval, grad, or Hessvec is called.
Similarly we want to reuse λ(u) which has to be computed as part of the gradient computation in

352 CHAPTER 8. IMPLICIT CONSTRAINTS

Algorithm 8.2.3 during subsequent calls of Hessvec. Of course, if u changes, we must recompute
y(u) and λ(u). How can we do this?
If we know precisely what is going on in our optimization algorithm, then y(u) and λ(u) can be
reused. For example, if we use the Newton–CG Algorithm, then we know that fD(u k ) is evaluated
before ∇ fD(u k ) is computed. Moreover, we know that Hessian-times-vector products ∇2 fD(u k )v
computed only after ∇ fD(u k ) is computed. Thus, in this case, when fval is called, we compute
y(u k ) and store it to make it available for reuse in subsequent calls to grad and Hessvec. Similarly,
if the gradient is implemented via Algorithm 8.2.3, then when grad is called we compute λ(u k )
and store it to make it available for reuse in subsequent calls to Hessvec. This strategy works
only because we know that the functions fval, grad, or Hessvec are called in the right order.
If the optimization is changed such that, say ∇ fD(u k ) is computed before fD(u k ), the optimization
algorithm will fail because it is no longer interfaced correctly with our problem.
We need to find a way that allows us to separate the optimization algorithm (which doesn’t
need and shouldn’t need to know about the fact that the evaluation of our objective function
depends on the implicit function y(u)) from the particular optimization problem, but allows us to
avoid unnecessary recomputations of y(u) and λ(u). Such software design issues are extremely
important for the efficient implementation of optimization algorithms in which function evaluations
may involve expensive simulations. We refer to [BvH04, HV99, PSS09], for more discussions on
such issues. In our Matlab implementation we deal with this issue by expanding our interface
between optimization algorithm and application slightly.
In our Matlab implementation, we require the user to supply a function

function [usr_par] = unew(u, usr_par)

The function unew is called by the optimization algorithm whenever u has been changed and before
any of the three functions fval, grad, or Hessvec are called. In our context, whenever unew is
called with argument u we compute y(u) and store it to make it available for reuse in subsequent
calls to fval, grad and Hessvec. If the implementer of the optimization algorithm changes
the algorithm and, say requires the computation of ∇ fD(u k ) before the computation of fD(u k ) then
she/he needs to ensure that unew is called with argument u k before grad is called. This change
of the optimization algorithm does not need to be communicated to the user of the optimization
algorithm. The interface would still work. This interface is used in a Matlab implementation
of the Newton–CG Algorithm and of a limited memory BFGS method which are available at
http://www.caam.rice.edu/~heinken/software. The introduction of unew enables us to
separate the optimization form the application and to avoid unnecessary recomputations of y(u)
and λ(u). It is not totally satisfactory, however, since it requires that the optimization algorithm
developer implements the use of unew correctly and it requires the application person not to
accidentally overwrite information between two calls of unew. These requirements become the
more difficult to fulfill the more complex the optimization algorithm and applications become. The
papers mentioned above discuss other approaches when C++ instead of Matlab is used.

8.5. OPTIMIZATION 353

Inexact Function and Derivative Information

The evaluation of the objective function (8.1) requires the solution of the system of equations (8.2).
If c is nonlinear in y, then (8.2) typically must be solved using iterative methods, for example
using Newton’s method. Consequently, in practice we are not able to compute y(u), but only an
approximation H y (u) that satisfies kc(H y (u), u)k ≤ , where > 0 can be selected by the user via
the choice of the stopping tolerance of the iterative method applied to (8.2).
Of course, since in practice we can only compute an approximation H y (u) of y(u) we can never
compute the objective function f in (8.1) and its derivatives exactly. Instead of fD(u), ∇ fD(u),
D
∇2 fD(u)v we can only compute approximations fD (u) = f (H y (u), u), ∇ fD (u), and ∇2 fD (u)v.
In our numerical solution of the optimal control problem (8.46) we have to solve the nonlinear
equations in (8.46b) for ~yi+1 , i = 0, . . . , N. We do this by applying Newton’s method. As the initial
guess for ~yi+1 we use the computed solution ~yi in the previous time step. We stop the Newton
iteration when the residual is less than 10−2 min{h2, ∆t 2 }. In our example, the computed solution
~yi at the previous time step is a good approximation for the solution ~yi+1 of (8.46b) and we only
need one, at most two Newton steps to reduce the residual below 10−2 min{h2, ∆t 2 }. We use the
computed function and derivative information fD (u) = f (H y (u), u), ∇ fD (u), and ∇2 fD (u)v as if it
was exact. Since the computed solution ~yi+1 is a very good approximation to the exact solution of
(8.46b), the inexactness in the computed function and derivative information is small relative to the
required stopping tolerance k∇ fD(u)k2 < gtol when gtol = 10−8 , which was used to generate the
Table 8.1. However, if we set gtol = 10−12 , the Newton–CG Algorithm produces the output shown
in Table 8.2. We see that the gradient norm and the step norm are hardly reduced between iterations
4 and 5. The line-search fails in iteration 5 because no sufficient decrease could be detected after 55
reduction of the trial step size α5 . If in the Newton iteration for the solution of (8.46b) we reduce
the residual stopping tolerance to 10−5 min{h2, ∆t 2 }, then the Newton–CG Algorithm converges in
5 iterations, see Table 8.3.

k fD(u k ) k∇ fD(u k )k2 ks k k2 αk #CG iters

0 −8.320591e − 02 3.056462e − 03 1.350236e + 02 5.00e − 01 8
1 −1.752788e − 01 7.293242e − 04 3.511393e + 01 1.00e + 00 10
2 −1.861746e − 01 9.073135e − 05 4.239564e + 00 1.00e + 00 16
3 −1.863410e − 01 1.697294e − 06 9.011109e − 02 1.00e + 00 23
4 −1.863411e − 01 1.061131e − 09 4.866485e − 05 7.63e − 06 36
5 −1.863411e − 01 1.061122e − 09 4.866448e − 05 F 36

Table 8.2: Performance of a Newton-CG method with gtol = 10−12 applied to the solution of (8.39).
The systems (8.46b) are solved with a residual stopping tolerance of 10−2 min{h2, ∆t 2 }

In the simple problem (8.46) we are able to solve the implicit constraints (8.46b) rather accu-
rately. Consequently, even for an optimization stopping tolerance gtol = 10−8 (which arguably is
small for our discretization of (8.39)) the Newton–CG Algorithm converges. In other applications

354 CHAPTER 8. IMPLICIT CONSTRAINTS

k fD(u k ) k∇ fD(u k )k2 ks k k2 αk #CG iters

0 −8.320590e − 02 3.056462e − 03 1.350237e + 02 5.00e − 01 8
1 −1.752752e − 01 7.294590e − 04 3.511488e + 01 1.00e + 00 10
2 −1.861738e − 01 9.070177e − 05 4.239663e + 00 1.00e + 00 15
3 −1.863401e − 01 1.696622e − 06 9.009389e − 02 1.00e + 00 23
4 −1.863401e − 01 1.031566e − 09 4.663490e − 05 1.00e + 00 37
5 −1.863401e − 01 4.666573e − 16

Table 8.3: Performance of a Newton-CG method with gtol = 10−12 applied to the solution of (8.39).
The systems (8.46b) are solved with a residual stopping tolerance of 10−5 min{h2, ∆t 2 }

the inexactness in the solution of the implicit equation will affect the optimization algorithm even
for coarser stopping tolerances gtol.
The ‘hand-tuning’ of stopping tolerances for the implicit equation and the optimization algo-
rithm is, of course, very unsatisfactory. Ideally one would like an optimization algorithm that
selects these automatically and allows more inexact and therefore less expensive solves of the
implicit equation at the beginning of the optimization iteration. One difficulty is that one cannot
compute the error in function and derivative information, but one can usually only provide an
asymptotic estimate of the form | fD (u) − fD(u)| = O( ).
There are approaches to handle inexact function and derivative information in optimization
algorithms. For example, a general approach to this problem is presented in the book [Pol97].
Additionally, Section 10.6 in [CGT00] and [KHRv14] describe approaches to adjust the accuracy
of function values and derivatives in a trust-region method (see also the references in that section).
Handling inexactness in optimization algorithms to increase the efficiency of the overall algorithm
by using rough, inexpensive function and derivative information whenever possible while main-
taining the robustness of the optimization algorithm are important research problems. Although
approaches exist, more work remains to be done.

8.5.2. Constrainted Optimization

One may wonder why we have treated (8.3), (8.1), (8.2) as an implicitly constrained problem rather
than using (8.4). Clearly the explicitly constrained formulation (8.4) has several advantages:
1) Often the problem (8.4) is well-posed, even if the constraint c(y, u) = 0 has multiple or no
solutions y for some u.
2) The inexactness in function and derivative information that we have discussed in the previous
section and that arises out of the solution of c(y, u) = 0 for y is no longer an issue, since y and u
are both optimization variables in (8.4) and no implicit function has to be computed.
3) Finally, optimization algorithms for (8.4), such as sequential quadratic programming (SQP)
methods do not have to maintain feasibility throughout the iteration. This can lead to large gains
in efficiency of SQP methods for (8.4) over Newton-type methods for the implicitly constrained

8.5. OPTIMIZATION 355

problem (8.3), (8.1), (8.2).

If possible, the formulation (8.4) should be chosen over (8.3), (8.1), (8.2). However, in many
applications the number of y variables is so huge that it is infeasible to keep all in memory. This
is for example the case for problems in which c(y, u) = 0 corresponds to the discretization of time
dependent partial differential equations in 3D. (Our 1D example problem in Section 8.4 is a baby
sibling of such problems.)
Constrained optimization problems of the type (8.4) can be solved using SQP methods. We
mention a few ingredients of SQP methods for the solution of (8.4) with U = Rnu to point out the
relation between SQP methods for (8.4) and Newton-type methods for the implicitly constrained
problem (8.3), (8.1), (8.2). More details on SQP methods can be found in [NW06].
SQP methods compute a solution of (8.4) with U = Rnu by solving a sequence of quadratic
programming (QP) problems
!T !T
∇ y f (y, u)T ∇ yy L(y, u, λ) ∇ yu L(y, u, λ)
! ! !
sy sy sy
min + 1
,
∇u f (y, u) su 2 su ∇uy L(y, u, λ) ∇uu L(y, u, λ) su (8.50)
s.t. cy (y, u)s y + cu (y, u)su = −c(y, u),

where H is the Hessian of the Lagrangian (8.13),

∇ yy L(y, u, λ) ∇ yu L(y, u, λ)
!
H=
∇uy L(y, u, λ) ∇uu L(y, u, λ)

or a replacement thereof. In so-called reduced SQP methods one uses

!
0 0
H= D .
0 H

The QP (8.50) is almost identical to the QPs (8.21) and (8.24) arising in Newton-type methods
for the implicitly constrained problem (8.3), (8.1), (8.2). In the QPs (8.21) and (8.24), y = y(u)
and λ = λ(u) and the right hand side of the constraint is c(y(u), u) = 0. This indicates that one
step of an SQP method for (8.4) may not be computationally more expensive than one step of a
Newton type method for (8.3), (8.1), (8.2). However, SQP methods profit from the decoupling of
the variables y and u and can be significantly more efficient than Newton type method for (8.3),
(8.1), (8.2) because the latter compute iterates that are on the constraint manifold.

356 CHAPTER 8. IMPLICIT CONSTRAINTS

8.6. Problems

Problem 8.1 x

References
[Ber95] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, Massachusetts,
1995.

[Bri15] D. Britz. Implementing a neural network from scratch in python

– an introduction, 2015. http://www.wildml.com/2015/09/
implementing-a-neural-network-from-scratch (accessed April 10, 2017).

[Bur40] J. M. Burgers. Application of a model system to illustrate some points of the statistical
theory of free turbulence. Nederl. Akad. Wetensch., Proc., 43:2–12, 1940.

[Bur48] J. M. Burgers. A mathematical model illustrating the theory of turbulence. In Richard

von Mises and Theodore von Kármán, editors, Advances in Applied Mechanics, pages
171–199. Academic Press Inc., New York, N. Y., 1948.

[BvH04] R. A. Bartlett, B. G. van Bloemen Waanders, and M. A. Heroux. Vector reduc-

tion/transformation operators. ACM Trans. Math. Software, 30(1):62–85, 2004.

[CGT00] A. R. Conn, N. I. M. Gould, and Ph. L. Toint. Trust–Region Methods. SIAM,

Philadelphia, 2000.

[Dre90] S. E. Dreyfus. Artificial neural networks, back propagation, and the Kelley-Bryson
gradient procedure. J. Guidance Control Dynam., 13(5):926–928, 1990. URL: http:
//dx.doi.org/10.2514/3.25422, doi:10.2514/3.25422.

[FA11] J. A. Fike and J. J. Alonso. The development of hyper-dual numbers for exact second-
derivative calculations. Proceedings, 49th AIAA Aerospace Sciences Meeting including
the New Horizons Forum and Aerospace Exposition. Orlando, Florida, 2011. URL:
https://doi.org/10.2514/6.2011-886, doi:10.2514/6.2011-886.

[FA12] J. A. Fike and J. J. Alonso. Automatic differentiation through the use of hyper-dual num-
bers for second derivatives. In S. Forth, P. Hovland, E. Phipps, J. Utke, and A. Walther,

357
358 REFERENCES

editors, Recent advances in algorithmic differentiation, volume 87 of Lect. Notes Com-

put. Sci. Eng., pages 163–173. Springer, Heidelberg, 2012. URL: https://doi.org/
10.1007/978-3-642-30023-3_15, doi:10.1007/978-3-642-30023-3_15.

[GBC16] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
http://www.deeplearningbook.org (Accessed April 10, 2017).

[Gri03] A. Griewank. A mathematical view of automatic differentiation. In A. Iserles, editor,

Acta Numerica 2003, pages 321–398. Cambridge University Press, Cambridge, Lon-
don, New York, 2003. URL: https://doi.org/10.1017/S0962492902000132,
doi:10.1017/S0962492902000132.

[Gun03] M. D. Gunzburger. Perspectives in Flow Control and Optimization. SIAM, Philadel-

phia, 2003. URL: https://doi.org/10.1137/1.9780898718720, doi:10.1137/
1.9780898718720.

[GW08] A. Griewank and A. Walther. Evaluating Derivatives. Principles and Techniques of

Algorithmic Differentiation. Society for Industrial and Applied Mathematics (SIAM),
Philadelphia, PA, second edition, 2008. URL: https://doi.org/10.1137/1.
9780898717761, doi:10.1137/1.9780898717761.

[HV99] M. Heinkenschloss and L. N. Vicente. An interface between optimization and appli-

cation for the numerical solution of optimal control problems. ACM Transactions on
Mathematical Software, 25:157–190, 1999. URL: http://dx.doi.org/10.1145/
317275.317278, doi:10.1145/317275.317278.

[KHRv14] D. P. Kouri, M. Heinkenschloss, D. Ridzal, and B. G. van Bloemen Waan-

[KV99] K. Kunisch and S. Volkwein. Control of Burger’s equation by a reduced order ap-
proach using proper orthogonal decomposition. Journal of Optimization Theory and
Applications, 102:345–371, 1999.

[LMT97] H. V. Ly, K. D. Mease, and E. S. Titi. Distributed and boundary control of the viscous
Burgers’ equation. Numer. Funct. Anal. Optim., 18(1-2):143–188, 1997.

REFERENCES 359

[MSA03] J. R. R. A. Martins, P. Sturdza, and J. J. Alonso. The complex-step deriva-

tive approximation. ACM Trans. Math. Software, 29(3):245–262, 2003. URL:
https://doi.org/10.1145/838250.838251, doi:10.1145/838250.838251.

[Nie17] M. Nielsen. Neural networks and deep learning, 2017. http://

neuralnetworksanddeeplearning.com (accessed June 12, 2018).

[NW06] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Verlag, Berlin,

Heidelberg, New York, second edition, 2006. URL: https://doi.org/10.1007/
978-0-387-40065-5, doi:10.1007/978-0-387-40065-5.

[Pol97] E. Polak. Optimization:Algorithms and Consistent Approximations. Applied Mathe-

matical Sciences, Vol. 124. Springer Verlag, Berlin, Heidelberg, New-York, 1997.

[PSS09] A. D. Padula, S. D. Scott, and W. W. Symes. A software framework for abstract expres-
sion of coordinate-free linear algebra and optimization algorithms. ACM Trans. Math.
Software, 36(2):Art. 8, 36, 2009. URL: https://doi.org/10.1145/1499096.
1499097, doi:10.1145/1499096.1499097.

[RHW86] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-

propagating errors. Nature, 323:533–536, October 1986. URL: http://dx.doi.
org/10.1038/323533a0, doi:10.1038/323533a0.

[Sal86] D. E. Salane. Adaptive routines for forming jacobians numerically. Technical Report
SAND86–1319, Sandia National Laboratories, 1986.

[ST98] W. Squire and G. Trapp. Using complex variables to estimate derivatives of real
functions. SIAM Rev., 40(1):110–112, 1998. URL: https://doi.org/10.1137/
S003614459631241X, doi:10.1137/S003614459631241X.

[Vol01] S. Volkwein. Distributed control problems for the Burgers equation. Comput. Optim.
Appl., 18(2):115–140, 2001.

360 REFERENCES

Chapter
9
Quasi–Newton Methods
9.1. Introduction
In this chapter we study methods for the computation of a (local) minimum x ∗ of f : Rn → R
that require the computation of the gradient of f but replace the Hessian of f by a symmetric
matrix. These methods are known as quasi-Newton methods. Good introductions to quasi-Newton
methods can be found in the books by Dennis and Schnabel [DS83] and by Nocedal and Wright
[NW06]. The article [DM77] by Dennis and Moré is a classical reference for the motivation of
quasi-Newton methods and their local convergence theory.

9.2. The BFGS Method

Given a current iterate x k and a replacement Bk for the Hessian ∇2 f (x k ) we build the model

f (x k + s) ≈ m k (x k + s) = f (x k ) + ∇ f (x k )T s + 21 sT Bk s.

We can use the model to compute a new guess x k+1 . For example, if Bk is symmetric positive definite
and if we use a line-search method, then the unique minimizer of the model is s k = −Bk−1 ∇ f (x k )
and the new approximation is x k+1 = x k + α k s k .
At the new iterate x k+1 we build a new model

f (x k+1 + s) ≈ m k+1 (x k+1 + s) = f (x k+1 ) + ∇ f (x k+1 )T s + 12 sT Bk+1 s.

How should we choose Bk+1 ? The matrix Bk+1 should be symmetric and positive definite, so
that the model m k+1 (x k+1 + s) = f (x k+1 ) + ∇ f (x k+1 )T s + 21 sT Bk+1 s has the unique minimizer
s k+1 = −Bk+1
−1 ∇ f (x
k+1 ). Moreover we require that the gradients of the model m k+1 at x k and x k+1
coincide with the gradients of f at these points. That is we require

∇m k+1 (x k ) = ∇ f (x k ) and ∇m k+1 (x k+1 ) = ∇ f (x k+1 ).

The latter condition is automatically satisfied for any Bk+1 by the choice of m k+1 . The first condition
implies
Bk+1 (x k+1 − x k ) = ∇ f (x k+1 ) − ∇ f (x k ). (9.1)

361
362 CHAPTER 9. QUASI–NEWTON METHODS

The condition (9.1) is known as the secant equation. In the derivation of quasi-Newton methods
the notation
s k = x k+1 − x k , y k = ∇ f (x k+1 ) − ∇ f (x k )
is commonly used and we will adopt it here as well. In this notation the secant equation (9.1) is
given by
Bk+1 s k = y k .
We note that the notation s k = x k+1 − x k is misleading in the context of line searches, we have
α k s k = x k+1 − x k and in this case the formulas below should be applied with s k replaced by
α k s k = x k+1 − x k !
Since Bk+1 ∈ Rn×n is symmetric, we have 12 (n + 1)n entries to determine, but the secant
secant equation Bk+1 s k = y k provides only n equations. There are infinitely many symmetric
matrices Bk+1 that satisfy Bk+1 s k = y k . We choose the one that is closest to Bk . This leads to the
minimization problem

minimize k|B − Bk k|, (9.2a)

subject to B = B , T
(9.2b)
Bs k = y k . (9.2c)

As the matrix norm in (9.2) we choose a weighted Frobenius norm,

k|B − Bk k| = k M (B − Bk )M kF,

where M is a symmetric nonsingular matrix and the Frobenius norm of a square matrix A is given
by v
u
tX n
k AkF = Ai2j .
i, j=1

Theorem 9.2.1 Let M ∈ Rn×n be a symmetric nonsingular matrix and let s, y ∈ Rn be given vectors
with s , 0. If B ∈ Rn×n is a symmetric matrix, then the solution of

minimize k M (B+ − B)M kF, (9.3a)

subject to B+ = B+T , (9.3b)
B+ s = y (9.3c)

is given by
(y − Bs)cT + c(y − Bs)T (y − Bs)T s T
B+ = B + − cc , (9.4)
cT s (cT s) 2
where c = M −2 s.

9.2. THE BFGS METHOD 363

For a proof see [JS04, Thm. 6.6.10].

Remark 9.2.2 i. The solution B+ in (9.4) differs from B by a matrix of rank 2.

ii. Given s and c, there are infinitely many symmetric nonsingular matrices M such that c =
M −2 s. For each of these matrices, the solution of the least change problem (9.3) is given by
the same matrix (9.4).

If we use the weighting matrix M = I in (9.2), then application of Theorem 9.2.1 gives the
so-called PSB (Powell symmetric Broyden) update

(y k − Bk s k )sTk + s k (y k − Bk s k )T (y k − Bk s k )T s k
PSB
Bk+1 = Bk + − s k sTk , (9.5)
sTk s k T
(s k s k ) 2

If we choose a weighting matrix M with M −2 s k = y k in (9.2), then application of Theorem 9.2.1

gives the so-called DFP (Davidon Fletcher Powell) update

(y k − Bk s k )yTk + y k (y k − Bk s k )T (y k − Bk s k )T s k
DFP
Bk+1 = Bk + − y k yTk . (9.6)
yTk s k T
(y k s k ) 2

DFP can also be written as

The matrix Bk+1

DFP
Bk+1 = (I − ρ k y k sTk )Bk (I − ρ k s k yTk ) + ρ k y k yTk , (9.7a)

where
1
ρk = (9.7b)
yTk s k
Since the model

m k+1 (x k+1 + s) = f (x k+1 ) + ∇ f (x k+1 )T s + 12 sT Bk+1 s.

has the unique minimizer

s k = −Bk+1
−1
∇ f (x k+1 )
if and only if Bk+1 is symmetric positive definite, it is important that symmetric positive definiteness
is preserved. The following theorem shows that if Bk is symmetric positive definite, then the matrix
DFP is symmetric positive definite if and only if yT s > 0.
Bk+1 k k

Theorem 9.2.3 Let Bk ∈ Rn×n be symmetric positive definite and s k , 0. The matrix Bk+1
DFP is

symmetric positive definite if and only if yTk s k > 0.

364 CHAPTER 9. QUASI–NEWTON METHODS

Proof: i. Let Bk+1 DFP by symmetric positive definite. Since B DFP satisfies the secant equation
k+1
DFP s = y we have 0 < sT B DFP s = sT y .
Bk+1 k k k k+1 k k k
ii. Let yTk s k > 0. For any v , 0 we have

vT Bk+1
DFP
v = vT (I − ρ k y k sTk )Bk (I − ρ k s k yTk )v + ρ k vT y k yTk v
= wT Bk w + ρ k (vT y k ) 2,

where w = v − ( ρ k yTk v)s k . Since ρ k = 1/(yTk s k ) > 0 and Bk is symmetric positive definite,
DFP v = wT B w + ρ (vT y ) 2 ≥ 0. The right hand side is zero if and only if wT B w = 0 and
vT Bk+1 k k k k
v y k = 0. However, if vT y k = 0, then w = v − ( ρ k yTk v)s k = v , 0 and wT Bk w = vT Bk v > 0.
T
DFP v > 0 for all v , 0.
Hence, vT Bk+1

If Bk is symmetric positive definite with known inverse and if yTk s k > 0, then we can compute
DFP using the Sherman–Morrison–Woodbury formula.
the inverse of Bk+1

Lemma 9.2.4 (Sherman–Morrison–Woodbury) i. Let u, v ∈ Rn , and assume that A ∈ Rn×n is

nonsingular. Then A + uvT is nonsingular if and only if

1 + vT A−1u ≡ σ , 0.

In this case
1 −1 T −1
( A + uvT ) −1 = A−1 −
A uv A .
σ
ii. Let U, V ∈ Rn×m , m < n, and assume that A ∈ Rn×n is nonsingular. Then A + UV T is
nonsingular if and only if
I + V T A−1U ≡ Σ ∈ Rm×m
is invertible. In this case

( A + UV T ) −1 = A−1 − A−1UΣ−1V T A−1 .

Proof: The proof is left as an exercise. See Problem 9.3.

If Bk is symmetric positive definite with inverse Hk = Bk−1 and if yTk s k > 0, then the inverse
DFP
Hk+1 DFP in (9.6) is given by
of the DFP update Bk+1

Hk y k yTk Hk s k sTk
DFP
Hk+1 = Hk − − . (9.8)
yTk Hk y k yTk s k

We have developed the DFP update (9.6) using the least change principle applied to the replace-
ment Bk+1 of the Hessian. If Bk is invertible with inverse Hk , then we can try to update the inverse

9.3. IMPLEMENTATION OF THE BFGS METHOD 365

Hk to obtain Hk+1 , a replacement for the inverse of the Hessian at x k+1 . The matrix Hk+1 should
be symmetric and it should satisfy the secant equation Hk+1 y k = s k . In addition, we require that x
Hk+1 is close to Hk . This we update the inverse by solving

minimize k M (H − Hk )M kF, (9.9a)

subject to H = H T , (9.9b)
H yk = s k . (9.9c)

Of course the problem (9.9) is of the same type as the problem (9.3) and everything that we have
derived so far can be applied to solve (9.9). We only have to change the notation Bk → Hk , s k → y k ,
and y k → s k . If in (9.9) we use the weighted Frobenius norm with symmetric nonsingular weighting
matrix M such that M −2 y k = s k , the solution leads to the BFGS (Broyden Fletcher Goldfarb Shanno)
update
BFGS
Hk+1 = (I − ρ k s k yTk )Hk (I − ρ k y k sTk ) + ρ k s k sTk , (9.10a)
where
ρ k = 1/yTk s k . (9.10b)
The following result corresponds to Theorem 9.2.3 and equation (9.8).

Theorem 9.2.5 Let Hk ∈ Rn×n be symmetric positive definite and yTk s k > 0. The matrix Hk+1 BFGS is

symmetric positive definite and if Bk = Hk−1 , then the inverse of Hk+1

BFGS is given by inverse is given

by
Bk s k sT Bk y k yTk
BFGS
Bk+1 = Bk − T k + T . (9.11)
s k Bk s k yk s k

9.3. Implementation of the BFGS Method

9.3.1. Initial Matrix
The initial matrix H0BFGS is often set to a multiple of the identity, H0BFGS = βI. For choices of the
scaling of the identity see [DS83, Sec. 9.4] or [NW06, pp. 142,143].
Sometimes better choices for H0BFGS can be determined from the structure of the problem.

9.3.2. Line-Search
One step in a quasi-Newton method with line search is given as follows. Suppose we have given an
approximation x k of a (local) minimizer and a symmetric positive definite matrix that replaces the
inverse of the Hessian. Hence, our model of f (x k + s) is

m k (x k + s) = f (x k ) + ∇ f (x k )T s + 12 sT Hk−1 s.

366 CHAPTER 9. QUASI–NEWTON METHODS

The unique minimizer of the model is

s k = −Hk ∇ f (x k ).
Then we compute a step size α k and the new iterate
x k+1 = x k + α k s k .
Next we compute ∇ f (x k+1 ) and y k = ∇ f (x k+1 ) − ∇ f (x k ). To compute the new matrix Hk+1 using
the BFGS update (9.10) we need
yTk s k > 0
to ensure positive definiteness of Hk+1 . If the line search is computed using the Wolfe conditions
(6.6) or the strong Wolfe conditions (6.7), then (6.6b) implies
yTk s k = (∇ f (x k+1 ) − ∇ f (x k ))T s k ≥ (c2 − 1)∇ f (x k )T s k = (1 − c2 )sTk Hk s k > 0
since Hk is symmetric positive definite and c2 ∈ (0, 1). Thus if the line search is computed using
the Wolfe conditions (6.6) or the strong Wolfe conditions (6.7), then yTk s k > 0 is automatically
satisfied. If we apply the BFGS update (9.10) we obtain (note that we need to use s k = x k+1 − x k !)
Hk+1 = (I − ρ k (x k+1 − x k )yTk )Hk (I − ρ k y k (x k+1 − x k )T ) + ρ k (x k+1 − x k )(x k+1 − x k )T ,
where
ρ k = 1/yTk (x k+1 − x k ).
If a backtracking line search is used (which is easier to implement, see Algorithm 6.2.8 and
Section 6.2.4), then there is no guarantee that yTk (x k+1 − x k ) = (∇ f (x k+1 ) − ∇ f (x k ))T (x k+1 − x k )
holds. In this case, we skip the update (i.e., we set Hk+1 = Hk ) if yTk (x k+1 − x k ) < 1/2 k y k k2 k x k+1 −
x k k2 , where is the machine precision. Alternatively, one could use the following damping strategy
due to Powell [Pow85], where it was introduced in the context of quasi-Newton updates for Hessians
of the Lagranian in constrained problems. See also [NW06, pp. 536–538]. We describe it first for
BFGS . For simplicity we drop the superscript BFGS. We set y = ∇ f (x
the matrix Bk+1 k k+1 ) − ∇ f (x k ).
Given symmetric positive definite matix Bk
r k = θ k y k + (1 − θ k )Bk (x k+1 − x k )
where the scalar θ k is defined as

 1 if (x k+1 − x k )T y k ≤ 0.2(x k+1 − x k )T Bk (x k+1 − x k ),
θk =  T

0.8(x k+1 − x k ) Bk (x k+1 − x k )
 T T else.
 (x k+1 − x k ) Bk (x k+1 − x k ) − (x k+1 − x k ) y k


Update Bk using
Bk (x k+1 − x k )(x k+1 − x k )T Bk r k r Tk
Bk+1 = Bk − + T . (9.12)
(x k+1 − x k )T Bk (x k+1 − x k ) r k (x k+1 − x k )
The update (9.12) is just the standard BFGS update with y k replaced by r k . When θ k , 1, then
(x k+1 − x k )T r k = 0.2(x k+1 − x k )T Bk (x k+1 − x k ) > 0.

9.3. IMPLEMENTATION OF THE BFGS METHOD 367

9.3.3. Matrix-Free Implementation

BFGS ∇ f (x
In the BFGS quasi-Newton methods we have to compute Hk+1 k+1 ). This can be done using
a recursion due to Matthies Strang [MS79]. This recursion is particularly useful for very large scale
BFGS . If for a given vector we can evaluate
problems in which it is infeasible to store the matrix Hk+1
H0BFGS q cheaply (which is the case when H0BFGS = βI), then the recursion derived below only
requires the permanent storage of 2(k + 1) vectors s0, . . . , s k , y0, . . . , y k of length n as well as the
storage of two other vectors of length n (used to hold the vectors qi and ri below, respectively). If
n > 2(k + 1) + 2, then the recursion below saves storage. This recursion is particularly useful in
the limited memory BFGS method, see, e.g., Nocedal [Noc80].
We drop the superscript BFGS . Given a vector qk+1 we note that Hk+1 qk+1 is given by

Hk+1 qk+1 = (I − ρ k s k yTk )Hk (qk+1 − ρ k sTk qk+1 y k ) + ρ k sTk qk+1 s k .

≡ αk ≡ αk
| {z } | {z }
| {z }
≡ qk

If we have computed
r k = H k qk ,
which can be done using the same steps as used in the computation of Hk+1 qk+1 , then

Hk+1 qk+1 = (I − ρ k s k yTk )Hk qk + α k s k = r k + (α k − ρ k yTk r k )s k .

This leads to a recursion for the computation of r k+1 = Hk+1 qk+1 , which summarized the following
algorithm.

Algorithm 9.3.1
Compute r k+1 = Hk+1 qk+1 for a given qk+1 .

1. For i = k, . . . , 0 do

a. αi = ρi sTi qi+1 .
b. qi = qi+1 − αi yi .

2. r 0 = H0 q0 .

3. For i = 0, . . . , k do

ri+1 = ri + (αi − ρi yiT ri )si

368 CHAPTER 9. QUASI–NEWTON METHODS

9.4. Problems

Problem 9.1 Show that the problem

minimize kB − AkF,
subject to B = BT

is solved by
B = 12 ( A + AT ).

DFP in (9.6) and (9.7) are identical.

Problem 9.2 Show that the matrices Bk+1

Problem 9.3 Prove Lemma 9.2.4.

DFP in (9.8) is the inverse of B DFP (9.6) by verifying B DFP H DFP = I.
Problem 9.4 i. Show that Hk+1 k+1 k+1 k+1
DFP .
ii. Apply the Lemma 9.2.4 to derive the inverse of Bk+1

REFERENCES 369

References
[DM77] J. E. Dennis, Jr. and J. J. Moré. Quasi–Newton methods, motivation and theory. SIAM
Review, 19:46–89, 1977.

[DS83] J. E. Dennis, Jr. and R. B. Schnabel. Numerical Methods for Nonlinear Equations and
Unconstrained Optimization. Prentice-Hall, Englewood Cliffs, N. J, 1983. Republished
as [DS96].

[DS96] J. E. Dennis, Jr. and R. B. Schnabel. Numerical Methods for Nonlinear Equations and
Unconstrained Optimization. SIAM, Philadelphia, 1996. URL: https://doi.org/10.
1137/1.9781611971200, doi:10.1137/1.9781611971200.

[JS04] F. Jarre and J. Stoer. Optimierung. Springer Verlag, Berlin, Heidelberg, New-York, 2004.

[MS79] H. Matthies and G. Strang. The solution of nonlinear finite element equations. Internat.
J. Numer. Methods Engrg., 14:1613–1626, 1979.

[Noc80] J. Nocedal. Updating quasi-Newton matrices with limited storage. Math. Comp.,
35(151):773–782, 1980.

[NW06] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Verlag, Berlin,

Heidelberg, New York, second edition, 2006. URL: https://doi.org/10.1007/
978-0-387-40065-5, doi:10.1007/978-0-387-40065-5.

[Pow85] M. J. D. Powell. The performance of two subroutines for constrained optimization on

some difficult test problems. In P. T. Boggs, R. H. Byrd, and R. B. Schnabel, editors,
Numerical Optimization 1984, pages 160–177. SIAM, Philadelphia, 1985.

370 CHAPTER 9. QUASI–NEWTON METHODS

Part III

Iterative Methods for Nonlinear Systems

371
Chapter
10
Newton’s Method
10.1 Derivation of Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
10.2 Local Q-Quadratic Convergence of Newton’s Method . . . . . . . . . . . . . . . . 375
10.3 Modifications of Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . 376
10.3.1 Divided Difference Newton Methods . . . . . . . . . . . . . . . . . . . . 376
10.3.2 The Chord Method and the Shamanskii Method . . . . . . . . . . . . . . . 377
10.3.3 Inexact Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
10.4 Truncation of the Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
10.5 Newton’s Method and Fixed Point Iterations . . . . . . . . . . . . . . . . . . . . . 380
10.6 Kantorovich and Mysovskii Convergence Theorems . . . . . . . . . . . . . . . . . 381
10.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386

10.1. Derivation of Newton’s Method

Let
F : Rn → Rn .
We want to find a point x ∗ such that F (x ∗ ) = 0. Such a point is called a root of F or a zero of F.
Suppose we have an approximation x k of x ∗ . Then
0 = F (x ∗ ) = F (x k + (x ∗ − x k )).
Ideally, we want to solve F (x k + s) = 0 in s to obtain x ∗ = x k + s. Of course, this problem is
as complicated as the original one. Instead, we model F (x k + s) with a function Mk : Rn → Rn
and solve Mk (s) = 0. If the model Mk (s) of F (x k + s) is good, we hope that the solution s k of
Mk (s) = 0 is a good approximation of x ∗ − x k and, hence, that x k + s k is closer to x ∗ than x k . Of
course, in addition to be a good model of F (x k + s), the model needs to be constructed so that
Mk (s) = 0 has a solution s k and so that s k can be computed relatively easy.
Around the current iterate x k ∈ Rn the nonlinear function
F (x k + s)

373
374 CHAPTER 10. NEWTON’S METHOD

is approximated by the linear model

M (s) = F (x k ) + F 0 (x k )s.
Here F 0 (x) ∈ Rn×n denotes the Jacobian matrix defined by
∂F1 ∂F1
∂ x 1 (x) ··· ∂ x n (x)
.. ..
* +/
F 0 (x) = ... . . // .
∂Fn ∂Fn
, ∂ x 1 (x) · · · ∂ x n (x) -
The current iterate is updated by computing the root of the linear function M (s) and using this
root as a correction of the current iterate. This leads to the Newton’s method. The kth iteration of
Newton’s method is given by
Solve F 0 (x k )s k = −F (x k ),
(10.1)
x k+1 = x k + s k .
The linear equation is uniquely solvable if the Jacobian F 0 (x k ) is nonsingular.

Example 10.1.1 We consider

x 21 + x 22 − 2
!
F (x) = .
e x 1 −1 + x 32 − 2
Newton’s method with starting point !
1.5
x0 =
2
and stopping criteria
kF (x k )k2 < 10−10
applied to F (x) = 0 yields the following table:
k k x k k2 kF (x k )k2 ks k k2
0 0.250000E + 01 0.875017E + 01
1 0.166594E + 01 0.207320E + 01 0.880545E + 00
2 0.145074E + 01 0.412794E + 00 0.323487E + 00
3 0.142331E + 01 0.617719E − 01 0.160625E + 00
4 0.141439E + 01 0.140119E − 02 0.220672E − 01
5 0.141421E + 01 0.973029E − 06 0.608725E − 03
6 0.141421E + 01 0.427834E − 12 0.396448E − 06
The computed approximation for the root is given by (rounded to 6 digits)
!
1.0000
x6 = .
1.0000

10.2. LOCAL Q-QUADRATIC CONVERGENCE OF NEWTON’S METHOD 375

10.2. Local Q-Quadratic Convergence of Newton’s Method

The results in this section mirror those in Section 5.2.
The next theorem is a consequence of the fundamental theorem of calculus.

Theorem 10.2.1 Let F be continuously differentiable in an open set D ⊂ Rn . For all x, y ∈ D

such that {y + t(x − y) : t ∈ [0, 1]} ⊂ D,
Z 1
F (x) − F (y) = F 0 (y + t(x − y))(x − y)dt.
0

Proof: Apply the fundamental theorem of calculus to the functions φi (t) = Fi (y + t(x − y)), i =
1, . . . , n, on [0, 1].

Lemma 10.2.2 Let D ⊂ Rn be an open set and let F : D → Rn be differentiable on D with

F 0 ∈ LipL (D). If x ∗ ∈ D is a root of F and if F 0 (x ∗ ) is nonsingular, then there exists > 0 such
that B (x ∗ ) ⊂ D and for all x ∈ B (x ∗ ),

kF 0 (x)k ≤ 2kF 0 (x ∗ )k, (10.2)

kF 0 (x) −1 k ≤ 2kF 0 (x ∗ ) −1 k, (10.3)

and
1
k x − x ∗ k ≤ kF (x)k ≤ 2kF 0 (x ∗ )k k x − x ∗ k. (10.4)
2kF 0 (x ∗)
−1 k

Proof: The proof is similar to that of Lemma 5.2.3.

Theorem 10.2.3 Let D ⊂ Rn be an open set and let F : D → Rn be differentiable on D with

F 0 ∈ LipL (D). If x ∗ ∈ D is a root and if F 0 (x ∗ ) is nonsingular, then there exists an > 0 such that
Newton’s method with starting point x 0 ∈ B (x ∗ ) generates iterates x k which converge to x ∗ ,

lim x k = x ∗,
k→∞

and which obey

k x k+1 − x ∗ k ≤ ck x k − x ∗ k 2
for all k, where c is some positive constant.

376 CHAPTER 10. NEWTON’S METHOD

Proof: Let 1 be the parameter determined by Lemma 10.2.2 and let σ ∈ (0, 1). We will show by
induction that if
= min{ 1, σ/(kF 0 (x ∗ ) −1 kL)}
and k x 0 − x ∗ k < , then
k x k+1 − x ∗ k ≤ LkF 0 (x ∗ ) −1 k k x k − x ∗ k 2 < σk x k − x ∗ k < (10.5)
for all iterates x k .
If k x 0 − x ∗ k < , then
x 1 − x ∗ = x 0 − x ∗ − F 0 (x 0 ) −1 F (x 0 )
Z 1
= F (x 0 )
0 −1
(F 0 (x 0 ) − F 0 (x 0 + t(x 0 − x ∗ ))(x 0 − x ∗ )dt
0
Using (10.3) and the Lipschitz continuity of F 0 we obtain
k x 1 − x ∗ k ≤ 2LkF 0 (x ∗ ) −1 k k x 0 − x ∗ k 2 /2
= LkF 0 (x ∗ ) −1 k k x 0 − x ∗ k 2 < σk x 0 − x ∗ k < .
This proves (10.5) for k = 0. The indcution step can be proven analogously and is therefore omitted.
Since σ < 1 and
k x k+1 − x ∗ k < σk x k − x ∗ k < . . . < σ k+1 k x 0 − x ∗ k,
we find that lim k→∞ x k = x ∗ . The q–quadratic convergence rate follows from (10.5) with
c = LkF 0 (x ∗ ) −1 k.

One can easily modify the results in Section 5.2.3 to analyze Newton’s method for systems of
nonlinear equations with inexact function evaluations.

10.3. Modifications of Newton’s Method

10.3.1. Divided Difference Newton Methods
The jth column of the Jacobian is by definition of the derivative given as follows:
1
F 0 (x) = lim F (x + he j ) − F (x) .
j h→0 h
Thus, we can approximate the jth column of the Jacobian by finite differences:
1
F 0 (x) ≈ F (x + h j e j ) − F (x) .
j hj
As in the one dimensional case, care must be taken in the choice of the parameters h j . Read [DS83,
Section 5.4] and [Kel95, Section 5.4.4] for details. The divided difference approximation requires
the computation of n additional vector valued function values F (x + h1 e1 ), . . . , F (x + hn en ).

10.3. MODIFICATIONS OF NEWTON’S METHOD 377

10.3.2. The Chord Method and the Shamanskii Method

The Newton method and the Newton method with divided difference approximations to the Jacobin
requires the computation of the Jacobian, or an approximation thereof, at the cost of n2 one-
dimensional function evaluations. In addition, in each iteration the (approximate) Jacobian has to
be factored. Often, one can reduce the overall expense by fixing the Jacobian over the iterations.
Although one needs more iterations, each iteration can be executed faster.

Algorithm 10.3.1 (Shamanskii Method)

Input: Starting value x 0 ∈ Rn , tolerance tol, m ∈ N.

For k = 0, . . .
Compute and factor F 0 (x km ).
For j = 0, . . . , m − 1
Check truncation criteria.
Solve F 0 (x km )s km+ j = −F (x km+ j ).
Set x km+ j+1 = x km+ j + s km+ j .
End
End

A special case of the Shamanskii Method is the Chord Method. Here the Jacobian F 0 (x k )
is computed and factored only once in the initial iteration. This case is included in the previous
algorithm if we formally set m = ∞. For m = 1 we obtain Newton’s method. The convergence
rates for the Shamanskii method are derived in [Pol97, Sec. 1.4.5]. These methods can be viewed
as Newton–like iterations of the form

x k+1 = x k − A−1
k F (x k )

where Ak ∈ Rn×n is an invertible matrix which approximates F 0 (x k ). The convergence of this

Newton–like iteration will be investigated in Problem 10.2.

10.3.3. Inexact Newton Methods

The Newton step s k is a solution of the linear equation F 0 (x k )s k = −F (x k ). If n is large or if only
Jacobian-times vector products F 0 (x k )v are available for any given vector v, but the computation of
the entire Jacobian is expensive, then we can use the GMRES discussed in Section 3.4 or any other
iterative method for nonsymmetric systems to compute an approximate solution s k . We compute a
step s k such that
kF 0 (x k )s k + F (x k )k ≤ η k kF (x k )k,
where η k ≥ 0 is called the forcing parameter.

378 CHAPTER 10. NEWTON’S METHOD

Algorithm 10.3.2 (Inexact Newton Method)

Input: Starting value x ∈ Rn , tolerance tol, m ∈ N.

For k = 0, . . .
Check truncation criteria.
Compute s k such that kF 0 (x k )s k + F (x k )k ≤ η k kF (x k )k.
Set x k+1 = x k + s k .
End

The following theorem analyzes the convergence of the inexact Newton method. It mirrors
Theorem 5.3.1.
Theorem 10.3.3 Let D ⊂ Rn be an open set and let F : D → Rn be differentiable on D with
F 0 ∈ LipL (D). Moreover, let x ∗ ∈ D be a root of F and let F 0 (x ∗ ) be nonsingular.
If 4η kF 0 (x ∗ ) −1 k kF 0 (x ∗ )k < 1 and if {η k } satisfies 0 < η k ≤ η, then for all σ ∈
(4η kF 0 (x ∗ ) −1 k kF 0 (x ∗ )k, 1) there exists an > 0 such that the inexact Newton method 10.3.2
with starting point x 0 ∈ B (x ∗ ) generates iterates x k which converge to x ∗ ,
lim x k = x ∗,
k→∞
and which obey
k x k+1 − x ∗ k ≤ L kF 0 (x ∗ ) −1 k k x k − x ∗ k 2 + 4η k kF 0 (x ∗ ) −1 k kF 0 (x ∗ )k k x k − x ∗ k ≤ σk x k − x ∗ k
for all k.
Proof: Let 1 > 0 be the parameter given by Lemma 10.2.2. Furthermore, let σ ∈
0 −1
(2kF (x ∗ ) kη, 1) be arbitrary and let
= min{ 1, 2(σ − 4η kF 0 (x ∗ ) −1 k kF 0 (x ∗ )k)/(L kF 0 (x ∗ ) −1 k)}
We set r k = −F (x k ) − F 0 (x k )s k . If k x k − x ∗ k < , then
x k+1 − x ∗ = x k − x ∗ + s k = x k − x ∗ − F 0 (x k ) −1 [F (x k ) − r k ]
= x k − x ∗ − F 0 (x k ) −1 [F (x k ) − F (x k )] − F 0 (x k ) −1r k
Z 1
= F 0 (x k ) −1 [F 0 (x k ) − F 0 (x k + t(x k − x ∗ )](x k − x ∗ )dt − F 0 (x k ) −1r k .
0
Taking norms yields
L
k x k+1 − x ∗ k ≤ kF 0 (x k ) −1 k k x k − x ∗ k 2 + η k kF 0 (x k ) −1 kkF (x k )k
2
≤ L kF 0 (x ∗ ) −1 k k x k − x ∗ k 2 + 4η k kF 0 (x ∗ ) −1 k kF 0 (x ∗ )k k x k − x ∗ k

≤ L kF 0 (x ∗ ) −1 k k x k − x ∗ k + 4η kF 0 (x ∗ ) −1 k kF 0 (x ∗ )k k x k − x ∗ k
≤ σk x k − x ∗ k.

10.4. TRUNCATION OF THE ITERATION 379

The proof now follows from this inequality by using induction.

Sharper convergence results can be obtained if the weighted norm

kF 0 (x ∗ )(x k − x ∗ )k
of the error is considered. See [DES82] or [Kel95, Sec. 6.1.2].

10.4. Truncation of the Iteration

Under the assumptions of Lemma 10.2.2 the inequalities (10.4) show that the function values
eventually behave like the errors, provided the convergence is sufficiently fast. The sequence
{x k } is converging q–superlinearly if and only if the sequence of function values {kF (x k+1 )k} is
converging q–superlinearly to zero:
k x k+1 − x ∗ k ≤ ck k x k − x ∗ k ⇐⇒ kF (x k+1 )k ≤ c̃k kF (x k )k. (10.6)
The sequence {x k } is converging q–quadratically if and only if the sequence of function values
{kF (x k+1 )k} is converging q–quadratically to zero:
k x k+1 − x ∗ k ≤ c k x k − x ∗ k 2 ⇐⇒ kF (x k+1 )k ≤ c̃ kF (x k )k 2 . (10.7)
This relation does not hold for q–linearly convergence sequences, since k x k+1 − x ∗ k ≤ c k x k − x ∗ k,
c ∈ (0, 1), generally does not imply kF (x k+1 )k ≤ c̃ kF (x k )k with c̃ ∈ (0, 1).
Thus, we can use the norm of function values to observe q–superlinear, q–quadratic, or faster
convergence rates of the iterates. Moreover, this enables us to use the function values as a truncation
criteria. Equation (10.4) yields
kF 0 (x ∗ ) −1 k
kF (x)k < tolF =⇒ k x − x ∗ k ≤ tolF ,
2
provided x is sufficiently close to x ∗ .
Lemma 10.4.1 If {x k } ⊂ Rn is a sequence that converges q–superlinearly to x ∗ , i.e
k x k+1 − x ∗ k
lim = 0,
k→∞ k x k − x∗ k
then
k x k+1 − x k k
lim = 1.
k→∞ k x k − x∗ k
This suggest the stopping criteria
k x k − x ∗ k k x k − x k+1 k
≈ < tols,
sizex sizex
where
sizex ≈ k x ∗ k.

380 CHAPTER 10. NEWTON’S METHOD

10.5. Newton’s Method and Fixed Point Iterations

The Newton Iteration
x k+1 = x k − F 0 (x k ) −1 F (x k )
can be viewed as a fixed point iteration

x k+1 = G(x k ) (10.8)

if we set G(x) = x − F 0 (x) −1 F (x). Convergence of the fixed point iteration (10.8) can be proven
if D ⊂ Rn and if G : D → D is a contraction mapping, i.e., if there exists γ < 1 such that

kG(x) − G(y)k ≤ γk x − yk ∀ x, y ∈ D.

Theorem 10.5.1 (Banach Fixed Point Theorem) Let D ⊂ Rn be closed and let G : D → Rn be
a contraction mapping on D such that G(x) ∈ D for all x ∈ D. There exists a unique fixed point
x ∗ of G in D and for all x 0 the sequence {x k } generated by fixed point iteration 10.8 convergences
q–linearly with q–factor γ to the fixed point x ∗ , i.e.,

k x k+1 − x ∗ k ≤ γk x k − x ∗ k.

Proof: Consider the sequence {x k } generated by x k+1 = G(x k ) with x 0 ∈ D. First note that

k x k+1 − x k k = kG(x k+1 ) − G(x k )k ≤ γk x k − x k−1 k ≤ . . . ≤ γ k k x 1 − x 0 k ∀k

and
` `
X X 1 − γ`
k x k+` − x k k ≤ k x k+i − x k+i−1 k ≤ γ k+i−1 k x 1 − x 0 k = γ k k x1 − x0 k
i=1 i=1
1−γ
1
≤ γk k x1 − x0 k ∀ k, `.
1−γ
Thus, given any > 0,
(1 − γ)
k x k+` − x k k < ∀ k > ln / ln γ, ` ≥ 1.
k x1 − x0 k
Thus the sequence {x k } is a Cauchy sequence and therefore has a limit, lim k→∞ x k = x ∗ . By
continuity of G,
x ∗ = lim x k = G( lim x k ) = G(x ∗ ),
k→∞ k→∞
which means the limit is a fixed point. The q-linear convergence of the sequence follows from

k x k+1 − x ∗ k = kG(x k ) − G(x ∗ )k ≤ γk x k − x ∗ k.

10.6. KANTOROVICH AND MYSOVSKII CONVERGENCE THEOREMS 381

The fixed point x ∗ ∈ D is unique, since x ∗ = G(x ∗ ) and y∗ = G(y∗ ) imply

k x ∗ − y∗ k = kG(x ∗ ) − G(y∗ )k ≤ γk x ∗ − y∗ k.

Since γ < 1, the previous inequality implies x ∗ = y∗ .

Theorem 10.5.2 Let O ⊂ Rn be open and let G : O → Rn be continuously differentiable on O. If

x ∗ ∈ O is a fixed point of G and if kG0 (x ∗ )k < 1, then there exists > 0 such that x ∗ is the only
fixed point in B (x ∗ ) and the fixed point iteration converges q–linearly to x ∗ for any x 0 ∈ Br (x ∗ )
with an asymptotic q–factor
k x k+1 − x ∗ k
lim = kG0 (x ∗ )k < 1.
k→∞ k x k − x∗ k

Proof: For an arbitrary γ with kG0 (x ∗ )k < γ < 1 choose so that B (x ∗ ) ⊂ O and

kG0 (x)k ≤ γ ∀ x ∈ B (x ∗ ).

This is possible, since G0 and the norm are continuous. Hence

Z 1
kG(x) − G(y)k ≤ kG0 (x + t(y − x))k dt k x − yk ≤ γ k x − yk ∀ x, y ∈ B (x ∗ ).
0

Moreover, G maps B (x ∗ ) into B (x ∗ ) since for all x ∈ B (x ∗ ) we have

k x ∗ − G(x)k = kG(x ∗ ) − G(x)k ≤ γ k x ∗ − xk ≤ γ < .

Application of the Banach Fixed Point Theorem with D = B (x ∗ ) implies the desired result.

We can use the previous result applied to G(x) = x − F 0 (x) −1 F (x) to prove the local q–
superlinear convergence of Newton’s method. See Problem 10.9.

10.6. Kantorovich and Mysovskii Convergence Theorems

Theorem 10.6.1 (Kantorovich Theorem) Let r > 0, x 0 ∈ Rn , and let F : Br (x 0 ) → Rn be
differentiable on Br (x 0 ) with F 0 ∈ LipL (Br (x 0 )). Furthermore, let F 0 (x 0 ) is nonsingular and let
there exist constants β, η ≥ 0 such that

kF 0 (x 0 ) −1 k ≤ β,
kF 0 (x 0 ) −1 F (x 0 )k ≤ η.
√
Define α = L βη. If α ≤ 1/2 and r ≥ r 0 ≡ 1 − 1 − 2α /( βη), then the sequence {x k }
produced by Newton’s method is well defined and converges to x ∗ , a unique zero of F in the

382 CHAPTER 10. NEWTON’S METHOD

closure of( Br0 (x 0 ). If α < 1/2, then

) x ∗ is the unique zero of F in the closure of Br1 (x 0 ), where
√
r 1 = min r, 1 + 1 − 2α /( βη) and

k η
k x k − x ∗ k ≤ (2α) 2 k = 0, 1, . . . (10.9)
α
For a proof of the Kantorovich Theorem see e.g. [KA64], [OR00, Sec. 12], or [Den71].
It is important to notice that the Kantorovich Theorem establishes the existence and local
uniqueness of a solution of F (x) = 0. It only requires smoothness properties of F on Br (x 0 ) and
estimates of kF 0 (x 0 ) −1 k and kF 0 (x 0 ) −1 F (x 0 )k at a single point x 0 . This aspect of the Kantorovich
Theorem is useful in many situations. See also Problem 10.7v. On the other hand, the Kantorovich
Theorem only predicts r–quadratic convergence of the iterates, which is inferior to the r–quadratic
convergence predicted by Theorem 10.2.3.
An important property of Newton’s method is its scale invariance. Consider the differentiable
function F : Rn → Rn . Suppose we transform the variables x and F (x) by

x = D x x,
D D(x) = DF F (x),
F

where DF, D x be nonsingular matrices. Instead of solving F (x) = 0, we solve the equivalent system
x ) = 0, where
D(D
F
FD(Dx ) = DF F (D−1
x Dx ) = DF F (x).
Let {D
x k } be the sequence of Newton iterates for the function F x 0 = D x x 0 . It is
D with starting value D
not difficult to show, see Problem 10.3, that for all k

x k = Dx x k .
D

Thus, Newton’s method is invariant with respect to scaling of the function F. Moreover, if the
initial iterate is scaled by the matrix D x , then all subsequent Newton iterates are scaled by the same
matrix. The invariance of Newton’s method with respect to the scaling of the function F is not
reflected in the previous convergence Theorems 10.2.3 or 10.6.1. The Lipschitz constant L depends
on the choice of the scaling matrix DF . Thus, if the scaling matrix DF is varied, the convergence
Theorems 10.2.3 and 10.6.1 predict a different convergence behavior, although the previous analysis
shows that Newton’s method does produce the same iterates, i.e., does not change. Therefore, affine
invariant convergence theorems have been introduced in [DH79] and slightly refined in [Boc88].
See also [Deu04]. The following theorem is due to [DH79], [Boc88] and is an extension of the
Newton–Mysovskii Theorem in, e.g., [KA64], [OR00, Thm. 12.4.6].

Theorem 10.6.2 Let D ⊂ Rn be open and let F : D → Rn be continuously differentiable with

F 0 (x) invertible for all x ∈ D. Suppose that
F 0 (y) −1 F 0 (x + t(y − x)) − F 0 (x) (y − x) ≤ tωk y − xk 2 ∀x, y ∈ D, t ∈ [0, 1]. (10.10)

10.6. KANTOROVICH AND MYSOVSKII CONVERGENCE THEOREMS 383

If x 0 ∈ D is such that
kF 0 (x 0 ) −1 F (x 0 )k ≤ α
and
h = αω/2 < 1
def

and if the closure Br (x 0 ) of Br (x 0 ) is contained in D, where

∞
X j α
r=α h2 −1 ≤ ,
def

j=0
1−h

then the following results hold.

(i) The sequence {x k } of Newton iterates remains in the closure of Br (x 0 ) and converges to a
zero x ∗ ∈ Br (x 0 ) of F.

(ii) The convergence rate can be estimated by

k x k+1 − x k k ≤ (ω/2)k x k − x k−1 k 2 . (10.11)

(iii) The following error estimate holds:

k x k − x ∗ k ≤ σ k k x k − x k−1 k 2, (10.12)

where
∞
X k 2 j −1 ω/2
σ k = (ω/2) h2 ≤ k
.
j=0 1 − h2

Proof: First, we prove (10.11). Using the definition of Newton’s method,

x k+1 − x k = −F 0 (x k ) −1 F (x k )

= −F 0 (x k ) −1 F (x k ) − F (x k−1 ) − F 0 (x k−1 )(x k − x k−1 )
Z 1f g
= −F (x k )
0 −1
F 0 (x k−1 + t(x k − x k−1 )) − F 0 (x k−1 ) (x k − x k−1 ) dt.
0

Taking norms and using (10.10) implies

Z 1
k x k+1 − x k k ≤ ωt dt k x k − x k−1 k 2 = (ω/2) k x k − x k−1 k 2 .
0

Next, we prove
k−1
X j
k x k − x0 k ≤ α h2 −1 (10.13)
j=0

384 CHAPTER 10. NEWTON’S METHOD

by induction. For k = 1,
k x 1 − x 0 k = kF 0 (x 0 ) −1 F (x 0 )k ≤ α.
Now, assume that (10.13) holds for k. We have k x k+1 − x 0 k ≤ k x k+1 − x k k + k x k − x 0 k, where, by
induction hypothesis,
k−1
X j
k x k − x0 k ≤ α h2 −1
j=0

and, by (10.11),
k
≤ α2
k−1
Y z }| {
j k
k x k+1 − x k k ≤ (ω/2)k x k − x k−1 k 2 ≤ (ω/2) 2 k x 1 − x 0 k 2
j=0
2k −1 2k −1 k −1
≤ (ω/2) α α = h2 α. (10.14)

Thus,

k x k+1 − x 0 k ≤ k x k+1 − x k k + k x k − x 0 k
k−1
X k
X
2k −1 2 j −1 j
≤ αh +α h = α h2 −1,
j=0 j=0

which is inequality (10.13) for k + 1.

Inequality (10.13) shows that all iterates stay in Br (x 0 ).
Next, we show that {x k } is a Cauchy sequence. We use (10.11) to get

l−1 l−1
X X ω
k x k+l − x k k ≤ k x k+ j+1 − x k+ j k ≤ k x k+ j − x k+ j−1 k 2 . (10.15)
j=0 j=0
2

Now we apply (10.11) to the term k x k+ j − x k+ j−1 k 2 and use (10.11) to estimate

j
Y i j+1
k x k+ j − x k+ j−1 k ≤ (ω/2) k x k+ j−1 − x k+ j−2 k ≤ . . . ≤
2 2 4
(ω/2) 2 k x k − x k−1 k 2
i=1
2 j+1 −2 2 j+1 −2
= (ω/2) k x k − x k−1 k k x k − x k−1 k 2
j+1 k−1 2 j+1 −2 k−1 αω 2 j+1 −2
≤ (ω/2) 2 −2 h2 −1 α k x k − x k−1 k 2 = h2 −1 k x k − x k−1 k 2
2
k−1 2 j+1 −2 k 2 j −1
= h2 k x k − x k−1 k 2 = h2 k x k − x k−1 k 2 .

10.6. KANTOROVICH AND MYSOVSKII CONVERGENCE THEOREMS 385

Inserting this estimate into (10.15) we find that

l−1
ω X 2k 2 j −1
k x k+l − xk k ≤ h k x k − x k−1 k 2,
2 j=0
∞
ω X 2k 2 j −1
≤ h k x k − x k−1 k 2 . (10.16)
2 j=0

This shows that {x k } is a Cauchy sequence, since

∞ 2 j −1
X k 1 1
h2 ≤ k
<
j=0 1 − h2 1−h

and, by (10.14), k x k − x k−1 k → 0 as k → ∞.

Therefore, there exists a limit point x ∗ of {x k }. By (10.13) this limit point is contained in
Br (x 0 ). Moreover, taking the limit l → ∞ in (10.16) shows the validity of inequality (10.12).
Finally, by continuity of F and F 0,

−F 0 (x ∗ ) −1 F (x ∗ ) = lim −F 0 (x k ) −1 F (x k ) = lim x k+1 − x k = 0.

k→∞ k→∞

This yields F (x ∗ ) = 0.

The previous theorem and the Kantorovich theorem provide an existence result. Theorem 10.6.2
is affine invariant, but one has to assume invertibility of F 0 (x) on D. In the Kantorovich Theorem
10.6.1 the invertibility of F 0 (x k ) is one of the resuls. Note that Theorem 10.6.2 does not state the
uniqueness of the zero x ∗ . Like the Kantorovich theorem, the previous theorem establishes the
r–quadratic convergence of Newton’s method.

386 CHAPTER 10. NEWTON’S METHOD

10.7. Problems

Problem 10.1 Let f : R → R be a convex, twice continuously differentiable function. Suppose

that f has at least one root and let x 0 be a point with f 0 (x 0 ) , 0.

i. Show that f 0 (x) , 0 for all points x with f (x) > 0.

ii. Let {x k } be the iterates generated by Newton’s method. Show that f (x k ) ≥ 0 for all k ≥ 1.
(Hint: Use Theorem 4.4.3 i.)

iii. Use ii. to show that

x k ≥ x ∗ = max {x : f (x) = 0} for all k ≥ 1 if f 0 (x 0 ) > 0
and
x k ≤ x ∗ = min {x : f (x) = 0} for all k ≥ 1 if f 0 (x 0 ) < 0.

iv. x.

Problem 10.2 Consider the Newton–like iteration

x k+1 = x k − A−1
k F (x k ) (10.17)

where Ak ∈ Rn×n is an invertible matrix which approximates F 0 (x k ). Prove the following theorem.

Theorem 10.7.1 Let D ⊂ Rn be an open set and let F : D → Rn be differentiable on

D with F 0 ∈ LipL (D). Moreover, let x ∗ ∈ D be a root of F. If { Ak } is a sequence of
matrices with
k A−1
k k ≤ a

and
k ( Ak − F (x k ))k ≤ α k ≤ α < 1,
k A−1 0

then there exists an > 0 such that the generalized Newton method (10.17) with starting
point x 0 ∈ B (x ∗ ) generates iterates x k which converge to x ∗ ,

lim x k = x ∗,
k→∞

and which obey

La
k x k+1 − x ∗ k ≤ α k k x k − x ∗ k + k x k − x∗ k2
2
for all k.

10.7. PROBLEMS 387

x = D x x and instead of solving F (x) = 0,

Problem 10.3 Suppose we transform the variables x by D
x ) = 0, where
D(D
we solve the equivalent system F

x ) = DF F (D−1
D(D
F x Dx ) = DF F (x).

Here D x and DF are nonsingular matrices. Suppose that for a given starting vector x 0 sequence
{x k } of Newton iterates for the function F is well defined. Let {D x k } be the sequence of Newton
x 0 = D x x 0 . Show that for all k
iterates for the function F with starting value D
D

x k = Dx x k .
D

Problem 10.4 Let F : Rn → Rn be Lipschitz continuously differentiable and let x ∗ ∈ Rn be a point

such that F (x ∗ ) = 0 and F 0 (x ∗ ) is nonsingular. Assume that lim supk→∞ |1 − α k | < 1 and that the
sequence {x k } generated by
x k+1 = x k − α k F 0 (x k ) −1 F (x k )
converges to x ∗ . This question explore the influence of the step size α k on the local convergence.

i. Prove that the convergence rate is at least q-linear and derive the q-linear factor

k x k+1 − x ∗ k
lim sup .
k→∞ k x k − x∗ k

ii. Prove that the convergence rate is at least superlinear if and only if lim k→∞ α k = 1.

iii. Let α k , 1 for all k. Is it possible for {x k } to converge quadratically? Prove your assertion.

Problem 10.5 ([Kel95, pp. 87-90])

The Chandrasekhar-H equation
! −1
µH (ν)
Z 1
c
H (µ) − 1 − dν =0 µ ∈ [0, 1], (10.18)
2 0 µ+ν

is used to solve exit distributions problems in radiative transfer. The goal is to find a function H
such that (10.18) is satisfied.
To solve the problem numerically, we discretize integrals by the composite mid-point rule,
Z 1 N
1 X
g(ν)dν ≈ g(ν j ),
0 N j=1

388 CHAPTER 10. NEWTON’S METHOD

where ν j = ( j − 1/2))/N, j = 1, . . . N. With the integrals in (10.18) replaced the composite

mid-point rule, we require (10.18) to hold at µ j = ( j − 1/2))/N, j = 1, . . . N. This leads to the
discretization
−1
N
c X µi H j +
Hi − .1 −
* / = 0 i = 1, . . . N, (10.19)
2N j=1 µi + µ j
, -
of (10.18), where H j ≈ H (ν j ). The equations (10.19) form a system F (H) = 0 of nonlinear
equations in H j , j = 1, . . . N.

i. Solve this system numerically using Newton’s method with finite difference Jacobian approx-
imations.

– Approximate the j-th column of the Jacobian as follows:

f g

 F (H + t kH k2 e j ) − F (H) /(t kH k2 ) if H , 0,
F 0 (H) ≈ 

f g
 F (H + te j ) − F (H) /t
 if H = 0,

√
where t = eps and eps is the relative floating point accuracy in Matlab. (Type help
eps in Matlab.)
– Use the Matlab LU–decomposition to solve the Newton systems. (Type help lu in
Matlab.)
– Stop the iteration when

kF (H k )k∞ < 10−6 kF (H 0 )k∞ + 10−6,

where H k denotes the kth iterate, or if the number of iterations exceeds 20.
– Perform two runs, one with c = 0.9 and the other with 0.9999. In both runs N = 100
subintervals for the discretization of the integral and use the starting value H 0j = 1,
j = 1, . . . N.
– Output a table that shows the iteration number k, kF (H k )k∞ , kF (H k )k∞ /kF (H k−1 )k∞ ,
and kH k − H k−1 k∞ .
– Turn in program,source codes, tables generated by the program, and a plot of the
approximate solution H (µ) of (10.18).

ii. Experiment with

– different starting values ((10.18) and (10.19) have two different solutions, only one of
which is physically meaningful)
– different parameters t in the finite difference approximations

10.7. PROBLEMS 389

– different stopping criteria.

Carefully document your changes and the output of the test runs.
iii. Experiment with different parameters m in the Shamanskii method.

Problem 10.6 Let A ∈ Rn×n and assume there exist exist a diagonal matrix D ∈ Rn×n and a
nonsingular matrix V ∈ Rn×n such that A = V DV −1 .
The problem of finding an eigenvalue λ ∗ and a corresponding eigenvector v∗ with unit length
of A can be formulated as a root finding problem
F (v, λ) = 0, (10.20)
where
F : Rn × R → Rn × R
is given by
Av − λv
!
F (v, λ) = 1 T 1 .
2v v − 2
(i) Formulate Newton’s method for the computation of (v∗, λ ∗ ).
(ii) Show that the Jacobian F 0 (v∗, λ ∗ ) is nonsingular if λ ∗ is a simple eigenvalue.
(iii) Show that the Jacobian F 0 (·, ·) is Lipschitz continuous.
(iv) Prove the local q–quadratic convergence of Newton’s method for the solution of (10.20) under
the assumption that λ ∗ is a simple eigenvalue.
(v) Apply this method to compute an eigenvalue of
2 −1
.. −1 2 −1
*. +/
//
. . .
.. //
1 .
A= 2. . . . . . . // ∈ Rn×n, h = 1/(n + 1).
h .. //
.. /
. −1 2 −1 //
, −1 2 -
√
Use n = 100, and starting values v0 = (1, . . . , 1)T / n, λ 0 = 1.
Stop the iteration when kF (v, λ)k2 < 10−6 /n.
Output a table that shows the iteration number k, kF (vk , λ k )k2 , and kkF (vk , λ k )k2 k2 .
What is the eigenvalue λ k ? What is the eigenvector (plot using x = (h:h:1-h);
plot(x,v);)?
See Problem 1.2 for eigenvalues and eigenvectors of symmetric tridiagonal matrices.

390 CHAPTER 10. NEWTON’S METHOD

Problem 10.7 The implicit Euler method for solving large systems of ordinary differential equations
d
y(t) = G(y(t), t), t ∈ [0, T],
dt
partitions the time interval [0, T] into smaller time intervals 0 = t 0 < t 1 < . . . < t I = T and at each
time step t i+1 computes yi+1 ≈ y(t i+1 ) as the solution of
yi+1 − yi
= G(yi+1, t i+1 ), (10.21)
∆t
where yi ≈ y(t i ) is given from the previous time step and ∆t = t i+1 − t i > 0.
The equation (10.21) has to be solved for yi+1 ∈ Rn . This problem is concerned with Newton’s
method for the solution of this system. The indices i, i + 1 refer to the time steps in the implicit
Euler method; they do not have anything to do with Newton’s method. The Newton iterates are
k , k = 0, 1, . . . ,
yi+1

i. Rewrite (10.21) as system of nonlinear equations F (yi+1 ) = 0.

Compute the Jacobian of F (with respect to yi+1 ).

ii. Let G y (y, t) denote the Jacobian of G(y, t) with respect to y ∈ Rn and assume that
kG y (y, t)k < M for all y ∈ Rn and all t ∈ R.

– Use Banach’s lemma to show that the Jacobian F 0 (yi+1 ) of F is invertible if ∆t is

sufficiently small.
– Find an estimate for k(F 0 (yi+1 )) −1 k that only involves ∆t and M.

iii. Assume that kG y (y, t) − G y (z, t)k < L G k y − zk for all y, z ∈ Rn and all t ∈ R.
Show that the Jacobian F 0 of F satisfies kF 0 (y) − F 0 (z)k < Lk y − zk y, z ∈ Rn . What is L?

iv. Under the assumtions in ii. and iii. Newton’s method for the solution of F (yi+1 ) = 0 can be
k , k ∈ N, denote the Newton iterates,
shown to converge locally q–quadratic. Moreover, if yi+1
then one can show that
∗
k+1
k yi+1 − yi+1 k ≤ Lk(F 0 (yi+1
∗
)) −1 k k yi+1
k ∗
− yi+1 k2
k is sufficiently close to y ∗ .
provided yi+1 i+1
Use your results in ii. and iii. to show why Newton’s method will perform the better the
smaller ∆t.

v. In vi. we have assumed the existence of yi+1∗ such that F (y ∗ ) = 0. Use the estimates in ii.
i+1
and iii. to establish the existence of such a solution using the Kantorovich Theorem 10.6.1.
(Hint: The role of x 0 in the Kantorovich Theorem is played by yi .)

10.7. PROBLEMS 391

Problem 10.8 Prove the following theorem:

Theorem 10.7.2 Let D ⊂ Rn be open and let F : D → Rn be differentiable. Suppose

that there exist a root x ∗ ∈ D such that F 0 (x ∗ ) is invertible. Assume further that the
Lipschitz condition
F 0 (x ) −1 F 0 (y) − F 0 (x) ≤ ω k y − xk ∀x, y ∈ D (10.22)
∗ ∗

is satisfied. Let r = k x 0 − x ∗ k < 2/(3ω∗ ) and Br (x ∗ ) ⊂ D. The following results hold:

(i) The sequence {x k } of Newton iterates stays in Br and convergences to x ∗ .

(ii) The point x ∗ is the unique zero in Br (x ∗ ).

Hint: Use Banach’s lemma, Lemma 5.2.2, to show that F 0 (x) is invertible for all x ∈ Br (x ∗ ) and
use the arguments in the proof of Banach’s lemma to derive the bound kF 0 (x) −1 F 0 (x ∗ )k ≤ 3 for all
x ∈ Br (x ∗ ).

Problem 10.9 Let F be twice differentiable in the open set D ⊂ Rn and let F 0 (x) be invertible for
all x ∈ D.

• Show that G(x) = x − F 0 (x) −1 F (x) is differentiable at x ∗ and that G0 (x ∗ ) = 0.

• Use Theorem 10.5.2 to establish the local q–superlinear convergence of Newton’s method.

Problem 10.10 Prove the following result.

Theorem 10.7.3 Let D ⊂ R be open and let g : D → R be p times continuously

differentiable on D. If x ∗ ∈ D is a fixed point of g and if

g0 (x ∗ ) = . . . = g (p−1) (x ∗ ) = 0,

then there exists > 0 such that x ∗ is the only fixed point in B (x ∗ ) and the fixed point
iteration converges to x ∗ for any x 0 ∈ B (x ∗ ) with an q–order p and

|x k+1 − x ∗ | |g (p) (x ∗ )|
lim = .
k→∞ |x k − x ∗ | p p!

Problem 10.11 Let D ⊂ R be an open set and let f : D → R be (m + 1)-times differentiable on D

and let f (m+1) (x) be continuous at x ∗ . Suppose that x ∗ is a root with multiplicity m, i.e. that

f (x ∗ ) = f 0 (x ∗ ) = . . . = f (m−1) (x ∗ ) = 0, f (m) (x ∗ ) , 0.

392 CHAPTER 10. NEWTON’S METHOD

i. Show the existence of a continously differentiable function h : D → R with h(x ∗ ) =

f (m) (x ∗ )/m! , 0 such that
f (x) = (x − x ∗ ) m h(x). (10.23)

ii. Suppose that f 0 (x) , 0, for all x ∈ D \ {x ∗ }. Show that the Newton iteration

f (x k )
x k+1 = x k − .
f 0 (x k )

is locally q-linearly convergent with factor (m − 1)/m.

iii. Show that the iteration

f (x k )
x k+1 = x k − m . (10.24)
f 0 (x k )
is locally q-quadratic convergent.

iv. Construct an example which shows that the iteration (10.24) may not converge if the multi-
plicity m is overestimated.

Problem 10.12 Show that

f (x) = x sin2 (x).
has a root at x ∗ = 0 with multiplicity m = 3.
Write a program that applies the method (10.24) to this example with m = 1, 2, 3. In all cases
use the starting point x 0 = 0.5 and terminate the iterations if | f (x k )| < 10−8 .

Problem 10.13 Let A ∈ Rn×n be nonsingular and let k · k, ||| · ||| be a vector and a matrix norm
such that k Mvk ≤ |||M ||| kvk and |||M N ||| ≤ |||M ||| |||N ||| for all M, N ∈ Rn×n , v ∈ Rn .
Schulz’s method for computing the inverse of A generates a sequence of matrices {X k } via the
iteration
X k+1 = 2X k − X k AX k .

i. Define Rk = I − AX k and show that Rk+1 = R2k .

ii. Show that {X k } converges q-quadratically to the inverse of A for all X0 with |||I − AX0 ||| < 1.

iii. Show that {X k } converges converges q-quadratically for all X0 = α AT with α ∈ 0, 2/λ max ,
where λ max is the largest eigenvalue of AAT .

iv. Show that {X k } converges converges q-quadratically for X0 = AT /||| AT A|||.

10.7. PROBLEMS 393

Note: The assumption that A is invertible can be relaxed and converge of X k to the generalized
inverse can be shown. See [BI66].

Problem 10.14 Let A ∈ Rn×n be nonsingular and let G : Rn → Rn be Lipschitz continuous with
Lipschitz constant L > 0, i.e., kG(x) − G(y)k2 ≤ Lk x − yk2 for all x, y ∈
real n .
Let x ∗ be a solution of
Ax + G(x) = 0. (10.25)

a. Show that if Lk A−1 k2 < 1, then (10.25) has at most one solution.

b. Show that if Lk A−1 k2 < 1, the iteration

s k = −A−1 ( Ax k + G(x k )),

x k+1 = x k + s k

converges q-linearly to x ∗ with q-factor Lk A−1 k2 .

If Lk A−1 k2 ≥ 1, convergence of the iteration in Part b is no longer guaranteed. We employ a line

s k = −A−1 ( Ax k + G(x k )),

x k+1 = x k + α k s k

with step length α k > 0.

From now on we assume that A ∈ Rn×n is symmetric positive definite and that G : Rn → Rn is
continuously differentiable with symmetric positive semidefinite Jacobian G0 (x). We define
1
F (x) = Ax + G(x) and f (x) = F (x)T A−1 F (x).
2

c. Show that s k is a descent direction for f at x k .

d. We choose step lengths α k > 0 so that the sufficient decrease condition

f (x k + α k s k ) ≤ f (x k ) + c α k ∇ f (x k )T s k

is satisfied for c ∈ (0, 1) independent of k. Suppose that the step lengths α k are bounded
away from zero, i.e., that α k ≥ α > 0 for all k. Show that lim k→∞ Ax k + G(x k ) = 0.

394 CHAPTER 10. NEWTON’S METHOD

References
[BI66] A. Ben-Israel. A note on an iterative method for generalized inversion of ma-
trices. Math. Comp., 20:439–440, 1966. URL: https://doi.org/10.1090/
S0025-5718-66-99922-4, doi:10.1090/S0025-5718-66-99922-4.

[Boc88] H. G. Bock. Randwertprobleme zur Parameteridentifizierung in Systemen nichtlinearer

Differentialgleichungen. Preprint Nr. 442, Universität Heidelberg, Institut für Angewandte
Mathematik, SFB 123, D–6900 Heidelberg, Germany, 1988.

[Den71] J. E. Dennis, Jr. . Toward a unified convergence theory for Newton–like methods. In L. B.
Rall, editor, Nonlinear Functional and Applications, pages 425–472. Academic Press,
New-York, 1971.

[DES82] R. S. Dembo, S. C. Eisenstat, and T. Steihaug. Inexact Newton methods. SIAM J. Numer.
Anal., 19:400–408, 1982.

[Deu04] P. Deuflhard. Newton Methods for Nonlinear Problems. Affine Invariance and Adaptive
Algorithms, volume 35 of Springer Series in Computational Mathematics. Springer-
Verlag, Berlin, 2004.

[DH79] P. Deuflhard and G. Heindl. Affine invariant convergence theorems for Newton’s method
and extensions to related methods. SIAM J. Numer. Anal., 16:1–10, 1979.

[DS83] J. E. Dennis, Jr. and R. B. Schnabel. Numerical Methods for Nonlinear Equations and
Unconstrained Optimization. Prentice-Hall, Englewood Cliffs, N. J, 1983. Republished
as [DS96].

[DS96] J. E. Dennis, Jr. and R. B. Schnabel. Numerical Methods for Nonlinear Equations and
Unconstrained Optimization. SIAM, Philadelphia, 1996. URL: https://doi.org/10.
1137/1.9781611971200, doi:10.1137/1.9781611971200.

[KA64] L.V. Kantorovich and G.P. Akilov. Functional Analysis in Normed Spaces. Pergamon
Press, New York, 1964.

[Kel95] C. T. Kelley. Iterative methods for linear and nonlinear equations, volume 16 of Fron-
tiers in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM),

395
396 REFERENCES

Philadelphia, PA, 1995. URL: https://doi.org/10.1137/1.9781611970944, doi:

10.1137/1.9781611970944.

[OR70] J. M. Ortega and W. C. Rheinboldt. Iterative Solution of Nonlinear Equations in Several

Variables. Academic Press, New York, 1970.

[OR00] J. M. Ortega and W. C. Rheinboldt. Iterative solution of nonlinear equations in several

[Pol97] E. Polak. Optimization:Algorithms and Consistent Approximations. Applied Mathemat-

ical Sciences, Vol. 124. Springer Verlag, Berlin, Heidelberg, New-York, 1997.

Chapter
11
Broyden’s Method
11.1 Broyden’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
11.2 Implementation of Broyden’s Method . . . . . . . . . . . . . . . . . . . . . . . . 401
11.3 Limited Memory Broyden . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
11.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408

11.1. Broyden’s Method

One of the disadvantages of Newton’s method is the need to compute derivatives, or approximations
thereof in each iteration. In the secant method we overcame this disadvantage by replacing the
derivative f 0 (x k+1 ) by ( f (x k ) − f (x k+1 ))(x k − x k+1 ). If we set bk+1 = ( f (x k ) − f (x k+1 ))(x k − x k+1 )
then bk+1 ∈ R satisfies
bk+1 (x k+1 − x k ) = f (x k+1 ) − f (x k ),
which is called the secant equation. The next iterate is given by x k+2 = x k+1 + s k+1 , where s k+1 is
the solution of bk+1 s k+1 = − f (x k+1 ).
Now we consider the problem of finding a root of

F : Rn → Rn .

In the multidimensional case we can try to generalize this approach as follows: Given two iterates
x k , x k+1 ∈ Rn we try to find a nonsingular matrix Bk+1 ∈ Rn×n which satisfies the so-called secant
equation
Bk+1 (x k+1 − x k ) = F (x k+1 ) − F (x k ). (11.1)
Then we compute the new iterate as follows

Solve Bk+1 s k+1 = −F (x k+1 ),

x k+2 = x k+1 + s k+1 .

397
398 CHAPTER 11. BROYDEN’S METHOD

In the one-dimensional case bk+1 is uniquely determined from the secant equation. In the
multidimensional case this is not true. For example, if n = 2, x k+1 − x k = (1, 1)T , and
F (x k+1 ) − F (x k ) = (1, 2), then the matrices
! !
1 0 0 1
,
0 2 2 0

satisfy (11.1).
Therefore we chose Bk+1 ∈ Rn×n as the solution of 1

min kB − Bk k.
s.t. B(x k+1 − x k ) = F (x k+1 ) − F (x k )

Commonly, one sets

s k = x k+1 − x k , and y k = F (x k+1 ) − F (x k ).

With these definition the previous minimization problem becomes

min kB − Bk k. (11.2)
s.t. Bs k = y k

This can be interpreted as follows: Bk+1 should satisfy the secant equation (11.1) and Bk+1 should
be as close to the old matrix Bk as possible to preserve as much information contained in Bk as
possible.

Lemma 11.1.1 Let k · k, k| · k| be matrix norms satisfying

k ABk ≤ k Ak k|Bk|, A, B ∈ Rn×n

and
vvT
T = 1, v ∈ Rn, v , 0.
v v
If s k , 0 then a solution to (11.2) is given by

(y k − Bk s k )sTk
Bk+1 = Bk + . (11.3)
sTk s k

The matrix Bk+1 is the unique solution to (11.2) if k · k is the Frobenius norm.

1Recall that we always assume that the matrix norm is submultiplicative and that it is compatible with the vector
norm.

11.1. BROYDEN’S METHOD 399

Proof: If B is a matrix with Bs k = y k , then

(y − B s )sT
k k k k
kBk+1 − Bk k2 =
sTk s k
(B − B )s sT
k k k
= T

sk sk
s sT
k
≤ kB − Bk k T k = kB − Bk k.
s k s k
| {z }
=1

The uniqueness for the case that k · k is the Frobenius norm follows from the strict convexity of the
Frobenius norm and from the convexity of the set {B ∈ Rn×n | Bs k = y k }.
The pairs (k · k, k| · k|) = (k · k2, k · k2 ) and (k · k, k| · k|) = (k · k2, k · kF ) satisfy the properties
assumed in Lemma 11.1.1.
Lemma 11.1.2 Let Bk be invertible. The matrix Bk+1 defined in (11.3) is invertible if and only if
sTk Bk−1 y k , 0.
Proof: The result follows immediately from the Sherman–Morrison–Woodbury Lemma 9.2.4.

This yields the Broyden method.

Algorithm 11.1.3 (Broyden’s Method)

Input: Starting values x 0 ∈ Rn , B0 ∈ Rn×n .
Tolerance tol.

For k = 0, . . .
Solve Bk s k = −F (x k ) for s k .
Set x k+1 = x k + s k .
Evaluate F (x k+1 ) .
Check truncation criteria.
([F (x k+1 ) − F (x k )] − Bk s k )sTk
Set Bk+1 = Bk + .
sTk s k
End

Note that due to the definition of y k and s k we have that

(y k − Bk s k )sTk F (x k+1 )sTk
Bk+1 = Bk + = Bk + .
sTk s k sTk s k

400 CHAPTER 11. BROYDEN’S METHOD

To start Broyden’s method we need an initial guess x 0 for the root x ∗ and an initial matrix
B0 ∈ Rn×n . In practice one often chooses

B0 = F 0 (x 0 ), or B0 = γI,

where γ is a suitable scalar. Other choices, for example finite difference approximations to F 0 (x 0 )
or choices based on the specific structure of F 0 are also used.

Example 11.1.4 We consider

x 21 + x 22 − 2
!
F (x) = ,
e 1 −1 + x 32 − 2
x

cf. Example 10.1.1. Broyden’s method with starting values

!
1.5
x0 = , and B0 = F 0 (x 0 ),
2

and stopping criteria

kF (x k )k2 < 10−10
applied to F (x) = 0 yields the following table

k k x k k2 kF (x k )k2 ks k k2
0 0.250000E + 01 0.875017E + 01
1 0.166594E + 01 0.207320E + 01 0.880545E + 00
2 0.147651E + 01 0.873418E + 00 0.192204E + 00
3 0.141033E + 01 0.381251E + 00 0.132189E + 00
4 0.141763E + 01 0.158635E + 00 0.155521E + 00
5 0.142386E + 01 0.429850E − 01 0.962019E − 01
6 0.141585E + 01 0.468140E − 02 0.104304E − 01
7 0.141438E + 01 0.607409E − 03 0.258315E − 02
8 0.141421E + 01 0.405145E − 05 0.628819E − 03
9 0.141421E + 01 0.272411E − 07 0.180577E − 05
10 0.141421E + 01 0.118219E − 10 0.115425E − 07

The computed approximation for the root is given by (rounded to 6 digits)

!
1.0000
x 14 = .
1.0000

The convergence of Broyden’s method is characterized in the following theorem.

11.2. IMPLEMENTATION OF BROYDEN’S METHOD 401

Theorem 11.1.5 Let the assumptions of Theorem 10.2.3 hold. There exists an > 0 such that if

k x∗ − x0 k ≤

and
kF 0 (x ∗ ) − B0 k ≤
then Broyden’s method generates iterates x k which converge to x ∗ and the convergence rate is
q–superlinear, i.e. there exists a sequence ck with lim k→∞ ck = 0 such that

k x ∗ − x k+1 k ≤ ck k x ∗ − x k k

for all k.

A detailed convergence analysis of Broyden’s method can be found e.g. in [DS83] or in [Kel95].

Remark 11.1.6 Under suitable assumptions, the iterates x k in Broyden’s method converge towards
a zero x ∗ of F. The Broyden matrices Bk , however, generally do not converge to the Jacobian
F 0 (x ∗ ). See e.g. [DS83, Lemma 8.2.7].

11.2. Implementation of Broyden’s Method

In each iteration of Broyden’s method one has to solve a linear system Bk s k = −F (x k ). This can
be done by using the LU–decomposition, or the QR–decomposition. If the QR–decomposition
is used the particular structure of the matrix Bk+1 allows to use the QR–decomposition of the
previous matrix Bk to compute the QR–decomposition of Bk+1 in a more efficient way. Instead
of 32 n3 operations required to compute the QR–decomposition of Bk+1 the update of the QR–
decomposition of Bk+1 from the QR–decomposition of Bk costs only O(n2 ) operations. For details
see e.g. [DS83, Section 3.4]. Alternatively, one can also use limited memory formulations of
Broyden’s method. See e.g. [Kel95, Section 7.3] and the references cited therein.
If the inverse of Bk is known, then the inverse of Bk+1 can be efficiently computed using
the Sherman–Morrison–Woodbury formula, shown in Lemma 9.2.4. Applying the Sherman–
Morrison–Woodbury formula to the Broyden update yields
−1
(y k − Bk s k )sTk
−1
Bk+1 = * Bk + +
, sTk s k -
1
= Bk−1 − Bk−1 (y k − Bk s k )sT Bk−1 .
sTk s k + sTk Bk−1 (y k − Bk s k )

402 CHAPTER 11. BROYDEN’S METHOD

Scaling of Variables
Suppose we transform the variables x and F (x) by

x = D x x,
D F x ) = DF F (D−1
D(D x Dx ),

where DF, D x are nonsingular matrices. The new Jacobian F D0 (D

x ) is related to the old Jacobian by
0
D (D
F x ) = DF F (x)D x .
0 −1

Now, we perform the Broyden update for the scaled problem and then transform it back to the
original formulation. This process is equivalent to using the update

(y k − Bk s k )(DTx D x s k )T
Bk+1 = Bk +
sTk DTx D x s k

D0 = DF B0 D−1
in the original formulation, provided the starting matrix is B x . See Problem 11.1.
This leads to the following lemma.

Lemma 11.2.1 Suppose that k iterations of Broyden’s method are applied to the problem F (x) = 0
with starting vector x 0 and initial matrix B0 and that these iterations generate the Broyden matrices
B0, . . . , Bk . Suppose further that k iterations of Broyden’s method are applied to the problem
D(x) = A−1 F (x) = 0 with starting vector x 0 and initial matrix B
F D0 = A−1 B0 and that these
iterations generate the Broyden matrices B D0, . . . , B
Dk . Then

Dk = A−1 Bi,
B i = 0, . . . , k.

This lemma is important, because it says that we can choose the initial Broyden matrix to be
the identity matrix if we scale the function F by A−1 = B0−1 .

11.3. Limited Memory Broyden

If the inverse of Bk is known, then the inverse of Bk+1 can be efficiently computed using the
Sherman–Morrison–Woodbury formula. We apply the Sherman–Morrison–Woodbury Lemma
9.2.4 to the Broyden update

F (x k+1 )sTk
Bk+1 = Bk + = Bk + u k vTk ,
sTk s k

where
u k = F (x k+1 )/ks k k2, vk = s k /ks k k2 .
If we set
w k = Bk−1u k /(1 + vTk Bk−1u k ),

11.3. LIMITED MEMORY BROYDEN 403

then
−1
Bk+1 = (I − w k vTk )Bk−1
= (I − w k vTk )(I − w k−1 vTk−1 )Bk−1
−1

..
.
k
Y
= (I − wi viT ) B0−1 .
i=0

If we assume that B0 = I, then

k
Y
−1
Bk+1 = (I − wi viT ). (11.4)
i=0
Recall that Lemma 11.2.1 always allows us to choose B0 = I if we scale the function F appropriately.
Formula (11.4) shows that Bk+1 −1 is determined by the 2k + 2 vectors w , . . . , w , v , . . . , v . Instead
0 k 0 k
of storing Bk+1−1 as an N × N matrix we can store the 2k + 2 vectors w , . . . , w , v , . . . , v and
0 k 0 k
compute Bk+1−1 from (11.4). In the following we will show that B −1 can be represented by the k + 2
k+1
vectors s0, . . . , s k+1 , which reduced the storage requirement by approximately one half.
With (11.4),
k−1
Y
Bk u k =
−1
(I − wi viT )F (x k+1 ) /ks k k2 . (11.5)
|i=0 {z }
≡z
With z ≡ ks k k2 Bk−1u k the vector w k can be written as
1
wk = z.
ks k k2 + vTk z
Furthermore, if we set
1
β= .
ks k k2 + vTk z
then
w k = β z.
The next step in Broyden’s method is given by
s k+1 = −Bk+1
−1
F (x k+1 )
k−1
Y
= −(I − w k vTk ) (I − wi viT )F (x k+1 ).
i=0

With the quantities introduced above we can write

s k+1 = −(I − w k vTk )z = −(z − w k vTk z).

404 CHAPTER 11. BROYDEN’S METHOD

Since vTk z = β −1 − ks k k2 and w k = β z we ave

s k+1 = −(z − w k vTk z)

= −(1 − β( β −1 − ks k k2 )z
= − βks k k2 z
= −ks k k2 w k .
Hence, for k ≥ 0,
w k = −s k+1 /ks k k2 . (11.6)
Thus, we find that
k
s sT
* I + i+1 i + .
Y
−1
Bk+1 = (11.7)
i=0 , ksi k22 -

k+1 via s k+1 =

−1 also defines s
Note that the formula (11.7) depends on s k+1 , but that Bk+1
−1
−Bk+1 F (x k+1 ). For the computation of s k+1 we use

s k+1 = −Bk+1
−1
F (x k+1 )
k−1
s k+1 sTk Y si+1 sTi
= − I+
* + * I+ + F (x k+1 ).
, ks k 2
k 2 - i=0 , ks k 2
i 2 -
| {z }
= Bk−1
Solving the previous equation for s k+1 yields
Bk−1 F (x k+1 )
s k+1 = − . (11.8)
1 + sTk Bk−1 F (x k+1 )/ks k k22
Note that by the Sherman–Morrison–Woodbury Lemma 9.2.4,
1 + sTk Bk−1 F (x k+1 )/ksi k22 = 1 + vTk Bk−1u k , 0
if and only if Bk+1 is singular.
To implement Broyden’s method, we need to store the vectors s0, . . . , s k to compute Bk−1 F (x k ).
If storage is limited and we can only store L such vectors, then we have two choices. Either we can
restart the Broyden algorithm after iteration L, or we can replace the oldest s k−L by s k . Thus, in
the second case we use the approximation
k−1
Y si+1 sTi
Bk−1 ≈ * I+ +.
i=k−L+1 ,
ks k 2
i 2 -

This is known as the limited memory Broyden method.

11.3. LIMITED MEMORY BROYDEN 405

Algorithm 11.3.1 (Limited Memory Broyden’s Method)

Input: Starting value x 0 ∈ Rn .

For k = 0, . . .
Evaluate F (x k ) .
Check truncation criteria.
Solve Bk s k = −F (x k ):
If k = 0, then s k = −F (x k ).
If k > 0, then compute
si+1 sTi
!
z= = I+
−1 F (x ) Q k−2
Bk−1 k i=k−L F (x k )
ksi k22
and set s k = −z/(1 + sTk−1 z/ks k−1 k22 ).
Set x k+1 = x k + s k .
End

In the derivation of Algorithm 11.3.1, we have used that the Broyden update
(F (x k+1 ) − F (x k )Bk s k )sTk
Bk+1 = Bk +
sTk s k
for s k given by Bk s k = −F (x k ) is equal to
F (x k+1 )sTk
Bk+1 = Bk + .
sTk s k
In some globalizations of Broyden’s method, however, we compute the new iterate as x k+1 =
x k + t k s k with t k ∈ (0, 1] and use the Broyden update
(F (x k+1 ) − F (x k )) − Bk (x k+1 − x k ) (x k+1 − x k )T

Bk+1 = Bk +
k x k+1 − x k k22
F (x k+1 ) − (1 − t k )F (x k ) sTk

= Bk + . (11.9)
t k ks k k22
In this case, (11.7) does not apply. However, we are still able to reproduce Bk+1 −1 from the vectors

s0, . . . , s k+1 .
We apply the Sherman–Morrison–Woodbury formula, Lemma 9.2.4, to the Broyden update
(11.9)
F (x k+1 ) − (1 − t k )F (x k ) sTk

Bk+1 = Bk + = Bk−1 + u k vTk ,
t k ks k k2
2

406 CHAPTER 11. BROYDEN’S METHOD

where
u k = (F (x k+1 ) − (1 − t k )F (x k )) / (t k ks k k2 ) , vk = s k /ks k k2 .
If we set
w k = Bk−1u k /(1 + vTk Bk−1u k ),
and assume that B0 = I, then
k
Y
−1
Bk+1 = (I − wi viT ).
i=0
Similar to the previous calculations we find that
k−1
Y
Bk−1u k = (I − wi viT ) (F (x k+1 ) − (1 − t k )F (x k )) / (t k ks k k2 )
i=0
Yk−1 k−1
Y
= (I − wi viT )F (x k+1 ) −(1 − tk ) (I − wi viT )F (x k ) / (t k ks k k2 )
|i=0 |i=0
=z
{z } {z }
= Bk F (x k ) = −s k
−1

= (z + (1 − t k )s k )/(t k ks k k2 ).

and
1 1
wk = (z + (1 − t k )s k ) = (z + (1 − t k )s k ).
t k ks k k2 + vTk z + (1 − t k )vk s k
T ks k k2 + vTk z

As before, we set
1
β= .
ks k k2 + vTk z
The next step in Broyden’s method is given by

s k+1 = −Bk+1
−1
F (x k+1 )
k−1
Y
= −(I − w k vk )
T
(I − wi viT )F (x k+1 )
i=0
= −(I − w k vk )z =
T
−(z − w k vTk z)
= −z + β(z + (1 − t k )s k )( β −1 − ks k k2 )
= (1 − t k )s k − ks k k2 w k .

Hence, for k ≥ 0,
1
wk = − (s k+1 − (1 − t k )s k ) .
ks k k2

11.3. LIMITED MEMORY BROYDEN 407

and
k
Y (si+1 − (1 − t i )si )sTi
−1
Bk+1 = *I + +. (11.10)
i=0 , ksi k22 -
As in the case t k = 1, (11.10) depends on s k+1 , but Bk+1
−1 also defines s
k+1 . Thus we have to solve

s k+1 = −Bk+1
−1
F (x k+1 )
(s k+1 − (1 − t k )s k )sTk
= − *I + + B −1 F (x k+1 )
k
, ks k k22 -
sTk Bk−1 F (x k+1 ) sTk Bk−1 F (x k+1 )
= −Bk F (x k+1 ) −
−1
s k+1 + (1 − t k ) sk .
ks k k22 ks k k22

for s k+1 to obtain

ks k k22 (1 − t k )sTk Bk−1 F (x k+1 )
s k+1 = − * B−1 F (x k+1 ) − sk + . (11.11)
ks k k2 + s k Bk F (x k+1 ) , k
2 T −1 ks k k22 -
By the Sherman–Morrison–Woodbury formula, Lemma 9.2.4.

Algorithm 11.3.2 (Limited Memory Broyden’s Method)

Input: Starting value x 0 ∈ Rn .

For k = 0, . . .
Evaluate F (x k ) .
Check truncation criteria.
Solve Bk s k = −F (x k ):
If k = 0, then s k = −F (x k ).
If k > 0, then compute
(si+1 − (1 − t i )si )sTi
!
z= = i=k−L I +
−1 F (x ) Q k−2
Bk−1 k F (x k )
ksi k22
(1 − t k−1 )sTk−1 z
!
ks k−1 k22
and set s k = − z− s k−1 .
ks k−1 k22 + sTk−1 z ks k−1 k22
Compute t k ∈ (0, 1].
Set x k+1 = x k + t k s k .
End

408 CHAPTER 11. BROYDEN’S METHOD

11.4. Problems

Problem 11.1 ([DS83, Problem 8.5.12]) Suppose we transform the variables x and F (x) by

x = D x x,
D D(x) = DF F (x),
F

where DF, D x be nonsingular matrices, perform Broyden’s method in the new variable and function
space, and then transform back to the original variables. Show that this process is equivalent to
using the update
(y k − Bk s k )(DTx D x s k )T
Bk+1 = Bk +
sTk DTx D x s k
D0 = DF B0 D−1
provided the starting matrix is B x . Notice, that the update is independent of the scaling
0
D (D
DF . Notice also that the new Jacobian F x ) is related to the old Jacobian by F x ) = DF F 0 (x)D −1
D0 (D x .
See also Problem 10.3.

Problem 11.2 ([DS83, Problem 8.5.10]) Let F : Rn → Rn be of the form

A1 x + b1
!
F (x) = ,
F2 (x)

where A1 ∈ Rm×n , m < n. Suppose that Broyden’s method is used to solve F (x) = 0, generating a
sequence of Broyden matrices B0, B1, . . .. Let Bk be partitioned into
!
Bk1
Bk = , Bk1 ∈ Rm×n, Bk2 ∈ R(n−m)×n .
Bk2

Show that if B0 is calculated analytically (i.e. B0 = F 0 (x 0 )) or by finite differences, then Bk1 = A1

for all k ≥ 0, and
(Bk2 − B12 ) AT1 = 0 for all k ≥ 1. (∗)
What does (*) imply about the convergence of the sequence {Bk2 } to the Jacobian F20 (x ∗ )?

References
[DS83] J. E. Dennis, Jr. and R. B. Schnabel. Numerical Methods for Nonlinear Equations and
Unconstrained Optimization. Prentice-Hall, Englewood Cliffs, N. J, 1983. Republished
as [DS96].

[DS96] J. E. Dennis, Jr. and R. B. Schnabel. Numerical Methods for Nonlinear Equations and
Unconstrained Optimization. SIAM, Philadelphia, 1996. URL: https://doi.org/10.
1137/1.9781611971200, doi:10.1137/1.9781611971200.

409
410 REFERENCES

[Ama90] H. Amann. Ordinary Differential Equations. An Introduction to Nonlinear Analysis.

De Gruyter Studies in Mathematics, Vol. 13. de Gruyter, Berlin, New York, 1990.

[Axe94] O. Axelsson. Iterative Solution Methods. Cambridge University Press, Cambridge,

London, New York, 1994. URL: https://doi.org/10.1017/CBO9780511624100.

[BB88] J. Barzilai and J. M. Borwein. Two-point step size gradient methods. IMA J. Numer.
Anal., 8(1):141–148, 1988. URL: http://dx.doi.org/10.1093/imanum/8.1.
141, doi:10.1093/imanum/8.1.141.

[BC99] F. Bouttier and P. Courtier. Data assimilation concepts and methods. Technical report,
European Centre for Medium-Range Weather Forecasts (ECMWF), 1999. https:
//www.ecmwf.int/en/learning/education-material/lecture-notes
(accessed Nov. 23, 2017).

[BCP95] K. E. Brenan, S. L. Campbell, and L. R. Petzold. The Numerical Solution of Initial

Value Problems in Differential-Algebraic Equations. Classics in Applied Mathematics,
Vol. 14. SIAM, Philadelphia, 1995.

[Ben02] M. Benzi. Preconditioning techniques for large linear systems: a survey. J. Comput.
Phys., 182(2):418–477, 2002. URL: http://dx.doi.org/10.1006/jcph.2002.
7176, doi:10.1006/jcph.2002.7176.

[Ber95] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, Massachusetts,

1995.

411
412 REFERENCES

[BI66] A. Ben-Israel. A note on an iterative method for generalized inversion of ma-

trices. Math. Comp., 20:439–440, 1966. URL: https://doi.org/10.1090/
S0025-5718-66-99922-4, doi:10.1090/S0025-5718-66-99922-4.

[BIG74] A. Ben-Israel and T. N. E. Greville. Generalized Inverses: Theory and Applications.

John Wiley & Sons, New-York, Chicester, Brisbane, Toronto, 1974.

[Bjö96] Å. Björck. Numerical Methods for Least Squares Problems. SIAM, Philadelphia, 1996.

[Boc88] H. G. Bock. Randwertprobleme zur Parameteridentifizierung in Systemen nichtlin-

earer Differentialgleichungen. Preprint Nr. 442, Universität Heidelberg, Institut für
Angewandte Mathematik, SFB 123, D–6900 Heidelberg, Germany, 1988.

[Bri15] D. Britz. Implementing a neural network from scratch in python

– an introduction, 2015. http://www.wildml.com/2015/09/
implementing-a-neural-network-from-scratch (accessed April 10, 2017).

[BT89] D. P. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation: Numerical

Methods. Prentice Hall, Englewood Cliffs. N.J., 1989.

[Bur40] J. M. Burgers. Application of a model system to illustrate some points of the statistical
theory of free turbulence. Nederl. Akad. Wetensch., Proc., 43:2–12, 1940.

[Bur48] J. M. Burgers. A mathematical model illustrating the theory of turbulence. In Richard

von Mises and Theodore von Kármán, editors, Advances in Applied Mechanics, pages
171–199. Academic Press Inc., New York, N. Y., 1948.

[BvH04] R. A. Bartlett, B. G. van Bloemen Waanders, and M. A. Heroux. Vector reduc-

tion/transformation operators. ACM Trans. Math. Software, 30(1):62–85, 2004.

[BW88] D. M. Bates and D. G. Watts. Nonlinear Regression Analysis and its Applications.
John Wiley and Sons, Inc., Somerset, New Jersey, 1988.

[CGT00] A. R. Conn, N. I. M. Gould, and Ph. L. Toint. Trust–Region Methods. SIAM,

Philadelphia, 2000.

[Che05] K. Chen. Matrix Preconditioning Techniques and Applications, volume 19 of

Cambridge Monographs on Applied and Computational Mathematics. Cam-
bridge University Press, Cambridge, 2005. URL: http://dx.doi.org/10.1017/
CBO9780511543258, doi:10.1017/CBO9780511543258.

[CM79] S. L. Campbell and C. D. Meyer. Generalized Inverses of Linear Transformations.

Pitman, London, San Francisco, Melbourne, 1979.

REFERENCES 413

[Cra55] E. J. Craig. The n-step iteration procedures. J. of Mathematics and Physics, 34:64–73,
1955.
[Den71] J. E. Dennis, Jr. . Toward a unified convergence theory for Newton–like methods. In
L. B. Rall, editor, Nonlinear Functional and Applications, pages 425–472. Academic
Press, New-York, 1971.
[DES82] R. S. Dembo, S. C. Eisenstat, and T. Steihaug. Inexact Newton methods. SIAM J.
Numer. Anal., 19:400–408, 1982.
[Deu04] P. Deuflhard. Newton Methods for Nonlinear Problems. Affine Invariance and Adaptive
Algorithms, volume 35 of Springer Series in Computational Mathematics. Springer-
Verlag, Berlin, 2004.
[DGE81] J. E. Dennis, D. M. Gay, and R. E.Welsch. Algorithm 573 nl2sol – an adaptive
nonlinear least-squares algorithm. TOMS, 7:369–383, 1981. Fortran code available
from http://www.netlib.org/toms/573.
[DGW81] J. E. Dennis, D. M. Gay, and R. E. Welsch. An adaptive nonlinear least-squares
algorithm. TOMS, 7:348–368, 1981.
[DH79] P. Deuflhard and G. Heindl. Affine invariant convergence theorems for Newton’s
method and extensions to related methods. SIAM J. Numer. Anal., 16:1–10, 1979.
[DH95] P. Deuflhard and A. Hohmann. Numerical Analysis. A First Course in Scientific
Computation. Walter De Gruyter, Berlin, New York, 1995.
[DL02] Y.-H. Dai and L.-Z. Liao. R-linear convergence of the Barzilai and Borwein gradient
method. IMA J. Numer. Anal., 22(1):1–10, 2002. URL: http://dx.doi.org/10.
1093/imanum/22.1.1, doi:10.1093/imanum/22.1.1.
[DM77] J. E. Dennis, Jr. and J. J. Moré. Quasi–Newton methods, motivation and theory. SIAM
Review, 19:46–89, 1977.
[Dre90] S. E. Dreyfus. Artificial neural networks, back propagation, and the Kelley-Bryson
gradient procedure. J. Guidance Control Dynam., 13(5):926–928, 1990. URL: http:
//dx.doi.org/10.2514/3.25422, doi:10.2514/3.25422.
[DS83] J. E. Dennis, Jr. and R. B. Schnabel. Numerical Methods for Nonlinear Equations and
Unconstrained Optimization. Prentice-Hall, Englewood Cliffs, N. J, 1983. Republished
as [DS96].
[DS96] J. E. Dennis, Jr. and R. B. Schnabel. Numerical Methods for Nonlinear Equations and
Unconstrained Optimization. SIAM, Philadelphia, 1996. URL: https://doi.org/
10.1137/1.9781611971200, doi:10.1137/1.9781611971200.

414 REFERENCES

[EES83] S. C. Eisenstat, H. C. Elman, and M. H. Schultz. Variational iterative methods for

nonsymmetric systems of linear equations. SIAM J. Numer. Anal., 20:345–357, 1983.

[Eis81] S. C. Eisenstat. Efficient implementation of a class of preconditioned conjugate gradient

methods. SIAM J. Scientific and Statistical Computing, 2:1–4, 1981.

[Emb03] M. Embree. The tortoise and the hare restart GMRES. SIAM Rev., 45(2):259–266
(electronic), 2003.

[Esp81] J. H. Espenson. Chemical Kinetics and Reaction Mechanisms. Mc Graw Hill, New
York, 1981.

[ESW05] H. C. Elman, D. J. Silvester, and A. J. Wathen. Finite Elements and Fast Iterative
Solvers with Applications in Incompressible Fluid Dynamics. Oxford University Press,
Oxford, 2005.

[FA12] J. A. Fike and J. J. Alonso. Automatic differentiation through the use of hyper-dual num-
bers for second derivatives. In S. Forth, P. Hovland, E. Phipps, J. Utke, and A. Walther,
editors, Recent advances in algorithmic differentiation, volume 87 of Lect. Notes Com-
put. Sci. Eng., pages 163–173. Springer, Heidelberg, 2012. URL: https://doi.org/
10.1007/978-3-642-30023-3_15, doi:10.1007/978-3-642-30023-3_15.

[Gay83] D. M. Gay. Remark on “algorithm 573: NL2SOL—an adaptive nonlinear least-squares

algorithm”. ACM Trans. Math. Softw., 9(1):139, 1983. doi:http://doi.acm.org/
10.1145/356022.356031.

[GBC16] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
http://www.deeplearningbook.org (Accessed April 10, 2017).

[GL83] G. H. Golub and C. F. Van Loan. Matrix Computations, volume 3 of Johns Hopkins
Series in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD,
1983.

REFERENCES 415

[GL89] G. H. Golub and C. F. Van Loan. Matrix Computations, volume 3 of Johns Hopkins
Series in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD,
second edition, 1989.

[GL96] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins Studies in the
Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD, third edition,
1996.

[GO89] G. H. Golub and D. P. O’Leary. Some history of the conjugate gradient and Lanczos
algorithms: 1948–1976. SIAM Rev., 31(1):50–102, 1989.

[Gre97] A. Greenbaum. Iterative Methods for Solving of Linear Systems. SIAM, Philadelphia,
1997.

[Gri03] A. Griewank. A mathematical view of automatic differentiation. In A. Iserles, editor,

Acta Numerica 2003, pages 321–398. Cambridge University Press, Cambridge, Lon-
don, New York, 2003. URL: https://doi.org/10.1017/S0962492902000132,
doi:10.1017/S0962492902000132.

[Gro77] C. W. Groetsch. Generalized Inverses of Linear Operators. Marcel Dekker, Inc., New
York, Basel, 1977.

[Gun03] M. D. Gunzburger. Perspectives in Flow Control and Optimization. SIAM, Philadel-

phia, 2003. URL: https://doi.org/10.1137/1.9780898718720, doi:10.1137/
1.9780898718720.

[GW08] A. Griewank and A. Walther. Evaluating Derivatives. Principles and Techniques of

Algorithmic Differentiation. Society for Industrial and Applied Mathematics (SIAM),
Philadelphia, PA, second edition, 2008. URL: https://doi.org/10.1137/1.
9780898717761, doi:10.1137/1.9780898717761.

[Hac92] W. Hackbusch. Elliptic Differential Equations: Theory and Numerical Treatment.

Springer–Verlag, Berlin, 1992.

[Hac94] W. Hackbusch. Iterative Solution of Large Sparse Systems of Equations. Springer–

Verlag, Berlin, 1994.

[Heb73] M. D. Hebden. An algorithm for minimization using exact second order derivatives.
Technical Report T.P. 515, Atomic Energy Research Establishment, Harwell, England,
1973.

[Hei93] M. Heinkenschloss. Mesh independence for nonlinear least squares problems with
norm constraints. SIAM J. Optimization, 3:81–117, 1993. URL: http://dx.doi.
org/10.1137/0803005, doi:10.1137/0803005.

416 REFERENCES

[Hes75] M.R. Hestenes. Optimization Theory. Wiley-Interscience, New York, 1975.

[HJ85] R. A. Horn and C. A. Johnson. Matrix Analysis. Cambridge University Press, Cam-
bridge, London, New York, 1985.

[HNW93] E. Hairer, S. O. Nørsett, and G. Wanner. Solving Ordinary Differential Equations I.

Nonstiff Problems. Springer Series in Computational Mathematics, Vol. 8. Springer
Verlag, Berlin, Heidelberg, New York, second edition, 1993.

[HS52] M.R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems.
J. of Research National Bureau of Standards, 49:409–436, 1952.

[HV99] M. Heinkenschloss and L. N. Vicente. An interface between optimization and appli-

cation for the numerical solution of optimal control problems. ACM Transactions on
Mathematical Software, 25:157–190, 1999. URL: http://dx.doi.org/10.1145/
317275.317278, doi:10.1145/317275.317278.

[Ise96] A Iserles. A First Course in the Numerical Analysis of Differential Equations. Cam-
bridge University Press, Cambridge, London, New York, 1996.

[JS04] F. Jarre and J. Stoer. Optimierung. Springer Verlag, Berlin, Heidelberg, New-York,
2004.

[KA64] L.V. Kantorovich and G.P. Akilov. Functional Analysis in Normed Spaces. Pergamon
Press, New York, 1964.

[Kac37] S. Kaczmarz. Angenäherte Auflösung von Systemen linearer Gleichungen. Bulletin

International de l’Academie Polonaise des Sciences et des Lettres. Classe des Sciences
Mathematiques et Naturelles. Serie A, Sciences Mathematiques, 35:355–357, 1937.

[Kac93] S. Kaczmarz. Approximate solution of systems of linear equations. Internat. J. Control,

57(6):1269–1271, 1993. Translated from the German. URL: http://dx.doi.org/
10.1080/00207179308934446, doi:10.1080/00207179308934446.

REFERENCES 417

[KHRv14] D. P. Kouri, M. Heinkenschloss, D. Ridzal, and B. G. van Bloemen Waan-

ders. Inexact objective function evaluations in a trust-region algorithm for PDE-
constrained optimization under uncertainty. SIAM Journal on Scientific Comput-
ing, 36(6):A3011–A3029, 2014. URL: http://dx.doi.org/10.1137/140955665,
doi:10.1137/140955665.
[KMN88] D. Kahaner, C.B. Moler, and S. Nash. Numerical Methods and Software. Prentice Hall,
Englewood Cliffs, NJ, 1988.
[Koh98] L. Kohaupt. Basis of eigenvectors and principal vectors associated with Gauss-Seidel
matrix of A = tri diag[−1 2 − 1]. SIAM Rev., 40(4):959–964 (electronic), 1998.
[KV99] K. Kunisch and S. Volkwein. Control of Burger’s equation by a reduced order ap-
proach using proper orthogonal decomposition. Journal of Optimization Theory and
Applications, 102:345–371, 1999.
[Lev44] K. Levenberg. A method for the solution of certain nonlinear problems in least squares.
Quarterly Applied Mathematics, 2:164–168, 1944.
[LKM10] W. Lahoz, B. Khattatov, and R. Menard, editors. Data Assimilation: Making Sense of
Observations. Springer, Berlin, Heidelberg, 2010. URL: http://dx.doi.org/10.
1007/978-3-540-74703-1, doi:10.1007/978-3-540-74703-1.
[LMT97] H. V. Ly, K. D. Mease, and E. S. Titi. Distributed and boundary control of the viscous
Burgers’ equation. Numer. Funct. Anal. Optim., 18(1-2):143–188, 1997.
[LSZ15] K. J. H. Law, A. M. Stuart, and K. C. Zygalakis. Data Assimilation. A Math-
ematical Introduction. Texts in Applied Mathematics. Vol. 62. Springer, New
York, 2015. URL: http://dx.doi.org/10.1007/978-3-319-20325-6, doi:
10.1007/978-3-319-20325-6.
[Mar63] D. W. Marquardt. An algorithm for least squares estimation of nonlinear parameters.
SIAM Journal on Applied Mathematics, 11:431–441, 1963.
[Mar87] J. M. Martinez. An algorithm for solving sparse nonlinear least squares problems.
Computing, 39:307–325, 1987. URL: https://doi.org/10.1007/BF02239974,
doi:10.1007/BF02239974.
[Mey00] C. D. Meyer. Matrix Analysis and Applied Linear Algebra. Society for Industrial and
Applied Mathematics (SIAM), Philadelphia, PA, 2000. URL: https://doi.org/
10.1137/1.9780898719512, doi:10.1137/1.9780898719512.
[Mor78] J. J. Moré. The Levenberg–Marquardt algorithm: Implementation and theory. In G. A.
Watson, editor, Numerical Analysis, Proceedings, Biennial Conference, Dundee 1977,
pages 105–116, Berlin, Heidelberg, New-York, 1978. Springer Verlag.

418 REFERENCES

[MP96] T. Maly and L. R. Petzold. Numerical methods and software for sensitivity analysis of
differential-algebraic systems. Applied Numerical Mathematics, 20:57–79, 1996.

[MS79] H. Matthies and G. Strang. The solution of nonlinear finite element equations. Internat.
J. Numer. Methods Engrg., 14:1613–1626, 1979.

[MS83] J. J. Moré and D. C. Sorensen. Computing a trust region step. SIAM J. Sci. Statist.
Comput., 4(3):553–572, 1983.

[MSA03] J. R. R. A. Martins, P. Sturdza, and J. J. Alonso. The complex-step deriva-

tive approximation. ACM Trans. Math. Software, 29(3):245–262, 2003. URL:
https://doi.org/10.1145/838250.838251, doi:10.1145/838250.838251.

[MT94] J. J. Moré and D. J. Thuente. Line search algorithms with guaranteed sufficient decrease.
ACM Transactions on Mathematical Software, 20(3):286–307, 1994.

[Nas76] M. Z. Nashed. Generalized Inverses and Applications. Academic Press, Boston, San
Diego, New York, London„ 1976.

[Nat01] F. Natterer. The Mathematics of Computerized Tomography, volume 32 of Classics

in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM),
Philadelphia, PA, 2001. Reprint of the 1986 original.

[Nie17] M. Nielsen. Neural networks and deep learning, 2017. http://

neuralnetworksanddeeplearning.com (accessed June 12, 2018).

[Noc80] J. Nocedal. Updating quasi-Newton matrices with limited storage. Math. Comp.,
35(151):773–782, 1980.

[NW01] F. Natterer and F. Wübbeling. Mathematical Methods in Image Reconstruction. SIAM

Monographs on Mathematical Modeling and Computation. Society for Industrial and
Applied Mathematics (SIAM), Philadelphia, PA, 2001.

[NW06] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Verlag, Berlin,

Heidelberg, New York, second edition, 2006. URL: https://doi.org/10.1007/
978-0-387-40065-5, doi:10.1007/978-0-387-40065-5.

[O’L01] D. P. O’Leary. Commentary on methods of conjugate gradients for solving linear

systems by Magnus R. Hestenes and Eduard Stiefel. In D. R. Lide, editor, A Cen-
tury of Excellence in Measurements, Standards, and Technology - A Chronicle of
Selected NBS/NIST Publications 1901-2000, pages 81–85. Natl. Inst. Stand. Technol.
Special Publication 958, U. S. Government Printing Office, Washington, D. C, 2001.
Electronically available at http://nvlpubs.nist.gov/nistpubs/sp958-lide/
cntsp958.htm (accessed February 6, 2012).

REFERENCES 419

[OR70] J. M. Ortega and W. C. Rheinboldt. Iterative Solution of Nonlinear Equations in Several

Variables. Academic Press, New York, 1970.

[OR00] J. M. Ortega and W. C. Rheinboldt. Iterative solution of nonlinear equations in several

[OT14] M. A. Olshanskii and E. E. Tyrtyshnikov. Iterative Methods for Linear Systems: Theory
and Applications. SIAM, Philadelphia, 2014.

[Pea90] D. W. Peaceman. A personal retorspection of reservoir simulation. In S. G. Nash,

editor, History of Scientific Computing, pages 106 – 128. ACM Press, New York, 1990.
available at http://history.siam.org/peaceman.htm.

[Pol64] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. Z.

VyCisl. Math. i Mat. Fiz., 4:1–17, 1964.

[Pol97] E. Polak. Optimization:Algorithms and Consistent Approximations. Applied Mathe-

matical Sciences, Vol. 124. Springer Verlag, Berlin, Heidelberg, New-York, 1997.

[Pot89] F. A. Potra. On Q-order and R-order of convergence. J. Optim. Theory Appl., 63(3):415–
431, 1989.

[Pow85] M. J. D. Powell. The performance of two subroutines for constrained optimization on

some difficult test problems. In P. T. Boggs, R. H. Byrd, and R. B. Schnabel, editors,
Numerical Optimization 1984, pages 160–177. SIAM, Philadelphia, 1985.

[PR55] D. W. Peaceman and H. H. Rachford, Jr. The numerical solution of parabolic and
elliptic differential equations. J. Soc. Indust. Appl. Math., 3:28–41, 1955.

[PS75] C. C. Paige and M. A. Saunders. Solution of sparse indefinite systems of linear

equations. SIAM J. Numer. Anal., 12:617–629, 1975.

[Ram97] W. F. Ramirez. Computational Methods for Process Simulation. Butterworth–

Heinemann, Oxford, Boston, second edition, 1997.

420 REFERENCES

[Rei71] C. H. Reinsch. Smoothing by spline functions. Numer. Math., 16:451–454, 1971.

[RHW86] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-

propagating errors. Nature, 323:533–536, October 1986. URL: http://dx.doi.
org/10.1038/323533a0, doi:10.1038/323533a0.

[Riv90] T. J. Rivlin. Chebyshev Polynomials. From Approximation Theory to Algebra and

Number Theory. Pure and Applied Mathematics (New York). John Wiley & Sons Inc.,
New York, second edition, 1990.

[RSS01] M. Rojas, S. A. Santos, and D. C. Sorensen. A new matrix-free algorithm for the large-
scale trust-region subproblem. SIAM J. Optim., 11(3):611–646 (electronic), 2000/01.

[RSS08] M. Rojas, S. A. Santos, and D. C. Sorensen. Algorithm 873: LSTRS: MATLAB

software for large-scale trust-region subproblems and regularization. ACM Trans.
Math. Software, 34(2):Art. 11, 28, 2008.

[Saa03] Y. Saad. Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia, second
edition, 2003.

[Sal86] D. E. Salane. Adaptive routines for forming jacobians numerically. Technical Report
SAND86–1319, Sandia National Laboratories, 1986.

[SB93] J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Springer Verlag, New
York, Berlin, Heidelberg, London, Paris, second edition, 1993.

[SR48] P. Stein and R. L. Rosenberg. On the solution of linear simultaneous equations by

iteration. J. London Math. Soc., 23:111–118, 1948.

[SS86] Y. Saad and M. H. Schultz. GMRES a generalized minimal residual algorithm for
solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comp., 7:856–869, 1986.

[Ste83] T. Steihaug. The conjugate gradient method and trust regions in large scale optimiza-
tion. SIAM J. Numer. Anal., 20:626–637, 1983.

REFERENCES 421

[Sto83] J. Stoer. Solution of large linear systems of equations by conjugate gradient type meth-
ods. In A. Bachem, M. Grötschel, and B. Korte, editors, Mathematical Programming,
The State of The Art, pages 540–565. Springer Verlag, Berlin, Heidelberg, New-York,
1983.

[Sty05] M. Stynes. Steady-state convection-diffusion problems. In A. Iserles, editor, Acta

Numerica 2005, pages 445–508. Cambridge University Press, Cambridge, London,
New York, 2005.

[Tar05] A. Tarantola. Inverse problem theory and methods for model parameter estimation.
Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2005.

[TB97] L. N. Trefethen and D. Bau. Numerical Linear Algebra. SIAM, Philadelphia, 1997.

[TE05] L. N. Trefethen and M. Embree. Spectra and Pseudospectra. The Behavior of Nonnor-
mal Matrices and Operators. Princeton University Press, Princeton, NJ, 2005.

[Toi81] Ph. L. Toint. Towards an efficient sparsity exploiting Newton method for minimization.
In I. S. Duff, editor, Sparse Matrices and Their Uses, pages 57–87. Academic Press,
New York, 1981.

[Var62] R. S. Varga. Matrix Iterative Analysis. Prentice Hall, Englewood Cliffs, NJ, 1962.

[Var00] R. S. Varga. Matrix Iterative Analysis, volume 27 of Springer Series in Computational

Mathematics. Springer-Verlag, Berlin, expanded edition, 2000.

[Vog02] C. R. Vogel. Computational Methods for Inverse Problems. Frontiers in Applied

Mathematics, Vol 24. SIAM, Philadelphia, 2002. URL: https://doi.org/10.
1137/1.9780898717570, doi:10.1137/1.9780898717570.

[Vol01] S. Volkwein. Distributed control problems for the Burgers equation. Comput. Optim.
Appl., 18(2):115–140, 2001.

[WA86] H. Werner and H. Arndt. Gewöhnliche Differentialgleichungen. Eine Einführung in

Theorie und Praxis. Springer Verlag, Berlin, Heidelberg, New-York, 1986.

422 REFERENCES

[Wal98] W. Walter. Ordinary Differential Equations. Graduate Texts in Mathematics. Springer

Verlag, Berlin, Heidelberg, New York, 1998.

[Wat15] A. J. Wathen. Preconditioning. Acta Numer., 24:329–376, 2015. URL: http://dx.

doi.org/10.1017/S0962492915000021, doi:10.1017/S0962492915000021.

[Win80] R. Winther. Some superliner convergence results for the conjugate gradient methods.
SIAM J. Numer. Anal., 17:14–17, 1980.

[Wri15] S. J. Wright. Coordinate descent algorithms. Math. Program., 151(1, Ser. B):3–
34, 2015. URL: http://dx.doi.org/10.1007/s10107-015-0892-3, doi:10.
1007/s10107-015-0892-3.

[Xu92] J. Xu. Iterative methods by space decomposition and subspace correction. SIAM
Review, 34:581–613, 1992.

[You71] D. M. Young. Iterative Solution of Large Linear Systems. Academic Press, New York,
1971. Republished by Dover [You03].

[You03] D. M. Young. Iterative Solution of Large Linear Systems. Dover Publications Inc.,
Mineola, NY, 2003. Unabridged republication of the 1971 edition [You71].

AB SG Unit 2 Progress Check MCQ Part A
100% (1)
AB SG Unit 2 Progress Check MCQ Part A
9 pages
Experiment 01
No ratings yet
Experiment 01
2 pages
Chapter 16 Solids Liquids and Gases
No ratings yet
Chapter 16 Solids Liquids and Gases
30 pages
XII Maths
No ratings yet
XII Maths
8 pages
Important MCQ of Physics
No ratings yet
Important MCQ of Physics
33 pages
Materials Data Handbook-Inconel 718
No ratings yet
Materials Data Handbook-Inconel 718
113 pages
Ordinary Differential Equations: December 2014
No ratings yet
Ordinary Differential Equations: December 2014
5 pages
Batch Annealing Model For Cold Rolled Coils and Its Application
No ratings yet
Batch Annealing Model For Cold Rolled Coils and Its Application
8 pages
5 Derivatives
No ratings yet
5 Derivatives
25 pages
IRJIET704006
No ratings yet
IRJIET704006
9 pages
12 Mathematics Sp05
No ratings yet
12 Mathematics Sp05
27 pages
Numerical Methods and Methods of Approximation in Science and Engineering
No ratings yet
Numerical Methods and Methods of Approximation in Science and Engineering
498 pages
Mathematics N4: FET College Nated, #6
From Everand
Mathematics N4: FET College Nated, #6
Efetobo Emede
No ratings yet
Grades 7 12 Reading Selections
No ratings yet
Grades 7 12 Reading Selections
11 pages
Introduction Numerical Analysis
No ratings yet
Introduction Numerical Analysis
411 pages
Termo PR
No ratings yet
Termo PR
99 pages
ChatGPT for Business: Strategies for Success
From Everand
ChatGPT for Business: Strategies for Success
Matthew C. Smith
1/5 (1)
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet
Proposed Topic:: Study of Green's Relations On Inverse Semigroups
No ratings yet
Proposed Topic:: Study of Green's Relations On Inverse Semigroups
16 pages
Main
No ratings yet
Main
164 pages
Eecs127 Reader
No ratings yet
Eecs127 Reader
199 pages
Introduction Numerical Analysis
No ratings yet
Introduction Numerical Analysis
252 pages
Hamlet Had an Uncle: A Comedy of Honor
From Everand
Hamlet Had an Uncle: A Comedy of Honor
James Branch Cabell
4.5/5 (7)
Homework 1 PDF
No ratings yet
Homework 1 PDF
3 pages
Matlab Slope Stability: On '-R' '-R' 'Distance (M) ' 'Elevation (M) ' Off
No ratings yet
Matlab Slope Stability: On '-R' '-R' 'Distance (M) ' 'Elevation (M) ' Off
2 pages
Num Methods
No ratings yet
Num Methods
495 pages
MAT321 Lecture Notes Boumal 2019
No ratings yet
MAT321 Lecture Notes Boumal 2019
203 pages
Advanced college algebra study guide
From Everand
Advanced college algebra study guide
Harrison Cook
No ratings yet
Linear Programming
100% (1)
Linear Programming
47 pages
VITMEE Brochure 18
No ratings yet
VITMEE Brochure 18
36 pages
NM Script
No ratings yet
NM Script
181 pages
Ito's Lemma
No ratings yet
Ito's Lemma
2 pages
Introduction Numerical Analysis
No ratings yet
Introduction Numerical Analysis
443 pages
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
Arduino
100% (2)
Arduino
177 pages
Mortals or Immortals
From Everand
Mortals or Immortals
Konstantinos p Anastasiadis
No ratings yet
Risk Management and System Safety
From Everand
Risk Management and System Safety
Leonam dos Santos Guimarães
5/5 (1)
14.G Stability Testing: Here You Will Find Answers To The Following Questions
No ratings yet
14.G Stability Testing: Here You Will Find Answers To The Following Questions
12 pages
Integrated Inequalities
No ratings yet
Integrated Inequalities
4 pages
Bromination Experiment
No ratings yet
Bromination Experiment
9 pages
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
From Everand
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
Matthew C. Smith
No ratings yet
MTR3 - Microindenter and Scratch Tester ENG
100% (1)
MTR3 - Microindenter and Scratch Tester ENG
5 pages
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
Operation Longlife
From Everand
Operation Longlife
E. Hoffmann Price
3.5/5 (3)
Previewpdf
No ratings yet
Previewpdf
30 pages
Math 6610 - Analysis of Numerical Methods I
No ratings yet
Math 6610 - Analysis of Numerical Methods I
103 pages
Finite Difference Method 1
No ratings yet
Finite Difference Method 1
104 pages
CT2 Notes - All Chapters
No ratings yet
CT2 Notes - All Chapters
84 pages
Kelley C.T. - Iterative Methods For optimization-SIAM (1999)
No ratings yet
Kelley C.T. - Iterative Methods For optimization-SIAM (1999)
188 pages
Telephone Directory IITR 2011
No ratings yet
Telephone Directory IITR 2011
65 pages
Kellory the Warlock
From Everand
Kellory the Warlock
Lin Carter
No ratings yet
Proline Promass 80/83 E: Coriolis Mass Flow Measuring System
No ratings yet
Proline Promass 80/83 E: Coriolis Mass Flow Measuring System
32 pages
TAP 225-1: Radians and Angular Speed: Rotation
No ratings yet
TAP 225-1: Radians and Angular Speed: Rotation
4 pages
Yousef Saad - Iterative Methods For Sparse Linear Systems-Society For Industrial and Applied Mathematics (2003)
No ratings yet
Yousef Saad - Iterative Methods For Sparse Linear Systems-Society For Industrial and Applied Mathematics (2003)
460 pages
Unilift de DGD
No ratings yet
Unilift de DGD
36 pages
A Comparative Study On Failure Pressure Estimations of Unflawed Cylindrical Vessels
No ratings yet
A Comparative Study On Failure Pressure Estimations of Unflawed Cylindrical Vessels
14 pages
NumProg2 - 2020-07-12
No ratings yet
NumProg2 - 2020-07-12
110 pages
Exploring OpenFOAM
100% (1)
Exploring OpenFOAM
20 pages
Iterative Methods For Sparse Linear Systems
No ratings yet
Iterative Methods For Sparse Linear Systems
460 pages
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
From Everand
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
Harrison K Cook
No ratings yet
Buch Gander Kwok
No ratings yet
Buch Gander Kwok
10 pages
PrelimNum PDF
No ratings yet
PrelimNum PDF
236 pages
Saad
No ratings yet
Saad
460 pages
Book Ena
No ratings yet
Book Ena
436 pages
Iterative Methods Sparse Linear Systems
No ratings yet
Iterative Methods Sparse Linear Systems
460 pages
Introduction To Numerical Methods With Examples in Javascript
No ratings yet
Introduction To Numerical Methods With Examples in Javascript
55 pages
Optimum Design of Mechanical Elements: Class Notes For AME60661
No ratings yet
Optimum Design of Mechanical Elements: Class Notes For AME60661
217 pages
A Discourse Analysis of 1 Peter
From Everand
A Discourse Analysis of 1 Peter
Ervin Ray Starwalt
No ratings yet
Numerical Methods: Radostin Simitev Simon Candelaresi
No ratings yet
Numerical Methods: Radostin Simitev Simon Candelaresi
127 pages
Osama the Gun
From Everand
Osama the Gun
Norman Spinrad
5/5 (1)
Time-dependent Behaviour and Design of Composite Steel-concrete Structures
From Everand
Time-dependent Behaviour and Design of Composite Steel-concrete Structures
Massimiliano Bocciarelli
No ratings yet
Kelley - Iterative Methods For Optimization-SIAM (1999) PDF
No ratings yet
Kelley - Iterative Methods For Optimization-SIAM (1999) PDF
187 pages
Ae-423 CFD PDF
No ratings yet
Ae-423 CFD PDF
143 pages
Book2017 xGUoysw PDF
No ratings yet
Book2017 xGUoysw PDF
240 pages
Advanced Statistical Computing PDF
No ratings yet
Advanced Statistical Computing PDF
329 pages
IterMethBook 2nded PDF
100% (1)
IterMethBook 2nded PDF
567 pages
Kelley C.T. Iterative Methods For Linear and Nonlinear Equations (SIAM 1995) (ISBN 0898713528) (171
No ratings yet
Kelley C.T. Iterative Methods For Linear and Nonlinear Equations (SIAM 1995) (ISBN 0898713528) (171
171 pages
DQBJ Vol 2
100% (1)
DQBJ Vol 2
788 pages
Lecture Notes
No ratings yet
Lecture Notes
337 pages
Numerical Linear PDF
No ratings yet
Numerical Linear PDF
196 pages
Stat Computing
No ratings yet
Stat Computing
329 pages
Deadline Istanbul (The Elizabeth Darcy Series)
From Everand
Deadline Istanbul (The Elizabeth Darcy Series)
Peggy Hanson
5/5 (1)
Deadline Yemen (The Elizabeth Darcy Series)
From Everand
Deadline Yemen (The Elizabeth Darcy Series)
Peggy Hanson
5/5 (1)
IterMethBook 2nded
No ratings yet
IterMethBook 2nded
567 pages
Numerical Methods For Large Eigenvalue Problems
100% (1)
Numerical Methods For Large Eigenvalue Problems
285 pages
Undergraduate Text
No ratings yet
Undergraduate Text
351 pages
Metodos Iterativos para Optimizacion
No ratings yet
Metodos Iterativos para Optimizacion
188 pages
Saad Y., Iterative Methods For Sparse Linear Systems
No ratings yet
Saad Y., Iterative Methods For Sparse Linear Systems
469 pages
Course Notes MATH
No ratings yet
Course Notes MATH
130 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

CAAM 454 554 1lvazxx

Uploaded by

CAAM 454 554 1lvazxx

Uploaded by

Lecture Notes

CAAM 454 / 554 – Numerical Analysis II

Department of Computational and Applied Mathematics

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

2 Stationary Iterative Methods 47

2.7.2 Elliptic Partial Differential Equations in 2D . . . . . . . . . . . . . . . . . 85

3 Krylov Subspace Methods 113

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

II Iterative Methods for Unconstrained Optimization 197

5 Newton’s Method 219

6 Globalization of the Iteration 239

7 Nonlinear Least Squares Problems 275

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

7.4 The Gauss–Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286

8 Implicit Constraints 323

9 Quasi–Newton Methods 361

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

III Iterative Methods for Nonlinear Systems 371

11 Broyden’s Method 397

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

Iterative Methods for Linear Systems

1.2. Quadratic Optimization Problems and Linear Systems

properties of these these systems.

= 21 t x + (1 − t)y T A t x + (1 − t)y − bT t x + (1 − t)y

Hence x (∗) minimizes q.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

Setting v = −t( Ax (∗) − b), t ∈ R, in the previous inequality gives

Next consider the equality constrained quadratic program

where A ∈ Rm×n , m < n, and b ∈ Rm , c ∈ Rn , and H ∈ Rn×n symmetric and satisfies

vT Hv ≥ 0 for all v ∈ N ( A). (1.5)

We leave the proof as an exercise. See Problem 1.1.

vT Hv > 0 for all v ∈ N ( A) \ {0}, (1.7)

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

have the same eigenvalues. The matrices K

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

1.3. Linear Elliptic Partial Differential Equations

1.3.1. Elliptic Partial Differential Equations in 1D

− y00 (x) + c y0 (x) + r y(x) = f (x), x ∈ (0, 1), (1.12a)

At a point x we can approximate the derivative of a function g : R → R by

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

If we apply the approximation (1.13) with h = h

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

Example 1.3.1 Consider the differential equation

− y00 (x) + y0 (x) = 1, x ∈ (0, 1), (1.19a)

The solution is given by

1 FD approximation 1 FD approximation 1 FD approximation

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

• The matrix in (1.18) is symmetric if and only if c = 0.

• Let ai j be the entries of the matrix in (1.18). If h < 2/|c|, then

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

we use the upwind scheme

− yi−1 + (2 − hc + h2r)yi − ( − h c)yi+1

−( + h c)yi−1 + (2 + hc + h2r)yi −  yi+1

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

1 FD approximation 1 FD approximation 1 FD approximation

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

Other Boundary Conditions

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

We assume that  > 0, c, r ≥ 0 are given.

−( + h c)yi−1 + (2 + hc + h2r)yi −  yi+1

1.3.2. Elliptic Partial Differential Equations in 2D

−∆y(x) + c · ∇y(x) + r y(x) = f (x), x∈Ω (1.29a)

We assume that  > 0, r ≥ 0 and c = (c1, c2 )T with c1, c2 > 0.

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

with equidistant points

yi−1, j − 2yi j + yi+1, j yi, j−1 − 2yi j + yi, j+1

yi j = g(x 1,i, x 2, j ), if i ∈ {0, n1 } or j ∈ {0, n2 }.

To solve for the unknowns yi j , i = 1, . . . , n1 , j = 1, . . . , n2 , in (1.30) we form a linear system

CAAM 454/554 Generated November 16, 2018 ©2017 M. HEINKENSCHLOSS

− y00 (x) + c y0 (x) + r y(x) = f (x), x ∈ (0, 1), (1.12a)

− y00 (x) + y0 (x) = 1, x ∈ (0, 1), (1.19a)

• Let ai j be the entries of the matrix in (1.18). If h < 2/|c|, then

− yi−1 + (2 − hc + h2r)yi − ( − h c)yi+1

−( + h c)yi−1 + (2 + hc + h2r)yi − yi+1

We assume that > 0, c, r ≥ 0 are given.

−( + h c)yi−1 + (2 + hc + h2r)yi − yi+1

−∆y(x) + c · ∇y(x) + r y(x) = f (x), x∈Ω (1.29a)

We assume that > 0, r ≥ 0 and c = (c1, c2 )T with c1, c2 > 0.

−∆y(x) + c · ∇y(x) + r y(x) = f (x) + u(x) χΩc (x), x ∈ Ω, (1.41a)

2.1 For i = 0,. . . , nt /mt − 1 do

– Show that if > 0, r > 0, c ∈ R, and h < 2/|c|, then

– Show that if > 0, r = 0, c ∈ R, and h < 2/|c|, then