Steepest Descent
Steepest Descent
This project is devoted to an idea for iteratively solving linear systems, i.e., solving
equations of the form
Ax = b (1)
where A is an n n matrix and b is a vector in Rn . Henceforth we shall assume that A is
a positive definite matrix. Recall that this means that for all non-zero vectors x Rn
x Ax > 0 .
This means, in particular, that the kernel of the matrix A consists of the zero vector only
and hence the matrix is invertible. The equation (1) has always a unique solution.
A first thought is to formulate the problem as a minimization problem. To this end
we consider the function
1
F (x) = x Ax b x .
2
At the minimum, which we call y0 , the gradient has to vanish which means that Ay0 = b
and hence y0 is the solution.
Example
Lets take A be the 2 2 diagonal matrix
10
02
and
3
b= .
4
In this case the function F is a function of two variables and is given by
1 2
F (u, v) = (u + 2v 2 ) 3u 4v
2
which, by completing the squares, can be brought into the form
1 17
F (x, y) = (u 3)2 + (v 2)2 .
2 2
We see right away that the level curves of F are ellipses. The gradient of F is given by
F (u, v) = (u 3, 2v 4)
and it vanishes precisely at the point (3, 2) which is the solution of the system Ax = b.
The completion of the square can be carried out quite generally. Expanding the
expression
1 1 1
(x A1 b) A(x A1 b) = x Ax b x + b A1 b
2 2 2
1
yields
1 1
F (x) = (x A1 b) A(x A1 b) b A1 b .
2 2
From this we see right away that the minimal value of F is attained at the unique solution
y0 and the the value of the function F at this point is 12 b A1 b. We also see that the
level surfaces of the function F are ellipsoids who have y0 as the common center.
So far so good, but the real question we want to address is how to find y0 efficiently
and we would like to acquaint you with an idea on how this can be achieved. Note that the
issue at hand is not to work this out for 2 2 or 3 3 matrices, but with large matrices,
where computers are needed.
The method, which is in a way the simplest one, is the steepest descent method. Start
at a point x0 and think of skiing as fast as possible towards the lowest point. Let us assume
that we are not good skiers and cannot turn in a continuous fashion, i.e., we ski first in
a straight line, then stop and turn and then again ski in a straight line. You would start
in the direction of steepest descent which is opposite to the gradient direction, i.e., you
would ski first in the direction of F (x0 ). How far would you ski? First you will go
down and at some point you will go up again. Clearly you want to stop once you have
reached the lowest point in the valley along that direction. Call this point x1 . Then turn
and ski in the direction F (x1 ) until you hit the bottom of you straight trajectory in
this new direction and then stop. Repeating this gets you obviously closer and closer to
absolute bottom of the valley.
Now we formalize this picture. You starting trajectory is
x0 + td0
where d0 = F (x0 ). Now, by a calculation using general properties of the dot product,
t2
F (x0 + td0 ) = F (x0 ) + td0 (Ax0 b) + d0 Ad0 .
2
Note that d0 = (Ax0 b) and hence
t2
F (x0 + td0 ) = F (x0 ) t|d0 |2 + d0 Ad0 .
2
According to our plan, we have to find the minimum of this function with respect to the
variable t which is attained at
|d0 |2
t0 = .
d0 Ad0
Thus the first stopping point is at
|d0 |2
x1 = x0 + d0 .
d0 Ad0
Repeating the same argument we get the next stopping point at
|d1 |2
x2 = x1 + d1
d1 Ad1
2
where
d1 = F (x1 ) = (Ax1 b) .
After the k-th step we have
|dk1 |2
xk = xk1 + dk1 (SD1)
dk1 Adk1
where
dk1 = (Axk1 b) . (SD2)
With every step we get closer to the minimum of the function F .
Notice that this procedure is not exact in the sense that it does not stop after a
finite number of steps. The reason is that in an ellipsoid the gradient direction does not
in general pass through the center of the ellipse. Nevertheless, it is clear that we keep
descending towards the minimum although we may never may reach it. Thus, we have
to specify a stopping rule, namely at what accuracy should we stop.
The smallness of |dk | is certainly a measure for how accurately the vector xk is a
solution of the equation. It measures how well xk solves the equation.
To get an estimate how far away xk is from the actual solution we note that
|dk |
|y0 xk | .
mink k
In order to profit from this estimate one has to know, approximately, the smallest eigen-
value, which might be not so easy to get. In prac
Nevertheless, we have now a method for computing the solution of this system of
linear equations via an iterative method which can be summarized as follows:
|dk1 |2
xk = xk1 + dk1 (SD1)
dk1 Adk1
where
dk1 = (Axk1 b) . (SD2)
If |dk | repeat again (SD1) and (SD2). If |dk | < stop.
Nice, as it looks, there are some limitations to this method. If the level surfaces of
the function F (x), which are ellipsoids, are very elongated, the method can be very slow.
Recall that an elongated ellipsoid means that the ratio of the largest and the smallest
3
eigenvalue is very large. Thus, it may happen that when starting on the shallow end of
the ellipsoid, one makes always very small steps. Thus it may take a fairly long time until
you come close to the bottom of the valley. For this algorithm to work we need, generally
speaking a good condition number, i.e., the ratio of the largest to the smallest eigenvalue
should not be too big.
Problems
1: Implement (i.e., write a program) the steepest descent algorithm and apply it first
to solve simple problems such as
52 1
x=
21 1
1.001 0.999 1
x=
0.999 1.001 2
Use an accuracy of 105 . Draw a qualitative picture of the level curves of the corresponding
function F . Based on that, use various starting points x0 and describe what you observe.
List the number of steps it takes for the various starting points.
2: Pick randomly five 10 10 positive definite matrices A and vectors b with integer
coefficients and solve the equation
Ax = b
Use an accuracy of 105 . Check your answers.
To generate these matrices proceed as follows. Pick randomly a 10 10 matrix B
with integer coefficients and compute A = B T B. This matrix is very likely to be positive
definite. Randomly chosen matrices are unlikely to have nontrivial kernel.