David Diagonalization
David Diagonalization
Abstract.
∗ Thisis the report of the course project for Fall 2011 18.335J by Prof. Steven Johnson
† Department of Mechanical Engineering, Massachusetts Institute of Technology, 77 Massachusetts
Avenue, Cambridge, MA, 02139 (bolin@mit.edu).
1
2 BOLIN LIAO
Now we would like to minimize the Rayleigh quotient by varying x, but instead of
varying x along certain direction (as we usually do in steepest descent and conjugate
gradient method), we vary one component of x while holding all the other components
fixed. Specifically, if one varies the ith component xi by an amount δi , the optimum
choice of δi from
∂ρ
(1.5) =0
∂xi xi +δi
is just given by
This expression looks similar to (1.3), except that ρ and ri here are evaluated at
x + δi ei . In this sense the correction vector given by equation (1.3) could also be
interpreted as an approximation of the optimum variation vector that minimizes the
Rayleigh quotient locally. A third interpretation of equation (1.3) has to do with
Rayleigh Quotient Inverse Iteration (RQII)[4]. Consider a RQII step
It was shown in class that RQII exhibits cubic convergence when the approximate
eigenvector approaches the true one. Now if one imposes that the correction in each
step be orthogonal to the previous trial vector, i.e. xnew = (x + δ)/ǫ, where xT δ = 0
and ǫ plays a role as the normalization factor, then equation (1.7) could be rewritten
as (according to Davidson[3], a modified Newton-Raphson equation)
1
(1.9) ǫ= ≈λ−ρ
xT (H − ρI)−1 x
This observation is consistent with the previous discussions. Given ǫ the equation
(1.8) can be rewritten as
X
(1.11) (ρ − Hii )δi ≈ ri + Hij δj + ǫxi
j6=i
From this point of view, equation (1.3) is also an approximate form of an orthog-
onal correction vector in one step of RQII. Davidson[5] and Pulay[6] also pointed out
that equation (1.3) is a form of diagonal-preconditioned (Jacobi-type) gradient of the
Rayleigh quotient, after which this original Davidson method is also referred to as
Diagonal-Preconditioned-Residue (DPR) method. From the discussion given above,
especially the approximation made along the way, one can tentatively predict that
this original flavor only works well in diagonally dominant matrices, which will be
verified in experiments shown in the next two sections.
4 BOLIN LIAO
1.2. Improved versions: IIGD, GJD and RQII. Some slightly modified
versions of Davidson methods were proposed in late 80s and 90s, and the basic idea
was to add correction terms that were dropped in the original version back and try
to maintain an efficient way of evaluating the correction vector. A brief review was
given in the Appendix of reference[11]. Olsen et al.[4] proposed that adding the ǫx
term back and the correction vector can be given by
xT (D − ρI)−1 r
(1.13) ǫ=
xT (D − ρI)−1 x
Because of the resemblance of the correction vector to that from RQII, this method
was named as Invers-Iteration generalized Davidson (IIGD) method.
Sleijpen et al.[7][8][9] suggested a further improvement over IIGD, the Generalized
Jacobi Davidson (GJD), which is also explained in detail in the online book Templates
for the Solution of Algebraic Eigenvalue Problems[2]. By operating on both sides of
the RQII equation (1.10) the projector (1 − xxT ), ǫ can be removed explicitly from
the equation, and after reorganizing the RQII equation can be written as a projected
form
Here H e = (I − xxT )(H − ρI)(I − xxT ) is the projected matrix onto the subspace
which is orthogonal to x. In the original paper it is suggested to solve the equation
(1.14) approximately, for example, by some steps of MINRES. But in practice, any
efficient iterative linear solver (conjugate gradient, for example) can be utilized to solve
the GJD equation at each iteration step. Along the same path, it is also possible to
solve the RQII equation directly at each step by efficient linear solvers. By including
more correction terms back into the original recipe, these improved versions aim
at improving the performance of Davidson method when applied to non-diagonally-
dominant matrices. The results and comparison will be given in subsequent sections.
There are other modifications to Davidson methods, which are mainly concerned
about optimizing the correction vector and speeding up the convergence. Since space
is limited, those minor modifications will not be described here, and a thorough review
by Leininger et al.[10] is available.
1.3. Subspace Projected Approximate Matrix (SPAM) modification.
Taking into account the fact that when the matrix dimension becomes extremely large,
the most time-consuming parts of the iterative algorithms are the matrix-vector prod-
ucts, an extension of Davidson method called Subspace Projected Approximate Matrix
(SPAM)[11] was designed, aiming at reducing number of “exact” matrix-vector prod-
ucts as much as possible, in a flexible and adaptive way. Assume at certain iteration
step, the subspace vectors are given by columns of matrix B, and the matrix-vector
products are computed as columns of matrix W.Thus the subspace representation of
H is given by H̄ = BT HB = BT W. Define the orthogonal projector P = BT B and
the complementary projector Q = I − P, then the original matrix H can be written
equivalently as
H = (P + Q)H(P + Q)
DAVIDSON DIAGONALIZATION METHODS 5
When computing the matrix-vector product Hy, the first three terms in equation
(1.15) are easy to handle since B and W are available and have low dimensions. The
basic idea of SPAM algorithm is to approximate H in the fourth term by another
matrix H1 whose matrix-vector products H1 y require less effort to compute. Choice
of H1 is flexible and problem-dependent and can be a less dense matrix than H or
some formal or algebraic approximation to H. Given H1 , the original matrix H can
now be approximated by a “SPAM” matrix HSP AM
One good property of HSP AM is that for any vector y ∈ span(B), equation (1.16)
indicates that HSP AM y = Hy, which means if the column space of B converges to
the eigenspace of HSP AM , this eigenspace is exactly the eigenspace of H. Thus one
can solve the eigenproblem of HSP AM instead and use the eigenvectors of HSP AM
to append to the previous subspace (span(B)). To update W, one “exact” matrix-
vector product involving H is required. To solve the eigenproblem of HSP AM (with
the same dimension as the original problem), an iterative Davidson method is utilized,
which would be cheap, thanks to another good property of HSP AM : for any vector
x⊥ orthogonal to the column space of B , the matrix-vector product takes the simple
form
(1.18) δ = −r
It is easily seen that the residue vector lies in an expanding Krylov space, so is the
correction vector. By generating correction vectors this way Davidson method is
reduced to an explicit-orthogonalization Lanczos method. Although Lanczos method
seems more elegant (only two latest trial vectors need to be stored, and the subspace-
projected matrix is tridiagonal), it suffers from slow convergence due to the fact that
it does not selectively converge to the desired eigenpair of interest.
On the other hand, since the correction vector for the Davidson method can
be interpreted as just the gradient of the Rayleigh quotient preconditioned in some
6 BOLIN LIAO
way, there are also connections between the Davidson method and gradient-based
methods[1][13], such as steepest descent (SD) and conjugate gradient (CG) method.
They all compute the correction vector from the residue somehow (with certain kinds
of preconditioning) but use it differently: SD and CG use the correction vector as
the search direction for the next step whereas Davidson methods use it to expand the
subspace. Detailed comparison with respect to convergence performance in a realistic
problem will be given in a later section, and also the possibility to combine these
methods will be discussed.
1.5. Block Davidson Method and Subspace Collapse. Another remarkable
feature of Davidson method is that it can be easily extended to computing a few lowest
eigenpairs simultaneously. This type of Davidson method is called Block Davidson or
Davidson-Liu[14] algorithm. The basic idea is that instead of adding one new vector
at each iteration, a few new vectors, corresponding to the residue vectors of different
eigenpairs, will be added at each iteration, tuning the subspace eigenvetors to converge
at the same time.
Another extension of Davidson method is the subspace collapse technique[6], sim-
ilar to the restart scheme used in Lanczos method, which could reduce the memory
requirement. The basic idea is to choose the optimal approximate eigenvectors already
obtained and restart with an initial subspace expanded by the optimal approximate
eigenvectors.
2. Implementation and Performance Test.
2.1. Diagonally Dominant Matrices. First the original DPR Davidson method
[3][15] is implemented and applied to diagonally dominant matrices. For simplicity
and convenient comparison, only the lowest eigenpair is to be solved. Block Davidson
method, which solves a few lowest eigenpairs simultaneously will be demonstrated
separately, but not used to compare different flavors.
The convergence performance of DPR is demonstrated in figure 2.1.
−2 −2
10 10
Normalized norm of the residue
−4 −4
10 10
−6 −6
10 10
−8 −8
10 10
−10 −10
10 10
−12 −12
10 10
0 5 10 15 20 25 0 5 10 15 20 25
Number of Iterations Number of Iterations
Fig. 2.1: The convergence curve of DPR method, applied to randomly generated
diagonally dominant matrices
Below 30 iterations achieve the convergence of 10−10 , the convergence curves are
DAVIDSON DIAGONALIZATION METHODS 7
smooth and the number of iterations needed seems not to increase with the dimension
of the problem, which is quite amazing. Although these convergence curves demon-
strate the typical “successful” behavior of Davidson method, further study shows that
the performance is much more complicated and depends on a lot of factors, and can
be very sensitive. A good analysis of the convergence behavior of Davidson from the
perspective of the spectrum of a preconditioned Krylov problem was given by Morgan
and Scott[16]. Consider the operator N(ρ) = (D − ρI)−1 (H − ρI), then every cor-
rection vector generated during Davidson iteration is given by N operating on some
vector. Now if ρ were a constant, then the subspace generated using Davidson itera-
tion is just a Krylov space generated by powers of N. Then the methods of analyzing
the Krylov space methods may be considered here. Faster convergence of Arnoldi or
Lanczos method can be achieved (as we learned from class) if the gap ratio (relative
separation) of the spectrum of the matrix is large. Of course ρ is not a constant
here, but (ideally) converges to certain eigenvalue of H, so the spectrum of N when
ρ is near certain eigenvalue of H is crucial to the convergence rate of Davidson. An
extreme example is that when H is diagonal, then all eigenvalues of N are the same
(1), thus Davidson method is expected to perform badly (actually fails in exact arith-
metic, since δ = −x in this case and lies in the previous subspace). Known from
analysis of preconditioners of gradient-based methods, (D − ρI)−1 tends to compress
the spectrum of (H − ρI), which is a preferred property for gradient-based methods,
while in eigenvalue problems, an increased gap ratio is the desired property. From this
point of view, original Davidson method is expected to perform well (or better than
Lanczos method) only if after preconditioning (multiplied by (D − ρI)−1 ) the ratio
gap is increased and also the corresponding eigenvalue of N is not clustered with other
eigenvalues. So even in the seemingly simplest case of diagonally-dominant matrices,
the convergence behavior of Davidson method could be rather complicated. Figure
2.2a illustrates one typical situation where Davidson does not do so well. The algo-
rithm seems to converge to other eigenvalues at first (corresponding to deeps in the
norm of the residue) and then figures out that smaller eigenvalue exists and adjusts
to it at the end. This behavior may be explained by the analysis given above that
the corresponding eigenvalue of operator N resides in the interior instead of being
well separated from other eigenvalues. IIGD is devised to improve DPR in diago-
nally dominant matrices, and the performance of IIGD applied to the same matrix
is displayed in figure 2.2b. Although total number of iterations is reduced, stronger
oscillations are observed in IIGD, which may be dangerous because if the convergence
threshold is not set small enough, the algorithm may end up converging to a higher
eigenvalue.
In addition to explaining when Davidson does not perform very well, the anal-
ysis given above also sheds light upon further improving Davidson method by other
preconditioning method in specific problems, which will be demonstrated in a later
section.
−2 −2
10 10
−4
10 10
−6 −6
10 10
−8 −8
10 10
−10 −10
10 10
−12 −12
10 10
0 20 40 60 80 100 120 0 10 20 30 40 50 60
Number of iterations Number of iterations
DPR convergence curve for non diagonally dominant matrix (N=100) IIGD convergence for non diagonally dominant matrix (N=100)
4 4
10 10
2 2
10 10
0 0
10 10
−2
Normalized norm of residue
−2
10
Normalized norm of residue
10
−4 −4
10 10
−6 −6
10 10
−8 −8
10 10
−10 −10
10 10
−12 −12
10 10
−14 −14
10 10
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Number os iterations Number of iterations
Fig. 2.3: DPR and IIGD fail in dealing with non-diagonally-dominant matrices
diagonal approximation made when solving the correction vector). Figure 2.4 demon-
strates how they work when applied to the same matrix. They converge in around 40
steps, but the price to pay is (much) larger at each iteration. Another problem is that
when ρ converges to certain eigenvalue, the linear system to be solved at each step is
near singular, which may cause problems and slow convergence. Figure 2.5 shows that
when applied to a larger non diagonally dominanat matrix, they converge slowly and
other methods (such as gradient-based methods) may be more advantageous provided
the increased cost at each step.
DAVIDSON DIAGONALIZATION METHODS 9
GJD convergence curve for non diagonally dominant matrix (N=100) RQII convergence curve for non diagonally matrix (N=100)
4 4
10 10
2
2 10
10
0
10
0
10
−2
−2
10
−4
10
−4
10
−6
10
−6
10
−8
10
−8
10 −10
10
−10
10 −12
10
−12 −14
10 10
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 45
Number of iterations Number of iterations
Fig. 2.4: GJD and RQII perform better for non diagonally dominant matrices
GJD convergence curve for non diagonally dominant matrix (N=1000) RQII convergence curve for non diagonally dominant matrix (N=1000)
6 6
10 10
4 4
10 10
2 2
10 10
0 0
Normalized norm of residue
10
Normalized norm of residue
10
−2 −2
10 10
−4 −4
10 10
−6 −6
10 10
−8 −8
10 10
−10 −10
10 10
−12 −12
10 10
0 50 100 150 200 250 0 50 100 150 200 250
Number of iterations Number of iterations
Fig. 2.5: GJD and RQII perform not so well for larger non diagonally dominant
matrices
2.3. Block Davidson Method. To solve the lowest few eigenpairs, one can of
course solve the eigenpairs one by one sequentially using Davidson from the lowest
one, where at each step how to pick the correction vector is a little bit subtle[12]. A
more powerful method is to solve for the lowest few eigenpairs simultaneously, using
Block Davidson Algorithm[14]. Suppose now we are interested in the lowest k eigen-
pairs. Starting by a guess subspace with dimention l (l ≥ k), the only modification
to the original Davidson method is that at each step, all the correction vectors cor-
responding to the first k approximate eigenvectors are solved. Then orthonormalize
the first correction vector against the previous subspace, and append it to the sub-
10 BOLIN LIAO
space. Repeat this process for each of the other k − 1 correction vectors, neglecting
any new vector whose norm after normalization is less than some threshold. Then
solve the subspace problem again. This extension is simple and powerful, and keeps
all the properties of the original Davidson method, except that the dimension of the
subspace problem will grow faster during iterations.
Block Davidson Algorithm is implemented in Matlab, and the simultaneous con-
vergence curve is show in figure 2.6.
−6
Normalized norm of residue
10
−8
10
−10
10
−12
10
−14
10
0 5 10 15 20 25
Number of iterations
Fig. 2.6: The lowest 4 eigenvalues converge uniformly when applying Block Davidson
to a 1000 × 1000 diagonally dominant matrix.
(k − Gi )2
(3.1) Hii =
2
(3.2) Hij = V (Gj − Gi )
where the diagonal elements of H are kinetic energies of the planewaves in the
basis and V (G) is the Fourier component of the periodic pseudo potential with respect
DAVIDSON DIAGONALIZATION METHODS 11
to maintain the unity of the guess eigenvector. Then the algorithm fits the en-
ergy functional (equivalent to Rayleigh quotient to be minimized) to the following
functional form of θ
which turns out to work magically well in this specific type of problems. Then
three function values (and/or the derivative values) of the energy functional at special
points are evaluated to calculate the three unknown coefficients in equation (3.4), and
the minimum point in terms of θ can be found analytically and thus used to update the
guess vector in equation (3.3). Readers are referred to the original paper[18] for the
detailed formulae. Compared to Block Davidson method, the non-linear CG can only
solve eigenpairs one by one sequentially. In this report comparison will only be made
in the case where the lowest eigenpair is of interest, while one has to keep in mind
that since the cost of each iteration step of CG is cheaper than Davidson (especially
when the subspace of Davidson grows relatively large), a more general discussion is
necessary to give a full judgement. Limited by time and space, only the convergence
behavior with respect to number of iteration steps will be discussed in this report.
So far the preconditioning issue has not been discussed. First we notice that
the single-electron Hamiltonian in this problem exhibits a rather special structure of
being “partially” diagonally dominant: diagonal elements corresponding to the kinetic
energy of planewaves with large wavenumber overwhelm the off-diagonal elements
whereas the ones corresponding to the low energy planewaves are actually comparable
or even smaller than the off-diagonal elements. Since it is not diagonally dominant,
original Davidson method (DPR) is not supposed to work super well especially when
the dimension of the problem is large, which can be justified by figure 3.1. In the test,
a Gaussian-type model pseudopotential is used and the calculation takes place at Γ
point (k = 0).
12 BOLIN LIAO
−2
10
−3
10
−4
10
−6
10
−7
10
−8
10
−9
10
−10
10
−11
10
0 10 20 30 40 50 60
Number of iterations
(a) N=100
DPR convergence curve for test Hamiltonian (N=1000)
−1
10
−2
10
−3
10
−4
10
Normalized norm of residue
−5
10
−6
10
−7
10
−8
10
−9
10
−10
10
−11
10
0 100 200 300 400 500 600
Number of iterations
(b) N=1000
Fig. 3.1: DPR does not work well when directly applied to the planewave Hamiltonian
with N = 1000
Here the precondioner comes into play. TPA devised a specific preconditioner for
planewave Hamiltonian
−2
10
−4
10
Normalized norm of residue
−6
10
−8
10
−10
10
−12
10
0 1 2 3
10 10 10 10
(3.6) δ = Kr
And the result shows that this strategy seems to work magically well. Figure
3.3 shows that starting with the same initial guess vector, the number of iterations
needed for TPA preconditioned DPR is around 20, while the original DPR seems to
wander around for a long time before finally settling down. This example illustrates
the possibility of improving performance of Davidson method by preconditioning in
specific problems.
To conclude this section, a comparison between preconditioned Davidson and
nonlinear CG when applied to the same (larger) Hamiltonian with the same starting
vector is given in figure 3.4, which may not be really fair for CG.
4. Summary. This report mainly focuses on understanding the underlying mech-
anism that makes Davidson method work and reviewing different modifications and
extensions of original DPR method. Different flavors of Davidson algorithm are im-
plemented, so is the Blocked Davidson method. In a “not-so-realistic” toy problem,
the possibility of improving Davidson method via preconditioning is explored, and the
result compared with another state-of-the-art method, non-linear conjugate gradient
algorithm (in TPA flavor). Furthur analysis is required to judge the performance of
14 BOLIN LIAO
−2
10
−4
10
Normalized norm of residue
−6
10
−8
10
−10
10
−12
10
0 1 2 3
10 10 10 10
Number of iterations
−2
10
−4
10
Normalized norm of residue
−6
10
−8
10
−10
10
−12
10
0 1 2
10 10 10
Number of iterations
Fig. 3.4: Comparison between the convergence curves of preconditioned Davidson and
non-linear CG
DAVIDSON DIAGONALIZATION METHODS 15
these two methods, and the result might as well depend on the specific problem. In
realistic applications, the present problem needs to be solved self-consistently, which
makes a totally different story.
Acknowledgements. Firstly the author would like to thank Prof. Johnson for
his wonderfully inspiring lectures, desperately hard problem sets and frustratingly
super hard quiz. And also I want to thank Dr. Keivan Esfarjani in Nanoengineering
group at MIT for helpful discussions.
REFERENCES