Notes On Dynamic-Programming Sequence Alignment
Notes On Dynamic-Programming Sequence Alignment
Introduction. Following its introduction by Needleman and Wunsch (1970), dynamic pro-
gramming has become the method of choice for ‘‘rigorous’’ alignment of DNA and protein
sequences. For a number of useful alignment-scoring schemes, this method is guaranteed to pro-
duce an alignment of two given sequences having the highest possible score.
For alignment scores that are popular with molecular biologists, dynamic-programming
alignment of two sequences requires quadratic time, i.e., time proportional to the product of the
two sequence lengths. In particular, this holds for affine gap costs, that is, scoring schemes under
which a gap of length k is penalized g + ek, where g is a fixed ‘‘gap-opening penalty’’ and e is a
‘‘gap-extension penalty’’ (Gotoh, 1982). (More general alignment scores, which are more expen-
sive to optimize, were considered by Waterman et al., 1976, but have not found wide-spread
use.) Quadratic time is necessitated by the inspection of every pair (i, j), where i is a position in
the first sequence and j is a position in the second sequence. For many applications, e.g.,
database searches, such an exhaustive examination of position pairs may not be worth the effort,
and a number of faster methods have been proposed.
For long sequences, computer memory is another limiting factor, but very space-efficient
versions of dynamic programming are possible. The original formulation (Hirschberg, 1975) was
for an alignment-scoring scheme that is too restrictive to be of general utility in molecular biol-
ogy, but the basic idea is quite robust and works readily for affine gap penalties (Myers and
Miller, 1988).
We now describe the relationship between maximum-score paths and optimal alignments.
Consider two sequences, A = a1 a2 . . . a M and B = b1 b2 . . . b N . That is, A contains M symbols
and B contains N symbols, where the symbols are from an arbitrary ‘‘alphabet’’ that does not
contain the dash symbol, ‘‘−’’. The alignment graph for A and B, denoted G A, B , is an edge-
labeled directed graph. The nodes of G A, B are the pairs (i, j) where i ∈[0, M] and j ∈[0, N ]. (We
use the notation [ p, q] for the set { p, p + 1, . . . , q − 1, q}.) When graphed, these nodes are
arrayed in M + 1 rows (row i corresponds to ai for i ∈[1, M], with an additional row 0) and N + 1
columns (column j corresponds to b j for j ∈[1, N ]). The edge set for G A, B consists of the fol-
lowing edges, labeled as indicated.
a
1. (i − 1, j) → (i, j) for i ∈[1, M] and j ∈[0, N ], labeled i
[−]
−
2. (i, j − 1) → (i, j) for i ∈[0, M] and j ∈[1, N ], labeled [ ]
b j
a
3. (i − 1, j − 1) → (i, j) for i ∈[1, M] and j ∈[1, N ], labeled [ ] i
b j
− − −
C T C
T T T T
− T − T − T −
C T C
− − −
C T C
C C C C
− C − C − C −
C T C
− − −
C T C
FIG. 1. Alignment graph G A, B for the sequences A = TC and B = CTC .
It is instructive to look for a path from (0, 0) (the upper left corner of the graph of Fig. 1) to
(2, 3) (the lower right) such that the labels along the path ‘‘spell’’ the alignment:
-TC
CTC
− T
[C ], so the first edge mustC be horizontal. The second pair is [T ], so the
The first aligned pair is
second edge must be diagonal. The third pair is [ ], so the third edge must be diagonal. Gener-
C
ally, when a path descends from row i − 1 to row i, it picks up an aligned pair with top entry ai .
A path from (0, 0) to (M, N ) has zero or more horizontal edges, then a vertical or diagonal edges
to row 1, then zero or more horizontal edges, then an edge to row 2, then . . ., so the top entries of
the labels along the path are a1 , a2 , . . ., possibly with some interspersed dashes. Similarly, the
bottom entries spell B if dashes are ignored, so the aligned pairs spell an alignment of A and B.
Indeed, alignments are in general equivalent to paths, as we now state more precisely.
Fact: Let G A, B be the alignment graph for sequences A and B. With each path from (0, 0)
to (M, N ) associate the alignment formed by concatenating the edge labels along the path, i.e.,
-3-
the alignment ‘‘spelled’’ by the path. Then every such path determines an alignment of A and B,
and every alignment of A and B is determined by a unique path. In other words, there is a one-to-
one correspondence between paths in G A, B from (0, 0) to (M, N ) and alignments of A and B.
Furthermore, if the score σ (π ) is assigned to each edge of G A, B , where π is the aligned pair label-
ing that edge, then a path’s score is exactly the score of the corresponding alignment.
At each node, the score is computed from the scores of immediate predecessors and of
entering edges, which are pictured in Fig. 2. The procedure of Fig. 3 computes the maximum
alignment score by considering rows of G A, B in order, sweeping left to right within each row.
S[i, j] denotes the maximum score of a path from (0, 0) to (i, j). Lines 7-10 mirror Fig. 2. In row
0 there is but a single edge entering a node (lines 2-3), and similarly for column 0 (line 5). This is
a quadratic-space procedure since it uses the (M+1)-by-(N +1) array S to hold all node-scores.
i−1
(σ
σ
(
a
bj i
)
ai
) −
σ −
i
( bj )
j−1 j
1. S[0, 0] ← 0
2. for j ← 1 to N do
3. S[0, j] ← S[0, j − 1] + σ ( [b− ]) j
4. for i ← 1 to M do
5. S[i, 0] ← S[i − 1, 0] + σ ( [a− ]}
i
6. for j ← 1 to N do
7. [a− ]) a
Vertical ← S[i − 1, j] + σ ( i
8. Diagonal ← S[i − 1, j − 1] + σ ([ ]) i
b j
−
9. Horizontal ← S[i, j − 1] + σ ([ ])
b j
10. S[i, j] ← max{Vertical, Diagonal, Horizontal}
11. write "Maximum alignment score is" S[M, N ]
The next step is to see that the optimal alignment score for A and B can be com-
puted in linear space. Indeed, it is apparent that the scores in row i of S depend only on
those in row i − 1. Thus, after treating row i, the space used for values in row i − 1 can be
recycled to hold values in row i + 1. In other words, we can get by with space for two
rows, since all that we ultimately want is the single score S[M, N ].
In fact, a single array, S[0. . N ], is adequate. S[ j] holds the most recently computed
value in column j, so that as values of S are computed, they overwrite old values. There
is a slight conflict in this strategy, since two ‘‘active’’ values are needed in the current
-4-
column, necessitating an additional scalar, s, to hold one of them. Fig. 4 shows the grid
locations of values in S and of scalars s and c when (i, j) is reached in the computation.
S[k] holds path scores for row i when k < j, and for row i − 1 when k ≥ j. Fig. 5 is a
direct translation of Fig. 3 using the memory-allocation scheme of Fig. 4.
j−1 j
i−1 s
i c
FIG. 4. Grid locations of entries of a vector of length N + 1 just before the maximum path-
score is evaluated at node (i, j). Additionally, a scalar s holds the path score at (i − 1, j − 1)
and c holds the score at (i, j − 1).
1. S[0] ← 0
2. for j ← 1 to N do
3. S[ j] ← S[ j − 1] + σ ( [b− ])
j
4. for i ← 1 to M do
5. s ← S[0]
6. S[0] ← c ← S[0] + σ ( [a− ])
i
7. for j ← 1 to N do
8. c ← max{S[ j] + σ ( [a− ]), s + σ ([ba ]), c + σ ([b− ])}
i i
j j
9. s ← S[ j]
10. S[ j] ← c
11. write "Maximum alignment score is" S[N ]
We will soon need to perform this computation in the reverse direction. Here, the
relevant edges are the ones leaving node (i, j), as pictured in Fig. 6, and the quadratic-
space algorithm is given in Fig. 7. A slight generalization of a linear-space version of
Fig. 7 appears in lines 26-35 of Fig. 9; its derivation is left as an exercise for the reader.
-5-
σ −
j ( bj+1 ) j+1
i
σ(
σ
( −i+1 )
a i+ +1
bj
1 )
a
i+1
FIG. 6. Edges leaving (i, j) and their scores.
1. S[M, N ] ← 0
2. for j ← N − 1 down to 0 do
3. S[M, j] ← S[M, j + 1] + σ ( [b − ])
j+1
4. for i ← M − 1 down to 0 do
5. S[i, N ] ← S[i + 1, N ] + σ ([a− ])
i+1
6. for j ← N − 1 down to 0 do
a
S[i + 1, j] + σ ( −i+1 )
[ ]
a
7. S[i, j] ← max S[i + 1, j + 1] + σ ( i+1 )
[ ]
b j+1
−
S[i, j + 1] + σ ( b j+1 )
[ ]
8. write "Maximum alignment score is" S[0, 0]
Hirshberg’s Insight. We are now ready to describe Hirschberg’s linear-space alignment algorithm;
the algorithm delivers an explicit optimal alignment, not merely its score. First, make a ‘‘forward’’ score-
only pass (Fig. 5), stopping at the middle row, i.e., row mid = M/2. Then make a backward score-only
pass (the linear-space version of Fig. 7), again stopping at the middle row. Thus, for each point along the
middle row, we now have the optimal score from (0, 0) to that point and the optimal score from that point to
(M, N ). Adding those numbers gives the optimal score over all paths from (0, 0) to (M, N ) that pass
through that point. A sweep along the middle row, checking those sums, determines a point (mid, j) where
an optimal path crosses the middle row. This reduces the problem to finding an optimal path from (0, 0) to
(mid, j) and an optimal path from (mid, j) to (M, N ), which is done recursively.
Fig. 8A shows the two subproblems and each of their ‘‘subsubproblems’’. Note that regardless of
where the optimal path crosses the middle row, the total of the sizes of the two subproblems is just half the
size of the original problem, where problem size is measured by the number of nodes. Similarly, the total
sizes of all subsubproblems is a fourth the original size. Letting T be the size of the original, it follows that
the total sizes of all problems, at all levels of recursion, is at most T + ½T + ¼T . . . = 2T . Since computa-
tion time is directly proportional to the problem size, this approach will deliver an optimal alignment in
about twice the time needed to compute merely its score.
-6-
Fig. 8B shows a typical point in the alignment process. The initial portion of an optimal path will
have been determined, and the current problem is to report the aligned pairs along an optimal path from
(i 1 , j 1 ) to (i 2 , j 2 ). Fig. 9 provides detailed pseudo-code for the linear-space alignment algorithm.
(A) (B) j j
0 N 0 1 2 N
0 0
i1
i2
optimal path
M
2
optimal midpoint
M M
FIG. 8. (A) The two subproblems and four subsubproblems in Hirschberg’s linear-space alignment procedure. (B)
Snapshot of the execution of Hirschberg’s algorithm. Shaded areas indicate problems remaining to be solved.
-7-
1. shared strings a1 a2 . . . a M , b1 b2 . . . b N
2. shared temporary integer arrays S − [0. . N ], S + [0. . N ]
3. procedure Align(M, N )
4. if M = 0 then
5. for j ← 1 to N do
−
6. write [b ]
j
7. else
8. path(0, 0, M, N )
21. for j ← j 1 + 1 to j 2 do
22. c ← max{S − [ j] + σ ( [a− ]), s + σ ([ba ]), c + σ ([b− ])}
i i
j j
23. s ← S − [ j]
24. S − [ j] ← c
25. /* find maximum path scores to (i 2 , j 2 ) */
26. S +[ j2] ← 0
27. for j ← j 2 − 1 down to j 1 do
−
28. S + [ j] ← S + [ j + 1] + σ (
[b ] )
j+1
29. for i ← i 2 − 1 down to mid do
30. s ← S +[ j2]
31. S +[ j2] ← c ← S +[ j2] + σ ( [a− ])
i+1
Local Alignment. In many applications, a global (i.e., end-to-end) alignment of the two given
sequences is inappropriate; instead, a local alignment (i.e., involving only a part of each sequence) is
desired. In other words, one seeks a high-scoring path that need not terminate at the corners of the
dynamic-programming grid (Smith and Waterman, 1981). The highest local alignment score can be com-
puted as follows:
0 if 0 ≤ i ≤ M and 0 ≤ j ≤ N
ai
S[i − 1, j] + σ ( − )
[] if 1 ≤ i ≤ M and 0 ≤ j ≤ N
S[i, j] ← max ai
S[i − 1, j − 1] + σ ( b ) if 1 ≤ i ≤ M and 1 ≤ j ≤ N
[ ]
−
j
S[i, j − 1] + σ[ ]
( ) if 0 ≤ i ≤ M and 1 ≤ j ≤ N
bj
A single highest-scoring alignment can be found by locating the alignment’s end points
(which is straightforward to do in linear space), then applying Hirschberg’s strategy to the
two substrings bracketed by those points.
Further complications arise when one seeks k best alignments, where k > 1. For
computing an arbitrary number of non-intersecting and high-scoring local alignments,
Waterman and Eggert (1987) developed a very time-efficient method. Producing a linear-
space variant of their algorithm requires ideas that differ significantly from those pre-
sented in previous sections (Huang and Miller, 1991; Huang et al., 1990).
REFERENCES
Gotoh, O. (1982) An improved algorithm for matching biological sequences. J. Mol.
Biol. 162, 705-708.
Hirschberg, D. S. (1975) A linear space algorithm for computing maximal common sub-
sequences. Comm. ACM, 18, 341-343.
Huang, X., R. Hardison and W. Miller (1990) A space-efficient algorithm for local simi-
larities. CABIOS 6, 373-381.
Huang, X. and W. Miller (1991) A time-efficient, linear-space local similarity algorithm.
Advances in Applied Mathematics 12, 337-357.
Myers, E. and W. Miller (1988) Optimal alignments in linear space. CABIOS 4, 11-17.
Myers, E. and W. Miller (1989) Approximate matching of regular expressions. Bull.
Math. Biol. 51, 5-37.
Needleman, S. B. and C. D. Wunsch (1970) A general method applicable to the search for
similarities in the amino acid sequences of two proteins. J. Mol. Biol. 48, 443-453.
Smith, T. F. and M. S. Waterman (1981) Identification of common molecular sequences.
J. Mol. Biol. 197, 723-728.
Waterman, M. S., T. F. Smith and W. A. Beyer (1976) Some biological sequence metrics.
Adv. Math. 20, 367-387.
Waterman, M. S. and M. Eggert (1987) A new algorithm for best subsequence alignments
with application to tRNA-rRNA comparisons. J. Mol. Biol. 197, 723-728.