Talg 11
Talg 11
Suffix trees are by far the most important data structure in stringology, with a myriad of applications in
fields like bioinformatics and information retrieval. Classical representations of suffix trees require Θ(n log n)
bits of space, for a string of size n. This is considerably more than the n log2 σ bits needed for the string itself,
where σ is the alphabet size. The size of suffix trees has been a barrier to their wider adoption in practice.
Recent compressed suffix tree representations require just the space of the compressed string plus Θ(n) extra
bits. This is already spectacular, but the linear extra bits are still unsatisfactory when σ is small as in DNA
sequences. In this paper we introduce the first compressed suffix tree representation that breaks this Θ(n)-
bit space barrier. The Fully Compressed Suffix Tree (FCST) representation requires only sublinear space on
top of the compressed text size, and supports a wide set of navigational operations in almost logarithmic
time. This includes extracting arbitrary text substrings, so the FCST replaces the text using almost the
same space as the compressed text. An essential ingredient of FCSTs is the lowest common ancestor (LCA)
operation. We reveal important connections between LCAs and suffix tree navigation. We also describe
how to make FCSTs dynamic, i.e., support updates to the text. The dynamic FCST also supports several
operations. In particular it can build the static FCST within optimal space and polylogarithmic time per
symbol. Our theoretical results are also validated experimentally, showing that FCSTs are very effective in
practice as well.
Categories and Subject Descriptors: E.4 [Coding and Information Theory]: Data Compaction and Com-
pression; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—search process
General Terms: Algorithms, Performance, Theory
Additional Key Words and Phrases: Text processing, Pattern matching, String algorithms, Suffix tree, Data
compression, Compressed index
ACM Reference Format:
ACM Trans. Algor. V, N, Article A (January YYYY), 33 pages.
DOI = 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000
First and third authors supported by FCT through projects TAGS PTDC/EIA-EIA/112283/2009, HELIX
PTDC/EEA-ELC/113999/2009 and the PIDDAC Program funds (INESC-ID multiannual funding). Second
author partially funded by Millennium Institute for Cell Dynamics and Biotechnology (ICDB), Grant ICM
P05-001-F, Mideplan, Chile.
Preliminary partial versions of this paper appeared in LATIN 2008, LNCS 4957, pp. 362–373; and CPM
2008, LNCS 5029, pp. 191–203.
Authors’ address: Luı́s M. S. Russo, Arlindo Oliveira, Instituto de Engenharia de Sistemas e Computadores:
Investigação e Desenvolvimento (INESC-ID), R. Alves Redol 9, 1000-029 LISBON, PORTUGAL
Instituto Superior Técnico Technical University of Lisbon (IST/UTL), Av. Rovisco Pais, 1049-001 LISBON,
PORTUGAL {lsr,aml}@kdbio.inesc-id.pt.
Gonzalo Navarro, Dept. of Computer Science, University of Chile, Blanco Encalada 2120, Santiago, Chile.
gnavarro@dcc.uchile.cl.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights
for components of this work owned by others than ACM must be honored. Abstracting with credit is per-
mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component
of this work in other works requires prior specific permission and/or a fee. Permissions may be requested
from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or permissions@acm.org.
c YYYY ACM 1549-6325/YYYY/01-ARTA $10.00
DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:2 Fully-Compressed Suffix Trees
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:3
Table I. Comparison between compressed suffix tree representations. The operations are defined
along Section 2.1 and are separated in a first group of general tree navigation and a second spe-
cific of suffix trees. The instantiation we show assumes σ = O(polylog(n)), and uses different
versions of the CSA of Grossi et al. for the CST and EBST, and the FM-index of Ferragina et al.
for the FCST. The space given holds for any k ≤ α logσ n and any constant 0 < α < 1. The o(n)
space term in this instantiation is O(n/ log log n). CST and EBST times should be multiplied by
a low-degree polynomial of log log n, which we omit for simplicity as it would be dominated by
using an infinitesimally larger ǫ.
CST EBST FCST
Space in bits (1 + 1ǫ )nHk + Θ(n) (1 + 1ǫ )nHk + o(n) nHk + o(n)
R OOT O(1) O(1) O(1)
C OUNT O(1) O(1) O(1)
A NCESTOR O(1) O(1) O(1)
PARENT O(1) O(logǫ n) O(log n log log n)
FC HILD O(1) O(logǫ n) O(log n log log n)
NS IB O(1) O(logǫ n) O(log n log log n)
LCA O(1) O(logǫ n) O(log n log log n)
TD EP O(1) Not supported O((log n log log n)2 )
TLAQ O(1) Not supported O((log n log log n)2 )
L ETTER(v, i, ℓ) O(logǫ n + ℓ/ logσ n) O(logǫ n + ℓ/ logσ n) O(log n log log n + ℓ)
C HILD O(logǫ n) O(logǫ n) O(log n(log log n)2 )
L OCATE O(logǫ n) O(logǫ n) O(log n log log n)
SL INK O(1) O(logǫ n) O(log n log log n)
SL INKi O(logǫ n) O(logǫ n) O(log n log log n)
W EINER L INK O(log n) O(log n) O(1)
SD EP O(logǫ n) O(logǫ n) O(log n log log n)
SLAQ O(log1+ǫ n) O(log1+ǫ n) O(log n log log n)
for any constant ǫ > 0) and that the FCST uses the FM-index [Ferragina et al. 2007]
(which requires nHk + o(n) bits), to take the preferred setting for each. In general the
FCST is slower than the CST, but it requires much less space. Assuming realistically
that for DNA Hk ≈ 2, Sadakane’s CST requires at the very least 8n + o(n) to 13n + o(n)
bits, depending on the CSA variant of Grossi et al. [2003] they use, whereas the FCST
requires only 2n + o(n) bits (this theoretical prediction is not far from reality, as shown
in Section 7). The FCST space is optimal in the sense that no k-th order compressor
can achieve asymptotically less space to represent T . If the CST used the FM-index,
it would still have the 6n extra bits and the O(logǫ n) time complexities would become
O(log n log log n).
Table I also compares the Entropy-Bounded Suffix Tree (EBST) [Fischer et al. 2009;
Fischer 2010], a newer proposal that aims at maintaining the o(n) extra space of the
FCST while reducing navigation times. If it uses another version of the CSA by Grossi
et al. [2003] that requires o(n) extra bits on polylog-sized alphabets, it achieves sublog-
arithmic time complexities for most operations. If we force it to use the FM-index to
achieve the least possible space (as the FCST) its time complexities become not compet-
itive. There are previous incomplete theoretical proposals for compressed suffix trees
[Munro et al. 2001; Foschini et al. 2006]; a brief description is given at the end of
Section 3.
Our results are based on a special kind of sampling of suffix tree nodes. There is
some literature on sampled, or sparse, suffix trees. The pioneering work [Kärkkäinen
and Ukkonen 1996b] indexed evenly spaced suffixes (every k text positions). The re-
sulting structure required reduced space, O((n/k) log n) + n log σ bits, at the price of
multiplying the suffix tree search time by k and only handling patterns of length k
or more. Replacing the regular sampling with one guided by the Lempel-Ziv parsing
yielded the very first compressed text index [Kärkkäinen and Ukkonen 1996a]. This
index used the Lempel-Ziv properties to handle any pattern length, and later several
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:4 Fully-Compressed Suffix Trees
Table II. Comparison between dynamic compressed suffix tree representations. The operations
are defined along Section 2.1. The same considerations of Table I apply, except that the instan-
tiation assumes the dynamic FM-Index variant of Navarro and Sadakane [2010] as the CSA, for
which the space holds for any k ≤ α logσ (n) − 1 and any constant 0 < α < 1.
Chan et al. [2007] (DCST) Ours (DFCST)
Space in bits nHk + Θ(n) nHk + o(n)
R OOT O(1) O(1)
C OUNT O(log n/ log log n) O(1)
A NCESTOR O(log n/ log log n) O(1)
PARENT O(log n/ log log n) O(log2 n)
FC HILD O(log n/ log log n) O(log2 n log log n)
NS IB O(log n/ log log n) O(log2 n log log n)
LCA O(log n/ log log n) O(log2 n)
L ETTER(v, i, ℓ) O(log n(log n + ℓ/ log log n)) O(log n(log n + ℓ/ log log n))
C HILD O(log2 n log σ) O(log2 n log log n)
L OCATE O(log2 n) O(log2 n)
SL INK O(log n/ log log n) O(log2 n)
SL INKi O(log2 n) O(log2 n)
W EINER L INK O(log n/ log log n) O(log n/ log log n)
SD EP O(log2 n) O(log2 n)
I NSERT(T ) / D ELETE(T ) O(|T | log2 n) O(|T | log2 n)
self-indexes based on Lempel-Ziv compression followed the same lines [Navarro 2004;
Ferragina and Manzini 2005; Russo and Oliveira 2008]. Sparse indexes that use evenly
spaced suffixes and orthogonal range searching were recently proposed for secondary
memory searching [Chien et al. 2008; Hon et al. 2009]. All these representations sup-
port pattern searches, but not the full suffix tree functionality. Our sampling is differ-
ent in the sense that it samples suffix tree nodes, not text positions. This is the key to
achieve good upper bounds for all suffix tree operations.
Albeit very appealing, static FCSTs must be built from the uncompressed suffix tree.
Moreover, they must be rebuilt from scratch upon changes in the text. This severely
limits their applicability, as one needs to have a large main memory, or resort to sec-
ondary memory construction, to end up with a FCST that fits in a reasonable main
memory. CSAs have overcome this limitation, starting with the structure by Chan
et al. [2004]. In its journal version [Chan et al. 2007] the work includes the first dy-
namic CST, which builds on the static CST of Sadakane [2007] and retains its Θ(n)
extra space penalty (with constant at least 6). On the other hand, the smallest exist-
ing CSA [Ferragina et al. 2007] was made dynamic within the same space by Navarro
and Sadakane [2010] so as to achieve a sublogarithmic slowdown with respect to the
static version4 . In this paper we show how to support dynamic FCSTs, by building on
this latter dynamic CSA. We retain the optimal space complexity and polylogarithmic
time for all the operations.
A comparison between the dynamic CST by Chan et al. [2007] and our dynamic
FCST is given in Table II. Both use the dynamic FM-index of Navarro and Sadakane
[2010], as that of Chan et al. [2007] uses O(σn) space and is not significantly faster.
Again, the FCST is slower but requires much less space (one can realistically predict
25% of Chan et al.’s CST space on DNA).
All these dynamic structures, as well as ours, handle a collection of texts, where
whole texts are added/deleted to/from the collection. Construction in compressed space
is achieved by inserting a text into an empty collection.
4 He and Munrop [2010] obtained a very similar result but their o(n) extra space term is larger,
O(n log log n/ log n) versus O(n log log n/ log n).
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:5
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:6 Fully-Compressed Suffix Trees
1 2
i: 01 234 56 7890 12 345 67 8901
a ((0)((1)(2))((3)(4)((5)(6))))
b b $ b
0
B: 1 0 0 0 101101 0 0 1
b b b a b
3 ( 0 1 2 (3)(4) 5 6 )
i: 0 1 23 4 5
b b a
a a a a 1
B: 1 0 0 0 1101101 0 0 11
b b b b b ( 0 1 2 ((3)(4) 5 6 ))
$ $ $ $ $ $ $ 4
i: 0 12 34 5 67
0 1 2 3 4 5 6 b
5 Fig. 3. Parentheses representations of
A: 6 4 0 5 3 2 1 b trees. The parentheses on top represent the
6 suffix tree, those in the middle the sampled
Fig. 1. Suffix tree T of string abbbab, with
a tree, and those on the bottom the sampled
the leaves numbered. The arrow shows the 2
SL INK between node ab and b. Below we tree when b is also sampled along with the
show the suffix array. The portion of the Fig. 2. B bitmap. The numbers are not part of the
tree corresponding to node b and respective Reverse representation; they are shown for clarity.
leaves interval is within a dashed box. The tree T R . The rows labeled i: give the index of the
sampled nodes have bold outlines. parentheses.
from v by the edge whose label starts with symbol X, if it exists. The suffix-link of
a node v 6= R OOT of a suffix tree, denoted SL INK (v), is a pointer to node v[1..] (that
is, the longest proper suffix of v; this node always exists). Note that SD EP(v) of a leaf
v identifies the suffix of T $ starting at position L OCATE (v) = n − SD EP (v). In our
example, T [L OCATE(ab$)..] = T [(7 − 3)..] = T [4..] = ab$. The list of L OCATE values
comprises another well-known structure.
Definition 2.2. [Manber and Myers 1993] The suffix array A[0, n − 1] of a text T is
the sequence of starting positions of the suffixes of T $ in lexicographical order. This is
the same as the L OCATE values of the suffix tree leaves, if the children of the nodes
are ordered lexicographically by their branching letters.
Note that A is a permutation, and permutation A−1 [j] gives the lexicographical rank
of T [j..] among all the suffixes of T $.
The suffix tree nodes can be identified with suffix array intervals: Each node v cor-
responds to the range [vl , vr ] of leaves that descend from v (since there are no unary
nodes, there are no two nodes with the same interval). These intervals are also ref-
ered to as lcp-intervals [Abouelhoda et al. 2004]. In our example, node b corresponds
to the interval [3, 6]. We will refer indifferently to nodes v and to their interval [vl , vr ].
Leaves v correspond to [v, v] in this notation. For example by vl − 1 we refer to the
leaf immediately before vl , i.e., [vl − 1, vl − 1]. With this representation we can solve
C OUNT (v) = vr − vl + 1, the number of leaves that descend from node v. In our ex-
ample, the number of leaves below b is 4 = 6 − 3 + 1. This is precisely the number of
times the string v occurs in the text T , and thus the pattern search problem for P re-
duces to navigating from the R OOT to the point denoting P , and then using C OUNT to
determine the number of times P occurs in T , and using L OCATE (vl ) . . . L OCATE(vr ) to
output the occurrence positions.
The representation of ranges lets one trivially compute several other operations of
interest for suffix trees, such as A NCESTOR(v, v ′ ) ⇔ vl ≤ vl′ ≤ vr′ ≤ vr , knowing
whether v follows v ′ in T (⇔ vr′ < vl ), whether the preorder of v is smaller than that of
v ′ (⇔ vl < vl′ ∨ (vl = vl′ ∧ vr > vr′ )), whether a node is a leaf (vl = vr ), the leftmost leaf of
node v (vl ), etc.
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:7
We will also handle general trees of n nodes, which can be represented using 2n+o(n)
bits while supporting in constant time a number of traversal operations. This space
is asymptotically optimal considering all the trees of n nodes. For this paper we are
interested in the following operations: P REORDER(v) (and its inverse), which gives the
preorder position of node v in the tree starting at zero; PARENT (v); LCA(v, v ′ ), TD EP(v)
and TLAQ(v, d).
A useful tree representation, which will be necessary at some points in the paper,
is based on balanced parentheses: Do a preorder traversal and write a ’(’ when you
arrive at a node and a ’)’ when you leave it. This sequence is regarded as a bitmap
supporting R ANK and S ELECT operations. In addition, the following operations on
the parentheses are supported: F IND M ATCH (u) finds the matching parenthesis of u;
E NCLOSE(u) finds the nearest pair of matching parentheses that encloses u; and in
some cases D OUBLE E NCLOSE (u, u′ ), which finds the nearest pair of parentheses that
encloses both u and u′ .
These operations on the parentheses support most of the tree operations we need.
If tree node v is identified with the position of its opening parenthesis in the se-
quence B, then P REORDER(v) = R ANK′ (′ (B, v) − 1, P REORDER−1 (i) = S ELECT ′ (′ (B, i),
TD EP(v) = R ANK′ (′ (B, v) − R ANK′ )′ (B, v), PARENT (v) = E NCLOSE (v), LCA(v, v ′ ) =
D OUBLE E NCLOSE(v, v ′ ). Only operation TLAQ(v, d) needs special treatment. We will
use a representation that supports all of these operations within optimal space
[Pǎtraşcu and Viola 2010].
T HEOREM 2.5 (2.5 [S ADAKANE AND N AVARRO 2010]). Let a general tree of n
nodes be represented as a sequence of 2n balanced parentheses. Then there exists a data
structure supporting operations P REORDER(v), P REORDER−1 (i), LCA(v, v ′ ), TD EP(v),
TLAQ(v, d), and PARENT (v), on the tree, and R ANK(v), S ELECT (v), F IND M ATCH (v),
E NCLOSE(v), and D OUBLE E NCLOSE (v, v ′ ) on the parentheses, in constant time ttree =
O(c) using 2n + O(n/ logc n) bits of space.
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:8 Fully-Compressed Suffix Trees
In the dynamic case, we wish to I NSERT tree nodes, and D ELETE tree leaves
or unary nodes. The update operations are then translated into I NSERT(u, u′ ) and
D ELETE (u, u′ ), which insert or delete matching parentheses located at u, u′ . On the
other hand, we will not need TLAQ(v, d).
T HEOREM 2.6 (2.6 [N AVARRO AND S ADAKANE 2010]). A sequence of 2n balanced
parentheses can be maintained in 2n + O(n log log n/ log n) bits of space while sup-
porting the same operations of Theorem 2.5 except TLAQ(v, d), plus I NSERT (v, v ′ ) and
D ELETE (v, v ′ ), in ttree = O(log n/ log log n) worst-case time.
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:9
Table III. Operations supported by static compressed suffix arrays (CSAs) and their space and time com-
plexities, which hold for any 0 < ǫ < 1, l ≥ 1, k ≤ α logσ n and constant 0 < α < 1. We give two variants
of Grossi et al. [2003].
Time [Grossi et al. 2003] [Grossi et al. 2003] [Ferragina et al. 2007]
Space in bits (1 + 1ǫ )nHk (1 + 1ǫ )nHk + nHk + O( n log log
σ log log n
n
)
n log log n
+Θ(n) +O( ǫ/(1+ǫ) ) +O((n/l) log n)
log σ n
log σ
ψ(v) tψ O(1) O(1 + log log n
) O(1 + loglog σ
log n
)
A[v],A−1 [v] tSA logǫσ n + log σ tψ logǫσ n l tLF
L OCATE(v)
log σ
LF(v) tLF tψ log n O(1 + log log n
)
W EINER L INK(v)
v[i..i + ℓ − 1] tSA + tLF (ℓ − 1)
T [i..i + ℓ − 1] tSA + ℓ/ logσ n = tLF (l + ℓ − 1)
L ETTER(v, i, ℓ)
and tLF = O(tψ log n).5 For our results we favor another compressed suffix array,
called the FM-index [Ferragina et al. 2007], which requires nHk + O( n log log
σ log log n
n )
bits, with the same limit on k. Its complexities are6 tψ = tLF = O(1 + logloglogσ n ) (us-
ing multiary wavelet trees again) and tSA = O(l tLF ). This l is a suffix array sam-
pling parameter, such that we need O((n/l) log n) extra bits of space. For example,
n
if we set the extra space to O( log log n ) then we use l = log n log log n and achieve
tSA = O(log n(log σ + log log n)). Table III summarizes the supported CSA operations
and times.
We remark that, if log σ = Ω( logloglogn n ), the extra space of the FM-index includes an
extra Ω(n)-bit term. Although this is still o(n log σ) bits, which can be argued to be
reasonable for a text T whose plain representation takes n log σ bits, the main point of
this paper is to break the Θ(n) space barrier. In this sense, our results are interesting
for log σ = o( logloglogn n ), where the FM-index takes nHk + o(n) bits of space. This is a
reasonable assumption on σ and includes the interesting case σ = O(polylog(n)), on
which the FM-index offers constant tψ and tLF times.
Let us now consider dynamic CSAs. These handle a collection of texts, as if they
formed a single concatenated text. They offer the same functionalities of the static
CSAs on that text, plus insertion/deletion of texts into/from the collection. Two main
dynamic CSAs exist: That of Chan et al. [2007] is a natural extension of the static
CSA of Sadakane [2003]. It requires O(σn) bits of space, and offers complexities
tψ = tLF = O(log n), tSA = O(log2 n), and insertion/deletion of a text T in time
O(|T |(tψ + tLF )). The FM-index has several dynamic versions as well [Ferragina and
Manzini 2000; Mäkinen and Navarro 2008; González and Navarro 2008]. The most
efficient version [Navarro and Sadakane 2010] achieves nHk + O( n (1−ǫ) log σ log log n
logǫ n ) bits
of space, for any k ≤ α logσ (n) − 1 and any constants 0 < α, ǫ < 1. It offers times
tψ = tLF = O( logloglogn n (1 + logloglogσ n )), and tSA = l tLF using other O((n/l) log n) extra bits.
Again, we set l = log n log log n to achieve O( log nlog n ) extra bits, which makes tSA =
O(log2 n(1 + log σ
log log n )). Insertion and deletion of a text T takes time O(|T |(tψ + tLF )).7
5 The complexities of both variants are incorrectly mixed in Fischer et al. [2009], an error that carries over
to Fischer [2010].
6 Function ψ(v) can be computed in the same time of LF on the FM-index [Lee and Park 2007].
7 This was just O(|T |t ) in the original papers [Mäkinen and Navarro 2008; González and Navarro 2008],
LF
but in Section 6 we will modify the deletion procedure to operate in left-to-right order. Thus our times are
O(|T |tLF ) for insertions and O(|T |tψ ) for deletions.
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:10 Fully-Compressed Suffix Trees
Table IV. Operations supported by dynamic compressed suffix arrays (CSAs) and their space and
time complexities, which hold for any 0 < ǫ < 1, l ≥ 1, k ≤ α logσ (n) − 1 and constant 0 < α < 1.
Time [Chan et al. 2007] [Navarro and Sadakane 2010]
n log σ
Space in bits O(σn) nHk + O( (1−ǫ) logǫ n
)
+O((n/l) log n)
log n
ψ(v) tψ O(log n) O( log log n
(1 + loglog σ
log n
))
A[v],A−1 [v], L OCATE(v) tSA O(log2 n) l tLF
log n
LF(v), W EINER L INK(v) tLF O(log n) O( log log n
(1 + loglog σ
log n
))
v[i..i + ℓ − 1], T [i..i + ℓ − 1] tSA + tψ (ℓ − 1) tSA + tLF (ℓ − 1)
L ETTER(v, i, ℓ) = (l + ℓ − 1)tLF
I NSERT(T ) / D ELETE(T ) |T |(tψ + tLF )
Table IV summarizes these complexities. The dynamic CSA by Chan et al. [2007] is
used in his CST representation. In FCST representation the focus is on minimal, o(n),
space and therefore we will use the result by Navarro and Sadakane [2010].
A larger, but much faster, dynamic CSA was proposed by Gupta et al. [2007].
Their dynamic CSA requires n log σ + o(n log σ) bits of space and supports queries in
O(log log n) time, and O(1) time when σ = O(polylog(n)). Updates, however, are much
more expensive, O(nǫ ) amortized time, for 0 < ǫ < 1. The FCST representation may use
this dynamic CSA. However, for this to be useful, one should also use faster dynamic
trees. While there are some supporting various operations in time O(log log n) [Raman
and Rao 2003; Arroyuelo 2008], none of these supports the crucial LCA operation.
Finally, let us mention a previous data structure called a “compressed suffix tree”
but which, under the terminology of this paper, offers just compressed suffix array
functionality. Munro et al. [2001] propose what can be considered as a predecessor
of Sadakane’s CST, as it uses a suffix array and a compact tree. By using it on top
of an arbitrary CSA, its smallest variant would take |CSA| + o(n) bits plus the text
(which could be compressed to nHk + o(n log σ) bits and support constant-time access
[Ferragina and Venturini 2007]) and find the suffix array interval corresponding to
pattern P [1, m] in time O(m tSA log σ). The FM-index alone, however, is the smallest
CSA and can do the same without any other structure in time O(m tLF ), which is always
faster. Munro et al. [2001] can also achieve time O(m tSA ), but for this they require
|CSA| + O(n log σ) bits and still do not support any other suffix tree operation. There
exists another previous compressed suffix tree description [Foschini et al. 2006] based
on an interval representation and sampling of the suffix tree. However, the description
is extremely brief and no details nor theoretical bounds on the result are given.
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:11
This means that if we start at v and follow suffix links successively, i.e., v, SL INK(v),
SL INK(SL INK(v)), . . ., we will find a sampled node in at most δ steps. Note that this
property implies that the R OOT must be sampled, since SL INK(R OOT) is undefined.
We now show that it is possible to δ-sample a suffix tree.
T HEOREM 4.2. There exists a δ-sampled tree S for any suffix tree T .
P ROOF. We sample the nodes v such that SD EP (v) ≡δ/2 0 and there is another node
v such that v = SL INK δ/2 (v ′ ). Since SD EP(SL INK i (v ′ )) = SD EP (v ′ )−i, this guarantees
′
that, for any v ′ such that SD EP (v ′ ) ≥ δ/2, the SD EP (SL INK i (v ′ )) ≡δ/2 0 condition holds
for exactly two values in the range i ∈ [0, δ −1]. For the largest of those two i values, the
second sampling condition must hold as well. (If SD EP(v ′ ) < δ/2, v ′ is sufficiently close
to the R OOT, which is sampled.) On the other hand, for each sampled node v 6= R OOT,
there are at least other δ/2 − 1 non-sampled nodes that point to it via SL INK i , as their
SD EP is not a multiple of δ/2. Hence there are s ≤ 1 + t/(2δ) = O(t/δ) = O(n/δ)
sampled nodes.
We represent the sampled tree S as a sequence of balanced parentheses, using
Theorem 2.5. Operations P REORDER S , P REORDER −1 S , PARENT S , TLAQ S , LCA S , and
TD EPS , are all supported in constant time and O(n/δ) bits of space. We will also need
to store, in arrays indexed by P REORDERS (v), the values SD EP (v) and TD EP(v) for
the sampled nodes (do not confuse TD EP (v) = TD EPT (v), the depth of a sampled node
v in the suffix tree, with TD EP S (v), the depth of v in the sampled tree). These arrays
require O((n/δ) log n) bits of space.
In the dynamic case we use Theorem 2.6 to represent S with balanced parentheses.
This takes O(n/δ) bits and supports operations P REORDERS , P REORDER −1 S , PARENT S ,
LCAS , and TD EPS , all in O(log n/ log log n) time. The structure also supports insertion
and deletion of leaves and unary nodes. The representation also needs to maintain the
SD EP values of nodes in S, which are handled using a simple dynamic structure such
as that presented by Navarro and Sadakane [2010]: It allows inserting, deleting and
accessing the values in O(log n/ log log n) time while using O((n/δ) log n) bits of space.
In order to make effective use of the sampled tree, we need a way to map any node
v to its lowest sampled ancestor, LSA(v). Another important operation is the lowest
common sampled ancestor LCSA(v, v ′ ) = LSA(LCA(v, v ′ )), i.e., lowest common ances-
tor in the sampled tree S. In our example, LCSA(3, 4) is the R OOT, whereas LCA(3, 4)
is [3, 6], i.e., the node labeled b. The next lemma shows how the general LCSA and LSA
queries can be answered if LSA for leaves is available, and then we go on to solve that
specific problem. The mapping will also let us compute the range [vl , vr ] of a sampled
node v.
L EMMA 4.3. Let v and v ′ be nodes of a suffix tree T and S an δ-sampled subtree,
then the following properties always hold:
v1 ancestor of v2 ⇒ LSA(v1 ) ancestor of LSA(v2 ) (1)
LCSA(v, v ′ ) = LCA S (v, v ′ ), when v and v ′ belong to S (2)
LCSA(v, v ′ ) = LCSA(LSA(v), LSA(v ′ )) (3)
LCSA(v, v ′ ) = LCA S (LSA(v), LSA(v ′ )) (4)
LSA(v) = LCA S (LSA(vl ), LSA(vr )) (5)
P ROOF. For (1), LSA(v1 ) is transitively an ancestor of v2 and it is sampled, thus by
definition of LSA it is also an ancestor of LSA(v2 ).
For the rest of the proof let us define v ′′ = LCSA(v, v ′ ) = LSA(LCA(v, v ′ )). For
Eq. (2) note that v ′′ is a node of S and it is an ancestor of both v and v ′ , since it is
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:12 Fully-Compressed Suffix Trees
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:13
Consider for example the leaf numbered 5 in Figure 3. This leaf is not sampled, but
in the original tree it appears between leaf 4 and the end of the tree, more specifically
between parenthesis ’)’ of 4 and parenthesis ’)’ of the R OOT. Thus P RED(5) = 4. In
this case, since the parenthesis we obtain is a ’)’, we know that LSA is the parent of
that node.
In the opposite direction, we wish to find out the leaf interval [vl , vr ] corresponding
to a sampled node identifier v of S. This is not hard to do:
vl = R ANK0 (S ELECT 1 (B, v))
vr = R ANK0 (S ELECT 1 (B, F IND M ATCH S (v))) − 1
Summarizing, we can map from sampled nodes in S to suffix tree nodes [vl , vr ], as
well as the reverse with operations LSA and LCSA, all in constant time and using
O((n/δ) log δ) bits of space.
In the dynamic case, we use Theorem 2.4 to handle B and Theorem 2.6 to handle
S. This retains the same space, and operations cost O(log n/ log log n) time. The update
operations we will need to carry out are: (i) insertion/deletion of leaves in B, when a
leaf appears in / disappears from the suffix tree T , and (ii) insertion/deletion of pairs of
matching parentheses in/from S (and their corresponding 1s in B), when nodes become
sampled/unsampled in S. Figure 3 illustrates the effect of sampling b = [3, 6] in our
running example.
For (i), we will want to insert a new leaf (that is, suffix array position) between
leaves v − 1 and v. If v − 1 and v are consecutive in B, i.e., S ELECT 0 (B, v − 1) + 1 =
S ELECT 0 (B, v), then we simply do I NSERT (B, S ELECT 0 (B, v), 0). Yet in general there
could be several sampled nodes containing the leaf. Thus the general procedure is
as follows. The new leaf is a child of some internal node v ′ of T . We assume that in
case v ′ had to be sampled due to the update, it is already in S. Before the new leaf
is inserted in B, since v ′ cannot be unary, it is an ancestor of leaves v − 1 or v or
both. Let us assume v ′ is ancestor of v − 1; the other case is similar. We compute t =
TD EPS (LSA(v −1))− TD EP S (LSA(v ′ )) and run I NSERT (B, S ELECT 0 (B, v −1)+t+1, 0).
To remove leaf number v we run D ELETE (B, S ELECT 0 (B, v)).
For (ii), the insertion of a new node v = [vl , vr ] in the sampled tree translates into
the insertion of a matching parentheses pair at positions (u, u′ ) in S. For example if
the new node encloses current node v then u = v and u′ = F IND M ATCH S (v); if it
is a leaf first child of v then u = u′ = v + 1; if it is a leaf next sibling of v then
u = u′ = F IND M ATCH S (v)+1. After carrying out the insertion on S (via I NSERT S (u, u′ +
1)), we must update B. We compute m′ = max(S ELECT 1 (B, u′ ), S ELECT 0 (B, vr ))
for I NSERT (B, m′ + 1, 1) and then m = max(S ELECT 1 (B, u), S ELECT 0 (B, vl )) for
I NSERT(B, m, 1). For removing sampled node v, after D ELETE S (u, u′ ) for u = v and
u = F IND M ATCH S (v), we also update B by D ELETE (B, S ELECT 1 (B, u′ )) and then
D ELETE(B, S ELECT 1 (u)).
Thus, all the updates required for the dynamic case can be carried out in
O(log n/ log log n) time per update to S or to T .
5. SUFFIX TREE NAVIGATION
We start this section by showing in Lemma 5.1 a simple relation between the SL INK
and LCA operations, and use this relation to obtain an algorithmic way of computing
the SD EP value of non-sampled nodes8 , in Lemma 5.2. This algorithmic procedure
8A detailed exposition on why these properties are important for representing suffix trees is given in Ap-
pendix 9.
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:14 Fully-Compressed Suffix Trees
X 2 PARENT S (v ′ )
α α
Y Z d−i
Y Z
δ PARENT
v v′
2 v′
ψ
ψ δ
LF
Fig. 4. Schematic representation
of the relation between LCA and Fig. 5. Schematic representation of the vi,j nodes
SL INK, see Lemma 5.1. Curved of the SLAQ operation. The nodes sampled because
arrows represent SL INK and straight of Definition 4.1 are in bold and those sampled be-
arrows the ψ function. cause of the condition of TD EP are filled.
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:15
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:16 Fully-Compressed Suffix Trees
and retrieve its stored SD EP. The overall process takes O(tψ δ) time. Note that in the
dynamic scenario the rank and tree operations are slower by an O( logloglogn n ) factor. Like-
wise SD EP and LCA simplifies to
SD EP (v) = SD EP (LCA(v, v)) = max {i + SD EP(LCSA(ψ i (vl ), ψ i (vr )))}
0≤i<d
LCA(v, v ) = LF(v[0..i − 1], LCSA(ψ i (min{vl , vl′ }), ψ i (max{vr , vr′ })))
′
Now it is finally clear that we do not need SL INK to compute LCA. The time to
compute LCA is thus O((tψ + tLF )δ), and that to compute SD EP is O(tψ δ). Using
LCA we compute SL INK (v) = LCA(ψ(vl ), ψ(vr )) in O((tψ + tLF )δ) and SL INK i =
LCA(ψ i (vl ), ψ i (vr )) in O(tSA + (tψ + tLF )δ) time. Note that the arguments to LCSA
do not correspond necessarily to nodes, but the formulas hold in this case too.
T HEOREM 5.5. Suffix tree operations SD EP , LCA, SL INK , and SL INK i can be com-
puted respectively in O(tψ δ), O((tψ + tLF )δ), O((tψ + tLF )δ), and O(tSA + (tψ + tLF )δ) time,
provided a CSA implements ψ in O(tψ ), LF in O(tLF ), and A and A−1 in O(tSA ) time,
and we have a δ-sampled suffix tree.
9 Most CSAs already include such a sampling in one way or another [Navarro and Mäkinen 2007].
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:17
The overall process requires time O(tSA + (tψ + tLF )θ) to access the last letters of v1
and build D, plus O(log(n/θ)) for binary searching the samples; plus O(tSA log θ) for
the final binary search. For example we can just use θ = δ.10 In this case the time is
O((tψ + tLF )δ + tSA log δ + log n) and the extra space is O((n/δ) log n) bits. This is the
value used in Tables V and VI.
Yet, regarding our discussion in Section 8, we wish to avoid more extra spaces of
this magnitude. For the particular case of using the FM-index (under which we get our
best results), we can do the following to achieve the same time with less extra space.
Set θ = tSA log tLF
tSA
, so that the overall time is O(log n + (1 + tψ /tLF )tSA log tSA ) (which is
O(log n(log log n)2 ) in Table I), and the extra space for the sampling is O((n/θ) log n).
Recall Table III, where we defined tSA = l tLF and chose l = log n log log n, to have
O( log nlog n ) = o(n) extra bits of space for the CSA. Hence O((n/θ) log n) = O( nl log n
log l ).
This is less than the O((n/l) log n) bits paid by the CSA for its own sampling. For the
n
value of l we have chosen, it is O( (log log n)2 ).
In a dynamic scenario, we do not store exactly the A[jθ] values; instead we guarantee
that for any k there is a k ′ such that k−θ < k ′ ≤ k and A[k ′ ] is sampled, and the same for
A−1 . Still the sampled elements of A and the m′ to use can be easily obtained in O(log n)
time. Those sampled sequences are not hard to maintain upon insertions/deletions in
A. For example, Mäkinen and Navarro [2008, Sec. 7.1] describe how to maintain A−1
(called SC in there), and essentially how to maintain A (called SA in there; the only
missing point is how to maintain approximately spaced samples in A, which can be
done exactly as for A−1 ). Thus the space remains the same and the O(log n) term in
the complexity becomes O(log2 n).
Computing TD EP (v). To compute TD EP we add other O(n/δ) nodes to the sampled
tree S so as to guarantee that, for any suffix tree node v, PARENT j (v) is sampled for
some 0 ≤ j < δ. Recall that the TD EP (v) values are stored in S. Since TD EP (v) =
TD EP(LSA(v))+j, where LSA(v) = PARENT j (v), TD EP(v) can be computed by reading
TD EP(LSA(v)) and adding the number of nodes between v and LSA(v). The sampling
guarantees that j < δ. Hence to determine j we iterate PARENT until reaching LSA(v).
The total cost is O((tψ + tLF )δ 2 ).
To achieve this sampling property, we sample the nodes v such that TD EP (v) ≡δ/2 0
and H EIGHT (v) ≥ δ/2. Since TD EP(PARENT i (v)) = TD EP (v) − i, the first condition
holds for exactly two i’s in [0, δ − 1] if TD EP(v) ≥ δ/2. Since H EIGHT is strictly in-
creasing, the second condition holds for sure for the largest i. On the other hand, since
every sampled node has at least δ/2 descendants that are not sampled, it follows that
we sample O(n/δ) extra nodes with this criterion.
We are unable to maintain either the sampled TD EP values or the sampling property
in the dynamic scenario. Therefore, this operation and next two are not supported in
the dynamic case.
Computing TLAQ(v, d). We extend the notation PARENT S (v) to represent LSA(v)
when v is not sampled. Recall that the sampled tree supports constant-time level an-
cestor queries. Hence we have any PARENT iS (v) in constant time for any node v and
any i. We binary search PARENT iS (v) to find the sampled node v ′ with TD EP (v ′ ) ≥ d >
TD EP(PARENT S (v ′ )). Notice that this can be computed by evaluating only the second
inequality, which refers to sampled nodes only. Now we iterate the PARENT operation,
from v ′ , exactly TD EP (v ′ ) − d times. We need the additional sampling introduced for
TD EP to guarantee TD EP (v ′ ) − d < δ. Hence the total time is O(log n + (tψ + tLF )δ 2 ).
10 This speedup immediately improves the results of Huynh et al. [2006].
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:18 Fully-Compressed Suffix Trees
Computing SLAQ(v, d). We start by binary searching for the value m such that
δ−1
v ′ = PARENT m S (SL INK (v)) satisfies SD EP (v ′ ) ≥ d − (δ − 1) > SD EP(PARENT S (v ′ )).
Now we scan all the sampled nodes vi,j = PARENT jS (LSA(LF(v[i..δ − 1], v ′ ))) with
SD EP(vi,j ) ≥ d − i and i, j < δ. This means that we start at node v ′ , follow LF, reduce
every node found to the sampled tree S and use PARENT S until the SD EP of the node
drops below d − i. Our aim is to find the vi,j that minimizes SD EP(vi,j ) − (d − i) ≥ 0,
and then apply the LF mapping to it.
The time to perform this operation depends on the number of existing vi,j nodes. For
this operation the sampling must satisfy Definition 4.1 and the condition for computing
TD EP . Each condition contributes with at most two sampled nodes for every δ nodes
in one direction (SL INK or PARENT ). Therefore, there are at most 4δ nodes vi,j (see
Figure 5), and thus the time is O(log n + (tψ + tLF )δ). Unfortunately, the same trick
does not work for TD EP and TLAQ, because we cannot know which is the “right” node
without bringing all of them back with LF.
Computing FC HILD . To find the first child of v = [vl , vr ], where vl 6= vr ,
we simply compute SLAQ(vl , SD EP (v) + 1). Likewise if we use vr we obtain the
last child. It is possible to avoid the binary search step of SLAQ by choos-
δ−1
ing v ′ = PARENT m S (LSA(SL INK (vl ))) for m = TD EPS (LSA(SL INK δ−1 (vl ))) −
δ−1
TD EP S (LSA(SL INK (v))) − 1, if i ≥ 0, and v ′ = SL INK δ−1 (vl ) if m = −1. Thus
the time is O((tψ + tLF )δ).
In the dynamic case we do not have SLAQ. Instead, FC HILD (v) can be determined
by computing X = L ETTER (vl , SD EP (v)+1) and then C HILD(v, X). The time for C HILD
dominates.
Computing NS IB . The next sibling of v = [vl , vr ] can be computed as SLAQ(vr +
1, SD EP(PARENT (v)) + 1) for any v 6= R OOT. Likewise we can obtain the previous
sibling with vl − 1. We must check that the answer has the same parent as v, to cover
the case where there is no previous/next sibling. We can also skip the binary search.
Again, in the dynamic case NS IB(v) can be computed with C HILD . If PARENT (v) =
[vl′ , vr′ ] and vr′ > vr , then we compute X = L ETTER (vr + 1, SD EP(v) + 1) and do
C HILD (v ′ , X).
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:19
Table V. Comparison between compressed suffix tree representations. We omit the operations that
are carried out directly on the CSA, see Table III. We simplify the FCST complexities by assuming
δ = ω(log n) as otherwise the extra space is not o(n). We also assume that tψ , tLF , tSA = Ω(ttree ).
log n
The f of EBST must be O( log log n
) and Ω(log[r] n) for some constant r ≥ 0, which denotes r
applications of log to n. For EBST “not supported” means that it needs at least twice the space to
support those operations. Notice that C HILD can, alternatively, be computed using FC HILD and at
most σ times NS IB.
CST EBST Ours (FCST)
Space in bits |CSA| + 6n + o(n) |CSA| + O(n/f ) |CSA| + O((n/δ) log n)
R OOT O(1) O(1) O(1)
C OUNT O(1) O(1) O(1)
A NCESTOR O(1) O(1) O(1)
PARENT ttree tSA f log log n (tψ + tLF )δ
FC HILD ttree tSA f log2 f (tψ + tLF )δ
NS IB ttree tSA f (log2 f + log log n) (tψ + tLF )δ
LCA ttree tSA f (log2 f + log log n) (tψ + tLF )δ
TD EP ttree Not supported (tψ + tLF )δ2
TLAQ ttree Not supported (tψ + tLF )δ2
C HILD tSA log σ tSA (f log2 f + log σ) tSA log δ + (tψ + tLF )δ
SL INK tψ tSA f (log2 f + log log n) (tψ + tLF )δ
SL INKi tSA tSA f (log2 f + log log n) tSA + (tψ + tLF )δ
SD EP tSA tSA f log2 f tψ δ
SLAQ tSA log n tSA f log n log2 f (tψ + tLF )δ
CSA. For a node v represented as [vl , vr ] and PARENT (v) represented as [vl′ , vr′ ], if vl 6= vl′
then vl is the index that represents v, otherwise we use vr . The R OOT is represented
by 0. So this identifier is computed in O((tψ + tLF )δ) time, and guarantees that no index
represents more than one node (as only the highest node of a leftmost/rightmost path
can use the shared vl /vr value), but some indexes may represent no node at all. More
precisely, this scheme yields identifiers in the range [0, n − 1] for the internal nodes,
whereas there are only t − n < n of them.
6. UPDATING THE SUFFIX TREE AND ITS SAMPLING
The static FCST requires that we first build the classical suffix tree and then sample
it. Thus the machine used for construction must have a very large main memory, or
we must resort to secondary memory suffix tree construction. Dynamic FCSTs permit
handling a text collection where queries are interleaved with insertions and deletions
of texts along time, and their space is asymptotically the same as their static variant.
In particular, they solve the problem of construction of the static FCST within asymp-
totically the same space of the final static FCST: Start with an empty text collection,
insert T , and then turn all the data structures into their static equivalents.
Along the paper we have given static and dynamic variants of all the data structures
we have introduced. What remains is to explain how to modify our suffix tree repre-
sentation to reflect the changes caused by inserting and removing texts T , and how to
maintain our sampling conditions upon updates.
The CSA of Mäkinen and Navarro [2008], on which we build, inserts T in right-to-
left order. It first determines the position of the new terminator and then uses LF to
find the consecutive positions of longer and longer suffixes, until the whole T is in-
serted.11 This right-to-left method perfectly matches with the algorithm by Weiner
[1973] to build the suffix tree of T : It first inserts suffix T [i + 1..] and then suffix
11 Thisinsertion point is arbitrary in that CSA, thus there is no order among the texts. Moreover, all the
terminators are the same in the CSA, yet this can be easily modified to handle different terminators as
required in some bioinformatic applications [Gusfield 1997].
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:20 Fully-Compressed Suffix Trees
T [i..], finding the points in the tree where the node associated to the new suffix is
to be created if it does not already exist. The node is found by using PARENT until the
W EINER L INK operation returns a non-empty interval. This requires one PARENT and
one W EINER L INK amortized operation per symbol of T . This algorithm has the impor-
tant invariant that the intermediate data structure is a suffix tree. Hence, by carrying
it out in synchronization with the CSA insertion algorithm and with the insertion of
the new leaves in bitvector B, we can use the current CSA and FCST to implement
PARENT and W EINER L INK .
To maintain the property that the intermediate structure is a suffix tree, deletion
of a text T must proceed by first locating the node of T that corresponds to T ,12 and
then using SL INK s to remove all the nodes corresponding to its suffixes in T . We must
simultaneously remove the leaves in the CSA and in bitmap B (Mäkinen et al.’s CSA
deletes a text right-to-left, but it is easy to adapt to use ψ instead of LF to do it left-to-
right).
6.1. Maintaining the Sampling
We now explain how to update the sampled tree S whenever nodes are inserted into or
deleted from the (virtual) suffix tree T . The sampled tree must maintain, at all times,
the property that for any node v there is an i < δ such that SL INK i (v) is sampled. The
following concept from Russo and Oliveira [2008] is useful to explain how to obtain
this result.
Definition 6.1. The reverse tree T R of a suffix tree T is the minimal labeled tree
that, for every node v of T , contains a node v R denoting the reverse string of the path
label of v.
We note we are not maintaining nor sampling T R , we just use it as a conceptual device.
Figure 2 shows a reverse tree. Observe that, since there is a node with path label ab
in T , there is a node with path label ba in T R . We can therefore define a mapping R
that maps every node v to v R . Observe that for any node v of T , except for the R OOT,
we have that SL INK (v) = R−1 (PARENT (R(v))). This mapping is partially shown in
Figures 1 and 2 by the numbers. Hence the reverse tree stores the information of the
suffix links. For a regular sampling we choose the nodes for which TD EP (v R ) ≡δ/2 0
and H EIGHT (v R ) ≥ δ/2. This is equivalent to our sampling rules on T (Theorem 4.2):
Since the reverse suffixes form a prefix-closed set, T R is a non-compact trie, i.e., each
edge is labeled by a single letter. Thus, SD EP(v) = TD EP (v R ). The rule for H EIGHT (v R )
is obviously related to that on SL INK (v) by R. See Figure 2 for an example of this
sampling.
Notice that, whenever a node is inserted or removed from a suffix tree, it never
changes the SD EP of the other nodes in the tree, hence it does not change any TD EP
in T R . This means that whenever the suffix tree is modified the only nodes that can
be inserted or deleted from the reverse tree are the leaves. In T this means that when
a node is inserted it does not break a chain of suffix links; it is always added at the
beginning of such a chain. Weiner’s algorithm works precisely by appending a new leaf
to a node of T R .
Assume that we are using Weiner’s algorithm and decide that the node X.v should
be added and we know the representation of node v. All we need to do to update the
structure of the sampled tree is to verify whether by adding (X.v)R as a child of v R in
12 The dynamic CSA by Mäkinen and Navarro [2008] provides this functionality by returning a handle when
inserting a text T , that can be used later to retrieve the CSA position of its first or last symbol. This requires
O(N log n) extra bits of space when handling a collection of N texts and total length n, which is negligible
unless one has to handle many short texts.
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:21
T R we increase the H EIGHT of some ancestor, in T R , that will now become sampled.
Hence we must scan upwards in T R to verify if this is the case. Also, we do not need to
maintain H EIGHT values. Instead, if the distance from (X.v)R to the closest sampled
node (v ′ )R is exactly δ/2 and TD EP((v ′ )R ) ≡δ/2 0, then we know that v ′ meets the sam-
pling condition and we sample it. Operationally, the procedure is as follows: Compute
v ′ = SL INK δ/2 (v.X); if v ′ is not in S (v ′ 6= LSA(v ′ )) and SD EP(v ′ ) ≡δ/2 0, then add v ′ to
S.
Deleting a node (i.e., a leaf in T R ) is slightly more complex and involves some
reference counting. This time assume we are deleting node X.v, again we need to
scan upwards, this time to decide whether to make a node non-sampled. However
SD EP(v)− SD EP (v ′ ) < δ/2 is not enough, as it may be that H EIGHT (v ′R ) ≥ δ/2 because
of some other descendant. Therefore every sampled node v ′ counts how many descen-
dants it has at distance δ/2. A node becomes non-sampled (i.e., we remove it from S)
only when this counter reaches zero. Insertions and deletions of nodes in T must up-
date these counters, by increasing/decreasing them whenever inserting/deleting a leaf
at distance exactly δ/2 from sampled nodes.
As there are O(n/δ) sampled nodes, reference counters count also sampled nodes,
and no sampled node can be counted in two counters, we have that the sum of all
the counters is also O(n/δ). Hence we represent them all using a bitmap C of O(n/δ)
bits. C stores a 1 associated to each P REORDERS (v), and that 1 is followed by as
many 0s as the value of the counter for v. Hence the value of the counter for v is
retrieved as S ELECT 1 (C, P REORDER S (v) + 1) − S ELECT 1 (C, P REORDER S (v)) − 1; in-
creasing the counter for v translates into I NSERT (C, S ELECT 1 (C, P REORDER S (v)) +
1, 0); and decreasing the counter into D ELETE (C, S ELECT 1 (C, P REORDER S (v)) +
1). Similarly, insertion of a new node v into S must be followed by opera-
tion I NSERT (C, S ELECT 1 (C, P REORDER S (v)), 1), and its deletion must be preceded
by D ELETE (C, S ELECT 1 (C, P REORDER S (v))). Using Theorem 2.4, structure C takes
O(n/δ) bits and carries out all these operations in O(log n/ log log n) time.
Hence, to I NSERT or D ELETE a node requires O((tψ + tLF )δ) time to find out whether
to modify the sampling, plus O(log n/ log log n) time to update S and associated struc-
tures when necessary (S itself, B, C, etc.), plus O(log n) time to modify the sampled
A and A−1 arrays. Added to the constant amortized number of calls to PARENT and
W EINER L INK per text symbol, we have an overall time of O(|T |(log n + (tψ + tLF )δ)) for
the insertion or deletion of the whole text T .
The following theorem summarizes our result.
T HEOREM 6.2. It is possible to represent the suffix tree of a dynamic text collection
within the space and time bounds given for DFCST in Table VI, by using any dynamic
compressed suffix array offering the operations and times given in Table IV and insert-
ing (deleting) texts in right-to-left (left-to-right) order.
Table VI also compares our DFCST with the DCST of Chan et al. [2007]. For the
latter we have used the faster dynamic trees of Theorem 2.6. There exists no dynamic
variant of the EBST.
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:22 Fully-Compressed Suffix Trees
Table VI. Comparison between dynamic compressed suffix tree representations. The performance
refers to dynamic CSA times and assumes tψ , tLF , tSA = Ω(ttree ); likewise we assume δ = ω(log n)
as before. We omit the operations that depend solely on the CSA, see Table IV.
Chan et al. [2007] (DCST) Ours (DFCST)
Space in bits |CSA| + Θ(n) |CSA| + O((n/δ) log n)
R OOT O(1) O(1)
C OUNT ttree O(1)
A NCESTOR ttree O(1)
PARENT ttree (tψ + tLF )δ
FC HILD ttree tSA log δ + (tψ + tLF )δ + O(log2 n)
NS IB ttree tSA log δ + (tψ + tLF )δ + O(log2 n)
LCA ttree (tψ + tLF )δ
C HILD tSA log σ tSA log δ + (tψ + tLF )δ + O(log2 n)
SL INK tψ (tψ + tLF )δ
SL INKi tSA tSA + (tψ + tLF )δ
SD EP tSA tψ δ
I NSERT(T ) / D ELETE(T ) |T |(tSA + log n) |T |(tψ + tLF )δ
Navarro [2008] that we use for handling the sampling of A[jθ] and A−1 [jθ] needed for
C HILD . The dynamic parentheses data structure of Theorem 2.6 we use to represent
S also allows for changes in ⌈log n⌉, but our mechanism to adapt to changes in δ will
subsume it. We discuss now how to cope with this while retaining the same space and
worst-case time complexities.
We use δ = ⌈log n⌉ · ⌈log⌈log n⌉⌉, which will change whenever ⌈log n⌉ changes (some-
times it will change by more than 1). Let us write δ = ∆(ℓ) = ℓ⌈log ℓ⌉. We maintain
ℓ = ⌈log n⌉. As S is small enough, we can afford to maintain three copies of it: S sam-
pled with δ, S − with δ − = ∆(ℓ − 1), and S + sampled with δ + = ∆(ℓ + 1). When ⌈log n⌉
increases (i.e., n doubles), S − is discarded, the current S becomes S − , the current S +
becomes S, we build a new S + sampled with ∆(ℓ + 2), and ℓ is increased. A symmetric
operation is done when ⌈log n⌉ decreases (i.e., n halves due to deletions), so let us focus
on increases from now on. Note this can occur in the middle of the insertion of a text,
which must be suspended, and then resumed over the new set of sampled trees.
The construction of the new S + can be done by retraversing all the suffix tree T and
deciding which nodes to sample according to the new δ + . An initially empty paren-
theses sequence and a bitmap B + initialized with t zeros would give the correct in-
sertion points from the chosen intervals as both structures are populated. To ensure
that we consider each node of T once, we process the leaves in order (i.e., v = [0, 0] to
v = [n − 1, n − 1]), and for each leaf v we also consider all of its ancestors [vl , vr ] (using
PARENT ) as long as vr = v. For each node [vl , vr ] we consider, we apply SL INK at most
δ + times until we find the first node v ′ = SL INK i ([vl , vr ]) which either is sampled in
S + , or SD EP (v ′ ) ≡δ+ /2 0 and i = δ + /2. If v ′ was not sampled we insert it into S + , and
in both cases we increase its reference count (recall Section 6). All the δ + suffix links
SL INK i ([vl , vr ]) are computed in O((tψ + tLF )δ + ) = O((tψ + tLF )δ) time, as they form a
single chain.
Deamortization can be achieved by the classical method of interleaving the normal
operations of the data structure with the construction of the new S + . By performing
a constant number of operations on the new S + for each insertion/deletion operation
over the text collection, we can ensure that the new S + will be ready in time. We start
by the creation (split into several operations) of B + formed by t 0s, and then proceed
to traverse T to determine which nodes to insert into S. The challenge is to maintain
the consistency of the traversal of T while texts are inserted/deleted.
As we insert a text, the operations that update T consist of insertion of leaves, and
possibly creation of a new parent for them. Assume we are currently at node [vl , vr ]
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:23
in our traversal of T to update S + . If a new node [vl′ , vr′ ] we are inserting is behind
the current node in our traversal order (that is, vr′ < vr , or vr′ = vr and vl′ > vl ), then
we consider [vl′ , vr′ ] immediately; otherwise we leave it for the moment when we will
reach [vl′ , vr′ ] in our traversal (note that we will reach it in our traversal because the
CSA has already been updated). Recall from Section 6 that those new insertions do not
affect the existing SD EP s nor suffix link paths, and hence cannot affect the decisions
to sample nodes already made in the current traversal. Similarly, deleted nodes that
fall behind the current node are processed immediately, and the others are left for the
traversal to handle them later.
If ℓ decreases again while we are still building S + , we simply discard it even before
having completed its construction. This involves freeing the whole B + and S + data
structures, which is also necessary when we abandon the former B − and S − struc-
tures. This deallocation can be done in constant time in this particular case: The max-
imum size n the collection can have as long as we keep using the current S + and B +
structures is nmax = 22+⌈log n⌉ , thus the maximum value for t is tmax = 2nmax and for s
is smax = tmax /(δ + /2). Hence, we allocate a single chunk of memory of the maximum
possible size, which is still O((n/δ) log δ). A similar preallocation can be done for S + ,
which needs O(n/δ) bits.
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:24 Fully-Compressed Suffix Trees
Overall this construction procedure requires O(nσ) time to obtain a static FCST
representation from a static CSA. If the CSA is an FM-index implemented with a
wavelet tree [Mäkinen and Navarro 2008], the time can be reduced to O(n log σ), as
we can determine each existing interval LF(X, v) = v ′ in time O(log σ). The reason is
that the X symbols that produce valid intervals correspond to the distinct symbols in
T bwt [vl , vr ], where T bwt is the Burrows-Wheeler transform of T [Burrows and Wheeler
1994], and if T bwt is represented as a wavelet tree then we can use a so-called range
quantile algorithm to list the distinct values in any substring within O(log σ) time per
delivered value [Gagie et al. 2009].
Indeed, CSAs can be built from scratch faster than CSTs. For example, the FM-index
can be built within nHk + o(n log σ) bits and time O(n logloglogn n (1 + logloglogσ n )) [Navarro and
Sadakane 2010]. After making this CSA static we can build our CST, within negligible
o(n) extra space and negligible O(n log σ) extra time. There are algorithms to build the
FM-index that take o(n log σ) time [Okanohara and Sadakane 2009; Hon et al. 2009],
yet using Ω(n log σ) construction space. Another intermediate solution is by Hon et al.
[2003], using O(n log n) time and O(nH0 ) bits of space.
7. EXPERIMENTAL RESULTS
We implemented a basic version of the static FCST representation. We compare our
prototype with an implementation of Sadakane’s CST [Välimäki et al. 2007]. Our
prototype uses a binary wavelet tree where each bitmap is encoded as in Theo-
rem 2.3 (using an easier-to-implement proposal [Raman et al. 2002]); this CSA requires
nHk + o(n log σ) bits [Mäkinen and Navarro 2008]. The sampling factor δ was chosen
as ⌈log n⌉ · ⌈log⌈log n⌉⌉. We made a simple implementation that uses pointers in the
sampled tree S, since the extra space requirement is still sub-linear. To minimize the
amount of information stored in the sampled tree we chose not to support the TD EP,
SLAQ and TLAQ operations, moreover for the same reason we chose to support only
the basic C HILD operation and not the elaborated scheme presented in Section 5.2.
One important difference between our implementation and the theory we presented
is that the leaves of T are never part of the sampled tree S. This simplification is
possible because LCA(v, v ′ ) is a leaf only if v = v ′ , in which case LCA(v, v ′ ) = v and
SD EP(LCA(v, v ′ )) can be obtained from A[v]. Hence, the sampled tree becomes much
smaller than in theory, as the sampled values of A are already considered as part of
the CSA.
We used the texts from the Pizza&Chili corpus13 trimmed to at most 100 megabytes
(MB).
— Sources (program source code). This file is formed by C/Java source code obtained
by concatenating all the .c, .h, .C and .java files of the linux-2.6.11.6 and gcc-4.0.0
distributions.
— Pitches (MIDI pitch values). This file is a sequence of pitch values (bytes in 0-127,
plus a few extra special values) obtained from a myriad of MIDI files freely available
on Internet. The MIDI files were processed using semex 1.29 tool by Kjell Lemström,
so as to convert them to IRP format. This is a human-readable tuple format, where
the 5th column is the pitch value. Then the pitch values were coded in one byte each
and concatenated.
— Proteins (protein sequences). This file is a sequence of newline-separated protein
sequences (without descriptions, just the bare proteins) obtained from the Swissprot
database. Each of the 20 amino acids is coded as one uppercase letter.
13 http://pizzachili.dcc.uchile.cl
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:25
Table VII. Space requirements, in MB, of FCSTs and CSTs. The space is obtained
by reporting peak of main memory usage when operating the data structures. Other
related information such as the number of nodes in the sampled tree S and in T is also
presented. Finally, we give data on relative space usage of the different components.
Sources Pitches Proteins DNA English XML
σ 229 134 26 18 217 98
n/220 100.0 53.2 63.7 100.0 100.0 100.0
|T |/220 162.7 87.5 98.7 167.4 161.7 147.4
δ 135 130 130 135 135 135
|T |/|S| 1,368 551 1,304 12,657 541 2,008
FCST (MB) 66.3 45.2 49.8 56.7 73.2 54.9
CST (MB) 407.4 214.3 204.2 287.2 353.7 316.3
CSAF /F CST 0.90 0.88 0.92 0.92 0.87 0.91
CSAC /CST 0.29 0.30 0.30 0.21 0.29 0.36
CSAF /CSAC 0.50 0.62 0.75 0.86 0.62 0.44
— DNA (DNA sequences). This file is a sequence of newline-separated gene DNA se-
quences (without descriptions, just the bare DNA code) obtained from files 01hgp10
to 21hgp10, plus 0xhgp10 and 0yhgp10, from Gutenberg Project. Each of the 4 bases
is coded as an uppercase letter A,G,C,T, and there are a few occurrences of other
special characters.
— English (English texts). This file is the concatenation of English text files selected
from etext02 to etext05 collections of Gutenberg Project. We deleted the headers re-
lated to the project so as to leave just the real text.
— XML (structured text). This file is an XML that provides bibliographic information
on major computer science journals and proceedings and it is obtained from dblp.uni-
trier.de.
We built FCSTs and CSTs for each of the previous files. The resulting space usage
and related information is given in Table VII. The line “n/220 ” gives the file size in MB.
We also count the number of nodes in each suffix tree in line “|T |/220 ”. It is interesting
to observe that in practice the sampling rate of the internal nodes is much higher than
δ (as several suffix link paths share the same sample). This can be observed by looking
at line “|T |/|S|”: The ratio is usually 5 to 10 times larger than δ, but it reaches 93 times
for DNA. The consequence of such a small sampling is that the percentage of our CSA
size (CSAF ) in the overall structure is around 90%, see line “CSAF /F CST ”.
Lines FCST and CST show that our FCST is 4 to 6 times smaller than the CST.
This is a consequence not only of the fact that the size (CSAF ) of our CSA is only
44% to 86% the size (CSAC ) of the CSA used by the CST implementation (see line
CSAF /CSAC ), but more importantly, that our tree structure occupies a much smaller
portion of the overall space than in the CST, see lines CSAC /CST and CSAF /F CST .
Hence, in terms of space we managed to obtain an extremely compact representation of
FCSTs. Moreover the fact that our implementation uses pointers increases the overall
space by only a negligible amount of space.
Overall, our structure takes 55% to 85% of the original text size and moreover re-
places it, as the CSA itself can reproduce any substring of the sequence. Thus, our rep-
resentation can be regarded as a compressed representation of the sequence which, in
addition, provides a suffix tree functionality on it. We now consider how time-efficient
is this functionality.
We tested the time it takes to compute the operations in Theorem 6.2 by choosing
internal nodes, computing the operations during 60 seconds, and obtaining averages
per operation. We used three ways of choosing the nodes to test the operations. To
select a node we chose a random leaf v and computed LCA(v, v + 1). We used three
sequences of random nodes. In the first case we chose only one random node as de-
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:26 Fully-Compressed Suffix Trees
scribed (u). In the second case we chose a random node and iterated SL INK (su) until
reaching the root, collecting all the traversed nodes. In the last case we chose a random
node and iterated PARENT (pu) until reaching the root. This simulates various types of
suffix tree traversals. The results are shown in table VIII. Our machine had a Quad-
Core Intel Xeon CPU at 3.20GHz with a 2 MB cache, 3 GB of RAM, and was running
Slackware 12.0.0 with Linux kernel 2.6.21.5. The FCSTs were implemented in C and
compiled with gcc 3.4.6 -O9. The CSTs were implemented in C++ and compiled with
g++ 4.1.2 -O3.
The results show that the price for our FCST’s small space requirements is that they
are much slower than CSTs, yet practical in absolute terms for many applications (i.e.,
few milliseconds per operation). For some operations such as LCA the diference can
reach 3 orders of magnitude. Still for the C HILD operation, which is the slowest, the
diference is usually 1 order of magnitude. Hence, in any algorithm that uses C HILD,
this operation should dominate the overall time, moreover it depends essentially on
the underlying CSA. We expect this to be the case in general. Therefore it is possible
to obtain diferent space time trade-offs by using other CSAs.
Our implementation aimed at obtaining the smallest possible FCSTs. The resulting
space/time trade-off is interesting because we obtained very small FCSTs that support
the usual operations within a reasonable time per operation. Recently published ex-
periments [Cánovas and Navarro 2010] that compare the performance of a practical
implementation of the EBST with the CST and FCST reinforce the conclusion that
our FCST, albeit being the slowest of the three, is unparalleled in space requirements,
which makes it able to fit in main memory suffix trees that no other representation
can handle.
8. LARGER AND FASTER COMPRESSED SUFFIX TREES
The previous discussion raises the question of whether it is possible to obtain better
times from our technique, perhaps using more space. In particular, we note that using a
smaller δ value in our FCST would yield better times. What prevents us from using val-
n
ues smaller than δ = log n log log n is that we have to spend O((n/δ) log n) = O( log log n)
extra bits. However, this space comes only from the storage of SD EP and TD EP ar-
rays14 .
Imagine we use both the FM-index and the sublinear-space CSA by Grossi et al.
[2003], for a total space of (2 + 1ǫ )nHk + o(n log σ) bits, so that we have tψ = tLF =
O(1 + logloglogσ n ), and tSA = O(tψ logǫσ n). Now we could store only the SD EP values at
nodes whose SD EP is a multiple of κ, and at the other sampled nodes v we only store
SD EP(v) mod κ using log κ bits. The total space for SD EP becomes O((n/κ) log n +
(n/δ) log κ). To retrieve a SD EP (v) value, we read d = SD EP (v) mod κ, and then read
the full c = SD EP (v ′ ), where v ′ = SL INK d (v) has its full SD EP value stored. The an-
swer is c+d and can be obtained in O(tSA ) time15 . The same idea can be used for TD EP,
which is stored for tree depths multiple of κ and retrieved using v ′ = PARENT i (v) in
time O(log n).16 Now, we can use κ = log n log log n and δ = (log log n)2 while main-
n
taining the extra space O( log log n ). Although we use a much smaller δ now, each step
requires computing a SD EP value in O(tSA ) time, and thus our usual (tψ + tLF )δ cost
becomes tSA δ = O(log ǫσ n(log σ + log log n) log log n). If σ = polylog(n), this can be written
14 We have been careful along the paper to avoid this type of space for the other data structures, which could
otherwise have been handled with classical solutions.
15 Since we know v ′ is a sampled node, we do v ′ = LCSA(ψ i (v ), ψ i (v )) without resorting to LCA, which
l r
would have implied a circular dependence.
16 Again, because v ′ is sampled it can be binary searched for in PARENTj (v).
S
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:27
Table VIII. Time to compute operations over the FCST (F) and CST (C), in seconds.
Operation Sources Pitches Proteins DNA English XML
LCA F u 6.9e-3 8.2e-3 2.9e-3 2.0e-3 6.3e-3 9.7e-3
F su 1.3e-2 2.0e-2 7.7e-3 2.2e-3 1.9e-2 7.6e-3
F pu 4.3e-3 9.0e-3 1.0e-3 1.1e-3 1.6e-3 4.0e-3
C u 2.5e-6 2.5e-6 2.3e-6 2.2e-6 2.4e-6 2.4e-6
C su 2.5e-6 2.1e-6 2.7e-6 4.2e-6 1.9e-6 3.5e-6
C pu 5.4e-6 5.1e-6 5.0e-6 5.7e-6 5.6e-6 5.7e-6
L ETTER F u 1.6e-5 1.6e-5 1.1e-5 8.9e-6 1.3e-5 1.5e-5
F su 1.8e-5 1.6e-5 1.1e-5 8.4e-6 1.4e-5 1.5e-5
F pu 1.5e-5 1.4e-5 9.9e-6 8.4e-6 1.3e-5 1.4e-5
C u 1.6e-4 1.3e-4 1.2e-4 7.1e-5 1.4e-4 1.6e-4
C su 7.4e-5 7.1e-5 7.4e-5 5.0e-5 8.2e-5 9.8e-5
C pu 1.1e-4 6.8e-5 8.8e-5 5.8e-5 1.3e-4 1.4e-4
SL INK F u 6.8e-3 8.2e-3 2.9e-3 2.0e-3 6.2e-3 9.6e-3
F su 1.3e-2 2.0e-2 7.7e-3 2.1e-3 1.9e-2 7.6e-3
F pu 4.3e-3 9.0e-3 9.6e-4 1.0e-3 1.6e-3 3.9e-3
C u 2.2e-4 1.7e-4 1.7e-4 9.6e-5 2.0e-4 2.2e-4
C su 1.1e-4 9.6e-5 1.0e-4 6.7e-5 1.1e-4 1.4e-4
C pu 1.7e-4 1.0e-4 1.5e-4 8.4e-5 1.9e-4 2.0e-4
L OCATE F u 3.2e-3 3.0e-3 1.8e-3 1.6e-3 2.7e-3 3.1e-3
F su 3.0e-3 2.7e-3 1.7e-3 1.3e-3 2.6e-3 2.9e-3
F pu 2.7e-3 2.3e-3 1.6e-3 1.3e-3 2.4e-3 2.6e-3
C u 5.0e-5 3.9e-5 3.8e-5 2.2e-5 4.4e-5 5.0e-5
C su 2.2e-5 1.8e-5 2.0e-5 1.5e-5 2.0e-5 2.7e-5
C pu 3.7e-5 2.1e-5 3.0e-5 1.9e-5 4.3e-5 4.3e-5
C HILD F u 8.2e-3 1.3e-2 3.4e-3 1.6e-3 1.1e-2 8.7e-3
F su 1.9e-2 3.2e-2 1.0e-2 3.7e-3 3.8e-2 6.5e-3
F pu 6.5e-3 1.6e-2 9.9e-4 7.7e-4 1.9e-3 3.3e-3
C u 5.8e-4 5.4e-4 4.2e-4 1.2e-4 5.2e-4 6.4e-4
C su 2.0e-4 2.6e-4 2.8e-4 1.2e-4 1.3e-4 1.2e-3
C pu 2.5e-3 1.6e-3 9.0e-4 1.9e-4 2.9e-3 2.2e-3
SD EP F u 5.4e-3 6.6e-3 2.3e-3 1.6e-3 5.2e-3 7.3e-3
F su 1.1e-2 1.8e-2 5.9e-3 1.7e-3 1.7e-2 5.7e-3
F pu 4.0e-3 7.8e-3 8.0e-4 8.4e-4 1.3e-3 3.0e-3
C u 5.1e-5 4.0e-5 3.8e-5 2.3e-5 4.5e-5 5.0e-5
C su 2.1e-5 1.8e-5 2.0e-5 1.5e-5 2.0e-5 2.6e-5
C pu 3.6e-5 2.1e-5 3.3e-5 2.2e-5 4.1e-5 4.4e-5
PARENT F u 8.3e-3 8.7e-3 2.6e-3 3.4e-3 4.3e-3 1.5e-2
F su 1.3e-2 7.6e-3 5.0e-3 2.5e-3 4.5e-3 1.3e-2
F pu 6.3e-3 1.3e-2 1.1e-3 1.8e-3 2.1e-3 6.1e-3
C u 1.6e-6 1.7e-6 1.7e-6 1.6e-6 1.6e-6 1.7e-6
C su 1.5e-6 1.5e-6 1.6e-6 1.6e-6 1.6e-6 1.7e-6
C pu 1.5e-6 1.5e-6 1.6e-6 1.6e-6 1.6e-6 1.7e-6
as O(log ǫ n). Thus we achieve sublogarithmic times for most operations. Indeed the
times are similar to those of the EBST and our space is better than their original ver-
sion [Fischer et al. 2009], yet their most recent result [Fischer 2010] achieves better
space.
We can go further and achieve poly-loglog times for the most common operations,
at the expense of higher space. We use their representation for LCP information that
gives constant-time access and 2nHk (log H1k + O(1)) + o(n) bits of space [Fischer et al.
2009]. Recall that LCP(v) = SD EP([v − 1, v]) is the longest common prefix between
leaves v and v − 1. In addition they show how to compute range minimum queries
RMQ(v, v ′ ) (which gives the minimum value in the range LCP(v) . . . LCP(v ′ )) using,
n 2
for example, O( log log n ) bits of space and O(log log n(log log log n) ) time. Using this we
can obtain directly SD EP ([vl , vr ]) = RMQ(vl + 1, vr ). The same method can be ap-
plied for TD EP . Now the only limit to decrease δ is array B, which uses O((n/δ) log δ)
n
bits, and this is o(n) for any δ = ω(1). Yet, let us restrict to O( log log n ) extra space,
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:28 Fully-Compressed Suffix Trees
so we use δ = log log n log log log n. If we use an FM-index as our CSA, our final CST
size is 2nHk (log H1k + O(1)) + o(n) bits, and our usual (tψ + tLF )δ time for most op-
erations becomes O(log log n(log σ + log log n)(log log log n)3 ). This is o((log log n)3 ) for
σ = O(polylog(n)).
9. CONCLUSIONS
We presented a fully-compressed representation of suffix trees (FCSTs), which breaks
the linear-bit space barrier of previous representations at a reasonable time complex-
ity penalty. Our structure efficiently supports common and not-so-common operations,
including very powerful ones such as lowest common ancestor (LCA) and level ances-
tor (LAQ) queries. Indeed, by building over an FM-index, our FCSTs achieve optimal
asymptotic space under the k-th order entropy model, and support all the navigation
operations in polylogarithmic time. Our representation is largely based on the LCA
operation. Suffix trees have been used in combination with LCAs for a long time, but
our results show new ways to exploit this partnership. We also presented a dynamic
fully-compressed representation of suffix trees. Dynamic FCSTs permit not only man-
aging dynamic collections, but also building static FCSTs within optimal space, at a
logarithmic time penalty factor.
We implemented a static version of the FCSTs and showed that its surprisingly
small space requirements can be obtained in practice and it can still support the usual
operations efficiently. A recent experimental comparison [Cánovas and Navarro 2010]
of compressed suffix trees confirms that the FCST is the smallest representation, albeit
it is also the slowest. Using a denser sampling on our current implementation does not
give interesting space/time tradeoffs, but we are pursuing a new one where such a
denser sampling makes a better impact on response times.
The research on this topic advances at a very rapid pace. In the last two years,
after the conference publication of our results [Russo et al. 2008b; 2008a], several
new achievements have been presented. The progress was mainly focused on ob-
taining smaller representations of the data structures that support Range Minimum
Queries (RMQs), and the so-called Previous Smaller Value (PSV) and Next Smaller
Value (NSV) queries. The results by Ohlebusch et al. [2009; 2010] reduced the con-
stants associated with the O(n)-bit space term. Although the resulting space is still
Θ(n), they achieve relevant improvements. An implementation of the EBST [Fischer
et al. 2009] also provided new practical techniques to implement RMQ/PSV/NSV op-
erations [Cánovas and Navarro 2010], as well as the mentioned experimental compar-
ison among different prototypes. Fischer [2010] improved the original EBST [Fischer
et al. 2009] by removing the “ugly” space factor associated to the entropy, that is, the
new EBST now requires (1 + 1ǫ )nHk + o(n) bits and retains the same sublogarithmic
time performance (we used this improved complexity in our Table I).
The techniques we introduce in this paper also have demonstrated to have inde-
pendent interest. Recently, Hon et al. [2009] improved the secondary memory index
proposed by Chien et al. [2008] using, among other techniques, a structure similar to
the bitmap B we presented in Section 4.1.
We believe this fascinating topic is far from closed. In particular, we have exposed
limitations for some operations on FCSTs, which might or might not be fundamental.
For example we give only a partial answer to the problem of computing the preorder
number of a suffix tree node, which is relevant to associate satellite information to in-
ternal nodes. Another important example is the lack of support for the TD EP , TLAQ,
and SLAQ operations on dynamic FCSTs. This has its roots in our inability to main-
tain a properly spaced sampling of the suffix tree, and maintain TD EP values up to
date. Yet a third example are the limitations on the alphabet size σ in order to have
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:29
o(n) extra space. Our prototype is also being extended to support the dynamic case
and, as mentioned, denser samplings.
More generally, and especially under the light of the combinations of ideas explored
in the previous section, it is not clear how fast can we navigate suffix trees using how
much space, and in general which is the space/time lower bound for compressed suffix
trees.
A APPENDIX
In this appendix we explore some fundamental properties of suffix trees that show how
to use a δ-sampled suffix tree. Section 5 makes use of these properties to provide the
different navigation operations, albeit it can be read without resorting to this deeper
discussion.
More specifically, we reveal some self-similarity properties of suffix trees. Such prop-
erties have already been studied, but as far as we know, the ones we study here are
novel.
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:30 Fully-Compressed Suffix Trees
P ROOF. Let T1 and T2 denote the subtrees below v and SL INK (v) respectively. The
proof consists in showing that |T1 | ≥ |T2 |, i.e., T2 has no more nodes than T1 . This
implies that SL INK is surjective, since, by Lemma 1.2, SL INK is injective.
We denote the number leaves of a tree T ′ by λ(T ′ ). It is easy to prove, by induction,
that for any tree T ′ the following property holds:
X
|T ′ | = 2 ∗ λ(T ′ ) − 1 − (−2 + number of children of v)
internal v ∈T
′ ′
Hence, since λ(T1 ) = λ(T2 ), all we need to show is that v1 ∈T1 (. . .) ≤ v2 ∈T2 (. . .).
P P
Note that the terms of the sum are always non-negative because T1 and T2 are compact,
i.e., have at least two children. Since SL INK is injective this result can be shown di-
rectly by arguing that the number of children of the internal node v1 in T1 is not larger
than the number of children of SL INK (v1 ) in T2 . This is a known property of suffix
trees: If node v1 contains a child node that branches by letter X ∈ Σ, then SL INK (v1 )
must also contain a child branching by X. SL INK does not remove these letters from
the path label because v1 descends from v 6= R OOT.
To complete the proof of Lemma 1.1 we still need the following property, whose proof
we postpone to the next subsection, where we will have more algebraic tools.
L EMMA 1.4. If the number of leaves of the subtree below node v 6= R OOT is
equal to the number of leaves below SL INK (v) then it holds SL INK (PARENT (v ′ )) =
PARENT (SL INK (v ′ )) for any node v ′ descendant of v.
The compact DAG data structure [Gusfield 1997] removes the regularity arising
from Lemma 1.1 by storing pointers from the node v to the node SL INK (v), whenever
v satisfies the conditions of the lemma.
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:31
REFERENCES
A BOUELHODA , M., K URTZ , S., AND O HLEBUSCH , E. 2004. Replacing suffix trees with enhanced suffix
arrays. Journal of Discrete Algorithms 2, 1, 53–86.
A POSTOLICO, A. 1985. The myriad virtues of subword trees. In Combinatorial Algorithms on Words. NATO
ISI Series. Springer-Verlag, 85–96.
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:32 Fully-Compressed Suffix Trees
A RROYUELO, D. 2008. An improved succinct representation for dynamic k-ary trees. In Proc. 19th Interna-
tional Symposium on Combinatorial Pattern Matching (CPM). LNCS 5029. 277–289.
B LUMER , A., B LUMER , J., H AUSSLER , D., E HRENFEUCHT, A., C HEN, M., AND S EIFERAS, J. 1985. The
smallest automaton recognizing the subwords of a text. Theoretical Computer Science 40, 1, 31–55.
B URROWS, M. AND W HEELER , D. 1994. A block-sorting lossless data compression algorithm. Tech. rep.,
Digital Equipment Corporation.
C ÁNOVAS, R. AND N AVARRO, G. 2010. Practical Compressed Suffix Trees. In Proc. 9th International Sym-
posium on Experimental Algorithms (SEA). LNCS 6049. 94–105.
C HAN, H.-L., H ON, W.-K., AND L AM , T.-W. 2004. Compressed index for a dynamic collection of texts. In
Proc. 15th International Symposium on Combinatorial Pattern Matching (CPM). LNCS 3109. 445–456.
C HAN, H.-L., H ON, W.-K., L AM , T.-W., AND S ADAKANE , K. 2007. Compressed indexes for dynamic text
collections. ACM Transactions on Algorithms 3, 2, article 21.
C HIEN, Y.-F., H ON, W.-K., S HAH , R., AND V ITTER , J. S. 2008. Geometric Burrows-Wheeler Transform:
Linking Range Searching and Text Indexing. In Proc. Data Compression Conference (DCC). 252–261.
C ROCHEMORE , M. 1986. Transducers and repetitions. Theoretical Computer Science 45, 1, 63–86.
F ERRAGINA , P. AND M ANZINI , G. 2000. Opportunistic data structures with applications. In Proc. 41st An-
nual Symposium on Foundations of Computer Science (FOCS). 390–398.
F ERRAGINA , P. AND M ANZINI , G. 2005. Indexing compressed text. Journal of the ACM 52, 4, 552–581.
F ERRAGINA , P., M ANZINI , G., M ÄKINEN, V., AND N AVARRO, G. 2007. Compressed Representations of Se-
quences and Full-Text indexes. ACM Transactions on Algorithms 3, 2, article 20.
F ERRAGINA , P. AND V ENTURINI , R. 2007. A simple storage scheme for strings achieving entropy bounds.
Theoretical Computer Science 372, 1, 115–121.
F ISCHER , J. 2010. Wee LCP. Information Processing Letters 110, 317–320.
F ISCHER , J., M ÄKINEN, V., AND N AVARRO, G. 2009. Faster Entropy-Bounded Compressed Suffix Trees.
Theoretical Computer Science 410, 51, 5354–5364.
F OSCHINI , L., G ROSSI , R., G UPTA , A., AND V ITTER , J. 2006. When indexing equals compression: Experi-
ments with compressing suffix arrays and applications. ACM Transactions on Algorithms 2, 4, 611–639.
G AGIE , T., P UGLISI , S. J., AND T URPIN, A. 2009. Range quantile queries: Another virtue of wavelet trees.
In Proc. 16th Symposium on String Processing and Information Retrieval (SPIRE). 1–6.
G IEGERICH , R., K URTZ , S., AND S TOYE , J. 2003. Efficient implementation of lazy suffix trees. Software
Practice and Experience 33, 11, 1035–1049.
G ONZ ÁLEZ , R. AND N AVARRO, G. 2008. Rank/Select on Dynamic Compressed Sequences and Applications.
Theoretical Computer Science 410, 4414–4422.
G ROSSI , R., G UPTA , A., AND V ITTER , J. 2003. High-order entropy-compressed text indexes. In Proc. 14th
Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 841–850.
G UPTA , A., H ON, W.-K., S HAH , R., AND V ITTER , J. 2007. A framework for dynamizing succinct data struc-
tures. In Proc. 34th International Colloquium on Automata, Languages and Programming (ICALP).
LNCS 4596. 521–532.
G USFIELD, D. 1997. Algorithms on Strings, Trees and Sequences. Cambridge University Press.
H E , M. AND M UNRO, I. 2010. Succinct representations of dynamic strings. In Proc. 17th International
Symposium on String Processing and Information Retrieval (SPIRE). LNCS 6393. 334–346.
H ON, W.-K., L AM , T.-W., S ADAKANE , K., AND S UNG, W.-K. 2003. Constructing compressed suffix arrays
with large alphabets. In Proc. 14th Annual International Symposium on Algorithms and Computation
(ISAAC). 240–249.
H ON, W. K., S ADAKANE , K., AND S UNG, W. K. 2009. Breaking a Time-and-Space Barrier in Constructing
Full-Text Indices. SIAM Journal on Computing 38, 6, 2162–2178.
H ON, W.-K., S HAH , R., T HANKACHAN, S., AND V ITTER , J. 2009. On Entropy-Compressed Text Indexing
in External Memory. In Proc. 16th International Symposium on String Processing and Information
Retrieval (SPIRE). LNCS 5721. 75–89.
H UYNH , T. N. D., H ON, W.-K., L AM , T. W., AND S UNG, W.-K. 2006. Approximate string matching using
compressed suffix arrays. Theoretical Computer Science 352, 1-3, 240–249.
K ÄRKK ÄINEN, J. AND U KKONEN, E. 1996a. Lempel-Ziv parsing and sublinear-size index structures for
string matching. In Proc. 3rd South American Workshop on String Processing. 141–155.
K ÄRKK ÄINEN, J. AND U KKONEN, E. 1996b. Sparse suffix trees. In Computing and Combinatorics. LNCS
1090. 219–230.
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:33
L EE , S. AND PARK , K. 2007. Dynamic Rank-Select Structures with Applications to Run-Length Encoded
Texts. In Proc. 18th Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS 4580. 95–
106.
M ÄKINEN, V. AND N AVARRO, G. 2008. Dynamic Entropy-Compressed Sequences and Full-Text Indexes.
ACM Transactions on Algorithms 4, 3, 1–38.
M ANBER , U. AND M YERS, E. 1993. Suffix arrays: A new method for on-line string searches. SIAM Journal
on Computing 22, 5, 935–948.
M ANZINI , G. 2001. An analysis of the Burrows-Wheeler transform. Journal of the ACM 48, 3, 407–430.
M C C REIGHT, E. 1976. A space-economical suffix tree construction algorithm. Journal of the ACM 32, 2,
262–272.
M UNRO, I., R AMAN, V., AND R AO, S. S. 2001. Space efficient suffix trees. Journal of Algorithms 39, 205–222.
N AVARRO, G. 2004. Indexing text using the Ziv-Lempel trie. J. Discrete Algorithms 2, 1, 87–114.
N AVARRO, G. AND M ÄKINEN, V. 2007. Compressed Full-Text Indexes. ACM Computing Surveys 39, 1, article
2.
N AVARRO, G. AND S ADAKANE , K. 2010. Fully-functional static and dynamic succinct trees.
CoRR abs/0905.0768. http://arxiv.org/abs/0905.0768. Version 4.
O HLEBUSCH , E., F ISCHER , J., AND G OG, S. 2010. CST++. In Proc. 17th International Symposium on String
Processing and Information Retrieval (SPIRE). LNCS 6393. 322–333.
O HLEBUSCH , E. AND G OG, S. 2009. A compressed enhanced suffix array supporting fast string matching.
In Proc. 16th International Symposium on String Processing and Information Retrieval (SPIRE). LNCS
5721. 51–62.
O KANOHARA , D. AND S ADAKANE , K. 2009. A linear-time burrows-wheeler transform using induced sorting.
In Proc. 16th International Symposium on String Processing and Information Retrieval (SPIRE). LNCS
5721. 90–101.
P ǍTRAŞCU, M. AND V IOLA , E. 2010. Cell-probe lower bounds for succinct partial sums. In Proc. 21st ACM-
SIAM Symposium on Discrete Algorithms (SODA). 117–122.
P ǍTRAŞCU, M. 2008. Succincter. In Proc. 49th IEEE Annual Symposium on Foundations of Computer Sci-
ence (FOCS). 305–313.
R AMAN, R., R AMAN, V., AND R AO, S. S. 2002. Succinct indexable dictionaries with applications to encoding
k-ary trees and multisets. In Proc. 13th ACM-SIAM Symposium on Discrete Algorithms (SODA). 233–
242.
R AMAN, R. AND R AO, S. S. 2003. Succinct Dynamic Dictionaries and Trees. In Proc. 30th International
Colloquium on Automata, Languages and Programming (ICALP). LNCS 2719. 357–368.
R USSO, L., N AVARRO, G., AND O LIVEIRA , A. 2008a. Dynamic Fully-Compressed Suffix Trees. In Proc. 19th
International Symposium on Combinatorial Pattern Matching (CPM). LNCS 5029. 191–203.
R USSO, L., N AVARRO, G., AND O LIVEIRA , A. 2008b. Fully-Compressed Suffix Trees. In Proc. 8th Latin
American Symposium on Theoretical Informatics (LATIN). LNCS 4957. 362–373.
R USSO, L. M. S. AND O LIVEIRA , A. L. 2008. A compressed self-index using a Ziv-Lempel dictionary. Infor-
mation Retrieval 11, 4, 359–388.
S ADAKANE , K. 2003. New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48, 2,
294–313.
S ADAKANE , K. 2007. Compressed Suffix Trees with Full Functionality. Theory of Computing Systems 41, 4,
589–607.
S ADAKANE , K. AND N AVARRO, G. 2010. Fully-functional succinct trees. In Proc. 21st Annual ACM-SIAM
Symposium on Discrete Algorithms (SODA). 134–149.
V ÄLIM ÄKI , N., G ERLACH , W., D IXIT, K., AND M ÄKINEN, V. 2007. Engineering a compressed suffix tree
implementation. In Proc. 6th International Workshop on Efficient and Experimental Algorithms (WEA).
LNCS 4525. 217–228.
W EINER , P. 1973. Linear pattern matching algorithms. In Proc. 14th IEEE Annual Symposium on Switching
and Automata Theory (SWAT). 1–11.
ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.