0% found this document useful (0 votes)
15 views33 pages

Talg 11

The paper introduces a Fully Compressed Suffix Tree (FCST) representation that significantly reduces the space requirements of classical suffix trees from Θ(n log n) to sublinear space on top of the compressed text size. The FCST supports various navigational operations in nearly logarithmic time and can dynamically update the text while maintaining optimal space efficiency. Experimental results validate the theoretical claims, demonstrating the practical effectiveness of FCSTs in string processing applications.

Uploaded by

dalebryers
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views33 pages

Talg 11

The paper introduces a Fully Compressed Suffix Tree (FCST) representation that significantly reduces the space requirements of classical suffix trees from Θ(n log n) to sublinear space on top of the compressed text size. The FCST supports various navigational operations in nearly logarithmic time and can dynamically update the text while maintaining optimal space efficiency. Experimental results validate the theoretical claims, demonstrating the practical effectiveness of FCSTs in string processing applications.

Uploaded by

dalebryers
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

A

Fully-Compressed Suffix Trees


LUÍS M. S. RUSSO , INESC-ID / Instituto Superior Técnico, Tech Univ of Lisbon, Portugal
GONZALO NAVARRO, University of Chile
ARLINDO L. OLIVEIRA , INESC-ID / Instituto Superior Técnico, Tech Univ of Lisbon, Portugal

Suffix trees are by far the most important data structure in stringology, with a myriad of applications in
fields like bioinformatics and information retrieval. Classical representations of suffix trees require Θ(n log n)
bits of space, for a string of size n. This is considerably more than the n log2 σ bits needed for the string itself,
where σ is the alphabet size. The size of suffix trees has been a barrier to their wider adoption in practice.
Recent compressed suffix tree representations require just the space of the compressed string plus Θ(n) extra
bits. This is already spectacular, but the linear extra bits are still unsatisfactory when σ is small as in DNA
sequences. In this paper we introduce the first compressed suffix tree representation that breaks this Θ(n)-
bit space barrier. The Fully Compressed Suffix Tree (FCST) representation requires only sublinear space on
top of the compressed text size, and supports a wide set of navigational operations in almost logarithmic
time. This includes extracting arbitrary text substrings, so the FCST replaces the text using almost the
same space as the compressed text. An essential ingredient of FCSTs is the lowest common ancestor (LCA)
operation. We reveal important connections between LCAs and suffix tree navigation. We also describe
how to make FCSTs dynamic, i.e., support updates to the text. The dynamic FCST also supports several
operations. In particular it can build the static FCST within optimal space and polylogarithmic time per
symbol. Our theoretical results are also validated experimentally, showing that FCSTs are very effective in
practice as well.
Categories and Subject Descriptors: E.4 [Coding and Information Theory]: Data Compaction and Com-
pression; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—search process
General Terms: Algorithms, Performance, Theory
Additional Key Words and Phrases: Text processing, Pattern matching, String algorithms, Suffix tree, Data
compression, Compressed index
ACM Reference Format:
ACM Trans. Algor. V, N, Article A (January YYYY), 33 pages.
DOI = 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000

First and third authors supported by FCT through projects TAGS PTDC/EIA-EIA/112283/2009, HELIX
PTDC/EEA-ELC/113999/2009 and the PIDDAC Program funds (INESC-ID multiannual funding). Second
author partially funded by Millennium Institute for Cell Dynamics and Biotechnology (ICDB), Grant ICM
P05-001-F, Mideplan, Chile.
Preliminary partial versions of this paper appeared in LATIN 2008, LNCS 4957, pp. 362–373; and CPM
2008, LNCS 5029, pp. 191–203.
Authors’ address: Luı́s M. S. Russo, Arlindo Oliveira, Instituto de Engenharia de Sistemas e Computadores:
Investigação e Desenvolvimento (INESC-ID), R. Alves Redol 9, 1000-029 LISBON, PORTUGAL
Instituto Superior Técnico Technical University of Lisbon (IST/UTL), Av. Rovisco Pais, 1049-001 LISBON,
PORTUGAL {lsr,aml}@kdbio.inesc-id.pt.
Gonzalo Navarro, Dept. of Computer Science, University of Chile, Blanco Encalada 2120, Santiago, Chile.
gnavarro@dcc.uchile.cl.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights
for components of this work owned by others than ACM must be honored. Abstracting with credit is per-
mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component
of this work in other works requires prior specific permission and/or a fee. Permissions may be requested
from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or permissions@acm.org.
c YYYY ACM 1549-6325/YYYY/01-ARTA $10.00
DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:2 Fully-Compressed Suffix Trees

1. INTRODUCTION AND RELATED WORK


Suffix trees are extremely important for a large number of string processing problems.
Many of their virtues were described by Apostolico [1985] and Gusfield [1997]. The
combinatorial properties of suffix trees have a profound impact in the bioinformatics
field, which needs to analyze long strings of DNA and proteins with no predefined
boundaries. This partnership has produced several important results, but it has also
exposed the main shortcoming of suffix trees: Their large space requirements, together
with their need to operate in main memory, renders them inapplicable in the cases
where they would be most useful, that is, on large texts.
There has been much research around the space problem of suffix trees, ranging
from space-engineered representations [Giegerich et al. 2003; Abouelhoda et al. 2004]
to novel data structures simulating them, most notably suffix arrays [Manber and
Myers 1993]. Some of those space-reduced variants give away some functionality in
exchange. For example suffix arrays miss the important suffix link navigational oper-
ation. Still, these classical approaches require Θ(n log n) bits, while the indexed string
requires only n log σ bits1 , n being the size of the string and σ the size of the alphabet.
For example, storing the human genome requires about 700 Megabytes, while even a
space-efficient suffix tree of it requires at least 40 Gigabytes [Sadakane 2007], and the
reduced-functionality suffix array requires more than 10 Gigabytes. This problem is
particularly evident in DNA because log σ = 2 is much smaller than log n ≈ 30.
These representations are also much larger than the size of the compressed string.
Recent approaches [Navarro and Mäkinen 2007] combining data compression and suc-
cinct data structures have achieved spectacular results for the pattern search prob-
lem, that is, finding the occ occurrences of a pattern string P in the text. For exam-
ple, Ferragina et al. [2007] presented a compressed suffix array that, for moderate
σ = O(polylog(n)) requires nHk + o(n) bits of space and computes occ in time O(|P |).2
Here nHk denotes the k-th order empirical entropy of the string [Manzini 2001], a
lower bound on the space achieved by any compressor using k-th order modeling. As
that index is also able of reproducing any text substring, its space is asymptotically op-
timal in the sense that no k-th order compressor can achieve asymptotically less space
to represent the text.
It turns out that it is possible to use this kind of data structures, that we will call
compressed suffix arrays (CSAs)3 and, by adding a few extra structures, support all
the operations provided by suffix trees. Sadakane [2007] presented the first such com-
pressed suffix tree (CST), adding 6n bits on top of the CSA.
In this paper we break this Θ(n) extra-bit space barrier. We use a new suffix tree
representation on top of a compressed suffix array, so that we can support all of the
navigational operations within o(n) bits, besides the compressed suffix array, provided
log σ = o( logloglogn n ). Hence we name the data structure Fully Compressed Suffix Tree
(FCST). Our central tools are a particular sampling of suffix tree nodes, its connection
with the suffix link and the lowest common ancestor (LCA) operation, and the interplay
with the compressed suffix array. We exploit the relationship between these actors and
uncover some properties that might be of independent interest.
A comparison between Sadakane’s CST and our FCST is shown in Table I, consid-
ering a moderate alphabet size σ = O(polylog(n)) (there are several more operations
that are trivially supported, see the end of Section 2.1). The table assumes that the
CST uses the CSA of Grossi et al. [2003] (a variant that requires (1 + 1ǫ )nHk + Θ(n) bits

1 Inthis paper log stands for log2 .


2 Forgeneral σ = o(n), the space is nHk + O( n log log
σ log log n
n
log σ
) and the time is O(|P |(1 + log log n
)).
3 These are also called compact suffix arrays, FM-indexes, etc. in the literature [Navarro and Mäkinen 2007].

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:3

Table I. Comparison between compressed suffix tree representations. The operations are defined
along Section 2.1 and are separated in a first group of general tree navigation and a second spe-
cific of suffix trees. The instantiation we show assumes σ = O(polylog(n)), and uses different
versions of the CSA of Grossi et al. for the CST and EBST, and the FM-index of Ferragina et al.
for the FCST. The space given holds for any k ≤ α logσ n and any constant 0 < α < 1. The o(n)
space term in this instantiation is O(n/ log log n). CST and EBST times should be multiplied by
a low-degree polynomial of log log n, which we omit for simplicity as it would be dominated by
using an infinitesimally larger ǫ.
CST EBST FCST
Space in bits (1 + 1ǫ )nHk + Θ(n) (1 + 1ǫ )nHk + o(n) nHk + o(n)
R OOT O(1) O(1) O(1)
C OUNT O(1) O(1) O(1)
A NCESTOR O(1) O(1) O(1)
PARENT O(1) O(logǫ n) O(log n log log n)
FC HILD O(1) O(logǫ n) O(log n log log n)
NS IB O(1) O(logǫ n) O(log n log log n)
LCA O(1) O(logǫ n) O(log n log log n)
TD EP O(1) Not supported O((log n log log n)2 )
TLAQ O(1) Not supported O((log n log log n)2 )
L ETTER(v, i, ℓ) O(logǫ n + ℓ/ logσ n) O(logǫ n + ℓ/ logσ n) O(log n log log n + ℓ)
C HILD O(logǫ n) O(logǫ n) O(log n(log log n)2 )
L OCATE O(logǫ n) O(logǫ n) O(log n log log n)
SL INK O(1) O(logǫ n) O(log n log log n)
SL INKi O(logǫ n) O(logǫ n) O(log n log log n)
W EINER L INK O(log n) O(log n) O(1)
SD EP O(logǫ n) O(logǫ n) O(log n log log n)
SLAQ O(log1+ǫ n) O(log1+ǫ n) O(log n log log n)

for any constant ǫ > 0) and that the FCST uses the FM-index [Ferragina et al. 2007]
(which requires nHk + o(n) bits), to take the preferred setting for each. In general the
FCST is slower than the CST, but it requires much less space. Assuming realistically
that for DNA Hk ≈ 2, Sadakane’s CST requires at the very least 8n + o(n) to 13n + o(n)
bits, depending on the CSA variant of Grossi et al. [2003] they use, whereas the FCST
requires only 2n + o(n) bits (this theoretical prediction is not far from reality, as shown
in Section 7). The FCST space is optimal in the sense that no k-th order compressor
can achieve asymptotically less space to represent T . If the CST used the FM-index,
it would still have the 6n extra bits and the O(logǫ n) time complexities would become
O(log n log log n).
Table I also compares the Entropy-Bounded Suffix Tree (EBST) [Fischer et al. 2009;
Fischer 2010], a newer proposal that aims at maintaining the o(n) extra space of the
FCST while reducing navigation times. If it uses another version of the CSA by Grossi
et al. [2003] that requires o(n) extra bits on polylog-sized alphabets, it achieves sublog-
arithmic time complexities for most operations. If we force it to use the FM-index to
achieve the least possible space (as the FCST) its time complexities become not compet-
itive. There are previous incomplete theoretical proposals for compressed suffix trees
[Munro et al. 2001; Foschini et al. 2006]; a brief description is given at the end of
Section 3.
Our results are based on a special kind of sampling of suffix tree nodes. There is
some literature on sampled, or sparse, suffix trees. The pioneering work [Kärkkäinen
and Ukkonen 1996b] indexed evenly spaced suffixes (every k text positions). The re-
sulting structure required reduced space, O((n/k) log n) + n log σ bits, at the price of
multiplying the suffix tree search time by k and only handling patterns of length k
or more. Replacing the regular sampling with one guided by the Lempel-Ziv parsing
yielded the very first compressed text index [Kärkkäinen and Ukkonen 1996a]. This
index used the Lempel-Ziv properties to handle any pattern length, and later several

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:4 Fully-Compressed Suffix Trees

Table II. Comparison between dynamic compressed suffix tree representations. The operations
are defined along Section 2.1. The same considerations of Table I apply, except that the instan-
tiation assumes the dynamic FM-Index variant of Navarro and Sadakane [2010] as the CSA, for
which the space holds for any k ≤ α logσ (n) − 1 and any constant 0 < α < 1.
Chan et al. [2007] (DCST) Ours (DFCST)
Space in bits nHk + Θ(n) nHk + o(n)
R OOT O(1) O(1)
C OUNT O(log n/ log log n) O(1)
A NCESTOR O(log n/ log log n) O(1)
PARENT O(log n/ log log n) O(log2 n)
FC HILD O(log n/ log log n) O(log2 n log log n)
NS IB O(log n/ log log n) O(log2 n log log n)
LCA O(log n/ log log n) O(log2 n)
L ETTER(v, i, ℓ) O(log n(log n + ℓ/ log log n)) O(log n(log n + ℓ/ log log n))
C HILD O(log2 n log σ) O(log2 n log log n)
L OCATE O(log2 n) O(log2 n)
SL INK O(log n/ log log n) O(log2 n)
SL INKi O(log2 n) O(log2 n)
W EINER L INK O(log n/ log log n) O(log n/ log log n)
SD EP O(log2 n) O(log2 n)
I NSERT(T ) / D ELETE(T ) O(|T | log2 n) O(|T | log2 n)

self-indexes based on Lempel-Ziv compression followed the same lines [Navarro 2004;
Ferragina and Manzini 2005; Russo and Oliveira 2008]. Sparse indexes that use evenly
spaced suffixes and orthogonal range searching were recently proposed for secondary
memory searching [Chien et al. 2008; Hon et al. 2009]. All these representations sup-
port pattern searches, but not the full suffix tree functionality. Our sampling is differ-
ent in the sense that it samples suffix tree nodes, not text positions. This is the key to
achieve good upper bounds for all suffix tree operations.
Albeit very appealing, static FCSTs must be built from the uncompressed suffix tree.
Moreover, they must be rebuilt from scratch upon changes in the text. This severely
limits their applicability, as one needs to have a large main memory, or resort to sec-
ondary memory construction, to end up with a FCST that fits in a reasonable main
memory. CSAs have overcome this limitation, starting with the structure by Chan
et al. [2004]. In its journal version [Chan et al. 2007] the work includes the first dy-
namic CST, which builds on the static CST of Sadakane [2007] and retains its Θ(n)
extra space penalty (with constant at least 6). On the other hand, the smallest exist-
ing CSA [Ferragina et al. 2007] was made dynamic within the same space by Navarro
and Sadakane [2010] so as to achieve a sublogarithmic slowdown with respect to the
static version4 . In this paper we show how to support dynamic FCSTs, by building on
this latter dynamic CSA. We retain the optimal space complexity and polylogarithmic
time for all the operations.
A comparison between the dynamic CST by Chan et al. [2007] and our dynamic
FCST is given in Table II. Both use the dynamic FM-index of Navarro and Sadakane
[2010], as that of Chan et al. [2007] uses O(σn) space and is not significantly faster.
Again, the FCST is slower but requires much less space (one can realistically predict
25% of Chan et al.’s CST space on DNA).
All these dynamic structures, as well as ours, handle a collection of texts, where
whole texts are added/deleted to/from the collection. Construction in compressed space
is achieved by inserting a text into an empty collection.

4 He and Munrop [2010] obtained a very similar result but their o(n) extra space term is larger,
O(n log log n/ log n) versus O(n log log n/ log n).

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:5

We have implemented the static FCST and compared with an implementation of


Sadakane’s CST by Välimäki et al. [2007]. Our experiments show that we can obtain
very small FCST representations and still support the usual operations efficiently.
We start with Section 2 by defining basic concepts about suffix trees and compact
data structures, and listing the navigational operations we wish to support. In Sec-
tion 3 we overview compressed suffix arrays (CSAs) and their functionality. Section 4
introduces our sampled suffix tree and shows how to support its navigation and map-
ping from the full (not represented) suffix tree. Section 5 shows how to carry out nav-
igational operations using self-similarity properties of suffix trees; a detailed analysis
of these properties is given in Appendix 9. Section 6 introduces our main technique for
maintaining the sampling up to date upon changes in the text collection, and obtains
the results on dynamic FCSTs. Section 7 shows experimental results. Section 8 con-
siders different sampling factors to obtain larger, but faster representations. Section 9
concludes the paper.
2. BASIC CONCEPTS
In this section we give a brief review of suffix trees, suffix arrays, and compact data
structures. For a more detailed explanation, the reader is referred to the publications
focused on the subject, e.g. [Gusfield 1997; Navarro and Mäkinen 2007]. In particular,
the former reference shows dozens of algorithms relevant in bioinformatics where the
suffix tree navigation operations we are going to describe are of use.
2.1. Strings, Trees, Suffix Trees and Arrays
We denote by T = T [0, n − 1] a string, which is a sequence of length |T | = n over an
alphabet Σ of size σ. We denote by T [i] the symbol at position (i mod n); by T [i..j] the
substring T [i]T [i + 1] . . . T [j], which is a prefix if i = 0 (and can be written T [..j]) and
a suffix if j = n − 1 (and can be written T [i..]). By T.T ′ we denote the concatenation
of T and T ′ . The empty string of length zero is denoted ε.
We make extensive use of rooted trees. The root node is called R OOT. By PARENT (v)
we denote the parent node of node v 6= R OOT; by TD EP (v) its tree-depth; by H EIGHT (v)
the distance between v and its farthest descendant leaf; by FC HILD (v) its first child,
if v is not a leaf; and by NS IB(v) the next child of the same parent, if it exists. By
A NCESTOR(v, v ′ ) we denote whether v is an ancestor of v ′ ; by LCA(v, v ′ ) the lowest
common ancestor of v and v ′ ; and by TLAQ(v, d) the level-d ancestor of v (that is,
the ancestor of v with tree-depth d).
A compact tree is a tree that has no unary nodes (that is, nodes with only one
child). A labeled tree is a tree that has a nonempty string label for every edge. In a
deterministic tree, the common prefix of any two different edges out of a node is ε.
Definition 2.1. [Weiner 1973; McCreight 1976] The suffix tree T of a text string T
is the deterministic compact labeled tree for which the path labels of the leaves are the
suffixes of T $, where $ is a terminator symbol not belonging to Σ. We will assume n is
the length of T $.
Figure 1 shows a running example that illustrates several concepts of suffix trees,
for T = abbbaab. The suffix tree T contains t nodes, and it holds n ≤ t < 2n. In a
deterministic tree the first letters of every edge are referred to as branching letters.
A point p in a labeled tree is either a node or a string position in some edge label.
The path-label of a point p in a labeled tree is the concatenation of the edge labels
from the R OOT down to p. We refer indifferently to nodes v and to their path labels, also
denoted by v. The i-th letter of the path label of node v is denoted L ETTER (v, i) = v[i],
and in general we use L ETTER (v, i, ℓ) = v[i..i + ℓ − 1]. The string-depth of a node v,
denoted SD EP(v), is the length of its path label. SLAQ(v, d) is the highest ancestor
v ′ of node v with SD EP(v ′ ) ≥ d. C HILD (v, X) is the node that results of descending

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:6 Fully-Compressed Suffix Trees

1 2
i: 01 234 56 7890 12 345 67 8901
a ((0)((1)(2))((3)(4)((5)(6))))
b b $ b
0
B: 1 0 0 0 101101 0 0 1
b b b a b
3 ( 0 1 2 (3)(4) 5 6 )
i: 0 1 23 4 5
b b a
a a a a 1
B: 1 0 0 0 1101101 0 0 11
b b b b b ( 0 1 2 ((3)(4) 5 6 ))
$ $ $ $ $ $ $ 4
i: 0 12 34 5 67
0 1 2 3 4 5 6 b
5 Fig. 3. Parentheses representations of
A: 6 4 0 5 3 2 1 b trees. The parentheses on top represent the
6 suffix tree, those in the middle the sampled
Fig. 1. Suffix tree T of string abbbab, with
a tree, and those on the bottom the sampled
the leaves numbered. The arrow shows the 2
SL INK between node ab and b. Below we tree when b is also sampled along with the
show the suffix array. The portion of the Fig. 2. B bitmap. The numbers are not part of the
tree corresponding to node b and respective Reverse representation; they are shown for clarity.
leaves interval is within a dashed box. The tree T R . The rows labeled i: give the index of the
sampled nodes have bold outlines. parentheses.

from v by the edge whose label starts with symbol X, if it exists. The suffix-link of
a node v 6= R OOT of a suffix tree, denoted SL INK (v), is a pointer to node v[1..] (that
is, the longest proper suffix of v; this node always exists). Note that SD EP(v) of a leaf
v identifies the suffix of T $ starting at position L OCATE (v) = n − SD EP (v). In our
example, T [L OCATE(ab$)..] = T [(7 − 3)..] = T [4..] = ab$. The list of L OCATE values
comprises another well-known structure.

Definition 2.2. [Manber and Myers 1993] The suffix array A[0, n − 1] of a text T is
the sequence of starting positions of the suffixes of T $ in lexicographical order. This is
the same as the L OCATE values of the suffix tree leaves, if the children of the nodes
are ordered lexicographically by their branching letters.

Note that A is a permutation, and permutation A−1 [j] gives the lexicographical rank
of T [j..] among all the suffixes of T $.
The suffix tree nodes can be identified with suffix array intervals: Each node v cor-
responds to the range [vl , vr ] of leaves that descend from v (since there are no unary
nodes, there are no two nodes with the same interval). These intervals are also ref-
ered to as lcp-intervals [Abouelhoda et al. 2004]. In our example, node b corresponds
to the interval [3, 6]. We will refer indifferently to nodes v and to their interval [vl , vr ].
Leaves v correspond to [v, v] in this notation. For example by vl − 1 we refer to the
leaf immediately before vl , i.e., [vl − 1, vl − 1]. With this representation we can solve
C OUNT (v) = vr − vl + 1, the number of leaves that descend from node v. In our ex-
ample, the number of leaves below b is 4 = 6 − 3 + 1. This is precisely the number of
times the string v occurs in the text T , and thus the pattern search problem for P re-
duces to navigating from the R OOT to the point denoting P , and then using C OUNT to
determine the number of times P occurs in T , and using L OCATE (vl ) . . . L OCATE(vr ) to
output the occurrence positions.
The representation of ranges lets one trivially compute several other operations of
interest for suffix trees, such as A NCESTOR(v, v ′ ) ⇔ vl ≤ vl′ ≤ vr′ ≤ vr , knowing
whether v follows v ′ in T (⇔ vr′ < vl ), whether the preorder of v is smaller than that of
v ′ (⇔ vl < vl′ ∨ (vl = vl′ ∧ vr > vr′ )), whether a node is a leaf (vl = vr ), the leftmost leaf of
node v (vl ), etc.

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:7

2.2. Compact Data Structures


We make heavy use of compact and compressed data structures for representing
bitmaps and trees. We give now the results we build on along the paper. As already
mentioned, we will develop a static and a dynamic variant of our FCST. Hence we will
give static and dynamic variants of all the data structures we create.
Let B[0, n − 1] be a bitmap of length n. Then we define operations R ANKb (B, i) as
the number of occurrences of bit b in B[0, i], and S ELECT b (B, j) as the position of the
(j + 1)-th occurrence of bit b in B. We build on the following compressed bitmap rep-
resentations that support R ANK and S ELECT . Their space is asymptotically optimal
among all the bitmaps with the same number of bits set [Pǎtraşcu and Viola 2010].
T HEOREM 2.3 (2.3 [P ǍTRAŞCU 2008]). Let B[0, n − 1] contain m 1s. Then there
exists a data structure that answers R ANK and S ELECT in constant time O(c) using
m log mn
+ O(m + n/ logc n) bits of space.
In the dynamic scenario, we also wish to support operations I NSERT(B, i, b), which
inserts bit b between B[i − 1] and B[i], and D ELETE (B, i), which deletes B[i] from the
sequence.
T HEOREM 2.4 (2.4 [N AVARRO AND S ADAKANE 2010]). Let B[0, n − 1] contain m
1s. Then there exists a data structure that answers R ANK and S ELECT , and executes
n
operations I NSERT and D ELETE , all in O(log n/ log log n) time, using m log m + O(m +
n log log n
log n ) bits of space.

We will also handle general trees of n nodes, which can be represented using 2n+o(n)
bits while supporting in constant time a number of traversal operations. This space
is asymptotically optimal considering all the trees of n nodes. For this paper we are
interested in the following operations: P REORDER(v) (and its inverse), which gives the
preorder position of node v in the tree starting at zero; PARENT (v); LCA(v, v ′ ), TD EP(v)
and TLAQ(v, d).
A useful tree representation, which will be necessary at some points in the paper,
is based on balanced parentheses: Do a preorder traversal and write a ’(’ when you
arrive at a node and a ’)’ when you leave it. This sequence is regarded as a bitmap
supporting R ANK and S ELECT operations. In addition, the following operations on
the parentheses are supported: F IND M ATCH (u) finds the matching parenthesis of u;
E NCLOSE(u) finds the nearest pair of matching parentheses that encloses u; and in
some cases D OUBLE E NCLOSE (u, u′ ), which finds the nearest pair of parentheses that
encloses both u and u′ .
These operations on the parentheses support most of the tree operations we need.
If tree node v is identified with the position of its opening parenthesis in the se-
quence B, then P REORDER(v) = R ANK′ (′ (B, v) − 1, P REORDER−1 (i) = S ELECT ′ (′ (B, i),
TD EP(v) = R ANK′ (′ (B, v) − R ANK′ )′ (B, v), PARENT (v) = E NCLOSE (v), LCA(v, v ′ ) =
D OUBLE E NCLOSE(v, v ′ ). Only operation TLAQ(v, d) needs special treatment. We will
use a representation that supports all of these operations within optimal space
[Pǎtraşcu and Viola 2010].
T HEOREM 2.5 (2.5 [S ADAKANE AND N AVARRO 2010]). Let a general tree of n
nodes be represented as a sequence of 2n balanced parentheses. Then there exists a data
structure supporting operations P REORDER(v), P REORDER−1 (i), LCA(v, v ′ ), TD EP(v),
TLAQ(v, d), and PARENT (v), on the tree, and R ANK(v), S ELECT (v), F IND M ATCH (v),
E NCLOSE(v), and D OUBLE E NCLOSE (v, v ′ ) on the parentheses, in constant time ttree =
O(c) using 2n + O(n/ logc n) bits of space.

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:8 Fully-Compressed Suffix Trees

In the dynamic case, we wish to I NSERT tree nodes, and D ELETE tree leaves
or unary nodes. The update operations are then translated into I NSERT(u, u′ ) and
D ELETE (u, u′ ), which insert or delete matching parentheses located at u, u′ . On the
other hand, we will not need TLAQ(v, d).
T HEOREM 2.6 (2.6 [N AVARRO AND S ADAKANE 2010]). A sequence of 2n balanced
parentheses can be maintained in 2n + O(n log log n/ log n) bits of space while sup-
porting the same operations of Theorem 2.5 except TLAQ(v, d), plus I NSERT (v, v ′ ) and
D ELETE (v, v ′ ), in ttree = O(log n/ log log n) worst-case time.

3. USING COMPRESSED SUFFIX ARRAYS


We are interested in compressed suffix arrays (CSAs) because they have very compact
representations and support partial suffix tree functionality (being usually more pow-
erful than the classical suffix arrays [Navarro and Mäkinen 2007]). Apart from the
basic functionality of retrieving A[i] = L OCATE (i), state-of-the-art compressed suffix
arrays support operation SL INK (v) for leaves v. This is called ψ(v) in the literature:
A[ψ(v)] = A[v] + 1, and thus SL INK (v) = ψ(v), let its time complexity be O(tψ ). The
iterated version of ψ, denoted ψ i , can usually be computed faster than O(i tψ ) with
compressed indexes. This is achieved with A and A−1 , ψ i (v) = A−1 [A[v] + i]. Let O(tSA )
be the time complexity to compute A and A−1 (and hence to compute L OCATE). CSAs
also support operation W EINER L INK (X, v), which, for a node v, gives the suffix tree
node with path label X.v. This is called the LF mapping (for leaves) in compressed
suffix arrays, and is a kind of inverse of ψ: LF(X, v) gives the lexicographical rank
of the suffix X.T [A[v]..] among all the suffixes, whether it exists or not. Let O(tLF ) be
the time complexity to compute LF. It is easy to extend LF to suffix tree nodes v:
LF(X, v) = LF(X, [vl , vr ]) = [LF(X, vl ), LF(X, vr )] = W EINER L INK (X, v).
Consider the interval [3, 6] in our example, which represents the leaves whose path
labels start by b. In this case we have that LF(a, [3, 6]) = [1, 2], i.e., by using the LF
mapping with a we obtain the interval of leaves whose path labels start by ab. We also
extend the notation of LF to strings, LF(X.α, v) = LF(X, LF(α, v)).
Compressed suffix arrays are usually self-indexes, meaning that they replace the
text: It is possible to extract any substring, of size ℓ, of the indexed text in O(tSA +
tψ (ℓ − 1)) time. A particularly easy case that is solved in constant time is to extract
T [A[v]] for a suffix array cell v, that is, the first letter of a given suffix. Since suffixes are
lexicographically sorted, one can partition A into at most σ intervals, where suffixes
start with the same letter. Self-indexes store, in some way or another, O(σ log nσ ) bits
that allow one to compute in constant time the partition where any v belongs [Navarro
and Mäkinen 2007]. This corresponds to L ETTER (v, 0) = v[0], the first letter of the
path label of leaf v. To obtain L ETTER (v, 0, ℓ) one repeats the process ℓ times, on v,
ψ(v), ψ(ψ(v)), and so on, in O(1 + tψ (ℓ − 1)) time. The more general L ETTER (v, i, ℓ) is
reduced to L ETTER (ψ i (v), 0, ℓ) in O(tSA + tψ (ℓ − 1)) time. Finally, in order to extract
T [i..j] one computes v = A−1 [i] and then L ETTER (v, 0, j − i + 1), in O(tSA + tψ (j − i))
time.
As anticipated, our compressed suffix tree representation will consist of a sampling
of the suffix tree plus a compressed suffix array representation. A well-known com-
pressed suffix array is that of Grossi et al. [2003]. One variant (their Lemma 3.2)
requires (1 + 1ǫ )nHk + 2n(1 + log e) + O( n log log n
log n ) bits of space and has times tψ = O(1),
tSA = O(logǫσ n + log σ), and tLF = O(tψ log n), for any 0 < ǫ < 1, any k ≤ α logσ n, and
any constant 0 < α < 1. A second variant (their Theorem 4.1) requires (1 + 1ǫ )nHk +
O( n log σloglogn log n ) bits and has times tψ = O(1 + logloglogσ n ) (using multiary wavelet trees
[Ferragina et al. 2007] instead of their earlier binary wavelet trees), tSA = O(tψ logǫσ n),

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:9

Table III. Operations supported by static compressed suffix arrays (CSAs) and their space and time com-
plexities, which hold for any 0 < ǫ < 1, l ≥ 1, k ≤ α logσ n and constant 0 < α < 1. We give two variants
of Grossi et al. [2003].
Time [Grossi et al. 2003] [Grossi et al. 2003] [Ferragina et al. 2007]
Space in bits (1 + 1ǫ )nHk (1 + 1ǫ )nHk + nHk + O( n log log
σ log log n
n
)
n log log n
+Θ(n) +O( ǫ/(1+ǫ) ) +O((n/l) log n)
log σ n
log σ
ψ(v) tψ O(1) O(1 + log log n
) O(1 + loglog σ
log n
)
A[v],A−1 [v] tSA logǫσ n + log σ tψ logǫσ n l tLF
L OCATE(v)
log σ
LF(v) tLF tψ log n O(1 + log log n
)
W EINER L INK(v)
v[i..i + ℓ − 1] tSA + tLF (ℓ − 1)
T [i..i + ℓ − 1] tSA + ℓ/ logσ n = tLF (l + ℓ − 1)
L ETTER(v, i, ℓ)

and tLF = O(tψ log n).5 For our results we favor another compressed suffix array,
called the FM-index [Ferragina et al. 2007], which requires nHk + O( n log log
σ log log n
n )
bits, with the same limit on k. Its complexities are6 tψ = tLF = O(1 + logloglogσ n ) (us-
ing multiary wavelet trees again) and tSA = O(l tLF ). This l is a suffix array sam-
pling parameter, such that we need O((n/l) log n) extra bits of space. For example,
n
if we set the extra space to O( log log n ) then we use l = log n log log n and achieve
tSA = O(log n(log σ + log log n)). Table III summarizes the supported CSA operations
and times.
We remark that, if log σ = Ω( logloglogn n ), the extra space of the FM-index includes an
extra Ω(n)-bit term. Although this is still o(n log σ) bits, which can be argued to be
reasonable for a text T whose plain representation takes n log σ bits, the main point of
this paper is to break the Θ(n) space barrier. In this sense, our results are interesting
for log σ = o( logloglogn n ), where the FM-index takes nHk + o(n) bits of space. This is a
reasonable assumption on σ and includes the interesting case σ = O(polylog(n)), on
which the FM-index offers constant tψ and tLF times.
Let us now consider dynamic CSAs. These handle a collection of texts, as if they
formed a single concatenated text. They offer the same functionalities of the static
CSAs on that text, plus insertion/deletion of texts into/from the collection. Two main
dynamic CSAs exist: That of Chan et al. [2007] is a natural extension of the static
CSA of Sadakane [2003]. It requires O(σn) bits of space, and offers complexities
tψ = tLF = O(log n), tSA = O(log2 n), and insertion/deletion of a text T in time
O(|T |(tψ + tLF )). The FM-index has several dynamic versions as well [Ferragina and
Manzini 2000; Mäkinen and Navarro 2008; González and Navarro 2008]. The most
efficient version [Navarro and Sadakane 2010] achieves nHk + O( n (1−ǫ) log σ log log n
logǫ n ) bits
of space, for any k ≤ α logσ (n) − 1 and any constants 0 < α, ǫ < 1. It offers times
tψ = tLF = O( logloglogn n (1 + logloglogσ n )), and tSA = l tLF using other O((n/l) log n) extra bits.
Again, we set l = log n log log n to achieve O( log nlog n ) extra bits, which makes tSA =
O(log2 n(1 + log σ
log log n )). Insertion and deletion of a text T takes time O(|T |(tψ + tLF )).7

5 The complexities of both variants are incorrectly mixed in Fischer et al. [2009], an error that carries over
to Fischer [2010].
6 Function ψ(v) can be computed in the same time of LF on the FM-index [Lee and Park 2007].
7 This was just O(|T |t ) in the original papers [Mäkinen and Navarro 2008; González and Navarro 2008],
LF
but in Section 6 we will modify the deletion procedure to operate in left-to-right order. Thus our times are
O(|T |tLF ) for insertions and O(|T |tψ ) for deletions.

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:10 Fully-Compressed Suffix Trees

Table IV. Operations supported by dynamic compressed suffix arrays (CSAs) and their space and
time complexities, which hold for any 0 < ǫ < 1, l ≥ 1, k ≤ α logσ (n) − 1 and constant 0 < α < 1.
Time [Chan et al. 2007] [Navarro and Sadakane 2010]
n log σ
Space in bits O(σn) nHk + O( (1−ǫ) logǫ n
)
+O((n/l) log n)
log n
ψ(v) tψ O(log n) O( log log n
(1 + loglog σ
log n
))
A[v],A−1 [v], L OCATE(v) tSA O(log2 n) l tLF
log n
LF(v), W EINER L INK(v) tLF O(log n) O( log log n
(1 + loglog σ
log n
))
v[i..i + ℓ − 1], T [i..i + ℓ − 1] tSA + tψ (ℓ − 1) tSA + tLF (ℓ − 1)
L ETTER(v, i, ℓ) = (l + ℓ − 1)tLF
I NSERT(T ) / D ELETE(T ) |T |(tψ + tLF )

Table IV summarizes these complexities. The dynamic CSA by Chan et al. [2007] is
used in his CST representation. In FCST representation the focus is on minimal, o(n),
space and therefore we will use the result by Navarro and Sadakane [2010].
A larger, but much faster, dynamic CSA was proposed by Gupta et al. [2007].
Their dynamic CSA requires n log σ + o(n log σ) bits of space and supports queries in
O(log log n) time, and O(1) time when σ = O(polylog(n)). Updates, however, are much
more expensive, O(nǫ ) amortized time, for 0 < ǫ < 1. The FCST representation may use
this dynamic CSA. However, for this to be useful, one should also use faster dynamic
trees. While there are some supporting various operations in time O(log log n) [Raman
and Rao 2003; Arroyuelo 2008], none of these supports the crucial LCA operation.
Finally, let us mention a previous data structure called a “compressed suffix tree”
but which, under the terminology of this paper, offers just compressed suffix array
functionality. Munro et al. [2001] propose what can be considered as a predecessor
of Sadakane’s CST, as it uses a suffix array and a compact tree. By using it on top
of an arbitrary CSA, its smallest variant would take |CSA| + o(n) bits plus the text
(which could be compressed to nHk + o(n log σ) bits and support constant-time access
[Ferragina and Venturini 2007]) and find the suffix array interval corresponding to
pattern P [1, m] in time O(m tSA log σ). The FM-index alone, however, is the smallest
CSA and can do the same without any other structure in time O(m tLF ), which is always
faster. Munro et al. [2001] can also achieve time O(m tSA ), but for this they require
|CSA| + O(n log σ) bits and still do not support any other suffix tree operation. There
exists another previous compressed suffix tree description [Foschini et al. 2006] based
on an interval representation and sampling of the suffix tree. However, the description
is extremely brief and no details nor theoretical bounds on the result are given.

4. THE SAMPLED SUFFIX TREE


A pointer based implementation of suffix trees requires Θ(n log n) bits to represent a
suffix tree of t < 2n nodes. As this is too much, we will store only a few sampled nodes.
We denote our sampling factor by δ, so that in total we sample O(n/δ) nodes. Hence,
provided δ = ω(log n), the sampled tree can be represented using o(n) bits. To settle
ideas we can assume δ = ⌈log n log log n⌉. In our running example we use δ = 4.
To illustrate the structure of the sampled tree, Figure 3 shows the balanced paren-
theses representation of the tree of Figure 1. The representation of the sampled tree
is obtained by deleting the parentheses of the non-sampled nodes, as in Figure 3. For
the sampled tree to be representative of the suffix tree it is necessary that every node
is, in some sense, close enough to a sampled node.
Definition 4.1. A δ-sampled tree S of a suffix tree T with t nodes is formed by
choosing s = O(t/δ) nodes of T so that for each node v of T there is an i < δ such that
node SL INK i (v) is sampled.

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:11

This means that if we start at v and follow suffix links successively, i.e., v, SL INK(v),
SL INK(SL INK(v)), . . ., we will find a sampled node in at most δ steps. Note that this
property implies that the R OOT must be sampled, since SL INK(R OOT) is undefined.
We now show that it is possible to δ-sample a suffix tree.
T HEOREM 4.2. There exists a δ-sampled tree S for any suffix tree T .
P ROOF. We sample the nodes v such that SD EP (v) ≡δ/2 0 and there is another node
v such that v = SL INK δ/2 (v ′ ). Since SD EP(SL INK i (v ′ )) = SD EP (v ′ )−i, this guarantees

that, for any v ′ such that SD EP (v ′ ) ≥ δ/2, the SD EP (SL INK i (v ′ )) ≡δ/2 0 condition holds
for exactly two values in the range i ∈ [0, δ −1]. For the largest of those two i values, the
second sampling condition must hold as well. (If SD EP(v ′ ) < δ/2, v ′ is sufficiently close
to the R OOT, which is sampled.) On the other hand, for each sampled node v 6= R OOT,
there are at least other δ/2 − 1 non-sampled nodes that point to it via SL INK i , as their
SD EP is not a multiple of δ/2. Hence there are s ≤ 1 + t/(2δ) = O(t/δ) = O(n/δ)
sampled nodes.
We represent the sampled tree S as a sequence of balanced parentheses, using
Theorem 2.5. Operations P REORDER S , P REORDER −1 S , PARENT S , TLAQ S , LCA S , and
TD EPS , are all supported in constant time and O(n/δ) bits of space. We will also need
to store, in arrays indexed by P REORDERS (v), the values SD EP (v) and TD EP(v) for
the sampled nodes (do not confuse TD EP (v) = TD EPT (v), the depth of a sampled node
v in the suffix tree, with TD EP S (v), the depth of v in the sampled tree). These arrays
require O((n/δ) log n) bits of space.
In the dynamic case we use Theorem 2.6 to represent S with balanced parentheses.
This takes O(n/δ) bits and supports operations P REORDERS , P REORDER −1 S , PARENT S ,
LCAS , and TD EPS , all in O(log n/ log log n) time. The structure also supports insertion
and deletion of leaves and unary nodes. The representation also needs to maintain the
SD EP values of nodes in S, which are handled using a simple dynamic structure such
as that presented by Navarro and Sadakane [2010]: It allows inserting, deleting and
accessing the values in O(log n/ log log n) time while using O((n/δ) log n) bits of space.
In order to make effective use of the sampled tree, we need a way to map any node
v to its lowest sampled ancestor, LSA(v). Another important operation is the lowest
common sampled ancestor LCSA(v, v ′ ) = LSA(LCA(v, v ′ )), i.e., lowest common ances-
tor in the sampled tree S. In our example, LCSA(3, 4) is the R OOT, whereas LCA(3, 4)
is [3, 6], i.e., the node labeled b. The next lemma shows how the general LCSA and LSA
queries can be answered if LSA for leaves is available, and then we go on to solve that
specific problem. The mapping will also let us compute the range [vl , vr ] of a sampled
node v.
L EMMA 4.3. Let v and v ′ be nodes of a suffix tree T and S an δ-sampled subtree,
then the following properties always hold:
v1 ancestor of v2 ⇒ LSA(v1 ) ancestor of LSA(v2 ) (1)
LCSA(v, v ′ ) = LCA S (v, v ′ ), when v and v ′ belong to S (2)
LCSA(v, v ′ ) = LCSA(LSA(v), LSA(v ′ )) (3)
LCSA(v, v ′ ) = LCA S (LSA(v), LSA(v ′ )) (4)
LSA(v) = LCA S (LSA(vl ), LSA(vr )) (5)
P ROOF. For (1), LSA(v1 ) is transitively an ancestor of v2 and it is sampled, thus by
definition of LSA it is also an ancestor of LSA(v2 ).
For the rest of the proof let us define v ′′ = LCSA(v, v ′ ) = LSA(LCA(v, v ′ )). For
Eq. (2) note that v ′′ is a node of S and it is an ancestor of both v and v ′ , since it is

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:12 Fully-Compressed Suffix Trees

an ancestor of LCA(v, v ′ ). Therefore by the definition of LCAS we conclude that v ′′ is


an ancestor of v ′′′ = LCAS (v, v ′ ). On the other hand v ′′′ is an ancestor of v and of v ′ .
Therefore v ′′′ is an ancestor of LCA(v, v ′ ). Taking LSA on both nodes (1) we have that
LSA(v ′′′ ) = v ′′′ is an ancestor of v ′′ .
For Eq. (3) note that since LCA(v, v ′ ) is an ancestor of v, we have by (1)
that v ′′ is an ancestor of LSA(v), and likewise of LSA(v ′ ). Hence v ′′ is an an-
cestor of LCA(LSA(v), LSA(v ′ )). Therefore by (1) LSA(v ′′ ) = v ′′ is an ancestor
of v ∗ = LSA(LCA(LSA(v), LSA(v ′ ))) = LCSA(LSA(v), LSA(v ′ )). On the other
hand LSA(v) is an ancestor of v, likewise LSA(v ′ ) is an ancestor of v ′ . Therefore
LCA(LSA(v), LSA(v ′ )) is an ancestor of LCA(v, v ′ ). Hence v ∗ is an ancestor of v ′′ .
For Eq. (4) observe that LSA(v) and LSA(v ′ ) belong to S, hence we can use Eq. (2)
to conclude that LCSA(LSA(v), LSA(v ′ )) = LCA S (LSA(v), LSA(v ′ )). Using Eq. (3) we
can replace the first term by LCSA(v, v ′ ).
For Eq. (5) note that v = LCA(vl , vr ). Therefore using Eq. (4) we have that LSA(v) =
LSA(LCA(vl , vr )) = LCSA(vl , vr ) = LCAS (LSA(vl ), LSA(vr ))).

4.1. Computing LSA for Leaves


We use the following data structures to provide the mapping LSA between leaves
v = [v, v] and their lowest sampled ancestors in S, and conversely, to obtain the range
[vl , vr ] for sampled nodes v.
(1) We will identify S with its balanced parentheses representation S[0, 2s − 1], so
that we will speak indistinctly of nodes in S and their tree operations, and positions in
S and their parenthesis operations.
(2) A bitmap B[n + 2s] containing 2s ones, which correspond to the parentheses of
S, and n zeros, which correspond to the suffix tree leaves. If leaf v is contained in the
sampled node represented by parentheses S[u] and S[u′ ], then the 0 bit corresponding
to v must be placed between the (u+1)-th and the (u′ +1)-th 1 of B. Since B contains 2s
ones, its representation using Theorem 2.3 requires at most 2s log n+2s2s + O(s) + o(n +
2s) = O((n/δ) log δ) bits of space, and supports constant-time R ANK and S ELECT .
In our example S = (()()) and B = 1000101101001, see Figure 3. An operational way
to describe B, which is useful to explain later the dynamic case, is as follows: Initialize
it with n bits all equal to 0. Now, for every sampled node v = [vl , vr ], insert a 1 at
S ELECT 0 (B, vl ) and another right after S ELECT 0 (B, vr ).
To compute LSA we use an auxiliary function defined as follows:
P RED(v) = R ANK1 (B, S ELECT 0 (B, v)) − 1 = S ELECT 0 (B, v) − v − 1,
which gives the position of the last parenthesis in S preceding leaf v.
L EMMA 4.4. Let p = P RED(v). If S[p] = ’(’ then p is the lowest sampled ancestor of
v; otherwise it is E NCLOSE S (F IND M ATCH S (p)).
P ROOF. If S[p] = ’(’, then F IND M ATCH S (p) closes after leaf v, and by definition
of B, v is contained in p. The descendants of p, if any, start after v and hence do not
contain it. If S[p] = ’)’ then, by definition of B, p′ = F IND M ATCH S (p) does not contain
v (it closes before v) nor does its next sibling, if any (it opens after v), but E NCLOSE S (p′ )
opens before and closes after v and hence it is the lowest node containing v.
Now we are ready to give the formula for LSA(v). Notice that v = [v, v] is a leaf
position, and the answer is an identifier in tree S.
p = P RED (v)
LSA(v) = if S[p] = ’(’ then p else E NCLOSES (F IND M ATCH S (p))

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:13

Consider for example the leaf numbered 5 in Figure 3. This leaf is not sampled, but
in the original tree it appears between leaf 4 and the end of the tree, more specifically
between parenthesis ’)’ of 4 and parenthesis ’)’ of the R OOT. Thus P RED(5) = 4. In
this case, since the parenthesis we obtain is a ’)’, we know that LSA is the parent of
that node.
In the opposite direction, we wish to find out the leaf interval [vl , vr ] corresponding
to a sampled node identifier v of S. This is not hard to do:
vl = R ANK0 (S ELECT 1 (B, v))
vr = R ANK0 (S ELECT 1 (B, F IND M ATCH S (v))) − 1
Summarizing, we can map from sampled nodes in S to suffix tree nodes [vl , vr ], as
well as the reverse with operations LSA and LCSA, all in constant time and using
O((n/δ) log δ) bits of space.
In the dynamic case, we use Theorem 2.4 to handle B and Theorem 2.6 to handle
S. This retains the same space, and operations cost O(log n/ log log n) time. The update
operations we will need to carry out are: (i) insertion/deletion of leaves in B, when a
leaf appears in / disappears from the suffix tree T , and (ii) insertion/deletion of pairs of
matching parentheses in/from S (and their corresponding 1s in B), when nodes become
sampled/unsampled in S. Figure 3 illustrates the effect of sampling b = [3, 6] in our
running example.
For (i), we will want to insert a new leaf (that is, suffix array position) between
leaves v − 1 and v. If v − 1 and v are consecutive in B, i.e., S ELECT 0 (B, v − 1) + 1 =
S ELECT 0 (B, v), then we simply do I NSERT (B, S ELECT 0 (B, v), 0). Yet in general there
could be several sampled nodes containing the leaf. Thus the general procedure is
as follows. The new leaf is a child of some internal node v ′ of T . We assume that in
case v ′ had to be sampled due to the update, it is already in S. Before the new leaf
is inserted in B, since v ′ cannot be unary, it is an ancestor of leaves v − 1 or v or
both. Let us assume v ′ is ancestor of v − 1; the other case is similar. We compute t =
TD EPS (LSA(v −1))− TD EP S (LSA(v ′ )) and run I NSERT (B, S ELECT 0 (B, v −1)+t+1, 0).
To remove leaf number v we run D ELETE (B, S ELECT 0 (B, v)).
For (ii), the insertion of a new node v = [vl , vr ] in the sampled tree translates into
the insertion of a matching parentheses pair at positions (u, u′ ) in S. For example if
the new node encloses current node v then u = v and u′ = F IND M ATCH S (v); if it
is a leaf first child of v then u = u′ = v + 1; if it is a leaf next sibling of v then
u = u′ = F IND M ATCH S (v)+1. After carrying out the insertion on S (via I NSERT S (u, u′ +
1)), we must update B. We compute m′ = max(S ELECT 1 (B, u′ ), S ELECT 0 (B, vr ))
for I NSERT (B, m′ + 1, 1) and then m = max(S ELECT 1 (B, u), S ELECT 0 (B, vl )) for
I NSERT(B, m, 1). For removing sampled node v, after D ELETE S (u, u′ ) for u = v and
u = F IND M ATCH S (v), we also update B by D ELETE (B, S ELECT 1 (B, u′ )) and then
D ELETE(B, S ELECT 1 (u)).
Thus, all the updates required for the dynamic case can be carried out in
O(log n/ log log n) time per update to S or to T .
5. SUFFIX TREE NAVIGATION
We start this section by showing in Lemma 5.1 a simple relation between the SL INK
and LCA operations, and use this relation to obtain an algorithmic way of computing
the SD EP value of non-sampled nodes8 , in Lemma 5.2. This algorithmic procedure

8A detailed exposition on why these properties are important for representing suffix trees is given in Ap-
pendix 9.

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:14 Fully-Compressed Suffix Trees

X 2 PARENT S (v ′ )
α α

Y Z d−i
Y Z
δ PARENT
v v′

2 v′
ψ
ψ δ
LF
Fig. 4. Schematic representation
of the relation between LCA and Fig. 5. Schematic representation of the vi,j nodes
SL INK, see Lemma 5.1. Curved of the SLAQ operation. The nodes sampled because
arrows represent SL INK and straight of Definition 4.1 are in bold and those sampled be-
arrows the ψ function. cause of the condition of TD EP are filled.

turns out to be flexible enough to support a complete spectrum of operations, which we


explain throughout this section.
L EMMA 5.1. For any nodes v, v ′ such that LCA(v, v ′ ) 6= R OOT we have that
SL INK (LCA(v, v ′ )) = LCA(SL INK (v), SL INK (v ′ )).
P ROOF. Assume that the path-labels of v and v ′ are respectively X.α.Y.β and
X.α.Z.β ′ , where Y 6= Z. According to the definitions of LCA and SL INK , we have
that LCA(v, v ′ ) = X.α and SL INK (LCA(v, v ′ )) = α. On the other hand the path-
labels of SL INK(v) and SL INK(v ′ ) are respectively α.Y.β and α.Z.β ′ . Therefore the
path-label of LCA(SL INK (v), SL INK (v ′ )) is also α. Hence this node must be the same
as SL INK (LCA(v, v ′ )).
Figure 4 illustrates this lemma; ignore the nodes associated with ψ. The condition
LCA(v, v ′ ) 6= R OOT is easy to verify, in a suffix tree, by comparing the first letters of
the path-label of v and v ′ , i.e. LCA(v, v ′ ) 6= R OOT iff v[0] = v ′ [0].
The next lemma shows a fundamental property that relates the kernel operations
LCA, SL INK , and SD EP .
L EMMA 5.2. Let v, v ′ be nodes in a δ-sampled suffix tree such that
r ′
SL INK (LCA(v, v )) = R OOT, and let d = min(δ, r + 1). Then
SD EP (LCA(v, v ′ )) = max0≤i<d {i + SD EP (LCSA(SL INK i (v), SL INK i (v ′ )))}
P ROOF. The following reasoning holds for any valid i:
SD EP(LCA(v, v ′ )) = i + SD EP (SL INK i (LCA(v, v ′ ))) (6)
i i ′
= i + SD EP (LCA(SL INK (v), SL INK (v ))) (7)
i i ′
≥ i + SD EP (LCSA(SL INK (v), SL INK (v ))) (8)
Eq. (6) holds by iterating the fact that SD EP (v ′′ ) = 1 + SD EP(SL INK (v ′′ )) for any node
v ′′ for which SL INK (v ′′ ) is defined. Eq. (7) results from applying Lemma 5.1 repeat-
edly. Inequality (8) comes from the definition of LCSA and the fact that if node v ′′′ is
an ancestor of node v ′′ then SD EP (v ′′ ) ≥ SD EP (v ′′′ ). Therefore SD EP (LCA(v, v ′ )) ≥
max0≤i<d {. . .}. On the other hand, from Definition 4.1 we know that for some i < δ
the node SL INK i (LCA(v, v ′ )) is sampled. The formula goes only up to d, but d < δ only
if SL INK d (LCA(v, v ′ )) = R OOT, which is also sampled. According to the definition of
LCSA, inequality (8) becomes an equality for that node. Hence SD EP(LCA(v, v ′ )) ≤
max0≤i<d {. . .}.

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:15

To apply Lemma 5.2 we need to support operations LCSA, SD EP , and SL INK .


Operation LCSA is supported in constant time (Section 4). Since SD EP is ap-
plied only to sampled nodes, we have it readily stored in the sampled tree.
Hence the only obstacle is SL INK . Sadakane [2007] showed that SL INK (v) =
LCA(ψ(vl ), ψ(vr )), whenever v 6= R OOT. This is, now, a trivial consequence of
Lemma 5.1 since SL INK (v) = SL INK (LCA(vl , vr )) = LCA(SL INK (vl ), SL INK (vr ))) =
LCA(ψ(vl ), ψ(vr )). This is not necessarily equal to the interval [ψ(vl ), ψ(vr )], see
node X.α in Figure 4, or consider the example of node bb represented by the
interval [5, 6] for which SL INK (v) = [3, 6] and [ψ(5), ψ(6)] = [4, 5]. In gen-
eral SL INK i (v) = LCA(ψ i (vl ), ψ i (vr )), which can be shown by induction using
Lemma 5.1: SL INK i+1 (v) = SL INK (SL INK i (v)) = SL INK (LCA(ψ i (vl ), ψ i (vr ))) =
LCA(SL INK (ψ i (vl )), SL INK (ψ i (vr ))) = LCA(ψ i+1 (vl ), ψ i+1 (vr )).
Now we have the tools to support LCA using Lemma 5.2.
L EMMA 5.3. LCA(v, v ′ ) = LF(v[0..i − 1], LCSA(SL INK i (v), SL INK i (v ′ ))) for any
nodes v, v ′ , where i is the value maximizing the formula of Lemma 5.2.
P ROOF. This is direct from Lemma 5.2. Let i be such that SL INK i (LCA(v, v ′ )) is
sampled. From Lemma 5.1 SL INK i (LCA(v, v ′ )) = LCA(SL INK i (v), SL INK i (v ′ )), which
is also the same as LCSA(SL INK i (v), SL INK i (v ′ )) because it is a sampled node. Note
that for the LF mapping we have that LF(v ′′ [0], SL INK (v ′′ )) = v ′′ . Applying this itera-
tively to SL INK i (LCA(v, v ′ )) we obtain the equality in the lemma.
To use this lemma we must know which is the correct i. This is easily determined if we
first compute SD EP (LCA(v, v ′ )). Accessing the letters to apply LF is not a problem, as
it suffices to obtain the first letter of a path label, SL INK i (v)[0] = SL INK i (v ′ )[0]. But
we are stuck in a circular dependency between LCA and SL INK .
5.1. Solving the Kernel Operations
To get out of this dependency we will handle all the computation over leaves, for which
we can compute SL INK (v) = ψ(v) using the CSA.
L EMMA 5.4. For any two suffix tree nodes v, v ′ we have LCA(v, v ′ ) =
LCA(min{vl , vl′ }, max{vr , vr′ }).
P ROOF. Let v ′′ and v ′′′ be respectively the nodes on the left and on the right of the
equality. Assume that they are represented as [vl′′ , vr′′ ] and [vl′′′ , vr′′′ ] respectively. Hence
vl′′ ≤ vl , vl′ and vr′′ ≥ vr , vr′ since v ′′ is an ancestor of v and v ′ . This means that vl′′ ≤
min{vl , vl′ } ≤ max{vr , vr′ } ≤ vr′′ , i.e., v ′′ is also an ancestor of min{vl , vl′ } and max{vr , vr′ }.
Since v ′′′ is by definition the lowest common ancestor of these nodes we have that vl′′ ≤
vl′′′ ≤ vr′′′ ≤ vr′′ . Using a similar reasoning for v ′′′ we conclude that vl′′′ ≤ vl′′ ≤ vr′′ ≤ vr′′′
and hence v ′′ = v ′′′ .
Observe this property in Figure 4; ignore SL INK , ψ and the subtree on the right.
Using this property and ψ the equation in Lemma 5.2 reduces to
SD EP (LCA(v, v ′ )) = SD EP(LCA(min{vl , vl′ }, max{vr , vr′ }))
= max {i + SD EP (LCSA(SL INK i (min{vl , vl′ }), SL INK i (max{vr , vr′ })))}
0≤i<d

= max {i + SD EP (LCSA(ψ i (min{vl , vl′ }), ψ i (max{vr , vr′ })))}


0≤i<d

Operationally, this corresponds to iteratively taking the ψ function, δ times or un-


til the R OOT is reached. At each step we find the LCSA of the two current leaves

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:16 Fully-Compressed Suffix Trees

and retrieve its stored SD EP. The overall process takes O(tψ δ) time. Note that in the
dynamic scenario the rank and tree operations are slower by an O( logloglogn n ) factor. Like-
wise SD EP and LCA simplifies to
SD EP (v) = SD EP (LCA(v, v)) = max {i + SD EP(LCSA(ψ i (vl ), ψ i (vr )))}
0≤i<d

LCA(v, v ) = LF(v[0..i − 1], LCSA(ψ i (min{vl , vl′ }), ψ i (max{vr , vr′ })))

Now it is finally clear that we do not need SL INK to compute LCA. The time to
compute LCA is thus O((tψ + tLF )δ), and that to compute SD EP is O(tψ δ). Using
LCA we compute SL INK (v) = LCA(ψ(vl ), ψ(vr )) in O((tψ + tLF )δ) and SL INK i =
LCA(ψ i (vl ), ψ i (vr )) in O(tSA + (tψ + tLF )δ) time. Note that the arguments to LCSA
do not correspond necessarily to nodes, but the formulas hold in this case too.
T HEOREM 5.5. Suffix tree operations SD EP , LCA, SL INK , and SL INK i can be com-
puted respectively in O(tψ δ), O((tψ + tLF )δ), O((tψ + tLF )δ), and O(tSA + (tψ + tLF )δ) time,
provided a CSA implements ψ in O(tψ ), LF in O(tLF ), and A and A−1 in O(tSA ) time,
and we have a δ-sampled suffix tree.

5.2. Further Operations


We now show how other operations can be computed on top of the kernel ones.
Computing PARENT (v). For any node v represented as [vl , vr ] we have that
PARENT (v) is either LCA(vl − 1, vl ) or LCA(vr , vr + 1), whichever is lowest. This is
because suffix trees are compact and hence PARENT ([vl , vr ]) must contain [vl − 1, vr ]
or [vl , vr + 1]. Notice that if one of these nodes is undefined, either because vl = 0 or
vr = n − 1, then the parent is the other node. If both nodes are undefined then node v
is the R OOT, which has no PARENT . The time is O((tψ + tLF )δ).
Computing C HILD(v, X). We show how C HILD can be computed in a general and
efficient way directly over the CSA. The generalized branching for nodes v1 and v2 con-
sists in determining the node with path label v1 .v2 if it exists. A simple solution is to
binary search the interval of v1 for the subinterval of the v ′ ’s such that ψ m (v ′ ) ∈ v2 ,
where m = SD EP(v1 ). This approach requires O(tSA log n) time and it was first consid-
ered using CSAs by Huynh et al. [2006]. Thus we are able to compute C HILD(v, X),
using v2 as the subinterval of A where the suffixes start with X. This is easily com-
puted from the CSA as W EINER L INK (X, R OOT).
This general solution can be improved by noticing that we are using SL INK m at
arbitrary positions of the CSA for the binary search. Recall that SL INK m is solved via A
and A−1 , i.e., SL INK m (vl ) = ψ m (vl ) = A−1 [A[vl ] + m]. Thus, we could sample A and A−1
regularly so as to store their values explicitly. That is, we explicitly store the values
A[jθ] and A−1 [jθ] for all j.9 To solve a generalized branching, we start by building a
table of ranges D[0] = v2 and D[i] = LF(v1 [m − i..m − 1], v2 ), for 1 ≤ i < θ. If m < θ
the answer is D[m]. Otherwise, we binary search the interval of v1 , accessing only
the sampled elements of A. To determine the branching we should compute ψ m (jθ) =
A−1 [A[jθ]+m] for some jθ values in v1 . To use the cheaper sampled A−1 as well, we need

that A[jθ]+m be divisible by θ, thus we instead compute ψ m for m′ = ⌊(A[jθ]+m)/θ⌋θ−

A[jθ]. Hence instead of verifying that ψ m (jθ) ∈ v2 , we verify that ψ m ∈ D[m−m′ ]. After
this process we still have to binary search an interval of size O(θ), which is carried out
naively.

9 Most CSAs already include such a sampling in one way or another [Navarro and Mäkinen 2007].

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:17

The overall process requires time O(tSA + (tψ + tLF )θ) to access the last letters of v1
and build D, plus O(log(n/θ)) for binary searching the samples; plus O(tSA log θ) for
the final binary search. For example we can just use θ = δ.10 In this case the time is
O((tψ + tLF )δ + tSA log δ + log n) and the extra space is O((n/δ) log n) bits. This is the
value used in Tables V and VI.
Yet, regarding our discussion in Section 8, we wish to avoid more extra spaces of
this magnitude. For the particular case of using the FM-index (under which we get our
best results), we can do the following to achieve the same time with less extra space.
Set θ = tSA log tLF
tSA
, so that the overall time is O(log n + (1 + tψ /tLF )tSA log tSA ) (which is
O(log n(log log n)2 ) in Table I), and the extra space for the sampling is O((n/θ) log n).
Recall Table III, where we defined tSA = l tLF and chose l = log n log log n, to have
O( log nlog n ) = o(n) extra bits of space for the CSA. Hence O((n/θ) log n) = O( nl log n
log l ).
This is less than the O((n/l) log n) bits paid by the CSA for its own sampling. For the
n
value of l we have chosen, it is O( (log log n)2 ).
In a dynamic scenario, we do not store exactly the A[jθ] values; instead we guarantee
that for any k there is a k ′ such that k−θ < k ′ ≤ k and A[k ′ ] is sampled, and the same for
A−1 . Still the sampled elements of A and the m′ to use can be easily obtained in O(log n)
time. Those sampled sequences are not hard to maintain upon insertions/deletions in
A. For example, Mäkinen and Navarro [2008, Sec. 7.1] describe how to maintain A−1
(called SC in there), and essentially how to maintain A (called SA in there; the only
missing point is how to maintain approximately spaced samples in A, which can be
done exactly as for A−1 ). Thus the space remains the same and the O(log n) term in
the complexity becomes O(log2 n).
Computing TD EP (v). To compute TD EP we add other O(n/δ) nodes to the sampled
tree S so as to guarantee that, for any suffix tree node v, PARENT j (v) is sampled for
some 0 ≤ j < δ. Recall that the TD EP (v) values are stored in S. Since TD EP (v) =
TD EP(LSA(v))+j, where LSA(v) = PARENT j (v), TD EP(v) can be computed by reading
TD EP(LSA(v)) and adding the number of nodes between v and LSA(v). The sampling
guarantees that j < δ. Hence to determine j we iterate PARENT until reaching LSA(v).
The total cost is O((tψ + tLF )δ 2 ).
To achieve this sampling property, we sample the nodes v such that TD EP (v) ≡δ/2 0
and H EIGHT (v) ≥ δ/2. Since TD EP(PARENT i (v)) = TD EP (v) − i, the first condition
holds for exactly two i’s in [0, δ − 1] if TD EP(v) ≥ δ/2. Since H EIGHT is strictly in-
creasing, the second condition holds for sure for the largest i. On the other hand, since
every sampled node has at least δ/2 descendants that are not sampled, it follows that
we sample O(n/δ) extra nodes with this criterion.
We are unable to maintain either the sampled TD EP values or the sampling property
in the dynamic scenario. Therefore, this operation and next two are not supported in
the dynamic case.
Computing TLAQ(v, d). We extend the notation PARENT S (v) to represent LSA(v)
when v is not sampled. Recall that the sampled tree supports constant-time level an-
cestor queries. Hence we have any PARENT iS (v) in constant time for any node v and
any i. We binary search PARENT iS (v) to find the sampled node v ′ with TD EP (v ′ ) ≥ d >
TD EP(PARENT S (v ′ )). Notice that this can be computed by evaluating only the second
inequality, which refers to sampled nodes only. Now we iterate the PARENT operation,
from v ′ , exactly TD EP (v ′ ) − d times. We need the additional sampling introduced for
TD EP to guarantee TD EP (v ′ ) − d < δ. Hence the total time is O(log n + (tψ + tLF )δ 2 ).
10 This speedup immediately improves the results of Huynh et al. [2006].

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:18 Fully-Compressed Suffix Trees

Computing SLAQ(v, d). We start by binary searching for the value m such that
δ−1
v ′ = PARENT m S (SL INK (v)) satisfies SD EP (v ′ ) ≥ d − (δ − 1) > SD EP(PARENT S (v ′ )).
Now we scan all the sampled nodes vi,j = PARENT jS (LSA(LF(v[i..δ − 1], v ′ ))) with
SD EP(vi,j ) ≥ d − i and i, j < δ. This means that we start at node v ′ , follow LF, reduce
every node found to the sampled tree S and use PARENT S until the SD EP of the node
drops below d − i. Our aim is to find the vi,j that minimizes SD EP(vi,j ) − (d − i) ≥ 0,
and then apply the LF mapping to it.
The time to perform this operation depends on the number of existing vi,j nodes. For
this operation the sampling must satisfy Definition 4.1 and the condition for computing
TD EP . Each condition contributes with at most two sampled nodes for every δ nodes
in one direction (SL INK or PARENT ). Therefore, there are at most 4δ nodes vi,j (see
Figure 5), and thus the time is O(log n + (tψ + tLF )δ). Unfortunately, the same trick
does not work for TD EP and TLAQ, because we cannot know which is the “right” node
without bringing all of them back with LF.
Computing FC HILD . To find the first child of v = [vl , vr ], where vl 6= vr ,
we simply compute SLAQ(vl , SD EP (v) + 1). Likewise if we use vr we obtain the
last child. It is possible to avoid the binary search step of SLAQ by choos-
δ−1
ing v ′ = PARENT m S (LSA(SL INK (vl ))) for m = TD EPS (LSA(SL INK δ−1 (vl ))) −
δ−1
TD EP S (LSA(SL INK (v))) − 1, if i ≥ 0, and v ′ = SL INK δ−1 (vl ) if m = −1. Thus
the time is O((tψ + tLF )δ).
In the dynamic case we do not have SLAQ. Instead, FC HILD (v) can be determined
by computing X = L ETTER (vl , SD EP (v)+1) and then C HILD(v, X). The time for C HILD
dominates.
Computing NS IB . The next sibling of v = [vl , vr ] can be computed as SLAQ(vr +
1, SD EP(PARENT (v)) + 1) for any v 6= R OOT. Likewise we can obtain the previous
sibling with vl − 1. We must check that the answer has the same parent as v, to cover
the case where there is no previous/next sibling. We can also skip the binary search.
Again, in the dynamic case NS IB(v) can be computed with C HILD . If PARENT (v) =
[vl′ , vr′ ] and vr′ > vr , then we compute X = L ETTER (vr + 1, SD EP(v) + 1) and do
C HILD (v ′ , X).

We are ready to state our summarizing theorem.


T HEOREM 5.6. Using a compressed suffix array (CSA) with the properties stated
in Table III, it is possible to represent a suffix tree with the properties given in Table V
(FCST).
Table V also compares our FCST with the CST [Sadakane 2007] and the EBST [Fis-
cher et al. 2009; Fischer 2010]. The times for the EBST are slightly simplified as they
depend on other parameters. The best for the FCST is to use the FM-index [Ferragina
et al. 2007], which reaches the minimum space, whereas FCST times do not improve
by using other CSAs because its operations depend on tψ + tLF . The alternatives, in-
stead, depend mostly on tSA and tψ , so they improve significantly by using a slightly
larger CSA [Grossi et al. 2003] that offers much better times for tψ and tSA but slower
tLF , see Table III.
The operations in Table V provide an extremely functional suffix tree. Yet, not all the
potentially interesting operations are supported. A notorious deficiency is the inability
to efficiently compute the P REORDERT value of a suffix tree node. This is essential
when we need to associate satellite information to nodes.
We propose an alternative scheme for this problem. The technique applies only for
internal nodes and not leaves, which can be indexed separately by their position in the

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:19

Table V. Comparison between compressed suffix tree representations. We omit the operations that
are carried out directly on the CSA, see Table III. We simplify the FCST complexities by assuming
δ = ω(log n) as otherwise the extra space is not o(n). We also assume that tψ , tLF , tSA = Ω(ttree ).
log n
The f of EBST must be O( log log n
) and Ω(log[r] n) for some constant r ≥ 0, which denotes r
applications of log to n. For EBST “not supported” means that it needs at least twice the space to
support those operations. Notice that C HILD can, alternatively, be computed using FC HILD and at
most σ times NS IB.
CST EBST Ours (FCST)
Space in bits |CSA| + 6n + o(n) |CSA| + O(n/f ) |CSA| + O((n/δ) log n)
R OOT O(1) O(1) O(1)
C OUNT O(1) O(1) O(1)
A NCESTOR O(1) O(1) O(1)
PARENT ttree tSA f log log n (tψ + tLF )δ
FC HILD ttree tSA f log2 f (tψ + tLF )δ
NS IB ttree tSA f (log2 f + log log n) (tψ + tLF )δ
LCA ttree tSA f (log2 f + log log n) (tψ + tLF )δ
TD EP ttree Not supported (tψ + tLF )δ2
TLAQ ttree Not supported (tψ + tLF )δ2
C HILD tSA log σ tSA (f log2 f + log σ) tSA log δ + (tψ + tLF )δ
SL INK tψ tSA f (log2 f + log log n) (tψ + tLF )δ
SL INKi tSA tSA f (log2 f + log log n) tSA + (tψ + tLF )δ
SD EP tSA tSA f log2 f tψ δ
SLAQ tSA log n tSA f log n log2 f (tψ + tLF )δ

CSA. For a node v represented as [vl , vr ] and PARENT (v) represented as [vl′ , vr′ ], if vl 6= vl′
then vl is the index that represents v, otherwise we use vr . The R OOT is represented
by 0. So this identifier is computed in O((tψ + tLF )δ) time, and guarantees that no index
represents more than one node (as only the highest node of a leftmost/rightmost path
can use the shared vl /vr value), but some indexes may represent no node at all. More
precisely, this scheme yields identifiers in the range [0, n − 1] for the internal nodes,
whereas there are only t − n < n of them.
6. UPDATING THE SUFFIX TREE AND ITS SAMPLING
The static FCST requires that we first build the classical suffix tree and then sample
it. Thus the machine used for construction must have a very large main memory, or
we must resort to secondary memory suffix tree construction. Dynamic FCSTs permit
handling a text collection where queries are interleaved with insertions and deletions
of texts along time, and their space is asymptotically the same as their static variant.
In particular, they solve the problem of construction of the static FCST within asymp-
totically the same space of the final static FCST: Start with an empty text collection,
insert T , and then turn all the data structures into their static equivalents.
Along the paper we have given static and dynamic variants of all the data structures
we have introduced. What remains is to explain how to modify our suffix tree repre-
sentation to reflect the changes caused by inserting and removing texts T , and how to
maintain our sampling conditions upon updates.
The CSA of Mäkinen and Navarro [2008], on which we build, inserts T in right-to-
left order. It first determines the position of the new terminator and then uses LF to
find the consecutive positions of longer and longer suffixes, until the whole T is in-
serted.11 This right-to-left method perfectly matches with the algorithm by Weiner
[1973] to build the suffix tree of T : It first inserts suffix T [i + 1..] and then suffix

11 Thisinsertion point is arbitrary in that CSA, thus there is no order among the texts. Moreover, all the
terminators are the same in the CSA, yet this can be easily modified to handle different terminators as
required in some bioinformatic applications [Gusfield 1997].

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:20 Fully-Compressed Suffix Trees

T [i..], finding the points in the tree where the node associated to the new suffix is
to be created if it does not already exist. The node is found by using PARENT until the
W EINER L INK operation returns a non-empty interval. This requires one PARENT and
one W EINER L INK amortized operation per symbol of T . This algorithm has the impor-
tant invariant that the intermediate data structure is a suffix tree. Hence, by carrying
it out in synchronization with the CSA insertion algorithm and with the insertion of
the new leaves in bitvector B, we can use the current CSA and FCST to implement
PARENT and W EINER L INK .
To maintain the property that the intermediate structure is a suffix tree, deletion
of a text T must proceed by first locating the node of T that corresponds to T ,12 and
then using SL INK s to remove all the nodes corresponding to its suffixes in T . We must
simultaneously remove the leaves in the CSA and in bitmap B (Mäkinen et al.’s CSA
deletes a text right-to-left, but it is easy to adapt to use ψ instead of LF to do it left-to-
right).
6.1. Maintaining the Sampling
We now explain how to update the sampled tree S whenever nodes are inserted into or
deleted from the (virtual) suffix tree T . The sampled tree must maintain, at all times,
the property that for any node v there is an i < δ such that SL INK i (v) is sampled. The
following concept from Russo and Oliveira [2008] is useful to explain how to obtain
this result.
Definition 6.1. The reverse tree T R of a suffix tree T is the minimal labeled tree
that, for every node v of T , contains a node v R denoting the reverse string of the path
label of v.
We note we are not maintaining nor sampling T R , we just use it as a conceptual device.
Figure 2 shows a reverse tree. Observe that, since there is a node with path label ab
in T , there is a node with path label ba in T R . We can therefore define a mapping R
that maps every node v to v R . Observe that for any node v of T , except for the R OOT,
we have that SL INK (v) = R−1 (PARENT (R(v))). This mapping is partially shown in
Figures 1 and 2 by the numbers. Hence the reverse tree stores the information of the
suffix links. For a regular sampling we choose the nodes for which TD EP (v R ) ≡δ/2 0
and H EIGHT (v R ) ≥ δ/2. This is equivalent to our sampling rules on T (Theorem 4.2):
Since the reverse suffixes form a prefix-closed set, T R is a non-compact trie, i.e., each
edge is labeled by a single letter. Thus, SD EP(v) = TD EP (v R ). The rule for H EIGHT (v R )
is obviously related to that on SL INK (v) by R. See Figure 2 for an example of this
sampling.
Notice that, whenever a node is inserted or removed from a suffix tree, it never
changes the SD EP of the other nodes in the tree, hence it does not change any TD EP
in T R . This means that whenever the suffix tree is modified the only nodes that can
be inserted or deleted from the reverse tree are the leaves. In T this means that when
a node is inserted it does not break a chain of suffix links; it is always added at the
beginning of such a chain. Weiner’s algorithm works precisely by appending a new leaf
to a node of T R .
Assume that we are using Weiner’s algorithm and decide that the node X.v should
be added and we know the representation of node v. All we need to do to update the
structure of the sampled tree is to verify whether by adding (X.v)R as a child of v R in

12 The dynamic CSA by Mäkinen and Navarro [2008] provides this functionality by returning a handle when
inserting a text T , that can be used later to retrieve the CSA position of its first or last symbol. This requires
O(N log n) extra bits of space when handling a collection of N texts and total length n, which is negligible
unless one has to handle many short texts.

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:21

T R we increase the H EIGHT of some ancestor, in T R , that will now become sampled.
Hence we must scan upwards in T R to verify if this is the case. Also, we do not need to
maintain H EIGHT values. Instead, if the distance from (X.v)R to the closest sampled
node (v ′ )R is exactly δ/2 and TD EP((v ′ )R ) ≡δ/2 0, then we know that v ′ meets the sam-
pling condition and we sample it. Operationally, the procedure is as follows: Compute
v ′ = SL INK δ/2 (v.X); if v ′ is not in S (v ′ 6= LSA(v ′ )) and SD EP(v ′ ) ≡δ/2 0, then add v ′ to
S.
Deleting a node (i.e., a leaf in T R ) is slightly more complex and involves some
reference counting. This time assume we are deleting node X.v, again we need to
scan upwards, this time to decide whether to make a node non-sampled. However
SD EP(v)− SD EP (v ′ ) < δ/2 is not enough, as it may be that H EIGHT (v ′R ) ≥ δ/2 because
of some other descendant. Therefore every sampled node v ′ counts how many descen-
dants it has at distance δ/2. A node becomes non-sampled (i.e., we remove it from S)
only when this counter reaches zero. Insertions and deletions of nodes in T must up-
date these counters, by increasing/decreasing them whenever inserting/deleting a leaf
at distance exactly δ/2 from sampled nodes.
As there are O(n/δ) sampled nodes, reference counters count also sampled nodes,
and no sampled node can be counted in two counters, we have that the sum of all
the counters is also O(n/δ). Hence we represent them all using a bitmap C of O(n/δ)
bits. C stores a 1 associated to each P REORDERS (v), and that 1 is followed by as
many 0s as the value of the counter for v. Hence the value of the counter for v is
retrieved as S ELECT 1 (C, P REORDER S (v) + 1) − S ELECT 1 (C, P REORDER S (v)) − 1; in-
creasing the counter for v translates into I NSERT (C, S ELECT 1 (C, P REORDER S (v)) +
1, 0); and decreasing the counter into D ELETE (C, S ELECT 1 (C, P REORDER S (v)) +
1). Similarly, insertion of a new node v into S must be followed by opera-
tion I NSERT (C, S ELECT 1 (C, P REORDER S (v)), 1), and its deletion must be preceded
by D ELETE (C, S ELECT 1 (C, P REORDER S (v))). Using Theorem 2.4, structure C takes
O(n/δ) bits and carries out all these operations in O(log n/ log log n) time.
Hence, to I NSERT or D ELETE a node requires O((tψ + tLF )δ) time to find out whether
to modify the sampling, plus O(log n/ log log n) time to update S and associated struc-
tures when necessary (S itself, B, C, etc.), plus O(log n) time to modify the sampled
A and A−1 arrays. Added to the constant amortized number of calls to PARENT and
W EINER L INK per text symbol, we have an overall time of O(|T |(log n + (tψ + tLF )δ)) for
the insertion or deletion of the whole text T .
The following theorem summarizes our result.
T HEOREM 6.2. It is possible to represent the suffix tree of a dynamic text collection
within the space and time bounds given for DFCST in Table VI, by using any dynamic
compressed suffix array offering the operations and times given in Table IV and insert-
ing (deleting) texts in right-to-left (left-to-right) order.
Table VI also compares our DFCST with the DCST of Chan et al. [2007]. For the
latter we have used the faster dynamic trees of Theorem 2.6. There exists no dynamic
variant of the EBST.

6.2. Changing log n


We note that Theorem 6.2 assumes that ⌈log n⌉ is fixed, and so is δ. This assumption
is not uncommon in dynamic data structures, even if it affects assertions like that of
pointers taking O(log n) bits. The CSAs used in Table IV can handle varying ⌈log n⌉
within the same worst-case space and complexities, and the same happens with Theo-
rem 2.4, which is used for bitmaps B and C, and with other data structures described
by Navarro and Sadakane [2010] that we use for storing SD EP and by Mäkinen and

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:22 Fully-Compressed Suffix Trees

Table VI. Comparison between dynamic compressed suffix tree representations. The performance
refers to dynamic CSA times and assumes tψ , tLF , tSA = Ω(ttree ); likewise we assume δ = ω(log n)
as before. We omit the operations that depend solely on the CSA, see Table IV.
Chan et al. [2007] (DCST) Ours (DFCST)
Space in bits |CSA| + Θ(n) |CSA| + O((n/δ) log n)
R OOT O(1) O(1)
C OUNT ttree O(1)
A NCESTOR ttree O(1)
PARENT ttree (tψ + tLF )δ
FC HILD ttree tSA log δ + (tψ + tLF )δ + O(log2 n)
NS IB ttree tSA log δ + (tψ + tLF )δ + O(log2 n)
LCA ttree (tψ + tLF )δ
C HILD tSA log σ tSA log δ + (tψ + tLF )δ + O(log2 n)
SL INK tψ (tψ + tLF )δ
SL INKi tSA tSA + (tψ + tLF )δ
SD EP tSA tψ δ
I NSERT(T ) / D ELETE(T ) |T |(tSA + log n) |T |(tψ + tLF )δ

Navarro [2008] that we use for handling the sampling of A[jθ] and A−1 [jθ] needed for
C HILD . The dynamic parentheses data structure of Theorem 2.6 we use to represent
S also allows for changes in ⌈log n⌉, but our mechanism to adapt to changes in δ will
subsume it. We discuss now how to cope with this while retaining the same space and
worst-case time complexities.
We use δ = ⌈log n⌉ · ⌈log⌈log n⌉⌉, which will change whenever ⌈log n⌉ changes (some-
times it will change by more than 1). Let us write δ = ∆(ℓ) = ℓ⌈log ℓ⌉. We maintain
ℓ = ⌈log n⌉. As S is small enough, we can afford to maintain three copies of it: S sam-
pled with δ, S − with δ − = ∆(ℓ − 1), and S + sampled with δ + = ∆(ℓ + 1). When ⌈log n⌉
increases (i.e., n doubles), S − is discarded, the current S becomes S − , the current S +
becomes S, we build a new S + sampled with ∆(ℓ + 2), and ℓ is increased. A symmetric
operation is done when ⌈log n⌉ decreases (i.e., n halves due to deletions), so let us focus
on increases from now on. Note this can occur in the middle of the insertion of a text,
which must be suspended, and then resumed over the new set of sampled trees.
The construction of the new S + can be done by retraversing all the suffix tree T and
deciding which nodes to sample according to the new δ + . An initially empty paren-
theses sequence and a bitmap B + initialized with t zeros would give the correct in-
sertion points from the chosen intervals as both structures are populated. To ensure
that we consider each node of T once, we process the leaves in order (i.e., v = [0, 0] to
v = [n − 1, n − 1]), and for each leaf v we also consider all of its ancestors [vl , vr ] (using
PARENT ) as long as vr = v. For each node [vl , vr ] we consider, we apply SL INK at most
δ + times until we find the first node v ′ = SL INK i ([vl , vr ]) which either is sampled in
S + , or SD EP (v ′ ) ≡δ+ /2 0 and i = δ + /2. If v ′ was not sampled we insert it into S + , and
in both cases we increase its reference count (recall Section 6). All the δ + suffix links
SL INK i ([vl , vr ]) are computed in O((tψ + tLF )δ + ) = O((tψ + tLF )δ) time, as they form a
single chain.
Deamortization can be achieved by the classical method of interleaving the normal
operations of the data structure with the construction of the new S + . By performing
a constant number of operations on the new S + for each insertion/deletion operation
over the text collection, we can ensure that the new S + will be ready in time. We start
by the creation (split into several operations) of B + formed by t 0s, and then proceed
to traverse T to determine which nodes to insert into S. The challenge is to maintain
the consistency of the traversal of T while texts are inserted/deleted.
As we insert a text, the operations that update T consist of insertion of leaves, and
possibly creation of a new parent for them. Assume we are currently at node [vl , vr ]

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:23

in our traversal of T to update S + . If a new node [vl′ , vr′ ] we are inserting is behind
the current node in our traversal order (that is, vr′ < vr , or vr′ = vr and vl′ > vl ), then
we consider [vl′ , vr′ ] immediately; otherwise we leave it for the moment when we will
reach [vl′ , vr′ ] in our traversal (note that we will reach it in our traversal because the
CSA has already been updated). Recall from Section 6 that those new insertions do not
affect the existing SD EP s nor suffix link paths, and hence cannot affect the decisions
to sample nodes already made in the current traversal. Similarly, deleted nodes that
fall behind the current node are processed immediately, and the others are left for the
traversal to handle them later.
If ℓ decreases again while we are still building S + , we simply discard it even before
having completed its construction. This involves freeing the whole B + and S + data
structures, which is also necessary when we abandon the former B − and S − struc-
tures. This deallocation can be done in constant time in this particular case: The max-
imum size n the collection can have as long as we keep using the current S + and B +
structures is nmax = 22+⌈log n⌉ , thus the maximum value for t is tmax = 2nmax and for s
is smax = tmax /(δ + /2). Hence, we allocate a single chunk of memory of the maximum
possible size, which is still O((n/δ) log δ). A similar preallocation can be done for S + ,
which needs O(n/δ) bits.

6.3. Construction from a Static CSA


An important side-effect of Theorem 6.2 is that operation I NSERT (T ) can be used to
construct an FCST representation from scratch. We transform the DFCST into the
FCST by changing the dynamic bitmaps (Theorem 2.4) into static ones (Theorem 2.3).
This construction procedure works within |CSA| + O((n/δ) log n) space, but it is time
consuming. Using the dynamic CST in Table VI, and assuming σ = O(polylog(n)), the
construction procedure takes O(n log2 n) time.
It is possible to build the FCST faster from a static CSA. We simulate on the CSA
a depth-first traversal on the reverse tree T R : Start at the R OOT and at each node
v compute LF(X, v), for every X ∈ Σ. This process requires O(nσ) time to traverse
the reverse tree (we return to this soon). When LF(X, v) produces an interval with
only one suffix, the traversal finishes because it corresponds to a leaf in the suffix
tree. During this search we compute SD EP incrementally, as SD EP (LF(X, v)) = 1 +
SD EP(v), and we can easily compute the H EIGHT of each node v in T R when we return
to it during the recursive traversal. Thus the conditions that define which nodes to
sample (Theorem 4.2) are easy to verify on the fly. The nodes to sample are stored in a
(i) (i)
list as pairs of integers [vl , vr ], therefore they require O((n/δ) log n) = o(n) bits.
From this list we obtain the P REORDER representation of the sampled tree and the
(i) (i)
bitmap B of Section 4.1. We duplicate every pair [vl , vr ] in the list into a left pair
(i) (i) (i) (i)
[vl , vr ]L and a right pair [vl , vr ]R . The list is then sorted. The sorting key for left
(i) (i) (i)
pairs is vl ; in case of a tie it is vr . The sorting key for right pairs is vr ; in case of a
(i)
tie it is vl . This means that if i corresponds to a left pair and i′ to a right pair we will
(i) (i′ )
compare vl with vr ; note that for the crossed comparisons there are no ties. This
procedure requires O((n/δ) log(n/δ)) = o(n) time. To obtain P REORDER S we traverse
the list and write ‘(’ for every left pair and ‘)’ for every right pair. The bitmap B is
obtained sequentially in a similar way: Consider the sequence of numbers oi obtained
(i) (i)
by storing vl for the left nodes and vr for the right nodes. For every oi , if oi = oi+1
we skip from oi to oi+1 , otherwise we write a 1 in B followed by oi+1 − oi zeros. Recall
that B finishes with 1 because the R OOT is always sampled.

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:24 Fully-Compressed Suffix Trees

Overall this construction procedure requires O(nσ) time to obtain a static FCST
representation from a static CSA. If the CSA is an FM-index implemented with a
wavelet tree [Mäkinen and Navarro 2008], the time can be reduced to O(n log σ), as
we can determine each existing interval LF(X, v) = v ′ in time O(log σ). The reason is
that the X symbols that produce valid intervals correspond to the distinct symbols in
T bwt [vl , vr ], where T bwt is the Burrows-Wheeler transform of T [Burrows and Wheeler
1994], and if T bwt is represented as a wavelet tree then we can use a so-called range
quantile algorithm to list the distinct values in any substring within O(log σ) time per
delivered value [Gagie et al. 2009].
Indeed, CSAs can be built from scratch faster than CSTs. For example, the FM-index
can be built within nHk + o(n log σ) bits and time O(n logloglogn n (1 + logloglogσ n )) [Navarro and
Sadakane 2010]. After making this CSA static we can build our CST, within negligible
o(n) extra space and negligible O(n log σ) extra time. There are algorithms to build the
FM-index that take o(n log σ) time [Okanohara and Sadakane 2009; Hon et al. 2009],
yet using Ω(n log σ) construction space. Another intermediate solution is by Hon et al.
[2003], using O(n log n) time and O(nH0 ) bits of space.

7. EXPERIMENTAL RESULTS
We implemented a basic version of the static FCST representation. We compare our
prototype with an implementation of Sadakane’s CST [Välimäki et al. 2007]. Our
prototype uses a binary wavelet tree where each bitmap is encoded as in Theo-
rem 2.3 (using an easier-to-implement proposal [Raman et al. 2002]); this CSA requires
nHk + o(n log σ) bits [Mäkinen and Navarro 2008]. The sampling factor δ was chosen
as ⌈log n⌉ · ⌈log⌈log n⌉⌉. We made a simple implementation that uses pointers in the
sampled tree S, since the extra space requirement is still sub-linear. To minimize the
amount of information stored in the sampled tree we chose not to support the TD EP,
SLAQ and TLAQ operations, moreover for the same reason we chose to support only
the basic C HILD operation and not the elaborated scheme presented in Section 5.2.
One important difference between our implementation and the theory we presented
is that the leaves of T are never part of the sampled tree S. This simplification is
possible because LCA(v, v ′ ) is a leaf only if v = v ′ , in which case LCA(v, v ′ ) = v and
SD EP(LCA(v, v ′ )) can be obtained from A[v]. Hence, the sampled tree becomes much
smaller than in theory, as the sampled values of A are already considered as part of
the CSA.
We used the texts from the Pizza&Chili corpus13 trimmed to at most 100 megabytes
(MB).
— Sources (program source code). This file is formed by C/Java source code obtained
by concatenating all the .c, .h, .C and .java files of the linux-2.6.11.6 and gcc-4.0.0
distributions.
— Pitches (MIDI pitch values). This file is a sequence of pitch values (bytes in 0-127,
plus a few extra special values) obtained from a myriad of MIDI files freely available
on Internet. The MIDI files were processed using semex 1.29 tool by Kjell Lemström,
so as to convert them to IRP format. This is a human-readable tuple format, where
the 5th column is the pitch value. Then the pitch values were coded in one byte each
and concatenated.
— Proteins (protein sequences). This file is a sequence of newline-separated protein
sequences (without descriptions, just the bare proteins) obtained from the Swissprot
database. Each of the 20 amino acids is coded as one uppercase letter.

13 http://pizzachili.dcc.uchile.cl

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:25

Table VII. Space requirements, in MB, of FCSTs and CSTs. The space is obtained
by reporting peak of main memory usage when operating the data structures. Other
related information such as the number of nodes in the sampled tree S and in T is also
presented. Finally, we give data on relative space usage of the different components.
Sources Pitches Proteins DNA English XML
σ 229 134 26 18 217 98
n/220 100.0 53.2 63.7 100.0 100.0 100.0
|T |/220 162.7 87.5 98.7 167.4 161.7 147.4
δ 135 130 130 135 135 135
|T |/|S| 1,368 551 1,304 12,657 541 2,008
FCST (MB) 66.3 45.2 49.8 56.7 73.2 54.9
CST (MB) 407.4 214.3 204.2 287.2 353.7 316.3
CSAF /F CST 0.90 0.88 0.92 0.92 0.87 0.91
CSAC /CST 0.29 0.30 0.30 0.21 0.29 0.36
CSAF /CSAC 0.50 0.62 0.75 0.86 0.62 0.44

— DNA (DNA sequences). This file is a sequence of newline-separated gene DNA se-
quences (without descriptions, just the bare DNA code) obtained from files 01hgp10
to 21hgp10, plus 0xhgp10 and 0yhgp10, from Gutenberg Project. Each of the 4 bases
is coded as an uppercase letter A,G,C,T, and there are a few occurrences of other
special characters.
— English (English texts). This file is the concatenation of English text files selected
from etext02 to etext05 collections of Gutenberg Project. We deleted the headers re-
lated to the project so as to leave just the real text.
— XML (structured text). This file is an XML that provides bibliographic information
on major computer science journals and proceedings and it is obtained from dblp.uni-
trier.de.

We built FCSTs and CSTs for each of the previous files. The resulting space usage
and related information is given in Table VII. The line “n/220 ” gives the file size in MB.
We also count the number of nodes in each suffix tree in line “|T |/220 ”. It is interesting
to observe that in practice the sampling rate of the internal nodes is much higher than
δ (as several suffix link paths share the same sample). This can be observed by looking
at line “|T |/|S|”: The ratio is usually 5 to 10 times larger than δ, but it reaches 93 times
for DNA. The consequence of such a small sampling is that the percentage of our CSA
size (CSAF ) in the overall structure is around 90%, see line “CSAF /F CST ”.
Lines FCST and CST show that our FCST is 4 to 6 times smaller than the CST.
This is a consequence not only of the fact that the size (CSAF ) of our CSA is only
44% to 86% the size (CSAC ) of the CSA used by the CST implementation (see line
CSAF /CSAC ), but more importantly, that our tree structure occupies a much smaller
portion of the overall space than in the CST, see lines CSAC /CST and CSAF /F CST .
Hence, in terms of space we managed to obtain an extremely compact representation of
FCSTs. Moreover the fact that our implementation uses pointers increases the overall
space by only a negligible amount of space.
Overall, our structure takes 55% to 85% of the original text size and moreover re-
places it, as the CSA itself can reproduce any substring of the sequence. Thus, our rep-
resentation can be regarded as a compressed representation of the sequence which, in
addition, provides a suffix tree functionality on it. We now consider how time-efficient
is this functionality.
We tested the time it takes to compute the operations in Theorem 6.2 by choosing
internal nodes, computing the operations during 60 seconds, and obtaining averages
per operation. We used three ways of choosing the nodes to test the operations. To
select a node we chose a random leaf v and computed LCA(v, v + 1). We used three
sequences of random nodes. In the first case we chose only one random node as de-

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:26 Fully-Compressed Suffix Trees

scribed (u). In the second case we chose a random node and iterated SL INK (su) until
reaching the root, collecting all the traversed nodes. In the last case we chose a random
node and iterated PARENT (pu) until reaching the root. This simulates various types of
suffix tree traversals. The results are shown in table VIII. Our machine had a Quad-
Core Intel Xeon CPU at 3.20GHz with a 2 MB cache, 3 GB of RAM, and was running
Slackware 12.0.0 with Linux kernel 2.6.21.5. The FCSTs were implemented in C and
compiled with gcc 3.4.6 -O9. The CSTs were implemented in C++ and compiled with
g++ 4.1.2 -O3.
The results show that the price for our FCST’s small space requirements is that they
are much slower than CSTs, yet practical in absolute terms for many applications (i.e.,
few milliseconds per operation). For some operations such as LCA the diference can
reach 3 orders of magnitude. Still for the C HILD operation, which is the slowest, the
diference is usually 1 order of magnitude. Hence, in any algorithm that uses C HILD,
this operation should dominate the overall time, moreover it depends essentially on
the underlying CSA. We expect this to be the case in general. Therefore it is possible
to obtain diferent space time trade-offs by using other CSAs.
Our implementation aimed at obtaining the smallest possible FCSTs. The resulting
space/time trade-off is interesting because we obtained very small FCSTs that support
the usual operations within a reasonable time per operation. Recently published ex-
periments [Cánovas and Navarro 2010] that compare the performance of a practical
implementation of the EBST with the CST and FCST reinforce the conclusion that
our FCST, albeit being the slowest of the three, is unparalleled in space requirements,
which makes it able to fit in main memory suffix trees that no other representation
can handle.
8. LARGER AND FASTER COMPRESSED SUFFIX TREES
The previous discussion raises the question of whether it is possible to obtain better
times from our technique, perhaps using more space. In particular, we note that using a
smaller δ value in our FCST would yield better times. What prevents us from using val-
n
ues smaller than δ = log n log log n is that we have to spend O((n/δ) log n) = O( log log n)
extra bits. However, this space comes only from the storage of SD EP and TD EP ar-
rays14 .
Imagine we use both the FM-index and the sublinear-space CSA by Grossi et al.
[2003], for a total space of (2 + 1ǫ )nHk + o(n log σ) bits, so that we have tψ = tLF =
O(1 + logloglogσ n ), and tSA = O(tψ logǫσ n). Now we could store only the SD EP values at
nodes whose SD EP is a multiple of κ, and at the other sampled nodes v we only store
SD EP(v) mod κ using log κ bits. The total space for SD EP becomes O((n/κ) log n +
(n/δ) log κ). To retrieve a SD EP (v) value, we read d = SD EP (v) mod κ, and then read
the full c = SD EP (v ′ ), where v ′ = SL INK d (v) has its full SD EP value stored. The an-
swer is c+d and can be obtained in O(tSA ) time15 . The same idea can be used for TD EP,
which is stored for tree depths multiple of κ and retrieved using v ′ = PARENT i (v) in
time O(log n).16 Now, we can use κ = log n log log n and δ = (log log n)2 while main-
n
taining the extra space O( log log n ). Although we use a much smaller δ now, each step
requires computing a SD EP value in O(tSA ) time, and thus our usual (tψ + tLF )δ cost
becomes tSA δ = O(log ǫσ n(log σ + log log n) log log n). If σ = polylog(n), this can be written

14 We have been careful along the paper to avoid this type of space for the other data structures, which could
otherwise have been handled with classical solutions.
15 Since we know v ′ is a sampled node, we do v ′ = LCSA(ψ i (v ), ψ i (v )) without resorting to LCA, which
l r
would have implied a circular dependence.
16 Again, because v ′ is sampled it can be binary searched for in PARENTj (v).
S

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:27

Table VIII. Time to compute operations over the FCST (F) and CST (C), in seconds.
Operation Sources Pitches Proteins DNA English XML
LCA F u 6.9e-3 8.2e-3 2.9e-3 2.0e-3 6.3e-3 9.7e-3
F su 1.3e-2 2.0e-2 7.7e-3 2.2e-3 1.9e-2 7.6e-3
F pu 4.3e-3 9.0e-3 1.0e-3 1.1e-3 1.6e-3 4.0e-3
C u 2.5e-6 2.5e-6 2.3e-6 2.2e-6 2.4e-6 2.4e-6
C su 2.5e-6 2.1e-6 2.7e-6 4.2e-6 1.9e-6 3.5e-6
C pu 5.4e-6 5.1e-6 5.0e-6 5.7e-6 5.6e-6 5.7e-6
L ETTER F u 1.6e-5 1.6e-5 1.1e-5 8.9e-6 1.3e-5 1.5e-5
F su 1.8e-5 1.6e-5 1.1e-5 8.4e-6 1.4e-5 1.5e-5
F pu 1.5e-5 1.4e-5 9.9e-6 8.4e-6 1.3e-5 1.4e-5
C u 1.6e-4 1.3e-4 1.2e-4 7.1e-5 1.4e-4 1.6e-4
C su 7.4e-5 7.1e-5 7.4e-5 5.0e-5 8.2e-5 9.8e-5
C pu 1.1e-4 6.8e-5 8.8e-5 5.8e-5 1.3e-4 1.4e-4
SL INK F u 6.8e-3 8.2e-3 2.9e-3 2.0e-3 6.2e-3 9.6e-3
F su 1.3e-2 2.0e-2 7.7e-3 2.1e-3 1.9e-2 7.6e-3
F pu 4.3e-3 9.0e-3 9.6e-4 1.0e-3 1.6e-3 3.9e-3
C u 2.2e-4 1.7e-4 1.7e-4 9.6e-5 2.0e-4 2.2e-4
C su 1.1e-4 9.6e-5 1.0e-4 6.7e-5 1.1e-4 1.4e-4
C pu 1.7e-4 1.0e-4 1.5e-4 8.4e-5 1.9e-4 2.0e-4
L OCATE F u 3.2e-3 3.0e-3 1.8e-3 1.6e-3 2.7e-3 3.1e-3
F su 3.0e-3 2.7e-3 1.7e-3 1.3e-3 2.6e-3 2.9e-3
F pu 2.7e-3 2.3e-3 1.6e-3 1.3e-3 2.4e-3 2.6e-3
C u 5.0e-5 3.9e-5 3.8e-5 2.2e-5 4.4e-5 5.0e-5
C su 2.2e-5 1.8e-5 2.0e-5 1.5e-5 2.0e-5 2.7e-5
C pu 3.7e-5 2.1e-5 3.0e-5 1.9e-5 4.3e-5 4.3e-5
C HILD F u 8.2e-3 1.3e-2 3.4e-3 1.6e-3 1.1e-2 8.7e-3
F su 1.9e-2 3.2e-2 1.0e-2 3.7e-3 3.8e-2 6.5e-3
F pu 6.5e-3 1.6e-2 9.9e-4 7.7e-4 1.9e-3 3.3e-3
C u 5.8e-4 5.4e-4 4.2e-4 1.2e-4 5.2e-4 6.4e-4
C su 2.0e-4 2.6e-4 2.8e-4 1.2e-4 1.3e-4 1.2e-3
C pu 2.5e-3 1.6e-3 9.0e-4 1.9e-4 2.9e-3 2.2e-3
SD EP F u 5.4e-3 6.6e-3 2.3e-3 1.6e-3 5.2e-3 7.3e-3
F su 1.1e-2 1.8e-2 5.9e-3 1.7e-3 1.7e-2 5.7e-3
F pu 4.0e-3 7.8e-3 8.0e-4 8.4e-4 1.3e-3 3.0e-3
C u 5.1e-5 4.0e-5 3.8e-5 2.3e-5 4.5e-5 5.0e-5
C su 2.1e-5 1.8e-5 2.0e-5 1.5e-5 2.0e-5 2.6e-5
C pu 3.6e-5 2.1e-5 3.3e-5 2.2e-5 4.1e-5 4.4e-5
PARENT F u 8.3e-3 8.7e-3 2.6e-3 3.4e-3 4.3e-3 1.5e-2
F su 1.3e-2 7.6e-3 5.0e-3 2.5e-3 4.5e-3 1.3e-2
F pu 6.3e-3 1.3e-2 1.1e-3 1.8e-3 2.1e-3 6.1e-3
C u 1.6e-6 1.7e-6 1.7e-6 1.6e-6 1.6e-6 1.7e-6
C su 1.5e-6 1.5e-6 1.6e-6 1.6e-6 1.6e-6 1.7e-6
C pu 1.5e-6 1.5e-6 1.6e-6 1.6e-6 1.6e-6 1.7e-6

as O(log ǫ n). Thus we achieve sublogarithmic times for most operations. Indeed the
times are similar to those of the EBST and our space is better than their original ver-
sion [Fischer et al. 2009], yet their most recent result [Fischer 2010] achieves better
space.
We can go further and achieve poly-loglog times for the most common operations,
at the expense of higher space. We use their representation for LCP information that
gives constant-time access and 2nHk (log H1k + O(1)) + o(n) bits of space [Fischer et al.
2009]. Recall that LCP(v) = SD EP([v − 1, v]) is the longest common prefix between
leaves v and v − 1. In addition they show how to compute range minimum queries
RMQ(v, v ′ ) (which gives the minimum value in the range LCP(v) . . . LCP(v ′ )) using,
n 2
for example, O( log log n ) bits of space and O(log log n(log log log n) ) time. Using this we
can obtain directly SD EP ([vl , vr ]) = RMQ(vl + 1, vr ). The same method can be ap-
plied for TD EP . Now the only limit to decrease δ is array B, which uses O((n/δ) log δ)
n
bits, and this is o(n) for any δ = ω(1). Yet, let us restrict to O( log log n ) extra space,

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:28 Fully-Compressed Suffix Trees

so we use δ = log log n log log log n. If we use an FM-index as our CSA, our final CST
size is 2nHk (log H1k + O(1)) + o(n) bits, and our usual (tψ + tLF )δ time for most op-
erations becomes O(log log n(log σ + log log n)(log log log n)3 ). This is o((log log n)3 ) for
σ = O(polylog(n)).

9. CONCLUSIONS
We presented a fully-compressed representation of suffix trees (FCSTs), which breaks
the linear-bit space barrier of previous representations at a reasonable time complex-
ity penalty. Our structure efficiently supports common and not-so-common operations,
including very powerful ones such as lowest common ancestor (LCA) and level ances-
tor (LAQ) queries. Indeed, by building over an FM-index, our FCSTs achieve optimal
asymptotic space under the k-th order entropy model, and support all the navigation
operations in polylogarithmic time. Our representation is largely based on the LCA
operation. Suffix trees have been used in combination with LCAs for a long time, but
our results show new ways to exploit this partnership. We also presented a dynamic
fully-compressed representation of suffix trees. Dynamic FCSTs permit not only man-
aging dynamic collections, but also building static FCSTs within optimal space, at a
logarithmic time penalty factor.
We implemented a static version of the FCSTs and showed that its surprisingly
small space requirements can be obtained in practice and it can still support the usual
operations efficiently. A recent experimental comparison [Cánovas and Navarro 2010]
of compressed suffix trees confirms that the FCST is the smallest representation, albeit
it is also the slowest. Using a denser sampling on our current implementation does not
give interesting space/time tradeoffs, but we are pursuing a new one where such a
denser sampling makes a better impact on response times.
The research on this topic advances at a very rapid pace. In the last two years,
after the conference publication of our results [Russo et al. 2008b; 2008a], several
new achievements have been presented. The progress was mainly focused on ob-
taining smaller representations of the data structures that support Range Minimum
Queries (RMQs), and the so-called Previous Smaller Value (PSV) and Next Smaller
Value (NSV) queries. The results by Ohlebusch et al. [2009; 2010] reduced the con-
stants associated with the O(n)-bit space term. Although the resulting space is still
Θ(n), they achieve relevant improvements. An implementation of the EBST [Fischer
et al. 2009] also provided new practical techniques to implement RMQ/PSV/NSV op-
erations [Cánovas and Navarro 2010], as well as the mentioned experimental compar-
ison among different prototypes. Fischer [2010] improved the original EBST [Fischer
et al. 2009] by removing the “ugly” space factor associated to the entropy, that is, the
new EBST now requires (1 + 1ǫ )nHk + o(n) bits and retains the same sublogarithmic
time performance (we used this improved complexity in our Table I).
The techniques we introduce in this paper also have demonstrated to have inde-
pendent interest. Recently, Hon et al. [2009] improved the secondary memory index
proposed by Chien et al. [2008] using, among other techniques, a structure similar to
the bitmap B we presented in Section 4.1.
We believe this fascinating topic is far from closed. In particular, we have exposed
limitations for some operations on FCSTs, which might or might not be fundamental.
For example we give only a partial answer to the problem of computing the preorder
number of a suffix tree node, which is relevant to associate satellite information to in-
ternal nodes. Another important example is the lack of support for the TD EP , TLAQ,
and SLAQ operations on dynamic FCSTs. This has its roots in our inability to main-
tain a properly spaced sampling of the suffix tree, and maintain TD EP values up to
date. Yet a third example are the limitations on the alphabet size σ in order to have

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:29

o(n) extra space. Our prototype is also being extended to support the dynamic case
and, as mentioned, denser samplings.
More generally, and especially under the light of the combinations of ideas explored
in the previous section, it is not clear how fast can we navigate suffix trees using how
much space, and in general which is the space/time lower bound for compressed suffix
trees.

A APPENDIX
In this appendix we explore some fundamental properties of suffix trees that show how
to use a δ-sampled suffix tree. Section 5 makes use of these properties to provide the
different navigation operations, albeit it can be read without resorting to this deeper
discussion.
More specifically, we reveal some self-similarity properties of suffix trees. Such prop-
erties have already been studied, but as far as we know, the ones we study here are
novel.

1.1. Locally Isomorphic Subtrees


Gusfield [1997, Section 7.7] showed that suffix trees contain isomorphic subtrees. This
type of information is useful because storing a self-similar structure naively contains
redundant information. Therefore self-similarities can be exploited to remove the re-
dundant information in the representation of suffix trees. Gusfield used this prop-
erty to define compact DAGs that are similar, but distinct, to the earlier concept of
DAWGs [Blumer et al. 1985] and CDAWGs [Crochemore 1986].
An isomorphism is a bijective homomorphism. An homomorphism from a tree to an-
other tree is a function that preserves, in some way, the structure of the source tree in
the target tree. The type of homomorphism we use depends on the algebraic structure
that we consider. For example, we can consider that an homomorphism between trees
T1 and T2 is a function f , of the nodes, for which f (PARENT (v1 )) = PARENT (f (v1 ))
for any node v1 of T1 . For this notion, graph-homomorphism, Gusfield presented the
following lemma:
L EMMA 1.1 (1.1 [G USFIELD 1997]). If the number of leaves of the subtree below
v is equal to the number of leaves below SL INK (v), then the two subtrees are graph-
isomorphic.
The isomorphism is given by the SL INK function. Reproving this lemma will be use-
ful to introduce the new concepts. We only need to show that SL INK is injective, sur-
jective and structure-preserving. It is interesting to notice that the SL INK function is
not globally injective, for example SL INK (ab) = b = SL INK (bb). However by restricting
its domain it becomes injective.
L EMMA 1.2. The SL INK function from the subtree below any node v 6= R OOT is
injective.
P ROOF. Since v 6= R OOT we have that the path label of v is X.α where X ∈ Σ and
α ∈ Σ∗ . Let v ′ and v ′′ be descendants of v, i.e., with path labels X.α.β ′ and X.α.β ′′
respectively, such that SL INK (v ′ ) = SL INK (v ′′ ). This means that α.β = α.β ′ and there-
fore X.α.β = X.α.β ′ which means that v ′ = v ′′ .
With an extra condition SL INK becomes surjective.
L EMMA 1.3. If the number of leaves of the subtree below node v 6= R OOT is equal to
the number of leaves below SL INK (v), then SL INK is surjective.

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:30 Fully-Compressed Suffix Trees

P ROOF. Let T1 and T2 denote the subtrees below v and SL INK (v) respectively. The
proof consists in showing that |T1 | ≥ |T2 |, i.e., T2 has no more nodes than T1 . This
implies that SL INK is surjective, since, by Lemma 1.2, SL INK is injective.
We denote the number leaves of a tree T ′ by λ(T ′ ). It is easy to prove, by induction,
that for any tree T ′ the following property holds:
X
|T ′ | = 2 ∗ λ(T ′ ) − 1 − (−2 + number of children of v)
internal v ∈T
′ ′

Hence, since λ(T1 ) = λ(T2 ), all we need to show is that v1 ∈T1 (. . .) ≤ v2 ∈T2 (. . .).
P P
Note that the terms of the sum are always non-negative because T1 and T2 are compact,
i.e., have at least two children. Since SL INK is injective this result can be shown di-
rectly by arguing that the number of children of the internal node v1 in T1 is not larger
than the number of children of SL INK (v1 ) in T2 . This is a known property of suffix
trees: If node v1 contains a child node that branches by letter X ∈ Σ, then SL INK (v1 )
must also contain a child branching by X. SL INK does not remove these letters from
the path label because v1 descends from v 6= R OOT.
To complete the proof of Lemma 1.1 we still need the following property, whose proof
we postpone to the next subsection, where we will have more algebraic tools.
L EMMA 1.4. If the number of leaves of the subtree below node v 6= R OOT is
equal to the number of leaves below SL INK (v) then it holds SL INK (PARENT (v ′ )) =
PARENT (SL INK (v ′ )) for any node v ′ descendant of v.
The compact DAG data structure [Gusfield 1997] removes the regularity arising
from Lemma 1.1 by storing pointers from the node v to the node SL INK (v), whenever
v satisfies the conditions of the lemma.

1.2. Locally Monomorphic Subtrees


The problem with the regularities exploited by the DAG approach is that the associ-
ated similarity concept is too strict. In this section we consider the relation between the
subtrees below v and SL INK (v) under less regular conditions, i.e., when the number
leaves below v and SL INK (v) is different. Obviously SL INK will not be surjective, but
the proof of Lemma 1.2 is still valid and therefore SL INK is still locally injective. More-
over, SL INK will no longer be a graph homomorphism, although it is still preserving
the tree structure in some way. This means we need a new notion of homomorphism
that is less strict than the one related to PARENT . Hence we will now consider trees
as being partially ordered sets (posets), ordered by the A NCESTOR relation. A poset
homomorphism has the following definition:
Definition 1.5. A tree poset-homomorphism, between trees T1 and T2 , is a mapping
f from the nodes of T1 to the nodes of T2 , such that if A NCESTOR (v ′ , v) holds in T1 , then
A NCESTOR(f (v ′ ), f (v)) holds in T2 .
By reasoning with the path labels it should be obvious that SL INK is always a ho-
momorphism in this sense. In fact SL INK is also a homomorphism in a slightly more
regular way. We presented the poset-homomorphism because it gives a more intuitive
description of the structure being preserved. With the LCA operation trees can also
be considered as semilattices, i.e., for any nodes v, v ′ and v ′′ we have LCA(v, v) =
v; LCA(v, v ′ ) = LCA(v ′ , v); LCA(v, LCA(v ′ , v ′′ )) = LCA(LCA(v, v ′ ), v ′′ ). The poset-
homomorphisms are not necessarily semilattice-homomorphisms, although they have
the following property:

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:31

L EMMA 1.6. If there is a tree poset-homomorphism f between trees T1 and T2 ,


then for any nodes v and v ′ of T1 we have that f (LCA(v, v ′ )) is an ancestor of
LCA(f (v), f (v ′ )).
P ROOF. The proof consists in showing that f (LCA(v, v ′ )) is an ancestor of both f (v)
and f (v ′ ), since by definition of LCA this guarantees that it must be an ancestor of
LCA(f (v), f (v ′ )). Let us consider f (v), without loss of generality. Since LCA(v, v ′ ) is
an ancestor of v we conclude from Definition 1.5 that f (LCA(v, v ′ )) is an ancestor of
f (v).
Lemma 5.1 points out that in fact SL INK is a much more special kind of homo-
morphism: It is a semilattice-homomorphism, and that is essentially the underlying
insight of our suffix tree representation technique. In light of this property it is now
very easy to prove Lemma 1.4.
P ROOF OF LEMMA 1.4. The proof follows from computing the expression:
SL INK (LCA(PARENT (v), SL INK −1 (PARENT (SL INK (v))))).
Note that the expression is well defined because, under the conditions of the
lemma, SL INK is bijective. First note that since PARENT (SL INK (v)) is an ancestor
of SL INK (v), SL INK −1 (PARENT (SL INK (v))) is an ancestor of SL INK −1 (SL INK (v)) = v.
Moreover, since SL INK is bijective and PARENT (SL INK (v)) 6= SL INK (v), we have that
SL INK −1 (PARENT (SL INK (v))) 6= v. This means that SL INK −1 (PARENT (SL INK (v)))
is an ancestor of PARENT (v) and therefore it is the LCA we are trying to
compute. Therefore the expression yields SL INK (SL INK −1 (PARENT (SL INK (v)))) =
PARENT (SL INK (v)).
On the other hand we can try to compute our expression by using Lemma 5.1 and
therefore obtain:
LCA(SL INK (PARENT (v)), PARENT (SL INK (v)))
Now we use essentially the same reasoning, in a slightly simpler way. As PARENT (v)
is an ancestor of v we have that SL INK (PARENT (v)) is an ancestor of SL INK (v).
Moreover, since SL INK is injective and PARENT (v) 6= v, it holds SL INK (PARENT (v))
6= SL INK (v). Thus SL INK (PARENT (v)) is an ancestor of PARENT (SL INK (v)) and there-
fore it is the LCA we are trying to compute.
Hence we obtained that PARENT (SL INK (v)) = SL INK (PARENT (v)).
Note that, although the first relation in the proof is only true under the conditions
of Lemma 1.4, the fact that SL INK (PARENT (v)) is an ancestor of PARENT (SL INK (v))
is always true.
Hence we have shown that, when v 6= R OOT, the SL INK operation is a monomor-
phism, i.e., injective homomorphism, that preserves LCA. This means that suffix trees
are locally self-similar as semilattices. This regularity, as far-fetched as it may be, is
very important because any representation of suffix trees that ignores it will con-
tain redundant information. The regularity implies, roughly, that the subtree below
SL INK (v) contains the subtree below v, and that we can use LCA to recover it. This is
the conceptual basis of Lemma 5.2.

REFERENCES
A BOUELHODA , M., K URTZ , S., AND O HLEBUSCH , E. 2004. Replacing suffix trees with enhanced suffix
arrays. Journal of Discrete Algorithms 2, 1, 53–86.
A POSTOLICO, A. 1985. The myriad virtues of subword trees. In Combinatorial Algorithms on Words. NATO
ISI Series. Springer-Verlag, 85–96.

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
A:32 Fully-Compressed Suffix Trees

A RROYUELO, D. 2008. An improved succinct representation for dynamic k-ary trees. In Proc. 19th Interna-
tional Symposium on Combinatorial Pattern Matching (CPM). LNCS 5029. 277–289.
B LUMER , A., B LUMER , J., H AUSSLER , D., E HRENFEUCHT, A., C HEN, M., AND S EIFERAS, J. 1985. The
smallest automaton recognizing the subwords of a text. Theoretical Computer Science 40, 1, 31–55.
B URROWS, M. AND W HEELER , D. 1994. A block-sorting lossless data compression algorithm. Tech. rep.,
Digital Equipment Corporation.
C ÁNOVAS, R. AND N AVARRO, G. 2010. Practical Compressed Suffix Trees. In Proc. 9th International Sym-
posium on Experimental Algorithms (SEA). LNCS 6049. 94–105.
C HAN, H.-L., H ON, W.-K., AND L AM , T.-W. 2004. Compressed index for a dynamic collection of texts. In
Proc. 15th International Symposium on Combinatorial Pattern Matching (CPM). LNCS 3109. 445–456.
C HAN, H.-L., H ON, W.-K., L AM , T.-W., AND S ADAKANE , K. 2007. Compressed indexes for dynamic text
collections. ACM Transactions on Algorithms 3, 2, article 21.
C HIEN, Y.-F., H ON, W.-K., S HAH , R., AND V ITTER , J. S. 2008. Geometric Burrows-Wheeler Transform:
Linking Range Searching and Text Indexing. In Proc. Data Compression Conference (DCC). 252–261.
C ROCHEMORE , M. 1986. Transducers and repetitions. Theoretical Computer Science 45, 1, 63–86.
F ERRAGINA , P. AND M ANZINI , G. 2000. Opportunistic data structures with applications. In Proc. 41st An-
nual Symposium on Foundations of Computer Science (FOCS). 390–398.
F ERRAGINA , P. AND M ANZINI , G. 2005. Indexing compressed text. Journal of the ACM 52, 4, 552–581.
F ERRAGINA , P., M ANZINI , G., M ÄKINEN, V., AND N AVARRO, G. 2007. Compressed Representations of Se-
quences and Full-Text indexes. ACM Transactions on Algorithms 3, 2, article 20.
F ERRAGINA , P. AND V ENTURINI , R. 2007. A simple storage scheme for strings achieving entropy bounds.
Theoretical Computer Science 372, 1, 115–121.
F ISCHER , J. 2010. Wee LCP. Information Processing Letters 110, 317–320.
F ISCHER , J., M ÄKINEN, V., AND N AVARRO, G. 2009. Faster Entropy-Bounded Compressed Suffix Trees.
Theoretical Computer Science 410, 51, 5354–5364.
F OSCHINI , L., G ROSSI , R., G UPTA , A., AND V ITTER , J. 2006. When indexing equals compression: Experi-
ments with compressing suffix arrays and applications. ACM Transactions on Algorithms 2, 4, 611–639.
G AGIE , T., P UGLISI , S. J., AND T URPIN, A. 2009. Range quantile queries: Another virtue of wavelet trees.
In Proc. 16th Symposium on String Processing and Information Retrieval (SPIRE). 1–6.
G IEGERICH , R., K URTZ , S., AND S TOYE , J. 2003. Efficient implementation of lazy suffix trees. Software
Practice and Experience 33, 11, 1035–1049.
G ONZ ÁLEZ , R. AND N AVARRO, G. 2008. Rank/Select on Dynamic Compressed Sequences and Applications.
Theoretical Computer Science 410, 4414–4422.
G ROSSI , R., G UPTA , A., AND V ITTER , J. 2003. High-order entropy-compressed text indexes. In Proc. 14th
Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 841–850.
G UPTA , A., H ON, W.-K., S HAH , R., AND V ITTER , J. 2007. A framework for dynamizing succinct data struc-
tures. In Proc. 34th International Colloquium on Automata, Languages and Programming (ICALP).
LNCS 4596. 521–532.
G USFIELD, D. 1997. Algorithms on Strings, Trees and Sequences. Cambridge University Press.
H E , M. AND M UNRO, I. 2010. Succinct representations of dynamic strings. In Proc. 17th International
Symposium on String Processing and Information Retrieval (SPIRE). LNCS 6393. 334–346.
H ON, W.-K., L AM , T.-W., S ADAKANE , K., AND S UNG, W.-K. 2003. Constructing compressed suffix arrays
with large alphabets. In Proc. 14th Annual International Symposium on Algorithms and Computation
(ISAAC). 240–249.
H ON, W. K., S ADAKANE , K., AND S UNG, W. K. 2009. Breaking a Time-and-Space Barrier in Constructing
Full-Text Indices. SIAM Journal on Computing 38, 6, 2162–2178.
H ON, W.-K., S HAH , R., T HANKACHAN, S., AND V ITTER , J. 2009. On Entropy-Compressed Text Indexing
in External Memory. In Proc. 16th International Symposium on String Processing and Information
Retrieval (SPIRE). LNCS 5721. 75–89.
H UYNH , T. N. D., H ON, W.-K., L AM , T. W., AND S UNG, W.-K. 2006. Approximate string matching using
compressed suffix arrays. Theoretical Computer Science 352, 1-3, 240–249.
K ÄRKK ÄINEN, J. AND U KKONEN, E. 1996a. Lempel-Ziv parsing and sublinear-size index structures for
string matching. In Proc. 3rd South American Workshop on String Processing. 141–155.
K ÄRKK ÄINEN, J. AND U KKONEN, E. 1996b. Sparse suffix trees. In Computing and Combinatorics. LNCS
1090. 219–230.

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.
L. Russo, G. Navarro, A. Oliveira A:33

L EE , S. AND PARK , K. 2007. Dynamic Rank-Select Structures with Applications to Run-Length Encoded
Texts. In Proc. 18th Annual Symposium on Combinatorial Pattern Matching (CPM). LNCS 4580. 95–
106.
M ÄKINEN, V. AND N AVARRO, G. 2008. Dynamic Entropy-Compressed Sequences and Full-Text Indexes.
ACM Transactions on Algorithms 4, 3, 1–38.
M ANBER , U. AND M YERS, E. 1993. Suffix arrays: A new method for on-line string searches. SIAM Journal
on Computing 22, 5, 935–948.
M ANZINI , G. 2001. An analysis of the Burrows-Wheeler transform. Journal of the ACM 48, 3, 407–430.
M C C REIGHT, E. 1976. A space-economical suffix tree construction algorithm. Journal of the ACM 32, 2,
262–272.
M UNRO, I., R AMAN, V., AND R AO, S. S. 2001. Space efficient suffix trees. Journal of Algorithms 39, 205–222.
N AVARRO, G. 2004. Indexing text using the Ziv-Lempel trie. J. Discrete Algorithms 2, 1, 87–114.
N AVARRO, G. AND M ÄKINEN, V. 2007. Compressed Full-Text Indexes. ACM Computing Surveys 39, 1, article
2.
N AVARRO, G. AND S ADAKANE , K. 2010. Fully-functional static and dynamic succinct trees.
CoRR abs/0905.0768. http://arxiv.org/abs/0905.0768. Version 4.
O HLEBUSCH , E., F ISCHER , J., AND G OG, S. 2010. CST++. In Proc. 17th International Symposium on String
Processing and Information Retrieval (SPIRE). LNCS 6393. 322–333.
O HLEBUSCH , E. AND G OG, S. 2009. A compressed enhanced suffix array supporting fast string matching.
In Proc. 16th International Symposium on String Processing and Information Retrieval (SPIRE). LNCS
5721. 51–62.
O KANOHARA , D. AND S ADAKANE , K. 2009. A linear-time burrows-wheeler transform using induced sorting.
In Proc. 16th International Symposium on String Processing and Information Retrieval (SPIRE). LNCS
5721. 90–101.
P ǍTRAŞCU, M. AND V IOLA , E. 2010. Cell-probe lower bounds for succinct partial sums. In Proc. 21st ACM-
SIAM Symposium on Discrete Algorithms (SODA). 117–122.
P ǍTRAŞCU, M. 2008. Succincter. In Proc. 49th IEEE Annual Symposium on Foundations of Computer Sci-
ence (FOCS). 305–313.
R AMAN, R., R AMAN, V., AND R AO, S. S. 2002. Succinct indexable dictionaries with applications to encoding
k-ary trees and multisets. In Proc. 13th ACM-SIAM Symposium on Discrete Algorithms (SODA). 233–
242.
R AMAN, R. AND R AO, S. S. 2003. Succinct Dynamic Dictionaries and Trees. In Proc. 30th International
Colloquium on Automata, Languages and Programming (ICALP). LNCS 2719. 357–368.
R USSO, L., N AVARRO, G., AND O LIVEIRA , A. 2008a. Dynamic Fully-Compressed Suffix Trees. In Proc. 19th
International Symposium on Combinatorial Pattern Matching (CPM). LNCS 5029. 191–203.
R USSO, L., N AVARRO, G., AND O LIVEIRA , A. 2008b. Fully-Compressed Suffix Trees. In Proc. 8th Latin
American Symposium on Theoretical Informatics (LATIN). LNCS 4957. 362–373.
R USSO, L. M. S. AND O LIVEIRA , A. L. 2008. A compressed self-index using a Ziv-Lempel dictionary. Infor-
mation Retrieval 11, 4, 359–388.
S ADAKANE , K. 2003. New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48, 2,
294–313.
S ADAKANE , K. 2007. Compressed Suffix Trees with Full Functionality. Theory of Computing Systems 41, 4,
589–607.
S ADAKANE , K. AND N AVARRO, G. 2010. Fully-functional succinct trees. In Proc. 21st Annual ACM-SIAM
Symposium on Discrete Algorithms (SODA). 134–149.
V ÄLIM ÄKI , N., G ERLACH , W., D IXIT, K., AND M ÄKINEN, V. 2007. Engineering a compressed suffix tree
implementation. In Proc. 6th International Workshop on Efficient and Experimental Algorithms (WEA).
LNCS 4525. 217–228.
W EINER , P. 1973. Linear pattern matching algorithms. In Proc. 14th IEEE Annual Symposium on Switching
and Automata Theory (SWAT). 1–11.

Received Month Year; revised Month Year; accepted Month Year

ACM Transactions on Algorithms, Vol. V, No. N, Article A, Publication date: January YYYY.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy