Abstract
With the first Human DNA being decoded into a sequence of about 2.8 billion base pairs, many biological research has been centered on analyzing this sequence. Theoretically speaking, it is now feasible to accommodate an index for human DNA in main memory so that any pattern can be located efficiently. This is due to the recent breakthrough on compressed suffix arrays, which reduces the space requirement from O(n log n) bits to O(n) bits. However, constructing compressed suffix arrays is still not an easy task because we still have to compute suffix arrays first and need a working memory of O(n log n) bits (i.e., more than 13 Gigabytes for human DNA). This paper initiates the study of constructing compressed suffix arrays directly from text. The main contribution is a new construction algorithm that uses only O(n) bits of working memory, and more importantly, the time complexity remains the same as before, i.e., O(n log n).
This research was supported in part by NUS Academic Research Grant R-252-000-119-112
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
D. R. Clark and J. I. Munro. Efficient suffix trees on secondary storage. In Proceedings of the Seventh Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 383–391. 1996.
Altschul S. F., Gish W., Miller W., Myers E. W., and Lipman D. J. Basic locol alignment search tool. Journal of Molecular Biology, pages 403–410, 1990.
P. Elias. Universal codeword sets and representation of the integers. IEEE Transactions on Information Theory, 21(2):194–203, 1975.
P. Ferragine and G. Manzini. Opportunistic data structures with applications. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science (FOCS), pages 390–398. 2000.
R. Grossi and J.S. Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the 32nd ACM Symposium on Theory of Computing, pages 397–406, 2000.
E. Hunt, M. P. Atkinson, and R. W. Irving. A database index to large biological sequences. In Proceedings of the 27th VLDB Conference, pages 410–421. 2000.
S. Kurtz. Reducing the space requirement of suffix trees. Software Practice and Experiences, 29:1149–1171, 1999.
U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935–948, 1993.
E. M. MCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2):262–272, 1976.
K. Sadakane. Compressed text databases with efficient query algorithms based on compressed suffix array. In Proceedings of the 11th International Conference on Algorithms and Computation (ISAAC), pages 410–421. 2000.
K. Sadakane and T. Shibyya. Indexing huge genome sequences for solving various porblems. In Genome Informatics, pages 175–183. 2001.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lam, TW., Sadakane, K., Sung, WK., Yiu, SM. (2002). A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays. In: Ibarra, O.H., Zhang, L. (eds) Computing and Combinatorics. COCOON 2002. Lecture Notes in Computer Science, vol 2387. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45655-4_43
Download citation
DOI: https://doi.org/10.1007/3-540-45655-4_43
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43996-7
Online ISBN: 978-3-540-45655-1
eBook Packages: Springer Book Archive