Structural optimization of a full-text n-gram index using relational normalization

Kim, Min-Soo; Whang, Kyu-Young; Lee, Jae-Gil; Lee, Min-Jae

doi:10.1007/s00778-007-0082-x

Structural optimization of a full-text n-gram index using relational normalization

Regular Paper
Published: 13 December 2007

Volume 17, pages 1485–1507, (2008)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

185 Accesses
Explore all metrics

Abstract

As the amount of text data grows explosively, an efficient index structure for large text databases becomes ever important. The n-gram inverted index (simply, the n-gram index) has been widely used in information retrieval or in approximate string matching due to its two major advantages: language-neutral and error-tolerant. Nevertheless, the n-gram index also has drawbacks: the size tends to be very large, and the performance of queries tends to be bad. In this paper, we propose the two-level n-gram inverted index (simply, the n-gram/2L index) that significantly reduces the size and improves the query performance by using the relational normalization theory. We first identify that, in the (full-text) n-gram index, there exists redundancy in the position information caused by a non-trivial multivalued dependency. The proposed index eliminates such redundancy by constructing the index in two levels: the front-end index and the back-end index. We formally prove that this two-level construction is identical to the relational normalization process. We call this process structural optimization of the n-gram index. The n-gram/2L index has excellent properties: (1) it significantly reduces the size and improves the performance compared with the n-gram index with these improvements becoming more marked as the database size gets larger; (2) the query processing time increases only very slightly as the query length gets longer. Experimental results using real databases of 1 GB show that the size of the n-gram/2L index is reduced by up to 1.9–2.4 times and, at the same time, the query performance is improved by up to 13.1 times compared with those of the n-gram index. We also compare the n-gram/2L index with Makinen’s compact suffix array (CSA) (Proc. 11th Annual Symposium on Combinatorial Pattern Matching pp. 305–319, 2000) stored in disk. Experimental results show that the n-gram/2L index outperforms the CSA when the query length is short (i.e., less than 15–20), and the CSA is similar to or better than the n-gram/2L index when the query length is long (i.e., more than 15–20).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

AS-Index: A Structure for String Search Using n-Grams and Algebraic Signatures

Article 08 January 2016

Information Retrieval Using n-grams

Pre-indexing Pruning Strategies

References

Baeza-Yates, R., Navarro, G.: A practical q-gram index for text retrieval allowing errors. CLEI Electron. J. 1(2), (1998)
Baeza-Yates, R., Navarro, G.: Block addressing indices for approximate text retrieval. J. Am. Soc. Inf. Sci. 51(1), 69–82 (2000)
Article Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press (1999)
Barroso, L.A., Dean, J., Holzle, U.: Web search for a planet: the google cluster architecture. IEEE Micro 23(2), 22–28 (2003)
Article Google Scholar
Cao, X., Li, S.C., Tung, A.K.H.: Indexing DNA sequences using q-grams. In: Proc. Int’l Conf. on Database Systems for Advanced Applications (DASFAA). Beijing, pp. 4–16 (2005)
Elmasri, R., Navathe, S.B.: Fundamentals of Database Systems, 4th edn. Addison Wesley (2003)
Gao, J., Goodman, J., Li, M., Lee, K.: Toward a unified approach to statistical language modeling for Chinese. ACM Trans. Asian Lang. Inf. Process. (TALIP) 1(1), 3–33 (2002)
Article Google Scholar
Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: Proc. 32nd ACM Symposium on Theory of Computing (STOC), pp. 397–406 (2000)
Karkkainen, J., Rao, S.: 7. Full-text indexes in external memory. In: Algorithms for Memory Hierarchies pp. 149–170 (2003)
Karkkainen, J., Sutinen, E.: Lempel-Ziv index for q-grams. Algorithmica 21(1), 137–154 (1998)
Article MathSciNet Google Scholar
Karkkainen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string mathcing. In: Proc. 3rd South American Workshop on String Processing (WSP), pp. 141–155 (1996)
Kim, M., Whang, K., Lee, J.: n-gram/2L-approximation: a two-level n-gram inverted index structure for approximate string matching. J. Comput. Systems Sci. Eng. (2007) (to appear)
Kim, M., Whang, K., Lee, J., Lee, M.: n-Gram/2L: a space and time efficient two-level n-gram inverted index structure. In: Proc. the 31th Int’l Conf. on Very Large Data Bases (VLDB), Trondheim, pp. 325–336 (2005)
Kukich, K.: Techniques for automatically correcting words in text. ACM Comput Surv 24(4), 377–439 (1992)
Article Google Scholar
Lee, J.H., Ahn J.S.: Using n-grams for korean text retrieval. In: Proc. Int’l Conf. on Information Retrieval. ACM SIGIR, Zurich, pp. 216–224 (1996)
Lehtinen, O., Sutinen, E., Tarhio, J.: Experiments on block indexing. In: Proc. 3rd South American Workshop on String Processing pp. 183–193 (1996)
Makinen, V.: Compact suffix array. In: Proc. 11th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 305–319 (2000)
Mayfield, J., McNamee, P.: Single N-gram stemming. In: Proc. Int’l Conf. on Information Retrieval. ACM SIGIR, Toronto, pp. 415–416 (2003)
Melnik, S., Raghavan, S., Yang, B., Garcia-Molina, H.: Building a distributed full-text index for the Web. ACM Trans. Inf. Systems 19(3), 217–241 (2001)
Article Google Scholar
Miller, E., Shen, D., Liu, J., Nicholas, C.: Performance and scalability of a large-scale N-gram based information retrieval system. J. Digital Inf. 1(5), 1–25 (2000)
Google Scholar
Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Eng Bull 24(4), 19–27 (2001)
Google Scholar
Navarro, G., Makinen, V.: Compressed full-text indexes. Technical report TR/DCC-2006-6, Department of Computer Science, University of Chile, (2006). (accepted to ACM Computing Surveys)
Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing text with approximate q-grams. In: Proc. 11th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 350–363 (2000)
Puglisi, S., Smyth, W., Turpin, A.: Inverted files versus suffix arrays for locating patterns in primary memory. In: Proc. 13th Symposium on String Processing and Information Retrieval (SPIRE), Glasgow, pp. 122–133 (2006)
Ramakrishnan, R.: Database Management Systems. McGraw-Hill, New York (1998)
Google Scholar
Scholer, F., Williams, H.E., Yiannis, J., Zobel, J.: Compression of inverted indexes for fast query evaluation. In: Proc. Int’l Conf. on Information Retrieval, ACM SIGIR, Tampere, pp. 222–229 (2002)
Sutinen, E., Tarhio, J.: Filtration with q-samples in approximate string matching. In: Proc. 7th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 50–63 (1996)
Ullman, J.D.: Principles of Database and Knowledge-Base Systems, Vol. I. Computer Science Press, USA (1988)
Google Scholar
Whang, K., Lee, M., Lee, J., Kim, M., Han, W.: Odysseus:a high-performance ORDBMS tightly-coupled with IR features. In: Proc. 21st IEEE Int’l Conf. on Data Engineering (ICDE), Tokyo, pp. 1104–1105, (2005) (this paper received the Best Demonstration Award)
Williams, H.E.: Genomic information retrieval. In: Proc. 14th Australasian Database Conferences, Adelaide, pp. 27–35 (2003)
Williams, H.E., Zobel, J.: Indexing and retrieval for genomic databases. IEEE Trans. Knowl. Data Eng. 14(1), 63–78 (2002)
Article Google Scholar
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn., Morgan Kaufmann (1999)
Yasushi, O., Masajirou, I.: A new character-based indexing method using frequency data for Japanese documents. In: Proc. Int’l Conf. on Information Retrieval, pp. 121–129. ACM SIGIR, Seattle (1995)
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput Surv 38(2), (2006)

Download references

Author information

Authors and Affiliations

Department of Computer Science, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea
Min-Soo Kim, Kyu-Young Whang, Jae-Gil Lee & Min-Jae Lee

Authors

Min-Soo Kim
View author publications
You can also search for this author inPubMed Google Scholar
Kyu-Young Whang
View author publications
You can also search for this author inPubMed Google Scholar
Jae-Gil Lee
View author publications
You can also search for this author inPubMed Google Scholar
Min-Jae Lee
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Min-Soo Kim.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, MS., Whang, KY., Lee, JG. et al. Structural optimization of a full-text n-gram index using relational normalization. The VLDB Journal 17, 1485–1507 (2008). https://doi.org/10.1007/s00778-007-0082-x

Download citation

Received: 24 May 2006
Revised: 11 July 2007
Accepted: 13 August 2007
Published: 13 December 2007
Issue Date: November 2008
DOI: https://doi.org/10.1007/s00778-007-0082-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Structural optimization of a full-text n-gram index using relational normalization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

AS-Index: A Structure for String Search Using n-Grams and Algebraic Signatures

Information Retrieval Using n-grams

Pre-indexing Pruning Strategies

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Structural optimization of a full-text n-gram index using relational normalization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

AS-Index: A Structure for String Search Using n-Grams and Algebraic Signatures

Information Retrieval Using n-grams

Pre-indexing Pruning Strategies

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.