Abstract
Relationship tagging of long text documents is a growing need in information science, spurred by the emergence of multi-million book bibliographic digital libraries. Large digital libraries offer an unprecedented glimpse into cultural history through their collections, but the combination of collection scale and document length complicates their study, given that prior work on large corpora has dealt primarily with much shorter texts. This study presents and evaluates an approach for fast retrieval on long texts, which leverages a chunk-and-aggregate approach with document sub-units to capture nuanced similarity relationships at scales which are not otherwise tractable. This approach is evaluated on book relationships from the HathiTrust Digital Library and shows strong results for relationships beyond exact duplicates. Finally, we argue for the value of approximate nearest neighbor search for narrowing the search space for downstream classification and retrieval contexts.
Similar content being viewed by others
References
Kazai, G., Doucet, A., Landoni, M.: Overview of the INEX 2008 Book Track. In: Geva, S., Kamps, J., Trotman, A. (eds.) Advances in Focused Retrieval, pp. 106–123. Springer, Berlin, Heidelberg (2009)
Cummins, R.: A study of retrieval models for long documents and queries in information retrieval. In: Proceedings of the 25th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, pp. 795–805 (2016)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on Theory of computing. Association for Computing Machinery, Dallas, Texas, USA, pp. 604–613 (1998)
Charikar, MS.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. Association for Computing Machinery, Montreal, Quebec, Canada, pp. 380–388 (2002)
Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. Association for Computing Machinery, Seattle, Washington, USA, pp. 284–291 (2006)
HathiTrust.: About. In: HathiTrust Digit. Libr. https://www.hathitrust.org/about (2022). Accessed 12 Nov 2017
Downie, J.S., Furlough, M., McDonald, R.H., et al.: The hathitrust research center: exploring the full-text frontier. Educ. Rev. 51, 50–51 (2016)
Michel, J.-B., Shen, Y.K., Aiden, A.P., et al.: Quantitative analysis of culture using millions of digitized books. Science 331, 176–182 (2011)
Schreibman, S.: Non-consumptive reading. In: From Literature to Cultural Literacy. Springer, pp. 148–165 (2014)
York, J.: Building a future by preserving our past: the preservation infrastructure of HathiTrust digital library. In: World Library and Information Congress: 76th IFLA General Conference and Assembly. pp. 10–15 (2010)
Kazai, G., Doucet, A., Koolen, M., Landoni, M.: Overview of the INEX 2009 Book Track. In: Geva, S., Kamps, J., Trotman, A. (eds.) Focused Retrieval and Evaluation, pp. 145–159. Springer, Berlin, Heidelberg (2010)
Kazai, G., Koolen, M., Kamps, J., et al.: Overview of the INEX 2010 Book Track: Scaling Up the Evaluation Using Crowdsourcing. In: Geva, S., Kamps, J., Schenkel, R., Trotman, A. (eds.) Comparative Evaluation of Focused Retrieval, pp. 98–117. Springer, Berlin, Heidelberg (2011)
Salton, G., Allan, J., Buckley, C.: Approaches to passage retrieval in full text information systems. In: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval. Association for Computing Machinery, New York, NY, USA, pp. 49–58 (1993)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural. Comput. 9, 1735–1780 (1997)
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention Is All You Need. ArXiv1706.03762 (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding (2018)
Raffel, C., Shazeer, N., Roberts, A., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21, 1–67 (2020)
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. ArXiv200405150 Cs (2020)
Sutton, R.: The Bitter Lesson (2019). http://www.incompleteideas.net/IncIdeas/BitterLesson.html
Liu, Y., Ott, M., Goyal, N., et al.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. (2019). ArXiv190711692 Cs
Kaplan, J., McCandlish, S., Henighan, T., et al.: Scaling laws for neural language models (2020)
Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for deep learning in NLP. (2019). ArXiv190602243 Cs
Schwartz, R., Dodge, J., Smith, N.A., Etzioni, O.: Green AI. Commun ACM 63, 54–63 (2020). https://doi.org/10.1145/3381831
Han, S., Mao, H., Dally, WJ.: Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. (2016). ArXiv151000149 Cs
Jiang, J.Y., Zhang, M., Li, C., et al.: Semantic text matching for long-form documents. In: The World Wide Web Conference. Association for Computing Machinery, New York, NY, USA, pp. 795–806 (2019)
Mikolov, T., Sutskever, I., Chen, K., et al.: Distributed representations of words and phrases and their compositionality. In: Burges CJC, Bottou L, Welling M, et al (eds) Advances in Neural Information Processing Systems 26. Curran Associates, Inc., pp. 3111–3119 (2013)
Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 1532–1543 (2014)
McCann, B., Bradbury, J., Xiong, C., Socher, R.: Learned in translation: Contextualized word vectors. In: Advances in Neural Information Processing Systems. pp. 6294–6305 (2017)
Cer, D., Yang, Y., Kong, S., et al.: Universal Sentence Encoder. (2018). ArXiv180311175 Cs
Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using siamese BERT-networks (2019)
Ni, J., Ábrego, G.H., Constant, N., et al.: Sentence-T5: scalable sentence encoders from pre-trained text-to-text models. (2021)
HathiTrust Collections Committee.: HathiTrust monographic duplication and uniqueness: 2017 report and recommendations from the HathiTrust collections committee. (2017)
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Comput Netw ISDN Syst 29, 1157–1166 (1997)
Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans Pattern Anal Mach Intell 42, 824–836 (2020)
Aumüller, M., Bernhardsson, E., Faithfull, A.: ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms. (2018). ArXiv:180705614 Cs
Bernhardsson, E. (2013) Annoy [C++]. Spotify. https://github.com/spotify/annoy
Korn, F., Sidiropoulos, N., Faloutsos, C., et al.: Fast nearest neighbor search in medical image databases. (1998)
Bittremieux, W., Meysman, P., Noble, W.S., Laukens, K.: Fast open modification spectral library searching through approximate nearest neighbor indexing. J Proteome Res 17, 3463–3474 (2018)
Dershowitz, N., Labenski, D., Silberpfennig, A., et al.: Relating articles textually and visually. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). IEEE, Kyoto, pp. 274–280 (2017)
Sankar, K.P., Jawahar, C.V., Manmatha, R.: Nearest neighbor based collection OCR. In: Proceedings of the 8th IAPR International Workshop on Document Analysis Systems - DAS ’10. ACM Press, Boston, Massachusetts, pp. 207–214 (2010)
Williams, K., Giles, CL.: Near duplicate detection in an academic digital library. In: Proceedings of the 2013 ACM symposium on Document engineering. Association for Computing Machinery, Florence, Italy, pp. 91–94 (2013)
Schmidt, B.: Stable random projection: lightweight, general-purpose dimensionality reduction for digitized libraries. J Cult Anal. (2018)
Gonzalez-Agirre, A., Rigau, G., Agirre, E., et al.: Why are these similar? Investigating item similarity types in a large digital library. J Assoc Inf Sci Technol 67, 1624–1638 (2016)
Organisciak, P., Capitanu, B., Underwood, T., Downie JS.: Access to billions of pages for large-scale text analysis (2017)
Lui, M., Baldwin, T.: langid.py: an off-the-shelf language identification tool. In: Proceedings of the ACL 2012 System Demonstrations. Association for Computational Linguistics, Jeju Island, Korea, pp. 25–30 (2012)
Organisciak, P., Shetenhelm, S., Vasques, D.F.A., Matusiak, K.: Characterizing Same Work Relationships in Large-Scale Digital Libraries. In: International Conference on Information. Springer, pp. 419–425 (2019)
IFLA Study Group on the Functional Requirements for Bibliographic Records.: Functional Requirements for Bibliographic Records. IFLA, Munich (1998)
JSC for Development of RDA.: RDA: Resource description and access: 2013 Revision (2013)
Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: Proceedings of the 16th International Conference on World Wide Web. Association for Computing Machinery, Banff, Alberta, Canada, pp. 271–280 (2007)
Liu, J., Jin, T., Pan, K., et al.: An improved KNN text classification algorithm based on Simhash. In: 2017 IEEE 16th International Conference on Cognitive Informatics Cognitive Computing (ICCI*CC). pp. 92–95 (2017)
Deerwester, S., Dumais, S.T., Furnas, G.W., et al.: Indexing by latent semantic analysis. J Am Soc Inf Sci 41, 391–407 (1990). https://doi.org/10.1002/(SICI)1097-4571(199009)41:6%3c391::AID-ASI1%3e3.0.CO;2-9
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L.: Deep Contextualized Word Representations. Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Associationfor Computational Linguistics (ACL). pp. 2227–2237. (2018). https://doi.org/10.18653/v1/N18-1202.
Ethayarajh, K.: How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. (2019). ArXiv190900512 Cs
Li, B., Zhou, H., He, J., et al.: On the sentence embeddings from BERT for semantic textual similarity. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 9119–9130 (2020)
Robertson, S.E., Spärck Jones, K.: Relevance weighting of search terms. J Am Soc Inf Sci 27, 129–146 (1976). https://doi.org/10.1002/asi.4630270302
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun ACM 18, 613–620 (1975). https://doi.org/10.1145/361219.361220
Białecki, A., Muir, R., Ingersoll, G.: Apache lucene 4. In: Proceedings of the SIGIR 2012 Workshop on Open Source Information Retrieval. Portland, Oregon, p. 17 (2012)
Banon, S. (2022). Elasticsearch. Elastic. https://www.elastic.co/elasticsearch
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans Assoc Comput Linguist 5, 135–146 (2017)
Heinzerling, B., Strube, M.: BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages. In: chair) NC (Conference, Choukri K, Cieri C, et al (eds) Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan (2018)
Acknowledgements
This work made possible by IMLS grant #LG-86-18-0061-18.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Organisciak, P., Schmidt, B.M. & Durward, M. Approximate nearest neighbor for long document relationship labeling in digital libraries. Int J Digit Libr 24, 311–325 (2023). https://doi.org/10.1007/s00799-023-00354-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-023-00354-5