Approximate nearest neighbor for long document relationship labeling in digital libraries

Organisciak, Peter; Schmidt, Benjamin M.; Durward, Matthew

doi:10.1007/s00799-023-00354-5

Approximate nearest neighbor for long document relationship labeling in digital libraries

Published: 15 April 2023

Volume 24, pages 311–325, (2023)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

221 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Relationship tagging of long text documents is a growing need in information science, spurred by the emergence of multi-million book bibliographic digital libraries. Large digital libraries offer an unprecedented glimpse into cultural history through their collections, but the combination of collection scale and document length complicates their study, given that prior work on large corpora has dealt primarily with much shorter texts. This study presents and evaluates an approach for fast retrieval on long texts, which leverages a chunk-and-aggregate approach with document sub-units to capture nuanced similarity relationships at scales which are not otherwise tractable. This approach is evaluated on book relationships from the HathiTrust Digital Library and shows strong results for relationships beyond exact duplicates. Finally, we argue for the value of approximate nearest neighbor search for narrowing the search space for downstream classification and retrieval contexts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

AnnoTag: Concise Content Annotation via LOD Tags derived from Entity-Level Analytics

Evaluating BERT-based scientific relation classifiers for scholarly knowledge graph construction on digital library collections

Article 02 November 2021

Entity Linking for Historical Documents: Challenges and Solutions

Notes

https://github.com/massivetexts/compare-tools.

References

Kazai, G., Doucet, A., Landoni, M.: Overview of the INEX 2008 Book Track. In: Geva, S., Kamps, J., Trotman, A. (eds.) Advances in Focused Retrieval, pp. 106–123. Springer, Berlin, Heidelberg (2009)
Chapter Google Scholar
Cummins, R.: A study of retrieval models for long documents and queries in information retrieval. In: Proceedings of the 25th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, pp. 795–805 (2016)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on Theory of computing. Association for Computing Machinery, Dallas, Texas, USA, pp. 604–613 (1998)
Charikar, MS.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. Association for Computing Machinery, Montreal, Quebec, Canada, pp. 380–388 (2002)
Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. Association for Computing Machinery, Seattle, Washington, USA, pp. 284–291 (2006)
HathiTrust.: About. In: HathiTrust Digit. Libr. https://www.hathitrust.org/about (2022). Accessed 12 Nov 2017
Downie, J.S., Furlough, M., McDonald, R.H., et al.: The hathitrust research center: exploring the full-text frontier. Educ. Rev. 51, 50–51 (2016)
Google Scholar
Michel, J.-B., Shen, Y.K., Aiden, A.P., et al.: Quantitative analysis of culture using millions of digitized books. Science 331, 176–182 (2011)
Article Google Scholar
Schreibman, S.: Non-consumptive reading. In: From Literature to Cultural Literacy. Springer, pp. 148–165 (2014)
York, J.: Building a future by preserving our past: the preservation infrastructure of HathiTrust digital library. In: World Library and Information Congress: 76th IFLA General Conference and Assembly. pp. 10–15 (2010)
Kazai, G., Doucet, A., Koolen, M., Landoni, M.: Overview of the INEX 2009 Book Track. In: Geva, S., Kamps, J., Trotman, A. (eds.) Focused Retrieval and Evaluation, pp. 145–159. Springer, Berlin, Heidelberg (2010)
Chapter Google Scholar
Kazai, G., Koolen, M., Kamps, J., et al.: Overview of the INEX 2010 Book Track: Scaling Up the Evaluation Using Crowdsourcing. In: Geva, S., Kamps, J., Schenkel, R., Trotman, A. (eds.) Comparative Evaluation of Focused Retrieval, pp. 98–117. Springer, Berlin, Heidelberg (2011)
Chapter Google Scholar
Salton, G., Allan, J., Buckley, C.: Approaches to passage retrieval in full text information systems. In: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval. Association for Computing Machinery, New York, NY, USA, pp. 49–58 (1993)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural. Comput. 9, 1735–1780 (1997)
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention Is All You Need. ArXiv1706.03762 (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding (2018)
Raffel, C., Shazeer, N., Roberts, A., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21, 1–67 (2020)
MathSciNet Google Scholar
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. ArXiv200405150 Cs (2020)
Sutton, R.: The Bitter Lesson (2019). http://www.incompleteideas.net/IncIdeas/BitterLesson.html
Liu, Y., Ott, M., Goyal, N., et al.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. (2019). ArXiv190711692 Cs
Kaplan, J., McCandlish, S., Henighan, T., et al.: Scaling laws for neural language models (2020)
Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for deep learning in NLP. (2019). ArXiv190602243 Cs
Schwartz, R., Dodge, J., Smith, N.A., Etzioni, O.: Green AI. Commun ACM 63, 54–63 (2020). https://doi.org/10.1145/3381831
Article Google Scholar
Han, S., Mao, H., Dally, WJ.: Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. (2016). ArXiv151000149 Cs
Jiang, J.Y., Zhang, M., Li, C., et al.: Semantic text matching for long-form documents. In: The World Wide Web Conference. Association for Computing Machinery, New York, NY, USA, pp. 795–806 (2019)
Mikolov, T., Sutskever, I., Chen, K., et al.: Distributed representations of words and phrases and their compositionality. In: Burges CJC, Bottou L, Welling M, et al (eds) Advances in Neural Information Processing Systems 26. Curran Associates, Inc., pp. 3111–3119 (2013)
Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 1532–1543 (2014)
McCann, B., Bradbury, J., Xiong, C., Socher, R.: Learned in translation: Contextualized word vectors. In: Advances in Neural Information Processing Systems. pp. 6294–6305 (2017)
Cer, D., Yang, Y., Kong, S., et al.: Universal Sentence Encoder. (2018). ArXiv180311175 Cs
Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using siamese BERT-networks (2019)
Ni, J., Ábrego, G.H., Constant, N., et al.: Sentence-T5: scalable sentence encoders from pre-trained text-to-text models. (2021)
HathiTrust Collections Committee.: HathiTrust monographic duplication and uniqueness: 2017 report and recommendations from the HathiTrust collections committee. (2017)
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Comput Netw ISDN Syst 29, 1157–1166 (1997)
Article Google Scholar
Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans Pattern Anal Mach Intell 42, 824–836 (2020)
Article Google Scholar
Aumüller, M., Bernhardsson, E., Faithfull, A.: ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms. (2018). ArXiv:180705614 Cs
Bernhardsson, E. (2013) Annoy [C++]. Spotify. https://github.com/spotify/annoy
Korn, F., Sidiropoulos, N., Faloutsos, C., et al.: Fast nearest neighbor search in medical image databases. (1998)
Bittremieux, W., Meysman, P., Noble, W.S., Laukens, K.: Fast open modification spectral library searching through approximate nearest neighbor indexing. J Proteome Res 17, 3463–3474 (2018)
Article Google Scholar
Dershowitz, N., Labenski, D., Silberpfennig, A., et al.: Relating articles textually and visually. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). IEEE, Kyoto, pp. 274–280 (2017)
Sankar, K.P., Jawahar, C.V., Manmatha, R.: Nearest neighbor based collection OCR. In: Proceedings of the 8th IAPR International Workshop on Document Analysis Systems - DAS ’10. ACM Press, Boston, Massachusetts, pp. 207–214 (2010)
Williams, K., Giles, CL.: Near duplicate detection in an academic digital library. In: Proceedings of the 2013 ACM symposium on Document engineering. Association for Computing Machinery, Florence, Italy, pp. 91–94 (2013)
Schmidt, B.: Stable random projection: lightweight, general-purpose dimensionality reduction for digitized libraries. J Cult Anal. (2018)
Gonzalez-Agirre, A., Rigau, G., Agirre, E., et al.: Why are these similar? Investigating item similarity types in a large digital library. J Assoc Inf Sci Technol 67, 1624–1638 (2016)
Article Google Scholar
Organisciak, P., Capitanu, B., Underwood, T., Downie JS.: Access to billions of pages for large-scale text analysis (2017)
Lui, M., Baldwin, T.: langid.py: an off-the-shelf language identification tool. In: Proceedings of the ACL 2012 System Demonstrations. Association for Computational Linguistics, Jeju Island, Korea, pp. 25–30 (2012)
Organisciak, P., Shetenhelm, S., Vasques, D.F.A., Matusiak, K.: Characterizing Same Work Relationships in Large-Scale Digital Libraries. In: International Conference on Information. Springer, pp. 419–425 (2019)
IFLA Study Group on the Functional Requirements for Bibliographic Records.: Functional Requirements for Bibliographic Records. IFLA, Munich (1998)
JSC for Development of RDA.: RDA: Resource description and access: 2013 Revision (2013)
Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: Proceedings of the 16th International Conference on World Wide Web. Association for Computing Machinery, Banff, Alberta, Canada, pp. 271–280 (2007)
Liu, J., Jin, T., Pan, K., et al.: An improved KNN text classification algorithm based on Simhash. In: 2017 IEEE 16th International Conference on Cognitive Informatics Cognitive Computing (ICCI*CC). pp. 92–95 (2017)
Deerwester, S., Dumais, S.T., Furnas, G.W., et al.: Indexing by latent semantic analysis. J Am Soc Inf Sci 41, 391–407 (1990). https://doi.org/10.1002/(SICI)1097-4571(199009)41:6%3c391::AID-ASI1%3e3.0.CO;2-9
Article Google Scholar
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L.: Deep Contextualized Word Representations. Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Associationfor Computational Linguistics (ACL). pp. 2227–2237. (2018). https://doi.org/10.18653/v1/N18-1202.
Ethayarajh, K.: How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. (2019). ArXiv190900512 Cs
Li, B., Zhou, H., He, J., et al.: On the sentence embeddings from BERT for semantic textual similarity. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 9119–9130 (2020)
Robertson, S.E., Spärck Jones, K.: Relevance weighting of search terms. J Am Soc Inf Sci 27, 129–146 (1976). https://doi.org/10.1002/asi.4630270302
Article Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun ACM 18, 613–620 (1975). https://doi.org/10.1145/361219.361220
Article MATH Google Scholar
Białecki, A., Muir, R., Ingersoll, G.: Apache lucene 4. In: Proceedings of the SIGIR 2012 Workshop on Open Source Information Retrieval. Portland, Oregon, p. 17 (2012)
Banon, S. (2022). Elasticsearch. Elastic. https://www.elastic.co/elasticsearch
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans Assoc Comput Linguist 5, 135–146 (2017)
Article Google Scholar
Heinzerling, B., Strube, M.: BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages. In: chair) NC (Conference, Choukri K, Cieri C, et al (eds) Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan (2018)

Download references

Acknowledgements

This work made possible by IMLS grant #LG-86-18-0061-18.

Author information

Authors and Affiliations

University of Denver, Denver, USA
Peter Organisciak
Nomic AI, New York, USA
Benjamin M. Schmidt
University of Canterbury, Christchurch, New Zealand
Matthew Durward

Authors

Peter Organisciak
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin M. Schmidt
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Durward
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peter Organisciak.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Organisciak, P., Schmidt, B.M. & Durward, M. Approximate nearest neighbor for long document relationship labeling in digital libraries. Int J Digit Libr 24, 311–325 (2023). https://doi.org/10.1007/s00799-023-00354-5

Download citation

Received: 04 May 2022
Revised: 24 February 2023
Accepted: 01 March 2023
Published: 15 April 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s00799-023-00354-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Approximate nearest neighbor for long document relationship labeling in digital libraries

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

AnnoTag: Concise Content Annotation via LOD Tags derived from Entity-Level Analytics

Evaluating BERT-based scientific relation classifiers for scholarly knowledge graph construction on digital library collections

Entity Linking for Historical Documents: Challenges and Solutions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Approximate nearest neighbor for long document relationship labeling in digital libraries

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

AnnoTag: Concise Content Annotation via LOD Tags derived from Entity-Level Analytics

Evaluating BERT-based scientific relation classifiers for scholarly knowledge graph construction on digital library collections

Entity Linking for Historical Documents: Challenges and Solutions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.