Abstract
This study aims to investigate and identify the driving factors of word co-occurrence from the perspective of semantic relations between frequently co-occurring words. Natural sentences in a corpus of news articles were used as co-occurrence windows to extract co-occurring word pairs, and the distance of those two words was not limited. ConceptNet (a semantic knowledge base) was used to annotate the semantic relation between co-occurring words. To solve the problem that some co-occurring word pairs fail to match direct semantic relations in ConceptNet, we proposed a relation annotation method by connecting them with an intermediate word. Results showed that six semantic relations in ConceptNet, (i.e., RelatedTo, IsA, Synonym, HasContext, Antonym, and MannerOf) were important factors directly inducing word co-occurrence. The combination of some of those semantic relations was an important factor indirectly driving word co-occurrence. Also, syntactic analysis and lexical semantic theories were combined to analyze the direct and indirect semantic relations. In this analysis, we found that the factors driving word co-occurrence in sentences could be classified into three relation categories: collocation and modification, hyponymy, and synonym and antonym. These findings can help explain the phenomenon of word co-occurrence and improve the method and application of co-word analysis.



Similar content being viewed by others
References
Adam, A. (2023, June 13). The New York Times. Encyclopædia Britannica. Retrieved June 23, 2023, from https://www.britannica.com/topic/The-New-York-Times
Alcaide-Muñoz, L., Rodríguez-Bolívar, M. P., Cobo, M. J., & Herrera-Viedma, E. (2017). Analysing the scientific evolution of e-government using a science mapping approach. Government Information Quarterly, 34(3), 545–555.
Balikas, G., Dias, G., Moraliyski, R., Akhmouch, H., & Amini, M.-R. (2019). Learning lexical–semantic relations using intuitive cognitive links. In L. Azzopardi, B. Stein, N. Fuhr, P. Mayr, C. Hauff & D. Hiemstra (Eds.), Lecture notes in computer science: Advances in information retrieval (Vol. 11437, pp. 3–18). Springer.
Bannour, N., Dias, G., Chahir, Y., & Akhmouch, H. (2020). Patch-based identification of lexical semantic relations. In J. Jose, E. Yilmaz, J. Magalhães, P. Castells, N. Ferro, M. Silva & F. Martins (Eds.), Lecture notes in computer science: Advances in information retrieval (Vol. 12035, pp. 126–140). Springer.
Booth, A. D. (1967). A “Law” of occurrences for words of low frequency. Information and Control, 10(4), 386–393.
Bornmann, L., Haunschild, R., & Hug, S. E. (2018). Visualizing the context of citations referencing papers published by Eugene Garfield: A new type of keyword co-occurrence analysis. Scientometrics, 114(2), 427–437.
Callon, M., Courtial, J.-P., Turner, W. A., & Bauin, S. (1983). From translations to problematic networks: An introduction to co-word analysis. Social Science Information, 22(2), 191–235.
Chen, D., & Manning, C. (2014, October). A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014 (pp. 740–750).
Cruse, D. A. (1986). Lexical semantics. Cambridge University Press.
Ding, W., & Chen, C. (2014). Dynamic topic detection and tracking: A comparison of HDP, C-word, and cocitation methods. Journal of the Association for Information Science and Technology, 65(10), 2084–2097.
Feng, J., Zhang, Y. Q., & Zhang, H. (2017). Improving the co-word analysis method based on semantic distance. Scientometrics, 111(3), 1521–1531.
Garg, M., & Kumar, M. (2020, January). Finding summaries to obtain event phrases from streaming Microblogs using Word Co-occurrence Network. In International conference on COMmunication Systems and NETworkS (COMSNETS), 2020 (pp. 200–206). IEEE.
Gelbukh, A., & Calvo, H. (2018). Automatic syntactic analysis based on selectional preferences. Springer.
Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English. Longman Pub Group.
Hook, P. A. (2017). Using course-subject co-occurrence (CSCO) to reveal the structure of an academic discipline: A framework to evaluate different inputs of a domain map. Journal of the Association for Information Science and Technology, 68(1), 182–196.
Jackson, H., & Amvela, E. Z. (2000). Words, meaning and vocabulary: An introduction to modern English lexicology. Continuum International Publishing Group.
Jin, C. X., Zhang, H., & Bai, Q. C. (2014). Text clustering algorithm of co-occurrence word based on association-rule mining. Applied Mechanics and Materials, 599, 1749–1752.
Kastrin, A., Klisara, J., Lužar, B., et al. (2018). Is science driven by principal investigators? Scientometrics, 117(2), 1157–1182. https://doi.org/10.1007/s11192-018-2900-x
Kostoff, R. N., Eberhart, H. J., & Toothman, D. R. (1997). Database tomography for information retrieval. Journal of Information Science, 23(4), 301–311.
Kwiek, M. (2020). Internationalists and locals: International research collaboration in a resource-poor system. Scientometrics, 124(1), 57–105. https://doi.org/10.1007/s11192-020-03460-2
Leech, G. (1981). Semantics: The study of meaning: Geoffrey Leech. Penguin Books.
Li, T., Bai, J., Yang, X., Liu, Q., & Chen, Y. (2018). Co-occurrence Network of High-Frequency Words in the bioinformatics literature: Structural characteristics and evolution. Applied Sciences, 8(10), 1994.
Liang, Z., Mao, J., Lu, K., et al. (2021). Finding citations for PubMed: A large-scale comparison between five freely available bibliographic data sources. Scientometrics, 126(12), 9519–9542. https://doi.org/10.1007/s11192-021-04191-8
Liu, Y., McInnes, B. T., Pedersen, T., Melton-Meaux, G., & Pakhomov, S. (2012, January). Semantic relatedness study using second order co-occurrence vectors computed from biomedical corpora, UMLS and WordNet. In Proceedings of the 2nd ACM SIGHIT international health informatics symposium, 2012 (pp. 363–372).
Lu, H., Xie, L., Kang, N., Wang, C., & Xie, J. (2017, February). Don’t forget the quantifiable relationship between words: Using recurrent neural network for short text topic discovery. In Thirty-first AAAI conference on artificial intelligence, 2017.
Lu, S. Y., & Fu, K. S. (1978). A sentence-to-sentence clustering procedure for pattern analysis. IEEE Transactions on Systems, Man, and Cybernetics, 8(5), 381–389.
Lu, W., Wang, J., & Hu, J. (2020). Analyzing the topic distribution and evolution of foreign relations from parliamentary debates: A framework and case study. Information Processing and Management, 57(3), 102191.
Mark, J. (2022, August 2). Fox News sweeps July cable news ratings as all networks see declines. Forbes. Retrieved June 23, 2023, from https://www.forbes.com/sites/forbes-personal-shopper/article/best-gaming-mouse/?sh=dca27511c4b1
Miller, G. A., & Charles, W. G. (1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1), 1–28.
Nasar, Z., Jaffry, S. W., & Malik, M. K. (2021). Named entity recognition and relation extraction. ACM Computing Surveys, 54(1), 1–39.
Nivre, J., De Marneffe, M. C., Ginter, F., Goldberg, Y., Hajic, J., Manning, C. D., & Tsarfaty, R. (2016, May). Universal dependencies v1: A multilingual treebank collection. In Proceedings of the tenth international conference on language resources and evaluation (LREC2016), 2016 (pp. 1659–1666).
NLTK Project. (2020, March). Natural Language Toolkit—NLTK 3.5b1 documentation. Retrieved September 10, 2021, from https://www.nltk.org/
Pao, M. L. (1978). Automatic text analysis based on transition phenomena of word occurrences. Journal of the American Society for Information Science, 29(3), 121–124.
Qiu, J., Li, L., & Wu, L. (2008, October). The research on semantic transitivity. In 4th International conference on wireless communications, networking and mobile computing, 2008 (pp. 1–4).
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.
Sarah, S. (2020, May). The New York Times’ success with digital subscriptions is accelerating, not slowing down. NiemanLab. Retrieved June 23, 2023, from https://www.niemanlab.org/2020/05/the-new-york-times-success-with-digital-subscriptions-is-accelerating-not-slowing-down/
Shams, M., & Baraani-Dastjerdi, A. (2017). Enriched LDA (ELDA): Combination of latent Dirichlet allocation with word co-occurrence analysis for aspect extraction. Expert Systems with Applications, 80, 136–146.
Shin, S., Jin, X., Jung, J., & Lee, K. (2019). Predicate constraints based question answering over knowledge graph. Information Processing and Management, 56(3), 445–462.
Shu, D. (2000). An introduction to contemporary linguistic semantics. Shanghai Foreign Language Education Press.
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford University Press.
Speer, R. (2019, June). Relations in ConceptNet 5. Retrieved September 10, 2021, from https://github.com/commonsense/conceptnet5/wiki/Relations
Speer R. (2021, September). FAQ of ConceptNet 5. Retrieved September 10, 2021, from https://github.com/commonsense/conceptnet5/wiki/FAQ
Speer, R., Chin, J., & Havasi, C. (2017, February). Conceptnet. In Thirty-first AAAI conference on artificial intelligence, 5.5: An open multilingual graph of general knowledge, 2017.
Strohman, T., Metzler, D., Turtle, H., & Croft, W. (2005) Indri: A language model-based search engine for complex queries. In Proceedings of the international conference on intelligent analysis, 2005 (Vol. 2(6), pp. 2–6).
Swanson, D. R. (1986). Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspectives in Biology and Medicine, 30(1), 7–18.
Thompson, A. (2017). All the news. Kaggle. Retrieved June 23, 2023, from https://www.kaggle.com/datasets/snapcrack/all-the-news
Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., et al. (2019). Unsupervised word embeddings capture latent knowledge from materials science literature. Nature, 571(7763), 95–98.
Vo, D., & Bagheri, E. (2019). Feature-enriched matrix factorization for relation extraction. Information Processing and Management, 56(3), 424–444.
Wang, W. (2001). English lexical semantics. Zhejiang Education Publishing House.
Wang, Z., Li, G., Li, C., & Li, A. (2012). Research on the semantic-based co-word analysis. Scientometrics, 90(3), 855–875.
Wei, W., Guo, C., Chen, J., & Zhang, Z. (2017, November). Textual topic evolution analysis based on term co-occurrence: A case study on the government work report of the State Council (1954–2017). In 12th International conference on intelligent systems and knowledge engineering (ISKE), 2017 (pp. 1–6).
Whittaker, J. (1989). Creativity and conformity in science: Titles, keywords and co-word analysis. Social Studies of Science, 19(3), 473–496.
Yang, S., Huang, G., & Ofoghi, B. (2020, May). Short text similarity measurement using context from bag of word pairs and word co-occurrence. In Communications in computer and information science international conference on data service (pp. 221–231). Springer.
Yumoto, T., Yamanaka, T., Nii, M., & Kamiura, N. (2016, December). Rarity-oriented information retrieval: Social Bookmarking vs. word Co-occurrence. In Lecture notes in computer science (pp. 85–91). Springer.
Zhang, H., Bai, J., Song, Y., Xu, K., Yu, C., Song, Y., Wilfred, N., & Yu, D. (2019a). Multiplex word embeddings for selectional preference acquisition. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing—EMNLP’19, Hong Kong, China, 2019 (pp. 5247–5256).
Zhang, H., Ding, H., & Song, Y. (2019b). Sp-10k: A large-scale evaluation set for selectional preference acquisition. In Proceedings of the 57th annual meeting of the Association for Computational Linguistics—ACL’19, Florence, Italy, 2019 (pp. 722–731).
Zhang, J., & Zhao, Y. (2013). A user term visualization analysis based on a social question and answer log. Information Processing and Management, 49(3), 1019–1048.
Zhang, J., Zhao, Y., & Dimitroff, A. (2014). A study on health care consumers’ diabetes term usage across identified categories. Aslib Journal of Information Management, 66(4), 443–463.
Zhang, Y., Wang, X., Zhang, G., & Lu, J. (2018). Predicting the dynamics of scientific activities: A diffusion-based network analytic methodology. Proceedings of the Association for Information Science and Technology, 55(1), 598–607.
Zhao, Y., Chen, B., Zhang, J., Ding, Y., Mao, J., & Zhou, L. (2018). An investigation on the evolution of diabetes data in social Q&A logs. Data and Information Management, 2(1), 37–48.
Acknowledgments
This research is funded by the National Key Research and Development Program of China (2019YFA0707201), National Natural Science Foundation of China (72274146, 71874130 & 71921002), and the Ministry of Education of the People’s Republic of China (22JJD870004 & 18YJC870026).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author declares no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhao, Y., Yin, J., Zhang, J. et al. Identifying the driving factors of word co-occurrence: a perspective of semantic relations. Scientometrics 128, 6471–6494 (2023). https://doi.org/10.1007/s11192-023-04851-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-023-04851-x