A Survey of Chinese Text Similarity Computation

Wang, Xiuhong; Ju, Shiguang; Wu, Shengli

doi:10.1007/978-3-540-68636-1_69

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4993))

Included in the following conference series:

Asia Information Retrieval Symposium

1526 Accesses
2 Citations

Abstract

There is not a natural delimiter between words in Chinese texts. Moreover, Chinese is a semotactic language with complicated structures focusing on semantics. Its differences from Western languages bring more difficulties in Chinese word segmentation and more challenges in Chinese natural language understanding. How to compute the Chinese text similarity with high precision, recall and low cost is a very important but challenging task. Many researchers have studied it for long time. In this paper, we examine existing Chinese text similarity measures, including measures based on statistics and semantics. Our work provides insights into the advantages and disadvantages of each method, including tradeoffs between effectiveness and efficiency. New directions of the future work are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Short text similarity measurement methods: a review

Article 03 January 2021

A Phonetics and Semantics-Based Chinese Short Text Fusion Algorithm

SimiT: A Text Similarity Method Using Lexicon and Dependency Representations

Article 17 June 2020

References

McGill, M., Koll, M., Norrreault, T.: An Evaluation of Factors Affecting Document Ranking by Information Retrieval Systems. Technical Report, Syracuse University School of Information Studies (1979)
Google Scholar
Lesk, M.E.: Computer Evaluation of Indexing and Text Processing. Journal of the ACM 1, 8–36 (1968)
Google Scholar
Beaza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press, New York (1999)
Google Scholar
Wong, S.: On Modeling of Information Retrieval Concepts in Vector Spaces. ACM Transactions on Database Systems 2, 299–321 (1987)
Article Google Scholar
Becker, J., Kuropka, D.: Topic-based Vector Space Model. Business Information Systems. In: Proceedings of BIS 2003, Colorado Springs, USA (2003)
Google Scholar
Cheng, Y., Wu, S.: Text Similarity Computing Based on Components. Computer Engineering and Design 18, 3444–3446 (2006)
Google Scholar
Pan, Q., Wang, J., Shi, Z.: Text Similarity Computing Based on Attribute Theory. Chinese Journal of Computers 6, 653–655 (1999)
Google Scholar
Zhang, H., Wang, G., Zhong, Y.: Text Similarity Computing Based on Hamming Distance. Computer Engineering and Applications 19, 21–22 (2001)
Google Scholar
Agirre, E., Rigau, G.: A Proposal for Word Sense Disambiguation Using Conceptual Distance. In: International Conference on Recent Advances in Natural Language Processing, Velingrad, pp. 258–264 (1995)
Google Scholar
Wang, B.: Study on Chinese-English Bi-language Corpus Automatic Ordering. Institute of Computing Technology, Chinese Academy of Science (1999)
Google Scholar
Liu, Q., Li, S.: Words Semantic Similarity Computation Based on HowNet. In: Proceedings of the 3rd Symposium on Chinese Words Semantics, vol. 5 (2002)
Google Scholar
Xia, T.: Study on Chinese Words Semantic Similarity Computation. Computer Engineering 6, 191–194 (2003)
Google Scholar
Kwok, K.L.: Comparing Representations in Chinese Information Retrieval. In: Proceedings of the ACM SIGER 1997 Conference, pp. 34–41 (1997)
Google Scholar
Zhao, Y., Li, Q.: Chinese Character Association Measurement Method and Its Application on Chinese Text Similarity Analysis. Computer Applications 6, 1396–1397, 1400 (2006)
Google Scholar
Che, W.: Chinese Sentences Similarity Computation Oriented the Searching in Bilingual Sentence Pairs. In: The 7th National JSCH, pp. 81–88. Tsinghua University press, Beijing (2003)
Google Scholar
Jin, Y.: Text Similarity Computing Based on Context Framework Model. Computer Engineering and Applications 16, 36–39 (2004)
Google Scholar
Jin, B., Shi, Y., Teng, H.: Similarity Algorithm of Text Based on Semantic Understanding. Journal of Dalian University of Technology 2, 291–297 (2005)
Google Scholar
Jin, B., Shi, Y., Teng, H.: Document-structure-based Copy Detection Algorithm. Journal of Dalian University of Technology 1, 125–130 (2007)
Google Scholar
Javed, A., Aslam, M.F.: An Information-theoretic Measure for Document Similarity. ACM SIGIR 3, 449–450 (2003)
Google Scholar
Lin, D.: An Information-theoretic Definition of Similarity. In: Proc. 15th International Conf. on Machine Learning (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Jiangsu University, Zhenjiang, China
Xiuhong Wang & Shiguang Ju
University of Ulster, Northern Ireland, UK
Shengli Wu

Authors

Xiuhong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shiguang Ju
View author publications
You can also search for this author in PubMed Google Scholar
Shengli Wu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Hang Li Ting Liu Wei-Ying Ma Tetsuya Sakai Kam-Fai Wong Guodong Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, X., Ju, S., Wu, S. (2008). A Survey of Chinese Text Similarity Computation. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_69

Download citation

DOI: https://doi.org/10.1007/978-3-540-68636-1_69
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68633-0
Online ISBN: 978-3-540-68636-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Survey of Chinese Text Similarity Computation

Abstract

Access this chapter

Preview

Similar content being viewed by others

Short text similarity measurement methods: a review

A Phonetics and Semantics-Based Chinese Short Text Fusion Algorithm

SimiT: A Text Similarity Method Using Lexicon and Dependency Representations

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

A Survey of Chinese Text Similarity Computation

Abstract

Access this chapter

Preview

Similar content being viewed by others

Short text similarity measurement methods: a review

A Phonetics and Semantics-Based Chinese Short Text Fusion Algorithm

SimiT: A Text Similarity Method Using Lexicon and Dependency Representations

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.