Abstract
There is not a natural delimiter between words in Chinese texts. Moreover, Chinese is a semotactic language with complicated structures focusing on semantics. Its differences from Western languages bring more difficulties in Chinese word segmentation and more challenges in Chinese natural language understanding. How to compute the Chinese text similarity with high precision, recall and low cost is a very important but challenging task. Many researchers have studied it for long time. In this paper, we examine existing Chinese text similarity measures, including measures based on statistics and semantics. Our work provides insights into the advantages and disadvantages of each method, including tradeoffs between effectiveness and efficiency. New directions of the future work are discussed.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
McGill, M., Koll, M., Norrreault, T.: An Evaluation of Factors Affecting Document Ranking by Information Retrieval Systems. Technical Report, Syracuse University School of Information Studies (1979)
Lesk, M.E.: Computer Evaluation of Indexing and Text Processing. Journal of the ACM 1, 8–36 (1968)
Beaza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press, New York (1999)
Wong, S.: On Modeling of Information Retrieval Concepts in Vector Spaces. ACM Transactions on Database Systems 2, 299–321 (1987)
Becker, J., Kuropka, D.: Topic-based Vector Space Model. Business Information Systems. In: Proceedings of BIS 2003, Colorado Springs, USA (2003)
Cheng, Y., Wu, S.: Text Similarity Computing Based on Components. Computer Engineering and Design 18, 3444–3446 (2006)
Pan, Q., Wang, J., Shi, Z.: Text Similarity Computing Based on Attribute Theory. Chinese Journal of Computers 6, 653–655 (1999)
Zhang, H., Wang, G., Zhong, Y.: Text Similarity Computing Based on Hamming Distance. Computer Engineering and Applications 19, 21–22 (2001)
Agirre, E., Rigau, G.: A Proposal for Word Sense Disambiguation Using Conceptual Distance. In: International Conference on Recent Advances in Natural Language Processing, Velingrad, pp. 258–264 (1995)
Wang, B.: Study on Chinese-English Bi-language Corpus Automatic Ordering. Institute of Computing Technology, Chinese Academy of Science (1999)
Liu, Q., Li, S.: Words Semantic Similarity Computation Based on HowNet. In: Proceedings of the 3rd Symposium on Chinese Words Semantics, vol. 5 (2002)
Xia, T.: Study on Chinese Words Semantic Similarity Computation. Computer Engineering 6, 191–194 (2003)
Kwok, K.L.: Comparing Representations in Chinese Information Retrieval. In: Proceedings of the ACM SIGER 1997 Conference, pp. 34–41 (1997)
Zhao, Y., Li, Q.: Chinese Character Association Measurement Method and Its Application on Chinese Text Similarity Analysis. Computer Applications 6, 1396–1397, 1400 (2006)
Che, W.: Chinese Sentences Similarity Computation Oriented the Searching in Bilingual Sentence Pairs. In: The 7th National JSCH, pp. 81–88. Tsinghua University press, Beijing (2003)
Jin, Y.: Text Similarity Computing Based on Context Framework Model. Computer Engineering and Applications 16, 36–39 (2004)
Jin, B., Shi, Y., Teng, H.: Similarity Algorithm of Text Based on Semantic Understanding. Journal of Dalian University of Technology 2, 291–297 (2005)
Jin, B., Shi, Y., Teng, H.: Document-structure-based Copy Detection Algorithm. Journal of Dalian University of Technology 1, 125–130 (2007)
Javed, A., Aslam, M.F.: An Information-theoretic Measure for Document Similarity. ACM SIGIR 3, 449–450 (2003)
Lin, D.: An Information-theoretic Definition of Similarity. In: Proc. 15th International Conf. on Machine Learning (1998)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, X., Ju, S., Wu, S. (2008). A Survey of Chinese Text Similarity Computation. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_69
Download citation
DOI: https://doi.org/10.1007/978-3-540-68636-1_69
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68633-0
Online ISBN: 978-3-540-68636-1
eBook Packages: Computer ScienceComputer Science (R0)