Skip to main content

A Survey of Chinese Text Similarity Computation

  • Conference paper
Information Retrieval Technology (AIRS 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4993))

Included in the following conference series:

Abstract

There is not a natural delimiter between words in Chinese texts. Moreover, Chinese is a semotactic language with complicated structures focusing on semantics. Its differences from Western languages bring more difficulties in Chinese word segmentation and more challenges in Chinese natural language understanding. How to compute the Chinese text similarity with high precision, recall and low cost is a very important but challenging task. Many researchers have studied it for long time. In this paper, we examine existing Chinese text similarity measures, including measures based on statistics and semantics. Our work provides insights into the advantages and disadvantages of each method, including tradeoffs between effectiveness and efficiency. New directions of the future work are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. McGill, M., Koll, M., Norrreault, T.: An Evaluation of Factors Affecting Document Ranking by Information Retrieval Systems. Technical Report, Syracuse University School of Information Studies (1979)

    Google Scholar 

  2. Lesk, M.E.: Computer Evaluation of Indexing and Text Processing. Journal of the ACM 1, 8–36 (1968)

    Google Scholar 

  3. Beaza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press, New York (1999)

    Google Scholar 

  4. Wong, S.: On Modeling of Information Retrieval Concepts in Vector Spaces. ACM Transactions on Database Systems 2, 299–321 (1987)

    Article  Google Scholar 

  5. Becker, J., Kuropka, D.: Topic-based Vector Space Model. Business Information Systems. In: Proceedings of BIS 2003, Colorado Springs, USA (2003)

    Google Scholar 

  6. Cheng, Y., Wu, S.: Text Similarity Computing Based on Components. Computer Engineering and Design 18, 3444–3446 (2006)

    Google Scholar 

  7. Pan, Q., Wang, J., Shi, Z.: Text Similarity Computing Based on Attribute Theory. Chinese Journal of Computers 6, 653–655 (1999)

    Google Scholar 

  8. Zhang, H., Wang, G., Zhong, Y.: Text Similarity Computing Based on Hamming Distance. Computer Engineering and Applications 19, 21–22 (2001)

    Google Scholar 

  9. Agirre, E., Rigau, G.: A Proposal for Word Sense Disambiguation Using Conceptual Distance. In: International Conference on Recent Advances in Natural Language Processing, Velingrad, pp. 258–264 (1995)

    Google Scholar 

  10. Wang, B.: Study on Chinese-English Bi-language Corpus Automatic Ordering. Institute of Computing Technology, Chinese Academy of Science (1999)

    Google Scholar 

  11. Liu, Q., Li, S.: Words Semantic Similarity Computation Based on HowNet. In: Proceedings of the 3rd Symposium on Chinese Words Semantics, vol. 5 (2002)

    Google Scholar 

  12. Xia, T.: Study on Chinese Words Semantic Similarity Computation. Computer Engineering 6, 191–194 (2003)

    Google Scholar 

  13. Kwok, K.L.: Comparing Representations in Chinese Information Retrieval. In: Proceedings of the ACM SIGER 1997 Conference, pp. 34–41 (1997)

    Google Scholar 

  14. Zhao, Y., Li, Q.: Chinese Character Association Measurement Method and Its Application on Chinese Text Similarity Analysis. Computer Applications 6, 1396–1397, 1400 (2006)

    Google Scholar 

  15. Che, W.: Chinese Sentences Similarity Computation Oriented the Searching in Bilingual Sentence Pairs. In: The 7th National JSCH, pp. 81–88. Tsinghua University press, Beijing (2003)

    Google Scholar 

  16. Jin, Y.: Text Similarity Computing Based on Context Framework Model. Computer Engineering and Applications 16, 36–39 (2004)

    Google Scholar 

  17. Jin, B., Shi, Y., Teng, H.: Similarity Algorithm of Text Based on Semantic Understanding. Journal of Dalian University of Technology 2, 291–297 (2005)

    Google Scholar 

  18. Jin, B., Shi, Y., Teng, H.: Document-structure-based Copy Detection Algorithm. Journal of Dalian University of Technology 1, 125–130 (2007)

    Google Scholar 

  19. Javed, A., Aslam, M.F.: An Information-theoretic Measure for Document Similarity. ACM SIGIR 3, 449–450 (2003)

    Google Scholar 

  20. Lin, D.: An Information-theoretic Definition of Similarity. In: Proc. 15th International Conf. on Machine Learning (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Hang Li Ting Liu Wei-Ying Ma Tetsuya Sakai Kam-Fai Wong Guodong Zhou

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, X., Ju, S., Wu, S. (2008). A Survey of Chinese Text Similarity Computation. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_69

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-68636-1_69

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-68633-0

  • Online ISBN: 978-3-540-68636-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy