Skip to main content

Hierarchically Classifying Chinese Web Documents without Dictionary Support and Segmentation Procedure1

  • Conference paper
  • First Online:
Web-Age Information Management (WAIM 2000)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1846))

Included in the following conference series:

Abstract

This paper reports a system that hierarchically classifies Chinese web documents without dictionary support and segmentation procedure. In our classifier, Web documents are represented by N-grams (N≤4) that are easy to be extracted. A boosting machine learning approach is applied to classifying Web Chinese documents that share a topic hierarchy. The open and modularized system architecture makes our classifier be extendible. Experimental results show that our system can effectively and efficiently classify Chinese Web documents.

This work is supported by the 973 High-Tech Projects Foundation of China and partially supported by a grant (No. 69933010) from NSFC.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Yahoo! On-line guide for the Internet. http://www.yahoo.com/ (1995)

  2. Yang Y. and Liu X. A re-examination of text categorization methods. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (1999)

    Google Scholar 

  3. Zhao B. and Xu L. Processing Chinese Information with Computer, Vol. 2. Space Publisher House (1988)

    Google Scholar 

  4. Yang Y. and Pederson J. Feature selection in statistical learning of text categorization. In ICML-97 (1997) 412–420.

    Google Scholar 

  5. Lewis D.D. Naive (Bayes) at forty: The independence assumption in information retrieval. In Machine Learning: ECML-98, 10th European Conference on Machine Learning (1998) 4–15

    Google Scholar 

  6. Joachims T. Text categorization with support vector machines: learning with many relevant features. In Machine Learning: ECML-98, 10th European Conference on Machine Learning (1998) 137–142

    Google Scholar 

  7. Schapire R. E. and Singer Y. Improved boosting algorithms using confidence-rated predictions. In Proceedings of 11th Annual Conference on Computational Learning Theory (1998) 80–91

    Google Scholar 

  8. Cohen W. W. and Singer Y. Context-sensitive learning methods for text categorization. In SIGIR’96: Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1996) 307–315

    Google Scholar 

  9. Koller D. and Sahami M. Hierarchically classifying documents using very few words.

    Google Scholar 

  10. Mladenic D., et al. Feature selection in text learning. Proc. Of 10th European Conference on Machine Learning ECML98 (1998)

    Google Scholar 

  11. McCallum A., et al. Improving text classification by shrinkage in a hierarchy of classes. In ICML-98 (1998) 359–367

    Google Scholar 

  12. Chakrabarti S., et al. Using taxonomy, discriminants, and signatures for navigating in text databases. Proc. Of the 23rd VLDB Conference Athene, Greece (1997)

    Google Scholar 

  13. Moor J. and Han E. H (Sam). Web page categorization and feature selection using association rule and principal component clustering (1998)

    Google Scholar 

  14. Quek C. Y. Classification of World Wide Web documents. Senior Honors Thesis, CMU (1997)

    Google Scholar 

  15. Koller D. and Sahami M. Toward optimal feature selection. In Lorenza Saita, ed., Machine Learning: Proc. of the 13th International Conference, Morgan Kaufman (1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhou, S., Fan, Y., Hu, J., Yu, F., Hu, Y. (2000). Hierarchically Classifying Chinese Web Documents without Dictionary Support and Segmentation Procedure1 . In: Lu, H., Zhou, A. (eds) Web-Age Information Management. WAIM 2000. Lecture Notes in Computer Science, vol 1846. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45151-X_20

Download citation

  • DOI: https://doi.org/10.1007/3-540-45151-X_20

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-67627-0

  • Online ISBN: 978-3-540-45151-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy