Abstract
This paper reports a system that hierarchically classifies Chinese web documents without dictionary support and segmentation procedure. In our classifier, Web documents are represented by N-grams (N≤4) that are easy to be extracted. A boosting machine learning approach is applied to classifying Web Chinese documents that share a topic hierarchy. The open and modularized system architecture makes our classifier be extendible. Experimental results show that our system can effectively and efficiently classify Chinese Web documents.
This work is supported by the 973 High-Tech Projects Foundation of China and partially supported by a grant (No. 69933010) from NSFC.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Yahoo! On-line guide for the Internet. http://www.yahoo.com/ (1995)
Yang Y. and Liu X. A re-examination of text categorization methods. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (1999)
Zhao B. and Xu L. Processing Chinese Information with Computer, Vol. 2. Space Publisher House (1988)
Yang Y. and Pederson J. Feature selection in statistical learning of text categorization. In ICML-97 (1997) 412–420.
Lewis D.D. Naive (Bayes) at forty: The independence assumption in information retrieval. In Machine Learning: ECML-98, 10th European Conference on Machine Learning (1998) 4–15
Joachims T. Text categorization with support vector machines: learning with many relevant features. In Machine Learning: ECML-98, 10th European Conference on Machine Learning (1998) 137–142
Schapire R. E. and Singer Y. Improved boosting algorithms using confidence-rated predictions. In Proceedings of 11th Annual Conference on Computational Learning Theory (1998) 80–91
Cohen W. W. and Singer Y. Context-sensitive learning methods for text categorization. In SIGIR’96: Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1996) 307–315
Koller D. and Sahami M. Hierarchically classifying documents using very few words.
Mladenic D., et al. Feature selection in text learning. Proc. Of 10th European Conference on Machine Learning ECML98 (1998)
McCallum A., et al. Improving text classification by shrinkage in a hierarchy of classes. In ICML-98 (1998) 359–367
Chakrabarti S., et al. Using taxonomy, discriminants, and signatures for navigating in text databases. Proc. Of the 23rd VLDB Conference Athene, Greece (1997)
Moor J. and Han E. H (Sam). Web page categorization and feature selection using association rule and principal component clustering (1998)
Quek C. Y. Classification of World Wide Web documents. Senior Honors Thesis, CMU (1997)
Koller D. and Sahami M. Toward optimal feature selection. In Lorenza Saita, ed., Machine Learning: Proc. of the 13th International Conference, Morgan Kaufman (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhou, S., Fan, Y., Hu, J., Yu, F., Hu, Y. (2000). Hierarchically Classifying Chinese Web Documents without Dictionary Support and Segmentation Procedure1 . In: Lu, H., Zhou, A. (eds) Web-Age Information Management. WAIM 2000. Lecture Notes in Computer Science, vol 1846. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45151-X_20
Download citation
DOI: https://doi.org/10.1007/3-540-45151-X_20
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67627-0
Online ISBN: 978-3-540-45151-8
eBook Packages: Springer Book Archive