Skip to main content

A Hybrid Algorithm for Web Document Clustering Based on Frequent Term Sets and k-Means

  • Conference paper
Advances in Web and Network Technologies, and Information Management (APWeb 2007, WAIM 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4537))

Abstract

In order to conquer the major challenges of current web document clustering, i.e. huge volume of documents, high dimensional process and understandability of the cluster, we propose a simple hybrid algorithm (SHDC) based on top-k frequent term sets and k-means. Top-k frequent term sets are used to produce k initial means, which are regarded as initial clusters and further refined by k-means. The final optimal clustering is returned by k-means while the understandable description of clustering is provided by k frequent term sets. Experimental results on two public datasets indicate that SHDC outperforms other two representative clustering algorithms (the farthest first k-means and random initial k-means) both on efficiency and effectiveness.

This project is sponsored by national 863 high technology development foundation (No. 2004AA112020) and the National Grand Fundamental Research 973 Program of China (No. 2005CB321804).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Zhuang, L., Dai, H.: A Maximal Frequent Itemset Approach for Web Document Clustering. In: Proceedings of the Fourth International Conference on Computer and Information Technology (CIT 2004) (2004)

    Google Scholar 

  2. Yongheng, W., Yan J., Shuqiang, Y.: Parallel Mining of Top-K Frequent Items in Very Large Text Database. WAIM (2005)

    Google Scholar 

  3. Han, J., Kamber, M.: Data Mining. Concepts and Techniques, 2nd edn. Morgan Kaufmann Press, Seattle, Washington, USA (2006)

    Google Scholar 

  4. Hotho, A., Nürnberger, A., Paaß, G.: A Brief Survey of Text Mining. LDV FORUM 20, 19–62 (2005)

    Google Scholar 

  5. Beil, F., Ester, M., Xu, X.: Frequent Term-Based Text Clustering. In: Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining (KDD 2002), Edmonton, Alberta, Canada (2002)

    Google Scholar 

  6. Fung, B.C.M., Wang, K., Ester, M.: Hierarchical Document Clustering using Frequent Itemsets. SDM (2003)

    Google Scholar 

  7. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceeding of the 5th Berkeley symposium in mathematics and probability (1967)

    Google Scholar 

  8. Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: Proc. TextMining Workshop, KDD 2000 (2000)

    Google Scholar 

  9. HE, J., LAN, M., et al.: Initialization of cluster refinement algorithms: a review and comparative study. International Joint Conference on Neural Networks (2004)

    Google Scholar 

  10. Hotho, A., Maedche, A., Staab, S.: Ontology-based Text Document Clustering. KI 16(4), 48–54 (2002)

    Google Scholar 

  11. Strehl, A., Ghosh, J.: Cluster ensembles - a knowledge reuse framework for combining partitions. Journal of Machine Learning Research 3, 583–617 (2002)

    Article  MathSciNet  Google Scholar 

  12. Shi, Z., Ester, M.: Performance Improvement for Frequent Term-based Text Clustering Algorithm. Technique Report in Computing Science, Simon Fraser University (April 2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Kevin Chen-Chuan Chang Wei Wang Lei Chen Clarence A. Ellis Ching-Hsien Hsu Ah Chung Tsoi Haixun Wang

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, L., Tian, L., Jia, Y., Han, W. (2007). A Hybrid Algorithm for Web Document Clustering Based on Frequent Term Sets and k-Means. In: Chang, K.CC., et al. Advances in Web and Network Technologies, and Information Management. APWeb WAIM 2007 2007. Lecture Notes in Computer Science, vol 4537. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72909-9_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-72909-9_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-72908-2

  • Online ISBN: 978-3-540-72909-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy