Abstract
This paper introduces a new approach to summarize clusters by finding dense regions, and representing each cluster as a Gaussian Mixture Model (GMM). The GMM summarization allows us to summarize a cluster efficiently, then regenerate the original data with high accuracy. Unlike the classical representation of a cluster using a radius and a center, the proposed approach keeps information of the shape, as well as distributions of the samples in the clusters. Considering the GMM as a parametric model (number of Gaussian mixtures in each GMM), we propose a method to find number of Gaussian mixtures automatically. Each GMM is able to summarize a cluster generated by any kind of clustering algorithms and regenerate the original data with high accuracy. Moreover, when a new sample is presented to the GMMs of clusters, a membership value is calculated for each cluster. Then, using the membership values, the new incoming sample is assigned to the closest cluster. Employing the GMMs to summarize clusters offers several advantages with regards to accuracy, detection rate, memory efficiency and time complexity. We evaluate the proposed method on a variety of datasets, both synthetic dataset and real datasets from the UCI repository. We examine the quality of the summarized clusters generated by the proposed method in terms of DUNN, DB, SD and SSD indexes, and compare them with that of the well-known ABACUS method. We also employ the proposed algorithm in anomaly detection applications, and study the performance of the proposed method in terms of false alarm and detection rates, and compare them with Negative Selection, Naïve models, and ABACUS. Furthermore, we evaluate the memory usage and processing time of the proposed algorithms with other algorithms. The results illustrate that our algorithm outperforms other well-known anomaly detection algorithms in terms of accuracy, detection rate, as well as memory usage and processing time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD
Wang, W., Yang, J., Muntz, R.R.: Sting: a statistical information grid approach to spatial data mining. San Francisco (1997)
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques: The Morgan Kaufmann Series in Data Management Systems, 3rd edn. Morgan Kaufmann Publishers, Burlington (2006)
MacQueen, B.J.: Some Methods for classification and Analysis of Multivariate Observations (1967)
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Upper Saddle River (1988)
Kaufman, L., Rousseeuw, J.P.: Clustering by means of Medoids, in Statistical Data Analysis Based on the L_1–Norm and Related Methods. Y. Dodge, North-Holland (1987)
Karypis, G., Han, H.E., Kumar, V.: CHAMELEON: a hierarchical clustering algorithm using dynamic modeling. IEEE Comput. 32(8), 68–75 (1999)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, New York (1990)
Agrawal, J., Gunopulos, D., Raghavan, P.: Automatic sub-space clustering of high dimensional data for data mining applications (1998)
Hinneburg, A., Keim, D.A.: An efficient approach to clustering in large multimedia databases with noise (1998)
Guha, S., Meyerson, A., Mishra, N., Motwani, R.: Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng. 15(3), 505–528 (2003)
Bifet, A., Holmes, G., Pfahringer, B.: New ensemble methods for evolving data streams (2009)
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams (2003)
Yang, D., Elke, A., Matthew, O.W.: Summarization and matching of density-based clusters in streaming environments. Proc. VLDB Endowment 5(2), 121–132 (2011)
Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: SIAM Conference on Data Mining (2006)
Chaoji, V., Li, W., Yildirim, H., Zaki, M.: ABACUS: mining arbitrary shaped clusters from large datasets based on backbone identification. In: SIAM/Omnipress (2011)
He, Z., Xu, X., Deng, S.: Discovering cluster-based local outliers. Pattern Recogn. Lett. 24, 1641–1650 (2003)
Gaddam, S., Phoha, V., Balagani, K.: K-means+ID3: a novel method for supervised anomaly detection by cascading k-means clustering and ID3 decision tree learning methods. IEEE Trans. Knowl. Data Eng. 19(3), 345–354 (2007)
Mohammadi, M., Akbari, A., Raahemi, B., Nasersharif, B., Asgharian, H.: A fast anomaly detection system using probabilistic artificial immune algorithm capable of learning new attacks. Evol. Intel. 6(3), 135–156 (2014)
Kersting, K., Wahabzada, M., Thurau, C., Bauckhage, C.: Hierarchical convex NMF for clustering massive data (2010)
Hershberger, J., Shrivastava, N., Suri, S.: Summarizing spatial data streams using ClusterHulls. J. Exp. Algorithmics (JEA) 13 (2009). doi:10.1145/1412228.1412238
Mohammadi, M., Akbari, A., Raahemi, B., Nasersharif, B., Asgharian, H.: A fast anomaly detection system using probabilistic artificial immune algorithm capable of learning new attacks. Evol. Intel. 6(5), 135–156 (2014)
Gaddam, S., Phoha, V., Balagani, K.: K-means+ID3: a novel method for supervised anomaly detection by cascading k-means clustering and ID3 decision tree learning methods. IEEE Trans. Knowl. Data Eng. 19(3), 345–354 (2007)
Dunn, J.C.: Well separated clusters and optimal fuzzy partitions. Cybernetics 4, 95–104 (1997)
Davies, L.D., Bouldin, W.D.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1(4), 224–227 (1979)
Halkidi, M., Vazirgiannis, M., Batistakis, Y.: Quality scheme assessment in the clustering process. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 265–276. Springer, Heidelberg (2000)
Sande, P.C., Monroe, J.G.: Negative selection of immature b cells by receptor editing or deletion is determined by site of antigen encounter. Immunity 10(3), 289–299 (1999)
Acknowledgement
This research was supported by NSERC Canada, Grant Nbr RGPIN/341811-2012.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Bigdeli, E., Mohammadi, M., Raahemi, B., Matwin, S. (2015). Cluster Summarization with Dense Region Detection. In: Fred, A., Dietz, J., Aveiro, D., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2014. Communications in Computer and Information Science, vol 553. Springer, Cham. https://doi.org/10.1007/978-3-319-25840-9_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-25840-9_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25839-3
Online ISBN: 978-3-319-25840-9
eBook Packages: Computer ScienceComputer Science (R0)