Abstract
While density based clustering algorithms are able to detect clusters of arbitrary shapes, their clustering results usually rely heavily on some user-specified parameters. In order to solve this problem, in this paper we propose to combine dominant sets and density based clustering to obtain reliable clustering results. We firstly use the dominant sets algorithm and histogram equalization to generate the initial clusters, which are usually subsets of the real clusters. In the second step the initial clusters are extended with density based clustering algorithms, where the required parameters are determined based on the initial clusters. By merging the merits of both algorithms, our approach is able to generate clusters of arbitrary shapes without user-specified parameters as input. In experiments our algorithm is shown to perform better than or comparably to some state-of-the-art algorithms with careful parameter tuning.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
As an important unsupervised learning approach, clustering is widely applied in various domains, including data mining, pattern recognition and image processing. A vast amount of clustering algorithms have been proposed in the literature and many of them have found successful applications. In addition to the commonly used k-means approach, density based and spectral clustering algorithms have received much attention in recent decades. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [4] is a typical density based clustering algorithm. Given a neighborhood radius and the minimum number of data in the neighborhood, DBSCAN determines a density threshold and admits into clusters the data satisfying the density constraint. By extracting the clusters in a sequential manner, DBSCAN is able to determine the number of clusters automatically. In [10] the authors presented a density peak (DP) based clustering algorithm, which applies local density and the distance to the nearest neighbor with higher density to isolate and identify the cluster centers. Then the non-center data are assigned the same labels as their nearest neighbors with higher density. Given that the cluster centers are identified correctly, this algorithm is reported to generate excellent clustering results in [10]. Spectral clustering, e.g., the normalized cuts (NCuts) algorithm [11], performs dimensionality reduction by means of the eigenvalues of the pairwise data similarity matrix in the first step, and then start the clustering process in the smaller dimension. Like k-means, spectral clustering algorithms require the number of clusters to be specified by user. Also with the pairwise data similarity matrix as input, the affinity propagation (AP) [1] and dominant sets (DSets) [9] algorithms accomplish the clustering process with different strategies from spectral clustering. Given the preference values of each data as a cluster center, the affinity propagation algorithm iteratively pass the affinity (similarity) messages among data and identifies the cluster centers and members gradually. With the pairwise similarity matrix as the sole input, the dominant sets algorithm defines a non-parametric concept of a cluster and extract the clusters sequentially.
One problem with many of the above-mentioned clustering algorithms is that they can only detect spherical clusters. The algorithms afflicted by this problem include k-means, spectral cluster, AP and DSets. While the density based algorithms, e.g., DBSCAN and DP, are able to generate clusters of arbitrary shapes, they involve user-specified parameters and the clustering results rely heavily on these parameters. Unfortunately, the appropriate parameters of these algorithms are usually data dependent and not easy to determine. DBSCAN uses a neighborhood radius Eps and the minimum number MinPts of data in the neighborhood to denote a density threshold, which is then used to determine if one data can be included into a cluster. Evidently inappropriate selection of the two parameters may result in over-large or over-small clusters and influence the clustering results. The DP algorithm does not involve parameter input in theory. However, the calculation of density involves a cut-off distance, which is usually determined empirically. Furthermore, the automatic identification of cluster centers seem difficult and additional parameter or even human assistance is often necessary.
In order to obtain clusters of arbitrary shapes reliably, in this paper we present a two-step approach which combines the merits of the DSets algorithm and density based clustering algorithms. Our work is based on the observation that the two kinds of algorithms have some complementary properties. The DSets algorithm uses only the pairwise data similarity matrix as input and does not involve any other parameters explicitly. However, it can only generate spherical clusters. In contrast, density based algorithms are able to generate clusters of arbitrary shapes, on condition that appropriate density related parameters are determined beforehand. Motivated by this observation, we propose to generate initial clusters with the DSets algorithm in the first step, and then perform density based clustering where the required parameters are determined based on the initial clusters. Following the DSets algorithm, our algorithm extracts the clusters in a sequential manner and determines the number of clusters automatically. Experiments on data clustering and image segmentation validate the effectiveness of our approach.
The following sections are organized as follows. In Sect. 2 we introduce the DSets algorithm and two typical density based clustering algorithms, based on which we present our two-step algorithm in details in Sect. 3. Extensive experiments are conducted to validate the proposed approach in Sect. 4, and finally the conclusions are given in Sect. 5.
2 Related Works
In this part we firstly introduce the DSets algorithm and discuss its properties. Then two density based clustering algorithms, i.e., DBSCAN and DP, are introduced briefly. These content serves as the basis of deriving our two-step clustering algorithm in Sect. 3.
2.1 Dominant Set
Dominant set is a graph-theoretic concept of a cluster [9]. In other words, a dominant set is regarded as a cluster in the DSets algorithm. In DSets clustering, the dominant sets (clusters) are extracted in a sequential manner. Given the pairwise data similarity matrix, we extract the first cluster in the whole set of data. After removing the data in the first cluster, we continue to extract the second cluster in the remaining data. In this way we are able to accomplish the clustering process and obtain the number of clusters automatically. As a promising clustering approach, the DSets algorithm has been applied successfully in various tasks [2, 8, 12, 15].
Many clustering algorithms rely on the user-specified number of clusters to partition the data and obtain the clusters as a by-product of the partitioning process. While DBSCAN extract clusters in a region growing fashion sequentially, it also needs density related parameters to determine the cluster border. Different from these parameter-dependent clustering algorithms, the DSets algorithm defines dominant set as a non-parametric concept of a cluster. The basic idea is to maximize the internal similarity in a cluster by admitting into the cluster only the data helpful to increase the internal similarity. For this purpose, we need a measure to evaluate the relationship between one data and a cluster. Let’s say that S is the set of data to be clustered, \(A=(a_{ij})\) is the pairwise data similarity matrix, D is a non-empty subset in S, and \(i \in D\) and \(j \notin D\) are two data in S. We firstly define the relationship between j and i by
where |D| denotes the number of data in D. This variable compares the similarity between j and i, with respect to the average similarity between i and the data in D, thereby providing a connection between the data j outside D and those in D. Then we define
This key variable is defined in a recursive form and it is not evident to see what it means. However, Eq. (2) shows that \(w_D(i)\) can be approximately regarded as a weighted sum of \(\phi _{D \setminus \{i\}}(l,i)\) for all \(k \in D \setminus \{i\}\). Based on Eq. (1) we further see that \(w_D(i)\) equals approximately to the average similarity between i and those in \(D \setminus \{i\}\), minus the average of pairwise similarity in \(D \setminus \{i\}\), i.e.,
As the average of pairwise similarity is a suitable measure of the internal similarity in \(D \setminus \{i\}\), here we see that \(w_D(i)>0\) means that D has a higher internal similarity than \(D \setminus \{i\}\) and i is helpful to increase the internal similarity in \(D \setminus \{i\}\). In contrast, \(w_D(i)<0\) implies that admitting i into \(D \setminus \{i\}\) will reduce the internal similarity in \(D \setminus \{i\}\).
Now we are ready to present the formal definition of dominant set. With \(W(D)=\sum _{i \in D}w_D(i)\), we call D as a dominant set if
-
1.
\(W(T) > 0\), for all non-empty \(T \subseteq D\).
-
2.
\(w_D(i)>0\), for all \(i \in D\).
-
3.
\(w_{D \bigcup \{i\}}(i)<0\), for all \(i \notin D\).
From this definition we see that only the data with positive \(w_D(i)\) can be admitted into the dominant set D. In other words, a dominant set accepts only the data helpful to increase its internal similarity, and the dominant set extraction can be viewed as an internal similarity maximization process.
The dominant set definition requires that each data in a dominant set is able to increase the internal similarity. This condition is very strict as it means that each data in a dominant set must be very similar to all the others in the dominant set, including the closest ones and the farthest ones. As a result, the DSets algorithm tends to generate clusters of spherical shapes only.
2.2 Density Based Clustering Algorithms
DBSCAN is one of the most popular density based clustering approaches. With the user-specified neighborhood radius Eps and the minimum number MinPts of data in the neighborhood, DBSCAN defines the minimum density acceptable in a cluster. The data with density larger than this threshold are called core points. Staring from an arbitrary core point, DBSCAN admits all the data in its Eps neighborhood into the cluster. Repeating this process for all the core points in the cluster we obtain the first cluster. In the same way we can obtain the remaining clusters, and the data not included in any cluster are regarded as noise. One distinct advantage of DBSCAN is that it does not require the number of clusters to be specified, and is able to generate clusters of arbitrary shapes.
The DP algorithm is proposed to identify cluster centers in the first step and then determine the labels of other data based on the cluster centers. This algorithm uses local density \(\rho \) and the distance \(\delta \) to the nearest neighbor with higher density to characterize the data. Since cluster centers usually have both high \(\rho \) and high \(\delta \), they are isolated from other data in the \(\rho \)-\(\delta \) decision graph and can be identified relatively easily. After the cluster centers are identified, the other data are assigned labels based on the labels of the nearest neighbors with higher density. On condition that the cluster centers are identified correctly, this algorithm can be used to generate clusters of arbitrary shapes. However, the correct identification of cluster centers usually involve the appropriate selection of density parameters, and human assistance may be necessary.
3 Our Approach
In the last section we see that the DSets algorithm and density based clustering algorithms have some complementary properties. The DSets algorithm uses only the pairwise similarity matrix as input and does not involve user-specified parameters explicitly. However, it generates clusters of only spherical shapes. In contrast, density base clustering algorithms are able to detect clusters of arbitrary shapes, on condition that appropriate density parameters are specified by user. This observation motivates us to combine both algorithms to make use of their respective merits. Specifically, we firstly use the DSets algorithm to generate an initial cluster, and then use a density based algorithm to obtain the final cluster, where the parameters needed by density based clustering are determined from the initial cluster. In this way our algorithm is able to generate clusters of arbitrary shapes and the involved parameters can be determined appropriately. Similar to the DSets algorithm, out algorithm extracts the clusters in a sequential manner.
3.1 DSets-histeq
We have known that the DSets algorithm uses only the pairwise data similarity matrix as input, and involves no user-specified parameters. However, we notice that in many cases the data are represented as points in vector space. This means that we need to construct the pairwise similarity matrix from the data represented in the form of vectors. One common measure of the similarity between two data (vectors) x and y is in the form of \(s(x,y)=exp(-d(x,y)/\sigma )\), where d(x, y) is the distance and \(\sigma \) is a regulating parameter. Given a set of data to be clustered, different \(\sigma \)’s result in different similarity matrices, which then lead to different DSets clustering results, as illustrated in Fig. 1.
In this paper we intend to use the DSets algorithm to generate initial clusters, based on which the parameters required by density based clustering can be determined. In order to serve this purpose, we expect one initial cluster to be a not too small subset of a real cluster. The reason is as follows. If the initial cluster is too small, it doesn’t contain enough data which are used to estimate density parameters reliably. In contrast, if the initial cluster is too large, it may be larger than the real cluster and contains also data from other clusters. In this case, the estimated density parameters are not accurate. Only when the initial cluster is a not too small subset of the real cluster, the density information captured in the initial cluster can be viewed as an approximation of that in the real cluster, and therefore can be used to estimate the density parameters. However, Fig. 1 shows that the sizes of initial clusters vary with \(\sigma \)’s, and large \(\sigma \)’s tend to generate large clusters. In this case, if we use a very small \(\sigma \), the generate clusters may be too small to support reliable density parameters estimation. On the contrary, too large \(\sigma \)’s may make the generated initial clusters larger than the real clusters. With a medium \(\sigma \), the initial clusters may be smaller than the real ones for some datasets and greater than the real ones for others. In general, it is difficult to find a suitable \(\sigma \) to generate the appropriate initial clusters.
In order to solve this problem, we propose to apply histogram equalization transformation to the similarity matrices before clustering. This transformation is adopted based on two reasons. First, by applying histogram equalization transformation to the similarity matrix, the influence of \(\sigma \) on DSets clustering results can be removed. From \(s(x,y)=exp(-d(x,y)/\sigma )\) we see that different \(\sigma \)’s only change the absolute similarity values. The relative magnitude among similarity values are determined solely by the data distances and invariant to \(\sigma \)’s. In other words, the ordering of similarity values in the similarity matrices are invariant to \(\sigma \). Since in histogram equalization transformation the new similarity values are determined only based on the ordering of the original similarity values, we are ready to see that the new similarity matrices are invariant to \(\sigma \)’s. Consequently, the resulted clustering results are no longer influenced by \(\sigma \)’s. In practice, there will be very slight difference caused by the quantization process in histogram equalization. This is illustrated in Fig. 2, where the clustering results with different \(\sigma \)’s are very similar to each other. Second, histogram equalization transformation is a process to increase the overall similarity contrast in the similarity matrix. As dominant set by definition exerts a high requirement on the internal similarity, this transformation tends to reduce the size of clusters. On the other hand, as the similarity values are distributed in the range [0,1] relatively evenly after histogram equalization transformation, the obtained clusters will not be too small, as illustrated in Fig. 2. This means that the initial clusters obtained in this way can be used to estimate density parameters reliably.
The DSets clustering results on the Aggregation [6] and Flame [5] datasets with different \(\sigma \), where \(\overline{d}\) is the average of all pairwise distances, and the similarity matrices are transformed by histogram equalization. The top and bottom rows belong to Aggregation and Flame, respectively.
Based on the above observations, we propose to use the DSets algorithm to extract initial clusters, where the similarity matrices are transformed by histogram equalization before clustering. For ease of expression, in the following we use DSets-histeq to denote this algorithm. With DSets-histeq, we can use any \(\sigma \) to build the similarity matrix, then the histogram equalization transformation enables the DSets algorithm to generate not too small initial clusters. The data in these initial clusters are then used to estimate the density in the real cluster, and further estimate the density parameters required by density based clustering algorithms.
Theoretically, we can use non-parametric similarity measures, e.g., cosine and histogram intersection, to evaluate the similarity among data and build the similarity matrix. However, we have found that with these two measures, the DSets algorithm usually generate large clusters which contain data of more than one real cluster, as illustrated in Fig. 3. For this reason, we stick to use DSets-histeq to generate the initial clusters.
3.2 Initial Cluster Extension
As our algorithm extracts clusters in a sequential manner and all the clusters are extracted with the same procedures, in this part we describe how one cluster is obtained. With DSets-histeq we are able to extract a not too small initial cluster. This initial cluster is usually a subset of the real cluster, and contains important density information which can be used to estimate the density parameters required by density based algorithms. On the other hand, since the initial cluster is a subset of the real cluster, the density based clustering can be regarded as a density based cluster extension process. In the following we show that by making full use of the initial cluster, the cluster extension can be accomplished with very simple methods.
As stated in Sect. 2, in extracting a dominant set, we intend to maximize the internal similarity in the dominant set, and only the data helpful to increase the internal similarity can be included. As a consequence of this strict requirement of the internal similarity, the obtained dominant set (initial cluster) is usually the densest part in the real cluster. This observation motivates us to extend the initial cluster with the following two steps.
First, Eq. (3) indicates that in order to be admitted into a dominant set, one data must be very similar to all the data in the dominant set, including the nearest ones and the farthest ones. In other words, the density of these data is evaluated by taking all the data in the dominant set into account. This condition is a little too strict and enables the DSets algorithm to generate clusters of spherical shapes only. Therefore we choose to relax the condition and evaluate the data density with only its nearest neighbors. In this case, we firstly use the smallest data density in the initial cluster as the density threshold, which denotes the minimum density acceptable in the initial cluster. Then the neighboring data of the initial cluster are included if their density is above the threshold.
In the first step we have included the outside data with density above the density threshold. As mentioned above, the initial cluster is usually the densest part of the real cluster, and the density threshold in determined in the initial cluster. This means that some data with density smaller than the density threshold may also belong to the real cluster. Simply reducing the density threshold is not a good option as we don’t know to which extent the density threshold should be reduced. Therefore we need a different approach to solve this problem. In the DP algorithm, after the cluster centers are identified, the non-center data are assigned labels equalling to those of their nearest neighbors with higher density. Motivated by this algorithm, we propose the following procedure to extend the initial cluster further. For each data i in the initial cluster, we find its nearest neighbor \(i_{nn}\). If \(i_{nn}\) is outside the initial cluster and its density is smaller than that of i, we include \(i_{nn}\) into the initial cluster. Repeating this process until no data can be included, we accomplish the cluster extension process and obtain the final cluster.
4 Experiments
4.1 Histogram Equalization
In our algorithm we use histogram equalization transformation of similarity matrices to remove the dependence on \(\sigma \)’s and generate not too small initial clusters for later extension. In this part we use experiments to validate the effectiveness in these two aspects respectively.
The data clustering experiments are conducted on eight datasets, including Aggregation, Compound [14], R15 [13], Jain [7], Flame and three UCI datasets, namely Thyroid, Iris and Wdbc. We use different \(\sigma \)’s in our algorithm and report the clustering results (F-measure and Jaccard index) in Fig. 4, where it is evident that the influence of \(\sigma \)’s on clustering results have been removed effectively.
In our algorithm we use histogram equalization transformation to generate not too small initial clusters, which are used in later cluster extension. As small \(\sigma \)’s can also be used to generate small initial clusters, it is interesting to compare these two methods. Specifically, we conduct two sets of experiments with our algorithm, where the only difference is that in one set the similarity matrices are transformed by histogram equalization, and in the other set the similarity matrices are not transformed. The comparison of average clustering results on the eight datasets are reported in Fig. 5. From the comparison we observe that histogram equalization transformation performs better than any fixed \(\sigma \)’s. This confirms that it is difficult to find a fixed \(\sigma \) which is able to generate not too small initial clusters for different datasets. In this case, histogram equalization transformation of similarity matrices becomes a better option.
4.2 Data Clustering
In this part we use data clustering experiments to compare our algorithm with some state-of-the-art algorithms, including the original DSets algorithm, k-means, DBSCAN, NCuts, AP, SPRG [16], DP-cutoff (the DP algorithm with the cutoff kernel) and DP-Gaussian (the DP algorithm with the Gaussian kernel). The experimental setups of these algorithms are as follows.
-
1.
With DSets, we manually select the best-performing \(\sigma \) from 0.5\(\overline{d}\), \(\overline{d}\), 2\(\overline{d}\), 5\(\overline{d}\), 10\(\overline{d}\), 20\(\overline{d}\), \(\cdots \), 100\(\overline{d}\), for each dataset separately. This means that the reported results are approximately the best possible results obtainable from the DSets algorithm.
-
2.
As k-means, NCuts and SPRG requires the number of clusters as input, we feed the ground truth numbers of clusters to them. Considering that their clustering results are influenced by random initial cluster centers, we report the average results of five runs.
-
3.
The AP algorithm requires the preference value of each data as input, and the authors of [10] have published the code to calculate the range [\(p_{min},p_{max}\)] of this parameter. In experiments we manually select the best-performing p from \(p_{min}+step\), \(p_{min}+2step\), \(\cdots \), \(p_{min}+9step\), \(p_{min}+9.1step\), \(p_{min}+9.2step\), \(\cdots \), \(p_{min}+9.9step\), with \(step=(p_{max}-p_{min})/10\), for each dataset separately. Like in the case of DSets, the reported results represent approximately the best possible ones from the AP algorithm.
-
4.
With DBSCAN, we manually select the best performing MinPts from 2, 3, \(\cdots \), 10, for each dataset separately, and Eps is determined based on MinPts with the method presented in [3]. The reported results represent approximately the best possible ones from DBSCAN.
-
5.
The original DP algorithm involves manual selection of cluster centers. In order to avoid the influence of human factors, with DP-cutoff and DP-Gaussian we feed the ground truth numbers of clusters and select the cluster centers based on \(\gamma =\rho \delta \), where \(\rho \) denotes and local density and \(\delta \) represents the distance to the nearest neighbor with higher density.
The experimental results on the eight datasets are reported in Tables 1 and 2, where we observe that our algorithm performs the best or near-best in four out of eight datasets, and generates the best average result. Noticing that the eight algorithms for comparison are reported with the best possible results or benefit from ground truth numbers of clusters, we believe these results validate the effectiveness of our algorithm.
5 Conclusions
In this paper we present a new clustering algorithm based on dominant set and density based clustering algorithms. By means of histogram equalization transformation of similarity matrices, the dominant sets algorithm is used to generate not too small initial clusters. In the next step we propose two methods to make use of the density information in initial clusters in extending initial clusters to final ones. By merging the merits of dominant sets algorithm and density based clustering algorithms, our algorithm requires no parameter input, and is able to generate clusters of arbitrary shapes. In data clustering results our algorithm performs comparably to or better than other algorithms with parameter tuning.
References
Brendan, J.F., Delbert, D.: Clustering by passing messages between data points. Science 315, 972–976 (2007)
Bulo, S.R., Torsello, A., Pelillo, M.: A game-theoretic approach to partial clique enumeration. Image Vis. Comput. 27(7), 911–922 (2009)
Daszykowski, M., Walczak, B., Massart, D.L.: Looking for natural patterns in data: Part 1. Density-based approach. Chemom. Intell. Lab. Syst. 56(2), 83–92 (2001)
Ester, M., Kriegel, H.P., Sander, J., Xu, X.W.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: International Conference on Knowledge Discovery and Data Mining, pp. 226–231 (1996)
Fu, L., Medico, E.: Flame, a novel fuzzy clustering method for the analysis of DNA microarray data. BMC Bioinform. 8(1), 1–17 (2007)
Gionis, A., Mannila, H., Tsaparas, P.: Clustering aggregation. ACM Trans. Knowl. Discov. Data 1(1), 1–30 (2007)
Jain, A.K., Law, M.H.C.: Data clustering: a user’s dilemma. In: Pal, S.K., Bandyopadhyay, S., Biswas, S. (eds.) PReMI 2005. LNCS, vol. 3776, pp. 1–10. Springer, Heidelberg (2005)
Pavan, M., Pelillo, M.: Efficient out-of-sample extension of dominant-set clusters. In: Advances in Neural Information Processing Systems, pp. 1057–1064 (2005)
Pavan, M., Pelillo, M.: Dominant sets and pairwise clustering. IEEE Trans. Pattern Anal. Mach. Intell. 29(1), 167–172 (2007)
Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344, 1492–1496 (2014)
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 167–172 (2000)
Tripodi, R., Pelillo, M.: Document clustering games. In: The 5th International Conference on Pattern Recognition Applications and Methods, pp. 109–118 (2016)
Veenman, C.J., Reinders, M., Backer, E.: A maximum variance cluster algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 24(9), 1273–1280 (2002)
Zahn, C.T.: Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans. Comput. 20(1), 68–86 (1971)
Zemene, E., Pelillo, M.: Path-based dominant-set clustering. In: Murino, V., Puppo, E. (eds.) ICIAP 2015. LNCS, vol. 9279, pp. 150–160. Springer, Heidelberg (2015)
Zhu, X., Loy, C.C., Gong, S.: Constructing robust affinity graphs for spectral clustering. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1450–1457 (2014)
Acknowledgement
This work is supported in part by the National Natural Science Foundation of China under Grant No. 61473045 and by China Scholarship Council.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Hou, J., Liu, W., E, X. (2016). Density Based Clustering via Dominant Sets. In: Schwenker, F., Abbas, H., El Gayar, N., Trentin, E. (eds) Artificial Neural Networks in Pattern Recognition. ANNPR 2016. Lecture Notes in Computer Science(), vol 9896. Springer, Cham. https://doi.org/10.1007/978-3-319-46182-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-46182-3_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46181-6
Online ISBN: 978-3-319-46182-3
eBook Packages: Computer ScienceComputer Science (R0)