Strategies and Algorithms For Clustering Large Datasets: A Review
Strategies and Algorithms For Clustering Large Datasets: A Review
Review
Javier Béjar
Departament de Llenguatges i Sistemes Informàtics
Universitat Politècnica de Catalunya
bejar@lsi.upc.edu
Abstract
The exploratory nature of data analysis and data mining makes clustering one of the most usual tasks in these
kind of projects. More frequently these projects come from many different application areas like biology, text
analysis, signal analysis, etc that involve larger and larger datasets in the number of examples and the number
of attributes. Classical methods for clustering data like K-means or hierarchical clustering are beginning to
reach its maximum capability to cope with this increase of dataset size. The limitation for these algorithms
come either from the need of storing all the data in memory or because of their computational time complexity.
These problems have opened an area for the search of algorithms able to reduce this data overload. Some
solutions come from the side of data preprocessing by transforming the data to a lower dimensionality manifold
that represents the structure of the data or by summarizing the dataset by obtaining a smaller subset of examples
that represent an equivalent information.
A different perspective is to modify the classical clustering algorithms or to derive other ones able to cluster
larger datasets. This perspective relies on many different strategies. Techniques such as sampling, on-line
processing, summarization, data distribution and efficient datastructures have being applied to the problem of
scaling clustering algorithms.
This paper presents a review of different strategies and clustering algorithms that apply these techniques.
The aim is to cover the different range of methodologies applied for clustering data and how they can be scaled.
1 Introduction
According to a recent poll about the most frequent tasks and methods employed in data mining
projects (KDNuggets, 2011), clustering was the third most frequent task. It is usual that these
projects involve areas like astronomy, bioinformatics or finance, that generate large quantities of data.
Also according to a recurrent poll of KDNuggets the most frequent size of the datasets being processed
has shifted from tens of gigabytes in 2011 to terabytes in 2013. It is also common also that, in some
of these domains, data is a continuous stream representing and boundless dataset, that is collected
and processed in batches to incrementally update or refine a previously built model.
The classical methods for clustering (e.g.: K-means, hierarchical clustering) are not able to cope
with this increasing amount of data. The reason is mainly because either the constraint of maintain-
ing all the data in main memory or the temporal complexity of the algorithms. This makes them
impractical for the purpose of processing these increasingly larger datasets. This means that the need
of scalable clustering methods is a real problem and in consequence some new approaches are being
developed.
There are several methodologies that have been used to scale clustering algorithms, some inspired
in methodologies successfully used for supervised machine learning, other specific for this unsupervised
task. For instance, some of these techniques use different kinds of sampling strategies, in order to store
in memory only a subset of the data. Others are based on the partition of the whole dataset in several
1
2 Preprocessing: Dimensionality reduction 2
independent batches for separate processing and the merging of the result in a consensuated model.
Some methodologies assume that the data is a continuous stream and has to be processed on-line
or in successive batches. Also, these techniques are integrated in different ways depending on the
model that is used for the clustering process (prototype based, density based, ...). This large variety
of approaches makes necessary to define their characteristics and to organize them in a coherent way.
The outline of this paper is as follows. In section 2 some preprocessing techniques available for
dimensionality reduction will be reviewed. In section 3 the different paradigms for clustering data will
be presented, analyzing their capability for processing large quantities of data. Section 4 will discuss
the different strategies used for scaling clustering algorithms. Section 5 describe some algorithms that
use these scalability strategies. Finally, section 6 will outline some conclusions.
Before applying the specific mining task that has to be performed on a dataset, several preprocessing
steps can be done. The first goal of the preprocessing step is to assure the quality of the data by
reducing the noisy and irrelevant information that it could contain. The second goal is to reduce the
size of the dataset, so the computational cost of the discovery task is also reduced.
There are two dimensions that can be taken in account when reducing the size of the dataset. The
first one is the number of instances. This problem can be addressed by sampling techniques when it
is clear that a smaller subset of the data holds the same information that the whole dataset. Not in
all application it is the case, and sometimes the specific goal of the mining process is to find specific
groups of instances with low frequency, but of high value. This data could be discarded by the sampling
process, making unfruitful the process. In other applications, the data is a stream, this circumstance
makes more difficult the sampling process or carries the risk of losing important information from the
data if its distribution changes over time.
With dimensionality reduction techniques, the number of attributes of the dataset also can be
addressed. There are several areas related to the transformation of a dataset from the original repre-
sentation to a representation with a reduced set of features. The goal is to obtain a new dataset that
preserves, up to a level, the original structure of the data, so its analysis will result in the same or
equivalent patterns present in the original data. Broadly, there are two kinds of methods for reducing
the attributes in a dataset, feature selection and feature extraction.
Most of the research on feature selection is related to supervised learning [14]. More recently,
methods for unsupervised learning have been appearing in the literature [4], [11], [24], [26]. These
methods can be divided on filters, that use characteristics of the features to determine their salience
so the more relevant ones can be kept, and wrappers, that involve the exploration of the subset of
features and a clustering algorithm to evaluate the quality of the partitions generated with the subset,
according to a internal or external quality criteria. The main advantage of all these methods is that
they preserve the original attributes, so the resulting patterns can be interpreted more easily.
Feature extraction is an area with a large number of methods. The goal is to create a smaller
new set of attributes that maintains the patterns present in the data. This reduction process is used
frequently to visualize the data to help with the discovery process. These methods generate new
attributes that are linear or non linear combinations of the original attributes. The most popular
method that obtains a linear transformation of the data is Principal Component Analysis [10]. This
transformation results in a set of orthogonal dimensions that account for the variance of the dataset.
It is usual for only a small number of these dimensions to hold most of the variance of the data, so with
only this subset should be enough to discover the patterns in the data. Nonlinear feature extraction
methods have been becoming more popular because of the ability to uncover more complex patterns in
the data. Popular examples of these methods include the kernelized version of PCA [19] and methods
based on manifold learning like ISOMAP [21] or Locality Linear Embedding [18]. A mayor drawback
these methods is their computational cost. Most of them include some sort of matrix factorization
and scale poorly.
3 Clustering algorithms 3
3 Clustering algorithms
The clustering task can be defined as a process that, using the intrinsic properties of a dataset X ,
uncovers a set of partitions that represents its inherent structure. It is, thus, an usupervised task,
that relies in the patterns that present the values of the attributes that describe the dataset. The
partitions can be either nested, so a hierarchical structure is represented, or disjoint partitions with
or without overlapping.
There are several approaches to obtain a partition from a dataset, depending on the characteristics
of the data or the kind of the desired partition. Broadly these approaches can be divided in:
• Hierarchical algorithms, that result in a nested set of partitions, representing the hierarchical
structure of the data. These methods are usually based on a matrix of distances/similarites and
a recursive divisive or agglomerative strategy.
• Partitional algorithms, that result in a set of disjoint or overlapped partitions. There is a more
wide variety of methods of this kind, depending on the model used to represent the partitions
or the discovery strategy used. The more representative ones include algorithms based on pro-
totypes or probabilistical models, based on the discovery of dense regions and based on the
partition of the space of examples into a multidimensional grid.
In the following sections the main characteristics of these methods will be described with an outline
of the main representative algorithms.
The most used prototype based clustering algorithm is K-Means [5]. This algorithm assumes that
clusters are defined by their center (the prototype) and have spherical shapes. The fitting of this
spheres is done by minimizing the distances from the examples to these centers. Examples are only
assigned to one cluster.
Different optimization criteria can be used to obtain the partition, but the most common one is
the minimization of the sum of the euclidean distance of the examples assigned to a cluster and the
centroid of the cluster. The problem can be formalized as:
X X
min k xj − µi k2
C
Ci xj ∈Ci
The main drawback of this methods comes from the cost of finding the nearest neighbors for an
example. Indexing data structures can be used to reduce the computational time, but these structures
degrade with the number of dimensions to a linear search. This makes the computational time of these
algorithms proportional to the square of the number of examples for datasets with a large number of
dimensions.
4 Scalability strategies
The strategies used to scale clustering algorithms range from general strategies that can be adapted to
any algorithm to specific strategies that exploit the characteristics of the algorithm in order to reduce
its computational cost.
4 Scalability strategies 6
Some of the strategies are also dependent on the type of data that is used. For instance, only
clustering algorithms that incrementally build the partition can be used for data streams. For this
kind of datasets it means that the scaling strategy has to assume that the data will be processed
continuously and only one pass through the data will be allowed. For applications where the whole
dataset can be stored in secondary memory, other possibilities are also available.
The different strategies applied for scalability are not disjoint and several strategies can be used
in combination. These strategies can be classified in:
One-pass strategies: The constraint assumed is that the data only can be processed once and in a
sequential fashion. A new example is integrated in the model each iteration. Depending on the
type of the algorithm a data structure can be used to efficiently determine how to perform this
update. This strategy does not only apply to data streams and can be actually used for any
dataset.
Summarization strategies: It is assumed that all the examples in the dataset are not needed for obtain-
ing the clustering, so an initial preprocess of the data can be used to reduce its size by combining
examples. The preprocess results in a set of representatives of groups of examples that fits in
memory. The representatives are then processed by the clustering algorithm.
Sampling/batch strategies: It is assumed that processing samples of the dataset that fit in memory
allows to obtain an approximation of the partition of the whole dataset. The clustering algorithm
generates different partitions that are combined iterativelly to obtain the final partition.
Approximation strategies: It is assumed that certain computations of the clustering algorithm can be
approximated or reduced. These computations are mainly related with the distances among
examples or among the examples and the cluster prototypes.
Divide and conquer strategies: It is assumed that the whole dataset can be partitioned in roughly inde-
pendent datasets and that the combination/union of the results for each dataset approximates
the true partition.
This hierarchical scheme can reduce the computational complexity by using a multi level clustering
algorithm or can be used as an element of a fast indexing structure that reduces the cost of obtaining
the first level summarization.
5 Algorithms
All these scalability strategies have been implemented in several algorithms that represent the full range
of different approaches to clustering. Usually more than one strategy is combined in an algorithm
to take advantage of the cost reduction and scalability properties. In this section, a review of a
representative set of algorithms and the use of these strategies is presented.
5 Algorithms 8
5.2 Rough-DBSCAN
In [22] a two steps algorithm is presented. The first step applies a one pass strategy using the leader
algorithm, just like the algorithm in the previous section. The application of this algorithm results in
an approximation of the different densities of the dataset. This densities are used in the second step,
that consists in a variation of the density based algorithm DBSCAN.
This method uses a theoretical result that bounds the maximum number of leaders obtained by
the leader algorithm. Given a radius τ and a closed and bounded region of space determined by the
values of the features of the dataset, the maximum number of leaders k is bounded by:
VS
k≤
Vτ /2
being VS the volume of the region S and Vτ /2 the volume of a sphere of radius τ /2. This number
is independent of the number of examples in the dataset and the data distribution.
For the first step, given a radius τ , the result of the leader algorithm is a list of leaders (L), their
followers and the count of their followers. The second step applies the DBSCAN algorithm to the set
of leaders given an and a M inP ts parameters.
The count of followers is used to estimate the count of examples around a leader. Different
estimations can be derived from this count. First it is defined Ll as the set of leaders at a distance
less or equal than to the leader l:
Ll = {lj ∈ L | klj − lk ≤ }
approximating the number of examples less than a distance to a leader. Alternate counts can be
derived as upper and lower bounds of this count using + τ (upper) or − τ (lower) as distance.
5 Algorithms 9
Fig. 1: CURE
From this counts it can be determined if a leader is dense or not. Dense leaders are substituted
by their followers, non dense leaders are discarded as outliers. The final result of the algorithm is the
partition of the dataset according to the partition of the leaders.
The computational complexity of this algorithm is for the first step O(nk), being k the number of
leaders, that does not depend on the number of examples n, but on the radius τ and the volume of
the region that contains the examples. For the second step, the complexity of the DBSCAN algorithm
is O(k 2 ), given that the number of leaders will be small for large datasets, the cost is dominated by
the cost of the first step.
5.3 CURE
CURE [9] is a hierarchical agglomerative clustering algorithm. The main difference with the classical
hierarchical algorithms is that it uses a set of examples to represent the clusters, allowing for non
spherical clusters to be represented. It also uses a parameter that shrinks the representatives towards
the mean of the cluster, reducing the effect of outliers and smoothing the shape of the clusters. Its
computational cost is O(n2 log(n))
The strategy used by this algorithm to attain scalability combines a divide an conquer and a
sampling strategy. The dataset is first reduced by using only a sample of the data. Chernoff bounds
are used to compute the minimum size of the sample so it represents all clusters and approximates
adequately their shapes.
In the case that the minimum size of the sample does not fit in memory a divide and conquer
strategy is used. The sample is divided in a set of disjoint batches of the same size and clustered
until a certain number of clusters is achieved or the distance among clusters is less than an specified
parameter. This step has the effect of a pre-clustering of the data. The clusters representatives from
each batch are merged and the same algorithm is applied until the desired number of clusters is
achieved. A representation of this strategy appears in figure 1. Once the clusters are obtained all the
2
dataset is labeled according to the nearest cluster. The complexity of the algorithm is O( np log( np )),
being n the size of the sample and p the number of batches used.
5 Algorithms 10
5.4 BIRCH
BIRCH [27] is a multi stage clustering algorithm that bases its scalability in a first stage that incre-
mentally builds a pre-clustering of the dataset. The first stage combines a one pass strategy and a
summarization strategy that reduces the actual size of the dataset to a size that fits in memory.
The scalability strategy relies on a data structure named Clustering Feature tree (CF-tree) that
stores information that summarizes the characteristics of a cluster. Specifically, the information in a
node is the number of examples, the sum of the examples values and the sum of their square values.
From these values other quantities about the individual clusters can be computed, for instance, the
centroid, the radius of the cluster, its diameter and quantities relative to pairs of clusters, as the
inter-cluster distance or the variance increase.
A CF-tree (figure 2) is a balanced n-ary tree that contains information that represents probabilistic
prototypes. Leaves of the tree can contain as much as l prototypes and their radius can not be more
than t. Each non terminal node has a fixed branching factor (b), each element is a prototype that
summarizes its subtree. The choice of these parameters is crucial, because it determines the actual
available space for the first phase. In the case of selecting wrong parameters, the CF-tree can be
dynamically compressed by changing the parameters values (basically t). In fact, t determines the
granularity of the final groups.
The first phase of BIRCH inserts sequentially the examples in the CF-tree to obtain a set of clusters
that summarizes the data. For each instance, the tree is traversed following the branch of the nearest
prototype of each level, until a leave is reached. Once there, the nearest prototype from the leave to
the example is determined. The example could be introduced in this prototype or a new prototype
could be created, depending whether the distance is greater or not than the value of the parameter t.
If the current leave has not space for the new prototype (already contains l prototypes), the algorithm
proceeds to create a new terminal node and to distribute the prototypes among the current node and
the new leaf. The distribution is performed choosing the two most different prototypes and dividing
the rest using their proximity to these two prototypes. This division will create a new node in the
ascendant node. If the new node exceeds the capacity of the father, it will be split and the process
will continue upwards until the root of the tree is reached if necessary. Additional merge operations
after completing this process could be performed to compact the tree.
For the next phase, the resulting prototypes from the leaves of the CF tree represent a coarse
vision of the dataset. These prototypes are used as the input of a clustering algorithm. In the original
5 Algorithms 11
algorithm, single link hierarchical clustering is applied, but also K-means clustering could be used.
The last phase involves labeling the whole dataset using the centroids obtained by this clustering
algorithm. Additional scans of the data can be performed to refine the clusters and detect outliers.
The actual computational cost of the first phase of the algorithm depends on the chosen parameters.
Chosen a threshold t, considering that s is the maximum number of leaves that the CF-tree can contain,
also that the height of the tree is logb (s) and that at each level b nodes have to be considered, the
temporal cost is O(nb logb (s)). The temporal cost of clustering the leaves of the tree depends on the
algorithm used, for hierarchical clustering it is O(s2 ). Labeling the dataset has a cost O(nk), being k
the number of clusters.
5.5 OptiGrid
OptiGrid [12] presents an algorithm that divides the space of examples in an adaptive multidimensional
grid that determines dense regions. The scalability strategy is based on recursive divide and conquer.
The computation of one level of the grid determines how to divide the space on independent datasets.
These partitions can be divided further until no more partitions are possible.
The main element of the algorithm is the computation of a set of low dimensional projections of
the data that are used to determine the dense areas of examples. These projections can be computed
using PCA or other dimensionality reduction algorithms and can be fixed for all the iterations. For a
projection, a fixed number of orthogonal cutting planes are determined from the maxima and minima
of the density function computed using kernel density estimation or other density estimation method.
These cutting planes are used to compute a grid. The dense cells of the grid are considered clusters at
the current level and are recursively partitioned until no new cutting planes can be determined given
a quality threshold. A detailed implementation is presented in algorithm 1
For the computational complexity of this method. If the projections are fixed for all the com-
putations, the first step can be obtained separately of the algorithm and is added to the total cost.
The actual cost of computing the projections depends on the method used. Assuming axis parallel
projections the cost for obtaining k projections for N examples is O(N k), O(N dk) otherwise, being
d the number of dimensions. Computing the cutting planes for k projections can be obtained also in
O(N k). Assigning the examples to the grid depends on the size of the grid and the insertion time for
5 Algorithms 12
the data structure used to store the grid. For q cutting planes and assuming a logarithmic insertion
time structure, the cost of assigning the examples has a cost of O(N q min(q, log(N ))) considering axis
parallel projections and O(N qd min(q, log(N ))) otherwise. The number of recursions of the algorithm
is bound by the number of clusters in the dataset that is a constant. Considering that q is also a
constant, this gives a total complexity that is bounded by O(N d log(N ))
The discarding and compressing of part of the new examples allows to reduce the amount of data
needed to maintain the model each iteration.
The algorithm divides the compression of data in two differentiated strategies. The first one is
called primary compression, that aims to detect those examples that can be discarded. Two criteria
are used for this compression, the first one determines those examples that are closer to the cluster
centroid than a threshold. These examples are not probably going to change their assignment in the
future. The second one consist in perturbing the centroid around a confidence interval of its values. If
an example does not change its current cluster assignment, it is considered that future modifications
of the centroid will still include the example.
5 Algorithms 13
K Clusters
The second strategy is called secondary compression, that aims to detect those examples that
can not be discarded but form a compact subcluster. In this case, all the examples that form these
compact subclusters are summarized using a set of sufficient statistics. The values used to compute
that sufficient statistics are the same used by BIRCH to summarize the dataset.
The algorithm used for updating the model is a variation of the K-means algorithm that is able to
treat single instances and also summaries. The temporal cost of the algorithm depends on the number
of iterations needed until convergence as in the original K-means, so the computational complexity is
O(kni), being k the number of clusters, n the size of the sample in memory, and i the total number
of iterations performed by all the updates.
1. Input the first m points; use the base clustering algorithm to reduce these to at most 2k cluster
centroids. The number of examples at each cluster will act as the weight of the cluster.
m2
2. Repeat the above procedure until 2k examples have been processed so we have m centroids
4. Apply the same criteria for each existing level so after having m centroids at level i then 2k
centroid at level i + 1 are computed
5. After seen all the sequence (or at any time) reduce the 2k centroids at top level to k centroids
The number of centroids to cluster is reduced geometrically with the number of levels, so the
main cost of the algorithm relies on the first level. This makes the time complexity of the algorithm
O(nk log(nk)), while needing only O(m) space.
KD-TREE
does not depend on the total number of examples (n), but on the volume defined by the attributes
and the value of parameter T2 . Being k this number of canopies, the computational cost is bounded
by O(nk). The cost of the second stage depends on the specific algorithm used, but the number of
distance computations needed for a canopy will be reduced in a factor nk , so for example if single link
hierarchical clustering is applied the total computational cost of applying this algorithm to k canopies
will be O( nk 2 ).
leaves of the kd-tree and the number of dimensions (d). The computational cost for each iteration is
bound by log(2d k log(n)).
The major problem of this algorithm is that as the dimensionality increases, the benefit of the
kd-tree structure degrades to a lineal search. This is a direct effect of the curse of the dimensionality
and the experiments show that for a number of dimensions larger than 20 there are no time savings.
examples and the prototypes are computed and the examples are assigned to the nearest one. After
the assignment of the examples to prototypes, the prototypes are recomputed as the centroid of all
the examples.
Further computational improvement can be obtained by calculating the actual bounds of the bins,
using the maximum and minimum values of the attributes of the examples inside a bin. This allows
to obtain a more precise maximum and minimum distances from prototypes to bins, reducing the
number of prototypes that are assigned to a bin.
It is difficult to calculate the actual complexity of the algorithm because it depends on the quanti-
zation of the dataset and how separated the clusters are. The initialization step that assigns examples
to bins is O(n). The maximum number of bins is bounded by the number of examples n, so at each
iteration in the worst case scenario O(kn) computations have to be performed. In the case that the
data presents well separated clusters, a large number of bins will be empty, reducing the actual number
of computations.
6 Conclusion
The scalability of clustering algorithms is a recent issue arisen by the need to solve unsupervised
learning tasks in data mining applications. The commonly used clustering algorithms can not scale to
the increased size of the datasets due to their time or space complexity. This problem opens the field
for different strategies to adapt the commonly used clustering algorithms to the current needs.
This paper presents a perspective on different strategies used to scale clustering algorithm. The
approaches range from the general divide and conquer scheme to more algorithm specific strategies.
These strategies are used frequently in combination to obtain the different advantages that they
provide. For instance, two stage clustering algorithms that apply a summarization strategy as a first
stage combined with a one pass strategy.
Some algorithms implementing successfully different combinations of the presented strategies have
been described in some detail, including their computational time complexity. The algorithms cover
all the range of clustering paradigms including hierarchical, model based, density based and grid based
algorithms.
All the discussed solutions show an evident improvement for clustering scalability. But little has
been discussed about how to adjust the different parameters of these algorithms. In an scenario of very
large datasets this is a challenge, and the usual trial and error does not seem an efficient approach.
Further research into these methods should address this problem.
References
[1] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. Automatic
subspace clustering of high dimensional data for data mining applications. In Proceedings of the
ACM SIGMOD International Conference on Management of Data (SIGMOD-98), volume 27,2
of ACM SIGMOD Record, pages 94–105, New York, June 1–4 1998. ACM Press.
[2] Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. OPTICS: Ordering
points to identify the clustering structure. In Alex Delis, Christos Faloutsos, and Shahram Ghan-
deharizadeh, editors, Proceedings of the ACM SIGMOD International Conference on Management
of Data (SIGMod-99), volume 28,2 of SIGMOD Record, pages 49–60, New York, June 1–3 1999.
ACM Press.
[3] Paul S. Bradley, Usama M. Fayyad, and Cory Reina. Scaling clustering algorithms to large
databases. In Rakesh Agrawal, Paul E. Stolorz, and Gregory Piatetsky-Shapiro, editors, KDD,
pages 9–15. AAAI Press, 1998.
[4] M. Dash, K. Choi, P. Scheuermann, and H. Liu. Feature Selection for Clustering - A Filter
Solution. In ICDM, pages 115–122, 2002.
6 Conclusion 19
[5] R. Dubes and A Jain. Algorithms for Clustering Data. PHI Series in Computer Science. Prentice
Hall, 1988.
[6] Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. A density-based algorithm for
discovering clusters in large spatial databases with noise. In Evangelos Simoudis, Jia Wei Han,
and Usama Fayyad, editors, Proceedings of the Second International Conference on Knowledge
Discovery and Data Mining (KDD-96), page 226. AAAI Press, 1996.
[7] Frey and Dueck. Clustering by passing messages between data points. SCIENCE: Science, 315,
2007.
[8] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams:
Theory and practice. Knowledge and Data Engineering, IEEE Transactions on, 15(3):515 – 528,
may-june 2003.
[9] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. CURE: An efficient clustering algorithm for
large databases. Inf. Syst, 26(1):35–58, 2001.
[10] Trevor Hastie, Robert Tibshirani, and J. H. Friedman. The Elements of Statistical Learning.
Springer, July 2001.
[11] X. He, D. Cai, and P. Niyogi. Laplacian Score for Feature Selection. In NIPS, 2005.
[12] Alexander Hinneburg, Daniel A Keim, et al. Optimal grid-clustering: Towards breaking the curse
of dimensionality in high-dimensional clustering. In VLDB, volume 99, pages 506–517. Citeseer,
1999.
[13] T. Kanungo, D.M. Mount, N.S. Netanyahu, C.D. Piatko, R. Silverman, and A.Y. Wu. An effi-
cient k-means clustering algorithm: analysis and implementation. Pattern Analysis and Machine
Intelligence, IEEE Transactions on, 24(7):881 –892, jul 2002.
[14] R. Kohavi. Wrappers for Feature Subset Selection. Art. Intel., 97:273–324, 1997.
[15] Ulrike Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, 2007.
[16] A. McCallum, K. Nigam, and L. Ungar. Efficient clustering of high-dimensional data sets with
application to reference matching. KDD, pages ?–?, 2000.
[17] Bidyut Kr. Patra, Sukumar Nandi, and P. Viswanath. A distance based clustering method for
arbitrary shaped clusters in large datasets. Pattern Recognition, 44(12):2862–2870, 2011.
[18] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding.
Science, 290(5500):2323–2326, December 2000.
[19] B. Scholkopf, A. Smola, and K. R. Muller. Nonlinear component analysis as a kernel eigenvalue
problem. Neural Computation, 10:1299–1319, 1998.
[20] D. Sculley. Web-scale k-means clustering. In Proceedings of the 19th international conference on
World wide web, pages 1177–1178. ACM, 2010.
[21] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear
dimensionality reduction. Science, 290(5500):2319–2323, December 2000.
[22] P. Viswanath and V. Suresh Babu. Rough-dbscan: A fast hybrid density based clustering method
for large data sets. Pattern Recognition Letters, 30(16):1477 – 1488, 2009.
[23] Wei Wang, Jiong Yang, and Richard R. Muntz. STING: A statistical information grid approach
to spatial data mining. In Matthias Jarke, Michael J. Carey, Klaus R. Dittrich, Frederick H.
Lochovsky, Pericles Loucopoulos, and Manfred A. Jeusfeld, editors, Twenty-Third International
Conference on Very Large Data Bases, pages 186–195, Athens, Greece, 1997. Morgan Kaufmann.
6 Conclusion 20
[24] L. Wolf and A. Shashua. Feature Selection for Unsupervised and Supervised Inference. Journal
of Machine Learning Research, 6:1855–1887, 2005.
[25] Zhiwen Yu and Hau-San Wong. Quantization-based clustering algorithm. Pattern Recognition,
43(8):2698 – 2711, 2010.
[26] H. Zeng and Yiu ming Cheung. A new feature selection method for Gaussian mixture clustering.
Pattern Recognition, 42(2):243–250, February 2009.
[27] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: A new data clustering algorithm
and its applications. Data Min. Knowl. Discov, 1(2):141–182, 1997.