A06-A Survey of Clustering Techniques
A06-A Survey of Clustering Techniques
1
International Journal of Computer Applications (0975 – 8887)
Volume 7– No.12, October 2010
2.2. Center-based clusters The ability to analyze single as well as mixtures of attribute
A cluster is a set of objects such that an object in a cluster is types.
closer (more similar) to the “center” of a cluster, than to the 3.3 Find arbitrary-shaped clusters
center of any other cluster The center of a cluster is often a The shape usually corresponds to the kinds of clusters an
centroid, the average of all the points in the cluster, or a algorithm can find and we should consider this as a very
medoid, the most “representative” point of a cluster. important thing when choosing a method, since we want to be
2.3. Contiguous clusters as general as possible. different types of algorithms will be
biased towards finding different types of cluster
A cluster is a set of points such that a point in a cluster is structures/shapes and it is not always an easy task to
closer (or more similar) to one or more other points in the determine the shape or the corresponding bias. Especially
cluster than to any point not in the cluster. when categorical attributes are present we may not be able to
talk about cluster structures.
2.4. Density-based clusters
A cluster is a dense region of points, which is separated by
3.4 Minimum requirements for input
low-density regions, from other regions of high density. Used parameters
when the clusters are irregular or intertwined, and when noise Many clustering algorithms require some user-defined
and outliers are present. parameters, such as the number of clusters, in order to analyze
the data. However, with large datasets and higher
2.5. Shared Property or Conceptual imensionalities, it is desirable that a method require only
Clusters limited guidance from the user, in order to avoid bias over the
result.
Finds clusters that share some common property or represent
a particular concept. 3.5 Handling of noise
Clustering algorithms should be able to handle deviations, in
2.6. Described by an Objective Function order to improve cluster quality. Deviations are defined as
data objects that depart from generally accepted norms of
Finds clusters that minimize or maximize an objective behaviour and are also referred to as outliers. Deviation
function. detection is considered as a separate problem.
Fig.3 B. The distance of a given point from the nearest and furthest
Cluster Analysis is very useful without proper analysis neighbor is almost the same, for awide variety of distributions
implementation of clustering algorithm will not provide good and distance functions.Both of the above highly influence the
results Cluster analysis is useful to Understand group related efficiency of a clustering algorithm, since it would need more
documents for browsing, group genes and proteins that have time to process the data, while at the same time the resulting
similar functionality, or group stocks with similar price clusters would be of very poor quality.
fluctuations and also reduces the size of large data sets.
3. 8 Interpretability and usability
Clustering is equivalent to breaking the graph into Most of the times, it is expected that clustering algorithms
connected components, one for each cluster. produce usable and interpretable results. But when it comes to
A good clustering algorithm should have the following comparing the results with reconceived ideas or constraints,
properties:- some techniques fail to be satisfactory. Therefore, easy to
understand results are highly desirable..
3.1. Scalability
The ability of the algorithm to perform well with large 4. CLASSIFICATION OF CLUSTERING
number of data objects (tuples). Traditionally clustering techniques are broadly divided in
hierarchical and partitioning and density based
3.2. Analyze mixture of attribute types clustering.Categorization of clustering is neither
2
International Journal of Computer Applications (0975 – 8887)
Volume 7– No.12, October 2010
straightforward, nor canonical. In reality, groups below 1. Put all objects in one cluster
overlap. 2. Repeat until all clusters are singletons
a) Choose a cluster to split
4.1.Hierarchical Methods b) Replace the chosen cluster with the sub-cluster
Hierarchical clustering is a method of cluster analysis which
seeks to build a hierarchy of clusters. . The basics of 4.1.3 Advantages of hierarchal clustering
hierarchical clustering include Lance-Williams formula, idea 1. Embedded flexibility regarding the level of granularity.
of conceptual clustering, now classic algorithms SLINK, 2. Ease of handling any forms of similarity or distance.
COBWEB, as well as newer algorithms CURE and
CHAMELEON.The hierarchical algorithms build clusters 3. Applicability to any attribute type .
gradually (as crystals are grown)Strategies for hierarchical
clustering generally fall into two types: In hierarchical 4.1.4Disadvantages of hierarchal clustering
clustering the data are not partitioned into a particular cluster 1. Vagueness of termination criteria.
in a single step. Instead, a series of partitions takes place, 2. Most hierarchal algorithm do not revisit once constructed
which may run from a single cluster containing all objects to n clusters with the purpose of improvement.
clusters each containing a single object. Hierarchical
Clustering is subdivided into agglomerative methods, which
4.2. Partitioning Methods
The partitioning methods generally result in a set of M
proceed by series of fusions of the n objects into groups, and
clusters, each object belonging to one cluster. Each cluster
divisive methods, which separate n objects successively into
may be represented by a centroid or a cluster representative;
finer groupings. Agglomerative techniques are more
this is some sort of summary description of all the objects
commonly used, and this is the method implemented in
contained in a cluster. The precise form of this description
XLMiner?. Hierarchical clustering may be represented by a
will depend on the type of the object which is being clustered.
two dimensional diagram known as dendrogram which
In case where real-valued data is available, the arithmetic
illustrates the fusions or divisions made at each successive
mean of the attribute vectors for all objects within a cluster
stage of analysis
provides an appropriate representative; alternative types of
centroid may be required in other cases, e.g., a cluster of
documents can be represented by a list of those keywords that
0.2 occur in some minimum number of documents within a
cluster. If the number of the clusters is large, the centroids can
6 5 be further clustered to produces hierarchy within a dataset.
0.15
4
3 4
2
5
0.1
2
0.05 1
3 1
0
1 3 2 5 4 6
Fig.5
Fig.4 Nested cluster Diagram
3
International Journal of Computer Applications (0975 – 8887)
Volume 7– No.12, October 2010
2. Repeat into two clusters select one of them and split and repeat this
process until the K clusters have been produced.
3. Form K clusters by assigning each point to its closest
centriod. 4.2.3-Medoids Method
K-medoid is the most appropriate data point within a cluster
4. Recompute the centroid of each cluster until centroid does that represents it. Representation by k-medoids has two
not change. advantages. First, it presents no limitations on attributes types,
and, second, the choice of medoids is dictated by the location
3 of a predominant fraction of points inside a cluster and,
therefore, it is lesser sensitive to the presence of outliers.
2.5
When medoids are selected, clusters are defined as subsets of
2 points close to respective medoids, and the objective function
is defined as the averaged distance or another dissimilarity
1.5
measure between a point and its medoid.K medoids method
y
4
International Journal of Computer Applications (0975 – 8887)
Volume 7– No.12, October 2010
So defined density-connectivity is a symmetric relation and all [3] G. S. Moldovan and G. Serban. Quality Measures for
the points reachable from core objects can be factorized into Evaluating the Results of Clustering Based Aspect Mining
maximal connected components serving as clusters.The points Techniques. In Proceedings of Towards Evaluation of Aspect
that are not connected to any core point are declared to be Mining(TEAM), ECOOP, 2006, to be published.
outliers (they are not covered by any cluster). The non-core [4] Orlando Alejo Mendez Morales. Aspect Mining Using
points inside a cluster represent its boundary.Finally, core Clone Detection. Master's thesis, Delft University of
objects are internal points. Processing is independent of data Technology, The Netherlands, August 2004.
ordering. So far, nothing requires any limitations on the
dimension or attribute types. [5] D. Shepherd and L. Pollock. Interfaces, Aspects, and
Views. In Proceedings of Linking AspectTechnology and
4.3.2Density Functions Clustering Evolution Workshop(LATE 2005), March 2005.
In this density function is used to compute the density . [6] P. Tonella and M. Ceccato. Aspect Mining through the
Overall density is modeled as the sum of the density functions Formal Concept Analysis of Execution Traces. In Proceedings
of all objects;. Clusters are determined by density attractors, of the IEEE Eleventh Working Conference on Reverse
where density attractors are local maxima of the overall Engineering(WCRE 2004), pages 112_121, November 2004.
density function. The influence function can be an arbitrary [7]L. D. Baker and A. McCallum. Distributional clustering of
one. words for text classification. In SIGIR ’98: Proceedings of the
21st Annual International ACM SIGIR, pages 96–103. ACM,
4.4. Grid Based Clustering August1998.
These focus on spatial data i.e the data that model the
geometric structure of objects in the space, their relationships, [8]R. Bekkerman, R. El-Yaniv, Y. Winter, and N. Tishby. On
properties and operations. this technique quantize the data set feature distributional clustering for text categorization. In
inti a no of cells and then work with objects belonging to ACM SIGIR, pages 146–153, 2001.
these cells. They do not relocate points but ratter build several [9]P. Berkhin and J. D. Becher. Learning simple relations:
hierarchical levels of groups of objects. The merging of grids Theory and applications. In Proceedings of the The Second
and consequently clusters, does not depend on a distance SIAM International Conference on Data Mining, pages 420–
measure .It is determined by a predefined parameter. 436, 2002.
[10]B. E. Boser, I. Guyon, and V. Vapnik. A training
5. CONCLUSIONS algorithm for optimal margin classifiers. In COLT, pages 144–
Clustering lies at the heart of data analysis and data mining 152, 1992.
applications. The ability to discover highly correlated regions
[11]P. S. Bradley and O. L. Mangasarian. k-plane clustering.
of objects when their number becomes very large is highly
Journal of Global Optimization, 16(1): 23–32, 2000.
desirable, as data sets grow and their properties and data
interrelationships change. At the same time, it is notable that [12]S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan.
any clustering “is a division of the objects into groups based Using taxonomy, discriminants, and signatures for navigating
on a set of rules – it is neither true nor false” in text databases. In Proceedings of the 23rd VLDB
Conference, Athens, Greece, 1997.