0% found this document useful (0 votes)
47 views5 pages

A06-A Survey of Clustering Techniques

This document summarizes different clustering techniques in data mining. Clustering involves grouping similar data objects together. It is an unsupervised learning method used to discover hidden patterns in unlabeled data. The goals of clustering are to maximize similarity within each cluster and maximize dissimilarity between clusters. Various types of clusters are described such as well-separated, center-based, contiguous, and density-based clusters. Important considerations for clustering algorithms include the ability to handle different data types, find arbitrarily shaped clusters, require minimal parameters, handle noise, and be insensitive to the order of input records.

Uploaded by

Deniz BIYIK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views5 pages

A06-A Survey of Clustering Techniques

This document summarizes different clustering techniques in data mining. Clustering involves grouping similar data objects together. It is an unsupervised learning method used to discover hidden patterns in unlabeled data. The goals of clustering are to maximize similarity within each cluster and maximize dissimilarity between clusters. Various types of clusters are described such as well-separated, center-based, contiguous, and density-based clusters. Important considerations for clustering algorithms include the ability to handle different data types, find arbitrarily shaped clusters, require minimal parameters, handle noise, and be insensitive to the order of input records.

Uploaded by

Deniz BIYIK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Computer Applications (0975 – 8887)

Volume 7– No.12, October 2010

A Survey of Clustering Techniques


Pradeep Rai Shubha Singh
Asst. Prof., CSE Department, Asst. Prof., MCA Department,
Kanpur Institute of Technology, Kanpur Institute of Technology,
Kanpur-208001 (India) Kanpur -208001 (India)

ABSTRACT For example, a company that sales a variety of products may


The goal of this survey is to provide a comprehensive review need to know about the sale of all of their products in order to
check that what product is giving extensive sale and which is
of different clustering techniques in data mining. lacking. This is done by data mining techniques. But if the
system clusters the products that are giving less sale then only
1. INTRODUCTION the cluster of such products would have to be checked rather
Clustering is a division of data into groups of similar than comparing the sales value of all the products. This is
objects.Each group, called cluster, consists of objects that are actually to facilitate the mining process. Clustering is a data
similar between themselves and dissimilar to objects of other
mining (machine learning) technique used to place data
groups. Representing data by fewer clusters necessarily loses elements into related groups without advance knowledge of
certain fine details (akin to lossy data compression), but the group definitions. Clustering techniques fall into a group
achieves simplification. It represents many data objects by of undirected data mining tools. The goal of undirected data
few clusters, and hence, it models data by its clusters.Data mining is to discover structure in the data as a whole. There is
modeling puts clustering in a historical perspective rooted in no target variable to be predicted, thus no distinction is being
mathematics, statistics,and numerical analysis. From a made between independent and dependent variables.
machine learning perspective clusters correspond to hidden
patterns, the search for clusters is unsupervised learning, and
the resulting system represents a data concept. Therefore,
clustering is unsupervised learning of a hidden data concept.
Data mining deals with large databases that impose on
clustering analysis additional severe computational
requirements. These challenges led to the emergence of
powerful broadly applicable data mining clustering methods
surveyed below.

Fig.2 DATA MINING

Clustering techniques are used for combining observed


examples into clusters (groups) which satisfy two main
criteria:

1. Each group or cluster is homogeneous; examples that


belong to the same group are similar to each other.

2. Each group or cluster should be different from other


clusters, that is, examples that belong to one cluster should be
different from the examples of other clusters. Depending on
the clustering technique, clusters can be expressed in
different ways:
Fig .1 DATA MINING PROCESS 1. Identified clusters may be exclusive, so that any example
belongs to only one cluster.
Clustering is often one of the first steps in data mining
analysis. It identifies groups of related records that can be 2. They may be overlapping; an example may belong to
used as a starting point for exploring further relationships. several clusters.
This technique supports the development of population 3. They may be probabilistic, whereby an example belongs to
segmentation models, such as demographic-based customer each cluster with a certain probability.
segmentation. Additional analyses using standard analytical
and other data mining techniques can determine the 2. GENERAL TYPES OF CLUSTERS
characteristics of these segments with respect to some desired
outcome. For example, the buying habits of multiple 2.1. Well-separated clusters
population segments might be compared to determine which
segments to target for a new sales campaign. A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster than
to any point not in the cluster.

1
International Journal of Computer Applications (0975 – 8887)
Volume 7– No.12, October 2010

2.2. Center-based clusters The ability to analyze single as well as mixtures of attribute
A cluster is a set of objects such that an object in a cluster is types.
closer (more similar) to the “center” of a cluster, than to the 3.3 Find arbitrary-shaped clusters
center of any other cluster The center of a cluster is often a The shape usually corresponds to the kinds of clusters an
centroid, the average of all the points in the cluster, or a algorithm can find and we should consider this as a very
medoid, the most “representative” point of a cluster. important thing when choosing a method, since we want to be
2.3. Contiguous clusters as general as possible. different types of algorithms will be
biased towards finding different types of cluster
A cluster is a set of points such that a point in a cluster is structures/shapes and it is not always an easy task to
closer (or more similar) to one or more other points in the determine the shape or the corresponding bias. Especially
cluster than to any point not in the cluster. when categorical attributes are present we may not be able to
talk about cluster structures.
2.4. Density-based clusters
A cluster is a dense region of points, which is separated by
3.4 Minimum requirements for input
low-density regions, from other regions of high density. Used parameters
when the clusters are irregular or intertwined, and when noise Many clustering algorithms require some user-defined
and outliers are present. parameters, such as the number of clusters, in order to analyze
the data. However, with large datasets and higher
2.5. Shared Property or Conceptual imensionalities, it is desirable that a method require only
Clusters limited guidance from the user, in order to avoid bias over the
result.
Finds clusters that share some common property or represent
a particular concept. 3.5 Handling of noise
Clustering algorithms should be able to handle deviations, in
2.6. Described by an Objective Function order to improve cluster quality. Deviations are defined as
data objects that depart from generally accepted norms of
Finds clusters that minimize or maximize an objective behaviour and are also referred to as outliers. Deviation
function. detection is considered as a separate problem.

3. CLUSTER ANALYSIS 3.6 Sensitivity to the order of input records


Finding groups of objects such that the objects in a group will The same data set, when presented to certain algorithms in
be similar (or related) to one another and different from (or different orders, may produce dramatically different results.
The order of input mostly affects algorithms that require a
unrelated to) the objects in other groups.
single scan over the data set, leading to locally optimal
solutions at every step.Thus, it is crucial that algorithms be
insensitive to the order of input.

3.7 High dimensionality of data


The number of attributes/dimensions in many data sets is
large, and many clustering algorithms cannot handle more
than a small number (eight to ten) of dimensions. It is a
challenge to cluster high dimensional data sets, such as the
U.S. census data set which contains attributes. The appearance
of large number of attributes is often termed as the curse of
dimensionality. This has to do with the following

A. As the number of attributes becomes larger, the amount of


resources required to store or represent them grows.

Fig.3 B. The distance of a given point from the nearest and furthest
Cluster Analysis is very useful without proper analysis neighbor is almost the same, for awide variety of distributions
implementation of clustering algorithm will not provide good and distance functions.Both of the above highly influence the
results Cluster analysis is useful to Understand group related efficiency of a clustering algorithm, since it would need more
documents for browsing, group genes and proteins that have time to process the data, while at the same time the resulting
similar functionality, or group stocks with similar price clusters would be of very poor quality.
fluctuations and also reduces the size of large data sets.
3. 8 Interpretability and usability
Clustering is equivalent to breaking the graph into Most of the times, it is expected that clustering algorithms
connected components, one for each cluster. produce usable and interpretable results. But when it comes to
A good clustering algorithm should have the following comparing the results with reconceived ideas or constraints,
properties:- some techniques fail to be satisfactory. Therefore, easy to
understand results are highly desirable..
3.1. Scalability
The ability of the algorithm to perform well with large 4. CLASSIFICATION OF CLUSTERING
number of data objects (tuples). Traditionally clustering techniques are broadly divided in
hierarchical and partitioning and density based
3.2. Analyze mixture of attribute types clustering.Categorization of clustering is neither

2
International Journal of Computer Applications (0975 – 8887)
Volume 7– No.12, October 2010

straightforward, nor canonical. In reality, groups below 1. Put all objects in one cluster
overlap. 2. Repeat until all clusters are singletons
a) Choose a cluster to split
4.1.Hierarchical Methods b) Replace the chosen cluster with the sub-cluster
Hierarchical clustering is a method of cluster analysis which
seeks to build a hierarchy of clusters. . The basics of 4.1.3 Advantages of hierarchal clustering
hierarchical clustering include Lance-Williams formula, idea 1. Embedded flexibility regarding the level of granularity.
of conceptual clustering, now classic algorithms SLINK, 2. Ease of handling any forms of similarity or distance.
COBWEB, as well as newer algorithms CURE and
CHAMELEON.The hierarchical algorithms build clusters 3. Applicability to any attribute type .
gradually (as crystals are grown)Strategies for hierarchical
clustering generally fall into two types: In hierarchical 4.1.4Disadvantages of hierarchal clustering
clustering the data are not partitioned into a particular cluster 1. Vagueness of termination criteria.
in a single step. Instead, a series of partitions takes place, 2. Most hierarchal algorithm do not revisit once constructed
which may run from a single cluster containing all objects to n clusters with the purpose of improvement.
clusters each containing a single object. Hierarchical
Clustering is subdivided into agglomerative methods, which
4.2. Partitioning Methods
The partitioning methods generally result in a set of M
proceed by series of fusions of the n objects into groups, and
clusters, each object belonging to one cluster. Each cluster
divisive methods, which separate n objects successively into
may be represented by a centroid or a cluster representative;
finer groupings. Agglomerative techniques are more
this is some sort of summary description of all the objects
commonly used, and this is the method implemented in
contained in a cluster. The precise form of this description
XLMiner?. Hierarchical clustering may be represented by a
will depend on the type of the object which is being clustered.
two dimensional diagram known as dendrogram which
In case where real-valued data is available, the arithmetic
illustrates the fusions or divisions made at each successive
mean of the attribute vectors for all objects within a cluster
stage of analysis
provides an appropriate representative; alternative types of
centroid may be required in other cases, e.g., a cluster of
documents can be represented by a list of those keywords that
0.2 occur in some minimum number of documents within a
cluster. If the number of the clusters is large, the centroids can
6 5 be further clustered to produces hierarchy within a dataset.
0.15
4
3 4
2
5
0.1
2

0.05 1
3 1

0
1 3 2 5 4 6
Fig.5
Fig.4 Nested cluster Diagram

There are many methods of partitioning clustering


4.1.1Agglomerative
This is a "bottom up" approach: each observation starts in its 4.2.1 K-means Methods
own cluster, and pairs of clusters are merged as one moves up
the hierarchy. The algorithm forms clusters in a bottom-up In k-means case a cluster is represented by its centroid, which
manner, as follows: is a mean (usually weighted average) of points within a
cluster. This works conveniently only with numerical
1. Initially, put each article in its own cluster. attributes and can be negatively affected by a single outlier.
2. Among all current clusters, pick the two clusters The k-means algorithm [Hartigan 1975; Hartigan & Wong
with the smallest distance. 1979] is by far the most popular clustering tool used in
3. Replace these two clusters with a new cluster, scientific and industrial applications. The name comes from
formed by merging the two original ones. representing each of k clusters C by the mean (or weighted
4. Repeat the above two steps until there is only one average) c of its points, the so-called centroid. While this
remaining cluster in the pool. obviously does not work well with a categorical attributes, it
Thus, the agglomerative clustering algorithm will result in a has the good geometric and statistical sense for numerical
binary cluster tree with single article clusters as its leaf nodes attributes. The sum of discrepancies between a point and its
and a root node containing all the articles. centroid expressed through appropriate distance is used as the
objective function. Each point is assigned to the cluster with
4.1.2Divisive Algorithm the closest centroid Number of clusters, K, must be
This is a "top down" approach: all observations start in one specified.the basic algo is as follows The basic
cluster, and splits are performed recursively as one moves algorithm is very simple
down the hierarchy. 1. Select K points as initial centroids.

3
International Journal of Computer Applications (0975 – 8887)
Volume 7– No.12, October 2010

2. Repeat into two clusters select one of them and split and repeat this
process until the K clusters have been produced.
3. Form K clusters by assigning each point to its closest
centriod. 4.2.3-Medoids Method
K-medoid is the most appropriate data point within a cluster
4. Recompute the centroid of each cluster until centroid does that represents it. Representation by k-medoids has two
not change. advantages. First, it presents no limitations on attributes types,
and, second, the choice of medoids is dictated by the location
3 of a predominant fraction of points inside a cluster and,
therefore, it is lesser sensitive to the presence of outliers.
2.5
When medoids are selected, clusters are defined as subsets of
2 points close to respective medoids, and the objective function
is defined as the averaged distance or another dissimilarity
1.5
measure between a point and its medoid.K medoids method
y

1 has two versions


0.5 4.2.2.1PAM (Partitioning Around Medoids): PAM
0
is iterative optimization that combines relocation of points
between perspective clusters with re-nominating the points as
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 potential medoids. The guiding principle for the process is the
x effect on an objective function,
4.2.2.2CLARA (Clustering LARge Applications)
Original Point CLARA uses several (five) samples, each with 40+2k points,
which are each subjected to PAM. The whole dataset is
Fig.6
assigned to resulting medoids, the objective function is
computed, and the best system of medoids is retained.
4.2.3 Probabilistic Clustering
In the probabilistic approach, data is considered to be a
sample independently drawn from a mixture model of several
probability distributions [McLachlan & Basford 1988].The
main assumption is that data points are generated by, first,
randomly picking a model j with probability гj, j = 1:K and,
second, by drawing a point x from a corresponding
distribution. The area around the mean of each (supposedly
unimodal) distribution constitutes a natural cluster. So we
associate the cluster with the corresponding distribution.s
parameters such as mean, variance, etc. Each data point
carries not only its (observable) attributes, but also a (hidden)
cluster ID (class in pattern recognition). Each point x is
assumed to belong to one and only one cluster, and we can
estimate the probabilities of the assignment.
Fig.7
Probabilistic clustering has some important features:
1. It can be modified to handle recodes of complex structure
2. It can be stopped and resumed with consecutive batches of
data, since clusters have representation totally different from
sets of points
3. At any stage of iterative process the intermediate mixture
model can be used to assign cases (on-line property)
4. It results in easily interpretable cluster system

4.3. Density-Based Algorithms


Density-based algorithms are capable of discovering clusters
of arbitrary shapes. Also this provides a natural protection
Fig.8 against outliers. These algorithms group objects according to
specific density objective functions.Density is usually defined
as the number of objects in a particular neighborhood of a
limitations of K-means data objects. In these approaches a given cluster continues
K-means has problems when clusters are of differing growing as long as the number of objects in the neighborhood
Sizes,Densities,Non-globular shapes and K-means has exceeds some parameter.
problems when the data contains outliers.
This type of clustering can be of two types
4.2.2Bisecting K Means Method
This is an extension of K-Means method .the basioc concept 4.3.1Density-Based Connectivity Clustering
is as follows that to obtain k clusters split the set of all points In this clustering technique density and connectivity both
measured in terms of local distribution of nearest neighbours.

4
International Journal of Computer Applications (0975 – 8887)
Volume 7– No.12, October 2010

So defined density-connectivity is a symmetric relation and all [3] G. S. Moldovan and G. Serban. Quality Measures for
the points reachable from core objects can be factorized into Evaluating the Results of Clustering Based Aspect Mining
maximal connected components serving as clusters.The points Techniques. In Proceedings of Towards Evaluation of Aspect
that are not connected to any core point are declared to be Mining(TEAM), ECOOP, 2006, to be published.
outliers (they are not covered by any cluster). The non-core [4] Orlando Alejo Mendez Morales. Aspect Mining Using
points inside a cluster represent its boundary.Finally, core Clone Detection. Master's thesis, Delft University of
objects are internal points. Processing is independent of data Technology, The Netherlands, August 2004.
ordering. So far, nothing requires any limitations on the
dimension or attribute types. [5] D. Shepherd and L. Pollock. Interfaces, Aspects, and
Views. In Proceedings of Linking AspectTechnology and
4.3.2Density Functions Clustering Evolution Workshop(LATE 2005), March 2005.
In this density function is used to compute the density . [6] P. Tonella and M. Ceccato. Aspect Mining through the
Overall density is modeled as the sum of the density functions Formal Concept Analysis of Execution Traces. In Proceedings
of all objects;. Clusters are determined by density attractors, of the IEEE Eleventh Working Conference on Reverse
where density attractors are local maxima of the overall Engineering(WCRE 2004), pages 112_121, November 2004.
density function. The influence function can be an arbitrary [7]L. D. Baker and A. McCallum. Distributional clustering of
one. words for text classification. In SIGIR ’98: Proceedings of the
21st Annual International ACM SIGIR, pages 96–103. ACM,
4.4. Grid Based Clustering August1998.
These focus on spatial data i.e the data that model the
geometric structure of objects in the space, their relationships, [8]R. Bekkerman, R. El-Yaniv, Y. Winter, and N. Tishby. On
properties and operations. this technique quantize the data set feature distributional clustering for text categorization. In
inti a no of cells and then work with objects belonging to ACM SIGIR, pages 146–153, 2001.
these cells. They do not relocate points but ratter build several [9]P. Berkhin and J. D. Becher. Learning simple relations:
hierarchical levels of groups of objects. The merging of grids Theory and applications. In Proceedings of the The Second
and consequently clusters, does not depend on a distance SIAM International Conference on Data Mining, pages 420–
measure .It is determined by a predefined parameter. 436, 2002.
[10]B. E. Boser, I. Guyon, and V. Vapnik. A training
5. CONCLUSIONS algorithm for optimal margin classifiers. In COLT, pages 144–
Clustering lies at the heart of data analysis and data mining 152, 1992.
applications. The ability to discover highly correlated regions
[11]P. S. Bradley and O. L. Mangasarian. k-plane clustering.
of objects when their number becomes very large is highly
Journal of Global Optimization, 16(1): 23–32, 2000.
desirable, as data sets grow and their properties and data
interrelationships change. At the same time, it is notable that [12]S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan.
any clustering “is a division of the objects into groups based Using taxonomy, discriminants, and signatures for navigating
on a set of rules – it is neither true nor false” in text databases. In Proceedings of the 23rd VLDB
Conference, Athens, Greece, 1997.

6. REFERENCES [13]T. M. Cover and J. A. Thomas. Elements of Information


Theory. John Wiley & Sons, New York,
[1] M. Marin, A. van, Deursen, and L. Moonen. Identifying
Aspects Using Fan-in Analysis. In Proceedings of the 11th USA, 1991.
Working Conference on Reverse Engineering (WCRE2004),
pages 132_141. IEEE Computer Society, 2004.
[2] G. S. Moldovan and G. Serban. Aspect Mining using a
Vector-Space Model Based ClusteringApproach. In
Proceedings of Linking Aspect Technology and Evolution
Workshop(LATE 2006), Bonn, Germany, March 2006.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy