0% found this document useful (0 votes)
102 views3 pages

Agglomerative Hierarchical Clustering Algorithm-A Review: K.Sasirekha, P.Baby

jurnal hierarchical clustering

Uploaded by

dwidary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views3 pages

Agglomerative Hierarchical Clustering Algorithm-A Review: K.Sasirekha, P.Baby

jurnal hierarchical clustering

Uploaded by

dwidary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

International Journal of Scientific and Research Publications, Volume 3, Issue 3, March 2013 1

ISSN 2250-3153

Agglomerative Hierarchical Clustering Algorithm- A


Review
K.Sasirekha, P.Baby

Department of CS, Dr.SNS.Rajalakshmi College of Arts & Science

Abstract- Clustering is a task of assigning a set of objects into


groups called clusters. In data mining, hierarchical clustering is a II. DISADVANTAGES
method of cluster analysis which seeks to build a hierarchy of 1) Very sensitive to good initialization
clusters. Strategies for hierarchical clustering generally fall into 2) Coincident clusters may result
two types:Agglomerative: This is a "bottom up" approach: each
observation starts in its own cluster, and pairs of clusters are Because the columns and rows of the typicality matrix are
merged as one moves up the hierarchy.Divisive: This is a "top independent of each other
down" approach: all observations start in one cluster, and splits Sometimes this could be advantageous (start with a large
are performed recursively as one moves down the hierarchy.
value of c and get less distinct clusters)
Index Terms- Agglomerative, Divisive Cluster dissimilarity:In order to decide which clusters
should be combined (for agglomerative), or where a cluster
should be split (for divisive), a measure of dissimilarity between
sets of observations is required. In most methods of hierarchical
I. INTRODUCTION
clustering, this is achieved by use of an appropriate metric (a

F ast and robust clustering algorithms play an important role in


extracting useful information in large databases. The aim of
cluster analysis is to partition a set of N object into C clusters
measure of distance between pairs of observations), and a linkage
criterion which specifies the dissimilarity of sets as a function of
the pairwise distances of observations in the sets.
such that objects within cluster should be similar to each other
and objects in different clusters are should be dissimilar with Metric:
each other[1]. Clustering can be used to quantize the available The choice of an appropriate metric will influence the
data, to extract a set of cluster prototypes for the compact shape of the clusters, as some elements may be close to one
representation of the dataset, into homogeneous subsets. another according to one distance and farther away according to
Clustering is a mathematical tool that attempts to another. For example, in a 2-dimensional space, the distance
discover structures or certain patterns in a dataset, where the between the point (1,0) and the origin (0,0) is always 1 according
objects inside each cluster show a certain degree of similarity. It to the usual norms, but the distance between the point (1,1) and
can be achieved by various algorithms that differ significantly in the origin (0,0) can be 2, or 1 under Manhattan distance,
their notion of what constitutes a cluster and how to efficiently Euclidean distance or maximum distance respectively.
find them. Cluster analysis is not an automatic task, but an Some commonly used metrics for hierarchical clustering
iterative process of knowledge discovery or interactive multi- are:[3]
objective optimization. It will often necessary to modify
preprocessing and parameter until the result achieves the desired
Names Formula
properties.
In Clustering, one of the most widely used algorithms is
Euclidean
agglomerative algorithms. In general, the merges and splits are
distance
determined in a greedy manner. The results of hierarchical
clustering are usually presented in a dendrogram.In the general squared
case, the complexity of agglomerative clustering is , Euclidean
which makes them too slow for large data sets. Divisive distance

clustering with an exhaustive search is , which is even Manhattan


worse. However, for some special cases, optimal efficient distance
maximum
agglomerative methods (of complexity ) are known: distance
SLINK[1] for single-linkage and CLINK[2] for complete-linkage
clustering. Mahalanobis
distance where S is the covariance
matrix

www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 3, Issue 3, March 2013 2
ISSN 2250-3153

 The mean distance between elements of each cluster


cosine
(also called average linkage clustering, used e.g. in
similarity
UPGMA):

For text or other non-numeric data, metrics such as the


Hamming distance or Levenshtein distance are often used.A
review of cluster analysis in health psychology research found  The sum of all intra-cluster variance.
that the most common distance measure in published studies in  The increase in variance for the cluster being merged
that research area is the Euclidean distance or the squared (Ward's method[6])
Euclidean distance.
 The probability that candidate clusters spawn from the
The linkage criterion determines the distance between sets
same distribution function (V-linkage).
of observations as a function of the pairwise distances between
observations.
Each agglomeration occurs at a greater distance between
Some commonly used linkage criteria between two sets of
clusters than the previous agglomeration, and one can decide to
observations A and B are:
stop clustering either when the clusters are too far apart to be
merged (distance criterion) or when there is a sufficiently small
Names Formula number of clusters (number criterion).
Maximu
m or Divisive Hierarchical Clustering
complete  A top-down clustering method and is less commonly
linkage used. It works in a similar way to agglomerative
clustering clustering but in the opposite direction. This method
Minimum starts with a single cluster containing all objects, and
or single- then successively splits resulting clusters until only
linkage clusters of individual objects remain. GeneLinker™
clustering does not support divisive hierarchical clustering.

Disadvantages
Mean or average linkage clustering, or UPGMA
 No provision can be made for a relocation of objects
that may have been 'incorrectly' grouped at an early
Minimum stage. The result should be examined closely to ensure it
energy makes sense.
clustering  Use of different distance metrics for measuring
distances between clusters may generate different
where d is the chosen metric. Other linkage criteria include: results. Performing multiple experiments and comparing
 The sum of all intra-cluster variance. the results is recommended to support the veracity of
the original results.
 The decrease in variance for the cluster being merged
(Ward's criterion)

A simple agglomerative clustering algorithm is described in III. CONCLUSION


the single-linkage clustering page; it can easily be adapted to Agglomerative hierarchical clustering is a bottom-up
different types of linkage (see below). clustering method where clusters have sub-clusters, which in turn
have sub-clusters, etc. The classic example of this is species
Suppose we have merged the two closest elements b and taxonomy. Gene expression data might also exhibit this
c, we now have the following clusters {a}, {b, c}, {d}, {e} and hierarchical quality (e.g. neurotransmitter gene families).
{f}, and want to merge them further. To do that, we need to take Agglomerative hierarchical clustering starts with every single
the distance between {a} and {b c}, and therefore define the object (gene or sample) in a single cluster. Then, in each
distance between two clusters. Usually the distance between two successive iteration, it agglomerates (merges) the closest pair of
clusters and is one of the following: clusters by satisfying some similarity criteria, until all of the data
is in one cluster.
 The maximum distance between elements of each Advantages:It can produce an ordering of the objects,
cluster (also called complete-linkage clustering): which may be informative for data display.
Smaller clusters are generated, which may be helpful for
discovery. determine the similarity between prototypes and data
 The minimum distance between elements of each points, and it performs well only in .
cluster (also called single-linkage clustering):

www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 3, Issue 3, March 2013 3
ISSN 2250-3153

IV. FUTURE WORK [2] A. vathy-Fogarassy, B.Feil, J.Abonyi”Minimal Spanning Tree based
clustering” Proceedings of World academy of Sc., Eng & Technology, vol-
This paper was intened to compare between two 8, Oct-2005, 7-12.
algorithms.Through my extensive search I was unable to find any [3] Pal N.R, Pal K, Keller J.M. and Bezdek J.C, “A Possibilistic Clustering
study that attempts to comapre between all algoritms under Algorithm”, IEEE Transactions on Fuzzy Systems, Vol. 13, No. 4, Pp. 517–
530, 2005.
investigation.
[4] R. Krishnapuram amd J.M. Keller, “A possibilistic approach to clustering”,
As a future work comparison between these algoritms can IEEE Trans. Fuzzy Systems, Vol. 1, Pp. 98-110, 1993.
be attempted according to different factors other than those [5] J. C. Dunn (1973): "A Agglomerative Relative of the ISODATA Process
considered in this paper.Comparing between the results of and Its Use in Detecting Compact Well-Separated Clusters", Journal of
algorithms using normalized data or non-normalizes data will Cybernetics 3: 32-57
give different results.Ofcourse normalization will affect the
performance of the algorithm and quality of the results.
Another approach may consider using data clustering AUTHORS
algorithms in applications such as object and character
First Author – K.Sasirekha MCA, M.Phil., Assistant Professor,
recognition or information retrieval which is concerned with
Dr.SNS.Rajalakshmi College of Arts &
automatic documents.
Science,Chinnavedampatti,Coimbatore.
Email-id:sasirekharamesh1985@gmail.com
Second Author – P.Baby, MCA, M.Phil., Assistant Professor ,
REFERENCES
Dr.SNS.Rajalakshmi College of Arts &
[1] M.S.Yang,” A Survey of hierarchical clustering” Mathl. Comput. Science,Chinnavedampatti, Coimbatore.
Modelling Vol. 18, No. 11, pp. 1-16, 1993.
Email-id:cb.ridhu@gmail.com

www.ijsrp.org

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy