A Comparative Study of K-Means, DBSCAN and OPTICS
A Comparative Study of K-Means, DBSCAN and OPTICS
A COMPARATIVE STUDY OF
K-MEANS, DBSCAN AND OPTICS
Hari Krishna Kanagala Dr. V.V. Jaya Rama Krishnaiah
Asst. Prof Assoc. Prof.
Department of MCA Dept. of Computer Science & Engineering
Vignan’s Lara Institute of Technology & Sci. ASN Women’s Engineering College
Vadlamudi, Guntur Dist, India. Tenali, Guntur Dist, India
harikanagala@gmail.com jkvemula@gmail.com
Abstract - In view of today's information value of the cluster objects. First k number of
available, recent progress in data mining objects each represents a cluster mean or center
research has lead to the development of various is randomly selected. For each of the remaining
efficient methods for mining interesting patterns objects, an object is assigned to the cluster by
in large databases. It plays a vital role in using the distance measurement in between the
knowledge discovery process by analyzing the object and cluster mean. For each of the cluster,
huge data from various sources and a new mean value to be computed. This process
summarizing it into useful information. It is iterates until the criterion minimizes the sum of
helpful for analyzing the volumes of data in the squared error minimized. The sum of the
different domains like Marketing, Health, squared error is defined as
Science and Technology. Cluster analysis is k
widely used approach to notice the trends in the E = ∑ i=1 ∑ x ∈ ci | x – mi |2 (1)
volumes of data. In this paper, we evaluated the
performance of the different clustering K-Means works as follows
approaches like as K-Means, DBSCAN, and 1. The algorithm arbitrarily selects k number of
OPTICS in terms of accuracy, outlier’s data objects initially as the cluster mean or
formation, and cluster size prediction. cluster centers.
Keywords: Clustering, k-means, dbscan, optics 2. Compute the distance measurement such as
the Euclidian distance between each data object
I. INTRODUCTION and the cluster center, each data object is
Clustering is the process of partitioning the set assigned to the closest cluster.
of objects or data into a set of classes of similar 3. Recompute each cluster center as the average
objects. Clustering having maximum similarity of the data objects in that cluster.
in between the data objects of the same class 4. Repeat the steps 2 and 3 until no change in
and minimum similarity in between the objects clusters.
of different classes. The quality of the clustering
result depends on the similarity measure used by III. DBSCAN (Density Based Spatial
the method. The similarity measure is expressed Clustering of Applications with Noise)
in terms of a distance function. The distance Density-based clustering identifies regions of
functions are very different for interval-scaled, high density that are separated from one and
Boolean, categorical, ordinal and ratio variables. other by regions of low density. The density is
Distance measure will determine the similarity defined as a minimum number of objects within
of the two elements and it will influence the a certain distance of each other. The DBSCAN
shape of the clusters. Many distance measures approach is to create clusters with a minimum
are used such as Euclidian distance, Manhattan size and density. In the center based approach, a
distance, Minkowski distance. This paper point to be classified as a core point or border
presents different clustering algorithms such as point or noise point. A core point has more than
K-Means, DBSCAN, OPTICS and the a specified number of points (MinPts), within
performance evaluation of those algorithms. the specified radius (Eps). A border point has
fewer than MinPts within Eps but is in the
II. K-MEANS neighborhood of a core point. A noise point is
The input parameters for the K-MEANS neither a core point nor a border point. An
algorithm is the number of clusters, k, and object is called the ∈-neighborhood of the
partitions a set of n objects into k clusters object if the neighborhood is within the radius
containing data points so as to minimize the sum of a given object. An object is called a core
of the squared error criterion by iteratively. object if the ε-neighborhood of an object
Cluster similarity is measured by using the mean contains at least a minimum number, MinPts, of
objects. An object p is directly density-reachable clusters. Clusters can be extracted for all εi such
from object q if p is within the ε-neighborhood that 0 ≤ εi ≤ ε. Based on this idea, two values
of q, and q is a core object. A point is said to be need to be stored for each object core distance
density-reachable from another point if there is a and the reachability distance. The core distance
chain of points from one to the other, which of an object p is the smallest ε’ value that makes
contains only points that are directly density- p as a core object. If p is not a core object, the
reachable to each other. An object p is density- core distance of p is undefined. The
connected to object q if there is an object o such reachability-distance of an object p and another
that both p and q are density-reachable from o. object o is the greater value of the core-distance
The algorithm defines a cluster as the maximal of p and the Euclidean distance between p and
set of density-connected points. [1] q. If p is not a
The DBSCAN works as follows core object, the reachability-distance between p
1. Label all points as Core, Border or Noise and q is undefined. [1]
points A reachability plot for a simple 2-
2. Eliminate Noise points. dimensional data set, which shows the data are
3. Put an edge between all core points that are clustered. [1]
within neighborhood of each other
4. Make each group of connected core points V. PERFORMANCE EVALUATION
into a separate cluster. The results obtained after running of the
5. Every border point is assigned to one of the clustering techniques for abalone data set which
clusters of its associated core points. consists of 4177 instances and 9 attributes such
as gender, length, diameter, whole height, whole
IV. OPTICS weight, shucked weight, viscera weight, shell
OPTICS is the Density Based clustering by weight and rings. To construct the algorithms,
creating an ordering of the points that allows the we use Waikato Environment for Knowledge
extraction of clusters with arbitrary values for ε. Analysis (WEKA version 3.6.10), an open
The parameter ε is a distance, it is the source data mining tool which was developed at
neighborhood radius. Therefore, in order to University of Waikato New Zealand. WEKA is
generate a set or ordering of density-based an open source application that is freely
clusters, we provide a set of distance parameter available under the GNU general public license
values. To construct the different clustering’s agreement. This experiment is performed on
simultaneously, the objects should be processed Duo Core with 2.10 GHz CPU and 4G RAM.
in a specific order. This order selects an object The result for each clustering algorithms is
that is density-reachable with respect to the shown and described below.
lowest value so that clusters with higher density The table I shows the result of K-
(lower ε) will be finished first. The generating MEANS clustering algorithm on the Iris dataset
distance ∈ is the largest distance considered for for different number of clusters.
TABLE I. The result of K-MEANS on the Iris data set for different number of clusters
No. of Clusters
K-MEANS
10 15 20 25
No. of
Iterations 30 37 63 34
Sum of
Squared 165.36 92.73 71.41 62.48
Error
Clustered C0-379, C0-239, C1- C0-149, C1-42,C2-202, C0-103, C1-42,C2-89,
Abalone Dataset Instances C1-296 201,C2-200, C3-205,C4-49, C3-55,C4-46, C5-166,
No. of Instances: C2-851, C3-287,C4-78, C5-166,C6-314, C7-315, C6-249, C7-301,C8-272,
4177 C3-287, C5-201,C6-412, C8-305, C9-358, C9-258,C10-222,
C4-78, C7-373,C8-403, C10-219,C11-132, C11-104,C12-300,
No. of attributes:
9 C5-201, C9-475,C10-349, C12-360,C13-139 C13-127,C14-142,
C6-632, C11-188, C14-182,C15-299, C15-318,C16-189,
C7-373, C12-503,C13-162 C16-302,C17-264, C17-209,C18-68,
C8-403, C14-106 C18-102,C19-73 C19-39,C20-175,
C9-677 C21-313,C22-182,
C23-163,C24-45
2016 International Conference on Computer Communication and Informatics (ICCCI -2016), Jan. 07 – 09,, 2016, Coimbatore, INDIA
When the number of clusters inncreases, the Fig. 3 shows thee graph which
corresponding sum of squaredd error is visualizes the cluster assignmeents in K-MEANS
decreased. when the input parameter, the number of
Fig. 1 shows the grraph which clusters is 20 for the Abalone dataset.
d
visualizes the cluster assignments inn K-MEANS
when the input parameter, the number of
clusters is 10 for the Abalone dataseet.
Fig. 5 shows the graph which when the input parameter, ∈ = 0.4, MinPts = 5
visualizes the cluster assignments in DBSCAN for the Abalone dataset.
when the input parameter, ∈ = 0.1, MinPts = 2
for the Abalone dataset.
Table III. The result of OPTICS on abalone data set with different ∈ and MinPts parameters.
∈ = 0.1, ∈ = 0.2, ∈ = 0.3, ∈ = 0.4, ∈ = 0.8, ∈ = 1.0,
OPTICS
MinPts = 2 MinPts =2 MinPts = 6 MinPts = 5 MinPts = 5 MinPts = 5
No. of
generated 0 0 0 0 0 0
clusters
No. of
Unclustered 4177 4177 4177 4177 4177 4177
instances
Elapsed
25.71 24.38 20.89 30.24 18.32 31.62
Time
2016 International Conference on Computer Communication and Informatics (ICCCI -2016), Jan. 07 – 09, 2016, Coimbatore, INDIA
Fig. 11 shows the graph which Fig. 15 shows the graph which
visualizes the cluster assignments in OPTICS visualizes the cluster assignments in OPTICS
when the input parameter, ∈ = 0.1, MinPts = 2 when the input parameter, ∈ = 0.8, MinPts = 5
for the Abalone dataset. for the Abalone dataset.
REFERENCES
[1] Jiawei Han, MichelineKamber, Jian Pei,
“Data Mining Concepts and Techniques”
Elsevie Second Edition.
[2] Pang-Ning Tan,Vipin Kumar, Michael
Steinbach, “Introduction to Data Mining”
Pearson.
[3] Bharat Chaudhari, Manan Parikh “A
Comparative Study of clustering
algorithms Using weka tools”
International Journal of Application or
Innovation in Engineering &
Management (IJAIEM) Volume 1, Issue
2, October 2012 ISSN 2319 – 4847 pp
154-158.
[4] V.V.Jaya RamaKrishnaiah,
Dr.K.Ramchand H Rao, Dr. R.Satya
Prasad “Entropy Based Mean Clustering:
A Enhanced Clustering Approach” The
International Journal of Computer
Science & Applications (TIJCSA)
Volume 1, No. 3, May 2012 ISSN –
2278-1080 pp 1-9.
[5] Martin Ester, Hans-Peter Kriegel, Jörg
Sander, XiaoweiXu “A Density-Based
Algorithm for Discovering Clusters in
Large Spatial Databases with Noise” from
KDD-96 proceedings pp 226-231
[6] Narendra Sharma, AmanBajpai,
Mr.RatneshLitoriya “Comparison the
various clustering algorithms of weka
tools” International Journal of Emerging
Technology and Advanced Engineering
Volume 2, Issue 5, May 2012 ISSN 2250-
2459 pp 73-80
[7] Kaushik H. Raviya, KunjanDhinoja “An
Empirical Comparison of K-Means and
DBSCAN Clustering Algorithm” Paripex
– Indian Journal of Research Volume: 2
Issue ISSN - 2250-1991 pp 153-155
[8] Pradeep Rai Shubha Singh “A Survey of
Clustering Techniques” International
Journal of Computer Applications (0975 –
8887) Volume 7– No.12, October 2010
pp 1-5
[9] P. IndiraPriya, Dr. D.K.Ghosh “A Survey
on Different Clustering Algorithms in
Data Mining Technique” International
Journal of Modern Engineering Research
(IJMER) Vol.3, Issue.1, Jan-Feb. 2013
pp-267-274 ISSN: 2249-6645
[10] B.G.Obula Reddy, Dr. Maligela
Ussenaiah “Literature Survey On
Clustering Techniques” IOSR Journal of
Computer Engineering (IOSRJCE) ISSN:
2278-0661 Volume 3, Issue 1 (July-Aug.
2012), PP 01-12