0% found this document useful (0 votes)
230 views6 pages

A Comparative Study of K-Means, DBSCAN and OPTICS

Uploaded by

kripsi ilyas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
230 views6 pages

A Comparative Study of K-Means, DBSCAN and OPTICS

Uploaded by

kripsi ilyas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2016 International Conference on Computer Communication and Informatics (ICCCI -2016), Jan.

07 – 09, 2016, Coimbatore, INDIA

A COMPARATIVE STUDY OF
K-MEANS, DBSCAN AND OPTICS
Hari Krishna Kanagala Dr. V.V. Jaya Rama Krishnaiah
Asst. Prof Assoc. Prof.
Department of MCA Dept. of Computer Science & Engineering
Vignan’s Lara Institute of Technology & Sci. ASN Women’s Engineering College
Vadlamudi, Guntur Dist, India. Tenali, Guntur Dist, India
harikanagala@gmail.com jkvemula@gmail.com

Abstract - In view of today's information value of the cluster objects. First k number of
available, recent progress in data mining objects each represents a cluster mean or center
research has lead to the development of various is randomly selected. For each of the remaining
efficient methods for mining interesting patterns objects, an object is assigned to the cluster by
in large databases. It plays a vital role in using the distance measurement in between the
knowledge discovery process by analyzing the object and cluster mean. For each of the cluster,
huge data from various sources and a new mean value to be computed. This process
summarizing it into useful information. It is iterates until the criterion minimizes the sum of
helpful for analyzing the volumes of data in the squared error minimized. The sum of the
different domains like Marketing, Health, squared error is defined as
Science and Technology. Cluster analysis is k
widely used approach to notice the trends in the E = ∑ i=1 ∑ x ∈ ci | x – mi |2 (1)
volumes of data. In this paper, we evaluated the
performance of the different clustering K-Means works as follows
approaches like as K-Means, DBSCAN, and 1. The algorithm arbitrarily selects k number of
OPTICS in terms of accuracy, outlier’s data objects initially as the cluster mean or
formation, and cluster size prediction. cluster centers.
Keywords: Clustering, k-means, dbscan, optics 2. Compute the distance measurement such as
the Euclidian distance between each data object
I. INTRODUCTION and the cluster center, each data object is
Clustering is the process of partitioning the set assigned to the closest cluster.
of objects or data into a set of classes of similar 3. Recompute each cluster center as the average
objects. Clustering having maximum similarity of the data objects in that cluster.
in between the data objects of the same class 4. Repeat the steps 2 and 3 until no change in
and minimum similarity in between the objects clusters.
of different classes. The quality of the clustering
result depends on the similarity measure used by III. DBSCAN (Density Based Spatial
the method. The similarity measure is expressed Clustering of Applications with Noise)
in terms of a distance function. The distance Density-based clustering identifies regions of
functions are very different for interval-scaled, high density that are separated from one and
Boolean, categorical, ordinal and ratio variables. other by regions of low density. The density is
Distance measure will determine the similarity defined as a minimum number of objects within
of the two elements and it will influence the a certain distance of each other. The DBSCAN
shape of the clusters. Many distance measures approach is to create clusters with a minimum
are used such as Euclidian distance, Manhattan size and density. In the center based approach, a
distance, Minkowski distance. This paper point to be classified as a core point or border
presents different clustering algorithms such as point or noise point. A core point has more than
K-Means, DBSCAN, OPTICS and the a specified number of points (MinPts), within
performance evaluation of those algorithms. the specified radius (Eps). A border point has
fewer than MinPts within Eps but is in the
II. K-MEANS neighborhood of a core point. A noise point is
The input parameters for the K-MEANS neither a core point nor a border point. An
algorithm is the number of clusters, k, and object is called the ∈-neighborhood of the
partitions a set of n objects into k clusters object if the neighborhood is within the radius
containing data points so as to minimize the sum of a given object. An object is called a core
of the squared error criterion by iteratively. object if the ε-neighborhood of an object
Cluster similarity is measured by using the mean contains at least a minimum number, MinPts, of

978-1-4673-6680-9/16/$31.00 ©2016 IEEE


2016 International Conference on Computer Communication and Informatics (ICCCI -2016), Jan. 07 – 09, 2016, Coimbatore, INDIA

objects. An object p is directly density-reachable clusters. Clusters can be extracted for all εi such
from object q if p is within the ε-neighborhood that 0 ≤ εi ≤ ε. Based on this idea, two values
of q, and q is a core object. A point is said to be need to be stored for each object core distance
density-reachable from another point if there is a and the reachability distance. The core distance
chain of points from one to the other, which of an object p is the smallest ε’ value that makes
contains only points that are directly density- p as a core object. If p is not a core object, the
reachable to each other. An object p is density- core distance of p is undefined. The
connected to object q if there is an object o such reachability-distance of an object p and another
that both p and q are density-reachable from o. object o is the greater value of the core-distance
The algorithm defines a cluster as the maximal of p and the Euclidean distance between p and
set of density-connected points. [1] q. If p is not a
The DBSCAN works as follows core object, the reachability-distance between p
1. Label all points as Core, Border or Noise and q is undefined. [1]
points A reachability plot for a simple 2-
2. Eliminate Noise points. dimensional data set, which shows the data are
3. Put an edge between all core points that are clustered. [1]
within neighborhood of each other
4. Make each group of connected core points V. PERFORMANCE EVALUATION
into a separate cluster. The results obtained after running of the
5. Every border point is assigned to one of the clustering techniques for abalone data set which
clusters of its associated core points. consists of 4177 instances and 9 attributes such
as gender, length, diameter, whole height, whole
IV. OPTICS weight, shucked weight, viscera weight, shell
OPTICS is the Density Based clustering by weight and rings. To construct the algorithms,
creating an ordering of the points that allows the we use Waikato Environment for Knowledge
extraction of clusters with arbitrary values for ε. Analysis (WEKA version 3.6.10), an open
The parameter ε is a distance, it is the source data mining tool which was developed at
neighborhood radius. Therefore, in order to University of Waikato New Zealand. WEKA is
generate a set or ordering of density-based an open source application that is freely
clusters, we provide a set of distance parameter available under the GNU general public license
values. To construct the different clustering’s agreement. This experiment is performed on
simultaneously, the objects should be processed Duo Core with 2.10 GHz CPU and 4G RAM.
in a specific order. This order selects an object The result for each clustering algorithms is
that is density-reachable with respect to the shown and described below.
lowest value so that clusters with higher density The table I shows the result of K-
(lower ε) will be finished first. The generating MEANS clustering algorithm on the Iris dataset
distance ∈ is the largest distance considered for for different number of clusters.
TABLE I. The result of K-MEANS on the Iris data set for different number of clusters
No. of Clusters
K-MEANS
10 15 20 25
No. of
Iterations 30 37 63 34

Sum of
Squared 165.36 92.73 71.41 62.48
Error
Clustered C0-379, C0-239, C1- C0-149, C1-42,C2-202, C0-103, C1-42,C2-89,
Abalone Dataset Instances C1-296 201,C2-200, C3-205,C4-49, C3-55,C4-46, C5-166,
No. of Instances: C2-851, C3-287,C4-78, C5-166,C6-314, C7-315, C6-249, C7-301,C8-272,
4177 C3-287, C5-201,C6-412, C8-305, C9-358, C9-258,C10-222,
C4-78, C7-373,C8-403, C10-219,C11-132, C11-104,C12-300,
No. of attributes:
9 C5-201, C9-475,C10-349, C12-360,C13-139 C13-127,C14-142,
C6-632, C11-188, C14-182,C15-299, C15-318,C16-189,
C7-373, C12-503,C13-162 C16-302,C17-264, C17-209,C18-68,
C8-403, C14-106 C18-102,C19-73 C19-39,C20-175,
C9-677 C21-313,C22-182,
C23-163,C24-45
2016 International Conference on Computer Communication and Informatics (ICCCI -2016), Jan. 07 – 09,, 2016, Coimbatore, INDIA

When the number of clusters inncreases, the Fig. 3 shows thee graph which
corresponding sum of squaredd error is visualizes the cluster assignmeents in K-MEANS
decreased. when the input parameter, the number of
Fig. 1 shows the grraph which clusters is 20 for the Abalone dataset.
d
visualizes the cluster assignments inn K-MEANS
when the input parameter, the number of
clusters is 10 for the Abalone dataseet.

Figure 3. Cluster assignments in K-MEANS


K when the
number of clusters is 20
Fig. 4 shows thee graph which
visualizes the cluster assignmeents in K-MEANS
Figure 1. Cluster assignments in K-MEA
ANS when the when the input parameter, the number of
number of clusters is 10 clusters is 25 for the Abalone dataset.
d
Fig. 2 shows the grraph which
visualizes the cluster assignments inn K-MEANS
when the input parameter, the number of
clusters is 15 for the Abalone dataseet.

Figure 4. Cluster assignments in K-MEANS


K when the
number off clusters is 25
The table II is the reesult of DBSCAN
on Abalone data set whichh contains of 9
attributes and 4177 instances ono different ∈ and
Figure 2. cluster assignments in K-MEA ANS when the MinPts parameters.
number of clusters is 15
AN on abalone data set with different ∈ and MinPts parameters.
TABLE II. The result of DBSCA
DBSCAN ∈ = 0.1, ∈ = 0.2, ∈ = 0.3, ∈ = 0.4, ∈ = 0.8, ∈ = 1.0,
MinPts = 2 M
MinPts =2 MinPts = 6 MinPts = 5 MinPts = 5 MinPts = 5
No. of 28 4 3 3 3 3
generated
clusters
No. of 179 1
11 6 1 1 0
Unclustered
instances
Elapsed Time 24.66 220.72 23.27 26.5 25.58 26.38
Clustered C0-1422 C
C0-1524 C0-1527 C0-1528 C0-1528 C0-1528
Instances C1-1197 C
C1-1300 C1-1304 C1-1306 C1-1306 C1-1307
C2-1307 C
C2-1339 C2-1340 C2-1342 C2-1342 C2-1342
C3-4,C4-2 C
C3-3
C5-7, C6-6
C7-2, C8-4
C9-3, C10-2
C11-2,C12-2
C13-2,C14-2
C15-2,C16-2
C17-2,C18-2
C19-8,C20-4
C21-2,C22-2
C23-2,C24-2
C25-2,C26-2
C27-2
2016 International Conference on Computer Communication and Informatics (ICCCI -2016), Jan. 07 – 09, 2016, Coimbatore, INDIA

Fig. 5 shows the graph which when the input parameter, ∈ = 0.4, MinPts = 5
visualizes the cluster assignments in DBSCAN for the Abalone dataset.
when the input parameter, ∈ = 0.1, MinPts = 2
for the Abalone dataset.

Figure 8. Cluster assignments in DBSCAN when ∈ = 0.4,


MinPts = 5 for the Abalone dataset.
Figure 5. Cluster assignments in DBSCAN when ∈ = 0.1,
Fig. 9 shows the graph which
MinPts = 2 for the Abalone dataset.
visualizes the cluster assignments in DBSCAN
Fig. 6 shows the graph which
visualizes the cluster assignments in DBSCAN when the input parameter, ∈ = 0.8, MinPts = 5
for the Abalone dataset.
when the input parameter, ∈ = 0.2, MinPts = 2
for the Abalone dataset.

Figure 9. Cluster assignments in DBSCAN when ∈ = 0.8,


MinPts = 5 for the Abalone dataset.
Figure 6. Cluster assignments in DBSCAN when ∈ = 0.2,
MinPts = 2 for the Abalone dataset. Fig. 10 shows the graph which
Fig. 7 shows the graph which visualizes the cluster assignments in DBSCAN
visualizes the cluster assignments in DBSCAN when the input parameter, ∈ = 1.0, MinPts = 5
when the input parameter ∈ = 0.3, MinPts = 6 for the Abalone dataset.
for the Abalone dataset.

Figure 10. Cluster assignments in DBSCAN when ∈=1.0,


MinPts = 5 for the Abalone dataset.
Figure 7. Cluster assignments in DBSCAN when ∈ = 0.3, The table III is the result of OPTICS
MinPts = 6 for the Abalone dataset. clustering on Abalone data set which consists of
Fig. 8 shows the graph which 9 attributes and 4177 instances on different ∈
visualizes the cluster assignments in DBSCAN and MinPts parameters.

Table III. The result of OPTICS on abalone data set with different ∈ and MinPts parameters.
∈ = 0.1, ∈ = 0.2, ∈ = 0.3, ∈ = 0.4, ∈ = 0.8, ∈ = 1.0,
OPTICS
MinPts = 2 MinPts =2 MinPts = 6 MinPts = 5 MinPts = 5 MinPts = 5
No. of
generated 0 0 0 0 0 0
clusters
No. of
Unclustered 4177 4177 4177 4177 4177 4177
instances
Elapsed
25.71 24.38 20.89 30.24 18.32 31.62
Time
2016 International Conference on Computer Communication and Informatics (ICCCI -2016), Jan. 07 – 09, 2016, Coimbatore, INDIA

Fig. 11 shows the graph which Fig. 15 shows the graph which
visualizes the cluster assignments in OPTICS visualizes the cluster assignments in OPTICS
when the input parameter, ∈ = 0.1, MinPts = 2 when the input parameter, ∈ = 0.8, MinPts = 5
for the Abalone dataset. for the Abalone dataset.

Figure 11. Cluster assignments in OPTICS when ∈ = 0.1,


MinPts = 2 for the Abalone dataset. Figure 15. Cluster assignments in OPTICS when ∈ = 0.8,
Fig. 12 shows the graph which MinPts = 5 for the Abalone dataset.
visualizes the cluster assignments in OPTICS Fig. 16 shows the graph which
when the input parameter, ∈ = 0.2, MinPts = 2 visualizes the cluster assignments in OPTICS
for the Abalone dataset. when the input parameter, ∈ = 1.0, MinPts = 5
for the Abalone dataset.

Figure 12. Cluster assignments in OPTICS when ∈ = 0.2,


MinPts = 2 for the Abalone dataset.
Fig. 13 shows the graph which Figure 16. Cluster assignments in OPTICS when ∈ = 1.0,
MinPts = 5 for the Abalone dataset.
visualizes the cluster assignments in OPTICS
when the input parameter, ∈ = 0.3, MinPts = 6
for the Abalone dataset. CONCLUSION
K-Means algorithm is only applied
when the mean of the cluster is defined. K-
Means algorithm produces a quality of clusters
when using huge dataset. The number of
clusters, K, must be specified, in advance. K-
Means will not identify Outliers. DBSCAN
can find clusters of arbitrary shape, determine
what information should be classified as noise
or outliers. It is very fast when compared to
Figure 13. Cluster assignments in OPTICS when ∈ = 0.3,
other algorithms. In DBSCAN, the user has the
MinPts = 6 for the Abalone dataset. responsibility of selecting the parameter values
Fig. 14 shows the graph which (ε and MinPts). Slightly different parameter
visualizes the cluster assignments in OPTICS settings may lead to different clusters. It has
when the input parameter, ∈ = 0.4, MinPts = 5 some difficulties in distinguishing separated
for the Abalone dataset. clusters if they are located too close to each
other, even though they have different densities.
To overcome this difficulty, OPTICS algorithm
was developed. OPTICS ensures good quality
clustering by maintaining the order in which the
data objects are processed, i.e., high-density
clusters are given priority over lower density
clusters. OPTICS also requires a parameters (ε
and MinPts) to be specified by the user that will
affect the result. The efficiency of clustering
Figure 14. Cluster assignments in OPTICS when ∈=0.4, algorithms can be improved by removing the
MinPts=5 for the Abalone dataset. limitations of the clustering techniques.
2016 International Conference on Computer Communication and Informatics (ICCCI -2016), Jan. 07 – 09, 2016, Coimbatore, INDIA

REFERENCES
[1] Jiawei Han, MichelineKamber, Jian Pei,
“Data Mining Concepts and Techniques”
Elsevie Second Edition.
[2] Pang-Ning Tan,Vipin Kumar, Michael
Steinbach, “Introduction to Data Mining”
Pearson.
[3] Bharat Chaudhari, Manan Parikh “A
Comparative Study of clustering
algorithms Using weka tools”
International Journal of Application or
Innovation in Engineering &
Management (IJAIEM) Volume 1, Issue
2, October 2012 ISSN 2319 – 4847 pp
154-158.
[4] V.V.Jaya RamaKrishnaiah,
Dr.K.Ramchand H Rao, Dr. R.Satya
Prasad “Entropy Based Mean Clustering:
A Enhanced Clustering Approach” The
International Journal of Computer
Science & Applications (TIJCSA)
Volume 1, No. 3, May 2012 ISSN –
2278-1080 pp 1-9.
[5] Martin Ester, Hans-Peter Kriegel, Jörg
Sander, XiaoweiXu “A Density-Based
Algorithm for Discovering Clusters in
Large Spatial Databases with Noise” from
KDD-96 proceedings pp 226-231
[6] Narendra Sharma, AmanBajpai,
Mr.RatneshLitoriya “Comparison the
various clustering algorithms of weka
tools” International Journal of Emerging
Technology and Advanced Engineering
Volume 2, Issue 5, May 2012 ISSN 2250-
2459 pp 73-80
[7] Kaushik H. Raviya, KunjanDhinoja “An
Empirical Comparison of K-Means and
DBSCAN Clustering Algorithm” Paripex
– Indian Journal of Research Volume: 2
Issue ISSN - 2250-1991 pp 153-155
[8] Pradeep Rai Shubha Singh “A Survey of
Clustering Techniques” International
Journal of Computer Applications (0975 –
8887) Volume 7– No.12, October 2010
pp 1-5
[9] P. IndiraPriya, Dr. D.K.Ghosh “A Survey
on Different Clustering Algorithms in
Data Mining Technique” International
Journal of Modern Engineering Research
(IJMER) Vol.3, Issue.1, Jan-Feb. 2013
pp-267-274 ISSN: 2249-6645
[10] B.G.Obula Reddy, Dr. Maligela
Ussenaiah “Literature Survey On
Clustering Techniques” IOSR Journal of
Computer Engineering (IOSRJCE) ISSN:
2278-0661 Volume 3, Issue 1 (July-Aug.
2012), PP 01-12

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy