Research on k Mean Algorithm
Research on k Mean Algorithm
Clustering Algorithm
Abstract:
Clustering analysis method is one of the main analytical methods in data mining,
the method of clustering algorithm will influence the clustering results directly. This
paper discusses the standard k-means clustering algorithm and analyzes the
shortcomings of standard k-means algorithm, such as the k-means clustering
algorithm has to calculate the distance between each data object and all cluster
centers in each iteration, which makes the efficiency of clustering is not high. This
paper proposes an improved k-means algorithm in order to solve this question,
requiring a simple data structure to store some information in every iteration, which
is to be used in the next interation. The improved method avoids computing the
distance of each data object to the cluster centers repeatly, saving the running
time. Experimental results show that the improved method can effectively improve
the speed of clustering and accuracy, reducing the computational complexity of the
k-means.
Introduction
Extracting meaningful and tangible information from collected data is the primary
goal of data mining [4]. However, most data are collected in arbitrary forms and
categories, making such data difficult to analyse, especially when the data objects’
features are unknown. Appropriate organization of unlabeled data is an aspect of
data mining handled by cluster analysis. The meaningful grouping of such unlabeled
data is regarded as data clustering. The goal is to group unlabeled data so that the
data objects whose characteristics and attributes are similar are together in a
cluster such that the similarities of data objects within the same clusters are higher
when compared with other clusters’ data objects. In other words, data clustering
analysis classifies unlabeled data to ensure higher intra-cluster similarity and lower
inter-cluster similarity [59]. The process of clustering analysis can be likened to the
learning process, which involves specific predictive behavior associated with
unsupervised learning when handling unlabeled datasets [55]. Fig. 1 clearly
illustrates this spectrum of different categories of learning problems of interest in
pattern recognition and machine learning, as discussed in Jain [95].
Cluster analysis has been successfully applied to address data clustering problems
in different domains such as medical science, manufacturing, robotics, the financial
sector, privacy protection, artificial intelligence, urban development, aviation,
industries, sales, and marketing [61], [7], [180], [59], [20], [111], [49]. Extracting
useful information from data in these domains is essential for providing better
services and generating more profits [181], [148], [172]. Real-world data generated
are mostly voluminous, unlabeled, and of different dimensions. This makes data
clustering difficult. Pre-identifying the number of clusters in a real-world dataset
cannot be quickly done. Therefore, determining the optimal number of clusters in a
real-world dataset characterized by high density and dimensionality is quite tricky
for standard clustering algorithms. This poses a significant challenge to
conventional clustering algorithms in which the number of clusters must be
specified as input to the algorithm.
Algorithms for data clustering are grouped into two major categories [97], [224],
[68], [60], namely, hierarchical clustering algorithms and partitional clustering
algorithms. Hierarchical clustering algorithms partition data objects into clusters in
a hierarchical form either in a bottom-up approach (agglomerative method) or a
top-down approach (divisive method). In the agglomerative method, individual data
objects are merged iteratively based on their similarity. In the divisive method, the
initial dataset is taken as a single cluster and broken down iteratively using data
object similarity until each data object forms a single cluster or a set criterion is
met. The hierarchical clustering algorithm produces a dendrogram of merged
(agglomerative) or split (divisive) data objects depicting the corresponding cluster
hierarchy generated as output for the cluster analysis [60]. The dendrogram is a
pictorial representation of the data objects’ nested grouping showing the similarity
level at which each grouping changes [97].
In the partitional clustering approach, a single partition of the initial dataset is
produced instead of a clustering structure of a dendrogram. Clusters are produced
in a heuristic approach while optimizing a criterion function defined globally on all
the data objects in the set or locally on the subset of the data objects [246], [9],
[189]. Optimizing a criterion function on a set of the data objects using a
combinatorial search of all possible values to get the optimum value is
computationally prohibitive. Therefore, partitional clustering algorithms require the
specification of different k values supplied at different runs to obtain the best
configuration to produce the optimum clusters.
K-means clustering algorithm was proposed independently by different researchers,
including Steinhaus [203], Lloyd [132], MacQueen [135], and Jancey [98] from
different disciplines during the 1950s and 1960s [171]. These researchers’ various
versions of the algorithms show four common processing steps with differences in
each step [171]. The K-means clustering algorithm generates clusters using the
cluster’s object mean value [197], [34]. In the standard K-means algorithm, the
cluster number is required as a user parameter and is used in the arbitrary cluster
center selection from the dataset. However, the K-means algorithm may converge
to a local minimum because of its greedy nature [95]. Therefore, it requires several
runs for a given k value with different initial cluster center selections to obtain the
optimal cluster result [243], [59], [19]. In addition, the standard algorithm detects
ball-shaped or spherical clusters only because of the use of the Euclidean metric as
its distance measure [95]. A typical K-means clustering process is illustrated in Fig.
2. With a set of input data supplied to the K-means clustering algorithm, the
centroid vector C={c1,c2,...,ck} can easily be identified with K being the number of
centroids defined by the user. Fig. 2a illustrates a data set in 2D space distributed
randomly with -100≤xi,yi≤100, and Fig. 2b presents the K-means clustering result
with the number of centroids set to K=3.
Despite these limitations, the K-means clustering algorithm is credited with
flexibility, efficiency, and ease of implementation. It is also among the top ten
clustering algorithms in data mining [59], [217], [105], [94]. The simplicity and low
computational complexity have given the K-means clustering algorithm a wide
acceptance in many domains for solving clustering problems. Several K-means
clustering algorithm variants have been developed to enhance its performance. This
work presents an overview of the K-means clustering algorithm and its variants with
a proposed taxonomy for the variants. The algorithm’s research progression from its
inception, the current trends, open issues, and challenges with recommended future
research perspectives are also discussed in detail.
In this paper, the following focal research question was proposed to reflect the
purpose of this comprehensive review work:
“What are the existing variants of K-means algorithms for solving clustering
problems since its inception to date.”
In providing answers to the main research question, the following sub-research
questions were considered:
a)
Identify research that has been conducted to improve on the standard K-means
clustering algorithm
b)
What methods have been adopted in the various research found in (a) for improving
the performance of the K-means clustering algorithm?
c)
What are the performances of the reported K-means clustering algorithm variants?
d)
What are the current research progressions involving the K-means clustering
algorithm?
This review work will be presented from four perspectives: first, a systematic review
of the K-mean clustering algorithm and its variants. Second, a presentation of a
proposed novel taxonomy of K-mean clustering methods in the literature. Third,
verifications of the findings on all aspects of K-means clustering methods through
an in-depth analysis. Fourth, an outline of open issues and challenges and
recommended future trends. The main idea is to present a comprehensive
systematic review that will provide current researchers and practitioners with a
pathway for future novel research involving the K-means clustering algorithm. The
main contributions of this research work are summarized below:
•
A comprehensive review of the K-means algorithm is presented, including a
proposed taxonomy of recent variants and trending application areas of the K-
means clustering algorithm.
•
Open research issues relating to adopting metaheuristic algorithms as automatic
cluster number generators to improve the K-means algorithm's performance quality
are identified and discussed.
•
Finally, research gaps and the future scope of the K-means algorithm in general,
particularly in outlining a new perspective for solving the challenges of the K-means
clustering algorithm and its variants, are identified.
The rest of the paper is organized as follows: Section 1 introduces the background
work on the proposed review study; Section 2 outlines the methodology approach;
Section 3 presents a proposed taxonomy of k-mean clustering methods found in the
literature, followed by a detailed discussion of the review of the K-means algorithm
variants; Section 4 discusses the review findings; Section 5 reports the current
trending areas of application of the K-means algorithm; Section 6 outlines the open
issues and challenges of K-means clustering methods with recommended future
trends; and Section 7 concludes the review.
Section snippets
Research methodology
This study aims at conducting a review of the K-means clustering algorithm
variants. The research methodology adopted for the study is presented in this
section. Kitchenham et al.’s [113] guidelines for a systematic literature review of
computer technology were adopted for the study. Four phases are involved in the
review: planning, study search and selection, data acquisition, and data analysis.
The planning section reported in section 1 includes establishing the problem
statement, the study
Standard K-means clustering algorithm
The K-means clustering algorithm is categorized as a partitional clustering
algorithm. Partitioning given datasets into clusters involves finding the minimum
squared error between the various data points in the data set and the mean of a
cluster, then assigning each data point to the cluster centre nearest to it.
Mathematically, given a dataset X=xiwherei=1,2,...n of d-dimension data points of
size n,
X Is partitioned into ‘k’ clusters C=cjwherej=1,2...k
such that∊Jck=∑xi∊ckxi-μk2
The K-means
Discussion
This study presents an extensive literature review on the various improvements of
the K-means algorithm, mainly from 2010 to date. The review has surveyed the
variety of modifications available of the standard K-means algorithm design and
implementation which are intended to enhance its clustering performance and
speed. The current study found that the improvements span all the major aspects of
algorithm design, including the algorithm input, processes, output, and concept
modification. In the
Trending application areas of the K-Means algorithm
The K-means clustering algorithm and variants have been applied widely in many
research areas, including: image recognition [12], image processing [151], market
analysis [70], data processing [152], medical image segmentation [153], [151], risk
evaluation [248], medical diagnosis [200], [245], medical services [215], [100], etc.
In the health sector, some parts of the human body have been tested and examined
for tumor detection (tumor in the brain, cancer), and cluster analysis in medical
Open issues and challenges
The main aim of the K-means algorithm and its variants is to group any given
dataset into k clusters such that the data objects within clusters are similar but
different from the ones in other clusters. Open issues and challenges in the K-means
algorithm and its variants include the challenges common to the generality of
clustering techniques as well as those peculiar to it.
Initialization Problem: - The initialization problem in K-means is twofold: defining the
accurate cluster numbers to be
Conclusion
The K-means clustering algorithm is known for its simplicity and is applied in
clustering datasets from different domains. Despite this advantage, its performance
is greatly hampered due to some of the problems inherent in its implementation. As
a result, much research has been conducted to improve the algorithm’s general
performance. This review work has been able to identify the various limitations of
the standard algorithm and the numerous variants developed to solve the identified
problems