0% found this document useful (0 votes)

32 views21 pages

Anupama Luthra - 2011

The document outlines a Ph.D. research proposal focused on enhancing clustering techniques in data mining, specifically improving the K-Means algorithm and its extensions (K-Modes and K-Prototype) to automatically determine the optimal number of clusters without prior input. The research aims to address limitations of existing algorithms, such as the dependency on pre-specified cluster numbers and the need for a unified similarity metric for mixed data sets. The study will involve comparing the proposed algorithms against traditional methods using various real data sets to evaluate their effectiveness.

Uploaded by

ladakhtour082024

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views21 pages

Anupama Luthra - 2011

Uploaded by

ladakhtour082024

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Short Synopsis

For
Ph. D. Programme 2013-14

Title: Design and Development of Efficient Clustering Techniques in Data Mining

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

FACULTY OF ENGINEERING & TECHNOLOGY

Submitted by:
Name: Anupama Luthra
Registration No.: 13/Ph.D./015

Supervisor :

Name: Dr. Suresh Kumar

Designation: Professor
ABSTRACT

K-Means is a widely used partition based clustering algorithm which organizes input dataset
into predefined number of clusters. Simplicity and speed in classification of massive data are
two features which have made K-Means a very popular algorithm. The original K-Means
algorithm clusters numerical data, however its extensions K-Modes and K-Prototype work on
categorical and mixed data sets respectively. K-Means has a major limitation -- the number
of clusters, K, need to be pre-specified as an input to the algorithm. In absence of thorough
domain knowledge, or for a new and unknown dataset, this advance estimation and
specification of cluster number typically leads to forced clustering of data, and proper
classification does not emerge. Also one new need with K-Prototype is to find a unified
metric for both numerical and categorical data.

The author will propose algorithms based on the K-Means and its extensions K-Modes and
K-Prototype, but having advance features of intelligent data analysis and automatic
generation of appropriate number of clusters. The automatic and intelligent cluster generation
by the proposed algorithm will be compared against the results obtained with ideally pre-
optimized number of clusters specified in K-Means .This will be done using many different
real data sets. Also a unified similarity metric that works with mixed data sets will be
proposed.

Keywords: Clustering, K-Means, K-Modes, K-Prototype, Dependency, Prior input, Number

of clusters, Unified similarity metric
CONTENTS

S.No. Description Page No.

1 Introduction 1-2
2 Literature Review 3-5
2.1 Extension of K-Means 3-4
2.2 Extension of K-Modes and K-Prototype 4-5
2.3 Unified similarity metric for mixed data sets 5
3 Description of Broad Area 6-15
3.1 Introduction to Data Mining 6-7
3.2 Data Mining Techniques 7-9
3.2.1 Association Rule Analysis 7-8
3.2.2 Classification 8-9
3.2.3 Clustering 9
3.3 Clustering Methods 9-11
3.3.1 Partitioning Method 9-10
3.3.2 Hierarchical Method 10
3.3.3 Density Based Method 10-11
3.3.4 Grid-Based Method 11
3.4 Introduction to K-Means algorithm 11-12
3.5 K-Modes algorithm 13
3.6 K-Prototype algorithm 14-15
4 Objectives of the Study 15
5 Methodology to be adopted 16
6 Expected outcome of the research 16
7 References 17-18
1. INTRODUCTION

In today s globalized and increasingly smaller world, the markets are barrierless and the
reach of businesses is expanding beyond cities to entire nations, and even across the globe.
Large organizations in Banking, Automobile, Agriculture, Education etc. use huge amount of
data and classify it to understand demographies, consumption, usage patterns etc. These
businesses also generate huge amount of data themselves, which is increasing exponentially.
For taking right decisions at right time, this huge amount of data needs to be processed
efficiently and accurately for correct interpretations and decisions. This entire philosophy of
storing, maintaining, classifying & interpreting data, to find patterns or trends for better
business decisions, is an upcoming area of research.

To help organizations take right decisions at right time, Data mining is a tool which provides
techniques to process large amount of data efficiently and presents it in the required form.
Data Mining is a process of drawing out useful patterns or knowledge from the huge data
collected in information systems and to use these patterns in taking safe and smart decisions.
The predefined methods and algorithms that are used to extract these useful patterns are
together called Data Mining Techniques. Some popular data mining techniques include
Frequent Pattern Mining, Association Rule Analysis, Classification and Clustering.

Clustering is a technique of segregating the objects into partitions such that the objects in a
one group are more similar to each other than the objects in the other group. Clustering has
its applications in variety of domains like psychology, statistics, medicine, engineering,
computer science etc. For example, in an organization, grouping and identifying the products
which are not in high demand may help in reducing their production to cut losses. Further, in
Educational Institutions, grouping the students according to their academic performance may
help in identifying the students with lower grades. These students can be motivated to attend
remedial classes to overcome their difficulties. In Health sector, data mining and clustering
may help in identifying the link between the disease symptoms. In Banking sector, it can be
used to group customers with overdue credit card payments. In Market Research, clustering
can be used to identify customers having certain buying patterns. A lot of work is being done
to apply this technique in various other areas, apart from the above mentioned.

1
Many clustering algorithms have been proposed in the literature [7,20]. These clustering
algorithms are broadly classified into two categories: Hierarchical and Partitional. The
hierarchical algorithms find clusters by arranging them into hierarchy (top-down or bottom-
up), as a result they are not suitable for large data sets. On the other hand, the partition based
clustering algorithms find the clusters independently. So they can easily partition large
datasets. In this method, the given dataset of n objects is partitioned into k groups or clusters
where k≤n with the constraint that each group must be comprised of at least one object and
each object is a member of only one cluster. Partition based method is an iterative method in
which the clusters once created are further improved by shifting the objects from one cluster
to another depending upon value of some objective function. K-means algorithm is one of the
commonly used techniques in this category.

K-Means is a simple algorithm known for its speed. The algorithm is not expensive in terms
of cost and works well with high dimensional and large data sets. However there exist some
limitations in this algorithm. One major limitation is that the clusters produced are highly
dependent on the objects initially selected as centroids (cluster centers). As the initial
centroids are selected randomly, K-Means algorithm may not provide same result for
different runs on same data set. A lot of work has been done to overcome this limitation.

Another limitation is the requirement to specify a pre-defined value of K (number of clusters)

as input. This is domain specific, and if the person using the algorithm is not domain expert,
then an incorrect number of clusters may be input, leading to inefficient grouping of data. To
overcome this limitation, researchers are still exploring ways.

To overcome the limitations of K-mean algorithm, the author will propose algorithms based
on K-Means and its extensions K-Modes and K-Prototype algorithm for numerical,
categorical and mixed data sets which does not require the value of K as input. In order to
increase the accuracy of the clusters produced by the K-Prototype algorithm a similarity
metric that works with mixed data sets will be proposed.

2
2. LITERATURE REVIEW

The Literature Review has been divided into three sections. Section 2.1 deals with the
attempts that have been made in the literature to remove the limitation of giving the number
of clusters required as an input for numerical data. Section 2.2 discusses the work done in the
field of removing the limitation of providing the number of clusters required (K) in the K-
Means for categorical and mixed data. Section 2.3 discusses the similarity metrics that have
been suggested for mixed data sets.

2.1 Contribution of some authors to remove the limitation of providing the value of
K initially for numerical data is discussed below:

Pelleg Dan et al. [16] suggested X-Means algorithm as an extension of K-Means which
required the user to input a range representing the lower and upper value of K instead of a
particular value of K. The algorithm initially takes lower bound of the given range as K and
continues to add centroids until the upper bound is reached. The algorithm terminates as soon
as it gets the centroid set that scores the best. The drawback lies in the fact that it requires
the user to input a range suggesting the lower and upper bound of K.

Tibshirani R. et al. [18] used the technique of Gap Statistic. In this technique the quality of
clusters produced is verified using an appropriate reference distribution. The algorithm works
well with separated clusters.

Wagstaff Kiris et al. [19] suggested to utilize information about the problem domain in order
to put some constraints on the data set. During the clustering process it is ensured that none
of the constraint is violated. This algorithm requires some domain specific information,
which sometimes becomes difficult to obtain.

Cheung Yiu-Ming [5] proposed an extension of K-Means clustering technique named STep-
wise Automatic Rival penalized (STAR) K-Means algorithm to overcome the major
limitations of K-Means. In the first step of the algorithm, cluster centers are provided and in

3
the second step the units are adjusted adaptively by a learning rule. The limitation of this
algorithm is the complex computation involved in it.

Shafeeq Ahamed B.M. et al. [3] proposed an algorithm in which the appropriate number of
clusters is found dynamically. The main drawback of this approach is that its computational
time is more than the K-Means for larger data sets. Also the user has to input the value of K
as 2 in the first run.

Leela V.et al. [12] proposed Y-means algorithm. Initially, clusters were found using K-
Means algorithm on the data set. A sequence of splitting, deleting and merging the clusters
was then followed to find the optimal number of clusters. The limitation of this algorithm is
that it depends on K-Means algorithm to find the initial clusters.

Abubaker Mohamed et al. [1] presented an approach based on the K-Nearest Neighbor
method. The only input parameter taken by the algorithm was kn (the number of nearest
neighbor). The drawback of this algorithm is that it require the number of nearest neighbors
kn as input.

2.2 This section discusses the work that has been done to overcome the limitation of
providing the number of cluster required (K) in the K-Means for categorical and mixed
data sets .

San Mar Ohn et al. [17] proposed an algorithm in which a parameter was used to control the
number of clusters in the clustering process. A suitable value of regularization parameter was
chosen to find the most stable clustering results. The major limitation of the proposed
algorithm is that an input parameter representing the initial cluster centers is required.

Cheung Yiu-ming et al.[6] presented penalized competitive learning algorithm that require
some initial value of K which should not be less than the original value of K. The resulting
clusters are more accurate than the original K-Modes and K-Modes with Ng s dissimilarity
metric proposed by H. Liao and M.K. Ng [14]. But this algorithm has much computation
involved in it.

4
Liang Jiye et al. [13] extended K-Prototype algorithm by proposing a new dissimilarity
measure for mixed data set. The measures of within-cluster entropy and between-cluster
entropy were used to identify the clusters with minimum coherence in a mixed dataset. The
major limitation of this algorithm is that it requires input parameters representing the
minimum and maximum number of clusters that can be generated from the data set.

Ahmad Amir et al. [4] proposed a new cost function for mixed data set. The authors extended
K-Means algorithm that worked well with mixed datasets by introducing a new distance
measure and a new way of finding the centroids of the clusters. As the clustering process is
based on the significance of an attribute in the dataset so the computation times of the
algorithm increases with the increase in the dimensions of the data.

2.3 Attempts have been done and still going on to find a similarity metric that works
with data sets containing mixed attributes. Some of the work is discussed below:

Ahmad Amir et al. [4] proposed a similarity measure for mixed datasets. The distance
measure for categorical attributes was based on the distance between every two attribute
values of every attribute. The numerical attributes were normalized and discretized to find
added weight (wt) to be added in their distance measure. Since in this method distance is
calculated for each attribute, it is not suitable for noisy and high dimensional datasets. The
results obtained by this algorithm can be improved further by improving discretizing methods
for numeric valued attributes.

Cheung Yiu-ming et al. [6] proposed a unified metric for mixed data sets. The author treated
the categorical and numerical attributes differently. The categorical attributes were
considered one by one, while all the numerical attributes were handled together as a vector in
the clustering process. The use of this unified metric increases computation time.

5
3. DESCRIPTION OF BROAD AREA

3.1 Data Mining

In today s world information technology has affected every aspect of life be it be food
preparing gadgets in mother s kitchen, maintaining financial records in banks, maintaining
health records of the patients in hospitals, recording academic performance of the students in
educational institutes. In late 1980s a new trend has come up, identifying the meaningful
data collected in information systems, finding useful patterns to take smart and safe decisions
based on these patterns. The process of extracting useful patterns or knowledge is called Data
Mining or KDD (Knowledge Discovery from Data). In Data mining, the process of
knowledge discovery is shown in Figure 1.

Figure1: Data Mining as a step in Knowledge Discovery [8]

Simply, data mining can be defined as the process of automatic detection of relevant patterns
in a database consisting of current and historical data, using predefined approaches and
algorithms (together called data mining techniques).Data mining uses approaches and
methods from multi-disciplinary areas out of which Statistics and Artificial Intelligence (AI)
are the major ones. Statisticians have used sophisticated techniques to analyze data and
provide business projections. So Data Mining can be considered as computer-assisted method
to explore the data statistically. Also some of the techniques used in data mining are
borrowed from the field of AI. The major difference between the data mining techniques is

6
that data mining techniques deal with huge stores of data while AI techniques deal with those
data sets that fit in the main memory of the computer.

This development of data warehousing and data mining techniques has enabled the
organization to change the pattern of their queries. Earlier, the queries of the companies were
of the form "What was the total number of sales in City A in the month of April?" Now they
have the answers to queries such as: "Which product should be launched in the coming year
to increase the business of the company? . Most of the business sectors are making use of
data mining techniques to make analytical decisions be it a Health sector, Agriculture,
Banking sector, Educational organizations or any other sector requiring an analysis into the
current and historical data.

3.2 Data Mining Techniques

The next section gives brief overview of the various popular data mining techniques.

3.2.1 Frequent Pattern Mining and Association Rule Analysis

In this technique the patterns that appear in the data set frequently are extracted and
association between these frequent patterns is find out by generating Association Rules. The
most popular example of this technique is Market Basket Analysis. In this example, the items
that are frequently purchased by the customers are find out from the data set of their buying
habits, the association rules are then generated using these itemsets which further elaborate
the relation between these frequent item sets.
The Association Rule is of the form:
buys(X, bread ) => buys(X, jam ) [support=20%, confidence=60%]
This rule shows that a person who buys bread tends to buy jam at the same time. The
association rules are measured by their interestingness parameter, which in turn is
represented by two measures Support and Confidence. These measures are expressed in % as
shown above in the given example. For the above association rule a support of 20% reflects
that 20% of the customers buy bread and jam together, a confidence of 60% shows that
60% of the customers who have bought bread are likely to buy jam . From this
association rule, the manager of the store can reorganize the things by putting bread and

7
jam together at the same place so that the persons buying the bread are prompted to buy
the jam also, this in turn will increase the sale of jam . The Apriori algorithm is one of
commonly used technique for finding frequent item sets from the given data set, from these
item sets Association rules can be generated.

3.2.2 Classification
As the name suggests, this techniques is used to divide or classify the data into different
groups. Given a dataset with at least one of the attributes as categorical attribute which will
represent the class, this technique will divide the dataset into groups where number of groups
depends on the number of values in the categorical attribute. This process of classification in
groups is done in two steps:

i) Generate Classification Rules

ii) Classify the data

The attributes in the dataset used for generating Classification rules are called Input
parameters and the attribute which represents the class is called Target attribute.

This can be better understood with the following example:

Table 1: Dataset for Classification

Age Income Applies_for_Loan
Senior Low No
Middle_aged Low Yes
Senior High Yes
Middle_aged High Yes

Here Age and Income are Input parameters and Applies_for_Loan is Target parameter.
In the first step, two Classification Rules are generated:

If Age= Senior and Income=High Then Applies_for_Loan=Yes

If Age=Middle_Aged Then Apples_for_Loan=Yes

In the second step, the classification of the data whose class is unknown is performed.
Suppose we add a new object in the dataset with value of Applies_for_Loan as unknown

8
then based on the above generated classification rules the missing value will be predicted as
shown below:

Table 2: Dataset for predicting the class

Age Income Applies_for_Loan
Senior Low No
Middle_aged Low Yes
Senior High Yes
Middle_aged High Yes
Senior High ?

According to the generated classification rules, the missing value will be predicted as Yes .
There exist various ways in which the classification can be performed. Some of the popular
algorithms are ID3, C4.5, Fuzzy C4.5.

3.2.3 Clustering
Clustering is a technique of segregating the objects into partitions such that the objects in a
certain cluster are more similar to each other than the objects in the other clusters. It differs
from Classification in a way that it is unsupervised while Classification is supervised. The
following section discusses the clustering methods.

3.3 Clustering Methods

Number of methods have emerged to partition the data and have been successfully applied to
real life data mining problems. These methods are discussed below:

3.3.1 Partitioning Method

In this method, the given dataset of n objects is partitioned into k groups or clusters where
k≤n with the constraint that each group must be comprised of at least one object and each
object is a member of only one cluster. This is an iterative approach in which the clusters are
refined in every step by making the objects move from one cluster to another depending upon
value of some objective function. The famous algorithms that fall in this category are: K-
Means and K-Medoids.

9
Limitation and Advantage:

The major limitation of this approach is that the algorithms under this category work well
when clusters are globular. Also this method is sensitive to noise. The advantage is that all
the algorithms falling in this category are scalable.

3.3.2 Hierarchical Method

In this method the objects in the given data set are organized into a tree of clusters. There are
two ways of using this method :

Agglomerative method: In this method every object is put into a separate cluster and then a
process of merging these unit clusters into large clusters is initiated until a single cluster
containing all the objects is formed. AGNES (Agglomerative Nesting) algorithm falls under
this category.

Divisive method: This is a reverse of Agglomerative method. This method starts with all the
objects in a single cluster and then dividing into smaller clusters until the unit clusters are
obtained. DIANA (Divisive Analysis) comes under this category.

Limitation and Advantage:

The limitation of this method is that this approach works well when the clusters are globular.
The algorithms under this category are sensitive also to noise and outliers. The advantage of
this method is that any number of clusters desired by the user can be obtained by chopping
the tree of clusters.

3.3.3 Density Based Method

This method requires an initial input of minimum number of objects needed to make a cluster
and the minimum distance between objects within a cluster. This method is helpful in
removing the outliers as the outliers will themselves form a dense cluster. The famous
algorithms of this category are: DBSCAN and OPTICS

10
Limitation and Advantage:

The limitation of the algorithms under this category are they require some input parameters
which sometimes becomes difficult to predict. The advantage of this approach is that it helps
in detecting outliers by making a separate cluster of outliers. Also this method works well
with clusters of arbitrary shapes.

3.3.4 Grid-Based Method

As suggested by name of this method, it uses a multi resolution grid structure. The objects in
the dataset are divided into a multilevel hierarchical grid structure. The clusters produced by
this method are isothetic, that is, all of the cluster boundaries are either vertical or horizontal
and no diagonal boundary. This constraint can be removed by combining the approaches of
Density-Based and Grid-Based method as done in WaveCluster algorithm.

Limitation and Advantage:

The limitation of this method is that the clusters produced are isothetic and hence of low
quality. The major advantage is that this method is efficient in terms of time complexity. The
famous algorithm of this category is: STING.

The author will be focusing on the famous partitioning method based clustering algorithm:
K-Means and its extensions, K-Modes and K-Prototype.

3.4 Introduction to K-Means algorithm

The K-Means is a partition based iterative algorithm using Euclidean distance as an objective
function. It groups a given data set into certain number of clusters K (K in K-Means
represents number of clusters required). The procedure starts with the selection of K
centroids for K clusters. The clusters and their centroids are recomputed until all the data
points in each cluster are at their minimum distance from their centroids. In this technique
Euclidean distance is taken as given in Equation 1.

Â
n
deuc( x, y ) = i =i
( xi - yi ) 2
(1)

11
The basic algorithm works as follows: [2]

Algorithm 1 : Basic K-Means

Input: Numerical dataset (D), Number of clusters (K)
Output: Elements of dataset classified into K clusters
1. Select random K points as initial Centroids
2.Repeat
3.Create K clusters by assigning all data points to the closest
centroid
4.Recompute centroids for each cluster to improve accuracy
5.Until the cluster centroids don t change

The K-Means algorithm is chosen by the author because of the following reasons:
∑ The K-Means algorithm is not very expensive in terms of time.
∑ Works well with high dimensional data and large data sets.
∑ Produces highly cohesive clusters.

Apart from these benefits it suffers from these limitations:

∑ The initial centroids are chosen randomly so the final clusters produced are highly
sensitive to initial centroids.
∑ As the centroids are the mean of the values lying in that cluster, so sensitive to
outliers and produces spherical clusters only.
∑ Value of K is required to be input, which sometimes becomes domain specific and if
the person using the algorithm is not domain expert then it can cause problems.

12
3.5 K-Modes algorithm
K-Means has gain popularity because of its simplicity and speed of classifying massive data
rapidly and efficiently. The K-Modes extends the K-Means algorithm to cluster categorical
data with the following approach [11]:

∑ A Simple matching dissimilarity function suitable for categorical data is used instead
of Euclidean distance.
∑ Modes are used to represent centroids instead of Mean values.
∑ A frequency based method is used to find centroids in each iteration of the algorithm.

Basic K-Modes works as follows: [9]

Algorithm : Basic K-Modes

Input: Categorical dataset (D), Number of clusters (K)

Output: Elements of dataset classified into K clusters

1. Select k initial modes, one for each cluster.
2. Allocate an object to the cluster whose mode is the nearest to it according
to:

??(?, ?) = ? ?(??, ??)

???
?(??, ??)=0 if xj=yj else 1
Update the mode of the cluster after each allocation.

3. After all objects have been allocated to clusters, retest the dissimilarity of
objects against the current modes. If an object is found such that its nearest
mode belongs to another cluster rather than its current one, reallocate the
object to that cluster and update the modes of both clusters.
4. Repeat 3 until no object has changed clusters after a full cycle test of the
whole data set.

Since K-Modes follows the same iterative process as followed by K-Means to produce
clusters so it carries in itself the merits and demerits of K-Means.

13
3.6 K-Prototype algorithm
Earlier, clustering techniques were developed focusing on a single type of attributes, either
numerical or categorical. Since mixed data set is common in real life, so techniques need to
be developed to group this type of data. The techniques used only for numerical data or for
categorical data cannot be directly applied as they differ in their behavior. The numerical
data is continuous whereas categorical attributes values are not only discontinuous but also
disordered.
K-Prototype is a variant of K-Means that can be used with numeric or categorical data sets.
K-Prototype extends the idea of K-Means by applying Euclidean distance to numeric
attributes and Binary distance to categorical attributes. Basic K-Prototype algorithm works as
follows: [10]

Algorithm : Basic K-Prototype

Input:Mixed dataset (D), Number of Clusters (K)

Output: Elements of Dataset classified into K clusters

1. Select k initial prototypes from a data set D, one for each

cluster.

2. Allocate each object in D to a cluster whose prototype is the

nearest to it according to the following distance measure:
?? ??

?(??, ??) = ? (???? − ????) + ??? ?(???? , ????)

??? ???

whered(p,q)=0 for p=q and d(p,q)=1 for pπ q.

????and ????are values of numeric attributes, whereas ????and
????are values of categorical attributes for object I and the
prototype of cluster l.
mr and mc are the numbers of numeric and categorical
attributes.
gl is a weight for categorical attributes for cluster l.

Update the prototype of the cluster after each allocation.

3. After all objects have been allocated to a cluster, retest the

similarity of objects against the current prototypes. If an
object is found such that its nearest prototype belongs to

14
another cluster rather than its current one, reallocate the
object to that cluster and update the prototypes of both
clusters.

4. Repeat (3) until no object has changed clusters after a full

cycle test of X.

However, the Binary distance for categorical attributes does not represent the real situation as
the categorical values may have some other degree of difference rather than just 0 or 1. So,
various extensions of the K-Prototype have been proposed. All these extensions suffer from
the limitation of inputting the number of clusters required. Since K-Prototype extends the
ideas of K-Means, the same weaknesses of K-Means are retained. The focus of research work
will be on removal of one of the common limitation beard by the above discussed three
algorithms of inputting the required number of clusters (K) to the algorithms.

4. OBJECTIVES OF OUR STUDY

To design an algorithm based on the K-Means for numerical data sets to overcome
the limitation of providing the number of clusters at the very beginning based on
anticipation.
To modify the K-Modes algorithm to cluster categorical data sets so as to overcome
the limitation of inputting the number of clusters required initially.
To transform the K-Prototype algorithm for mixed data sets to overcome the
limitation of providing the number of clusters required initially.
The compare the accuracy of the clusters produced by the proposed algorithms with
that of the original K-Means, K-Modes and K-Prototype algorithm on different real
world datasets of different sizes and dimensions from UCI machine Repository using
RapidMiner.

To develop a suitable unified similarity metric for mixed data sets.

To use the proposed unified similarity metric in K-Prototype algorithm.

To modify the K-Prototype algorithm to overcome the limitation of providing the

number of clusters initially by using this unified similarity metric.

15
5. METHODOLOGY
∑ Algorithms will be proposed based on the K-Means and its extensions K-Modes and
K-Prototype for numerical, categorical and mixed data sets which divides the input
data set into appropriate clusters without taking number of clusters K as input, as it
was required in the original algorithms.
∑ A similarity metric that works with mixed data sets will be proposed and that metric
will be used in basic K-Prototype and proposed K-Prototype algorithms.
∑ The accuracy of the clusters produced will be compared with that of the original
algorithms. To achieve this we will be implementing the modified algorithms in C#
and will compare the accuracy of the clusters with original algorithms using
RapidMiner. For this purpose many real data sets will be used from UCI Machine
Repository (A website that maintains 300 data sets as a service to the machine
learning community).

6. EXPECTED OUTCOME OF THE RESEARCH

New algorithms based on the K-Means, K-Modes and K-Prototype but with advance features
of intelligent data analysis and automatic generation of appropriate number of clusters for
numerical, categorical and mixed data sets. The proposed algorithms will also generate
clusters that are more accurate as compared to the clusters generated by the original
algorithms. Our research work will also come up with a new unified similarity metric that
works for mixed data sets in data clustering. This metric will be implemented in K-Prototype
algorithm and the accuracy of the clusters will be improved.

16
REFERENCES

1. Abubaker, Mohamed, Ashour, Wesam (2013). Efficient Data Clustering Algorithms:

Improvements over K means. International Journal of Intelligent Systems and
Applications, 5(3), 37-49.

2. Agha, El, Mohammed, Ashour, M., Wesam (2012). Efficient and Fast Initializtion
Algorithm for K-means Clustering. I.J. Intelligent Systems and Applications. 4(1), 21-31.

3. Ahamed, Shafeeq, B., M., Hareesha, K., S.(2012). Dynamic Clustering of Data with
Modified K-Means Algorithm. International Conference on Information and Computer
Networks (ICICN 2012). 27, 221-225.

4. Ahmad, A., Dey, L. (2007). A K-Mean Clustering Algorithm for Mixed Numeric and
Categorical Data. Data & Knowledge Engineering. 63, 503 527.

5. Cheung, Yiu-Ming (2013). k*-Means: A new generalized k-means clustering algorithm.

Pattern Recognition Letters. 24, 2883 2893.

6. Cheung, Yiu-ming, Jia, Hong (2013). Categorical-and-numerical-attribute data clustering

based on a unified similarity metric without knowing cluster number. Pattern
Recognition, 46, 2228 2238.

7. Gan, Guojun, Ma, Chaoqun, Wu, Jianhong (2007).Data clustering: theory, algorithms,
and applications. SIAM: Society for Industrial and Applied Mathematics.

8. Han, Jiawei, Kamber, Micheline (2006). Data Mining: Concepts and Techniques. 2nd
Edition: Morgan Kuaumann Publishers.

9. He, Zengyou, Deng, Shengchun, Xu, Xiaofei (2005). Improving K-Modes Algorithm
Considering Frequencies of Attribute Values in Mode. Computational Intelligence and
Security Lecture Notes in Computer Science. 3801, 157-162.

10. Huang, Zhexue (1997). Clustering large data sets with mixed numeric andcategorical
values. Pacific-Asia Conference on Knowledge Discovery and Data Mining.

11. Khan, S., Shehroz, Ahmad, Amir (2013). Cluster Center Initialization for Categorical
Data Using Multiple Attribute Clustering. Expert Systems with Applications. 40(18),
7444 7456.

12. Leela, V., Sakthipriya, K., Manikandan, R. (2013). A comparative analysis between k-
mean and y-means Algorithms in Fisher s Iris data sets. International Journal of
Engineering and Technology. 5(1), 245-249.

13. Liang, Jiye, Zhao, Xingwang, Li, Deyu, Cao, Fuyuan, Dang, Chuangyin. Determining the
number of clusters using information entropy for mixed data. Pattern Recognition. 45,
2251-2265.
17
14. Liao, H., Ng, M., K. (2009). Categorical Data Clustering with Automatic Selection of
Cluster Number. Fuzzy Information and Engineering. 1(1), 5 25.

15. Ng, K., Michael, Junjie, Mark, Huang, Zhexue, Joshua, Li, He, Zengyou (2007). On the
Impact of Dissimilarity Measurein k-Modes Clustering Algorithm. IEEE Transactions on
Pattern Analysis and Machine Intelligence. 29(3),503-507.

16. Pelleg, Dan, Moore, W., Andrew (2000). X-means: Extending K-means with Efficient
Estimation of the Number of Clusters. Proceedings of the Seventeenth International
Conference on Machine Learning. 727-734.

17. San, Mar, Ohn, Huynh, Van-Nam, Nakamori, Yoshiteru (2004). An Alternative
Extension of the k-Means Algorithm for Clustering Categorical Data. International
Journal Appl. Math. Computer. Science. 14(2),241 247.

18. Tibshirani, R., Walther, G., Hastie, T. (2000). Estimating the number of clusters in a
dataset via the gap statistic. Technical Report 208, Department of Statistics, Stanford
University, California.

19. Wagstaff, Kiris, Cardie, Claire, Rogers, Seth, Schroedl, Stefan (2001). Constrained K-
means Clustering with Background Knowledge. Proceedings of the Eighteenth
International Conference on Machine Learning. 577-584.

20. Xu, Rui, Wunsch, Donald (2005). Survey of Clustering Algorithms. IEEE Transactions
on Neural Networks.16(3), 645-678.

21. Zhang, Chunfei, Fang, Zhiyi (2013). An Improved K-means Clustering Algorithm.
Journal of Information & Computational Science, 10(1), 93 199.

Genedata
No ratings yet
Genedata
67 pages
Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
37 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Data Mining Project: Cluster Analysis and Dimensionality Reduction in R Using Bank Marketing Data Set
No ratings yet
Data Mining Project: Cluster Analysis and Dimensionality Reduction in R Using Bank Marketing Data Set
31 pages
A Review of Self Optimal Clustering Technique and Data Mining Approach
No ratings yet
A Review of Self Optimal Clustering Technique and Data Mining Approach
6 pages
1 s2.0 S0020025522014633 Main
No ratings yet
1 s2.0 S0020025522014633 Main
33 pages
Application Based, Advantageous K-Means Clustering Algorithm in Data Mining - A Review
No ratings yet
Application Based, Advantageous K-Means Clustering Algorithm in Data Mining - A Review
6 pages
I Jcs It 20140506204
No ratings yet
I Jcs It 20140506204
4 pages
09 - Chapter 1111
No ratings yet
09 - Chapter 1111
22 pages
K-Means and MAP REDUCE Algorithm
No ratings yet
K-Means and MAP REDUCE Algorithm
13 pages
Data Mining and Constraints: An Overview: (Vgrossi, Pedre) @di - Unipi.it, Turini@unipi - It
No ratings yet
Data Mining and Constraints: An Overview: (Vgrossi, Pedre) @di - Unipi.it, Turini@unipi - It
25 pages
Running Head:: Data Mining 1
No ratings yet
Running Head:: Data Mining 1
7 pages
Unit 4
No ratings yet
Unit 4
40 pages
A Fast Clustering Algorithm To Cluster Very Large Categorical Data Sets in Data Mining
No ratings yet
A Fast Clustering Algorithm To Cluster Very Large Categorical Data Sets in Data Mining
8 pages
Datamining & Cluster Coputing
No ratings yet
Datamining & Cluster Coputing
16 pages
Fast and Robust General Purpose Clustering Algorit
No ratings yet
Fast and Robust General Purpose Clustering Algorit
29 pages
Yihao Final Paper CCSC For Submission
No ratings yet
Yihao Final Paper CCSC For Submission
6 pages
DWMModule 4
No ratings yet
DWMModule 4
31 pages
Research On K Mean Algorithm
No ratings yet
Research On K Mean Algorithm
5 pages
Untitled Document
No ratings yet
Untitled Document
32 pages
K-Means Clustering
No ratings yet
K-Means Clustering
8 pages
Clustering in Data Mining
No ratings yet
Clustering in Data Mining
14 pages
Iterative Improved K-Means Clusterin
No ratings yet
Iterative Improved K-Means Clusterin
5 pages
Comparative Analysis of K-Means and Fuzzy C-Means Algorithms
No ratings yet
Comparative Analysis of K-Means and Fuzzy C-Means Algorithms
5 pages
PRJ C MR 18
No ratings yet
PRJ C MR 18
4 pages
The K-Means Clustering Technique General Considera
No ratings yet
The K-Means Clustering Technique General Considera
11 pages
Fraud Detection in Credit Card by Clustering Approach: 2. K-Means Clustering Algorithm
No ratings yet
Fraud Detection in Credit Card by Clustering Approach: 2. K-Means Clustering Algorithm
4 pages
Ijcset 2016060701
No ratings yet
Ijcset 2016060701
3 pages
Research On K-Means Clustering Algorithm An Improved K-Means Clustering Algorithm
No ratings yet
Research On K-Means Clustering Algorithm An Improved K-Means Clustering Algorithm
5 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Dynamic Approach To K-Means Clustering Algorithm-2
No ratings yet
Dynamic Approach To K-Means Clustering Algorithm-2
16 pages
1.1 Data and Information Mining
No ratings yet
1.1 Data and Information Mining
24 pages
Data Mining Implementation
No ratings yet
Data Mining Implementation
9 pages
Improved K-Means Clustering Algorithm by Getting Initial Cenroids
No ratings yet
Improved K-Means Clustering Algorithm by Getting Initial Cenroids
9 pages
Camintac Essay - Nubbh Kejriwal
No ratings yet
Camintac Essay - Nubbh Kejriwal
4 pages
An Improved K-Means Cluster Algorithm Using Map Reduce Techniques To Mining of Inter and Intra Cluster Datain Big Data Analytics
No ratings yet
An Improved K-Means Cluster Algorithm Using Map Reduce Techniques To Mining of Inter and Intra Cluster Datain Big Data Analytics
12 pages
Normalization Based K Means Clustering Algorithm
No ratings yet
Normalization Based K Means Clustering Algorithm
5 pages
V5I5201647
No ratings yet
V5I5201647
13 pages
MMZ XRF O0 Ra Pre 0 ZB XGXW W1 Er 02 OAYQum QDD78 HQP
No ratings yet
MMZ XRF O0 Ra Pre 0 ZB XGXW W1 Er 02 OAYQum QDD78 HQP
4 pages
A Comparative Study of K-Means, K-Medoid and Enhanced K-Medoid Algorithms
No ratings yet
A Comparative Study of K-Means, K-Medoid and Enhanced K-Medoid Algorithms
4 pages
Predicting Students' Performance Using K-Median Clustering
No ratings yet
Predicting Students' Performance Using K-Median Clustering
4 pages
An Improvement in K Means Clustering Algorithm IJERTV2IS1385
No ratings yet
An Improvement in K Means Clustering Algorithm IJERTV2IS1385
6 pages
Comprehensive Review of K Means Clustering Algorithms1
No ratings yet
Comprehensive Review of K Means Clustering Algorithms1
6 pages
Ijert Ijert: Enhanced Clustering Algorithm For Classification of Datasets
No ratings yet
Ijert Ijert: Enhanced Clustering Algorithm For Classification of Datasets
8 pages
A Dynamic K-Means Clustering For Data Mining
No ratings yet
A Dynamic K-Means Clustering For Data Mining
6 pages
Statistical Considerations On The K - Means Algorithm
No ratings yet
Statistical Considerations On The K - Means Algorithm
9 pages
A Review of Various KNN Techniques
No ratings yet
A Review of Various KNN Techniques
6 pages
U20cs604 Machine Learning Unit III
No ratings yet
U20cs604 Machine Learning Unit III
23 pages
A Dynamic K-Means Clustering For Data Mining-Dikonversi
No ratings yet
A Dynamic K-Means Clustering For Data Mining-Dikonversi
6 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
A Review On K Means Clustering
No ratings yet
A Review On K Means Clustering
7 pages
6 IJAEST Volume No 2 Issue No 2 Representative Based Method of Categorical Data Clustering 152 156
No ratings yet
6 IJAEST Volume No 2 Issue No 2 Representative Based Method of Categorical Data Clustering 152 156
5 pages
Comparison of Different Clustering Algorithms Using WEKA Tool
No ratings yet
Comparison of Different Clustering Algorithms Using WEKA Tool
3 pages
Government Polytechnic, Washim: "Implement Modifier Caesar's Cipher With Shift of Any Key
No ratings yet
Government Polytechnic, Washim: "Implement Modifier Caesar's Cipher With Shift of Any Key
12 pages
A Parallel Study On Clustering Algorithms in Data Mining
No ratings yet
A Parallel Study On Clustering Algorithms in Data Mining
7 pages
Paper Dinesh Clustering Techniques
No ratings yet
Paper Dinesh Clustering Techniques
5 pages
Analysis&Comparisonof Efficient Techniquesof
No ratings yet
Analysis&Comparisonof Efficient Techniquesof
5 pages
Variance Rover System
No ratings yet
Variance Rover System
3 pages
I Jsa It 04132012
No ratings yet
I Jsa It 04132012
4 pages
An Enhanced Clustering Algorithm To Analyze Spatial Data: Dr. Mahesh Kumar, Mr. Sachin Yadav
No ratings yet
An Enhanced Clustering Algorithm To Analyze Spatial Data: Dr. Mahesh Kumar, Mr. Sachin Yadav
3 pages
Automation and Robotics PDF
No ratings yet
Automation and Robotics PDF
32 pages
GATE DA Important Topics
No ratings yet
GATE DA Important Topics
37 pages
Credit Card Fraud Detection Using AI
No ratings yet
Credit Card Fraud Detection Using AI
18 pages
21csc305p Machine Learning Unit 5
No ratings yet
21csc305p Machine Learning Unit 5
61 pages
Roulette Wheel
No ratings yet
Roulette Wheel
8 pages
EM9 U1 Lesson 2 PPT
No ratings yet
EM9 U1 Lesson 2 PPT
35 pages
202104 - 공공분야 인공지능 도입 실무 안내서 PDF
No ratings yet
202104 - 공공분야 인공지능 도입 실무 안내서 PDF
74 pages
ESRGAN Slides 3mar2025
No ratings yet
ESRGAN Slides 3mar2025
40 pages
Grammar and Language: Grammar: It Is System That Specifies
No ratings yet
Grammar and Language: Grammar: It Is System That Specifies
40 pages
Mathematic Modelling of Dynamic SYSTEMS Ch. 2
No ratings yet
Mathematic Modelling of Dynamic SYSTEMS Ch. 2
31 pages
AI Note
No ratings yet
AI Note
113 pages
2018 Mult 9
No ratings yet
2018 Mult 9
46 pages
Isom Lab Question Bank
No ratings yet
Isom Lab Question Bank
6 pages
hw7 Sol
No ratings yet
hw7 Sol
3 pages
The Application of Queuing Theory
No ratings yet
The Application of Queuing Theory
5 pages
Enhancing Machine Learning Work Ows: A Comprehensive Study of Machine Learning Pipelines
No ratings yet
Enhancing Machine Learning Work Ows: A Comprehensive Study of Machine Learning Pipelines
7 pages
Transform Theory
No ratings yet
Transform Theory
16 pages
Pipeline Leak Detection and Control
No ratings yet
Pipeline Leak Detection and Control
6 pages
Heap Sort 001
No ratings yet
Heap Sort 001
4 pages
Ain Shame Tarig Shams Badry Lesha 2021
No ratings yet
Ain Shame Tarig Shams Badry Lesha 2021
6 pages
156az - Finite Element Methods
No ratings yet
156az - Finite Element Methods
2 pages
Assignment # 2: Discrete Mathematics Counting Principles
No ratings yet
Assignment # 2: Discrete Mathematics Counting Principles
4 pages
VaR Estimation Using GANs - 1553122463
No ratings yet
VaR Estimation Using GANs - 1553122463
23 pages
Assignment 2
No ratings yet
Assignment 2
10 pages
Towards Democratizing Joint-Embedding Self-Supervised Learning
No ratings yet
Towards Democratizing Joint-Embedding Self-Supervised Learning
11 pages
RCS Minuteslast Minute Notes PDF
No ratings yet
RCS Minuteslast Minute Notes PDF
9 pages
Instruments and Control System(s)
No ratings yet
Instruments and Control System(s)
2 pages
Data Analytics Courses in Delhi
No ratings yet
Data Analytics Courses in Delhi
5 pages
Plant Location Selection by Using A Three-Step
No ratings yet
Plant Location Selection by Using A Three-Step
4 pages
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Anupama Luthra - 2011

Uploaded by

Anupama Luthra - 2011

Uploaded by

Short Synopsis

Title: Design and Development of Efficient Clustering Techniques in Data Mining

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

FACULTY OF ENGINEERING & TECHNOLOGY

Name: Dr. Suresh Kumar

Keywords: Clustering, K-Means, K-Modes, K-Prototype, Dependency, Prior input, Number

S.No. Description Page No.

Another limitation is the requirement to specify a pre-defined value of K (number of clusters)

3.1 Data Mining

Figure1: Data Mining as a step in Knowledge Discovery [8]

3.2 Data Mining Techniques

3.2.1 Frequent Pattern Mining and Association Rule Analysis

i) Generate Classification Rules

This can be better understood with the following example:

Table 1: Dataset for Classification

If Age= Senior and Income=High Then Applies_for_Loan=Yes

If Age=Middle_Aged Then Apples_for_Loan=Yes

Table 2: Dataset for predicting the class

3.3 Clustering Methods

3.3.1 Partitioning Method

3.3.2 Hierarchical Method

Limitation and Advantage:

3.3.3 Density Based Method

3.3.4 Grid-Based Method

Limitation and Advantage:

3.4 Introduction to K-Means algorithm

Algorithm 1 : Basic K-Means

Apart from these benefits it suffers from these limitations:

Basic K-Modes works as follows: [9]

Algorithm : Basic K-Modes

Input: Categorical dataset (D), Number of clusters (K)

Output: Elements of dataset classified into K clusters

??(?, ?) = ? ?(??, ??)

Algorithm : Basic K-Prototype

Input:Mixed dataset (D), Number of Clusters (K)

Output: Elements of Dataset classified into K clusters

1. Select k initial prototypes from a data set D, one for each

2. Allocate each object in D to a cluster whose prototype is the

?(??, ??) = ? (???? − ????) + ??? ?(???? , ????)

whered(p,q)=0 for p=q and d(p,q)=1 for pπ q.

Update the prototype of the cluster after each allocation.

3. After all objects have been allocated to a cluster, retest the

4. Repeat (3) until no object has changed clusters after a full

4. OBJECTIVES OF OUR STUDY

To develop a suitable unified similarity metric for mixed data sets.

To use the proposed unified similarity metric in K-Prototype algorithm.

To modify the K-Prototype algorithm to overcome the limitation of providing the

6. EXPECTED OUTCOME OF THE RESEARCH

1. Abubaker, Mohamed, Ashour, Wesam (2013). Efficient Data Clustering Algorithms:

5. Cheung, Yiu-Ming (2013). k*-Means: A new generalized k-means clustering algorithm.

6. Cheung, Yiu-ming, Jia, Hong (2013). Categorical-and-numerical-attribute data clustering

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.