Conference Paper1
Conference Paper1
Abstract—Clustering is an important technique which is used to the distance measure and Yager's ordered weighted averaging
discover the data structure. Clustering is applied in many areas, operator [7] to reduce sensitivity to outliers. However, the
such as customer segmentation, image recognition, social results of the FCOM is still very susceptible to initial centroids
science, and so on. However, most of the existing clustering due to the randomly initial centers. Moreover, the FCOM may
methods suffer from two major drawbacks including 1) the terminate at a locally optimal solution. Therefore, this study
susceptibility of clustering result due to the randomly initial proposes a method named genetic algorithm-based fuzzy c-
centers and 2) the sensitivity of outliers and noise data. To solve ordered-means algorithm (GA-FCOM) which integrates the
these two problems, this study proposes a new algorithm named FCOM with genetic algorithm (GA) for several purposes:
genetic algorithm-based fuzzy c-ordered-means algorithm (GA-
1) Find the better initial centroids.
FCOM) Herein, the fuzzy c-ordered-means algorithm (FCOM)
2) Exploit the global optimal solution.
can deal with noise and outliers data while the genetic algorithm
The rest of this paper is organized as follows. Section 1
is employed to obtain the optimal initial centroids efficiently
shows the introduction of this research. Section 2 provides the
during the clustering process. An experiment is conducted using
the benchmark datasets collected from the UCI machine literature review including FCM, FCOM, and GA for
repository to validate the proposed algorithm. The clustering. Section 3 presents a methodology of the proposed
computational results indicate that the proposed GA-FCOM GA-FCOM. Section 4 shows the computational results in
outperforms fuzzy c-means algorithm (FCM) and FCOM in detail. Finally, Section 5 presents the conclusions and future
terms of both accuracy and objective function values. recommendations of this study.
ሾሿ ሾିଵሿ where the ܲ is the selection probability of chromosome
Step3-7: If צɋ െ ɋ צଶଶ>Ɍ then ݐൌ ݐ ͳ and go to
Step3-2 else stop. Ɍ is usually set ͳͲିହ . ݅ሺͳ ݅ ݊ሻ, ݂ and ݂ are the fitness function values of the
Step4: Update the parameters ݂ by using Eq. (9). chromosome i and chromosome jሺͳ ݆ ݊ሻ, respectively.
Step5: If ԡܸ െ ܸ ିଵ ԡி ߝ, then ݎൌ ݎ ͳ and go to 4) Crossover
Step2 else stop. According to the crossover rate, we determine whether the
chromosomes in the mating pool are mated or not. If mating
C. The Genetic Algorithm for Clustering is implemented, two selected chromosomes mate to produce
The GA was proposed by Holland in 1975 [9]. GA, which the offspring.ġThe formula is as follows [16]:
simulates the phenomenon of natural evolution, is a search ܺ ௪ ൌ ߙܺ ሺͳ െ ߙሻܻ, (19)
algorithm that solves the optimization problem. The main ܻ ௪ ൌ ߚܺ ሺͳ െ ߚሻܺǡ (20)
concept of GA is based on Darwin's theory of evolution. where ܺ ௪ and ܻ ௪ are the genes of offspring after the
The GA begins by encoding the problem into some crossover process, ߙ and ߚ is the random number between 0
chromosomes. Some chromosomes form the population, and 1, and ܺ and ܻ are the genes of parents which are selected
which produces better offspring through selection, crossover, in the mating pool.
mutation, and evaluation. The result of the continuous 5) Mutation
evolution process is the optimal solution to the problem. In Mutation is used to avoid reaching the local optimal
addition, there are some literatures that apply the GA to solve solution. The probability of mutation is much lower than the
clustering problems [1011]. Murphy and Chowdhury chance of crossover. If the mutation is implemented, this study
proposed a GA-based approach and got better results in applies a function to change the gene of chromosomes. The
clustering problems [12].ġ Krishna and Murty combined GA function is as follows [17]:
and K-means to develop a new clustering approach named ܺ ௪ ൌ ܺ ݏൈ ݎൈ ܽ, (21)
GKA and proved that globally optimal results could be ܽ ൌ ʹି௨ ǡ (22)
obtained [13]. GA is also used to optimize initial centroids and where ܺ ௪ is the gene of offspring after mutation process,
parameters for clustering methods [14-15]. s אሼെͳǡͳሽ is uniform at random, and א ݎሾͳͲെ ǡ ͲǤͳሿ is a
III. A GENETIC ALGORITHM-BASED FUZZY C-ORDERED - specified proportion. u is the random number between 0 and
MEANS ALGORITHM 1, and k {א4,5,Ă.20} is mutation precision.
The FCOM may terminate at the local optimal solution B. The Steps of GA-FCOM
since it is sensitive to the initial centroids. Therefore, the GA is a famous method which can search for optimal
proposed GA-FCOM aims to reduce the impact of the initial solutions efficiently. During the iteration process, the fitness
centroids by combining GA with FCOM. This study uses the value is used to evaluate the goodness of chromosomes.
real-code GA for the proposed GA-FCOM. The chromosome Through the natural selection, crossover, and mutation
represents a set of alternative centroids. The process of GA- processes, the chromosomes of the next generation are
FCOM contains initialization, fitness calculation, selection, generated. Finally, after a certain number of generations, the
crossover, and mutation. The procedures of GA are as follows: best solution is obtained. The steps of GA-FCOM can be
A. The Control of GA represented as follows:
Step 1: Set the parameters of GA including population
1) Initialization size, crossover rate, mutation rate, and number of generations,
The initial population is generated by randomly selecting and determine the number of clusters c. In addition, the
the data instances from the dataset to become the initial parameter for the FCOM algorithm should also be pre-
chromosomes. specified.
2) Fitness calculation Step 2: Generate the initial chromosomes by selecting
The fitness function is used to evaluate the quality of each from the dataset.
chromosome. This study uses the following formula as the Step 3: Run FCOM with the generated initial centers.
fitness. Step 4: Calculate the fitness values using Eq. (6) and Eq.
ଵ
ൌ ଵାሺǡሻǡ (17) (17).
where ܬሺܷǡ ܸሻ represents the objective function of the Step 5: Update the best chromosome.
FCOM. The higher fitness value indicates better initial Step 6: Selection: Select the chromosomes into the
centroids. crossover pool by Eq. (18).
Step 7: Crossover: using Eq. (19) and Eq. (20) to produce
3) Selection
new offspring.
Roulette wheel selection is a famous method used in GA.
Step 8: Mutation: using Eq. (21) and Eq. (22) to produce
In this method, the probability of each chromosome is
new offspring.
calculated as the base being selected into the mating pool for
Step 9: Stop if the termination condition is reached;
further genetic operations by the value of fitness. The formula
otherwise, go back to Step 3.
is as follows:
Step 10: Choose the best chromosome as initial cluster
ܲ ൌ σ ǡ (18) centroids to do FCOM.
ೕ ೕ
IV. EXPERIMENTALġRESULTS TABLE III. THE COMPUTATION RESULTS OF ACCURACY
In this section, the experimental results of the clustering Datasets Accuracy FCM FCOM GA-FCOM
Glass Average(%) 55.09 55.23 56.00
algorithms will be presented. The clustering algorithms were SD 0.005 0.012 0.011
coded with Python 3.6.5 and run on a PC with an Intel Core Vertebral Average(%) 66.13 68.91 69.03
i7-6700 processor. The detailed experimental results are SD 0.000 0.002 0.000
described as follows. Breast Tissue Average(%) 58.14 62.55 63.20
SD 0.018 0.019 0.008
A. Datasets
This study uses three data sets including Glass, Vertebral According to Table 3, the proposed GA-FCOM
and Breast Tissue from UCI machine learning repository to outperforms the FCM and FCOM for all datasets. The
evaluate the performance FCM, FCOM, and GA-FCOM. standard deviation of GA-FCOM is also lower than that of
There are outliers in the Glass and Vertebral datasets. The FCOM. Moreover, in order to verify the impact of the initial
characteristics of datasets are illustrated in Table 1. centers, the objective function values of FCOM and GA-
FCOM are compared. Table 4 shows objective function values
TABLE I. THE CHARACTERISTICS OF DATASETS for FCOM and GA-FCOM. Fig. 1 shows the evolution of
objective function in Glass dataset.
Number of Number of
Dataset Number of instances
attributes clusters TABLE IV. THE COMPUTATION RESULTS OF THE OBJECTIVE FUNCTION
Glass 214 9 6
Datasets Objective FCOM GA-FCOM
function
Vertebral 310 6 3
Glass Average 16.53 15.83
Breast SD 0.791 0.175
106 9 6 Vertebral Average 1.42 1.42
Tissue
SD 0.001 0.000
B. Performance Measurement Breast Tissue Average 1.36 1.24
SD 0.133 0.103
This study uses the accuracy to evaluate the performance
of the proposed algorithm. The accuracy can be defined as
[18]:
ௗ௧ௗ௦
ݕܿܽݎݑܿܿܣൌ ௧௨௦ ǡ (23)
where ݏ݈ܾ݈݁݁݀݁ݐܿ݅݀݁ݎrepresent the results after
clustering.
C. Parameter Setting
Parameters are set up for GA and clustering algorithms. In
GA, after several trial experiment, the number of
chromosomes is set as 20, crossover rate and mutation rate are
set as 0.85 and 0.01, respectively, and the number of
generations is set as 30. Besides, the parameters setup for
Figure 1. Evolution of objective function in glass dataset.
clustering algorithms are described in Table 2.
From Table 4, it shows the computation results of the
TABLE II. PARAMETERS SETTING FOR ALGORITHMS
objective function. For the Glass and Breast Tissue, GA-
Methods Dataset ࢇ ࢉ FCOM can obtain the lower value of objective function than
Glass 2 - - FCOM. It means that there is a good performance in GA-
FCM Vertebral 2 - - FCOM. For Vertebral, FCOM and GA-FCOM have the same
Breast Tissue 2 - - performance. From the Fig. 1, GA-FCOM has better
Glass 1.2 0.2 2 performance in the first iteration and faster convergence than
FCOM Vertebral 2 0.4 0.5
FCOM.
Breast Tissue 3 0.3 0.6
Glass 1.2 0.2 2 V. CONCLUSIONS
GA-FCOM Vertebral 2 0.4 0.5
Breast Tissue 3 0.3 0.6 This study has proposed a new clustering method which is
a genetic algorithm-based fuzzy c-ordered means algorithm.
Since the performance of clustering is susceptible to initial
D. Computational Results centroids, this study used the GA to overcome this problem.
In this study, experimental results are obtained by running Because of the good initial centroids, GA-FCOM can
the algorithms 30 times. All data are also normalized in the converge faster and obtain better results more efficiently. GA-
range between 0 and 1. Table 3 shows the computation results FCOM enhances robustness for the dataset with outliers.
including accuracy and the corresponding standard deviation. There are two data sets with outliers used in this experiment.
The experimental results are compared for three algorithms
including FCM, FCOM, and GA-FCOM. GA-FCOM can [8] Fan, J., Han, M., & Wang, J. J. P. R. (2009). Single point iterative
obtain the best clustering performance including the accuracy weighted fuzzy C-means clustering algorithm for remote sensing
image segmentation. 42(11), 2527-2540.
and objective function value for all datasets. In the future,
[9] Holland, J. J. A. A. M. T. U. o. M. P. (1975). Adaption in natural and
since the parameter setting of FCOM is still based on the artificial systems.
previous research, it is hoped that the meta-heuristics can be [10] Bezdek, J. C., Boggavarapu, S., Hall, L. O., & Bensaid, A. (1994).
used to find good parameter values and initial centroids Genetic algorithm guided clustering. Paper presented at the
simultaneously. Evolutionary Computation, 1994. IEEE World Congress on
Computational Intelligence., Proceedings of the First IEEE Conference
ACKNOWLEDGMENT on.
This research was partially supported by the Ministry of [11] Maulik, U., & Bandyopadhyay, S. J. P. r. (2000). Genetic algorithm-
based clustering technique. 33(9), 1455-1465.
Science and Technology of the Taiwan Government under
[12] Murthy, C. A., & Chowdhury, N. (1996). In search of optimal clusters
grant MOST105-2221-E-011-103-MY3. This support is using genetic algorithms.
gratefully appreciated. [13] Krishna, K., Murty, M. N. J. I. T. o. S., Man,, & Cybernetics, P. B.
(1999). Genetic K-means algorithm. 29(3), 433-439.
REFERENCES
[14] Jimenez, J., Cuevas, F., & Carpio, J. (2007). Genetic algorithms
[1] Chen, M.-S., Han, J., Yu, P. S. J. I. T. o. K., & Engineering, d. (1996). applied to clustering problem and data mining. Paper presented at the
Data mining: an overview from a database perspective. 8(6), 866-883. Proceedings of the 7th WSEAS International Conference on
[2] Grira, N., Crucianu, M., & Boujemaa, N. J. A. r. o. m. l. t. f. p. m. c., Simulation, Modelling and Optimization.
Report of the MUSCLE European Network of Excellence. (2004). [15] Khotimah, B. K., Irhamni, F., Sundarwati, T. J. J. o. T., & Technology,
Unsupervised and semi-supervised clustering: a brief survey. 1001- A. I. (2016). A Genetic Algorithm For Optimized Initial Centers K-
1030. Means Clustering In SMEs. 90(1), 23.
[3] Tan, P.-N., Steinbach, M., & Kumar, V. J. I. t. d. m. (2013). Data [16] Michielssen, E., Ranjithan, S., & Mittra, R. J. I. P. J.-O. (1992).
mining cluster analysis: basic concepts and algorithms. Optimal multilayer filter design using real coded genetic algorithms.
[4] MacQueen, J. (1967). Some methods for classification and analysis of 139(6), 413-420.
multivariate observations. Paper presented at the Proceedings of the [17] Sumathi, S., Hamsapriya, T., & Surekha, P. (2008). Evolutionary
fifth Berkeley symposium on mathematical statistics and probability. intelligence: an introduction to theory and applications with Matlab:
[5] Bezdek, J. C. (1981). Objective function clustering. In Pattern Springer Science & Business Media.
recognition with fuzzy objective function algorithms (pp. 43-93): [18] Graves, D., Pedrycz, W. J. F. s., & systems. (2010). Kernel-based fuzzy
Springer. clustering and fuzzy clustering: A comparative experimental study.
[6] Leski, J. M. J. F. S., & Systems. (2016). Fuzzy c-ordered-means 161(4), 522-543
clustering. 286, 114-133.
[7] Yager, R. R. J. I. T. o. s., Man,, & Cybernetics. (1988). On ordered
weighted averaging aggregation operators in multicriteria
decisionmaking. 18(1), 183-190.