Data Clustering Using Particle Swarm Optimization: - To Show That The Standard PSO Algorithm Can Be Used
Data Clustering Using Particle Swarm Optimization: - To Show That The Standard PSO Algorithm Can Be Used
Abstract- This paper proposes two new approaches to and unsupervised learning, e.g. LVQ-II [5].
using PSO to cluster data. It is shown how PSO can be Recently, particle swarm optimization (PSO) [9, 10] has
used to find the centroids of a user specified number of been applied to image clustering [13]. This paper explores
clusters. The algorithm is then extended to use K-means the applicability of PSO to cluster data vectors. In the pro-
clustering to seed the initial swarm. This second algo- cess of doing so, the objective of the paper is twofold:
rithm basically uses PSO to refine the clusters formed by
K-means. The new PSO algorithms are evaluated on six • to show that the standard PSO algorithm can be used
data sets, and compared to the performance of K-means to cluster arbitrary data, and
clustering. Results show that both PSO clustering tech- • to develop a new PSO-based clustering algorithm
niques have much potential. where K-means clustering is used to seed the initial
swarm.
1 Introduction
The rest of the paper is organized as follows: Section 2
Data clustering is the process of grouping together similar presents an overview of the K-means algorithm. PSO is
multi-dimensional data vectors into a number of clusters or overviewed in section 3. The two PSO clustering techniques
bins. Clustering algorithms have been applied to a wide are discussed in section 4. Experimental results are summa-
range of problems, including exploratory data analysis, data rized in section 5.
mining [4], image segmentation [12] and mathematical pro-
gramming [1, 16]. Clustering techniques have been used 2 K-Means Clustering
successfully to address the scalability problem of machine
learning and data mining algorithms, where prior to, and One of the most important components of a clustering al-
during training, training data is clustered, and samples from gorithm is the measure of similarity used to determine how
these clusters are selected for training, thereby reducing the close two patterns are to one another. K-means clustering
computational complexity of the training process, and even groups data vectors into a predefined number of clusters,
improving generalization performance [6, 15, 14, 3]. based on Euclidean distance as similarity measure. Data
Clustering algorithms can be grouped into two main vectors within a cluster have small Euclidean distances from
classes of algorithms, namely supervised and unsupervised. one another, and are associated with one centroid vector,
With supervised clustering, the learning algorithm has an which represents the “midpoint” of that cluster. The cen-
external teacher that indicates the target class to which a troid vector is the mean of the data vectors that belong to
data vector should belong. For unsupervised clustering, a the corresponding cluster.
teacher does not exist, and data vectors are grouped based For the purpose of this paper, define the following sym-
on distance from one another. This paper focuses on unsu- bols:
pervised clustering.
Many unsupervised clustering algorithms have been de- • Nd denotes the input dimension, i.e. the number of
veloped. Most of these algorithms group data into clusters parameters of each data vector
independent of the topology of input space. These algo- • No denotes the number of data vectors to be clustered
rithms include, among others, K-means [7, 8], ISODATA
[2], and learning vector quantizers (LVQ) [5]. The self- • Nc denotes the number of cluster centroids (as pro-
organizing feature map (SOM) [11], on the other hand, per- vided by the user), i.e. the number of clusters to be
forms a topological clustering, where the topology of the formed
original input space is maintained. While clustering algo-
rithms are usually supervised or unsupervised, efficient hy- • zp denotes the p-th data vector
brids have been developed that performs both supervised • mj denotes the centroid vector of cluster j
• nj is the number of data vectors in cluster j • xi : The current position of the particle;
• Cj is the subset of data vectors that form cluster j. • vi : The current velocity of the particle;
Using the above notation, the standard K-means algorithm • yi : The personal best position of the particle.
is summarized as
Using the above notation, a particle’s position is adjusted
1. Randomly initialize the Nc cluster centroid vectors according to
2. Repeat vi,k (t + 1) = wvi,k (t) + c1 r1,k (t)(yi,k (t) − xi,k (t)) +
(a) For each data vector, assign the vector to the c2 r2,k (t)(ŷk (t) − xi,k (t)) (3)
class with the closest centroid vector, where the xi (t + 1) = xi (t) + vi (t + 1) (4)
distance to the centroid is determined using
v where w is the inertia weight, c1 and c2 are the acceleration
u Nd
uX constants, r1,j (t), r2,j (t) ∼ U (0, 1), and k = 1, · · · , Nd .
d(zp , mj ) = t (zpk − mjk )2 (1) The velocity is thus calculated based on three contributions:
k=1 (1) a fraction of the previous velocity, (2) the cognitive com-
where k subscripts the dimension. ponent which is a function of the distance of the particle
from its personal best position, and (3) the social compo-
(b) Recalculate the cluster centroid vectors, using nent which is a function of the distance of the particle from
1 X the best particle found thus far (i.e. the best of the personal
mj = zp (2) bests).
nj
∀zp ∈Cj The personal best position of particle i is calculated as
until a stopping criterion is satisfied.
yi (t) if f (xi (t + 1)) ≥ f (yi (t))
yi (t + 1) =
The K-means clustering process can be stopped when xi (t + 1) if f (xi (t + 1)) < f (yi (t))
any one of the following criteria are satisfied: when the (5)
maximum number of iterations has been exceeded, when Two basic approaches to PSO exists based on the inter-
there is little change in the centroid vectors over a num- pretation of the neighborhood of particles. Equation (3) re-
ber of iterations, or when there are no cluster membership flects the gbest version of PSO where, for each particle, the
changes. For the purposes of this study, the algorithm is neighborhood is simply the entire swarm. The social com-
stopped when a user-specified number of iterations has been ponent then causes particles to be drawn toward the best
exceeded. particle in the swarm. In the lbest PSO model, the swarm is
divided into overlapping neighborhoods, and the best parti-
cle of each neighborhood is determined. For the lbest PSO
3 Particle Swarm Optimization model, the social component of equation (3) changes to
Particle swarm optimization (PSO) is a population-based
c2 r2,k (t)(ŷj,k (t) − xi,k (t)) (6)
stochastic search process, modeled after the social behavior
of a bird flock [9, 10]. The algorithm maintains a popula- where ŷj is the best particle in the neighborhood of the i-th
tion of particles, where each particle represents a potential particle.
solution to an optimisation problem. The PSO is usually executed with repeated application
In the context of PSO, a swarm refers to a number of of equations (3) and (4) until a specified number of itera-
potential solutions to the optimization problem, where each tions has been exceeded. Alternatively, the algorithm can
potential solution is referred to as a particle. The aim of the be terminated when the velocity updates are close to zero
PSO is to find the particle position that results in the best over a number of iterations.
evaluation of a given fitness (objective) function.
Each particle represents a position in Nd dimensional
space, and is “flown” through this multi-dimensional search
4 PSO Clustering
space, adjusting its position toward both In the context of clustering, a single particle represents the
• the particle’s best position found thus far, and Nc cluster centroid vectors. That is, each particle xi is con-
structed as follows:
• the best position in the neighborhood of that particle.
xi = (mi1 , · · · , mij , · · · , miNc ) (7)
Each particle i maintains the following information:
where mij refers to the j-th cluster centroid vector of the K-means algorithm. The hybrid algorithm first executes the
i-th particle in cluster Cij . Therefore, a swarm represents a K-means algorithm once. In this case the K-means cluster-
number of candidate clusterings for the current data vectors. ing is terminated when (1) the maximum number of itera-
The fitness of particles is easily measured as the quantiza- tions is exceeded, or when (2) the average change in cen-
tion error, troid vectors is less that 0.0001 (a user specified parameter).
PN c P The result of the K-means algorithm is then used as one of
j=1 [ ∀zp ∈Cij d(zp , mj )/|Cij |] the particles, while the rest of the swarm is initialized ran-
Je = (8)
Nc domly. The gbest PSO algorithm as presented above is then
executed.
where d is defined in equation (1), and |Cij | is the number
of data vectors belonging to cluster Cij , i.e. the frequency
of that cluster. 5 Experimental Results
This section first presents a standard gbest PSO for clus-
This section compares the results of the K-means, PSO and
tering data into a given number of clusters in section 4.1,
Hybrid clustering algorithms on six classification problems.
and then shows how K-means and the PSO algorithm can
The main purpose is to compare the quality of the respec-
be combined to further improve the performance of the PSO
tive clusterings, where quality is measured according to the
clustering algorithm in section 4.2.
following three criteria:
4.1 gbest PSO Cluster Algorithm • the quantization error as defined in equation (8);
Using the standard gbest PSO, data vectors can be clustered • the intra-cluster distances, i.e. the distance between
as follows: data vectors within a cluster, where the objective is to
minimize the intra-cluster distances;
1. Initialize each particle to contain Nc randomly se-
lected cluster centroids. • the inter-cluster distances, i.e. the distance between
the centroids of the clusters, where the objective is to
2. For t = 1 to tmax do maximize the distance between clusters.
(a) For each particle i do The latter two objectives respectively correspond to crisp,
compact clusters that are well separated.
(b) For each data vector zp For all the results reported, averages over 30 simulations
i. calculate the Euclidean distance d(zp , mij are given. All algorithms are run for 1000 function evalua-
to all cluster centroids Cij tions, and the PSO algorithms used 10 particles. For PSO,
w = 0.72 and c1 = c2 = 1.49. These values were chosen
ii. assign zp to cluster Cij such that
to ensure good convergence [17].
d(zp , mij ) = min∀c=1,···,Nc {d(zp , mic )}
The classification problems used for the purpose of this
iii. calculate the fitness using equation (8) paper are
(c) Update the global best and local best positions
• Artificial problem 1: This problem follows the fol-
(d) Update the cluster centroids using equations (3) lowing classification rule:
and (4).
1 if (z1 ≥ 0.7) or ((z1 ≤ 0.3)
where tmax is the maximum number of iterations. class = and (z2 ≥ −0.2 − z1 )) (9)
The population-based search of the PSO algorithm re-
0 otherwise
duces the effect that initial conditions has, as opposed to the
K-means algorithm; the search starts from multiple posi- A total of 400 data vectors were randomly created,
tions in parallel. Section 5 shows that the PSO algorithm with z1 , z2 ∼ U (−1, 1). This problem is illustrated
performs better than the K-means algorithm in terms of in figure 1.
quantization error. • Artificial problem 2: This is a 2-dimensional prob-
lem with 4 unique classes. The problem is interesting
4.2 Hybrid PSO and K-Means Clustering Algorithm in that only one of the inputs are really relevant to the
The K-means algorithm tends to converge faster (after less formation of the classes. A total of 600 patterns were
function evaluations) than the PSO, but usually with a less drawn from four independent bivariate normal distri-
accurate clustering [13]. This section shows that the perfor- butions, where classes were distributed according to
X
mance of the PSO clustering algorithm can further be im- mi 0.50 0.05
proved by seeding the initial swarm with the result of the N2 µ = , = (10)
0 0.05 0.50
class 0 class 0
class 1 class 1
class 2
class 3
-0.8 -0.8
-0.6 -0.6
-0.4 -0.4
-0.2 -0.2
0 z1 0 z1
0.2 0.2
0.4 0.4
0.6 0.6
0.8 0.8
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1 1 0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8
z2 z2
Figure 1: Artificial rule classification problem defined in Figure 2: Four-class artificial classification problem defined
equation (9) in equation (10)
P
for i = 1, · · · , 4, where µ is the mean vector and algorithms. However, for the Wine problem, both K-means
is the covariance matrix; m1 = −3, m2 = 0, m3 = 3 and the PSO algorithms are significantly worse than the Hy-
and m4 = 6. The problem is illustrated in figure 2. brid algorithm.
When considering inter- and intra-cluster distances, the
• Iris plants database: This is a well-understood latter ensures compact clusters with little deviation from the
database with 4 inputs, 3 classes and 150 data vec- cluster centroids, while the former ensures larger separation
tors. between the different clusters. With reference to these crite-
• Wine: This is a classification problem with “well be- ria, the PSO approaches succeeded most in finding clusters
haved” class structures. There are 13 inputs, 3 classes with larger separation than the K-means algorithm, with the
and 178 data vectors. Hybrid PSO algorithm doing so for 4 of the 6 problems. It
is also the PSO approaches that succeeded in forming the
• Breast cancer: The Wisconsin breast cancer more compact clusters. The Hybrid PSO formed the most
database contains 9 relevant inputs and 2 classes. The compact clusters for 4 problems, the standard PSO for 1
objective is to classify each data vector into benign or problem, and the K-means algorithm for 1 problem.
malignant tumors. The results above show a general improvement of per-
formance when the PSO is seeded with the outcome of the
• Automotives: This is an 11-dimensional data set rep- K-means algorithm.
resenting different attributes of more than 500 auto- Figure 3 summarizes the effect of varying the number
mobiles from a car selling agent. of clusters for the different algorithms for the first artificial
Table 1 summarizes the results obtained from the three problem. It is expected that the quantization error should go
clustering algorithms for the problems above. The values down with increase in the number of clusters, as illustrated.
reported are averages over 30 simulations, with standard Figure 3 also shows that the Hybrid PSO algorithm consis-
deviations to indicate the range of values to which the al- tently performs better than the other two approaches with an
gorithms converge. First, consider the fitness of solutions, increase in the number of clusters.
i.e. the quantization error. For all the problems, except for Figure 4 illustrates the convergence behavior of the al-
Artificial 2, the Hybrid algorithm had the smallest average gorithms for the first artificial problem. The K-means al-
quantization error. For the Artificial 2 problem, the PSO gorithm exhibited a faster, but premature convergence to
clustering algorithm has a better quantization error, but not a large quantization error, while the PSO algorithms had
significantly better than the Hybrid algorithm. It is only for slower convergence, but to lower quantization errors. As
the Wine and Iris problems that the standard K-means clus- indicated (refer to the circles) in figure 4, the K-means al-
tering is not significantly worse than the PSO and Hybrid gorithm converged after 12 function evaluations, the Hybrid
Table 1: Comparison of K-means, PSO and Hybrid clustering algorithms
Quantization Intra-cluster Inter-cluster
Problem Algorithm Error Distance Distance
Artificial 1 K-means 0.984±0.032 3.678±0.085 1.771±0.046
PSO 0.769±0.031 3.826±0.091 1.142±0.052
Hybrid 0.768±0.048 3.823±0.081 1.151±0.043
Artificial 2 K-means 0.264±0.001 0.911±0.027 0.796±0.022
PSO 0.252±0.001 0.873±0.023 0.815±0.019
Hybrid 0.250±0.001 0.869±0.018 0.814±0.011
Iris K-means 0.649±0.146 3.374±0.245 0.887±0.091
PSO 0.774±0.094 3.489±0.186 0.881±0.086
Hybrid 0.633±0.143 3.304±0.204 0.852±0.097
Wine K-means 1.139±0.125 4.202±0.223 1.010±0.146
PSO 1.493±0.095 4.911±0.353 2.977±0.241
Hybrid 1.078±0.085 4.199±0.514 2.799±0.111
Breast-cancer K-means 1.999±0.054 6.599±0.332 1.824±0.251
PSO 2.536±0.197 7.285±0.351 3.545±0.204
Hybrid 1.890±0.125 6.551±0.436 3.335±0.097
Automotive K-means 1030.714±44.69 11032.355±342.2 1037.920±22.14
PSO 971.553±44.11 11675.675±341.1 988.818±22.44
Hybrid 902.414±43.81 11895.797±340.7 952.892±21.55
[1] HC Andrews, “Introduction to Mathematical Tech- [15] JR Quinlan, “C4.5: Programs for Machine Learning”,
niques in Pattern Recognition”, John Wiley & Sons, Morgan Kaufmann, San Mateo, 1993.
New York, 1972.
[16] MR Rao, “Cluster Analysis and Mathematical Pro-
[2] G Ball, D Hall, “A Clustering Technique for Summariz- gramming”, Journal of the American Statistical Asso-
ing Multivariate Data”, Behavioral Science, Vol. 12, pp ciation, Vol. 22, pp 622-626, 1971.
153–155, 1967.
[17] F van den Bergh, “An Analysis of Particle Swarm Op-
[3] AP Engelbrecht. “Sensitivity Analysis of Multilayer timizers”, PhD Thesis, Department of Computer Sci-
Neural Networks”, PhD Thesis, Department of Com- ence, University of Pretoria, Pretoria, South Africa,
puter Science, University of Stellenbosch, Stellen- 2002.
bosch, South Africa, 1999.