E-Stream Evolution-Based Technique For Stream Clus
E-Stream Evolution-Based Technique For Stream Clus
net/publication/221571035
CITATIONS READS
112 1,458
3 authors, including:
All content following this page was uploaded by Thanawin Rakthanmanon on 21 March 2014.
Abstract. Data streams have recently attracted attention for their applicability
to numerous domains including credit fraud detection, network intrusion
detection, and click streams. Stream clustering is a technique that performs
cluster analysis of data streams that is able to monitor the results in real time. A
data stream is continuously generated sequences of data for which the
characteristics of the data evolve over time. A good stream clustering algorithm
should recognize such evolution and yield a cluster model that conforms to the
current data. In this paper, we propose a new technique for stream clustering
which supports five evolutions that are appearance, disappearance, self-
evolution, merge and split.
1 Introduction
Stream clustering is a technique that performs cluster analysis of data streams that is
able to produce results in real time. The ability to process data in a single pass and
summarize it, while using limited memory, is crucial to stream clustering.
Several efficient stream clustering techniques have been presented recently, such as
STREAM [9], CluStream [2], and HPStream [1]. STREAM is a k-median based
algorithm that can achieve a constant factor approximation. CluStream divides the
clustering process into an online and offline process. Data summarization is
performed online, while clustering of the summarized data is performed offline.
Experiments show that CluStream yields better cluster quality than STREAM.
HPStream is the most recent stream clustering technique, which utilizes a fading
concept, data representation, and dimension projection. HPStream achieves better
clustering quality than the above algorithms.
Since the characteristics of the data evolve over time, various types of evolution
should be defined and supported by the algorithm. Almost of existing algorithms
support few types of evolution. The objective of the research reported here is to
improve existing stream clustering algorithms by supporting 5 evolutions with a new
suitable cluster representation and a distance function. Experimental results show that
this technique yields better cluster quality than HPStream.
The remaining of the paper is organized as follows. Section 2 introduces basic
concepts and definitions. Section 3 presents our stream clustering algorithm called
E-Stream. Section 4 compares the performance of E-Stream and HPStream with
respect to the synthetic dataset. Conclusions are discussed in Section 5.
R. Alhajj et al. (Eds.): ADMA 2007, LNAI 4632, pp. 605–615, 2007.
© Springer-Verlag Berlin Heidelberg 2007
606 K. Udommanetanakit, T. Rakthanmanon, and K. Waiyamai
Each cluster is represented as a Fading Cluster Structure (FCS) [1] utilizing a α-bin
histogram for each feature of the dataset. We called our cluster representation Fading
Cluster Structure with Histogram (FCH). Let Ti be the time when data point xi is
retrieved, and suppose t be the current time then f(t-Ti) is the fading weight of data
point xi.
FC1(t) is a vector of weighted sumation of data feature values at time t. The jth
dimension is
( )
N
FC1 j (t ) = ∑ f (t − Ti ) ⋅ xij (2)
i =1
FC2(t) is the weighted sum of squares of each data feature at time t. The jth
dimension is,
( )
N
FC 2 j (t ) = ∑ f (t − Ti ) ⋅ x ij
2
(3)
i =1
E-Stream: Evolution-Based Technique for Stream Clustering 607
H(t) is a α-bin histogram of data values. For the jth feature at time t, the elements of
j
H are
( )( )
N
H l j (t ) = ∑ f (t − Ti ) ⋅ xij ⋅ y ilj (5)
i =1
Where
⎧1 if l ⋅ b + left ≤ x i ≤ (l + 1) ⋅ b + left
y il = ⎨ (6)
⎩0 otherwise
left + right
b= (9)
α
left is a minimum value, right is a maximum value in this cluster, b is a size of each
bin, and yil is a weigth of xi in the lth bin.
Other
Dimension
This function is used to find the closest active cluster for an incoming data point. A
cluster with a larger radius yields a lower distance. Let C be an active cluster and x be
a data point, the Cluster-Point distance is
1 d centerCj − x j
dist (C , x ) = ⋅∑ (10)
d j =1 radius Cj
3 The Algorithm
In this section, we first describe our idea. Then, we present the E-Stream algorithm
composing of a set of sub-algorithms.
Appearance: A new cluster can appear if there is a sufficiently dense group of data
points in one area. Initially, such elements appear as a group of outliers, but (as more
data appears in a neighborhood) they are recognized as a cluster.
Disappearance: Existing clusters can disappear because the existence of data is
diminished over time. Clusters that contain only old data are faded and eventually
disappear because they do not represent the presence of data.
E-Stream: Evolution-Based Technique for Stream Clustering 609
Self-evolution: Data can change their behaviors, which cause size or position of a
cluster to evolve. Evolution can be done faster if the data can fade.
Merge: A pair of clusters can be merged if their characteristics are very similar.
Merged clusters must cover the behavior of the pair.
Split: A cluster can be split into two smaller clusters if the behavior inside the cluster
is obviously separated.
Algorithm E-Stream
1 retrieve new data Xi
2 FadingAll
3 CheckSplit
4 MergeOverlapCluster
5 LimitMaximumCluster
6 FlagActiveCluster
7 (minDistance, index) ← FindClosestCluster
8 if minDistance < radius_factor
9 add xi to FCHindex
10 else
11 create new FCH from Xi
12 waiting for new data
FadingAll. The algorithm performs fading of all clusters and deletes the clusters
whose weight is less than remove_threshold.
CheckSplit is used to verify the splitting criteria in each cluster using the histogram.
If a splitting point is found in any cluster then it is split. And store the index pairs of
split cluster in S.
610 K. Udommanetanakit, T. Rakthanmanon, and K. Waiyamai
4 Experimental Results
We tested the algorithm using a synthetic dataset consisting of two dimensions and
8,000 data points. This data changes the behavior of clusters over time. We can
segment it into 8 intervals as follows
1. Initially, there are 4 clusters in a steady state. Data point from 1 to 1600.
2. The 5th cluster appears at position (15, 6). Data point from 1601 to 2600.
3. The 1st cluster disappears. Data point from 2601 to 3400.
4. The 4th cluster swells. Data point from 3401 to 4200
5. The 2nd and 5th cluster get closer. Data point from 4201 to 5000.
6. The 2nd and 5th are merged into a bigger cluster. Data point from 5001 to 5600.
7. A 6th cluster is split from the 3rd cluster. Data point from 5601 – 6400.
8. Every cluster is in a steady state again. Data point from 6401 – 8000.
In this experiment, we set the parameters as in table 1. E-Stream allows the number of
clusters to vary dynamically with the constraint of the maximum number of clusters,
but requires a limit on the number of clusters. HPStream requires a fixed number of
clusters. Since the synthetic dataset has at most 5 clusters in each interval, we used 5
as the cluster (group) count in HPStream, and 10 as the cluster limit in E-Stream.
HPStream requires initial data for its initialization process before beginning stream
clustering. We, therefore, set it to 100 points.
612 K. Udommanetanakit, T. Rakthanmanon, and K. Waiyamai
In the first interval, there are four clusters in a steady-state. Both algorithms yield
clusters with little distortion, but E-Stream has a great number of clusters. Because at
the beginning E-Stream does not have any active clusters, every incoming data point
is considered as an isolated data point. As more data is accumulated, a cluster will
appear. On the other hand, HPStream requires initial data (set to 100 points) for
offline clustering, so HPStream exhibits better initial clustering than E-Stream.
In the second interval, a new cluster appears. HPStream still yields a better quality
because it finds all clusters correctly, but E-Stream yields only little distortion.
In the third interval, an existing cluster disappears. E-Stream yields good clustering
while HPStream tries to create a new cluster from existing cluster even though their
do not have a significant difference, due to the fixed cluster-count constraint.
In the 4th interval, a cluster swells. In the 5th interval, two clusters are closer, an
evolution that is supported by both algorithms. But, HPStream still tries to find five
clusters.
In the 6th interval two close clusters are merged. Neither algorithms merge the two
clusters in this interval even though their current behaviors are undistinguished.
In the 7th interval, a cluster splits. E-Stream can support this evolution when it
receives enough data. But HPStream cannot detect this.
In the 8th interval there are four clusters in a steady state again. E-stream algorithm
detects the previous merged case and identifies all clusters correctly within this
interval. But HPStream still is confusing by the cluster behavior.
Purity
1.05
0.95
0.9
purity
E-Stream
HPStream
0.85
0.8
0.75
0.7
200 1000 1800 2600 3400 4200 5000 5800 6600 7400
data stream (points)
In the purity test, E-Stream always has a purity greater than 0.9. But HPStream
exhibits a big drop in the seventh interval (cluster split), because the algorithm cannot
accommodate this evolution.
In the F-measure test, E-Stream yields an average value much better than
HPStream although, there are two intervals where E-Stream has a lower F-measure.
HPStream yields better results due to an initial offline process to find the initial
clusters. The second instance is the second interval (1601-2600), where E-Stream
merged two clusters incorrectly.
From the efficiency test, we can say that HPStream cannot support the evolution of
number of clusters because the algorithm constrains it. E-Stream can support all the evolu-
tions in Section 3, even though some evolutions such as merge require a lot of data to detect.
F-Measure
1.05
0.95
f-measure
0.9
E-Stream
HPStream
0.85
0.8
0.75
0.7
200 1000 1800 2600 3400 4200 5000 5800 6600 7400
data stream (points)
k = 10
purity
0.9 0.9
k = 20 k = 20
0.85 0.85
k = 40 k = 40
0.8 0.8
k = 80 k = 80
0.75 0.75
0.7 0.7
200 1400 2600 3800 5000 6200 7400
200 1400 2600 3800 5000 6200 7400
data stream (points) data stream (points)
0.8 k = 10 0.8 k = 10
0.6 k = 20 0.6 k = 20
0.4 k = 40 0.4 k = 40
0.2 k = 80 0.2 k = 80
0 0
200 1400 2600 3800 5000 6200 7400 200 1400 2600 3800 5000 6200 7400
data stream (points) data stream (points)
From the experiment, the Purity of both algorithms is not sensitive to this input
parameter. But in F-measure terms, HPStream has a tendency to drop if the input
number of clusters differs greatly from the actual number. E-Stream is not sensitive to
this parameter since the number of clusters is not fixed. As long as the maximum
number is not exceeded, E-Stream still yields good results.
In this experiment we use a dataset consisting of 500,000 data points with two
dimensions and five clusters.
Runtime
180
160
140
runtime (secs)
120
100 E-Stream
80 HPStream
60
40
20
0
0 50000 100000 150000 200000 250000 300000 350000 400000 450000
#data (points)
To test the runtime as a function of the number of clusters, we use two dimensions,
100,000 data points, and vary the number of data clusters from 5 to 25 in increments
of 5 clusters.
For runtime with the number of dimensions test, we use 5 clusters, 100,000 data
points, and vary the number of dimensions from 2 to 20.
Runtime Runtime
400 300
350 250
runtime (secs)
runtime (secs)
300
250 200
E-Stream E-Stream
200 150
150 HPStream HPStream
100
100
50 50
0 0
5 10 15 20 25 2 4 6 8 10 12 14 16 18 20
#cluster #dimension
The results of both experiments are summarized in Figure 15. HPStream exhibits
linear runtime in both the number of clusters and the number of dimensions. E-Stream
exhibits linear runtime in the number of dimensions but polynomial runtime in the
number of clusters. This is due to the merging procedure, which requires O(k2) time in
the number of clusters.
5 Conclusions
This paper proposed a new stream clustering technique called E-Stream which can
support five cluster evolutions: appearance, disappearance, self-evolution, merge, and
split. These evolutions can normally occur in an evolving data stream. This technique
outperforms a well-known technique, HPStream. However, the runtime of the new
approach is polynomial with respect to the number clusters.
References
1. Milenova, B.L., Campos, M.M.: Clustering Large Databases with Numeric and Nominial
Values Using Orthogonal Projections. In: Proceedings of the 29th VLDB Conference
(2003)
2. Aggarwal, C., Han, J., Wang, J., Yu, P.S.: A Framework for Projected Clustering of High
Dimensional Data Streams. In: Proceeding of the 30th VLDB conference (2004)
3. Aggarwal, C., Han, J., Wang, J., Yu, P.S.: A Framework for Clustering Evolving Data
Streams. In: Proceeding of the 29th VLDB conference (2003)
4. Barbara, D.: Requirements for Clustering Data Streams. In: SIGKDD Explorations (2002)
5. Gaber, M.M., Zaslavsky, A., Krishnaswmy, S.: Mining Data Streams: A Review. In:
SIGMOD Record, vol. 34(2) (June 2005)
6. Oh, S., Kang, J., Byun, Y., Park, G., Byun, S.: Intrusion Detection based on Clustering a
Data Stream. In: Proceedings of the 2005 Third ACIS International Conference on Software
Engineering Research, Management and Applications (2005)
7. Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering Data
Streams: Theory and Practice. TKDE special issue on clustering 15 (2003)
8. Song, M., Wang, H.: Highly Efficient Incremental Estimation of Gaussian Mixture Models
for Online Data Stream Clustering. In: SPIE Conference on Intelligent Computing: Theory
And Application III (2005)
9. Zhang, T., Ramakhrisnan, R., Livny, M.: BIRCH: An Efficient Data Clustering Method for
Very Large Databases. In: Proc. ACM SIGMOD Int. Conf. Management of Data (1996)