0% found this document useful (0 votes)

5 views12 pages

E-Stream Evolution-Based Technique For Stream Clus

The document presents E-Stream, an evolution-based technique for stream clustering that addresses the dynamic nature of data streams by supporting five types of evolution: appearance, disappearance, self-evolution, merge, and split. The proposed algorithm aims to improve existing stream clustering methods by providing better cluster quality through a new cluster representation and distance function. Experimental results demonstrate that E-Stream outperforms previous techniques like HPStream in terms of clustering effectiveness.

Uploaded by

12213023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views12 pages

E-Stream Evolution-Based Technique For Stream Clus

Uploaded by

12213023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221571035

E-Stream: Evolution-Based Technique for Stream

Clustering

Conference Paper in Lecture Notes in Computer Science · August 2007

DOI: 10.1007/978-3-540-73871-8_58 · Source: DBLP

CITATIONS READS
112 1,458

3 authors, including:

Thanawin Rakthanmanon Kitsana Waiyamai

Kasetsart University Kasetsart University
39 PUBLICATIONS 2,442 CITATIONS 54 PUBLICATIONS 416 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Thanawin Rakthanmanon on 21 March 2014.

The user has requested enhancement of the downloaded file.

E-Stream: Evolution-Based Technique for Stream
Clustering

Komkrit Udommanetanakit, Thanawin Rakthanmanon, and Kitsana Waiyamai

Department of Computer Engineering, Faculty of Engineering

Kasetsart University, Bangkok 10900, Thailand
{fengtwr, fengknw}@ku.ac.th

Abstract. Data streams have recently attracted attention for their applicability
to numerous domains including credit fraud detection, network intrusion
detection, and click streams. Stream clustering is a technique that performs
cluster analysis of data streams that is able to monitor the results in real time. A
data stream is continuously generated sequences of data for which the
characteristics of the data evolve over time. A good stream clustering algorithm
should recognize such evolution and yield a cluster model that conforms to the
current data. In this paper, we propose a new technique for stream clustering
which supports five evolutions that are appearance, disappearance, self-
evolution, merge and split.

1 Introduction
Stream clustering is a technique that performs cluster analysis of data streams that is
able to produce results in real time. The ability to process data in a single pass and
summarize it, while using limited memory, is crucial to stream clustering.
Several efficient stream clustering techniques have been presented recently, such as
STREAM [9], CluStream [2], and HPStream [1]. STREAM is a k-median based
algorithm that can achieve a constant factor approximation. CluStream divides the
clustering process into an online and offline process. Data summarization is
performed online, while clustering of the summarized data is performed offline.
Experiments show that CluStream yields better cluster quality than STREAM.
HPStream is the most recent stream clustering technique, which utilizes a fading
concept, data representation, and dimension projection. HPStream achieves better
clustering quality than the above algorithms.
Since the characteristics of the data evolve over time, various types of evolution
should be defined and supported by the algorithm. Almost of existing algorithms
support few types of evolution. The objective of the research reported here is to
improve existing stream clustering algorithms by supporting 5 evolutions with a new
suitable cluster representation and a distance function. Experimental results show that
this technique yields better cluster quality than HPStream.
The remaining of the paper is organized as follows. Section 2 introduces basic
concepts and definitions. Section 3 presents our stream clustering algorithm called
E-Stream. Section 4 compares the performance of E-Stream and HPStream with
respect to the synthetic dataset. Conclusions are discussed in Section 5.

R. Alhajj et al. (Eds.): ADMA 2007, LNAI 4632, pp. 605–615, 2007.
© Springer-Verlag Berlin Heidelberg 2007
606 K. Udommanetanakit, T. Rakthanmanon, and K. Waiyamai

2 Basic Concepts and Definitions

In this section, we introduce some basic concepts and definitions that will be used
subsequently.
The data stream consists of a set of multidimensional records X1…Xk… arriving at
time stamps T1…Tk…. Each data point Xi is a multidimensional record containing d
dimensions, denote by Xi = (xi1…xid).
An isolated data point is a data point that is not a member of any clusters. Isolated
data points remain in the system for cluster appearance computations.
An inactive cluster is a cluster that has a low weight. It can become an active cluster
if its weight is increased.
An active Cluster is a cluster that can assemble incoming data if there is sufficient
similarity score.
A cluster is a collection of data that has been memorized for processing in the
system. It can be an isolated data point, an inactive cluster, or an active cluster.
Fading decreases weight of data over time. In a data stream that has evolving data;
older data should have lesser weight. We decrease weight of every cluster over time
to achieve a fast adaptive cluster model. Let λ be the decay rate and t be elapsed time,
the fading function is
f (t ) = 2 − λ t (1)

Weight of a cluster is the number of data elements in a cluster. Weight is determined

according to the fading function. Initially, each data element has a weight of 1. A
cluster can be increased its weight by assembling incoming data points or merging
with other clusters.

2.1 Fading Cluster Structure with Histogram: FCH

Each cluster is represented as a Fading Cluster Structure (FCS) [1] utilizing a α-bin
histogram for each feature of the dataset. We called our cluster representation Fading
Cluster Structure with Histogram (FCH). Let Ti be the time when data point xi is
retrieved, and suppose t be the current time then f(t-Ti) is the fading weight of data
point xi.
FC1(t) is a vector of weighted sumation of data feature values at time t. The jth
dimension is

( )
N
FC1 j (t ) = ∑ f (t − Ti ) ⋅ xij (2)
i =1

FC2(t) is the weighted sum of squares of each data feature at time t. The jth
dimension is,

( )
N
FC 2 j (t ) = ∑ f (t − Ti ) ⋅ x ij
2
(3)
i =1
E-Stream: Evolution-Based Technique for Stream Clustering 607

W(t) is a sum of all weights of data points in the cluster at time t,

N
W (t ) = ∑ f (t − Ti ) (4)
i =1

H(t) is a α-bin histogram of data values. For the jth feature at time t, the elements of
j
H are
( )( )
N
H l j (t ) = ∑ f (t − Ti ) ⋅ xij ⋅ y ilj (5)
i =1
Where
⎧1 if l ⋅ b + left ≤ x i ≤ (l + 1) ⋅ b + left
y il = ⎨ (6)
⎩0 otherwise

left = min( xij ) (7)

right = max( xij ) (8)

left + right
b= (9)
α
left is a minimum value, right is a maximum value in this cluster, b is a size of each
bin, and yil is a weigth of xi in the lth bin.

2.2 Histogram Management

We utilize a histogram of cluster data values to identify cluster splits. A α-bin

histogram summarizes the distribution of cluster data for each dimension of each
cluster. The range of each bin is calculated as the difference between the maximum
and minimum feature values divided by α. When the maximum or minimum value
changes, we calculate a new range and update the values in each range from the
intersection between the new and old ranges. Each cluster has a histogram of feature
values, but the histogram is utilized only for the split of active clusters. Only an active
cluster can assemble an incoming data point.
Cluster split is based on the distribution of feature values as summarized by the
cluster histogram [1]. If a statistically significant valley is found between two peaks in
histogram values along any dimensions, the cluster is split. If more than one split
valley occurs in the histogram values, the value with the minimum height relative to
the surrounding peaks is chosen. When a cluster is split, the histogram in that
dimension is split and the other dimensions are weighted from the split dimension.
FC1, FC2 and W are recalculated from the new cluster histograms.

2.3 Distance Functions

Cluster-Point distance is a distance from a data point to the center of a cluster,

ormalized by the standard deviation (radius) of the cluster data in each dimension.
608 K. Udommanetanakit, T. Rakthanmanon, and K. Waiyamai

Split histogram 1st split histogram 2nd split histogram

Split
Dimension

Other
Dimension

Fig. 1. Histogram management in a split dimensionop and other dimension

This function is used to find the closest active cluster for an incoming data point. A
cluster with a larger radius yields a lower distance. Let C be an active cluster and x be
a data point, the Cluster-Point distance is
1 d centerCj − x j
dist (C , x ) = ⋅∑ (10)
d j =1 radius Cj

Cluster-Cluster distance is a difference between centers of two clusters. It is used to

find the closest pair of cluster. If Ca and Cb are two clusters, the cluster-cluster distance is
1 d
dist (C a , C b ) = ∑ centerCja − centerCjb (11)
d j =1

3 The Algorithm
In this section, we first describe our idea. Then, we present the E-Stream algorithm
composing of a set of sub-algorithms.

3.1 Our Idea

The behavior of data in a data stream can evolve over time. We can classify this
evolution into five categories: appearance, disappearance, self evolution, merge, and
split. A clustering model can start from empty. In the beginning, incoming data are
considered as isolated clusters. A cluster is formed when a sufficiently dense region
appears. The cluster assembles similar data and increases its existence. When a set of
clusters has been identified, incoming data must be assigned to a cluster based on
similarity score, or the datum may be classified as an isolated. In every change of a
cluster we check if any of the following evolutions occur and handle it.

Appearance: A new cluster can appear if there is a sufficiently dense group of data
points in one area. Initially, such elements appear as a group of outliers, but (as more
data appears in a neighborhood) they are recognized as a cluster.
Disappearance: Existing clusters can disappear because the existence of data is
diminished over time. Clusters that contain only old data are faded and eventually
disappear because they do not represent the presence of data.
E-Stream: Evolution-Based Technique for Stream Clustering 609

Self-evolution: Data can change their behaviors, which cause size or position of a
cluster to evolve. Evolution can be done faster if the data can fade.
Merge: A pair of clusters can be merged if their characteristics are very similar.
Merged clusters must cover the behavior of the pair.
Split: A cluster can be split into two smaller clusters if the behavior inside the cluster
is obviously separated.

3.2 E-Stream Algorithm

This section describes E-Stream in details. Following is the list of notations used in
our pseudo-code.
• |FCH| = current number of clusters
•
th
FCHi.W = weight of the i cluster
•
th
FCHi.sd = standard deviation of the i cluster
• S = set of pair of the split cluster
E-Stream is the main algorithm. In line 1, the algorithm starts by retrieving a new data
point. In line 2, it fades all clusters and deletes those having insufficient weight. Line 3
performs a histogram analysis and cluster split. Then line 4 checks for overlap clusters and
merges them. Line 5 checks the number of clusters and merges the closest pairs if the
cluster count exceeds the limit. Line 6 checks all clusters whether their status are active.
Lines 7-10 find the closest cluster to the incoming data point. If the distance is less than
radius_factor then the point is assigned to the cluster, otherwise it is an isolated data point.
The flow of control then returns to the top of the algorithm and waits for a new data point.

Algorithm E-Stream
1 retrieve new data Xi
2 FadingAll
3 CheckSplit
4 MergeOverlapCluster
5 LimitMaximumCluster
6 FlagActiveCluster
7 (minDistance, index) ← FindClosestCluster
8 if minDistance < radius_factor
9 add xi to FCHindex
10 else
11 create new FCH from Xi
12 waiting for new data

Fig. 2. E-Stream, stream clustering algorithm

FadingAll. The algorithm performs fading of all clusters and deletes the clusters
whose weight is less than remove_threshold.
CheckSplit is used to verify the splitting criteria in each cluster using the histogram.
If a splitting point is found in any cluster then it is split. And store the index pairs of
split cluster in S.
610 K. Udommanetanakit, T. Rakthanmanon, and K. Waiyamai

CheckMerge is an algorithm for merging pairs of similar clusters. This algorithm

checks every pair of clusters and computes the cluster-cluster distance. If the distance
is less than merge_threshold and the merged pair is not in S then merge the pair.
LimitMaximumCluster is used to limit the number of clusters. This algorithm
checks whether the number of clusters is not greater than maximum_cluster (an input
parameter); if it exceeds then the closest pair of clusters is merged until the number of
remaining clusters is less than or equal to the threshold.
FlagActiveCluster is used to check the current active cluster. If the weight of any
cluster is greater or equal to active_threshold then it is flagged as an active cluster.
Otherwise, the flag is cleared.
FindClosestCluster is used to find the distance and index of the closest active cluster
for an incoming data point.

Algorithm FadingAll Algorithm CheckSplit

for i ← 1 to |FCH| for i ← 1 to |FCH|
fading FCHi for j ← 1 to d
if FCHi.W < fade_threshold if FCHij have split point
delete FCHi split FCHi
S ← S U {(i, |FCH|)}

Fig. 3. FadingAll and CheckSplit algorithms

Algorithm MergeOverlapCluster Algorithm LimitMaximumCluster

for i ← 1 to |FCH| while |FCH| > maximum_cluster
for j ← i + 1 to |FCH| for i ← 1 to |FCH|
overlap[i,j] ← dist(FCHi,FCHj) for j ← i + 1 to |FCH|
m ← merge_threshold dist[i,j] ← dist(FCHi, FCHj)
if overlap[i,j] > m*(FCHi.sd+FCHj.sd) (first, second) ← argmin(i,j)(dist[i,j])
if (i, j) not in S merge(FCHfirst, FCHsecond)
merge(FCHi, FCHj)

Fig. 4. MergeOverlapCluster and LimitMaximumCluster algorithms

Algorithm FlagActiveCluster Algorithm FindClosestCluster

for i ← 1 to |FCH| for i ← 1 to |FCH|
if FCHi.W>= active_threshold if FCHi is active cluster
flag FCHi as active cluster dist[i] ←dist(FCHi, xi)
else (minDistance, i) ← min(dist[i])
remove flag from FCHi return (minDistance, i)

Fig. 5. FlagActiveCluster and FindClosestClusterAlgorithms

E-Stream: Evolution-Based Technique for Stream Clustering 611

4 Experimental Results
We tested the algorithm using a synthetic dataset consisting of two dimensions and
8,000 data points. This data changes the behavior of clusters over time. We can
segment it into 8 intervals as follows
1. Initially, there are 4 clusters in a steady state. Data point from 1 to 1600.
2. The 5th cluster appears at position (15, 6). Data point from 1601 to 2600.
3. The 1st cluster disappears. Data point from 2601 to 3400.
4. The 4th cluster swells. Data point from 3401 to 4200
5. The 2nd and 5th cluster get closer. Data point from 4201 to 5000.
6. The 2nd and 5th are merged into a bigger cluster. Data point from 5001 to 5600.
7. A 6th cluster is split from the 3rd cluster. Data point from 5601 – 6400.
8. Every cluster is in a steady state again. Data point from 6401 – 8000.

Fig. 6. The 8-Step evolution of the Synthetic Dataset

4.1 Efficiency Test

In this experiment, we set the parameters as in table 1. E-Stream allows the number of
clusters to vary dynamically with the constraint of the maximum number of clusters,
but requires a limit on the number of clusters. HPStream requires a fixed number of
clusters. Since the synthetic dataset has at most 5 clusters in each interval, we used 5
as the cluster (group) count in HPStream, and 10 as the cluster limit in E-Stream.
HPStream requires initial data for its initialization process before beginning stream
clustering. We, therefore, set it to 100 points.
612 K. Udommanetanakit, T. Rakthanmanon, and K. Waiyamai

Table 1. Parameters of each algorithm

algorithm E-Stream algorithm HPStream

maximum_cluster 10 num_cluster 5
stream_speed 100 stream_speed 100
decay_rate 0.1 decay_rate 0.1
radius_factor 3 radius_factor 3
remove_threshold 0.1
merge_threshold 1.25
active_threshold 5

In the first interval, there are four clusters in a steady-state. Both algorithms yield
clusters with little distortion, but E-Stream has a great number of clusters. Because at
the beginning E-Stream does not have any active clusters, every incoming data point
is considered as an isolated data point. As more data is accumulated, a cluster will
appear. On the other hand, HPStream requires initial data (set to 100 points) for
offline clustering, so HPStream exhibits better initial clustering than E-Stream.
In the second interval, a new cluster appears. HPStream still yields a better quality
because it finds all clusters correctly, but E-Stream yields only little distortion.
In the third interval, an existing cluster disappears. E-Stream yields good clustering
while HPStream tries to create a new cluster from existing cluster even though their
do not have a significant difference, due to the fixed cluster-count constraint.
In the 4th interval, a cluster swells. In the 5th interval, two clusters are closer, an
evolution that is supported by both algorithms. But, HPStream still tries to find five
clusters.
In the 6th interval two close clusters are merged. Neither algorithms merge the two
clusters in this interval even though their current behaviors are undistinguished.
In the 7th interval, a cluster splits. E-Stream can support this evolution when it
receives enough data. But HPStream cannot detect this.
In the 8th interval there are four clusters in a steady state again. E-stream algorithm
detects the previous merged case and identifies all clusters correctly within this
interval. But HPStream still is confusing by the cluster behavior.

Purity
1.05

0.95

0.9
purity

E-Stream
HPStream
0.85

0.8

0.75

0.7
200 1000 1800 2600 3400 4200 5000 5800 6600 7400
data stream (points)

Fig. 7. Purity test between E-Stream and HPStream

E-Stream: Evolution-Based Technique for Stream Clustering 613

In the purity test, E-Stream always has a purity greater than 0.9. But HPStream
exhibits a big drop in the seventh interval (cluster split), because the algorithm cannot
accommodate this evolution.
In the F-measure test, E-Stream yields an average value much better than
HPStream although, there are two intervals where E-Stream has a lower F-measure.
HPStream yields better results due to an initial offline process to find the initial
clusters. The second instance is the second interval (1601-2600), where E-Stream
merged two clusters incorrectly.
From the efficiency test, we can say that HPStream cannot support the evolution of
number of clusters because the algorithm constrains it. E-Stream can support all the evolu-
tions in Section 3, even though some evolutions such as merge require a lot of data to detect.

F-Measure
1.05

0.95
f-measure

0.9
E-Stream
HPStream
0.85

0.8

0.75

0.7
200 1000 1800 2600 3400 4200 5000 5800 6600 7400
data stream (points)

Fig. 8. F-Measure test between E-Stream and HPStream

4.2 Sensitivity with Number of Cluster (Input Parameters)

E-Stream Purity HPStream Purity

1.05 1.05
1 1
k=5 k=5
0.95 0.95
k = 10
purity

k = 10
purity

0.9 0.9
k = 20 k = 20
0.85 0.85
k = 40 k = 40
0.8 0.8
k = 80 k = 80
0.75 0.75
0.7 0.7
200 1400 2600 3800 5000 6200 7400
200 1400 2600 3800 5000 6200 7400
data stream (points) data stream (points)

E-Stream F-measure HPStream F-measure

1.2 1.2
1 1 k=5
k=5
f-measure
f-measure

0.8 k = 10 0.8 k = 10
0.6 k = 20 0.6 k = 20
0.4 k = 40 0.4 k = 40
0.2 k = 80 0.2 k = 80
0 0
200 1400 2600 3800 5000 6200 7400 200 1400 2600 3800 5000 6200 7400
data stream (points) data stream (points)

Fig. 9. Sensitivity with number of cluster (input parameter)

614 K. Udommanetanakit, T. Rakthanmanon, and K. Waiyamai

From the experiment, the Purity of both algorithms is not sensitive to this input
parameter. But in F-measure terms, HPStream has a tendency to drop if the input
number of clusters differs greatly from the actual number. E-Stream is not sensitive to
this parameter since the number of clusters is not fixed. As long as the maximum
number is not exceeded, E-Stream still yields good results.

4.3 Runtime with Number of Data

In this experiment we use a dataset consisting of 500,000 data points with two
dimensions and five clusters.

Runtime
180
160
140
runtime (secs)

120
100 E-Stream
80 HPStream

60
40
20
0
0 50000 100000 150000 200000 250000 300000 350000 400000 450000
#data (points)

Fig. 10. Runtime with number of data (points)

Both algorithms exhibit linear runtime in number of data points, which is a

constraint for stream clustering algorithms.

4.4 Runtime as a Function of Clusters and of Dimensions

To test the runtime as a function of the number of clusters, we use two dimensions,
100,000 data points, and vary the number of data clusters from 5 to 25 in increments
of 5 clusters.
For runtime with the number of dimensions test, we use 5 clusters, 100,000 data
points, and vary the number of dimensions from 2 to 20.

Runtime Runtime
400 300
350 250
runtime (secs)

runtime (secs)

300
250 200
E-Stream E-Stream
200 150
150 HPStream HPStream
100
100
50 50
0 0
5 10 15 20 25 2 4 6 8 10 12 14 16 18 20
#cluster #dimension

Fig. 11. Runtime with number of clusters and number of dimensions

E-Stream: Evolution-Based Technique for Stream Clustering 615

The results of both experiments are summarized in Figure 15. HPStream exhibits
linear runtime in both the number of clusters and the number of dimensions. E-Stream
exhibits linear runtime in the number of dimensions but polynomial runtime in the
number of clusters. This is due to the merging procedure, which requires O(k2) time in
the number of clusters.

5 Conclusions
This paper proposed a new stream clustering technique called E-Stream which can
support five cluster evolutions: appearance, disappearance, self-evolution, merge, and
split. These evolutions can normally occur in an evolving data stream. This technique
outperforms a well-known technique, HPStream. However, the runtime of the new
approach is polynomial with respect to the number clusters.

Acknowledgment. Thanks to J. E. Brucker and P. Vateekul for their reading and

comments of this paper.

References
1. Milenova, B.L., Campos, M.M.: Clustering Large Databases with Numeric and Nominial
Values Using Orthogonal Projections. In: Proceedings of the 29th VLDB Conference
(2003)
2. Aggarwal, C., Han, J., Wang, J., Yu, P.S.: A Framework for Projected Clustering of High
Dimensional Data Streams. In: Proceeding of the 30th VLDB conference (2004)
3. Aggarwal, C., Han, J., Wang, J., Yu, P.S.: A Framework for Clustering Evolving Data
Streams. In: Proceeding of the 29th VLDB conference (2003)
4. Barbara, D.: Requirements for Clustering Data Streams. In: SIGKDD Explorations (2002)
5. Gaber, M.M., Zaslavsky, A., Krishnaswmy, S.: Mining Data Streams: A Review. In:
SIGMOD Record, vol. 34(2) (June 2005)
6. Oh, S., Kang, J., Byun, Y., Park, G., Byun, S.: Intrusion Detection based on Clustering a
Data Stream. In: Proceedings of the 2005 Third ACIS International Conference on Software
Engineering Research, Management and Applications (2005)
7. Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering Data
Streams: Theory and Practice. TKDE special issue on clustering 15 (2003)
8. Song, M., Wang, H.: Highly Efficient Incremental Estimation of Gaussian Mixture Models
for Online Data Stream Clustering. In: SPIE Conference on Intelligent Computing: Theory
And Application III (2005)
9. Zhang, T., Ramakhrisnan, R., Livny, M.: BIRCH: An Efficient Data Clustering Method for
Very Large Databases. In: Proc. ACM SIGMOD Int. Conf. Management of Data (1996)

View publication stats

Big Data Unit III
No ratings yet
Big Data Unit III
20 pages
Bonus Tema Grupiranje Tijekovnih Podataka
No ratings yet
Bonus Tema Grupiranje Tijekovnih Podataka
36 pages
GFJHFN
No ratings yet
GFJHFN
21 pages
TGNHBGFN
No ratings yet
TGNHBGFN
25 pages
Sangma2022 Article HierarchicalClusteringForMulti
No ratings yet
Sangma2022 Article HierarchicalClusteringForMulti
26 pages
LNAI 4632 - E-Stream - Evolution-Based Technique For Stream Clustering
No ratings yet
LNAI 4632 - E-Stream - Evolution-Based Technique For Stream Clustering
11 pages
Evolutionary Clustering
No ratings yet
Evolutionary Clustering
7 pages
FHC-NDS Fuzzy Hierarchical Clustering of Multiple Nominal Data Streams
No ratings yet
FHC-NDS Fuzzy Hierarchical Clustering of Multiple Nominal Data Streams
13 pages
Learning With Feature and Distribution Evolvable Streams
No ratings yet
Learning With Feature and Distribution Evolvable Streams
11 pages
DM Clustering UNIT4
No ratings yet
DM Clustering UNIT4
36 pages
Evolutionary Spectral Clustering by Incorporating Temporal Smoothness.
No ratings yet
Evolutionary Spectral Clustering by Incorporating Temporal Smoothness.
10 pages
Adaptive Clustering
No ratings yet
Adaptive Clustering
11 pages
Evolving Fuzzy Model
No ratings yet
Evolving Fuzzy Model
9 pages
Adaptive Clustering For Dynamic IoT Data Streams
No ratings yet
Adaptive Clustering For Dynamic IoT Data Streams
11 pages
DataStreamsCRC Anjaly
No ratings yet
DataStreamsCRC Anjaly
258 pages
BDA Notes Part 2
No ratings yet
BDA Notes Part 2
5 pages
A Novel Drift Detection Algorithm Based
No ratings yet
A Novel Drift Detection Algorithm Based
12 pages
Clustering
No ratings yet
Clustering
5 pages
Applying Temporal Dependence To de
No ratings yet
Applying Temporal Dependence To de
19 pages
5 Tracking - Recurrent - Concept - Drift - in - Streaming - Data - Using - Ensemble - Classifiers
No ratings yet
5 Tracking - Recurrent - Concept - Drift - in - Streaming - Data - Using - Ensemble - Classifiers
6 pages
Mining Techniques For Streaming Data
No ratings yet
Mining Techniques For Streaming Data
14 pages
A Framework For Clustering Evolving Data Streams
No ratings yet
A Framework For Clustering Evolving Data Streams
21 pages
DWDM FINAL6
No ratings yet
DWDM FINAL6
28 pages
Data Mining - Past Present and Future - A Typical
No ratings yet
Data Mining - Past Present and Future - A Typical
10 pages
Clustering Data Streams Theory Practice
No ratings yet
Clustering Data Streams Theory Practice
33 pages
Big Data
No ratings yet
Big Data
37 pages
Data Stream Unit4
No ratings yet
Data Stream Unit4
20 pages
Comp Sci - Ijcse - Improve Frequent Patteren Mining in Data - Himanshu - Opaid
No ratings yet
Comp Sci - Ijcse - Improve Frequent Patteren Mining in Data - Himanshu - Opaid
12 pages
Clustering Data Streams: Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague
No ratings yet
Clustering Data Streams: Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague
19 pages
Big Data Notes
No ratings yet
Big Data Notes
37 pages
AOS Rel 6.0 GUI User Guide August 2024
No ratings yet
AOS Rel 6.0 GUI User Guide August 2024
208 pages
Advances in Data Stream Mining
No ratings yet
Advances in Data Stream Mining
7 pages
Unit 4
No ratings yet
Unit 4
10 pages
BDA Unit-2
No ratings yet
BDA Unit-2
12 pages
DM Unit-5 Notes
No ratings yet
DM Unit-5 Notes
16 pages
Review Paper On Concept Drifting Data Stream Mining
No ratings yet
Review Paper On Concept Drifting Data Stream Mining
4 pages
A Framework For Clustering Evolving Data Streams
No ratings yet
A Framework For Clustering Evolving Data Streams
12 pages
Unit II (Big Data)
No ratings yet
Unit II (Big Data)
19 pages
Mining Data Streams
No ratings yet
Mining Data Streams
33 pages
Unit 3
No ratings yet
Unit 3
30 pages
Bda M4
No ratings yet
Bda M4
57 pages
Mining Data Streams
No ratings yet
Mining Data Streams
17 pages
Video Information Reterival Using: Chemeleon Clustering
No ratings yet
Video Information Reterival Using: Chemeleon Clustering
5 pages
Requirements For Clustering Data Streams: Dbarbara@gmu - Edu
No ratings yet
Requirements For Clustering Data Streams: Dbarbara@gmu - Edu
5 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
Mining&Data Stream Unit-3 - Removed
No ratings yet
Mining&Data Stream Unit-3 - Removed
50 pages
IJRET - in Data Streams Using Classification and Clustering Different Techniques To Find Novel Class
No ratings yet
IJRET - in Data Streams Using Classification and Clustering Different Techniques To Find Novel Class
3 pages
TELSMITH Rotary Drum Scrubber New
100% (4)
TELSMITH Rotary Drum Scrubber New
7 pages
An Efficient Closed Frequent Itemset Miner For The MOA Stream Mining System
No ratings yet
An Efficient Closed Frequent Itemset Miner For The MOA Stream Mining System
10 pages
Unit-II (Big Data)
No ratings yet
Unit-II (Big Data)
20 pages
A
No ratings yet
A
3 pages
Eng-Improve Frequent Pattern Mining in Data Stream-Himanshu Shah
No ratings yet
Eng-Improve Frequent Pattern Mining in Data Stream-Himanshu Shah
10 pages
Stream
No ratings yet
Stream
30 pages
Clustering On Temporal Multidimensional Data by Visualization Technique
No ratings yet
Clustering On Temporal Multidimensional Data by Visualization Technique
4 pages
Overview of Streaming-Data Algorithms
No ratings yet
Overview of Streaming-Data Algorithms
10 pages
Methodologies For Stream Data Processing and Stream Data Systems
No ratings yet
Methodologies For Stream Data Processing and Stream Data Systems
20 pages
Feature-Selected and - Preserved Sampling For
No ratings yet
Feature-Selected and - Preserved Sampling For
6 pages
Frequent Patterns Mining in Time-Sensitive Data Stream
No ratings yet
Frequent Patterns Mining in Time-Sensitive Data Stream
8 pages
Visual Clustering Approaches
No ratings yet
Visual Clustering Approaches
3 pages
Disassembly and Assembly Manual Cat c15 Engine
0% (1)
Disassembly and Assembly Manual Cat c15 Engine
2 pages
Mining Frequent Itemsets Based On CBSW Method: K Jothimani, DR Antony Selvadossthanmani
No ratings yet
Mining Frequent Itemsets Based On CBSW Method: K Jothimani, DR Antony Selvadossthanmani
5 pages
Oracle Leasing and Finance Management
No ratings yet
Oracle Leasing and Finance Management
74 pages
DGTL-BRKACI-2690-How To Extend Your ACI Fabric To Public Cloud
No ratings yet
DGTL-BRKACI-2690-How To Extend Your ACI Fabric To Public Cloud
79 pages
Sagemcom Dgci384
No ratings yet
Sagemcom Dgci384
6 pages
Course Plan - Linux Lab
No ratings yet
Course Plan - Linux Lab
12 pages
Assessment - 6: Question 1: Write The C Program To Implement A Linked List File Allocation Method
No ratings yet
Assessment - 6: Question 1: Write The C Program To Implement A Linked List File Allocation Method
6 pages
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
A Queueing Model With Server Breakdowns Repairs Va
No ratings yet
A Queueing Model With Server Breakdowns Repairs Va
13 pages
LM Chart Cast Alloys Aluminum
0% (1)
LM Chart Cast Alloys Aluminum
2 pages
Boe 1
No ratings yet
Boe 1
9 pages
Middleware MidSem Preparation - PDF
No ratings yet
Middleware MidSem Preparation - PDF
108 pages
Solar Dryer
No ratings yet
Solar Dryer
25 pages
Iot 220112132928
No ratings yet
Iot 220112132928
31 pages
Digital System Design (EC 302) - MCQ (Google Classroom Uploading)
No ratings yet
Digital System Design (EC 302) - MCQ (Google Classroom Uploading)
7 pages
Manual - Pdms Support Design
No ratings yet
Manual - Pdms Support Design
84 pages
Discrete Maths Sle
No ratings yet
Discrete Maths Sle
13 pages
PCS50-630 User Manual 20220509
No ratings yet
PCS50-630 User Manual 20220509
37 pages
9852 2340 01b Manual Cement Unit Boltec M & L RCS 4.5
No ratings yet
9852 2340 01b Manual Cement Unit Boltec M & L RCS 4.5
56 pages
AMF-65 AMS RFT Partnership Range - CULT PDF
No ratings yet
AMF-65 AMS RFT Partnership Range - CULT PDF
46 pages
DSP Seminar
No ratings yet
DSP Seminar
17 pages
Kolom Distilasi Tinjauan Umum
No ratings yet
Kolom Distilasi Tinjauan Umum
22 pages
Nuaire ECO4-AE-1Z-STND Installation and Manual
No ratings yet
Nuaire ECO4-AE-1Z-STND Installation and Manual
4 pages
Ma8551 Algebra and Number Theory
No ratings yet
Ma8551 Algebra and Number Theory
14 pages
Running Head: DATA STRUCTURES 1: Course: Project Name: Student Name: Date
No ratings yet
Running Head: DATA STRUCTURES 1: Course: Project Name: Student Name: Date
7 pages
Which Control The Pitch Angle of The Tail Rotor Blades: by Pressing On The Right Pedal, The Pitch Is
No ratings yet
Which Control The Pitch Angle of The Tail Rotor Blades: by Pressing On The Right Pedal, The Pitch Is
5 pages
Arwa Alrezehi - Shahad Sultan
No ratings yet
Arwa Alrezehi - Shahad Sultan
1 page
Cover Letter Qatar
No ratings yet
Cover Letter Qatar
1 page
Tsarouchas Anastasios Resume
No ratings yet
Tsarouchas Anastasios Resume
1 page
Installer Uninstaller Readme
No ratings yet
Installer Uninstaller Readme
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

E-Stream Evolution-Based Technique For Stream Clus

Uploaded by

E-Stream Evolution-Based Technique For Stream Clus

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

E-Stream: Evolution-Based Technique for Stream

Conference Paper in Lecture Notes in Computer Science · August 2007

Thanawin Rakthanmanon Kitsana Waiyamai

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Komkrit Udommanetanakit, Thanawin Rakthanmanon, and Kitsana Waiyamai

Department of Computer Engineering, Faculty of Engineering

2 Basic Concepts and Definitions

Weight of a cluster is the number of data elements in a cluster. Weight is determined

2.1 Fading Cluster Structure with Histogram: FCH

W(t) is a sum of all weights of data points in the cluster at time t,

left = min( xij ) (7)

right = max( xij ) (8)

2.2 Histogram Management

We utilize a histogram of cluster data values to identify cluster splits. A α-bin

2.3 Distance Functions

Cluster-Point distance is a distance from a data point to the center of a cluster,

Split histogram 1st split histogram 2nd split histogram

Fig. 1. Histogram management in a split dimensionop and other dimension

Cluster-Cluster distance is a difference between centers of two clusters. It is used to

3.1 Our Idea

3.2 E-Stream Algorithm

Fig. 2. E-Stream, stream clustering algorithm

CheckMerge is an algorithm for merging pairs of similar clusters. This algorithm

Algorithm FadingAll Algorithm CheckSplit

Fig. 3. FadingAll and CheckSplit algorithms

Algorithm MergeOverlapCluster Algorithm LimitMaximumCluster

Fig. 4. MergeOverlapCluster and LimitMaximumCluster algorithms

Algorithm FlagActiveCluster Algorithm FindClosestCluster

Fig. 5. FlagActiveCluster and FindClosestClusterAlgorithms

Fig. 6. The 8-Step evolution of the Synthetic Dataset

4.1 Efficiency Test

Table 1. Parameters of each algorithm

algorithm E-Stream algorithm HPStream

Fig. 7. Purity test between E-Stream and HPStream

Fig. 8. F-Measure test between E-Stream and HPStream

4.2 Sensitivity with Number of Cluster (Input Parameters)

E-Stream Purity HPStream Purity

E-Stream F-measure HPStream F-measure

Fig. 9. Sensitivity with number of cluster (input parameter)

4.3 Runtime with Number of Data

Fig. 10. Runtime with number of data (points)

Both algorithms exhibit linear runtime in number of data points, which is a

4.4 Runtime as a Function of Clusters and of Dimensions

Fig. 11. Runtime with number of clusters and number of dimensions

Acknowledgment. Thanks to J. E. Brucker and P. Vateekul for their reading and

View publication stats

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.