Stream Data
Stream Data
Concepts and
Techniques
— Chapter 8 —
8.1. Mining data streams
Seven chapters
(Chapters 1-7) are
covered in the Fall
semester
Four chapters
(Chapters 8-11) are
covered in the Spring
semester
Continuous Query
Results
Multiple streams
Stream Query
Processor
Scratch Space
(Main memory and/or Disk)
Data Mining: Concepts and
May 25, 2025 Techniques 13
Challenges of Stream Data
Processing
ages
Methodology
Synopses (trade-off between accuracy and storage)
Histograms
Sliding windows
Multi-resolution model
Sketches
Radomized algorithms
Data Mining: Concepts and
May 25, 2025 Techniques 16
Stream Data Processing Methods (1)
Random sampling (but without knowing the total length in advance)
Reservoir sampling: maintain a set of s candidates in the reservoir,
which form a true random sample of the element seen so far in the
stream. As the data stream flow, every new element has a certain
probability (s/N) of replacing an old element in the reservoir.
Sliding windows
Make decisions based only on recent data of sliding window size w
An element arriving at time t expires at time t + w
Histograms
Approximate the frequency distribution of element values in a
stream
Partition data into a set of contiguous buckets
Equal-width (equal value range for buckets) vs. V-optimal
(minimizing frequency variance within each bucket)
Multi-resolution models
Popular models: balanced binary trees, micro-clusters, and
Data Mining: Concepts and
May 25, 2025 wavelets Techniques 17
Stream Data Processing Methods
(2)
Sketches
Histograms and wavelets require multi-passes over the data but
v
sketches can operate in a single pass
Fk mi
k
But often requires less “precision”, e.g., no join,
grouping, sorting
Patterns are hidden and more general than querying
Not necessarily continuous queries
Stream data mining tasks
Multi-dimensional on-line analysis of streams
1 2 m o n th s 3 1 d ay s 2 4 h o u rs 4 q trs
tim e
Logarithmic tilted time frame:
Example: Minimal: 1 minute, then 1, 2, 4, 8, 16, 32, …
(A 1 , * , C 1 )
(A 1 , * , C 2 ) (A 1 , B 1 , C 1 ) (A 2 , * , C 1 )
(A 1 , B 1 , C 2 ) (A 1 , B 2 , C 1 ) (A 2 , * , C 2 ) (A 2 , B 1 , C 1 )
(A 1 , B 2 , C 2 ) (A 2 , B 1 , C 2 ) (A 2 , B 2 , C 1 )
(A 2 , B 2 , C 2 )
Data Mining: Concepts and
May 25, 2025 Techniques 30
An H-Tree Cubing Structure
root
Observation layer
Chicago Urbana Springfield
Empty
(summary) +
2 4 3
2 4 3
1
2 + 10 9
1 2
1 2
1 0
Itemset ( ) is deleted.
That’s why we choose a large number of
buckets
– delete more Data Mining: Concepts and
May 25, 2025 Techniques 41
Pruning Itemsets – Apriori
Rule
1
2
2
1 +
1
R 2 ln(1 / )
2n
Data Mining: Concepts and
May 25, 2025 Techniques 49
Hoeffding Tree Algorithm
Hoeffding Tree Input
S: sequence of examples
X: attributes
G( ): evaluation function
d: desired accuracy
Hoeffding Tree Algorithm
for each example in S
retrieve G(Xa) and G(Xb) //two highest G(Xi)
if ( G(Xa) – G(Xb) > ε )
split on Xa
recurse to next node
break
Data Mining: Concepts and
May 25, 2025 Techniques 50
Decision-Tree Induction with Data
Streams
Packets > 10
Data Stream
yes no
Protocol = http
Packets > 10
Data Stream
yes no
Bytes > 60K
Protocol = http
yes
Strengths
Scales better than traditional methods
Sublinear with sampling
Very small memory utilization
Incremental
Make class predictions in parallel
New examples are added as they come
Weakness
Could spend a lot of time with ties
G computed every nmin
Deactivates certain leaves to save memory
curve)
Compare to Hoeffding Tree: Better time and memory
Compare to traditional decision tree
Similar accuracy
21 minutes for VFDT
24 hours for C4.5
Still does not handleData
concept drift
Mining: Concepts and
May 25, 2025 Techniques 53
CVFDT (Concept-adapting
VFDT)
Concept Drift
Time-changing data streams
CVFDT
Increments count with new example
Sliding window
Nodes assigned monotonically increasing IDs
Grows alternate subtrees
level-(i+1) medians
level-i medians
data points
Data Mining: Concepts and
May 25, 2025 Techniques 58
Hierarchical Tree and
Drawbacks
Method:
maintain at most m level-i medians
On seeing m of them, generate O(k) level-(i+1)
medians of weight equal to the sum of the
weights of the intermediate medians assigned to
them
Drawbacks:
Low quality for evolving data streams (register
only k centers)
Limited functionality in discovering and
exploring clusters over different portions of the
May 25, 2025
stream over time
Data Mining: Concepts and
Techniques 59
Clustering for Mining Stream
Dynamics
Network intrusion detection: one example
Detect bursts of activities or abrupt changes in real time—by
on-line clustering
Our methodology (C. Agarwal, J. Han, J. Wang, P.S. Yu, VLDB’03)
Tilted time frame work: o.w. dynamic changes cannot be found
Micro-clustering: better quality than k-means/k-median
incremental, online processing and maintenance)
Two stages: micro-clustering and macro-clustering
With limited “overhead” to achieve high efficiency, scalability,
quality of results and power of evolution/change detection