0% found this document useful (0 votes)
26 views53 pages

Swe2011 Bda - III

Uploaded by

junkmailpavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views53 pages

Swe2011 Bda - III

Uploaded by

junkmailpavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Introduction to Stream Computing

1
4vs of Big Data

2
Infinite Data

High dim. Graph Infinite Machine


Apps
data data data learning

Locality Filtering Recommen


PageRank,
sensitive data SVM der
SimRank
hashing streams systems

Community Queries on Decision Association


Clustering
Detection streams Trees Rules

Dimension Duplicate
ality Spam Web Perceptron, document
reduction Detection advertising kNN detection

3
Data at Rest Vs. Data in Motion
• Data at Rest
– Data that is placed in storage rather than used
in real time
• Data in Motion
– Data that is moving across a network or in
memory for processing in real time

4
Data at rest vs Data in motion
• Data has been collected from • The collection process for data
that
various sources and is then analyzed in motion is similar to that of data at
after the event occurs. rest; however, the difference lies in
• The point where the data is the analytics.
analyzed and the point where action is • Analytics occur in real-time as the event
taken on it occur at two separate times. happens.
• For example, a retailer • An example here would be a theme
analyzes a previous month’s sales data park that uses wristbands to collect
and uses it to make strategic decisions data about their guests.
about the present month’s business • These wristbands would
activities. constantly record data about the guest’s
• The action takes place after the data- activities, and the park could use this
creating event has occurred. information to personalize the guest
• This data is meaningful to the visit with special surprises or
retailer and allows them to create suggested activities based on their
marketing campaigns and send behavior.
customized coupons based on customer • This allows the business to customize
purchasing behavior and other variables. the guest experience during the visit.
• Batch processing method • Real-time processing method

5
Data Streams - Terms
• Data Tsunami
• A data stream is a (potentially unbounded) sequence of tuples
• Each tuple consist of a set of attributes, similar to a row in database
table
– Continuous input, often in high-volume
– Does not end
– Impossible to process / analyze in real-time with traditional
relational database systems
• Transactional data streams: log interactions between entities
– Credit card: purchases by consumers from merchants
– Telecommunications: phone calls by callers to dialed parties
– Web: accesses by clients of resources at servers
• Measurement data streams: monitor evolution of entity states
– Sensor networks: physical phenomena, road traffic
– IP network: traffic at router interfaces
– Earth climate: temperature, moisture at weather stations
6
Data Streams: Characteristics
The characteristic of continually arriving data points introduces an important
property of data streams which also poses the greatest challenge: the size of a
data stream is potentially unbounded.

This leads to the following requirements for data stream processing algorithms:
• Bounded storage: The algorithm can only store a very limited amount of
data to summarize the data stream.
• Single pass: The incoming data points cannot be permanently stored and
need to be processed at once in the arriving order.
• Real-time: The algorithm has to process data points on average at least as
fast as the data is arriving.
• Concept drift: The algorithm has to be able to deal with a data generating
process which evolves over time (e.g., distributions change or new
structure in the data appears).

7
Why Data Stream Analysis?
• Must analyze the massive data:
– Scientific research (monitor
species) environment,
– System
failures) management (spot faults,
– Business
offers) drops, intelligence (marketing
– For revenue protection (phone fraud,
service abuse)
rules, new

8
Data Stream Management System
Ad-Hoc
Queries

. . . 1, 5, 2, 7, 0, Standing
9, 3 Queries
Output
. . . a, r, v, t, y, Processor
h, b

. . . 0, 0, 1, 0, 1,
Streams
1, 0 Entering.
Each stream is
time
composed of
elements/tuples
Limited
Working
Storage Archival
Storage

9
Analytic With Data-In-Motion
Data Ingest
01011001100011101001001001001
Opportunity Cost Starts Here

110100101010011100101001111001000100100010010001000100101 11000100101001001011001001010
01100100101001001010100010010
01100100101001001010100010010
11000100101001001011001001010
01100100101001001010100010010
Bootstrap 01100100101001001010100010010
Enrich 01100100101001001010100010010
01100100101001001010100010010

Forecast
11000100101001001011001001010
Nowcast

01100100101001001010100010010
01100100101001001010100010010
01100100101001001010100010010
Adaptive 01100100101001001010100010010
01100100101001001010100010010
Analytics 11000100101001001011001001010
01100100101001001010100010010
Model 01100100101001001010100010010
01100100101001001010100010010
11000100101001001011001001010

10
DBMS Vs. DSMS
DBMS DSMS
• Persistent relations • Transient streams
• One-time queries • Continuous queries
• Random access • Sequential access
• “Unbounded” disk store • Bounded main memory
• Only current state matters • History/arrival-order is critical
• Passive repository • Active stores
• Relatively low update rate • Possibly multi-GB arrival rate
• No real-time services • Real-time requirements
• Assume precise data • Data stale/imprecise
• Access plan determined by • Unpredictable/variable data
query processor, physical DB arrival and characteristics
design

11
Real Time Data Streams
Social media: posts, pictures
Sensors gathering information: and videos
e.g. Climate, traffic etc.

Digital satellite
images

Purchase transaction records

Mobile phone GPS signals

High volume
administrative &
transactional records

12
Real time Data Streams
• Sensor Data Streams
• Streaming trending topics on Twitter
• Share Market Streams
• Streaming video in TV Shows
• Blog Data
• Telecommunication calling records
• Credit card transaction flows
• Network monitoring and traffic engineering
• Web logs and Web page click streams
• Satelite data flow

13
Applications (1)
• Mining query streams
– Google wants to know what queries are
more frequent today than yesterday

• Mining click streams


– Yahoo wants to know which of its pages are getting
an unusual number of hits in the past hour

• Mining social network news feeds


– E.g., look for trending topics on Twitter, Facebook

Dr.SMK, AVCCE 14
Applications (2)
• Sensor Networks
– Many sensors feeding into a central controller
• Telephone call records
– Data feeds into customer bills as well as
settlements between telephone companies
• IP packets monitored at a switch
– Gather information for optimal routing
– Detect denial-of-service attacks

Dr.SMK, AVCCE 15
Issues in Stream Processing
• Streams often deliver elements very rapidly.
– Process elements in real time
– The stream-processing algorithm is executed in main
memory, without access to secondary storage or
with only rare accesses to secondary storage.
– Even when streams are “slow,” Process should be fast
– Even if each stream by itself can be processed using a
small amount of main memory, the requirements
of all the streams together can easily exceed the
amount of available main memory.

16
Points to Ponder….
1. Process an example at a time, and inspect it
only once (atmost)
2. Use a limited amount of memory
3. Work in a limited amount of time
4. Be ready to predict at any point

17
Managing Data Streams

18
Kinds of Stream Processing Techniques
• [

• Sampling data in a Stream


– To create a sample of a stream that is usable for a class of queries
• Filtering Data Stream
– To allow particular set of elements by filtering the stream arrival
• Counting distinct elements in a Stream
– To estimate the number of different elements appearing in a stream
• Estimating moments
– Involves the distribution of frequencies of different elements in a
stream
• Counting Ones in a Window
– Counting the number of 1’s in the binary stream

19
Sampling Streams
• Stream sampling is the process of collecting a representative
sample of the elements of a data stream.

• Since we can not store the entire stream, one obvious approach is
to store a sample.

• The sample is usually much smaller than the entire stream, but can
be designed to retain many important characteristics of the stream.

• One can select the subset of stream such a way that queries about
the selected subset should have the answer which are statistically
representative of the stream as a whole.

• Unlike sampling from a stored data set, stream sampling must be


performed online, when the data arrives.

20
Sampling Streams
• Two different approaches
– (1) Sample a fixed proportion of elements
in the stream (say 1 in 10)
– (2) Maintain a random sample of fixed
size
over a potentially infinite stream
• At any “time” k we would like a random sample
of s elements
– What is the property of the sample we want to
maintain?
For all time steps k, each of k elements seen so far has
equal prob. of being sampled
21
Sampling Data Stream
• Inputs:
– Sample size k
– Window size n >> k (alternatively, time duration ‘m’)
– Stream of data elements that arrive online
• Output:
– k elements chosen uniformly at random from the
last n elements (alternatively, from all
elements that have arrived in the last ‘m’ time
units)
• Goal:
– maintain a data structure that can produce the
desired output at any time upon request
22
Two Types of Sliding Windows
• Sequence-Based
– The most recent n elements from the data stream
– Assumes a (possibly implicit) sequence number for
each element
• Timestamp-Based
– All elements from the data stream in the last
minutes of
time (e.g. last 1 week)
– Assumes a (possibly implicit) arrival timestamp for
each element
• Sequence-based is the focus for most of the
talk
23
An example problem
• A search engine receives a stream of queries, and it would
like to study the behavior of a typical user.
• Stream consists of tuples (user, query, time). Suppose, one
would like to answer the following query
“What fraction of the typical user’s
queries were repeated over the past month”
• Suppose a user has issued ‘s’ search queries one time in the
past month and ‘d’ queries more than once

• The correct answer: d/(s+d)

24
Sampling based Approach
• If we have 1/10th the sample, an expected s/10 of the
search queries appeared once and d/100 will appear
twice in the sample
d/100 = 1/10 ∙ 1/10 ∙ d
• For the entire stream, one of the two occurrences will be in
the 1/10th of the stream selected, while the other is in
the 9/10th of not selected.
18d/100 = ((1/10 ∙ 9/10)+(9/10 ∙ 1/10)) ∙ d
𝑑
100 �
𝟏𝟎𝒔+𝟏𝟗
=
𝑑+
• Hence the sample-based answer is
𝑠
10 +100

18𝑑
𝒅
100

25
Generalized Solution
• Stream of tuples with keys:
– Key is some subset of each tuple’s components
• e.g., tuple is (user, search, time); key is user
– Choice of key depends on application
• To get a sample of a/b fraction of the stream:
– Hash each tuple’s key uniformly into b buckets
– Pick the tuple if its hash value is at most a

Hash table with b buckets, pick the tuple if its hash value is at most
a.
How to generate a 30% sample?
Hash into b=10 buckets, take the tuple if it hashes to one of the
first 3 buckets 26
Filtering Data Streams
• A common process on streams is selection,
or filtering.
• We want to accept those tuples in the stream that
meet a criterion.
• Accepted tuples are passed to another process as a
stream, while other tuples are dropped.
• The problem becomes harder, when the
criterion involves lookup for membership in a Set.
• It is especially hard, when the Set is too large
to store in main memory.

27
Applications
• Email spam filtering
– We know 1 billion “good” email addresses
– If an email comes from one of these, it is NOT spam

• Publish-subscribe systems
– You are collecting lots of messages (news articles)
– People express interest in certain sets of keywords
– Determine whether each message matches
user’s interest

28
Filtering Data Streams : Problem
• Suppose we have a set ‘S’ of one million allowed
email address (Valid emails and not considered as spam)
• Each e-mail will occupy at least of 20 bytes.
• Assume that we have 1GB of main memory
• Hence unable to store the entire ‘S’ in main memory and
need disk access too.
• Prompts the need for adopting a method to perform the
filtering in main memory alone.
• Technique is Bloom Filtering

29
Bloom Filtering
• The underlying concept is to utilize the main memory as a
bit array.
• With 1 GB of main memory. We have a room for 8
billion
bits.
• Device a hash function ‘h’ and hash each member of ‘S’ to a
bit and set the bit as ‘1’. All the other bits of array
remain ‘0’.
• Since there are 1 billion members of ‘S’, approximately
1/8th of the bits will be ‘1’.
• The exact fraction of bit set to ‘1’ will be slightly less than
1/8th (Because it is possible that two members of ‘S’
may hash into the same bit.

30
First Cut Solution
• Given a set of keys S that we want to filter
• Create a bit array B of n bits, initially all 0s
• Choose a hash function h with range [0,n)
• Hash each member of s S to one of
n buckets, and set that bit to 1, i.e., B[h(s)]=1
• Hash each element a of the stream and output
only those that hash to bit that was set to 1
– Output a if B[h(a)] == 1

31
First Cut Solution
Output the item since it may be in S.
Item hashes to a bucket that at
least one of the items in S hashed
Item to.

Hash
func h

00100010110 Bit array B


00
Drop the item.
It hashes to a bucket
set to 0 so it is surely
not in S.

• Creates false positives


but no false negatives
– If the item is in S we surely output it, if not we may32
First Cut Solution
◾ |S| = 1 billion email addresses
|B|= 1GB = 8 billion bits
◾ If the email address is in S, then it surely
hashes to a bucket that has the big set to
1,
so it always gets through (no false
negatives)
◾ Approximately 1/8 of the bits are set to 1, so
about 1/8th of the addresses not in S get
through to the output (false positives)
 Actually, less than 1/8th, because more than one
address might hash to the same bit
Analysis: Throwing Darts
• More accurate analysis for the number of
false positives

• Consider: If we throw m darts into n equally


likely targets, what is the probability that
a target gets at least one dart?

• In our case:
– Targets = bits/buckets
– Darts = hash values of items

34
Analysis: Throwing Darts
• We have m darts, n targets
• What is the probability that a target gets
at least one dart?
Equals 1/e
Equivalent
as n ∞

n( m / n)
1 - (1 – 1/n)
1 – e–m/n
Probability some
target n not hit
Probability at
by a dart
least one dart
hits target n
35
Analysis: Throwing Darts
• Fraction of 1s in the array B =
= probability of false positive = 1 – e-m/n

• Example: 109 darts, 8∙109 targets


– Fraction of 1s in B = 1 – e-1/8 = 0.1175
• Compare with our earlier estimate: 1/8 = 0.125

36
Counting Distinct Problem
• Data stream consists of a universe
of elements chosen from a set of size N
– Maintain a count of the number of
distinct elements seen so far
• Maintain the set of elements seen so far
– That is, keep a hash table of all the
distinct elements seen so far
– Hashing and variety of algorithms are
to be used

37
Applications
• A Web site gathering statistics on how many
unique users it has seen in each given
month.
– The universal set is the set of logins for that site,
and a stream element is generated each time
someone logs in.
– This measure is appropriate for a site like
Amazon, where the typical user logs in with
their unique login name.

38
• Web site like Google that does not
require login to issue a search query
– may be able to identify users only by the IP
address from which they send the query.

– There are about 4 billion IP addresses,


sequences of four 8-bit bytes will serve as the
universal set in this case.

39
Solution
• The obvious way to solve the problem is to keep in main
memory a list of all the elements seen so far in the
stream.
• Adopt an efficient search structure such as a hash table or
search tree, so one can quickly add new elements
and check whether or not the element that just arrived
on the stream was already seen.
• As long as the number of distinct elements is not too great,
this structure can fit in main memory and there is
little problem obtaining an exact answer to the
question how many distinct elements appear in the
stream.
• Approach : Flajolet-Martin Algorithm

40
The Flajolet-Martin Algorithm
• Used to estimate the number of distinct elements by
hashing the elements of the universal set to a bit-
string
• Pick many different hash functions and hash each
element of the stream using these hash functions.
• The important property of a hash function is that
when applied to the same element, it always
produces the same result
• The length of the bit-string must be sufficient that
there are more possible results of the hash function
than there are elements of the universal set.

41
• Whenever we apply a hash function h to a
stream element a, the bit string h(a)
will end in some number of 0’s.
– Call this number the tail length for a and h.
• Let R be the maximum tail length of any a
seen so far in the stream.
• Estimate 2R for the number of distinct
elements seen in the stream.

42
Why It Works: Intuition
• Very very rough and heuristic intuition why
Flajolet-Martin works:
– h(a) hashes a with equal prob. to any of N values
– Then h(a) is a sequence of log2 N bits,
where 2-r fraction of all as have a tail of r zeros
• About 50% of as hash to ***0
• About 25% of as hash to **00
• So, if we saw the longest tail of r=2 (i.e., item hash
ending *100) then we have probably
seen
about 4 distinct items so far
– So, it takes to hash about 2r items before we
see one with zero-suffix of length r 43
Example

44
Estimating Moments
• A generalization of the problem of counting
distinct elements in a stream.
– The problem, called computing “moments,”
• Involves the distribution of frequencies of
different elements in the stream.
• We shall define moments of all orders and
concentrate on computing second
moments, from which the general
algorithm for all moments is a simple
extension.
45
Definition of Moments
• Suppose a stream consists of
elements chosen from a universal set.
• Assume the universal set is ordered so
we can speak of the ith element for any i.
• Let mi be the number of occurrences of the
i th element for any i. Then the k th-order

moment (or just kth moment) of the stream


is the sum over all i of (mi)k.
• Kth Moment
46
Computing Different Moments
• 0th moment - Count the number of different
elements in the stream.
• 1st moment = sum of the numbers of
elements in the stream (length of the
stream)
• 2nd moment = surprise number (a measure
of how uneven the distribution is)

47
Alon Matias Szegedy
Method
• AMS method works for all moments
• Gives an unbiased estimate
• We will just concentrate on the 2nd moment S
• We pick and keep track of many variables X:
– For each variable X we store X.el and X.val
• X.elcorresponds to the item i
• X.valcorresponds to the count of item i
– Note this requires a count in main memory,

• Our goal is to compute 𝑺 =


so number of Xs is limited

∑𝒊 𝒎 𝟐

� 48
Example: Surprise Number

• Given Stream : 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9


Surprise number = S = 10 X (1)2 + 9 X(10)2
= 10 +
900
= 910
Compute the Surprise Number for the
following stream :
90, 1, 1, 1, 1, 1, 1, 1 ,1, 1, 1
49
Counting Oneness
• A window of length N on a binary stream.
• We focus on the situation where we can not
afford to store the entire window.
• We want at all times to be able to answer
queries of the form “how many 1’s are there
in the last k bits?” for any k ≤ N.
• Solution proposed through Datar-Gionis-
Indyk- Motwani Algorithm – DGM algorithm

50
DGM Algorithm
• Each bit of the stream has a timestamp,
the position in which it arrives.
• The first bit has timestamp 1, the second
has timestamp 2, and so on.
• Divide the window into buckets, consisting of:
1. The timestamp of its right (most recent) end.
2.The number of 1’s in the bucket. This number
must be a power of 2, and we refer to the number of
1’s as the size of the bucket.

51
DGM Algorithm
• There are five rules that must be followed
when representing stream by buckets :
– The right end of a bucket is always a
position with 1.
– No position is in more than one bucket.
– There are one or two buckets of any given size,
up to some maximum size.
– All sizes must be a power of 2.
– Buckets can not decrease in size as we move to
the left.
52
53

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy