DSBD Unit-II 3
DSBD Unit-II 3
Big Data
Universal Hashing, Streaming Models, Bloom Filter,
Flajolet Martin Algorithm
Pairwise Independence & Universal
Hash Functions
2
A more limited notion of independence that proves
useful in many contexts: k-wise independence.
3
4
Hash functions
The goal - map elements from a large domain to a small one.
Typically, to obtain the required guarantees, we would need
not just one function, but a family of functions, where we
would use randomness to sample a hash function from this
family.
Let H = {h : U -> R} be a family of functions,
Ideally, we would like to obtain certain “random-like"
properties of this family, while keeping its size small.
This will be important for applications in data structures and
streaming algorithms.
5
Universal Hashing
Fixed hashing is vulnerable to bad worse-case behavior so
we improve this behavior by choosing our hash function
randomly in a way that is independent of the keys.
This approach is called Universal Hashing and can produce
good performance on average independent of key choice.
6
Desired properties: The main desired properties for
a good hashing scheme are:
1. The keys are nicely spread out so that we do not have too
many collisions, since collisions affect the time to perform
lookups and deletes.
2. M = O(N): in particular, we would like our scheme to
achieve property (1) without needing the table size M to be
much larger than the number of elements N.
3. The function h is fast to compute. Ideally, time to compute
h(x) as a constant.
7
Definition : A randomized algorithm H for constructing hash
functions h : U → {1, . . . ,M} is universal if for all x!=y in U,
we have
8
Pairwise independence
9
Streaming models
10
The problem
A search engine receives a stream of queries and it would
like to study the behavior of typical users.
We assume the stream consists of tuples (user, query,
time).
Suppose that we want to answer queries like:
“What fraction of the typical user’s queries were
repeated over the past month?”
We only wish to store 1/10th of the elements..
11
What are Streaming algorithms?
Algorithms for processing data streams
Input is presented as a sequence of items
Can be examined in only a few passes (typically
just one)
Limited working memory
12
STREAMING MODELS
Sampling Algorithms
Reservoir Sampling
Priority Sampling
Sketch Algorithms
Bloom Filter
Counting Distinct Elements
Flajolet-Martin Algorithm
13
Bloom Filter
Suppose we have a set S of one billion allowed email
addresses.
The stream consists of pairs: an email address and the email
itself.
Since the typical email address is 20 bytes or more, it is not
reasonable to store S in main memory. (20BX109=20GB)
We have one gigabyte(1GB) of available main memory.
In the technique known as Bloom filtering, we use that main
memory as a bit array.
In this case, we have room for eight billion bits(8X109 bits).
Devise a hash function h from email addresses to eight billion
buckets.
14
Hash each member of S to a bit, and set that bit to 1.
All other bits of the array remain 0.
Since there are one billion members of S, approximately 1/8th
of the bits will be 1.
The exact fraction of bits set to 1 will be slightly less than
1/8th, because it is possible that two members of S hash to the
same bit.
When a stream element arrives, we hash its email address.
If the bit to which that email address hashes is 1, then we let
the email through.
But if the email address hashes to a 0, we are certain that the
address is not in S, so we can drop this stream element.
15
A Bloom filter consists of:
1. An array of n bits, initially all 0’s.
2. A collection of hash functions h1, h2, . . . , hk. Each hash
function maps “key” values to n buckets, corresponding to
the n bits of the bit-array.
3. A set S of m key values.
16
Working-
To initialize the bit array, begin with all bits 0.
Take each key value in S and hash it using each of the k hash
functions.
Set to 1 each bit that is hi(K) for some hash function hi and
some key value K in S.
To test a key K that arrives in the stream, check that all of
h1(K), h2(K), . . . , hk(K) are 1’s in the bit-array.
If all are 1’s, then let the stream element through.
If one or more of these bits are 0, then K could not be in S, so
reject the stream element.
17
19
20
21
Bloom Filters: cons
Small false positive probability
No deletions
Can not store associated objects
22
The Count-Distinct Example –
Consider a Web site gathering statistics on how many unique
users it has seen in each given month.
The universal set is the set of logins for that site, and a stream
element is generated each time someone logs in.
This measure is appropriate for a site like Amazon, where the
typical user logs in with their unique login name.
It is possible to estimate the number of distinct elements by
hashing the elements of the universal set to a bit-string that is
sufficiently long.
The length of the bit-string must be sufficient that there are
more possible results of the hash function than there are
elements of the universal set.
23
Flajolet Martin Algorithm
Whenever we apply a hash function h to a stream element a,
the bit string h(a) will end in some number of 0’s, possibly
none.
Call this number the tail length for a and h.
Let R be the maximum tail length of any a seen so far in the
stream.
Then we shall use estimate 2R for the number of distinct
elements seen in the stream.
24
Analysis-
The probability that a given stream element a has h(a) ending
in at least r 0’s is 1/2r.
Suppose there are m distinct elements in the stream.
Then the probability that none of them has tail length at least
r is (1 − 2−r)m.
We can rewrite it as.
25
Example
Q. Input stream: 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3,1
Hash function: h(x)=6x+1 mod 5
Find no of distinct elements using FM algorithm.
26
Home work questions
1. Only 1 in 1000 people has rare disease. Given True Positive = 0.9 and
False positive = 0.02. If randomly tested individual is positive. What
is the probability that they have a disease?
2.
27
Home work questions
3. Assume that a man's profession can be classified as professional,
skilled labourer or unskilled labourer. Assume that of the sons of
professional men, 80 percent are professional, 10 percent are skilled
labourers, and 10 percent are unskilled labourers. In the case of sons
of skilled labourers, 60 percent are skilled labourers, 20 percent are
professional and 20 percent are unskilled. Finally, in the case of
unskilled labourers, 50 percent of the sons are unskilled labourers,
and 25 percent each are in the other two categories. Assume that
every man has at least one son, and form a Markov chain by
following the profession of a randomly chosen son of a given family
through several generations. Set up the matrix of transition
probabilities. Find the probability that a randomly chosen grandson
of an unskilled labourer is a professional man.
28
29