0% found this document useful (0 votes)
12 views28 pages

DSBD Unit-II 3

The document discusses mathematical foundations of big data, focusing on concepts such as universal hashing, streaming models, and Bloom filters. It explains the importance of hash functions in mapping large data sets to smaller ones while minimizing collisions and ensuring fast computation. Additionally, it covers streaming algorithms and specific techniques like the Flajolet-Martin algorithm for estimating distinct elements in data streams.

Uploaded by

Lavanya Zute
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views28 pages

DSBD Unit-II 3

The document discusses mathematical foundations of big data, focusing on concepts such as universal hashing, streaming models, and Bloom filters. It explains the importance of hash functions in mapping large data sets to smaller ones while minimizing collisions and ensuring fast computation. Additionally, it covers streaming algorithms and specific techniques like the Flajolet-Martin algorithm for estimating distinct elements in data streams.

Uploaded by

Lavanya Zute
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Unit-II- Mathematical Foundation of

Big Data
Universal Hashing, Streaming Models, Bloom Filter,
Flajolet Martin Algorithm
Pairwise Independence & Universal
Hash Functions

2
 A more limited notion of independence that proves
useful in many contexts: k-wise independence.
3
4
Hash functions
 The goal - map elements from a large domain to a small one.
 Typically, to obtain the required guarantees, we would need
not just one function, but a family of functions, where we
would use randomness to sample a hash function from this
family.
 Let H = {h : U -> R} be a family of functions,
 Ideally, we would like to obtain certain “random-like"
properties of this family, while keeping its size small.
 This will be important for applications in data structures and
streaming algorithms.

5
Universal Hashing
 Fixed hashing is vulnerable to bad worse-case behavior so
we improve this behavior by choosing our hash function
randomly in a way that is independent of the keys.
 This approach is called Universal Hashing and can produce
good performance on average independent of key choice.

6
 Desired properties: The main desired properties for
a good hashing scheme are:
 1. The keys are nicely spread out so that we do not have too
many collisions, since collisions affect the time to perform
lookups and deletes.
 2. M = O(N): in particular, we would like our scheme to
achieve property (1) without needing the table size M to be
much larger than the number of elements N.
 3. The function h is fast to compute. Ideally, time to compute
h(x) as a constant.

7
 Definition : A randomized algorithm H for constructing hash
functions h : U → {1, . . . ,M} is universal if for all x!=y in U,
we have

 We also say that a set H of hash functions is a universal hash


function family if the procedure “choose h ∈ H at random” is
universal.
 Theorem - If H is universal, then for any set S ⊆ U of size N,
for any x ∈ U (e.g., that we might want to lookup), if we
construct h at random according to H, the expected number of
collisions between x and other elements in S is at most N/M.

8
Pairwise independence

9
Streaming models

10
The problem
 A search engine receives a stream of queries and it would
like to study the behavior of typical users.
 We assume the stream consists of tuples (user, query,
time).
 Suppose that we want to answer queries like:
 “What fraction of the typical user’s queries were
repeated over the past month?”
 We only wish to store 1/10th of the elements..

11
What are Streaming algorithms?
 Algorithms for processing data streams
 Input is presented as a sequence of items
 Can be examined in only a few passes (typically
just one)
 Limited working memory

12
STREAMING MODELS
 Sampling Algorithms
 Reservoir Sampling
 Priority Sampling
 Sketch Algorithms
 Bloom Filter
 Counting Distinct Elements
 Flajolet-Martin Algorithm

13
Bloom Filter
 Suppose we have a set S of one billion allowed email
addresses.
 The stream consists of pairs: an email address and the email
itself.
 Since the typical email address is 20 bytes or more, it is not
reasonable to store S in main memory. (20BX109=20GB)
 We have one gigabyte(1GB) of available main memory.
 In the technique known as Bloom filtering, we use that main
memory as a bit array.
 In this case, we have room for eight billion bits(8X109 bits).
 Devise a hash function h from email addresses to eight billion
buckets.
14
 Hash each member of S to a bit, and set that bit to 1.
 All other bits of the array remain 0.
 Since there are one billion members of S, approximately 1/8th
of the bits will be 1.
 The exact fraction of bits set to 1 will be slightly less than
1/8th, because it is possible that two members of S hash to the
same bit.
 When a stream element arrives, we hash its email address.
 If the bit to which that email address hashes is 1, then we let
the email through.
 But if the email address hashes to a 0, we are certain that the
address is not in S, so we can drop this stream element.
15
A Bloom filter consists of:
 1. An array of n bits, initially all 0’s.
 2. A collection of hash functions h1, h2, . . . , hk. Each hash
function maps “key” values to n buckets, corresponding to
the n bits of the bit-array.
 3. A set S of m key values.

 The purpose of the Bloom filter is to allow through all


stream elements whose keys are in S, while rejecting most of
the stream elements whose keys are not in S.

16
 Working-
 To initialize the bit array, begin with all bits 0.
 Take each key value in S and hash it using each of the k hash
functions.
 Set to 1 each bit that is hi(K) for some hash function hi and
some key value K in S.
 To test a key K that arrives in the stream, check that all of
h1(K), h2(K), . . . , hk(K) are 1’s in the bit-array.
 If all are 1’s, then let the stream element through.
 If one or more of these bits are 0, then K could not be in S, so
reject the stream element.

17
19
20
21
Bloom Filters: cons
 Small false positive probability
 No deletions
 Can not store associated objects

22
The Count-Distinct Example –
 Consider a Web site gathering statistics on how many unique
users it has seen in each given month.
 The universal set is the set of logins for that site, and a stream
element is generated each time someone logs in.
 This measure is appropriate for a site like Amazon, where the
typical user logs in with their unique login name.
 It is possible to estimate the number of distinct elements by
hashing the elements of the universal set to a bit-string that is
sufficiently long.
 The length of the bit-string must be sufficient that there are
more possible results of the hash function than there are
elements of the universal set.
23
Flajolet Martin Algorithm
 Whenever we apply a hash function h to a stream element a,
the bit string h(a) will end in some number of 0’s, possibly
none.
 Call this number the tail length for a and h.
 Let R be the maximum tail length of any a seen so far in the
stream.
 Then we shall use estimate 2R for the number of distinct
elements seen in the stream.

24
 Analysis-
 The probability that a given stream element a has h(a) ending
in at least r 0’s is 1/2r.
 Suppose there are m distinct elements in the stream.
 Then the probability that none of them has tail length at least
r is (1 − 2−r)m.
 We can rewrite it as.

 The inner expression is of the form (1 − n)1/n, which is


approximately 1/e.
 Thus, the probability of not finding a stream element with as
many as r 0’s at the end of its hash value is

25
Example
 Q. Input stream: 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3,1
 Hash function: h(x)=6x+1 mod 5
 Find no of distinct elements using FM algorithm.

26
Home work questions
1. Only 1 in 1000 people has rare disease. Given True Positive = 0.9 and
False positive = 0.02. If randomly tested individual is positive. What
is the probability that they have a disease?
2.

27
Home work questions
3. Assume that a man's profession can be classified as professional,
skilled labourer or unskilled labourer. Assume that of the sons of
professional men, 80 percent are professional, 10 percent are skilled
labourers, and 10 percent are unskilled labourers. In the case of sons
of skilled labourers, 60 percent are skilled labourers, 20 percent are
professional and 20 percent are unskilled. Finally, in the case of
unskilled labourers, 50 percent of the sons are unskilled labourers,
and 25 percent each are in the other two categories. Assume that
every man has at least one son, and form a Markov chain by
following the profession of a randomly chosen son of a given family
through several generations. Set up the matrix of transition
probabilities. Find the probability that a randomly chosen grandson
of an unskilled labourer is a professional man.

28
29

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy