0% found this document useful (0 votes)

12 views28 pages

DSBD Unit-II 3

The document discusses mathematical foundations of big data, focusing on concepts such as universal hashing, streaming models, and Bloom filters. It explains the importance of hash functions in mapping large data sets to smaller ones while minimizing collisions and ensuring fast computation. Additionally, it covers streaming algorithms and specific techniques like the Flajolet-Martin algorithm for estimating distinct elements in data streams.

Uploaded by

Lavanya Zute

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views28 pages

DSBD Unit-II 3

Uploaded by

Lavanya Zute

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Unit-II- Mathematical Foundation of

Big Data
Universal Hashing, Streaming Models, Bloom Filter,
Flajolet Martin Algorithm
Pairwise Independence & Universal
Hash Functions

2
 A more limited notion of independence that proves
useful in many contexts: k-wise independence.
3
4
Hash functions
 The goal - map elements from a large domain to a small one.
 Typically, to obtain the required guarantees, we would need
not just one function, but a family of functions, where we
would use randomness to sample a hash function from this
family.
 Let H = {h : U -> R} be a family of functions,
 Ideally, we would like to obtain certain “random-like"
properties of this family, while keeping its size small.
 This will be important for applications in data structures and
streaming algorithms.

5
Universal Hashing
 Fixed hashing is vulnerable to bad worse-case behavior so
we improve this behavior by choosing our hash function
randomly in a way that is independent of the keys.
 This approach is called Universal Hashing and can produce
good performance on average independent of key choice.

6
 Desired properties: The main desired properties for
a good hashing scheme are:
 1. The keys are nicely spread out so that we do not have too
many collisions, since collisions affect the time to perform
lookups and deletes.
 2. M = O(N): in particular, we would like our scheme to
achieve property (1) without needing the table size M to be
much larger than the number of elements N.
 3. The function h is fast to compute. Ideally, time to compute
h(x) as a constant.

7
 Definition : A randomized algorithm H for constructing hash
functions h : U → {1, . . . ,M} is universal if for all x!=y in U,
we have

 We also say that a set H of hash functions is a universal hash

function family if the procedure “choose h ∈ H at random” is
universal.
 Theorem - If H is universal, then for any set S ⊆ U of size N,
for any x ∈ U (e.g., that we might want to lookup), if we
construct h at random according to H, the expected number of
collisions between x and other elements in S is at most N/M.

8
Pairwise independence

9
Streaming models

10
The problem
 A search engine receives a stream of queries and it would
like to study the behavior of typical users.
 We assume the stream consists of tuples (user, query,
time).
 Suppose that we want to answer queries like:
 “What fraction of the typical user’s queries were
repeated over the past month?”
 We only wish to store 1/10th of the elements..

11
What are Streaming algorithms?
 Algorithms for processing data streams
 Input is presented as a sequence of items
 Can be examined in only a few passes (typically
just one)
 Limited working memory

12
STREAMING MODELS
 Sampling Algorithms
 Reservoir Sampling
 Priority Sampling
 Sketch Algorithms
 Bloom Filter
 Counting Distinct Elements
 Flajolet-Martin Algorithm

13
Bloom Filter
 Suppose we have a set S of one billion allowed email
addresses.
 The stream consists of pairs: an email address and the email
itself.
 Since the typical email address is 20 bytes or more, it is not
reasonable to store S in main memory. (20BX109=20GB)
 We have one gigabyte(1GB) of available main memory.
 In the technique known as Bloom filtering, we use that main
memory as a bit array.
 In this case, we have room for eight billion bits(8X109 bits).
 Devise a hash function h from email addresses to eight billion
buckets.
14
 Hash each member of S to a bit, and set that bit to 1.
 All other bits of the array remain 0.
 Since there are one billion members of S, approximately 1/8th
of the bits will be 1.
 The exact fraction of bits set to 1 will be slightly less than
1/8th, because it is possible that two members of S hash to the
same bit.
 When a stream element arrives, we hash its email address.
 If the bit to which that email address hashes is 1, then we let
the email through.
 But if the email address hashes to a 0, we are certain that the
address is not in S, so we can drop this stream element.
15
A Bloom filter consists of:
 1. An array of n bits, initially all 0’s.
 2. A collection of hash functions h1, h2, . . . , hk. Each hash
function maps “key” values to n buckets, corresponding to
the n bits of the bit-array.
 3. A set S of m key values.

 The purpose of the Bloom filter is to allow through all

stream elements whose keys are in S, while rejecting most of
the stream elements whose keys are not in S.

16
 Working-
 To initialize the bit array, begin with all bits 0.
 Take each key value in S and hash it using each of the k hash
functions.
 Set to 1 each bit that is hi(K) for some hash function hi and
some key value K in S.
 To test a key K that arrives in the stream, check that all of
h1(K), h2(K), . . . , hk(K) are 1’s in the bit-array.
 If all are 1’s, then let the stream element through.
 If one or more of these bits are 0, then K could not be in S, so
reject the stream element.

17
19
20
21
Bloom Filters: cons
 Small false positive probability
 No deletions
 Can not store associated objects

22
The Count-Distinct Example –
 Consider a Web site gathering statistics on how many unique
users it has seen in each given month.
 The universal set is the set of logins for that site, and a stream
element is generated each time someone logs in.
 This measure is appropriate for a site like Amazon, where the
typical user logs in with their unique login name.
 It is possible to estimate the number of distinct elements by
hashing the elements of the universal set to a bit-string that is
sufficiently long.
 The length of the bit-string must be sufficient that there are
more possible results of the hash function than there are
elements of the universal set.
23
Flajolet Martin Algorithm
 Whenever we apply a hash function h to a stream element a,
the bit string h(a) will end in some number of 0’s, possibly
none.
 Call this number the tail length for a and h.
 Let R be the maximum tail length of any a seen so far in the
stream.
 Then we shall use estimate 2R for the number of distinct
elements seen in the stream.

24
 Analysis-
 The probability that a given stream element a has h(a) ending
in at least r 0’s is 1/2r.
 Suppose there are m distinct elements in the stream.
 Then the probability that none of them has tail length at least
r is (1 − 2−r)m.
 We can rewrite it as.

 The inner expression is of the form (1 − n)1/n, which is

approximately 1/e.
 Thus, the probability of not finding a stream element with as
many as r 0’s at the end of its hash value is

25
Example
 Q. Input stream: 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3,1
 Hash function: h(x)=6x+1 mod 5
 Find no of distinct elements using FM algorithm.

26
Home work questions
1. Only 1 in 1000 people has rare disease. Given True Positive = 0.9 and
False positive = 0.02. If randomly tested individual is positive. What
is the probability that they have a disease?
2.

27
Home work questions
3. Assume that a man's profession can be classified as professional,
skilled labourer or unskilled labourer. Assume that of the sons of
professional men, 80 percent are professional, 10 percent are skilled
labourers, and 10 percent are unskilled labourers. In the case of sons
of skilled labourers, 60 percent are skilled labourers, 20 percent are
professional and 20 percent are unskilled. Finally, in the case of
unskilled labourers, 50 percent of the sons are unskilled labourers,
and 25 percent each are in the other two categories. Assume that
every man has at least one son, and form a Markov chain by
following the profession of a randomly chosen son of a given family
through several generations. Set up the matrix of transition
probabilities. Find the probability that a randomly chosen grandson
of an unskilled labourer is a professional man.

28
29

Lec1 Bloom Distinctcount
No ratings yet
Lec1 Bloom Distinctcount
76 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Streams 2
No ratings yet
Streams 2
49 pages
Data Science 5
No ratings yet
Data Science 5
82 pages
Blooms Filter
No ratings yet
Blooms Filter
15 pages
Data Stream Sampling
No ratings yet
Data Stream Sampling
25 pages
Bda PT 2
No ratings yet
Bda PT 2
35 pages
Book 160 163
No ratings yet
Book 160 163
4 pages
Viden Io Data Analytics Lecture7 Data Stream Filtering PDF
No ratings yet
Viden Io Data Analytics Lecture7 Data Stream Filtering PDF
20 pages
Mining Data Stream
No ratings yet
Mining Data Stream
31 pages
Unit 4 - 4.4
No ratings yet
Unit 4 - 4.4
23 pages
MMD 05
No ratings yet
MMD 05
50 pages
Experiment No 8
No ratings yet
Experiment No 8
7 pages
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
No ratings yet
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
26 pages
Flajolet-Martin Algorithm
No ratings yet
Flajolet-Martin Algorithm
28 pages
Module 3 Mining Data Streams
No ratings yet
Module 3 Mining Data Streams
96 pages
Assignment No.2: HOANG Nguyen Phong
No ratings yet
Assignment No.2: HOANG Nguyen Phong
6 pages
6 Filtering and Streaming: 6.1 Bloom Filters
No ratings yet
6 Filtering and Streaming: 6.1 Bloom Filters
6 pages
Bda Exp8
No ratings yet
Bda Exp8
4 pages
DGIM
No ratings yet
DGIM
90 pages
FM Algorithm
No ratings yet
FM Algorithm
3 pages
Manual Bda 6 7 8
No ratings yet
Manual Bda 6 7 8
6 pages
Unit Ii BD
No ratings yet
Unit Ii BD
74 pages
Module 4
No ratings yet
Module 4
10 pages
Lect1004 PDF
No ratings yet
Lect1004 PDF
7 pages
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
No ratings yet
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
90 pages
Bda Ut-2
No ratings yet
Bda Ut-2
34 pages
Probabilistic Data Structures
No ratings yet
Probabilistic Data Structures
26 pages
Unit 3
No ratings yet
Unit 3
49 pages
Bloom Filter
No ratings yet
Bloom Filter
29 pages
Bda Unit3
No ratings yet
Bda Unit3
22 pages
Mining Data Streams
No ratings yet
Mining Data Streams
34 pages
High Speed Hashing For Integers
No ratings yet
High Speed Hashing For Integers
17 pages
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
No ratings yet
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
4 pages
HW 2 Sol
No ratings yet
HW 2 Sol
5 pages
Unit 2 Mathematical Foundation of Big Data: - Syllabus
No ratings yet
Unit 2 Mathematical Foundation of Big Data: - Syllabus
26 pages
Bloom Filters: Presented By: Eman Shafiq (2017-EE-389) Bareera Azhar (2017-EE-379) Ruqia Rubab (2017-EE-383
No ratings yet
Bloom Filters: Presented By: Eman Shafiq (2017-EE-389) Bareera Azhar (2017-EE-379) Ruqia Rubab (2017-EE-383
14 pages
Analysis of Algorithms CS 477/677: Hashing Instructor: George Bebis
No ratings yet
Analysis of Algorithms CS 477/677: Hashing Instructor: George Bebis
53 pages
Mining Data Streams (Part 1)
No ratings yet
Mining Data Streams (Part 1)
46 pages
Viden Io Data Analytics Lecture8 Counting Distinct Elements PDF
No ratings yet
Viden Io Data Analytics Lecture8 Counting Distinct Elements PDF
13 pages
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
No ratings yet
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
4 pages
Lecture 8 Hashing
No ratings yet
Lecture 8 Hashing
47 pages
Cuckoo Hashing and Universal Hashing
No ratings yet
Cuckoo Hashing and Universal Hashing
31 pages
10 Dictionaries
No ratings yet
10 Dictionaries
11 pages
Bloom Filter
No ratings yet
Bloom Filter
50 pages
L11 PDF
No ratings yet
L11 PDF
5 pages
Lec 31 Handout
No ratings yet
Lec 31 Handout
18 pages
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
No ratings yet
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
28 pages
c11 Hashing
No ratings yet
c11 Hashing
9 pages
Rsa 2008
No ratings yet
Rsa 2008
32 pages
Mmd04A Streams
No ratings yet
Mmd04A Streams
78 pages
02 StreamsAlgorithms
No ratings yet
02 StreamsAlgorithms
93 pages
Problem Idea of Universal Hashing
No ratings yet
Problem Idea of Universal Hashing
14 pages
DSBDA UT 2 Part 2
No ratings yet
DSBDA UT 2 Part 2
21 pages
Bloomfilter
No ratings yet
Bloomfilter
9 pages
1 Hashing: 1.1 Maintaining A Dictionary
No ratings yet
1 Hashing: 1.1 Maintaining A Dictionary
17 pages
Unit 1
No ratings yet
Unit 1
100 pages
Micro Teaching
No ratings yet
Micro Teaching
21 pages
Tutorial 3
No ratings yet
Tutorial 3
37 pages
A Gentle Introduction To Mini-Batch Gradient Descent and How To Configure Batch Size
No ratings yet
A Gentle Introduction To Mini-Batch Gradient Descent and How To Configure Batch Size
16 pages
Slide 2 - Reservoir Simulation
No ratings yet
Slide 2 - Reservoir Simulation
37 pages
Iii Sem - Ai19442 - Foml
No ratings yet
Iii Sem - Ai19442 - Foml
42 pages
CH22 The Cordic
No ratings yet
CH22 The Cordic
18 pages
Real Estate Valuation Case
No ratings yet
Real Estate Valuation Case
5 pages
Digital Control 4 - Lecture Notes 2023
No ratings yet
Digital Control 4 - Lecture Notes 2023
29 pages
06 KNN
No ratings yet
06 KNN
41 pages
MODULE I-A Star Algorithm
No ratings yet
MODULE I-A Star Algorithm
12 pages
Daa Practical - Aman
No ratings yet
Daa Practical - Aman
32 pages
A Memory-Based FFT Processor Design With Generalized Efficient Conflict-Free Address Schemes
No ratings yet
A Memory-Based FFT Processor Design With Generalized Efficient Conflict-Free Address Schemes
11 pages
18AI72
No ratings yet
18AI72
3 pages
TD1 ELTP 2023 Correction
No ratings yet
TD1 ELTP 2023 Correction
6 pages
Polynomial 10
No ratings yet
Polynomial 10
1 page
(CC-202) (Data Structures)
No ratings yet
(CC-202) (Data Structures)
4 pages
Solution 1
No ratings yet
Solution 1
11 pages
Final Aiml
No ratings yet
Final Aiml
16 pages
Slides Control of Discrete Systems
No ratings yet
Slides Control of Discrete Systems
45 pages
Lecture - 3
No ratings yet
Lecture - 3
18 pages
Venkata Simha Reddy
No ratings yet
Venkata Simha Reddy
1 page
DONE QUESTIONS INTERVIEW - Sheet1
0% (1)
DONE QUESTIONS INTERVIEW - Sheet1
9 pages
1.advanced Tree Structures
No ratings yet
1.advanced Tree Structures
29 pages
DAA III-Unit - Greedy Method
No ratings yet
DAA III-Unit - Greedy Method
52 pages
Strivera2z 1
No ratings yet
Strivera2z 1
63 pages
Understanding Linear Feedback Shift Registers - The Easy Way
No ratings yet
Understanding Linear Feedback Shift Registers - The Easy Way
3 pages
Untitled
No ratings yet
Untitled
4 pages
Handout 02 Logistic Regression
No ratings yet
Handout 02 Logistic Regression
39 pages
Mws Mec Inp TXT Direct Examples
No ratings yet
Mws Mec Inp TXT Direct Examples
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DSBD Unit-II 3

Uploaded by

DSBD Unit-II 3

Uploaded by

Unit-II- Mathematical Foundation of

 We also say that a set H of hash functions is a universal hash

 The purpose of the Bloom filter is to allow through all

 The inner expression is of the form (1 − n)1/n, which is

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.