0% found this document useful (0 votes)
105 views2 pages

Homework Assignment 2: Total Points 80

This homework assignment involves answering questions about sampling from data streams to estimate statistics, using Bloom filters to estimate set membership, estimating frequencies of elements in a data stream using the Frequency Moment (FM) algorithm, and estimating frequencies of elements in a sliding window using the Dynamic Global Iceberg Monitoring (DGIM) algorithm. The document provides details of exercises to complete for each algorithm involving calculating statistics from samples, estimating false positive rates, determining tail lengths, and estimating frequencies. Solutions must show all steps and be explained briefly in own words. Copying from others will result in no points.

Uploaded by

samriddhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views2 pages

Homework Assignment 2: Total Points 80

This homework assignment involves answering questions about sampling from data streams to estimate statistics, using Bloom filters to estimate set membership, estimating frequencies of elements in a data stream using the Frequency Moment (FM) algorithm, and estimating frequencies of elements in a sliding window using the Dynamic Global Iceberg Monitoring (DGIM) algorithm. The document provides details of exercises to complete for each algorithm involving calculating statistics from samples, estimating false positive rates, determining tail lengths, and estimating frequencies. Solutions must show all steps and be explained briefly in own words. Copying from others will result in no points.

Uploaded by

samriddhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Homework Assignment 2

From the course book Mining Massive Datasets, chapter 4.


http://infolab.stanford.edu/~ullman/mmds/ch4.pdf
Use your own words. No cut-and-paste from the web or from class mates. Copying from other
sources will be detected and result in 0 points. If assignments by multiple students seem too
similar to be independent work, all students will receive 0 points.
It is great to work on solutions in groups! Just prepare the homework report in your own words.
Show all steps of your solution and calculations and explain them briefly in your own words. If
you write just the answer to the question without solution details, it will be 0 points.
No long answers, just brief and clear explanations for each step of your solution are required.

Total points 80
1. Sampling

(20 points) Exercise 4.2.1 : Suppose we have a stream of tuples with the schema Grades(university,
courseID, studentID, grade) Assume universities are unique, but a courseID is unique only within a
university (i.e., different universities may have different courses with the same ID, e.g., “CS101”) and
likewise, studentID’s are unique only within a university (different universities may assign the same ID to
different students). Suppose we want to answer certain queries approximately from a 1/15th sample of
the data. For each of the queries below, indicate how you would construct the sample. That is, tell what
the key attributes should be.

(a) Estimate the average number of courses per university.

(b) Estimate the fraction of students who have a GPA of 3.7 or more.

Explain briefly but clearly how you will create the sample and why.

2. Bloom Filter

(15 points) Exercise 4.3.1 : For the situation of our running example (8 billion bits, 1 billion members of
the set S), calculate the false-positive rate if we use 3 and 5 hash functions. Briefly explain each step in
your solution.

3. FM Algorithm

(10 points) Exercise 4.4.1 : Suppose our stream consists of the integers 3, 1, 4, 1, 5, 9, 2, 6, 5. Our hash
functions will all be of the form h(x) = ax+ b mod 32 for some a and b. You should treat the result as a 5-
bit binary integer. Determine the tail length for each stream element and the resulting estimate of the
number of distinct elements if the hash function is:

(a) h(x) = 2x + 1 mod 32.

(b) h(x) = 3x + 7 mod 32.

Briefly explain each step in your solution.


4. DGIM Algorithm

1) (15 points) Exercise 4.6.1 : Suppose the window is as shown in Fig. 4.2. Estimate the number of 1’s the
last k positions, for k =

(a) 5

(b) 15

In each case, how far off the correct value is your estimate?
2) (20 points) Study the example in section 4.6.7 Extensions to the Counting of Ones. Use the technique of
Section 4.6.6 to estimate the total error. Show that if each ci has fractional error at most e, then the
estimate of the true sum has error at most e.
Briefly explain each step in your solution.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy