0% found this document useful (0 votes)

2K views22 pages

Estimating Moments

The document discusses estimating moments from data streams. The 0th moment is the number of distinct elements, and the 1st moment is the stream length. Higher moments like the 2nd moment measure how evenly elements are distributed. To estimate the 2nd moment, the method samples random elements from the stream and counts their frequencies to estimate the sum of squares. This works because the expected value of counting frequencies approximates summing the squares of actual frequencies. Reservoir sampling allows estimating moments from infinite streams.

Uploaded by

4241 DAYANA SRI VARSHA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2K views22 pages

Estimating Moments

Uploaded by

4241 DAYANA SRI VARSHA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 22

Data Streams:

Estimating Moments

1
Sources
● Mining of Massive Datasets (2014) by Leskovec
et al. (chapter 4)

2
Estimating moments
Estimating moments is a generalization of the problem of
counting distinct elements in a stream. The problem, called
computing "moments," involves the distribution of
frequencies of different elements in the stream.

3
Moments of order k
● If a stream has A distinct elements, and
each element has frequency mi
● The kth order moment of the stream is
● The 0th order moment is the number of
distinct elements in the stream
● The 1st order moment is the length of the stream

4
Moments of order k (cont.)
● The kth order moment of the stream is
● The 2nd order moment is also known as the
“surprise number” of a stream (large values =
more uneven distribution)
Mi i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9 i=10 i=11 2nd moment
Seq1 10 9 9 9 9 9 9 9 9 9 9 910
Seq2 90 1 1 1 1 1 1 1 1 1 1 8110
92 +10=810
102 +1=100= 910
902 +10=8100 +10=8110

5
Method for second moment
●
Assume that we know n, the length of the stream
●
We will sample s positions
●
For each sample we will have X.element and X.count
●
We sample s random positions in the stream
X.element = element in that position,

X.count ← 1 When we see X.element again, X.count

← X.count + 1
●
Estimate second moment as n(2 × X.count - 1)
Alon, N., Matias, Y., & Szegedy, M. (1999). The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1), 137-147.
6
Method for second moment (cont.)
● Example: a,b,c,b,d,a,c,d,a,b,d,c,a,a,b
ma = 5, mb = 4, mc = 3, md = 3
second moment = 52+42+32+32 = 59
● Suppose we sample s=3 variables X1, X2, X3
● Suppose we pick the 3rd, 8th, and 13th position at random
● X1.element=c, X2.element=d, X3.element=a
● X1.count=3, X2.count=2, X3.count=2 (we count forwards only!)
● Estimate n(2 × X.count – 1),
● first estimate = 15(6-1) = 75,
● second estimate 15(4-1) = 45,
Alon, N., Matias, Y., & Szegedy, M. (1999). The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1), 137-147.
7
● third estimate 15(4-1) = 45,
● average of estimates = 55≃59

Alon, N., Matias, Y., & Szegedy, M. (1999). The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1), 137-147.
8
Method for second moment (cont.)
● Example: a,b,c,b,d,a,c,d,a,b,d,c,a,a,b
● Suppose we pick the 3rd, 8th, and 13th position
at random
● X1.element=c, X2.element=d, X3.element=a
● X1.count=3, X2.count=2, X3.count=2

Alon, N., Matias, Y., & Szegedy, M. (1999). The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1), 137-147.
9
Why this method works?
● Let e(i) be the element in position i of the
stream
● Let c(i) be the number of times e(i) appears in
positions i, i+1, i+2, …, n
● Example: a,b,c,b,d,a,c,d,a,b,d,c,a,a,b
c(6) = ?

10
For higher order moments
(v = X.count)
● For second order moment
– We use n(2v-1) = n(v2 – (v-1)2)
● For third order moment
– We use n(3v2 – 3v + 1) = n(v3 – (v-1)3)
● For kth order moment
– We use n(vk - (v-1)k)
11
● The 0tℎ moment is the sum of 1 of each mi that is greater than 0
i.e., 0thℎ moment is a count of the number of distinct element in the
stream.
● The 1st moment is the sum of the mi’s, which must be the length of the
stream. Thus, first moments are especially easy to compute i.e., just count
the length of the stream seen so far.

● The second moment is the sum of the squares of the mi2’s. It is sometimes
called the surprise number, since it measures how uneven the distribution
of elements in the stream is.

● To see the distinction, suppose we have a stream of length 100, in which

eleven different elements appear. The most even distribution of these
eleven elements would have one appearing 10 times and the other ten
appearing 9 times each.
12
● In this case, the surprise number is 102 + 10 × 92= 910. At the other
extreme, one of the eleven elements could appear 90 times and the other
ten appear 1 time each. Then, the surprise number would be
1. 902 + 10 × 12 = 8110.
ADD COMMENT SHARE EDIT

13
Why this method works?
● Let e(i) be the element in position i of the
stream
● Let c(i) be the number of times e(i) appears in
14
positions i, i+1, i+2, …, n
● Example: a,b,c,b,d,a,c,d,a,b,d,c,a,a,b
c(6) = 4 (remember: we count forwards only!)

15
Why this method works? (cont.)
● c(i) is the number of times e(i) appears in
positions i, i+1, i+2, …, n
● E[n (2 × X.count – 1)] is the average of
n (2 c(i) – 1) over all positions i=1...n

16
Why this method works? (cont.)

●
Now focus on element a that appears ma times in the stream
– The last time a appears this term is 2c(i) – 1 = 2x1-1 = 1
– Just before that, 2c(i)-1 = 2x2-1 = 3
– …
– Until 2ma – 1 for the first time a appears
●
Hence

17
For higher order moments
(v = X.count)
● For second order moment
– We use n(2v-1) = n(v2 – (v-1)2)
● For third order moment
– We use n(3v2 – 3v + 1) = n(v3 – (v-1)3)
● For kth order moment
– We use n(vk - (v-1)k)

18
For infinite streams
●
Use a reservoir sampling strategy
●
If we want s samples
– Pick the first s elements of the stream setting ← e(i)
Xi.element
and Xi.count ← 1 for i=1...s
– When element n+1 arrives
●
Pick Xn+1.element with probability s/(n+1), evicting one of the
existing elements at random and setting X.count ← 1
●
As before, probability of an element is s/n
19
Summary

20
Things to remember
● kth order moments of a stream

21
Exercises for TT22-T26
● Mining of Massive Datasets (2014) by Leskovec et al.
– Exercises 4.2.5
– Exercises 4.3.4
– Exercises 4.4.5
– Exercises 4.5.6

Curious Freaks Coding Sheet
100% (6)
Curious Freaks Coding Sheet
6 pages
Comm 215.MidtermReview
No ratings yet
Comm 215.MidtermReview
71 pages
Ken Black QA All Odd No Chapter Solution
83% (6)
Ken Black QA All Odd No Chapter Solution
919 pages
Flajolet-Martin Algorithm
No ratings yet
Flajolet-Martin Algorithm
28 pages
Unit 4 - Lecture 3 - DGIM Algorithm Notes
100% (1)
Unit 4 - Lecture 3 - DGIM Algorithm Notes
8 pages
Counting Ones in A Window: The Cost of Exact Counts
100% (1)
Counting Ones in A Window: The Cost of Exact Counts
13 pages
SOLVED NUMERICALS EXAMPLES in Machine Learning
No ratings yet
SOLVED NUMERICALS EXAMPLES in Machine Learning
59 pages
Counting Ones in A Window
No ratings yet
Counting Ones in A Window
11 pages
TOC Reference
100% (1)
TOC Reference
25 pages
Week 1 Assignment 01
100% (1)
Week 1 Assignment 01
4 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
Sampling Data in A Stream
No ratings yet
Sampling Data in A Stream
3 pages
Anatomy OF File Write and Read
No ratings yet
Anatomy OF File Write and Read
6 pages
Case Study 3: Global Innovation Network and Analysis (GINA)
No ratings yet
Case Study 3: Global Innovation Network and Analysis (GINA)
9 pages
Data Mining-Mining Time Series Data
0% (1)
Data Mining-Mining Time Series Data
7 pages
Data Structure Unit 5
50% (4)
Data Structure Unit 5
14 pages
DBMS Lab Manual 2023-24
No ratings yet
DBMS Lab Manual 2023-24
77 pages
Characteristics of Soft Computing
88% (8)
Characteristics of Soft Computing
11 pages
Data Mining-Graph Mining
No ratings yet
Data Mining-Graph Mining
9 pages
Trends in Data Mining
No ratings yet
Trends in Data Mining
9 pages
Cognizant Coding Questions
No ratings yet
Cognizant Coding Questions
21 pages
Content: Directory Structure in OS
No ratings yet
Content: Directory Structure in OS
15 pages
Write A Program To Implement Job Sequencing Algorithm
64% (11)
Write A Program To Implement Job Sequencing Algorithm
2 pages
Measuring Data Similarity and Dissimilarity
No ratings yet
Measuring Data Similarity and Dissimilarity
20 pages
Data Warehousing & Data Mining (R20) Imp Questions:-Unit-1
100% (1)
Data Warehousing & Data Mining (R20) Imp Questions:-Unit-1
3 pages
Introduction To Data Analytics and Visualization Question Paper
100% (1)
Introduction To Data Analytics and Visualization Question Paper
2 pages
UNIT 1 TOC Sem5 RGPV
100% (2)
UNIT 1 TOC Sem5 RGPV
12 pages
P 1
67% (6)
P 1
7 pages
Unit 1 Introduction of Machine Learning Notes
No ratings yet
Unit 1 Introduction of Machine Learning Notes
57 pages
Distribution Model
100% (1)
Distribution Model
24 pages
Ooad Questions
No ratings yet
Ooad Questions
2 pages
Assignment DBMS
No ratings yet
Assignment DBMS
8 pages
Problem Representation in Ai
100% (10)
Problem Representation in Ai
12 pages
Algorithms Lab Viva Questions
No ratings yet
Algorithms Lab Viva Questions
2 pages
Binary Search Tree: Reny Jose
No ratings yet
Binary Search Tree: Reny Jose
36 pages
Question Bank Python For Data Science
0% (1)
Question Bank Python For Data Science
3 pages
Semantic Integrity Control in Distributed DBMSS: References
100% (1)
Semantic Integrity Control in Distributed DBMSS: References
33 pages
ADBMS Question Paper
No ratings yet
ADBMS Question Paper
2 pages
Viruses and Related Threats in Security
80% (5)
Viruses and Related Threats in Security
12 pages
CS8582 Object-Oriented-Analysis-and-Design-Lab-Manual PDF
33% (6)
CS8582 Object-Oriented-Analysis-and-Design-Lab-Manual PDF
93 pages
17Pcs03 - Advanced Java Programming Question and Answers: Unit - I
No ratings yet
17Pcs03 - Advanced Java Programming Question and Answers: Unit - I
1 page
CS402 Data Mining and Warehousing Question Bank
No ratings yet
CS402 Data Mining and Warehousing Question Bank
6 pages
Architectural Mapping Using Data Flow
100% (5)
Architectural Mapping Using Data Flow
5 pages
Disk and Drum Scheduling
100% (2)
Disk and Drum Scheduling
19 pages
DAA Question Bank
No ratings yet
DAA Question Bank
9 pages
DAA Question Bank
No ratings yet
DAA Question Bank
39 pages
DS Sem 2 Case Study PDF
No ratings yet
DS Sem 2 Case Study PDF
10 pages
IMP Questions ADA
No ratings yet
IMP Questions ADA
7 pages
PHP Practicals
No ratings yet
PHP Practicals
36 pages
Quiz Solutions
95% (20)
Quiz Solutions
11 pages
Mobile Application Development Question Bank
No ratings yet
Mobile Application Development Question Bank
8 pages
Object Identity and Reference Types in SQL
No ratings yet
Object Identity and Reference Types in SQL
10 pages
Counting Oneness in A Window
No ratings yet
Counting Oneness in A Window
12 pages
ADBMS Notes
67% (3)
ADBMS Notes
48 pages
Data Mining-Partitioning Methods
100% (1)
Data Mining-Partitioning Methods
7 pages
Ccs375 Web Technologies Syllabus
No ratings yet
Ccs375 Web Technologies Syllabus
3 pages
Data Structure - Arrays
No ratings yet
Data Structure - Arrays
27 pages
Localization and Calling: Mobile Station International ISDN Number (MSISDN) : The Only Important Number
100% (1)
Localization and Calling: Mobile Station International ISDN Number (MSISDN) : The Only Important Number
3 pages
OOAD Notes PDF
100% (2)
OOAD Notes PDF
92 pages
Estimating Moments
No ratings yet
Estimating Moments
17 pages
DataScience&Analytics DataStreamsContd2
No ratings yet
DataScience&Analytics DataStreamsContd2
21 pages
01 - 03 Basic Computer Organization and Design
No ratings yet
01 - 03 Basic Computer Organization and Design
53 pages
02 - 01 Microprogrammed Control
No ratings yet
02 - 01 Microprogrammed Control
53 pages
04 - 02 Memory Organization
No ratings yet
04 - 02 Memory Organization
32 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
11 pages
03 - 02 Computer Arithmetic
No ratings yet
03 - 02 Computer Arithmetic
51 pages
Big Data Analytics Unit-1
100% (2)
Big Data Analytics Unit-1
5 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
15 pages
Big Data Components
No ratings yet
Big Data Components
31 pages
Data Streams1
No ratings yet
Data Streams1
10 pages
Unit 3b
No ratings yet
Unit 3b
12 pages
Least Squares Adjustment
100% (1)
Least Squares Adjustment
47 pages
Chapter 9: Serial Correlation
No ratings yet
Chapter 9: Serial Correlation
7 pages
SEM CFA EFA Koach Scholar Links
No ratings yet
SEM CFA EFA Koach Scholar Links
4 pages
Session 12
No ratings yet
Session 12
9 pages
W03 - AI Data Handling
No ratings yet
W03 - AI Data Handling
47 pages
Chapter 8 and 9 Intro-to-Hypothesis-Testing-Using-Sign-Test
No ratings yet
Chapter 8 and 9 Intro-to-Hypothesis-Testing-Using-Sign-Test
44 pages
June 2018 MA - 2
No ratings yet
June 2018 MA - 2
28 pages
01 Basics 02knn 01
No ratings yet
01 Basics 02knn 01
7 pages
Panel Data 4: Fixed Effects Vs Random Effects Models
No ratings yet
Panel Data 4: Fixed Effects Vs Random Effects Models
8 pages
Case Processing Summary
No ratings yet
Case Processing Summary
4 pages
Staff Manual 06
No ratings yet
Staff Manual 06
3 pages
Cochrans Q Test
No ratings yet
Cochrans Q Test
8 pages
BIO203 Lecture 11 (Correlation) SHF 2024
No ratings yet
BIO203 Lecture 11 (Correlation) SHF 2024
52 pages
M1 Stat-701 SLR 2022
No ratings yet
M1 Stat-701 SLR 2022
17 pages
Statistical Notes For Clinical Researchers - Effect Size
No ratings yet
Statistical Notes For Clinical Researchers - Effect Size
4 pages
BB Day 2 Exam
No ratings yet
BB Day 2 Exam
6 pages
Problem Set 1 Engineering
No ratings yet
Problem Set 1 Engineering
3 pages
MLfinal 1
No ratings yet
MLfinal 1
7 pages
MAT 3 14th WeeK
No ratings yet
MAT 3 14th WeeK
28 pages
F-Ratio Table 2005
No ratings yet
F-Ratio Table 2005
5 pages
Adam Smith Business School Subject of Economics Degree of MSC Degree Exam Basic Econometrics, Econ5002
No ratings yet
Adam Smith Business School Subject of Economics Degree of MSC Degree Exam Basic Econometrics, Econ5002
6 pages
Accidents in Mumbai Local Trains - 2019 PDF
No ratings yet
Accidents in Mumbai Local Trains - 2019 PDF
15 pages
Test Instrument: Gecc 103 - Mathematics in The Modern World Final Examination
No ratings yet
Test Instrument: Gecc 103 - Mathematics in The Modern World Final Examination
6 pages
‏لقطة شاشة 2022-06-01 في 4.10.51 م
No ratings yet
‏لقطة شاشة 2022-06-01 في 4.10.51 م
67 pages
Computational Methods For Mixed Models
No ratings yet
Computational Methods For Mixed Models
21 pages
Lind 18e Chap013 PPT-Correlation and Linear Regression
No ratings yet
Lind 18e Chap013 PPT-Correlation and Linear Regression
43 pages
Approved Dessertation
No ratings yet
Approved Dessertation
85 pages
Dimensionality Reduction, PCA, and Kernel Methods
No ratings yet
Dimensionality Reduction, PCA, and Kernel Methods
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Estimating Moments

Uploaded by

Estimating Moments

Uploaded by

Data Streams:

X.count ← 1 When we see X.element again, X.count

● To see the distinction, suppose we have a stream of length 100, in which

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.