0% found this document useful (0 votes)
2K views22 pages

Estimating Moments

The document discusses estimating moments from data streams. The 0th moment is the number of distinct elements, and the 1st moment is the stream length. Higher moments like the 2nd moment measure how evenly elements are distributed. To estimate the 2nd moment, the method samples random elements from the stream and counts their frequencies to estimate the sum of squares. This works because the expected value of counting frequencies approximates summing the squares of actual frequencies. Reservoir sampling allows estimating moments from infinite streams.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views22 pages

Estimating Moments

The document discusses estimating moments from data streams. The 0th moment is the number of distinct elements, and the 1st moment is the stream length. Higher moments like the 2nd moment measure how evenly elements are distributed. To estimate the 2nd moment, the method samples random elements from the stream and counts their frequencies to estimate the sum of squares. This works because the expected value of counting frequencies approximates summing the squares of actual frequencies. Reservoir sampling allows estimating moments from infinite streams.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Data Streams:

Estimating Moments

1
Sources
● Mining of Massive Datasets (2014) by Leskovec
et al. (chapter 4)

2
Estimating moments
Estimating moments is a generalization of the problem of
counting distinct elements in a stream. The problem, called
computing "moments," involves the distribution of
frequencies of different elements in the stream.

3
Moments of order k
● If a stream has A distinct elements, and
each element has frequency mi
● The kth order moment of the stream is
● The 0th order moment is the number of
distinct elements in the stream
● The 1st order moment is the length of the stream

4
Moments of order k (cont.)
● The kth order moment of the stream is
● The 2nd order moment is also known as the
“surprise number” of a stream (large values =
more uneven distribution)
Mi i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9 i=10 i=11 2nd moment
Seq1 10 9 9 9 9 9 9 9 9 9 9 910
Seq2 90 1 1 1 1 1 1 1 1 1 1 8110
92 +10=810
102 +1=100= 910
902 +10=8100 +10=8110

5
Method for second moment

Assume that we know n, the length of the stream

We will sample s positions

For each sample we will have X.element and X.count

We sample s random positions in the stream
X.element = element in that position,

X.count ← 1 When we see X.element again, X.count


← X.count + 1

Estimate second moment as n(2 × X.count - 1)
Alon, N., Matias, Y., & Szegedy, M. (1999). The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1), 137-147.
6
Method for second moment (cont.)
● Example: a,b,c,b,d,a,c,d,a,b,d,c,a,a,b
ma = 5, mb = 4, mc = 3, md = 3
second moment = 52+42+32+32 = 59
● Suppose we sample s=3 variables X1, X2, X3
● Suppose we pick the 3rd, 8th, and 13th position at random
● X1.element=c, X2.element=d, X3.element=a
● X1.count=3, X2.count=2, X3.count=2 (we count forwards only!)
● Estimate n(2 × X.count – 1),
● first estimate = 15(6-1) = 75,
● second estimate 15(4-1) = 45,
Alon, N., Matias, Y., & Szegedy, M. (1999). The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1), 137-147.
7
● third estimate 15(4-1) = 45,
● average of estimates = 55≃59

Alon, N., Matias, Y., & Szegedy, M. (1999). The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1), 137-147.
8
Method for second moment (cont.)
● Example: a,b,c,b,d,a,c,d,a,b,d,c,a,a,b
● Suppose we pick the 3rd, 8th, and 13th position
at random
● X1.element=c, X2.element=d, X3.element=a
● X1.count=3, X2.count=2, X3.count=2

Alon, N., Matias, Y., & Szegedy, M. (1999). The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1), 137-147.
9
Why this method works?
● Let e(i) be the element in position i of the
stream
● Let c(i) be the number of times e(i) appears in
positions i, i+1, i+2, …, n
● Example: a,b,c,b,d,a,c,d,a,b,d,c,a,a,b
c(6) = ?

10
For higher order moments
(v = X.count)
● For second order moment
– We use n(2v-1) = n(v2 – (v-1)2)
● For third order moment
– We use n(3v2 – 3v + 1) = n(v3 – (v-1)3)
● For kth order moment
– We use n(vk - (v-1)k)
11
● The 0tℎ moment is the sum of 1 of each mi that is greater than 0
i.e., 0thℎ moment is a count of the number of distinct element in the
stream.
● The 1st moment is the sum of the mi’s, which must be the length of the
stream. Thus, first moments are especially easy to compute i.e., just count
the length of the stream seen so far.

● The second moment is the sum of the squares of the mi2’s. It is sometimes
called the surprise number, since it measures how uneven the distribution
of elements in the stream is.

● To see the distinction, suppose we have a stream of length 100, in which


eleven different elements appear. The most even distribution of these
eleven elements would have one appearing 10 times and the other ten
appearing 9 times each.
12
● In this case, the surprise number is 102 + 10 × 92= 910. At the other
extreme, one of the eleven elements could appear 90 times and the other
ten appear 1 time each. Then, the surprise number would be
1. 902 + 10 × 12 = 8110.
ADD COMMENT SHARE EDIT

13
Why this method works?
● Let e(i) be the element in position i of the
stream
● Let c(i) be the number of times e(i) appears in
14
positions i, i+1, i+2, …, n
● Example: a,b,c,b,d,a,c,d,a,b,d,c,a,a,b
c(6) = 4 (remember: we count forwards only!)

15
Why this method works? (cont.)
● c(i) is the number of times e(i) appears in
positions i, i+1, i+2, …, n
● E[n (2 × X.count – 1)] is the average of
n (2 c(i) – 1) over all positions i=1...n

16
Why this method works? (cont.)


Now focus on element a that appears ma times in the stream
– The last time a appears this term is 2c(i) – 1 = 2x1-1 = 1
– Just before that, 2c(i)-1 = 2x2-1 = 3
– …
– Until 2ma – 1 for the first time a appears

Hence

17
For higher order moments
(v = X.count)
● For second order moment
– We use n(2v-1) = n(v2 – (v-1)2)
● For third order moment
– We use n(3v2 – 3v + 1) = n(v3 – (v-1)3)
● For kth order moment
– We use n(vk - (v-1)k)

18
For infinite streams

Use a reservoir sampling strategy

If we want s samples
– Pick the first s elements of the stream setting ← e(i)
Xi.element
and Xi.count ← 1 for i=1...s
– When element n+1 arrives

Pick Xn+1.element with probability s/(n+1), evicting one of the
existing elements at random and setting X.count ← 1

As before, probability of an element is s/n
19
Summary

20
Things to remember
● kth order moments of a stream

21
Exercises for TT22-T26
● Mining of Massive Datasets (2014) by Leskovec et al.
– Exercises 4.2.5
– Exercises 4.3.4
– Exercises 4.4.5
– Exercises 4.5.6

22

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy