Estimating Moments
Estimating Moments
Estimating Moments
1
Sources
● Mining of Massive Datasets (2014) by Leskovec
et al. (chapter 4)
2
Estimating moments
Estimating moments is a generalization of the problem of
counting distinct elements in a stream. The problem, called
computing "moments," involves the distribution of
frequencies of different elements in the stream.
3
Moments of order k
● If a stream has A distinct elements, and
each element has frequency mi
● The kth order moment of the stream is
● The 0th order moment is the number of
distinct elements in the stream
● The 1st order moment is the length of the stream
4
Moments of order k (cont.)
● The kth order moment of the stream is
● The 2nd order moment is also known as the
“surprise number” of a stream (large values =
more uneven distribution)
Mi i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9 i=10 i=11 2nd moment
Seq1 10 9 9 9 9 9 9 9 9 9 9 910
Seq2 90 1 1 1 1 1 1 1 1 1 1 8110
92 +10=810
102 +1=100= 910
902 +10=8100 +10=8110
5
Method for second moment
●
Assume that we know n, the length of the stream
●
We will sample s positions
●
For each sample we will have X.element and X.count
●
We sample s random positions in the stream
X.element = element in that position,
Alon, N., Matias, Y., & Szegedy, M. (1999). The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1), 137-147.
8
Method for second moment (cont.)
● Example: a,b,c,b,d,a,c,d,a,b,d,c,a,a,b
● Suppose we pick the 3rd, 8th, and 13th position
at random
● X1.element=c, X2.element=d, X3.element=a
● X1.count=3, X2.count=2, X3.count=2
Alon, N., Matias, Y., & Szegedy, M. (1999). The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1), 137-147.
9
Why this method works?
● Let e(i) be the element in position i of the
stream
● Let c(i) be the number of times e(i) appears in
positions i, i+1, i+2, …, n
● Example: a,b,c,b,d,a,c,d,a,b,d,c,a,a,b
c(6) = ?
10
For higher order moments
(v = X.count)
● For second order moment
– We use n(2v-1) = n(v2 – (v-1)2)
● For third order moment
– We use n(3v2 – 3v + 1) = n(v3 – (v-1)3)
● For kth order moment
– We use n(vk - (v-1)k)
11
● The 0tℎ moment is the sum of 1 of each mi that is greater than 0
i.e., 0thℎ moment is a count of the number of distinct element in the
stream.
● The 1st moment is the sum of the mi’s, which must be the length of the
stream. Thus, first moments are especially easy to compute i.e., just count
the length of the stream seen so far.
● The second moment is the sum of the squares of the mi2’s. It is sometimes
called the surprise number, since it measures how uneven the distribution
of elements in the stream is.
13
Why this method works?
● Let e(i) be the element in position i of the
stream
● Let c(i) be the number of times e(i) appears in
14
positions i, i+1, i+2, …, n
● Example: a,b,c,b,d,a,c,d,a,b,d,c,a,a,b
c(6) = 4 (remember: we count forwards only!)
15
Why this method works? (cont.)
● c(i) is the number of times e(i) appears in
positions i, i+1, i+2, …, n
● E[n (2 × X.count – 1)] is the average of
n (2 c(i) – 1) over all positions i=1...n
16
Why this method works? (cont.)
●
Now focus on element a that appears ma times in the stream
– The last time a appears this term is 2c(i) – 1 = 2x1-1 = 1
– Just before that, 2c(i)-1 = 2x2-1 = 3
– …
– Until 2ma – 1 for the first time a appears
●
Hence
17
For higher order moments
(v = X.count)
● For second order moment
– We use n(2v-1) = n(v2 – (v-1)2)
● For third order moment
– We use n(3v2 – 3v + 1) = n(v3 – (v-1)3)
● For kth order moment
– We use n(vk - (v-1)k)
18
For infinite streams
●
Use a reservoir sampling strategy
●
If we want s samples
– Pick the first s elements of the stream setting ← e(i)
Xi.element
and Xi.count ← 1 for i=1...s
– When element n+1 arrives
●
Pick Xn+1.element with probability s/(n+1), evicting one of the
existing elements at random and setting X.count ← 1
●
As before, probability of an element is s/n
19
Summary
20
Things to remember
● kth order moments of a stream
21
Exercises for TT22-T26
● Mining of Massive Datasets (2014) by Leskovec et al.
– Exercises 4.2.5
– Exercises 4.3.4
– Exercises 4.4.5
– Exercises 4.5.6
22