From 289c47759d9edf7152e28dc7adeac1446b626043 Mon Sep 17 00:00:00 2001 From: Sam Cutler Date: Thu, 17 Jul 2025 20:01:13 +0100 Subject: [PATCH 1/3] add basic intuition for data structure --- crates/geo_filters/README.md | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/crates/geo_filters/README.md b/crates/geo_filters/README.md index 69cf4f3..0091048 100644 --- a/crates/geo_filters/README.md +++ b/crates/geo_filters/README.md @@ -9,6 +9,37 @@ Two variants are implemented, which differ in the way new elements are added to Supports estimating the size of the union of two sets with a precision related to the estimated size. It has some similar properties as related filters like HyperLogLog, MinHash, etc, but uses less space. +

Data Structure Analogy

+If you're not familiar with probabilistic data structures then getting an intuition for how this data structure works +using a real world example might be helpful. + +Imagine you wanted to count how many rain drops fall from the sky and landing in certain area. A single rain drop will +fall randomly hitting the ground in the area you wish to analyze. Trying to count every single drop that hits the +concrete would be very difficult, there are simply far too many. + +Perhaps you can estimate the number of rain drops? + +Instead of counting each and every drop of rain you could instead lay out a grid of buckets and then count how many buckets +have *any* rain drops in them at all. For this thought experiment we're not considering how _much_ water is in them, only if +there is a non-zero amount of water. Uniformly sized buckets might work ok for a small shower, but you'd quickly run +into an issue where most of your buckets have some amount rain in them. Because of this, you would not be able to differentiate between a gentle shower and a downpour; either way most of the buckets have _some_ water in them. + +By varying the size of the buckets you reduce the probability that a rain drop will land in the smaller ones. You can +then estimate the number of droplets by adding up the probabilities that a given bucket has a rain drop in it. Smaller +ones are much less likely to have a droplet in so if you've got a lot of smaller buckets with drop lets in, that would imply +that there was a lot of rain. If those buckets are mostly dry, then it would imply that there was only a small amount +of drizzle. You still need a wide range of bucket sizes to be able to tell the difference between having no rain and a small +amount of rain. + +You can estimate the difference in the amount of rain fall on two areas by counting the number of buckets where the matching +bucket size has rain in it in one area but not the other. + +This data structure works in a similar way. Items are hashed to produce a "random" number which we assign to a bucket. The +bucket "sizes" are arranged to follow a geometric distribution to allow us to calculate an estimate of the number of items +using well known formulas. +

+ ## Usage Add this to your `Cargo.toml`: From 223cc49cbe92ac7e4ae027c7ec08bae549dcaeda Mon Sep 17 00:00:00 2001 From: Sam Cutler Date: Thu, 17 Jul 2025 20:03:21 +0100 Subject: [PATCH 2/3] better summary title --- crates/geo_filters/README.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/crates/geo_filters/README.md b/crates/geo_filters/README.md index 0091048..b262b6b 100644 --- a/crates/geo_filters/README.md +++ b/crates/geo_filters/README.md @@ -10,7 +10,11 @@ Two variants are implemented, which differ in the way new elements are added to It has some similar properties as related filters like HyperLogLog, MinHash, etc, but uses less space.

Data Structure Analogy

+ +### Data Structure Analogy + +

If you're not familiar with probabilistic data structures then getting an intuition for how this data structure works using a real world example might be helpful. From d2109876983e5a01572487df9f4c97ca45b2ab17 Mon Sep 17 00:00:00 2001 From: Sam Cutler Date: Fri, 18 Jul 2025 11:36:41 +0100 Subject: [PATCH 3/3] move to docs dir --- crates/geo_filters/README.md | 35 ------------------------ crates/geo_filters/docs/01-intuitions.md | 29 ++++++++++++++++++++ 2 files changed, 29 insertions(+), 35 deletions(-) create mode 100644 crates/geo_filters/docs/01-intuitions.md diff --git a/crates/geo_filters/README.md b/crates/geo_filters/README.md index b262b6b..69cf4f3 100644 --- a/crates/geo_filters/README.md +++ b/crates/geo_filters/README.md @@ -9,41 +9,6 @@ Two variants are implemented, which differ in the way new elements are added to Supports estimating the size of the union of two sets with a precision related to the estimated size. It has some similar properties as related filters like HyperLogLog, MinHash, etc, but uses less space. -

- -### Data Structure Analogy - -

-If you're not familiar with probabilistic data structures then getting an intuition for how this data structure works -using a real world example might be helpful. - -Imagine you wanted to count how many rain drops fall from the sky and landing in certain area. A single rain drop will -fall randomly hitting the ground in the area you wish to analyze. Trying to count every single drop that hits the -concrete would be very difficult, there are simply far too many. - -Perhaps you can estimate the number of rain drops? - -Instead of counting each and every drop of rain you could instead lay out a grid of buckets and then count how many buckets -have *any* rain drops in them at all. For this thought experiment we're not considering how _much_ water is in them, only if -there is a non-zero amount of water. Uniformly sized buckets might work ok for a small shower, but you'd quickly run -into an issue where most of your buckets have some amount rain in them. Because of this, you would not be able to differentiate between a gentle shower and a downpour; either way most of the buckets have _some_ water in them. - -By varying the size of the buckets you reduce the probability that a rain drop will land in the smaller ones. You can -then estimate the number of droplets by adding up the probabilities that a given bucket has a rain drop in it. Smaller -ones are much less likely to have a droplet in so if you've got a lot of smaller buckets with drop lets in, that would imply -that there was a lot of rain. If those buckets are mostly dry, then it would imply that there was only a small amount -of drizzle. You still need a wide range of bucket sizes to be able to tell the difference between having no rain and a small -amount of rain. - -You can estimate the difference in the amount of rain fall on two areas by counting the number of buckets where the matching -bucket size has rain in it in one area but not the other. - -This data structure works in a similar way. Items are hashed to produce a "random" number which we assign to a bucket. The -bucket "sizes" are arranged to follow a geometric distribution to allow us to calculate an estimate of the number of items -using well known formulas. -

- ## Usage Add this to your `Cargo.toml`: diff --git a/crates/geo_filters/docs/01-intuitions.md b/crates/geo_filters/docs/01-intuitions.md new file mode 100644 index 0000000..5f546ec --- /dev/null +++ b/crates/geo_filters/docs/01-intuitions.md @@ -0,0 +1,29 @@ +# Intuition + +If you're not familiar with probabilistic data structures then getting an intuition for how this data structure works +using a real world example might be helpful. + +Imagine you wanted to count how many raindrops fall from the sky and landing in certain area, let's say one square meter. +A single raindrop will fall randomly hitting the ground in the area you wish to analyze. Trying to count every single drop +that hits the ground would be very difficult, there are simply far too many. + +Perhaps you can estimate the number of raindrops? + +Instead of counting each and every drop of rain you could instead lay out a grid of buckets and then count how many buckets +have *any* raindrops in them at all. For this thought experiment we're not considering how _much_ water is in them, only if +there is a non-zero amount of water. Uniformly sized buckets might work ok for a small shower, but you'd quickly run +into an issue where most of your buckets have some amount rain in them. Because of this, you would not be able to differentiate between a gentle shower and a downpour; either way most of the buckets have _some_ water in them. + +By varying the size of the buckets you reduce the probability that a raindrop will land in the smaller ones. You can +then estimate the number of droplets by adding up the probabilities that a given bucket has a raindrop in it. Smaller +ones are much less likely to have a droplet in so if you've got a lot of smaller buckets with drop lets in, that would imply +that there was a lot of rain. If those buckets are mostly dry, then it would imply that there was only a small amount +of drizzle. You still need a wide range of bucket sizes to be able to tell the difference between having no rain and a small +amount of rain. + +You can estimate the difference in the amount of rain fall on two areas by counting the number of buckets where the matching +bucket size has rain in it in one area but not the other. + +This data structure works in a similar way. Items are hashed to produce a "random" number which we assign to a bucket. The +bucket "sizes" are arranged to follow a geometric distribution to allow us to calculate an estimate of the number of items +using well known formulas. \ No newline at end of file pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Alternative Proxies: