0% found this document useful (0 votes)

50 views6 pages

CS168: The Modern Algorithmic Toolbox Lecture #3: Similarity Metrics and Kd-Trees

1. The document discusses similarity search and kd-trees, a data structure for efficiently finding similar items in a dataset. 2. It describes several common measures of similarity like Jaccard similarity, Euclidean distance, and edit distance. It also discusses metric embeddings which map data points from one similarity metric to another like Euclidean space. 3. Kd-trees are presented as a method to partition data points into subsets based on splitting on median values of dimensions. This allows nearest neighbors to be quickly found by traversing the tree.

Uploaded by

Prathmesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views6 pages

CS168: The Modern Algorithmic Toolbox Lecture #3: Similarity Metrics and Kd-Trees

Uploaded by

Prathmesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

CS168: The Modern Algorithmic Toolbox

Lecture #3: Similarity Metrics and kd-Trees

Tim Roughgarden & Gregory Valiant∗
April 4, 2022

1 Similarity Search
We begin with the basic problem of how to organize/represent a dataset so that similar items
can be found quickly. There are two slightly different settings in which one might want to
consider this question:

1. All the data is present, and one wants to find pairs or sets of similar items from within
the dataset.

2. There is a reference dataset that one has plenty of time to process cleverly, but when
we are given a new datapoint, we want to very quickly return a similar datapoint from
the reference dataset. This setting is sometimes referred to as the “nearest neighbor
search” setting.

In general, similar techniques/approaches are used for the above two settings, though it
is worth being aware of the the different objectives.
There are many real-world applications of similarity search:

• Similar documents, web pages, genomes, etc. (de-duplication of datasets, plagiarism

detection in code/essays, detecting mirror web sites, finding similar genes or sequencing
populations of organisms such as in the human gut “microbiome”).

• Collaborative filtering: find similar products (based on whether the same set of peo-
ple purchased them) or individuals (based on purchase history, demographics, web
browsing behavior, etc.).

• Machine learning/classification: If we find two similar datapoints, they might have the
same label....
∗
2015–2022,
c Tim Roughgarden and Gregory Valiant. Not to be sold, published, or distributed without
the authors’ consent.

1
• Combining datasets (e.g. in astronomy—different telescopes take pictures of the same
portions of the sky, maybe in different wavelengths, etc.–very useful to automatically
aggregate these datasets).

• Super fast labeling: e.g. in CERN, which bashes particles together and needs to very
quickly figure out which new particle traces/trajectories are “interesting” and worth
saving, and which trajectories are just boring/common particles.

2 Measures of Similarity
Before talking about algorithms for finding similar objects, we should begin by considering
several quantifications of what “similarity” means.

2.1 Jaccard Similarity

Jaccard similarity, which we denote by J(·, ·), is a distance metric between two sets, S, T (or
two multisets—sets where elements are allowed to appear more than once):

|S ∩ T |
J(S, T ) = .
|S ∪ T |

Equivalently, if we represent our sets (or multisets) S, T as vectors vS , vT , with the ith index
of the vector vS (i) equalling the number of times that the ith element is represented in S,
the above definition becomes:
P
min(vS (i), vT (i))
J(S, T ) = J(vS , vT ) = P i .
i max(vS (i), vT (i))

This expression is undefined if S, T are both the empty set, in which case we can define the
distance to be 0.
Jaccard similarity works quite well in practice, especially for sparse data. For example,
if we represent documents in terms of the multiset of words they contain, then the Jaccard
similarity between two documents is often a reasonable measure of their similarity. Simi-
larly, to estimate similarity between individuals, an online marketplace like Amazon, might
represent people as multisets of items purchased, movies reviewed, etc., and use Jaccard
similarity.

2.2 Euclidean Distance/ `2 distance, and `p distance

Given datapoints in Rd , the Euclidean distance metric, which we are all familiar with, is
simply v
u d
uX
Deuclidean (x, y) = ||x − y||2 = t (x(i) − y(i))2 .
i=1

2
More generally, we can define other measures of similarity for points in Rd that generalize
the above form; the `p distance is defined as

d
!1/p
X
||x − y||p = |x(i) − y(i)|p .
i=1

If p = 1, we get “manhattan” distance, and for large p, ||x − y||p is more and more dependent
on the coordinate with maximal difference, with the `∞ distance simply being defined as
maxi |x(i) − y(i)|.
Note that `2 distance is rotationally invariant, whereas `p for p 6= 2 is not invariant to
rotations of the space. One consequence of this fact is that if you are using `p distance with
p 6= 2, you should make sure that the coordinates of your space actually mean something—
it does not make too much sense to use a distance metric that depends on your choice of
coordinate axes, if your choice of coordinate axes are arbitrary.

2.3 Other similarity metrics

There are many other similarity metrics, including “cosine similarity” which you will see on
the homework, and “edit distance” that measures the similarity between strings (documents,
genetic sequences, etc.) by asking “how many edits—-i.e. insertions/deletions” does it take
to get from one string to the other. Given all these similarity metrics, it is always worth
spending some time thinking about which metric is appropriate for a given problem.

2.4 The Relationships between Metrics, and Metric Embeddings

A very natural mathematical questions is “how different are the different metrics”? A specific
practical motivation is that many geometric algorithms are designed for Euclidean distance
(in part because of its rotational invariance). Given a set of points, and a distance metric D, is
it possible to map the points into a set of point in Rd , such that the original distance between
points is equal to (or closely approximated by) the Euclidean distance between the images
of those points? Formally, given a distance metric D, and a set of points X = x1 , . . . , xn ,
is it possible to map the points to a set Y = y1 , . . . , yn , where yi ∈ Rd , ideally for a lowish
dimension d, s.t. for all i, j,
D(xi , xj ) ≈ ||yi − yj ||2 .
This is known as a “metric embedding”—in this case, an embedding into Rd under
Euclidean distance—and there is a whole area of math/geometry/computer science devoted
to studying when such embeddings exist.

3 A Datastructure for Similarity Search: kd-trees

For the remainder of the lecture, we will focus on Euclidean distance, though it is worth
thinking about how one would apply these ideas to other similarity measures.

3
A kd-tree, originally proposed by Bentley (a Stanford undergrad) in 1975, is a space
partitioning datastructure that allows one to very quickly find the closest point in a dataset
to a given “query” point. kd-trees perform extremely well when the dimensionality (or
effective dimensionality) of the data is not too large—usually people say that it works well
if the dimensionality of the space is less than 20, or if the number of points is at least 2d ,
where d is the dimensionality of the points. There are many variants of kd-trees, and you
should think of this as a general framework for designing such a datastructure, though the
specifics can be fruitfully tailored to individual datasets and applications.
The high-level idea is to build a binary search tree that partitions space. Edges of
the tree will correspond to subsets of space, and each node, v, in the tree will have two
data-fields: the index of some dimension iv , and a value mv . Let Sv denote the subset of
space corresponding to the edge going into a node v, and let S< , S> denote the subsets
of space corresponding to the two outgoing edges of v. These subsets will be defined as
S< = {x : x ∈ Sv , s.t. x(iv ) < mv } and S> = {x : x ∈ Sv , s.t. x(iv ) ≥ mv }.
We build this tree as follows:

Algorithm 1
kd-Tree
Given a pair [S, v], where S = x1 , . . . , xn is a set of points with xi ∈ Rd ,
corresponding to node v in a partially built kd-tree:

• if n = 1, then store that point in the node v. v will now be a leaf of the
tree.

• Otherwise, pick a dimension i ∈ {1, . . . d} [there are many suggested

heuristics: picking uniformly at random, choosing in round-robin order,
choosing the dimension with largest variance, etc.]

• Let m be the median of the ith dimension of the points: m =

median[x1 (i), . . . , xn (i)]. Store dimension i and median m at node v.
Partition the set S into S< and S> according to whether the ith coor-
dinate of each point exceeds m. [Note that in some implementations,
“median” might be replaced by “mean” or some other value.]

• Make two children of v< and v> , and recurse on [v< , S< ] and [v> , S> ].

Note that the size of the data-structure is linear in the size of the initial pointset. To add
a point, v, to the structure, one simply goes down the tree, querying the indices of V , and
comparing them to the various medians until one reaches a leaf, at which point one will then
split the leaf into two children. The tree will initially be balanced (because we are using the
medians of the coordinate values), and hence will have depth O(log n).
Given a node v, if we want to find the closest point in our kd-tree structure to v, we will
first go down the tree and find the leaf in which v would end up. We then go back up the
tree, at each juncture, asking “is it possible that the closest point to v would have ended up

4
down the other path?”. In low dimensions, the answer to this question will often be “no”,
and the search will be efficient. For example, in 1 dimension, the leaf node corresponding
to v will always contain the closest point to v. [Think about why this is the case!] In high
dimensions, we might end up needing to explore many/all leaves of the tree, which is why
kd-trees are ill-suited to very high dimensional data.

4 The Curse of Dimensionality [might cover at begin-

ning of Lecture 4 depending on time]
Why are high-dimensional spaces often hard to deal with? Why do the running times of many
algorithms scale exponentially with the dimension? One answer is that high-dimensional
spaces, in some sense, can lack geometry. For example, they can have lots and lots of points
with the property that all pairs of points have roughly the same distance.

Example 4.1 What is the largest number of points that fit in d-dimensional space, with
the property that all pairwise distances are in the interval [0.75, 1]?

• d = 1: At most 2 points have this property...if you try to fit a third point, at least one
of the 3 pairwise distances will be off.

• d = 2: At most 3 points have this property...if you try to fit a fourth point, at least
one of the 6 pairwise distances will be off.

• d = 100: You will be able to fit several thousand points!

• In general, you will be able to fit√an exponential number of points (a quick calculation
shows that a random set of exp( d) will satisfy this property with high probability).

5 Take-Home Messages
Below are the highest-level take-home points from this lecture.

• Similarity search is a fundamental problem with two common variants: 1) given a set of
point, and distance metric, how can we quickly find pairs or sets of similar datapoints?
2) Given a set of point, S, and distance metric, how do we store the points such that
if we are given a new “query” point, q, we can efficiently find the closest point (or a
close point) to q in set S.

• Many different useful distance metrics: Jaccard similarity, Euclidean distance, L1/Manhattan
distance, cosine similarity, edit distance, etc. Different distance metrics have very dif-
ferent properties. In any application, it is worth thinking about which distance metric
is most appropriate.

5
• One datastructure that enables an efficient similarity search is a kd-tree: a binary tree
that partitions space.

• kd-trees work well for low dimensional settings. The usual rule of thumb is that if the
dimension, d is less than 20, or if the number of points is larger than 2d , then kd-trees
will be a reasonable datastructure.

• (Might cover during lecture 4) The curse of dimensionality: high dimensional spaces
often have strange properties (e.g. kissing number exponential in dimension), and many
geometric algorithms have a runtime that scales exponentially with the dimension.

Next Time: Dimension Reduction!

Lecture 3
No ratings yet
Lecture 3
58 pages
IP Digital Geometry L2
No ratings yet
IP Digital Geometry L2
170 pages
Chapter_2
No ratings yet
Chapter_2
70 pages
MultidimensionalSearchTrees
No ratings yet
MultidimensionalSearchTrees
100 pages
4.1 Clustering
No ratings yet
4.1 Clustering
80 pages
7 Cluster Analysis
No ratings yet
7 Cluster Analysis
62 pages
Gromov–Wasserstein Distances and the Metric Approach to Object Matching
No ratings yet
Gromov–Wasserstein Distances and the Metric Approach to Object Matching
71 pages
Clusters
No ratings yet
Clusters
64 pages
Chapter-3.-Resultant-of-Concurrent-Force-Systems
No ratings yet
Chapter-3.-Resultant-of-Concurrent-Force-Systems
6 pages
kdtree-layout
No ratings yet
kdtree-layout
14 pages
A Polynomial Time Metric for Attributed Trees 1st edition by Andrea Torsello, Dzena Hidovic, Marcello Pelillo ISBN 3540219811 9783540219811 pdf download
100% (4)
A Polynomial Time Metric for Attributed Trees 1st edition by Andrea Torsello, Dzena Hidovic, Marcello Pelillo ISBN 3540219811 9783540219811 pdf download
45 pages
cs4811-ch10c-clustering
No ratings yet
cs4811-ch10c-clustering
35 pages
JDA_CorrectedAfterPublication
No ratings yet
JDA_CorrectedAfterPublication
16 pages
K-Nearest Neighbors: Nipun Batra July 5, 2020
No ratings yet
K-Nearest Neighbors: Nipun Batra July 5, 2020
66 pages
Machine Learning For Humans, Part 2.3 - Supervised Learning III - by Vishal Maini - Machine Learning For Humans - Medium
No ratings yet
Machine Learning For Humans, Part 2.3 - Supervised Learning III - by Vishal Maini - Machine Learning For Humans - Medium
25 pages
Annexure 1. A. 1 Cbse - GR 10 Prelim Exam Portion 23 24
No ratings yet
Annexure 1. A. 1 Cbse - GR 10 Prelim Exam Portion 23 24
27 pages
Fast and exact fixed-radius neighbor search based on sorting
No ratings yet
Fast and exact fixed-radius neighbor search based on sorting
17 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
ML-UNIT-2
No ratings yet
ML-UNIT-2
22 pages
An Optimal Algorithm For Approximate Nearest
No ratings yet
An Optimal Algorithm For Approximate Nearest
33 pages
KD Tree
No ratings yet
KD Tree
41 pages
Clustering, A Tool To Analyze Data Points
No ratings yet
Clustering, A Tool To Analyze Data Points
61 pages
Grade 6 Unit 8 Shapes and Symmetry
No ratings yet
Grade 6 Unit 8 Shapes and Symmetry
17 pages
DS_Module 3
No ratings yet
DS_Module 3
65 pages
[PR 2024] Lec14 Unsupervised Learning II
No ratings yet
[PR 2024] Lec14 Unsupervised Learning II
32 pages
AIML-Unit 4 Notes-Assignment 4
No ratings yet
AIML-Unit 4 Notes-Assignment 4
21 pages
K Nearest Neighbour - Algorithm
No ratings yet
K Nearest Neighbour - Algorithm
29 pages
Math Vocabulary Words
No ratings yet
Math Vocabulary Words
15 pages
Medwedeff and Suppe 1997 Multibend FBF
No ratings yet
Medwedeff and Suppe 1997 Multibend FBF
14 pages
Machine learning Lecture 02
No ratings yet
Machine learning Lecture 02
25 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Mathmatics Most Asked Questions
No ratings yet
Mathmatics Most Asked Questions
14 pages
k-d trees
No ratings yet
k-d trees
19 pages
Pace Booklet - Kinematics-2
No ratings yet
Pace Booklet - Kinematics-2
42 pages
Similarity_Based_learning_(part_2_)__
No ratings yet
Similarity_Based_learning_(part_2_)__
15 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
Perspectives About Mathematical Knowledge
No ratings yet
Perspectives About Mathematical Knowledge
19 pages
1903.04936v1
No ratings yet
1903.04936v1
12 pages
TM3 ch07 Clustering
No ratings yet
TM3 ch07 Clustering
47 pages
Class 16 Geometry 1
No ratings yet
Class 16 Geometry 1
14 pages
similarity search-kd tree
No ratings yet
similarity search-kd tree
5 pages
Comparative Study of Document Similarity Algorithms and Clustering Algorithms For Sentiment Analysis
No ratings yet
Comparative Study of Document Similarity Algorithms and Clustering Algorithms For Sentiment Analysis
4 pages
TMP - 7967-IMSO 2015 MATH - Essay-672542535
No ratings yet
TMP - 7967-IMSO 2015 MATH - Essay-672542535
14 pages
K-Nearest Neighbor: Classification On Spatial Data Streams Using P-Trees
No ratings yet
K-Nearest Neighbor: Classification On Spatial Data Streams Using P-Trees
12 pages
k-d_trees_and_knn_searches
No ratings yet
k-d_trees_and_knn_searches
9 pages
K-Means Clustering and Related Algorithms: Ryan P. Adams
No ratings yet
K-Means Clustering and Related Algorithms: Ryan P. Adams
16 pages
Developments_in_KD_Tree_and_KNN_Searches
No ratings yet
Developments_in_KD_Tree_and_KNN_Searches
8 pages
Math Mock-2 (2025)
No ratings yet
Math Mock-2 (2025)
8 pages
Chapter 3 Geometric Objects and Transformations
No ratings yet
Chapter 3 Geometric Objects and Transformations
43 pages
Chapter 2
No ratings yet
Chapter 2
26 pages
9th Geometry
No ratings yet
9th Geometry
28 pages
Functional Analysis: With Applications in Numerical Analysis
No ratings yet
Functional Analysis: With Applications in Numerical Analysis
7 pages
KD-Trees
No ratings yet
KD-Trees
7 pages
Computational Geomatory
No ratings yet
Computational Geomatory
212 pages
Curved Beam PDF
100% (2)
Curved Beam PDF
85 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
Nearest Neighbor Search
No ratings yet
Nearest Neighbor Search
9 pages
Assignment 3: Kdtree: Due June 4, 11:59 PM
No ratings yet
Assignment 3: Kdtree: Due June 4, 11:59 PM
19 pages
Ml unit 2
No ratings yet
Ml unit 2
11 pages
A_Comparative_Study_on_Distance_Measuring_Approach
No ratings yet
A_Comparative_Study_on_Distance_Measuring_Approach
3 pages
Banked Curves Report
No ratings yet
Banked Curves Report
32 pages
Eb4069135 F en
No ratings yet
Eb4069135 F en
13 pages
Text Clustering and Validation For Web Search Results
No ratings yet
Text Clustering and Validation For Web Search Results
7 pages
Lesson 6.2 Inverse Trigonometric Functions
No ratings yet
Lesson 6.2 Inverse Trigonometric Functions
15 pages
4.4-InstanceBasedLearning Part 1
No ratings yet
4.4-InstanceBasedLearning Part 1
16 pages
Reducing Computational Cost: - Nearest-Neighbors Has O (N) Complexity
No ratings yet
Reducing Computational Cost: - Nearest-Neighbors Has O (N) Complexity
20 pages
GR 12 Mathematics June P1 2024
No ratings yet
GR 12 Mathematics June P1 2024
11 pages
L19.Kd Trees
0% (1)
L19.Kd Trees
19 pages
Annual Curriculum for Grade 6
No ratings yet
Annual Curriculum for Grade 6
1 page
09 Integration - Extra Exercises PDF
No ratings yet
09 Integration - Extra Exercises PDF
11 pages
Lecture 07 KNN 14112022 034756pm
100% (1)
Lecture 07 KNN 14112022 034756pm
24 pages
A Gentle Introduction To The Kernel Distance: 1 Definitions
No ratings yet
A Gentle Introduction To The Kernel Distance: 1 Definitions
9 pages
Krishna Engineering College: Computer Graphics Compendium UNIT-1
No ratings yet
Krishna Engineering College: Computer Graphics Compendium UNIT-1
8 pages
Multidimensional Search Trees
No ratings yet
Multidimensional Search Trees
119 pages
Similarity Search Using Metric Trees: Bhavin Bhuta Gautam Chauhan
No ratings yet
Similarity Search Using Metric Trees: Bhavin Bhuta Gautam Chauhan
6 pages
p117 Andoni
No ratings yet
p117 Andoni
6 pages
Mth101 Collection of Old Papers
50% (10)
Mth101 Collection of Old Papers
18 pages
361 Remaining Questions For Edit and Print
0% (1)
361 Remaining Questions For Edit and Print
25 pages
Vectors in Two & Three Dimensions
100% (2)
Vectors in Two & Three Dimensions
7 pages
Answering Metric Skyline Queries by PM-tree
No ratings yet
Answering Metric Skyline Queries by PM-tree
16 pages
Instance Based Learning
No ratings yet
Instance Based Learning
20 pages
Introduction To Machine Learning: K-Nearest Neighbor Algorithm
No ratings yet
Introduction To Machine Learning: K-Nearest Neighbor Algorithm
25 pages
Csa Math
No ratings yet
Csa Math
8 pages
111
No ratings yet
111
12 pages
Trees For Semidynamic Point Sets: AT&T Bell Labo Ttories Murray Hill, NJ 07974
No ratings yet
Trees For Semidynamic Point Sets: AT&T Bell Labo Ttories Murray Hill, NJ 07974
11 pages
Constraint Motion
100% (1)
Constraint Motion
10 pages
Applicatios of Derivatives
No ratings yet
Applicatios of Derivatives
19 pages
Lesson Plan
No ratings yet
Lesson Plan
5 pages
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
Two Dimensional Geometric Model: Understanding and Applications in Computer Vision
From Everand
Two Dimensional Geometric Model: Understanding and Applications in Computer Vision
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

CS168: The Modern Algorithmic Toolbox Lecture #3: Similarity Metrics and Kd-Trees

Uploaded by

CS168: The Modern Algorithmic Toolbox Lecture #3: Similarity Metrics and Kd-Trees

Uploaded by

CS168: The Modern Algorithmic Toolbox

Lecture #3: Similarity Metrics and kd-Trees

• Similar documents, web pages, genomes, etc. (de-duplication of datasets, plagiarism

2.1 Jaccard Similarity

2.2 Euclidean Distance/ `2 distance, and `p distance

2.3 Other similarity metrics

2.4 The Relationships between Metrics, and Metric Embeddings

3 A Datastructure for Similarity Search: kd-trees

• Otherwise, pick a dimension i ∈ {1, . . . d} [there are many suggested

• Let m be the median of the ith dimension of the points: m =

4 The Curse of Dimensionality [might cover at begin-

• d = 100: You will be able to fit several thousand points!

Next Time: Dimension Reduction!

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.