CS168: The Modern Algorithmic Toolbox Lecture #3: Similarity Metrics and Kd-Trees
CS168: The Modern Algorithmic Toolbox Lecture #3: Similarity Metrics and Kd-Trees
1 Similarity Search
We begin with the basic problem of how to organize/represent a dataset so that similar items
can be found quickly. There are two slightly different settings in which one might want to
consider this question:
1. All the data is present, and one wants to find pairs or sets of similar items from within
the dataset.
2. There is a reference dataset that one has plenty of time to process cleverly, but when
we are given a new datapoint, we want to very quickly return a similar datapoint from
the reference dataset. This setting is sometimes referred to as the “nearest neighbor
search” setting.
In general, similar techniques/approaches are used for the above two settings, though it
is worth being aware of the the different objectives.
There are many real-world applications of similarity search:
• Collaborative filtering: find similar products (based on whether the same set of peo-
ple purchased them) or individuals (based on purchase history, demographics, web
browsing behavior, etc.).
• Machine learning/classification: If we find two similar datapoints, they might have the
same label....
∗
2015–2022,
c Tim Roughgarden and Gregory Valiant. Not to be sold, published, or distributed without
the authors’ consent.
1
• Combining datasets (e.g. in astronomy—different telescopes take pictures of the same
portions of the sky, maybe in different wavelengths, etc.–very useful to automatically
aggregate these datasets).
• Super fast labeling: e.g. in CERN, which bashes particles together and needs to very
quickly figure out which new particle traces/trajectories are “interesting” and worth
saving, and which trajectories are just boring/common particles.
2 Measures of Similarity
Before talking about algorithms for finding similar objects, we should begin by considering
several quantifications of what “similarity” means.
|S ∩ T |
J(S, T ) = .
|S ∪ T |
Equivalently, if we represent our sets (or multisets) S, T as vectors vS , vT , with the ith index
of the vector vS (i) equalling the number of times that the ith element is represented in S,
the above definition becomes:
P
min(vS (i), vT (i))
J(S, T ) = J(vS , vT ) = P i .
i max(vS (i), vT (i))
This expression is undefined if S, T are both the empty set, in which case we can define the
distance to be 0.
Jaccard similarity works quite well in practice, especially for sparse data. For example,
if we represent documents in terms of the multiset of words they contain, then the Jaccard
similarity between two documents is often a reasonable measure of their similarity. Simi-
larly, to estimate similarity between individuals, an online marketplace like Amazon, might
represent people as multisets of items purchased, movies reviewed, etc., and use Jaccard
similarity.
2
More generally, we can define other measures of similarity for points in Rd that generalize
the above form; the `p distance is defined as
d
!1/p
X
||x − y||p = |x(i) − y(i)|p .
i=1
If p = 1, we get “manhattan” distance, and for large p, ||x − y||p is more and more dependent
on the coordinate with maximal difference, with the `∞ distance simply being defined as
maxi |x(i) − y(i)|.
Note that `2 distance is rotationally invariant, whereas `p for p 6= 2 is not invariant to
rotations of the space. One consequence of this fact is that if you are using `p distance with
p 6= 2, you should make sure that the coordinates of your space actually mean something—
it does not make too much sense to use a distance metric that depends on your choice of
coordinate axes, if your choice of coordinate axes are arbitrary.
3
A kd-tree, originally proposed by Bentley (a Stanford undergrad) in 1975, is a space
partitioning datastructure that allows one to very quickly find the closest point in a dataset
to a given “query” point. kd-trees perform extremely well when the dimensionality (or
effective dimensionality) of the data is not too large—usually people say that it works well
if the dimensionality of the space is less than 20, or if the number of points is at least 2d ,
where d is the dimensionality of the points. There are many variants of kd-trees, and you
should think of this as a general framework for designing such a datastructure, though the
specifics can be fruitfully tailored to individual datasets and applications.
The high-level idea is to build a binary search tree that partitions space. Edges of
the tree will correspond to subsets of space, and each node, v, in the tree will have two
data-fields: the index of some dimension iv , and a value mv . Let Sv denote the subset of
space corresponding to the edge going into a node v, and let S< , S> denote the subsets
of space corresponding to the two outgoing edges of v. These subsets will be defined as
S< = {x : x ∈ Sv , s.t. x(iv ) < mv } and S> = {x : x ∈ Sv , s.t. x(iv ) ≥ mv }.
We build this tree as follows:
Algorithm 1
kd-Tree
Given a pair [S, v], where S = x1 , . . . , xn is a set of points with xi ∈ Rd ,
corresponding to node v in a partially built kd-tree:
• if n = 1, then store that point in the node v. v will now be a leaf of the
tree.
• Make two children of v< and v> , and recurse on [v< , S< ] and [v> , S> ].
Note that the size of the data-structure is linear in the size of the initial pointset. To add
a point, v, to the structure, one simply goes down the tree, querying the indices of V , and
comparing them to the various medians until one reaches a leaf, at which point one will then
split the leaf into two children. The tree will initially be balanced (because we are using the
medians of the coordinate values), and hence will have depth O(log n).
Given a node v, if we want to find the closest point in our kd-tree structure to v, we will
first go down the tree and find the leaf in which v would end up. We then go back up the
tree, at each juncture, asking “is it possible that the closest point to v would have ended up
4
down the other path?”. In low dimensions, the answer to this question will often be “no”,
and the search will be efficient. For example, in 1 dimension, the leaf node corresponding
to v will always contain the closest point to v. [Think about why this is the case!] In high
dimensions, we might end up needing to explore many/all leaves of the tree, which is why
kd-trees are ill-suited to very high dimensional data.
Example 4.1 What is the largest number of points that fit in d-dimensional space, with
the property that all pairwise distances are in the interval [0.75, 1]?
• d = 1: At most 2 points have this property...if you try to fit a third point, at least one
of the 3 pairwise distances will be off.
• d = 2: At most 3 points have this property...if you try to fit a fourth point, at least
one of the 6 pairwise distances will be off.
• In general, you will be able to fit√an exponential number of points (a quick calculation
shows that a random set of exp( d) will satisfy this property with high probability).
5 Take-Home Messages
Below are the highest-level take-home points from this lecture.
• Similarity search is a fundamental problem with two common variants: 1) given a set of
point, and distance metric, how can we quickly find pairs or sets of similar datapoints?
2) Given a set of point, S, and distance metric, how do we store the points such that
if we are given a new “query” point, q, we can efficiently find the closest point (or a
close point) to q in set S.
• Many different useful distance metrics: Jaccard similarity, Euclidean distance, L1/Manhattan
distance, cosine similarity, edit distance, etc. Different distance metrics have very dif-
ferent properties. In any application, it is worth thinking about which distance metric
is most appropriate.
5
• One datastructure that enables an efficient similarity search is a kd-tree: a binary tree
that partitions space.
• kd-trees work well for low dimensional settings. The usual rule of thumb is that if the
dimension, d is less than 20, or if the number of points is larger than 2d , then kd-trees
will be a reasonable datastructure.
• (Might cover during lecture 4) The curse of dimensionality: high dimensional spaces
often have strange properties (e.g. kissing number exponential in dimension), and many
geometric algorithms have a runtime that scales exponentially with the dimension.