(PR 2024) Lec14 Unsupervised Learning II
(PR 2024) Lec14 Unsupervised Learning II
Email: dkhattab@eelu.edu.eg
3
Given the following three documents d1, d2 and d3, calculate the TF-IDF feature
vector for each document
• d1 - Music is a universal language
• d2 - Music is a miracle
• d3 - Music is a universal feature of the human experience
then based on the cosine similarity metric show which two documents are the most
similar.
Solution:
• TF = word count
• IDF = log( # doc / #doc using word)
Word TF d1 TF d2 TF d3 IDF TF-IDF d1 TF-IDF d2 TF-IDF d3
Music 1 1 1 0 0 0 0
Is 1 1 1 0 0 0 0
a 1 1 1 0 0 0 0
universal 1 0 1 0.176 0.176 0 0.176
language 1 0 0 0.477 0.477 0 0
miracle 0 1 0 0.477 0 0.477 0
feature 0 0 1 0.477 0 0 0.477
of 0 0 1 0.477 0 0 0.477
The 0 0 1 0.477 0 0 0.477
Human 0 0 1 0.477 0 0 0.477 4
Experience 0 0 1 0.477 0 0 0.477
Word TF d1 TF d2 TF d3 IDF TF-IDF d1 TF-IDF d2 TF-IDF d3
Music 1 1 1 0 0 0 0
Is 1 1 1 0 0 0 0
a 1 1 1 0 0 0 0
universal 1 0 1 0.176 0.176 0 0.176
language 1 0 0 0.477 0.477 0 0
miracle 0 1 0 0.477 0 0.477 0
feature 0 0 1 0.477 0 0 0.477
of 0 0 1 0.477 0 0 0.477
The 0 0 1 0.477 0 0 0.477
Human 0 0 1 0.477 0 0 0.477
Experience 0 0 1 0.477 0 0 0.477
6
1-NN search for retrieval
• Space of all articles, organized by similarity of
text
7
Compute distances to all docs
• Space of all articles, organized by similarity of
text
8
Retrieve “nearest neighbor”
• Space of all articles, organized by similarity of
text
9
Complexity of brute-force search
• Given a query point, scan through each point
– O(N) distance computations per 1-NN query!
– O(Nlogk) per k-NN query!
10
KD-trees
• Structured organization of documents
– Recursively partitions points into axis aligned
boxes.
• Enables more efficient pruning of search space
• Works “well” in “low-medium” dimensions
11
KD-tree construction
• Start with a list of
d-dimensional points
12
KD-tree construction
• Split points into 2 groups
13
KD-tree construction
• Recurse on each group
separately
14
KD-tree construction
• Recurse on each group
separately
15
KD-tree construction
• Continue splitting points at each set
– Creates a binary tree structure
• Each leaf node contains a list of points
16
KD-tree construction
17
KD-tree construction choices
• Use heuristics to make splitting decisions:
– Which dimension do we split along?
the widest one (with highest variance)
– When do we stop?
fewer then m points left or box hits minimum width
18
Nearest neighbor with KD-trees
• Traverse tree looking for nearest neighbor to
query point
19
Nearest neighbor with KD-trees
1. Start by exploring leaf node containing query
point
20
Nearest neighbor with KD-trees
1. Start by exploring leaf node containing query point
21
Nearest neighbor with KD-trees
1. Start by exploring leaf node containing query point
2. Compute distance to each other point at leaf node
22
Nearest neighbor with KD-trees
1. Start by exploring leaf node containing query point
2. Compute distance to each other point at leaf node
23
Nearest neighbor with KD-trees
1. Start by exploring leaf node containing query point
2. Compute distance to each other point at leaf node
3. Backtrack and try other branch at each node visited
24
Nearest neighbor with KD-trees
1. Start by exploring leaf node containing query point
2. Compute distance to each other point at leaf node
3. Backtrack and try other branch at each node visited
25
Nearest neighbor with KD-trees
• Use distance bound and bounding box of each node
to prune parts of tree that cannot include nearest
neighbor
26
Nearest neighbor with KD-trees
• Use distance bound and bounding box of each node
to prune parts of tree that cannot include nearest
neighbor
27
Nearest neighbor with KD-trees
• Use distance bound and bounding box of each node
to prune parts of tree that cannot include nearest
neighbor
28
Complexity
For (nearly) balanced, binary trees...
• Construction
– Size: 2N-1 nodes if 1 data point at each leaf → O(N)
– Depth: O(log N)
– Construction time: O(N log N)
• 1-NN query
– Traverse down tree to starting point: O(log N)
– Maximum backtrack and traverse: O(N) in worst case
– Complexity range: O(log N) → O(N) (brute-force)
30
k-NN with KD-trees
• Exactly same algorithm, but maintain distance
to furthest of current k nearest neighbors
31
Credit for
“Machine Learning Specialization” (2015) by Emily Fox
& Carlos Guestrin – Uni. of Washington.