0% found this document useful (0 votes)
8 views32 pages

(PR 2024) Lec14 Unsupervised Learning II

Uploaded by

oomaarhmed2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views32 pages

(PR 2024) Lec14 Unsupervised Learning II

Uploaded by

oomaarhmed2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Pattern Recognition

Lecture 14: Unsupervised learning II


(Document Retrieval)
Dr. Dina Khattab
Faculty of Computer & Information Sciences (FCIS) - Ain Shams University
dkhattab@eelu.edu.eg
Instructor: Dr. Dina Khattab

Email: dkhattab@eelu.edu.eg

Office Hours: Wednesday 7:00 PM to 9:00 PM


Agenda
Document retrieval
– Practice on document similarity

– Search Complexity of Nearest Neighbor (NN)


• KD-tree

3
Given the following three documents d1, d2 and d3, calculate the TF-IDF feature
vector for each document
• d1 - Music is a universal language
• d2 - Music is a miracle
• d3 - Music is a universal feature of the human experience
then based on the cosine similarity metric show which two documents are the most
similar.
Solution:
• TF = word count
• IDF = log( # doc / #doc using word)
Word TF d1 TF d2 TF d3 IDF TF-IDF d1 TF-IDF d2 TF-IDF d3

Music 1 1 1 0 0 0 0
Is 1 1 1 0 0 0 0
a 1 1 1 0 0 0 0
universal 1 0 1 0.176 0.176 0 0.176
language 1 0 0 0.477 0.477 0 0
miracle 0 1 0 0.477 0 0.477 0
feature 0 0 1 0.477 0 0 0.477
of 0 0 1 0.477 0 0 0.477
The 0 0 1 0.477 0 0 0.477
Human 0 0 1 0.477 0 0 0.477 4
Experience 0 0 1 0.477 0 0 0.477
Word TF d1 TF d2 TF d3 IDF TF-IDF d1 TF-IDF d2 TF-IDF d3

Music 1 1 1 0 0 0 0
Is 1 1 1 0 0 0 0
a 1 1 1 0 0 0 0
universal 1 0 1 0.176 0.176 0 0.176
language 1 0 0 0.477 0.477 0 0
miracle 0 1 0 0.477 0 0.477 0
feature 0 0 1 0.477 0 0 0.477
of 0 0 1 0.477 0 0 0.477
The 0 0 1 0.477 0 0 0.477
Human 0 0 1 0.477 0 0 0.477
Experience 0 0 1 0.477 0 0 0.477

• Cos similarity (d1,d2) = 0

• Cos similarity (d1,d3) = 0.0309 / (0.508)*(1.08) = 0.0563

• Cos similarity (d2,d3) = 0

• So d1, d3 are the most similar documents


5
Document retrieval
• Currently reading article you like
• Goal: Want to find similar article

6
1-NN search for retrieval
• Space of all articles, organized by similarity of
text

7
Compute distances to all docs
• Space of all articles, organized by similarity of
text

8
Retrieve “nearest neighbor”
• Space of all articles, organized by similarity of
text

9
Complexity of brute-force search
• Given a query point, scan through each point
– O(N) distance computations per 1-NN query!
– O(Nlogk) per k-NN query!

10
KD-trees
• Structured organization of documents
– Recursively partitions points into axis aligned
boxes.
• Enables more efficient pruning of search space
• Works “well” in “low-medium” dimensions

11
KD-tree construction
• Start with a list of
d-dimensional points

12
KD-tree construction
• Split points into 2 groups

13
KD-tree construction
• Recurse on each group
separately

14
KD-tree construction
• Recurse on each group
separately

15
KD-tree construction
• Continue splitting points at each set
– Creates a binary tree structure
• Each leaf node contains a list of points

16
KD-tree construction

17
KD-tree construction choices
• Use heuristics to make splitting decisions:
– Which dimension do we split along?
the widest one (with highest variance)

– Which value do we split at?


median or center of the box

– When do we stop?
fewer then m points left or box hits minimum width

18
Nearest neighbor with KD-trees
• Traverse tree looking for nearest neighbor to
query point

19
Nearest neighbor with KD-trees
1. Start by exploring leaf node containing query
point

20
Nearest neighbor with KD-trees
1. Start by exploring leaf node containing query point

21
Nearest neighbor with KD-trees
1. Start by exploring leaf node containing query point
2. Compute distance to each other point at leaf node

22
Nearest neighbor with KD-trees
1. Start by exploring leaf node containing query point
2. Compute distance to each other point at leaf node

23
Nearest neighbor with KD-trees
1. Start by exploring leaf node containing query point
2. Compute distance to each other point at leaf node
3. Backtrack and try other branch at each node visited

24
Nearest neighbor with KD-trees
1. Start by exploring leaf node containing query point
2. Compute distance to each other point at leaf node
3. Backtrack and try other branch at each node visited

25
Nearest neighbor with KD-trees
• Use distance bound and bounding box of each node
to prune parts of tree that cannot include nearest
neighbor

26
Nearest neighbor with KD-trees
• Use distance bound and bounding box of each node
to prune parts of tree that cannot include nearest
neighbor

27
Nearest neighbor with KD-trees
• Use distance bound and bounding box of each node
to prune parts of tree that cannot include nearest
neighbor

28
Complexity
For (nearly) balanced, binary trees...
• Construction
– Size: 2N-1 nodes if 1 data point at each leaf → O(N)
– Depth: O(log N)
– Construction time: O(N log N)
• 1-NN query
– Traverse down tree to starting point: O(log N)
– Maximum backtrack and traverse: O(N) in worst case
– Complexity range: O(log N) → O(N) (brute-force)

Search time can go exponential with increase of feature


dimension
29
Complexity

30
k-NN with KD-trees
• Exactly same algorithm, but maintain distance
to furthest of current k nearest neighbors

31
Credit for
“Machine Learning Specialization” (2015) by Emily Fox
& Carlos Guestrin – Uni. of Washington.

“Machine Learning”, by Andrew Ng – Uni. of Stanford.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy