0% found this document useful (0 votes)

40 views24 pages

NLP Ir

This document discusses natural language processing and information retrieval techniques. It covers topics like tokenization, stemming, lemmatization and the vector space model. The vector space model represents documents and queries as vectors in a shared space to calculate similarity between them.

Uploaded by

pawebiarxdxd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views24 pages

NLP Ir

Uploaded by

pawebiarxdxd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Natural Language Processing

Vector Space Model and Information Retrieval

Felipe Bravo-Marquez

March 22, 2020

Motivation

• How does a search engine such as Duckduckgo or Google

retrieve relevant documents from a given query?
• How can a company process the claims left by its users on
its Web portals?
These problems are studied in the following fields:
• Information Retrieval: science of searching for information
in document collections.
• Text Mining: automatic extraction of knowledge from text.
Both of them are closely related to NLP! (the borders between
these fields are unclear).
Tokens and Types
Tokenization: the task of splitting a sentence or document into pieces called
tokens.
Additional transformations can be employed such as the removal of special
characters (e.g., punctuation), lowercasing, etc. [Manning et al., 2008].
Example
Input: I like human languages and programming languages.
Tokens: [I] [like] [human] [languages] [and] [programming] [languages]

Types
• A type is a class of token containing a single sequence of characters.
• They are obtained by identifying unique tokens within the document.

Types for the previous sentence: [I] [like] [human] [languages] [and]
[programming]

The token languages was repeated in the sentence.

Vocabulary Extraction
• A term is a normalized type.
• Normalization is the process of creating equivalence classes of different
types. This will become clear in the following slides.
• The vocabulary V , is the set of terms (normalized unique tokens) within
a collection of documents or corpus D.

Stopwords removal
• In order to reduce the size of the vocabulary and eliminate terms that do
not provide much information, terms that occur with high frequency in
the corpus are eliminated.
• These terms are called stopwords and include articles, pronouns,
prepositions and conjunctions.
Example: [a, an, and, any, has, do, don’t, did, the, on].1

The removal of stopwords can be inconvenient in many NLP tasks!!

Example: I don’t like pizza => pizza ( “I”, “don’t”, and “like” were removed)

1
Related concepts: function words, closed-class words.
Stemming
A term normalization process in which terms are transformed to their root in
order to reduce the size of the vocabulary. It is carried by applying word
reduction rules.
Example: Porter’s Algorithm.

Example: d = I like human languages and programming languages => I like

human languag and program languag 2
The vocabulary of document d after removing stopwords and performing
stemming:

termId value
t1 human
t2 languag
t3 program

2
http://9ol.es/porter_js_demo.html
Lemmatization

• Another term normalization strategy.

• It also transform words into their roots.
• It performs a morphological analysis using reference dictionaries
(lookup tables) to create equivalence classes between types.
• For example, for the token studies, a stemming rule would return the
term studi, while through lemmatization we would get the term study3 .

3
https://blog.bitext.com/
what-is-the-difference-between-stemming-and-lemmatization/
Zipf’s law [1]

• The Zipf’s law, proposed by George Kingsley Zipf in [Zipf, 1935], is an

empirical law about the frequency of terms within a collection of
documents (corpus).
• It states that the frequency f of a term in a corpus is inversely
proportional to its r ranking in a sorted frequency table:
cf
f = (1)
rβ
• Where cf is a constant dependent on the collection and β > 0 is a
decay factor.
• If β = 1, then f follows exactly Zipf’s law, otherwise it follows a Zipf-like
distribution.
• The law relates to the principle of minimum effort. We often use a few
words to write ideas.
• The Zipf law is a type of power law distribution (long tail distributions)
Zipf’s law [2]

Figure: Zipf’s law

• If we plot a log-log graph, we obtain a straight line with slope −β.

• Listing the most frequent words of a corpus can be used to build a
stopwords list.
Posting Lists and the Inverted Index
Let D be a collection of documents and V the vocabulary of all terms
extracted from the collection:
• The posting list of a term is the list of all documents where the term
appears at least once. Documents are identified by their ids.
• An inverted index is a dictionary-type data structure mapping terms
ti ∈ V into their corresponding posting lists.

< term >→< docId >∗

Figure: Inverted Index

Web Search Engines [1]
A search engine is an information retrieval system designed for searching
information on the Web (solving information needs) [Manning et al., 2008]. Its
basic components are:

• Crawler: a robot that navigates the Web according to a defined strategy.

It usually starts by browsing a set of seed websites and continues to
browse their hyperlinks.
• Indexer: in charge of maintaining an inverted index with the content of
the pages traversed by the Crawler.
• Query processor: in charge of processing user queries and searching
the index for the documents most relevant to a query.
• Ranking function: the function used by the query processor to rank
documents indexed in the collection by relevance according to a query.
• User interface: receives the query as input and returns the documents
ranked by relevancy.
Web Search Engines [2]

Figure: The various components of a web search

engine [Manning et al., 2008].
The Vector Space Model

• In order to rank queries, or measure the similarity between two

documents we need a similarity metric.
• Documents can be represented as vectors of terms, where each term is
a vector dimension [Salton et al., 1975].
• Documents with different words and lengths will reside in the same
vector space.
• These types of representations are called Bag of Words.
• In bag-of-words-representations the order of words and the linguistic
structure of a sentence is lost.
• The value of each dimension is a weight that represents the relevance
of the term ti in the document d.
−
→
dj → dj = (w(t1 , dj ), ..., w(t|V | , dj )) (2)
• How can we model how informative is a term to a document?
Term Frequency - Inverted Document Frequency [1]

• Let tfi,j be the frequency of term ti in document dj .

• A term that occurs 10 times should provide more information than one
that occurs once.
• What happens when we have documents that are much longer than the
others?
• We can normalize by the maximum term frequency in the document.

tfi,j
ntfi,j =
maxi (tfi,j )

• Does a term that occurs in very few documents provide more or less
information than one that occurs several times?
• For example, the document The respected major of Pelotillehue. The
term Pelotillehue occurs in fewer documents than the term major, so it
should be more descriptive.
Term Frequency - Inverted Document Frequency [2]
• Let N be the number of documents in the collection and ni the number
of documents containing term ti , we define idf of ti as follows:
N
idfti = log10 ( )
ni
• A term that appears in all documents would have idf = 0 and one that
appears in 10% of the documents would have idf = 1.
• The tf -idf scoring model combines tf and idf scores, resulting in the
following weights w for a term in a document:
N
w(ti , dj ) = tfi × log10 ( )
ni
• Search engine queries can also be modeled as vectors. However,
queries have between 2 and 3 terms in average. To avoid having too
many null dimensions, query vectors can be smoothed as follows:
N
w(ti , dj ) = (0.5 + 0.5 × tfi,j )log10 ( )
ni
Similarity between Vectors

• Representing queries and documents as vectors allows calculating their

similarity.
• One approach would be using the euclidean distance.
• The common approach is to calculate the cosine of the angle between
the two vectors.
• If both documents are the same, the angle would be 0 and its cosine
would be 1. On the other hand, if they are orthogonal the cosine is 0.
• The cosine similarity is calculated as follows:
P|V |
~
d1 · ~
d2 (w(ti , d1 ) × w(ti , d2 ))
cos(~
d1 , ~
d2 ) = = qP i=1
|~
d1 | × |~
q
d2 | |V |
w(t d 2 ×
P|V | 2
i=1 i , 1 ) i=1 w(ti , d2 )

• This is wrongly called cosine distance. It is actually a similarity metric.

• Notice that cosine similarity normalizes the vectors by its euclidean
norm ||~
d||2 .
Cosine Similarity

Figure: Cosine Similarity.

Exercise

• Suppose we have 3 documents formed from the following

sequences of terms:
d1 → t4 t3 t1 t4
d2 → t5 t4 t2 t3 t5
d3 → t2 t1 t4 t4
• Build a term-document matrix of 5 × 3 dimensions using
simple tf -idf weights (without normalization).
• We recommend you first build a list with the number of
documents in which each term appears (useful for
calculating idf scores)
• Then calculate the idf scores of each term.
• Fill up the cells of the matrix the tf -idf values.
• Which is the closest document to d1 ?
Result

Table: tf-idf Matrix

d1 d2 d3
t1 0.176 0.000 0.176
t2 0.000 0.176 0.176
t3 0.176 0.176 0.000
t4 0.000 0.000 0.000
t5 0.000 0.954 0.000
Document Clustering [1]
• How can we group documents that are similar with each other?
• Clustering is the process of grouping documents that are similar with
each other.
• Each group of documents is called a cluster.
• In clustering we try to identify groups of documents in which the
similarity between documents in the same cluster is maximized and the
similarity of documents in different clusters is minimized.

Figure: Set of documents where the clusters can be clearly identified.

Document Clustering [2]

• Document clustering allows identifying topics in a corpus and reducing

the search space in a search engine i.e., the inverted index is organized
according to the clusters.
• K-means is a simple clustering algorithm that receives the number of
clusters k as a parameter.
• The algorithm relies on the idea of centroid, which is the average vector
of documents belonging to the same cluster.
• Let S be a set of 2-dimensional vectors {3, 6}, {1, 2}, {5, 1}, the
centroid of S is {(3 + 1 + 5)/3, (6 + 2 + 1)/3} = {3, 3}.
K-Means

1. We start with k random centroids.

2. We calculate the similarity between each document and each centroid.
3. We assign each document to its closest centroid forming a cluster.
4. The centroids are recalculated according to the documents assigned to
them.
5. This process is repeated until convergence.
K-means

Figure: K-means algorithm

Conclusions and Additional Concepts

• Representing documents as vectors is essential for calculating

similarities between document pairs.
• Bag of words vectors lack linguistic structure.
• Bag of words vectors are high-dimensional and sparse.
• Word n-grams can help capturing multi word-expressions (e.g., New
York => new york)
• Modern information retrieval systems go beyond vector similarity
(PageRank, Relevance Feedback, Query log mining, Google
Knowledge Graph, Machine Learning).
• Information retrieval and text mining are less concerned with linguistic
structure, and more interested in producing fast and scalable algorithms
[Eisenstein, 2018].
References I

Eisenstein, J. (2018).
Natural language processing.
Technical report, Georgia Tech.
Manning, C. D., Raghavan, P., and Schütze, H. (2008).
Introduction to Information Retrieval.
Cambridge University Press, New York, NY, USA.
Salton, G., Wong, A., and Yang, C.-S. (1975).
A vector space model for automatic indexing.
Communications of the ACM, 18(11):613–620.
Zipf, G. K. (1935).
The Psychobiology of Language.
Houghton-Mifflin, New York, NY, USA.

Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
Information Retrieval Notes
No ratings yet
Information Retrieval Notes
42 pages
IRS Unit 4
No ratings yet
IRS Unit 4
63 pages
Computer Vision Lecture Notes All
0% (1)
Computer Vision Lecture Notes All
18 pages
TF Idf
100% (3)
TF Idf
38 pages
IR Journal
No ratings yet
IR Journal
36 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
Module 7
No ratings yet
Module 7
53 pages
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
No ratings yet
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
30 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
Document Classification Utilising Ontologies and Relations Between Documents
No ratings yet
Document Classification Utilising Ontologies and Relations Between Documents
8 pages
Vector Space Model
No ratings yet
Vector Space Model
11 pages
Document Ranking Using Customizes Vector Method
No ratings yet
Document Ranking Using Customizes Vector Method
6 pages
Mod4 NLP
No ratings yet
Mod4 NLP
53 pages
Precision Recal TF Idf
No ratings yet
Precision Recal TF Idf
36 pages
NLP Week10 IR Enc Dec
No ratings yet
NLP Week10 IR Enc Dec
68 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
No ratings yet
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
48 pages
Unit 1 Notes-1
No ratings yet
Unit 1 Notes-1
10 pages
Lecture 2: More Similarity Searching Multidimensional Scaling
No ratings yet
Lecture 2: More Similarity Searching Multidimensional Scaling
8 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
1 Overview
No ratings yet
1 Overview
44 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
61 pages
Predictive Methods For Text Mining
No ratings yet
Predictive Methods For Text Mining
75 pages
Information Retrieval Practical
No ratings yet
Information Retrieval Practical
10 pages
Information Retrieval
No ratings yet
Information Retrieval
72 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
Applications of NLP
No ratings yet
Applications of NLP
85 pages
RM - Multivariate Analysis
No ratings yet
RM - Multivariate Analysis
19 pages
The Impact of Contingencies On Management Accounting System Development
No ratings yet
The Impact of Contingencies On Management Accounting System Development
3 pages
Term Weighting & The Vector Space Model
No ratings yet
Term Weighting & The Vector Space Model
2 pages
Introduction To Clustering Thesaurus Generation Item Clustering
No ratings yet
Introduction To Clustering Thesaurus Generation Item Clustering
15 pages
Raster: Resolution
No ratings yet
Raster: Resolution
6 pages
51 Machine Learning Interview Questions With Answers - Springboard
100% (1)
51 Machine Learning Interview Questions With Answers - Springboard
20 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
Text Pre Processing With NLTK
No ratings yet
Text Pre Processing With NLTK
42 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Machine Learning-Based Breast Cancer Detection
No ratings yet
Machine Learning-Based Breast Cancer Detection
82 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
What Is Information Retrieval (IR)
No ratings yet
What Is Information Retrieval (IR)
17 pages
Coursera Capstone Project Final
No ratings yet
Coursera Capstone Project Final
6 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
Bulu
No ratings yet
Bulu
47 pages
Travel Time Prediction Using Random Forest
No ratings yet
Travel Time Prediction Using Random Forest
55 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
V14 Cse Aiml Iii Year
No ratings yet
V14 Cse Aiml Iii Year
41 pages
Email Clustering Algorithm
No ratings yet
Email Clustering Algorithm
57 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Unit 2
No ratings yet
Unit 2
57 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
Data Whare House PDF
No ratings yet
Data Whare House PDF
51 pages
Crop and Yield Prediction Model
No ratings yet
Crop and Yield Prediction Model
6 pages
UNIT 4 Information Retrieval Using NLP
No ratings yet
UNIT 4 Information Retrieval Using NLP
13 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Supervised and Unsupervised Learning
No ratings yet
Supervised and Unsupervised Learning
19 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
K Metoids
No ratings yet
K Metoids
18 pages
Efficient Cache-Supported Path Planning On Roads
No ratings yet
Efficient Cache-Supported Path Planning On Roads
14 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Application of Rough Fuzzy Clustering Means To Earthquake Fire
No ratings yet
Application of Rough Fuzzy Clustering Means To Earthquake Fire
5 pages
SEM CFA EFA Koach Scholar Links
No ratings yet
SEM CFA EFA Koach Scholar Links
4 pages
Q1a) What Is Big Data? Explain Characteristics of Big Data (4M) Ans
No ratings yet
Q1a) What Is Big Data? Explain Characteristics of Big Data (4M) Ans
16 pages
Pattern Recognition Unit 1 Chat GPT
No ratings yet
Pattern Recognition Unit 1 Chat GPT
13 pages
Summary Data Quality Course
No ratings yet
Summary Data Quality Course
7 pages
Manning Christopher, Prabhakar Raghavan, Hinrich Schu Tze: Introduction To Information Retrieval
No ratings yet
Manning Christopher, Prabhakar Raghavan, Hinrich Schu Tze: Introduction To Information Retrieval
4 pages
EEG Electric Field Topography Is Stable During Moments of High Field Strength
No ratings yet
EEG Electric Field Topography Is Stable During Moments of High Field Strength
26 pages
9 - IAI5101 Unsupervised Learning - 20-40
No ratings yet
9 - IAI5101 Unsupervised Learning - 20-40
21 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
3C's, Regression and Dimension Reduction in Machine Learning.
No ratings yet
3C's, Regression and Dimension Reduction in Machine Learning.
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-3
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-3
3 pages
RM Unit 4 - Overview
No ratings yet
RM Unit 4 - Overview
62 pages
DMDW Day-Wise Lesson Plan
No ratings yet
DMDW Day-Wise Lesson Plan
4 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

NLP Ir

Uploaded by

NLP Ir

Uploaded by

Natural Language Processing

Vector Space Model and Information Retrieval

March 22, 2020

• How does a search engine such as Duckduckgo or Google

The token languages was repeated in the sentence.

The removal of stopwords can be inconvenient in many NLP tasks!!

Example: d = I like human languages and programming languages => I like

• Another term normalization strategy.

• The Zipf’s law, proposed by George Kingsley Zipf in [Zipf, 1935], is an

Figure: Zipf’s law

• If we plot a log-log graph, we obtain a straight line with slope −β.

< term >→< docId >∗

Figure: Inverted Index

• Crawler: a robot that navigates the Web according to a defined strategy.

Figure: The various components of a web search

• In order to rank queries, or measure the similarity between two

• Let tfi,j be the frequency of term ti in document dj .

• Representing queries and documents as vectors allows calculating their

• This is wrongly called cosine distance. It is actually a similarity metric.

Figure: Cosine Similarity.

• Suppose we have 3 documents formed from the following

Table: tf-idf Matrix

Figure: Set of documents where the clusters can be clearly identified.

• Document clustering allows identifying topics in a corpus and reducing

1. We start with k random centroids.

Figure: K-means algorithm

• Representing documents as vectors is essential for calculating

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.