NLP Ir
NLP Ir
Felipe Bravo-Marquez
Types
• A type is a class of token containing a single sequence of characters.
• They are obtained by identifying unique tokens within the document.
Types for the previous sentence: [I] [like] [human] [languages] [and]
[programming]
Stopwords removal
• In order to reduce the size of the vocabulary and eliminate terms that do
not provide much information, terms that occur with high frequency in
the corpus are eliminated.
• These terms are called stopwords and include articles, pronouns,
prepositions and conjunctions.
Example: [a, an, and, any, has, do, don’t, did, the, on].1
1
Related concepts: function words, closed-class words.
Stemming
A term normalization process in which terms are transformed to their root in
order to reduce the size of the vocabulary. It is carried by applying word
reduction rules.
Example: Porter’s Algorithm.
termId value
t1 human
t2 languag
t3 program
2
http://9ol.es/porter_js_demo.html
Lemmatization
3
https://blog.bitext.com/
what-is-the-difference-between-stemming-and-lemmatization/
Zipf’s law [1]
tfi,j
ntfi,j =
maxi (tfi,j )
• Does a term that occurs in very few documents provide more or less
information than one that occurs several times?
• For example, the document The respected major of Pelotillehue. The
term Pelotillehue occurs in fewer documents than the term major, so it
should be more descriptive.
Term Frequency - Inverted Document Frequency [2]
• Let N be the number of documents in the collection and ni the number
of documents containing term ti , we define idf of ti as follows:
N
idfti = log10 ( )
ni
• A term that appears in all documents would have idf = 0 and one that
appears in 10% of the documents would have idf = 1.
• The tf -idf scoring model combines tf and idf scores, resulting in the
following weights w for a term in a document:
N
w(ti , dj ) = tfi × log10 ( )
ni
• Search engine queries can also be modeled as vectors. However,
queries have between 2 and 3 terms in average. To avoid having too
many null dimensions, query vectors can be smoothed as follows:
N
w(ti , dj ) = (0.5 + 0.5 × tfi,j )log10 ( )
ni
Similarity between Vectors
Eisenstein, J. (2018).
Natural language processing.
Technical report, Georgia Tech.
Manning, C. D., Raghavan, P., and Schütze, H. (2008).
Introduction to Information Retrieval.
Cambridge University Press, New York, NY, USA.
Salton, G., Wong, A., and Yang, C.-S. (1975).
A vector space model for automatic indexing.
Communications of the ACM, 18(11):613–620.
Zipf, G. K. (1935).
The Psychobiology of Language.
Houghton-Mifflin, New York, NY, USA.