NLP Lecture2 Text Pre Processing
NLP Lecture2 Text Pre Processing
Lecture 2
Text processing: tokenization
What is Tokenization?
Tokenization is the process of segmenting a string of characters into words.
Challenges Involved
Sentence Segmentation
Challenges Involved
• While ‘!’, ‘?’ are quite unambiguous
Sentence Segmentation
Challenges Involved
• While ‘!’, ‘?’ are quite unambiguous
• Period “.” is quite ambiguous and can be used additionally
for
• ) Abbreviations (Dr., Mr., m.p.h.)
Sentence Segmentation
Challenges Involved
While ‘!’, ‘?’ are quite unambiguous
Period “.” is quite ambiguous and can be used additionally for
) Abbreviations (Dr., Mr., m.p.h.)
) Numbers (2.4%, 4.3)
Sentence Segmentation
Challenges Involved
While ‘!’, ‘?’ are quite unambiguous
Period “.” is quite ambiguous and can be used additionally for
) Abbreviations (Dr., Mr., m.p.h.)
) Numbers (2.4%, 4.3)
3 /26
Sentence Segmentation
Challenges Involved
• While ‘!’, ‘?’ are quite unambiguous
• Period “.” is quite ambiguous and can be used additionally for
• ) Abbreviations (Dr., Mr., m.p.h.)
What is Tokenization?
Tokenization is the process of segmenting a string of characters into words.
Word Tokenization
What is Tokenization?
Tokenization is the process of segmenting a string of characters into words.
Word Token
An occurrence of a word
For the above sentence, 12 word tokens.
Word Type
A different realization of a word
For the above sentence, 10 word types.
Popular Python packages for NLP
➢ NLTK (Natural Language Toolkit): NLTK is one of the oldest and most comprehensive
libraries for NLP tasks. It provides tools for tasks such as tokenization, stemming,
lemmatization, part-of-speech tagging, parsing, and more.
➢ spaCy: spaCy is a modern NLP library that's designed to be fast and efficient. It offers
features like tokenization, POS tagging, named entity recognition (NER), dependency
parsing, and sentence segmentation.
➢ TextBlob: TextBlob is built on top of NLTK and provides a simpler interface for common
NLP tasks such as tokenization, POS tagging, noun phrase extraction, sentiment analysis,
and more.
➢ Gensim: Gensim is primarily focused on topic modeling and document similarity analysis,
but it also offers functionality for tasks like text preprocessing, word embedding, and
similarity queries.
➢ scikit-learn: While scikit-learn is a general-purpose machine learning library, it also
includes utilities for text preprocessing, such as CountVectorizer and TfidfVectorizer for
converting text data into numerical feature vectors.
Word Tokenization
Issues in Tokenization
Finland’s → Finland Finland‘s Finland ’s ?
What’re, I’m, shouldn’t → What are, I am, should not ?
S a n Francisco → one token or two?
Normalization
Why to “normalize”?
• Indexed text and query terms must have the same form.
Example: U.S.A. and U S A should be matched
✓ am, are, is → be
✓ car, cars, car’s, cars’ → car
✓ eat, ate, eaten→ eat
✓ Write, wrote, written→ write
Morphology studies the internal structure of words, how words are built up
from smaller meaningful units called morphemes
Example:
Porter’s algorithm for Stemming
Step 1a
s s e s → s s (caresses → caress)
ies → i (ponies → poni)
s s → s s (caress → caress)
s → φ (cats → cat)
Porter’s algorithm
Step 1a
s s e s → s s (caresses → caress)
ies → i (ponies → poni)
s s → s s (caress → caress)
s → φ (cats → cat)
Step 1b
(*v*)ing → φ(walking → walk, king → king)
(*v*)ed → φ (played → play)
Step 2
ational → ate (relational → relate)
izer → ize (digitizer → digitize)
ator → ate (operator → operate)
Step 3
al → φ (revival → reviv)
able → φ(adjustable → adjust)
ate → φ (activate → activ)
Python code for Stemming
import nltk
from nltk.stem import PorterStemmer
# Initialize Porter stemmer
stemmer = PorterStemmer()
➢ NLTK (Natural Language Toolkit): NLTK is one of the oldest and most
comprehensive libraries for NLP tasks. It provides tools for tasks such as
tokenization, stemming, lemmatization, part-of-speech tagging, parsing, and more.
➢ spaCy: spaCy is a modern NLP library that's designed to be fast and efficient. It
offers features like tokenization, POS tagging, named entity recognition (NER),
dependency parsing, and sentence segmentation.
➢ TextBlob: TextBlob is built on top of NLTK and provides a simpler interface for
common NLP tasks such as tokenization, POS tagging, noun phrase extraction,
sentiment analysis, and more.
➢ Gensim: Gensim is primarily focused on topic modeling and document similarity
analysis, but it also offers functionality for tasks like text preprocessing, word
embedding, and similarity queries.
➢ scikit-learn: While scikit-learn is a general-purpose machine learning library, it also
includes utilities for text preprocessing, such as CountVectorizer and TfidfVectorizer
for converting text data into numerical feature vectors.
Python code for pre-processing using spacy
import spacy
# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")
# Sample text
text = "The dogs are barking loudly outside. I am reading a book."
doc = nlp(text)
# Perform various preprocessing tasks
cleaned_text = []
for token in doc:
# Remove stop words and punctuation Output:
if not token.is_stop and not token.is_punct: Original text: The dogs
# Lemmatize each token are barking loudly
lemma = token.lemma_ outside. I am reading a
# Lowercase each token book.
cleaned_text.append(lemma.lower()) Preprocessed text: dog
# Join the cleaned tokens back into a string bark loudly outside read
cleaned_text = " ".join(cleaned_text) book
# Print the preprocessed text
print("Original text:", text)
print("Preprocessed text:", cleaned_text)
Python code for Pre-processing using TextBlob
Output:
Tokens: ['Barack', 'Obama', 'was', 'born', 'in', 'Hawaii', 'on', 'August', '4', '1961', 'He',
'served', 'as', 'the', '44th', 'President', 'of', 'the', 'United', 'States’]
POS tags: [('Barack', 'NNP'), ('Obama', 'NNP'), ('was', 'VBD'), ('born', 'VBN'), ('in',
'IN'), ('Hawaii', 'NNP'), ('on', 'IN'), ('August', 'NNP'), ('4', 'CD'), ('1961', 'CD'), ('He',
'PRP'), ('served', 'VBD'), ('as', 'IN'), ('the', 'DT'), ('44th', 'CD'), ('President', 'NNP'),
('of', 'IN'), ('the', 'DT'), ('United', 'NNP'), ('States', 'NNPS’)]
How to navigate?
The space of all edit sequences is huge
Minimum Edit as Search
How to navigate?
The space of all edit sequences is huge
Lot of distinct paths end up at the same state
Don’t have to keep track of all of them
Keep track of the shortest path to each state
We define D(i, j)
the edit distance between X[1..i] and Y[1..j]
i.e., the first i characters of X and the first j characters of Y
Dynamic Programming
A tabular computation of D(n,m)
Solving problems by combining solutions to subproblems
Bottom-up
) Compute D(i, j) for small i,j
) Compute larger D(i, j) based on previously computed smaller values
Dynamic Programming
A tabular computation of D(n,m)
Solving problems by combining solutions to subproblems
Bottom-up
) Compute D(i, j) for small i,j
) Compute larger D(i, j) based on previously computed smaller values
) Compute D(i, j) for all i and j till you get to D(n, m)
Example
Edit distance from ‘intention’ to ‘execution’
Defining Minimum Edit Distance Matrix
We define D(i, j)
the edit distance between X[1..i] and Y[1..j]
i.e., the first i characters of X and the first j characters of Y
Time
O(nm)
Backtrace
O(n +m)