0% found this document useful (0 votes)
6 views

NLP Lecture2 Text Pre Processing

Uploaded by

tharini.abhinaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

NLP Lecture2 Text Pre Processing

Uploaded by

tharini.abhinaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

19CSE453

Natural Language Processing

Lecture 2
Text processing: tokenization

What is Tokenization?
Tokenization is the process of segmenting a string of characters into words.

Depending on the application in hand, you might have to perform sentence


segmentation a s well.
Sentence Segmentation

The problem of deciding where the sentences begin and end.

Challenges Involved
Sentence Segmentation

The problem of deciding where the sentences begin and end.

Challenges Involved
• While ‘!’, ‘?’ are quite unambiguous
Sentence Segmentation

The problem of deciding where the sentences begin and end.

Challenges Involved
• While ‘!’, ‘?’ are quite unambiguous
• Period “.” is quite ambiguous and can be used additionally
for
• ) Abbreviations (Dr., Mr., m.p.h.)
Sentence Segmentation

The problem of deciding where the sentences begin and end.

Challenges Involved
While ‘!’, ‘?’ are quite unambiguous
Period “.” is quite ambiguous and can be used additionally for
) Abbreviations (Dr., Mr., m.p.h.)
) Numbers (2.4%, 4.3)
Sentence Segmentation

The problem of deciding where the sentences begin and end.

Challenges Involved
While ‘!’, ‘?’ are quite unambiguous
Period “.” is quite ambiguous and can be used additionally for
) Abbreviations (Dr., Mr., m.p.h.)
) Numbers (2.4%, 4.3)

Approach: build a binary classifier


For each “.”
Decides EndOfSentence/NotEndOfSentence

3 /26
Sentence Segmentation

The problem of deciding where the sentences begin and end.

Challenges Involved
• While ‘!’, ‘?’ are quite unambiguous
• Period “.” is quite ambiguous and can be used additionally for
• ) Abbreviations (Dr., Mr., m.p.h.)

• ) Numbers (2.4%, 4.3)

Approach: build a binary classifier


For each “.”
• Decides EndOfSentence/NotEndOfSentence
• Classifiers can be: hand-written rules, regular expressions, or
machine learning
Sentence Segmentation: Decision Tree Example

Decision Tree: Is this word the end-of-sentence ( E -O-S) ?


Sentence Segmentation: Decision Tree Example

Decision Tree: Is this word the end-of-sentence ( E -O-S) ?


Other Important Features likely to influence sentence segmentation

• C a s e of word with “.” : Upper, Lower, Cap, Number


• C a s e of word after “.” : Upper, Lower, Cap, Number
• Numeric Features
➢ Length of word with “.”
➢ Probability (word with “.” occurs at end-of-sentence)
➢ Probability (word after “.” occurs at beginning-of-sentence)
Word Tokenization

What is Tokenization?
Tokenization is the process of segmenting a string of characters into words.
Word Tokenization

What is Tokenization?
Tokenization is the process of segmenting a string of characters into words.

I have a can opener; but I can not open these cans.

Word Token
An occurrence of a word
For the above sentence, 12 word tokens.

Word Type
A different realization of a word
For the above sentence, 10 word types.
Popular Python packages for NLP

➢ NLTK (Natural Language Toolkit): NLTK is one of the oldest and most comprehensive
libraries for NLP tasks. It provides tools for tasks such as tokenization, stemming,
lemmatization, part-of-speech tagging, parsing, and more.
➢ spaCy: spaCy is a modern NLP library that's designed to be fast and efficient. It offers
features like tokenization, POS tagging, named entity recognition (NER), dependency
parsing, and sentence segmentation.
➢ TextBlob: TextBlob is built on top of NLTK and provides a simpler interface for common
NLP tasks such as tokenization, POS tagging, noun phrase extraction, sentiment analysis,
and more.
➢ Gensim: Gensim is primarily focused on topic modeling and document similarity analysis,
but it also offers functionality for tasks like text preprocessing, word embedding, and
similarity queries.
➢ scikit-learn: While scikit-learn is a general-purpose machine learning library, it also
includes utilities for text preprocessing, such as CountVectorizer and TfidfVectorizer for
converting text data into numerical feature vectors.
Word Tokenization

Issues in Tokenization
Finland’s → Finland Finland‘s Finland ’s ?
What’re, I’m, shouldn’t → What are, I am, should not ?
S a n Francisco → one token or two?
Normalization

Why to “normalize”?
• Indexed text and query terms must have the same form.
Example: U.S.A. and U S A should be matched

We implicitly define equivalence classes of terms

• Three forms of Normalization


• Case folding
• Stemming
• Lemmatization
Case Folding

• Reduce all letters to lower case

• Possible exceptions (Task dependent):


➢ Upper case in mid sentence, may point to named entities (e.g.
General Motors)

➢ Words conveyed in CAPS mean a strong conveyance


➢ (eg. US vs. us, I REALLY MEAN IT
➢ Applications in sentiment analysis, information extraction etc.
Lemmatization

• Reduce inflections or variant forms to base form:

✓ am, are, is → be
✓ car, cars, car’s, cars’ → car
✓ eat, ate, eaten→ eat
✓ Write, wrote, written→ write

• Must find the correct dictionary headword form


Lemmatization learns from Morphology

Morphology studies the internal structure of words, how words are built up
from smaller meaningful units called morphemes

Morphemes are divided into two categories


Stems: The core meaning bearing units
Affixes: Bits and pieces adhering to stems to change their meanings and
grammatical functions
• Perfix: un-,anti-, etc.
• suffix: -ation, -ity, en, -ed etc.

*Lemmatization algorithms take input from morphology to convert tokens


to their root form
Python Code for Lemmatization

*Note: The 2nd parameter of the lemmatize function is taken


default as noun if not provided
Stemming

• Reducing terms to their stems, used in information retrieval


• Crude chopping of affixes
➢ language dependent
➢ automate(s), automatic, automation all reduced to automat

Example:
Porter’s algorithm for Stemming

Step 1a
s s e s → s s (caresses → caress)
ies → i (ponies → poni)
s s → s s (caress → caress)
s → φ (cats → cat)
Porter’s algorithm

Step 1a
s s e s → s s (caresses → caress)
ies → i (ponies → poni)
s s → s s (caress → caress)
s → φ (cats → cat)

Step 1b
(*v*)ing → φ(walking → walk, king → king)
(*v*)ed → φ (played → play)

*Note: v represents vowel


Porter’s algorithm

Step 2
ational → ate (relational → relate)
izer → ize (digitizer → digitize)
ator → ate (operator → operate)

Step 3
al → φ (revival → reviv)
able → φ(adjustable → adjust)
ate → φ (activate → activ)
Python code for Stemming

import nltk
from nltk.stem import PorterStemmer
# Initialize Porter stemmer
stemmer = PorterStemmer()

# Stem each token Output:


print(stemmer.stem("cats")) cat
print(stemmer.stem("played")) play
print(stemmer.stem("playing")) play
print(stemmer.stem("welcomes")) welcom
print(stemmer.stem("persual")) persual
print(stemmer.stem("ideologies")) ideolog

*Note: SnowballStemmer& Lancester stemmer are other examples


Popular Python packages for NLP

➢ NLTK (Natural Language Toolkit): NLTK is one of the oldest and most
comprehensive libraries for NLP tasks. It provides tools for tasks such as
tokenization, stemming, lemmatization, part-of-speech tagging, parsing, and more.
➢ spaCy: spaCy is a modern NLP library that's designed to be fast and efficient. It
offers features like tokenization, POS tagging, named entity recognition (NER),
dependency parsing, and sentence segmentation.
➢ TextBlob: TextBlob is built on top of NLTK and provides a simpler interface for
common NLP tasks such as tokenization, POS tagging, noun phrase extraction,
sentiment analysis, and more.
➢ Gensim: Gensim is primarily focused on topic modeling and document similarity
analysis, but it also offers functionality for tasks like text preprocessing, word
embedding, and similarity queries.
➢ scikit-learn: While scikit-learn is a general-purpose machine learning library, it also
includes utilities for text preprocessing, such as CountVectorizer and TfidfVectorizer
for converting text data into numerical feature vectors.
Python code for pre-processing using spacy

import spacy
# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")
# Sample text
text = "The dogs are barking loudly outside. I am reading a book."
doc = nlp(text)
# Perform various preprocessing tasks
cleaned_text = []
for token in doc:
# Remove stop words and punctuation Output:
if not token.is_stop and not token.is_punct: Original text: The dogs
# Lemmatize each token are barking loudly
lemma = token.lemma_ outside. I am reading a
# Lowercase each token book.
cleaned_text.append(lemma.lower()) Preprocessed text: dog
# Join the cleaned tokens back into a string bark loudly outside read
cleaned_text = " ".join(cleaned_text) book
# Print the preprocessed text
print("Original text:", text)
print("Preprocessed text:", cleaned_text)
Python code for Pre-processing using TextBlob

from textblob import TextBlob


# Sample text
text = "Barack Obama was born in Hawaii on August 4,
1961. He served as the 44th President of the United States."
# Tokenization
blob = TextBlob(text)
tokens = blob.words
# POS tagging
pos_tags = blob.tags
# NER tagging
ner_tags = blob.noun_phrases
print("Tokens:", tokens)
print("POS tags:", pos_tags)
print("NER tags:", ner_tags)
Output

Output:
Tokens: ['Barack', 'Obama', 'was', 'born', 'in', 'Hawaii', 'on', 'August', '4', '1961', 'He',
'served', 'as', 'the', '44th', 'President', 'of', 'the', 'United', 'States’]

POS tags: [('Barack', 'NNP'), ('Obama', 'NNP'), ('was', 'VBD'), ('born', 'VBN'), ('in',
'IN'), ('Hawaii', 'NNP'), ('on', 'IN'), ('August', 'NNP'), ('4', 'CD'), ('1961', 'CD'), ('He',
'PRP'), ('served', 'VBD'), ('as', 'IN'), ('the', 'DT'), ('44th', 'CD'), ('President', 'NNP'),
('of', 'IN'), ('the', 'DT'), ('United', 'NNP'), ('States', 'NNPS’)]

NER tags: ['barack obama', 'hawaii', 'august', '44th president']


Spelling Correction: Edit Distance
Spelling Correction

I am writing this email on behaf of ...


The user typed ‘behaf’.

Which are some close words?


Spelling Correction

I am writing this email on behaf of ...


The user typed ‘behaf’.

Which are some close words?


behalf
behave
....

Isolated word error correction


Pick the one that is closest to ‘behaf’
Spelling Correction

I am writing this email on behaf of ...


The user typed ‘behaf’.

Which are some close words?


behalf
behave
....

Isolated word error correction


Pick the one that is closest to ‘behaf’
How to define ‘closest’?
Spelling Correction

I am writing this email on behaf of ...


The user typed ‘behaf’.

Which are some close words?


behalf
behave
....

Isolated word error correction


Pick the one that is closest to ‘behaf’
How to define ‘closest’?
Need a distance metric
Spelling Correction

I am writing this email on behaf of ...


The user typed ‘behaf’.

Which are some close words?


behalf
behave
....

Isolated word error correction


Pick the one that is closest to ‘behaf’
How to define ‘closest’?
Need a distance metric
The simplest metric: edit distance
Edit Distance

• The minimum edit distance between two strings : It is the minimum


number of editing operations
• Insert
• Delete
• Substitution
How to find the Minimum Edit Distance?

Searching for a path (sequence of edits) from the start string to


the final string:
Initial state: the word we are transforming
Operators: insert, delete, substitute
Goal state: the word we are trying to get to
Path cost: what we want to minimize: the number of edits
Minimum Edit as Search

How to navigate?
The space of all edit sequences is huge
Minimum Edit as Search

How to navigate?
The space of all edit sequences is huge
Lot of distinct paths end up at the same state
Don’t have to keep track of all of them
Keep track of the shortest path to each state

Spelling Correction: Edit Distance


Defining Minimum Edit Distance Matrix

For two strings


X of length n
Y of length m

We define D(i, j)
the edit distance between X[1..i] and Y[1..j]
i.e., the first i characters of X and the first j characters of Y

Thus, the edit distance between X and Y is D(n, m)

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 8 /20


Computing Minimum Edit Distance

Dynamic Programming
A tabular computation of D(n,m)
Solving problems by combining solutions to subproblems
Bottom-up
) Compute D(i, j) for small i,j
) Compute larger D(i, j) based on previously computed smaller values

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 9 /20


Computing Minimum Edit Distance

Dynamic Programming
A tabular computation of D(n,m)
Solving problems by combining solutions to subproblems
Bottom-up
) Compute D(i, j) for small i,j
) Compute larger D(i, j) based on previously computed smaller values
) Compute D(i, j) for all i and j till you get to D(n, m)

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 9 /20


Dynamic Programming Algorithm

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1


The Edit Distance Table

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 11 / 20


The Edit Distance Table

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 11 / 20


The Edit Distance Table

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1


Computing Alignments

➢ Computing edit distance may not be sufficient for some applications

➢ We often need to align characters of the two strings to each other

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 13 / 20


Minimum Edit Distance

Example
Edit distance from ‘intention’ to ‘execution’
Defining Minimum Edit Distance Matrix

For two strings


X of length n Y of length m

We define D(i, j)
the edit distance between X[1..i] and Y[1..j]
i.e., the first i characters of X and the first j characters of Y

Thus, the edit distance between X and Y is D(n, m)


Performance

Time
O(nm)

Backtrace
O(n +m)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy