0% found this document useful (0 votes)

6 views

NLP Lecture2 Text Pre Processing

Uploaded by

tharini.abhinaya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

NLP Lecture2 Text Pre Processing

Uploaded by

tharini.abhinaya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

19CSE453

Natural Language Processing

Lecture 2
Text processing: tokenization

What is Tokenization?
Tokenization is the process of segmenting a string of characters into words.

Depending on the application in hand, you might have to perform sentence

segmentation a s well.
Sentence Segmentation

The problem of deciding where the sentences begin and end.

Challenges Involved
Sentence Segmentation

The problem of deciding where the sentences begin and end.

Challenges Involved
• While ‘!’, ‘?’ are quite unambiguous
Sentence Segmentation

The problem of deciding where the sentences begin and end.

Challenges Involved
• While ‘!’, ‘?’ are quite unambiguous
• Period “.” is quite ambiguous and can be used additionally
for
• ) Abbreviations (Dr., Mr., m.p.h.)
Sentence Segmentation

The problem of deciding where the sentences begin and end.

Challenges Involved
While ‘!’, ‘?’ are quite unambiguous
Period “.” is quite ambiguous and can be used additionally for
) Abbreviations (Dr., Mr., m.p.h.)
) Numbers (2.4%, 4.3)
Sentence Segmentation

The problem of deciding where the sentences begin and end.

Challenges Involved
While ‘!’, ‘?’ are quite unambiguous
Period “.” is quite ambiguous and can be used additionally for
) Abbreviations (Dr., Mr., m.p.h.)
) Numbers (2.4%, 4.3)

Approach: build a binary classifier

For each “.”
Decides EndOfSentence/NotEndOfSentence

3 /26
Sentence Segmentation

The problem of deciding where the sentences begin and end.

Challenges Involved
• While ‘!’, ‘?’ are quite unambiguous
• Period “.” is quite ambiguous and can be used additionally for
• ) Abbreviations (Dr., Mr., m.p.h.)

• ) Numbers (2.4%, 4.3)

Approach: build a binary classifier

For each “.”
• Decides EndOfSentence/NotEndOfSentence
• Classifiers can be: hand-written rules, regular expressions, or
machine learning
Sentence Segmentation: Decision Tree Example

Decision Tree: Is this word the end-of-sentence ( E -O-S) ?

Sentence Segmentation: Decision Tree Example

Decision Tree: Is this word the end-of-sentence ( E -O-S) ?

Other Important Features likely to influence sentence segmentation

• C a s e of word with “.” : Upper, Lower, Cap, Number

• C a s e of word after “.” : Upper, Lower, Cap, Number
• Numeric Features
➢ Length of word with “.”
➢ Probability (word with “.” occurs at end-of-sentence)
➢ Probability (word after “.” occurs at beginning-of-sentence)
Word Tokenization

What is Tokenization?
Tokenization is the process of segmenting a string of characters into words.
Word Tokenization

What is Tokenization?
Tokenization is the process of segmenting a string of characters into words.

I have a can opener; but I can not open these cans.

Word Token
An occurrence of a word
For the above sentence, 12 word tokens.

Word Type
A different realization of a word
For the above sentence, 10 word types.
Popular Python packages for NLP

➢ NLTK (Natural Language Toolkit): NLTK is one of the oldest and most comprehensive
libraries for NLP tasks. It provides tools for tasks such as tokenization, stemming,
lemmatization, part-of-speech tagging, parsing, and more.
➢ spaCy: spaCy is a modern NLP library that's designed to be fast and efficient. It offers
features like tokenization, POS tagging, named entity recognition (NER), dependency
parsing, and sentence segmentation.
➢ TextBlob: TextBlob is built on top of NLTK and provides a simpler interface for common
NLP tasks such as tokenization, POS tagging, noun phrase extraction, sentiment analysis,
and more.
➢ Gensim: Gensim is primarily focused on topic modeling and document similarity analysis,
but it also offers functionality for tasks like text preprocessing, word embedding, and
similarity queries.
➢ scikit-learn: While scikit-learn is a general-purpose machine learning library, it also
includes utilities for text preprocessing, such as CountVectorizer and TfidfVectorizer for
converting text data into numerical feature vectors.
Word Tokenization

Issues in Tokenization
Finland’s → Finland Finland‘s Finland ’s ?
What’re, I’m, shouldn’t → What are, I am, should not ?
S a n Francisco → one token or two?
Normalization

Why to “normalize”?
• Indexed text and query terms must have the same form.
Example: U.S.A. and U S A should be matched

We implicitly define equivalence classes of terms

• Three forms of Normalization

• Case folding
• Stemming
• Lemmatization
Case Folding

• Reduce all letters to lower case

• Possible exceptions (Task dependent):

➢ Upper case in mid sentence, may point to named entities (e.g.
General Motors)

➢ Words conveyed in CAPS mean a strong conveyance

➢ (eg. US vs. us, I REALLY MEAN IT
➢ Applications in sentiment analysis, information extraction etc.
Lemmatization

• Reduce inflections or variant forms to base form:

✓ am, are, is → be
✓ car, cars, car’s, cars’ → car
✓ eat, ate, eaten→ eat
✓ Write, wrote, written→ write

• Must find the correct dictionary headword form

Lemmatization learns from Morphology

Morphology studies the internal structure of words, how words are built up
from smaller meaningful units called morphemes

Morphemes are divided into two categories

Stems: The core meaning bearing units
Affixes: Bits and pieces adhering to stems to change their meanings and
grammatical functions
• Perfix: un-,anti-, etc.
• suffix: -ation, -ity, en, -ed etc.

*Lemmatization algorithms take input from morphology to convert tokens

to their root form
Python Code for Lemmatization

*Note: The 2nd parameter of the lemmatize function is taken

default as noun if not provided
Stemming

• Reducing terms to their stems, used in information retrieval

• Crude chopping of affixes
➢ language dependent
➢ automate(s), automatic, automation all reduced to automat

Example:
Porter’s algorithm for Stemming

Step 1a
s s e s → s s (caresses → caress)
ies → i (ponies → poni)
s s → s s (caress → caress)
s → φ (cats → cat)
Porter’s algorithm

Step 1a
s s e s → s s (caresses → caress)
ies → i (ponies → poni)
s s → s s (caress → caress)
s → φ (cats → cat)

Step 1b
(*v*)ing → φ(walking → walk, king → king)
(*v*)ed → φ (played → play)

*Note: v represents vowel

Porter’s algorithm

Step 2
ational → ate (relational → relate)
izer → ize (digitizer → digitize)
ator → ate (operator → operate)

Step 3
al → φ (revival → reviv)
able → φ(adjustable → adjust)
ate → φ (activate → activ)
Python code for Stemming

import nltk
from nltk.stem import PorterStemmer
# Initialize Porter stemmer
stemmer = PorterStemmer()

# Stem each token Output:

print(stemmer.stem("cats")) cat
print(stemmer.stem("played")) play
print(stemmer.stem("playing")) play
print(stemmer.stem("welcomes")) welcom
print(stemmer.stem("persual")) persual
print(stemmer.stem("ideologies")) ideolog

*Note: SnowballStemmer& Lancester stemmer are other examples

Popular Python packages for NLP

➢ NLTK (Natural Language Toolkit): NLTK is one of the oldest and most
comprehensive libraries for NLP tasks. It provides tools for tasks such as
tokenization, stemming, lemmatization, part-of-speech tagging, parsing, and more.
➢ spaCy: spaCy is a modern NLP library that's designed to be fast and efficient. It
offers features like tokenization, POS tagging, named entity recognition (NER),
dependency parsing, and sentence segmentation.
➢ TextBlob: TextBlob is built on top of NLTK and provides a simpler interface for
common NLP tasks such as tokenization, POS tagging, noun phrase extraction,
sentiment analysis, and more.
➢ Gensim: Gensim is primarily focused on topic modeling and document similarity
analysis, but it also offers functionality for tasks like text preprocessing, word
embedding, and similarity queries.
➢ scikit-learn: While scikit-learn is a general-purpose machine learning library, it also
includes utilities for text preprocessing, such as CountVectorizer and TfidfVectorizer
for converting text data into numerical feature vectors.
Python code for pre-processing using spacy

import spacy
# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")
# Sample text
text = "The dogs are barking loudly outside. I am reading a book."
doc = nlp(text)
# Perform various preprocessing tasks
cleaned_text = []
for token in doc:
# Remove stop words and punctuation Output:
if not token.is_stop and not token.is_punct: Original text: The dogs
# Lemmatize each token are barking loudly
lemma = token.lemma_ outside. I am reading a
# Lowercase each token book.
cleaned_text.append(lemma.lower()) Preprocessed text: dog
# Join the cleaned tokens back into a string bark loudly outside read
cleaned_text = " ".join(cleaned_text) book
# Print the preprocessed text
print("Original text:", text)
print("Preprocessed text:", cleaned_text)
Python code for Pre-processing using TextBlob

from textblob import TextBlob

# Sample text
text = "Barack Obama was born in Hawaii on August 4,
1961. He served as the 44th President of the United States."
# Tokenization
blob = TextBlob(text)
tokens = blob.words
# POS tagging
pos_tags = blob.tags
# NER tagging
ner_tags = blob.noun_phrases
print("Tokens:", tokens)
print("POS tags:", pos_tags)
print("NER tags:", ner_tags)
Output

Output:
Tokens: ['Barack', 'Obama', 'was', 'born', 'in', 'Hawaii', 'on', 'August', '4', '1961', 'He',
'served', 'as', 'the', '44th', 'President', 'of', 'the', 'United', 'States’]

X of length n
Y of length m

We define D(i, j)
the edit distance between X[1..i] and Y[1..j]
i.e., the first i characters of X and the first j characters of Y

Thus, the edit distance between X and Y is D(n, m)

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 8 /20

Computing Minimum Edit Distance

Dynamic Programming
A tabular computation of D(n,m)
Solving problems by combining solutions to subproblems
Bottom-up
) Compute D(i, j) for small i,j
) Compute larger D(i, j) based on previously computed smaller values

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 9 /20

Computing Minimum Edit Distance

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 9 /20

Dynamic Programming Algorithm

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1

The Edit Distance Table

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 11 / 20

The Edit Distance Table

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 11 / 20

The Edit Distance Table

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1

Computing Alignments

➢ Computing edit distance may not be sufficient for some applications

➢ We often need to align characters of the two strings to each other

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 13 / 20

Minimum Edit Distance

Example
Edit distance from ‘intention’ to ‘execution’
Defining Minimum Edit Distance Matrix

For two strings

X of length n Y of length m

We define D(i, j)
the edit distance between X[1..i] and Y[1..j]
i.e., the first i characters of X and the first j characters of Y

Thus, the edit distance between X and Y is D(n, m)

Performance

Time
O(nm)

Backtrace
O(n +m)

NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
CS236 Project 2: The Datalog Parser
No ratings yet
CS236 Project 2: The Datalog Parser
3 pages
Text preprocessing
No ratings yet
Text preprocessing
39 pages
ir manual
No ratings yet
ir manual
53 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
AP for NLP-LO1
No ratings yet
AP for NLP-LO1
61 pages
NLP m2
No ratings yet
NLP m2
71 pages
AP for NLP-Word 2 Vec
No ratings yet
AP for NLP-Word 2 Vec
33 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
NLB final lab manual (2)
No ratings yet
NLB final lab manual (2)
23 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
Text Preprocessing
No ratings yet
Text Preprocessing
59 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
NLP UNIT-2
No ratings yet
NLP UNIT-2
12 pages
TextMining
No ratings yet
TextMining
43 pages
CH4
No ratings yet
CH4
15 pages
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
7 pages
NLP - 1_250119_222702 (1)
No ratings yet
NLP - 1_250119_222702 (1)
71 pages
NLP Lect-5 02.02.21
No ratings yet
NLP Lect-5 02.02.21
18 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
NLP Lect-6 03.02.21
No ratings yet
NLP Lect-6 03.02.21
17 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
CAT King study material 5
No ratings yet
CAT King study material 5
21 pages
Lab 2
No ratings yet
Lab 2
49 pages
Unraveling The Power of Natural Language Processing
No ratings yet
Unraveling The Power of Natural Language Processing
11 pages
AI_NLP
No ratings yet
AI_NLP
9 pages
NLP_Module 2
No ratings yet
NLP_Module 2
54 pages
02 - NLP Pipeline - Binh
No ratings yet
02 - NLP Pipeline - Binh
37 pages
UNIT-1 notes
No ratings yet
UNIT-1 notes
19 pages
nlp
No ratings yet
nlp
16 pages
NLP Record
No ratings yet
NLP Record
15 pages
NLP Steps Basic
No ratings yet
NLP Steps Basic
26 pages
Unit - 2
No ratings yet
Unit - 2
55 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
16 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
Final LP-VI NLP Manual 2023-24
No ratings yet
Final LP-VI NLP Manual 2023-24
29 pages
CISC 867 Deep Learning: 13. Processing Text With Neural Networks
No ratings yet
CISC 867 Deep Learning: 13. Processing Text With Neural Networks
106 pages
02 Linguistics Essentials
No ratings yet
02 Linguistics Essentials
36 pages
1009_nlp_ppt
No ratings yet
1009_nlp_ppt
31 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
AIML-HC Mod 04
No ratings yet
AIML-HC Mod 04
71 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
Statistical NLP
No ratings yet
Statistical NLP
45 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Introduction To NLP
No ratings yet
Introduction To NLP
68 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
Lab - Manual - IR - BE AI&DS CL II
No ratings yet
Lab - Manual - IR - BE AI&DS CL II
38 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
NLTK
No ratings yet
NLTK
3 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Python Text Processing with NLTK 2.0 Cookbook: LITE
From Everand
Python Text Processing with NLTK 2.0 Cookbook: LITE
Jacob Perkins
4/5 (1)
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
38 pages
Plagiarism in Programming Assignments: Mike Joy and Michael Luck
No ratings yet
Plagiarism in Programming Assignments: Mike Joy and Michael Luck
5 pages
Lab Programs
No ratings yet
Lab Programs
18 pages
Web Upload Ty Computer Syl Lab Us 1806
No ratings yet
Web Upload Ty Computer Syl Lab Us 1806
41 pages
Phases of A Compiler
No ratings yet
Phases of A Compiler
6 pages
Syllabus B Tech CSE 3rd Year (6th Sem Non-Credit Based) KUK
No ratings yet
Syllabus B Tech CSE 3rd Year (6th Sem Non-Credit Based) KUK
10 pages
Compiler Design Objectives
No ratings yet
Compiler Design Objectives
4 pages
Formal Language and Compiler Design - 2
No ratings yet
Formal Language and Compiler Design - 2
39 pages
REVIEW - Allchapters - DOOYUM (SUPERVISOR CORRECTIONS)
No ratings yet
REVIEW - Allchapters - DOOYUM (SUPERVISOR CORRECTIONS)
172 pages
Lecture 03
No ratings yet
Lecture 03
42 pages
Natural Language Processing: Dr. Tulasi Prasad Sariki SCOPE, VIT Chennai
No ratings yet
Natural Language Processing: Dr. Tulasi Prasad Sariki SCOPE, VIT Chennai
29 pages
Yacc
No ratings yet
Yacc
12 pages
Command Line Calculator
No ratings yet
Command Line Calculator
15 pages
S6CSEHand Out
No ratings yet
S6CSEHand Out
59 pages
Csc3120 Tutorial 2 - Javacc: Javacc Grammar File (Bantamjava - JJ) Javacc
No ratings yet
Csc3120 Tutorial 2 - Javacc: Javacc Grammar File (Bantamjava - JJ) Javacc
19 pages
MSD Bos Cse
No ratings yet
MSD Bos Cse
40 pages
Section Solutions #2: Problem 1: Vectors A) B)
No ratings yet
Section Solutions #2: Problem 1: Vectors A) B)
5 pages
Language Translation
No ratings yet
Language Translation
16 pages
Building LISP
No ratings yet
Building LISP
51 pages
Csit Btech Iv Yr Vii Sem Scheme Syllabus July 2022
No ratings yet
Csit Btech Iv Yr Vii Sem Scheme Syllabus July 2022
25 pages
Subject Notes
No ratings yet
Subject Notes
33 pages
Formal Definition
No ratings yet
Formal Definition
152 pages
Compiler Lab
No ratings yet
Compiler Lab
3 pages
CD (Aicte 2020-2021)
No ratings yet
CD (Aicte 2020-2021)
74 pages
NLQ PDF
No ratings yet
NLQ PDF
5 pages
Java Programming (BCA 3rd Year)
77% (13)
Java Programming (BCA 3rd Year)
100 pages
Unit 3 RRJ TAC
No ratings yet
Unit 3 RRJ TAC
15 pages
Full Download Introduction to Compiler Design 3rd Edition Torben Ægidius Mogensen PDF DOCX
100% (3)
Full Download Introduction to Compiler Design 3rd Edition Torben Ægidius Mogensen PDF DOCX
55 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

NLP Lecture2 Text Pre Processing

Uploaded by

NLP Lecture2 Text Pre Processing

Uploaded by

19CSE453

Natural Language Processing

Depending on the application in hand, you might have to perform sentence

The problem of deciding where the sentences begin and end.

The problem of deciding where the sentences begin and end.

The problem of deciding where the sentences begin and end.

The problem of deciding where the sentences begin and end.

The problem of deciding where the sentences begin and end.

Approach: build a binary classifier

The problem of deciding where the sentences begin and end.

• ) Numbers (2.4%, 4.3)

Approach: build a binary classifier

Decision Tree: Is this word the end-of-sentence ( E -O-S) ?

Decision Tree: Is this word the end-of-sentence ( E -O-S) ?

• C a s e of word with “.” : Upper, Lower, Cap, Number

I have a can opener; but I can not open these cans.

We implicitly define equivalence classes of terms

• Three forms of Normalization

• Reduce all letters to lower case

• Possible exceptions (Task dependent):

➢ Words conveyed in CAPS mean a strong conveyance

• Reduce inflections or variant forms to base form:

• Must find the correct dictionary headword form

Morphemes are divided into two categories

*Lemmatization algorithms take input from morphology to convert tokens

*Note: The 2nd parameter of the lemmatize function is taken

• Reducing terms to their stems, used in information retrieval

*Note: v represents vowel

# Stem each token Output:

*Note: SnowballStemmer& Lancester stemmer are other examples

from textblob import TextBlob

NER tags: ['barack obama', 'hawaii', 'august', '44th president']

I am writing this email on behaf of ...

Which are some close words?

I am writing this email on behaf of ...

Which are some close words?

Isolated word error correction

I am writing this email on behaf of ...

Which are some close words?

Isolated word error correction

I am writing this email on behaf of ...

Which are some close words?

Isolated word error correction

I am writing this email on behaf of ...

Which are some close words?

Isolated word error correction

• The minimum edit distance between two strings : It is the minimum

Searching for a path (sequence of edits) from the start string to

Spelling Correction: Edit Distance

For two strings

Thus, the edit distance between X and Y is D(n, m)

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 8 /20

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 9 /20

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 9 /20

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 11 / 20

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 11 / 20

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1

➢ Computing edit distance may not be sufficient for some applications

➢ We often need to align characters of the two strings to each other

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 13 / 20

For two strings

Thus, the edit distance between X and Y is D(n, m)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.