0% found this document useful (0 votes)

16 views

Week3

Uploaded by

fatimabuhari2014

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Week3

Uploaded by

fatimabuhari2014

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 15

Text Processing

 Introduction
 Tokenizing (segmenting) words
 Normalizing word formats
 Segmenting sentences
 The practical
CS3TM20 © XH 1
Introduction
 Text normalization is the process of transforming text into a
single canonical form before almost any natural language processing
of a text.
• predictive text and handwriting recognition.
• web search engines
• machine translation, text analysis to detect sentiment in tweets and
blogs.
 At least three tasks are commonly applied as part of any
normalization process:
1. Tokenizing (segmenting) words
2. Normalizing word formats
3. Segmenting sentences
CS3TM20 © XH 2
How many words?
N = number of tokens (pieces in a document)
V = vocabulary = set of types, |V| is size of vocabulary
Heaps Law = Herdan's Law =
where often .67 < β < .75
i.e., vocabulary size grows with > square root of the number of word tokens
Tokens = N Types = |V|
Switchboard phone conversations 2.4 million 20 thousand
Shakespeare 884,000 31 thousand
COCA 440 million 2 million
Google N-grams 1 trillion 13+ million
Tokenizing (segmenting) words
 Separate a chunk of continuous text into separate words.
 Tokenization is thus intimately tied up with named entity recognition.
 Segment off a token between instances of spaces (Space-based
tokenization) for languages that use space characters between
words.
 Tokenization needs to be run before any other language processing
and fast.
 The standard method for tokenization is therefore to use
deterministic algorithms based on regular expression.
 Word tokenization is more complex in languages which do not use
spaces to mark potential word-boundaries.

CS3TM20 © XH 4
Issues in Tokenization
•Can't just blindly remove punctuation:
• m.p.h., Ph.D., AT&T, cap’n
• prices ($45.55)
• dates (01/02/06)
• URLs (http://www.stanford.edu)
• hashtags (#nlproc)
• email addresses (someone@cs.colorado.edu)
•Clitic: a word that doesn't stand on its own
• "are" in we're, French "je" in j'ai, "le" in l'honneur
•When should multiword expressions (MWE) be
words?
• New York, rock ’n’ roll
Tokenization in NLTK
Bird, Loper and Klein (2009), Natural Language Processing with Python. O’Reilly
Word Normalization

Putting words/tokens in a standard format

Information Retrieval: indexed text & query terms must
have same form.
• U.S.A. or USA
• uhhuh or uh-huh
• Fed or fed
• am, is, be, are
 We implicitly define equivalence classes of terms.

CS3TM20 © XH 7
Case folding
Applications like IR: reduce all letters to lower case
Since users tend to use lower case
Possible exception: upper case in mid-sentence?
e.g., General Motors
Fed vs. fed
SAIL vs. sail
For sentiment analysis, MT, Information extraction,
case is helpful (US versus us is important)
Stemming
Reduce terms to stems, chopping off affixes crudely
stemming a word or sentence may result in words that are not
actual words.
This was not the map we Thi wa not the map we
found in Billy Bones’s found in Billi Bone s
chest, but an accurate chest but an accur copi
copy, complete in all complet in all thing
things-names and name and height and
heights and soundings- sound with the singl
with the single exception except of the red cross
of the red crosses and and the written note
the written notes. .
Porter Stemmer
Based on a series of rewrite rules run in series
A cascade, in which output of each pass fed to next pass
Some sample rules:
Lemmatization

Represent all words as their lemma, their shared root

= dictionary headword form:
• am, are, is  be
• car, cars, car's, cars'  car
• Spanish quiero (‘I want’), quieres (‘you want’)
 querer ‘want'
• He is reading detective stories
 He be read detective story
What is difference between stemming and Lemmatization?
Morphology is the study of words, how they are formed, and
their relationship to other words in the same language.
It analyses the structure of words and parts of words such
as stems, root words, prefixes, and suffixes.
Lemmatization usually refers to doing things properly with
the use of a vocabulary and morphological analysis of words.
Stemming usually refers to a crude heuristic process that
chops off the ends of words in the hope of achieving this goal
correctly most of the time.
Sentence Segmentation
!, ? mostly unambiguous but period “.” is very ambiguous
Sentence boundary
Abbreviations like Inc. or Dr.
Numbers like .02% or 4.3
Common algorithm: Tokenize first: use rules or ML to
classify a period as either (a) part of the word or (b) a
sentence-boundary.
An abbreviation dictionary can help
Sentence segmentation can then often be done by rules
based on this tokenization.
Determining if a word is end-of-sentence: a Decision Tree
Practical
The practical is based on NLTK Chapter 3
The information covered is more than an hour. Please
focus on linguistic aspects if you are confused by
programming language.

Why text processing is needed?

What are the terminologies?
Can you explain the output of the codes?

https://www.nltk.org/book/ch03.html
15
Slides adapted from Jure Leskovec

NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
Omega-Technical Reference
No ratings yet
Omega-Technical Reference
263 pages
Bell 225f Parts Bn049437
No ratings yet
Bell 225f Parts Bn049437
350 pages
Text Preprocessing
No ratings yet
Text Preprocessing
59 pages
NLP_Lecture_6_Week_3
No ratings yet
NLP_Lecture_6_Week_3
9 pages
Text preprocessing
No ratings yet
Text preprocessing
39 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
35 pages
ir manual
No ratings yet
ir manual
53 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
2.BasicTextProcessing NEW
No ratings yet
2.BasicTextProcessing NEW
39 pages
3.Chapter4_Lexical Representations
No ratings yet
3.Chapter4_Lexical Representations
36 pages
PART B NOTES
No ratings yet
PART B NOTES
62 pages
Corpora
No ratings yet
Corpora
48 pages
NLP m2
No ratings yet
NLP m2
71 pages
3.Word level analysis-tokenization stemming
No ratings yet
3.Word level analysis-tokenization stemming
8 pages
Text Preprocessing: Information Retrieval
100% (2)
Text Preprocessing: Information Retrieval
16 pages
Text Mining
No ratings yet
Text Mining
62 pages
2-Regular expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular expressions, Text Normalization, Edit Distance
42 pages
Week 2
No ratings yet
Week 2
90 pages
lec2
No ratings yet
lec2
21 pages
Regular Expression and BPE
No ratings yet
Regular Expression and BPE
68 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
NLP Lect-5 02.02.21
No ratings yet
NLP Lect-5 02.02.21
18 pages
Adobe Scan 30 Sept 2024
No ratings yet
Adobe Scan 30 Sept 2024
6 pages
UNIT 1_Part1
No ratings yet
UNIT 1_Part1
121 pages
Extracting, Cleaning and Pre-Processing Text
No ratings yet
Extracting, Cleaning and Pre-Processing Text
12 pages
CL_lec 6
No ratings yet
CL_lec 6
28 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
NLTK
No ratings yet
NLTK
3 pages
1009_nlp_ppt
No ratings yet
1009_nlp_ppt
31 pages
Unit 1b
No ratings yet
Unit 1b
24 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
Words and Corpora J+M
No ratings yet
Words and Corpora J+M
49 pages
NLP Lect-6 03.02.21
No ratings yet
NLP Lect-6 03.02.21
17 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
AI_NLP
No ratings yet
AI_NLP
9 pages
Lecture 1 Text Preprocessing PDF
No ratings yet
Lecture 1 Text Preprocessing PDF
29 pages
Text Mining
No ratings yet
Text Mining
34 pages
C10_AI_UNIT 3_NLP_ HALF YEARLY
No ratings yet
C10_AI_UNIT 3_NLP_ HALF YEARLY
37 pages
NLP TT-1 Question Bank
No ratings yet
NLP TT-1 Question Bank
21 pages
Experiment 3 Manual
No ratings yet
Experiment 3 Manual
7 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
CAT King study material 5
No ratings yet
CAT King study material 5
21 pages
02 Linguistics Essentials
No ratings yet
02 Linguistics Essentials
36 pages
02 Textprocessingboth
No ratings yet
02 Textprocessingboth
46 pages
5 BASIC TEXT PROCESSING
No ratings yet
5 BASIC TEXT PROCESSING
6 pages
Natural Language Processing Notes Class 10 AI
No ratings yet
Natural Language Processing Notes Class 10 AI
25 pages
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
No ratings yet
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
54 pages
Introduction To NLP
No ratings yet
Introduction To NLP
68 pages
Text Proc
No ratings yet
Text Proc
55 pages
TextMining
No ratings yet
TextMining
43 pages
Sample Paper Questions - NLP (Part 2)
No ratings yet
Sample Paper Questions - NLP (Part 2)
7 pages
Introduction To Natural Language Processing (Nlp) : Ths. Đặng Nhân Cách Email: Cach.Dang@Ut.Edu.Vn
No ratings yet
Introduction To Natural Language Processing (Nlp) : Ths. Đặng Nhân Cách Email: Cach.Dang@Ut.Edu.Vn
25 pages
Lab 2
No ratings yet
Lab 2
49 pages
Session1 2024_2025_ Natural Language Processing
No ratings yet
Session1 2024_2025_ Natural Language Processing
40 pages
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
7 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Syntax Choices
From Everand
Syntax Choices
Hugo Raines
No ratings yet
Week4
No ratings yet
Week4
45 pages
Week9
No ratings yet
Week9
36 pages
Big Data Challenges Practices and Technologies NIST Big Data Public Working Group Workshop at IEEE Big Data 2014
No ratings yet
Big Data Challenges Practices and Technologies NIST Big Data Public Working Group Workshop at IEEE Big Data 2014
5 pages
Revision Lecture
No ratings yet
Revision Lecture
19 pages
Week5
No ratings yet
Week5
26 pages
Week10
No ratings yet
Week10
24 pages
Week2
No ratings yet
Week2
44 pages
Cs3vr16 Revision Plus Answers(2)
No ratings yet
Cs3vr16 Revision Plus Answers(2)
32 pages
cs3vr16 Graphics 1(1) (1)
No ratings yet
cs3vr16 Graphics 1(1) (1)
42 pages
Cs3vr16 Graphics 5(4)
No ratings yet
Cs3vr16 Graphics 5(4)
38 pages
Cs3vr16 Graphics 2(4)
No ratings yet
Cs3vr16 Graphics 2(4)
39 pages
Cs3vr16 Graphics 3(1)
No ratings yet
Cs3vr16 Graphics 3(1)
37 pages
Zoho Workplace Apps
No ratings yet
Zoho Workplace Apps
14 pages
Cs3vr16 Graphics 4 Tutorial(2)
No ratings yet
Cs3vr16 Graphics 4 Tutorial(2)
13 pages
Aim High 4
No ratings yet
Aim High 4
31 pages
Week 3
No ratings yet
Week 3
3 pages
Transactional Analysis Wiki
No ratings yet
Transactional Analysis Wiki
15 pages
35b7fed366a9d2ea4e828853ec658b73ecaec8c061f0c73633ac4b711dbbda75
100% (9)
35b7fed366a9d2ea4e828853ec658b73ecaec8c061f0c73633ac4b711dbbda75
78 pages
VSB Series
No ratings yet
VSB Series
3 pages
Lupin Presentation
No ratings yet
Lupin Presentation
24 pages
Dry Sieve Analysis
0% (1)
Dry Sieve Analysis
9 pages
The Influence of Teachers Motivation On Learners Performances
No ratings yet
The Influence of Teachers Motivation On Learners Performances
6 pages
New York City Department of Education Style Guide January 2014
No ratings yet
New York City Department of Education Style Guide January 2014
16 pages
Guidelines Tier I PQS Revision 2.3 23.09.22
No ratings yet
Guidelines Tier I PQS Revision 2.3 23.09.22
37 pages
Fashion Marketing Assignment (Chapter 3)
No ratings yet
Fashion Marketing Assignment (Chapter 3)
6 pages
Series Bible Pitch Deck Standards 1 0
100% (2)
Series Bible Pitch Deck Standards 1 0
186 pages
SAP Market Rates Management, Bring Your Own Rates Data Option
No ratings yet
SAP Market Rates Management, Bring Your Own Rates Data Option
90 pages
The Paranoid Logic of Hatred
No ratings yet
The Paranoid Logic of Hatred
8 pages
Reading The Image - Alice G
No ratings yet
Reading The Image - Alice G
2 pages
Entreprenuership: Input Data Sheet For SHS E-Class Record
No ratings yet
Entreprenuership: Input Data Sheet For SHS E-Class Record
6 pages
Data Processing: ANUBHAV (73) MOHIT (75) Priyanka (77) Sangeeta (81) GUNJAN
No ratings yet
Data Processing: ANUBHAV (73) MOHIT (75) Priyanka (77) Sangeeta (81) GUNJAN
31 pages
A - B. C. D - Answer & Explanation
No ratings yet
A - B. C. D - Answer & Explanation
62 pages
Data Recovery PDF
No ratings yet
Data Recovery PDF
105 pages
Pengaruh Brand Ambassador Bts Terhadap Purchase Intention Yang Dimediasi Oleh Brand Awareness Tokopedia Di Indonesia
No ratings yet
Pengaruh Brand Ambassador Bts Terhadap Purchase Intention Yang Dimediasi Oleh Brand Awareness Tokopedia Di Indonesia
13 pages
Private & Confidential: DBS Bank Ltd. Ground Floor, Express Tower, Nariman Point, Mumbai-400021, India
No ratings yet
Private & Confidential: DBS Bank Ltd. Ground Floor, Express Tower, Nariman Point, Mumbai-400021, India
23 pages
Palin Phoocharoon's Lesson From Authentic Leader: Konosuke Matsushita Founder of Panasonic
No ratings yet
Palin Phoocharoon's Lesson From Authentic Leader: Konosuke Matsushita Founder of Panasonic
14 pages
Brown and Yellow Retro Elegant Thesis Defense Presentation
No ratings yet
Brown and Yellow Retro Elegant Thesis Defense Presentation
46 pages
Edd PDF
No ratings yet
Edd PDF
8 pages
Ict 10 Paper Midterm 3
No ratings yet
Ict 10 Paper Midterm 3
10 pages
26233-Facial Electrical Treatments Sample Questions
No ratings yet
26233-Facial Electrical Treatments Sample Questions
1 page
Introduction Proe
No ratings yet
Introduction Proe
7 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Week3

Uploaded by

Week3

Uploaded by

Text Processing

Putting words/tokens in a standard format

Represent all words as their lemma, their shared root

Why text processing is needed?

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.