0% found this document useful (0 votes)

16 views

2 TextProc 2023

Uploaded by

KAUSHIK KADIUM

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

2 TextProc 2023

Uploaded by

KAUSHIK KADIUM

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Words and Corpora

Basic Text
Processing
How many words in a sentence?
"I do uh main- mainly business data processing"
◦ Fragments, filled pauses
"Seuss’s cat in the hat is different from other cats!"
◦ Lemma: same stem, part of speech, rough word sense
◦ cat and cats = same lemma
◦ Wordform: the full inflected surface form
◦ cat and cats = different wordforms
How many words in a sentence?

they lay back on the San Francisco grass and looked at the stars
and their

Type: an element of the vocabulary.

Token: an instance of that type in running text.
How many?
◦ 15 tokens (or 14)
◦ 13 types (or 12) (or 11?)
and in fact this relationship between the number of types |V | and nu
HowHerdan’s
N is called many words in a corpus?
Law (Herdan, 1960) or Heaps’ Law (Heaps, 1
iscoverers (in linguistics
N = number of tokens and information retrieval respectively). It is sh
1, where k and b are positive constants, and 0 < b < 1.
V = vocabulary = set of types, |V| is size of vocabulary
b where often .67 < β < .75
Heaps Law = Herdan's Law = |V | = kN
i.e., vocabulary size grows with > square root of the number of word tokens

Tokens = N Types = |V|

Switchboard phone conversations 2.4 million 20 thousand
Shakespeare 884,000 31 thousand
COCA 440 million 2 million
Google N-grams 1 trillion 13+ million
Corpora
Words don't appear out of nowhere!
A text is produced by
• a specific writer(s),
• at a specific time,
• in a specific variety,
• of a specific language,
• for a specific function.
Corpora vary along dimension like
◦ Language: 7097 languages in the world
◦ Variety, like African American Language varieties.
◦ AAE Twitter posts might include forms like "iont" (I don't)
◦ Code switching, e.g., Spanish/English, Hindi/English:
S/E: Por primera vez veo a @username actually being hateful! It was beautiful:)
[For the first time I get to see @username actually being hateful! it was beautiful:) ]
H/E: dost tha or ra- hega ... dont wory ... but dherya rakhe
[“he was and will remain a friend ... don’t worry ... but have faith”]
◦ Genre: newswire, fiction, scientific articles, Wikipedia
◦ Author Demographics: writer's age, gender, ethnicity, SES
Corpus datasheets
Gebru et al (2020), Bender and Friedman (2018)

Motivation:
• Why was the corpus collected?
• By whom?
• Who funded it?
Situation: In what situation was the text written?
Collection process: If it is a subsample how was it sampled? Was
there consent? Pre-processing?
+Annotation process, language variety, demographics, etc.
Words and Corpora
Basic Text
Processing
Word tokenization
Basic Text
Processing
Text Normalization

Every NLP task requires text normalization:

1. Tokenizing (segmenting) words
2. Normalizing word formats
3. Segmenting sentences
Space-based tokenization
A very simple way to tokenize
◦ For languages that use space characters between words
◦ Arabic, Cyrillic, Greek, Latin, etc., based writing systems
◦ Segment off a token between instances of spaces
Unix tools for space-based tokenization
◦ The "tr" command
◦ Inspired by Ken Church's UNIX for Poets
◦ Given a text file, output the word tokens and their frequencies
Simple Tokenization in UNIX
(Inspired by Ken Church’s UNIX for Poets.)
Given a text file, output the word tokens and their frequencies
tr -sc ’A-Za-z’ ’\n’ < shakes.txt Change all non-alpha to newlines
| sort Sort in alphabetical order
| uniq –c Merge and count each type

1945 A
72 AARON
19 ABBESS
25 Aaron
5 ABBOT
6 Abate
... ... 1 Abates
5 Abbess
6 Abbey
3 Abbot
.... …
The first step: tokenizing
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | head

THE
SONNETS
by
William
Shakespeare
From
fairest
creatures
We
...
The second step: sorting
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | head

A
A
A
A
A
A
A
A
A
...
More counting
Merging upper and lower case
tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c

Sorting the counts

tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c | sort –n –r

23243 the
22225 i
18618 and
16339 to
15687 of
12780 a
12163 you What happened here?
10839 my
10005 in
8954 d
Issues in Tokenization
Can't just blindly remove punctuation:
◦ m.p.h., Ph.D., AT&T, cap’n
◦ prices ($45.55)
◦ dates (01/02/06)
◦ URLs (http://www.stanford.edu)
◦ hashtags (#nlproc)
◦ email addresses (someone@cs.colorado.edu)
Clitic: a word that doesn't stand on its own
◦ "are" in we're, French "je" in j'ai, "le" in l'honneur
When should multiword expressions (MWE) be words?
◦ New York, rock ’n’ roll
ficient finite state automata. For example, Fig. 2.12 shows an example of a basic
Tokenization in NLTK
regular expression that can be used to tokenize with the nltk.regexp tokenize
function of the Python-based Natural Language Toolkit (NLTK) (Bird et al. 2009;
Bird, Loper and Klein (2009), Natural Language Processing with Python. O’Reilly
http://www.nltk.org).

>>> text = ’That U.S.A. poster-print costs $12.40...’

>>> pattern = r’’’(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(-\w+)* # words with optional internal hyphens
... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"’?():-_‘] # these are separate tokens; includes ], [
... ’’’
>>> nltk.regexp_tokenize(text, pattern)
[’That’, ’U.S.A.’, ’poster-print’, ’costs’, ’$12.40’, ’...’]
Figure 2.12 A Python trace of regular expression tokenization in the NLTK Python-based
natural language processing toolkit (Bird et al., 2009), commented for readability; the (?x)
verbose flag tells Python to strip comments and whitespace. Figure from Chapter 3 of Bird
Tokenization in languages without spaces
Many languages (like Chinese, Japanese, Thai) don't
use spaces to separate words!

How do we decide where the token boundaries

should be?
Word tokenization in Chinese
Chinese words are composed of characters called
"hanzi" (or sometimes just "zi")
Each one represents a meaning unit called a morpheme.
Each word has on average 2.4 of them.
But deciding what counts as a word is complex and not
agreed upon.
How to do word tokenization in Chinese?

姚明进入总决赛 “Yao Ming reaches the finals”

3 words?
姚明进入总决赛
YaoMing reaches finals
5 words?
姚明进入总决赛
Yao Ming reaches overall finals
7 characters? (don't use words at all):
姚明进入总决赛
Yao Ming enter enter overall decision game
Word tokenization / segmentation
So in Chinese it's common to just treat each character
(zi) as a token.
• So the segmentation step is very simple
In other languages (like Thai and Japanese), more
complex word segmentation is required.
• The standard algorithms are neural sequence models
trained by supervised machine learning.
Word tokenization
Basic Text
Processing
Word Normalization and
other issues
Basic Text
Processing
Word Normalization
Putting words/tokens in a standard format
◦ U.S.A. or USA
◦ uhhuh or uh-huh
◦ Fed or fed
◦ am, is, be, are
Case folding
Applications like IR: reduce all letters to lower case
◦ Since users tend to use lower case
◦ Possible exception: upper case in mid-sentence?
◦ e.g., General Motors
◦ Fed vs. fed
◦ SAIL vs. sail

For sentiment analysis, MT, Information extraction

◦ Case is helpful (US versus us is important)
Lemmatization

Represent all words as their lemma, their shared root

= dictionary headword form:
◦ am, are, is ® be
◦ car, cars, car's, cars' ® car
◦ Spanish quiero (‘I want’), quieres (‘you want’)
® querer ‘want'
◦ He is reading detective stories
® He be read detective story
Lemmatization is done by Morphological Parsing
Morphemes:
◦ The small meaningful units that make up words
◦ Stems: The core meaning-bearing units
◦ Affixes: Parts that adhere to stems, often with grammatical
functions
Morphological Parsers:
◦ Parse cats into two morphemes cat and s
◦ Parse Spanish amaren (‘if in the future they would love’) into
morpheme amar ‘to love’, and the morphological features
3PL and future subjunctive.
Stemming
Reduce terms to stems, chopping off affixes crudely
This was not the map we
Thi wa not the map we
found in Billy Bones’s
found in Billi Bone s chest
chest, but an accurate
but an accur copi complet
copy, complete in all
in all thing name and
things-names and heights
height and sound with the
and soundings-with the
singl except of the red
single exception of the
cross and the written note
red crosses and the
.
written notes.
and soundings-with the single exception of the red crosses

Porter Stemmer
and the written notes.
produces the following stemmed output:
Thi wa not the map we found in Billi Bone s chest but an
Based oncopi
accur a series
completofinrewrite rules
all thing namerun
and in series
height and sound
with the singl except of the red cross and the written note
◦ A cascade, in which output of each pass fed to next pass
ascade The algorithm is based on series of rewrite rules run in series, as a cascade, in
Some
which sample
the output rules:
of each pass is fed as input to the next pass; here is a sampling of
the rules:
ATIONAL ! ATE (e.g., relational ! relate)
ING ! ✏ if stem contains vowel (e.g., motoring ! motor)
SSES ! SS (e.g., grasses ! grass)
Detailed rule lists for the Porter stemmer, as well as code (in Java, Python, etc.)
can be found on Martin Porter’s homepage; see also the original paper (Porter, 1980).
Simple stemmers can be useful in cases where we need to collapse across differ-
Dealing with complex morphology is necessary
for many languages
◦ e.g., the Turkish word:
◦ Uygarlastiramadiklarimizdanmissinizcasina
◦ `(behaving) as if you are among those whom we could not civilize’
◦ Uygar `civilized’ + las `become’
+ tir `cause’ + ama `not able’
+ dik `past’ + lar ‘plural’
+ imiz ‘p1pl’ + dan ‘abl’
+ mis ‘past’ + siniz ‘2pl’ + casina ‘as if’
Sentence Segmentation
!, ? mostly unambiguous but period “.” is very ambiguous
◦ Sentence boundary
◦ Abbreviations like Inc. or Dr.
◦ Numbers like .02% or 4.3
Common algorithm: Tokenize first: use rules or ML to
classify a period as either (a) part of the word or (b) a
sentence-boundary.
◦ An abbreviation dictionary can help
Sentence segmentation can then often be done by rules
based on this tokenization.
Word Normalization and
other issues
Basic Text
Processing

Building LLMs - Stanford
No ratings yet
Building LLMs - Stanford
78 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
A Book of Anagrams: An Ancient Word Game
From Everand
A Book of Anagrams: An Ancient Word Game
Daniel H. Wieczorek
No ratings yet
Corpora
No ratings yet
Corpora
48 pages
Words and Corpora J+M
No ratings yet
Words and Corpora J+M
49 pages
Lect 4 Words and Tokenizing
No ratings yet
Lect 4 Words and Tokenizing
24 pages
Regular Expression and BPE
No ratings yet
Regular Expression and BPE
68 pages
Text Proc
No ratings yet
Text Proc
55 pages
Week 2
No ratings yet
Week 2
90 pages
5 BASIC TEXT PROCESSING
No ratings yet
5 BASIC TEXT PROCESSING
6 pages
2 TextProc Mar 25 2021
No ratings yet
2 TextProc Mar 25 2021
71 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
02 Textprocessingboth
No ratings yet
02 Textprocessingboth
46 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
74 pages
2.BasicTextProcessing NEW
No ratings yet
2.BasicTextProcessing NEW
39 pages
3.Word level analysis-tokenization stemming
No ratings yet
3.Word level analysis-tokenization stemming
8 pages
Text Preprocessing
No ratings yet
Text Preprocessing
59 pages
Lecture 2 NLP
No ratings yet
Lecture 2 NLP
27 pages
03 Word Tokenization 14-26
No ratings yet
03 Word Tokenization 14-26
6 pages
Basic Text Processing: Regular Expressions
No ratings yet
Basic Text Processing: Regular Expressions
46 pages
Grading: Final Term: 40 % Term Paper: 30% Assignments and Quizzes: 30%
No ratings yet
Grading: Final Term: 40 % Term Paper: 30% Assignments and Quizzes: 30%
46 pages
3.Chapter4_Lexical Representations
No ratings yet
3.Chapter4_Lexical Representations
36 pages
Week3
No ratings yet
Week3
15 pages
NLP_Week_02
No ratings yet
NLP_Week_02
55 pages
Lecture 1 Text Preprocessing PDF
No ratings yet
Lecture 1 Text Preprocessing PDF
29 pages
Basic Text Processing: Regular Expressions
No ratings yet
Basic Text Processing: Regular Expressions
41 pages
NLP Digital Notes
No ratings yet
NLP Digital Notes
128 pages
2 Textprocessingboth
No ratings yet
2 Textprocessingboth
46 pages
Introduction To NLP
No ratings yet
Introduction To NLP
68 pages
Kuhlmann - Introduction To Computational Linguistics (Slides) (2015)
100% (1)
Kuhlmann - Introduction To Computational Linguistics (Slides) (2015)
66 pages
2-Regular expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular expressions, Text Normalization, Edit Distance
42 pages
week_02_Tokenizers
No ratings yet
week_02_Tokenizers
36 pages
NLP m2
No ratings yet
NLP m2
71 pages
CL_lec 6
No ratings yet
CL_lec 6
28 pages
PART B NOTES
No ratings yet
PART B NOTES
62 pages
NATURAL LANGUAGE PROCESSING UNIT 1
No ratings yet
NATURAL LANGUAGE PROCESSING UNIT 1
16 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Tokeniz prob!
No ratings yet
Tokeniz prob!
4 pages
NLP Lect-5 02.02.21
No ratings yet
NLP Lect-5 02.02.21
18 pages
Text preprocessing
No ratings yet
Text preprocessing
39 pages
3. text-processing
No ratings yet
3. text-processing
70 pages
Midterm 1
No ratings yet
Midterm 1
5 pages
NLP_Week_02
No ratings yet
NLP_Week_02
54 pages
Natural Language Processing (NLP) & Computational Linguistics
No ratings yet
Natural Language Processing (NLP) & Computational Linguistics
60 pages
Session1 2024_2025_ Natural Language Processing
No ratings yet
Session1 2024_2025_ Natural Language Processing
40 pages
AI_NLP
No ratings yet
AI_NLP
9 pages
Text Processing, Tokenization & Characteristics
100% (1)
Text Processing, Tokenization & Characteristics
89 pages
C10_AI_UNIT 3_NLP_ HALF YEARLY
No ratings yet
C10_AI_UNIT 3_NLP_ HALF YEARLY
37 pages
NLP Lect-6 03.02.21
No ratings yet
NLP Lect-6 03.02.21
17 pages
NLP_Lecture_6_Week_3
No ratings yet
NLP_Lecture_6_Week_3
9 pages
Natural Language Processing 1
No ratings yet
Natural Language Processing 1
19 pages
All Practicals
No ratings yet
All Practicals
33 pages
text-processing
No ratings yet
text-processing
114 pages
Basic Text Process
No ratings yet
Basic Text Process
3 pages
Words & Transducers
No ratings yet
Words & Transducers
7 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
Lecture02 Tokenization
No ratings yet
Lecture02 Tokenization
16 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
Compiler Design
No ratings yet
Compiler Design
5 pages
French Essentials
From Everand
French Essentials
Miriam Ellis
4/5 (3)
Rectifier Circuits
No ratings yet
Rectifier Circuits
17 pages
Zener Diode and As Voltage Regulator
No ratings yet
Zener Diode and As Voltage Regulator
15 pages
Problems in Unit-1
No ratings yet
Problems in Unit-1
3 pages
EEE F111 Electrical Sciences Handout I-Sem - 2023 - 2024
No ratings yet
EEE F111 Electrical Sciences Handout I-Sem - 2023 - 2024
3 pages
Homework Problem Set On Differential Amps - Part 1
100% (1)
Homework Problem Set On Differential Amps - Part 1
11 pages
Fmang Inst Instr
No ratings yet
Fmang Inst Instr
29 pages
Advanced Alarm Citect SCADA
No ratings yet
Advanced Alarm Citect SCADA
3 pages
Im 28243 en Pactware Dtm Collection 2021-01-22
No ratings yet
Im 28243 en Pactware Dtm Collection 2021-01-22
28 pages
Software Design Lab 2 (19132020)
No ratings yet
Software Design Lab 2 (19132020)
5 pages
Mastering Python Free E-Book by Cosmicode
No ratings yet
Mastering Python Free E-Book by Cosmicode
60 pages
Operating Systems: Project Reports
No ratings yet
Operating Systems: Project Reports
33 pages
Raspberry PI Computer Vision Programming 1st Edition Pajankar Ebook All Chapters PDF
100% (12)
Raspberry PI Computer Vision Programming 1st Edition Pajankar Ebook All Chapters PDF
70 pages
Artificial Intelligence Midterm Exam
No ratings yet
Artificial Intelligence Midterm Exam
5 pages
Final Course List (Jan - Apr 2025)
No ratings yet
Final Course List (Jan - Apr 2025)
20 pages
HM10 Intro Final
No ratings yet
HM10 Intro Final
229 pages
Formulario Integrales
No ratings yet
Formulario Integrales
12 pages
212e_Install
No ratings yet
212e_Install
16 pages
Module - 7
No ratings yet
Module - 7
28 pages
Platine Contrôle - Pharos - TPC - Datasheet 1
No ratings yet
Platine Contrôle - Pharos - TPC - Datasheet 1
1 page
CC105 - App Dev & Emerging Technologies: Arcilla, Ronald Caraecle, Ella Rodriguez, Rhea Acedo, Reiner
No ratings yet
CC105 - App Dev & Emerging Technologies: Arcilla, Ronald Caraecle, Ella Rodriguez, Rhea Acedo, Reiner
17 pages
Number System for the CAT 2nd Edition Nishit K. Sinha pdf download
100% (3)
Number System for the CAT 2nd Edition Nishit K. Sinha pdf download
65 pages
Trakzer_Empire_Strategy (1)
No ratings yet
Trakzer_Empire_Strategy (1)
3 pages
WS 1.2 Introduction To Python
No ratings yet
WS 1.2 Introduction To Python
3 pages
PhotoPrint Installation Manual
No ratings yet
PhotoPrint Installation Manual
3 pages
Report Card Making: A Project Report On
No ratings yet
Report Card Making: A Project Report On
34 pages
Manual Cubitt Jr. EN ES
No ratings yet
Manual Cubitt Jr. EN ES
34 pages
Tcelectronic Hypergravity Eng
No ratings yet
Tcelectronic Hypergravity Eng
29 pages
Numericals Reviewer
No ratings yet
Numericals Reviewer
17 pages
Lorenzo-Seva 2003 - Factor Symplicity Index
No ratings yet
Lorenzo-Seva 2003 - Factor Symplicity Index
12 pages
Message
No ratings yet
Message
540 pages
Solution Manual Sim
No ratings yet
Solution Manual Sim
6 pages
Digital NI Act Guidelines
No ratings yet
Digital NI Act Guidelines
74 pages
First Periodical Test in g9 Math
No ratings yet
First Periodical Test in g9 Math
5 pages
MOKO Beacon Product Summary - V1.6 - 20230731
No ratings yet
MOKO Beacon Product Summary - V1.6 - 20230731
13 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

2 TextProc 2023

Uploaded by

2 TextProc 2023

Uploaded by

Words and Corpora

Type: an element of the vocabulary.

Tokens = N Types = |V|

Every NLP task requires text normalization:

Sorting the counts

>>> text = ’That U.S.A. poster-print costs $12.40...’

How do we decide where the token boundaries

姚明进入总决赛 “Yao Ming reaches the finals”

姚明进入总决赛 “Yao Ming reaches the finals”

姚明进入总决赛 “Yao Ming reaches the finals”

姚明进入总决赛 “Yao Ming reaches the finals”

For sentiment analysis, MT, Information extraction

Represent all words as their lemma, their shared root

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.