Week3
Week3
Introduction
Tokenizing (segmenting) words
Normalizing word formats
Segmenting sentences
The practical
CS3TM20 © XH 1
Introduction
Text normalization is the process of transforming text into a
single canonical form before almost any natural language processing
of a text.
• predictive text and handwriting recognition.
• web search engines
• machine translation, text analysis to detect sentiment in tweets and
blogs.
At least three tasks are commonly applied as part of any
normalization process:
1. Tokenizing (segmenting) words
2. Normalizing word formats
3. Segmenting sentences
CS3TM20 © XH 2
How many words?
N = number of tokens (pieces in a document)
V = vocabulary = set of types, |V| is size of vocabulary
Heaps Law = Herdan's Law =
where often .67 < β < .75
i.e., vocabulary size grows with > square root of the number of word tokens
Tokens = N Types = |V|
Switchboard phone conversations 2.4 million 20 thousand
Shakespeare 884,000 31 thousand
COCA 440 million 2 million
Google N-grams 1 trillion 13+ million
Tokenizing (segmenting) words
Separate a chunk of continuous text into separate words.
Tokenization is thus intimately tied up with named entity recognition.
Segment off a token between instances of spaces (Space-based
tokenization) for languages that use space characters between
words.
Tokenization needs to be run before any other language processing
and fast.
The standard method for tokenization is therefore to use
deterministic algorithms based on regular expression.
Word tokenization is more complex in languages which do not use
spaces to mark potential word-boundaries.
CS3TM20 © XH 4
Issues in Tokenization
•Can't just blindly remove punctuation:
• m.p.h., Ph.D., AT&T, cap’n
• prices ($45.55)
• dates (01/02/06)
• URLs (http://www.stanford.edu)
• hashtags (#nlproc)
• email addresses (someone@cs.colorado.edu)
•Clitic: a word that doesn't stand on its own
• "are" in we're, French "je" in j'ai, "le" in l'honneur
•When should multiword expressions (MWE) be
words?
• New York, rock ’n’ roll
Tokenization in NLTK
Bird, Loper and Klein (2009), Natural Language Processing with Python. O’Reilly
Word Normalization
CS3TM20 © XH 7
Case folding
Applications like IR: reduce all letters to lower case
Since users tend to use lower case
Possible exception: upper case in mid-sentence?
e.g., General Motors
Fed vs. fed
SAIL vs. sail
For sentiment analysis, MT, Information extraction,
case is helpful (US versus us is important)
Stemming
Reduce terms to stems, chopping off affixes crudely
stemming a word or sentence may result in words that are not
actual words.
This was not the map we Thi wa not the map we
found in Billy Bones’s found in Billi Bone s
chest, but an accurate chest but an accur copi
copy, complete in all complet in all thing
things-names and name and height and
heights and soundings- sound with the singl
with the single exception except of the red cross
of the red crosses and and the written note
the written notes. .
Porter Stemmer
Based on a series of rewrite rules run in series
A cascade, in which output of each pass fed to next pass
Some sample rules:
Lemmatization
https://www.nltk.org/book/ch03.html
15
Slides adapted from Jure Leskovec