02 Textprocessingboth
02 Textprocessingboth
Processing
Regular Expressions
Regular expressions
• Ranges [A-Z]
Pattern Matches
[A-Z] An upper case Drenched Blossoms
letter
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down the Rabbit Hole
Regular Expressions: Negation in Disjunction
• Negations [^Ss]
– Carat means negation only when first in []
Pattern Matches
groundhog|woodchuck
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck
Photo D. Fletcher
Regular Expressions: ? * + .
Pattern Matches
Kleene *, Kleene +
Regular Expressions: Anchors ^ $
Pattern Matches
^[^A-Za-z] 1 “Hello”
Regular Expressions
Basic Text
Processing
Word tokenization
Text Normalization
they lay back on the San Francisco grass and looked at the stars
and their
N = number of tokens
Church and Gale (1990): |V| > O(N½)
V = vocabulary = set of types
|V| is the size of the vocabulary
23243 the
22225 i
18618 and
16339 to
15687 of
12780 a
12163 you
10839 my What happened here?
10005 in
8954 d
Issues in Tokenization
• French
– L'ensemble one token or two?
• L ? L’ ? Le ?
• Want l’ensemble to match with un ensemble
• Morphemes:
– The small meaningful units that make up words
– Stems: The core meaning-bearing units
– Affixes: Bits and pieces that adhere to stems
• Often with grammatical functions
Stemming
• Reduce terms to their stems in information retrieval
• Stemming is crude chopping of affixes
– language dependent
– e.g., automate(s), automatic, automation all reduced to
automat.
tr -sc 'A-Za-z' '\n' < shakes.txt | grep ’ing$' | sort | uniq -c | sort –nr
1312 King 548 being
548 being 541 nothing
541 nothing 152 something
388 king 145 coming
375 bring 130 morning
358 thing 122 having
307 ring 120 living
152 something 117 loving
145 coming 116 Being
130 morning 102 going
tr -sc 'A-Za-z' '\n' < shakes.txt | grep '[aeiou].*ing$' | sort | uniq -c | sort –nr
Dealing with complex morphology is sometimes
necessary
Sentence Segmentation
and Decision Trees
Sentence Segmentation
• Numeric features
– Length of word with “.”
– Probability(word with “.” occurs at end-of-s)
– Probability(word after “.” occurs at beginning-of-s)
Implementing Decision Trees
• A decision tree is just an if-then-else statement
• The interesting research is choosing the features
• Setting up the structure is often too hard to do by hand
– Hand-building only possible for very simple features, domains
• For numeric features, it’s too hard to pick each threshold
– Instead, structure usually learned by machine learning from a
training corpus
Decision Trees and other classifiers
Sentence Segmentation
and Decision Trees