2.BasicTextProcessing NEW
2.BasicTextProcessing NEW
Katrien Beuls
Artificial Intelligence Laboratory
Vrije Universiteit Brussel
1
TER
BASIC TEXT PROCESSING
Regular Expressions, Text
ELIZA
3
REGULAR EXPRESSIONS
DISJUNCTIONS
I Ranges: [A-Z]
Pattern Matches
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down ...
4
REGULAR EXPRESSIONS
NEGATION IN DISJUNCTION
I Negations: [ˆSs]
Pattern Matches
[ˆA-Z] Not an upper case letter Oyfn pripetchik
[ˆSs] Neither ‘S’ nor ‘s’ I have no reason
[ˆ\.] Not a period our resident Djinn
aˆb The pattern ‘aˆb’ look up aˆb now
5
REGULAR EXPRESSIONS
MORE DISJUNCTION
Pattern Matches
groundhog|woodchuck groundhog
woodchuck
gupp(y|ies) guppy
guppies
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck
6
REGULAR EXPRESSIONS
? *+.
Pattern Matches
colou?r Optional previous char color colour
oo*h! 0 or more of previous char oh! ooh! oooh! ooooh!
o+h! 1 or more of previous char oh! ooh! oooh! ooooh!
baa+ baa baaa baaaa baaaaa
beg.n begin begun began beg3n
7
REGULAR EXPRESSIONS
ANCHORS ˆ $
Pattern Matches
ˆ[A-Z] Palo Alto
ˆ[ˆA-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
8
REGULAR EXPRESSIONS
EXERCISE
I /the/
Misses capitalised examples
I /[tT]he/
Incorrectly returns other or theology
I /[ˆa-zA-Z][tT]he[ˆa-zA-Z]/
Does not return “the” when it begins a line
I /(ˆ|[ˆa-zA-Z])[tT]he([ˆa-zA-Z]|$)/
9
REGULAR EXPRESSIONS
OPERATOR PRECEDENCE HIERARCHY
Parenthesis ()
Counters * + ? {}
Sequences and anchors the ˆmy end$
Disjunction |
10
REGULAR EXPRESSIONS
SUBSTITUTIONS
11
REGULAR EXPRESSIONS
CAPTURE GROUPS
12
REGULAR EXPRESSIONS
ELIZA OR A SIMPLE CHATBOT
13
WORDS
WHAT COUNTS AS A WORD?
14
WORDS
WHAT COUNTS AS A WORD?
15
WORDS
HOW MANY WORDS ARE THERE IN ENGLISH?
16
WORDS
POPULAR ENGLISH LANGUAGE CORPORA
|V| = kNβ
17
WORDS
DICTIONARY ENTRIES
18
TEXT NORMALIZATION
FIRST NLP TASK
19
TEXT NORMALIZATION
WORD TOKENIZATION
Main challenges:
I Break off punctuation as a separate token but preserve it
when it occurs word internally (Ph.D., AT&T, ...)
I Keep special characters and numbers in prices ($45.55)
and dates (15/02/2019)
I Expand clitic contractions that are marked by apostrophes
(we’re → we are)
I Tokenize multiword expressions like New York or rock ’n’
roll as a single token
20
contrast, use a comma to mark the decimal point, and spaces (or sometimes periods)
where English puts commas, for example, 555 500,50.
clitic A tokenizer can also be used to expand clitic contractions that are marked by
apostrophes, for example, converting what’re to the two tokens what are, and
TEXT NORMALIZATION
we’re to we are. A clitic is a part of a word that can’t stand on its own, and can only
occur when it is attached to another word. Some such contractions occur in other
PENN TREEBANK TOKENIZATION
alphabetic languages, including articles and pronouns in French (j’ai, l’homme).
Depending on the application, tokenization algorithms may also tokenize mul-
tiword expressions like New York or rock ’n’ roll as a single token, which re-
quires a multiword expression dictionary of some sort. Tokenization is thus inti-
mately tied up with named entity detection, the task of detecting names, dates, and
Pennorganizations
Treebank (Chapter 17).
tokenization standard separates out clitics
One commonly used tokenization standard is known as the Penn Treebank to-
(doesn’t
Penn Treebank becomes does plus n’t), keeps(treebanks)
kenization standard, used for the parsed corpora hyphenated releasedwords
by the Lin-
tokenization
together,
guistic and separates
Data Consortium (LDC),outtheall punctuation:
source of many useful datasets. This standard
separates out clitics (doesn’t becomes does plus n’t), keeps hyphenated words to-
gether, and separates out all punctuation:
Input: “The San Francisco-based restaurant,” they said, “doesn’t charge $10”.
Output: “ The San Francisco-based restaurant , ” they
said , “ does n’t charge $ 10 ” .
Tokens can also be normalized, in which a single normalized form is chosen for
words with multiple forms like USA and US or uh-huh and uhhuh. This standard-
ization may be valuable, despite the spelling information that is lost in the normal-
ization process. For information retrieval, we might want a query for US to match a
document that has USA; for information extraction we might want to extract coherent
information that is consistent across differently-spelled instances.
21
TEXT NORMALIZATION
NORMALIZING TOKENS
22
TEXT NORMALIZATION
CASE FOLDING
23
TEXT NORMALIZATION
EFFICIENCY
24
TEXT NORMALIZATION
COLLAPSING WORDS
25
involve complete morphological parsing of the word. Morphology is the study of
morpheme the way words are built up from smaller meaning-bearing units called morphemes.
stem Two broad classes of morphemes can be distinguished: stems—the central mor-
affix pheme of the word, supplying the main meaning— and affixes—adding “additional”
meanings
TEXT of various kinds. So, for example, the word fox consists of one morpheme
NORMALIZATION
(the morpheme fox) and the word cats consists of two: the morpheme cat and the
morpheme -s. AWORDS
COLLAPSING morphological parser takes a word like cats and parses it into the
two morphemes cat and s, or a Spanish word like amaren (‘if in the future they
would love’) into the morphemes amar ‘to love’, 3PL, and future subjunctive.
27
TEXT NORMALIZATION
SENTENCE SEGMENTATION
28
MINIMUM EDIT DISTANCE
STRING SIMILARITY
29
s of symbols expressing an operation list for
om string: d for deletion, s for substitution, i fo
MINIMUM EDIT DISTANCE
EXAMPLE ALIGNMENT
INTE*NTION
| | | | | | | | | |
*EXECUTION
d s s i s
peration list for converting the top string into the bot
30
MINIMUM EDIT DISTANCE
LEVENSHTEIN DISTANCE
31
MINIMUM EDIT DISTANCE
ALGORITHM
32
MINIMUM EDIT DISTANCE
DYNAMIC PROGRAMMING
33
a table-driven method to solve problems by combining solutions to sub-problems.
Some of the most commonly used algorithms in natural language processing make
use of dynamic programming, such as the Viterbi algorithm (Chapter 8) and the
CKY algorithm for parsing (Chapter 11).
MINIMUM EDIT DISTANCE
The intuition of a dynamic programming problem is that a large problem can
be solved by properly
SHORTEST PATH combining the solutions to various sub-problems. Consider
the shortest path of transformed words that represents the minimum edit distance
between the strings intention and execution shown in Fig. 2.15.
i n t e n t i o n
delete i
n t e n t i o n
substitute n by e
e t e n t i o n
substitute t by x
e x e n t i o n
insert u
e x e n u t i o n
substitute n by c
e x e c u t i o n
Figure 2.15 Path from intention to execution.
Imagine some string (perhaps it is exention) that is in this optimal path (whatever
it is). The intuition of dynamic programming is that if exention is in the optimal
operation list, then the optimal sequence must also include the optimal path from 34
HAPTER 2 MINIMUM
• R EGULAR E EDIT DISTANCE
XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
ALGORITHM
n L ENGTH(source)
m L ENGTH(target)
Create a distance matrix distance[n+1,m+1]
# Initialization: the zeroth row and column is the distance from the empty string
D[0,0] = 0
for each row i from 1 to n do
D[i,0] D[i-1,0] + del-cost(source[i])
for each column j from 1 to m do
D[0,j] D[0, j-1] + ins-cost(target[j])
# Recurrence relation:
for each row i from 1 to n do
for each column j from 1 to m do
D[i, j] M IN( D[i 1, j] + del-cost(source[i]),
D[i 1, j 1] + sub-cost(source[i], target[j]),
D[i, j 1] + ins-cost(target[j]))
# Termination
return D[n,m]
Figure 2.16 The minimum edit distance algorithm, an example of the class of dynamic
35
D[i 1, j 1] + sub-cost(source[i], target[j]),
D[i, j 1] + ins-cost(target[j]))
# Termination
return D[n,m]
MINIMUM EDIT DISTANCE
Figure 2.16 The minimum edit distance algorithm, an example of the class of dynamic
THEprogramming
EDIT DISTANCE MATRIX
algorithms. The various costs can either be fixed (e.g., 8x, ins-cost(x) = 1)
or can be specific to the letter (to model the fact that some letters are more likely to be in-
serted than others). We assume that there is no cost for substituting a letter for itself (i.e.,
sub-cost(x, x) = 0).
Src\Tar # e x e c u t i o n
# 0 1 2 3 4 5 6 7 8 9
i 1 2 3 4 5 6 7 6 7 8
n 2 3 4 5 6 7 8 7 8 7
t 3 4 5 6 7 8 7 8 9 8
e 4 3 4 5 6 7 8 9 10 9
n 5 4 5 6 7 8 9 10 11 10
t 6 5 6 7 8 9 8 9 10 11
i 7 6 7 8 9 10 9 8 9 10
o 8 7 8 9 10 11 10 9 8 9
n 9 8 9 10 11 12 11 10 9 8
Figure 2.17 Computation of minimum edit distance between intention and execution with
the algorithm of Fig. 2.16, using Levenshtein distance with cost of 1 for insertions or dele-
tions, 2 for substitutions.
minimum edit distance algorithm to store the pointers and compute the backtrace to
output an alignment. 36
MINIMUM EDIT DISTANCE
PRODUCING AN ALIGNMENT
2.6 • S UMMARY 25
# e x e c u t i o n
# 0 1 2 3 4 5 6 7 8 9
i "1 - "2 - "3 - "4 - "5 - "6 - "7 -6 7 8
n "2 - "3 - "4 - "5 - "6 - "7 - "8 "7 - "8 -7
t "3 - "4 - "5 - "6 - "7 - "8 -7 "8 - "9 "8
e "4 -3 4 - 5 6 7 "8 - "9 - " 10 "9
n "5 "4 - "5 - "6 - "7 - "8 - "9 - " 10 - " 11 -" 10
t "6 "5 - "6 - "7 - "8 - "9 -8 9 10 " 11
i "7 "6 - "7 - "8 - "9 - " 10 "9 -8 9 10
o "8 "7 - "8 - "9 - " 10 - " 11 " 10 "9 -8 9
n "9 "8 - "9 - " 10 - " 11 - " 12 " 11 " 10 "9 -8
Figure 2.18 When entering a value in each cell, we mark which of the three neighboring
cells we came from with up to three arrows. After the table is full we compute an alignment
(minimum edit path) by using a backtrace, starting at the 8 in the lower-right corner and
following the arrows back. The sequence of bold cells represents one possible minimum cost
alignment between the two strings. Diagram design after Gusfield (1997).
Summary 37
MINIMUM EDIT DISTANCE
EXTENSIONS
38
SUMMARY