Regular Expression and BPE
Regular Expression and BPE
Processing
Regular Expressions
Regular expressions
A formal language for specifying text strings
How can we search for any of these?
◦ woodchuck
◦ woodchucks
◦ Woodchuck
◦ Woodchucks
Regular Expressions: Disjunctions
Letters inside square brackets []
Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any digit
Ranges [A-Z]
Pattern Matches
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down the Rabbit Hole
Regular Expressions: Negation in Disjunction
Negations [^Ss]
◦ Carat means negation only when first in []
Pattern Matches
[^A-Z] Not an upper case Oyfn pripetchik
letter
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason”
[^e^] Neither e nor ^ Look here
a^b The pattern a carat b Look up a^b now
Regular Expressions: More Disjunction
Woodchuck is another name for groundhog!
The pipe | for disjunction
Pattern Matches
groundhog|woodchuck woodchuck
yours|mine yours
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck Woodchuck
Regular Expressions: ? *+.
Pattern Matches
colou?r Optional color colour
previous char
oo*h! 0 or more of oh! ooh! oooh! ooooh!
previous char
o+h! 1 or more of oh! ooh! oooh! ooooh!
previous char
baa+ baa baaa baaaa baaaaa
Stephen C Kleene
beg.n begin begun begun beg3n
Kleene *, Kleene +
Regular Expressions: Anchors ^ $
Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
Example
Find me all instances of the word “the” in a text.
the
Misses capitalized examples
[tT]he
Incorrectly returns other or theology
[^a-zA-Z][tT]he[^a-zA-Z]
Errors
The process we just went through was based on
fixing two kinds of errors:
11
Basic Text
Processing
Regular Expressions
More Regular Expressions:
Substitutions and ELIZA
Basic Text
Processing
Substitutions
Substitution in Python and UNIX commands:
s/regexp1/pattern/
e.g.:
s/colour/color/
Capture Groups
• Say we want to put angles around all numbers:
the 35 boxes the <35> boxes
• Use parens () to "capture" a pattern into a
numbered register (1, 2, 3…)
• Use \1 to refer to the contents of the register
s/([0-9]+)/<\1>/
Capture groups: multiple registers
/the (.*)er they (.*), the \1er we \2/
Matches
the faster they ran, the faster we ran
But not
the faster they ran, the faster we ate
But suppose we don't want to capture?
Parentheses have a double function: grouping terms, and
capturing
Non-capturing groups: add a ?: after paren:
/(?:some|a few) (people|cats) like some \1/
matches
◦ some cats like some cats
but not
◦ some cats like some some
Lookahead assertions
(?= pattern) is true if pattern matches, but is
zero-width; doesn't advance character pointer
(?! pattern) true if a pattern does not match
How to match, at the beginning of a line, any single
word that doesn’t start with “Volcano”:
/ˆ(?!Volcano)[A-Za-z]+/
Simple Application: ELIZA
Early NLP system that imitated a Rogerian
psychotherapist
◦ Joseph Weizenbaum, 1966.
they lay back on the San Francisco grass and looked at the stars
and their
Motivation:
• Why was the corpus collected?
• By whom?
• Who funded it?
Situation: In what situation was the text written?
Collection process: If it is a subsample how was it sampled? Was
there consent? Pre-processing?
+Annotation process, language variety, demographics, etc.
Words and Corpora
Basic Text
Processing
Word tokenization
Basic Text
Processing
Text Normalization
1945 A
72 AARON
19 ABBESS
25 Aaron
5 ABBOT
6 Abate
... ... 1 Abates
5 Abbess
6 Abbey
3 Abbot
.... …
Issues in Tokenization
Can't just blindly remove punctuation:
◦ m.p.h., Ph.D., AT&T, cap’n
◦ prices ($45.55)
◦ dates (01/02/06)
◦ URLs (http://www.stanford.edu)
◦ hashtags (#nlproc)
◦ email addresses (someone@cs.colorado.edu)
Clitic: a word that doesn't stand on its own
◦ "are" in we're, French "je" in j'ai, "le" in l'honneur
When should multiword expressions (MWE) be words?
◦ New York, rock ’n’ roll
Tokenization in NLTK
Bird, Loper and Klein (2009), Natural Language Processing with Python. O’Reilly
Tokenization in languages without spaces
Many languages (like Chinese, Japanese, Thai) don't
use spaces to separate words!
Merge e r to er
BPE
Merge er _ to er_
BPE
Merge n e to ne
BPE