3 Regular Expression
3 Regular Expression
REGULAR EXPRESSION
What exactly are regular expressions? What exactly are regular expressions?
1
Regular Expression Regular Expression Examples
• Regular expression or RegEx for short is a way to search • Sometimes, we want to identify the different components of an email address.
pattern in a big text of data.
• RegEx is also used for data validation like secure password,
correct email format but it's usage can be extended.
• Few more examples: extracting all hashtags from a tweet,
getting email ID or phone numbers,..etc from large
unstructured text contents. • Simply put, a regular expression is defined as an ”instruction” that is given to a
function on what and how to match, search or replace a set of strings.
5 6
2
Properties of Regular Expressions Regular Expression in NLP
• Some of the important properties of Regular Expression are as • In NLP, we can used Regular Expression at many places such as:
follows: 1. To validate data fields
1. The Regular Expression language is formalized by an American • Example: dates, email address, URLs, abbreviations, etc.
Mathematician named Stephen Cole Kleene. 2. To filter a particular text from the whole corpus
2. Regular Expression (RE) is a formula in a special language, which can be used • Example: spam, disallowed websites, etc.
for specifying simple classes of strings, a sequence of symbols. [Regular 3. To identify particular strings in a text
Expression is an algebraic notation for characterizing a set of strings] • Example: Token boundaries
3. Regular Expression requires two things, one is the pattern that we want to 4. To convert the output of one processing component into the format required
search and the other is a corpus of text or a string from which we need to for a second component
search the pattern.
9 10
12
3
Regular Expressions: Disjunctions Regular Expressions: Disjunctions
• Letters inside square brackets [] Any letter inside the square bracket • Now that was kind of annoying to write
Pattern Matches
– Instead, we have little ranges
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any digit • Dash (-) Ranges [A-Z] Any character in that range
• [wW] Either lower case ‘w’ or Capital case ‘W’ Pattern Matches
[A-Z] Matches an upper case letter Drenched Blossoms
[1234567890] Matches any digit [a-z] Matches a lower case letter my beans were impatient
[0-9] Matches a single digit Chapter 1: Down the Rabbit Hole
[abc] ‘a’, ‘b’ or ‘c’
15 16
4
Regular Expressions: Disjunctions Regular Expressions: Disjunctions
[Ww] [em]
17 18
19 20
5
Regular Expressions: Disjunctions Regular Expressions: Disjunctions
[A-Za-z] [ !]
21 22
6
Regular Expressions: Negation in Regular Expressions: Negation in
Disjunction Disjunction
[^A-Za-z]
[^e] [^e^]
25 26
27 28
7
Regular Expressions: More Disjunction Regular Expressions: More Disjunction
30
at|ook [a|b|c]
31 32
8
Kleene or cleany operators
• They allow us to mention and control over how many times a specific Pattern Matches
character(s) pattern should occur in the given text. colou?r Optional previous char color colour
maths? math or maths
Character woodchucks? woodchuck or woodchucks
. Any character, including whitespace or numeric oo*h! 0 or more of previous char oh! ooh! oooh! ooooh!
? Zero or one of the preceding character o+h! 1 or more of previous char oh! ooh! oooh! ooooh!
* Zero or more of the preceding character baa+ baa baaa baaaa baaaaa Stephen C Kleene
+ One or more of the preceding character beg.n Matches any character begin begun beg3n
Kleene *, Kleene +
^ Negation or complement
33
9
Regular Expression: Quantifiers Regular Expressions: ? * + .
O+
• Example:
abc* Matches a string that has ‘ab’ followed abc abcccc abc abccc bca cba
by zero or more ‘c’
abc+ Matches ‘ab’ followed by one or more ‘c’ ab abcccc abc abccc bca cba
abc? Matches ‘ab’ followed by zero or once ‘c’ ab abcccc abc abccc bca cba
abc{2} Matches ‘ab’ followed by 2 ‘c’ ab abcccc abc abcccc bca cba
abc{2, } Matches ‘ab’ followed by 2 or more ‘c’ ab abcccc abc abcccc bca cba
abc{2, 5} Matches ‘ab’ followed by 2 upto 5 ‘c’ ab abcccccc abc abcccccc bca cba
a(bc)* Matches ‘a’ followed by zero or more ab abccccc abc abccccc bca cba
copies of the sequence ‘bc’
37 38
39 40
10
Regular Expressions: ? * + . Regular Expressions: Anchors ^ $
Matches any character and is known as wildcard character • Anchors are used to asset something about the string or the
matching process.
• As such, they are not used in specific word or character but help
Matches beginning of the line
with more general queries
Pattern Matches
^[A-Z] Specifies start of the string Palo Alto
^[^A-Za-z] Expect upper case and lower case
letters
1, “Hello”
.$ The end.
41
^The end$ Exact string match (starts and ends with ‘The end’)
43 44
11
Regular Expressions: Anchors ^ $ Regular Expression: Whitespace
$ Matches end of the line • A white space can include a single space, multiple
spaces, tab space or a newline character (also known
as a vertical space). It will match the corresponding
spaces in the string.
• Example:
– ‘ +’, i.e., space followed by a plus sign will match one or
more spaces.
45
• abc abcccccc abc abccccc bca cba! 46
12
Regular Expression: Character Sets Example
• Example:
Pattern Matches
[a-z]ed Match strings such as ‘ted’, ‘bed’, ‘red’ and so on because the first
character of each string – ‘t’, ‘b’ and ‘r’ is present inside the range of
the character set.
[a-z]+ed Matches only those words that end with ‘ed’ like ‘watched’, ‘baked’,
‘jammed’, ‘educated’ and so on.
49 50
Example
Example
the
[Tt]he
51 52
13
Example Example
[Tt]he[^A-Za-z] [^A-Za-z][Tt]he[^A-Za-z]
53 54
14
Summary
• Regular expressions play a surprisingly large role
– Sophisticated sequences of regular expressions are often the first
model for any text processing text
• For many hard tasks, we use machine learning classifiers
– But regular expressions are used as features in the classifiers
– Can be very useful in capturing generalizations
57
15