0% found this document useful (0 votes)
17 views15 pages

3 Regular Expression

Uploaded by

Atharva Nagore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views15 pages

3 Regular Expression

Uploaded by

Atharva Nagore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Natural Language Processing

REGULAR EXPRESSION

What exactly are regular expressions? What exactly are regular expressions?

• A regular expression is a set of characters or a • Applications of regular expressions:


pattern which is used to find substring in a given – For Example:
string • Finding date and time from a log file
• Allow us to match patterns in other strings • Parse email addresses, remove/replace unwanted characters
• Find all web links in a document

– but it's usage can be extended.


3 4

1
Regular Expression Regular Expression Examples
• Regular expression or RegEx for short is a way to search • Sometimes, we want to identify the different components of an email address.
pattern in a big text of data.
• RegEx is also used for data validation like secure password,
correct email format but it's usage can be extended.
• Few more examples: extracting all hashtags from a tweet,
getting email ID or phone numbers,..etc from large
unstructured text contents. • Simply put, a regular expression is defined as an ”instruction” that is given to a
function on what and how to match, search or replace a set of strings.
5 6

Regular Expression Examples How Regular Expression Work


• Regular Expressions are used in various tasks such as,  Consider the following list of some students of a School,
 Names: Sunil, Shyam, Ankit, Surjeet, Sumit, Subhi, Surbhi, Siddharth, Sujan
 Data pre-processing,
 And our goal is to select only those names from the above list which match a certain
 Rule-based information Mining systems, pattern such as something like this S u _ _ _ ( S and u followed by only 3
 Pattern Matching, positions)
 Extracted Names: Sunil, Sumit, Sujan
 Text feature Engineering,
 What exactly we have done here is that
 Web scraping,  we have a pattern and a list of student names and That’s exactly how regular
we have to find the name that matches the given pattern. expressions work.
 Data Extraction, etc 

 In RegEx, we have different types of patterns to recognize different strings of


7 characters. 8

2
Properties of Regular Expressions Regular Expression in NLP
• Some of the important properties of Regular Expression are as • In NLP, we can used Regular Expression at many places such as:
follows: 1. To validate data fields
1. The Regular Expression language is formalized by an American • Example: dates, email address, URLs, abbreviations, etc.
Mathematician named Stephen Cole Kleene. 2. To filter a particular text from the whole corpus
2. Regular Expression (RE) is a formula in a special language, which can be used • Example: spam, disallowed websites, etc.
for specifying simple classes of strings, a sequence of symbols. [Regular 3. To identify particular strings in a text
Expression is an algebraic notation for characterizing a set of strings] • Example: Token boundaries
3. Regular Expression requires two things, one is the pattern that we want to 4. To convert the output of one processing component into the format required
search and the other is a corpus of text or a string from which we need to for a second component
search the pattern.
9 10

Regular expressions Regular Expression

• A formal language for specifying text strings • Brackets ([])


• How can we search for any of these? – They are used to specify a disjunction of characters.
– woodchuck
– woodchucks
– Woodchuck
– Woodchucks

12

3
Regular Expressions: Disjunctions Regular Expressions: Disjunctions

• Letters inside square brackets [] Any letter inside the square bracket • Now that was kind of annoying to write
Pattern Matches
– Instead, we have little ranges
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any digit • Dash (-)  Ranges [A-Z] Any character in that range

• [wW] Either lower case ‘w’ or Capital case ‘W’ Pattern Matches
[A-Z] Matches an upper case letter Drenched Blossoms

[1234567890] Matches any digit [a-z] Matches a lower case letter my beans were impatient
[0-9] Matches a single digit Chapter 1: Down the Rabbit Hole
[abc] ‘a’, ‘b’ or ‘c’

Regular Expressions: Disjunctions Regular Expressions: Disjunctions


[We]

15 16

4
Regular Expressions: Disjunctions Regular Expressions: Disjunctions
[Ww] [em]

17 18

Regular Expressions: Disjunctions Regular Expressions: Disjunctions


[A-Z] [a-z]

19 20

5
Regular Expressions: Disjunctions Regular Expressions: Disjunctions

[A-Za-z] [ !]

21 22

Regular Expressions: Negation in Regular Expressions: Negation in


Disjunction Disjunction
[^A-Z]
• Caret (^)  Negations or just to mean ^
– Carat means negation only when first in []
Pattern Matches
[^A-Z] Not an upper case letter Oyfn pripetchik
[^a-z] Not a lower case letter Oyfn pripetchik
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason”
[e^] either e or ^ Look here
a^b The pattern a^b Look up a^b now
24
[^.] Not a period I have no exquisite reason.

6
Regular Expressions: Negation in Regular Expressions: Negation in
Disjunction Disjunction
[^A-Za-z]
[^e] [^e^]

25 26

Regular Expressions: Negation in


Regular Expressions: Negation in Disjunction
Disjunction
[^Ss] [\^]

27 28

7
Regular Expressions: More Disjunction Regular Expressions: More Disjunction

• Woodchucks is another name for groundhog! looked|step


• The pipe | for disjunction Either groundhog or woodchuck
Pattern Matches
groundhog|woodchuck
yours|mine yours mine
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck

30

Regular Expressions: More Disjunction Regular Expressions: More Disjunction

at|ook [a|b|c]

31 32

8
Kleene or cleany operators

Regular Expression: Quantifiers Regular Expressions: ? * + .


• Some common quantifiers are: (*, +, ? And {}) Previous character is optional. i.e., match the word color with or without ‘u’

• They allow us to mention and control over how many times a specific Pattern Matches
character(s) pattern should occur in the given text. colou?r Optional previous char color colour
maths? math or maths
Character woodchucks? woodchuck or woodchucks

. Any character, including whitespace or numeric oo*h! 0 or more of previous char oh! ooh! oooh! ooooh!

? Zero or one of the preceding character o+h! 1 or more of previous char oh! ooh! oooh! ooooh!

* Zero or more of the preceding character baa+ baa baaa baaaa baaaaa Stephen C Kleene
+ One or more of the preceding character beg.n Matches any character begin begun beg3n
Kleene *, Kleene +
^ Negation or complement
33

Regular Expression: Quantifiers Regular Expression: Quantifiers


• {m,n} quantifier
• For each of the quantifiers mentioned in previous
– There are four variants of {m,n} quantifier.
1. {m,n}  matches the preceding character from ‘m’ number of times to ‘n’ slide can be written in the form of {m,n} quantifier
number of times in the following ways:
2. {m,}  matches the preceding character from ‘m’ number of time infinite
times, therefore there is no upper limit on the occurrence of the preceding – ‘?’ is equivalent to zero or once, or {0,1}
character.
– ‘*’ is equivalent to zero or more time, or {0,}
3. {,n}  matches the preceding character from zero to ‘n’ number of times,
therefore the upper limit is fixed for the occurrence of the preceding – ‘+’ is equivalent to one or more times, or {1,}
character.
4. {n}  matches iff the preceding character occurs exactly ‘n’ number of
35 36
times.

9
Regular Expression: Quantifiers Regular Expressions: ? * + .
O+
• Example:
abc* Matches a string that has ‘ab’ followed abc abcccc abc abccc bca cba
by zero or more ‘c’
abc+ Matches ‘ab’ followed by one or more ‘c’ ab abcccc abc abccc bca cba
abc? Matches ‘ab’ followed by zero or once ‘c’ ab abcccc abc abccc bca cba
abc{2} Matches ‘ab’ followed by 2 ‘c’ ab abcccc abc abcccc bca cba
abc{2, } Matches ‘ab’ followed by 2 or more ‘c’ ab abcccc abc abcccc bca cba
abc{2, 5} Matches ‘ab’ followed by 2 upto 5 ‘c’ ab abcccccc abc abcccccc bca cba
a(bc)* Matches ‘a’ followed by zero or more ab abccccc abc abccccc bca cba
copies of the sequence ‘bc’
37 38

Regular Expression: Escaping Special


Regular Expressions: ? * + .
Characters
• The quantifiers such as ‘?’, ‘*’, ‘+’, ‘(‘, ‘)’, ‘{‘ . etc., can also appear in the input \.
text.
• In such cases to extract these specific characters, we have to use the escape
sequences.
• The escape sequence represented by ‘\’  used to escape the special meaning of
the special characters.
– *  it is used to escape the star
– .  it is used to escape the dot
– +  it is used to escape the plus sign (to match a ‘+’ sign)
– ?  it is used to escape the question mark (to match a question mark)

39 40

10
Regular Expressions: ? * + . Regular Expressions: Anchors ^ $
Matches any character and is known as wildcard character • Anchors are used to asset something about the string or the
matching process.
• As such, they are not used in specific word or character but help
Matches beginning of the line
with more general queries
Pattern Matches
^[A-Z] Specifies start of the string Palo Alto
^[^A-Za-z] Expect upper case and lower case
letters
1, “Hello”
.$ The end.
41

Regular Expressions: Anchors ^ $ Regular Expressions: Anchors ^ $


Pattern ^  Matches beginning of the line
^The Matches a string that starts with ‘The’

end$ Matches a string that ends with ‘end’

^The end$ Exact string match (starts and ends with ‘The end’)

roar Matches a string that has the text roar in it

43 44

11
Regular Expressions: Anchors ^ $ Regular Expression: Whitespace
$  Matches end of the line • A white space can include a single space, multiple
spaces, tab space or a newline character (also known
as a vertical space). It will match the corresponding
spaces in the string.
• Example:
– ‘ +’, i.e., space followed by a plus sign will match one or
more spaces.
45
• abc abcccccc abc abccccc bca cba! 46

Regular Expressions: Meta Sequences Regular Expression: Character Sets


• Meta sequences are shorthand way to write commonly used character sets in the • Character sets provide a lot more flexibility than just typing a
form of regular expression. The commonly used meta-sequences are as follows: wildcard or the literal characters. These are groups of characters
Special Sequences specified inside square brackets.
\b Word boundary (zero width) • Character sets can be specified with or without the help of a
\d Any decimal digit (equivalent to [0-9]) quantifier.
\D Any non – digit character (equivalent to ^[0-9]) • When no quantifier succeeds the character set, it matches only one
\s Any whitespace character (equivalent to [ \t\n\r\f\v]) character and the match is successful only if the character in the string
\S Any non-whitespace character (equivalent to [^ \t\n\r\f\v] is one of the characters present inside the character set.
\w Any alphanumeric character (equivalent to [a-zA-Z0-9]
\W Any non-alphanumeric character (equivalent to [^a-zA-Z0-9] 47 48

12
Regular Expression: Character Sets Example

• Example:

Pattern Matches
[a-z]ed Match strings such as ‘ted’, ‘bed’, ‘red’ and so on because the first
character of each string – ‘t’, ‘b’ and ‘r’ is present inside the range of
the character set.
[a-z]+ed Matches only those words that end with ‘ed’ like ‘watched’, ‘baked’,
‘jammed’, ‘educated’ and so on.

49 50

Example
Example
the
[Tt]he

51 52

13
Example Example
[Tt]he[^A-Za-z] [^A-Za-z][Tt]he[^A-Za-z]

53 54

Errors Errors cont.


• The process we just went through was based on fixing two • In NLP we are always dealing with these kinds of errors.
kinds of errors • Reducing the error rate for an application often involves
– Matching strings that we should not have matched (there, then, two antagonistic efforts:
other) – Increasing accuracy or precision (minimizing false positives)
• False positives (Type I) – Increasing coverage or recall (minimizing false negatives).
– Not matching things that we should have matched (The)
• False negatives (Type II)

14
Summary
• Regular expressions play a surprisingly large role
– Sophisticated sequences of regular expressions are often the first
model for any text processing text
• For many hard tasks, we use machine learning classifiers
– But regular expressions are used as features in the classifiers
– Can be very useful in capturing generalizations

57

15

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy