Morphological Analysis
Morphological Analysis
◦ Visualize/Content word
◦ Nouns –Glass
◦ Adjectives -Black,
◦ Verbs-Dance,
◦ Adverb - beautifully.
◦ ‘open' class of morphemes because we can add new words to the
language easily
◦ Eg.(girl, play, google, e-mail, blog).
Grammatical Morpheme
◦ Conjunctions- And/Or
◦ Prepositions-On/Under/At
◦ Articles-An/The
◦ Pronouns- She/He.
◦ Limited words
◦ Care carelessly
(Verb) (Adverb)
◦ Class Maintaining
◦ Child Childhood
(Noun) (Noun)
Inflectional Morpheme
◦ Eg.
◦ Dance+ed->Danced
◦ Walk+ed->Walked
◦ Boy+s->Boys
Irregular Inflectional Morpheme
◦ Eg.
◦ Go+ed->Went
◦ Good+er->Better
◦ am+ed->was
Identify Inflectional or Derivational
Morphology
The number of matches does the command a{1,3} give with the string aabbaaaa?
• 3
• 2
• 1
• 4
What is the output of the below code?
print(re.match(‘On’,”one”))
• 'e'
• Match Object
• ‘n’
• None
[1-4]
(1-3)
[1234]
Both a and c
•What is the output of the below code?
re.sub(‘a’,’u’,’aeiou!’)
• 'ueiou!'
• 'eiou!'
• 'eio!'
• None of the above
What is Regular Expression?
A Formal Language for specifying text search strings
• Ranges [A-Z]
Pattern Matches
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down the Rabbit Hole
Regular Expressions: Negation in
Disjunction
• Negations [^Ss]
• Carat means negation only when first in []
Pattern Matches
[^A-Z] Not an upper case Oyfn pripetchik
letter
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite
reason”
[^e^] Neither e nor ^ Look here
a^b The pattern a carat b Look up a^b now
Regular Expressions: More
Disjunction
• Woodchuck is another name for groundhog!
• The pipe | for disjunction
Pattern Matches
groundhog|woodchuck woodchuck
yours|mine yours
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck Woodchuck
Regular Expressions: ? *+.
Pattern Matches
colou?r Optional color colour
previous char
oo*h! 0 or more of oh! ooh! oooh! ooooh!
previous char
o+h! 1 or more of oh! ooh! oooh! ooooh!
previous char
baa+ baa baaa baaaa baaaaa
Stephen C Kleene
beg.n begin begun begun
beg3n Kleene *, Kleene +
Regular Expressions: Anchors ^ $
Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
993, 99
If your pattern is 99 it will be matched to both above string if you want to match only 99 you can use \b99\b
Example
• Find me all instances of the word “the” in a text.
the
Misses capitalized examples
[tT]he
Incorrectly returns other or theology
[^a-zA-Z][tT]he[^a-zA-Z]
Errors
• The process we just went through was based on
fixing two kinds of errors:
41
More Regular Expressions:
Substitutions and ELIZA
Substitutions
• Substitution in Python and UNIX commands:
• s/regexp1/pattern/
• e.g.:
• s/colour/color/
Capture Groups
• Say we want to put angles around all numbers:
the 35 boxes the <35> boxes
• Use parens () to "capture" a pattern into a numbered register (1, 2,
3…)
• Use \1 to refer to the contents of the register
s/([0-9]+)/<\1>/
Capture groups: multiple registers
• /the (.*)er they (.*), the \1er we \2/
• Matches
• the faster they ran, the faster we ran
• But not
• the faster they ran, the faster we ate
But suppose we don't want to
capture?
Parentheses have a double function: grouping terms, and
capturing
Non-capturing groups: add a ?: after paren:
• /(?:some|a few) (people|cats) like some \
1/
• matches
• some cats like some cats
• but not
• some cats like some some
Lookahead assertions
• (?= pattern) is true if pattern matches, but is zero-width; doesn't
advance character pointer
• (?! pattern) true if a pattern does not match
• How to match, at the beginning of a line, any single word that doesn’t
start with “Volcano”:
• /ˆ(?!Volcano)[A-Za-z]+/
Simple Application: ELIZA
• Early NLP system that imitated a Rogerian psychotherapist
• Joseph Weizenbaum, 1966.
• Write the regular expression that matches a string that has an a followed by zero or more b's
• Write the regular expression for given language L = {E, a, aa, b,bb,ab,ba, aba, bab,.....},any combination
of a and b.
• Write the regular expression for the language L = {a, aba, aab, aba, aaa, abab, .....}
• Write the regular expression for the language L = {a, aa,aaa, ....}
• Write the regular expression for the language L = {E, 0, 1,00, 11,10,100,.....}
• Write the regular expression for the language accepting all the string which are starting with 1 and
ending with 0, over ∑ = {0, 1}.
Write the regular expression for the following
• Write the regular expression that matches a string that has an a followed by zero
or more b's R=ab*
• Write the regular expression for given language L = {E, a, aa, b,bb,ab,ba, aba,
bab,.....},any combination of a and b. Answer:-Solution
• The regular expression will be −(a + b)*
• Write the regular expression for the language L = {a, aba, aab, aba, aaa,
abab, .....} R = {a + ab}*
• Write the regular expression for the language L = {a, aa,aaa, ....} R = a+
• Write the regular expression for the language L = {E, 0, 1,00, 11,10,100,.....}
• R = (1* O*)
• Write the regular expression for the language accepting all the string which are
starting with 1 and ending with 0, over ∑ = {0, 1}. R = 1 (0+1)* 0
Finite State Automata
• The finite automata or finite state machine is an abstract machine that has five
elements or tuples.
• It has a set of states and rules for moving from one state to another, but it
depends upon the applied input symbol.
• When the input string is processed successfully, and the automata reached its final
state, then it will accept.
•The above figure shows the
following features of
automata:
•Input
•Output
•States of automata
•State relation
•Output relation
A Finite Automata consists of the following:
Q : Finite set of states.
Σ : set of Input Symbols.
q : Initial state.
F : set of Final States.
δ : Transition Function.
Deterministic Finite Automata (DFA):
• In a DFA, for a particular input character, the machine goes to one state only.
•
–cats ------------ cat +N +PLU
–cat ------------- cat +N +SG
Surface Form
–goose ------------goose +N +SG
–geese ------------goose +N +PLU Lexical Form
–catch ------------catch +V
–caught -----------catch +V +PAST
Parts of A Morphological
Processor
•Lexicon: The list of stems
with basic information about
categories (noun, verb, adjective, …)
sub-categories (regular noun, irregular noun, …)
•The simplest way to create a morphological parser, put all possible words
(together with its inflections) into a lexicon.
Morphotactics
We cannot find Say What is Lemma for the word Boys.
Finite state transducers
• A finite state transducer essentially is a finite state automaton that works on two (or
more) tapes. The most common way to think about transducers is as a kind of
``translating machine''.
• They read from one of the tapes and write onto the other. This, for instance, is a
transducer that translates as into bs:
Finite state transducers
• A finite state transducer (FST) is a finite state machine where transitions are conditioned on a
pair of symbols
• The machine moves between the states based on input symbol, while it outputs
the corresponding output symbol
• An FST encodes a relation, a mapping from a set to another . The relation defined by an FST
is called a regular (or rational) relation
Two-Level Morphology
Two-level morphology represents the correspondence between lexical and
surface levels.
•We use a finite-state transducer to find mapping between these two levels.
•A FST is a two-tape automaton:
–Reads from one tape and writes to other one.
•For morphological processing, one tape holds lexical representation, the second
one holds the surface form of a word.
Morphological segmentation (or
Stemming)
• Taking a surface input and breaking it down into its
morphemes
• cat- cat+N+SL
• cats- cat+N+PL
• Plays- Play+V+3Singular
• Played- Play+V+ PastParticipat
• foxes – fox+N+PL
+Sg
#
+Sg
#
• For each spelling rule we will have a FST, and these FSTs run parallel .
Orthographic Rules
•For each spelling rule we will have a FST, and these FSTs run parallel.
•We represent these rules using two-level morphology rules:
a => b / c __ d
• Eg:
• Playing- Play
• Boys- Boy
– Within each set, if more than one of the rules can apply,
only the one with the longest matching suffix (S1) is
followed
(condition) S1 -> S2
Step 1a :
•sses -> ss (Example : caresses -> caress)
•ies -> i (Example : ponies -> poni ; ties -> ti)
•ss -> ss (Example : caress -> caress)
•s ->Є (Example : cats -> cat)
Step 1b :
•(m>0) eed -> ee (Example : agreed -> agree; feed -> feed )
•(*v*) ed -> є (Example : plastered -> plaster ; bled -> bled)
•(*v*) ing -> є (Example : motoring -> motor ; sing -> sing)
•s -> є (Example : cats -> cat)
If the second or third of the rules in Step 1b is successful, the
following is done: Cleaning Step
•at -> ate (Example : conflat(ed) -> conflate)
•bl -> ble (Example : troubl(ed) -> trouble)
•iz -> ize (Example : siz(ed) -> size)
•s -> є (Example : cats -> cat)
•(*d &! (*l or *s or *z)) -> single letter
• (Example : hopp(ing) -> hop ; tann(ed) -> tan ; fall(ing) ->
fall ; hiss(ing) -> hiss ; fizz(ed) -> fizz)
•(m=1 and *o) -> e
• (Example : fil(ing) -> file); fail(ing) -> fail
Step 1c : Y Elimination
( \*v\*) y -> i (Example : happy -> happi ; sky -> sky)
Step 1 deals with plurals and past participles. The subsequent
steps are much more straightforward.
Step 5a :
•(m>1) e -> є (Example : probate -> probat ; rate -> rate)
•(m=1 and not *o) ness -> є (Example : goodness -> good)
Step 5b
(m > 1 and *d &*l) -> single letter
• This indiscriminate cutting will be successful at some occasion and fail in some
• Eg.
• Studies- Studi
• Giving- Giv
• Intelligence-Intelligen
• Produced immediate representation of the word may not have any meaning
Stemming
Lemmatization
• Cutting of Suffixes to extract stem • Cutting of Suffixes to extract Lemma
• Eg. • Eg.
• Studies- Studi
• Studies- Study
• Giving- Giv
• Giving- Give
• Caring- Car
• Caring- Care
• Produced immediate representation of the
word may not have any meaning ◦ Produced immediate representation of the
word having meaning
Lemmatization
• Same as Stemming but immediate representation have some meaning
• Stemming and lemmatization both of these concepts are used to normalized the given word by removing affixes
Stemming Lemmatization
Stemming requires less computational power Lemmatization requires more computational power
Stemming is not used to make dictionary or Lemmatization concept is used to make dictionary
WordNet kind of dictionary. or WordNet kind of dictionary.
Eg. Caring-car Eg. Caring-care
Application: Sentiment Analysis, Document Application: Sentiment Analysis, Document
Clustering, Machine Translation Clustering, Machine Translation
N-Gram Language Model
• Predicting Nth word from N-1 words
• Predicting 3rd word from previous 2 words – Model is called Trigram
• Predicting 2nd word from previous 1 words – Model is called Bigram
•
1. He is going to _____ Predicting 5th word from last four words is 5 gram
Language Model
2. going to _____ Predicting 3rd word from last two words is trigram
Application of N-Gram
1. Optical Character Recognition
• Image Text
• If words are missing or not clear can be predicted
2. Grammar correction
• Spelling is not correct then based on context can be suggested
• If spelling is right
• E.g. Deer sir instead of Dear sir can be correctly as, Sir is normally followed by Dear not
Deer
3. Speech to Text
• Words with different pronunciation. e.g. Eye am Fine & I am Fine
4. Translation
• Multiple Synonym of words
• E.g., He is biggest minister of Pakistan Prime
5. Suggestion while typing in mobile text prediction mode.
• He is going to school Meaningful
P(w1,w2, ...,wn)