0% found this document useful (0 votes)
52 views118 pages

Morphological Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views118 pages

Morphological Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 118

Morphology

• Morphology is the study of words

• Study's internal structure of words

• How words change their form to generate new word

• Different role they play in sentence ,strictly following linguistic rule.


Morphemes

• words are built from smaller meaningful grammatical units


called morphemes.
• dogs
• 2 morphemes, ‘dog’ and ‘s’
• ‘s’ is a plural marker on nouns
Morpheme

• A Smallest unit carries meaning


• While doing analysis of a word, we split the given word into parts.
• Each part is called a morpheme.
• Eg.:
• uneducated –>
• un + educate + d.

• it has three morphemes.


Free Morpheme
◦ Can appear as a word by itself; often can combine with other morphemes too.

◦ house (house-s), walk (walk-ed), of, the, or

◦ Independent/can stand by themselves as single words.

◦ When combined with bound morphemes

◦ free morphemes are called stems/root.


Lexical Morpheme

◦ Visualize/Content word
◦ Nouns –Glass
◦ Adjectives -Black,
◦ Verbs-Dance,
◦ Adverb - beautifully.
◦ ‘open' class of morphemes because we can add new words to the
language easily
◦ Eg.(girl, play, google, e-mail, blog).
Grammatical Morpheme

◦ Also called as Functional morphemes


◦ consist of functional/Grammer words

◦ Conjunctions- And/Or
◦ Prepositions-On/Under/At
◦ Articles-An/The
◦ Pronouns- She/He.

◦ Limited words

◦This is a ‘closed' class of morphemes


Bound Morpheme
◦ Cannot appear as a word by itself.
◦ -s (dog-s), -ly (quick-ly), -ed (walk-ed)
◦ cannot stand alone
◦ Not complete in themselves
◦ Depends on free Morpheme
◦ Affixes are bound morpheme
◦ typically attached to free morpheme.
◦ They can prefixes, Infix and suffixes (re-, un-, dis-, pre-, -ness, -less, -ly).
Prefix
◦ Attached before lexical morpheme
◦ Be-come
◦ Un-happy
Infix
• unladylike
• 3 morphemes
• un- ‘not’
• lady ‘well-behaved woman’
• -like ‘having the characteristic of’
Suffix

◦ Words attached after root


◦ Teach-er
◦ Use-less
Suffix
◦ 8 Suffixes
◦ Noun
1. Plural – Book->Books
2. Possessional- Ram’s
◦ Adjective
1. er – Tall->Taller
2. est- Tall->Tallest
◦ Verb
1. S= Play->Plays
2. ed= Play->Played
3. Ing= Play->Playing
4. en= Brake->Broken
Derivational Morpheme
◦ make new words of a different grammatical category from a stem
◦ Class Changing
◦ Care careful
(Verb) (Adjective)

◦ Care carelessly
(Verb) (Adverb)

◦ Class Maintaining
◦ Child Childhood
(Noun) (Noun)
Inflectional Morpheme

◦ Morpheme when attached to root doesn’t change its class

◦ There are eight inflectional morphemes in English.

◦ They are all suffixes.


Regular Inflectional Morpheme

◦ Refers to those inflections which follows a standard pattern

◦ Eg.

◦ Dance+ed->Danced
◦ Walk+ed->Walked
◦ Boy+s->Boys
Irregular Inflectional Morpheme

◦ Refers to those inflections which does not follow a standard pattern


◦ Eg.
◦ Wife+s - > Wives
◦ Child+s -> Children
◦ Mouse+s -> Mice
Suppletion Inflectional Morpheme

◦ Completely change the morpheme

◦ Occurrence of phonemically unrelated morpheme

◦ Eg.
◦ Go+ed->Went
◦ Good+er->Better
◦ am+ed->was
Identify Inflectional or Derivational
Morphology

• I jumped into the puddle this


morning
• This is unbelievable
• I have john’s umbrella
• Emma goes to school
• She is working carelessly
Identify Inflectional or Derivational
Morphology

• I jumped into the puddle this • Inflectional Morphology


morning
• This is unbelievable • Derivational Morphology
• I have john’s umbrella • Inflectional Morphology
• Emma goes to school
• She is working carelessly • Inflectional Morphology
• Derivational Morphology
Regular Expressions
Some Questions on Regular Expression.....
What is Regular Expression?

What does the command ab+c search for?


• ac,abc,abbc, and so on
• ab,abc,abcc and so on
• abc,abbc,abbbc and so on
• None of the above

The number of matches does the command a{1,3} give with the string aabbaaaa?
• 3
• 2
• 1
• 4
What is the output of the below code?
print(re.match(‘On’,”one”))

• 'e'
• Match Object
• ‘n’
• None

• What does the sequence \D finds the match?


Decimal digits
Non-decimal digits
Division
None of the above

• Which of the following command is used to search a match for 1,2,3,4?

[1-4]
(1-3)
[1234]
Both a and c
•What is the output of the below code?
re.sub(‘a’,’u’,’aeiou!’)
• 'ueiou!'
• 'eiou!'
• 'eio!'
• None of the above
What is Regular Expression?
A Formal Language for specifying text search strings

A regular expression is an algebraic notation for characterizing a set of


strings.

Regular expression search function will search through the corpus,


returning all texts that match the pattern or the first match.

The corpus can be a single document or a collection


Regular expressions
• The simplest kind of regular expression is sequence of simple characters

• A formal language for specifying text strings

• How can we search for any of these?


• woodchuck
• woodchucks
• Woodchuck
• Woodchucks

• Regular expressions are case sensitive

• All the above are different regular expression


Regular Expression
Regular Expressions: Disjunctions
• Letters inside square brackets []
Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any digit

• Ranges [A-Z]

Pattern Matches
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down the Rabbit Hole
Regular Expressions: Negation in
Disjunction

• Negations [^Ss]
• Carat means negation only when first in []
Pattern Matches
[^A-Z] Not an upper case Oyfn pripetchik
letter
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite
reason”
[^e^] Neither e nor ^ Look here
a^b The pattern a carat b Look up a^b now
Regular Expressions: More
Disjunction
• Woodchuck is another name for groundhog!
• The pipe | for disjunction

Pattern Matches
groundhog|woodchuck woodchuck
yours|mine yours
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck Woodchuck
Regular Expressions: ? *+.
Pattern Matches
colou?r Optional color colour
previous char
oo*h! 0 or more of oh! ooh! oooh! ooooh!
previous char
o+h! 1 or more of oh! ooh! oooh! ooooh!
previous char
baa+ baa baaa baaaa baaaaa
Stephen C Kleene
beg.n begin begun begun
beg3n Kleene *, Kleene +
Regular Expressions: Anchors ^ $
Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!

993, 99
If your pattern is 99 it will be matched to both above string if you want to match only 99 you can use \b99\b
Example
• Find me all instances of the word “the” in a text.
the
Misses capitalized examples
[tT]he
Incorrectly returns other or theology
[^a-zA-Z][tT]he[^a-zA-Z]
Errors
• The process we just went through was based on
fixing two kinds of errors:

1. Matching strings that we should not have matched


(there, then, other)
False positives (Type I errors)

2. Not matching things that we should have matched (The)


False negatives (Type II errors)
Errors cont.
• In NLP we are always dealing with these kinds of
errors.
• Reducing the error rate for an application often
involves two antagonistic efforts:
• Increasing accuracy or precision (minimizing false
positives)
• Increasing coverage or recall (minimizing false negatives).
Summary
• Regular expressions play a surprisingly large role
• Sophisticated sequences of regular expressions are often
the first model for any text processing text
• For hard tasks, we use machine learning classifiers
• But regular expressions are still used for pre-processing,
or as features in the classifiers
• Can be very useful in capturing generalizations

41
More Regular Expressions:
Substitutions and ELIZA
Substitutions
• Substitution in Python and UNIX commands:

• s/regexp1/pattern/
• e.g.:
• s/colour/color/
Capture Groups
• Say we want to put angles around all numbers:
the 35 boxes  the <35> boxes
• Use parens () to "capture" a pattern into a numbered register (1, 2,
3…)
• Use \1 to refer to the contents of the register
s/([0-9]+)/<\1>/
Capture groups: multiple registers
• /the (.*)er they (.*), the \1er we \2/
• Matches
• the faster they ran, the faster we ran
• But not
• the faster they ran, the faster we ate
But suppose we don't want to
capture?
Parentheses have a double function: grouping terms, and
capturing
Non-capturing groups: add a ?: after paren:
• /(?:some|a few) (people|cats) like some \
1/
• matches
• some cats like some cats
• but not
• some cats like some some
Lookahead assertions
• (?= pattern) is true if pattern matches, but is zero-width; doesn't
advance character pointer
• (?! pattern) true if a pattern does not match
• How to match, at the beginning of a line, any single word that doesn’t
start with “Volcano”:
• /ˆ(?!Volcano)[A-Za-z]+/
Simple Application: ELIZA
• Early NLP system that imitated a Rogerian psychotherapist
• Joseph Weizenbaum, 1966.

• Uses pattern matching to match, e.g.,:


• “I need X”
and translates them into, e.g.
• “What would it mean to you if you got X?
Simple Application: ELIZA
Men are all alike.
IN WHAT WAY
They're always bugging us about something or other.
CAN YOU THINK OF A SPECIFIC EXAMPLE
Well, my boyfriend made me come here.
YOUR BOYFRIEND MADE YOU COME HERE
He says I'm depressed much of the time.
I AM SORRY TO HEAR YOU ARE DEPRESSED
How ELIZA works
• s/.* I’M (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \
1/
• s/.* I AM (depressed|sad) .*/WHY DO YOU THINK YOU ARE \
1/
• s/.* all .*/IN WHAT WAY?/
• s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE?/
Write the regular expression for the following

• Write the regular expression that matches a string that has an a followed by zero or more b's

• Write the regular expression for given language L = {E, a, aa, b,bb,ab,ba, aba, bab,.....},any combination
of a and b.

• Write the regular expression for the language L = {a, aba, aab, aba, aaa, abab, .....}

• Write the regular expression for the language L = {a, aa,aaa, ....}

• Write the regular expression for the language L = {E, 0, 1,00, 11,10,100,.....}

• Write the regular expression for the language accepting all the string which are starting with 1 and
ending with 0, over ∑ = {0, 1}.
Write the regular expression for the following

• Write the regular expression that matches a string that has an a followed by zero
or more b's R=ab*
• Write the regular expression for given language L = {E, a, aa, b,bb,ab,ba, aba,
bab,.....},any combination of a and b. Answer:-Solution
• The regular expression will be −(a + b)*
• Write the regular expression for the language L = {a, aba, aab, aba, aaa,
abab, .....} R = {a + ab}*
• Write the regular expression for the language L = {a, aa,aaa, ....} R = a+
• Write the regular expression for the language L = {E, 0, 1,00, 11,10,100,.....}
• R = (1* O*)
• Write the regular expression for the language accepting all the string which are
starting with 1 and ending with 0, over ∑ = {0, 1}. R = 1 (0+1)* 0
Finite State Automata

• Finite Automata(FA) is the simplest machine to recognize patterns.

• The finite automata or finite state machine is an abstract machine that has five
elements or tuples.

• It has a set of states and rules for moving from one state to another, but it
depends upon the applied input symbol.

• Basically, it is an abstract model of a digital computer.

• Finite automata have two states, Accept state or Reject state.

• When the input string is processed successfully, and the automata reached its final
state, then it will accept.
•The above figure shows the
following features of
automata:

•Input
•Output
•States of automata
•State relation
•Output relation
A Finite Automata consists of the following:
Q : Finite set of states.
Σ : set of Input Symbols.
q : Initial state.
F : set of Final States.
δ : Transition Function.
Deterministic Finite Automata (DFA):

DFA consists of 5 tuples {Q, Σ, q, F, δ}.


Q : set of all states.
Σ : set of input symbols. ( Symbols which machine takes as
input )
q : Initial state. ( Starting state of a machine )
F : set of final state.
• δ : Transition Function, defined as δ : Q X Σ --> Q.

• In a DFA, for a particular input character, the machine goes to one state only.

• A transition function is defined on every state for every input symbol.


For example, below DFA with Σ = {0, 1}
accepts all strings ending with 0.
State
Transition
Diagram
Draw a deterministic finite automate which accept 00 and 11 at the
end of a string containing 0, 1 in it, e.g., 01010100 but not
000111010.
• Finite-state can capture the generalization here:
• Eg.
• I eat Sushi
• Ram like Mango
• Noun+ Verb Noun+
Language is recursive
§the ball
§ the ball
§ the ball in the garden
§ the big ball
§ the ball in the garden behind the
§ the big, red ball
house
§ the big, red, heavy ball
§ the ball in the garden behind the
house next to school
Morphological Parsing
Morphological parsing is to find the lexical form of a word
from its surface form.​
Stem-----Prefix+Stem+Suffix​


–cats ------------ cat +N +PLU
–cat ------------- cat +N +SG
Surface Form
–goose ------------goose +N +SG
–geese ------------goose +N +PLU Lexical Form
–catch ------------catch +V
–caught -----------catch +V +PAST
Parts of A Morphological
Processor
•Lexicon: The list of stems
with basic information about
categories (noun, verb, adjective, …)
sub-categories (regular noun, irregular noun, …)

•Morphotactics: Explains ordering -which classes of morphemes can


follow other classes of morphemes inside a word.

•Orthographic Rules (Spelling Rules): These spelling rules are used to


model changes that occur in a word (normally when two morphemes combine).
Lexicon
•A lexicon is a repository for words (stems).

•They are grouped according to their main categories.


–noun, verb, adjective, adverb, …

•They may be also divided into sub-categories.


–regular-nouns, irregular-singular nouns, irregular-plural nouns

•The simplest way to create a morphological parser, put all possible words
(together with its inflections) into a lexicon.
Morphotactics
We cannot find Say What is Lemma for the word Boys.
Finite state transducers
• A finite state transducer essentially is a finite state automaton that works on two (or
more) tapes. The most common way to think about transducers is as a kind of
``translating machine''.

• They read from one of the tapes and write onto the other. This, for instance, is a
transducer that translates as into bs:
Finite state transducers
• A finite state transducer (FST) is a finite state machine where transitions are conditioned on a
pair of symbols
• The machine moves between the states based on input symbol, while it outputs
the corresponding output symbol
• An FST encodes a relation, a mapping from a set to another . The relation defined by an FST
is called a regular (or rational) relation
Two-Level Morphology
Two-level morphology represents the correspondence between lexical and
surface levels.
•We use a finite-state transducer to find mapping between these two levels.
•A FST is a two-tape automaton:
–Reads from one tape and writes to other one.
•For morphological processing, one tape holds lexical representation, the second
one holds the surface form of a word.
Morphological segmentation (or
Stemming)
• Taking a surface input and breaking it down into its
morphemes
• cat- cat+N+SL
• cats- cat+N+PL
• Plays- Play+V+3Singular
• Played- Play+V+ PastParticipat
• foxes – fox+N+PL
+Sg

#
+Sg

#
• For each spelling rule we will have a FST, and these FSTs run parallel .
Orthographic Rules
•For each spelling rule we will have a FST, and these FSTs run parallel.
•We represent these rules using two-level morphology rules:

a => b / c __ d

rewrite a as b when it occurs between c and d.

English Spelling Rules:


––E insertion --e added after s, z, x, ch, sh before s --watch/watches
Stemming
• Stemming is suffix stripping operation
• Process of reducing word into its base form (Root form/stem form)
• This is achieved by cutting of begging or end of the word

• Eg:
• Playing- Play
• Boys- Boy

• Popular Algorithm is Porter Stemmer


Porter Stemmer

– Five sets of rules, applied in order

– Within each set, if more than one of the rules can apply,
only the one with the longest matching suffix (S1) is
followed

Advantage: easy to see understand, easy to implement.


Convention

• Consonant( C ): other than A, E, I, O or U, and other


than Y preceded by a consonant.

• So in TOY the consonants are T and Y

•Vowel(V) : Any other letter.


•Any word in English has this forms:
•[C](VCm)[V]
•[]denotes arbitrary presence of content
•C-Consonant
•V-Vowel
•m will be called the measure of any word or word part
•m=0 TR, EE, TREE, Y, BY.
•m=1 TROUBLE, OATS, TREES, IVY.
•m=2 TROUBLES, PRIVATE, OATEN, ORRERY.
◦RULES
◦The rules for removing a suffix will be given in the form

(condition) S1 -> S2

if a word ends with the suffix S1 will be replaced by S2, if it


satisfies the given condition.
◦ e.g. (m > 1) EMENT -> Є

Here S1 is 'EMENT' and S2 is null.


◦REPLACEMENT to REPLAC,
◦since REPLAC is a word part for which m = 2.
Condition Meaning
part
*S the stem ends with S (and similarly for the other letters).

*v* the stem contains a vowel.


m=2 TROUBLES, PRIVATE, OATEN, ORRERY.

*d the stem ends with a double consonant (e.g. -TT, -SS).

*o the stem ends cvc, where the second c is not W, X or Y (e.g. -


WIL, -HOP).

E.g. (m>1 or *S)


PORTER STEMMER

Step 1a :
•sses -> ss (Example : caresses -> caress)
•ies -> i (Example : ponies -> poni ; ties -> ti)
•ss -> ss (Example : caress -> caress)
•s ->Є (Example : cats -> cat)
Step 1b :
•(m>0) eed -> ee (Example : agreed -> agree; feed -> feed )
•(*v*) ed -> є (Example : plastered -> plaster ; bled -> bled)
•(*v*) ing -> є (Example : motoring -> motor ; sing -> sing)
•s -> є (Example : cats -> cat)
If the second or third of the rules in Step 1b is successful, the
following is done: Cleaning Step
•at -> ate (Example : conflat(ed) -> conflate)
•bl -> ble (Example : troubl(ed) -> trouble)
•iz -> ize (Example : siz(ed) -> size)
•s -> є (Example : cats -> cat)
•(*d &! (*l or *s or *z)) -> single letter
• (Example : hopp(ing) -> hop ; tann(ed) -> tan ; fall(ing) ->
fall ; hiss(ing) -> hiss ; fizz(ed) -> fizz)
•(m=1 and *o) -> e
• (Example : fil(ing) -> file); fail(ing) -> fail
Step 1c : Y Elimination
( \*v\*) y -> i (Example : happy -> happi ; sky -> sky)
Step 1 deals with plurals and past participles. The subsequent
steps are much more straightforward.

Step 2 :Derivational Morphology -1


•(m>0) ational -> ate (Example : relational -> relate
•(m>0) ization -> ize (generalization -> generalize)
•(m>0) biliti -> ble (sensibiliti -> sensible)
Step 3 :

•(m>0) icate -> ic (Example : triplicate -> triplic)


•(m>0) ful -> є (Example : hopeful -> hope)
•(m>0) ness -> є (Example : goodness -> good)
Step 4 :Derivational Morphology-II

•( (m>1) ance -> є (Example : allowance -> allow)


•(m>1) ment -> є(Example : adjustment -> adjust)
•(m>1) ent -> є (Example : dependent -> depend)
•(m>1) ive -> є (Example : effective -> effect)
The suffixes are now removed. All that remains is a little tidying up.

Step 5a :
•(m>1) e -> є (Example : probate -> probat ; rate -> rate)
•(m=1 and not *o) ness -> є (Example : goodness -> good)
Step 5b
(m > 1 and *d &*l) -> single letter

(Example : controll -> control ; roll -> roll)


Disadvantage

• This indiscriminate cutting will be successful at some occasion and fail in some

• Eg.
• Studies- Studi
• Giving- Giv
• Intelligence-Intelligen
• Produced immediate representation of the word may not have any meaning
Stemming
Lemmatization
• Cutting of Suffixes to extract stem • Cutting of Suffixes to extract Lemma

• Eg. • Eg.
• Studies- Studi
• Studies- Study
• Giving- Giv
• Giving- Give
• Caring- Car
• Caring- Care
• Produced immediate representation of the
word may not have any meaning ◦ Produced immediate representation of the
word having meaning
Lemmatization
• Same as Stemming but immediate representation have some meaning
• Stemming and lemmatization both of these concepts are used to normalized the given word by removing affixes
Stemming Lemmatization

Root word is called Stem Root word is called Lemma

It needs POS tagging


It does not need (Part of Speech) POS tagging

Lemmatization requires the context of the word in


Stemming does not require knowledge of the context
the sentence

Stemming requires less computational power Lemmatization requires more computational power

Stemming is not used to make dictionary or Lemmatization concept is used to make dictionary
WordNet kind of dictionary. or WordNet kind of dictionary.
Eg. Caring-car Eg. Caring-care
Application: Sentiment Analysis, Document Application: Sentiment Analysis, Document
Clustering, Machine Translation Clustering, Machine Translation
N-Gram Language Model
• Predicting Nth word from N-1 words
• Predicting 3rd word from previous 2 words – Model is called Trigram
• Predicting 2nd word from previous 1 words – Model is called Bigram

1. He is going to _____  Predicting 5th word from last four words is 5 gram
Language Model
2. going to _____  Predicting 3rd word from last two words is trigram
Application of N-Gram
1. Optical Character Recognition
• Image  Text
• If words are missing or not clear can be predicted
2. Grammar correction
• Spelling is not correct then based on context can be suggested
• If spelling is right
• E.g. Deer sir instead of Dear sir can be correctly as, Sir is normally followed by Dear not
Deer
3. Speech to Text
• Words with different pronunciation. e.g. Eye am Fine & I am Fine
4. Translation
• Multiple Synonym of words
• E.g., He is biggest minister of Pakistan  Prime
5. Suggestion while typing in mobile text prediction mode.
• He is going to school Meaningful

• He school is to going Doesn’t Make sense

For N-gram models

P(w1,w2, ...,wn)

By the Chain Rule we can decompose a joint probability, as follows:

P(w1,w2, ...,wn) = P(w1) P(w2|w1) P(w3|w1w2 ) P(wn|w1..., Wn-2 wn-1)

Join Probability of sentence


P(He is going to school)= P(He)P(is|He)P(going|He is)P(to|He is going)P(school|He is going to)

Looking back so much

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy