Words
Words
Language Processing
Lecture 2: Words and Morphology
Linguistic Morphology
The shape of Words to Come
What? Linguistics?
• Roots
• The central morphemes in words, which carry the main
meaning
• Affixes
• Prefixes
• pre-nuptual, ir-regular
• Suffixes
• determin-ize, iterat-or
• Infixes
• Pennsyl-f**kin-vanian
• Circumfixes
• ge-sammel-t
Nonconcatenative Morphology
• Umlaut
• foot : feet :: tooth : teeth
• Ablaut
• sing, sang, sung
• Root-and-pattern or templatic morphology
• Common in Arabic, Hebrew, and other Afroasiatic languages
• Roots made of consonants, into which vowels are shoved
• Infixation
• Gr-um-adwet
Functional Differences in Morphology
• Inflectional morphology
• Adds information to a word consistent with its context within a sentence
• Examples
• Number (singular versus plural)
automaton → automata
• Walk → walks
• Case (nominative versus accusative versus…)
he, him, his, …
• Derivational morphology
• Creates new words with new meanings (and often with new parts of
speech)
• Examples
• parse → parser
• repulse → repulsive
Irregularity
• Formal irregularity
• Sometimes, inflectional marking differs depending on the root/base
• walk : walked : walked :: sing : sang : sung
• Semantic irregularity/unpredictabililty
• The same derivational morpheme may have different meanings/functions
depending on the base it attaches to
• a kind-ly old man
• *a slow-ly old man
The Problem and Promise of Morphology
• To teach about finite state machines, we often trace our way from
state to state, consuming symbols from the input tape, until we
reach the final state
• While this is not wrong, it can lead to the wrong idea
• What are we actually asking when we ask whether a FSM accepts a
string? Is there a path through the network that…
• Starts at the initial state
• Consumes each of the symbols on the tape
• Arrives at a final state, coincident with the end of the tape
Formal Languages
Input: a word
Output: the word’s stem(s)/lemmas and features
expressed by other morphemes.
1. Table
2. Trie
3. Finite-state transducer
Finite State Transducers
qj ...
... qi :t
s
s ∈ Σ* and t ∈ Δ*
Turkish Example
uygarlaştıramadıklarımızdanmışsınızcasına
“(behaving) as if you are among those whom we were not able to civilize”
uygar “civilized”
+laş “become”
+tır “cause to”
+ama “not able”
+dık past participle
+lar plural
+ımız first person plural possessive (“our”)
+dan second person plural (“y’all”)
+mış past
+sınız ablative case (“from/among”)
+casına finite verb → adverb (“as if”)
Morphological Parsing with FSTs
8 9
< =
e! / ˆ
: ;
<latexit sha1_base64="(null)">(null)</latexit>
FST in Theory, Rule in Practice
• There are a number of FST toolkits (XFST, HFST, Foma, etc.) that
allow you to compile rewrite rules into FSTs
• Rather than manually constructing an FST to handle orthographic
alternations, you would be more likely to write rules in a notation
similar to the rule on the preceding slide.
• Cascades of such rules can then be compiled into an FST and
composed with other FSTs
Combining FSTs
parse
generate
Operations on FSTs