0% found this document useful (0 votes)

37 views44 pages

Words

Uploaded by

s90505656

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views44 pages

Words

Uploaded by

s90505656

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Algorithms for Natural

Language Processing
Lecture 2: Words and Morphology
Linguistic Morphology
The shape of Words to Come
What? Linguistics?

• One common complaint we receive in this course goes

something like the following:
I’m not a linguist, I’m a computer scientist! Why do you keep
talking to me about linguistics?
• NLP is not just P; it’s also NL
• Just as you would need to know something about biology in
order to do computational biology, you need to know
something about natural language to do NLP
• If you were linguists, we wouldn’t have to talk much about
natural language because you would already know about it
What is Morphology?

• Words are not atoms

• They have internal structure
• They are composed (to a first approximation) of morphemes
• It is easy to forget this if you are working with English or Chinese, since they
are simpler, morphologically speaking, than most languages.
• But...
• mis-understand-ing-s
• 同志们 tongzhi-men ‘comrades’
Kind of Morphemes

• Roots
• The central morphemes in words, which carry the main
meaning
• Affixes
• Prefixes
• pre-nuptual, ir-regular
• Suffixes
• determin-ize, iterat-or
• Infixes
• Pennsyl-f**kin-vanian
• Circumfixes
• ge-sammel-t
Nonconcatenative Morphology

• Umlaut
• foot : feet :: tooth : teeth
• Ablaut
• sing, sang, sung
• Root-and-pattern or templatic morphology
• Common in Arabic, Hebrew, and other Afroasiatic languages
• Roots made of consonants, into which vowels are shoved
• Infixation
• Gr-um-adwet
Functional Differences in Morphology

• Inflectional morphology
• Adds information to a word consistent with its context within a sentence
• Examples
• Number (singular versus plural)
automaton → automata
• Walk → walks
• Case (nominative versus accusative versus…)
he, him, his, …
• Derivational morphology
• Creates new words with new meanings (and often with new parts of
speech)
• Examples
• parse → parser
• repulse → repulsive
Irregularity

• Formal irregularity
• Sometimes, inflectional marking differs depending on the root/base
• walk : walked : walked :: sing : sang : sung
• Semantic irregularity/unpredictabililty
• The same derivational morpheme may have different meanings/functions
depending on the base it attaches to
• a kind-ly old man
• *a slow-ly old man
The Problem and Promise of Morphology

• Inflectional morphology (especially) makes instances of the same

word appear to be different words
• Problematic in information extraction, information retrieval
• Morphology encodes information that can be useful (or even
essential) in NLP tasks
• Machine translation
• Natural language understanding
• Semantic role labeling
Morphology in NLP

• The processing of morphology is largely a solved problem in NLP

• A rule-based solution to morphology: finite state methods
• Other solutions
• Supervised, sequence-to-sequence models
• Unsupervised models
Levels of Analysis
Level hugging panicked foxes

Lexical form hug +V +Prog panic +V +Past fox +N +Pl

fox +V +Sg
Morphemic form hug^ing# panic^ed# fox^s#
(intermediate form)
Orthographic form hugging panicked foxes
(surface form)

• In morphological analysis, map from orthographic form to lexical form (using

morphemic form as intermediate representation)
• In morphological generation, map from lexical form to orthographic form (using
the morphemic form as intermediate representation)
Morphological Analysis and Generation:
How?
• Finite-state transducers (FSTs)
• Define regular relations between strings
• “foxes”ℜ“fox +V +3p +Sg +Pres”
• “foxes”ℜ“fox +N +Pl”
• Widely used in practice, not just for morphological analysis and generation,
but also in speech applications, surface syntactic parsing, etc.
• Once compiled, run in linear time (proportional to the length of the input)
• To understand FSTs, we will first learn about their simpler relative,
the FSA or FSM
• Should be familiar from theoretical computer science
• FSAs can tell you whether a word is morphologically “well-formed” but
cannot do analysis or generation
Finite State Automata
Accept them!
Finite-State Automaton

• Q: a finite set of states

• q0 ∈ Q: a special start state
• F ⊆ Q: a set of final states
• Σ: a finite alphabet
• Transitions:
s∈ Σ* qj ...
... qi

• Encodes a set of strings that can be recognized

by following paths from q0 to some state in F.
A “baaaaa!”d Example of an FSA
Don’t Let Pedagogy Lead You Astray

• To teach about finite state machines, we often trace our way from
state to state, consuming symbols from the input tape, until we
reach the final state
• While this is not wrong, it can lead to the wrong idea
• What are we actually asking when we ask whether a FSM accepts a
string? Is there a path through the network that…
• Starts at the initial state
• Consumes each of the symbols on the tape
• Arrives at a final state, coincident with the end of the tape
Formal Languages

• A formal language is a set of strings, typically one that

can be generated/recognized by an automaton
• A formal language is therefore potentially quite different
from a natural language
• However, a lot of NLP and CL involves treating natural
languages like formal languages
• The set of languages that can be recognized by FSAs are
called regular languages
• Conveniently, (most) natural language morphologies
belong to the set of regular languages
FSAs and Regular Expressions

• The set of languages that can be characterized by FSAs

are called “regular” as in “regular expression”
• Regular expressions, as you may known, are a fairly
convenient and standard way to represent something
equivalent to a finite state machine
• The equivalence is pretty intuitive (see the book)
• There is also an elegant proof (not in the book)
• Note that “regular expression” implementations in
programming languages like Perl and Python often go
beyond true regular expressions
FSA for English Nouns
FSA for English Adjectives
FSA for English Derivational Morphology
Finite State Transducers
I am no longer accepting the things I cannot change; I am changing the things that
I cannot accept
Morphological Parsing/Analysis

Input: a word
Output: the word’s stem(s)/lemmas and features
expressed by other morphemes.

Example: geese → {goose +N +Pl}

gooses → {goose +V +3P +Sg}
dog → {dog +N +Sg, dog +V}
leaves → {leaf +N +Pl, leave +V +3P +Sg}
Three Solutions

1. Table
2. Trie
3. Finite-state transducer
Finite State Transducers

• Q: a finite set of states

• q0 ∈ Q: a special start state
• F ⊆ Q: a set of final states
• Σ and Δ: two finite alphabets
• Transitions:

qj ...
... qi :t
s
s ∈ Σ* and t ∈ Δ*
Turkish Example
uygarlaştıramadıklarımızdanmışsınızcasına
“(behaving) as if you are among those whom we were not able to civilize”

uygar “civilized”
+laş “become”
+tır “cause to”
+ama “not able”
+dık past participle
+lar plural
+ımız first person plural possessive (“our”)
+dan second person plural (“y’all”)
+mış past
+sınız ablative case (“from/among”)
+casına finite verb → adverb (“as if”)
Morphological Parsing with FSTs

• Note “same symbol” shorthand.

• ^ denotes a morpheme boundary.
• # denotes a word boundary.
• ^ and # are not there
automatically—they must be
inserted.
English Spelling
The E Insertion Rule as a FST

8 9
< =
e! / ˆ
: ;
<latexit sha1_base64="(null)">(null)</latexit>
FST in Theory, Rule in Practice

• There are a number of FST toolkits (XFST, HFST, Foma, etc.) that
allow you to compile rewrite rules into FSTs
• Rather than manually constructing an FST to handle orthographic
alternations, you would be more likely to write rules in a notation
similar to the rule on the preceding slide.
• Cascades of such rules can then be compiled into an FST and
composed with other FSTs
Combining FSTs

parse

generate
Operations on FSTs

• There are a number of operations that can be performed on FSTs:

• intersection: Given transducers T and S, there exists a transducer T ∩ S such that
x[T ∩ S]y iff x[T]y and x[S]y. FSTs are not closed under intersection.
• union: Given transducers T and S, there exists a transducer T ∪ S such that
x[T ∪ S]y iff x[T]y or x[S]y. FSTs are not closed under union.
• concatenation: Given transducers T and S, there exists a transducer
T · S such that x1x2[T · S]y1y2 and x1[T]y1 and x2[S]y2.
• Kleene closure: Given a transducer T, there exists a transducer T* such that
ϵ[T*]ϵ and if w[T*]y and x[T]z then wx[T*]yz]; x[T*]y only holds if one of these
two conditions holds.
• composition: Given transducers T and S, there exists a transducer T ∘ S such that
x[T ∘ S]z iff x[T]y and y[S]z; effectively equivalent to feeding an input to T,
collecting the output from T, feeding this output to S and collecting the output
from S.
FST Operations
A Word to the Wise

• You will be asked to create FSTs in a homework assignment and on

an exam
• Sometimes, you will need to draw multiple FSTs and then combine
them using FST operations
• The most common of these is composition
• If you catch yourself saying “The output of FST A is the input to FST
B,” stop yourself and instead say “Compose FST A with FST B” or
simply “A ∘ B”
Operations on FSTs (cont.)

• FSTs are not closed under determination, which is nevertheless

an important operation
• Given a transducer T, construct an equivalent transducer Tʹ in
which no two transitions leaving the same state have the same
label
• There are algorithms for determinizing FSTs, but they don’t
always halt (see powerset construction) and they often result in
much larger networks
• There are also algorithms for determining whether an FST can be
determinized (whether powerset construction will halt)
ML and Morphology

• Morphology is one area where—in practice—you may

want to use hand-engineered rules rather than
machine learning
• ML solutions for morphology do exist, including
interesting unsupervised methods
• However, unsupervised methods typically give you only
the parse of the word into morphemes (prefixes, root,
suffixes) rather than lemmas and inflectional features,
which may not be suitable for some applications
STEMMING → STEM
Stemming (“Poor Man’s
Morphology”)
Input: a word
Output: the word’s stem (approximately)

Examples from the Porter stemmer:

•-sses → -ss
•-ies → i
•-ss → s
no no
noah noah
nob nob
nobility nobil
nobis nobi
noble nobl
nobleman nobleman
noblemen noblemen
nobleness nobl
nobler nobler
nobles nobl
noblesse nobless
noblest noblest
nobly nobli
nobody nobodi
noces noce
nod nod
nodded nod
nodding nod
noddle noddl
noddles noddl
noddy noddi
nods nod
Tokenization
Tokenization

Input: raw text

Output: sequence of tokens normalized for easier processing.
“Tokenization is easy, they said!
Just split on whitespace, they
said!”*

*Provided you’re working in English so words are (mostly)

whitespace-delimited, but even then…
The Challenge

Dr. Mortensen said tokenization of

English is “harder than you’ve
thought.” When in New York, he
paid $12.00 a day for lunch and
wondered what it would be like to
work for AT&T or Google, Inc.
Finite State Tokenization

•How can finite state techniques be used to

tokenize text?
•Why might they be useful?
•Can you think of other potential tokenization
techniques?

Regular Expressions For NLP
No ratings yet
Regular Expressions For NLP
24 pages
Morphology FST
No ratings yet
Morphology FST
47 pages
NLP Notes Complete
No ratings yet
NLP Notes Complete
99 pages
CME4408 P4 RE FSA Morphology FST
No ratings yet
CME4408 P4 RE FSA Morphology FST
85 pages
Unit-1 ATCD
No ratings yet
Unit-1 ATCD
51 pages
FLST Fsa
No ratings yet
FLST Fsa
61 pages
Lecture2 436n
No ratings yet
Lecture2 436n
140 pages
Finite Automata For NLP
No ratings yet
Finite Automata For NLP
8 pages
Lecture 3
No ratings yet
Lecture 3
55 pages
Word Level Analysis
No ratings yet
Word Level Analysis
49 pages
WINSEM2018-19 SWE1017 ETH VL2018195004705 2018-12-19 Reference-Material-I
No ratings yet
WINSEM2018-19 SWE1017 ETH VL2018195004705 2018-12-19 Reference-Material-I
42 pages
Finite Automata and Morphological Parsing
No ratings yet
Finite Automata and Morphological Parsing
18 pages
NLP MODULE-2 Final
No ratings yet
NLP MODULE-2 Final
114 pages
Lec08 09 FSA For Morphological Parsig and Generation
No ratings yet
Lec08 09 FSA For Morphological Parsig and Generation
40 pages
CH 3 Morphology and FST
No ratings yet
CH 3 Morphology and FST
30 pages
Finite State Automata
No ratings yet
Finite State Automata
28 pages
Nlplect 10
No ratings yet
Nlplect 10
18 pages
Inf2a L15 Slides
No ratings yet
Inf2a L15 Slides
31 pages
02 Background Morphology FSA IntroFST Final
No ratings yet
02 Background Morphology FSA IntroFST Final
38 pages
675469663
No ratings yet
675469663
33 pages
13-Application - Complete Prediction-02-02-2024
No ratings yet
13-Application - Complete Prediction-02-02-2024
49 pages
FLAT Module-I
No ratings yet
FLAT Module-I
123 pages
7MvcJmJaQ8uL3CZiWhPLDQ Lecture02 6 FST
No ratings yet
7MvcJmJaQ8uL3CZiWhPLDQ Lecture02 6 FST
16 pages
Lecture 1 FS22
No ratings yet
Lecture 1 FS22
35 pages
CD Digital Notes
No ratings yet
CD Digital Notes
126 pages
Finite State Transducers: Data Structures and Algorithms For Computational Linguistics III
No ratings yet
Finite State Transducers: Data Structures and Algorithms For Computational Linguistics III
31 pages
Finnish, Turkish and Hungarian
100% (2)
Finnish, Turkish and Hungarian
12 pages
2.chapter3 - Regular Expressions and Automata
No ratings yet
2.chapter3 - Regular Expressions and Automata
28 pages
Eric Transducers
No ratings yet
Eric Transducers
22 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
74 pages
Wordlevel Analysis - Chap2
No ratings yet
Wordlevel Analysis - Chap2
97 pages
Bwu Bta 21 296 NLP
No ratings yet
Bwu Bta 21 296 NLP
10 pages
NLP Lect 2 Words and Morphology
No ratings yet
NLP Lect 2 Words and Morphology
52 pages
Morphology
No ratings yet
Morphology
41 pages
Electronic
No ratings yet
Electronic
17 pages
Finite Automata: Automaton
No ratings yet
Finite Automata: Automaton
40 pages
Logic and The Generative Power of Autosegmental Phonology: Adam Jardine
No ratings yet
Logic and The Generative Power of Autosegmental Phonology: Adam Jardine
10 pages
IS 7118 Unit-3 Morphology
No ratings yet
IS 7118 Unit-3 Morphology
98 pages
NLP - Sem
No ratings yet
NLP - Sem
31 pages
7-Morphology Part2
No ratings yet
7-Morphology Part2
28 pages
Lecture 4 Finite Automata
No ratings yet
Lecture 4 Finite Automata
27 pages
Module 3 - Part 1
No ratings yet
Module 3 - Part 1
54 pages
Unit 1
No ratings yet
Unit 1
30 pages
Morphological Analysis
No ratings yet
Morphological Analysis
35 pages
Lect2 Regular Expressions
No ratings yet
Lect2 Regular Expressions
41 pages
Finite State Transducers
No ratings yet
Finite State Transducers
4 pages
What Is NLP?: Components of An FSA
No ratings yet
What Is NLP?: Components of An FSA
16 pages
Module 3: Morphology Morphological Parsing With Finite State
No ratings yet
Module 3: Morphology Morphological Parsing With Finite State
29 pages
UNIT-1 Notes
No ratings yet
UNIT-1 Notes
19 pages
Formal Languages & Finite Theory of Automata: BS Course
No ratings yet
Formal Languages & Finite Theory of Automata: BS Course
54 pages
3 Notes FiniteAutomata
No ratings yet
3 Notes FiniteAutomata
6 pages
CH-1 IntroToAutomataTheory
No ratings yet
CH-1 IntroToAutomataTheory
35 pages
Scan 27 Nov 23 09 21 15
No ratings yet
Scan 27 Nov 23 09 21 15
11 pages
Lexical Analysis - Morphological Analysis
No ratings yet
Lexical Analysis - Morphological Analysis
9 pages
Morp
No ratings yet
Morp
30 pages
NLP 39-48
No ratings yet
NLP 39-48
11 pages
Chapter 2
No ratings yet
Chapter 2
8 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Words

Uploaded by

Words

Uploaded by

Algorithms for Natural

• One common complaint we receive in this course goes

• Words are not atoms

• Inflectional morphology (especially) makes instances of the same

• The processing of morphology is largely a solved problem in NLP

Lexical form hug +V +Prog panic +V +Past fox +N +Pl

• In morphological analysis, map from orthographic form to lexical form (using

• Q: a finite set of states

• Encodes a set of strings that can be recognized

• A formal language is a set of strings, typically one that

• The set of languages that can be characterized by FSAs

Example: geese → {goose +N +Pl}

• Q: a finite set of states

• Note “same symbol” shorthand.

• There are a number of operations that can be performed on FSTs:

• You will be asked to create FSTs in a homework assignment and on

• FSTs are not closed under determination, which is nevertheless

• Morphology is one area where—in practice—you may

Examples from the Porter stemmer:

Input: raw text

*Provided you’re working in English so words are (mostly)

Dr. Mortensen said tokenization of

•How can finite state techniques be used to

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.