0% found this document useful (0 votes)
12 views392 pages

First-Order Logic (FOL) : Limited Expressiveness

The document discusses the limitations of propositional logic, such as its limited expressiveness and inability to handle ambiguity, and contrasts it with the advantages of first-order logic (FOL), which includes greater expressiveness, the use of quantifiers, and the ability to capture complex relationships. It outlines the fundamental elements of FOL, including variables, predicates, and functions, and provides examples of how various statements can be represented in FOL. Additionally, the document introduces the concept of Probabilistic Context-Free Grammar (PCFG), explaining its structure and how it assigns probabilities to parse trees.

Uploaded by

22ad240naveen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views392 pages

First-Order Logic (FOL) : Limited Expressiveness

The document discusses the limitations of propositional logic, such as its limited expressiveness and inability to handle ambiguity, and contrasts it with the advantages of first-order logic (FOL), which includes greater expressiveness, the use of quantifiers, and the ability to capture complex relationships. It outlines the fundamental elements of FOL, including variables, predicates, and functions, and provides examples of how various statements can be represented in FOL. Additionally, the document introduces the concept of Probabilistic Context-Free Grammar (PCFG), explaining its structure and how it assigns probabilities to parse trees.

Uploaded by

22ad240naveen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 392

First-order logic (FOL)

Propositional logic, while powerful and widely used, does have some drawbacks:
1. Limited Expressiveness: Propositional logic deals only with propositions (statements that
are either true or false) and does not capture the complexity of relationships between
objects or concepts. It lacks the ability to represent quantifiers like "for all" (∀) and "there
exists" (∃), which are essential in predicate logic.
2. Inability to Handle Ambiguity: Propositional logic cannot handle ambiguous statements
effectively. Ambiguity arises when a statement can be interpreted in multiple ways, and
propositional logic lacks the capacity to resolve such ambiguity.
3. Explosion of Complexity: In certain cases, especially when dealing with a large number
of variables or propositions, the number of possible combinations grows exponentially,
leading to what is known as the "combinatorial explosion." This explosion makes reasoning
and computation in propositional logic impractical for complex systems.
4. No Representation of Relationships: Propositional logic treats propositions as atomic
units and does not provide a way to represent relationships between propositions. For
example, it cannot express the concept of implication within a single proposition.
5. No Handling of Uncertainty or Probability: Propositional logic is deterministic,
meaning it assumes propositions are either true or false with no uncertainty. It lacks the
ability to model probabilistic relationships, which are crucial in many real-world scenarios.
6. Limited Scope in Real-world Applications: While propositional logic is useful for
representing simple relationships in domains like computer science and mathematics, it
often falls short in modeling the complexities of real-world scenarios where uncertainty,
ambiguity, and relationships are prevalent.

First-order logic (FOL), also known as predicate logic, offers several advantages over
propositional logic:
1. Expressiveness: FOL allows for the representation of complex relationships and structures
by introducing variables, quantifiers (such as ∀ for "for all" and ∃ for "there exists"), and
predicates. This expressiveness enables FOL to capture the richness of natural language
and real-world scenarios more effectively than propositional logic.
2. Quantification: FOL includes quantifiers, which allow statements to be made about entire
classes of objects or individuals, not just specific instances. This feature enables FOL to
reason about general properties and make universal statements, whereas propositional logic
can only make assertions about specific propositions.
3. Predicates and Functions: FOL allows the use of predicates to express relationships
between objects and functions to represent operations or transformations. This capability
makes FOL suitable for modeling a wide range of domains, including mathematics,
linguistics, and artificial intelligence.
4. Modularity and Reusability: FOL facilitates the modular representation of knowledge by
allowing the definition of reusable predicates and functions. This modularity enhances the
clarity and maintainability of logical systems, as well as the ability to reuse components
across different contexts.
5. Ability to Capture Complex Relationships: FOL can express complex relationships
between objects, including hierarchical structures, temporal relationships, and
dependencies. This capability enables FOL to represent and reason about real-world
phenomena more accurately and comprehensively than propositional logic.
6. Resolution of Ambiguity: FOL provides mechanisms for disambiguating statements
through the use of variables and quantifiers. Unlike propositional logic, which struggles
with ambiguity, FOL can handle more nuanced and context-dependent interpretations of
statements.
7. Soundness and Completeness: FOL has well-defined semantics and inference rules that
ensure soundness (correctness) and completeness (ability to derive all valid conclusions)
of reasoning processes. This property makes FOL a reliable framework for formalizing and
reasoning about knowledge.
Overall, first-order logic offers a significant advancement over propositional logic in terms of
expressiveness, representational power, and reasoning capabilities, making it a foundational tool
in various fields such as mathematics, computer science, philosophy, and linguistics.
ELEMENTS OF FIRST ORDER LOGIC
First-order logic (FOL), also known as predicate logic, consists of several fundamental elements:
1. Variables: Variables in FOL represent placeholders for objects or individuals in the
domain of discourse. They are typically denoted by lowercase letters such as x, y, z, etc.,
and can be quantified over using quantifiers like ∀ (for all) and ∃ (there exists).
2. Constants: Constants are specific objects or individuals in the domain of discourse. They
are represented by symbols and typically denoted by lowercase or uppercase letters like a,
b, c, etc. Unlike variables, constants do not vary and represent fixed elements.
3. Predicates: Predicates in FOL represent properties or relations that can be true or false of
objects in the domain. They are denoted by uppercase letters followed by a list of variables
or constants enclosed in parentheses. For example P(x), Q(x,y), , R(a,b,c) are predicates.

4. Functions: Functions in FOL represent operations or transformations that produce an


output value based on input values. They are denoted by lowercase or uppercase letters
followed by a list of variables or constants enclosed in parentheses. For example, f(x),
g(x,y), , h(a,b,c) are functions.
5. Connectives: FOL includes logical connectives that allow the construction of compound
formulas from simpler ones. The main connectives in FOL are:
• Conjunction (∧): Represents logical AND.
• Disjunction (∨): Represents logical OR.
• Negation (¬): Represents logical NOT.
• Implication (→): Represents logical implication.
• Biconditional (↔): Represents logical equivalence.
6. Quantifiers: FOL includes quantifiers that specify the scope of variables and predicates.
The main quantifiers in FOL are:
• Universal quantifier (∀): Represents "for all" or "for every".
• Existential quantifier (∃): Represents "there exists" or "there is".
7. Equality: FOL includes an equality symbol (=) to express that two terms are equal. It is
used to assert that two objects or expressions refer to the same thing.
8. Brackets and Parentheses: FOL uses brackets and parentheses to indicate the scope and
grouping of subformulas within larger formulas. They are used to disambiguate the order
of operations and clarify the structure of complex formulas.
These elements form the basic building blocks of first-order logic, which provides a rich and
expressive framework for representing and reasoning about relationships, properties, and
structures within a given domain of discourse.
Example:
1. In first-order logic (FOL), the statement "All birds fly" can be expressed using quantifiers,
predicates, and variables. Here's how you can represent it:
All birds fly.
In this question the predicate is "fly(bird)."
And since there are all birds who fly so it will be represented as follows.
∀x bird(x) →fly(x).
In another way,
Let's define:
• B(x) represents the predicate "x is a bird."
• F(x) represents the predicate "x can fly."
Using quantifiers:
• ∀x denotes "for all x" or "for every x," indicating that the statement applies to all
individuals.
• → denotes implication, indicating "if...then" relationship.
So, the statement "All birds fly" can be expressed as: ∀x (B(x)→F(x))
This statement reads as: "For every x, if x is a bird (B(x)), then x can fly (F(x))."

2. Every man respects his parent.


In this question, the predicate is "respect(x, y)," where x=man, and y= parent. Since there is
every man so will use ∀, and it will be represented as follows:
∀x man(x) → respects (x, parent).
Or
To express the statement "Every man respects his parent" in first-order logic (FOL), we can use
predicates, quantifiers, and variables. Here's how we can represent it:
Let's define:
• M(x) represents the predicate "x is a man."
• R(x,y) represents the predicate "x respects y."
• P(x) represents the predicate "x is a parent of someone."
Using quantifiers:
• ∀x denotes "for all x" or "for every x," indicating that the statement applies to all
individuals.
So, the statement "Every man respects his parent" can be expressed as:
∀x (M(x)→∃y (P(y)∧R(x,y)))
This statement reads as: "For every x, if x is a man (M(x)), then there exists a y such that y is a
parent (P(y)) and x respects y (R(x, y))."
3. Some boys play cricket.
In this question, the predicate is "play(x, y)," where x= boys, and y= game. Since there are some
boys so we will use ∃, and it will be represented as:
∃x boys(x) → play(x, cricket).
Or
To express the statement "Some boys play cricket" in first-order logic (FOL), we can use
predicates, quantifiers, and variables. Here's how we can represent it:
Let's define:
• B(x) represents the predicate "x is a boy."
• P(x) represents the predicate "x plays cricket."
Using quantifiers:
• ∃x denotes "there exists an x," indicating that the statement applies to at least one
individual.
So, the statement "Some boys play cricket" can be expressed as: ∃x (B(x)∧P(x))
This statement reads as: "There exists an x such that x is a boy (B(x)) and x plays cricket (P(x))."
4. Not all students like both Mathematics and Science.
In this question, the predicate is "like(x, y)," where x= student, and y= subject.
Since there are not all students, so we will use ∀ with negation, so following representation for
this:
¬∀ (x) [ student(x) → like(x, Mathematics) ∧ like(x, Science)].
Or
To express the statement "Not all students like both Mathematics and Science" in first-order
logic (FOL), we can use predicates, quantifiers, and logical operators. Here's how we can
represent it:
Let's define:
• S(x) represents the predicate "x is a student."
• L(x,y) represents the predicate "x likes y," where y is a subject.
Using quantifiers:
• ¬¬ denotes negation, indicating "not."
So, the statement "Not all students like both Mathematics and Science" can be expressed as:
¬∀x (S(x)→(L(x,Mathematics)∧L(x,Science)))
This statement reads as: "It is not the case that for every x, if x is a student (S(x)), then x likes
Mathematics and x likes Science."
5. Only one student failed in Mathematics.
In this question, the predicate is "failed(x, y)," where x= student, and y= subject.
Since there is only one student who failed in Mathematics, so we will use following
representation for this:
∃(x) [ student(x) → failed (x, Mathematics) ∧∀ (y) [¬(x==y) ∧ student(y) → ¬failed
(x, Mathematics)].
Or
To express the statement "Only one student failed in Mathematics" in first-order logic (FOL), we
can use predicates, quantifiers, and logical operators. Here's how we can represent it:
Let's define:
• S(x) represents the predicate "x is a student."
• F(x) represents the predicate "x failed in Mathematics."
Using quantifiers:
• ∃!∃! denotes "there exists a unique x," indicating that the statement applies to exactly one
individual.
So, the statement "Only one student failed in Mathematics" can be expressed as: ∃!x (S(x)∧F(x))
This statement reads as: "There exists a unique x such that x is a student (S(x)) and x failed in
Mathematics (F(x))."
CFG specification of the syntax of FOPC representations:
Lexicalized and Probabilistic Parsing
5.1 Probabilistic Context-Free grammar (PCFG)
5.1.1 The definition of PCFG:
The Probabilistic Context-Free Grammar (PCFG) also known as the Stochastic Context-Free
Grammar (SCFG).
A context-free grammar G is defined by four parameters {N,Σ, P, S}:
A set of non-terminal symbols N;
A set of terminal symbolsΣ (disjoint from N);
A set of productions P, each of the form A β, where A is a non-terminal symbol and
β is a string of symbols from the infinite set of strings (Σ ∪ N);
A designated start symbol S.
A PCFG augments each rule in p with a conditional probability:
A β [p]
A PCFG is thus a 5-tuple G = {N,Σ, P, S, D}, where D is a function assigning probabilities to
each rule in P. This function expresses the probability p that the given non-terminal A will be
expanded to the sequenceβ; it is often referred to as
P (A β)
or as
P (A β|A)
Formally this is conditional probability of a given expansion given the left-hand-side non-terminal
A. If we consider all the possible expansions of a non-terminal, the sum of their probabilities must
be 1.
For example, the rules of PCFG of our small CFG in Chapter 3 can be:

S NP VP [.80]
S AUX NP VP [.15]
S VP [.05]
NP Det Nominal [.20]
NP Proper Noun [.35]
NP Nominal [.05]
NP Pronoun [.40]
Nominal Noun [.75]
Nominal Noun Nominal [.20]
Nominal Proper-Noun Nominal [.05]
VP Verb [.55]
VP Verb NP [.40]
VP Verb NP NP [.05]
Det that [.05] | this [.80] | a [.15]
Noun book [.10] | flight [.50] | meat [.40]
Verb book [.30] | include [.30] | want [.40]
Aux can [.40] | does [.30] | do [.30]
Proper Noun Houston [.10] | ASIANA [.20] | KOREAN AIR [.30] | CAAC [.25] |
Dragon Air [.15]
Pronoun you [.40] | I [.60]

These probabilities are not based on a corpus; they were made up merely for expository
purpose.
Note that the probabilities of all of the expansion of a non-terminal sum to 1.
A PCFG assigns a probability to each parse-tree T of a sentence S. The probability of a particular
parse T is defined as the product of the probabilities of all the rules r used to expand each node n
in the parse tree:

P (T) = ∏ p(r(n))
n ∈ T
For example, the sentence “Can you book ASIANA flights” is ambiguous:
One meaning is :”Can you book flights on behalf of ASIANA”, the other meaning is : “Can you
book flights run by ASIANA”. The trees are respectively as follows:
Left tree: S Right tree: S

Aux NP VP Aux NP VP

can Pro V NP NP can Pro V NP

you book Pnoun Nom you book Pnoun Nom

ASIANA flights ASIANA Noun

flights
The probabilities of each rule in left tree are:
S Aux NP VP [.15]
NP Pro [.40]
VP V NP NP [.05]
NP Nom [.05]
NP Pnoun [.35]
Nom Noun [.75]
Aux can [.40]
NP Pro [.40]
Pro you [.40]
Verb book [.30]
Pnoun ASIANA [.40]
Noun flights [.50]

The probabilities of each rule in right tree are:


S Aux NP VP [.15]
NP Pro [.40]
VP V NP [.40]
NP Nom [.05]
Nom Pnoun Nom [.05]
Nom Noun [.75]
Aux can [.40]
NP Pro [.40]
Pro you [.40]
Verb book [.30]
Pnoun ASIANA [.40]
Noun flights [.50]

The P (Tl) of left tree is:


P (Tl) = .15 * .40 * .05 * .05 * .35 * .75 * .40 * .40 * .40 * .30 * .40 * .50
= 1.5 X 10-6
The P (Tr) of right tree is:
P (Tl) = .15 * .40 * .40 * .05 * .05 * .75 * .40 * .40 * .40 * .30 * .40 * .50
= 1.7 X 10-6
We can see that the right tree has a higher probability. Thus this parse will be chosen as correct
result.
The disambiguation algorithm picks the best tree for a sentence S out of the set of parse trees
for S (which we shall call τ(S). We want the parse tree T which is most likely given the sentence
S. Formally, if T ∈ τ(S), the tree with the highest probability T(S) will equal to “ argmax
P(T)”, so we have

T(S) = argmax P(T)

5.1.2 Probabilistic CYK algorithm


How we can use PCFG in the parsing algorithm? Luckily, the algorithm for computing the
most-likely parse are simple extensions of the standard algorithms for parsing.
Now we shall present the probabilistic CYK (Cocke-Younger-Kasami) algorithm.
The Probabilistic CYK parsing algorithm was first described by Ney (1991).
Assume that the PCFG is in CNF (Normal Chomsky Form). The grammar is in CNF if it is
ε-free and if in addition each production is either of the form A BC or A α. In Chapter 3,
we introduced the CYK algorithm. Now we shall give the formal description of Probabilistic CYK
algorithm.
Formally, the probabilistic CYK algorithm assume the following input, output, and data
structure:
Input.
1. A Chomsky normal form PCFG G = {N, Σ , P, S, D}.Assume that the |N|
non-terminals have indices 1, 2, …, |N|, and that the start symbol S has index 1.
2. n words w1 … wn .
.Data structure.
A dynamic programming array π [i, j, a] holds the maximum probability for a
constituent with non-terminal index a spanning words i … j.
Output.
The maximum probability parse will be π [1, n, S]: The parse tree whose root is S and
which spans the entire string of words w1 … wn .
The CYK algorithm fills out the probability array by induction. Here we use wij to mean the string
of words from word i to word j. The induction is as following:
Base case: Consider the input strings of length one (individual words wi). In CNF, the
probability of a given non-terminal A expanding to a single word wi must come only
from the rule A wi.
Recursive case: For string of words of length >1, A => wij if and only if there is at least
one rule A BC and some k, 1≤ k < j, such that B derives the first k symbols of wij
and C derives the last j – k symbols of wij, their probability will already be stored in the
matrix π. We compute the probability of wij by multiplying together the probability of
these two pieces. But there may be multiple parses of wij, and so we’ll need to take max
over all the possible divisions of wij (over all values of k and over all possible rules).
A:π[i, j, A]

B:π[i, k, B] C:π[i+k, j-k, C]

… …
i-1 i i+k-1 i+k i+j-1 i+j
|______________________| |_______________________|
length of B = k length of C = j-k
|______________________________________________________|
length of A = j

In the rule A BC, the probability of A (wij ) equals the product of the probability of B and C.
such we can compute the probability P(T) for different sub-tree.
Then we can compute the highest probability T(S) of the sentence S. It will equal to argmax P(T):

T(S) = argmax P(T)

5.1.3 Learning PCFG Probabilities


The PCFG probabilities come from corpus of already-parsed sentences: tree-bank.
Penn Tree-bank contains parse trees for the Brown Corpus, one million words from the Wall Street
Journal, and parts of the Switchboard corpus.
Given a tree-bank, the probability of each expansion of a non-terminal can be computed by
counting the number of times that expansion occurs and then normalizing:
Count (α β) Count (α β)
P (α β|α) = =

∑ γCount (α γ) Count (α)


“∑ γCount (α γ)” is the rule number taking α as the LHS of the rule.

When a tree-bank is unavailable, the count needed for computing PCFG probabilities
can be generated by first parsing a corpus.
If sentences were unambiguous, it would be very simple: parse the corpus,
add a counter for every rule in the parse, and then normalize to get
probabilities.
If the sentences were ambiguous, we need to keep a separate count for each
parse of a sentence and weight each partial count by the probability of the
parse it appears in.

5.2 Lexicalized PCFG


5.2.1 problems with PCFG:
The problems in structural dependency
A CFG assumes that the expansion of any one non-terminal is independent of the expansion of any
other non-terminal. This independence assumption is carried over in the PCFG: each PCFG rule is
assumed to be independent of each other rule, and thus the rule probabilities are multiplied
together. But, In English, the choice of how a node expands is dependent on the location of the
node in the parse tree. For example, there is a strong tendency for the syntactic subject of a
sentence to be a pronoun. This tendency is caused by the use of subject position to realize the
topic or old information. Pronouns are a way to talk about old information. While the
non-pronominal lexical noun-phrase are often used to introduce new referents. According to the
investigation of Francis (1999), the 31,021 subjects of declarative sentences in Switchboard
corpus, 91% are pronouns and only 9% are lexical. By contrast, out of 7,498 direct object, only
34% are pronoun, and 66% are lexical.
Subject: She is able to take her baby to work with her.
: My wife worked until we had a family.
Object: Some laws absolutely prohibit it.
All the people signed applications.
These dependencies could be captured if the probability of expanding an NP as a pronoun (via the
rule NP Pronoun) versus a lexical NP (via the rule NP Det Noun) were dependent on
whether the NP was a subject or an object. However, this is just the kind of probabilistic
dependency that a PCFG does not allow.

The problems in lexical dependency


PCFG can only be represented via the probability of pre-terminal nodes to be expanded lexically.
But there are a number of other kinds of lexical and other dependencies that is important in
modeling syntactic probabilities.
--PP-attachment: The lexical information plays an important role in selecting the correct
parsing of an ambiguous prepositional phrase attachment.
For example, in the sentence “Washington sent more than 10,000 soldiers into Afghanistan”, PP
“into Afghanistan” can be attached either to NP (more than 10,000 soldiers), or to attached to the
verb (sent).
In PCFG, the attachment choice comes down to the choice between two rules:
NP NP PP (NP-attachment)
And VP VP PP (VP-attachment)

The probability of these two rules depends on the training corpus.


Corpus NP-attachment VP-attachment
AP Newswire (13 million words) 67% 33%
Wall Street Journal & IBM manuals 52% 48%

Whether the preference is 67% or 52%, in PCFG, this preference is purely structural and must be
the same to all verb.

However, the correct attachment is to verb. The verb “send” subcategorizes for a destination,
which can be expressed with the preposition “into”. It is a lexical dependency. The PCFG can not
deal with the lexical dependency.

--Coordination ambiguities:
The coordination ambiguities are the key to choosing the proper parse.
In the phrase “dogs in houses and cats” is ambiguous:
Left tree: NP Right tree: NP

NP Conj NP NP PP

NP PP and Noun Noun Prep NP

Noun Prep NP cats dogs in NP Conj NP

dogs in Noun Noun and Noun

house house cats

Although the left tree is intuitively the correct one. But the PCFG will assign them identically
probabilities because both structure use the exact same rule:
NP NP Conj NP
NP NP PP
NP Noun
PP Prep NP
Noun dogs | house | cats
Prep in
Conj and
In this case, PCFG will assign two trees the same probability.
PCFG has a number of inadequacies as a probabilistic model of syntax, we shall augment PCFG
to deal with these problems.
5.2.2 Probabilistic Lexicalized CFG
Charniak (1997) proposed the approach of the probabilistic representation of lexical heads. It is a
kind of lexical grammar. In this probabilistic representation, each non-terminal in a parse-tree is
annotated with a single word which is its lexical-head.
E,g. “Workers dumped sacks into a bin” can be represented as follows:
S(dumped)

NP(workers) VP(dumped)

NNS(workers) VBD(dumped) NP(sacks) PP(into)

Workers dumped NNS(sacks) P(into) NP(bin)

sacks into DT(a) NN(bin)

a bin
Fig. Lexicalized tree

In this case, we were to treat a probabilistic lexicalized CFG like a normal but huge PCFG. Then
we would store a probability for each rule/head combination. E.g.
VP(dumped) VBD(dumped) NP(sacks) PP(into) [3X10-10]
VP(dumped) VBD(dumped) NP(cats) PP(into) [8X10-11]
VP(dumped) VBD(dumped) NP(hats) PP(into) [4X10-10]
VP(dumped) VBD(dumped) NP(sacks) PP(above) [1X10-12]

This sentence can be also parsed to another tree:

S(dumped)

NP(workers) VP(dumped)

NNS(workers) VBD(dumped) NP(sacks)

Workers dumped NP(sacks) PP(into)

NNS(sacks) P(into) NP(bin)

sacks into DT(a) NN(bin)

a bin
Fig. Incorrect parse-tree
If VP(dumped) expands tp VBD NP PP, then the tree will be correct. If VP(dumped) expands to
VBD NP, then the tree will be incorrect.
Let us compute both of these by counting in the Brown Corpus portion of the Penn Trre-bank.
The first rule is quite likely:
C(VP(dumped) VBD NP PP)
P(VP VBD NP PP|VP, dumped) =

∑βC(VP(dumped) β

= 6/9 = .67
The second rule never happens in the Brown Corpus. This is not surprising, since “dump” is a
verb of caused-motion into a new location:
C(VP(dumped) VBD NP)
P(VP VBD NP|VP, dumped) =

∑βC(VP(dumped) β

= 0/9 = 0.
In practice this zero value would be smoothed somehow, but now we just notice that the first rule
is preferred.
For the head probabilities we can also count it using same method.
In the correct parse, a PP node whose mother’s head (X) is “dumped’ has the head ‘into”. In the
incorrect parse, a PP node whose mother’s head (X) is “sacks’ has the head “into”. We can use
counts from Brown portion of the Penn Tree-bank. X is the mother’s head.
C(X(dumped) …PP (into)…)
P(into |PP, dumped) =

∑βC(X(dumped) …PP…)

= 2/9 = .22
C(X(sacks) …PP (into)…)
P(into |PP, sacks) =

∑βC(X(sacks) …PP…)

= 0/0 = ?
Once again, the head probabilities correctly predict that “dumped” is more likely to be modified
by “into” than is “sacks”.
5.3 Human Parsing
In the last 20 years we have learned a lot about human parsing. Here we shall give a brief
overview of some recent results.
5.3.1 Ambiguity solution in the human parsing:
Human sentence processor is sensitive to lexical sub-categorization preferences. For example,
The scientists can ask the people to read the ambiguous sentence and check off a box indicating
which of the two interpretations they got first. The results are after each sentence.
“The women kept the dogs on the beach”
The women kept the dogs which were on the beach. 5%
The women kept them (the dogs) on the beach. 95%
“The women discussed the dogs on the beach”
The women discussed the dogs which were on the beach. 90%
The women discussed them (the dogs) while on the beach. 10%
The results were that people preferred VP-attachment with “keep” and NP-attachment with
“discuss”.
This suggest that “keep” has a sub-categorization preference for a VP with three constituents: (VP
V NP PP) while “discuss” has a sub-categorization preference for a VP with to constituents:
(VP V NP), although both verbs still allow both sub-categorizations.
5.3.2 Garden-path sentences:
The garden-path sentence is a specific class of temporarily ambiguous sentences.
For example, “The horse raced past the barn fell” (“barn” is a farm building for storing corps and
food for animals). .
(a) S (b) S

NP VP NP VP

Det N V PP NP VP V

The horse raced P NP ? Det N V PP fell

past Det N V the horse raced P NP

the barn fell past Det N

the barn
Fig. Garden-path sentence 1
The garden-path sentences are the sentences which are cleverly constructed to have three
properties that combine to make them very difficult for people to parse:
They are temporarily ambiguous: the sentence is not ambiguous, but its initial portion is
ambiguous.
One of the two or three parses in the initial portion is somehow preferable to the human
parsing mechanism.
But the dispreferred parse is the correct one for the sentence.
The result of these three properties is that people are “led down the garden path” toward the
incorrect parse, and then are confused when they realize it is the wrong way.
More examples
“The complex houses married and single students and their families.”
(a) S (b) S

NP NP VP

Det Adj N Det N V

The complex houses


… the complex houses …
Fig. Garden-path sentence 2
In this sentence, the readers often mis-parse “complex” as Adj and “houses” as N, but the
correct parse is to parse “complex’ as N and ‘houses’ as V, even it is dispreferable.

“The student forgot the solution was in the back of the book.”
(a) S (b) S

NP VP NP VP

Det N V NP ? Det N V S

The students forgot Det N V The students forgot NP VP

the solution was … Det N V

the solution was …

Fig. Garden-path sentence 3


In this sentence, the readers often mis-parse “the solution” as the direct object of “forgot” rather
than as the subject of an embedded sentence. This is another sub-categorization preference
difference: “forgot’ prefers a direct object (VP V NP) to a sentential complement (VP V S).
The garden-path sentence is caused by the sub-categorization preferences of the verb.
In the sentence “The horse raced past the barn fell”, verb “raced” is preferable to be used as
Main Verb (MV), and it dispreferable to be used as Reduced-Relative (RR). But correct one just
the dispreferable (RR). MV/RR = 387
In the sentence “The horse found in the barn died”, since the verb “found” is transitive, the
reduced-relative (RR) interpretation becomes much more than it was for “raced”. Its MV/RR
probabilistic ratio (lower than 5) is less than MV/RR probabilistic ratio of “raced” (387). So this
sentence cannot become the garden-path sentence.
The MV/RR probability ratio for “raced’ much more than the MV/RR probabilistic ratio for
“found”. Perhaps, It is the explanation for garden-path sentence.

MV/RR=387
raced
Log(MV/RR)

WV/RR=5 (threshold)

found

The horse X-ed PP


Fig. MV/RR probabilistic ratio
The model assumes that people are unable to maintain very many interpretation at one time.
Whether because of memory limitations, or just because they have a strong desire to come up with
a simple interpretation, they prune away low-ranking interpretation. An interpretation is pruned if
its probability is 5 times lower than the most-probable interpretation. The result is that they
sometimes prune away the correct interpretation. Leaving a highest but incorrect interpretation.
This is what happens with the probability (but “correct”) reduced-relative (RR) interpretation in
the sentence “The horse raced past the barn fell”.
In above Figure, the WV/RR probabilistic ratio for “raced” falls above the threshold and the RR
interpretation is pruned. For “found” its interpretation is active in the disambiguating region.
Feature and Unification
4.1 Feature Structures in grammar
4.1.1 Attribute-value matrix
From a reductionist perspective, the history of the natural sciences over the last few hundred
years can be seen as an attempt to explain the behavior of larger structures by the combined action
of smaller primitives.
Biology: Cell action Genes action DNA action
Physics: Molecular Atom subatomic particles
It can be called as “Reductionism”.
In NLP, we also are influenced by this reductionism.
E,g. In Chapter 2, we have proposed the following rule
S Aux NP VP
It can be replaced by two rules of following form:
S 3sgAux 3sgNP VP
S Non-3sgAux Non3sgNP VP
Lexicon rules:
3sgAux does | has | can | …
Non3sgAux do | have | can |
We attempt to combine the smaller structures actions to explain the action of larger
structures.
We shall use the feature structures to describe reductionism in NLP.
The feature structures are simply sets of feature value pairs, where features are
un-analyzable atomic symbols drawn from some finite set, and values are either atomic symbols
or feature structures.
The feature structures are illustrated with an Attribute-Value Matrix (AVM) as follows:
FEATURE1 VALUE1
FEATURE2 VALUE2

FEATUREn VALUEn

E.g. 3sgNP can be illustrated by following AVM:

cat NP
num sig
person 3
3sgAux can be illustrated by following AVM:

cat Aux
num sing
per 3

In the feature structures, the features are not limited to atomic symbols as their values; they can
also have other feature structures as their values.
It is very useful when we wish to bundle a set of feature-value pairs together for similar treatment.
E,g, The feature “num” and “per” are often lumped together since grammatical subject must agree
with their predicates in both of their number and person. This lumping together can introduce the
feature “agreement” that takes a feature structure consisting of the number and person
feature-value pairs as its value.
The feature structure of 3sgNP with feature “agreement” can be illustrated as following AVM:

cat NP
num sing
agreement
per 3

4.1.2 Feature path and reentrant structure


We can also use the DAG to represent the attribute-value pairs.

E,g above AVM can be represented by following DAG.

cat agreement

○ ○
NP
per num

○ ○.
3 sing
Fig. 1 DAG for feature structure

In DAG, a feature path is a list of features through a feature structure leading to a


particular value. For example, in Fig. 1, we can say that the <agreement num> path leads to the
value sing, <agreement per> path leads to the value 3.
If there is the shared feature structure, such feature structure will be referred to as
reentrant structure. In the case of a reentrant structure, two feature paths actually lead to the
same node in the structure.

head cat

○ ○
subject S
agreement

agreement

per num

○ ○
3 sing
Fig. 2 a feature structure with shared values
In Fig. 2, the <head subject agreement> path and the <head agreement> path lead to the
same location. They shared the feature structure
per 3
num sing

The shared structure will be denoted in the AVM by adding numerical indexes that signal
the values to be shared.
cat s
agreement ① num sing
head per 3

subject agreement ①

The reentrant structures give us the ability to express linguistic knowledge in the elegant
ways.
4.1.3 Unification of feature structures
For the calculation of feature structure, we can use the unification to do it. There are two principle
operations in the unification.:
Merging the information content of two structure that are compatible;
Rejecting the merging of structures that are incompatible.
Following are the examples (symbol ∪ means unification):
(1) Compatible
:
num sing ∪ num sing = num sing
○ ○ ○

num num = num

○ ○ ○
sing sing sing

(2) Incompatible:
num sing ∪ num plur = fails!

○ ○

num num = fails

○ ○
sing plur

(3) Symbol []:


:
num sing ∪ num [] = num sing

The feature with a [ ] value can be successfully matched to any value.

○ ○ ○

num ∪ num = num

○ ○ ○
sing [] sing

(4) Merger:

num sing ∪ per 3 = num sing


per 3

○ ○ ○

num ∪ per = num per

○ ○ ○ ○
sing 3 sing 3
(5) The reentrant structure

agreement ① num sing


per 3

subject agreement ①

∪ subject agreement per 3


num sing

agreement ① num sing


per 3
=
subject agreement ①

○ ○
agreement subject subject

○ ○ ∪ ○

num per agreement agreement

○ ○ ○ ○
sing 3 ①
per num

○ ○
3 sing


agreement subject
=

○ ○

num per agreement

○ ○ ○
sing 3 ①
(5) The copying capability of unification

agreement ①
subject agreement ①

∪ subject agreement per 3


num sing

= agreement ①

subject agreement ① per 3


num sing

○ ○
agreement subject subject

○ ○ ∪ ○

agreement agreement

○ ○

per num

○ ○
3 sing


agreement subject

○ ○
= ①
agreement


per num

○ ○
3 sing
(5) The features merely have similar values:
In following example, there is no sharing index linking the “agreement” feature and
[subject agreement], the information [per 3.] is not added to the value of the “agreement”
feature.

agreement num sing

subject agreement num sing

∪ subject agreement per 3


num sing

= agreement num sing

subject agreement num sing


per 3

In the result, the information [per 3.] is only added to the end of [subject [agreement]] path, but
it is not added to the end of “agreement’ (it is first line in the AVM of result). Therefore, the value
of “agreement” is only [num sing] without [per 3].

○ ○
agreement subject subject

○ ○ ∪ ○

agreement agreement

○ ○ ○
sing
per num

○ ○ ○
sing 3 sing

agreement subject

○ ○
=
num agreement

○ ○
sing
per num

○ ○
3 sing

(7) The failure of unification

agreement ① num sing


per 3

subject agreement ①

∪ agreement num sing


per 3

subject agreement num plur


per 3

= fails !


agreement subject

○ ○

num per agreement

○ ○ ○
sing 3 ①

agreement subject
∪ = fails !

○ ○

num per agreement

○ ○ ○
sing 3
num per

○ ○
plur 3

Feature structures are a way of representing partial information about some linguistic object
or placing informational constrains on what the object can be. Unification can be seen as a way of
merging the information in each feature structure, or describing objects that satisfy both sets of
constraints.

4.1.4 Subsumption
Intuitively, unifying two feature structures produces a new feature structure that is more
specific (has more information) than, or is identical to, either of the input feature structure. We say
that a less specific (more abstract) feature structure subsumes an equally or more specific one.
Formally, A feature structure F subsumes a feature structure G if and only if:
For every feature x in F, F(x) subsumes G(x) (where F(x) means “the value of the feature x of
feature structure F”);
For all paths p and q in F such that F(p) = F (q), it is also the case that G(p) = G(q).

E.g.

(1) num sing

(2). per 3

(3) num sing


per 3

We have: (1) subsumes (3),


(2) subsumes (3).
(4) cat VP

agreement ①

subject agreement ①

(5) cat VP

agreement ①

subject agreement per 3


num sing

(6) cat VP

agreement ①

subject agreement ① per 3


num sing

We have: (3) subsumes (5), (4) subsumes (5), (5) subsumes (6), (4) and (5) subsume (6)..

Subsumption is a partial ordering: there are pairs of feature structures that neither subsume nor are
subsumed by each other:
(1) does not subsume (2),
(2) does not subsume (1),
(3) does not subsume (4),
(4) does not subsume (3).
Since every feature structure is subsumed by the empty structure [], the relation among feature
structures can be defined as a semi-lattice. The semi-lattice is often represented pictorially with the
most general feature [ ] at the top and the subsumption relation represented by lines between
feature structures.
lower []

less information (1) (2)

(3) (4)_

(5)
more information
higher (6)

Fig. 4 subsumption represented by semi-lattice


.
4.1.5 Formal definition of Unification
Unification can be formally defined in terms of the subsumption semi-lattice as follows:.
Given two feature structures F and G, the unification F∪G is defined as the most
general feature H such that F subsume H and G subsume H.
Since the information ordering defined by unification is a semi-lattice, the unification
operation is monotonic. This means:
If some description is true of a feature structure, unifying it with another feature structure
results in a feature structure that still satisfies the original description.
The unification operation is order-independent; given a set of feature structures to unify, we
can check them in any order and get the same result
Unification is a way of implementing the integration of knowledge from different constraints:
Given two compatible feature structures as input, it produces a new feature structure which
contains all the information in the inputs;
Given two incompatible feature structures, it fails.

4.2 Feature structures in the Grammar


4.2.1 Augmentation of CFG rules with feature structures:
To associate complex feature structures with both lexical items and instances of grammatical
categories.
To guide the composition of feature structures for larger grammatical constituents based on
the feature structures of their component parts.
To enforce compatibility constraints between specified parts of grammatical constructions.
Formally, we can use following notation to denote the grammar augmentation:
β0 β1 … βn
(set of constraints)
The specified constraints have one of the following forms:
(βi feature path) = Atomic value
(βi feature path) = (βj feature path)
The notation (βi feature path) denotes a feature path through the feature structure
associated with the βi component of the CFG rule.
For example, the rule
S NP VP
can be augmented with attachment of the feature structure for number agreement as
follows:
S NP VP
(NP num) = (VP num)
In this case, the simple generative nature of CFG rule has been fundamentally changed
by this augmentation. These changes are following two aspects:
The elements of CFG rules will have feature-based constraints associated with
them. This reflects a shift from atomic grammatical categories to more complex
categories with properties.
The constraints associated with individual rules can refer to the feature
structures associated with the parts of the rule to which they are attached.
4.2.2 Agreement
There are two kinds of agreement in English.
Subject-verb agreement
S NP VP
(NP agreement) = (VP agreement)
E.g. This flight serves breakfast.
These flights serve breakfast.
S Aux NP VP
(Aux agreement) = (NP agreement)
E.g. Does this flight serve breakfast?
Do these flights serve breakfast?
Determiner-nominal agreement
NP Det Nominal
(Det Agreement) = (Nominal Agreement)
(NP Agreement) = (Nominal Agreement)
E,g. This flight.
These flights.

The constraints involve both lexical and non-lexical constituents.


The constraints of lexical constraints can directly write in the lexicon:

Aux do
(Aux agreement num) = plur
(Aux agreement per) = 3
Aux does
(Aus agreement num) = sing
(Aux agreement per) = 3
Determiner this
(Det agreement num) = sing
Determiner these
(Det agreement num) = plur
Verb serves
(Verb agreement num) = sing
Verb serve
(Verb agreement num) = plur
Noun flight
(Noun agreement num) = sing
Noun flights
(Noun agreement num) = plur

The constraints of non-lexical constituent can acquire values for at least some of their features
from their component constituents.

VP Verb NP
(VP agreement) = (Verb agreement)
The constraints of ”VP” come from the constraints of “Verb”.
Nominal Noun
(Nominal agreement) = (Noun agreement)
The constraints of “Nominal” come from the “Noun”.

4.2.3 Head features


The features for most grammatical categories are copied from one of the children to the parent.
The child that provides the feature is called the head of the phrase, and the features copied are
called head features.
In the following rules,
VP Verb NP
(VP agreement) = (Verb agreement)

NP Det Nominal
(Det agreement) = (Nominal agreement)
(NP agreement) = (Nominal agreement)

Nominal Noun
(Nominal agreement) = (Noun agreement)

the verb is the head of the VP, the nominal is the head of NP, the Noun is the head of the nominal.
In these rules, the constituent providing the agreement feature structure up to the parent is the head
of the phrase. We can say that the agreement feature structure is a head feature.
We can rewrite our rules by placing the agreement feature structure under a HEAD feature and
then copying that feature upward:

VP Verb NP
(VP head) = (Verb head)

NP Det Nominal
(Det head Agreement) = (Nominal head Agreement)
Det and Nominal locate in the same level, their “HEAD Agreement” is equal.
(NP head) = (Nominal head)

Nominal Noun
(Nominal head) = (Noun head)

The lexical rules can be rewritten as follows:


Verb serves
(Verb head agreement num) = sing
Verb serve
(Verb head agreement num) = plur
Noun flight
(Noun head agreement num) = sing
Noun flights
(Noun head agreement num) = plur

The conception of a head is very significant in grammar, because it provides a way for a syntactic
rule to be linked to a particular word.

4.2.4 Sub-categorization
4.2.4.1 An atomic feature SUBCAT:
Following is a rule with complex features
Verb-with-S-comp think
VP Verb-with-S-comp S
We have to subcategorize the verbs to some subcategories. So we need an atomic feature called
SUBCAT.
Opaque approach
Lexicon:
Verb serves
<Verb head agreement num> = sing
<Verb head subcat> = trans
Rules:
VP Verb
<VP head> = <Verb head>
<VP head subcat> = intrans

VP Verb NP
<VP head> = <Verb head>
<VP head subcat> = trans

VP Verb NP NP
<VP head> = <Verb head.
<VP head subcat. = ditrans

In these rules, the value of SUBCAT is un-analyzable. It does not directly encode either the
number or type of the arguments that the verb expects to take.
This approach is somewhat opaque, it is not so clear.

.Elegant approach:
A more elegant approach makes better use of the expressive power of feature structures,
allows the verb entries to directly specify the order and category type of the arguments they
require.
The verb’s subcategory feature expresses a list of its objects and complements.
Lexicon:
Verb serves
<Verb head agreement num> = sing
<Verb head subcat first cat> = NP
<Verb head subcat second> = end
Verb leaves
<Verb head agreement num> = sing
<Verb head subcat first cat> = NP
<Verb head subcat second cat> = PP
<Verb head subcat third> = end
E..g. “we leave Seoul in the morning”.

Rules:
VP Verb NP
<VP head> = <Verb head>
<VP head subcat first cat> = <NP cat>
<VP head subcat second> = end
4.2.4.2 Sub-categorization frame
The sub-categorization frame can be composed of many different phrase types.
Sub-categorization of verb:
Each verb allows many different sub-categorization frames. For example, verb ‘ask’ can allow
following sub-categorization frame:
Subcat: Example

Quo . asked [Quo “What was it like?”]


NP asking [NP a question]
Swh asked [Swh what trades you’re interested in]
Sto ask [Sto him to tell you]
PP that means asking [PP at home]
Vto asked [Vto to see a girl called Sabina]
NP Sif asked [NP him] [Sif whether he could make]
NP NP asked [NP myself] [NP a question]
NP Swh asked [NP him [Swh why he took time off]
A number of comprehensive sub-categorization frame tagsets exist. For example, COMLEX
(Macleod, 1998), ACQUILEX (Sanfilippo, 1993).
Sub-categorization of Adjective
Subcat: Example

Sfin It was apparent [Sfin that the kitchen was the only room…]
PP It was apparent [PP from the way she rested her hand over his]
Swheth It is unimportant [Swheth whether only a little bit is accepted]

Sub-categorization of noun
Subcat: Example

Sfin the assumption [Sfin that wasteful methods have been employed]
Swheth the question [Swheth whether the authorities might have decided]
4.2.5 Long-Distance Dependencies
Sometimes, a constituent subcategorized for by the verb is not locally instantiated ,but is in a
long-distance relationship with the predicate.
For example, following sentence:
Which flight do you want me to have the travel agent book?
Here, “which flight” is the object of “book”, there is a long-distance dependency between them.
The representation of such long-distance dependency is a very difficult problem, because the verb
whose subcategorization requirement is being filled can be quite distance from the filler.
Many solutions to representing long-distance dependency were proposed in unification grammars.
One solution is called “Gap List”. The gap list implements a list as a feature gap, which is passed
up from phrase to phrase in the parse tree. The filler (E.g. ”which flights”) is put in the gap list,
and must eventually be united with the subcategorization frame of some verb.

4.3 Implementing unification


4.3.1 Unification data structures
The unification operator takes two feature structures as input and returns a single merged feature if
successful, or a feature signal if the two inputs are not compatible. The implementation of the
operator is a relatively straightforward recursive graph matching algorithm. The algorithm loops
through the features in one input and attempts to find a corresponding feature in the other. If all of
feature match, then the unification is successful. If any single feature causes a mismatch then the
unification fails.
The feature structures are represented using DAGs with additional fields. Each feature structure
consists of two fields:
A content field:
A pointer field.

The content field may be null or contain a pointer to another feature structure. Similarly, the
pointer field may be null or contain a pointer to another feature structure.

The operation is as follows:


If the pointer field of the DAG is null, then the content field of the DAG contains the actual
feature structure to be processed.
If the pointer field is non-null, then the destination of the pointer represents the actual feature
structure to be processed.
The merger aspects of unification will be achieved by altering the pointer field of DAGs
during processing.

For example, if we have the following feature structure:

num sing
per 3

The extended DAG representation is as following:


num CONTENT sing
CONTENT POINTER null
per CONTENT 3
POINTER null
POINTER null

The DAG is as follows:


PTR CT

○ ○
null num per

○ ○
PTR CT PTR CT

○ ○ ○ ○
null sing null 3
Fig. 5. An extended DAG notation

The example of the unification of feature structures is as follows:

num sing ∪ per 3 = num sing


per 3
The DAGs of original arguments is as follows:

○ ○
PTR CT PTR CT

○ ○ ○ ○
null num null per

○ ○
PTR CT PTR CT

○ ○ ○ ○
null sing null 3

Fig. 6 The original arguments


The unification shall result in the creation of a new structure containing the union of the
information from the two original arguments:
Adding a “per” feature to the first argument;
Assigning it a value by filling its PTR field with a pointer to the appropriate location in the
second argument.

○ ○
PTR CT PTR CT

○ ○ ○ ○
null num per null per

○ ○ ○
PTR CT CT PTR CT
○ ○ ○ ○ ○
null sing null null 3

PTR

Fig.7. adding a “per’ feature

Set the pointer field of the second argument to point at the first one.
PTR
○ ○
PTR CT CT

○ ○ ○
null num per null per

○ ○ ○
PTR CT CT PTR CT
○ ○ ○ ○ ○
null sing null null 3

PTR

Fig.8. The final result of unification

More complex examples:

(1) The reentrant structure.


cat s
agreement ① num sing
head per 3

subject agreement ①

The DAG:


PTR CT

○ ○
null head cat

○ ○
PTR CT PTR CT

○ ○ ○ ○
subject agreement null S

○ ○

PTR CT PTR CT

○ ○ ○ ○

agreement per num

○ ○ ○

PTR CT PTR CT

○ ○ ○ ○ ○ ○
null 3 sing
(2) Compatible feature structure:
num sing ∪ num sing = num sing
The original arguments::

○ ○
PTR CT PTR CT

○ ○ ○ ○
null null
num ∪ num

○ ○
PTR CT PTR CT

○ ○ ○ ○
sing sing
The result of unification:
PTR
○ ○ ○
CT PTR CT PTR CT

○ ○ ○ ○ ○
null null
num ∪ num = num

○ ○ ○
PTR CT PTR CT PTR CT

○ ○ ○ ○ ○ ○
null sing null sing

(3) Incompatible:
num sing ∪ num plur = fails!
The result of unification:

○ ○
PTR CT PTR CT

○ ○ ○ ○
null ∪ null
num num = fails

○ ○
PTR CT PTR CT

○ ○ ○ ○
sing plur

(4) Symbol []:


:
num sing ∪ num [] = num sing

The original arguments:

○ ○
PTR CT PTR CT

○ ○ ○ ○
null null
num ∪ num

○ ○
PTR CT PTR CT

○ ○ ○ ○
null sing null []

The result of unification:

○ ○ ○
PTR CT PTR CT PTR CT

○ ○ ○ ○ ○
null null
num ∪ num = num

○ ○ ○
PTR CT PTR CT PTR CT

○ ○ ○ ○ ○ ○
sing [] null sing

(5) Merger
:

num sing ∪ per 3 = num sing


per 3
The original arguments:

○ ○
PTR CT PTR CT

○ ○ ○ ○
null null
num ∪ per

○ ○
PTR CT PTR CT

○ ○ ○ ○
null sing null 3

The result of unification:

○ ○ ○
PTR CT PTR CT PTR CT

○ ○ ○ ○ ○
null null
num ∪ per = num per

○ ○ ○ ○
PTR CT PTR CT PTR CT PTR CT

○ ○ ○ ○ ○ ○ ○ ○
null sing null 3 null sing null 3

(6) The reentrant structure

agreement ① num sing


per 3

subject agreement ①
∪ subject agreement per 3
num sing

agreement ① num sing


per 3
=
subject agreement ①

The original arguments:

○ ○
PTR CT PTR CT

○ ○ ∪ ○ ○
subject agreement subject

○ ○ ○

PTR CT PTR CT PTR CT

○ ○ ○ ○ ○ ○

agreement num per agreement

○ ○ ○ ○

PTR CT PTR CT PTR CT

○ ○ ○ ○ ○ ○ ○ ○
null null sing null 3
per num

○ ○
PTR CT PTR CT

○ ○ ○ ○
null 3 null sing
The result of unification:


PTR CT

○ ○
subject agreement
=
○ ○

PTR CT PTR CT

○ ○ ○ ○

agreement num per

○ ○ ○

PTR CT PTR CT PTR CT

○ ○ ○ ○ ○ ○
null null sing null 3

(7) The copying capability of unification

agreement ①
subject agreement ①

∪ subject agreement per 3


num sing

= agreement ①

subject agreement ① per 3


num sing
The original arguments:

○ ○
PTR CT PTR CT

○ ○ ○ ○
agreement subject subject

○ ○ ∪ ○

PTR CT PTR CT PTR CT

○ ○ ○ ○ ○ ○

agreement agreement

○ ○

PTR CT PTR CT

○ ○ ○ ○
null ①
per num

○ ○

PTR CT PTR CT

○ ○ ○ ○
null 3 null sing
The results of unification:


PTR CT

○ ○
agreement subject

○ ○

PTR CT PTR CT

○ ○ ○ ○

agreement

PTR CT

○ ○
null ①
per num

○ ○

PTR CT PTR CT

○ ○ ○ ○
null 3 null sing
(8) The features merely have similar values:
agreement num sing

subject agreement num sing

∪ subject agreement per 3


num sing

= agreement num sing

subject agreement num sing


per 3
The original arguments:

○ ○
PTR CT PTR CT

○ ○ ○ ○
null null
agreement subject subject

○ ○ ∪ ○

PTR CT PTR CT PTR CT

○ ○ ○ ○ ○ ○
null nul null
num agreement agreement

○ ○ ○

PTR CT PTR CT PTR CT

○ ○ ○ ○ ○ ○
null sing null null
num per num

○ ○ ○

PTR CT PTR CT PTR CT

○ ○ ○ ○ ○ ○
null sing null 3 null sing
The results of unification:


PTR CT

○ ○
null
agreement subject

= ○ ○

PTR CT PTR CT

○ ○ ○ ○
null nul
num agreement

○ ○

PTR CT PTR CT

○ ○ ○ ○
null sing null null
per num

○ ○

PTR CT PTR CT

○ ○ ○ ○
null 3 null sing
(9) The failure of unification

agreement ① num sing


per 3

subject agreement ①

∪ agreement num sing


per 3

subject agreement num plur


per 3
= fails !

The original arguments:

PTR CT

○ ○
agreement subject

○ ○

PTR CT PTR CT

○ ○ ○ ○

num per agreement

○ ○ ○
PTR CT PTR CT PTR CT

○ ○ ○ ○ ○ ○

○ ○
sing 3

PTR CT

∪ ○ ○
agreement subject

○ ○

PTR CT PTR CT

○ ○ ○ ○

num per agreement

○ ○ ○
PTR CT PTR CT PTR CT

○ ○ ○ ○ ○ ○

num pre

○ ○ ○ ○
sing 3
PTR CT PTR CT

○ ○ ○ ○
null plur null 3

= fails !

4.3.2 The unification Algorithm

The unification algorithm is as follows:


function UNIFY (f1, f2) returns fstructure or failure
f1-real Real contents of f1
f2-real Real contents of f2
If f1-real is null then
f1.pointer f2
return f2
else if f2-real is null then
f2.pointer f1
return f1
else if f1-real and f2-real are identical then
f1.pointer f2
return f2
else if both f1-real and f2-real are complex feature structure then
f2.pointer f1
for each feature in f2-real do
other feature Find or create
a feature corresponding to feature in f1-real
if UNIFY (featurer.value, other feature.value) returns failure then
return failure
return f1
else return failure

“ ” means “be changed to point to” or “be set to”.


First step: To acquire the true contents of both of the arguments.. The valuables f1-real and
f2-real are the result of this pointer following process.
Second step: To test for the various base cases of the recursion. There are three possible base
cases:
1. One or both of the arguments has a null value;
2. The arguments are identical;
3. The arguments are non-complex and non-identical.
In the case where either of the arguments is null, the pointer field for the null argument is
changed to point to the other argument, which is then returned. The result is the both
structures now point at the same value.
If the structure are identical, then the pointer of the first is set to the second and the second is
returned.
If neither of the preceding tests is true, then there are two possibilities: they are non-identical
atomic values, or they are non-identical complex structures. The former case signals an
incompatibility in the arguments that leads the algorithm to return a failure signal. In the latter
case, a recursive call is needed to ensure that the component parts of the complex structures
are compatible. In this implementation, the key to the recursion is a loop over all the features
of the second argument (f2). This loop attempts to unify the value of each feature in f2 with
the corresponding feature in f1. In this loop, if a feature is encountered in f2 that is missing
from f1, a feature is added to f1 and given the value NULL. Processing then continues as if
the feature had been there to begin with. If every one of these unifications succeeds, then the
pointer field of f2 is set to f1 completing the unification of the structures and f1 is returned as
the value of the unification.
An example: Unify following feature structure.
agreement ① num sing

subject agreement ①

∪ subject agreement per 3

The extended DAGs f1 and f2:


○ ○
PTR CT PTR CT

○ ○ ∪ ○ ○
subject agreement subject

○ ○ ○

PTR CT PTR CT PTR CT

○ ○ ○ ○ ○ ○
null null null
agreement num agreement

○ ○ ○

PTR CT PTR CT PTR CT

○ ○ ○ ○ ○
null null sing null
per


PTR CT

○ ○
null 3
These original arguments are neither identical, nor atomic, nor null, so the main loop is entered.
Looping over the features of f2, the algorithm is led to a recursive attempt to unify the values of
the corresponding “subject” feature of f1 and f2.
agreement ① ∪ agreement per 3
These argument are also non-identical, non-atomic and non-null so the loop is entered again
leading to a recursive check of the values of the “agreement” features.
num sing ∪ per 3
In looping over the features of the second argument, the fact that the first argument
lacks “per” feature is discovered. A “per” feature initialized with a “null”
value is added to the first argument. This changes the previous unification to the
following:
num sing ∪ per 3
per null
After adding this new “per” feature, the next recursive call leads to the
unification of the “null” value of the new feature in the first argument with the
3 value of the second argument. This recursive call results in the assignment of
the pointer field of the first argument to the 3 value in f2.
○ ○
PTR CT PTR CT

○ ○ ∪ ○ ○
subject agreement subject

○ ○ ○

PTR CT PTR CT PTR CT

○ ○ ○ ○ ○ ○
null null null
agreement num per agreement

○ ○ ○ ○

PTR CT PTR CT CT PTR CT

○ ○ ○ ○ ○ ○
null null sing null null
per

PRT ○
PTR CT

○ ○
null 3

Since there are no further features to check in the f2 argument at any level of recursion. Each in
turn sets the pointer for its f2 argument to point at its f1 argument and returns it. The result of all
arguments is as following:
○ ○
CT PTR CT

○ ○ ∪ ○
subject agreement subject

○ ○ ○
① PTR
PTR CT PTR CT CT

○ ○ ○ ○ ○
null null null
agreement num per agreement

○ ○ ○ ○
① PTR
PTR CT PTR CT CT CT

○ ○ ○ ○ ○
null null sing null
per

PRT ○
PTR CT

○ ○
null 3

4.3.3 Parsing with unification constraints


The CFG rule with unification constraint is as follows:
S NP VP
<NP head agreement> = <VP head agreement>
<S head> = <VP head>
Its AVM is:

S head ①

NP head agreement ②

VP head ① agreement ②

This AVM can be represented by a DAG. So we can use AVM to represent the DAG.

In Earley parser, we can add the DAG to the rule:


S . NP VP, [0, 0], [ ], Dag
[ ] means that the parsing just starts. It marks the position of dot of rule in the DAG.
Dag
S head ①

NP head agreement ②

VP head ① agreement ②

In the chart, it is an active edge.

NP Det. Nominal, [0,1], [Sdet], Dag1

Dag1 NP head ①

Det head agreement ② num sing

Nominal head ① agreement ②

It is also an active edge.

Nominal Noun., [1,2], [Snoun], Dag2

Dag2 Nominal head ①

Noun head ① agreement num sing

It is an inactive edge.

By his means, we can integrate unification into Earley parser.

4.4 Types and Inheritance


The basic feature structures have two problems that have led to extensions to the formalism:
First problem: there is no way to place a constraint on what can be the value of a feature.
For example, in our current system, there is nothing to stop “num” from have the value 3rd or
feminine as values:

num feminine

This problem has caused many unification-based grammatical theories to add various mechanisms
to try constrain the possible values of a feature. E.g.
FUG (Functional Unification Grammar, Kay, 1979), LFG (Lexical Functional Grammar, Bresnan,
1982), GPSG (Generalized Phrase Structure Grammar, Gazdar et al\., 1985), HPSG (Head-Driven
Phrase Structure Grammar, Pollard et al., 1994).
Second problem: In the feature structure, there is no way to capture generalization across
them. For example, the many types of English verb phrases share many features, as do the
many kinds of sub-categorization frames for verbs.
A general solution to both of these problems is the use of types.
Type system for unification grammar has the following characteristics:
Each feature structure is labeled by a type.
Each type has appropriateness conditions expressing which features are appropriate for it.
The types are organized into a type hierarchy, in which more specific types inherit properties
of more abstract one.
The unification operation is modified to unify the types of feature structures in addition to
unifying the attributes and values.
In such typed feature structure systems, types are a new class of objects, just like attributes and
values for standard feature structures.
There are two kinds of types:
1. Simple types (atomic types): It is an atomic symbol like sg or pl, and replaces the simple
atomic values used in standard feature structures.
All types are organized into a multiple-inheritance type hierarchy (a partial order or lattice).
Following is a type hierarchy of new type agr, which will be the type of the kind of atomic
object that can be the value of an AGREEMENT feature.
agr

1st 3rd sg pl

1-sg 3-sg 1-pl 3-pl

3sg-masc 3sg-fem 3sg-neut


In this hierarchy, 3 is a subtype of agr, and 3-sg is a subtype of both 3rd and sg.
rd

The unification of any two types is more specific type than the two input types. Thus
3rd ∪ sg = 3sg
1st ∪ pl = 1pl
1st ∪ arg = 1st
3rd ∪ 1st = ┻ (undefined, fail type)
2. Complex types: The complex types specify:
A set of features that are appropriate for that type.
Restrictions on the values of those features (expressed in terms of types).
Equality constraints between the values.
For example, the complex type verb represents agreement and verb morphological form
information.
A definition of verb would define two appropriate features:
AGREE: It takes values of type arg defined above. :
VFORM: It takes values of type vform which subsumes the seven subtypes: finite, infinitive,
gerund, base, present-participle, past-participle, passive-participle.
Thus verb would be defined as follows:
verb
AGREE arg
VFORM vform
The type noun might be defined with the AGREE feature, but without the VFORM feature.
noun
AGREE arg
The unification of typed feature structures:
verb ∪ verb = verb
st
SGREE 1 AGREE sg AGREE 1-sg
VFORM gerund VFORM gerund VFORM gerund
Complex types are also part of the type hierarchy. Subtypes of complex types inherit all the
feature of their parents, together with the constraints on the values. Following is a small part of
this hierarchy for the sentential complement of verb (Sanfilippo, 1993):
Tr-fin-comp-cat
trans-comp-cat
tr-swh-comp-cat
sfin-comp-cat
tr-sbase-comp-cat
swh-comp-cat
comp-cat
sbase-comp-cat
intr-swh-comp-cat intr-sfin-comp-cat
sinf-comp-cat

intrans-comp-cat intr-sbase-comp-cat
intr-sinf-comp-cat
Ex:
tr-swh-comp-cat: “Ask yourself whether you have become better informed.”
intr-swh-comp-cat: Mosieur asked whether I wanted to ride.”
It is possible to represent the whole phrase structure rule as a type. Sag and Wasow (1999) take a
type phrase which has a feature called DTRS (daughters), whose value is a list of phrases. The
phrase “I love Seoul” could have the following representation (showing only the daughter
feature):

phrase
CAT VP
CAT PRO
DTRS , CAT V CAT NP
ORTH I DRTS ,
ORTH LOVE ORTH SEOUL

The resulting typed feature structures place constraints on which type of values a given feature
can take, and can also be organized into a type hierarchy. In this case, the feature structures can be
well typed.
Parsing with PSG
3.1 Bottom-Up Parsing

3.1.1 Definition of parsing:


Parsing means taking an input and producing some sort of structure for it. Parsing is a
general conception not only for linguistics, but also for programming technique.
In NLP, the parsing is a combination of recognizing an input string and assigning some
structure to it.
Syntactic parsing is the task of recognizing a sentence and assigning a syntactic structure
(e..g. tree, chart) to it.

3.1.2 Parsing as search


In syntactic parsing, the parser can be reviewed as searching through the space of all possible
parse trees to find the correct parse tree for the correct sentence.

E.g. If we have a small PSG for English:

1. S NP VP
2. S AUX NP VP
3. S VP
4. NP Det Nominal
5. Nominal Noun
6. Nominal Noun Nominal
7. Nominal Nominal PP
8. NP Proper Noun
9. VP Verb
10. VP Verb NP
11. Det that | this | a
12. Noun book | flight | meat | money
13. Verb book | include | prefer
14. Aux does
15. Prep from | to | on
16. Proper Noun Houston | ASIANA | KOREAN AIR | CAAC | Dragon Air

Using this PSG to parse sentence “Book that flight”, the correct parse tree that would be
assigned to this sentence is as follows:
S

VP

Verb NP

book Det Nominal

that Noun

flight
Fig. 1 parse tree
Regardless of the search algorithm we choose, there are two kinds of constraints that should
help guide the search.
Constraint coming from the data: The final parse tree must have three leaves (three
words in the input sentence): “book, that flight”.
Constraint coming from the grammar: the final parse tree must have one root: S (start
symbol).
These two constraints give rise to the two search strategies:
Bottom-up search (or data-directed search)
Top-down search (or goal-directed search)
3.1.3 Bottom-Up Parsing
In bottom-up parsing, the parser starts with the words of the input and tries to build tree from the
words up. The parsing is successful if the parser succeeds in building a tree rooted in the start
symbol S that covers all of the input.
Example:
We use above small PSG to parse (Bottom-Up) sentence “Book that flight”
First ply: Book that flight
Second ply: Noun Det Noun Verb Det Noun

Book that flight Book that flight


Third ply: Nominal Nominal Nominal

Noun Det Noun Verb Det Noun

Book that flight Book that flight


Fourth ply:
NP NP

Nominal Nominal VP Nominal Nominal

Noun Det Noun Verb Det Noun Verb Det Noun

Book that flight Book that flight Book that flight


Fifth ply:
VP

NP NP NP

Nominal Nominal VP Nominal Nominal

Noun Det Noun Verb Det Noun Verb Det Noun

Book that flight Book that flight Book that flight


(Fail !) (Fail !)
Sixth ply:
S

VP

NP

Nominal

Verb Det Noun

Book that flight


Fig. 2 Bottom-Up parser

In sixth ply, the root S covers all the input, our Bottom-Up parsing is success.
We can use Shift-Reduce algorithm to do the parsing.
In the shift-reduce algorithm, the stack is used for information access. The operation methods are
shift, reduce, refuse and accept. In the shift, the symbol waiting to process is move to the top of
stack. In the reduce, the symbol on stack top is replaced by RHS of grammar rule, if the RHS of
the rule is matched with the symbol on stack top. If the input string is processed, and the symbol
on stack top becomes S (initial symbol in the string), then the input string is accepted. Otherwise,
it is refused.
Following is the shift-reduce process of sentence “Book that flight”
Stack Operation the rest part of input string
Book that flight
++Book shift that flight
Noun reduce by rule12 that flight
Noun that shift flight
Noun Det reduce by rule 11 flight
Noun Det flight shift φ
Noun Det Noun reduce by rule 12 φ
Noun Det Nominal reduce by rule 5 φ
Noun NP reduce by rule 4 φ
[Backtracking to ++]
+++ Verb reduce by rule 13 that flight
VP reduce by rule 9 that flight
VP that shift flight
VP Det reduce by rule 11 flight
VP Det flight shift φ
VP Det Noun reduce by rule 12 φ
VP Det Nominal reduce by rule 5 φ
VP NP reduce by rule 4 φ
[Backtracking to +++]
Verb that shift flight
Verb Det reduce by rule 11 flight
Verb Det flight shift φ
Verb Det Noun reduce by rule 12 φ
Verb Det Nominal reduce by rule 5 φ
Verb NP reduce by rule 4 φ
VP reduce by rule 10 φ
S reduce by rule 3 φ
[Success !]
3.2 Top-Down Parsing
3.2.1 The process of Top-Down Parsing
A top-down parser searches for a parse tree by trying to build from the root node S down to the
leaves. The algorithm starts symbol S. The next step is to find the tops of all trees which can start
with S. Then expend the constituents in new trees. etc.
If we use above small PSG to parse (Top-Down) sentence “Book that flight”,, first 3 ply will be as
follows:
First ply: S
Second ply: S S S

NP VP Aux NP VP VP
Third ply:
S S S S S S

NP VP NP VP Aux NP VP Aux NP VP VP VP

Det Nom PropN Det Noun PropN V NP V


(fail !) (fail !) (fail !) (fail !) (fail !)
In this case, only the fifth parse tree will match the input sentence.
S

VP

Verb NP
Fourth ply:
S

VP

Verb NP

Book

Fifth ply:
S S

VP VP

Verb NP Verb NP

Book ProperNoun book Det Noninal


(fail !)

Sixth ply:
S

VP

Verb NP

Book Det Nominal

That

Seventh ply:
S S S

VP VP VP

Verb NP Verb NP Verb NP

Book Det Nominal Book Det Nominal Book Det Nominal

that Nominal PP that Noun Nominal that Noun


(fail !) ( fail !)

Eighth ply:
S

VP

Verb NP

Book Det Nominal

That Noun

flight
[Success !]
Fig. 3 Top-Down parsing
The search process of the sentence ”book that flight”:
Searching goal Rule The rest part of input string
++S Book that flight
+NP VP 1 Book that flight
Det Nom VP 4 Book that flight
[backtracking to +]
PropN VP 8 Book that flight
[backtracking to ++]
Aux NP VP 2 Book that flight
[backtracking to ++] Book that flight
+++VP 3 Book that flight
Verb 9 Book that flight
φ that flight
[backtracking to +++] Book that flight
++++Verb NP 10 Book that flight
PropN 8 that flight
[backtracking to ++++]
Det Nominal 4 that flight
+++++ Nominal flight
++++++Nominal PP 7 flight
Noun Nominal PP 6 flight
Nominal PP φ
[backtracking to ++++++]
Noun PP 5 flight
PP φ
[backtracking to +++++]
Noun Nominal 6 flight
Nominal φ
[backtracking to +++++]
Noun 5 flight
φ φ
[Success ! ]
3.2.2 Comparing Top-Down and Bottom-Up Parsing
Top-Down strategy never wastes time exploring trees that cannot result in an S, since it
begins by generating just those trees. This means it never explores subtrees that cannot
find a place in some S-rooted tree. By contrast, In the bottom-up strategy, trees that have
no hope of leading to an S are generated with wild abandon. it will waste effort
Top-Down strategy spends considerable effort on S trees that are not consistent with the
input. It can generate trees before ever examining the input. Bottom-Up never suggest
trees that are not locally grounded in the actual input.
Neither of these approaches adequately exploits the constraints presented by the grammar and the
input words.
3.3.3 A basic Top-Down parser
3.3.3.1 left-corner: We call the first word along the left edge of a derivation the left-corner of the
tree.
e.g. VP VP

NP
Nom

Verb Det Noun Noun Verb Det Noun Noun

prefer a morning flight prefer a morning flight


Fig. 4 left-corner
In Fig. 4, the node “verb” and the node “prefer” are both left-corner of VP.
Formally, we can say that for non-terminals A and B, B is a left-corner of A if the following
relation holds:
A Bα
In other words, B can be left-corner of A if there is a derivation of A that begins
with a B.
The parser should not consider any grammar rule if the current input cannot serve as the first word
along the left edge of some derivation from the rule.
3.2.3 Bottom-Up Filtering
We can set up a table that list all the valid left-corner categories (it is part of speech, POS) for each
non-terminal (e.g. S, VP, NP, etc) in the grammar. When a rule is considered, the table entry for
the category (POS) that starts the right hand side of the rule in consulted. If it fails to contain any
of the POS associated with the current input then the rule is eliminated from consideration. In this
case, this table can be regarded as a bottom-up filter.
For our small Grammar, the left-corner table is as follows:
Non-terminal Left-corner

S Det, Proper Noun, Aux, Verb


NP Det, Proper Noun
Nominal Noun
VP Verb
Fig. 5 left-corner table
Using this left-corner table, the process of sentence “book that flight” will become simple and
quick. The process is as follows:
First ply:

S S S

VP NP VP Aux NP VP
?
Verb NP Det Nominal Book
?
Book Book
[Fail !] [Fail !]
Fig. 5
Verb is the left-corner of S. “Det” and “Aux” can not match with “Book”.
Second ply:

VP

Verb NP

Book Det Nominal

that
Fig. 6
Det is the left-corner of NP.
Third ply:

VP

Verb NP

Book Det Nominal

That Noun

flight
Fig. 7
Noun is the left-corner of Nominal.

The Top-Down parsing process using left-corner filter is as follows:


Searching goal Rule The rest part of input string
S Book that flight
+ VP 3 Book that flight
Verb 9 Book that flight
φ that flight
[backtracking to +]
Verb NP 10 Book that flight
NP that flight
Det Nominal 4 that flight
Nominal flight
Noun 5 flight
φ φ
[Success !]

3.3 Problems with Top-Down Parser

3.3.1 Left-recursion:
In top-down, depth-first, left-to-right parser, It may dive down an infinitely deeper path
never return to visit space if it use the left-recursive grammar.

Formally, a grammar is left-recursive if it contains at least one non-terminal A, such that ,


A α A β , for some α and β and α => ε . In other words, a grammar is
left-recursive if it contains a non-terminal category that has a derivation that
includes itself anywhere along its leftmost branch.

A more obvious and common case of left-recursion in natural language grammar involves
immediately left-recursive rules. The left-recursive rules are rules of the form A Aβ, where the
first constituent of the RHS is identical to the LHS.
E.g. NP NP PP
VP VP PP
S S and S

A left-recursive non-terminal can lead a top-down, depth-first, left-to-right parser to recursively


expand the same non-terminal over again in exactly the same way, leading to an infinite expansion
of the trees.

E.g. if we have the left recursive rule NP NP PP as first rule in our small grammar, we may get
the infinite search as following:
S S S S

NP VP NP VP NP VP
NP PP NP PP

NP PP
Fig. 8 infinite search .
3.3.2 Structure ambiguity
Structure ambiguity occurs when the grammar assigns more than one possible parse to sentence.
Three common kinds of structure ambiguities are attachment ambiguity, coordination ambiguity
and noun-phrase bracketing ambiguity.
3.3.2.1 Attachment ambiguity:
3.3.2.1.1 PP attachment ambiguity:
E.g.
1) They made a report about the ship.
On the ship, they made a report.
They made a report on the ship.
S S

NP VP NP VP

Pronoun V NP PP Pronoun V NP

They made Det Nom P NP They made Det Nom

a N on Det Nom a Nom PP

report the N N P NP

ship report on Det Nom

the N

ship
PP is the modifier of V PP is the modifier of Nominal
Fig.9 PP attachment ambiguity
2) They made a decision concerning the boat.
On the boat, they made a decision.
They made a decision on the boat.
3) He drove the car which was near the post office.
Near the post office, he drove the car.
He drove the car near the post office.
4) They are walking around the lake which is situated in the park.
In the park, they are walking around the lake.
They are walking around the lake in the park.
5) He shot at the man who was with a gun.
With a gun, he shot at the man.
He shot at the man with a gun.
6) The policeman arrested the thief who was in the room.
In the room, the policeman arrested the thief.
The policeman arrested the thief in the room.
Church and Patil (1982) showed that the number of parse for sentences of this type
grows at the same rate as the number of parenthesization of arithmetic expressions.
Such parenthesization problems (insertion problems) are known as grow exponentially
in accordance with what are called the Catalan numbers:
2n
C (n) = 1/n+1
n

1 2n(2n − 1)...(n + 1)
= x
n +1 n!
The following table shows the number of parses for a simple noun phrase as
a function of the number of trailing prepositional phrases. We may see that this
kind of ambiguity can very quickly make it imprudent to keep every possible parse
around.

Number of PPs Number of NP parses


2 2
3 5
4 14
5 21
6 132
7 429
8 1430
9 4867

Fig. 10
3.3.2.1.2 Gerundive attachment ambiguity:
E.g We saw the Eiffel tower flying to Paris.
The Gerundive phrase “flying to Paris” can modifies “saw” as the adverbial, it can
also be the predicate in the clause “the Eiffel tower flying to Paris”.
3.3.2.1.3 local ambiguity
Local ambiguity occurs when some part of a sentence is ambiguous, even if the whole
sentence is not ambiguous. E.g. Sentence “book that flight” is unambiguous, but
when the parser sees the first word “book”, it can not know if it is a verb or a
noun until later. Thus it must use backtracking or parallelism to consider both
possible parses.

3.3.2.2 Coordination ambiguity (Ambiguity of ‘and’)


E,g
1) She looks care of old men and old women.
She looks care of women and old men.
She looks care of old men and women.
2) Mr. John is a scientist of great fame and a professor of great fame.
Mr. John is a professor of great fame and a scientist.
Mr. John is a scientist and a professor of great fame.
3) Someone tells me he’s cheating, and I can’t do anything about it.
Someone tells me that he’s cheating and that I can’t do anything about it.
Someone tells me he’s cheating and I can’t do anything about it.
4) John will go, or Dick and Tom will go.
John or Dick will go, and Tom will go.
John or Dick and Tom will go.
3.4.2.3 Noun-phrase bracketing ambiguity:
ADJ + N1 + N2
NP(ADJ(NP(N1 N2))): NP(NP(ADJ N1)N2):
NP NP

Adj NP NP N2

N1 N2 ADJ N1
E.g.
1) The salesman who sells old cars is busy.
The old salesman who sells cars is busy.
The old car salesman is busy.
2) He is a Department Head, who is from England.
He is Head of the English Department.
He is an English Department Head.
3.3.3 Inefficient re-parsing of sub-tree
The parser often builds valid trees for portions of the input, then discards them during
backtracking, only to find that it has to rebuild them again. The re-parsing of sub-tree is inefficient
E,g. The noun phrase “ a flight from Beijing to Seoul on ASIANA”, its top-down parser process is
as follows:

NP

Nom

Det Noun

A flight from Beijing to Seoul on ASIANA


NP

NP PP

Nom NP

Det Noun Prep Prop-Noun

A flight from Beijing to Seoul on ASIANA

NP

NP

NP PP PP

Nom NP NP

Det Noun Prep Prop-Noun Prep Prop-Noun

A flight from Beijing to Seoul on ASIANA

NP

NP

NP

NP PP PP PP

Nom NP NP NP

Det Noun Prep Prop-Noun Prep Prop-Noun Prep Prop-Noun

A flight from Beijing to Seoul on ASIANA


Fig. 11 Reduplication effort

Because of the way the rules are consulted in our top-down parsing, the parser is ;ed first to
small parse trees that fail because they do not cover all the input. These successive failures trigger
backtracking events which lead to parses that incrementally cover more and more of the input. In
the backtracking, reduplication of work arises many times. Except for its topmost component,
every part of the final tree is derived more than once in the backtracking process.
Component reduplication times

A flight 4
From Beijing 3
To Seoul 2
On ASIANA 1
A flight from Beijing 3
A flight from Beijing to Seoul 2
A flight from Beijing to Seoul on ASIANA 1

Similar example of wasted effort also exists in the bottom-up parsing.


3.4 Some Algorithms
3.4.1 Earley algorithm
In order to solve the problems in parsing, Earley (1970) proposes Earley algorithm.
3.4.1.1 Chart and dotted rule
The core of Earley algorithm is chart. For each word position in the sentence, the chart
contains a list of states representing that have been generated so far. By the end of the sentence,
the chart compactly encodes all the possible parses of the input.
The state within each chart contain three kinds of information:
A sub-tree corresponding to the single grammar rule;
Information about the progress made in completing this sub-tree;
Information about the position of the sub-tree with respect to the input.
These information is represented by a dotted rule. In the dotted rule, a state position with respect
to the input is representd by two numbers indicating where the state begins and where its dot lies.
E.g.
The three rules which using to parser “book that flight” are as follows:
S VP
NP Det Nominal
VP V NP
Some dotted rules of these three rules can be represented as follows:
S .VP, [0,0]
NP Det. Nominal, [1,2]
VP V NP., [0,3]
The state represented by these dotted rules can be expressed by following chart:
VP V NP.

S .VP
NP Det. Nominal

● book ● that ● flight ●


0 1 2 3
Fig. 11 Chart
This chart is a directed acyclic graph (DAG).
3.4.1.2 Three operators in Early algorithm
Predictor
The job of predictor is to create new state representing top-down expectations generated during
the parsing process. The predictor is applied to any state that has a non-terminal to the right of the
dot. This application results in the creation of one new state for each alternative expansion of that
non-terminal provided by the grammar. These new states are placed into the same chart entry as
the generated state. They begin and end at the point on the input where the generating state ends.:
E.g. applying the predicator to the state S .VP, [0,0] results in adding the states VP .Verb,
[0,0] and VP .Verb NP, [0,0] to the first chart entry.
Scanner
When a state has a POS category to the right of the dot, the scanner is called to examine the input
and incorporate a state corresponding to the predicated POS into the chart. This is accomplished
by creating a new state from the input state with the dot advanced over the predicted input
category.
E.g. When the state VP .Verb NP, [0,0] is processed, the Scanner consults the current word in
the input since the category following the dot is a POS. The Scanner then notes that “book” can be
a verb, matching the expectation in the current state. This results in the creation of new state VP
Verb. NP, [0,1]. The new state is then added to the chart entry that follows the one currently
being processed.
Completer
The completer is applied to a state when its dot has reached the right end of the rule. Intuitively,
the presence of such a state represents the fact that the parser has successfully discovered a
particular grammatical category over some span of the input. The purpose of the completer is to
find and advance all previously created states that were looking for this grammatical category at
this position in the input. New states are then created by copying the old state, advancing the dot
over expected category and installing the new state in the current chart entry.
E.g. When the state NP Det Nominal., [1,3] is processed, the completer looks for state ending
at 1 expecting an NP. In the current example, it will find the state VP Verb. NP, [0,1] created by
the Scanner. This results in the addition of new completer state VP Verb NP., [0,3].
Martin Kay improved Early algorithm and proposed the fundamental rule of chart parsing.
Strictly, the fundamental rule of chart parsing is as following:
If the chart contains edges <A W1.B W2, [i,j]> and <B W3., [j,k]>, where A and B are
categories and W1, W2 and W3 are (possibly empty) sequences of categories or words, then
add edge <A W1 B.W2, [i,k]> to the chart.
This fundamental rule can be represented by DAG:
<A W1 B.W2, [i, k]>

<A W1.B W2, [i, j]>


<B W3., [j. k]>

○ ○ ○
i j k
Fig. 12 fundamental rule of chart parsing

3.4.1.3 An example fro Early algorithm

The state sequence in chart while parsing “book that flight” using our small grammar:
Chart [0]
γ .S [0,0] Dummy start state
S .NP VP [0,0] Predictor
NP .Det Nominal [0,0] Predictor
NP .Proper-Noun [0,0] Predictor
S .Aux NP VP [0,0] Predictor
S .VP [0,0] Predictor
VP .Verb [0,0] Predictor
VP .Verb NP [0,0] Predictor

Chart [1]
Verb book. [0.1] Scanner
VP Verb. [0,1] Completer
S VP. [0,1] Completer
VP Verb. NP [0,1] Completer
NP .Det Nominal [1,1] Predictor
NP .Proper-Noun [1,1] Predictor

Chart [2]
Det that. [1,2] Scanner
NP Det. Nominal [1,2] Completer
Nominal .Noun [2,2] Predictor
Nominal .Noun Nominal [2,2] Predictor

Chart [3]
Noun flight. [2,3] Scanner
Nominal Noun. [2,3] Completer
Nominal Noun. Nominal [2,3] Completer
NP Det Nominal. [1,3] Completer
VP Verb NP. [0,3] Completer
S VP. [0,3] Completer
Nominal .Noun [3,3] Predictor
Nominal .Noun Nominal [3,3] Predictor

In chart [3], the presence of the state representing “flight” leads to completion of NP, transitive VP,
and S. The presence of the state S VP., [0,3] in the last chart entry means that our parser gets
the success.

3.4.1.4 Retrieving parser trees from a Chart


The Earley algorithm just described is actually a recognizer not a parser. After processing, valid
sentences will leave the state S α., [0,N] (N is the word number in the sentence)
To return this algorithm into a parser, we must be able to extract individual parses from the chart.
To do this, the representation of each state must be augmented with an additional field to store
information about the completed states that generated its constituents.
Recall that the Completer creates new states by advancing older incomplete ones when the
constituent following the dot is discovered. The only change necessary is to have Completer add a
pointer to the older state onto the list of the previous states of the new state. Retrieving a parse tree
from the chart is then merely a recursive retrieval starting with the state (or states) representing a
complete S in the final chart entry. Following shows the chart produced by an updated completer.
Chart [0]
S0 γ .S [0,0] [] Dummy start state
S1 S .NP VP [0,0] [] Predictor
S2 NP .Det Nominal [0,0] [] Predictor
S3 NP .Proper-Noun [0,0] [] Predictor
S4 S .Aux NP VP [0,0] [] Predictor
S5 S .VP [0,0] [] Predictor
S6 VP .Verb [0,0] [] Predictor
S7 VP .Verb NP [0,0] [] Predictor

Chart [1]
S8 Verb book. [0.1] [] Scanner
S9 VP Verb. [0,1] [S8] Completer
S10 S VP. [0,1] [S9] Completer
S11 VP Verb. NP [0,1] [S8] Completer
S12 NP .Det Nominal [1,1] [] Predictor
S13 NP .Proper-Noun [1,1] [] Predictor

Chart [2]
S14 Det that. [1,2] [] Scanner
S15 NP Det. Nominal [1,2] [S14] Completer
S16 Nominal .Noun [2,2] [] Predictor
S17 Nominal .Noun Nominal [2,2] [] Predictor

Chart [3]
S18 Noun flight. [2,3] [] Scanner
S19 Nominal Noun. [2,3] [S18] Completer
S20 Nominal Noun. Nominal [2,3] [S18] Completer
S21 NP Det Nominal. [1,3]] [S14, S19] Completer
S22 VP Verb NP. [0,3] [S8, S21] Completer
S23 S VP. [0,3] [S22] Completer
S24 Nominal .Noun [3,3] [] Predictor
S25 Nominal .Noun Nominal [3,3] [] Predictor
The parsing process can be summarized as follows:
S8 Verb book. [0.1] [] Scanner
S9 VP Verb. [0,1] [S8] Completer
S10 S VP. [0,1] [S9] Completer
S11 VP Verb. NP [0,1] [S8] Completer
S14 Det that. [1,2] [] Scanner
S15 NP Det. Nominal [1,2] [S14] Completer
S18 Noun flight. [2,3] [] Scanner
S19 Nominal Noun. [2,3] [S18] Completer
S20 Nominal Noun. Nominal [2,3] [S18] Completer
S21 NP Det Nominal. [1,3]] [S14, S19] Completer
S22 VP Verb NP. [0,3] [S8, S21] Completer
S23 S VP. [0,3] [S22] Completer

The DAG representing the parse is as follows:


S VP.

.
VP Verb NP.

NP Det Nom.

S Verb. NP
Nom Noun. Nom
S VP,

VP Verb. NP Det. Nom Nom Noun.

Verb book. Det that. Noun flight.


● ● ● ●
0 1 2 3
Fig. 13 DAG representing the parsing result
3.4.1.5 Another examples:
Example-1: Using Early algorithm to parse sentence “Does KA 852 have a first class
section?”
In this sentence, “first” is “ord”, so we shall add a new rule in our small grammar
NP Ord Nom
The states are as follows:
● Does ● KA 852 ● have ●first ● class ● section ●
0 1 2 3 4 5 6
The state sequence in chart:
Chart [0]
γ .S [0,0] Dummy start state
S .NP VP [0,0] Predictor
NP .Ord Nom [0,0] Predictor
NP .PrN [0,0] Predictor
S .Aux NP VP [0,0] Predictor
S .VP [0,0] Predictor
VP .V [0,0] Predictor
VP .V NP [0,0] Predictor

Chart [1]
Aux does. [0,1] Scanner
S Aux. NP VP [0,1] Completer
NP .Ord Nom [1,1] Predictor
NP .PrN [1,1] Predictor

Chart [2]
PrN KA 852. [1,2] Scanner
NP PrN. [1,2] Completer
S Aux NP. VP [0,2] Completer
VP .V [2,2] Predictor
VP .V NP [2,2] Predictor

Chart [3]
V have. [2,3] Scanner
VP V. [2,3] Completer
VP V. NP [2,3] Completer
NP ..Ord Nom [3,3] Predictor

Chart [4]
Ord first. [3,4] Scanner
NP Ord. Nom [3,4] Completer
Nom .N Nom [4,4] Predictor
Nom .N. [4,4] Predictor
Nom .N PP [4,4] Predictor

Chart [5]
N class. [4,5] Scanner
Nom N. [4,5] Completer
NP Ord Nom. [3,5] Completer
VP V NP. [2,5] Completer
S Aux NP VP. [0,5] Completer (S’ span is 5, 5 < 6)
Nom N. Nom [4,5] Completer
Nom .N [5,5] Predictor
Chart [6]
N section. [5,6] Scanner
Nom N. [5,6] Completer
Nom N Nom. [4,6] Completer
NP Ord Nom. [3,6] Completer
VP V NP. [2,6] Completer
S Aux NP VP. [0,6] Completer
[Success !]
The parsing process:
Aux does. [0,1] Scanner
S Aux. NP VP [0,1] Completer
PrN KA 852. [1,2] Scanner
NP PrN. [1,2] Completer
S Aux NP. VP [0,2] Completer
V have. [2,3] Scanner
VP V. [2,3] Completer
VP V. NP [2,3] Completer
Ord first. [3,4] Scanner
NP Ord. Nom [3,4] Completer
N class. [4,5] Scanner
N section. [5,6] Scanner
Nom N. [5,6] Completer
Nom N Nom. [4,6] Completer
NP Ord Nom. [3,6] Completer
VP V NP. [2,6] Completer
S Aux NP VP. [0,6] Completer
[Success !]
S Aux NP VP.

VP V NP.

NP Ord Nom.

Nom N Nom.
NP PrN.

Nom N.

Aux does. PrN KA852. V have. Ord first. N class. N section.


● ● ● ● ● ● ●
0 1 2 3 4 5 6
Fig. 14
Example-2: using Earley algorithm to parse sentence “It is a flight from Beijing to Seoul on
ASIANA”
The states:
● it ● is ● a ● flight ● from ● Beijing ● to ● Seoul ● on ● ASIANA●
0 1 2 3 4 5 6 7 8 9 10
“It” is a pronoun, so we need to add a new rule in our grammar:
NP Pron
and
PP Prep NP
The state sequence is as follows:
Chart [0]
γ .S [0,0] Dummy start state
S .NP VP [0,0] Predictor
NP .Pron [0,0] Predictor
NP .PrN [0,0] Predictor
S .Aux NP VP [0,0] Predictor
S .VP [0,0] Predicator
VP .V [0,0] Predicator
VP .V NP [0,0] Predicator

Chart [1]
Pron it [0,1] Scanner
NP Pron. [0,1,] Completer
S NP. VP [0,1] Completer
VP .V [1,1] Predictor
VP .V NP [1,1] Predictor

Chart [2]
V is. [1,2] Scanner
VP V. [1,2] Completer
S NP VP. [0,2] Completer (S’ span is 2 < 10)
VP V. NP [1,2] Completer
NP .Det Nom [2,2] Predictor

Chart [3]
Det a. [2,3] Scanner
NP Det. Nom [2,3] Completer
Nom N [3,3] Predictor
Nom .N Nom [3,3] Predictor
Nom .Nom PP [3,3] Predictor

Chart [4]
N flight. [3,4] Scanner
Nom N. [3,4] Completer
NP Det Nom. [2,4] Completer
VP V NP.. [1,4] Completer
S NP VP. [0,4] Completer (S’ span is 4 < 10)
Nom N. Nom [3,4] Completer
Attention: behind N, no Nom. So the process turns to following state:
Nom Nom. PP [3,4] Completer
PP .Prep NP [4,4] Predictor

Chart [5]
Prep from. [4,5] Scanner
PP Prep. NP [4,5] Completer
NP .PrN [5,5] Predictor

Chart [6]
PrN Beijing. [5,6] Scanner
NP PrN. [5,6] Completer
PP Prep NP [4,6] Completer
Nom Nom PP. [3,6] Completer
Attention: The dot behind PP (this PP = “from Beijing”), it is inactive edge.
Nom Nom. PP [3,6] Completer
Attention: The dot in front of PP (this PP = “to Seoul”), it is active edge.
PP .Prep NP [6,6] Predictor

Chart [7]
Prep to. [6,7] Scanner
PP Prep. NP [6,7] Completer
NP .PrN [7,7] Predictor

Chart [8]
PrN Seoul. [7,8] Scanner
NP PrN. [7,8] Completer
PP Prep NP. [6,8] Completer
Nom Nom PP. [3,8] Completer
Attention: The dot behind PP (this PP = “to Seoul”), it is inactive edge.
Nom Nom. PP [3,8] Completer
Attention: The dot in front of PP (this PP = “on ASIANA”), it is active edged.
PP .Prep NP [8,8] Predictor

Chart [9]
Prep on. [8,9] Scanner
PP Prep. NP [8,9] Completer
NP .PrN [9,9] Predictor

Chart [10]
PrN ASIANA. [9,10] Scanner
NP PrN. [9,10] Completer
PP Prep NP. [8,10] Completer
Nom Nom PP. [3,10] Completer
NP Det Nom. [2,10] Completer
VP V NP [1,10] Completer
S NP VP. [0,10] Completer
[Success !]

The parsing process:

Pron it [0,1] Scanner


NP Pron. [0,1,] Completer
S NP. VP [0,1] Completer
V is. [1,2] Scanner
VP V. NP [1,2] Completer
Det a. [2,3] Scanner
NP Det. Nom [2,3] Completer
N flight. [3,4] Scanner
Nom N. [3,4] Completer
NP Det Nom. [2,4] Completer
Nom Nom. PP [3,4] Completer
Prep from. [4,5] Scanner
PP Prep. NP [4,5] Completer
PrN Beijing. [5,6] Scanner
NP PrN. [5,6] Completer
PP Prep NP [4,6] Completer
Nom Nom PP. [3,6] Completer
Attention: The dot behind PP (this PP = “from Beijing”), it is inactive edge.
Nom Nom. PP [3,6] Completer
Attention: The dot in front of PP (this PP = “to Seoul”), it is active edge.
Prep to. [6,7] Scanner
PP Prep. NP [6,7] Completer
PrN Seoul. [7,8] Scanner
NP PrN. [7,8] Completer
PP Prep NP. [6,8] Completer
Nom Nom PP. [3,8] Completer
Attention: The dot behind PP (this PP = “to Seoul”), it is inactive edge.
Nom Nom. PP [3,8] Completer
Attention: The dot in front of PP (this PP = “on ASIANA”), it is active edged.
Prep on. [8,9] Scanner
PP Prep. NP [8,9] Completer
PrN ASIANA. [9,10] Scanner
NP PrN . [9,10] Completer
PP Prep NP . [8,10] Completer
Nom Nom PP. [3,10] Completer
NP Det Nom. [2,10] Completer
VP V NP [1,10] Completer
S NP VP. [0,10] Completer
[Success !]
The chart:
S NP VP.

VP V NP.
NP Det Nom.

Nom Nom PP.

Nom Nom PP.

Nom Nom PP.

PP P NP. PP P NP. PP P PrN.

NP Pron. Nom N. NP PrN. NP NrP. NP NrP.

Pron it. V is. Det a. N flight P from. PrN Beijing. P to. PrN Seoul.P on. PrN ASI
● ● ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6 7 8 9 10
Fig. 15
In this parsing process, there is not backtracking as in the top-down parsing. The advantage of
Early algorithm is obvious.
3.4.2. CYK approach:
CYK approach is abbreviation of Cocke-Younger-Kasami approach. It is a parallel
parsing algorithm.

3.4.2.1 Table and box in CYK approach

If we have a CFG as follows:

S NP VP
NP Det N
VP V NP

Obviously, this is a CFG with Chomsky Nornal Form because the form all the rules is A BC.
Following table can expresses the result of CYK parsing for the sentence “the boy hits a dog”:

5 S

3 VP

2 NP NP

1 Det N V Det N

1 2 3 4 5
the boy hits a dog
Fig. 16
In this table, the row number expresses the location of word in the sentence, the line number
expresses the word number included in the grammatical category (e.g. N, V, NP, VP, S, etc). All
the category is located in the box of the table. bi j expresses the box that located in the row i and
line j. Every grammatical category in the table can be expressed by bi j .
‘Det belongs to b1 1’ means : Det is located in row 1 and line 1.
‘N belongs to b2 1’ means : N is located in row 2 and line 1.
‘V belongs to b3 1’ means: V is located in row 3 and line 1.
‘Det belongs to b4 1’ means; Det is located in row 4 and line 1.
‘N belongs to b5 1’ means; N is located in row 5 and line 1.
By this reason,
The location of NP (the boy) is b1 2 (including 2 words),
The location of NP (a dog) is b4 2 (including 2 words),
The location of VP (hits a dog) is b3 3 (including 3 words),
The location of S (the boy hits a dog) is b1 5 (including 5 words).
Obviously, the table and the bi j in the table can describe the structure of the
sentence. For every category bi j , i describes its location in the sentence structure,
j describes the word number included in this category. If we may create the table
and the bi j in the table, the parsing is completed.

3.4.2.2 CYK Description of Chomsky Normal Form


In the Chomsky normal form A BC,
if B belongs to bi k, C belongs to bi+k j-k,
Then A must belong to bi j.
If we start from i-th word of the sentence create a sub-tree B including k words,
and then from i+k-th word of the sentence create a sub-tree C including j-k words,
then the tree graph A can be expressed as follows:

A (bi j)

B (bi k) C (bi+k j-k)

… …
i-1 i i+k-1 i+k i+j-1 i+j
|______________________| |_______________________|
length of B = k length of C = j-k
|______________________________________________________|
length of A = j
Fig. 17
For example, in Fig. 17, NP belongs to b1 2, Det belongs to b1 1, N belongs to b2 1,
is represents the Chomsky normal form NP Det N. In this case, i=1, k=1, j=2.
Therefore, if we know the starting number i of B, the length k of B, the length j
of A, then we can calculate the location of A, B and C in the CYK table: A belongs
to bi j, B belongs to bi k, C belongs to bi+k j-k.

In CYK approach, the important problem is how to calculate the location of A. The
row number of A is always same as that of B, so if row number of B is i, then the
row number of A must be i. The line number of A (=j) equals to the addition of the
line number of B (=k) and the line number of C (=j-k): j = k + j – k.
Therefore, If we know the location of B and the location of C, it is easy to calculate
the location of A.

If the length of input sentence is n, the CYK algorithm can be divided to two steps:
First step: start from i = 1, for every words Wi in input sentence (with length n)
, we have rewriting rule A Wi, so we write the non-terminal symbol A of every
Word Wi in the box of our table, and give the location number of box with bij. E.g,
for sentence “The boy hits a dog”, we give the location number respectively for
every words of sentence is as follows: b11 (for Det [non-terminal symbol of ‘the’]),
b21 (for N [non-terminal symbol of ‘boy’]), b31 (for V [non-terminal symbol of
‘hits’]), b41 (for second Det [non-terminal symbol of ‘a’]), b51 (for second N
[non-terminal symbol of ‘dog’]).
Second step; For 1≤ h ≤j and all i, create bi h. non-terminal set including bij can
be defined as follows:
bi j = {A |for 1≤ k ≤j, B is included in bi k, C is included in bi+k j-k, and exists
grammar rule A BC that A is included in bi j}.
If box b1 n includes initial symbol S, then input sentence will be accepted. The
analysis gets success.
E.g, for rule ‘NP Det N’ and Det belongs to b1 1, N belongs to b2 1, we can confirm
that NP belongs to b1 2;
for rule ‘NP Det N’and ‘Det” belongs to b4 1, N belongs to b5 1, we can confirm
that NP belongs to b4 2;
for rule VP V NP and V belongs to b3 1, NP belongs to b4 2, we can confirm that VP
belongs to b3 3;
for rule S NP VP and NP belongs to b1 2, VP belongs to B3 3, we can confirm that
S belongs to b1 5. In our input sentence, n=5, so our sentence is accepted.
3.4.2.3 A complex example for CYK algorithm
If the PSG grammar is as follows:
S NP VP
NP PrN
NP DET N
NP N WH VP
NP DET N WH VP
VP V
VP V NP
VP V that S
Use CYK approach to analyze sentence ‘the table that lacks a leg hits Jack”.
Transformation of rewriting rules to Chomsky normal form:
S NP VP
NP PrN It is not CNF and must be transformed to:
NP Jack | John | Maria
NP DET N
NP N WH VP It must be transformed to:
NP N CL
CL WH VP
NP DET N WH VP It must be transformed to:
NP NP CL
NP DET N
CL WH VP
Here CL is WH clause, it = (that + VP)
VP V It is not CNF and must be transformed to:
VP cough | walk | …
VP V NP
VP V that S It must be transformed to:
VP V TH
TH WH S
Here TH is that-clause, it = (that + S).

Calculation of the bij of non-terminal symbols:


--To arrange POS non-terminal symbols and calculate their bij
“The table that lacks a leg hits Jack”
DET N WH V DET N V PrN (NP)
b11 b21 b31 b41 b51 b61 b71 b81
--To calculate the bij of phrase non-terminal symbols

S18 (S NP VP)

NP3 (NP NP CL)


b16

CL (CL WH VP)
b34

VP2 (VP V NP)


b43

NP1 (NP DET N) NP2 (NP DET N) VP1 (VP V NP)


b12 b52 b72

DET N WH V DET N V NP
b11 b21 b31 b41 b51 b61 b71 b81

The table that lacks a leg hits Jack


Fig. 18
bij (NP1): i=1. j=1+1=2
bij (NP2): i=5, j=1+1=2
bij (VP1): i=7, j=1+1=2
bij (VP2): i=4, j=1+2=3
bij (CL): i=3, j=1+3=4
bij (NP3): i=1, j=2+4=6
bij (S): i=1, j=2+6=8
The length of this sentence is 8, and we get box line number of S is also 8, so the sentence
was recognized.
By the CYK approach, we can create the pyramid in Fig. 15. This pyramid is also a tree
graph.
3.4.2.5 another example
Now we use CYK to parse the sentence “book that flight”.
If the rules of our CFG are as above rules which we used to parse this sentence:
1. S VP
2. VP Verb NP
3. NP Det Nominal
4. Nominal Noun
The form of rule-1 is not CNF because its RHS includes only one single Non-terminal VP.
Therefore we have to combine rule-1 with rule-2 which has CNF. In this case, rule-1 and rule-2
can be changes as following CNF:
S Verb NP
The form rule-4 is not CNF because its RHS includes only one single Non-terminal symbol.
Therefore we have to combine rule-4 with rule-3 which has CNF. In this case, rule-4 and rule-3
can be changed as following CNF:
NP Det Noun
Now the rules with CNF can be:
S Verb NP
NP Det Noun
The CYK result of this sentence can be represented in following table:

S (S Verb NP)
b13

NP (NP Det Noun)


b22

Verb Det Noun


b11 b21 b31

Book that flight


Fig. 19
bij (NP): i=2,. j=1+1=2
bij (S): i=1, j=1+2=3

We can also create the pyramid in the table of CYK. This pyramid is similar as a tree graph.
We can see the CYK algorithm is so simple and effective.
Speech and Language
Processing

Constituency Grammars
Chapter 11
Review
▪ POS Decoding
▪ What does this mean?
▪ What representation do we use?
▪ What algorithm do we use, and why?

▪ Fish sleep example


▪ Posted on syllabus
▪ Next page
Hidden Markov Models
▪ States Q = q1, q2…qN;
▪ Observations O= o1, o2…oN;
▪ Each observation is a symbol from a vocabulary V
= {v1,v2,…vV}
▪ Transition probabilities
▪ Transition probability matrix A = {aij}
aij = P(qt = j | qt−1 = i) 1  i, j  N
▪ Observation likelihoods
▪ Output probability matrix B={bi(k)}
 bi (k) = P(X t = ok | qt = i)
 i = P(q1 = i) 1  i  N
▪ Special initial probability vector 
4/9/2024 Speech and Language Processing - Jurafsky and Martin 3
Today
▪ Formal Grammars
▪ Context-free grammar
▪ Grammars for English
▪ Treebanks
▪ Dependency grammars

4/9/2024 Speech and Language Processing - Jurafsky and Martin 4


Simple View of Linguistic
Analysis

Phonology  /waddyasai/

Morphology /waddyasai/  what did you say


say
Syntax what did you say  subj obj

say
you what
Semantics subj obj
 P[ x. say(you, x) ]
you what
Syntax
▪ Grammars (and parsing) are key
components in many applications
▪ Grammar checkers
▪ Dialogue management
▪ Question answering
▪ Information extraction
▪ Machine translation

4/9/2024 Speech and Language Processing - Jurafsky and Martin 6


Syntax
▪ Key notions that we’ll cover
▪ Constituency
▪ Grammatical relations and Dependency
▪ Heads
▪ Key formalism
▪ Context-free grammars
▪ Resources
▪ Treebanks

4/9/2024 Speech and Language Processing - Jurafsky and Martin 7


Types of Linguistic Theories
▪ Prescriptive theories: how people ought
to talk
▪ Descriptive theories: how people actually
talk
▪ Most appropriate for NLP applications
Constituency
▪ The basic idea here is that groups of
words within utterances can be shown to
act as single units.
▪ And in a given language, these units form
coherent classes that can be be shown to
behave in similar ways
▪ With respect to their internal structure
▪ And with respect to other units in the
language

4/9/2024 Speech and Language Processing - Jurafsky and Martin 9


Constituency
▪ Internal structure
▪ We can describe an internal structure to the
class (might have to use disjunctions of
somewhat unlike sub-classes to do this).
▪ External behavior
▪ For example, we can say that noun phrases
can come before verbs

4/9/2024 Speech and Language Processing - Jurafsky and Martin 10


Constituency
▪ For example, it makes sense to the say
that the following are all noun phrases in
English...

▪ Why? One piece of evidence is that they


can all precede verbs.
▪ This is external evidence

4/9/2024 Speech and Language Processing - Jurafsky and Martin 11


Grammars and Constituency
▪ Of course, there’s nothing easy or obvious about
how we come up with right set of constituents
and the rules that govern how they combine...
▪ That’s why there are so many different theories
of grammar and competing analyses of the
same data.
▪ The approach to grammar, and the analyses,
adopted here are very generic (and don’t
correspond to any modern linguistic theory of
grammar).

4/9/2024 Speech and Language Processing - Jurafsky and Martin 12


Context-Free Grammars
▪ Context-free grammars (CFGs)
▪ Also known as
▪ Phrase structure grammars
▪ Backus-Naur form
▪ Consist of
▪ Rules
▪ Terminals
▪ Non-terminals

4/9/2024 Speech and Language Processing - Jurafsky and Martin 13


Context-Free Grammars
▪ Terminals
▪ We’ll take these to be words (for now)
▪ Non-Terminals
▪ The constituents in a language
▪ Like noun phrase, verb phrase and sentence
▪ Rules
▪ Rules are equations that consist of a single
non-terminal on the left and any number of
terminals and non-terminals on the right.

4/9/2024 Speech and Language Processing - Jurafsky and Martin 14


Some NP Rules
▪ Here are some rules for our noun phrases

▪ Together, these describe two kinds of NPs.


▪ One that consists of a determiner followed by a nominal
▪ And another that says that proper names are NPs.
▪ The third rule illustrates two things
▪ An explicit disjunction
▪ Two kinds of nominals
▪ A recursive definition
▪ Same non-terminal on the right and left-side of the rule
4/9/2024 Speech and Language Processing - Jurafsky and Martin 15
L0 Grammar

4/9/2024 Speech and Language Processing - Jurafsky and Martin 16


Generativity
▪ As with n-grams, you can view these rules
as either analysis or synthesis machines
▪ Generate strings in the language
▪ Reject strings not in the language
▪ Impose structures (trees) on strings in the
language

4/9/2024 Speech and Language Processing - Jurafsky and Martin 17


Derivations
▪ A derivation is a
sequence of rules
applied to a string
that accounts for
that string
▪ Covers all the
elements in the
string
▪ Covers only the
elements in the
string

4/9/2024 Speech and Language Processing - Jurafsky and Martin 18


Definition
▪ More formally, a CFG consists of

4/9/2024 Speech and Language Processing - Jurafsky and Martin 19


Parsing
▪ Parsing is the process of taking a string
and a grammar and returning a
(multiple?) parse tree(s) for that string
▪ There are languages we can capture with CFGs
that we can’t capture with regular expressions
▪ There are properties that we can capture that we
can’t capture with n-grams

4/9/2024 Speech and Language Processing - Jurafsky and Martin 20


Phonetics:
The Sounds of Language
Sound Segments
• Knowing a language includes knowing the sounds of that
language

• Phonetics is the study of speech sounds

• We are able to segment a continuous stream of speech


into distinct parts and recognize the parts in other words

• Everyone who knows a language knows how to segment


sentences into words and words into sounds
Identity of Speech Sounds
• Our linguistic knowledge allows us to ignore
nonlinguistic differences in speech (such as
individual pitch levels, rates of speed, coughs)

• We are capable of making sounds that are not


speech sounds in English but are in other
languages

– The click tsk that signals disapproval in English is a


speech sound in languages such as Xhosa and Zulu
where it is combined with other sounds just like t or k
is in English
Identity of Speech Sounds
• The science of phonetics aims to describe all the
sounds of all the world’s languages

– Acoustic phonetics: focuses on the physical


properties of the sounds of language

– Auditory phonetics: focuses on how listeners


perceive the sounds of language

– Articulatory phonetics: focuses on how the vocal


tract produces the sounds of language
The Phonetic Alphabet
• Spelling, or orthography, does not consistently represent the
sounds of language

• Some problems with ordinary spelling:

– 1. The same sound may be represented by many letters or combination


of letters:
he people key
believe seize machine
Caesar seas
see amoeba

– 2. The same letter may represent a variety of sounds:


father village
badly made
many
The Phonetic Alphabet
– 3. A combination of letters may represent a
single sound
shoot character Thomas
either physics rough
coat deal

– 4. A single letter may represent a combination


of sounds
xerox
The Phonetic Alphabet
– 4. Some letters in a word may not be
pronounced at all
autumn sword resign
pterodactyl lamb corps
psychology write knot

– 5. There may be no letter to represent a


sound that occurs in a word
cute
use
The Phonetic Alphabet
• In 1888 the International Phonetic
Alphabet (IPA) was invented in order to
have a system in which there was a one-
to-one correspondence between each
sound in language and each phonetic
symbol

• Someone who knows the IPA knows how


to pronounce any word in any language
The Phonetic Alphabet
• Dialectal and individual differences affect
pronunciation, but the sounds of English
are:
The Phonetic Alphabet
• Using IPA symbols, we can now represent
the pronunciation of words
unambiguously:
Articulatory Phonetics
• Most speech sounds are produced by pushing air
through the vocal cords

– Glottis = the opening between the vocal cords

– Larynx = ‘voice box’

– Pharynx = tubular part of the throat above the larynx

– Oral cavity = mouth

– Nasal cavity = nose and the passages connecting it to the throat


and sinuses
Consonants: Place of Articulation
• Consonants are sounds produced with some
restriction or closure in the vocal tract

• Consonants are classified based in part on


where in the vocal tract the airflow is being
restricted (the place of articulation)

• The major places of articulation are:


bilabial, labiodental, interdental, alveolar, palatal,
velar, uvular, and glottal
Consonants: Place of Articulation

© Cengage Learning
Consonants:  Place  of  Ar0cula0on  
• Bilabials:  [p]  [b]  [m]  
– Produced  by  bringing  both  lips  together  

• Labiodentals:  [f]  [v]  


– Produced  by  touching  the  bo=om  lip  to  the  upper  teeth  

• Interdentals  [θ]  [ð]  


– Produced  by  pu@ng  the  0p  of  the  tongue  between  the  
teeth  
Consonants: Place of Articulation
• Alveolars: [t] [d] [n] [s] [z] [l] [r]
– All of these are produced by raising the tongue to the alveolar
ridge in some way

• [t, d, n]: produced by the tip of the tongue touching the alveolar
ridge (or just in front of it)

• [s, z]: produced with the sides of the front of the tongue raised but
the tip lowered to allow air to escape

• [l]: the tongue tip is raised while the rest of the tongue remains down
so air can escape over the sides of the tongue (thus [l] is a lateral
sound)

• [r]: air escapes through the central part of the mouth; either the tip
of the tongue is curled back behind the alveolar ridge or the top of
the tongue is bunched up behind the alveolar ridge
Consonants:  Place  of  Ar0cula0on  
• Palatals:  [ʃ]  [ʒ]  [ʧ]  [ʤ][ʝ]  
– Produced  by  raising  the  front  part  of  the  tongue  to  the  palate  

• Velars:  [k]  [g]  [ŋ]  


– Produced  by  raising  the  back  of  the  tongue  to  the  soI  palate  or  velum  

• Uvulars:  [ʀ]  [q]  [ɢ]    


– Produced  by  raising  the  back  of  the  tongue  to  the  uvula  

• Glo5als:  [h]  [Ɂ]  


– Produced  by  restric0ng  the  airflow  through  the  open  glo@s  ([h])  or  by  
stopping  the  air  completely  at  the  glo@s  (a  glo5al  stop:  [Ɂ])  
Consonants: Manner of Articulation
• The manner of articulation is the way the
airstream is affected as it flows from the lungs
and out of the mouth and nose

• Voiceless sounds are those produced with the


vocal cords apart so the air flows freely through
the glottis

• Voiced sounds are those produced when the


vocal cords are together and vibrate as air
passes through
Consonants:  Manner  of  Ar0cula0on  
• The  voiced/voiceless  dis0nc0on  is  important  in  
English  because  it  helps  us  dis0nguish  words  like:  
 rope/robe                              fine/vine        seal/zeal  
 [rop]/[rob]          [faɪn]/[vaɪn]          [sil]/[zil]  

• But  some  voiceless  sounds  can  be  further  


dis0nguished  as  aspirated  or  unaspirated  
   aspirated      unaspirated  
   pool  [phul]                spool        [spul]  
   tale  [thel]                stale        [stel]  
   kale  [khel]                scale        [skel]  
Consonants: Manner of Articulation
• Oral sounds are those produced with the velum raised
to prevent air from escaping out the nose

• Nasal sounds are those produced with the velum


lowered to allow air to escape out the nose

• So far we have three ways of classifying sounds based


on phonetic features: by voicing, by place of
articulation, and by nasalization

– [p] is a voiceless, bilabial, oral sound


– [n] is a voiced, alveolar, nasal sound
Consonants:  Manner  of  Ar0cula0on  

• Stops:  [p]  [b]  [m]  [t]  [d]  [n]  [k]  [g]  [ŋ]  [ʧ][ʤ]  [Ɂ]  
– Produced  by  completely  stopping  the  air  flow  in  
the  oral  cavity  for  a  frac0on  of  a  second  

• All  other  sounds  are  con;nuants,  meaning  that  the  


airflow  is  con0nuous  through  the  oral  cavity  

• Frica;ves:  [f]  [v]  [θ]  [ð]  [s]  [z]  [ʃ]  [ʒ]  [x]  [ɣ]  [h]  
– Produced  by  severely  obstruc0ng  the  airflow  so  as  
to  cause  fric0on  
Consonants:  Manner  of  Ar0cula0on  
• Affricates:  [ʧ]  [ʤ]  
– Produced  by  a  stop  closure  that  is  released  with  a  lot  of  
fric0on  

• Liquids:  [l]  [r]  


– Produced  by  causing  some  obstruc0on  of  the  airstream  in  
the  mouth,  but  not  enough  to  cause  any  real  fric0on  

• Glides:  [j]  [w]    


– Produced  with  very  li=le  obstruc0on  of  the  airstream  and  
are  always  followed  by  a  vowel  
Consonants:  Manner  of  Ar0cula0on  
• Approximants:  [w]  [j]  [r]  [l]  
– Some0mes  liquids  and  glides  are  put  together  into  one  category  because  the  
ar0culators  approximate  a  fric0onal  closeness  but  do  not  actually  cause  
fric0on  

• Trills  and  flaps:  [r]*  [ɾ]  


– Trills  are  produced  by  rapidly  vibra0ng  an  ar0culator  
– Flaps  are  produced  by  a  flick  of  the  tongue  against  the  alveolar  ridge  

• Clicks:  
– Produced  by  moving  air  in  the  mouth  between  various  ar0culators  
– The  disapproving  sound  tsk  in  English  is  a  consonant  in  Zulu  and  some  other  
southern  African  languages  
– The  lateral  click  used  to  encourage  a  horse  in  English  is  a  consonant  in  Xhosa  

*The textbook uses [r] to represent the central liquid as in the word ready rather than as
a trill
Vowels  
• Vowels  are  classified  by  how  high  or  low  the  tongue  is,  if  the  
tongue  is  in  the  front  or  back  of  the  mouth,  and  whether  or  
not  the  lips  are  rounded  

• High  vowels:  [i] [ɪ] [u] [ʊ]  


• Mid  vowels:  [e]  [ɛ]  [o]  [ə]  [ʌ]  [ɔ]  
• Low  vowels:  [æ]  [a]  
 
• Front  vowels:  [i]  [ɪ]  [e]  [ɛ]  [æ]  
• Central  vowels:  [ə]  [ʌ]    
• Back  vowels:  [u]  [ɔ]  [o]  [æ]  [a]  
Vowels

© Cengage Learning
Vowels  
• Round  vowels:  [u]  [ʊ]  [o]  [ɔ]  
– Produced  by  rounding  the  lips  
– English  has  only  back  round  vowels,  but  other  languages  such  as  French  and  
Swedish  have  front  round  vowels  

• Diphthongs:  [aɪ]  [aʊ]  [ɔɪ]  


– A  sequence  of  two  vowel  sounds  (as  opposed  to  the  monophthongs  we  have  
looked  at  so  far)  

• Nasaliza;on:    
– Vowels  can  also  be  pronounced  with  a  lowered  velum,  allowing  air  to  pass  
through  the  nose  
– In  English,  speakers  nasalize  vowels  before  a  nasal  sound,  such  as  in  the  words  
beam,  bean,  and  bingo  
– The  nasaliza0on  is  represented  by  a  diacri0c,  an  extra  mark  placed  with  the  
symbol:  
Vowels
• Tense vowels:
– Are produced with
greater tension in the
tongue
– May occur at the end of
words

• Lax vowels:
– Are produced with less
tongue tension
– May not occur at the end
of words
Vowels
Major Phonetic Classes
• Noncontinuants: the airstream is totally obstructed in
the oral cavity
– Stops and affricates

• Continuants: the airstream flows continuously out of the


mouth
– All other consonants and vowels

• Obstruents: the airstream has partial or full obstruction


– Non-nasal stops, fricatives, and affricates

• Sonorants: air resonates in the nasal or oral cavities


– Vowels, nasal stops, liquids, and glides
Major  Phone0c  Classes:  Consonantal  
• Consonantal:  there  is  some  restric0on  of  the  airflow  
during  ar0cula0on  
– All  consonants  except  glides  

• Consonantal  sounds  can  be  further  subdivided:  

– Labials:  [p]  [b]  [m]  [f]  [v]  [w]  [ʍ]  


• Ar0culated  with  the  lips  

– Coronals:  [θ]  [ð]  [t]  [d]  [n]  [s]  [z]  [ʃ]  [ʒ]  [ʧ][ʤ]  [l]  [r]  
• Ar0culated  by  raising  the  tongue  blade  
Major  Phone0c  Classes  
• Consonantal  categories  cont.:  
– Anteriors:  [p]  [b]  [m]  [f]  [v]  [θ]  [ð]  [t]  [d]  [n]  [s]  [z]  
• Produced  in  the  front  part  of  the  mouth  (from  the  alveolar  area  
forward)  
 
– Sibilants:  [s]  [z]  ]  [ʃ]  [ʒ]  [ʧ][ʤ]  
• Produced  with  a  lot  of  fric0on  that  causes  a  hissing  sound,  which  is  
a  mixture  of  high-­‐frequency  sounds  
 
• Syllabic  Sounds:  sounds  that  can  func0on  as  the  core  
of  a  syllable  
– Vowels,  liquids,  and  nasals  
Prosodic  Features  
• Prosodic,  or  suprasegmental  features  of  sounds,  
such  as  length,  stress  and  pitch,  are  features  above  
the  segmental  values  such  as  place  and  manner  of  
ar0cula0on  

• Length:  in  some  languages,  such  as  Japanese,  the  


length  of  a  consonant  or  a  vowel  can  change  the  
meaning  of  a  word:  

– biru  [biru]  “building”    biiru  [biːru]  “beer”  


– saki  [saki]  “ahead”    sakki  [sakːi]  “before”  
Prosodic Features
• Stress: stressed syllables are louder, slightly
higher in pitch, and somewhat longer than
unstressed syllables

– The noun digest has the stress on the first syllable

– The verb digest has the stress on the second syllable

– English is a stress-timed language, meaning that at


least one syllable is stressed in an English word
• French functions differently, so when English speakers learn
French they put stress on certain syllables which contributes
to their foreign accent
Tone  and  Intona0on  
• Tone  languages  are  languages  that  use  pitch  
to  contrast  the  meaning  of  words  

• For  example,  in  Thai,  the  string  of  sounds  [naː]  can  
be  said  with  5  different  pitches  and  can  thus  have  
5  different  meanings:  
Tone and Intonation
• Intonation languages (like English) have
varied pitch contour across an utterance,
but pitch is not used to distinguish words

– However, intonation may affect the meaning of a


whole sentence:

• John is here said with falling intonation is a statement


• John is here said with rising intonation is a question
Phonetics of Signed Languages
• Signs can be broken down into segmental
features similar to the phonetic features of
speech sounds (such as place and manner of
articulation)
– And just like spoken languages, signed languages of
the world vary in these features

– Signs are formed by three major features:


• 1. The configuration of the hand (handshape)
• 2. The movement of the hand and arm towards or away from
the body
• 3. The location of the hand in signing space
Phonetics of Signed Languages
• The configuration of the hand (handshape)
• The movement of the hand and arm
• The location of the hand in signing space
Speech and Language
Processing

Chapter 9 of SLP
Automatic Speech Recognition
Outline for ASR

▪ ASR Architecture
▪ The Noisy Channel Model
▪ Five easy pieces of an ASR system
1) Feature Extraction
2) Acoustic Model
3) Language Model
4) Lexicon/Pronunciation Model
(Introduction to HMMs again)
5) Decoder
• Evaluation

4/9/2024 Speech and Language Processing Jurafsky and Martin 2


Speech Recognition
▪ Applications of Speech Recognition (ASR)
▪ Dictation
▪ Telephone-based Information (directions, air
travel, banking, etc)
▪ Hands-free (in car)
▪ Speaker Identification
▪ Language Identification
▪ Second language ('L2') (accent reduction)
▪ Audio archive searching

4/9/2024 Speech and Language Processing Jurafsky and Martin 3


LVCSR
▪ Large Vocabulary Continuous Speech
Recognition
▪ ~20,000-64,000 words
▪ Speaker independent (vs. speaker-
dependent)
▪ Continuous speech (vs isolated-word)

4/9/2024 Speech and Language Processing Jurafsky and Martin 4


Current error rates

Ballpark numbers; exact numbers depend very much on the specific corpus

Task Vocabulary Error Rate%


Digits 11 0.5
WSJ read speech 5K 3
WSJ read speech 20K 3
Broadcast news 64,000+ 10
Conversational Telephone 64,000+ 20

4/9/2024 Speech and Language Processing Jurafsky and Martin 5


HSR versus ASR

Task Vocab ASR Hum SR


Continuous digits 11 .5 .009
WSJ 1995 clean 5K 3 0.9
WSJ 1995 w/noise 5K 9 1.1
SWBD 2004 65K 20 4

▪ Conclusions:
▪ Machines about 5 times worse than humans
▪ Gap increases with noisy speech
▪ These numbers are rough, take with grain of
4/9/2024
salt Speech and Language Processing Jurafsky and Martin 6
Why is conversational speech
harder?
▪ A piece of an utterance without context

▪ The same utterance with more context

4/9/2024 Speech and Language Processing Jurafsky and Martin 7


LVCSR Design Intuition
• Build a statistical model of the speech-to-
words process
• Collect lots and lots of speech, and
transcribe all the words.
• Train the model on the labeled speech
• Paradigm: Supervised Machine Learning +
Search

4/9/2024 Speech and Language Processing Jurafsky and Martin 8


Speech Recognition
Architecture

4/9/2024 Speech and Language Processing Jurafsky and Martin 9


The Noisy Channel Model

▪ Search through space of all possible


sentences.
▪ Pick the one that is most probable given
4/9/2024 the waveform. Speech and Language Processing Jurafsky and Martin 10
The Noisy Channel Model (II)
▪ What is the most likely sentence out of all
sentences in the language L given some
acoustic input O?
▪ Treat acoustic input O as sequence of
individual observations
▪ O = o1,o2,o3,…,ot
▪ Define a sentence as a sequence of
words:
▪ W = w1,w2,w3,…,wn

4/9/2024 Speech and Language Processing Jurafsky and Martin 11


Noisy Channel Model (III)

▪ Probabilistic implication: Pick the highest prob S =


W: ˆ
W = argmax P(W | O)
W L

▪ We can use Bayes rule to rewrite this:


ˆ P(O |W )P(W )
 W = argmax
W L P(O)
▪ Since denominator is the same for each candidate
sentence W, we can ignore it for the argmax:
 Wˆ = argmax P(O |W )P(W )
W L
4/9/2024 Speech and Language Processing Jurafsky and Martin 12
Noisy channel model

likelihood prior

Wˆ = argmax P(O |W )P(W )


W L

4/9/2024 Speech and Language Processing Jurafsky and Martin 13


The noisy channel model
▪ Ignoring the denominator leaves us with
two factors: P(Source) and
P(Signal|Source)

4/9/2024 Speech and Language Processing Jurafsky and Martin 14


Speech Architecture meets
Noisy Channel

4/9/2024 Speech and Language Processing Jurafsky and Martin 15


Architecture: Five easy pieces
(only 3-4 for today)
▪ HMMs, Lexicons, and Pronunciation
▪ Feature extraction
▪ Acoustic Modeling
▪ Decoding
▪ Language Modeling (seen this already)

4/9/2024 Speech and Language Processing Jurafsky and Martin 16


Lexicon
▪ A list of words
▪ Each one with a pronunciation in terms of
phones
▪ We get these from on-line pronucniation
dictionary
▪ CMU dictionary: 127K words
▪ http://www.speech.cs.cmu.edu/cgi-
bin/cmudict
▪ We’ll represent the lexicon as an HMM

4/9/2024 Speech and Language Processing Jurafsky and Martin 17


HMMs for speech: the word
“six”

4/9/2024 Speech and Language Processing Jurafsky and Martin 18


Phones are not homogeneous!

5000

0
0.48152 ay k 0.937203
Time (s)

4/9/2024 Speech and Language Processing Jurafsky and Martin 19


Each phone has 3 subphones

4/9/2024 Speech and Language Processing Jurafsky and Martin 20


Resulting HMM word model
for “six” with their subphones

4/9/2024 Speech and Language Processing Jurafsky and Martin 21


HMM for the digit
recognition task

4/9/2024 Speech and Language Processing Jurafsky and Martin 22


Detecting Phones
▪ Two stages
▪ Feature extraction
▪ Basically a slice of a spectrogram
▪ Building a phone classifier (using GMM
classifier)

4/9/2024 Speech and Language Processing Jurafsky and Martin 23


MFCC: Mel-Frequency Cepstral
Coefficients

4/9/2024 Speech and Language Processing Jurafsky and Martin 24


MFCC process: windowing

4/9/2024 Speech and Language Processing Jurafsky and Martin 25


MFCC process: windowing

4/9/2024 Speech and Language Processing Jurafsky and Martin 26


Hamming window on the signal,
and then computing the spectrum

4/9/2024 Speech and Language Processing Jurafsky and Martin 27


Final Feature Vector

▪ 39 Features per 10 ms frame:


▪ 12 MFCC features
▪ 12 Delta MFCC features
▪ 12 Delta-Delta MFCC features
▪ 1 (log) frame energy
▪ 1 Delta (log) frame energy
▪ 1 Delta-Delta (log frame energy)
▪ So each frame represented by a 39D
vector
4/9/2024 Speech and Language Processing Jurafsky and Martin 28
Acoustic Modeling
(= Phone detection)
▪ Given a 39-dimensional vector
corresponding to the observation of one
frame oi
▪ And given a phone q we want to detect
▪ Compute p(oi|q)
▪ Most popular method:
▪ GMM (Gaussian mixture models)
▪ Other methods
▪ Neural nets, CRFs, SVM, etc

4/9/2024 Speech and Language Processing Jurafsky and Martin 29


Gaussian Mixture Models
▪ Also called “fully-continuous HMMs”
▪ P(o|q) computed by a Gaussian:

1 (o − ) 2
p(o | q) = exp(− )
 2 2 2


4/9/2024 Speech and Language Processing Jurafsky and Martin 30
Gaussians for Acoustic
Modeling
A Gaussian is parameterized by a mean and
a variance:
Different means

▪ P(o|q): P(o|q) is highest here at mean

P(o|q is low here, very far from mean)


P(o|q)

4/9/2024 Speech and Language Processing Jurafsky and Martin 31


Training Gaussians

▪ A (single) Gaussian is characterized by a mean and a variance


▪ Imagine that we had some training data in which each phone was
labeled
▪ And imagine that we were just computing 1 single spectral value (real
valued number) as our acoustic observation
▪ We could just compute the mean and variance from the data:

T
i =  ot s.t. ot is phone i
1
T t=1
T
 i =  (ot − i ) 2 s.t. ot is phone i
2 1
T t=1

4/9/2024 Speech and Language Processing Jurafsky and Martin 32
But we need 39 gaussians,
not 1!
▪ The observation o is really a vector of
length 39
▪ So need a vector of Gaussians:

1 (o[d] − [d])
D 2
exp(− 
1
p(o | q) = )
D
 [d]
2

 [d]
D 2 d =1
2 2
 2

d =1

4/9/2024 Speech and Language Processing Jurafsky and Martin 33


Actually, mixture of gaussians

Phone A

Phone B

▪ Each phone is modeled by a sum of


different gaussians
▪ Hence able to model complex facts about
the data
4/9/2024 Speech and Language Processing Jurafsky and Martin 34
Gaussians acoustic modeling
▪ Summary: each phone is represented by a
GMM parameterized by
▪ M mixture weights
▪ M mean vectors
▪ M covariance matrices
▪ Usually assume covariance matrix is diagonal
▪ I.e. just keep separate variance for each cepstral
feature

4/9/2024 Speech and Language Processing Jurafsky and Martin 35


Where we are
▪ Given: A wave file
▪ Goal: output a string of words
▪ What we know: the acoustic model
▪ How to turn the wavefile into a sequence of acoustic feature
vectors, one every 10 ms
▪ If we had a complete phonetic labeling of the training set, we
know how to train a gaussian “phone detector” for each phone.
▪ We also know how to represent each word as a sequence of
phones
▪ What we knew from Chapter 4: the language model
▪ Next time:
▪ Seeing all this back in the context of HMMs
▪ Search: how to combine the language model and the acoustic
model to produce a sequence of words
4/9/2024 Speech and Language Processing Jurafsky and Martin 36
Decoding
▪ In principle:

▪ In practice:
Why is ASR decoding hard?
HMMs for speech
HMM for digit recognition task
The Evaluation (forward)
problem for speech
▪ The observation sequence O is a series of
MFCC vectors
▪ The hidden states W are the phones and
words
▪ For a given phone/word string W, our job
is to evaluate P(O|W)
▪ Intuition: how likely is the input to have
been generated by just that word string W
Evaluation for speech: Summing
over all different paths!
▪ f ay ay ay ay v v v v
▪ f f ay ay ay ay v v v
▪ f f f f ay ay ay ay v
▪ f f ay ay ay ay ay ay v
▪ f f ay ay ay ay ay ay ay ay v
▪ f f ay v v v v v v v
The forward lattice for “five”
The forward trellis for “five”
Viterbi trellis for “five”
Viterbi trellis for “five”
Search space with bigrams
Viterbi trellis
Viterbi backtrace
Training
Evaluation
▪ How to evaluate the word string output by
a speech recognizer?
Word Error Rate

▪ Word Error Rate =


100 (Insertions+Substitutions + Deletions)
------------------------------
Total Word in Correct Transcript
Aligment example:
REF: portable **** PHONE UPSTAIRS last night so
HYP: portable FORM OF STORES last night so
Eval I S S
WER = 100 (1+2+0)/6 = 50%
NIST sctk-1.3 scoring softare:
Computing WER with sclite
▪ http://www.nist.gov/speech/tools/
▪ Sclite aligns a hypothesized text (HYP) (from the recognizer) with a
correct or reference text (REF) (human transcribed)
id: (2347-b-013)
Scores: (#C #S #D #I) 9 3 1 2
REF: was an engineer SO I i was always with **** **** MEN UM and they
HYP: was an engineer ** AND i was always with THEM THEY ALL THAT and they
Eval: D S I I S S
Sclite output for error
analysis
CONFUSION PAIRS Total (972)
With >= 1 occurances (972)

1: 6 -> (%hesitation) ==> on


2: 6 -> the ==> that
3: 5 -> but ==> that
4: 4 -> a ==> the
5: 4 -> four ==> for
6: 4 -> in ==> and
7: 4 -> there ==> that
8: 3 -> (%hesitation) ==> and
9: 3 -> (%hesitation) ==> the
10: 3 -> (a-) ==> i
11: 3 -> and ==> i
12: 3 -> and ==> in
13: 3 -> are ==> there
14: 3 -> as ==> is
15: 3 -> have ==> that
16: 3 -> is ==> this
Sclite output for error
analysis
17: 3 -> it ==> that
18: 3 -> mouse ==> most
19: 3 -> was ==> is
20: 3 -> was ==> this
21: 3 -> you ==> we
22: 2 -> (%hesitation) ==> it
23: 2 -> (%hesitation) ==> that
24: 2 -> (%hesitation) ==> to
25: 2 -> (%hesitation) ==> yeah
26: 2 -> a ==> all
27: 2 -> a ==> know
28: 2 -> a ==> you
29: 2 -> along ==> well
30: 2 -> and ==> it
31: 2 -> and ==> we
32: 2 -> and ==> you
33: 2 -> are ==> i
34: 2 -> are ==> were
Better metrics than WER?
▪ WER has been useful
▪ But should we be more concerned with meaning
(“semantic error rate”)?
▪ Good idea, but hard to agree on
▪ Has been applied in dialogue systems, where desired
semantic output is more clear
Summary: ASR Architecture
▪ Five easy pieces: ASR Noisy Channel architecture
1) Feature Extraction:
39 “MFCC” features
2) Acoustic Model:
Gaussians for computing p(o|q)
3) Lexicon/Pronunciation Model
• HMM: what phones can follow each other
4) Language Model
• N-grams for computing p(wi|wi-1)
5) Decoder
• Viterbi algorithm: dynamic programming for combining all these to get
word sequence from speech!
Accents: An experiment
▪ A word by itself

▪ The word in context

4/9/2024 Speech and Language Processing Jurafsky and Martin 58


Summary
▪ ASR Architecture
▪ The Noisy Channel Model
▪ Phonetics Background
▪ Five easy pieces of an ASR system
1) Lexicon/Pronunciation Model
(Introduction to HMMs again)
2) Feature Extraction
3) Acoustic Model
4) Language Model
5) Decoder
• Evaluation

4/9/2024 Speech and Language Processing Jurafsky and Martin 59


Speech synthesis

1
Speech synthesis
• What is the task?
– Generating natural sounding speech on the fly,
usually from text
• What are the main difficulties?
– What to say and how to say it
• How is it approached?
– Two main approaches, both with pros and cons
• How good is it?
– Excellent, almost unnoticeable at its best
• How much better could it be?
– marginally

2
Input type
• Concept-to-speech vs text-to-speech
• In CTS, content of message is determined
from internal representation, not by
reading out text
– E.g. database query system
– No problem of text interpretation

3
Text-to-speech
• What to say: text-to-phoneme conversion
is not straightforward
– Dr Smith lives on Marine Dr in Chicago IL. He got his
PhD from MIT. He earns $70,000 p.a.
– Have toy read that book? No I’m still reading it. I live
in Reading.
• How to say it: not just choice of
phonemes, but allophones, coarticulation
effects, as well as prosodic features (pitch,
loudness, length)
4
Architecture of TTS systems
Text-to-phoneme module Phoneme-to-speech module

Synthetic speech
Text input
output

Abbreviation
Normalization
lexicon

Text in orthographic form Acoustic Various


Exceptions synthesis methods
lexicon

Grapheme-to-
Orthographic
phoneme
rules
conversion
Phoneme string +
Grammar rules prosodic annotation

Prosodic
Phoneme string Prosodic model
modelling
5
Text normalization
• Any text that has a special pronunciation
should be stored in a lexicon
– Abbreviations (Mr, Dr, Rd, St, Middx)
– Acronyms (UN but UNESCO)
– Special symbols (&, %)
– Particular conventions (£5, $5 million, 12°C)
– Numbers are especially difficult
• 1995 2001 1,995 236 3017 233 4488

6
Grapheme-to-phoneme conversion
• English spelling is complex but largely regular,
other languages more (or less) so
• Gross exceptions must be in lexicon
• Lexicon or rules?
– If look-up is quick, may as well store them
– But you need rules anyway for unknown words
• MANY words have multiple pronunciations
– Free variation (eg controversy, either)
– Conditioned variation (eg record, import, weak forms)
– Genuine homographs

7
Grapheme-to-phoneme conversion
• Much easier for some languages
(Spanish, Italian, Welsh, Czech, Korean)
• Much harder for others (English, French)
• Especially if writing system is only partially
alphabetic (Arabic, Urdu)
• Or not alphabetic at all (Chinese,
Japanese)

8
Syntactic (etc.) analysis
• Homograph disambiguation requires
syntactic analysis
– He makes a record of everything they record.
– I read a lot. What have you read recently?
• Analysis also essential to determine
appropriate prosodic features

9
Architecture of TTS systems
Text-to-phoneme module Phoneme-to-speech module

Synthetic speech
Text input
output

Abbreviation
Normalization
lexicon

Text in orthographic form Acoustic Various


Exceptions synthesis methods
lexicon

Grapheme-to-
Orthographic
phoneme
rules
conversion
Phoneme string +
Grammar rules prosodic annotation

Prosodic
Phoneme string Prosodic model
modelling
10
Prosody modelling
• Pitch, length, loudness
• Intonation (pitch)
– essential to avoid monotonous robot-like voice
– linked to basic syntax (eg statement vs question), but
also to thematization (stress)
– Pitch range is a sensitive issue
• Rhythm (length)
– Has to do with pace (natural tendency to slow down
at end of utterance)
– Also need to pause at appropriate place
– Linked (with pitch and loudness) to stress
11
Acoustic synthesis
• Alternative methods:
– Articulatory synthesis
– Formant synthesis
– Concatenative synthesis
– Unit selection synthesis

12
Articulatory synthesis
• Simulation of physical processes of human
articulation
• Wolfgang von Kempelen (1734-1804) and
others used bellows, reeds and tubes to
construct mechanical speaking machines
• Modern versions simulate electronically
the effect of articulator positions, vocal
tract shape, etc.
• Too much like hard work
13
Formant synthesis
• Reproduce the relevant characteristics of the
acoustic signal
• In particular, amplitude and frequency of
formants
• But also other resonances and noise, eg for
nasals, laterals, fricatives etc.
• Values of acoustic parameters are derived by
rule from phonetic transcription
• Result is intelligible, but too “pure” and sounds
synthetic

14
Formant synthesis
• Demo:
– In control panel select
“Speech” icon
– Type in your text and
Preview voice
– You may have a choice
of voices

15
Concatenative synthesis
• Concatenate segments of pre-recorded
natural human speech
• Requires database of previously recorded
human speech covering all the possible
segments to be synthesised
• Segment might be phoneme, syllable,
word, phrase, or any combination
• Or, something else more clever ...
16
Diphone synthesis
• Most important for natural
sounding speech is to get the
transitions right (allophonic
variation, coarticulation
effects)
• These are found at the
boundary between phoneme
segments
• “diphones” are fragments of
speech signal cutting across
phoneme boundaries
• If a language has P phones, m y n u m b er
then number of diphones is
~P2 (some combinations
impossible) – eg 800 for
Spanish, 1200 for French,
2500 for German)
17
Diphone synthesis
• Most systems use diphones because they are
– Manageable in number
– Can be automatically extracted from recordings of
human speech
– Capture most inter-allophonic variants
• But they do not capture all coarticulatory effects,
so some systems include triphones, as well as
fixed phrases and other larger units (= USS)

18
Concatenative synthesis
• Input is phonemic representation +
prosodic features
• Diphone segments can be digitally
manipulated for length, pitch and loudness
• Segment boundaries need to be smoothed
to avoid distortion

19
Unit selection synthesis (USS)
• Same idea as concatenative synthesis, but
database contains bigger variety of “units”
• Multiple examples of phonemes (under
different prosodic conditions) are recorded
• Selection of appropriate unit therefore
becomes more complex, as there are in
the database competing candidates for
selection

20
Speech synthesis demo

21
Speech synthesis demo

22
Chapter 8. Word Classes and
Part-of-Speech Tagging
From: Chapter 8 of An Introduction to Natural Language
Processing, Computational Linguistics, and Speech
Recognition, by Daniel Jurafsky and James H. Martin
Background

• Part of speech:
– Noun, verb, pronoun, preposition, adverb, conjunction, particle, and article
• Recent lists of POS (also know as word classes, morphological class, or
lexical tags) have much larger numbers of word classes.
– 45 for Penn Treebank
– 87 for the Brown corpus, and
– 146 for the C7 tagset
• The significance of the POS for language processing is that it gives a
significant amount of information about the word and its neighbors.
• POS can be used in stemming for IR, since
– Knowing a word’s POS can help tell us which morphological affixes it can take.
– They can help an IR application by helping select out nouns or other important
words from a document.

Word Classes and POS Tagging 2


8.1 English Word Classes

• Give a more complete definition of the classes of POS.


– Traditionally, the definition of POS has been based on morphological and
syntactic function.
– While, it has tendencies toward semantic coherence (e.g., nouns describe
“people, places, or things and adjectives describe properties), this is not
necessarily the case.
• Two broad subcategories of POS:
1. Closed class
2. Open class

Word Classes and POS Tagging 3


8.1 English Word Classes

1. Closed class
– Having relatively fixed membership, e.g., prepositions
– Function words:
– Grammatical words like of, and, or you, which tend to be very short, occur
frequently, and play an important role in grammar.
2. Open class
• Four major open classes occurring in the languages of the world: nouns,
verbs, adjectives, and adverbs.
– Many languages have no adjectives, e.g., the native American language Lakhota,
and Chinese

Word Classes and POS Tagging 4


8.1 English Word Classes
Open Class: Noun
• Noun
– The name given to the lexical class in which the words for most people, places, or
things occur
– Since lexical classes like noun are defined functionally (morphological and
syntactically) rather than semantically,
• some words for people, places, or things may not be nouns, and conversely
• some nouns may not be words for people, places, or things.
– Thus, nouns include
• Concrete terms, like ship, and chair,
• Abstractions like bandwidth and relationship, and
• Verb-like terms like pacing
– Noun in English
• Things to occur with determiners (a goat, its bandwidth, Plato’s Republic),
• To take possessives (IBM’s annual revenue), and
• To occur in the plural form (goats, abaci)

Word Classes and POS Tagging 5


8.1 English Word Classes
Open Class: Noun
• Nouns are traditionally grouped into proper nouns and common
nouns.
– Proper nouns:
• Regina, Colorado, and IBM
• Not preceded by articles, e.g., the book is upstairs, but Regina is upstairs.
– Common nouns
• Count nouns:
– Allow grammatical enumeration, i.e., both singular and plural (goat/goats), and can
be counted (one goat/ two goats)
• Mass nouns:
– Something is conceptualized as a homogeneous group, snow, salt, and communism.
– Appear without articles where singular nouns cannot (Snow is white but not *Goal
is white)

Word Classes and POS Tagging 6


8.1 English Word Classes
Open Class: Verb
• Verbs
– Most of the words referring to actions and processes including main verbs
like draw, provide, differ, and go.
– A number of morphological forms: non-3rd-person-sg (eat), 3rd-person-
sg(eats), progressive (eating), past participle (eaten)
– A subclass: auxiliaries (discussed in closed class)

Word Classes and POS Tagging 7


8.1 English Word Classes
Open Class: Adjectives
• Adjectives
– Terms describing properties or qualities
– Most languages have adjectives for the concepts of color (white, black),
age (old, young), and value (good, bad), but
– There are languages without adjectives, e.g., Chinese.

Word Classes and POS Tagging 8


8.1 English Word Classes
Open Class: Adverbs
• Adverbs
– Words viewed as modifying something (often verbs)
• Directional (or locative) adverbs: specify the direction or location of some
action, hoe, here, downhill
• Degree adverbs: specify the extent of some action, process, or property,
extremely, very, somewhat
• Manner adverb: describe the manner of some action or process, slowly,
slinkily, delicately
• Temporal adverbs: describe the time that some action or event took place,
yesterday, Monday

Word Classes and POS Tagging 9


8.1 English Word Classes
Open Classes

• Some important closed classes in English


– Prepositions: on, under, over, near, by, at, from, to, with
– Determiners: a, an, the
– Pronouns: she, who, I, others
– Conjunctions: and, but, or, as, if, when
– Auxiliary verbs: can, may, should, are
– Particles: up, down, on, off, in, out, at, by
– Numerals: one, two, three, first, second, third

Word Classes and POS Tagging 10


8.1 English Word Classes
Open Classes: Prepositions
• Prepositions occur before nouns, semantically they are relational
– Indicating spatial or temporal relations, whether literal (on it, before then, by the
house) or metaphorical (on time, with gusto, beside herself)
– Other relations as well

Preposition (and particles) of English from CELEX


Word Classes and POS Tagging 11
8.1 English Word Classes
Open Classes: Particles
• A particle is a word that resembles a preposition or an adverb, and that often
combines with a verb to form a larger unit call a phrasal verb
So I went on for some days cutting and hewing timber …
Moral reform is the effort to throw off sleep …

English single-word particles from Quirk, et al (1985)

Word Classes and POS Tagging 12


8.1 English Word Classes
Open Classes: Particles and Conjunctions

• English has three: a, an, and the


– Articles begin a noun phrase.
– Articles are frequent in English.
• Conjunctions are used to join two phrases, clauses, or sentences.
– and, or, or, but
– Subordinating conjunctions are used when one of the elements is of some
sort of embedded status. I thought that you might like some
milk…complementizer

Word Classes and POS Tagging 13


Coordinating and subordinating conjunctions of English
From the CELEX on-line dictionary.

Word Classes and POS Tagging 14


8.1 English Word Classes
Open Classes: Pronouns

• Pronouns act as a kind of shorthand for referring to some noun phrase


or entity or event.
– Personal pronouns: persons or entities (you, she, I, it, me, etc)
– Possessive pronouns: forms of personal pronouns indicating actual
possession or just an abstract relation between the person and some
objects.
– Wh-pronouns: used in certain question forms, or may act as
complementizer.

Word Classes and POS Tagging 15


Pronouns of English from the
CELEX on-line dictionary.

Word Classes and POS Tagging 16


8.1 English Word Classes
Open Classes: Auxiliary Verbs
• Auxiliary verbs: mark certain semantic feature of a main verb, including
– whether an action takes place in the present, past or future (tense),
– whether it is completed (aspect),
– whether it is negated (polarity), and
– whether an action is necessary, possible, suggested, desired, etc (mood).
– Including copula verb be, the two verbs do and have along with their inflection
forms, as well as a class of modal verbs.

English modal verbs from


the CELEX on-line dictionary.

Word Classes and POS Tagging 17


8.1 English Word Classes
Open Classes: Others

• Interjections: oh, ah, hey, man, alas


• Negatives: no, not
• Politeness markers: please, thank you
• Greetings: hello, goodbye
• Existential there: there are two on the table

Word Classes and POS Tagging 18


8.2 Tagsets for English

• There are a small number of popular tagsets for English, many of


which evolved from the 87-tag tagset used for the Brown corpus.
– Three commonly used
• The small 45-tag Penn Treebank tagset
• The medium-sized 61 tag C5 tageset used by the Lancaster UCREL project’s
CLAWS tagger to tag the British National Corpus, and
• The larger 146-tag C7 tagset

Word Classes and POS Tagging 19


Penn Treebank POS tags
Word Classes and POS Tagging 20
8.2 Tagsets for English

The/DT grand/JJ jury/NN commented/VBD on/IN a /DT number/NN


of/IN other/JJ topics/NNS ./.

• Certain syntactic distinctions were not marked in the Penn Treebank


tagset because
– Treebank sentences were parsed, not merely tagged, and
– So some syntactic information is represented in the phrase structure.
• For example, prepositions and subordinating conjunctions were
combined into the single tag IN, since the tree-structure of the sentence
disambiguated them.

Word Classes and POS Tagging 21


8.3 Part-of-Speech Tagging

• POS tagging (tagging)


– The process of assigning a POS or other lexical marker to each word in a
corpus.
– Also applied to punctuation marks
– Thus, tagging for NL is the same process as tokenization for computer
language, although tags for NL are much more ambiguous.
– Taggers play an increasingly important role in speech recognition, NL
parsing and IR

Word Classes and POS Tagging 22


8.3 Part-of-Speech Tagging
• The input to a tagging algorithm is a string of words and a specified
tagset of the kind described previously.
VB DT NN .
Book that flight .

VBZ DT NN VB NN ?
Does that flight serve dinner ?

• Automatically assigning a tag to a word is not trivial


– For example, book is ambiguous: it can be a verb or a noun
– Similarly, that can be a determiner, or a complementizer
• The problem of POS-tagging is to resolve the ambiguities, choosing
the proper tag for the context.

Word Classes and POS Tagging 23


8.3 Part-of-Speech Tagging
• How hard is the tagging problem?

The number of word types


in Brown corpus by degree
of ambiguity.

• Many of the 40% ambiguous tokens are easy to disambiguate, because


– The various tags associated with a word are not equally likely.

Word Classes and POS Tagging 24


8.3 Part-of-Speech Tagging

• Many tagging algorithms fall into two classes:


– Rule-based taggers
• Involve a large database of hand-written disambiguation rule specifying, for
example, that an ambiguous word is a noun rather than a verb if it follows a
determiner.
– Stochastic taggers
• Resolve tagging ambiguities by using a training corpus to count the
probability of a given word having a given tag in a given context.
• The Brill tagger, called the transformation-based tagger, shares
features of both tagging architecture.

Word Classes and POS Tagging 25


8.4 Rule-Based Part-of-Speech Tagging

• The earliest algorithms for automatically assigning POS were based on


a two-stage architecture
– First, use a dictionary to assign each word a list of potential POS.
– Second, use large lists of hand-written disambiguation rules to winnow
down this list to a single POS for each word
• The ENGTWOL tagger (1995) is based on the same two stage
architecture, with much more sophisticated lexicon and disambiguation
rules than before.
– Lexicon:
• 56000 entries
• A word with multiple POS is counted as separate entries

Word Classes and POS Tagging 26


Sample lexical entries from the ENGTWOL lexicon.

Word Classes and POS Tagging 27


8.4 Rule-Based Part-of-Speech Tagging
• In the first stage of tagger,
– each word is run through the two-level lexicon transducer and
– the entries for all possible POS are returned.
• A set of about 1,100 constraints are then applied to the input sentences
to rule out incorrect POS.
Pavlov PALOV N NOM SG PROPER
had HAVE V PAST VFIN SVO
HAVE PCP2 SVO
shown SHOW PCP2 SVOO SVO SV
that ADV
PRON DEM SG
DET CENTRAL DEM SG
CS
salivation N NOM SG

Word Classes and POS Tagging 28
8.4 Rule-Based Part-of-Speech Tagging

• A simplified version of the constraint:

ADVERBIAL-THAT RULE
Given input: “that”
if
(+1 A/ADV/QUANT); /* if next word is adj, adverb, or quantifier */
(+2 SENT-LIM); /* and following which is a sentence boundary, */
(NOT -1 SVOC/A); /* and the previous word is not a verb like */
/* ‘consider’ which allows adj as object complements */
then eliminate non-ADV tags
else eliminate ADV tags

Word Classes and POS Tagging 29


8.5 HMM Part-of-Speech Tagging
• We are given a sentence, for example, like
Secretariat is expected to race tomorrow.
– What is the best sequence of tags which corresponds to this sequence of words?
n
• We want: out of all sequences of n tags t1 the single tag sequence such that
P(t1n | w1n ) is highest.
tˆ1n = arg max P(t1n | w1n )
t1n

ˆ means “our estimate of the correct tag sequence”.


• It is not clear how to make the equation operational
n n
– that is, for a given tag sequence t1 and word sequence w1 , we don’t know how to
directly compute P(t1n | w1n ) .

Word Classes and POS Tagging 30


8.5 HMM Part-of-Speech Tagging
P( y | x ) P( x )
P( x | y ) =
P( y )
n n n
P ( w | t ) P ( t )
tˆ = arg max
1
n 1 1 1

t1n P ( w1n )

But P ( w1n ) doesn’t change for each tag sequence.

tˆ1n = arg max P( w1n | t1n ) P(t1n )


t1n
likelihood prior

Word Classes and POS Tagging 31


8.5 HMM Part-of-Speech Tagging
Assumption 1: the probability of a word appearing is dependent only on its
own part of speech tag:
n
P( w | t )   P( win | tin )
n
1
n
1
i =1

Assumption 2: the probability of a tag appearing is dependent only on


the previous tag:
n
P(t )   P(ti | ti −1 )
n
1
i =1

Word Classes and POS Tagging 32


8.5 HMM Part-of-Speech Tagging

tˆ1n = arg max P(t1n | w1n )  arg max P( wi | ti ) P(ti | ti −1 )


t1n t1n

• This equation contains two kinds of probabilities,


– tag transition probabilities and
– word likelihoods.
C (ti −1 , ti )
• The tag transition probabilities: P(ti | ti −1 ) =
C (ti −1 )

C ( DT , NN ) 56,509
P( NN | DT ) = = = .49
C ( DT ) 116, 454

C (ti , wi )
• The word likelihood probabilities: P( wi | ti ) =
C (ti )
C (VBZ , is ) 10,073
P(is | VBZ ) = = = .47
C (VBZ ) 21,627
Word Classes and POS Tagging 33
8.5 HMM Part-of-Speech Tagging
Computing the most-likely tag sequence: A motivating example

(8.36)Secretariat/NNP is/BEZ expected/VBN to/TO race/VB tomorrow/NR


(8.37)People/NNS continue/VB to/TO inquire/VB the/AT reason/NN for/IN
the/AT race/NN for/IN outer/JJ space/NN

• Let’s look at how race can be


correctly tagged as a VB
instead of an NN in (8.36).

Word Classes and POS Tagging 34


8.5 HMM Part-of-Speech Tagging
Computing the most-likely tag sequence: A motivating example

P ( NN | TO ) = .00047
P (VB | TO ) = .83
P ( race | NN ) = .00057
P ( race | VB) = .00012
P ( NR | VB) = .0027
P ( NR | NN ) = .0012
P (VB | TO ) P ( NR | VB) P( race | VB) = .00000027
P ( NN | TO ) P ( NR | NN ) P ( race | NN ) = .00000000032

Word Classes and POS Tagging 35


Hidden Markov and Maximum Entropy Models
 Introduction,  Maximum entropy models,
 Markov Chains - Logistic regression,
- Observed Markov model, - hyperplane,
- weighted finite-state automata,  Maximum entropy markov models,
- Probabilistic graphical model, - MaxEnt model,
 Hidden Markov model, - HMM tagging model,
- transition probability matrix, - MEMM tagging model.
- Observed likelihood,
- emission probability,
- Left-to-right (Bakis) HMM,
 Maximum entropy models,
- log-linear classifiers,
- linear regression,
1. Introduction (Hidden Markov and Maximum Entropy Models )
 Two important classes of statistical models for processing text & speech;
(1) Hidden Markov model (HMM),
(2) Maximum entropy model (MaxEnt), and particularly a Markov-related
variant of MaxEnt called the maximum entropy Markov model (MEMM).

HMMs and MEMMs are both sequence classifiers.


- A sequence classifier or sequence labeller is a model whose job is to assign
some label or class to each unit in a sequence.
- compute a probability distribution over possible labels and choose the best
label sequence.
2. Markov Chains
 Markov chains and hidden Markov models are both extensions of the finite
automata.
 Finite automata is definitely by a set of states and a set of transitions between
states.
 A Markov chain is a special case of a weighted automaton in which the input
sequence uniquely determines which states the automaton will go through.
Because
 it can’t represent inherently ambiguous problems, a Markov chain is only
useful for assigning probabilities to unambiguous sequences.
2. Markov Chains (Cont…)
 Figure 6.1 (a) shows;
- (word by word state) a Markov chain for assigning a probability to a sequence of weather
events, for which the vocabulary consists of HOT, COLD and RAINY.
Figure 6.1 (b) shows;
- (sequence of word states, sentences) another simple example of a Markov chain for
assigning a probability to a sequence of words w1, w2, …., wn.
2. Markov Chains (Alternative representation)
 An alternative representation that is sometimes used for Markov chains doesn’t reply on a
start or end state,
- instead representation the distribution over initial state and accepting states explicitly.
Examples ; compute the probability of each of the following sequences as;
o hot hot hot hot => π = .5*.5*.5*.5 = 0.0625
o cold hot cold hot => π = .5*.2*.5*.2 = 0.01 ????

 COLD-> COLD->WARM->WARM->WARM-> HOT-> COLD

 HOT->COLD->HOT->HOT->WARM->COLD->COLD->WARM

 WARM->HOT->COLD->WARM->COLD->HOT->WARM->HOT

 HOT->COLD->COLD->WARM->WARM->HOT->COLD->WARM

 COLD->HOT->WARM->WARM->COLD->HOT->WARM->HOT

 WARM->COLD->HOT->WARM->COLD->COLD->HOT->WARM
2. Markov Chains (Class Participation)
 How to compute the probabilities of each of the following sentences by using
7-states problems of;
(a) Students did their assignment well at time (* high probability likelihood).
(b) did their assignment student well at time (*2nd best probability likelihood).
(c) At student assignment well did time their ( worst probability likelihood).

How to compute the probabilities of each of the following sentences by using


5-states problems of;
(a) Weather is hot and dry (* high probability likelihood).
(b) and is hot weather dry (*2nd best probability likelihood).
(c) hot weather and dry is ( worst probability likelihood).
3. Hidden Markov Model (HMM)
A Hidden Markov Model (HMM) allows us to talk Hidden Markov about
both
- observed Model events (like words that we see in the input) and hidden
events (like part-of-speech tags) that we think of as causal factors in our
probabilistic model.
A formal definition of a Hidden Markov Model, focusing on how HMM it
differs from a Markov chain.
- HMM doesn’t rely on a start or end state.
- representing the distribution over initial and accepting states explicitly.
3. Hidden Markov Model (HMM) (Cont…)
A first hidden Markov model instantiates two simplifying assumptions;
First, the probability of a particular state depends only on the previous state:

Second, the probability of an output observation oi


- depends only on the state that produced the observation qi and
- not on any other states or any other observations:
3. Hidden Markov Model (HMM) [Example]
In Figure;
 Two states : ‘Low’ and ‘High’ atmospheric
pressure.
Two observations : ‘Rain’ and ‘Dry’.
Transition probabilities: P(‘Low’|‘Low’)=0.3 ,
P(‘High’|‘Low’)=0.7 , P(‘Low’|‘High’)=0.2,
P(‘High’|‘High’)=0.8
Observation probabilities : P(‘Rain’|‘Low’)=0.6 ,
P(‘Dry’|‘Low’)=0.4 , P(‘Rain’|‘High’)=0.4 ,
P(‘Dry’|‘High’)=0.3 .
Initial probabilities: say P(‘Low’)=0.4 ,
P(‘High’)=0.6 .
3. Hidden Markov Model (HMM) [Example-1] (Cont…)
Calculate of observation sequence probability;
Transition
Suppose we want to calculate a probability of a probabilities:
sequence of observations in our example, P(‘Low’|‘Low’)=0.3 ,
{‘Dry’,’Rain’}. P(‘High’|‘Low’)=0.7 ,
Consider all possible hidden state sequences: P(‘Low’|‘High’)=0.2,
P({‘Dry’,’Rain’} ) = P({‘Dry’,’Rain’} , {‘Low’,’Low’}) + P(‘High’|‘High’)=0.8
P({‘Dry’,’Rain’} , {‘Low’,’High’}) + P({‘Dry’,’Rain’} , Observation
{‘High’,’Low’}) + P({‘Dry’,’Rain’} , {‘High’,’High’}) probabilities :
P(‘Rain’|‘Low’)=0.6 ,
P(‘Dry’|‘Low’)=0.4 ,
where first term is : P(‘Rain’|‘High’)=0.4 ,
P({‘Dry’,’Rain’} , {‘Low’,’Low’})= P(‘Dry’|‘High’)=0.3 .
P(‘Dry’|’Low’) P(‘Low’) P(‘Rain’|’Low’) P(‘Low’) Initial probabilities:
P(‘Low’|’Low) say P(‘Low’)=0.4 ,
P(‘High’)=0.6 .
= 0.4*0.4*0.6*0.4*0.3
3. Hidden Markov Model (HMM) [Example-2] (Cont…)
Typed word recognition, assume all characters are separated.

Character recognizer outputs probability of the image being particular


character, P(image|character).
3. Hidden Markov Model (HMM) [Example-3] (Cont…)
We can construct a single HMM for all words.
Hidden states = all characters in the alphabet.
Transition probabilities and initial probabilities are calculated from language
model.
Observations and observation probabilities are as before.

Here we have to determine the best sequence of hidden states, the one that
most likely produced word image.
This is an application of Decoding problem.
4. Left-to-right (Bakis) HMM
 During left-to-right (also called Bakis)
HMMs, the state transitions proceed from
left to right.

In a Bakis HMM,


- no transitions go from a higher-
numbered state to a lower-numbered state.

It includes 1-state to multi-states HMM as;


4. Left-to-right (Bakis) HMM (Home Assignment)
 Draw a model of left-to-right HMM of 2-states, 3-states and 4-states
problems of;

(a) (b)

Tennis posture detection

(c)
5. Maximum Entropy Models
2nd probabilistic machine learning framework called maximum entropy modelling.
- MaxEnt is more widely known as multinomial logistic regression.

MaxEnt belongs to the family of classifiers known as the exponential or log-linear


classifiers.
- MaxEnt works by extracting some set of features from the input,
- combining them linearly (meaning that each feature is multiplied by a weight and then
added up) and sum become exponent.

Example-1:
In text classification,
- need to decide whether a particular email should be classified as spam.
- determine whether a particular sentence or document expresses a positive or negative
opinion.
5. Maximum Entropy Models (Cont…)
Example-2: Assume that we have some input x (perhaps it is a word that needs to be tagged or
a document that needs to be classified).
- From input x, we extract some features fi.
- A feature for tagging might be this word ends in –ing.
- For each such features fi, we have some weight wi.

Given the features and weights, our goal is to choose a class for a word.
- the probability of a particular class c given the observation x is;

where Z is a normalization factor, used to make the probability correctly sum to 1.


Finally, in actual MaxEnt model,
- the feature f and weights w both depend on the class c (i.e., we’ll have different features and
weights for different classes);
5.1 Linear Regression
In linear regression, we are given a set of observations;
- each observation associated with some features,
- and we want to predict some real-valued outcome for each observation.

Example; predicting housing prices.


Levitt and Dubner showed that; the words used in a real estate ad can be a good predictor of;
- whether a house will sell for more or less than its asking prices.
- e.g., house whose real estate ads has words like
fantastic, cute, or charming, tending to sell for
lower prices.
- e.g., while houses whose ads has words like
maple and granite tended to sell for
higher prices.
5.1 Linear Regression (Cont…)
 Figure shows;
- a graph of these points, with the feature (# of adjectives) on the REGRESSION LINE x-
axis, and the price on the y-axis.

Suppose the weight vector that we had previously learned for this task was
w = (w0,w1,w2,w3) = (18000,−5000,−3000,−1.8).

 Then the predicted value for this house would be computed by multiplying each feature by
its weight:
The equation of any line is
y = mx +b; as we
show on the graph, the slope
of this line is
m = −4900, while the
intercept (b) = 16550.
Features (in this case x,
numbers of adjectives)
5.1 Linear Regression (Class Participation)
 Example; Global warming may be reducing average
snowfall in your town and you are asked to predict how
much snow you think will fall this year.
Looking at the following table you might guess somewhere
around 10-20 inches. That’s a good guess, but you could make
a better guess, by using regression.
- Find out linear regression for 2014, 2015, 2016, 2017 and
2018?

Hint :
- regression also gives you a useful equation, which for this
chart is: y = -2.2923x + 4624.4.
- For example, 2005:
y = -2.2923(2005) + 4624.4 = 28.3385 inches, which is
pretty close to the actual figure of 30 inches for that year.
5.2 Logistic Regression
In logistic regression, we classify whether some observation x is in the class (true) or not in
the class (false).

Example; we are assigning a part-of-speech tag to the word “race”.


Secretariat/NNP is/BEZ expected/VBN to/TO race/?? tommorrow/
- we are just doing classification, not sequence classification, so let’s consider just this single
word.
- We would like to know whether to assign the class VB to race (or instead assign some other
class like NN)

Case 1: We can thus add a binary feature that is true if this is the case:

Case 2: Another feature would be whether the previous word “to” the tag TO;
5.3 Maximum Entropy Markov Models
 Previously, the HMM tagging model is based on probabilities of the form
P(tag|tag) and P(word|tag).
- That means that if we want to include some source of knowledge into the tagging process,
we must find a way to encode the knowledge into one of these two probabilities.
- But “many knowledge sources are hard to fit into these models??????”.

For example; tagging unknown words


- useful features include capitalization, the presence of hyphens, word endings, and so on.
- There is no easy way to fit probabilities like P(capitalization|tag), P(hyphen|tag),
P(suffix|tag), and so on into a HMM-style model.
- HMM to model the most probable part-of-speech tag sequence, we rely on Bayes rule,
5.3 Maximum Entropy Markov Models (Cont…)
 In an MEMM, we break down the probabilities as follows;

Fig. The dependency graph for a traditional HMM (left).


The dependency graph for a Maximum Entropy Markov
Model (right).
In case of HMM, its parameters are used to maximize the likelihood of the observation
sequence (see Figure at left)

While, in the MEMM, the current observation Ot depends on the current state St and the
current observation Ot is also depend on the previous state St-1 .
5.3 Maximum Entropy Markov Models [Example-1] (Cont…)

 More formally, in the HMM, we compare the


probability of the state sequence given the
observations
as;

In the MEMM, we compute the probability of the


state sequence given the observation as;
(Class Presentation)
 Design case study with proper examples for;
 Linear Regression,
 Logistic Regression,
 Maximum Entropy Markov Models.
6. HMM Vs Maximum Entropy Markov Models [Example]
 Text classification: Asia or Europe

Europe Training Data Asia


Monaco Monaco Monaco Monaco
Hong Monaco
Monaco Monaco Hong
Kong
Kong
Monaco

HMM FACTORS: PREDICTIONS MEMM:


NB Model • P(A) = P(E) = • P(A,M) =
Class • P(M|A) = • P(E,M) =
• P(M|E) = • P(A|M) =
X1=M • P(E|M) =
6. HMM Vs Maximum Entropy Markov Models [Example] (Cont…)
 Text classification: Asia or Europe

Europe Training Data Asia


Monaco Monaco Monaco
Hong Monaco Hong
Monaco Hong
Kong Kong
Kong
Monaco

NB Model HMM FACTORS: PREDICTIONS MEMM:


• P(A) = P(E) = • P(A,H,K) =
Class
• P(H|A) = P(K|A) = • P(E,H,K) =
• P(H|E) = P(K|E) = • P(A|H,K) =
X1=H X2=K
• P(E|H,K) =
6. HMM Vs Maximum Entropy Markov Models [Example] (Cont…)
 Text classification: Asia or Europe

Europe Training Data Asia


Monaco Monaco Monaco
Hong Monaco Hong
Monaco Hong
Kong Kong
Kong
Monaco

NB Model HMM FACTORS: PREDICTIONS MEMM:


• P(A) = P(E) = • P(A,H,K,M) =
Class • P(H|A) = P(K|A) = • P(E,H,K,M) =
• P(H|E) = PK|E) =
• P(A|H,K,M) =
H K M • P(M|A) =
• P(M|E) =
• P(E|H,K,M) =
6. HMM vs. Maximum Entropy Markov Models [Example] (Cont…)
 NLP relevance: we often have overlapping features….

HMM models multi-count correlated evidence


• Each feature is multiplied in, even when you have multiple features telling you the same thing

Maximum Entropy models (pretty much) solve this problem


• As we will see, this is done by weighting features so that model expectations match the observed
(empirical) expectations.
Finite-State Automata

• An RE is one way of describing a FSA.


• An RE is one way of characterizing a particular kind of
formal language called a regular language.
• Both Regular Expressions & Finite State Automata can
be used to describe regular languages.
• The relation among these three theoretical
constructions is sketched out in the figure in next slide,
which suggested by Martin Kay.

4
Finite-State Automata

5
Using an FSA to Recognize Sheeptalk
• As we defined the sheep language in Part 1 it is any string
from the following (infinite) set:
baa!
baaa!
baaaa!
baaaaa!
baaaaaa!
...

• The regular expression for this kind of ‘sheep talk’ is


/baa+!/. Figure 2.10 shows an automaton for modeling
this regular expression.
6
Using an FSA to Recognize Sheeptalk
• The automaton (i.e. machine, also called finite automaton,
finite-state automaton, or FSA) recognizes a set of strings, in
this case the strings characterizing sheep talk, in the same
way that a regular expression does.
• We represent the automaton as a directed graph: a finite
set of vertices (also called nodes), together with a set of
directed links between pairs of vertices called arcs.
• We’ll represent vertices with circles and arcs with arrows.
• The automaton has five states, which are represented by
nodes in the graph.
• State 0 is the start state which we represent by the incoming
arrow.
• State 4 is the final state or accepting state, which we
represent by the double circle. It also has four transitions,
which we represent by arcs in the graph.
7
Using an FSA to Recognize Sheeptalk

• The FSA can be used for recognizing (we also say


accepting) strings in the following way.
• First, think of the input as being written on a long tape
broken up into cells, with one symbol written in each
cell of the tape, as in Figure 2.11.

8
Using an FSA to Recognize Sheeptalk

• The machine starts in the start state (q0), and iterates the
following process:
▪ Check the next letter of the input. If it matches the symbol on an arc
leaving the current state, then cross that arc, move to the next state,
and also advance one symbol in the input.
▪ If we are in the accepting state (q4) when we run out of input, the
machine has successfully recognized an instance of sheeptalk.
▪ If the machine never gets to the final state, either because it runs out
of input, or it gets some input that doesn’t match an arc (as in Figure
2.11), or if it just happens to get stuck in some non-final state, we say
the machine rejects or fails to accept an input.

9
Using an FSA to Recognize Sheeptalk
• We can also represent an automaton with a state-transition
table. As in the graph notation, the state-transition table
represents the start state, the accepting states, and what
transitions leave each state with which symbols.
• Here’s the state-transition table for the FSA of Figure 2.10.

10
Using an FSA to Recognize Sheeptalk
• See the input b we must go to state 1. If we’re in state 0 and
we see the input a or !, we fail”.

• More formally, a finite automaton is defined by the following 5


parameters:
▪ Q: a finite set of N states q0, q1, …, qN
▪ : a finite input alphabet of symbols
▪ q0: the start state
▪ F: the set of final states, F  Q
▪ (q,i): the transition function or transition matrix between states. Given a
state q  Q and input symbol i  , (q,i) returns a new state q’  Q.  is
thus a relation from Q   to Q;

11
Using an FSA to Recognize Sheeptalk
• For the sheeptalk automaton in Figure 2.10, Q = {q0, q1, q2, q3, q4},
 = {a, b, !}, F = {q4}, and (q,i) is defined by the transition table in
Figure 2.12.

• Figure 2.13 presents an algorithm for recognizing a string using a


state transition table. The algorithm is called D-RECOGNIZE for
‘deterministic recognizer’.

• A deterministic algorithm is one that has no choice points; the


algorithm always knows what to do for any input.
• But non-deterministic automata that must make decisions about
which states to move to.

12
Using an FSA to Recognize Sheeptalk

13
Using an FSA to Recognize Sheeptalk
• D-RECOGNIZE takes as input a tape and an automaton. It returns accept
if the string it is pointing to on the tape is accepted by the automaton,
and reject otherwise.

• Note that since D-RECOGNIZE assumes it is already pointing at the string


to be checked, its task is only a subpart of the general problem that we
often use regular expressions for, finding a string in a corpus.

• D-RECOGNIZE begins by initializing the variables index and currentstate


to the beginning of the tape and the machine’s initial state.

• D-RECOGNIZE then enters a loop that drives the rest of the algorithm.
• It first checks whether it has reached the end of its input. If so, it either
accepts the input (if the current state is an accept state) or rejects the
input (if not).
14
Using an FSA to Recognize Sheeptalk
• If there is input left on the tape, D-RECOGNIZE looks at the transition table
to decide which state to move to.

• The variable current-state indicates which row of the table to consult,


while the current symbol on the tape indicates which column of the table
to consult.

• The resulting transition-table cell is used to update the variable current-


state and index is incremented to move forward on the tape.

• If the transition-table cell is empty then the machine has nowhere to go


and must reject the input.

15
Using an FSA to Recognize Sheeptalk
• Figure 2.14 traces the execution of this algorithm on the sheep language
FSA given the sample input string baaa!.

16
Using an FSA to Recognize Sheeptalk

• Before examining the beginning of the tape, the machine is in state q0.

• Finding a b on input tape, it changes to state q1 as indicated by the


contents of transition-table[q0 ,b] in Figure 2.12 on slide 10.

• It then finds an a and switches to state q2, another a puts it in state q3, a
third a leaves it in state q3,
• where it reads the ‘!’, and switches to state q4.

• Since there is no more input, the End of input condition at the beginning
of the loop is satisfied for the first time and the machine halts in q4.

• State q4 is an accepting state, and so the machine has accepted the


string baaa! as a sentence in the sheep language.
17
Using an FSA to Recognize Sheeptalk

• The algorithm will fail whenever there is no legal transition for a given
combination of state and input.
• The input abc will fail to be recognized since there is no legal transition
out of state q0 on the input a, (i.e. this entry of the transition table in
Figure 2.12 on slide 10 has a ∅).
• Even if the automaton had allowed an initial a it would have certainly
failed on c, since c isn’t even in the sheeptalk alphabet!).
• We can think of these ‘empty’ elements in the table as if they all pointed
at one ‘empty’ state, which we might call the fail state or sink state.
• In a sense then, we could view any machine with empty transitions as if
we had augmented it with a fail state, and drawn in all the extra arcs, so
we always had somewhere to go from any state on any possible input.
18
Using an FSA to Recognize Sheeptalk
• Just for completeness, Figure 2.15 shows the FSA from Figure 2.10 with the
fail state qF filled in.

19
Formal Languages
• We can use the same graph in Figure 2.10 as an automaton for
GENERATING sheeptalk.
• If we do, we would say that the automaton starts at state q0, and crosses
arcs to new states, printing out the symbols that label each arc it follows.

• When the automaton gets to the final state it stops. Notice that at state 3,
the automaton has to chose between printing out a ! and going to state
4, or printing out an a and returning to state 3.
• Let’s say for now that we don’t care how the machine makes this
decision; maybe it flips a coin.
• For now, we don’t care which exact string of sheeptalk we generate, as
long as it’s a string captured by the regular expression for sheeptalk
above.
20
Formal Languages
• Key concept #1: Formal Language: A model which can both generate
and recognize all and only the strings of a formal language acts as a
definition of the formal language.

• A formal language is a set of strings, each string composed of symbols


from a finite symbol-set call an alphabet (the same alphabet used above
for defining an automaton!).

• The alphabet for the sheep language is the set  = {a, b, !}.
• Given a model m (such as a particular FSA), we can use L(m) to mean
“the formal language characterized by m”.
• So the formal language defined by our sheeptalk automaton m in Figure
2.10 (and Figure 2.12) is the infinite set:

L(m) = {baa!, baaa!, baaaa!, baaaaa!, baaaaaa!...}


21
Formal Languages
• The usefulness of an automaton for defining a language is that it can
express an infinite set in a closed form.
• A formal language may bear no resemblance at all to a real language
(natural language), but
▪ We often use a formal language to model part of a natural language,
such as parts of the phonology, morphology, or syntax.

• The term generative grammar is used in linguistics to mean a grammar of


a formal language; the origin of the term is this use of an automaton to
define a language by generating all possible strings.

22
Another Example

23
Another Example

24
Non-Deterministic FSAs
• Consider the sheeptalk automaton in Figure 2.18, which is much like our
first automaton in Figure 2.10:

• The only difference between this automaton and the previous one is that
here in Figure 2.18 the self-loop is on state 2 instead of state 3.
25
Non-Deterministic FSAs
• Consider using this network as an automaton for recognizing sheeptalk.
• When we get to state 2, if we see an a we don’t know whether to remain
in state 2 or go on to state 3.

• Automata with decision points like this are called non-deterministic FSAs
(or NFSAs).

• Recall by contrast that Figure 2.10 specified a deterministic automaton,


i.e. one whose behavior during recognition is fully determined by the
state it is in and the symbol it is looking at.

• A deterministic automaton can be referred to as a DFSA.

• That is not true for DFSA the machine in Figure 2.18 (NFSA #1).
26
Non-Deterministic FSAs
• There is another common type of non-determinism, which can be
caused by arcs that have no symbols on them (called 𝜀-transitions).

• The automaton in Figure 2.19 defines the exact same language as the
last one, or our first one, but it does it with an 𝜀 -transition.

27
Non-Deterministic FSAs
• We interpret this new arc as follows: if we are in state 3, we are
allowed to move to state 2 without looking at the input, or
advancing our input pointer.

• So this introduces another kind of non-determinism – we might not


know whether to follow the 𝜀 -transition or the ! arc.

28
Using an NFSA to Accept Strings

• There are three standard solutions to problem of choice in non-


deterministic models:

1. Backup: Whenever we come to a choice point, we could put a


marker to mark where we were in the input, and what state the
automaton was in. Then if it turns out that we took the wrong choice,
we could back up and try another path.

2. Look-ahead: We could look ahead in the input to help us decide


which path to take.

3. Parallelism: Whenever we come to a choice point, we could look at


every alternative path in parallel.
29
Using an NFSA to Accept Strings
• The backup approach suggests that we should blithely make choices
that might lead to deadends, knowing that we can always return to
unexplored alternative choices.

• There are two keys to this approach: we need to remember all the
alternatives for each choice point, and we need to store sufficient
information about each alternative so that we can return to it when
necessary.

• When a backup algorithm reaches a point in its processing where no


progress can be made (because it runs out of input, or has no legal
transitions), it returns to a previous choice point, selects one of the
unexplored alternatives, and continues from there.

30
Using an NFSA to Accept Strings
• Applying this notion to our nondeterministic recognizer, we need only
remember two things for each choice point: the state, or node, of the
machine that we can go to and the corresponding position on the tape.

• We will call the combination of the node and position the search-state of
the recognition algorithm.

• To avoid confusion, we will refer to the state of the automaton (as


opposed to the state of the search) as a node or a machine-state.

• Figure 2.21 presents a recognition algorithm based on this approach.

31
Using an NFSA to Accept Strings
• Before going on to describe the main part of this algorithm, we should
note two changes to the transition table that drives it.
• First, in order to represent nodes that have outgoing 𝜀-transitions, we add
a new 𝜀 -column to the transition table. If a node has an 𝜀-transition, we
list the destination node in the 𝜀 -column for that node’s row.
• The second addition is needed to account for multiple transitions to
different nodes from the same input symbol.
• We let each cell entry consist of a list of destination nodes rather than a
single node.

• Figure 2.20 shows the transition table for the machine in Figure 2.18 (NFSA
#1).
• While it has no 𝜀 -transitions, it does show that in machine-state q2 the
input a can lead back to q2 or on to q3.
32
Using an NFSA to Accept Strings

33
Using an NFSA to Accept Strings

• Figure 2.21 shows the algorithm for using a non-deterministic FSA to


recognize an input string.

• The function ND-RECOGNIZE uses the variable agenda to keep track of


all the currently unexplored choices generated during the course of
processing.

• Each choice (search state) is a tuple consisting of a node (state) of the


machine and a position on the tape.

• The variable current-search-state represents the branch choice being


currently explored.

34
Using an NFSA to Accept Strings
• ND-RECOGNIZE begins by creating an initial search-state and placing it
on the agenda.

• For now we don’t specify what order the search-states are placed on the
agenda.

• This search-state consists of the initial machine-state of the machine and


a pointer to the beginning of the tape.

• The function NEXT is then called to retrieve an item from the agenda and
assign it to the variable current-search-state.

35
Using an NFSA to Accept Strings
• As with D-RECOGNIZE, the first task of the main loop is to determine if the
entire contents of the tape have been successfully recognized.

• This is done via a call to ACCEPT-STATE?, which returns accept if the


current search-state contains both an accepting machine-state and a
pointer to the end of the tape.

• If we’re not done, the machine generates a set of possible next steps by
calling GENERATE-NEW-STATES, which creates search-states for any 𝜀-
transitions and any normal input-symbol transitions from the transition
table.

• All of these search-state tuples are then added to the current agenda.

36
Using an NFSA to Accept Strings
• Finally, we attempt to get a new search-state to process from the
agenda.

• If the agenda is empty we’ve run out of options and have to reject the
input.

• Otherwise, an unexplored option is selected and the loop continues.

• It is important to understand why ND-RECOGNIZE returns a value of reject


only when the agenda is found to be empty.

• Unlike D-RECOGNIZE, it does not return reject when it reaches the end of
the tape in an non-accept machine-state or when it finds itself unable to
advance the tape from some machine-state.
37
Using an NFSA to Accept Strings

• This is because, in the non-deterministic case, such roadblocks


only indicate failure down a given path, not overall failure.

• We can only be sure we can reject a string when all possible


choices have been examined and found lacking.

• Figure 2.22 illustrates the progress of ND-RECOGNIZE as it


attempts to handle the input baaa!.

38
Using an NFSA to Accept Strings

• Each strip illustrates the state of the algorithm at a given point


in its processing.

• The current-search-state variable is captured by the solid


bubbles representing the machine-state along with the arrow
representing progress on the tape.

39
Using an NFSA to Accept Strings
• Each strip lower down in the figure represents progress from one current-
search-state to the next.

• Little of interest happens until the algorithm finds itself in state q2 while
looking at the second a on the tape.

• An examination of the entry for transition-table[q2 ,a] returns both q2 and


q3.

• Search states are created for each of these choices and placed on the
agenda.

• Unfortunately, our algorithm chooses to move to state q3, a move that


results in neither an accept state nor any new states since the entry for
transition-table[q3 , a] is empty.
40
Using an NFSA to Accept Strings

• At this point, the algorithm simply asks the agenda for a new
state to pursue.

• Since the choice of returning to q2 from q2 is the only


unexamined choice on the agenda it is returned with the tape
pointer advanced to the next a.

• Somewhat diabolically, ND-RECOGNIZE finds itself faced with


the same choice.

41
Using an NFSA to Accept Strings

• The entry for transition-table[q2 ,a] still indicates that looping


back to q2 or advancing to q3 are valid choices.

• As before, states representing both are placed on the


agenda. These search states are not the same as the previous
ones since their tape index values have advanced.

• This time the agenda provides the move to q3 as the next


move. The move to q4, and success, is then uniquely
determined by the tape and the transition-table.

42
Using an NFSA to Accept Strings

43
Recognition as Search
• ND-RECOGNIZE accomplishes the task of recognizing strings in a regular
language by providing a way to systematically explore all the possible
paths through a machine.

• If this exploration yields a path ending in an accept state, it accepts the


string, otherwise it rejects it.

• This systematic exploration is made possible by the agenda mechanism,


which on each iteration selects a partial path to explore and keeps track
of any remaining, as yet unexplored, partial paths.

• Algorithms such as ND-RECOGNIZE, which operate by systematically


searching for solutions, are known as state-space-search algorithms.

44
Recognition as Search
• In such algorithms, the problem definition creates a space of possible
solutions; the goal is to explore this space, returning an answer when one
is found or rejecting the input when the space has been exhaustively
explored.

• In ND-RECOGNIZE, search states consist of pairings of machine-states with


positions on the input tape.

• The state-space consists of all the pairings of machine-state and tape


positions that are possible given the machine in question.

• The goal of the search is to navigate through this space from one state to
another looking for a pairing of an accept state with an end of tape
position.
45
Recognition as Search

• The key to the effectiveness of such programs is often the order in which
the states in the space are considered.

• A poor ordering of states may lead to the examination of a large number


of unfruitful states before a successful solution is discovered.

• Unfortunately, it is typically not possible to tell a good choice from a bad


one, and often the best we can do is to insure that each possible solution
is eventually considered.

46
Recognition as Search
• You may have noticed that the ordering of states in ND-RECOGNIZE
has been left unspecified.

• We know only that unexplored states are added to the agenda as


they are created and that the (undefined) function NEXT returns an
unexplored state from the agenda when asked.

• How should the function NEXT be defined?

47
Recognition as Search
• Consider an ordering strategy where the states that are considered
next are the most recently created ones.

• Such a policy can be implemented by placing newly created


states at the front of the agenda and having NEXT return the state
at the front of the agenda when called.

• Thus the agenda is implemented by a stack.

• This is commonly referred to as a depth-first search or Last In First Out


(LIFO) strategy.

48
Recognition as Search

• Such a strategy dives into the search space following newly


developed leads as they are generated.

• It will only return to consider earlier options when progress


along a current lead has been blocked.

• The trace of the execution of ND-RECOGNIZE on the string


baaa! as shown in Figure 2.22 illustrates a depth-first search.

49
Recognition as Search

• The algorithm hits the first choice point after seeing ba when it
has to decide whether to stay in q2 or advance to state q3.

• At this point, it chooses one alternative and follows it until it is


sure it’s wrong.
• The algorithm then backs up and tries another older
alternative.

50
Recognition as Search

• The second way to order the states in the search space is to


consider states in the order in which they are created.

• Such a policy can be implemented by placing newly created


states at the back of the agenda and still have NEXT return the
state at the front of the agenda.

• Thus the agenda is implemented via a queue.

• This is commonly referred to as a breadth-first-search or First In


First Out (FIFO) strategy.
51
Recognition as Search

• Consider a different trace of the execution of ND-RECOGNIZE


on the string baaa! as shown in Figure 2.23.

• Again, the algorithm hits its first choice point after seeing ba
when it had to decide whether to stay in q2 or advance to
state q3.

• But now rather than picking one choice and following it up, we
imagine examining all possible choices, expanding one ply of
the search tree at a time.
52
Recognition as Search

53
Recognition as Search

• Both the two algorithms (Depth-first search and Breadth-first search)have


their own disadvantages, such as they both can enter an
infinite loop under certain circumstances, but depth-first search
algorithm is normally preferred for its more efficient use of
memory.

• For larger problems, more complex search techniques such as


dynamic programming or A* must be used.

54
Regular Languages and FSAs
• The class of languages that are definable by regular expressions is exactly
the same as the class of languages that are characterizable by FSA (D or
ND).
▪ These languages are called regular languages.
• The regular languages over  is formally defined as:
1.  is an RL
2. a  , {a} is an RL
3. If L1 and L2 are RLs, then so are:
a)L1L2 ={xy| x  L1 and y  L2}, the concatenation of L1 and L2
b)L1L2, the union or disjunction of L1 and L2
c)L1*, the Kleene closure of L1

• All and only the sets of languages which meet the above prosperities are
regular languages.
55
Regular Languages and FSAs
• Regular languages are also closed under the following operations (where
Σ ∗ means the infinite set of all possible strings formed from the alphabet Σ):

❖ Intersection: if L1 and L2 are regular languages, then so is L1 ∩ L2, the


language consisting of the set of strings that are in both L1 and L2.

❖ Difference: if L1 and L2 are regular languages, then so is L1 − L2, the


language consisting of the set of strings that are in L1 but not L2.

❖ Complementation: If L1 is a regular language, then so is Σ ∗ − L1, the set


of all possible strings that aren’t in L1.

❖ Reversal: If L1 is a regular language, then so is 𝐿1𝑅 , the language


consisting of the set of reversals of all the strings in L1.
56
Regular Languages and FSAs

57
Regular Languages and FSAs

58
Regular Languages and FSAs

59
Chapter 3. Morphology and
Finite-State Transducers
From: Chapter 3 of An Introduction to Natural Language
Processing, Computational Linguistics, and Speech
Recognition, by Daniel Jurafsky and James H. Martin
Background
• The problem of recognizing that foxes breaks down into the two
morphemes fox and -es is called morphological parsing.
• Similar problem in the information retrieval domain: stemming
• Given the surface or input form going, we might want to produce the
parsed form: VERB-go + GERUND-ing
• In this chapter
– morphological knowledge and
– The finite-state transducer
• It is quite inefficient to list all forms of noun and verb in the dictionary
because the productivity of the forms.
• Morphological parsing is necessary more than just IR, but also
– Machine translation
– Spelling checking

2
Survey of (Mostly) English Morphology

• Morphology is the study of the way words are built up from smaller
meaning-bearing units, morphemes.
• Two broad classes of morphemes:
– The stems: the “main” morpheme of the word, supplying the main
meaning, while
– The affixes: add “additional” meaning of various kinds.
• Affixes are further divided into prefixes, suffixes, infixes, and
circumfixes.
– Suffix: eat-s
– Prefix: un-buckle
– Circumfix: ge-sag-t (said) sagen (to say) (in German)
– Infix: hingi (borrow) humingi (the agent of an action) )in Philippine
language Tagalog)

3
Survey of (Mostly) English Morphology

• Prefixes and suffixes are often called concatenative morphology.


• A number of languages have extensive non-concatenative
morphology
– The Tagalog infixation example
– Templatic morphology or root-and-pattern morphology, common in
Arabic, Hebrew, and other Semitic languages
• Two broad classes of ways to form words from morphemes:
– Inflection: the combination of a word stem with a grammatical
morpheme, usually resulting in a word of the same class as the original
tem, and usually filling some syntactic function like agreement, and
– Derivation: the combination of a word stem with a grammatical
morpheme, usually resulting in a word of a different class, often with a
meaning hard to predict exactly.

4
Survey of (Mostly) English Morphology
Inflectional Morphology

• In English, only nouns, verbs, and sometimes adjectives can be


inflected, and the number of affixes is quite small.
• Inflections of nouns in English:
– An affix marking plural,
• cat(-s), thrush(-es), ox (oxen), mouse (mice)
• ibis(-es), waltz(-es), finch(-es), box(-es),
butterfly(-lies)
– An affix marking possessive
• llama’s, children’s, llamas’, Euripides’ comedies

5
Survey of (Mostly) English Morphology
Inflectional Morphology
• Verbal inflection is more complicated than nominal inflection.
– English has three kinds of verbs:
• Main verbs, eat, sleep, impeach
• Modal verbs, can will, should
• Primary verbs, be, have, do
– Morphological forms of regular verbs
stem walk merge try map
-s form walks merges tries maps
-ing principle walking merging trying mapping
Past form or –ed participle walked merged tried mapped

– These regular verbs and forms are significant in the morphology of


English because of their majority and being productive.

6
Survey of (Mostly) English Morphology
Inflectional Morphology

– Morphological forms of irregular verbs

stem eat catch cut


-s form eats catches cuts
-ing principle eating catching cutting
Past form ate caught cut
–ed participle eaten caught cut

7
Survey of (Mostly) English Morphology
Derivational Morphology
• Nominalization in English:
– The formation of new nouns, often from verbs or adjectives
Suffix Base Verb/Adjective Derived Noun
-action computerize (V) computerization
-ee appoint (V) appointee
-er kill (V) killer
-ness fuzzy (A) fuzziness

– Adjectives derived from nouns or verbs

Suffix Base Noun/Verb Derived Adjective


-al computation (N) computational
-able embrace (V) embraceable
-less clue (A) clueless
8
Survey of (Mostly) English Morphology
Derivational Morphology
• Derivation in English is more complex than inflection because
– Generally less productive
• A nominalizing affix like –ation can not be added to absolutely every verb.
eatation(*)
– There are subtle and complex meaning differences among nominalizing
suffixes. For example, sincerity has a subtle difference in meaning from
sincereness.

9
Finite-State Morphological Parsing
• Parsing English morphology

Input Morphological parsed output


cats cat +N +PL
cat cat +N +SG
cities city +N +PL
geese goose +N +PL
goose (goose +N +SG) or (goose +V)
gooses goose +V +3SG
merging merge +V +PRES-PART
caught (caught +V +PAST-PART) or (catch +V +PAST)

10
Finite-State Morphological Parsing

• We need at least the following to build a morphological parser:


1. Lexicon: the list of stems and affixes, together with basic information
about them (Noun stem or Verb stem, etc.)
2. Morphotactics: the model of morpheme ordering that explains which
classes of morphemes can follow other classes of morphemes inside a
word. E.g., the rule that English plural morpheme follows the noun rather
than preceding it.
3. Orthographic rules: these spelling rules are used to model the changes
that occur in a word, usually when two morphemes combine (e.g., the
y→ie spelling rule changes city + -s to cities).

11
Finite-State Morphological Parsing
The Lexicon and Morphotactics
• A lexicon is a repository for words.
– The simplest one would consist of an explicit list of every word of the language.
Incovenient or impossible!
– Computational lexicons are usually structured with
• a list of each of the stems and
• Affixes of the language together with a representation of morphotactics telling us how
they can fit together.
– The most common way of modeling morphotactics is the finite-state automaton.

Reg-noun Irreg-pl-noun Irreg-sg-noun plural

fox geese goose -s


fat sheep sheep
fog Mice mouse
fardvark

An FSA for English nominal inflection


12
Finite-State Morphological Parsing
The Lexicon and Morphotactics

An FSA for English verbal inflection

Reg-verb-stem Irreg-verb-stem Irreg-past-verb past Past-part Pres-part 3sg

walk cut caught -ed -ed -ing -s


fry speak ate
talk sing eaten
impeach sang
spoken

13
Finite-State Morphological Parsing
The Lexicon and Morphotactics
• English derivational morphology is more complex than English
inflectional morphology, and so automata of modeling English
derivation tends to be quite complex.
– Some even based on CFG
• A small part of morphosyntactics of English adjectives

big, bigger, biggest


cool, cooler, coolest, coolly
red, redder, reddest
clear, clearer, clearest, clearly, unclear, unclearly
happy, happier, happiest, happily
unhappy, unhappier, unhappiest, unhappily
An FSA for a fragment of English adjective real, unreal, really
Morphology #1

14
Finite-State Morphological Parsing
• The FSA#1 recognizes all the listed adjectives, and ungrammatical forms
like unbig, redly, and realest.
• Thus #1 is revised to become #2.
• The complexity is expected from English derivation.

An FSA for a fragment of English adjective


Morphology #2
15
Finite-State Morphological Parsing

An FSA for another fragment of English derivational morphology

16
Finite-State Morphological Parsing
• We can now use these FSAs to
solve the problem of
morphological recognition:
– Determining whether an input
string of letters makes up a
legitimate English word or not
– We do this by taking the
morphotactic FSAs, and plugging
in each “sub-lexicon” into the FSA.
– The resulting FSA can then be
defined as the level of the
individual letter.

17
Finite-State Morphological Parsing
Morphological Parsing with FST

• Given the input, for example, cats, we would like to produce cat +N +PL.
• Two-level morphology, by Koskenniemi (1983)
– Representing a word as a correspondence between a lexical level
• Representing a simple concatenation of morphemes making up a word, and
– The surface level
• Representing the actual spelling of the final word.
• Morphological parsing is implemented by building mapping rules that maps
letter sequences like cats on the surface level into morpheme and features
sequence like cat +N +PL on the lexical level.

18
Finite-State Morphological Parsing
Morphological Parsing with FST

• The automaton we use for performing the mapping between these two
levels is the finite-state transducer or FST.
– A transducer maps between one set of symbols and another;
– An FST does this via a finite automaton.
• Thus an FST can be seen as a two-tape automaton which recognizes or
generates pairs of strings.
• The FST has a more general function than an FSA:
– An FSA defines a formal language
– An FST defines a relation between sets of strings.
• Another view of an FST:
– A machine reads one string and generates another.

19
Finite-State Morphological Parsing
Morphological Parsing with FST

• FST as recognizer:
– a transducer that takes a pair of strings as input and output accept if the
string-pair is in the string-pair language, and a reject if it is not.
• FST as generator:
– a machine that outputs pairs of strings of the language. Thus the output is
a yes or no, and a pair of output strings.
• FST as transducer:
– A machine that reads a string and outputs another string.
• FST as set relater:
– A machine that computes relation between sets.

20
Finite-State Morphological Parsing
Morphological Parsing with FST

• A formal definition of FST (based on the Mealy machine extension to


a simple FSA):
– Q: a finite set of N states q0, q1,…, qN
– Σ: a finite alphabet of complex symbols. Each complex symbol is
composed of an input-output pair i : o; one symbol I from an input
alphabet I, and one symbol o from an output alphabet O, thus Σ ⊆ I×O. I
and O may each also include the epsilon symbol ε.
– q0: the start state
– F: the set of final states, F ⊆ Q
– δ(q, i:o): the transition function or transition matrix between states. Given
a state q ∈ Q and complex symbol i:o ∈ Σ, δ(q, i:o) returns a new state q’
∈ Q. δ is thus a relation from Q × Σ to Q.

21
Finite-State Morphological Parsing
Morphological Parsing with FST

• FSAs are isomorphic to regular languages, FSTs are isomorphic to


regular relations.
• Regular relations are sets of pairs of strings, a natural extension of the
regular language, which are sets of strings.
• FSTs are closed under union, but generally they are not closed under
difference, complementation, and intersection.
• Two useful closure properties of FSTs:
– Inversion: If T maps from I to O, then the inverse of T, T-1 maps from O
to I.
– Composition: If T1 is a transducer from I1 to O1 and T2 a transducer from
I2 to O2, then T1 。 T2 maps from I1 to O2

22
Finite-State Morphological Parsing
Morphological Parsing with FST

• Inversion is useful because it makes it easy to convert a FST-as-parser into an FST-


as-generator.
• Composition is useful because it allows us to take two transducers than run in series
and replace them with one complex transducer.
– T1。T2(S) = T2(T1(S) )

Reg-noun Irreg-pl-noun Irreg-sg-noun


fox g o:e o:e s e goose
fat sheep sheep
fog m o:i u:εs:c e mouse
aardvark
A transducer for English nominal
number inflection Tnum

23
Finite-State Morphological Parsing
Morphological Parsing with FST

The transducer Tstems, which maps roots to their root-class

24
Finite-State Morphological Parsing
Morphological Parsing with FST

^: morpheme boundary
#: word boundary

A fleshed-out English nominal inflection FST


Tlex = Tnum。Tstems
25
Finite-State Morphological Parsing
Orthographic Rules and FSTs
• Spelling rules (or orthographic rules)

Name Description of Rule Example


Consonant doubling 1-letter consonant doubled before -ing/-ed beg/begging
E deletion Silent e dropped before -ing and -ed make/making
E insertion e added after -s, -z, -x, -ch, -sh, before -s watch/watches
Y replacement -y changes to -ie before -s, -i before -ed try/tries
K insertion Verb ending with vowel + -c add -k panic/panicked

– These spelling changes can be thought as taking as input a simple concatenation of


morphemes and producing as output a slightly-modified concatenation of
morphemes.

26
Finite-State Morphological Parsing
Orthographic Rules and FSTs

• “insert an e on the surface tape just when the lexical tape has a
morpheme ending in x (or z, etc) and the next morphemes is -s”
x
ε→ e/ s s#
z

• “rewrite a to b when it occurs between c and d”

a→ b / c d

27
Finite-State Morphological Parsing
Orthographic Rules and FSTs

The transducer for the E-insertion rule

28
Combining FST Lexicon and Rules

29
Combining FST Lexicon and Rules

30
Combining FST Lexicon and Rules

• The power of FSTs is that the exact same cascade with the same state
sequences is used
– when machine is generating the surface form from the lexical tape, or
– When it is parsing the lexical tape from the surface tape.
• Parsing can be slightly more complicated than generation, because of
the problem of ambiguity.
– For example, foxes could be fox +V +3SG as well as fox +N +PL

31
Lexicon-Free FSTs: the Porter Stemmer

• Information retrieval
• One of the mostly widely used stemmming algorithms is the simple
and efficient Porter (1980) algorithm, which is based on a series of
simple cascaded rewrite rules.
– ATIONAL → ATE (e.g., relational → relate)
– ING → εif stem contains vowel (e.g., motoring → motor)
• Problem:
– Not perfect: error of commision, omission
• Experiments have been made
– Some improvement with smaller documents
– Any improvement is quite small

32
Minimum Edit Distance

Natural Language Processing 1


Definition of Minimum Edit Distance
• Many NLP tasks are concerned with measuring how similar two strings are.

• Spell correction:
– The user typed “graffe”
– Which is closest? : graf grail giraffe
• the word giraffe, which differs by only one letter from graffe, seems intuitively to be
more similar than, say grail or graf,

• The minimum edit distance between two strings is defined as the minimum number
of editing operations (insertion, deletion, substitution) needed to transform one
string into another.

Natural Language Processing 2


Minimum Edit Distance: Alignment
• The minimum edit distance between intention and execution can be
visualized using their alignment.

• Given two sequences, an alignment is a correspondence between substrings of the two


sequences.

Natural Language Processing 3


Minimum Edit Distance

• If each operation has cost of 1


– Distance between them is 5

• If substitutions cost 2 (Levenshtein Distance)


– Distance between them is 8

Natural Language Processing 4


Other uses of Edit Distance in NLP
• Evaluating Machine Translation and speech recognition

R Spokesman confirms senior government adviser was shot


H Spokesman said the senior adviser was shot dead
S I D I

• Named Entity Extraction and Entity Coreference


– IBM Inc. announced today
– IBM profits
– Stanford President John Hennessy announced yesterday
– for Stanford University President John Hennessy

Natural Language Processing 5


The Minimum Edit Distance Algorithm
• How do we find the minimum edit distance?
– We can think of this as a search task, in which we are searching for the shortest path—a
sequence of edits—from one string to another.

• The space of all possible edits is enormous, so we can’t search naively.


– Most of distinct edit paths ends up in the same state, so rather than recomputing all those
paths, we could just remember the shortest path to a state each time we saw it.
– We can do this by using dynamic programming.
– Dynamic programming is the name for a class of algorithms that apply a table-driven
method to solve problems by combining solutions to sub-problems.

Natural Language Processing 6


Minimum Edit Distance between Two Strings
• For two strings
– the source string X of length n
– the target string Y of length m
• We define D(i,j) as the edit distance between X[1..i] and Y[1..j]
• i.e., the first i characters of X and the first j characters of Y
• The edit distance between X and Y is thus D(n,m)

Natural Language Processing 7


Dynamic Programming for
Computing Minimum Edit Distance
• We will compute D(n,m) bottom up, combining solutions to subproblems.
• Compute base cases first:
– D(i,0) = i
• a source substring of length i and an empty target string requires i deletes.
– D(0,j) = j
• a target substring of length j and an empty source string requires j inserts.
• Having computed D(i,j) for small i, j we then compute larger D(i,j) based on
previously computed smaller values.
• The value of D(i, j) is computed by taking the minimum of the three possible paths
through the matrix which arrive there:

Natural Language Processing 8


Dynamic Programming for
Computing Minimum Edit Distance

• If we assume the version of Levenshtein distance in which the insertions and


deletions each have a cost of 1, and substitutions have a cost of 2 (except substitution
of identical letters have zero cost), the computation for D(i,j) becomes:

Natural Language Processing 9


Minimum Edit Distance Algorithm

Natural Language Processing 10


Computation of Minimum Edit Distance between
intention and execution

Natural Language Processing 11


Computation of Minimum Edit Distance between
intention and execution

deletion
insertion
substitution

Natural Language Processing 12


Computation of Minimum Edit Distance between
intention and execution

Natural Language Processing 13


Computing Alignments
• Edit distance isn’t sufficient
– We often need to align each character of the two strings to each other

• We do this by keeping a “backtrace”

• Every time we enter a cell, remember where we came from

• When we reach the end,


– Trace back the path from the upper right corner to read off the alignment

Natural Language Processing 14


MinEdit with Backtrace

deletion
insertion
substitution

Natural Language Processing 15


MinEdit with Backtrace

Natural Language Processing 16


Adding Backtrace to
Minimum Edit Distance
• Base conditions: Termination:
D(i,0) = i D(0,j) = j D(N,M) is distance

• Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1 deletion
D(i,j)= min D(i,j-1) + 1 insertion
D(i-1,j-1) + 2; if X(i) ≠ Y(j) substitution
0; if X(i) = Y(j)
LEFT insertion
ptr(i,j)= DOWN deletion
DIAG substitution

Natural Language Processing 17


Performance of
Minimum Edit Distance Algorithm

• Time: O(nm)

• Space: O(nm)

• Backtrace: O(n+m)

Natural Language Processing 18

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy