First-Order Logic (FOL) : Limited Expressiveness
First-Order Logic (FOL) : Limited Expressiveness
Propositional logic, while powerful and widely used, does have some drawbacks:
1. Limited Expressiveness: Propositional logic deals only with propositions (statements that
are either true or false) and does not capture the complexity of relationships between
objects or concepts. It lacks the ability to represent quantifiers like "for all" (∀) and "there
exists" (∃), which are essential in predicate logic.
2. Inability to Handle Ambiguity: Propositional logic cannot handle ambiguous statements
effectively. Ambiguity arises when a statement can be interpreted in multiple ways, and
propositional logic lacks the capacity to resolve such ambiguity.
3. Explosion of Complexity: In certain cases, especially when dealing with a large number
of variables or propositions, the number of possible combinations grows exponentially,
leading to what is known as the "combinatorial explosion." This explosion makes reasoning
and computation in propositional logic impractical for complex systems.
4. No Representation of Relationships: Propositional logic treats propositions as atomic
units and does not provide a way to represent relationships between propositions. For
example, it cannot express the concept of implication within a single proposition.
5. No Handling of Uncertainty or Probability: Propositional logic is deterministic,
meaning it assumes propositions are either true or false with no uncertainty. It lacks the
ability to model probabilistic relationships, which are crucial in many real-world scenarios.
6. Limited Scope in Real-world Applications: While propositional logic is useful for
representing simple relationships in domains like computer science and mathematics, it
often falls short in modeling the complexities of real-world scenarios where uncertainty,
ambiguity, and relationships are prevalent.
First-order logic (FOL), also known as predicate logic, offers several advantages over
propositional logic:
1. Expressiveness: FOL allows for the representation of complex relationships and structures
by introducing variables, quantifiers (such as ∀ for "for all" and ∃ for "there exists"), and
predicates. This expressiveness enables FOL to capture the richness of natural language
and real-world scenarios more effectively than propositional logic.
2. Quantification: FOL includes quantifiers, which allow statements to be made about entire
classes of objects or individuals, not just specific instances. This feature enables FOL to
reason about general properties and make universal statements, whereas propositional logic
can only make assertions about specific propositions.
3. Predicates and Functions: FOL allows the use of predicates to express relationships
between objects and functions to represent operations or transformations. This capability
makes FOL suitable for modeling a wide range of domains, including mathematics,
linguistics, and artificial intelligence.
4. Modularity and Reusability: FOL facilitates the modular representation of knowledge by
allowing the definition of reusable predicates and functions. This modularity enhances the
clarity and maintainability of logical systems, as well as the ability to reuse components
across different contexts.
5. Ability to Capture Complex Relationships: FOL can express complex relationships
between objects, including hierarchical structures, temporal relationships, and
dependencies. This capability enables FOL to represent and reason about real-world
phenomena more accurately and comprehensively than propositional logic.
6. Resolution of Ambiguity: FOL provides mechanisms for disambiguating statements
through the use of variables and quantifiers. Unlike propositional logic, which struggles
with ambiguity, FOL can handle more nuanced and context-dependent interpretations of
statements.
7. Soundness and Completeness: FOL has well-defined semantics and inference rules that
ensure soundness (correctness) and completeness (ability to derive all valid conclusions)
of reasoning processes. This property makes FOL a reliable framework for formalizing and
reasoning about knowledge.
Overall, first-order logic offers a significant advancement over propositional logic in terms of
expressiveness, representational power, and reasoning capabilities, making it a foundational tool
in various fields such as mathematics, computer science, philosophy, and linguistics.
ELEMENTS OF FIRST ORDER LOGIC
First-order logic (FOL), also known as predicate logic, consists of several fundamental elements:
1. Variables: Variables in FOL represent placeholders for objects or individuals in the
domain of discourse. They are typically denoted by lowercase letters such as x, y, z, etc.,
and can be quantified over using quantifiers like ∀ (for all) and ∃ (there exists).
2. Constants: Constants are specific objects or individuals in the domain of discourse. They
are represented by symbols and typically denoted by lowercase or uppercase letters like a,
b, c, etc. Unlike variables, constants do not vary and represent fixed elements.
3. Predicates: Predicates in FOL represent properties or relations that can be true or false of
objects in the domain. They are denoted by uppercase letters followed by a list of variables
or constants enclosed in parentheses. For example P(x), Q(x,y), , R(a,b,c) are predicates.
S NP VP [.80]
S AUX NP VP [.15]
S VP [.05]
NP Det Nominal [.20]
NP Proper Noun [.35]
NP Nominal [.05]
NP Pronoun [.40]
Nominal Noun [.75]
Nominal Noun Nominal [.20]
Nominal Proper-Noun Nominal [.05]
VP Verb [.55]
VP Verb NP [.40]
VP Verb NP NP [.05]
Det that [.05] | this [.80] | a [.15]
Noun book [.10] | flight [.50] | meat [.40]
Verb book [.30] | include [.30] | want [.40]
Aux can [.40] | does [.30] | do [.30]
Proper Noun Houston [.10] | ASIANA [.20] | KOREAN AIR [.30] | CAAC [.25] |
Dragon Air [.15]
Pronoun you [.40] | I [.60]
These probabilities are not based on a corpus; they were made up merely for expository
purpose.
Note that the probabilities of all of the expansion of a non-terminal sum to 1.
A PCFG assigns a probability to each parse-tree T of a sentence S. The probability of a particular
parse T is defined as the product of the probabilities of all the rules r used to expand each node n
in the parse tree:
P (T) = ∏ p(r(n))
n ∈ T
For example, the sentence “Can you book ASIANA flights” is ambiguous:
One meaning is :”Can you book flights on behalf of ASIANA”, the other meaning is : “Can you
book flights run by ASIANA”. The trees are respectively as follows:
Left tree: S Right tree: S
Aux NP VP Aux NP VP
flights
The probabilities of each rule in left tree are:
S Aux NP VP [.15]
NP Pro [.40]
VP V NP NP [.05]
NP Nom [.05]
NP Pnoun [.35]
Nom Noun [.75]
Aux can [.40]
NP Pro [.40]
Pro you [.40]
Verb book [.30]
Pnoun ASIANA [.40]
Noun flights [.50]
… …
i-1 i i+k-1 i+k i+j-1 i+j
|______________________| |_______________________|
length of B = k length of C = j-k
|______________________________________________________|
length of A = j
In the rule A BC, the probability of A (wij ) equals the product of the probability of B and C.
such we can compute the probability P(T) for different sub-tree.
Then we can compute the highest probability T(S) of the sentence S. It will equal to argmax P(T):
When a tree-bank is unavailable, the count needed for computing PCFG probabilities
can be generated by first parsing a corpus.
If sentences were unambiguous, it would be very simple: parse the corpus,
add a counter for every rule in the parse, and then normalize to get
probabilities.
If the sentences were ambiguous, we need to keep a separate count for each
parse of a sentence and weight each partial count by the probability of the
parse it appears in.
Whether the preference is 67% or 52%, in PCFG, this preference is purely structural and must be
the same to all verb.
However, the correct attachment is to verb. The verb “send” subcategorizes for a destination,
which can be expressed with the preposition “into”. It is a lexical dependency. The PCFG can not
deal with the lexical dependency.
--Coordination ambiguities:
The coordination ambiguities are the key to choosing the proper parse.
In the phrase “dogs in houses and cats” is ambiguous:
Left tree: NP Right tree: NP
NP Conj NP NP PP
Although the left tree is intuitively the correct one. But the PCFG will assign them identically
probabilities because both structure use the exact same rule:
NP NP Conj NP
NP NP PP
NP Noun
PP Prep NP
Noun dogs | house | cats
Prep in
Conj and
In this case, PCFG will assign two trees the same probability.
PCFG has a number of inadequacies as a probabilistic model of syntax, we shall augment PCFG
to deal with these problems.
5.2.2 Probabilistic Lexicalized CFG
Charniak (1997) proposed the approach of the probabilistic representation of lexical heads. It is a
kind of lexical grammar. In this probabilistic representation, each non-terminal in a parse-tree is
annotated with a single word which is its lexical-head.
E,g. “Workers dumped sacks into a bin” can be represented as follows:
S(dumped)
NP(workers) VP(dumped)
a bin
Fig. Lexicalized tree
In this case, we were to treat a probabilistic lexicalized CFG like a normal but huge PCFG. Then
we would store a probability for each rule/head combination. E.g.
VP(dumped) VBD(dumped) NP(sacks) PP(into) [3X10-10]
VP(dumped) VBD(dumped) NP(cats) PP(into) [8X10-11]
VP(dumped) VBD(dumped) NP(hats) PP(into) [4X10-10]
VP(dumped) VBD(dumped) NP(sacks) PP(above) [1X10-12]
S(dumped)
NP(workers) VP(dumped)
a bin
Fig. Incorrect parse-tree
If VP(dumped) expands tp VBD NP PP, then the tree will be correct. If VP(dumped) expands to
VBD NP, then the tree will be incorrect.
Let us compute both of these by counting in the Brown Corpus portion of the Penn Trre-bank.
The first rule is quite likely:
C(VP(dumped) VBD NP PP)
P(VP VBD NP PP|VP, dumped) =
∑βC(VP(dumped) β
= 6/9 = .67
The second rule never happens in the Brown Corpus. This is not surprising, since “dump” is a
verb of caused-motion into a new location:
C(VP(dumped) VBD NP)
P(VP VBD NP|VP, dumped) =
∑βC(VP(dumped) β
= 0/9 = 0.
In practice this zero value would be smoothed somehow, but now we just notice that the first rule
is preferred.
For the head probabilities we can also count it using same method.
In the correct parse, a PP node whose mother’s head (X) is “dumped’ has the head ‘into”. In the
incorrect parse, a PP node whose mother’s head (X) is “sacks’ has the head “into”. We can use
counts from Brown portion of the Penn Tree-bank. X is the mother’s head.
C(X(dumped) …PP (into)…)
P(into |PP, dumped) =
∑βC(X(dumped) …PP…)
= 2/9 = .22
C(X(sacks) …PP (into)…)
P(into |PP, sacks) =
∑βC(X(sacks) …PP…)
= 0/0 = ?
Once again, the head probabilities correctly predict that “dumped” is more likely to be modified
by “into” than is “sacks”.
5.3 Human Parsing
In the last 20 years we have learned a lot about human parsing. Here we shall give a brief
overview of some recent results.
5.3.1 Ambiguity solution in the human parsing:
Human sentence processor is sensitive to lexical sub-categorization preferences. For example,
The scientists can ask the people to read the ambiguous sentence and check off a box indicating
which of the two interpretations they got first. The results are after each sentence.
“The women kept the dogs on the beach”
The women kept the dogs which were on the beach. 5%
The women kept them (the dogs) on the beach. 95%
“The women discussed the dogs on the beach”
The women discussed the dogs which were on the beach. 90%
The women discussed them (the dogs) while on the beach. 10%
The results were that people preferred VP-attachment with “keep” and NP-attachment with
“discuss”.
This suggest that “keep” has a sub-categorization preference for a VP with three constituents: (VP
V NP PP) while “discuss” has a sub-categorization preference for a VP with to constituents:
(VP V NP), although both verbs still allow both sub-categorizations.
5.3.2 Garden-path sentences:
The garden-path sentence is a specific class of temporarily ambiguous sentences.
For example, “The horse raced past the barn fell” (“barn” is a farm building for storing corps and
food for animals). .
(a) S (b) S
NP VP NP VP
Det N V PP NP VP V
the barn
Fig. Garden-path sentence 1
The garden-path sentences are the sentences which are cleverly constructed to have three
properties that combine to make them very difficult for people to parse:
They are temporarily ambiguous: the sentence is not ambiguous, but its initial portion is
ambiguous.
One of the two or three parses in the initial portion is somehow preferable to the human
parsing mechanism.
But the dispreferred parse is the correct one for the sentence.
The result of these three properties is that people are “led down the garden path” toward the
incorrect parse, and then are confused when they realize it is the wrong way.
More examples
“The complex houses married and single students and their families.”
(a) S (b) S
NP NP VP
“The student forgot the solution was in the back of the book.”
(a) S (b) S
NP VP NP VP
Det N V NP ? Det N V S
MV/RR=387
raced
Log(MV/RR)
WV/RR=5 (threshold)
found
FEATUREn VALUEn
cat NP
num sig
person 3
3sgAux can be illustrated by following AVM:
cat Aux
num sing
per 3
In the feature structures, the features are not limited to atomic symbols as their values; they can
also have other feature structures as their values.
It is very useful when we wish to bundle a set of feature-value pairs together for similar treatment.
E,g, The feature “num” and “per” are often lumped together since grammatical subject must agree
with their predicates in both of their number and person. This lumping together can introduce the
feature “agreement” that takes a feature structure consisting of the number and person
feature-value pairs as its value.
The feature structure of 3sgNP with feature “agreement” can be illustrated as following AVM:
cat NP
num sing
agreement
per 3
cat agreement
○ ○
NP
per num
○ ○.
3 sing
Fig. 1 DAG for feature structure
○ ○
subject S
agreement
○
agreement
per num
○ ○
3 sing
Fig. 2 a feature structure with shared values
In Fig. 2, the <head subject agreement> path and the <head agreement> path lead to the
same location. They shared the feature structure
per 3
num sing
The shared structure will be denoted in the AVM by adding numerical indexes that signal
the values to be shared.
cat s
agreement ① num sing
head per 3
subject agreement ①
The reentrant structures give us the ability to express linguistic knowledge in the elegant
ways.
4.1.3 Unification of feature structures
For the calculation of feature structure, we can use the unification to do it. There are two principle
operations in the unification.:
Merging the information content of two structure that are compatible;
Rejecting the merging of structures that are incompatible.
Following are the examples (symbol ∪ means unification):
(1) Compatible
:
num sing ∪ num sing = num sing
○ ○ ○
○ ○ ○
sing sing sing
(2) Incompatible:
num sing ∪ num plur = fails!
○ ○
○ ○
sing plur
○ ○ ○
○ ○ ○
sing [] sing
(4) Merger:
○ ○ ○
○ ○ ○ ○
sing 3 sing 3
(5) The reentrant structure
subject agreement ①
○ ○
agreement subject subject
○ ○ ∪ ○
①
num per agreement agreement
○ ○ ○ ○
sing 3 ①
per num
○ ○
3 sing
○
agreement subject
=
○ ○
①
num per agreement
○ ○ ○
sing 3 ①
(5) The copying capability of unification
agreement ①
subject agreement ①
= agreement ①
○ ○
agreement subject subject
○ ○ ∪ ○
①
agreement agreement
○ ○
①
per num
○ ○
3 sing
○
agreement subject
○ ○
= ①
agreement
○
①
per num
○ ○
3 sing
(5) The features merely have similar values:
In following example, there is no sharing index linking the “agreement” feature and
[subject agreement], the information [per 3.] is not added to the value of the “agreement”
feature.
In the result, the information [per 3.] is only added to the end of [subject [agreement]] path, but
it is not added to the end of “agreement’ (it is first line in the AVM of result). Therefore, the value
of “agreement” is only [num sing] without [per 3].
○ ○
agreement subject subject
○ ○ ∪ ○
agreement agreement
○ ○ ○
sing
per num
○ ○ ○
sing 3 sing
○
agreement subject
○ ○
=
num agreement
○ ○
sing
per num
○ ○
3 sing
subject agreement ①
= fails !
○
agreement subject
○ ○
①
num per agreement
○ ○ ○
sing 3 ①
○
agreement subject
∪ = fails !
○ ○
○ ○ ○
sing 3
num per
○ ○
plur 3
Feature structures are a way of representing partial information about some linguistic object
or placing informational constrains on what the object can be. Unification can be seen as a way of
merging the information in each feature structure, or describing objects that satisfy both sets of
constraints.
4.1.4 Subsumption
Intuitively, unifying two feature structures produces a new feature structure that is more
specific (has more information) than, or is identical to, either of the input feature structure. We say
that a less specific (more abstract) feature structure subsumes an equally or more specific one.
Formally, A feature structure F subsumes a feature structure G if and only if:
For every feature x in F, F(x) subsumes G(x) (where F(x) means “the value of the feature x of
feature structure F”);
For all paths p and q in F such that F(p) = F (q), it is also the case that G(p) = G(q).
E.g.
(2). per 3
agreement ①
subject agreement ①
(5) cat VP
agreement ①
(6) cat VP
agreement ①
We have: (3) subsumes (5), (4) subsumes (5), (5) subsumes (6), (4) and (5) subsume (6)..
Subsumption is a partial ordering: there are pairs of feature structures that neither subsume nor are
subsumed by each other:
(1) does not subsume (2),
(2) does not subsume (1),
(3) does not subsume (4),
(4) does not subsume (3).
Since every feature structure is subsumed by the empty structure [], the relation among feature
structures can be defined as a semi-lattice. The semi-lattice is often represented pictorially with the
most general feature [ ] at the top and the subsumption relation represented by lines between
feature structures.
lower []
(3) (4)_
(5)
more information
higher (6)
Aux do
(Aux agreement num) = plur
(Aux agreement per) = 3
Aux does
(Aus agreement num) = sing
(Aux agreement per) = 3
Determiner this
(Det agreement num) = sing
Determiner these
(Det agreement num) = plur
Verb serves
(Verb agreement num) = sing
Verb serve
(Verb agreement num) = plur
Noun flight
(Noun agreement num) = sing
Noun flights
(Noun agreement num) = plur
The constraints of non-lexical constituent can acquire values for at least some of their features
from their component constituents.
VP Verb NP
(VP agreement) = (Verb agreement)
The constraints of ”VP” come from the constraints of “Verb”.
Nominal Noun
(Nominal agreement) = (Noun agreement)
The constraints of “Nominal” come from the “Noun”.
NP Det Nominal
(Det agreement) = (Nominal agreement)
(NP agreement) = (Nominal agreement)
Nominal Noun
(Nominal agreement) = (Noun agreement)
the verb is the head of the VP, the nominal is the head of NP, the Noun is the head of the nominal.
In these rules, the constituent providing the agreement feature structure up to the parent is the head
of the phrase. We can say that the agreement feature structure is a head feature.
We can rewrite our rules by placing the agreement feature structure under a HEAD feature and
then copying that feature upward:
VP Verb NP
(VP head) = (Verb head)
NP Det Nominal
(Det head Agreement) = (Nominal head Agreement)
Det and Nominal locate in the same level, their “HEAD Agreement” is equal.
(NP head) = (Nominal head)
Nominal Noun
(Nominal head) = (Noun head)
The conception of a head is very significant in grammar, because it provides a way for a syntactic
rule to be linked to a particular word.
4.2.4 Sub-categorization
4.2.4.1 An atomic feature SUBCAT:
Following is a rule with complex features
Verb-with-S-comp think
VP Verb-with-S-comp S
We have to subcategorize the verbs to some subcategories. So we need an atomic feature called
SUBCAT.
Opaque approach
Lexicon:
Verb serves
<Verb head agreement num> = sing
<Verb head subcat> = trans
Rules:
VP Verb
<VP head> = <Verb head>
<VP head subcat> = intrans
VP Verb NP
<VP head> = <Verb head>
<VP head subcat> = trans
VP Verb NP NP
<VP head> = <Verb head.
<VP head subcat. = ditrans
In these rules, the value of SUBCAT is un-analyzable. It does not directly encode either the
number or type of the arguments that the verb expects to take.
This approach is somewhat opaque, it is not so clear.
.Elegant approach:
A more elegant approach makes better use of the expressive power of feature structures,
allows the verb entries to directly specify the order and category type of the arguments they
require.
The verb’s subcategory feature expresses a list of its objects and complements.
Lexicon:
Verb serves
<Verb head agreement num> = sing
<Verb head subcat first cat> = NP
<Verb head subcat second> = end
Verb leaves
<Verb head agreement num> = sing
<Verb head subcat first cat> = NP
<Verb head subcat second cat> = PP
<Verb head subcat third> = end
E..g. “we leave Seoul in the morning”.
Rules:
VP Verb NP
<VP head> = <Verb head>
<VP head subcat first cat> = <NP cat>
<VP head subcat second> = end
4.2.4.2 Sub-categorization frame
The sub-categorization frame can be composed of many different phrase types.
Sub-categorization of verb:
Each verb allows many different sub-categorization frames. For example, verb ‘ask’ can allow
following sub-categorization frame:
Subcat: Example
Sfin It was apparent [Sfin that the kitchen was the only room…]
PP It was apparent [PP from the way she rested her hand over his]
Swheth It is unimportant [Swheth whether only a little bit is accepted]
Sub-categorization of noun
Subcat: Example
Sfin the assumption [Sfin that wasteful methods have been employed]
Swheth the question [Swheth whether the authorities might have decided]
4.2.5 Long-Distance Dependencies
Sometimes, a constituent subcategorized for by the verb is not locally instantiated ,but is in a
long-distance relationship with the predicate.
For example, following sentence:
Which flight do you want me to have the travel agent book?
Here, “which flight” is the object of “book”, there is a long-distance dependency between them.
The representation of such long-distance dependency is a very difficult problem, because the verb
whose subcategorization requirement is being filled can be quite distance from the filler.
Many solutions to representing long-distance dependency were proposed in unification grammars.
One solution is called “Gap List”. The gap list implements a list as a feature gap, which is passed
up from phrase to phrase in the parse tree. The filler (E.g. ”which flights”) is put in the gap list,
and must eventually be united with the subcategorization frame of some verb.
The content field may be null or contain a pointer to another feature structure. Similarly, the
pointer field may be null or contain a pointer to another feature structure.
num sing
per 3
○
PTR CT
○ ○
null num per
○ ○
PTR CT PTR CT
○ ○ ○ ○
null sing null 3
Fig. 5. An extended DAG notation
○ ○
PTR CT PTR CT
○ ○ ○ ○
null num null per
○ ○
PTR CT PTR CT
○ ○ ○ ○
null sing null 3
○ ○
PTR CT PTR CT
○ ○ ○ ○
null num per null per
○ ○ ○
PTR CT CT PTR CT
○ ○ ○ ○ ○
null sing null null 3
PTR
Set the pointer field of the second argument to point at the first one.
PTR
○ ○
PTR CT CT
○ ○ ○
null num per null per
○ ○ ○
PTR CT CT PTR CT
○ ○ ○ ○ ○
null sing null null 3
PTR
subject agreement ①
The DAG:
○
PTR CT
○ ○
null head cat
○ ○
PTR CT PTR CT
○ ○ ○ ○
subject agreement null S
○ ○
①
PTR CT PTR CT
○ ○ ○ ○
○ ○ ○
①
PTR CT PTR CT
○ ○ ○ ○ ○ ○
null 3 sing
(2) Compatible feature structure:
num sing ∪ num sing = num sing
The original arguments::
○ ○
PTR CT PTR CT
○ ○ ○ ○
null null
num ∪ num
○ ○
PTR CT PTR CT
○ ○ ○ ○
sing sing
The result of unification:
PTR
○ ○ ○
CT PTR CT PTR CT
○ ○ ○ ○ ○
null null
num ∪ num = num
○ ○ ○
PTR CT PTR CT PTR CT
○ ○ ○ ○ ○ ○
null sing null sing
(3) Incompatible:
num sing ∪ num plur = fails!
The result of unification:
○ ○
PTR CT PTR CT
○ ○ ○ ○
null ∪ null
num num = fails
○ ○
PTR CT PTR CT
○ ○ ○ ○
sing plur
○ ○
PTR CT PTR CT
○ ○ ○ ○
null null
num ∪ num
○ ○
PTR CT PTR CT
○ ○ ○ ○
null sing null []
○ ○ ○
PTR CT PTR CT PTR CT
○ ○ ○ ○ ○
null null
num ∪ num = num
○ ○ ○
PTR CT PTR CT PTR CT
○ ○ ○ ○ ○ ○
sing [] null sing
(5) Merger
:
○ ○
PTR CT PTR CT
○ ○ ○ ○
null null
num ∪ per
○ ○
PTR CT PTR CT
○ ○ ○ ○
null sing null 3
○ ○ ○
PTR CT PTR CT PTR CT
○ ○ ○ ○ ○
null null
num ∪ per = num per
○ ○ ○ ○
PTR CT PTR CT PTR CT PTR CT
○ ○ ○ ○ ○ ○ ○ ○
null sing null 3 null sing null 3
subject agreement ①
∪ subject agreement per 3
num sing
○ ○
PTR CT PTR CT
○ ○ ∪ ○ ○
subject agreement subject
○ ○ ○
①
PTR CT PTR CT PTR CT
○ ○ ○ ○ ○ ○
○ ○ ○ ○
①
PTR CT PTR CT PTR CT
○ ○ ○ ○ ○ ○ ○ ○
null null sing null 3
per num
○ ○
PTR CT PTR CT
○ ○ ○ ○
null 3 null sing
The result of unification:
○
PTR CT
○ ○
subject agreement
=
○ ○
①
PTR CT PTR CT
○ ○ ○ ○
○ ○ ○
①
PTR CT PTR CT PTR CT
○ ○ ○ ○ ○ ○
null null sing null 3
agreement ①
subject agreement ①
= agreement ①
○ ○
PTR CT PTR CT
○ ○ ○ ○
agreement subject subject
○ ○ ∪ ○
○ ○ ○ ○ ○ ○
①
agreement agreement
○ ○
PTR CT PTR CT
○ ○ ○ ○
null ①
per num
○ ○
PTR CT PTR CT
○ ○ ○ ○
null 3 null sing
The results of unification:
○
PTR CT
○ ○
agreement subject
○ ○
PTR CT PTR CT
○ ○ ○ ○
①
agreement
PTR CT
○ ○
null ①
per num
○ ○
PTR CT PTR CT
○ ○ ○ ○
null 3 null sing
(8) The features merely have similar values:
agreement num sing
○ ○
PTR CT PTR CT
○ ○ ○ ○
null null
agreement subject subject
○ ○ ∪ ○
○ ○ ○ ○ ○ ○
null nul null
num agreement agreement
○ ○ ○
○ ○ ○ ○ ○ ○
null sing null null
num per num
○ ○ ○
○ ○ ○ ○ ○ ○
null sing null 3 null sing
The results of unification:
○
PTR CT
○ ○
null
agreement subject
= ○ ○
PTR CT PTR CT
○ ○ ○ ○
null nul
num agreement
○ ○
PTR CT PTR CT
○ ○ ○ ○
null sing null null
per num
○ ○
PTR CT PTR CT
○ ○ ○ ○
null 3 null sing
(9) The failure of unification
subject agreement ①
PTR CT
○ ○
agreement subject
○ ○
PTR CT PTR CT
○ ○ ○ ○
①
num per agreement
○ ○ ○
PTR CT PTR CT PTR CT
○ ○ ○ ○ ○ ○
①
○ ○
sing 3
○
PTR CT
∪ ○ ○
agreement subject
○ ○
PTR CT PTR CT
○ ○ ○ ○
①
num per agreement
○ ○ ○
PTR CT PTR CT PTR CT
○ ○ ○ ○ ○ ○
num pre
○ ○ ○ ○
sing 3
PTR CT PTR CT
○ ○ ○ ○
null plur null 3
= fails !
subject agreement ①
○ ○ ∪ ○ ○
subject agreement subject
○ ○ ○
①
PTR CT PTR CT PTR CT
○ ○ ○ ○ ○ ○
null null null
agreement num agreement
○ ○ ○
①
PTR CT PTR CT PTR CT
○ ○ ○ ○ ○
null null sing null
per
○
PTR CT
○ ○
null 3
These original arguments are neither identical, nor atomic, nor null, so the main loop is entered.
Looping over the features of f2, the algorithm is led to a recursive attempt to unify the values of
the corresponding “subject” feature of f1 and f2.
agreement ① ∪ agreement per 3
These argument are also non-identical, non-atomic and non-null so the loop is entered again
leading to a recursive check of the values of the “agreement” features.
num sing ∪ per 3
In looping over the features of the second argument, the fact that the first argument
lacks “per” feature is discovered. A “per” feature initialized with a “null”
value is added to the first argument. This changes the previous unification to the
following:
num sing ∪ per 3
per null
After adding this new “per” feature, the next recursive call leads to the
unification of the “null” value of the new feature in the first argument with the
3 value of the second argument. This recursive call results in the assignment of
the pointer field of the first argument to the 3 value in f2.
○ ○
PTR CT PTR CT
○ ○ ∪ ○ ○
subject agreement subject
○ ○ ○
①
PTR CT PTR CT PTR CT
○ ○ ○ ○ ○ ○
null null null
agreement num per agreement
○ ○ ○ ○
①
PTR CT PTR CT CT PTR CT
○ ○ ○ ○ ○ ○
null null sing null null
per
PRT ○
PTR CT
○ ○
null 3
Since there are no further features to check in the f2 argument at any level of recursion. Each in
turn sets the pointer for its f2 argument to point at its f1 argument and returns it. The result of all
arguments is as following:
○ ○
CT PTR CT
○ ○ ∪ ○
subject agreement subject
○ ○ ○
① PTR
PTR CT PTR CT CT
○ ○ ○ ○ ○
null null null
agreement num per agreement
○ ○ ○ ○
① PTR
PTR CT PTR CT CT CT
○ ○ ○ ○ ○
null null sing null
per
PRT ○
PTR CT
○ ○
null 3
S head ①
NP head agreement ②
VP head ① agreement ②
This AVM can be represented by a DAG. So we can use AVM to represent the DAG.
NP head agreement ②
VP head ① agreement ②
Dag1 NP head ①
It is an inactive edge.
num feminine
This problem has caused many unification-based grammatical theories to add various mechanisms
to try constrain the possible values of a feature. E.g.
FUG (Functional Unification Grammar, Kay, 1979), LFG (Lexical Functional Grammar, Bresnan,
1982), GPSG (Generalized Phrase Structure Grammar, Gazdar et al\., 1985), HPSG (Head-Driven
Phrase Structure Grammar, Pollard et al., 1994).
Second problem: In the feature structure, there is no way to capture generalization across
them. For example, the many types of English verb phrases share many features, as do the
many kinds of sub-categorization frames for verbs.
A general solution to both of these problems is the use of types.
Type system for unification grammar has the following characteristics:
Each feature structure is labeled by a type.
Each type has appropriateness conditions expressing which features are appropriate for it.
The types are organized into a type hierarchy, in which more specific types inherit properties
of more abstract one.
The unification operation is modified to unify the types of feature structures in addition to
unifying the attributes and values.
In such typed feature structure systems, types are a new class of objects, just like attributes and
values for standard feature structures.
There are two kinds of types:
1. Simple types (atomic types): It is an atomic symbol like sg or pl, and replaces the simple
atomic values used in standard feature structures.
All types are organized into a multiple-inheritance type hierarchy (a partial order or lattice).
Following is a type hierarchy of new type agr, which will be the type of the kind of atomic
object that can be the value of an AGREEMENT feature.
agr
1st 3rd sg pl
The unification of any two types is more specific type than the two input types. Thus
3rd ∪ sg = 3sg
1st ∪ pl = 1pl
1st ∪ arg = 1st
3rd ∪ 1st = ┻ (undefined, fail type)
2. Complex types: The complex types specify:
A set of features that are appropriate for that type.
Restrictions on the values of those features (expressed in terms of types).
Equality constraints between the values.
For example, the complex type verb represents agreement and verb morphological form
information.
A definition of verb would define two appropriate features:
AGREE: It takes values of type arg defined above. :
VFORM: It takes values of type vform which subsumes the seven subtypes: finite, infinitive,
gerund, base, present-participle, past-participle, passive-participle.
Thus verb would be defined as follows:
verb
AGREE arg
VFORM vform
The type noun might be defined with the AGREE feature, but without the VFORM feature.
noun
AGREE arg
The unification of typed feature structures:
verb ∪ verb = verb
st
SGREE 1 AGREE sg AGREE 1-sg
VFORM gerund VFORM gerund VFORM gerund
Complex types are also part of the type hierarchy. Subtypes of complex types inherit all the
feature of their parents, together with the constraints on the values. Following is a small part of
this hierarchy for the sentential complement of verb (Sanfilippo, 1993):
Tr-fin-comp-cat
trans-comp-cat
tr-swh-comp-cat
sfin-comp-cat
tr-sbase-comp-cat
swh-comp-cat
comp-cat
sbase-comp-cat
intr-swh-comp-cat intr-sfin-comp-cat
sinf-comp-cat
intrans-comp-cat intr-sbase-comp-cat
intr-sinf-comp-cat
Ex:
tr-swh-comp-cat: “Ask yourself whether you have become better informed.”
intr-swh-comp-cat: Mosieur asked whether I wanted to ride.”
It is possible to represent the whole phrase structure rule as a type. Sag and Wasow (1999) take a
type phrase which has a feature called DTRS (daughters), whose value is a list of phrases. The
phrase “I love Seoul” could have the following representation (showing only the daughter
feature):
phrase
CAT VP
CAT PRO
DTRS , CAT V CAT NP
ORTH I DRTS ,
ORTH LOVE ORTH SEOUL
The resulting typed feature structures place constraints on which type of values a given feature
can take, and can also be organized into a type hierarchy. In this case, the feature structures can be
well typed.
Parsing with PSG
3.1 Bottom-Up Parsing
1. S NP VP
2. S AUX NP VP
3. S VP
4. NP Det Nominal
5. Nominal Noun
6. Nominal Noun Nominal
7. Nominal Nominal PP
8. NP Proper Noun
9. VP Verb
10. VP Verb NP
11. Det that | this | a
12. Noun book | flight | meat | money
13. Verb book | include | prefer
14. Aux does
15. Prep from | to | on
16. Proper Noun Houston | ASIANA | KOREAN AIR | CAAC | Dragon Air
Using this PSG to parse sentence “Book that flight”, the correct parse tree that would be
assigned to this sentence is as follows:
S
VP
Verb NP
that Noun
flight
Fig. 1 parse tree
Regardless of the search algorithm we choose, there are two kinds of constraints that should
help guide the search.
Constraint coming from the data: The final parse tree must have three leaves (three
words in the input sentence): “book, that flight”.
Constraint coming from the grammar: the final parse tree must have one root: S (start
symbol).
These two constraints give rise to the two search strategies:
Bottom-up search (or data-directed search)
Top-down search (or goal-directed search)
3.1.3 Bottom-Up Parsing
In bottom-up parsing, the parser starts with the words of the input and tries to build tree from the
words up. The parsing is successful if the parser succeeds in building a tree rooted in the start
symbol S that covers all of the input.
Example:
We use above small PSG to parse (Bottom-Up) sentence “Book that flight”
First ply: Book that flight
Second ply: Noun Det Noun Verb Det Noun
NP NP NP
VP
NP
Nominal
In sixth ply, the root S covers all the input, our Bottom-Up parsing is success.
We can use Shift-Reduce algorithm to do the parsing.
In the shift-reduce algorithm, the stack is used for information access. The operation methods are
shift, reduce, refuse and accept. In the shift, the symbol waiting to process is move to the top of
stack. In the reduce, the symbol on stack top is replaced by RHS of grammar rule, if the RHS of
the rule is matched with the symbol on stack top. If the input string is processed, and the symbol
on stack top becomes S (initial symbol in the string), then the input string is accepted. Otherwise,
it is refused.
Following is the shift-reduce process of sentence “Book that flight”
Stack Operation the rest part of input string
Book that flight
++Book shift that flight
Noun reduce by rule12 that flight
Noun that shift flight
Noun Det reduce by rule 11 flight
Noun Det flight shift φ
Noun Det Noun reduce by rule 12 φ
Noun Det Nominal reduce by rule 5 φ
Noun NP reduce by rule 4 φ
[Backtracking to ++]
+++ Verb reduce by rule 13 that flight
VP reduce by rule 9 that flight
VP that shift flight
VP Det reduce by rule 11 flight
VP Det flight shift φ
VP Det Noun reduce by rule 12 φ
VP Det Nominal reduce by rule 5 φ
VP NP reduce by rule 4 φ
[Backtracking to +++]
Verb that shift flight
Verb Det reduce by rule 11 flight
Verb Det flight shift φ
Verb Det Noun reduce by rule 12 φ
Verb Det Nominal reduce by rule 5 φ
Verb NP reduce by rule 4 φ
VP reduce by rule 10 φ
S reduce by rule 3 φ
[Success !]
3.2 Top-Down Parsing
3.2.1 The process of Top-Down Parsing
A top-down parser searches for a parse tree by trying to build from the root node S down to the
leaves. The algorithm starts symbol S. The next step is to find the tops of all trees which can start
with S. Then expend the constituents in new trees. etc.
If we use above small PSG to parse (Top-Down) sentence “Book that flight”,, first 3 ply will be as
follows:
First ply: S
Second ply: S S S
NP VP Aux NP VP VP
Third ply:
S S S S S S
NP VP NP VP Aux NP VP Aux NP VP VP VP
VP
Verb NP
Fourth ply:
S
VP
Verb NP
Book
Fifth ply:
S S
VP VP
Verb NP Verb NP
Sixth ply:
S
VP
Verb NP
That
Seventh ply:
S S S
VP VP VP
Eighth ply:
S
VP
Verb NP
That Noun
flight
[Success !]
Fig. 3 Top-Down parsing
The search process of the sentence ”book that flight”:
Searching goal Rule The rest part of input string
++S Book that flight
+NP VP 1 Book that flight
Det Nom VP 4 Book that flight
[backtracking to +]
PropN VP 8 Book that flight
[backtracking to ++]
Aux NP VP 2 Book that flight
[backtracking to ++] Book that flight
+++VP 3 Book that flight
Verb 9 Book that flight
φ that flight
[backtracking to +++] Book that flight
++++Verb NP 10 Book that flight
PropN 8 that flight
[backtracking to ++++]
Det Nominal 4 that flight
+++++ Nominal flight
++++++Nominal PP 7 flight
Noun Nominal PP 6 flight
Nominal PP φ
[backtracking to ++++++]
Noun PP 5 flight
PP φ
[backtracking to +++++]
Noun Nominal 6 flight
Nominal φ
[backtracking to +++++]
Noun 5 flight
φ φ
[Success ! ]
3.2.2 Comparing Top-Down and Bottom-Up Parsing
Top-Down strategy never wastes time exploring trees that cannot result in an S, since it
begins by generating just those trees. This means it never explores subtrees that cannot
find a place in some S-rooted tree. By contrast, In the bottom-up strategy, trees that have
no hope of leading to an S are generated with wild abandon. it will waste effort
Top-Down strategy spends considerable effort on S trees that are not consistent with the
input. It can generate trees before ever examining the input. Bottom-Up never suggest
trees that are not locally grounded in the actual input.
Neither of these approaches adequately exploits the constraints presented by the grammar and the
input words.
3.3.3 A basic Top-Down parser
3.3.3.1 left-corner: We call the first word along the left edge of a derivation the left-corner of the
tree.
e.g. VP VP
NP
Nom
S S S
VP NP VP Aux NP VP
?
Verb NP Det Nominal Book
?
Book Book
[Fail !] [Fail !]
Fig. 5
Verb is the left-corner of S. “Det” and “Aux” can not match with “Book”.
Second ply:
VP
Verb NP
that
Fig. 6
Det is the left-corner of NP.
Third ply:
VP
Verb NP
That Noun
flight
Fig. 7
Noun is the left-corner of Nominal.
3.3.1 Left-recursion:
In top-down, depth-first, left-to-right parser, It may dive down an infinitely deeper path
never return to visit space if it use the left-recursive grammar.
A more obvious and common case of left-recursion in natural language grammar involves
immediately left-recursive rules. The left-recursive rules are rules of the form A Aβ, where the
first constituent of the RHS is identical to the LHS.
E.g. NP NP PP
VP VP PP
S S and S
E.g. if we have the left recursive rule NP NP PP as first rule in our small grammar, we may get
the infinite search as following:
S S S S
NP VP NP VP NP VP
NP PP NP PP
NP PP
Fig. 8 infinite search .
3.3.2 Structure ambiguity
Structure ambiguity occurs when the grammar assigns more than one possible parse to sentence.
Three common kinds of structure ambiguities are attachment ambiguity, coordination ambiguity
and noun-phrase bracketing ambiguity.
3.3.2.1 Attachment ambiguity:
3.3.2.1.1 PP attachment ambiguity:
E.g.
1) They made a report about the ship.
On the ship, they made a report.
They made a report on the ship.
S S
NP VP NP VP
Pronoun V NP PP Pronoun V NP
report the N N P NP
the N
ship
PP is the modifier of V PP is the modifier of Nominal
Fig.9 PP attachment ambiguity
2) They made a decision concerning the boat.
On the boat, they made a decision.
They made a decision on the boat.
3) He drove the car which was near the post office.
Near the post office, he drove the car.
He drove the car near the post office.
4) They are walking around the lake which is situated in the park.
In the park, they are walking around the lake.
They are walking around the lake in the park.
5) He shot at the man who was with a gun.
With a gun, he shot at the man.
He shot at the man with a gun.
6) The policeman arrested the thief who was in the room.
In the room, the policeman arrested the thief.
The policeman arrested the thief in the room.
Church and Patil (1982) showed that the number of parse for sentences of this type
grows at the same rate as the number of parenthesization of arithmetic expressions.
Such parenthesization problems (insertion problems) are known as grow exponentially
in accordance with what are called the Catalan numbers:
2n
C (n) = 1/n+1
n
1 2n(2n − 1)...(n + 1)
= x
n +1 n!
The following table shows the number of parses for a simple noun phrase as
a function of the number of trailing prepositional phrases. We may see that this
kind of ambiguity can very quickly make it imprudent to keep every possible parse
around.
Fig. 10
3.3.2.1.2 Gerundive attachment ambiguity:
E.g We saw the Eiffel tower flying to Paris.
The Gerundive phrase “flying to Paris” can modifies “saw” as the adverbial, it can
also be the predicate in the clause “the Eiffel tower flying to Paris”.
3.3.2.1.3 local ambiguity
Local ambiguity occurs when some part of a sentence is ambiguous, even if the whole
sentence is not ambiguous. E.g. Sentence “book that flight” is unambiguous, but
when the parser sees the first word “book”, it can not know if it is a verb or a
noun until later. Thus it must use backtracking or parallelism to consider both
possible parses.
Adj NP NP N2
N1 N2 ADJ N1
E.g.
1) The salesman who sells old cars is busy.
The old salesman who sells cars is busy.
The old car salesman is busy.
2) He is a Department Head, who is from England.
He is Head of the English Department.
He is an English Department Head.
3.3.3 Inefficient re-parsing of sub-tree
The parser often builds valid trees for portions of the input, then discards them during
backtracking, only to find that it has to rebuild them again. The re-parsing of sub-tree is inefficient
E,g. The noun phrase “ a flight from Beijing to Seoul on ASIANA”, its top-down parser process is
as follows:
NP
Nom
Det Noun
NP PP
Nom NP
NP
NP
NP PP PP
Nom NP NP
NP
NP
NP
NP PP PP PP
Nom NP NP NP
Because of the way the rules are consulted in our top-down parsing, the parser is ;ed first to
small parse trees that fail because they do not cover all the input. These successive failures trigger
backtracking events which lead to parses that incrementally cover more and more of the input. In
the backtracking, reduplication of work arises many times. Except for its topmost component,
every part of the final tree is derived more than once in the backtracking process.
Component reduplication times
A flight 4
From Beijing 3
To Seoul 2
On ASIANA 1
A flight from Beijing 3
A flight from Beijing to Seoul 2
A flight from Beijing to Seoul on ASIANA 1
S .VP
NP Det. Nominal
○ ○ ○
i j k
Fig. 12 fundamental rule of chart parsing
The state sequence in chart while parsing “book that flight” using our small grammar:
Chart [0]
γ .S [0,0] Dummy start state
S .NP VP [0,0] Predictor
NP .Det Nominal [0,0] Predictor
NP .Proper-Noun [0,0] Predictor
S .Aux NP VP [0,0] Predictor
S .VP [0,0] Predictor
VP .Verb [0,0] Predictor
VP .Verb NP [0,0] Predictor
Chart [1]
Verb book. [0.1] Scanner
VP Verb. [0,1] Completer
S VP. [0,1] Completer
VP Verb. NP [0,1] Completer
NP .Det Nominal [1,1] Predictor
NP .Proper-Noun [1,1] Predictor
Chart [2]
Det that. [1,2] Scanner
NP Det. Nominal [1,2] Completer
Nominal .Noun [2,2] Predictor
Nominal .Noun Nominal [2,2] Predictor
Chart [3]
Noun flight. [2,3] Scanner
Nominal Noun. [2,3] Completer
Nominal Noun. Nominal [2,3] Completer
NP Det Nominal. [1,3] Completer
VP Verb NP. [0,3] Completer
S VP. [0,3] Completer
Nominal .Noun [3,3] Predictor
Nominal .Noun Nominal [3,3] Predictor
In chart [3], the presence of the state representing “flight” leads to completion of NP, transitive VP,
and S. The presence of the state S VP., [0,3] in the last chart entry means that our parser gets
the success.
Chart [1]
S8 Verb book. [0.1] [] Scanner
S9 VP Verb. [0,1] [S8] Completer
S10 S VP. [0,1] [S9] Completer
S11 VP Verb. NP [0,1] [S8] Completer
S12 NP .Det Nominal [1,1] [] Predictor
S13 NP .Proper-Noun [1,1] [] Predictor
Chart [2]
S14 Det that. [1,2] [] Scanner
S15 NP Det. Nominal [1,2] [S14] Completer
S16 Nominal .Noun [2,2] [] Predictor
S17 Nominal .Noun Nominal [2,2] [] Predictor
Chart [3]
S18 Noun flight. [2,3] [] Scanner
S19 Nominal Noun. [2,3] [S18] Completer
S20 Nominal Noun. Nominal [2,3] [S18] Completer
S21 NP Det Nominal. [1,3]] [S14, S19] Completer
S22 VP Verb NP. [0,3] [S8, S21] Completer
S23 S VP. [0,3] [S22] Completer
S24 Nominal .Noun [3,3] [] Predictor
S25 Nominal .Noun Nominal [3,3] [] Predictor
The parsing process can be summarized as follows:
S8 Verb book. [0.1] [] Scanner
S9 VP Verb. [0,1] [S8] Completer
S10 S VP. [0,1] [S9] Completer
S11 VP Verb. NP [0,1] [S8] Completer
S14 Det that. [1,2] [] Scanner
S15 NP Det. Nominal [1,2] [S14] Completer
S18 Noun flight. [2,3] [] Scanner
S19 Nominal Noun. [2,3] [S18] Completer
S20 Nominal Noun. Nominal [2,3] [S18] Completer
S21 NP Det Nominal. [1,3]] [S14, S19] Completer
S22 VP Verb NP. [0,3] [S8, S21] Completer
S23 S VP. [0,3] [S22] Completer
.
VP Verb NP.
NP Det Nom.
S Verb. NP
Nom Noun. Nom
S VP,
Chart [1]
Aux does. [0,1] Scanner
S Aux. NP VP [0,1] Completer
NP .Ord Nom [1,1] Predictor
NP .PrN [1,1] Predictor
Chart [2]
PrN KA 852. [1,2] Scanner
NP PrN. [1,2] Completer
S Aux NP. VP [0,2] Completer
VP .V [2,2] Predictor
VP .V NP [2,2] Predictor
Chart [3]
V have. [2,3] Scanner
VP V. [2,3] Completer
VP V. NP [2,3] Completer
NP ..Ord Nom [3,3] Predictor
Chart [4]
Ord first. [3,4] Scanner
NP Ord. Nom [3,4] Completer
Nom .N Nom [4,4] Predictor
Nom .N. [4,4] Predictor
Nom .N PP [4,4] Predictor
Chart [5]
N class. [4,5] Scanner
Nom N. [4,5] Completer
NP Ord Nom. [3,5] Completer
VP V NP. [2,5] Completer
S Aux NP VP. [0,5] Completer (S’ span is 5, 5 < 6)
Nom N. Nom [4,5] Completer
Nom .N [5,5] Predictor
Chart [6]
N section. [5,6] Scanner
Nom N. [5,6] Completer
Nom N Nom. [4,6] Completer
NP Ord Nom. [3,6] Completer
VP V NP. [2,6] Completer
S Aux NP VP. [0,6] Completer
[Success !]
The parsing process:
Aux does. [0,1] Scanner
S Aux. NP VP [0,1] Completer
PrN KA 852. [1,2] Scanner
NP PrN. [1,2] Completer
S Aux NP. VP [0,2] Completer
V have. [2,3] Scanner
VP V. [2,3] Completer
VP V. NP [2,3] Completer
Ord first. [3,4] Scanner
NP Ord. Nom [3,4] Completer
N class. [4,5] Scanner
N section. [5,6] Scanner
Nom N. [5,6] Completer
Nom N Nom. [4,6] Completer
NP Ord Nom. [3,6] Completer
VP V NP. [2,6] Completer
S Aux NP VP. [0,6] Completer
[Success !]
S Aux NP VP.
VP V NP.
NP Ord Nom.
Nom N Nom.
NP PrN.
Nom N.
Chart [1]
Pron it [0,1] Scanner
NP Pron. [0,1,] Completer
S NP. VP [0,1] Completer
VP .V [1,1] Predictor
VP .V NP [1,1] Predictor
Chart [2]
V is. [1,2] Scanner
VP V. [1,2] Completer
S NP VP. [0,2] Completer (S’ span is 2 < 10)
VP V. NP [1,2] Completer
NP .Det Nom [2,2] Predictor
Chart [3]
Det a. [2,3] Scanner
NP Det. Nom [2,3] Completer
Nom N [3,3] Predictor
Nom .N Nom [3,3] Predictor
Nom .Nom PP [3,3] Predictor
Chart [4]
N flight. [3,4] Scanner
Nom N. [3,4] Completer
NP Det Nom. [2,4] Completer
VP V NP.. [1,4] Completer
S NP VP. [0,4] Completer (S’ span is 4 < 10)
Nom N. Nom [3,4] Completer
Attention: behind N, no Nom. So the process turns to following state:
Nom Nom. PP [3,4] Completer
PP .Prep NP [4,4] Predictor
Chart [5]
Prep from. [4,5] Scanner
PP Prep. NP [4,5] Completer
NP .PrN [5,5] Predictor
Chart [6]
PrN Beijing. [5,6] Scanner
NP PrN. [5,6] Completer
PP Prep NP [4,6] Completer
Nom Nom PP. [3,6] Completer
Attention: The dot behind PP (this PP = “from Beijing”), it is inactive edge.
Nom Nom. PP [3,6] Completer
Attention: The dot in front of PP (this PP = “to Seoul”), it is active edge.
PP .Prep NP [6,6] Predictor
Chart [7]
Prep to. [6,7] Scanner
PP Prep. NP [6,7] Completer
NP .PrN [7,7] Predictor
Chart [8]
PrN Seoul. [7,8] Scanner
NP PrN. [7,8] Completer
PP Prep NP. [6,8] Completer
Nom Nom PP. [3,8] Completer
Attention: The dot behind PP (this PP = “to Seoul”), it is inactive edge.
Nom Nom. PP [3,8] Completer
Attention: The dot in front of PP (this PP = “on ASIANA”), it is active edged.
PP .Prep NP [8,8] Predictor
Chart [9]
Prep on. [8,9] Scanner
PP Prep. NP [8,9] Completer
NP .PrN [9,9] Predictor
Chart [10]
PrN ASIANA. [9,10] Scanner
NP PrN. [9,10] Completer
PP Prep NP. [8,10] Completer
Nom Nom PP. [3,10] Completer
NP Det Nom. [2,10] Completer
VP V NP [1,10] Completer
S NP VP. [0,10] Completer
[Success !]
VP V NP.
NP Det Nom.
Pron it. V is. Det a. N flight P from. PrN Beijing. P to. PrN Seoul.P on. PrN ASI
● ● ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6 7 8 9 10
Fig. 15
In this parsing process, there is not backtracking as in the top-down parsing. The advantage of
Early algorithm is obvious.
3.4.2. CYK approach:
CYK approach is abbreviation of Cocke-Younger-Kasami approach. It is a parallel
parsing algorithm.
S NP VP
NP Det N
VP V NP
Obviously, this is a CFG with Chomsky Nornal Form because the form all the rules is A BC.
Following table can expresses the result of CYK parsing for the sentence “the boy hits a dog”:
5 S
3 VP
2 NP NP
1 Det N V Det N
1 2 3 4 5
the boy hits a dog
Fig. 16
In this table, the row number expresses the location of word in the sentence, the line number
expresses the word number included in the grammatical category (e.g. N, V, NP, VP, S, etc). All
the category is located in the box of the table. bi j expresses the box that located in the row i and
line j. Every grammatical category in the table can be expressed by bi j .
‘Det belongs to b1 1’ means : Det is located in row 1 and line 1.
‘N belongs to b2 1’ means : N is located in row 2 and line 1.
‘V belongs to b3 1’ means: V is located in row 3 and line 1.
‘Det belongs to b4 1’ means; Det is located in row 4 and line 1.
‘N belongs to b5 1’ means; N is located in row 5 and line 1.
By this reason,
The location of NP (the boy) is b1 2 (including 2 words),
The location of NP (a dog) is b4 2 (including 2 words),
The location of VP (hits a dog) is b3 3 (including 3 words),
The location of S (the boy hits a dog) is b1 5 (including 5 words).
Obviously, the table and the bi j in the table can describe the structure of the
sentence. For every category bi j , i describes its location in the sentence structure,
j describes the word number included in this category. If we may create the table
and the bi j in the table, the parsing is completed.
A (bi j)
… …
i-1 i i+k-1 i+k i+j-1 i+j
|______________________| |_______________________|
length of B = k length of C = j-k
|______________________________________________________|
length of A = j
Fig. 17
For example, in Fig. 17, NP belongs to b1 2, Det belongs to b1 1, N belongs to b2 1,
is represents the Chomsky normal form NP Det N. In this case, i=1, k=1, j=2.
Therefore, if we know the starting number i of B, the length k of B, the length j
of A, then we can calculate the location of A, B and C in the CYK table: A belongs
to bi j, B belongs to bi k, C belongs to bi+k j-k.
In CYK approach, the important problem is how to calculate the location of A. The
row number of A is always same as that of B, so if row number of B is i, then the
row number of A must be i. The line number of A (=j) equals to the addition of the
line number of B (=k) and the line number of C (=j-k): j = k + j – k.
Therefore, If we know the location of B and the location of C, it is easy to calculate
the location of A.
If the length of input sentence is n, the CYK algorithm can be divided to two steps:
First step: start from i = 1, for every words Wi in input sentence (with length n)
, we have rewriting rule A Wi, so we write the non-terminal symbol A of every
Word Wi in the box of our table, and give the location number of box with bij. E.g,
for sentence “The boy hits a dog”, we give the location number respectively for
every words of sentence is as follows: b11 (for Det [non-terminal symbol of ‘the’]),
b21 (for N [non-terminal symbol of ‘boy’]), b31 (for V [non-terminal symbol of
‘hits’]), b41 (for second Det [non-terminal symbol of ‘a’]), b51 (for second N
[non-terminal symbol of ‘dog’]).
Second step; For 1≤ h ≤j and all i, create bi h. non-terminal set including bij can
be defined as follows:
bi j = {A |for 1≤ k ≤j, B is included in bi k, C is included in bi+k j-k, and exists
grammar rule A BC that A is included in bi j}.
If box b1 n includes initial symbol S, then input sentence will be accepted. The
analysis gets success.
E.g, for rule ‘NP Det N’ and Det belongs to b1 1, N belongs to b2 1, we can confirm
that NP belongs to b1 2;
for rule ‘NP Det N’and ‘Det” belongs to b4 1, N belongs to b5 1, we can confirm
that NP belongs to b4 2;
for rule VP V NP and V belongs to b3 1, NP belongs to b4 2, we can confirm that VP
belongs to b3 3;
for rule S NP VP and NP belongs to b1 2, VP belongs to B3 3, we can confirm that
S belongs to b1 5. In our input sentence, n=5, so our sentence is accepted.
3.4.2.3 A complex example for CYK algorithm
If the PSG grammar is as follows:
S NP VP
NP PrN
NP DET N
NP N WH VP
NP DET N WH VP
VP V
VP V NP
VP V that S
Use CYK approach to analyze sentence ‘the table that lacks a leg hits Jack”.
Transformation of rewriting rules to Chomsky normal form:
S NP VP
NP PrN It is not CNF and must be transformed to:
NP Jack | John | Maria
NP DET N
NP N WH VP It must be transformed to:
NP N CL
CL WH VP
NP DET N WH VP It must be transformed to:
NP NP CL
NP DET N
CL WH VP
Here CL is WH clause, it = (that + VP)
VP V It is not CNF and must be transformed to:
VP cough | walk | …
VP V NP
VP V that S It must be transformed to:
VP V TH
TH WH S
Here TH is that-clause, it = (that + S).
S18 (S NP VP)
CL (CL WH VP)
b34
DET N WH V DET N V NP
b11 b21 b31 b41 b51 b61 b71 b81
S (S Verb NP)
b13
We can also create the pyramid in the table of CYK. This pyramid is similar as a tree graph.
We can see the CYK algorithm is so simple and effective.
Speech and Language
Processing
Constituency Grammars
Chapter 11
Review
▪ POS Decoding
▪ What does this mean?
▪ What representation do we use?
▪ What algorithm do we use, and why?
Phonology /waddyasai/
say
you what
Semantics subj obj
P[ x. say(you, x) ]
you what
Syntax
▪ Grammars (and parsing) are key
components in many applications
▪ Grammar checkers
▪ Dialogue management
▪ Question answering
▪ Information extraction
▪ Machine translation
© Cengage Learning
Consonants:
Place
of
Ar0cula0on
• Bilabials:
[p]
[b]
[m]
– Produced
by
bringing
both
lips
together
• [t, d, n]: produced by the tip of the tongue touching the alveolar
ridge (or just in front of it)
• [s, z]: produced with the sides of the front of the tongue raised but
the tip lowered to allow air to escape
• [l]: the tongue tip is raised while the rest of the tongue remains down
so air can escape over the sides of the tongue (thus [l] is a lateral
sound)
• [r]: air escapes through the central part of the mouth; either the tip
of the tongue is curled back behind the alveolar ridge or the top of
the tongue is bunched up behind the alveolar ridge
Consonants:
Place
of
Ar0cula0on
• Palatals:
[ʃ]
[ʒ]
[ʧ]
[ʤ][ʝ]
– Produced
by
raising
the
front
part
of
the
tongue
to
the
palate
• Stops:
[p]
[b]
[m]
[t]
[d]
[n]
[k]
[g]
[ŋ]
[ʧ][ʤ]
[Ɂ]
– Produced
by
completely
stopping
the
air
flow
in
the
oral
cavity
for
a
frac0on
of
a
second
• Frica;ves:
[f]
[v]
[θ]
[ð]
[s]
[z]
[ʃ]
[ʒ]
[x]
[ɣ]
[h]
– Produced
by
severely
obstruc0ng
the
airflow
so
as
to
cause
fric0on
Consonants:
Manner
of
Ar0cula0on
• Affricates:
[ʧ]
[ʤ]
– Produced
by
a
stop
closure
that
is
released
with
a
lot
of
fric0on
• Clicks:
– Produced
by
moving
air
in
the
mouth
between
various
ar0culators
– The
disapproving
sound
tsk
in
English
is
a
consonant
in
Zulu
and
some
other
southern
African
languages
– The
lateral
click
used
to
encourage
a
horse
in
English
is
a
consonant
in
Xhosa
*The textbook uses [r] to represent the central liquid as in the word ready rather than as
a trill
Vowels
• Vowels
are
classified
by
how
high
or
low
the
tongue
is,
if
the
tongue
is
in
the
front
or
back
of
the
mouth,
and
whether
or
not
the
lips
are
rounded
© Cengage Learning
Vowels
• Round
vowels:
[u]
[ʊ]
[o]
[ɔ]
– Produced
by
rounding
the
lips
– English
has
only
back
round
vowels,
but
other
languages
such
as
French
and
Swedish
have
front
round
vowels
• Nasaliza;on:
– Vowels
can
also
be
pronounced
with
a
lowered
velum,
allowing
air
to
pass
through
the
nose
– In
English,
speakers
nasalize
vowels
before
a
nasal
sound,
such
as
in
the
words
beam,
bean,
and
bingo
– The
nasaliza0on
is
represented
by
a
diacri0c,
an
extra
mark
placed
with
the
symbol:
Vowels
• Tense vowels:
– Are produced with
greater tension in the
tongue
– May occur at the end of
words
• Lax vowels:
– Are produced with less
tongue tension
– May not occur at the end
of words
Vowels
Major Phonetic Classes
• Noncontinuants: the airstream is totally obstructed in
the oral cavity
– Stops and affricates
– Coronals:
[θ]
[ð]
[t]
[d]
[n]
[s]
[z]
[ʃ]
[ʒ]
[ʧ][ʤ]
[l]
[r]
• Ar0culated
by
raising
the
tongue
blade
Major
Phone0c
Classes
• Consonantal
categories
cont.:
– Anteriors:
[p]
[b]
[m]
[f]
[v]
[θ]
[ð]
[t]
[d]
[n]
[s]
[z]
• Produced
in
the
front
part
of
the
mouth
(from
the
alveolar
area
forward)
– Sibilants:
[s]
[z]
]
[ʃ]
[ʒ]
[ʧ][ʤ]
• Produced
with
a
lot
of
fric0on
that
causes
a
hissing
sound,
which
is
a
mixture
of
high-‐frequency
sounds
• Syllabic
Sounds:
sounds
that
can
func0on
as
the
core
of
a
syllable
– Vowels,
liquids,
and
nasals
Prosodic
Features
• Prosodic,
or
suprasegmental
features
of
sounds,
such
as
length,
stress
and
pitch,
are
features
above
the
segmental
values
such
as
place
and
manner
of
ar0cula0on
• For
example,
in
Thai,
the
string
of
sounds
[naː]
can
be
said
with
5
different
pitches
and
can
thus
have
5
different
meanings:
Tone and Intonation
• Intonation languages (like English) have
varied pitch contour across an utterance,
but pitch is not used to distinguish words
Chapter 9 of SLP
Automatic Speech Recognition
Outline for ASR
▪ ASR Architecture
▪ The Noisy Channel Model
▪ Five easy pieces of an ASR system
1) Feature Extraction
2) Acoustic Model
3) Language Model
4) Lexicon/Pronunciation Model
(Introduction to HMMs again)
5) Decoder
• Evaluation
Ballpark numbers; exact numbers depend very much on the specific corpus
▪ Conclusions:
▪ Machines about 5 times worse than humans
▪ Gap increases with noisy speech
▪ These numbers are rough, take with grain of
4/9/2024
salt Speech and Language Processing Jurafsky and Martin 6
Why is conversational speech
harder?
▪ A piece of an utterance without context
likelihood prior
5000
0
0.48152 ay k 0.937203
Time (s)
1 (o − ) 2
p(o | q) = exp(− )
2 2 2
4/9/2024 Speech and Language Processing Jurafsky and Martin 30
Gaussians for Acoustic
Modeling
A Gaussian is parameterized by a mean and
a variance:
Different means
T
i = ot s.t. ot is phone i
1
T t=1
T
i = (ot − i ) 2 s.t. ot is phone i
2 1
T t=1
4/9/2024 Speech and Language Processing Jurafsky and Martin 32
But we need 39 gaussians,
not 1!
▪ The observation o is really a vector of
length 39
▪ So need a vector of Gaussians:
1 (o[d] − [d])
D 2
exp(−
1
p(o | q) = )
D
[d]
2
[d]
D 2 d =1
2 2
2
d =1
Phone A
Phone B
▪ In practice:
Why is ASR decoding hard?
HMMs for speech
HMM for digit recognition task
The Evaluation (forward)
problem for speech
▪ The observation sequence O is a series of
MFCC vectors
▪ The hidden states W are the phones and
words
▪ For a given phone/word string W, our job
is to evaluate P(O|W)
▪ Intuition: how likely is the input to have
been generated by just that word string W
Evaluation for speech: Summing
over all different paths!
▪ f ay ay ay ay v v v v
▪ f f ay ay ay ay v v v
▪ f f f f ay ay ay ay v
▪ f f ay ay ay ay ay ay v
▪ f f ay ay ay ay ay ay ay ay v
▪ f f ay v v v v v v v
The forward lattice for “five”
The forward trellis for “five”
Viterbi trellis for “five”
Viterbi trellis for “five”
Search space with bigrams
Viterbi trellis
Viterbi backtrace
Training
Evaluation
▪ How to evaluate the word string output by
a speech recognizer?
Word Error Rate
1
Speech synthesis
• What is the task?
– Generating natural sounding speech on the fly,
usually from text
• What are the main difficulties?
– What to say and how to say it
• How is it approached?
– Two main approaches, both with pros and cons
• How good is it?
– Excellent, almost unnoticeable at its best
• How much better could it be?
– marginally
2
Input type
• Concept-to-speech vs text-to-speech
• In CTS, content of message is determined
from internal representation, not by
reading out text
– E.g. database query system
– No problem of text interpretation
3
Text-to-speech
• What to say: text-to-phoneme conversion
is not straightforward
– Dr Smith lives on Marine Dr in Chicago IL. He got his
PhD from MIT. He earns $70,000 p.a.
– Have toy read that book? No I’m still reading it. I live
in Reading.
• How to say it: not just choice of
phonemes, but allophones, coarticulation
effects, as well as prosodic features (pitch,
loudness, length)
4
Architecture of TTS systems
Text-to-phoneme module Phoneme-to-speech module
Synthetic speech
Text input
output
Abbreviation
Normalization
lexicon
Grapheme-to-
Orthographic
phoneme
rules
conversion
Phoneme string +
Grammar rules prosodic annotation
Prosodic
Phoneme string Prosodic model
modelling
5
Text normalization
• Any text that has a special pronunciation
should be stored in a lexicon
– Abbreviations (Mr, Dr, Rd, St, Middx)
– Acronyms (UN but UNESCO)
– Special symbols (&, %)
– Particular conventions (£5, $5 million, 12°C)
– Numbers are especially difficult
• 1995 2001 1,995 236 3017 233 4488
6
Grapheme-to-phoneme conversion
• English spelling is complex but largely regular,
other languages more (or less) so
• Gross exceptions must be in lexicon
• Lexicon or rules?
– If look-up is quick, may as well store them
– But you need rules anyway for unknown words
• MANY words have multiple pronunciations
– Free variation (eg controversy, either)
– Conditioned variation (eg record, import, weak forms)
– Genuine homographs
7
Grapheme-to-phoneme conversion
• Much easier for some languages
(Spanish, Italian, Welsh, Czech, Korean)
• Much harder for others (English, French)
• Especially if writing system is only partially
alphabetic (Arabic, Urdu)
• Or not alphabetic at all (Chinese,
Japanese)
8
Syntactic (etc.) analysis
• Homograph disambiguation requires
syntactic analysis
– He makes a record of everything they record.
– I read a lot. What have you read recently?
• Analysis also essential to determine
appropriate prosodic features
9
Architecture of TTS systems
Text-to-phoneme module Phoneme-to-speech module
Synthetic speech
Text input
output
Abbreviation
Normalization
lexicon
Grapheme-to-
Orthographic
phoneme
rules
conversion
Phoneme string +
Grammar rules prosodic annotation
Prosodic
Phoneme string Prosodic model
modelling
10
Prosody modelling
• Pitch, length, loudness
• Intonation (pitch)
– essential to avoid monotonous robot-like voice
– linked to basic syntax (eg statement vs question), but
also to thematization (stress)
– Pitch range is a sensitive issue
• Rhythm (length)
– Has to do with pace (natural tendency to slow down
at end of utterance)
– Also need to pause at appropriate place
– Linked (with pitch and loudness) to stress
11
Acoustic synthesis
• Alternative methods:
– Articulatory synthesis
– Formant synthesis
– Concatenative synthesis
– Unit selection synthesis
12
Articulatory synthesis
• Simulation of physical processes of human
articulation
• Wolfgang von Kempelen (1734-1804) and
others used bellows, reeds and tubes to
construct mechanical speaking machines
• Modern versions simulate electronically
the effect of articulator positions, vocal
tract shape, etc.
• Too much like hard work
13
Formant synthesis
• Reproduce the relevant characteristics of the
acoustic signal
• In particular, amplitude and frequency of
formants
• But also other resonances and noise, eg for
nasals, laterals, fricatives etc.
• Values of acoustic parameters are derived by
rule from phonetic transcription
• Result is intelligible, but too “pure” and sounds
synthetic
14
Formant synthesis
• Demo:
– In control panel select
“Speech” icon
– Type in your text and
Preview voice
– You may have a choice
of voices
15
Concatenative synthesis
• Concatenate segments of pre-recorded
natural human speech
• Requires database of previously recorded
human speech covering all the possible
segments to be synthesised
• Segment might be phoneme, syllable,
word, phrase, or any combination
• Or, something else more clever ...
16
Diphone synthesis
• Most important for natural
sounding speech is to get the
transitions right (allophonic
variation, coarticulation
effects)
• These are found at the
boundary between phoneme
segments
• “diphones” are fragments of
speech signal cutting across
phoneme boundaries
• If a language has P phones, m y n u m b er
then number of diphones is
~P2 (some combinations
impossible) – eg 800 for
Spanish, 1200 for French,
2500 for German)
17
Diphone synthesis
• Most systems use diphones because they are
– Manageable in number
– Can be automatically extracted from recordings of
human speech
– Capture most inter-allophonic variants
• But they do not capture all coarticulatory effects,
so some systems include triphones, as well as
fixed phrases and other larger units (= USS)
18
Concatenative synthesis
• Input is phonemic representation +
prosodic features
• Diphone segments can be digitally
manipulated for length, pitch and loudness
• Segment boundaries need to be smoothed
to avoid distortion
19
Unit selection synthesis (USS)
• Same idea as concatenative synthesis, but
database contains bigger variety of “units”
• Multiple examples of phonemes (under
different prosodic conditions) are recorded
• Selection of appropriate unit therefore
becomes more complex, as there are in
the database competing candidates for
selection
20
Speech synthesis demo
21
Speech synthesis demo
22
Chapter 8. Word Classes and
Part-of-Speech Tagging
From: Chapter 8 of An Introduction to Natural Language
Processing, Computational Linguistics, and Speech
Recognition, by Daniel Jurafsky and James H. Martin
Background
• Part of speech:
– Noun, verb, pronoun, preposition, adverb, conjunction, particle, and article
• Recent lists of POS (also know as word classes, morphological class, or
lexical tags) have much larger numbers of word classes.
– 45 for Penn Treebank
– 87 for the Brown corpus, and
– 146 for the C7 tagset
• The significance of the POS for language processing is that it gives a
significant amount of information about the word and its neighbors.
• POS can be used in stemming for IR, since
– Knowing a word’s POS can help tell us which morphological affixes it can take.
– They can help an IR application by helping select out nouns or other important
words from a document.
1. Closed class
– Having relatively fixed membership, e.g., prepositions
– Function words:
– Grammatical words like of, and, or you, which tend to be very short, occur
frequently, and play an important role in grammar.
2. Open class
• Four major open classes occurring in the languages of the world: nouns,
verbs, adjectives, and adverbs.
– Many languages have no adjectives, e.g., the native American language Lakhota,
and Chinese
VBZ DT NN VB NN ?
Does that flight serve dinner ?
ADVERBIAL-THAT RULE
Given input: “that”
if
(+1 A/ADV/QUANT); /* if next word is adj, adverb, or quantifier */
(+2 SENT-LIM); /* and following which is a sentence boundary, */
(NOT -1 SVOC/A); /* and the previous word is not a verb like */
/* ‘consider’ which allows adj as object complements */
then eliminate non-ADV tags
else eliminate ADV tags
t1n P ( w1n )
C ( DT , NN ) 56,509
P( NN | DT ) = = = .49
C ( DT ) 116, 454
C (ti , wi )
• The word likelihood probabilities: P( wi | ti ) =
C (ti )
C (VBZ , is ) 10,073
P(is | VBZ ) = = = .47
C (VBZ ) 21,627
Word Classes and POS Tagging 33
8.5 HMM Part-of-Speech Tagging
Computing the most-likely tag sequence: A motivating example
P ( NN | TO ) = .00047
P (VB | TO ) = .83
P ( race | NN ) = .00057
P ( race | VB) = .00012
P ( NR | VB) = .0027
P ( NR | NN ) = .0012
P (VB | TO ) P ( NR | VB) P( race | VB) = .00000027
P ( NN | TO ) P ( NR | NN ) P ( race | NN ) = .00000000032
HOT->COLD->HOT->HOT->WARM->COLD->COLD->WARM
WARM->HOT->COLD->WARM->COLD->HOT->WARM->HOT
HOT->COLD->COLD->WARM->WARM->HOT->COLD->WARM
COLD->HOT->WARM->WARM->COLD->HOT->WARM->HOT
WARM->COLD->HOT->WARM->COLD->COLD->HOT->WARM
2. Markov Chains (Class Participation)
How to compute the probabilities of each of the following sentences by using
7-states problems of;
(a) Students did their assignment well at time (* high probability likelihood).
(b) did their assignment student well at time (*2nd best probability likelihood).
(c) At student assignment well did time their ( worst probability likelihood).
Here we have to determine the best sequence of hidden states, the one that
most likely produced word image.
This is an application of Decoding problem.
4. Left-to-right (Bakis) HMM
During left-to-right (also called Bakis)
HMMs, the state transitions proceed from
left to right.
(a) (b)
(c)
5. Maximum Entropy Models
2nd probabilistic machine learning framework called maximum entropy modelling.
- MaxEnt is more widely known as multinomial logistic regression.
Example-1:
In text classification,
- need to decide whether a particular email should be classified as spam.
- determine whether a particular sentence or document expresses a positive or negative
opinion.
5. Maximum Entropy Models (Cont…)
Example-2: Assume that we have some input x (perhaps it is a word that needs to be tagged or
a document that needs to be classified).
- From input x, we extract some features fi.
- A feature for tagging might be this word ends in –ing.
- For each such features fi, we have some weight wi.
Given the features and weights, our goal is to choose a class for a word.
- the probability of a particular class c given the observation x is;
Suppose the weight vector that we had previously learned for this task was
w = (w0,w1,w2,w3) = (18000,−5000,−3000,−1.8).
Then the predicted value for this house would be computed by multiplying each feature by
its weight:
The equation of any line is
y = mx +b; as we
show on the graph, the slope
of this line is
m = −4900, while the
intercept (b) = 16550.
Features (in this case x,
numbers of adjectives)
5.1 Linear Regression (Class Participation)
Example; Global warming may be reducing average
snowfall in your town and you are asked to predict how
much snow you think will fall this year.
Looking at the following table you might guess somewhere
around 10-20 inches. That’s a good guess, but you could make
a better guess, by using regression.
- Find out linear regression for 2014, 2015, 2016, 2017 and
2018?
Hint :
- regression also gives you a useful equation, which for this
chart is: y = -2.2923x + 4624.4.
- For example, 2005:
y = -2.2923(2005) + 4624.4 = 28.3385 inches, which is
pretty close to the actual figure of 30 inches for that year.
5.2 Logistic Regression
In logistic regression, we classify whether some observation x is in the class (true) or not in
the class (false).
Case 1: We can thus add a binary feature that is true if this is the case:
Case 2: Another feature would be whether the previous word “to” the tag TO;
5.3 Maximum Entropy Markov Models
Previously, the HMM tagging model is based on probabilities of the form
P(tag|tag) and P(word|tag).
- That means that if we want to include some source of knowledge into the tagging process,
we must find a way to encode the knowledge into one of these two probabilities.
- But “many knowledge sources are hard to fit into these models??????”.
While, in the MEMM, the current observation Ot depends on the current state St and the
current observation Ot is also depend on the previous state St-1 .
5.3 Maximum Entropy Markov Models [Example-1] (Cont…)
4
Finite-State Automata
5
Using an FSA to Recognize Sheeptalk
• As we defined the sheep language in Part 1 it is any string
from the following (infinite) set:
baa!
baaa!
baaaa!
baaaaa!
baaaaaa!
...
8
Using an FSA to Recognize Sheeptalk
• The machine starts in the start state (q0), and iterates the
following process:
▪ Check the next letter of the input. If it matches the symbol on an arc
leaving the current state, then cross that arc, move to the next state,
and also advance one symbol in the input.
▪ If we are in the accepting state (q4) when we run out of input, the
machine has successfully recognized an instance of sheeptalk.
▪ If the machine never gets to the final state, either because it runs out
of input, or it gets some input that doesn’t match an arc (as in Figure
2.11), or if it just happens to get stuck in some non-final state, we say
the machine rejects or fails to accept an input.
9
Using an FSA to Recognize Sheeptalk
• We can also represent an automaton with a state-transition
table. As in the graph notation, the state-transition table
represents the start state, the accepting states, and what
transitions leave each state with which symbols.
• Here’s the state-transition table for the FSA of Figure 2.10.
10
Using an FSA to Recognize Sheeptalk
• See the input b we must go to state 1. If we’re in state 0 and
we see the input a or !, we fail”.
11
Using an FSA to Recognize Sheeptalk
• For the sheeptalk automaton in Figure 2.10, Q = {q0, q1, q2, q3, q4},
= {a, b, !}, F = {q4}, and (q,i) is defined by the transition table in
Figure 2.12.
12
Using an FSA to Recognize Sheeptalk
13
Using an FSA to Recognize Sheeptalk
• D-RECOGNIZE takes as input a tape and an automaton. It returns accept
if the string it is pointing to on the tape is accepted by the automaton,
and reject otherwise.
• D-RECOGNIZE then enters a loop that drives the rest of the algorithm.
• It first checks whether it has reached the end of its input. If so, it either
accepts the input (if the current state is an accept state) or rejects the
input (if not).
14
Using an FSA to Recognize Sheeptalk
• If there is input left on the tape, D-RECOGNIZE looks at the transition table
to decide which state to move to.
15
Using an FSA to Recognize Sheeptalk
• Figure 2.14 traces the execution of this algorithm on the sheep language
FSA given the sample input string baaa!.
16
Using an FSA to Recognize Sheeptalk
• Before examining the beginning of the tape, the machine is in state q0.
• It then finds an a and switches to state q2, another a puts it in state q3, a
third a leaves it in state q3,
• where it reads the ‘!’, and switches to state q4.
• Since there is no more input, the End of input condition at the beginning
of the loop is satisfied for the first time and the machine halts in q4.
• The algorithm will fail whenever there is no legal transition for a given
combination of state and input.
• The input abc will fail to be recognized since there is no legal transition
out of state q0 on the input a, (i.e. this entry of the transition table in
Figure 2.12 on slide 10 has a ∅).
• Even if the automaton had allowed an initial a it would have certainly
failed on c, since c isn’t even in the sheeptalk alphabet!).
• We can think of these ‘empty’ elements in the table as if they all pointed
at one ‘empty’ state, which we might call the fail state or sink state.
• In a sense then, we could view any machine with empty transitions as if
we had augmented it with a fail state, and drawn in all the extra arcs, so
we always had somewhere to go from any state on any possible input.
18
Using an FSA to Recognize Sheeptalk
• Just for completeness, Figure 2.15 shows the FSA from Figure 2.10 with the
fail state qF filled in.
19
Formal Languages
• We can use the same graph in Figure 2.10 as an automaton for
GENERATING sheeptalk.
• If we do, we would say that the automaton starts at state q0, and crosses
arcs to new states, printing out the symbols that label each arc it follows.
• When the automaton gets to the final state it stops. Notice that at state 3,
the automaton has to chose between printing out a ! and going to state
4, or printing out an a and returning to state 3.
• Let’s say for now that we don’t care how the machine makes this
decision; maybe it flips a coin.
• For now, we don’t care which exact string of sheeptalk we generate, as
long as it’s a string captured by the regular expression for sheeptalk
above.
20
Formal Languages
• Key concept #1: Formal Language: A model which can both generate
and recognize all and only the strings of a formal language acts as a
definition of the formal language.
• The alphabet for the sheep language is the set = {a, b, !}.
• Given a model m (such as a particular FSA), we can use L(m) to mean
“the formal language characterized by m”.
• So the formal language defined by our sheeptalk automaton m in Figure
2.10 (and Figure 2.12) is the infinite set:
22
Another Example
23
Another Example
24
Non-Deterministic FSAs
• Consider the sheeptalk automaton in Figure 2.18, which is much like our
first automaton in Figure 2.10:
• The only difference between this automaton and the previous one is that
here in Figure 2.18 the self-loop is on state 2 instead of state 3.
25
Non-Deterministic FSAs
• Consider using this network as an automaton for recognizing sheeptalk.
• When we get to state 2, if we see an a we don’t know whether to remain
in state 2 or go on to state 3.
• Automata with decision points like this are called non-deterministic FSAs
(or NFSAs).
• That is not true for DFSA the machine in Figure 2.18 (NFSA #1).
26
Non-Deterministic FSAs
• There is another common type of non-determinism, which can be
caused by arcs that have no symbols on them (called 𝜀-transitions).
• The automaton in Figure 2.19 defines the exact same language as the
last one, or our first one, but it does it with an 𝜀 -transition.
27
Non-Deterministic FSAs
• We interpret this new arc as follows: if we are in state 3, we are
allowed to move to state 2 without looking at the input, or
advancing our input pointer.
28
Using an NFSA to Accept Strings
• There are two keys to this approach: we need to remember all the
alternatives for each choice point, and we need to store sufficient
information about each alternative so that we can return to it when
necessary.
30
Using an NFSA to Accept Strings
• Applying this notion to our nondeterministic recognizer, we need only
remember two things for each choice point: the state, or node, of the
machine that we can go to and the corresponding position on the tape.
• We will call the combination of the node and position the search-state of
the recognition algorithm.
31
Using an NFSA to Accept Strings
• Before going on to describe the main part of this algorithm, we should
note two changes to the transition table that drives it.
• First, in order to represent nodes that have outgoing 𝜀-transitions, we add
a new 𝜀 -column to the transition table. If a node has an 𝜀-transition, we
list the destination node in the 𝜀 -column for that node’s row.
• The second addition is needed to account for multiple transitions to
different nodes from the same input symbol.
• We let each cell entry consist of a list of destination nodes rather than a
single node.
• Figure 2.20 shows the transition table for the machine in Figure 2.18 (NFSA
#1).
• While it has no 𝜀 -transitions, it does show that in machine-state q2 the
input a can lead back to q2 or on to q3.
32
Using an NFSA to Accept Strings
33
Using an NFSA to Accept Strings
34
Using an NFSA to Accept Strings
• ND-RECOGNIZE begins by creating an initial search-state and placing it
on the agenda.
• For now we don’t specify what order the search-states are placed on the
agenda.
• The function NEXT is then called to retrieve an item from the agenda and
assign it to the variable current-search-state.
35
Using an NFSA to Accept Strings
• As with D-RECOGNIZE, the first task of the main loop is to determine if the
entire contents of the tape have been successfully recognized.
• If we’re not done, the machine generates a set of possible next steps by
calling GENERATE-NEW-STATES, which creates search-states for any 𝜀-
transitions and any normal input-symbol transitions from the transition
table.
• All of these search-state tuples are then added to the current agenda.
36
Using an NFSA to Accept Strings
• Finally, we attempt to get a new search-state to process from the
agenda.
• If the agenda is empty we’ve run out of options and have to reject the
input.
• Unlike D-RECOGNIZE, it does not return reject when it reaches the end of
the tape in an non-accept machine-state or when it finds itself unable to
advance the tape from some machine-state.
37
Using an NFSA to Accept Strings
38
Using an NFSA to Accept Strings
39
Using an NFSA to Accept Strings
• Each strip lower down in the figure represents progress from one current-
search-state to the next.
• Little of interest happens until the algorithm finds itself in state q2 while
looking at the second a on the tape.
• Search states are created for each of these choices and placed on the
agenda.
• At this point, the algorithm simply asks the agenda for a new
state to pursue.
41
Using an NFSA to Accept Strings
42
Using an NFSA to Accept Strings
43
Recognition as Search
• ND-RECOGNIZE accomplishes the task of recognizing strings in a regular
language by providing a way to systematically explore all the possible
paths through a machine.
44
Recognition as Search
• In such algorithms, the problem definition creates a space of possible
solutions; the goal is to explore this space, returning an answer when one
is found or rejecting the input when the space has been exhaustively
explored.
• The goal of the search is to navigate through this space from one state to
another looking for a pairing of an accept state with an end of tape
position.
45
Recognition as Search
• The key to the effectiveness of such programs is often the order in which
the states in the space are considered.
46
Recognition as Search
• You may have noticed that the ordering of states in ND-RECOGNIZE
has been left unspecified.
47
Recognition as Search
• Consider an ordering strategy where the states that are considered
next are the most recently created ones.
48
Recognition as Search
49
Recognition as Search
• The algorithm hits the first choice point after seeing ba when it
has to decide whether to stay in q2 or advance to state q3.
50
Recognition as Search
• Again, the algorithm hits its first choice point after seeing ba
when it had to decide whether to stay in q2 or advance to
state q3.
• But now rather than picking one choice and following it up, we
imagine examining all possible choices, expanding one ply of
the search tree at a time.
52
Recognition as Search
53
Recognition as Search
54
Regular Languages and FSAs
• The class of languages that are definable by regular expressions is exactly
the same as the class of languages that are characterizable by FSA (D or
ND).
▪ These languages are called regular languages.
• The regular languages over is formally defined as:
1. is an RL
2. a , {a} is an RL
3. If L1 and L2 are RLs, then so are:
a)L1L2 ={xy| x L1 and y L2}, the concatenation of L1 and L2
b)L1L2, the union or disjunction of L1 and L2
c)L1*, the Kleene closure of L1
• All and only the sets of languages which meet the above prosperities are
regular languages.
55
Regular Languages and FSAs
• Regular languages are also closed under the following operations (where
Σ ∗ means the infinite set of all possible strings formed from the alphabet Σ):
57
Regular Languages and FSAs
58
Regular Languages and FSAs
59
Chapter 3. Morphology and
Finite-State Transducers
From: Chapter 3 of An Introduction to Natural Language
Processing, Computational Linguistics, and Speech
Recognition, by Daniel Jurafsky and James H. Martin
Background
• The problem of recognizing that foxes breaks down into the two
morphemes fox and -es is called morphological parsing.
• Similar problem in the information retrieval domain: stemming
• Given the surface or input form going, we might want to produce the
parsed form: VERB-go + GERUND-ing
• In this chapter
– morphological knowledge and
– The finite-state transducer
• It is quite inefficient to list all forms of noun and verb in the dictionary
because the productivity of the forms.
• Morphological parsing is necessary more than just IR, but also
– Machine translation
– Spelling checking
2
Survey of (Mostly) English Morphology
• Morphology is the study of the way words are built up from smaller
meaning-bearing units, morphemes.
• Two broad classes of morphemes:
– The stems: the “main” morpheme of the word, supplying the main
meaning, while
– The affixes: add “additional” meaning of various kinds.
• Affixes are further divided into prefixes, suffixes, infixes, and
circumfixes.
– Suffix: eat-s
– Prefix: un-buckle
– Circumfix: ge-sag-t (said) sagen (to say) (in German)
– Infix: hingi (borrow) humingi (the agent of an action) )in Philippine
language Tagalog)
3
Survey of (Mostly) English Morphology
4
Survey of (Mostly) English Morphology
Inflectional Morphology
5
Survey of (Mostly) English Morphology
Inflectional Morphology
• Verbal inflection is more complicated than nominal inflection.
– English has three kinds of verbs:
• Main verbs, eat, sleep, impeach
• Modal verbs, can will, should
• Primary verbs, be, have, do
– Morphological forms of regular verbs
stem walk merge try map
-s form walks merges tries maps
-ing principle walking merging trying mapping
Past form or –ed participle walked merged tried mapped
6
Survey of (Mostly) English Morphology
Inflectional Morphology
7
Survey of (Mostly) English Morphology
Derivational Morphology
• Nominalization in English:
– The formation of new nouns, often from verbs or adjectives
Suffix Base Verb/Adjective Derived Noun
-action computerize (V) computerization
-ee appoint (V) appointee
-er kill (V) killer
-ness fuzzy (A) fuzziness
9
Finite-State Morphological Parsing
• Parsing English morphology
10
Finite-State Morphological Parsing
11
Finite-State Morphological Parsing
The Lexicon and Morphotactics
• A lexicon is a repository for words.
– The simplest one would consist of an explicit list of every word of the language.
Incovenient or impossible!
– Computational lexicons are usually structured with
• a list of each of the stems and
• Affixes of the language together with a representation of morphotactics telling us how
they can fit together.
– The most common way of modeling morphotactics is the finite-state automaton.
13
Finite-State Morphological Parsing
The Lexicon and Morphotactics
• English derivational morphology is more complex than English
inflectional morphology, and so automata of modeling English
derivation tends to be quite complex.
– Some even based on CFG
• A small part of morphosyntactics of English adjectives
14
Finite-State Morphological Parsing
• The FSA#1 recognizes all the listed adjectives, and ungrammatical forms
like unbig, redly, and realest.
• Thus #1 is revised to become #2.
• The complexity is expected from English derivation.
16
Finite-State Morphological Parsing
• We can now use these FSAs to
solve the problem of
morphological recognition:
– Determining whether an input
string of letters makes up a
legitimate English word or not
– We do this by taking the
morphotactic FSAs, and plugging
in each “sub-lexicon” into the FSA.
– The resulting FSA can then be
defined as the level of the
individual letter.
17
Finite-State Morphological Parsing
Morphological Parsing with FST
• Given the input, for example, cats, we would like to produce cat +N +PL.
• Two-level morphology, by Koskenniemi (1983)
– Representing a word as a correspondence between a lexical level
• Representing a simple concatenation of morphemes making up a word, and
– The surface level
• Representing the actual spelling of the final word.
• Morphological parsing is implemented by building mapping rules that maps
letter sequences like cats on the surface level into morpheme and features
sequence like cat +N +PL on the lexical level.
18
Finite-State Morphological Parsing
Morphological Parsing with FST
• The automaton we use for performing the mapping between these two
levels is the finite-state transducer or FST.
– A transducer maps between one set of symbols and another;
– An FST does this via a finite automaton.
• Thus an FST can be seen as a two-tape automaton which recognizes or
generates pairs of strings.
• The FST has a more general function than an FSA:
– An FSA defines a formal language
– An FST defines a relation between sets of strings.
• Another view of an FST:
– A machine reads one string and generates another.
19
Finite-State Morphological Parsing
Morphological Parsing with FST
• FST as recognizer:
– a transducer that takes a pair of strings as input and output accept if the
string-pair is in the string-pair language, and a reject if it is not.
• FST as generator:
– a machine that outputs pairs of strings of the language. Thus the output is
a yes or no, and a pair of output strings.
• FST as transducer:
– A machine that reads a string and outputs another string.
• FST as set relater:
– A machine that computes relation between sets.
20
Finite-State Morphological Parsing
Morphological Parsing with FST
21
Finite-State Morphological Parsing
Morphological Parsing with FST
22
Finite-State Morphological Parsing
Morphological Parsing with FST
23
Finite-State Morphological Parsing
Morphological Parsing with FST
24
Finite-State Morphological Parsing
Morphological Parsing with FST
^: morpheme boundary
#: word boundary
26
Finite-State Morphological Parsing
Orthographic Rules and FSTs
• “insert an e on the surface tape just when the lexical tape has a
morpheme ending in x (or z, etc) and the next morphemes is -s”
x
ε→ e/ s s#
z
a→ b / c d
27
Finite-State Morphological Parsing
Orthographic Rules and FSTs
28
Combining FST Lexicon and Rules
29
Combining FST Lexicon and Rules
30
Combining FST Lexicon and Rules
• The power of FSTs is that the exact same cascade with the same state
sequences is used
– when machine is generating the surface form from the lexical tape, or
– When it is parsing the lexical tape from the surface tape.
• Parsing can be slightly more complicated than generation, because of
the problem of ambiguity.
– For example, foxes could be fox +V +3SG as well as fox +N +PL
31
Lexicon-Free FSTs: the Porter Stemmer
• Information retrieval
• One of the mostly widely used stemmming algorithms is the simple
and efficient Porter (1980) algorithm, which is based on a series of
simple cascaded rewrite rules.
– ATIONAL → ATE (e.g., relational → relate)
– ING → εif stem contains vowel (e.g., motoring → motor)
• Problem:
– Not perfect: error of commision, omission
• Experiments have been made
– Some improvement with smaller documents
– Any improvement is quite small
32
Minimum Edit Distance
• Spell correction:
– The user typed “graffe”
– Which is closest? : graf grail giraffe
• the word giraffe, which differs by only one letter from graffe, seems intuitively to be
more similar than, say grail or graf,
• The minimum edit distance between two strings is defined as the minimum number
of editing operations (insertion, deletion, substitution) needed to transform one
string into another.
deletion
insertion
substitution
deletion
insertion
substitution
• Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1 deletion
D(i,j)= min D(i,j-1) + 1 insertion
D(i-1,j-1) + 2; if X(i) ≠ Y(j) substitution
0; if X(i) = Y(j)
LEFT insertion
ptr(i,j)= DOWN deletion
DIAG substitution
• Time: O(nm)
• Space: O(nm)
• Backtrace: O(n+m)