NLP Notes
NLP Notes
NLP Notes
Unit-I
• Typology
Tokens
Will you read the newspaper? Will you read it? I won’t read it.
• Here we see two words newspaper and won’t.
• In writing, newspaper and its associated concepts are very clear but
in speech there are a few issues.
• When it comes to word won’t linguists prefer to analyze it as two
words or tokens will and not.
• This type of analysis is called tokenization and normalization.
• In Arabic or Hebrew certain tokens are concatenated in writing
with the preceding or the following words, possibly changing their
forms.
• This type of tokens are called clitics (I’m, we’ve).
• In the writing systems of Chinese, Japanese and Thai white space
is not used to separate words.
• In Korean character strings are called eojeol ‘word segment’ and
correspond to speech or cognitive units which are usually larger
than words and smaller than clauses.
Lexemes
• Example in Telugu:
vAlYlYu aMxamEna wotalo neVmmaxigA naduswunnAru.
They beautiful garden slowly walking
Morphemes
Typology
Irregularity
Ambiguity
Productivity
• A few examples:
Example in Arabic
Ambiguity
Productivity
• Dictionary Lookup
• Finite-State Morphology
• Unification-Based Morphology
• Functional Morphology
• Morphology Induction
Morphological Models-Introduction
Finite-State Morphology
• Let us have a relation R, and let us denote it by [∑], the set of all
sequences over some set of symbols ∑ so that the domain and
range of R are subsets of [∑].
• Now R is a function mapping input string to a set of output
strings.
R: [∑]{[∑]} which can be written as
R: string{string}
• Morphological operations and processes in human languages
can be expressed in finite-state terms.
• A theoretical limitation of finite state models of morphology is
the problem of reduplication of words found in many natural
languages.
Finite-State Morphology
Unification-Based Morphology
Functional-Morphology
• Introduction
• Methods
• Complexity of the Approaches
• Performances of the Approaches
• Sentence Boundary Detection
• Topic Boundary detection
Introduction
• As we all know words form sentences.
• Sentences can be related to each other by explicit discourse
connectives such as therefore.
• Sentences form paragraphs.
• Paragraphs are self contained units of discourse about a particular
point or idea.
• Automatic extraction of structure of documents help in:
• Parsing
• Machine Translation
• Semantic Role Labelling
• Sentence boundary annotation is important for human readability
of the output of the automatic speech recognition (ASR) system.
• Chunking the input text or speech into topically coherent blocks
provides better organization and indexing of the data.
• For simplicity we consider sentence and group of sentences related
to a topic as the structure elements.
• Here we discuss about two topics:
• We assume that:
• Example:
COMPLEXITY APPORACHES
• All the approaches have advantages and disadvantages.
• These approaches can be rated in terms of:
• Training and prediction algorithms
• Performance on real world data set
• Training of discriminative approaches is more complex than
training of generative ones because they require multiple passes
over the training data to adjust for their feature weights.
• Generative models such as HELMS can handle multiple orders
of magnitude larger training sets and benefit from old transcripts.
• They work only with a few features and do not cope well with
unseen events.
• Discriminative classifiers allow for a wider variety of features
and perform better on smaller training sets.
• Predicting with discriminative classifiers is slower because it is
dominated by the cost of extracting more features.
• In comparison to local approaches sequence approaches have to
handle additional complexity.
• They need to handle the complexity of finding the best sequence
of decisions which requires evaluating all possible sequences of
decisions.
• The assumption of conditional independence in generative
sequence algorithms allow the use of dynamic programming to
trade time for memory and decode in polynomial time.
• This complexity is measured in:
• The number of boundary candidates processed together
Syntax-Introduction
• For the input “natural language processing book” only one out
of the five parsers obtained using the above CFG is correct:
• This is the second knowledge acquisition problem- We need to
know not only the rules but also which analysis is most
plausible for a given input sentence.
• The construction of a tree bank is a data driven approach to
syntax analysis that allows us to address both the knowledge
acquisition bottlenecks in one stroke.
• A treebank is a collection of sentences where each sentence is
provided a complete syntax analysis.
• The syntax analysis for each sentence has been judged by a
human expert.
• A set of annotation guidelines is written before the annotation
process to ensure a consistent scheme of annotation
throughout the tree bank.
• No set of syntactic rules are provided by a treebank.
• No exhaustive set of rules are assumed to exist even though
assumptions about syntax are implicit in a treebank.
• The consistency of syntax analysis in a treebank is measured
using interannotator agreement by having approximately 10%
overlapped material annotated by more than one annotator.
• Treebanks provide annotations of syntactic structure for a
large sample of sentences.
• A supervised machine learning method can be used to train the
parser.
• Treebanks solve the first knowledge acquisition problem of
finding the grammar underlying the syntax analysis because
the analysis is directly given instead of a grammar.
• The second problem of knowledge acquisition is also solved
by treebanks.
• Each sentence in a treebank has ben given its most plausible
syntactic analysis.
• English has very few cases in a treebank that needs such a non
projective analysis.
• In languages like Czech, Turkish, Telugu the number of non
productive dependencies are much higher.
• Let us look at a multilingual comparison of crossing
dependencies across a few languages:
• Ar=Arabic;Ba=Basque;Ca=Catalan;Ch=Chinese;Cz=Czech;En
=English; Gr=Greek;Hu=Hungarian;It=Italian;Tu=Turkish
• Dependency graphs in treebanks do not explicitly distinguish
between projective and non-projective dependency tree
analyses.
• Parsing algorithms are sometimes forced to distinguish
between projective and non-projective dependencies.
• Let us try to setup dependency links in a CFG.
• In the second case the missing subject for “take back ” is the subject
of the verb promised.
• The dependency analysis for “persuaded” and “promised” do not
make such a distinction.
The dependency analysis for the two sentences is as follows:
Parsing Algorithms
• CFGs in the worst case need backtracking and have a worst case
parsing algorithm which run in O(n3) where n is the size of the
input.
• Variants of this algorithm are used in statistical parsers that
attempt to search the space of possible parse trees without the
limitation of left-to-right parsing.
Our example CFG G is rewritten as new CFG Gc which contains up to
two non-terminals on the right hand side
• Now let us look at the steps the parser has to take to construct a
specialized CFG.
• Let us consider the rules that generate only lexical items:
We can use a MST algorithm to find the most likely parse for this
graph. One popularalgorithm for this is the Chu-Liu/Edmonds
algorithm:
1. We first remove all self-loops and multiple edges in the graph. This
is becausea valid dependency tree must be acyclic and have only one
edge between any two nodes.
2. We then choose a node to be the root of the tree. In this example,
we can
choose "chased" to be the root since it is the main verb of the
sentence.
3. We then compute the scores for each edge in the graph based on a
scoring
function that takes into account the probability of each edge being a
valid
dependency. The score function can be based on various linguistic
features,
such as part-of-speech tags or word embeddings.
4. We use the MST algorithm to find the tree that maximizes the total
score of its edges. The MST algorithm starts with a set of edges
that connect the root node to each of its immediate dependents, and
iteratively adds edges that connect other nodes to the tree. At each
iteration, we select the edge with the highest score that does not
create a cycle in the tree.
5. Once the MST algorithm has constructed the tree, we can assign a
label to each edge in the tree based on the type of dependency it
represents (e.g.,subject, object, etc.).
Here is an example of a PCFG for the sentence "the cat saw the dog":
S -> NP VP [1.0]
NP -> Det N [0.6]
NP -> N [0.4]
VP -> V NP [0.8]
VP -> V [0.2]
Det -> "the" [0.9]
Det -> "a" [0.1]
N -> "cat" [0.5]
N -> "dog" [0.5]
V -> "saw" [1.0]
In this PCFG, each production rule is annotated with a probability.
For example, therule NP -> Det N [0.6] has a probability of 0.6,
indicating that a noun phrase can be generated by first generating a
determiner, followed by a noun, with a probability of 0.6.
To parse the sentence "the cat saw the dog" using this PCFG, we can
use the CKY algorithm to generate all possible parse trees and
compute their probabilities.
example, if the grammar has a rule S -> NP VP, the parser would
predict the presence of an S symbol in the
current span of the sentence by adding a new item to the chart that
indicates that an S symbol can be generated by an NP symbol
followed by a VP symbol.
In the scanning stage, the parser checks whether a word in the
sentence can be
assigned to a non-terminal symbol in the chart. For example, if the
parser has predicted an NP symbol in the current span of the
sentence, and the word "dog"appears in that span, the parser would
add a new item to the chart that indicates that the NP symbol can
be generated by the word "dog".In the completion stage, the parser
c0ombines items in the chart that have the same end position and
can be combined according to the grammar rules. For example, if
the parser has added an item to the chart that indicates that an NP
symbol can be generated by the word "dog", and another item that
indicates that a VP symbol can be generated by the word "saw" and
an NP symbol, the parser would add a new item it the chart that
indicates that an S symbol can be generated by an NP
symbolfollowed by a VP symbol.
Here is an example of a probabilistic earley parser applied to the
sentence "the cat saw the dog":
Grammar:
S -> NP VP [1.0]
NP -> Det N [0.6]
NP -> N [0.4]
VP -> V NP [0.8]
VP -> V [0.2]
Det -> "the" [0.9]
Det -> "a" [0.1]
N -> "cat" [0.5]
N -> "dog" [0.5]
V -> "saw" [1.0]
Initial chart:
0: [S -> * NP VP [1.0], 0, 0]
F1: 1.2
F2: 0.5
F3: 0.9
F4: 1.1
F5: 0.8
F6: 0.6
F7: 0.7
F8: 0.9
F9: 1.5
Possible parse trees and their scores:
S -> NP VP
- NP -> Det N
- - Det -> "the"
- - N -> "cat"
- VP -> V NP
- - V -> "saw"
- - NP -> Det N
- - - Det -> "the"
- - - N -> "dog"
Score: 5.7
S -> NP VP
- NP -> N
- - N -> "cat"
- VP -> V NP
- - V -> "saw"
- - NP -> Det N
- - - Det -> "the"
- - - N -> "dog"
Score: 4.9
S -> NP VP
- NP -> Det N
- - Det -> "the"
- - N -> "cat"
- VP -> V
- - V -> "saw"
- NP -> Det N
- - Det -> "the"
- - N -> "dog"
Score: 3.5
Selected parse tree:
S -> NP VP
- NP -> Det N
- - Det -> "the"
- - N -> "cat"
- VP -> V NP
- - V -> "saw"
- - NP -> Det N
The selected parse tree corresponds to the correct parse for the
sentence.
6.Multilingual Issues:
In natural language processing (NLP), a token is a sequence of
characters that
represents a single unit of meaning. In other words, it is a word or a
piece of a word that has a specific meaning within a language. The
process of splitting a text into individual tokens is called
tokenization.
However, the definition of what constitutes a token can vary
depending on the language being analyzed. This is because
different languages have different rules for how words are
constructed, how they are written, and how they are used in
context.
For example, in English, words are typically separated by spaces,
making it relatively easy to tokenize a sentence into individual
words. However, in some languages, such
as Chinese or Japanese, there are no spaces between words, and the
text must be segmented into individual units of meaning based on
other cues, such as syntax or context.Furthermore, even within a
single language, there can be variation in how words are spelled or
written. For example, in English, words can be spelled with or
without
hyphens or apostrophes, and there can be differences in spelling
between American English and British English.
Multilingual issues in tokenization arisebecause different languages
can have different character sets, which means that the same
sequence of characters can represent different words in different
languages. Additionally, some languages have
complex morphology, which means that a single word can have many
different forms that represent different grammatical features or
meanings.
To address these issues, NLP researchers have developed multilingual
tokenization
6.3 Morphology:
Morphology is the study of the structure of words and how they are
formed from
smaller units called morphemes. Morphological analysis is important
in many natural language processing tasks, such as machine
translation and speech
recognition, because it helps to identify the underlying structure of
words and to disambiguate their meanings.
UNIT -3
Semantic Parsing
INRODUCTION
• Research in language understanding is the identification of a
meaning representation that is detailed enough to allow reasoning
systems to make deductions.
• It is general enough that it can be used across many domains with
little to no adaptation.
• Two approaches have emerged in natural language processing for
language understanding.
• In the first approach a specific rich meaning representation is
created for a limited domain for use by applications that are
restricted to that domain.
• Example: air travel reservations, football game simulations,
querying a geographic database.
• In the second approach a related set of intermediate meaning
representations is created from low level, to midlevel and the
bigger understanding task is divided into multiple smaller pieces
that are more manageable such as word sense disambiguation.
• In the first approach the meaning representations are tied to a
specific domain.
• In the second approach the meaning representations cover the
overall meaning.
• We do not yet have a detailed overall representation that would
cover across domains.
• Here we treat the world as though it has exactly two types of
meaning representations:
• A domain dependent deeper representation (deep semantic parsing)
• Structural Ambiguity
• Word Sense
• Entity and Event Resolution
• Predicate-Argument structure
• Meaning Representation
Semantic parsing is considered as a part of a large process semantic
interpretation
Word Sense
• In any given language the same word type or word lemma is
used in different contexts and with different morphological
variants to represent different entities or concepts in the world.
• For example the word nail represents a part of the human
anatomy and also to represent an iron object.
• Humans can easily identify the use of nail in the following
sentences:
• He nailed the loose arm of the chair with a hammer.
• He bought a box of nails from the hardware store.
• He went to the beauty salon to get his nails clipped
• He went to get a manicure. His nails had grown very long.
• Resolving the sense of words in a discourse, therefore,
constitutes one of the steps in the process of semantic
interpretation.
Entity and Event Resolution
• Coreference resolution
Predicate-Argument Structure
Predicate-Argument Structure
Meaning Representation
• Scope
• Domain Dependent: These systems are specific to certain
domains such as air travel reservations or simulated
football coaching.
• Domain Independent: These systems are general enough
that the techniques can be applicable to multiple domains
without little or no change.
• Coverage
• Shallow: These systems tend to produce an intermediate
representation that can then be converted to one that a
machine can base its actions on.
• Deep: These systems usually create a terminal
representation that is directly consumed by a machine or
application.
Word Sense
• Resources
• Systems
• Software
• In a compositional approach the semantics is composed of the
meaning of its parts in a discourse.
• The smallest parts in textual discourse are the words themselves:
• Tokens that appear in the text
• Lemmatized parts of the tokens
• It is not clear whether it is possible to identify a finite set of senses
that each word in a language exhibits in various contexts.
• Attempts to solve this problem range from rule based, knowledge
based to completely unsupervised, supervised and semi-supervised
learning.
• The early systems were either rule based or knowledge based and
used dictionary definitions of senses of words.
• Unsupervised word sense disambiguation techniques induce the
sense of a word as it appears in various corpora.
• Rule based
• Supervised
• Unsupervised
• Algorithms motivated by Cross-linguistic Evidence
Semi-Supervised
Let us discuss about some sense disambiguation systems.
We can classify the sense disambiguation system into four main
categories:
• Rule based or knowledge based
• Supervised
• Unsupervised
Rule based
• Semi-supervised
• The first generation of word sense disambiguation systems
were primarily based on dictionary sense definitions.
• Access to the exact rules and the systems was very limited.
• Here we look at a few techniques and algorithms which are
accessible.
• The first generation word sense disambiguation algorithms
were mostly based on computerized dictionaries.
• We now have a look at Lesk algorithm which can be used as a
baseline for comparing word sense disambiguation
performance.
• An initialization
• An iterative step- the algorithm attempts to disambiguate all
the words in the context iteratively.
• Its performance is very close to that of supervised algorithms
and surpasses the best unsupervised algorithms.
• We now look at an example semantic graph for the two senses
of the term bus.
• The first one is the vehicle sense
The second one is the connector sense
• Notation:
• T(Lexical context) is the list of terms in the context of the term
t to be disambiguated T=[t1,t2,……,tn].
• S1t,S2t,……Snt are structural specifications of the possible
concepts (or senses) of t.
• Brown was among the first few to use supervised learning and used
the information in the form of parallel corpora.
• Yearowsky used a rich set of features and decision lists to tackle
the word sense problem.
• A few researchers like Ng and Lee have come up with several
variations including different levels of context and granularity.
• Now we look at the different classifiers and features that are
relatively easy to obtain.
Supervised-Classifiers
• The collection C1,1 would be “from” and C1,3 would be the string
“from the hardware”.
Stop words and punctuations are not removed before creating the
collocations and for boundary conditions a null word is collocated
• Topic Features- The broad topic or domain of the article that the
word belongs to is also a good indicator of what sense of the word
might be most frequent.
• The sense that falls in the subhierarchy with the highest conceptual
density is chosen to be the correct sense.
• In the figure in the previous slide sense 2 is the one with the
highest conceptual density and is therefore the chosen one.
Resnik observed that selectional constraints and word sense are
related and identified a measure by which to compute the sense of a
word on the basis of predicate argument statistics.
• where ai is the score for sense si. The sense si which has the largest
value of ai is sense for the word. Ties are broken by random choice.
• Leacock, Miller, and Chodorow provide another algorithm that makes
use of corpus statistics and WordNet relations, and show that
monosemous relatives can be exploited for disambiguating words.
Evidence
• There are a family of unsupervised algorithms based on
crosslinguistic information or evidence.
• Brown and others were the first to make use of this information
for purposes of sense disambiguation.
• They were interested in sense differences that required
translating into other languages in addition to sense
disambiguation.
• They provide a method to use the context information for a
given word to identify its most likely translation in the target
language.
Evidence
• Dagan and Itai used a bilingual lexicon paired with a monolingual
corpus to acquire statistics on word senses automatically.
• They also proposed that syntactic relations along with word co-
occurances statistics provide a good source to resolve lexical
ambiguity.
SEMI SUPERVISED
The next category of algorithms we look at are those that start from a
small seed of examples and an iterative algorithm that identifies more
training examples using a classifier.
• This additional automatically labeled data can be used to
augment the training data of the classifier to provide better
predictions for the next cycle.
• Yorowsky algorithm is such an algorithm which introduced
semi-supervised methods to the word sense disambiguation
problem.
• The algorithm is based on the assumption that two strong
properties are exhibited by corpora:
• There are two types of links: a simple link such as [[bar]] or a piped
link such as [[musical_notation|bar]].
• Filter those links that point to a disambiguation page. This means that
we need further information to disambiguate the word. If the word
does not point to a disambiguation page the word itself can be the
label.
• For all piped links the string before the pipe serves as the label.
• Collect all the labels associated with the word and then map them to
possible WordNet senses.
They might all map to the same sense essentially making the verb
monosemous and not useful for this purpose
UNIT - IV
Predicate-Argument Structure
• Resources
• Systems
Software
Predicate-Argument Structure
Resources
• FrameNet
• PropBank
Other Resources
Resources
• We have two important corpora that are semantically tagged. One
is FrameNet and the other is PropBank.
• These resources have transformed from rule based approaches to
more data-oriented approaches.
• These approaches focus on transforming linguistic insights into
features.
• FrameNet is based on the theory of frame semantics where a
given predicate invokes a semantic frame initiating some or all of
the possible semantic roles belonging to that frame.
PropBank is based on Dowty’s prototype theory and uses a more
linguistically neutral view. Each predicate has a set of core
arguments that are predicate dependent and all predicates share a
set of noncore or adjunctive arguments
FRAME NET
PROPBANK
OTHER RESOURCSES
SYSTEMS
• Syntactic Representations
• Classification Paradigms
• Overcoming the Independence Assumptions
• Feature Performance
• Feature Salience
• Feature Selection
• Size of Training Data
• Overcoming Parsing Errors
• Noun Arguments
Multilingual Issues
******************************************8
Syntactic Representation
• Verb Clustering:
• This predicate is one of the most salient features in predicting the
argument class.
• Gildea and Jurafsky used a distance function for clustering that is
based on the intuition that verbs with similar semantics will tend to
have similar direct objects.
• For example:
• Verbs such as eat, devour and savor will occur with direct objects
describing food.
• The clustering algorithm uses a database of verb-direct object
relations.
• The verbs were clustered into 64 classes using the probabilistic co-
occurrence model.
Surdeanu suggested the following features
Content Word: Since in some cases head words are not very
informative a different set of rules were used to identify a so-called
content word instead of using the head-word finding rules. The rules
that they used are
• POS of Head Word and Content Word: Adding the POS of the head
word and the content word of a constituent as a feature to help
generalize in the task of argument identification and gives a
performance boost to their decision tree-based systems.
• Named Entity of the Content Word: Certain roles such as ARGM-
TMP, ARGM-LOC tent to contain time or place named entities. This
information was added as a set of binary valued features.
• Boolean Named Entity Flags: Named entity information was also
added as a feature. They created indicator functions for each of the
seven named entity types: PERSON, PLACE, TIME, DATE,
MONEY, PERCENT, ORGANIZATION.
• Phrasal Verb Collocations: This feature comprises frequency
statistics related to the verb and the immediately following
preposition.
• Fleischman, Kwon, and Hovy added the following features to their
system:
• Logical function: This is a feature that takes three values external
argument, object argument, and other argument and is computed using
some heuristic on the syntax tree.
• The path is one of the most salient feature for the argument
identification.
• This is also the most data-sparse feature.
• To overcome this problem the path was generalized in several
different ways
• Clause-based path variations: Position of the clause node
(S,SBAR) seems to be an important feature in argument
identification. Experiments were done with four clause-based
path feature variations.
• Replacing all the nodes in the path other than clause nodes with
an (*).
Retaining only the clause nodes in the path
• Partial path: Using only that part of the path from the
constituent to the lowest common ancestor of the predicate and
the constituent.
• Predicate context: This feature captures predicate sense
variations. Two words before and two words after were added
as features. The POS of the words were also added as features.
• Punctuations: For some adjunctive arguments, punctuation
plays an important role. This set of features captures whether
punctuation appears immediately before and after the
constituent.
• Feature context:
• Features of constituents that are parent or siblings of the
constituent being classified were found useful.
• There is a complex interaction between the types and number of
arguments that a constituent can have.
This feature uses all the other vector values of the constituents that are
likely to be non-null as an added context
• The figure in the previous slide shows how the arguments of the
predicate “kick” map to the nodes in a phrase structure grammar
tree as well as the nodes in a Minipar parse tree.
• The nodes that represent head words of constituents are the targets
of classification.
They used the features in the following slide
The following table lists the features used by the semantic chunker
Classification Paradigms
• A few researchers have adopted the same basic paradigm but added
a simple postprocessor that removes implausible analysis, such as
when two arguments overlap.
• A few more complicated approaches augment the post processing
step to use argument specific language models or frame element
group statistics.
• There are more sophisticated approaches to perform joint decoding
of all the arguments, trying to capture the arguments
interdependence.
• These sophisticated approaches have yielded only slight gains
because the performance of a pure classifier followed by a simple
postprocessor is already quite high.
• Here we concentrate on a current high-performance approach that
is very effective.
• Let us look at the process of the SRL algorithm designed by Gildea
and Jurafsky. It involves two steps:
• In the first step:
• It calculates the maximum likelihood probabilities that the
constituent is an argument based on two features:
• P(argument/Path, Predicate)
• P(argument/Head, Predicate)
• It interpolates them to generate the probability that the constituent
under consideration represents an argument.
• In the second step:
• It assigns each constituent that has a nonzero probability of being
an argument a normalized probability calculated by interpolating
distributions conditioned on various sets of features.
• It then selects the most probable argument sequence
Feature Performance
Feature Selection
• The fact that adding the named entity feature to the null/non-null
classifier has a effect on the performance of the argument
identification task.
• The same feature set showed significant improvement to the
argument classification task.
• This indicates that a feature selection strategy would be very
useful.
• One strategy is to leave one feature at a time and check the
performance.
• Depending on the performance the feature is kept or pruned out.
• One more solution for Feature Selection is to convert the
scores that are out put by SVMs to convert into probabilities
by fitting a sigmoid.
Multiple Views
Broader Search
• One important problem with all these approaches is that all the
parsers are trained on the same Penn Treebank which when
evaluated on sources other than WSJ seems to degrade in
performance.
• It has been proved that when we train the system on WSJ data and
test on the Brown propositions the classification performance and
the identification performance are affected to the same degree.
• This shows that more lexical semantic features are needed to
bridge the performance gap across genres.
Zapirain showed that incorporating features based on selection
prefferences provide one way of effecting more lexico-semantic
generalization
Software
• Resources
• Systems
• Software
• Now we look at the activity which takes natural language
input and transforms it into an unambiguous representation
that a machine can act on.
• This form will be understood by the machines more than
human beings.
• It is easier to generate such a form for programming
languages as they impose syntactic and semantic restrictions
on the programs where as such restrictions cannot be imposed
on natural language.
• Techniques developed so far work within specific domains
and are not scalable.
• This is called deep semantic parsing as opposed to shallow
semantic parsing.
Resources
• GeoQuery
• Robocup:CLang
ATIS
• Rule Baes
• Supervised
Rule Based
Rule Based
SUPERVISED
• The following are the few problems with rule-based systems:
• They need some effort upfront to create the rules
• The time and specificity required to write rules restrict the
development to systems that operate in limited domains
• They are hard to maintain and scale up as the problems become more
complex and more domain independent
• They tend to be brittle
• As an alternative statistical models derived from hand annotated data
can be used.
• Unless some hand annotated data is available statistical models cannot
deal with unknown phenomena.
• During the ATIS evaluations some data was hand-tagged for semantic
information.
• Schwartz used that information to create the first end-to-end
supervised statistical learning system for ATIS domain.
• They had four components in their system:
• Semantic parse
• Semantic frame
• Discourse
• Backend
• This system used a supervised learning approach with quick training
augmentation through a human in-the-loop corrective approach to
generate lower quality but more data for improved supervision.
• This research is now known as natural language interface for
databases (NLIDB).
• Zelle and Mooney tackled the task of retrieving answers from Prolog
database.
• The system tackled the task of retrieving answers from a Prolog
database by converting natural language questions into Prolog queries
in the domain of GeoQuery.
• The CHILL (Constructive Heuristics Induction for Language
Learning) system uses a shift-reduce parser to map the input sentence
into parses expressed as a Prolog program.
• A representation closer to formal logic than SQL is preferred for
CHILL because it can be translated into other equivalent
representations.
• It took CHILL 175 training queries to match the performance of
Geobase.
• After the advances in machine learning new approaches were
identified and existing were refined.
• CHILL
Discourse Processing
Introduction
Discourse processing in natural language processing (NLP)refers to
the study and analysis of text beyond the level of individual
sentences.It focuses on understanding the connections,
relationships,
and coherence between sentences and larger units of text,such as
paragraphs and documents.
Discourse processing aims to capture the overall meaning,structure,
and flow of a piece of text, taking into account various linguistic
and contextual factors.
The main goal of discourse processing is to extract meaningful
information and infer the intended meaning from a text, which
can be useful in a variety of NLP applications, such as text
summarization, sentiment analysis, question answering, and
information extraction.
It involves 4 subtasks, including:
3.Ellipsis cohesion:
It involves the omission of words or phrases that can be inferred from
the context.
Example:
Text: "John went to the store, and Mary to the library."
In this example, the verb "went" is omitted in the second part of the
sentence, but it can be inferred from the previous context.
4.Lexical cohesion:
It is based on the use of related words or synonyms across sentences
or paragraphs.
Example:
Text: "The weather was hot. The sun was shining brightly. People
were enjoying the beach."
In this example, the words "weather," "sun," and "beach" are used to
establish lexical cohesion.
5.Conjunction cohesion:
It involves the use of conjunctions or connectors to link sentences or
ideas together.
Example:
Text: "I bought some groceries. In addition, I need to do laundry."
In this example, the conjunction "In addition" establishes cohesion
between the two sentences.
Cohesion plays a crucial role in enhancing the clarity and
understanding of a text.
By creating connections between different parts of a text, cohesion
helps readers or language processing systems comprehend the
relationships and follow the flow of information smoothly.
Reference Resolution
Language Modeling
• Introduction
• n-Gram Models
• Language Model Evaluation
• Parameter Estimation
• Language Model Adaptation
n-Gram Models
• The assumption is that all the preceding words except the n-1
words directly preceding the current word are irrelevant for
predicting the current word.
• Hence P(W) is approximated to:
Parameter Estimation
• relative frequencies:
• Where c(wi,wi-1,wi-2) is the count of the trigram wi-2wi-1wi in
the training data.
Smoothing
• This method fails to assign nonzero probabilities to word
sequences that have not been observed in the training data.
• The probability of sequences that were observed might also be
overestimated.