Natural Language Processing Tools and Approaches
Natural Language Processing Tools and Approaches
1
Natural Language Processing
Natural Language Processing (NLP) refers to the branch of
computer science—and more specifically, the branch of artificial
intelligence concerned with giving computers the ability to
understand text and spoken words in much the same way human
beings can.
2
Natural Language Processing (contd)
• NLP drives computer programs that translate text from one
language to another, respond to spoken commands, and
summarize large volumes of text rapidly—even in real time.
4
1) Python and the Natural Language Toolkit (NLTK)
• The NLTK includes libraries for many of the NLP tasks listed
above, plus libraries for subtasks, such as sentence parsing, word
segmentation, stemming and lemmatization (methods of
trimming words down to their roots), and tokenization (for
breaking phrases, sentences, paragraphs and passages into tokens
that help the computer better understand the text).
With NLP, machines can make sense of written or spoken text and
perform tasks including speech recognition, sentiment analysis,
and automatic text summarization
8
Why is NLP important?
1. Large volumes of textual data
11
D1
12
1. TOKENIZATION
Tokenization is the first step in any NLP pipeline. It has an important effect on the rest of
your pipeline.
Tokenization is the process of tokenizing or splitting a string, text into a list of tokens.
A tokenizer breaks unstructured data and natural language text into chunks of
information that can be considered as discrete elements. The token occurrences in a
document can be used directly as a vector representing that document.
This immediately turns an unstructured string (text document) into a numerical data
structure suitable for machine learning. They can also be used directly by a computer to
trigger useful actions and responses. Or they might be used in a machine learning
pipeline as features that trigger more complex decisions or behavior.
13
Lexical Analysis
• Lexical analysis separates tokens from the language statements
based on “Pattern Matching Principle”.
• A token (string of characters) is the smallest unit with some
logical meaning, which cannot be further decomposed.
• Regular expression is used to specify text strings in all sorts
of text processing and information extraction applications.
• “Finite State Automata” is a basic mathematical
foundation used extensively in Computational Linguistics.
• Variations / different forms of Finite state automata – state
transducers, HMM, N grams
14
Regular Expression (RE)
15
2. Morphological Analysis / Parsing
25
Morphological Analysis / Parsing
• What is a “morpheme”?
A minimal meaning-bearing unit in a language
eg. Dog -> 1 morpheme ‘dog’
cats -> 2 morphemes ‘cat’ + ‘s’
• The problem of recognizing that a word breaks down into component morphemes and
building a structural representation of this fact is known as “Morphological Analysis /
Parsing”
eg. doctors – doctor + s
foxes – fox + es
talking – talk + ing
(surface / input forms) (stem) (suffix)
26
• Different surfaces can have same stem
eg. sings, sang, sung - sing
surfaces stem / common lemma
Morphemes
Stems Affixes
Infix Circumfixes
Prefix Suffix
Inserted inside Precedes +
Precedes stem Follows stem
stem follows stem
28
Methods / Approaches to combine morphemes to from words
30
Verbs
31
Derivational Morphology
32
Cliticization
33
Finite state morphological analysis / parsing
• What is expected in morphological analysis?
Cities City + N + Pl
Caught Catch + V + Past Catch + V + PastParticiple ambiguity
Dog Dog + N + SG
Playing Play +V + PresPart
34
How to build a Morphological parser?
• Lexicon
Lexicon is a list of stems and affixes along with basic information (whether a
stem is a noun or verb)
• Morphotactics
These are guiding rules for arrangement of morphemes from
different classes eg. Morphemes of which class can follow
morphemes of other class
boy – boyes
girl – girlshere ‘es’ and ‘s’ follow the nouns, they can not precede the
noun
• Orthographic rules
Spelling rules – they guide about the changes occurring in a word stem
while combining with other morphemes eg. City – cities ……not
35
citys
Exercise
Input Morphological
Parse
patents
balloons
watched
animal
going
36
• Morphological parse
Input Morphological
Parse
patents patent + N + Pl
balloons balloon + N + Pl
watched watch + V + Past
animal animal + N + SG
going go +V + PresPart
37
Construction of a Finite-State lexicon
38
Eg. Cat, dog, ….majority of
nouns fall under reg-noun
category.
Ignore plurals with -es
39
3 forms of verb stem class
–
English verbal Inflection Regular verb stem
Irregular verb stem
Irregular past verb form
Affix class –
-ed past
-ed participle
-ing participle
-s Third singulars
41
Exercise
42
Finite-State Morphological Parsing
43
44
Morphological recognition – plug sub-lexicon into FSA
ie. Expand each arc like reg-noun, irreg-pl-noun etc.
45
Finite-State Transducer (FST)
aa : b b:a
b:ε 1. FST as recognizer
2. FST as generator
q0 b:b q1 3. FST as translator
4. FST as set relator
b : ba
47
FST representation
48
Morphological
Parsing
It is implemented by building mapping rules that
maps letter sequences like
Lexical c a t +N +Pl Σ
Surface c a t s Δ
New alphabet Σ’
50
Morphological noun parser
^ - indicates a morpheme boundary
# - indicates a word boundary
51
52
Transducers and Orthographic rules
53
Part of Speech Tagging
54
Part of Speech Tagging
• Part of Speech tagging is a process of assigning a PoS or other syntactic class marker to each word
in a corpus.
(A corpus is a computer readable collection of text or speech)
• POS is useful for subsequent syntactic parsing and word sense disambiguation
55
• Basic 8 parts-of-speech
noun, verb, pronoun, proposition, adverb, conjunction, participle, article
56
• PoS tags - closed class and open class
PoS tags
57
Part of speech is the grammatical category of a word.
• Verb asserts something about the subject of the sentence and express actions, events, or
states of being. Ex. Walk, write, stand (main form, 3rd person SG, past participle, ….)
• Adverb describes a verb, a phrase, or a clause. An adverb indicates manner, time, place,
cause, ... Ex. Slowly, gently (directional, locative, degree, manner, temporal)
• Pronoun replaces nouns or another pronouns and are essentially used to make sentences
less cumbersome and less repetitive. Ex. He, she, it
• Prepositions, Conjunctions, Determiners ...
58
Closed classes in English
• Prepositions – on, under, over, from, to
• Determiners – a, an, the
• Pronouns – she, I, who
• Conjunctions – and, but, or, if
• Auxiliary verbs – can, may, should
• Particles – up, down, on, off, in, out, at, by
• Numerals – one, two, first, forth
59
TagSets
The process of classifying words into their parts of speech and labeling
them accordingly is known as part-of-speech tagging, POS-tagging, or
simply tagging. Parts of speech are also known as word classes or lexical
categories. The collection of tags used for a particular task is known as a
tagset.
60
Tagsets for English
• Origin of all Tagsets for English language is “87-tag tagset” used for Brown
Corpus in 1979 – research at Brown university
It is a million word collection of samples from 500 written texts from different genres
(news papers, novels, academics, non-fiction, etc)
61
Small 45-tag Penn Treebank Tagset
62
Exercise
A part-of-speech tagger is a piece of software that reads text in some
language and decides category.
63
A /DT
Part-Of-Speech/NNP
Tagger/NNP
is/VBZ
a/DT
piece/NN
of/IN
software/NN
that/WDT
reads/VBZ
text/NN
in/IN
some/DT
language/NN
64
Algorithms for POS tagging
65
The table below shows the amount of tag ambiguity for word types in the Brown corpus
66
Named Entities
67
Named Entity Recognition (NER) (also known
as (named) entity identification, entity chunking,
and entity extraction) is a subtask of information
extraction that seeks to locate and classify named
entities mentioned in unstructured text into pre-defined
categories such as
person names,
organizations,
locations,
medical codes,
time expressions,
quantities,
monetary values,
percentages, etc. 68
NER
69
Example of NER
70
Research at IITB
71
72
3. Syntax Analysis
73
Parsing / Syntax Analysis
• Given a string of terminals and a CFG, determine if the string can be
generated by the CFG.
– Also return a parse tree for the string
– return all possible parse trees for the string
• Must search space of derivations for one that derives the given string.
– Top-Down Parsing: Start searching space of derivations for the start symbol.
– Bottom-up Parsing: Start search space of reverse derivations from the terminal
symbols in the string.
74
Parsing
• Syntactic Parsing - Produce the correct syntactic parse tree for a sentence.
75
Example of CFG
Grammar
Lexicon
S → NP VP
S → Aux NP VP Det -> the|a|that|this
S → VP Noun -> book |flight| money
NP → Pronoun Verb -> book | prefer |include
NP → Proper-Noun Pronoun -> I | he | she | me
NP → Det Nominal Proper-Noun -> Pune | Mumbai
Nominal → Noun Aux -> does
Nominal → Nominal Noun Prep -> from | to |on | near | through
Nominal → Nominal PP
VP → Verb
VP → Verb NP
VP → VP PP
PP → Prep NP
76
• Sentences are generated by recursively rewriting the start symbol using the
productions until only terminals symbols remain.
• Ex. - book the flight from Mumbai
VP
Verb NP
book
Det Nominal
the
Nominal PP
Noun Prep NP
flight from Proper-Noun
Mumbai
77
The miniature English grammar and lexicon
Grammar
S → NP VP Lexicon
S → Aux NP VP
S → VP Det -> the a that this
NP → Pronoun Noun -> book flight money
NP → Proper-Noun Verb -> book prefer include
NP → Det Nominal Pronoun -> I he she me we
Nominal → Noun Proper-Noun -> Pune Mumbai Hyderabad
Nominal → Nominal Noun Aux -> does
Nominal → Nominal PP Preposition -> from to on near through
VP → Verb
VP → Verb NP
VP → Verb NP PP Exercise
VP → Verb PP Parse following sentences
Does American airlines serves a meal? Top-down
VP → VP PP
She plays in the garden. Bottom-up
PP → Preposition NP
78
Does American airlines serves a meal? (top-down parse)
Aux NP VP
Does proper VP PP
Noun
serves a meals
79
She plays in the garden. Bottom-up parse
NP VP Nominal
80
NP
NP VP Nominal
81
S
VP
PP
NP
NP VP Nominal
82
Ambiguity
84
One of the possible parse trees
85
Attachment ambiguity
86
“Guna ate an ice cream with fruits from Chennai”
In this sentence, we have two prepositional phrases “with fruits”
and “from Chennai”.
Here the possible meanings are as follows;
1. Guna who is from Chennai ate an ice cream filled with fruits.
2.Guna ate an ice cream filled with fruits and the ice cream is
brought from Chennai.
3.Guna who is from Chennai ate the ice cream with the help of
fruits.
4. Guna with the help of fruits ate the ice cream which is brought
from Chennai
88
Attachment ambiguity arises from uncertainty
of attaching a phrase or clause to a part of a
sentence
89
Coordination ambiguity
90
“delicious cookies and milk”
could mean
ⓐ “delicious cookies” + “milk”
Or
ⓑ “delicious cookies” + “delicious milk”
91
•Ex. Show me the meal on Flight UA 386 from San Francisco to
Denver. Total 14 parse for this statement – tree below shows a
reasonable parse
92
Disambiguation -
• Explore all possible parse trees in parallel …. Unrealistic based on memory requirement
• Use backtracking strategy – expand the search –space tree by exploring one state at a time and backtrack if
required. Inefficient as many subtrees to be rebuilt.
93
EFFICIENT PARSING FOR CONTEXT FREE GRAMMAR
94
• Dynamic programming parsing methods
In parsing with Dynamic programming, the subtrees are computed once, stored
(in tables) and reused.
- CKY parsing – uses CNF grammar
- The Early algorithm – CFG
- Chart parsing - CFG
95
What is CKY in NLP?
97
• When we have loads of descriptions or modifications
around a particular word or the phrase of our interest, we
use chunking to grab the required phrase alone, ignoring the
rest around it.
98
99
Statistical Parsing
• Statistical /Probabilistic parsing helps to solve the problem of disambiguation
• Compute the probability of each possible interpretation and choose the most probable
interpretation
• Most of the modern parsers in NLP are probabilistic
• Probabilistic parsers use Probabilistic / Stochastic CFG (PCFG)
- augmenting CFG with probabilities
• In PCFG, each production rule will be of the form
A -> β [p] ……………… p is prob that NT A will be expanded to β
p is a number between 0 and 1
100
PCFG
101
Disambiguation with PCFGs
• PCFG assigns probability to each parse tree (ie. Each derivation) of a sentence S
• The probability of a particular parse T is defined as a product of the probabilities of all
the n rules used to expand each of the non-terminal nodes in the parse tree T,
where each rule i can be expressed as LHSi = RHSi
n
P(T, S) = i =1
102
103
PoS Tagging using
Hidden Markov Model (HMM)
104
1) An easy introduction to Hidden Markov Model (HMM) - Part 1
https://www.youtube.com/watch?v=YlL0YARYK-o
2) https://www.youtube.com/watch?v=7ak1_zDUgEg
105
TreeBank: A data driven approach to
Syntax
What is the meaning of treebank?
treebank (plural treebanks) A database of
sentences which are annotated with syntactic
information, often in the form of a tree.
107
SUSANNE Corpus (Sampson 1995), where each
token is represented by one line, with part-of-speech
(including morphosyntactic features) in the first
column, the actual token in the second column, and
the lemma in the third column
108
109
Treebank development
110
Tools and Standards
111
• Well-known examples of syntactic parsers used in
treebank development are the deterministic Fidditch
parser (Hindle 1994), used in the development of
the Penn Treebank, and the statistical parser of
Collins et al. (1999), used for the Prague
Dependency Treebank.
114
References
https://www.ibm.com/in-en/topics/natural-language-processing
115
END OF UNIT 2
116