0% found this document useful (0 votes)
14 views106 pages

Natural Language Processing Tools and Approaches

Natural Language Processing (NLP) is a branch of artificial intelligence that enables computers to understand and interpret human language through various techniques, including statistical and machine learning models. NLP is utilized in applications such as translation, voice recognition, and chatbots, and plays a significant role in enhancing business operations. Key components of NLP include tokenization, morphological analysis, and part-of-speech tagging, which help structure and analyze large volumes of unstructured language data.

Uploaded by

Nisha Jaiswal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views106 pages

Natural Language Processing Tools and Approaches

Natural Language Processing (NLP) is a branch of artificial intelligence that enables computers to understand and interpret human language through various techniques, including statistical and machine learning models. NLP is utilized in applications such as translation, voice recognition, and chatbots, and plays a significant role in enhancing business operations. Key components of NLP include tokenization, morphological analysis, and part-of-speech tagging, which help structure and analyze large volumes of unstructured language data.

Uploaded by

Nisha Jaiswal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 106

Unit 2

1
Natural Language Processing
Natural Language Processing (NLP) refers to the branch of
computer science—and more specifically, the branch of artificial
intelligence concerned with giving computers the ability to
understand text and spoken words in much the same way human
beings can.

NLP combines computational linguistics—rule-based modeling


of human language—with statistical, machine learning, and deep
learning models. Together, these technologies enable computers
to process human language in the form of text or voice data and
to ‘understand’ its full meaning, complete with the speaker or
writer’s intent and sentiment.

2
Natural Language Processing (contd)
• NLP drives computer programs that translate text from one
language to another, respond to spoken commands, and
summarize large volumes of text rapidly—even in real time.

• One may have interacted with NLP in the form of


• voice-operated GPS systems,
• digital assistants,
• speech-to-text dictation software,
• customer service chatbots, and
• other consumer conveniences.

• But NLP also plays a growing role in enterprise solutions that


help streamline business operations, increase employee
productivity, and simplify mission-critical business processes.3
NLP tools and approaches

4
1) Python and the Natural Language Toolkit (NLTK)

• The Python programing language provides a wide range of tools


and libraries for attacking specific NLP tasks. Many of these are
found in the Natural Language Toolkit, or NLTK, an open
source collection of libraries, programs, and education resources
for building NLP programs.

• The NLTK includes libraries for many of the NLP tasks listed
above, plus libraries for subtasks, such as sentence parsing, word
segmentation, stemming and lemmatization (methods of
trimming words down to their roots), and tokenization (for
breaking phrases, sentences, paragraphs and passages into tokens
that help the computer better understand the text).

• It also includes libraries for implementing capabilities such as


semantic reasoning, the ability to reach logical conclusions based
5
on facts extracted from text.
2) Statistical NLP, machine learning, and deep learning
• The earliest NLP applications were hand-coded, rules-based
systems that could perform certain NLP tasks, but couldn't
easily scale to accommodate a seemingly endless stream of
exceptions or the increasing volumes of text and voice data.

• Statistical NLP combines computer algorithms with machine


learning and deep learning models to automatically extract,
classify, and label elements of text and voice data and then
assign a statistical likelihood to each possible meaning of those
elements.

• Today, deep learning models and learning techniques based on


convolutional neural networks (CNNs) and recurrent neural
networks (RNNs) enable NLP systems that 'learn' as they work
and extract ever more accurate meaning from huge volumes of
6
raw, unstructured, and unlabeled text and voice data sets.
Role of NLP in CS

Natural Language Processing is a form of AI that gives machines


the ability to not just read, but to understand and interpret
human language.

With NLP, machines can make sense of written or spoken text and
perform tasks including speech recognition, sentiment analysis,
and automatic text summarization

8
Why is NLP important?
1. Large volumes of textual data

Natural language processing helps computers communicate with


humans in their own language and scales other language-related
tasks. For example, NLP makes it possible for computers to read
text, hear speech, interpret it, measure sentiment and determine
which parts are important.

Today’s machines can analyze more language-based data than


humans, without fatigue and in a consistent, unbiased way.
Considering the staggering amount of unstructured data that’s
generated every day, from medical records to social media,
automation will be critical to fully analyze text and speech data
efficiently.
9
2. Structuring a highly unstructured data source
• Human language is astoundingly complex and diverse. We
express ourselves in infinite ways, both verbally and in writing.
Not only are there hundreds of languages and dialects, but
within each language is a unique set of grammar and syntax
rules, terms and slang. When we write, we often misspell or
abbreviate words, or omit punctuation. When we speak, we
have regional accents, and we mumble, stutter and borrow
terms from other languages.
• While supervised and unsupervised learning, and specifically
deep learning, are now widely used for modeling human
language, there’s also a need for syntactic and semantic
understanding and domain expertise that are not necessarily
present in these machine learning approaches.
• NLP is important because it helps resolve ambiguity in
language and adds useful numeric structure to the data for
many downstream applications, such as speech recognition or
10
text analytics.
NLP Pipeline

11
D1

12
1. TOKENIZATION

Tokenization is the first step in any NLP pipeline. It has an important effect on the rest of
your pipeline.
Tokenization is the process of tokenizing or splitting a string, text into a list of tokens.

A tokenizer breaks unstructured data and natural language text into chunks of
information that can be considered as discrete elements. The token occurrences in a
document can be used directly as a vector representing that document.
This immediately turns an unstructured string (text document) into a numerical data
structure suitable for machine learning. They can also be used directly by a computer to
trigger useful actions and responses. Or they might be used in a machine learning
pipeline as features that trigger more complex decisions or behavior.

13
Lexical Analysis
• Lexical analysis separates tokens from the language statements
based on “Pattern Matching Principle”.
• A token (string of characters) is the smallest unit with some
logical meaning, which cannot be further decomposed.
• Regular expression is used to specify text strings in all sorts
of text processing and information extraction applications.
• “Finite State Automata” is a basic mathematical
foundation used extensively in Computational Linguistics.
• Variations / different forms of Finite state automata – state
transducers, HMM, N grams

14
Regular Expression (RE)

• RE is a Language for specifying text search strings


• RE is an algebraic notation for characterizing a set of strings/ sequence of symbols
• RE can be used to define a Language in a formal way

Eg. Variable name in a programming language


L(L|D|S)* here L -> {a..zA..Z}
a, abc, a10, min_count, …. D -> {0..9}
S -> {$|_}
Eg. A person’s name in a Natural language
L1L* here L -> {a..z}
M, Vrushali, John,….. L1 -> {A..Z}

15
2. Morphological Analysis / Parsing

25
Morphological Analysis / Parsing

• What is a “morpheme”?
A minimal meaning-bearing unit in a language
eg. Dog -> 1 morpheme ‘dog’
cats -> 2 morphemes ‘cat’ + ‘s’

• The problem of recognizing that a word breaks down into component morphemes and
building a structural representation of this fact is known as “Morphological Analysis /
Parsing”
eg. doctors – doctor + s
foxes – fox + es
talking – talk + ing
(surface / input forms) (stem) (suffix)

26
• Different surfaces can have same stem
eg. sings, sang, sung - sing
surfaces stem / common lemma

• Mapping from surfaces to stem / lemma is also called


“Lemmatization”

• Stemming – simple to implement, useful in applications like IR and


Web search
• Lemmatization – complex to implement, NLP applications which
relies on meaning

• Can the problem of morphological analysis be solved by Lookup?


- may be possible for a small subset but not for entire language
- 1 verb – various forms
- 1 noun – various forms
- Some languages are morphologically complex – Russian, Turkish 27
Morphemes

Morphemes

Stems Affixes

Infix Circumfixes
Prefix Suffix
Inserted inside Precedes +
Precedes stem Follows stem
stem follows stem

More than one affixes in the same word are possible


Eg. Rewrites, preprocessing

28
Methods / Approaches to combine morphemes to from words

• Inflection – combination of a word stem with a grammatical


morpheme, commonly plural with ‘s’ or past tense with ‘ed’,
resulting in a word of same class
eg. Papers, walked

• Derivation – combination of a word stem with a grammatical morpheme


resulting in a word of different class
• eg. fish → fisherry, computerize -> computerization

• Compounding – combination of multiple word stem


together eg. Battlefield, doghouse
• Cliticization – combination of a word stem with a “clitic”
eg – I’hv I’ll I’m
A clitic is a morpheme that acts syntactically like a word but is reduced in
29
the form and attached to another word
Inflectional Morphology

• In English language, only nouns, verbs and adjectives are


inflected
• Nouns can be inflected in two way –
- affix used for plural
- affix used for possessive
Regular nouns Irregular nouns
Singular cat box mouse ox
Plural cats boxes mice oxen

Possessive suffix is realized by apostrophe + s

Eg for singular noun – child’s


for plural - students’

30
Verbs

• Verbal inflection in English language is more complex


• Regular and irregular classes of verbs
• Majority of the verbs are in regular class

• While adding suffixes,


spelling changes are taking
place at morpheme
boundaries.
• Eg beg – begged
toss – tosses

31
Derivational Morphology

• It is a combination of a word stem with a grammatical morpheme, resulting in a word of


different class.
• A common derivation in English is formation of new noun, from verb or adjective – the
process is called as “nominalization”

Suffix Base verb / adj Derived noun


-ee Appoint (v) Appointee
-er Kill (v) Killer
-ness Fuzzy (a) Fuzziness
Adjectives can be derived from word or noun
Base Noun / Verb Derived Adj
-able Suit (v) Suitable
-less Clue(n) clueless

32
Cliticization

• Clitics preceding a word are Proclitics


• Clitics following a word are Enclitics
• These are rarely found in English

33
Finite state morphological analysis / parsing
• What is expected in morphological analysis?

Input Morphological Parse


Cats Cat + N + Pl

Cities City + N + Pl
Caught Catch + V + Past Catch + V + PastParticiple ambiguity

Dog Dog + N + SG
Playing Play +V + PresPart

34
How to build a Morphological parser?
• Lexicon
Lexicon is a list of stems and affixes along with basic information (whether a
stem is a noun or verb)

• Morphotactics
These are guiding rules for arrangement of morphemes from
different classes eg. Morphemes of which class can follow
morphemes of other class
boy – boyes
girl – girlshere ‘es’ and ‘s’ follow the nouns, they can not precede the
noun

• Orthographic rules
Spelling rules – they guide about the changes occurring in a word stem
while combining with other morphemes eg. City – cities ……not
35
citys
Exercise

• Represent morphological parse of following words.

Input Morphological
Parse
patents
balloons
watched
animal
going

36
• Morphological parse

Input Morphological
Parse
patents patent + N + Pl
balloons balloon + N + Pl
watched watch + V + Past
animal animal + N + SG
going go +V + PresPart

37
Construction of a Finite-State lexicon

• Simplest possible Lexicon –


every possible word in a language + proper names
{apple, boxes, cat, ……, go, run, stop, running, gone…..Ameya,
Bharati, ……….. }
Impossible task
Hence, Lexicons are built with the list of each of the stems and
affixes of the language together with the representation of
morphotactics

38
Eg. Cat, dog, ….majority of
nouns fall under reg-noun
category.
Ignore plurals with -es

Reg-noun Irreg-Pl-noun Irreg-SG-noun Plural -s


Cat Geese Goose
Dog Sheep Sheep
Girl Mice mouse

39
3 forms of verb stem class

English verbal Inflection Regular verb stem
Irregular verb stem
Irregular past verb form

Affix class –
-ed past
-ed participle
-ing participle
-s Third singulars

Reg- Irreg- Irreg- Past –ed PastPart Pres Part Third


verb- verb- past- – – singula
stem stem stem ed ing r -s
walk cut caught
fly speak ate
talk sing sang
40
Derivational
Morphology

Simple ex for morphotactics of English adjectives Big – bigger,


biggest
Happy – happier, happiest, happily, unhappy, unhappily
Cool – cooler, coolest, coolly

41
Exercise

42
Finite-State Morphological Parsing

43
44
Morphological recognition – plug sub-lexicon into FSA
ie. Expand each arc like reg-noun, irreg-pl-noun etc.

45
Finite-State Transducer (FST)

• A finite-state transducer (FST) is a finite-state


machine with two memory tapes, following the
terminology for Turing machines: an input tape
and an output tape.
• This contrasts with an ordinary finite-state
automaton, which has a single tape.
• An FST is a type of finite-state automaton
(FSA) that maps between two sets of symbols.
• An FST is more general than an FSA.
• An FSA defines a formal language by defining a
set of accepted strings, while an FST defines
relations between sets of strings. 46
Finite State Transducer (FST)
• FST is a type of finite automata which maps between two sets of symbols
• In FST, each arc is labeled by an input and output string separated by colon

aa : b b:a
b:ε 1. FST as recognizer
2. FST as generator
q0 b:b q1 3. FST as translator
4. FST as set relator

b : ba

FST as a morphological parser will take as input a string of letters and


generate as output a string of morphemes

47
FST representation

48
Morphological
Parsing
It is implemented by building mapping rules that
maps letter sequences like

cats on the surface level


into
morpheme and features sequence like cat +N +PL
on the lexical level.
49
FST - for morphological parsing
• For given input cats – cat + N +Pl
• Here a word is represented as a correspondence between two levels – lexical level and
surface level

Lexical c a t +N +Pl Σ

Surface c a t s Δ

New alphabet Σ’

50
Morphological noun parser
^ - indicates a morpheme boundary
# - indicates a word boundary

51
52
Transducers and Orthographic rules

53
Part of Speech Tagging

54
Part of Speech Tagging

• Part of Speech tagging is a process of assigning a PoS or other syntactic class marker to each word
in a corpus.
(A corpus is a computer readable collection of text or speech)

• Eg Book that flight.


Popular corpora (English): Brown Corpus (1M words)
Book/VB that/DT flight/NN. British National Corpus – BNC (100M words)
Wall Street Journal Corpus (30M words)
• Here, (Penn Treebank PoS tags)
VB – verb, base form
DT – determiner
NN – noun, singular

• POS is useful for subsequent syntactic parsing and word sense disambiguation

55
• Basic 8 parts-of-speech
noun, verb, pronoun, proposition, adverb, conjunction, participle, article

• Penn Treebank (1993) – 45 PoS tags


• Brown Corpus (1979) – 87 PoS tags

• Parts of speech are also known as word classes, morphological classes,


lexical tags
• PoS tags are very useful in syntactic parsing, WSD, IR ,…

56
• PoS tags - closed class and open class

PoS tags

Closed class Open class

Relatively fixed membership New members can be added


Eg. Prepositions Eg. Noun, verb
fax, mute

57
Part of speech is the grammatical category of a word.
• Verb asserts something about the subject of the sentence and express actions, events, or
states of being. Ex. Walk, write, stand (main form, 3rd person SG, past participle, ….)

• Noun is used to name a person, an animal, a place, a thing, or an abstract idea.


Ex. Rama, book, John (common noun, proper noun)

• Adjective describes nouns and pronouns by some property or quality.


Ex. Beautiful, noisy

• Adverb describes a verb, a phrase, or a clause. An adverb indicates manner, time, place,
cause, ... Ex. Slowly, gently (directional, locative, degree, manner, temporal)
• Pronoun replaces nouns or another pronouns and are essentially used to make sentences
less cumbersome and less repetitive. Ex. He, she, it
• Prepositions, Conjunctions, Determiners ...

58
Closed classes in English
• Prepositions – on, under, over, from, to
• Determiners – a, an, the
• Pronouns – she, I, who
• Conjunctions – and, but, or, if
• Auxiliary verbs – can, may, should
• Particles – up, down, on, off, in, out, at, by
• Numerals – one, two, first, forth

59
TagSets
The process of classifying words into their parts of speech and labeling
them accordingly is known as part-of-speech tagging, POS-tagging, or
simply tagging. Parts of speech are also known as word classes or lexical
categories. The collection of tags used for a particular task is known as a
tagset.

60
Tagsets for English
• Origin of all Tagsets for English language is “87-tag tagset” used for Brown
Corpus in 1979 – research at Brown university
It is a million word collection of samples from 500 written texts from different genres
(news papers, novels, academics, non-fiction, etc)

61
Small 45-tag Penn Treebank Tagset

62
Exercise
A part-of-speech tagger is a piece of software that reads text in some
language and decides category.

63
A /DT
Part-Of-Speech/NNP
Tagger/NNP
is/VBZ
a/DT
piece/NN
of/IN
software/NN
that/WDT
reads/VBZ
text/NN
in/IN
some/DT
language/NN

64
Algorithms for POS tagging

• Rule based algorithms – work with a database of hand-written disambiguation rules


-Stage 1 – assigns a potential pos tag to each word using a dictionary
uses 2-level transducer to return all possible pos of a word
-Stage 2 – uses a list of hand-written disambiguation rules to narrow down this list to a single pos to
each word

• Probabilistic / Stochastic taggers – use a training corpus to compute the probability of


a given word having a given tag in a given context
• [The simplest stochastic taggers disambiguate words based solely on the probability that a
word occurs with a particular tag. ]
eg. HMM tagger

65
The table below shows the amount of tag ambiguity for word types in the Brown corpus

66
Named Entities

67
Named Entity Recognition (NER) (also known
as (named) entity identification, entity chunking,
and entity extraction) is a subtask of information
extraction that seeks to locate and classify named
entities mentioned in unstructured text into pre-defined
categories such as
person names,
organizations,
locations,
medical codes,
time expressions,
quantities,
monetary values,
percentages, etc. 68
NER

69
Example of NER

70
Research at IITB

71
72
3. Syntax Analysis

73
Parsing / Syntax Analysis
• Given a string of terminals and a CFG, determine if the string can be
generated by the CFG.
– Also return a parse tree for the string
– return all possible parse trees for the string

• Must search space of derivations for one that derives the given string.
– Top-Down Parsing: Start searching space of derivations for the start symbol.
– Bottom-up Parsing: Start search space of reverse derivations from the terminal
symbols in the string.

74
Parsing
• Syntactic Parsing - Produce the correct syntactic parse tree for a sentence.

Context Free Grammars (CFG) = {N, Σ, R, S}

• N a set of non-terminal symbols (or variables)


• Σ a set of terminal symbols (disjoint from N)
• R a set of productions or rules of the form A→β,
where A is a non-terminal and β is a string of symbols from (Σ N)*
• S, a designated non-terminal called the start symbol

75
Example of CFG
Grammar
Lexicon
S → NP VP
S → Aux NP VP Det -> the|a|that|this
S → VP Noun -> book |flight| money
NP → Pronoun Verb -> book | prefer |include
NP → Proper-Noun Pronoun -> I | he | she | me
NP → Det Nominal Proper-Noun -> Pune | Mumbai
Nominal → Noun Aux -> does
Nominal → Nominal Noun Prep -> from | to |on | near | through
Nominal → Nominal PP
VP → Verb
VP → Verb NP
VP → VP PP
PP → Prep NP

76
• Sentences are generated by recursively rewriting the start symbol using the
productions until only terminals symbols remain.
• Ex. - book the flight from Mumbai

VP
Verb NP
book

Det Nominal
the
Nominal PP

Noun Prep NP
flight from Proper-Noun
Mumbai

77
The miniature English grammar and lexicon
Grammar

S → NP VP Lexicon
S → Aux NP VP
S → VP Det -> the a that this
NP → Pronoun Noun -> book flight money
NP → Proper-Noun Verb -> book prefer include
NP → Det Nominal Pronoun -> I he she me we
Nominal → Noun Proper-Noun -> Pune Mumbai Hyderabad
Nominal → Nominal Noun Aux -> does
Nominal → Nominal PP Preposition -> from to on near through
VP → Verb
VP → Verb NP
VP → Verb NP PP Exercise
VP → Verb PP Parse following sentences
Does American airlines serves a meal? Top-down
VP → VP PP
She plays in the garden. Bottom-up
PP → Preposition NP

78
Does American airlines serves a meal? (top-down parse)

Aux NP VP

Does proper VP PP
Noun

American Airlines verb Det NP

serves a meals

79
She plays in the garden. Bottom-up parse

NP VP Nominal

Pronoun verb Preposition Det Noun

She plays in the garden

80
NP

NP VP Nominal

Pronoun verb Preposition Det Noun

She plays in the garden

81
S

VP

PP
NP

NP VP Nominal

Pronoun verb Preposition Det Noun

She plays in the garden

82
Ambiguity

• Present at various phases


• Lexical level – pos tagging
ex. Book that flight. Book as Noun or Verb – Rule based
algo, WSD
• Syntactic Level –
ambiguity rises in syntactic structures – structural
ambiguity multiple parse trees are possible for the
same sentence
• Syntactic ambiguity
-attachment ambiguity
83
-coordination ambiguity
Attachment ambiguity
• This ambiguity exists if a particular constituent can be
attached to the parse tree at more than one places.

Ex. I saw the man with a telescope.

84
One of the possible parse trees

85
Attachment ambiguity

• This ambiguity exists if a particular constituent can be attached to the parse


tree at more than one places.

Ex. I saw the man with a telescope.


I saw the man with a telescope.

86
“Guna ate an ice cream with fruits from Chennai”
In this sentence, we have two prepositional phrases “with fruits”
and “from Chennai”.
Here the possible meanings are as follows;
1. Guna who is from Chennai ate an ice cream filled with fruits.
2.Guna ate an ice cream filled with fruits and the ice cream is
brought from Chennai.
3.Guna who is from Chennai ate the ice cream with the help of
fruits.
4. Guna with the help of fruits ate the ice cream which is brought
from Chennai

Here we got four possibilities due to two prepositional phrases.


Each one arises from how we attach the prepositional phrases
“with fruits” and “from Chennai” to either “Guna” or the “ice
cream”.
87
I saw the man on the hill with a telescope.
I saw the man on the hill with a telescope.
I saw the man on the hill with a telescope

88
Attachment ambiguity arises from uncertainty
of attaching a phrase or clause to a part of a
sentence

89
Coordination ambiguity

• This ambiguity exists when different sets of phrases are


combined by a conjunction like
“and”

Ex. Old men and women, young boys and girls


(old men) and (women)
old (men and women)

90
“delicious cookies and milk”

could mean
ⓐ “delicious cookies” + “milk”
Or
ⓑ “delicious cookies” + “delicious milk”

91
•Ex. Show me the meal on Flight UA 386 from San Francisco to
Denver. Total 14 parse for this statement – tree below shows a
reasonable parse

92
Disambiguation -
• Explore all possible parse trees in parallel …. Unrealistic based on memory requirement
• Use backtracking strategy – expand the search –space tree by exploring one state at a time and backtrack if
required. Inefficient as many subtrees to be rebuilt.

93
EFFICIENT PARSING FOR CONTEXT FREE GRAMMAR

94
• Dynamic programming parsing methods

Dynamic programming is based on Principle of Optimality. It systematically solves


sub-problems and stores intermediate results – no work duplication

In parsing with Dynamic programming, the subtrees are computed once, stored
(in tables) and reused.
- CKY parsing – uses CNF grammar
- The Early algorithm – CFG
- Chart parsing - CFG

95
What is CKY in NLP?

• CKY means Cocke-Kasami-Younger.

• It is one of the earliest recognition and parsing algorithms.


• The standard version of CKY can only recognize languages
defined by context-free grammars in Chomsky Normal Form
(CNF).

• In its simplest form, the CYK algorithm solves the


recognition problem; it determines whether a string can be
derived from a grammar .

• In other words, the algorithm takes a sentence and a context-


free grammar and returns TRUE if there is a valid parse tree or
FALSE otherwise 96
Partial Parsing

• Many language processing tasks do not require


complete parse trees to be generated
…..complete parse tree is a complex process
• Partial parse or shallow parse is generated
• Ex. Information extraction, Information retrieval
• Chunking – one of the ways of performing partial
parsing
-Finite state Rule based chunking
-Machine Learning-based chunking

97
• When we have loads of descriptions or modifications
around a particular word or the phrase of our interest, we
use chunking to grab the required phrase alone, ignoring the
rest around it.

• Hence, chunking paves a way to group the required phrases


and exclude all the modifiers around them which are not
necessary for our analysis. Summing up, chunking helps us
extract the important words alone from lengthy
descriptions. Thus, chunking is a step in information
extraction.

98
99
Statistical Parsing
• Statistical /Probabilistic parsing helps to solve the problem of disambiguation
• Compute the probability of each possible interpretation and choose the most probable
interpretation
• Most of the modern parsers in NLP are probabilistic
• Probabilistic parsers use Probabilistic / Stochastic CFG (PCFG)
- augmenting CFG with probabilities
• In PCFG, each production rule will be of the form
A -> β [p] ……………… p is prob that NT A will be expanded to β
p is a number between 0 and 1

100
PCFG

101
Disambiguation with PCFGs

• PCFG assigns probability to each parse tree (ie. Each derivation) of a sentence S
• The probability of a particular parse T is defined as a product of the probabilities of all
the n rules used to expand each of the non-terminal nodes in the parse tree T,
where each rule i can be expressed as LHSi = RHSi

n
P(T, S) = i =1

Refer figure for ex statement “Book the dinner flight”

102
103
PoS Tagging using
Hidden Markov Model (HMM)

104
1) An easy introduction to Hidden Markov Model (HMM) - Part 1

https://www.youtube.com/watch?v=YlL0YARYK-o

2) https://www.youtube.com/watch?v=7ak1_zDUgEg

105
TreeBank: A data driven approach to
Syntax
What is the meaning of treebank?
treebank (plural treebanks) A database of
sentences which are annotated with syntactic
information, often in the form of a tree.

A treebank can be defined as a linguistically


annotated corpus that includes some
grammatical analysis beyond the part-of-
speech level
106
What is treebank dataset?

English Web Treebank is a dataset containing 254,830


word-level tokens and 16,624 sentence-level tokens of
webtext in 1174 files annotated for sentence- and word-
level tokenization, part-of-speech, and syntactic structure

107
SUSANNE Corpus (Sampson 1995), where each
token is represented by one line, with part-of-speech
(including morphosyntactic features) in the first
column, the actual token in the second column, and
the lemma in the third column

108
109
Treebank development

The methods and tools for treebank development


have evolved considerably from the very first
treebank projects, where all annotation was done
manually, to the present-day situation, which is
characterized by a more or less elaborate
combination of manual work and automatic
processing, supported by emerging standards and
customized software tools

110
Tools and Standards

Many of the software tools that are used in treebank


development are tools that are needed in the development
of any annotated corpus, such as tokenizers and part-of
speech taggers.

Tools that are specific to treebank development are


primarily tools for syntactic preprocessing and
specialized annotation tools.

111
• Well-known examples of syntactic parsers used in
treebank development are the deterministic Fidditch
parser (Hindle 1994), used in the development of
the Penn Treebank, and the statistical parser of
Collins et al. (1999), used for the Prague
Dependency Treebank.

• It is also common to use partial parsers (or


chunkers) for syntactic preprocessing, since partial
parsing can be performed with higher accuracy than
full parsing
112
4. SEMANTIC ANALYSIS
• It refers to understanding what
text means.

• Semantic analysis analyzes the


grammatical format of sentences,
including the arrangement of
words, phrases, and clauses, to
determine relationships between
independent terms in a specific
context.

• It is driving force behind machine


learning tools like chatbots, search
113
engines, and text analysis.
5. DISCOURSE ANALYSIS

114
References

https://www.ibm.com/in-en/topics/natural-language-processing

115
END OF UNIT 2

116

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy