0% found this document useful (0 votes)

14 views106 pages

Natural Language Processing Tools and Approaches

Natural Language Processing (NLP) is a branch of artificial intelligence that enables computers to understand and interpret human language through various techniques, including statistical and machine learning models. NLP is utilized in applications such as translation, voice recognition, and chatbots, and plays a significant role in enhancing business operations. Key components of NLP include tokenization, morphological analysis, and part-of-speech tagging, which help structure and analyze large volumes of unstructured language data.

Uploaded by

Nisha Jaiswal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views106 pages

Natural Language Processing Tools and Approaches

Uploaded by

Nisha Jaiswal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 106

Unit 2

1
Natural Language Processing
Natural Language Processing (NLP) refers to the branch of
computer science—and more specifically, the branch of artificial
intelligence concerned with giving computers the ability to
understand text and spoken words in much the same way human
beings can.

NLP combines computational linguistics—rule-based modeling

of human language—with statistical, machine learning, and deep
learning models. Together, these technologies enable computers
to process human language in the form of text or voice data and
to ‘understand’ its full meaning, complete with the speaker or
writer’s intent and sentiment.

2
Natural Language Processing (contd)
• NLP drives computer programs that translate text from one
language to another, respond to spoken commands, and
summarize large volumes of text rapidly—even in real time.

• One may have interacted with NLP in the form of

• voice-operated GPS systems,
• digital assistants,
• speech-to-text dictation software,
• customer service chatbots, and
• other consumer conveniences.

• But NLP also plays a growing role in enterprise solutions that

help streamline business operations, increase employee
productivity, and simplify mission-critical business processes.3
NLP tools and approaches

4
1) Python and the Natural Language Toolkit (NLTK)

• The Python programing language provides a wide range of tools

and libraries for attacking specific NLP tasks. Many of these are
found in the Natural Language Toolkit, or NLTK, an open
source collection of libraries, programs, and education resources
for building NLP programs.

• The NLTK includes libraries for many of the NLP tasks listed
above, plus libraries for subtasks, such as sentence parsing, word
segmentation, stemming and lemmatization (methods of
trimming words down to their roots), and tokenization (for
breaking phrases, sentences, paragraphs and passages into tokens
that help the computer better understand the text).

• It also includes libraries for implementing capabilities such as

semantic reasoning, the ability to reach logical conclusions based
5
on facts extracted from text.
2) Statistical NLP, machine learning, and deep learning
• The earliest NLP applications were hand-coded, rules-based
systems that could perform certain NLP tasks, but couldn't
easily scale to accommodate a seemingly endless stream of
exceptions or the increasing volumes of text and voice data.

• Statistical NLP combines computer algorithms with machine

learning and deep learning models to automatically extract,
classify, and label elements of text and voice data and then
assign a statistical likelihood to each possible meaning of those
elements.

• Today, deep learning models and learning techniques based on

convolutional neural networks (CNNs) and recurrent neural
networks (RNNs) enable NLP systems that 'learn' as they work
and extract ever more accurate meaning from huge volumes of
6
raw, unstructured, and unlabeled text and voice data sets.
Role of NLP in CS

Natural Language Processing is a form of AI that gives machines

the ability to not just read, but to understand and interpret
human language.

With NLP, machines can make sense of written or spoken text and
perform tasks including speech recognition, sentiment analysis,
and automatic text summarization

8
Why is NLP important?
1. Large volumes of textual data

Natural language processing helps computers communicate with

humans in their own language and scales other language-related
tasks. For example, NLP makes it possible for computers to read
text, hear speech, interpret it, measure sentiment and determine
which parts are important.

Today’s machines can analyze more language-based data than

humans, without fatigue and in a consistent, unbiased way.
Considering the staggering amount of unstructured data that’s
generated every day, from medical records to social media,
automation will be critical to fully analyze text and speech data
efficiently.
9
2. Structuring a highly unstructured data source
• Human language is astoundingly complex and diverse. We
express ourselves in infinite ways, both verbally and in writing.
Not only are there hundreds of languages and dialects, but
within each language is a unique set of grammar and syntax
rules, terms and slang. When we write, we often misspell or
abbreviate words, or omit punctuation. When we speak, we
have regional accents, and we mumble, stutter and borrow
terms from other languages.
• While supervised and unsupervised learning, and specifically
deep learning, are now widely used for modeling human
language, there’s also a need for syntactic and semantic
understanding and domain expertise that are not necessarily
present in these machine learning approaches.
• NLP is important because it helps resolve ambiguity in
language and adds useful numeric structure to the data for
many downstream applications, such as speech recognition or
10
text analytics.
NLP Pipeline

11
D1

12
1. TOKENIZATION

Tokenization is the first step in any NLP pipeline. It has an important effect on the rest of
your pipeline.
Tokenization is the process of tokenizing or splitting a string, text into a list of tokens.

A tokenizer breaks unstructured data and natural language text into chunks of
information that can be considered as discrete elements. The token occurrences in a
document can be used directly as a vector representing that document.
This immediately turns an unstructured string (text document) into a numerical data
structure suitable for machine learning. They can also be used directly by a computer to
trigger useful actions and responses. Or they might be used in a machine learning
pipeline as features that trigger more complex decisions or behavior.

13
Lexical Analysis
• Lexical analysis separates tokens from the language statements
based on “Pattern Matching Principle”.
• A token (string of characters) is the smallest unit with some
logical meaning, which cannot be further decomposed.
• Regular expression is used to specify text strings in all sorts
of text processing and information extraction applications.
• “Finite State Automata” is a basic mathematical
foundation used extensively in Computational Linguistics.
• Variations / different forms of Finite state automata – state
transducers, HMM, N grams

14
Regular Expression (RE)

• RE is a Language for specifying text search strings

• RE is an algebraic notation for characterizing a set of strings/ sequence of symbols
• RE can be used to define a Language in a formal way

Eg. Variable name in a programming language

L(L|D|S)* here L -> {a..zA..Z}
a, abc, a10, min_count, …. D -> {0..9}
S -> {$|_}
Eg. A person’s name in a Natural language
L1L* here L -> {a..z}
M, Vrushali, John,….. L1 -> {A..Z}

15
2. Morphological Analysis / Parsing

25
Morphological Analysis / Parsing

• What is a “morpheme”?
A minimal meaning-bearing unit in a language
eg. Dog -> 1 morpheme ‘dog’
cats -> 2 morphemes ‘cat’ + ‘s’

• The problem of recognizing that a word breaks down into component morphemes and
building a structural representation of this fact is known as “Morphological Analysis /
Parsing”
eg. doctors – doctor + s
foxes – fox + es
talking – talk + ing
(surface / input forms) (stem) (suffix)

26
• Different surfaces can have same stem
eg. sings, sang, sung - sing
surfaces stem / common lemma

• Mapping from surfaces to stem / lemma is also called

“Lemmatization”

• Stemming – simple to implement, useful in applications like IR and

Web search
• Lemmatization – complex to implement, NLP applications which
relies on meaning

• Can the problem of morphological analysis be solved by Lookup?

- may be possible for a small subset but not for entire language
- 1 verb – various forms
- 1 noun – various forms
- Some languages are morphologically complex – Russian, Turkish 27
Morphemes

Morphemes

Stems Affixes

Infix Circumfixes
Prefix Suffix
Inserted inside Precedes +
Precedes stem Follows stem
stem follows stem

More than one affixes in the same word are possible

Eg. Rewrites, preprocessing

28
Methods / Approaches to combine morphemes to from words

• Inflection – combination of a word stem with a grammatical

morpheme, commonly plural with ‘s’ or past tense with ‘ed’,
resulting in a word of same class
eg. Papers, walked

• Derivation – combination of a word stem with a grammatical morpheme

resulting in a word of different class
• eg. fish → fisherry, computerize -> computerization

• Compounding – combination of multiple word stem

together eg. Battlefield, doghouse
• Cliticization – combination of a word stem with a “clitic”
eg – I’hv I’ll I’m
A clitic is a morpheme that acts syntactically like a word but is reduced in
29
the form and attached to another word
Inflectional Morphology

• In English language, only nouns, verbs and adjectives are

inflected
• Nouns can be inflected in two way –
- affix used for plural
- affix used for possessive
Regular nouns Irregular nouns
Singular cat box mouse ox
Plural cats boxes mice oxen

Possessive suffix is realized by apostrophe + s

Eg for singular noun – child’s

for plural - students’

30
Verbs

• Verbal inflection in English language is more complex

• Regular and irregular classes of verbs
• Majority of the verbs are in regular class

• While adding suffixes,

spelling changes are taking
place at morpheme
boundaries.
• Eg beg – begged
toss – tosses

31
Derivational Morphology

• It is a combination of a word stem with a grammatical morpheme, resulting in a word of

different class.
• A common derivation in English is formation of new noun, from verb or adjective – the
process is called as “nominalization”

Suffix Base verb / adj Derived noun

-ee Appoint (v) Appointee
-er Kill (v) Killer
-ness Fuzzy (a) Fuzziness
Adjectives can be derived from word or noun
Base Noun / Verb Derived Adj
-able Suit (v) Suitable
-less Clue(n) clueless

32
Cliticization

• Clitics preceding a word are Proclitics

• Clitics following a word are Enclitics
• These are rarely found in English

33
Finite state morphological analysis / parsing
• What is expected in morphological analysis?

Input Morphological Parse

Cats Cat + N + Pl

Cities City + N + Pl
Caught Catch + V + Past Catch + V + PastParticiple ambiguity

Dog Dog + N + SG
Playing Play +V + PresPart

34
How to build a Morphological parser?
• Lexicon
Lexicon is a list of stems and affixes along with basic information (whether a
stem is a noun or verb)

• Morphotactics
These are guiding rules for arrangement of morphemes from
different classes eg. Morphemes of which class can follow
morphemes of other class
boy – boyes
girl – girlshere ‘es’ and ‘s’ follow the nouns, they can not precede the
noun

• Orthographic rules
Spelling rules – they guide about the changes occurring in a word stem
while combining with other morphemes eg. City – cities ……not
35
citys
Exercise

• Represent morphological parse of following words.

Input Morphological
Parse
patents
balloons
watched
animal
going

36
• Morphological parse

Input Morphological
Parse
patents patent + N + Pl
balloons balloon + N + Pl
watched watch + V + Past
animal animal + N + SG
going go +V + PresPart

37
Construction of a Finite-State lexicon

• Simplest possible Lexicon –

every possible word in a language + proper names
{apple, boxes, cat, ……, go, run, stop, running, gone…..Ameya,
Bharati, ……….. }
Impossible task
Hence, Lexicons are built with the list of each of the stems and
affixes of the language together with the representation of
morphotactics

38
Eg. Cat, dog, ….majority of
nouns fall under reg-noun
category.
Ignore plurals with -es

Reg-noun Irreg-Pl-noun Irreg-SG-noun Plural -s

Cat Geese Goose
Dog Sheep Sheep
Girl Mice mouse

39
3 forms of verb stem class
–
English verbal Inflection Regular verb stem
Irregular verb stem
Irregular past verb form

Affix class –
-ed past
-ed participle
-ing participle
-s Third singulars

Reg- Irreg- Irreg- Past –ed PastPart Pres Part Third

verb- verb- past- – – singula
stem stem stem ed ing r -s
walk cut caught
fly speak ate
talk sing sang
40
Derivational
Morphology

Simple ex for morphotactics of English adjectives Big – bigger,

biggest
Happy – happier, happiest, happily, unhappy, unhappily
Cool – cooler, coolest, coolly

41
Exercise

42
Finite-State Morphological Parsing

43
44
Morphological recognition – plug sub-lexicon into FSA
ie. Expand each arc like reg-noun, irreg-pl-noun etc.

45
Finite-State Transducer (FST)

• A finite-state transducer (FST) is a finite-state

machine with two memory tapes, following the
terminology for Turing machines: an input tape
and an output tape.
• This contrasts with an ordinary finite-state
automaton, which has a single tape.
• An FST is a type of finite-state automaton
(FSA) that maps between two sets of symbols.
• An FST is more general than an FSA.
• An FSA defines a formal language by defining a
set of accepted strings, while an FST defines
relations between sets of strings. 46
Finite State Transducer (FST)
• FST is a type of finite automata which maps between two sets of symbols
• In FST, each arc is labeled by an input and output string separated by colon

aa : b b:a
b:ε 1. FST as recognizer
2. FST as generator
q0 b:b q1 3. FST as translator
4. FST as set relator

b : ba

FST as a morphological parser will take as input a string of letters and

generate as output a string of morphemes

47
FST representation

48
Morphological
Parsing
It is implemented by building mapping rules that
maps letter sequences like

cats on the surface level

into
morpheme and features sequence like cat +N +PL
on the lexical level.
49
FST - for morphological parsing
• For given input cats – cat + N +Pl
• Here a word is represented as a correspondence between two levels – lexical level and
surface level

Lexical c a t +N +Pl Σ

Surface c a t s Δ

New alphabet Σ’

50
Morphological noun parser
^ - indicates a morpheme boundary
# - indicates a word boundary

51
52
Transducers and Orthographic rules

53
Part of Speech Tagging

54
Part of Speech Tagging

• Part of Speech tagging is a process of assigning a PoS or other syntactic class marker to each word
in a corpus.
(A corpus is a computer readable collection of text or speech)

• Eg Book that flight.

Popular corpora (English): Brown Corpus (1M words)
Book/VB that/DT flight/NN. British National Corpus – BNC (100M words)
Wall Street Journal Corpus (30M words)
• Here, (Penn Treebank PoS tags)
VB – verb, base form
DT – determiner
NN – noun, singular

• POS is useful for subsequent syntactic parsing and word sense disambiguation

55
• Basic 8 parts-of-speech
noun, verb, pronoun, proposition, adverb, conjunction, participle, article

• Penn Treebank (1993) – 45 PoS tags

• Brown Corpus (1979) – 87 PoS tags

• Parts of speech are also known as word classes, morphological classes,

lexical tags
• PoS tags are very useful in syntactic parsing, WSD, IR ,…

56
• PoS tags - closed class and open class

PoS tags

Closed class Open class

Relatively fixed membership New members can be added

Eg. Prepositions Eg. Noun, verb
fax, mute

57
Part of speech is the grammatical category of a word.
• Verb asserts something about the subject of the sentence and express actions, events, or
states of being. Ex. Walk, write, stand (main form, 3rd person SG, past participle, ….)

• Noun is used to name a person, an animal, a place, a thing, or an abstract idea.

Ex. Rama, book, John (common noun, proper noun)

• Adjective describes nouns and pronouns by some property or quality.

Ex. Beautiful, noisy

• Adverb describes a verb, a phrase, or a clause. An adverb indicates manner, time, place,
cause, ... Ex. Slowly, gently (directional, locative, degree, manner, temporal)
• Pronoun replaces nouns or another pronouns and are essentially used to make sentences
less cumbersome and less repetitive. Ex. He, she, it
• Prepositions, Conjunctions, Determiners ...

58
Closed classes in English
• Prepositions – on, under, over, from, to
• Determiners – a, an, the
• Pronouns – she, I, who
• Conjunctions – and, but, or, if
• Auxiliary verbs – can, may, should
• Particles – up, down, on, off, in, out, at, by
• Numerals – one, two, first, forth

59
TagSets
The process of classifying words into their parts of speech and labeling
them accordingly is known as part-of-speech tagging, POS-tagging, or
simply tagging. Parts of speech are also known as word classes or lexical
categories. The collection of tags used for a particular task is known as a
tagset.

60
Tagsets for English
• Origin of all Tagsets for English language is “87-tag tagset” used for Brown
Corpus in 1979 – research at Brown university
It is a million word collection of samples from 500 written texts from different genres
(news papers, novels, academics, non-fiction, etc)

61
Small 45-tag Penn Treebank Tagset

62
Exercise
A part-of-speech tagger is a piece of software that reads text in some
language and decides category.

63
A /DT
Part-Of-Speech/NNP
Tagger/NNP
is/VBZ
a/DT
piece/NN
of/IN
software/NN
that/WDT
reads/VBZ
text/NN
in/IN
some/DT
language/NN

64
Algorithms for POS tagging

• Rule based algorithms – work with a database of hand-written disambiguation rules

-Stage 1 – assigns a potential pos tag to each word using a dictionary
uses 2-level transducer to return all possible pos of a word
-Stage 2 – uses a list of hand-written disambiguation rules to narrow down this list to a single pos to
each word

• Probabilistic / Stochastic taggers – use a training corpus to compute the probability of

a given word having a given tag in a given context
• [The simplest stochastic taggers disambiguate words based solely on the probability that a
word occurs with a particular tag. ]
eg. HMM tagger

65
The table below shows the amount of tag ambiguity for word types in the Brown corpus

66
Named Entities

67
Named Entity Recognition (NER) (also known
as (named) entity identification, entity chunking,
and entity extraction) is a subtask of information
extraction that seeks to locate and classify named
entities mentioned in unstructured text into pre-defined
categories such as
person names,
organizations,
locations,
medical codes,
time expressions,
quantities,
monetary values,
percentages, etc. 68
NER

69
Example of NER

70
Research at IITB

71
72
3. Syntax Analysis

73
Parsing / Syntax Analysis
• Given a string of terminals and a CFG, determine if the string can be
generated by the CFG.
– Also return a parse tree for the string
– return all possible parse trees for the string

• Must search space of derivations for one that derives the given string.
– Top-Down Parsing: Start searching space of derivations for the start symbol.
– Bottom-up Parsing: Start search space of reverse derivations from the terminal
symbols in the string.

74
Parsing
• Syntactic Parsing - Produce the correct syntactic parse tree for a sentence.

Context Free Grammars (CFG) = {N, Σ, R, S}

• N a set of non-terminal symbols (or variables)

• Σ a set of terminal symbols (disjoint from N)
• R a set of productions or rules of the form A→β,
where A is a non-terminal and β is a string of symbols from (Σ N)*
• S, a designated non-terminal called the start symbol

76
• Sentences are generated by recursively rewriting the start symbol using the
productions until only terminals symbols remain.
• Ex. - book the flight from Mumbai

VP
Verb NP
book

Det Nominal
the
Nominal PP

Noun Prep NP
flight from Proper-Noun
Mumbai

77
The miniature English grammar and lexicon
Grammar

S → NP VP Lexicon
S → Aux NP VP
S → VP Det -> the a that this
NP → Pronoun Noun -> book flight money
NP → Proper-Noun Verb -> book prefer include
NP → Det Nominal Pronoun -> I he she me we
Nominal → Noun Proper-Noun -> Pune Mumbai Hyderabad
Nominal → Nominal Noun Aux -> does
Nominal → Nominal PP Preposition -> from to on near through
VP → Verb
VP → Verb NP
VP → Verb NP PP Exercise
VP → Verb PP Parse following sentences
Does American airlines serves a meal? Top-down
VP → VP PP
She plays in the garden. Bottom-up
PP → Preposition NP

78
Does American airlines serves a meal? (top-down parse)

Aux NP VP

Does proper VP PP
Noun

American Airlines verb Det NP

serves a meals

79
She plays in the garden. Bottom-up parse

NP VP Nominal

Pronoun verb Preposition Det Noun

She plays in the garden

80
NP

NP VP Nominal

Pronoun verb Preposition Det Noun

She plays in the garden

81
S

PP
NP

NP VP Nominal

Pronoun verb Preposition Det Noun

She plays in the garden

82
Ambiguity

• Present at various phases

• Lexical level – pos tagging
ex. Book that flight. Book as Noun or Verb – Rule based
algo, WSD
• Syntactic Level –
ambiguity rises in syntactic structures – structural
ambiguity multiple parse trees are possible for the
same sentence
• Syntactic ambiguity
-attachment ambiguity
83
-coordination ambiguity
Attachment ambiguity
• This ambiguity exists if a particular constituent can be
attached to the parse tree at more than one places.

Ex. I saw the man with a telescope.

84
One of the possible parse trees

85
Attachment ambiguity

• This ambiguity exists if a particular constituent can be attached to the parse

tree at more than one places.

Ex. I saw the man with a telescope.

I saw the man with a telescope.

86
“Guna ate an ice cream with fruits from Chennai”
In this sentence, we have two prepositional phrases “with fruits”
and “from Chennai”.
Here the possible meanings are as follows;
1. Guna who is from Chennai ate an ice cream filled with fruits.
2.Guna ate an ice cream filled with fruits and the ice cream is
brought from Chennai.
3.Guna who is from Chennai ate the ice cream with the help of
fruits.
4. Guna with the help of fruits ate the ice cream which is brought
from Chennai

Here we got four possibilities due to two prepositional phrases.

Each one arises from how we attach the prepositional phrases
“with fruits” and “from Chennai” to either “Guna” or the “ice
cream”.
87
I saw the man on the hill with a telescope.
I saw the man on the hill with a telescope.
I saw the man on the hill with a telescope

88
Attachment ambiguity arises from uncertainty
of attaching a phrase or clause to a part of a
sentence

89
Coordination ambiguity

• This ambiguity exists when different sets of phrases are

combined by a conjunction like
“and”

Ex. Old men and women, young boys and girls

(old men) and (women)
old (men and women)

90
“delicious cookies and milk”

could mean
ⓐ “delicious cookies” + “milk”
Or
ⓑ “delicious cookies” + “delicious milk”

91
•Ex. Show me the meal on Flight UA 386 from San Francisco to
Denver. Total 14 parse for this statement – tree below shows a
reasonable parse

92
Disambiguation -
• Explore all possible parse trees in parallel …. Unrealistic based on memory requirement
• Use backtracking strategy – expand the search –space tree by exploring one state at a time and backtrack if
required. Inefficient as many subtrees to be rebuilt.

93
EFFICIENT PARSING FOR CONTEXT FREE GRAMMAR

94
• Dynamic programming parsing methods

Dynamic programming is based on Principle of Optimality. It systematically solves

sub-problems and stores intermediate results – no work duplication

In parsing with Dynamic programming, the subtrees are computed once, stored
(in tables) and reused.
- CKY parsing – uses CNF grammar
- The Early algorithm – CFG
- Chart parsing - CFG

95
What is CKY in NLP?

• CKY means Cocke-Kasami-Younger.

• It is one of the earliest recognition and parsing algorithms.

• The standard version of CKY can only recognize languages
defined by context-free grammars in Chomsky Normal Form
(CNF).

• In its simplest form, the CYK algorithm solves the

recognition problem; it determines whether a string can be
derived from a grammar .

• In other words, the algorithm takes a sentence and a context-

free grammar and returns TRUE if there is a valid parse tree or
FALSE otherwise 96
Partial Parsing

• Many language processing tasks do not require

complete parse trees to be generated
…..complete parse tree is a complex process
• Partial parse or shallow parse is generated
• Ex. Information extraction, Information retrieval
• Chunking – one of the ways of performing partial
parsing
-Finite state Rule based chunking
-Machine Learning-based chunking

97
• When we have loads of descriptions or modifications
around a particular word or the phrase of our interest, we
use chunking to grab the required phrase alone, ignoring the
rest around it.

• Hence, chunking paves a way to group the required phrases

and exclude all the modifiers around them which are not
necessary for our analysis. Summing up, chunking helps us
extract the important words alone from lengthy
descriptions. Thus, chunking is a step in information
extraction.

98
99
Statistical Parsing
• Statistical /Probabilistic parsing helps to solve the problem of disambiguation
• Compute the probability of each possible interpretation and choose the most probable
interpretation
• Most of the modern parsers in NLP are probabilistic
• Probabilistic parsers use Probabilistic / Stochastic CFG (PCFG)
- augmenting CFG with probabilities
• In PCFG, each production rule will be of the form
A -> β [p] ……………… p is prob that NT A will be expanded to β
p is a number between 0 and 1

100
PCFG

101
Disambiguation with PCFGs

• PCFG assigns probability to each parse tree (ie. Each derivation) of a sentence S
• The probability of a particular parse T is defined as a product of the probabilities of all
the n rules used to expand each of the non-terminal nodes in the parse tree T,
where each rule i can be expressed as LHSi = RHSi

n
P(T, S) = i =1

Refer figure for ex statement “Book the dinner flight”

102
103
PoS Tagging using
Hidden Markov Model (HMM)

104
1) An easy introduction to Hidden Markov Model (HMM) - Part 1

https://www.youtube.com/watch?v=YlL0YARYK-o

2) https://www.youtube.com/watch?v=7ak1_zDUgEg

105
TreeBank: A data driven approach to
Syntax
What is the meaning of treebank?
treebank (plural treebanks) A database of
sentences which are annotated with syntactic
information, often in the form of a tree.

A treebank can be defined as a linguistically

annotated corpus that includes some
grammatical analysis beyond the part-of-
speech level
106
What is treebank dataset?

English Web Treebank is a dataset containing 254,830

word-level tokens and 16,624 sentence-level tokens of
webtext in 1174 files annotated for sentence- and word-
level tokenization, part-of-speech, and syntactic structure

107
SUSANNE Corpus (Sampson 1995), where each
token is represented by one line, with part-of-speech
(including morphosyntactic features) in the first
column, the actual token in the second column, and
the lemma in the third column

108
109
Treebank development

The methods and tools for treebank development

have evolved considerably from the very first
treebank projects, where all annotation was done
manually, to the present-day situation, which is
characterized by a more or less elaborate
combination of manual work and automatic
processing, supported by emerging standards and
customized software tools

110
Tools and Standards

Many of the software tools that are used in treebank

development are tools that are needed in the development
of any annotated corpus, such as tokenizers and part-of
speech taggers.

Tools that are specific to treebank development are

primarily tools for syntactic preprocessing and
specialized annotation tools.

111
• Well-known examples of syntactic parsers used in
treebank development are the deterministic Fidditch
parser (Hindle 1994), used in the development of
the Penn Treebank, and the statistical parser of
Collins et al. (1999), used for the Prague
Dependency Treebank.

• It is also common to use partial parsers (or

chunkers) for syntactic preprocessing, since partial
parsing can be performed with higher accuracy than
full parsing
112
4. SEMANTIC ANALYSIS
• It refers to understanding what
text means.

• Semantic analysis analyzes the

grammatical format of sentences,
including the arrangement of
words, phrases, and clauses, to
determine relationships between
independent terms in a specific
context.

• It is driving force behind machine

learning tools like chatbots, search
113
engines, and text analysis.
5. DISCOURSE ANALYSIS

114
References

https://www.ibm.com/in-en/topics/natural-language-processing

115
END OF UNIT 2

116

1 Intro To NLP
100% (1)
1 Intro To NLP
46 pages
Marine Diesel Engine
100% (1)
Marine Diesel Engine
5 pages
AI M3 Merged PDF
No ratings yet
AI M3 Merged PDF
98 pages
Thought Mastery Vocab Text PDF
No ratings yet
Thought Mastery Vocab Text PDF
2 pages
Earth Station Subsystem
No ratings yet
Earth Station Subsystem
3 pages
Solutions of Triangle Sheet
100% (2)
Solutions of Triangle Sheet
16 pages
Lesson 11 - Communication Professionals and Practitioners
No ratings yet
Lesson 11 - Communication Professionals and Practitioners
20 pages
Master The Wards Surgery Flashcards Dec 3 2015 1st Edition Sonpal Niket Fischer Conrad Ebook All Chapters PDF
100% (4)
Master The Wards Surgery Flashcards Dec 3 2015 1st Edition Sonpal Niket Fischer Conrad Ebook All Chapters PDF
52 pages
NLP Notes
No ratings yet
NLP Notes
18 pages
Unit 1 Text and Speech Analysis Notes
No ratings yet
Unit 1 Text and Speech Analysis Notes
28 pages
Text and Speech Analysis Notes CCS369-UNIT 1
No ratings yet
Text and Speech Analysis Notes CCS369-UNIT 1
27 pages
Unit V Natural Language Processing
No ratings yet
Unit V Natural Language Processing
20 pages
Role of Information System in Tourism PDF
No ratings yet
Role of Information System in Tourism PDF
18 pages
Unit 5 - Paper & Pulp Industry
No ratings yet
Unit 5 - Paper & Pulp Industry
7 pages
Unit 4-1-Ferrous
100% (1)
Unit 4-1-Ferrous
78 pages
Unit 1
No ratings yet
Unit 1
99 pages
Preguntas y Respuestas The Canterville Ghost
100% (1)
Preguntas y Respuestas The Canterville Ghost
13 pages
Forest Managemnet Assignment
No ratings yet
Forest Managemnet Assignment
3 pages
Heressies and Cult in The 21ST Century
No ratings yet
Heressies and Cult in The 21ST Century
23 pages
NLP Important Question and Answers Module Wise
No ratings yet
NLP Important Question and Answers Module Wise
101 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
Ccs369-Unit 1
No ratings yet
Ccs369-Unit 1
27 pages
Nlp-Unit-I Final
No ratings yet
Nlp-Unit-I Final
31 pages
JENA
No ratings yet
JENA
91 pages
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
No ratings yet
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
41 pages
Alexander Hamilton, Michael A. Genovese, James Madison, John Jay - The Federalist Papers-Palgrave Macmillan (2009) PDF
No ratings yet
Alexander Hamilton, Michael A. Genovese, James Madison, John Jay - The Federalist Papers-Palgrave Macmillan (2009) PDF
313 pages
AI Chapter 5
No ratings yet
AI Chapter 5
37 pages
NLP PPT
No ratings yet
NLP PPT
41 pages
NLP Question and Answers Final
No ratings yet
NLP Question and Answers Final
129 pages
NLP Unit1
No ratings yet
NLP Unit1
51 pages
Hurl 170425
No ratings yet
Hurl 170425
9 pages
Mit Syllabus
No ratings yet
Mit Syllabus
113 pages
Unit V
No ratings yet
Unit V
16 pages
Turbo Straight
No ratings yet
Turbo Straight
1 page
Of Mice and Men Chapter 2 Questions
No ratings yet
Of Mice and Men Chapter 2 Questions
3 pages
Ontology As A Service (Oaas) : A Case For Sub-Ontology Merging On The Cloud
No ratings yet
Ontology As A Service (Oaas) : A Case For Sub-Ontology Merging On The Cloud
32 pages
NxOpen Programming MasterCourse CADVertex
No ratings yet
NxOpen Programming MasterCourse CADVertex
10 pages
Pengembangan Konsep Smart Village Bagi Desa-Desa Di Indonesia Developing The Smart Village Concept For Indonesian Villages
No ratings yet
Pengembangan Konsep Smart Village Bagi Desa-Desa Di Indonesia Developing The Smart Village Concept For Indonesian Villages
17 pages
The Art of Support
No ratings yet
The Art of Support
203 pages
University of California, Los Angeles: UNDERGRADUATE Student Copy Transcript Report
No ratings yet
University of California, Los Angeles: UNDERGRADUATE Student Copy Transcript Report
4 pages
BC Vergleich 1
No ratings yet
BC Vergleich 1
12 pages
Tsa-Unit-1 To 5 Notes
No ratings yet
Tsa-Unit-1 To 5 Notes
124 pages
Unit 1
No ratings yet
Unit 1
20 pages
Introduction To NLP 2021
No ratings yet
Introduction To NLP 2021
13 pages
Unit 1a
No ratings yet
Unit 1a
53 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
NLP Module1-4
No ratings yet
NLP Module1-4
100 pages
404-BA-Chapter V
No ratings yet
404-BA-Chapter V
22 pages
Brocode OP
No ratings yet
Brocode OP
133 pages
Introduction To Natural Language Processing: Unit 1
No ratings yet
Introduction To Natural Language Processing: Unit 1
60 pages
Notes MSC NLP
No ratings yet
Notes MSC NLP
36 pages
Unit-I NLP
No ratings yet
Unit-I NLP
15 pages
Natural Language Processing
No ratings yet
Natural Language Processing
21 pages
NLP Qna Sem 7 2024 18 11 05 03 29 1
No ratings yet
NLP Qna Sem 7 2024 18 11 05 03 29 1
37 pages
NLP Introduction
No ratings yet
NLP Introduction
35 pages
Lect1 Intro 3jan08
No ratings yet
Lect1 Intro 3jan08
94 pages
NLP Unit-1 - 1
No ratings yet
NLP Unit-1 - 1
24 pages
NLP Introduction
No ratings yet
NLP Introduction
35 pages
Road To Revolution-Lesson 7
No ratings yet
Road To Revolution-Lesson 7
2 pages
Natural Language Processing Lec 1
No ratings yet
Natural Language Processing Lec 1
23 pages
2-Lecture Two - (Back Ground of NLP)
No ratings yet
2-Lecture Two - (Back Ground of NLP)
65 pages
NLP m2
No ratings yet
NLP m2
71 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
32 pages
Needs Medical Professional Medical Professional For Its Hospital at Rourkela
No ratings yet
Needs Medical Professional Medical Professional For Its Hospital at Rourkela
5 pages
Chapter 6
No ratings yet
Chapter 6
21 pages
Poeter Stemmer Algorithm
No ratings yet
Poeter Stemmer Algorithm
57 pages
Introduction
No ratings yet
Introduction
49 pages
Mat Unit 5
No ratings yet
Mat Unit 5
28 pages
Teenager Problems
No ratings yet
Teenager Problems
4 pages
Queus
No ratings yet
Queus
25 pages
Ai TXT Unit1
No ratings yet
Ai TXT Unit1
13 pages
Chapter 4
No ratings yet
Chapter 4
17 pages
Text and Speech Analysis Notes ccs369 Unit 1
No ratings yet
Text and Speech Analysis Notes ccs369 Unit 1
28 pages
Introduction To NLP: Prof: Vraj M Hingu Dept: Computer
No ratings yet
Introduction To NLP: Prof: Vraj M Hingu Dept: Computer
87 pages
NLP PPT1
No ratings yet
NLP PPT1
29 pages
Natural Language Processing 101
No ratings yet
Natural Language Processing 101
26 pages
What Is Computational Linguistics
No ratings yet
What Is Computational Linguistics
14 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
28 pages
Development On The Four Domain Skills of English Language by Grade 12 Contact Center Services Students Through Work Immersion
No ratings yet
Development On The Four Domain Skills of English Language by Grade 12 Contact Center Services Students Through Work Immersion
55 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
Natural Language Processing
No ratings yet
Natural Language Processing
24 pages
Part - A (2 Mark Questions)
No ratings yet
Part - A (2 Mark Questions)
35 pages
Seminar Report1
No ratings yet
Seminar Report1
17 pages
NLPNotes
No ratings yet
NLPNotes
12 pages
Voltage Drop
No ratings yet
Voltage Drop
39 pages
Assignment 1 PDF
No ratings yet
Assignment 1 PDF
3 pages
Assignment of AI Finished
No ratings yet
Assignment of AI Finished
16 pages
Department of Education: Republic of The Philippines
No ratings yet
Department of Education: Republic of The Philippines
16 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
NLP Final
No ratings yet
NLP Final
4 pages
Joint Dislocations
No ratings yet
Joint Dislocations
35 pages
DBMS CSE Syllabus
No ratings yet
DBMS CSE Syllabus
4 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Natural Language Understanding: Fundamentals and Applications
From Everand
Natural Language Understanding: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.