NLP Lab File
NLP Lab File
I. Introduction
NLP has heavily benefited from recent advances in machine learning, especially from
deep learning techniques. The field is divided into the three parts:
1
II. Why NLP is Difficult??
Another remarkable thing about human language is that it is all about symbols.
According to Chris Manning, a machine learning professor at Stanford, it is a discrete,
symbolic, categorical signaling system. This means we can convey the same meaning
in different ways (i.e., speech, gesture, signs, etc.) The encoding by the human brain is
a continuous pattern of activation by which the symbols are transmitted via continuous
signals of sound and vision.
Understanding human language is considered a difficult task due to its complexity. For
example, there is an infinite number of different ways to arrange words in a sentence.
Also, words can have several meanings and contextual information is necessary to
correctly interpret sentences. Every language is more or less unique and ambiguous.
Just take a look at the following newspaper headline "The Pope’s baby steps on gays."
This sentence clearly has two very different interpretations, which is a pretty good
example of the challenges in NLP.
2
Fig 1.3 : Evolution of NLP
3
III. Syntactic and Semantics Analysis
Syntactic analysis (syntax) and semantic analysis (semantic) are the two primary
techniques that lead to the understanding of natural language. Language is a set of valid
sentences, but what makes a sentence valid? Syntax and semantics.
Syntax is the grammatical structure of the text, whereas semantics is the meaning being
conveyed. A sentence that is syntactically correct, however, is not always semantically
correct. For example, “cows flow supremely” is grammatically valid (subject — verb —
adverb) but it doesn't make any sense.
Syntactic Analysis
Syntactic analysis, also referred to as syntax analysis or parsing, is the process of
analyzing natural language with the rules of a formal grammar. Grammatical rules are
applied to categories and groups of words, not individual words. Syntactic analysis
basically assigns a semantic structure to text.
For example, a sentence includes a subject and a predicate where the subject is a noun
phrase and the predicate is a verb phrase. Take a look at the following sentence: “The
dog (noun phrase) went away (verb phrase).” Note how we can combine every noun
phrase with a verb phrase. Again, it's important to reiterate that a sentence can be
syntactically correct but not make sense.
4
Semantics Analysis
The way we understand what someone has said is an unconscious process relying on
our intuition and knowledge about language itself. In other words, the way we
understand language is heavily based on meaning and context. Computers need a
different approach, however. The word “semantic” is a linguistic term and means
"related to meaning or logic."
5
Experiment No: 2
I. Grammars
Grammar is very essential and important to describe the syntactic structure of well-
formed programs. In the literary sense, they denote syntactical rules for conversation in
natural languages. Linguistics have attempted to define grammars since the inception
of natural languages like English, Hindi, etc.
The theory of formal languages is also applicable in the fields of Computer Science
mainly in programming languages and data structure. For example, in ‘C’ language, the
precise grammar rules state how functions are made from lists and statements.
P denotes the Production rules for terminals as well as non-terminals. It has the form α
→ β, where α and β are strings on VN ∪ ∑ and least one symbol of α belongs to VN
6
Context-Free Grammar (CFG)
CFG consists of a finite set of grammar rules having the following four components
Set of Non-Terminals
Set of Terminals
Set of Productions
Start Symbol
Set of Non-terminals
It is represented by V. The non-terminals are syntactic variables that denote the sets of
strings, which helps in defining the language that is generated with the help of
grammar.
Set of Terminals
It is also known as tokens and represented by Σ. Strings are formed with the help of
the basic symbols of terminals.
Set of Productions
It is represented by P. The set gives an idea about how the terminals and nonterminals
can be combined. Every production consists of the following components:
Non-terminals,
Arrow,
Terminals (the sequence of terminals).
The left side of production is called non-terminals while the right side of production is
called terminals.
7
Start Symbol
The production begins from the start symbol. It is represented by symbol S. Non-
terminal symbols are always designated as start symbols.
Before deep dive into the discussion of CG, let’s see some fundamental points about
constituency grammar and constituency relation.
All the related frameworks view the sentence structure in terms of constituency relation.
To derive the constituency relation, we take the help of subject-predicate division of
Latin as well as Greek grammar.
Here we study the clause structure in terms of noun phrase NP and verb phrase VP.
“The dogs are barking in the park” “They are eating happily”
“The horses are running since the morning”
Now, let’s look at another view of constituency grammar is to define their grammar in
terms of their part of speech tags.
8
II. Parsing
Simply speaking, parsing in NLP is the process of determining the syntactic structure
of a text by analyzing its constituent words based on an underlying grammar (of the
language).
See this example grammar below, where each line indicates a rule of the grammar to be
applied to an example sentence “Tom ate an apple”.
Example Grammar
Then, the outcome of the parsing process would be a parse tree like the following, where
sentence is the root, intermediate nodes such as noun-phrase, verb-phrase etc. have
children - hence they are called non-terminals and finally, the leaves of the tree ‘Tom’,
‘ate’, ‘an’, ‘apple’ are called terminals.
Parse Tree
9
Existing parsing approaches are basically statistical, probabilistic, and machine
learning-based. Some notable tools to use for parsing are: Stanford parser (The Stanford
Natural Language Processing Group), OpenNLP (Apache OpenNLP Developer
Documentation) etc
Now, if we talk about Part-of-Speech (PoS) tagging, then it may be defined as the
process of assigning one of the parts of speech to the given word. It is generally called
POS tagging. In simple words, we can say that POS tagging is a task of labelling each
word in a sentence with its appropriate part of speech. We already know that parts of
speech include nouns, verb, adverbs, adjectives, pronouns, conjunction and their sub-
categories.
Most of the POS tagging falls under Rule Base POS tagging, Stochastic POS tagging
and Transformation based tagging.
10
Rule-based POS Tagging
One of the oldest techniques of tagging is rule-based POS tagging. Rule-based taggers
use dictionary or lexicon for getting possible tags for tagging each word. If the word
has more than one possible tag, then rule-based taggers use hand-written rules to
identify the correct tag.
First stage − In the first stage, it uses a dictionary to assign each word a list of potential
parts-of-speech.
Second stage − In the second stage, it uses large lists of hand-written disambiguation
rules to sort down the list to a single part-of-speech for each word.
11
Stochastic POS Tagging
Another technique of tagging is Stochastic POS Tagging. Now, the question that arises
here is which model can be stochastic. The model that includes frequency or probability
(statistics) can be called stochastic. Any number of different approaches to the problem
of part-of-speech tagging can be referred to as stochastic tagger.
The simplest stochastic tagger applies the following approaches for POS tagging –
In this approach, the stochastic taggers disambiguate the words based on the probability
that a word occurs with a particular tag. We can also say that the tag encountered most
frequently with the word in the training set is the one assigned to an ambiguous instance
of that word. The main issue with this approach is that it may yield inadmissible
sequence of tags.
It is another approach of stochastic tagging, where the tagger calculates the probability
of a given sequence of tags occurring. It is also called n-gram approach. It is called so
because the best tag for a given word is determined by the probability at which it occurs
with the n previous tags.
12
Transformation-based Tagging
Transformation based tagging is also called Brill tagging. It is an instance of the
transformation-based learning (TBL), which is a rule-based algorithm for automatic tagging of
POS to the given text. TBL, allows us to have linguistic knowledge in a readable form,
transforms one state to another state by using transformation rules.
It draws the inspiration from both the previous explained taggers − rule-based and stochastic.
If we see similarity between rule-based and transformation tagger, then like rule-based, it is
also based on the rules that specify what tags need to be assigned to what words. On the other
hand, if we see similarity between stochastic and transformation tagger then like stochastic, it
is machine learning technique in which rules are automatically induced from data.
Start with the solution − The TBL usually starts with some solution to the
problem and works in cycles.
Apply to the problem − The transformation chosen in the last step will be
applied to the problem.
The algorithm will stop when the selected transformation in step 2 will not add either
more value or there are no more transformations to be selected. Such kind of learning
is best suited in classification tasks.
HMM (Hidden Markov Model) is a Stochastic technique for POS tagging. Hidden
Markov models are known for their applications to reinforcement learning and temporal
pattern recognition such as speech, handwriting, gesture recognition, musical score
following, partial discharges, and bioinformatics.
Let us consider an example proposed by Dr.Luis Serrano and find out how HMM selects
an appropriate tag sequence for a sentence.
13
Fig 2.8 : Transition Probability
An HMM model may be defined as the doubly-embedded stochastic model, where the
underlying stochastic process is hidden. This hidden stochastic process can only be
observed through another set of stochastic processes that produces the sequence of
observations.
A POS tagger takes in a phrase or sentence and assigns the most probable part-of-speech
tag to each word. In practice, input is often pre-processed. One common pre-processing
task is to tokenize the input so that the tagger sees a sequence of words and punctuations.
Other tasks such as stop word removals, punctuation removals and lemmatization may
be done before tagging.
The set of predefined tags is called the tagset. This is essential information that the
tagger must be given. Example tags are NNS for a plural noun, VBD for a past tense
verb, or JJ for an adjective. A tagset can also include punctuations.
Rather than design our own tagset, the common practice is to use well-known tagsets:
87-tag Brown tagset, 45-tag Penn Treebank tagset, 61-tag C5 tagset, or 146-tag C7
tagset. In the architecture diagram, we have shown the 45-tag Penn Treebank tagset.
Sketch Engine is a place to download tagsets.
14
Experiment No : 3
Introduction to NLTK
NLTK stands for Natural Language Toolkit. It is a powerful, leading platform for
building Python programs to work among other NLP libraries; it consists of several
packages that help machines understand human language data and reply to it with an
appropriate response.
NLTK (Natural Language Toolkit) is the go-to API for NLP (Natural Language
Processing) with Python. It is a really powerful tool to preprocess text data for further
analysis like with ML models for instance. It helps convert text into numbers, which the
model can then easily work with. This is the first part of a basic introduction to NLTK
for getting our feet wet and assumes some basic knowledge of Python.
15
Importing dataset from corpus
Some simple operations with NLTK
Tokenizing
Filtering Stop Words
Stemming
Tagging Parts of Speech
Tokenizing
By tokenizing, you can conveniently split up text by word or by sentence. This will
allow you to work with smaller pieces of text that are still relatively coherent and
meaningful even outside of the context of the rest of the text. It’s your first step in
turning unstructured data into structured data, which is easier to analyze.
When you’re analyzing text, you’ll be tokenizing by word and tokenizing by sentence.
Here’s what both types of tokenization bring to the table:
Tokenizing by word: Words are like the atoms of natural language. They’re the
smallest unit of meaning that still makes sense on its own. Tokenizing your text by word
allows you to identify words that come up particularly often. For example, if you were
analyzing a group of job ads, then you might find that the word “Python” comes up
often. That could suggest high demand for Python knowledge, but you’d need to look
deeper to know more.
Tokenizing by sentence: When you tokenize by sentence, you can analyze how those
words relate to one another and see more context. Are there a lot of negative words
around the word “Python” because the hiring manager doesn’t like Python? Are there
more terms from the domain of herpetology than the domain of software development,
suggesting that you may be dealing with an entirely different kind of python than you
were expecting?
Here’s how to import the relevant parts of NLTK so you can tokenize by word and by
sentence:
16
Filtering Stop Words
Stop words are words that you want to ignore, so you filter them out of your text when
you’re processing it. Very common words like 'in', 'is', and 'an' are often used as stop
words since they don’t add a lot of meaning to a text in and of themselves.
Here’s how to import the relevant parts of NLTK in order to filter out stop words:
>>> nltk.download("stopwords")
>>> from nltk.corpus import stopwords
>>> from nltk.tokenize import word_tokenize
Stemming
Stemming is a text processing task in which you reduce words to their root, which is
the core part of a word. For example, the words “helping” and “helper” share the root
“help.” Stemming allows you to zero in on the basic meaning of a word rather than all
the details of how it’s being used. NLTK has more than one stemmer, but you’ll be
using the Porter stemmer.
Here’s how to import the relevant parts of NLTK in order to start stemming:
17
Tagging Parts of Speech
Part of speech is a grammatical term that deals with the roles words play when you
use them together in sentences. Tagging parts of speech, or POS tagging, is the task of
labeling the words in your text according to their part of speech.
Part of
speech Role Examples
Noun Is a person, place, or thing mountain, bagel,
Poland
Some sources also include the category articles (like “a” or “the”) in the list of parts of
speech, but other sources consider them to be adjectives. NLTK uses the word
determiner to refer to articles.
18
19
Experiment No : 4
Objective:
The process of converting data to something a computer can understand is referred to
as pre-processing. One of the major forms of pre-processing is to filter out useless data.
In natural language processing, useless words (data), are referred to as stop words.
Introduction:
Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that
a search engine has been programmed to ignore, both when indexing entries for
searching and when retrieving them as the result of a search query.
If we have a task of text classification or sentiment analysis then we should remove stop
words as they do not provide any information to our model, i.e keeping out unwanted
words out of our corpus, but if we have the task of language translation then stopwords
are useful, as they have to be translated along with other words.
There is no hard and fast rule on when to remove stop words. But I would suggest
removing stop words if our task to be performed is one of Language Classification,
Spam Filtering, Caption Generation, Auto-Tag Generation, Sentiment analysis, or
something that is related to text classification.
20
Code:
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print("Word Tokens : " , word_tokens)
print("Filtered Sentence : " , filtered_sentence)
21
Output :
22
Experiment No : 5
Write a Python Program to generate “tokens” and assign “PoS tags” for a
given text using NLTK package.
Objective:
Generate “Tokens” and assign “PoS” tags for a given text using NLTK package
Introduction:
It is a process of converting a sentence to forms – list of words, list of tuples (where
each tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag, and
signifies whether the word is a noun, adjective, verb, and so on.
Default tagging is a basic step for the part-of-speech tagging. It is performed using the
DefaultTagger class. The DefaultTagger class takes ‘tag’ as a single argument. NN is
the tag for a singular noun. DefaultTagger is most useful when it gets to work with most
common part-of-speech tag. that’s why a noun tag is recommended.
23
Code :
tokens_tag = pos_tag(text)
print("After Token:",tokens_tag)
patterns= """mychunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}"""
chunker = RegexpParser(patterns)
print("After Regex:",chunker)
output = chunker.parse(tokens_tag)
print("After Chunking",output)
24
Output :
25
Experiment No : 6
What is a WordCloud?
A WorldCloud /Word Cloud (also known as a tag cloud or word art) is a simple
visualisation of data, in which words are shown in varying sizes depending on how often
they appear in your text/data.
There are many free word cloud generators online that can help you perform text
analysis, and spot trends and patterns at a glance. Python is not the only tool capable of
creating such visuals. So if you need to make a word cloud visualisation quickly and
you are not working with your data in Python, then this tutorial is not for you.
WorldCloud install
In order to create a WorldCloud viz in Python you will need to install below packages:
numpy
pandas
matplotlib
os
pillow
wordcloud
First four packages are data analytics staples, so don't require an introduction.
The pillow library is a package that enables image reading. You can find a tutorial for
pillow here. Pillow is a wrapper for PIL - Python Imaging Library. You will need this
library to read in an image as the mask for the WordCloud.
The wordcloud library is the one responsible for creating WorldClouds. It can be a little
tricky to install. If you only need it for plotting a basic WordCloud, then running one of
the commands below would be sufficient.
26
Code :
28
Output :
29
Experiment No : 7
Perform an experiment to learn about morphological features of a word by
analyzing it.
Objective:
The objective of the experiment is to learn about morphological features of a word by
analysing it.
Introduction:
A word can be simple or complex. For example, the word 'cat' is simple because one
cannot further decompose the word into smaller part. On the other hand, the word 'cats'
is complex, because the word is made up of two parts: root 'cat' and plural suffix '-s'
Definition
Morphemes are considered as smallest meaningful units of language. These morphemes
can either be a root word(play) or affix(-ed). Combination of these morphemes is called
morphological process. So, word "played" is made out of 2 morphemes "play" and "-
ed". Thus finding all parts of a word(morphemes) and thus describing properties of a
word is called "Morphological Analysis". For example, "played" has information verb
"play" and "past tense", so given word is past tense form of verb "play".
Analysis of a word :
30
direct बबबबब(bachchaa) बबबबब(bachche)
oblique बबबबब(bachche) बबबबबब (bachchoM)
Types of Morphology
Morphology is of two types:
1. Inflectional morphology
Deals with word forms of a root, where there is no change in lexical category. For
example, 'played' is an inflection
of the root word 'play'. Here, both 'played' and 'play' are verbs.
2. Derivational morphology
Deals with word forms of a root, where there is a change in the lexical category. For
example, the word form
'happiness' is a derivation of the word 'happy'. Here, 'happiness' is a derived noun form
of the adjective 'happy'.
Morphological Features:
All words will have their lexical category attested during morphological analysis.
A noun and pronoun can take suffixes of the following features: gender, number,
person, case.
For example, morphological analysis of a few words is given below:
Languageinput:wordoutput:analysis
Languageinput:wordoutput:analysis
'rt' stands for root. 'cat' stands for lexical category. Thev value of lexicat category can
be noun, verb, adjective, pronoun, adverb, preposition. 'gen' stands for gender. The
value of gender can be masculine or feminine.
31
'num' stands for number. The value of number can be singular (sg) or plural (pl). 'per'
stands for person. The value of person can be 1, 2 or 3
The value of tense can be present, past or future. This feature is applicable for verbs.
The value of aspect can be perfect (pft), continuous (cont) or habitual (hab). This feature
is not applicable for verbs.
'case' can be direct or oblique. This feature is applicable for nouns. A case is an oblique
case when a postposition occurs after noun. If no postposition can occur after noun, then
the case is a direct case. This is applicable for hindi but not english as it doesn't have
any postpositions. Some of the postpsitions in hindi are: बब(kaa), बब(kii), बब(ke),
बब(ko), बबब(meM)
Procedure
STEP1: Select the language.
OUTPUT: Drop down for selecting words will appear.
32
Experiment :
Eg-1
Select a word from the below dropbox and do a morphological analysis on that word
Select the Correct morphological analysis for the above word using dropboxes (NOTE
: na = not applicable)
WORD train
ROOT train
CATEGOR verb
Y
GENDER na
NUMBER singular
PERSON third
CASE na
TENSE simple-future
Check
Right
answer!!!
33
Eg - 2
STEP1: Select the language.
OUTPUT: Drop downs for selecting root and other features will appear.
STEP3: After selecting all the features, select the word corresponding above features
selected.
STEP4: Click the check button to see whether right word is selected or not.
34
Experiment No : 8
Objective:
The objective of the experiment is to generate word forms from root and suffix
information.
Introduction:
A word can be simple or complex. For example, the word 'cat' is simple because one
cannot further decompose the word into smaller part. On the other hand, the word
'cats' is complex, because the word is made up of two parts: root 'cat' and plural suffix
'-s'
Theory:
Given the root and suffix information, a word can be generated. For example,
35
- Analysis may involve non-determinism, since more than one analysis is possible.
- Generation is a deterministic process. In case a language allows spelling variation,
then till that extent, generation would also involve non-determinism.
Procedure:
Eg-1
36
STEP3: Select the features.
37
Experiment No : 9
Objective:
Understanding the morphology of a word by the use of Add-Delete table
Introduction:
Morphology is the study of the way words are built up from smaller meaning bearing
units i.e., morphemes. A morpheme is the smallest meaningful linguistic unit. For eg:
•बबबब(बब bachchoM) consists of two morphemes, बबबबब(bachchaa) has the
information of the root word noun "बबबबब"(bachchaa) and ब(बब oM) has the
information of plural and oblique case.
• played has two morphemes play and -ed having information verb "play" and "past
tense", so given word is past tense form of verb "play".
Words can be analysed morphologically if we know all variants of a given root word.
We can use an 'Add-Delete' table for this analysis.
38
Theory:
Morph Analyser
Definition
Analysis of a word :
A linguistic paradigm is the complete set of variants of a given lexeme. These variants
can be classified according to shared inflectional categories (eg: number, case etc) and
arranged into tables.
Paradigm Class:
39
Words in the same paradigm class behave similarly, for Example बबबब is in the same
paradigm class as बबबब, so बबबबब would behave similarly as बबबबब as they share the
same paradigm class.
Objective:
Understanding the morphology of a word by the use of Add-Delete table
Procedure:
STEP1: Select a word root.
STEP2: Fill the add-delete table and submit.
STEP3: If wrong, see the correct answer or repeat STEP1.
Experiment:
STEP1: Select a word root.
40
Experiment No : 10
Perform an experiment to learn to calculate bigrams from a given corpus
and calculate probability of a sentence.
Objective:
The objective of this experiment is to learn to calculate bigrams from a given
corpus and calculate probability of a sentence.
Introduction:
Probability of a sentence can be calculated by the probability of sequence of words
occuring in it. We can use Markov assumption, that the probability of a word in a
sentence depends on the probability of the word occuring just before it. Such a model
is called first order Markov model or the bigram model.
Here, Wn refers to the word token corresponding to the nth word in a sequence.
Theory:
One easy way to handle such unacceptable sentences is by assigning probabilities to the
strings of words i.e, how likely the sentence is.
Probability of a sentence
41
Bigrams
We can avoid this very long calculation by approximating that the probability of a given
word depends only on the probability of its previous words. This assumption is called
Markov assumption and such a model is called Markov model- bigrams. Bigrams can
be generalized to the n-gram which looks at (n-1) words in the past. A bigram is a first-
order Markov model.
Therefore ,
A bigram table for a given corpus can be generated and used as a lookup table for
calculating probability of sentences.
Eg: Corpus – (eos) You book a flight (eos) I read a book (eos) You read (eos)
Bigram Table:
a 0 0 0.5 0 0.5 0 0
flight 1 0 0 0 0 0 0
I 0 0 0 0 0 0 1
Objective:
The objective of this experiment is to learn to calculate bigrams from a given corpus
42
and calculate probability of a sentence.
Procedure:
STEP1: Select a corpus and click on
STEP2: Fill up the table that is generated and hit
STEP3: If incorrect (red), see the correct answer by clicking on show answer or
repeat Step 2.
STEP4: If correct (green), click on take a quiz and fill the correct answer
Experiment:
STEP1: Select a corpus and click on
STEP3: If incorrect (red), see the correct answer by clicking on show answer or
repeat Step 2.
43
STEP4: If correct (green), click on take a quiz and fill the correct answer
44
Experiment No : 11
Objective:
The objective of this experiment is to learn how to apply add-one smoothing on sparse
bigram table
Introduction:
One major problem with standard N-gram models is that they must be trained from
some corpus, and because any particular training corpus is finite, some perfectly
acceptable N-grams are bound to be missing from it. We can see that bigram matrix
for any given training corpus is sparse. There are large number of cases with zero
probabilty bigrams and that should really have some non-zero probability. This method
tend to underestimate the probability of strings that happen not to have occurred
nearby in their training corpus.
There are some techniques that can be used for assigning a non-zero probabilty to
these 'zero probability bigrams'. This task of reevaluating some of the zero-
probability and low-probabilty N-grams, and assigning them non-zero values, is
called smoothing.
45
Theory:
The standard N-gram models are trained from some corpus. The finiteness of the
training corpus leads to the absence of some perfectly acceptable N-grams. This results
in sparse bigram matrices. This method tend to underestimate the probability of strings
that do not occur in their training corpus.
There are some techniques that can be used for assigning a non-zero probabilty to these
'zero probability bigrams'. This task of reevaluating some of the zero-probability and
low-probabilty N-grams, and assigning them non-zero values, is called smoothing.
Some of the techniques are: Add-One Smoothing, Witten-Bell Discounting, Good-
Turing Discounting.
Add-One Smoothing
In Add-One smooting, we add one to all the bigram counts before normalizing them into
probabilities. This is called add-one smoothing.
Application on unigrams
The unsmoothed maximum likelihood estimate of the unigram probability can be
computed by dividing the count of the word by the total number of word tokens N
P(wx) = c(wx)/sumi{c(wi)}
= c(wx)/N
Application on bigrams
Normal bigram probabilities are computed by normalizing each row of counts by the
unigram count:
P(w n|wn-1) = C(wn-1wn)/C(wn-1)
For add-one smoothed bigram counts we need to augment the unigram count by the
number of total word types in the vocabulary
V:p *(wn|wn-1) = ( C(wn-1wn)+1 )/( C(wn-1)+V )
46
Procedure
STEP2: Apply add one smoothing and calculate bigram probabilities using the
given bigram counts,N and V. Fill the table and hit
Submit
STEP3: If incorrect (red), see the correct answer by clicking on show answer or
repeat Step 2
Experiment
N=V=
Fill the bigram probabilities after add-one smoothing: (Upto 4 decimal places)
0
Right Answer
47
Experiment No : 12
Objective:
The objective of the experiment is to calculate emission and transition matrix which will
be helpful for tagging Parts of Speech using Hidden Markov Model.
Introduction:
POS tagging or part-of-speech tagging is the procedure of assigning a grammatical
category like noun, verb, adjective etc. to a word. In this process both the lexical
information and the context play an important role as the same lexical form can behave
differently in a different context.
For example the word "Park" can have two different lexical categories based on the
context.
Assigning part of speech to words by hand is a common exercise one can find in an
elementary grammar class. But here we wish to build an automated tool which can assign
the appropriate part-of-speech tag to the words of a given sentence. One can think of
creating hand crafted rules by observing patterns in the language, but this would limit
the system's performance to the quality and number of patterns identified by the rule
48
crafter. Thus, this approach is not practically adopted for building POS Tagger. Instead,
a large corpus annotated with correct POS tags for each word is given to the computer
and algorithms then learn the patterns automatically from the data and store them in form
of a trained model. Later this model can be used to POS tag new sentences.
In this experiment we will explore how such a model can be learned from the data.
Theory
A Hidden Markov Model (HMM) is a statistical Markov model in which the system
being modeled is assumed to be a Markov process with unobserved (hidden) states.In a
regular Markov model (Markov Model (Ref:
http://en.wikipedia.org/wiki/Markov_model)), the state is directly visible to the
observer, and therefore the state transition probabilities are the only parameters. In a
hidden Markov model, the state is not directly visible, but output, dependent on the
state, is visible.
For POS tagging, it is assumed that POS are generated as random process, and each
process randomly generates a word. Hence, transition matrix denotes the transition
probability from one POS to another and emission matrix denotes the probability that a
given word can have a particular POS. Word acts as the observations. Some of the basic
assumptions are:
Calculating Emission Probability Matrix
count(cut,verb)=1
count(cut,noun)=2
count(cut,determiner)=0
EOS/eos
They/pronoun
cut/verb
the/determiner
paper/noun
50
EOS/eos He/pronoun
asked/verb
for/preposition
his/pronoun
cut/noun.
EOS/eos
Put/verb
the/determiner
paper/noun
in/preposition
the/determiner
cut/noun
EOS/eos
P(cut/verb)=count(cut,verb)/count(cut)=1/3=0.33
Similarly,
Probability to be filled in the cell at he intersection of cut and determiner
P(cut/determiner)=count(cut,determiner)/count(cut)=0/3=0
Repeat the same for all the word-tag combination and fill the Calculating Transition
Probability Matrix
Count the no. of times a specific tag comes after other POS tags in the corpus. Here,
say for "determiner"
count(verb,determiner)=2
count(preposition,determiner)=1
count(determiner,determiner)=0
count(eos,determiner)=0
count(noun,determiner)=0
... and so on zero for other tags too.
51
count(determiner) = total count of tag 'determiner' = 3
P(determiner/verb)=count(verb,determiner)/count(determiner)=2/3=0.66
Similarly,
Probability to be filled in the cell at he intersection of determiner(in the column) and
noun(in the row)
P(determiner/noun)=count(noun,determiner)/count(determiner)=0/3=0
Experiment:
Emission Matrix
book park car is in a the
determiner 0 0 0 0 0 0 0
52
noun 0 0 0 0 0 0 0
verb 0 0 0 0 0 0 0
preposition 0 0 0 0 0 0 0
Transition Matrix
eos determiner noun verb preposition
eos 0 0 0 0 0
determiner 0 0 0 0 0
noun 0 0 0 0 0
verb 0 0 0 0 0
preposition 0 0 0 0 0
Check
Wrong Emission and Transition Matrix!!!
Right Answer
Emission Matrix
book park car is in a the
determiner 0 0 0 0 0 1 1
noun 0.5 0.5 1 0 0 0 0
verb 0.5 0.5 0 1 0 0 0
preposition 0 0 0 0 1 0 0
Transition Matrix
eos determiner noun verb preposition
eos 0 0.33 0 0.5 0
determiner 0 0 1 0 0
noun 1 0 0 0.5 0
verb 0 0.33 0 0 1
preposition 0 0.33 0 0 0
53
Experiment No: 13
Introduction:
In previous experiment you have calculated the transition and emission matrix, and now
in this experiment it will be used to find the POS tag sequence for a given sentence.
When we have emission and transition matrix, various algorithms can be applied to find
out the POS tags for words. Some of possible algorithms are: Backward algorithm,
forward algorithm and viterbi algorithm. Here, in this experiment, you can get familiar
with Viterbi Decoding.
In the mid 1980s, researchers in Europe began to use Hidden Markov models (HMMs)
to disambiguate parts of speech. HMMs involve counting cases, and making a table of
the probabilities of certain sequences. For example, once you've seen an article such as
'the', perhaps the next word is a noun 40% of the time, an adjective 40%, and a number
20%. Knowing this, a program can decide that "can" in "the can" is far more likely to be
a noun than a verb or a modal. The same method can of course be used to benefit from
knowledge about following words.
More advanced ("higher order") HMMs learn the probabilities not only of pairs, but
triples or even larger sequences. So, for example, if you've just seen an article and a
verb, the next item may be very likely a preposition, article, or noun, but much less
likely another verb.
When several ambiguous words occur together, the possibilities multiply. However, it
is easy to enumerate every combination and to assign a relative probability to each one,
by multiplying together the probabilities of each choice in turn.
It is worth remembering, as Eugene Charniak points out in Statistical techniques for
natural language parsing, that merely assigning the most common tag to each known
word and the tag "proper noun" to all unknowns, will approach 90% accuracy because
many words are unambiguous.
HMMs underlie the functioning of stochastic taggers and are used in various algorithms.
Accuracies for one such algorithm (TnT) on various training data is shown here.
54
Conditional Random Field
Conditional random fields (CRFs) are a class of statistical modelling method often
applied in machine learning, where they are used for structured prediction. Whereas an
ordinary classifier predicts a label for a single sample without regard to "neighboring"
samples, a CRF can take context into account. Since it can consider context, therefore
CRF can be used in Natural Language Processing. Hence, Parts of Speech tagging is
also possible. It predicts the POS using the lexicons as the context.
Theory:
Viterbi Decoding is based on dynamic programming. This algorithm takes emission and
transmission matrix as the input. Emission matrix gives us information about proabities
of a POS tag for a given word and transmission matrix gives the probability of transition
from one POS tag to another POS tag. It observes sequence of words and returns the
state sequences of POS tags along with its probability.
Here "s" denotes words and "t" denotes tags. "a" is transmission matrix and "b" is
emission matrix.
Using above algorithm, we have to fill the viterbi table column by column.
55
Objective:
The objective of this experiment is to find POS tags of words in a sentence using
Viterbi decoding.
Procedure
Experiment:
STEP1:Select the corpus.
OUTPUT: Emission and Transmission matrix will appear.
STEP2: Fill the column with the probabilty of possible POS tags given the word (i.e.
form the viterbi matrix by filling colum for each observation). Answers submitted are
rounded off to 3 digits after decimal and are than checked.
56
STEP4: Repeat steps 2 and 3 untill all words of a sentence are covered.
STEP5: At last check the POS tag for each word obtained from backtracking
57
Experiment No : 14
Objective:
The objective of this experiment is to understand the concept of chunking and get
familiar with the basic chunk tagset.
Introduction:
Chunking of text invloves dividing a text into syntactically correlated words. For
example, the sentence 'He ate an apple.' can be divided as follows:
Each chunk has an open boundary and close boundary that delimit the word groups as a
minimal non-recursive unit. This can be formally expressed by using IOB prefixes.
58
Theory:
The basic types of chunks in English are: Chunk Type Tag Name
1. Noun NP
2. Verb VP
3. Adverb ADVP
4. Adjectivial ADJP
5. Prepositional PP
59
Eg:
'this' 'book' 'in'
((in/IN the/DT big/ADJ room/NN))NP
Verb Chunks
The verb chunks are marked as VP for English, however they would be of several
types for Indian languages. A verb group will include the main verb and its
auxiliaries, if
any.
For English:
I (will/MD be/VB loved/VBD)VP
The types of verb chunks and their tags are described below.
A non-finite verb chunk will be tagged as VGNF. apple' 'eating' 'PROG' 'boy' go'
'PROG' 'is'
VGNN Gerunds
Eg:
The fruit is (ripe/JJ)ADJP
Note: Adjectives appearing before a noun will be grouped together within the
noun chunk.
60
RBP/ADVP Adverb Chunk
Eg:
(with/IN)PP a pen
IOB prefixes
Each chunk has an open boundary and close boundary that delimit the word groups as
a minimal non-recursive unit. This can be formally expressed by using IOB prefixes:
B-CHUNK for the first word of the chunk and I-CHUNK for each other word in the
chunk. Here is an example of the file
format:
He PRP B-NP
ate VBD B-VP
an DT B-NP
apple NN I-NP
to TO B-VP
satiate VB I-VP
his PRP$ B-NP
hunger NN I-NP
61
Procedure:
STEP1: Select a language
STEP2: Select a sentence
STEP3: Select the corresponding chunk-tag for each
word in the sentence and click the Submit button.
Experiment:
62
63
Experiment No : 15
Introduction:
Millions of web pages and websites exist on the Internet today. Going through a vast
amount of content becomes very difficult to extract information on a certain topic.
Google will filter the search results and give you the top ten search results, but often
you are unable to find the right content that you need. There is a lot of redundant and
overlapping data in the articles which leads to a lot of wastage of time. The better way
to deal with this problem is to summarize the text data which is available in large
amounts to smaller sizes.
Text Summarization
Text summarization is an NLP technique that extracts text from a large amount of data.
It helps in creating a shorter version of the large text available.
It is important because :
Reduces reading time
Helps in better research work
Increases the amount of information that can fit in an area
There are two approaches for text summarization: NLP based techniques and deep
learning techniques.
In this article, we will go through an NLP based technique which will make use of the
NLTK library.
Obtain Data
Text Preprocessing
Convert paragraphs to sentences
Tokenizing the sentences
Find weighted frequency of occurrence
Replace words by weighted frequency in sentences
Sort sentences in descending order of weights
Summarizing the Article.
64
Obtain Data for Summarization
If you wish to summarize a Wikipedia Article, obtain the URL for the article that you
wish to summarize. We will obtain data from the URL using the concept of Web
scraping. Now, to use web scraping you will need to install the beautifulsoup library
in Python. This library will be used to fetch the data on the web page within the
various HTML tags.
Code:
import bs4 as bs
import urllib.request
import re
scraped_data = urllib.request.urlopen('https://en.wikipedia.org/w
iki/Natural_language')
article = scraped_data.read()
parsed_article = bs.BeautifulSoup(article,'lxml')
paragraphs = parsed_article.find_all('p')
article_text = ""
for p in paragraphs:
article_text += p.text
# Removing Square Brackets and Extra Spaces
article_text = re.sub(r'[[0-9]*]', ' ', article_text)
article_text = re.sub(r's+', ' ', article_text)
sentence_list = nltk.sent_tokenize(article_text)
stopwords = nltk.corpus.stopwords.words('english')
word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
if word not in stopwords:
if word not in word_frequencies.keys():
word_frequencies[word] = 1
else:
word_frequencies[word] += 1
maximum_frequncy = max(word_frequencies.values())
for word in word_frequencies.keys():
word_frequencies[word] = (word_frequencies[word]/maximum_freq
uncy)
65
sentence_scores = {}
for sent in sentence_list:
for word in nltk.word_tokenize(sent.lower()):
if word in word_frequencies.keys():
if len(sent.split(' ')) < 30:
if sent not in sentence_scores.keys():
sentence_scores[sent] = word_frequencies[word
]
else:
sentence_scores[sent] += word_frequencies[wor
d]
import heapq
summary_sentences = heapq.nlargest(7, sentence_scores, key=senten
ce_scores.get)
summary = '\n'.join(summary_sentences)
print('\n \n \n\n\n\n\n',summary)
Summary
The sentence_scores dictionary consists of the sentences along with their scores. Now,
top N sentences can be used to form the summary of the article.
Here the heapq library has been used to pick the top 7 sentences to summarize the
article.
Output :
66