100% found this document useful (2 votes)
401 views66 pages

NLP Lab File

Natural language processing (NLP) is the intersection of computer science, linguistics, and machine learning that focuses on communication between computers and humans in natural language. NLP has benefited from advances in machine learning and deep learning. It is divided into speech recognition, natural language understanding, and natural language generation. Understanding human language is difficult due to its complexity, ambiguity, and infinite ways words can be arranged. Syntax analyzes grammar structure while semantics analyzes meaning. Grammars and parsing are important for describing syntactic structure in languages.

Uploaded by

VIPIN YADAV
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
401 views66 pages

NLP Lab File

Natural language processing (NLP) is the intersection of computer science, linguistics, and machine learning that focuses on communication between computers and humans in natural language. NLP has benefited from advances in machine learning and deep learning. It is divided into speech recognition, natural language understanding, and natural language generation. Understanding human language is difficult due to its complexity, ambiguity, and infinite ways words can be arranged. Syntax analyzes grammar structure while semantics analyzes meaning. Grammars and parsing are important for describing syntactic structure in languages.

Uploaded by

VIPIN YADAV
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Experiment No: 1

Introduction to Natural Language Processing

I. Introduction

Natural language processing (NLP) is the intersection of computer science, linguistics


and machine learning. The field focuses on communication between computers and
humans in natural language and NLP is all about making computers understand and
generate human language. Applications of NLP techniques include voice assistants like
Amazon's Alexa and Apple's Siri, but also things like machine translation and text-
filtering.

NLP has heavily benefited from recent advances in machine learning, especially from
deep learning techniques. The field is divided into the three parts:

Speech Recognition — The translation of spoken language into text.


Natural Language Understanding — The computer's ability to understand what we
say.
Natural Language Generation — The generation of natural language by a computer.

Fig 1.1 : NLP Overview

1
II. Why NLP is Difficult??

Human language is special for several reasons. It is specifically constructed to convey


the speaker/writer's meaning. It is a complex system, although little children can learn
it pretty quickly.

Another remarkable thing about human language is that it is all about symbols.
According to Chris Manning, a machine learning professor at Stanford, it is a discrete,
symbolic, categorical signaling system. This means we can convey the same meaning
in different ways (i.e., speech, gesture, signs, etc.) The encoding by the human brain is
a continuous pattern of activation by which the symbols are transmitted via continuous
signals of sound and vision.

Understanding human language is considered a difficult task due to its complexity. For
example, there is an infinite number of different ways to arrange words in a sentence.
Also, words can have several meanings and contextual information is necessary to
correctly interpret sentences. Every language is more or less unique and ambiguous.
Just take a look at the following newspaper headline "The Pope’s baby steps on gays."
This sentence clearly has two very different interpretations, which is a pretty good
example of the challenges in NLP.

Fig 1.2 : NLP Understanding

2
Fig 1.3 : Evolution of NLP

Natural language processing (NLP) describes the interaction between human


language and computers. It’s a technology that many people use daily and has been
around for years, but is often taken for granted.
A few examples of NLP that people use every day are:.
 Spell check
 Autocomplete
 Voice text messaging
 Spam filters
 Related keywords on search engines
 Siri, Alexa, or Google Assistant
In any case, the computer is able to identify the appropriate word, phrase, or response
by using context clues, the same way that any human would. Conceptually, it’s a fairly
straightforward technology.
Where NLP outperforms humans is in the amount of language and data it’s able to
process. Therefore, its potential uses go beyond the examples above and make possible
tasks that would’ve otherwise taken employees months or years to accomplish.

Why Should Businesses Use Natural Language Processing?


Human interaction is the driving force of most businesses. Whether it’s a brick-and-
mortar store with inventory or a large SaaS brand with hundreds of employees,
customers and companies need to communicate before, during, and after a sale.
That means that there are countless opportunities for NLP to step in and improve how
a company operates. This is especially true of large businesses that want to keep track
of, facilitate, and analyze thousands of customer interactions in order to improve their
product or service.
It would be nearly impossible for employees to log and interpret all that data on their
own, but technologies integrated with NLP can help do it all and more.

3
III. Syntactic and Semantics Analysis
Syntactic analysis (syntax) and semantic analysis (semantic) are the two primary
techniques that lead to the understanding of natural language. Language is a set of valid
sentences, but what makes a sentence valid? Syntax and semantics.

Syntax is the grammatical structure of the text, whereas semantics is the meaning being
conveyed. A sentence that is syntactically correct, however, is not always semantically
correct. For example, “cows flow supremely” is grammatically valid (subject — verb —
 adverb) but it doesn't make any sense.

Syntactic Analysis
Syntactic analysis, also referred to as syntax analysis or parsing, is the process of
analyzing natural language with the rules of a formal grammar. Grammatical rules are
applied to categories and groups of words, not individual words. Syntactic analysis
basically assigns a semantic structure to text.

For example, a sentence includes a subject and a predicate where the subject is a noun
phrase and the predicate is a verb phrase. Take a look at the following sentence: “The
dog (noun phrase) went away (verb phrase).” Note how we can combine every noun
phrase with a verb phrase. Again, it's important to reiterate that a sentence can be
syntactically correct but not make sense.

Fig 1.4 : Syntactic Analysis

4
Semantics Analysis
The way we understand what someone has said is an unconscious process relying on
our intuition and knowledge about language itself. In other words, the way we
understand language is heavily based on meaning and context. Computers need a
different approach, however. The word “semantic” is a linguistic term and means
"related to meaning or logic."

Semantic analysis is the process of understanding the meaning and interpretation of


words, signs and sentence structure. This lets computers partly understand natural
language the way humans do. I say partly because semantic analysis is one of the
toughest parts of NLP and it's not fully solved yet.

Fig 1.5 : Semantics Analysis

5
Experiment No: 2

Introduction to Grammars, Parsing and PoS tags

I. Grammars

Grammar is very essential and important to describe the syntactic structure of well-
formed programs. In the literary sense, they denote syntactical rules for conversation in
natural languages. Linguistics have attempted to define grammars since the inception
of natural languages like English, Hindi, etc.

The theory of formal languages is also applicable in the fields of Computer Science
mainly in programming languages and data structure. For example, in ‘C’ language, the
precise grammar rules state how functions are made from lists and statements.

A mathematical model of grammar was given by Noam Chomsky in 1956, which is


effective for writing computer languages.

Mathematically, a grammar G can be formally written as a 4-tuple (N, T, S, P) where −

N or VN = set of non-terminal symbols, i.e., variables.

T or ∑ = set of terminal symbols.

S = Start symbol where S ∈ N

P denotes the Production rules for terminals as well as non-terminals. It has the form α
→ β, where α and β are strings on VN ∪ ∑ and least one symbol of α belongs to VN

Fig 2.1 : Grammar

6
Context-Free Grammar (CFG)

A context-free grammar, which is in short represented as CFG, is a notation used for


describing the languages and it is a superset of Regular grammar which you can see
from the following diagram:

Fig 2.2 : Context Free Grammar

CFG consists of a finite set of grammar rules having the following four components

Set of Non-Terminals
Set of Terminals
Set of Productions
Start Symbol

Set of Non-terminals

It is represented by V. The non-terminals are syntactic variables that denote the sets of
strings, which helps in defining the language that is generated with the help of
grammar.

Set of Terminals

It is also known as tokens and represented by Σ. Strings are formed with the help of
the basic symbols of terminals.

Set of Productions

It is represented by P. The set gives an idea about how the terminals and nonterminals
can be combined. Every production consists of the following components:

Non-terminals,
Arrow,
Terminals (the sequence of terminals).
The left side of production is called non-terminals while the right side of production is
called terminals.

7
Start Symbol

The production begins from the start symbol. It is represented by symbol S. Non-
terminal symbols are always designated as start symbols.

Constituency Grammar (CG)

It is also known as Phrase structure grammar. It is called constituency Grammar as it is


based on the constituency relation. It is the opposite of dependency grammar.

Before deep dive into the discussion of CG, let’s see some fundamental points about
constituency grammar and constituency relation.

All the related frameworks view the sentence structure in terms of constituency relation.
To derive the constituency relation, we take the help of subject-predicate division of
Latin as well as Greek grammar.
Here we study the clause structure in terms of noun phrase NP and verb phrase VP.

For Example, If we have the following constituents, then

<subject> The horses / The dogs / They


<context> are running / are barking / are eating
<object> in the park / happily / since the morning
Example sentences that we can be generated with the help of the above constituents are:

“The dogs are barking in the park” “They are eating happily”
“The horses are running since the morning”
Now, let’s look at another view of constituency grammar is to define their grammar in
terms of their part of speech tags.

Say a grammar structure containing a

[determiner, noun] [ adjective, verb] [preposition, determiner, noun]


which corresponds to the same sentence – “The dogs are barking in the park”

Another view (Using Part of Speech)


< DT NN > < JJ VB > < PRP DT NN >
> The dogs are barking in the park

8
II. Parsing
Simply speaking, parsing in NLP is the process of determining the syntactic structure
of a text by analyzing its constituent words based on an underlying grammar (of the
language).

See this example grammar below, where each line indicates a rule of the grammar to be
applied to an example sentence “Tom ate an apple”.
Example Grammar

Fig 2.3 : Grammar

Then, the outcome of the parsing process would be a parse tree like the following, where
sentence is the root, intermediate nodes such as noun-phrase, verb-phrase etc. have
children - hence they are called non-terminals and finally, the leaves of the tree ‘Tom’,
‘ate’, ‘an’, ‘apple’ are called terminals.

Parse Tree

Fig 2.4 : Parse Tree

9
Existing parsing approaches are basically statistical, probabilistic, and machine
learning-based. Some notable tools to use for parsing are: Stanford parser (The Stanford
Natural Language Processing Group), OpenNLP (Apache OpenNLP Developer
Documentation) etc

III. PoS Tagging

Part-of-speech (POS) tagging is a popular Natural Language Processing process which


refers to categorizing words in a text (corpus) in correspondence with a particular part
of speech, depending on the definition of the word and its context.

Part-of-speech (POS) tagging is a popular Natural Language Processing process which


refers to categorizing words in a text (corpus) in correspondence with a particular part
of speech, depending on the definition of the word and its context.

Fig 2.5 : POS Tagging

Now, if we talk about Part-of-Speech (PoS) tagging, then it may be defined as the
process of assigning one of the parts of speech to the given word. It is generally called
POS tagging. In simple words, we can say that POS tagging is a task of labelling each
word in a sentence with its appropriate part of speech. We already know that parts of
speech include nouns, verb, adverbs, adjectives, pronouns, conjunction and their sub-
categories.

Most of the POS tagging falls under Rule Base POS tagging, Stochastic POS tagging
and Transformation based tagging.
10
Rule-based POS Tagging

One of the oldest techniques of tagging is rule-based POS tagging. Rule-based taggers
use dictionary or lexicon for getting possible tags for tagging each word. If the word
has more than one possible tag, then rule-based taggers use hand-written rules to
identify the correct tag.

We can also understand Rule-based POS tagging by its two-stage architecture −

First stage − In the first stage, it uses a dictionary to assign each word a list of potential
parts-of-speech.

Second stage − In the second stage, it uses large lists of hand-written disambiguation
rules to sort down the list to a single part-of-speech for each word.

Fig 2.6 : Rule Based POS Tagging

11
Stochastic POS Tagging
Another technique of tagging is Stochastic POS Tagging. Now, the question that arises
here is which model can be stochastic. The model that includes frequency or probability
(statistics) can be called stochastic. Any number of different approaches to the problem
of part-of-speech tagging can be referred to as stochastic tagger.

The simplest stochastic tagger applies the following approaches for POS tagging –

Word Frequency Approach

In this approach, the stochastic taggers disambiguate the words based on the probability
that a word occurs with a particular tag. We can also say that the tag encountered most
frequently with the word in the training set is the one assigned to an ambiguous instance
of that word. The main issue with this approach is that it may yield inadmissible
sequence of tags.

Tag Sequence Probabilities

It is another approach of stochastic tagging, where the tagger calculates the probability
of a given sequence of tags occurring. It is also called n-gram approach. It is called so
because the best tag for a given word is determined by the probability at which it occurs
with the n previous tags.

Fig 2.7 : Stochastic POS Tagging

12
Transformation-based Tagging
Transformation based tagging is also called Brill tagging. It is an instance of the
transformation-based learning (TBL), which is a rule-based algorithm for automatic tagging of
POS to the given text. TBL, allows us to have linguistic knowledge in a readable form,
transforms one state to another state by using transformation rules.

It draws the inspiration from both the previous explained taggers − rule-based and stochastic.
If we see similarity between rule-based and transformation tagger, then like rule-based, it is
also based on the rules that specify what tags need to be assigned to what words. On the other
hand, if we see similarity between stochastic and transformation tagger then like stochastic, it
is machine learning technique in which rules are automatically induced from data.

Working of Transformation Based Learning (TBL)

In order to understand the working and concept of transformation-based taggers, we


need to understand the working of transformation-based learning. Consider the
following steps to understand the working of TBL −

Start with the solution − The TBL usually starts with some solution to the
problem and works in cycles.

Most beneficial transformation chosen − In each cycle, TBL will choose


the most beneficial transformation.

Apply to the problem − The transformation chosen in the last step will be
applied to the problem.

The algorithm will stop when the selected transformation in step 2 will not add either
more value or there are no more transformations to be selected. Such kind of learning
is best suited in classification tasks.

POS tagging with Hidden Markov Model

HMM (Hidden Markov Model) is a Stochastic technique for POS tagging. Hidden
Markov models are known for their applications to reinforcement learning and temporal
pattern recognition such as speech, handwriting, gesture recognition, musical score
following, partial discharges, and bioinformatics.

Let us consider an example proposed by Dr.Luis Serrano and find out how HMM selects
an appropriate tag sequence for a sentence.

13
Fig 2.8 : Transition Probability

An HMM model may be defined as the doubly-embedded stochastic model, where the
underlying stochastic process is hidden. This hidden stochastic process can only be
observed through another set of stochastic processes that produces the sequence of
observations.

Architecture diagram of POS tagging. Source: Devopedia 2019.

A POS tagger takes in a phrase or sentence and assigns the most probable part-of-speech
tag to each word. In practice, input is often pre-processed. One common pre-processing
task is to tokenize the input so that the tagger sees a sequence of words and punctuations.
Other tasks such as stop word removals, punctuation removals and lemmatization may
be done before tagging.
The set of predefined tags is called the tagset. This is essential information that the
tagger must be given. Example tags are NNS for a plural noun, VBD for a past tense
verb, or JJ for an adjective. A tagset can also include punctuations.
Rather than design our own tagset, the common practice is to use well-known tagsets:
87-tag Brown tagset, 45-tag Penn Treebank tagset, 61-tag C5 tagset, or 146-tag C7
tagset. In the architecture diagram, we have shown the 45-tag Penn Treebank tagset.
Sketch Engine is a place to download tagsets.

14
Experiment No : 3

Introduction to NLTK

NLTK stands for Natural Language Toolkit. It is a powerful, leading platform for
building Python programs to work among other NLP libraries; it consists of several
packages that help machines understand human language data and reply to it with an
appropriate response.

NLTK (Natural Language Toolkit) is the go-to API for NLP (Natural Language
Processing) with Python. It is a really powerful tool to preprocess text data for further
analysis like with ML models for instance. It helps convert text into numbers, which the
model can then easily work with. This is the first part of a basic introduction to NLTK
for getting our feet wet and assumes some basic knowledge of Python.

It helps practitioners by providing easy-to-use interfaces to over 50 lexical and corpora


resources, with text processing libraries for classification, tokenization, tagging,
stemming, parsing, and semantic reasoning, wrappers for industrial-strength NLP
libraries. The corpora of data consist of data from various applications over the internet;
for text analytics, we can get data. By analyzing tweets on Twitter, we can find trending
news and people’s reactions to a particular event. Amazon can understand user feedback
or review on the specific product. BookMyShow can discover people’s reviews about
the movie, which can be both positive or negative.

Getting started with NLTK


Download NLTK – You can download NLTK with nltk.download()
Import NLTK
You can import NLTK directly using import.
import nltk

15
Importing dataset from corpus
Some simple operations with NLTK
 Tokenizing
 Filtering Stop Words
 Stemming
 Tagging Parts of Speech

Tokenizing
By tokenizing, you can conveniently split up text by word or by sentence. This will
allow you to work with smaller pieces of text that are still relatively coherent and
meaningful even outside of the context of the rest of the text. It’s your first step in
turning unstructured data into structured data, which is easier to analyze.

When you’re analyzing text, you’ll be tokenizing by word and tokenizing by sentence.
Here’s what both types of tokenization bring to the table:

Tokenizing by word: Words are like the atoms of natural language. They’re the
smallest unit of meaning that still makes sense on its own. Tokenizing your text by word
allows you to identify words that come up particularly often. For example, if you were
analyzing a group of job ads, then you might find that the word “Python” comes up
often. That could suggest high demand for Python knowledge, but you’d need to look
deeper to know more.

Tokenizing by sentence: When you tokenize by sentence, you can analyze how those
words relate to one another and see more context. Are there a lot of negative words
around the word “Python” because the hiring manager doesn’t like Python? Are there
more terms from the domain of herpetology than the domain of software development,
suggesting that you may be dealing with an entirely different kind of python than you
were expecting?

Here’s how to import the relevant parts of NLTK so you can tokenize by word and by
sentence:

>>> from nltk.tokenize import sent_tokenize, word_tokenize

16
Filtering Stop Words

Stop words are words that you want to ignore, so you filter them out of your text when
you’re processing it. Very common words like 'in', 'is', and 'an' are often used as stop
words since they don’t add a lot of meaning to a text in and of themselves.

Here’s how to import the relevant parts of NLTK in order to filter out stop words:

>>> nltk.download("stopwords")
>>> from nltk.corpus import stopwords
>>> from nltk.tokenize import word_tokenize

Stemming

Stemming is a text processing task in which you reduce words to their root, which is
the core part of a word. For example, the words “helping” and “helper” share the root
“help.” Stemming allows you to zero in on the basic meaning of a word rather than all
the details of how it’s being used. NLTK has more than one stemmer, but you’ll be
using the Porter stemmer.
Here’s how to import the relevant parts of NLTK in order to start stemming:

>>> from nltk.stem import PorterStemmer


>>> from nltk.tokenize import word_tokenize

17
Tagging Parts of Speech

Part of speech is a grammatical term that deals with the roles words play when you
use them together in sentences. Tagging parts of speech, or POS tagging, is the task of
labeling the words in your text according to their part of speech.

In English, there are eight parts of speech:

Part of
speech Role Examples
Noun Is a person, place, or thing mountain, bagel,
Poland

Pronoun Replaces a noun you, she, we

Adjective Gives information about what a noun is like efficient, windy,


colorful

Verb Is an action or a state of being learn, is, go

Adverb Gives information about a verb, an adjective, or efficiently, always,


another adverb very

Preposition Gives information about how a noun or pronoun is from, about, at


connected to another word

Conjunction Connects two other words or phrases so, because, and

Interjection Is an exclamation yay, ow, wow

Some sources also include the category articles (like “a” or “the”) in the list of parts of
speech, but other sources consider them to be adjectives. NLTK uses the word
determiner to refer to articles.

18
19
Experiment No : 4

Write a Python Program to remove “stopwords” from a given text and


generate word tokens and filtered text.

Objective:
The process of converting data to something a computer can understand is referred to
as pre-processing. One of the major forms of pre-processing is to filter out useless data.
In natural language processing, useless words (data), are referred to as stop words.

Introduction:
Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that
a search engine has been programmed to ignore, both when indexing entries for
searching and when retrieving them as the result of a search query.

When to remove stop words?

If we have a task of text classification or sentiment analysis then we should remove stop
words as they do not provide any information to our model, i.e keeping out unwanted
words out of our corpus, but if we have the task of language translation then stopwords
are useful, as they have to be translated along with other words.
There is no hard and fast rule on when to remove stop words. But I would suggest
removing stop words if our task to be performed is one of Language Classification,
Spam Filtering, Caption Generation, Auto-Tag Generation, Sentiment analysis, or
something that is related to text classification.

20
Code:

from nltk.corpus import stopwords


from nltk.tokenize import word_tokenize

example_sent = """This is a sample sentence,


showing off the stop words filtration."""

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w.


lower() in stop_words]

filtered_sentence = []

for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print("Word Tokens : " , word_tokens)
print("Filtered Sentence : " , filtered_sentence)

21
Output :

22
Experiment No : 5

Write a Python Program to generate “tokens” and assign “PoS tags” for a
given text using NLTK package.

Objective:
Generate “Tokens” and assign “PoS” tags for a given text using NLTK package

Introduction:
It is a process of converting a sentence to forms – list of words, list of tuples (where
each tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag, and
signifies whether the word is a noun, adjective, verb, and so on.

Default tagging is a basic step for the part-of-speech tagging. It is performed using the
DefaultTagger class. The DefaultTagger class takes ‘tag’ as a single argument. NN is
the tag for a singular noun. DefaultTagger is most useful when it gets to work with most
common part-of-speech tag. that’s why a noun tag is recommended.

23
Code :

from nltk import pos_tag

from nltk import RegexpParser

text ="Learn Python for a better future in Machine Learning an


d AI".split()
print("After Split:",text)

tokens_tag = pos_tag(text)
print("After Token:",tokens_tag)

patterns= """mychunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}"""
chunker = RegexpParser(patterns)

print("After Regex:",chunker)

output = chunker.parse(tokens_tag)

print("After Chunking",output)

24
Output :

After Split: ['Learn', 'Python', 'for', 'a', 'better', 'future',


'in', 'Machine', 'Learning', 'and', 'AI']
After Token: [('Learn', 'NNP'), ('Python', 'NNP'), ('for', 'IN'),
('a', 'DT'), ('better', 'JJR'), ('future', 'NN'), ('in', 'IN'),
('Machine', 'NNP'), ('Learning', 'NNP'), ('and', 'CC'), ('AI',
'NNP')]
After Regex: chunk.RegexpParser with 1 stages:
RegexpChunkParser with 1 rules:
<ChunkRule: '<NN.?>*<VBD.?>*<JJ.?>*<CC>?'>
After Chunking (S
(mychunk Learn/NNP Python/NNP)
for/IN
a/DT
(mychunk better/JJR)
(mychunk future/NN)
in/IN
(mychunk Machine/NNP Learning/NNP and/CC)
(mychunk AI/NNP))

25
Experiment No : 6

Write a Python Program to generate “worldcloud” with maximum words


used = 100, in different shapes and save as a .png file for a given txt file.

What is a WordCloud?
A WorldCloud /Word Cloud (also known as a tag cloud or word art) is a simple
visualisation of data, in which words are shown in varying sizes depending on how often
they appear in your text/data.

There are many free word cloud generators online that can help you perform text
analysis, and spot trends and patterns at a glance. Python is not the only tool capable of
creating such visuals. So if you need to make a word cloud visualisation quickly and
you are not working with your data in Python, then this tutorial is not for you.

WorldCloud install
In order to create a WorldCloud viz in Python you will need to install below packages:

numpy
pandas
matplotlib
os
pillow
wordcloud
First four packages are data analytics staples, so don't require an introduction.

The pillow library is a package that enables image reading. You can find a tutorial for
pillow here. Pillow is a wrapper for PIL - Python Imaging Library. You will need this
library to read in an image as the mask for the WordCloud.

The wordcloud library is the one responsible for creating WorldClouds. It can be a little
tricky to install. If you only need it for plotting a basic WordCloud, then running one of
the commands below would be sufficient.

pip install wordcloud

or conda install -c conda-forge wordcloud for Anaconda-Navigator.

26
Code :

from textblob import TextBlob


import sys
import tweepy
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import os
import nltk
import pycountry
import re
import string
from wordcloud import WordCloud, STOPWORDS
from PIL import Image
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from langdetect import detect
from nltk.stem import SnowballStemmer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer
consumerKey = "Secret"
consumerSecret="Secre"
accessToken= "secret"
accessTokenSecret= "Secret"
auth = tweepy.OAuthHandler(consumerKey, consumerSecret)
auth.set_access_token(accessToken,accessTokenSecret)
api = tweepy.API(auth)
nltk.download('vader_lexicon')
def percentage(part,whole):
return 100 * float(part)/float(whole)
keyword = input("Please enter keyword or hasing to search: ")
noOfTweet = int(input ("Please enter how many tweets to analyze:
"))
tweets = tweepy.Cursor(api.search, q=keyword).items(noOfTweet)
positive =0
negative=0
neutral = 0
polarity =0
tweet_list =[]
neutral_list =[]
negative_list =[]
positive_list =[]
for tweet in tweets:
print(tweet.text)
tweet_list.append(tweet.text)
analysis = TextBlob(tweet.text)
27
score = SentimentIntensityAnalyzer().polarity_scores(tweet.text)
neg = score['neg']
neu = score['neu']
pos =score['pos']
comp = score['compound']
polarity += analysis.sentiment.polarity
if neg > pos:
negative_list.append(tweet.text)
negative +=1
elif pos > neg:
positive +=1
elif pos == neg:
neutral_list.append(tweet.text)
neutral += 1
positive = percentage(positive, noOfTweet)
negative = percentage(negative, noOfTweet)
neutral = percentage(neutral, noOfTweet)
polarity = percentage(polarity, noOfTweet)
positive = format(positive, '.1f')
negatitve = format(negative, '.1f')
neutral = format(neutral, '.1f')
#Function to create Wordcloud
from IPython.display import display
def create_wordcloud(text):
mask = np.array(Image.open(r"C:\Users\saurabh\Desktop\BigDataAna
lytics\cloud.png"))
stopwords = set(STOPWORDS)
wc = WordCloud(background_color="white",
mask = mask,
max_words=3000,
stopwords=stopwords,
repeat=True)
wc.generate(str(text))
wc.to_file(r"C:\Users\saurabh\Desktop\BigDataAnalytics\wc.png")
print("World Cloud Saved Successfully")
path=r"C:\Users\saurabh\Desktop\BigDataAnalytics\wc.png"
display(Image.open(path))
create_wordcloud(tw_list["text"].values)

28
Output :

29
Experiment No : 7
Perform an experiment to learn about morphological features of a word by
analyzing it.

Objective:
The objective of the experiment is to learn about morphological features of a word by
analysing it.

Introduction:
A word can be simple or complex. For example, the word 'cat' is simple because one
cannot further decompose the word into smaller part. On the other hand, the word 'cats'
is complex, because the word is made up of two parts: root 'cat' and plural suffix '-s'

Definition
Morphemes are considered as smallest meaningful units of language. These morphemes
can either be a root word(play) or affix(-ed). Combination of these morphemes is called
morphological process. So, word "played" is made out of 2 morphemes "play" and "-
ed". Thus finding all parts of a word(morphemes) and thus describing properties of a
word is called "Morphological Analysis". For example, "played" has information verb
"play" and "past tense", so given word is past tense form of verb "play".

Analysis of a word :

बबबबबब (bachchoM) = बबबबब(bachchaa)(root) + बब(oM)(suffix)


(बब=3 plural oblique)
A linguistic paradigm is the complete set of variants of a given lexeme. These variants
can be classified according to shared inflectional categories (eg: number, case etc) and
arranged into tables.
Paradigm for बबबबब

case/num singular plural

30
direct बबबबब(bachchaa) बबबबब(bachche)
oblique बबबबब(bachche) बबबबबब (bachchoM)

Types of Morphology
Morphology is of two types:

1. Inflectional morphology

Deals with word forms of a root, where there is no change in lexical category. For
example, 'played' is an inflection
of the root word 'play'. Here, both 'played' and 'play' are verbs.

2. Derivational morphology

Deals with word forms of a root, where there is a change in the lexical category. For
example, the word form
'happiness' is a derivation of the word 'happy'. Here, 'happiness' is a derived noun form
of the adjective 'happy'.

Morphological Features:

All words will have their lexical category attested during morphological analysis.
A noun and pronoun can take suffixes of the following features: gender, number,
person, case.
For example, morphological analysis of a few words is given below:

Languageinput:wordoutput:analysis

English boy rt=boy, cat=n, gen=m, num=sg


English boys rt=boy, cat=n, gen=m, num=pl

A verb can take suffixes of the following features: tense,


aspect,modality, gender, number, person

Languageinput:wordoutput:analysis

English toys rt=toy, cat=n, num=pl, per=3

'rt' stands for root. 'cat' stands for lexical category. Thev value of lexicat category can
be noun, verb, adjective, pronoun, adverb, preposition. 'gen' stands for gender. The
value of gender can be masculine or feminine.

31
'num' stands for number. The value of number can be singular (sg) or plural (pl). 'per'
stands for person. The value of person can be 1, 2 or 3

The value of tense can be present, past or future. This feature is applicable for verbs.

The value of aspect can be perfect (pft), continuous (cont) or habitual (hab). This feature
is not applicable for verbs.

'case' can be direct or oblique. This feature is applicable for nouns. A case is an oblique
case when a postposition occurs after noun. If no postposition can occur after noun, then
the case is a direct case. This is applicable for hindi but not english as it doesn't have
any postpositions. Some of the postpsitions in hindi are: बब(kaa), बब(kii), बब(ke),
बब(ko), बबब(meM)

Procedure
STEP1: Select the language.
OUTPUT: Drop down for selecting words will appear.

STEP2: Select the word.


OUTPUT: Drop down for selecting features will appear.

STEP3: Select the features.

STEP4: Click "Check" button to check your answer.


OUTPUT: Right features are marked by tick and wrong features are marked by cross.

32
Experiment :

Eg-1

Select a Language which you know


better
English

Select a word from the below dropbox and do a morphological analysis on that word

Select the Correct morphological analysis for the above word using dropboxes (NOTE
: na = not applicable)

WORD train
ROOT train

CATEGOR verb
Y
GENDER na

NUMBER singular

PERSON third

CASE na

TENSE simple-future

Check
Right
answer!!!

33
Eg - 2
STEP1: Select the language.
OUTPUT: Drop downs for selecting root and other features will appear.

STEP2: Select the root and other features.

STEP3: After selecting all the features, select the word corresponding above features
selected.

STEP4: Click the check button to see whether right word is selected or not.

OUTPUT: Output tells whether the word selected is right or wrong.

34
Experiment No : 8

Perform an experiment to generate word forms from root and suffix


information.

Objective:
The objective of the experiment is to generate word forms from root and suffix
information.

Introduction:
A word can be simple or complex. For example, the word 'cat' is simple because one
cannot further decompose the word into smaller part. On the other hand, the word
'cats' is complex, because the word is made up of two parts: root 'cat' and plural suffix
'-s'

Theory:

Given the root and suffix information, a word can be generated. For example,

Language input:analysis output:word

Hindi rt=ललललल(ladakaa), cat=n, gen=m, num=sg, case=obl ललललल


(ladake)
Hindi rt=ललललल(ladakaa), cat=n, gen=m, num=pl, case=dir ललललल
(ladake)
English rt=boy, cat=n, num=pl boys

English rt=play, cat=v, num=sg, per=3, tense=pr plays

- Morphological analysis and generation: Inverse processes.

35
- Analysis may involve non-determinism, since more than one analysis is possible.
- Generation is a deterministic process. In case a language allows spelling variation,
then till that extent, generation would also involve non-determinism.
Procedure:

STEP1: Select the language.


OUTPUT: Drop downs for selecting root and other features will appear.
STEP2: Select the root and other features.
STEP3: After selecting all the features, select the word corresponding above features
selected.
STEP4: Click the check button to see whether right word is selected or not
OUTPUT: Output tells whether the word selected is right or wrong

Eg-1

Step 1: We have selected “English” as language.

STEP2: Select the word.


OUTPUT: Drop down for selecting features will appear.

We have selected “playing”.

36
STEP3: Select the features.

STEP4: Click "Check" button to check your answer.


OUTPUT: Right features are marked by tick and wrong features are marked by cross.

37
Experiment No : 9

Perform an experiment to understand the morphology of a word by the use


of Add-Delete table

Objective:
Understanding the morphology of a word by the use of Add-Delete table

Introduction:
Morphology is the study of the way words are built up from smaller meaning bearing
units i.e., morphemes. A morpheme is the smallest meaningful linguistic unit. For eg:
•बबबब(बब bachchoM) consists of two morphemes, बबबबब(bachchaa) has the
information of the root word noun "बबबबब"(bachchaa) and ब(बब oM) has the
information of plural and oblique case.
• played has two morphemes play and -ed having information verb "play" and "past
tense", so given word is past tense form of verb "play".

Words can be analysed morphologically if we know all variants of a given root word.
We can use an 'Add-Delete' table for this analysis.

38
Theory:

Morph Analyser

Definition

Morphemes are considered as smallest meaningful units of language. These morphemes


can either be a root word(play) or affix(-ed). Combination of these morphemes is called
morphological process. So, word "played" is made out of 2 morphemes "play" and "-
ed". Thus finding all parts of a word(morphemes) and thus describing properties of a
word is called "Morphological Analysis". For example, "played" has information verb
"play" and "past tense", so given word is past tense form of verb "play".

Analysis of a word :

बबबब बब (bachchoM) = बबबबब(bachchaa)(root) + ब(बब oM)(suffix)

(ब=बब3 plural oblique)

A linguistic paradigm is the complete set of variants of a given lexeme. These variants
can be classified according to shared inflectional categories (eg: number, case etc) and
arranged into tables.

Paradigm for ललललल

case/num singular plural


direct ललललल(bachchaa) ललललल(bachche)
oblique ललललल(bachche) लललल ूू (bachchoM)

Algorithm to get बब्ब(्् bachchoM) from बब््(bachchaa)

1.Take Root लललल(bachch)ब(aa)


2.Delete ब(aa)
3.output लललल(bachch)
4.Add ब(्् oM) to output
5.Return लललल ूू (bachchoM)
Therefore ल is deleted and लूू is added to get लललल ूू
Add-Delete table for बबबब

Paradigm Class:

39
Words in the same paradigm class behave similarly, for Example बबबब is in the same
paradigm class as बबबब, so बबबबब would behave similarly as बबबबब as they share the
same paradigm class.
Objective:
Understanding the morphology of a word by the use of Add-Delete table
Procedure:
STEP1: Select a word root.
STEP2: Fill the add-delete table and submit.
STEP3: If wrong, see the correct answer or repeat STEP1.
Experiment:
STEP1: Select a word root.

STEP2: Fill the add-delete table and submit.

STEP3: If wrong, see the correct answer or repeat STEP1.

40
Experiment No : 10
Perform an experiment to learn to calculate bigrams from a given corpus
and calculate probability of a sentence.

Objective:
The objective of this experiment is to learn to calculate bigrams from a given
corpus and calculate probability of a sentence.

Introduction:
Probability of a sentence can be calculated by the probability of sequence of words
occuring in it. We can use Markov assumption, that the probability of a word in a
sentence depends on the probability of the word occuring just before it. Such a model
is called first order Markov model or the bigram model.

Here, Wn refers to the word token corresponding to the nth word in a sequence.

Theory:

A combination of words forms a sentence. However, such a formation is meaningful


only when the words are arranged in some order.

Eg: Sit I car in the

Such a sentence is not grammatically acceptable. However some perfectly grammatical


sentences can be nonsensical too!
Eg: Colorless green ideas sleep furiously

One easy way to handle such unacceptable sentences is by assigning probabilities to the
strings of words i.e, how likely the sentence is.

Probability of a sentence

If we consider each word occurring in its correct location as an independent event,the


probability of the sentences is
: P(w(1), w(2)..., w(n-1), w(n))

Using chain rule:


= P(w(1)) * P(w(2) | w(1)) * P(w(3) | w(1)w(2)) ... P(w(n) | w(1)w(2)…w(n1))

41
Bigrams
We can avoid this very long calculation by approximating that the probability of a given
word depends only on the probability of its previous words. This assumption is called
Markov assumption and such a model is called Markov model- bigrams. Bigrams can
be generalized to the n-gram which looks at (n-1) words in the past. A bigram is a first-
order Markov model.
Therefore ,

P(w(1), w(2)..., w(n-1), w(n))= P(w(2)|w(1)) P(w(3)|w(2)) …. P(w(n)|w(n-1))

We use (eos) tag to mark the beginning and end of a sentence.

A bigram table for a given corpus can be generated and used as a lookup table for
calculating probability of sentences.

Eg: Corpus – (eos) You book a flight (eos) I read a book (eos) You read (eos)
Bigram Table:

(eos) you book a flight I read

(eos) 0 0.33 0 0 0 0.25 0

you 0 0 0.5 0 0 0 0.5

book 0.5 0 0 0.5 0 0 0

a 0 0 0.5 0 0.5 0 0

flight 1 0 0 0 0 0 0

I 0 0 0 0 0 0 1

read 0.5 0 0 0.5 0 0 0

P((eos) you read a book (eos))


= P(you|eos) * P(read|you) * P(a|read) * P(book|a) * P(eos|book)
= 0.33 * 0.5 * 0.5 * 0.5 * 0.5
=.020625

Objective:
The objective of this experiment is to learn to calculate bigrams from a given corpus
42
and calculate probability of a sentence.

Procedure:
STEP1: Select a corpus and click on
STEP2: Fill up the table that is generated and hit
STEP3: If incorrect (red), see the correct answer by clicking on show answer or
repeat Step 2.
STEP4: If correct (green), click on take a quiz and fill the correct answer

Experiment:
STEP1: Select a corpus and click on

STEP2: Fill up the table that is generated and hit

STEP3: If incorrect (red), see the correct answer by clicking on show answer or
repeat Step 2.

43
STEP4: If correct (green), click on take a quiz and fill the correct answer

44
Experiment No : 11

Perform an experiment to experiment to learn how to apply add-one


smoothing on sparse bigram table.

Objective:
The objective of this experiment is to learn how to apply add-one smoothing on sparse
bigram table

Introduction:
One major problem with standard N-gram models is that they must be trained from
some corpus, and because any particular training corpus is finite, some perfectly
acceptable N-grams are bound to be missing from it. We can see that bigram matrix
for any given training corpus is sparse. There are large number of cases with zero
probabilty bigrams and that should really have some non-zero probability. This method
tend to underestimate the probability of strings that happen not to have occurred
nearby in their training corpus.

There are some techniques that can be used for assigning a non-zero probabilty to
these 'zero probability bigrams'. This task of reevaluating some of the zero-
probability and low-probabilty N-grams, and assigning them non-zero values, is
called smoothing.

45
Theory:
The standard N-gram models are trained from some corpus. The finiteness of the
training corpus leads to the absence of some perfectly acceptable N-grams. This results
in sparse bigram matrices. This method tend to underestimate the probability of strings
that do not occur in their training corpus.

There are some techniques that can be used for assigning a non-zero probabilty to these
'zero probability bigrams'. This task of reevaluating some of the zero-probability and
low-probabilty N-grams, and assigning them non-zero values, is called smoothing.
Some of the techniques are: Add-One Smoothing, Witten-Bell Discounting, Good-
Turing Discounting.
Add-One Smoothing
In Add-One smooting, we add one to all the bigram counts before normalizing them into
probabilities. This is called add-one smoothing.
Application on unigrams
The unsmoothed maximum likelihood estimate of the unigram probability can be
computed by dividing the count of the word by the total number of word tokens N
P(wx) = c(wx)/sumi{c(wi)}
= c(wx)/N

Application on bigrams

Normal bigram probabilities are computed by normalizing each row of counts by the
unigram count:
P(w n|wn-1) = C(wn-1wn)/C(wn-1)

For add-one smoothed bigram counts we need to augment the unigram count by the
number of total word types in the vocabulary
V:p *(wn|wn-1) = ( C(wn-1wn)+1 )/( C(wn-1)+V )

46
Procedure

STEP1: Select a corpus

STEP2: Apply add one smoothing and calculate bigram probabilities using the
given bigram counts,N and V. Fill the table and hit
Submit

STEP3: If incorrect (red), see the correct answer by clicking on show answer or
repeat Step 2

Experiment

Bigram counts for the corpus:

N=V=

Fill the bigram probabilities after add-one smoothing: (Upto 4 decimal places)
0

Right Answer

47
Experiment No : 12

PERFORM AN EXPERIMENT TO CALCULATE EMISSION AND


TRANSITION MATRIX WHICH WILL BE HELPFUL FOR TAGGING
PARTS OF SPEECH USING HIDDEN MARKOV MODEL.

Objective:
The objective of the experiment is to calculate emission and transition matrix which will
be helpful for tagging Parts of Speech using Hidden Markov Model.

Introduction:
POS tagging or part-of-speech tagging is the procedure of assigning a grammatical
category like noun, verb, adjective etc. to a word. In this process both the lexical
information and the context play an important role as the same lexical form can behave
differently in a different context.

For example the word "Park" can have two different lexical categories based on the
context.

1. The boy is playing in the park. ('Park' is Noun)


2. Park the car. ('Park' is Verb)

Assigning part of speech to words by hand is a common exercise one can find in an
elementary grammar class. But here we wish to build an automated tool which can assign
the appropriate part-of-speech tag to the words of a given sentence. One can think of
creating hand crafted rules by observing patterns in the language, but this would limit
the system's performance to the quality and number of patterns identified by the rule
48
crafter. Thus, this approach is not practically adopted for building POS Tagger. Instead,
a large corpus annotated with correct POS tags for each word is given to the computer
and algorithms then learn the patterns automatically from the data and store them in form
of a trained model. Later this model can be used to POS tag new sentences.
In this experiment we will explore how such a model can be learned from the data.

Theory

A Hidden Markov Model (HMM) is a statistical Markov model in which the system
being modeled is assumed to be a Markov process with unobserved (hidden) states.In a
regular Markov model (Markov Model (Ref:
http://en.wikipedia.org/wiki/Markov_model)), the state is directly visible to the
observer, and therefore the state transition probabilities are the only parameters. In a
hidden Markov model, the state is not directly visible, but output, dependent on the
state, is visible.

Hidden Markov Model has two important components-

Transition Probabilities: The one-step transition probability is the probability of


transitioning from one state to another in a single step.

Emission Probabilties: The output probabilities for an observation from state.


Emission probabilities B = { bi,k = bi(ok) = P(ok | qi) }, where okis an Observation.
49
Informally, B is the probability that the output is ok given that the current state is qi

For POS tagging, it is assumed that POS are generated as random process, and each
process randomly generates a word. Hence, transition matrix denotes the transition
probability from one POS to another and emission matrix denotes the probability that a
given word can have a particular POS. Word acts as the observations. Some of the basic
assumptions are:
Calculating Emission Probability Matrix

1. First-order (bigram) Markov assumptions:


a. Limited Horizon: Tag depends only on

previous tag P(ti+1 = tk |


t1=tj1,âलŚ,ti=tji) = P(ti+1 = tk | ti = tj)
b. Time invariance: No change over time
P(ti+1 = tk | ti = tj) = P(t2 = tk | t1 = tj) = P(tj -> tk)
2. Output probabilities:
Probability of getting word wk for tag tj: P(wk | tj) is independent of other tags or
words
!
Count the no. of times a specific word occus with a specific POS tag in the corpus.
Here, say for "cut"

count(cut,verb)=1
count(cut,noun)=2
count(cut,determiner)=0

... and so on zero for other tags too

count(cut) = total count of cut = 3


Calculating the Probabilities Consider the given toy corpus

EOS/eos
They/pronoun
cut/verb
the/determiner
paper/noun

50
EOS/eos He/pronoun
asked/verb
for/preposition
his/pronoun
cut/noun.
EOS/eos
Put/verb
the/determiner

paper/noun
in/preposition
the/determiner
cut/noun
EOS/eos

Now, calculating the probability


Probability to be filled in the matrix cell at the intersection of cut and verb

P(cut/verb)=count(cut,verb)/count(cut)=1/3=0.33

Similarly,
Probability to be filled in the cell at he intersection of cut and determiner

P(cut/determiner)=count(cut,determiner)/count(cut)=0/3=0

Repeat the same for all the word-tag combination and fill the Calculating Transition
Probability Matrix
Count the no. of times a specific tag comes after other POS tags in the corpus. Here,
say for "determiner"

count(verb,determiner)=2
count(preposition,determiner)=1
count(determiner,determiner)=0
count(eos,determiner)=0
count(noun,determiner)=0
... and so on zero for other tags too.

51
count(determiner) = total count of tag 'determiner' = 3

Now, calculating the probability


Probability to be filled in the cell at he intersection of determiner(in the column) and
verb(in the row)

P(determiner/verb)=count(verb,determiner)/count(determiner)=2/3=0.66

Similarly,
Probability to be filled in the cell at he intersection of determiner(in the column) and
noun(in the row)

P(determiner/noun)=count(noun,determiner)/count(determiner)=0/3=0

Repeat the same for all the tags

Note: EOS/eos is a special marker which represents End Of Sentence. Objective:


The objective of the experiment is to calculate emission and transition matrix which
will be helpful for tagging Parts of Speech using Hidden Markov Model.
Procedure:

STEP1: Select the corpus.


STEP2: For the given corpus fill the emission and transition matrix. Answers are
rounded to 2 decimal digits.
STEP3: Press Check to check your answer. Wrong answers are indicated by the red
cell.

Experiment:

EOS/eos Book/verb a/determiner car/nounEOS/eos Park/verb a/determiner carnoun


EOS/eos The/determiner book/noun is/verb in/preposition the/determiner car/noun
EOS/eos The/determiner car/noun is/verb in/preposition a/determiner park/noun
EOS/eos

Emission Matrix
book park car is in a the
determiner 0 0 0 0 0 0 0

52
noun 0 0 0 0 0 0 0
verb 0 0 0 0 0 0 0
preposition 0 0 0 0 0 0 0

Transition Matrix
eos determiner noun verb preposition
eos 0 0 0 0 0
determiner 0 0 0 0 0
noun 0 0 0 0 0
verb 0 0 0 0 0
preposition 0 0 0 0 0

Check
Wrong Emission and Transition Matrix!!!
Right Answer

Emission Matrix
book park car is in a the
determiner 0 0 0 0 0 1 1
noun 0.5 0.5 1 0 0 0 0
verb 0.5 0.5 0 1 0 0 0
preposition 0 0 0 0 1 0 0

Transition Matrix
eos determiner noun verb preposition
eos 0 0.33 0 0.5 0
determiner 0 0 1 0 0
noun 1 0 0 0.5 0
verb 0 0.33 0 0 1
preposition 0 0.33 0 0 0

53
Experiment No: 13

Perform an experiment to know the importance of context and size of


training corpus in learning Parts of Speech.

Introduction:
In previous experiment you have calculated the transition and emission matrix, and now
in this experiment it will be used to find the POS tag sequence for a given sentence.
When we have emission and transition matrix, various algorithms can be applied to find
out the POS tags for words. Some of possible algorithms are: Backward algorithm,
forward algorithm and viterbi algorithm. Here, in this experiment, you can get familiar
with Viterbi Decoding.

Hidden Markov Model

In the mid 1980s, researchers in Europe began to use Hidden Markov models (HMMs)
to disambiguate parts of speech. HMMs involve counting cases, and making a table of
the probabilities of certain sequences. For example, once you've seen an article such as
'the', perhaps the next word is a noun 40% of the time, an adjective 40%, and a number
20%. Knowing this, a program can decide that "can" in "the can" is far more likely to be
a noun than a verb or a modal. The same method can of course be used to benefit from
knowledge about following words.

More advanced ("higher order") HMMs learn the probabilities not only of pairs, but
triples or even larger sequences. So, for example, if you've just seen an article and a
verb, the next item may be very likely a preposition, article, or noun, but much less
likely another verb.
When several ambiguous words occur together, the possibilities multiply. However, it
is easy to enumerate every combination and to assign a relative probability to each one,
by multiplying together the probabilities of each choice in turn.
It is worth remembering, as Eugene Charniak points out in Statistical techniques for
natural language parsing, that merely assigning the most common tag to each known
word and the tag "proper noun" to all unknowns, will approach 90% accuracy because
many words are unambiguous.
HMMs underlie the functioning of stochastic taggers and are used in various algorithms.
Accuracies for one such algorithm (TnT) on various training data is shown here.

54
Conditional Random Field

Conditional random fields (CRFs) are a class of statistical modelling method often
applied in machine learning, where they are used for structured prediction. Whereas an
ordinary classifier predicts a label for a single sample without regard to "neighboring"
samples, a CRF can take context into account. Since it can consider context, therefore
CRF can be used in Natural Language Processing. Hence, Parts of Speech tagging is
also possible. It predicts the POS using the lexicons as the context.

Theory:

Viterbi Decoding is based on dynamic programming. This algorithm takes emission and
transmission matrix as the input. Emission matrix gives us information about proabities
of a POS tag for a given word and transmission matrix gives the probability of transition
from one POS tag to another POS tag. It observes sequence of words and returns the
state sequences of POS tags along with its probability.

Here "s" denotes words and "t" denotes tags. "a" is transmission matrix and "b" is
emission matrix.
Using above algorithm, we have to fill the viterbi table column by column.

55
Objective:
The objective of this experiment is to find POS tags of words in a sentence using
Viterbi decoding.

Procedure

STEP1: Select the language.


OUTPUT: Drop down to select size of corpus, algorithm and features will appear.
STEP2: Fill the column with the probabilty of possible POS tags given the word (i.e.
form the viterbi matrix by filling colum for each observation). Answers submitted are
rounded off to 3 digits after decimal and are than checked.
STEP3: Check the column.
Wrong answers are indicated by red backgound in a cell.
If answers are right, then go to step2
STEP4: Repeat steps 2 and 3 untill all words of a sentence are covered. STEP5: At
last check the POS tag for each word obtained from backtracking.

Experiment:
STEP1:Select the corpus.
OUTPUT: Emission and Transmission matrix will appear.

STEP2: Fill the column with the probabilty of possible POS tags given the word (i.e.
form the viterbi matrix by filling colum for each observation). Answers submitted are
rounded off to 3 digits after decimal and are than checked.

STEP3: Check the column.


Wrong answers are indicated by red backgound in a cell.
If answers are right, then go to step2

56
STEP4: Repeat steps 2 and 3 untill all words of a sentence are covered.

STEP5: At last check the POS tag for each word obtained from backtracking

57
Experiment No : 14

Perform an experiment to understand the concept of chunking and get


familiar with the basic chunk tagset.

Objective:
The objective of this experiment is to understand the concept of chunking and get
familiar with the basic chunk tagset.

Introduction:
Chunking of text invloves dividing a text into syntactically correlated words. For
example, the sentence 'He ate an apple.' can be divided as follows:

Each chunk has an open boundary and close boundary that delimit the word groups as a
minimal non-recursive unit. This can be formally expressed by using IOB prefixes.

58
Theory:

Chunking of text invloves


dividing a text
into syntactically
correlated words.

Eg: He ate an apple to satiate his hunger. [NP He ] [VP ate


] [NP an apple] [VP to satiate] [NP his hunger]
Chunk Types
The chunk types are based on the
syntactic category part. Besides the head a chunk also contains modifiers (like
determiners, adjectives, postpositions in NPs).

The basic types of chunks in English are: Chunk Type Tag Name
1. Noun NP
2. Verb VP
3. Adverb ADVP
4. Adjectivial ADJP
5. Prepositional PP

The basic Chunk Tag Set for Indian Languages


Sl. Chunk Type Tag Name
No
1 Noun Chunk NP
2.1 Finite Verb Chunk VGF
2.2 Non-finite Verb VGNF
Chunk
2.3 Verb Chunk VGNN
(Gerund)
3 Adjectival Chunk
JJP

4 Adverb Chunk RBP


NP Noun Chunks
Noun Chunks will be given the tag NP and include non-recursive noun phrases
and postposition for Indian languages and preposition for English. Determiners,
adjectives and other modifiers will be part of the noun chunk.

59
Eg:
'this' 'book' 'in'
((in/IN the/DT big/ADJ room/NN))NP
Verb Chunks
The verb chunks are marked as VP for English, however they would be of several
types for Indian languages. A verb group will include the main verb and its
auxiliaries, if
any.

For English:
I (will/MD be/VB loved/VBD)VP
The types of verb chunks and their tags are described below.

1. VGF Finite Verb Chunk


The auxiliaries in the verb group mark the finiteness of the verb at the chunk
level. Thus, any verb group which is finite will be tagged as VGF. For example,
'I erg''home' 'at''meal' 'ate'

VGNF Non-finite Verb Chunk

A non-finite verb chunk will be tagged as VGNF. apple' 'eating' 'PROG' 'boy' go'
'PROG' 'is'

VGNN Gerunds

A verb chunk having a gerund will be annotated as VGNN. 'liquor' 'drinking'


'heath' 'for' 'harmful' 'is'

JJP/ADJP Adjectival Chunk


An adjectival chunk will be tagged as ADJP for English and JJP for Indian
languages. This chunk will consist of all adjectival chunks including the
predicative adjectives.

Eg:
The fruit is (ripe/JJ)ADJP
Note: Adjectives appearing before a noun will be grouped together within the
noun chunk.

60
RBP/ADVP Adverb Chunk

This chunk will include all pure adverbial phrases.


Eg:
'he'
'slowly
' 'walk'
'PROG
' 'was'
He
walks
(slowly
/ADV)
/ADVP
PP Prepositional Chunk
This chunk type is present
for only English and not for Indian languages. It consists of only the preposition
and not the NP argument.

Eg:

(with/IN)PP a pen

IOB prefixes
Each chunk has an open boundary and close boundary that delimit the word groups as
a minimal non-recursive unit. This can be formally expressed by using IOB prefixes:

B-CHUNK for the first word of the chunk and I-CHUNK for each other word in the
chunk. Here is an example of the file
format:

Tokens POS Chunk-Tags

He PRP B-NP
ate VBD B-VP
an DT B-NP
apple NN I-NP
to TO B-VP
satiate VB I-VP
his PRP$ B-NP
hunger NN I-NP

61
Procedure:
STEP1: Select a language
STEP2: Select a sentence
STEP3: Select the corresponding chunk-tag for each
word in the sentence and click the Submit button.

OUTPUT1: The submitted answer will be checked.

Click on Get Answer button for the correct answer.

Experiment:

STEP1: Select a language

STEP2: Select a sentence

STEP3: Select the corresponding chunk-tag for each


word in the sentence and click the Submit button.

OUTPUT1: The submitted answer will be checked.

Click on Get Answer button for the correct answer.

62
63
Experiment No : 15

Texts Summarization using Python

Introduction:
Millions of web pages and websites exist on the Internet today. Going through a vast
amount of content becomes very difficult to extract information on a certain topic.
Google will filter the search results and give you the top ten search results, but often
you are unable to find the right content that you need. There is a lot of redundant and
overlapping data in the articles which leads to a lot of wastage of time. The better way
to deal with this problem is to summarize the text data which is available in large
amounts to smaller sizes.

Text Summarization
Text summarization is an NLP technique that extracts text from a large amount of data.
It helps in creating a shorter version of the large text available.
It is important because :
Reduces reading time
Helps in better research work
Increases the amount of information that can fit in an area
There are two approaches for text summarization: NLP based techniques and deep
learning techniques.
In this article, we will go through an NLP based technique which will make use of the
NLTK library.

Text Summarization steps

 Obtain Data
 Text Preprocessing
 Convert paragraphs to sentences
 Tokenizing the sentences
 Find weighted frequency of occurrence
 Replace words by weighted frequency in sentences
 Sort sentences in descending order of weights
 Summarizing the Article.

64
Obtain Data for Summarization
If you wish to summarize a Wikipedia Article, obtain the URL for the article that you
wish to summarize. We will obtain data from the URL using the concept of Web
scraping. Now, to use web scraping you will need to install the beautifulsoup library
in Python. This library will be used to fetch the data on the web page within the
various HTML tags.

Code:

import bs4 as bs
import urllib.request
import re
scraped_data = urllib.request.urlopen('https://en.wikipedia.org/w
iki/Natural_language')
article = scraped_data.read()
parsed_article = bs.BeautifulSoup(article,'lxml')
paragraphs = parsed_article.find_all('p')
article_text = ""
for p in paragraphs:
article_text += p.text
# Removing Square Brackets and Extra Spaces
article_text = re.sub(r'[[0-9]*]', ' ', article_text)
article_text = re.sub(r's+', ' ', article_text)

# Removing special characters and digits


formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )
formatted_article_text = re.sub(r's+', ' ', formatted_article_tex
t)

sentence_list = nltk.sent_tokenize(article_text)

stopwords = nltk.corpus.stopwords.words('english')
word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
if word not in stopwords:
if word not in word_frequencies.keys():
word_frequencies[word] = 1
else:
word_frequencies[word] += 1

maximum_frequncy = max(word_frequencies.values())
for word in word_frequencies.keys():
word_frequencies[word] = (word_frequencies[word]/maximum_freq
uncy)

65
sentence_scores = {}
for sent in sentence_list:
for word in nltk.word_tokenize(sent.lower()):
if word in word_frequencies.keys():
if len(sent.split(' ')) < 30:
if sent not in sentence_scores.keys():
sentence_scores[sent] = word_frequencies[word
]
else:
sentence_scores[sent] += word_frequencies[wor
d]
import heapq
summary_sentences = heapq.nlargest(7, sentence_scores, key=senten
ce_scores.get)
summary = '\n'.join(summary_sentences)
print('\n \n \n\n\n\n\n',summary)

Summary
The sentence_scores dictionary consists of the sentences along with their scores. Now,
top N sentences can be used to form the summary of the article.
Here the heapq library has been used to pick the top 7 sentences to summarize the
article.

Output :

66

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy