Ai&Ml Bai601 NLP Lab Manual
Ai&Ml Bai601 NLP Lab Manual
LAB MANUAL
(Effective from the academic year 2024-2025 under 2022 CBCS scheme)
STEPS:-
1. Install ‘nltk’ Python library (Natural Language Toolkit) using ‘pip’ command.
CODE:-
!pip install nltk
OUTPUT:-
CODE:-
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')
OUTPUT:-
3. Perform text preprocessing tasks.
CODE:-
# Import all the required libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string
import re
OUTPUT:-
2. Write a Python program to demonstrate the N-gram modeling to analyze and
establish the probability distribution across sentences and explore the utilization
of unigrams, bigrams, and trigrams in diverse English sentences to illustrate the
impact of varying n-gram orders on the calculated probabilities.
CODE:-
import nltk
from nltk.util import ngrams
from collections import Counter
# Sample corpus
sentences = [
"I love programming.",
"I love learning new things.",
"Programming is fun." ]
# Generate n-grams
n_grams = list(ngrams(tokens, n))
# Calculate the frequency of the first n-1 words for the conditional
probability
n_1_grams = list(ngrams(tokens, n-1))
n_1_gram_freq = Counter(n_1_grams)
return ngram_probabilities
OUTPUT:-
3. Investigate the Minimum Edit Distance (MED) algorithm and its application in
string comparison and the goal is to understand how the algorithm efficiently
computes the minimum number of edit operations required to transform one
string into another.
● Test the algorithm on strings with different types of variations (e.g., typos,
substitutions, insertions, deletions).
● Evaluate its adaptability to different types of input variations.
CODE:-
# Import the required libraries
import numpy as np
# Initialize a 2D Array
dp = np.zeros((m,n), dtype=int)
# MED Algorithm
for i in range(1,m):
for j in range(1,n):
if str1[i-1] == str2[j-1]:
dp[i][j] = dp[i-1][j-1]
else:
dp[i][j] = min(dp[i-1][j]+1, dp[i][j-1]+1, dp[i-1][j-1]+1)
return dp[m-1][n-1]
OUTPUT:-
4. Write a program to implement top-down and bottom-up parser using
appropriate context free grammar.
CODE:-
# Import the CFG library
import nltk
from nltk import CFG
# Top-down parser
print("\nTop-Down Parsing(ChartParser):")
top_down_parser=nltk.ChartParser(grammar)
for tree in top_down_parser.parse(sentence):
print(tree)
tree.pretty_print()
# Bottom-up parser
print("\nBottom-up Parsing(Shift-Reduce Parser):")
bottom_up_parser=nltk.ShiftReduceParser(grammar)
for tree in bottom_up_parser.parse(sentence):
print(tree)
tree.pretty_print()
OUTPUT:-
5. Given the following short movie reviews, each labeled with a genre, either
comedy or action:
● fun, couple, love, love - comedy
● fast, furious, shoot - action
● couple, fly, fast, fun, fun - comedy
● furious, shoot, shoot, fun - action
● fly, fast, shoot, love - action
A new document D: fast, couple, shoot, fly.
Compute the most likely class for D. Assume a Naïve Bayes classifier and use
add 1 smoothing for the likelihoods.
CODE:-
# Import the required libraries
from collections import Counter
# Document to classify
doc = ['fast', 'couple', 'shoot', 'fly']
for w in doc:
comedy_doc_prob *= comedy_probabilities.get(w, 1 / (comedy_word_count
+ size))
action_doc_prob *= action_probabilities.get(w, 1 / (action_word_count
+ size))
OUTPUT:-
Total no. of documents in training data: 5
Total 'comedy' documents: 2
Total 'action' documents: 3
Probability of 'comedy' documents: 0.4
Probability of 'action' documents: 0.6
Unique words in all documents: {'couple', 'fast', 'furious', 'fly',
'love', 'fun', 'shoot'}
Size of unique words: 7
CODE:-
# Install and import the required libraries
import nltk
import matplotlib.pyplot as plt
from nltk.corpus import brown, inaugural, reuters, udhr
from nltk import FreqDist, ConditionalFreqDist, word_tokenize, pos_tag
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
from nltk.tag import UnigramTagger, DefaultTagger, RegexpTagger
import string
nltk.download('brown')
nltk.download('inaugural')
nltk.download('reuters')
nltk.download('udhr')
OUTPUT:-
CODE:-
# Brown
print("Brown Categories:", brown.categories())
print("Brown File IDs:", brown.fileids()[:5])
print("Brown Words (News):", brown.words(categories='news')[:10])
print("Brown Sentences:", brown.sents(categories='news')[:2])
# Inaugural Address
print("\nInaugural File IDs:", inaugural.fileids()[:5])
print("Inaugural Words (2009):", inaugural.words('2009-Obama.txt')[:10])
# Reuters
print("\nReuters Categories:", reuters.categories()[:5])
print("Reuters File IDs:", reuters.fileids()[:5])
print("Reuters Words:", reuters.words(reuters.fileids()[0])[:10])
# UDHR
print("\nUDHR Languages:", udhr.fileids()[:5])
print("UDHR Words (English):", udhr.words('English-Latin1')[:10])
OUTPUT:-
CODE:-
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'C:\\Users\\Student\\Desktop\\My Corpus'
wordlists = PlaintextCorpusReader(corpus_root, '.*')
wordlists.fileids()
OUTPUT:-
CODE:-
# Conditional frequency distribution for Brown corpus
cfd = nltk.ConditionalFreqDist( (genre, word)
... for genre in brown.categories()
... for word in brown.words(categories=genre) )
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance',
'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=modals)
OUTPUT:-
CODE:-
# Conditional frequency distribution for Inaugural Address corpus
from nltk.corpus import inaugural
inaugural.fileids()
[fileid[:4] for fileid in inaugural.fileids()]
cfd = nltk.ConditionalFreqDist(
(target, fileid[:4])
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
for target in ['america', 'citizen']
if w.lower().startswith(target))
cfd.plot()
OUTPUT:-
CODE:-
# Conditional frequency distribution for UDHR corpus
from nltk.corpus import udhr
languages = ['Chickasaw', 'English', 'German_Deutsch',
'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
cfd = nltk.ConditionalFreqDist(
(lang, len(word))
for lang in languages
for word in udhr.words(lang + '-Latin1'))
cfd.plot(cumulative=True)
OUTPUT:-
CODE:-
# Tagged words and sentences from Brown corpus
tagged_words = brown.tagged_words(categories='news')
tagged_sents = brown.tagged_sents(categories='news')
OUTPUT:-
5. Write a program to find the most frequent noun tags.
CODE:-
# Noun tags usually start with 'NN'
nouns = [word for word, tag in tagged_words if tag.startswith('NN')]
fd_nouns = FreqDist(nouns)
print("Most Common Nouns:", fd_nouns.most_common(10))
OUTPUT:-
CODE:-
word_props = {
'dog': {'type': 'noun', 'sentiment': 'neutral'},
'run': {'type': 'verb', 'sentiment': 'positive'},
'hate': {'type': 'verb', 'sentiment': 'negative'}
}
print(word_props['run']['sentiment'])
OUTPUT:-
CODE:-
# Unigram Tagger
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.tag(brown_sents[2007])
OUTPUT:-
CODE:-
# Rule-based Tagger
patterns = [
(r'.*ing$', 'VBG'), # gerunds
(r'.*ed$', 'VBD'), # simple past
(r'.*es$', 'VBZ'), # 3rd singular present
(r'.*ould$', 'MD'), # modals
(r'.*\'s$', 'NN$'), # possessive nouns
(r'.*s$', 'NNS'), # plural nouns
(r'^-?[0-9]+(\.[0-9]+)?$', 'CD'), # cardinal numbers
(r'.*', 'NN') # nouns (default)
]
regexp_tagger = nltk.RegexpTagger(patterns)
tagged=regexp_tagger.tag(brown_sents[1])
print(tagged)
OUTPUT:-
8. Find different words from a given plain text without any space.
CODE:-
!pip install wordninja
OUTPUT:-
CODE:-
import wordninja
wordninja.split('thisisabeautifulday')
OUTPUT:-
7. Write a Python program to find synonyms and antonyms of the word ‘active’
using WordNet.
CODE:-
# Import the required libraries
import nltk
from nltk.corpus import wordnet
# Download these modules only if they are not present, else skip it
nltk.download('wordnet')
nltk.download('omw-1.4')
# Declare the given word and print its synonyms and antonyms
word = 'active'
syno, anto = get_syn_ant(word)
print(f'Synonyms of {word} : {syno}')
print(f'Antonyms of {word} : {anto}')
OUTPUT:-
8. Implement the machine translation application of NLP where it needs to train a
machine translation model for a language with limited parallel corpora.
Investigate and incorporate techniques to improve performance in low-resource
scenarios.
CODE:-
# Install and import the required modules and libraries
!pip install transformers
!pip install sentencepiece
!pip install torch
from collections import Counter
import nltk
from nltk.util import ngrams
from transformers import MarianMTModel, MarianTokenizer
import torch
import sentencepiece as spm
import random
nltk.download('punkt')
OUTPUT:-
CODE:-
# Define the corpus
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_path = r"C:\Users\Student\Documents\en-hi.txt"
corpus = PlaintextCorpusReader(corpus_path,"/en-hi.txt")
english_sentences = corpus.words(corpus_path + "/tldr-pages.en-hi.en")
hindi_sentences = corpus.words(corpus_path + "/tldr-pages.en-hi.hi")
# Preprocessing
!pip install sacremoses
def tokenize_sentences(corpus, lang):
tokenizer =
MarianTokenizer.from_pretrained(f"Helsinki-NLP/opus-mt-en-hi")
if isinstance(corpus, list):
if lang == 'src':
return [tokenizer(sentence, return_tensors='pt', padding=True,
truncation=True) for sentence, _ in corpus]
else:
return [tokenizer(sentence, return_tensors='pt', padding=True,
truncation=True) for _, sentence in corpus]
else:
sentences = [''.join(corpus)]
return [tokenizer(sentence, return_tensors='pt', padding=True,
truncation=True) for sentence in sentences]
OUTPUT:-
CODE:-
import sys
print(sys.version)
model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-hi")
OUTPUT:-
CODE:-
# Train the machine translation model
def train_model(model, src_tokens, tgt_tokens, epochs=5):
optimizer=torch.optim.Adam(model.parameters(), lr=5e-5)
loss_fn=torch.nn.CrossEntropyLoss()
model.train
for epoch in range(epochs):
total_loss=0
for src_token, tgt_token in zip(src_tokens, tgt_tokens):
optimizer.zero_grad()
output=model(**src_token, labels=tgt_token['input_ids'])
loss=output.loss
loss.backward()
optimizer.step()
total_loss+=loss.item()
print(f"Epoch {epoch +1}/{epochs}, Loss:
{total_loss/len(src_tokens)}")
OUTPUT:-
CODE:-
def translate_text(model, tokenizer, text):
inputs = tokenizer(text, return_tensors='pt')
outputs = model.generate( ** inputs)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
OUTPUT:-
EXTRA PROGRAMS
CODE:-
# Install dependencies
!pip install pandas scikit-learn nltk
# NLTK data
nltk.download('stopwords')
nltk.download('wordnet')
OUTPUT:-
2. Develop a Python program to implement a typical resume keyword matcher.
CODE:-
1. Install the required libraries.
pip install fitz nltk scikit-learn pdfminer.six
def extract_text(pdf_file):
return " ".join(page.get_text() for page in fitz.open(pdf_file))
def preprocess(text):
text = re.sub(r'[^\w\s]', '', text.lower())
return " ".join(lemmatizer.lemmatize(word) for word in text.split() if
word not in stop_words)
def extract_keywords(text):
return Counter(text.split())
# Main
resume_text = preprocess(extract_text('resume_sample.pdf'))
job_text = preprocess(extract_text('job_description_sample.pdf'))
resume_kw = extract_keywords(resume_text)
job_kw = extract_keywords(job_text)
OUTPUT:-
VIVA-VOCE
What is NLP?
Natural language processing (NLP) is a field of computer science and a subfield of artificial
intelligence that aims to make computers understand human language.
(Refer: Natural Language Processing (NLP) - Overview - GeeksforGeeks)
NLP Tools
1. Tokenization and Segmentation: Dividing text into its elementary units, such as words or
sentences.
2. POS Tagging (Part-of-Speech Tagging): Assigning grammatical categories to words, like
nouns, verbs, or adjectives.
3. Named Entity Recognition (NER): Identifying proper nouns or specific names in text.
4. Lemmatization and Stemming: Reducing words to their root form or a common base.
5. Word Sense Disambiguation: Determining the correct meaning of a word with multiple
interpretations based on the context.
6. Parsing: Structurally analyzing sentences and establishing dependencies between
words.
7. Sentiment Analysis: Assessing emotions or opinions expressed in text.
Challenges in NLP
● Text Classification: Automates tasks like email sorting and news categorization.
● Sentiment Analysis: Tracks public opinion on social media and product reviews.
● Machine Translation: Powers platforms like Google Translate.
● Chatbots and Virtual Assistants: Enables automated text or voice interactions.
● Voice Recognition: Facilitates speech-to-text and smart speakers.
● Search and Recommendation Systems: Enhances user experience on websites and
apps.
What do you understand about the terms 'corpus', 'tokenization', and 'stopwords'
in NLP?
Corpus:
Tokenization:
● Tokenization is the process of breaking longer text into discrete units, or tokens, which
could be words, n-grams, or characters.Common tokenization strategies include splitting
by whitespace, punctuation, or specific vocabularies.Tokenization is a foundational step
for various text processing tasks and is crucial for tasks like neural network for NLP.
Stopwords:
● Stopwords are words that are often removed from texts during processing as they carry
little meaning on their own (e.g., 'is', 'the', 'and') in bag-of-words models.By eliminating
stopwords, we can focus on content-carrying words and reduce data dimensionality, thus
enhancing computational efficiency and, in some applications, improving the accuracy of
text classification or clustering.
Distinguish between morphology and syntax in the context of NLP.
Morphology deals with the structure and formation of words, syntax its their arrangement in
sentences.Morphology looks at individual words. Syntax considers sentences and the
relationships between words.
Consider the sentence: "Riders are riding the horses riding wildly."
Stemming:
POS Challenges:
● Ambiguity: Many words can serve as different parts of speech, depending on their use or
context.
● Multiple Tags: Some words can have more than one POS, such as the word "well" which
can be an adverb, adjective, or noun.
Describe lemmatization and stemming. When would you use one over the other?
Both lemmatization and stemming are methods for reducing inflected words to their root forms.
Stemming:
● Definition: Stemming uses an algorithmic, rule-based approach to cut off word endings,
producing the stem. This process can sometimes result in non-real words, known as
"raw" stems.
● Example: The stem of "running" is "run."
● Code Example (Using NLTK):
stemmer = PorterStemmer()
stem = stemmer.stem("running")
Lemmatization:
● Definition: Lemmatization, on the other hand, uses a linguistic approach that considers
the word's context. It maps inflected forms to their base or dictionary form (lemma).
● Example: The lemma of "running" is "run".
● Code Example (Using NLTK):
lemmatizer = WordNetLemmatizer()
● Stemming: Useful for tasks like text classification when speed is a priority. The technique
is simpler and quicker than lemmatization, but it may sacrifice precision and produce
non-real words.
● Lemmatization: Ideal when semantic accuracy is crucial, such as in question-answering
systems or topic modeling. It ensures that the root form retains its existing meaning in
the text, potentially leading to better results in tasks that require understanding and
interpretation of the text.
What is a 'named entity' and how is Named Entity Recognition (NER) useful in
NLP tasks?
● In Natural Language Processing, a named entity (NE) refers to real-world objects, such
as persons, dates, or locations, that are assigned proper names.
● Named Entity Recognition (NER) utilizes machine learning techniques, such as
sequence labeling and deep learning, to identify and categorize named entities within
larger bodies of text.
● NER recognizes entities like:
a. Person Names: E.g., "John Doe"
b. Locations: E.g., "New York"
c. Organizations: E.g., "Google"
d. Dates: E.g., "January 1, 2022"
e. Numeric References: E.g., "$10,000"
f. Product Names: E.g., "iPhone"
g. Time Notations: E.g., "4 PM"
Practical Applications:
● Information Retrieval and Summarization: Identifying entities aids in summarizing
content and retrieving specific information.
● Question Answering Systems: Helps to understand what or who a question is about.
● Relation Extraction: Can provide insight into relationships between recognizable entities.
● Sentiment and Opinion Analysis: Understanding the context in which named entities
appear can guide sentiment analysis.
● Geotagging: Identifying place names facilitates geographically tagging content.
● Recommendation Systems: Identifying products, organizations, or other named entities
enhances the power of recommendation systems.
● Competitive Intelligence: Identifying and categorizing company names and other
organizations can provide valuable insights for businesses.
● Legal and Regulatory Compliance Monitoring: For tasks like contract analysis, identifying
named entities can be crucial.
import spacy
nlp = spacy.load('en_core_web_sm')
# Sample text
doc = nlp(text)
print(ent.text, ent.label_)
Sentiment analysis (SA), also known as opinion mining, is the computational study of people's
emotions, attitudes, and opinions from text data. Its core goal is determining whether a piece of
writing is positive, negative, or neutral.
How does a ‘dependency parser’ work, and what information does it provide?
Core Functionalities:
1. Word-Level Classification: Each word is classified based on its relationship with others.
Examples of classifications are Subject (nsubj), Object (obj), and Modifiers (e.g. amod
for adjectival modification).
2. Arc Representation: These are labeled directed edges, or arcs, between words. They
represent a grammatical relationship and provide the basis for constructing the parsing
tree.
Parsing Procedure:
1. Initial Assignment: The parser begins by giving each word a universal "root" node to
define the starting point of the parse tree.
2. Iterative Classification: For every word, the algorithm assigns both a word-level
classification and a directed arc to another word, specifying the relationship from the first
to the second.
3. Tree Check: Throughout the process, the parser ensures that the set of classified arcs
forms a single, non-looping tree, known as a "projective parse tree."
4. Recursive Structure: The tree starts from the "root" node and recursively accounts for
word and arc classifications to create dependencies covering the entire sentence.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The quick brown fox jumps over the lazy dog.")
use the spaCy library to perform dependency parsing on the sentence: "The quick brown fox
jumps over the lazy dog." The output will display the word, its word-level classification (dep_),
the head word it's connected to, the head word's part-of-speech tag (pos_), and the word's
children.
N-grams are sequential word or character sets, with "n" indicating the number of elements in a
particular set. They play a crucial role in understanding context and text prediction, especially in
statistical language models.
Types of N-grams:
tokenized_text = word_tokenize(text.lower())
unigrams = ngrams(tokenized_text, 1)
bigrams = ngrams(tokenized_text, 2)
trigrams = ngrams(tokenized_text, 3)
The Bag of Words model, or BoW, is a fundamental technique in Natural Language Processing
(NLP). This model disregards the word order and syntax within a text, focusing instead on the
presence and frequency of words.
Working Mechanism:
● Text Collection: Gather a set of documents or a corpus.
● Tokenization: Split the text into individual words, known as tokens.
● Vocabulary Building: Identify unique tokens, constituting the vocabulary.
● Vectorization: Represent each document as a numerical vector, where each element
reflects word presence or frequency in the vocabulary.
# Sample data
corpus = [
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
# Visualize outputs
print(vectorizer.get_feature_names_out())
print(X.toarray())
Limitations:
● Loss of Word Order: Disregarding word dependencies and contextual meanings can
hinder performance.
● Lacking Context: Assigning the same weight to identical words across different
documents can lead to skewed representations.
● Dimensionality: The vector's length equals the vocabulary size, getting unwieldy with
large corpora.
● Word Sense Ambiguity: Fails to distinguish meanings of polysemous words (words with
multiple meanings) or homonyms.
● Non-linguistic Information: Ignores parts of speech, negation, and any linguistic
subtleties.
● Out-of-Vocabulary Words: Struggles to handle new words or spelling variations.
Naive Bayes classifier is a popular choice for text classification tasks in Natural Language
Processing (NLP). It's preferred for its simplicity, speed, and effectiveness.Naive Bayes makes
use of Bag of Words techniques, treating the order of words as irrelevant. It calculates the
probability of a document belonging to a specific category based on the probability of words
occurring within that category.
● Bag of Words: Represents text as an unordered set of words, simplifying the text and
data representation.
● Conditional Independence Assumption: This core assumption of Naive Bayes states that
the presence of a word in a category is independent of the presence of other words.
Advantages:
emails = [
# More emails...
X, y = zip(*emails)
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_vec, y_train)
print(f"Accuracy: {accuracy:.2f}")
Limitations:
● Out-of-Vocabulary Words: The model may struggle with words it has not seen before.
Techniques like smoothing can reduce this issue.
● Word Sense Disambiguation: The model assumes that words have consistent meanings,
ignoring nuances. This can be problematic for words with multiple meanings.
● Zipf's Law: The classifier might be influenced by the frequency of words, prioritizing
common words and potentially overlooking rare, but valuable, ones.
● Contextual Information: Naive Bayes doesn't consider word context or word order,
making it less effective in tasks that require understanding of such nuances (like
sentiment analysis or contextual disambiguation).
● Lack of Context: HMMs are limited to local data, which can lead to suboptimal
performance, especially in complex tasks requiring global context.
● Scalability Concerns: With growing datasets and evolving language use, constant model
retraining and capacity to encompass lexicons become necessary.
import nltk
# Get data
nltk.download('treebank')
nltk.download('maxent_treebank_pos_tagger')
data = treebank.tagged_sents()[:3000]
# Split data
train_data = data[:split]
test_data = data[split:]
tagger = hmm.HiddenMarkovModelTrainer().train(train_data)
# Evaluate accuracy
accuracy = tagger.evaluate(test_data)
print(f"Accuracy: {accuracy:.2%}")