0% found this document useful (0 votes)
23 views48 pages

Ai&Ml Bai601 NLP Lab Manual

The document is a lab manual for the Natural Language Processing Laboratory course at CMR Institute of Technology, effective from the academic year 2024-2025. It includes various programming exercises in Python, covering topics such as text preprocessing, N-gram modeling, Minimum Edit Distance algorithm, parsing techniques, and Naïve Bayes classification. Additionally, it explores information retrieval using different corpora and includes code examples for each task.

Uploaded by

sara22aiml
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views48 pages

Ai&Ml Bai601 NLP Lab Manual

The document is a lab manual for the Natural Language Processing Laboratory course at CMR Institute of Technology, effective from the academic year 2024-2025. It includes various programming exercises in Python, covering topics such as text preprocessing, N-gram modeling, Minimum Edit Distance algorithm, parsing techniques, and Naïve Bayes classification. Additionally, it explores information retrieval using different corpora and includes code examples for each task.

Uploaded by

sara22aiml
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

CMR INSTITUTE OF TECHNOLOGY

(Affiliated to VTU, Approved by AICTE, Accredited by NBA and NAAC


with “A++” Grade)
ITPL MAIN ROAD, BROOKFIELD, BENGALURU-560037,
KARNATAKA, INDIA

Department of Artificial Intelligence and Machine Learning

LAB MANUAL
(Effective from the academic year 2024-2025 under 2022 CBCS scheme)

Subject: Natural Language Processing Laboratory (Integrated)


Subject Code: BAI601
Semester: 6
1. Write a Python program for the following preprocessing of text in NLP:
●​ Tokenization
●​ Filtration
●​ Script Validation
●​ Stopword Removal
●​ Stemming

STEPS:-
1. Install ‘nltk’ Python library (Natural Language Toolkit) using ‘pip’ command.

CODE:-
!pip install nltk

OUTPUT:-

2. Download the required modules/corpora from the ‘nltk’ library.

CODE:-
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

OUTPUT:-
3. Perform text preprocessing tasks.

CODE:-
# Import all the required libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string
import re

# Function to perform text preprocessing


def preprocess_text(text):
# Tokenization: Breakdown the large text into smaller units.
tokens = word_tokenize(text, language='english', preserve_line = True)
print(f'Tokens: {tokens}')

# Filtration: Remove punctuation and special characters.


fil_tokens = [word for word in tokens if word.isalnum()]
print(f'Filtered tokens: {fil_tokens}')

# Script Validation: Ensure tokens contain only valid characters and


english alphabets.
valid_tokens = [word for word in tokens if re.match(r'^[a-zA-Z]+$',
word)]
print(f"Script validated tokens: {valid_tokens}")

# Stopword Removal: Remove meaningless commonly occuring words.


stop_words = set(stopwords.words('english'))
new_tokens = [word for word in valid_tokens if word.lower() not in
stop_words]
print(f'Tokens after stopword removal : {new_tokens}')

# Stemming: Cut off ends of words to obtain root form.


stemmer = PorterStemmer()
stem_tokens = [stemmer.stem(word) for word in new_tokens]
print(f'Stemmed tokens : {stem_tokens}')

# Lemmatization: Remove ends of words to return base form.


lemmatizer = WordNetLemmatizer()
lemma_tokens = [lemmatizer.lemmatize(word) for word in new_tokens]
print(f'Lemmatized tokens: {lemma_tokens}')

# Declare the text for preprocessing


text = "Natural language processing (NLP) is an interesting subject."
preprocess_text(text)

OUTPUT:-
2. Write a Python program to demonstrate the N-gram modeling to analyze and
establish the probability distribution across sentences and explore the utilization
of unigrams, bigrams, and trigrams in diverse English sentences to illustrate the
impact of varying n-gram orders on the calculated probabilities.

CODE:-
import nltk
from nltk.util import ngrams
from collections import Counter

# Download the necessary NLTK data files


nltk.download('punkt')
nltk.download('punkt_tab')

# Sample corpus
sentences = [
"I love programming.",
"I love learning new things.",
"Programming is fun." ]

# Preprocess the sentences: Tokenize and clean the text


tokenized = [nltk.word_tokenize(s.lower()) for s in sentences]

# Function to calculate probabilities for n-grams


def calculate_ngram(sentences, n):
# Flatten the list of tokenized sentences
tokens = [token for s in sentences for token in s]

# Generate n-grams
n_grams = list(ngrams(tokens, n))

# Count frequencies of n-grams


ngram_freq = Counter(n_grams)

# Calculate the frequency of the first n-1 words for the conditional
probability
n_1_grams = list(ngrams(tokens, n-1))
n_1_gram_freq = Counter(n_1_grams)

# Calculate probabilities P(w_n | w_1, ..., w_{n-1})


ngram_probabilities = {}

for ngram, count in ngram_freq.items():


prefix = ngram[:-1] # First n-1 words
prefix_count = n_1_gram_freq[prefix] # Count of the first n-1
words

# Conditional probability P(w_n | w_1, ..., w_{n-1}) =


count(ngram) / count(prefix)
prob = count / prefix_count if prefix_count > 0 else 0
ngram_probabilities[ngram] = prob

return ngram_probabilities

# Function to display probabilities in a readable format


def display_ngram(prob, n):
print(f"\n{n}-Gram Probabilities:")
for ngram, prob in prob.items():
print(f" P({ngram}) = {prob:.4f}")

# Calculate and display probabilities for unigrams, bigrams, and trigrams


for n in [1, 2, 3]:
probabilities = calculate_ngram(tokenized, n)
display_ngram(probabilities, n)

OUTPUT:-
3. Investigate the Minimum Edit Distance (MED) algorithm and its application in
string comparison and the goal is to understand how the algorithm efficiently
computes the minimum number of edit operations required to transform one
string into another.
●​ Test the algorithm on strings with different types of variations (e.g., typos,
substitutions, insertions, deletions).
●​ Evaluate its adaptability to different types of input variations.

CODE:-
# Import the required libraries
import numpy as np

# Function to compute minimum edit distance


def med(str1, str2):
m = len(str1) + 1
n = len(str2) + 1

# Initialize a 2D Array
dp = np.zeros((m,n), dtype=int)

# Use dynamic programming technique to compute MED


for i in range(m):
dp[i][0] = i
for j in range(n):
dp[0][j] = j

# MED Algorithm
for i in range(1,m):
for j in range(1,n):
if str1[i-1] == str2[j-1]:
dp[i][j] = dp[i-1][j-1]
else:
dp[i][j] = min(dp[i-1][j]+1, dp[i][j-1]+1, dp[i-1][j-1]+1)

return dp[m-1][n-1]

# Give 2 words to compute MED and print it


s1 = input("Enter 1st word:")
s2 = input("Enter 2nd word:")
print(f"Minimum edit distance required to convert '{s1}' to '{s2}' is:
{med(s1, s2)}")

OUTPUT:-
4. Write a program to implement top-down and bottom-up parser using
appropriate context free grammar.

CODE:-
# Import the CFG library
import nltk
from nltk import CFG

# Declare appropriate CFG rules


grammar=CFG.fromstring("""
S ->NP VP
NP ->Det N|Det N PP|'John'|'Alice'
VP ->V NP|V NP PP
PP ->P NP
Det ->'the'|'a'
N ->'cat'|'dog'|'park'
V ->'saw'|'chased'
P ->'in'|'with'
""")
sentence=['John','saw','the','dog','in','the','park']

# Top-down parser
print("\nTop-Down Parsing(ChartParser):")
top_down_parser=nltk.ChartParser(grammar)
for tree in top_down_parser.parse(sentence):
print(tree)
tree.pretty_print()

# Bottom-up parser
print("\nBottom-up Parsing(Shift-Reduce Parser):")
bottom_up_parser=nltk.ShiftReduceParser(grammar)
for tree in bottom_up_parser.parse(sentence):
print(tree)
tree.pretty_print()

OUTPUT:-
5. Given the following short movie reviews, each labeled with a genre, either
comedy or action:
●​ fun, couple, love, love - comedy
●​ fast, furious, shoot - action
●​ couple, fly, fast, fun, fun - comedy
●​ furious, shoot, shoot, fun - action
●​ fly, fast, shoot, love - action
A new document D: fast, couple, shoot, fly.
Compute the most likely class for D. Assume a Naïve Bayes classifier and use
add 1 smoothing for the likelihoods.

CODE:-
# Import the required libraries
from collections import Counter

# Training dataset with words and their respective classes (comedy or


action)
training_data = [
(['fun', 'couple', 'love', 'love'], 'comedy'),
(['fast', 'furious', 'shoot'], 'action'),
(['couple', 'fly', 'fast', 'fun', 'fun'], 'comedy'),
(['furious', 'shoot', 'shoot', 'fun'], 'action'),
(['fly', 'fast', 'shoot', 'love'], 'action')
]

# Document to classify
doc = ['fast', 'couple', 'shoot', 'fly']

# Count the number of documents in training data


total_docs = len(training_data)
print("Total no. of documents in training data:", total_docs)

# Count the number of documents in each class


total_comedy, total_action = 0, 0
for d in training_data:
if d[1] == 'comedy':
total_comedy += 1
elif d[1] == 'action':
total_action += 1
print("Total 'comedy' documents:", total_comedy)
print("Total 'action' documents:", total_action)

# Compute prior probabilities (P(Class))


comedy_prob = total_comedy / total_docs
action_prob = total_action / total_docs
print("Probability of 'comedy' documents:", comedy_prob)
print("Probability of 'action' documents:", action_prob)

# Find unique words in the training data


unique = set()
for d in training_data:
for w in d[0]:
unique.add(w)

print("\nUnique words in all documents:", unique)


size = len(unique)
print("Size of unique words:", size)

# Initialize word frequency dictionaries for both classes


comedy_dict = {w: 0 for w in unique}
action_dict = {w: 0 for w in unique}

# Count occurrences of words in each class


comedy_word_count = 0
action_word_count = 0
for d in training_data:
if d[1] == 'comedy':
for w in d[0]:
comedy_dict[w] += 1
comedy_word_count += 1
elif d[1] == 'action':
for w in d[0]:
action_dict[w] += 1
action_word_count += 1

print("\nWord occurrences in 'comedy':", comedy_dict)


print("Word occurrences in 'action':", action_dict)

print("No. of word occurrences in 'comedy':", comedy_word_count)


print("No. of word occurrences in 'action':", action_word_count)

# Compute word probabilities using Laplace (Add 1) smoothing


comedy_probabilities = {w: (comedy_dict[w] + 1) / (comedy_word_count +
size) for w in unique}
action_probabilities = {w: (action_dict[w] + 1) / (action_word_count +
size) for w in unique}

print("\n'Comedy' word probabilities:", comedy_probabilities)


print("'Action' word probabilities:", action_probabilities)

# Compute Naïve Bayes probabilities for the given document


comedy_doc_prob = comedy_prob
action_doc_prob = action_prob

for w in doc:
comedy_doc_prob *= comedy_probabilities.get(w, 1 / (comedy_word_count
+ size))
action_doc_prob *= action_probabilities.get(w, 1 / (action_word_count
+ size))

print(f"\nProbability of 'comedy' for given document:


{comedy_doc_prob:.8f}")
print(f"Probability of 'action' for given document:
{action_doc_prob:.8f}")

# Predict the most likely class


predicted_class = "COMEDY" if comedy_doc_prob > action_doc_prob else
"ACTION"
print(f"\nPredicted class for the document: '{predicted_class}' ")

OUTPUT:-
Total no. of documents in training data: 5
Total 'comedy' documents: 2
Total 'action' documents: 3
Probability of 'comedy' documents: 0.4
Probability of 'action' documents: 0.6
Unique words in all documents: {'couple', 'fast', 'furious', 'fly',
'love', 'fun', 'shoot'}
Size of unique words: 7

Word occurrences in 'comedy': {'couple': 2, 'fast': 1, 'furious':


0, 'fly': 1, 'love': 2, 'fun': 3, 'shoot': 0}
Word occurrences in 'action': {'couple': 0, 'fast': 2, 'furious':
2, 'fly': 1, 'love': 1, 'fun': 1, 'shoot': 4}
No. of word occurrences in 'comedy': 9
No. of word occurrences in 'action': 11

'Comedy' word probabilities: {'couple': 0.1875, 'fast': 0.125,


'furious': 0.0625, 'fly': 0.125, 'love': 0.1875, 'fun': 0.25,
'shoot': 0.0625}
'Action' word probabilities: {'couple': 0.05555555555555555,
'fast': 0.16666666666666666, 'furious': 0.16666666666666666, 'fly':
0.1111111111111111, 'love': 0.1111111111111111, 'fun':
0.1111111111111111, 'shoot': 0.2777777777777778}

Probability of 'comedy' for given document: 0.00007324


Probability of 'action' for given document: 0.00017147

Predicted class for the document: 'ACTION'


6. Demonstrate the following using appropriate programming tools which
illustrates the use of information retrieval in NLP.
●​ Study various corpora (Brown, Inaugural Address, Reuters, udhr) with
various methods like fileid, raw, words, sents, categories.
●​ Create and use your own corpus (plain text, categorical).
●​ Study conditional frequency distributions.
●​ Study of tagged corpora with methods like tagged_sents, tagged_words.
●​ Write a program to find the most frequent noun tags.
●​ Map words to properties using python dictionaries.
●​ Study rule based tagger and unigram tagger.
Find different words from a given plain text without any space by comparing this
text with a given corpus of words. Also find the score of the words.

1. Study various corpora.

CODE:-
# Install and import the required libraries
import nltk
import matplotlib.pyplot as plt
from nltk.corpus import brown, inaugural, reuters, udhr
from nltk import FreqDist, ConditionalFreqDist, word_tokenize, pos_tag
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
from nltk.tag import UnigramTagger, DefaultTagger, RegexpTagger
import string

nltk.download('brown')
nltk.download('inaugural')
nltk.download('reuters')
nltk.download('udhr')

OUTPUT:-
CODE:-
# Brown
print("Brown Categories:", brown.categories())
print("Brown File IDs:", brown.fileids()[:5])
print("Brown Words (News):", brown.words(categories='news')[:10])
print("Brown Sentences:", brown.sents(categories='news')[:2])

# Inaugural Address
print("\nInaugural File IDs:", inaugural.fileids()[:5])
print("Inaugural Words (2009):", inaugural.words('2009-Obama.txt')[:10])

# Reuters
print("\nReuters Categories:", reuters.categories()[:5])
print("Reuters File IDs:", reuters.fileids()[:5])
print("Reuters Words:", reuters.words(reuters.fileids()[0])[:10])

# UDHR
print("\nUDHR Languages:", udhr.fileids()[:5])
print("UDHR Words (English):", udhr.words('English-Latin1')[:10])

OUTPUT:-

2. Create and use your own corpus.


●​ On your desktop, create a folder that contains multiple files. Input some data into those
text files.

CODE:-
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'C:\\Users\\Student\\Desktop\\My Corpus'
wordlists = PlaintextCorpusReader(corpus_root, '.*')
wordlists.fileids()

OUTPUT:-

3. Study conditional frequency distributions.

CODE:-
# Conditional frequency distribution for Brown corpus
cfd = nltk.ConditionalFreqDist( (genre, word)
... for genre in brown.categories()
... for word in brown.words(categories=genre) )
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance',
'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=modals)

OUTPUT:-

CODE:-
# Conditional frequency distribution for Inaugural Address corpus
from nltk.corpus import inaugural
inaugural.fileids()
[fileid[:4] for fileid in inaugural.fileids()]
cfd = nltk.ConditionalFreqDist(
(target, fileid[:4])
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
for target in ['america', 'citizen']
if w.lower().startswith(target))
cfd.plot()

OUTPUT:-

CODE:-
# Conditional frequency distribution for UDHR corpus
from nltk.corpus import udhr
languages = ['Chickasaw', 'English', 'German_Deutsch',
'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
cfd = nltk.ConditionalFreqDist(
(lang, len(word))
for lang in languages
for word in udhr.words(lang + '-Latin1'))
cfd.plot(cumulative=True)
OUTPUT:-

4. Study of tagged corpora with methods like tagged_sents, tagged_words.

CODE:-
# Tagged words and sentences from Brown corpus
tagged_words = brown.tagged_words(categories='news')
tagged_sents = brown.tagged_sents(categories='news')

print("Tagged Words:", tagged_words[:10])


print("Tagged Sentences:", tagged_sents[:2])

OUTPUT:-
5. Write a program to find the most frequent noun tags.

CODE:-
# Noun tags usually start with 'NN'
nouns = [word for word, tag in tagged_words if tag.startswith('NN')]
fd_nouns = FreqDist(nouns)
print("Most Common Nouns:", fd_nouns.most_common(10))

OUTPUT:-

6. Map words to properties using python dictionaries.

CODE:-
word_props = {
'dog': {'type': 'noun', 'sentiment': 'neutral'},
'run': {'type': 'verb', 'sentiment': 'positive'},
'hate': {'type': 'verb', 'sentiment': 'negative'}
}
print(word_props['run']['sentiment'])

OUTPUT:-

7. Study rule based tagger and unigram tagger.

CODE:-
# Unigram Tagger
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.tag(brown_sents[2007])

OUTPUT:-
CODE:-
# Rule-based Tagger
patterns = [
(r'.*ing$', 'VBG'), # gerunds
(r'.*ed$', 'VBD'), # simple past
(r'.*es$', 'VBZ'), # 3rd singular present
(r'.*ould$', 'MD'), # modals
(r'.*\'s$', 'NN$'), # possessive nouns
(r'.*s$', 'NNS'), # plural nouns
(r'^-?[0-9]+(\.[0-9]+)?$', 'CD'), # cardinal numbers
(r'.*', 'NN') # nouns (default)
]

regexp_tagger = nltk.RegexpTagger(patterns)
tagged=regexp_tagger.tag(brown_sents[1])
print(tagged)

OUTPUT:-
8. Find different words from a given plain text without any space.

CODE:-
!pip install wordninja

OUTPUT:-

CODE:-
import wordninja
wordninja.split('thisisabeautifulday')

OUTPUT:-
7. Write a Python program to find synonyms and antonyms of the word ‘active’
using WordNet.

CODE:-
# Import the required libraries
import nltk
from nltk.corpus import wordnet

# Download these modules only if they are not present, else skip it
nltk.download('wordnet')
nltk.download('omw-1.4')

# Function to find synonyms and antonyms of a word


def get_syn_ant(word):
syno = set()
anto = set()
for s in wordnet.synsets(word):
for lemma in s.lemmas():
syno.add(lemma.name())
if lemma.antonyms():
anto.add(lemma.antonyms()[0].name())
return syno, anto

# Declare the given word and print its synonyms and antonyms
word = 'active'
syno, anto = get_syn_ant(word)
print(f'Synonyms of {word} : {syno}')
print(f'Antonyms of {word} : {anto}')

OUTPUT:-
8. Implement the machine translation application of NLP where it needs to train a
machine translation model for a language with limited parallel corpora.
Investigate and incorporate techniques to improve performance in low-resource
scenarios.

CODE:-
# Install and import the required modules and libraries
!pip install transformers
!pip install sentencepiece
!pip install torch
from collections import Counter
import nltk
from nltk.util import ngrams
from transformers import MarianMTModel, MarianTokenizer
import torch
import sentencepiece as spm
import random
nltk.download('punkt')

OUTPUT:-

CODE:-
# Define the corpus
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_path = r"C:\Users\Student\Documents\en-hi.txt"
corpus = PlaintextCorpusReader(corpus_path,"/en-hi.txt")
english_sentences = corpus.words(corpus_path + "/tldr-pages.en-hi.en")
hindi_sentences = corpus.words(corpus_path + "/tldr-pages.en-hi.hi")

# Preprocessing
!pip install sacremoses
def tokenize_sentences(corpus, lang):
tokenizer =
MarianTokenizer.from_pretrained(f"Helsinki-NLP/opus-mt-en-hi")
if isinstance(corpus, list):
if lang == 'src':
return [tokenizer(sentence, return_tensors='pt', padding=True,
truncation=True) for sentence, _ in corpus]
else:
return [tokenizer(sentence, return_tensors='pt', padding=True,
truncation=True) for _, sentence in corpus]
else:
sentences = [''.join(corpus)]
return [tokenizer(sentence, return_tensors='pt', padding=True,
truncation=True) for sentence in sentences]

src_tokens = tokenize_sentences(english_sentences, 'src')


tgt_tokens = tokenize_sentences(hindi_sentences, 'tgt')

OUTPUT:-
CODE:-
import sys
print(sys.version)
model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-hi")

OUTPUT:-

CODE:-
# Train the machine translation model
def train_model(model, src_tokens, tgt_tokens, epochs=5):
optimizer=torch.optim.Adam(model.parameters(), lr=5e-5)
loss_fn=torch.nn.CrossEntropyLoss()

model.train
for epoch in range(epochs):
total_loss=0
for src_token, tgt_token in zip(src_tokens, tgt_tokens):
optimizer.zero_grad()
output=model(**src_token, labels=tgt_token['input_ids'])
loss=output.loss
loss.backward()
optimizer.step()
total_loss+=loss.item()
print(f"Epoch {epoch +1}/{epochs}, Loss:
{total_loss/len(src_tokens)}")

train_model(model, src_tokens, tgt_tokens)

OUTPUT:-
CODE:-
def translate_text(model, tokenizer, text):
inputs = tokenizer(text, return_tensors='pt')
outputs = model.generate( ** inputs)
return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test the model


example_sentence = 'I love machine learning'
tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-hi")
translated_text = translate_text(model, tokenizer, example_sentence)
print(f"Original: {example_sentence}\nTranslated: {translated_text}")

OUTPUT:-
EXTRA PROGRAMS

1. Develop a Python program to perform sentiment analysis on movie reviews


using an appropriate dataset.

CODE:-
# Install dependencies
!pip install pandas scikit-learn nltk

# Imports and setup


import pandas as pd, re, nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# NLTK data
nltk.download('stopwords')
nltk.download('wordnet')

# Load and preprocess


df = pd.read_csv('IMDB Dataset.csv')
lemmatizer, stop_words = WordNetLemmatizer(),
set(stopwords.words('english'))
def preprocess(text): return ' '.join(lemmatizer.lemmatize(w) for w in
re.sub(r'[^a-z]', ' ', text.lower()).split() if w not in stop_words)
df['review'] = df['review'].apply(preprocess)

# Vectorize and encode


X = TfidfVectorizer(max_features=10000).fit_transform(df['review'])
y = df['sentiment'].map({'positive': 1, 'negative': 0})

# Split, train, predict


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
model = LogisticRegression(max_iter=1000).fit(X_train, y_train)
y_pred = model.predict(X_test)
# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

OUTPUT:-
2. Develop a Python program to implement a typical resume keyword matcher.

●​ Create a sample resume (resume_sample.pdf) and job description


(job_description_sample.pdf) files on your desktop. You can download samples from the
internet and save them as PDF files using the filenames provided.

CODE:-
1. Install the required libraries.
pip install fitz nltk scikit-learn pdfminer.six

2. Develop a resume keyword matcher.


import fitz
import re
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Preload stopwords and lemmatizer


stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def extract_text(pdf_file):
return " ".join(page.get_text() for page in fitz.open(pdf_file))

def preprocess(text):
text = re.sub(r'[^\w\s]', '', text.lower())
return " ".join(lemmatizer.lemmatize(word) for word in text.split() if
word not in stop_words)

def extract_keywords(text):
return Counter(text.split())

def compare_keywords(resume_kw, job_kw):


common = resume_kw.keys() & job_kw.keys()
missing = job_kw.keys() - resume_kw.keys()
match = (len(common) / len(job_kw) * 100) if job_kw else 0
return match, common, missing

# Main
resume_text = preprocess(extract_text('resume_sample.pdf'))
job_text = preprocess(extract_text('job_description_sample.pdf'))

resume_kw = extract_keywords(resume_text)
job_kw = extract_keywords(job_text)

match, common, missing = compare_keywords(resume_kw, job_kw)

print(f"Match Percentage: {match:.2f}%")


print(f"Common Keywords: {common}")
print(f"Missing Keywords: {missing}")

OUTPUT:-
VIVA-VOCE

What is NLP?
Natural language processing (NLP) is a field of computer science and a subfield of artificial
intelligence that aims to make computers understand human language.
(Refer: Natural Language Processing (NLP) - Overview - GeeksforGeeks)

NLP Tools
1.​ Tokenization and Segmentation: Dividing text into its elementary units, such as words or
sentences.
2.​ POS Tagging (Part-of-Speech Tagging): Assigning grammatical categories to words, like
nouns, verbs, or adjectives.
3.​ Named Entity Recognition (NER): Identifying proper nouns or specific names in text.
4.​ Lemmatization and Stemming: Reducing words to their root form or a common base.
5.​ Word Sense Disambiguation: Determining the correct meaning of a word with multiple
interpretations based on the context.
6.​ Parsing: Structurally analyzing sentences and establishing dependencies between
words.
7.​ Sentiment Analysis: Assessing emotions or opinions expressed in text.

Challenges in NLP

●​ Ambiguity: Language is inherently ambiguous with words or phrases having multiple


interpretations.
●​ Context Sensitivity: The meaning of a word may vary depending on the context in which
it's used.
●​ Variability: Linguistic variations, including dialects or slang, pose challenges for NLP
models.
●​ Complex Sentences: Understanding intricate sentence structures, especially in literature
or legal documents, can be demanding.
●​ Negation and Irony: Recognizing negated statements or sarcasm is still a hurdle for
many NLP models.

History and Key Milestones


1.​ 1950s: Alan Turing introduces the Turing Test.
2.​ 1957: Noam Chomsky lays the foundation for formal language theory.
3.​ 1966: ELIZA, the first chatbot, demonstrates NLP capabilities.
4.​ 1978: SHRDLU, an early NLP system, interprets natural language commands in a block
world environment.
5.​ 1983: Chomsky's theories are integrated into practical models with the development of
HPSG (Head-driven Phrase Structure Grammar).
6.​ 1990s: Probabilistic models gain prominence in NLP.
7.​ Early 2000s: Machine learning, especially neural networks, becomes increasingly
influential in NLP.
8.​ 2010s: The deep learning revolution significantly advances NLP.

The Importance of NLP in Industry

●​ Text Classification: Automates tasks like email sorting and news categorization.
●​ Sentiment Analysis: Tracks public opinion on social media and product reviews.
●​ Machine Translation: Powers platforms like Google Translate.
●​ Chatbots and Virtual Assistants: Enables automated text or voice interactions.
●​ Voice Recognition: Facilitates speech-to-text and smart speakers.
●​ Search and Recommendation Systems: Enhances user experience on websites and
apps.

What do you understand about the terms 'corpus', 'tokenization', and 'stopwords'
in NLP?

Corpus:

●​ A corpus (plural: corpora) is a structured collection of text, often serving as the


foundation for building a language model. Corpora can be domain-specific, say, for legal
or medical texts, or general, covering a range of topics.
●​ It acts as a textual dataset for tasks like language model training, sentiment analysis,
and more.

Tokenization:

●​ Tokenization is the process of breaking longer text into discrete units, or tokens, which
could be words, n-grams, or characters.Common tokenization strategies include splitting
by whitespace, punctuation, or specific vocabularies.Tokenization is a foundational step
for various text processing tasks and is crucial for tasks like neural network for NLP.

Stopwords:

●​ Stopwords are words that are often removed from texts during processing as they carry
little meaning on their own (e.g., 'is', 'the', 'and') in bag-of-words models.By eliminating
stopwords, we can focus on content-carrying words and reduce data dimensionality, thus
enhancing computational efficiency and, in some applications, improving the accuracy of
text classification or clustering.
Distinguish between morphology and syntax in the context of NLP.

Morphology deals with the structure and formation of words, syntax its their arrangement in
sentences.Morphology looks at individual words. Syntax considers sentences and the
relationships between words.

●​ Morphology: Words, prefixes, suffixes, root forms, infixes, and more.


●​ Syntax: The syntactic structures and relationships between words. For example, the
subject, verb, and object in a sentence or the noun phrases and verb phrases within it.
●​ Morphology: The smallest grammatical units within the word (e.g., morphemes like "un-"
and "happi-" in "unhappy").
●​ Syntax: The combination of words within a sentence to form meaningful structures or
phrases.
●​ Morphology is key for tasks such as stemming (reducing inflected words to their base or
root form) and lemmatization (reducing words to a common base or lemma).
●​ Syntax is essential for grammar checking, part-of-speech tagging, and more
sophisticated tasks such as natural language understanding and generation.

Morphological analysis helps with tasks such as:

●​ Stemming and Lemmatization: Reducing words to their basic form improves


computational efficiency and information retrieval accuracy.
●​ Morphological Generation: Constructing words and their variations is useful in text
generation and inflected language processing.
●​ Morphological Tagging: Identifying morphological properties of words contributes to
accurate part-of-speech tagging, which in turn supports tasks like information extraction
and machine translation.

Syntactic analysis is crucial for tasks such as:

●​ Parsing: Uncovering grammatical structures in sentences supports semantic


interpretation and knowledge extraction.
●​ Sentence Boundary Detection: Identifies sentence boundaries, aiding in various
processing tasks, such as summarization and text segmentation.
●​ Gross Syntactic Tasks: Such as identifying subjects and objects, verb clustering, and
maintaining grammatical accuracy in tasks like language generation and style transfer.

Stemming v/s Lemmatization

Consider the sentence: "Riders are riding the horses riding wildly."

Stemming:

●​ Tokenized Version: ['Riders', 'are', 'riding', 'the', 'horses', 'riding', 'wildly']


●​ Stemmed Version: ['Rider', 'ar', 'ride', 'the', 'hors', 'ride', 'wild']
Lemmatization:

●​ Tokenized Version: ['Riders', 'are', 'riding', 'the', 'horses', 'riding', 'wildly']


●​ Lemmatized Version: ['Rider', 'be', 'ride', 'the', 'horse', 'ride', 'wildly']

Explain the significance of Part-of-Speech (POS) tagging in NLP.

●​ Part-of-Speech (POS) Tagging plays a fundamental role in natural language processing


by identifying the grammatical components of text, such as words and phrases, and
labeling them with their corresponding parts of speech.
●​ POS tagging is often the initial step in more advanced syntactic parsing tasks, such as
chunking or full parsing, that help uncover the broader grammatical structure of a
sentence.
●​ This provides the semantic context necessary for understanding the subtle nuances and
deeper meanings within the text.
●​ POS tags are used to extract and identify key pieces of information from a body of text.
This function is essential for tools that aim to summarize or extract structured
information, such as named-entity recognition and relation extraction.
●​ In some cases, the grammatical form of a word, as captured by its POS tag, can be the
clue needed to discern its semantic meaning. For instance, the same word might
function as a noun or verb, with vastly different interpretations: consider the word "sink."
●​ POS tagging aids in identifying the base or root form of a word. This is an essential task
for a variety of NLP applications, like search engines or systems monitoring sentiment,
as analyzing a word's structure can reveal more about its context and significance in a
given text.
●​ In the domain of customer service, businesses can use POS tagging to understand the
intention behind customer queries. This capability can drive automation strategies like
chatbots, where customer requests can be tagged with important grammatical
information to inform proper responses.
●​ In media monitoring and sentiment analysis, POS tagging is used to identify the key
components of sentences, phrases, or paragraphs, which in turn can help determine
sentiment or extract useful data.

POS Challenges:

●​ Ambiguity: Many words can serve as different parts of speech, depending on their use or
context.
●​ Multiple Tags: Some words can have more than one POS, such as the word "well" which
can be an adverb, adjective, or noun.

Describe lemmatization and stemming. When would you use one over the other?

Both lemmatization and stemming are methods for reducing inflected words to their root forms.
Stemming:

●​ Definition: Stemming uses an algorithmic, rule-based approach to cut off word endings,
producing the stem. This process can sometimes result in non-real words, known as
"raw" stems.
●​ Example: The stem of "running" is "run."
●​ Code Example (Using NLTK):

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

stem = stemmer.stem("running")

Lemmatization:

●​ Definition: Lemmatization, on the other hand, uses a linguistic approach that considers
the word's context. It maps inflected forms to their base or dictionary form (lemma).
●​ Example: The lemma of "running" is "run".
●​ Code Example (Using NLTK):

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lemma = lemmatizer.lemmatize("running", pos="v") # Need to specify part of


speech (pos)

When to Choose Each:

●​ Stemming: Useful for tasks like text classification when speed is a priority. The technique
is simpler and quicker than lemmatization, but it may sacrifice precision and produce
non-real words.
●​ Lemmatization: Ideal when semantic accuracy is crucial, such as in question-answering
systems or topic modeling. It ensures that the root form retains its existing meaning in
the text, potentially leading to better results in tasks that require understanding and
interpretation of the text.

What is a 'named entity' and how is Named Entity Recognition (NER) useful in
NLP tasks?

●​ In Natural Language Processing, a named entity (NE) refers to real-world objects, such
as persons, dates, or locations, that are assigned proper names.
●​ Named Entity Recognition (NER) utilizes machine learning techniques, such as
sequence labeling and deep learning, to identify and categorize named entities within
larger bodies of text.
●​ NER recognizes entities like:
a.​ Person Names: E.g., "John Doe"
b.​ Locations: E.g., "New York"
c.​ Organizations: E.g., "Google"
d.​ Dates: E.g., "January 1, 2022"
e.​ Numeric References: E.g., "$10,000"
f.​ Product Names: E.g., "iPhone"
g.​ Time Notations: E.g., "4 PM"

Practical Applications:
●​ Information Retrieval and Summarization: Identifying entities aids in summarizing
content and retrieving specific information.
●​ Question Answering Systems: Helps to understand what or who a question is about.
●​ Relation Extraction: Can provide insight into relationships between recognizable entities.
●​ Sentiment and Opinion Analysis: Understanding the context in which named entities
appear can guide sentiment analysis.
●​ Geotagging: Identifying place names facilitates geographically tagging content.
●​ Recommendation Systems: Identifying products, organizations, or other named entities
enhances the power of recommendation systems.
●​ Competitive Intelligence: Identifying and categorizing company names and other
organizations can provide valuable insights for businesses.
●​ Legal and Regulatory Compliance Monitoring: For tasks like contract analysis, identifying
named entities can be crucial.

Code Example: NER with spaCy

import spacy

# Load the English NER model

nlp = spacy.load('en_core_web_sm')

# Sample text

text = "Apple is looking at buying a U.K. startup for $1 billion."

# Process the text

doc = nlp(text)

# Extract and display entity labels


for ent in doc.ents:

​ print(ent.text, ent.label_)

Output: "Apple" ORG, "U.K." GPE, and "$1 billion" MONEY.

Define 'sentiment analysis' and discuss its applications.

Sentiment analysis (SA), also known as opinion mining, is the computational study of people's
emotions, attitudes, and opinions from text data. Its core goal is determining whether a piece of
writing is positive, negative, or neutral.

Applications of Sentiment Analysis:

●​ Business: SA can streamline customer feedback analysis, brand management, and


product development.
●​ Marketing: Identifying trends and insights from social and digital media, understanding
customer desires and pain points, developing targeted advertising campaigns.
●​ Customer Service: Quick and automated identification of customer needs and moods,
routing priority or potentially negative feedback to relevant support channels.
●​ Politics and Social Sciences: Tracking public opinion, election forecasting, and analyzing
the impact of policies and events on public sentiment.
●​ Healthcare: Monitoring mental health trends and identifying potential outbreaks of
diseases by processing texts from online forums, review platforms, and social media.
●​ News and Media: Understand reader/viewer views, opinions, and feedback, and track
trends in public sentiment related to news topics.
●​ Legal and Regulatory Compliance: Analyzing large volumes of text data to identify
compliance issues, legal risks, and reputation-related risks.
●​ Market Research: Gather and analyze consumer comments, reviews, and feedback to
inform product/service development, branding decisions, and more.
●​ Education: Assessing student engagement and learning by analyzing their online posts
about course materials or studying experiences.
●​ Customer Feedback Surveys: Automating the analysis of feedback from surveys, focus
groups, or comment sections. For example, hotel reviews on travel websites help
travelers make informed decisions.
●​ Voice of the Customer (VOC): Interpreting and identifying customer feelings and insights
across multiple communication channels: calls, chat, emails, and social media.
●​ Text-Based Searches: Ranking results based on sentiment with some search engines or
social platforms.
●​ Automated Content Moderation: Identifying and flagging inappropriate or harmful content
on online platforms, including hate speech, bullying, or adult content.
●​ Financial Services: Assessing investor sentiments, measuring market reactions to
financial news, and gauging public opinions on specific companies through social or
news media.

How does a ‘dependency parser’ work, and what information does it provide?

A dependency parser is a tool used in natural language processing (NLP) to extract


grammatical structure from free-form text.

Core Functionalities:
1.​ Word-Level Classification: Each word is classified based on its relationship with others.
Examples of classifications are Subject (nsubj), Object (obj), and Modifiers (e.g. amod
for adjectival modification).
2.​ Arc Representation: These are labeled directed edges, or arcs, between words. They
represent a grammatical relationship and provide the basis for constructing the parsing
tree.

Parsing Procedure:
1.​ Initial Assignment: The parser begins by giving each word a universal "root" node to
define the starting point of the parse tree.
2.​ Iterative Classification: For every word, the algorithm assigns both a word-level
classification and a directed arc to another word, specifying the relationship from the first
to the second.
3.​ Tree Check: Throughout the process, the parser ensures that the set of classified arcs
forms a single, non-looping tree, known as a "projective parse tree."
4.​ Recursive Structure: The tree starts from the "root" node and recursively accounts for
word and arc classifications to create dependencies covering the entire sentence.

Code Example: Dependency Parsing

import spacy

# Load English tokenizer, tagger, parser, NER, and word vectors

nlp = spacy.load("en_core_web_sm")

# Process a sentence using the dependency parser

doc = nlp("The quick brown fox jumps over the lazy dog.")

# Print the dependencies of each token in the sentence

for token in doc:

print(token.text, token.dep_, token.head.text, token.head.pos_,


​ [child for child in token.children])

use the spaCy library to perform dependency parsing on the sentence: "The quick brown fox
jumps over the lazy dog." The output will display the word, its word-level classification (dep_),
the head word it's connected to, the head word's part-of-speech tag (pos_), and the word's
children.

What are n-grams, and how do they contribute to language modeling?

N-grams are sequential word or character sets, with "n" indicating the number of elements in a
particular set. They play a crucial role in understanding context and text prediction, especially in
statistical language models.

Types of N-grams:

●​ Unigrams: Single words wi


●​ Bigrams: Pairs of words wi−1,wi
●​ Trigrams: Three-word sequences wi−2,wi−1,wi

Applications in Language Modeling:


●​ Text Prediction: Using contextual cues to predict next words.
●​ Speech Recognition: Relating phonemes to known word sequences.
●​ Machine Translation: Contextual understanding for accurate translations.
●​ Optical Character Recognition (OCR): Correcting recognition errors based on
surrounding text.
●​ Spelling Correction: Matching misspelled words to known N-grams.

Code Example: n-gram Modelling

from nltk import ngrams, word_tokenize

# Define input text

text = "This is a simple example for generating n-grams using NLTK."

# Tokenize the text into words

tokenized_text = word_tokenize(text.lower())

# Generate different types of N-grams

unigrams = ngrams(tokenized_text, 1)

bigrams = ngrams(tokenized_text, 2)
trigrams = ngrams(tokenized_text, 3)

# Print the generated n-grams

print("Unigrams:", [gram for gram in unigrams])

print("Bigrams:", [gram for gram in bigrams])

print("Trigrams:", [gram for gram in trigrams])

Describe what a 'bag of words' model is and its limitations.

The Bag of Words model, or BoW, is a fundamental technique in Natural Language Processing
(NLP). This model disregards the word order and syntax within a text, focusing instead on the
presence and frequency of words.

Working Mechanism:
●​ Text Collection: Gather a set of documents or a corpus.
●​ Tokenization: Split the text into individual words, known as tokens.
●​ Vocabulary Building: Identify unique tokens, constituting the vocabulary.
●​ Vectorization: Represent each document as a numerical vector, where each element
reflects word presence or frequency in the vocabulary.

Code Example: BoW Model

from sklearn.feature_extraction.text import CountVectorizer

# Sample data

corpus = [

'This is the first document.',

'This document is the second document.',

'And this is the third one.',

# Create BoW model

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(corpus)

# Visualize outputs
print(vectorizer.get_feature_names_out())

print(X.toarray())

Limitations:
●​ Loss of Word Order: Disregarding word dependencies and contextual meanings can
hinder performance.
●​ Lacking Context: Assigning the same weight to identical words across different
documents can lead to skewed representations.
●​ Dimensionality: The vector's length equals the vocabulary size, getting unwieldy with
large corpora.
●​ Word Sense Ambiguity: Fails to distinguish meanings of polysemous words (words with
multiple meanings) or homonyms.
●​ Non-linguistic Information: Ignores parts of speech, negation, and any linguistic
subtleties.
●​ Out-of-Vocabulary Words: Struggles to handle new words or spelling variations.

Explain how the Naive Bayes classifier is used in NLP.

Naive Bayes classifier is a popular choice for text classification tasks in Natural Language
Processing (NLP). It's preferred for its simplicity, speed, and effectiveness.Naive Bayes makes
use of Bag of Words techniques, treating the order of words as irrelevant. It calculates the
probability of a document belonging to a specific category based on the probability of words
occurring within that category.

●​ Bag of Words: Represents text as an unordered set of words, simplifying the text and
data representation.
●​ Conditional Independence Assumption: This core assumption of Naive Bayes states that
the presence of a word in a category is independent of the presence of other words.

Advantages:

●​ Efficiency: It's computationally lightweight and doesn't require extensive tuning.


●​ Simplicity: Easy to implement and understand.
●​ Low Data Requirements: Can be effective even in situations with smaller training
datasets.

Code Example: Naive Bayes Classification

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import train_test_split


# Sample Data: Emails labeled as either Spam or Ham.

emails = [

("Win a free iPhone", "spam"),

("Meeting at 3 pm", "ham"),

("Verify your account", "spam"),

​ # More emails...

# Separate into features and target

X, y = zip(*emails)

# Split into training and testing

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert text to a numerical format using CountVectorizer

vectorizer = CountVectorizer()

X_train_vec = vectorizer.fit_transform(X_train)

X_test_vec = vectorizer.transform(X_test)

# Fit Naive Bayes to the training data

nb_classifier = MultinomialNB()

nb_classifier.fit(X_train_vec, y_train)

# Evaluate on test data

accuracy = nb_classifier.score(X_test_vec, y_test)

print(f"Accuracy: {accuracy:.2f}")

Limitations:

●​ Out-of-Vocabulary Words: The model may struggle with words it has not seen before.
Techniques like smoothing can reduce this issue.
●​ Word Sense Disambiguation: The model assumes that words have consistent meanings,
ignoring nuances. This can be problematic for words with multiple meanings.
●​ Zipf's Law: The classifier might be influenced by the frequency of words, prioritizing
common words and potentially overlooking rare, but valuable, ones.
●​ Contextual Information: Naive Bayes doesn't consider word context or word order,
making it less effective in tasks that require understanding of such nuances (like
sentiment analysis or contextual disambiguation).

How are Hidden Markov Models (HMMs) applied in NLP tasks?

●​ Hidden Markov Models (HMMs) have a long-standing role in Natural Language


Processing (NLP) tasks due to their ability to handle sequential data like text.
●​ From the early days of POS tagging to modern speech recognition and language
translation, HMMs have been instrumental in numerous tasks.

Specific Applications in NLP:

●​ Task: POS Tagging


○​ Description: Determines the part of speech for each word in a sentence.
○​ HMM Role: The most common application of HMMs in NLP. Each POS tag is a
state, and the observed word is the corresponding observed output.
●​ Task: Named Entity Recognition (NER)
○​ Description: Identifies entities in text, such as names of persons, locations, or
organizations.
○​ HMM Role: Useful in sequence modeling tasks, where the existence of one entity
influences the presence of another (e.g., "United States President").
●​ Task: Coreference Resolution
○​ Description: Links an entity, usually a nominal phrase, such as a proper name or
a pronoun, to its previous references.
○​ HMM Role: Helps in making coreference decisions based on chains of
references and broader context.
●​ Task: Language Translation
○​ Description: Translates text from one language to another.
○​ HMM Role: In the pre-sequence to sequence model era, HMMs were used for
alignment between two sequences, i.e., source and target sentences.

HMM Limitations in NLP:

●​ Lack of Context: HMMs are limited to local data, which can lead to suboptimal
performance, especially in complex tasks requiring global context.
●​ Scalability Concerns: With growing datasets and evolving language use, constant model
retraining and capacity to encompass lexicons become necessary.

Code Example: HMM

import nltk
# Get data

nltk.download('treebank')

nltk.download('maxent_treebank_pos_tagger')

from nltk.corpus import treebank

data = treebank.tagged_sents()[:3000]

# Split data

split = int(0.9 * len(data))

train_data = data[:split]

test_data = data[split:]

# Train the HMM POS tagger

from nltk.tag import hmm

tagger = hmm.HiddenMarkovModelTrainer().train(train_data)

# Evaluate accuracy

accuracy = tagger.evaluate(test_data)

print(f"Accuracy: {accuracy:.2%}")

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy