0% found this document useful (0 votes)
15 views18 pages

NLP Record 2

The document provides Python code examples for various text processing techniques in NLP, including Tokenization, One-Hot Encoding, Count Vectorization, and TF-IDF. It explains how to convert text into numerical representations suitable for machine learning tasks, detailing the process of building vocabularies and creating matrices. Additionally, it introduces the Word2Vec technique for word embeddings, illustrating how words are represented as vectors based on their contextual relationships.

Uploaded by

vasanthi.kota17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views18 pages

NLP Record 2

The document provides Python code examples for various text processing techniques in NLP, including Tokenization, One-Hot Encoding, Count Vectorization, and TF-IDF. It explains how to convert text into numerical representations suitable for machine learning tasks, detailing the process of building vocabularies and creating matrices. Additionally, it introduces the Word2Vec technique for word embeddings, illustrating how words are represented as vectors based on their contextual relationships.

Uploaded by

vasanthi.kota17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Program 13: Write a python code for Tokenization & One-Hot Encoding using CountVectorizer

Description:

Tokenization
Before representing text numerically, it must be broken down into smaller units called tokens.
In the code, tokenizer=lambda x: x.split() splits each sentence by whitespace.
Example:
"apple and banana" → ["apple", "and", "banana"]
The vectorizer builds a vocabulary of all unique tokens in the corpus.
One-Hot Encoding (Binary Representation)
This method represents whether each token from the vocabulary is present (1) or absent (0) in a sentence.
• binary=True in CountVectorizer ensures that we use a binary vector, not raw counts.
• Each vector’s length = size of the vocabulary.
• Each sentence becomes a vector indicating the presence of each word.
Example Corpus:
1. "apple and banana"
2. "banana and orange"
3. "grape apple banana"
Vocabulary: ['and', 'apple', 'banana', 'grape', 'orange']
Binary Matrix Output:
• [1, 1, 1, 0, 0] # "apple and banana"
• [1, 0, 1, 0, 1] # "banana and orange"
• [0, 1, 1, 1, 0] # "grape apple banana"

This matrix numerically represents the text and is suitable for downstream ML tasks
Code:
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"apple and banana",
"banana and orange",
"grape apple banana"
]
vectorizer = CountVectorizer(tokenizer=lambda x: x.split(), binary=True)
X = vectorizer.fit_transform(corpus).toarray()
print("Vocabulary:", vectorizer.get_feature_names_out())
print("One-Hot Encoded Matrix:\n", X)
Output:
Vocabulary: ['and' 'apple' 'banana' 'grape' 'orange']
One-Hot Encoded Matrix:
[[1 1 1 0 0]
[1 0 1 0 1]
[0 1 1 1 0]]

261 22261A6610 25
Program 14: Write a python code for demonstrating Count Vectorization, also known as the Bag-of-Words (BoW) model
— a foundational text representation technique in NLP.

Description:
This code demonstrates Count Vectorization, also known as the Bag-of-Words (BoW) model — a foundational text
representation technique in NLP.
What is Count Vectorization?
• CountVectorizer converts a collection of text documents into a matrix of token counts.
• It breaks each sentence into tokens (typically words), builds a vocabulary of all unique tokens, and creates vectors
indicating the frequency of each word in a sentence.
How it works:
1. Tokenization:
The text is split into words (tokens) using default rules (like splitting by spaces and removing punctuation).
2. Vocabulary Creation:
A set of all unique words across the corpus is created.
3. Vectorization:
Each sentence is converted into a numerical vector where:
o Each dimension corresponds to a word in the vocabulary.
o The value represents how many times that word appears in the sentence.
Code:
from sklearn.feature_extraction.text import CountVectorizer
# Sample corpus (can be sentences, paragraphs, documents)
corpus = [
"Natural language processing is a field of artificial intelligence and language processing.",
"Machine learning and deep learning are parts of AI.",
"Natural language techniques are used in chatbots and translation.",
"The future of AI depends on advances in NLP."
]
# Initialize the vectorizer
vectorizer = CountVectorizer()
# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)
# Convert result to array and get feature names
count_matrix = X.toarray()
vocab = vectorizer.get_feature_names_out()
# Print vocabulary and count matrix
print("📚 Vocabulary:")
print(vocab)
print("\n🧮 Count Vector Matrix:")
for i, row in enumerate(count_matrix):
print(f"Sentence {i+1}: {row}")

Output:
📚 Vocabulary:
['advances' 'ai' 'and' 'are' 'artificial' 'chatbots' 'deep' 'depends'
'field' 'future' 'in' 'intelligence' 'is' 'language' 'learning' 'machine'
'natural' 'nlp' 'of' 'on' 'parts' 'processing' 'techniques' 'the'
'translation' 'used']

🧮 Count Vector Matrix:


Sentence 1: [0 0 1 0 1 0 0 0 1 0 0 1 1 2 0 0 1 0 1 0 0 2 0 0 0 0]
Sentence 2: [0 1 1 1 0 0 1 0 0 0 0 0 0 0 2 1 0 0 1 0 1 0 0 0 0 0]
Sentence 3: [0 0 1 1 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 1 1]
Sentence 4: [1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0]

261 22261A6610 26
Program 14 A: Write a python code for TF-IDF (Term Frequency-Inverse Document Frequency) to understand which
words are important, not just frequent.

Description:
1. Count Vectorizer (Bag of Words)
• Converts each document into a vector based on word frequency.
• Ignores grammar and word order; only counts occurrences of words.
• Commonly used for basic text classification and NLP tasks.
Limitation:
Frequent but less meaningful words (like "data" or "systems") may dominate the representation, even if they don’t carry much
information.

2. TF-IDF (Term Frequency–Inverse Document Frequency)


• Goes beyond raw word counts by evaluating how important a word is in a document relative to the entire
corpus.
How it works:
• TF (Term Frequency): Measures how often a word appears in a single document.
• IDF (Inverse Document Frequency): Downweights words that appear in many documents, as they are less informative.
 Final TF-IDF score = TF × IDF
• High TF-IDF → word is important in that document but rare in others.
• Low TF-IDF → word is common across documents (less informative).

Code:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
corpus = [
"Artificial Intelligence is transforming industries and daily life through automation and smart systems.",
"Machine Learning, as a subset of AI, enables systems to learn from data without being explicitly programmed.",
"Deep Learning techniques use neural networks with many layers to model complex patterns in data such as images and
speech.",
"Applications of AI include self-driving cars, medical diagnosis, financial forecasting, and personalized recommendations.",
"Natural Language Processing helps computers understand, interpret, and generate human language using linguistic and
statistical techniques.",
"With rapid advancements in computing power and data availability, the future of AI continues to grow exponentially."
]
# Count Vectorizer
count_vectorizer = CountVectorizer(stop_words='english')
count_matrix = count_vectorizer.fit_transform(corpus)
count_df = pd.DataFrame(count_matrix.toarray(), columns=count_vectorizer.get_feature_names_out())
# TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
# Display
print("📊 Count Vector (Bag of Words):")
print(count_df)
print("\n🌟 TF-IDF Vector:")
print(tfidf_df)
Output:
📊 Count Vector (Bag of Words):
advancements ai applications artificial automation availability cars \
0 0 0 0 1 1 0 0
1 0 1 0 0 0 0 0
2 0 0 0 0 0 0 0
3 0 1 1 0 0 0 1
4 0 0 0 0 0 0 0
5 1 1 0 0 0 1 0

complex computers computing ... smart speech statistical subset \


0 0 0 0 ... 1 0 0 0

261 22261A6610 27
1 0 0 0 ... 0 0 0 1
2 1 0 0 ... 0 1 0 0
3 0 0 0 ... 0 0 0 0
4 0 1 0 ... 0 0 1 0
5 0 0 1 ... 0 0 0 0

systems techniques transforming understand use using


0 1 0 1 0 0 0
1 1 0 0 0 0 0
2 0 1 0 0 1 0
3 0 0 0 0 0 0
4 0 1 0 1 0 1
5 0 0 0 0 0 0

[6 rows x 61 columns]

🌟 TF-IDF Vector:
advancements ai applications artificial automation availability \
0 0.000000 0.000000 0.00000 0.33957 0.33957 0.000000
1 0.000000 0.240255 0.00000 0.00000 0.00000 0.000000
2 0.000000 0.000000 0.00000 0.00000 0.00000 0.000000
3 0.000000 0.204336 0.29515 0.00000 0.00000 0.000000
4 0.000000 0.000000 0.00000 0.00000 0.00000 0.000000
5 0.316885 0.219383 0.00000 0.00000 0.00000 0.316885

cars complex computers computing ... smart speech \


0 0.00000 0.000000 0.000000 0.000000 ... 0.33957 0.000000
1 0.00000 0.000000 0.000000 0.000000 ... 0.00000 0.000000
2 0.00000 0.290814 0.000000 0.000000 ... 0.00000 0.290814
3 0.29515 0.000000 0.000000 0.000000 ... 0.00000 0.000000
4 0.00000 0.000000 0.252599 0.000000 ... 0.00000 0.000000
5 0.00000 0.000000 0.000000 0.316885 ... 0.00000 0.000000

statistical subset systems techniques transforming understand \


0 0.000000 0.000000 0.278453 0.000000 0.33957 0.000000
1 0.000000 0.347033 0.284572 0.000000 0.00000 0.000000
2 0.000000 0.000000 0.000000 0.238472 0.00000 0.000000
3 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000
4 0.252599 0.000000 0.000000 0.207135 0.00000 0.252599
5 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000

use using
0 0.000000 0.000000
1 0.000000 0.000000
2 0.290814 0.000000
3 0.000000 0.000000
4 0.000000 0.252599
5 0.000000 0.000000
[6 rows x 61 columns]

261 22261A6610 28
Program 15: Write a python code to implement word2vec word-embedding technique

Description: Word2vec represents a word as a high-dimension vector of numbers which capture relationships between words. In
particular, words which appear in similar contexts are mapped to vectors which are nearby as measured by cosine similarity. This
indicates the level of semantic similarity between the words, so for example the vectors for walk and ran are nearby, as are those
for "but" and "however", and "Berlin" and "Germany".

Code:
import gensim
import pandas as pd

df = pd.read_json("Sports_and_Outdoors_5.json", lines=True)
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)
df.reviewText.loc[0]
model = gensim.models.Word2Vec(
window=10,
min_count=2,
workers=4,
)
model.build_vocab(review_text, progress_per=1000)
model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)
model.save("./word2vec-outdoor-reviews-short.model")
print(model.wv.most_similar("awful"))

Output:
[('terrible', 0.7352169156074524),
('horrible', 0.6891771554946899),
('overwhelming', 0.6227911710739136),
('impossibility', 0.5835400819778442),
('horrendous', 0.5827057957649231),
('enormous', 0.5721909999847412),
('ugly', 0.567825436592102),
('unusual', 0.566750705242157),
('isolated', 0.5588798522949219),
('unfortunate', 0.5560564994812012)]
• model.wv.similarity(w1="good", w2="great")
 output: 0.7870506

• model.wv.similarity(w1="slow", w2="steady")

 output: 0.3472042

261 22261A6610 29
Program 16 A: Write a python program to create a sample list for at least 5 words with ambiguous sense

Description:

A word sense is a specific meaning of a word, especially when the word has multiple meanings depending on the context. This
concept is central to understanding and processing natural language correctly.

Example of Word Senses

1. Bat (animal): A flying nocturnal mammal.


"The bat flew out of the cave."
2. Bat (sports equipment): A stick used to hit a ball in games like cricket or baseball.
"He swung the bat and hit a home run."

Each of these is a different word sense of the word bat.

Word Sense in Lexical Databases

Tools like WordNet organize word senses systematically. For example:

• Word: "bank"
o bank.n.01: financial institution
o bank.n.02: sloping land beside a river

Code:

import nltk

from nltk.corpus import wordnet as wn

# Download WordNet data if you don't have it


nltk.download('wordnet')
# Define a function to resolve word sense based on context
def wsd(word, context):
# Look for different senses (meanings) of the word using WordNet
senses = wn.synsets(word)
# Print the senses of the word
print(f"Senses of the word '{word}':")
for i, sense in enumerate(senses):
print(f"{i+1}. {sense.name()}: {sense.definition()}")
# A simple approach to disambiguate: Check if any context words match with the word's definition
for i, sense in enumerate(senses):
for example in sense.examples():
if any(context_word in example for context_word in context):
print(f"\nContext matched with sense {i+1}: {sense.name()}")
print(f"Example: {example}")
return sense.name() # Return the sense name based on context

return "No match found"


# Sample sentences (context for ambiguous words)
contexts = {
'bank': ["money", "deposit", "river", "side", "finance"],
'bark': ["tree", "rough", "dog", "loud"],
'bat': ["fly", "animal", "hit", "game"],
'bow': ["tie", "gift", "respect", "gesture"],
'lead': ["guide", "direct", "metal", "pipes"]
}
# Run WSD on each ambiguous word with its context

261 22261A6610 30
for word in contexts:
print(f"\nWSD for the word: {word}")
word_context = contexts[word]
sense = wsd(word, word_context)
print(f"Disambiguated sense of '{word}': {sense}")
Output:
WSD for the word: bank
Senses of the word 'bank':
1. bank.n.01: sloping land (especially the slope beside a body of water)
2. depository_financial_institution.n.01: a financial institution that accepts deposits and channels the money into lending
activities
3. bank.n.03: a long ridge or pile
4. bank.n.04: an arrangement of similar objects in a row or in tiers
5. bank.n.05: a supply or stock held in reserve for future use (especially in emergencies)
6. bank.n.06: the funds held by a gambling house or the dealer in some gambling games
7. bank.n.07: a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal
force
8. savings_bank.n.02: a container (usually with a slot in the top) for keeping money at home
9. bank.n.09: a building in which the business of banking transacted
10. bank.n.10: a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)
11. bank.v.01: tip laterally
12. bank.v.02: enclose with a bank
13. bank.v.03: do business with a bank or keep an account at a bank
14. bank.v.04: act as the banker in a game or in gambling
15. bank.v.05: be in the banking business
16. deposit.v.02: put into a bank account
17. bank.v.07: cover with ashes so to control the rate of burning
18. trust.v.01: have confidence or faith in

Context matched with sense 1: bank.n.01


Example: he sat on the bank of the river and watched the currents
Disambiguated sense of 'bank': bank.n.01

WSD for the word: bark


Senses of the word 'bark':
1. bark.n.01: tough protective covering of the woody stems and roots of trees and other woody plants
2. bark.n.02: a noise resembling the bark of a dog
3. bark.n.03: a sailing ship with 3 (or more) masts
4. bark.n.04: the sound made by a dog
5. bark.v.01: speak in an unfriendly tone
6. bark.v.02: cover with bark
7. bark.v.03: remove the bark of a tree
8. bark.v.04: make barking sounds
9. bark.v.05: tan (a skin) with bark tannins

Context matched with sense 8: bark.v.04


Example: The dogs barked at the stranger
Disambiguated sense of 'bark': bark.v.04

WSD for the word: bat


Senses of the word 'bat':
1. bat.n.01: nocturnal mouselike mammal with forelimbs modified to form membranous wings and anatomical adaptations for
echolocation by which they navigate
2. bat.n.02: (baseball) a turn trying to get a hit
3. squash_racket.n.01: a small racket with a long handle used for playing squash
4. cricket_bat.n.01: the club used in playing cricket
5. bat.n.05: a club used for hitting a ball in various games
6. bat.v.01: strike with, or as if with a baseball bat
7. bat.v.02: wink briefly
8. bat.v.03: have a turn at bat
9. bat.v.04: use a bat
10. cream.v.02: beat thoroughly and conclusively in a competition or fight

261 22261A6610 31
Context matched with sense 2: bat.n.02
Example: he got four hits in four at-bats
Disambiguated sense of 'bat': bat.n.02

WSD for the word: bow


Senses of the word 'bow':
1. bow.n.01: a knot with two loops and loose ends; used to tie shoelaces
2. bow.n.02: a slightly curved piece of resilient wood with taut horsehair strands; used in playing certain stringed instruments
3. bow.n.03: front part of a vessel or aircraft
4. bow.n.04: a weapon for shooting arrows, composed of a curved piece of resilient wood with a taut cord to propel the arrow
5. bow.n.05: something curved in shape
6. bow.n.06: bending the head or body or knee as a sign of reverence or submission or shame or greeting
7. bow.n.07: an appearance by actors or performers at the end of the concert or play in order to acknowledge the applause of the
audience
8. bow.n.08: a decorative interlacing of ribbons
9. bow.n.09: a stroke with a curved piece of wood with taut horsehair strands that is used in playing stringed instruments
10. bow.v.01: bend one's knee or body, or lower one's head
11. submit.v.06: yield to another's wish or opinion
12. bow.v.03: bend the head or the upper part of the body in a gesture of respect or greeting
13. crouch.v.01: bend one's back forward from the waist on down
14. bow.v.05: play on a string instrument with a bow
Disambiguated sense of 'bow': No match found

WSD for the word: lead


Senses of the word 'lead':
1. lead.n.01: an advantage held by a competitor in a race
2. lead.n.02: a soft heavy toxic malleable metallic element; bluish white when freshly cut but tarnishes readily to dull grey
3. lead.n.03: evidence pointing to a possible solution
4. lead.n.04: a position of leadership (especially in the phrase `take the lead')
5. lead.n.05: the angle between the direction a gun is aimed and the position of a moving target (correcting for the flight time of
the missile)
6. lead.n.06: the introductory section of a story
7. lead.n.07: (sports) the score by which a team or individual is winning
8. star.n.04: an actor who plays a principal role
9. lead.n.09: (baseball) the position taken by a base runner preparing to advance to the next base
10. tip.n.03: an indication of potential opportunity
11. lead.n.11: a news story of major importance
12. spark_advance.n.01: the timing of ignition relative to the position of the piston in an internal-combustion engine
13. leash.n.01: restraint consisting of a rope (or light chain) used to restrain an animal
14. lead.n.14: thin strip of metal used to separate lines of type in printing
15. lead.n.15: mixture of graphite with clay in different degrees of hardness; the marking substance in a pencil
16. jumper_cable.n.01: a jumper that consists of a short piece of wire
17. lead.n.17: the playing of a card to start a trick in bridge
18. lead.v.01: take somebody somewhere
19. leave.v.07: have as a result or residue
20. lead.v.03: tend to or result in
21. lead.v.04: travel in front of; go in advance of others
22. lead.v.05: cause to undertake a certain action
23. run.v.03: stretch out over a distance, space, time, or scope; run or extend between two points or beyond a certain point
24. head.v.02: be in charge of
25. lead.v.08: be ahead of others; be the first
26. contribute.v.03: be conducive to
27. conduct.v.02: lead, as in the performance of a composition
28. go.v.25: lead, extend, or afford access
29. precede.v.04: move ahead (of others) in time or space
30. run.v.23: cause something to pass or lead somewhere
31. moderate.v.01: preside over
Disambiguated sense of 'lead': No match found
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Package wordnet is already up-to-date!

261 22261A6610 32
Program 16 B: Write a python program to implement Lesk’s algorithm for word sense disambiguity.

Description:

Word Sense Disambiguation (WSD) is the process of identifying the correct meaning (sense) of a word based on its context,
especially when the word has multiple meanings.

Lesk’s algorithm is a knowledge-based method for WSD that disambiguates a word by comparing the dictionary definitions
(glosses) of its possible senses with the context in which the word appears.

Code:
import nltk
from nltk.corpus import wordnet as wn
from nltk.tokenize import word_tokenize
def lesk_algorithm(word, context):
senses = wn.synsets(word)
if not senses:
return None
max_overlap = 0
best_sense = None
context_tokens = set(word_tokenize(' '.join(context).lower()))
for sense in senses:
# Get the definition and examples of the sense
definition = set(word_tokenize(sense.definition().lower()))
examples = set()
for example in sense.examples():
examples.update(word_tokenize(example.lower()))
overlap = len(context_tokens.intersection(definition.union(examples)))
if overlap > max_overlap:
max_overlap = overlap
best_sense = sense
return best_sense
# Example usage:
context = ["The bark of the tree is rough and textured."]
word = "bark"
best_sense = lesk_algorithm(word, context)
if best_sense:
print(f"Disambiguated sense for '{word}': {best_sense.name()}")
print(f"Definition: {best_sense.definition()}")
else:
print(f"No sense found for '{word}'")
context = ["The dog started to bark loudly in the yard."]
word = "bark"
best_sense = lesk_algorithm(word, context)
if best_sense:
print(f"Disambiguated sense for '{word}': {best_sense.name()}")
print(f"Definition: {best_sense.definition()}")
else:
print(f"No sense found for '{word}'")
Output:
Disambiguated sense for 'bark': bark.v.03
Definition: remove the bark of a tree
Disambiguated sense for 'bark': bark.n.02
Definition: a noise resembling the bark of a dog

261 22261A6610 33
Program 17: Write a python program using NLTK package to convert audio file to text and text to audio files

Description:

NLTK it does not handle audio files directly.We will use speech_recognition and gTTS libraries for audio, and use NLTK to
process the text in between.

1. Speech_recognition:
speech_recognition is a Python library that helps you convert spoken audio into written text using speech
recognition engines.
2. gTTS libraries (Convert text file to audio)
gTTS stands for Google Text-to-Speech — it's a Python library and CLI tool that lets you convert text into spoken
audio using the Google Text-to-Speech API. It takes in text (string), sends it to Google's Text-to-Speech engine, and
returns a spoken audio file in MP3 format.

Install the following packages:

1. !pip install SpeechRecognition pydub


2. !pip install gTTS

Code:

import speech_recognition as sr

from gtts import gTTS


recognizer = sr.Recognizer()
with sr.AudioFile(audio_file_path) as source:
audio_data = recognizer.record(source)
try:
text = recognizer.recognize_google(audio_data)
print("Recognized Text:\n", text)
return text
except sr.UnknownValueError:
print("Speech Recognition could not understand the audio")
except sr.RequestError:
print("Could not request results from Google Speech Recognition service")
tts = gTTS(text)
tts.save(output_audio_path)
print(f"Saved audio to {output_audio_path}")

audio_path = "Sports.wav"
output_audio_path = "output.mp3"
text = audio_to_text(audio_path)
import nltk
from nltk.tokenize import word_tokenize
if text:
tokens = word_tokenize(text)
print("Tokenized Text:\n", tokens)
text1="hello welcome to class"
text_to_audio(text1, output_audio_path)

Output:

Recognized Text:
good evening ladies and gentlemen we like to welcome you to play the new videos Broadcast
Tokenized Text:
['good', 'evening', 'ladies', 'and', 'gentlemen', 'we', 'like', 'to', 'welcome', 'you', 'to', 'play', 'the', 'new', 'videos', 'Broadcast']

Saved audio to output.mp3

261 22261A6610 34
Program 18: Write a python program using NLTK package to convert audio file to text and text to audio files

Description:

FrameNet is a linguistic database that organizes words based on the situations (called frames) they describe, showing how words
are connected to roles and events in real-world experiences.

1. Helps computers understand not just words, but meanings and relationships.

2. Useful in AI, language learning, chatbots, machine translation, and semantic search.

The key elements of FrameNet are Frames, Frame Elements (FEs), and Lexical Units (LUs)

Code:

Program to print different framenet

import nltk

from nltk.corpus import framenet as fn

# Download FrameNet data (only needed once)

nltk.download('framenet_v17')

# Get all frames

frames = fn.frames()

# Print some frame names

print("Total number of frames:", len(frames))

print("\nSample frames in FrameNet:\n")

for i, frame in enumerate(frames[:20]): # Change 20 to show more

print(f"{i+1}. {frame.name}")

Output:

[nltk_data] Downloading package framenet_v17 to /root/nltk_data...


[nltk_data] Unzipping corpora/framenet_v17.zip.
Total number of frames: 1221

Sample frames in FrameNet:


1. Abandonment
2. Abounding_with
3. Absorb_heat
4. Abundance
5. Abusing
6. Access_scenario
7. Accompaniment
8. Accomplishment
9. Accoutrements
10. Accuracy
11. Achieving_first
12. Active_substance
13. Activity
14. Activity_abandoned_state
15. Activity_done_state
16. Activity_finish
17. Activity_ongoing
18. Activity_pause
19. Activity_paused_state
20. Activity_prepare

261 22261A6610 35
Program to print detail of a particular given frame

import nltk
from nltk.corpus import framenet as fn

# Download if not already done


nltk.download('framenet_v17')

# Choose a frame by name, e.g., "Commerce_buy"


frame = fn.frame('Awareness')

# Print basic info


print(" Frame Name:", frame.name)
print(" Definition:", frame.definition)

# Print Lexical Units (words that evoke this frame)


print("\n Lexical Units (LUs):")
for lu in frame.lexUnit:
print(f"- {lu}")

# Print Frame Elements (semantic roles)


print("\n Frame Elements (FEs):")
for fe_name, fe in frame.FE.items():
print(f"- {fe_name}: {fe.coreType} — {fe.definition}")

Output:
[nltk_data] Downloading package framenet_v17 to /root/nltk_data...
📌 Frame Name: Awareness
📝 Definition: A Cognizer has a piece of Content in their model of the world. The Content is not necessarily present due to
immediate perception, but usually, rather, due to deduction from perceivables. In some cases, the deduction of the Content is
implicitly based on confidence in sources of information (believe), in some cases based on logic (think), and in other cases the
source of the deduction is deprofiled (know). 'Your boss is aware of your commitment.' '' Note that this frame is undergoing
some degree of reconsideration. Many of the targets will be moved to the Opinion frame.' In the uses that will remain in the
Awareness frame, however, the Content is presupposed. '' This frame is also distinct from the Certainty frame, in that it does not
profile the relationship of the Cognizer to the Content, but rather presupposes it. In Certainty, the Degree of confidence or
certainty is expressible as a separate frame element, as in the following: 'She absolutely knew that he would be there .'

🔤 Lexical Units (LUs):


- aware.a
- awareness.n
- believe.v
- comprehend.v
- comprehension.n
- conceive.v
- conception.n
- conscious.a
- hunch.n
- imagine.v
- know.v
- knowledge.n
- knowledgeable.a
- presume.v
- understand.v
- understanding.n
- ignorance.n
- consciousness.n
- cognizant.a
- unknown.a
- idea.n

261 22261A6610 36
🧩 Frame Elements (FEs):
- Cognizer: Core — The Cognizer is the person whose awareness of phenomena is at question. With a target verb or adjective the
Cognizer is generally expressed as an External Argument with the Content expressed as an Object or Complement. 'Your boss is
aware of your commitment.' 'The students do not know the answer.'
- Content: Core — The Content is the object of the Cognizer's awareness. Content can be expressed as a direct object or in a PP
Complement. 'Your boss is aware of your commitment.' 'The students do not know of your commitment.' 'The students do not
know how committed you are.' '
- Evidence: Peripheral — The source of awareness or knowledge which can be expressed in a PP Complement: 'The sailors knew
from the look of the sky that a storm was coming.' 'I knew from experience that Jo would be late.'
- Topic: Core — Some verbs in this frame allow a Topic to be expressed in about-PPs. 'Kim knows about first aid.' However, a
number of nouns and adjectives in this frame which cannot take about-phrases allow Topic to be expressed as an adjectival or
adverbial modifier. 'Kim is politically aware. ' ' Environmental consciousness is increasing.'
- Degree: Peripheral — This FE identifies the Degree to which an event occurs.
- Manner: Peripheral — This FE identifies the Manner in which the Cognizer knows or thinks something.
- Expressor: Core — Expressor is the body part that reveals the Cognizer's state to the observer. 'Bob's eyes were overly aware'
- Role: Peripheral — Role is the category within which an element of the Content is considered. 'He understood her remark as an
insult.'
- Paradigm: Extra-Thematic — This frame element identifies the Paradigm which serves as the basis for the Cognizer's
awareness. 'The formation of black holes should be understood in astrophysic terms.'
- Time: Peripheral — The time interval during which the Cognizer is aware of the Content. 'Yet there is no evidence that Mr.
Parrish was cognizant at the time of the signing of the notes that the clauses in issue were present.'
- Explanation: Extra-Thematic — The reason why or how it came to be that the Cognizer has awareness of the Topic or Content.
[nltk_data] Package framenet_v17 is already up-to-date!

Program to invoke a particular Frame based on Lexical unit in the given sentence

import nltk
import spacy
from nltk.corpus import framenet as fn
nltk.download('framenet_v17')
nlp = spacy.load("en_core_web_sm")
def find_frames_for_sentence(sentence):
doc = nlp(sentence)
frames_found = {}
for token in doc:
if token.pos_ in {"VERB", "NOUN"}: # Likely to evoke frames
lemma = token.lemma_
frames = fn.frames_by_lemma(lemma)
if frames:
frames_found[token.text] = [frame.name for frame in frames]
return frames_found
sentence = "We believe it is a fair and generous price."
frames_invoked = find_frames_for_sentence(sentence)
print(f"Sentence: {sentence}")
print("\nInvoked FrameNet Frames:")
for word, frames in frames_invoked.items():
print(f"- {word}: {', '.join(frames)}")

Output:

[nltk_data] Downloading package framenet_v17 to /root/nltk_data...

[nltk_data] Package framenet_v17 is already up-to-date!


Sentence: We believe it is a fair and generous price.

Invoked FrameNet Frames:


- believe: Awareness, Certainty, Opinion, Religious_belief, Taking_sides, Trust
- price: Commerce_scenario, Expensiveness

261 22261A6610 37
261 22261A6610 38
Program 19: Write a python program using NLTK package to convert audio file to text and text to audio files

Description:WordNet is a lexical database of the English language, widely used in Natural Language Processing (NLP) tasks.
It groups words into sets of synonyms (synsets) and organizes them into a network based on their semantic relationships like
hyponyms (more specific), hypernyms (more general), meronyms (part-whole), and holonyms (whole-part).

Code:

1. Finding Synonyms of a given word:

from nltk.corpus import wordnet as wn

# Get synsets for the word 'car'


synsets = wn.synsets('car')
for synset in synsets:
print(f"Synset: {synset.name()}")
print(f"Definition: {synset.definition()}")
print(f"Examples: {synset.examples()}")
print()

Output:
Synset: car.n.01
Definition: a motor vehicle with four wheels; usually propelled by an internal combustion engine
Examples: ['he needs a car to get to work']
Synset: car.n.02
Definition: a wheeled vehicle adapted to the rails of railroad
Examples: ['three cars had jumped the rails']
Synset: car.n.03
Definition: the compartment that is suspended from an airship and that carries personnel and the cargo and the power
plant
Examples: []
Synset: car.n.04
Definition: where passengers ride up and down
Examples: ['the car was on the top floor']
Synset: cable_car.n.01
Definition: a conveyance for passengers or freight on a cable railway
Examples: ['they took a cable car to the top of the mountain']

2. Finding Hypernyms (More General Terms)'


car_synset = wn.synset('car.n.03') # First synset of 'car'
hypernyms = car_synset.hypernyms()
for hypernym in hypernyms:
print(f"Hypernym: {hypernym.name()}")
OUTPUT: Hypernym: compartment.n.02

3. Finding Hyponyms (More Specific Terms)


hyponyms = car_synset.hyponyms()
for hyponym in hyponyms:
print(f"Hyponym: {hyponym.name()}")
4. Finding Antonyms
synsets = wn.synsets('happy')
for synset in synsets:
for lemma in synset.lemmas():
if lemma.antonyms():
print(f"Antonym: {lemma.antonyms()[0].name()}")
Output: Antonym: unhappy
5. Word Similarity
word1 = wn.synset('dog.n.01')
word2 = wn.synset('cat.n.01')

similarity = word1.path_similarity(word2)
print(f"Similarity between 'dog' and 'cat': {similarity}")
Output: Similarity between 'dog' and 'cat': 0.2

261 22261A6610 39
Program 20: Write a Python code to generate n-grams using NLTK n-gram Library

Description:

An n-gram is a continuous sequence of n items (typically words or characters) from a given text or speech. It's a
fundamental concept in Natural Language Processing (NLP) and is used in many tasks, including language modeling, text
analysis, speech recognition, and machine translation. n represents the number of items (usually words) in the sequence.

Code:
import collections
import nltk
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
def get_ngrams(text, n):
"""Generate n-grams from the given text."""
tokens = word_tokenize(text.lower())
n_grams = ngrams(tokens, n)
return list(n_grams)
text1 = "this is a sample text with several words this is another sample text with some different words"
text = "Sample list of words"
print("List of Unigram")
ngrams_list = get_ngrams(text, 1)
for ngram in ngrams_list:
print(ngram)
print("List of Bigrams")
ngrams_list = get_ngrams(text, 2)
for ngram in ngrams_list:
print(ngram)
print("List of Trigrams")
ngrams_list = get_ngrams(text, 3)
for ngram in ngrams_list:
print(ngram)

Output:
List of Unigram
('sample',)
('list',)
('of',)
('words',)

List of Bigrams
('sample', 'list')
('list', 'of')
('of', 'words')

List of Trigrams
('sample', 'list', 'of')
('list', 'of', 'words')

261 22261A6610 40
Program 21: Write a python program to train bi-gram model for a given corpus of text to predict the next probable word
given the previous two words of a sentence.

import nltk
from nltk.util import ngrams
from collections import defaultdict
# Step 1: Prepare the corpus (tokenize sentences)
corpus = [
"The weather is beautiful today",
"I am learning natural language processing",
"I am studing ML",
"Machine learning is fascinating",
"Language models are powerful tools",
"I love IITM”
"I am a Coder"
"My passion is developing real world problem solving applications"
]
tokenized_corpus = [nltk.word_tokenize(sentence.lower()) for sentence in corpus]
# Step 2: Train the bigram model
def train_bigram_model(corpus):
model = defaultdict(lambda: defaultdict(int)) # Bigram model
for sentence in corpus:
# Generate bigrams for each sentence
bigrams = ngrams(sentence, 2)
for w1, w2 in bigrams:
model[w1][w2] += 1 # Increment the count for each bigram
return model
bigram_model = train_bigram_model(tokenized_corpus)
# Step 3: Calculate word probabilities (bigram probabilities)
def calculate_word_probabilities(model):
probabilities = defaultdict(lambda: defaultdict(float))
for w1 in model:
total_count = float(sum(model[w1].values()))
for w2 in model[w1]:
probabilities[w1][w2] = model[w1][w2] / total_count # Probability of w2 after w1
return probabilities
word_probabilities = calculate_word_probabilities(bigram_model)
# Step 4: Predict the next word given a context (bigram prediction)
def predict_next_word(model, context):
if context[-1] in model:
next_word_probs = model[context[-1]]
next_word = max(next_word_probs, key=next_word_probs.get)
return next_word
else:
return None
user_input = input("Enter a sentence or context (e.g., 'The bank'): ")
context = user_input.split()
predicted_word = predict_next_word(bigram_model, context)
print("\nPredicted next word given context '{}':".format(" ".join(context)), predicted_word)

Output:

[nltk_data] Downloading package punkt_tab to /root/nltk_data...


[nltk_data] Package punkt_tab is already up-to-date!
Enter a sentence or context (e.g., 'The bank'): I love

Predicted next word given context 'I love': iitm

261 22261A6610 41
Program 22: Write a python program to train bi-gram model for a given corpus of text to predict the next probable word
given the previous two words of a sentence.

Aim: To write a python program to implement n-grams smoothing technique

Description: Incorporating smoothing into a bi-gram model helps to handle cases where some bi-grams may not appear in the
training corpus. Simple Additive Smoothing (also known as Laplace Smoothing) method. This program will build n-grams from a
given text and apply smoothing to handle zero probabilities.

Code:
import collections
import nltk
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
def get_ngrams(text, n):
tokens = word_tokenize(text.lower())
n_grams = ngrams(tokens, n)
return list(n_grams)
def count_ngrams(ngrams):
ngram_counts = collections.Counter(ngrams)
return ngram_counts
def laplace_smoothing(ngram_counts, unigram_counts, vocab_size, n):
smoothed_probs = {}
for ngram in ngram_counts:
context = ngram[:-1]
context_count = unigram_counts[context] if n > 1 else sum(unigram_counts.values())
smoothed_probs[ngram] = (ngram_counts[ngram] + 1) / (context_count + vocab_size)
return smoothed_probs
def build_vocabulary(text):
"""Build a vocabulary from the given text."""
tokens = word_tokenize(text.lower())
vocab = set(tokens)
return vocab
text = "this is a sample text with several words this is another sample text with some different words"
n=2
ngrams_list = get_ngrams(text, n)
unigrams_list = get_ngrams(text, 1)
ngram_counts = count_ngrams(ngrams_list)
unigram_counts = count_ngrams(unigrams_list)
vocab = build_vocabulary(text)
vocab_size = len(vocab)
smoothed_probs = laplace_smoothing(ngram_counts, unigram_counts, vocab_size, n)
print("N-grams with their smoothed probabilities:")
for ngram, prob in smoothed_probs.items():
print(f"{ngram}: {prob:.6f}")

Output:
N-grams with their smoothed probabilities:
('this', 'is'): 0.230769
('is', 'a'): 0.153846
('a', 'sample'): 0.166667
('sample', 'text'): 0.230769
('text', 'with'): 0.230769
('with', 'several'): 0.153846
('several', 'words'): 0.166667
('words', 'this'): 0.153846
('is', 'another'): 0.153846
('another', 'sample'): 0.166667
('with', 'some'): 0.153846
('some', 'different'): 0.166667
('different', 'words'): 0.166667

261 22261A6610 42

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy