NLP Record 2
NLP Record 2
Description:
Tokenization
Before representing text numerically, it must be broken down into smaller units called tokens.
In the code, tokenizer=lambda x: x.split() splits each sentence by whitespace.
Example:
"apple and banana" → ["apple", "and", "banana"]
The vectorizer builds a vocabulary of all unique tokens in the corpus.
One-Hot Encoding (Binary Representation)
This method represents whether each token from the vocabulary is present (1) or absent (0) in a sentence.
• binary=True in CountVectorizer ensures that we use a binary vector, not raw counts.
• Each vector’s length = size of the vocabulary.
• Each sentence becomes a vector indicating the presence of each word.
Example Corpus:
1. "apple and banana"
2. "banana and orange"
3. "grape apple banana"
Vocabulary: ['and', 'apple', 'banana', 'grape', 'orange']
Binary Matrix Output:
• [1, 1, 1, 0, 0] # "apple and banana"
• [1, 0, 1, 0, 1] # "banana and orange"
• [0, 1, 1, 1, 0] # "grape apple banana"
This matrix numerically represents the text and is suitable for downstream ML tasks
Code:
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"apple and banana",
"banana and orange",
"grape apple banana"
]
vectorizer = CountVectorizer(tokenizer=lambda x: x.split(), binary=True)
X = vectorizer.fit_transform(corpus).toarray()
print("Vocabulary:", vectorizer.get_feature_names_out())
print("One-Hot Encoded Matrix:\n", X)
Output:
Vocabulary: ['and' 'apple' 'banana' 'grape' 'orange']
One-Hot Encoded Matrix:
[[1 1 1 0 0]
[1 0 1 0 1]
[0 1 1 1 0]]
261 22261A6610 25
Program 14: Write a python code for demonstrating Count Vectorization, also known as the Bag-of-Words (BoW) model
— a foundational text representation technique in NLP.
Description:
This code demonstrates Count Vectorization, also known as the Bag-of-Words (BoW) model — a foundational text
representation technique in NLP.
What is Count Vectorization?
• CountVectorizer converts a collection of text documents into a matrix of token counts.
• It breaks each sentence into tokens (typically words), builds a vocabulary of all unique tokens, and creates vectors
indicating the frequency of each word in a sentence.
How it works:
1. Tokenization:
The text is split into words (tokens) using default rules (like splitting by spaces and removing punctuation).
2. Vocabulary Creation:
A set of all unique words across the corpus is created.
3. Vectorization:
Each sentence is converted into a numerical vector where:
o Each dimension corresponds to a word in the vocabulary.
o The value represents how many times that word appears in the sentence.
Code:
from sklearn.feature_extraction.text import CountVectorizer
# Sample corpus (can be sentences, paragraphs, documents)
corpus = [
"Natural language processing is a field of artificial intelligence and language processing.",
"Machine learning and deep learning are parts of AI.",
"Natural language techniques are used in chatbots and translation.",
"The future of AI depends on advances in NLP."
]
# Initialize the vectorizer
vectorizer = CountVectorizer()
# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)
# Convert result to array and get feature names
count_matrix = X.toarray()
vocab = vectorizer.get_feature_names_out()
# Print vocabulary and count matrix
print("📚 Vocabulary:")
print(vocab)
print("\n🧮 Count Vector Matrix:")
for i, row in enumerate(count_matrix):
print(f"Sentence {i+1}: {row}")
Output:
📚 Vocabulary:
['advances' 'ai' 'and' 'are' 'artificial' 'chatbots' 'deep' 'depends'
'field' 'future' 'in' 'intelligence' 'is' 'language' 'learning' 'machine'
'natural' 'nlp' 'of' 'on' 'parts' 'processing' 'techniques' 'the'
'translation' 'used']
261 22261A6610 26
Program 14 A: Write a python code for TF-IDF (Term Frequency-Inverse Document Frequency) to understand which
words are important, not just frequent.
Description:
1. Count Vectorizer (Bag of Words)
• Converts each document into a vector based on word frequency.
• Ignores grammar and word order; only counts occurrences of words.
• Commonly used for basic text classification and NLP tasks.
Limitation:
Frequent but less meaningful words (like "data" or "systems") may dominate the representation, even if they don’t carry much
information.
Code:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
corpus = [
"Artificial Intelligence is transforming industries and daily life through automation and smart systems.",
"Machine Learning, as a subset of AI, enables systems to learn from data without being explicitly programmed.",
"Deep Learning techniques use neural networks with many layers to model complex patterns in data such as images and
speech.",
"Applications of AI include self-driving cars, medical diagnosis, financial forecasting, and personalized recommendations.",
"Natural Language Processing helps computers understand, interpret, and generate human language using linguistic and
statistical techniques.",
"With rapid advancements in computing power and data availability, the future of AI continues to grow exponentially."
]
# Count Vectorizer
count_vectorizer = CountVectorizer(stop_words='english')
count_matrix = count_vectorizer.fit_transform(corpus)
count_df = pd.DataFrame(count_matrix.toarray(), columns=count_vectorizer.get_feature_names_out())
# TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
# Display
print("📊 Count Vector (Bag of Words):")
print(count_df)
print("\n🌟 TF-IDF Vector:")
print(tfidf_df)
Output:
📊 Count Vector (Bag of Words):
advancements ai applications artificial automation availability cars \
0 0 0 0 1 1 0 0
1 0 1 0 0 0 0 0
2 0 0 0 0 0 0 0
3 0 1 1 0 0 0 1
4 0 0 0 0 0 0 0
5 1 1 0 0 0 1 0
261 22261A6610 27
1 0 0 0 ... 0 0 0 1
2 1 0 0 ... 0 1 0 0
3 0 0 0 ... 0 0 0 0
4 0 1 0 ... 0 0 1 0
5 0 0 1 ... 0 0 0 0
[6 rows x 61 columns]
🌟 TF-IDF Vector:
advancements ai applications artificial automation availability \
0 0.000000 0.000000 0.00000 0.33957 0.33957 0.000000
1 0.000000 0.240255 0.00000 0.00000 0.00000 0.000000
2 0.000000 0.000000 0.00000 0.00000 0.00000 0.000000
3 0.000000 0.204336 0.29515 0.00000 0.00000 0.000000
4 0.000000 0.000000 0.00000 0.00000 0.00000 0.000000
5 0.316885 0.219383 0.00000 0.00000 0.00000 0.316885
use using
0 0.000000 0.000000
1 0.000000 0.000000
2 0.290814 0.000000
3 0.000000 0.000000
4 0.000000 0.252599
5 0.000000 0.000000
[6 rows x 61 columns]
261 22261A6610 28
Program 15: Write a python code to implement word2vec word-embedding technique
Description: Word2vec represents a word as a high-dimension vector of numbers which capture relationships between words. In
particular, words which appear in similar contexts are mapped to vectors which are nearby as measured by cosine similarity. This
indicates the level of semantic similarity between the words, so for example the vectors for walk and ran are nearby, as are those
for "but" and "however", and "Berlin" and "Germany".
Code:
import gensim
import pandas as pd
df = pd.read_json("Sports_and_Outdoors_5.json", lines=True)
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)
df.reviewText.loc[0]
model = gensim.models.Word2Vec(
window=10,
min_count=2,
workers=4,
)
model.build_vocab(review_text, progress_per=1000)
model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)
model.save("./word2vec-outdoor-reviews-short.model")
print(model.wv.most_similar("awful"))
Output:
[('terrible', 0.7352169156074524),
('horrible', 0.6891771554946899),
('overwhelming', 0.6227911710739136),
('impossibility', 0.5835400819778442),
('horrendous', 0.5827057957649231),
('enormous', 0.5721909999847412),
('ugly', 0.567825436592102),
('unusual', 0.566750705242157),
('isolated', 0.5588798522949219),
('unfortunate', 0.5560564994812012)]
• model.wv.similarity(w1="good", w2="great")
output: 0.7870506
• model.wv.similarity(w1="slow", w2="steady")
output: 0.3472042
261 22261A6610 29
Program 16 A: Write a python program to create a sample list for at least 5 words with ambiguous sense
Description:
A word sense is a specific meaning of a word, especially when the word has multiple meanings depending on the context. This
concept is central to understanding and processing natural language correctly.
• Word: "bank"
o bank.n.01: financial institution
o bank.n.02: sloping land beside a river
Code:
import nltk
261 22261A6610 30
for word in contexts:
print(f"\nWSD for the word: {word}")
word_context = contexts[word]
sense = wsd(word, word_context)
print(f"Disambiguated sense of '{word}': {sense}")
Output:
WSD for the word: bank
Senses of the word 'bank':
1. bank.n.01: sloping land (especially the slope beside a body of water)
2. depository_financial_institution.n.01: a financial institution that accepts deposits and channels the money into lending
activities
3. bank.n.03: a long ridge or pile
4. bank.n.04: an arrangement of similar objects in a row or in tiers
5. bank.n.05: a supply or stock held in reserve for future use (especially in emergencies)
6. bank.n.06: the funds held by a gambling house or the dealer in some gambling games
7. bank.n.07: a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal
force
8. savings_bank.n.02: a container (usually with a slot in the top) for keeping money at home
9. bank.n.09: a building in which the business of banking transacted
10. bank.n.10: a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)
11. bank.v.01: tip laterally
12. bank.v.02: enclose with a bank
13. bank.v.03: do business with a bank or keep an account at a bank
14. bank.v.04: act as the banker in a game or in gambling
15. bank.v.05: be in the banking business
16. deposit.v.02: put into a bank account
17. bank.v.07: cover with ashes so to control the rate of burning
18. trust.v.01: have confidence or faith in
261 22261A6610 31
Context matched with sense 2: bat.n.02
Example: he got four hits in four at-bats
Disambiguated sense of 'bat': bat.n.02
261 22261A6610 32
Program 16 B: Write a python program to implement Lesk’s algorithm for word sense disambiguity.
Description:
Word Sense Disambiguation (WSD) is the process of identifying the correct meaning (sense) of a word based on its context,
especially when the word has multiple meanings.
Lesk’s algorithm is a knowledge-based method for WSD that disambiguates a word by comparing the dictionary definitions
(glosses) of its possible senses with the context in which the word appears.
Code:
import nltk
from nltk.corpus import wordnet as wn
from nltk.tokenize import word_tokenize
def lesk_algorithm(word, context):
senses = wn.synsets(word)
if not senses:
return None
max_overlap = 0
best_sense = None
context_tokens = set(word_tokenize(' '.join(context).lower()))
for sense in senses:
# Get the definition and examples of the sense
definition = set(word_tokenize(sense.definition().lower()))
examples = set()
for example in sense.examples():
examples.update(word_tokenize(example.lower()))
overlap = len(context_tokens.intersection(definition.union(examples)))
if overlap > max_overlap:
max_overlap = overlap
best_sense = sense
return best_sense
# Example usage:
context = ["The bark of the tree is rough and textured."]
word = "bark"
best_sense = lesk_algorithm(word, context)
if best_sense:
print(f"Disambiguated sense for '{word}': {best_sense.name()}")
print(f"Definition: {best_sense.definition()}")
else:
print(f"No sense found for '{word}'")
context = ["The dog started to bark loudly in the yard."]
word = "bark"
best_sense = lesk_algorithm(word, context)
if best_sense:
print(f"Disambiguated sense for '{word}': {best_sense.name()}")
print(f"Definition: {best_sense.definition()}")
else:
print(f"No sense found for '{word}'")
Output:
Disambiguated sense for 'bark': bark.v.03
Definition: remove the bark of a tree
Disambiguated sense for 'bark': bark.n.02
Definition: a noise resembling the bark of a dog
261 22261A6610 33
Program 17: Write a python program using NLTK package to convert audio file to text and text to audio files
Description:
NLTK it does not handle audio files directly.We will use speech_recognition and gTTS libraries for audio, and use NLTK to
process the text in between.
1. Speech_recognition:
speech_recognition is a Python library that helps you convert spoken audio into written text using speech
recognition engines.
2. gTTS libraries (Convert text file to audio)
gTTS stands for Google Text-to-Speech — it's a Python library and CLI tool that lets you convert text into spoken
audio using the Google Text-to-Speech API. It takes in text (string), sends it to Google's Text-to-Speech engine, and
returns a spoken audio file in MP3 format.
Code:
import speech_recognition as sr
audio_path = "Sports.wav"
output_audio_path = "output.mp3"
text = audio_to_text(audio_path)
import nltk
from nltk.tokenize import word_tokenize
if text:
tokens = word_tokenize(text)
print("Tokenized Text:\n", tokens)
text1="hello welcome to class"
text_to_audio(text1, output_audio_path)
Output:
Recognized Text:
good evening ladies and gentlemen we like to welcome you to play the new videos Broadcast
Tokenized Text:
['good', 'evening', 'ladies', 'and', 'gentlemen', 'we', 'like', 'to', 'welcome', 'you', 'to', 'play', 'the', 'new', 'videos', 'Broadcast']
261 22261A6610 34
Program 18: Write a python program using NLTK package to convert audio file to text and text to audio files
Description:
FrameNet is a linguistic database that organizes words based on the situations (called frames) they describe, showing how words
are connected to roles and events in real-world experiences.
1. Helps computers understand not just words, but meanings and relationships.
2. Useful in AI, language learning, chatbots, machine translation, and semantic search.
The key elements of FrameNet are Frames, Frame Elements (FEs), and Lexical Units (LUs)
Code:
import nltk
nltk.download('framenet_v17')
frames = fn.frames()
print(f"{i+1}. {frame.name}")
Output:
261 22261A6610 35
Program to print detail of a particular given frame
import nltk
from nltk.corpus import framenet as fn
Output:
[nltk_data] Downloading package framenet_v17 to /root/nltk_data...
📌 Frame Name: Awareness
📝 Definition: A Cognizer has a piece of Content in their model of the world. The Content is not necessarily present due to
immediate perception, but usually, rather, due to deduction from perceivables. In some cases, the deduction of the Content is
implicitly based on confidence in sources of information (believe), in some cases based on logic (think), and in other cases the
source of the deduction is deprofiled (know). 'Your boss is aware of your commitment.' '' Note that this frame is undergoing
some degree of reconsideration. Many of the targets will be moved to the Opinion frame.' In the uses that will remain in the
Awareness frame, however, the Content is presupposed. '' This frame is also distinct from the Certainty frame, in that it does not
profile the relationship of the Cognizer to the Content, but rather presupposes it. In Certainty, the Degree of confidence or
certainty is expressible as a separate frame element, as in the following: 'She absolutely knew that he would be there .'
261 22261A6610 36
🧩 Frame Elements (FEs):
- Cognizer: Core — The Cognizer is the person whose awareness of phenomena is at question. With a target verb or adjective the
Cognizer is generally expressed as an External Argument with the Content expressed as an Object or Complement. 'Your boss is
aware of your commitment.' 'The students do not know the answer.'
- Content: Core — The Content is the object of the Cognizer's awareness. Content can be expressed as a direct object or in a PP
Complement. 'Your boss is aware of your commitment.' 'The students do not know of your commitment.' 'The students do not
know how committed you are.' '
- Evidence: Peripheral — The source of awareness or knowledge which can be expressed in a PP Complement: 'The sailors knew
from the look of the sky that a storm was coming.' 'I knew from experience that Jo would be late.'
- Topic: Core — Some verbs in this frame allow a Topic to be expressed in about-PPs. 'Kim knows about first aid.' However, a
number of nouns and adjectives in this frame which cannot take about-phrases allow Topic to be expressed as an adjectival or
adverbial modifier. 'Kim is politically aware. ' ' Environmental consciousness is increasing.'
- Degree: Peripheral — This FE identifies the Degree to which an event occurs.
- Manner: Peripheral — This FE identifies the Manner in which the Cognizer knows or thinks something.
- Expressor: Core — Expressor is the body part that reveals the Cognizer's state to the observer. 'Bob's eyes were overly aware'
- Role: Peripheral — Role is the category within which an element of the Content is considered. 'He understood her remark as an
insult.'
- Paradigm: Extra-Thematic — This frame element identifies the Paradigm which serves as the basis for the Cognizer's
awareness. 'The formation of black holes should be understood in astrophysic terms.'
- Time: Peripheral — The time interval during which the Cognizer is aware of the Content. 'Yet there is no evidence that Mr.
Parrish was cognizant at the time of the signing of the notes that the clauses in issue were present.'
- Explanation: Extra-Thematic — The reason why or how it came to be that the Cognizer has awareness of the Topic or Content.
[nltk_data] Package framenet_v17 is already up-to-date!
Program to invoke a particular Frame based on Lexical unit in the given sentence
import nltk
import spacy
from nltk.corpus import framenet as fn
nltk.download('framenet_v17')
nlp = spacy.load("en_core_web_sm")
def find_frames_for_sentence(sentence):
doc = nlp(sentence)
frames_found = {}
for token in doc:
if token.pos_ in {"VERB", "NOUN"}: # Likely to evoke frames
lemma = token.lemma_
frames = fn.frames_by_lemma(lemma)
if frames:
frames_found[token.text] = [frame.name for frame in frames]
return frames_found
sentence = "We believe it is a fair and generous price."
frames_invoked = find_frames_for_sentence(sentence)
print(f"Sentence: {sentence}")
print("\nInvoked FrameNet Frames:")
for word, frames in frames_invoked.items():
print(f"- {word}: {', '.join(frames)}")
Output:
261 22261A6610 37
261 22261A6610 38
Program 19: Write a python program using NLTK package to convert audio file to text and text to audio files
Description:WordNet is a lexical database of the English language, widely used in Natural Language Processing (NLP) tasks.
It groups words into sets of synonyms (synsets) and organizes them into a network based on their semantic relationships like
hyponyms (more specific), hypernyms (more general), meronyms (part-whole), and holonyms (whole-part).
Code:
Output:
Synset: car.n.01
Definition: a motor vehicle with four wheels; usually propelled by an internal combustion engine
Examples: ['he needs a car to get to work']
Synset: car.n.02
Definition: a wheeled vehicle adapted to the rails of railroad
Examples: ['three cars had jumped the rails']
Synset: car.n.03
Definition: the compartment that is suspended from an airship and that carries personnel and the cargo and the power
plant
Examples: []
Synset: car.n.04
Definition: where passengers ride up and down
Examples: ['the car was on the top floor']
Synset: cable_car.n.01
Definition: a conveyance for passengers or freight on a cable railway
Examples: ['they took a cable car to the top of the mountain']
similarity = word1.path_similarity(word2)
print(f"Similarity between 'dog' and 'cat': {similarity}")
Output: Similarity between 'dog' and 'cat': 0.2
261 22261A6610 39
Program 20: Write a Python code to generate n-grams using NLTK n-gram Library
Description:
An n-gram is a continuous sequence of n items (typically words or characters) from a given text or speech. It's a
fundamental concept in Natural Language Processing (NLP) and is used in many tasks, including language modeling, text
analysis, speech recognition, and machine translation. n represents the number of items (usually words) in the sequence.
Code:
import collections
import nltk
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
def get_ngrams(text, n):
"""Generate n-grams from the given text."""
tokens = word_tokenize(text.lower())
n_grams = ngrams(tokens, n)
return list(n_grams)
text1 = "this is a sample text with several words this is another sample text with some different words"
text = "Sample list of words"
print("List of Unigram")
ngrams_list = get_ngrams(text, 1)
for ngram in ngrams_list:
print(ngram)
print("List of Bigrams")
ngrams_list = get_ngrams(text, 2)
for ngram in ngrams_list:
print(ngram)
print("List of Trigrams")
ngrams_list = get_ngrams(text, 3)
for ngram in ngrams_list:
print(ngram)
Output:
List of Unigram
('sample',)
('list',)
('of',)
('words',)
List of Bigrams
('sample', 'list')
('list', 'of')
('of', 'words')
List of Trigrams
('sample', 'list', 'of')
('list', 'of', 'words')
261 22261A6610 40
Program 21: Write a python program to train bi-gram model for a given corpus of text to predict the next probable word
given the previous two words of a sentence.
import nltk
from nltk.util import ngrams
from collections import defaultdict
# Step 1: Prepare the corpus (tokenize sentences)
corpus = [
"The weather is beautiful today",
"I am learning natural language processing",
"I am studing ML",
"Machine learning is fascinating",
"Language models are powerful tools",
"I love IITM”
"I am a Coder"
"My passion is developing real world problem solving applications"
]
tokenized_corpus = [nltk.word_tokenize(sentence.lower()) for sentence in corpus]
# Step 2: Train the bigram model
def train_bigram_model(corpus):
model = defaultdict(lambda: defaultdict(int)) # Bigram model
for sentence in corpus:
# Generate bigrams for each sentence
bigrams = ngrams(sentence, 2)
for w1, w2 in bigrams:
model[w1][w2] += 1 # Increment the count for each bigram
return model
bigram_model = train_bigram_model(tokenized_corpus)
# Step 3: Calculate word probabilities (bigram probabilities)
def calculate_word_probabilities(model):
probabilities = defaultdict(lambda: defaultdict(float))
for w1 in model:
total_count = float(sum(model[w1].values()))
for w2 in model[w1]:
probabilities[w1][w2] = model[w1][w2] / total_count # Probability of w2 after w1
return probabilities
word_probabilities = calculate_word_probabilities(bigram_model)
# Step 4: Predict the next word given a context (bigram prediction)
def predict_next_word(model, context):
if context[-1] in model:
next_word_probs = model[context[-1]]
next_word = max(next_word_probs, key=next_word_probs.get)
return next_word
else:
return None
user_input = input("Enter a sentence or context (e.g., 'The bank'): ")
context = user_input.split()
predicted_word = predict_next_word(bigram_model, context)
print("\nPredicted next word given context '{}':".format(" ".join(context)), predicted_word)
Output:
261 22261A6610 41
Program 22: Write a python program to train bi-gram model for a given corpus of text to predict the next probable word
given the previous two words of a sentence.
Description: Incorporating smoothing into a bi-gram model helps to handle cases where some bi-grams may not appear in the
training corpus. Simple Additive Smoothing (also known as Laplace Smoothing) method. This program will build n-grams from a
given text and apply smoothing to handle zero probabilities.
Code:
import collections
import nltk
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
def get_ngrams(text, n):
tokens = word_tokenize(text.lower())
n_grams = ngrams(tokens, n)
return list(n_grams)
def count_ngrams(ngrams):
ngram_counts = collections.Counter(ngrams)
return ngram_counts
def laplace_smoothing(ngram_counts, unigram_counts, vocab_size, n):
smoothed_probs = {}
for ngram in ngram_counts:
context = ngram[:-1]
context_count = unigram_counts[context] if n > 1 else sum(unigram_counts.values())
smoothed_probs[ngram] = (ngram_counts[ngram] + 1) / (context_count + vocab_size)
return smoothed_probs
def build_vocabulary(text):
"""Build a vocabulary from the given text."""
tokens = word_tokenize(text.lower())
vocab = set(tokens)
return vocab
text = "this is a sample text with several words this is another sample text with some different words"
n=2
ngrams_list = get_ngrams(text, n)
unigrams_list = get_ngrams(text, 1)
ngram_counts = count_ngrams(ngrams_list)
unigram_counts = count_ngrams(unigrams_list)
vocab = build_vocabulary(text)
vocab_size = len(vocab)
smoothed_probs = laplace_smoothing(ngram_counts, unigram_counts, vocab_size, n)
print("N-grams with their smoothed probabilities:")
for ngram, prob in smoothed_probs.items():
print(f"{ngram}: {prob:.6f}")
Output:
N-grams with their smoothed probabilities:
('this', 'is'): 0.230769
('is', 'a'): 0.153846
('a', 'sample'): 0.166667
('sample', 'text'): 0.230769
('text', 'with'): 0.230769
('with', 'several'): 0.153846
('several', 'words'): 0.166667
('words', 'this'): 0.153846
('is', 'another'): 0.153846
('another', 'sample'): 0.166667
('with', 'some'): 0.153846
('some', 'different'): 0.166667
('different', 'words'): 0.166667
261 22261A6610 42