0% found this document useful (0 votes)
12 views4 pages

NLPPR8

The document outlines an assignment on developing a language model for next word prediction in Natural Language Processing. It discusses the theory behind language modeling, types of models (statistical and neural), and the steps for creating a model, including data collection, preprocessing, model building, training, prediction, and evaluation. Additionally, it provides sample code for implementing a bigram and trigram model for word prediction, concluding that such models are essential for various intelligent applications.

Uploaded by

patilaaryaa85
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views4 pages

NLPPR8

The document outlines an assignment on developing a language model for next word prediction in Natural Language Processing. It discusses the theory behind language modeling, types of models (statistical and neural), and the steps for creating a model, including data collection, preprocessing, model building, training, prediction, and evaluation. Additionally, it provides sample code for implementing a bigram and trigram model for word prediction, concluding that such models are essential for various intelligent applications.

Uploaded by

patilaaryaa85
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Bansilal Ramnath Agarwal Charitable Trust's

Vishwakarma Institute of Information Technology

Department of
Artificial Intelligence and Data Science

Student Name: Vaishnavi Lawate

Class: TY Division: A Roll No: 371032

Semester: 6th Academic Year: 2024-25

Subject Name & Code: Natural Language Processing


Title of Assignment: 8) Develop a language model to predict next best word.

Aim : Develop a language model to predict next best word.


Background Theory :

Introduction to Language Modeling


A language model (LM) is a probabilistic model that predicts the likelihood of a sequence of
words. In simple terms, it learns how words typically follow one another in a language and
can be used to predict the next word in a sentence. This is a fundamental concept in Natural
Language Processing (NLP), useful in applications like text generation, autocomplete,
translation, and speech recognition.

What Does “Next Best Word Prediction” Mean?


Given a sequence of words like:
“The weather is”
A language model attempts to predict the most probable word that should come next:
“nice”, “sunny”, “rainy”, etc.
The model evaluates these candidates based on statistical probability or learned patterns.
Types of Language Models
1. Statistical Language Models
These are based on probability and statistics, especially n-gram models.
N-gram Models
• An n-gram is a sequence of ‘n’ words.
• The probability of the next word is based on the previous n−1 words.
• Formula:
P(wₙ | wₙ₋₁, ..., w₁) ≈ P(wₙ | wₙ₋₁, ..., wₙ₋ₖ₊₁)
• For a bigram model (n=2):
P(the weather is nice) ≈ P(the) × P(weather | the) × P(is | weather) × P(nice | is)
Advantages:
• Simple and fast.
• Easy to implement.
Limitations:
• Cannot capture long-term dependencies.
• Requires smoothing to handle unseen word combinations.

2. Neural Language Models


These models use deep learning techniques, and they learn context better than statistical
models.
Examples:
• Feedforward Neural Networks
• Recurrent Neural Networks (RNNs)
• Long Short-Term Memory (LSTM)
• Gated Recurrent Units (GRU)
• Transformers (e.g., GPT, BERT) How They Work:
• Words are converted into vectors (word embeddings).
• The model processes sequences of these embeddings.
• It learns patterns and relationships between words over time.
• It then predicts the word with the highest probability as the next best word.
Advantages:
• Can capture long-range dependencies.
• Context-aware and dynamic.
Limitations:
• Requires large datasets and computational power.
• Slower training process than n-gram models.

How to Develop a Language Model for Next Word Prediction


Steps:
1. Data Collection:
Gather a large corpus (text dataset) such as news articles, Wikipedia, books, or movie
scripts.
2. Text Preprocessing: Tokenization Lowercasing
Removing punctuation, special characters, etc.
3. Model Building:
Choose model type: N-gram or Neural (RNN, LSTM, Transformer)
For N-gram: Count frequencies and compute probabilities.
For Neural: Use sequences of words as input to a deep learning model.
4. Training:
Train the model on large amounts of text data.
For neural networks, use loss functions like categorical cross-entropy.
5. Prediction:
Given a partial sentence, feed it to the model.
Output the word with the highest probability as the next best word.
6. Evaluation:
Use metrics like perplexity for language models.
You can also do human evaluation or BLEU scores (for generation tasks).

CODE AND OUTPUT:


import re
from collections import Counter import
nltk from nltk.tokenize import
word_tokenize

# Download tokenizer
nltk.download('punkt')

# Sample corpus (Different from original)


corpus = """
Artificial Intelligence is a branch of computer science.
It focuses on building smart machines capable of performing tasks that typically require
human intelligence.
AI is widely used in various fields including healthcare, finance, and education.
"""

# Preprocessing corpus
= corpus.lower()
corpus = re.sub(r'[^\w\s]', '', corpus) # Remove punctuation
words = word_tokenize(corpus)

# Unigram, Bigram, and Trigram counts unigrams =


Counter(words) bigrams = Counter(zip(words,
words[1:])) trigrams = Counter(zip(words,
words[1:], words[2:]))

# Vocabulary size
vocab_size = len(set(words))

# Smoothed bigram probabilities smoothed_bigrams =


{ bigram: (count + 1) / (unigrams[bigram[0]] + vocab_size)
for bigram, count in bigrams.items() }
# Smoothed trigram probabilities smoothed_trigrams = { trigram: (count +
1) / (bigrams[(trigram[0], trigram[1])] + vocab_size) for trigram, count in
trigrams.items() if (trigram[0], trigram[1]) in bigrams }

# Prediction function def


predict_next_word(previous_text, n=3):
previous_text = previous_text.lower()
previous_text = re.sub(r'[^\w\s]', '', previous_text)
words = word_tokenize(previous_text)

if len(words) < n - 1: return f"Please provide at


least {n - 1} words."

if n == 2:
candidates = [(word, smoothed_bigrams.get((words[-1], word), 1 / vocab_size)) for
word in unigrams]
elif n == 3:
candidates = [(word, smoothed_trigrams.get((words[-2], words[-1], word), 1 /
vocab_size)) for word in unigrams]
else:
return "n can only be 2 or 3."

candidates.sort(key=lambda x: x[1], reverse=True)


return candidates[0][0] if candidates else "No prediction available"

# Test cases input_bigram = "smart" predicted_bigram =


predict_next_word(input_bigram, 2) print(f"Next word after
'{input_bigram}': {predicted_bigram}")

input_trigram = "a branch of" predicted_trigram =


predict_next_word(input_trigram, 3) print(f"Next word after
'{input_trigram}': {predicted_trigram}")

Next word after 'smart': machines


Next word after 'a branch of': computer

Conclusion:
Developing a language model to predict the next best word is a core NLP task. Traditional n-gram
models provide a statistical foundation, while modern neural models enable context-aware,
powerful predictions. Such models form the backbone of intelligent applications like chatbots,
virtual assistants, and content generators.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy