0% found this document useful (0 votes)

12 views4 pages

NLPPR8

The document outlines an assignment on developing a language model for next word prediction in Natural Language Processing. It discusses the theory behind language modeling, types of models (statistical and neural), and the steps for creating a model, including data collection, preprocessing, model building, training, prediction, and evaluation. Additionally, it provides sample code for implementing a bigram and trigram model for word prediction, concluding that such models are essential for various intelligent applications.

Uploaded by

patilaaryaa85

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views4 pages

NLPPR8

Uploaded by

patilaaryaa85

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Bansilal Ramnath Agarwal Charitable Trust's

Vishwakarma Institute of Information Technology

Department of
Artificial Intelligence and Data Science

Student Name: Vaishnavi Lawate

Class: TY Division: A Roll No: 371032

Semester: 6th Academic Year: 2024-25

Subject Name & Code: Natural Language Processing

Title of Assignment: 8) Develop a language model to predict next best word.

Aim : Develop a language model to predict next best word.

Background Theory :

Introduction to Language Modeling

A language model (LM) is a probabilistic model that predicts the likelihood of a sequence of
words. In simple terms, it learns how words typically follow one another in a language and
can be used to predict the next word in a sentence. This is a fundamental concept in Natural
Language Processing (NLP), useful in applications like text generation, autocomplete,
translation, and speech recognition.

What Does “Next Best Word Prediction” Mean?

Given a sequence of words like:
“The weather is”
A language model attempts to predict the most probable word that should come next:
“nice”, “sunny”, “rainy”, etc.
The model evaluates these candidates based on statistical probability or learned patterns.
Types of Language Models
1. Statistical Language Models
These are based on probability and statistics, especially n-gram models.
N-gram Models
• An n-gram is a sequence of ‘n’ words.
• The probability of the next word is based on the previous n−1 words.
• Formula:
P(wₙ | wₙ₋₁, ..., w₁) ≈ P(wₙ | wₙ₋₁, ..., wₙ₋ₖ₊₁)
• For a bigram model (n=2):
P(the weather is nice) ≈ P(the) × P(weather | the) × P(is | weather) × P(nice | is)
Advantages:
• Simple and fast.
• Easy to implement.
Limitations:
• Cannot capture long-term dependencies.
• Requires smoothing to handle unseen word combinations.

2. Neural Language Models

These models use deep learning techniques, and they learn context better than statistical
models.
Examples:
• Feedforward Neural Networks
• Recurrent Neural Networks (RNNs)
• Long Short-Term Memory (LSTM)
• Gated Recurrent Units (GRU)
• Transformers (e.g., GPT, BERT) How They Work:
• Words are converted into vectors (word embeddings).
• The model processes sequences of these embeddings.
• It learns patterns and relationships between words over time.
• It then predicts the word with the highest probability as the next best word.
Advantages:
• Can capture long-range dependencies.
• Context-aware and dynamic.
Limitations:
• Requires large datasets and computational power.
• Slower training process than n-gram models.

How to Develop a Language Model for Next Word Prediction

Steps:
1. Data Collection:
Gather a large corpus (text dataset) such as news articles, Wikipedia, books, or movie
scripts.
2. Text Preprocessing: Tokenization Lowercasing
Removing punctuation, special characters, etc.
3. Model Building:
Choose model type: N-gram or Neural (RNN, LSTM, Transformer)
For N-gram: Count frequencies and compute probabilities.
For Neural: Use sequences of words as input to a deep learning model.
4. Training:
Train the model on large amounts of text data.
For neural networks, use loss functions like categorical cross-entropy.
5. Prediction:
Given a partial sentence, feed it to the model.
Output the word with the highest probability as the next best word.
6. Evaluation:
Use metrics like perplexity for language models.
You can also do human evaluation or BLEU scores (for generation tasks).

CODE AND OUTPUT:

import re
from collections import Counter import
nltk from nltk.tokenize import
word_tokenize

# Download tokenizer
nltk.download('punkt')

# Sample corpus (Different from original)

corpus = """
Artificial Intelligence is a branch of computer science.
It focuses on building smart machines capable of performing tasks that typically require
human intelligence.
AI is widely used in various fields including healthcare, finance, and education.
"""

# Preprocessing corpus
= corpus.lower()
corpus = re.sub(r'[^\w\s]', '', corpus) # Remove punctuation
words = word_tokenize(corpus)

# Unigram, Bigram, and Trigram counts unigrams =

Counter(words) bigrams = Counter(zip(words,
words[1:])) trigrams = Counter(zip(words,
words[1:], words[2:]))

# Vocabulary size
vocab_size = len(set(words))

# Smoothed bigram probabilities smoothed_bigrams =

{ bigram: (count + 1) / (unigrams[bigram[0]] + vocab_size)
for bigram, count in bigrams.items() }
# Smoothed trigram probabilities smoothed_trigrams = { trigram: (count +
1) / (bigrams[(trigram[0], trigram[1])] + vocab_size) for trigram, count in
trigrams.items() if (trigram[0], trigram[1]) in bigrams }

# Prediction function def

predict_next_word(previous_text, n=3):
previous_text = previous_text.lower()
previous_text = re.sub(r'[^\w\s]', '', previous_text)
words = word_tokenize(previous_text)

if len(words) < n - 1: return f"Please provide at

least {n - 1} words."

if n == 2:
candidates = [(word, smoothed_bigrams.get((words[-1], word), 1 / vocab_size)) for
word in unigrams]
elif n == 3:
candidates = [(word, smoothed_trigrams.get((words[-2], words[-1], word), 1 /
vocab_size)) for word in unigrams]
else:
return "n can only be 2 or 3."

candidates.sort(key=lambda x: x[1], reverse=True)

return candidates[0][0] if candidates else "No prediction available"

# Test cases input_bigram = "smart" predicted_bigram =

predict_next_word(input_bigram, 2) print(f"Next word after
'{input_bigram}': {predicted_bigram}")

input_trigram = "a branch of" predicted_trigram =

predict_next_word(input_trigram, 3) print(f"Next word after
'{input_trigram}': {predicted_trigram}")

Next word after 'smart': machines

Next word after 'a branch of': computer

Conclusion:
Developing a language model to predict the next best word is a core NLP task. Traditional n-gram
models provide a statistical foundation, while modern neural models enable context-aware,
powerful predictions. Such models form the backbone of intelligent applications like chatbots,
virtual assistants, and content generators.

Teaching Task - Anticipated Problems and Solutions
No ratings yet
Teaching Task - Anticipated Problems and Solutions
9 pages
Unit 301 Understanding The Principles and Practices of Assessment
No ratings yet
Unit 301 Understanding The Principles and Practices of Assessment
18 pages
Approach To Student Discipline
No ratings yet
Approach To Student Discipline
8 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
Predicting Words and Sentences Using Statistical Models: Nicola Carmignani
No ratings yet
Predicting Words and Sentences Using Statistical Models: Nicola Carmignani
42 pages
NLP UNIT-4
No ratings yet
NLP UNIT-4
62 pages
NLP LAB 4
No ratings yet
NLP LAB 4
2 pages
PT 2
No ratings yet
PT 2
59 pages
Ngrams
100% (1)
Ngrams
22 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
10 (3S) 4112-4118
No ratings yet
10 (3S) 4112-4118
7 pages
NLp
No ratings yet
NLp
12 pages
6.Chapter6_LanguageModel
No ratings yet
6.Chapter6_LanguageModel
33 pages
module-1 ch-2
No ratings yet
module-1 ch-2
31 pages
Probabilistic Language Modeling Challenges
No ratings yet
Probabilistic Language Modeling Challenges
12 pages
1_N-grams_and_Language_Models_Detailed
No ratings yet
1_N-grams_and_Language_Models_Detailed
4 pages
INTRO TO LANGUAGE MODELS - SOUMYASIS MISHRA - 191001021003 - BCS4C
No ratings yet
INTRO TO LANGUAGE MODELS - SOUMYASIS MISHRA - 191001021003 - BCS4C
10 pages
NLP_Unit2 (2)
No ratings yet
NLP_Unit2 (2)
65 pages
NLP_Lec_11
No ratings yet
NLP_Lec_11
6 pages
2. Language Modeling
No ratings yet
2. Language Modeling
50 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
3-Lecture Three - (Chapter Two-N-gram Language Models)
No ratings yet
3-Lecture Three - (Chapter Two-N-gram Language Models)
28 pages
Lecture 6 to 8 N-gram
No ratings yet
Lecture 6 to 8 N-gram
19 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
lm24aug
No ratings yet
lm24aug
84 pages
LLM_book_43-102
No ratings yet
LLM_book_43-102
60 pages
2 Generative models
No ratings yet
2 Generative models
60 pages
Cs224n 2025 Lecture05 Rnnlm
No ratings yet
Cs224n 2025 Lecture05 Rnnlm
54 pages
language models
No ratings yet
language models
11 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
14 Ngramlm
No ratings yet
14 Ngramlm
67 pages
Intro DL 10 NLP
No ratings yet
Intro DL 10 NLP
99 pages
Next Word Prediction With NLP and Deep Learning
No ratings yet
Next Word Prediction With NLP and Deep Learning
13 pages
module5_DS_ppt
No ratings yet
module5_DS_ppt
38 pages
Ai Report - Merged
No ratings yet
Ai Report - Merged
4 pages
A Survey On Neural Network Language Models
No ratings yet
A Survey On Neural Network Language Models
7 pages
AI Project
No ratings yet
AI Project
19 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
UNIT 3 Language Modelling
No ratings yet
UNIT 3 Language Modelling
15 pages
NLP-Ch-2 Introduction to Language Models
No ratings yet
NLP-Ch-2 Introduction to Language Models
82 pages
Word Embedding Learning Process
No ratings yet
Word Embedding Learning Process
6 pages
13 Ngramlm
No ratings yet
13 Ngramlm
27 pages
02 NLP LM
No ratings yet
02 NLP LM
99 pages
Nlp Internal
No ratings yet
Nlp Internal
15 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
REPORT -MINOR PROJECT
No ratings yet
REPORT -MINOR PROJECT
22 pages
3_2
No ratings yet
3_2
26 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
No ratings yet
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
36 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
BCSE306L_AI_MODULE-7_SMSATAPATHY
No ratings yet
BCSE306L_AI_MODULE-7_SMSATAPATHY
51 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
NLP 1.2
No ratings yet
NLP 1.2
22 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
L5_CSE256_FA24_LM
No ratings yet
L5_CSE256_FA24_LM
65 pages
Clip Unit 4
No ratings yet
Clip Unit 4
9 pages
Karpathy
No ratings yet
Karpathy
32 pages
Cs224n 2023 Lecture05 RNNLM
No ratings yet
Cs224n 2023 Lecture05 RNNLM
68 pages
NLP 5th unit
No ratings yet
NLP 5th unit
19 pages
UBC Summer School in NLP - VSP 2019 Lecture 9
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 9
17 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Self-Discipline: Mark Manson
No ratings yet
Self-Discipline: Mark Manson
58 pages
Basic Concepts in Translation
88% (8)
Basic Concepts in Translation
161 pages
Ielts Writing Task 1 Describing Trends: Possible Adjectives
No ratings yet
Ielts Writing Task 1 Describing Trends: Possible Adjectives
6 pages
Ms Diễm - 25 bản
No ratings yet
Ms Diễm - 25 bản
6 pages
Sanskrit Textbook
90% (81)
Sanskrit Textbook
113 pages
Present Continuous Tense: Group Name: 1. Baiq Yupita Zuliatul Rahmi 2. Ali Akbar Rafsanjani 3. Hilmiati
No ratings yet
Present Continuous Tense: Group Name: 1. Baiq Yupita Zuliatul Rahmi 2. Ali Akbar Rafsanjani 3. Hilmiati
13 pages
Arshita Matta 0011 Exp 3
No ratings yet
Arshita Matta 0011 Exp 3
6 pages
Seminar 3. HEL: I. Theoretical Questions
100% (2)
Seminar 3. HEL: I. Theoretical Questions
16 pages
Seventeen Thought Moment
No ratings yet
Seventeen Thought Moment
8 pages
Silabo Ingles Tecnico 2
No ratings yet
Silabo Ingles Tecnico 2
4 pages
Howse - Interview of Teen Reader
No ratings yet
Howse - Interview of Teen Reader
7 pages
EL3 Answer Sheet With RRS
No ratings yet
EL3 Answer Sheet With RRS
2 pages
APTIS COURSEBOOK ANSWERS Ultima Edicion
No ratings yet
APTIS COURSEBOOK ANSWERS Ultima Edicion
37 pages
Literary Translation Introduction
No ratings yet
Literary Translation Introduction
32 pages
Journal Article Critique of Classroom Management Studies
No ratings yet
Journal Article Critique of Classroom Management Studies
5 pages
Objective 14 IPCRF Development Plan - Jnicolas
No ratings yet
Objective 14 IPCRF Development Plan - Jnicolas
2 pages
The Pragmatism of William James and Truth
No ratings yet
The Pragmatism of William James and Truth
15 pages
(eBook PDF) Psychology: Modules for Active Learning 14th Editionpdf download
No ratings yet
(eBook PDF) Psychology: Modules for Active Learning 14th Editionpdf download
39 pages
Home Industries Ass. No.1
No ratings yet
Home Industries Ass. No.1
2 pages
UNIT 1 COMMUNICATION SKILLS_X
No ratings yet
UNIT 1 COMMUNICATION SKILLS_X
8 pages
Najdi Arabic Grammar
100% (2)
Najdi Arabic Grammar
39 pages
CNN RNN LSTM Attention
No ratings yet
CNN RNN LSTM Attention
86 pages
Kierkegaard. Pathos Bij Mens
No ratings yet
Kierkegaard. Pathos Bij Mens
24 pages
Teks Descriptive
100% (1)
Teks Descriptive
5 pages
A Practical Guide Using LLMs ChatGPT and Beyond
No ratings yet
A Practical Guide Using LLMs ChatGPT and Beyond
24 pages
Grice's Bricks and Slabs: Friday, April 19, 2013
No ratings yet
Grice's Bricks and Slabs: Friday, April 19, 2013
22 pages
English 2-07-Structured Paired Conjunctions
No ratings yet
English 2-07-Structured Paired Conjunctions
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

NLPPR8

Uploaded by

NLPPR8

Uploaded by

Bansilal Ramnath Agarwal Charitable Trust's

Vishwakarma Institute of Information Technology

Student Name: Vaishnavi Lawate

Class: TY Division: A Roll No: 371032

Semester: 6th Academic Year: 2024-25

Subject Name & Code: Natural Language Processing

Aim : Develop a language model to predict next best word.

Introduction to Language Modeling

What Does “Next Best Word Prediction” Mean?

2. Neural Language Models

How to Develop a Language Model for Next Word Prediction

CODE AND OUTPUT:

# Sample corpus (Different from original)

# Unigram, Bigram, and Trigram counts unigrams =

# Smoothed bigram probabilities smoothed_bigrams =

# Prediction function def

if len(words) < n - 1: return f"Please provide at

candidates.sort(key=lambda x: x[1], reverse=True)

# Test cases input_bigram = "smart" predicted_bigram =

input_trigram = "a branch of" predicted_trigram =

Next word after 'smart': machines

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.