Report - Minor Project
Report - Minor Project
MASTER OF TECHNOLOGY
in
COMPUTATIONAL AND DATA SCIENCE
by
BHARGAV RATHOD
(222CD022/2220539)
NOVEMBER, 2023
DECLARATION
I hereby declare that the Report of the P.G. Project Work entitled “NEXT
WORD PREDICTION USING LSTM”, which is being submitted to National
Institute of Technology Karnataka, Surathkal, in partial fulfillment of the
requirements for the award of the Degree of Master of Technology in computa-
tional and data science in the Department of Mathematic and computational
science(MACS), is a bonafide report of the work carried out by me. The material
contained in this Report has not been submitted at any University or Institution
for the award of any degree.
This is to certify that the P.G. Project Work Report entitled “‘NEXT WORD
PREDICTION USING LSTM”, submitted by BHARGAV RATHOD (Register
number: 2220539) as the record of the work carried out by him, is accepted as the
P.G. Project Work report submission in partial fulfillment of the requirement for
the award of degree of Master of Technology in computational and data science
in the Department of Mathematical and computational science.
DR. V. MURUGAN
Project Guide Head of MACS
Dept. of MACS Dept. of MACS
NITK Surathkal, Mangalore NITK Surathkal, Mangalore
ACKNOWLEDGEMENT
BHARGAV RATHOD
ABSTRACT
Writing long sentences is bit boring, however with text prediction within the
keyboard technology has created this easy. Next Word Prediction is in addition
referred to as Language Modeling. It’s the endeavor of predicting what word comes
straightaway. It’s one in every of the key assignments of human language technology
and has various applications. Long short term memory formula can perceive past
text and predict the words which can be useful for the user to border sentences and
this method uses for predicting next word based on the previous word present in our
sentence and we have to give minimum input of 5 word because it is predicting base
on the previous 5 word.
List of Tables ii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Literature Review 3
2.1 Background and Related Works . . . . . . . . . . . . . . . . . . . . . 3
2.2 Outcome of Literature Review . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Methodology 6
3.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.1 Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.2 Building the LSTM Model . . . . . . . . . . . . . . . . . . . . 6
3.1.3 Training the Model . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.4 Saving the Model and Training History . . . . . . . . . . . . . 7
3.1.5 Text Generation Functions . . . . . . . . . . . . . . . . . . . . 7
3.1.6 Sample Predictions . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.7 Code and Libraries . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Work Done 9
4.1 Handling out of dictionary word . . . . . . . . . . . . . . . . . . . . . 9
4.2 Handling Stop Words . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
i
LIST OF TABLES
2.1.1 Summary of Literature Survey . . . . . . . . . . . . . . . . . . . . . . 4
ii
Chapter 1
Introduction
I develop next word prediction Word prediction model mainly for email writing and
word typing it can be also facilitate to speak and additionally to assist the individuals
with less speed writing. during this project, a language model based mostly framework
for fast electronic communication, which will predict probable next word given a group
of current words are briefed. Word prediction technique will the task of guesswork the
preceding word that’s probably to continue with few initial text fragments. my goal
is to facilitate the task of instant electronic communication by suggesting relevant
words to the use
1.1 Motivation
The motivation behind embarking on a Next Word Prediction (NWP) project using
Long Short-Term Memory (LSTM) technology stems from the ever-expanding influ-
ence of natural language processing in our digital interactions. In our increasingly
connected world, where human-computer interaction relies heavily on textual com-
munication, the ability to enhance the predictive power of text input holds immense
practical value. NWP, an essential component of text generation systems, signifi-
cantly impacts user experience in various applications, from messaging platforms to
content creation tools.
Traditional text prediction methods often fall short when it comes to understand-
ing the large sentence . LSTM, as a sophisticated variant of recurrent neural networks,
excels in capturing long-term dependencies within textual data. Its unique memory-
retention capabilities make it well-suited for understanding contextual relationships
between words, enabling it to generate more accurate and contextually relevant word
predictions. By delving into LSTM-based NWP, this project aspires to address the
limitations of conventional prediction models and elevate the precision and fluency of
computer-generated text.
1
Furthermore, as the demand for intelligent virtual assistants, chatbots, and smart
content creation tools continues to rise, a robust NWP system becomes indispensable.
Improving the accuracy of word predictions not only enhances user satisfaction but
also contributes to the development of more effective communication technologies.
Ultimately, this project aims to push the boundaries of natural language processing,
fostering innovation in human-computer interaction and paving the way for more
intuitive and seamless digital experiences in diverse domains.
2
Chapter 2
Literature Review
3
is a feat in itself.
Sourabh Ambulgekar (2021) presented that RNN model became less exact so they
are use LSTM created a 3d vector layer of input and a 2d vector layer for output and
feed through to the LSTM layer having 128 hidden layers and manage to get accuracy
to around 56 percent during 5 epochs.
4
LSTM and structure of RNN which help to understand the working of this algorithm.
2.4 Objectives
1. predict the most probable next word in a sequence of words, based on the
context provided by the preceding words.
2. utilizing the memory and sequential modeling capabilities of LSTM for predict-
ing next word.
5
Chapter 3
Methodology
3.1 Data Preprocessing
• Tokenization: The NLTK library is used to tokenize the text into words.
• Stopword Removal: Common English stopwords are removed from the tok-
enized words.
• Sequence Generation: Previous words (of length 5 in this case) are used to
predict the next word. Sequences are created with corresponding next words.
• The model has an LSTM layer with 128 units, followed by a Dense layer with
softmax activation to predict the next word.
6
3.1.4 Saving the Model and Training History
• The trained model and its training history (loss and accuracy) are saved to files
using pickle for future use.
• prepare input converts input text into a format suitable for the model.
• predict completions generates a list of next possible words given an input se-
quence.
7
Figure 3.1.1: Architecture of LSTM
8
Chapter 4
Work Done
4.1 Handling out of dictionary word
In the course of developing our project, one significant challenge we encountered was
dealing with out-of-dictionary words, often referred to as ”OOV” (Out of Vocabulary)
words. These words are not present in the pre-defined vocabulary of the language
model, which posed a potential stumbling block in our natural language processing
tasks.
To address this issue, we implemented a robust solution. Our approach involved
identifying instances where ”OOV” words occurred in the input data. For every
input text, we checked if the word existed within our predefined vocabulary. The
code snippet below illustrates how we managed out-of-dictionary words:
Figure 4.1.2: In word embedding considering ”OOV” column for word other than in
dictionary
9
4.2 Handling Stop Words
In natural language processing, stop words are common words like ”and,” ”the,”
”is,” etc., that are often filtered out from text data because they occur frequently
and do not carry significant meaning in the context of the analysis. These words,
while important in the structure of a sentence, do not provide valuable insights when
it comes to tasks like text classification, sentiment analysis, or topic modeling. In
our project, we implemented a prepossessing step to remove stop words from the
textual data before further analysis. This step involved the removal of words that
were present in a predefined list of stop words. For example, the list of stop words we
used included words such as ’had,’ ’hasn’t,’ ’with,’ ’no,’ and many others. Removing
these words was essential to focus on the more meaningful content of the text and
improve the accuracy of our natural language processing tasks.
4.3 Evaluation
The model was trained for 100 epochs, and the training process was monitored us-
ing both training and validation data. After 100 iterations, the model achieved an
accuracy of 62 percent on the validation data. The accuracy metric indicates the
percentage of correctly predicted next words from the validation dataset.
During the evaluation process, the model utilized a validation dataset that was not
seen during training. This approach ensured an unbiased assessment of the model’s
performance on unseen data. The evaluation process followed these steps:
Model Prediction: The trained model was utilized to make predictions on the
validation dataset.
10
Metric Calculation: Using the predicted labels and the actual labels from the
validation dataset, the model calculated accuracy, precision, recall, F1-score, and
generated a confusion matrix.
Analysis and Iteration: Based on the evaluation metrics, the model’s performance
was analyzed. If the results were not satisfactory, iterative refinement of the model
architecture, fine-tuning of hyperparameters, or additional preprocessing steps were
performed to enhance performance.
11
Figure 4.3.4: 98,99 and 100th epoch accuracy
12
Chapter 5
Conclusion and Future Work
• Dataset Diversification: Training the model on a larger and more diverse dataset
could improve its ability to predict a wider range of words and phrases. Incor-
porating multiple genres or sources of text might enhance the model’s language
understanding.
5.0.2 Conclusion
• In this study, we successfully implemented a Next Word Prediction Model using
LSTM neural networks. The model was trained on a corpus of text from ”The
Adventures of Sherlock Holmes” by Arthur Conan Doyle. After 100 epochs
of training, the model achieved an accuracy of approximately 62 percent on
the validation data. The implemented LSTM model, coupled with the Seq-
SelfAttention layer for attention mechanisms, demonstrated the ability to learn
sequential patterns and generate probable next words given a sequence of words.
13
Bibliography
Afika Rianti, Suprih Widodo, A. D. A. F. B. H. (2022). Next word prediction using
lstm. JOURNAL OF INFORMATION TECHNOLOGY AND ITS UTILIZA-
TION, 5(13):443–454.
Hajj-Ahmad, A., Baudry, S., Chupeau, B., Doërr, G., and Wu, M. (2016). Flicker
forensics for camcorder piracy. IEEE Transactions on Information Forensics and
Security, 12(1):89–100.
Keerthana N, Harikrishnan S, K. B. M. J. J. B. (2021). Next word prediction. Inter-
national Journal of Creative Research Thoughts, 9(9):754–757.
Sourabh Ambulgekar, Sanket Malewadikar, R. G. D. B. J. (2021). Next words predic-
tion using recurrent neuralnetworks. In International Conference on Advances
in Computing CommunicationsAdvances in Computing Communications, pages
1–4. ICACC.
14