0% found this document useful (0 votes)
26 views22 pages

Report - Minor Project

Uploaded by

soorajbaba14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views22 pages

Report - Minor Project

Uploaded by

soorajbaba14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

NEXT WORD PREDICTION USING LSTM

Submitted in partial fulfillment of the requirements for the degree of

MASTER OF TECHNOLOGY
in
COMPUTATIONAL AND DATA SCIENCE

by
BHARGAV RATHOD
(222CD022/2220539)

under the guidance of


DR.V. MURUGAN

DEPARTMENT OF MATHEMATICAL AND COMPUTATIONAL


SCIENCE (MACS)
NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA
SURATHKAL, MANGALORE - 575025

NOVEMBER, 2023
DECLARATION

I hereby declare that the Report of the P.G. Project Work entitled “NEXT
WORD PREDICTION USING LSTM”, which is being submitted to National
Institute of Technology Karnataka, Surathkal, in partial fulfillment of the
requirements for the award of the Degree of Master of Technology in computa-
tional and data science in the Department of Mathematic and computational
science(MACS), is a bonafide report of the work carried out by me. The material
contained in this Report has not been submitted at any University or Institution
for the award of any degree.

Place: NITK, Surathkal BHARGAV RATHOD


Date: 09-11-2023 (222CD009)
CERTIFICATE
(for internal)

This is to certify that the P.G. Project Work Report entitled “‘NEXT WORD
PREDICTION USING LSTM”, submitted by BHARGAV RATHOD (Register
number: 2220539) as the record of the work carried out by him, is accepted as the
P.G. Project Work report submission in partial fulfillment of the requirement for
the award of degree of Master of Technology in computational and data science
in the Department of Mathematical and computational science.

DR. V. MURUGAN
Project Guide Head of MACS
Dept. of MACS Dept. of MACS
NITK Surathkal, Mangalore NITK Surathkal, Mangalore
ACKNOWLEDGEMENT

I want to express my sincere gratitude to my thesis supervisor (project guide)


DR. V.MURUGAN, for providing his invaluable guidance, suggestions, and com-
ments throughout this work. Also, I would like to thank Dr. R MADHUSUDHAN,
head of MACS, for his advice and encouragement. I would like to thank all staff
members and friends of the National Institute of Technology Karnataka Surathkal for
their support and cooperation in completing this report. Last but not least, I would
like to thank my parents for their love, moral support, and remarkable encouragement
during this work.

BHARGAV RATHOD
ABSTRACT
Writing long sentences is bit boring, however with text prediction within the
keyboard technology has created this easy. Next Word Prediction is in addition
referred to as Language Modeling. It’s the endeavor of predicting what word comes
straightaway. It’s one in every of the key assignments of human language technology
and has various applications. Long short term memory formula can perceive past
text and predict the words which can be useful for the user to border sentences and
this method uses for predicting next word based on the previous word present in our
sentence and we have to give minimum input of 5 word because it is predicting base
on the previous 5 word.

Keywords— NLP, LSTM, Next Word


CONTENTS
LIST OF FIGURES i

List of Tables ii

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Literature Review 3
2.1 Background and Related Works . . . . . . . . . . . . . . . . . . . . . 3
2.2 Outcome of Literature Review . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Methodology 6
3.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.1 Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.2 Building the LSTM Model . . . . . . . . . . . . . . . . . . . . 6
3.1.3 Training the Model . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.4 Saving the Model and Training History . . . . . . . . . . . . . 7
3.1.5 Text Generation Functions . . . . . . . . . . . . . . . . . . . . 7
3.1.6 Sample Predictions . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.7 Code and Libraries . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Work Done 9
4.1 Handling out of dictionary word . . . . . . . . . . . . . . . . . . . . . 9
4.2 Handling Stop Words . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5 Conclusion and Future Work 13


5.0.1 Future scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.0.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
LIST OF FIGURES
3.1.1 Architecture of LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.1.1 appending word in ”OOV” column . . . . . . . . . . . . . . . . . . . 9


4.1.2 In word embedding considering ”OOV” column for word other than in
dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.1 example of stop word . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3.1 Parameter during training of model . . . . . . . . . . . . . . . . . . 11
4.3.2 1st and 2nd epoch accuracy . . . . . . . . . . . . . . . . . . . . . . . 11
4.3.3 50,51 and 52th epoch accuracy . . . . . . . . . . . . . . . . . . . . . . 11
4.3.4 98,99 and 100th epoch accuracy . . . . . . . . . . . . . . . . . . . . . 12

i
LIST OF TABLES
2.1.1 Summary of Literature Survey . . . . . . . . . . . . . . . . . . . . . . 4

ii
Chapter 1
Introduction

I develop next word prediction Word prediction model mainly for email writing and
word typing it can be also facilitate to speak and additionally to assist the individuals
with less speed writing. during this project, a language model based mostly framework
for fast electronic communication, which will predict probable next word given a group
of current words are briefed. Word prediction technique will the task of guesswork the
preceding word that’s probably to continue with few initial text fragments. my goal
is to facilitate the task of instant electronic communication by suggesting relevant
words to the use

1.1 Motivation
The motivation behind embarking on a Next Word Prediction (NWP) project using
Long Short-Term Memory (LSTM) technology stems from the ever-expanding influ-
ence of natural language processing in our digital interactions. In our increasingly
connected world, where human-computer interaction relies heavily on textual com-
munication, the ability to enhance the predictive power of text input holds immense
practical value. NWP, an essential component of text generation systems, signifi-
cantly impacts user experience in various applications, from messaging platforms to
content creation tools.
Traditional text prediction methods often fall short when it comes to understand-
ing the large sentence . LSTM, as a sophisticated variant of recurrent neural networks,
excels in capturing long-term dependencies within textual data. Its unique memory-
retention capabilities make it well-suited for understanding contextual relationships
between words, enabling it to generate more accurate and contextually relevant word
predictions. By delving into LSTM-based NWP, this project aspires to address the
limitations of conventional prediction models and elevate the precision and fluency of
computer-generated text.

1
Furthermore, as the demand for intelligent virtual assistants, chatbots, and smart
content creation tools continues to rise, a robust NWP system becomes indispensable.
Improving the accuracy of word predictions not only enhances user satisfaction but
also contributes to the development of more effective communication technologies.
Ultimately, this project aims to push the boundaries of natural language processing,
fostering innovation in human-computer interaction and paving the way for more
intuitive and seamless digital experiences in diverse domains.

2
Chapter 2
Literature Review

2.1 Background and Related Works


Understand how to predict next word using LSTM and RNN. And understand the
procedure and step for predicting next word using LSTM.Understand various tech-
nique and implementation of various methodology. Which was used for predicting
next word in document typing ,email writing,also can implement in learning lan-
guage,writing language.
Afika Rianti (2022) have presented that Next word prediction is one of NLP fields
because it’s about mining the text. The researcher here used the LSTM model to
make the prediction with 200 epochs. The result showed that it maintained to get
accuracy 75 percent while the loss was 55 percent. Based on that result, it could be
said the accuracy is good enough. As it also showed that it’s better than two of the
two other researches which used different models. The model could be used to predict
the next word by giving the input of the destination
Keerthana N (2021) proposed That The subsequent word prediction model that
was developed is fairly correct on the provided data-set. NLP requires applying
various types of pattern discovery approaches aimed at eliminating noisy data. The
loss was considerably reduced in concerning a hundred epochs. Files or data-set that
are large to process need still some optimizations. However, bound pre-processing
steps and bound changes within the model are often created to boost the prediction
of the model
Hajj-Ahmad et al. (2016) have used an LSTM model which is trained in 500
iterations. Researcher develop next word prediction model for bodhi language which
use near the area of nepal and surrounding region include nepal and tibet From the
result, it can be said that the accuracy was 50 percent that is sufficiently high. This
model use three previous base on that model was predicting sentence.This model
works on a language which even google translate does not provide a service for, which

3
is a feat in itself.
Sourabh Ambulgekar (2021) presented that RNN model became less exact so they
are use LSTM created a 3d vector layer of input and a 2d vector layer for output and
feed through to the LSTM layer having 128 hidden layers and manage to get accuracy
to around 56 percent during 5 epochs.

Table 2.1.1: Summary of Literature Survey

Authors/year Title Observations


Afika Rianti Next Word prediction using he researcher here used the
(2022) LSTM LSTM model to make the pre-
diction with 200 epochs. The
result showed that it main-
tained to get accuracy 75 per-
cent while the loss was 55 per-
cent.
Keerthana N NEXT WORD PREDICTION The subsequent word predic-
(2021) tion model that was developed
is fairly correct on the provided
dataset. The loss was consid-
erably reduced in concerning a
hundred epochs.
Hajj-Ahmad Next Word Prediction in Bodhi Predicted next word in Bodhi
et al. (2016) Language Using LSTM based Ap- language which is rare lan-
proach guage also not available model
in google.
Sourabh Ambul- Next Words Prediction Using Re- LSTM layer having 128 hidden
gekar (2021) current NeuralNetworks layers and manage to get accu-
racy to around 56 percent dur-
ing 5 epochs

2.2 Outcome of Literature Review


The literature survey provides a comprehensive understanding of different method-
ologies and developments made in the field of Next word prediction and introduces
innovative approaches and techniques for implement this model, this literature gave
idea about working methodology of next word prediction model and which algorithm
best suitable and which will not predict correct outcome.It also include structure of

4
LSTM and structure of RNN which help to understand the working of this algorithm.

2.3 Problem Statement


Next Word prediction model using LSTM which can handle out of dictionary word.

2.4 Objectives
1. predict the most probable next word in a sequence of words, based on the
context provided by the preceding words.

2. utilizing the memory and sequential modeling capabilities of LSTM for predict-
ing next word.

5
Chapter 3
Methodology
3.1 Data Preprocessing
• Tokenization: The NLTK library is used to tokenize the text into words.

• Stopword Removal: Common English stopwords are removed from the tok-
enized words.

• Sequence Generation: Previous words (of length 5 in this case) are used to
predict the next word. Sequences are created with corresponding next words.

3.1.1 Word Embedding


• The code creates input sequences (prev words) and corresponding output words
(next words) for training the LSTM model.

• Input sequences are one-hot encoded and prepared as a 3D array.

• Output words are one-hot encoded and prepared as a 2D array.

3.1.2 Building the LSTM Model


• An LSTM model is defined using Keras Sequential API.

• The model has an LSTM layer with 128 units, followed by a Dense layer with
softmax activation to predict the next word.

• Categorical cross-entropy is used as the loss function, and RMSprop is used as


the optimizer.

3.1.3 Training the Model


• The model is trained for 100 epochs with a batch size of 128.

• Training and validation loss, as well as accuracy, are monitored.

6
3.1.4 Saving the Model and Training History
• The trained model and its training history (loss and accuracy) are saved to files
using pickle for future use.

3.1.5 Text Generation Functions


• Two functions are defined for text generation: prepare input and predict com-
pletions.

• prepare input converts input text into a format suitable for the model.

• predict completions generates a list of next possible words given an input se-
quence.

3.1.6 Sample Predictions


• The model is tested with different input sequences, and the top 5 predicted
words are displayed.

3.1.7 Code and Libraries


• Include the code snippets used for prepossessing, model construction, training,
and prediction.

• Specify the libraries used, such as NLTK and Keras..

7
Figure 3.1.1: Architecture of LSTM

8
Chapter 4
Work Done
4.1 Handling out of dictionary word
In the course of developing our project, one significant challenge we encountered was
dealing with out-of-dictionary words, often referred to as ”OOV” (Out of Vocabulary)
words. These words are not present in the pre-defined vocabulary of the language
model, which posed a potential stumbling block in our natural language processing
tasks.
To address this issue, we implemented a robust solution. Our approach involved
identifying instances where ”OOV” words occurred in the input data. For every
input text, we checked if the word existed within our predefined vocabulary. The
code snippet below illustrates how we managed out-of-dictionary words:

Figure 4.1.1: appending word in ”OOV” column

Figure 4.1.2: In word embedding considering ”OOV” column for word other than in
dictionary

9
4.2 Handling Stop Words
In natural language processing, stop words are common words like ”and,” ”the,”
”is,” etc., that are often filtered out from text data because they occur frequently
and do not carry significant meaning in the context of the analysis. These words,
while important in the structure of a sentence, do not provide valuable insights when
it comes to tasks like text classification, sentiment analysis, or topic modeling. In
our project, we implemented a prepossessing step to remove stop words from the
textual data before further analysis. This step involved the removal of words that
were present in a predefined list of stop words. For example, the list of stop words we
used included words such as ’had,’ ’hasn’t,’ ’with,’ ’no,’ and many others. Removing
these words was essential to focus on the more meaningful content of the text and
improve the accuracy of our natural language processing tasks.

Figure 4.2.1: example of stop word

4.3 Evaluation
The model was trained for 100 epochs, and the training process was monitored us-
ing both training and validation data. After 100 iterations, the model achieved an
accuracy of 62 percent on the validation data. The accuracy metric indicates the
percentage of correctly predicted next words from the validation dataset.
During the evaluation process, the model utilized a validation dataset that was not
seen during training. This approach ensured an unbiased assessment of the model’s
performance on unseen data. The evaluation process followed these steps:
Model Prediction: The trained model was utilized to make predictions on the
validation dataset.

10
Metric Calculation: Using the predicted labels and the actual labels from the
validation dataset, the model calculated accuracy, precision, recall, F1-score, and
generated a confusion matrix.
Analysis and Iteration: Based on the evaluation metrics, the model’s performance
was analyzed. If the results were not satisfactory, iterative refinement of the model
architecture, fine-tuning of hyperparameters, or additional preprocessing steps were
performed to enhance performance.

Figure 4.3.1: Parameter during training of model

Figure 4.3.2: 1st and 2nd epoch accuracy

Figure 4.3.3: 50,51 and 52th epoch accuracy

11
Figure 4.3.4: 98,99 and 100th epoch accuracy

12
Chapter 5
Conclusion and Future Work

5.0.1 Future scope


• Advanced Architectures: Exploring advanced LSTM architectures or other neu-
ral network architectures, such as transformers, could be beneficial. Transform-
ers, especially models like GPT-3, have demonstrated exceptional performance
in natural language processing tasks.

• Dataset Diversification: Training the model on a larger and more diverse dataset
could improve its ability to predict a wider range of words and phrases. Incor-
porating multiple genres or sources of text might enhance the model’s language
understanding.

• Fine-Tuning Pretrained Models: Fine-tuning pretrained language models like


GPT-3 or BERT on the specific task of next word prediction might yield superior
results, leveraging the knowledge learned from vast amounts of text data.

5.0.2 Conclusion
• In this study, we successfully implemented a Next Word Prediction Model using
LSTM neural networks. The model was trained on a corpus of text from ”The
Adventures of Sherlock Holmes” by Arthur Conan Doyle. After 100 epochs
of training, the model achieved an accuracy of approximately 62 percent on
the validation data. The implemented LSTM model, coupled with the Seq-
SelfAttention layer for attention mechanisms, demonstrated the ability to learn
sequential patterns and generate probable next words given a sequence of words.

13
Bibliography
Afika Rianti, Suprih Widodo, A. D. A. F. B. H. (2022). Next word prediction using
lstm. JOURNAL OF INFORMATION TECHNOLOGY AND ITS UTILIZA-
TION, 5(13):443–454.
Hajj-Ahmad, A., Baudry, S., Chupeau, B., Doërr, G., and Wu, M. (2016). Flicker
forensics for camcorder piracy. IEEE Transactions on Information Forensics and
Security, 12(1):89–100.
Keerthana N, Harikrishnan S, K. B. M. J. J. B. (2021). Next word prediction. Inter-
national Journal of Creative Research Thoughts, 9(9):754–757.
Sourabh Ambulgekar, Sanket Malewadikar, R. G. D. B. J. (2021). Next words predic-
tion using recurrent neuralnetworks. In International Conference on Advances
in Computing CommunicationsAdvances in Computing Communications, pages
1–4. ICACC.

14

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy