NLP Report
NLP Report
Bachelor of Technology
in
Computer Science and Engineering
by
ANURAG RAJ
(15BCE0951)
SHIVAM SRIVASTAVA
(15BCE0575)
KUMAR SAURAV DASH
(15BCE0320)
Science and Engineering to VIT is a record of bonafide work carried out by me under the
supervision of Arivoli A.
I further declare that the work reported in this thesis has not been submitted and will not
be submitted, either in part or in full, for the award of any other degree or diploma in this
i
CONTENTS PAGE
NO.
Acknowledgement i
1. INTRODUCTION 6
1. Objective 6
2. Motivation 7
3. Related Works 7
3. TECHNICAL SPECIFICATIONS 9
1. Introduction 9
2. Requirement Analysis 19
2.1. Functional Requirements 10
2.2. Non-functional Requirements 12
2.3. System Requirements 14
1. Design Approach-Methodology 14
2. Detailed Design 14
17
2.1. USE CASE Diagram 18
2.2. Architecture Diagram
6. PROJECT DEMONSTRATION 19
8. CONCLUSION 23
9. FUTURE SCOPE 23
10. REFERENCES 23
ii
1. INTRODUCTION
('paathshaala'). With expanding globalization and quick development of the web, a ton
of data is accessible to every user. However, the vast majority of this data is available
in specific languages, which creates a hindrance in the transfer of knowledge amongst
various linguistic groups. To cut down this language hindrance, we need powerful tool
like transliteration for free knowledge transfer.
In our project, we convert a word from Roman script to Devanagari script and
vice versa. Transliteration is one of the main sub tasks of Cross-lingual Information
Retrieval (CLIR), which ensures a higher accuracy in the output.
1.1. OBJECTIVE
• Transliteration is changing words from one script to another, more commonly the
words are proper nouns.
• Sometimes it also means changing sounds from one language to another. For
example:
3
Figure 1.1 Example of transliteration
• This system is basically designed to help in retrieving old Sanskrit documents and
manuscripts using different information retrieval techniques.
1.2. MOTIVATION
• Taraka Rama and Karthik Gali, they have proposed the transliteration problem as
translation problem. They have used phrase based statistical machine translation
system and deployed it for transliteration. They have worked using GIZA++ and
beam search based decoder. They achieved an accuracy of 46.3% on their test set.
• It has been keenly observed that the above-mentioned research works solely
focuses on Machine learning approach.
• There has not been any research built on the Hindi (Devanagri) or Sanskrit texts
using a deep learning method.
• In transliteration statistical techniques give good results and these techniques do
not require very good linguistic knowledge of the source and the target language.
The way vowels are pronounced in a language affects the efficiency of
transliterated results. Origin of the words also plays an important role in
transliteration. In papers discussed herein, reasons for error are the origin of
words is not taken into account or the way vowels are pronounced and the
transliteration system not giving good results for unseen data and abbreviations.
Here, we have used deep learning methodology to implement our idea, which is a
novel method that has never been used.
Transliteration is one of the main sub tasks of Cross-lingual Information Retrieval
(CLIR), which ensures a higher accuracy in the output. By using Deep Learning for
transliteration instead of using conventional machine learning techniques, we intend to
increase the accuracy of the output thus produced. We have used Seq2Seq model of
Keras library, which uses LSTM networks (which is basically a special type Recurrent
Neural Network (RNN)) to train data.
Our major goal is to be able to convert large paragraphs and pages of English data to
Hindi transliterated output with the greatest accuracy.
3. TECHNICAL SPECIFICATION
3.1. INTRODUCTION
The major task for the system to work was to gather enough data so that our model
could be trained efficiently.
• To accomplish this we first converted Sanskrit text to Itrans notation.
• An ASCII method called the “Indian languages TRANSliteration” (ITRANS) is
used for Devanagari script (type of Indic script).
• This Itrans notation was then converted to Roman script by creating all possible
mapping from Itrans to Roman characters, which was created manually based on
the phoneme that the Itrans was capturing.
• The main advantage of this technique was that our data set was enriched with
multiple ways of writing a given word in roman script. Finally we trained our
model on 1,52,000 words and cross-validated it on 38,000 words to fine tune the
parameters and tested it on 10,000 words.
To build our model we have used Seq2Seq model and then while decoding an
unknown input sequence we go through a slightly different process.
• Encode the input sequence into state vectors.
• Initialize the start-of-sequence with size1 as the target sequence.
• Now, feed the target sequence generated in the previous step along with the state
vectors to the decoder, which will predict the next sequence.
• Use these predictions produced to sample the following (succeeding) characters
(using argmax).
• Add the characters which are sampled to target sequence
• Repeat this process till we reach the end of the document or the sentence or word.
For Devanagari script to Roman script we have used Harvard-Kyoto mapping
scheme, which is basically based on the phoneme of given Devanagari characters.
3.2. REQUIREMENT ANALYSIS
1. Functional Requirements
identify people who will be benefited by the software. However, most importantly
Product features
• The system built can be used as a basis of the most of the big information
retrieval projects.
• User can customize the output and store the configuration for future use.
• Easy view of the result either in form of a document or terminal display.
• The same can be used incorporated in the form of a website with can help
user retrieve various sets of documents and manuscripts from the Internet
very easily.
• Website form of transliteration can help solve major problem of retrieval
of old manuscripts and books written in Hindi or Sanskrit with easy usage
of regular English.
• Along with the above mentioned features, most significant feature of the
system is to help general public get easy access to various Hindu religious
• Teachers
• Foreign Students
The above mentioned are the types of users who will be highly benefitted with the
developed system.
Each user mentioned above has a set of his own requirements for is personal
usage.
transliterations
• Also, user is well-versed with English as a language
• We expect less accuracy with long set of input text files due to a variable
Every stakeholder has its own set of requirements from his point of view.
2. System Requirements
H/W Requirements
RNN or Recurrent Neural Networks are a type of neural network, which contains
because of the need to address the issue of long- term dependencies. For example,
if you wish to guess the end word in the sentence ”the clouds are in the sky.”
In this sentence the last word is pretty much obvious without the requirement of
any further context. But if we take a look at the sentence ”I grew up in France.....I
speak fluent French.” Traditional Neural Networks would suggest that the next
a pair, of which one of which one is an encoder that processes the input given to it
and the other is the decoder that generates the output. Encoder and Decoder can
The boxes in the picture represent a cell of the recurrent neural network.
Figure 4.4 multi-layer network using sequence-to-sequence
In this model, the output produced by the decoder at any time t is fed back to the
algorithm and it becomes an input for the algorithm at time t+1. This is done at
the testing time. During the generation of the model, it is kept in mind that the
input should be given correct even if the decoder has given wrong output at a
previous-step.
LSTM Network
One of the interests of RNNs is the possibility that they may most likely associate
old information to the ongoing task. RNN can figure out how to use past data.
A special type of RNNs, called Long Short Term Memory, fit for remembering
representation.
4.2. DETAILED DESIGN
!
Figure 4.5 USE CASE diagram for the system
4.2.2 ARCHITECTURAL DIAGRAM
!
Figure 4.6 Architectural diagram for the system
5. SCHEDULE, TASKS AND TIMELINES
The entire initial approach consisted of research work and coding in python
with a vision of making everything in house. Later on the importance of proper
research and time constraints were realized. This led us to channelize our efforts in
achieving the final and desired result rather than building individual systems.
The Project plan, shown in Figure 5.1, has 6 stages of how the entire idea was
realized. The time frames comprised of separate code testing simultaneously with
coding and algorithm development.
Res
ults
Test Set BLEU Score Accuracy Word-Error Rate
DISCUSSION
The data set we have used is automatically generated. It is not able to capture the
human cognition. But since every user has different ways of writing a particular word
hence we need to feed the system with some manual data which would capture the
different possibilities.
8. CONCLUSION
We have achieved an accuracy of 47.11% and a BLEU score of 80.34 The accuracy is
lower than the already discovered tools but we believe that if we train the model with
better data-set, there is a good hope of getting our accuracy improved. The accuracy is
low because it is calculated by matching the whole word with one another (which is
not a good method to measure the quality of the output). So for a better measure we
calculated BLEU score and word-error rate. The word-error rate was found to be
37.43%.
9. FUTURE SCOPE
We plan to extend our model for cross-lingual information retrieval and apply
information retrieval techniques on both, the transliterated and the original query in
order to retrieve the set of relevant documents of the both the source and target
language.
10. REFERENCES
1. Xu, J., Zhai, F., & Xue, Z. (2017, August). Cross-Lingual Information Retrieve in
Sogou Search. In Proceedings of the 40th International ACM SIGIR Conference
on Research and Development in Information Retrieval (pp. 1361-1361). ACM.
2. Zhang, L., Färber, M., & Rettinger, A. (2016, October). Xknowsearch!: exploiting
knowledge bases for entity-based cross-lingual information retrieval.
In Proceedings of the 25th ACM International on Conference on Information and
Knowledge Management (pp. 2425-2428). ACM.
3. Sasaki, S., Sun, S., Schamoni, S., Duh, K., & Inui, K. (2018). Cross-lingual
Learning-to-Rank with Shared Representations. In Proceedings of the 2018
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 2 (Short Papers)(Vol. 2, pp.
458-463).
4. Ravishankar, V. (2017). Finite-State Back-Transliteration for Marathi. The Prague
Bulletin of Mathematical Linguistics, 108(1), 319-329.
5. K. G. Taraka Rama, “Modeling machine transliteration as a phrase based statis-
tical machine translation problem,” in Language Technol-ogies Research Centre,
IIIT, Hyderabad, 2009.
6. T. M. Amitava Das, Asif Ekbal and S. Bandyopadhyay, “English to hindi ma-
chine transliteration system at news,” in Pro-ceedings of the 2009 Named Entities
Workshop, ACL-IJCNLP 2009,Suntec, Singapore, 2009.
7. Schamoni, S., & Riezler, S. (2015, August). Combining orthogonal information in
large-scale cross-language information retrieval. In Proceedings of the 38th
International ACM SIGIR Conference on Research and Development in
Information Retrieval (pp. 943-946). ACM.
8. Niyogi, M., Ghosh, K., & Bhattacharya, A. (2018). Learning Multilingual
Embeddings for Cross-Lingual Information Retrieval in the Presence of Topically
Aligned Corpora. arXiv preprint arXiv:1804.04475.
9. Merhav, Y., & Ash, S. (2018). Design Challenges in Named Entity
Transliteration. arXiv preprint arXiv:1808.02563.