0% found this document useful (0 votes)
189 views

NLP Report

The document discusses a project report on cross-lingual information retrieval. It describes transliterating text from English to Hindi and vice versa using deep learning techniques. The goal is to increase accuracy for converting large paragraphs and pages of text between the languages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
189 views

NLP Report

The document discusses a project report on cross-lingual information retrieval. It describes transliterating text from English to Hindi and vice versa using deep learning techniques. The goal is to increase accuracy for converting large paragraphs and pages of text between the languages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Cross-Lingual Information Retrieval

А NLP Project Report

Submitted in partial fulfillment of the requirements for the degree of

Bachelor of Technology
in
Computer Science and Engineering

by
ANURAG RAJ
(15BCE0951)
SHIVAM SRIVASTAVA
(15BCE0575)
KUMAR SAURAV DASH
(15BCE0320)

Under the guidance of


Prof. ARIVOLI A.
School of Computer Science and Engineering
VIT, Vellore.
DECLARATION

I hereby declare that the thesis entitled “Cross-Lingual Information Retrieval"


submitted by me, for the award of the degree of Bachelor of Technology in Computer

Science and Engineering to VIT is a record of bonafide work carried out by me under the

supervision of Arivoli A.

I further declare that the work reported in this thesis has not been submitted and will not
be submitted, either in part or in full, for the award of any other degree or diploma in this

institute or any other institute or university.

Signature of the Candidate


ACKNOWLEDGEMENTS

I express my indebted gratitude and special thanks to my Project Guide Prof.


Arivoli A, School of Computer Science and Engineering, Vellore Institute of Technology,
for his guidance and support throughout this project and keeping me on the right path and
thus enabling me to complete the thesis. It would not have been possible if it weren’t for his
valuable guidance. I am also grateful to the management and staff of VIT for providing
me with all the facilities needed to complete this project. I also take the opportunity to
highlight my gratitude to all the teachers, my parents, batch mates and my friends for
providing me with both moral and technical support.

i
CONTENTS PAGE
NO.

Acknowledgement i

Table of Contents iii

1. INTRODUCTION 6

1. Objective 6
2. Motivation 7
3. Related Works 7

2. PROJECT DESCRIPTION AND GOALS 8

3. TECHNICAL SPECIFICATIONS 9

1. Introduction 9
2. Requirement Analysis 19
2.1. Functional Requirements 10
2.2. Non-functional Requirements 12
2.3. System Requirements 14

4. DESIGN APPROACH AND DETAILS 14

1. Design Approach-Methodology 14
2. Detailed Design 14
17
2.1. USE CASE Diagram 18
2.2. Architecture Diagram

5. SCHEDULES, TASKS AND TIMELINES 18

6. PROJECT DEMONSTRATION 19

7. RESULTS AND DISCUSSION 21

8. CONCLUSION 23

9. FUTURE SCOPE 23

10. REFERENCES 23

ii
1. INTRODUCTION

Transliteration is one type of conversion, where a user inputs a query in a


particular language and does retrieval of collection of documents in the output
language in Cross-Lingual information retrieval follow the output in another language.
Hence, in transliteration we decipher one word in a specific language to another
without letting it lose its phonological properties. For example, the English word
“school” as a result of transliteration will be a Hindi word “स्कूल”. This is unique in

relation to interpretation in which the word school would guide to पाठशाला

('paathshaala'). With expanding globalization and quick development of the web, a ton
of data is accessible to every user. However, the vast majority of this data is available
in specific languages, which creates a hindrance in the transfer of knowledge amongst
various linguistic groups. To cut down this language hindrance, we need powerful tool
like transliteration for free knowledge transfer.
In our project, we convert a word from Roman script to Devanagari script and
vice versa. Transliteration is one of the main sub tasks of Cross-lingual Information
Retrieval (CLIR), which ensures a higher accuracy in the output.

1.1. OBJECTIVE

• To develop a Transliteration mechanism for Roman to Devanagari and vice versa. 


• Transliteration is changing words from one script to another, more commonly the
words are proper nouns. 


• Sometimes it also means changing sounds from one language to another. For
example: 


3
Figure 1.1 Example of transliteration

• This system is basically designed to help in retrieving old Sanskrit documents and
manuscripts using different information retrieval techniques. 


1.2. MOTIVATION

In forthcoming years, with an ever-increasing population, a large amount of


people wishes to capacitate to get to the expansive measures of foreign language
information and to be able to comprehend about the world. However, language
obstruction is dependably an issue to them.
So as to break this language barrier and connect individuals to the different dialect
data available across the world, we endeavor to fabricate a cross-lingual information
retrieval framework named CLIR, which empowers individuals to pursuit and peruse
foreign language.
In a CLIR system, on inputting a Sanskrit/Hindi query, it first interprets the Sanskrit/
Hindi query into English form, and then look over the Internet, and finally make an
interpretation of the indexed lists into Hindi with the goal that clients can comprehend
them better.
Hence with this system, people can read and comprehend the information from
English world without having an actual knowledge of that language. Also, the main
idea behind choosing the English to Hindi/Sanskrit transliteration is to preserve and
retrieve the very old Sanskrit manuscripts and text.
1.3. RELATED WORKS

• Taraka Rama and Karthik Gali, they have proposed the transliteration problem as
translation problem. They have used phrase based statistical machine translation
system and deployed it for transliteration. They have worked using GIZA++ and
beam search based decoder. They achieved an accuracy of 46.3% on their test set.

• Amitava Das, Asif Ekbal, Tapabrata Mandal developed another transliteration


framework and Sivaji Bandyopadhyay based on NEWS 2009 Machine
Transliteration Shared Task training data sets. They have utilized a modified joint
source channel model inclusive of two other methods to produce the Hindi
transliteration from English (to generate more spelling variations of Hindi names).
They have concocted a set of rules to remove errors. Amid standard run of the
system, they obtained the word precision of 0.471 and mean F-score of 0.831. The
non-standard runs resulted in word precision and mean F-score estimations of
0.389 and 0.831 individually in the primary run and 0.384 and 0.823 separately in
the second run.

1.4. SUMMARY /GAPS IDENTIFIED IN THE SURVEY

• It has been keenly observed that the above-mentioned research works solely
focuses on Machine learning approach.
• There has not been any research built on the Hindi (Devanagri) or Sanskrit texts
using a deep learning method.
• In transliteration statistical techniques give good results and these techniques do
not require very good linguistic knowledge of the source and the target language.
The way vowels are pronounced in a language affects the efficiency of
transliterated results. Origin of the words also plays an important role in
transliteration. In papers discussed herein, reasons for error are the origin of
words is not taken into account or the way vowels are pronounced and the
transliteration system not giving good results for unseen data and abbreviations.

2. PROJECT DESCRIPTION AND GOALS

This project is designed to transliterate English to Hindi words, sentences and


paragraphs.
In this project, we have trained over 1.5 lakh words so as to get a more precise and
accurate transliterated text as output. When the user inputs a Sanskrit/Hindi query, it
will first translate the Sanskrit/Hindi query into English, and then search over the
Internet, and finally translate the search results into Hindi so that users can understand
them better.
Hence with our system, people can read and browse the information from English
world without actually knowing English. Also, the main idea behind choosing the
Sanskrit to English transliteration is to preserve and retrieve the very old Sanskrit
manuscripts and text.

Here, we have used deep learning methodology to implement our idea, which is a
novel method that has never been used.
Transliteration is one of the main sub tasks of Cross-lingual Information Retrieval
(CLIR), which ensures a higher accuracy in the output. By using Deep Learning for
transliteration instead of using conventional machine learning techniques, we intend to
increase the accuracy of the output thus produced. We have used Seq2Seq model of
Keras library, which uses LSTM networks (which is basically a special type Recurrent
Neural Network (RNN)) to train data.

Our major goal is to be able to convert large paragraphs and pages of English data to
Hindi transliterated output with the greatest accuracy.
3. TECHNICAL SPECIFICATION

3.1. INTRODUCTION

The major task for the system to work was to gather enough data so that our model
could be trained efficiently.
• To accomplish this we first converted Sanskrit text to Itrans notation.
• An ASCII method called the “Indian languages TRANSliteration” (ITRANS) is
used for Devanagari script (type of Indic script).
• This Itrans notation was then converted to Roman script by creating all possible
mapping from Itrans to Roman characters, which was created manually based on
the phoneme that the Itrans was capturing.
• The main advantage of this technique was that our data set was enriched with
multiple ways of writing a given word in roman script. Finally we trained our
model on 1,52,000 words and cross-validated it on 38,000 words to fine tune the
parameters and tested it on 10,000 words.
To build our model we have used Seq2Seq model and then while decoding an
unknown input sequence we go through a slightly different process.
• Encode the input sequence into state vectors.
• Initialize the start-of-sequence with size1 as the target sequence.
• Now, feed the target sequence generated in the previous step along with the state
vectors to the decoder, which will predict the next sequence.
• Use these predictions produced to sample the following (succeeding) characters
(using argmax).
• Add the characters which are sampled to target sequence
• Repeat this process till we reach the end of the document or the sentence or word.
For Devanagari script to Roman script we have used Harvard-Kyoto mapping
scheme, which is basically based on the phoneme of given Devanagari characters.
3.2. REQUIREMENT ANALYSIS

1. Functional Requirements

The functional requirements phase strongly supported us in establishing basic

understanding about the English to Hindi Transliterator system. It also helped us

identify people who will be benefited by the software. However, most importantly

it established a preliminary communication with the system and the stakeholders.

Product features

The system offers a wide range of specific functionality to the user.

• The system built can be used as a basis of the most of the big information

retrieval projects.
• User can customize the output and store the configuration for future use.
• Easy view of the result either in form of a document or terminal display.

• The product offered provide user with Word-to-Word transliteration of any

text file given as the input by the user.

• The same can be used incorporated in the form of a website with can help

user retrieve various sets of documents and manuscripts from the Internet

very easily.
• Website form of transliteration can help solve major problem of retrieval

of old manuscripts and books written in Hindi or Sanskrit with easy usage

of regular English.

• Along with the above mentioned features, most significant feature of the

system is to help general public get easy access to various Hindu religious

books, stories and prayers all written in Hindi or Sanskrit.


User characteristics

There is a collection of type of users for out system:


• Linguists with experience in technical background
• Linguists with no theoretical or technical knowledge

• Teachers

• Educated and comfortable with English

• Foreign Students

• Priests seeking religion knowledge

• General user category seeking for a language knowledge

The above mentioned are the types of users who will be highly benefitted with the

developed system.

Each user mentioned above has a set of his own requirements for is personal

usage.

Assumption, Dependencies & Constraints

• It is assumed that the user is looking specifically for English to Hindi

transliterations
• Also, user is well-versed with English as a language
• We expect less accuracy with long set of input text files due to a variable

range in the training data set.

User Requirements and Product Specific System Requirements

Every stakeholder has its own set of requirements from his point of view.

Common requirements include:


• User friendly interface
• Easy to use

• Low cost and process double side writing


• Fast and minimum error rate
• Full control of the system

• Can work with a scanner

• Users with no technical knowledge will be able to use templates for

running the system.

2. System Requirements

H/W Requirements

• GPU: A graphics-processing unit is a special type of electronic device,

which is designed so as to manipulate and make alterations in

the memory to speed up the creation of images for output in a display

device like monitor or screen. GPUs are extensively used in embedded

systems. Recent advanced GPUs are used extensively used at

manipulating graphics of the computer and image processing.

• Any Computer device, preferably Laptop, PC


4. DESIGN APPROACH AND DETAILS

4.1. METHODOLODY (THEORETICAL BACKGROUND)

Recurrent Neural Networks

RNN or Recurrent Neural Networks are a type of neural network, which contains

loop in them, which basically allows information to persist. RNN emerged

because of the need to address the issue of long- term dependencies. For example,

if you wish to guess the end word in the sentence ”the clouds are in the sky.”

Figure 4.1 Looping in RNN

In this sentence the last word is pretty much obvious without the requirement of

any further context. But if we take a look at the sentence ”I grew up in France.....I

speak fluent French.” Traditional Neural Networks would suggest that the next

word ought to be a name of language but if we want to be sure as to what

language it must be, we need to remember the context of France. A Recurrent

Neural Network is a network, which is a repetition of similar networks where

every previous network passes information to the next inline.

Figure 4.2 Unrolled RNN


Sequence-to-Sequence Model

The backbone of our project is the sequence-to-sequence model of Keras library.

Sequence-to-Sequence model comprises of Recurrent Neural Networks (RNN) in

a pair, of which one of which one is an encoder that processes the input given to it

and the other is the decoder that generates the output. Encoder and Decoder can

share weight. This architecture is given below in the figure:-

Figure 4.3 Encoder-Decoder Networks

The boxes in the picture represent a cell of the recurrent neural network.
Figure 4.4 multi-layer network using sequence-to-sequence

In this model, the output produced by the decoder at any time t is fed back to the

algorithm and it becomes an input for the algorithm at time t+1. This is done at

the testing time. During the generation of the model, it is kept in mind that the

input should be given correct even if the decoder has given wrong output at a

previous-step.
LSTM Network

One of the interests of RNNs is the possibility that they may most likely associate

old information to the ongoing task. RNN can figure out how to use past data.

A special type of RNNs, called Long Short Term Memory, fit for remembering

long-term dependencies. They were presented by Hochreiter Schmidhuber (1997),

and were refined and advanced by numerous individuals in following work.

LSTM have given tremendous results in problems comprising of long-term

dependencies. The most commonly used characteristic of the LSTM is that it

maps a variable length sequence of input to a fixed dimensional vector

representation.
4.2. DETAILED DESIGN

4.2.1 USE CASE DIAGRAM

!
Figure 4.5 USE CASE diagram for the system
4.2.2 ARCHITECTURAL DIAGRAM

!
Figure 4.6 Architectural diagram for the system
5. SCHEDULE, TASKS AND TIMELINES

The entire initial approach consisted of research work and coding in python
with a vision of making everything in house. Later on the importance of proper
research and time constraints were realized. This led us to channelize our efforts in
achieving the final and desired result rather than building individual systems.
The Project plan, shown in Figure 5.1, has 6 stages of how the entire idea was
realized. The time frames comprised of separate code testing simultaneously with
coding and algorithm development.

TIME PERIOD TASKS


Week 4 Finalizing the project
Week 4 Intensive research in similar areas of work
Week 3 Literature surveys and developing algorithm
Week 3 Developing the code for the designed
algorithm
Week 2 Integration of sub-systems, final code
Week 1 Full integration and testing and
Documentation

Table 5.1 Project progress timeline


6. RESULTS AND DISCUSSION

We have experimented our model by taking different test-sets. BLEU score,


Word error rate and accuracy were different metrics that we used to capture the quality
of the system developed.

➢ BLEU (bilingual evaluation understudy) is an analysis algorithm, which is a


quality test mechanism for texts that have been machine-translated to a different
natural language than the input natural language. The output produced is a
numeric quantity lying between 0 and 1. The output produced by BLEU test
specifies the similarity between candidate and reference text. As the value
approaches 1, it implies that the texts are nearly similar. We have used it to
capture the closeness of Transliteration as in this also we are having a sequence as
input and a sequence as output.
➢ Word error rate (WER) is a generic method that analyzes the performance of a
machine translation system. Normally, problems in analyzing the performance
depend on how the candidate word has a varying length than the reference word
(supposedly the correct one).

Table 7.1 Results of test

Res
ults
Test Set BLEU Score Accuracy Word-Error Rate

A 0.8167 0.4882 0.3808

B 0.7959 0.4548 0.3729

C 0.7980 0.4699 0.3712

D 0.8065 0.4730 0.3755


E 0.8001 0.4699 0.3728
Figure 7.1 BLEU Score Vs Test set

Figure 7.2 Accuracy Vs Test set


Figure 7.3 Word Error Rate Vs Test Set

DISCUSSION

The data set we have used is automatically generated. It is not able to capture the
human cognition. But since every user has different ways of writing a particular word
hence we need to feed the system with some manual data which would capture the
different possibilities.
8. CONCLUSION

We have achieved an accuracy of 47.11% and a BLEU score of 80.34 The accuracy is
lower than the already discovered tools but we believe that if we train the model with
better data-set, there is a good hope of getting our accuracy improved. The accuracy is
low because it is calculated by matching the whole word with one another (which is
not a good method to measure the quality of the output). So for a better measure we
calculated BLEU score and word-error rate. The word-error rate was found to be
37.43%.

9. FUTURE SCOPE

We plan to extend our model for cross-lingual information retrieval and apply
information retrieval techniques on both, the transliterated and the original query in
order to retrieve the set of relevant documents of the both the source and target
language.

10. REFERENCES
1. Xu, J., Zhai, F., & Xue, Z. (2017, August). Cross-Lingual Information Retrieve in
Sogou Search. In Proceedings of the 40th International ACM SIGIR Conference
on Research and Development in Information Retrieval (pp. 1361-1361). ACM.
2. Zhang, L., Färber, M., & Rettinger, A. (2016, October). Xknowsearch!: exploiting
knowledge bases for entity-based cross-lingual information retrieval.
In Proceedings of the 25th ACM International on Conference on Information and
Knowledge Management (pp. 2425-2428). ACM.
3. Sasaki, S., Sun, S., Schamoni, S., Duh, K., & Inui, K. (2018). Cross-lingual
Learning-to-Rank with Shared Representations. In Proceedings of the 2018
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 2 (Short Papers)(Vol. 2, pp.
458-463).
4. Ravishankar, V. (2017). Finite-State Back-Transliteration for Marathi. The Prague
Bulletin of Mathematical Linguistics, 108(1), 319-329.
5. K. G. Taraka Rama, “Modeling machine transliteration as a phrase based statis-
tical machine translation problem,” in Language Technol-ogies Research Centre,
IIIT, Hyderabad, 2009.
6. T. M. Amitava Das, Asif Ekbal and S. Bandyopadhyay, “English to hindi ma-
chine transliteration system at news,” in Pro-ceedings of the 2009 Named Entities
Workshop, ACL-IJCNLP 2009,Suntec, Singapore, 2009.
7. Schamoni, S., & Riezler, S. (2015, August). Combining orthogonal information in
large-scale cross-language information retrieval. In Proceedings of the 38th
International ACM SIGIR Conference on Research and Development in
Information Retrieval (pp. 943-946). ACM.
8. Niyogi, M., Ghosh, K., & Bhattacharya, A. (2018). Learning Multilingual
Embeddings for Cross-Lingual Information Retrieval in the Presence of Topically
Aligned Corpora. arXiv preprint arXiv:1804.04475.
9. Merhav, Y., & Ash, S. (2018). Design Challenges in Named Entity
Transliteration. arXiv preprint arXiv:1808.02563.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy