0% found this document useful (0 votes)
14 views9 pages

82 338 4 PB

Uploaded by

zohaibsaleemoff
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views9 pages

82 338 4 PB

Uploaded by

zohaibsaleemoff
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/342215848

A Study of Neural Machine Translation from Chinese to Urdu

Article in Journal of Autonomous Intelligence · January 2019


DOI: 10.32629/jai.v2i4.82

CITATIONS READS

3 472

4 authors, including:

Nady Slam Mr Zeeshan


XiBei Minzu University Xinjiang University
7 PUBLICATIONS 27 CITATIONS 4 PUBLICATIONS 4 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Mr Zeeshan on 16 June 2020.

The user has requested enhancement of the downloaded file.


Journal of Autonomous Intelligence (2019) Volume 2 Issue 4.
doi:10.32629/jai.v2i4.82

Original Article
A Study of Neural Machine Translation from Chinese to Urdu
Zeeshan Khan1*, Muhammad Zakira1, Wushour Slamu1, Nady Slam1
School of Information Science and Engineering, Xinjiang University Urumqi, Xinjiang, China

ABSTRACT
Machine Translation (MT) is used for giving a translation from a source language to a target language. Machine
translation simply translates text or speech from one language to another language, but this process is not sufficient to
give the perfect translation of a text due to the requirement of identification of whole expressions and their direct
counterparts. Neural Machine Translation (NMT) is one of the most standard machine translation methods, which has
made great progress in the recent years especially in non-universal languages. However, local language translation
software for other foreign languages is limited and needs improving. In this paper, the Chinese language is translated to
the Urdu language with the help of Open Neural Machine Translation (OpenNMT) in Deep Learning. Firstly, a Chinese
to Urdu language sentences datasets were established and supported with Seven million sentences. After that, these
datasets were trained by using the Open Neural Machine Translation (OpenNMT) method. At the final stage, the
translation was compared to the desired translation with the help of the Bleu Score Method.
Keywords: Machine Translation; Neural Machine Translation; Non-Universal Languages; Chinese; Urdu; Deep
Learning
ARTICLE INFO
Received: Mar 10, 2020 1. Introduction
Accepted: Apr 27, 2020
Available online: May 6, 2020
In the era of globalization, communication, interaction, and cooperation
*CORRESPONDING AUTHOR among countries have become ever more common than before. At this period,
Zeeshan Khan, School of Information
Science and Engineering, Xinjiang failing to understand the foreign languages might impede the communication
University Urumqi, Xinjiang, China;
zeeshan@uswat.edu.pk; process. Machine translation (MT) is a helpful and efficient way to
overcome barriers in communication between different languages. Machine
CITATION
Zeeshan Khan, Muhammad Zakira, translation has been making great progress in recent years, especially in the
Wushour Slamu, Nady Slam. A Study
of Neural Machine Translation from translation between universal languages such as English, German and French,
Chinese to Urdu. Journal of English and Chinese, etc[1]. However, the number of local language translation
Autonomous Intelligence 2019; 2(4):
29-36. doi: 10.32629/jai.v2i4.82 software for non-universal languages is limited. Meanwhile, with the promotion of
COPYRIGHT the “One Belt One Road “policy, many Chinese companies have initiated steps to
Copyright © 2019 by author(s) and reinforce mutual assistance between ASIAN countries. As one of the founding
Frontier Scientific Publishing. This
work is licensed under the Creative countries of ASIAN, Pakistan is also the largest economy in Southeast Asia. China
Commons Attribution-NonCommercial
4.0 International License (CC BY-NC and Pakistan have established cooperative relations in social infrastructure and
4.0). trade. Therefore, the demand for translation in Chinese and Urdu is also increasing.
https://creativecommons.org/licenses/b
y-nc/4.0 With the help of machine translation, the difficulties that translators and interpreters
have faced can be reduced and some difficulties in cross-language and
cross-cultural cooperation can be eliminated, and this contributes to promoting
political, economic and cultural exchanges between the two countries[2]. Machine
translation (MT) is a subtype of computational linguistics that uses software to
translate text from one natural language (NL) into another natural language (NL).

29
Zeeshan Khan, Muhammad Zakira, Wushour Slamu, et al.
Machine translation executes simple translation of words 1.1 Machine translation
from one language to another language, but this process
MT device alongside in reality substituting the
is not a solution to give an accurate translation of a text
phrases as according to version also takes the application
due to the requirement of identification of whole phrases
of complex linguistic understanding, morphology,
and their instant equivalents. To resolve this difficulty
meaning, and grammar into attention. The standard
with corpus, statistical and neural techniques are the
metric humans are using for evaluation of MT systems is
leading fields for better handling of differences in
the BLEU score.
linguistic typology, isolation of anomalies, and
Bilingual Assessment Understudy (BLEU) is the
translation systems. There are several types of machine
algorithm to define the superiority of text translated by a
translation (MT) systems; one of those is an
machine translation quality that is the
Example-based Machine Translation or EBMT
assessment between machine-translated outputs to that of
approach[3].
human-generated output; the closer machine translation
In recent years, Neural Machine Translation (NMT)
is to human-generated translation, the better is the BLEU
has become one of the most popular machine translation
score. BLEU score is an n-gram overlap of gadget
methods.
translation to that of reference translation[4]. As shown in
(Eq. 1).
അࡾഅࡾ 㪰컀Ⲑ컀ࡾ傔 Ⲑ th
㪰അ컀 ̍ 傔烢Ⲑ 쳌
컀 컀 컀Ⲑ‫ݎ‬컀 㪰컀Ⲑ컀ࡾ傔 烢̍h
 컀‫ݎ‬烢ܿ烢 Ⲑ (1)

1.1.1 Direct machine translation (DMT) This intermediate steps are language structured[7].

Direct translation systems are basically bilingual


and uni-directional. Direct translation approach needs
only a little syntactic and semantic analysis. SL analysis
is oriented specifically to the production of
representations appropriate for one particular TL. DMT
is a word-by-word translation approach with some
simple grammatical adjustments[5].

1.1.2 Rule-Based machine translation (RBMT) Figure 1. Types of rule based machine translation.

1.1.3 Interlingua-Based translation (IBT)


Rule based totally machine translation uses hand
written linguistic regulations for both languages in its In IBT, the input language (IL) is converted into
translation method. It calls for loads of human effort to Interlingua form. The parser and analyzer of IL do not
outline the policies and adjustments of regulations depend on a generator for the output language (OL). So,
generally costs very excessive[6]. It has 3 different sorts: there is a condition for a complete determination of
(1) Direct; (2) Interlingua; (3) Transfer based. ambiguity in IL text.
In direct gadget translation, source language is
immediately transformed to goal language without the 1.1.4 Knowledge-Based machine translation
use of intermediate steps. In Interlingua machine (KBMT)
translation, there are intermediate steps which includes In KBMT, the source text and linguistic semantic
all necessary records for producing texts of target information about the meaning of words and the
language. Interlingua steps generally layout in order to combinations of words precedes the translation process
make it established for all pair of languages. In into the target text. It is implemented over Interlingua
transfer based totally translation, there is bilingual architecture.
representation of both languages in intermediate steps.
30
A study of neural machine translation from Chinese to Urdu

1.1.5 Statistical-Based machine translation parameters are derived from the analysis of bilingual text
(SBMT) corpora. The standard translation technique is
established by the selection of the highest probability , [8]
SBMT is a work on bilingual data. The SBMT is
as shown in (Eq. 2).
generated on the basis of statistical models whose

컀̍ 컀 . 컀 ̍ 컀 . 컀 (2)

1.1.6 Example-Based machine translation (EBMT) several methods and approaches were developed in this
field[3]. According to modern exploration, the
Example based system translation works on
performance report of baseline systems in translating
decomposing/fragmentation of supply sentence,
Indian languages-based text (Bengali, Hindi, Malayalam,
translating these fragments into goal language and then
Punjabi, Tamil, Telugu, Gujarati, and Urdu) into English
re-composing the ones translated fragments into lengthy
text has an average of 10% correctness for all language
sentence[9].
pairs[20].
1.1.7 Hybrid-Based machine translation (HBMT) In 2013 Kalchbrenner proposed recurrent
continuous translation models for machine translation[21].
In hybrid machine translation, a combination of two
This model uses a Convolutional Neural Network (CNN)
or more machine translation techniques is used to
to encode a given part of input text into an unbroken
overcome the limitations of each technique and to
vector and then uses a Recurrent Neural Network (RNN)
enhance the quality of translation[10].
as a decoder to convert the vector into output language.
1.1.8 Neural machine translation (NMT) In 2014, Long Short-term Memory (LSTM) was
introduced into NMT[11]. To solve the problem of
The neural networks and deep learning approach are
generating fixed-length vectors for encoders, they
used by this system. NMT models require only a fraction
introduce attention mechanisms into NMT[12]. The
of the memory needed by old SMT. All parts of the
attention mechanism allows the neural network to pay
NMT model are trained jointly, i.e., end to end, to make
more consideration to the relevant parts of the input, and
the best use of the translation performance. The NMT
discard unrelated parts. Since then, the performance of
models trust on sequential encoder and decoder[11,12]
the neural machine translation method has been
without any unambiguous modeling of the syntactic
significantly improved.
structure of sentences. With this inspiration, the
In this Sutskever a multilayer LSTM is used to
researchers made an effort to expand the translation
encode input sentence into a fixed-size Direction and
model by modeling the hierarchical structure of language.
then decode it into output by another LSTM. The use of
Eriguchi first proposed a tree-based attentive NMT
LSTM efficiently resolved the problem of gradient
model[13], which was further extended by Yang[14] and
vanishing, which agrees on the model to capture data
Chen[15] via a bidirectional encoding mechanism. All the
over extended space in a sentence. Muhammad Bilal uses
above tree-based models applied constituent tree
the three classification models that are used for text
structure and met the same difficulties. Other studies try
classification using the Waikato Environment for
to improve the NMT by modeling the syntax on the
Knowledge Analysis (WEKA). The blogs which are
target side[16 – 19]. Regardless of enhancing the decoder,
written in Roman Urdu and English can be considered as
their success also proved the requirement and efficiency
documents that are useful for the training dataset, labeled
of modeling syntactic information for NMT systems.
examples, and texting data. Due to testing, these three
2. Related Work different models and the results were examined in each
case.
The machine translation method for language The results display that Naive Bayesian
translation had started typically from 1990. There are outperformed Decision Tree and KNN in terms of more
31
A study of neural machine translation from Chinese to Urdu

accuracy, precision, recall, and measure[22]. Mehreen Urdu transliteration into sequence to sequence learning
Alam addresses this difficult and convert Roman-Urdu to difficulty. The Urdu corpus was created and pass it to
neural machine translation that guess sentences up to enhanced the neural network language model, in order to
length 10 while achieving good BLEU score . Neelam
[23]
use both the source and target side information. In their
Mukhtar describes the Urdu language as a poor language work, not only the target word embedding is used as the
that is mostly be ignored by the research community. input of the network, but also the current target word[27].
After collecting data from many blogs of about 14 Liu suggests an improver neural network for SMT
different genres, the data is being noted with the help of decoding[28]. Mikolov is firstly used to generate the
human annotators. Three well-known machine learning source and target word inserting, which work on one
algorithms were used for the test and comparison: hidden-layer neural network to get a translation
Support Vector Machine, Decision tree and k-Nearest confidence score[29]. The main factor that reveals the
Neighbor (k-NN). importance of this study is an absence of academic
It shows that k-NN performance is better than the studies that conducted on the development of a Chinese
Support Vector Machine and Decision tree in terms of and Urdu sentence-to-sentence translation model. This
accuracy, precision, recall, and f-measure .
[24]
translation project is modeled as a neural machine
Muhammad Usman also describes five well-known translation and will make a significant contribution to the
classification techniques on the Urdu language corpus. development of today's technological age.
The corpus contains 21769 news documents of seven
categories (Business, Entertainment, Culture, Health,
3. OpenNMT
Sports, and Weird). After preprocessing 93400 features OpenNMT is an open-source tool that based on a
that are taken out from the data to apply machine neural machine translation system built upon the
learning algorithms up to 94% precision[25]. In Yang and Torch/Py-Torch deep learning toolkit. The tool is
Dahl ’ s work, firstly word is trained with a huge designed to be user-friendly and easily accessible while
monolingual corpus, and then the word embedding is also providing a high translation accuracy. This tool
modified with bilingually in a context-depended DNN delivers a general-purpose interface, which needed only
HMM framework. Word capturing lexical translation source and target data with speed as well as memory
information and modeling context information are used optimizations. OpenNMT has an active open
to improve the word alignment performance. public-friendly industrial as well as academic
Unfortunately, the better word alignment result contribution. The diagram view of a neural machine
generated but cannot give significant performance an translation is explained in Figure 2. The red source
end-to-end SMT evaluation task [26]
. words are drawn to word vectors for a recurrent neural
To improve the SMT performance directly, Auli network (RNN). After finding the symbol <eos> then

Figure 2. The view of NMT.

32
A study of neural machine translation from Chinese to Urdu

At the end of the sentence, the final step initializes a sufficient parallel corpus for training, validation and
target blue RNN. At each step, the target is compared testing of translation engine. The dataset of this project
with source RNN and matched with the current hidden consists of two million Chinese-Urdu parallel corpus
state, which shows prediction as declared in (Eq. 3). Due which derived from the combination of all the below
to this prediction, it provides for back into the target datasets which are defined below.
RNN. (1) Monolingual Corpus: the Urdu corpus is around
ࡾ 95.4 million tokens distributed in and around different
쳌t (3)
ࡾ websites in which we have used 2.5 million for our
approach. This corpus is a mix of sources such as News,
Given model was trained with the help of Religion, Blogs, Literature, Science, Education, etc[30].
OpenNMT Torch/PyTorch. There was no work on (2) IPC: The Indic Parallel Corpus is a collection of
translation from Chinese to Urdu in the previous years. Wikipedia documents of six Indian sub-continent
This study was conducted to reduce the barriers in the languages translated into English through crowdsourcing
communication process between these two countries in in the Amazon Mechanical Turk (MTurk) platform[31].
the field of business and cultural promotion. Firstly, a (3) UCBT: UCBT dataset is a parallel corpus of
Chinese-Urdu language parallel sentences datasets with Urdu Chinese. UrduChineseCorp contain 5 million
more than million sentences were established. After that, parallel sentences. It is not freely available for open
these datasets were trained by using the Neural Machine research.
Translation (NMT) method and traditional statistical (4) CTUS: Chinese to Urdu Sentence dataset is a
machine translation. collection of different categories of sentences derived
3.1 Parallel corpuses internet, news and books. In addition, dataset contains
50,000 sentences written manually and derived from
The amount of parallel corpus and its quality plays UNHD (Urdu Nastaliq Handwritten Dataset) to meet a
important role in quality of translation. For low resource Chinese-Urdu parallel deficit. Which is shown in Table
languages like Urdu, it is extremely difficult to find 1.
Table 1. Chinese to Urdu dataset
Number of Words Number of Nouns Number of Verbs Number of Particles Punctuation Number of
sentences
2553053 387957 436759 268950 178923 700000

tokenization.
3.2 Methodology
3.2.3 Byte per encoding (BPE)
Several methods of OpenNMT tool are explained in
the following subcategories. BPE or byte pair encoding is a script compression
method which working on pattern substitution. In this
3.2.1 Normalization
work, the BPE model is developed on the basis of
This method is used for smooth transformation on tokenized source data. For languages sharing an alphabet,
the source sequences to classify and keep some specific understanding BPE on two or more involved languages
sequence into a single representation only for the increases the reliability of division and decreases the
translation process. problem of insertion or deletion of characters when
copying data[32].
3.2.2 Tokenization
3.2.4 Rearranged for tokenization
The tokenization is a process of splitting a sentence
into pieces, each piece of a sentence is called tokens. Due to the BPE model, the previous tokenized
OpenNMT uses a space-separated technique for sentences were rearranged. There is an overview of the

33
A study of neural machine translation from Chinese to Urdu

case feature and joiner annotation. The case feature Additionally, the BLEU score for each data-test is
enhances extra features to the encoder which will be calculated. The details of each data-test are shown below
optimized for each label and then fed as extra source in Figure 3 and Table 2.
input alongside the word. On the target side, these Table 2. Data-Test result represented with BLEU
features will be expected by the net. The decoder is then score
able to decode a sentence and annotate each word. On Data-test BLEU Score System Information
the other hand, by activating the joiner annotate symbol,
Test1 0.0678
the tokenization is reversible.
Test2 0.0847 CPU@
3.2.5 Preprocessing Test3 0.0855 2.70 GHz, Intel (R)
Test4 0.0889 core (TM)
In this method, the data is passing from
Test5 0.0924 I5-4700
preprocessing which can generate word vocabularies
and balance data size, which is used for training. Test6 0.0929
Test7 0.0987
3.2.6 Data training
Test8 0.1156
In data training process, default OpenNMT encoder Test9 0.1887
and decoder, LSTM layers, and RNN are taken. The
research is based on open-source codebase[33]. This
codebase is written in Python, using PyTorch, an
open-source software library. For the NMT model, a
single-layer LSTM with outstanding network
connections is used as a good mechanism to train a
translation model.

3.2.7 Data translation

In data translation the default OpenNMT is


using binary translation method for creating an output
translation file which comes from source and target
language datasets.

4. Results and Discussions


Figure 3. Histogram representation of BLEU score
OpenNMT is an open-source tool that based on a of different data-test.
neural machine translation system built upon the
Torch/Py-Torch deep learning toolkit. The tool is 5. Conclusions
designed to be user-friendly and easily accessible while
This study consists of trained data from several
also providing a high translation accuracy.
sources that have been made up of different variety of
In the OpenNMT model, the Chinese-Urdu
sentences. The training part of method is conducted in
language dataset with seven million parallel corpuses is
the shape of data-test, and it has proved practical as the
trained. The validation part of a source and target
BLEU score has been increased with the number of
have been taken as 25% of training corpus. For testing
data-test; the accuracy of the system is obtained after
the model 15k randomly selected sentences from the
ninth data-test, which is suitably matched to other
corpus have been identified. Furthermore, these
machine translation systems. The BLEU score of
sentences underwent nine different tests. The results of
Chinese to the Urdu translation system is going to be
each data-test have been collected and compared to both
improved by applying some more techniques, which are
manual translation and translation model outputs.
used to generate the best model of translation.

34
Zeeshan Khan, Muhammad Zakira, Wushour Slamu, et al.

References translation. 2016.


1. Jonathan Slocum. A survey of machine translation: 14. Yang B, Wong DF, Xiao T, et al.
Its history, current status, and future prospects. Towards bidirectional hierarchical representations
1985; 11(1): 1-17. for attention-based neural machine translation.
2. Bai L, Liu W. A Practice on Neural Machine 2017.
Translation from Indonesian to Chinese. Recent 15. Chen H, Huang S, Chiang D, et al. Improved neural
Trends in Intelligent Computing, Communication machine translation with a syntax-aware encoder
and Devices 2020; 33-38. and decoder. 2017.
3. Godase A, Govilkar S. Machine translation 16. Wu S, Zhang D, Yang N, et al.
development for Indian languages and its Sequence-to-dependency neural machine translation.
approaches. Behavioral & Brain Sciences 2015; Proceedings of the 55th Annual Meeting of the
4(2): 55-74. Association for Computational Linguistics 2017;
4. Papineni K, Roukos S, Ward T, et al. BLEU: A Vol. 1.
method for automatic evaluation of machine 17. Eriguchi A, Tsuruoka Y, Cho K. Learning to parse
translation. Proceedings of the 40th Annual and translate improves neural machine translation.
Meeting on Association for Computational 2017.
Linguistics 2002; 311-318. 18. Aharoni R, Goldberg Y. Towards string-to-tree
5. Okpor M. Machine translation approaches: Issues neural machine translation. 2017.
and challenges. International Journal of Computer 19. Du W, Black AW. Top-down
Science Issues 2014; 11(5): 159. structurally-constrained neural response generation
6. Mall S. and Jaiswal U. Survey: Machine translation with lexicalized probabilistic context-free grammar.
for Indian language. 2018; 13(1): 202-209. Proceedings of the 2019 Conference of the North
7. Hutchins WJ, Somers HL. An introduction to American Chapter of the Association for
machine translation. Academic Press London 1992; Computational Linguistics: Human Language
Vol. 362. Technologies 2019; Vol. 1.
8. Marcu D, Wong D. A phrase-based, joint 20. Khan NJ, Anwar W, Durrani N. Machine translation
probability model for statistical machine translation. approaches and survey for Indian languages. 2017.
Proceedings of the 2002 Conference on Empirical 21. Kalchbrenner N, Blunsom P. Recurrent continuous
Methods in Natural Language Processing 2002. translation models. Proceedings of the 2013
9. Zafar M, Masood A. Interactive English to Urdu Conference on Empirical Methods in Natural
machine translation using example-based approach. Language Processing 2013.
International Journal on Computer Science & 22. Bilal M, Israr H, Shahid M, et al. Sentiment
Engineering 2009; 1(3): 275-282. classification of Roman-Urdu opinions using Naive
10. Pathak AK, Acharya P, Balabantaray RC. A case Bayesian, Decision Tree and KNN classification
study of Hindi – English example-based machine techniques. Journal of King Saud University
translation. Innovations in Soft Computing and Computer & Information Sciences 2016; 28(3):
Information Technology 2019; 7-16. 330-344.
11. Sutskever I, Vinyals O, Le QV. Sequence to 23. Alam M, Hussain S. Sequence to sequence
sequence learning with neural networks. Advances networks for Roman-Urdu to Urdu transliteration.
in Neural Information Processing Systems 2014. International Multi-topic Conference (INMIC) 2017.
12. Bahdanau D, Cho K, Bengio Y. Neural machine IEEE.
translation by jointly learning to align and translate. 24. Mukhtar N, Khan MA. Urdu sentiment analysis
Computer Science 2014. using supervised machine learning approach.
13. Eriguchi A, Hashimoto K, Tsuruoka Y. International Journal of Pattern Recognition and
Tree-to-sequence attentional neural machine Artificial Intelligence 2018; 32(2): 1851001.

35
A study of neural machine translation from Chinese to Urdu

25. Usman M. Urdu text classification using majority


voting. 2016; 7(8): 265-273.
26. Yang N, Liu S, Li M, et al. Word alignment
modeling with context dependent deep neural
network. Proceedings of the 51st Annual Meeting
of the Association for Computational Linguistics
2013; Vol. 1.
27. Auli M, Galley M, Quirk C, et al. Joint language
and translation modeling with recurrent neural
networks. Proceedings of the 2013 Conference on
Empirical Methods in Natural Language Processing
2013; 1044-1054.
28. Liu L, Taro W, Eiichiro S, et al. Additive neural
networks for statistical machine translation.
Proceedings of the 51st Annual Meeting of the
Association for Computational Linguistics 2013;
Vol. 1.
29. Mikolov T, Karafiat M, Burget L, et al. Recurrent
neural network based language model. Eleventh
Annual Conference of the International Speech
Communication Association 2010.
30. Post M, Callison-Burch C, Osborne M.
Constructing parallel corpora for six Indian
languages via crowdsourcing. Proceedings of the
Seventh Workshop on Statistical Machine
Translation 2012; 401-409.
31. Baker P, Hardie A, McEnery T, et al. EMILLE, a
67-million word corpus of Indic languages: data
collection, mark-up and harmonisation. Proceedings
of the Third International Conference on Language
Resources and Evaluation (LREC’02) 2002.
32. Sennrich R, Haddow B, Birch A. Neural machine
translation of rare words with subword units. 2015.
33. Luong T, Brevdo E, Zhao R. Neural machine
translation (seq2seq) tutorial. 2017.

36

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy