82 338 4 PB
82 338 4 PB
net/publication/342215848
CITATIONS READS
3 472
4 authors, including:
All content following this page was uploaded by Mr Zeeshan on 16 June 2020.
Original Article
A Study of Neural Machine Translation from Chinese to Urdu
Zeeshan Khan1*, Muhammad Zakira1, Wushour Slamu1, Nady Slam1
School of Information Science and Engineering, Xinjiang University Urumqi, Xinjiang, China
ABSTRACT
Machine Translation (MT) is used for giving a translation from a source language to a target language. Machine
translation simply translates text or speech from one language to another language, but this process is not sufficient to
give the perfect translation of a text due to the requirement of identification of whole expressions and their direct
counterparts. Neural Machine Translation (NMT) is one of the most standard machine translation methods, which has
made great progress in the recent years especially in non-universal languages. However, local language translation
software for other foreign languages is limited and needs improving. In this paper, the Chinese language is translated to
the Urdu language with the help of Open Neural Machine Translation (OpenNMT) in Deep Learning. Firstly, a Chinese
to Urdu language sentences datasets were established and supported with Seven million sentences. After that, these
datasets were trained by using the Open Neural Machine Translation (OpenNMT) method. At the final stage, the
translation was compared to the desired translation with the help of the Bleu Score Method.
Keywords: Machine Translation; Neural Machine Translation; Non-Universal Languages; Chinese; Urdu; Deep
Learning
ARTICLE INFO
Received: Mar 10, 2020 1. Introduction
Accepted: Apr 27, 2020
Available online: May 6, 2020
In the era of globalization, communication, interaction, and cooperation
*CORRESPONDING AUTHOR among countries have become ever more common than before. At this period,
Zeeshan Khan, School of Information
Science and Engineering, Xinjiang failing to understand the foreign languages might impede the communication
University Urumqi, Xinjiang, China;
zeeshan@uswat.edu.pk; process. Machine translation (MT) is a helpful and efficient way to
overcome barriers in communication between different languages. Machine
CITATION
Zeeshan Khan, Muhammad Zakira, translation has been making great progress in recent years, especially in the
Wushour Slamu, Nady Slam. A Study
of Neural Machine Translation from translation between universal languages such as English, German and French,
Chinese to Urdu. Journal of English and Chinese, etc[1]. However, the number of local language translation
Autonomous Intelligence 2019; 2(4):
29-36. doi: 10.32629/jai.v2i4.82 software for non-universal languages is limited. Meanwhile, with the promotion of
COPYRIGHT the “One Belt One Road “policy, many Chinese companies have initiated steps to
Copyright © 2019 by author(s) and reinforce mutual assistance between ASIAN countries. As one of the founding
Frontier Scientific Publishing. This
work is licensed under the Creative countries of ASIAN, Pakistan is also the largest economy in Southeast Asia. China
Commons Attribution-NonCommercial
4.0 International License (CC BY-NC and Pakistan have established cooperative relations in social infrastructure and
4.0). trade. Therefore, the demand for translation in Chinese and Urdu is also increasing.
https://creativecommons.org/licenses/b
y-nc/4.0 With the help of machine translation, the difficulties that translators and interpreters
have faced can be reduced and some difficulties in cross-language and
cross-cultural cooperation can be eliminated, and this contributes to promoting
political, economic and cultural exchanges between the two countries[2]. Machine
translation (MT) is a subtype of computational linguistics that uses software to
translate text from one natural language (NL) into another natural language (NL).
29
Zeeshan Khan, Muhammad Zakira, Wushour Slamu, et al.
Machine translation executes simple translation of words 1.1 Machine translation
from one language to another language, but this process
MT device alongside in reality substituting the
is not a solution to give an accurate translation of a text
phrases as according to version also takes the application
due to the requirement of identification of whole phrases
of complex linguistic understanding, morphology,
and their instant equivalents. To resolve this difficulty
meaning, and grammar into attention. The standard
with corpus, statistical and neural techniques are the
metric humans are using for evaluation of MT systems is
leading fields for better handling of differences in
the BLEU score.
linguistic typology, isolation of anomalies, and
Bilingual Assessment Understudy (BLEU) is the
translation systems. There are several types of machine
algorithm to define the superiority of text translated by a
translation (MT) systems; one of those is an
machine translation quality that is the
Example-based Machine Translation or EBMT
assessment between machine-translated outputs to that of
approach[3].
human-generated output; the closer machine translation
In recent years, Neural Machine Translation (NMT)
is to human-generated translation, the better is the BLEU
has become one of the most popular machine translation
score. BLEU score is an n-gram overlap of gadget
methods.
translation to that of reference translation[4]. As shown in
(Eq. 1).
അࡾഅࡾ 㪰컀Ⲑ컀ࡾ傔 Ⲑ th
㪰അ컀 ̍ 傔烢Ⲑ 쳌
컀 컀 컀Ⲑݎ컀 㪰컀Ⲑ컀ࡾ傔 烢̍h
컀ݎ烢ܿ烢 Ⲑ (1)
1.1.1 Direct machine translation (DMT) This intermediate steps are language structured[7].
1.1.2 Rule-Based machine translation (RBMT) Figure 1. Types of rule based machine translation.
1.1.5 Statistical-Based machine translation parameters are derived from the analysis of bilingual text
(SBMT) corpora. The standard translation technique is
established by the selection of the highest probability , [8]
SBMT is a work on bilingual data. The SBMT is
as shown in (Eq. 2).
generated on the basis of statistical models whose
컀̍ 컀 . 컀 ̍ 컀 . 컀 (2)
컀
1.1.6 Example-Based machine translation (EBMT) several methods and approaches were developed in this
field[3]. According to modern exploration, the
Example based system translation works on
performance report of baseline systems in translating
decomposing/fragmentation of supply sentence,
Indian languages-based text (Bengali, Hindi, Malayalam,
translating these fragments into goal language and then
Punjabi, Tamil, Telugu, Gujarati, and Urdu) into English
re-composing the ones translated fragments into lengthy
text has an average of 10% correctness for all language
sentence[9].
pairs[20].
1.1.7 Hybrid-Based machine translation (HBMT) In 2013 Kalchbrenner proposed recurrent
continuous translation models for machine translation[21].
In hybrid machine translation, a combination of two
This model uses a Convolutional Neural Network (CNN)
or more machine translation techniques is used to
to encode a given part of input text into an unbroken
overcome the limitations of each technique and to
vector and then uses a Recurrent Neural Network (RNN)
enhance the quality of translation[10].
as a decoder to convert the vector into output language.
1.1.8 Neural machine translation (NMT) In 2014, Long Short-term Memory (LSTM) was
introduced into NMT[11]. To solve the problem of
The neural networks and deep learning approach are
generating fixed-length vectors for encoders, they
used by this system. NMT models require only a fraction
introduce attention mechanisms into NMT[12]. The
of the memory needed by old SMT. All parts of the
attention mechanism allows the neural network to pay
NMT model are trained jointly, i.e., end to end, to make
more consideration to the relevant parts of the input, and
the best use of the translation performance. The NMT
discard unrelated parts. Since then, the performance of
models trust on sequential encoder and decoder[11,12]
the neural machine translation method has been
without any unambiguous modeling of the syntactic
significantly improved.
structure of sentences. With this inspiration, the
In this Sutskever a multilayer LSTM is used to
researchers made an effort to expand the translation
encode input sentence into a fixed-size Direction and
model by modeling the hierarchical structure of language.
then decode it into output by another LSTM. The use of
Eriguchi first proposed a tree-based attentive NMT
LSTM efficiently resolved the problem of gradient
model[13], which was further extended by Yang[14] and
vanishing, which agrees on the model to capture data
Chen[15] via a bidirectional encoding mechanism. All the
over extended space in a sentence. Muhammad Bilal uses
above tree-based models applied constituent tree
the three classification models that are used for text
structure and met the same difficulties. Other studies try
classification using the Waikato Environment for
to improve the NMT by modeling the syntax on the
Knowledge Analysis (WEKA). The blogs which are
target side[16 – 19]. Regardless of enhancing the decoder,
written in Roman Urdu and English can be considered as
their success also proved the requirement and efficiency
documents that are useful for the training dataset, labeled
of modeling syntactic information for NMT systems.
examples, and texting data. Due to testing, these three
2. Related Work different models and the results were examined in each
case.
The machine translation method for language The results display that Naive Bayesian
translation had started typically from 1990. There are outperformed Decision Tree and KNN in terms of more
31
A study of neural machine translation from Chinese to Urdu
accuracy, precision, recall, and measure[22]. Mehreen Urdu transliteration into sequence to sequence learning
Alam addresses this difficult and convert Roman-Urdu to difficulty. The Urdu corpus was created and pass it to
neural machine translation that guess sentences up to enhanced the neural network language model, in order to
length 10 while achieving good BLEU score . Neelam
[23]
use both the source and target side information. In their
Mukhtar describes the Urdu language as a poor language work, not only the target word embedding is used as the
that is mostly be ignored by the research community. input of the network, but also the current target word[27].
After collecting data from many blogs of about 14 Liu suggests an improver neural network for SMT
different genres, the data is being noted with the help of decoding[28]. Mikolov is firstly used to generate the
human annotators. Three well-known machine learning source and target word inserting, which work on one
algorithms were used for the test and comparison: hidden-layer neural network to get a translation
Support Vector Machine, Decision tree and k-Nearest confidence score[29]. The main factor that reveals the
Neighbor (k-NN). importance of this study is an absence of academic
It shows that k-NN performance is better than the studies that conducted on the development of a Chinese
Support Vector Machine and Decision tree in terms of and Urdu sentence-to-sentence translation model. This
accuracy, precision, recall, and f-measure .
[24]
translation project is modeled as a neural machine
Muhammad Usman also describes five well-known translation and will make a significant contribution to the
classification techniques on the Urdu language corpus. development of today's technological age.
The corpus contains 21769 news documents of seven
categories (Business, Entertainment, Culture, Health,
3. OpenNMT
Sports, and Weird). After preprocessing 93400 features OpenNMT is an open-source tool that based on a
that are taken out from the data to apply machine neural machine translation system built upon the
learning algorithms up to 94% precision[25]. In Yang and Torch/Py-Torch deep learning toolkit. The tool is
Dahl ’ s work, firstly word is trained with a huge designed to be user-friendly and easily accessible while
monolingual corpus, and then the word embedding is also providing a high translation accuracy. This tool
modified with bilingually in a context-depended DNN delivers a general-purpose interface, which needed only
HMM framework. Word capturing lexical translation source and target data with speed as well as memory
information and modeling context information are used optimizations. OpenNMT has an active open
to improve the word alignment performance. public-friendly industrial as well as academic
Unfortunately, the better word alignment result contribution. The diagram view of a neural machine
generated but cannot give significant performance an translation is explained in Figure 2. The red source
end-to-end SMT evaluation task [26]
. words are drawn to word vectors for a recurrent neural
To improve the SMT performance directly, Auli network (RNN). After finding the symbol <eos> then
32
A study of neural machine translation from Chinese to Urdu
At the end of the sentence, the final step initializes a sufficient parallel corpus for training, validation and
target blue RNN. At each step, the target is compared testing of translation engine. The dataset of this project
with source RNN and matched with the current hidden consists of two million Chinese-Urdu parallel corpus
state, which shows prediction as declared in (Eq. 3). Due which derived from the combination of all the below
to this prediction, it provides for back into the target datasets which are defined below.
RNN. (1) Monolingual Corpus: the Urdu corpus is around
ࡾ 95.4 million tokens distributed in and around different
쳌t (3)
ࡾ websites in which we have used 2.5 million for our
approach. This corpus is a mix of sources such as News,
Given model was trained with the help of Religion, Blogs, Literature, Science, Education, etc[30].
OpenNMT Torch/PyTorch. There was no work on (2) IPC: The Indic Parallel Corpus is a collection of
translation from Chinese to Urdu in the previous years. Wikipedia documents of six Indian sub-continent
This study was conducted to reduce the barriers in the languages translated into English through crowdsourcing
communication process between these two countries in in the Amazon Mechanical Turk (MTurk) platform[31].
the field of business and cultural promotion. Firstly, a (3) UCBT: UCBT dataset is a parallel corpus of
Chinese-Urdu language parallel sentences datasets with Urdu Chinese. UrduChineseCorp contain 5 million
more than million sentences were established. After that, parallel sentences. It is not freely available for open
these datasets were trained by using the Neural Machine research.
Translation (NMT) method and traditional statistical (4) CTUS: Chinese to Urdu Sentence dataset is a
machine translation. collection of different categories of sentences derived
3.1 Parallel corpuses internet, news and books. In addition, dataset contains
50,000 sentences written manually and derived from
The amount of parallel corpus and its quality plays UNHD (Urdu Nastaliq Handwritten Dataset) to meet a
important role in quality of translation. For low resource Chinese-Urdu parallel deficit. Which is shown in Table
languages like Urdu, it is extremely difficult to find 1.
Table 1. Chinese to Urdu dataset
Number of Words Number of Nouns Number of Verbs Number of Particles Punctuation Number of
sentences
2553053 387957 436759 268950 178923 700000
tokenization.
3.2 Methodology
3.2.3 Byte per encoding (BPE)
Several methods of OpenNMT tool are explained in
the following subcategories. BPE or byte pair encoding is a script compression
method which working on pattern substitution. In this
3.2.1 Normalization
work, the BPE model is developed on the basis of
This method is used for smooth transformation on tokenized source data. For languages sharing an alphabet,
the source sequences to classify and keep some specific understanding BPE on two or more involved languages
sequence into a single representation only for the increases the reliability of division and decreases the
translation process. problem of insertion or deletion of characters when
copying data[32].
3.2.2 Tokenization
3.2.4 Rearranged for tokenization
The tokenization is a process of splitting a sentence
into pieces, each piece of a sentence is called tokens. Due to the BPE model, the previous tokenized
OpenNMT uses a space-separated technique for sentences were rearranged. There is an overview of the
33
A study of neural machine translation from Chinese to Urdu
case feature and joiner annotation. The case feature Additionally, the BLEU score for each data-test is
enhances extra features to the encoder which will be calculated. The details of each data-test are shown below
optimized for each label and then fed as extra source in Figure 3 and Table 2.
input alongside the word. On the target side, these Table 2. Data-Test result represented with BLEU
features will be expected by the net. The decoder is then score
able to decode a sentence and annotate each word. On Data-test BLEU Score System Information
the other hand, by activating the joiner annotate symbol,
Test1 0.0678
the tokenization is reversible.
Test2 0.0847 CPU@
3.2.5 Preprocessing Test3 0.0855 2.70 GHz, Intel (R)
Test4 0.0889 core (TM)
In this method, the data is passing from
Test5 0.0924 I5-4700
preprocessing which can generate word vocabularies
and balance data size, which is used for training. Test6 0.0929
Test7 0.0987
3.2.6 Data training
Test8 0.1156
In data training process, default OpenNMT encoder Test9 0.1887
and decoder, LSTM layers, and RNN are taken. The
research is based on open-source codebase[33]. This
codebase is written in Python, using PyTorch, an
open-source software library. For the NMT model, a
single-layer LSTM with outstanding network
connections is used as a good mechanism to train a
translation model.
34
Zeeshan Khan, Muhammad Zakira, Wushour Slamu, et al.
35
A study of neural machine translation from Chinese to Urdu
36