Problem Statement:: Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
Problem Statement:: Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
Implementing pre-trained model BERT for Machine Translation from English to Telugu.
Introduction:
Machine Translation is being carried out from a very long time. And a lot of methods like Rule-
Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
Machine Translation. Statistical MT uses predictive algorithms to teach a compute r
how to translate text. NMT is based on the model of neural networks in the human
brain, with information being sent to different “layers” to be processed before
output. Statistical machine translation usually works less well for language pairs with
significantly different word order. Using Neural Networks, it is possible to translate
longer sentences into required language . The Previous work in neural machine
translation involves using different models like Transformer, Attention and
Encoder – Decoder. Here in this proposal we would like to introduce a fine tuned
model -BERT (Bidirectional Encoder Representations from Transformers). BERT’s key
technical innovation is applying the bidirectional training of Transformer, a popular attention
model, to language modelling. This is in contrast to previous efforts which looked at a text
sequence either from left to right or combined left-to-right and right-to-left training. the
researchers detail a novel technique named Masked LM (MLM) which allows bidirectional
training in models. Here we are going to implement BERT for translating one language to
another (English to Telugu).
RNN
The concept of the recurrent neural net is one in which the input is a sequence and we want to
use one previous input as an input to the next input somehow. Take a step back, The recurrent
uses its internal memory to capture internal state (memory) to process sequences of inputs. The
feedforward takes everything that input is composed of and makes a decision. RNN’s have
shown great success in many NLP tasks. The most commonly used type of RNNs are LSTM’s,
which are much better at capturing long-term dependencies than vanilla RNNs are. LSTMs are
essentially the same thing as the RNN, they have a different way of computing the hidden state.
LSTMs in more detail below.
LSTMS
LSTM networks are quite popular these days and we briefly talked about them above.
LSTMs don’t have a fundamentally different architecture from RNNs, but they use a different
function to compute the hidden state. The memory in LSTMs are called cells and we can think
of them as black boxes that take as input the previous state and current input . Internally
these cells decide what to keep in (and what to erase from) memory. They then combine the
previous state, the current memory, and the input. It turns out that these types of units are very
efficient at capturing long-term dependencies.
Bidirectional RNNs
These are based on the idea that the output at time may not only depend on the previous
elements in the sequence, but also future elements. For example, to predict a missing word in
a sequence you want to look at both the left and the right context. Bidirectional RNNs are quite
simple. They are just two RNNs stacked on top of each other. The output is then computed
based on the hidden state of both RNNs.
BERT
BERT uses a multi-layer bidirectional Transformer encoder. Its self-attention layer
performs self-attention in both directions. Google has released two variants of the model:
Pretrained Language Models(LM) such as ELMO and BERT (Peters et al., 2018; Devlin et al.,
2018) have turned out to significantly improve the quality of several Natural Language
Processing (NLP) tasks by transferring the prior knowledge learned from data-rich
monolingual corpora to data-poor NLP tasks such as question answering, bio-medical
information extraction and standard benchmarks (Wang et al., 2018; Lee et al., 2019). In
addition, it was shown that these representations contain syntactic and semantic information in
different layers of the network (Tenney et al., 2019). Therefore, using such pretrained LMs for
Neural Machine Translation (NMT) is appealing.
Transformer Model
• The Transformer model in NLP has truly changed the way we work with text data
• Transformer is behind the recent NLP developments, including Google’s BERT
• The Transformer in NLP is a novel architecture that aims to solve sequence-to-sequence
tasks while handling long-range dependencies with ease.
• The Encoder block has 1 layer of a Multi-Head Attention followed by another layer
of Feed Forward Neural Network. The decoder, on the other hand, has an
extra Masked Multi-Head Attention.
• The encoder and decoder blocks are actually multiple identical encoders and decoders
stacked on top of each other. Both the encoder stack and the decoder stack have the
same number of units.
Transformer is undoubtedly a huge improvement over the RNN based seq2seq models. But it
comes with its own share of limitations:
• Attention can only deal with fixed-length text strings. The text has to be split into a
certain number of segments or chunks before being fed into the system as input
• This chunking of text causes context fragmentation. For example, if a sentence is
split from the middle, then a significant amount of context is lost. In other words, the
text is split without respecting the sentence or any other semantic boundary
Text Preprocessing:
It is the process of modifying the dataset made initially to the form that is acceptable by the
machine Learning algorithms and produce the output with maximum accuracy.
Preprocessing of data is necessary for a NMT dataset in which Tokenization followed by
normalization is performed to produce better output and the best accuracy.
Tokenization:
It is the process of splitting the given dataset to tokens.
Normalization:
It is the process in which the dataset is tested for redundancy and repetition of sentences which
affect the training the system and decrease Accuracy.
The below is the literature survey made on preprocessing in NMT on different languages.
Literature Survey:
1. There are many challenges in machine translation for Indian languages which include
the size of parallel corpora ,differences amongst languages, mainly word order
differences due to syntactical difference are two of the major challenges. Indian
languages suffer both of these problems, especially when they are being translated from
English. Applied NMT for English-Tamil language pair. proposed a novel neural
machine translation technique using word-embedding along with Byte-Pair-Encoding
(BPE) to develop an efficient translation system that overcomes the OOV (Out Of
Vocabulary) problem for languages which do not have much translations available
online.
2. Neural systems are compared with traditional phrase-based systems using various
parallel corpora including UN, ISI and Ummah. The corpus UN is an obvious choice
for many researchers and is used in these experiments. It is composed of parliamentary
documents of the United Nations since 1990 for Arabic, Chinese, English, French,
Russian, and Spanish. In this experiments, we applied Arabic preprocessing ,which
includes normalization and tokenization. to see its impact on both SMT and NMT
systems. We used Farasa (Abdelali et al., 2016), a fast Arabic segmentor.
3. a study on word level machine translation on English-German Translation is done.
Unlike traditional systems, Neural Machine Translation (NMT) systems learn the
parameters of the model and require only minimal preprocessing. Memory and time
constraints allow to take only a fixed number of words into account, which leads to the
out-of-vocabulary (OOV) problem.
4. In this system description paper, they report details of training neural machine
translation with multi-source Romance languages with the Transformer model and in
the evaluation frame of the biomedical WMT 2018 task. Using multi-source languages
from the same family allows improvements of over 6 BLEU points.
5. We build Neural Machine Translation(NMT) systems for English-Hindi,Bengali-Hindi
and Gujarati-Hindi with two different units of translation i.e. word and sub word and
present a comparative study of sub word NMT and word level NMT systems, along
with strong results and case studies. We train attention-based encoder-decoder model
for word level and use Byte Pair Encoding(BPE) in sub word NMT for word
segmentation.We conduct case studies to study the effects of BPE.'
6. In this paper, They used a novel sequence encoding model named as feedforward
sequential memory networks (FSMNs) [16][17][18] to replace the RNNs model in both
encoder and decoder module of the end-to-end framework. The FSMN model is a
standard feedforward neural network with single or multiple memory blocks in hidden
layers and can learn long term dependency in language model [16] as well as speech
recognition [17]. On account of the ability of memorizing the context information of a
word, FSMN is used as encoder model. They also modified the attention module so that
FSMN can be employed as decoder model and generates output symbols
simultaneously during training. The FSMN based encoder-decoder model can be
trained fast because of no recurrent connections.(2017-govind)
7. Proposed a hierarchical RNN encoder that models the input sentence in a hierarchical
structure. This is the first attempt to explore the sources ide word-clause-sentence
hierarchical structure for NMT. Based on the hierarchical structure modeled by our
encoder, They incorporate two types of attentional mechanism into decoder, and thus
different scopes of contexts can be distinguished and exploited for NMT. Experimental
results show that model significantly outperforms several state-of-the-art models in
machine translation.(2018 - govind)
8. Sentence-level context to model source topic information, and design a topic attention
to integrate these learned latent topic representations into the existing NMT architecture
for improving target word prediction. They first represent source topic information over
a source sentence as latent topic representations (LTRs) by a variant of convolutional
neural networks(CNNs)[11].A topic attention, which is then learned based on word
context and topic context, is used to compute an additional topic context vector for
predicting target words. Compared with the approach of Zhang et al. [6], Main
contributions are: 1) This work focuses on learning a novel independent sequence of
sentence-level context representations to capture latent topic information for predicting
target words. 2) The learned latent topic representations can be estimated together with
the existing NMT architecture by an encoder with a convolutional neural network
instead of being pretrained.(2019 - govind)