Seq 2 Seq
Seq 2 Seq
1
Content
2
Machine Translation
(Dịch máy)
3
What is Neural Machine
Translation?
4
Encoder-decoder Framework
Decoder (RNN/LSTM)
Encoder (RNN/LSTM)
“Sequence to Sequence Learning with Neural
Networks”, 2014
5
Encoder-decoder Framework
Vector K-dimention
of context
6
Encoder-decoder Framework
7
Encoder-decoder Framework
All inputs are
aggregatedin this
single vector
max
max
max
max
max
max
max
arg
arg
arg
arg
arg
arg
arg
Encoder RNN
Decoder RNN
les pauvres sont démunis <START> the poor don’t have any money
10
Training of NMT system
11
12
Training of NMT system
= negative log = negative log = negative log
prob of “the” prob of “have” prob of <END>
Encoder RNN = + + + + + +
Decoder RNN
les pauvres sont démunis <START> the poor don’t have any money
14
Decoder based on Beam search
15
Decoder based on Beam
search: Example
Beam size = 2
<START>
16
Decoder based on Beam
search: Example
Beam size = 2
the
<START>
17
Decoder based on Beam
search: Example
Beam size = 2
poor
the
people
<START>
poor
a
person
18
Decoder based on Beam
search: Example
Beam size = 2
are
poor
the don’t
people
<START>
person
poor
a but
person
19
Decoder based on Beam
search: Example
Beam size = 2
always
not
are
poor
the don’t have
people
take
<START>
person
poor
a but
person
20
Decoder based on Beam
search: Example
Beam size = 2
always
in
not
are with
poor
the don’t have
people
any
take
<START> enough
person
poor
a but
person
21
Decoder based on Beam
search: Example
Beam size = 2
always
in
not
are with
poor money
the don’t have
people funds
any
take
<START> enough
person money
poor
a but funds
person
22
Decoder based on Beam
search: Example
Beam size = 2
always
in
not
are with
poor money
the don’t have
people funds
any
take
<START> enough
person money
poor
a but funds
person
23
Beam search: stopping criterion
•In greedy decoding, usually we decode until the model
produces an <END>
Ví dụ : <START> he hit me with a pie <END>
•In beam search decoding, different hypotheses may produce
<END> tokens on different timesteps
• When a hypothesis produces <END>, that hypothesis is complete.
• Place it aside and continue exploring other hypotheses via beam search.
•Usually we continue beam search until:
• We reach timestep T (where T is some pre-defined cutoff), or
• We have at least n completed hypotheses (where n is pre-defined cutoff)
24
Advantage of NMT
Compared to SMT, NMT has many advantages:
Better performance
• More fluent
• Better use of context
• Better use of phrase similarities
26
How to evaluation MT system?
BLEU (Bilingual Evaluation Understudy)
Encoding of the
source sentence. Target sentence (output)
Decoder RNN
les pauvres sont démunis <START> the poor don’t have any money
Decoder RNN
les pauvres sont démunis <START> the poor don’t have any money
29
Watching sentence embedding
Attention (Chú ý)
31
Encoder-Decoder với Attention
dot product
Attention
scores
Decoder RNN
Encoder
RNN
dot product
Attention
scores
Decoder RNN
Encoder
RNN
dot product
Attention
scores
Decoder RNN
Encoder
RNN
dot product
Attention
scores
Decoder RNN
Encoder
RNN
Decoder RNN
Encoder
RNN
Decoder RNN
Encoder
RNN
Decoder RNN
Encoder
RNN
Decoder RNN
Encoder
RNN
Decoder RNN
Encoder
RNN
Decoder RNN
Encoder
RNN
Decoder RNN
Encoder
RNN
Decoder RNN
Encoder
RNN
les pauvres sont démunis <START> the poor don’t have any
• We take softmax to get the attention distribution for this step (this is a
probability distribution and sums to 1)
• We use to take a weighted sum of the encoder hidden states to get the
attention output
• Finally we concatenate the attention output with the decoder hidden state
and proceed as in the non-attention seq2seq model
46
Attention scoring function
• q is the query and k is the key
• Multi-layer Perceptron (Bahdanau et al. 2015)
• Flexible, often very good with large data
ChatGpt
GPT3
Attention is a general Deep Learning technique
50
Evolution of the MT system over time
[Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal]
Source: http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf
51
NMT: the first big story of NLP Deep Learning
Neural Machine Translation went from a fringe research activity in 2014 to
the leading standard method in 2016
• 2016: Google Translate switches from SMT to NMT and by 2018 everyone
has
• This is amazing!
• SMT systems, built by hundreds of engineers over many years, outperformed by
NMT systems trained by a small groups of engineers in a few months 52
MT solved?
• Nope!
• Many difficulties remain:
• Out-of-vocabulary words (unknow words)
• Domain mismatch between train and test data
• Maintaining context over longer text
• Low-resource language pairs (Hallucinations)
53
MT solved?
• Nope!
• Using common sense is still hard
?
54
Seq2seq is flexible and efficient!
• Seq2Seq is useful not only in the Machine Translation
• Many NLP tasks can be phrased as sequence-to-sequence:
• Summarization (long text → short text)
• Dialogue (previous utterances → next utterance)
• Parsing (input text → output parse as sequence)
• Code generation (natural language → Python code)
• OCR (image of character text sequcence)
• ASR (sequence of Acoustic text sequence)
55
Machine Translation Problem
• Automatic translation from English sentence to Vietnamese sentence
• Training và testing data: IWSLT 2015
Example:
Input: I like a blue book
Oputput: Tôi thích quyển sách màu xanh
Apply seq2seq + attention for NMT
Goals to be achieved:
1. Training seq2seq model for English-Vietnamese translation problem.
2. Evaluating the translation system by BLEU score.
3. Seing Attention score matrix to better understand the model
The tasks to be implemented:
1. Data preprocessing
2. Create training data
3. Write encoder, decoder, attention modul
4. Training model
5. Translate new source sentence
6. Show Attention, BLEU score.
Conclusion
58
References