0% found this document useful (0 votes)
50 views59 pages

Seq 2 Seq

This document introduces the seq2seq model for machine translation. It discusses how seq2seq uses two RNNs, an encoder and decoder, to translate sentences from one language to another. The encoder encodes the source sentence into a vector, and the decoder generates the target sentence based on the encoded vector. Beam search is used during decoding to explore multiple hypotheses instead of greedy decoding. The model is trained end-to-end using backpropagation to minimize the loss between the predicted and target words.

Uploaded by

21020641
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views59 pages

Seq 2 Seq

This document introduces the seq2seq model for machine translation. It discusses how seq2seq uses two RNNs, an encoder and decoder, to translate sentences from one language to another. The encoder encodes the source sentence into a vector, and the decoder generates the target sentence based on the encoded vector. Beam search is used during decoding to explore multiple hypotheses instead of greedy decoding. The model is trained end-to-end using backpropagation to minimize the loss between the predicted and target words.

Uploaded by

21020641
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Seq2seq model

and application for machine


translation

Nguyen Van Vinh


UET, VNU-Hanoi

1
Content

•Introduction to Machine Translation


•The seq2seq model
•Attention mechanism
•Practice: Machine translation với mô hình seq2seq

2
Machine Translation
(Dịch máy)

Machine Translation (MT) is the task of translating a sentence x from


one language (the source language) to a sentence y in another language
(the target language).

x: L'homme est né libre, et partout il est dans les fers

y: Man is born free, but everywhere he is in chains

3
What is Neural Machine
Translation?

• Neural Machine Translation (NMT) is a way to do Machine


Translation with a single neural network

• The neural network architecture is called sequence-to-sequence (aka


seq2seq) and it involves two RNNs (LSTMs).

4
Encoder-decoder Framework
Decoder (RNN/LSTM)

Encoder (RNN/LSTM)
“Sequence to Sequence Learning with Neural
Networks”, 2014

5
Encoder-decoder Framework

Vector K-dimention
of context

Condition of the word to be generate from


the translation system

“Sequence to Sequence Learning with Neural Networks”, 2014

6
Encoder-decoder Framework

7
Encoder-decoder Framework
All inputs are
aggregatedin this
single vector

In the seq2seq, the decoder's state


depends only on the previous state and
the previous output
8
Neural Machine Translation
The sequence-to-sequence model
Target sentence (output)
Encoding of the source sentence.
Provides initial hidden state
for Decoder RNN. the poor don’t have any money <END>

max

max
max
max
max
max
max
arg

arg
arg
arg
arg
arg
arg
Encoder RNN

Decoder RNN
les pauvres sont démunis <START> the poor don’t have any money

Source sentence (input) Decoder RNN is a Language Model that generates


target sentence conditioned on encoding.
Encoder RNN produces
an encoding of the Note: This diagram shows test time behavior:
source sentence. decoder output is fed in as next step’s input
9
Training of NMT system
•As in other RNN models, we can train by minimizing the loss
function between what we predict at each step and its
ground true value.

10
Training of NMT system

11
12
Training of NMT system
= negative log = negative log = negative log
prob of “the” prob of “have” prob of <END>

Encoder RNN = + + + + + +

Decoder RNN
les pauvres sont démunis <START> the poor don’t have any money

Source sentence (from corpus) Target sentence (from corpus)

Seq2seq is optimized as a single system.


Backpropagation operates “end to end”. 13
Better-than-greedy decoding?

• Greedy decoding has no way to undo decisions!


• les pauvres sont démunis (the poor don’t have any money)
• → the ____
• → the poor ____
• → the poor are ____
• Better option: use beam search (a search algorithm) to
explore several hypotheses and select the best one

14
Decoder based on Beam search

15
Decoder based on Beam
search: Example
Beam size = 2

<START>

16
Decoder based on Beam
search: Example
Beam size = 2

the

<START>

17
Decoder based on Beam
search: Example
Beam size = 2

poor
the
people

<START>

poor
a
person

18
Decoder based on Beam
search: Example
Beam size = 2

are
poor
the don’t
people

<START>
person
poor
a but
person

19
Decoder based on Beam
search: Example
Beam size = 2
always

not
are
poor
the don’t have
people
take
<START>
person
poor
a but
person

20
Decoder based on Beam
search: Example
Beam size = 2
always
in
not
are with
poor
the don’t have
people
any
take
<START> enough
person
poor
a but
person

21
Decoder based on Beam
search: Example
Beam size = 2
always
in
not
are with
poor money
the don’t have
people funds
any
take
<START> enough
person money
poor
a but funds
person

22
Decoder based on Beam
search: Example
Beam size = 2
always
in
not
are with
poor money
the don’t have
people funds
any
take
<START> enough
person money
poor
a but funds
person

23
Beam search: stopping criterion
•In greedy decoding, usually we decode until the model
produces an <END>
Ví dụ : <START> he hit me with a pie <END>
•In beam search decoding, different hypotheses may produce
<END> tokens on different timesteps
• When a hypothesis produces <END>, that hypothesis is complete.
• Place it aside and continue exploring other hypotheses via beam search.
•Usually we continue beam search until:
• We reach timestep T (where T is some pre-defined cutoff), or
• We have at least n completed hypotheses (where n is pre-defined cutoff)

24
Advantage of NMT
Compared to SMT, NMT has many advantages:
Better performance
• More fluent
• Better use of context
• Better use of phrase similarities

• A single neural network to be optimized end-to-end


• No subcomponents to be individually optimized

• Requires much less human engineering effort


• No feature engineering
• Same method for all language pairs
25
Weakness of NMT?
Compared to SMT:

• NMT is less interpretable


• Hard to debug

• NMT is difficult to control


• For example, can’t easily specify rules or guidelines for
translation
• Safety concerns!

26
How to evaluation MT system?
BLEU (Bilingual Evaluation Understudy)

• BLEU compares the machine-written translation to one or several human-


written translation(s), and computes a similarity score based on:
• n-gram precision (usually up to 3 or 4-grams)
• Penalty for too-short system translations

• BLEU is useful but imperfect


• There are many valid ways to translate a sentence
• So a good translation can get a poor BLEU score because it has low n-
gram overlap with the human translation ☹
27
Sequence-to-sequence: the bottleneck problem

Encoding of the
source sentence. Target sentence (output)

the poor don’t have any money <END>


Encoder RNN

Decoder RNN
les pauvres sont démunis <START> the poor don’t have any money

Source sentence (input)

Problems with this architecture?


28
Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
Target sentence (output)
This needs to capture all
information about the source
sentence. the poor don’t have any money <END>
Information bottleneck!
Encoder RNN

Decoder RNN
les pauvres sont démunis <START> the poor don’t have any money

Source sentence (input)

29
Watching sentence embedding
Attention (Chú ý)

• Attention: Solution for bottleneck problem.

• Main idea: on each step of the decoder, use direct


connection to the encoder to focus on a particular part of
the source sequence

31
Encoder-Decoder với Attention

dot product
Attention
scores

Decoder RNN
Encoder
RNN

les pauvres sont démunis <START>

Source sentence (input) 32


Seq2Seq with attention

dot product
Attention
scores

Decoder RNN
Encoder
RNN

les pauvres sont démunis <START>

Source sentence (input) 33


Sequence-to-sequence with
attention

dot product
Attention
scores

Decoder RNN
Encoder
RNN

les pauvres sont démunis <START>

Source sentence (input) 34


Sequence-to-sequence with attention

dot product
Attention
scores

Decoder RNN
Encoder
RNN

les pauvres sont démunis <START>

Source sentence (input) 35


Sequence-to-sequence with
attention
On this decoder timestep, we’re
scores distribution mostly focusing on the first
Attention Attention
encoder hidden state (”les”)

Take softmax to turn the scores


into a probability distribution

Decoder RNN
Encoder
RNN

les pauvres sont démunis <START>

Source sentence (input) 36


Sequence-to-sequence with
attention
Attention Use the attention distribution to take a
output weighted sum of the encoder hidden
scores distribution states.
Attention Attention

The attention output mostly contains


information the hidden states that
received high attention.

Decoder RNN
Encoder
RNN

les pauvres sont démunis <START>

Source sentence (input) 37


Sequence-to-sequence with
attention
Attention the
output
scores distribution
Attention Attention

Decoder RNN
Encoder
RNN

les pauvres sont démunis <START>

Source sentence (input) 38


Sequence-to-sequence with
attention
Attention poor
output
scores distribution
Attention Attention

Decoder RNN
Encoder
RNN

les pauvres sont démunis <START> the

Source sentence (input) 39


Sequence-to-sequence with
attention
Attention don’t
output
scores distribution
Attention Attention

Decoder RNN
Encoder
RNN

les pauvres sont démunis <START> the poor

Source sentence (input) 40


Sequence-to-sequence with
attention
Attention have
output
scores distribution
Attention Attention

Decoder RNN
Encoder
RNN

les pauvres sont démunis <START> the poor don’t

Source sentence (input) 41


Sequence-to-sequence with
attention
Attention any
output
scores distribution
Attention Attention

Decoder RNN
Encoder
RNN

les pauvres sont démunis <START> the poor don’t have

Source sentence (input) 42


Sequence-to-sequence with
attention
Attention money
output
scores distribution
Attention Attention

Decoder RNN
Encoder
RNN

les pauvres sont démunis <START> the poor don’t have any

Source sentence (input) 43


Neural Machine Translation by Jointly Learn to Align
and Translate

Source: Bahdanau et al., ICLR 2015, https://arxiv.org/abs/1409.0473


Neural Machine Translation by Jointly Learn to Align
and Translate

Source: Bahdanau et al., ICLR 2015, https://arxiv.org/abs/1409.0473


Attention: Formula

• We have encoder hidden states (values)


• On timestep t, we have (query s) decoder hidden state There are
• We get the attention scores for this step: multiple ways
to do this

• We take softmax to get the attention distribution for this step (this is a
probability distribution and sums to 1)

• We use to take a weighted sum of the encoder hidden states to get the
attention output

• Finally we concatenate the attention output with the decoder hidden state
and proceed as in the non-attention seq2seq model
46
Attention scoring function
• q is the query and k is the key
• Multi-layer Perceptron (Bahdanau et al. 2015)
• Flexible, often very good with large data

• Bilinear (Luong et al. 2015)

• Dot Product (Luong et al. 2015)


Attention is so great (Bahdanau at al., 16054
Citations)
• Attention significantly improves NMT performance
• It’s very useful to allow decoder to focus on certain parts of the source
• Attention provides a more “human-like” model of the MT process
• You can look back at the source sentence while translating, rather than needing to remember it
all
• Attention solves the bottleneck problem
• Attention allows decoder to look directly at source; bypass bottleneck
• Attention helps with the vanishing gradient problem
• Attention provides some interpretability
• By inspecting attention distribution, we can see
what the decoder was focusing on
• We get alignment for free!
• This is cool because we never explicitly trained
an alignment system
• The network just learned alignment by itself 48
Key developments in attention

ChatGpt

GPT3
Attention is a general Deep Learning technique

50
Evolution of the MT system over time
[Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal]

Source: http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf
51
NMT: the first big story of NLP Deep Learning
Neural Machine Translation went from a fringe research activity in 2014 to
the leading standard method in 2016

• 2014: First seq2seq paper published [Sutskever et al., 2014]

• 2016: Google Translate switches from SMT to NMT and by 2018 everyone
has

• This is amazing!
• SMT systems, built by hundreds of engineers over many years, outperformed by
NMT systems trained by a small groups of engineers in a few months 52
MT solved?

• Nope!
• Many difficulties remain:
• Out-of-vocabulary words (unknow words)
• Domain mismatch between train and test data
• Maintaining context over longer text
• Low-resource language pairs (Hallucinations)

53
MT solved?

• Nope!
• Using common sense is still hard

?
54
Seq2seq is flexible and efficient!
• Seq2Seq is useful not only in the Machine Translation
• Many NLP tasks can be phrased as sequence-to-sequence:
• Summarization (long text → short text)
• Dialogue (previous utterances → next utterance)
• Parsing (input text → output parse as sequence)
• Code generation (natural language → Python code)
• OCR (image of character  text sequcence)
• ASR (sequence of Acoustic  text sequence)

55
Machine Translation Problem
• Automatic translation from English sentence to Vietnamese sentence
• Training và testing data: IWSLT 2015
Example:
Input: I like a blue book
Oputput: Tôi thích quyển sách màu xanh
Apply seq2seq + attention for NMT

Goals to be achieved:
1. Training seq2seq model for English-Vietnamese translation problem.
2. Evaluating the translation system by BLEU score.
3. Seing Attention score matrix to better understand the model
The tasks to be implemented:
1. Data preprocessing
2. Create training data
3. Write encoder, decoder, attention modul
4. Training model
5. Translate new source sentence
6. Show Attention, BLEU score.
Conclusion

• Sequence-to-sequence is the architecture of most


current NLP problems such as NMT, text generation, ...

• Attention great is a way to focus on


particular parts of the input
• Improved Seq2Seq model a lot!
• As the foundation of the Transformer model (now
dominated!)

58
References

• Speech and Language Processing 2023


(https://web.stanford.edu/~jurafsky/slp3/)
• Machine Translation and Sequence-to-Sequence Models,
Neubig, 2019, CMU
• Some slides for Universities of Stanford 2023, MIT, …

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy