XCS224N Module5 Slides
XCS224N Module5 Slides
Christopher Manning
Lecture 7: Machine Translation, Sequence-to-Sequence and Attention
Lecture Plan
Today we will:
1. Introduce a new task: Machine Translation [15 mins], which is a major use-case of
2. A new neural architecture: sequence-to-sequence [45 mins], which is improved by
3. A new neural technique: attention [20 mins]
• Announcements
• Assignment 3 is due today – I hope your dependency parsers are parsing text!
• Assignment 4 out today – covered in this lecture, you get 9 days for it (!), due Thu
• Get started early! It’s bigger and harder than the previous assignments 😰
• Thursday’s lecture about choosing final projects
2
Section 1: Pre-Neural Machine Translation
3
Machine Translation
Machine Translation (MT) is the task of translating a sentence x from one language (the
source language) to a sentence y in another language (the target language).
– Rousseau
4
The early history of MT: 1950s
• Machine translation research began in the early 1950s on machines less
powerful than high school calculators
• Foundational work on automata, formal languages, probabilities, and
information theory
• MT heavily funded by military, but basically just simple rule-based
systems doing word substitution
• Human language is more complicated than that, and varies more across
languages!
• Little understanding of natural language syntax, semantics, pragmatics
• Problem soon appeared intractable
1 minute video showing 1954 MT:
https://youtu.be/K-HfpsHPmvw
1990s-2010s: Statistical Machine Translation
• Core idea: Learn a probabilistic model from data
• Suppose we’re translating French → English.
• We want to find best English sentence y, given French sentence x
• Use Bayes Rule to break this down into two components to be learned
separately:
Demotic
Ancient Greek
7
Learning alignment for SMT
• Question: How to learn translation model from the parallel corpus?
8
What is alignment?
Alignment is the correspondence between particular words in the translated sentence pair.
9 Examples from: “The Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993. http://www.aclweb.org/anthology/J93-2003
Alignment is complex
Alignment can be many-to-one
10 Examples from: “The Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993. http://www.aclweb.org/anthology/J93-2003
Alignment is complex
Alignment can be one-to-many
11 Examples from: “The Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993. http://www.aclweb.org/anthology/J93-2003
Alignment is complex
Alignment can be many-to-many (phrase-level)
12 Examples from: “The Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993. http://www.aclweb.org/anthology/J93-2003
Learning alignment for SMT
• We learn as a combination of many factors, including:
• Probability of particular words aligning (also depends on position in sent)
• Probability of particular words having a particular fertility (number of corresponding
words)
• etc.
• Alignments a are latent variables: They aren’t explicitly specified in the data!
• Require the use of special learning algorithms (like Expectation-Maximization) for
learning the parameters of distributions with latent variables
• In older days, we used to do a lot of that in CS 224N, but now see CS 228!
13
Decoding for SMT
Language Model
Question:
How to compute Translation Model
this argmax?
14
Decoding for SMT Translation Options
er geht ja nicht nach hause
he is yes not after house
it are is do not to home
, it goes , of course does not according to chamber
, he go , is not in at home
it is
he will be
it goes
Decoding: Find Best Path not
is not
does not
home
under house
return home
he goes do not do not
is to
er are geht ja nicht
following nach hause
is after all not after
does not to
not
is not
are not
is not a
are
does not go home
Chapter 6: Decoding 8
it
to
16
Section 2: Neural Machine Translation
17
2014
(dramatic reenactment)
18
2014 Ne
u
Ma ral
Tra chin
nsl e
atio
n
MT
res
earc
h (dramatic reenactment)
19
What is Neural Machine Translation?
• Neural Machine Translation (NMT) is a way to do Machine Translation with a single
end-to-end neural network
20
Neural Machine Translation (NMT)
The sequence-to-sequence model
Target sentence (output)
Encoding of the source sentence.
Provides initial hidden state
he hit me with a pie <END>
for Decoder RNN.
argmax
argmax
argmax
argmax
argmax
argmax
argmax
Encoder RNN
Decoder RNN
il a m’ entarté <START> he hit me with a pie
22
Neural Machine Translation (NMT)
• The sequence-to-sequence model is an example of a Conditional Language Model
• Language Model because the decoder is predicting the
next word of the target sentence y
• Conditional because its predictions are also conditioned on the source sentence x
23
Training a Neural Machine Translation system
= negative log = negative log = negative log
* prob of “he” prob of “with” prob of <END>
1
𝐽 = ' 𝐽( = 𝐽! + 𝐽" + 𝐽# + 𝐽$ + 𝐽% + 𝐽& + 𝐽'
𝑇
()!
Decoder RNN
Encoder RNN
• We can also make them “deep” in another dimension by applying multiple RNNs
– this is a multi-layer RNN.
25
Multi-layer deep encoder-decoder machine translation net
[Sutskever et al. 2014; Luong et al. 2015]
The hidden states from RNN layer i
are the inputs to RNN layer i+1
Translation
The protests escalated over the weekend <EOS>
generated
0.1 0.2 0.4 0.5 0.2 -0.1 0.2 0.2 0.3 0.4 -0.2 -0.4 -0.3
0.3 0.6 0.4 0.5 0.6 0.6 0.6 0.6 0.6 0.4 0.6 0.6 0.5
0.1 -0.1 0.3 0.9 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1
-0.4 -0.7 -0.2 -0.3 -0.5 -0.7 -0.7 -0.7 -0.7 -0.7 -0.7 -0.7 -0.7
0.2 0.1 -0.3 -0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
Encoder:
Builds up 0.2 0.2 0.1 0.2 0.2 0.2 0.2 -0.4 0.2 -0.1 0.2 0.3 0.2
Decoder
-0.2 0.6 0.3 0.6 -0.8 0.6 -0.1 0.6 0.6 0.6 0.4 0.6 0.6
-0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1
sentence 0.1
0.1
-0.7
0.1
-0.7
0.1
-0.4
0.1
-0.5
0.1
-0.7
0.1
-0.7
0.1
-0.7
0.1
0.3
0.1
0.3
0.1
0.2
0.1
-0.5
0.1
-0.7
0.1
meaning
0.2 0.4 0.2 0.2 0.4 0.2 0.2 0.2 0.2 -0.1 -0.2 -0.4 0.2
0.6 -0.6 -0.3 0.4 -0.2 0.6 0.6 0.6 0.6 0.3 0.6 0.5 0.6
-0.1 0.2 -0.1 0.1 -0.3 -0.1 -0.1 -0.1 -0.1 -0.1 0.1 -0.5 -0.1
-0.7 -0.3 -0.4 -0.5 -0.4 -0.7 -0.7 -0.7 -0.7 -0.7 0.3 0.4 -0.7
0.1 0.4 0.2 -0.2 -0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
Source Die Proteste waren am Wochenende eskaliert <EOS> The protests escalated over the weekend Feeding in
sentence last word
Conditioning =
26
Bottleneck
Multi-layer RNNs in practice
• High-performing RNNs are usually multi-layer (but aren’t as deep as convolutional or
feed-forward networks)
• For example: In a 2017 paper, Britz et al. find that for Neural Machine Translation, 2 to
4 layers is best for the encoder RNN, and 4 layers is best for the decoder RNN
• Often 2 layers is a lot better than 1, and 3 might be a little better than 2
• Usually, skip-connections/dense-connections are needed to train deeper RNNs
(e.g., 8 layers)
“Massive Exploration of Neural Machine Translation Architecutres”, Britz et al, 2017. https://arxiv.org/pdf/1703.03906.pdf
27
Greedy decoding
• We saw how to generate (or “decode”) the target sentence by taking argmax on each
step of the decoder
he hit me with a pie <END>
argmax
argmax
argmax
argmax
argmax
argmax
28
Problems with greedy decoding
• Greedy decoding has no way to undo decisions!
• Input: il a m’entarté (he hit me with a pie)
• → he ____
• → he hit ____
• → he hit a ____ (whoops! no going back now…)
29
Exhaustive search decoding
• Ideally, we want to find a (length T) translation y that maximizes
30
Beam search decoding
• Core idea: On each step of decoder, keep track of the k most probable partial
translations (which we call hypotheses)
• k is the beam size (in practice around 5 to 10)
31
Beam search decoding: example
Beam size = k = 2. Blue numbers =
<START>
Calculate prob
dist of next word
32
Beam search decoding: example
Beam size = k = 2. Blue numbers =
<START>
I
-0.9 = log PLM(I|<START>)
-1.7
-0.7 hit
he
struck
-2.9
<START>
-1.6
was
I
got
-0.9
-1.8
Of these k2 hypotheses,
35 just keep k with highest scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =
-2.8
-1.7 a
-0.7 hit
he me
struck -2.5
-2.9
<START> -2.9
-1.6
hit
was
I struck
got
-0.9 -3.8
-1.8
Of these k2 hypotheses,
37 just keep k with highest scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =
-4.0
tart
-2.8
-1.7 pie
a
-0.7 -3.4
hit
he me -3.3
struck -2.5 with
-2.9
<START> -2.9 on
-1.6
hit -3.5
was
I struck
got
-0.9 -3.8
-1.8
For each of the k hypotheses, find
38 top k next words and calculate scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =
-4.0
tart
-2.8
-1.7 pie
a
-0.7 -3.4
hit
he me -3.3
struck -2.5 with
-2.9
<START> -2.9 on
-1.6
hit -3.5
was
I struck
got
-0.9 -3.8
-1.8
Of these k2 hypotheses,
39 just keep k with highest scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =
-4.0 -4.8
tart in
-2.8
-1.7 pie with
a
-0.7 -3.4 -4.5
hit
he me -3.3 -3.7
struck -2.5 with a
-2.9
<START> -2.9 on one
-1.6
hit -3.5 -4.3
was
I struck
got
-0.9 -3.8
-1.8
For each of the k hypotheses, find
40 top k next words and calculate scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =
-4.0 -4.8
tart in
-2.8
-1.7 pie with
a
-0.7 -3.4 -4.5
hit
he me -3.3 -3.7
struck -2.5 with a
-2.9
<START> -2.9 on one
-1.6
hit -3.5 -4.3
was
I struck
got
-0.9 -3.8
-1.8
Of these k2 hypotheses,
41 just keep k with highest scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =
-4.0 -4.8
tart in
-2.8 -4.3
-1.7 pie with
a pie
-0.7 -3.4 -4.5
hit
he me -3.3 -3.7 tart
struck -2.5 with a -4.6
-2.9
<START> -2.9 on one -5.0
-1.6
hit -3.5 -4.3 pie
was
I struck tart
got
-0.9 -3.8 -5.3
-1.8
For each of the k hypotheses, find
42 top k next words and calculate scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =
-4.0 -4.8
tart in
-2.8 -4.3
-1.7 pie with
a pie
-0.7 -3.4 -4.5
hit
he me -3.3 -3.7 tart
struck -2.5 with a -4.6
-2.9
<START> -2.9 on one -5.0
-1.6
hit -3.5 -4.3 pie
was
I struck tart
got
-0.9 -3.8 -5.3
-1.8
-4.0 -4.8
tart in
-2.8 -4.3
-1.7 pie with
a pie
-0.7 -3.4 -4.5
hit
he me -3.3 -3.7 tart
struck -2.5 with a -4.6
-2.9
<START> -2.9 on one -5.0
-1.6
hit -3.5 -4.3 pie
was
I struck tart
got
-0.9 -3.8 -5.3
-1.8
45
Beam search decoding: finishing up
• We have our list of completed hypotheses.
• How to select top one with highest score?
46
Advantages of NMT
Compared to SMT, NMT has many advantages:
• Better performance
• More fluent
• Better use of context
• Better use of phrase similarities
48
How do we evaluate Machine Translation?
You’ll see BLEU in detail
BLEU (Bilingual Evaluation Understudy) in Assignment 4!
49 Source: ”BLEU: a Method for Automatic Evaluation of Machine Translation", Papineni et al, 2002. http://aclweb.org/anthology/P02-1040
MT progress over time
[Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal; NMT 2019 FAIR on newstest2019]
45
Phrase-based SMT
40
Syntax-based SMT
35
Neural MT
30
25
20
15
10
5
0
2013 2014 2015 2016 2017 2018 2019
Sources: http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf & http://matrix.statmt.org/
50
NMT: perhaps the biggest success story of NLP Deep Learning?
Neural Machine Translation went from a fringe research attempt in 2014 to the leading
standard method in 2016
• 2016: Google Translate switches from SMT to NMT – and by 2018 everyone has
• This is amazing!
• SMT systems, built by hundreds of engineers over many years, outperformed by
NMT systems trained by a small group of engineers in a few months
51
So, is Machine Translation solved?
• Nope!
• Many difficulties remain:
• Out-of-vocabulary words
• Domain mismatch between train and test data
• Maintaining context over longer text
• Low-resource language pairs
• Failures to accurately capture sentence meaning
• Pronoun (or zero pronoun) resolution errors
• Morphological agreement errors
?
53
So is Machine Translation solved?
• Nope!
• NMT picks up biases in training data
Source: https://hackernoon.com/bias-sexist-or-this-is-the-way-it-should-be-ce1f7c8c683c
54
So is Machine Translation solved?
• Nope!
• Uninterpretable systems do strange things
• (But I think this problem has been fixed in Google Translate by 2021?)
• NMT research has pioneered many of the recent innovations of NLP Deep Learning
• But we’ll present in a minute one improvement so integral that it is the new vanilla…
ATTENTION
56
Section 3: Attention
59
Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
Target sentence (output)
Decoder RNN
il a m’ entarté <START> he hit me with a pie
60
Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
This needs to capture all Target sentence (output)
information about the
source sentence. he hit me with a pie <END>
Information bottleneck!
Encoder RNN
Decoder RNN
il a m’ entarté <START> he hit me with a pie
61
Attention
• Attention provides a solution to the bottleneck problem.
• Core idea: on each step of the decoder, use direct connection to the encoder to focus
on a particular part of the source sequence
• First, we will show via diagram (no equations), then we will show with equations
62
Sequence-to-sequence with attention
dot product
Attention
scores
Decoder RNN
Encoder
RNN
il a m’ entarté <START>
63
Source sentence (input)
Sequence-to-sequence with attention
dot product
Attention
scores
Decoder RNN
Encoder
RNN
il a m’ entarté <START>
64
Source sentence (input)
Sequence-to-sequence with attention
dot product
Attention
scores
Decoder RNN
Encoder
RNN
il a m’ entarté <START>
65
Source sentence (input)
Sequence-to-sequence with attention
dot product
Attention
scores
Decoder RNN
Encoder
RNN
il a m’ entarté <START>
66
Source sentence (input)
Sequence-to-sequence with attention
Decoder RNN
Encoder
RNN
il a m’ entarté <START>
67
Source sentence (input)
Sequence-to-sequence with attention
Attention Use the attention distribution to take a
output weighted sum of the encoder hidden
scores distribution states.
Attention Attention
Decoder RNN
Encoder
RNN
il a m’ entarté <START>
68
Source sentence (input)
Sequence-to-sequence with attention
Attention he
output
Concatenate attention output
scores distribution
𝑦!! with decoder hidden state, then
Attention Attention
Decoder RNN
Encoder
RNN
il a m’ entarté <START>
69
Source sentence (input)
Sequence-to-sequence with attention
Attention hit
output
scores distribution
𝑦!"
Attention Attention
Decoder RNN
Encoder
RNN
Decoder RNN
Encoder
RNN
71
Source sentence (input)
Sequence-to-sequence with attention
Attention with
output
scores distribution 𝑦!$
Attention Attention
Decoder RNN
Encoder
RNN
72
Source sentence (input)
Sequence-to-sequence with attention
Attention a
output
scores distribution 𝑦!%
Attention Attention
Decoder RNN
Encoder
RNN
73
Source sentence (input)
Sequence-to-sequence with attention
Attention pie
output
scores distribution 𝑦!&
Attention Attention
Decoder RNN
Encoder
RNN
74
Source sentence (input)
Attention: in equations
• We have encoder hidden states
• On timestep t, we have decoder hidden state
• We get the attention scores for this step:
• We take softmax to get the attention distribution for this step (this is a probability distribution and
sums to 1)
• We use to take a weighted sum of the encoder hidden states to get the
attention output
75
Attention is great
• Attention significantly improves NMT performance
• It’s very useful to allow decoder to focus on certain parts of the source
• Attention solves the bottleneck problem
• Attention allows decoder to look directly at source; bypass bottleneck
• Attention helps with vanishing gradient problem
• Provides shortcut to faraway states
• Attention provides some interpretability
• By inspecting attention distribution, we can see
with
me
pie
he
hit
a
what the decoder was focusing on il
76
Attention is a general Deep Learning technique
• We’ve seen that attention is a great way to improve the sequence-to-sequence model
for Machine Translation.
• However: You can use attention in many architectures
(not just seq2seq) and many tasks (not just MT)
77
Attention is a general Deep Learning technique
More general definition of attention:
Given a set of vector values, and a vector query, attention is a
technique to compute a weighted sum of the values, dependent on
the query.
Intuition:
• The weighted sum is a selective summary of the information
contained in the values, where the query determines which
values to focus on.
• Attention is a way to obtain a fixed-size representation of an
arbitrary set of representations (the values), dependent on
some other representation (the query).
78
There are several attention variants
• We have some values and a query
thus obtaining the attention output a (sometimes called the context vector)
79
You’ll think about the relative
Attention variants advantages/disadvantages of these in Assignment 4!
• Multiplicative attention:
• Where is a weight matrix
• Additive attention:
• Where are weight matrices and
is a weight vector.
• d3 (the attention dimensionality) is a hyperparameter
More information: “Deep Learning for NLP Best Practices”, Ruder, 2017. http://ruder.io/deep-learning-nlp-best-practices/index.html#attention
“Massive Exploration of Neural Machine Translation Architectures”, Britz et al, 2017, https://arxiv.org/pdf/1703.03906.pdf
80
You’ll think about the relative
Attention variants advantages/disadvantages of these in Assignment 4!
Intuition:
• The weighted sum is a selective summary of the information contained in the values,
where the query determines which values to focus on.
• Attention is a way to obtain a fixed-size representation of an arbitrary set of
representations (the values), dependent on some other representation (the query).
Upshot:
• Attention has become the powerful, flexible, general way pointer and memory
manipulation in deep learning models. A new idea from after 2010!
9