0% found this document useful (0 votes)

42 views80 pages

XCS224N Module5 Slides

Uploaded by

bksaif

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views80 pages

XCS224N Module5 Slides

Uploaded by

bksaif

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 80

Natural Language Processing

with Deep Learning

CS224N/Ling284

Christopher Manning
Lecture 7: Machine Translation, Sequence-to-Sequence and Attention
Lecture Plan
Today we will:
1. Introduce a new task: Machine Translation [15 mins], which is a major use-case of
2. A new neural architecture: sequence-to-sequence [45 mins], which is improved by
3. A new neural technique: attention [20 mins]

• Announcements
• Assignment 3 is due today – I hope your dependency parsers are parsing text!
• Assignment 4 out today – covered in this lecture, you get 9 days for it (!), due Thu
• Get started early! It’s bigger and harder than the previous assignments 😰
• Thursday’s lecture about choosing final projects

2
Section 1: Pre-Neural Machine Translation

3
Machine Translation
Machine Translation (MT) is the task of translating a sentence x from one language (the
source language) to a sentence y in another language (the target language).

x: L'homme est né libre, et partout il est dans les fers

y: Man is born free, but everywhere he is in chains

– Rousseau
4
The early history of MT: 1950s
• Machine translation research began in the early 1950s on machines less
powerful than high school calculators
• Foundational work on automata, formal languages, probabilities, and
information theory
• MT heavily funded by military, but basically just simple rule-based
systems doing word substitution
• Human language is more complicated than that, and varies more across
languages!
• Little understanding of natural language syntax, semantics, pragmatics
• Problem soon appeared intractable
1 minute video showing 1954 MT:
https://youtu.be/K-HfpsHPmvw
1990s-2010s: Statistical Machine Translation
• Core idea: Learn a probabilistic model from data
• Suppose we’re translating French → English.
• We want to find best English sentence y, given French sentence x

• Use Bayes Rule to break this down into two components to be learned
separately:

Translation Model Language Model

Models how words and phrases Models how to write

should be translated (fidelity). good English (fluency).
6 Learnt from parallel data. Learnt from monolingual data.
1990s-2010s: Statistical Machine Translation
• Question: How to learn translation model ?
• First, need large amount of parallel data
(e.g., pairs of human-translated French/English sentences)

The Rosetta Stone Ancient Egyptian

Demotic

Ancient Greek

7
Learning alignment for SMT
• Question: How to learn translation model from the parallel corpus?

• Break it down further: Introduce latent a variable into the model:

where a is the alignment, i.e. word-level correspondence between source sentence x

and target sentence y

8
What is alignment?
Alignment is the correspondence between particular words in the translated sentence pair.

• Typological differences between languages lead to complicated alignments!

• Note: Some words have no counterpart

9 Examples from: “The Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993. http://www.aclweb.org/anthology/J93-2003
Alignment is complex
Alignment can be many-to-one

10 Examples from: “The Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993. http://www.aclweb.org/anthology/J93-2003
Alignment is complex
Alignment can be one-to-many

11 Examples from: “The Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993. http://www.aclweb.org/anthology/J93-2003
Alignment is complex
Alignment can be many-to-many (phrase-level)

12 Examples from: “The Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993. http://www.aclweb.org/anthology/J93-2003
Learning alignment for SMT
• We learn as a combination of many factors, including:
• Probability of particular words aligning (also depends on position in sent)
• Probability of particular words having a particular fertility (number of corresponding
words)
• etc.
• Alignments a are latent variables: They aren’t explicitly specified in the data!
• Require the use of special learning algorithms (like Expectation-Maximization) for
learning the parameters of distributions with latent variables
• In older days, we used to do a lot of that in CS 224N, but now see CS 228!

13
Decoding for SMT

Language Model
Question:
How to compute Translation Model
this argmax?

• We could enumerate every possible y and calculate the probability? → Too

expensive!
• Answer: Impose strong independence assumptions in model, use dynamic
programming for globally optimal solutions (e.g. Viterbi algorithm).
• This process is called decoding

14
Decoding for SMT Translation Options
er geht ja nicht nach hause
he is yes not after house
it are is do not to home
, it goes , of course does not according to chamber
, he go , is not in at home
it is
he will be
it goes
Decoding: Find Best Path not
is not
does not
home
under house
return home
he goes do not do not
is to
er are geht ja nicht
following nach hause
is after all not after
does not to
not
is not
are not
is not a

• Many translation options to choose from

yes
– in Europarl phrase table: 2727 matching phrase pairs for this sentence
– by pruning to the top 20 perhephrase, 202 translationhome
goes options remain

are
does not go home
Chapter 6: Decoding 8
it
to

Source: ”Statistical Machine Translation", Chapter 6, Koehn, 2009.

backtrack https://www.cambridge.org/core/books/statistical-machine-translation/94EADF9F680558E13BE759997553CDE5
from highest scoring complete hypothesis
15
1990s-2010s: Statistical Machine Translation
• SMT was a huge research field
• The best systems were extremely complex
• Hundreds of important details we haven’t mentioned here
• Systems had many separately-designed subcomponents
• Lots of feature engineering
• Need to design features to capture particular language phenomena
• Require compiling and maintaining extra resources
• Like tables of equivalent phrases
• Lots of human effort to maintain
• Repeated effort for each language pair!

16
Section 2: Neural Machine Translation

17
2014

(dramatic reenactment)
18
2014 Ne
u
Ma ral
Tra chin
nsl e
atio
n

MT
res
earc
h (dramatic reenactment)
19
What is Neural Machine Translation?
• Neural Machine Translation (NMT) is a way to do Machine Translation with a single
end-to-end neural network

• The neural network architecture is called a sequence-to-sequence model (aka seq2seq)

and it involves two RNNs

20
Neural Machine Translation (NMT)
The sequence-to-sequence model
Target sentence (output)
Encoding of the source sentence.
Provides initial hidden state
he hit me with a pie <END>
for Decoder RNN.

argmax

argmax
argmax

argmax
Encoder RNN

Decoder RNN
il a m’ entarté <START> he hit me with a pie

Source sentence (input) Decoder RNN is a Language Model that generates

target sentence, conditioned on encoding.
Encoder RNN produces Note: This diagram shows test time behavior: decoder
an encoding of the output is fed in as next step’s input
source sentence.
21
Sequence-to-sequence is versatile!
• Sequence-to-sequence is useful for more than just MT

• Many NLP tasks can be phrased as sequence-to-sequence:

• Summarization (long text → short text)
• Dialogue (previous utterances → next utterance)
• Parsing (input text → output parse as sequence)
• Code generation (natural language → Python code)

22
Neural Machine Translation (NMT)
• The sequence-to-sequence model is an example of a Conditional Language Model
• Language Model because the decoder is predicting the
next word of the target sentence y
• Conditional because its predictions are also conditioned on the source sentence x

• NMT directly calculates :

Probability of next target word, given

target words so far and source sentence x
• Question: How to train a NMT system?
• Answer: Get a big parallel corpus…

23
Training a Neural Machine Translation system
= negative log = negative log = negative log
* prob of “he” prob of “with” prob of <END>
1
𝐽 = ' 𝐽( = 𝐽! + 𝐽" + 𝐽# + 𝐽$ + 𝐽% + 𝐽& + 𝐽'
𝑇
()!

𝑦!! 𝑦!" 𝑦!# 𝑦!$ 𝑦!% 𝑦!& 𝑦!'

Decoder RNN
Encoder RNN

il a m’ entarté <START> he hit me with a pie

Source sentence (from corpus) Target sentence (from corpus)

Seq2seq is optimized as a single system. Backpropagation operates “end-to-end”.

24
Multi-layer RNNs
• RNNs are already “deep” on one dimension (they unroll over many timesteps)

• We can also make them “deep” in another dimension by applying multiple RNNs
– this is a multi-layer RNN.

• This allows the network to compute more complex representations

• The lower RNNs should compute lower-level features and the higher RNNs should
compute higher-level features.

• Multi-layer RNNs are also called stacked RNNs.

25
Multi-layer deep encoder-decoder machine translation net
[Sutskever et al. 2014; Luong et al. 2015]
The hidden states from RNN layer i
are the inputs to RNN layer i+1

Translation
The protests escalated over the weekend <EOS>
generated
0.1 0.2 0.4 0.5 0.2 -0.1 0.2 0.2 0.3 0.4 -0.2 -0.4 -0.3
0.3 0.6 0.4 0.5 0.6 0.6 0.6 0.6 0.6 0.4 0.6 0.6 0.5
0.1 -0.1 0.3 0.9 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1
-0.4 -0.7 -0.2 -0.3 -0.5 -0.7 -0.7 -0.7 -0.7 -0.7 -0.7 -0.7 -0.7
0.2 0.1 -0.3 -0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

Encoder:
Builds up 0.2 0.2 0.1 0.2 0.2 0.2 0.2 -0.4 0.2 -0.1 0.2 0.3 0.2

Decoder
-0.2 0.6 0.3 0.6 -0.8 0.6 -0.1 0.6 0.6 0.6 0.4 0.6 0.6
-0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1

sentence 0.1
0.1
-0.7
0.1
-0.7
0.1
-0.4
0.1
-0.5
0.1
-0.7
0.1
-0.7
0.1
-0.7
0.1
0.3
0.1
0.3
0.1
0.2
0.1
-0.5
0.1
-0.7
0.1

meaning
0.2 0.4 0.2 0.2 0.4 0.2 0.2 0.2 0.2 -0.1 -0.2 -0.4 0.2
0.6 -0.6 -0.3 0.4 -0.2 0.6 0.6 0.6 0.6 0.3 0.6 0.5 0.6
-0.1 0.2 -0.1 0.1 -0.3 -0.1 -0.1 -0.1 -0.1 -0.1 0.1 -0.5 -0.1
-0.7 -0.3 -0.4 -0.5 -0.4 -0.7 -0.7 -0.7 -0.7 -0.7 0.3 0.4 -0.7
0.1 0.4 0.2 -0.2 -0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

Source Die Proteste waren am Wochenende eskaliert <EOS> The protests escalated over the weekend Feeding in
sentence last word

Conditioning =
26
Bottleneck
Multi-layer RNNs in practice
• High-performing RNNs are usually multi-layer (but aren’t as deep as convolutional or
feed-forward networks)

• For example: In a 2017 paper, Britz et al. find that for Neural Machine Translation, 2 to
4 layers is best for the encoder RNN, and 4 layers is best for the decoder RNN
• Often 2 layers is a lot better than 1, and 3 might be a little better than 2
• Usually, skip-connections/dense-connections are needed to train deeper RNNs
(e.g., 8 layers)

• Transformer-based networks (e.g., BERT) are usually deeper, like 12 or 24 layers.

• You will learn about Transformers later; they have a lot of skipping-like connections

“Massive Exploration of Neural Machine Translation Architecutres”, Britz et al, 2017. https://arxiv.org/pdf/1703.03906.pdf
27
Greedy decoding
• We saw how to generate (or “decode”) the target sentence by taking argmax on each
step of the decoder
he hit me with a pie <END>

argmax

argmax
argmax

<START> heargmax hit me with a pie

• This is greedy decoding (take most probable word on each step)

• Problems with this method?

28
Problems with greedy decoding
• Greedy decoding has no way to undo decisions!
• Input: il a m’entarté (he hit me with a pie)
• → he ____
• → he hit ____
• → he hit a ____ (whoops! no going back now…)

• How to fix this?

29
Exhaustive search decoding
• Ideally, we want to find a (length T) translation y that maximizes

• We could try computing all possible sequences y

• This means that on each step t of the decoder, we’re tracking Vt possible partial
translations, where V is vocab size
• This O(VT) complexity is far too expensive!

30
Beam search decoding
• Core idea: On each step of decoder, keep track of the k most probable partial
translations (which we call hypotheses)
• k is the beam size (in practice around 5 to 10)

• A hypothesis has a score which is its log probability:

• Scores are all negative, and higher score is better

• We search for high-scoring hypotheses, tracking top k on each step

• Beam search is not guaranteed to find optimal solution

• But much more efficient than exhaustive search!

31
Beam search decoding: example
Beam size = k = 2. Blue numbers =

<START>

Calculate prob
dist of next word
32
Beam search decoding: example
Beam size = k = 2. Blue numbers =

-0.7 = log PLM(he|<START>)

<START>

I
-0.9 = log PLM(I|<START>)

Take top k words

and compute scores
33
Beam search decoding: example
Beam size = k = 2. Blue numbers =

-1.7 = log PLM(hit|<START> he) + -0.7

-0.7 hit
he
struck
-2.9 = log PLM(struck|<START> he) + -0.7
<START>
-1.6 = log PLM(was|<START> I) + -0.9
was
I
got
-0.9
-1.8 = log PLM(got|<START> I) + -0.9
For each of the k hypotheses, find
34 top k next words and calculate scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =

-1.7
-0.7 hit
he
struck
-2.9
<START>
-1.6
was
I
got
-0.9
-1.8
Of these k2 hypotheses,
35 just keep k with highest scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =

-2.8 = log PLM(a|<START> he hit) + -1.7

-1.7 a
-0.7 hit
he me
struck -2.5 = log PLM(me|<START> he hit) + -1.7
-2.9
<START> -2.9 = log PLM(hit|<START> I was) + -1.6
-1.6
hit
was
I struck
got
-0.9 -3.8 = log PLM(struck|<START> I was) + -1.6
-1.8
For each of the k hypotheses, find
36 top k next words and calculate scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =

-2.8
-1.7 a
-0.7 hit
he me
struck -2.5
-2.9
<START> -2.9
-1.6
hit
was
I struck
got
-0.9 -3.8
-1.8
Of these k2 hypotheses,
37 just keep k with highest scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =

-4.0
tart
-2.8
-1.7 pie
a
-0.7 -3.4
hit
he me -3.3
struck -2.5 with
-2.9
<START> -2.9 on
-1.6
hit -3.5
was
I struck
got
-0.9 -3.8
-1.8
For each of the k hypotheses, find
38 top k next words and calculate scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =

-4.0
tart
-2.8
-1.7 pie
a
-0.7 -3.4
hit
he me -3.3
struck -2.5 with
-2.9
<START> -2.9 on
-1.6
hit -3.5
was
I struck
got
-0.9 -3.8
-1.8
Of these k2 hypotheses,
39 just keep k with highest scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =

-4.0 -4.8
tart in
-2.8
-1.7 pie with
a
-0.7 -3.4 -4.5
hit
he me -3.3 -3.7
struck -2.5 with a
-2.9
<START> -2.9 on one
-1.6
hit -3.5 -4.3
was
I struck
got
-0.9 -3.8
-1.8
For each of the k hypotheses, find
40 top k next words and calculate scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =

-4.0 -4.8
tart in
-2.8
-1.7 pie with
a
-0.7 -3.4 -4.5
hit
he me -3.3 -3.7
struck -2.5 with a
-2.9
<START> -2.9 on one
-1.6
hit -3.5 -4.3
was
I struck
got
-0.9 -3.8
-1.8
Of these k2 hypotheses,
41 just keep k with highest scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =

-4.0 -4.8
tart in
-2.8 -4.3
-1.7 pie with
a pie
-0.7 -3.4 -4.5
hit
he me -3.3 -3.7 tart
struck -2.5 with a -4.6
-2.9
<START> -2.9 on one -5.0
-1.6
hit -3.5 -4.3 pie
was
I struck tart
got
-0.9 -3.8 -5.3
-1.8
For each of the k hypotheses, find
42 top k next words and calculate scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =

This is the top-scoring hypothesis!

43
Beam search decoding: example
Beam size = k = 2. Blue numbers =

Backtrack to obtain the full hypothesis

44
Beam search decoding: stopping criterion
• In greedy decoding, usually we decode until the model produces an <END> token
• For example: <START> he hit me with a pie <END>

• In beam search decoding, different hypotheses may produce <END> tokens on

different timesteps
• When a hypothesis produces <END>, that hypothesis is complete.
• Place it aside and continue exploring other hypotheses via beam search.

• Usually we continue beam search until:

• We reach timestep T (where T is some pre-defined cutoff), or
• We have at least n completed hypotheses (where n is pre-defined cutoff)

45
Beam search decoding: finishing up
• We have our list of completed hypotheses.
• How to select top one with highest score?

• Each hypothesis on our list has a score

• Problem with this: longer hypotheses have lower scores

• Fix: Normalize by length. Use this to select top one instead:

46
Advantages of NMT
Compared to SMT, NMT has many advantages:

• Better performance
• More fluent
• Better use of context
• Better use of phrase similarities

• A single neural network to be optimized end-to-end

• No subcomponents to be individually optimized

• Requires much less human engineering effort

• No feature engineering
• Same method for all language pairs
47
Disadvantages of NMT?
Compared to SMT:

• NMT is less interpretable

• Hard to debug

• NMT is difficult to control

• For example, can’t easily specify rules or guidelines for translation
• Safety concerns!

48
How do we evaluate Machine Translation?
You’ll see BLEU in detail
BLEU (Bilingual Evaluation Understudy) in Assignment 4!

• BLEU compares the machine-written translation to one or several human-written

translation(s), and computes a similarity score based on:
• n-gram precision (usually for 1, 2, 3 and 4-grams)
• Plus a penalty for too-short system translations

• BLEU is useful but imperfect

• There are many valid ways to translate a sentence
• So a good translation can get a poor BLEU score because it has low n-gram overlap
with the human translation L

49 Source: ”BLEU: a Method for Automatic Evaluation of Machine Translation", Papineni et al, 2002. http://aclweb.org/anthology/P02-1040
MT progress over time
[Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal; NMT 2019 FAIR on newstest2019]

45
Phrase-based SMT
40
Syntax-based SMT
35
Neural MT
30
25
20
15
10
5
0
2013 2014 2015 2016 2017 2018 2019
Sources: http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf & http://matrix.statmt.org/
50
NMT: perhaps the biggest success story of NLP Deep Learning?
Neural Machine Translation went from a fringe research attempt in 2014 to the leading
standard method in 2016

• 2014: First seq2seq paper published

• 2016: Google Translate switches from SMT to NMT – and by 2018 everyone has

• This is amazing!
• SMT systems, built by hundreds of engineers over many years, outperformed by
NMT systems trained by a small group of engineers in a few months

51
So, is Machine Translation solved?
• Nope!
• Many difficulties remain:
• Out-of-vocabulary words
• Domain mismatch between train and test data
• Maintaining context over longer text
• Low-resource language pairs
• Failures to accurately capture sentence meaning
• Pronoun (or zero pronoun) resolution errors
• Morphological agreement errors

Didn’t specify gender

Source: https://hackernoon.com/bias-sexist-or-this-is-the-way-it-should-be-ce1f7c8c683c
54
So is Machine Translation solved?
• Nope!
• Uninterpretable systems do strange things
• (But I think this problem has been fixed in Google Translate by 2021?)

Picture source: https://www.vice.com/en_uk/article/j5npeg/why-is-google-translate-spitting-out-sinister-religious-prophecies

Explanation: https://www.skynettoday.com/briefs/google-nmt-prophecies
55
NMT research continues
NMT is a flagship task for NLP Deep Learning

• NMT research has pioneered many of the recent innovations of NLP Deep Learning

• In 2021: NMT research continues to thrive

• Researchers have found many, many improvements to the “vanilla” seq2seq NMT
system we’ve just presented

• But we’ll present in a minute one improvement so integral that it is the new vanilla…

ATTENTION
56
Section 3: Attention

59
Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
Target sentence (output)

he hit me with a pie <END>

Encoder RNN

Decoder RNN
il a m’ entarté <START> he hit me with a pie

Source sentence (input)

Problems with this architecture?

60
Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
This needs to capture all Target sentence (output)
information about the
source sentence. he hit me with a pie <END>
Information bottleneck!
Encoder RNN

Decoder RNN
il a m’ entarté <START> he hit me with a pie

Source sentence (input)

61
Attention
• Attention provides a solution to the bottleneck problem.

• Core idea: on each step of the decoder, use direct connection to the encoder to focus
on a particular part of the source sequence

• First, we will show via diagram (no equations), then we will show with equations

62
Sequence-to-sequence with attention

dot product
Attention
scores

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

63
Source sentence (input)
Sequence-to-sequence with attention

dot product
Attention
scores

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

64
Source sentence (input)
Sequence-to-sequence with attention

dot product
Attention
scores

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

65
Source sentence (input)
Sequence-to-sequence with attention

dot product
Attention
scores

Decoder RNN
Encoder
RNN

il a m’ entarté <START> he hit me

72
Source sentence (input)
Sequence-to-sequence with attention
Attention a
output
scores distribution 𝑦!%
Attention Attention

Decoder RNN
Encoder
RNN

il a m’ entarté <START> he hit me with

73
Source sentence (input)
Sequence-to-sequence with attention
Attention pie
output
scores distribution 𝑦!&
Attention Attention

Decoder RNN
Encoder
RNN

il a m’ entarté <START> he hit me with a

74
Source sentence (input)
Attention: in equations
• We have encoder hidden states
• On timestep t, we have decoder hidden state
• We get the attention scores for this step:

• We take softmax to get the attention distribution for this step (this is a probability distribution and
sums to 1)

• We use to take a weighted sum of the encoder hidden states to get the
attention output

• Finally we concatenate the attention output with the decoder hidden

state and proceed as in the non-attention seq2seq model

75
Attention is great
• Attention significantly improves NMT performance
• It’s very useful to allow decoder to focus on certain parts of the source
• Attention solves the bottleneck problem
• Attention allows decoder to look directly at source; bypass bottleneck
• Attention helps with vanishing gradient problem
• Provides shortcut to faraway states
• Attention provides some interpretability
• By inspecting attention distribution, we can see

with
me

pie
he
hit

a
what the decoder was focusing on il

• We get (soft) alignment for free! a

• This is cool because we never explicitly trained m’

an alignment system entarté

• The network just learned alignment by itself

76
Attention is a general Deep Learning technique
• We’ve seen that attention is a great way to improve the sequence-to-sequence model
for Machine Translation.
• However: You can use attention in many architectures
(not just seq2seq) and many tasks (not just MT)

• More general definition of attention:

• Given a set of vector values, and a vector query, attention is a technique to compute
a weighted sum of the values, dependent on the query.

• We sometimes say that the query attends to the values.

• For example, in the seq2seq + attention model, each decoder hidden state (query)
attends to all the encoder hidden states (values).

77
Attention is a general Deep Learning technique
More general definition of attention:
Given a set of vector values, and a vector query, attention is a
technique to compute a weighted sum of the values, dependent on
the query.

Intuition:
• The weighted sum is a selective summary of the information
contained in the values, where the query determines which
values to focus on.
• Attention is a way to obtain a fixed-size representation of an
arbitrary set of representations (the values), dependent on
some other representation (the query).

78
There are several attention variants
• We have some values and a query

• Attention always involves: There are

multiple ways
1. Computing the attention scores
to do this
2. Taking softmax to get attention distribution ⍺:

3. Using attention distribution to take weighted sum of values:

thus obtaining the attention output a (sometimes called the context vector)

79
You’ll think about the relative
Attention variants advantages/disadvantages of these in Assignment 4!

There are several ways you can compute from

and :

• Basic dot-product attention:

• Note: this assumes
• This is the version we saw earlier

• Multiplicative attention:
• Where is a weight matrix

• Additive attention:
• Where are weight matrices and
is a weight vector.
• d3 (the attention dimensionality) is a hyperparameter
More information: “Deep Learning for NLP Best Practices”, Ruder, 2017. http://ruder.io/deep-learning-nlp-best-practices/index.html#attention
“Massive Exploration of Neural Machine Translation Architectures”, Britz et al, 2017, https://arxiv.org/pdf/1703.03906.pdf
80
You’ll think about the relative
Attention variants advantages/disadvantages of these in Assignment 4!

There are several ways you can compute from and :

Basic dot-product attention:

• Note: this assumes . This is the version we saw earlier.

• Multiplicative attention: [Luong. Pham, and Manning 2015]

• Where is a weight matrix

• Reduced rank multiplicative attention: 𝑒! = 𝑠 " 𝑼" 𝑽 ℎ! = (𝑼𝑠)" (𝑽ℎ! )

• For low rank matrices 𝑼 ∈ ℝ#×%! , 𝑽 ∈ ℝ#×%" , 𝑘 ≪ 𝑑& , 𝑑'

• Additive attention: [Bahdanau, Cho, and Bengio 2014]

• Where are weight matrices and is a weight vector.
• d3 (the attention dimensionality) is a hyperparameter
• “Additive” is a weird/bad name. It’s really using a neural net layer.
More information: “Deep Learning for NLP Best Practices”, Ruder, 2017. http://ruder.io/deep-learning-nlp-best-practices/index.html#attention
“Massive Exploration of Neural Machine Translation Architectures”, Britz et al, 2017, https://arxiv.org/pdf/1703.03906.pdf
7
Attention is a general Deep Learning technique
• More general definition of attention:
• Given a set of vector values, and a vector query, attention is a technique to compute
a weighted sum of the values, dependent on the query.

Intuition:
• The weighted sum is a selective summary of the information contained in the values,
where the query determines which values to focus on.
• Attention is a way to obtain a fixed-size representation of an arbitrary set of
representations (the values), dependent on some other representation (the query).

Upshot:
• Attention has become the powerful, flexible, general way pointer and memory
manipulation in deep learning models. A new idea from after 2010!
9

Unit4 Notes Final
No ratings yet
Unit4 Notes Final
34 pages
Lect 07 - MT and Seq2seq
No ratings yet
Lect 07 - MT and Seq2seq
86 pages
05 Lecture08 NMT
No ratings yet
05 Lecture08 NMT
79 pages
(Slides) Module 44
No ratings yet
(Slides) Module 44
119 pages
Designs for Remodeling Your Home: Bumps, Bays, Additions & More
From Everand
Designs for Remodeling Your Home: Bumps, Bays, Additions & More
Jerold Axelrod
No ratings yet
2503 06594v1-LaMaTE
No ratings yet
2503 06594v1-LaMaTE
36 pages
Divai2020 Benkova
No ratings yet
Divai2020 Benkova
11 pages
Team03 Project Report PDF
No ratings yet
Team03 Project Report PDF
39 pages
Urk22ai1022 NLP Qa
No ratings yet
Urk22ai1022 NLP Qa
21 pages
NLP Module5 and 6
No ratings yet
NLP Module5 and 6
31 pages
Natural Language Processing Unit 5
No ratings yet
Natural Language Processing Unit 5
23 pages
AN2DL 05 2324 Seq2SeqAndWordEmbedding
No ratings yet
AN2DL 05 2324 Seq2SeqAndWordEmbedding
42 pages
NLP Slides2
No ratings yet
NLP Slides2
93 pages
On The Properties of Neural Machine Translation: Encoder-Decoder Approaches
No ratings yet
On The Properties of Neural Machine Translation: Encoder-Decoder Approaches
9 pages
(2016-ACL) Modeling Coverage For Neural Machine Translation
No ratings yet
(2016-ACL) Modeling Coverage For Neural Machine Translation
11 pages
NLP Answers
No ratings yet
NLP Answers
13 pages
Machine Translation, Auto Encoders and Decoders
No ratings yet
Machine Translation, Auto Encoders and Decoders
29 pages
Neural Machine Translation, Seq2seq, and Attention
No ratings yet
Neural Machine Translation, Seq2seq, and Attention
17 pages
Challenges in NMT - 2004.05809
No ratings yet
Challenges in NMT - 2004.05809
22 pages
15.chapter11 NLPApplications
No ratings yet
15.chapter11 NLPApplications
25 pages
Assignment 2 Report
No ratings yet
Assignment 2 Report
10 pages
1 Monthly Assessment in Oralcomm11: (1 Quarter)
No ratings yet
1 Monthly Assessment in Oralcomm11: (1 Quarter)
9 pages
Grammar For Writing Sheets
No ratings yet
Grammar For Writing Sheets
1 page
Rhetorical Devices Exercise
No ratings yet
Rhetorical Devices Exercise
3 pages
Neural Machine Translation A Review of Methods Resources and - 2020 - AI Ope
No ratings yet
Neural Machine Translation A Review of Methods Resources and - 2020 - AI Ope
17 pages
Machine Translation of Vedic Sanskrit Using Deep Learning Algorithm
No ratings yet
Machine Translation of Vedic Sanskrit Using Deep Learning Algorithm
4 pages
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
No ratings yet
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
20 pages
A Character-Level Decoder Without Explicit Segmentation For Neural Machine Translation
No ratings yet
A Character-Level Decoder Without Explicit Segmentation For Neural Machine Translation
11 pages
Non Autoregressive Neural MT
No ratings yet
Non Autoregressive Neural MT
13 pages
Attention: Sharad Jones
No ratings yet
Attention: Sharad Jones
25 pages
Linguistic Input Features Improve Neural Machine Translation
No ratings yet
Linguistic Input Features Improve Neural Machine Translation
9 pages
Translation Table Compression Under End-Tagged Dense Code
No ratings yet
Translation Table Compression Under End-Tagged Dense Code
6 pages
Improving Neural Machine Translation Models With Monolingual Data
No ratings yet
Improving Neural Machine Translation Models With Monolingual Data
11 pages
Notes 1311
No ratings yet
Notes 1311
4 pages
Cs224n 2020 Lecture08 NMT
No ratings yet
Cs224n 2020 Lecture08 NMT
77 pages
Machine Translation Wise 2016/2017
No ratings yet
Machine Translation Wise 2016/2017
58 pages
Incorporating Source-Side Phrase Structures Into Neural Machine Translation
No ratings yet
Incorporating Source-Side Phrase Structures Into Neural Machine Translation
26 pages
Lang Gragh
No ratings yet
Lang Gragh
14 pages
Toward Multilingual Neural Machine Translation With Universal Encoder and Decoder
No ratings yet
Toward Multilingual Neural Machine Translation With Universal Encoder and Decoder
10 pages
Google PDF
No ratings yet
Google PDF
23 pages
1st Year Logic (English) A1 Logic
100% (1)
1st Year Logic (English) A1 Logic
255 pages
AI4youngster - 6 - Topic NLP
No ratings yet
AI4youngster - 6 - Topic NLP
66 pages
Neural Machine Translation: Max Mustermann, and Hermann Ney
No ratings yet
Neural Machine Translation: Max Mustermann, and Hermann Ney
18 pages
Neubig 16 Afnlp
No ratings yet
Neubig 16 Afnlp
58 pages
Multi-Model Neural Machine Translation: B. Nikitha, K. Bhanu Prakash, M. Sravanthi Suma, M. Kavya Srihitha
No ratings yet
Multi-Model Neural Machine Translation: B. Nikitha, K. Bhanu Prakash, M. Sravanthi Suma, M. Kavya Srihitha
2 pages
A Gentle Introduction To Neural Machine Translation
No ratings yet
A Gentle Introduction To Neural Machine Translation
14 pages
OpenNMT Open-Source Toolkit For Neural Machine Translation
No ratings yet
OpenNMT Open-Source Toolkit For Neural Machine Translation
6 pages
French To English Translator in PyTorch
No ratings yet
French To English Translator in PyTorch
30 pages
ELT Cambridge Dictionary Classroom Activity British and American English
No ratings yet
ELT Cambridge Dictionary Classroom Activity British and American English
6 pages
Unit 4
No ratings yet
Unit 4
4 pages
Quinn Thesis Final On NMT
No ratings yet
Quinn Thesis Final On NMT
29 pages
A Recipe For Arabic-English Neural Machine Translation
No ratings yet
A Recipe For Arabic-English Neural Machine Translation
5 pages
Neural Machine Translation PDF
No ratings yet
Neural Machine Translation PDF
15 pages
Present 1 2021
No ratings yet
Present 1 2021
1 page
Language Translation
No ratings yet
Language Translation
15 pages
Kubler 2011
No ratings yet
Kubler 2011
23 pages
Artificial Intelligent Decoding of Rare Words in Natural Language Translation Using Lexical Level Context
No ratings yet
Artificial Intelligent Decoding of Rare Words in Natural Language Translation Using Lexical Level Context
7 pages
Extra 1 PDF
No ratings yet
Extra 1 PDF
9 pages
Attention and Memory in Deep Learning and NLP
No ratings yet
Attention and Memory in Deep Learning and NLP
8 pages
ASWIN TS Unit 3 NLP Translations Gen AI
No ratings yet
ASWIN TS Unit 3 NLP Translations Gen AI
5 pages
39 Vi 4 2023
No ratings yet
39 Vi 4 2023
26 pages
Chapter 6 Listen To The School Announcement
No ratings yet
Chapter 6 Listen To The School Announcement
8 pages
Yr 7 Big Milestone Notification 2023
No ratings yet
Yr 7 Big Milestone Notification 2023
3 pages
Mots Croisés
No ratings yet
Mots Croisés
10 pages
Fully Character-Level Neural Machine Translation Without Explicit Segmentation
No ratings yet
Fully Character-Level Neural Machine Translation Without Explicit Segmentation
14 pages
Multi-Task Learning For Multiple Language Translation
No ratings yet
Multi-Task Learning For Multiple Language Translation
10 pages
Be Going To - 91604
No ratings yet
Be Going To - 91604
12 pages
Google Neural Machine Translation System
No ratings yet
Google Neural Machine Translation System
23 pages
Rephrasing Con Soluciones
No ratings yet
Rephrasing Con Soluciones
3 pages
Contextualized Individual Summary Record
No ratings yet
Contextualized Individual Summary Record
4 pages
Department of Computer Science, University of Kashmir Presentation For PHD Admission
No ratings yet
Department of Computer Science, University of Kashmir Presentation For PHD Admission
9 pages
Verbi Irregolari Inglesi
No ratings yet
Verbi Irregolari Inglesi
4 pages
EA 1 01rev100306
No ratings yet
EA 1 01rev100306
11 pages
Deep Learning For Machine Translation: A Dramatic Turn of Paradigm
No ratings yet
Deep Learning For Machine Translation: A Dramatic Turn of Paradigm
36 pages
He Reads The Newspaper Everyday
No ratings yet
He Reads The Newspaper Everyday
13 pages
Death Wasn't Painful Stories of Indian Fighter Pil... - (13 - Tigers in The Cage)
No ratings yet
Death Wasn't Painful Stories of Indian Fighter Pil... - (13 - Tigers in The Cage)
4 pages
Problem Statement:: Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
No ratings yet
Problem Statement:: Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
4 pages
Park Life
50% (2)
Park Life
3 pages
ph005 Phrasal Verbs
No ratings yet
ph005 Phrasal Verbs
2 pages
ENGLISH
No ratings yet
ENGLISH
6 pages
Formal Letter Phrases and Example
No ratings yet
Formal Letter Phrases and Example
5 pages
Tools, Verbs, Nouns, Adjectives
No ratings yet
Tools, Verbs, Nouns, Adjectives
2 pages
B1+ UNITS 9 and 10 Study Skills
No ratings yet
B1+ UNITS 9 and 10 Study Skills
2 pages
Fuller Activity For Non-Reader Lesson 2
No ratings yet
Fuller Activity For Non-Reader Lesson 2
3 pages
Unit 4 Standard B
100% (2)
Unit 4 Standard B
2 pages
Antihero Essay Checklist in Essay? Requirement in Essay? Requirement
No ratings yet
Antihero Essay Checklist in Essay? Requirement in Essay? Requirement
1 page
2000 Most Common German Words in Context: Get Fluent & Increase Your German Vocabulary with 2000 German Phrases
From Everand
2000 Most Common German Words in Context: Get Fluent & Increase Your German Vocabulary with 2000 German Phrases
Lingo Mastery
4.5/5 (8)
Summative Assessment Grade 3
No ratings yet
Summative Assessment Grade 3
2 pages
56 of The Most Useful Phrasal Verbs in English - FluentU English
No ratings yet
56 of The Most Useful Phrasal Verbs in English - FluentU English
14 pages
Study Time: Listening
No ratings yet
Study Time: Listening
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

XCS224N Module5 Slides

Uploaded by

XCS224N Module5 Slides

Uploaded by

Natural Language Processing

with Deep Learning

x: L'homme est né libre, et partout il est dans les fers

y: Man is born free, but everywhere he is in chains

Translation Model Language Model

Models how words and phrases Models how to write

The Rosetta Stone Ancient Egyptian

• Break it down further: Introduce latent a variable into the model:

where a is the alignment, i.e. word-level correspondence between source sentence x

• Typological differences between languages lead to complicated alignments!

• We could enumerate every possible y and calculate the probability? → Too

• Many translation options to choose from

Source: ”Statistical Machine Translation", Chapter 6, Koehn, 2009.

• The neural network architecture is called a sequence-to-sequence model (aka seq2seq)

Source sentence (input) Decoder RNN is a Language Model that generates

• Many NLP tasks can be phrased as sequence-to-sequence:

• NMT directly calculates :

Probability of next target word, given

𝑦!! 𝑦!" 𝑦!# 𝑦!$ 𝑦!% 𝑦!& 𝑦!'

il a m’ entarté <START> he hit me with a pie

Source sentence (from corpus) Target sentence (from corpus)

Seq2seq is optimized as a single system. Backpropagation operates “end-to-end”.

• This allows the network to compute more complex representations

• Multi-layer RNNs are also called stacked RNNs.

• Transformer-based networks (e.g., BERT) are usually deeper, like 12 or 24 layers.

<START> heargmax hit me with a pie

• This is greedy decoding (take most probable word on each step)

• How to fix this?

• We could try computing all possible sequences y

• A hypothesis has a score which is its log probability:

• Scores are all negative, and higher score is better

• Beam search is not guaranteed to find optimal solution

-0.7 = log PLM(he|<START>)

Take top k words

-1.7 = log PLM(hit|<START> he) + -0.7

-2.8 = log PLM(a|<START> he hit) + -1.7

This is the top-scoring hypothesis!

Backtrack to obtain the full hypothesis

• In beam search decoding, different hypotheses may produce <END> tokens on

• Usually we continue beam search until:

• Each hypothesis on our list has a score

• Problem with this: longer hypotheses have lower scores

• Fix: Normalize by length. Use this to select top one instead:

• A single neural network to be optimized end-to-end

• Requires much less human engineering effort

• NMT is less interpretable

• NMT is difficult to control

• BLEU compares the machine-written translation to one or several human-written

• BLEU is useful but imperfect

• 2014: First seq2seq paper published

Further reading: “Has AI surpassed humans at translation? Not even close!”

Didn’t specify gender

Picture source: https://www.vice.com/en_uk/article/j5npeg/why-is-google-translate-spitting-out-sinister-religious-prophecies

• In 2021: NMT research continues to thrive

he hit me with a pie <END>

Source sentence (input)

Problems with this architecture?

Source sentence (input)

On this decoder timestep, we’re

Take softmax to turn the scores

The attention output mostly contains

use to compute 𝑦!1 as before

Sometimes we take the

il a m’ entarté <START> he hit

il a m’ entarté <START> he hit me

il a m’ entarté <START> he hit me with

il a m’ entarté <START> he hit me with a

• Finally we concatenate the attention output with the decoder hidden

• We get (soft) alignment for free! a

• This is cool because we never explicitly trained m’

an alignment system entarté

• The network just learned alignment by itself

• More general definition of attention:

• We sometimes say that the query attends to the values.

• Attention always involves: There are

3. Using attention distribution to take weighted sum of values: