0% found this document useful (0 votes)
848 views12 pages

Attention

The document discusses problems with traditional encoder-decoder recurrent neural networks (RNNs) for tasks like machine translation. It introduces the attention mechanism as a solution. With attention, the encoder outputs are connected to the decoder via a learned attention layer, allowing the decoder to focus on relevant parts of the input at each step. This addresses the problem of having to compress all input information into a single fixed-length vector. The mathematics of calculating attention weights at each decoder step are also described.

Uploaded by

api-332129590
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
848 views12 pages

Attention

The document discusses problems with traditional encoder-decoder recurrent neural networks (RNNs) for tasks like machine translation. It introduces the attention mechanism as a solution. With attention, the encoder outputs are connected to the decoder via a learned attention layer, allowing the decoder to focus on relevant parts of the input at each step. This addresses the problem of having to compress all input information into a single fixed-length vector. The mathematics of calculating attention weights at each decoder step are also described.

Uploaded by

api-332129590
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Why does attention

mechanism work while


training Encoder-Decoder
Recurrent Neural Networks
Traditional Encoder decoder model
for Recurrent Neural Networks
Problems in this approach
 Asking the encoder to learn what was spoken (input data); that is
captured in the activation of the last input cell, thereafter using only this
activation value and a start sequence (which is usually zero ) we want to
produce the output sequence [ NOTE: A SINGLE RNN LAYER WITH INPUT
AND OUTPUT WILL NOT BE POSSIBLE HERE BECAUSE THE LENGTH OF INTPUT
SEQUENCE CAN BE DIFFERENT FROM THE LENGTH OF THE OUTPUT
SEQUENCE ! ] . Thus the entire information from the input is represented
by a SINGLE vector that is the activation of the last encoder cell
Problems in this approach continued
 This approach is okayish if we have very small input sequences, but
for large input sequences this method performs very poorly as
translation relies on reading a complete sentence and compressing
all information into a fixed-length vector, as we can imagine, a
sentence with hundreds of words represented by several words will
surely lead to information loss, inadequate translation, etc (
Vanishing gradients is not the only reason, because even if the RNNs
are replaced by LSTMs, the problem still persists : making a single
vector model entire information of a sequence is too much of an
ask and this method doesn’t scale)
 What is the solution ? -> ATTENTION learning mechanism
Attention Mechanism
 I will explain with the example of Machine Translation task
 Input is represented as x<t'> , where t' represent a input time frame
 The encoder is a Bi-directional LSTM with activations called a<t'>
 The decoder is a simple LSTM , with activations called as s<t>. Important to
note that I am representing input sequence with time label t' and output
sequence with time label t( which again highlights the important
observation that length of output is different (in a general case) from the
length of input)
 The encoder and decoder are connected with each other via a attention
layer as described in the next slide. Observe the difference in architecture
of the previous and current systems
 Due to poor support of mathematical notations in powerpoint online, I am
forced to draw the structure by hand :P
The overall model :
Decoder
|
Attention
|
Encoder
The mathematics of
interconnection of
encoder to decoder
via the attention layer
Mathematics of the
attention weights
 Implement the whole model as discussed in the last few slides and
train it with gradient descent(backprop) [this will be straightforward
to do because gradients are easy to find as all variables have been
neatly expressed as differentiable mathematical functions of the
remaining variables of the model]
 Finally ,while figuring out the output for a particular time step , the
network pays attention to the appropriate parts of the input as
learnt by the attention layer
Advantage of
attention
 A big advantage of attention is that it gives us
the ability to interpret and visualize what the
model is doing. For example, by visualizing
the attention weight matrix when a sentence
is translated, we can understand how the
model is translating (as shown for the task of
machine translation)
Places where attention is used

 Image captioning ( this idea was first introduced in this domain)


 Machine translation
 Speech Recognition
Image Captioning Example:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy