The document discusses problems with traditional encoder-decoder recurrent neural networks (RNNs) for tasks like machine translation. It introduces the attention mechanism as a solution. With attention, the encoder outputs are connected to the decoder via a learned attention layer, allowing the decoder to focus on relevant parts of the input at each step. This addresses the problem of having to compress all input information into a single fixed-length vector. The mathematics of calculating attention weights at each decoder step are also described.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
848 views12 pages
Attention
The document discusses problems with traditional encoder-decoder recurrent neural networks (RNNs) for tasks like machine translation. It introduces the attention mechanism as a solution. With attention, the encoder outputs are connected to the decoder via a learned attention layer, allowing the decoder to focus on relevant parts of the input at each step. This addresses the problem of having to compress all input information into a single fixed-length vector. The mathematics of calculating attention weights at each decoder step are also described.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12
Why does attention
mechanism work while
training Encoder-Decoder Recurrent Neural Networks Traditional Encoder decoder model for Recurrent Neural Networks Problems in this approach Asking the encoder to learn what was spoken (input data); that is captured in the activation of the last input cell, thereafter using only this activation value and a start sequence (which is usually zero ) we want to produce the output sequence [ NOTE: A SINGLE RNN LAYER WITH INPUT AND OUTPUT WILL NOT BE POSSIBLE HERE BECAUSE THE LENGTH OF INTPUT SEQUENCE CAN BE DIFFERENT FROM THE LENGTH OF THE OUTPUT SEQUENCE ! ] . Thus the entire information from the input is represented by a SINGLE vector that is the activation of the last encoder cell Problems in this approach continued This approach is okayish if we have very small input sequences, but for large input sequences this method performs very poorly as translation relies on reading a complete sentence and compressing all information into a fixed-length vector, as we can imagine, a sentence with hundreds of words represented by several words will surely lead to information loss, inadequate translation, etc ( Vanishing gradients is not the only reason, because even if the RNNs are replaced by LSTMs, the problem still persists : making a single vector model entire information of a sequence is too much of an ask and this method doesn’t scale) What is the solution ? -> ATTENTION learning mechanism Attention Mechanism I will explain with the example of Machine Translation task Input is represented as x<t'> , where t' represent a input time frame The encoder is a Bi-directional LSTM with activations called a<t'> The decoder is a simple LSTM , with activations called as s<t>. Important to note that I am representing input sequence with time label t' and output sequence with time label t( which again highlights the important observation that length of output is different (in a general case) from the length of input) The encoder and decoder are connected with each other via a attention layer as described in the next slide. Observe the difference in architecture of the previous and current systems Due to poor support of mathematical notations in powerpoint online, I am forced to draw the structure by hand :P The overall model : Decoder | Attention | Encoder The mathematics of interconnection of encoder to decoder via the attention layer Mathematics of the attention weights Implement the whole model as discussed in the last few slides and train it with gradient descent(backprop) [this will be straightforward to do because gradients are easy to find as all variables have been neatly expressed as differentiable mathematical functions of the remaining variables of the model] Finally ,while figuring out the output for a particular time step , the network pays attention to the appropriate parts of the input as learnt by the attention layer Advantage of attention A big advantage of attention is that it gives us the ability to interpret and visualize what the model is doing. For example, by visualizing the attention weight matrix when a sentence is translated, we can understand how the model is translating (as shown for the task of machine translation) Places where attention is used
Image captioning ( this idea was first introduced in this domain)