0% found this document useful (0 votes)
438 views16 pages

Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview

1. The document provides an overview of Recurrent Neural Networks (RNNs), which are a type of neural network that can detect patterns in sequential data like text, audio, or time series. RNNs incorporate cycles that allow information to persist, enabling them to consider previous inputs and outputs. 2. RNNs operate by updating a hidden state that contains information from previous time steps. This hidden state is used to generate the output at each time step. Backpropagation Through Time (BPTT) is used to train RNNs by unfolding them into feedforward networks and applying backpropagation. 3. BPTT computes the loss over all time steps and uses the chain rule to backpropagate gradients through time. This allows

Uploaded by

RkeA RkeR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
438 views16 pages

Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview

1. The document provides an overview of Recurrent Neural Networks (RNNs), which are a type of neural network that can detect patterns in sequential data like text, audio, or time series. RNNs incorporate cycles that allow information to persist, enabling them to consider previous inputs and outputs. 2. RNNs operate by updating a hidden state that contains information from previous time steps. This hidden state is used to generate the output at each time step. Backpropagation Through Time (BPTT) is used to train RNNs by unfolding them into feedforward networks and applying backpropagation. 3. BPTT computes the loss over all time steps and uses the chain rule to backpropagate gradients through time. This allows

Uploaded by

RkeA RkeR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Recurrent Neural Networks (RNNs):

A gentle Introduction and Overview

Robin M. Schmidt
Department of Computer Science
Eberhard-Karls-University Tübingen
arXiv:1912.05911v1 [cs.LG] 23 Nov 2019

Tübingen, Germany
rob.schmidt@student.uni-tuebingen.de

Abstract

State-of-the-art solutions in the areas of “Language Modelling & Generating Text”,


“Speech Recognition”, “Generating Image Descriptions” or “Video Tagging” have
been using Recurrent Neural Networks as the foundation for their approaches.
Understanding the underlying concepts is therefore of tremendous importance if
we want to keep up with recent or upcoming publications in those areas. In this
work we give a short overview over some of the most important concepts in the
realm of Recurrent Neural Networks which enables readers to easily understand
the fundamentals such as but not limited to “Backpropagation through Time” or
“Long Short-Term Memory Units” as well as some of the more recent advances like
the “Attention Mechanism” or “Pointer Networks”. We also give recommendations
for further reading regarding more complex topics where it is necessary.

1 Introduction & Notation


Recurrent Neural Networks (RNNs) are a type of neural network architecture which is mainly used
to detect patterns in a sequence of data. Such data can be handwriting, genomes, text or numerical
time series which are often produced in industry settings (e.g. stock markets or sensors) [7, 12].
However, they are also applicable to images if these get respectively decomposed into a series of
patches and treated as a sequence [12]. On a higher level, RNNs find applications in Language
Modelling & Generating Text, Speech Recognition, Generating Image Descriptions or Video Tagging.
What differentiates Recurrent Neural Networks from Feedforward Neural Networks also known
as Multi-Layer Perceptrons (MLPs) is how information gets passed through the network. While
Feedforward Networks pass information through the network without cycles, the RNN has cycles and
transmits information back into itself. This enables them to extend the functionality of Feedforward
Networks to also take into account previous inputs X0:t−1 and not only the current input Xt . This
difference is visualised on a high level in Figure 1. Note, that here the option of having multiple
hidden layers is aggregated to one Hidden Layer block H. This block can obviously be extended to
multiple hidden layers.
We can describe this process of passing information from the previous iteration to the hidden layer
with the mathematical notation proposed in [24]. For that, we denote the hidden state and the input
at time step t respecively as Ht ∈ Rn×h and Xt ∈ Rn×d where n is number of samples, d is the
number of inputs of each sample and h is the number of hidden units. Further, we use a weight matrix
Wxh ∈ Rd×h , hidden-state-to-hidden-state matrix Whh ∈ Rh×h and a bias parameter bh ∈ R1×h .
Lastly, all these informations get passed to a activation function φ which is usually a logistic sigmoid
or tanh function to prepair the gradients for usage in backpropagation. Putting all these notations
together yields Equation 1 as the hidden variable and Equation 2 as the output variable.
Ht = φh (Xt Wxh + Ht−1 Whh + bh ) (1)
Output  Output 

ℎ ℎ

Hidden Layer  Hidden Layer 


ℎℎ

ℎ ℎ

Input  Input 

Feedforward Neural Network Recurrent Neural Network

Figure 1: Visualisation of differences between Feedfoward NNs und Recurrent NNs

Ot = φo (Ht Who + bo ) (2)


Since Ht recursively includes Ht−1 and this process occurs for every time step the RNN includes
traces of all hidden states that preceded Ht−1 as well as Ht−1 itself.
If we compare that notation for RNNs with similar notation for Feedforward Neural Networks we
can clearly see the difference we described earlier. In Equation 3 we can see the computation for the
hidden variable while Equation 4 shows the output variable.
H = φh (XWxh + bh ) (3)

O = φo (HWho + bo ) (4)
If you are familiar with training techniques for Feedforward Neural Networks such as backpropagation
one question which might arise is how to properly backpropagate the error through a RNN. Here, a
technique called Backpropagation Through Time (BPTT) is used which gets described in detail in the
next section.

2 Backpropagation Through Time (BPTT) & Truncated BPTT


Backpropagation Through Time (BPTT) is the adaption of the backpropagation algorithm for RNNs
[24]. In theory, this unfolds the RNN to construct a traditional Feedfoward Neural Network where
we can apply backpropagation. For that, we use the same notations for the RNN as proposed before.
When we forward pass our input Xt through the network we compute the hidden state Ht and the
output state Ot one step at a time. We can then define a loss function L (O, Y) to describe the
difference between all outputs Ot and target values Yt as shown in Equation 5. This basically sums
up every loss term `t of each update step so far. This loss term `t can have different definitions based
on the specific problem (e.g. Mean Squared Error, Hinge Loss, Cross Entropy Loss, etc.).
T
X
L (O, Y) = `t (Ot , Yt ) (5)
t=1
Since we have three weight matrices Wxh , Whh and Who we need to compute the partial deriva-
tive w.r.t. to each of these weight matrices. With the chain rule which is also used in normal
backpropagation we get to the result for Who shown in Equation 6.
T T
∂L X ∂`t ∂Ot ∂φo X ∂`t ∂Ot
= · · = · · Ht (6)
∂Who t=1
∂Ot ∂φo Who t=1
∂Ot ∂φo

For the partial derivative with respect to Whh we get the result shown in Equation 7.
T T
∂L X ∂`t ∂Ot ∂φo ∂Ht ∂φh X ∂`t ∂Ot ∂Ht ∂φh
= · · · · = · · Who · · (7)
∂Whh t=1
∂O t ∂φo ∂Ht ∂φh ∂W hh t=1
∂Ot ∂φo ∂φh ∂W hh

2
For the partial derivative with respect to Wxh we get the result shown in Equation 8.
T T
∂L X ∂`t ∂Ot ∂φo ∂Ht ∂φh X ∂`t ∂Ot ∂Ht ∂φh
= · · · · = · · Who · · (8)
∂Wxh t=1
∂O t ∂φo ∂H t ∂φh ∂W xh t=1
∂Ot ∂φo ∂φh ∂W xh

Since each Ht depends on the previous time step we can substitute the last part from above equations
to get Equation 9 and Equation 10.
T t
∂L X ∂`t ∂Ot X ∂Ht ∂Hk
= · · Who · (9)
∂Whh t=1
∂Ot ∂φo ∂Hk ∂Whh
k=1

T t
∂L X ∂`t ∂Ot X ∂Ht ∂Hk
= · · Who · (10)
∂Wxh t=1
∂Ot ∂φo ∂Hk ∂Wxh
k=1

The adapted part can then further be written as shown in Equation 11 and Equation 12.
T t
∂L X ∂`t ∂Ot X
> t−k

= · · Who Whh · Hk (11)
∂Whh t=1
∂Ot ∂φo
k=1

T t
∂L X ∂`t ∂Ot X
> t−k

= · · Who Whh · Xk (12)
∂Wxh t=1
∂Ot ∂φo
k=1

k
From here, we can see that we need to store powers of Whh as we proceed through each loss term
`t of the overall loss function L which can become very large. For these large values this method
becomes numerically unstable since eigenvalues smaller than 1 vanish and eigenvalues larger than 1
diverge [5]. One method of solving this problem is truncate the sum at a computationally convenient
size [24]. When you do this, you’re using Truncated BPTT [22]. This basically establishes an upper
bound for the number of time steps the gradient can flow back to [15]. One can think of this upper
bound as a moving window of past time steps which the RNN considers. Anything before the cut-off
time step doesn’t get taken into account. Since BPTT basically unfolds the RNN to create a new layer
for each time step we can also think of this procedure as limiting the number of hidden layers.

3 Problems of RNNs: Vanishing & Exploding Gradients


As in most neural networks, vanishing or exploding gradients is a key problem of RNNs [12]. In
∂Ht
Equation 9 and Equation 10 we can see ∂H k
which basically introduces matrix multiplication over
the (potentially very long) sequence, if there are small values (< 1) in the matrix multiplication this
causes the gradient to decrease with each layer (or time step) and finally vanish [6]. This basically
stops the contribution of states that happened far earlier than the current time step towards the current
time step [6]. Similarly, this can happen in the opposite direction if we have large values (> 1) during
matrix multiplication causing an exploding gradient which in result values each weight too much and
changes it heavily [6].
This problem motivated the introduction of the long short term memory units (LSTMs) to particularly
handle the vanishing gradient problem. This approach was able to outperform traditional RNNs on a
variety of tasks [6]. In the next section we want to go deeper on the proposed structure of LSTMs.

4 Long Short-Term Memory Units (LSTMs)


Long Short-Term Memory Units (LSTMs) [9] were designed to properly handle the vanishing
gradient problem. Since they use a more constant error, they allow RNNs to learn over a lot more time
steps (way over 1000) [12]. To achieve that, LSTMs store more information outside of the traditional
neural network flow in structures called gated cells [6, 12]. To make things work in an LSTM we use
an output gate Ot to read entries of the cell, an input gate It to read data into the cell and a forget

3
gate Ft to reset the content of the cell. The computations for these gates are shown in Equation 13,
Equation 14 and Equation 15. For a more visual approach please see Figure 8 in Appendix A.

Ot = σ (Xt Wxo + Ht−1 Who + bo ) (13)

It = σ (Xt Wxi + Ht−1 Whi + bi ) (14)

Ft = σ (Xt Wxf + Ht−1 Whf + bf ) (15)

The shown equations use Wxi , Wxf , Wxo ∈ Rd×h and Whi , Whf , Who ∈ Rh×h as weight
matrices while bi , bf , bo ∈ R1×h are their respective biases. Further, they use the sigmoid activation
function σ to transform the output ∈ (0, 1) which each results in a vector with entries ∈ (0, 1).
Next, we need a candidate memory cell C̃t ∈ Rn×h which has a similar computation as the previously
mentioned gates but instead uses a tanh activation function to have an output ∈ (−1, 1). Further,
it again has its own weights Wxc ∈ Rd×h , Whc ∈ Rh×h and biases bc ∈ R1×h . The respective
computation is shown in Equation 16. See Figure 9 in Appendix A for a visualisation of this
enhancement.
C̃t = tanh (Xt Wxc + Ht−1 Whc + bc ) (16)

To plug some things together we introduce old memory content Ct−1 ∈ Rn×h which together with
the introduced gates controls how much of the old memory content we want to preserve to get to the
new memory content Ct . This is shown in Equation 17 where denotes element-wise multiplication.
The structure so far can be seen in Figure 10 in Appendix A.

Ct = Ft Ct−1 + It C̃t (17)

The last step is to introduce the computation for the hidden states Ht ∈ Rn×h into the framework.
This can be seen in Equation 18.

Ht = Ot tanh (Ct ) (18)

With the tanh function we ensure that each element of Ht is ∈ (−1, 1). The full LSTM framework
can be seen in Figure 11 in Appendix A.

5 Deep Recurrent Neural Networks (DRNNs)

Deep Recurrent Neural Networks (DRNNs) are in theory a really easy concept. To construct a deep
RNN with L hidden layers we simply stack ordinary RNNs of any type on top of each other. Each
(`) (`)
hidden state Ht ∈ Rn×h is passed to the next time step of the current layer Ht+1 as well as the
(`+1)
current time step of the next layer Ht . For the first layer we compute the hidden state as proposed
in the previous models shown shown in Equation 19 while for the subsequent layer we use Equation
20 where the hidden state from the previous layer is treated as input.
 
(1) (1)
Ht = φ1 Xt , Ht−1 (19)

 
(`) (`−1) (`)
Ht = φ` Ht , Ht−1 (20)

The output Ot ∈ Rn×o where o is the number of outputs is then computed as shown in Equation 21
where we only use the hidden state of layer L.
 
(L)
Ot = φo Ht Who + bo (21)

4
6 Bidirectional Recurrent Neural Networks (BRNNs)
Lets take an example of language modeling for now. Based on our current models we are able to
reliably predict the next sequence element (i.e. the next word) based on what we have seen so far.
However, there scenarios where we might want to fill in a gap in a sentence and the part of the
sentence after the gap conveys significant information. This information is necessary to take into
account to perform well on this kind of task [24]. On a more generalised level we want to incorporate
a look-ahead property for sequences.

Figure 2: Architecture of a bidirectional recurrent neural network

To achieve this look-ahead property Bidirectional Recurrent Neural Networks (BRNNs) [14] got
introduced which basically add another hidden layer which run the sequence backwards starting from
the last element [24]. An architectural overview can is visualised in Figure 2. Here, we introduce

− ←−
a forward hidden state H t ∈ Rn×h and a backward hidden state H t ∈ Rn×h . Their respective
calculations are shown in Equation 22 and Equation 23.

− 
(f ) →
− (f ) (f )

H t = φ Xt Wxh + H t−1 Whh + bh (22)


− 
(b) ←
− (b) (b)

H t = φ Xt Wxh + H t+1 Whh + bh (23)

For that, we have similar weight matrices as in definitions before but now they are seperated into two
(f ) (f )
sets. One set of weight matrices is for the forward hidden states Wxh ∈ Rd×h and Whh ∈ Rh×h
(b) (b)
while the other one is for the backward hidden states Wxh ∈ Rd×h and Whh ∈ Rh×h . They also
(f ) (b)
have their respective biases bh ∈ R1×h and bh ∈ R1×h . With that, we can compute the output
Ot ∈ Rn×o with o being the number of outputs and _ denoting the concatenation of the two matrices
on axis 0 (stacking them on top of each other).
h→− ← −i 
Ot = φ H _ t H t Who + bo (24)

Again, we have weight matrices Who ∈ R2h×o and bias parameters bo ∈ R1×o . Keep in mind that
the two directions can have different number of hidden units.

7 Encoder-Decoder Architecture & Sequence to Sequence (seq2seq)


The Encoder-Decoder architecture is a type of neural network architecture where the network is
twofold. It consists of encoder network and a decoder network whose respective roles are to encode
the input into a state and decode the state to an output. This state usually has shape of a vector or a
tensor [24]. A visualisation of this structure is shown in Figure 3.
Based on this Encoder-Decoder architecture a model called Sequence to Sequence (seq2seq) [16] got
proposed for generating a sequence output based on a sequence input. This model uses RNNs for
the encoder as well as the decoder where the hidden state of the encoder gets passed to the hidden
state of the decoder. Common applications of the model are Google Translate [16, 23], voice-enabled

5
Input Encoder State Decoder Output

Figure 3: Encoder-Decoder Architecture Overview alternated from: [24]

devices [13] or labeling video data [18]. It mainly focuses on mapping a fixed length input sequence
of size n to an fixed length output sequence of size m where n 6= m can be true but isn’t a necessity.
A de-rellod visualisation of the proposed architecture is shown in Figure 4. Here, we have a encoder
which consists of a RNN accepting a single element of the sequence Xt where t is the order of
the sequence element. These RNNs can be LSTMs or Gated Recurrent Units (GRUs) to further
improve performance [16]. Further, the hidden states Ht are computed according to the definition
of the hidden states in the used RNN type (e.g. LSTM or GRU). The Encoder Vector (context) is a
representation of the last hidden state of the encoder network which aims to aggregate all information
from all previous input elements. This functions as initial hidden state of the decoder network of the
model and enables the decoder to make accurate predictions. The decoder network again is built of a
RNN which predicts an output Yt at a time step t. The produced output is again a sequence where
each Yt is a sequence element with order t. At each time step the RNN accepts a hidden state from
the previous unit and itself produces an output as well as a new hidden state.

Encoder

2 3
1 RNN RNN RNN

rotceV redocnE
1 2 2 3 1

RNN RNN

Decoder

Figure 4: Visualisation of the Sequence to Sequence (seq2seq) Model

The Encoder Vector (context) was shown to be a bottleneck for these type of models since it needed
to contain all all the necessary information of a source sentence in a fixed-length vector which was
particularly problematic for long sequences. There have been approaches to solve this problem by
introducing Attention in for example [4] or [10]. In the next section, we take a closer look at the
proposed solutions.

8 Attention Mechanism & Transformer


The Attention Mechanism for RNNs is partly motivated by human visual focus and the peripheral
perception [21]. It allows humans to focus on a certain region to achieve high resolution while
adjacent objects are perceived with a rather low resolution. Based on these focus points and adjacent
perception, we can make inference about what we expect to perceive when shifting our focus point.
Similarly, we can transfer this method on our sequence of words where we are able to perform
inference based on observed words. For example, if we perceive the word eating in the sequence
“She is eating a green apple” we assume to observe a food object in the near future [21].
Generally, Attention takes two sentences and transforms them into a matrix where each sequence
element (i.e. a word) corresponds to a row or column. Based on this matrix layout we can fill in the
entries to identify relevant context or correlations between them. An example of this process can

6
be seen in Figure 5 where white denotes high correlation while black denotes low correlation. This
method isn’t limited to two sentences of a different languages as seen the example but can also be
applied to the same sentence which is then called self-attention.

Figure 5: Example of an Alignment matrix of “L’accord sur la zone économique européen a été signé
en août 1992” (French) and its English translation “The agreement on the European Economic Area
was signed in August 1992”: [4]

8.1 Definition

To help the seq2seq model to better deal with long sequences the attention mechanism got introduced.
Instead of constructing the Encoder Vector out of the last hidden state of the encoder network,
attention introduces shortcuts between context vector and the entire source input. A visualisation of
this process can be seen in Figure 6. Here, we have source sequence X of length n and try to output a
target sequence Y of size m. In that regard the formulation is rather similar to the one we described
before in Section 7. We have an overall hidden state Ht0 which is the concatenated version of the
forward and backward pass as shown in Equation 25. Also, the hidden state of the decoder network is
denoted as St while the encoder vector (context vector) is denoted as Ct . Both of these are shown in
Equation 26 and Equation 27 respectively.
h→
− ← − i
Ht0 = H _ t0 H t0 (25)

St = φd (St−1 , Yt−1 , Ct ) (26)

The context vector Ct is a sum of hidden states of the input sequence each weighted with an alignment
PT
score αt,t0 where t0 =1 αt,t0 = 1. This is shown in Equation 27 as well as Equation 28.
T
X
Ct = αt,t0 · Ht0 (27)
t0 =1

exp (score(St−1 , Ht0 ))


αt,t0 = align(Yt , Xt0 ) = PT (28)
t0 =1 exp(score(St−1 , Ht ))
0

The alignment αt,t0 connects an alignment score for the input at position t0 and the output at position
t. This score is based on how well this pair matches [21]. The set of all alignment scores defines how
much each source hidden state should be considered for each output [21]. Please see Appendix B for
a more easy and visual explanation of the attention mechanism in the seq2seq model.

8.2 Different types of score functions

Generally, there are different implementations for this score function which have been used in various
works. Table 1 gives an overview over their respective name, equation and the usage in publications.
Here, we have two trainable weight matrices in the alignment model denoted as va and Wa .

7
−1

… −1

,1 ,
,2 ,3

→ → → →
1 2 3

⃖  ⃖  ⃖  ⃖ 
1 2 3


1 2 3

Figure 6: Encoder-Decoder architecture with additive attention mechanism alternated from: [4]

Name Equation for: score(St , Ht0 ) Used In


Content-base cosine[St , Ht0 ] [8]
Additive va> tanh Wa [St ; Ht0 ] [3]
Location-Base softmax(Wa St ) [11]
General S>
t Wa Ht0 [11]
Dot-Product S>
t Ht0 [11]
S> H 0
Scaled Dot-Product √ t t
nsource [17]

Table 1: Different score functions with their respective equations and usage alternated from: [21]

The Scaled-Dot-Product used in [17] scales the dot-product by the number of characters of the current
word which is motivated by the problem that when the input is large, the softmax function may have
an extremely small gradient which is a problem for efficient learning.

8.3 Transformer

By encorporating this Attention Mechanism the Transformer [17] got introduced which achieves
parallelization by capturing recurrence sequence with attention but at the same time encoding each
item’s position in the sequence based on the encoder-decoder architecture [24]. In fact, for that it
doesn’t use any recurrent network units and entirely relies on the self-attention mechanism to improve
performance. The encoding part of the architecture is made out of several encoders (e.g. six encoders
in [17]) while the decoder part consists out of decoders with the same amount as the encoders. A
general overview over the architecture is illustrated in Figure 7.
Here, each encoder component consists out of two sub-layers which are Self-Attention and a Feed
Forward Neural Network. Similarly, those two sub-layers are found in each decoder component
but with a Encoder-Decoder Attention sub-layer in between them which works similarly to the
Attention used in the seq2seq model. The deployed Attention layers are not your ordinary attention
layers but a method called Multi-Headed Attention which improves performance of the attention
layer. This allows the model to jointly attend to information from different representation subspaces
at different positions which in easier terms runs different chunks in parallel and concatenates the

8
Figure 7: Model Architecture of the Transformer: [17]

results [17]. Unfortunately, explaining the design choices and mathematical formulations contained
in multi-headed attention would be to much details at this point. Please refer to the original paper
[17] for more information. The architecture shown in Figure 7 also deploys skip connections and
layer normalisation for each sub-layer of the encoder as well as the decoder. One thing to note is that
the input as well as the output get embedded and a positional encoding is applied which represents
the proximity of sequence elements (see Appendix C).
The final linear and softmax layer turn the vector of floats which is the output of the decoder stack
into a word. This is done by transforming the vector through the linear layer into a much larger
vector called a logits vector [1]. This logits vector has the size of the learned vocabulary from the
training dataset where each cell corresponds to the score of a unique word [1]. By applying a softmax
function we turn those scores into probabilities which sum up to 1 and therefore we can choose the
cell (i.e. the word) with the highest probability as output for this particular time step.

9 Pointer Networks (Ptr-Nets)

Pointer Networks (Ptr-Nets) [19] adapt the seq2seq model with attention to improve it by not fixing
the discrete categories (i.e. elements) of the output dictionary a priori. Instead of yielding an output
sequence generated from an input sequence, a pointer network creates a succession of pointers to the
elements of the input series [25]. In [19] they show that by using Pointer Networks they can solve
combinatorial optimization problems such as computing planar convex hulls, Delaunay triangulations
and the symmetric planar Travelling Salesman Problem (TSP).

9
Generally, we apply additive attention (from Table 1) between states and then normalize it by applying
the softmax function to model the output conditional probability as seen in Equation 29.
Yt = softmax (score (St , Ht0 )) = softmax va> tanh Wa [St ; Ht0 ]

(29)

The attention mechanism is simplified, as Ptr-Net does not blend the encoder states into the output
with attention weights. In this way, the output only responds to the positions but not the input content
[21].

10 Conlusion & Outlook


In this work we gave an introduction into fundamentals for Recurrent Neural Networks (RNNs). This
includes the general framework for RNNs, Backpropagation through time, problems of traditional
RNNs, LSTMS, Deep and Bidirectional RNNs as well as more recent advances such as the Encoder-
Decoder Architecture, seq2seq model, Attention, Transformer and Pointer Networks. Most topics
are only covered conceptionally and don’t go too deep into implementation specifications. To get a
broader understanding of the covered topics, we recommend looking into some of the cited original
papers. Additionally, most recent publications use some of the presented concepts so we recommend
taking a look at such papers.
One recent publication which uses many of the presented concepts is “Grandmaster level in StarCraft
II using multi-agent reinforcement learning” by Vinyals et al. [20]. Here, they present their approach
to train agents to play the real-time strategy game Starcraft II with great success. If the presented
concepts were a little too theoretical for you we recommend reading that paper to see LSTMs, the
Transformer or Pointer Networks in a setting which can be deployed in a more practical environment.

References
[1] Jay Alammar. The Illustrated Transformer. 2018.
[2] Jay Alammar. Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models
With Attention). 2018.
[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by
Jointly Learning to Align and Translate. 2014.
[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural Machine Translation by
Jointly Learning to Align and Translate”. In: 3rd International Conference on Learning Repre-
sentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
Ed. by Yoshua Bengio and Yann LeCun. 2015.
[5] Y. Bengio, Patrice Simard, and Paolo Frasconi. “Learning long-term dependencies with
gradient descent is difficult”. In: IEEE transactions on neural networks / a publication of the
IEEE Neural Networks Council 5 (Feb. 1994), pp. 157–66.
[6] Gang Chen. A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation.
2016.
[7] Junyoung Chung et al. “Gated Feedback Recurrent Neural Networks”. In: Proceedings of the
32Nd International Conference on International Conference on Machine Learning - Volume
37. ICML’15. Lille, France: JMLR.org, 2015, pp. 2067–2075.
[8] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing Machines. 2014.
[9] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-term Memory”. In: Neural computation
9 (Dec. 1997), pp. 1735–80.
[10] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective Approaches to
Attention-based Neural Machine Translation. 2015.
[11] Thang Luong, Hieu Pham, and Christopher D. Manning. “Effective Approaches to Attention-
based Neural Machine Translation”. In: Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing. Lisbon, Portugal: Association for Computational
Linguistics, Sept. 2015, pp. 1412–1421.
[12] Chris Nicholson. A Beginner’s Guide to LSTMs and Recurrent Neural Networks. https :
//skymind.ai/wiki/lstm. Accessed: 06 November 2019. 2019.

10
[13] Rohit Prabhavalkar et al. “A Comparison of Sequence-to-Sequence Models for Speech Recog-
nition”. In: INTERSPEECH. 2017.
[14] Mike Schuster and Kuldip K. Paliwal. “Bidirectional recurrent neural networks”. In: IEEE
Trans. Signal Processing 45 (1997), pp. 2673–2681.
[15] Ilya Sutskever. “Training Recurrent Neural Networks”. AAINS22066. PhD thesis. Toronto,
Ont., Canada, 2013.
[16] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Sequence to Sequence Learning with Neural
Networks”. In: Advances in Neural Information Processing Systems 27. Ed. by Z. Ghahramani
et al. Curran Associates, Inc., 2014, pp. 3104–3112.
[17] Ashish Vaswani et al. “Attention is All you Need”. In: Advances in Neural Information
Processing Systems 30. Ed. by I. Guyon et al. Curran Associates, Inc., 2017, pp. 5998–6008.
[18] S. Venugopalan et al. “Sequence to Sequence – Video to Text”. In: 2015 IEEE International
Conference on Computer Vision (ICCV). 2015, pp. 4534–4542.
[19] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. “Pointer Networks”. In: Advances in
Neural Information Processing Systems 28. Ed. by C. Cortes et al. Curran Associates, Inc.,
2015, pp. 2692–2700.
[20] Oriol Vinyals et al. “Grandmaster level in StarCraft II using multi-agent reinforcement learn-
ing”. In: Nature 575.7782 (Oct. 2019), pp. 350–354.
[21] Lilian Weng. “Attention? Attention!” In: lilianweng.github.io/lil-log (2018). Accessed: 09
November 2019.
[22] R. J. Williams and J. Peng. “An Efficient Gradient-Based Algorithm for On-Line Training of
Recurrent Network Trajectories”. In: Neural Computation 2.4 (1990), pp. 490–501.
[23] Yonghui Wu et al. Google’s Neural Machine Translation System: Bridging the Gap between
Human and Machine Translation. 2016.
[24] Aston Zhang et al. Dive into Deep Learning. http://www.d2l.ai. 2019.
[25] Z. Zygmunt. Introduction to pointer networks. Accessed: 22 November 2019. 2017.

Appendices
A Visual Representation of LSTMs

In this section we consecutively construct the full architecture of Long Short-Term Memory Units
(LSTMs) explained in Section 4. For a description what is changing between each step please read
Section 4 or refer to the source of the illustrations [24].

Figure 8: Calculation of input, forget, and output gates in an LSTM: [24]

11
Figure 9: Computation of candidate memory cells in LSTM: [24]

Figure 10: Computation of memory cells in an LSTM: [24]

Figure 11: Computation of the hidden state in an LSTM: [24]

12
B Visual Representation of seq2seq with Attention
The seq2seq model with attention passes a lot more data from the encoder to the decoder than the
regular seq2seq model. Instead of passing the last hidden state of the encoding stage, the encoder
passes all the hidden states to the decoder. The first step of the decoder part in the seq2seq model
with attention is illustrated in Figure 12 where we pass “I am a student” to the encoder and expect a
translation to french producing “je suis un étudiant”. Here, all the hidden states of the encoder H1 ,
H2 , H3 are passed to the attention decoder as well as the embedding from the < End > token and
an initial decoder hidden state Hinit .

Decoding Stage
egatS gnidocnE

1, 2, 3

< >

Figure 12: Seq2Seq Model with Attention Mechanism Step 1 alternated from: [2]

Next, we produce an output and a new hidden state vector H4 . However, the output is discarded.
This can be seen in Figure 13.

Decoding Stage
egatS gnidocnE

1, 2, 3

< >

Figure 13: Seq2Seq Model with Attention Mechanism Step 2 alternated from: [2]

For the attention step we use this produced hidden state vector H4 and the hidden states from the
encoder H1 , H2 , H3 to produce a context vector C4 (blue). This process can be seen in Figure 14.
Each encoder hidden state is most associated with a certain word in the input sentence [2]. When
we give these hidden states scores and apply a softmax to it we generate probability values. These
probabilities are represented with by the three-element pink vector where light values stand for high
probabilities while dark values denote low probabilities. Next, we apply each hidden state vector H1 ,

13
H2 , H3 by its softmaxed score which increases hidden states with high scores, and decreases hidden
states with low scores. This is visualised by graying out the hidden states H2 and H3 while keeping
H1 in solid color.

Decoding Stage
Attention4
egatS gnidocnE

1, 2, 3

< >

Figure 14: Seq2Seq Model with Attention Mechanism Step 4 alternated from: [2]

After that, we concatenate this produced context vector C4 with the produced hidden state H4 . One
can see this process in Figure 15. This process just stacks the two vectors on top of each other.

Decoding Stage
Attention4

4
egatS gnidocnE

1, 2, 3

< >

Figure 15: Seq2Seq Model with Attention Mechanism Step 5 alternated from: [2]

This concatenated version of hidden state H4 and context vector C4 is then passed into a jointly
trained Feedforward Neural Network. This network is visualised by the red box with round edges in
Figure 16. The output of this network then represents the output of the current time step t which in
this case represents the word “I”. This basically concludes all the steps needed at each iteration step.

14
I Decoding Stage
Attention4

4
egatS gnidocnE

1, 2, 3

< >

Figure 16: Seq2Seq Model with Attention Mechanism Step 6 alternated from: [2]

If we take a look at the next iteration step in Figure 17 we can see that the output from the previous
hidden state H4 is passed instead of the < EN D > token. All the other steps are equal from
the previous iteration. However, we can see that the hidden state H2 has the best score during the
attention stage. Again, this is represented by the lightest shade of pink in the score vector. By
multiplying the scores with the hidden states we achieve two reduced hidden states H1 and H3 while
keeping H2 as the most active hidden state. This results in the word “am” being produced as the
output of the Feedforward Neural Network for this time step.

I am
Attention4 Attention5

4 5
egatS gnidocnE

4 5

1, 2, 3

4 5

I
< >

Figure 17: Seq2Seq Model with Attention Mechanism Step 7 alternated from: [2]

Obviously, there are still two more attention decoder time steps which are omitted here for illustration
purposes. The functionality of each of those steps however would still be equivalent to the already
seen time steps.

C Visual Representation of Positional Encodings used in the Transformer


On example of a positional encoding used inside the transformer is applying trigonometric functions
as seen in Figure 18. Here, we have multiple trigonometric functions with different frequency. We
also show the encoding for three words i.e. X1 , X2 , X3 .
In principal the encoding for X1 is therefore high for the first curve (blue), mid for the second curve
(red) and low for the last curve (green). Similarly, this applies for the other words as well. What

15
1 2 3

Figure 18: Positional Encoding Example based on trigonometric functions

we can see here is that close words have closer encodings while distant words have more different
encodings. Generally, this is a method for binary encoding the position of a given sequence.
The choice of such a positional encoding algorithm definitely is not the main contribution of [17] but
it is a relevant concept to at least understand in theory since this boosts performance.

16

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy