Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
Robin M. Schmidt
Department of Computer Science
Eberhard-Karls-University Tübingen
arXiv:1912.05911v1 [cs.LG] 23 Nov 2019
Tübingen, Germany
rob.schmidt@student.uni-tuebingen.de
Abstract
ℎ ℎ
ℎ ℎ
Input Input
O = φo (HWho + bo ) (4)
If you are familiar with training techniques for Feedforward Neural Networks such as backpropagation
one question which might arise is how to properly backpropagate the error through a RNN. Here, a
technique called Backpropagation Through Time (BPTT) is used which gets described in detail in the
next section.
For the partial derivative with respect to Whh we get the result shown in Equation 7.
T T
∂L X ∂`t ∂Ot ∂φo ∂Ht ∂φh X ∂`t ∂Ot ∂Ht ∂φh
= · · · · = · · Who · · (7)
∂Whh t=1
∂O t ∂φo ∂Ht ∂φh ∂W hh t=1
∂Ot ∂φo ∂φh ∂W hh
2
For the partial derivative with respect to Wxh we get the result shown in Equation 8.
T T
∂L X ∂`t ∂Ot ∂φo ∂Ht ∂φh X ∂`t ∂Ot ∂Ht ∂φh
= · · · · = · · Who · · (8)
∂Wxh t=1
∂O t ∂φo ∂H t ∂φh ∂W xh t=1
∂Ot ∂φo ∂φh ∂W xh
Since each Ht depends on the previous time step we can substitute the last part from above equations
to get Equation 9 and Equation 10.
T t
∂L X ∂`t ∂Ot X ∂Ht ∂Hk
= · · Who · (9)
∂Whh t=1
∂Ot ∂φo ∂Hk ∂Whh
k=1
T t
∂L X ∂`t ∂Ot X ∂Ht ∂Hk
= · · Who · (10)
∂Wxh t=1
∂Ot ∂φo ∂Hk ∂Wxh
k=1
The adapted part can then further be written as shown in Equation 11 and Equation 12.
T t
∂L X ∂`t ∂Ot X
> t−k
= · · Who Whh · Hk (11)
∂Whh t=1
∂Ot ∂φo
k=1
T t
∂L X ∂`t ∂Ot X
> t−k
= · · Who Whh · Xk (12)
∂Wxh t=1
∂Ot ∂φo
k=1
k
From here, we can see that we need to store powers of Whh as we proceed through each loss term
`t of the overall loss function L which can become very large. For these large values this method
becomes numerically unstable since eigenvalues smaller than 1 vanish and eigenvalues larger than 1
diverge [5]. One method of solving this problem is truncate the sum at a computationally convenient
size [24]. When you do this, you’re using Truncated BPTT [22]. This basically establishes an upper
bound for the number of time steps the gradient can flow back to [15]. One can think of this upper
bound as a moving window of past time steps which the RNN considers. Anything before the cut-off
time step doesn’t get taken into account. Since BPTT basically unfolds the RNN to create a new layer
for each time step we can also think of this procedure as limiting the number of hidden layers.
3
gate Ft to reset the content of the cell. The computations for these gates are shown in Equation 13,
Equation 14 and Equation 15. For a more visual approach please see Figure 8 in Appendix A.
The shown equations use Wxi , Wxf , Wxo ∈ Rd×h and Whi , Whf , Who ∈ Rh×h as weight
matrices while bi , bf , bo ∈ R1×h are their respective biases. Further, they use the sigmoid activation
function σ to transform the output ∈ (0, 1) which each results in a vector with entries ∈ (0, 1).
Next, we need a candidate memory cell C̃t ∈ Rn×h which has a similar computation as the previously
mentioned gates but instead uses a tanh activation function to have an output ∈ (−1, 1). Further,
it again has its own weights Wxc ∈ Rd×h , Whc ∈ Rh×h and biases bc ∈ R1×h . The respective
computation is shown in Equation 16. See Figure 9 in Appendix A for a visualisation of this
enhancement.
C̃t = tanh (Xt Wxc + Ht−1 Whc + bc ) (16)
To plug some things together we introduce old memory content Ct−1 ∈ Rn×h which together with
the introduced gates controls how much of the old memory content we want to preserve to get to the
new memory content Ct . This is shown in Equation 17 where denotes element-wise multiplication.
The structure so far can be seen in Figure 10 in Appendix A.
The last step is to introduce the computation for the hidden states Ht ∈ Rn×h into the framework.
This can be seen in Equation 18.
With the tanh function we ensure that each element of Ht is ∈ (−1, 1). The full LSTM framework
can be seen in Figure 11 in Appendix A.
Deep Recurrent Neural Networks (DRNNs) are in theory a really easy concept. To construct a deep
RNN with L hidden layers we simply stack ordinary RNNs of any type on top of each other. Each
(`) (`)
hidden state Ht ∈ Rn×h is passed to the next time step of the current layer Ht+1 as well as the
(`+1)
current time step of the next layer Ht . For the first layer we compute the hidden state as proposed
in the previous models shown shown in Equation 19 while for the subsequent layer we use Equation
20 where the hidden state from the previous layer is treated as input.
(1) (1)
Ht = φ1 Xt , Ht−1 (19)
(`) (`−1) (`)
Ht = φ` Ht , Ht−1 (20)
The output Ot ∈ Rn×o where o is the number of outputs is then computed as shown in Equation 21
where we only use the hidden state of layer L.
(L)
Ot = φo Ht Who + bo (21)
4
6 Bidirectional Recurrent Neural Networks (BRNNs)
Lets take an example of language modeling for now. Based on our current models we are able to
reliably predict the next sequence element (i.e. the next word) based on what we have seen so far.
However, there scenarios where we might want to fill in a gap in a sentence and the part of the
sentence after the gap conveys significant information. This information is necessary to take into
account to perform well on this kind of task [24]. On a more generalised level we want to incorporate
a look-ahead property for sequences.
To achieve this look-ahead property Bidirectional Recurrent Neural Networks (BRNNs) [14] got
introduced which basically add another hidden layer which run the sequence backwards starting from
the last element [24]. An architectural overview can is visualised in Figure 2. Here, we introduce
→
− ←−
a forward hidden state H t ∈ Rn×h and a backward hidden state H t ∈ Rn×h . Their respective
calculations are shown in Equation 22 and Equation 23.
→
−
(f ) →
− (f ) (f )
H t = φ Xt Wxh + H t−1 Whh + bh (22)
←
−
(b) ←
− (b) (b)
H t = φ Xt Wxh + H t+1 Whh + bh (23)
For that, we have similar weight matrices as in definitions before but now they are seperated into two
(f ) (f )
sets. One set of weight matrices is for the forward hidden states Wxh ∈ Rd×h and Whh ∈ Rh×h
(b) (b)
while the other one is for the backward hidden states Wxh ∈ Rd×h and Whh ∈ Rh×h . They also
(f ) (b)
have their respective biases bh ∈ R1×h and bh ∈ R1×h . With that, we can compute the output
Ot ∈ Rn×o with o being the number of outputs and _ denoting the concatenation of the two matrices
on axis 0 (stacking them on top of each other).
h→− ← −i
Ot = φ H _ t H t Who + bo (24)
Again, we have weight matrices Who ∈ R2h×o and bias parameters bo ∈ R1×o . Keep in mind that
the two directions can have different number of hidden units.
5
Input Encoder State Decoder Output
devices [13] or labeling video data [18]. It mainly focuses on mapping a fixed length input sequence
of size n to an fixed length output sequence of size m where n 6= m can be true but isn’t a necessity.
A de-rellod visualisation of the proposed architecture is shown in Figure 4. Here, we have a encoder
which consists of a RNN accepting a single element of the sequence Xt where t is the order of
the sequence element. These RNNs can be LSTMs or Gated Recurrent Units (GRUs) to further
improve performance [16]. Further, the hidden states Ht are computed according to the definition
of the hidden states in the used RNN type (e.g. LSTM or GRU). The Encoder Vector (context) is a
representation of the last hidden state of the encoder network which aims to aggregate all information
from all previous input elements. This functions as initial hidden state of the decoder network of the
model and enables the decoder to make accurate predictions. The decoder network again is built of a
RNN which predicts an output Yt at a time step t. The produced output is again a sequence where
each Yt is a sequence element with order t. At each time step the RNN accepts a hidden state from
the previous unit and itself produces an output as well as a new hidden state.
Encoder
2 3
1 RNN RNN RNN
rotceV redocnE
1 2 2 3 1
RNN RNN
Decoder
The Encoder Vector (context) was shown to be a bottleneck for these type of models since it needed
to contain all all the necessary information of a source sentence in a fixed-length vector which was
particularly problematic for long sequences. There have been approaches to solve this problem by
introducing Attention in for example [4] or [10]. In the next section, we take a closer look at the
proposed solutions.
6
be seen in Figure 5 where white denotes high correlation while black denotes low correlation. This
method isn’t limited to two sentences of a different languages as seen the example but can also be
applied to the same sentence which is then called self-attention.
Figure 5: Example of an Alignment matrix of “L’accord sur la zone économique européen a été signé
en août 1992” (French) and its English translation “The agreement on the European Economic Area
was signed in August 1992”: [4]
8.1 Definition
To help the seq2seq model to better deal with long sequences the attention mechanism got introduced.
Instead of constructing the Encoder Vector out of the last hidden state of the encoder network,
attention introduces shortcuts between context vector and the entire source input. A visualisation of
this process can be seen in Figure 6. Here, we have source sequence X of length n and try to output a
target sequence Y of size m. In that regard the formulation is rather similar to the one we described
before in Section 7. We have an overall hidden state Ht0 which is the concatenated version of the
forward and backward pass as shown in Equation 25. Also, the hidden state of the decoder network is
denoted as St while the encoder vector (context vector) is denoted as Ct . Both of these are shown in
Equation 26 and Equation 27 respectively.
h→
− ← − i
Ht0 = H _ t0 H t0 (25)
The context vector Ct is a sum of hidden states of the input sequence each weighted with an alignment
PT
score αt,t0 where t0 =1 αt,t0 = 1. This is shown in Equation 27 as well as Equation 28.
T
X
Ct = αt,t0 · Ht0 (27)
t0 =1
The alignment αt,t0 connects an alignment score for the input at position t0 and the output at position
t. This score is based on how well this pair matches [21]. The set of all alignment scores defines how
much each source hidden state should be considered for each output [21]. Please see Appendix B for
a more easy and visual explanation of the attention mechanism in the seq2seq model.
Generally, there are different implementations for this score function which have been used in various
works. Table 1 gives an overview over their respective name, equation and the usage in publications.
Here, we have two trainable weight matrices in the alignment model denoted as va and Wa .
7
−1
… −1
…
,1 ,
,2 ,3
→ → → →
1 2 3
⃖ ⃖ ⃖ ⃖
1 2 3
…
1 2 3
Figure 6: Encoder-Decoder architecture with additive attention mechanism alternated from: [4]
Table 1: Different score functions with their respective equations and usage alternated from: [21]
The Scaled-Dot-Product used in [17] scales the dot-product by the number of characters of the current
word which is motivated by the problem that when the input is large, the softmax function may have
an extremely small gradient which is a problem for efficient learning.
8.3 Transformer
By encorporating this Attention Mechanism the Transformer [17] got introduced which achieves
parallelization by capturing recurrence sequence with attention but at the same time encoding each
item’s position in the sequence based on the encoder-decoder architecture [24]. In fact, for that it
doesn’t use any recurrent network units and entirely relies on the self-attention mechanism to improve
performance. The encoding part of the architecture is made out of several encoders (e.g. six encoders
in [17]) while the decoder part consists out of decoders with the same amount as the encoders. A
general overview over the architecture is illustrated in Figure 7.
Here, each encoder component consists out of two sub-layers which are Self-Attention and a Feed
Forward Neural Network. Similarly, those two sub-layers are found in each decoder component
but with a Encoder-Decoder Attention sub-layer in between them which works similarly to the
Attention used in the seq2seq model. The deployed Attention layers are not your ordinary attention
layers but a method called Multi-Headed Attention which improves performance of the attention
layer. This allows the model to jointly attend to information from different representation subspaces
at different positions which in easier terms runs different chunks in parallel and concatenates the
8
Figure 7: Model Architecture of the Transformer: [17]
results [17]. Unfortunately, explaining the design choices and mathematical formulations contained
in multi-headed attention would be to much details at this point. Please refer to the original paper
[17] for more information. The architecture shown in Figure 7 also deploys skip connections and
layer normalisation for each sub-layer of the encoder as well as the decoder. One thing to note is that
the input as well as the output get embedded and a positional encoding is applied which represents
the proximity of sequence elements (see Appendix C).
The final linear and softmax layer turn the vector of floats which is the output of the decoder stack
into a word. This is done by transforming the vector through the linear layer into a much larger
vector called a logits vector [1]. This logits vector has the size of the learned vocabulary from the
training dataset where each cell corresponds to the score of a unique word [1]. By applying a softmax
function we turn those scores into probabilities which sum up to 1 and therefore we can choose the
cell (i.e. the word) with the highest probability as output for this particular time step.
Pointer Networks (Ptr-Nets) [19] adapt the seq2seq model with attention to improve it by not fixing
the discrete categories (i.e. elements) of the output dictionary a priori. Instead of yielding an output
sequence generated from an input sequence, a pointer network creates a succession of pointers to the
elements of the input series [25]. In [19] they show that by using Pointer Networks they can solve
combinatorial optimization problems such as computing planar convex hulls, Delaunay triangulations
and the symmetric planar Travelling Salesman Problem (TSP).
9
Generally, we apply additive attention (from Table 1) between states and then normalize it by applying
the softmax function to model the output conditional probability as seen in Equation 29.
Yt = softmax (score (St , Ht0 )) = softmax va> tanh Wa [St ; Ht0 ]
(29)
The attention mechanism is simplified, as Ptr-Net does not blend the encoder states into the output
with attention weights. In this way, the output only responds to the positions but not the input content
[21].
References
[1] Jay Alammar. The Illustrated Transformer. 2018.
[2] Jay Alammar. Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models
With Attention). 2018.
[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by
Jointly Learning to Align and Translate. 2014.
[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural Machine Translation by
Jointly Learning to Align and Translate”. In: 3rd International Conference on Learning Repre-
sentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
Ed. by Yoshua Bengio and Yann LeCun. 2015.
[5] Y. Bengio, Patrice Simard, and Paolo Frasconi. “Learning long-term dependencies with
gradient descent is difficult”. In: IEEE transactions on neural networks / a publication of the
IEEE Neural Networks Council 5 (Feb. 1994), pp. 157–66.
[6] Gang Chen. A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation.
2016.
[7] Junyoung Chung et al. “Gated Feedback Recurrent Neural Networks”. In: Proceedings of the
32Nd International Conference on International Conference on Machine Learning - Volume
37. ICML’15. Lille, France: JMLR.org, 2015, pp. 2067–2075.
[8] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing Machines. 2014.
[9] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-term Memory”. In: Neural computation
9 (Dec. 1997), pp. 1735–80.
[10] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective Approaches to
Attention-based Neural Machine Translation. 2015.
[11] Thang Luong, Hieu Pham, and Christopher D. Manning. “Effective Approaches to Attention-
based Neural Machine Translation”. In: Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing. Lisbon, Portugal: Association for Computational
Linguistics, Sept. 2015, pp. 1412–1421.
[12] Chris Nicholson. A Beginner’s Guide to LSTMs and Recurrent Neural Networks. https :
//skymind.ai/wiki/lstm. Accessed: 06 November 2019. 2019.
10
[13] Rohit Prabhavalkar et al. “A Comparison of Sequence-to-Sequence Models for Speech Recog-
nition”. In: INTERSPEECH. 2017.
[14] Mike Schuster and Kuldip K. Paliwal. “Bidirectional recurrent neural networks”. In: IEEE
Trans. Signal Processing 45 (1997), pp. 2673–2681.
[15] Ilya Sutskever. “Training Recurrent Neural Networks”. AAINS22066. PhD thesis. Toronto,
Ont., Canada, 2013.
[16] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Sequence to Sequence Learning with Neural
Networks”. In: Advances in Neural Information Processing Systems 27. Ed. by Z. Ghahramani
et al. Curran Associates, Inc., 2014, pp. 3104–3112.
[17] Ashish Vaswani et al. “Attention is All you Need”. In: Advances in Neural Information
Processing Systems 30. Ed. by I. Guyon et al. Curran Associates, Inc., 2017, pp. 5998–6008.
[18] S. Venugopalan et al. “Sequence to Sequence – Video to Text”. In: 2015 IEEE International
Conference on Computer Vision (ICCV). 2015, pp. 4534–4542.
[19] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. “Pointer Networks”. In: Advances in
Neural Information Processing Systems 28. Ed. by C. Cortes et al. Curran Associates, Inc.,
2015, pp. 2692–2700.
[20] Oriol Vinyals et al. “Grandmaster level in StarCraft II using multi-agent reinforcement learn-
ing”. In: Nature 575.7782 (Oct. 2019), pp. 350–354.
[21] Lilian Weng. “Attention? Attention!” In: lilianweng.github.io/lil-log (2018). Accessed: 09
November 2019.
[22] R. J. Williams and J. Peng. “An Efficient Gradient-Based Algorithm for On-Line Training of
Recurrent Network Trajectories”. In: Neural Computation 2.4 (1990), pp. 490–501.
[23] Yonghui Wu et al. Google’s Neural Machine Translation System: Bridging the Gap between
Human and Machine Translation. 2016.
[24] Aston Zhang et al. Dive into Deep Learning. http://www.d2l.ai. 2019.
[25] Z. Zygmunt. Introduction to pointer networks. Accessed: 22 November 2019. 2017.
Appendices
A Visual Representation of LSTMs
In this section we consecutively construct the full architecture of Long Short-Term Memory Units
(LSTMs) explained in Section 4. For a description what is changing between each step please read
Section 4 or refer to the source of the illustrations [24].
11
Figure 9: Computation of candidate memory cells in LSTM: [24]
12
B Visual Representation of seq2seq with Attention
The seq2seq model with attention passes a lot more data from the encoder to the decoder than the
regular seq2seq model. Instead of passing the last hidden state of the encoding stage, the encoder
passes all the hidden states to the decoder. The first step of the decoder part in the seq2seq model
with attention is illustrated in Figure 12 where we pass “I am a student” to the encoder and expect a
translation to french producing “je suis un étudiant”. Here, all the hidden states of the encoder H1 ,
H2 , H3 are passed to the attention decoder as well as the embedding from the < End > token and
an initial decoder hidden state Hinit .
Decoding Stage
egatS gnidocnE
1, 2, 3
< >
Figure 12: Seq2Seq Model with Attention Mechanism Step 1 alternated from: [2]
Next, we produce an output and a new hidden state vector H4 . However, the output is discarded.
This can be seen in Figure 13.
Decoding Stage
egatS gnidocnE
1, 2, 3
< >
Figure 13: Seq2Seq Model with Attention Mechanism Step 2 alternated from: [2]
For the attention step we use this produced hidden state vector H4 and the hidden states from the
encoder H1 , H2 , H3 to produce a context vector C4 (blue). This process can be seen in Figure 14.
Each encoder hidden state is most associated with a certain word in the input sentence [2]. When
we give these hidden states scores and apply a softmax to it we generate probability values. These
probabilities are represented with by the three-element pink vector where light values stand for high
probabilities while dark values denote low probabilities. Next, we apply each hidden state vector H1 ,
13
H2 , H3 by its softmaxed score which increases hidden states with high scores, and decreases hidden
states with low scores. This is visualised by graying out the hidden states H2 and H3 while keeping
H1 in solid color.
Decoding Stage
Attention4
egatS gnidocnE
1, 2, 3
< >
Figure 14: Seq2Seq Model with Attention Mechanism Step 4 alternated from: [2]
After that, we concatenate this produced context vector C4 with the produced hidden state H4 . One
can see this process in Figure 15. This process just stacks the two vectors on top of each other.
Decoding Stage
Attention4
4
egatS gnidocnE
1, 2, 3
< >
Figure 15: Seq2Seq Model with Attention Mechanism Step 5 alternated from: [2]
This concatenated version of hidden state H4 and context vector C4 is then passed into a jointly
trained Feedforward Neural Network. This network is visualised by the red box with round edges in
Figure 16. The output of this network then represents the output of the current time step t which in
this case represents the word “I”. This basically concludes all the steps needed at each iteration step.
14
I Decoding Stage
Attention4
4
egatS gnidocnE
1, 2, 3
< >
Figure 16: Seq2Seq Model with Attention Mechanism Step 6 alternated from: [2]
If we take a look at the next iteration step in Figure 17 we can see that the output from the previous
hidden state H4 is passed instead of the < EN D > token. All the other steps are equal from
the previous iteration. However, we can see that the hidden state H2 has the best score during the
attention stage. Again, this is represented by the lightest shade of pink in the score vector. By
multiplying the scores with the hidden states we achieve two reduced hidden states H1 and H3 while
keeping H2 as the most active hidden state. This results in the word “am” being produced as the
output of the Feedforward Neural Network for this time step.
I am
Attention4 Attention5
4 5
egatS gnidocnE
4 5
1, 2, 3
4 5
I
< >
Figure 17: Seq2Seq Model with Attention Mechanism Step 7 alternated from: [2]
Obviously, there are still two more attention decoder time steps which are omitted here for illustration
purposes. The functionality of each of those steps however would still be equivalent to the already
seen time steps.
15
1 2 3
we can see here is that close words have closer encodings while distant words have more different
encodings. Generally, this is a method for binary encoding the position of a given sequence.
The choice of such a positional encoding algorithm definitely is not the main contribution of [17] but
it is a relevant concept to at least understand in theory since this boosts performance.
16