DLAI4 Networks Recurrent
DLAI4 Networks Recurrent
Contents
1 Introduction 2
2 Dynamical systems 2
6 Applications 7
james.liley@durham.ac.uk 1 of 7
Deep Learning and Artificial Intelligence Epiphany 2024
1 Introduction
So far, we have only really looked at feedforward neural networks, characterised by an architecture in which we
start with inputs, progress through layers, and get to an output.
In this lecture we will look at recurrent neural networks, and depart from this architecture a bit. Recurrent
neural networks are neural networks specialised to model sequential data.
2 Dynamical systems
We shall start by considering a deterministic dynamical system.
Definition 1. A deterministic discrete dynamical system {ht }, t ∈ {1, 2, . . . }, is a sequence of random variables
given by
ht = f (ht−1 ; θ) (1)
for t > 1, where f (the ‘transition function’) is a Borel-measurable function and θ is a set of parameters.
We assume states ht are random variables, although all randomness comes from the first state h0 . We denote
by H (t) the σ-algebra generated by ht , and note that H (t) ⊆ H (t − 1).
Let’s quickly establish a simple property of these objects (example 17.1.1, Calin [2020]):
Theorem 1. Suppose {ht } is a deterministic discrete dynamical system with transition function f and that f
satisfies ||∂f /∂h||2 ≤ λ < 1. Then we have
P ( lim ht = c) = 1 (2)
t→∞
for some constant c; that is, ht → c almost surely.
Proof. By the mean-value theorem, we have
∂f
||f (h; θ) − f (h′ ; θ)||≤ sup || ||||h − h′ ||≤ λ||h − h′ ||
h ∂h
that is, f is a contraction. Recalling that ht are random variables, suppose we consider two values taken h0 . Let
ω, ω ′ be the corresponding values in the sample space of h0 , so the values taken are h0 (ω), h1 (ω ′ ).
Let us consider the values h1 (ω) = f (h0 (ω), θ), h1 (ω ′ ) = f (h0 (ω ′ ), θ), , h2 (ω) = f (h1 (ω), θ), h2 (ω ′ ) =
f (h1 (ω ′ ), θ), h3 (ω) = f (h2 (ω), θ), h3 (ω ′ ) = f (h2 (ω ′ ), θ), . . . . We have
||ht (ω) − ht (ω ′ )|| = ||f (ht−1 (ω), θ) − f (ht−1 (ω ′ ), θ)||≤ λ||ht−1 − ht−1 (ω ′ )||
≤ λ2 ||ht−2 (ω − ht−2 (ω ′ )||≤ . . . ≤ λt ||h1 (ω) − h1 (ω ′ )
→0 (3)
for almost all ω. Hence, with respect to the probability measure of h1 , we have P ({ω : limt→∞ ht (ω) = c}) = 1, as
needed.
In this case, the sequence H (t) of σ−algebras gradually loses all its information: the information I(ht ) tends to
0. This is analogous to the vanishing gradient problem which we encountered in the chapter on training of neural
networks.
We now consider a more general dynamical system, in which we inject new randomness at each t, governed by
a random process Xt (for our purposes, a random process is just a sequence of random variables, not necessarily
independent, but defined on the same probability space).
Definition 2. A discrete dynamical system {ht }, t ∈ N, is a sequence of random variables given by
ht = f (ht−1 , Xt ; θ) (4)
for t > 1, where f is a Borel-measurable function, Xt is a random process, and θ is a set of parameters.
Now our sequence of σ−algebras is a little more complicated. Recalling our notation of S(x) as the σ-algebra
generated by x, and denoting It = S(Xt ) we have
S({ht−1 , Xt }) = S (S(ht−1 ) ∪ S(Xt )) = S(H− ∪ It )
and hence
S(ht ) ⊆ S({ht−1 , Xt }) = S(H− ∪ It )
james.liley@durham.ac.uk 2 of 7
Deep Learning and Artificial Intelligence Epiphany 2024
If we consider each block Xt → ht → Yt separately, and model this as a neural network with output Yt , hidden
layer ht and input Xt , then we get what is called a recurrent neural network.
Making this setup a bit more formal, let us say that the sample space of ht is Rk , and the sample space of X t
is Rp . We set this up as:
ht = ϕ (W ht−1 + U Xt + b)
Yt = V ht + C
james.liley@durham.ac.uk 3 of 7
Deep Learning and Artificial Intelligence Epiphany 2024
The essence of the ‘sequence’ of data comes from the matrix W (or the function f in our dynamical system
formulation).
hi = ϕ (W hi−1 + U Xi + b)
Yi = V hi + c
james.liley@durham.ac.uk 4 of 7
Deep Learning and Artificial Intelligence Epiphany 2024
and
∂L X ∂Li
=
∂c i
∂c
X ∂Li ∂Yi
=
i
∂Yi ∂c
X ∂Li ∂
= (V hi + c)
i
∂Yi ∂c
X ∂Li
=
t
∂Yi
Computing the gradient of L with respect to W gets difficult. The value L1 depends on W only through h1 , but
L2 depends on W through both h1 and h2 (see figure 2) and in general Li depends on W through all the values
h1 , h2 , . . . hi . To make it more difficult, hi also depends on hi−1 . Let us start by considering just L2 :
and, finally,
∂L X ∂Li
=
∂W i
∂W
X ∂Li dhi
=
i
∂hi dW
X
= γi V d i
i
∂L ∂L
Similar derivations hold for ∂U and ∂b , which are left as exercises.
4.2 Problems
This does not look great - there are lots of V ’s and W ’s getting multiplied together - and indeed, recurrent neural
networks are very susceptible to both exploding gradients and vanishing gradients. If the eigenvalues of W are all
less than 1, we are very susceptible to vanishing gradients; if any of them exceed 1, we are susceptible to exploding
gradients.
james.liley@durham.ac.uk 5 of 7
Deep Learning and Artificial Intelligence Epiphany 2024
A recurrent network can be thought of as having long-term ‘memory’ encoded by weights, which are updated
during training, and change slowly. The activations of the network (from calculating a forward pass) constitute
short-term memory. In a sense, the vanishing gradient problem arises because long-term memory is too salient,
and the network cannot react to new information quickly. LSTMs introduce a new type of memory encoded by a
particular type of neuron.
To each hi we attribute a memory Ci , sometimes called an internal state. We may control whether the internal
state:
• should be affected by a given input,
• should affect the output
• should be changed
Figure 3: Information flow through a long-short-term-memory network. Adapted from Zhang et al. [2021].
The flow of information consists of a few steps. Data passes through a series of ‘gates’, called ‘forget’ and ‘input’.
In detail (adapted from Zhang et al. [2021]):
1. We begin with what we previously had: at time t, data Xt and a hidden state ht−1 from the previous time,
and we will output a current hidden state ht and a Yt . We also have a ‘memory’ from the previous time Ct−1 ,
and we will output Ct .
2. We begin by computing four functions F , I, O and C̃ of ht and Xt . All are neural networks, parametrised by
weight matrices WF , WI , WO , WC̃ for ht and UF , UI , UO , UC̃ for Xt respectively, and biases bF , bI , bO , bC̃ .
The first three use logistic activation functions ϕ(x) = σ(x) = (1 + exp(−x))−1 , so have values in (0, 1), and
C̃ uses ϕ(x) = tanh(x) so has values in (−1, 1).
The function F is short for ‘forget’, I is ‘input’ and O is ‘output’. The ‘forget’ function tells us how much of
the old Ct value we will forget, and the ‘input’ function tells us how much of the new input we will retain.
The dimensions of F are the same as that of Ct ,
james.liley@durham.ac.uk 6 of 7
Deep Learning and Artificial Intelligence Epiphany 2024
3. We combine F with the previous memory Ct−1 using a Hadamard product ⊙ (element-wise multiplication),
and combine I and C̃ the same way. We add these together to get the new memory state Ct :
4. Finally, we take the Hadamard product of O(ht , Xt ) and tanh(Ct ) to get the new hidden state ht .
Why does this help? In a sense, it allows resetting of the hidden state, and the network learns when to do this, in
particular learning to skip irrelevant observations.
6 Applications
Exercises
1. For a deterministic discrete dynamical process {ht }, show that H (t) = H (t − 1) for all random
variables h1 of appropriate dimension if and only if the transition function f (x, θ) is invertible in f .
dhi ∂hi ∂hi
2. Let αi = dW , βi = ∂W , and γi = ∂hi−1 . Find a recursion for ai , and show that for n > 1
n−1
X n
Y
αn = βn + γj β i (11)
i=1 j=i+1
∂L
3. Derive the backpropagation-through-time formula for ∂U
References
Ovidiu Calin. Deep Learning Architectures: A Mathematical Approach. Springer, 2020.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
Aston Zhang, Zachary C Lipton, Mu Li, and Alexander J Smola. Dive into deep learning. arXiv preprint
arXiv:2106.11342, 2021. URL https://d2l.ai/index.html.
james.liley@durham.ac.uk 7 of 7