0% found this document useful (0 votes)
6 views

DLAI4 Networks Recurrent

This document discusses recurrent neural networks and their training. Recurrent neural networks are specialized for sequential data and have feedback connections, making their training more complex than feedforward networks. Backpropagation through time and long short-term memory networks are introduced as methods for training recurrent neural networks.

Uploaded by

rujunhuang2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

DLAI4 Networks Recurrent

This document discusses recurrent neural networks and their training. Recurrent neural networks are specialized for sequential data and have feedback connections, making their training more complex than feedforward networks. Backpropagation through time and long short-term memory networks are introduced as methods for training recurrent neural networks.

Uploaded by

rujunhuang2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Deep Learning and Artificial Intelligence Epiphany 2024

Lecture 14- recurrent neural networks


James Liley

Reading list and references

[Calin, 2020, Chapter 17]


[Zhang et al., 2021, Chapters 9,10]

Contents
1 Introduction 2

2 Dynamical systems 2

3 Recurrent neural networks 3

4 Training recurrent neural networks 4


4.1 Backpropagation through time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

5 Long short-term memory networks (LSTMs) 5


5.1 LSTM setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

6 Applications 7

james.liley@durham.ac.uk 1 of 7
Deep Learning and Artificial Intelligence Epiphany 2024

1 Introduction
So far, we have only really looked at feedforward neural networks, characterised by an architecture in which we
start with inputs, progress through layers, and get to an output.
In this lecture we will look at recurrent neural networks, and depart from this architecture a bit. Recurrent
neural networks are neural networks specialised to model sequential data.

2 Dynamical systems
We shall start by considering a deterministic dynamical system.
Definition 1. A deterministic discrete dynamical system {ht }, t ∈ {1, 2, . . . }, is a sequence of random variables
given by
ht = f (ht−1 ; θ) (1)
for t > 1, where f (the ‘transition function’) is a Borel-measurable function and θ is a set of parameters.
We assume states ht are random variables, although all randomness comes from the first state h0 . We denote
by H (t) the σ-algebra generated by ht , and note that H (t) ⊆ H (t − 1).
Let’s quickly establish a simple property of these objects (example 17.1.1, Calin [2020]):
Theorem 1. Suppose {ht } is a deterministic discrete dynamical system with transition function f and that f
satisfies ||∂f /∂h||2 ≤ λ < 1. Then we have
P ( lim ht = c) = 1 (2)
t→∞
for some constant c; that is, ht → c almost surely.
Proof. By the mean-value theorem, we have
∂f
||f (h; θ) − f (h′ ; θ)||≤ sup || ||||h − h′ ||≤ λ||h − h′ ||
h ∂h
that is, f is a contraction. Recalling that ht are random variables, suppose we consider two values taken h0 . Let
ω, ω ′ be the corresponding values in the sample space of h0 , so the values taken are h0 (ω), h1 (ω ′ ).
Let us consider the values h1 (ω) = f (h0 (ω), θ), h1 (ω ′ ) = f (h0 (ω ′ ), θ), , h2 (ω) = f (h1 (ω), θ), h2 (ω ′ ) =
f (h1 (ω ′ ), θ), h3 (ω) = f (h2 (ω), θ), h3 (ω ′ ) = f (h2 (ω ′ ), θ), . . . . We have
||ht (ω) − ht (ω ′ )|| = ||f (ht−1 (ω), θ) − f (ht−1 (ω ′ ), θ)||≤ λ||ht−1 − ht−1 (ω ′ )||
≤ λ2 ||ht−2 (ω − ht−2 (ω ′ )||≤ . . . ≤ λt ||h1 (ω) − h1 (ω ′ )
→0 (3)
for almost all ω. Hence, with respect to the probability measure of h1 , we have P ({ω : limt→∞ ht (ω) = c}) = 1, as
needed.

In this case, the sequence H (t) of σ−algebras gradually loses all its information: the information I(ht ) tends to
0. This is analogous to the vanishing gradient problem which we encountered in the chapter on training of neural
networks.
We now consider a more general dynamical system, in which we inject new randomness at each t, governed by
a random process Xt (for our purposes, a random process is just a sequence of random variables, not necessarily
independent, but defined on the same probability space).
Definition 2. A discrete dynamical system {ht }, t ∈ N, is a sequence of random variables given by
ht = f (ht−1 , Xt ; θ) (4)
for t > 1, where f is a Borel-measurable function, Xt is a random process, and θ is a set of parameters.
Now our sequence of σ−algebras is a little more complicated. Recalling our notation of S(x) as the σ-algebra
generated by x, and denoting It = S(Xt ) we have
S({ht−1 , Xt }) = S (S(ht−1 ) ∪ S(Xt )) = S(H− ∪ It )
and hence
S(ht ) ⊆ S({ht−1 , Xt }) = S(H− ∪ It )

james.liley@durham.ac.uk 2 of 7
Deep Learning and Artificial Intelligence Epiphany 2024

3 Recurrent neural networks


Let’s suppose that we draw a discrete dynamical system as in figure 1, and add an ‘output’ from each ht as an
outcome Yt as in figure 2.

Figure 1: Illustration of a dynamical system with random process Xt

Figure 2: Illustration of a dynamical system with random process Xt and output Yt

If we consider each block Xt → ht → Yt separately, and model this as a neural network with output Yt , hidden
layer ht and input Xt , then we get what is called a recurrent neural network.
Making this setup a bit more formal, let us say that the sample space of ht is Rk , and the sample space of X t
is Rp . We set this up as:

ht = ϕ (W ht−1 + U Xt + b)
Yt = V ht + C

with appropriately sized matrices W , U , and V . We immediately note a few points


1. Note that each ht depends also on ht−1 : this is what distinguishes a recurrent neural network from a series
of feedforward networks
2. The matrices W , U , and V do not depend on t. This corresponds to the function f being fixed with t.
3. The values ht are vectors with all elements in the range of ϕ().
4. The parameter θ is (W, U, b)
5. We can quickly see the resemblance to a hidden Markov model
6. Recurrent neural networks typically use the activation function ϕ() = tanh().

james.liley@durham.ac.uk 3 of 7
Deep Learning and Artificial Intelligence Epiphany 2024

The essence of the ‘sequence’ of data comes from the matrix W (or the function f in our dynamical system
formulation).

4 Training recurrent neural networks


Recurrent neural networks encounter training examples Xt , Yt sequentially, updating as they go. This does not
lend itself directly to the backpropagation algorithm we have studied for feedforward networks.
Suppose our network has input Xt , output Yt , and target output Zt , over T total times. We have a series of loss
estimates Lt comparing how similar each Yt is to each Zt , with total loss L as
T
X
L= Lt
t=0

As usual, we have a range of choices for L.

4.1 Backpropagation through time


We will consider a backpropagation for T = 2 with activation function ϕ. We have:

hi = ϕ (W hi−1 + U Xi + b)
Yi = V hi + c

for i ∈ {1, 2}. We will use the shorthand:


ai = W hi + U Xi + b (5)
In order to perform gradient descent on the loss function, we need to compute five gradients:
 
∂L ∂L ∂L ∂L ∂L
∇L = , , , , (6)
∂W ∂V ∂U ∂b ∂c
We start with
∂hi ∂ ∂ai
= ϕ (ai ) = ϕ′ (ai ) = ϕ′ (ai )hi−1
∂W ∂W ∂W
We similarly have
∂hi
= ϕ′ (ai ) (7)
∂b
∂Li
The value Li = Loss(Yi , Zi ) depends on hi only through Yi . Let us write γi = ∂Yi . Now we have

∂Li ∂Li ∂Yi


= = γi V
∂hi ∂Yi ∂hi
and
∂hi ∂ ∂ai
= ϕ(ai ) = ϕ′ (ai ) = ϕ′ (ai )W
∂hi−1 ∂hi−1 ∂hi−1
and
∂hi ∂ ∂ai
= ϕ(ai ) = ϕ′ (ai ) = ϕ′ (ai )Xi
∂U ∂U ∂U
Now we can compute the gradients we need. Since Li depends on V only through Yi , we have
∂L X ∂Li
=
∂V i
∂V
X ∂Li ∂Yi
=
i
∂Yi ∂V
X ∂Li ∂
= (V hi + c)
i
∂Yi ∂V
X
= γi hi
t

james.liley@durham.ac.uk 4 of 7
Deep Learning and Artificial Intelligence Epiphany 2024

and
∂L X ∂Li
=
∂c i
∂c
X ∂Li ∂Yi
=
i
∂Yi ∂c
X ∂Li ∂
= (V hi + c)
i
∂Yi ∂c
X ∂Li
=
t
∂Yi

Computing the gradient of L with respect to W gets difficult. The value L1 depends on W only through h1 , but
L2 depends on W through both h1 and h2 (see figure 2) and in general Li depends on W through all the values
h1 , h2 , . . . hi . To make it more difficult, hi also depends on hi−1 . Let us start by considering just L2 :

∂L2 ∂L2 dh2


=
∂W ∂h2 dW
∂L2 d
= ϕ(W h1 + U X2 + b)
∂h2 dW
 
∂L2 ∂h2 ∂h2 dh1
= +
∂h2 ∂W ∂h1 dW
= γ2 V (ϕ′ (a2 )h1 + ϕ′ (a2 )W ϕ′ (a1 )h0 ) (8)
∂Li
To calculate ∂W , let us denote
dhi
di = (9)
dW
so d1 = ϕ′ (a1 )h0 , and d2 = ϕ′ (a2 )h1 + ϕ′ (a2 )W ϕ′ (a1 )h0 = ϕ′ (a2 )h1 + ϕ′ (a2 )W d1 . Now, working recursively:

dhi d ∂hi ∂hi dhi−1


di = = ϕ(W hi−1 + U Xi + b) = + = ϕ′ (ai )hi−1 + ϕ′ (a2 )W di−1
dW dW ∂W ∂hi−1 dW

and, finally,
∂L X ∂Li
=
∂W i
∂W
X ∂Li dhi
=
i
∂hi dW
X
= γi V d i
i

∂L ∂L
Similar derivations hold for ∂U and ∂b , which are left as exercises.

4.2 Problems
This does not look great - there are lots of V ’s and W ’s getting multiplied together - and indeed, recurrent neural
networks are very susceptible to both exploding gradients and vanishing gradients. If the eigenvalues of W are all
less than 1, we are very susceptible to vanishing gradients; if any of them exceed 1, we are susceptible to exploding
gradients.

5 Long short-term memory networks (LSTMs)


Not examinable
A tidy way to (partly) overcome the vanishing gradient problem intrinsic to recurrent neural networks is the
long short-term memory network (LSTM) Hochreiter and Schmidhuber [1997].

james.liley@durham.ac.uk 5 of 7
Deep Learning and Artificial Intelligence Epiphany 2024

A recurrent network can be thought of as having long-term ‘memory’ encoded by weights, which are updated
during training, and change slowly. The activations of the network (from calculating a forward pass) constitute
short-term memory. In a sense, the vanishing gradient problem arises because long-term memory is too salient,
and the network cannot react to new information quickly. LSTMs introduce a new type of memory encoded by a
particular type of neuron.
To each hi we attribute a memory Ci , sometimes called an internal state. We may control whether the internal
state:
• should be affected by a given input,
• should affect the output
• should be changed

This is easier if we actually formalise it.

5.1 LSTM setup


The flow of information through an LSTM is illustrated in figure 3.

Figure 3: Information flow through a long-short-term-memory network. Adapted from Zhang et al. [2021].

The flow of information consists of a few steps. Data passes through a series of ‘gates’, called ‘forget’ and ‘input’.
In detail (adapted from Zhang et al. [2021]):
1. We begin with what we previously had: at time t, data Xt and a hidden state ht−1 from the previous time,
and we will output a current hidden state ht and a Yt . We also have a ‘memory’ from the previous time Ct−1 ,
and we will output Ct .

2. We begin by computing four functions F , I, O and C̃ of ht and Xt . All are neural networks, parametrised by
weight matrices WF , WI , WO , WC̃ for ht and UF , UI , UO , UC̃ for Xt respectively, and biases bF , bI , bO , bC̃ .
The first three use logistic activation functions ϕ(x) = σ(x) = (1 + exp(−x))−1 , so have values in (0, 1), and
C̃ uses ϕ(x) = tanh(x) so has values in (−1, 1).

F (ht−1 , Xt ) = σ(WF ht−1 + UF Xt + bF )


I(ht−1 , Xt ) = σ(WI ht−1 + UI Xt + bI )
O(ht−1 , Xt ) = σ(WO ht−1 + UO Xt + bO )
C̃(ht−1 , Xt ) = σ(WC̃ ht−1 + UC̃ Xt + bC̃ )
(10)

The function F is short for ‘forget’, I is ‘input’ and O is ‘output’. The ‘forget’ function tells us how much of
the old Ct value we will forget, and the ‘input’ function tells us how much of the new input we will retain.
The dimensions of F are the same as that of Ct ,

james.liley@durham.ac.uk 6 of 7
Deep Learning and Artificial Intelligence Epiphany 2024

3. We combine F with the previous memory Ct−1 using a Hadamard product ⊙ (element-wise multiplication),
and combine I and C̃ the same way. We add these together to get the new memory state Ct :

Ct = F (ht−1 , Xt ) ⊙ Ct−1 + I(ht−1 , Xt ) ⊙ C̃(ht−1 , Xt )

4. Finally, we take the Hadamard product of O(ht , Xt ) and tanh(Ct ) to get the new hidden state ht .

Why does this help? In a sense, it allows resetting of the hidden state, and the network learns when to do this, in
particular learning to skip irrelevant observations.

6 Applications
Exercises

1. For a deterministic discrete dynamical process {ht }, show that H (t) = H (t − 1) for all random
variables h1 of appropriate dimension if and only if the transition function f (x, θ) is invertible in f .
dhi ∂hi ∂hi
2. Let αi = dW , βi = ∂W , and γi = ∂hi−1 . Find a recursion for ai , and show that for n > 1
 
n−1
X n
Y
αn = βn +  γj  β i (11)
i=1 j=i+1

∂L
3. Derive the backpropagation-through-time formula for ∂U

4. What are the consequences in an LSTM if F = 0? What if F = 1?


5. What are the consequences in an LSTM if I = 0? What if I = 1?

References
Ovidiu Calin. Deep Learning Architectures: A Mathematical Approach. Springer, 2020.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
Aston Zhang, Zachary C Lipton, Mu Li, and Alexander J Smola. Dive into deep learning. arXiv preprint
arXiv:2106.11342, 2021. URL https://d2l.ai/index.html.

james.liley@durham.ac.uk 7 of 7

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy