6b. Recurrent Neural Networks
6b. Recurrent Neural Networks
2023-10-13
What is a Neural Network?
Neural Networks used in Deep Learning, consists of different layers connected to
each other and work on the structure and functions of a human brain. It learns
from huge volumes of data and uses complex algorithms to train a neural net.
Popular Neural Networks
Issues in Neural Networks
In a Feed-Forward Network , information flows only in forward direction, from the input nodes, through
the hidden layers (if any) and to the output nodes. There are no cycles or loops in the network.
on مینے I مینے
ate اتوار
amount
Out of country Fraud? sunday کو
sushi. کھائی۔
Recurrent Neural Network
RNNs works on the principle of saving the output layer and feeding this back to the input in order to predict the output of
the layer.
Autocomplete
Translation
Sentiment analysis
Applications of RNN
How does a RNN look like?
Rolled RNN Unrolled RNN
A A A A A
C C C C C C
Hidden layer = … h(t-2) h(t-1) h(t) h(t+1) h(t+2) …
B B B B B
x(t-2) x(t-1) x(t) x(t+1) x(t+2)
Input layer
Time
How does a RNN work?
• RNN is recurrent in nature
• Process a sequence of vectors by applying recurrence formula at every time step
• For making a decision, it considers the current input and the output that it has learned from the previous input.
h (𝑡 ) 𝐶 h ( 𝑡 −1 ) 𝐵 𝑥 (𝑡)
𝑦 ( 𝑡 )= 𝐴 . h(𝑡 )
𝑦 (𝑡 ) 𝐴 h (𝑡)
How does a RNN work?
• The way RNNs do this, is by taking the output of each neuron and feeding it back to it
as an input
• input nodes are fed into a hidden layer with sigmoid or tanh activations
• By doing this,
• it does not only receive new pieces of information in every time step,
• but it also adds to these new pieces of information a weighted version of the previous output
• As you can see the hidden layer outputs are passed through a conceptual delay block to
allow the input of h(t-1) into the hidden layer.
• What is the point of this?
• Simply, the point is that we can now model time or sequence-dependent data.
• This makes these neurons have a kind of “memory” of the previous inputs it has had,
• as they are somehow quantified by the output being fed back to the neuron.
• A recurrent neural network can be thought of as multiple copies of the same network,
each passing a message to a successor.
How does a RNN work?
• A particularly good example of this is in predicting text sequences.
• Consider the following text string: “A girl walked into a bar, and she said ‘Can I have a drink please?’. The bartender said ‘Certainly
{ }”.
• There are many options for what could fill in the { } symbol in the above string, for instance, “miss”, “ma’am” and so on.
• However, other words could also fit, such as “sir”, “Mister” etc.
• In order to get the correct gender of the noun, the neural network needs to “recall” that two previous words designating the likely
gender (i.e. “girl” and “she”) were used.
• We supply the word vector for “A” to the network F- the output of the nodes in F are fed into the “next” network and
also act as a stand-alone output ( h₀ ).
• The next network F at time t=1 takes the next word vector for “girl” and the previous output h₀ into its hidden nodes,
producing the next output h₁ and so on.
• NOTE: Although shown for easy explanation in Diagram
• but the words themselves i.e. “A”, “girl” etc. aren’t inputted directly into the neural network.
• Neither are their one-hot vector type representations- rather, an embedding word vector (Word2Vec) is used for each word.
• One last thing to note
• the weights of the connections between time steps are shared i.e. there isn’t a different set of weights for each time step
• BECAUSE we have the same single RNN cell looped to itself
Types of RNNs
• Single output Single input one to one One to one
Single input
Types of RNNs
• One to many network generates One to many
sequence of outputs.
• Image captioning Multiple outputs
Single input
Types of RNNs
Many to one
• Many to one takes a sequence of inputs
and generates a single output
• Sentiment analysis Single output
• where a given sentence can be classified as
expressing positive or negative sentiments
• Sequence of words → sentiment
Multiple inputs
Types of RNNs
Many to many
• Many to many takes a sequence of inputs
and generates a sequence of outputs
• Machine translation Multiple outputs
• where a given sentence is translated into its
corresponding equivalent of the target language.
• Sequence of words → Sequence of words
Multiple inputs
Multilayer RNNs
• So far, we have only seen RNNs with just one layer
• However, we’re not limited to only a single layer
architectures
• One of the ways, RNNs are used today is in more
complex manner is
• RNNs can be stacked together in multiple layers
• It gives more depth, and empirically deeper architectures
tend to work better
• Three RNNs are stacked on top of each other
• each with their own set of
• the input of the second RNN is the vector of the hidden
state vector of the first RNN
• All stacked RNNs can be trained jointly
RNN example as Character-level language
model
• One of the simplest ways in which we can
use an RNN is
• character-level language model
• Input a sequence of characters into the RNN h 𝑡= tanh ( 𝑊 hh h𝑡 − 1+ 𝑊 𝑥h 𝑥𝑡 )
• and at every single timestep the RNN will
predict the next character
• Example training sequence: “hello”
• Vocabulary :
RNN example as Character-level language
model
• All characters are encoded in the representation
• one-hot vector: where only one unique bit of the vector is turned on for each
unique character
• RNN incorrectly suggests that “o” should come next, as the score
of 4.1 is the highest
• However, of course, we know that in this training sequence “e”
should follow “h”, so in fact the score of 2.2 is the correct answer
• and we want that to be high and all other scores to be low
• At every single timestep we have a target for what next character
should come in the sequence
• therefore, the error signal is backpropagated as a gradient of the
loss function through the connections
RNN example as Character-level language
model
• As a loss function softmax classifier
is used
• so that all those losses flowing down
from the top backwards to calculate the
gradients on all the weight matrices to
figure out how to shift the matrices so
that the correct probabilities are coming
out of the RNN
• At test time, sample character one at
a time and feedback to model
Backpropagation Through Time (BPTT)
• Backpropagation through time (BPTT)
• Forward through entire sequence to compute loss, then backward through
entire sequence to compute gradient
Limitations of RNNs
• Problem with RNNs is that as time passes by and they get fed more and more new data, they start to “forget” about the
previous data they have seen
• as it gets diluted between the new data, the transformation from activation function, and the weight multiplication
• This means they have a good short-term memory, but a slight problem when trying to remember things that have happened a while ago
• data they have seen many time steps in the past
• While training a RNN, your slope can be either too small or very large and this make training difficult
• The more time steps we have, the more chance we have of back-propagation gradients either accumulating and exploding or
vanishing down to nothing
• When the slope is too small, the problem is known as Vanishing gradient, whereas, when the slope tends to grow exponentially
instead of decaying, this is called exploding gradient
• Vanishing gradient problem
• The magnitude of the gradients shrink exponentially as we backpropagate through many layers
• Since typical activation functions such as sigmoid or tanh are bounded
• Why do gradients vanish?
• Think of a simplified 3-layer neural network
Vanishing gradient problem
• The equation above is only a rough approximation of what is going on during back-
propagation through time
Vanishing gradient problem
• How about
• Calculate the gradient of the loss with respect to
• The equation above is only a rough approximation of what is going on during back-
propagation through time
• Each of these gradients will involve calculating the gradient of the sigmoid function
• The problem with the sigmoid function occurs when the input values are such that the output
is close to either 0 or 1
• at this point, the gradient is very small (saturating)
• For instance, say the value decreased like 0.863 →0.532 →0.356 →0.192 →0.117 →0.086 →0.023 →0.019…
• you can see that there is no much change in last 3 iterations
• It means that when you multiply many sigmoid gradients together you are multiplying many
values which are potentially much less than zero
• this leads to a vanishing gradient problem
Vanishing Gradient Over Time
• This is more problematic in vanilla RNN (with tanh/sigmoid
activation)
• When trying to handle long temporal dependency
• Similar to previous example, the gradient vanishes over time
Vanishing gradient problem
• Vanishing gradient problem is critical in training neural network
• Can we just use activation function that has gradients > 1?
• Not really.
• It will cause another problem so called exploding gradients
• Let’s consider if we use exponential activation function:
• The magnitude of gradient is always larger than 1 when input > 0
• If output of the networks are positive, then the gradients to
update will explode
Gradients > 1