0% found this document useful (0 votes)
30 views38 pages

6b. Recurrent Neural Networks

Uploaded by

ahmadai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views38 pages

6b. Recurrent Neural Networks

Uploaded by

ahmadai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Recurrent Neural Networks

2023-10-13
What is a Neural Network?
Neural Networks used in Deep Learning, consists of different layers connected to
each other and work on the structure and functions of a human brain. It learns
from huge volumes of data and uses complex algorithms to train a neural net.
Popular Neural Networks
Issues in Neural Networks
In a Feed-Forward Network , information flows only in forward direction, from the input nodes, through
the hidden layers (if any) and to the output nodes. There are no cycles or loops in the network.

Issue # 1: Context is important in sequential data

• Decisions are based on current input


• No memory about the past
• No future scope
Issues in Neural Networks
Issue # 2: fixed size of neurons in a layer

how ‫آپ‬ did ‫کیا‬


are ‫کیسے‬ you ‫آپنے‬
you? ‫ہو؟‬ eat ‫کھانا‬
dinner? ‫کھایا؟‬
Issues in Neural Networks
Issue # 3: Too much computation

25000 words in vocabulary 42000 words in vocabulary

how [0,0,0,0,0,…,1,0,0,…,0] [0,0,0,0,0,…,1,0,0,…,0] ‫آپ‬


are [0,1,0,0,0,…,0,0,0,…,0] [0,1,0,0,0,…,1,0,0,…,0] ‫کیسے‬
you? [0,0,0,0,0,…,0,0,0,1,0] [0,0,0,0,0,…,1,0,0,…,0] ‫ہو؟‬
Issues in Neural Networks
Issue # 4: Parameters are not shared

on ‫مینے‬ I ‫مینے‬

sunday ‫اتوار‬ ate ‫اتوار‬

I ‫کو‬ sushi ‫کو‬

ate ‫سوشی‬ on ‫سوشی‬

sushi. ‫کھائی۔‬ sunday. ‫کھائی۔‬


Issues in Neural Networks
Issue # 4: Sequence is important
Fraud detection
I ‫مینے‬

ate ‫اتوار‬
amount
Out of country Fraud? sunday ‫کو‬

SSN correct on ‫سوشی‬

sushi. ‫کھائی۔‬
Recurrent Neural Network
RNNs works on the principle of saving the output layer and feeding this back to the input in order to predict the output of
the layer.

RNN is basically a generalization of


feed-forward neural network that
has an internal memory

• Can handle sequential data


• Considers the current input as well
as the previously received inputs
• Can handle sequence of different
size and structure
• The advantage of the CNN over the feed forward neural network was?
• CNN models spatial invariance information
h ( 𝑡 )= 𝑓 𝑐 ( h (𝑡 − 1 ) , 𝑥 (𝑡 ))
• Recurrent Neural Network (RNN)
• Models temporal information
• Hidden states as a function of inputs and previous time step
Recurrent Neural Network
• RNNs are designed to effectively deal with sequential or temporal
data.
• Time series: a sequence of values of some parameters over a certain
period of time,
• Text documents: a sequence of words,
• Audio: a sequence of sound frequencies over time
• Unlike feed-forward neural networks, (all the inputs are
independent)
• RNNs can use their internal state (memory) to process sequences of
inputs.
Recurrent Neural
Networks (RNNs)
• How Google’s autocomplete feature predicts
the rest of the words a user is typing?
Applications of RNN

Autocomplete

Translation

Named Entity Recognition

Sentiment analysis
Applications of RNN
How does a RNN look like?
Rolled RNN Unrolled RNN

Output layer y(t-2) y(t-1) y(t) y(t+1) y(t+2)

A A A A A

C C C C C C
Hidden layer = … h(t-2) h(t-1) h(t) h(t+1) h(t+2) …

B B B B B
x(t-2) x(t-1) x(t) x(t+1) x(t+2)
Input layer

Time
How does a RNN work?
• RNN is recurrent in nature
• Process a sequence of vectors by applying recurrence formula at every time step
• For making a decision, it considers the current input and the output that it has learned from the previous input.

New state Old state Input


vector at
time step t
Function parameterized by
How does a RNN work?
h ( 𝑡 )= 𝑓 𝑐 ( h (𝑡 − 1 ) , 𝑥 (𝑡 ))
h ( 𝑡 )=𝑡𝑎𝑛h (𝐶 . h ( 𝑡 − 1 ) + 𝐵 . 𝑥 (𝑡 ))

h (𝑡 ) 𝐶 h ( 𝑡 −1 ) 𝐵 𝑥 (𝑡)
𝑦 ( 𝑡 )= 𝐴 . h(𝑡 )

𝑦 (𝑡 ) 𝐴 h (𝑡)
How does a RNN work?
• The way RNNs do this, is by taking the output of each neuron and feeding it back to it
as an input
• input nodes are fed into a hidden layer with sigmoid or tanh activations
• By doing this,
• it does not only receive new pieces of information in every time step,
• but it also adds to these new pieces of information a weighted version of the previous output
• As you can see the hidden layer outputs are passed through a conceptual delay block to
allow the input of h(t-1) into the hidden layer.
• What is the point of this?
• Simply, the point is that we can now model time or sequence-dependent data.
• This makes these neurons have a kind of “memory” of the previous inputs it has had,
• as they are somehow quantified by the output being fed back to the neuron.
• A recurrent neural network can be thought of as multiple copies of the same network,
each passing a message to a successor.
How does a RNN work?
• A particularly good example of this is in predicting text sequences.
• Consider the following text string: “A girl walked into a bar, and she said ‘Can I have a drink please?’. The bartender said ‘Certainly
{ }”.
• There are many options for what could fill in the { } symbol in the above string, for instance, “miss”, “ma’am” and so on.
• However, other words could also fit, such as “sir”, “Mister” etc.
• In order to get the correct gender of the noun, the neural network needs to “recall” that two previous words designating the likely
gender (i.e. “girl” and “she”) were used.
• We supply the word vector for “A” to the network F- the output of the nodes in F are fed into the “next” network and
also act as a stand-alone output ( h₀ ).
• The next network F at time t=1 takes the next word vector for “girl” and the previous output h₀ into its hidden nodes,
producing the next output h₁ and so on.
• NOTE: Although shown for easy explanation in Diagram
• but the words themselves i.e. “A”, “girl” etc. aren’t inputted directly into the neural network.
• Neither are their one-hot vector type representations- rather, an embedding word vector (Word2Vec) is used for each word.
• One last thing to note
• the weights of the connections between time steps are shared i.e. there isn’t a different set of weights for each time step
• BECAUSE we have the same single RNN cell looped to itself
Types of RNNs
• Single output Single input one to one One to one

network is known as the Vanilla


Single output
Neural Network.
• Traditional neural networks
• Used for regular machine learning
problems

Single input
Types of RNNs
• One to many network generates One to many

sequence of outputs.
• Image captioning Multiple outputs

• Image → sequence of words

Single input
Types of RNNs
Many to one
• Many to one takes a sequence of inputs
and generates a single output
• Sentiment analysis Single output
• where a given sentence can be classified as
expressing positive or negative sentiments
• Sequence of words → sentiment

Good paper or not?


Multiple inputs
Types of RNNs
Many to many
• Many to many takes a sequence of inputs
and generates a sequence of outputs
• Video classification on frame level Multiple outputs
• where the frames in the given video is labeled to
their corresponding class.
• Sequence of frames → Sequence of labels

Multiple inputs
Types of RNNs
Many to many
• Many to many takes a sequence of inputs
and generates a sequence of outputs
• Machine translation Multiple outputs
• where a given sentence is translated into its
corresponding equivalent of the target language.
• Sequence of words → Sequence of words

Multiple inputs
Multilayer RNNs
• So far, we have only seen RNNs with just one layer
• However, we’re not limited to only a single layer
architectures
• One of the ways, RNNs are used today is in more
complex manner is
• RNNs can be stacked together in multiple layers
• It gives more depth, and empirically deeper architectures
tend to work better
• Three RNNs are stacked on top of each other
• each with their own set of
• the input of the second RNN is the vector of the hidden
state vector of the first RNN
• All stacked RNNs can be trained jointly
RNN example as Character-level language
model
• One of the simplest ways in which we can
use an RNN is
• character-level language model
• Input a sequence of characters into the RNN h 𝑡= tanh ( 𝑊 hh h𝑡 − 1+ 𝑊 𝑥h 𝑥𝑡 )
• and at every single timestep the RNN will
predict the next character
• Example training sequence: “hello”
• Vocabulary :
RNN example as Character-level language
model
• All characters are encoded in the representation
• one-hot vector: where only one unique bit of the vector is turned on for each
unique character

• the recurrence formula is used at every single timestep


RNN example as Character-level language
model
• The prediction of RNN will be in the form of score distribution of
the characters in the vocabulary
• in the very first timestep we fed in “h”, and the RNN with its
current setting of weights computed a vector of logits:

• RNN incorrectly suggests that “o” should come next, as the score
of 4.1 is the highest
• However, of course, we know that in this training sequence “e”
should follow “h”, so in fact the score of 2.2 is the correct answer
• and we want that to be high and all other scores to be low
• At every single timestep we have a target for what next character
should come in the sequence
• therefore, the error signal is backpropagated as a gradient of the
loss function through the connections
RNN example as Character-level language
model
• As a loss function softmax classifier
is used
• so that all those losses flowing down
from the top backwards to calculate the
gradients on all the weight matrices to
figure out how to shift the matrices so
that the correct probabilities are coming
out of the RNN
• At test time, sample character one at
a time and feedback to model
Backpropagation Through Time (BPTT)
• Backpropagation through time (BPTT)
• Forward through entire sequence to compute loss, then backward through
entire sequence to compute gradient
Limitations of RNNs
• Problem with RNNs is that as time passes by and they get fed more and more new data, they start to “forget” about the
previous data they have seen
• as it gets diluted between the new data, the transformation from activation function, and the weight multiplication
• This means they have a good short-term memory, but a slight problem when trying to remember things that have happened a while ago
• data they have seen many time steps in the past
• While training a RNN, your slope can be either too small or very large and this make training difficult
• The more time steps we have, the more chance we have of back-propagation gradients either accumulating and exploding or
vanishing down to nothing
• When the slope is too small, the problem is known as Vanishing gradient, whereas, when the slope tends to grow exponentially
instead of decaying, this is called exploding gradient
• Vanishing gradient problem
• The magnitude of the gradients shrink exponentially as we backpropagate through many layers
• Since typical activation functions such as sigmoid or tanh are bounded
• Why do gradients vanish?
• Think of a simplified 3-layer neural network
Vanishing gradient problem

• First, let’s update


• Calculate the gradient of the loss with respect to

• The equation above is only a rough approximation of what is going on during back-
propagation through time
Vanishing gradient problem

• How about
• Calculate the gradient of the loss with respect to

• The equation above is only a rough approximation of what is going on during back-
propagation through time
• Each of these gradients will involve calculating the gradient of the sigmoid function
• The problem with the sigmoid function occurs when the input values are such that the output
is close to either 0 or 1
• at this point, the gradient is very small (saturating)
• For instance, say the value decreased like 0.863 →0.532 →0.356 →0.192 →0.117 →0.086 →0.023 →0.019…
• you can see that there is no much change in last 3 iterations
• It means that when you multiply many sigmoid gradients together you are multiplying many
values which are potentially much less than zero
• this leads to a vanishing gradient problem
Vanishing Gradient Over Time
• This is more problematic in vanilla RNN (with tanh/sigmoid
activation)
• When trying to handle long temporal dependency
• Similar to previous example, the gradient vanishes over time
Vanishing gradient problem
• Vanishing gradient problem is critical in training neural network
• Can we just use activation function that has gradients > 1?
• Not really.
• It will cause another problem so called exploding gradients
• Let’s consider if we use exponential activation function:
• The magnitude of gradient is always larger than 1 when input > 0
• If output of the networks are positive, then the gradients to
update will explode
Gradients > 1

• This will cause the training very unstable


• The weights will be updated in very large amount, resulting in NaN
values
• Very critical problem in training neural networks
How to resolve Vanishing Gradient
Problems?
• Possible solutions
• Activation functions
• CNN: Residual networks [He et al.,
2016]
• RNN: LSTM (Long Short-Term Memory)
Solving Vanishing Gradient: Activation
Functions
• Use different activation functions that are not bounded:
• Recent works largely use ReLU or their variants
• No saturation, easy to optimize
Solving Vanishing Gradient: Residual Networks
• Residual networks (ResNet[He et al., 2016])
• Feed-forward NN with “shortcut connections”
• Can preserve gradient flow throughout the entire depth of the network
• Possible to train more than 100 layers by simply stacking residual blocks
Solving Vanishing Gradient: Residual Networks
• LSTM (Long Short-Term Memory) and GRU (Gated Recurrent
Units)
• Specially designed RNN which can remember information for much
longer period
• 3 main steps:
• Forget irrelevant parts of previous state
• Selectively update the cell state based on the new input
• Selectively decide what part of the cell state to output as the new
hidden state

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy