0% found this document useful (0 votes)
6 views

module5

Recurrent Neural Networks (RNNs) differ from traditional neural networks by allowing information to be fed back into the system, enabling them to remember past inputs and better predict future outputs. They are particularly suited for sequential data tasks, such as language translation and speech recognition, and include various types like LSTMs and GRUs to address challenges like the vanishing gradient problem. Key components of RNNs include recurrent neurons, which maintain hidden states, and the unfolding process that facilitates backpropagation through time for learning dependencies across sequences.

Uploaded by

henop47759
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

module5

Recurrent Neural Networks (RNNs) differ from traditional neural networks by allowing information to be fed back into the system, enabling them to remember past inputs and better predict future outputs. They are particularly suited for sequential data tasks, such as language translation and speech recognition, and include various types like LSTMs and GRUs to address challenges like the vanishing gradient problem. Key components of RNNs include recurrent neurons, which maintain hidden states, and the unfolding process that facilitates backpropagation through time for learning dependencies across sequences.

Uploaded by

henop47759
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Introduction to RNN

Recurrent Neural Networks (RNNs) work a bit different from regular neural networks. In
neural network the information flows in one direction from input to output. However in RNN
information is fed back into the system after each step. Think of it like reading a sentence,
when you’re trying to predict the next word you don’t just look at the current word but also
need to remember the words that came before to make accurate guess.
RNNs allow the network to “remember” past information by feeding the output from
one step into next step. This helps the network understand the context of what has
already happened and make better predictions based on that. For example when
predicting the next word in a sentence the RNN uses the previous words to help decide what
word is most likely to come next.

Difference between Feedforward Networks and RNNS:

Feed-Forward Neural Networks vs Recurrent Neural Networks


The below table provides a quick comparison between feed-forward neural networks and
recurrent neural Networks

Feed-forward Neural
Comparison Attribute Networks Recurrent Neural Networks

Signal flow direction Forward only Bidirectional

Delay introduced No Yes

Complexity Low High

Neuron independence
Yes No
in the same layer

Speed High slow


Feed-forward Neural
Comparison Attribute Networks Recurrent Neural Networks

Pattern recognition, speech Language translation, speech-


Commonly used for recognition, and character to-text conversion, and robotic
recognition control

Feed-Forward Neural Networks


The feedforward neural network is one of the most basic artificial neural networks. In this
ANN, the data or the input provided travels in a single direction. It enters into the ANN
through the input layer and exits through the output layer while hidden layers may or may not
exist. So the feedforward neural network has a front-propagated wave only and usually does
not have backpropagation

Recurrent Neural Networks


The Recurrent Neural Network saves the output of a layer and feeds this output back to the
input to better predict the outcome of the layer. The first layer in the RNN is quite similar to
the feed-forward neural network and the recurrent neural network starts once the output of the
first layer is computed. After this layer, each unit will remember some information from the
previous step so that it can act as a memory cell in performing computation

How RNN Differs from Feedforward Neural Networks?


Feedforward Neural Networks (FNNs) process data in one direction from input to output
without retaining information from previous inputs. This makes them suitable for tasks with
independent inputs like image classification. However FNNs struggle with sequential data
since they lack memory.
Recurrent Neural Networks (RNNs) solve this by incorporating loops that allow
information from previous steps to be fed back into the network. This feedback enables
RNNs to remember prior inputs making them ideal for tasks where context is important.

Key Components of RNNs


1. Recurrent Neurons
The fundamental processing unit in RNN is a Recurrent Unit. Recurrent units hold a hidden
state that maintains information about previous inputs in a sequence. Recurrent units can
“remember” information from prior steps by feeding back their hidden state, allowing them to
capture dependencies across time.
2. RNN Unfolding
RNN unfolding or unrolling is the process of expanding the recurrent structure over time
steps. During unfolding each step of the sequence is represented as a separate layer in a series
illustrating how information flows across each time step.
This unrolling enables backpropagation through time (BPTT) a learning process where
errors are propagated across time steps to adjust the network’s weights enhancing the RNN’s
ability to learn dependencies within sequential data.
Types Of Recurrent Neural Networks
There are four types of RNNs based on the number of inputs and outputs in the network:
1. One-to-One RNN
This is the simplest type of neural network architecture where there is a single input and a
single output. It is used for straightforward classification tasks such as binary classification
where no sequential data is involved.

One-to-Many RNN
In a One-to-Many RNN the network processes a single input to produce multiple outputs over
time. This is useful in tasks where one input triggers a sequence of predictions (outputs). For
example in image captioning a single image can be used as input to generate a sequence of
words as a caption.

3. Many-to-One RNN
The Many-to-One RNN receives a sequence of inputs and generates a single output. This
type is useful when the overall context of the input sequence is needed to make one
prediction. In sentiment analysis the model receives a sequence of words (like a sentence)
and produces a single output like positive, negative or neutral.
4. Many-to-Many RNN
The Many-to-Many RNN type processes a sequence of inputs and generates a sequence of
outputs. In language translation task a sequence of words in one language is given as input,
and a corresponding sequence in another language is generated as output.
Variants of Recurrent Neural Networks (RNNs)
There are several variations of RNNs, each designed to address specific challenges or
optimize for certain tasks:
1. Vanilla RNN
This simplest form of RNN consists of a single hidden layer where weights are shared across
time steps. Vanilla RNNs are suitable for learning short-term dependencies but are limited by
the vanishing gradient problem, which hampers long-sequence learning.
2. Bidirectional RNNs
Bidirectional RNNs process inputs in both forward and backward directions, capturing both
past and future context for each time step. This architecture is ideal for tasks where the entire
sequence is available, such as named entity recognition and question answering.
3. Long Short-Term Memory Networks (LSTMs)
Long Short-Term Memory Networks (LSTMs) introduce a memory mechanism to overcome
the vanishing gradient problem. Each LSTM cell has three gates:
• Input Gate: Controls how much new information should be added to the cell state.
• Forget Gate: Decides what past information should be discarded.
• Output Gate: Regulates what information should be output at the current step. This
selective memory enables LSTMs to handle long-term dependencies, making them
ideal for tasks where earlier context is critical.
4. Gated Recurrent Units (GRUs)
Gated Recurrent Units (GRUs) simplify LSTMs by combining the input and forget gates into
a single update gate and streamlining the output mechanism. This design is computationally
efficient, often performing similarly to LSTMs, and is useful in tasks where simplicity and
faster training are beneficial.
Back Propagating through time:
Recurrent Neural Networks are those networks that deal with sequential data. They predict
outputs using not only the current inputs but also by taking into consideration those that
occurred before it. In other words, the current output depends on current output as well as a
memory element (which takes into account the past inputs). For training such networks, we
use good old backpropagation but with a slight twist. We don’t independently train the system
at a specific time “t”. We train it at a specific time “t” as well as all that has happened before
time “t” like t-1, t-2, t-3. Consider the following representation of a RNN

S1, S2, S3 are the hidden states or memory units at time t1, t2, t3 respectively, and Ws is the
weight matrix associated with it. X1, X2, X3 are the inputs at time t1, t2, t3 respectively,
and Wx is the weight matrix associated with it. Y1, Y2, Y3 are the outputs at time t1, t2,
t3 respectively, and Wy is the weight matrix associated with it. For any time, t, we have the
following two equations:
where g1 and g2 are activation functions. Let us now perform back propagation at time t = 3.
Let the error function be:

so at t =3,

We are using the squared error here, where d3 is the desired output at time t = 3. To perform
back propagation, we have to adjust the weights associated with inputs, the memory units and
the outputs. Adjusting Wy For better understanding, let us consider the following
representation:

What is Vanishing Gradient?


The vanishing gradient problem is a challenge that emerges during backpropagation when the
derivatives or slopes of the activation functions become progressively smaller as we move
backward through the layers of a neural network. This phenomenon is particularly prominent
in deep networks with many layers, hindering the effective training of the model. The weight
updates becomes extremely tiny, or even exponentially small, it can significantly prolong the
training time, and in the worst-case scenario, it can halt the training process altogether.
Why the Problem Occurs?
During backpropagation, the gradients propagate back through the layers of the network, they
decrease significantly. This means that as they leave the output layer and return to the input
layer, the gradients become progressively smaller. As a result, the weights associated with the
initial levels, which accommodate these small gradients, are updated little or not at each
iteration of the optimization process.
The vanishing gradient problem is particularly associated with the sigmoid and hyperbolic
tangent (tanh) activation functions because their derivatives fall within the range of 0 to 0.25
and 0 to 1, respectively. Consequently, extreme weights becomes very small, causing the
updated weights to closely resemble the original ones. This persistence of small updates
contributes to the vanishing gradient issue.
The sigmoid and tanh functions limit the input values to the ranges [0,1] and [-1,1], so that
they saturate at 0 or 1 for sigmoid and -1 or 1 for Tanh. The derivatives at points becomes
zero as they are moving. In these regions, especially when inputs are very small or large, the
gradients are very close to zero. While this may not be a major concern in shallow networks
with a few layers, it is a more pronounced issue in deep networks. When the inputs fall in
saturated regions, the gradients approach zero, resulting in little update to the weights of the
previous layer. In simple networks this does not pose much of a problem, but as more layers
are added, these small gradients, which multiply between layers, decay significantly and
consequently the first layer tears very slowly , and hinders overall model performance and
can lead to convergence failure.
How can we identify?
Identifying the vanishing gradient problem typically involves monitoring the training
dynamics of a deep neural network.
• One key indicator is observing model weights converging to 0 or stagnation in the
improvement of the model's performance metrics over training epochs.
• During training, if the loss function fails to decrease significantly, or if there is
erratic behavior in the learning curves, it suggests that the gradients may be vanishing.
• Additionally, examining the gradients themselves during backpropagation can provide
insights. Visualization techniques, such as gradient histograms or norms, can aid in
assessing the distribution of gradients throughout the network.
How can we solve the issue?
• Batch Normalization : Batch normalization normalizes the inputs of each layer,
reducing internal covariate shift. This can help stabilize and accelerate the training
process, allowing for more consistent gradient flow.
• Activation function: Activation function like Rectified Linear Unit (ReLU) can be
used. With ReLU, the gradient is 0 for negative and zero input, and it is 1 for positive
input, which helps alleviate the vanishing gradient issue. Therefore, ReLU operates by
replacing poor enter values with 0, and 1 for fine enter values, it preserves the input
unchanged.
• Skip Connections and Residual Networks (ResNets): Skip connections, as seen in
ResNets, allow the gradient to bypass certain layers during backpropagation. This
facilitates the flow of information through the network, preventing gradients from
vanishing.
• Long Short-Term Memory Networks (LSTMs) and Gated Recurrent Units
(GRUs): In the context of recurrent neural networks (RNNs), architectures like
LSTMs and GRUs are designed to address the vanishing gradient problem in
sequences by incorporating gating mechanisms .
• Gradient Clipping: Gradient clipping involves imposing a threshold on the gradients
during backpropagation. Limit the magnitude of gradients during backpropagation,
this can prevent them from becoming too small or exploding, which can also hinder
learning.

What is LSTM – Long Short Term Memory?


ong Short-Term Memory (LSTM) is an enhanced version of the Recurrent Neural Network
(RNN) designed by Hochreiter & Schmidhuber. LSTMs can capture long-term dependencies
in sequential data making them ideal for tasks like language translation, speech recognition
and time series forecasting.
Unlike traditional RNNs which use a single hidden state passed through time LSTMs
introduce a memory cell that holds information over extended periods addressing the
challenge of learning long-term dependencies.
Problem with Long-Term Dependencies in RNN
Recurrent Neural Networks (RNNs) are designed to handle sequential data by maintaining a
hidden state that captures information from previous time steps. However they often face
challenges in learning long-term dependencies where information from distant time steps
becomes crucial for making accurate predictions for current state. This problem is known as
the vanishing gradient or exploding gradient problem.
• Vanishing Gradient: When training a model over time, the gradients (which help the
model learn) can shrink as they pass through many steps. This makes it hard for the
model to learn long-term patterns since earlier information becomes almost irrelevant.
• Exploding Gradient: Sometimes, gradients can grow too large, causing instability.
This makes it difficult for the model to learn properly, as the updates to the model
become erratic and unpredictable.
Both of these issues make it challenging for standard RNNs to effectively capture long-term
dependencies in sequential data.
LSTM Architecture
LSTM architectures involves the memory cell which is controlled by three gates: the input
gate, the forget gate and the output gate. These gates decide what information to add to,
remove from and output from the memory cell.
• Input gate: Controls what information is added to the memory cell.
• Forget gate: Determines what information is removed from the memory cell.
• Output gate: Controls what information is output from the memory cell.
This allows LSTM networks to selectively retain or discard information as it flows through
the network which allows them to learn long-term dependencies. The network has a hidden
state which is like its short-term memory. This memory is updated using the current input, the
previous hidden state and the current state of the memory cell.
Working of LSTM

Information is retained by the cells and the memory manipulations are done by
the gates. There are three gates –
Forget Gate
The information that is no longer useful in the cell state is removed with the forget gate. Two
inputs xt (input at the particular time) and ht-1 (previous cell output) are fed to the gate and
multiplied with weight matrices followed by the addition of bias. The resultant is passed
through an activation function which gives a binary output. If for a particular cell state the
output is 0, the piece of information is forgotten and for output 1, the information is retained
for future use.
The equation for the forget gate is:
Input gate
The addition of useful information to the cell state is done by the input gate. First, the
information is regulated using the sigmoid function and filter the values to be remembered
similar to the forget gate using inputs ht-1 and xt. . Then, a vector is created
using tanh function that gives an output from -1 to +1, which contains all the possible values
from ht-1 and xt. At last, the values of the vector and the regulated values are multiplied to
obtain the useful information. The equation for the input gate is:
it=σ(Wi⋅[ht−1,xt]+bi) it=σ(Wi⋅[ht−1,xt]+bi)
We multiply the previous state by ft, disregarding the information we had previously chosen
to ignore. Next, we include it∗Ct. This represents the updated candidate values, adjusted for
the amount that we chose to update each state value.

where
• ⊙ denotes element-wise multiplication
• tanh is tanh activation function

Output gate
The task of extracting useful information from the current cell state to be presented as output
is done by the output gate. First, a vector is generated by applying tanh function on the cell.
Then, the information is regulated using the sigmoid function and filter by the values to be
remembered using inputs ht−1ht−1and xtxt. At last, the values of the vector and the regulated
values are multiplied to be sent as an output and input to the next cell. The equation for the
output gate is:
Bidirectional LSTM Model
Bidirectional LSTM (Bi LSTM/ BLSTM) is a variation of normal LSTM which processes
sequential data in both forward and backward directions. This allows Bi LSTM to learn
longer-range dependencies in sequential data than traditional LSTMs which can only process
sequential data in one direction.
• Bi LSTMs are made up of two LSTM networks one that processes the input sequence
in the forward direction and one that processes the input sequence in the backward
direction.
• The outputs of the two LSTM networks are then combined to produce the final
output.
LSTM models including Bi LSTMs have demonstrated state-of-the-art performance across
various tasks such as machine translation, speech recognition and text summarization.
LSTM networks can be stacked to form deeper models allowing them to learn more complex
patterns in data. Each layer in the stack captures different levels of information and time-
based relationships in the input.
Applications of LSTM
Some of the famous applications of LSTM includes:
• Language Modeling: Used in tasks like language modeling, machine translation and
text summarization. These networks learn the dependencies between words in a
sentence to generate coherent and grammatically correct sentences.
• Speech Recognition: Used in transcribing speech to text and recognizing spoken
commands. By learning speech patterns they can match spoken words to
corresponding text.
• Time Series Forecasting: Used for predicting stock prices, weather and energy
consumption. They learn patterns in time series data to predict future events.
• Anomaly Detection: Used for detecting fraud or network intrusions. These networks
can identify patterns in data that deviate drastically and flag them as potential
anomalies.
• Recommender Systems: In recommendation tasks like suggesting movies, music and
books. They learn user behavior patterns to provide personalized suggestions.
• Video Analysis: Applied in tasks such as object detection, activity recognition and
action classification. When combined with Convolutional Neural Networks
(CNNs) they help analyze video data and extract useful information.

Truncated BPTT:
Truncated Backpropagation Through Time (Truncated BPTT) is a modified version of
Backpropagation Through Time (BPTT) used to train recurrent neural networks (RNNs),
particularly for long sequences, by limiting the number of time steps back through which
gradients are calculated. This approach reduces computational cost and memory usage
compared to standard BPTT, but can introduce bias by neglecting long-term

What is Backpropagation Through Time (BPTT)?


• BPTT is a standard algorithm used to train RNNs by propagating error gradients
through time, allowing the network to learn from sequential data.
• It unfolds the RNN's recurrent connections over time, creating a chain of
computations that can be used to calculate the gradient of the loss function with
respect to the network's parameters.
Why is Truncated BPTT Needed?
• For very long sequences, BPTT can become computationally expensive, requiring a
large amount of memory and computation to backpropagate through all time steps.
• Truncated BPTT addresses this by truncating the backpropagation process after a
fixed number of time steps, effectively creating a shorter chain of computations.
How Truncated BPTT Works:
• Instead of backpropagating through the entire sequence, truncated BPTT only
backpropagates through a limited number of time steps (the truncation length).
• This significantly reduces the computational cost, especially for long sequences.
• However, it also introduces a bias in the gradient estimate, as the network is not able
to fully learn from the entire sequence.
Advantages of Truncated BPTT:
• Reduced computational cost:
Truncated BPTT is significantly faster and requires less memory than standard BPTT,
especially for long sequences.
• Scalability:
It allows for training RNNs with longer sequences, which would be impossible with standard
BPTT.
Disadvantages of Truncated BPTT:
• Bias in gradient estimates:
The truncation introduces a bias in the gradient estimate, which can lead to suboptimal
performance.
• Difficulty in learning long-term dependencies:
The truncated backpropagation can make it difficult for the network to learn long-term
dependencies in the data.
Alternatives and Solutions:
• Adaptive Truncation:
Some methods adapt the truncation length during training, allowing the network to learn from
longer sequences when necessary.
• Anticipated Reweighted Truncated Backpropagation (ARTBP):
ARTBP aims to unbias truncated BPTT by using variable truncation lengths and carefully
chosen compensation factors.
• Other RNN Architectures:
Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks are designed
to better handle long-term dependencies and may be a better choice than standard RNNs
when dealing with long sequences.

GRU vs LSTM
GRUs are more computationally efficient because they combine the forget and input gates
into a single update gate. GRUs do not maintain an internal cell state as LSTMs do, instead
they store information directly in the hidden state making them simpler and faster.

LSTM (Long Short-Term


Feature Memory) GRU (Gated Recurrent Unit)

Gates 3 (Input, Forget, Output) 2 (Update, Reset)

Cell State Yes it has cell state No (Hidden state only)

Training Speed Slower due to complexity Faster due to simpler architecture

Computational Higher due to more gates and Lower due to fewer gates and
Load parameters parameters
LSTM (Long Short-Term
Feature Memory) GRU (Gated Recurrent Unit)

Often better in tasks requiring Performs similarly in many tasks


Performance long-term memory with less complexity

Gated Recurrent Network:


The core idea behind GRUs is to use gating mechanisms to selectively update the hidden
state at each time step allowing them to remember important information while discarding
irrelevant details. GRUs aim to simplify the LSTM architecture by merging some of its
components and focusing on just two main gates: the update gate and the reset gate.

The GRU consists of two main gates:


1. Update Gate (ztzt): This gate decides how much information from previous hidden
state should be retained for the next time step.
2. Reset Gate (rtrt): This gate determines how much of the past hidden state should be
forgotten.
These gates allow GRU to control the flow of information in a more efficient manner
compared to traditional RNNs which solely rely on hidden state.
Equations for GRU Operations
The internal workings of a GRU can be described using following equations:
1. Reset gate:

The reset gate determines how much of the previous hidden state ht−1ht−1 should be
forgotten.
3. Update gate:

4. Candidate hidden state:

This is the potential new hidden state calculated based on the current input and the previous
hidden state.
5. Hidden state:

The final hidden state is a weighted average of the previous hidden state ht−1ht−1 and the
candidate hidden state ht′ht′ based on the update gate ztzt.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy