module5
module5
Recurrent Neural Networks (RNNs) work a bit different from regular neural networks. In
neural network the information flows in one direction from input to output. However in RNN
information is fed back into the system after each step. Think of it like reading a sentence,
when you’re trying to predict the next word you don’t just look at the current word but also
need to remember the words that came before to make accurate guess.
RNNs allow the network to “remember” past information by feeding the output from
one step into next step. This helps the network understand the context of what has
already happened and make better predictions based on that. For example when
predicting the next word in a sentence the RNN uses the previous words to help decide what
word is most likely to come next.
Feed-forward Neural
Comparison Attribute Networks Recurrent Neural Networks
Neuron independence
Yes No
in the same layer
One-to-Many RNN
In a One-to-Many RNN the network processes a single input to produce multiple outputs over
time. This is useful in tasks where one input triggers a sequence of predictions (outputs). For
example in image captioning a single image can be used as input to generate a sequence of
words as a caption.
3. Many-to-One RNN
The Many-to-One RNN receives a sequence of inputs and generates a single output. This
type is useful when the overall context of the input sequence is needed to make one
prediction. In sentiment analysis the model receives a sequence of words (like a sentence)
and produces a single output like positive, negative or neutral.
4. Many-to-Many RNN
The Many-to-Many RNN type processes a sequence of inputs and generates a sequence of
outputs. In language translation task a sequence of words in one language is given as input,
and a corresponding sequence in another language is generated as output.
Variants of Recurrent Neural Networks (RNNs)
There are several variations of RNNs, each designed to address specific challenges or
optimize for certain tasks:
1. Vanilla RNN
This simplest form of RNN consists of a single hidden layer where weights are shared across
time steps. Vanilla RNNs are suitable for learning short-term dependencies but are limited by
the vanishing gradient problem, which hampers long-sequence learning.
2. Bidirectional RNNs
Bidirectional RNNs process inputs in both forward and backward directions, capturing both
past and future context for each time step. This architecture is ideal for tasks where the entire
sequence is available, such as named entity recognition and question answering.
3. Long Short-Term Memory Networks (LSTMs)
Long Short-Term Memory Networks (LSTMs) introduce a memory mechanism to overcome
the vanishing gradient problem. Each LSTM cell has three gates:
• Input Gate: Controls how much new information should be added to the cell state.
• Forget Gate: Decides what past information should be discarded.
• Output Gate: Regulates what information should be output at the current step. This
selective memory enables LSTMs to handle long-term dependencies, making them
ideal for tasks where earlier context is critical.
4. Gated Recurrent Units (GRUs)
Gated Recurrent Units (GRUs) simplify LSTMs by combining the input and forget gates into
a single update gate and streamlining the output mechanism. This design is computationally
efficient, often performing similarly to LSTMs, and is useful in tasks where simplicity and
faster training are beneficial.
Back Propagating through time:
Recurrent Neural Networks are those networks that deal with sequential data. They predict
outputs using not only the current inputs but also by taking into consideration those that
occurred before it. In other words, the current output depends on current output as well as a
memory element (which takes into account the past inputs). For training such networks, we
use good old backpropagation but with a slight twist. We don’t independently train the system
at a specific time “t”. We train it at a specific time “t” as well as all that has happened before
time “t” like t-1, t-2, t-3. Consider the following representation of a RNN
S1, S2, S3 are the hidden states or memory units at time t1, t2, t3 respectively, and Ws is the
weight matrix associated with it. X1, X2, X3 are the inputs at time t1, t2, t3 respectively,
and Wx is the weight matrix associated with it. Y1, Y2, Y3 are the outputs at time t1, t2,
t3 respectively, and Wy is the weight matrix associated with it. For any time, t, we have the
following two equations:
where g1 and g2 are activation functions. Let us now perform back propagation at time t = 3.
Let the error function be:
so at t =3,
We are using the squared error here, where d3 is the desired output at time t = 3. To perform
back propagation, we have to adjust the weights associated with inputs, the memory units and
the outputs. Adjusting Wy For better understanding, let us consider the following
representation:
Information is retained by the cells and the memory manipulations are done by
the gates. There are three gates –
Forget Gate
The information that is no longer useful in the cell state is removed with the forget gate. Two
inputs xt (input at the particular time) and ht-1 (previous cell output) are fed to the gate and
multiplied with weight matrices followed by the addition of bias. The resultant is passed
through an activation function which gives a binary output. If for a particular cell state the
output is 0, the piece of information is forgotten and for output 1, the information is retained
for future use.
The equation for the forget gate is:
Input gate
The addition of useful information to the cell state is done by the input gate. First, the
information is regulated using the sigmoid function and filter the values to be remembered
similar to the forget gate using inputs ht-1 and xt. . Then, a vector is created
using tanh function that gives an output from -1 to +1, which contains all the possible values
from ht-1 and xt. At last, the values of the vector and the regulated values are multiplied to
obtain the useful information. The equation for the input gate is:
it=σ(Wi⋅[ht−1,xt]+bi) it=σ(Wi⋅[ht−1,xt]+bi)
We multiply the previous state by ft, disregarding the information we had previously chosen
to ignore. Next, we include it∗Ct. This represents the updated candidate values, adjusted for
the amount that we chose to update each state value.
where
• ⊙ denotes element-wise multiplication
• tanh is tanh activation function
Output gate
The task of extracting useful information from the current cell state to be presented as output
is done by the output gate. First, a vector is generated by applying tanh function on the cell.
Then, the information is regulated using the sigmoid function and filter by the values to be
remembered using inputs ht−1ht−1and xtxt. At last, the values of the vector and the regulated
values are multiplied to be sent as an output and input to the next cell. The equation for the
output gate is:
Bidirectional LSTM Model
Bidirectional LSTM (Bi LSTM/ BLSTM) is a variation of normal LSTM which processes
sequential data in both forward and backward directions. This allows Bi LSTM to learn
longer-range dependencies in sequential data than traditional LSTMs which can only process
sequential data in one direction.
• Bi LSTMs are made up of two LSTM networks one that processes the input sequence
in the forward direction and one that processes the input sequence in the backward
direction.
• The outputs of the two LSTM networks are then combined to produce the final
output.
LSTM models including Bi LSTMs have demonstrated state-of-the-art performance across
various tasks such as machine translation, speech recognition and text summarization.
LSTM networks can be stacked to form deeper models allowing them to learn more complex
patterns in data. Each layer in the stack captures different levels of information and time-
based relationships in the input.
Applications of LSTM
Some of the famous applications of LSTM includes:
• Language Modeling: Used in tasks like language modeling, machine translation and
text summarization. These networks learn the dependencies between words in a
sentence to generate coherent and grammatically correct sentences.
• Speech Recognition: Used in transcribing speech to text and recognizing spoken
commands. By learning speech patterns they can match spoken words to
corresponding text.
• Time Series Forecasting: Used for predicting stock prices, weather and energy
consumption. They learn patterns in time series data to predict future events.
• Anomaly Detection: Used for detecting fraud or network intrusions. These networks
can identify patterns in data that deviate drastically and flag them as potential
anomalies.
• Recommender Systems: In recommendation tasks like suggesting movies, music and
books. They learn user behavior patterns to provide personalized suggestions.
• Video Analysis: Applied in tasks such as object detection, activity recognition and
action classification. When combined with Convolutional Neural Networks
(CNNs) they help analyze video data and extract useful information.
Truncated BPTT:
Truncated Backpropagation Through Time (Truncated BPTT) is a modified version of
Backpropagation Through Time (BPTT) used to train recurrent neural networks (RNNs),
particularly for long sequences, by limiting the number of time steps back through which
gradients are calculated. This approach reduces computational cost and memory usage
compared to standard BPTT, but can introduce bias by neglecting long-term
GRU vs LSTM
GRUs are more computationally efficient because they combine the forget and input gates
into a single update gate. GRUs do not maintain an internal cell state as LSTMs do, instead
they store information directly in the hidden state making them simpler and faster.
Computational Higher due to more gates and Lower due to fewer gates and
Load parameters parameters
LSTM (Long Short-Term
Feature Memory) GRU (Gated Recurrent Unit)
The reset gate determines how much of the previous hidden state ht−1ht−1 should be
forgotten.
3. Update gate:
This is the potential new hidden state calculated based on the current input and the previous
hidden state.
5. Hidden state:
The final hidden state is a weighted average of the previous hidden state ht−1ht−1 and the
candidate hidden state ht′ht′ based on the update gate ztzt.