DL-UNIT_5
DL-UNIT_5
Recurrent Neural Networks introduce a mechanism where the output from one step is fed back
as input to the next, allowing them to retain information from previous inputs. This design
makes RNNs well-suited for tasks where context from earlier steps is essential, such as
predicting the next word in a sentence.
The defining feature of RNNs is their hidden state—also called the memory state—which
preserves essential information from previous inputs in the sequence. By using the same
parameters across all steps, RNNs perform consistently across inputs, reducing parameter
complexity compared to traditional neural networks. This capability makes RNNs highly
effective for sequential tasks.
process data in one direction, from input to output, without retaining information from
previous inputs. This makes them suitable for tasks with independent inputs, like image
classification. However, FNNs struggle with sequential data since they lack memory.
Recurrent Neural Networks (RNNs) solve this by incorporating loops that allow information
from previous steps to be fed back into the network. This feedback enables RNNs to remember
prior inputs, making them ideal for tasks where context is important.
1. Recurrent Neurons
The fundamental processing unit in a Recurrent Neural Network (RNN) is a Recurrent Unit,
which is not explicitly called a “Recurrent Neuron.” Recurrent units hold a hidden state that
maintains information about previous inputs in a sequence. Recurrent units can “remember”
information from prior steps by feeding back their hidden state, allowing them to capture
dependencies across time.
2. RNN Unfolding
RNN unfolding, or “unrolling,” is the process of expanding the recurrent structure over time
steps. During unfolding, each step of the sequence is represented as a separate layer in a series,
illustrating how information flows across each time step. This unrolling enables
backpropagation through time (BPTT), a learning process where errors are propagated across
time steps to adjust the network’s weights, enhancing the RNN’s ability to learn dependencies
within sequential data.
Types of Recurrent Neural Networks
There are four types of RNNs based on the number of inputs and outputs in the network:
1. One-to-One RNN
One-to-One RNN behaves as the Vanilla Neural Network, is the simplest type of neural network
architecture. In this setup, there is a single input and a single output. Commonly used for
straightforward classification tasks where input data points do not depend on previous
elements.
2. One-to-Many RNN
In a One-to-Many RNN, the network processes a single input to produce multiple outputs over
time. This setup is beneficial when a single input element should generate a sequence of
predictions.
For example, for image captioning task, a single image as input, the model predicts a sequence
of words as a caption
3. Many-to-One RNN
The Many-to-One RNN receives a sequence of inputs and generates a single output. This type is
useful when the overall context of the input sequence is needed to make one prediction.
In sentiment analysis, the model receives a sequence of words (like a sentence) and produces a
single output, which is the sentiment of the sentence (positive, negative, or neutral).
4. Many-to-Many RNN
The Many-to-Many RNN type processes a sequence of inputs and generates a sequence of
outputs. This configuration is ideal for tasks where the input and output sequences need to
align over time, often in a one-to-one or many-to-many mapping.
In language translation task, a sequence of words in one language is given as input, and a
corresponding sequence in another language is generated as output.
Recurrent Neural Networks are those networks that deal with sequential data. They can predict
outputs based on not only current inputs but also considering the inputs that were generated
prior to it. The output of the present depends on the output of the present and the memory
element (which includes the previous inputs).
To train these networks, we make use of traditional backpropagation with an added twist. We
don't train the system on the exact time "t". We train it according to a particular time "t" as
well as everything that has occurred prior to time "t" like the following: t-1, t-2, t-3.
S1, S2, and S3 are the states that are hidden or memory units at the time of t1, t2, and t3,
respectively, while Ws represents the matrix of weight that goes with it.
X1, X2, and X3 are the inputs for the time that is t1, t2, and t3, respectively, while Wx
represents the weighted matrix that goes with it.
The numbers Y1, Y2, and Y3 are the outputs of t1, t2, and t3, respectively as well as Wy, the
weighted matrix that goes with it.
St = g1 (Wx xt + Ws St-1)
Yt = g2 (WY St )
Et=(dt-Yt )2
Here, we employ the squared error, in which D3 is the desired output at a time t = 3.
In order to do backpropagation, it is necessary to change the weights that are associated with
inputs, memory units, and outputs.
Encoder Decoder Model:
In deep learning, the encoder-decoder architecture is a type of neural network most widely
associated with the transformer architecture and used in sequence-to-sequence learning.
Literature thus refers to encoder-decoders at times as a form of sequence-to-sequence model
(seq2seq model). Much machine learning research focuses on encoder-decoder models for
natural language processing (NLP) tasks involving large language models (LLMs).
Encoder-decoder models are used to handle sequential data, specifically mapping input
sequences to output sequences of different lengths, such as neural machine translation, text
summarization, image captioning and speech recognition. In such tasks, mapping a token in the
input to one in the output is often indirect. For example, take machine translation: in some
languages, the verb appears near the beginning of the sentence (as in English), in others at the
end (such as German) and in some, the location of the verb may be more variable (for example,
Latin). An encoder-decoder network generates variable length yet contextually appropriate
output sequences to correspond to a given input sequence.1
As may be inferred from their respective names, the encoder encodes a given input into a
vector representation, and the decoder decodes this vector into the same data type as the
original input dataset.
Both the encoder and decoder are separate, fully connected neural networks. They may be
recurrent neural networks (RNNs)—plus its variants long-short term memory (LSTM), gated
recurrent units (GRUs)—and convolutional neural networks (CNNs), as well as transformer
models. An encoder-decoder model typically contains several encoders and several decoders.
Each encoder consists of two layers: the self-attention layer (or self-attention mechanism) and
the feed-forward neural network. The first layer guides the encoder in surveying and focusing
on other related words in a given input as it encodes one specific word therein. The feed-
forward neural network further processes encodings so they are acceptable for subsequent
encoder or decoder layers.
The decoder part also consists of a self-attention layer and feed-forward neural network, as
well as an additional third layer: the encoder-decoder attention layer. This layer focuses
network attention on specific parts of the output of the encoder. The multi-head attention
layer thereby maps tokens from two different sequences.2
Attention Mechanism:
Let’s take a look at hearing and a case study of selective attention in the context of a crowded
cocktail party. Assume you’re at a social gathering with a large number of people speaking at
the same time. You’re also talking with a friend, but the background noise is not recognized.
You are only paying attention to your friend’s voice and grasping their words while filtering out
background noise. In this scenario, our auditory system employs selective attention to focus on
the relevant auditory information. The neurological system of our brain improves the
representation of speech by prioritizing relevant sounds and ignoring background noises.
A computer method for prioritizing specific information in a given context is called the
attention mechanism of deep learning. During translation or question-answering activities,
attention is used in natural language processing to align pertinent portions of the source
phrase. Without necessarily relying on reinforcement learning, attention mechanisms allow
neural networks to give various weights to various input items, boosting their ability to capture
crucial information and improve performance in a variety of tasks. Google Streetview’s house
number identification is an example of an attention mechanism in Computer vision that enables
models to systematically identify particular portions of an image for processing.
Attention Mechanism
Recurrent models of visual attention use reinforcement learning to focus attention on key areas
of the image. A recurrent neural network governs the peek network, which dynamically selects
particular locations for exploration over time. In classification tasks, this method outperforms
convolutional neural networks. Additionally, this framework goes beyond image identification
and may be used for a variety of visual reinforcement learning applications, such as helping
robots choose behaviours to accomplish particular goals. Although the most basic use of this
strategy is supervised learning, the use of reinforcement learning permits more adaptable and
flexible decision-making based on feedback from past glances and rewards earned throughout
the learning process.
The application of attention mechanisms to image captioning has substantially enhanced the
quality and accuracy of generated captions. By incorporating attention, the model learns to
focus on pertinent image regions while creating each caption word. The model can synchronize
the visual and textual modalities by paying attention to various areas of the image at each time
step thanks to the attention mechanism. By focusing on important objects or areas in the
image, the model is able to produce captions that are more detailed and contextually
appropriate. The attention-based image captioning models have proven to perform better at
catching minute details, managing complicated scenes, and delivering cohesive and educational
captions that closely match the visual material.
The attention mechanism is a technique used in machine learning and natural language
processing to increase model accuracy by focusing on relevant data. It enables the model to
focus on certain areas of the input data, giving more weight to crucial features and disregarding
unimportant ones. Each input attribute is given a weight based on how important it is to the
output in order to accomplish this. The performance of tasks requiring the utilization of the
attention mechanism has significantly improved in areas including speech recognition, image
captioning, and machine translation.
An attention mechanism in a neural network model typically consists of the following steps:
Input Encoding: The input sequence of data is represented or embedded using a collection of
representations. This step transforms the input into a format that can be processed by the
attention mechanism.
Query Generation: A query vector is generated based on the current state or context of the
model. This query vector represents the information the model wants to focus on or retrieve
from the input.
Key-Value Pair Creation: The input representations are split into key-value pairs. The keys
capture the information that will be used to determine the importance or relevance, while the
values contain the actual data or information.
Here are the main attention mechanisms used for image processing:
1. Spatial Attention
Commonly used in Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs).
Example:
In object detection, spatial attention helps the model focus on the main object while ignoring
the background.
2. Channel Attention
Example:
In image classification, channel attention helps the model emphasize informative features like
object edges rather than less useful ones.
3. Self-Attention (Transformers)
The Vision Transformer (ViT) divides an image into patches and applies self-attention to learn
global dependencies.
4. Cross-Attention
Used when multiple images or modalities (e.g., text and image) are involved.
Example: