Unit 4 Deeplearning
Unit 4 Deeplearning
Recurrent Neural Networks- Back propagation through time, Long Short Term
Memory, Gated Recurrent Units, Bidirectional LSTMs, Bidirectional RNNs.
Convolutional Neural Networks: LeNet, AlexNet. Generative models: Restrictive
Boltzmann Machines (RBMs), Introduction to MCMC and Gibbs Sampling, gradient
computations in RBMs, Deep Boltzmann Machines.
Summary of the BPTT Process:
Forward pass through the unrolled network.
Compute loss at each time step using softmax + cross-entropy.
Backpropagate gradients through the unrolled network treating each time-step’s weights as distinct.
Aggregate gradients across all time steps for shared weights.
Update the shared weights using the aggregated gradients.
Why This Matters:
BPTT enables learning temporal dependencies by training RNNs with gradient descent.
TBPTT makes training feasible on long sequences and large datasets.
Understanding these mechanisms is crucial when working with sequence models like RNNs, LSTMs, or GRUs
Long Short Term Memory:
Long Short-Term Memory (LSTM) networks by highlighting their innovative mechanism to address the
vanishing gradient problem in training Recurrent Neural Networks (RNNs)
Dynamic Time Scales:
A key refinement introduced by Gers et al. (2000) is that the self-loop weight is not fixed, but
rather controlled (gated)by the network itself—conditioned on the context (i.e., input and hidden state).
This dynamically adjusts how long the network should remember information.
Even though the LSTM's weights are fixed after training, the gates produce time-varying memory
behaviorsbased on the current input sequence.
🧠 Interpretations and Benefits:
Internal Recurrence: The state St has its own loop, enabling persistent memory unaffected by short-term noise.
Additive Memory Update: The cell state update is additive, not multiplicative—this is critical for stable gradient flow.
Contextual Forgetting: Gates let the network decide when to forget or remember, dynamically adjusting over time.
Gradient Stability: The self-loop allows the gradient to flow backward without vanishing as it does in vanilla RNNs.
Applications:
LSTMs have been hugely successful across a wide range of sequence tasks:
Handwriting recognition and generation
Speech recognition
Machine translation
Image captioning
Syntactic parsing
🧪 Extra Notes:
Some models allow the cell state stto influence the gates directly—introducing additional learnable parameters.
Bias Initialization: Forget gates often initialized with positive bias to encourage retention early in training.
Variants: Many variations exist (e.g., peephole connections, coupled input-forget gates, GRUs).
Gated Recurrent Units:
The Gated Recurrent Unit (GRU) is a simplified alternative to the LSTM, introduced to retain the
benefits of learning long-term dependencies while using fewer parameters and simpler
computations.
Summary:
GRU simplifies the LSTM by removing the cell state and combining gates.
It’s faster and easier to train, often with similar or slightly worse performance than LSTM depending on the task.
It provides stable gradient flow and is ideal when computational resources or data are limited.
Not a subset of LSTM: It’s a different architecture, although conceptually related.
Bidirectional LSTMs:
Convolutional Neural Networks:
The purpose of linear LeNet was to recognize handwritten and machine printed characters, it
was used by many banks for the recognition of written numbers on cheques. So,it was widely
used for banking automation and because it was used for the automation of reading the
cheques, naturally the accuracy that was demanded was quite high
LeNet is having 5 layers of operations, so it could deliver an error rate which was even as low
as 0.95 percent on the test data; that means, the accuracy on the test data was more than 99
percent
The input to this convolutional neural network was grayscale images of size 32 × 32. So, if the
input images were of size more than that it had to be scaled down to size 32 × 32. Similarly, if
the input image size was less than this actual image size is less than this then it has to be
And then you had a convolution layer which in this case is shown as layer C1 using 6
kernels every kernel was of size 5 × 5 with and the convolution was performed with stride
equal to 1. So, as a result because there are 6 kernels kennels, so this convolution layer
generates 6 different feature maps and the kernel size was 5 × 5 with stride one, so that
tells you that every feature map will be of size 28 × 28.
If we want to have the size of the feature map same as the input in that case, we have to
have extra rows and columns which are known as padding. So, obviously, no padding was
used for this convolutional layer
So, the output of this convolution layer passes through a non-linearity which is ReLU and then we have the pooling
layer or sub-sampling layer.
pooling which is used in this case is the average pooling, it is not max pooling it is average pooling with window
size 2 × 2 and stride equal to 2. So, after pooling the size of the feature map becomes 14 × 14. So, the output
feature map size which is 14 × 14 and is just half of the input feature map
pooling reduces the dimension of the feature map and at the same time it collects the local statistic, a local
neighborhood statistic or if the number of channels, that remains the same which is equal to 6. So, these inputs,
these feature maps of size 14 × 14, these 6 channels or 6 feature maps are again passed to a
convolution layer.
The second convolution layer in this case it is layer C3 that has got 16 kernels every kernel is of size 5 × 5 and
stride is equal to 1
The size of the feature maps in this case will be 10 × 10. The kind of connection or the way
the feature map is generated by this convolution layer is a bit asymmetric. The reason, for
making it asymmetric is to break the symmetry in the network that is the first reason and
the second is because of this asymmetry, the number of parameters or the number of
connections were kept within a reasonable bound
when you are using the 16 kernels to generate 16 feature maps, not every 6 feature maps
from the input from the previous layer is passed to every kernel, rather it uses a type of
asymmetrical connection. So, this particular figure table in the above diagram on the left it
tells that how this connection is actually made.
After these 16 feature maps again we have a pooling layer or the subsampling layer. So,
here again the pooling window was 2 × 2 and with stride equal to 2, that gives you 16
feature maps each of size 5 × 5. And then,we have is fully connected network or fully
connected layers and the purpose of such fully connected layers is to classify the input data or to
recognize the input data that you have
we have 3 such fully connected layers. For all the layers this fully connected layer C5 and
F6 the non-linearity which was used was tan hyperbolic whereas, for the output layer the
non-linearity which is used is SoftMax non-linearity or this a SoftMax classifier.
The number of nodes in the output layer is 10, because this network was used for
recognition of numerals in handwritten cheques. So, numerals are from 0 to 9, there are 10
numerals. So, as a result the number of nodes in the output layer is ten which recognizes
numerals from 0 to 9. So, this is what the connection of LeNet-5.
AlexNet
The AlexNet architecture:
Our max pooling window is of size 3 × 3, but stride is 2; that means, the max pooling will
be done over overlapped windows. And after max pooling, you reduce the size of every
feature map to size 27 × 27 and the number of feature channels that remains 96. Then you
have the second convolution layer where the convolution kennel size is 5 × 5.
We use padding, so that the output of the convolution layer remains same as the input
feature size. So, the size of the feature maps generated by the second convolutional layer
remains same as the input feature map size.
The number of kernels in this case is 256, so that means, from this convolution layer output
you get 256 different channels or different feature maps and every feature map is of size
27 × 27.
Then again you have an overlapping max pool layer where max pooling is again done over
a window of size 3 × 3 and stride is equal to 2. So, again that means that max pooling is
done over overlapping windows and output of this becomes your 13 × 13 feature maps and
the number of channels you have 256 because to max pooling you are not reducing the
number of channels.
After this you have three consecutive convolution layers.So, the first convolution layer is
having kernel size of 3 × 3 with padding equal to 1, and there are 384 kernels, so that
gives you 13 × 13 feature maps and there are 384 feature maps which passes through the
next convolution layer again the completion kernel size is 13 × 13 with padding equal to 1.
This has 384 number of kernels, that means, output of this convolution layer will have
again 384 number of channels on 384 number of feature maps. And every feature map is of
size 13 × 13. It is because you find that you have given a padding equal to 1 for a 3 × 3
kernel size and that is the reason that your feature map size, the size of every feature map
at the output of this convolution layer is remaining same as the size of the feature maps
which are inputted to this convolution layer.
This again passes through another convolution layer where the convolution kernel size is
again 3 × 3 padding equal to 1 that means output of this convolution layer will generate
the feature maps, where the feature map size will remain the same that is 13 × 13. But in
this case over here what AlexNet uses is 256 kennels, so that means, at the input of this
convolution layer there are 384 channels and these 384 channels now get converted to 256
channels or it generates 256 feature maps. And every feature map is of size 13 × 13.
Followed by this is then next overlapping max pool layer, again the max pooling is done
over window of 3 × 3 with stride equal to 2 and that gives you the output feature map.
The number of channels means the same which is 256 and the size of every feature map is
6 × 6.
After this what you have is fully connected layers or which are same as multilayer
perceptron that we have discussed earlier.So, the first two fully connected layers have
4096 nodes each. So, you find that after output of this max pool layer we have 6 × 6 × 256
that is 9216 number of nodes or number of features and each of them is connected to each
of the nodes in this first fully connected layer.
So, the number of connections or the number of parameters that we will have in this case is
9216 into 4096. And then from this first fully connected convolution layer every node of
this first fully connected layer provides input to every node in the second fully connected
layer.
So, here you have number of connections which are 4096 × 4096 because the number of
nodes in the second fully connected layer is also 4096. And then we have the final layer
which is the output layer having 1000 SoftMax channels. So, this is a SoftMax layer and the
number of connections from this fully connected layer to this output layer that is again
4096 × 1000. So, this is overall the architecture of the AlexNet or the functional diagram of
the AlexNet. and if you look at the architecture, the architecture looks something like this.
What is AlexNet?
AlexNet is a deep convolutional neural network (CNN) that revolutionized image
classification by winning the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
2012 with a top-5 error rate of 15.3%.
Architecture Overview
AlexNet consists of:
5 Convolutional Layers
3 Max-Pooling Layers (all overlapping with stride 2)
3 Fully Connected (FC) Layers
Final SoftMax Output Layer with 1000 nodes
Input:
RGB image of size 227 × 227 × 3 (cropped from 256 × 256 during training)
If grayscale, the image is replicated across RGB channels
Layer-by-Layer
Breakdown
Conv Layer 1:
Kernel: 11×11, Stride: 4, 96 filters
Output: 96 feature maps, each 55×55
Followed by: Max Pooling (3×3, Stride 2) → Output: 27×27×96
Conv Layer 2:
Kernel: 5×5, Padding: yes, 256 filters
Output: 27×27×256
Followed by: Max Pooling (3×3, Stride 2) → Output: 13×13×256
Conv Layer 3:
Kernel: 3×3, Padding: 1, 384 filters
Output: 13×13×384
Conv Layer 4:
Kernel: 3×3, Padding: 1, 384 filters
Output: 13×13×384
Conv Layer 5:
Kernel: 3×3, Padding: 1, 256 filters
Output: 13×13×256
Followed by: Max Pooling (3×3, Stride 2) → Output: 6×6×256
Fully Connected Layers:
FC1:
Input: 6×6×256 = 9216 nodes
Output: 4096 nodes
FC2:
Input: 4096
Output: 4096
FC3 (Output Layer):
Input: 4096
Output: 1000 classes (SoftMax)
Implementation & Training Details
Total Parameters: ~60 million
Neurons: ~650,000
Trained on 2 GPUs (network split into two parallel pipelines)
Training Duration: ~1 week
Optimizer: Stochastic Gradient Descent with Momentum
Loss Function: Cross-entropy (SoftMax output)
✅ Key Feature
Top-5 Error Rate: 15.3%
If the correct label is not among the 5 highest probability predictions, it's counted as an error.
A Restricted Boltzmann Machine is a generative stochastic neural network that learns a probability distribution over its set of
inputs.
It is an undirected probabilistic graphical model with:
One layer of visible (observed) units v
One layer of hidden (latent) units h
No intra-layer connections (i.e., no visible-to-visible or hidden-to-hidden connections)
Also called Harmonium (Smolensky, 1986)
Structure
The network is a bipartite graph:
Every visible unit connects to every hidden unit
No connections among visible units
No connections among hidden units
Conditional Distributions Are Tractable:
Figure 20.1: Examples of models that may be built with restricted Boltzmann machines.
(a) The restricted Boltzmann machine itself is an undirected graphical model based on a bipartite graph, with visible
units in one part of the graph and hidden units in the other part. There are no connections among the visible units,
nor any connections among the hidden units. Typically every visible unit is connected to every hidden unit but it is
possible to construct sparsely connected RBMs such as convolutional RBMs.
Training RBMs
RBMs can be trained using techniques suited for models with intractable partition functions:
Common Algorithms:
Contrastive Divergence (CD)
Stochastic Maximum Likelihood (SML) / Persistent CD
Ratio Matching, etc.
Why Training is Efficient for RBMs:
Sampling from P(h | v) and P(v | h) is easy due to the factorial structure
Inference is exact for conditionals, unlike in deeper models (e.g., DBMs)
Stacking RBMs:
RBMs are building blocks for:
Deep Belief Networks (DBNs): hybrid of directed and undirected connections
Deep Boltzmann Machines (DBMs): fully undirected, multiple hidden layers
Each higher layer in DBNs/DBMs typically learns from the latent representation of the lower layer.
Deep Boltzmann Machine (DBM)—a powerful generative model that builds on ideas
from Restricted Boltzmann Machines (RBMs) and Deep Belief Networks (DBNs)
Summary:
DBM = Deep + Undirected + Energy-based
Unlike DBNs, all connections are undirected.
Unlike RBMs, it has more than one hidden layer.
Like RBMs, each layer’s units are conditionally independent given adjacent layers, making Gibbs
sampling tractable for inference.
Key Characteristics
Binary Units: DBMs commonly use binary stochastic units (e.g., taking values {0,1}), though they can be extended to real-
valued visible units for more flexible data modeling (e.g., continuous-valued inputs like pixels).
Layer-Wise Conditional Independence: Even though the full model is undirected and complex, within each layer,
units are conditionally independent given adjacent layers, which simplifies block Gibbs sampling.
Applications: DBMs have been applied to tasks such as:
Document modeling
Image recognition
Representation learning
Deep Boltzmann machines have many interesting
properties
Interesting Properties) discusses the unique strengths and challenges of Deep
Boltzmann Machines (DBMs)compared to Deep Belief Networks (DBNs).
DBM Mean field Inference
DBM Parameter Learning:
Explains the parameter learning process in Deep Boltzmann Machines (DBMs) by
introducing variational stochastic maximum likelihood (variational SML).
Variational stochastic maximum likelihood as applied to the DBM is given in
algorithm 20.1.
EXPLANATION:
Jointly Training Deep Boltzmann Machines::
A detailed overview of two modern approaches to jointly train Deep Boltzmann Machines (DBMs), as
alternatives to the classic (greedy layer-wise) training procedure
❌ Limitation:
Still not great at classification compared to regularized MLPs
Gradient Computations in Restricted Boltzmann Machines (RBMs):
Restricted Boltzmann Machines (RBMs) are energy-based models that learn a probability distribution
over inputs. Their training is based on maximizing the log-likelihood of observed data, typically
using stochastic gradient ascent
1. Objective Function