III-II CSM (Ar 20) DL 5 Units Question Answers
III-II CSM (Ar 20) DL 5 Units Question Answers
Introduction to deep LEARNING – deep learning – why deep learning-difference between deep
learning and machine learning –Neural Networks basics – Gradient Descent, Derivatives, Computation
graph, Vectorization, Vectorizing logistic regression – Shallow neural networks: Activation functions,
non-linear activation functions, Back propagation, Data classification with a hidden layer Linear
Classifiers, Linear Machines with Hinge Loss - Optimization Techniques -Batch Optimization.
1. Better performance: Deep learning algorithms have been shown to achieve state-of-the-
art performance on a variety of tasks, including image and speech recognition, natural
language processing, and playing games like Go and chess. This has led to significant
improvements in areas such as healthcare, finance, and transportation.
2. Scalability: Deep learning algorithms are highly scalable and can handle large datasets
with ease. This has enabled the development of AI systems that can process and analyze
vast amounts of data in real-time.
3. Autonomous learning: Deep learning algorithms are designed to learn and improve on
their own, without the need for human intervention. This allows AI systems to adapt and
improve over time as they encounter new data.
4. Flexibility: Deep learning algorithms can be applied to a wide range of applications and
domains, from self-driving cars to drug discovery. This flexibility has led to the
development of new and innovative AI applications that were previously impossible.
Overall, deep learning has revolutionized the field of AI by enabling machines to learn
from data in a way that is similar to how humans learn. This has led to significant
advances in a wide range of applications and has the potential to transform many
industries in the future.
2 What is the difference between deep learning and machine learning? Provide 4M
examples to illustrate your answer.
ANS: Deep learning is a subset of machine learning, and the main difference between the
two is the complexity of the models that they use to learn from data. Machine learning
algorithms typically use simpler models and focus on finding patterns in data that can be
used to make predictions or decisions. Deep learning algorithms, on the other hand, use
more complex models that are composed of multiple layers of artificial neurons. These
neural networks are designed to learn hierarchical representations of data, which can lead
to more accurate predictions and better performance on complex tasks.
To illustrate the difference between deep learning and machine learning, let's consider a
few examples:
In summary, while machine learning algorithms focus on simpler models and patterns,
deep learning algorithms use more complex models that allow them to learn more
abstract and hierarchical representations of data. As a result, deep learning is typically
used in applications where the data is complex and the task requires a higher level of
accuracy and performance.
These graphs organize the computation of a specific function. Consider the below
example:
J(a,b,c) = 3(a+bc)
We have to calculate J given a, b, and c. We can divide this into three steps:
1. u = bc
2. v = a+u
3. J = 3v
This is the forward propagation step where we have calculated the output, i.e., J. We can
also use computation graphs for backward propagation where we update the parameters,
a,b and c in the above example.
Now let’s see how we can calculate derivatives with the help of a computation graph.
Suppose we have to calculate dJ/da. The steps will be:
Now we will take the concept of computation graphs and gradient descent together and
see how the parameters of logistic regression can be updated.
4 What is vectorization in deep learning, and how does it help improve the efficiency 4M
of computation?
ANS: Vectorization is a technique used in deep learning to perform mathematical
operations on entire arrays or matrices of data at once, rather than looping over each
element of the array one at a time. This can greatly improve the efficiency of
computation, especially when dealing with large datasets and complex models.
In contrast, vectorization allows us to perform the same operation on the entire matrix at
once, using optimized linear algebra libraries that can take advantage of hardware
acceleration (such as GPUs) to perform the computation in parallel. This can be orders of
magnitude faster than using loops.
Here's an example of how vectorization can be used to improve the efficiency of a neural
network:
return Y
return Y
In the non-vectorized implementation, we use nested loops to iterate over each element of
the input matrix X and weight matrix W, and compute the dot product of each pair of
rows and columns. In the vectorized implementation, we use the dot product operator to
compute the dot product of the entire matrix X with the entire matrix W at once, without
the need for loops.
5 What are the activation functions used in neural networks, and how do they differ 4M
from each other?
ANS: Activation functions are mathematical functions that are applied to the output of a
neuron in a neural network. They are used to introduce non-linearity into the network,
allowing it to learn complex patterns and relationships in the data. There are several
activation functions that are commonly used in neural networks, each with its own
strengths and weaknesses. Some of the most common activation functions are:
1. Sigmoid function: The sigmoid function is defined as f(x) = 1 / (1 + exp(-x)). It maps any
input value to a range between 0 and 1, making it useful for binary classification
problems where the output is a probability estimate. However, the sigmoid function can
suffer from the problem of vanishing gradients, where the gradients of the function
become very small as the input values become very large or very small.
2. ReLU function: The ReLU (Rectified Linear Unit) function is defined as f(x) = max(0,
x). It returns the input value if it is positive, and 0 otherwise. The ReLU function is
simple to compute and has been found to work well in practice, especially in deep neural
networks. However, it can suffer from the problem of dead neurons, where the neurons
can become permanently inactive if their input values are always negative.
3. Leaky ReLU function: The Leaky ReLU function is a variant of the ReLU function that
is defined as f(x) = max(0.1x, x). It is similar to the ReLU function, but introduces a
small slope for negative input values, which can help to prevent the problem of dead
neurons.
4. Tanh function: The Tanh (hyperbolic tangent) function is defined as f(x) = (exp(x) - exp(-
x)) / (exp(x) + exp(-x)). It maps any input value to a range between -1 and 1, making it
useful for classification problems with outputs that can take negative values. However,
like the sigmoid function, it can suffer from the problem of vanishing gradients.
5. Softmax function: The Softmax function is a generalization of the sigmoid function that
is used for multi-class classification problems. It maps any input vector to a probability
distribution over multiple classes, ensuring that the sum of the probabilities adds up to 1.
It is defined as f(x_i) = exp(x_i) / sum(exp(x_j)), where i ranges over all the classes and j
ranges over all the inputs.
Overall, the choice of activation function depends on the specific requirements of the
problem and the characteristics of the data. It is often necessary to experiment with
different activation functions and architectures to find the optimal combination for a
given task.
6 1. Explain the process of gradient descent and its role in optimizing the weights and 10M
biases of a neural network. How does it help in improving the accuracy of a deep
learning model?
ANS: Gradient descent is an optimization algorithm used to find the optimal values for
the weights and biases in a neural network. The goal of the algorithm is to minimize the
loss function, which measures the difference between the predicted outputs of the
network and the actual outputs.
The gradient descent algorithm works by computing the gradient of the loss function with
respect to each weight and bias in the network. The gradient indicates the direction of the
steepest increase in the loss function, so by moving in the opposite direction, we can
move towards the minimum of the loss function. This process is repeated iteratively, with
each iteration updating the values of the weights and biases to move towards the
minimum of the loss function.
How does Gradient Descent work?
Before starting the working principle of gradient descent, we should know some basic
concepts to find out the slope of a line from linear regression. The equation for simple
linear regression is given as:
1. Y=mX+c
Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.
The starting point(shown in above fig.) is used to evaluate the performance as it is
considered just as an arbitrary point. At this starting point, we will derive the first
derivative or slope and then use a tangent line to calculate the steepness of this slope.
Further, this slope will inform the updates to the parameters (weights and bias).
The slope becomes steeper at the starting point or arbitrary point, but whenever new
parameters are generated, then steepness gradually reduces, and at the lowest point, it
approaches the lowest point, which is called a point of convergence.
The main objective of gradient descent is to minimize the cost function or the error
between expected and actual. To minimize the cost function, two data points are required:
These two factors are used to determine the partial derivative calculation of future
iteration and allow it to the point of convergence or local minimum or global minimum.
Let's discuss learning rate factors in brief;
Learning Rate:
It is defined as the step size taken to reach the minimum or lowest point. This is typically
a small value that is evaluated and updated based on the behavior of the cost function. If
the learning rate is high, it results in larger steps but also leads to risks of overshooting
the minimum. At the same time, a low learning rate shows the small step sizes, which
compromises overall efficiency but gives the advantage of more precision.
There are two main types of gradient descent: batch gradient descent and stochastic
gradient descent. In batch gradient descent, the gradient is computed over the entire
training dataset, while in stochastic gradient descent, the gradient is computed for each
individual training example. Stochastic gradient descent is typically faster and more
efficient for large datasets.
Gradient descent plays a crucial role in optimizing the weights and biases of a neural
network, as it allows the network to learn from the data and improve its accuracy. By
iteratively updating the weights and biases to minimize the loss function, the network can
adjust its parameters to better fit the patterns in the data. This can lead to significant
improvements in accuracy, particularly in complex tasks such as image recognition or
natural language processing.
However, gradient descent can also suffer from several problems, such as getting stuck in
local minima or taking a long time to converge to the optimal solution. To address these
issues, various techniques have been developed, such as momentum, learning rate
schedules, and adaptive learning rates, which help to improve the efficiency and stability
of the optimization process.
The backpropagation algorithm works by propagating the error backwards through the
network, from the output layer to the input layer. It does this by computing the partial
derivatives of the loss function with respect to each weight and bias in the network. These
partial derivatives are then used to update the weights and biases, so that the network can
learn from the data and improve its accuracy.
The backpropagation algorithm can be broken down into several steps:
1. Forward propagation: The input is fed forward through the network, and the output is
computed for each neuron in each layer. The output of the final layer is compared to the
target output, and the difference is used to calculate the loss function.
2. Backward propagation: The error is propagated backwards through the network, starting
from the output layer. The partial derivatives of the loss function with respect to each
weight and bias in the network are computed using the chain rule of differentiation.
These partial derivatives are used to update the weights and biases of the network.
3. Weight and bias updates: The weights and biases of the network are updated using the
partial derivatives computed during the backward propagation step. This update is
performed using an optimization algorithm such as gradient descent, which moves the
weights and biases towards the direction of the minimum of the loss function.
4. Repeat: Steps 1-3 are repeated for each batch of training examples, until the loss function
converges to a minimum.
Linear classifiers work by computing a weighted sum of the input features, and applying
a threshold to determine the predicted class. In the case of binary classification, where
there are only two possible classes, the predicted class is determined by whether the
weighted sum of the features is greater or less than a certain threshold. In the case of
multi-class classification, where there are more than two possible classes, the predicted
class is determined by the class with the highest weighted sum of features.
The weights in a linear classifier are typically learned through an optimization process,
such as gradient descent. During training, the weights are adjusted to minimize the loss
function, which measures the difference between the predicted output and the actual
output.
One of the main limitations of linear classifiers is that they are only able to classify data
that can be separated by a linear boundary. This means that they may not be suitable for
complex classification tasks, where the decision boundary between different classes may
be non-linear. In addition, linear classifiers may not be robust to noisy data, as they are
sensitive to outliers and may produce inaccurate predictions in the presence of noise.
9 1. What are the optimization techniques used in deep learning? Discuss their 10M
advantages and disadvantages, with examples.
ANS: In deep learning, we have the concept of loss, which tells us how poorly the model
is performing at that current instant. Now we need to use this loss to train our network
such that it performs better. Essentially what we need to do is to take the loss and try
to minimize it, because a lower loss means our model is going to perform better. The
called optimization.
Optimizers are algorithms or methods used to change the attributes of the neural network
such as weights and learning rate to reduce the losses. Optimizers are used to solve
For a useful mental model, you can think of a hiker trying to get down a mountain with a
blindfold on. It’s impossible to know which direction to go in, but there’s one thing she
can know: if she’s going down (making progress) or going up (losing progress).
Eventually, if she keeps taking steps that lead her downwards, she’ll reach the base.
Similarly, it’s impossible to know what your model’s weights should be right from the
start. But with some trial and error based on the loss function (whether the hiker is
descending), you can end up getting there eventually.
How you should change your weights or learning rates of your neural network to reduce
the losses is defined by the optimizers you use. Optimization algorithms are responsible
for reducing the losses and to provide the most accurate results possible.
Various optimizers are researched within the last few couples of years each having its
advantages and disadvantages. Read the entire article to understand the working,
We’ll learn about different types of optimizers and how they exactly work to minimize
1. Gradient Descent
5. RMSprop
6. Adam
Gradient Descent
learning model. It's based on a convex function and tweaks its parameters iteratively to
used to find the values of a function's parameters (coefficients) that minimize a cost
calculus to iteratively adjust the values so they minimize the given cost-function.
The weight is initialized using some initialization strategies and is updated with each
The above equation computes the gradient of the cost function J(θ) w.r.t. to the
Our aim is to get to the bottom of our graph(Cost vs weights), or to a point where we can
"A gradient measures how much the output of a function changes if you change the
determined by the learning rate, which figures out how fast or slows we will move
For gradient descent to reach the local minimum we must set the learning rate to an
appropriate value, which is neither too low nor too high. This is important because if the
steps it takes are too big, it may not reach the local minimum because it bounces back and
forth between the convex function of gradient descent (see left image below). If we set
the learning rate to a very small value, gradient descent will eventually reach the local
So, the learning rate should never be too high or too low for this reason.
For a pre-defined number of epochs, we first compute the gradient vector params_grad of
the loss function for the whole dataset w.r.t. our parameter vector params.
Advantages:
1. Easy computation.
2. Easy to implement.
3. Easy to understand.
Disadvantages:
2. Weights are changed after calculating the gradient on the whole dataset. So, if the
dataset is too large then this may take years to converge to the minima.
SGD algorithm is an extension of the Gradient Descent and it overcomes some of the
lot of memory to load the entire dataset of n-points at a time to compute the derivative of
the loss function. In the SGD algorithm derivative is computed taking one point at a
time.
SGD performs a parameter update for each training example x(i) and label y(i):
θ = θ − α⋅∂(J(θ;x(i),y(i)))/∂θ
To make the training even faster we take a Gradient Descent step for each training
example. Let's see what the implications would be in the image below.
1. On the left, we have Stochastic Gradient Descent (where m=1 per step) we take a
Gradient Descent step for each example and on the right is Gradient Descent (1 step
2. SGD seems to be quite noisy, at the same time it is much faster but may not converge
to a minimum.
3. Typically, to get the best out of both worlds we use Mini-batch gradient descent
(MGD) which looks at a smaller number of training set examples at once to help
Descent (SGD) but does have oscillations as gradient steps are being taken in the
direction of a sample of the training set and not the entire set as in BGD.
It is observed that in SGD the updates take more number iterations compared to gradient
descent to reach minima. On the right, the Gradient Descent takes fewer steps to reach
minima but the SGD algorithm is noisier and takes more iterations.
Advantage:
Disadvantages:
MB-SGD algorithm is an extension of the SGD algorithm and it overcomes the problem
of large time complexity in the case of the SGD algorithm. MB-SGD algorithm takes a
derivate of the loss function for GD after some number of iterations. But the number of
iterations to achieve minima is large for MB-SGD compared to GD and the cost of
The update of weight is dependent on the derivate of loss for a batch of points. The
updates in the case of MB-SGD are much noisy because the derivative is not always
towards minima.
MB-SGD divides the dataset into various batches and after every batch, the parameters
are updated.
θ = θ − α⋅∂(J(θ;B(i)))/∂θ
Advantages:
Disadvantages:
1. The update of MB-SGD is much noisy compared to the update of the GD algorithm.
SGD with momentum overcomes this disadvantage by denoising the gradients. Updates
of weight are dependent on noisy derivative and if we somehow denoise the derivatives
The idea is to denoise derivative using exponential weighting average that is to give more
It accelerates the convergence towards the relevant direction and reduces the fluctuation
to the irrelevant direction. One more hyperparameter is used in this method known as
Momentum at time ‘t’ is computed using all previous updates giving more weightage to
recent updates compared to the previous update. This leads to speed up the convergence.
Essentially, when using momentum, we push a ball down a hill. The ball accumulates
momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its
terminal velocity if there is air resistance, i.e. γ<1). The same thing happens to our
parameter updates: The momentum term increases for dimensions whose gradients point
in the same directions and reduces updates for dimensions whose gradients change
Advantages:
Disadvantages:
For all the previously discussed algorithms the learning rate remains constant. So the key
idea of AdaGrad is to have an adaptive learning rate for each of the weights.
It performs smaller updates for parameters associated with frequently occurring features,
and larger updates for parameters associated with infrequently occurring features.
For brevity, we use gt to denote the gradient at time step t. gt,i is then the partial
derivative of the objective function w.r.t. to the parameter θi at time step t, η is the
The benefit of AdaGrad is that it eliminates the need to manually tune the learning rate;
Its main weakness is the accumulation of the squared gradients(Gt) in the denominator.
Since every added term is positive, the accumulated sum keeps growing during training,
causing the learning rate to shrink and becoming infinitesimally small and further
Advantage:
No need to update the learning rate manually as it changes adaptively with iterations.
Disadvantage:
As the number of iteration becomes very large learning rate decreases to a very small
RMSprop in fact is identical to the first update vector of Adadelta that we derived above:
RMSprop as well divides the learning rate by an exponentially decaying average of
squared gradients. Hinton suggests γ be set to 0.9, while a good default value for the
RMSprop and Adadelta have both been developed independently around the same time
stemming from the need to resolve Adagrad's radically diminishing learning rates
with momentum.
Adam computes adaptive learning rates for each parameter. In addition to storing an
exponentially decaying average of past squared gradients vt like Adadelta and RMSprop,
Adam also keeps an exponentially decaying average of past gradients mt, similar to
momentum. Whereas momentum can be seen as a ball running down a slope, Adam
behaves like a heavy ball with friction, which thus prefers flat minima in the error
surface.
Hyper-parameters β1, β2 ∈ [0, 1) control the exponential decay rates of these moving
mt and vt are estimates of the first moment (the mean) and the second moment (the
uncentered variance) of the gradients respectively, hence the name of the method.
101. Compare and contrast shallow neural networks and deep neural networks in terms 10M
of their architecture, performance, and training. What are the factors that influence
their effectiveness?
ANS:
Shallow neural networks typically have one or two hidden layers, while deep neural
networks have more than two hidden layers. Here are the comparisons and contrasts
between the two types of neural networks:
1. Architecture: Shallow neural networks have a simple architecture with fewer layers,
while deep neural networks have a complex architecture with multiple layers. Deep
neural networks can extract hierarchical features from data, which enables them to
capture more complex patterns.
2. Performance: Deep neural networks generally outperform shallow neural networks in
tasks such as image recognition, speech recognition, and natural language processing.
This is because deep neural networks can learn more abstract and complex features than
shallow neural networks.
3. Training: Deep neural networks require more computational resources and longer training
time than shallow neural networks. This is because the training process involves
optimizing the parameters of multiple layers, which requires more iterations and more
memory. However, deep neural networks can be trained using techniques such as transfer
learning and pretraining, which can reduce the training time and improve the
performance.
Hence, deep neural networks are generally more effective than shallow neural networks
due to their ability to extract more abstract and complex features. However, the choice of
architecture, training data, optimization algorithm, and regularization techniques can
significantly impact the performance of both shallow and deep neural networks.
UNIT-II
Single-layer perceptron is a type of artificial neural network that consists of a single layer
of perceptrons. In contrast, the multilayer perceptron (MLP) is an artificial neural
network that consists of multiple layers of perceptrons.
The main differences between the single-layer perceptron and multilayer perceptron are
as follows:
1. Limited to linearly separable problems: Perceptrons can only classify data that is linearly
separable. If the data is not linearly separable, then the perceptron will not be able to
classify it correctly.
2. Inability to learn complex patterns: Perceptrons are not capable of learning complex
patterns that require multiple layers of processing. They can only learn simple patterns
that can be represented by a linear combination of the input features.
3. Sensitivity to input data: Perceptrons are highly sensitive to input data. If the input data is
noisy or contains outliers, the perceptron may produce inaccurate results.
4. Lack of probabilistic outputs: Perceptrons produce binary outputs, which means they
cannot estimate the probability of a given input belonging to a particular class.
5. Requirement for labeled data: Perceptrons require labeled data for training, which can be
time-consuming and costly to obtain in some cases.
6. Difficulty in handling non-stationary data: Perceptrons are not designed to handle non-
stationary data, where the statistical properties of the input data change over time. They
require retraining on new data to adapt to the new distribution.
3 What is the XOR problem, and how can it be solved using an artificial neural 4M
network?
ANS: The XOR problem is a classic example of a problem that cannot be solved by a
single-layer perceptron. It involves classifying inputs into two classes based on their
logical XOR (exclusive or) relationship. XOR is a logical operation that returns true only
if both inputs are different; otherwise, it returns false. The XOR problem arises when we
try to classify inputs based on their XOR relationship.
A single-layer perceptron cannot solve the XOR problem because the problem is not
linearly separable. However, an artificial neural network with at least one hidden layer,
such as a multilayer perceptron (MLP), can solve the XOR problem.
The MLP can solve the XOR problem by learning to represent the data in a higher-
dimensional space where the data becomes linearly separable. The hidden layer of the
MLP acts as a feature extractor, transforming the input data into a higher-dimensional
space, where a linear classifier can separate the data. The output layer of the MLP then
performs the classification based on the transformed data.
In the case of the XOR problem, the MLP with one hidden layer can solve the problem
by learning to represent the data in a 2D space where a linear classifier can separate the
data. The hidden layer can have two neurons, and the output layer can have one neuron,
representing the binary classification problem. The activation function used in the hidden
layer can be a non-linear function such as the sigmoid function or the hyperbolic tangent
function, allowing the MLP to learn the non-linear relationship between the input
features. By training the MLP using backpropagation, it can learn to classify the inputs
correctly, even when the inputs have a non-linear relationship such as the XOR problem.
4 What are Radial Basis Function Networks, and how do they differ from other types 4M
of artificial neural networks?
ANS:
Radial Basis Function (RBF) Networks are a type of artificial neural network that uses
radial basis functions as activation functions. They differ from other types of artificial
neural networks, such as multilayer perceptrons (MLPs) and convolutional neural
networks (CNNs), in their architecture and training method.
The architecture of an RBF network typically consists of three layers: an input layer, a
hidden layer, and an output layer. The input layer receives the input data, which is then
passed through the hidden layer, where the radial basis functions are applied. The output
of the hidden layer is then passed through the output layer, which produces the final
output of the network. The hidden layer in an RBF network is often called the radial basis
layer, as it uses radial basis functions as activation functions.
Unlike MLPs and CNNs, RBF networks do not use backpropagation to train the network.
Instead, RBF networks use a two-step training method called clustering and least squares
estimation. In the clustering step, the input data is clustered using a clustering algorithm,
such as k-means clustering, to find representative centers for the radial basis functions. In
the least squares estimation step, the weights of the output layer are calculated using
linear regression to approximate the target outputs.
RBF networks have several advantages over other types of artificial neural networks.
First, they are good at handling non-linear problems, as the radial basis functions can
model non-linear relationships between the input features. Second, they require less
training data than MLPs and CNNs, as the clustering step reduces the number of
parameters that need to be trained. Third, they can handle missing or noisy data, as the
clustering step is robust to missing or noisy data. Finally, RBF networks can be used for
both regression and classification tasks.
However, RBF networks also have some limitations. They can be computationally
expensive, as the clustering step requires the input data to be processed multiple times.
They are also not well-suited for handling high-dimensional data, as the number of radial
basis functions required to model the data can become prohibitively large.
5 What are the differences between Momentum Optimizer, RMSProp, and Adam, in 4M
the context of deep neural network training?
ANS:
Momentum Optimizer, RMSProp, and Adam are all optimization algorithms commonly
used in deep neural network training. While they are all designed to improve the training
speed and convergence of neural networks, they differ in their approaches to achieving
this goal. Here are the key differences between these optimization algorithms:
1. Momentum Optimizer: This algorithm adds a momentum term to the gradient update in
order to accelerate the convergence of the optimization process. The momentum term
takes into account the previous update direction and magnitude, which helps the
algorithm to overcome local minima and plateaus in the optimization landscape.
However, Momentum Optimizer can overshoot the optimum and lead to oscillations
around the optimum.
2. RMSProp: This algorithm adapts the learning rate for each parameter based on the
magnitude of its gradients. It divides the learning rate by an exponentially decaying
average of the squared gradients, which has the effect of reducing the learning rate for
parameters with large gradients and increasing it for parameters with small gradients.
This makes the algorithm more robust to noisy gradients and to changes in the
optimization landscape. However, RMSProp can still converge slowly or get stuck in
saddle points.
3. Adam: This algorithm combines the ideas of Momentum Optimizer and RMSProp. It
uses an exponentially decaying average of past gradients to update the momentum term
and an exponentially decaying average of past squared gradients to adapt the learning
rate. Additionally, Adam introduces bias-correction terms to account for the fact that the
initial estimates of the averages are biased towards zero. This makes Adam more efficient
than Momentum Optimizer and RMSProp individually, as it can overcome the limitations
of both algorithms. However, Adam can still converge slowly or get stuck in saddle
points for certain types of optimization landscapes.
In summary, Momentum Optimizer, RMSProp, and Adam are all powerful optimization
algorithms that can improve the training speed and convergence of deep neural networks.
While they differ in their approaches, they are all designed to address the limitations of
traditional stochastic gradient descent, which can converge slowly or get stuck in
suboptimal local minima. The choice of which algorithm to use depends on the specific
optimization landscape and the requirements of the task at hand.
6 Explain the structure and learning process of a single-layer perceptron, and discuss 10M
its limitations as a pattern classifier.
ANS:
A single-layer perceptron is a type of artificial neural network that is used for binary
classification tasks. It consists of a layer of input neurons, each connected to a single
output neuron. The input neurons receive the features of the input data and apply a
weighted sum of the inputs to the output neuron. The output neuron then applies an
activation function to the weighted sum to produce the final output of the network, which
is a binary decision.
The learning process of a single-layer perceptron involves the adjustment of the weights
of the connections between the input neurons and the output neuron. During training, the
perceptron is presented with a set of input-output pairs, and the weights are adjusted
based on the error between the predicted output and the true output. The weights are
updated using the delta rule, which is a form of gradient descent. The delta rule adjusts
the weights in the direction that reduces the error between the predicted output and the
true output, until the error reaches a minimum.
Despite its simplicity and usefulness, the single-layer perceptron has several limitations
as a pattern classifier. First, it can only classify linearly separable patterns, which limits
its ability to model complex relationships between the input features. Second, it can be
sensitive to the choice of initial weights and the order in which the input patterns are
presented during training, which can affect the convergence and the quality of the
solution. Third, it can suffer from the problem of overfitting, where it memorizes the
training data and fails to generalize to new data. Finally, it cannot handle multi-class
classification tasks, which require the use of more sophisticated models such as
multilayer perceptrons or convolutional neural networks.
In summary, the single-layer perceptron is a simple and useful model for binary
classification tasks, but it has several limitations that make it unsuitable for more
complex tasks. Its learning process involves the adjustment of weights using the delta
rule, and its structure consists of a layer of input neurons connected to a single output
neuron. Despite its limitations, the single-layer perceptron paved the way for the
development of more sophisticated artificial neural networks, which have revolutionized
the field of machine learning.
Perceptron model is also treated as one of the best and simplest types of Artificial Neural
networks. However, it is a supervised learning algorithm of binary classifiers.
Step-1
In the first step first, multiply all input values with corresponding weight values and then
add them to determine the weighted sum. Mathematically, we can calculate the weighted
sum as follows:
Add a special term called bias 'b' to this weighted sum to improve the model's
performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned weighted
sum, which gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
Based on the layers, Perceptron models are divided into two types. These are as follows:
This is one of the easiest Artificial neural networks (ANN) types. A single-layered
perceptron model consists feed-forward network and also includes a threshold transfer
function inside the model. The main objective of the single-layer perceptron model is to
analyze the linearly separable objects with binary outcomes.
In a single layer perceptron model, its algorithms do not contain recorded data, so it
begins with inconstantly allocated input for weight parameters. Further, it sums up all
inputs (weight). After adding all inputs, if the total sum of all inputs is more than a pre-
determined value, the model gets activated and shows the output value as +1.
If the outcome is same as pre-determined or threshold value, then the performance of this
model is stated as satisfied, and weight demand does not change. However, this model
consists of a few discrepancies triggered when multiple weight inputs values are fed into
the model. Hence, to find desired output and minimize errors, some changes should be
necessary for the weights input.
Like a single-layer perceptron model, a multi-layer perceptron model also has the same
model structure but has a greater number of hidden layers.
The multi-layer perceptron model is also known as the Backpropagation algorithm, which
executes in two stages as follows:
o Forward Stage: Activation functions start from the input layer in the forward
stage and terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as
per the model's requirement. In this stage, the error between actual output and
demanded originated backward on the output layer and ended on the input layer.
A multi-layer perceptron model has greater processing power and can process linear and
non-linear patterns. Further, it can also implement logic gates such as AND, OR, XOR,
NAND, NOT, XNOR, NOR.
Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with the
learned weight coefficient 'w'.
otherwise, f(x)=0
Characteristics of Perceptron
o The output of a perceptron can only be a binary number (0 or 1) due to the hard
limit transfer function.
o Perceptron can only be used to classify the linearly separable sets of input vectors.
If input vectors are non-linear, it is not easy to classify them properly.
Future of Perceptron
The future of the Perceptron model is much bright and significant as it helps to interpret
data by building intuitive patterns and applying them in the future. Machine learning is a
rapidly growing technology of Artificial Intelligence that is continuously evolving and in
the developing phase; hence the future of perceptron technology will continue to support
and facilitate analytical behavior in machines that will, in turn, add to the efficiency of
computers.
The perceptron model is continuously becoming more advanced and working efficiently
on complex problems with the help of artificial neurons.
7 Discuss the Bayesian approach to pattern classification, and explain how Bayes' 10M
classifiers can be used in artificial neural networks.
ANS:
The Bayesian approach to pattern classification is a probabilistic framework that uses
Bayes' theorem to classify new input data based on a prior knowledge of the statistical
properties of the data. In this approach, the classification decision is based on the
posterior probability of the class given the input features, which is computed by
combining the prior probability of the class and the likelihood of the features given the
class.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability
of a hypothesis is true.
Working of Naïve Bayes' Classifier can be understood with the help of the below
example:
Bayes' classifiers can be used in artificial neural networks (ANNs) to improve the
accuracy and robustness of the classification model. This is done by incorporating
Bayesian reasoning into the learning process of the ANNs, either by explicitly modeling
the prior probabilities of the classes, or by using Bayesian regularization techniques to
control the complexity of the network and prevent overfitting.
One example of a Bayesian neural network is the Bayesian neural network with
backpropagation (BNNBP), which combines the backpropagation algorithm with
Bayesian regularization to train the network. In this approach, the network weights are
assigned prior probability distributions, and the likelihood of the training data given the
weights is computed using the standard backpropagation algorithm. The posterior
distribution of the weights is then computed using Bayes' theorem, and the classification
decision is based on the maximum posterior probability of the class given the input
features.
Bayesian ANNs have several advantages over traditional ANNs, including the ability to
handle uncertainty and noise in the input data, and the ability to provide probabilistic
estimates of the class probabilities. They are also more robust to overfitting, since the
prior probabilities can act as a regularizer that constrains the complexity of the model.
However, they can be computationally expensive to train and evaluate, especially for
large datasets and complex models, and require careful selection of the prior distributions
and hyperparameters.
8 Explain the structure and learning process of a multilayer perceptron, and explain 10M
how it can be used to solve complex pattern recognition problems.
ANS: A multilayer perceptron (MLP) is a type of artificial neural network (ANN) that is
commonly used for solving complex pattern recognition problems. It is a feedforward
neural network, which means that data flows only in one direction through the network,
from the input layer to the output layer, with no feedback loops.
The structure of an MLP consists of multiple layers of artificial neurons, each layer
connected to the next in a series of directed edges, forming a directed graph. The input
layer is the first layer, which receives the raw data as input. The output layer is the final
layer, which produces the output of the network. In between, there can be one or more
hidden layers, which contain a varying number of neurons.
Each neuron in an MLP receives input from the previous layer, performs a weighted sum
of these inputs, applies a non-linear activation function, and passes the output to the next
layer. The output of the MLP is obtained by passing the input through the layers of
neurons until it reaches the output layer.
The learning process of an MLP involves adjusting the weights and biases of the
connections between neurons to minimize the error between the actual output of the
network and the desired output. This process is called backpropagation, and it involves
computing the gradient of the error function with respect to the weights and biases, and
then using this gradient to update the weights and biases using an optimization algorithm
such as gradient descent.
MLP can be used to solve complex pattern recognition problems by learning the
underlying relationships between inputs and outputs through training on a dataset. This
means that the network can learn to recognize patterns in the data and generalize its
learning to new, unseen data.
In this MLP, the input layer has three neurons, the hidden layer has four neurons, and the
output layer has two neurons. The arrows between the neurons represent the connections
between them, and the weights associated with these connections are represented by the
numbers on the arrows.
where $w_i$ are the weights, $x_i$ are the inputs, $b$ is the bias, $f$ is the activation
function, and $n$ is the number of inputs.
The activation function introduces nonlinearity into the network, allowing it to learn
complex patterns in the data. Common activation functions include sigmoid, ReLU, and
tanh.
Multilayer perceptron (MLP) can be used to solve complex pattern recognition problems
such as image classification, speech recognition, and natural language processing by
learning a nonlinear mapping between the input features and the output labels. Here is a
step-by-step guide on how MLP can be used to solve a pattern recognition problem:
1. Data Preparation: The first step is to prepare the dataset by splitting it into training,
validation, and test sets. The training set is used to train the MLP, the validation set is
used to tune the hyperparameters, and the test set is used to evaluate the performance of
the MLP.
2. Input Encoding: The input data is encoded as a feature vector of size N, where N is the
number of features. Each feature represents a specific aspect of the input data, such as the
pixel intensity of an image or the frequency components of a speech signal.
3. Output Encoding: The output labels are encoded as a one-hot vector of size C, where C is
the number of classes. Each element of the vector represents a specific class, and the
element corresponding to the true class is set to 1, while all other elements are set to 0.
4. MLP Architecture: The MLP architecture is designed based on the complexity of the
problem and the size of the dataset. It consists of an input layer, one or more hidden
layers, and an output layer. The number of neurons in the input layer is equal to the
number of features, and the number of neurons in the output layer is equal to the number
of classes.
5. Forward Propagation: During forward propagation, the input features are passed through
the MLP, and the output probabilities are computed using the softmax function.
6. Loss Function: The loss function measures the difference between the predicted output
probabilities and the true output probabilities. The most commonly used loss function for
multiclass classification problems is the cross-entropy loss function.
7. Backpropagation: During backpropagation, the error is propagated backward through the
MLP, and the weights and biases are updated using the gradient descent algorithm.
8. Training: The MLP is trained by iteratively updating the weights and biases using the
backpropagation algorithm on the training set until the loss function converges.
9. Hyperparameter Tuning: The hyperparameters of the MLP, such as the number of hidden
layers, the number of neurons in each layer, the learning rate, and the regularization
parameters, are tuned using the validation set to optimize the performance of the MLP.
10. Testing: The performance of the MLP is evaluated on the test set by computing the
accuracy, precision, recall, and F1-score.
The input features are passed through the input layer, and the output of each neuron in the
hidden layer is computed using the sigmoid function. The output probabilities are
computed using the softmax function in the output layer.
The forward propagation and backpropagation equations for the MLP are described in the
previous answer. The weights and biases are updated using the gradient descent
algorithm as follows:
bi+1=bi-η
MLP networks are used for supervised learning format. A typical learning algorithm for
MLP networks is also called back propagation's algorithm.
A multilayer perceptron (MLP) is a feed forward artificial neural network that generates a
set of outputs from a set of inputs. An MLP is characterized by several layers of input
nodes connected as a directed graph between the input nodes connected as a directed
graph between the input and output layers. MLP uses backpropagation for training the
network. MLP is a deep learning method.
In summary, MLPs are a powerful tool for solving complex pattern recognition problems
by learning the underlying relationships between inputs and outputs through training on a
dataset. The learning process involves adjusting the weights and biases of the connections
between neurons using backpropagation, and the structure of the network consists of
multiple layers of interconnected nodes or neurons.
9 Explain the concept of regularization in deep neural network training, and discuss 10M
how early stopping, dropout, and batch normalization can be used to improve the
performance of a deep neural network.
ANS:
Regularization is a technique used to prevent overfitting of a model during training. In
the context of deep neural network training, overfitting occurs when a model becomes too
complex and starts to memorize the training data, resulting in poor generalization
performance on unseen data.
Regularization methods add constraints to the model to reduce its complexity and prevent
overfitting. There are several types of regularization techniques used in deep learning,
including:
1. L1 and L2 regularization: These methods add a penalty term to the loss function that
forces the model to use smaller weights. L1 regularization encourages the model to use
sparse weights, while L2 regularization encourages the model to use smaller but non-zero
weights.
2. Dropout: Dropout is a regularization technique that randomly drops out a portion of the
nodes in a layer during training. This forces the model to learn more robust features that
are not dependent on any one node.
L2 Regularization
The L2 regularization is the most common type of all regularization techniques and is also
explanation of why this method works at reducing overfitting, is quite long and complex.
Since this is a very practical article I don’t want to focus on mathematics more than it is
required. Instead, I want to convey the intuition behind this technique and most
importantly how to implement it so you can address the overfitting problem during your
During the L2 regularization the loss function of the neural network as extended by a so-
The regularization term Ω is defined as the Euclidean Norm (or L2 norm) of the weight
matrices, which is the sum over all squared weight values of a weight matrix. The
regularization term is weighted by the scalar alpha divided by two and added to the regular
loss function that is chosen for the current task. This leads to a new expression for the loss
function:
we introduce into the neural network. Simply speaking alpha determines how much we
In the next step we can compute the gradient of the new loss function and put the gradient
Some reformulations of the update rule lead to the expression which very much looks like
the update rule for the weights during regular gradient descent:
The only difference is that by adding the regularization term we introduce an additional
In other words independent of the gradient of the loss function we are making our weights
4. L1 Regularization
In the case of L1 regularization (also knows as Lasso regression), we simply use another
regularization term Ω. This term is the sum of the absolute values of the weight
As in the previous case, we multiply the regularization term by alpha and add the entire
The derivative of the new loss function leads to the following expression, which the sum
of the gradient of the old loss function and sign of a weight value times alpha.
Please consider the plots of the and functions, where represents the operation performed
In the case of L2 regularization, our weight parameters decrease, but not necessarily
become zero, since the curve becomes flat near zero. On the other hand during the L1
regularization, the weight are always forced all the way towards zero.
Basically the introduced equations for L1 and L2 regularizations are constraint functions,
The left image shows the constraint function (green area) for the L1 regularization and the
right image shows the constraint function for the L2 regularization. The red ellipses are
contours of the loss function that is used during the gradient descent. In the center of the
contours there is a set of optimal weights for which the loss function has a global
minimum.
In the case of L1 and L2 regularization, the estimates of W1 and W2 are given by the first
point where the ellipse intersects with the green constraint area.
Since L2 regularization has a circular constraint area, the intersection won’t generally
occur on an axis, and this the estimates for W1 and W2 will be exclusively non-zero.
In the case of L1, the constraints area has a diamond shape with corners. And thus the
contours of the loss function will often intersect the constraint region at an axis. Then this
In a high dimensional space, many of the weight parameters will equal zero
simultaneously.
3. Early stopping: Early stopping is a simple regularization technique that stops training
when the model's performance on a validation set stops improving. This prevents the
model from continuing to memorize the training data and allows it to generalize better to
unseen data.
4. Batch normalization: Batch normalization is a technique that normalizes the activations
of each layer to prevent the distribution of activations from shifting too much. This helps
to reduce the internal covariate shift and makes the model more stable during training.
5.
Regularization is an essential part of deep neural network training and is often necessary
to achieve good generalization performance. By adding constraints to the model,
regularization techniques help to reduce overfitting and improve the model's ability to
generalize to new data.
Early stopping, dropout, and batch normalization are all regularization techniques that
can be used to improve the performance of a deep neural network.
1. Early stopping: Early stopping is a simple but effective regularization technique that
involves monitoring the model's performance on a validation set during training. When
the performance of the model on the validation set stops improving, training is stopped,
and the model with the best performance on the validation set is used as the final model.
Early stopping helps prevent overfitting by stopping the model from memorizing the
training data and allows it to generalize better to new data.
2. Dropout: Dropout is a regularization technique that involves randomly dropping out a
portion of the nodes in a layer during training. This helps to prevent the model from
overfitting by encouraging it to learn more robust features that are not dependent on any
one node. Dropout has been shown to be effective in reducing overfitting and improving
the generalization performance of deep neural networks.
3. Batch normalization: Batch normalization is a technique that involves normalizing the
activations of each layer to prevent the distribution of activations from shifting too much
during training. This helps to reduce the internal covariate shift and makes the model
more stable during training. Batch normalization has been shown to improve the
generalization performance of deep neural networks, reduce the number of training
iterations required, and make the model more resistant to overfitting.
Overall, early stopping, dropout, and batch normalization are all powerful techniques that
can be used to improve the performance of deep neural networks. By reducing overfitting
and improving generalization performance, these regularization techniques help to make
deep neural networks more effective at solving complex problems.
10 Compare and contrast different optimization algorithms used in deep neural 10M
network training, such as Momentum Optimizer, RMSProp, and Adam.
ANS:
Momentum
A very popular technique that is used along with SGD is called Momentum. Instead of
using only the gradient of the current step to guide the search, momentum also
accumulates the gradient of the past steps to determine the direction to go. The equations
If we set the initial value for v to 0 and chose our coefficient as 0.9, the subsequent
We see that the previous gradients are also included in subsequent updates, but the
weightage of the most recent previous gradients is more than the less recent ones. (For
the mathematically inclined, we are taking an exponential average of the gradient steps)
How does this help our case. Consider the image,and notice that most of the gradient
updates are in a zig-zag direction. Also notice that each gradient update has been resolved
into components along w1 and w2 directions. If we will individually sum these vectors
up, their components along the direction w1 cancel out, while the component along
For an update, this adds to the component along w2, while zeroing out the component
in w1 direction. This helps us move more quickly towards the minima. For this reason,
It also builds speed, and quickens convergence, but you may want to use simulated
RMSprop, or Root Mean Square Propogation has an interesting history. It was devised by
the legendary Geoffrey Hinton, while suggesting a random idea during a Coursera class.
RMSProp also tries to dampen the oscillations, but in a different way than momentum.
RMS prop also takes away the need to adjust learning rate, and does it automatically.
More so, RMSProp choses a different learning rate for each parameter.
In RMS prop, each update is done according to the equations described below. This
To do that, we multiply the exponential average computed till the last update with a
hyperparameter, represented by the greek symbol nu. We then multiply the square of the
current gradient with (1 - nu). We then add them together to get the exponential average
The reason why we use exponential average is because as we saw, in the momentum
example, it helps us weigh the more recent gradient updates more than the less recent
ones. In fact, the name "exponential" comes from the fact that the weightage of previous
terms falls exponentially (the most recent term is weighted as p, the next one as squared
Notice our diagram denoting pathological curvature, the components of the gradients
along w1 are much larger than the ones along w2. Since we are squaring and adding
them, they don't cancel out, and the exponential average is large for w2 updates.
Then in the second equation, we decided our step size. We move in the direction of the
gradient, but our step size is affected by the exponential average. We chose an initial
learning rate eta, and then divide it by the average. In our case, since the average of w1 is
much much larger than w2, the learning step for w1 is much lesser than that of w2.
Hence, this will help us avoid bouncing between the ridges, and move towards the
minima.
The third equation is just the update step. The hyperparameter p is generally chosen to be
0.9, but you might have to tune it. The epsilon is equation 2, is to ensure that we do not
It's also to be noted that RMSProp implicitly performs simulated annealing. Suppose if
we are heading towards the minima, and we want to slow down so as to not to overshoot
the minima. RMSProp automatically will decrease the size of the gradient steps towards
minima when the steps are too large (Large steps make us prone to overshooting)
Adam
So far, we've seen RMSProp and Momentum take contrasting approaches. While
momentum accelerates our search in direction of minima, RMSProp impedes our search
in direction of oscillations.
Here, we compute the exponential average of the gradient as well as the squares of the
gradient for each parameters (Eq 1, and Eq 2). To decide our learning step, we multiply
our learning rate by average of the gradient (as was the case with momentum) and divide
it by the root mean square of the exponential average of square of gradients (as was the
The hyperparameter beta1 is generally kept around 0.9 while beta_2 is kept at 0.99.
Epsilon is chosen to be 1e-10 generally.
UNIT III
DEEP LEARNING: working principle of CNN – RNN – Auto encoders and decoders - Unsupervised
Learning with Deep Network - Convolution Neural Network, Building blocks of CNN, Transfer
Learning - Image Classification using Transfer Learning.
1 What is the main difference between a convolutional neural network (CNN) and a 4M
recurrent neural network (RNN)?
ANS:
Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are two
types of neural networks that are commonly used in deep learning. The main difference
between CNNs and RNNs is the type of data they are designed to process.
CNNs are typically used for processing data that has a grid-like structure, such as images
and videos. CNNs use convolutional layers to extract features from the input data, which
are then passed through one or more fully connected layers to generate the final output.
The convolutional layers of a CNN use filters to scan over the input data and detect local
patterns and features.
On the other hand, RNNs are designed to process sequential data, such as natural
language text or time-series data. RNNs use recurrent layers that allow the network to
maintain an internal state or memory of past inputs. This memory allows RNNs to
process sequential data by considering the context of previous inputs when making
predictions about future inputs.
In summary, CNNs are designed to process grid-like data such as images and videos,
while RNNs are designed to process sequential data such as text and time-series data.
CNNs use convolutional layers to extract features from the input data, while RNNs use
recurrent layers to maintain memory of past inputs.
Image and video recognition, object Speech recognition, language modeling, time
Applications detection series prediction
In general, CNNs are used for processing 2D or 3D data such as images and videos,
whereas RNNs are used for processing sequences of variable length such as text, speech,
and time series data. While CNNs are feedforward networks that use convolutional and
pooling layers to extract features, RNNs are feedback networks that use recurrent layers
to maintain memory of past inputs. Additionally, CNNs share parameters across the
entire network, which makes them more efficient to train and process, while RNNs do not
share parameters across the network, which makes them more flexible and adaptable to
different types of inputs.
2 What is the working principle of autoencoders and decoders, and how are they used 4M
in deep learning applications?
ANS:
Autoencoders are a type of neural network that can learn to represent data in a lower-
dimensional space, known as a latent space, by compressing the input data into a smaller
representation and then reconstructing the original input from the compressed
representation. Autoencoders consist of two main components: an encoder and a decoder.
Autoencoders can be used for a variety of tasks, such as data compression, image
denoising, and data generation. They are often used in deep learning applications because
they can learn to extract meaningful features from the input data without requiring
manual feature engineering.
In addition to standard autoencoders, there are several variants of autoencoders that have
been developed, including denoising autoencoders, variational autoencoders (VAEs), and
generative adversarial networks (GANs).
Denoising autoencoders are designed to learn to remove noise from the input data.
During training, the autoencoder is fed corrupted input data and learns to reconstruct the
original, uncorrupted data.
VAEs are a type of generative model that can generate new data that is similar to the
training data. They learn a probability distribution over the latent space, allowing for the
generation of new data points by sampling from this distribution.
GANs are another type of generative model that consist of two neural networks: a
generator network and a discriminator network. The generator network learns to generate
new data points that are similar to the training data, while the discriminator network
learns to distinguish between real and generated data. The two networks are trained
together in an adversarial process, with the generator trying to produce data that can fool
the discriminator into thinking it is real.
In summary, autoencoders and decoders are used in deep learning applications to learn
compressed representations of data, which can then be used for tasks such as data
compression, image denoising, and data generation. There are several types of
autoencoders, including denoising autoencoders, VAEs, and GANs, which have different
applications and architectures.
3 What is unsupervised learning, and how can deep neural networks be used for 4M
unsupervised learning tasks?
ANS:
Unsupervised learning is a type of machine learning where the algorithm is trained on a
dataset without explicit labels or targets. The goal of unsupervised learning is to discover
the underlying structure or patterns in the data without the need for human supervision or
intervention. In contrast to supervised learning, where the algorithm is trained to predict a
target variable based on labeled examples, unsupervised learning is used to identify
relationships or groupings in the data that are not necessarily apparent to the human eye.
Deep neural networks can be used for unsupervised learning tasks by leveraging
architectures such as autoencoders, restricted Boltzmann machines (RBMs), and
generative adversarial networks (GANs). These architectures can learn useful
representations of the input data that can be used for downstream tasks such as
classification, regression, or clustering.
Autoencoders are neural networks that are trained to reconstruct the input data by
compressing it into a lower-dimensional representation (the latent space) and then
decoding it back to the original dimensionality. Autoencoders are a form of unsupervised
learning because they do not require labeled examples during training. By learning to
compress the input data into a lower-dimensional representation, autoencoders can
discover meaningful features and patterns in the data.
RBMs are a type of generative model that can be used for unsupervised learning tasks
such as dimensionality reduction, feature learning, and anomaly detection. They are
trained using a contrastive divergence algorithm that minimizes the energy function of
the network.
GANs are a type of generative model that consists of two neural networks: a generator
network and a discriminator network. The generator network learns to generate new data
points that are similar to the training data, while the discriminator network learns to
distinguish between real and generated data. The two networks are trained together in an
adversarial process, with the generator trying to produce data that can fool the
discriminator into thinking it is real.
1. Convolutional layers: These layers apply a set of filters to the input data, which slide over
the input data and perform a dot product between the filter and a small region of the input
data, known as the receptive field. The filters capture local patterns and features in the
input data, such as edges, corners, and textures.
2. Pooling layers: These layers downsample the output of the convolutional layers by
summarizing small regions of the input data. The most common type of pooling operation
is max pooling, which takes the maximum value in each small region. This reduces the
size of the feature maps and helps to make the model more robust to small variations in
the input data.
3. Activation functions: These functions introduce nonlinearity into the model, allowing it
to learn more complex patterns and relationships in the data. The most common
activation function used in CNNs is the rectified linear unit (ReLU), which sets negative
values to zero and leaves positive values unchanged.
4. Fully connected layers: These layers are similar to the layers in a standard neural network
and connect every neuron in one layer to every neuron in the next layer. Fully connected
layers are typically used at the end of a CNN to make predictions based on the features
learned by the earlier layers.
CNNs work by applying a sequence of convolutional and pooling layers to the input data,
gradually extracting more abstract and high-level features. The first convolutional layers
capture low-level features such as edges and textures, while later layers capture more
complex patterns and object parts. The fully connected layers at the end of the network
combine these features to make predictions based on the input data.
There are two main approaches to using transfer learning with CNNs:
1. Fine-tuning: In fine-tuning, a pre-trained CNN is used as a starting point, and the weights
of the model are adjusted on a new dataset. Typically, the last few layers of the pre-
trained model are replaced with new layers, and only the weights of the new layers are
updated during training. This approach is useful when the new dataset is similar to the
dataset used to train the pre-trained model, as the lower layers of the pre-trained model
can still be used to extract useful features.
2. Feature extraction: In feature extraction, the pre-trained CNN is used as a fixed feature
extractor, and the extracted features are used as input to a new classifier. The pre-trained
model is used to extract features from the new dataset, and these features are then fed into
a new classifier, such as a fully connected layer or a support vector machine (SVM). This
approach is useful when the new dataset is significantly different from the dataset used to
train the pre-trained model, as the lower layers of the pre-trained model can still be used
to extract useful features, but the upper layers may need to be discarded or frozen.
Transfer learning can be beneficial for image classification tasks in several ways. First,
pre-trained models are often trained on large, diverse datasets and have learned to
recognize many general features of images, such as edges, shapes, and textures. This
makes them useful starting points for many image classification tasks. Second, by starting
with a pre-trained model, the amount of training data required to achieve good
performance can be reduced, as the model has already learned useful features. Finally,
transfer learning can help to reduce the risk of overfitting, as the pre-trained model has
already been trained on a large dataset, and its weights have been regularized to prevent
overfitting.
In summary, transfer learning can be used to improve the performance of a CNN for
image classification tasks by leveraging the knowledge learned from a pre-trained model
on a similar task. Fine-tuning and feature extraction are two approaches to using transfer
learning with CNNs, and they can help to reduce the amount of training data required and
reduce the risk of overfitting.
6 Explain the working principle of a convolutional neural network, and discuss the 10M
different building blocks used to extract features from input data.
ANS:
Convolutional Neural Network is a specialized neural network designed for visual data,
such as images & videos. But CNNs also work well for non-image data (especially in
Although vanilla neural networks (MLPs) can learn highly complex functions, their
architecture does not exploit what we know about how the brain reads & processes
images.
The architecture of Convolutional Neural Network uses many of the working principles
of the animal visual system & it has been able to achieve extraordinary results in image-
For this reason, MLPs haven’t been able to achieve any significant breakthroughs in the
What is Convolution?
Let us consider an image ‘X’ & a filter ‘Y’ (More about filter will be covered later). Both
of them, i.e. X & Y, are matrices (image X is being expressed in the state of pixels).
When we convolve the image ‘X’ using filter ‘Y’, we produce the output in a matrix, say’
Z’.
Source: Author
Finally, we compute the sum of all the elements in ‘Z’ to get a scalar number, i.e.
3+4+0+6+0+0+0+45+2 = 60
Now that we are familiar with the idea behind CNN let us dig deeper into the topic
to understand the building blocks of CNN. Following is the outline of our journey:
A filter provides a measure for how close a patch or a region of the input resembles a
feature. A feature may be any prominent aspect – a vertical edge, a horizontal edge, an
A filter acts as a single template or pattern, which, when convolved across the input, finds
similarities between the stored template & different locations/regions in the input image.
Each column of the 4×4 output matrix looks at exactly three columns & three rows (the
coloured boxes show the output of the filter as it moves over the input image). The values
in the output matrix represent the change in the intensity along the horizontal direction
The output image has the value 0 in the 1st & last column. It means there is no change in
intensity in the first three columns & the previous three columns of the input image. On
the other hand, the output is 30 in the 2nd & 3rd column, indicating a change in the
If the input image size is ‘n x n’ & filter size is ‘f x f ‘, then after convolution, the
size of the output image is: (Size of input image – filter size + 1)
x (Size of input image – filter size + 1). Let us refer to the below image:
Why do we do Padding?
Every time we apply a convolution operator, our image shrinks (in the above example,
our vision shrunk from 6 x 6 to 4 x 4). If we convolve the output again with a filter, our
image shrinks.
During convolution, the pixels in the corners & the edges are considered only once. This
is the 2nd downside of convolution. If we consider any pixel in the middle, many (fxf)
regions overlap the pixel (we shift the filter & observe the image through it, i.e.
convolve). Thus, the pixels on the corners or the edges are used much less in the output.
So, we throw away a lot of information near the edge of the image.
Let P be padding. In this example, p = 1 because we padded all around the input image
=6x6
Types of Padding
There are two common choices for padding: Valid convolutions & the Same
convolutions.
1.
have (nxn) image convolve with (fxf) filter & this would give us an output
2. Same convolutions: In this case, padding is such that the output size is the
same as the input image size. When we do padding by ‘p’ pixels then, size
The amount of padding to be done should be such that the output image
Let,
p = Padding
To avoid shrinkage of the original input image, we calculate ‘p = padding size’. The
output image size achieved after convolving the padded input image is equal to that of the
⸫ Output size after convolving padded image = Original input image size
How is the Filter Size Decided?
By convention, the value of ‘f,’ i.e. filter size, is usually odd in computer vision. This
If the value of ‘f’ is even, we may need asymmetric padding (Please refer above eqn. 1 ).
Let us say that the size of the filter i.e. ‘f’ is 6. Then by using equation 1, we get a
The 2nd reason for choosing an odd size filter such as a 3×3 or a 5×5 filter is we get a
What is Stride?
The stride indicates the pace by which the filter moves horizontally & vertically over the
Stride depends on what we expect in our output image. We prefer a smaller stride size if
we expect several fine-grained features to reflect in our output. On the other hand, if we
are only interested in the macro-level of features, we choose a larger stride size.
To understand the concept of stride in more detail, let’s consider an example. Let’s say
we are interested in classifying the input image between landscape & portrait. We do not
need minute details for this task, such as the number of mountain peaks, trees, etc. So we
can choose a higher value for stride. On the other hand, if we want to classify an image
between dog & cat, we need to focus or capture very minute details or features of the
input image to type the image correctly. In this case, we prefer a smaller stride size.
Let,
S = Stride size
Consider an RGB image of size 6×6. Since it’s an RGB image, its dimension is 6x6x3,
where the three corresponds to the three colour channels: Red, Green & Blue. We can
For 3-D images, we need 3D filters, i.e. the filter itself will also have three layers
corresponding to the red, green & blue channels, similar to that of the input RGB image.
images. We 1st place the 3x3x3 filter in the upper left most position. This filter has 27 (9
parameters in each channel) or numbers. We take each of these 27 numbers & multiply
them with the corresponding numbers from the image’s red, green & blue channels. Then
we add up all those numbers & this gives us the 1st number in the output image.
To compute the following output, we take this filter & slide/stride it over by 1 (or
whatever stride number we consider) & again, due to 27 multiplications, add up the 27
numbers.
We can use multiple filters to detect various features simultaneously. Let us consider the
following example in which we see vertical edge & curve in the input RGB image. We
will have to use two different filters for this task, and the output image will thus have two
feature maps.
Source: Author
Source
: Author
1. The filters are learned during training (i.e. during backpropagation). Hence, the
individual values of the filters are often called the weights of CNN.
2. A neuron is a filter whose weights are learned during training. E.g., a (3,3,3)
filter (or neuron) has 27 units. Each neuron looks at a particular region in the
inputs with the same weights. All neurons in a feature map extract the same
feature (but from other input regions). It is called a ‘feature map’ because it maps
What is Pooling?
A pooling layer is another essential building block of CNN. It tries to figure out whether
a particular region in the image has the feature we are interested in or not.
The actual dictionary meaning of pooling is the act of sharing or combining two or more
things. In CNN, the pooling layer does a similar job. It summarizes the featured map so
that the model will not need to be trained on precisely positioned features, making a
The pooling layer looks at more significant regions (having multiple patches) of the
image & captures aggregate statistics (min, max, average & global). In other words, it
The two most popular aggregate functions used in pooling are ‘max’ & ‘average’:
1. Max pooling – If any of the patches say something firmly about the
presence of a particular feature, then the pooling layer counts that feature
as ‘detected’.
2. Average pooling – If one patch says something very firmly, but the other
ones disagree, the average pooling takes the average to find out.
Pooling has the advantage of making the representation more compact by reducing the
spatial size of the feature maps, thereby reducing the number of parameters to be learnt.
Pooling reduces only the height & width of the feature map, not the (i.e. number of
channels). For example, if we have ‘m’ feature maps each of size (c,c), the pooling
On the other hand, pooling also loses a lot of information, which is often considered a
potential disadvantage.
The pooling layer has ‘NO PARAMETERS’ i.e. ‘ZERO TRAINABLE PARAMETERS’.
The pooling layer computes the aggregate of the input. E.g. in max pooling, it takes a
maximum over a group of pixels. We do not need any adjustments in any parameters.
The following image explains when a MAX pooling works and when AVG pooling
works:
A typical CNN has the following sequence of CNN layers
We have an input image using multiple filters to create various feature maps.
Each feature map of size (C, C) is pooled to generate a (C/2, C/2) output (for a standard
2×2 pooling)
The above pattern is referred to as one Convolutional Neural Network layer or one unit.
Multiple such CNN layers are stacked on top of each other to create deep Convolutional
The output of the convolution layer contains features, and these features are fed into a
Convolutional neural networks (CNNs) are a type of neural network that are commonly
used for image recognition and computer vision tasks. They are based on the idea of
convolution, which is a mathematical operation that is used to extract features from data.
The working principle of a CNN can be broken down into several building blocks that
work together to extract features from input data:
1. Convolutional layers: Convolutional layers are the main building block of a CNN. They
use a set of filters to extract features from the input data. The filters are small 2D matrices
that slide over the input data, computing a dot product with each region of the data they
cover. This results in a set of output feature maps, which represent different learned
features.
2. Pooling layers: Pooling layers are used to downsample the output of the convolutional
layers. They reduce the spatial dimensions of the feature maps, while retaining the most
important information. The most common pooling operation is max pooling, which
selects the maximum value from each region of the feature map.
3. Activation functions: Activation functions introduce nonlinearity into the model. They
are applied element-wise to the output of each neuron in the network, allowing the
network to learn more complex patterns and relationships in the data. The most common
activation function used in CNNs is the rectified linear unit (ReLU), which sets negative
values to zero and leaves positive values unchanged.
4. Fully connected layers: Fully connected layers are used at the end of the network to make
predictions based on the features learned by the earlier layers. They connect every neuron
in one layer to every neuron in the next layer, allowing the network to learn complex
mappings from the input to the output.
During training, the weights of the CNN are adjusted to minimize a loss function, which
measures how well the network is performing on the task at hand. The weights are
updated using backpropagation, which calculates the gradient of the loss function with
respect to each weight in the network.
In summary, a convolutional neural network extracts features from input data using
convolutional layers, pooling layers, activation functions, and fully connected layers.
Convolutional layers extract local features from the input data, while pooling layers down
sample the output and reduce the spatial dimensions of the feature maps. Activation
functions introduce nonlinearity into the model, and fully connected layers are used to
make predictions based on the learned features.
7 Discuss the working principle of recurrent neural networks, and explain how they 10M
can be used to process sequential data.
ANS:
Recurrent neural networks (RNNs) are a type of neural network that are commonly used
for processing sequential data. They are designed to work with input sequences of
arbitrary length, where the current output depends not only on the current input but also
on previous inputs.
The working principle of an RNN can be explained in terms of its recurrent nature. An
RNN processes input sequences one element at a time, and at each time step, it produces
an output and updates its internal state. The output and internal state are then fed back
into the network at the next time step, along with the next element in the input sequence.
This allows the network to maintain a memory of previous inputs and use that
information to generate more accurate predictions.
The key component of an RNN is the recurrent connection, which allows information to
flow from one time step to the next. The internal state of the RNN at time step t, denoted
as h_t, is computed as a function of the current input x_t and the previous state h_{t-1}.
This can be expressed mathematically as:
where f is a nonlinear function that is applied element-wise to the input and previous
state.
The output of the RNN at each time step is computed based on the current internal state,
and can be used for various tasks such as classification or prediction. The output can also
be fed back into the network as an additional input at the next time step.
In order to train an RNN, the weights of the network are adjusted to minimize a loss
function that measures the difference between the predicted output and the true output.
This is done using a variant of backpropagation called backpropagation through time
(BPTT), which propagates the error gradient back through the recurrent connections.
RNNs can be used for a variety of tasks that involve sequential data, such as natural
language processing, speech recognition, and time series prediction. They are particularly
well-suited for tasks where the length of the input sequence is variable or unknown, as
they can process input sequences of arbitrary length. However, they can be difficult to
train and can suffer from the vanishing gradient problem, where the gradient of the loss
function becomes very small over long sequences, making it difficult to learn long-term
dependencies. This has led to the development of variants of RNNs such as long short-
term memory (LSTM) networks and gated recurrent units (GRUs), which are designed to
address these issues.
8 Explain the structure and working principle of autoencoders and decoders, and 10M
explain how they can be used for unsupervised learning tasks, such as
dimensionality reduction and anomaly detection.
ANS:
Autoencoders are neural networks that are designed to learn a compressed representation
of input data. They consist of two main components: an encoder and a decoder. The
encoder takes the input data and compresses it into a low-dimensional representation,
while the decoder takes this compressed representation and attempts to reconstruct the
original input data.
In anomaly detection, the goal is to identify instances of data that are significantly
different from the norm. Autoencoders can be used for this task by training the model on
a dataset of normal data, and then using the model to reconstruct new instances of data. If
the reconstruction error is significantly higher than normal, then the instance of data may
be an anomaly. This approach is particularly useful when anomalies are rare or when it is
difficult to define explicit rules for what constitutes an anomaly.
Overall, autoencoders are a powerful tool for unsupervised learning tasks, and they can
be used for a wide range of applications including dimensionality reduction, anomaly
detection, and generative modeling.
9 Discuss the concept of transfer learning in deep neural networks, and explain how it can 10M
be used to improve the performance of a convolutional neural network for image
classification tasks.
ANS:
The reuse of a previously learned model on a new problem is known as transfer learning.
It’s particularly popular in deep learning right now since it can train deep neural networks
with a small amount of data. This is particularly valuable in the field of data science, as
most real-world situations do not require millions of labelled data points to train
complicated models.
machine learning. A machine uses the knowledge learned from a prior assignment to
increase prediction about a new task in transfer learning. You could, for example, use the
simple classifier to predict whether an image contains a backpack, you could use the
With transfer learning, we basically try to use what we’ve learned in one task to better
understand the concepts in another. weights are being automatically being shifted to a
network performing “task A” from a network that performed new “task B.”
Because of the massive amount of CPU power required, transfer learning is typically
applied in computer vision and natural language processing tasks like sentiment analysis.
In computer vision, neural networks typically aim to detect edges in the first layer, forms
in the middle layer, and task-specific features in the latter layers. The early and central
layers are employed in transfer learning, and the latter layers are only retrained. It makes
use of the labelled data from the task it was trained on.
Let’s return to the example of a model that has been intended to identify a backpack in an
image and will now be used to detect sunglasses. Because the model has trained to
recognise objects in the earlier levels, we will simply retrain the subsequent layers to
Transfer learning offers a number of advantages, the most important of which are reduced
training time, improved neural network performance (in most circumstances), and the
To train a neural model from scratch, a lot of data is typically needed, but access to that
data isn’t always possible – this is when transfer learning comes in handy.
Because the model has already been pre-trained, a good machine learning model can be
generated with fairly little training data using transfer learning. This is especially useful
in natural language processing, where huge labelled datasets require a lot of expert
network from the start of a complex task can take days or even weeks.
When we don’t have enough annotated data to train our model with. When there is a pre-
trained model that has been trained on similar data and tasks. If you used TensorFlow to
train the original model, you might simply restore it and retrain some layers for your job.
Transfer learning, on the other hand, only works if the features learnt in the first task are
general, meaning they can be applied to another activity. Furthermore, the model’s input
must be the same size as it was when it was first trained. If
If you don’t have it, add a step to resize your input to the required size.
Consider the situation in which you wish to tackle Task A but lack the necessary data to
train a deep neural network. Finding a related task B with a lot of data is one method to
Utilize the deep neural network to train on task B and then use the model to solve task A.
The problem you’re seeking to solve will decide whether you need to employ the entire
If the input in both jobs is the same, you might reapply the model and make predictions
for your new input. Changing and retraining distinct task-specific layers and the output
The second option is to employ a model that has already been trained. There are a
number of these models out there, so do some research beforehand. The number of layers
Keras consists of nine pre-trained models used in transfer learning, prediction, fine-
tuning. These models, as well as some quick lessons on how to utilise them, may be
found here. Many research institutions also make trained models accessible.
The most popular application of this form of transfer learning is deep learning.
3. EXTRACTION OF FEATURES
Another option is to utilise deep learning to identify the optimum representation of your
problem, which comprises identifying the key features. This method is known as
representation learning, and it can often produce significantly better results than hand-
designed representations.
Feature creation in machine learning is mainly done by hand by researchers and domain
specialists. Deep learning, fortunately, can extract features automatically. Of course, this
does not diminish the importance of feature engineering and domain knowledge; you
Neural networks, on the other hand, have the ability to learn which features are critical
and which aren’t. Even for complicated tasks that would otherwise necessitate a lot of
The learned representation can then be applied to a variety of other challenges. Simply
utilise the initial layers to find the appropriate feature representation, but avoid using the
network’s output because it is too task-specific. Instead, send data into your network and
output it through one of the intermediate layers.
This method is commonly used in computer vision since it can shrink your dataset,
reducing computation time and making it more suited for classical algorithms.
There are a number of popular pre-trained machine learning models available. The
Inception-v3 model, which was developed for the ImageNet “Large Visual Recognition
Challenge,” is one of them.” Participants in this challenge had to categorize pictures into
The basic idea behind transfer learning is that a neural network pre-trained on a large
dataset, such as ImageNet, has learned to recognize general features and patterns that are
useful for many different tasks, including tasks that the network was not specifically
trained for. By reusing the pre-trained network, we can save time and computational
resources that would otherwise be required to train a new network from scratch.
Additionally, we can leverage the knowledge learned from the pre-trained network to
improve the performance of the new network on the new task.
In the context of image classification tasks, transfer learning can be used to improve the
performance of a convolutional neural network (CNN) by reusing a pre-trained CNN as a
starting point. The pre-trained CNN can be used as a feature extractor, where the weights
of the pre-trained layers are frozen and only the weights of the new layers are trained on
the new dataset. This approach is particularly useful when the new dataset is small, as the
pre-trained network has already learned general features and patterns that can be reused
for the new task, allowing the new network to learn from a smaller number of examples.
There are several pre-trained CNN models available, such as VGG, ResNet, and
Inception, that have been trained on large datasets such as ImageNet. These pre-trained
models can be downloaded and used as a starting point for a new image classification
task. For example, the pre-trained CNN can be used to extract features from the images in
the new dataset, and then these features can be fed into a new classifier, such as a fully
connected layer or a support vector machine, to classify the images.
Overall, transfer learning is a powerful technique for improving the performance of deep
neural networks, and it can be particularly useful for tasks with limited training data or
computational resources. By leveraging the knowledge learned from pre-trained models,
we can improve the performance of new models and accelerate the development of new
machine learning applications.
10 Explain the working principle of unsupervised learning with deep neural networks, 10M
and discuss the different types of unsupervised learning tasks that can be performed
with deep neural networks.
ANS:
Unsupervised learning with deep neural networks is a type of machine learning where the
model is trained on input data without explicit labels or targets. Instead, the model is
trained to find patterns or structure in the data through self-organization, and to learn
useful representations of the input data without any explicit guidance.
The working principle of unsupervised learning with deep neural networks involves
optimizing the model's parameters to minimize some objective function that captures a
notion of similarity or regularity in the input data. There are different types of objective
functions used for unsupervised learning, depending on the specific task and the type of
neural network used.
One common type of unsupervised learning task is clustering, which involves grouping
similar data points together based on some measure of similarity or distance. This can be
performed with deep neural networks using techniques such as k-means clustering, where
the network is trained to identify clusters of similar data points by minimizing the sum of
distances between each data point and its nearest cluster center.
Generative modeling is another important type of unsupervised learning task, where the
goal is to learn a probabilistic model of the input data that can be used to generate new
samples that are similar to the original data. This can be performed with deep neural
networks using techniques such as variational autoencoders (VAEs) or generative
adversarial networks (GANs), where the network is trained to generate new samples by
sampling from a learned latent space that captures the underlying structure of the data.
Overall, unsupervised learning with deep neural networks is a powerful tool for
discovering patterns and structure in complex data without the need for explicit labels or
targets. By leveraging the power of deep learning, we can learn useful representations of
data that can be used for a wide range of applications including clustering, dimensionality
reduction, and generative modeling.
UNIT IV
LSTM Networks - Generative Modeling with DL –Variational Auto encoder- Generative Adversarial
Network Revisiting Gradient Descent - Momentum Optimizer – RMSProp – Adam.
1 What is the working principle of LSTM networks, and how are they used in deep 4M
learning applications?
ANS:
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network
(RNN) that are designed to handle the issue of vanishing gradients that can occur in
traditional RNNs. LSTMs use a special type of memory cell that can selectively
remember or forget information over time, allowing the network to capture long-term
dependencies in sequential data.
LSTM networks are widely used in deep learning applications that involve sequential
data, such as natural language processing, speech recognition, and time series analysis. In
natural language processing, LSTMs can be used for tasks such as language modeling,
machine translation, and sentiment analysis. In speech recognition, LSTMs can be used
for tasks such as speech-to-text transcription and speaker identification. In time series
analysis, LSTMs can be used for tasks such as anomaly detection and forecasting.
Overall, LSTM networks are a powerful tool for handling sequential data and capturing
long-term dependencies, and they have enabled significant advances in a wide range of
deep learning applications.
2 How do generative models differ from discriminative models, and what are the 4M
applications of generative modeling in deep learning?
ANS:
Generative models and discriminative models are two types of models used in machine
learning.
Generative models are designed to learn the underlying probability distribution of the
input data and can be used to generate new samples that are similar to the training data. In
other words, they model the joint probability distribution of both the input data and the
labels. Examples of generative models include Gaussian Mixture Models (GMMs),
Hidden Markov Models (HMMs), and Variational Autoencoders (VAEs).
On the other hand, discriminative models are designed to learn the decision boundary
between classes directly. They only model the conditional probability distribution of the
labels given the input data. Examples of discriminative models include logistic
regression, Support Vector Machines (SVMs), and Neural Networks (NNs).
In terms of applications in deep learning, generative models are used in a variety of tasks,
such as image synthesis, text generation, and anomaly detection. They are also used in
unsupervised learning, where the labels are not provided, and the model must learn the
underlying structure of the data on its own.
One popular generative model is the Generative Adversarial Network (GAN), which is
designed to learn the distribution of the input data and can generate new samples that are
visually similar to the training data. Another popular generative model is the Variational
Autoencoder (VAE), which learns a lower-dimensional representation of the input data
and can generate new samples by sampling from the learned latent space.
In summary, generative models learn the probability distribution of the input data and can
be used to generate new samples, while discriminative models learn the decision
boundary between classes directly. Generative models are used in a variety of deep
learning applications, including image synthesis, text generation, and unsupervised
learning.
3 What is the working principle of variational autoencoders, and how are they used in 4M
generative modeling?
ANS:
Variational autoencoders (VAEs) are a type of generative model that use neural networks
to learn a lower-dimensional representation of the input data, called a latent space. VAEs
are trained using unsupervised learning, which means that they do not require labeled
data to learn the underlying distribution of the input data.
The basic working principle of a VAE is to learn two neural networks: an encoder and a
decoder. The encoder takes an input data point and maps it to a lower-dimensional latent
representation. The decoder then takes the latent representation and maps it back to the
original data space. During training, the VAE learns to minimize the difference between
the original data point and the reconstructed data point.
However, unlike traditional autoencoders, VAEs are trained to generate new data samples
by sampling from the learned latent space. During training, VAEs learn to optimize the
latent space to have a simple, normally distributed structure. This allows the VAE to
generate new data samples by randomly sampling from the latent space and mapping
them back to the original data space using the decoder.
The loss function used to train VAEs includes two terms: a reconstruction loss and a KL
divergence loss. The reconstruction loss measures the difference between the original
input data and the reconstructed data, while the KL divergence loss measures the
difference between the learned latent distribution and a simple, normally distributed prior
distribution. The KL divergence loss encourages the VAE to learn a simple, structured
latent space that can be easily sampled to generate new data samples.
In terms of generative modeling, VAEs are used to generate new data samples that are
similar to the training data. The learned latent space can be used to interpolate between
existing data samples, generating new data samples that lie along the same distribution as
the training data. VAEs have been used in a variety of applications, such as image
synthesis, text generation, and anomaly detection.
4 What is the main idea behind generative adversarial networks, and how are they 4M
used in generative modeling?
ANS:
Generative Adversarial Networks (GANs) are a type of generative model that use two
neural networks to play a two-player minimax game. GANs are designed to learn the
underlying distribution of the input data and can generate new data samples that are
similar to the training data.
The main idea behind GANs is to have one neural network, called the generator, generate
new data samples that are similar to the training data, while another neural network,
called the discriminator, attempts to distinguish between the generated data samples and
the real data samples. The generator is trained to generate data samples that can fool the
discriminator, while the discriminator is trained to distinguish between real and fake data
samples.
During training, the generator takes random noise as input and generates a new data
sample. The discriminator then takes the generated data sample and a real data sample
and predicts which one is real and which one is fake. The generator is then updated to
generate data samples that are more difficult for the discriminator to distinguish from real
data samples, and the discriminator is updated to better distinguish between real and fake
data samples. This process continues until the generator is able to generate new data
samples that are indistinguishable from real data samples.
The loss function used to train GANs is a combination of two terms: a generator loss and
a discriminator loss. The generator loss measures how well the generated data samples
fool the discriminator, while the discriminator loss measures how well the discriminator
is able to distinguish between real and fake data samples. The generator and discriminator
are updated iteratively based on the gradients of the loss function.
In terms of generative modeling, GANs are used to generate new data samples that are
similar to the training data. The generator can be used to generate new data samples that
are visually similar to the training data, while the discriminator can be used to evaluate
the quality of the generated data samples. GANs have been used in a variety of
applications, such as image synthesis, text generation, and video synthesis.
LSTM is a type of Neural Network used in the field of Deep Learning. LSTM stands
for Long-Short-term-Memory. LSTM is an improved version of the RNN(Recurrent
Neural Network). LSTM is mainly used in Time series and Sequence data because RNN
doesn't provide an efficient performance as the gap length rises. LSTM differs from
conventional Feedforward Networks as it uses previous data and its output to affect the
current predictions. LSTM is also better at retaining information for longer periods when
compared with RNN. Long Short Term Memory uses Gated Cells to remember or forget
previous information.
LSTM was designed by Hochreiter & Schmidhuber. It's not an easy task to get your head
around LSTM as it belongs to the complex area of Deep Learning. LSTM deals with
algorithms to uncover the underlying relationships in the given sequential data.
The chain structure LSTM contains four neural networks and different memory blocks
called cells. The LSTM may keep information for a long time by default and information
is retained by the cells and the three gates do the memory manipulations.
Need of LSTM
LSTM was introduced to tackle the problems and challenges faced in Recurrent Neural
Networks. Slightly touching on the topic of RNN. RNN is a type of Neural Network that
stores the previous output to help improve its future predictions. Vanilla RNN has a
“short-term” memory. The input at the beginning of the sequence doesn’t affect the
output of the Network after a while, maybe 3 or 4 inputs. This is called a long-term
dependency issue.
Example:
An RNN could easily return the correct output that the sun rises in the East as all the
necessary information is nearby.
Let’s take another example.
In this sentence, the RNN would not be able to return the correct output as it requires
remembering the word Japan for a long duration. Since RNN only has a “Short-term”
memory, it doesn’t work well. LSTM solves this problem by enabling the Network to
remember Long-term dependencies.
The other problem in RNN is the Vanishing Gradient and Exploding Gradient. It
arises during the Backpropagation of the Neural Network. Suppose the gradient of each
layer is contained between 0 and 1. As the value gets multiplied in each layer, it gets
smaller and smaller, ultimately, a value very close to 0. This is the Vanishing gradient
problem. The converse, when the values are greater than 1, exploding gradient problem
occurs, where the value gets really big, disrupting the training of the Network. These
problems are tackled in LSTMs.
Structure of LSTM
LSTM is a cell that consists of 3 gates. A forget gate, input gate, and output gate. The
gates decide which information is important and which information can be forgotten. The
cell has two states Cell State and Hidden State. They are continuously updated and
carry the information from the previous time steps to the current time steps. The cell state
acts as the “long-term” memory, while the hidden state acts as the “short-term” memory.
Now let’s look at each gate in detail.
Forget Gate:
Forget gate is responsible for deciding what information should be removed from the
cell state. It takes in the hidden state of the previous time-step and the current input and
passes it to a Sigma Activation Function, which outputs a value between 0 and 1,
where 0 means forget and 1 means keep.
Input Gate:
The Input Gate considers the current input and the hidden state of the previous time step.
The input gate is used to update the cell state value. It has two parts. The first part
contains the Sigma activation function. Its purpose is to decide what percent of the
information is required. The second part passes the two values to a Tanh
activation function. Its purpose is to map the data between -1 and 1. To obtain the
relevant information required from the output of Tanh, we multiply it by the output of the
Sigma function. This is the output of the Input gate, which updates the cell state.
Output Gate:
The output gate returns the hidden state for the next time stamp. The output gate has two
parts. The first part is a Sigma function, which serves the same purpose as the other two
gates, to decide the percent of the relevant information required. The newly updated cell
state is passed through a Tanh function and multiplied by the output from the sigma
function. This is now the new hidden state.
Cell State:
The forget gate and input gate update the cell state. The cell state of the previous state is
multiplied by the output of the forget gate. The output of this state is then summed with
the output of the input gate. This value is then used in the calculation of hidden state in
the output gate.
The LSTM architecture is similar to RNN, but instead of the feedback loop has an LSTM
cell. The sequence of LSTM cells in each layer is fed with the output of the previous
cell. This enables the cell to get the information of the previous inputs and sequence. A
cyclic set of steps happens in each LSTM cell
These series of steps occur in every LSTM cell. The intuition behind LSTM is that the
Cell state and the Hidden state carry the previous information and pass it on to the future
time steps. The Cell state is aggregated with the information of all the past data and is
the long-term information retainer. The Hidden state carries the output of the previous
cell, i.e. short-term memory. This combination of Long term and short-term memory
techniques enables LSTM’s to perform really well In time series and sequence data.
Applications of LSTM
Language Modeling: LSTMs have been used to build language models that can generate
natural language text, such as in machine translation systems or chatbots.
Time series prediction: LSTMs have been used to model time series data and
make predictions about future values in the series. For example, LSTMs have
been used to predict stock prices or traffic patterns.
Sentiment analysis: LSTMs have been used to analyze the sentiment of text,
such as in social media posts or customer reviews.
Speech recognition: LSTMs have been used to build speech recognition systems
that can transcribe spoken language into text.
Image captioning: LSTMs have been used to generate descriptive captions for
images, such as in image search engines or automated image annotation systems.
LSTMs can be used for processing sequential data, such as natural language and speech,
by treating each word or phoneme as a separate input. The LSTM network then updates
the memory cell and generates an output for each input. The output can be used to predict
the next word or phoneme in the sequence, or to generate a new sequence altogether.
7 Explain the concept of generative modeling in deep learning, and discuss the 10M
different types of generative models, such as variational autoencoders and
generative adversarial networks.
ANS:
Generative modeling in deep learning refers to the process of training a model to generate
new data samples that are similar to the original data distribution. In other words,
generative models learn the underlying patterns and structures of the data and use them to
generate new samples that are statistically similar to the original data.
There are different types of generative models in deep learning, but two of the most
popular ones are Variational Autoencoders (VAEs) and Generative Adversarial Networks
(GANs).
3.
Generative models have numerous applications in deep learning, including image and
video synthesis, natural language processing, music generation, and data augmentation.
They are also used for anomaly detection, where the generative model is trained on
normal data and can then be used to detect anomalies or outliers in new data.
In summary, generative modeling in deep learning refers to the process of training a
model to generate new data samples that are statistically similar to the original data.
VAEs and GANs are two popular types of generative models that are used for different
types of data and applications. Generative models have numerous applications in deep
learning, including image and video synthesis, natural language processing, and data
augmentation.
8 Explain the working principle of variational autoencoders, and explain how they 10M
can be used for generative modeling tasks, such as image generation and data
compression.
ANS:
Variational Autoencoders (VAEs) are a type of autoencoder that can be used for
generative modeling tasks, such as image generation and data compression. VAEs work
by learning a compressed representation of the input data in a lower-dimensional latent
space, which can be used to generate new data samples that are similar to the original
data.
1. Encoding: The input data is first encoded into a lower-dimensional latent space using an
encoder network. The encoder network maps the input data to the mean and standard
deviation of a multivariate normal distribution that represents the latent space.
2. Sampling: A sample is drawn from the latent space using the mean and standard
deviation generated by the encoder network. This sample is then used as the input to the
decoder network.
3. Decoding: The decoder network then maps the sample from the latent space back to the
original input space. The output of the decoder network is a reconstruction of the input
data.
4. Loss function: The VAE is trained to minimize the difference between the input data and
the reconstructed data, as well as to minimize the difference between the latent space
distribution and a standard normal distribution. This is done by minimizing a loss
function that consists of two terms: the reconstruction loss and the KL-divergence loss.
The reconstruction loss measures the difference between the input data and the
reconstructed data, while the KL-divergence loss measures the difference between the
latent space distribution and a standard normal distribution. The KL-divergence loss
encourages the VAE to learn a smooth and regular latent space that can be easily sampled
to generate new data samples.
VAEs can be used for generative modeling tasks, such as image generation and data
compression, by sampling from the learned latent space and decoding the samples into
new data samples. To generate new data samples, the VAE can be trained to learn the
distribution of the original data and generate new samples from the learned distribution.
In image generation, the VAE can be trained on a dataset of images and then used to
generate new images by sampling from the learned latent space and decoding the samples
into new images. The VAE can also be used for data compression by encoding the input
data into the learned latent space and then decoding the samples back into the original
input space. The compressed data can be stored in a lower-dimensional space, which can
save storage and computational resources.
In summary, VAEs are a type of autoencoder that can be used for generative modeling
tasks, such as image generation and data compression. VAEs learn a compressed
representation of the input data in a lower-dimensional latent space, which can be used to
generate new data samples that are similar to the original data. VAEs are trained to
minimize a loss function that consists of two terms: the reconstruction loss and the KL-
divergence loss.
9 Discuss the working principle of generative adversarial networks, and explain how 10M
they can be used for generative modeling tasks, such as image generation and data
synthesis.
ANS:
Generative Adversarial Networks (GANs) are a type of generative model that can be used
for generating new data samples, such as images or text. The basic idea behind GANs is
to train two neural networks, a generator and a discriminator, in a two-player minimax
game.
The generator network is trained to produce fake data samples that are similar to the real
data samples, while the discriminator network is trained to distinguish between the real
and fake data samples. The two networks are trained simultaneously, with the generator
network attempting to fool the discriminator network, and the discriminator network
trying to correctly classify the real and fake data samples.
1. Generator network: The generator network takes random noise as input and generates a
fake data sample, such as an image or a text sequence.
2. Discriminator network: The discriminator network takes a data sample, either real or
fake, as input and produces a binary output indicating whether the sample is real or fake.
3. Minimax game: The generator and discriminator networks are trained in a two-player
minimax game, where the generator tries to produce fake data samples that can fool the
discriminator, and the discriminator tries to correctly distinguish between real and fake
data samples.
4. Loss function: The generator and discriminator networks are trained to minimize
different loss functions. The generator network is trained to maximize the probability of
the discriminator network making a mistake, while the discriminator network is trained to
maximize the probability of correctly classifying real and fake data samples.
GANs can be used for generative modeling tasks such as image generation and data
synthesis by training the generator network to produce new data samples that are similar
to the real data samples. After training, the generator network can be used to generate
new data samples by sampling random noise as input and producing fake data samples as
output.
In image generation, GANs have been used to generate realistic images that can be used
for various applications such as data augmentation, style transfer, and image editing.
GANs can also be used for data synthesis, where the generator network can be trained to
generate new data samples in a dataset with limited samples.
One of the challenges with GANs is the training instability due to the non-convex nature
of the minimax game. The generator network may produce samples that are too similar,
leading to a mode collapse, or too dissimilar, leading to poor quality samples. Various
techniques have been proposed to stabilize the training of GANs, such as using different
loss functions, regularization techniques, and architecture modifications.
In summary, GANs are a type of generative model that can be used for generating new
data samples, such as images or text. GANs train a generator and a discriminator network
in a two-player minimax game, where the generator network tries to produce fake data
samples that can fool the discriminator network, and the discriminator network tries to
correctly distinguish between real and fake data samples. GANs can be used for
generative modeling tasks such as image generation and data synthesis, but their training
stability can be challenging.
10 Compare and contrast the momentum optimizer, RMSProp, and Adam 10M
optimization algorithms in deep learning.
ANS:
Momentum
The gradient descent with momentum algorithm (or Momentum for short) borrows the
idea from physics. Imagine rolling down a ball inside of a frictionless bowl. Instead of
stopping at the bottom, the momentum it has accumulated pushes it forward, and the ball
We can apply the concept of momentum to our vanilla gradient descent algorithm. In each
step, in addition to the regular gradient, it also adds on the movement from the previous
step. Mathematically, it is commonly expressed as:
I found it more intuitive if I massage this equation a little and keep track of the (decayed)
cumulative sum of gradient instead. This will also make things easier when we introduce
(What I did was factoring out -learning_rate. To see the mathematical equivalence, you
can substitute delta with -learning_rate * sum_of_gradient in eq. 1 to get eq. 3.)
Step-by-step illustration of momentum descent. Watch live animation in the app. For the
rest of this post, I sloppily use gradient x and gradient y in the visualization; in reality,
because it’s gradient *descent*, it’s actually the negative of the gradient.
Let’s consider two extreme cases to understand this decay rate parameter better. If the
decay rate is 0, then it is exactly the same as (vanilla) gradient descent. If the decay rate is
1 (and provided that the learning rate is reasonably small), then it rocks back and forth
endlessly like the frictionless bowl analogy we mentioned in the beginning; you do not
want that. Typically the decay rate is chosen around 0.8–0.9 — it’s like a surface with a
Momentum (magenta) vs. Gradient Descent (cyan) on a surface with a global minimum
(the left well) and local minimum (the right well)
So, in what ways is Momentum better than vanilla gradient descent? In this comparison on
2. Momentum has a shot at escaping local minima (because the momentum may propel
it out of a local minimum). In a similar vein, as we shall see later, it will also power
RMSProp
The problem of AdaGrad, however, is that it is incredibly slow. This is because the sum of
gradient squared only grows and never shrinks. RMSProp
(for Root Mean Square Propagation) fixes this issue by adding a decay factor.
theta += delta
More precisely, the sum of gradient squared is actually the decayed sum of gradient
squared. The decay rate is saying only recent gradient² matters, and the ones from long
ago are basically forgotten. As a side note, the term “decay rate” is a bit of a misnomer.
Unlike the decay rate we saw in momentum, in addition to decaying, the decay rate here
also has a scaling effect: it scales down the whole term by a factor of (1 - decay_rate). In
other words, if the decay_rate is set at 0.99, in addition to decaying, the sum of gradient
squared will be sqrt(1 - 0.99) = 0.1 that of AdaGrad, and thus the step is on the order of
10x larger for the same learning rate.
RMSProp (green) vs AdaGrad (white). The first run just shows the balls; the second run
also shows the sum of gradient squared represented by the squares.
To see the effect of the decaying, in this head-to-head comparison, AdaGrad white) keeps
up with RMSProp (green) initially, as expected with the tuned learning rate and decay
rate. But the sums of gradient squared for AdaGrad accumulate so fast that they soon
become humongous (demonstrated by the sizes of the squares in the animation). They take
a heavy toll and eventually AdaGrad practically stops moving. RMSProp, on the other
hand, has kept the squares under a manageable size the whole time, thanks to the decay
Adam
Last but not least, Adam (short for Adaptive Moment Estimation) takes the best of both
worlds of Momentum and RMSProp. Adam empirically works well, and thus in recent
beta1) [Momentum]
theta += delta
Beta1 is the decay rate for the first moment, sum of gradient (aka momentum), commonly
set at 0.9. Beta 2 is the decay rate for the second moment, sum of gradient squared, and it
Adam gets the speed from momentum and the ability to adapt gradients in different
directions from RMSProp. The combination of the two makes it powerful.
The momentum optimizer, RMSProp, and Adam optimization algorithms are all
commonly used optimization techniques in deep learning. They differ in their update
rules and the information they use to update the parameters of the model. Here is a
comparison of the three algorithms:
1. Momentum optimizer: The momentum optimizer adds a fraction of the previous update
to the current update in order to accelerate the learning process. It helps to smooth out
oscillations and reduce the noise in the gradients. The momentum optimizer keeps track
of an exponentially decaying moving average of past gradients and uses this information
to update the parameters. However, it does not take into account the magnitude of the
gradients, which can lead to overshooting the minimum.
2. RMSProp: RMSProp is an optimization algorithm that adjusts the learning rate for each
parameter based on the average of the magnitudes of recent gradients for that parameter.
It helps to adjust the learning rate to the scale of the gradients and to avoid overshooting
the minimum. It uses a moving average of squared gradients to normalize the gradient
updates. However, it may suffer from slow convergence when the gradients have a high
variance.
3. Adam: Adam is an adaptive learning rate optimization algorithm that combines the ideas
of both momentum and RMSProp. It keeps track of both the first and second moments of
the gradient and adapts the learning rate based on the variance of the gradients. It uses a
biased estimate of the first moment and a biased estimate of the second moment of the
gradients to calculate the update. It helps to adjust the learning rate to the scale of the
gradients and reduces the noise in the gradients. It has been shown to work well in
practice and is widely used in deep learning applications.
In summary, the momentum optimizer, RMSProp, and Adam optimization algorithms all
aim to improve the convergence of the gradient descent optimization process. The
momentum optimizer uses past gradients to accelerate learning, but does not take into
account the magnitude of the gradients. RMSProp adjusts the learning rate for each
parameter based on the average of the magnitudes of recent gradients, while Adam
combines both momentum and RMSProp to adapt the learning rate based on the variance
of the gradients. In general, Adam is considered to be the most effective optimization
algorithm in deep learning, but the best choice of optimizer can depend on the specific
problem and model being trained.
UNIT V
Images segmentation – Object Detection – Automatic Image Captioning – Image generation with
Generative adversarial networks – Video to Text with LSTM models – Attention models for Computer
Vision – Self-Organizing Maps - Restricted Boltzmann Machines - Long Short-Term Memory
Networks (LSTM) and Deep Reinforcement Learning.
Case Study: Named Entity Recognition – Opinion Mining using Recurrent Neural Networks – Parsing
and Sentiment Analysis using Recursive Neural Networks – Sentence Classification using Convolution
Neural Networks – Dialogue Generation with LSTMs.
1 What is image segmentation, and how is it used in computer vision applications? 4M
ANS:
Image segmentation is the process of dividing an image into multiple segments or
regions, each of which corresponds to a different object or part of the image. The goal of
image segmentation is to simplify or change the representation of an image into
something that is more meaningful and easier to analyze. Image segmentation is a
fundamental task in computer vision and is used in a variety of applications such as
object recognition, scene understanding, medical image analysis, and autonomous
vehicles.
The segmentation process typically involves two main steps: 1) grouping similar pixels
together to form a region, and 2) assigning a label to each region. The grouping of pixels
can be based on a number of criteria such as color, texture, brightness, or a combination
of these features. Once the regions are identified, a label is assigned to each region based
on its characteristics, such as the object it belongs to or the background of the image.
Object detection differs from image segmentation in that it aims to identify individual
objects within an image, rather than dividing the image into regions or segments based on
their properties. Object detection involves two main tasks: object localization and object
classification.
Object localization involves determining the location of an object within an image. This
is typically done by identifying the boundaries of the object using a bounding box or a
contour. Object classification involves assigning a label or class to the object based on its
characteristics, such as its shape or color.
In summary, object detection and image segmentation are both important computer vision
tasks, but they differ in their goals and methods. Object detection is focused on
identifying and localizing individual objects within an image, while image segmentation
is focused on dividing an image into regions or segments based on their properties.
3 How is automatic image captioning achieved using deep learning models, and what 4M
are some potential applications of this technology?
ANS:
Automatic image captioning is a task in which a deep learning model is trained to
generate a natural language description of an image. The goal of automatic image
captioning is to produce a caption that accurately reflects the content of the image, and is
grammatically and semantically correct.
The RNN is typically a type of neural network known as a long short-term memory
(LSTM) network, which is capable of modeling sequences of variable length. The LSTM
network generates the caption one word at a time, using the feature representation from
the CNN as input to each step of the sequence.
To train the model, a dataset of images and corresponding captions is typically used. The
model is trained to predict the caption given the image, and the parameters of the model
are adjusted to minimize the difference between the predicted caption and the actual
caption.
Automatic image captioning has a variety of potential applications, such as image and
video indexing, assistive technology for the visually impaired, and content-based image
retrieval. For example, image and video indexing could be used to automatically tag and
organize large collections of images and videos, making them easier to search and
browse. Assistive technology for the visually impaired could use image captions to
provide descriptions of visual content, allowing blind and visually impaired individuals to
better understand and interact with the world around them. Content-based image retrieval
could use image captions to improve the accuracy and relevance of search results, by
allowing users to search for images based on their content, rather than just their file
names or metadata.
4 What is the working principle of generative adversarial networks, and how are they 4M
used for image generation tasks?
ANS:
Generative Adversarial Networks (GANs) are a type of deep learning model that can
generate new data, such as images, by learning from a training dataset. GANs consist of
two neural networks: a generator and a discriminator. The generator is trained to generate
new data that resembles the training data, while the discriminator is trained to distinguish
between the generated data and the real data.
The generator takes a random input, typically noise, and generates a new sample that
resembles the training data. The discriminator then takes both the generated sample and a
real sample from the training data, and tries to distinguish between them. The
discriminator provides feedback to the generator on how well its generated samples
resemble the real data. The generator then uses this feedback to improve its samples, and
the process is repeated until the generated data is indistinguishable from the real data.
The training process of GANs can be difficult, as the generator and discriminator are
constantly trying to outsmart each other. However, when trained successfully, GANs can
generate highly realistic images that are difficult to distinguish from real images.
GANs can be used for a variety of image generation tasks, such as creating realistic
images of faces, landscapes, or objects. They can also be used for tasks such as image
inpainting, where missing parts of an image are filled in based on the surrounding
context, or image super-resolution, where low-resolution images are transformed into
higher-resolution versions.
The working principle of LSTM models is similar to that of RNNs, but they are designed
to handle the vanishing gradient problem that occurs in RNNs, which can make it
difficult for the network to learn long-term dependencies. LSTM models use a system of
gates that regulate the flow of information through the network, allowing it to selectively
retain or forget information over time.
Recurrent neural networks (RNNs) are a type of neural network that are well-suited for
processing sequential data, such as text. One popular approach to NER using RNNs is the
use of a bi-directional LSTM (BLSTM) model. In a BLSTM model, the input text is first
encoded into a sequence of feature vectors, which are then fed into two separate LSTM
layers. One LSTM layer processes the input sequence in the forward direction, while the
other processes it in the backward direction. The outputs of the two LSTM layers are then
concatenated to produce a final output sequence, which is fed into a softmax layer for
classification.
During training, the BLSTM model is trained on a dataset of labeled examples, where
each example consists of a sequence of words and their corresponding labels. The model
is trained to predict the label for each word in the input sequence, based on its context
within the sentence.
Overall, NER is a fundamental task in natural language processing, and the use of RNNs,
especially BLSTMs, has proven to be effective in achieving state-of-the-art performance
on various NER benchmarks.
7 Discuss the use of recursive neural networks in parsing and sentiment analysis tasks 10M
with examples.
ANS:
Recursive neural networks (RecNNs) are a class of neural networks that can process tree-
structured data, such as sentences or parse trees. They can be used for various natural
language processing tasks, including parsing and sentiment analysis.
In parsing, RecNNs can be used to generate parse trees for input sentences. Parse trees
represent the syntactic structure of sentences, and they are useful for many downstream
NLP tasks, such as information extraction and question answering. RecNNs can be
trained on annotated datasets to predict the parse trees for new sentences. The basic idea
is to use a recursive function to compute representations for sub-trees and combine them
to obtain representations for larger trees. A popular RecNN architecture for parsing is the
Recursive Autoencoder (RAE), which uses an encoder-decoder framework to reconstruct
the input tree.
In sentiment analysis, RecNNs can be used to classify the sentiment of input sentences or
documents. Sentiment analysis is the task of determining the polarity of a text, i.e.,
whether it expresses a positive, negative, or neutral sentiment. RecNNs can be trained on
annotated datasets to predict the sentiment labels for new sentences. The basic idea is to
use a recursive function to compute representations for sub-sentences and combine them
to obtain representations for larger sentences. A popular RecNN architecture for
sentiment analysis is the Recursive Neural Tensor Network (RNTN), which uses a
tensor-based composition function to combine the representations of sub-sentences.
Here are some examples of how RecNNs can be used in parsing and sentiment analysis:
Parsing: The Stanford Parser is a widely used parser that uses RecNNs to generate parse
trees for English sentences. It is trained on the Penn Treebank dataset, which contains
annotated parse trees for a large corpus of sentences.
1. Sentiment analysis: The Recursive Neural Tensor Network (RNTN) has been used for
sentiment analysis on various datasets, including the Stanford Sentiment Treebank and
the Movie Review dataset. It has achieved state-of-the-art performance on these
benchmarks.
In summary, RecNNs are a powerful tool for processing tree-structured data in NLP
tasks, and they have been successfully applied to parsing and sentiment analysis. They
can be trained on annotated datasets to learn to generate parse trees or classify the
sentiment of sentences, and they have achieved state-of-the-art performance on various
benchmarks.
8 Explain the concept of sentence classification, and discuss how it can be achieved 10M
using convolutional neural networks.
ANS:
Sentence classification is the task of assigning a label or category to a given sentence
based on its content. This task is commonly used in natural language processing
applications such as sentiment analysis, topic classification, and spam detection.
Convolutional neural networks (CNNs) have been shown to be effective for sentence
classification tasks. The key idea behind using CNNs for sentence classification is to treat
the sentence as a one-dimensional signal, with each word in the sentence being a single
element of the signal. This allows us to use the same convolutional filters across the
entire sentence, capturing local patterns and features in the sentence.
One popular variant of the basic CNN architecture for sentence classification is the Kim
model, named after its creator, Yoon Kim. The Kim model uses multiple filters of
different sizes, capturing both local and global features of the sentence. It also uses
dropout regularization to prevent overfitting.
9 Discuss the working principle of dialogue generation with LSTMs, and explain how 10M
LSTMs can be used to generate natural language responses in dialogue systems.
ANS:
Dialogue generation is the task of generating natural language responses in a
conversational setting, typically in response to user input. Long Short-Term Memory
(LSTM) networks are a type of recurrent neural network (RNN) that have been shown to
be effective for generating natural language responses in dialogue systems. In this
answer, we will discuss the working principle of dialogue generation with LSTMs and
explain how LSTMs can be used to generate natural language responses in dialogue
systems.
LSTMs are a type of RNN that can capture long-term dependencies in sequential data,
making them well-suited for natural language processing tasks such as dialogue
generation. The basic idea behind LSTMs is to use a set of learnable gates to control the
flow of information through the network, allowing it to selectively remember or forget
information at each time step. The key components of an LSTM network include an input
gate, a forget gate, a memory cell, and an output gate.
During training, the model is trained to maximize the likelihood of generating the correct
response given the user input. This is done by minimizing the cross-entropy loss between
the predicted and actual responses. During inference, the model uses beam search or
sampling to generate a response based on the learned distribution of responses.
However, there are also some challenges associated with LSTM-based dialogue
generation. For example, the generated responses may lack coherence or may be
inappropriate or offensive. To address these issues, researchers have proposed various
techniques such as reinforcement learning, adversarial training, and human-in-the-loop
approaches.
In conclusion, LSTMs are an effective tool for generating natural language responses in
dialogue systems. By using an LSTM-based seq2seq model, we can encode the user input
into a fixed-length vector and use it to generate a response. Although there are some
challenges associated with LSTM-based dialogue generation, it has the potential to
improve the user experience in conversational settings and has applications in various
domains.
10 Discuss the use of deep reinforcement learning in robotics applications. Explain the 10M
working principle of deep reinforcement learning, and provide examples of its
applications in robotics.
ANS:
Deep reinforcement learning (DRL) is a subfield of machine learning that combines the
principles of deep learning and reinforcement learning to enable artificial agents to learn
and adapt to complex environments through trial and error. In robotics applications, DRL
can be used to train robots to perform complex tasks, such as object manipulation,
navigation, and assembly, without explicit programming.
The working principle of DRL is based on the concept of reinforcement learning, where
an agent learns to maximize a reward signal by taking actions in an environment. In
DRL, the agent is a deep neural network that takes input from the environment, such as
sensor data or images, and produces a set of actions based on its learned policy. The
policy is optimized through the backpropagation of the reward signal, which is obtained
by evaluating the performance of the agent in the environment.
One example of DRL in robotics is the training of robots to grasp objects. In this
application, the robot arm is equipped with sensors that provide it with information about
the position and orientation of the objects. The DRL agent, which is implemented as a
deep neural network, takes this information as input and outputs a set of joint angles that
move the robot arm towards the target object. The reward signal is based on the success
of the grasp, which is determined by the force sensors attached to the robot arm.
Another example of DRL in robotics is the training of autonomous robots for navigation.
In this application, the robot is equipped with sensors that provide it with information
about the environment, such as lidar or camera data. The DRL agent takes this
information as input and outputs a set of actions, such as turning, moving forward, or
stopping. The reward signal is based on the success of the navigation task, which is
determined by the distance travelled or the time taken to reach the goal.
In conclusion, DRL has shown great potential in robotics applications, enabling robots to
learn and adapt to complex environments through trial and error. By combining the
principles of deep learning and reinforcement learning, DRL can help to overcome the
limitations of traditional programming approaches and enable the development of more
flexible and intelligent robots.