0% found this document useful (0 votes)

63 views108 pages

III-II CSM (Ar 20) DL 5 Units Question Answers

Uploaded by

Sravan Jana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views108 pages

III-II CSM (Ar 20) DL 5 Units Question Answers

Uploaded by

Sravan Jana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 108

BTECH III YEAR II SEMESTER (AR20)

DEEP LEARNING (20CS6003)

S. QUESTION - ANSWERS MAR

No KS
UNIT-I

Introduction to deep LEARNING – deep learning – why deep learning-difference between deep
learning and machine learning –Neural Networks basics – Gradient Descent, Derivatives, Computation
graph, Vectorization, Vectorizing logistic regression – Shallow neural networks: Activation functions,
non-linear activation functions, Back propagation, Data classification with a hidden layer Linear
Classifiers, Linear Machines with Hinge Loss - Optimization Techniques -Batch Optimization.

1 What is deep learning, and why is it considered a breakthrough in artificial 4M

intelligence?
ANS: Deep learning is a subfield of artificial intelligence (AI) that is focused on training
artificial neural networks to learn complex representations of data. It involves the use of
algorithms that allow machines to learn from data and make predictions or decisions
without being explicitly programmed to do so.

Deep learning is considered a breakthrough in AI for several reasons:

1. Better performance: Deep learning algorithms have been shown to achieve state-of-the-
art performance on a variety of tasks, including image and speech recognition, natural
language processing, and playing games like Go and chess. This has led to significant
improvements in areas such as healthcare, finance, and transportation.
2. Scalability: Deep learning algorithms are highly scalable and can handle large datasets
with ease. This has enabled the development of AI systems that can process and analyze
vast amounts of data in real-time.
3. Autonomous learning: Deep learning algorithms are designed to learn and improve on
their own, without the need for human intervention. This allows AI systems to adapt and
improve over time as they encounter new data.
4. Flexibility: Deep learning algorithms can be applied to a wide range of applications and
domains, from self-driving cars to drug discovery. This flexibility has led to the
development of new and innovative AI applications that were previously impossible.

Overall, deep learning has revolutionized the field of AI by enabling machines to learn
from data in a way that is similar to how humans learn. This has led to significant
advances in a wide range of applications and has the potential to transform many
industries in the future.

2 What is the difference between deep learning and machine learning? Provide 4M
examples to illustrate your answer.
ANS: Deep learning is a subset of machine learning, and the main difference between the
two is the complexity of the models that they use to learn from data. Machine learning
algorithms typically use simpler models and focus on finding patterns in data that can be
used to make predictions or decisions. Deep learning algorithms, on the other hand, use
more complex models that are composed of multiple layers of artificial neurons. These
neural networks are designed to learn hierarchical representations of data, which can lead
to more accurate predictions and better performance on complex tasks.

To illustrate the difference between deep learning and machine learning, let's consider a
few examples:

1. Image recognition: Machine learning algorithms can be trained to recognize simple

patterns in images, such as edges and shapes. For example, a machine learning model
could be trained to recognize a cat by looking for specific features, such as pointed ears
and a furry tail. However, deep learning algorithms can learn more complex
representations of images by analyzing multiple layers of features. For example, a deep
learning model could learn to recognize a cat by analyzing the shapes of its eyes, the
texture of its fur, and the way it moves.
2. Natural language processing: Machine learning algorithms can be used to analyze text
and identify patterns that can be used to make predictions or decisions. For example, a
machine learning model could be trained to identify spam emails based on patterns in the
text. However, deep learning algorithms can learn more complex representations of
language by analyzing multiple layers of features, such as the syntax, semantics, and
context of the text. For example, a deep learning model could be used to generate natural
language responses to customer inquiries in a customer service chatbot.
3. Predictive analytics: Machine learning algorithms can be used to analyze data and make
predictions about future events. For example, a machine learning model could be trained
to predict the likelihood of a customer churn based on their past behavior. However, deep
learning algorithms can learn more complex relationships between variables by analyzing
multiple layers of data. For example, a deep learning model could be used to predict the
likelihood of a patient developing a certain disease based on their genetic and
environmental factors.

In summary, while machine learning algorithms focus on simpler models and patterns,
deep learning algorithms use more complex models that allow them to learn more
abstract and hierarchical representations of data. As a result, deep learning is typically
used in applications where the data is complex and the task requires a higher level of
accuracy and performance.

3 What is a computation graph, and how is it used in deep learning? 4M

A computation graph, also known as a computational graph or a directed acyclic graph
(DAG), is a graphical representation of a mathematical function or a series of functions
that are used to transform input data into output data. In the context of deep learning,
computation graphs are used to represent the operations that are performed by a neural
network during training and inference.

Computation graphs consist of nodes, which represent mathematical operations, and

edges, which represent the flow of data between the nodes. The nodes are arranged in a
hierarchical structure, with inputs at the bottom and outputs at the top. Each node takes
one or more inputs, performs a mathematical operation on them, and produces an output,
which is then passed on to the next node in the graph.

These graphs organize the computation of a specific function. Consider the below
example:

J(a,b,c) = 3(a+bc)

We have to calculate J given a, b, and c. We can divide this into three steps:

1. u = bc
2. v = a+u
3. J = 3v

Let’s visualize these steps for a = 5, b = 3 and c = 2:

This is the forward propagation step where we have calculated the output, i.e., J. We can
also use computation graphs for backward propagation where we update the parameters,
a,b and c in the above example.

Derivatives with a Computation Graph

Now let’s see how we can calculate derivatives with the help of a computation graph.
Suppose we have to calculate dJ/da. The steps will be:

1. Since J is a function of v, calculate dJ/dv:

dJ/dv = d(3v)/dv = 3
2. Since v is a function of a and u, calculate dv/da:
dv/da = d(a+u)/da = 1
3. Calculate dJ/da:
dJ/da = (dJ/dv)*(dv/da) = 3*1 = 3

Similarly, we can calculate dJ/db and dJ/dc:

Now we will take the concept of computation graphs and gradient descent together and
see how the parameters of logistic regression can be updated.

4 What is vectorization in deep learning, and how does it help improve the efficiency 4M
of computation?
ANS: Vectorization is a technique used in deep learning to perform mathematical
operations on entire arrays or matrices of data at once, rather than looping over each
element of the array one at a time. This can greatly improve the efficiency of
computation, especially when dealing with large datasets and complex models.

In traditional programming, we often use loops to perform operations on arrays of data.

For example, if we wanted to multiply two matrices A and B together, we might write a
loop that iterates over each row and column of the matrices and multiplies the
corresponding elements together. This can be computationally expensive, especially for
large matrices, as each element is processed one at a time.

In contrast, vectorization allows us to perform the same operation on the entire matrix at
once, using optimized linear algebra libraries that can take advantage of hardware
acceleration (such as GPUs) to perform the computation in parallel. This can be orders of
magnitude faster than using loops.

Here's an example of how vectorization can be used to improve the efficiency of a neural
network:

# Non-vectorized implementation of a neural network layer

def layer_forward(X, W, b):
"""
Computes the forward pass for a neural network layer.
"""
# Initialize output array
Y = np.zeros((X.shape[0], W.shape[1]))

# Loop over each row of input matrix X

for i in range(X.shape[0]):
# Loop over each column of weight matrix W
for j in range(W.shape[1]):
# Compute dot product of input row and weight column
Y[i, j] = np.dot(X[i, :], W[:, j])

# Add bias term

Y += b

# Apply activation function

Y = np.maximum(Y, 0)

return Y

# Vectorized implementation of a neural network layer

def layer_forward_vectorized(X, W, b):
"""
Computes the forward pass for a neural network layer using vectorization.
"""
# Compute dot product of input matrix X and weight matrix W
Z = np.dot(X, W)

# Add bias term

Z += b

# Apply activation function

Y = np.maximum(Z, 0)

return Y

In the non-vectorized implementation, we use nested loops to iterate over each element of
the input matrix X and weight matrix W, and compute the dot product of each pair of
rows and columns. In the vectorized implementation, we use the dot product operator to
compute the dot product of the entire matrix X with the entire matrix W at once, without
the need for loops.

Overall, vectorization is a powerful technique for improving the efficiency of

computation in deep learning, and is a key reason why deep learning has become such a
powerful tool for a wide range of applications.

5 What are the activation functions used in neural networks, and how do they differ 4M
from each other?
ANS: Activation functions are mathematical functions that are applied to the output of a
neuron in a neural network. They are used to introduce non-linearity into the network,
allowing it to learn complex patterns and relationships in the data. There are several
activation functions that are commonly used in neural networks, each with its own
strengths and weaknesses. Some of the most common activation functions are:

1. Sigmoid function: The sigmoid function is defined as f(x) = 1 / (1 + exp(-x)). It maps any
input value to a range between 0 and 1, making it useful for binary classification
problems where the output is a probability estimate. However, the sigmoid function can
suffer from the problem of vanishing gradients, where the gradients of the function
become very small as the input values become very large or very small.
2. ReLU function: The ReLU (Rectified Linear Unit) function is defined as f(x) = max(0,
x). It returns the input value if it is positive, and 0 otherwise. The ReLU function is
simple to compute and has been found to work well in practice, especially in deep neural
networks. However, it can suffer from the problem of dead neurons, where the neurons
can become permanently inactive if their input values are always negative.
3. Leaky ReLU function: The Leaky ReLU function is a variant of the ReLU function that
is defined as f(x) = max(0.1x, x). It is similar to the ReLU function, but introduces a
small slope for negative input values, which can help to prevent the problem of dead
neurons.
4. Tanh function: The Tanh (hyperbolic tangent) function is defined as f(x) = (exp(x) - exp(-
x)) / (exp(x) + exp(-x)). It maps any input value to a range between -1 and 1, making it
useful for classification problems with outputs that can take negative values. However,
like the sigmoid function, it can suffer from the problem of vanishing gradients.
5. Softmax function: The Softmax function is a generalization of the sigmoid function that
is used for multi-class classification problems. It maps any input vector to a probability
distribution over multiple classes, ensuring that the sum of the probabilities adds up to 1.
It is defined as f(x_i) = exp(x_i) / sum(exp(x_j)), where i ranges over all the classes and j
ranges over all the inputs.

Overall, the choice of activation function depends on the specific requirements of the
problem and the characteristics of the data. It is often necessary to experiment with
different activation functions and architectures to find the optimal combination for a
given task.

6 1. Explain the process of gradient descent and its role in optimizing the weights and 10M
biases of a neural network. How does it help in improving the accuracy of a deep
learning model?
ANS: Gradient descent is an optimization algorithm used to find the optimal values for
the weights and biases in a neural network. The goal of the algorithm is to minimize the
loss function, which measures the difference between the predicted outputs of the
network and the actual outputs.

The gradient descent algorithm works by computing the gradient of the loss function with
respect to each weight and bias in the network. The gradient indicates the direction of the
steepest increase in the loss function, so by moving in the opposite direction, we can
move towards the minimum of the loss function. This process is repeated iteratively, with
each iteration updating the values of the weights and biases to move towards the
minimum of the loss function.
How does Gradient Descent work?

Before starting the working principle of gradient descent, we should know some basic
concepts to find out the slope of a line from linear regression. The equation for simple
linear regression is given as:

1. Y=mX+c

Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.
The starting point(shown in above fig.) is used to evaluate the performance as it is
considered just as an arbitrary point. At this starting point, we will derive the first
derivative or slope and then use a tangent line to calculate the steepness of this slope.
Further, this slope will inform the updates to the parameters (weights and bias).

The slope becomes steeper at the starting point or arbitrary point, but whenever new
parameters are generated, then steepness gradually reduces, and at the lowest point, it
approaches the lowest point, which is called a point of convergence.

The main objective of gradient descent is to minimize the cost function or the error
between expected and actual. To minimize the cost function, two data points are required:

o Direction & Learning Rate

These two factors are used to determine the partial derivative calculation of future
iteration and allow it to the point of convergence or local minimum or global minimum.
Let's discuss learning rate factors in brief;

Learning Rate:

It is defined as the step size taken to reach the minimum or lowest point. This is typically
a small value that is evaluated and updated based on the behavior of the cost function. If
the learning rate is high, it results in larger steps but also leads to risks of overshooting
the minimum. At the same time, a low learning rate shows the small step sizes, which
compromises overall efficiency but gives the advantage of more precision.
There are two main types of gradient descent: batch gradient descent and stochastic
gradient descent. In batch gradient descent, the gradient is computed over the entire
training dataset, while in stochastic gradient descent, the gradient is computed for each
individual training example. Stochastic gradient descent is typically faster and more
efficient for large datasets.

Gradient descent plays a crucial role in optimizing the weights and biases of a neural
network, as it allows the network to learn from the data and improve its accuracy. By
iteratively updating the weights and biases to minimize the loss function, the network can
adjust its parameters to better fit the patterns in the data. This can lead to significant
improvements in accuracy, particularly in complex tasks such as image recognition or
natural language processing.

However, gradient descent can also suffer from several problems, such as getting stuck in
local minima or taking a long time to converge to the optimal solution. To address these
issues, various techniques have been developed, such as momentum, learning rate
schedules, and adaptive learning rates, which help to improve the efficiency and stability
of the optimization process.

7 1. Discuss the concept of backpropagation in neural networks. How is it used to 10M

calculate the error and update the weights and biases of the network during
training?
ANS: Backpropagation is an algorithm used in neural networks to calculate the gradient
of the loss function with respect to the weights and biases of the network. It is used to
update the weights and biases during training, in order to minimize the loss function and
improve the accuracy of the network.

The backpropagation algorithm works by propagating the error backwards through the
network, from the output layer to the input layer. It does this by computing the partial
derivatives of the loss function with respect to each weight and bias in the network. These
partial derivatives are then used to update the weights and biases, so that the network can
learn from the data and improve its accuracy.
The backpropagation algorithm can be broken down into several steps:

1. Forward propagation: The input is fed forward through the network, and the output is
computed for each neuron in each layer. The output of the final layer is compared to the
target output, and the difference is used to calculate the loss function.
2. Backward propagation: The error is propagated backwards through the network, starting
from the output layer. The partial derivatives of the loss function with respect to each
weight and bias in the network are computed using the chain rule of differentiation.
These partial derivatives are used to update the weights and biases of the network.
3. Weight and bias updates: The weights and biases of the network are updated using the
partial derivatives computed during the backward propagation step. This update is
performed using an optimization algorithm such as gradient descent, which moves the
weights and biases towards the direction of the minimum of the loss function.
4. Repeat: Steps 1-3 are repeated for each batch of training examples, until the loss function
converges to a minimum.

Backpropagation is a key algorithm in training neural networks, as it allows the network

to learn from the data and improve its accuracy over time. By calculating the gradient of
the loss function with respect to the weights and biases of the network, it provides a way
to update these parameters in a way that moves the network towards the optimal solution.
8 1. Explain the concept of data classification using linear classifiers in deep learning. 10M
How do they work, and what are their limitations?
ANS: Data classification is a fundamental task in deep learning, where the goal is to
predict the class of a given input based on a set of features. Linear classifiers are a class
of algorithms used in data classification that attempt to draw a linear boundary between
different classes of data points in a feature space.

Linear classifiers work by computing a weighted sum of the input features, and applying
a threshold to determine the predicted class. In the case of binary classification, where
there are only two possible classes, the predicted class is determined by whether the
weighted sum of the features is greater or less than a certain threshold. In the case of
multi-class classification, where there are more than two possible classes, the predicted
class is determined by the class with the highest weighted sum of features.

The weights in a linear classifier are typically learned through an optimization process,
such as gradient descent. During training, the weights are adjusted to minimize the loss
function, which measures the difference between the predicted output and the actual
output.
One of the main limitations of linear classifiers is that they are only able to classify data
that can be separated by a linear boundary. This means that they may not be suitable for
complex classification tasks, where the decision boundary between different classes may
be non-linear. In addition, linear classifiers may not be robust to noisy data, as they are
sensitive to outliers and may produce inaccurate predictions in the presence of noise.

To address these limitations, various extensions and modifications to linear classifiers

have been developed, such as kernel methods, which allow for non-linear decision
boundaries, and regularization techniques, which help to prevent overfitting and improve
generalization. Overall, while linear classifiers have their limitations, they remain a
powerful tool in data classification tasks, particularly in cases where the data can be
separated by a linear boundary.

9 1. What are the optimization techniques used in deep learning? Discuss their 10M
advantages and disadvantages, with examples.

ANS: In deep learning, we have the concept of loss, which tells us how poorly the model

is performing at that current instant. Now we need to use this loss to train our network

such that it performs better. Essentially what we need to do is to take the loss and try

to minimize it, because a lower loss means our model is going to perform better. The

process of minimizing (or maximizing) any mathematical expression is

called optimization.

Optimizers are algorithms or methods used to change the attributes of the neural network

such as weights and learning rate to reduce the losses. Optimizers are used to solve

optimization problems by minimizing the function.

How do Optimizers work?

For a useful mental model, you can think of a hiker trying to get down a mountain with a

blindfold on. It’s impossible to know which direction to go in, but there’s one thing she

can know: if she’s going down (making progress) or going up (losing progress).

Eventually, if she keeps taking steps that lead her downwards, she’ll reach the base.

Similarly, it’s impossible to know what your model’s weights should be right from the

start. But with some trial and error based on the loss function (whether the hiker is
descending), you can end up getting there eventually.

How you should change your weights or learning rates of your neural network to reduce

the losses is defined by the optimizers you use. Optimization algorithms are responsible

for reducing the losses and to provide the most accurate results possible.

Various optimizers are researched within the last few couples of years each having its

advantages and disadvantages. Read the entire article to understand the working,

advantages, and disadvantages of the algorithms.

We’ll learn about different types of optimizers and how they exactly work to minimize

the loss function.

1. Gradient Descent

2. Stochastic Gradient Descent (SGD)

3. Mini Batch Stochastic Gradient Descent (MB-SGD)

4. SGD with momentum

5. RMSprop

6. Adam

Gradient Descent

Gradient descent is an optimization algorithm that's used when training a machine

learning model. It's based on a convex function and tweaks its parameters iteratively to

minimize a given function to its local minimum.

WHAT IS GRADIENT DESCENT? Gradient Descent is an optimization algorithm

for finding a local minimum of a differentiable function. Gradient descent is simply

used to find the values of a function's parameters (coefficients) that minimize a cost

function as far as possible.

You start by defining the initial parameter's values and from there gradient descent uses

calculus to iteratively adjust the values so they minimize the given cost-function.

The weight is initialized using some initialization strategies and is updated with each

epoch according to the update equation.

The above equation computes the gradient of the cost function J(θ) w.r.t. to the

parameters/weights θ for the entire training dataset:

Our aim is to get to the bottom of our graph(Cost vs weights), or to a point where we can

no longer move downhill–a local minimum.

Okay now, what is Gradient?

"A gradient measures how much the output of a function changes if you change the

inputs a little bit."

Importance of Learning rate

How big the steps are gradient descent takes into the direction of the local minimum are

determined by the learning rate, which figures out how fast or slows we will move

towards the optimal weights.

For gradient descent to reach the local minimum we must set the learning rate to an

appropriate value, which is neither too low nor too high. This is important because if the

steps it takes are too big, it may not reach the local minimum because it bounces back and

forth between the convex function of gradient descent (see left image below). If we set

the learning rate to a very small value, gradient descent will eventually reach the local

minimum but that may take a while

So, the learning rate should never be too high or too low for this reason.

For a pre-defined number of epochs, we first compute the gradient vector params_grad of

the loss function for the whole dataset w.r.t. our parameter vector params.

Advantages:

1. Easy computation.

2. Easy to implement.

3. Easy to understand.
Disadvantages:

1. May trap at local minima.

2. Weights are changed after calculating the gradient on the whole dataset. So, if the

dataset is too large then this may take years to converge to the minima.

3. Requires large memory to calculate the gradient on the whole dataset.

Stochastic Gradient Descent (SGD)

SGD algorithm is an extension of the Gradient Descent and it overcomes some of the

disadvantages of the GD algorithm. Gradient Descent has a disadvantage that it requires a

lot of memory to load the entire dataset of n-points at a time to compute the derivative of

the loss function. In the SGD algorithm derivative is computed taking one point at a

time.

SGD performs a parameter update for each training example x(i) and label y(i):
θ = θ − α⋅∂(J(θ;x(i),y(i)))/∂θ

where {x(i) ,y(i)} are the training examples.

To make the training even faster we take a Gradient Descent step for each training

example. Let's see what the implications would be in the image below.

1. On the left, we have Stochastic Gradient Descent (where m=1 per step) we take a
Gradient Descent step for each example and on the right is Gradient Descent (1 step

per entire training set).

2. SGD seems to be quite noisy, at the same time it is much faster but may not converge

to a minimum.

3. Typically, to get the best out of both worlds we use Mini-batch gradient descent

(MGD) which looks at a smaller number of training set examples at once to help

(usually power of 2 - 2^6 etc.).

4. Mini-batch Gradient Descent is relatively more stable than Stochastic Gradient

Descent (SGD) but does have oscillations as gradient steps are being taken in the

direction of a sample of the training set and not the entire set as in BGD.

It is observed that in SGD the updates take more number iterations compared to gradient

descent to reach minima. On the right, the Gradient Descent takes fewer steps to reach

minima but the SGD algorithm is noisier and takes more iterations.

Advantage:

Memory requirement is less compared to the GD algorithm as the derivative is computed

taking only 1 point at once.

Disadvantages:

1. The time required to complete 1 epoch is large compared to the GD algorithm.

2. Takes a long time to converge.

3. May stuck at local minima.

Mini Batch Stochastic Gradient Descent (MB-SGD)

MB-SGD algorithm is an extension of the SGD algorithm and it overcomes the problem

of large time complexity in the case of the SGD algorithm. MB-SGD algorithm takes a

batch of points or subset of points from the dataset to compute derivate.

It is observed that the derivative of the loss function for MB-SGD is almost the same as a

derivate of the loss function for GD after some number of iterations. But the number of

iterations to achieve minima is large for MB-SGD compared to GD and the cost of

computation is also large.

The update of weight is dependent on the derivate of loss for a batch of points. The

updates in the case of MB-SGD are much noisy because the derivative is not always

towards minima.

MB-SGD divides the dataset into various batches and after every batch, the parameters

are updated.
θ = θ − α⋅∂(J(θ;B(i)))/∂θ

where {B(i)} are the batches of training examples.

Advantages:

Less time complexity to converge compared to standard SGD algorithm.

Disadvantages:

1. The update of MB-SGD is much noisy compared to the update of the GD algorithm.

2. Take a longer time to converge than the GD algorithm.

3. May get stuck at local minima.

SGD with momentum

A major disadvantage of the MB-SGD algorithm is that updates of weight are very noisy.

SGD with momentum overcomes this disadvantage by denoising the gradients. Updates

of weight are dependent on noisy derivative and if we somehow denoise the derivatives

then converging time will decrease.

The idea is to denoise derivative using exponential weighting average that is to give more

weightage to recent updates compared to the previous update.

It accelerates the convergence towards the relevant direction and reduces the fluctuation

to the irrelevant direction. One more hyperparameter is used in this method known as

momentum symbolized by ‘γ’.

V(t) = γ.V(t−1) + α.∂(J(θ))/∂θ

Now, the weights are updated by θ = θ − V(t).

The momentum term γ is usually set to 0.9 or a similar value.

Momentum at time ‘t’ is computed using all previous updates giving more weightage to

recent updates compared to the previous update. This leads to speed up the convergence.

Essentially, when using momentum, we push a ball down a hill. The ball accumulates

momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its

terminal velocity if there is air resistance, i.e. γ<1). The same thing happens to our

parameter updates: The momentum term increases for dimensions whose gradients point

in the same directions and reduces updates for dimensions whose gradients change

directions. As a result, we gain faster convergence and reduced oscillation.

The diagram above concludes SGD with momentum denoises the gradients and

converges faster as compared to SGD.

Advantages:

1. Has all advantages of the SGD algorithm.

2. Converges faster than the GD algorithm.

Disadvantages:

We need to compute one more variable for each update.

Adaptive Gradient Descent(AdaGrad)

For all the previously discussed algorithms the learning rate remains constant. So the key

idea of AdaGrad is to have an adaptive learning rate for each of the weights.

It performs smaller updates for parameters associated with frequently occurring features,

and larger updates for parameters associated with infrequently occurring features.

For brevity, we use gt to denote the gradient at time step t. gt,i is then the partial

derivative of the objective function w.r.t. to the parameter θi at time step t, η is the

learning rate and ∇θ is the partial derivative of loss function J(θi)

where Gt is the sum of the squares of the past gradients w.r.t to all parameters θ.

The benefit of AdaGrad is that it eliminates the need to manually tune the learning rate;

most leave it at a default value of 0.01.

Its main weakness is the accumulation of the squared gradients(Gt) in the denominator.

Since every added term is positive, the accumulated sum keeps growing during training,

causing the learning rate to shrink and becoming infinitesimally small and further

resulting in a vanishing gradient problem.

Advantage:

No need to update the learning rate manually as it changes adaptively with iterations.

Disadvantage:

As the number of iteration becomes very large learning rate decreases to a very small

number which leads to slow convergence.

RMSprop

RMSprop in fact is identical to the first update vector of Adadelta that we derived above:
RMSprop as well divides the learning rate by an exponentially decaying average of

squared gradients. Hinton suggests γ be set to 0.9, while a good default value for the

learning rate η is 0.001.

RMSprop and Adadelta have both been developed independently around the same time

stemming from the need to resolve Adagrad's radically diminishing learning rates

Adaptive Moment Estimation (Adam)

Adam can be looked at as a combination of RMSprop and Stochastic Gradient Descent

with momentum.

Adam computes adaptive learning rates for each parameter. In addition to storing an

exponentially decaying average of past squared gradients vt like Adadelta and RMSprop,

Adam also keeps an exponentially decaying average of past gradients mt, similar to

momentum. Whereas momentum can be seen as a ball running down a slope, Adam

behaves like a heavy ball with friction, which thus prefers flat minima in the error

surface.

Hyper-parameters β1, β2 ∈ [0, 1) control the exponential decay rates of these moving

averages. We compute the decaying averages of past and past squared

gradients mt and vt respectively as follows:

mt and vt are estimates of the first moment (the mean) and the second moment (the

uncentered variance) of the gradients respectively, hence the name of the method.
101. Compare and contrast shallow neural networks and deep neural networks in terms 10M
of their architecture, performance, and training. What are the factors that influence
their effectiveness?
ANS:

Shallow neural networks typically have one or two hidden layers, while deep neural
networks have more than two hidden layers. Here are the comparisons and contrasts
between the two types of neural networks:

1. Architecture: Shallow neural networks have a simple architecture with fewer layers,
while deep neural networks have a complex architecture with multiple layers. Deep
neural networks can extract hierarchical features from data, which enables them to
capture more complex patterns.
2. Performance: Deep neural networks generally outperform shallow neural networks in
tasks such as image recognition, speech recognition, and natural language processing.
This is because deep neural networks can learn more abstract and complex features than
shallow neural networks.
3. Training: Deep neural networks require more computational resources and longer training
time than shallow neural networks. This is because the training process involves
optimizing the parameters of multiple layers, which requires more iterations and more
memory. However, deep neural networks can be trained using techniques such as transfer
learning and pretraining, which can reduce the training time and improve the
performance.

Factors that influence the effectiveness of neural networks:

1. Architecture: The choice of architecture, including the number of layers, number of

neurons per layer, and activation functions, can have a significant impact on the
performance of neural networks.
2. Training data: The amount and quality of training data can affect the performance of
neural networks. Deep neural networks require more data to learn complex features and
avoid overfitting.
3. Optimization algorithm: The choice of optimization algorithm, such as gradient descent
or one of its variants, can affect the training time and convergence rate of neural
networks.
4. Regularization: Regularization techniques such as dropout and weight decay can help
prevent overfitting in deep neural networks.

Hence, deep neural networks are generally more effective than shallow neural networks
due to their ability to extract more abstract and complex features. However, the choice of
architecture, training data, optimization algorithm, and regularization techniques can
significantly impact the performance of both shallow and deep neural networks.

UNIT-II

Introduction to ANN: Introduction to Neural Network, Multilayer Perceptron- Single layer

perceptrons - Structure and learning of perceptrons - Pattern classifier - introduction and Bayes'
classifiers - Perceptron as a pattern classifier - Perceptron convergence - Limitations of a perceptrons
-.XOR problem with ANN - Radial Basis Function Networks - RBF networks .Revisiting Gradient
Descent - Momentum Optimizer - RMSProp - Adam - Effective training in Deep Net- early stopping
– Dropout - Batch Normalization - Instance Normalization - Group Normalization

1 What is the main difference between a single-layer perceptron and a multilayer 4M

perceptron?
ANS:

Single-layer perceptron is a type of artificial neural network that consists of a single layer
of perceptrons. In contrast, the multilayer perceptron (MLP) is an artificial neural
network that consists of multiple layers of perceptrons.

The main differences between the single-layer perceptron and multilayer perceptron are
as follows:

Network Architecture: The single-layer perceptron consists of a single layer of

perceptrons, while the multilayer perceptron consists of multiple layers of perceptrons.
Training: Single-layer perceptron uses the perceptron learning algorithm, whereas the
multilayer perceptron uses the backpropagation algorithm for training.
Learning Capability: Single-layer perceptron can only learn linearly separable functions,
while multilayer perceptron can learn non-linearly separable functions.
Activation Function: Single-layer perceptron uses the Heaviside step function as an
activation function, while the multilayer perceptron uses non-linear activation functions
such as the sigmoid function or the hyperbolic tangent function.
Hidden Layers: Single-layer perceptron does not have hidden layers, while multilayer
perceptron has at least one hidden layer.
Complexity: Single-layer perceptron is relatively simple, while the multilayer perceptron
is more complex.
Applications: Single-layer perceptron is suitable for linear classification problems, while
multilayer perceptron is suitable for non-linear classification problems.

2 What are the limitations of a perceptron as a pattern classifier? 4M

ANS:
Although perceptrons are a powerful tool for pattern classification, they have certain
limitations that can make them unsuitable for some tasks. Some of the limitations of
perceptrons as a pattern classifier are:

1. Limited to linearly separable problems: Perceptrons can only classify data that is linearly
separable. If the data is not linearly separable, then the perceptron will not be able to
classify it correctly.
2. Inability to learn complex patterns: Perceptrons are not capable of learning complex
patterns that require multiple layers of processing. They can only learn simple patterns
that can be represented by a linear combination of the input features.
3. Sensitivity to input data: Perceptrons are highly sensitive to input data. If the input data is
noisy or contains outliers, the perceptron may produce inaccurate results.
4. Lack of probabilistic outputs: Perceptrons produce binary outputs, which means they
cannot estimate the probability of a given input belonging to a particular class.
5. Requirement for labeled data: Perceptrons require labeled data for training, which can be
time-consuming and costly to obtain in some cases.
6. Difficulty in handling non-stationary data: Perceptrons are not designed to handle non-
stationary data, where the statistical properties of the input data change over time. They
require retraining on new data to adapt to the new distribution.

3 What is the XOR problem, and how can it be solved using an artificial neural 4M
network?
ANS: The XOR problem is a classic example of a problem that cannot be solved by a
single-layer perceptron. It involves classifying inputs into two classes based on their
logical XOR (exclusive or) relationship. XOR is a logical operation that returns true only
if both inputs are different; otherwise, it returns false. The XOR problem arises when we
try to classify inputs based on their XOR relationship.

For example, consider the following table:

H
ere, the output is 1 only if the two inputs are different (i.e., the XOR relationship holds).

A single-layer perceptron cannot solve the XOR problem because the problem is not
linearly separable. However, an artificial neural network with at least one hidden layer,
such as a multilayer perceptron (MLP), can solve the XOR problem.

The MLP can solve the XOR problem by learning to represent the data in a higher-
dimensional space where the data becomes linearly separable. The hidden layer of the
MLP acts as a feature extractor, transforming the input data into a higher-dimensional
space, where a linear classifier can separate the data. The output layer of the MLP then
performs the classification based on the transformed data.

In the case of the XOR problem, the MLP with one hidden layer can solve the problem
by learning to represent the data in a 2D space where a linear classifier can separate the
data. The hidden layer can have two neurons, and the output layer can have one neuron,
representing the binary classification problem. The activation function used in the hidden
layer can be a non-linear function such as the sigmoid function or the hyperbolic tangent
function, allowing the MLP to learn the non-linear relationship between the input
features. By training the MLP using backpropagation, it can learn to classify the inputs
correctly, even when the inputs have a non-linear relationship such as the XOR problem.

4 What are Radial Basis Function Networks, and how do they differ from other types 4M
of artificial neural networks?
ANS:
Radial Basis Function (RBF) Networks are a type of artificial neural network that uses
radial basis functions as activation functions. They differ from other types of artificial
neural networks, such as multilayer perceptrons (MLPs) and convolutional neural
networks (CNNs), in their architecture and training method.

The architecture of an RBF network typically consists of three layers: an input layer, a
hidden layer, and an output layer. The input layer receives the input data, which is then
passed through the hidden layer, where the radial basis functions are applied. The output
of the hidden layer is then passed through the output layer, which produces the final
output of the network. The hidden layer in an RBF network is often called the radial basis
layer, as it uses radial basis functions as activation functions.

Unlike MLPs and CNNs, RBF networks do not use backpropagation to train the network.
Instead, RBF networks use a two-step training method called clustering and least squares
estimation. In the clustering step, the input data is clustered using a clustering algorithm,
such as k-means clustering, to find representative centers for the radial basis functions. In
the least squares estimation step, the weights of the output layer are calculated using
linear regression to approximate the target outputs.

RBF networks have several advantages over other types of artificial neural networks.
First, they are good at handling non-linear problems, as the radial basis functions can
model non-linear relationships between the input features. Second, they require less
training data than MLPs and CNNs, as the clustering step reduces the number of
parameters that need to be trained. Third, they can handle missing or noisy data, as the
clustering step is robust to missing or noisy data. Finally, RBF networks can be used for
both regression and classification tasks.

However, RBF networks also have some limitations. They can be computationally
expensive, as the clustering step requires the input data to be processed multiple times.
They are also not well-suited for handling high-dimensional data, as the number of radial
basis functions required to model the data can become prohibitively large.

5 What are the differences between Momentum Optimizer, RMSProp, and Adam, in 4M
the context of deep neural network training?
ANS:
Momentum Optimizer, RMSProp, and Adam are all optimization algorithms commonly
used in deep neural network training. While they are all designed to improve the training
speed and convergence of neural networks, they differ in their approaches to achieving
this goal. Here are the key differences between these optimization algorithms:

1. Momentum Optimizer: This algorithm adds a momentum term to the gradient update in
order to accelerate the convergence of the optimization process. The momentum term
takes into account the previous update direction and magnitude, which helps the
algorithm to overcome local minima and plateaus in the optimization landscape.
However, Momentum Optimizer can overshoot the optimum and lead to oscillations
around the optimum.
2. RMSProp: This algorithm adapts the learning rate for each parameter based on the
magnitude of its gradients. It divides the learning rate by an exponentially decaying
average of the squared gradients, which has the effect of reducing the learning rate for
parameters with large gradients and increasing it for parameters with small gradients.
This makes the algorithm more robust to noisy gradients and to changes in the
optimization landscape. However, RMSProp can still converge slowly or get stuck in
saddle points.
3. Adam: This algorithm combines the ideas of Momentum Optimizer and RMSProp. It
uses an exponentially decaying average of past gradients to update the momentum term
and an exponentially decaying average of past squared gradients to adapt the learning
rate. Additionally, Adam introduces bias-correction terms to account for the fact that the
initial estimates of the averages are biased towards zero. This makes Adam more efficient
than Momentum Optimizer and RMSProp individually, as it can overcome the limitations
of both algorithms. However, Adam can still converge slowly or get stuck in saddle
points for certain types of optimization landscapes.

In summary, Momentum Optimizer, RMSProp, and Adam are all powerful optimization
algorithms that can improve the training speed and convergence of deep neural networks.
While they differ in their approaches, they are all designed to address the limitations of
traditional stochastic gradient descent, which can converge slowly or get stuck in
suboptimal local minima. The choice of which algorithm to use depends on the specific
optimization landscape and the requirements of the task at hand.

6 Explain the structure and learning process of a single-layer perceptron, and discuss 10M
its limitations as a pattern classifier.
ANS:
A single-layer perceptron is a type of artificial neural network that is used for binary
classification tasks. It consists of a layer of input neurons, each connected to a single
output neuron. The input neurons receive the features of the input data and apply a
weighted sum of the inputs to the output neuron. The output neuron then applies an
activation function to the weighted sum to produce the final output of the network, which
is a binary decision.

The learning process of a single-layer perceptron involves the adjustment of the weights
of the connections between the input neurons and the output neuron. During training, the
perceptron is presented with a set of input-output pairs, and the weights are adjusted
based on the error between the predicted output and the true output. The weights are
updated using the delta rule, which is a form of gradient descent. The delta rule adjusts
the weights in the direction that reduces the error between the predicted output and the
true output, until the error reaches a minimum.

Despite its simplicity and usefulness, the single-layer perceptron has several limitations
as a pattern classifier. First, it can only classify linearly separable patterns, which limits
its ability to model complex relationships between the input features. Second, it can be
sensitive to the choice of initial weights and the order in which the input patterns are
presented during training, which can affect the convergence and the quality of the
solution. Third, it can suffer from the problem of overfitting, where it memorizes the
training data and fails to generalize to new data. Finally, it cannot handle multi-class
classification tasks, which require the use of more sophisticated models such as
multilayer perceptrons or convolutional neural networks.

In summary, the single-layer perceptron is a simple and useful model for binary
classification tasks, but it has several limitations that make it unsuitable for more
complex tasks. Its learning process involves the adjustment of weights using the delta
rule, and its structure consists of a layer of input neurons connected to a single output
neuron. Despite its limitations, the single-layer perceptron paved the way for the
development of more sophisticated artificial neural networks, which have revolutionized
the field of machine learning.

Perceptron is Machine Learning algorithm for supervised learning of various binary

classification tasks. Further, Perceptron is also understood as an Artificial Neuron or
neural network unit that helps to detect certain input data computations in business
intelligence.

Perceptron model is also treated as one of the best and simplest types of Artificial Neural
networks. However, it is a supervised learning algorithm of binary classifiers.

Binary classifiers can be considered as linear classifiers. In simple words, we can

understand it as a classification algorithm that can predict linear predictor function in
terms of weight and feature vectors.

A regular neural network looks like this:

The perceptron consists of 4 parts.

o Input value or One input layer: The input layer of the perceptron is made of
artificial input neurons and takes the initial data into the system for further
processing.
o Weights and Bias:
Weight: It represents the dimension or strength of the connection between units.
If the weight to node 1 to node 2 has a higher quantity, then neuron 1 has a more
considerable influence on the neuron.
Bias: It is the same as the intercept added in a linear equation. It is an additional
parameter which task is to modify the output along with the weighted sum of the
input to the other neuron.
o Net sum: It calculates the total sum.
o Activation Function: A neuron can be activated or not, is determined by an
activation function. The activation function calculates a weighted sum and further
adding bias with it to give the result.

A standard neural network looks like the below diagram.

Perceptron model works in two important steps as follows:

Step-1

In the first step first, multiply all input values with corresponding weight values and then
add them to determine the weighted sum. Mathematically, we can calculate the weighted
sum as follows:

∑wixi = x1w1 + x2w2 +…wnxn

Add a special term called bias 'b' to this weighted sum to improve the model's
performance.

∑wi*xi + b

Step-2

In the second step, an activation function is applied with the above-mentioned weighted
sum, which gives us output either in binary form or a continuous value as follows:

Y = f(∑wi*xi + b)

Types of Perceptron Models

Based on the layers, Perceptron models are divided into two types. These are as follows:

1. Single-layer Perceptron Model

2. Multi-layer Perceptron model

Single Layer Perceptron Model:

This is one of the easiest Artificial neural networks (ANN) types. A single-layered
perceptron model consists feed-forward network and also includes a threshold transfer
function inside the model. The main objective of the single-layer perceptron model is to
analyze the linearly separable objects with binary outcomes.

In a single layer perceptron model, its algorithms do not contain recorded data, so it
begins with inconstantly allocated input for weight parameters. Further, it sums up all
inputs (weight). After adding all inputs, if the total sum of all inputs is more than a pre-
determined value, the model gets activated and shows the output value as +1.

If the outcome is same as pre-determined or threshold value, then the performance of this
model is stated as satisfied, and weight demand does not change. However, this model
consists of a few discrepancies triggered when multiple weight inputs values are fed into
the model. Hence, to find desired output and minimize errors, some changes should be
necessary for the weights input.

"Single-layer perceptron can learn only linearly separable patterns."

Multi-Layered Perceptron Model:

Like a single-layer perceptron model, a multi-layer perceptron model also has the same
model structure but has a greater number of hidden layers.

The multi-layer perceptron model is also known as the Backpropagation algorithm, which
executes in two stages as follows:

o Forward Stage: Activation functions start from the input layer in the forward
stage and terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as
per the model's requirement. In this stage, the error between actual output and
demanded originated backward on the output layer and ended on the input layer.

Hence, a multi-layered perceptron model has considered as multiple artificial neural

networks having various layers in which activation function does not remain linear,
similar to a single layer perceptron model. Instead of linear, activation function can be
executed as sigmoid, TanH, ReLU, etc., for deployment.

A multi-layer perceptron model has greater processing power and can process linear and
non-linear patterns. Further, it can also implement logic gates such as AND, OR, XOR,
NAND, NOT, XNOR, NOR.

Advantages of Multi-Layer Perceptron:

o A multi-layered perceptron model can be used to solve complex non-linear

problems.
o It works well with both small and large input data.
o It helps us to obtain quick predictions after the training.
o It helps to obtain the same accuracy ratio with large as well as small data.

Disadvantages of Multi-Layer Perceptron:

o In Multi-layer perceptron, computations are difficult and time-consuming.

o In multi-layer Perceptron, it is difficult to predict how much the dependent
variable affects each independent variable.
o The model functioning depends on the quality of the training.

Perceptron Function

Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with the
learned weight coefficient 'w'.

Mathematically, we can express it as follows:

f(x)=1; if w.x+b>0

otherwise, f(x)=0

o 'w' represents real-valued weights vector

o 'b' represents the bias
o 'x' represents a vector of input x values.

Characteristics of Perceptron

The perceptron model has the following characteristics.

1. Perceptron is a machine learning algorithm for supervised learning of binary

classifiers.
2. In Perceptron, the weight coefficient is automatically learned.
3. Initially, weights are multiplied with input features, and the decision is made
whether the neuron is fired or not.
4. The activation function applies a step rule to check whether the weight function is
greater than zero.
5. The linear decision boundary is drawn, enabling the distinction between the two
linearly separable classes +1 and -1.
6. If the added sum of all input values is more than the threshold value, it must have
an output signal; otherwise, no output will be shown.

Limitations of Perceptron Model

A perceptron model has limitations as follows:

o The output of a perceptron can only be a binary number (0 or 1) due to the hard
limit transfer function.
o Perceptron can only be used to classify the linearly separable sets of input vectors.
If input vectors are non-linear, it is not easy to classify them properly.

Future of Perceptron

The future of the Perceptron model is much bright and significant as it helps to interpret
data by building intuitive patterns and applying them in the future. Machine learning is a
rapidly growing technology of Artificial Intelligence that is continuously evolving and in
the developing phase; hence the future of perceptron technology will continue to support
and facilitate analytical behavior in machines that will, in turn, add to the efficiency of
computers.

The perceptron model is continuously becoming more advanced and working efficiently
on complex problems with the help of artificial neurons.

7 Discuss the Bayesian approach to pattern classification, and explain how Bayes' 10M
classifiers can be used in artificial neural networks.
ANS:
The Bayesian approach to pattern classification is a probabilistic framework that uses
Bayes' theorem to classify new input data based on a prior knowledge of the statistical
properties of the data. In this approach, the classification decision is based on the
posterior probability of the class given the input features, which is computed by
combining the prior probability of the class and the likelihood of the features given the
class.

Naïve Bayes Classifier Algorithm

o Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training
dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can
make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
o The formula for Bayes' theorem is given as:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability
of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below
example:

Suppose we have a dataset of weather conditions and corresponding target variable

"Play". So using this dataset we need to decide that whether we should play or not on a
particular day according to the weather conditions. So to solve this problem, we need to
follow the below steps:

1. Convert the given dataset into frequency tables.

2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Bayes' classifiers can be used in artificial neural networks (ANNs) to improve the
accuracy and robustness of the classification model. This is done by incorporating
Bayesian reasoning into the learning process of the ANNs, either by explicitly modeling
the prior probabilities of the classes, or by using Bayesian regularization techniques to
control the complexity of the network and prevent overfitting.

One example of a Bayesian neural network is the Bayesian neural network with
backpropagation (BNNBP), which combines the backpropagation algorithm with
Bayesian regularization to train the network. In this approach, the network weights are
assigned prior probability distributions, and the likelihood of the training data given the
weights is computed using the standard backpropagation algorithm. The posterior
distribution of the weights is then computed using Bayes' theorem, and the classification
decision is based on the maximum posterior probability of the class given the input
features.

Bayesian ANNs have several advantages over traditional ANNs, including the ability to
handle uncertainty and noise in the input data, and the ability to provide probabilistic
estimates of the class probabilities. They are also more robust to overfitting, since the
prior probabilities can act as a regularizer that constrains the complexity of the model.
However, they can be computationally expensive to train and evaluate, especially for
large datasets and complex models, and require careful selection of the prior distributions
and hyperparameters.
8 Explain the structure and learning process of a multilayer perceptron, and explain 10M
how it can be used to solve complex pattern recognition problems.
ANS: A multilayer perceptron (MLP) is a type of artificial neural network (ANN) that is
commonly used for solving complex pattern recognition problems. It is a feedforward
neural network, which means that data flows only in one direction through the network,
from the input layer to the output layer, with no feedback loops.

The structure of an MLP consists of multiple layers of artificial neurons, each layer
connected to the next in a series of directed edges, forming a directed graph. The input
layer is the first layer, which receives the raw data as input. The output layer is the final
layer, which produces the output of the network. In between, there can be one or more
hidden layers, which contain a varying number of neurons.

Each neuron in an MLP receives input from the previous layer, performs a weighted sum
of these inputs, applies a non-linear activation function, and passes the output to the next
layer. The output of the MLP is obtained by passing the input through the layers of
neurons until it reaches the output layer.

The learning process of an MLP involves adjusting the weights and biases of the
connections between neurons to minimize the error between the actual output of the
network and the desired output. This process is called backpropagation, and it involves
computing the gradient of the error function with respect to the weights and biases, and
then using this gradient to update the weights and biases using an optimization algorithm
such as gradient descent.

MLP can be used to solve complex pattern recognition problems by learning the
underlying relationships between inputs and outputs through training on a dataset. This
means that the network can learn to recognize patterns in the data and generalize its
learning to new, unseen data.

In this MLP, the input layer has three neurons, the hidden layer has four neurons, and the
output layer has two neurons. The arrows between the neurons represent the connections
between them, and the weights associated with these connections are represented by the
numbers on the arrows.

The output of each neuron is computed as follows:

where $w_i$ are the weights, $x_i$ are the inputs, $b$ is the bias, $f$ is the activation
function, and $n$ is the number of inputs.

The activation function introduces nonlinearity into the network, allowing it to learn
complex patterns in the data. Common activation functions include sigmoid, ReLU, and
tanh.

Multilayer perceptron (MLP) can be used to solve complex pattern recognition problems
such as image classification, speech recognition, and natural language processing by
learning a nonlinear mapping between the input features and the output labels. Here is a
step-by-step guide on how MLP can be used to solve a pattern recognition problem:

1. Data Preparation: The first step is to prepare the dataset by splitting it into training,
validation, and test sets. The training set is used to train the MLP, the validation set is
used to tune the hyperparameters, and the test set is used to evaluate the performance of
the MLP.
2. Input Encoding: The input data is encoded as a feature vector of size N, where N is the
number of features. Each feature represents a specific aspect of the input data, such as the
pixel intensity of an image or the frequency components of a speech signal.
3. Output Encoding: The output labels are encoded as a one-hot vector of size C, where C is
the number of classes. Each element of the vector represents a specific class, and the
element corresponding to the true class is set to 1, while all other elements are set to 0.
4. MLP Architecture: The MLP architecture is designed based on the complexity of the
problem and the size of the dataset. It consists of an input layer, one or more hidden
layers, and an output layer. The number of neurons in the input layer is equal to the
number of features, and the number of neurons in the output layer is equal to the number
of classes.
5. Forward Propagation: During forward propagation, the input features are passed through
the MLP, and the output probabilities are computed using the softmax function.
6. Loss Function: The loss function measures the difference between the predicted output
probabilities and the true output probabilities. The most commonly used loss function for
multiclass classification problems is the cross-entropy loss function.
7. Backpropagation: During backpropagation, the error is propagated backward through the
MLP, and the weights and biases are updated using the gradient descent algorithm.
8. Training: The MLP is trained by iteratively updating the weights and biases using the
backpropagation algorithm on the training set until the loss function converges.
9. Hyperparameter Tuning: The hyperparameters of the MLP, such as the number of hidden
layers, the number of neurons in each layer, the learning rate, and the regularization
parameters, are tuned using the validation set to optimize the performance of the MLP.
10. Testing: The performance of the MLP is evaluated on the test set by computing the
accuracy, precision, recall, and F1-score.

The input features are passed through the input layer, and the output of each neuron in the
hidden layer is computed using the sigmoid function. The output probabilities are
computed using the softmax function in the output layer.

The forward propagation and backpropagation equations for the MLP are described in the
previous answer. The weights and biases are updated using the gradient descent
algorithm as follows:
bi+1=bi-η

Multi-Layer perceptron defines the most complex architecture of artificial neural

networks. It is substantially formed from multiple layers of the perceptron. TensorFlow is
a very popular deep learning framework released by, and this notebook will guide to
build a neural network with this library. If we want to understand what is a Multi-layer
perceptron, we have to develop a multi-layer perceptron from scratch using Numpy.

The pictorial representation of multi-layer perceptron learning is as shown below-

MLP networks are used for supervised learning format. A typical learning algorithm for
MLP networks is also called back propagation's algorithm.

A multilayer perceptron (MLP) is a feed forward artificial neural network that generates a
set of outputs from a set of inputs. An MLP is characterized by several layers of input
nodes connected as a directed graph between the input nodes connected as a directed
graph between the input and output layers. MLP uses backpropagation for training the
network. MLP is a deep learning method.

In summary, MLPs are a powerful tool for solving complex pattern recognition problems
by learning the underlying relationships between inputs and outputs through training on a
dataset. The learning process involves adjusting the weights and biases of the connections
between neurons using backpropagation, and the structure of the network consists of
multiple layers of interconnected nodes or neurons.

9 Explain the concept of regularization in deep neural network training, and discuss 10M
how early stopping, dropout, and batch normalization can be used to improve the
performance of a deep neural network.
ANS:
Regularization is a technique used to prevent overfitting of a model during training. In
the context of deep neural network training, overfitting occurs when a model becomes too
complex and starts to memorize the training data, resulting in poor generalization
performance on unseen data.

Regularization methods add constraints to the model to reduce its complexity and prevent
overfitting. There are several types of regularization techniques used in deep learning,
including:

1. L1 and L2 regularization: These methods add a penalty term to the loss function that
forces the model to use smaller weights. L1 regularization encourages the model to use
sparse weights, while L2 regularization encourages the model to use smaller but non-zero
weights.
2. Dropout: Dropout is a regularization technique that randomly drops out a portion of the
nodes in a layer during training. This forces the model to learn more robust features that
are not dependent on any one node.

L2 Regularization

The L2 regularization is the most common type of all regularization techniques and is also

commonly known as weight decay or Ride Regression.

The mathematical derivation of this regularization, as well as the mathematical

explanation of why this method works at reducing overfitting, is quite long and complex.

Since this is a very practical article I don’t want to focus on mathematics more than it is

required. Instead, I want to convey the intuition behind this technique and most

importantly how to implement it so you can address the overfitting problem during your

deep learning projects.

During the L2 regularization the loss function of the neural network as extended by a so-

called regularization term, which is called here Ω.

Eq. 1 Regularization Term

The regularization term Ω is defined as the Euclidean Norm (or L2 norm) of the weight
matrices, which is the sum over all squared weight values of a weight matrix. The

regularization term is weighted by the scalar alpha divided by two and added to the regular

loss function that is chosen for the current task. This leads to a new expression for the loss

function:

Eq 2. Regularization loss during L2 regularization.

Alpha is sometimes called as the regularization rate and is an additional hyperparameter

we introduce into the neural network. Simply speaking alpha determines how much we

regularize our model.

In the next step we can compute the gradient of the new loss function and put the gradient

into the update rule for the weights:

Eq. 3 Gradient Descent during L2 Regularization.

Some reformulations of the update rule lead to the expression which very much looks like

the update rule for the weights during regular gradient descent:

Eq.4 Gradient Descent during L2 Regularization.

The only difference is that by adding the regularization term we introduce an additional

subtraction from the current weights (first term in the equation).

In other words independent of the gradient of the loss function we are making our weights

a little bit smaller each time an update is performed.

4. L1 Regularization
In the case of L1 regularization (also knows as Lasso regression), we simply use another

regularization term Ω. This term is the sum of the absolute values of the weight

parameters in a weight matrix:

Eq. 5 Regularization Term for L1 Regularization.

As in the previous case, we multiply the regularization term by alpha and add the entire

thing to the loss function.

Eq. 6 Loss function during L1 Regularization.

The derivative of the new loss function leads to the following expression, which the sum

of the gradient of the old loss function and sign of a weight value times alpha.

Eq. 7 Gradient of the loss function during L1 Regularization.

5. Why do L1 and L2 Regularizations work?

The question you might be asking yourself right now is:

“Why does all of this help to reduce the overfitting issue?”

Let’s tackle this question.

Please consider the plots of the and functions, where represents the operation performed

during L1 and the operation performed during L2 regularization.

Graph. 2 L1 function (red), L2 function (blue).

In the case of L2 regularization, our weight parameters decrease, but not necessarily

become zero, since the curve becomes flat near zero. On the other hand during the L1

regularization, the weight are always forced all the way towards zero.

Basically the introduced equations for L1 and L2 regularizations are constraint functions,

which we can visualize:

The left image shows the constraint function (green area) for the L1 regularization and the

right image shows the constraint function for the L2 regularization. The red ellipses are

contours of the loss function that is used during the gradient descent. In the center of the

contours there is a set of optimal weights for which the loss function has a global

minimum.
In the case of L1 and L2 regularization, the estimates of W1 and W2 are given by the first

point where the ellipse intersects with the green constraint area.

Since L2 regularization has a circular constraint area, the intersection won’t generally

occur on an axis, and this the estimates for W1 and W2 will be exclusively non-zero.

In the case of L1, the constraints area has a diamond shape with corners. And thus the

contours of the loss function will often intersect the constraint region at an axis. Then this

occurs, one of the estimates (W1 or W2) will be zero.

In a high dimensional space, many of the weight parameters will equal zero

simultaneously.

3. Early stopping: Early stopping is a simple regularization technique that stops training
when the model's performance on a validation set stops improving. This prevents the
model from continuing to memorize the training data and allows it to generalize better to
unseen data.
4. Batch normalization: Batch normalization is a technique that normalizes the activations
of each layer to prevent the distribution of activations from shifting too much. This helps
to reduce the internal covariate shift and makes the model more stable during training.
5.

Regularization is an essential part of deep neural network training and is often necessary
to achieve good generalization performance. By adding constraints to the model,
regularization techniques help to reduce overfitting and improve the model's ability to
generalize to new data.
Early stopping, dropout, and batch normalization are all regularization techniques that
can be used to improve the performance of a deep neural network.

1. Early stopping: Early stopping is a simple but effective regularization technique that
involves monitoring the model's performance on a validation set during training. When
the performance of the model on the validation set stops improving, training is stopped,
and the model with the best performance on the validation set is used as the final model.
Early stopping helps prevent overfitting by stopping the model from memorizing the
training data and allows it to generalize better to new data.
2. Dropout: Dropout is a regularization technique that involves randomly dropping out a
portion of the nodes in a layer during training. This helps to prevent the model from
overfitting by encouraging it to learn more robust features that are not dependent on any
one node. Dropout has been shown to be effective in reducing overfitting and improving
the generalization performance of deep neural networks.
3. Batch normalization: Batch normalization is a technique that involves normalizing the
activations of each layer to prevent the distribution of activations from shifting too much
during training. This helps to reduce the internal covariate shift and makes the model
more stable during training. Batch normalization has been shown to improve the
generalization performance of deep neural networks, reduce the number of training
iterations required, and make the model more resistant to overfitting.

Overall, early stopping, dropout, and batch normalization are all powerful techniques that
can be used to improve the performance of deep neural networks. By reducing overfitting
and improving generalization performance, these regularization techniques help to make
deep neural networks more effective at solving complex problems.

10 Compare and contrast different optimization algorithms used in deep neural 10M
network training, such as Momentum Optimizer, RMSProp, and Adam.
ANS:
Momentum

A very popular technique that is used along with SGD is called Momentum. Instead of

using only the gradient of the current step to guide the search, momentum also

accumulates the gradient of the past steps to determine the direction to go. The equations

of gradient descent are revised as follows.

The first equations has two parts. The first term is the gradient that is retained from

previous iterations. This retained gradient is multiplied by a value called "Coefficient of

Momentum" which is the percentage of the gradient retained every iteration.

If we set the initial value for v to 0 and chose our coefficient as 0.9, the subsequent

update equations would look like.

We see that the previous gradients are also included in subsequent updates, but the

weightage of the most recent previous gradients is more than the less recent ones. (For
the mathematically inclined, we are taking an exponential average of the gradient steps)

How does this help our case. Consider the image,and notice that most of the gradient

updates are in a zig-zag direction. Also notice that each gradient update has been resolved

into components along w1 and w2 directions. If we will individually sum these vectors

up, their components along the direction w1 cancel out, while the component along

the w2 direction is reinforced.

For an update, this adds to the component along w2, while zeroing out the component

in w1 direction. This helps us move more quickly towards the minima. For this reason,

momentum is also referred to as a technique which dampens oscillations in our search.

It also builds speed, and quickens convergence, but you may want to use simulated

annealing in case you overshoot the minima.

In practice, the coefficient of momentum is initialized at 0.5, and gradually annealed to

0.9 over multiple epochs.

RMSProp

RMSprop, or Root Mean Square Propogation has an interesting history. It was devised by

the legendary Geoffrey Hinton, while suggesting a random idea during a Coursera class.

RMSProp also tries to dampen the oscillations, but in a different way than momentum.

RMS prop also takes away the need to adjust learning rate, and does it automatically.

More so, RMSProp choses a different learning rate for each parameter.

In RMS prop, each update is done according to the equations described below. This

update is done separately for each parameter.

So, let's break down what is happening here.

In the first equation, we compute an exponential average of the square of the

gradient. Since we do it separately for each parameter, gradient Gt here corresponds

to the projection, or component of the gradient along the direction represented by

the parameter we are updating.

To do that, we multiply the exponential average computed till the last update with a

hyperparameter, represented by the greek symbol nu. We then multiply the square of the
current gradient with (1 - nu). We then add them together to get the exponential average

till the current time step.

The reason why we use exponential average is because as we saw, in the momentum

example, it helps us weigh the more recent gradient updates more than the less recent

ones. In fact, the name "exponential" comes from the fact that the weightage of previous

terms falls exponentially (the most recent term is weighted as p, the next one as squared

of p, then cube of p, and so on.)

Notice our diagram denoting pathological curvature, the components of the gradients

along w1 are much larger than the ones along w2. Since we are squaring and adding

them, they don't cancel out, and the exponential average is large for w2 updates.

Then in the second equation, we decided our step size. We move in the direction of the

gradient, but our step size is affected by the exponential average. We chose an initial

learning rate eta, and then divide it by the average. In our case, since the average of w1 is

much much larger than w2, the learning step for w1 is much lesser than that of w2.

Hence, this will help us avoid bouncing between the ridges, and move towards the

minima.

The third equation is just the update step. The hyperparameter p is generally chosen to be

0.9, but you might have to tune it. The epsilon is equation 2, is to ensure that we do not

end up dividing by zero, and is generally chosen to be 1e-10.

It's also to be noted that RMSProp implicitly performs simulated annealing. Suppose if

we are heading towards the minima, and we want to slow down so as to not to overshoot

the minima. RMSProp automatically will decrease the size of the gradient steps towards
minima when the steps are too large (Large steps make us prone to overshooting)

Adam

So far, we've seen RMSProp and Momentum take contrasting approaches. While

momentum accelerates our search in direction of minima, RMSProp impedes our search

in direction of oscillations.

Adam or Adaptive Moment Optimization algorithms combines the heuristics of both

Momentum and RMSProp. Here are the update equations.

Here, we compute the exponential average of the gradient as well as the squares of the

gradient for each parameters (Eq 1, and Eq 2). To decide our learning step, we multiply

our learning rate by average of the gradient (as was the case with momentum) and divide

it by the root mean square of the exponential average of square of gradients (as was the

case with momentum) in equation 3. Then, we add the update.

The hyperparameter beta1 is generally kept around 0.9 while beta_2 is kept at 0.99.
Epsilon is chosen to be 1e-10 generally.

UNIT III

DEEP LEARNING: working principle of CNN – RNN – Auto encoders and decoders - Unsupervised
Learning with Deep Network - Convolution Neural Network, Building blocks of CNN, Transfer
Learning - Image Classification using Transfer Learning.

1 What is the main difference between a convolutional neural network (CNN) and a 4M
recurrent neural network (RNN)?
ANS:
Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are two
types of neural networks that are commonly used in deep learning. The main difference
between CNNs and RNNs is the type of data they are designed to process.

CNNs are typically used for processing data that has a grid-like structure, such as images
and videos. CNNs use convolutional layers to extract features from the input data, which
are then passed through one or more fully connected layers to generate the final output.
The convolutional layers of a CNN use filters to scan over the input data and detect local
patterns and features.

On the other hand, RNNs are designed to process sequential data, such as natural
language text or time-series data. RNNs use recurrent layers that allow the network to
maintain an internal state or memory of past inputs. This memory allows RNNs to
process sequential data by considering the context of previous inputs when making
predictions about future inputs.
In summary, CNNs are designed to process grid-like data such as images and videos,
while RNNs are designed to process sequential data such as text and time-series data.
CNNs use convolutional layers to extract features from the input data, while RNNs use
recurrent layers to maintain memory of past inputs.

Convolutional Neural Networks

Feature (CNNs) Recurrent Neural Networks (RNNs)

Input Data 2D or 3D data (e.g., images, videos) Sequences of variable length

Network Feedforward network Feedforward and feedback network

Convolutional layers and pooling

Architecture layers Recurrent layers and output layers

Memory No memory of past inputs Memory of past inputs

Parameter Share parameters across the entire

sharing network Does not share parameters across the network
Backpropagation through
Training convolutional layers Backpropagation through time (BPTT)

Image and video recognition, object Speech recognition, language modeling, time
Applications detection series prediction

In general, CNNs are used for processing 2D or 3D data such as images and videos,
whereas RNNs are used for processing sequences of variable length such as text, speech,
and time series data. While CNNs are feedforward networks that use convolutional and
pooling layers to extract features, RNNs are feedback networks that use recurrent layers
to maintain memory of past inputs. Additionally, CNNs share parameters across the
entire network, which makes them more efficient to train and process, while RNNs do not
share parameters across the network, which makes them more flexible and adaptable to
different types of inputs.

2 What is the working principle of autoencoders and decoders, and how are they used 4M
in deep learning applications?
ANS:
Autoencoders are a type of neural network that can learn to represent data in a lower-
dimensional space, known as a latent space, by compressing the input data into a smaller
representation and then reconstructing the original input from the compressed
representation. Autoencoders consist of two main components: an encoder and a decoder.

The encoder takes an input and compresses it into a lower-dimensional representation.

This lower-dimensional representation is called the latent code or bottleneck. The
decoder then takes the compressed representation and reconstructs the original input data.
The goal of an autoencoder is to learn a compressed representation of the input data that
captures the most important features of the original data.

Autoencoders can be used for a variety of tasks, such as data compression, image
denoising, and data generation. They are often used in deep learning applications because
they can learn to extract meaningful features from the input data without requiring
manual feature engineering.

In addition to standard autoencoders, there are several variants of autoencoders that have
been developed, including denoising autoencoders, variational autoencoders (VAEs), and
generative adversarial networks (GANs).

Denoising autoencoders are designed to learn to remove noise from the input data.
During training, the autoencoder is fed corrupted input data and learns to reconstruct the
original, uncorrupted data.

VAEs are a type of generative model that can generate new data that is similar to the
training data. They learn a probability distribution over the latent space, allowing for the
generation of new data points by sampling from this distribution.

GANs are another type of generative model that consist of two neural networks: a
generator network and a discriminator network. The generator network learns to generate
new data points that are similar to the training data, while the discriminator network
learns to distinguish between real and generated data. The two networks are trained
together in an adversarial process, with the generator trying to produce data that can fool
the discriminator into thinking it is real.

In summary, autoencoders and decoders are used in deep learning applications to learn
compressed representations of data, which can then be used for tasks such as data
compression, image denoising, and data generation. There are several types of
autoencoders, including denoising autoencoders, VAEs, and GANs, which have different
applications and architectures.

3 What is unsupervised learning, and how can deep neural networks be used for 4M
unsupervised learning tasks?
ANS:
Unsupervised learning is a type of machine learning where the algorithm is trained on a
dataset without explicit labels or targets. The goal of unsupervised learning is to discover
the underlying structure or patterns in the data without the need for human supervision or
intervention. In contrast to supervised learning, where the algorithm is trained to predict a
target variable based on labeled examples, unsupervised learning is used to identify
relationships or groupings in the data that are not necessarily apparent to the human eye.

Deep neural networks can be used for unsupervised learning tasks by leveraging
architectures such as autoencoders, restricted Boltzmann machines (RBMs), and
generative adversarial networks (GANs). These architectures can learn useful
representations of the input data that can be used for downstream tasks such as
classification, regression, or clustering.

Autoencoders are neural networks that are trained to reconstruct the input data by
compressing it into a lower-dimensional representation (the latent space) and then
decoding it back to the original dimensionality. Autoencoders are a form of unsupervised
learning because they do not require labeled examples during training. By learning to
compress the input data into a lower-dimensional representation, autoencoders can
discover meaningful features and patterns in the data.

RBMs are a type of generative model that can be used for unsupervised learning tasks
such as dimensionality reduction, feature learning, and anomaly detection. They are
trained using a contrastive divergence algorithm that minimizes the energy function of
the network.

GANs are a type of generative model that consists of two neural networks: a generator
network and a discriminator network. The generator network learns to generate new data
points that are similar to the training data, while the discriminator network learns to
distinguish between real and generated data. The two networks are trained together in an
adversarial process, with the generator trying to produce data that can fool the
discriminator into thinking it is real.

In summary, unsupervised learning is a type of machine learning where the algorithm is

trained on a dataset without explicit labels or targets. Deep neural networks can be used
for unsupervised learning tasks by leveraging architectures such as autoencoders, RBMs,
and GANs, which can learn useful representations of the input data that can be used for
downstream tasks.
4 What are the building blocks of a convolutional neural network, and how are they 4M
used to extract features from input data?
ANS:
Convolutional neural networks (CNNs) are a class of neural networks that are designed to
work with data that has a grid-like structure, such as images, audio spectrograms, and 3D
data. The building blocks of a CNN are:

1. Convolutional layers: These layers apply a set of filters to the input data, which slide over
the input data and perform a dot product between the filter and a small region of the input
data, known as the receptive field. The filters capture local patterns and features in the
input data, such as edges, corners, and textures.
2. Pooling layers: These layers downsample the output of the convolutional layers by
summarizing small regions of the input data. The most common type of pooling operation
is max pooling, which takes the maximum value in each small region. This reduces the
size of the feature maps and helps to make the model more robust to small variations in
the input data.
3. Activation functions: These functions introduce nonlinearity into the model, allowing it
to learn more complex patterns and relationships in the data. The most common
activation function used in CNNs is the rectified linear unit (ReLU), which sets negative
values to zero and leaves positive values unchanged.
4. Fully connected layers: These layers are similar to the layers in a standard neural network
and connect every neuron in one layer to every neuron in the next layer. Fully connected
layers are typically used at the end of a CNN to make predictions based on the features
learned by the earlier layers.

CNNs work by applying a sequence of convolutional and pooling layers to the input data,
gradually extracting more abstract and high-level features. The first convolutional layers
capture low-level features such as edges and textures, while later layers capture more
complex patterns and object parts. The fully connected layers at the end of the network
combine these features to make predictions based on the input data.

In summary, the building blocks of a convolutional neural network are convolutional

layers, pooling layers, activation functions, and fully connected layers. These layers work
together to extract increasingly complex and abstract features from the input data,
allowing the network to make predictions based on the patterns and relationships it
learns.
5 How can transfer learning be used to improve the performance of a convolutional 4M
neural network for image classification tasks?
ANS:
Transfer learning is a technique in machine learning where a pre-trained model is used as
a starting point for a new task. In the case of convolutional neural networks (CNNs),
transfer learning can be used to improve the performance of a model for image
classification tasks by leveraging the knowledge learned from a pre-trained model on a
similar task.

There are two main approaches to using transfer learning with CNNs:

1. Fine-tuning: In fine-tuning, a pre-trained CNN is used as a starting point, and the weights
of the model are adjusted on a new dataset. Typically, the last few layers of the pre-
trained model are replaced with new layers, and only the weights of the new layers are
updated during training. This approach is useful when the new dataset is similar to the
dataset used to train the pre-trained model, as the lower layers of the pre-trained model
can still be used to extract useful features.
2. Feature extraction: In feature extraction, the pre-trained CNN is used as a fixed feature
extractor, and the extracted features are used as input to a new classifier. The pre-trained
model is used to extract features from the new dataset, and these features are then fed into
a new classifier, such as a fully connected layer or a support vector machine (SVM). This
approach is useful when the new dataset is significantly different from the dataset used to
train the pre-trained model, as the lower layers of the pre-trained model can still be used
to extract useful features, but the upper layers may need to be discarded or frozen.

Transfer learning can be beneficial for image classification tasks in several ways. First,
pre-trained models are often trained on large, diverse datasets and have learned to
recognize many general features of images, such as edges, shapes, and textures. This
makes them useful starting points for many image classification tasks. Second, by starting
with a pre-trained model, the amount of training data required to achieve good
performance can be reduced, as the model has already learned useful features. Finally,
transfer learning can help to reduce the risk of overfitting, as the pre-trained model has
already been trained on a large dataset, and its weights have been regularized to prevent
overfitting.

In summary, transfer learning can be used to improve the performance of a CNN for
image classification tasks by leveraging the knowledge learned from a pre-trained model
on a similar task. Fine-tuning and feature extraction are two approaches to using transfer
learning with CNNs, and they can help to reduce the amount of training data required and
reduce the risk of overfitting.

6 Explain the working principle of a convolutional neural network, and discuss the 10M
different building blocks used to extract features from input data.
ANS:

What is Convolutional Neural Network?

Convolutional Neural Network is a specialized neural network designed for visual data,

such as images & videos. But CNNs also work well for non-image data (especially in

NLP & text classification).

Its concept is similar to that of a vanilla neural network (multilayer perceptron) – It

follows the same general principle of forwarding & backward propagation.

Why do we need Convolutional Neural Network?

Although vanilla neural networks (MLPs) can learn highly complex functions, their

architecture does not exploit what we know about how the brain reads & processes

images.

The architecture of Convolutional Neural Network uses many of the working principles

of the animal visual system & it has been able to achieve extraordinary results in image-

related learning tasks.

For this reason, MLPs haven’t been able to achieve any significant breakthroughs in the

image processing domain.

What is Convolution?

Mathematically, convolution is the summation of the element-wise product of 2 matrices.

Let us consider an image ‘X’ & a filter ‘Y’ (More about filter will be covered later). Both

of them, i.e. X & Y, are matrices (image X is being expressed in the state of pixels).

When we convolve the image ‘X’ using filter ‘Y’, we produce the output in a matrix, say’

Z’.

Source: Author

Finally, we compute the sum of all the elements in ‘Z’ to get a scalar number, i.e.

3+4+0+6+0+0+0+45+2 = 60

Now that we are familiar with the idea behind CNN let us dig deeper into the topic
to understand the building blocks of CNN. Following is the outline of our journey:

What are Filters/Kernels?

A filter provides a measure for how close a patch or a region of the input resembles a

feature. A feature may be any prominent aspect – a vertical edge, a horizontal edge, an

arch, a diagonal, etc.

A filter acts as a single template or pattern, which, when convolved across the input, finds

similarities between the stored template & different locations/regions in the input image.

Let us consider an example of detecting a vertical edge in the input image.

Each column of the 4×4 output matrix looks at exactly three columns & three rows (the

coloured boxes show the output of the filter as it moves over the input image). The values

in the output matrix represent the change in the intensity along the horizontal direction

w.r.t the columns in the input image.

The output image has the value 0 in the 1st & last column. It means there is no change in

intensity in the first three columns & the previous three columns of the input image. On

the other hand, the output is 30 in the 2nd & 3rd column, indicating a change in the

intensity of the corresponding columns of the input image.

Dimensions of the Convolved Output?

If the input image size is ‘n x n’ & filter size is ‘f x f ‘, then after convolution, the
size of the output image is: (Size of input image – filter size + 1)

x (Size of input image – filter size + 1). Let us refer to the below image:

Why do we do Padding?

Every time we apply a convolution operator, our image shrinks (in the above example,

our vision shrunk from 6 x 6 to 4 x 4). If we convolve the output again with a filter, our

image shrinks.

If we continue this process, we lose a lot of information because of image shrinking,

which is one of the downsides of convolution.

During convolution, the pixels in the corners & the edges are considered only once. This

is the 2nd downside of convolution. If we consider any pixel in the middle, many (fxf)

regions overlap the pixel (we shift the filter & observe the image through it, i.e.

convolve). Thus, the pixels on the corners or the edges are used much less in the output.

So, we throw away a lot of information near the edge of the image.

So to fix both of these problems, we can ‘pad’ the image.

Let P be padding. In this example, p = 1 because we padded all around the input image

with an extra border of 1 pixel.

⸫ Output Size = (n + 2p –f +1) x (n + 2p –f +1)

= (6 + 2×1 – 3 +1) x (6 + 2×1 – 3 +1)

=6x6

Types of Padding

There are two common choices for padding: Valid convolutions & the Same

convolutions.

1. Valid convolutions: This Means no padding. Thus, in this case, we might

have (nxn) image convolve with (fxf) filter & this would give us an output

(n-f+1) x (n-f+1) dimensional output.

2. Same convolutions: In this case, padding is such that the output size is the

same as the input image size. When we do padding by ‘p’ pixels then, size

of the input image changes from (nxn) to (n + 2p –f +1) x (n + 2p –f +1).

The amount of padding to be done should be such that the output image

after convolution matches the size of the input image.

Let,

n x n = Original input image size

p = Padding

(n+2p) x (n+2p) = Size of padded input image

(n+2p–f+1) x (n+2p-f+1) = Size of output image after convolving padded image

To avoid shrinkage of the original input image, we calculate ‘p = padding size’. The

output image size achieved after convolving the padded input image is equal to that of the

original input image size.

⸫ Output size after convolving padded image = Original input image size
How is the Filter Size Decided?

By convention, the value of ‘f,’ i.e. filter size, is usually odd in computer vision. This

might be because of 2 reasons:

If the value of ‘f’ is even, we may need asymmetric padding (Please refer above eqn. 1 ).

Let us say that the size of the filter i.e. ‘f’ is 6. Then by using equation 1, we get a

padding size of 2.5, which does not make sense.

The 2nd reason for choosing an odd size filter such as a 3×3 or a 5×5 filter is we get a

central position & at times it is nice to have a distinguisher.

What is Stride?

The stride indicates the pace by which the filter moves horizontally & vertically over the

pixels of the input image during convolution.

Stride depends on what we expect in our output image. We prefer a smaller stride size if

we expect several fine-grained features to reflect in our output. On the other hand, if we

are only interested in the macro-level of features, we choose a larger stride size.

To understand the concept of stride in more detail, let’s consider an example. Let’s say

we are interested in classifying the input image between landscape & portrait. We do not
need minute details for this task, such as the number of mountain peaks, trees, etc. So we

can choose a higher value for stride. On the other hand, if we want to classify an image

between dog & cat, we need to focus or capture very minute details or features of the

input image to type the image correctly. In this case, we prefer a smaller stride size.

Let,

S = Stride size

⸫ Size of the output image is given by :

Convolutions over RGB images

Consider an RGB image of size 6×6. Since it’s an RGB image, its dimension is 6x6x3,

where the three corresponds to the three colour channels: Red, Green & Blue. We can

imagine this as a 3-D image with a stack of 3 six by six shots.

For 3-D images, we need 3D filters, i.e. the filter itself will also have three layers

corresponding to the red, green & blue channels, similar to that of the input RGB image.

Convolution operator functions in a similar fashion in both 2D (greyscale) & 3D (RGB)

images. We 1st place the 3x3x3 filter in the upper left most position. This filter has 27 (9

parameters in each channel) or numbers. We take each of these 27 numbers & multiply
them with the corresponding numbers from the image’s red, green & blue channels. Then

we add up all those numbers & this gives us the 1st number in the output image.

To compute the following output, we take this filter & slide/stride it over by 1 (or

whatever stride number we consider) & again, due to 27 multiplications, add up the 27

numbers.

Multiple Filters for Multiple Features

We can use multiple filters to detect various features simultaneously. Let us consider the

following example in which we see vertical edge & curve in the input RGB image. We

will have to use two different filters for this task, and the output image will thus have two

feature maps.

Source: Author

Let us understand the dimensions mathematically,

Source
: Author

Some important terms

1. The filters are learned during training (i.e. during backpropagation). Hence, the

individual values of the filters are often called the weights of CNN.

2. A neuron is a filter whose weights are learned during training. E.g., a (3,3,3)

filter (or neuron) has 27 units. Each neuron looks at a particular region in the

output (i.e. its ‘receptive field’)

3. A feature map is a collection of multiple neurons, each looking at different

inputs with the same weights. All neurons in a feature map extract the same

feature (but from other input regions). It is called a ‘feature map’ because it maps

where a particular part is found in the image.

What is Pooling?

A pooling layer is another essential building block of CNN. It tries to figure out whether

a particular region in the image has the feature we are interested in or not.

The actual dictionary meaning of pooling is the act of sharing or combining two or more

things. In CNN, the pooling layer does a similar job. It summarizes the featured map so

that the model will not need to be trained on precisely positioned features, making a

model more reliable and robust.

The pooling layer looks at more significant regions (having multiple patches) of the

image & captures aggregate statistics (min, max, average & global). In other words, it

makes the network invariant to local transformations.

The two most popular aggregate functions used in pooling are ‘max’ & ‘average’:

1. Max pooling – If any of the patches say something firmly about the

presence of a particular feature, then the pooling layer counts that feature

as ‘detected’.
2. Average pooling – If one patch says something very firmly, but the other

ones disagree, the average pooling takes the average to find out.

Pooling has the advantage of making the representation more compact by reducing the

spatial size of the feature maps, thereby reducing the number of parameters to be learnt.

Pooling reduces only the height & width of the feature map, not the (i.e. number of

channels). For example, if we have ‘m’ feature maps each of size (c,c), the pooling

operation will produce ‘m’ outputs (c/2,c/2).

On the other hand, pooling also loses a lot of information, which is often considered a

potential disadvantage.

The pooling layer has ‘NO PARAMETERS’ i.e. ‘ZERO TRAINABLE PARAMETERS’.

The pooling layer computes the aggregate of the input. E.g. in max pooling, it takes a

maximum over a group of pixels. We do not need any adjustments in any parameters.

The following image explains when a MAX pooling works and when AVG pooling

works:
A typical CNN has the following sequence of CNN layers

We have an input image using multiple filters to create various feature maps.

Each feature map of size (C, C) is pooled to generate a (C/2, C/2) output (for a standard

2×2 pooling)

The above pattern is referred to as one Convolutional Neural Network layer or one unit.

Multiple such CNN layers are stacked on top of each other to create deep Convolutional

Neural Network networks.

The output of the convolution layer contains features, and these features are fed into a

dense neural network.

Convolutional neural networks (CNNs) are a type of neural network that are commonly
used for image recognition and computer vision tasks. They are based on the idea of
convolution, which is a mathematical operation that is used to extract features from data.

The working principle of a CNN can be broken down into several building blocks that
work together to extract features from input data:

1. Convolutional layers: Convolutional layers are the main building block of a CNN. They
use a set of filters to extract features from the input data. The filters are small 2D matrices
that slide over the input data, computing a dot product with each region of the data they
cover. This results in a set of output feature maps, which represent different learned
features.
2. Pooling layers: Pooling layers are used to downsample the output of the convolutional
layers. They reduce the spatial dimensions of the feature maps, while retaining the most
important information. The most common pooling operation is max pooling, which
selects the maximum value from each region of the feature map.
3. Activation functions: Activation functions introduce nonlinearity into the model. They
are applied element-wise to the output of each neuron in the network, allowing the
network to learn more complex patterns and relationships in the data. The most common
activation function used in CNNs is the rectified linear unit (ReLU), which sets negative
values to zero and leaves positive values unchanged.
4. Fully connected layers: Fully connected layers are used at the end of the network to make
predictions based on the features learned by the earlier layers. They connect every neuron
in one layer to every neuron in the next layer, allowing the network to learn complex
mappings from the input to the output.

During training, the weights of the CNN are adjusted to minimize a loss function, which
measures how well the network is performing on the task at hand. The weights are
updated using backpropagation, which calculates the gradient of the loss function with
respect to each weight in the network.

In summary, a convolutional neural network extracts features from input data using
convolutional layers, pooling layers, activation functions, and fully connected layers.
Convolutional layers extract local features from the input data, while pooling layers down
sample the output and reduce the spatial dimensions of the feature maps. Activation
functions introduce nonlinearity into the model, and fully connected layers are used to
make predictions based on the learned features.

7 Discuss the working principle of recurrent neural networks, and explain how they 10M
can be used to process sequential data.
ANS:
Recurrent neural networks (RNNs) are a type of neural network that are commonly used
for processing sequential data. They are designed to work with input sequences of
arbitrary length, where the current output depends not only on the current input but also
on previous inputs.

The working principle of an RNN can be explained in terms of its recurrent nature. An
RNN processes input sequences one element at a time, and at each time step, it produces
an output and updates its internal state. The output and internal state are then fed back
into the network at the next time step, along with the next element in the input sequence.
This allows the network to maintain a memory of previous inputs and use that
information to generate more accurate predictions.

The key component of an RNN is the recurrent connection, which allows information to
flow from one time step to the next. The internal state of the RNN at time step t, denoted
as h_t, is computed as a function of the current input x_t and the previous state h_{t-1}.
This can be expressed mathematically as:

h_t = f(x_t, h_{t-1})

where f is a nonlinear function that is applied element-wise to the input and previous
state.

The output of the RNN at each time step is computed based on the current internal state,
and can be used for various tasks such as classification or prediction. The output can also
be fed back into the network as an additional input at the next time step.

In order to train an RNN, the weights of the network are adjusted to minimize a loss
function that measures the difference between the predicted output and the true output.
This is done using a variant of backpropagation called backpropagation through time
(BPTT), which propagates the error gradient back through the recurrent connections.

RNNs can be used for a variety of tasks that involve sequential data, such as natural
language processing, speech recognition, and time series prediction. They are particularly
well-suited for tasks where the length of the input sequence is variable or unknown, as
they can process input sequences of arbitrary length. However, they can be difficult to
train and can suffer from the vanishing gradient problem, where the gradient of the loss
function becomes very small over long sequences, making it difficult to learn long-term
dependencies. This has led to the development of variants of RNNs such as long short-
term memory (LSTM) networks and gated recurrent units (GRUs), which are designed to
address these issues.
8 Explain the structure and working principle of autoencoders and decoders, and 10M
explain how they can be used for unsupervised learning tasks, such as
dimensionality reduction and anomaly detection.
ANS:
Autoencoders are neural networks that are designed to learn a compressed representation
of input data. They consist of two main components: an encoder and a decoder. The
encoder takes the input data and compresses it into a low-dimensional representation,
while the decoder takes this compressed representation and attempts to reconstruct the
original input data.

The working principle of an autoencoder involves minimizing the reconstruction error

between the input data and the output of the decoder. During training, the encoder learns
to extract the most important features from the input data, while the decoder learns to use
these features to reconstruct the original data.

Autoencoders can be used for unsupervised learning tasks such as dimensionality

reduction and anomaly detection. In dimensionality reduction, the goal is to reduce the
number of features in the input data while preserving as much information as possible.
By using an autoencoder, the encoder component learns to compress the input data into a
lower-dimensional representation, while the decoder component learns to reconstruct the
original data from this compressed representation. The compressed representation can
then be used as a lower-dimensional representation of the input data, which can be useful
for tasks such as visualization or classification.

In anomaly detection, the goal is to identify instances of data that are significantly
different from the norm. Autoencoders can be used for this task by training the model on
a dataset of normal data, and then using the model to reconstruct new instances of data. If
the reconstruction error is significantly higher than normal, then the instance of data may
be an anomaly. This approach is particularly useful when anomalies are rare or when it is
difficult to define explicit rules for what constitutes an anomaly.

Overall, autoencoders are a powerful tool for unsupervised learning tasks, and they can
be used for a wide range of applications including dimensionality reduction, anomaly
detection, and generative modeling.

9 Discuss the concept of transfer learning in deep neural networks, and explain how it can 10M
be used to improve the performance of a convolutional neural network for image
classification tasks.
ANS:

The reuse of a previously learned model on a new problem is known as transfer learning.

It’s particularly popular in deep learning right now since it can train deep neural networks

with a small amount of data. This is particularly valuable in the field of data science, as

most real-world situations do not require millions of labelled data points to train

complicated models.

ransfer Learning and It’s Working

The reuse of a pre-trained model on a new problem is known as transfer learning in

machine learning. A machine uses the knowledge learned from a prior assignment to

increase prediction about a new task in transfer learning. You could, for example, use the

information gained during training to distinguish beverages when training a classifier to

predict whether an image contains cuisine.

The knowledge of an already trained machine learning model is transferred to a different

but closely linked problem throughout transfer learning. For example, if you trained a

simple classifier to predict whether an image contains a backpack, you could use the

model’s training knowledge to identify other objects such as sunglasses.

With transfer learning, we basically try to use what we’ve learned in one task to better

understand the concepts in another. weights are being automatically being shifted to a

network performing “task A” from a network that performed new “task B.”

Because of the massive amount of CPU power required, transfer learning is typically

applied in computer vision and natural language processing tasks like sentiment analysis.

How Transfer Learning Works

In computer vision, neural networks typically aim to detect edges in the first layer, forms
in the middle layer, and task-specific features in the latter layers. The early and central

layers are employed in transfer learning, and the latter layers are only retrained. It makes

use of the labelled data from the task it was trained on.

Let’s return to the example of a model that has been intended to identify a backpack in an

image and will now be used to detect sunglasses. Because the model has trained to

recognise objects in the earlier levels, we will simply retrain the subsequent layers to

understand what distinguishes sunglasses from other objects.

Why Should You Use Transfer Learning?

Transfer learning offers a number of advantages, the most important of which are reduced

training time, improved neural network performance (in most circumstances), and the

absence of a large amount of data.

To train a neural model from scratch, a lot of data is typically needed, but access to that

data isn’t always possible – this is when transfer learning comes in handy.
Because the model has already been pre-trained, a good machine learning model can be

generated with fairly little training data using transfer learning. This is especially useful

in natural language processing, where huge labelled datasets require a lot of expert

knowledge. Additionally, training time is decreased because building a deep neural

network from the start of a complex task can take days or even weeks.

When to Use Transfer Learning

When we don’t have enough annotated data to train our model with. When there is a pre-

trained model that has been trained on similar data and tasks. If you used TensorFlow to

train the original model, you might simply restore it and retrain some layers for your job.

Transfer learning, on the other hand, only works if the features learnt in the first task are

general, meaning they can be applied to another activity. Furthermore, the model’s input
must be the same size as it was when it was first trained. If

If you don’t have it, add a step to resize your input to the required size.

1. TRAINING A MODEL TO REUSE IT

Consider the situation in which you wish to tackle Task A but lack the necessary data to

train a deep neural network. Finding a related task B with a lot of data is one method to

get around this.

Utilize the deep neural network to train on task B and then use the model to solve task A.

The problem you’re seeking to solve will decide whether you need to employ the entire

model or just a few layers.

If the input in both jobs is the same, you might reapply the model and make predictions

for your new input. Changing and retraining distinct task-specific layers and the output

layer, on the other hand, is an approach to investigate.

2. USING A PRE-TRAINED MODEL

The second option is to employ a model that has already been trained. There are a

number of these models out there, so do some research beforehand. The number of layers

to reuse and retrain is determined by the task.

Keras consists of nine pre-trained models used in transfer learning, prediction, fine-

tuning. These models, as well as some quick lessons on how to utilise them, may be

found here. Many research institutions also make trained models accessible.

The most popular application of this form of transfer learning is deep learning.

3. EXTRACTION OF FEATURES
Another option is to utilise deep learning to identify the optimum representation of your

problem, which comprises identifying the key features. This method is known as

representation learning, and it can often produce significantly better results than hand-

designed representations.

Feature creation in machine learning is mainly done by hand by researchers and domain

specialists. Deep learning, fortunately, can extract features automatically. Of course, this

does not diminish the importance of feature engineering and domain knowledge; you

must still choose which features to include in your network.

Neural networks, on the other hand, have the ability to learn which features are critical

and which aren’t. Even for complicated tasks that would otherwise necessitate a lot of

human effort, a representation learning algorithm can find a decent combination of

characteristics in a short amount of time.

The learned representation can then be applied to a variety of other challenges. Simply

utilise the initial layers to find the appropriate feature representation, but avoid using the

network’s output because it is too task-specific. Instead, send data into your network and
output it through one of the intermediate layers.

The raw data can then be understood as a representation of this layer.

This method is commonly used in computer vision since it can shrink your dataset,

reducing computation time and making it more suited for classical algorithms.

Models That Have Been Pre-Trained

There are a number of popular pre-trained machine learning models available. The

Inception-v3 model, which was developed for the ImageNet “Large Visual Recognition

Challenge,” is one of them.” Participants in this challenge had to categorize pictures into

1,000 subcategories such as “zebra,” “Dalmatian,” and “dishwasher.”

Transfer learning is a machine learning technique that involves reusing a pre-trained

model on a related task as a starting point for a new task. In the context of deep neural
networks, transfer learning refers to using a pre-trained model, typically a neural
network, to perform a new task with the goal of improving performance and reducing
training time.

The basic idea behind transfer learning is that a neural network pre-trained on a large
dataset, such as ImageNet, has learned to recognize general features and patterns that are
useful for many different tasks, including tasks that the network was not specifically
trained for. By reusing the pre-trained network, we can save time and computational
resources that would otherwise be required to train a new network from scratch.
Additionally, we can leverage the knowledge learned from the pre-trained network to
improve the performance of the new network on the new task.

In the context of image classification tasks, transfer learning can be used to improve the
performance of a convolutional neural network (CNN) by reusing a pre-trained CNN as a
starting point. The pre-trained CNN can be used as a feature extractor, where the weights
of the pre-trained layers are frozen and only the weights of the new layers are trained on
the new dataset. This approach is particularly useful when the new dataset is small, as the
pre-trained network has already learned general features and patterns that can be reused
for the new task, allowing the new network to learn from a smaller number of examples.

There are several pre-trained CNN models available, such as VGG, ResNet, and
Inception, that have been trained on large datasets such as ImageNet. These pre-trained
models can be downloaded and used as a starting point for a new image classification
task. For example, the pre-trained CNN can be used to extract features from the images in
the new dataset, and then these features can be fed into a new classifier, such as a fully
connected layer or a support vector machine, to classify the images.

Overall, transfer learning is a powerful technique for improving the performance of deep
neural networks, and it can be particularly useful for tasks with limited training data or
computational resources. By leveraging the knowledge learned from pre-trained models,
we can improve the performance of new models and accelerate the development of new
machine learning applications.

10 Explain the working principle of unsupervised learning with deep neural networks, 10M
and discuss the different types of unsupervised learning tasks that can be performed
with deep neural networks.
ANS:
Unsupervised learning with deep neural networks is a type of machine learning where the
model is trained on input data without explicit labels or targets. Instead, the model is
trained to find patterns or structure in the data through self-organization, and to learn
useful representations of the input data without any explicit guidance.

The working principle of unsupervised learning with deep neural networks involves
optimizing the model's parameters to minimize some objective function that captures a
notion of similarity or regularity in the input data. There are different types of objective
functions used for unsupervised learning, depending on the specific task and the type of
neural network used.

One common type of unsupervised learning task is clustering, which involves grouping
similar data points together based on some measure of similarity or distance. This can be
performed with deep neural networks using techniques such as k-means clustering, where
the network is trained to identify clusters of similar data points by minimizing the sum of
distances between each data point and its nearest cluster center.

Another type of unsupervised learning task is dimensionality reduction, which involves

reducing the number of features in the input data while preserving as much information
as possible. This can be performed with deep neural networks using techniques such as
principal component analysis (PCA) or autoencoders, where the network is trained to
extract the most important features from the input data and represent them in a lower-
dimensional space.

Generative modeling is another important type of unsupervised learning task, where the
goal is to learn a probabilistic model of the input data that can be used to generate new
samples that are similar to the original data. This can be performed with deep neural
networks using techniques such as variational autoencoders (VAEs) or generative
adversarial networks (GANs), where the network is trained to generate new samples by
sampling from a learned latent space that captures the underlying structure of the data.

Overall, unsupervised learning with deep neural networks is a powerful tool for
discovering patterns and structure in complex data without the need for explicit labels or
targets. By leveraging the power of deep learning, we can learn useful representations of
data that can be used for a wide range of applications including clustering, dimensionality
reduction, and generative modeling.

UNIT IV

LSTM Networks - Generative Modeling with DL –Variational Auto encoder- Generative Adversarial
Network Revisiting Gradient Descent - Momentum Optimizer – RMSProp – Adam.

1 What is the working principle of LSTM networks, and how are they used in deep 4M
learning applications?
ANS:
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network
(RNN) that are designed to handle the issue of vanishing gradients that can occur in
traditional RNNs. LSTMs use a special type of memory cell that can selectively
remember or forget information over time, allowing the network to capture long-term
dependencies in sequential data.

The working principle of LSTM networks involves a series of interconnected memory

cells that are updated over time as new input is fed into the network. Each memory cell
has three main components: an input gate, a forget gate, and an output gate. The input
gate controls how much new input should be added to the memory cell, while the forget
gate controls how much old information should be discarded. The output gate determines
how much of the current memory cell state should be output to the next layer of the
network.
During training, the LSTM network learns to adjust the parameters of the input, forget,
and output gates to optimize a given objective function, such as minimizing the
prediction error on a sequence prediction task. By selectively remembering or forgetting
information over time, the LSTM network is able to capture long-term dependencies in
sequential data, such as the relationship between words in a sentence or the temporal
patterns in a time series.

LSTM networks are widely used in deep learning applications that involve sequential
data, such as natural language processing, speech recognition, and time series analysis. In
natural language processing, LSTMs can be used for tasks such as language modeling,
machine translation, and sentiment analysis. In speech recognition, LSTMs can be used
for tasks such as speech-to-text transcription and speaker identification. In time series
analysis, LSTMs can be used for tasks such as anomaly detection and forecasting.

Overall, LSTM networks are a powerful tool for handling sequential data and capturing
long-term dependencies, and they have enabled significant advances in a wide range of
deep learning applications.
2 How do generative models differ from discriminative models, and what are the 4M
applications of generative modeling in deep learning?
ANS:
Generative models and discriminative models are two types of models used in machine
learning.

Generative models are designed to learn the underlying probability distribution of the
input data and can be used to generate new samples that are similar to the training data. In
other words, they model the joint probability distribution of both the input data and the
labels. Examples of generative models include Gaussian Mixture Models (GMMs),
Hidden Markov Models (HMMs), and Variational Autoencoders (VAEs).

On the other hand, discriminative models are designed to learn the decision boundary
between classes directly. They only model the conditional probability distribution of the
labels given the input data. Examples of discriminative models include logistic
regression, Support Vector Machines (SVMs), and Neural Networks (NNs).

In terms of applications in deep learning, generative models are used in a variety of tasks,
such as image synthesis, text generation, and anomaly detection. They are also used in
unsupervised learning, where the labels are not provided, and the model must learn the
underlying structure of the data on its own.

One popular generative model is the Generative Adversarial Network (GAN), which is
designed to learn the distribution of the input data and can generate new samples that are
visually similar to the training data. Another popular generative model is the Variational
Autoencoder (VAE), which learns a lower-dimensional representation of the input data
and can generate new samples by sampling from the learned latent space.

In summary, generative models learn the probability distribution of the input data and can
be used to generate new samples, while discriminative models learn the decision
boundary between classes directly. Generative models are used in a variety of deep
learning applications, including image synthesis, text generation, and unsupervised
learning.
3 What is the working principle of variational autoencoders, and how are they used in 4M
generative modeling?
ANS:
Variational autoencoders (VAEs) are a type of generative model that use neural networks
to learn a lower-dimensional representation of the input data, called a latent space. VAEs
are trained using unsupervised learning, which means that they do not require labeled
data to learn the underlying distribution of the input data.

The basic working principle of a VAE is to learn two neural networks: an encoder and a
decoder. The encoder takes an input data point and maps it to a lower-dimensional latent
representation. The decoder then takes the latent representation and maps it back to the
original data space. During training, the VAE learns to minimize the difference between
the original data point and the reconstructed data point.

However, unlike traditional autoencoders, VAEs are trained to generate new data samples
by sampling from the learned latent space. During training, VAEs learn to optimize the
latent space to have a simple, normally distributed structure. This allows the VAE to
generate new data samples by randomly sampling from the latent space and mapping
them back to the original data space using the decoder.

The loss function used to train VAEs includes two terms: a reconstruction loss and a KL
divergence loss. The reconstruction loss measures the difference between the original
input data and the reconstructed data, while the KL divergence loss measures the
difference between the learned latent distribution and a simple, normally distributed prior
distribution. The KL divergence loss encourages the VAE to learn a simple, structured
latent space that can be easily sampled to generate new data samples.

In terms of generative modeling, VAEs are used to generate new data samples that are
similar to the training data. The learned latent space can be used to interpolate between
existing data samples, generating new data samples that lie along the same distribution as
the training data. VAEs have been used in a variety of applications, such as image
synthesis, text generation, and anomaly detection.
4 What is the main idea behind generative adversarial networks, and how are they 4M
used in generative modeling?
ANS:
Generative Adversarial Networks (GANs) are a type of generative model that use two
neural networks to play a two-player minimax game. GANs are designed to learn the
underlying distribution of the input data and can generate new data samples that are
similar to the training data.

The main idea behind GANs is to have one neural network, called the generator, generate
new data samples that are similar to the training data, while another neural network,
called the discriminator, attempts to distinguish between the generated data samples and
the real data samples. The generator is trained to generate data samples that can fool the
discriminator, while the discriminator is trained to distinguish between real and fake data
samples.

During training, the generator takes random noise as input and generates a new data
sample. The discriminator then takes the generated data sample and a real data sample
and predicts which one is real and which one is fake. The generator is then updated to
generate data samples that are more difficult for the discriminator to distinguish from real
data samples, and the discriminator is updated to better distinguish between real and fake
data samples. This process continues until the generator is able to generate new data
samples that are indistinguishable from real data samples.

The loss function used to train GANs is a combination of two terms: a generator loss and
a discriminator loss. The generator loss measures how well the generated data samples
fool the discriminator, while the discriminator loss measures how well the discriminator
is able to distinguish between real and fake data samples. The generator and discriminator
are updated iteratively based on the gradients of the loss function.

In terms of generative modeling, GANs are used to generate new data samples that are
similar to the training data. The generator can be used to generate new data samples that
are visually similar to the training data, while the discriminator can be used to evaluate
the quality of the generated data samples. GANs have been used in a variety of
applications, such as image synthesis, text generation, and video synthesis.

5 How do the momentum optimizer, RMSProp, and Adam optimization algorithms 4M

differ from each other in deep learning?
ANS:
The momentum optimizer, RMSProp, and Adam optimization algorithms are all popular
optimization algorithms used in deep learning. They differ in terms of how they calculate
and update the gradients during the optimization process. Here is a brief overview of each
algorithm:

1. Momentum optimizer: The momentum optimizer uses an exponentially decaying moving

average of past gradients to update the current gradient. It helps the optimizer to move
faster towards the minimum of the loss function and helps to avoid oscillations. The
momentum optimizer also introduces a new parameter, momentum, which controls the
weight of the previous gradients in the update. Higher momentum values result in a
smoother and faster convergence. However, if the momentum value is set too high, it can
cause overshooting or oscillations.
2. RMSProp: The RMSProp algorithm is an adaptive learning rate algorithm that adjusts the
learning rate for each weight based on the average of the squared gradients for that
weight. It helps to control the step size and reduces the learning rate for weights with
high variance gradients. RMSProp has a decaying learning rate, which means that the
learning rate decreases over time as the training progresses. The RMSProp algorithm also
introduces a new parameter, epsilon, which prevents division by zero.
3. Adam: The Adam algorithm is an adaptive learning rate algorithm that combines the
advantages of the momentum optimizer and RMSProp. It uses a moving average of both
the first-order (gradient) and second-order (squared gradient) moments of the gradients to
adapt the learning rate for each weight. Adam also introduces two new parameters, beta1
and beta2, which control the exponential decay rates for the moving averages. Adam is
widely used in deep learning because it is computationally efficient and often achieves
good performance.

In summary, the momentum optimizer uses an exponentially decaying moving average of

past gradients, RMSProp adjusts the learning rate based on the average of the squared
gradients, and Adam combines the advantages of both methods by using a moving
average of both the first- and second-order moments of the gradients. All three algorithms
have their strengths and weaknesses, and the choice of optimization algorithm depends
on the problem and the data.
6 Discuss the working principle of LSTM networks, and explain how they can be used 10M
for processing sequential data, such as natural language and speech.
ANS:

LSTM is a type of Neural Network used in the field of Deep Learning. LSTM stands
for Long-Short-term-Memory. LSTM is an improved version of the RNN(Recurrent
Neural Network). LSTM is mainly used in Time series and Sequence data because RNN
doesn't provide an efficient performance as the gap length rises. LSTM differs from
conventional Feedforward Networks as it uses previous data and its output to affect the
current predictions. LSTM is also better at retaining information for longer periods when
compared with RNN. Long Short Term Memory uses Gated Cells to remember or forget
previous information.

LSTM was designed by Hochreiter & Schmidhuber. It's not an easy task to get your head
around LSTM as it belongs to the complex area of Deep Learning. LSTM deals with
algorithms to uncover the underlying relationships in the given sequential data.
The chain structure LSTM contains four neural networks and different memory blocks
called cells. The LSTM may keep information for a long time by default and information
is retained by the cells and the three gates do the memory manipulations.

Need of LSTM

LSTM was introduced to tackle the problems and challenges faced in Recurrent Neural
Networks. Slightly touching on the topic of RNN. RNN is a type of Neural Network that
stores the previous output to help improve its future predictions. Vanilla RNN has a
“short-term” memory. The input at the beginning of the sequence doesn’t affect the
output of the Network after a while, maybe 3 or 4 inputs. This is called a long-term
dependency issue.

Example:

Let’s take this sentence.

The Sun rises in the ______.

An RNN could easily return the correct output that the sun rises in the East as all the
necessary information is nearby.
Let’s take another example.

I was born in Japan, ……… and I speak fluent ______.

In this sentence, the RNN would not be able to return the correct output as it requires
remembering the word Japan for a long duration. Since RNN only has a “Short-term”
memory, it doesn’t work well. LSTM solves this problem by enabling the Network to
remember Long-term dependencies.

The other problem in RNN is the Vanishing Gradient and Exploding Gradient. It
arises during the Backpropagation of the Neural Network. Suppose the gradient of each
layer is contained between 0 and 1. As the value gets multiplied in each layer, it gets
smaller and smaller, ultimately, a value very close to 0. This is the Vanishing gradient
problem. The converse, when the values are greater than 1, exploding gradient problem
occurs, where the value gets really big, disrupting the training of the Network. These
problems are tackled in LSTMs.

Structure of LSTM

LSTM is a cell that consists of 3 gates. A forget gate, input gate, and output gate. The
gates decide which information is important and which information can be forgotten. The
cell has two states Cell State and Hidden State. They are continuously updated and
carry the information from the previous time steps to the current time steps. The cell state
acts as the “long-term” memory, while the hidden state acts as the “short-term” memory.
Now let’s look at each gate in detail.
Forget Gate:

Forget gate is responsible for deciding what information should be removed from the
cell state. It takes in the hidden state of the previous time-step and the current input and
passes it to a Sigma Activation Function, which outputs a value between 0 and 1,
where 0 means forget and 1 means keep.

Input Gate:

The Input Gate considers the current input and the hidden state of the previous time step.
The input gate is used to update the cell state value. It has two parts. The first part
contains the Sigma activation function. Its purpose is to decide what percent of the
information is required. The second part passes the two values to a Tanh
activation function. Its purpose is to map the data between -1 and 1. To obtain the
relevant information required from the output of Tanh, we multiply it by the output of the
Sigma function. This is the output of the Input gate, which updates the cell state.

Output Gate:

The output gate returns the hidden state for the next time stamp. The output gate has two
parts. The first part is a Sigma function, which serves the same purpose as the other two
gates, to decide the percent of the relevant information required. The newly updated cell
state is passed through a Tanh function and multiplied by the output from the sigma
function. This is now the new hidden state.

Cell State:

The forget gate and input gate update the cell state. The cell state of the previous state is
multiplied by the output of the forget gate. The output of this state is then summed with
the output of the input gate. This value is then used in the calculation of hidden state in
the output gate.

How do LSTMs Work?

The LSTM architecture is similar to RNN, but instead of the feedback loop has an LSTM
cell. The sequence of LSTM cells in each layer is fed with the output of the previous
cell. This enables the cell to get the information of the previous inputs and sequence. A
cyclic set of steps happens in each LSTM cell

 The Forget gate is computed.

 The Input gate value is computed.
 The Cell state is updated using the above two outputs.
 The output(hidden state) is computed using the output gate.

These series of steps occur in every LSTM cell. The intuition behind LSTM is that the
Cell state and the Hidden state carry the previous information and pass it on to the future
time steps. The Cell state is aggregated with the information of all the past data and is
the long-term information retainer. The Hidden state carries the output of the previous
cell, i.e. short-term memory. This combination of Long term and short-term memory
techniques enables LSTM’s to perform really well In time series and sequence data.

Applications of LSTM

LSTM has a lot of applications:

Language Modeling: LSTMs have been used to build language models that can generate
natural language text, such as in machine translation systems or chatbots.
 Time series prediction: LSTMs have been used to model time series data and
make predictions about future values in the series. For example, LSTMs have
been used to predict stock prices or traffic patterns.
 Sentiment analysis: LSTMs have been used to analyze the sentiment of text,
such as in social media posts or customer reviews.

 Speech recognition: LSTMs have been used to build speech recognition systems
that can transcribe spoken language into text.
 Image captioning: LSTMs have been used to generate descriptive captions for
images, such as in image search engines or automated image annotation systems.
LSTMs can be used for processing sequential data, such as natural language and speech,
by treating each word or phoneme as a separate input. The LSTM network then updates
the memory cell and generates an output for each input. The output can be used to predict
the next word or phoneme in the sequence, or to generate a new sequence altogether.

7 Explain the concept of generative modeling in deep learning, and discuss the 10M
different types of generative models, such as variational autoencoders and
generative adversarial networks.
ANS:
Generative modeling in deep learning refers to the process of training a model to generate
new data samples that are similar to the original data distribution. In other words,
generative models learn the underlying patterns and structures of the data and use them to
generate new samples that are statistically similar to the original data.

There are different types of generative models in deep learning, but two of the most
popular ones are Variational Autoencoders (VAEs) and Generative Adversarial Networks
(GANs).

1. 1. Variational Autoencoders (VAEs): VAEs are a type of autoencoder that learns a

compressed representation of the input data (the "latent space"), which can then be used
to generate new data samples. VAEs use a probabilistic encoder and decoder to learn the
distribution of the input data and generate new samples from the learned distribution.
During training, the VAE learns to minimize the difference between the input data and
the generated data, while also maximizing the likelihood of the learned distribution.
2. 2. Generative Adversarial Networks (GANs): GANs are a type of generative model that
consists of two networks: a generator and a discriminator. The generator generates new
data samples, while the discriminator evaluates the generated samples and tries to
distinguish them from the real data. During training, the generator tries to generate
samples that fool the discriminator, while the discriminator tries to correctly classify the
samples as either real or fake. The goal is for the generator to learn the distribution of the
real data and generate samples that are indistinguishable from the real data.

Other types of generative models include Boltzmann Machines, Restricted Boltzmann

Machines, and Autoregressive models. These models have their own unique strengths
and weaknesses and are suited for different types of data and applications.

Generative models have numerous applications in deep learning, including image and
video synthesis, natural language processing, music generation, and data augmentation.
They are also used for anomaly detection, where the generative model is trained on
normal data and can then be used to detect anomalies or outliers in new data.
In summary, generative modeling in deep learning refers to the process of training a
model to generate new data samples that are statistically similar to the original data.
VAEs and GANs are two popular types of generative models that are used for different
types of data and applications. Generative models have numerous applications in deep
learning, including image and video synthesis, natural language processing, and data
augmentation.
8 Explain the working principle of variational autoencoders, and explain how they 10M
can be used for generative modeling tasks, such as image generation and data
compression.
ANS:
Variational Autoencoders (VAEs) are a type of autoencoder that can be used for
generative modeling tasks, such as image generation and data compression. VAEs work
by learning a compressed representation of the input data in a lower-dimensional latent
space, which can be used to generate new data samples that are similar to the original
data.

The working principle of VAEs can be summarized as follows:

1. Encoding: The input data is first encoded into a lower-dimensional latent space using an
encoder network. The encoder network maps the input data to the mean and standard
deviation of a multivariate normal distribution that represents the latent space.
2. Sampling: A sample is drawn from the latent space using the mean and standard
deviation generated by the encoder network. This sample is then used as the input to the
decoder network.
3. Decoding: The decoder network then maps the sample from the latent space back to the
original input space. The output of the decoder network is a reconstruction of the input
data.
4. Loss function: The VAE is trained to minimize the difference between the input data and
the reconstructed data, as well as to minimize the difference between the latent space
distribution and a standard normal distribution. This is done by minimizing a loss
function that consists of two terms: the reconstruction loss and the KL-divergence loss.

The reconstruction loss measures the difference between the input data and the
reconstructed data, while the KL-divergence loss measures the difference between the
latent space distribution and a standard normal distribution. The KL-divergence loss
encourages the VAE to learn a smooth and regular latent space that can be easily sampled
to generate new data samples.

VAEs can be used for generative modeling tasks, such as image generation and data
compression, by sampling from the learned latent space and decoding the samples into
new data samples. To generate new data samples, the VAE can be trained to learn the
distribution of the original data and generate new samples from the learned distribution.

In image generation, the VAE can be trained on a dataset of images and then used to
generate new images by sampling from the learned latent space and decoding the samples
into new images. The VAE can also be used for data compression by encoding the input
data into the learned latent space and then decoding the samples back into the original
input space. The compressed data can be stored in a lower-dimensional space, which can
save storage and computational resources.

In summary, VAEs are a type of autoencoder that can be used for generative modeling
tasks, such as image generation and data compression. VAEs learn a compressed
representation of the input data in a lower-dimensional latent space, which can be used to
generate new data samples that are similar to the original data. VAEs are trained to
minimize a loss function that consists of two terms: the reconstruction loss and the KL-
divergence loss.
9 Discuss the working principle of generative adversarial networks, and explain how 10M
they can be used for generative modeling tasks, such as image generation and data
synthesis.
ANS:
Generative Adversarial Networks (GANs) are a type of generative model that can be used
for generating new data samples, such as images or text. The basic idea behind GANs is
to train two neural networks, a generator and a discriminator, in a two-player minimax
game.

The generator network is trained to produce fake data samples that are similar to the real
data samples, while the discriminator network is trained to distinguish between the real
and fake data samples. The two networks are trained simultaneously, with the generator
network attempting to fool the discriminator network, and the discriminator network
trying to correctly classify the real and fake data samples.

The working principle of GANs can be summarized as follows:

1. Generator network: The generator network takes random noise as input and generates a
fake data sample, such as an image or a text sequence.
2. Discriminator network: The discriminator network takes a data sample, either real or
fake, as input and produces a binary output indicating whether the sample is real or fake.
3. Minimax game: The generator and discriminator networks are trained in a two-player
minimax game, where the generator tries to produce fake data samples that can fool the
discriminator, and the discriminator tries to correctly distinguish between real and fake
data samples.
4. Loss function: The generator and discriminator networks are trained to minimize
different loss functions. The generator network is trained to maximize the probability of
the discriminator network making a mistake, while the discriminator network is trained to
maximize the probability of correctly classifying real and fake data samples.

GANs can be used for generative modeling tasks such as image generation and data
synthesis by training the generator network to produce new data samples that are similar
to the real data samples. After training, the generator network can be used to generate
new data samples by sampling random noise as input and producing fake data samples as
output.

In image generation, GANs have been used to generate realistic images that can be used
for various applications such as data augmentation, style transfer, and image editing.
GANs can also be used for data synthesis, where the generator network can be trained to
generate new data samples in a dataset with limited samples.

One of the challenges with GANs is the training instability due to the non-convex nature
of the minimax game. The generator network may produce samples that are too similar,
leading to a mode collapse, or too dissimilar, leading to poor quality samples. Various
techniques have been proposed to stabilize the training of GANs, such as using different
loss functions, regularization techniques, and architecture modifications.

In summary, GANs are a type of generative model that can be used for generating new
data samples, such as images or text. GANs train a generator and a discriminator network
in a two-player minimax game, where the generator network tries to produce fake data
samples that can fool the discriminator network, and the discriminator network tries to
correctly distinguish between real and fake data samples. GANs can be used for
generative modeling tasks such as image generation and data synthesis, but their training
stability can be challenging.
10 Compare and contrast the momentum optimizer, RMSProp, and Adam 10M
optimization algorithms in deep learning.
ANS:

Momentum

Momentum descent with decay_rate = 1.0 (no decay).

The gradient descent with momentum algorithm (or Momentum for short) borrows the

idea from physics. Imagine rolling down a ball inside of a frictionless bowl. Instead of

stopping at the bottom, the momentum it has accumulated pushes it forward, and the ball

keeps rolling back and forth.

We can apply the concept of momentum to our vanilla gradient descent algorithm. In each

step, in addition to the regular gradient, it also adds on the movement from the previous
step. Mathematically, it is commonly expressed as:

delta = - learning_rate * gradient + previous_delta * decay_rate (eq. 1)

theta += delta (eq. 2)

I found it more intuitive if I massage this equation a little and keep track of the (decayed)

cumulative sum of gradient instead. This will also make things easier when we introduce

the Adam algorithm later.

sum_of_gradient = gradient + previous_sum_of_gradient * decay_rate (eq. 3)

delta = -learning_rate * sum_of_gradient (eq. 4)

theta += delta (eq. 5)

(What I did was factoring out -learning_rate. To see the mathematical equivalence, you
can substitute delta with -learning_rate * sum_of_gradient in eq. 1 to get eq. 3.)

Step-by-step illustration of momentum descent. Watch live animation in the app. For the
rest of this post, I sloppily use gradient x and gradient y in the visualization; in reality,
because it’s gradient *descent*, it’s actually the negative of the gradient.

Let’s consider two extreme cases to understand this decay rate parameter better. If the

decay rate is 0, then it is exactly the same as (vanilla) gradient descent. If the decay rate is

1 (and provided that the learning rate is reasonably small), then it rocks back and forth

endlessly like the frictionless bowl analogy we mentioned in the beginning; you do not

want that. Typically the decay rate is chosen around 0.8–0.9 — it’s like a surface with a

little bit of friction so it eventually slows down and stops.

Momentum (magenta) vs. Gradient Descent (cyan) on a surface with a global minimum
(the left well) and local minimum (the right well)

So, in what ways is Momentum better than vanilla gradient descent? In this comparison on

the left, you can see two advantages:

1. Momentum simply moves faster (because of all the momentum it accumulates)

2. Momentum has a shot at escaping local minima (because the momentum may propel

it out of a local minimum). In a similar vein, as we shall see later, it will also power

through plateau regions better.

RMSProp

The problem of AdaGrad, however, is that it is incredibly slow. This is because the sum of
gradient squared only grows and never shrinks. RMSProp

(for Root Mean Square Propagation) fixes this issue by adding a decay factor.

sum_of_gradient_squared = previous_sum_of_gradient_squared * decay_rate+

gradient² * (1- decay_rate)

delta = -learning_rate * gradient / sqrt(sum_of_gradient_squared)

theta += delta

More precisely, the sum of gradient squared is actually the decayed sum of gradient

squared. The decay rate is saying only recent gradient² matters, and the ones from long

ago are basically forgotten. As a side note, the term “decay rate” is a bit of a misnomer.

Unlike the decay rate we saw in momentum, in addition to decaying, the decay rate here

also has a scaling effect: it scales down the whole term by a factor of (1 - decay_rate). In

other words, if the decay_rate is set at 0.99, in addition to decaying, the sum of gradient

squared will be sqrt(1 - 0.99) = 0.1 that of AdaGrad, and thus the step is on the order of
10x larger for the same learning rate.
RMSProp (green) vs AdaGrad (white). The first run just shows the balls; the second run
also shows the sum of gradient squared represented by the squares.

To see the effect of the decaying, in this head-to-head comparison, AdaGrad white) keeps

up with RMSProp (green) initially, as expected with the tuned learning rate and decay

rate. But the sums of gradient squared for AdaGrad accumulate so fast that they soon

become humongous (demonstrated by the sizes of the squares in the animation). They take

a heavy toll and eventually AdaGrad practically stops moving. RMSProp, on the other

hand, has kept the squares under a manageable size the whole time, thanks to the decay

rate. This makes RMSProp faster than AdaGrad.

Adam

Last but not least, Adam (short for Adaptive Moment Estimation) takes the best of both

worlds of Momentum and RMSProp. Adam empirically works well, and thus in recent

years, it is commonly the go-to choice of deep learning problems.

Let’s take a look at how it works:

sum_of_gradient = previous_sum_of_gradient * beta1 + gradient * (1 -

beta1) [Momentum]

sum_of_gradient_squared = previous_sum_of_gradient_squared * beta2 + gradient² *

(1- beta2) [RMSProp]

delta = -learning_rate * sum_of_gradient / sqrt(sum_of_gradient_squared)

theta += delta

Beta1 is the decay rate for the first moment, sum of gradient (aka momentum), commonly

set at 0.9. Beta 2 is the decay rate for the second moment, sum of gradient squared, and it

is commonly set at 0.999.

Step-by-step illustration of Adam descent. Watch live animation in the app.

Adam gets the speed from momentum and the ability to adapt gradients in different
directions from RMSProp. The combination of the two makes it powerful.

The momentum optimizer, RMSProp, and Adam optimization algorithms are all
commonly used optimization techniques in deep learning. They differ in their update
rules and the information they use to update the parameters of the model. Here is a
comparison of the three algorithms:

1. Momentum optimizer: The momentum optimizer adds a fraction of the previous update
to the current update in order to accelerate the learning process. It helps to smooth out
oscillations and reduce the noise in the gradients. The momentum optimizer keeps track
of an exponentially decaying moving average of past gradients and uses this information
to update the parameters. However, it does not take into account the magnitude of the
gradients, which can lead to overshooting the minimum.
2. RMSProp: RMSProp is an optimization algorithm that adjusts the learning rate for each
parameter based on the average of the magnitudes of recent gradients for that parameter.
It helps to adjust the learning rate to the scale of the gradients and to avoid overshooting
the minimum. It uses a moving average of squared gradients to normalize the gradient
updates. However, it may suffer from slow convergence when the gradients have a high
variance.
3. Adam: Adam is an adaptive learning rate optimization algorithm that combines the ideas
of both momentum and RMSProp. It keeps track of both the first and second moments of
the gradient and adapts the learning rate based on the variance of the gradients. It uses a
biased estimate of the first moment and a biased estimate of the second moment of the
gradients to calculate the update. It helps to adjust the learning rate to the scale of the
gradients and reduces the noise in the gradients. It has been shown to work well in
practice and is widely used in deep learning applications.

In summary, the momentum optimizer, RMSProp, and Adam optimization algorithms all
aim to improve the convergence of the gradient descent optimization process. The
momentum optimizer uses past gradients to accelerate learning, but does not take into
account the magnitude of the gradients. RMSProp adjusts the learning rate for each
parameter based on the average of the magnitudes of recent gradients, while Adam
combines both momentum and RMSProp to adapt the learning rate based on the variance
of the gradients. In general, Adam is considered to be the most effective optimization
algorithm in deep learning, but the best choice of optimizer can depend on the specific
problem and model being trained.

UNIT V

Images segmentation – Object Detection – Automatic Image Captioning – Image generation with
Generative adversarial networks – Video to Text with LSTM models – Attention models for Computer
Vision – Self-Organizing Maps - Restricted Boltzmann Machines - Long Short-Term Memory
Networks (LSTM) and Deep Reinforcement Learning.
Case Study: Named Entity Recognition – Opinion Mining using Recurrent Neural Networks – Parsing
and Sentiment Analysis using Recursive Neural Networks – Sentence Classification using Convolution
Neural Networks – Dialogue Generation with LSTMs.
1 What is image segmentation, and how is it used in computer vision applications? 4M
ANS:
Image segmentation is the process of dividing an image into multiple segments or
regions, each of which corresponds to a different object or part of the image. The goal of
image segmentation is to simplify or change the representation of an image into
something that is more meaningful and easier to analyze. Image segmentation is a
fundamental task in computer vision and is used in a variety of applications such as
object recognition, scene understanding, medical image analysis, and autonomous
vehicles.

The segmentation process typically involves two main steps: 1) grouping similar pixels
together to form a region, and 2) assigning a label to each region. The grouping of pixels
can be based on a number of criteria such as color, texture, brightness, or a combination
of these features. Once the regions are identified, a label is assigned to each region based
on its characteristics, such as the object it belongs to or the background of the image.

Image segmentation is useful in computer vision applications because it provides a way

to identify objects and regions of interest within an image. For example, in object
recognition, segmentation can be used to identify the boundaries of objects and isolate
them from the background. This can make it easier to classify objects and detect their
positions in an image. In medical image analysis, segmentation can be used to identify
and isolate different structures within the body, such as tumors or organs, which can help
with diagnosis and treatment planning. In autonomous vehicles, segmentation can be
used to detect and track objects on the road, such as other vehicles or pedestrians, which
can improve safety and navigation.

2 What is object detection, and how is it different from image segmentation? 4M

ANS:
Object detection is a computer vision task that involves detecting and localizing objects
within an image or video. The goal of object detection is to identify and locate specific
objects within an image, and assign them a label or class. Object detection is commonly
used in applications such as autonomous vehicles, surveillance, and object recognition.

Object detection differs from image segmentation in that it aims to identify individual
objects within an image, rather than dividing the image into regions or segments based on
their properties. Object detection involves two main tasks: object localization and object
classification.

Object localization involves determining the location of an object within an image. This
is typically done by identifying the boundaries of the object using a bounding box or a
contour. Object classification involves assigning a label or class to the object based on its
characteristics, such as its shape or color.

In contrast, image segmentation involves dividing an image into multiple regions or

segments, each of which corresponds to a different object or part of the image. Image
segmentation does not necessarily involve identifying individual objects within an image,
and does not assign a label or class to each object. Instead, it provides a way to identify
and isolate different regions of an image based on their properties.

In summary, object detection and image segmentation are both important computer vision
tasks, but they differ in their goals and methods. Object detection is focused on
identifying and localizing individual objects within an image, while image segmentation
is focused on dividing an image into regions or segments based on their properties.

3 How is automatic image captioning achieved using deep learning models, and what 4M
are some potential applications of this technology?
ANS:
Automatic image captioning is a task in which a deep learning model is trained to
generate a natural language description of an image. The goal of automatic image
captioning is to produce a caption that accurately reflects the content of the image, and is
grammatically and semantically correct.

The process of automatic image captioning typically involves using a convolutional

neural network (CNN) to extract features from the image, and a recurrent neural network
(RNN) to generate the caption. The CNN is used to encode the visual information in the
image and create a representation of the image in a high-dimensional feature space. This
feature representation is then fed into the RNN, which generates a sequence of words that
form the caption.

The RNN is typically a type of neural network known as a long short-term memory
(LSTM) network, which is capable of modeling sequences of variable length. The LSTM
network generates the caption one word at a time, using the feature representation from
the CNN as input to each step of the sequence.

To train the model, a dataset of images and corresponding captions is typically used. The
model is trained to predict the caption given the image, and the parameters of the model
are adjusted to minimize the difference between the predicted caption and the actual
caption.

Automatic image captioning has a variety of potential applications, such as image and
video indexing, assistive technology for the visually impaired, and content-based image
retrieval. For example, image and video indexing could be used to automatically tag and
organize large collections of images and videos, making them easier to search and
browse. Assistive technology for the visually impaired could use image captions to
provide descriptions of visual content, allowing blind and visually impaired individuals to
better understand and interact with the world around them. Content-based image retrieval
could use image captions to improve the accuracy and relevance of search results, by
allowing users to search for images based on their content, rather than just their file
names or metadata.
4 What is the working principle of generative adversarial networks, and how are they 4M
used for image generation tasks?
ANS:
Generative Adversarial Networks (GANs) are a type of deep learning model that can
generate new data, such as images, by learning from a training dataset. GANs consist of
two neural networks: a generator and a discriminator. The generator is trained to generate
new data that resembles the training data, while the discriminator is trained to distinguish
between the generated data and the real data.

The generator takes a random input, typically noise, and generates a new sample that
resembles the training data. The discriminator then takes both the generated sample and a
real sample from the training data, and tries to distinguish between them. The
discriminator provides feedback to the generator on how well its generated samples
resemble the real data. The generator then uses this feedback to improve its samples, and
the process is repeated until the generated data is indistinguishable from the real data.

The training process of GANs can be difficult, as the generator and discriminator are
constantly trying to outsmart each other. However, when trained successfully, GANs can
generate highly realistic images that are difficult to distinguish from real images.

GANs can be used for a variety of image generation tasks, such as creating realistic
images of faces, landscapes, or objects. They can also be used for tasks such as image
inpainting, where missing parts of an image are filled in based on the surrounding
context, or image super-resolution, where low-resolution images are transformed into
higher-resolution versions.

GANs have a wide range of applications in various industries, including entertainment,

fashion, and design. For example, GANs can be used to generate photorealistic images of
clothing or furniture, which can be used for online shopping or interior design purposes.
In the entertainment industry, GANs can be used to generate special effects for movies or
video games. Additionally, GANs have potential applications in scientific research, such
as generating realistic simulations of complex biological structures or materials.
5 How are LSTM models used in video to text applications, and what are some 4M
potential applications of this technology?
ANS:
LSTM (Long Short-Term Memory) models are a type of recurrent neural network (RNN)
that are well-suited for processing sequential data, such as text, speech, and video. In
video-to-text applications, LSTM models are used to generate natural language
descriptions or captions of videos.

The working principle of LSTM models is similar to that of RNNs, but they are designed
to handle the vanishing gradient problem that occurs in RNNs, which can make it
difficult for the network to learn long-term dependencies. LSTM models use a system of
gates that regulate the flow of information through the network, allowing it to selectively
retain or forget information over time.

In video-to-text applications, an LSTM model is trained on a dataset of videos and

corresponding captions. The video is first converted into a sequence of frames, which are
then fed into the LSTM model one at a time. The model then generates a caption for the
video based on the sequence of frames. The training process involves optimizing the
parameters of the LSTM model to minimize the difference between the generated caption
and the actual caption.

Potential applications of video-to-text technology include video search and indexing,

video summarization, and video captioning for the hearing impaired. In video search and
indexing, the technology can be used to automatically tag videos with relevant keywords
or topics, making it easier to find specific videos in a large collection. In video
summarization, the technology can be used to generate a concise summary of a longer
video, which can be useful for quickly identifying the key points or highlights. For the
hearing impaired, video captioning technology can provide a more accessible way to
consume video content.
6 Explain the concept of named entity recognition, and discuss how it can be achieved 10M
using recurrent neural networks. Provide examples of applications that require
named entity recognition.
ANS:
Named entity recognition (NER) is a natural language processing task that involves
identifying and categorizing named entities in text, such as people, places, organizations,
and dates. The goal of NER is to extract structured information from unstructured text,
which can be useful in a variety of applications, such as information retrieval, question
answering, and sentiment analysis.

Recurrent neural networks (RNNs) are a type of neural network that are well-suited for
processing sequential data, such as text. One popular approach to NER using RNNs is the
use of a bi-directional LSTM (BLSTM) model. In a BLSTM model, the input text is first
encoded into a sequence of feature vectors, which are then fed into two separate LSTM
layers. One LSTM layer processes the input sequence in the forward direction, while the
other processes it in the backward direction. The outputs of the two LSTM layers are then
concatenated to produce a final output sequence, which is fed into a softmax layer for
classification.

During training, the BLSTM model is trained on a dataset of labeled examples, where
each example consists of a sequence of words and their corresponding labels. The model
is trained to predict the label for each word in the input sequence, based on its context
within the sentence.

Applications that require named entity recognition include:

1. Information extraction: NER can be used to extract structured information from

unstructured text, such as identifying the names of people, organizations, and
locations mentioned in news articles or social media posts.
2. Question answering: NER can be used to identify the entities mentioned in a
question, which can then be used to retrieve the relevant information from a
knowledge base or search engine.
3. Sentiment analysis: NER can be used to identify the entities mentioned in
customer reviews or social media posts, which can then be used to analyze the
sentiment towards those entities.
4. Machine translation: NER can be used to identify named entities in the source
language and correctly translate them to the target language, which can be
particularly important for preserving the meaning of the text.
5.
Chatbots: NER can be used in chatbots to understand the user's intent by
identifying the named entities in their queries and responses, and provide relevant
information accordingly.
6. Fraud detection: NER can be used in fraud detection systems to identify names,
locations, and other relevant information mentioned in transactional data.
7. Medical applications: NER can be used in medical applications to identify named
entities such as symptoms, diseases, and treatments mentioned in medical records,
which can aid in diagnosis and treatment.
8. Legal applications: NER can be used in legal applications to identify named
entities such as legal terms, people, and organizations mentioned in legal
documents, which can help in legal research and analysis.

Overall, NER is a fundamental task in natural language processing, and the use of RNNs,
especially BLSTMs, has proven to be effective in achieving state-of-the-art performance
on various NER benchmarks.

7 Discuss the use of recursive neural networks in parsing and sentiment analysis tasks 10M
with examples.
ANS:
Recursive neural networks (RecNNs) are a class of neural networks that can process tree-
structured data, such as sentences or parse trees. They can be used for various natural
language processing tasks, including parsing and sentiment analysis.

In parsing, RecNNs can be used to generate parse trees for input sentences. Parse trees
represent the syntactic structure of sentences, and they are useful for many downstream
NLP tasks, such as information extraction and question answering. RecNNs can be
trained on annotated datasets to predict the parse trees for new sentences. The basic idea
is to use a recursive function to compute representations for sub-trees and combine them
to obtain representations for larger trees. A popular RecNN architecture for parsing is the
Recursive Autoencoder (RAE), which uses an encoder-decoder framework to reconstruct
the input tree.

In sentiment analysis, RecNNs can be used to classify the sentiment of input sentences or
documents. Sentiment analysis is the task of determining the polarity of a text, i.e.,
whether it expresses a positive, negative, or neutral sentiment. RecNNs can be trained on
annotated datasets to predict the sentiment labels for new sentences. The basic idea is to
use a recursive function to compute representations for sub-sentences and combine them
to obtain representations for larger sentences. A popular RecNN architecture for
sentiment analysis is the Recursive Neural Tensor Network (RNTN), which uses a
tensor-based composition function to combine the representations of sub-sentences.

Here are some examples of how RecNNs can be used in parsing and sentiment analysis:

Parsing: The Stanford Parser is a widely used parser that uses RecNNs to generate parse
trees for English sentences. It is trained on the Penn Treebank dataset, which contains
annotated parse trees for a large corpus of sentences.

1. Sentiment analysis: The Recursive Neural Tensor Network (RNTN) has been used for
sentiment analysis on various datasets, including the Stanford Sentiment Treebank and
the Movie Review dataset. It has achieved state-of-the-art performance on these
benchmarks.
In summary, RecNNs are a powerful tool for processing tree-structured data in NLP
tasks, and they have been successfully applied to parsing and sentiment analysis. They
can be trained on annotated datasets to learn to generate parse trees or classify the
sentiment of sentences, and they have achieved state-of-the-art performance on various
benchmarks.

8 Explain the concept of sentence classification, and discuss how it can be achieved 10M
using convolutional neural networks.
ANS:
Sentence classification is the task of assigning a label or category to a given sentence
based on its content. This task is commonly used in natural language processing
applications such as sentiment analysis, topic classification, and spam detection.

Convolutional neural networks (CNNs) have been shown to be effective for sentence
classification tasks. The key idea behind using CNNs for sentence classification is to treat
the sentence as a one-dimensional signal, with each word in the sentence being a single
element of the signal. This allows us to use the same convolutional filters across the
entire sentence, capturing local patterns and features in the sentence.

The basic architecture of a CNN for sentence classification involves a convolutional

layer, a pooling layer, and one or more fully connected layers. The convolutional layer
applies a set of filters to the input sentence, producing a set of feature maps. The pooling
layer then downsamples the feature maps to reduce the dimensionality of the output.
Finally, the fully connected layers use the pooled features to classify the sentence into
one of several categories.

One popular variant of the basic CNN architecture for sentence classification is the Kim
model, named after its creator, Yoon Kim. The Kim model uses multiple filters of
different sizes, capturing both local and global features of the sentence. It also uses
dropout regularization to prevent overfitting.

The Kim model has achieved state-of-the-art performance on various sentence

classification benchmarks, including the Stanford Sentiment Treebank and the Movie
Review dataset. It has also been used for other natural language processing tasks, such as
question answering and document classification.

In summary, sentence classification is the task of assigning a label or category to a given

sentence. Convolutional neural networks can be used for sentence classification by
treating the sentence as a one-dimensional signal and using convolutional filters to
capture local patterns and features. The Kim model is a popular variant of the basic CNN
architecture for sentence classification, and has achieved state-of-the-art performance on
various benchmarks.

9 Discuss the working principle of dialogue generation with LSTMs, and explain how 10M
LSTMs can be used to generate natural language responses in dialogue systems.
ANS:
Dialogue generation is the task of generating natural language responses in a
conversational setting, typically in response to user input. Long Short-Term Memory
(LSTM) networks are a type of recurrent neural network (RNN) that have been shown to
be effective for generating natural language responses in dialogue systems. In this
answer, we will discuss the working principle of dialogue generation with LSTMs and
explain how LSTMs can be used to generate natural language responses in dialogue
systems.

LSTMs are a type of RNN that can capture long-term dependencies in sequential data,
making them well-suited for natural language processing tasks such as dialogue
generation. The basic idea behind LSTMs is to use a set of learnable gates to control the
flow of information through the network, allowing it to selectively remember or forget
information at each time step. The key components of an LSTM network include an input
gate, a forget gate, a memory cell, and an output gate.

To generate natural language responses in a dialogue system, we can use an LSTM-based

sequence-to-sequence (seq2seq) model. The seq2seq model consists of an encoder LSTM
and a decoder LSTM. The encoder LSTM processes the user input, encoding it into a
fixed-length vector called the context vector. The decoder LSTM then generates a
response based on the context vector.

During training, the model is trained to maximize the likelihood of generating the correct
response given the user input. This is done by minimizing the cross-entropy loss between
the predicted and actual responses. During inference, the model uses beam search or
sampling to generate a response based on the learned distribution of responses.

LSTM-based dialogue generation has several advantages over traditional rule-based

approaches. For example, it can generate more natural and diverse responses, and can
learn to adapt to different conversational contexts. It has been applied in various domains,
including customer service, personal assistants, and social chatbots.

However, there are also some challenges associated with LSTM-based dialogue
generation. For example, the generated responses may lack coherence or may be
inappropriate or offensive. To address these issues, researchers have proposed various
techniques such as reinforcement learning, adversarial training, and human-in-the-loop
approaches.

In conclusion, LSTMs are an effective tool for generating natural language responses in
dialogue systems. By using an LSTM-based seq2seq model, we can encode the user input
into a fixed-length vector and use it to generate a response. Although there are some
challenges associated with LSTM-based dialogue generation, it has the potential to
improve the user experience in conversational settings and has applications in various
domains.

10 Discuss the use of deep reinforcement learning in robotics applications. Explain the 10M
working principle of deep reinforcement learning, and provide examples of its
applications in robotics.
ANS:
Deep reinforcement learning (DRL) is a subfield of machine learning that combines the
principles of deep learning and reinforcement learning to enable artificial agents to learn
and adapt to complex environments through trial and error. In robotics applications, DRL
can be used to train robots to perform complex tasks, such as object manipulation,
navigation, and assembly, without explicit programming.

The working principle of DRL is based on the concept of reinforcement learning, where
an agent learns to maximize a reward signal by taking actions in an environment. In
DRL, the agent is a deep neural network that takes input from the environment, such as
sensor data or images, and produces a set of actions based on its learned policy. The
policy is optimized through the backpropagation of the reward signal, which is obtained
by evaluating the performance of the agent in the environment.

One example of DRL in robotics is the training of robots to grasp objects. In this
application, the robot arm is equipped with sensors that provide it with information about
the position and orientation of the objects. The DRL agent, which is implemented as a
deep neural network, takes this information as input and outputs a set of joint angles that
move the robot arm towards the target object. The reward signal is based on the success
of the grasp, which is determined by the force sensors attached to the robot arm.

Another example of DRL in robotics is the training of autonomous robots for navigation.
In this application, the robot is equipped with sensors that provide it with information
about the environment, such as lidar or camera data. The DRL agent takes this
information as input and outputs a set of actions, such as turning, moving forward, or
stopping. The reward signal is based on the success of the navigation task, which is
determined by the distance travelled or the time taken to reach the goal.

In conclusion, DRL has shown great potential in robotics applications, enabling robots to
learn and adapt to complex environments through trial and error. By combining the
principles of deep learning and reinforcement learning, DRL can help to overcome the
limitations of traditional programming approaches and enable the development of more
flexible and intelligent robots.

Machine Learning Exercises in Python, Part 1: Curious Insight
No ratings yet
Machine Learning Exercises in Python, Part 1: Curious Insight
14 pages
Python Deep Learning Tutorial
0% (1)
Python Deep Learning Tutorial
17 pages
Ebook Deep Learning Objective Type Questions
No ratings yet
Ebook Deep Learning Objective Type Questions
102 pages
Machine Learning Programming Exercise
100% (2)
Machine Learning Programming Exercise
118 pages
Solutions To Deep Learning
No ratings yet
Solutions To Deep Learning
25 pages
Single Layer Perceptron
No ratings yet
Single Layer Perceptron
113 pages
d2l en PDF
No ratings yet
d2l en PDF
1,197 pages
What Is Deep Learning - How It Works, Techniques & Applications - MATLAB & Simulink
No ratings yet
What Is Deep Learning - How It Works, Techniques & Applications - MATLAB & Simulink
14 pages
A Survey of Deep Learning and Its Applications - A New Paradigm To Machine Learning - Dargan2019
No ratings yet
A Survey of Deep Learning and Its Applications - A New Paradigm To Machine Learning - Dargan2019
22 pages
Deep Learning: Deep Learning Is A Machine Learning Technique That Teaches Computers To Do
No ratings yet
Deep Learning: Deep Learning Is A Machine Learning Technique That Teaches Computers To Do
2 pages
Artificial Intelligence and Deep Learning
0% (1)
Artificial Intelligence and Deep Learning
9 pages
Optimization Algorithms in MATLAB
No ratings yet
Optimization Algorithms in MATLAB
18 pages
Reading+10+ +Introduction+to+Deep+Learning
No ratings yet
Reading+10+ +Introduction+to+Deep+Learning
21 pages
CS 446: Machine Learning: Dan Roth University of Illinois, Urbana-Champaign
No ratings yet
CS 446: Machine Learning: Dan Roth University of Illinois, Urbana-Champaign
71 pages
Logistic Regression
No ratings yet
Logistic Regression
42 pages
Deep Learning File
No ratings yet
Deep Learning File
58 pages
What Is AI & Machine
No ratings yet
What Is AI & Machine
8 pages
III-II CSM (Ar 20) DL - Units - 1 & 2 - Question Answers As On 4-3-23
No ratings yet
III-II CSM (Ar 20) DL - Units - 1 & 2 - Question Answers As On 4-3-23
56 pages
Unit 3
No ratings yet
Unit 3
21 pages
Source Code Plagiarism
No ratings yet
Source Code Plagiarism
41 pages
HW 4
No ratings yet
HW 4
7 pages
Gans Trained by A Two Time-Scale Update Rule Converge To A Local Nash Equilibrium
No ratings yet
Gans Trained by A Two Time-Scale Update Rule Converge To A Local Nash Equilibrium
38 pages
Deep Learning
No ratings yet
Deep Learning
5 pages
ANN Doc
No ratings yet
ANN Doc
2 pages
BTP Project Report
No ratings yet
BTP Project Report
13 pages
Foundations of Machine Learning - 3
No ratings yet
Foundations of Machine Learning - 3
38 pages
About Deep Learning: How Does Deep Learning Attain Such Impressive Results?
No ratings yet
About Deep Learning: How Does Deep Learning Attain Such Impressive Results?
3 pages
Survey of FNN
No ratings yet
Survey of FNN
25 pages
Deep Learning File
No ratings yet
Deep Learning File
60 pages
Deep Learning: Book Review
No ratings yet
Deep Learning: Book Review
4 pages
Jntuk r20 Unit-I Deep Learning Techniques (WWW - Jntumaterials.co - In)
No ratings yet
Jntuk r20 Unit-I Deep Learning Techniques (WWW - Jntumaterials.co - In)
23 pages
AI Chapter 4
No ratings yet
AI Chapter 4
63 pages
AIML Syllabus
No ratings yet
AIML Syllabus
7 pages
Deep Learning Introduction
No ratings yet
Deep Learning Introduction
14 pages
Matlab Deep Learning Series
No ratings yet
Matlab Deep Learning Series
6 pages
Unit I
No ratings yet
Unit I
10 pages
A Survey of Deep Learning and Its Applications: A New Paradigm To Machine Learning
No ratings yet
A Survey of Deep Learning and Its Applications: A New Paradigm To Machine Learning
22 pages
TOC Micro. Fabric.
No ratings yet
TOC Micro. Fabric.
4 pages
Report of Ann Cat3-Dev
No ratings yet
Report of Ann Cat3-Dev
8 pages
Everything You Need To Know About Linear Regression - by Sushant Patrikar - Towards Data Science
No ratings yet
Everything You Need To Know About Linear Regression - by Sushant Patrikar - Towards Data Science
20 pages
Computations 2
No ratings yet
Computations 2
2 pages
SPE 125331 Waterflooding Optimization Using Gradient Based Methods
No ratings yet
SPE 125331 Waterflooding Optimization Using Gradient Based Methods
14 pages
Coordination of Directional Overcurrent Relays Using Seeker Algorithm
No ratings yet
Coordination of Directional Overcurrent Relays Using Seeker Algorithm
8 pages
Deep Learning
No ratings yet
Deep Learning
5 pages
Deep Learning Project
No ratings yet
Deep Learning Project
24 pages
Deep Learning
No ratings yet
Deep Learning
7 pages
Machinelearning VSDeep Learning
No ratings yet
Machinelearning VSDeep Learning
2 pages
Machinelearning VSDeep Learning
No ratings yet
Machinelearning VSDeep Learning
1 page
Mahadi Final Presentation
No ratings yet
Mahadi Final Presentation
65 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
169 pages
JNTUK R20 B.Tech CSE 4-1 Deep Learning Techniques Unit 1 Notes
No ratings yet
JNTUK R20 B.Tech CSE 4-1 Deep Learning Techniques Unit 1 Notes
15 pages
Unit - 1 Deep Learning Techniques
No ratings yet
Unit - 1 Deep Learning Techniques
18 pages
Deep Learning
No ratings yet
Deep Learning
5 pages
Unit 2
No ratings yet
Unit 2
64 pages
R21 - A7709 - Deep Learning: Dr. Bhawani Sankar Panigrahi
No ratings yet
R21 - A7709 - Deep Learning: Dr. Bhawani Sankar Panigrahi
92 pages
uNIT 1
No ratings yet
uNIT 1
16 pages
Machine Learning Vs Deep Learning
No ratings yet
Machine Learning Vs Deep Learning
2 pages
3rd Unit DL Final Class Notes
No ratings yet
3rd Unit DL Final Class Notes
78 pages
DL Unit 2
No ratings yet
DL Unit 2
29 pages
Design of Concrete Mix Using Ann
No ratings yet
Design of Concrete Mix Using Ann
12 pages
Unit-3 Notes
No ratings yet
Unit-3 Notes
16 pages
DL All Units Materials
No ratings yet
DL All Units Materials
138 pages
UNIT I Part 1 Notes
No ratings yet
UNIT I Part 1 Notes
28 pages
IOE Thapathali Campus Minor and Major Project Report Template 1
No ratings yet
IOE Thapathali Campus Minor and Major Project Report Template 1
48 pages
Deep Learning
No ratings yet
Deep Learning
6 pages
Deep Learning Unit1
No ratings yet
Deep Learning Unit1
126 pages
Salman Technical Seminar
No ratings yet
Salman Technical Seminar
24 pages
DL Notes
No ratings yet
DL Notes
97 pages
W1 Ann
No ratings yet
W1 Ann
3 pages
Lecture 3 ML - Optimization
No ratings yet
Lecture 3 ML - Optimization
32 pages
Expanded Deep Learning Document-1
No ratings yet
Expanded Deep Learning Document-1
11 pages
Module1 - Deep Learning
No ratings yet
Module1 - Deep Learning
26 pages
Unit 3
No ratings yet
Unit 3
16 pages
Deep Learning Fundamentals
No ratings yet
Deep Learning Fundamentals
19 pages
Anatomy of Deep Learning Principles (2023)
No ratings yet
Anatomy of Deep Learning Principles (2023)
755 pages
Homework #3
No ratings yet
Homework #3
2 pages
CSE 412 Lab Manual 3 Linear Regression
No ratings yet
CSE 412 Lab Manual 3 Linear Regression
10 pages
Deep Learning Concise Notes
No ratings yet
Deep Learning Concise Notes
4 pages
ML Research Paper
No ratings yet
ML Research Paper
9 pages
Day-4 Deep Learning and Machine Learning
No ratings yet
Day-4 Deep Learning and Machine Learning
11 pages
ML Notes - 2025
No ratings yet
ML Notes - 2025
145 pages
cq02 Vdthanh Ass3
No ratings yet
cq02 Vdthanh Ass3
20 pages
Deep Learning UNIT 5
No ratings yet
Deep Learning UNIT 5
182 pages
AIML ML Session 3 - Student Common Reference (With More Additional Reading Materials)
No ratings yet
AIML ML Session 3 - Student Common Reference (With More Additional Reading Materials)
55 pages
Simple Explanation of Deep Learning
No ratings yet
Simple Explanation of Deep Learning
2 pages
Chapter1. Introduction To Deep Learning
No ratings yet
Chapter1. Introduction To Deep Learning
21 pages
AI Algorithms: Foundations, Applications, and Advancements
From Everand
AI Algorithms: Foundations, Applications, and Advancements
Anand Vemula
No ratings yet
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Artificial Intelligence Algorithms
From Everand
Artificial Intelligence Algorithms
akosnemeth
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.