0% found this document useful (0 votes)

31 views86 pages

Deep Learning c1

Uploaded by

Ayush Mokal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views86 pages

Deep Learning c1

Uploaded by

Ayush Mokal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 86

Lesson: Deep learning

● Machine Learning Engineers

● Data Analyst

Audience ● Python developer

● Statisticians
● Deep learning Introduction

● Why Deep Learning?

● Understanding Human neural

network

Index ● Understanding Deep Neural

Networks

● Different types of Neural

networks
Deep learning Introduction
A branch of machine learning called deep learning studies algorithms that are modeled after the structure and
operation of the human brain. Artiﬁcial intelligence (AI) includes machine learning, which includes deep learning.

The ability of a machine to mimic intelligent human behavior is known as artiﬁcial intelligence. A system can
automatically learn from experience thanks to machine learning. Deep learning is a type of machine learning that trains
a model using sophisticated algorithms and deep neural networks.
Why deep learning

Why is Deep Learning used?

Because deep learning is so good at identifying intricate patterns in large amounts of data, it can perform incredibly
effectively in this situation.

It can perform tasks like speech and picture recognition, language translation, and superhuman chess and go play.

Deep learning has the potential to occasionally surpass conventional machine learning techniques.

When Should You Use Deep Learning Over Machine Learning?

When you have a clear understanding of the features that matter or a limited dataset, use traditional machine learning.

When you have a lot of data and want the computer to automatically identify the key patterns and features, use deep
learning. Deep learning is very effective for jobs like speech and image recognition.
Human Neural network

● Neurons: The fundamental units of the human

brain network are neurons. Nearly 86 billion
neurons make up the human brain. Specialized
cells called neurons use electrical and chemical
impulses to relay information. A neuron's cell
body, dendrites (branch-like projections that
receive signals), and axon (a long, slender
projection that delivers information) are its three
main structural components.
Human Neural network

● Synapses: Synapses allow neurons to communicate with one another. Synapses are intersections where one
neuron's axon joins with another's dendrite or cell body. Chemical molecules called neurotransmitters pass
signals between these synapses. Synapses can vary over time in terms of their potency and effectiveness, which
is essential for memory and learning.
● Neural circuits: Neural circuits or networks are made up of neurons. These circuits are in charge of processing
information, carrying out several tasks, and managing body processes. Speciﬁc brain functions are linked to
various brain areas. For instance, the occipital lobe is in charge of processing visual data, whereas the frontal
lobe is connected to higher-order cognitive abilities.
● Learning and Memory: The human brain network's essential capabilities include learning and memory
processes. They entail adjustments to synapses' connection and strength. Speciﬁc brain pathways must
frequently be repeatedly and consistently stimulated in order for long-term memory to establish.
Deep Neural network

An artiﬁcial neural network (ANN) with numerous

layers in between the input and output layers is known
as a deep neural network (DNN). These networks are
made to automatically generate hierarchical
representations from numerous layers of connected
neurons in order to learn and represent complicated
patterns and characteristics in input. Particularly in
fields like audio and image identification, natural
language processing, and reinforcement learning,
deep neural networks have demonstrated substantial
success in a variety of machine learning and artificial
intelligence applications.
Similarities
Different Types of Neural network

● Feedforward Neural Networks (FNNs)

● Convolutional Neural Networks (CNNs):
● Recurrent Neural Networks (RNNs):
● Long Short-Term Memory Networks (LSTMs):
● Gated Recurrent Unit (GRU):
● Autoencoders:
● Radial Basis Function Networks (RBFNs):
● Generative Adversarial Networks (GANs):
Neural Network

Input Layer:

Each neuron in the input layer represents a feature or input variable. It acts as the point of entry for data that the
network needs to handle. The dimensionality of the input data affects how many input neurons there are.

● In image classiﬁcation tasks, each neuron in the input layer might represent a pixel in an image.
● In natural language processing tasks, each neuron might represent a word, a character, or a word embedding.
● In tabular data tasks, each neuron could correspond to a feature or attribute.
Components- Hidden layer
Hidden Layer:
There may be one or more hidden layers between the input layer and
the output layer. Numerous neurons (nodes) make up each hidden
layer. These layers are in charge of discovering and identifying
intricate patterns and features in the incoming data. The network can
roughly mimic complex, non-linear functions since hidden layers are
present.

● Dense (Fully Connected) Layers: Dense layers are the most

common type of hidden layers in feedforward neural networks
(multilayer perceptrons or MLPs).In a dense layer, each neuron
is connected to every neuron in the previous and subsequent
layers.These layers are used to model complex relationships in
the data by learning weight parameters associated with each
connection.
Components- Hidden layer

● Convolutional Layers (Convolutional Neural

Networks - CNNs): In computer vision
applications including image classification,
object identification, and picture segmentation,
convolutional layers are frequently utilized. They
perform convolutional operations on the input
data, which enables the network to learn spatial
feature hierarchies.Convolutional layers extract
regional patterns by using tiny filters (called
kernels) that move across the input data.
Components- Hidden layer

● Recurrent Layers (Recurrent Neural

Networks - RNNs): Recurrent layers are
designed for sequential data, such as time
series, text, and speech.They maintain
hidden states that capture information
from previous time steps and use it to
inﬂuence the processing of the current
input.Common recurrent layers include
vanilla RNNs, Long Short-Term Memory
(LSTM) cells, and Gated Recurrent Unit
(GRU) cells.
Components- Hidden layer

● Pooling Layers: Pooling layers are often

used in conjunction with convolutional
layers in CNNs. They reduce the spatial
dimensions of the feature maps, helping to
decrease computational complexity and
focus on the most important
information.Common pooling operations
include max pooling and average pooling.
Components- Hidden layer

● Dropout layers: Dropout is a regularization

technique, but it can be considered a type
of hidden layer. Dropout layers randomly
drop a fraction of neurons during each
training iteration, which helps prevent
overﬁtting and encourages the network to
be more robust.
Components- Hidden layer

● Embedding Layers: A crucial part of neural

networks, particularly for natural language
processing (NLP) applications, is the
embedding layer. Its main objective is to
embed continuous vector representations of
categorical data, such as words or discrete
tokens, within the original categorical data.
The semantic links between the category
values are captured by these embeddings,
making them appropriate for machine
learning models.
Components- Hidden layer

● Attention layers: Attention layers are often

used in sequence-to-sequence models and
transformers.They allow the model to
focus on different parts of the input
sequence when generating an output
sequence, which is particularly useful in
tasks like machine translation and
language understanding.
Components- Output layers

The output layer produces the network's predictions or outputs. The number of neurons in the output layer depends on
the nature of the task:
● For binary classiﬁcation, there is typically one output neuron, with the output representing the probability of
belonging to one class.
● For multi-class classiﬁcation, there are as many output neurons as there are classes, often using softmax
activation to produce class probabilities.
● For regression tasks, there is typically one output neuron per continuous target variable.
Components- Weights

Each neuron in a layer is linked to every other

neuron in the layer below. Weights, a measure
of the strength of the connections between
neurons, are linked to these connections. In
order to minimize a loss function, which
measures the discrepancy between the
projected outputs and the actual target values,
the network modiﬁes these weights during
training.
Components- Activation function

By adding non-linearity to the model, activation functions are essential in

artificial neural networks. They base a neuron's (or node's) output on its
weighted input. Activation functions give neural networks the ability to
approximate any arbitrary function and to describe complicated relationships in
the data.
Components- Activation function
● Sigmoid Function: The sigmoid activation function squashes its input into the range of (0, 1).It is often used in
the output layer of binary classification models, where the network needs to produce probabilities.
● Hyperbolic tan Function: The tanh activation function squashes its input into the range of (-1, 1).It is similar to
the sigmoid but centered at 0 and provides outputs with stronger negative values.It is often used in hidden layers
of neural networks.
● Rectified Linear Function: The ReLU activation function is one of the most popular choices for hidden layers. It
replaces any negative input values with zero, resulting in a piecewise linear function.It is computationally
efficient and helps mitigate the vanishing gradient problem.
● Softmax Function: The softmax activation function is used in the output layer for multi-class classification
problems. It converts a vector of raw scores into a probability distribution over multiple classes. It ensures that
the output values sum to 1.

Note: The vanishing gradient problem is a challenge that arises during the training of deep neural networks, particularly those with many layers. It occurs when the gradients of the
loss function with respect to the model's parameters (weights and biases) become very small as they are propagated backward through the network during the training process.
When gradients become vanishingly small, it can lead to slow convergence and diﬃculty in training deep networks effectively.
Components- Backpropagation

Backpropagation is an optimization approach used to train feedforward

neural networks. The gradient of a loss function with respect to the
network weights is calculated during backpropagation. The loss function
is then minimized by iteratively updating the weights. Gradient descent
and its variations, such as stochastic gradient descent (SGD) and Adam,
are frequently used optimization methods.

Backpropagation is used in gradient descent. Backpropagation employs

the chain rule to compute gradients. Gradient descent employs these
gradients to determine the cost function's least value.
Components- Gradient optimization

Gradient optimization algorithms, also known as optimization algorithms or optimizers, are essential for training
machine learning models, including neural networks. These algorithms aim to ﬁnd the optimal values of a model's
parameters (weights and biases) by iteratively updating them based on the gradients of a loss function with respect to
those parameters.

● Gradient Descent
● Stochastic Gradient Descent (SGD):
● SGD with Momentum:
● Mini batch Gradient descent
● Adadelta (Adaptive Gradient Algorithm):
● RMSprop (Root Mean Square Propagation):
● Adam
Components- Gradient optimization- GD
A machine learning technique called gradient descent is used to update model parameters. It operates by repeatedly
moving in the gradient's inverse direction, which determines the sharpest drop. The objective is to identify the ideal
collection of parameters. To train neural networks and machine learning models, gradient descent is frequently utilized.
It uses a convex function as its foundation and iteratively adjusts its parameters to minimize a given function to its
local minimum.

The stages for implementing a gradient descent algorithm are as follows:

● Initialize the bias and the weight theta at random.

● Determine the expected value of y based on the bias and weight.
● Determine the cost function using the projected and actual Y values.
● Determine the gradient and weights.

The learning rate is a hyperparameter in the gradient descent approach. The learning rate governs the rate at which
parameter values are altered. It is expensive to calculate the gradients if the size of the data is huge. Gradient descent
works well for convex functions, but it doesn’t know how far to travel along the gradient for nonconvex functions.
Components- Gradient optimization- SGD

Unlike classical gradient descent, which computes the average gradient over the whole dataset, stochastic gradient
descent adjusts the model parameters at each iteration using random examples or subsets of the training data. This
stochastic feature adds variability to SGD, allowing it to avoid local minima and ﬁnd better solutions in large datasets.One
of the primary beneﬁts of stochastic gradient descent is its ability to handle large-scale datasets.

When compared to the gradient descent approach, the path taken by the algorithm is full of noise since we do not use the
entire dataset but rather portions of it for each iteration. As a result, SGD requires more iterations to reach the local minima.
The total calculation time grows as the number of iterations rises. However, the calculation cost is still lower than that of the
gradient descent optimizer even after doubling the number of iterations. The result is that stochastic gradient descent
should be chosen over gradient descent algorithm if the data is large and processing time is a crucial consideration.
Components- Gradient optimization- SGD
with momentum

SGD takes a greater number of iterations to achieve the optimal minimum, hence computation time is very sluggish. We
utilize stochastic gradient descent with a momentum approach to solve the problem.The momentum aids in the faster
convergence of the loss function. Stochastic gradient descent oscillates between gradient directions, updating the weights
accordingly. Adding a fraction of the prior update to the current update, on the other hand, will speed up the process. One
thing to keep in mind while utilizing this algorithm is that the learning rate should be reduced with a strong momentum
term.
Components- Gradient optimization- Mini
batch GD

A variant of the gradient descent algorithm is mini-batch

gradient descent. It divides the training dataset into small
batches for calculating model error and updating model
coeﬃcients. By updating the weights and biases based on a
tiny fraction of the dataset, mini-batch gradient descent
minimizes variance. It has the potential to produce a superior
solution than batch gradient descent.

● The training dataset is divided into discrete batches.

● Creating gradients for each batch
● Using a portion of the complete dataset to calculate
the cost function
Components- Gradient optimization-
Adagrad

Adaptive Gradient Algorithm (Adagrad) is an algorithm for gradient-based optimization. The

learning rate is adapted component-wise to the parameters by incorporating knowledge of past
observations. It performs larger updates (e.g. high learning rates) for those parameters that are
related to infrequent features and smaller updates (i.e. low learning rates) for frequent one. It
performs smaller updates As a result, it is well-suited when dealing with sparse data (NLP or
image recognition) Each parameter has its own learning rate that improves performance on
problems with sparse gradients.

AdaGrad's Pros and Cons

● It does away with the requirement to manually adjust the learning rate.
● When the scaling of the weights is uneven, convergence is quicker and more reliable than basic SGD.
● The size of the master step has little impact on it.
Components- Gradient optimization-
RMSPROP

RMSprop is an adaptive learning rate optimization algorithm designed to get around some of the
drawbacks of Adagrad and basic gradient descent. It does this by separately modifying the
learning rates for each parameter based on the gradient information from the past. The
fundamental concept is to multiply the learning rate by a moving average of the root mean
square of previous gradients.
Components- Gradient optimization- ADAM

A gradient descent optimization algorithm is called adaptive moment estimation. When dealing with
complex problems involving numerous variables or data, the approach is quite effective. It works well and
uses minimal memory. It logically combines the 'gradient descent with momentum' algorithm and the 'RMSP'
algorithm.

The above two approaches have their own merits, and Adam Optimizer relies on those strengths to produce
a gradient descent that is better optimal. Here, we regulate the gradient descent rate to reach the global
minimum with the least amount of oscillation possible while passing the local minima obstacles with
suﬃciently large steps (step-size). Consequently, utilizing the advantages of the aforementioned techniques
to effectively meet the global minimum
Components - Loss function

A model's performance on a certain dataset is measured by a loss function, which is a mathematical function. In deep
learning, models are trained by reducing losses using loss functions. The mean squared error (MSE) and cross-entropy
loss are the two most often used loss functions for deep learning.

Binary Cross entropy: Binary cross-entropy is used to compute the cross-entropy between the true labels and predicted
outputs. It’s used when two-class problems arise like cat and dog classification [1 or 0].

Categorical Cross entropy: The Categorical crossentropy loss function is used to compute loss between true labels and
predicted labels. It’s mainly used for multiclass classification problems.
Components - Loss function

Sparse Categorical entropy: It is used when there are two or more classes present in our classification task. similarly to
categorical crossentropy. But there is one minor difference, between categorical crossentropy and sparse categorical
crossentropy that’s in sparse categorical cross-entropy labels are expected to be provided in integers.

Mean Squared Error: MSE tells, how close a regression line from predicted points. And this is done simply by taking
distance from point to the regression line and squaring them. The squaring is a must so it’ll remove the negative sign
problem.

Mean Absolute Error: MAE simply calculated by taking distance from point to the regression line. The MAE is more
sensitive to outliers. So before using MAE confirm that data doesn’t contain outliers.
ANN

A computer model modeled after the human brain is known as an Artiﬁcial Neural Network (ANN).
Neurons, which are interconnected nodes that process and evaluate data, make up this system.
Finding patterns and relationships in the data allows ANNs to learn and make predictions. They are
frequently employed in tasks like image recognition, language processing, and others that call for the
recognition of complicated patterns. The network improves at making predictions or classifying data
over time by training and changing the connections (weights) in its connections. ANNs are essentially
a mechanism for computers to simulate human learning and decision-making processes, although in
a more streamlined manner.
Typical ANN for regression
model = keras.Sequential([ layers.Dense(32, activation='relu', input_shape=(5,)), layers.Dense(16,
activation='relu'), layers.Dense(1) ])
ANN CODE FOR BINARY CLASSIFICATION
model = keras.Sequential([ layers.Dense(32, activation='relu', input_shape=(5,)), layers.Dense(16,
activation='relu'), layers.Dense(1, activation='sigmoid') ])
Typical ANN for MULTICLASS classiﬁcation
model = keras.Sequential([ layers.Dense(32, activation='relu', input_shape=(5,)), layers.Dense(16,
activation='relu'), layers.Dense(3, activation='softmax') ])
CNN

Convolutional Neural Networks (CNNs), also known as ConvNets, are a subset of deep learning neural networks that
are speciﬁcally made for tasks involving the interpretation of spatial and visual data. CNNs are particularly good at
identifying patterns and features in visual data since they are inspired by the human visual system.

The convolution operation is the central process of a convolutional layer. This entails applying a tiny filter to the
incoming data (sometimes referred to as a kernel or receptive field). The filter is a tiny matrix with weights, frequently
3x3 or 5x5. The convolution procedure, which aids in pattern recognition, functions as a dot product between the filter
and the input data.

model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu',

input_shape=(64, 64, 3))
STEPS AND LAYERS
POOLING
FLATTEN, STRIDES AND PADDING

Flattening is a technique used in convolutional neural networks (CNNs) to convert multidimensional

arrays into a single, long, continuous linear vector. The flattened matrix is then fed as input to the fully
connected layer to classify the image.

Stride and Padding: Convolutional Layers can have parameters like "stride" and "padding" that control
how the filter moves over the input data. Stride determines how much the filter shifts with each
operation, while padding adds extra pixels around the input data to control the size of the output
feature maps.
CNN Layer
Pooling layer
RNN

● Neural networks act like the human brain for AI, machine learning, and deep learning, helping
computers recognize patterns and solve problems.
● Recurrent Neural Networks (RNNs) are a type of neural network designed for sequence data,
like predicting the next word in a sentence. They work like our brain's memory.
● In regular neural networks, each input or output is independent. But for tasks where the order of
data matters, like in language, we need to remember previous data. That's where RNNs come
in.
● RNNs use a hidden layer to remember past information. The crucial part is the "Hidden State,"
which stores specific information about a sequence.
● RNNs have a "Memory" that keeps track of all calculations. It uses the same rules for each input,
making it handy for tasks where you need to consider what came before.
RNN ARCHITECTURE

One to One: This is a one-on-one situation. Traditional neural networks use a one-to-one architecture.

One to Many: In a one-to-many network, a single input could lead to a variety of outputs. For instance,
music production employs much too many networks.

Many To One: In this case, many inputs from various time steps are combined to produce a single
output. Such networks are used for sentiment analysis and emotion recognition, where the class label
is determined by a word list.

Many To Many: There are a lot of choices for many to many. Three outputs result from two inputs.
Many-to-many networks are used by machine translation systems, such as those that translate from
English to French or vice versa.
RNN
model.add(layers.SimpleRNN(units=64, activation='relu', input_shape=(10, 3))
BACKPROPAGATION IN RNN
Backpropagation through time is the term used when a
Backpropagation method is applied to an RNN with time
series data as its input.In a typical RNN, only one input
is provided at a time, and only one output is produced.
On the other hand, backpropagation takes into account
both the inputs from the past and the present. One
timestep will include several time series data points
entering at the same time.Once the neural network has
been trained on a time set and has produced an output,
the errors are calculated and collected. The network is
then reassembled, and weights are updated and
modiﬁed to take the errors into account.
VANISHING AND EXPLODING GRADIENT
VANISHING GRADIENT:

Vanishing gradients occur when the gradients of the model's parameters become extremely small during training. This can happen when the derivatives
with respect to the loss function are small and diminish as they propagate backward through layers

EXPLODING GRADIENT:

Exploding gradients happen when the gradients of the model's parameters become extremely large during training. This can occur when the derivatives
with respect to the loss function are large and are compounded through many layers.

For recurrent networks, consider using Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) architectures, which are designed to mitigate
vanishing gradient issues.
LSTM

Information can endure thanks to the deep learning, sequential neural network known as Long
Short-Term Memory Networks. It is a speciﬁc variety of recurrent neural network that can address the
vanishing gradient issue that RNNs encounter.

RNN retain the prior knowledge and use it to processing the input at hand. Due to the diminishing
gradient, RNN have the drawback of being unable to recall long-term dependencies. Long-term
dependency issues are speciﬁcally avoided when designing LSTMs.
CELL STATE

The LSTM's initial element, t, is present throughout the whole LSTM unit. It's somewhat comparable to
a conveyer belt.

This cell state is in charge of memory and forgetfulness. Based on the input's context, this is what
happens.
FORGET GATE

The LSTM network architecture consists of three parts, as shown in the image below, and each part performs an
individual function.

Imagine you have a special machine that's like a smart librarian. This librarian keeps track of information from the past
and uses it to make predictions about the future. This machine is called an LSTM.

FORGET GATE: The LSTM has a gate called the "Forget Gate." It decides if old information from the past is
important to remember or if it's not needed anymore and can be forgotten. It's like deciding whether old library
books are still useful. .
A sigmoid layer is used to make this decision. This sigmoid layer is
called the “forget gate layer”.
INPUT GATE

INPUT GATE: The LSTM has another gate called the "Input Gate." This gate helps it learn new information from
what's happening right now. It's like reading a new book and adding its knowledge to the library. The input gate
provides the LSTM with fresh data and makes the decision of whether or not to store that data in the cell state.

Value updates are decided by a sigmoid layer. The "input gate layer" is the name of this layer.
A layer with a tanh activation function generates a vector of potential new values for the state,
The cell state is then updated by adding these two outputs, i(t) * (t).
The output from forget and input gates are combined to create the new cell state C(t).
OUTPUT GATE

OUTPUT GATE: The last gate is the "Output Gate." It takes all the remembered and newly learned information
First, a sigmoid layer
and passes it on to the future. It's like sharing the library's knowledge with others.
decides what parts of the cell state we’re going to output. Then, a tanh layer is used on the
cell state to squash the values between -1 and 1, which is finally multiplied by the sigmoid
gate output
LSTM

model.add(layers.LSTM(units=64, input_shape=(10, 3)) # Example input shape:

(sequence_length, number_of_features)
LSTM
GRU

A Gated Recurrent Unit (GRU) is a type of recurrent neural network (RNN) architecture that is designed to address the
vanishing gradient problem while eﬃciently capturing dependencies in sequential data. It is similar to the more
complex Long Short-Term Memory (LSTM) network but has a simpliﬁed structure with fewer gating mechanisms.

Hidden State: Like an LSTM, a GRU maintains a hidden state, which acts as the network's memory and carries
information from one time step to the next. This hidden state can capture dependencies in the data.
Reset Gate: The reset gate in a GRU controls what information from the previous time step should be forgotten
or reset. It helps the network determine which past information to consider in the current time step.
Update Gate: The update gate determines what new information should be added to the current hidden state. It
inﬂuences how much of the new input data should be remembered and incorporated into the hidden state.
GRU
model.add(layers.GRU(units=64, input_shape=(10, 3)) # Example input shape: (sequence_length,
number_of_features)
TRANSFER LEARNING

A potent method used in deep learning is transfer learning. Transfer learning has made it possible to
train deep neural networks even with little input by leveraging the reuse of existing models and their
expertise with new tasks. This development is particularly important in data science because more
labeled data is frequently required for real applications.

Through transfer learning, the expertise of a machine learning model that has already been trained is
applied to an unrelated but closely related problem. For instance, if you trained a straightforward
classiﬁer to determine whether a picture has a backpack, you might use that knowledge to recognize
other things, like sunglasses.

Because of the massive amount of CPU power required, transfer learning is typically applied in computer
vision and natural language processing tasks like sentiment analysis.
TRANSFER LEARNING

PRE TRAINED MODEL:

In transfer learning, you begin with a pre-trained model that has already been educated on a sizable
dataset for a particular activity, usually a generic or comparable one. Usually deep neural networks,
such as convolutional neural networks (CNNs) for image tasks or recurrent neural networks (RNNs)
for text tasks, are used in these pre-trained models.

FEATURE EXTRACTION:

Utilizing the pre-trained model as a feature extractor is the ﬁrst stage in the transfer learning process.
You keep the other levels and take off the output layer, which contains the predictions for the initial
work. You are then left with a model that can draw out useful features from data.
TRANSFER LEARNING

FINE TUNING:

Next, you extend the pre-trained model with a fresh output layer. This new output layer is unique to the task
you want to do. The output layer would contain the new class labels you wish to forecast if you're employing a
pre-trained image model for a fresh picture classiﬁcation task. This new layer's initialization is random.

TRAINING:

The target dataset, which is typically smaller than the original dataset used to train the pre-trained model, is
then utilized to train the modified model. The theory behind this is that the attributes that the pre-trained
model acquired and found useful for the initial task may also prove effective for your current work. While
maintaining the pre-trained weights fixed or fine-tuning them with a slower learning rate, the training process
modifies the parameters of the new output layer.
TRANSFER LEARNING

KNOWLEDGE TRANSFER:

The pre-trained model applies its understanding of general aspects to your particular assignment during
training. If you have little data for the new task, this can considerably speed up training and enhance the
performance of your model.

Transfer learning is most effective when:

For your assignment, you only need a little dataset because the pre-trained model's expertise may be used.
A similar task or domain is used to train the pre-trained model (for example, employing an image model for a
related image classiﬁcation task).
The pre-trained model has acquired broad and portable properties.
EXAMPLE OF TL
Single Shot MultiBox Detector (SSD) is an object detection algorithm that's designed to eﬃciently detect and locate
objects in images or video frames.

Dividing the Image: Imagine your image is divided into a grid of different sizes, like squares on a checkerboard. Each square is called a "cell."
Predictions in Each Cell: In each cell, SSD tries to make predictions. It wants to figure out if there's an object in that cell and, if so, where the
object is and what it is. To do this, it predicts:
● The likelihood that there's an object in the cell.
● The coordinates of a box that tightly surrounds the object (bounding box).
● The class of the object (e.g., cat, dog, car).
Different Box Sizes: SSD doesn't just predict one size of the bounding box. It predicts multiple bounding box sizes in each cell. This helps it find
objects of different sizes.
Confidence Score: For each predicted bounding box, SSD also gives a "confidence score." This score tells us how sure it is that the box contains
an object.
Non-Maximum Suppression: SSD looks at all the predicted boxes across all the cells and removes those with low confidence scores. It keeps
only the boxes that are very likely to contain an object.
Final Output: The remaining bounding boxes, along with their associated class labels and confidence scores, are the final output of the SSD
algorithm. These are the objects it found in the image.
SSD
Autoencoder

It is common practice in deep learning to employ the Sequence-to-Sequence

(Seq2Seq) architecture with Long Short-Term Memory (LSTM) networks for a
variety of tasks related to natural language processing, including speech
recognition, machine translation, and more. It works particularly well when
dealing with sequences of different lengths.
Autoencoder

Encoder (based on LSTM):

Typically, the encoder is constructed using one or more LSTM layers.

Every LSTM layer sequentially processes the input sequence, creating hidden states at every time step.
The context vector, which summarizes the input sequence, is the resultant hidden state of the encoder LSTM(s).

Vector of Context:

The context vector, which acts as the decoder's starting point, encodes the data from the whole input sequence.
The input sequence's context and meaning are captured, enabling the decoder to provide a meaningful output.
Decoder (based on LSTM):
Autoencoder

Decoder (based on LSTM):

Another LSTM network serves as the decoder, using the context vector to produce the output sequence (for example, a
translation into a different language).
It too can include one or more LSTM layers, much like the encoder.
The previous concealed state and the prior anticipated token are used to construct the subsequent token in a
step-by-step way.

Training:

Training involves feeding the encoder the input sequence and training the decoder to generate the corresponding
output sequence.
The goal of training the model is to reduce the variation between the target sequence and the predicted sequence.
How to choose models for regression
Linear Regression:
● Use when there is a linear relationship between the independent and dependent variables.
● Suitable for simple regression tasks with continuous numerical data.
● Provides interpretable coeﬃcients for each predictor.
Polynomial Regression:
● Appropriate when the relationship between variables is not strictly linear.
● It extends linear regression by including polynomial terms to capture curvature in the data.
Ridge and Lasso Regression:
● Useful for addressing multicollinearity (high correlations between predictors) and preventing overﬁtting.
● Ridge adds a penalty term to the linear regression loss function, while Lasso adds a penalty and performs variable selection.
Decision Trees and Random Forests:
● Suitable for non-linear relationships and when the relationship between variables is complex.
● Decision trees are easy to understand, while random forests are ensembles of decision trees that improve predictive
performance.
Support Vector Regression (SVR):
● Appropriate when there is non-linearity and you want to minimize the prediction error within a certain margin.
● Useful when you need to handle outliers effectively.
How to choose model for regression
Gradient Boosting (e.g., XGBoost, LightGBM):
● Effective for improving prediction accuracy through boosting techniques.
● Often used in Kaggle competitions and other data science competitions.
Neural Networks (Deep Learning):
● Suitable for complex, high-dimensional data with non-linear relationships.
● Particularly effective when dealing with unstructured data, like images, text, or time series.
Elastic Net Regression:
● Combines L1 (Lasso) and L2 (Ridge) regularization, providing a balance between feature selection and handling
multicollinearity.
Bayesian Regression:
● Useful when you have prior knowledge about the relationships between variables and want to incorporate that into the
modeling process.
K-Nearest Neighbors (KNN) Regression:
● Appropriate when you want to predict a continuous outcome based on the average of the values of its k-nearest neighbors.

The choice of a regression model should be based on a combination of factors, including the nature of your data, the relationships you expect, the complexity of the problem, and

the interpretability of the model. It's often a good practice to start with a simple model (e.g., linear regression) and then experiment with more complex models if needed.

Cross-validation and model evaluation are also important to determine which model performs best for your specific problem.
How to choose model for classification
Logistic Regression:
● Suitable for binary classification problems.
● Provides probabilities of class membership.
● Interpretable, easy to understand, and a good starting point.
K-Nearest Neighbors (KNN):
● Effective for simple and flexible classification.
● Good for small to medium-sized datasets.
● Works well when decision boundaries are nonlinear or not well-defined.
Decision Trees and Random Forests:
● Decision trees are simple and interpretable but can overfit.
● Random forests are ensembles of decision trees that reduce overfitting and improve performance.
● Useful when interpretability is important.
How to choose model for classification
Support Vector Machine (SVM):
● Effective for binary and multi-class classification.
● Works well with high-dimensional data.
● Good for well-separated classes and complex decision boundaries.
Naive Bayes:
● Particularly useful for text classification (e.g., spam detection, sentiment analysis).
● Assumes that features are conditionally independent within each class.
Gradient Boosting (e.g., XGBoost, LightGBM):
● Powerful ensemble techniques that perform well in many scenarios.
● Often used in data science competitions for their high accuracy.
Neural Networks (Deep Learning):
● Effective for complex, high-dimensional data and large datasets.
● Can handle image and text data, among others.
● Require substantial computational resources and data.
How to choose model for classification
Ensemble Methods:
● Combining multiple base models (e.g., bagging, boosting, stacking) often improves classification performance.
● Can be used with a variety of base classifiers.
Nearest Centroid Classifier:
● Simple classifier that assigns data points to the class with the nearest centroid.
● Suitable for situations where features have a Gaussian distribution.
Multi-Layer Perceptron (MLP):
● A type of neural network that can be used for both binary and multi-class classification.
● Allows for deep learning with multiple hidden layers.

The choice of a classiﬁcation model should consider factors such as the nature of the data, the number of classes, the presence of class

imbalance, interpretability, and computational resources. It's often a good practice to start with simpler models like logistic regression and

decision trees and then explore more complex models as needed. Proper model evaluation, including cross-validation, is essential to determine

which model performs best for your speciﬁc problem.

Models for Image Classification
Convolutional Neural Networks (CNNs):
● CNNs are the backbone of modern image classification. They can automatically learn features from the data and
are highly effective for a wide range of image classification tasks.
● Architectures like VGG, ResNet, Inception, and MobileNet are popular choices.
Transfer Learning:
● Transfer learning involves using pre-trained CNN models as a starting point and fine-tuning them for a specific
classification task.
● Models like VGG, ResNet, and Inception pre-trained on large image datasets are often used.
Recurrent Neural Networks (RNNs):
● RNNs can be used for image sequences or videos, where temporal dependencies matter.
● They are also useful for tasks like image captioning or video classification.
Multi-Layer Perceptrons (MLPs):
● For simpler image classification tasks or when you need to extract features manually, MLPs can be used.
● These are suitable for datasets with small image sizes.
Metrics for regression based models

Mean Absolute Error (MAE):

● MAE calculates the average absolute difference between the predicted values and the actual values.
● It provides an easily interpretable measure of prediction accuracy.
● MAE is less sensitive to outliers compared to some other metrics.
Mean Squared Error (MSE):
● MSE calculates the average of the squared differences between predicted and actual values.
● It penalizes larger errors more than MAE, making it more sensitive to outliers.
● MSE is widely used but can be affected by the scale of the data.
Root Mean Squared Error (RMSE):
● RMSE is the square root of the MSE.
● It provides an error metric on the same scale as the target variable, which can be easier to interpret.
● Like MSE, it is sensitive to outliers.
Metrics for regression based models
Mean Absolute Percentage Error (MAPE):
● MAPE expresses the error as a percentage of the actual values.
● It provides a relative measure of prediction accuracy and is useful when working with data of different
scales.
R-squared (R²):
● R² measures the proportion of the variance in the dependent variable that is explained by the independent
variables.
● A higher R² indicates a better fit of the model to the data.
● R² can range from 0 (no fit) to 1 (perfect fit).
Adjusted R-squared (Adjusted R²):
● Adjusted R² is a modification of R² that takes into account the number of predictors in the model.
● It penalizes the inclusion of irrelevant predictors, helping to avoid overfitting.
Mean Bias Deviation (MBD):
● MBD calculates the average difference between predicted and actual values.
● It helps assess the bias of the model; a positive value indicates overestimation, and a negative value
indicates underestimation.
Metrics for regression based models

Explained Variance Score:

● This metric measures the proportion of the variance in the dependent variable that the model accounts
for.
● It's similar to R² and can range from 0 to 1.
Max Error:
● Max Error measures the largest absolute error between predicted and actual values.
● It is useful for identifying the worst-case scenario in terms of prediction error.
Median Absolute Error:
● Median Absolute Error calculates the median of the absolute differences between predicted and actual
values.
● It is robust to outliers and can provide a more stable measure of central tendency than mean-based
metrics.
Metrics for classification
Accuracy:
● Accuracy measures the proportion of correctly classified instances out of the total instances.
● Suitable for balanced datasets where all classes have similar importance.
● May not be suitable for imbalanced datasets, as high accuracy can be achieved by simply predicting the majority
class.
Precision:
● Precision measures the proportion of true positive predictions (correctly predicted positive instances) out of all
positive predictions (true positives + false positives).
● Useful when minimizing false positives is a priority (e.g., spam email detection).
Recall (Sensitivity or True Positive Rate):
● Recall measures the proportion of true positive predictions out of all actual positive instances (true positives +
false negatives).
● Relevant when minimizing false negatives is important (e.g., medical diagnosis).
F1-Score:
● The F1-Score is the harmonic mean of precision and recall and provides a balance between the two metrics.
● Particularly useful when precision and recall need to be considered together.
Metrics for classification

Speciﬁcity (True Negative Rate):

● Speciﬁcity measures the proportion of true negative predictions (correctly predicted negative instances) out of all
actual negative instances (true negatives + false positives).
● Important when minimizing false positives is a priority.
Area Under the Receiver Operating Characteristic (ROC-AUC):
● ROC-AUC measures the area under the ROC curve, which represents the trade-off between true positive rate and
false positive rate at various thresholds.
● Suitable for evaluating models with different threshold settings or when you want to compare the overall
discriminatory power of classiﬁers.
Confusion Matrix:
● While not a single metric, the confusion matrix provides a detailed breakdown of model performance, including true
positives, true negatives, false positives, and false negatives.

Neural Network-Soniya
100% (1)
Neural Network-Soniya
72 pages
Deep Learning Report For Students
No ratings yet
Deep Learning Report For Students
32 pages
Assignment 4
No ratings yet
Assignment 4
46 pages
Lesson 03 Artificial Neural Network
No ratings yet
Lesson 03 Artificial Neural Network
116 pages
Module 2
No ratings yet
Module 2
84 pages
Unit - 5 - Neural Pattern Recognition
No ratings yet
Unit - 5 - Neural Pattern Recognition
45 pages
Deep Learning 1687744660
No ratings yet
Deep Learning 1687744660
26 pages
ML06 Neural-Network 2024-2025
No ratings yet
ML06 Neural-Network 2024-2025
78 pages
Unit 4 Notes New
No ratings yet
Unit 4 Notes New
49 pages
AAI Unit 2
No ratings yet
AAI Unit 2
147 pages
DL Unit-3 (CDS)
No ratings yet
DL Unit-3 (CDS)
32 pages
Artificial Neural Network: Synapses Weight The Individual Parts of Information
No ratings yet
Artificial Neural Network: Synapses Weight The Individual Parts of Information
8 pages
ANN-unit 1
No ratings yet
ANN-unit 1
59 pages
ASC-unit 1 Notes
No ratings yet
ASC-unit 1 Notes
46 pages
Unit - 5
No ratings yet
Unit - 5
20 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
10 pages
Unit-1 and 2 Deep Learning
No ratings yet
Unit-1 and 2 Deep Learning
22 pages
Ann Unit-1 Imp
No ratings yet
Ann Unit-1 Imp
7 pages
Unit 3
No ratings yet
Unit 3
81 pages
Deep Learning
No ratings yet
Deep Learning
39 pages
Introduction To ANN
No ratings yet
Introduction To ANN
26 pages
Neural Networks, A Brief Overview
No ratings yet
Neural Networks, A Brief Overview
2 pages
NN DL Unit - I
No ratings yet
NN DL Unit - I
30 pages
22h51a6752 Pa
No ratings yet
22h51a6752 Pa
11 pages
Deep Learning UNIT 1
No ratings yet
Deep Learning UNIT 1
22 pages
Unit 3 Endsem PYQs
No ratings yet
Unit 3 Endsem PYQs
19 pages
Fai Unit-4tb
No ratings yet
Fai Unit-4tb
18 pages
ML Unit 4
No ratings yet
ML Unit 4
16 pages
Topic Ai: Submitted by Sheharbano
No ratings yet
Topic Ai: Submitted by Sheharbano
7 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
Updated Neural Networks
No ratings yet
Updated Neural Networks
49 pages
Review On Neural Network and Its Applications
No ratings yet
Review On Neural Network and Its Applications
27 pages
Intelligent Control of Drives-1
No ratings yet
Intelligent Control of Drives-1
82 pages
Neural Networks
No ratings yet
Neural Networks
11 pages
Technical Seminar
No ratings yet
Technical Seminar
27 pages
Deep Learning Concepts
No ratings yet
Deep Learning Concepts
14 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
21 pages
Features of Cnns
No ratings yet
Features of Cnns
3 pages
MACHINE LEARNING Unit-2
No ratings yet
MACHINE LEARNING Unit-2
21 pages
Physucs prjct-1
No ratings yet
Physucs prjct-1
33 pages
CP4252 ML Unit - V
No ratings yet
CP4252 ML Unit - V
17 pages
Introduction To Digital Signal Processing
90% (10)
Introduction To Digital Signal Processing
487 pages
6-Neural NT
No ratings yet
6-Neural NT
44 pages
UNIT-1 Foundations of Deep Learning
100% (1)
UNIT-1 Foundations of Deep Learning
51 pages
Neural Network: Neural Networks Used For
No ratings yet
Neural Network: Neural Networks Used For
4 pages
VLSI Cell Placement
No ratings yet
VLSI Cell Placement
78 pages
Deep Learning - Unit 1 Notes
No ratings yet
Deep Learning - Unit 1 Notes
27 pages
UNIT - 5 Lecture 2
No ratings yet
UNIT - 5 Lecture 2
26 pages
Unit 4 - Artificial Intelligence
No ratings yet
Unit 4 - Artificial Intelligence
9 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
Types of Neural Networks and Definition of Neural Network
No ratings yet
Types of Neural Networks and Definition of Neural Network
15 pages
Difference Equations
100% (1)
Difference Equations
193 pages
Dsa Theory Da
No ratings yet
Dsa Theory Da
41 pages
Introduction To Neural Networks
100% (1)
Introduction To Neural Networks
46 pages
Unit IV Artificial Neural Networks
No ratings yet
Unit IV Artificial Neural Networks
25 pages
Neural Networks
No ratings yet
Neural Networks
16 pages
Lecture 1
No ratings yet
Lecture 1
38 pages
Neural network-WPS Office
No ratings yet
Neural network-WPS Office
23 pages
What Are Neural Networks
No ratings yet
What Are Neural Networks
5 pages
2 DeepLearning
No ratings yet
2 DeepLearning
46 pages
Introduction To Neural Networks: Training Learn Generalization
No ratings yet
Introduction To Neural Networks: Training Learn Generalization
46 pages
Artificial Intelligence Artificial Neural Networks - : Introduction
No ratings yet
Artificial Intelligence Artificial Neural Networks - : Introduction
43 pages
Time Series Analysis Project - CAC 40 - 2018
No ratings yet
Time Series Analysis Project - CAC 40 - 2018
33 pages
23 Domain Adaptation Challenges Methods Datasets and Applications
No ratings yet
23 Domain Adaptation Challenges Methods Datasets and Applications
48 pages
Fattah Lecture 01
No ratings yet
Fattah Lecture 01
30 pages
1 s2.0 S0950705124005999 Main
No ratings yet
1 s2.0 S0950705124005999 Main
12 pages
ADA Lab Manual (Replica)
No ratings yet
ADA Lab Manual (Replica)
26 pages
An Enhanced AI-Based Network Intrusion Detection System Using Generative Adversarial Networks-1
No ratings yet
An Enhanced AI-Based Network Intrusion Detection System Using Generative Adversarial Networks-1
16 pages
Machine-Learning-for-Sleep-Disorder-Classification (1) Vinay
No ratings yet
Machine-Learning-for-Sleep-Disorder-Classification (1) Vinay
8 pages
The Cholesky Decomposition
No ratings yet
The Cholesky Decomposition
4 pages
Voiced/Unvoiced Decision For Speech Signals Based On Zero-Crossing Rate and Energy
No ratings yet
Voiced/Unvoiced Decision For Speech Signals Based On Zero-Crossing Rate and Energy
5 pages
V. IS 1893 2016 Static Seismic
No ratings yet
V. IS 1893 2016 Static Seismic
5 pages
Rigorous Derivation of Hooghoudt's Equation For Drainage Spacing
No ratings yet
Rigorous Derivation of Hooghoudt's Equation For Drainage Spacing
41 pages
Particle Swarm Optimization (PSO) - NEW
No ratings yet
Particle Swarm Optimization (PSO) - NEW
18 pages
ML Unit-5
No ratings yet
ML Unit-5
14 pages
Assignment 9 July 2022 Solution
No ratings yet
Assignment 9 July 2022 Solution
4 pages
Quiz-3 Fourier Transform
No ratings yet
Quiz-3 Fourier Transform
2 pages
CSE1325 Mid 183
No ratings yet
CSE1325 Mid 183
1 page
Example: Anscombe's Quartet Revisited: CC-BY-SA-3.0 GFDL
No ratings yet
Example: Anscombe's Quartet Revisited: CC-BY-SA-3.0 GFDL
10 pages
Lab 2
No ratings yet
Lab 2
2 pages
Harman Papneja Resume 2
No ratings yet
Harman Papneja Resume 2
1 page
Quiz 2 Solution
No ratings yet
Quiz 2 Solution
2 pages
Christophgruber Mechatronics2012
No ratings yet
Christophgruber Mechatronics2012
8 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Calculus of Variations: Total Variation Denoising
No ratings yet
Calculus of Variations: Total Variation Denoising
4 pages
Finish Start: Chapter 02: Project Management Solution: Practice Problems
No ratings yet
Finish Start: Chapter 02: Project Management Solution: Practice Problems
5 pages
Bessel Function Zeroes
No ratings yet
Bessel Function Zeroes
5 pages
Chapter 7 PDF
No ratings yet
Chapter 7 PDF
2 pages
TensorFlow in 1 Day: Make your own Neural Network
From Everand
TensorFlow in 1 Day: Make your own Neural Network
Krishna Rungta
3.5/5 (10)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Deep Learning c1

Uploaded by

Deep Learning c1

Uploaded by

Lesson: Deep learning

● Machine Learning Engineers

Audience ● Python developer

● Why Deep Learning?

● Understanding Human neural

Index ● Understanding Deep Neural

● Different types of Neural

Why is Deep Learning used?

When Should You Use Deep Learning Over Machine Learning?

● Neurons: The fundamental units of the human

An artiﬁcial neural network (ANN) with numerous

● Feedforward Neural Networks (FNNs)

A feedforward neural network

● Dense (Fully Connected) Layers: Dense layers are the most

● Convolutional Layers (Convolutional Neural

● Recurrent Layers (Recurrent Neural

● Pooling Layers: Pooling layers are often

● Dropout layers: Dropout is a regularization

● Embedding Layers: A crucial part of neural

● Attention layers: Attention layers are often

Each neuron in a layer is linked to every other

By adding non-linearity to the model, activation functions are essential in

Backpropagation is an optimization approach used to train feedforward

Backpropagation is used in gradient descent. Backpropagation employs

The stages for implementing a gradient descent algorithm are as follows:

● Initialize the bias and the weight theta at random.

A variant of the gradient descent algorithm is mini-batch

● The training dataset is divided into discrete batches.

Adaptive Gradient Algorithm (Adagrad) is an algorithm for gradient-based optimization. The

AdaGrad's Pros and Cons

model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu',

Flattening is a technique used in convolutional neural networks (CNNs) to convert multidimensional

model.add(layers.LSTM(units=64, input_shape=(10, 3)) # Example input shape:

PRE TRAINED MODEL:

Transfer learning is most effective when:

It is common practice in deep learning to employ the Sequence-to-Sequence

Encoder (based on LSTM):

Typically, the encoder is constructed using one or more LSTM layers.

Decoder (based on LSTM):

which model performs best for your speciﬁc problem.

Mean Absolute Error (MAE):

Explained Variance Score:

Speciﬁcity (True Negative Rate):

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.