DL Mod 1
DL Mod 1
MODULE 1
1.1 INTRODUCTION
• All neurons have three different parts – dendrites, cell body and axon.
• Parts of Neuron
• Following are the different parts of a neuron:
• Dendrites
• These are branch-like structures that receive messages from other
neurons and allow the transmission of messages to the cell body.
• Cell Body
• Each neuron has a cell body with a nucleus, Golgi body, endoplasmic
reticulum, mitochondria and other components.
• Axon
• Axon is a tube-like structure that carries electrical impulse from the
cell body to the axon terminals that pass the impulse to another
neuron.
Synapse
• It is the chemical junction between the terminal of one neuron and the
dendrites of another neuron.
Dendrites Inputs
Synapse Weights
Axon Output
• Input Layer:
• As the name suggests, it accepts inputs in several different formats provided
by the programmer.
• Hidden Layer:
• The hidden layer presents in-between input and output layers. It performs all
the calculations to find hidden features and patterns.
• Output Layer:
• The input goes through a series of transformations using the hidden layer,
which finally results in output that is conveyed using this layer.
• The artificial neural network takes input and computes the weighted sum of
the inputs and includes a bias. This computation is represented in the form of
a transfer function.
• It determines weighted total is passed as an input to an activation function to
produce the output.
• Activation functions choose whether a node should fire or not.
• Only those who are fired make it to the output layer.
• There are distinctive activation functions available that can be applied upon
the sort of task we are performing.
• Feedback ANN
In this type of ANN, the output returns into the network to accomplish the best-
evolved results internally
The feedback networks feed information back into itself and are well suited to
solve optimization issues.
The Internal system error corrections utilize feedback ANN
Long short term memory network (LSTM) - LSTM adds extra structures, or
gates, to an RNN to improve memory capabilities.
Echo state network (ESN) - A type of RNN hidden layers that are sparsely
connected.
Convolutional neural networks
Convolutional neural networks (CNNs) are a type of feed-forward network that are
used for image analysis and language processing. There are hidden convolutional
layers that form ConvNets and detect patterns. CNNs use features such as edges,
shapes, and textures to detect patterns. Examples of CNNs include:
Generator - creates synthetic data from the learning phase of the model. It
will take random datasets and generate a transformed image.
Discriminator - decides whether or not the images produced are fake or
genuine.
GANs are used to help predict what the next frame in a video might be, text to
image generation, or image to image translation.
PERCEPTRON
• It is one of the oldest and first introduced neural networks.
• It was proposed by Frank Rosenblatt in 1958.
• Perceptron is also known as an artificial neural network.
• Perceptron is mainly used to compute the logical gate like AND, OR, and
NOR which has binary input and binary output.
• Perceptron is a building block of an Artificial Neural Network.
• Perceptron is a linear Machine Learning algorithm used for supervised
learning for various binary classifiers. This algorithm enables neurons to
learn elements and processes them one by one during preparation.
The main functionality of the perceptron is:-
• Takes input from the input layer
• Weight them up and sum it up.
• Pass the sum to the function to produce the output.
Step-1
In the first step first, multiply all input values with corresponding weight values
and then add them to determine the weighted sum. Mathematically, we can
calculate the weighted sum as follows:
∑wi*xi = x1*w1 + x2*w2 +…wn*xn
Add a special term called bias 'b' to this weighted sum to improve the model's
performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned
weighted sum, which gives us output either in binary form or a continuous value as
follows:
Y = f(∑wi*xi + b)
• For all the other hidden layers repeat the same procedure. Keep repeating the
process until reach the last weight set.
ACTIVATION FUNCTION
• It’s a function that we use to get the output of node. It is also known as
Transfer Function.
• The primary role of the Activation Function is to transform the summed
weighted input from the node into an output value to be fed to the next
hidden layer or as output.
• It is used in neural network to determine the output of neural network like
yes or no. It maps the resulting values in between 0 to 1 or -1 to 1 etc.
(depending upon the function).
• The Activation Functions can be basically divided into 2 types-
1. Linear Activation Function
2. Non-linear Activation Functions
It doesn’t help with the complexity or various parameters of usual da
It makes it easy for the model to generalize or adapt with variety of d
•
• The main reason why we use sigmoid function is because it exists between
(0 to 1).
• Therefore, it is especially used for models where we have to predict the
probability as an output.
• Since probability of anything exists only between the range of 0 and 1,
sigmoid is the right choice.
• The function is differentiable.That means, we can find the slope of the
sigmoid curve at any two points.
• The function is monotonic but function’s derivative is not.
• As you can see, the ReLU is half rectified (from bottom). f(z) is zero when z
is less than zero and f(z) is equal to z when z is above or equal to zero.
• Range: [ 0 to infinity)
• The function and its derivative both are monotonic.
• But the issue is that all the negative values become zero immediately which
decreases the ability of the model to fit or train from the data properly.
• That means any negative input given to the ReLU activation function turns
the value into zero immediately in the graph, which in turns affects the
resulting graph by not mapping the negative values appropriately.
4. Leaky ReLU
• It is an attempt to solve the dying ReLU problem
• The leak helps to increase the range of the ReLU function. Usually, the value
of a is 0.01 or so.
• When a is not 0.01 then it is called Randomized ReLU.
• Therefore the range of the Leaky ReLU is (-infinity to infinity).
• Both Leaky and Randomized ReLU functions are monotonic in nature. Also,
their derivatives also monotonic in nature.
•
LEAKY ReLU
F(X)=max(0.01x,x)
5. Softmax activation function
• The softmax activation function takes in a vector of raw outputs of the
neural network and returns a vector of probability scores.
• In the vector z of raw outputs, the maximum value is 1.23, which on
applying softmax activation maps to 0.664: the largest entry in the softmax
output vector. Likewise, 0.25 and -0.8 map to 0.249 and 0.087: the second
and the third largest entries in the softmax output respectively. Thus,
applying softmax preserves the relative ordering of scores.
• All entries in the softmax output vector are between 0 and 1.
• In a multiclass classification problem, where the classes are mutually
exclusive, notice how the entries of the softmax output sum up to 1: 0.664 +
0.249 + 0.087 = 1.
•
• The hard tanh activation function is a modified version of the tanh function
that applies a threshold to the output to produce an output between -1 and 1.
• The hard tanh function is faster to compute than the tanh function and is
commonly used in embedded systems and real-time applications.
LOSS FUNCTION
Neural networks are a set of algorithms that are designed to recognize
trends/relationships in a given set of training data. These algorithms are
based on the way human neurons process information.
This equation represents how a neural network processes the input data at
each layer and eventually produces a predicted output value.
To train — the process by which the model maps the relationship between
the training data and the outputs — the neural network updates its
hyperparameters, the weights, wT, and biases, b, to satisfy the equation
above.
Each training input is loaded into the neural network in a process called
forward propagation. Once the model has produced an output, this predicted
output is compared against the given target output in a process called
backpropagation — the hyperparameters of the model are then adjusted so
that it now outputs a result closer to the target output.
• A loss function is a function that compares the target and predicted output
values; measures how well the neural network models the training data.
When training, we aim to minimize this loss between the predicted and
target outputs.
• The hyperparameters are adjusted to minimize the average loss — we find
the weights, wT, and biases, b, that minimize the value of J (average loss).
•
• Types of Loss Functions
• In supervised learning, there are two main types of loss functions :
regression and classification loss functions
• Regression Loss Functions — used in regression neural
networks; given an input value, the model predicts a
corresponding output value (rather than pre-selected labels); Ex.
Mean Squared Error, Mean Absolute Error
• Classification Loss Functions — used in classification neural
networks; given an input, the neural network produces a vector
of probabilities of the input belonging to various pre-set
categories — can then select the category with the highest
probability of belonging; Ex. Binary Cross-Entropy, Categorical
Cross-Entropy
• Mean Squared Error (MSE)
• One of the most popular loss functions, MSE finds the average of the
squared differences between the target and the predicted outputs
•
• Mean Absolute Error (MAE)
• MAE finds the average of the absolute differences between the target and the
predicted outputs.
•
• Inputs X, arrive through the preconnected path
• Input is modeled using real weights W. The weights are usually randomly
selected.
• Calculate the output for every neuron from the input layer, to the hidden
layers, to the output layer.
• Calculate the error in the output
• ERROR=TARGET-ACTUAL
• Travel back from the output layer to the hidden layer to adjust the weights
such that the error is decreased.
•
Practical Issues in Neural Network
Training
• I.The Problem of Overfitting
• II. The Vanishing and Exploding Gradient Problems
• III.Difficulties in Convergence
• IV.Local and Spurious Optima
• V. Computational Challenges
•
• Overfitting during training can be spotted when the error on training data
decreases to a very small value but the error on the new data or test data
increases to a large value.
• The error vs iteration graph shows how a deep neural network overfits on
training data.
• The blue curve indicates the error on training data & the red curve the error
on test data.
• The point where the green line intersects is the instance the network begins
to overfit.
• As you can see, the error on test data increases sharply while error on
training data decreases.
• A new set of data points will result in the model/network performing poorly
as it is very close to all the training points which are noise & outliers.
• The error on the training points is minimum or very small but the error on
the new data points will be high.
• One of the main reasons for the network to overfit is if the size of the
training dataset is small.
• When the network tries to learn from a small dataset it will tend to have
greater control over the dataset & will make sure to satisfy all the datapoints
exactly.
•
• In order to understand this point, consider a simple single-layer neural
network on a data set with five attributes, where we use the identity
activation to learn a real-valued target variable.
• Consider a situation in which the observed target value is real and is always
twice the value of the first attribute, whereas other attributes are completely
unrelated to the target. However, we have only four training instances, which
is one less than the number of features (free parameters). For example, the
training instances could be as follows:
The correct parameter vector in this case is W = [2, 0, 0, 0, 0] based on the
known relationship between the first feature and target.
The training data also provides zero error with this solution, although the
relationship needs to be learned from the given instances
However, the problem is that the number of training points is fewer than the
number of parameters and it is possible to find an infinite numberof solutions with
zero error.
For example, the parameter set [0, 2, 4, 6, 8] also provides zero error on the
training data.
However, if we used this solution on unseen test data, it is likely to provide very
poor performance because the learned parameters are spuriously inferred and are
unlikely to generalize well to new points in which the target is twice the first
attribute (and other attributes are random).
As a result,the solution does not generalize well to unseen test data.
Underfitting
Underfitting happens when the network can neither model the training or test data
which results in overall bad performance.
By looking at the graph, the model doesn’t cover all the data points & has a high
error on both training & test data.
The reason for underfitting can be because of the limited capacity of the network, a
limited number of features provided as input to the network, noisy data etc.
• It represents the inability of the model to learn the training data effectively
result in poor performance both on the training and testing data.
• In simple terms, an underfit model’s are inaccurate, especially when applied
to new, unseen examples.
• It mainly happens when we uses very simple model with overly simplified
assumptions.
• To address underfitting problem of the model, we need to use more complex
models, with enhanced feature representation, and less regularization.
• Note: The underfitting model has High bias and low variance
There is no general rule as to how many layers are to be removed or how many
neurons must be in a layer before the network can overfit.
The popular approach for reducing the network complexity is
--Grid search can be applied to find out the number of neurons and/or layers
to reduce or remove overfitting.
--The overfit model can be pruned (trimmed) by removing nodes or
connections until it reaches suitable performance on test data.
2. Data Augmentation
One of the best strategies to avoid overfitting is to increase the size of the
training dataset.
As discussed, when the size of the training data is small the network tends
to have greater control over the training data.
But in real-world scenarios gathering of large amounts of data is a tedious
& time-consuming task, hence the collection of new data is not a viable option.
4. Dropouts
Model Parameters
o Model parameters are configuration variables that are internal to the model, and a model
learns them on its own. For example, W Weights or Coefficients of independent variables
in the Linear regression model. or Weights or Coefficients of independent variables in
SVM, weight, and biases of a neural network, cluster centroid in clustering.
Model Hyperparameters:
Hyperparameters are those parameters that are explicitly defined by the user to control the learning
process. Some key points for model parameters are as follows:
Categories of Hyperparameters
Broadly hyperparameters can be divided into two categories, which are given below:
• Learning Rate: The learning rate is the hyperparameter in optimization algorithms that
controls how much the model needs to change in response to the estimated error for each
time when the model's weights are updated. It is one of the crucial parameters while
building a neural network, and also it determines the frequency of cross-checking with
model parameters. Selecting the optimized learning rate is a challenging task because if the
learning rate is very less, then it may slow down the training process. On the other hand, if
the learning rate is too large, then it may not optimize the model properly.
• Note: Learning rate is a crucial hyperparameter for optimizing the model, so if there
is a requirement of tuning only a single hyperparameter, it is suggested to tune the
learning rate.
• Batch Size: To enhance the speed of the learning process, the training set is divided
into different subsets, which are known as a batch.
• Number of Epochs: An epoch can be defined as the complete cycle for training the
machine learning model. Epoch represents an iterative learning process. The number
of epochs varies from model to model, and various models are created with more than
one epoch. To determine the right number of epochs, a validation error is taken into
account.
• The number of epochs is increased until there is a reduction in a validation
error. If there is no improvement in reduction error for the consecutive
epochs, then it indicates to stop increasing the number of epochs.
o A number of Hidden Units: Hidden units are part of neural networks, which refer to
the components comprising the layers of processors between input and output units
in a neural network.
It is important to specify the number of hidden units hyperparameter for the neural network.
It should be between the size of the input layer and the size of the output layer. More
specifically, the number of hidden units should be 2/3 of the size of the input layer, plus the
size of the output layer.
For complex functions, it is necessary to specify the number of hidden units, but it should not
overfit the model.
Validation Set
Training set: The data you will use to train your model. This will be fed
into an algorithm that generates a model. It maps inputs to outputs.
Validation set: This is smaller than the training set, and is used to evaluate
the performance of models with different hyperparameter values. It's also
used to detect overfitting during the training stages.
Test set: This set is used to get an idea of the final performance of a model
after hyperparameter tuning. It's also useful to get an idea of how different
models (SVMs, Neural Networks, Random forests...) perform against each
other.
The validation and test sets are usually much smaller than the training
set.
The validation and test sets are put aside at the beginning of the project
and are not used for training.
The validation set is used to fine-tune the hyperparameters of the model and is
considered a part of the training of the model. The model only sees this data for
evaluation but does not learn from this data
In each epoch, the same training data is fed to the neural network architecture repeatedly,
and the model continues to learn the features of the data.
The training set should have a diversified set of inputs so that the model is trained in all
scenarios and can predict any unseen data sample that may appear in the future.
The Validation Set
The validation set is a set of data, separate from the training set, that is used to validate our
model performance during training.
This validation process gives information that helps us tune the model’s hyperparameters
and configurations accordingly. It is like a critic telling us whether the training is moving in
the right direction or not.
The model is trained on the training set, and, simultaneously, the model evaluation is
performed on the validation set after every epoch.
The main idea of splitting the dataset into a validation set is to prevent our model from
overfitting i.e., the model becomes really good at classifying the samples in the training set
but cannot generalize and make accurate classifications on the data it has not seen before.
It provides an unbiased final model performance metric in terms of accuracy, precision, etc.
To put it simply, it answers the question of "How well does the model perform?"
o
VARIANCE
Variance is the measure of spread in data from its mean position.
In machine learning variance is the amount by which the performance
of a predictive model changes when it is trained on different subsets
of the training data.
More specifically, variance is the variability of the model that how
much it is sensitive to another subset of the training dataset. i.e. how
much it can adjust on the new subset of the training dataset.
Let Y be the actual values of the target variable, and Y’ be the
predicted values of the target variable.
Then the variance of a model can be measured as the expected value
of the square of the difference between predicted values and the
expected value of the predicted values.
Variance = E[(Y’ - E[ Y’])^2]
Ways to Reduce the reduce Variance in Machine
Learning:
Cross-validation: By splitting the data into training and testing
sets multiple times, cross-validation can help identify if a model is
overfitting or underfitting and can be used to tune hyperparameters to
reduce variance.
Feature selection: By choosing the only relevant feature will
decrease the model’s complexity. and it can reduce the variance error.
Regularization: We can use L1 or L2 regularization to reduce
variance in machine learning models
Ensemble methods: It will combine multiple models to improve
generalization performance. Bagging, boosting, and stacking are
common ensemble methods that can help reduce variance and improve
generalization performance.
Simplifying the model: Reducing the complexity of the model,
such as decreasing the number of parameters or layers in a neural
network, can also help reduce variance and improve generalization
performance.
Early stopping: Early stopping is a technique used to prevent
overfitting by stopping the training of the deep learning model when the
performance on the validation set stops improving.
Deep Learning
Deep learning is a method in artificial intelligence (AI) that
teaches computers to process data in a way that is inspired by
the human brain. Deep learning models can recognize complex
patterns in pictures, text, sounds, and other data to produce
accurate insights and predictions.
Machine Learning Deep Learning
A model is created by
Relevant features are
relevant features which are
automatically extracted from
manually extracted from
images. It is an end-to-end
images to detect an object in
learning process.
the image.
Machine Learning Deep Learning
The neural network can compare the outputs of its nodes with
the desired values using a property known as the delta rule,
allowing the network to alter its weights through training to
create more accurate output values. This training and learning
procedure results in gradient descent. The technique of updating
weights in multi-layered perceptrons is virtually the same,
however, the process is referred to as back-propagation. In such
circumstances, the output values provided by the final layer are
used to alter each hidden layer inside the network.
A Feed Forward Neural Network is an artificial neural network
in which the connections between nodes does not form a cycle.
The feed forward model is the simplest form of neural network
as information is only processed in one direction. While the data
may pass through multiple hidden nodes, it always moves in one
direction and never backwards.
The structure of a DFF is very similar to that of an FF. The
major difference between them is the number of hidden layers.
Currently, people refer to a Neural Network with one hidden
layer as a “shallow” network or simply a Feed-Forward network.
Feedforward neural networks perform well when solving
basic problems like identifying simple patterns or classifying
information. However, they will struggle with more complex
tasks. On the other hand, deep learning algorithms can process
and analyze vast data volumes due to several hidden layers of
abstraction.
##################################################
####################################################
###############################################