0% found this document useful (0 votes)
16 views68 pages

DL Mod 1

module 1 deep learning intro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views68 pages

DL Mod 1

module 1 deep learning intro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

DEEP LEARNING

MODULE 1
1.1 INTRODUCTION

Artificial Intelligence – Mimic Human Brain


Machine Learning— Branch of Artificial Intelligence. It focuses on the
use of data and algorithms to imitate the way that humans learn, gradually
improving its accuracy.
Deep learning -- is a subset of machine learning, which is essentially a
neural network with three or more layers. These neural networks attempt to
simulate the behavior of the human brain,allowing it to “learn” from large
amounts of data. While a neural network with a single layer can still make
approximate predictions, additional hidden layers can help to optimize and
refine for accuracy.
NEURON
Neurons are the fundamental unit of the nervous system specialized to
transmit information to different parts of the body.
Neurons are the building blocks of the nervous system.
They receive and transmit signals to different parts of the body.
This is carried out in both physical and electrical forms.
There are several different types of neurons that facilitate the transmission
of information.

The sensory neurons carry information from the sensory receptor


cells present throughout the body to the brain.
The motor neurons transmit information from the brain to the
muscles.
The interneurons transmit information between different neurons in
the body.

• All neurons have three different parts – dendrites, cell body and axon.
• Parts of Neuron
• Following are the different parts of a neuron:
• Dendrites
• These are branch-like structures that receive messages from other
neurons and allow the transmission of messages to the cell body.
• Cell Body
• Each neuron has a cell body with a nucleus, Golgi body, endoplasmic
reticulum, mitochondria and other components.
• Axon
• Axon is a tube-like structure that carries electrical impulse from the
cell body to the axon terminals that pass the impulse to another
neuron.
Synapse
• It is the chemical junction between the terminal of one neuron and the
dendrites of another neuron.

Artificial Neural Network


An Artificial neural network is usually a computational network based on
biological neural networks that construct the structure of the human brain.
Similar to a human brain, it has neurons interconnected to each other, artificial
neural networks also have neurons that are linked to each other in various layers of
the networks.
These neurons are known as nodes.

Dendrites from Biological Neural Network represent inputs in Ar ficial Ne


nucleus represents Nodes, synapse represents Weights, and Axon represe
Biological Neural Network Artificial Neural Network

Dendrites Inputs

Cell nucleus Nodes

Synapse Weights

Axon Output

• Artificial Neural Network primarily consists of three layers:

• Input Layer:
• As the name suggests, it accepts inputs in several different formats provided
by the programmer.
• Hidden Layer:
• The hidden layer presents in-between input and output layers. It performs all
the calculations to find hidden features and patterns.
• Output Layer:
• The input goes through a series of transformations using the hidden layer,
which finally results in output that is conveyed using this layer.
• The artificial neural network takes input and computes the weighted sum of
the inputs and includes a bias. This computation is represented in the form of
a transfer function.
• It determines weighted total is passed as an input to an activation function to
produce the output.
• Activation functions choose whether a node should fire or not.
• Only those who are fired make it to the output layer.
• There are distinctive activation functions available that can be applied upon
the sort of task we are performing.

Advantages of Artificial Neural Network (ANN)


1. Parallel processing capability:
Artificial neural networks have a numerical value that can perform
more than one task simultaneously.
2. Storing data on the entire network:
3. Capability to work with incomplete knowledge:
After ANN training, the information may produce output even with
inadequate data. The loss of performance here relies upon the significance of
missing data.
4. Having a memory distribution:
For ANN is to be able to adapt, it is important to determine the
examples and to encourage the network according to the desired output by
demonstrating these examples to the network. produce false output.
5. Having fault tolerance:
Extortion of one or more cells of ANN does not prohibit it from
generating output, and this feature makes the network fault-tolerance.

Disadvantages of Artificial Neural Network:


1. Assurance of proper network structure:
There is no particular guideline for determining the structure of
artificial neural networks. The appropriate network structure is accomplished
through experience, trial, and error.
2. Unrecognized behavior of the network:
It is the most significant issue of ANN. When ANN produces a testing
solution, it does not provide insight concerning why and how. It decreases
trust in the network.
3. Hardware dependence:
Artificial neural networks need processors with parallel processing
power, as per their structure. Therefore, the realization of the equipment is
dependent.
4. Difficulty of showing the issue to the network:
ANNs can work with numerical data. Problems must be converted
into numerical values before being introduced to ANN. The presentation
mechanism to be resolved here will directly impact the performance of the
network. It relies on the user's abilities.
5. The duration of the network is unknown:
The network is reduced to a specific value of the error, and this value
does not give us optimum results.
How do artificial neural networks work?
Artificial Neural Network can be best represented as a weighted directed graph,
where the artificial neurons form the nodes.
 The association between the neurons outputs and neuron inputs can be viewed
as the directed edges with weights.
The Artificial Neural Network receives the input signal from the external source
in the form of a pattern and image in the form of a vector.
These inputs are then mathematically assigned by the notations x(n) for every n
number of inputs.
Afterward, each of the input is multiplied by its corresponding weights ( these
weights are the details utilized by the artificial neural networks to solve a specific
problem ).
In general terms, these weights normally represent the strength of the
interconnection between neurons inside the artificial neural network.
All the weighted inputs are summarized inside the computing unit.
If the weighted sum is equal to zero, then bias is added to make the output non-
zero or something else to scale up to the system's response.
Bias has the same input, and weight equals to 1.
Here the total of weighted inputs can be in the range of 0 to positive infinity.
The activation function refers to the set of transfer functions used to achieve the
desired output.
There is a different kind of the activation function, but primarily either linear or
non-linear sets of functions.
Some of the commonly used sets of activation functions are the Binary, linear,
and Tan hyperbolic sigmoidal activation functions.

Types of Artificial Neural Network


• Feedforward ANN
Its flow is uni-directional, meaning that the information in the model flows in
only one direction—forward—from the input nodes, through the hidden nodes (if
any) and to the output nodes, without any cycles or loops.

• Feedback ANN
In this type of ANN, the output returns into the network to accomplish the best-
evolved results internally
The feedback networks feed information back into itself and are well suited to
solve optimization issues.
The Internal system error corrections utilize feedback ANN

Types of Neural Network Architectures


Neural networks are an efficient way to solve machine learning problems and can
be used in various situations. Neural networks offer precision and accuracy.
Finding the correct neural network for each project can increase efficiency.

Standard neural networks

 Perceptron - A neural network that applies a mathematical operation to an input


value, providing an output variable.
 Feed-Forward Networks - A multi-layered neural network where the
information moves from left to right, or in other words, in a forward direction.
The input values pass through a series of hidden layers on their way to the
output layer.
 Residual Networks (ResNet) - A deep feed-forward network with hundreds of
layers.

Recurrent neural networks

Recurrent neural networks (RNNs) remember previously learned predictions to


help make future predictions with accuracy.

 Long short term memory network (LSTM) - LSTM adds extra structures, or
gates, to an RNN to improve memory capabilities.
 Echo state network (ESN) - A type of RNN hidden layers that are sparsely
connected.
Convolutional neural networks

Convolutional neural networks (CNNs) are a type of feed-forward network that are
used for image analysis and language processing. There are hidden convolutional
layers that form ConvNets and detect patterns. CNNs use features such as edges,
shapes, and textures to detect patterns. Examples of CNNs include:

 AlexNet - Contains multiple convolutional layers designed for image


recognition.
 Visual geometry group (VGG) - VGG is similar to AlexNet, but has more
layers of narrow convolutions.
 Capsule networks - Contain nested capsules (groups of neurons) to create a
more powerful CNN.

Generative adversarial networks

Generative adversarial networks (GAN) are a type of unsupervised learning where


data is generated from patterns that were discovered from the input data. GANs
have two main parts that compete against one another:

 Generator - creates synthetic data from the learning phase of the model. It
will take random datasets and generate a transformed image.
 Discriminator - decides whether or not the images produced are fake or
genuine.

GANs are used to help predict what the next frame in a video might be, text to
image generation, or image to image translation.

Transformer neural networks

Unlike RNNs, transformer neural networks do not have a concept of timestamps.


This enables them to pass through multiple inputs at once, making them a more
efficient way to process data.

PERCEPTRON
• It is one of the oldest and first introduced neural networks.
• It was proposed by Frank Rosenblatt in 1958.
• Perceptron is also known as an artificial neural network.
• Perceptron is mainly used to compute the logical gate like AND, OR, and
NOR which has binary input and binary output.
• Perceptron is a building block of an Artificial Neural Network.
• Perceptron is a linear Machine Learning algorithm used for supervised
learning for various binary classifiers. This algorithm enables neurons to
learn elements and processes them one by one during preparation.
The main functionality of the perceptron is:-
• Takes input from the input layer
• Weight them up and sum it up.
• Pass the sum to the function to produce the output.

Activation functions can be anything like sigmoid, tanh, relu


• Input Nodes or Input Layer:
• This is the primary component of Perceptron which accepts the initial data
into the system for further processing. Each input node contains a real
numerical value.
• Wight and Bias:
• Weight parameter represents the strength of the connection between units.
• Weight is directly proportional to the strength of the associated input neuron
in deciding the output.
• Bias can be considered as the line of intercept in a linear equation.
• Activation Function:
• These are the final and important components that help to determine whether
the neuron will fire or not. Activation Function can be considered primarily
as a step function.
• Types of Activation functions:
• Sign function
• Step function, and
• Sigmoid function

How does Perceptron work?


• In Machine Learning, Perceptron is considered as a single-layer neural
network that consists of four main parameters named input values (Input
nodes), weights and Bias, net sum, and an activation function.
• The perceptron model begins with the multiplication of all input values and
their weights, then adds these values together to create the weighted sum.
• Then this weighted sum is applied to the activation function 'f' to obtain the
desired output.
• This activation function is also known as the step function and is represented
by 'f'.

Perceptron model works in two important steps as
follows:

Step-1
In the first step first, multiply all input values with corresponding weight values
and then add them to determine the weighted sum. Mathematically, we can
calculate the weighted sum as follows:
∑wi*xi = x1*w1 + x2*w2 +…wn*xn
Add a special term called bias 'b' to this weighted sum to improve the model's
performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned
weighted sum, which gives us output either in binary form or a continuous value as
follows:
Y = f(∑wi*xi + b)

Types of Perceptron Models


• Based on the layers, Perceptron models are divided into two types. These are
as follows:
• Single-layer Perceptron Model
• Multi-layer Perceptron model

Single-layer Perceptron Model


• This is one of the easiest Artificial neural networks (ANN) types.
• A single-layered perceptron model consists feed-forward network and also
includes a threshold transfer function inside the model.
• The main objective of the single-layer perceptron model is to analyze the
linearly separable objects with binary outcomes.
• In a single layer perceptron model, its algorithms do not contain recorded
data, so it begins with inconstantly allocated input for weight parameters.
• Further, it sums up all inputs (weight).
• After adding all inputs, if the total sum of all inputs is more than a pre-
determined value, the model gets activated and shows the output value as
+1.
• If the outcome is same as pre-determined or threshold value, then the
performance of this model is stated as satisfied, and weight demand does not
change.
• Hence, to find desired output and minimize errors, some changes should be
necessary for the weights input.
• "Single-layer perceptron can learn only linearly separable patterns."
Q. Explain the limitation of single layer perceptron.
• One of the main disadvantages of using a single-layer
perceptron is its limited expressive power and generalization
ability.
• It cannot learn to classify non-linearly separable patterns, such
as XOR, circles, or spirals.
• It is also prone to overfitting and noise, as it tries to fit a
straight line to the data.
• It does not have any hidden layers that can introduce non-
linearity and flexibility to the model.
• A "single-layer" perceptron can't implement XOR. The reason
is because the classes in XOR are not linearly separable. You
cannot draw a straight line to separate the points (0,0),(1,1)
from the points (0,1),(1,0).
• Truth table of XOR

.Here when we apply w1x1+w2x2,


Case 1
W1*0+W2*0=0, that is actual output<threshold. So neuron
does not fire.
Case 2
• 0.w1 + 1.w2 >= t, so it causes a fire
Case 3
• 1.w1 + 0.w2 cause a fire, i.e. >= t
• Case 4
• 1.w1 + 1.w2 also doesn't fire, < t
Here
• w1 >= t
• w2 >= t
• w1+w2 < t
• Contradiction.

Multi-Layered Perceptron Model:


• Like a single-layer perceptron model, a multi-layer perceptron model also
has the same model structure but has a greater number of hidden layers.
• The multi-layer perceptron model is also known as the Backpropagation
algorithm, which executes in two stages as follows:
• Forward Stage: Activation functions start from the input layer in the
forward stage and terminate on the output layer.
• Backward Stage: In the backward stage, weight and bias values are modified
as per the model's requirement. In this stage, the error between actual output
and demanded originated backward on the output layer and ended on the
input layer.
• Hence, a multi-layered perceptron model has considered as multiple
artificial neural networks having various layers in which activation function
does not remain linear, similar to a single layer perceptron model. Instead of
linear, activation function can be executed as sigmoid, TanH, ReLU, etc., for
deployment.
• A multi-layer perceptron model has greater processing power and can
process linear and non-linear patterns.
• Further, it can also implement logic gates such as AND, OR, XOR, NAND,
NOT, XNOR, NOR.
Advantages of Multi-Layer Perceptron:
• A multi-layered perceptron model can be used to solve complex non-linear
problems.
• It works well with both small and large input data.
• It helps us to obtain quick predictions after the training.
• It helps to obtain the same accuracy ratio with large as well as small data

MULTI LAYER NEURAL NETWORK


• Multilayer neural networks contain more than one computational layer.
• Multilayer neural networks contain multiple computational layers;
• the additional intermediate layers (between input and output) are referred to
as hidden layers because the computations performed are not visible to the
user.
The specific architecture of multilayer neural networks is referred to as feed-
forward networks, because successive layers feed into one another in the forward
direction from input to output.
• The default architecture of feed-forward networks assumes that all nodes in
one layer are connected to those of the next layer. Therefore, the architecture
of the neural network is almost fully defined, once the number of layers and
the number/type of nodes in each layer have been defined.
• The only remaining detail is the loss function that is optimized in the output
layer.
• The loss function is the function that computes the distance between the
current output of the algorithm and the expected output. It’s a method to
evaluate how your algorithm models the data.
• The number of units in each layer is referred to as the dimensionality of that
layer.

To be accurate a fully connected Multi-Layered Neural Network is known as


Multi-Layer Perceptron.
A Multi-Layered Neural Network consists of multiple layers of artificial neurons
or nodes
• Suppose we have xn inputs(x1, x2….xn) and a bias unit. Let the weight
applied to be w1, w2…..wn. Then find the summation and bias unit on
performing dot product among inputs and weights as:
• On feeding the r into activation function F(r) we find the output for the
hidden layers. For the first hidden layer h1, the neuron can be calculated as:

• For all the other hidden layers repeat the same procedure. Keep repeating the
process until reach the last weight set.

ACTIVATION FUNCTION
• It’s a function that we use to get the output of node. It is also known as
Transfer Function.
• The primary role of the Activation Function is to transform the summed
weighted input from the node into an output value to be fed to the next
hidden layer or as output.
• It is used in neural network to determine the output of neural network like
yes or no. It maps the resulting values in between 0 to 1 or -1 to 1 etc.
(depending upon the function).
• The Activation Functions can be basically divided into 2 types-
1. Linear Activation Function
2. Non-linear Activation Functions
It doesn’t help with the complexity or various parameters of usual da
It makes it easy for the model to generalize or adapt with variety of d

• The main terminologies needed to understand for nonlinear functions are:


• Derivative or Differential: Change in y-axis w.r.t. change in x-axis.It
is also known as slope.
• Monotonic function: A function which is either entirely non-
increasing or non-decreasing.
1. Sigmoid or Logistic Activation Function
• The Sigmoid Function curve looks like a S-shape.


• The main reason why we use sigmoid function is because it exists between
(0 to 1).
• Therefore, it is especially used for models where we have to predict the
probability as an output.
• Since probability of anything exists only between the range of 0 and 1,
sigmoid is the right choice.
• The function is differentiable.That means, we can find the slope of the
sigmoid curve at any two points.
• The function is monotonic but function’s derivative is not.

2. Tanh or hyperbolic tangent Activation Function


tanh is also like logistic sigmoid but better. The range of the tanh function is
from (-1 to 1). tanh is also sigmoidal (s - shaped).

• The function is differentiable.


• The function is monotonic while its derivative is not monotonic.
• The tanh function is mainly used classification between two classes.
• Both tanh and logistic sigmoid activation functions are used in feed-forward
nets.

3. ReLU (Rectified Linear Unit) Activation Function


• The ReLU is the most used activation function in the world right now.Since,
it is used in almost all the convolutional neural networks or deep learning.

• As you can see, the ReLU is half rectified (from bottom). f(z) is zero when z
is less than zero and f(z) is equal to z when z is above or equal to zero.
• Range: [ 0 to infinity)
• The function and its derivative both are monotonic.
• But the issue is that all the negative values become zero immediately which
decreases the ability of the model to fit or train from the data properly.
• That means any negative input given to the ReLU activation function turns
the value into zero immediately in the graph, which in turns affects the
resulting graph by not mapping the negative values appropriately.
4. Leaky ReLU
• It is an attempt to solve the dying ReLU problem

• The leak helps to increase the range of the ReLU function. Usually, the value
of a is 0.01 or so.
• When a is not 0.01 then it is called Randomized ReLU.
• Therefore the range of the Leaky ReLU is (-infinity to infinity).
• Both Leaky and Randomized ReLU functions are monotonic in nature. Also,
their derivatives also monotonic in nature.

LEAKY ReLU
F(X)=max(0.01x,x)
5. Softmax activation function
• The softmax activation function takes in a vector of raw outputs of the
neural network and returns a vector of probability scores.
• In the vector z of raw outputs, the maximum value is 1.23, which on
applying softmax activation maps to 0.664: the largest entry in the softmax
output vector. Likewise, 0.25 and -0.8 map to 0.249 and 0.087: the second
and the third largest entries in the softmax output respectively. Thus,
applying softmax preserves the relative ordering of scores.
• All entries in the softmax output vector are between 0 and 1.
• In a multiclass classification problem, where the classes are mutually
exclusive, notice how the entries of the softmax output sum up to 1: 0.664 +
0.249 + 0.087 = 1.

6. Hardtanh Activation Function


• Hardtanh is an activation function used for neural networks:


• The hard tanh activation function is a modified version of the tanh function
that applies a threshold to the output to produce an output between -1 and 1.
• The hard tanh function is faster to compute than the tanh function and is
commonly used in embedded systems and real-time applications.

LOSS FUNCTION
Neural networks are a set of algorithms that are designed to recognize
trends/relationships in a given set of training data. These algorithms are
based on the way human neurons process information.
This equation represents how a neural network processes the input data at
each layer and eventually produces a predicted output value.

To train — the process by which the model maps the relationship between
the training data and the outputs — the neural network updates its
hyperparameters, the weights, wT, and biases, b, to satisfy the equation
above.
Each training input is loaded into the neural network in a process called
forward propagation. Once the model has produced an output, this predicted
output is compared against the given target output in a process called
backpropagation — the hyperparameters of the model are then adjusted so
that it now outputs a result closer to the target output.
• A loss function is a function that compares the target and predicted output
values; measures how well the neural network models the training data.
When training, we aim to minimize this loss between the predicted and
target outputs.
• The hyperparameters are adjusted to minimize the average loss — we find
the weights, wT, and biases, b, that minimize the value of J (average loss).


• Types of Loss Functions
• In supervised learning, there are two main types of loss functions :
regression and classification loss functions
• Regression Loss Functions — used in regression neural
networks; given an input value, the model predicts a
corresponding output value (rather than pre-selected labels); Ex.
Mean Squared Error, Mean Absolute Error
• Classification Loss Functions — used in classification neural
networks; given an input, the neural network produces a vector
of probabilities of the input belonging to various pre-set
categories — can then select the category with the highest
probability of belonging; Ex. Binary Cross-Entropy, Categorical
Cross-Entropy
• Mean Squared Error (MSE)
• One of the most popular loss functions, MSE finds the average of the
squared differences between the target and the predicted outputs


• Mean Absolute Error (MAE)
• MAE finds the average of the absolute differences between the target and the
predicted outputs.

• Binary Cross-Entropy/Log Loss


• This is the loss function used in binary classification models — where the
model takes in an input and has to classify it into one of two pre-set
categories.

• Categorical Cross-Entropy Loss


• In cases where the number of classes is greater than two, we utilize
categorical cross-entropy — this follows a very similar process to binary
cross-entropy.

TRAINING A NEURAL NETWORK WITH


BACKPROPAGATION
• Backpropagation is the essence of neural network training.
• It is the method of fine-tuning the weights of a neural network based on the
error rate obtained in the previous epoch (i.e., iteration).
• Proper tuning of the weights allows you to reduce error rates and make the
model reliable by increasing its generalization.
• The Back propagation algorithm in neural network computes the gradient of
the loss function for a single weight by the chain rule.
• It efficiently computes one layer at a time
• It contains two main phases,referred to as the forward and backward phases,


• Inputs X, arrive through the preconnected path
• Input is modeled using real weights W. The weights are usually randomly
selected.
• Calculate the output for every neuron from the input layer, to the hidden
layers, to the output layer.
• Calculate the error in the output
• ERROR=TARGET-ACTUAL
• Travel back from the output layer to the hidden layer to adjust the weights
such that the error is decreased.

Practical Issues in Neural Network
Training
• I.The Problem of Overfitting
• II. The Vanishing and Exploding Gradient Problems
• III.Difficulties in Convergence
• IV.Local and Spurious Optima
• V. Computational Challenges

• I.The Problem of Overfitting


• The primary objective in deep learning is to have a network that performs its
best on both training data & the test data/new data it hasn’t seen before.
• Overfitting & Underfitting is a common occurrence encountered .
• There is always a gap between the training and test data performance, which
is particularly large when the models are complex and the data set is small.
• When the network tries to learn too much or too many details in the training
data along with the noise from the training data which results in poor
performance on unseen or test dataset. When this happens the network fails
to generalize the features/pattern found in the training data.
• Error vs iteration graph


• Overfitting during training can be spotted when the error on training data
decreases to a very small value but the error on the new data or test data
increases to a large value.
• The error vs iteration graph shows how a deep neural network overfits on
training data.
• The blue curve indicates the error on training data & the red curve the error
on test data.
• The point where the green line intersects is the instance the network begins
to overfit.
• As you can see, the error on test data increases sharply while error on
training data decreases.
• A new set of data points will result in the model/network performing poorly
as it is very close to all the training points which are noise & outliers.
• The error on the training points is minimum or very small but the error on
the new data points will be high.
• One of the main reasons for the network to overfit is if the size of the
training dataset is small.
• When the network tries to learn from a small dataset it will tend to have
greater control over the dataset & will make sure to satisfy all the datapoints
exactly.


• In order to understand this point, consider a simple single-layer neural
network on a data set with five attributes, where we use the identity
activation to learn a real-valued target variable.
• Consider a situation in which the observed target value is real and is always
twice the value of the first attribute, whereas other attributes are completely
unrelated to the target. However, we have only four training instances, which
is one less than the number of features (free parameters). For example, the
training instances could be as follows:
The correct parameter vector in this case is W = [2, 0, 0, 0, 0] based on the
known relationship between the first feature and target.
The training data also provides zero error with this solution, although the
relationship needs to be learned from the given instances
However, the problem is that the number of training points is fewer than the
number of parameters and it is possible to find an infinite numberof solutions with
zero error.
For example, the parameter set [0, 2, 4, 6, 8] also provides zero error on the
training data.
However, if we used this solution on unseen test data, it is likely to provide very
poor performance because the learned parameters are spuriously inferred and are
unlikely to generalize well to new points in which the target is twice the first
attribute (and other attributes are random).
As a result,the solution does not generalize well to unseen test data.

Underfitting
Underfitting happens when the network can neither model the training or test data
which results in overall bad performance.
By looking at the graph, the model doesn’t cover all the data points & has a high
error on both training & test data.
The reason for underfitting can be because of the limited capacity of the network, a
limited number of features provided as input to the network, noisy data etc.
• It represents the inability of the model to learn the training data effectively
result in poor performance both on the training and testing data.
• In simple terms, an underfit model’s are inaccurate, especially when applied
to new, unseen examples.
• It mainly happens when we uses very simple model with overly simplified
assumptions.
• To address underfitting problem of the model, we need to use more complex
models, with enhanced feature representation, and less regularization.
• Note: The underfitting model has High bias and low variance

Reasons for Underfitting


The model is too simple, So it may be not capable to represent the
complexities in the data.
The input features which is used to train the model is not the adequate
representations of underlying factors influencing the target variable.
The size of the training dataset used is not enough.
Excessive regularization are used to prevent the overfitting, which constraint
the model to capture the data well.
Features are not scaled
Techniques to Reduce Underfitting
Increase model complexity.
Increase the number of features, performing feature engineering.
Remove noise from the data.
Increase the number of epochs or increase the duration of training to get
better results.
O
Measures to prevent overfitting
1. Decrease the network complexity
Deep neural networks like CNN are prone to overfitting because of the millions
or billions of parameters it encloses.
By removing certain layers or decreasing the number of neurons (filters in CNN)
the network becomes less prone to overfitting as the neurons contributing to
overfitting are removed or deactivated.
 The network also has a reduced number of parameters because of which it
cannot memorize all the data points & will be forced to generalize.

There is no general rule as to how many layers are to be removed or how many
neurons must be in a layer before the network can overfit.
The popular approach for reducing the network complexity is
--Grid search can be applied to find out the number of neurons and/or layers
to reduce or remove overfitting.
--The overfit model can be pruned (trimmed) by removing nodes or
connections until it reaches suitable performance on test data.
2. Data Augmentation
One of the best strategies to avoid overfitting is to increase the size of the
training dataset.
As discussed, when the size of the training data is small the network tends
to have greater control over the training data.
But in real-world scenarios gathering of large amounts of data is a tedious
& time-consuming task, hence the collection of new data is not a viable option.

Data augmentation provides techniques to increase the size of existing


training data without any external addition.
If our training data consists of images, image augmentation techniques
like rotation, horizontal & vertical flipping, translation, increasing or decreasing
the brightness or adding noise, cutouts etc can be applied to the existing training
images to increase the number of instances.
By applying the above-mentioned data augmentation strategies, the
network is trained on multiple instances of the same class of object in different
perspectives.
An augmented result of a lion’s photograph will have an instance of a
lion being viewed in a rotated manner, a lion being viewed up-side-down or cutting
out the portion of an image which encloses the mane of a lion.
By applying the last augmentation (cutout) the network learns to associate
the feature that male lions have a mane with its class.
3. Weight Regularization
Weight regularization is a technique which aims to stabilize an overfitted
network by penalizing the large value of weights in the network.
An overfitted network usually presents with problems with a large value
of weights as a small change in the input can lead to large changes in the output.
For instance, when the network is given new or test data, it results in
incorrect predictions.
Weight regularization penalizes the network’s large weights & forcing the
optimization algorithm to reduce the larger weight values to smaller weights, and
this leads to stability of the network & presents good performance.
In weight regularization, the network configuration remains unchanged
only modifying the value of weights.
Weight Regularization reduces overfitting by penalizing or adding a
constraint to the loss function.
Regularization terms are constraints the optimization algorithm (like
Stochastic Gradient Descent) must adhere to when minimizing loss function apart
from minimizing the error between predicted value & actual value.

The above two equations represent two types of weight regularization L1


& L2.
There are two parts to the equation, the first part is the error between the
actual target vs the predicted target (loss function).
 The second part is the weight penalty or the regularization term
A regression model that uses L1 regularization technique is called Lasso
Regression and model which uses L2 is called Ridge Regression.
The key difference between these two is the penalty term.

4. Dropouts

Dropout is a regularization strategy that prevents deep neural networks from


overfitting.

deactivate a certain number of neurons at a layer from firing during training.


At each iteration different set of neurons are deactivated & this results in a
different set of results.
Many deep learning frameworks implement dropouts as a layer which receives
inputs from the previous layer, the dropout layer randomly selects neurons which
are not fired to the next layer. By deactivating certain neurons which might
contribute to overfitting the performance of the network on test data improves.
Dropouts reduce overfitting in a variety of problems like image classification,
image segmentation, word embedding etc.
5. Early Stopping
While training a neural network using an optimization algorithm like Gradient
Descent, the model parameters (weights) are updated to reduce the training error.
At the end of each forward propagation, the network parameters are updated to
reduce error in the next iteration.
Too much training can result in network overfitting on the training data.
Early stopping provides guidance as to how many iterations can be run before
the network begins to overfit.
The above graph indicates the point after which the network begins to overfit.
The network parameters at the point of early termination are the best fit for the
model.
To decrease the test error beyond the point of early termination can be done by
Decreasing the learning rate. Applying a learning rate scheduler algorithm
would be recommended.
Applying a different optimization algorithm.
Applying regularization.
6. Neural Architecture and Parameter Sharing
The most effective way of building a neural network is by constructing the
architecture of the neural network after giving some thought to the
underlying data domain.
For example, the successive words in a sentence are often related to one
another, whereas the nearby pixels in an image are typically related.
These types of insights are used to create specialized architectures for text
and image data with fewer parameters.
Furthermore, many of the parameters might be shared. For example, a
convolutional neural network uses the same set of parameters to learn the
characteristics of a local block of the image.
7. Trading Off Breadth for Depth
networks with more layers (i.e., greater depth) tend to require far fewer
units per layer because the composition functions created by successive
layers make the neural network more powerful.
Increased depth is a form of regularization, as the features in later layers
are forced to obey a particular type of structure imposed by the earlier layers
The number of units in each layer can typically be reduced to such an
extent that a deep network often has far fewer parameters even when added
up over the greater
number of layers.
8. Ensemble Methods
 A variety of ensemble methods like bagging are used in order to increase
the generalization power of the model.
These methods are applicable not just to neural networks but to any pe of
machine learning algorithm.
 However, in recent years, a number of ensemble methods that are
specifically focused on neural networks have also been proposed.
Two such methods include Dropout and Dropconnect.
These methods can be combined with many neural network architectures
to obtain an additional accuracy improvement of about 2% in many real
settings.
However, the precise improvement depends to the type of data and the
nature of the underlying training.

II.The Vanishing and Exploding Gradient Problems


While increasing depth often reduces the number of parameters of the
network, it leads to different types of practical issues.
Propagating backwards using the chain rule has its drawbacks in networks
with a large number of layers in terms of the stability of the updates.
In particular, the updates in earlier layers can either be negligibly small
(vanishing gradient) or they can be increasingly large (exploding gradient) in
certain types of neural network architectures.

III. Difficulties in Convergence


Sufficiently fast convergence of the optimization process is difficult to
achieve with very deep networks, as depth leads to increased resistance to
the training process in terms of
letting the gradients smoothly flow through the network.
This problem is somewhat related to the vanishing gradient problem, but has
its own unique characteristics.
IV. Local and Spurious Optima
When the parameter space is large, and there are many local optima, it
makes sense to spend some effort in picking good initialization points.
One such method for improving neural network initialization is referred to as
pretraining.
The basic idea is to use either supervised or unsupervised training on
shallow sub-networks of the original network in order to create the initial
weights.
This type of pretraining is done in a greedy and layerwise fashion in which a
single layer of the network is trained at one time in order to learn the
initialization points of that layer.
This type of approach provides initialization points that ignore drastically
irrelevant parts of the parameter space to begin with.
Furthermore, unsupervised pretraining often tends to avoid problems
associated with overfitting
V Computational Challenges
A significant challenge in neural network design is the running time
required to train the network.
It is not uncommon to require weeks to train neural networks in the text and
image domains.
In recent years, advances in hardware technology such as Graphics
Processor Units (GPUs) have helped to a significant extent.
GPUs are specialized hardware processors that can significantly speed up
the kinds of operations commonly used in neural networks.
In this sense, some algorithmic frameworks like Torch are particularly
convenient because they have GPU support tightly integrated into the
platform.

HYPERPARAMETERS AND VALIDATION SETS


• Hyperparameters in Machine learning are those parameters that are
explicitly defined by the user to control the learning process.
• These hyperparameters are used to improve the learning of the model, and
their values are set before starting the learning process of the model.
• Here the prefix "hyper" suggests that the parameters are top-level parameters
that are used in controlling the learning process.
• The value of the Hyperparameter is selected and set by the machine learning
engineer before the learning algorithm begins training the model.
• Hence, these are external to the model, and their values cannot be changed
during the training process.

Some examples of Hyperparameters in Machine Learning

o The k in kNN or K-Nearest Neighbour algorithm


o Learning rate for training a neural network
o Train-test split ratio
o Batch Size
o Number of Epochs
o Branches in Decision Tree
o Number of clusters in Clustering Algorithm

Model Parameters

o Model parameters are configuration variables that are internal to the model, and a model
learns them on its own. For example, W Weights or Coefficients of independent variables
in the Linear regression model. or Weights or Coefficients of independent variables in
SVM, weight, and biases of a neural network, cluster centroid in clustering.

o They are used by the model for making predictions.


o They are learned by the model from the data itself
o These are usually not set manually.
o These are the part of the model and key to a machine learning Algorithm.

Model Hyperparameters:
Hyperparameters are those parameters that are explicitly defined by the user to control the learning
process. Some key points for model parameters are as follows:

o These are usually defined manually by the machine learning engineer.


o One cannot know the exact best value for hyperparameters for the given problem. The best
value can be determined either by the rule of thumb or by trial and error.
o Some examples of Hyperparameters are the learning rate for training a neural network,
K in the KNN algorithm

Categories of Hyperparameters
Broadly hyperparameters can be divided into two categories, which are given below:

1. Hyperparameter for Optimization


2. Hyperparameter for Specific Models

Hyperparameter for Optimization


The process of selecting the best hyperparameters to use is known as hyperparameter tuning, and
the tuning process is also known as hyperparameter optimization. Optimization parameters are
used for optimizing the model.

• Learning Rate: The learning rate is the hyperparameter in optimization algorithms that
controls how much the model needs to change in response to the estimated error for each
time when the model's weights are updated. It is one of the crucial parameters while
building a neural network, and also it determines the frequency of cross-checking with
model parameters. Selecting the optimized learning rate is a challenging task because if the
learning rate is very less, then it may slow down the training process. On the other hand, if
the learning rate is too large, then it may not optimize the model properly.
• Note: Learning rate is a crucial hyperparameter for optimizing the model, so if there
is a requirement of tuning only a single hyperparameter, it is suggested to tune the
learning rate.
• Batch Size: To enhance the speed of the learning process, the training set is divided
into different subsets, which are known as a batch.
• Number of Epochs: An epoch can be defined as the complete cycle for training the
machine learning model. Epoch represents an iterative learning process. The number
of epochs varies from model to model, and various models are created with more than
one epoch. To determine the right number of epochs, a validation error is taken into
account.
• The number of epochs is increased until there is a reduction in a validation
error. If there is no improvement in reduction error for the consecutive
epochs, then it indicates to stop increasing the number of epochs.

Hyperparameter for Specific Models


Hyperparameters that are involved in the structure of the model are known as
hyperparameters for specific models. These are given below:

o A number of Hidden Units: Hidden units are part of neural networks, which refer to
the components comprising the layers of processors between input and output units
in a neural network.

It is important to specify the number of hidden units hyperparameter for the neural network.
It should be between the size of the input layer and the size of the output layer. More
specifically, the number of hidden units should be 2/3 of the size of the input layer, plus the
size of the output layer.

For complex functions, it is necessary to specify the number of hidden units, but it should not
overfit the model.

o Number of Layers: A neural network is made up of vertically arranged components,


which are called layers. There are mainly input layers, hidden layers, and output
layers. A 3-layered neural network gives a better performance than a 2-layered
network. For a Convolutional Neural network, a greater number of layers make a
better model.

Validation Set
 Training set: The data you will use to train your model. This will be fed
into an algorithm that generates a model. It maps inputs to outputs.
 Validation set: This is smaller than the training set, and is used to evaluate
the performance of models with different hyperparameter values. It's also
used to detect overfitting during the training stages.
 Test set: This set is used to get an idea of the final performance of a model
after hyperparameter tuning. It's also useful to get an idea of how different
models (SVMs, Neural Networks, Random forests...) perform against each
other.

The validation and test sets are usually much smaller than the training
set.

The validation and test sets are put aside at the beginning of the project
and are not used for training.

The validation set is used to fine-tune the hyperparameters of the model and is
considered a part of the training of the model. The model only sees this data for
evaluation but does not learn from this data

Train vs. Validation vs. Test set


For training and testing purposes of our model, we should have our data broken down into
three distinct dataset splits.

The Training Set


It is the set of data that is used to train and make the model learn the hidden
features/patterns in the data.

In each epoch, the same training data is fed to the neural network architecture repeatedly,
and the model continues to learn the features of the data.

The training set should have a diversified set of inputs so that the model is trained in all
scenarios and can predict any unseen data sample that may appear in the future.
The Validation Set
The validation set is a set of data, separate from the training set, that is used to validate our
model performance during training.

This validation process gives information that helps us tune the model’s hyperparameters
and configurations accordingly. It is like a critic telling us whether the training is moving in
the right direction or not.

The model is trained on the training set, and, simultaneously, the model evaluation is
performed on the validation set after every epoch.

The main idea of splitting the dataset into a validation set is to prevent our model from
overfitting i.e., the model becomes really good at classifying the samples in the training set
but cannot generalize and make accurate classifications on the data it has not seen before.

The Test Set


The test set is a separate set of data used to test the model after completing the training.

It provides an unbiased final model performance metric in terms of accuracy, precision, etc.
To put it simply, it answers the question of "How well does the model perform?"
o

Estimators Bias and Variance


BIAS
Bias is simply defined as the inability of the model because of that there
is some difference or error occurring between the model’s predicted value
and the actual value.

These differences between actual or expected values and the predicted


values are known as error or bias error or error due to bias.
Bias is a systematic error that occurs due to wrong assumptions in the
machine learning process.

Let Y be the true value of a parameter, and let Y’ be an estimator of Y


based on a sample of data. Then, the bias of the estimator Y’ is given by:
• Bias(Y’) = E(Y’) - Y
• where E(Y’) is the expected value of the estimator Y’. It is the measurement
of the model that how well it fits the data.

VARIANCE
Variance is the measure of spread in data from its mean position.
In machine learning variance is the amount by which the performance
of a predictive model changes when it is trained on different subsets
of the training data.
More specifically, variance is the variability of the model that how
much it is sensitive to another subset of the training dataset. i.e. how
much it can adjust on the new subset of the training dataset.
Let Y be the actual values of the target variable, and Y’ be the
predicted values of the target variable.
Then the variance of a model can be measured as the expected value
of the square of the difference between predicted values and the
expected value of the predicted values.
Variance = E[(Y’ - E[ Y’])^2]
Ways to Reduce the reduce Variance in Machine
Learning:
Cross-validation: By splitting the data into training and testing
sets multiple times, cross-validation can help identify if a model is
overfitting or underfitting and can be used to tune hyperparameters to
reduce variance.
Feature selection: By choosing the only relevant feature will
decrease the model’s complexity. and it can reduce the variance error.
Regularization: We can use L1 or L2 regularization to reduce
variance in machine learning models
Ensemble methods: It will combine multiple models to improve
generalization performance. Bagging, boosting, and stacking are
common ensemble methods that can help reduce variance and improve
generalization performance.
Simplifying the model: Reducing the complexity of the model,
such as decreasing the number of parameters or layers in a neural
network, can also help reduce variance and improve generalization
performance.
Early stopping: Early stopping is a technique used to prevent
overfitting by stopping the training of the deep learning model when the
performance on the validation set stops improving.

Deep Learning
Deep learning is a method in artificial intelligence (AI) that
teaches computers to process data in a way that is inspired by
the human brain. Deep learning models can recognize complex
patterns in pictures, text, sounds, and other data to produce
accurate insights and predictions.
Machine Learning Deep Learning

Apply statistical algorithms Uses artificial neural


to learn the hidden patterns network architecture to learn
and relationships in the the hidden patterns and
dataset. relationships in the dataset.

Requires the larger volume


Can work on the smaller
of dataset compared to
amount of dataset
machine learning

Better for complex task


Better for the low-label like image processing,
task. natural language processing,
etc.

Takes less time to train the Takes more time to train


model. the model.

A model is created by
Relevant features are
relevant features which are
automatically extracted from
manually extracted from
images. It is an end-to-end
images to detect an object in
learning process.
the image.
Machine Learning Deep Learning

More complex, it works


Less complex and easy to like the black box
interpret the result. interpretations of the result
are not easy.

It can work on the CPU or It requires a high-


requires less computing power performance computer with
as compared to deep learning. GPU.

Deep learning is the branch of machine learning which is


based on artificial neural network architecture. An artificial
neural network or ANN uses layers of interconnected nodes
called neurons that work together to process and learn from the
input data.
Deep Learning is a subfield of Machine Learning that
involves the use of neural networks to model and solve complex
problems. Neural networks are modeled after the structure and
function of the human brain and consist of layers of
interconnected nodes that process and transform data.
The key characteristic of Deep Learning is the use of deep
neural networks, which have multiple layers of interconnected
nodes. These networks can learn complex representations of data
by discovering hierarchical patterns and features in the data.
Deep Learning algorithms can automatically learn and improve
from data without the need for manual feature engineering.
 Learning has achieved significant success in various fields,
including image recognition, natural language processing,
speech recognition, and recommendation systems. Some of the
popular Deep Learning architectures include Convolutional
Neural Networks (CNNs), Recurrent Neural Networks (RNNs),
and Deep Belief Networks (DBNs).
 deep neural networks typically requires a large amount of
data and computational resources. However, the availability of
cloud computing and the development of specialized hardware,
such as Graphics Processing Units (GPUs), has made it easier to
train deep neural networks.

Applications of deep learning


Computer vision
Reinforcement Learning
NLP
Challenges in Deep Learning
Data availability: It requires large amounts of data to learn
from. For using deep learning it’s a big concern to gather as
much data for training.
Computational Resources: For training the deep learning
model, it is computationally expensive because it requires
specialized hardware like GPUs and TPUs.
Time-consuming: While working on sequential data
depending on the computational resource it can take very large
even in days or months.
Interpretability: Deep learning models are complex.
Overfitting: when the model is trained again and again, it
becomes too specialized for the training data, leading to
overfitting and poor performance on new data.
Advantages of Deep Learning:
1. High accuracy: Deep Learning algorithms can achieve
state-of-the-art performance in various tasks, such as image
recognition and natural language processing.
2. Automated feature engineering: Deep Learning algorithms
can automatically discover and learn relevant features from
data without the need for manual feature engineering.
3. Scalability: Deep Learning models can scale to handle large
and complex datasets, and can learn from massive amounts
of data.
4. Flexibility: Deep Learning models can be applied to a wide
range of tasks and can handle various types of data, such as
images, text, and speech.
5. Continual improvement: Deep Learning models can
continually improve their performance as more data
becomes available.
Disadvantages of Deep Learning:
1. High computational requirements: Deep Learning models
require large amounts of data and computational resources
to train and optimize.
2. Requires large amounts of labeled data: Deep Learning
models often require a large amount of labeled data for
training, which can be expensive and time- consuming to
acquire.
3. Interpretability: Deep Learning models can be challenging
to interpret, making it difficult to understand how they
make decisions.
Overfitting: Deep Learning models can sometimes overfit
to the training data, resulting in poor performance on new
and unseen data.
4. Black-box nature: Deep Learning models are often treated
as black boxes, making it difficult to understand how they
work and how they arrived at their predictions.
In summary, while Deep Learning offers many advantages,
including high accuracy and scalability, it also has some
disadvantages, such as high computational requirements,
the need for large amounts of labeled data, and
interpretability challenges. These limitations need to be
carefully considered when deciding whether to use Deep
Learning for a specific task.

DEEP FEED FORWARD NETWORK (DFF)


Feed-Forward Neural Network is a single layer perceptron. A
sequence of inputs enter the layer and are multiplied by the
weights in this model. The weighted input values are then
summed together to form a total. If the sum of the values is more
than a predetermined threshold, which is normally set at zero,
the output value is usually 1, and if the sum is less than the
threshold, the output value is usually -1. The single-layer
perceptron is a popular feed-forward neural network model that
is frequently used for classification. Single-layer perceptrons can
also contain machine learning features.

The neural network can compare the outputs of its nodes with
the desired values using a property known as the delta rule,
allowing the network to alter its weights through training to
create more accurate output values. This training and learning
procedure results in gradient descent. The technique of updating
weights in multi-layered perceptrons is virtually the same,
however, the process is referred to as back-propagation. In such
circumstances, the output values provided by the final layer are
used to alter each hidden layer inside the network.
A Feed Forward Neural Network is an artificial neural network
in which the connections between nodes does not form a cycle.
The feed forward model is the simplest form of neural network
as information is only processed in one direction. While the data
may pass through multiple hidden nodes, it always moves in one
direction and never backwards.
The structure of a DFF is very similar to that of an FF. The
major difference between them is the number of hidden layers.
Currently, people refer to a Neural Network with one hidden
layer as a “shallow” network or simply a Feed-Forward network.
Feedforward neural networks perform well when solving
basic problems like identifying simple patterns or classifying
information. However, they will struggle with more complex
tasks. On the other hand, deep learning algorithms can process
and analyze vast data volumes due to several hidden layers of
abstraction.
##################################################
####################################################
###############################################

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy