Artificial Neural Network: Unit V Neural Networks and Deep Learning
Artificial Neural Network: Unit V Neural Networks and Deep Learning
Unit V
Neural Networks and Deep Learning
The term "Artificial Neural Network" is derived from Biological neural networks
that develop the structure of a human brain. Similar to the human brain that has
neurons interconnected to one another, artificial neural networks also have neurons
that are interconnected to one another in various layers of the networks. These
neurons are known as nodes.
The given figure illustrates the typical diagram of Biological Neural Network.
The typical Artificial Neural Network looks something like the given figure.
Dendrites Inputs
Synapse Weights
Axon Output
There are around 1000 billion neurons in the human brain. Each neuro n has an
association point somewhere in the range of 1,000 and 100,000. In the human brain,
data is stored in such a manner as to be distributed, and we can extract more than
one piece of this data when necessary, from our memory parallelly. We can say that
the human brain is made up of incredibly amazing parallel processors.
Input Layer:
As the name suggests, it accepts inputs in several different formats provided by the
programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the
calculations to find hidden features and patterns.
S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN
5
Output Layer:
The input goes through a series of transformations using the hidden layer, which
finally results in output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the
inputs and includes a bias. This computation is represented in the form of a transfer
function.
Artificial neural networks have a numerical value that can perform more than one
task simultaneously.
Data that is used in traditional programming is stored on the whole network, not on
a database. The disappearance of a couple of pieces of data in one place doesn't
prevent the network from working.
After ANN training, the information may produce output even with inadequate data.
The loss of performance here relies upon the significance of missing data.
Extortion of one or more cells of ANN does not prohibit it from generating output,
and this feature makes the network fault-tolerance.
It is the most significant issue of ANN. When ANN produces a testing solution, it
does not provide insight concerning why and how. It decreases trust in the network.
Hardware dependence:
Artificial neural networks need processors with parallel processing power, as per
their structure. Therefore, the realization of the equipment is dependent.
ANNs can work with numerical data. Problems must be converted into numerical
values before being introduced to ANN. The presentation mechanism to be resolved
here will directly impact the performance of the network. It relies on the user's
abilities.
The network is reduced to a specific value of the error, and this value does not give
us optimum results.
If the weighted sum is equal to zero, then bias is added to make the output non -zero
or something else to scale up to the system's response. Bias has the same input, and
weight equals to 1. Here the total of weighted inputs can be in the range of 0 to
positive infinity. Here, to keep the response in the limits of the desired value, a
certain maximum value is benchmarked, and the total of weighted inputs is passed
through the activation function.
The activation function refers to the set of transfer functions used to achieve the
desired output. There is a different kind of the activation function, but primarily
either linear or non-linear sets of functions. Some of the commonly used sets of
activation functions are the Binary, linear, and Tan hyperbolic sigmoidal activation
functions. Let us take a look at each of them in details:
Definition of NN
Neural Networks are a set of algorithms that tries to recognize the patterns,
relationships, and information from the data through the process which is inspired
by and works like the human brain/biology.
In the above image, The Artificial Neural Network consists of four layers
interconnected with each other:
1. In the first step, Input units are passed i.e data is passed with some
weights attached to it to the hidden layer. We can have any number of
hidden layers. In the above image inputs x 1,x2,x3,….xn is passed.
2. Each hidden layer consists of neurons. All the inputs are connected to each
neuron.
3. After passing on the inputs, all the computation is performed in the
hidden layer (Blue oval in the picture)
Computation performed in hidden layers are done in two steps which are as
follows :
• First of all, all the inputs are multiplied by their weights. Weight is the
gradient or coefficient of each variable. It shows the strength of the
particular input. After assigning the weights, a bias variable is added. Bias is
a constant that helps the model to fit in the best way possible.
W1, W2, W3, W4, W5 are the weights assigned to the inputs In 1, In2, In3, In4,
In5, and b is the bias.
• Then in the second step, the activation function is applied to the linear
equation Z1. The activation function is a nonlinear transformation that is
applied to the input before sending it to the next layer of neurons. The
importance of the activation function is to inculcate nonlinearity in the
model.
5. After getting the predictions from the output layer, the error is calculated
i.e the difference between the actual and the predicted output.
If the error is large, then the steps are taken to minimize the error and for the same
purpose, Back Propagation is performed.
Back Propagation is the process of updating and finding the optimal values of
weights or coefficients which helps the model to minimize the error i.e
difference between the actual and predicted values.
In the image below, the curve is our cost function curve and our aim is the
minimize the error such that J min i.e global minima is achieved.
1. First, the weights are initialized randomly i.e random value of the weight,
and intercepts are assigned to the model while forward propagation and the
errors are calculated after all the computation. (As discussed above)
2. Then the gradient is calculated i.e derivative of error w.r.t current
weights
3. Then new weights are calculated using the below formula, where a is the
learning rate which is the parameter also known as step size to control the
speed or steps of the backpropagation. It gives additional control on how fast
we want to move on the curve to reach global minima.
4.This process of calculating the new weights, then errors from the new
weights, and then updation of weights continues till we reach global minima and
loss is minimized.
A point to note here is that the learning rate i.e a in our weight updation equation
should be chosen wisely. Learning rate is the amount of change or step size taken
towards reaching global minima. It should not be very small as it will take time
to converge as well as it should not be very large that it doesn’t reach global
minima at all. Therefore, the learning rate is the hyperparameter that we have to
choose based on the model.
Perceptron
Threshold T represents the activation function. If the weighted sum of the inputs is
greater than zero the neuron outputs the value 1, otherwise the output value is zero.
With this discrete output, controlled by the activation function, the perceptron can
be used as a binary classification model, defining a linear decision boundary. It
finds the separating hyperplane that minimizes the distance between misclassified
points and the decision boundary.
The last piece that Perceptron needs is the activation function, the function that
determines if the neuron will fire or not.
The sigmoid function maps any real input to a value that is either 0 or 1, and
encodes a non-linear function.
Multilayer Perceptron
A Multilayer Perceptron has input and output layers, and one or more hidden
layers with many neurons stacked together. And while in the Perceptron the neuron
must have an activation function that imposes a threshold, like ReLU or sigmoid,
neurons in a Multilayer Perceptron can use any arbitrary activation function.
Multilayer Perceptron.
Each layer is feeding the next one with the result of their computation, their internal
representation of the data. This goes all the way through the hidden layers to the
output layer. But it has more to it.
If the algorithm only computed the weighted sums in each neuron, propagated
results to the output layer, and stopped there, it wouldn’t be able to learn the
weights that minimize the cost function. If the algorithm only computed one
iteration, there would be no actual learning. For learning we need back propagation
Feed Forward Process in Deep Neural Network
Now, we know how with the combination of lines with different weight and biases
can result in non-linear models. How does a neural network know what weight and
biased values to have in each layer? It is no different from how we did it for the
single based perceptron model.
We are still making use of a gradient descent optimization algorithm which acts to
minimize the error of our model by iteratively moving in the direction with the
steepest descent, the direction which updates the parameters of our model while
ensuring the minimal error. It updates the weight of every model in every single
layer. We will talk more about optimization algorithms and backpropagation later.
"The process of receiving an input to produce some kind of output to make some
kind of prediction is known as Feed Forward." Feed Forward neural network is
the core of many other important neural networks such as convolution neural
network.
In the feed-forward neural network, there are not any feedback loops or connections
in the network. Here is simply an input layer, a hidden layer, and an output layer.
There can be multiple hidden layers which depend on what kind of data you are
dealing with. The number of hidden layers is known as the depth of the neural
network. The deep neural network can learn from more functions. Input layer first
provides the neural network with data and the output layer then make predictions on
that data which is based on a series of functions. ReLU Function is the most
commonly used activation function in the deep neural network.
1) The first input is fed to the network, which is represented as matrix x1, x2, and
one where one is the bias value.
2) Each input is multiplied by weight with respect to the first and second model to
obtain their probability of being in the positive region in each model.
So, we will multiply our inputs by a matrix of weight using matrix multiplication.
3) After that, we will take the sigmoid of our scores and gives us the probability of
the point being in the positive region in both models.
4) We multiply the probability which we have obtained from the previous st ep with
the second set of weights. We always include a bias of one whenever taking a
combination of inputs.
And as we know to obtain the probability of the point being in the positive region of
this model, we take the sigmoid and thus producing our final output in a feed-forward
process.
Let takes the neural network which we had previously with the following linear
models and the hidden layer which combined to form the non-linear model in the
output layer.
So, what we will do we use our non-linear model to produce an output that describes
the probability of the point being in the positive region. The point was represented
by 2 and 2. Along with bias, we will represent the input as
The first linear model in the hidden layer recall and the equation defined it
Which means in the first layer to obtain the linear combination the inputs are
multiplied by -4, -1 and the bias value is multiplied by twelve.
S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN
22
The weight of the inputs are multiplied by -1/5, 1, and the bias is multiplied by three
to obtain the linear combination of that same point in our second model.
Now, to obtain the probability of the point is in the positive region relative to both
models we apply sigmoid to both points as
The second layer contains the weights which dictated the combination of the linear
models in the first layer to obtain the non-linear model in the second layer. The
weights are 1.5, 1, and a bias value of 0.5.
Now, we have to multiply our probabilities from the first layer with the second set
of weights as
It is complete math behind the feed forward process where the inputs from the input
traverse the entire depth of the neural network. In this example, there is only one
hidden layer. Whether there is one hidden layer or twenty, the computational
processes are the same for all hidden layer
The main features of Backpropagation are the iterative, recursive and efficient
method through which it calculates the updated weight to improve the network until
it is not able to perform the task for which it is being trained. Derivatives of the
activation function to be known at network design time is required to
Backpropagation.
S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN
24
Input values
X1=0.05
X2=0.10
Initial weight
W1=0.15,w5=0.40
W2=0.20,w6=0.45
W3=0.25,w7=0.50
W4=0.30,w8=0.55
Bias Values
b1=0.35 b2=0.60
Target Values
T1=0.01
T2=0.99
Forward Pass
To find the value of H1 we first multiply the input value from the weights as
H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775
H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35
H2=0.3925
Now, we calculate the values of y1 and y2 in the same way as we calculate the H1
and H2.
To find the value of y1, we first multiply the input value i.e., the outcome of H1 and
H2 from the weights as
y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597
y2=H1×w7+H2×w8+b2
y2=0.593269992×0.50+0.596884378×0.55+0.60
y2=1.2249214
Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched with our
target values T1 and T2.
Now, we will find the total error, which is simply the difference between the
outputs from the target outputs. The total error is calculated as
Now, we will backpropagate this error to update the weights using a backward pass.
To update the weight, we calculate the error correspond to each weight with the help
of a total error. The error on weight w is calculated by differentiating total error with
respect to w.
From equation two, it is clear that we cannot partially differentiate it with respect to
w5 because there is no any w5. We split equation one into multiple terms so that we
can easily differentiate it with respect to w5 as
Now, we calculate each term one by one to differentiate E total with respect to w5 as
Now, we will calculate the updated weight w5 new with the help of the following
formula
In the same way, we calculate w6 new,w7new, and w8 new and this will give us the
following values
w5new=0.35891648
w6new=408666186
w7new=0.511301270
w8new=0.561370121
Now, we will backpropagate to our hidden layer and update the weight w1, w2, w3,
and w4 as we have done with w5, w6, w7, and w8 weights.
From equation (2), it is clear that we cannot partially differentiate it with respect to
w1 because there is no any w1. We split equation (1) into multiple terms so that we
can easily differentiate it with respect to w1 as
Now, we calculate each term one by one to differentiate E total with respect to w1 as
We again Split both because there is no any y1 and y2 term in E1 and E2.
We split it as
Now, we find the value of by putting values in equation (18) and (19) as
We calculate the partial derivative of the total net input to H1 with respect to w1 the
same as we did for the output neuron:
Now, we will calculate the updated weight w1 new with the help of the following
formula
In the same way, we calculate w2 new,w3new, and w4 and this will give us the
following values
w1new=0.149780716
w2new=0.19956143
w3new=0.24975114
w4new=0.29950229
We have updated all the weights. We found the error 0.298371109 on the network
when we fed forward the 0.05 and 0.1 inputs. In the first round of Backpropagation,
the total error is down to 0.291027924. After repeating this process 10,000, the total
error is down to 0.0000351085. At this point, the outputs neurons generate
0.159121960 and 0.984065734 i.e., nearby our target value when we feed forward
the 0.05 and 0.1
Activation Functions
The activation function of a neuron defines it’s output given its inputs. We will be
talking about 4 popular activation functions:
1.Sigmoid Function:
final output. Also, sigmoid outputs are not zero-centered (it is centred around 0.5)
which leads to undesirable zig-zagging dynamics in the gradient updates for the
weights
Plot:
2. Tanh Function:
Plot:
3. Softmax Function:
4. ReLU Function:
Loss Functions
The other key aspect in setting up the neural network infrastructure is selecting the
right loss functions. With neural networks, we seek to minimize the error
(difference between actual and predicted value) which is calculated by the loss
function. We will be discussing 3 popular loss functions:
Description: MSE loss is used for regression tasks. As the name suggests, this loss
is calculated by taking the mean of squared differences between actual(target) and
predicted values.
Formula:
Range: (0,inf)
Pros: Preferred loss function if the distribution of the target variable is Gaussian as
it has good derivatives and helps the model converge quickly
Cons: Is not robust to outliers in the data (unlike loss functions like Mean Absolute
Error) and penalizes high and low predictions exponentially (unlike loss functions
like Mean Squared Logarithmic Error Loss)
Description: BCE loss is the default loss function used for the binary classification
tasks. It requires one output layer to classify the data into two classes and the range
of output is (0–1) i.e. should use the sigmoid function.
Formula:
where y is the actual label, ŷ is the classifier’s predicted probability distributions for
predicting one class and m is the number of records.
Range: (0,inf)
Pros: The continuous nature of the loss function helps the training process
converged well
Cons: Can only be used with sigmoid activation function. Other loss functions like
Hinge or Squared Hinge Loss can work with tanh activation function
the final layer going through a softmax activation so that each output node has a
probability value between (0–1).
Formula:
where y is the actual label and p is the classifier’s predicted probability distributions
for predicting the class j
Range: (0,inf)
Pros: Similar to Binary Cross Entropy, the continuous nature of the loss function
helps the training process converged well
Cons: May require a one hot encoded vector with many zero values if there many
classes, requiring significant memory (should use Sparse Categorical Crossentropy
in this case)
Deep learning
Deep learning is a subfield of machine learning that deals with algorithms inspired
by the structure and function of the brain. Deep learning is a subset of machine
learning, which is a part of artificial intelligence (AI).
Torch
The torch was developed using the LUA language with an implementation in C.
Torch’s Python implementation is called PyTorch.
Keras
Keras is a Python framework for deep learning. Its USP is reusability of code for
CPU and GPU.
TensorFlow
TensorFlow is an open-source deep-learning library developed by Google. It’s
developed in C++ and has its implementation in Python. Keras can now be run on
top of TensorFlow.
DL4J
Deep Learning for Java (DL4J) is the first deep learning library written for Java
and Scala. It’s integrated with Hadoop and Apache Spark.
Google’s TensorFlow is currently the most popular learning library in the world.
It’s based on the concept of tensors, which are vectors or matrices of n dimensions.
Introduction to CNN
Yann LeCun, director of Facebook’s AI Research Group, is the pioneer of
convolutional neural networks. He built the first convolutional neural network called
LeNet in 1988. LeNet was used for character recognition tasks like reading zip codes
and digits.
Have you ever wondered how facial recognition works on social media, or how
object detection helps in building self-driving cars, or how disease detection is done
using visual imagery in healthcare? It’s all possible thanks to convolutional neural
networks (CNN). Here’s an example of convolutional neural networks that
illustrates how they work:
Imagine there’s an image of a bird, and you want to identify whether it’s really a
bird or some other object. The first thing you do is feed the pixels of the image in
the form of arrays to the input layer of the neural network (multi-layer networks used
to classify things). The hidden layers carry out feature extraction by performing
different calculations and manipulations. There are multiple hidden layers like the
convolution layer, the ReLU layer, and pooling layer, that perform feature
extraction from the image. Finally, there’s a fully connected layer that identifies the
object in the image.
The convolution operation forms the basis of any convolutional neural network.
Let’s understand the convolution operation using two matrices, a and b, of 1
dimension.
a = [5,3,7,5,9,7]
b = [1,2,3]
In convolution operation, the arrays are multiplied element-wise, and the product is
summed to create a new array, which represents a*b.
The first three elements of the matrix a are multiplied with the elements of matrix b.
The product is summed to get the result.
The next three elements from the matrix a are multiplied by the elements in matrix b,
and the product is summed up.
This process continues until the convolution operation is complete.How Does CNN
Recognize Images?
Consider the following images:
The boxes that are colored represent a pixel value of 1, and 0 if not colored.
When you press backslash (\), the below image gets processed.
As you can see from the above diagram, only those values are lit that have a value
of 1.
1. Convolution layer
2. ReLU layer
3. Pooling layer
4. Fully connected layer
Convolution Layer
This is the first step in the process of extracting valuable features from an image. A
convolution layer has several filters that perform the convolution operation. Every
image is considered as a matrix of pixel values.
Consider the following 5x5 image whose pixel values are either 0 or 1. There’s also
a filter matrix with a dimension of 3x3. Slide the filter matrix over the image and
compute the dot product to get the convolved feature matrix.
ReLU layer
ReLU stands for the rectified linear unit. Once the feature maps are extracted, the
next step is to move them to a ReLU layer.
ReLU performs an element-wise operation and sets all the negative pixels to 0. It
introduces non-linearity to the network, and the generated output is a rectified
feature map. Below is the graph of a ReLU function:
The original image is scanned with multiple convolutions and ReLU layers for
locating the features.
Pooling Layer
Pooling is a down-sampling operation that reduces the dimensionality of the feature
map. The rectified feature map now goes through a pooling layer to generate a
pooled feature map.
The pooling layer uses various filters to identify different parts of the image like
edges, corners, body, feathers, eyes, and beak.
Here’s how the structure of the convolution neural network looks so far:
The next step in the process is called flattening. Flattening is used to convert all the
resultant 2-Dimensional arrays from pooled feature maps into a single long
continuous linear vector.
The flattened matrix is fed as input to the fully connected layer to classify the image.
• The pixels from the image are fed to the convolutional layer that performs
the convolution operation
• It results in a convolved map
• The convolved map is applied to a ReLU function to generate a rectified
feature map
• The image is processed with multiple convolutions and ReLU layers for
locating the features
• Different pooling layers with various filters are used to identify specific
parts of the image
• The pooled feature map is flattened and fed to a fully connected layer to
get the final output
you can use this formula [(W−K+2P)/S]+1to calculate the size of output
dimensions in CNN
RNN have a “memory” which remembers all information about what has been
calculated. It uses the same parameters for each input as it performs the same task
on all the inputs or hidden layers to produce the output. This reduces the complexity
of parameters, unlike other neural networks.
How RNN works
The working of a RNN can be understood with the help of below example:
Example:
Suppose there is a deeper network with one input layer, three hidden layers and one
output layer. Then like other neural networks, each hidden layer will hav e its own
set of weights and biases, let’s say, for hidden layer 1 the weights and biases are (w1,
b1), (w2, b2) for second hidden layer and (w3, b3) for third hidden layer. This means
that each of these layers are independent of each other, i.e. they do not memorize the
previous outputs.
• Hence these three layers can be joined together such that the weights and
bias of all the hidden layers is the same, into a single recurrent layer.
where:
where:
Yt -> output
Why -> weight at output layer
5. Once all the time steps are completed the final current state is used to
calculate the output.
6. The output is then compared to the actual output i.e the target output and
the error is generated.
7. The error is then back-propagated to the network to update the weights and
hence the network (RNN) is trained.