0% found this document useful (0 votes)
13 views64 pages

Artificial Neural Network: Unit V Neural Networks and Deep Learning

Artificial Neural Networks (ANNs) are computational models inspired by biological neural networks, designed to mimic human brain functions for decision-making and pattern recognition. ANNs consist of interconnected nodes organized in layers, including input, hidden, and output layers, and utilize activation functions to process inputs and generate outputs. While ANNs offer advantages such as parallel processing and fault tolerance, they also face challenges like structural uncertainty and hardware dependence.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views64 pages

Artificial Neural Network: Unit V Neural Networks and Deep Learning

Artificial Neural Networks (ANNs) are computational models inspired by biological neural networks, designed to mimic human brain functions for decision-making and pattern recognition. ANNs consist of interconnected nodes organized in layers, including input, hidden, and output layers, and utilize activation functions to process inputs and generate outputs. While ANNs offer advantages such as parallel processing and fault tolerance, they also face challenges like structural uncertainty and hardware dependence.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

1

Unit V
Neural Networks and Deep Learning

Artificial Neural Network

The term "Artificial neural network" refers to a biologically inspired sub-field of


artificial intelligence modeled after the brain. An Artificial neural network is
usually a computational network based on biological neural networks that construct
the structure of the human brain.

What is Artificial Neural Network?

The term "Artificial Neural Network" is derived from Biological neural networks
that develop the structure of a human brain. Similar to the human brain that has
neurons interconnected to one another, artificial neural networks also have neurons
that are interconnected to one another in various layers of the networks. These
neurons are known as nodes.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


2

The given figure illustrates the typical diagram of Biological Neural Network.

The typical Artificial Neural Network looks something like the given figure.

Dendrites from Biological Neural Network represent inputs in Artificial Neural


Networks, cell nucleus represents Nodes, synapse represents Weights, and Axon
represents Output.

Relationship between Biological neural network and artificial neural network:

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


3

Biological Neural Network Artificial Neural Network

Dendrites Inputs

Cell nucleus Nodes

Synapse Weights

Axon Output

An Artificial Neural Network in the field of Artificial intelligence where it


attempts to mimic the network of neurons makes up a human brain so that computers
will have an option to understand things and make decisions in a human-like manner.
The artificial neural network is designed by programming computers to behave
simply like interconnected brain cells.

There are around 1000 billion neurons in the human brain. Each neuro n has an
association point somewhere in the range of 1,000 and 100,000. In the human brain,
data is stored in such a manner as to be distributed, and we can extract more than
one piece of this data when necessary, from our memory parallelly. We can say that
the human brain is made up of incredibly amazing parallel processors.

We can understand the artificial neural network with an example, consider an


example of a digital logic gate that takes an input and gives an output. "OR" gate,
which takes two inputs. If one or both the inputs are "On," then we get "On" in
output. If both the inputs are "Off," then we get "Off" in output. Here the output
depends upon input. Our brain does not perform the same task. The outputs to inputs
relationship keep changing because of the neurons in our brain, which are "learning."

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


4

The architecture of an artificial neural network:

To understand the concept of the architecture of an artificial neural network, we have


to understand what a neural network consists of. In order to define a neural network
that consists of a large number of artificial neurons, which are termed units arranged
in a sequence of layers. Lets us look at various types of layers available in an
artificial neural network.

Artificial Neural Network primarily consists of three layers:

Input Layer:

As the name suggests, it accepts inputs in several different formats provided by the
programmer.

Hidden Layer:

The hidden layer presents in-between input and output layers. It performs all the
calculations to find hidden features and patterns.
S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN
5

Output Layer:

The input goes through a series of transformations using the hidden layer, which
finally results in output that is conveyed using this layer.

The artificial neural network takes input and computes the weighted sum of the
inputs and includes a bias. This computation is represented in the form of a transfer
function.

It determines weighted total is passed as an input to an activation function to produce


the output. Activation functions choose whether a node should fire or not. Only those
who are fired make it to the output layer. There are distinctive activation functions
available that can be applied upon the sort of task we are performing.

Advantages of Artificial Neural Network (ANN)

Parallel processing capability:

Artificial neural networks have a numerical value that can perform more than one
task simultaneously.

Storing data on the entire network:

Data that is used in traditional programming is stored on the whole network, not on
a database. The disappearance of a couple of pieces of data in one place doesn't
prevent the network from working.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


6

Capability to work with incomplete knowledge:

After ANN training, the information may produce output even with inadequate data.
The loss of performance here relies upon the significance of missing data.

Having a memory distribution:

For ANN is to be able to adapt, it is important to determine the examples and to


encourage the network according to the desired output by demons trating these
examples to the network. The succession of the network is directly proportional to
the chosen instances, and if the event can't appear to the network in all its aspects, it
can produce false output.

Having fault tolerance:

Extortion of one or more cells of ANN does not prohibit it from generating output,
and this feature makes the network fault-tolerance.

Disadvantages of Artificial Neural Network:

Assurance of proper network structure:

There is no particular guideline for determining the structure of artificial neural


networks. The appropriate network structure is accomplished through experience,
trial, and error.

Unrecognized behavior of the network:

It is the most significant issue of ANN. When ANN produces a testing solution, it
does not provide insight concerning why and how. It decreases trust in the network.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


7

Hardware dependence:

Artificial neural networks need processors with parallel processing power, as per
their structure. Therefore, the realization of the equipment is dependent.

Difficulty of showing the issue to the network:

ANNs can work with numerical data. Problems must be converted into numerical
values before being introduced to ANN. The presentation mechanism to be resolved
here will directly impact the performance of the network. It relies on the user's
abilities.

The duration of the network is unknown:

The network is reduced to a specific value of the error, and this value does not give
us optimum results.

How do artificial neural networks work?

Artificial Neural Network can be best represented as a weighted directed graph,


where the artificial neurons form the nodes. The association between the neurons
outputs and neuron inputs can be viewed as the directed edges with weights. The
Artificial Neural Network receives the input signal from the external source in the
form of a pattern and image in the form of a vector. These inputs are then
mathematically assigned by the notations x(n) for every n number of inputs.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


8

Afterward, each of the input is multiplied by its corresponding weights ( these


weights are the details utilized by the artificial neural networks to solve a specific
problem ). In general terms, these weights normally represent the strength of the
interconnection between neurons inside the artificial neural network. All the
weighted inputs are summarized inside the computing unit.

If the weighted sum is equal to zero, then bias is added to make the output non -zero
or something else to scale up to the system's response. Bias has the same input, and
weight equals to 1. Here the total of weighted inputs can be in the range of 0 to
positive infinity. Here, to keep the response in the limits of the desired value, a
certain maximum value is benchmarked, and the total of weighted inputs is passed
through the activation function.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


9

The activation function refers to the set of transfer functions used to achieve the
desired output. There is a different kind of the activation function, but primarily
either linear or non-linear sets of functions. Some of the commonly used sets of
activation functions are the Binary, linear, and Tan hyperbolic sigmoidal activation
functions. Let us take a look at each of them in details:

Definition of NN

Neural Networks are a set of algorithms that tries to recognize the patterns,
relationships, and information from the data through the process which is inspired
by and works like the human brain/biology.

Perceptron and Multi-Layer Perceptron

Perceptron is a simple form of Neural Network and consists of a single layer


where all the mathematical computations are performed.

Whereas, Multilayer Perceptron also known as Artificial Neural


Networks consists of more than one perception which is grouped together to form
a multiple layer neural network.
S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN
10

In the above image, The Artificial Neural Network consists of four layers
interconnected with each other:

• An input layer, with 6 input nodes


• Hidden Layer 1, with 4 hidden nodes/4 perceptrons
• Hidden layer 2, with 4 hidden nodes
• Output layer with 1 output node

Step by Step Working of the Artificial Neural Network

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


11

1. In the first step, Input units are passed i.e data is passed with some
weights attached to it to the hidden layer. We can have any number of
hidden layers. In the above image inputs x 1,x2,x3,….xn is passed.
2. Each hidden layer consists of neurons. All the inputs are connected to each
neuron.
3. After passing on the inputs, all the computation is performed in the
hidden layer (Blue oval in the picture)

Computation performed in hidden layers are done in two steps which are as
follows :

• First of all, all the inputs are multiplied by their weights. Weight is the
gradient or coefficient of each variable. It shows the strength of the
particular input. After assigning the weights, a bias variable is added. Bias is
a constant that helps the model to fit in the best way possible.

Z1 = W1*In1 + W2*In2 + W3*In3 + W4*In4 + W5*In5 + b

W1, W2, W3, W4, W5 are the weights assigned to the inputs In 1, In2, In3, In4,
In5, and b is the bias.

• Then in the second step, the activation function is applied to the linear
equation Z1. The activation function is a nonlinear transformation that is
applied to the input before sending it to the next layer of neurons. The
importance of the activation function is to inculcate nonlinearity in the
model.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


12

4. The whole process described in point 3 is performed in each hidden layer.


After passing through every hidden layer, we move to the last layer i.e our
output layer which gives us the final output.

The process explained above is known as forwarding Propagation.

5. After getting the predictions from the output layer, the error is calculated
i.e the difference between the actual and the predicted output.

If the error is large, then the steps are taken to minimize the error and for the same
purpose, Back Propagation is performed.

What is Back Propagation and How it works?

Back Propagation is the process of updating and finding the optimal values of
weights or coefficients which helps the model to minimize the error i.e
difference between the actual and predicted values.

Back Propagation with Gradient Descent


Gradient Descent is one of the optimizers which helps in calculating the new
weights. Let’s understand step by step how Gradient Descent optimizes the cost
function.

In the image below, the curve is our cost function curve and our aim is the
minimize the error such that J min i.e global minima is achieved.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


13

Steps to achieve the global minima:

1. First, the weights are initialized randomly i.e random value of the weight,
and intercepts are assigned to the model while forward propagation and the
errors are calculated after all the computation. (As discussed above)
2. Then the gradient is calculated i.e derivative of error w.r.t current
weights
3. Then new weights are calculated using the below formula, where a is the
learning rate which is the parameter also known as step size to control the
speed or steps of the backpropagation. It gives additional control on how fast
we want to move on the curve to reach global minima.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


14

4.This process of calculating the new weights, then errors from the new
weights, and then updation of weights continues till we reach global minima and
loss is minimized.

A point to note here is that the learning rate i.e a in our weight updation equation
should be chosen wisely. Learning rate is the amount of change or step size taken
towards reaching global minima. It should not be very small as it will take time
to converge as well as it should not be very large that it doesn’t reach global
minima at all. Therefore, the learning rate is the hyperparameter that we have to
choose based on the model.

Perceptron

Although today the Perceptron is widely recognized as an algorithm, it was


initially intended as an image recognition machine. It gets its name from
performing the human-like function of perception, seeing and recognizing images.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


15

Rosenblatt’s perceptron machine relied on a basic unit of computation, the neuron.


Just like in previous models, each neuron has a cell that receives a series of pairs of
inputs and weights.

The major difference in Rosenblatt’s model is that inputs are combined in


a weighted sum and, if the weighted sum exceeds a predefined threshold, the
neuron fires and produces an output.

Perceptrons neuron model (left) and threshold logic (right).

Threshold T represents the activation function. If the weighted sum of the inputs is
greater than zero the neuron outputs the value 1, otherwise the output value is zero.

Perceptron for Binary Classification

With this discrete output, controlled by the activation function, the perceptron can
be used as a binary classification model, defining a linear decision boundary. It
finds the separating hyperplane that minimizes the distance between misclassified
points and the decision boundary.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


16

Perceptron’s loss function.

To minimize this distance, Perceptron uses Stochastic Gradient Descent as the


optimization function.If the data is linearly separable, it is guaranteed that
Stochastic Gradient Descent will converge in a finite number of steps.

The last piece that Perceptron needs is the activation function, the function that
determines if the neuron will fire or not.

The sigmoid function maps any real input to a value that is either 0 or 1, and
encodes a non-linear function.

Multilayer Perceptron

The Multilayer Perceptron was developed to tackle this limitation. It is a neural


network where the mapping between inputs and output is non-linear.

A Multilayer Perceptron has input and output layers, and one or more hidden
layers with many neurons stacked together. And while in the Perceptron the neuron

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


17

must have an activation function that imposes a threshold, like ReLU or sigmoid,
neurons in a Multilayer Perceptron can use any arbitrary activation function.

Multilayer Perceptron.

Multilayer Perceptron falls under the category of feedforward algorithms, because


inputs are combined with the initial weights in a weighted sum and subjected to the
activation function, just like in the Perceptron. But the difference is that each linear
combination is propagated to the next layer.

Each layer is feeding the next one with the result of their computation, their internal
representation of the data. This goes all the way through the hidden layers to the
output layer. But it has more to it.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


18

If the algorithm only computed the weighted sums in each neuron, propagated
results to the output layer, and stopped there, it wouldn’t be able to learn the
weights that minimize the cost function. If the algorithm only computed one
iteration, there would be no actual learning. For learning we need back propagation
Feed Forward Process in Deep Neural Network

Now, we know how with the combination of lines with different weight and biases
can result in non-linear models. How does a neural network know what weight and
biased values to have in each layer? It is no different from how we did it for the
single based perceptron model.

We are still making use of a gradient descent optimization algorithm which acts to
minimize the error of our model by iteratively moving in the direction with the
steepest descent, the direction which updates the parameters of our model while
ensuring the minimal error. It updates the weight of every model in every single
layer. We will talk more about optimization algorithms and backpropagation later.

It is important to recognize the subsequent training of our neural network.


Recognition is done by dividing our data samples through some decision boundary.

"The process of receiving an input to produce some kind of output to make some
kind of prediction is known as Feed Forward." Feed Forward neural network is
the core of many other important neural networks such as convolution neural
network.

In the feed-forward neural network, there are not any feedback loops or connections
in the network. Here is simply an input layer, a hidden layer, and an output layer.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


19

There can be multiple hidden layers which depend on what kind of data you are
dealing with. The number of hidden layers is known as the depth of the neural
network. The deep neural network can learn from more functions. Input layer first
provides the neural network with data and the output layer then make predictions on
that data which is based on a series of functions. ReLU Function is the most
commonly used activation function in the deep neural network.

To gain a solid understanding of the feed-forward process, let's see this


mathematically.

1) The first input is fed to the network, which is represented as matrix x1, x2, and
one where one is the bias value.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


20

2) Each input is multiplied by weight with respect to the first and second model to
obtain their probability of being in the positive region in each model.

So, we will multiply our inputs by a matrix of weight using matrix multiplication.

3) After that, we will take the sigmoid of our scores and gives us the probability of
the point being in the positive region in both models.

4) We multiply the probability which we have obtained from the previous st ep with
the second set of weights. We always include a bias of one whenever taking a
combination of inputs.

And as we know to obtain the probability of the point being in the positive region of
this model, we take the sigmoid and thus producing our final output in a feed-forward
process.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


21

Let takes the neural network which we had previously with the following linear
models and the hidden layer which combined to form the non-linear model in the
output layer.

So, what we will do we use our non-linear model to produce an output that describes
the probability of the point being in the positive region. The point was represented
by 2 and 2. Along with bias, we will represent the input as

The first linear model in the hidden layer recall and the equation defined it

Which means in the first layer to obtain the linear combination the inputs are
multiplied by -4, -1 and the bias value is multiplied by twelve.
S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN
22

The weight of the inputs are multiplied by -1/5, 1, and the bias is multiplied by three
to obtain the linear combination of that same point in our second model.

Now, to obtain the probability of the point is in the positive region relative to both
models we apply sigmoid to both points as

The second layer contains the weights which dictated the combination of the linear
models in the first layer to obtain the non-linear model in the second layer. The
weights are 1.5, 1, and a bias value of 0.5.

Now, we have to multiply our probabilities from the first layer with the second set
of weights as

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


23

Now, we will take the sigmoid of our final score

It is complete math behind the feed forward process where the inputs from the input
traverse the entire depth of the neural network. In this example, there is only one
hidden layer. Whether there is one hidden layer or twenty, the computational
processes are the same for all hidden layer

Backpropagation Process in Deep Neural Network

Backpropagation is one of the important concepts of a neural network. Our task is


to classify our data best. For this, we have to update the weights of parameter and
bias, but how can we do that in a deep neural network? In the linear regression
model, we use gradient descent to optimize the parameter. Similarly her e we also
use gradient descent algorithm using Backpropagation.

For a single training example, Backpropagation algorithm calculates the gradient


of the error function. Backpropagation can be written as a function of the neural
network. Backpropagation algorithms are a set of methods used to efficiently train
artificial neural networks following a gradient descent approach which exploits the
chain rule.

The main features of Backpropagation are the iterative, recursive and efficient
method through which it calculates the updated weight to improve the network until
it is not able to perform the task for which it is being trained. Derivatives of the
activation function to be known at network design time is required to
Backpropagation.
S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN
24

Now, how error function is used in Backpropagation and how Backpropagation


works? Let start with an example and do it mathematically to understand how exactly
updates the weight using Backpropagation.

Input values

X1=0.05
X2=0.10

Initial weight

W1=0.15,w5=0.40
W2=0.20,w6=0.45
W3=0.25,w7=0.50
W4=0.30,w8=0.55

Bias Values

b1=0.35 b2=0.60

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


25

Target Values

T1=0.01
T2=0.99

Now, we first calculate the values of H1 and H2 by a forward pass.

Forward Pass

To find the value of H1 we first multiply the input value from the weights as

H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775

To calculate the final result of H1, we performed the sigmoid function as

We will calculate the value of H2 in the same way as H1

H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35
H2=0.3925

To calculate the final result of H1, we performed the sigmoid function as

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


26

Now, we calculate the values of y1 and y2 in the same way as we calculate the H1
and H2.

To find the value of y1, we first multiply the input value i.e., the outcome of H1 and
H2 from the weights as

y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597

To calculate the final result of y1 we performed the sigmoid function as

We will calculate the value of y2 in the same way as y1

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


27

y2=H1×w7+H2×w8+b2
y2=0.593269992×0.50+0.596884378×0.55+0.60
y2=1.2249214

To calculate the final result of H1, we performed the sigmoid function as

Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched with our
target values T1 and T2.

Now, we will find the total error, which is simply the difference between the
outputs from the target outputs. The total error is calculated as

So, the total error is

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


28

Now, we will backpropagate this error to update the weights using a backward pass.

Backward pass at the output layer

To update the weight, we calculate the error correspond to each weight with the help
of a total error. The error on weight w is calculated by differentiating total error with
respect to w.

We perform backward process so first consider the last weight w5 as

From equation two, it is clear that we cannot partially differentiate it with respect to
w5 because there is no any w5. We split equation one into multiple terms so that we
can easily differentiate it with respect to w5 as

Now, we calculate each term one by one to differentiate E total with respect to w5 as

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


29

Putting the value of e-y in equation (5)

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


30

So, we put the values of in equation no (3) to find the final


result.

Now, we will calculate the updated weight w5 new with the help of the following
formula

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


31

In the same way, we calculate w6 new,w7new, and w8 new and this will give us the
following values

w5new=0.35891648
w6new=408666186
w7new=0.511301270
w8new=0.561370121

Backward pass at Hidden layer

Now, we will backpropagate to our hidden layer and update the weight w1, w2, w3,
and w4 as we have done with w5, w6, w7, and w8 weights.

We will calculate the error at w1 as

From equation (2), it is clear that we cannot partially differentiate it with respect to
w1 because there is no any w1. We split equation (1) into multiple terms so that we
can easily differentiate it with respect to w1 as

Now, we calculate each term one by one to differentiate E total with respect to w1 as

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


32

We again split this because there is no any H1 final term in Etoatal as

will again split because in E1 and E2 there is no H1 term. Splitting


is done as

We again Split both because there is no any y1 and y2 term in E1 and E2.
We split it as

Now, we find the value of by putting values in equation (18) and (19) as

From equation (18)

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


33

From equation (8)

From equation (19)

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


34

Putting the value of e-y2 in equation (23)

From equation (21)

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


35

Now from equation (16) and (17)

Put the value of in equation (15) as

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


36

We have we need to figure out as

Putting the value of e-H1 in equation (30)

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


37

We calculate the partial derivative of the total net input to H1 with respect to w1 the
same as we did for the output neuron:

So, we put the values of in equation (13) to find the final


result.

Now, we will calculate the updated weight w1 new with the help of the following
formula

In the same way, we calculate w2 new,w3new, and w4 and this will give us the
following values

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


38

w1new=0.149780716
w2new=0.19956143
w3new=0.24975114
w4new=0.29950229

We have updated all the weights. We found the error 0.298371109 on the network
when we fed forward the 0.05 and 0.1 inputs. In the first round of Backpropagation,
the total error is down to 0.291027924. After repeating this process 10,000, the total
error is down to 0.0000351085. At this point, the outputs neurons generate
0.159121960 and 0.984065734 i.e., nearby our target value when we feed forward
the 0.05 and 0.1

Activation Functions and Loss Functions for neural networks

Activation Functions

The activation function of a neuron defines it’s output given its inputs. We will be
talking about 4 popular activation functions:

1.Sigmoid Function:

Description: Takes a real-valued number and scales it between 0 and 1. Large


negative numbers become 0 and large positive numbers become 1
Formula: 1 /(1 + e^-x)
Range: (0,1)
Pros: As it’s range is between 0 and 1, it is ideal for situations where we need to
predict the probability of an event as an output.
Cons: The gradient values are significant for range -3 and 3 but become much
closer to zero beyond this range which almost kills the impact of the neuron on the

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


39

final output. Also, sigmoid outputs are not zero-centered (it is centred around 0.5)
which leads to undesirable zig-zagging dynamics in the gradient updates for the
weights
Plot:

2. Tanh Function:

Description: Similar to sigmoid but takes a real-valued number and scales it


between -1 and 1.It is better than sigmoid as it is centred around 0 which leads to
better convergence
Formula: (e^x — e^-x) / (e^x + e^-x)
Range: (-1,1)
Pros: The derivatives of the tanh are larger than the derivatives of the sigmoid
which help us minimize the cost function faster
Cons: Similar to sigmoid, the gradient values become close to zero for wide range
of values (this is known as vanishing gradient problem). Thus, the network refuses
to learn or keeps learning at a very small rate.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


40

Plot:

3. Softmax Function:

Description: Softmax function can be imagined as a combination of multiple


sigmoids which can returns the probability for a datapoint belonging to each
individual class in a multiclass classification problem
Formula:

Range: (0,1), sum of output = 1


Pros: Can handle multiple classes and give the probability of belonging to each
class

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


41

Cons: Should not be used in hidden layers as we want the neurons to be


independent. If we apply it then they will be linearly dependent.

Plot: Not Applicable

4. ReLU Function:

Description: The rectified linear activation function or ReLU for short is a


piecewise linear function that will output the input directly if it is positive,
otherwise, it will output zero. This is the default function but modifying default
parameters allows us to use non-zero thresholds and to use a non-zero multiple of
the input for values below the threshold (called Leaky ReLU).
Formula: max(0,x)
Range: (0,inf)
Pros: Although RELU looks and acts like a linear function, it is a nonlinear
function allowing complex relationships to be learned and is able to allow learning
through all the hidden layers in a deep network by having large derivatives.
Cons: It should not be used as the final output layer for either
classification/regression tasks
Plot:

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


42

Loss Functions

The other key aspect in setting up the neural network infrastructure is selecting the
right loss functions. With neural networks, we seek to minimize the error
(difference between actual and predicted value) which is calculated by the loss
function. We will be discussing 3 popular loss functions:

1. Mean Squared Error, L2 Loss

Description: MSE loss is used for regression tasks. As the name suggests, this loss
is calculated by taking the mean of squared differences between actual(target) and
predicted values.
Formula:

Range: (0,inf)
Pros: Preferred loss function if the distribution of the target variable is Gaussian as
it has good derivatives and helps the model converge quickly
Cons: Is not robust to outliers in the data (unlike loss functions like Mean Absolute

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


43

Error) and penalizes high and low predictions exponentially (unlike loss functions
like Mean Squared Logarithmic Error Loss)

2. Binary Cross Entropy

Description: BCE loss is the default loss function used for the binary classification
tasks. It requires one output layer to classify the data into two classes and the range
of output is (0–1) i.e. should use the sigmoid function.
Formula:

where y is the actual label, ŷ is the classifier’s predicted probability distributions for
predicting one class and m is the number of records.
Range: (0,inf)
Pros: The continuous nature of the loss function helps the training process
converged well
Cons: Can only be used with sigmoid activation function. Other loss functions like
Hinge or Squared Hinge Loss can work with tanh activation function

3. Categorical Cross Entropy

Description: It is the default loss function when we have a multi-class


classification task. It requires the same number of output nodes as the classes with

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


44

the final layer going through a softmax activation so that each output node has a
probability value between (0–1).
Formula:

where y is the actual label and p is the classifier’s predicted probability distributions
for predicting the class j
Range: (0,inf)
Pros: Similar to Binary Cross Entropy, the continuous nature of the loss function
helps the training process converged well
Cons: May require a one hot encoded vector with many zero values if there many
classes, requiring significant memory (should use Sparse Categorical Crossentropy
in this case)

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


45

Deep learning
Deep learning is a subfield of machine learning that deals with algorithms inspired
by the structure and function of the brain. Deep learning is a subset of machine
learning, which is a part of artificial intelligence (AI).

Artificial intelligence is the ability of a machine to imitate intelligent human


behavior. Machine learning allows a system to learn and improve from experience
automatically. Deep learning is an application of machine learning that uses
complex algorithms and deep neural nets to train a model.

Applications of Deep Learning


Next up in this introduction to deep learning tutorial, let’s learn about some of the
top applications of deep learning. Deep learning is widely used to make weather
predictions about rain, earthquakes, and tsunamis. It helps in taking the necessary
precautions. With deep learning, machines can comprehend speech and provide the
required output. It enables the machines to recognize people and objects in the
images fed to it. Deep learning models also help advertisers leverage data to
S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN
46

perform real-time bidding and targeted display advertising. In the next


section introduction to deep learning tutorial, we will cover the need and
importance of deep learning.

Importance of Deep Learning

• Machine learning works only with sets of structured and semi-structured


data, while deep learning works with both structured and unstructured
data
• Deep learning algorithms can perform complex operations efficiently,
while machine learning algorithms cannot
• Machine learning algorithms use labeled sample data to extract patterns,
while deep learning accepts large volumes of data as input and analyzes
the input data to extract features out of an object
• The performance of machine learning algorithms decreases as the number
of data increases; so to maintain the performance of the model, we need a
deep learning

Deep Learning Platforms


In the following section of the introduction to deep learning, you will learn about
several deep learning platforms and when they are used.

Torch
The torch was developed using the LUA language with an implementation in C.
Torch’s Python implementation is called PyTorch.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


47

Keras
Keras is a Python framework for deep learning. Its USP is reusability of code for
CPU and GPU.

TensorFlow
TensorFlow is an open-source deep-learning library developed by Google. It’s
developed in C++ and has its implementation in Python. Keras can now be run on
top of TensorFlow.

DL4J
Deep Learning for Java (DL4J) is the first deep learning library written for Java
and Scala. It’s integrated with Hadoop and Apache Spark.
Google’s TensorFlow is currently the most popular learning library in the world.
It’s based on the concept of tensors, which are vectors or matrices of n dimensions.

Introduction to CNN
Yann LeCun, director of Facebook’s AI Research Group, is the pioneer of
convolutional neural networks. He built the first convolutional neural network called
LeNet in 1988. LeNet was used for character recognition tasks like reading zip codes
and digits.
Have you ever wondered how facial recognition works on social media, or how
object detection helps in building self-driving cars, or how disease detection is done
using visual imagery in healthcare? It’s all possible thanks to convolutional neural
networks (CNN). Here’s an example of convolutional neural networks that
illustrates how they work:

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


48

Imagine there’s an image of a bird, and you want to identify whether it’s really a
bird or some other object. The first thing you do is feed the pixels of the image in
the form of arrays to the input layer of the neural network (multi-layer networks used
to classify things). The hidden layers carry out feature extraction by performing
different calculations and manipulations. There are multiple hidden layers like the
convolution layer, the ReLU layer, and pooling layer, that perform feature
extraction from the image. Finally, there’s a fully connected layer that identifies the
object in the image.

Fig: Convolutional Neural Network to identify the image of a bird

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


49

What is Convolutional Neural Network?


A convolutional neural network is a feed-forward neural network that is generally
used to analyze visual images by processing data with grid-like topology. It’s also
known as a ConvNet. A convolutional neural network is used to detect and classify
objects in an image.
Below is a neural network that identifies two types of flowers: Orchid and Rose.

In CNN, every image is represented in the form of an array of pixel values.

The convolution operation forms the basis of any convolutional neural network.
Let’s understand the convolution operation using two matrices, a and b, of 1
dimension.
a = [5,3,7,5,9,7]
b = [1,2,3]

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


50

In convolution operation, the arrays are multiplied element-wise, and the product is
summed to create a new array, which represents a*b.
The first three elements of the matrix a are multiplied with the elements of matrix b.
The product is summed to get the result.

The next three elements from the matrix a are multiplied by the elements in matrix b,
and the product is summed up.

This process continues until the convolution operation is complete.How Does CNN
Recognize Images?
Consider the following images:

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


51

The boxes that are colored represent a pixel value of 1, and 0 if not colored.
When you press backslash (\), the below image gets processed.

When you press forward-slash (/), the below image is processed:

Here is another example to depict how CNN recognizes an image:

As you can see from the above diagram, only those values are lit that have a value
of 1.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


52

Layers in a Convolutional Neural Network


A convolution neural network has multiple hidden layers that help in extracting
information from an image. The four important layers in CNN are:

1. Convolution layer
2. ReLU layer
3. Pooling layer
4. Fully connected layer

Convolution Layer
This is the first step in the process of extracting valuable features from an image. A
convolution layer has several filters that perform the convolution operation. Every
image is considered as a matrix of pixel values.
Consider the following 5x5 image whose pixel values are either 0 or 1. There’s also
a filter matrix with a dimension of 3x3. Slide the filter matrix over the image and
compute the dot product to get the convolved feature matrix.

ReLU layer
ReLU stands for the rectified linear unit. Once the feature maps are extracted, the
next step is to move them to a ReLU layer.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


53

ReLU performs an element-wise operation and sets all the negative pixels to 0. It
introduces non-linearity to the network, and the generated output is a rectified
feature map. Below is the graph of a ReLU function:

The original image is scanned with multiple convolutions and ReLU layers for
locating the features.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


54

Pooling Layer
Pooling is a down-sampling operation that reduces the dimensionality of the feature
map. The rectified feature map now goes through a pooling layer to generate a
pooled feature map.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


55

The pooling layer uses various filters to identify different parts of the image like
edges, corners, body, feathers, eyes, and beak.

Here’s how the structure of the convolution neural network looks so far:

The next step in the process is called flattening. Flattening is used to convert all the
resultant 2-Dimensional arrays from pooled feature maps into a single long
continuous linear vector.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


56

The flattened matrix is fed as input to the fully connected layer to classify the image.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


57

Here’s how exactly CNN recognizes a bird:

• The pixels from the image are fed to the convolutional layer that performs
the convolution operation
• It results in a convolved map
• The convolved map is applied to a ReLU function to generate a rectified
feature map
• The image is processed with multiple convolutions and ReLU layers for
locating the features
• Different pooling layers with various filters are used to identify specific
parts of the image
• The pooled feature map is flattened and fed to a fully connected layer to
get the final output

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


58

you can use this formula [(W−K+2P)/S]+1to calculate the size of output
dimensions in CNN

• W is the input volume - in your case 128


• K is the Kernel size - in your case 5
• P is the padding - in your case 0 i believe
• S is the stride - which you have not provided.

Introduction to Recurrent Neural Network


Recurrent Neural Network (RNN) are a type of Neural Network where the output
from previous step are fed as input to the current step. In traditional neural
networks, all the inputs and outputs are independent of each other, but in cases like
when it is required to predict the next word of a sentence, the previous words are
required and hence there is a need to remember the previous words. Thus RNN came
into existence, which solved this issue with the help of a Hidden Layer. The main
and most important feature of RNN is Hidden state, which remembers some
information about a sequence.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


59

RNN have a “memory” which remembers all information about what has been
calculated. It uses the same parameters for each input as it performs the same task
on all the inputs or hidden layers to produce the output. This reduces the complexity
of parameters, unlike other neural networks.
How RNN works

The working of a RNN can be understood with the help of below example:
Example:
Suppose there is a deeper network with one input layer, three hidden layers and one
output layer. Then like other neural networks, each hidden layer will hav e its own
set of weights and biases, let’s say, for hidden layer 1 the weights and biases are (w1,
b1), (w2, b2) for second hidden layer and (w3, b3) for third hidden layer. This means

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


60

that each of these layers are independent of each other, i.e. they do not memorize the
previous outputs.

Now the RNN will do the following:

• RNN converts the independent activations into dependent activations by


providing the same weights and biases to all the layers, thus reducing the
complexity of increasing parameters and memorizing each previous
outputs by giving each output as input to the next hidden layer.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


61

• Hence these three layers can be joined together such that the weights and
bias of all the hidden layers is the same, into a single recurrent layer.

Formula for calculating current state:

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


62

where:

ht -> current state


ht-1 -> previous state
xt -> input state
Formula for applying Activation function(tanh):

where:

whh -> weight at recurrent neuron


wxh -> weight at input neuron
Formula for calculating output:

Yt -> output
Why -> weight at output layer

Training through RNN

1. A single time step of the input is provided to the network.


2. Then calculate its current state using set of current input and the previous
state.
3. The current ht becomes ht-1 for the next time step.
4. One can go as many time steps according to the problem and join the
information from all the previous states.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


63

5. Once all the time steps are completed the final current state is used to
calculate the output.
6. The output is then compared to the actual output i.e the target output and
the error is generated.
7. The error is then back-propagated to the network to update the weights and
hence the network (RNN) is trained.

Advantages of Recurrent Neural Network

1. An RNN remembers each and every information through time. It is useful


in time series prediction only because of the feature to remember previous
inputs as well. This is called Long Short Term Memory.
2. Recurrent neural network are even used with convolutional layers to
extend the effective pixel neighborhood.
Disadvantages of Recurrent Neural Network
1. Gradient vanishing and exploding problems.
2. Training an RNN is a very difficult task.
3. It cannot process very long sequences if using tanh or relu as an activation
function.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


64

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy