0% found this document useful (0 votes)
13 views57 pages

ANNs

Artificial Neural Networks (ANN) are computational models inspired by biological neural networks, consisting of layers of interconnected neurons that learn from data. They utilize activation functions to introduce non-linearity, enabling the model to handle complex tasks and improve performance through backpropagation. Key components include selecting appropriate weights, hidden layers, and neurons, as well as understanding various activation functions like ReLU and Sigmoid, which impact the learning process and efficiency of the network.

Uploaded by

gagan gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views57 pages

ANNs

Artificial Neural Networks (ANN) are computational models inspired by biological neural networks, consisting of layers of interconnected neurons that learn from data. They utilize activation functions to introduce non-linearity, enabling the model to handle complex tasks and improve performance through backpropagation. Key components include selecting appropriate weights, hidden layers, and neurons, as well as understanding various activation functions like ReLU and Sigmoid, which impact the learning process and efficiency of the network.

Uploaded by

gagan gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Artificial Neural Networks

(ANN)
Artificial Neural Network (ANN)
• ANN is a computational model that is inspired by the way biological neural networks
in the human brain process information.
• It is a method of computing, based on the interaction of multiple connected processing
elements.

• A powerful technique to solve many real-world problems.

• The ability to learn from experience in order to improve their performance.

• Ability to deal with incomplete information

• Comprised of one or more layers of neurons


Biological Neuron vs ANN
Artificial Neural Network (ANN)
• A neural network (NN) model consists of fully connected stacked layers of
artificial neurons – an input layer, one or more hidden layers, and an output layer,
to learn data representation.
• Each layer of DNN consists of computational nodes called neurons.
• Each neuron is connected to other neuron through edges just like synapses
connect real neurons.
• Each edge has certain weight indicating whether the strength of the
connection.

• The depth of the model generally specifies the number of hidden layers.
• The learning of a model is indeed the process of finding the appropriate weights
through backpropagation.
Artificial Neural Network (ANN)
• DL, an umbrella term, representing recent advances in neural networks (NN).
• DL mimics the human brain and processes the information at multiple levels of
abstractions from high-dimensional raw data.
• DNNs mimic the efficiency and robustness of human brain to process massive
data.
A Single Neuron
• The basic unit of computation in a
neural network is the neuron, often
called a node or unit.
• It receives input from some other
nodes, or from an external source
and computes an output.
• Each input has an
associated weight (w), which is
assigned on the basis of its relative
importance to other inputs.
• The node applies a function f to the
weighted sum of its inputs.
A Single Neuron
• This network takes numerical inputs X1 and X2 and has
weights w1 and w2 associated with those inputs. Additionally, there is another
input 1 with weight b (called the Bias) associated with it.
• The output Y from the neuron is computed using the function f called
the Activation Function. The purpose of the activation function (non-linear) is to
introduce non-linearity into the output of a neuron.
• This is important because most real-world data is non-linear and we want
neurons to learn these non-linear representations.
• Every activation function (or non-linearity) takes a single number and performs a
certain fixed mathematical operation on it. There are several activation functions.
Selecting Weights
• The initialization of the parameters considerably affects the convergence of
backpropagation during the training of DNN.
• Too small or large values of weights lead to slow learning and divergence respectively.
• The rule of thumb is that the mean of activation* should be zero and the variance**
should remain same across the layers.
• Applicable to both forward (i.e. activation) as well as backward propagation (i.e. for
gradient of cost w.r.t activations)

* If activations are centred around zero, gradient updates are more balanced, preventing bias in weight
updates.
** If variance shrinks, the signal weakens, leading to vanishing gradients. If variance explodes, activations
become large, leading to unstable training.
Selecting Hidden Layers
• Difficult to find appropriate number of hidden layers and neurons per hidden
layer.
• Start with one or two hidden layers and thereafter, gradually increase the layers
depending upon the complexity of the problem, until the model starts
experiencing overfitting.
• A model with few layers but large number of neurons takes more time to learn as
compared to a model having more layers but lesser number of neurons.
• In general, it is always advisable to increase number of layers rather than number
of neurons per layers.
Selecting Neurons
• The number of neurons in the input and output layers depend upon the size of
input feature vector and the classification problem.
• For hidden layers, it is a common practice to choose the number of neurons in
such a way that it forms a funnel, i.e., placing lesser and lesser numbers of
neurons in succeeding layers.
• A model with few layers but a large number of neurons takes more time to learn
as compared to a model having more layers but a lesser number of neurons.
• A good practice to start with fewer neurons and gradually increase their count
until the architecture experiences overfitting.
Perceptron
• One of simplest ANN architecture, primary used for binary classification,
introduced by Frank Rosenblatt (1957).

• A simple neuron that is used to classify its input into one of the two categories.

• A simple processing unit of a neural network. It uses a step function (creates a


sharp jump at x = 0).

• LR makes use of sigmoid function whereas perceptron makes use step function.
Perceptron
Perceptron
• At x=0, there is an abrupt jump (discontinuity).

• The function is not differentiable at x=0 as left and right derivatives are not equal.

• Since the step function is non-differentiable, it cannot be used in gradient-based


learning methods like backpropagation in deep learning.

• Instead, functions like sigmoid, ReLU, and tanh are used in modern neural
networks.
Activation Functions
• Activation function decides, whether a neuron should be activated or not by
calculating weighted sum and further adding bias with it.

• The purpose of the activation function is to introduce non-linearity into the


output of a neuron.

• A neural network without an activation function is essentially just a linear


regression model.

• The activation function does the non-linear transformation to the input making it
capable to learn and perform more complex tasks. It is also known as Transfer
Function.
Activation Functions
• Let’s suppose we have a neural network working without the activation functions, the
network behaves like a linear model, significantly limits its capabilities.
• Useless deep layers: Every neuron will only be performing a linear transformation on the
inputs using the weights and biases. It’s because it doesn’t matter how many hidden
layers we attach in the neural network; all layers will behave in the same way because
the composition of two linear functions is a linear function itself.
• Although the neural network becomes simpler, learning any complex task is impossible,
and our model would be just a linear regression model.
• The Activation Functions can be divided into 2 types:
• Linear Activation Function
• Non-linear Activation Functions
Linear Activation Function
Range : - ∞ to +∞

No matter how many layers we have, if all are linear in


nature, the final activation function of last layer is
nothing but just a linear function of the input of first
layer. The entire network behaves like a single-layer
perceptron, no matter how many layers we add.

Uses : Linear activation function is used at just one


place i.e. output layer, commonly used in regression
based NN.
Non-Linear Activation Functions | Sigmoid
• Takes real-valued input and squashes it to range between 0 and 1.

σ(x) = 1 / (1 + exp(−x))

d σ(x) = σ(x) *(1- σ(x))


Derivative of Sigmoid Function | Vanishing Gradient
Sigmoid Function | Vanishing Gradient Effects in DNN

• Slow Learning: Weight updates become insignificant, making training very slow.

• Early Layers Stop Learning: The initial layers of the network stop updating
effectively, leading to poor feature learning.

• Poor Performance: Networks with many layers struggle to train efficiently.


Tanh (Hyperbolic Tangent) | NLAF
• Takes a real-valued input and squashes
it to the range [-1, 1].

• S-shaped (sigmoid-like), but centered at 0.

• Benefits
• Zero-centered output: Unlike Sigmoid, which
outputs in the range (0,1), Tanh outputs in
(−1,1), making it easier for weights to update
symmetrically.

• Stronger gradients: Tanh has a steeper slope


than Sigmoid, so it can train faster.
Derivative of Tanh
• The problem of vanishing gradients persists even
in Tanh function.

• For large values of x (positive or negative), Tanh


saturates, implies gradients become very small.
This causes vanishing gradients, making deep
networks hard to train.

• Still better than Sigmoid, but ReLU is preferred for


deep learning.
Non-Linear Activation Functions | ReLU
ReLU stands for Rectified Linear Unit, one of the most widely used AF, with range [0, ∞).
Non-Linear Activation Functions | ReLU
Benefits
• Solves vanishing gradient problem: Unlike Sigmoid & Tanh, ReLU does not saturate for positive
values.
• Computationally fast: Only requires a simple max operation and no expensive exponentials.
Less expensive operations as compared to sigmoid or tanh.
• Sparse activation: If x ≤0, the output is 0, making the network efficient.

Limitation

When inputs approach zero or are negative, the gradient of the function becomes zero, the

network cannot perform backpropagation and cannot learn. This is known as Dying ReLU problem.

If too many neurons output 0, they stop learning because the gradient is 0 for x ≤ 0.
Non-Linear Activation Functions | ReLU
An activation function saturates when its output reaches a limit, causing its derivative to
become very small (close to zero).
• Sigmoid saturates at 0 and 1
• Tanh saturates at -1 and 1

Why ReLU Does Not Saturate at 1?


• ReLU does not have an upper bound like Sigmoid or Tanh.
• As x → ∞, ReLU keeps increasing linearly instead of flattening.
• The derivative of ReLU for x > 0 is always 1, meaning it does not saturate.
Non Linear Activation Functions | Leaky ReLu
• The Leaky ReLU is same as ReLU for positive inputs but for negative inputs it has a
constant slope (less than 1).

f(x) = max (0.01 x, x)


f’(x) = 0.01 if x < 0 and 1 if x >= 0

• Prevents the Dying ReLU problem, i.e.,


• Prevents neurons from becoming completely inactive
• This variation of ReLU has a small positive slope in the negative area, so it does enable
backpropagation, even for negative input values.
Bias
• The main function of Bias is to provide every node with a trainable constant value
(in addition to the normal inputs that the node receives).

• It is like the intercept added in a linear equation. It is an additional parameter in


the NN which is used to adjust the output along with the weighted sum of the
inputs to the neuron. Therefore, bias is a constant which helps the model in a way
that it can fit best for the given data.

Output = Sum (Weights * Inputs) + Bias


Feedforward Neural Network
• It is the simplest type of ANN. It contains multiple neurons (nodes)
arranged in layers. Nodes from adjacent layers
have connections or edges between them. All these connections
have weights associated with them.

• In a feedforward network, the


information moves in only forward
direction from the input nodes,
through the hidden nodes (if any) and
to the output nodes.
Feedforward Neural Network
• Input Nodes
The Input nodes provide information from the outside world to the network and are
together referred to as the “Input Layer”. No computation is performed in any of the
input nodes - they just pass on the information to the hidden nodes.
• Hidden Nodes
The Hidden nodes have no direct connection with the outside world (hence the
name “hidden”). They perform computations and transfer information from the input
nodes to the output nodes. A collection of hidden nodes forms a “Hidden Layer”. While
a feedforward network will only have a single input layer and a single output layer, it can
have zero or multiple hidden layers.
• Output Nodes
The Output nodes are collectively referred to as the “Output Layer” and are responsible for
computations and transferring information from the network to the outside world.
Examples of Feedforward NN
• Single Layer Perceptron – This is the simplest feedforward neural network and
does not contain any hidden layer.
• Multi Layer Perceptron – A Multi Layer Perceptron (MLP) contains one or more
hidden layers (apart from one input and one output layer).
MLP with a Single Hidden Layer
• Note that all connections have weights
associated with them, but only three weights
(w0, w1, w2) are shown in the figure and only
a single calculation is shown.

• Given a set of features X = (x1, x2, …) and a


target Y, a MLP can learn the
relationship between the features and the
target, for either classification or regression.
Training MLP: The Backpropagation Algorithm
• The process by which a MLP learns is called the Backpropagation algorithm. It
allows the network to learn by minimizing errors and adjusting weights
efficiently.

• Backward Propagation of Errors is one of the several ways in which an ANN can
be trained. It is a supervised training scheme.

• An ANN consists of nodes in different layers; input layer, hidden layer(s) and the
output layer. The connections between nodes of adjacent layers have “weights”
associated with them.

• The goal of learning is to assign correct weights for these edges. Given an input
vector, these weights determine what the output vector is.
Backpropagation Algorithm
• Initially, all the edge weights are randomly assigned.

• Forward Propagation: Compute the output using the current weights.

• Error Calculation: Compare the predicted output with the actual output.

• Backward Propagation: Adjust weights using gradient descent to minimize error.

• Repeat Until Convergence: Continue iterating until the error is sufficiently small.
MLP Example
• Suppose we have the following student-marks dataset:

• Now, suppose, we want to predict whether a student studying 25 hours and having 70
marks in the mid-term will pass the final term.

• This is a binary classification problem where MLP can learn from the given examples
(training data) and make an informed prediction given a new data point.
Neural Network Architecture
• Input Layer: Two neurons (Hours Studied, Mid Term Marks)

• Hidden Layer: Two neurons

• Output Layer: Two neurons (Probabilities for Pass and Fail)

• Activation Functions:
• Hidden Layer: ReLU (Rectified Linear Unit)

• Output Layer: Softmax


Step 1: Forward Propagation
• All weights in the network are randomly assigned.

• Lets consider the hidden layer node marked V. Assume the weights of the
connections from the inputs to that node are w1, w2 and w3.

• The network then takes the first training example as input i.e. [35, 67]

• Desired output from the network (target) = [1, 0]


Step 1: Forward Propagation
Initially, the weights are randomly initialized. Then output V from the node in
consideration can be calculated as below (f is an activation function):

V = f (1*w1 + 35*w2 + 67*w3)

where:
w1​ is the bias weight
w2​ is the weight for Hours Studied
w3​ is the weight for Mid Term Marks
f is the activation function (ReLU in the hidden layer)

Similarly, outputs from the other node/neuron in the hidden layer is also calculated. The outputs
of the two nodes in the hidden layer act as inputs to the two nodes in the output layer. This
enables us to calculate output probabilities from the two nodes in output layer.
Step 1: Forward Propagation (Computation of Outputs)
• Compute outputs from hidden layer neurons
• Each hidden neuron receives inputs, applies a weighted sum, and passes it through an
activation function.

• Compute outputs from output layer neurons


• The hidden layer’s outputs act as inputs to the output layer.
• Apply Softmax Activation to get probabilities of Pass and Fail.

For the given input [35, 67], suppose the predicted probabilities are: [0.4, 0.6]
This means the network predicts: Pass probability = 0.4 & Fail probability = 0.6
Step 1: Forward Propagation (Computation of Outputs)
However, the actual label is [1, 0] (meaning the student should Pass). Thus, we have an
incorrect prediction with a high error.
Step 2: Backpropagation and Weight Updation
• We calculate the total error at the output nodes and propagate these errors
back through the network using Backpropagation to calculate the gradients.

• Then we use an optimization method such as Gradient Descent to


‘adjust’ all weights in the network to reduce the error at the output layer.

• Suppose that the new weights associated with the node in consideration are
w4, w5 and w6 (after backpropagation and adjusting weights).
Step 2: Backpropagation and Weight Updation
If we now input the same example to the network again, the network should perform
better than before since the weights have now been adjusted to minimize the error in
prediction.

As shown in figure, the errors at the output nodes now reduce to [0.2, -0.2] as compared
to [0.6, -0.4] earlier. This means that our network has learnt to correctly classify our first
training example.

Step 2: Backpropagation and Weight Updation
• We repeat this process with all other training examples in our dataset. Then, our
network is said to have learnt those examples.

• If we now want to predict whether a student studying 25 hours and having 70


marks in the mid term will pass the final term, we go through the forward
propagation step and find the output probabilities for Pass and Fail.
Summary
1. Forward Pass: Compute outputs from input to output layer.

2. Loss Calculation: Compare predicted and actual output.

3. Backward Pass (Backpropagation):


a) Compute gradients.

b) Adjust weights using Gradient Descent.

4. Repeat for all training examples.

5. Final Prediction on test data.


Gradient Descent
• Gradient Descent is defined as one of the most
commonly used iterative optimization algorithms
of machine learning to train the machine
learning and deep learning models. It helps in
finding the local minimum of a function.

• So, the idea is to pass the training set through


the hidden layers of the neural network and
then update the parameters of the layers by
computing the gradients using the training
samples from the training dataset.

43
Gradient Descent algorithm
• Step 1: Initialize the weights (a & b) with random values and calculate Error (SSE)

• Step 2: Calculate the gradient, i.e., change in SSE when the weights (a & b) are changed by a very
small value from their original randomly initialized value. This helps us move the values of a & b in
the direction in which SSE is minimized.

• Step 3: Adjust the weights with the gradients to reach the optimal values where SSE is minimized.
a = a – r * ∂(SSE)/∂a, b = b – r * ∂(SSE)/∂b

• Step 4: Use the new weights for prediction and to calculate the new SSE

• Step 5: Repeat steps 2 and 3 till further adjustments to weights doesn’t significantly reduce the
Error.
44
Learning Rate
The steps which are taken to reach optimal point decides the rate of gradient descent. It is often
referred to as 'Learning rate‘ (i.e., the size of the steps).

Too big
bounce between the convex function and may not reach the local minimum.

Too small
gradient descent will eventually reach the local minimum but it will take too much time for that

Just right
gradient descent will eventually reach the local minimum but it will take too much time for that

45
Learning Rate

46
Batch Gradient Descent
In Batch Gradient Descent, all the training data is taken into consideration to take a single
step. We take the average of the gradients of all the training examples and then use that mean
gradient to update our parameters. So that’s just one step of gradient descent in one epoch.

How it Works:
• Computes the gradient of the loss function using entire dataset.
• Updates the parameters (weights and biases) in the direction that minimizes the loss.
• Repeats until convergence.
Batch Gradient Descent
But what if our dataset is very large. Suppose our dataset has 5 million examples,
then just to take one step the model will have to calculate the gradients of all the 5
million examples. This does not seem an efficient way. To tackle this problem we
have Stochastic Gradient Descent.
Batch Gradient Descent

• The graph of cost vs epochs is


also quite smooth because we are
averaging over all the gradients of
training data for a single step. The
cost keeps on decreasing over the
epochs.
Epocs
An epoch in machine learning refers to one complete pass of the entire training dataset through
the model.

Example:
If you have 1000 samples and a batch size of 100, then:
• 1 epoch = 10 iterations (since 1000/100 = 10 batches).
• If training runs for 10 epochs, the model sees each data point 10 times.

Why Multiple Epochs?


• One epoch might not be enough to learn all patterns.
• Too many epochs can cause overfitting.
Stochastic Gradient Descent
Stochastic Gradient Descent is an optimization algorithm used to minimize the loss
function by updating model parameters one random sample at a time.

How It Works:
• Instead of computing the gradient using the entire dataset (like Batch Gradient
Descent), SGD picks one random sample per iteration.
• The model updates its parameters immediately after computing the gradient for that
single sample.
• This process continues until all samples have been used, completing one epoch.
Stochastic Gradient Descent
• The cost is decreasing with fluctuations.
• Because the cost is so fluctuating, it will
never reach the minima but it will keep
dancing around it.
• SGD can be used for larger datasets. It
converges faster when the dataset is large
as it causes updates to the parameters
more frequently.
• Since in SGD we use only one example at a
time, we cannot implement the vectorized
implementation on it. This can slow down
the computations.
Mini Batch Gradient Descent
• Mini-Batch Gradient Descent is a compromise between Batch Gradient Descent (BGD)
and Stochastic Gradient Descent (SGD). Instead of updating weights after every sample
(SGD) or after the entire dataset (BGD), MBGD updates weights after processing a small
batch of samples.

• Steps involved in one epoch:


1. Pick a mini-batch
2. Feed it to Neural Network
3. Calculate the mean gradient of the mini-batch.
4. Use the mean gradient we calculated in Step 3 to update the weights.
5. Repeat steps 1–4 for the mini-batches.
Mini Batch Gradient Descent
• Just like SGD, the average cost over the epochs in mini-batch gradient
descent fluctuates because we are averaging a small number of examples at
a time.

• So, when we are using the mini-batch gradient descent we are updating our
parameters frequently as well as we can use vectorized implementation for
faster computations.
Differences Between Batch, Stochastic, and Mini-Batch
Gradient Descent
Key Takeaways: Batch, Stochastic, and Mini-Batch
Gradient Descent
• Batch GD: Stable but slow, inefficient for large datasets.

• Stochastic GD: Faster, but noisy and less stable.

• Mini-Batch GD: Best balance between speed and stability, commonly used in
deep learning.
How a Neural Network is Trained

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy