ANNs
ANNs
(ANN)
Artificial Neural Network (ANN)
• ANN is a computational model that is inspired by the way biological neural networks
in the human brain process information.
• It is a method of computing, based on the interaction of multiple connected processing
elements.
• The depth of the model generally specifies the number of hidden layers.
• The learning of a model is indeed the process of finding the appropriate weights
through backpropagation.
Artificial Neural Network (ANN)
• DL, an umbrella term, representing recent advances in neural networks (NN).
• DL mimics the human brain and processes the information at multiple levels of
abstractions from high-dimensional raw data.
• DNNs mimic the efficiency and robustness of human brain to process massive
data.
A Single Neuron
• The basic unit of computation in a
neural network is the neuron, often
called a node or unit.
• It receives input from some other
nodes, or from an external source
and computes an output.
• Each input has an
associated weight (w), which is
assigned on the basis of its relative
importance to other inputs.
• The node applies a function f to the
weighted sum of its inputs.
A Single Neuron
• This network takes numerical inputs X1 and X2 and has
weights w1 and w2 associated with those inputs. Additionally, there is another
input 1 with weight b (called the Bias) associated with it.
• The output Y from the neuron is computed using the function f called
the Activation Function. The purpose of the activation function (non-linear) is to
introduce non-linearity into the output of a neuron.
• This is important because most real-world data is non-linear and we want
neurons to learn these non-linear representations.
• Every activation function (or non-linearity) takes a single number and performs a
certain fixed mathematical operation on it. There are several activation functions.
Selecting Weights
• The initialization of the parameters considerably affects the convergence of
backpropagation during the training of DNN.
• Too small or large values of weights lead to slow learning and divergence respectively.
• The rule of thumb is that the mean of activation* should be zero and the variance**
should remain same across the layers.
• Applicable to both forward (i.e. activation) as well as backward propagation (i.e. for
gradient of cost w.r.t activations)
* If activations are centred around zero, gradient updates are more balanced, preventing bias in weight
updates.
** If variance shrinks, the signal weakens, leading to vanishing gradients. If variance explodes, activations
become large, leading to unstable training.
Selecting Hidden Layers
• Difficult to find appropriate number of hidden layers and neurons per hidden
layer.
• Start with one or two hidden layers and thereafter, gradually increase the layers
depending upon the complexity of the problem, until the model starts
experiencing overfitting.
• A model with few layers but large number of neurons takes more time to learn as
compared to a model having more layers but lesser number of neurons.
• In general, it is always advisable to increase number of layers rather than number
of neurons per layers.
Selecting Neurons
• The number of neurons in the input and output layers depend upon the size of
input feature vector and the classification problem.
• For hidden layers, it is a common practice to choose the number of neurons in
such a way that it forms a funnel, i.e., placing lesser and lesser numbers of
neurons in succeeding layers.
• A model with few layers but a large number of neurons takes more time to learn
as compared to a model having more layers but a lesser number of neurons.
• A good practice to start with fewer neurons and gradually increase their count
until the architecture experiences overfitting.
Perceptron
• One of simplest ANN architecture, primary used for binary classification,
introduced by Frank Rosenblatt (1957).
• A simple neuron that is used to classify its input into one of the two categories.
• LR makes use of sigmoid function whereas perceptron makes use step function.
Perceptron
Perceptron
• At x=0, there is an abrupt jump (discontinuity).
• The function is not differentiable at x=0 as left and right derivatives are not equal.
• Instead, functions like sigmoid, ReLU, and tanh are used in modern neural
networks.
Activation Functions
• Activation function decides, whether a neuron should be activated or not by
calculating weighted sum and further adding bias with it.
• The activation function does the non-linear transformation to the input making it
capable to learn and perform more complex tasks. It is also known as Transfer
Function.
Activation Functions
• Let’s suppose we have a neural network working without the activation functions, the
network behaves like a linear model, significantly limits its capabilities.
• Useless deep layers: Every neuron will only be performing a linear transformation on the
inputs using the weights and biases. It’s because it doesn’t matter how many hidden
layers we attach in the neural network; all layers will behave in the same way because
the composition of two linear functions is a linear function itself.
• Although the neural network becomes simpler, learning any complex task is impossible,
and our model would be just a linear regression model.
• The Activation Functions can be divided into 2 types:
• Linear Activation Function
• Non-linear Activation Functions
Linear Activation Function
Range : - ∞ to +∞
σ(x) = 1 / (1 + exp(−x))
• Slow Learning: Weight updates become insignificant, making training very slow.
• Early Layers Stop Learning: The initial layers of the network stop updating
effectively, leading to poor feature learning.
• Benefits
• Zero-centered output: Unlike Sigmoid, which
outputs in the range (0,1), Tanh outputs in
(−1,1), making it easier for weights to update
symmetrically.
Limitation
When inputs approach zero or are negative, the gradient of the function becomes zero, the
network cannot perform backpropagation and cannot learn. This is known as Dying ReLU problem.
If too many neurons output 0, they stop learning because the gradient is 0 for x ≤ 0.
Non-Linear Activation Functions | ReLU
An activation function saturates when its output reaches a limit, causing its derivative to
become very small (close to zero).
• Sigmoid saturates at 0 and 1
• Tanh saturates at -1 and 1
• Backward Propagation of Errors is one of the several ways in which an ANN can
be trained. It is a supervised training scheme.
• An ANN consists of nodes in different layers; input layer, hidden layer(s) and the
output layer. The connections between nodes of adjacent layers have “weights”
associated with them.
• The goal of learning is to assign correct weights for these edges. Given an input
vector, these weights determine what the output vector is.
Backpropagation Algorithm
• Initially, all the edge weights are randomly assigned.
• Error Calculation: Compare the predicted output with the actual output.
• Repeat Until Convergence: Continue iterating until the error is sufficiently small.
MLP Example
• Suppose we have the following student-marks dataset:
• Now, suppose, we want to predict whether a student studying 25 hours and having 70
marks in the mid-term will pass the final term.
• This is a binary classification problem where MLP can learn from the given examples
(training data) and make an informed prediction given a new data point.
Neural Network Architecture
• Input Layer: Two neurons (Hours Studied, Mid Term Marks)
• Activation Functions:
• Hidden Layer: ReLU (Rectified Linear Unit)
• Lets consider the hidden layer node marked V. Assume the weights of the
connections from the inputs to that node are w1, w2 and w3.
• The network then takes the first training example as input i.e. [35, 67]
where:
w1 is the bias weight
w2 is the weight for Hours Studied
w3 is the weight for Mid Term Marks
f is the activation function (ReLU in the hidden layer)
Similarly, outputs from the other node/neuron in the hidden layer is also calculated. The outputs
of the two nodes in the hidden layer act as inputs to the two nodes in the output layer. This
enables us to calculate output probabilities from the two nodes in output layer.
Step 1: Forward Propagation (Computation of Outputs)
• Compute outputs from hidden layer neurons
• Each hidden neuron receives inputs, applies a weighted sum, and passes it through an
activation function.
For the given input [35, 67], suppose the predicted probabilities are: [0.4, 0.6]
This means the network predicts: Pass probability = 0.4 & Fail probability = 0.6
Step 1: Forward Propagation (Computation of Outputs)
However, the actual label is [1, 0] (meaning the student should Pass). Thus, we have an
incorrect prediction with a high error.
Step 2: Backpropagation and Weight Updation
• We calculate the total error at the output nodes and propagate these errors
back through the network using Backpropagation to calculate the gradients.
• Suppose that the new weights associated with the node in consideration are
w4, w5 and w6 (after backpropagation and adjusting weights).
Step 2: Backpropagation and Weight Updation
If we now input the same example to the network again, the network should perform
better than before since the weights have now been adjusted to minimize the error in
prediction.
As shown in figure, the errors at the output nodes now reduce to [0.2, -0.2] as compared
to [0.6, -0.4] earlier. This means that our network has learnt to correctly classify our first
training example.
•
Step 2: Backpropagation and Weight Updation
• We repeat this process with all other training examples in our dataset. Then, our
network is said to have learnt those examples.
43
Gradient Descent algorithm
• Step 1: Initialize the weights (a & b) with random values and calculate Error (SSE)
• Step 2: Calculate the gradient, i.e., change in SSE when the weights (a & b) are changed by a very
small value from their original randomly initialized value. This helps us move the values of a & b in
the direction in which SSE is minimized.
• Step 3: Adjust the weights with the gradients to reach the optimal values where SSE is minimized.
a = a – r * ∂(SSE)/∂a, b = b – r * ∂(SSE)/∂b
• Step 4: Use the new weights for prediction and to calculate the new SSE
• Step 5: Repeat steps 2 and 3 till further adjustments to weights doesn’t significantly reduce the
Error.
44
Learning Rate
The steps which are taken to reach optimal point decides the rate of gradient descent. It is often
referred to as 'Learning rate‘ (i.e., the size of the steps).
Too big
bounce between the convex function and may not reach the local minimum.
Too small
gradient descent will eventually reach the local minimum but it will take too much time for that
Just right
gradient descent will eventually reach the local minimum but it will take too much time for that
45
Learning Rate
46
Batch Gradient Descent
In Batch Gradient Descent, all the training data is taken into consideration to take a single
step. We take the average of the gradients of all the training examples and then use that mean
gradient to update our parameters. So that’s just one step of gradient descent in one epoch.
How it Works:
• Computes the gradient of the loss function using entire dataset.
• Updates the parameters (weights and biases) in the direction that minimizes the loss.
• Repeats until convergence.
Batch Gradient Descent
But what if our dataset is very large. Suppose our dataset has 5 million examples,
then just to take one step the model will have to calculate the gradients of all the 5
million examples. This does not seem an efficient way. To tackle this problem we
have Stochastic Gradient Descent.
Batch Gradient Descent
Example:
If you have 1000 samples and a batch size of 100, then:
• 1 epoch = 10 iterations (since 1000/100 = 10 batches).
• If training runs for 10 epochs, the model sees each data point 10 times.
How It Works:
• Instead of computing the gradient using the entire dataset (like Batch Gradient
Descent), SGD picks one random sample per iteration.
• The model updates its parameters immediately after computing the gradient for that
single sample.
• This process continues until all samples have been used, completing one epoch.
Stochastic Gradient Descent
• The cost is decreasing with fluctuations.
• Because the cost is so fluctuating, it will
never reach the minima but it will keep
dancing around it.
• SGD can be used for larger datasets. It
converges faster when the dataset is large
as it causes updates to the parameters
more frequently.
• Since in SGD we use only one example at a
time, we cannot implement the vectorized
implementation on it. This can slow down
the computations.
Mini Batch Gradient Descent
• Mini-Batch Gradient Descent is a compromise between Batch Gradient Descent (BGD)
and Stochastic Gradient Descent (SGD). Instead of updating weights after every sample
(SGD) or after the entire dataset (BGD), MBGD updates weights after processing a small
batch of samples.
• So, when we are using the mini-batch gradient descent we are updating our
parameters frequently as well as we can use vectorized implementation for
faster computations.
Differences Between Batch, Stochastic, and Mini-Batch
Gradient Descent
Key Takeaways: Batch, Stochastic, and Mini-Batch
Gradient Descent
• Batch GD: Stable but slow, inefficient for large datasets.
• Mini-Batch GD: Best balance between speed and stability, commonly used in
deep learning.
How a Neural Network is Trained