Neural Network - Optimization DRAFT 3.11
Neural Network - Optimization DRAFT 3.11
Presented by:
Mahesh S (24PHD0300)
Peter A (24PHD0345)
Ronald Vincent (24PHD0368)
Gaurav Gadhiya (24PHD0377)
Devisri Eriki (24PHD0325)
Neural Networks
A method of computing, based on the interaction of multiple
connected processing elements.
A powerful technique to solve many real-world problems.
The ability to learn from experience in order to improve their
performance.
Ability to deal with incomplete information
An Artificial neural network (ANN) may be defined as an
information-processing model that is inspired by the way biological
nervous systems, such as the brain, process information.
This model tries to replicate only the most basic functions of the
brain.
An ANN is composed of a large number of highly interconnected
processing units (neurons) working in unison to solve
specific problems.
Basics Of Neural Network
• Biological approach to AI
• Developed in 1943
• Comprised of one or more layers of neurons
• Several types, we'll focus on feed-forward and feedback networks
Neurons
Artificial
Biological
Each neuron is connected with the other by a connection link.
Each connection link is associated with weights which contain information
about the input signal.
This information is used by neuron network to solve a particular problem.
ANN’s collective behavior is characterized by their ability to learn, recall and
generalize training patterns or data.
Similar to that of human brain.
They have the capability to model networks of original neurons as found in the
brain.
Thus, the ANN processing elements are called neuron or artificial neuron.
Neural Network Neurons
• Receives n-inputs
• Multiplies each input by its weight
• Applies activation function to the sum of results
• Outputs result
Activation function
Non linear function help you finding complex relationship between variables.
Neural network applications
Pattern recognition
Investment analysis
Mobile computing
X1
W1
W2
X2 Σ Y
W3
X3
Activation Function
• An activation function in a neural network defines how the weighted sum of the input is transformed into an out
put from a Node or nodes in a layer of the network.
• They transform he input signal into a non-linear output, allowing the network to learn complex relationships
between inputs and outputs.
• Activation function determine if a neuron should activate based on the input it receives.
• A sigmoid function, also known as a logistic function, is a mathematical function that outputs a number between
0 and 1. and is commonly used in machine learning, especially in neural networks, as an activation function with
the given formula:
σ(x) =
0.1
0.3 0.2
0.3 0.3
Given, Input pattern
x 00 .. 13
h
Target Tq 0.8
w 0.1 0.4
0.2 0.3
hp. j
Weights between output and hidden layer
w 00. .32
pq.k
[ w h p . j ]T x h
I p. j
0 . 1 2 00 .. 1
00 .. 3 3
0 . 4
0 . 0 5
0.15
1
I
1 exp 0 . 0 5
0.5125
pj pj
1 0.5374
1 e x p 0.15
I q . k [ w p q . k ] T pj
0.2 00 .. 55 31 72 4
5 0.2637
0.3
Output qk
I qk
1
1 exp 0.2637
0.5655
T q q k 2
0.8 0.5655 2 0.0550
Modification of weight between output and hidden
layer
Let 1, 0 . 6
= 2 Tq
pq.k q.k q.k 1 q.k
2 0.8 0.5655 0.56551 0.5655
0.1152
2q
w pq.k 0.0354
p.q p.qpq.kp. j
w pq.k 0.0371
w p q . k ( N 1) w p q . k ( N ) w p q . k
w pq.k( N 1) 0 . 2 - 0 . 0 3 5 4
0 . 3 - 0 . 0 3 7 1
0.2629
0.1646
Modification of weight between input and hidden layer
r
2
q.k
(1 q.k ) wpq.k p. j (1 p. j ) xh
w hp. j (2) ( Tq q.k )
q1
w pq.k pq.k
0.2
0.3 0.1152 0.0346
0.0230
Let DD
w pq.k pq.k pj 1 pj
0.0230 0.5125 1 0.5125 00.0086
. 0 0 5 7
0.0346 0.5374 1 0.5374
w p q . k ( N 1) w p q . k ( N ) w p q . k
w pq.k( N 1) 0 . 2 0 . 0 3 5 4
0 . 3 0.0371
0.2629
0.1646
0.3
Let HH x DDh T
0.0057 0.0086
0.1
0 . 0 0 1 7 0.0026
0.0006 0.0009
HH
whp. j hp
0.0010 0.0015
0.0003 0.0005
whp. j (N 1) whp. j (N) whp. j
0.9900 0.3985
0.1997
0.2995
With the updated weights, error is calculated again. Iterations are carried out till we get the error
1.Gradient Descent
2.Stochastic Gradient Descent (SGD)
3.Mini Batch Stochastic Gradient Descent (MB-SGD)
4.SGD with momentum
5.Nesterov Accelerated Gradient (NAG)
6.Adaptive Gradient (AdaGrad)
7.AdaDelta
8.RMSprop
9.Adam
LINEAR REGRESSION
Imagine, we have the data having heights and weights of thousands of people. We want to use this data to
create a Machine Learning model that takes the height of a person as input and predicts the weight of the
person.
Linear regression performs the task to predict a dependent variable value (y) based on a given independent
variable (x)). Hence, the name is Linear Regression
This kind of relationship between the input feature(height) and output feature(weight) can be captured by a
linear regression model that tries to fit a straight line on this data.
Y is the output feature (weight), m is the slope of the line, x is the input feature(height) and c is the
intercept(weight is equal to c when height is 0 as shown below). Y = m(0) + c = c
COST FUNCTION – MEAN SQUARE ERROR - 1
we need is a cost function so we can start optimizing our weights.
Let’s use MSE(L2) as our cost function. MSE measures the average squared difference between an
observation’s actual and predicted values. The output is a single number representing the cost, or score,
associated with our current set of weights. Our goal is to minimize MSE to improve the accuracy of our
model.
COST FUNCTION – MEAN SQUARE ERROR - 2
It works by iteratively adjusting the weights or parameters of the model in the direction of the negative gradient of the
cost function until the minimum of the cost function is reached.
The cost function evaluates the difference between the actual and predicted outputs.
It trains machine learning models by minimizing errors between predicted and actual results.
The goal of gradient descent is to minimize the cost function, or the error between predicted and actual y. In
order to do this, it requires two data points—a direction and a learning rate. These factors determine the partial
derivative calculations of future iterations, allowing it to gradually arrive at the local or global minimum (i.e.
point of convergence).
The cost function of linear regression(MSE) is a convex function i.e. it has only one minima across the range of
values of slope ‘m’ and constant ‘c’
Learning rate (also referred to as step size or the alpha) is the size of the steps that are taken to reach the
minimum. This is typically a small value, and it is evaluated and updated based on the behavior of the cost
function. High learning rates result in larger steps but risks overshooting the minimum. Conversely, a low
learning rate has small step sizes. While it has the advantage of more precision, the number of iterations
compromises overall efficiency as this takes more time and computations to reach the minimum.
For big data ( huge data set ) , gradient descent is slow. Hence, in most scenarios, SGD is preferred over Batch
Gradient Descent for optimizing a learning algorithm. Reach minimum value , with in a short time period ,
since it uses fewer iterations.
Stochastic Gradient Descent (SGD) is a variant of the gradient descent algorithm that is used for optimizing
machine learning machine learning models. It addresses the computational inefficiency of traditional Gradient
Descent methods when dealing with large datasets in machine learning projects.
In SGD, instead of using the entire dataset for each iteration, only a single random training example (or a small
batch) is selected to calculate the gradient and update the model parameters.
This random selection introduces randomness into the optimization process, hence the term “stochastic” in
stochastic Gradient Descent
The advantage of using SGD is its computational efficiency, especially when dealing with large datasets. By
using a single example or a small batch, the computational cost per iteration is significantly reduced compared
to traditional Gradient Descent methods that require processing the entire dataset.
Stochastic Gradient Descent would
randomly pick one sample for each step . Stochastic Gradient Descent is used
Here the number of terms calculated is when redundant data set is available
reduced by a factor of 3
Height
Weight
Aspect STOCHASTIC GRADIENT DESCENT (SGD) BATCH GRADIENT DESCENT
Uses a single random sample or a small batch of samples at Uses the entire dataset (batch) at each
Dataset Usage
each iteration. Less accurate iteration. More accurate .
Computationally less expensive per iteration, as it processes Computationally more expensive per
Computational Efficiency
fewer data points. iteration, as it processes the entire dataset.
High noise due to frequent updates with a single or few Low noise as it updates parameters using all
Noise in Updates
samples. data points.
Requires less memory as it processes fewer data points at a Requires more memory to hold the entire
Memory Requirement
time. dataset in memory.
Frequent updates make it suitable for online learning and Less frequent updates make it suitable for
Update Frequency
large datasets. smaller datasets.
Mini-batch gradient descent combines concepts from both batch gradient descent and stochastic
gradient descent. It splits the training dataset into small batch sizes and performs updates on each of
those batches. This approach strikes a balance between the computational efficiency of batch gradient
descent and the speed of stochastic gradient descent.
Let’s say there are a total of ‘m’ observations in a data set and we use all these
observations to calculate the cost function J, then this is known as Batch Gradient
Descent
Then will take the second observation and perform similar steps with it.
This step will be repeated until all observations have been passed through
the network and the parameters have been updated.
Assume that the batch size is 2. The last single observation in the final
iteration
IMAGE GRADIENT – IMAGE PROCESSING APPLICATION
Gradient of an image = Measure of change in Image function, F(x,y) in X ( rows ) and y( column )
Case study :- Medical Science
Person is having Prostate cancer or not
Healthy/Cancer PSA(prostate Specific Healthy/Cancer PSA(prostate Specific
antigen) antigen)
Cancer 3.8 Healthy 2.5
• Advantage: Faster than batch gradient descent due to updates after each
sample, allowing more responsive performance on hardware accelerators.
• Limitation: May oscillate due to noisy updates, but this can be managed with
small learning rates or other techniques (like momentum).
3. Mini-Batch Stochastic Gradient Descent (MB-SGD)Application:
• Limitation: Requires careful tuning of batch size and learning rate, which
can impact memory usage on FPGA or ASIC hardware.
4. SGD with Momentum Application:
• Suitable for biomedical imaging tasks that require high accuracy and smooth
convergence, like tumor boundary identification in MRI scans.
• Limitation: Requires more memory for storing squared gradients and delta
terms, which can be demanding on hardware resources.
9. Adam (Adaptive Moment Estimation)Application:
• Widely used for deep learning tasks in biomedical image processing, such as convolutional
neural networks for cancer detection in histopathology images or segmentation in
radiology.
• Advantage: Combines the benefits of RMSprop and SGD with momentum, offering fast
convergence and good accuracy, making it ideal for high-performance hardware
applications.
• Limitation: Computationally intensive, requiring more complex logic and memory storage
for moment terms, so it works best on advanced hardware like GPUs or high-capacity
FPGAs.
THANK YOU