0% found this document useful (0 votes)
67 views66 pages

Neural Network - Optimization DRAFT 3.11

Uploaded by

Devisri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views66 pages

Neural Network - Optimization DRAFT 3.11

Uploaded by

Devisri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 66

Seminar on

NEURAL NETWORKS BASED OPTIMIZATIONS


School of Electronics Engineering (SENSE)
Vellore Institute of Technology, Vellore

Presented by:
Mahesh S (24PHD0300)
Peter A (24PHD0345)
Ronald Vincent (24PHD0368)
Gaurav Gadhiya (24PHD0377)
Devisri Eriki (24PHD0325)
Neural Networks
 A method of computing, based on the interaction of multiple
connected processing elements.
 A powerful technique to solve many real-world problems.
 The ability to learn from experience in order to improve their
performance.
 Ability to deal with incomplete information
 An Artificial neural network (ANN) may be defined as an
information-processing model that is inspired by the way biological
nervous systems, such as the brain, process information.
 This model tries to replicate only the most basic functions of the
brain.
 An ANN is composed of a large number of highly interconnected
processing units (neurons) working in unison to solve
specific problems.
Basics Of Neural Network
• Biological approach to AI
• Developed in 1943
• Comprised of one or more layers of neurons
• Several types, we'll focus on feed-forward and feedback networks
Neurons

Artificial
Biological
 Each neuron is connected with the other by a connection link.
 Each connection link is associated with weights which contain information
about the input signal.
 This information is used by neuron network to solve a particular problem.
 ANN’s collective behavior is characterized by their ability to learn, recall and
generalize training patterns or data.
 Similar to that of human brain.
 They have the capability to model networks of original neurons as found in the
brain.
 Thus, the ANN processing elements are called neuron or artificial neuron.
Neural Network Neurons
• Receives n-inputs
• Multiplies each input by its weight
• Applies activation function to the sum of results
• Outputs result
Activation function

 An activation function in a neural network


defines how the weighted sum of the input is
transformed into an out put from a Node or
nodes in a layer of the network.
 Activation function is a function that introduce
non linearity into the model.
Linear and Non-linear function

 Linear function causes vanishing gradient problems.

 Non linear function help you finding complex relationship between variables.
Neural network applications
 Pattern recognition

 Investment analysis

 Control system and monitoring

 Mobile computing

 Marketing and financial applications

 Forecasting- sales, market, research, meteorology


Structure of Neural Network

X1
W1

W2
X2 Σ Y

W3
X3
Activation Function
• An activation function in a neural network defines how the weighted sum of the input is transformed into an out
put from a Node or nodes in a layer of the network.

• They transform he input signal into a non-linear output, allowing the network to learn complex relationships
between inputs and outputs.

• Activation function determine if a neuron should activate based on the input it receives.

• A sigmoid function, also known as a logistic function, is a mathematical function that outputs a number between
0 and 1. and is commonly used in machine learning, especially in neural networks, as an activation function with
the given formula:

σ(x) =

 It is used to model the probability of an event occurring.


Activation Function

Fig2: Sigmoid function graph


Example of Neural Network

0.1
0.3 0.2

0.2 0.4 Tq= 0.8


0.1

0.3 0.3
 Given, Input pattern
x  00 .. 13 
h 
 
 Target Tq  0.8

 Weights between input and hidden layer

w  0.1 0.4
 0.2 0.3
hp. j 
 
 Weights between output and hidden layer

w  00. .32 
pq.k 
 
 [ w h p . j ]T x h
I p. j

0 . 1 2   00 .. 1
00 .. 3 3 

 0 . 4   

0 . 0 5 
 
0.15

 1 

 I   

 1 exp 0 . 0 5
   0.5125 
pj pj
 1  0.5374
 

 1  e x p 0.15
 

I q . k  [ w p q . k ] T  pj

  0.2 00 .. 55 31 72 4
5  0.2637
0.3  

Output  qk
I qk  

 1
1  exp 0.2637
 

 0.5655

The squared error signal

T q  q k  2
  0.8  0.5655  2  0.0550
Modification of weight between output and hidden
layer
Let   1,   0 . 6

= 2  Tq 
 pq.k   q.k   q.k 1   q.k 
 2 0.8  0.5655 0.56551  0.5655

 0.1152

 2q
w pq.k      0.0354  
p.q  p.qpq.kp. j
w pq.k   0.0371 

w p q . k ( N  1)  w p q . k ( N )  w p q . k

w pq.k( N  1)   0 . 2    - 0 . 0 3 5 4 
 0 . 3  - 0 . 0 3 7 1

 0.2629
0.1646 
 
Modification of weight between input and hidden layer

r
 2
 q.k  
(1 q.k ) wpq.k   p. j (1  p. j ) xh
w hp. j   (2) ( Tq   q.k )
q1

w pq.k  pq.k
0.2
 0.3 0.1152  0.0346
0.0230
   

Let DD
 w pq.k  pq.k  pj  1   pj 
 0.0230  0.5125  1  0.5125    00.0086
. 0 0 5 7 
  0.0346  0.5374  1  0.5374  
   
w p q . k ( N  1)  w p q . k ( N )  w p q . k

w pq.k( N  1)   0 . 2     0 . 0 3 5 4  
 0 . 3   0.0371 

 0.2629
0.1646 
 
0.3
Let HH  x DDh T
 0.0057 0.0086
0.1 

0 . 0 0 1 7 0.0026

 0.0006 0.0009 

HH
whp. j   hp

  0.0010  0.0015 

  0.0003  0.0005  
whp. j (N  1)  whp. j (N)  whp. j

0.1 0.4  0.0010  0.0015


 
0.2 0.3  0.0003
  0.0005 

 0.9900 0.3985
0.1997  
0.2995
 With the updated weights, error is calculated again. Iterations are carried out till we get the error

less than the tolerance.


 Once weights are adjusted, the network is trained.
 Continue this process until the sum squared error is less than the tolerance value or the
maximum number of iteration is reached.
 In this example we have taken the number of iterations as 100.
 Output = 0.7912 which is closer to the target value 0.8 and the sum squared error is 0.0088.
HOW DO OPTIMIZERS WORK?

• For a useful mental model, you can think of a hiker(she)


trying to get down a mountain with a blindfold on. It’s
impossible to know which direction to go in, but there’s one
thing she can know: if she’s going down (making progress)
or going up (losing progress). Eventually, if she keeps taking
steps that lead her downwards, she’ll reach the base.
DIFFERENT TYPES OF OPTIMIZERS

1.Gradient Descent
2.Stochastic Gradient Descent (SGD)
3.Mini Batch Stochastic Gradient Descent (MB-SGD)
4.SGD with momentum
5.Nesterov Accelerated Gradient (NAG)
6.Adaptive Gradient (AdaGrad)
7.AdaDelta
8.RMSprop
9.Adam
LINEAR REGRESSION

 Imagine, we have the data having heights and weights of thousands of people. We want to use this data to
create a Machine Learning model that takes the height of a person as input and predicts the weight of the
person.

 Linear regression performs the task to predict a dependent variable value (y) based on a given independent
variable (x)). Hence, the name is Linear Regression
 This kind of relationship between the input feature(height) and output feature(weight) can be captured by a
linear regression model that tries to fit a straight line on this data.

 The following is the equation of a line of a simple linear regression model: Y = mx + c

 Y is the output feature (weight), m is the slope of the line, x is the input feature(height) and c is the
intercept(weight is equal to c when height is 0 as shown below). Y = m(0) + c = c
COST FUNCTION – MEAN SQUARE ERROR - 1
 we need is a cost function so we can start optimizing our weights.

 Let’s use MSE(L2) as our cost function. MSE measures the average squared difference between an
observation’s actual and predicted values. The output is a single number representing the cost, or score,
associated with our current set of weights. Our goal is to minimize MSE to improve the accuracy of our
model.
COST FUNCTION – MEAN SQUARE ERROR - 2

Error: Difference between actual and predicted


WHAT IS GRADIENT DESCENT?
 Gradient Descent (GD) is a widely used optimization algorithm in machine learning and deep learning that minimizes
the cost function of a neural network model during training. It is used to find the optimal parameters—weights and
biases—of a neural network by minimizing a defined cost function.

 It works by iteratively adjusting the weights or parameters of the model in the direction of the negative gradient of the
cost function until the minimum of the cost function is reached.

 The cost function evaluates the difference between the actual and predicted outputs.

 It trains machine learning models by minimizing errors between predicted and actual results.
 The goal of gradient descent is to minimize the cost function, or the error between predicted and actual y. In
order to do this, it requires two data points—a direction and a learning rate. These factors determine the partial
derivative calculations of future iterations, allowing it to gradually arrive at the local or global minimum (i.e.
point of convergence).

 The cost function of linear regression(MSE) is a convex function i.e. it has only one minima across the range of
values of slope ‘m’ and constant ‘c’
Learning rate (also referred to as step size or the alpha) is the size of the steps that are taken to reach the
minimum. This is typically a small value, and it is evaluated and updated based on the behavior of the cost
function. High learning rates result in larger steps but risks overshooting the minimum. Conversely, a low
learning rate has small step sizes. While it has the advantage of more precision, the number of iterations
compromises overall efficiency as this takes more time and computations to reach the minimum.

Gradient descent is used to get to the


minimum value of the cost function.
Intuitively, gradient descent finds the
slope of the cost function at every step
and travels down the valley to reach the
lowest point (minimum of the cost
function).
STOCHASTIC GRADIENT DESCENT (SGD):

 For big data ( huge data set ) , gradient descent is slow. Hence, in most scenarios, SGD is preferred over Batch
Gradient Descent for optimizing a learning algorithm. Reach minimum value , with in a short time period ,
since it uses fewer iterations.

 Stochastic Gradient Descent (SGD) is a variant of the gradient descent algorithm that is used for optimizing
machine learning machine learning models. It addresses the computational inefficiency of traditional Gradient
Descent methods when dealing with large datasets in machine learning projects.

 In SGD, instead of using the entire dataset for each iteration, only a single random training example (or a small
batch) is selected to calculate the gradient and update the model parameters.

 This random selection introduces randomness into the optimization process, hence the term “stochastic” in
stochastic Gradient Descent

 The advantage of using SGD is its computational efficiency, especially when dealing with large datasets. By
using a single example or a small batch, the computational cost per iteration is significantly reduced compared
to traditional Gradient Descent methods that require processing the entire dataset.
Stochastic Gradient Descent would
randomly pick one sample for each step . Stochastic Gradient Descent is used
Here the number of terms calculated is when redundant data set is available
reduced by a factor of 3

Height

Weight
Aspect STOCHASTIC GRADIENT DESCENT (SGD) BATCH GRADIENT DESCENT

Uses a single random sample or a small batch of samples at Uses the entire dataset (batch) at each
Dataset Usage
each iteration. Less accurate iteration. More accurate .

Computationally less expensive per iteration, as it processes Computationally more expensive per
Computational Efficiency
fewer data points. iteration, as it processes the entire dataset.

Slower processing & convergence due to less


Convergence Faster processing & convergence due to frequent updates.
frequent updates.

High noise due to frequent updates with a single or few Low noise as it updates parameters using all
Noise in Updates
samples. data points.

More stable as it converges smoothly towards


Stability Less stable as it may oscillate around the optimal solution.
the optimum.

Requires less memory as it processes fewer data points at a Requires more memory to hold the entire
Memory Requirement
time. dataset in memory.

Frequent updates make it suitable for online learning and Less frequent updates make it suitable for
Update Frequency
large datasets. smaller datasets.

Less sensitive to initial parameter values due to frequent


Initialization Sensitivity More sensitive to initial parameter values.
updates.
MINI-BATCH GRADIENT DESCENT

 Mini-batch gradient descent combines concepts from both batch gradient descent and stochastic
gradient descent. It splits the training dataset into small batch sizes and performs updates on each of
those batches. This approach strikes a balance between the computational efficiency of batch gradient
descent and the speed of stochastic gradient descent.

It is more common to select a small


subset of data or mini-batch , for each
step
BATCH GRADIENT DESCENT

 Let’s say there are a total of ‘m’ observations in a data set and we use all these
observations to calculate the cost function J, then this is known as Batch Gradient
Descent

 We have 5 observations, all the 5 observations will be updated at once.


STOCHASTIC GRADIENT DESCENT(SGD)
 Take the first observation, then pass it through the neural network,
calculate the error and then update the parameters

 Then will take the second observation and perform similar steps with it.
This step will be repeated until all observations have been passed through
the network and the parameters have been updated.

 since we have 5 observations, the parameters will be updated 5 times or we


can say that there will be 5 iterations.
MINI-BATCH GRADIENT DESCENT

Assume that the batch size is 2. The last single observation in the final
iteration
IMAGE GRADIENT – IMAGE PROCESSING APPLICATION
 Gradient of an image = Measure of change in Image function, F(x,y) in X ( rows ) and y( column )
Case study :- Medical Science
Person is having Prostate cancer or not
Healthy/Cancer PSA(prostate Specific Healthy/Cancer PSA(prostate Specific
antigen) antigen)
Cancer 3.8 Healthy 2.5

Cancer 3.4 Healthy 2.0

Cancer 2.9 Healthy 1.7

Cancer 2.8 Healthy 1.4

Cancer 2.7 Healthy 1.2

Cancer 2.1 Healthy 0.9

Cancer 1.6 Healthy 0.8


Graphical Representation
Graphical Representation
Active Functions
Training a Neural Network
Training a Neural Network
Evaluation = 12/14*100=86%
Status PSA(prost Cancer Status PSA(prost Cancer
ate ate
Specific Specific
antigen) antigen)
Cancer 3.8 0.991 Healthy 2.5 0.755

Cancer 3.4 0.973 Healthy 2.0 0.438

Cancer 2.9 0.902 Healthy 1.7 0.254

Cancer 2.8 0.875 Healthy 1.4 0.130

Cancer 2.7 0.842 Healthy 1.2 0.079

Cancer 2.1 0.506 Healthy 0.9 0.036

Cancer 1.6 0.206 Healthy 0.8 0.028


Prediction
Where do the values for the weight come from?

For Calculating Bias we generally use ordinary least square


Hidden layers
Graphical Representation
Prediction
Comparison
Applications in Bio Medical Image Processing

• 1. Gradient Descent Application: Primarily used for simpler models or


batch processing tasks in hardware accelerators. It is used when
computational resources are abundant and high accuracy is needed.

• Advantage: Offers stable convergence, suitable for high-resolution


biomedical image analysis where precision is essential, such as in MRI or CT
scan interpretation.

• Limitation: Slow convergence, making it less ideal for real-time applications.


2. Stochastic Gradient Descent (SGD)Application:

• Commonly used in real-time applications where rapid responses are needed,


like edge detection in ultrasound or real-time diagnostics.

• Advantage: Faster than batch gradient descent due to updates after each
sample, allowing more responsive performance on hardware accelerators.

• Limitation: May oscillate due to noisy updates, but this can be managed with
small learning rates or other techniques (like momentum).
3. Mini-Batch Stochastic Gradient Descent (MB-SGD)Application:

• Ideal for moderate-sized biomedical image datasets, like those used in


pathology for cell segmentation or tissue classification.

• Advantage: Balances stability and speed, providing better convergence


compared to SGD while retaining some computational efficiency on
accelerators.

• Limitation: Requires careful tuning of batch size and learning rate, which
can impact memory usage on FPGA or ASIC hardware.
4. SGD with Momentum Application:

• Used in applications with high-dimensional data and complex patterns, such


as 3D medical image reconstruction or deep tissue analysis.

• Advantage: Accelerates convergence by accounting for past gradients,


reducing oscillations in training and achieving faster, smoother results.

• Limitation: Needs additional memory resources for storing the momentum


term, slightly increasing hardware demands.
5. Nesterov Accelerated Gradient (NAG)Application:

• Suitable for biomedical imaging tasks that require high accuracy and smooth
convergence, like tumor boundary identification in MRI scans.

• Advantage: Provides better foresight in parameter updates, improving


convergence speed and potentially enhancing interpretability in high-stakes
diagnostic applications.

• Limitation: Complexity in calculations increases hardware demand, which


might require powerful accelerators.
6. Adaptive Gradient (Ada Grad) Application:
• Effective for sparse datasets and rare event detection, which is common in
biomedical imaging for anomaly detection in complex images, like in identifying
rare cell types.

• Advantage: Adjusts learning rate dynamically for each parameter, improving


performance on sparse and unbalanced data common in medical imaging datasets.

• Limitation: Learning rate decreases over time, potentially causing premature


convergence; can increase computational burden on hardware over prolonged
training.
7. Ada Delta Application: Useful in imaging tasks with varying data
distributions, such as multi-modal imaging analysis where contrast and intensity
vary between modalities (e.g., MRI, PET).

• Advantage: Adaptive to data without requiring a manually set learning rate,


improving robustness in biomedical applications with heterogeneous data.

• Limitation: Requires more memory for storing squared gradients and delta
terms, which can be demanding on hardware resources.
9. Adam (Adaptive Moment Estimation)Application:

• Widely used for deep learning tasks in biomedical image processing, such as convolutional
neural networks for cancer detection in histopathology images or segmentation in
radiology.

• Advantage: Combines the benefits of RMSprop and SGD with momentum, offering fast
convergence and good accuracy, making it ideal for high-performance hardware
applications.

• Limitation: Computationally intensive, requiring more complex logic and memory storage
for moment terms, so it works best on advanced hardware like GPUs or high-capacity
FPGAs.
THANK YOU

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy