0% found this document useful (0 votes)
12 views17 pages

Advanced Supervised Learning

The document provides an overview of advanced supervised learning techniques, focusing on neural networks and perceptrons. It explains the structure and function of neurons, the importance of activation functions, and the training algorithms used in neural networks, including backpropagation and stochastic gradient descent. Additionally, it discusses support vector machines (SVM) and their role in classification problems, highlighting the concept of hyperplanes and their significance in separating data points.

Uploaded by

shyammm53
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views17 pages

Advanced Supervised Learning

The document provides an overview of advanced supervised learning techniques, focusing on neural networks and perceptrons. It explains the structure and function of neurons, the importance of activation functions, and the training algorithms used in neural networks, including backpropagation and stochastic gradient descent. Additionally, it discusses support vector machines (SVM) and their role in classification problems, highlighting the concept of hyperplanes and their significance in separating data points.

Uploaded by

shyammm53
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

ADVANCED SUPERVISED LEARNING:

Neural Networks
Biologically motivated approach to
machine learning
Here x1 and x2 are normalized
 Similarity with biological network attribute value of data. y is the output
of the neuron , i.e the class label. x1
 The fundamental processing
and x2 values multiplied by weight
element of a neural network is a
values w1 and w2 are input to the
neuron
neuron x.
1. Receives inputs from other sources
Given that :
2. Combines them in some way
w1 = 0.5 and w2 = 0.5 ; x1 = 0.3 and
3. Performs a generally nonlinear x2 = 0.8
operation on the result
weighted sum is :
4. Outputs the final result w1 * x1 + w2 * x2 = 0.5 * 0.3 + 0.5 *
0.8 = 0.55
A human brain has 100 billion neurons
The neuron receives the weighted sum
What is Neural Network?
as input and calculates the output as a
A Neural Network is a set of connected function of input as follows :
input/output units, where each y = f(x) , where f(x) is defined as
connection has a weight associated f(x) = 0 { when x< 0.5 }
f(x) = 1 { when x >= 0.5 }
with it.
 Neural Network learning is also  For example, x ( weighted sum ) is
called Connectionist learning due to 0.55, so y = 1. Therefore,
the connections between units. corresponding input attribute values
are classified in class 1.
 It is a case of supervised, inductive  If for another input values , x =
or classification learning. 0.45 , then f(x) = 0, so that the input
values are classified to class 0.
 Neural Network learns by adjusting
the weights so as to be able to Bias of a Neuron:
correctly classify the training data and We need the bias value to be added to
hence, after testing phase, to classify the weighted sum ∑wixi so that we
unknown data. can transform it from the origin.

 Neural Network needs long time for v = ∑wixi + b, here b is the bias
training.
 Neural Network has a high tolerance
to noisy and incomplete data
One Neuron as a Network:
transmits electrical signals through
complex mechanisms.

Basic Components of Perceptron

1. Input Layer: The input layer


consists of one or more input neurons,
which receive input signals from the
external world or from other layers of
the neural network.

2. Weights: Each input neuron is


associated with a weight, which
THE PERCEPTRON represents the strength of the
Perceptrons were developed way back connection between the input neuron
in the 1950s-60s by the scientist Frank and the output neuron.
Rosenblatt, inspired by earlier work
from Warren McCulloch and Walter 3. Bias: A bias term is added to the
Pitts. While today we use other models input layer to provide the perceptron
of artificial neurons, they follow the with additional flexibility in modeling
general principles set by the complex patterns in the input data.
perceptron.
4. Activation Function: The
Model of an artificial neuron activation function determines the
output of the perceptron based on the
weighted sum of the inputs and the
bias term. Common activation
functions used in perceptrons include
the step function, sigmoid function,
and ReLUfunction.
An activation function is a
fig 4.2 mathematical function that
determines whether a neuron in an
 As you can see, the network of artificial neural network should be
nodes sends signals in one direction. activated or not. It introduces non-
This is called a feed-forward network. linearity to the model, allowing it to
 The figure depicts a neuron learn complex patterns.
connected with n other neurons and Why is it needed?
thus receives n inputs (x1, x2, ….. xn). 1. Non-linearity: Without an
This configuration is called a activation function, a neural
Perceptron. network would just perform
linear transformations, making
A perceptron is a computational it unable to handle complex
representation of a neuron, capturing relationships.
its basic functionality of receiving 2. Decision Making: It helps
weighted inputs, summing them up, decide whether a neuron should
and producing an output based on a "fire" (activate) based on the
threshold, while a real neuron is a weighted sum of inputs.
biological cell in the brain that 3. Gradient-based Learning: It
enables backpropagation by
allowing gradients to be between the predicted output and the
computed. true output for a given set of training
examples.
Types of Activation Functions:
7. Overall, the perceptron is a simple
1. Step Function (Threshold yet powerful algorithm that can be
Function) used to perform binary classification
 Formula: tasks.
f(x)=1,x≥0
0,x<0 Types of Perceptron:
 Use Case: Used in perceptrons
for binary classification. Single layer: Single layer perceptron
 Limitation: Not differentiable, can learn only linearly separable
so it cannot be used in gradient- patterns.
based learning.
  Multilayer: Multilayer perceptron
2. Sigmoid Function (Logistic can learn about two or more layers
Function) having a greater processing power.
 Formula:
f(x)=1/(1+e^−x) Advantages:
 Range: (0,1) A multi-layered perceptron model can
 Use Case: Binary classification solve complex non-linear problems.
problems. It works well with both small and
 Limitation: Causes vanishing large input data.
gradient problem; slow Helps us to obtain quick predictions
learning for deep networks. after the training.
Helps us obtain the same accuracy
3. ReLU (Rectified Linear Unit) ratio with big and small data.
 Formula:
f(x)=max(0,x) Disadvantages:
 Range: (0, ∞) In multi-layered perceptron model,
 Use Case: Most commonly computations are time-consuming and
used in deep learning. complex.
 Advantage: Computationally It is tough to predict how much the
efficient, helps avoid vanishing dependent variable affects each
gradient. independent variable.
The model functioning depends on the
5. Output: The output of the quality of training.
perceptron is a single binary value,
either 0 or 1, which indicates the class Activation function:
or category to which the input data The activation function applies a step
belongs. rule (convert the numerical output into
+1 or -1) to check if the output of the
6. Training Algorithm: The weighting function is greater than zero
perceptron is typically trained using a or not.
supervised learning algorithm such as  Step function gets triggered above a
the perceptron learning algorithm or certain value of the neuron output;
back propagation. During training, the else it outputs zero.
weights and biases of the perceptron
are adjusted to minimize the error
 Sign Function outputs +1 or -1  INPUT: records without class
depending on whether neuron output attribute with normalized attributes
is greater than zero or not. values.
 Sigmoid is the S-curve and outputs a  INPUT VECTOR: X = { x1, x2, ….
value between 0 and 1. xn} where n is the number of (non
class) attributes.
 INPUT LAYER – there are as many
nodes as non-class attributes i.e. as
the length of the input vector.
 HIDDEN LAYER – the number of
nodes in the hidden layer and the
number of hidden layers depends on
implementation.
A Multilayer Feed-Forward  OUTPUT LAYER – corresponds to
Network: the class attribute. There are as many
The units in the hidden layers and nodes as classes (values of the class
output layer are sometimes referred to attribute).
as neurodes, due to their symbolic  Network is fully connected, i.e. each
biological basis, or as output units. unit provides input to each unit in the
 A network containing two hidden next forward layer.
layers is called a three-layer neural
network, and so on. What is Staged Prediction?
Staged predictions suggests an
 The network is feed-forward in that approach where predictions are made
none of the weights cycles back to an step by step, rather than all at once.
input unit or to an output unit of a  In single-layer perceptrons,
previous layer. predictions are direct.
 In multi-layer perceptrons
(MLPs), predictions are often

made in stages:
1. First, hidden layers D={(x,y)}
process raw inputs and  The dataset consists of input-
pass output pairs (x,y).
 Each x is a feature vector, and y
is the corresponding label.
transformed values to the
next layer. 2️Initialize Weights w:
w←0∈R^n
2. Finally, the output layer The weight vector w(also called θ in
produces the prediction. the formula) is initialized to zero.
R^n means w is an n-dimensional
This layer-wise transformation is a vector.
form of staged prediction—each
layer refines the information before 3️ Iterate Over Epochs (Full Passes
passing it forward. Over Data):
For epoch 1…T
Why Do We Need Multi-Layer For epoch 1…T:
Perceptrons (MLPs)?  The algorithm runs for T epochs.
 A single-layer perceptron can  Each epoch represents one full
only classify data that is pass over the dataset.
linearly separable.
 For non-linearly separable 4️ Iterate Over Each Training
problems like XOR, we need a Example (x,y):
multi-layer neural network. For (x,y) in D:
 A hidden layer with non-  The algorithm processes one
linear activation functions data point at a time (stochastic
(e.g., ReLU, sigmoid) allows the approach).
network to transform the input
space into a higher-dimensional 5️ Update Rule (Weight
representation where the Adjustment):
classes become separable. w←w−η∇f(θ)

gradient ∇f(θ).
 w is updated using the sub-

 η is the learning rate,


SGD AND BACKPROPAGATION: controlling step size.

differentiable, ∇f(θ) is the


 If the function is
SSGD is an extension of Stochastic
Gradient Descent (SGD) used when usual gradient.
the loss function is non-  If non-differentiable, a
differentiable at certain points. subgradient is used.
Instead of using the true gradient 6️Return the Final Parameters:
(which may not exist at non- Return θ
differentiable points), it uses a The optimized weight vector is
subgradient, which is a returned after TTT epochs.
generalization of the gradient for non-
smooth functions.

Steps in the Algorithm:

1️Given a Training Set:


between network’s classification and
actual classification.
 The delta rule is used for calculating
the gradient that is used for updating
the weights.
 We will try to minimize the following
error:
E = ½ Σi (ti – oi) 2
 For a new training example X = (x1 ,
x2 , …, xn ), update each weight
according to this rule: 
wi = wi + Δwi
where, Δwi= -n * E’(w)/wi

Steps in Backpropagation
Algorithm:
1. Initialize the weights and biases.
 The weights in the network are
initialized to random numbers
from the interval [-1,1].
 Each unit has a BIAS associated
with it
 The biases are similarly
initialized to random numbers
from the interval [-1,1].

2. Feed the training sample.


3. Propagate the inputs forward;
we compute the net input and
output of each unit in the hidden
and output layers.
4. Backpropagate the error.
5. Update weights and biases to
reflect the propagated errors.
6. Terminating conditions.

Support Vector Machine


(SVM):

Classification by Backpropagation:
Backpropagation learns by iteratively
processing a set of training data
(samples).
 For each sample, weights are
modified to minimize the error
What is a Hyperplane?
A hyperplane is a decision boundary
in machine learning, particularly in
classification problems. It is a
subspace of one dimension less than
the space it exists in.
Understanding the Concept:
 In 2D space, a hyperplane is a
line that separates points.
 In 3D space, a hyperplane is a
plane that divides space into
The SVM is a machine learning
two parts.
algorithm which
 In higher dimensions (n-
 Solves classification problems
dimensional space), a
 uses a flexible representation of the
hyperplane is an (n-1)-
class boundaries
dimensional subspace that
 implements automatic complexity
separates the data.
control to reduce overfitting
 has a single global minimum which
Hyperplane in Classification (SVM
can be found in polynomial time n
Example)
 It is popular because
 In classification problems (like in
o It can be easy to use
the image), a hyperplane is
o It often has good generalization used to separate two classes.
performance  The optimal hyperplane is the
o The same algorithm solves a one that maximizes the margin
variety of problems with little (distance) between the closest
tuning points from both classes. These
closest points are called
Types of Support Vector Machine support vectors.
(SVM) Algorithms:  Linear classifiers (like logistic
regression) may find a
Linear SVM: When the data is separating hyperplane, but
perfectly linearly separable, only then Support Vector Machines
we can use Linear SVM. Perfectly (SVMs) focus on the max
linearly separable means that the data margin plane, ensuring the
points can be classified into 2 classes best separation.
by using a single straight line(if 2D).
How to Pick the Right
 Non-Linear SVM: When the data is Hyperplane?
not linearly separable then we can use  Many possible hyperplanes can
Non-Linear SVM, which means when separate data.
the data points cannot be separated  Linear Regression & Neural
into 2 classes by using a straight line Networks consider all data
(if 2D) then we use some advanced points.
techniques like kernel tricks to classify  Support Vector Machines
them. In most real-world applications (SVMs) only consider the most
we do not find linearly separable difficult points (support
datapoints hence we use kernel trick vectors) close to the boundary.
to solve them.
Optimal Separating Hyperplane: The Non-separable Case: Soft Margin
Hyperplane
 If the data is not linearly separable,
then there is no hyperplane to
separate them into one of the two
classes.
Solution: Identify a hyperplane that
incurs the least error.
Slack variables, ξ t ≥ 0, are defined,
which stores the deviation from the
margin.
Two types of deviations:
An instance may lie on the wrong side
of the hyperplane and be
misclassified.
Or, it may be on the right side but
may lie in the margin, namely, not
sufficiently away from the hyperplane.

Margin: Hence r^t(w^T x^t + w0 ) ≥ 1 − ξ t


 We allow “error” xi in classification;
it is based on the output of the
discriminant function w^Tx+b  xi
approximates the number of
misclassified samples

Soft Margin Hyperplane

Strengths of SVMs:
 Good generalization in theory
 Good generalization in practice
 Work well with few training instances
 Find globally best model
 Efficient algorithms
 Amenable to the kernel trick

KERNEL METHODS:
Extension to Non-linear Decision  When evaluating a machine learning
Boundary: model, it is crucial to assess its
predictive ability, generalization
capability, and overall quality.
 Evaluation metrics provide objective
criteria to measure these aspects.
 The choice of evaluation metrics
depends on the specific problem
domain, the type of data, and the
desired outcome.

Transforming the Data:


Computation in the feature space can
be costly because it is high
dimensional
The feature space is typically infinite-
dimensional!
The kernel trick helps in this context.

METRICS & ERRORS


Evaluation metrics
Evaluation metrics are quantitative
measures used to assess the
performance and effectiveness of a
statistical or machine learning model.

These metrics provide insights into


how well the model is performing and
help in comparing different models or
algorithms.

When evaluating a machine learning


model, it is crucial to assess its
predictive ability, generalization
capability, and overall quality.
Evaluation metrics provide objective
METRICS & ERRORS criteria to measure these aspects.
Evaluation metrics are quantitative
measures used to assess the The choice of evaluation metrics
performance and effectiveness of a depends on the specific problem
statistical or machine learning model. domain, the type of data, and the
 These metrics provide insights into desired outcome.
how well the model is performing and
help in comparing different models or
algorithms. PERFORMANCE METRICS
FOR CLASSIFICATION
classification model on a set of test
Performance Metrics for data when true values are known.
Classification
A confusion matrix is an N X N matrix,
In a classification problem, the
where N is the number of predicted
category or classes of data is
classes.
identified based on training data.
Confusion matrix is a table with 4
The model learns from the given
different
dataset and then classifies the new
combinations of predicted and actual
data into classes or groups based on
values.
the training.
It is extremely useful for measuring
It predicts class labels as the output,
precision-recall, Specificity, Accuracy,
such as Yes or No, 0 or 1, Spam or Not
and most importantly, AUC-ROC
Spam, etc.
curves.
To evaluate the performance of a
True Positive: You predicted positive,
classification model, different metrics
and it’s true.
are used, and some of them are as
True Negative: You predicted
follows:
negative, and it’s
true.
 Accuracy
False Positive: (Type 1 Error): You
 Confusion Matrix
predicted
 Precision
positive, and it’s false.
 Recall
False Negative: (Type 2 Error): You
 F-Score
predicted
 AUC(Area Under the Curve)-
negative, and it’s false
ROC
1. Accuracy
The accuracy metric is one of the
simplest Classification metric to
implement, and it can be determined
as the number of correct predictions to
the total number of predictions.

3. Precision
The precision metric is used to
2. Confusion Matrix
overcome the
limitation of Accuracy.
A confusion matrix is a tabular
representation of prediction outcomes
of any binary classifier, which is used
to describe the performance of the
The precision determines the
proportion of positive prediction that
was actually correct.

It can be calculated as the True


Positive or
predictions that are actually true to
the total
positive predictions (True Positive and
False
Positive).

5. F-Score
F-score or F1 Score is a metric to
evaluate a binary classification model
on the basis of predictions that are
made for the positive class.
4. Recall or sensitivity
It is usually better to compare models
It aims to calculate the proportion of by means of one number only.
actual positive that was identified
incorrectly. It is a type of single score that
represents both
It can be calculated as True Positive or Precision and Recall. So, the F1 Score
predictions that are actually true to can be
the total number of positives, either calculated as the harmonic mean of
correctly predicted as positive or both precision and Recall, assigning
incorrectly predicted as negative (true equal weight to each of them.
Positive and false negative).
The value of AUC ranges from 0 to 1.

It means a model with 100% wrong


prediction will have an AUC of 0.0,
whereas models with 100% correct
predictions will have an AUC of 1.0.

The AUC of a classifier is defined as


6. AUC-ROC the probability of a classifier will rank
If we need to visualize the a randomly chosen positive example
performance of the higher than a negative example.
classification model on charts, then we
can use the AUCROC curve. Specificity = TN / (TN+FP)

It is one of the popular and important False alarm rate = 1- Specificity


metrics for evaluating the
performance of the classification
model.

ROC (Receiver Operating


Characteristic curve) curve.

ROC represents a graph to show the


performance of a classification model
at different threshold levels.

The curve is plotted between two


parameters, which
are:
 True Positive Rate(TPR)
 False Positive Rate(FPR)

AUC is known for Area Under the ROC


curve.

AUC calculates the performance


across
all the thresholds and provides an
aggregate measure.
PERFORMANCE METRICS
FOR REGRESSION
2. Mean Squared Error
Regression is a supervised learning
Mean Squared error or MSE is one of
technique that aims to find the
the most suitable metrics for
relationships between the dependent
Regression evaluation.
and independent variables.
It measures the average of the
A predictive regression model predicts
Squared difference between predicted
a numeric or discrete value.
values and the actual value given by
the model.
The metrics used for regression are
different from the classification
Since in MSE, errors are squared,
metrics.
therefore it only assumes non-
negative values, and it is usually
It means we cannot use the Accuracy
positive and non-zero.
metric (explained above) to evaluate a
regression model; instead, the
Moreover, due to squared
performance of a Regression model is
differences, it penalizes small errors
reported as errors in the prediction.
also, and hence it leads to over-
estimation of how bad the model is.
 Mean Absolute Error
 Mean Squared Error
MSE is a much-preferred metric
 Root Mean Square
compared to other regression metrics
Error(RMSE)
as it is differentiable and hence
optimized better.
1. Mean Absolute Error

Mean Absolute Error or MAE is one of


the simplest metrics, which measures
the absolute difference between
actual and predicted values, where 3. Root Mean Square Error
absolute means taking a number as
Positive.
RMSE is a metric that can be obtained
by just taking the square root of the
Here, Y is the Actual outcome, Y' is
MSE value.
the predicted outcome, and N is the
total number of data points.
As we know that the MSE metrics are
not robust to outliers and so are the
MAE is much more robust for the
RMSE values.
outliers.
This gives higher weightage to the
One of the limitations of MAE is that it
large errors in predictions.
is not differentiable, so for this, we
need to apply different optimizers
such as Gradient Descent.
Performance Measures-
Summary

Confusion Matrix for Multi-


Class Classification (from N * N
matrix to 2 * 2 matrix)
FN: The False-negative value for a
class will be the sum of values of
corresponding rows except for the TP
value.

FP: The False-positive value for a class


will be the sum of values of the
corresponding column except for the
TP value.

TN: The True-negative value for a class


will be the sum of the values of all
columns and rows except the values of
that class that we are calculating the
values for.

TP: the True-positive value is where STOCHASTIC GRADIENT


the actual value and predicted value DESCENT
are the same. Gradient descent is used to minimize
the MSE by calculating the gradient of
Two-class example the cost function.

A regression model uses gradient


descent to update the coefficients of
the line by reducing the cost function.
It is done by a random selection of
values of This makes it incredibly versatile in
coefficient and then iteratively update solving various types of problems in
the values to reach the minimum cost machine learning, from simple linear
function. regression to complex neural
networks.
Gradient Descent
It’s an optimization algorithm used to Stochastic Gradient Descent
minimize a function by iteratively
moving towards the steepest descent Stochastic Gradient Descent (SGD)
as defined adds a twist to the traditional gradient
by the negative of the gradient. descent approach. The term
Imagine that you’re at the top of a ‘stochastic’ refers to a system or
mountain, and your goal is to reach process that is linked with a random
the lowest point. Gradient Descent probability.
helps you find the best path down the Therefore, this randomness is
hill. introduced in the way the gradient is
calculated, which significantly alters
The beauty of Gradient Descent is its its behavior and efficiency compared
simplicity and elegance. to the standard gradient descent.

You start with a random point on the In traditional batch gradient descent,
function you’re trying to minimize, for you calculate the gradient of the loss
example a random starting point on function with respect to the
the mountain. Then, you calculate the parameters for the entire training set.
gradient (slope) of the function at that
point. As you can imagine, for large datasets,
this can be quite computationally
In the mountain analogy, this is like intensive and time-consuming.
looking around you to find the
steepest slope. Once you know the This is where SGD comes into play.
direction, you take a step downhill in Instead of using the entire dataset to
that direction, and then you calculate calculate the gradient, SGD randomly
the gradient again. selects just one data point (or a few
data points) to compute the gradient
Repeat this process until you reach in each iteration.
the bottom. The size of each step is
determined by the learning rate. Think of this process as if you were
However, if the learning rate is too again descending a mountain, but this
small, it might take a long time to time in thick fog with limited visibility.
reach the bottom. If it’s too large, you Rather than viewing the entire
might overshoot the lowest point. landscape to decide your next step,
you make your decision based on
Finding the right balance is key to the where your foot lands next.
success of the algorithm. One of the
most appealing aspects of Gradient This step is small and random, but it’s
Descent is its generality. It can be repeated many times, each time
applied to almost any function, adjusting your path slightly in
especially those where an analytical response to the immediate terrain
solution is not feasible. under your feet.
The Algorithm
1. Initialization
First, you initialize the parameters where: x new f(x) represents the
(weights) of your model. This can be updated parameters.
done randomly or by some other
initialization technique. The starting  x represents the current
point for SGD is crucial as it influences parameters before the update.
the path the algorithm will take.  η is the learning rate, a positive
scalar determining the size of
2. Random Selection the step in the direction of the
In each iteration of the training negative gradient.
process, SGD randomly selects a  f(x) is the gradient of the loss
single data point (or a small batch of function f(x) with respect to the
data points) from the entire dataset. parameters x.
This randomness is what makes it  The learning rate determines
‘stochastic’ the size of the steps you take
towards the minimum. If it’s too
3. Computing the Gradient small, the algorithm will be
Calculate the gradient of the loss slow; if it’s too large, you might
function, but only for the randomly overshoot the minimum
selected data point(s). The gradient is 5. Repeat until convergence
a vector that points in the direction of Repeat steps 2 to 4 for a set number
the steepest increase of the loss of iterations or until the model
function. In the context of SGD, it tells performance stops improving. Each
you how to tweak the parameters to iteration provides a slightly updated
make the model more accurate for model. Ideally, after many iterations,
that particular data point. SGD converges to a set of parameters
that minimize the loss function,
Gradient Formula: although due to its stochastic nature,
the path to convergence is not as
smooth and may oscillate around the
Here, f(x)= f(x) represents the minimum.
gradient of the loss function f(x) with
respect to the parameters x. Learning Rate Scheduling
This gradient is a vector of partial Learning rate scheduling involves
derivatives, where each component of adjusting the learning rate over time.
the vector is the partial derivative of Common strategies include:
the loss function with respect to the
corresponding parameter in x.  Time-Based Decay: The learning
rate decreases over each
update.
4. Update the Parameters
Adjust the model parameters in the  Step Decay: Reduce the
opposite direction of the gradient. learning rate by some factor
Here’s where the learning rate η plays after a certain number of
a crucial role. The formula for updating epochs.
each parameter is:
 Exponential Decay: Decrease  Sensitivity to Learning Rate: The
the learning rate exponentially. choice of learning rate can be
critical in SGD since using a
 Adaptive Learning Rate: high learning rate can cause the
Methods like AdaGrad, RMSProp, algorithm to overshoot the
and Adam adjust the learning minimum, while a low learning
rate automatically during rate can make the algorithm
training. converge slowly.

 Less Accurate: Due to the noisy


updates, SGD may not converge
Advantages of SGD to the exact global minimum
and can result in a suboptimal
 Speed: SGD is faster than other solution. This can be mitigated
variants of Gradient Descent by using techniques such as
such as Batch Gradient Descent learning rate scheduling and
and Mini Batch Gradient momentum-based updates
Descent since it uses only one
example to update the
parameters.

 Memory Efficiency: Since SGD


updates the parameters for
each training example one at a
time, it is memory-efficient and
can handle large datasets that
cannot fit into memory.

 Avoidance of Local Minima: Due


to the noisy updates in SGD, it
has the ability to escape from
local minima and converges to a
global minimum.
Disadvantages of SGD

 Noisy updates: The updates in


SGD are noisy and have a high
variance, which can make the
optimization process less stable
and lead to oscillations around
the minimum.

 Slow Convergence: SGD may


require more iterations to
converge to the minimum since
it updates the parameters for
each training example one at a
time.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy