0% found this document useful (0 votes)

12 views17 pages

Advanced Supervised Learning

The document provides an overview of advanced supervised learning techniques, focusing on neural networks and perceptrons. It explains the structure and function of neurons, the importance of activation functions, and the training algorithms used in neural networks, including backpropagation and stochastic gradient descent. Additionally, it discusses support vector machines (SVM) and their role in classification problems, highlighting the concept of hyperplanes and their significance in separating data points.

Uploaded by

shyammm53

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views17 pages

Advanced Supervised Learning

Uploaded by

shyammm53

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 17

ADVANCED SUPERVISED LEARNING:

Neural Networks
Biologically motivated approach to
machine learning
Here x1 and x2 are normalized
 Similarity with biological network attribute value of data. y is the output
of the neuron , i.e the class label. x1
 The fundamental processing
and x2 values multiplied by weight
element of a neural network is a
values w1 and w2 are input to the
neuron
neuron x.
1. Receives inputs from other sources
Given that :
2. Combines them in some way
w1 = 0.5 and w2 = 0.5 ; x1 = 0.3 and
3. Performs a generally nonlinear x2 = 0.8
operation on the result
weighted sum is :
4. Outputs the final result w1 * x1 + w2 * x2 = 0.5 * 0.3 + 0.5 *
0.8 = 0.55
A human brain has 100 billion neurons
The neuron receives the weighted sum
What is Neural Network?
as input and calculates the output as a
A Neural Network is a set of connected function of input as follows :
input/output units, where each y = f(x) , where f(x) is defined as
connection has a weight associated f(x) = 0 { when x< 0.5 }
f(x) = 1 { when x >= 0.5 }
with it.
 Neural Network learning is also  For example, x ( weighted sum ) is
called Connectionist learning due to 0.55, so y = 1. Therefore,
the connections between units. corresponding input attribute values
are classified in class 1.
 It is a case of supervised, inductive  If for another input values , x =
or classification learning. 0.45 , then f(x) = 0, so that the input
values are classified to class 0.
 Neural Network learns by adjusting
the weights so as to be able to Bias of a Neuron:
correctly classify the training data and We need the bias value to be added to
hence, after testing phase, to classify the weighted sum ∑wixi so that we
unknown data. can transform it from the origin.

 Neural Network needs long time for v = ∑wixi + b, here b is the bias
training.
 Neural Network has a high tolerance
to noisy and incomplete data
One Neuron as a Network:
transmits electrical signals through
complex mechanisms.

Basic Components of Perceptron

1. Input Layer: The input layer

consists of one or more input neurons,
which receive input signals from the
external world or from other layers of
the neural network.

2. Weights: Each input neuron is

associated with a weight, which
THE PERCEPTRON represents the strength of the
Perceptrons were developed way back connection between the input neuron
in the 1950s-60s by the scientist Frank and the output neuron.
Rosenblatt, inspired by earlier work
from Warren McCulloch and Walter 3. Bias: A bias term is added to the
Pitts. While today we use other models input layer to provide the perceptron
of artificial neurons, they follow the with additional flexibility in modeling
general principles set by the complex patterns in the input data.
perceptron.
4. Activation Function: The
Model of an artificial neuron activation function determines the
output of the perceptron based on the
weighted sum of the inputs and the
bias term. Common activation
functions used in perceptrons include
the step function, sigmoid function,
and ReLUfunction.
An activation function is a
fig 4.2 mathematical function that
determines whether a neuron in an
 As you can see, the network of artificial neural network should be
nodes sends signals in one direction. activated or not. It introduces non-
This is called a feed-forward network. linearity to the model, allowing it to
 The figure depicts a neuron learn complex patterns.
connected with n other neurons and Why is it needed?
thus receives n inputs (x1, x2, ….. xn). 1. Non-linearity: Without an
This configuration is called a activation function, a neural
Perceptron. network would just perform
linear transformations, making
A perceptron is a computational it unable to handle complex
representation of a neuron, capturing relationships.
its basic functionality of receiving 2. Decision Making: It helps
weighted inputs, summing them up, decide whether a neuron should
and producing an output based on a "fire" (activate) based on the
threshold, while a real neuron is a weighted sum of inputs.
biological cell in the brain that 3. Gradient-based Learning: It
enables backpropagation by
allowing gradients to be between the predicted output and the
computed. true output for a given set of training
examples.
Types of Activation Functions:
7. Overall, the perceptron is a simple
1. Step Function (Threshold yet powerful algorithm that can be
Function) used to perform binary classification
 Formula: tasks.
f(x)=1,x≥0
0,x<0 Types of Perceptron:
 Use Case: Used in perceptrons
for binary classification. Single layer: Single layer perceptron
 Limitation: Not differentiable, can learn only linearly separable
so it cannot be used in gradient- patterns.
based learning.
  Multilayer: Multilayer perceptron
2. Sigmoid Function (Logistic can learn about two or more layers
Function) having a greater processing power.
 Formula:
f(x)=1/(1+e^−x) Advantages:
 Range: (0,1) A multi-layered perceptron model can
 Use Case: Binary classification solve complex non-linear problems.
problems. It works well with both small and
 Limitation: Causes vanishing large input data.
gradient problem; slow Helps us to obtain quick predictions
learning for deep networks. after the training.
Helps us obtain the same accuracy
3. ReLU (Rectified Linear Unit) ratio with big and small data.
 Formula:
f(x)=max(0,x) Disadvantages:
 Range: (0, ∞) In multi-layered perceptron model,
 Use Case: Most commonly computations are time-consuming and
used in deep learning. complex.
 Advantage: Computationally It is tough to predict how much the
efficient, helps avoid vanishing dependent variable affects each
gradient. independent variable.
The model functioning depends on the
5. Output: The output of the quality of training.
perceptron is a single binary value,
either 0 or 1, which indicates the class Activation function:
or category to which the input data The activation function applies a step
belongs. rule (convert the numerical output into
+1 or -1) to check if the output of the
6. Training Algorithm: The weighting function is greater than zero
perceptron is typically trained using a or not.
supervised learning algorithm such as  Step function gets triggered above a
the perceptron learning algorithm or certain value of the neuron output;
back propagation. During training, the else it outputs zero.
weights and biases of the perceptron
are adjusted to minimize the error
 Sign Function outputs +1 or -1  INPUT: records without class
depending on whether neuron output attribute with normalized attributes
is greater than zero or not. values.
 Sigmoid is the S-curve and outputs a  INPUT VECTOR: X = { x1, x2, ….
value between 0 and 1. xn} where n is the number of (non
class) attributes.
 INPUT LAYER – there are as many
nodes as non-class attributes i.e. as
the length of the input vector.
 HIDDEN LAYER – the number of
nodes in the hidden layer and the
number of hidden layers depends on
implementation.
A Multilayer Feed-Forward  OUTPUT LAYER – corresponds to
Network: the class attribute. There are as many
The units in the hidden layers and nodes as classes (values of the class
output layer are sometimes referred to attribute).
as neurodes, due to their symbolic  Network is fully connected, i.e. each
biological basis, or as output units. unit provides input to each unit in the
 A network containing two hidden next forward layer.
layers is called a three-layer neural
network, and so on. What is Staged Prediction?
Staged predictions suggests an
 The network is feed-forward in that approach where predictions are made
none of the weights cycles back to an step by step, rather than all at once.
input unit or to an output unit of a  In single-layer perceptrons,
previous layer. predictions are direct.
 In multi-layer perceptrons
(MLPs), predictions are often

made in stages:
1. First, hidden layers D={(x,y)}
process raw inputs and  The dataset consists of input-
pass output pairs (x,y).
 Each x is a feature vector, and y
is the corresponding label.
transformed values to the
next layer. 2️Initialize Weights w:
w←0∈R^n
2. Finally, the output layer The weight vector w(also called θ in
produces the prediction. the formula) is initialized to zero.
R^n means w is an n-dimensional
This layer-wise transformation is a vector.
form of staged prediction—each
layer refines the information before 3️ Iterate Over Epochs (Full Passes
passing it forward. Over Data):
For epoch 1…T
Why Do We Need Multi-Layer For epoch 1…T:
Perceptrons (MLPs)?  The algorithm runs for T epochs.
 A single-layer perceptron can  Each epoch represents one full
only classify data that is pass over the dataset.
linearly separable.
 For non-linearly separable 4️ Iterate Over Each Training
problems like XOR, we need a Example (x,y):
multi-layer neural network. For (x,y) in D:
 A hidden layer with non-  The algorithm processes one
linear activation functions data point at a time (stochastic
(e.g., ReLU, sigmoid) allows the approach).
network to transform the input
space into a higher-dimensional 5️ Update Rule (Weight
representation where the Adjustment):
classes become separable. w←w−η∇f(θ)

gradient ∇f(θ).
 w is updated using the sub-

 η is the learning rate,

SGD AND BACKPROPAGATION: controlling step size.

differentiable, ∇f(θ) is the

 If the function is
SSGD is an extension of Stochastic
Gradient Descent (SGD) used when usual gradient.
the loss function is non-  If non-differentiable, a
differentiable at certain points. subgradient is used.
Instead of using the true gradient 6️Return the Final Parameters:
(which may not exist at non- Return θ
differentiable points), it uses a The optimized weight vector is
subgradient, which is a returned after TTT epochs.
generalization of the gradient for non-
smooth functions.

Steps in the Algorithm:

1️Given a Training Set:

between network’s classification and
actual classification.
 The delta rule is used for calculating
the gradient that is used for updating
the weights.
 We will try to minimize the following
error:
E = ½ Σi (ti – oi) 2
 For a new training example X = (x1 ,
x2 , …, xn ), update each weight
according to this rule: 
wi = wi + Δwi
where, Δwi= -n * E’(w)/wi

Steps in Backpropagation
Algorithm:
1. Initialize the weights and biases.
 The weights in the network are
initialized to random numbers
from the interval [-1,1].
 Each unit has a BIAS associated
with it
 The biases are similarly
initialized to random numbers
from the interval [-1,1].

2. Feed the training sample.

3. Propagate the inputs forward;
we compute the net input and
output of each unit in the hidden
and output layers.
4. Backpropagate the error.
5. Update weights and biases to
reflect the propagated errors.
6. Terminating conditions.

Support Vector Machine

(SVM):

Classification by Backpropagation:
Backpropagation learns by iteratively
processing a set of training data
(samples).
 For each sample, weights are
modified to minimize the error
What is a Hyperplane?
A hyperplane is a decision boundary
in machine learning, particularly in
classification problems. It is a
subspace of one dimension less than
the space it exists in.
Understanding the Concept:
 In 2D space, a hyperplane is a
line that separates points.
 In 3D space, a hyperplane is a
plane that divides space into
The SVM is a machine learning
two parts.
algorithm which
 In higher dimensions (n-
 Solves classification problems
dimensional space), a
 uses a flexible representation of the
hyperplane is an (n-1)-
class boundaries
dimensional subspace that
 implements automatic complexity
separates the data.
control to reduce overfitting
 has a single global minimum which
Hyperplane in Classification (SVM
can be found in polynomial time n
Example)
 It is popular because
 In classification problems (like in
o It can be easy to use
the image), a hyperplane is
o It often has good generalization used to separate two classes.
performance  The optimal hyperplane is the
o The same algorithm solves a one that maximizes the margin
variety of problems with little (distance) between the closest
tuning points from both classes. These
closest points are called
Types of Support Vector Machine support vectors.
(SVM) Algorithms:  Linear classifiers (like logistic
regression) may find a
Linear SVM: When the data is separating hyperplane, but
perfectly linearly separable, only then Support Vector Machines
we can use Linear SVM. Perfectly (SVMs) focus on the max
linearly separable means that the data margin plane, ensuring the
points can be classified into 2 classes best separation.
by using a single straight line(if 2D).
How to Pick the Right
 Non-Linear SVM: When the data is Hyperplane?
not linearly separable then we can use  Many possible hyperplanes can
Non-Linear SVM, which means when separate data.
the data points cannot be separated  Linear Regression & Neural
into 2 classes by using a straight line Networks consider all data
(if 2D) then we use some advanced points.
techniques like kernel tricks to classify  Support Vector Machines
them. In most real-world applications (SVMs) only consider the most
we do not find linearly separable difficult points (support
datapoints hence we use kernel trick vectors) close to the boundary.
to solve them.
Optimal Separating Hyperplane: The Non-separable Case: Soft Margin
Hyperplane
 If the data is not linearly separable,
then there is no hyperplane to
separate them into one of the two
classes.
Solution: Identify a hyperplane that
incurs the least error.
Slack variables, ξ t ≥ 0, are defined,
which stores the deviation from the
margin.
Two types of deviations:
An instance may lie on the wrong side
of the hyperplane and be
misclassified.
Or, it may be on the right side but
may lie in the margin, namely, not
sufficiently away from the hyperplane.

Margin: Hence r^t(w^T x^t + w0 ) ≥ 1 − ξ t

 We allow “error” xi in classification;
it is based on the output of the
discriminant function w^Tx+b  xi
approximates the number of
misclassified samples

Soft Margin Hyperplane

Strengths of SVMs:
 Good generalization in theory
 Good generalization in practice
 Work well with few training instances
 Find globally best model
 Efficient algorithms
 Amenable to the kernel trick

KERNEL METHODS:
Extension to Non-linear Decision  When evaluating a machine learning
Boundary: model, it is crucial to assess its
predictive ability, generalization
capability, and overall quality.
 Evaluation metrics provide objective
criteria to measure these aspects.
 The choice of evaluation metrics
depends on the specific problem
domain, the type of data, and the
desired outcome.

Transforming the Data:

Computation in the feature space can
be costly because it is high
dimensional
The feature space is typically infinite-
dimensional!
The kernel trick helps in this context.

METRICS & ERRORS

Evaluation metrics
Evaluation metrics are quantitative
measures used to assess the
performance and effectiveness of a
statistical or machine learning model.

These metrics provide insights into

how well the model is performing and
help in comparing different models or
algorithms.

When evaluating a machine learning

model, it is crucial to assess its
predictive ability, generalization
capability, and overall quality.
Evaluation metrics provide objective
METRICS & ERRORS criteria to measure these aspects.
Evaluation metrics are quantitative
measures used to assess the The choice of evaluation metrics
performance and effectiveness of a depends on the specific problem
statistical or machine learning model. domain, the type of data, and the
 These metrics provide insights into desired outcome.
how well the model is performing and
help in comparing different models or
algorithms. PERFORMANCE METRICS
FOR CLASSIFICATION
classification model on a set of test
Performance Metrics for data when true values are known.
Classification
A confusion matrix is an N X N matrix,
In a classification problem, the
where N is the number of predicted
category or classes of data is
classes.
identified based on training data.
Confusion matrix is a table with 4
The model learns from the given
different
dataset and then classifies the new
combinations of predicted and actual
data into classes or groups based on
values.
the training.
It is extremely useful for measuring
It predicts class labels as the output,
precision-recall, Specificity, Accuracy,
such as Yes or No, 0 or 1, Spam or Not
and most importantly, AUC-ROC
Spam, etc.
curves.
To evaluate the performance of a
True Positive: You predicted positive,
classification model, different metrics
and it’s true.
are used, and some of them are as
True Negative: You predicted
follows:
negative, and it’s
true.
 Accuracy
False Positive: (Type 1 Error): You
 Confusion Matrix
predicted
 Precision
positive, and it’s false.
 Recall
False Negative: (Type 2 Error): You
 F-Score
predicted
 AUC(Area Under the Curve)-
negative, and it’s false
ROC
1. Accuracy
The accuracy metric is one of the
simplest Classification metric to
implement, and it can be determined
as the number of correct predictions to
the total number of predictions.

3. Precision
The precision metric is used to
2. Confusion Matrix
overcome the
limitation of Accuracy.
A confusion matrix is a tabular
representation of prediction outcomes
of any binary classifier, which is used
to describe the performance of the
The precision determines the
proportion of positive prediction that
was actually correct.

It can be calculated as the True

Positive or
predictions that are actually true to
the total
positive predictions (True Positive and
False
Positive).

5. F-Score
F-score or F1 Score is a metric to
evaluate a binary classification model
on the basis of predictions that are
made for the positive class.
4. Recall or sensitivity
It is usually better to compare models
It aims to calculate the proportion of by means of one number only.
actual positive that was identified
incorrectly. It is a type of single score that
represents both
It can be calculated as True Positive or Precision and Recall. So, the F1 Score
predictions that are actually true to can be
the total number of positives, either calculated as the harmonic mean of
correctly predicted as positive or both precision and Recall, assigning
incorrectly predicted as negative (true equal weight to each of them.
Positive and false negative).
The value of AUC ranges from 0 to 1.

It means a model with 100% wrong

prediction will have an AUC of 0.0,
whereas models with 100% correct
predictions will have an AUC of 1.0.

The AUC of a classifier is defined as

6. AUC-ROC the probability of a classifier will rank
If we need to visualize the a randomly chosen positive example
performance of the higher than a negative example.
classification model on charts, then we
can use the AUCROC curve. Specificity = TN / (TN+FP)

It is one of the popular and important False alarm rate = 1- Specificity

metrics for evaluating the
performance of the classification
model.

ROC (Receiver Operating

Characteristic curve) curve.

ROC represents a graph to show the

performance of a classification model
at different threshold levels.

The curve is plotted between two

parameters, which
are:
 True Positive Rate(TPR)
 False Positive Rate(FPR)

AUC is known for Area Under the ROC

curve.

AUC calculates the performance

across
all the thresholds and provides an
aggregate measure.
PERFORMANCE METRICS
FOR REGRESSION
2. Mean Squared Error
Regression is a supervised learning
Mean Squared error or MSE is one of
technique that aims to find the
the most suitable metrics for
relationships between the dependent
Regression evaluation.
and independent variables.
It measures the average of the
A predictive regression model predicts
Squared difference between predicted
a numeric or discrete value.
values and the actual value given by
the model.
The metrics used for regression are
different from the classification
Since in MSE, errors are squared,
metrics.
therefore it only assumes non-
negative values, and it is usually
It means we cannot use the Accuracy
positive and non-zero.
metric (explained above) to evaluate a
regression model; instead, the
Moreover, due to squared
performance of a Regression model is
differences, it penalizes small errors
reported as errors in the prediction.
also, and hence it leads to over-
estimation of how bad the model is.
 Mean Absolute Error
 Mean Squared Error
MSE is a much-preferred metric
 Root Mean Square
compared to other regression metrics
Error(RMSE)
as it is differentiable and hence
optimized better.
1. Mean Absolute Error

Mean Absolute Error or MAE is one of

the simplest metrics, which measures
the absolute difference between
actual and predicted values, where 3. Root Mean Square Error
absolute means taking a number as
Positive.
RMSE is a metric that can be obtained
by just taking the square root of the
Here, Y is the Actual outcome, Y' is
MSE value.
the predicted outcome, and N is the
total number of data points.
As we know that the MSE metrics are
not robust to outliers and so are the
MAE is much more robust for the
RMSE values.
outliers.
This gives higher weightage to the
One of the limitations of MAE is that it
large errors in predictions.
is not differentiable, so for this, we
need to apply different optimizers
such as Gradient Descent.
Performance Measures-
Summary

Confusion Matrix for Multi-

Class Classification (from N * N
matrix to 2 * 2 matrix)
FN: The False-negative value for a
class will be the sum of values of
corresponding rows except for the TP
value.

FP: The False-positive value for a class

will be the sum of values of the
corresponding column except for the
TP value.

TN: The True-negative value for a class

will be the sum of the values of all
columns and rows except the values of
that class that we are calculating the
values for.

TP: the True-positive value is where STOCHASTIC GRADIENT

the actual value and predicted value DESCENT
are the same. Gradient descent is used to minimize
the MSE by calculating the gradient of
Two-class example the cost function.

A regression model uses gradient

descent to update the coefficients of
the line by reducing the cost function.
It is done by a random selection of
values of This makes it incredibly versatile in
coefficient and then iteratively update solving various types of problems in
the values to reach the minimum cost machine learning, from simple linear
function. regression to complex neural
networks.
Gradient Descent
It’s an optimization algorithm used to Stochastic Gradient Descent
minimize a function by iteratively
moving towards the steepest descent Stochastic Gradient Descent (SGD)
as defined adds a twist to the traditional gradient
by the negative of the gradient. descent approach. The term
Imagine that you’re at the top of a ‘stochastic’ refers to a system or
mountain, and your goal is to reach process that is linked with a random
the lowest point. Gradient Descent probability.
helps you find the best path down the Therefore, this randomness is
hill. introduced in the way the gradient is
calculated, which significantly alters
The beauty of Gradient Descent is its its behavior and efficiency compared
simplicity and elegance. to the standard gradient descent.

You start with a random point on the In traditional batch gradient descent,
function you’re trying to minimize, for you calculate the gradient of the loss
example a random starting point on function with respect to the
the mountain. Then, you calculate the parameters for the entire training set.
gradient (slope) of the function at that
point. As you can imagine, for large datasets,
this can be quite computationally
In the mountain analogy, this is like intensive and time-consuming.
looking around you to find the
steepest slope. Once you know the This is where SGD comes into play.
direction, you take a step downhill in Instead of using the entire dataset to
that direction, and then you calculate calculate the gradient, SGD randomly
the gradient again. selects just one data point (or a few
data points) to compute the gradient
Repeat this process until you reach in each iteration.
the bottom. The size of each step is
determined by the learning rate. Think of this process as if you were
However, if the learning rate is too again descending a mountain, but this
small, it might take a long time to time in thick fog with limited visibility.
reach the bottom. If it’s too large, you Rather than viewing the entire
might overshoot the lowest point. landscape to decide your next step,
you make your decision based on
Finding the right balance is key to the where your foot lands next.
success of the algorithm. One of the
most appealing aspects of Gradient This step is small and random, but it’s
Descent is its generality. It can be repeated many times, each time
applied to almost any function, adjusting your path slightly in
especially those where an analytical response to the immediate terrain
solution is not feasible. under your feet.
The Algorithm
1. Initialization
First, you initialize the parameters where: x new f(x) represents the
(weights) of your model. This can be updated parameters.
done randomly or by some other
initialization technique. The starting  x represents the current
point for SGD is crucial as it influences parameters before the update.
the path the algorithm will take.  η is the learning rate, a positive
scalar determining the size of
2. Random Selection the step in the direction of the
In each iteration of the training negative gradient.
process, SGD randomly selects a  f(x) is the gradient of the loss
single data point (or a small batch of function f(x) with respect to the
data points) from the entire dataset. parameters x.
This randomness is what makes it  The learning rate determines
‘stochastic’ the size of the steps you take
towards the minimum. If it’s too
3. Computing the Gradient small, the algorithm will be
Calculate the gradient of the loss slow; if it’s too large, you might
function, but only for the randomly overshoot the minimum
selected data point(s). The gradient is 5. Repeat until convergence
a vector that points in the direction of Repeat steps 2 to 4 for a set number
the steepest increase of the loss of iterations or until the model
function. In the context of SGD, it tells performance stops improving. Each
you how to tweak the parameters to iteration provides a slightly updated
make the model more accurate for model. Ideally, after many iterations,
that particular data point. SGD converges to a set of parameters
that minimize the loss function,
Gradient Formula: although due to its stochastic nature,
the path to convergence is not as
smooth and may oscillate around the
Here, f(x)= f(x) represents the minimum.
gradient of the loss function f(x) with
respect to the parameters x. Learning Rate Scheduling
This gradient is a vector of partial Learning rate scheduling involves
derivatives, where each component of adjusting the learning rate over time.
the vector is the partial derivative of Common strategies include:
the loss function with respect to the
corresponding parameter in x.  Time-Based Decay: The learning
rate decreases over each
update.
4. Update the Parameters
Adjust the model parameters in the  Step Decay: Reduce the
opposite direction of the gradient. learning rate by some factor
Here’s where the learning rate η plays after a certain number of
a crucial role. The formula for updating epochs.
each parameter is:
 Exponential Decay: Decrease  Sensitivity to Learning Rate: The
the learning rate exponentially. choice of learning rate can be
critical in SGD since using a
 Adaptive Learning Rate: high learning rate can cause the
Methods like AdaGrad, RMSProp, algorithm to overshoot the
and Adam adjust the learning minimum, while a low learning
rate automatically during rate can make the algorithm
training. converge slowly.

 Less Accurate: Due to the noisy

updates, SGD may not converge
Advantages of SGD to the exact global minimum
and can result in a suboptimal
 Speed: SGD is faster than other solution. This can be mitigated
variants of Gradient Descent by using techniques such as
such as Batch Gradient Descent learning rate scheduling and
and Mini Batch Gradient momentum-based updates
Descent since it uses only one
example to update the
parameters.

 Memory Efficiency: Since SGD

updates the parameters for
each training example one at a
time, it is memory-efficient and
can handle large datasets that
cannot fit into memory.

 Avoidance of Local Minima: Due

to the noisy updates in SGD, it
has the ability to escape from
local minima and converges to a
global minimum.
Disadvantages of SGD

 Noisy updates: The updates in

SGD are noisy and have a high
variance, which can make the
optimization process less stable
and lead to oscillations around
the minimum.

 Slow Convergence: SGD may

require more iterations to
converge to the minimum since
it updates the parameters for
each training example one at a
time.

Advances in Computational Intelligence
100% (1)
Advances in Computational Intelligence
636 pages
Deep Leaning
No ratings yet
Deep Leaning
117 pages
Deep Learning
No ratings yet
Deep Learning
180 pages
DL Unit 2
No ratings yet
DL Unit 2
107 pages
ML Unit 2
No ratings yet
ML Unit 2
23 pages
Unit 5
No ratings yet
Unit 5
46 pages
Neural Network
No ratings yet
Neural Network
55 pages
UNIT1
No ratings yet
UNIT1
72 pages
Unit 4 Neural Networks
No ratings yet
Unit 4 Neural Networks
76 pages
Unit 1 Notes Final
No ratings yet
Unit 1 Notes Final
36 pages
This Document Is About Artificial Inteligence.
No ratings yet
This Document Is About Artificial Inteligence.
81 pages
Unit 5
No ratings yet
Unit 5
102 pages
Aow DRV
No ratings yet
Aow DRV
3,758 pages
Unit V - Aiml PDF
No ratings yet
Unit V - Aiml PDF
29 pages
Neural Network
No ratings yet
Neural Network
82 pages
It ML Unit 2 Notes Final
No ratings yet
It ML Unit 2 Notes Final
23 pages
ML Unit 2
No ratings yet
ML Unit 2
24 pages
DP Learn
No ratings yet
DP Learn
72 pages
Unit-Ii MLT1
No ratings yet
Unit-Ii MLT1
45 pages
Unit 5
No ratings yet
Unit 5
28 pages
Unit V
No ratings yet
Unit V
49 pages
Neural Networks - V Unit
No ratings yet
Neural Networks - V Unit
43 pages
Deep Learning Unit1
No ratings yet
Deep Learning Unit1
25 pages
ML Unit 3-2-18
No ratings yet
ML Unit 3-2-18
17 pages
Unit-3 ML
No ratings yet
Unit-3 ML
21 pages
Lecture 5-Introduction To Neural Network
No ratings yet
Lecture 5-Introduction To Neural Network
42 pages
Neural Networks
No ratings yet
Neural Networks
19 pages
CC Link IE
No ratings yet
CC Link IE
84 pages
BMW E46 Code List
No ratings yet
BMW E46 Code List
82 pages
FALLSEM2024-25 BCSE209L TH VL2024250101737 2024-08-06 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE209L TH VL2024250101737 2024-08-06 Reference-Material-I
20 pages
Assignment On Grameenphone 5c, PESTLE & SWOT Analysis: Submitted To: Jeta Majumder Assistant Professor
No ratings yet
Assignment On Grameenphone 5c, PESTLE & SWOT Analysis: Submitted To: Jeta Majumder Assistant Professor
6 pages
Lesson 7.0 Supervised Learning With Neural Networks
No ratings yet
Lesson 7.0 Supervised Learning With Neural Networks
22 pages
UNIT1 Perceptron MLP
No ratings yet
UNIT1 Perceptron MLP
26 pages
Lecture 19 NN
No ratings yet
Lecture 19 NN
32 pages
Unit 3 Endsem PYQs
No ratings yet
Unit 3 Endsem PYQs
19 pages
Digital Library
No ratings yet
Digital Library
24 pages
The Introduction To Neural Networks 10 4 24
No ratings yet
The Introduction To Neural Networks 10 4 24
54 pages
Unit - 2
No ratings yet
Unit - 2
24 pages
MMA0041 Merged
No ratings yet
MMA0041 Merged
382 pages
05 ANN Artificial Neural Networks
No ratings yet
05 ANN Artificial Neural Networks
216 pages
CFBC 718 e 2 C
No ratings yet
CFBC 718 e 2 C
30 pages
@vtucode - in Module 5 AI 2021 Scheme 5th Sem
No ratings yet
@vtucode - in Module 5 AI 2021 Scheme 5th Sem
66 pages
ML Unit 5
No ratings yet
ML Unit 5
33 pages
1 Neural Networks
No ratings yet
1 Neural Networks
16 pages
Final Step Part - B Maths
No ratings yet
Final Step Part - B Maths
88 pages
Lecture 19 NN
No ratings yet
Lecture 19 NN
32 pages
chp1 NN, MLFFN, Weight, Bias, Threshold, Activation FN, Loss FN
No ratings yet
chp1 NN, MLFFN, Weight, Bias, Threshold, Activation FN, Loss FN
19 pages
Unit 3
No ratings yet
Unit 3
29 pages
ML Lec11
No ratings yet
ML Lec11
14 pages
ML Module 5
No ratings yet
ML Module 5
14 pages
UNIT-II Chapter-2
No ratings yet
UNIT-II Chapter-2
20 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
NNDL
No ratings yet
NNDL
96 pages
Unit 2
No ratings yet
Unit 2
20 pages
Machine Learning
No ratings yet
Machine Learning
13 pages
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-11 Reference-Material-I
No ratings yet
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-11 Reference-Material-I
40 pages
Perceptrons
No ratings yet
Perceptrons
8 pages
Unit 1 NNDL
No ratings yet
Unit 1 NNDL
8 pages
Artificial Intelligence: Outline
No ratings yet
Artificial Intelligence: Outline
35 pages
Unit 2
No ratings yet
Unit 2
15 pages
3rd Lecture
No ratings yet
3rd Lecture
21 pages
Unit 1
No ratings yet
Unit 1
19 pages
The Views Tangent Pile-Method Statement
No ratings yet
The Views Tangent Pile-Method Statement
5 pages
Macro Preprocessor
No ratings yet
Macro Preprocessor
75 pages
Komatsu fb20m-3
No ratings yet
Komatsu fb20m-3
6 pages
A Presentation On: By: Edutechlearners
No ratings yet
A Presentation On: By: Edutechlearners
33 pages
Master Thesis Results and Discussion
100% (3)
Master Thesis Results and Discussion
7 pages
Lecture 10 Neural Network
No ratings yet
Lecture 10 Neural Network
34 pages
Unit 4
No ratings yet
Unit 4
9 pages
Bettermaker EQ232D Manual
No ratings yet
Bettermaker EQ232D Manual
6 pages
纸张研究
100% (2)
纸张研究
12 pages
Spec Comparison - SuperMark 1.5T PDF
No ratings yet
Spec Comparison - SuperMark 1.5T PDF
4 pages
1 PR3 Chap 1 2 3
No ratings yet
1 PR3 Chap 1 2 3
22 pages
UNIT 5 MCQs
No ratings yet
UNIT 5 MCQs
12 pages
Viplav Awasthi-DataScientist
No ratings yet
Viplav Awasthi-DataScientist
6 pages
Curved Heads DIN 28011/cap Form: 1 Coverage
No ratings yet
Curved Heads DIN 28011/cap Form: 1 Coverage
6 pages
305766
No ratings yet
305766
7 pages
Employbility
No ratings yet
Employbility
18 pages
Week 8 IT Era LITE WITH HIGHLIGHT
No ratings yet
Week 8 IT Era LITE WITH HIGHLIGHT
36 pages
Convolutional Neural Networks: Fundamentals and Applications for Analyzing Visual Imagery
From Everand
Convolutional Neural Networks: Fundamentals and Applications for Analyzing Visual Imagery
Fouad Sabry
No ratings yet
DS 14NHG28
No ratings yet
DS 14NHG28
2 pages
Cybersecurity Risk Management: Mastering The Fundamentals Using The NIST Cybersecurity Framework Cynthia Brumfield Instant Download
100% (1)
Cybersecurity Risk Management: Mastering The Fundamentals Using The NIST Cybersecurity Framework Cynthia Brumfield Instant Download
49 pages
Screenshot 2024-12-26 at 21.28.44
No ratings yet
Screenshot 2024-12-26 at 21.28.44
9 pages
Maintaining A Customer Project Billing Proposal - PPM PDF
No ratings yet
Maintaining A Customer Project Billing Proposal - PPM PDF
7 pages
Bio Inspired Computing: Fundamentals and Applications for Biological Inspiration in the Digital World
From Everand
Bio Inspired Computing: Fundamentals and Applications for Biological Inspiration in the Digital World
Fouad Sabry
No ratings yet
1031590-Ejosat1083443-2292434 231226 134238
No ratings yet
1031590-Ejosat1083443-2292434 231226 134238
7 pages
How To Install CentOS 6.9 in UEFI Mode by Console Redirection - v1.1
No ratings yet
How To Install CentOS 6.9 in UEFI Mode by Console Redirection - v1.1
7 pages
Wa0017.
No ratings yet
Wa0017.
4 pages
Crop Improvement IA 3 Poster
No ratings yet
Crop Improvement IA 3 Poster
1 page
Kia Seltos 4 Page Leaflet 2023 Desktop Revised
No ratings yet
Kia Seltos 4 Page Leaflet 2023 Desktop Revised
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Advanced Supervised Learning

Uploaded by

Advanced Supervised Learning

Uploaded by

ADVANCED SUPERVISED LEARNING:

Basic Components of Perceptron

1. Input Layer: The input layer

2. Weights: Each input neuron is

 η is the learning rate,

differentiable, ∇f(θ) is the

Steps in the Algorithm:

1️Given a Training Set:

2. Feed the training sample.

Support Vector Machine

Margin: Hence r^t(w^T x^t + w0 ) ≥ 1 − ξ t

Soft Margin Hyperplane

Transforming the Data:

METRICS & ERRORS

These metrics provide insights into

When evaluating a machine learning

It can be calculated as the True

It means a model with 100% wrong

The AUC of a classifier is defined as

It is one of the popular and important False alarm rate = 1- Specificity

ROC (Receiver Operating

ROC represents a graph to show the

The curve is plotted between two

AUC is known for Area Under the ROC

AUC calculates the performance

Mean Absolute Error or MAE is one of

Confusion Matrix for Multi-

FP: The False-positive value for a class

TN: The True-negative value for a class

TP: the True-positive value is where STOCHASTIC GRADIENT

A regression model uses gradient

 Less Accurate: Due to the noisy

 Memory Efficiency: Since SGD

 Avoidance of Local Minima: Due

 Noisy updates: The updates in

 Slow Convergence: SGD may

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.