Advanced Supervised Learning
Advanced Supervised Learning
Neural Networks
Biologically motivated approach to
machine learning
Here x1 and x2 are normalized
Similarity with biological network attribute value of data. y is the output
of the neuron , i.e the class label. x1
The fundamental processing
and x2 values multiplied by weight
element of a neural network is a
values w1 and w2 are input to the
neuron
neuron x.
1. Receives inputs from other sources
Given that :
2. Combines them in some way
w1 = 0.5 and w2 = 0.5 ; x1 = 0.3 and
3. Performs a generally nonlinear x2 = 0.8
operation on the result
weighted sum is :
4. Outputs the final result w1 * x1 + w2 * x2 = 0.5 * 0.3 + 0.5 *
0.8 = 0.55
A human brain has 100 billion neurons
The neuron receives the weighted sum
What is Neural Network?
as input and calculates the output as a
A Neural Network is a set of connected function of input as follows :
input/output units, where each y = f(x) , where f(x) is defined as
connection has a weight associated f(x) = 0 { when x< 0.5 }
f(x) = 1 { when x >= 0.5 }
with it.
Neural Network learning is also For example, x ( weighted sum ) is
called Connectionist learning due to 0.55, so y = 1. Therefore,
the connections between units. corresponding input attribute values
are classified in class 1.
It is a case of supervised, inductive If for another input values , x =
or classification learning. 0.45 , then f(x) = 0, so that the input
values are classified to class 0.
Neural Network learns by adjusting
the weights so as to be able to Bias of a Neuron:
correctly classify the training data and We need the bias value to be added to
hence, after testing phase, to classify the weighted sum ∑wixi so that we
unknown data. can transform it from the origin.
Neural Network needs long time for v = ∑wixi + b, here b is the bias
training.
Neural Network has a high tolerance
to noisy and incomplete data
One Neuron as a Network:
transmits electrical signals through
complex mechanisms.
made in stages:
1. First, hidden layers D={(x,y)}
process raw inputs and The dataset consists of input-
pass output pairs (x,y).
Each x is a feature vector, and y
is the corresponding label.
transformed values to the
next layer. 2️Initialize Weights w:
w←0∈R^n
2. Finally, the output layer The weight vector w(also called θ in
produces the prediction. the formula) is initialized to zero.
R^n means w is an n-dimensional
This layer-wise transformation is a vector.
form of staged prediction—each
layer refines the information before 3️ Iterate Over Epochs (Full Passes
passing it forward. Over Data):
For epoch 1…T
Why Do We Need Multi-Layer For epoch 1…T:
Perceptrons (MLPs)? The algorithm runs for T epochs.
A single-layer perceptron can Each epoch represents one full
only classify data that is pass over the dataset.
linearly separable.
For non-linearly separable 4️ Iterate Over Each Training
problems like XOR, we need a Example (x,y):
multi-layer neural network. For (x,y) in D:
A hidden layer with non- The algorithm processes one
linear activation functions data point at a time (stochastic
(e.g., ReLU, sigmoid) allows the approach).
network to transform the input
space into a higher-dimensional 5️ Update Rule (Weight
representation where the Adjustment):
classes become separable. w←w−η∇f(θ)
gradient ∇f(θ).
w is updated using the sub-
Steps in Backpropagation
Algorithm:
1. Initialize the weights and biases.
The weights in the network are
initialized to random numbers
from the interval [-1,1].
Each unit has a BIAS associated
with it
The biases are similarly
initialized to random numbers
from the interval [-1,1].
Classification by Backpropagation:
Backpropagation learns by iteratively
processing a set of training data
(samples).
For each sample, weights are
modified to minimize the error
What is a Hyperplane?
A hyperplane is a decision boundary
in machine learning, particularly in
classification problems. It is a
subspace of one dimension less than
the space it exists in.
Understanding the Concept:
In 2D space, a hyperplane is a
line that separates points.
In 3D space, a hyperplane is a
plane that divides space into
The SVM is a machine learning
two parts.
algorithm which
In higher dimensions (n-
Solves classification problems
dimensional space), a
uses a flexible representation of the
hyperplane is an (n-1)-
class boundaries
dimensional subspace that
implements automatic complexity
separates the data.
control to reduce overfitting
has a single global minimum which
Hyperplane in Classification (SVM
can be found in polynomial time n
Example)
It is popular because
In classification problems (like in
o It can be easy to use
the image), a hyperplane is
o It often has good generalization used to separate two classes.
performance The optimal hyperplane is the
o The same algorithm solves a one that maximizes the margin
variety of problems with little (distance) between the closest
tuning points from both classes. These
closest points are called
Types of Support Vector Machine support vectors.
(SVM) Algorithms: Linear classifiers (like logistic
regression) may find a
Linear SVM: When the data is separating hyperplane, but
perfectly linearly separable, only then Support Vector Machines
we can use Linear SVM. Perfectly (SVMs) focus on the max
linearly separable means that the data margin plane, ensuring the
points can be classified into 2 classes best separation.
by using a single straight line(if 2D).
How to Pick the Right
Non-Linear SVM: When the data is Hyperplane?
not linearly separable then we can use Many possible hyperplanes can
Non-Linear SVM, which means when separate data.
the data points cannot be separated Linear Regression & Neural
into 2 classes by using a straight line Networks consider all data
(if 2D) then we use some advanced points.
techniques like kernel tricks to classify Support Vector Machines
them. In most real-world applications (SVMs) only consider the most
we do not find linearly separable difficult points (support
datapoints hence we use kernel trick vectors) close to the boundary.
to solve them.
Optimal Separating Hyperplane: The Non-separable Case: Soft Margin
Hyperplane
If the data is not linearly separable,
then there is no hyperplane to
separate them into one of the two
classes.
Solution: Identify a hyperplane that
incurs the least error.
Slack variables, ξ t ≥ 0, are defined,
which stores the deviation from the
margin.
Two types of deviations:
An instance may lie on the wrong side
of the hyperplane and be
misclassified.
Or, it may be on the right side but
may lie in the margin, namely, not
sufficiently away from the hyperplane.
Strengths of SVMs:
Good generalization in theory
Good generalization in practice
Work well with few training instances
Find globally best model
Efficient algorithms
Amenable to the kernel trick
KERNEL METHODS:
Extension to Non-linear Decision When evaluating a machine learning
Boundary: model, it is crucial to assess its
predictive ability, generalization
capability, and overall quality.
Evaluation metrics provide objective
criteria to measure these aspects.
The choice of evaluation metrics
depends on the specific problem
domain, the type of data, and the
desired outcome.
3. Precision
The precision metric is used to
2. Confusion Matrix
overcome the
limitation of Accuracy.
A confusion matrix is a tabular
representation of prediction outcomes
of any binary classifier, which is used
to describe the performance of the
The precision determines the
proportion of positive prediction that
was actually correct.
5. F-Score
F-score or F1 Score is a metric to
evaluate a binary classification model
on the basis of predictions that are
made for the positive class.
4. Recall or sensitivity
It is usually better to compare models
It aims to calculate the proportion of by means of one number only.
actual positive that was identified
incorrectly. It is a type of single score that
represents both
It can be calculated as True Positive or Precision and Recall. So, the F1 Score
predictions that are actually true to can be
the total number of positives, either calculated as the harmonic mean of
correctly predicted as positive or both precision and Recall, assigning
incorrectly predicted as negative (true equal weight to each of them.
Positive and false negative).
The value of AUC ranges from 0 to 1.
You start with a random point on the In traditional batch gradient descent,
function you’re trying to minimize, for you calculate the gradient of the loss
example a random starting point on function with respect to the
the mountain. Then, you calculate the parameters for the entire training set.
gradient (slope) of the function at that
point. As you can imagine, for large datasets,
this can be quite computationally
In the mountain analogy, this is like intensive and time-consuming.
looking around you to find the
steepest slope. Once you know the This is where SGD comes into play.
direction, you take a step downhill in Instead of using the entire dataset to
that direction, and then you calculate calculate the gradient, SGD randomly
the gradient again. selects just one data point (or a few
data points) to compute the gradient
Repeat this process until you reach in each iteration.
the bottom. The size of each step is
determined by the learning rate. Think of this process as if you were
However, if the learning rate is too again descending a mountain, but this
small, it might take a long time to time in thick fog with limited visibility.
reach the bottom. If it’s too large, you Rather than viewing the entire
might overshoot the lowest point. landscape to decide your next step,
you make your decision based on
Finding the right balance is key to the where your foot lands next.
success of the algorithm. One of the
most appealing aspects of Gradient This step is small and random, but it’s
Descent is its generality. It can be repeated many times, each time
applied to almost any function, adjusting your path slightly in
especially those where an analytical response to the immediate terrain
solution is not feasible. under your feet.
The Algorithm
1. Initialization
First, you initialize the parameters where: x new f(x) represents the
(weights) of your model. This can be updated parameters.
done randomly or by some other
initialization technique. The starting x represents the current
point for SGD is crucial as it influences parameters before the update.
the path the algorithm will take. η is the learning rate, a positive
scalar determining the size of
2. Random Selection the step in the direction of the
In each iteration of the training negative gradient.
process, SGD randomly selects a f(x) is the gradient of the loss
single data point (or a small batch of function f(x) with respect to the
data points) from the entire dataset. parameters x.
This randomness is what makes it The learning rate determines
‘stochastic’ the size of the steps you take
towards the minimum. If it’s too
3. Computing the Gradient small, the algorithm will be
Calculate the gradient of the loss slow; if it’s too large, you might
function, but only for the randomly overshoot the minimum
selected data point(s). The gradient is 5. Repeat until convergence
a vector that points in the direction of Repeat steps 2 to 4 for a set number
the steepest increase of the loss of iterations or until the model
function. In the context of SGD, it tells performance stops improving. Each
you how to tweak the parameters to iteration provides a slightly updated
make the model more accurate for model. Ideally, after many iterations,
that particular data point. SGD converges to a set of parameters
that minimize the loss function,
Gradient Formula: although due to its stochastic nature,
the path to convergence is not as
smooth and may oscillate around the
Here, f(x)= f(x) represents the minimum.
gradient of the loss function f(x) with
respect to the parameters x. Learning Rate Scheduling
This gradient is a vector of partial Learning rate scheduling involves
derivatives, where each component of adjusting the learning rate over time.
the vector is the partial derivative of Common strategies include:
the loss function with respect to the
corresponding parameter in x. Time-Based Decay: The learning
rate decreases over each
update.
4. Update the Parameters
Adjust the model parameters in the Step Decay: Reduce the
opposite direction of the gradient. learning rate by some factor
Here’s where the learning rate η plays after a certain number of
a crucial role. The formula for updating epochs.
each parameter is:
Exponential Decay: Decrease Sensitivity to Learning Rate: The
the learning rate exponentially. choice of learning rate can be
critical in SGD since using a
Adaptive Learning Rate: high learning rate can cause the
Methods like AdaGrad, RMSProp, algorithm to overshoot the
and Adam adjust the learning minimum, while a low learning
rate automatically during rate can make the algorithm
training. converge slowly.