ML3 Unit 4-3
ML3 Unit 4-3
Inputs are summed and passed through a nonlinear function to produce output
1. Perceptron
Single layer - Single layer perceptron can learn only linearly separable patterns
The Perceptron algorithm learns the weights for the input signals in order to draw a linear
decision boundary.
This enables you to distinguish between the two linearly separable classes +1 and -1.
Note: Supervised Learning is a type of Machine Learning used to learn models from labeled
training data. It enables output prediction for future or unseen data. Let us focus on the
Perceptron Learning Rule in the next section.
1.1. Perceptron Learning Rule
Perceptron Learning Rule states that the algorithm would automatically learn the optimal
weight coefficients. The input features are then multiplied with these weights to determine if
a neuron fires or not.
The Perceptron receives multiple input signals, and if the sum of the input signals exceeds a
certain threshold, it either outputs a signal or does not return an output. In the context of
supervised learning and classification, this can then be used to predict the class of a sample.
A perceptron is a neural network unit (an artificial neuron) that does certain computations to
detect features or business intelligence in the input data. And this perceptron tutorial will give
you an in-depth knowledge of Perceptron and its activation functions.
Perceptron is a function that maps its input “x,” which is multiplied with the learned weight
coefficient; an output value”f(x)”is generated.
“b” = bias (an element that adjusts the boundary away from origin without any
dependence on the input value)
The output can be represented as “1” or “0.” It can also be represented as “1” or “-1”
depending on which activation function is used.
Perceptron is an algorithm for Supervised Learning of single layer binary linear classifiers.
Weights are multiplied with the input features and decision is made if the neuron is fired
or not.
Activation function applies a step rule to check if the output of the weighting function is
greater than zero.
Linear decision boundary is drawn enabling the distinction between the two linearly
separable classes +1 and -1.
If the sum of the input signals exceeds a certain threshold, it outputs a signal; otherwise,
there is no output.
A perceptron is a neural network unit that does a precise computation to detect features in the
input data. Perceptron is mainly used to classify the data into two parts. Therefore, it is also
known as Linear Binary Classifier.
Perceptron uses the step function that returns +1 if the weighted sum of its input 0 and -1.
The activation function is used to map the input between the required value like (0, 1) or (-1,
1).
2. GRADIENT DESCENT
2.1 WHAT IS GRADIENT DESCENT?
Imagine the image below illustrates our hill from a top-down view and the red arrows are the
steps of our climber. Think of a gradient in this context as a vector that contains the direction
of the steepest step the blindfolded man can take and also how long that step should be.
Note that the gradient ranging from X0 to X1 is much longer than the one reaching from X3
to X4. This is because the steepness/slope of the hill, which determines the length of the
vector, is less.
This perfectly represents the example of the hill because the hill is getting less steep the
higher it's climbed. Therefore a reduced gradient goes along with a reduced slope and a
reduced step size for the hill climber.
Instead of climbing up a hill, think of gradient descent as hiking down to the bottom of a
valley. This is a better analogy because it is a minimization algorithm that minimizes a given
function.
The equation below describes what gradient descent does: b is the next position of our
climber, while a represents his current position. The minus sign refers to the minimization
part of gradient descent. The gamma in the middle is a waiting factor and the gradient term
( Δf(a) ) is simply the direction of the steepest descent.
For gradient descent to reach the local minimum we must set the learning rate to an
appropriate value, which is neither too low nor too high. This is important because if the steps
it takes are too big, it may not reach the local minimum because it bounces back and forth
between the convex function of gradient descent (see left image below). If we set the learning
rate to a very small value, gradient descent will eventually reach the local minimum but that
may take a while (see the right image).
So, the learning rate should never be too high or too low for this reason. You can check if
you’re learning rate is doing well by plotting it on a graph.
There are three popular types of gradient descent that mainly differ in the amount of data they
use:
Some advantages of batch gradient descent are its computational efficient, it produces a
stable error gradient and a stable convergence. Some disadvantages are the stable error
gradient can sometimes result in a state of convergence that isn’t the best the model can
achieve. It also requires the entire training dataset be in memory and available to the
algorithm.
The frequent updates, however, are more computationally expensive than the batch gradient
descent approach. Additionally, the frequency of those updates can result in noisy gradients,
which may cause the error rate to jump around instead of slowly decreasing.
2.4.3 MINI-BATCH GRADIENT DESCENT
Mini-batch gradient descent is the go-to method since it’s a combination of the concepts of
SGD and batch gradient descent. It simply splits the training dataset into small batches and
performs an update for each of those batches. This creates a balance between the robustness
of stochastic gradient descent and the efficiency of batch gradient descent.
Common mini-batch sizes range between 50 and 256, but like any other machine learning
technique, there is no clear rule because it varies for different applications. This is the go-to
algorithm when training a neural network and it is the most common type of gradient descent
within deep learning.
3. Delta Rule
3.1 What Does Delta Rule Mean?
The Delta rule in machine learning and neural network environments is a specific type of
back propagation that helps to refine connectionist ML/AI networks, making connections
between inputs and outputs with layers of artificial neurons.
In general, backpropagation has to do with recalculating input weights for artificial neurons
using a gradient method. Delta learning does this using the difference between a target
activation and an actual obtained activation. Using a linear activation function, network
connections are adjusted.
Another way to explain the Delta rule is that it uses an error function to perform gradient
descent learning.
The Delta Rule is an interesting mechanism for searching the hypothesis space. Actually, the
Delta Rule uses one of the most, if not the most, popular search technique in the hypothesis
space that is called Gradient Descent.
Using Gradient Descent, the Delta Rule strives to find the best-fitting model.
The Delta Rule, uses gradient descent as an optimization techniques, and tries different
values for the weights in a neural network, and depending on how accurate the output of the
network is (i.e., how close to the ground truth), it will make certain adjustments to certain
weights (i.e., increase some and decrease the other). It will try to increase and decrease the
weights in a way that the error of the output would go down, during training.
Gradient descent is the very foundation of the back-propagation algorithm that helps
us learn neural networks
4. Back-propagation algorithm
The algorithm is used to effectively train a neural network through a method called
chain rule.
In simple terms, after each forward pass through a network, back-propagation
performs a backward pass while adjusting the model’s parameters (weights and
biases).
measure of the difference between the actual output vector of the net and the desired
output vector.
The ability to create useful new features distinguishes back-propagation from earlier,
simpler method.
In other words, back-propagation aims to minimize the cost function by adjusting
network’s weights and biases. The level of adjustment is determined by the gradients
5. Travel back from the output layer to the hidden layer to adjust the
weights such that the error is decreased.
Static Back-propagation
Recurrent Back-propagation
Static back-propagation:
It is one kind of back-propagation network which produces a mapping of a
static input for static output. It is useful to solve static classification issues
like optical character recognition.
Recurrent Back-propagation:
Recurrent Back propagation in data mining is fed forward until a fixed value
is achieved. After that, the error is computed and propagated backward.
Self-Organizing Map
SOM has two layers, one is the Input layer and the other one is the
Output layer.
The architecture of the Self Organizing Map with two clusters and n
input features of any sample is given below:
Algorithm
Let’s say an input data of size (m, n) where m is the number of training
example and n is the number of features in each example.
First, it initializes the weights of size (n, C) where C is the number of clusters.
Then iterating over the input data, for each training example, it updates the
winning vector (weight vector with the shortest distance (e.g Euclidean
distance) from training example). Weight updating rule is given by:
wij = wij(old) - alpha(t) * (x ik - wij(old))