0% found this document useful (0 votes)
25 views13 pages

ML3 Unit 4-3

The document provides information about artificial neural networks and the Perceptron algorithm. It discusses the characteristics of artificial neurons and how they are modeled after biological neurons. It then describes the Perceptron learning algorithm, including single layer and multilayer Perceptrons. The Perceptron learning rule is explained, which allows the algorithm to automatically learn optimal weight coefficients. Gradient descent is also summarized as an optimization algorithm for minimizing cost functions by taking steps proportional to the negative of the gradient. The document outlines batch, stochastic, and mini-batch gradient descent methods.

Uploaded by

ISHAN SRIVASTAVA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views13 pages

ML3 Unit 4-3

The document provides information about artificial neural networks and the Perceptron algorithm. It discusses the characteristics of artificial neurons and how they are modeled after biological neurons. It then describes the Perceptron learning algorithm, including single layer and multilayer Perceptrons. The Perceptron learning rule is explained, which allows the algorithm to automatically learn optimal weight coefficients. Gradient descent is also summarized as an optimization algorithm for minimizing cost functions by taking steps proportional to the negative of the gradient. The document outlines batch, stochastic, and mini-batch gradient descent methods.

Uploaded by

ISHAN SRIVASTAVA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

UNIT 4

ARTIFICIAL NEURAL NETWORKS


The artificial neuron has the following characteristics:

 A neuron is a mathematical function modelled on the working of biological neurons

 It is an elementary unit in an artificial neural network

 One or more inputs are separately weighted

 Inputs are summed and passed through a nonlinear function to produce output

 Every neuron holds an internal state called activation signal

 Each connection link carries information about the input signal

 Every neuron is connected to another neuron via connection link

1. Perceptron

A Perceptron is an algorithm for supervised learning of binary classifiers. This algorithm


enables neurons to learn and processes elements in the training set one at a time.

There are two types of Perceptron: Single layer and Multilayer.

 Single layer - Single layer perceptron can learn only linearly separable patterns

 Multilayer - Multilayer perceptron or feed-forward neural networks with two or more


layers have the greater processing power

The Perceptron algorithm learns the weights for the input signals in order to draw a linear
decision boundary.

This enables you to distinguish between the two linearly separable classes +1 and -1.

Note: Supervised Learning is a type of Machine Learning used to learn models from labeled
training data. It enables output prediction for future or unseen data. Let us focus on the
Perceptron Learning Rule in the next section.
1.1. Perceptron Learning Rule

Perceptron Learning Rule states that the algorithm would automatically learn the optimal
weight coefficients. The input features are then multiplied with these weights to determine if
a neuron fires or not.

The Perceptron receives multiple input signals, and if the sum of the input signals exceeds a
certain threshold, it either outputs a signal or does not return an output. In the context of
supervised learning and classification, this can then be used to predict the class of a sample.

A perceptron is a neural network unit (an artificial neuron) that does certain computations to
detect features or business intelligence in the input data. And this perceptron tutorial will give
you an in-depth knowledge of Perceptron and its activation functions.

1.2 Perceptron Function

Perceptron is a function that maps its input “x,” which is multiplied with the learned weight
coefficient; an output value”f(x)”is generated.

In the equation given above:

 “w” = vector of real-valued weights

 “b” = bias (an element that adjusts the boundary away from origin without any
dependence on the input value)

 “x” = vector of input x values


 “m” = number of inputs to the Perceptron

The output can be represented as “1” or “0.” It can also be represented as “1” or “-1”
depending on which activation function is used.

1.3 Perceptron has the following characteristics:

 Perceptron is an algorithm for Supervised Learning of single layer binary linear classifiers.

 Optimal weight coefficients are automatically learned.

 Weights are multiplied with the input features and decision is made if the neuron is fired
or not.

 Activation function applies a step rule to check if the output of the weighting function is
greater than zero.

 Linear decision boundary is drawn enabling the distinction between the two linearly
separable classes +1 and -1.

 If the sum of the input signals exceeds a certain threshold, it outputs a signal; otherwise,
there is no output.

1.4 Single Layer Perceptron

A perceptron is a neural network unit that does a precise computation to detect features in the
input data. Perceptron is mainly used to classify the data into two parts. Therefore, it is also
known as Linear Binary Classifier.
Perceptron uses the step function that returns +1 if the weighted sum of its input 0 and -1.

The activation function is used to map the input between the required value like (0, 1) or (-1,
1).

1.5 Multi-Layer perceptron

 Multi-Layer perceptron defines the most complicated architecture of artificial neural


networks. It is substantially formed from multiple layers of perceptron.
 The diagrammatic representation of multi-layer perceptron learning is as shown
below −
 MLP networks are usually used for supervised learning format. A typical learning
algorithm for MLP networks is also called back propagation’s algorithm.

2. GRADIENT DESCENT
2.1 WHAT IS GRADIENT DESCENT?

Gradient Descent is an optimization algorithm for finding a local minimum of a differentiable


function. Gradient descent is simply used in machine learning to find the values of a
function's parameters (coefficients) that minimize a cost function as far as possible.
A gradient simply measures the change in all weights with regard to the change in error. You
can also think of a gradient as the slope of a function. The higher the gradient, the steeper the
slope and the faster a model can learn. But if the slope is zero, the model stops learning. In
mathematical terms, a gradient is a partial derivative with respect to its inputs.
Imagine a blindfolded man who wants to climb to the top of a hill with the fewest steps along
the way as possible. He might start climbing the hill by taking really big steps in the steepest
direction, which he can do as long as he is not close to the top. As he comes closer to the top,
however, his steps will get smaller and smaller to avoid overshooting it. This process can be
described mathematically using the gradient.

Imagine the image below illustrates our hill from a top-down view and the red arrows are the
steps of our climber. Think of a gradient in this context as a vector that contains the direction
of the steepest step the blindfolded man can take and also how long that step should be.
Note that the gradient ranging from X0 to X1 is much longer than the one reaching from X3
to X4. This is because the steepness/slope of the hill, which determines the length of the
vector, is less.

This perfectly represents the example of the hill because the hill is getting less steep the
higher it's climbed. Therefore a reduced gradient goes along with a reduced slope and a
reduced step size for the hill climber.

2.2 How Gradient Descent works?

Instead of climbing up a hill, think of gradient descent as hiking down to the bottom of a
valley. This is a better analogy because it is a minimization algorithm that minimizes a given
function.

The equation below describes what gradient descent does: b is the next position of our
climber, while a represents his current position. The minus sign refers to the minimization
part of gradient descent. The gamma in the middle is a waiting factor and the gradient term
( Δf(a) ) is simply the direction of the steepest descent.

2.3 Importance of the Learning Rate

For gradient descent to reach the local minimum we must set the learning rate to an
appropriate value, which is neither too low nor too high. This is important because if the steps
it takes are too big, it may not reach the local minimum because it bounces back and forth
between the convex function of gradient descent (see left image below). If we set the learning
rate to a very small value, gradient descent will eventually reach the local minimum but that
may take a while (see the right image).

So, the learning rate should never be too high or too low for this reason. You can check if
you’re learning rate is doing well by plotting it on a graph.

2.4 Types of Gradient Descent

There are three popular types of gradient descent that mainly differ in the amount of data they
use:

2.4.1 BATCH GRADIENT DESCENT


Batch gradient descent, also called vanilla gradient descent, calculates the error for each
example within the training dataset, but only after all training examples have been evaluated
does the model get updated. This whole process is like a cycle and it's called a training epoch.

Some advantages of batch gradient descent are its computational efficient, it produces a
stable error gradient and a stable convergence. Some disadvantages are the stable error
gradient can sometimes result in a state of convergence that isn’t the best the model can
achieve. It also requires the entire training dataset be in memory and available to the
algorithm.

2.4.2 STOCHASTIC GRADIENT DESCENT


By contrast, stochastic gradient descent (SGD) does this for each training example within the
dataset, meaning it updates the parameters for each training example one by one. Depending
on the problem, this can make SGD faster than batch gradient descent. One advantage is the
frequent updates allow us to have a pretty detailed rate of improvement.

The frequent updates, however, are more computationally expensive than the batch gradient
descent approach. Additionally, the frequency of those updates can result in noisy gradients,
which may cause the error rate to jump around instead of slowly decreasing.
2.4.3 MINI-BATCH GRADIENT DESCENT
Mini-batch gradient descent is the go-to method since it’s a combination of the concepts of
SGD and batch gradient descent. It simply splits the training dataset into small batches and
performs an update for each of those batches. This creates a balance between the robustness
of stochastic gradient descent and the efficiency of batch gradient descent.

Common mini-batch sizes range between 50 and 256, but like any other machine learning
technique, there is no clear rule because it varies for different applications. This is the go-to
algorithm when training a neural network and it is the most common type of gradient descent
within deep learning.

3. Delta Rule
3.1 What Does Delta Rule Mean?

The Delta rule in machine learning and neural network environments is a specific type of
back propagation that helps to refine connectionist ML/AI networks, making connections
between inputs and outputs with layers of artificial neurons.

The Delta rule is also known as the Delta learning rule.

In general, backpropagation has to do with recalculating input weights for artificial neurons
using a gradient method. Delta learning does this using the difference between a target
activation and an actual obtained activation. Using a linear activation function, network
connections are adjusted.

Another way to explain the Delta rule is that it uses an error function to perform gradient
descent learning.

The Delta Rule is an interesting mechanism for searching the hypothesis space. Actually, the
Delta Rule uses one of the most, if not the most, popular search technique in the hypothesis
space that is called Gradient Descent.

Using Gradient Descent, the Delta Rule strives to find the best-fitting model.
The Delta Rule, uses gradient descent as an optimization techniques, and tries different
values for the weights in a neural network, and depending on how accurate the output of the
network is (i.e., how close to the ground truth), it will make certain adjustments to certain
weights (i.e., increase some and decrease the other). It will try to increase and decrease the
weights in a way that the error of the output would go down, during training.
Gradient descent is the very foundation of the back-propagation algorithm that helps
us learn neural networks

4. Back-propagation algorithm

 Back-propagation algorithm is probably the most fundamental building block in a


neural network.

 The algorithm is used to effectively train a neural network through a method called
chain rule.
 In simple terms, after each forward pass through a network, back-propagation
performs a backward pass while adjusting the model’s parameters (weights and
biases).

 Repeatedly adjusts the weights of the connections in the network so as to minimize a

measure of the difference between the actual output vector of the net and the desired

output vector.
 The ability to create useful new features distinguishes back-propagation from earlier,

simpler method.
 In other words, back-propagation aims to minimize the cost function by adjusting

network’s weights and biases. The level of adjustment is determined by the gradients

of the cost function with respect to those parameters.


How Back-propagation Algorithm Works
The Back propagation algorithm in neural network computes the gradient of
the loss function for a single weight by the chain rule. It efficiently computes
one layer at a time, unlike a native direct computation. It computes the
gradient, but it does not define how the gradient is used. It generalizes the
computation in the delta rule.

Consider the following Back propagation neural network example diagram


to understand:

1. Inputs X, arrive through the reconnected path


2. Input is modelled using real weights W. The weights are usually
randomly selected.
3. Calculate the output for every neuron from the input layer, to the
hidden layers, to the output layer.
4. Calculate the error in the outputs
Error = Actual Output – Desired Output

5. Travel back from the output layer to the hidden layer to adjust the
weights such that the error is decreased.

Why We Need Back-propagation?


Most prominent advantages of Back-propagation are:
 Back-propagation is fast, simple and easy to program
 It has no parameters to tune apart from the numbers of input
 It is a flexible method as it does not require prior knowledge about
the network
 It is a standard method that generally works well
 It does not need any special mention of the features of the function to
be learned.

Types of Back-propagation Networks


Two Types of Back-propagation Networks are:

 Static Back-propagation
 Recurrent Back-propagation

Static back-propagation:
It is one kind of back-propagation network which produces a mapping of a
static input for static output. It is useful to solve static classification issues
like optical character recognition.

Recurrent Back-propagation:
Recurrent Back propagation in data mining is fed forward until a fixed value
is achieved. After that, the error is computed and propagated backward.

Disadvantages of using Back-propagation


 The actual performance of backpropagation on a specific problem is
dependent on the input data.
 Back propagation algorithm in data mining can be quite sensitive to
noisy data
 You need to use the matrix-based approach for backpropagation
instead of mini-batch.

Self-Organizing Map

Self-Organizing Map (or Kohonen Map or SOM) is a type of Artificial


Neural Network which is also inspired by biological models of neural
systems form the 1970’s.
It follows an unsupervised learning approach and trained its network
through a competitive learning algorithm.

SOM is used for clustering and mapping (or dimensionality reduction)


techniques to map multidimensional data onto lower-dimensional
which allows people to reduce complex problems for easy
interpretation.

SOM has two layers, one is the Input layer and the other one is the
Output layer.

The architecture of the Self Organizing Map with two clusters and n
input features of any sample is given below:

Algorithm

Steps involved are :


 Weight initialization
 For 1 to N number of epochs
 Select a training example
 Compute the winning vector
 Update the winning vector
 Repeat steps 3, 4, 5 for all training examples.
 Clustering the test sample

How SOM works?

Let’s say an input data of size (m, n) where m is the number of training
example and n is the number of features in each example.
First, it initializes the weights of size (n, C) where C is the number of clusters.
Then iterating over the input data, for each training example, it updates the
winning vector (weight vector with the shortest distance (e.g Euclidean
distance) from training example). Weight updating rule is given by:
wij = wij(old) - alpha(t) * (x ik - wij(old))

Where alpha is a learning rate at time t, j denotes the winning vector, i


denotes the ith feature of training example and k denotes the k th training
example from the input data.
After training the SOM network, trained weights are used for clustering new
examples. A new example falls in the cluster of winning vector.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy