Unit 5 Learning
Unit 5 Learning
―Learning denotes changes in the system that are adaptive in the sense that they enable the
system to do the same task (or tasks drawn from the same population) more effectively the next
time.‖ --Herbert Simon
Types of Learning:
The strategies for learning can be classified according to the amount of inference the system has
to perform on its training data. In increasing order we have
1. Rote learning – the new knowledge is implanted directly with no inference at all, e.g. simple
memorization of past events, or a knowledge engineer’s direct programming of rules elicited
from a human expert into an expert system.
2. Supervised learning – the system is supplied with a set of training examples consisting of
inputs and corresponding outputs, and is required to discover the relation or mapping between
then, e.g. as a series of rules, or a neural network.
3. Unsupervised learning – the system is supplied with a set of training examples consisting
only of inputs and is required to discover for itself what appropriate outputs should be, e.g. a
Kohonen Network or Self Organizing Map.
Early expert systems relied on rote learning, but for modern AI systems we are generally
interested in the supervised learning of various levels of rules.
As with many other types of AI system, it is much more efficient to give the system enough
knowledge to get it started, and then leave it to learn the rest for itself. We may even end up with
a system that learns to be better than a human expert.
The general learning approach is to generate potential improvements, test them, and discard
those which do not work. Naturally, there are many ways we might generate the potential
improvements, and many ways we can test their usefulness. At one extreme, there are model
driven (top-down) generators of potential improvements, guided by an understanding of how the
problem domain works. At the other, there are data driven (bottom-up) generators, guided by
patterns in some set of training data.
As regards machines, we might say, very broadly, that a machine learns whenever it changes its
structure, program, or data (based on its inputs or in response to external information) in such a
manner that its expected future performance improves. Some of these changes, such as the
addition of a record to a data base, fall comfortably within the province of other disciplines and
are not necessarily better understood for being called learning. But, for example, when the
performance of a speech-recognition machine improves after hearing several samples of a
person's speech, we feel quite justified in that case saying that the machine has learned.
Machine learning usually refers to the changes in systems that perform tasks associated
with artificial intelligence (AI). Such tasks involve recognition, diagnosis, planning, robot
control, prediction, etc. The changes might be either enhancements to already performing
systems or synthesis of new systems.
Supervised Learning
In supervised learning, you use input/output pairs (labeled data) to train the machine. You show
the algorithm both input variables (x) and an output variable (Y) and then have the algorithm
infer the mapping function from the input to the output.
The main goal here is getting the machine to produce a function that’s approximated enough to
be able to predict outputs for new inputs when you introduce them.
Supervised learning problems can be further grouped into regression problems and classification
problems.
A regression problem is when outputs are actual objects whereas a classification problem, as the
name suggests, is when outputs are categories.
Minimizing any function means finding the deepest valley in that function. Keep in mind that,
the cost function is used to monitor the error in predictions of an ML model. So minimizing this,
basically means getting to the lowest error value possible or increasing the accuracy of the
model. Gradient descent is best used when the parameters cannot be calculated analytically (e.g.
using linear algebra) and must be searched for by an optimization algorithm.
A random position on the surface of the bowl is the cost of the current values of the coefficients
(cost).
The bottom of the bowl is the cost of the best set of coefficients, the minimum of the function.
The goal is to continue to try different values for the coefficients, evaluate their cost and select
new coefficients that have a slightly better (lower) cost.
Repeating this process enough times will lead to the bottom of the bowl and you will know the
values of the coefficients that result in the minimum cost.
coefficient = 0.0
The cost of the coefficients is evaluated by plugging them into the function and calculating the
cost.
or
cost = evaluate(f(coefficient))
The derivative of the cost is calculated. The derivative is a concept from calculus and refers to
the slope of the function at a given point. We need to know the slope so that we know the
direction (sign) to move the coefficient values in order to get a lower cost on the next iteration.
delta = derivative(cost)
Now that we know from the derivative which direction is downhill, we can now update the
coefficient values. A learning rate parameter (alpha) must be specified that controls how much
the coefficients can change on each update.
This process is repeated until the cost of the coefficients (cost) is 0.0 or close enough to zero to
be good enough.
You can see how simple gradient descent is. It does require you to know the gradient of your
cost function or the function you are optimizing, but besides that, it’s very straightforward. Next
we will see how we can use this in machine learning algorithms.
The LMS algorithm was introduced by Widrow and Hoff in 1959. It has several names,
including the Widrow-Hoff rule and also Delta rule. LMS is an example of supervised learning
algorithm in NN similar with the perceptron learning algorithm (refer to the previous article,
May 2011). In the perceptron learning algorithm, the algorithm trains the perceptron until it
correctly classifies the output of the training set but LMS uses another termination criterion in
order to train the perceptron. So instead of training the perceptron until a solution is found,
another criterion is to continue training while the Mean-Square Error (MSE) is greater than a
certain value. This is the basis for the LMS algorithm.
In order to train the perceptron by using LMS, we can iterate the test set, taking a set of inputs,
computing the output and then using the error to adjust the weight. This process can be done
either randomly by the test set, or for each test of the set in succession. The learning rule of LMS
is given as:
The learning rule adjusts the weight based on the error (R-C or expected output minus actual
output). Once the error is calculated, the weights are adjusted by a small amount , p in the
direction of the input, E. This has the effect of adjusting the weights to reduce the output error.
The implementation of LMS is very simple. Initially, the weights vector is initialized with small
random weights. The main repetition then randomly selects a test, calculates the output of the
neuron, and then calculates the error. Using the error, the formula of learning rule is applied to
each weight in the vector. Then continues the repetition to check the MSE to see if it has reached
an acceptable value, and if so, exit and emit the computed truth table for the neuron.
The least mean square algorithm uses a technique called ―method of steepest descent‖ and
continuously estimates results by updating filter weights. Through the principle of algorithm
convergence, the least mean square algorithm provides particular learning curves useful in
machine learning theory and implementation. Many of these ideas are part of dedicated work on
refining machine learning models, matching inputs to outputs, making training and test processes
more effective, and generally pursuing ―convergence‖ where the iterative learning process
resolves into a coherent(reasonable or logical) final result instead of getting off track.
As the algorithm's name implies, the errors (and therefore the learning) propagate backwards
from the output nodes to the inner nodes. So technically speaking, back propagation is used to
calculate the gradient of the error of the network with respect to the network's modifiable
weights. This gradient is almost always then used in a simple stochastic gradient descent
algorithm, is a general optimization algorithm, but is typically used to fit the parameters of a
machine learning model, to find weights that minimize the error. Often the term "back
propagation" is used in a more general sense, to refer to the entire procedure encompassing both
the calculation of the gradient and its use in stochastic gradient descent. Back propagation
usually allows quick convergence on satisfactory local minima for error in the kind of networks
to which it is suited.
Back propagation networks are necessarily multilayer perceptrons (usually with one input, one
hidden, and one output layer). In order for the hidden layer to serve any useful function,
multilayer networks must have non-linear activation functions for the multiple layers: a
multilayer network using only linear activation functions is equivalent to some single layer,
linear network.
The back propagation algorithm provides a computational efficient method for training multi-
layer networks
5. Assign "blame" for the local error to neurons at the previous level, giving greater
responsibility to neurons connected by stronger weights.
6. Repeat from step 3 on the neurons at the previous level, using each one's "blame" as its error.
Because one iteration of the gradient descent algorithm requires a prediction for each instance in
the training dataset, it can take a long time when you have many millions of instances.
In situations when you have large amounts of data, you can use a variation of gradient descent
called stochastic gradient descent.
In this variation, the gradient descent procedure described above is run but the update to the
coefficients is performed for each training instance, rather than at the end of the batch of
instances.
The first step of the procedure requires that the order of the training dataset is randomized. This
is to mix up the order that updates are made to the coefficients. Because the coefficients are
updated after every training instance, the updates will be noisy jumping all over the place, and so
The update procedure for the coefficients is the same as that above, except the cost is not
summed over all training patterns, but instead calculated for one training pattern.
The learning can be much faster with stochastic gradient descent for very large training
datasets and often you only need a small number of passes through the dataset to reach a good or
good enough set of coefficients, e.g. 1-to-10 passes through the dataset.
Stochastic gradient descent is a popular algorithm for training a wide range of models in machine
learning, including (linear) support vector machines, logistic regression and graphical
models. When combined with the backpropagation algorithm, it is the de facto standard
algorithm for training artificial neural networks..
Unsupervised learning
In unsupervised learning, an AI system is presented with unlabeled, uncategorised data and the
system’s algorithms act on the data without prior training. The output is dependent upon the
coded algorithms. Subjecting a system to unsupervised learning is one way of testing AI.
Unsupervised learning algorithms can perform more complex processing tasks than supervised
learning systems. However, unsupervised learning can be more unpredictable than the alternate
Hebbian Learning:
The oldest and most famous of all learning rules is Hebb’s postulate of learning:
―When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes
part in firing it, some growth process or metabolic changes take place in one or both cells such
that A’s efficiency as one of the cells firing B is increased‖
From the point of view of artificial neurons and artificial neural networks, Hebb's principle can
be described as a method of determining how to alter the weights between model neurons. The
weight between two neurons increases if the two neurons activate simultaneously—and reduces
if they activate separately. Nodes that tend to be either both positive or both negative at the same
time have strong positive weights, while those that tend to be opposite have strong negative
weights.
Hebb’s Algorithm:
X1 X2 Bias target
1 1 1 1
1 -1 1 -1
-1 1 1 -1
-1 -1 1 -1
Competitive learning
Competitive learning is a form of unsupervised learning in artificial Neural Networks. The nodes
compete for the right to respond to a subset of the input data. Competitive learning works by
increasing the specialization of each node in the network. It is well suited to finding clusters
within data.
A set of neurons that are all the same, except for some randomly distributed synaptic
weights, which respond differently to a given set of input patterns
A limit which is imposed on the "strength" of each neuron.
A mechanism that permits the neurons to compete for the right to respond to a given
subset of inputs, such that only one output neuron (or only one neuron per group), is
active (i.e. "on") at a time. The neuron that wins the competition is called a "winner-
take-all" neuron.
The competitive networks recode sets of correlated inputs to one of a few output neurons
essentially removes the redundancy in representation.
Competitive Learning is usually implemented with Neural Networks that contain a hidden layer
which is commonly known as ―competitive layer‖.
For every input vector, the competitive neurons ―compete‖ with each other to see which one of
them is the most similar to that particular input vector. The winner neuron m sets its output to:
Om=1
Usually, in order to measure similarity the inverse of the Euclidean distance is used.
Genetic Algorithm
Algorithm
Disadvantage
GA is better if the problem does not have any mathematical model for the solution.
GA is less efficient in terms of speed of convergence.
GA has tendency to get stuck in local maxima rather than global maxima.
Example : The problem is as follows: We have an agent and a reward, with many hurdles in
between. The agent is supposed to find the best possible path to reach the reward. The following
problem explains the problem more easily.
The above image shows robot, diamond and fire. The goal of the robot is to get the reward that is
the diamond and avoid the hurdles that is fire. The robot learns by trying all the possible paths
and then choosing the path which gives him the reward with the least hurdles. Each right step
will give the robot a reward and each wrong step will subtract the reward of the robot. The total
reward will be calculated when it reaches the final reward that is the diamond.
1. Input: The input should be an initial state from which the model will start
2. Output: There are many possible output as there are variety of solution to a particular
problem
3. Training: The training is based upon the input, The model will return a state and the user
will decide to reward or punish the model based on its output.
4. The model keeps continues to learn.
5. The best solution is decided based on the maximum reward.
Advantages:
Maximizes Performance
Sustain Change for a long period of time
Disadvantages:
Too much Reinforcement can lead to overload of states which can diminish the results
Increases Behavior
Provide defiance to minimum standard of performance