0% found this document useful (0 votes)
2 views24 pages

ML Unit3

The document discusses ensemble techniques and unsupervised learning in artificial intelligence, focusing on methods like bagging, boosting, and stacking to improve model accuracy. It explains the concept of combining multiple learners to reduce misclassification errors and outlines various ensemble learning techniques, including voting and error-correcting output codes. Additionally, it covers the principles of bagging and boosting, highlighting their roles in generating complementary base-learners and enhancing model performance.

Uploaded by

sachins.aids2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views24 pages

ML Unit3

The document discusses ensemble techniques and unsupervised learning in artificial intelligence, focusing on methods like bagging, boosting, and stacking to improve model accuracy. It explains the concept of combining multiple learners to reduce misclassification errors and outlines various ensemble learning techniques, including voting and error-correcting output codes. Additionally, it covers the principles of bagging and boosting, highlighting their roles in generating complementary base-learners and enhancing model performance.

Uploaded by

sachins.aids2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

6. The hope is that this is much less likely than a misclassification


UNIT IIENSEMBLE TECHNIQUES AND UNSUPERVISED LEARNING 9 by a single hypothesis.
7. Suppose we assume that each hypothesis hi in the ensemble has
Combining multiple learners: Model combination schemes, Voting, an error of p-that is, the probability that a randomly chosen
Ensemble Learning - bagging, boosting, stacking, Unsupervised learning: K- example is rnisclassified by hi is p.
means, Instance Based Learning: KNN, Gaussian mixture models and 8. Furthermore, suppose we assume that the errors made by each
Expectation maximization
hypothesis are independent.
9. In that case, if p is small, then the probability of a large number of
misclassifications occurring is minuscule.
At the end of the Course Students will be able to
10. But if the hypotheses are at least a little bit different, thereby reducing
CO4: Build ensembling and unsupervised models the correlation between their errors, then ensemble learning can be very
useful.
1. Ensembling Techniques
Ensemble Learning refers to the use of ML algorithms jointly to
solve classification and/or regression problems. These algorithms can be
the same type (homogeneous Ensemble Learning) or different types
(heterogeneous Ensemble Learning). Ensemble Learning performs a
strategic combination of various experts or ML models in order to
improve the effectiveness obtained using a single weak model.
1. The idea of ensemble learning methods is to select a whole
collection, or ensemble, of hypotheses from the hypothesis space
and combine their predictions.
Figure 1: Illustration of the increased expressive power obtained by
2. For example, we might generate a hundred different decision trees
ensemble learning. We take three linear threshold hypotheses, each of
from the same training set and have them vote on the best
which classifies positively on the non-shaded side, and classify as
classification for a new example.
positive any example classified positively by all three. The resulting
3. The motivation for ensemble learning is simple.
triangular region is a hypothesis not expressible in the original hypothesis
4. Consider an ensemble of M = 5 hypotheses and suppose that we
space.
combine their predictions using simple majority voting.
5. For the ensemble to misclassify a new example, at least three of
There are different types of Ensemble Learning techniques which
the five hypotheses have to misclassify it.
differ mainly by the type of models used (homogeneous or heterogeneous

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 1

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

models), the data sampling (with or without replacement, k-fold, etc.) and Techniques/Methods in ensemble learning
the decision function (voting, average, meta model, etc). Therefore, Voting, Error-Correcting Output Codes, Bagging: Random Forest Trees,
Ensemble Learning techniques can be classified as: Boosting: Adaboost, Stacking. There are also different ways the multiple
 Voting b as e-learners are combined to generate the final output:
 Bagging
 Boosting Figure2: General Idea-Combining Multiple Learners
 Stacking

Figure1: Ensemble learning view

Which components to combine?


• Different learning algorithms
• Same learning algorithm trained in different ways
Multiexpert combination
• Same learning algorithm trained the same way
Multiexpert combination methods have base-learners that work in
There are two steps in ensemble learning:
parallel. These methods can in turn be divided into two:
Multiples machine learning models were generated using
same or different machine learning algorithm. These are called
 In the global approach, also called learner fusion, given an input,
“base models”. The prediction performs on the basis of base
all base-learners generate an output and all these outputs are
models.

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 2

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

used. Examples are voting and stacking.


 In the local approach, or learner selection, for example, in mixture of
experts, there is a gating model, which looks at the input and
chooses one (or very few)of the learners as responsible for
generating the output.

Multistage combination
Multistage combination methods use a serial approach where the
next base-learner is trained with or tested on only the instances where
the previous base-learners are not accurate enough. The idea is that the Figure 1: Base-learners are dj and their outputs are combined using f (·).
base-learners (or the different representations they use) are sorted in This is for a single output; in the case of classification, each base-learner has
increasing complexity so that a complex base-learner is not used (or its K outputs that are separately used to calculate yi, and then we choose the
complex representation is not extracted) unless the preceding simpler maximum .Note that here all learners observe the same input; it may be the
base-learners are not confident. case that different learners observe different representations of the same
An example is cascading. input object or event.

Let us say that we have L base-learners. We denote by dj(x) the When there are K outputs, for each learner there are dji(x),i
prediction of base-learner Mj given the arbitrary dimensional input x. In =1,...,K,j=1,...,L,and, combining them, we also generate K values, yi,i= 1,...,K
the case of multiple representations, each Mj uses a different input and then for example in classification, we choose the class with the
representation xj. The final prediction is calculated from the maximum yi value:
predictions of the base-learners:

y=f(d1,d2,...,dL| Φ)
1.1 Voting
where f(·)is the combining function with Φ denoting its parameters. The simplest way to combine multiple classifiers is by
voting, which corresponds to taking a linear combination of the
learners, Refer figure1.
This is also known as ensembles and linear opinion pools. In

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 3

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

overall output goes to 0. Note that after the combination rules, yi do not
necessarily sum up to 1.

the simplest case, all learners are given equal weight and we have simple
Table2: Example of combination rules on three learners and three
classes
voting that corresponds to taking an average. Still, taking a
(weighted) sum is only one of the possibilities and there are also other
combination rules, as shown in table 1. If the outputs are not posterior
probabilities, these rules require that outputs be normalized to the same

In weighted sum, dj iis the vote of learner j for class Ci and


wjis the weight of its vote. Simple voting is a special case where all voters
scale have equal weight, namely, wj = 1/L. In classification, this is called plurality
voting where the class having the maximum number of votes is the winner.
Table1-Classifiercombinationrules

When there are two classes, this is majority voting where the winning
An example of the use of these rules is shown in table 2, which class gets more than half of the votes. If the voters can also supply the
demonstrates the effects of different rules. Sum rule is the most intuitive additional information of how much they vote for each class(e.g., by the
and is the most widely used in practice. Median rule is more robust to posterior probability), then after normalization, these can be used as
outliers; minimum and maximum rules are pessimistic and optimistic, weights in a weighted voting scheme. Equivalently, if dji arethe class
respectively. With the product rule, each learner has veto power; posterior probabilities, P(Ci | x,Mj), then we can just sum them up (wj=
regardless of the other ones, if one learner has an output of 0, the 1/L) and choose the class with maximum yi.

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 4

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

having each model in the ensemble vote with equal weight. Inorder to
In the case of regression, simple or weighted averaging or median can promote model variance, bagging train seach model in the
be used to fuse the outputs of base-regressors. Median is more robust ensembleusingarandomlydrawnsubsetofthetrainingset.Asanexample,the
to noise than the average. randomforest algorithm combines random decision trees with bagging to
achieve very high classification accuracy.
Another possible way to find wj is to assess the accuracies of the The simplest method of combining classifiers is known as
learners (regressor or classifier) on a separate validation set and use bagging, which stands for bootstrap aggregating, the statistical
that information to compute the weights, so that we give more weights
to more accurate learners.

Voting schemes can be seen as approximations under a Bayesian


framework with weights approximating prior model probabilities, and
model decisions approximating model-conditional likelihoods. description of the method. This is fine if you know what a bootstrap is,
but fairly useless if you don’t. A bootstrap sample is a sample taken from
the original dataset with replacement, so that we may get some data
Simple voting corresponds to a uniform prior. If we have a prior several times and others not at all. The bootstrap sample is the same size
distribution preferring simpler models, this would give larger weights to as the original, and lots and lots of these samples are taken: B of them,
them. We cannot integrate over all models; we only choose a subset for where B is at least 50, and could even be in the thousands. The name
which we believe P(Mj )is high, or we can have another Bayesian step and bootstrap is more popular in computer science than anywhere else, since
calculate P(Ci | x,Mj ), the probability of a model given the sample, and there is also a bootstrap loader, which is the first program to run when a
sample high probable models from this density. computer is turned on. It comes from the nonsensical idea of ‘picking
yourself up by your bootstraps,’ which means lifting yourself up by your
Let us assume that dj are iid with expected value E[dj] and variance shoelaces, and is meant to imply starting from nothing.
Var(dj),then when we take a simple average with wj=1/L, the expected value Bootstrap sampling seems like a very strange thing to do. We’ve
and variance of the output are taken a perfectly good dataset, mucked it up by sampling from it, which
might be good if we had made a smaller dataset(since it would be faster),
but we still ended up with a dataset the same size. Worse, we’ve done it
1.2Bagging lots of times. Surely this is just away to burn up computer time without
gaining anything. The benefit of itis thatwe will get lots of learners that
Bootstrap aggregating, often abbreviated as bagging, involves

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 5

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

perform slightly differently, which is exactly what we want for an #Train classifiers
ensemblemethod. Another benefit is that estimates of the accuracy of the
classification function can be made without complicated analytic work, by classifiers.append(self.tree.make_tree(sample,sampleTarget,feature
throwing computer resources at the problem (technically, bagging is a s))
variance reducing algorithm; the meaning of this will become clearer when
we talk about bias and variance). Having taken a set of bootstrap samples, The example consists of taking the party data that was used to
the bagging method simply requires that we fit a model to each dataset,

demonstrate the decision tree, and restricting the trees to stumps, so


that they can make a classification based on just one variable
When we want to construct the decision tree to decide what to
do in the evening, we start by listing everything that we’ve done for the
past few days to get a suitable dataset (here, the last ten days):

and then combine them by taking the output to be the majority vote of all
The output of a decision tree that uses the whole dataset for this is not
the classifiers. A NumPy implementation is shown next, and then we will
surprising: it takes the two largest classes, and separates them. However,
look at a simple example.
using just stumps of trees and 20 samples, bagging can separate the data
# Compute bootstrap samples
perfectly, as this output shows:
samplePoints= np.random.randint(0,nPoints,(nPoints,nSamples))
classifiers = []
for i in range(nSamples):
1.3 Boosting
sample = []  Boosting: train next learner on mistakes made by
sampleTarget = [] previous learner(s)
for j in range(nPoints):
sample.append(data[samplePoints[j,i]]) In bagging, generating complementary base-learners is left to chance
sampleTarget.append(targets[samplePoints[j,i]]) and to the unstability of the learning method. In boosting, we actively

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 6

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

try to generate complementary base-learners by training the next


learner on the mistakes of the previous learners. The original boosting Though it is quite successful, the disadvantage of the
algorithm combines three weak learners to generate a strong learner. A original boosting method is that it requires a very large training
weak learner has error probability less than 1/2, which makes it better sample. The sample should be divided into three and furthermore, the
than random guessing on a two-class problem, and a strong learner has second and third classifiers are only trained on a subset on which the
arbitrarily small error probability. previous ones err. So unless one has a quite large training set, d2 and d3
will not have training sets of reasonable size.
Original Boosting Concept
AdaBoost
Given a large training set, we randomly divide it into three. We use X1 Freund and Schapire (1996) proposed a variant, named
and train d1. We then take X2 and feed it to d1. We take all instances AdaBoost, short for adaptive boosting, that uses the same training set
misclassified by d1 and also as many instances on which d1 is correct over and over and thus need not be large, but the classifiers should be
from X2, and these together form the training set of d2. We then take X3 simple so that they do not overfit. AdaBoost can also combine an
and feed it to d1 and d2. The instances on which d1 and d2 disagree arbitrary number of base learners, not three.
form the training set of d3. During testing, given an instance, we give it
The idea is to modify j the probabilities of drawing the instances as a
to d1 and d2; if they agree, that is the response, otherwise the response
function of the error. Let us say pt denotes the probability that the
of d3 is taken as the output.
instance pair (xt, rt) is drawn to train the jth base-learner.
1 Initially,
j all pt = 1/N. Then we add new base-learners as follows,
1. Split data X into {X1, X2, X3}
starting from j = 1: Є denotes the error rate of dj .
2. Train d1 on X1
AdaBoost requires that learners are weak, that is, Є j < 1/2,∀ j;
 Test d1 on X2
if not, we stop adding new base- learners. Note that this error rate
3. Train d2 on d1’s mistakes on X2 (plus some right)
is not on the original problem but on the dataset used at step j.
 Test d1 and d2 on X3
This has the effect that dj+1 focuses more on instances
4. Train d3 on disagreements between d1 and d2
misclassified by dj ; that is why the base-learners are chosen to be
 Testing: apply d1 and d2; if disagree, use d3
simple and not accurate, since otherwise the next training sample would
 Drawback: need large X
contain only a few outlier and noisy instances repeated many times
overall system has reduced error rate, and the error rate can arbitrarily
over. For example, with decision trees, decision stumps, which are trees
be reduced by using such systems recursively, that is, a boosting system
grown only one or two levels, are used. So it is clear that these would
of three models used as dj in a higher system.
have bias but the decrease in variance is larger and the overall error

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 7

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

decreases. An algorithm like the linear discriminant has low variance, much more expressive class of hypotheses without incurring much
and we cannot gain by AdaBoosting linear discriminants. additional computational or algorithmic complexity.
. 5. The most widely used ensemble method is called boosting.
6. To understand how it works, we need first to explain the idea of a
weighted training set.
7. In such a training set, each example has an associated weight wj > 0.
8. The higher the weight of an example, the higher is the importance
attached to it during the learning of a hypothesis.
9. It is straightforward to modify the learning algorithms we have seen so
far to operate with weighted training sets.
10. Boosting starts with w j = 1 for all the examples (i.e., a normal training
set).
11. From this set, it generates the first hypothesis, h1.
12. This hypothesis wall classify some of the training examples correctly
and some incorrectly.
13. We would like the next hypothesis to do better on the misclassified
examples, so we increase their weights while decreasing the weights of
the correctly classified examples.
14. From this new weighted training set, we generate hypothesis h2. The
process continues in this way until we have generated M hypotheses,
where M is an input to the boosting algorithm.
15. The final ensemble hypothesis is a weighted-majority combination of all
the M hypotheses, each weighted according to how well it performed on
the training set.
1. Another way to think about the ensemble idea is as a generic way of 16. Figure 2 shows how the algorithm works conceptually.
enlarging the hypothesis space.
2. That is, think of the ensemble itself as a hypothesis and the new
hypothesis space as the set of all possible ensembles constructible from
hypotheses in the original space.
3. Figure 1 shows how this can result in a more expressive hypothesis
space.
4. If the original hypothesis space allows for a simple and efficient
learning algorithm, then the ensemble method provides a way to learn a

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 8

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

hypothesis that classifies the training data perfectly for large


enough M.
4. Thus, the algorithm boosts the accuracy of the original learning
algorithm on the training data.
5. This result holds no matter how inexpressive the original
hypothesis space and no matter how complex the function being
learned.

Figure 2: How the boosting algorithm works. Each shaded rectangle


corresponds to an example; the height of the rectangle corresponds to the
weight. The ticks and crosses indicate whether the example was classified
correctly by the current hypothesis. The size of the decision tree indicates
the weight of that hypothesis in the final ensemble.
1. There are many variants of the basic boosting idea with different
ways of adjusting the weights and combining the hypotheses.
2. One specific algorithm, called ADABOOST is shown in Figure
3. Figure 3: The ADABOOST variant of the boosting method for ensemble
3. ADABOOST does have a very important property: if the input learning. The algorithm generates hypotheses by successively
learning algorithm L is a weak learning algorithm-which means reweighting the training examples. The function WEIGHTED-
that L always returns a hypothesis with weighted error on the MAJORITY generates a hypothesis that returns the output value with the
training set that is slightly better than random guessing (i.e., 50% highest vote from the hypotheses in h, with votes weighted by z.
for Boolean classification)-then ADABOOST will return a

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 9

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

1.4 Stacking - Stacked Generalization of estimating and correcting for the biases of the base-learners.
Therefore, the combiner should be trained on data unused in training
the base-learners.
Stacked generalization is a technique proposed by Wolpert
(1992) that extends voting in that the way the output of the base-
learners is combined need not be linear but is learned through a If f (·|w1, . . . , wL) is a linear model with constraints, wi ≥ 0, 𝑗Wj = 1,
combiner system, f (·|Φ), which is another learner, whose parameters the optimal weights can be found by constrained regression, but of
Φ are also trained. (see the below given figure) course we do not need to enforce this; in stacking, there is no
restriction on the combiner function and unlike voting, f (·) can be
nonlinear. For example, it may be implemented as a multilayer
perceptron with Φ its connection weights.

The outputs of the base-learners dj define a new L-dimensional


space in which the output discriminant/regression function is learned
by the combiner function.

In stacked generalization, we would like the base-learners to be as


different as possible so that they will complement each other, and, for
this, it is best if they are based on different learning algorithms. If we
Figure: In stacked generalization, the combiner is another learner
are combining classifiers that can generate continuous outputs, for
and is not restricted to being a linear combination as in voting.
example, posterior probabilities, it is better that they be the
y = f (d1, d2, . . . , dL |Φ)
combined rather than hard decisions.
When we compare a trained combiner as we have in
stacking, with a fixed rule such as in voting, we see that both have
The combiner learns what the correct output is when the base- their advantages: A trained rule is more flexible and may have less
learners give a certain output combination. We cannot train the bias, but adds extra parameters, risks introducing variance, and
combiner function on the training data because the base-learners needs extra time and data for training. Note also that there is no need
may be memorizing the training set; the combiner system should to normalize classifier outputs before stacking.
actually learn how the base learners make errors. Stacking is a means

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 10

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

Supervised Learning
2.K-Means Alogrithm
 K-means method is a centroid based technique
 It a unsupervised machine learning
 The k-means algorithm takes the input parameter, k, and where
partitions a set of n objects into k clusters  E is the sum of the square error for all objects in the data set;
 So that the resulting intra-cluster similarity is high but the  p is the point in space representing a given object;
inter-cluster similarity is low  mi is the mean of cluster Ci (both p and mi are
 Cluster similarity is measured in regard to the mean value of multidimensional).
the objects in a cluster, which can be viewed as the cluster’s  For each object in each cluster, the distance from the object to
centroid or center of gravity its cluster center is squared, and the distances are summed.
K-Means Algorithm  This criterion tries to make the resulting k clusters as compact
The algorithm works in the following way: and as separate as possible.
Step 1: Choose the number of clusters (K) and randomly K-Means Clustering Method:
initialize the K cluster centroids. (1) Arbitrarily choose k objects from D as the initial cluster centers;
Step 2: Assign each data point to the closest centroid. (2) repeat
Step 3: Recalculate the centroids as the mean of all the data (3) (re)assign each object to the cluster to which the object is the
points assigned to that centroid. most similar, based on the mean value of the objects in the cluster;
Step 4: Repeat steps 2 and 3 until the centroids no longer (4) update the cluster means, i.e., calculate the mean value of the
change significantly or a specified number of iterations has objects for each cluster;
been reached. (5) until no change;
Step 5: The output of the K-means algorithm is a set of K
centroids and a partition of the data into K clusters. K-Mean Cluster Example
 Criterion function

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 11

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

• Initialization: Randomly we choose following two centroids


(k=2) for two clusters. In this case the 2 centroid are:
• Calculate Euclidean distance using the given equation
• Distance [(𝑥1, 𝑦1), (𝑥2, 𝑦2)] = (𝑥2 − 𝑥1)2 + (𝑦2 − 𝑦1)2

Cluster Var1 Var2 Distance of k1 from k1 (1.0, 1.0) = (1.0 − 1.0)2 + (1.0 − 1.0)2 = 0
k1 to k2 (1.0, 1.0), (3.0, 4.0) = (3.0 − 1.0)2 + (4.0 − 1.0)2 = 3.61
K1 1.0 1.0 Distance of k 2 from k2 (3.0, 4.0) = (3.0 − 3.0)2 + 4.0 − 4.0 2 = 0
K2 3.0 4.0

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 12

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

• Update the cluster centroid


Cluster Var1 Var2

K1 (1.0 + 1.5)/2 = 1.25 (1.0 + 2.0)/2 = 1.5

K2 3.0 4.0

• Calculate Euclidean distance for next dataset (1.5, 2.0)


• Distance from cluster1 = (1.5 − 1.0) + (2.0 − 1.0) = 1.12
• Distance from cluster2= (1.5 − 3.0)2 + (2.0 − 4.0)2 = 2.5

Dataset Euclidean Distance

Cluster 1 Cluster 2 Assignment

(1.5, 2.0) 1.12 2.5 k1

• Update the cluster centroid


Cluster Var1 Var2

K1 1.25 1.5

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 13

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

K2 (3.0 + 5.0)/2 = 4 (4.0 + 7.0)/2 =5.5 K2 (3.0+5.0+ 3.5)/3 =3.83 (4.0+7.0 + 5.0)/3 = 5.33

• Update the cluster centroid


Cluster Var1 Var2 • Update the cluster centroid
Cluster Var1 Var2
K1 1.25 1.5

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 14

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

K1 1.25 1.5 Cluster Var1 Var2

K2 (3.0+5.0+3.5+4.5)/4= 4.00 K1
(4.0+7.0+5.0+5.0)/4= 5.25 1.25 1.5

K2 (3.0+5.0+3.5+4.5+3.5)/5= 3.9 (4.0+7.0+5.0+5.0+4.5)/5= 5.1

3. Instance Based Learning : KNN



• Update the cluster centroid o K-Nearest Neighbour is one of the simplest Machine Learning
algorithms based on Supervised Learning technique.

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 15

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

o K-NN algorithm assumes the similarity between the new case/data on the most similar features it will put it in either cat or dog
and available cases and put the new case into the category that is category.
most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new
data point based on the similarity. This means when new data
appears then it can be easily classified into a well suite category by
using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for
Classification but mostly it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not
make any assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn
from the training set immediately instead it stores the dataset and
at the time of classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and
when it gets new data, then it classifies that data into a category
that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks
similar to cat and dog, but we want to know either it is a cat or dog.
So for this identification, we can use the KNN algorithm, as it works
on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 16

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

Voronoi Diagram

Properties of Voronoi Diagram: Distance Metrics


1) All possible points within a sample's Voronoi cell are the
nearest neighboring points for that sample
2) For any sample, the nearest sample is determined by the
closest Voronoi cell edge

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 17

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

KNN Algorithm Advantages of KNN Algorithm:


o It is simple to implement.
The K-NN working can be explained on the basis of the below
algorithm: o It is robust to the noisy training data
o It can be more effective if the training data is large.
Step-1: Select the number K of the neighbors Disadvantages of KNN Algorithm:
Step-2: Calculate the Euclidean distance of K number of o Always needs to determine the value of K which may be complex
neighbors some time.
Step-3: Take the K nearest neighbors as per the calculated o The computation cost is high because of calculating the distance
Euclidean distance. between the data points for all the training samples.
Step-4: Among these k neighbors, count the number of the data
points in each category. 3.Gaussian Mixture Model:
Step-5: Assign the new data points to that category for which the
In real life, many datasets can be modeled by Gaussian
number of the neighbor is maximum. Distribution (Univariate or Multivariate). So it is quite natural and
Step-6: Our model is ready. intuitive to assume that the clusters come from different Gaussian
How to select the value of K in the K-NN Algorithm? Distributions. Or in other words, it tried to model the dataset as a mixture
of several Gaussian Distributions. This is the core idea of this model.
Below are some points to remember while selecting the value of K in
In one dimension the probability density function of a Gaussian
the K-NN algorithm: Distribution is given by
o There is no particular way to determine the best value for "K", so
we need to try some values to find the best out of them. The most
preferred value for K is 5.
where and are respectively the mean and variance of the
o A very low value for K such as K=1 or K=2, can be noisy and lead to distribution. For Multivariate ( let us say d-variate) Gaussian Distribution,
the effects of outliers in the model. the probability density function is given by
o Large values for K are good, but it may find some difficulties.

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 18

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

where πk is the mixing coefficient for kth distribution. For estimating the
parameters by the maximum log-likelihood method,

The given figures 4 shows two examples, where the data (shown by the
histograms) comes from two different Gaussians, and the model is
computed as a sum or mixture of the two Gaussians together.

FIGURE 4: Histograms of training data from a mixture of two Gaussians and


Gaussian Mixture Model two fitted models, shown as the line plot. The model shown on the left fits
Suppose there are K clusters (For the sake of simplicity here it is well, but the one on the right produces two Gaussians right on top of each

assumed that the number of clusters is known and it is K). So and are
also estimated for each k. Had it been only one distribution, they would
have been estimated by the maximum-likelihood method. But since there
are K such clusters and the probability density is defined as a linear
function of densities of all these K distributions, i.e.

other that do not fit the data well.

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 19

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

medical records often include the observed symptoms, the


physician’s diagnosis, the treatment applied, and perhaps the
outcome of the treatment.
 Hidden variables are important, but they do complicate the learning
problem. In Figure 20.10(a), for example, it is not obvious how to
learn the conditional distribution for HeartDisease, given its
parents, because we do not know the value of HeartDisease in each
case; the same problem arises in learning the distributions for the
symptoms

The common approach is to aim for the maximum likelihood solution (the
likelihood is the conditional probability of the data given the model, and the
maximum likelihood solution varies the model to maximise this conditional
probability).
In fact, it is common to compute the log likelihood and then to maximise
that; it is guaranteed to be negative, since probabilities are all less than 1, and
the logarithm spreads out the values, making the optimisation more effective.
So it can be clearly seen that the parameters cannot be estimated in closed
form.
The algorithm that is used is an example of a very general one known as the
expectation- maximisation (or more compactly, EM) algorithm.

4.EM Algorithm:  The basic idea of the EM algorithm is that sometimes it is easier to
add extra variables that are not actually known (called hidden or
 Many real-world problems have LATENT VARIABLE hidden latent variables) and then to maximise the function over those
variables (sometimes called latent variables), which are not variables.
observable in the data that are available for learning. For example,

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 20

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

 The Expectation-Maximization (EM) algorithm is defined as the


combination of various unsupervised machine learning algorithms,
which is used to determine the local maximum likelihood
estimates (MLE) or maximum a posteriori estimates (MAP) for
unobservable variables in statistical models.
 Further, it is a technique to find maximum likelihood estimation
when the latent variables are present. It is also referred to as
the latent variable model.
 A latent variable model consists of both observable and
unobservable variables where observable can be predicted while  Further, the other mode is used to optimize the parameters of the
unobserved are inferred from the observed variable. These models so that it can explain the data more clearly. The second
unobservable variables are known as latent variables. mode is known as the maximization-step or M-step.
 This might seem to be making a problem much more complicated
than it needs to be, but it turns out for many problems that it
makes finding the solution significantly easier.
 In order to see how it works, we will consider the simplest
interesting case of the Gaussian mixture model: a combination of
just two Gaussian mixtures.
 The assumption now is that sample from that Gaussian. If the
probability of picking Gaussian one is p, then the entire model looks
like this (where N(μ, σ2) specifies a Gaussian distribution with
mean μ and standard deviation σ):
 The EM algorithm is the combination of various unsupervised ML  Computing the value of this expectation is known as the E-step. Then
algorithms, such as the k-means clustering algorithm. Being an this estimate of the expectation is maximised over the model
iterative approach, it consists of two modes. In the first mode, we parameters (the parameters of the two Gaussians and the mixing
estimate the missing or latent variables. Hence it is referred to as parameter π), the M-step.
the Expectation/estimation step (E-step).  This requires differentiating the expectation with respect to each of
the model parameters. These two steps are simply iterated until the
algorithm converges. Note that the estimate never gets any smaller,

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 21

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

and it turns out that EM algorithms are guaranteed to reach a local Figure 20.13 represents a situation in which there are two bags of
maxima. candies that have been mixed together. Candies are described by three
 The trick with applying EM algorithms to problems is in identifying features: in addition to the Flavor and the Wrapper , some candies have a
the correct latent variables to include, and then simply working Hole in the middle and some do not. The distribution of candies in each bag
through the steps. is described by a naive Bayes model: the features are independent, given
 EM algorithms are very powerful methods for a wide variety of the bag, but the conditional probability distribution for each feature
statistical learning problems. depends on the bag. The parameters are as follows: θ is the prior
Steps in EM algorithm probability that a candy comes from Bag 1; θF1 and θF2 are the
1. E-step: Compute the probabilities pij = P(C = i| xj ), the probability that probabilities that the flavor is cherry, given that the candy comes from Bag
datum xj was generated by component i. By Bayes’ rule, we have pij = αP(xj | 1 or Bag 2 respectively; θW1 and θW2 give the probabilities that the
C = i)P(C = i). The term P(xj | C = i) is just the probability at xj of the ith wrapper is red; and θH1 and θH2 give the probabilities that the candy has a
Gaussian, and the term P(C = i) is just the weight parameter for the ith hole.
Gaussian. Define ni = j pij , the effective number of data points currently
assigned to component i.
2. M-step: Compute the new mean, covariance, and component weights
using the following steps in sequence:

) In the figure, the bag is a hidden variable because, once the candies
have been mixed together, we no longer know which bag each candy
came from. In such a case, can we recover the descriptions of the two
bags by observing candies from the mixture? Let us work through an
Learning Hidden Variable Example: iteration of EM for this problem. First, let’s look at the data. We generated
1000 samples from a model whose true parameters are as follows:

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 22

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

θ = 0.5, θF1 = θW1 = θH1 = 0.8, θF2 = θW2 = θH2 = 0.3 .


That is, the candies are equally likely to come from either bag; the first is
mostly cherries with red wrappers and holes; the second is mostly limes Applying this formula to, say, the 273 red-wrapped cherry candies with
with green wrappers and no holes. The counts for the eight possible kinds holes, we get a contribution of
of candy are as follows:

Continuing with the other seven kinds of candy in the table of counts, we
obtain θ(1) = 0.6124. Now let us consider the other parameters, such as
θF1. In the fully observable case, we would estimate this directly from the
observed counts of cherry and lime candies from bag 1. The expected
We start by initializing the parameters. For numerical simplicity, we
count of cherry candies from bag 1 is given by
arbitrarily choose
θ(0) = 0.6,
θ(0) F1 = θ (0) W1 = θ (0) H1 = 0.6,
θ(0) F2 = θ (0) W2 = θ (0) H2 = 0.4 . Again, these probabilities can be calculated by any Bayes net algorithm.
First, let us work on the θ parameter. In the fully observable case, we Completing this process, we obtain the new values of all the parameters:
would estimate this directly from the observed counts of candies from
bags 1 and 2. Because the bag is a hidden variable, we calculate the
expected counts instead. The expected count Nˆ (Bag = 1) is the sum, over
By the tenth iteration, the learned model is a better fit than the original
all candies, of the probability that the candy came from bag 1:
model (L = − 1982.214).

Applications of EM algorithm
These probabilities can be computed by any inference algorithm for
The primary aim of the EM algorithm is to estimate the missing data
Bayesian networks. For a naive Bayes model such as the one in our
in the latent variables through observed data in datasets. The EM
example, we can do the inference “by hand,” using Bayes’ rule and
algorithm or latent variable model has a broad range of real-life
applying conditional independence:
applications in machine learning. These are as follows:

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 23

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)


lOMoARcPSD|9899035

CS3491-Artificial Intelligence and Machine Learning

o The EM algorithm is applicable in data clustering in machine


learning.
o It is often used in computer vision and NLP (Natural language
processing).
o It is used to estimate the value of the parameter in mixed models
such as the Gaussian Mixture Model and quantitative genetics.
o It is also used in psychometrics for estimating item parameters and
latent abilities of item response theory models.
o It is also applicable in the medical and healthcare industry, such as
in image reconstruction and structural engineering.
o It is used to determine the Gaussian density of a function.
Advantages of EM algorithm
o It is very easy to implement the first two basic steps of the EM
algorithm in various machine learning problems, which are E-step
and M- step.
o It is mostly guaranteed that likelihood will enhance after each
iteration.
o It often generates a solution for the M-step in the closed form.
Disadvantages of EM algorithm
o The convergence of the EM algorithm is very slow.
o It can make convergence for the local optima only.
o It takes both forward and backward probability into consideration. It is
opposite to that of numerical optimization, which takes only forward
probabilities.

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 24

Downloaded by E.BHUVANESWARI CIT (bhuvaneswarie@citchennai.net)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy