ML Unit3
ML Unit3
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 1
models), the data sampling (with or without replacement, k-fold, etc.) and Techniques/Methods in ensemble learning
the decision function (voting, average, meta model, etc). Therefore, Voting, Error-Correcting Output Codes, Bagging: Random Forest Trees,
Ensemble Learning techniques can be classified as: Boosting: Adaboost, Stacking. There are also different ways the multiple
Voting b as e-learners are combined to generate the final output:
Bagging
Boosting Figure2: General Idea-Combining Multiple Learners
Stacking
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 2
Multistage combination
Multistage combination methods use a serial approach where the
next base-learner is trained with or tested on only the instances where
the previous base-learners are not accurate enough. The idea is that the Figure 1: Base-learners are dj and their outputs are combined using f (·).
base-learners (or the different representations they use) are sorted in This is for a single output; in the case of classification, each base-learner has
increasing complexity so that a complex base-learner is not used (or its K outputs that are separately used to calculate yi, and then we choose the
complex representation is not extracted) unless the preceding simpler maximum .Note that here all learners observe the same input; it may be the
base-learners are not confident. case that different learners observe different representations of the same
An example is cascading. input object or event.
Let us say that we have L base-learners. We denote by dj(x) the When there are K outputs, for each learner there are dji(x),i
prediction of base-learner Mj given the arbitrary dimensional input x. In =1,...,K,j=1,...,L,and, combining them, we also generate K values, yi,i= 1,...,K
the case of multiple representations, each Mj uses a different input and then for example in classification, we choose the class with the
representation xj. The final prediction is calculated from the maximum yi value:
predictions of the base-learners:
y=f(d1,d2,...,dL| Φ)
1.1 Voting
where f(·)is the combining function with Φ denoting its parameters. The simplest way to combine multiple classifiers is by
voting, which corresponds to taking a linear combination of the
learners, Refer figure1.
This is also known as ensembles and linear opinion pools. In
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 3
overall output goes to 0. Note that after the combination rules, yi do not
necessarily sum up to 1.
the simplest case, all learners are given equal weight and we have simple
Table2: Example of combination rules on three learners and three
classes
voting that corresponds to taking an average. Still, taking a
(weighted) sum is only one of the possibilities and there are also other
combination rules, as shown in table 1. If the outputs are not posterior
probabilities, these rules require that outputs be normalized to the same
When there are two classes, this is majority voting where the winning
An example of the use of these rules is shown in table 2, which class gets more than half of the votes. If the voters can also supply the
demonstrates the effects of different rules. Sum rule is the most intuitive additional information of how much they vote for each class(e.g., by the
and is the most widely used in practice. Median rule is more robust to posterior probability), then after normalization, these can be used as
outliers; minimum and maximum rules are pessimistic and optimistic, weights in a weighted voting scheme. Equivalently, if dji arethe class
respectively. With the product rule, each learner has veto power; posterior probabilities, P(Ci | x,Mj), then we can just sum them up (wj=
regardless of the other ones, if one learner has an output of 0, the 1/L) and choose the class with maximum yi.
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 4
having each model in the ensemble vote with equal weight. Inorder to
In the case of regression, simple or weighted averaging or median can promote model variance, bagging train seach model in the
be used to fuse the outputs of base-regressors. Median is more robust ensembleusingarandomlydrawnsubsetofthetrainingset.Asanexample,the
to noise than the average. randomforest algorithm combines random decision trees with bagging to
achieve very high classification accuracy.
Another possible way to find wj is to assess the accuracies of the The simplest method of combining classifiers is known as
learners (regressor or classifier) on a separate validation set and use bagging, which stands for bootstrap aggregating, the statistical
that information to compute the weights, so that we give more weights
to more accurate learners.
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 5
perform slightly differently, which is exactly what we want for an #Train classifiers
ensemblemethod. Another benefit is that estimates of the accuracy of the
classification function can be made without complicated analytic work, by classifiers.append(self.tree.make_tree(sample,sampleTarget,feature
throwing computer resources at the problem (technically, bagging is a s))
variance reducing algorithm; the meaning of this will become clearer when
we talk about bias and variance). Having taken a set of bootstrap samples, The example consists of taking the party data that was used to
the bagging method simply requires that we fit a model to each dataset,
and then combine them by taking the output to be the majority vote of all
The output of a decision tree that uses the whole dataset for this is not
the classifiers. A NumPy implementation is shown next, and then we will
surprising: it takes the two largest classes, and separates them. However,
look at a simple example.
using just stumps of trees and 20 samples, bagging can separate the data
# Compute bootstrap samples
perfectly, as this output shows:
samplePoints= np.random.randint(0,nPoints,(nPoints,nSamples))
classifiers = []
for i in range(nSamples):
1.3 Boosting
sample = [] Boosting: train next learner on mistakes made by
sampleTarget = [] previous learner(s)
for j in range(nPoints):
sample.append(data[samplePoints[j,i]]) In bagging, generating complementary base-learners is left to chance
sampleTarget.append(targets[samplePoints[j,i]]) and to the unstability of the learning method. In boosting, we actively
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 6
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 7
decreases. An algorithm like the linear discriminant has low variance, much more expressive class of hypotheses without incurring much
and we cannot gain by AdaBoosting linear discriminants. additional computational or algorithmic complexity.
. 5. The most widely used ensemble method is called boosting.
6. To understand how it works, we need first to explain the idea of a
weighted training set.
7. In such a training set, each example has an associated weight wj > 0.
8. The higher the weight of an example, the higher is the importance
attached to it during the learning of a hypothesis.
9. It is straightforward to modify the learning algorithms we have seen so
far to operate with weighted training sets.
10. Boosting starts with w j = 1 for all the examples (i.e., a normal training
set).
11. From this set, it generates the first hypothesis, h1.
12. This hypothesis wall classify some of the training examples correctly
and some incorrectly.
13. We would like the next hypothesis to do better on the misclassified
examples, so we increase their weights while decreasing the weights of
the correctly classified examples.
14. From this new weighted training set, we generate hypothesis h2. The
process continues in this way until we have generated M hypotheses,
where M is an input to the boosting algorithm.
15. The final ensemble hypothesis is a weighted-majority combination of all
the M hypotheses, each weighted according to how well it performed on
the training set.
1. Another way to think about the ensemble idea is as a generic way of 16. Figure 2 shows how the algorithm works conceptually.
enlarging the hypothesis space.
2. That is, think of the ensemble itself as a hypothesis and the new
hypothesis space as the set of all possible ensembles constructible from
hypotheses in the original space.
3. Figure 1 shows how this can result in a more expressive hypothesis
space.
4. If the original hypothesis space allows for a simple and efficient
learning algorithm, then the ensemble method provides a way to learn a
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 8
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 9
1.4 Stacking - Stacked Generalization of estimating and correcting for the biases of the base-learners.
Therefore, the combiner should be trained on data unused in training
the base-learners.
Stacked generalization is a technique proposed by Wolpert
(1992) that extends voting in that the way the output of the base-
learners is combined need not be linear but is learned through a If f (·|w1, . . . , wL) is a linear model with constraints, wi ≥ 0, 𝑗Wj = 1,
combiner system, f (·|Φ), which is another learner, whose parameters the optimal weights can be found by constrained regression, but of
Φ are also trained. (see the below given figure) course we do not need to enforce this; in stacking, there is no
restriction on the combiner function and unlike voting, f (·) can be
nonlinear. For example, it may be implemented as a multilayer
perceptron with Φ its connection weights.
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 10
Supervised Learning
2.K-Means Alogrithm
K-means method is a centroid based technique
It a unsupervised machine learning
The k-means algorithm takes the input parameter, k, and where
partitions a set of n objects into k clusters E is the sum of the square error for all objects in the data set;
So that the resulting intra-cluster similarity is high but the p is the point in space representing a given object;
inter-cluster similarity is low mi is the mean of cluster Ci (both p and mi are
Cluster similarity is measured in regard to the mean value of multidimensional).
the objects in a cluster, which can be viewed as the cluster’s For each object in each cluster, the distance from the object to
centroid or center of gravity its cluster center is squared, and the distances are summed.
K-Means Algorithm This criterion tries to make the resulting k clusters as compact
The algorithm works in the following way: and as separate as possible.
Step 1: Choose the number of clusters (K) and randomly K-Means Clustering Method:
initialize the K cluster centroids. (1) Arbitrarily choose k objects from D as the initial cluster centers;
Step 2: Assign each data point to the closest centroid. (2) repeat
Step 3: Recalculate the centroids as the mean of all the data (3) (re)assign each object to the cluster to which the object is the
points assigned to that centroid. most similar, based on the mean value of the objects in the cluster;
Step 4: Repeat steps 2 and 3 until the centroids no longer (4) update the cluster means, i.e., calculate the mean value of the
change significantly or a specified number of iterations has objects for each cluster;
been reached. (5) until no change;
Step 5: The output of the K-means algorithm is a set of K
centroids and a partition of the data into K clusters. K-Mean Cluster Example
Criterion function
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 11
Cluster Var1 Var2 Distance of k1 from k1 (1.0, 1.0) = (1.0 − 1.0)2 + (1.0 − 1.0)2 = 0
k1 to k2 (1.0, 1.0), (3.0, 4.0) = (3.0 − 1.0)2 + (4.0 − 1.0)2 = 3.61
K1 1.0 1.0 Distance of k 2 from k2 (3.0, 4.0) = (3.0 − 3.0)2 + 4.0 − 4.0 2 = 0
K2 3.0 4.0
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 12
K2 3.0 4.0
K1 1.25 1.5
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 13
K2 (3.0 + 5.0)/2 = 4 (4.0 + 7.0)/2 =5.5 K2 (3.0+5.0+ 3.5)/3 =3.83 (4.0+7.0 + 5.0)/3 = 5.33
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 14
K2 (3.0+5.0+3.5+4.5)/4= 4.00 K1
(4.0+7.0+5.0+5.0)/4= 5.25 1.25 1.5
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 15
o K-NN algorithm assumes the similarity between the new case/data on the most similar features it will put it in either cat or dog
and available cases and put the new case into the category that is category.
most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new
data point based on the similarity. This means when new data
appears then it can be easily classified into a well suite category by
using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for
Classification but mostly it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not
make any assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn
from the training set immediately instead it stores the dataset and
at the time of classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and
when it gets new data, then it classifies that data into a category
that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks
similar to cat and dog, but we want to know either it is a cat or dog.
So for this identification, we can use the KNN algorithm, as it works
on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 16
Voronoi Diagram
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 17
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 18
where πk is the mixing coefficient for kth distribution. For estimating the
parameters by the maximum log-likelihood method,
The given figures 4 shows two examples, where the data (shown by the
histograms) comes from two different Gaussians, and the model is
computed as a sum or mixture of the two Gaussians together.
assumed that the number of clusters is known and it is K). So and are
also estimated for each k. Had it been only one distribution, they would
have been estimated by the maximum-likelihood method. But since there
are K such clusters and the probability density is defined as a linear
function of densities of all these K distributions, i.e.
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 19
The common approach is to aim for the maximum likelihood solution (the
likelihood is the conditional probability of the data given the model, and the
maximum likelihood solution varies the model to maximise this conditional
probability).
In fact, it is common to compute the log likelihood and then to maximise
that; it is guaranteed to be negative, since probabilities are all less than 1, and
the logarithm spreads out the values, making the optimisation more effective.
So it can be clearly seen that the parameters cannot be estimated in closed
form.
The algorithm that is used is an example of a very general one known as the
expectation- maximisation (or more compactly, EM) algorithm.
4.EM Algorithm: The basic idea of the EM algorithm is that sometimes it is easier to
add extra variables that are not actually known (called hidden or
Many real-world problems have LATENT VARIABLE hidden latent variables) and then to maximise the function over those
variables (sometimes called latent variables), which are not variables.
observable in the data that are available for learning. For example,
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 20
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 21
and it turns out that EM algorithms are guaranteed to reach a local Figure 20.13 represents a situation in which there are two bags of
maxima. candies that have been mixed together. Candies are described by three
The trick with applying EM algorithms to problems is in identifying features: in addition to the Flavor and the Wrapper , some candies have a
the correct latent variables to include, and then simply working Hole in the middle and some do not. The distribution of candies in each bag
through the steps. is described by a naive Bayes model: the features are independent, given
EM algorithms are very powerful methods for a wide variety of the bag, but the conditional probability distribution for each feature
statistical learning problems. depends on the bag. The parameters are as follows: θ is the prior
Steps in EM algorithm probability that a candy comes from Bag 1; θF1 and θF2 are the
1. E-step: Compute the probabilities pij = P(C = i| xj ), the probability that probabilities that the flavor is cherry, given that the candy comes from Bag
datum xj was generated by component i. By Bayes’ rule, we have pij = αP(xj | 1 or Bag 2 respectively; θW1 and θW2 give the probabilities that the
C = i)P(C = i). The term P(xj | C = i) is just the probability at xj of the ith wrapper is red; and θH1 and θH2 give the probabilities that the candy has a
Gaussian, and the term P(C = i) is just the weight parameter for the ith hole.
Gaussian. Define ni = j pij , the effective number of data points currently
assigned to component i.
2. M-step: Compute the new mean, covariance, and component weights
using the following steps in sequence:
) In the figure, the bag is a hidden variable because, once the candies
have been mixed together, we no longer know which bag each candy
came from. In such a case, can we recover the descriptions of the two
bags by observing candies from the mixture? Let us work through an
Learning Hidden Variable Example: iteration of EM for this problem. First, let’s look at the data. We generated
1000 samples from a model whose true parameters are as follows:
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 22
Continuing with the other seven kinds of candy in the table of counts, we
obtain θ(1) = 0.6124. Now let us consider the other parameters, such as
θF1. In the fully observable case, we would estimate this directly from the
observed counts of cherry and lime candies from bag 1. The expected
We start by initializing the parameters. For numerical simplicity, we
count of cherry candies from bag 1 is given by
arbitrarily choose
θ(0) = 0.6,
θ(0) F1 = θ (0) W1 = θ (0) H1 = 0.6,
θ(0) F2 = θ (0) W2 = θ (0) H2 = 0.4 . Again, these probabilities can be calculated by any Bayes net algorithm.
First, let us work on the θ parameter. In the fully observable case, we Completing this process, we obtain the new values of all the parameters:
would estimate this directly from the observed counts of candies from
bags 1 and 2. Because the bag is a hidden variable, we calculate the
expected counts instead. The expected count Nˆ (Bag = 1) is the sum, over
By the tenth iteration, the learned model is a better fit than the original
all candies, of the probability that the candy came from bag 1:
model (L = − 1982.214).
Applications of EM algorithm
These probabilities can be computed by any inference algorithm for
The primary aim of the EM algorithm is to estimate the missing data
Bayesian networks. For a naive Bayes model such as the one in our
in the latent variables through observed data in datasets. The EM
example, we can do the inference “by hand,” using Bayes’ rule and
algorithm or latent variable model has a broad range of real-life
applying conditional independence:
applications in machine learning. These are as follows:
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 23
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 24