Ensemble Methods
Ensemble Methods
Ensemble methods are techniques that aim at improving the accuracy of results in models
by combining multiple models instead of using a single model. The combined models
increase the accuracy of the results significantly. This has boosted the popularity of
ensemble methods in machine learning.
Let’s understand the concept of ensemble learning with an example. Suppose you are a
movie director and you have created a short movie on a very important and interesting topic.
Now, you want to take preliminary feedback (ratings) on the movie before making it public.
What are the possible ways by which you can do that?
A:You may ask one of your friends to rate the movie for you.
Now it’s entirely possible that the person you have chosen loves you very much and doesn’t
want to break your heart by providing a 1-star rating to the horrible work you have created.
This should provide a better idea of the movie. This method may provide honest ratings for
your movie. But a problem still exists. These 5 people may not be “Subject Matter Experts”
on the topic of your movie. Sure, they might understand the cinematography, the shots, or
the audio, but at the same time may not be the best judges of dark humour.
Some of which can be your friends, some of them can be your colleagues and some may
people with different sets of skills. And as it turns out – this is a better approach to get
With these examples, you can infer that a diverse group of people are likely to make better
Summary
Ensemble methods aim at improving predictability in models by combining several
models to make one very reliable model.
The most popular ensemble methods are boosting, bagging, and stacking.
Ensemble methods are ideal for regression and classification, where they reduce bias
and variance to boost the accuracy of models.
In this section, we will look at a few simple but powerful techniques, namely:
1.Max Voting
2.Averaging
3.Weighted Averaging
1 Max Voting
The max voting method is generally used for classification problems. In this technique,
multiple models are used to make predictions for each data point. The predictions by each
model are considered as a ‘vote’. The predictions which we get from the majority of the
For example, when you asked 5 of your colleagues to rate your movie (out of 5); we’ll
assume three of them rated it as 4 while two of them gave it a 5. Since the majority gave a
You can consider this as taking the mode of all the predictions.
For example, when you asked 5 of your colleagues to rate your movie (out of 5); we’ll
assume three of them rated it as 4 while two of them gave it a 5. Since the majority gave a
rating of 4, the final rating will be taken as 4.
You can consider this as taking the mode of all the predictions.
The result of max voting would be something like this:
2 Averaging
Similar to the max voting technique, multiple predictions are made for each data point in
averaging. In this method, we take an average of predictions from all the models and use it
to make the final prediction. Averaging can be used for making predictions in regression
For example, in the below case, the averaging method would take the average of all the
values.
3 Weighted Average
This is an extension of the averaging method. All models are assigned different weights
defining the importance of each model for prediction. For instance, if two of your
colleagues are critics, while others have no prior experience in this field, then the answers
by these two friends are given more importance as compared to the other people.
Bias and variance are the two most fundamental features in a model, the idea is that we tune
our parameters in such a way that our bias and variance balance out to create good
predictive performances. Different types of data need different types of algorithms and with
these unique algorithms come varied trade-offs. Another way of looking at this trade-off is
that we want our models to have enough degree of flexibility to resolve any underlying
complexity of new datasets but not so much that the model has less accuracy on the training
and test data.
Bias Error
Bias errors are assumptions made by a model to make the target easier to predict or in other
words, it is how close our average predictions are to the actual correct values. The lower the
bias Error the more accurate our predictions will be on our train and testing data, the higher
the bias errors the less accurate our predictions will be. (Bias is the ability to approximate
data).
Low Bias means fewer assumptions about the form of the target function.
High Bias means more assumptions about the form of the target function or in other
words, High Bias means the model pays little attention to the training data and
oversimplifies the model. This will lead to high errors in the training and test data.
Linear Regression model would have a high bias if trying to fit a quadratic relationship.
Variance Error
Variance error is the amount that the estimate of the target function will change if the
different training data was used. We should expect the algorithm to have some variance no
matter what and if we introduce new training sets we should expect a change, but ideally,
we do not want to see too much variance from one set to the next. If we do not see much
change, we know that the base model is good at picking out hidden underlying mappings
between the input and the output variables.
Algorithms with high variance are strongly influenced by the training data set and will not
generalize well when it comes to new data sets. Algorithms with a lot of flexibility also have
a high variance, these tend to be nonlinear algorithms like Decision Trees or K-Nearest
Neighbors, While linear algorithms tend to have low variance. (variance shows the stability
of the model).
Low Variance suggest small changes to the estimate of the target function
High variance suggests large changes to the estimate of the target function. High
variance models will pay a lot of attention to training data and do not generalize well on
the data which it has not seen before.
Low Bias Algorithms: Trees, K-Nearest Neighbors and Support Vector Machines
High Bias Algorithms: Linear Regression, Linear Discriminant Analysis, and Logistic
Regression.
The whole goal is to achieve a low-bias and low-variance. Ex. Linear algorithms have low
variance and high bias, while nonlinear algorithms have high variance and low bias, in turn,
both will need different ensemble approaches to make the prediction value more accurate.
These different ensemble approaches will help to improve the error results by combining
several models.
In ensemble learning theory, we call weak learners (or base models) models that can be
used as building blocks for designing more complex models by combining several of them.
Most of the time, these basics models perform not so well by themselves either because they
have a high bias (low degree of freedom models, for example) or because they have too
much variance to be robust (high degree of freedom models, for example). Then, the idea of
ensemble methods is to try reducing bias and/or variance of such weak learners by
combining several of them together in order to create a strong learner (or ensemble model)
that achieves better performances.
In order to set up an ensemble learning method, we first need to select our base models to be
aggregated. Most of the time (including in the well known bagging and boosting methods) a
single base learning algorithm is used so that we have homogeneous weak learners that are
trained in different ways. The ensemble model we obtain is then said to be “homogeneous”.
However, there also exist some methods that use different type of base learning algorithms:
some heterogeneous weak learners are then combined into an “heterogeneous ensembles
model”.
One important point is that our choice of weak learners should be coherent with the way
we aggregate these models. If we choose base models with low bias but high variance, it
should be with an aggregating method that tends to reduce variance whereas if we choose
base models with low variance but high bias, it should be with an aggregating method that
tends to reduce bias.
Very roughly, we can say that bagging will mainly focus at getting an ensemble model with
less variance than its components whereas boosting and stacking will mainly try to produce
strong models less biased than their components (even if variance can also be reduced).
Bagging:
Bagging trains each model in the ensemble using a randomly chosen subset of the training
data. By training our models on randomly chosen subsets we can get a good generalized
idea of the distribution in our original set and thus create strong predictions. Bagging uses a
parallel ensemble method, training each model independently from each other on randomly
chosen subsets. An example of this would be the Random Forest Algorithm. Bagging often
aims to reduce Variance.
The idea behind bagging is combining the results of multiple models (for instance, all
decision trees) to get a generalized result. Here’s a question: If you create all the models on
the same set of data and combine it, will it be useful? There is a high chance that these
models will give the same result since they are getting the same input. So how can we solve
original dataset, with replacement. The size of the subsets is the same as the size of the
original set.
Bagging (or Bootstrap Aggregating) technique uses these subsets (bags) to get a fair idea of
the distribution (complete set). The size of subsets created for bagging may be less than the
original set.
1.Multiple subsets are created from the original dataset, selecting observations with
replacement.
same method of adjusting weights to the loss functions by using its vector gradient. The
loss function is a measure indicating how well off the models coefficient is at fitting the
underlying data (The difference from actual to predicted). Gradient Boost works well
with both classification and regression.
Stacking:
Stacking combines multiple classifications or regression models with a meta-classifier or
meta-regressor. The base models are trained on the complete training set and the meta-
models are trained on the outputs of the base level models as features.
https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-
c9214a10a205