0% found this document useful (0 votes)
38 views12 pages

Ensemble Methods

Ensemble methods aim to improve model accuracy by combining multiple models instead of using a single model. Popular ensemble methods include boosting, bagging, and stacking, which reduce bias and variance to boost accuracy for regression and classification tasks. Ensemble learning involves generating and combining base learners to make more reliable overall predictions.

Uploaded by

muditchechi03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views12 pages

Ensemble Methods

Ensemble methods aim to improve model accuracy by combining multiple models instead of using a single model. Popular ensemble methods include boosting, bagging, and stacking, which reduce bias and variance to boost accuracy for regression and classification tasks. Ensemble learning involves generating and combining base learners to make more reliable overall predictions.

Uploaded by

muditchechi03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Ensemble Methods

Ensemble methods are techniques that aim at improving the accuracy of results in models
by combining multiple models instead of using a single model. The combined models
increase the accuracy of the results significantly. This has boosted the popularity of
ensemble methods in machine learning.

Let’s understand the concept of ensemble learning with an example. Suppose you are a
movie director and you have created a short movie on a very important and interesting topic.
Now, you want to take preliminary feedback (ratings) on the movie before making it public.
What are the possible ways by which you can do that?

A:You may ask one of your friends to rate the movie for you.

Now it’s entirely possible that the person you have chosen loves you very much and doesn’t

want to break your heart by providing a 1-star rating to the horrible work you have created.

B:Another way could be by asking 5 colleagues of yours to rate the movie.

This should provide a better idea of the movie. This method may provide honest ratings for

your movie. But a problem still exists. These 5 people may not be “Subject Matter Experts”

on the topic of your movie. Sure, they might understand the cinematography, the shots, or

the audio, but at the same time may not be the best judges of dark humour.

C:How about asking 50 people to rate the movie?

Some of which can be your friends, some of them can be your colleagues and some may

even be total strangers.


The responses, in this case, would be more generalized and diversified since now you have

people with different sets of skills. And as it turns out – this is a better approach to get

honest ratings than the previous cases we saw.

With these examples, you can infer that a diverse group of people are likely to make better

decisions as compared to individuals. Similar is true for a diverse set of models in

comparison to single models. This diversification in Machine Learning is achieved by a

technique called Ensemble Learning.

Summary
Ensemble methods aim at improving predictability in models by combining several
models to make one very reliable model.
The most popular ensemble methods are boosting, bagging, and stacking.
Ensemble methods are ideal for regression and classification, where they reduce bias
and variance to boost the accuracy of models.

Categories of Ensemble Methods


Ensemble methods fall into two broad categories, i.e., sequential ensemble techniques and
parallel ensemble techniques.

Sequential ensemble techniques generate base learners in a sequence, e.g., Adaptive


Boosting (AdaBoost). The sequential generation of base learners promotes the dependence
between the base learners. The performance of the model is then improved by assigning
higher weights to previously misrepresented learners.

In parallel ensemble techniques, base learners are generated in a parallel format,


e.g.,random forest. Parallel methods utilize the parallel generation of base learners to
encourage independence between the base learners. The independence of base learners
significantly reduces the error due to the application of averages.
The majority of ensemble techniques apply a single algorithm in base learning, which
results in homogeneity in all base learners. Homogenous base learners refer to base learners
of the same type, with similar qualities. Other methods apply heterogeneous base learners,
giving rise to heterogeneous ensembles. Heterogeneous base learners are learners of distinct
types.
Simple Ensemble Techniques

In this section, we will look at a few simple but powerful techniques, namely:

1.Max Voting

2.Averaging

3.Weighted Averaging
1 Max Voting

The max voting method is generally used for classification problems. In this technique,

multiple models are used to make predictions for each data point. The predictions by each

model are considered as a ‘vote’. The predictions which we get from the majority of the

models are used as the final prediction.

For example, when you asked 5 of your colleagues to rate your movie (out of 5); we’ll

assume three of them rated it as 4 while two of them gave it a 5. Since the majority gave a

rating of 4, the final rating will be taken as 4.

You can consider this as taking the mode of all the predictions.
For example, when you asked 5 of your colleagues to rate your movie (out of 5); we’ll
assume three of them rated it as 4 while two of them gave it a 5. Since the majority gave a
rating of 4, the final rating will be taken as 4.
You can consider this as taking the mode of all the predictions.
The result of max voting would be something like this:

Colleague Colleague Colleague Colleague Colleague


Final rating
1 2 3 4 5
5 4 5 4 4 4

2 Averaging

Similar to the max voting technique, multiple predictions are made for each data point in

averaging. In this method, we take an average of predictions from all the models and use it
to make the final prediction. Averaging can be used for making predictions in regression

problems or while calculating probabilities for classification problems.

For example, in the below case, the averaging method would take the average of all the

values.

i.e. (5+4+5+4+4)/5 = 4.4


Colleague Colleague Colleague Colleague Colleague
Final rating
1 2 3 4 5
5 4 5 4 4 4.4

3 Weighted Average

This is an extension of the averaging method. All models are assigned different weights

defining the importance of each model for prediction. For instance, if two of your

colleagues are critics, while others have no prior experience in this field, then the answers

by these two friends are given more importance as compared to the other people.

The result is calculated as [(5*0.23) + (4*0.23) + (5*0.18) + (4*0.18) + (4*0.18)] = 4.41.


Colleague Colleague Colleague Colleague Colleague Final
1 2 3 4 5 rating
weight 0.23 0.23 0.18 0.18 0.18
rating 5 4 5 4 4 4.41

Understanding Bias-Variance Trade-Offs:

Bias and variance are the two most fundamental features in a model, the idea is that we tune
our parameters in such a way that our bias and variance balance out to create good
predictive performances. Different types of data need different types of algorithms and with
these unique algorithms come varied trade-offs. Another way of looking at this trade-off is
that we want our models to have enough degree of flexibility to resolve any underlying
complexity of new datasets but not so much that the model has less accuracy on the training
and test data.
Bias Error

Bias errors are assumptions made by a model to make the target easier to predict or in other
words, it is how close our average predictions are to the actual correct values. The lower the
bias Error the more accurate our predictions will be on our train and testing data, the higher
the bias errors the less accurate our predictions will be. (Bias is the ability to approximate
data).

Low Bias means fewer assumptions about the form of the target function.

High Bias means more assumptions about the form of the target function or in other

words, High Bias means the model pays little attention to the training data and
oversimplifies the model. This will lead to high errors in the training and test data.
Linear Regression model would have a high bias if trying to fit a quadratic relationship.

Variance Error

Variance error is the amount that the estimate of the target function will change if the
different training data was used. We should expect the algorithm to have some variance no
matter what and if we introduce new training sets we should expect a change, but ideally,
we do not want to see too much variance from one set to the next. If we do not see much
change, we know that the base model is good at picking out hidden underlying mappings
between the input and the output variables.

Algorithms with high variance are strongly influenced by the training data set and will not
generalize well when it comes to new data sets. Algorithms with a lot of flexibility also have
a high variance, these tend to be nonlinear algorithms like Decision Trees or K-Nearest
Neighbors, While linear algorithms tend to have low variance. (variance shows the stability
of the model).

Low Variance suggest small changes to the estimate of the target function
High variance suggests large changes to the estimate of the target function. High

variance models will pay a lot of attention to training data and do not generalize well on
the data which it has not seen before.
Low Bias Algorithms: Trees, K-Nearest Neighbors and Support Vector Machines

High Bias Algorithms: Linear Regression, Linear Discriminant Analysis, and Logistic
Regression.

Low-Variance Algorithms: Linear Regression, Linear Discriminant Analysis, and Logistic


Regression

High-Variance Algorithms: Decision Trees, K- Nearest Neighbors, and Support Vector


Machines.

The whole goal is to achieve a low-bias and low-variance. Ex. Linear algorithms have low
variance and high bias, while nonlinear algorithms have high variance and low bias, in turn,
both will need different ensemble approaches to make the prediction value more accurate.
These different ensemble approaches will help to improve the error results by combining
several models.

In ensemble learning theory, we call weak learners (or base models) models that can be
used as building blocks for designing more complex models by combining several of them.
Most of the time, these basics models perform not so well by themselves either because they
have a high bias (low degree of freedom models, for example) or because they have too
much variance to be robust (high degree of freedom models, for example). Then, the idea of
ensemble methods is to try reducing bias and/or variance of such weak learners by
combining several of them together in order to create a strong learner (or ensemble model)
that achieves better performances.

Combine weak learners

In order to set up an ensemble learning method, we first need to select our base models to be
aggregated. Most of the time (including in the well known bagging and boosting methods) a
single base learning algorithm is used so that we have homogeneous weak learners that are
trained in different ways. The ensemble model we obtain is then said to be “homogeneous”.
However, there also exist some methods that use different type of base learning algorithms:
some heterogeneous weak learners are then combined into an “heterogeneous ensembles
model”.

One important point is that our choice of weak learners should be coherent with the way
we aggregate these models. If we choose base models with low bias but high variance, it
should be with an aggregating method that tends to reduce variance whereas if we choose
base models with low variance but high bias, it should be with an aggregating method that
tends to reduce bias.

Very roughly, we can say that bagging will mainly focus at getting an ensemble model with
less variance than its components whereas boosting and stacking will mainly try to produce
strong models less biased than their components (even if variance can also be reduced).

Ensemble Approaches: Bagging, Boosting and Stacking

Bagging:

Bagging trains each model in the ensemble using a randomly chosen subset of the training
data. By training our models on randomly chosen subsets we can get a good generalized
idea of the distribution in our original set and thus create strong predictions. Bagging uses a
parallel ensemble method, training each model independently from each other on randomly
chosen subsets. An example of this would be the Random Forest Algorithm. Bagging often
aims to reduce Variance.

Bagging is classified into two types, i.e., bootstrapping and aggregation.


Bootstrapping is a sampling technique where samples are derived from the whole
population (set) using the replacement procedure. The sampling with replacement method
helps make the selection procedure randomized. The base learning algorithm is run on the
samples to complete the procedure.
Aggregation in bagging is done to incorporate all possible outcomes of the prediction and
randomize the outcome. Without aggregation, predictions will not be accurate because all
outcomes are not put into consideration. Therefore, the aggregation is based on the
probability bootstrapping procedures or on the basis of all outcomes of the predictive
models.
Bagging is advantageous since weak base learners are combined to form a single strong
learner that is more stable than single learners. It also eliminates any variance, thereby
reducing the overfitting of models. One limitation of bagging is that it is computationally
expensive. Thus, it can lead to more bias in models when the proper procedure of bagging is
ignored.

The idea behind bagging is combining the results of multiple models (for instance, all

decision trees) to get a generalized result. Here’s a question: If you create all the models on

the same set of data and combine it, will it be useful? There is a high chance that these

models will give the same result since they are getting the same input. So how can we solve

this problem? One of the techniques is bootstrapping.

Bootstrapping is a sampling technique in which we create subsets of observations from the

original dataset, with replacement. The size of the subsets is the same as the size of the

original set.

Bagging (or Bootstrap Aggregating) technique uses these subsets (bags) to get a fair idea of

the distribution (complete set). The size of subsets created for bagging may be less than the

original set.
1.Multiple subsets are created from the original dataset, selecting observations with

replacement.

2.A base model (weak model) is created on each of these subsets.

3.The models run in parallel and are independent of each other.


Rando
m
Forest
Model
s.
Rando
m
Forest
Model
s can
be
though
t of as
BAGG
ing,
with a
slight
tweak. When deciding where to split and how to make decisions, BAGGed Decision Trees
have the full disposal of features to choose from. Therefore, although the bootstrapped
samples may be slightly different, the data is largely going to break off at the same features
throughout each model. In contrary, Random Forest models decide where to split based on a
random selection of features. Rather than splitting at similar features at each node
throughout, Random Forest models implement a level of differentiation because each tree
will split based on different features. This level of differentiation provides a greater
ensemble to aggregate over, ergo producing a more accurate predictor. Refer to the image
for a better understanding.
Similar to BAGGing, bootstrapped subsamples are pulled from a larger dataset. A decision
tree is formed on each subsample. HOWEVER, the decision tree is split on different
features (in this diagram the features are represented by shapes).
Similar to BAGGing, bootstrapped subsamples are pulled from a larger dataset. A decision
tree is formed on each subsample. HOWEVER, the decision tree is split on different
features (in this diagram the features are represented by shapes).
Boosting will take a weak model like the regression or tree based-model and improve it. For
example, XGBoost is a decision tree based algorithm that uses gradient boosting to
improve itself, with its main focus in reducing bias by putting efforts on the most difficult
observations.

Popular Boosting techniques:

1.AdaBoost: AdaBoost specifically designed for classification problems. when training


a decision tree, it starts by training the observations and assigning each an equal weight.
After evaluation of the first set, it will increase the weight of those that are difficult to
classify and lower the easy ones. It will iterate these processes until and good predictive
value is the outcome.
1.Gradient Boost: Rather than focusing on the data points, gradient boost performs the

same method of adjusting weights to the loss functions by using its vector gradient. The
loss function is a measure indicating how well off the models coefficient is at fitting the
underlying data (The difference from actual to predicted). Gradient Boost works well
with both classification and regression.

Stacking:
Stacking combines multiple classifications or regression models with a meta-classifier or
meta-regressor. The base models are trained on the complete training set and the meta-
models are trained on the outputs of the base level models as features.
https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-
c9214a10a205

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy