0% found this document useful (0 votes)
10 views47 pages

Unit 4 Part 1

The document discusses the need for ensemble learning, emphasizing that no single algorithm is universally the best due to the No Free Lunch Theorem. It outlines methods for generating diverse learners and various model combination schemes, including voting, bagging, boosting, and stacking, each with its own advantages and use cases. Additionally, it touches on clustering as an unsupervised learning technique, highlighting the k-means algorithm and distance measures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views47 pages

Unit 4 Part 1

The document discusses the need for ensemble learning, emphasizing that no single algorithm is universally the best due to the No Free Lunch Theorem. It outlines methods for generating diverse learners and various model combination schemes, including voting, bagging, boosting, and stacking, each with its own advantages and use cases. Additionally, it touches on clustering as an unsupervised learning technique, highlighting the k-means algorithm and distance measures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Need of Ensemble Learning

● No Free Lunch Theorem states that there is no single learning algorithm


that in any domain always induces the most accurate learner.
● Each learning algorithm dictates a certain model that comes with a set of
assumptions. This inductive bias leads to error if the assumptions do not
hold for the data.
● Learning is an ill-posed problem and with finite data, each algorithm
converges to a different solution and fails under different circumstances.
● By suitably combining multiple base learners accuracy can be improved.
Generating Diverse Learners
● the aim is to be able to find a set of diverse learners who differ in their
decisions so that they complement each other.
● At the same time, there cannot be a gain in overall success unless the
learners are accurate, at least in their domain of expertise.
● We therefore have this double task of maximizing individual accuracies and
the diversity between learners.

● HOW TO GENERATE DIVERSIFIED LEARNERS?


1. Different Algorithms
2. Different Hyperparameters
3. Different Input Representations - multi- view learning
4. Different Training Sets
1. Different Algorithms
a. We can use different learning algorithms to train different base-learners.
b. Different algorithms make different assumptions about the data and lead to different classifiers.
c. For example, one base-learner may be parametric and another may be nonparametric.
d. When we decide on a single algorithm, we give emphasis to a single method and ignore all others.
e. Combining multiple learners based on multiple algorithms, we free ourselves from taking a decision
and we no longer put all our eggs in one basket.
2. Different Hyperparameters
a. We can use the same learning algorithm but use it with different hyperparameters.
b. Examples are the number of hidden units in a multilayer perceptron, k in k-nearest neighbor, error
threshold in decision trees, the kernel function in support vector machines, and so forth.
3. Different Input Representations
a. Separate base-learners may be using different representations of the same input object or event,
making it possible to integrate different types of sensors/measurements/modalities.
b. Different representations make different characteristics explicit allowing better identification.
c. In many applications, there are multiple sources of information, and it is desirable to use all of these
data to extract more information and achieve higher accuracy in prediction.
d. For example, in speech recognition, to recognize the uttered words, in addition to the acoustic
input, we can also use the video image of the speaker’s lips and shape of the mouth as the words are
spoken.
● Different Training Sets
○ Another possibility is to train different base-learners by different subsets of the
training set.
○ This can be done randomly by drawing random training sets from the given
sample; this is called bagging.
○ Or, the learners can be trained serially so that instances on which the preceding
base-learners are not accurate are given more emphasis in training later
base-learners; examples are boosting and cascading
Model Combination Schemes

1. MULTIEXPERT COMBINATION
a. Multiexpert combination methods have base-learners that work in parallel.
b. These methods can in turn be divided into two:
i. In the global approach, also called learner fusion, given an input, all base-learners
generate an output and all these outputs are used. Examples are voting and stacking.
ii. In the local approach, or learner selection, for example, in mixture of experts, there is
a gating model, which looks at the input and chooses one (or very few) of the learners
as responsible for generating the output.
2. MULTISTAGE COMBINATION
a. Multistage combination methods use a serial approach where the next combination
base-learner is trained with or tested on only the instances where the previous
base-learners are not accurate enough.
b. The idea is that the base-learners (or the different representations they use) are sorted in
increasing complexity so that a complex base-learner is not used (or its complex
representation is not extracted) unless the preceding simpler base-learners are not
confident. An example is cascading.
METHOD 1 VOTING
Decision Tree Random Forest Support Vector Machine
Decision Tree Random Forest Support Vector Machine
Max Voting
The max voting method is generally used for classification problems. In this technique,
multiple models are used to make predictions for each data point. The predictions by
each model are considered as a ‘vote’. The predictions which we get from the majority of
the models are used as the final prediction.

Averaging
Similar to the max voting technique, multiple predictions are made for each data point in
averaging. In this method, we take an average of predictions from all the models and use
it to make the final prediction. Averaging can be used for making predictions in
regression problems or while calculating probabilities for classification problems.

Weighted Average
This is an extension of the averaging method. All models are assigned different weights
defining the importance of each model for prediction.
BAGGING
Steps in Bagging
● Bagging is also known as Bootstrap aggregating. It consists of two steps:
bootstrapping and aggregation.
● Bootstrapping
○ Involves resampling subsets of data with replacement from an initial
dataset. In other words, subsets of data are taken from the initial dataset.
○ These subsets of data are called bootstrapped datasets or, simply,
bootstraps.
○ Resampled ‘with replacement’ means an individual data point can be
sampled multiple times. Each bootstrap dataset is used to train a weak
learner.
● Aggregating
○ The individual weak learners are trained independently from each other.
Each learner makes independent predictions.
○ The results of those predictions are aggregated at the end to get the overall
prediction. The predictions are aggregated using either max voting or
averaging.
Detailed Steps

The steps of bagging are as follows:

● We have an initial training dataset containing n-number of instances.


● We create a m-number of subsets of data from the training set. We take a subset of N sample
points from the initial dataset for each subset. Each subset is taken with replacement. This
means that a specific data point can be sampled more than once.
● For each subset of data, we train the corresponding weak learners independently. These
models are homogeneous, meaning that they are of the same type.
● Each model makes a prediction.
● The predictions are aggregated into a single prediction. For this, either max voting or averaging
is used.
Boosting
Steps
Boosting works with the following steps:

● We sample m-number of subsets from an initial training dataset.

● Using the first subset, we train the first weak learner.

● We test the trained weak learner using the training data. As a result of the testing, some data points

will be incorrectly predicted.

● Each data point with the wrong prediction is sent into the second subset of data, and this subset is

updated.

● Using this updated subset, we train and test the second weak learner.

● We continue with the following subset until the total number of subsets is reached.

● We now have the total prediction. The overall prediction has already been aggregated at each step, so

there is no need to calculate it.


Stacking

TECHNICAL
ROUND 1

CEO
TECHNICAL
ROUND 2

APTITUDE
ROUND
Steps

● We use initial training data to train m-number of algorithms.


● Using the output of each algorithm, we create a new training set.
● Using the new training set, we create a meta-model algorithm.
● Using the results of the meta-model, we make the final prediction. The results are combined
using weighted averaging.
Key Points
● If you want to reduce the overfitting or variance of your model, you use bagging. If
you are looking to reduce underfitting or bias, you use boosting. If you want to
increase predictive accuracy, use stacking.
● Bagging and boosting both works with homogeneous weak learners. Stacking
works using heterogeneous solid learners.
● All three of these methods can work with either classification or regression
problems.
● One disadvantage of boosting is that it is prone to variance or overfitting. It is
thus not advisable to use boosting for reducing variance. Boosting will do a worse
job in reducing variance as compared to bagging.
● On the other hand, the converse is true. It is not advisable to use bagging to
reduce bias or underfitting. This is because bagging is more prone to bias and does
not help reduce bias.
● Stacked models have the advantage of better prediction accuracy than bagging or boosting. But
because they combine bagged or boosted models, they have the disadvantage of needing much
more time and computational power.
CLUSTERING
ALGORITHM
Key Points
1. Clustering is the task of dividing the unlabeled data or data points into different clusters
such that similar data points fall in the same cluster than those which differ from the others.
2. In simple words, the aim of the clustering process is to segregate groups with similar traits
and assign them into clusters.
3. Clustering is very much important as it determines the intrinsic grouping among the unlabelled
data present.
4. There are no criteria for good clustering. It depends on the user, what is the criteria they may
use which satisfy their need.
5. It is a unsupervised learning technique.
k-Means Algorithm

Step 1:
Step 2 : Compute Distance of Every point from
centroids
Step 2 Continued…
C1 (2,10) C2(5,8) C3(1,2) Assigned
Cluster

A1(2,10)

A2(2,5)

A3(8,4)

A4(5,8)

A5(7,5)

A6(6,4)

A7(1,2)

A8(4,9)
Step 3 Update Centroid
Step 4: Repeat Step 2 with new centroids
Steps
DISTANCE BETWEEN TWO POINTS

1. Euclidean
2. Manhattan
3. Cosine
4. Jaccard
Euclidean Distance
Manhattan Distance

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy