0% found this document useful (0 votes)
20 views54 pages

ML-Unit I - Ensemble Methods

ensemble emathods

Uploaded by

t40088356
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views54 pages

ML-Unit I - Ensemble Methods

ensemble emathods

Uploaded by

t40088356
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Machine Learning

Dr. Sunil Saumya


IIIT Dharwad
Ensemble Methods: Bagging and Boosting
ML phenomena
Ensemble Learning
● A problem in machine learning is that individual models tend to perform
poorly.
○ In other words, they tend to have low prediction accuracy.
● To mitigate this problem, we combine multiple models to get one with a
better performance. This process is known as Ensemble learning.
○ The individual models that we combine are known as weak learners.
○ We call them weak learners because they either have a high bias or high
variance.
Ensemble Learning
● Ensemble learning improves a model’s performance in mainly three ways:
○ By reducing the variance of weak learners
○ By reducing the bias of weak learners,
○ By improving the overall accuracy of strong learners.
● Approach in Ensemble learning:
○ Bagging: used to reduce the variance of weak learners
○ Boosting: used to reduce the bias of weak learners
○ Stacking: used to improve the overall accuracy of strong learners
Bagging
Bagging
● Voting or Averaging of predictions of multiple pre-trained models
Boosting
Stacking
Stacking
● Use predictions of multiple models as “features” to train a new
model and use the new model to make predictions on test data
Random Forest
● Random Forest is one of the most popular and commonly used algorithms by
Data Scientists.
● Random forest is a Supervised Machine Learning Algorithm that is used
widely in Classification and Regression problems.
● It builds decision trees on different samples and takes their majority vote for
classification and average in case of regression.
● Random Forest uses a bagging approach of ensemble learning.
○ Bagging, also known as Bootstrap Aggregation
Random Forest: algorithm
● Step 1: In the Random forest model, a subset of data points and a subset of
features is selected for constructing each decision tree called Bootstrapping.
○ Simply put, n random records and m features are taken from the data set
having k number of records.
● Step 2: Individual decision trees are constructed for each sample.
● Step 3: Each decision tree will generate an output.
● Step 4: Final output is considered based on Majority Voting or Averaging for
Classification and regression, respectively.
Random Forest: Bootstrapping
Bootstrapping:
● It is a process of sampling dataset.
● The sampling can be done:
○ Row wise: with replacement
and without replacement
○ Column wise
○ Combination of row and
…………………..
column sampling …………………..

Dataset shape: (100*6)


Random Forest: Bootstrapping x1 x2 x3 x4 x5 Vote

27 8 2 7 0.16 0

Bootstrapping: (Row wise sampling) 40 42 2 9 0.26 1

… …. … … … ….
Without
replacement 14 13 32 6 0.87 0

x1 x2 x3 x4 x5 Vote

With 2 25 8 5 0.34 0
replacement
40 42 2 9 0.26 1

… …. … … … ….

40 42 2 9 0.26 1
Original Dataset Bootstrapped Dataset- Shape: (50*6)
Random Forest: Bootstrapping x1 x3 x4 Vote

27 2 7 0

Bootstrapping: (Column wise sampling) 40 2 9 1

… … … ….
Dataset 1
14 32 6 0

x2 x3 x5 Vote
Dataset 2
25 8 0.34 0

42 2 0.26 1

…. … … ….

42 2 0.26 1
Original Dataset Bootstrapped Dataset- Shape: (1000*3)
Random Forest: Bootstrapping x1 x3 x4 Vote

27 2 7 0

Bootstrapping: (Row+column wise sampling) 40 2 9 1

… … … ….
Without
replacement 14 32 6 0

x1 x2 x5 Vote

With 2 25 0.34 0
replacement
40 42 0.26 1

… …. … ….

40 42 0.26 1
Original Dataset Bootstrapped Dataset- Shape: (50*6)
Random Forest: Bootstrapping
Aggregation:
● We aggregate all individual decision trees for prediction.
○ In Classification: we do voting
○ In Regression: we do averaging
Random Forest: Bootstrapping
Random Forest: How it performs so well?
● Each individual weak
learner is exposed to only
1k instances.
● Other 2k instances are
unseen for every DT.

● In general, we always keep


30% data unseen for every
weak learner.
● RF performs good because
it gives LB-LV model.
Random Forest Vs Bagging
Are Random Forest and Bagging same?
● Random Forest employs the bagging approach, but differs from bagging in
two ways:
○ Bagging is an ensemble of any ML model in general, but Random forest
is strictly an ensemble of decision trees.
○ Bagging employs a tree-level sampling strategy, whereas Random Forest
employs a node-level sampling strategy.
Random Forest Vs Bagging
Are Random Forest and Bagging same?
● Random Forest employs the bagging approach, but differs from bagging in
two ways:
○ Bagging is an ensemble of any ML model in general, but Random forest
is strictly an ensemble of decision trees.
○ Bagging employs a tree-level sampling strategy, whereas Random Forest
employs a node-level sampling strategy.
Random Forest: Important features
Label p0 p1 p2 p3 …. … p782 p783 P784

4 0 0 0 0 20 11 0 0 0

9 0 0 0 0 186 0 0 0 0

0 0 0 0 0 90 0 0 0 0

6 0 0 0 0 54 90 0 0 0

0 0 0 0 0 255 0 0 0 0

4 0 0 0 0 0 87 0 0 0 Image size: 28*28


Random Forest: Important features
The array below shows the importance of
every column in the dataset.
Random Forest: Important features
The heat map of important features:
Random Forest: Hyperparameters
● n_estimators: Number of trees the algorithm builds before averaging the
predictions.
● max_features: Maximum number of features random forest considers
splitting a node.
● mini_sample_leaf: Determines the minimum number of leaves required to
split an internal node.
● criterion: How to split the node in each tree? (Entropy/Gini impurity/Log
Loss)
● max_leaf_nodes: Maximum leaf nodes in each tree
Boosting: Adaboost
● Adaboost is a stagewise additive method.
● Three important points to understand before starting adaboost algorithm:
○ Weak Learners
○ Decision stumps
○ +1 and -1
● Weak Learners:
○ A weak learner produces a classifier which is only slightly more accurate
than random classification.
■ k-Nearest Neighbors, with k=1
■ Multi-Layer Perceptron, with a single node
■ Naive Bayes, operating on a single input variable.
Boosting: Adaboost
● Three important points to understand before starting adaboost algorithm:
○ Weak Learners
○ Decision stumps
○ +1 and -1
● Decision stumps:
○ A decision tree with a single node operating on one input variable, the output of
which makes a prediction directly.
Boosting: Adaboost
● Three important points to understand before starting adaboost algorithm:
○ Weak Learners
○ Decision stumps
○ +1 and -1
● +1 and -1:
○ In adaboost:
■ For positive class we use +1
■ For negative class we use -1.
● We don't use 0 for negative class.
Adaboost: geometric intuition

h(x)= ∝1h1(x)+∝2h2(x)+∝3h3(x)

Weight =∝1 Weight =∝2 Weight =∝3


Adaboost: geometric intuition

h(x)= ∝1h1(x)+∝2h2(x)+∝3h3(x)
= 2*(-1)+3*(+1)+2(-1)
Weight =∝1=2 Weight =∝2=3 Weight =∝3=2
= -2+3-2 = -1
Adaboost: working example
Consider the initial dataset as:
X1 X2 Y

3 7 1

2 9 0

1 4 1

9 8 0

3 7 0

Original Dataset
Adaboost: working example
Consider the initial dataset as:
X1 X2 Y X1 X2 Y Initial weight (=1/n)

3 7 1 3 7 1 0.2

2 9 0 2 9 0 0.2

1 4 1 1 4 1 0.2

9 8 0 9 8 0 0.2

3 7 0 3 7 0 0.2

Original Dataset Give the initial weight to each row.


Here, number of row (n) = 5
Adaboost: working example
Now we will start with stage 1(Model 1):

X1 X2 Y weight ● Create a decision stump for the given dataset.


● Consider the decision stump we have created
3 7 1 0.2
is as: X1>5
2 9 0 0.2

1 4 1 0.2

9 8 0 0.2 1 data 4 data


point points
3 7 0 0.2 will will
come come
Original Dataset ● Now, do the prediction based on this decision
stump for all data points.
Adaboost: working example
Stage 1 (Model 1):

● If we observe the Y-pred column, we find that all


X1 X2 Y Y-pred weight predictions are true except those highlighted.
3 7 1 1 0.2
Y=0 ❌ Y-pred=1 and Y=1 ❌ Y-pred=0
2 9 0 1 0.2
● Find ∝ for Model 1.
1 4 1 0 0.2
○ Where ∝ is the weight of Model 1 in final
9 8 0 0 0.2 prediction.
3 7 0 0 0.2 ○ ∝ depends on error rate of Model 1.
■ Error rate is how many mistakes Model 1
Original Dataset has done on the give dataset.
● If ∝ will be high error rate will be low and vice
Adaboost: working example
Stage 1 (Model 1):

● Consider three values of error rate


X1 X2 Y Y-pred weight ○ error rate (model a)=0%
3 7 1 1 0.2 ○ error rate (model b)=100%
○ error rate (model c)=50%
2 9 0 1 0.2
● Which of the above model is reliable?
1 4 1 0 0.2
○ Clearly, model a is reliable, because here error
9 8 0 0 0.2 rate is 0% so ∝ will be near to 1.
3 7 0 0 0.2 ○ Then, model b is also reliable, because here error
rate is 100% so ∝ will be near to -1.
Original Dataset
○ model c is not reliable, because here error rate is
50% so ∝ will be 0.
Adaboost: working example
Stage 1 (Model 1):
X1 X2 Y Y-pred weight That means ∝ we can
compute by computing
3 7 1 1 0.2 following function:
2 9 0 1 0.2
error rate (model a)=0%
1 4 1 0 0.2 error rate (model b)=100%
9 8 0 0 0.2 error rate (model c)=50%
3 7 0 0 0.2
Original Dataset
Adaboost: working example
Stage 1 (Model 1): error rate
X1 X2 Y Y-pred weight Weight sum of all such data points which are misclassified.

3 7 1 1 0.2
Therefore, error rate (model 1)= 0.2+0.2 = 0.4

2 9 0 1 0.2 ∝1⇒½ *ln ((1-0.4)/0.4) ⇒ ½ *ln(0.6/0.4) ⇒ ½ * ln (1.5)


1 4 1 0 0.2 ⇒ ½* 0.40 ⇒ 0.20
9 8 0 0 0.2
Therefore, Stage 1 ∝1= 0.20
3 7 0 0 0.2
Original Dataset
Stage 1 completes here.
Adaboost: working example
● Next, we increase the weight of the misclassified data points
before sending it to Stage 2.
X1 X2 Y Y-pred weight ○ We use the technique called upsampling to increase
the weight of a specific data points.
3 7 1 1 0.2 ○ By upsampling we mean boosting weights of a few
2 9 0 1 0.2 data points.

1 4 1 0 0.2 Weight update formula:


9 8 0 0 0.2 For misclassified points: New_weight = current_weight * e∝1
3 7 0 0 0.2
correctly classified points: New_weight = current_weight * e-∝1
Original Dataset
Adaboost: working example Weight update formula:

For misclassified points:

New_weight = current_weight * e∝1


X1 X2 Y Y-pred weight updated
correctly classified points:

3 7 1 1 0.2 0.16 New_weight = current_weight * e-∝1


2 9 0 1 0.2 0.24

1 4 1 0 0.2 0.24

9 8 0 0 0.2 0.16

3 7 0 0 0.2 0.16
Original Dataset
Adaboost: working example

X1 X2 Y Y-pred weight Updated Normalized


weight weight

3 7 1 1 0.2 0.16 0.166

2 9 0 1 0.2 0.24 0.25

1 4 1 0 0.2 0.24 0.25

9 8 0 0 0.2 0.16 0.166

3 7 0 0 0.2 0.16 0.166


Original Dataset
Adaboost: working example Upsampling:

● Generate n (=5) random numbers in


X1 X2 Y New Range the range 0 and 1. Say numbers are:
weight ○ 0.13
3 7 1 0.166 0 - 0.166 ○ 0.43
2 9 0 0.25 0.166 - 0.416 ○ 0.62
○ 0.50
1 4 1 0.25 0.416 - 0.666
○ 0.8
9 8 0 0.166 0.666 - 0.832 ● For every random number generated
3 7 0 0.166 0.832 - 1.0 choose the corresponding row based
on the Range column for new stage
Original Dataset dataset.
Adaboost: working example Upsampling:

● Generate n (=5) random numbers in


X1 X2 Y New Range the range 0 and 1. Say numbers are:
weight ○ 0.13 → row 1
3 7 1 0.166 0 - 0.166 ○ 0.43 → row 3
2 9 0 0.25 0.166 - 0.416 ○ 0.62 → row 3
○ 0.50 → row 3
1 4 1 0.25 0.416 - 0.666
○ 0.8 → row 4
9 8 0 0.166 0.666 - 0.832 ● Therefore, in the new dataset for
3 7 0 0.166 0.832 - 1.0 Stage 2, the rows available will be

Row1, Row3, Row3, Row3, Row4


Original Dataset
Adaboost: working example
● Upsampling helps us to choose those row
X1 X2 Y New Range more times which has larger range. Hence,
weight
they are boosted.
3 7 1 0.166 0 - 0.166 ● Repeat all these for n_estimators:
1 4 1 0.25 0.416 - 0.666 ○ Create a decision stump based on new
1 4 1 0.25 0.416 - 0.666
stage data.
○ Calculate ∝2 for new decision stump.
1 4 1 0.25 0.416 - 0.666
○ Find new wight of each data points
9 8 0 0.166 0.666 - 0.832 ○ Find new range based on the new weight.
○ Using upsampling create new dataset.
New Dataset for Stage 2 with dominant
row 3 At last we have ∝1, ∝2, ∝3, …..,
∝n (for n
estimators)
Adaboost: working example
At last we have ∝1, ∝2, ∝3, …..,
∝n (for n
X1 X2 Y New Range estimators)
weight

3 7 1 0.166 0 - 0.166
● For new data points, the final prediction will
be made on following formula:
1 4 1 0.25 0.416 - 0.666

1 4 1 0.25 0.416 - 0.666 h(x)= ∝1h1(x)+∝2h2(x)+∝3h3(x) +...+


1 4 1 0.25 0.416 - 0.666
∝nhn(x)

9 8 0 0.166 0.666 - 0.832 Where, h1(x), h2(x) ..hn(x) are predictions of


estimators h1, h2, …, hn respectively on test
New Dataset for Stage 2 with dominant data.
row 3
Gradient Boosting
● It is a boosting algorithm.
● It works in a sequential stage wise addition.
● Consider the following dataset:

iq cgpa salary
We will create three estimators for this simple dataset.

90 8 3

100 7 4 Model 1: is an average of output variable also known as leaf


110 6 8
Therefore, Model 1 prediction = (3+4+8+6+3)/5 = 4.8
120 9 6
Hence, whatever will be the iq and cgpa, Model 1 prediction will
80 5 3
always be 4.8.
Gradient Boosting
● Model 1: is an average of output variable also known as leaf
● Therefore, Model 1 prediction = (3+4+8+6+3)/5 = 4.8

iq cgpa salary Pred1


Next, we will calculate the loss of Model 1 using

90 8 3 4.8 Pseudo_residual = actual - prediction


100 7 4 4.8

110 6 8 4.8

120 9 6 4.8

80 5 3 4.8
Gradient Boosting
● Model 1: is an average of output variable also known as leaf
● Therefore, Model 1 prediction = (3+4+8+6+3)/5 = 4.8
Calculate: Pseudo_residual = actual - prediction
iq cgpa salary Pred1 res1
● Next, we will transfer these errors to Model 2.
90 8 3 4.8 -1.8
● We will build Model 2 using decision tree on
100 7 4 4.8 -0.8 following dataset:
110 6 8 4.8 3.2 Input: iq and cgpa
120 9 6 4.8 1.2
Output: res1
80 5 3 4.8 -1.8
Gradient Boosting
● Model 1 prediction = 4.8 iq<=105
● Model 2: construction using DT

iq cgpa res1 iq<=95 cgpa<=7.5

90 8 -1.8
-1.8 -0.8 3.2 1.2
100 7 -0.8

110 6 3.2

120 9 1.2

80 5 -1.8
Gradient Boosting
● Model 1 prediction = 4.8 iq<=105
● Model 2: construction using DT

iq cgpa res1 Pred2 iq<=95 cgpa<=7.5

90 8 -1.8 -1.8

100 7 -0.8 -0.8 -1.8 -0.8 3.2 1.2

110 6 3.2 3.2


Here, we are ready with two models.
120 9 1.2 1.2 Let’s do the predictions on gradient boosting with these two
models.
80 5 -1.8 -1.8

Prediction = Model 1 prediction + Model 2 prediction


Gradient Boosting
● Model 1 prediction = 4.8 Row# Pred=
iq<=105
(iq,cgpa) M1+M2
● Model 2: construction using DT
1 4.8-1.8=3

iq cgpa res1 Pred2 iq<=95 cgpa<=7.5 2 4.8-0.8=4

90 8 -1.8 -1.8 3 4.8+3.2=8

100 7 -0.8 -0.8 -1.8 -0.8 3.2 1.2 4 4.8+1.2=6

110 6 3.2 3.2 5 4.8-1.8=3


Here, we are ready with two models.
120 9 1.2 1.2 Let’s do the predictions on gradient boosting with these two
models.
80 5 -1.8 -1.8 Model overfitting
Prediction = Model 1 prediction + Model 2 prediction
Gradient Boosting
● Model 1 prediction = 4.8
● Model 2: construction using DT
iq cgpa salary res1 Pred2 PredBoost res2
(M1+0.1*M2)
Pseudo_residual = actual - prediction
90 8 3 -1.8 -1.8 4.62 -1.62

100 7 4 -0.8 -0.8 4.72 -0.72 Here, res2 < res1


110 6 8 3.2 3.2 5.12 2.88 This process need to be repeated, where
120 9 6 1.2 1.2 4.92 1.08 by adding a new model our residual is
approaching towards zero.
80 5 3 -1.8 -1.8 4.62 -1.62
Therefore, now let’s build Model 3.
Gradient Boosting
● Model 1 prediction = 4.8
● Model 2: Our DT is ready.
● Model 3: construct DT Model 2 DT

iq cgpa res2
iq<=105
90 8 -1.62

100 7 -0.72
iq<=95 cgpa<=7.5
110 6 2.88

120 9 1.08
-1.62 -0.72 2.88 1.08
80 5 -1.62
Model 3 DT
Gradient Boosting
● Model 1 prediction = 4.8
● Model 2: Our DT is ready.
● Model 3: construct DT Model 2 DT

iq cgpa res2 pred3


iq<=105
90 8 -1.62 -1.62

100 7 -0.72 -0.72


iq<=95 cgpa<=7.5
110 6 2.88 2.88

120 9 1.08 1.08


-1.62 -0.72 2.88 1.08
80 5 -1.62 -1.62
Model 3 DT
Gradient Boosting
Model 1 prediction = 4.8
Model 2 DT
Final PredBoost = M1 prediction + 0.1*Model 2 prediction + 0.1*Model 3 prediction

iq cgpa salary res1 Pred2 PredBoost res2 pred3 Final PredBoost res3
(M1+0.1*M2) (M1+0.1*M2+0.1*M3)

90 8 3 -1.8 -1.8 4.62 -1.62 -1.62 4.62 - 0.1*1.62=4.45 -1.45

100 7 4 -0.8 -0.8 4.72 -0.72 -0.72 4.72 - 0.1*0.72=4.64 -0.64

110 6 8 3.2 3.2 5.12 2.88 2.88 5.12 + 0.1*2.88=5.40 2.6

120 9 6 1.2 1.2 4.92 1.08 1.08 4.92 + 0.1*1.08=5.02 0.98

80 5 3 -1.8 -1.8 4.62 -1.62 -1.62 4.62 - 0.1*1.62=4.45 -1.45

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy