0% found this document useful (0 votes)
7 views65 pages

06 EnsembleLearning

Uploaded by

vafag.va32
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views65 pages

06 EnsembleLearning

Uploaded by

vafag.va32
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Ensemble Learning

Danna Gurari
University of Texas at Austin
Spring 2021

https://www.ischool.utexas.edu/~dannag/Courses/IntroToMachineLearning/CourseContent.html
Review
• Last week:
• Evaluating Machine Learning Models Using Cross-Validation
• Naïve Bayes
• Support Vector Machines

• Assignments (Canvas):
• Problem set 4 due tonight
• Lab assignment 2 due next week
• Project pre-proposal due in two weeks (finding a partner ideas)

• Questions?
Today’s Topics

• One-vs-all multiclass classification

• Classifier confidence

• Evaluation: ROC and PR-curves

• Ensemble learning
Today’s Topics

• One-vs-all multiclass classification

• Classifier confidence

• Evaluation: ROC and PR-curves

• Ensemble learning
Recall: Binary vs Multiclass Classification
Binary: distinguish 2 classes Multiclass: distinguish 3+ classes

Figure Source: http://mlwiki.org/index.php/One-vs-All_Classification


Recall: Binary vs Multiclass Classification
Binary: distinguish 2 classes Multiclass: distinguish 3+ classes

Perceptron Nearest Neighbor


Adaline Decision Tree
Support Vector Machine Naïve Bayes
One-vs-All (aka, One-vs-Rest): Applying Binary
Classification Methods for Multiclass Classification
• Given ‘N’ classes, train ‘N’ different classifiers: a single classifier
trained per class, with the samples of that class as positive samples
and all other samples as negatives; e.g.,

Figure Source: http://mlwiki.org/index.php/One-vs-All_Classification


One-vs-All (aka, One-vs-Rest): Limitation

• Often leads to unbalanced distributions during learning; i.e., when


the set of negatives is much larger than the set of positives

Figure Source: http://mlwiki.org/index.php/One-vs-All_Classification


One-vs-All (aka, One-vs-Rest): Class Assignment

• (Imperfect) Approach: use from N classifiers the most confident


match; this requires a real-valued confidence score for its decision

Figure Source: http://mlwiki.org/index.php/One-vs-All_Classification


Today’s Topics

• One-vs-all multiclass classification

• Classifier confidence

• Evaluation: ROC and PR-curves

• Ensemble learning
Classifier Confidence: Beyond Classification

• Indicate both the predicted class and uncertainty about the choice

• When and why might you want to know about the uncertainty?
• e.g., weather forecast: 25% chance it will rain today
• e.g., medical treatment: when unconfident, start a patient on a drug at a
lower dose and decide later whether to change the medication or dose
Classifier Confidence: How to Measure for
K-Nearest Neighbors?
• Proportion of neighbors with label y; e.g.,
When K=3:

https://github.com/amueller/introduction_to_ml_with_python/blob/master/02-supervised-learning.ipynb
Classifier Confidence: How to Measure for
Decision Trees?
• Proportion of training samples with label y in the leaf where for the test sample;
e.g.,

14 yes; 6 no 100 no 120 yes 30 yes; 70 no


Classifier Confidence: How to Measure for
Naïve Bayes?

• Conditional probability P (Y|X) for the most probable class


Classifier Confidence: How to Measure for
Support Vector Machines?
• Distance to the hyperplane: e.g.,

http://chem-eng.utoronto.ca/~datamining/dmc/support_vector_machine.htm
Classifier Confidence vs Probability

• Classifiers can make mistakes in estimating their confidence level

• External calibration procedures can address this issue (e.g., using


calibration curves/reliability diagrams)
Today’s Topics

• One-vs-all multiclass classification

• Classifier confidence

• Evaluation: ROC and PR-curves

• Ensemble learning
Classification from a Classifier’s Confidence

• Observation: A threshold must be chosen to define the point at which


the example belongs to a class or not

• Motivation: how to choose the threshold?


• Default is 0.5
• Yet, it can tuned to avoid different types of errors
Review: Confusion Matrix for Binary Classification

Python Machine Learning; Raschkka & Mirjalili


Receiver Operating Characteristic (ROC) curve
Summarizes performance based
on the positive class
- A positive prediction is either
correct (TP) or not (FP)
Receiver Operating Characteristic (ROC) curve
To create, vary prediction threshold and
Summarizes performance based
compute TPR and FPR for each threshold
on the positive class
- A positive prediction is either
1 correct (TP) or not (FP)
TPR

0
0 FPR 1
Receiver Operating Characteristic (ROC) curve
What is the coordinate for a perfect predictor?
Summarizes performance based
on the positive class
- A positive prediction is either
1 correct (TP) or not (FP)
TPR

0
0 FPR 1
ROC Curve: Area Under Curve (AUC)
Which of the first three methods performs best overall?
Summarizes performance based
on the positive class
- A positive prediction is either
correct (TP) or not (FP)

Python Machine Learning; Raschkka & Mirjalili


ROC Curve: Multiclass Classification
• Plot curve per class:

https://stackoverflow.com/questions/56090541/how-to-plot-precision-and-recall-of-multiclass-classifier
Precision-Recall (PR) Curve

Summarizes performance based only on


the positive class (ignores true negatives):
Precision-Recall (PR) Curve
To create, vary prediction threshold and
compute precision and recall for each threshold
Summarizes performance based only on
the positive class (ignores true negatives):
1
Precision

0
0 Recall 1
Precision-Recall (PR) Curve
What is the coordinate for a perfect predictor?

Summarizes performance based only on


the positive class (ignores true negatives):
1
Precision

0
0 Recall 1
PR Curve: Area Under Curve (AUC)

• Which classifier is the best?


PR Curve: Multiclass Classification
• Plot curve per class:

https://stackoverflow.com/questions/56090541/how-to-plot-precision-and-recall-of-multiclass-classifier
Group Discussion: Evaluation Curves
1. Assume you are building a classifier for these Assume the following thresholds were used
applications: to create the curve: 0, 0.25, 0.5, 0.75, 1.
• Detecting offensive content online
• Medical diagnoses 1
• Detecting shoplifters
• Deciding whether a person is guilty of a crime
What classifier threshold would you choose for
each application and why?

Precision
2. When would you choose to evaluate with a
PR curve versus a ROC curve?

3. What is the area under the ROC and PR 0


curves for a perfect classifier? 0 Recall 1
Today’s Topics

• One-vs-all multiclass classification

• Classifier confidence

• Evaluation: ROC and PR-curves

• Ensemble learning
Idea: How Many Predictors to Use?

More than 1: Ensemble


Why Choose Ensemble Instead of an Algorithm?
• Reduces probability for making a wrong prediction, assuming:
• Classifiers are independent (not true in practice!)
• Suppose:
• n classifiers for binary classification task
• Each classifier has same error rate
• Probability mass function indicates the probability of error from an ensemble:
Number of classifiers Classifier error rate
Error probability
# ways to choose k subsets from set of size n
• e.g., n = 11, = 0.25; k = 6: probability of error is ~0.034 which is much lower
than probability of error from a single algorithm (0.25)
Why Choose Ensemble Instead of an Algorithm?
• Reduces probability for making a wrong prediction, assuming:
• Classifiers are independent (not true in practice!)
• Suppose:
• n classifiers for binary classification task
• Each classifier has same error rate
How to Get Diverse Classifiers?
• Probability mass function indicates the probability of error from an ensemble:

• e.g., n = 11, = 0.25; k = 6: probability of error is ~0.034 which is much lower


than probability of error from a single algorithm (0.25)
Why Choose Ensemble Instead of an Algorithm?
• Reduces probability for making a wrong prediction, assuming:
• Classifiers are independent (not true in practice!)
• Suppose:
• n classifiers for binary classification task
1. Use different algorithms
• Each classifier has same error rate
• Probability mass function indicates the probability of error from an ensemble:
2. Use different features
2.= 11,Use
• e.g., n = 0.25;different
k = 6: probability oftraining data
error is ~0.034 which is much lower
than probability of error from a single algorithm (0.25)
How to Predict with an Ensemble?
• Majority Voting
• Return most popular prediction from multiple prediction algorithms

• Bootstrap Aggregation, aka Bagging


• Resample data to train algorithm on different random subsets

• Boosting
• Reweight data to train algorithms to specialize on different “hard” examples

• Stacking
• Train a model that learns how to aggregate classifiers’ predictions
1613

Human “Computers”
Linear regression Early 1800

First programmable
machine 1945
Turing Test 1950
K-nearest neighbors
AI 1956
Perceptron 1957
Machine Learning 1959
Naïve Bayes
Decision Trees 1962
Historical Context of ML Models

1rst AI 2nd AI
Winter Winter

SVM
1974 1980 1987 1993
Rise of

Learning
Ensemble
How to Predict with an Ensemble of Algorithms?
• Majority Voting
• Return most popular prediction from multiple prediction algorithms

• Bootstrap Aggregation, aka Bagging


• Train algorithm repeatedly on different random subsets of the training set

• Boosting
• Train algorithms that each specialize on different “hard” training examples

• Stacking
• Train a model that learns how to aggregate classifiers’ predictions
Majority Voting

Figure Credit: Raschka & Mirjalili, Python Machine Learning.


Majority Voting

Prediction Model Prediction Model Prediction Model

Prediction Prediction Prediction

Majority Vote
Majority Voting: Binary Task
e.g., “Is it sunny today?”

Prediction Model Prediction Model Prediction Model Prediction Model

“Yes” “No” “Yes” “Yes”

“Yes”
Majority Voting: “Soft” (not “Hard”)

Prediction Model Prediction Model Prediction Model

Probability Probability Probability

Majority Vote
Majority Voting: Soft Voting on Binary Task
e.g., “Is it sunny today?”

Prediction Model Prediction Model Prediction Model Prediction Model

90% “Yes” 20% Yes 55% “Yes” 45% “Yes”

“Yes” (210/4 = 52.5% Yes)


Plurality Voting: Non-Binary Task
e.g., “What object is in the image?”

Prediction Model Prediction Model Prediction Model Prediction Model

“Cat” “Dog” “Pig” “Cat”

“Cat”
Majority Voting: Regression
e.g., “Is it sunny today?”

Prediction Model Prediction Model Prediction Model Prediction Model

90% “Yes” 20% Yes 55% “Yes” 45% “Yes”

52.5% (average prediction)


Majority Voting: Example of Decision Boundary

Figure Credit: Raschka & Mirjalili, Python Machine Learning.


How to Predict with an Ensemble of Algorithms?
• Majority Voting
• Return most popular prediction from multiple prediction algorithms

• Bootstrap Aggregation, aka Bagging


• Train algorithm repeatedly on different random subsets of the training set

• Boosting
• Train algorithms that each specialize on different “hard” training examples

• Stacking
• Train a model that learns how to aggregate classifiers’ predictions
Bagging

Figure Credit: Raschka & Mirjalili, Python Machine Learning.


Bagging: Training
• Build ensemble from “bootstrap samples” drawn with replacement
• e.g.,
Duplicate data can
occur for training

Some examples
missing from
training data;
e.g., round 1

Each classifier trained on Breiman, Bagging Predictors, 1994.


different subset of data Ho, Random Decision Forests, 1995.
Figure Credit: Raschka & Mirjalili, Python Machine Learning.
Bagging: Training
• Build ensemble from “bootstrap samples” drawn with replacement
• e.g.,

Class Demo:
- Pick a number
from the bag

Breiman, Bagging Predictors, 1994.


Ho, Random Decision Forests, 1995.
Figure Credit: Raschka & Mirjalili, Python Machine Learning.
Bagging: Predicting

Prediction Model Prediction Model Prediction Model Prediction Model

• Predict as done for “majority voting”


• e.g., “hard” voting
• e.g., “soft” voting
• e.g., averaging values for regression
Bagging: Random Forest
• Build ensemble from “bootstrap samples” drawn with replacement
• e.g.,

Fit decision trees by


also selecting random
feature subsets

Breiman, Bagging Predictors, 1994.


Ho, Random Decision Forests, 1995.
Figure Credit: Raschka & Mirjalili, Python Machine Learning.
Bagging: Intuition (Train an 8 detector)

Fellow et al., Deep Learning (chapter 7), 2016.


Bagging: Intuition (Train an 8 detector)

Fellow et al., Deep Learning (chapter 7), 2016.


Bagging: Intuition (Train an 8 detector)

Fellow et al., Deep Learning (chapter 7), 2016.


How to Predict with an Ensemble of Algorithms?
• Majority Voting
• Return most popular prediction from multiple prediction algorithms

• Bootstrap Aggregation, aka Bagging


• Train algorithm repeatedly on different random subsets of the training set

• Boosting
• Train algorithms that each specialize on different “hard” training examples

• Stacking
• Train a model that learns how to aggregate classifiers’ predictions
Boosting
• Key idea: sequentially train predictors that each try to correctly
predict examples that were hard for previous predictors

• Original Algorithm:
• Train classifier 1: use random subset of examples without replacement
• Train classifier 2: use a second random subset of examples without
replacement and add 50% of examples misclassified by classifier 1
• Train classifier 3: use examples that classifiers 1 and 2 disagree on
• Predict using majority vote from 3 classifiers
Boosting – Adaboost (Adaptive Boosting)

Assign equal weights • Assign larger weights to • Assign larger weights to Predict with weighted
to all examples previous misclassifications training samples C1 and C2 majority vote
disagree on
• Assign smaller weights to
previous correct • Assign smaller weights to
classifications previous correct
classifications
Freund and Schapire, Experiments with a New Boosting Algorithm, 1996. Raschka and Mirjalili; Python Machine Learning
Boosting – Adaboost (Adaptive Boosting)
e.g., 1d dataset
Round 2:
update weights

Round 1: training data, weights, predictions Raschka and Mirjalili; Python Machine Learning
Boosting – Adaboost (Adaptive Boosting)
e.g., 1d dataset
1. Compute error rate (sum misclassified examples’ weights):

2. Compute coefficient used to update weights and make


majority vote prediction:
3. Update weight vector:

• Correct predictions will decrease weight and vice versa

4. Normalize weights to sum to 1:


Raschka and Mirjalili; Python Machine Learning
Boosting – Adaboost (Adaptive Boosting)

To predict, use calculated for each classifier as its weight when voting with all trained classifiers.

Idea: value the prediction of each classifier based on the accuracies they had on the training dataset.

Raschka and Mirjalili; Python Machine Learning


How to Predict with an Ensemble of Algorithms?
• Majority Voting
• Return most popular prediction from multiple prediction algorithms

• Bootstrap Aggregation, aka Bagging


• Train algorithm repeatedly on different random subsets of the training set

• Boosting
• Train algorithms that each specialize on different “hard” training examples

• Stacking
• Train a model that learns how to aggregate classifiers’ predictions
Stacked Generalization, aka Stacking
• Train meta-learner to learn the optimal weighting of each classifiers’
predictions for making the final prediction
• Algorithm:
1. Split dataset into three disjoint sets.
2. Train several base learners on the first partition.
3. Test the base learners on the second partition and third partition.
4. Train meta-learner on second partition using classifiers’ predictions as features
5. Evaluate meta-learner on third prediction using classifiers’ predictions as features

David, H. Wolpert, Stacked Generalization, 1992.


Tutorial: http://blog.kaggle.com/2017/06/15/stacking-made-easy-an-introduction-to-
stacknet-by-competitions-grandmaster-marios-michailidis-kazanova/
Ensemble Learner Won Netflix Prize “Challenge”
• In 2009 challenge, winning team won $1 million using ensemble approach:
• https://www.netflixprize.com/assets/GrandPrize2009_BPC_BigChaos.pdf
• Dataset: 5-star ratings on 17770 movies from 480189 “anonymous” users collected
by Netflix over ~7 years. In total, the number of ratings is 100,480,507.

• Netflix did not use ensemble recommendation system. Why?


• “We evaluated some of the new methods offline but the additional accuracy gains
that we measured did not seem to justify the engineering effort needed to bring
them into a production environment” - https://medium.com/netflix-techblog/netflix-
recommendations-beyond-the-5-stars-part-1-55838468f429
• Computationally slow and complex from using “sequential” training of learners
Yehuda Koren, The BellKor Solution to the Netflix Grand Prize, 2009.
Today’s Topics

• One-vs-all multiclass classification

• Classifier confidence

• Evaluation: ROC and PR-curves

• Ensemble learning

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy