06 EnsembleLearning
06 EnsembleLearning
Danna Gurari
University of Texas at Austin
Spring 2021
https://www.ischool.utexas.edu/~dannag/Courses/IntroToMachineLearning/CourseContent.html
Review
• Last week:
• Evaluating Machine Learning Models Using Cross-Validation
• Naïve Bayes
• Support Vector Machines
• Assignments (Canvas):
• Problem set 4 due tonight
• Lab assignment 2 due next week
• Project pre-proposal due in two weeks (finding a partner ideas)
• Questions?
Today’s Topics
• Classifier confidence
• Ensemble learning
Today’s Topics
• Classifier confidence
• Ensemble learning
Recall: Binary vs Multiclass Classification
Binary: distinguish 2 classes Multiclass: distinguish 3+ classes
• Classifier confidence
• Ensemble learning
Classifier Confidence: Beyond Classification
• Indicate both the predicted class and uncertainty about the choice
• When and why might you want to know about the uncertainty?
• e.g., weather forecast: 25% chance it will rain today
• e.g., medical treatment: when unconfident, start a patient on a drug at a
lower dose and decide later whether to change the medication or dose
Classifier Confidence: How to Measure for
K-Nearest Neighbors?
• Proportion of neighbors with label y; e.g.,
When K=3:
https://github.com/amueller/introduction_to_ml_with_python/blob/master/02-supervised-learning.ipynb
Classifier Confidence: How to Measure for
Decision Trees?
• Proportion of training samples with label y in the leaf where for the test sample;
e.g.,
http://chem-eng.utoronto.ca/~datamining/dmc/support_vector_machine.htm
Classifier Confidence vs Probability
• Classifier confidence
• Ensemble learning
Classification from a Classifier’s Confidence
0
0 FPR 1
Receiver Operating Characteristic (ROC) curve
What is the coordinate for a perfect predictor?
Summarizes performance based
on the positive class
- A positive prediction is either
1 correct (TP) or not (FP)
TPR
0
0 FPR 1
ROC Curve: Area Under Curve (AUC)
Which of the first three methods performs best overall?
Summarizes performance based
on the positive class
- A positive prediction is either
correct (TP) or not (FP)
https://stackoverflow.com/questions/56090541/how-to-plot-precision-and-recall-of-multiclass-classifier
Precision-Recall (PR) Curve
0
0 Recall 1
Precision-Recall (PR) Curve
What is the coordinate for a perfect predictor?
0
0 Recall 1
PR Curve: Area Under Curve (AUC)
https://stackoverflow.com/questions/56090541/how-to-plot-precision-and-recall-of-multiclass-classifier
Group Discussion: Evaluation Curves
1. Assume you are building a classifier for these Assume the following thresholds were used
applications: to create the curve: 0, 0.25, 0.5, 0.75, 1.
• Detecting offensive content online
• Medical diagnoses 1
• Detecting shoplifters
• Deciding whether a person is guilty of a crime
What classifier threshold would you choose for
each application and why?
Precision
2. When would you choose to evaluate with a
PR curve versus a ROC curve?
• Classifier confidence
• Ensemble learning
Idea: How Many Predictors to Use?
• Boosting
• Reweight data to train algorithms to specialize on different “hard” examples
• Stacking
• Train a model that learns how to aggregate classifiers’ predictions
1613
Human “Computers”
Linear regression Early 1800
First programmable
machine 1945
Turing Test 1950
K-nearest neighbors
AI 1956
Perceptron 1957
Machine Learning 1959
Naïve Bayes
Decision Trees 1962
Historical Context of ML Models
1rst AI 2nd AI
Winter Winter
SVM
1974 1980 1987 1993
Rise of
Learning
Ensemble
How to Predict with an Ensemble of Algorithms?
• Majority Voting
• Return most popular prediction from multiple prediction algorithms
• Boosting
• Train algorithms that each specialize on different “hard” training examples
• Stacking
• Train a model that learns how to aggregate classifiers’ predictions
Majority Voting
Majority Vote
Majority Voting: Binary Task
e.g., “Is it sunny today?”
“Yes”
Majority Voting: “Soft” (not “Hard”)
Majority Vote
Majority Voting: Soft Voting on Binary Task
e.g., “Is it sunny today?”
“Cat”
Majority Voting: Regression
e.g., “Is it sunny today?”
• Boosting
• Train algorithms that each specialize on different “hard” training examples
• Stacking
• Train a model that learns how to aggregate classifiers’ predictions
Bagging
Some examples
missing from
training data;
e.g., round 1
Class Demo:
- Pick a number
from the bag
• Boosting
• Train algorithms that each specialize on different “hard” training examples
• Stacking
• Train a model that learns how to aggregate classifiers’ predictions
Boosting
• Key idea: sequentially train predictors that each try to correctly
predict examples that were hard for previous predictors
• Original Algorithm:
• Train classifier 1: use random subset of examples without replacement
• Train classifier 2: use a second random subset of examples without
replacement and add 50% of examples misclassified by classifier 1
• Train classifier 3: use examples that classifiers 1 and 2 disagree on
• Predict using majority vote from 3 classifiers
Boosting – Adaboost (Adaptive Boosting)
Assign equal weights • Assign larger weights to • Assign larger weights to Predict with weighted
to all examples previous misclassifications training samples C1 and C2 majority vote
disagree on
• Assign smaller weights to
previous correct • Assign smaller weights to
classifications previous correct
classifications
Freund and Schapire, Experiments with a New Boosting Algorithm, 1996. Raschka and Mirjalili; Python Machine Learning
Boosting – Adaboost (Adaptive Boosting)
e.g., 1d dataset
Round 2:
update weights
Round 1: training data, weights, predictions Raschka and Mirjalili; Python Machine Learning
Boosting – Adaboost (Adaptive Boosting)
e.g., 1d dataset
1. Compute error rate (sum misclassified examples’ weights):
To predict, use calculated for each classifier as its weight when voting with all trained classifiers.
Idea: value the prediction of each classifier based on the accuracies they had on the training dataset.
• Boosting
• Train algorithms that each specialize on different “hard” training examples
• Stacking
• Train a model that learns how to aggregate classifiers’ predictions
Stacked Generalization, aka Stacking
• Train meta-learner to learn the optimal weighting of each classifiers’
predictions for making the final prediction
• Algorithm:
1. Split dataset into three disjoint sets.
2. Train several base learners on the first partition.
3. Test the base learners on the second partition and third partition.
4. Train meta-learner on second partition using classifiers’ predictions as features
5. Evaluate meta-learner on third prediction using classifiers’ predictions as features
• Classifier confidence
• Ensemble learning