The document provides an overview of Ensemble Learning, highlighting its motivation, basic and advanced techniques, including Bagging and Boosting. It explains how combining multiple models can enhance predictive accuracy and outlines various methods such as Max Voting, Averaging, Stacking, and Blending. Additionally, it details specific algorithms like AdaBoost, Gradient Boosting, and XGBoost, emphasizing their applications and differences in handling data.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
5 views26 pages
Ensemble Learning
The document provides an overview of Ensemble Learning, highlighting its motivation, basic and advanced techniques, including Bagging and Boosting. It explains how combining multiple models can enhance predictive accuracy and outlines various methods such as Max Voting, Averaging, Stacking, and Blending. Additionally, it details specific algorithms like AdaBoost, Gradient Boosting, and XGBoost, emphasizing their applications and differences in handling data.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26
Ensemble Learning
Instructor: Dr. Umara Zahid
MSCS Fall 2022 Agenda • Motivation • Algorithms based on Bagging • Introduction to Ensemble and Boosting Learning • Bagging meta-estimator • Basic Ensemble Techniques • Random Forest • Max Voting • AdaBoost • Averaging • GBM • Weighted Average • XGB • Light GBM • Advanced Ensemble Techniques • CatBoost • Stacking • Blending • Bagging • Boosting Motivation • You have to buy something? How you do it? • Example: You want to buy a new car (Two approaches) 1. You will walk up to the first car shop and purchase one based on the advice of the dealer. Is it so? 2. You would likely browser a few web portals where people have posted their reviews and compare different car models, checking for their features and prices. You will also probably ask your friends and colleagues for their opinion. (In short, you wouldn’t directly reach a conclusion, but will instead make a decision considering the opinions of other people as well) • Review or opinion-based decisions Another Example (Its not just buying/ selling related problem) • You are a movie director, you created a short movie • Now, you want to take preliminary feedback (ratings) on the movie before Inferences : making it public. • What 1. Aare the possible diverse group ofways by which people youtocan are likely do better make that? decisions as 1. Ask compared one of yourtofriends to rate the movie individuals • Biased Reviewa diverse set of machine learning models are likely to 2. Similarly, make better decisions as compared to single models 2. Asking 5 colleagues to rate the movie This diversification • Unbiased/ in Machine Learning is achieved by a less number of people/ Subject Matter Experts technique 3. Asking 50 people to rate thecalled movieEnsemble Learning • More Generalized Review What is Ensemble Learning? • Ensemble methods are machine learning techniques that combines several base models in order to produce one optimal predictive model (What?) • Ensemble learning techniques attempt to make the performance of the predictive models better by improving their accuracy (Why?) • Ensemble Learning is a process using which multiple machine learning models (such as classifiers) are strategically constructed to solve a particular problem (How?) • In another way (Problem Symptom): • To reduce the variance of certain ML models, such as, (neural networks), multiple models are trained instead of a single model to combine the predictions from these models. Basic Ensemble Techniques 1. Max Voting 2. Averaging 3. Weighted Average Max Voting • In this technique, multiple models are used to make predictions for each data point. The predictions by each model are considered as a ‘vote’. The predictions which we get from the majority of the models are used as the final prediction • Considering previous example, you asked 5 of your colleagues to rate your movie (out of 5); suppose three of them rated it as 4 while two of them gave it a 5. Since the majority gave a rating of 4, the final rating will be taken as 4
• The max voting method is generally used for classification problems
Averaging • Similar to the max voting technique, multiple predictions are made for each data point in averaging • In this method, we take an average of predictions from all the models and use it to make the final prediction • For example, the averaging method would take the average of all the values • (5+4+5+4+4)/5 = 4.4 • Averaging can be used for making predictions in regression problems or while calculating probabilities for classification problems Weighted Average • This is an extension of the averaging method • All models are assigned different weights defining the importance of each model for prediction. • For instance, if two of your colleagues are critics, while others have no prior experience in this field, then the answers by these two friends are given more importance as compared to the other people
• The result is calculated as [(5*0.23) + (4*0.23) + (5*0.18) + (4*0.18) +
(4*0.18)/5] = 4.41. Advanced Ensemble Techniques • Stacking • Blending • Bagging • Boosting Stacking • Train multiple base models (Level-0 models) on your training data. • Then, you use their predictions to create a new dataset (called meta-features). • A new model — called the meta-model (Level-1 model) — is trained on these meta-features to make the final prediction • Important Detail: • To avoid data leakage, stacking typically uses k-fold cross-validation when generating predictions for the meta-model. • Example: • Base models: Decision Tree, SVM, Logistic Regression • Meta-model: Random Forest trained on predictions from the base models Stacking • Stacking uses predictions from multiple models (for example decision tree, knn or svm) to build a new model. This model is used for making predictions on the test set. • Step 1: The train set is split into 10 parts • Step 2: A base model (suppose a decision tree) is fitted on 9 parts and predictions are made for the 10th part. This is done for each part of the train set • Step 3: Using this model, predictions are made on the test set Continued… • Steps 2 to 3 are repeated for another base model (say knn) resulting in another set of predictions for the train set and test set • The predictions from the train set are used as features to build a new model • This model is used to make final predictions on the test prediction set Blending • Split the training set into two parts: • Train set (e.g., 70%) • Holdout set (e.g., 30%) • Train base models on the train set. • Use these trained models to predict on the holdout set. • Train the meta-model on these holdout predictions and use it for final predictions. • Important Detail: • Blending is simpler and faster, but it wastes some data (the holdout set is not used to train base models). Blending • Blending follows the same approach as stacking but uses only a holdout (validation) set from the train set to make predictions • In other words, unlike stacking, the predictions are made on the holdout set only • The holdout set and the predictions are used to build a model which is run on the test set. Blending • Step 1: The train set is split into training and validation sets • Step 2: Model(s) are fitted on the training set. The predictions are made on the validation set and the test set • Step 3: The validation set and its predictions are used as features to build a new model. This model is used to make final predictions on the test and meta-features. • Tranining set 70%, Validation 10% (Tuning) test set 20% Key Differences Between Stacking and Blending •Use blending if you're quickly experimenting or working with large datasets.
•Use stacking for more
robust, high-performance models especially in competitions or production. Bagging • The idea behind bagging is combining the results of multiple models (for instance, all decision trees) to get a generalized result • Here’s a question: If you create all the models on the same set of data and combine it, will it be useful? There is a high chance that these models will give the same result since they are getting the same input. So how can we solve this problem? One of the techniques is bootstrapping. • Bootstrapping is a sampling technique in which we create subsets of observations from the original dataset, with replacement. The size of the subsets is the same as the size of the original set. • Bagging (or Bootstrap Aggregating) technique uses these subsets (bags) to get a fair idea of the distribution (complete set). The size of subsets created for bagging may be less than the original set. Steps of Bagging • Multiple subsets are created from the original dataset, selecting observations with replacement. • A base model (weak model) is created on each of these subsets. • The models run in parallel and are independent of each other. • The final predictions are determined by combining the predictions from all the models Boosting • Boosting Machine Learning is one such technique that can be used to solve complex, data-driven, real-world problems • Boosting is an ensemble learning technique that uses a set of Machine Learning algorithms to convert weak learner to strong learners in order to increase the accuracy of the model Difference between Bagging and Boosting • Bagging Vs. Boosting • Parallel ensemble, popularly known as bagging • The weak learners are produced parallelly during the training phase • The performance of the model can be increased by parallelly training a number of weak learners on bootstrapped data sets • Examples: Random Forest algorithm, Bagging meta- estimator • Sequential ensemble, popularly known as boosting • The weak learners are sequentially produced during the training phase • The performance of the model is improved by assigning a higher weightage to the previous, incorrectly classified samples. • AdaBoost, Gradient Boosting, XGBoost How Boosting Works? • Step 1: The base algorithm reads the data and assigns equal weight to each sample observation. • Step 2: False predictions made by the base learner are identified. In the next iteration, these false predictions are assigned to the next base learner with a higher weightage on these incorrect predictions. • Step 3: Repeat step 2 until the algorithm can correctly classify the output. • Therefore, the main aim of Boosting is to focus more on miss-classified predictions but, it can be used for regression problems as well Adaptive Boosting (Adaboost) • AdaBoost is implemented by combining several weak learners into a single strong learner. • The weak learners in AdaBoost take into account a single input feature and draw out a single split decision tree called the decision stump. • Each observation is weighed equally while drawing out the first decision stump. • The results from the first decision stump are analyzed and if any observations are wrongfully classified, they are assigned higher weights. • Post this, a new decision stump is drawn by considering the observations with higher weights as more significant. • Again if any observations are misclassified, they’re given higher weight and this process continues until all the observations fall into the right class. • Adaboost can be used for both classification and regression-based problems, however, it is more commonly used for classification purpose. Gradient Boosting • The difference in this type of boosting is that the weights for misclassified outcomes are not incremented, instead, Gradient Boosting method tries to optimize the loss function of the previous learner by adding a new model that adds weak learners in order to reduce the loss function. • The main idea here is to overcome the errors in the previous learner’s predictions. This type of boosting has three main components: • Loss function that needs to be ameliorated. • Weak learner for computing predictions and forming strong learners. • An Additive Model that will regularize the loss function. • Like AdaBoost, Gradient Boosting can also be used for both classification and regression problems. XGBoost • Motivation • GBoost is an advanced version of Gradient boosting method, it literally means eXtreme Gradient Boosting. XGBoost developed by Tianqi Chen, falls under the category of Distributed Machine Learning Community (DMLC). • The main aim of this algorithm is to increase the speed and efficiency of computation. The Gradient Descent Boosting algorithm computes the output at a slower rate since they sequentially analyze the data set, therefore XGBoost is used to boost or extremely boost the performance of the model. XGBoost • XGBoost is designed to focus on computational speed and model efficiency. The main features provided by XGBoost are: • Parallelly creates decision trees. • Implementing distributed computing methods for evaluating large and complex models. • Using Out-of-Core Computing to analyze huge datasets. • Implementing cache optimization to make the best use of resources.
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.