Module 3.5 Ensemble Learning XGBoost
Module 3.5 Ensemble Learning XGBoost
When you want to purchase a new car, will you walk up to the first car shop and purchase one based on the advice of
the dealer? It’s highly unlikely.
You would likely browse a few websites where people have posted their reviews and compare different car models,
checking for their features and prices. You will also probably ask your friends and colleagues for their opinion. In short,
you wouldn’t directly reach a conclusion, but will instead make a decision considering the opinions of other people as
well.
Ensemble models in machine learning operate on a similar idea. They combine the decisions from multiple models to
improve the overall performance. Ensemble learning offers a systematic solution to combine the predictive power of
multiple learners. The resultant is a single model which gives the aggregated output from several models.
The goal of ensemble models is to combine different classifiers into a metaclassifier that has better generalization
performance than each individual classifier alone. For example, assuming that we collected predictions from 10
experts, ensemble methods would allow us to strategically combine their predictions to come up with a prediction that
is more accurate and robust than the experts’ individual predictions
The models that form the ensemble, also known as base learners, could be either from the same learning algorithm or
different learning algorithms. Bagging and boosting are two widely used ensemble learners.
Bagging (or bootstrap aggregation) is an ensemble technique of training several individual models in a parallel way.
Each model is trained by a random subset of the data. Boosting, on the other hand, is an ensemble technique of training
several individual models in a sequential way. This is done by building a model from the training data and then creating
a second model that attempts to correct the errors of the first model. Models are added until the training set is
predicted perfectly or a maximum number of models is added.
Though these two techniques can be used with several statistical models, the most predominant usage has been with
decision trees, and just like the decision trees themselves, bagging and boosting can be used for classification and
regression problems.
For our purposes, we will largely utilize decision trees to outline the definition and practicality of ensemble methods
(however, it is important to note that ensemble methods do not only pertain to decision trees). We will also focus first
on using the idea of decision trees and ensemble methods in addressing classification problems.
Recall from Module 3.3 that a decision tree predicts the target variable based on series of questions and conditions.
For instance, below decision tree determines whether an individual should play outside or not. The tree takes several
weather factors into account, and given each factor either makes a decision or asks another question. In this example,
every time it is overcast, we will play outside. However, if it is raining, we must ask if it is windy or not? If it’s windy, we
will not play. But, given no wind, we’re going outside to play.
When making decision trees, there are several factors we must take into consideration: On what features do we make
our decisions on? What is the threshold for classifying each question into a YES or NO answer? In the tree above, what
if we wanted to ask ourselves whether we had friends to play with or not? If we have friends, we will play every time.
If not, we might continue to ask ourselves questions about the weather. By adding an additional question, we hope to
greater define the YES and NO classes.
This is where ensemble methods come in handy. Rather than just relying on one decision tree and hoping we made
the right decision at each split, ensemble learning allows us to take a sample of decision trees into account, calculate
which features to use or questions to ask at each split, and make a final predictor based on the aggregated results of
the sampled decision trees.
Gradient boosting works by sequentially adding the previous underfitted predictions to the ensemble, ensuring the
errors made previously are corrected. Gradient boosting comes from the idea of the boosting ensemble method or
improving a single weak model by combining it with a number of other weak models in order to generate a collectively
strong model.
Gradient boosting is an extension of boosting where the process of additively generating weak models is formalized as
a gradient descent algorithm over an objective function. Gradient boosting sets targeted outcomes for the next model
in an effort to minimize errors. Targeted outcomes for each case are based on the gradient of the error (hence the
name gradient boosting) with respect to the prediction. Recall from Module 3.2 that gradient descent is an optimization
algorithm used in machine learning to minimize the cost function by iteratively adjusting parameters in the direction
of the negative gradient, aiming to find the optimal set of parameters. The cost function represents the discrepancy
between the predicted output of the model and the actual output. The goal of gradient descent is to find the set of
parameters that minimizes this discrepancy and improves the model’s performance. For a mathematical explanation
of the intuition behind gradient descent (which is the backbone of complex ML models like XGBoost), you can check
this article from Towards Data Science.
Gradient-boosted decision trees iteratively train an ensemble of shallow decision trees, with each iteration using the
error residuals of the previous model to fit the next model. The final prediction is a weighted sum of all of the tree
predictions. The following are the steps of the gradient-boosted tree algorithm:
1. A decision tree (which can be referred to as the first weak learner) is built on a subset of data. Using this model,
predictions are made on the whole dataset.
2. Errors are calculated by comparing the predictions and actual values, and the loss is calculated using the loss
function.
3. A new decision tree is created using the errors of the previous step as the target variable. The objective is to
find the best split in the data to minimize the error. The predictions made by this new model are combined
with the predictions of the previous model. New errors are calculated using this predicted value and actual
value.
4. This process is repeated until the error function does not change or until the maximum limit of the number of
estimators is reached.
XGBoost is a decision-tree-based ensemble machine learning algorithm that uses a gradient boosting framework. It is
a scalable and improved version of the gradient boosting algorithm designed for efficacy, computational speed, and
model performance.
Here, a gradient-boosted tree is learned such that the overall loss of the new model is minimized while keeping in mind
not to overfit the model. Recall, that an objective function intends to maximize or minimize something. In machine
learning, we try to minimize the objective function which is a combination of the loss function and regularization term
(revisit Module 3.3). The XGBoost objective function at iteration 𝑡 that we need to minimize is the following:
It is easy to see that the XGBoost objective is a function of functions (i.e., 𝑙 is a function of CART learners, a sum of the
current and previous additive trees). Optimizing the loss function encourages predictive models whereas optimizing
regularization leads to smaller variance and makes prediction stable. We will not go deep into the calculus behind
XGBoost, but you can read the following articles for a digestible explanation of the math behind this model here and
here.
Again, compared to baseline GBM, XGBoost improves upon the base GBM framework through systems optimization
and algorithmic enhancements.
System Optimization
1. Parallelization — Tree learning needs data in a sorted manner. To cut down the sorting costs, data is divided
into compressed blocks (each column with corresponding feature value). XGBoost sorts each block parallelly
using all available cores/threads of CPU. This optimization is valuable since a large number of nodes gets
created frequently in a tree. In summary, XGBoost parallelizes the sequential process of generating trees.
2. Cache Aware — By cache-aware optimization, we store gradient statistics (direction and value) for each split
node in an internal buffer of each thread and perform accumulation in a mini-batch manner. This helps to
reduce the time overhead of immediate read/write operations and also prevent cache miss.
Algorithmic Enhancements
1. Regularization – It penalizes more complex models through both LASSO (L1) and Ridge (L2) regularization to
prevent overfitting.
2. Tree Pruning – The stopping criterion for tree splitting within GBM framework is greedy in nature and depends
on the negative loss criterion at the point of split. XGBoost uses max_depth parameter as specified instead of
criterion first, and starts pruning trees backward. This ‘depth-first’ approach improves computational
performance significantly.
3. Sparsity Awareness – It is quite common that the data we gather has sparsity (a lot of missing or empty values)
or becomes sparse after performing data engineering (feature encoding). To be aware of the sparsity patterns
in the data, a default direction is assigned to each tree. XGBoost handles missing data by assigning them to
default direction and finding the best imputation value so that it minimizes the training loss. Optimization here
is to visit only missing values which make the algorithm run 50x faster than the naïve method.
In XGBoost, there are two main types of hyperparameters: tree-specific and learning task-specific. The following are
the main hyperparameters that influence XGBoost model performance.
Tree-specific hyperparameters control the construction and complexity of the decision trees:
1. max_depth: The maximum depth of each tree in the model. Increasing this value makes the model more
complex and can improve performance, but too high values lead to overfitting. Typical values range from 3-10
for shallow trees, but deep trees can go up to 15-25.
2. min_child_weight: Minimum number of samples required in a leaf node for it to be split further. A higher
value prevents overfitting. Typical values range from 1-10, with higher values for more sparse datasets.
3. subsample, colsample_bytree: Subsampling fractions of the training data used per iteration for row
(subsample) and column (colsample_bytree) sampling. Typical values range from 0.5-1.0. Lower values
are more conservative and prevent overfitting.
Learning task-specific hyperparameters control the overall behavior of the model and the learning process:
1. eta: The learning rate that controls how quickly the model learns from the data. Typical values range from
0.01 to 0.3, with smaller values generally requiring more boosting rounds but potentially leading to better
generalization.
2. gamma: Minimum loss reduction required to split a node. Higher values make the algorithm more
conservative. Values range from 0-10 typically.
3. lambda: L2 (ridge) regularization term on weights. Higher values increase the regularization.
4. alpha: L1 (LASSO) regularization term on weights. Higher values increase the regularization.
As covered in Module 3.3 and Module 3.4, the best way to select parameters is to do a grid search using cross-
validation. Try a wide range of values for the most important parameters like eta, max_depth, and
min_child_weight. Monitor the cross-validation scores and select the optimal configuration. Start with shallow
trees initially before exploring deep trees. It's also important to tune regularization parameters and tree constraints
once you've found optimal architecture hyperparameters. These help control model complexity and prevent
overfitting. Tuning XGBoost carefully is key to getting the most predictive power out of the model. Patience and
systematically exploring the hyperparameter space using cross-validation will pay off in better model generalization
and performance on unseen data. You can use GridSearchCV from sklearn or use the built-in cross-validation in
the xgboost package.
Part 5: Case Study – Classification Task using XGBoost in Python
In Telco industries or other business-related field, customer churn is defined as the action of existing customers
terminating their subscription of service with the company due to several reasons such as dissatisfaction of the service
provided/better price offered by competitors for the same services.
Churn analytics provides valuable capabilities to predict customer churn and also define the underlying reasons that
drive it. The churn metric is mostly shown as the percentage of customers that cancel a product or service within a
given period (mostly months).
Predicting customer churn is critical for telecommunication companies to be able to effectively retain customers. It is
more costly to acquire new customers than to retain existing ones. For this reason, large telecommunicati ons
corporations are seeking to develop models to predict which customers are more likely to change and take actions
accordingly.
The data set used in this article is available in Kaggle and contains 19 columns (independent variables) that indicate
the characteristics of the clients of a fictional telecommunications corporation. The Churn column (response variable)
indicates whether the customer departed within the last month or not. The class/label No includes the clients that did
not leave the company last month, while the class Yes contains the clients that decided to terminate their relations
with the company. The objective of the analysis is to obtain the relation between the customer’s characteristics and
the churn.
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot
from pandas import read_csv, set_option
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
import math
import warnings
warnings.filterwarnings("ignore")
# load dataset
# shape
print(data.shape)
#peek at data
set_option('display.width', 100)
data.head(5)
As shown above, the data set contains 19 independent variables, which can be classified into 3 groups:
1. Demographic Information
▪ gender: Whether the client is a female or a male (Female, Male).
▪ SeniorCitizen: Whether the client is a senior citizen or not (0, 1).
▪ Partner: Whether the client has a partner or not (Yes, No).
▪ Dependents: Whether the client has dependents or not (Yes, No).
3. Services Information
▪ PhoneService: Whether the client has a phone service or not (Yes, No).
▪ MultipleLines: Whether the client has multiple lines or not (No phone service, No, Yes).
▪ InternetServices: Whether the client is subscribed to Internet service with the company (DSL, Fiber optic, No)
▪ OnlineSecurity: Whether the client has online security or not (No internet service, No, Yes).
▪ OnlineBackup: Whether the client has online backup or not (No internet service, No, Yes).
▪ DeviceProtection: Whether the client has device protection or not (No internet service, No, Yes).
▪ TechSupport: Whether the client has tech support or not (No internet service, No, Yes).
▪ StreamingTV: Whether the client has streaming TV or not (No internet service, No, Yes).
▪ StreamingMovies: Whether the client has streaming movies or not (No internet service, No, Yes).
As shown above, the data set contains 7,043 observations and 21 columns. Apparently, there are no null values on the
dataset; however, we observe that the column TotalCharges was wrongly detected as an object. This column
represents the total amount charged to the customer and it is, therefore, a numeric variable. For further analysis, we
need to transform this column into a numeric data type. To do so, we can use the pd.to_numeric function. By
default, this function raises an exception when it sees non-numeric data; however, we can use the argument
errors='coerce' to skip those cases and replace them with a NaN.
We can now observe that the column TotalCharges has 11 missing values. Next let’s investigate the rows where
TotalCharges = NaN.
These observations have also a tenure of 0, even though MonthlyCharges is not null for these entries. This
information appeared to be contradictory, and therefore, we decide to remove those observations from the dataset.
# drop observations with null values
data.dropna(inplace=True)
The customerID column is useless to explain whether not the customer will churn. Therefore, we drop this column
from the dataset also.
As shown below, some payment method denominations contain in parenthesis the word ‘automatic’. These
denominations are too long to be used as tick labels in further visualizations. Therefore, we remove this clarification in
parenthesis from the entries of the PaymentMethod column.
The following bar plot shows the percentage of observations that correspond to each class of the response/target
variable, Churn: [‘No’ ‘Yes’]. As shown below, this is an imbalanced data set because both classes are not
equally distributed among all observations, being no the majority class (73.42%). When modeling, this imbalance will
lead to a large number of false negatives, as we will see later.
For the succeeding bar plots, we are going to use normalized stacked bar plots to analyze the influence of each
independent categorical variable in the outcome. A normalized stacked bar plot makes each column the same height,
so it is not useful for comparing total numbers; however, it is perfect for comparing how the response variable varies
across all groups of an independent variable.
On the other hand, we use histograms to evaluate the influence of each independent numeric variable in the outcome.
As mentioned before, the data set is imbalanced; therefore, we need to draw a probability density function of each
class (density=True) to be able to compare both distributions properly.
The following code creates a stacked percentage bar chart for each demographic attribute (gender,
SeniorCitizen, Partner, Dependents), showing the percentage of Churn for each category of the attribute.
'''
Prints a 100% stacked plot of the response variable for independent variable of the list columns_to_plot.
Parameters:
columns_to_plot (list of string): Names of the variables to plot
super_title (string): Super title of the visualization
Returns:
None
'''
number_of_columns = 2
number_of_rows = math.ceil(len(columns_to_plot)/2)
# create a figure
fig = pyplot.figure(figsize=(12, 5 * number_of_rows))
fig.suptitle(super_title, fontsize=22, y=.95)
# calculate the percentage of observations of the response variable for each group of the independent variable
# 100% stacked bar plot
prop_by_independent = pd.crosstab(data[column], data['Churn']).apply(lambda x: x/x.sum()*100, axis=1)
ax.tick_params(rotation='auto')
Part 5.6 Visualizing Customer Account Information - Categorical Variables with regards to Target Variable
As we did with demographic attributes, we evaluate the percentage of Churn for each category of the customer
account attributes (Contract, PaperlessBilling, PaymentMethod).
Part 5.7 Visualizing Customer Account Information - Numerical Variables with regards to Target Variable
The following plots show the distribution of tenure, MonthlyCharges, TotalCharges with respect to Churn.
For all numeric attributes, the distributions of both classes are different which suggests that all of the attributes will be
useful to determine whether or not a customer churns.
# create a figure
fig = pyplot.figure(figsize=(12, 5 * number_of_rows))
fig.suptitle(super_title, fontsize=22, y=.95)
ax.tick_params(rotation='auto')
Lastly, we evaluate the percentage of the target for each category of the services columns with stacked bar plots.
We can extract the following conclusions by evaluating services attributes from the bar plots below:
▪ We do not expect phone attributes (PhoneService and MultipleLines) to have significant predictive
power. The percentage of churn for all classes in both independent variables is nearly the same.
▪ Clients with online security churn less than those without it.
▪ Customers with no tech support tend to churn more often than those with tech support.
By looking at the plots, we can identify the most relevant attributes for detecting churn. We expect these attributes to
be discriminative in our future models.
We will drop the columns gender, PhoneService, and MultipleLines because these variables do not have a
strong relationship with the target as we saw in the plots above.
Feature engineering is the process of extracting features from the data and transforming them into a format that is
suitable for the machine learning model. In this project, we need to transform both numerical and categorical variables.
Most machine learning algorithms require numerical values; therefore, all categorical attributes available in the dataset
should be encoded into numerical labels before training the model. In addition, we need to transform numeric columns
into a common scale. This will prevent that the columns with large values dominate the learning process.
No Modification
The SeniorCitizen column is already a binary column [0,1] and should not be modified.
Label Encoding
Label encoding is used to replace categorical values with numerical values. This encoding replaces every category with
a numerical label. In the below code, we use label encoding with the following binary variables: Partner,
Dependents, PaperlessBilling, and Churn.
One-Hot Encoding
One-hot encoding creates a new binary column for each level of the categorical variable. The new column contains
zeros and ones indicating the absence or presence of the category in the data. In the below code, we apply one-hot
encoding to the following categorical variables: Contract, PaymentMethod, InternetServices,
OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, and
StreamingMovies.
# encode categorical variables with more than two levels using one-hot encoding
data_transformed = pd.get_dummies(data_transformed, columns = one_hot_encoding_columns)
The main drawback of this encoding is the significant increase in the dimensionality of the dataset (bigger feature space
due to additional independent variable columns); therefore, this method should be avoided when the categorical
column has a large number of unique values or labels.
In this case, it will be smarter to assign numerical labels instead or assign bins. For example, if there is a Civil Status
variable and if it has multiple labels, we can assign 1 = Single, 2 = Married, 3 = Separated, etc. instead of doing on-hot
encoding. This way, we can keep the dimensionality smaller. For this task, I chose one-hot encoding because the
maximum number of labels per independent variable is only three.
Scaling of Numerical Variables
Data normalization is a common practice in machine learning which consists of transforming numeric columns to a
common scale. Data normalization transforms multi-scaled data to the same scale. After normalization, all variables
have a similar influence on the model, improving the stability and performance of the learning algorithm.
There are multiple normalization techniques in ML. In the below code, we will use the min-max method to rescale the
numeric columns (tenure, MonthlyCharges, TotalCharges) to a common scale.
We will use 80% of the dataset for model training and 20% for testing:
# train-test split
Y = data_transformed["Churn"]
X = data_transformed.drop(columns=['Churn', 'tenure', 'MonthlyCharges', 'TotalCharges'])
validation_size = 0.2
seed = 777
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=validation_size, random_state=seed)
print(X.columns)
print(Y.name)
Algorithm selection is a key challenge in any machine learning project since there is not an algorithm that is the best
across all projects. Generally, we need to evaluate a set of potential candidates and select for further evaluation those
that provide better performance. As with the previous case study with credit card fraud detection, we will assess a
family of basic ML classifiers first with our training data and then we will select a baseline model from this group based
on an evaluation metric. We will implement a 𝑘 = 10 k-fold cross-validation across all model runs.
For the first pass of the models, we will use Accuracy as the evaluation metric as with any classifi er problem. As
discussed in Module 3.3, accuracy is the fraction of predictions our model got right.
# compare algorithms
fig = pyplot.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
fig.set_size_inches(8,4)
pyplot.show()
Logistic regression had the highest accuracy during training closely followed by SVM. We will use SVM as our baseline
model so that we can have a more distinct comparison between the baseline model and XGBoost. The XGBoost model
will use a log-likelihood objective function later given that this is binary classification problem.
The baseline model using support vector machine showed a pretty decent accuracy in that it was able to predict 79.74%
of our test set. However, the model’s recall and precision for the customers that left the company (Churn = 1) is low
at only 48% and 65%, respectively. We might have to revisit this later and tune our model. Let’s build first our challenger
model using XGBoost.
Part 5.12 Building an Initial XGBoost Model using Accuracy as Evaluation Metric
We will run an initial XGBoost model using accuracy also the evaluation metric.
Recall can often be called the true positive rate. In this case, it means correctly identifying customers who are going to
churn. Failing to identify a true positive, which would be to fail to identify one who is about to churn, would be costly
to a business as this could mean lost revenue. Meanwhile, the opposite of identifying incorrectly someone who is going
to churn doesn’t harm the business that much, if we assume that the interaction with the customer isn’t hostile.
num_folds = 10
scoring = 'recall'
results = []
names = []
for name, model in models:
kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
The SVM model underperformed when we changed the scoring to focus on recall. Let’s use CART as a new baseline
model. We will tune our final baseline model (CART) further by:
1. Optimizing the hyperparameters using grid search
2. Adjusting the decision threshold of binary classification of the predicted probabilities
baseline_final = DecisionTreeClassifier(random_state=seed)
kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
baseline_grid = GridSearchCV(estimator=baseline_final, param_grid=param_grid, scoring=scoring, cv=kfold)
baseline_grid_result = baseline_grid.fit(X_train, Y_train)
print("Best: %f using %s" % (baseline_grid_result.best_score_, baseline_grid_result.best_params_))
Next, we will have to adjust the model’s decision threshold. But first, let me explain what a decision threshold is.
Machine learning classification algorithms provide output as a probability for each of the samples you predict, and a
threshold is applied at the trained model to return predicted labels as binary categories. For example, say given two
customers 7590-VHVEG and 5575-GNVDE and an input 𝑋 of predictor variables. A classifier model will predict a
probability 𝑦. Of course, we cannot directly check this predicted probability against the target variable Churn as it
only has values of 0 or 1. So how does the model convert these probabilities into a binary class? That’s where the
decision threshold comes in. For binary classification, instances with a probability ≥ 0.5 are typically predicted as
positives or 1, otherwise as negatives or 0, since the default classification threshold is 0.5 in sklearn.
Is this a reasonable threshold? In our case, it is not optimal. Our dataset is imbalanced and machine learning classifiers
trained on imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for
the minority class, which in many real-world applications is the class we are interested in. The default classification
threshold 0.5 is often not ideal for imbalanced data, and it is a good strategy to adjust it. You can see from both runs
of the initial baseline and XGBoost models that the evaluation metrics for the minority class (Churn = 1) are poor.
Normally, for a severely imbalanced dataset, applying a resampling technique like SMOTE will solve the above issue.
However, not all imbalanced datasets require a resampling. In our case, a 70-30 imbalance in the target variable is not
severe. Instead of resampling, we’ll have to adjust the decision threshold.
The precision_recall_curve is a useful tool to visualize the precision-recall tradeoff in the classifier. They help
inform us where to set the decision threshold of the model to maximize either precision or recall. This is called the
“operating point” of the model. The below code will plot the precision-recall curve for our data.
Y_prob = baseline_grid.predict_proba(X_test)[:, 1]
precision, recall, threshold = precision_recall_curve(Y_test, Y_prob)
dec_thres = 0.30
While the accuracy score of the final baseline model did not greatly improve (in fact it became smaller), by tuning the
model hyperparameters and adjusting the decision threshold, we were able to achieve a higher recall without
sacrificing precision that much. Recall for the minority class increased to 72%.
First, we will need to identify the optimal values of the hyperparameters. The code will take a long time to run, at least
1 hour and 30 minutes.
Similar to the baseline model, we can set the optimal threshold between 0.25 – 0.40 to attain a higher recall without
fully sacrificing precision.
dec_thres = 0.30
The AUC-PR score of the XGBoost is slightly higher compared to the baseline model. The model’s AUC of 76% implies
that it is correct in approximately 4 out of 5 customer scenarios.
Lastly, we will plot the feature importance to see which independent variable impact the churn rate the most.
plot_importance(xgb_final_cf, X, 50)
The length of the contract far is the strongest predictor of customer churn behavior, wherein month-to-month
contracts churn significantly more than one-year or two-year contracts. Interestingly, internet fiber services churned
at a higher pace than other services.