0% found this document useful (0 votes)
37 views26 pages

Module 3.5 Ensemble Learning XGBoost

The document provides an overview of ensemble learning and gradient boosting methods. It discusses how ensemble methods like bagging and boosting combine multiple models to improve performance over single models. Gradient boosting works by sequentially adding models to correct errors from previous models, formalized as gradient descent. Extreme Gradient Boosting (XGBoost) is then introduced as a scalable and efficient implementation of gradient boosting that uses techniques like regularization, tree pruning, and sparsity awareness to prevent overfitting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views26 pages

Module 3.5 Ensemble Learning XGBoost

The document provides an overview of ensemble learning and gradient boosting methods. It discusses how ensemble methods like bagging and boosting combine multiple models to improve performance over single models. Gradient boosting works by sequentially adding models to correct errors from previous models, formalized as gradient descent. Extreme Gradient Boosting (XGBoost) is then introduced as a scalable and efficient implementation of gradient boosting that uses techniques like regularization, tree pruning, and sparsity awareness to prevent overfitting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

MODULE 3.

5 SUPERVISED LEARNING: GRADIENT BOOSTING METHOD PART 1


Ensemble Learning Using Extreme Gradient Boosting (XGBoost)

When you want to purchase a new car, will you walk up to the first car shop and purchase one based on the advice of
the dealer? It’s highly unlikely.

You would likely browse a few websites where people have posted their reviews and compare different car models,
checking for their features and prices. You will also probably ask your friends and colleagues for their opinion. In short,
you wouldn’t directly reach a conclusion, but will instead make a decision considering the opinions of other people as
well.

Ensemble models in machine learning operate on a similar idea. They combine the decisions from multiple models to
improve the overall performance. Ensemble learning offers a systematic solution to combine the predictive power of
multiple learners. The resultant is a single model which gives the aggregated output from several models.

Part 1: Overview of Ensemble Learning

The goal of ensemble models is to combine different classifiers into a metaclassifier that has better generalization
performance than each individual classifier alone. For example, assuming that we collected predictions from 10
experts, ensemble methods would allow us to strategically combine their predictions to come up with a prediction that
is more accurate and robust than the experts’ individual predictions

The models that form the ensemble, also known as base learners, could be either from the same learning algorithm or
different learning algorithms. Bagging and boosting are two widely used ensemble learners.

Bagging (or bootstrap aggregation) is an ensemble technique of training several individual models in a parallel way.
Each model is trained by a random subset of the data. Boosting, on the other hand, is an ensemble technique of training
several individual models in a sequential way. This is done by building a model from the training data and then creating
a second model that attempts to correct the errors of the first model. Models are added until the training set is
predicted perfectly or a maximum number of models is added.

Though these two techniques can be used with several statistical models, the most predominant usage has been with
decision trees, and just like the decision trees themselves, bagging and boosting can be used for classification and
regression problems.

For our purposes, we will largely utilize decision trees to outline the definition and practicality of ensemble methods
(however, it is important to note that ensemble methods do not only pertain to decision trees). We will also focus first
on using the idea of decision trees and ensemble methods in addressing classification problems.

Recall from Module 3.3 that a decision tree predicts the target variable based on series of questions and conditions.
For instance, below decision tree determines whether an individual should play outside or not. The tree takes several
weather factors into account, and given each factor either makes a decision or asks another question. In this example,
every time it is overcast, we will play outside. However, if it is raining, we must ask if it is windy or not? If it’s windy, we
will not play. But, given no wind, we’re going outside to play.
When making decision trees, there are several factors we must take into consideration: On what features do we make
our decisions on? What is the threshold for classifying each question into a YES or NO answer? In the tree above, what
if we wanted to ask ourselves whether we had friends to play with or not? If we have friends, we will play every time.
If not, we might continue to ask ourselves questions about the weather. By adding an additional question, we hope to
greater define the YES and NO classes.

This is where ensemble methods come in handy. Rather than just relying on one decision tree and hoping we made
the right decision at each split, ensemble learning allows us to take a sample of decision trees into account, calculate
which features to use or questions to ask at each split, and make a final predictor based on the aggregated results of
the sampled decision trees.

Part 2: Introduction to Gradient Boosting Methods

Gradient boosting works by sequentially adding the previous underfitted predictions to the ensemble, ensuring the
errors made previously are corrected. Gradient boosting comes from the idea of the boosting ensemble method or
improving a single weak model by combining it with a number of other weak models in order to generate a collectively
strong model.

Gradient boosting is an extension of boosting where the process of additively generating weak models is formalized as
a gradient descent algorithm over an objective function. Gradient boosting sets targeted outcomes for the next model
in an effort to minimize errors. Targeted outcomes for each case are based on the gradient of the error (hence the
name gradient boosting) with respect to the prediction. Recall from Module 3.2 that gradient descent is an optimization
algorithm used in machine learning to minimize the cost function by iteratively adjusting parameters in the direction
of the negative gradient, aiming to find the optimal set of parameters. The cost function represents the discrepancy
between the predicted output of the model and the actual output. The goal of gradient descent is to find the set of
parameters that minimizes this discrepancy and improves the model’s performance. For a mathematical explanation
of the intuition behind gradient descent (which is the backbone of complex ML models like XGBoost), you can check
this article from Towards Data Science.

Gradient-boosted decision trees iteratively train an ensemble of shallow decision trees, with each iteration using the
error residuals of the previous model to fit the next model. The final prediction is a weighted sum of all of the tree
predictions. The following are the steps of the gradient-boosted tree algorithm:
1. A decision tree (which can be referred to as the first weak learner) is built on a subset of data. Using this model,
predictions are made on the whole dataset.
2. Errors are calculated by comparing the predictions and actual values, and the loss is calculated using the loss
function.
3. A new decision tree is created using the errors of the previous step as the target variable. The objective is to
find the best split in the data to minimize the error. The predictions made by this new model are combined
with the predictions of the previous model. New errors are calculated using this predicted value and actual
value.
4. This process is repeated until the error function does not change or until the maximum limit of the number of
estimators is reached.

Part 3: Introduction to Extreme Gradient Boosting (XGBoost) Model

XGBoost is a decision-tree-based ensemble machine learning algorithm that uses a gradient boosting framework. It is
a scalable and improved version of the gradient boosting algorithm designed for efficacy, computational speed, and
model performance.

Here, a gradient-boosted tree is learned such that the overall loss of the new model is minimized while keeping in mind
not to overfit the model. Recall, that an objective function intends to maximize or minimize something. In machine
learning, we try to minimize the objective function which is a combination of the loss function and regularization term
(revisit Module 3.3). The XGBoost objective function at iteration 𝑡 that we need to minimize is the following:

It is easy to see that the XGBoost objective is a function of functions (i.e., 𝑙 is a function of CART learners, a sum of the
current and previous additive trees). Optimizing the loss function encourages predictive models whereas optimizing
regularization leads to smaller variance and makes prediction stable. We will not go deep into the calculus behind
XGBoost, but you can read the following articles for a digestible explanation of the math behind this model here and
here.

Again, compared to baseline GBM, XGBoost improves upon the base GBM framework through systems optimization
and algorithmic enhancements.

System Optimization
1. Parallelization — Tree learning needs data in a sorted manner. To cut down the sorting costs, data is divided
into compressed blocks (each column with corresponding feature value). XGBoost sorts each block parallelly
using all available cores/threads of CPU. This optimization is valuable since a large number of nodes gets
created frequently in a tree. In summary, XGBoost parallelizes the sequential process of generating trees.
2. Cache Aware — By cache-aware optimization, we store gradient statistics (direction and value) for each split
node in an internal buffer of each thread and perform accumulation in a mini-batch manner. This helps to
reduce the time overhead of immediate read/write operations and also prevent cache miss.
Algorithmic Enhancements
1. Regularization – It penalizes more complex models through both LASSO (L1) and Ridge (L2) regularization to
prevent overfitting.
2. Tree Pruning – The stopping criterion for tree splitting within GBM framework is greedy in nature and depends
on the negative loss criterion at the point of split. XGBoost uses max_depth parameter as specified instead of
criterion first, and starts pruning trees backward. This ‘depth-first’ approach improves computational
performance significantly.
3. Sparsity Awareness – It is quite common that the data we gather has sparsity (a lot of missing or empty values)
or becomes sparse after performing data engineering (feature encoding). To be aware of the sparsity patterns
in the data, a default direction is assigned to each tree. XGBoost handles missing data by assigning them to
default direction and finding the best imputation value so that it minimizes the training loss. Optimization here
is to visit only missing values which make the algorithm run 50x faster than the naïve method.

Part 4: XGBoost Hyperparameters and Model Tuning

In XGBoost, there are two main types of hyperparameters: tree-specific and learning task-specific. The following are
the main hyperparameters that influence XGBoost model performance.

Tree-specific hyperparameters control the construction and complexity of the decision trees:

1. max_depth: The maximum depth of each tree in the model. Increasing this value makes the model more
complex and can improve performance, but too high values lead to overfitting. Typical values range from 3-10
for shallow trees, but deep trees can go up to 15-25.

2. min_child_weight: Minimum number of samples required in a leaf node for it to be split further. A higher
value prevents overfitting. Typical values range from 1-10, with higher values for more sparse datasets.

3. subsample, colsample_bytree: Subsampling fractions of the training data used per iteration for row
(subsample) and column (colsample_bytree) sampling. Typical values range from 0.5-1.0. Lower values
are more conservative and prevent overfitting.

Learning task-specific hyperparameters control the overall behavior of the model and the learning process:

1. eta: The learning rate that controls how quickly the model learns from the data. Typical values range from
0.01 to 0.3, with smaller values generally requiring more boosting rounds but potentially leading to better
generalization.

2. gamma: Minimum loss reduction required to split a node. Higher values make the algorithm more
conservative. Values range from 0-10 typically.

3. lambda: L2 (ridge) regularization term on weights. Higher values increase the regularization.

4. alpha: L1 (LASSO) regularization term on weights. Higher values increase the regularization.

As covered in Module 3.3 and Module 3.4, the best way to select parameters is to do a grid search using cross-
validation. Try a wide range of values for the most important parameters like eta, max_depth, and
min_child_weight. Monitor the cross-validation scores and select the optimal configuration. Start with shallow
trees initially before exploring deep trees. It's also important to tune regularization parameters and tree constraints
once you've found optimal architecture hyperparameters. These help control model complexity and prevent
overfitting. Tuning XGBoost carefully is key to getting the most predictive power out of the model. Patience and
systematically exploring the hyperparameter space using cross-validation will pay off in better model generalization
and performance on unseen data. You can use GridSearchCV from sklearn or use the built-in cross-validation in
the xgboost package.
Part 5: Case Study – Classification Task using XGBoost in Python

Part 5.1 Defining the Use Case

In Telco industries or other business-related field, customer churn is defined as the action of existing customers
terminating their subscription of service with the company due to several reasons such as dissatisfaction of the service
provided/better price offered by competitors for the same services.

Churn analytics provides valuable capabilities to predict customer churn and also define the underlying reasons that
drive it. The churn metric is mostly shown as the percentage of customers that cancel a product or service within a
given period (mostly months).

Predicting customer churn is critical for telecommunication companies to be able to effectively retain customers. It is
more costly to acquire new customers than to retain existing ones. For this reason, large telecommunicati ons
corporations are seeking to develop models to predict which customers are more likely to change and take actions
accordingly.

The data set used in this article is available in Kaggle and contains 19 columns (independent variables) that indicate
the characteristics of the clients of a fictional telecommunications corporation. The Churn column (response variable)
indicates whether the customer departed within the last month or not. The class/label No includes the clients that did
not leave the company last month, while the class Yes contains the clients that decided to terminate their relations
with the company. The objective of the analysis is to obtain the relation between the customer’s characteristics and
the churn.

In this case study, we will focus on:


1. Data preparation, data cleaning, and handling a large number of features.
2. Data discretization and handling categorical data.
3. Feature selection and data transformation.

Part 5.2 Loading Python Packages and Dataset

# install xgboost if you haven’t yet already


pip install xgboost

import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot
from pandas import read_csv, set_option
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
import math

from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV


from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score

import xgboost as xgb


from pickle import dump
from pickle import load

import warnings
warnings.filterwarnings("ignore")

# load dataset

path = r'C:\Users\cdeani\Documents\Duane DECSC 131\05 Kaggle Datasets\telco_customer_churn.csv'


data = pd.read_csv(path)

Part 5.3 Exploratory Data Analysis and Data Preprocessing

The first thing we must do is gather a basic sense of our data.

# shape
print(data.shape)

#peek at data
set_option('display.width', 100)
data.head(5)

# check unique values of each column


for column in data.columns:
print('Column: {} - Unique Values: {}'.format(column, data[column].unique()))

As shown above, the data set contains 19 independent variables, which can be classified into 3 groups:
1. Demographic Information
▪ gender: Whether the client is a female or a male (Female, Male).
▪ SeniorCitizen: Whether the client is a senior citizen or not (0, 1).
▪ Partner: Whether the client has a partner or not (Yes, No).
▪ Dependents: Whether the client has dependents or not (Yes, No).

2. Customer Account Information


▪ tenure: Number of months the customer has stayed with the company (Multiple different numeric values).
▪ Contract: Indicates the customer’s current contract type (Month-to-Month, One year, Two year).
▪ PaperlessBilling: Whether the client has paperless billing or not (Yes, No).
▪ PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic),
Credit Card (automatic)).
▪ MontlyCharges: The amount charged to the customer monthly (Multiple different numeric values).
▪ TotalCharges: The total amount charged to the customer (Multiple different numeric values).

3. Services Information
▪ PhoneService: Whether the client has a phone service or not (Yes, No).
▪ MultipleLines: Whether the client has multiple lines or not (No phone service, No, Yes).
▪ InternetServices: Whether the client is subscribed to Internet service with the company (DSL, Fiber optic, No)
▪ OnlineSecurity: Whether the client has online security or not (No internet service, No, Yes).
▪ OnlineBackup: Whether the client has online backup or not (No internet service, No, Yes).
▪ DeviceProtection: Whether the client has device protection or not (No internet service, No, Yes).
▪ TechSupport: Whether the client has tech support or not (No internet service, No, Yes).
▪ StreamingTV: Whether the client has streaming TV or not (No internet service, No, Yes).
▪ StreamingMovies: Whether the client has streaming movies or not (No internet service, No, Yes).

# get a cursory summary of dataset


data.info()

As shown above, the data set contains 7,043 observations and 21 columns. Apparently, there are no null values on the
dataset; however, we observe that the column TotalCharges was wrongly detected as an object. This column
represents the total amount charged to the customer and it is, therefore, a numeric variable. For further analysis, we
need to transform this column into a numeric data type. To do so, we can use the pd.to_numeric function. By
default, this function raises an exception when it sees non-numeric data; however, we can use the argument
errors='coerce' to skip those cases and replace them with a NaN.

# transform the column TotalCharges into a numeric data type


data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')

# checking missing values


data.isnull().sum()

We can now observe that the column TotalCharges has 11 missing values. Next let’s investigate the rows where
TotalCharges = NaN.

# null observations of the TotalCharges column


data[data['TotalCharges'].isnull()]

These observations have also a tenure of 0, even though MonthlyCharges is not null for these entries. This
information appeared to be contradictory, and therefore, we decide to remove those observations from the dataset.
# drop observations with null values
data.dropna(inplace=True)

The customerID column is useless to explain whether not the customer will churn. Therefore, we drop this column
from the dataset also.

# drop the customerID column from the dataset


data.drop(columns='customerID', inplace=True)

As shown below, some payment method denominations contain in parenthesis the word ‘automatic’. These
denominations are too long to be used as tick labels in further visualizations. Therefore, we remove this clarification in
parenthesis from the entries of the PaymentMethod column.

# unique elements of the PaymentMethod column


data.PaymentMethod.unique()

# remove (automatic) from payment method names


data['PaymentMethod'] = data['PaymentMethod'].str.replace(' (automatic)', '', regex=False)

# unique elements of the PaymentMethod column after the modification


data.PaymentMethod.unique()

Part 5.4 Data Visualization of the Response/Target Variable

The following bar plot shows the percentage of observations that correspond to each class of the response/target
variable, Churn: [‘No’ ‘Yes’]. As shown below, this is an imbalanced data set because both classes are not
equally distributed among all observations, being no the majority class (73.42%). When modeling, this imbalance will
lead to a large number of false negatives, as we will see later.

# visualize dependent variable


# create a figure
fig = pyplot.figure(figsize=(10, 6))
ax = fig.add_subplot(111)

# proportion of observation of each class


prop_response = data['Churn'].value_counts(normalize=True)

# create a bar plot showing the percentage of churn


prop_response.plot(kind='bar',
ax=ax,
color=['navy','mediumseagreen'])

# set title and labels


ax.set_title('Proportion of observations of the response variable',
fontsize=18, loc='left')
ax.set_xlabel('churn',
fontsize=14)
ax.set_ylabel('proportion of observations',
fontsize=14)
ax.tick_params(rotation='auto')
# eliminate the frame from the plot
spine_names = ('top', 'right', 'bottom', 'left')
for spine_name in spine_names:
ax.spines[spine_name].set_visible(False)

For the succeeding bar plots, we are going to use normalized stacked bar plots to analyze the influence of each
independent categorical variable in the outcome. A normalized stacked bar plot makes each column the same height,
so it is not useful for comparing total numbers; however, it is perfect for comparing how the response variable varies
across all groups of an independent variable.

On the other hand, we use histograms to evaluate the influence of each independent numeric variable in the outcome.
As mentioned before, the data set is imbalanced; therefore, we need to draw a probability density function of each
class (density=True) to be able to compare both distributions properly.

Part 5.5 Visualizing Demographic Information with regards to Target Variable

The following code creates a stacked percentage bar chart for each demographic attribute (gender,
SeniorCitizen, Partner, Dependents), showing the percentage of Churn for each category of the attribute.

def percentage_stacked_plot(columns_to_plot, super_title):

'''
Prints a 100% stacked plot of the response variable for independent variable of the list columns_to_plot.
Parameters:
columns_to_plot (list of string): Names of the variables to plot
super_title (string): Super title of the visualization
Returns:
None
'''

number_of_columns = 2
number_of_rows = math.ceil(len(columns_to_plot)/2)

# create a figure
fig = pyplot.figure(figsize=(12, 5 * number_of_rows))
fig.suptitle(super_title, fontsize=22, y=.95)

# loop to each column name to create a subplot


for index, column in enumerate(columns_to_plot, 1):

# create the subplot


ax = fig.add_subplot(number_of_rows, number_of_columns, index)

# calculate the percentage of observations of the response variable for each group of the independent variable
# 100% stacked bar plot
prop_by_independent = pd.crosstab(data[column], data['Churn']).apply(lambda x: x/x.sum()*100, axis=1)

prop_by_independent.plot(kind='bar', ax=ax, stacked=True,


rot=0, color=['navy','mediumseagreen'])

# set the legend in the upper right corner


ax.legend(loc="upper right", bbox_to_anchor=(0.62, 0.5, 0.5, 0.5),
title='Churn', fancybox=True)

# set title and labels


ax.set_title('Proportion of observations by ' + column,
fontsize=16, loc='left')

ax.tick_params(rotation='auto')

# eliminate the frame from the plot


spine_names = ('top', 'right', 'bottom', 'left')
for spine_name in spine_names:
ax.spines[spine_name].set_visible(False)

# demographic column names


demographic_columns = ['gender', 'SeniorCitizen', 'Partner', 'Dependents']

# stacked plot of demographic columns


percentage_stacked_plot(demographic_columns, 'Demographic Information')
We can extract the following conclusions by analyzing demographic attributes:
▪ The churn rate of senior citizens is almost double that of young citizens.
▪ We do not expect gender to have significant predictive power. A similar percentage of churn is shown both
when a customer is a man or a woman.
▪ Customers with a partner churn less than customers with no partner.

Part 5.6 Visualizing Customer Account Information - Categorical Variables with regards to Target Variable

As we did with demographic attributes, we evaluate the percentage of Churn for each category of the customer
account attributes (Contract, PaperlessBilling, PaymentMethod).

# customer account column names


account_columns = ['Contract', 'PaperlessBilling', 'PaymentMethod']

# stacked plot of customer account columns


percentage_stacked_plot(account_columns, 'Customer Account Information')

We can extract the following conclusions by analyzing customer account attributes:


▪ Customers with month-to-month contracts have higher churn rates compared to clients with yearly contracts.
▪ Customers who opted for an electronic check as paying method are more likely to leave the company.
▪ Customers subscribed to paperless billing churn more than those who are not subscribed.

Part 5.7 Visualizing Customer Account Information - Numerical Variables with regards to Target Variable

The following plots show the distribution of tenure, MonthlyCharges, TotalCharges with respect to Churn.
For all numeric attributes, the distributions of both classes are different which suggests that all of the attributes will be
useful to determine whether or not a customer churns.

def histogram_plots(columns_to_plot, super_title):


# set number of rows and number of columns
number_of_columns = 2
number_of_rows = math.ceil(len(columns_to_plot)/2)

# create a figure
fig = pyplot.figure(figsize=(12, 5 * number_of_rows))
fig.suptitle(super_title, fontsize=22, y=.95)

# loop to each demographic column name to create a subplot


for index, column in enumerate(columns_to_plot, 1):

# create the subplot


ax = fig.add_subplot(number_of_rows, number_of_columns, index)

# histograms for each class (normalized histogram)


data[data['Churn']=='No'][column].plot(kind='hist', ax=ax, density=True,
alpha=0.5, color='navy', label='No')
data[data['Churn']=='Yes'][column].plot(kind='hist', ax=ax, density=True,
alpha=0.5, color='mediumseagreen', label='Yes')

# set the legend in the upper right corner


ax.legend(loc="upper right", bbox_to_anchor=(0.5, 0.5, 0.5, 0.5),
title='Churn', fancybox=True)

# set title and labels


ax.set_title('Distribution of ' + column + ' by churn',
fontsize=16, loc='left')

ax.tick_params(rotation='auto')

# eliminate the frame from the plot


spine_names = ('top', 'right', 'bottom', 'left')
for spine_name in spine_names:
ax.spines[spine_name].set_visible(False)

# customer account column names


account_columns_numeric = ['tenure', 'MonthlyCharges', 'TotalCharges']
# histogram of costumer account columns
histogram_plots(account_columns_numeric, 'Customer Account Information')
We can extract the following conclusions by analyzing the histograms above:
▪ The churn rate tends to be larger when monthly charges are high.
▪ New customers (low tenure) are more likely to churn.
▪ Clients with high total charges are less likely to leave the company.

Part 5.8 Visualizing Services Information with regards to Target Variable

Lastly, we evaluate the percentage of the target for each category of the services columns with stacked bar plots.

# services column names


services_columns = ['PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']

# stacked plot of services columns


percentage_stacked_plot(services_columns, 'Services Information')

We can extract the following conclusions by evaluating services attributes from the bar plots below:
▪ We do not expect phone attributes (PhoneService and MultipleLines) to have significant predictive
power. The percentage of churn for all classes in both independent variables is nearly the same.
▪ Clients with online security churn less than those without it.
▪ Customers with no tech support tend to churn more often than those with tech support.

By looking at the plots, we can identify the most relevant attributes for detecting churn. We expect these attributes to
be discriminative in our future models.
We will drop the columns gender, PhoneService, and MultipleLines because these variables do not have a
strong relationship with the target as we saw in the plots above.

# drop gender, PhoneService, MultipleLines columns from the dataset


data.drop(columns=['gender','PhoneService','MultipleLines'], inplace=True)

Part 5.9 Feature Engineering

Feature engineering is the process of extracting features from the data and transforming them into a format that is
suitable for the machine learning model. In this project, we need to transform both numerical and categorical variables.
Most machine learning algorithms require numerical values; therefore, all categorical attributes available in the dataset
should be encoded into numerical labels before training the model. In addition, we need to transform numeric columns
into a common scale. This will prevent that the columns with large values dominate the learning process.

No Modification
The SeniorCitizen column is already a binary column [0,1] and should not be modified.

Label Encoding
Label encoding is used to replace categorical values with numerical values. This encoding replaces every category with
a numerical label. In the below code, we use label encoding with the following binary variables: Partner,
Dependents, PaperlessBilling, and Churn.

# binary label encoding


data_transformed = data.copy()

# label encoding (binary variables)


label_encoding_columns = ['Partner', 'Dependents', 'PaperlessBilling', 'Churn']

# encode categorical binary features using label encoding


for column in label_encoding_columns:
data_transformed[column] = data_transformed[column].map({'Yes': 1, 'No': 0})

One-Hot Encoding
One-hot encoding creates a new binary column for each level of the categorical variable. The new column contains
zeros and ones indicating the absence or presence of the category in the data. In the below code, we apply one-hot
encoding to the following categorical variables: Contract, PaymentMethod, InternetServices,
OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, and
StreamingMovies.

# one-hot encoding (categorical variables with more than two labels)


one_hot_encoding_columns = ['InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaymentMethod']

# encode categorical variables with more than two levels using one-hot encoding
data_transformed = pd.get_dummies(data_transformed, columns = one_hot_encoding_columns)

The main drawback of this encoding is the significant increase in the dimensionality of the dataset (bigger feature space
due to additional independent variable columns); therefore, this method should be avoided when the categorical
column has a large number of unique values or labels.

In this case, it will be smarter to assign numerical labels instead or assign bins. For example, if there is a Civil Status
variable and if it has multiple labels, we can assign 1 = Single, 2 = Married, 3 = Separated, etc. instead of doing on-hot
encoding. This way, we can keep the dimensionality smaller. For this task, I chose one-hot encoding because the
maximum number of labels per independent variable is only three.
Scaling of Numerical Variables
Data normalization is a common practice in machine learning which consists of transforming numeric columns to a
common scale. Data normalization transforms multi-scaled data to the same scale. After normalization, all variables
have a similar influence on the model, improving the stability and performance of the learning algorithm.

There are multiple normalization techniques in ML. In the below code, we will use the min-max method to rescale the
numeric columns (tenure, MonthlyCharges, TotalCharges) to a common scale.

# scaling the numerical variables using MinMaxScaler


scaler = MinMaxScaler()

data_transformed['normtenure'] = scaler.fit_transform(np.array(data_transformed['tenure']).reshape(-1, 1))


data_transformed['normMonthlyCharges'] =
scaler.fit_transform(np.array(data_transformed['MonthlyCharges']).reshape(-1, 1))
data_transformed['normTotalCharges'] = scaler.fit_transform(np.array(data_transformed['TotalCharges']).reshape(-
1, 1))

Part 5.10 Train-Test Split

We will use 80% of the dataset for model training and 20% for testing:
# train-test split

Y = data_transformed["Churn"]
X = data_transformed.drop(columns=['Churn', 'tenure', 'MonthlyCharges', 'TotalCharges'])
validation_size = 0.2
seed = 777
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=validation_size, random_state=seed)

print(X.columns)
print(Y.name)

Part 5.11 Assessing Basic ML Classifiers for Baseline Model

Algorithm selection is a key challenge in any machine learning project since there is not an algorithm that is the best
across all projects. Generally, we need to evaluate a set of potential candidates and select for further evaluation those
that provide better performance. As with the previous case study with credit card fraud detection, we will assess a
family of basic ML classifiers first with our training data and then we will select a baseline model from this group based
on an evaluation metric. We will implement a 𝑘 = 10 k-fold cross-validation across all model runs.

For the first pass of the models, we will use Accuracy as the evaluation metric as with any classifi er problem. As
discussed in Module 3.3, accuracy is the fraction of predictions our model got right.

# creating a baseline using basic ML classifiers


# first run using 'accuracy' as an evaluation metric
num_folds = 10
scoring = 'accuracy'

# spot-check basic Classification algorithms


models = []
models.append(('LR', LogisticRegression(random_state=seed)))
models.append(('KNN', KNeighborsClassifier()))
models.append(('SVM', SVC(random_state=seed)))
models.append(('CART', DecisionTreeClassifier(random_state=seed)))

# first run for selecting a baseline model where scoring = accuracy


results = []
names = []
for name, model in models:
kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)

# compare algorithms
fig = pyplot.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
fig.set_size_inches(8,4)
pyplot.show()

Logistic regression had the highest accuracy during training closely followed by SVM. We will use SVM as our baseline
model so that we can have a more distinct comparison between the baseline model and XGBoost. The XGBoost model
will use a log-likelihood objective function later given that this is binary classification problem.

# prepare baseline model


baseline_model = SVC(random_state=seed)
baseline_model.fit(X_train, Y_train)

# estimate accuracy on validation set


baseline_pred = baseline_model.predict(X_test)
print('Accuracy of SVC baseline classifier:', accuracy_score(Y_test, baseline_pred))
print(classification_report(Y_test, baseline_pred))
# confusion matrix of baseline model
baseline_cm = pd.DataFrame(confusion_matrix(Y_test, baseline_pred), columns=np.unique(Y_test), index =
np.unique(Y_test))
baseline_cm.index.name = 'Actual'
baseline_cm.columns.name = 'Predicted'
sns.heatmap(baseline_cm, cmap="Blues", cbar=False, annot=True, fmt=',d', annot_kws={"size": 16})

The baseline model using support vector machine showed a pretty decent accuracy in that it was able to predict 79.74%
of our test set. However, the model’s recall and precision for the customers that left the company (Churn = 1) is low
at only 48% and 65%, respectively. We might have to revisit this later and tune our model. Let’s build first our challenger
model using XGBoost.

Part 5.12 Building an Initial XGBoost Model using Accuracy as Evaluation Metric

We will run an initial XGBoost model using accuracy also the evaluation metric.

# prepare first version of xgboost model


xgb_model = xgb.XGBClassifier(max_depth=5, learning_rate=0.08, objective= 'binary:logistic')
xgb_model.fit(X_train, Y_train)

# estimate accuracy on validation set


xgb_pred = xgb_model.predict(X_test)
print('Accuracy of XGBoost classifier: {:.16}'.format(xgb_model.score(X_test, Y_test)))
print(classification_report(Y_test, xgb_pred))

# confusion matrix of XGBoost model


baseline_cm = pd.DataFrame(confusion_matrix(Y_test, baseline_pred), columns=np.unique(Y_test), index =
np.unique(Y_test))
baseline_cm.index.name = 'Actual'
baseline_cm.columns.name = 'Predicted'
sns.heatmap(baseline_cm, cmap="Blues", cbar=False, annot=True, fmt=',d', annot_kws={"size": 16})
The first version of the XGBoost model has almost no improvements against the baseline SVM model. It has the same
accuracy score of 79.74% with a slightly better recall of 52% for the positive class. Further model tuning is needed.

Part 5.13 Tuning the Baseline Model

Recall can often be called the true positive rate. In this case, it means correctly identifying customers who are going to
churn. Failing to identify a true positive, which would be to fail to identify one who is about to churn, would be costly
to a business as this could mean lost revenue. Meanwhile, the opposite of identifying incorrectly someone who is going
to churn doesn’t harm the business that much, if we assume that the interaction with the customer isn’t hostile.

# rerun for selecting a baseline model where scoring = recall

num_folds = 10
scoring = 'recall'

results = []
names = []
for name, model in models:
kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)

The SVM model underperformed when we changed the scoring to focus on recall. Let’s use CART as a new baseline
model. We will tune our final baseline model (CART) further by:
1. Optimizing the hyperparameters using grid search
2. Adjusting the decision threshold of binary classification of the predicted probabilities

First, let’s find the optimal hyperparameters.

# using GridSearchCV to fine tune CART model hyperparameters


max_depth = [5,10,20]
min_samples_split = [10,20,40,60]
min_impurity_decrease = [0.0001, 0.0005, 0.001, 0.005, 0.01]
param_grid = dict(max_depth=max_depth, min_samples_split=min_samples_split,
min_impurity_decrease=min_impurity_decrease)

baseline_final = DecisionTreeClassifier(random_state=seed)
kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
baseline_grid = GridSearchCV(estimator=baseline_final, param_grid=param_grid, scoring=scoring, cv=kfold)
baseline_grid_result = baseline_grid.fit(X_train, Y_train)
print("Best: %f using %s" % (baseline_grid_result.best_score_, baseline_grid_result.best_params_))

Next, we will have to adjust the model’s decision threshold. But first, let me explain what a decision threshold is.

Machine learning classification algorithms provide output as a probability for each of the samples you predict, and a
threshold is applied at the trained model to return predicted labels as binary categories. For example, say given two
customers 7590-VHVEG and 5575-GNVDE and an input 𝑋 of predictor variables. A classifier model will predict a
probability 𝑦. Of course, we cannot directly check this predicted probability against the target variable Churn as it
only has values of 0 or 1. So how does the model convert these probabilities into a binary class? That’s where the
decision threshold comes in. For binary classification, instances with a probability ≥ 0.5 are typically predicted as
positives or 1, otherwise as negatives or 0, since the default classification threshold is 0.5 in sklearn.

Is this a reasonable threshold? In our case, it is not optimal. Our dataset is imbalanced and machine learning classifiers
trained on imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for
the minority class, which in many real-world applications is the class we are interested in. The default classification
threshold 0.5 is often not ideal for imbalanced data, and it is a good strategy to adjust it. You can see from both runs
of the initial baseline and XGBoost models that the evaluation metrics for the minority class (Churn = 1) are poor.

Normally, for a severely imbalanced dataset, applying a resampling technique like SMOTE will solve the above issue.
However, not all imbalanced datasets require a resampling. In our case, a 70-30 imbalance in the target variable is not
severe. Instead of resampling, we’ll have to adjust the decision threshold.

The precision_recall_curve is a useful tool to visualize the precision-recall tradeoff in the classifier. They help
inform us where to set the decision threshold of the model to maximize either precision or recall. This is called the
“operating point” of the model. The below code will plot the precision-recall curve for our data.

# plot Precision-Recall curve


from sklearn.metrics import precision_recall_curve

Y_prob = baseline_grid.predict_proba(X_test)[:, 1]
precision, recall, threshold = precision_recall_curve(Y_test, Y_prob)

# Plot the output


pyplot.plot(threshold, precision[:-1], c ='g', label ='PRECISION')
pyplot.plot(threshold, recall[:-1], c ='b', label ='RECALL')
pyplot.grid()
pyplot.legend()
pyplot.title('Precision-Recall Curve')
In in the above plot, we can see that if we want to maximize recall, then we need to decrease the value of the decision
threshold, but that would decrease the value of precision. So, we need to choose that value of decision threshold that
would increase recall but would not decrease precision that much. A good starting point is where the two curves meet
or somewhere near that point. Let’s try threshold = 0.35.

# adjusting the decision threshold

dec_thres = 0.30

# prepare final baseline model using grid search results


baseline_final = DecisionTreeClassifier(max_depth=5, min_impurity_decrease=0.001, min_samples_split=10 ,
random_state=seed)
baseline_final.fit(X_train, Y_train)

# estimate accuracy on validation set using new decision threshold


baseline_final_pred = (baseline_final.predict_proba(X_test)[:,1]>=dec_thres).astype(bool)
print(accuracy_score(Y_test, baseline_final_pred))
print(classification_report(Y_test, baseline_final_pred))
# confusion matrix of final baseline model
baseline_final_cm = pd.DataFrame(confusion_matrix(Y_test, baseline_final_pred),
columns=np.unique(Y_test), index = np.unique(Y_test))
baseline_final_cm.index.name = 'Actual'
baseline_final_cm.columns.name = 'Predicted'
sns.heatmap(baseline_final_cm, cmap="Blues", cbar=False, annot=True, fmt=',d', annot_kws={"size": 16})

While the accuracy score of the final baseline model did not greatly improve (in fact it became smaller), by tuning the
model hyperparameters and adjusting the decision threshold, we were able to achieve a higher recall without
sacrificing precision that much. Recall for the minority class increased to 72%.

Part 5.14 Tuning the XGBoost Model

First, we will need to identify the optimal values of the hyperparameters. The code will take a long time to run, at least
1 hour and 30 minutes.

# using GridSearchCV from sklearn to fine tune XGBoost model hyperparameters


xgb_params = {
'max_depth': [3, 5, 7, 10],
'min_child_weight': [3, 5, 7, 10],
'eta': [0.15, 0.1, 0.05, 0.01, 0.005, 0.001],
'subsample': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
'colsample_bytree': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
}

xgb_final = xgb.XGBClassifier(random_state=seed, objective='binary:logistic')


kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
xgb_grid = GridSearchCV(estimator=xgb_final, param_grid=xgb_params, scoring=scoring, cv=kfold)
xgb_grid_result = xgb_grid.fit(X_train, Y_train)
print("Best: %f using %s" % (xgb_grid_result.best_score_, xgb_grid_result.best_params_))

# plot Precision-Recall curve of XGBoost model


xgb_prob = xgb_grid.predict_proba(X_test)[:, 1]
xprecision, xrecall, xthreshold = precision_recall_curve(Y_test, xgb_prob)
# Plot the output.
pyplot.plot(xthreshold, xprecision[:-1], c ='g', label ='PRECISION')
pyplot.plot(xthreshold, xrecall[:-1], c ='b', label ='RECALL')
pyplot.grid()
pyplot.legend()
pyplot.title('Precision-Recall Curve of XGBoost Model')

Similar to the baseline model, we can set the optimal threshold between 0.25 – 0.40 to attain a higher recall without
fully sacrificing precision.

# adjusting the decision threshold

dec_thres = 0.30

# prepare final XGBoost model using grid search results


xgb_final_cf = xgb.XGBClassifier(objective='binary:logistic',
max_depth=3,
min_child_weight=7,
eta=0.1,
subsample=0.7,
colsample_bytree=1)
xgb_final_cf.fit(X_train, Y_train, eval_metric='aucpr')

# estimate accuracy on validation set using new decision threshold


xgb_final_pred = (xgb_final_cf.predict_proba(X_test)[:,1]>=dec_thres).astype(bool)
print(accuracy_score(Y_test, xgb_final_pred))
print(classification_report(Y_test, xgb_final_pred))

# confusion matrix of final XGBoost model


xgb_final_cm = pd.DataFrame(confusion_matrix(Y_test, xgb_final_pred),
columns=np.unique(Y_test), index = np.unique(Y_test))
xgb_final_cm.index.name = 'Actual'
xgb_final_cm.columns.name = 'Predicted'
sns.heatmap(xgb_final_cm, cmap="Blues", cbar=False, annot=True, fmt=',d', annot_kws={"size": 16})
The final XGBoost model showed a higher recall compared to the baseline CART model. The number of false negatives
at 86 is also much lesser compared to the number of FNs predicted by our baseline model.

# compare AUC-PR scores of baseline and XGBoost

print('The AUC-PR score of the baseline (CART) model:',roc_auc_score(Y_test, baseline_final_pred))


print('The AUC-PR score of the XGBoost model:',roc_auc_score(Y_test, xgb_final_pred))

The AUC-PR score of the XGBoost is slightly higher compared to the baseline model. The model’s AUC of 76% implies
that it is correct in approximately 4 out of 5 customer scenarios.

Lastly, we will plot the feature importance to see which independent variable impact the churn rate the most.

# extracting feature importance

def plot_importance(model, features, num=len(X), save=False):


feature_imp = pd.DataFrame({'Value': xgb_final_cf.feature_importances_, 'Feature': features.columns})
pyplot.figure(figsize=(10, 10))
sns.set(font_scale=1)
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False)[0:num])
pyplot.title('Features')
pyplot.tight_layout()
pyplot.show()
if save:
pyplot.savefig('importances.png')

plot_importance(xgb_final_cf, X, 50)

The length of the contract far is the strongest predictor of customer churn behavior, wherein month-to-month
contracts churn significantly more than one-year or two-year contracts. Interestingly, internet fiber services churned
at a higher pace than other services.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy