Default of Credit Card Clients
Default of Credit Card Clients
Professors Candidate
prof. F. VACCARINO Matteo Merlo, 287576
prof. M. GASPARINI
A.Y 2021-2022
Contents
1 Introduction 2
3 Data preprocessing 10
3.1 Handling Categorical Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Dataset Partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Outliers and Anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Features Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.5 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5.1 PCA - Principal component analysis . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.6 Class imbalance - Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.6.1 Cluster Centroid Undersampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.6.2 SMOTE - Synthetic Minority Oversampling Technique . . . . . . . . . . . . . . . 16
3.6.3 K-means SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Model evaluation 18
4.1 K-fold cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Performance evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Classification models 21
5.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 Decision Tree and Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2.1 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3.1 Hard margin SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3.2 Soft margin SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3.3 Kernel SVM - kernal trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6 Results 27
6.1 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.2 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.3 Random forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.4 Support vector machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.5 Overall overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7 Conclusion 30
1
Abstract
The aim of this study is to exploit some supervised machine learning algorithms to identify the
main factors that determine the probability of a credit card default, underlining the mathematical
aspects and the methods used. Credit card defaults may occur when you have become severely
broke on credit card payments. In order to increase market share, Taiwan’s banks have issued
excess cash and credit cards to unskilled applicants. At the same time, most cardholders, regardless
of their repayment capability, were using their credit card excessively for consumption and have
accumulated huge credits and debts.
The goal is to build an automated model to both identify central drivers and predict credit
card default based on customer information and historical transactions. Next, the general concepts
of the supervised machine learning paradigm are presented, along with a detailed explanation of
all the techniques and algorithms used to build the models. In particular, Logistic Regression,
Random Forest and Support Vector Machines algorithms have been applied.
The repository of this paper is available at: https://github.com/MatteoM95/Default-of-Credit-
Card-Clients-Dataset-Analisys
1 Introduction
Since 1990, the Taiwanese government has allowed the formation of new banks. In order to increase
market share, these banks have issued excess cash and credit cards to unskilled applicants. At the
same time, most cardholders, regardless of their repayment ability, have abused their credit card for
consumption and piled up heavy credit card debt and cash. Default occurs when a credit card holder is
unable to comply with the legal obligation to repay. The crisis has caused a severe blow to confidence
in consumer credit and has been a major challenge for both banks and cardholders [1].
In a well-developed financial system, crisis management is downstream and risk prediction is up-
stream. The primary purpose of risk forecasting is to use financial information, such as corporate
financial statements, customer transaction and refund records, etc., to predict individual customer
business performance or credit risk and reduce damage and uncertainty.
In this project, the aim is to reliably predict who is at risk of defaulting. In this case, the bank may
be able to prevent the loss by providing the customer with alternative options (such as forbearance
or debt consolidation, etc.). Then, we build an automated model based on customer information and
historical transactions that can identify key factors and predict credit card default.
2
2 Exploratory Data Analysis
2.1 Dataset Description
The Default of Credit Card Clients dataset contains 30 000 instances of credit card status collected
in Taiwan from April 2005 to September 2005. The dataset employs the binary variable default
payment next month as response variable. It indicates if the credit card holders will be defaulters
next month (Yes = 1, No = 0). In particular, for each record (namely, each client) we have demographic
information, credit data, history of payments and bill statements. To be more precise, the following is
the complete list of all the 23 predictors [2].
In Figure 1 we can understand what the data looks like. The target default.payment.next.month
is renamed DEFAULT to be short, while the PAY 0 column is renamed PAY 1
3
Figure 1: Original dataset from UCI machine learning repository through pandas framework
4
# Column Non-Null Count Dtype
0 LIMIT BAL 30000 non-null int64
1 SEX 30000 non-null int64
2 EDUCATION 30000 non-null int64
3 MARRIAGE 30000 non-null int64
4 AGE 30000 non-null int64
5 PAY 1 30000 non-null int64
6 PAY 2 30000 non-null int64
7 PAY 3 30000 non-null int64
8 PAY 4 30000 non-null int64
9 PAY 5 30000 non-null int64
10 PAY 6 30000 non-null int64
11 BILL AMT1 30000 non-null int64
12 BILL AMT2 30000 non-null int64
13 BILL AMT3 30000 non-null int64
14 BILL AMT4 30000 non-null int64
15 BILL AMT5 30000 non-null int64
16 BILL AMT6 30000 non-null int64
17 PAY AMT1 30000 non-null int64
18 PAY AMT2 30000 non-null int64
19 PAY AMT3 30000 non-null int64
20 PAY AMT4 30000 non-null int64
21 PAY AMT5 30000 non-null int64
22 PAY AMT6 30000 non-null int64
23 DEFAULT 30000 non-null int64
all the values -2 and -1 are mapped in 0. In this way PAY N will indicate for how many months the
payment was delayed.
We still have to inspect the payment status feature PAY N, boxplots below shown in Figure 4 is
very useful. It can be seen that clients who delay payment by one month or less have fewer credit card
defaults. I.e. a greater discriminatory power is held the repayment status in September, PAY 1, than
the repayment status in the other months.
5
attribute value count defaulters (%)
Female 17.855 3.744 20,96%
SEX
Male 11.746 2.861 24,35%
University 14.024 3.329 23,73%
Graduate school 10.581 2.036 19,24%
EDUCATION
High school 4.873 1.233 25,30%
Other 123 7 5,70%
Single 15.806 3.329 21,06%
MARRIAGE Married 13.477 3.192 23,68%
Others 318 84 26,4%
Figure 5: KDE plots of LIMIT BAL and AGE grouped by DEFAULT class
6
2.6 Check for Normality distribution - QQ-plot
Methods we will use later assume that the data should have a known and specific distribution, i.e.
Normal distribution. Applying such methods on different data distribution, our final results may
be misleading or plain wrong. A way to check whether our data are Normally distributed, we used
a graphical method called Quantile-Quantile (QQ) plot that give us a qualitative evaluation. In a
QQ-plot, the quantiles of the independent variable are plotted against the expected quantiles of the
normal distribution. If the variable is normally distributed, the dots in the QQ-plot should fall along
a 45 degree diagonal. The plots show that there is no evidence that numerical features are normally
distributed.
Figure 6: QQ-plot for feature LIMIT BAL, BILL ATM,AGE and PAY ATM
2.7 Correlation
Correlation is a statistical term describing the degree to which two random variables move in coor-
dination with one-another. Intuitively, if they are moving in the same direction, then those variables
are defined with a positive correlation, or viceversa we define that with a negative correlation. The
7
Pearson Correlation (ρ) is one of the most used linear correlation measures.
8
Figure 7: Heatmap correlation
9
3 Data preprocessing
3.1 Handling Categorical Features
The categorical features EDUCATION, SEX, and MARRIAGE are already encoded with integer numbers and
could be fed to a machine learning algorithm. However, these are nominal features, for which it would
be sub-optimal to assume an ordering. One-hot encoding allows us to remove any ordinal relationship,
which would be meaningless between these categorical variables. The idea behind this approach is to
create a new dummy feature for each unique value in the nominal feature column. Binary values can
then be used to indicate the particular class of an example.
Although Scikit-Learn provides methods to perform one-hot encoding automatically, we decide to
do the mapping of the features by hand, since there are few. In this way we mitigate the problem of
multicollinearity, which occurs when there are highly correlated features. Thus, we create the following
boolean columns and drop the old ones, EDUCATION, SEX, and MARRIAGE.
Using this strategy we do not loose any information. In Table 3 and Table 4 there is an example
that shows variable respectively before and after the application of one-hot encoding.
• Training set:
– shape: (22200,25)
10
– defaulters: 4955
– non-defaulters: 17245
– class proportion: 22,30%
• Test set:
– shape: (7400,25)
– defaulters: 1650
– non-defaulters: 5750
– class proportion: 22,30%
The presence of a significant amount of outliers in some cases could drastically affect the perfor-
mances. Therefore, it is a common practice to train the model with and without them in order to
catch their contribution. There are several different techniques for removing outliers. It is possible to
find them thanks to some graphical representation of the data (e.g. Boxplot).
Figure 8 shows the boxplots for BILL ATM and PAY ATM variables, and are each followed by a de-
scription of how outliers were determined for the variables. The BILL ATM variables were depicted the
amount of bill statement during the respective months. Since these can actually be considered as re-
peated observations of the same variable (per cardholder), then for the purpose of outlier identification,
they were analyzed using the same boxplot. In addition, and to minimize the loss of information due to
elimination of outliers, a cut-off value was set for these variables based on the general trend observed in
the six variables. Consequently, it was assumed that for all BILL ATM variables, any amount exceeding
1,000,000 and any amount going below min(BILL ATM4), which was the value 170,000, was considered
to be an outlier.
For the PAY ATM variables, the same approach was takes for outlier detection as the one taken for
BILL ATM variables. The general trend of the six variables set the upper cut-off point at max(PAY ATM4)
which was the value 621,000. Any observation exceeding this amount was considered an outlier. On
the lower side, no observation had a value lower than 0, hence there were no outliers based on this
(taking 0 as the minimum). After the exclusion of outlier values was done, a sample size of 29,993
remained (7 observations were dropped), which was the data used for training and validating the model
[4].
We do not eliminate any sample as an outlier because the literature on the dataset does not provide
information on this [5], and we lack knowledge about the domain.
11
Figure 8: Box-plot for PAY ATM and BILL ATM Variables
12
• feature selection: we select a subset of the original features
• feature extraction: we construct a new feature subspace deriving information from the original
feature set.
As previously noticed, BILL ATM categories are highly correlated between them, so a manual discard
could be done. However, we decide to keep them to perform Principal Component Analysis, a feature
extraction technique.
x 7→ Wx , x ∈ Rd , Wx ∈ Rn ; (n < d) (4)
n,d
and a second matrix U ∈ R , that can be used to recover each original vector x from its compressed
version. That is, for a compressed vector y = W x, where y is in the low dimensional space Rn , we can
construct x̃ = U y, so that x̃ is the recovered version of x and resides in the original high dimensional
space Rd .
The objective of PCA is to find the compression matrix W and the recovering matrix U so that
the total squared distance between original and recovered vector is minimal in the least squared sense:
m
X 2
argmin ∥xi − U W xi ∥2 (5)
W ∈Rn,d ,U ∈Rd,n i=1
Xm
argmax tr(U ⊤ ( xi x⊤ )U ) (8)
U ∈Rd,n :U ⊤ U =In i=1
Pm
Now, we can define the scatter matrix A = i=0 xi x⊤ i , since this matrix is symmetric, it can be
written using its spectral decomposition A = V DV ⊤ , where D is diagonal and V ⊤ V = V V ⊤ = I.
Here, the elements on the diagonal of D are the eigenvalues of A, while the columns of V are the
corresponding eigenvectors. In particular, we can safely assume that the diagonal elements of D are
sorted by the largest, and are all positive because A is semi-definite positive:
13
So to sum up, PCA helps us to identify patterns in the data based on the correlation between
features, finding the directions of maximum variance in high-dimensional data and projecting it onto
a new subspace with equal or fewer dimensions than the original one.
The orthogonal axes (PC = principal components) of the new subspace can be interpreted as the
directions of maximum variance given the constraint that the new feature axes are orthogonal to each
other, as illustrated in the Figure 10. Here, x1 and x2 are the original feature axes, and P C1 and P C2
are the principal components [7].
Since a reduction of the dimensionality of our dataset can be done by compressing it onto a new
feature subspace, the number of principal components chosen will be that preserve at the most the
information carried by the data. A decision can be taken looking at the explained variance ratio, that
will be how much variance does contain each principal component out of the d components we obtain
applying PCA to our training dataset.
Results depicted in Figure 11 show that the first 6 principal components capture more than 90%
of the total variance. However, considering the first 12 principal components the variance is capable
to explain 99% of the total variance, even if the number of features is halved.
Figure 11: Cumulative and individual explained variance plotted against principal components
14
computed as a sum over the training examples that it sees during fitting. The decision rule is likely
going to be biased toward the majority class.
The option of collecting more data is excluded a priori. There are other option to tackle this
problem:
• At training time assign larger penalty to wrong predictions on the minority class.
15
3.6.2 SMOTE - Synthetic Minority Oversampling Technique
The Synthetic Minority Oversampling Technique (SMOTE) was proposed in [11] to avoid the risk
of overfitting faced by random oversampling. Instead of simply replicating existing observations, the
technique generates artificial samples. As shown in Figure 14, this is achieved by linearly interpolating
a randomly selected minority observation and one of its neighboring minority observations. More
precisely, SMOTE executes three steps to generate a synthetic sample:
Figure 14: SMOTE linearly interpolates a randomly selected minority sample and one of its k = 4
nearest neighbors [12]
Unfortunately, SMOTE, as other random oversampling techniques, suffers from weakness due to
within-class imbalance and noise. In fact, SMOTE choose a random sample from the minority class
to start, but if distribution of the minority class is not uniformed this can cause densely populated
areas to be further inflated with artificial data. Moreover, SMOTE does not recognize noisy minority
samples which are located among majority class instances and interpolate them with their minority
neighboring, may be creating new noisy entries.
Finally, it has been proven that classification algorithms could benefit from samples that are closer
to class boundaries, and SMOTE does not specifically work in this sense, as shown in Figure 15.
Figure 15: SMOTE behavior in presence of noise and with-in class imbalance [12]
16
3.6.3 K-means SMOTE
As showed in [12] the method employs the simpler and popular K-means clustering algorithm in
conjunction with SMOTE oversampling in order to rebalance datasets. It manages to avoid the
generation of noise by oversampling only in safe areas (i.e., areas made up of at least 50% of minority
samples). Moreover, its focus is placed on both between-class imbalance and within-class imbalance,
fighting the small disjuncts’ problem by inflating sparse minority areas. K-means SMOTE executes
three steps as show in Figure 16: clustering, filtering, and oversampling.
• Clustering: the input space is clustered into k groups using K-means clustering.
• Filtering: selects clusters for oversampling, retaining those with a high proportion of minority
class samples. It then distributes the number of synthetic samples to generate, assigning more
samples to clusters where minority samples are sparsely distributed.
• Oversampling: SMOTE is applied in each selected cluster to achieve the target ratio of minority
and majority instances.
Figure 16: K-means SMOTE over samples safe areas and combats within-class imbalance
17
4 Model evaluation
One of the main steps in building a machine learning model is to estimate its performance on data
that the model has not seen before. For this reason, in 3.2 initial dataset is splitted into separate
training and test datasets. The former will be used for model training, and the latter to estimate its
generalization performance. This approach is commonly known as holdout method.
However, in typical machine learning applications, it is interested in tuning and comparing different
parameter settings to further improve the performance for making predictions on unseen data. But,
if we reuse the same test dataset over and over again during this process, it will become part of our
training data and thus the model will be more likely to overfit. A validation set could be held out of
the training set, to evaluate on it the performance of the model. However, this is not recommended
because performance estimation may be sensitive to how we partition the training set.
Since k-fold cross validation is a resampling technique without replacement, its advantage is that
each sample of the training set will be used exactly once for validation purpose, and this yields to a
lower-variance estimate of the model performances than the holdout method. A common value for k
in k-fold cross validation is 10 [13]. However, we are dealing with a large dataset and choosing such a
value would be too costly in computational sense, for this reason we decide to use a smaller value for
k, i.e. k = 5, still taking advantage of the k-fold method.
18
Actual Class
Positive Negative
Positive TP FP
Predicted Class
Negative FN TN
The following terminology is used when referring to the counts tabulated in a confusion matrix:
• True positive (TP), which corresponds to the number of positive examples correctly predicted
by the classification model.
• False negative (FN), which corresponds to the number of positive examples wrongly predicted
as negative by the classification model.
• False positive (FP), which corresponds to the number of negative examples wrongly predicted as
positive by the classification model.
• True negative (TN), which corresponds to the number of negative examples correctly predicted
by the classification model.
Different metrics can be used depending on the task, the data imbalance and other factors. While
dealing with classification tasks, these are some of the most used ones:
• Accuracy: is the performance measure generally associated with machine learning algorithms. It
is the ratio of correct predictions over the total number of data points classified.
TP + TN
Accuracy = (11)
TP + FP + TN + FN
• Precision: (also called positive predictive value): Indicates how many of a j -object (in binary
classification, commonly True class is considered) predictions are correct. It is defined as the
ratio of correct positive predictions over all the positive predictions.
TP
P recision = (12)
TP + FP
• Recall : (also called sensitivity): Indicates how many of the j -object (in binary classification,
commonly True class is considered) samples are correctly classified. It is defined as the fraction
of j-object predictions over the total number of j -object samples.
TP
Recall = (13)
TP + FN
19
• F1 : Accuracy measure treats every class as equally important, for this reason it may be not
suitable for imbalanced datasets, where the rare class is considered more interesting than the
majority class [14].
Hence, Precision and Recall, that are class-specific metrics, are widely employed in applications
in which successful detection of the rare class is more significant than detection of the other.
The challenge is to build a model that is capable of maximize both Precision and Recall. Hence,
the two metrics are usually summarized in a new metric that is F1-score. In practice, F1-score
represents the harmonic mean between precision and recall, so, a high value ensures that both
are reasonably high.
2 2rp
F 1score = 1 1 = , where r = Recall, p = P recision (14)
r + p
r+p
20
5 Classification models
In this section, we present different supervised learning algorithms with their mathematical details, and
we use them on our dataset to build a classification model that is able to predict credit card defaults
in the next month. In particular we will dive into Support Vector Machine, Logistic Regression and
some tree based methods, all following the Empirical Risk Minimization paradigm.
• the link function which specifies a function g that relates E[Y ] to the linear predictor such as
[15]:
g(E[Y ]) = α + β1 x1 + ... + βp xp (16)
Hence, in a GLM the expected response for a given feature vector x = [x1 , ..., xn ] is of the form:
with h, called activation function, being the inverse of the link function g [16].
Rather than directly modeling the distribution of Y , the logistic regression models the probability
that Y belongs to a particular class using the logistic function as activation function h:
⊤
exi β
P (Yi = 1|X = xi ) = h(x⊤ β) = ⊤ (18)
1 + exi β
indeed, with x⊤ ⊤
i β large we will have a high probability for Y to be 1, and for small xi β we will
have a high probability for Y to be 0. The logistic function will always produce an S-shaped curve
between 0 and 1, as shown in Figure 19.
To estimate the coefficients vector β through the available training data we use the Maximum
Likelihood method, which finds β̂ that is the maximum likelihood estimate of β, this is formalized in
such a way:
m
Y
L(β) = [h(x⊤ yi ⊤
i β)] [1 − h(xi β)]
1−yi
(19)
i=1
21
where L(β) is the log-likelihood, which maximized with respect to β gives the maximum likelihood
estimator of β. In a supervised learning environment, that maximization is equivalent to minimizing
the function:
m
1 1 X
− log L(β) = − [yi log h(x⊤ ⊤
i β) + (1 − yi ) log 1 − h(xi β)] (20)
m m i=1
We can interpret this function as the binary cross-entropy training loss associated with comparing
a true conditional probability density function (pdf) with an approximated pdf. In this study, the
Logistic Regression model provided by the SciKit-learn python library, which by default applies l2
regularization. Applying a regularization term is useful in reducing the generalization error but not
its training error, preventing overfitting on the training set.
The regularization term is added to the objective function, and in practice it penalizes large weights
values during the training, in other words regularization term force the weights to go toward 0, and
particularly the l2 regularization does not make them to be 0 (while l1 does) and it is defined as:
m
λX 2
l2 (β, λ) = β (21)
2 i=1 i
where λ called regularization parameter help us to tune the regularization strength.
• if the predictor is continuous, we choose a cut-point that maximizes purity for the two groups
created;
• if the predictor variable is categorical, we combine the categories to obtain two groups with
maximum purity.
In an iterative process, we can then repeat this splitting procedure at each child node until the
leaves are pure. This means that the training examples at each node all belong to the same class.
Unfortunately, this process tends to produce a tree that is too large and suffers from overfitting. Thus,
we typically prune the tree by setting a limit for the maximal depth of the tree.
Gini impurity ( IG ) and entropy ( IH ) are the most commonly used splitting criteria in binary
decision trees. Definying as p(i|t) the proportion of the examples that belong to class i for a particular
node t, we can write the entropy as:
Pc
• Gini Impurity: IG (t) = i=1 p(i|t)(1 − p(i|t))
Pc
• Entropy: − i=1 p(i|t) log2 p(i|t)
However, even though a decision tree is fairly interpretable, it is typically less accurate and robust
compared to more sophisticated algorithms.
A more advanced tree based algorithm is Random Forest.
22
5.2.2 Random Forest
Random Forest is an ensemble method which reduces the variance of a single model by combining
multiple decision trees with the bagging technique. The idea behind the bagging or bootstrap aggregating
technique is to generate different bootstrapped training sets taking sample with repetition form the
dataset. An illustration in Figure 20 skim the concept.
Figure 20: Feature importance obtained training Random Forest on the raw training dataset
Random Forest uses the bagging also at each split for the feature selection to decorrelate the trees:
to generate each splitting rule it is considered a randomly selected subset of features of fixed size.
This is why this algorithm is fairly robust to noise and outliers and will have much less variance
rather than a single decision tree. Random forest models are not interpretable as Decision trees, but
they allow us to measure the importance of each feature. As we will see in the results and in Figure
21 from our analysis on the raw data, it emerges that repayment status, age and bill amount are very
important in the decision process.
Figure 21: Feature importance obtained training Random Forest on the raw training dataset
23
5.3 Support Vector Machine
Support Vector Machines (SVMs) are considered among the most effective classification algorithms
in modern machine learning [19]. When used for classification tasks, SVMs are supervised learning
methods that construct an hyperplane that maximizes the margin between two classes in the feature
space.
{x ∈ H| ⟨w,x⟩ + b = 0} (22)
where w ∈ H and b ∈ R
Such a hyperplane naturally divides H into two half-spaces and hence can be used as the decision
boundary of a binary classifier: {x ∈ H| ⟨w,x⟩ + b ≥ 0} and {x ∈ H| ⟨w,x⟩ + b ≤ 0}
Given a set X = [x1 , ..., xm ], the margin is the distance of the closest point in X to the hyperplane:
| ⟨w, xi ⟩ |
min (23)
i=1,...,m ∥w∥
Since the parametrization of the hyperplane is not unique, we set
Let S = [(x1 , y1 ), ..., (xm , ym )] be a training set of examples, where each xi ∈ H and yi ∈ {±1}. Our
aim is to find a linear decision boundary parameterized by (w, b) such that ⟨w, xi ⟩ + b ≥ 0 whenever
yi = +1 and ⟨w, xi ⟩ + b < 0 whenever yi = −1. The SVM solution is the separating hyperplane with
the maximum geometric margin, as it is the safest choice. The problem of maximizing the margin can
be written as:
1 2
min ∥w∥
w,b,ξ 2 (25)
s.t. yi (⟨w, xi ⟩ + b) ≥ 1 ∀i
24
m
1 2 C X
min ∥w∥ + ξi
w,b,ξ 2 m i=1
(26)
s.t. yi (⟨w, xi ⟩ + b) ≥ 1 − ξi
ξi ≥ 0, ∀i
where C is a penalty parameter, typically determined via k-fold cross validation. A better looking
at the problem reveals that [20]:
• ξi = 0 whenever yi (⟨w, xi ⟩ + b) ≥ 1
• ξi = 1 − yi (⟨w, xi ⟩ + b) whenever yi (⟨w, xi ⟩ + b) ≤ 1
Let the hinge loss in the context of half spaces learning be:
25
Let us consider now the dual problem of (26) formulated as:
m m m
X 1 XX
max ( αi − αi αj yi yj ⟨xi , xj ⟩) (29)
α∈R;α≥0
i=1
2 i=1 j=1
It is evident that the dual problem only involves inner products between instances, that is nothing
but a linear kernel, so there is no restriction of using new kernel functions with the aim of measuring
the similarity in higher dimensions.
Given a non-linear mapping ψ : X 7→ F the kernel function is defined as:
the kernel enables an implicit non-linear mapping of the input points to a high-dimensional space
where large-margin separation is sought.
The complexity now depends on the size of the dataset, because for M data points we need to
compute M
2 inner products. Kernels explored in our analysis are:
26
6 Results
In the following pages, an overview on the results is given for each classifier. Different preprocessing
combinations were tested: applying dimensionality reduction techniques (PCA) or not, using or not
different resampling techniques. The metric we choose to adopt is F1-score. Precision-recall curve and
Confusion Matrix of the best model have been provided for each model.
Table 6: Results obtained with Logistic Regression model and different preprocessing strategies
According to Figure 25, it’s easy to notice how the best performances have been obtained using
PCA and SMOTE oversampling technique. Moreover, the confusion matrix has been provided.
Figure 25: Results obtained for Logistic regression Classifier using different settings
Figure 26: Results obtained for Decision Tree Classifier using different settings
27
6.3 Random forest
In the RandomForestClassifier implementation in SciKit-Learn, the size of the bootstrap sample is
chosen to be equal to the number of training samples in the original training dataset, which usually
provides a good bias-variance tradeoff [6]. Therefore we are interested in tuning the number of trees
that form the forest (n estimators) and the maximum number of features to consider in each split
(max features). Again we perform a Cross-Validated Grid Search to tune these parameters, and results
found are reported in Table 7.
According to Figure 27, it’s easy to notice how the best performances have been obtained using
PCA and SMOTE oversampling technique. Moreover, the confusion matrix has been provided.
Table 7: Results obtained with Random Forest model and different preprocessing strategies
Figure 27: Results obtained for Random forest using different settings
Table 8: Results obtained with SVM model and different preprocessing strategies
28
Figure 28: Results obtained for SVM Classifier using different settings
29
7 Conclusion
In this study, different supervised learning algorithms have been inspected and presented with their
mathematical details, and finally used on the UCI dataset to build a classification model that is able
to predict if a credit card clients will default in the next month. Data preprocessing makes algorithms
perform slightly better than when trained with original data: in particular, PCA results are approx-
imately the same, but the computational cost has been lowered. Oversampling and undersampling
techniques has been combined with PCA to assess the dataset imbalance problem. Oversampling
as mentioned performed slightly better w.r.t. the undersampling, this is likely because the model is
trained on a large amount of data. However, all the models implemented achieved comparable results
in terms of accuracy.
However, our result are quite low and other methods may be explored trying to get better perfor-
mances. It would be interesting to implement some Gradient Boost based models such as Gradient
Boosting Classifier or SGD Classifier, and also some outliers’ management approaches such as Local
Outliers Factor or Isolation Forest could help to improve our results.
30
References
[1] I.-C. Yeh and C. hui Lien, “The comparisons of data mining techniques for the predictive accuracy
of probability of default of credit card clients,” Expert Systems with Applications, vol. 36, no. 2,
pp. 2473–2480, Mar. 2009. [Online]. Available:
:https://doi.org/10.1016/j.eswa.2007.12.020.
[2] D. Dua and C. Graff, “Uci machine learning repository: de-
fault of credit card clients data set,” 2019. [Online]. Available:
https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients.
[3] T. M. Alam, K. Shaukat, I. A. Hameed, S. Luo, M. U. Sarwar, S. Shabbir, J. Li, and M. Khushi,
“An investigation of credit card default prediction in the imbalanced datasets,” IEEE Access, vol.
8, pp. 201 173–201 198, 2020. [Online]. Available: https://doi.org/10.1109/access.2020.3033784.
[4] Muchiri, L. M. (2020). A Model for predicting credit card loan defaulting using card-
holder characteristics and account transaction activities [Thesis, Strathmore University].
http://hdl.handle.net/11071/12127.
[5] Ying Chen, Ruirui Zhang, ”Research on Credit Card Default Prediction Based on k-Means
SMOTE and BP Neural Network”, Complexity, Volume 2021, Article ID 6618841, 13 pages,
2021. https://doi.org/10.1155/2021/6618841.
[6] S. Raschka, Python Machine Learning. Packt Publishing, 2015.
[7] S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning. Cambridge University
Press, 2009. [Online]. Available: https://doi.org/10.1017/cbo9781107298019.
[8] S.-J. Yen and Y.-S. Lee, “Cluster-based under-sampling approaches for imbalanced data distri-
butions,” Expert Systems with Applications, vol. 36, no. 3, pp. 5718–5727, apr 2009. [Online].
Available: https://doi.org/10.1016/j.eswa.2008.06.108.
[9] G. Lemaitre, F. Nogueira, and C. K. Aridas, “Imbalanced-learn: A python toolbox to
tackle the curse of imbalanced datasets in machine learning,” Sep 2016. [Online]. Available:
https://arxiv.org/abs/1609.06570.
[10] Q. Zou, Z. Wang, X. Guan, B. Liu, Y. Wu, and Z. Lin, “An approach for identifying cytokines
based on a novel ensemble classifier,” BioMed Research International, vol. 2013, pp. 1–11, 2013.
[Online]. Available: https://doi.org/10.1155/2013/686090.
[11] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: Synthetic minority
over-sampling technique,” 2011. [Online]. Available: https://arxiv.org/abs/1106.1813.
[12] F. Last, G. Douzas, and F. Bacao, “Oversampling for imbalanced learning based on k-means and
smote,” Nov 2017. [Online]. Available: https://arxiv.org/abs/1711.00837.
[13] R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model selec-
tion,” in Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume
2, ser. IJCAI’95. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1995, p. 1137–1143.
[14] P.-N. Tan, Introduction to Data Mining, 2nd ed. Pearson Education Limited, 2019.
[15] A. Agresti, An Introduction to Categorical Data Analysis, 3rd ed., ser. Wiley Se-
ries in Probability and Statistics. John Wiley and Sons, 2019. [Online]. Available:
http://gen.lib.rus.ec/book/index.php?md5=ec387de4af731cc168b9f0506700f5cc.
[16] D. P. K. Z. I. B. T. T. R. Vaisman, Data Science and Machine Learning: Mathematical and
Statistical Methods, ser. Chapman and Hall/CRC Machine Leraning and Pattern Recognition.
CRC Press, 2020.
[17] T. H. R. T. Gareth James, Daniela Witten, An Introduction to Statistical Learning with Appli-
cations in R, ser. Springer Texts in Statistics, Vol. 103. Springer, 2013.
31
[18] Robert I. Kabacoff, ”R in Action”, 2nd ed., Manning Publications, 2015.
http://www.cs.uni.edu/ jacobson/4772/week11/Ri nA ction.pdf .
[19] E. a. M. Mohri, Foundations of machine learning. MIT press, 2018.
[20] A. Smola, S.V.N. Vishwanathan, ”Introduction to Machine Learning”, 2008.
https://alex.smola.org/drafts/thebook.pdf.
32