Final - Bank Customer Response Prediction Model
Final - Bank Customer Response Prediction Model
A. Introduction
Data Collection
Classification
Missing data
Include Feature Encoding data Min-Max Scaler
handle
Missing Value Handling: For handling missing data, we can utilize methods such as mode()
and mean() to fill in the gaps based on the most frequent or average values of the respective features.
This approach allowed us to retain the completeness of the dataset and prevent biased analysis due to
missing values. We can also implement another technique. In this sector, we first tried to see if there is
any missing data. If we find any missing value then we fill up it.
Encoding Data: Then we encode the dataset based on dependent and independent data. Here we
implement label encoding technique.
Min Max Scaler: In our dataset, the range of features is very large due to which we do not get
good results. The Min Max scaler is used to minimize these large-range values. That's why we used the
Min Max scaler.
There are several changes in this dataset after preprocessing. Before preprocessing we have 17 columns
and there are huge missing values in our dataset. But after preprocessing we have only 13 columns and
there are no missing variables. Because we drop the irrelevant column and handle all the missing values.
Here shown the dataset before preprocessing:
Fig.C.1.1: Data Set
Data Scaling:
Correlation: Correlation is a statistical measure that describes the extent to which two
variables change together. In other words, it quantifies the degree to which a change in
one variable is associated with a change in another variable. Correlation does not imply
causation; it only measures the strength and direction of a linear relationship between two
variables.
Precision:
Precision is a metric that measures the ratio of accurate positive forecasts. The calculation involves
determining the ratio between the number of true positives and the sum of true positives and false
positives. Precision is a valuable metric, especially in situations when the consequences of false positives
are significant, since it allows for the reduction of wrong positive predictions to a minimum.
True Positives
Precision=
True Positives+ False Positives
Recall(Sensitivity):
Recall, also known as sensitivity or true positive rate, quantifies the ratio of correctly predicted positive
occurrences to the total number of real positive instances as determined by the model. The calculation
involves determining the proportion of genuine positives in relation to the combined total of true positives
and false negatives. The importance of recall becomes evident in situations where the consequences of
false negatives are significant, as it is desirable to reduce the occurrence of missed positive examples.
True Positives
Recall=
True Positives+ False Negatives
F1 Score:
The F1-score is calculated as the harmonic mean of precision and recall. This approach offers a harmonious
equilibrium between the aforementioned indicators, proving particularly advantageous when seeking to
account for both false positives and false negatives throughout the evaluation process. The F1-score is
computed using the following formula:
2 (Precision∗Recall)
F 1−Score=
Precision+ Recall
Support:
Here fig.C.1.3 show the updated dataset after using precision, recall,F1 score, support
model
Here fig.C.1.4 show the updated dataset after using sensitivity and specificity
E. Model Selection
SVM:
Support Vector Machine (SVM) is a supervised machine learning algorithm used for both classification and
regression. The main objective of the SVM algorithm is to find the optimal hyperplane in an N-dimensional
space that can separate the data points in different classes in the feature space. SVM works by mapping data
to a high-dimensional feature space so that data points can be categorized, even when the data are not
otherwise linearly separable. A separator between the categories is found, then the data are transformed in
such a way that the separator could be drawn as a hyperplane. In this project we will use svm because it is
effective in high-dimensional cases. Another thing is Different kernel functions can be specified for the
decision functions and its possible to specify custom kernels.
Equation:
f ( x )= β0 + β 1 x 1 +… β n xn
Random Forest Classifier:
Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset
and takes the average to improve the predictive accuracy of that dataset. Random Forest grows multiple
decision trees which are merged together for a more accurate prediction. The logic behind the Random
Forest model is that multiple uncorrelated models (the individual decision trees) perform much better as a
group than they do alone.
Equation:
N
1
y= ∑ yi
N i =1
K-Nearest Neighbors:
The k-nearest neighbors’ algorithm, also known as KNN or k-NN, is a nonparametric, supervised learning
classifier, which uses proximity to make classifications or predictions about the grouping of an individual
data point. While it can be used for either regression or classification problems, it is typically used as a
classification algorithm, working off the assumption that similar points can be found near one another. KNN
classifier operates by finding the k nearest neighbors to a given data point, and it takes the majority vote to
classify the data point. The value of k is crucial, and one needs to choose it wisely to prevent overfitting or
underfitting the model.
Equation:
y=mode( y 1, y 2 , …. , yk )
Equation:
F. Classification Metrics Accuracy Verification
Here I’m check the Decision Tree classifier model and KNN classifier model accuracy test with Recall and
F1 performance measurement metrics.
Recall: Recall, also known as sensitivity or true positive rate, is a crucial metric in classification tasks. If
you focus only on accuracy, a model might predict the majority class all the time and still have a high
accuracy, but it would miss important predictions of the minority class. Recall specifically focuses on the
ability of the model to find all relevant cases within a dataset, especially the minority class.
F1: Our F1 accuracy is outstanding result for classifier model because F1 performance measurement range
is 0 to 1 when the F1 value is near the 0 then we called our data preprocess is not actual and when the recall
value is near the 1 then it perfect
Here I’m check the KNN Regression model accuracy test with MAE, MSE, RMSE and RMSLE
performance measurement metrics.
MAE (Mean Absolute Error): Measures the performance of a classification model where the
prediction output is a probability value between 0 and 1. Regression Metrics: Mean Absolute Error (MAE):
Equation:
This metrics work on output test set and predict set for KNN regressor model and the accuracy rate is
(0.36) which is under the range of 0 to 1.
MSE (Mean Square Error): Measures the proportion of the variance in the dependent variable
that is predictable from the independent variables. Best possible score is 1.0 and it can be negative.
Equation:
This metrics work on output test set and predict set for KNN regressor model and the accuracy rate is
(0.51) which is under the range of 0 to 1. But it’s not satisfied score. Because the best score is 1.
MSE Score
KNN 0.095
MSE Score
RMSE (Root Mean Square Error): Root of MSE. Provides an interpretable measure in the
same unit as the target variable.
Equation:
This metrics work on output test set and predict set for KNN regressor model and the accuracy rate is
(0.72) which is under the range of 0 to 1. Which is satisfied score. Because the highest score is 1. The
threshold value for the Root Mean Square Error (RMSE) depends on the specific problem you are working
on and the context of your analysis. RMSE is a measure of the error or the difference between predicted
and actual values, so the threshold value for RMSE is not a fixed constant.
RMSE Score
KNN 0.3
RMSE Score
R^2(Root Squared): Measures the proportion of the variance in the dependent variable that is
predictable from the independent variables. Best possible score is 1.0 and it can be negative.
Equation:
This metrics work on output test set and predict set for KNN regressor model and the accuracy rate is
(0.84) which is under the range of 0 to 1. Which is satisfied score. Best possible score is 1.0 and it can be
negative.
Jaccard Score:
The jaccard score is a metric used to evaluate and contrast the similarities and diversity of sample. The
intersection to union ratio is the same for both. Comparing two finite sample sets of comparable size is
fesible with the use of the Jaccard coefficient, which is a kind of statistical statistic. It is calculated by
dividing the sample sets cross sectional area by the combined sample sets total area. For each algorithm in
this model, the Jaccard score chart and percentages are shown in figure:
Jaccard Score
KNN 0.82
Jaccard Score(%)
In this fig, KNN had the highest accuracy rate of 0.82. followed by Random Forest classifier, Decision
Tree Classification, Logistic Regression rate of 0.79, 0.74, 0.7 and Support Vector had the lowest accuracy
is 0.68.
Cross_Val_Score:
Cross_val_score is a function in the scikit-learn package which trains and tests a model over multiple folds
of your dataset. This cross validation method gives you a better understanding of model performance over
the whole dataset instead of just a single train/test split.
In this fig, KNN had the highest accuracy rate of 0.82. followed by Random Forest classifier, Decision
Tree Classification, Logistic Regression rate of 0.79, 0.74, 0.7 and Support Vector had the lowest accuracy
is 0.68.
Sensitivity:
Sensitivity Score(%)
KNN 98.93
0 10 20 30 40 50 60 70 80 90 100
Jaccard Score(%)
Specificity:
Specificity Score(%)
71.89
Support vector Machine
71.89
Logistic Regression
45.31
Decision Tree Classification
54.68
Random Forest Classifier
7.34
KNN
0 10 20 30 40 50 60 70 80
Jaccard Score(%)
G. Results Discussion
After applying different algorithms to our Bank deposit project, we conducted a comparison between
different model. The results indicated that the k-NN algorithm performed well in the approach.
From the table we see that during the implementation of model Logistic Regression the accuracy
of train set 80.95% where the test set accuracy is 79%, That means that it makes a better fit and make a
good result. While implementing Decision Tree Classifier the accuracy of train set 100% where
the test set accuracy is 83.9%. In that case we see that the train set overfit cause the test set
result has more gap. If the model is best fit then the test result would be closer to the train set
accuracy. So, this model is not best fit for the bank deposit data. On the other hand, while
implementing Random Forest Classifier the accuracy of train set 99.9% close to 100% where
the test set accuracy is 86.13%. Here we see that this model is as like as Decision Tree
Classifier cause the Random Forest Classifier model is overfitted but when compare the test set
result with the Decision Tree Classifier test set result then the Random Forest Classifier result
far better. On the other hand, during implementing Support Vector Machine (SVM) the
accuracy of the train set 83.47% where the test set accuracy is 81.37%.so we sat that this model
is fitted and it also make promising result. Among of these models the K-Neighbors Classifier
(KNN) make a promising result. Here we see that in the implementation the accuracy of the
train set 90.92% where the test set accuracy is 90.55%. This model is best fit in that case if we
compare with another model that we implement in this project. In that model we implement
Weighted K-Nearest Neighbors (K-NN) that assigns different weights to the neighbors based
on their distances from the query point.
Neighbors that are closer to the query point are given higher weights in the decision-making
process.
For that reason, it fitted best. When we see in the table the “Root Mean Squared Error"
(RMSE) the lowest value is 0.30 and this value belongs the KNN model and it that case when
the RMSE value is low that means there have a lowest error. We also se the Mean Square Error
(MSE) the lowest value belongs KNN model that is 0.095. In the analysis of sensitivity, we see
that the value of KNN model is best (98.93) %. In the context of binary classification,
sensitivity, also known as true positive rate, recall, or hit rate, is a measure of the proportion of
actual positive cases that were correctly identified by a classification model. A high sensitivity
value indicates that the model is effective at identifying positive instances, but it may come at
the cost of an increased false positive rate. Sensitivity is particularly important in scenarios
where the cost of missing positive instances (false negatives) is high, and it is often used
alongside other metrics such as specificity, precision, and F1-score for a more comprehensive
evaluation of a classifier. The Jaccard Index is a measure of similarity between two sets. In the
context of binary classification, it is often used to evaluate the similarity between the predicted
positive instances and the true positive instances. We also see in the table the score is highest
in the model KNN. The Jaccard Index ranges from 0 to 1, with 1 indicating perfect similarity
between the sets. It is a useful metric when dealing with imbalanced datasets or when you want
to focus on the intersection of positive instances. We also cheek the cross validation, here the
KNN perform best. we used % random set of data and find out the score around all the point
around 0.95. we also make ROC to AUC curve for visualization. That are given bellow-
In summary, the Decision Tree Classifier and Random Forest algorithms exhibited overfitting issues,
with 100% and 99.99 % accuracy on the training set but poor accuracy on the test set. The K-Nearest
Neighbors showed the best overall performance, achieving 90.92% accuracy on the training set and
90.55% on the test set. The logistic regression and Support Vector Machine (SVM) also make a promising
accuracy around 79% and 81% for that dataset. Overall, K-Neighbors Classifier (KNN), Logistic
Regression and Support Vector Machine (SVM) are promising choices for a bank deposit model.