0% found this document useful (0 votes)
5 views23 pages

Final - Bank Customer Response Prediction Model

This document presents a comparative study on classification models to predict the success of bank telemarketing campaigns, specifically focusing on whether customers will subscribe to term deposits. The study utilizes various machine learning classifiers, including Multilayer Perceptron Neural Network, Decision Tree, Logistic Regression, and Random Forest, to analyze a dataset from a Portuguese bank's telemarketing efforts. The research aims to enhance campaign effectiveness by identifying key features influencing customer subscriptions and evaluates model performance using metrics such as precision, recall, and F1 score.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views23 pages

Final - Bank Customer Response Prediction Model

This document presents a comparative study on classification models to predict the success of bank telemarketing campaigns, specifically focusing on whether customers will subscribe to term deposits. The study utilizes various machine learning classifiers, including Multilayer Perceptron Neural Network, Decision Tree, Logistic Regression, and Random Forest, to analyze a dataset from a Portuguese bank's telemarketing efforts. The research aims to enhance campaign effectiveness by identifying key features influencing customer subscriptions and evaluates model performance using metrics such as precision, recall, and F1 score.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

A Long-term deposits prediction: a comparative classification model for

predict the success of bank telemarketing.


Sayed Rafiad Hossan Abhijeet Kundu

ID: 201-15-13963 ID:201-15-13896

A. Introduction

Marketing campaigns basically constitute a technique of outsourcing by organizations with


the goal of improving the financial posture of their businesses and also having a
competitive advantage over there. Organizations utilize direct marketing when focusing
on fragments of clients by reaching them to meet a particular objective. Reaching out to
the client through remote communication centers facilitates operational administration of
campaigns. Such call centers permit speaking with clients through different mediums for
instance by using phones. The advertisement of a product undertaken via a contact center
is called telemarketing because of its nature of remoteness. There are two primary
methodologies by which organizations advance their products. through mass crusades,
which focuses on the overall populace, and direct campaign, which targets just a particular
group of individuals. The formal review demonstrates that the efficacy of mass campaign
is quite low. Typically, under 1% of the entire populace will have a positive reaction to the
mass campaign. Interestingly, direct campaign concentrates just on a little group of
individuals who are thought to have a higher prospect of being attracted to the product
being advertised and, in this manner, would substantially be more productive to engage.
The choosing of these prospective customers poses a dire challenge of classification in
Data Mining which rises by way of matching customer attributes and other to different
outputs. The contribution of this study comes in two main dimensions. First is to build a
prediction model, which is suitable to predict whether a client has subscribed to a term
deposit or not, given a Portuguese Bank’s telemarketing campaign data. In this regard, the
study applies and compares Multilayer Perceptron Neural Network, Decision Tree Logistic
Regression and Random Forest classifiers. The second objective of this study is to enhance
the campaign effectiveness by determining the key features that influence a term deposit
to be subscribed by customers. The remainder of this paper is organized as follows.
B. Methodology
This model aims to predict whether a customer will respond positively (open a bank account,
apply for a loan) or negatively to a marketing campaign, based on various features and historical
data.

B.1 Proposed Model


We must create a work process graph in order to carry out this investigation. Our labor method
was divided into five phases. The chart in (fig.1) below displays the graph.

Data Collection

Null Data Reduction Encoding

Feature Engineering Correlation

Classification

Logistic Regression Decision Tree Classifier

Random Forest Classifier Random Forest


Classifier

K-Neighbors Classifier Random Forest Classifier

Application Comparison Classification


and Result
B.2 Dataset Description
We collect data from Bank. The data is related with direct marketing campaigns (phone calls) of
a Portuguese banking institution. The classification goal is to predict if the client will subscribe a
term deposit. This dataset feature such as age, job, marital, education, default, housing, loan,
contact, month, day_of_week, duration.
Table.1: Parameter for Bank Customer data

B.3 Dataset Preprocessing


The dataset of interest concerns a Direct Marketing Campaign of a Portuguese Banking
Institution retrieved from the University of California Irvine (UCI) Machine Learning
Repository containing (42640) instances with (17) attributes without missing values. The
marketing campaigns were based on phone calls. Usually, more than one contact with the
same client was necessary to ascertain whether or not the product (bank term deposit) was
subscribed. The Table I summarizes the description of the telemarketing dataset.
The attributes in the dataset are nominal (categorical and binary) and numerical. The
numerical properties include Age, Balance, Day, Duration, Campaign, Pdays, and
Previous. Job, Marital, Education, Contact, Month, and Outcome are the classified items.
Output, Housing Loan, and Default are all binary properties. The number of classes
present in an attribute is shown in the column with the heading Description. Numerous
job titles are listed in the Job column, including "admin," "unknown," "unemployed,"
"management," "housemaid," "entrepreneur," "student," "blue-collar," "self- employed,"
"retired," "technician," and "services." The terms "married," "divorced," and "single"
might be used in classes to explain the trait "marital." The subcategories of "education"
are unknown, secondary, elementary, and tertiary. Only two classes—yes or no—are
available for the attributes Default, Housing, Loan, and output. The categories for contact
communication are mobile phone, unknown, and telephone.
Classes in the attitude Month are named after actual months, such as Jan, Feb, etc. The
attribute "Poutoute" describes the result of the preceding marketing campaign, such as
whether it was successful, unsuccessful, or unknown.
We found a sizable number of instances of missing data, erroneous values, and improper
data types in the dataset at hand. We used a variety of data preprocessing approaches to
address these problems and guarantee data integrity. We used the procedure listed below
to preprocess the dataset, which is represented in the diagram below.

Missing data
Include Feature Encoding data Min-Max Scaler
handle

Missing Value Handling: For handling missing data, we can utilize methods such as mode()
and mean() to fill in the gaps based on the most frequent or average values of the respective features.
This approach allowed us to retain the completeness of the dataset and prevent biased analysis due to
missing values. We can also implement another technique. In this sector, we first tried to see if there is
any missing data. If we find any missing value then we fill up it.

Encoding Data: Then we encode the dataset based on dependent and independent data. Here we
implement label encoding technique.

Min Max Scaler: In our dataset, the range of features is very large due to which we do not get
good results. The Min Max scaler is used to minimize these large-range values. That's why we used the
Min Max scaler.

B.4 Dataset Preprocessing Analysis

There are several changes in this dataset after preprocessing. Before preprocessing we have 17 columns
and there are huge missing values in our dataset. But after preprocessing we have only 13 columns and
there are no missing variables. Because we drop the irrelevant column and handle all the missing values.
Here shown the dataset before preprocessing:
Fig.C.1.1: Data Set

Data Scaling:

In this section we scaling dataset with MinMaxScaler.

Fig.C.1.2: After Scaling Data Set


Data Balancing:

Fig.C.1.3: Data Set Balancing


C. Feature selection

Correlation: Correlation is a statistical measure that describes the extent to which two
variables change together. In other words, it quantifies the degree to which a change in
one variable is associated with a change in another variable. Correlation does not imply
causation; it only measures the strength and direction of a linear relationship between two
variables.

Fig.C.1.2: Correlation in Dataset

Select SelectKBest: SelectKBest is a feature selection method in machine learning,


particularly within the context of feature selection using univariate statistics. This method
is available in the scikit-learn library in Python. The purpose of feature selection is to
choose a subset of relevant features from the original set of features. This can lead to a
more efficient and accurate machine learning model, as irrelevant or redundant features
may introduce noise or lead to overfitting. In the case of SelectKBest, it selects
the top k features based on univariate statistical tests. The selection is done independently
for each feature, and the features are chosen solely based on their individual performance
with respect to the target variable.

Fig.C.1.2: After applying SelectKBest method

D. Perform Measurement Matrix

Precision:

Precision is a metric that measures the ratio of accurate positive forecasts. The calculation involves
determining the ratio between the number of true positives and the sum of true positives and false
positives. Precision is a valuable metric, especially in situations when the consequences of false positives
are significant, since it allows for the reduction of wrong positive predictions to a minimum.

True Positives
Precision=
True Positives+ False Positives
Recall(Sensitivity):

Recall, also known as sensitivity or true positive rate, quantifies the ratio of correctly predicted positive
occurrences to the total number of real positive instances as determined by the model. The calculation
involves determining the proportion of genuine positives in relation to the combined total of true positives
and false negatives. The importance of recall becomes evident in situations where the consequences of
false negatives are significant, as it is desirable to reduce the occurrence of missed positive examples.

True Positives
Recall=
True Positives+ False Negatives

F1 Score:
The F1-score is calculated as the harmonic mean of precision and recall. This approach offers a harmonious
equilibrium between the aforementioned indicators, proving particularly advantageous when seeking to
account for both false positives and false negatives throughout the evaluation process. The F1-score is
computed using the following formula:

2 (Precision∗Recall)
F 1−Score=
Precision+ Recall

Support:
Here fig.C.1.3 show the updated dataset after using precision, recall,F1 score, support
model

Fig.C.1.3: Precision, Recall, F1 score, Support


Specificity:

True Positives+ F alse Positive


Sensitivity=
True Negatives

Here fig.C.1.4 show the updated dataset after using sensitivity and specificity

Fig.C.1.4: Sensitivity, Specificity

E. Model Selection

SVM:
Support Vector Machine (SVM) is a supervised machine learning algorithm used for both classification and
regression. The main objective of the SVM algorithm is to find the optimal hyperplane in an N-dimensional
space that can separate the data points in different classes in the feature space. SVM works by mapping data
to a high-dimensional feature space so that data points can be categorized, even when the data are not
otherwise linearly separable. A separator between the categories is found, then the data are transformed in
such a way that the separator could be drawn as a hyperplane. In this project we will use svm because it is
effective in high-dimensional cases. Another thing is Different kernel functions can be specified for the
decision functions and its possible to specify custom kernels.

Equation:

f ( x )= β0 + β 1 x 1 +… β n xn
Random Forest Classifier:
Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset
and takes the average to improve the predictive accuracy of that dataset. Random Forest grows multiple
decision trees which are merged together for a more accurate prediction. The logic behind the Random
Forest model is that multiple uncorrelated models (the individual decision trees) perform much better as a
group than they do alone.

Equation:

N
1
y= ∑ yi
N i =1

K-Nearest Neighbors:
The k-nearest neighbors’ algorithm, also known as KNN or k-NN, is a nonparametric, supervised learning
classifier, which uses proximity to make classifications or predictions about the grouping of an individual
data point. While it can be used for either regression or classification problems, it is typically used as a
classification algorithm, working off the assumption that similar points can be found near one another. KNN
classifier operates by finding the k nearest neighbors to a given data point, and it takes the majority vote to
classify the data point. The value of k is crucial, and one needs to choose it wisely to prevent overfitting or
underfitting the model.

Equation:

y=mode( y 1, y 2 , …. , yk )

Decision Tree Classifier:


Decision Tree is a Supervised learning technique that can be used for both classification and Regression
problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier,
where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf
node represents the outcome. Decision trees use a recursive partitioning process, where each node is divided
into child nodes, and this process continues until a stopping criterion is met. This assumes that data can be
effectively subdivided into smaller, more manageable subsets.

Equation:
F. Classification Metrics Accuracy Verification

Here I’m check the Decision Tree classifier model and KNN classifier model accuracy test with Recall and
F1 performance measurement metrics.

Recall: Recall, also known as sensitivity or true positive rate, is a crucial metric in classification tasks. If
you focus only on accuracy, a model might predict the majority class all the time and still have a high
accuracy, but it would miss important predictions of the minority class. Recall specifically focuses on the
ability of the model to find all relevant cases within a dataset, especially the minority class.

F1: Our F1 accuracy is outstanding result for classifier model because F1 performance measurement range
is 0 to 1 when the F1 value is near the 0 then we called our data preprocess is not actual and when the recall
value is near the 1 then it perfect

F.1 Regression Metrics Accuracy Verification

Here I’m check the KNN Regression model accuracy test with MAE, MSE, RMSE and RMSLE
performance measurement metrics.

MAE (Mean Absolute Error): Measures the performance of a classification model where the
prediction output is a probability value between 0 and 1. Regression Metrics: Mean Absolute Error (MAE):

Equation:

This metrics work on output test set and predict set for KNN regressor model and the accuracy rate is
(0.36) which is under the range of 0 to 1.
MSE (Mean Square Error): Measures the proportion of the variance in the dependent variable
that is predictable from the independent variables. Best possible score is 1.0 and it can be negative.

Equation:

This metrics work on output test set and predict set for KNN regressor model and the accuracy rate is
(0.51) which is under the range of 0 to 1. But it’s not satisfied score. Because the best score is 1.

MSE Score

Support vector Machine 0.18

Logistic Regression 0.17

Decision Tree Classification 0.14

Random Forest Classifier 0.11

KNN 0.095

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

MSE Score

RMSE (Root Mean Square Error): Root of MSE. Provides an interpretable measure in the
same unit as the target variable.

Equation:
This metrics work on output test set and predict set for KNN regressor model and the accuracy rate is
(0.72) which is under the range of 0 to 1. Which is satisfied score. Because the highest score is 1. The
threshold value for the Root Mean Square Error (RMSE) depends on the specific problem you are working
on and the context of your analysis. RMSE is a measure of the error or the difference between predicted
and actual values, so the threshold value for RMSE is not a fixed constant.

RMSE Score

Support vector Machine 0.43

Logistic Regression 0.41

Decision Tree Classification 0.38

Random Forest Classifier 0.33

KNN 0.3

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

RMSE Score

R^2(Root Squared): Measures the proportion of the variance in the dependent variable that is
predictable from the independent variables. Best possible score is 1.0 and it can be negative.

Equation:

This metrics work on output test set and predict set for KNN regressor model and the accuracy rate is
(0.84) which is under the range of 0 to 1. Which is satisfied score. Best possible score is 1.0 and it can be
negative.
Jaccard Score:
The jaccard score is a metric used to evaluate and contrast the similarities and diversity of sample. The
intersection to union ratio is the same for both. Comparing two finite sample sets of comparable size is
fesible with the use of the Jaccard coefficient, which is a kind of statistical statistic. It is calculated by
dividing the sample sets cross sectional area by the combined sample sets total area. For each algorithm in
this model, the Jaccard score chart and percentages are shown in figure:

Jaccard Score

Support vector Machine 0.68

Logistic Regression 0.7

Decision Tree Classification 0.74

Random Forest Classifier 0.79

KNN 0.82

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Jaccard Score(%)

In this fig, KNN had the highest accuracy rate of 0.82. followed by Random Forest classifier, Decision
Tree Classification, Logistic Regression rate of 0.79, 0.74, 0.7 and Support Vector had the lowest accuracy
is 0.68.
Cross_Val_Score:
Cross_val_score is a function in the scikit-learn package which trains and tests a model over multiple folds
of your dataset. This cross validation method gives you a better understanding of model performance over
the whole dataset instead of just a single train/test split.

Algorithm Name Cross_Val_Score


n=5
Logistic Regression [0.92070019 0.92163083 0.90776642
0.90471155 0.90012924]

Decision Tree Classifier [0.87359023 0.8517213 0.8376219


0.80883562 0.73093644]

Random Forest Classifier [0.91635338 0.89191729 0.90977444


0.90225564 0.89802632]

Support Vector Machine [0.90977444, 0.90553401, 0.90330161,


0.89848431, 0.88285748]

K-NeighborsClassifier [0.90707237, 0.90717894, 0.90717894,


0.90623898, 0.90706145]

In this fig, KNN had the highest accuracy rate of 0.82. followed by Random Forest classifier, Decision
Tree Classification, Logistic Regression rate of 0.79, 0.74, 0.7 and Support Vector had the lowest accuracy
is 0.68.
Sensitivity:

Sensitivity Score(%)

Support vector Machine 82.1

Logistic Regression 83.91

Decision Tree Classification 89.58

Random Forest Classifier 92.1

KNN 98.93

0 10 20 30 40 50 60 70 80 90 100

Jaccard Score(%)

Specificity:

Specificity Score(%)

71.89
Support vector Machine

71.89
Logistic Regression

45.31
Decision Tree Classification

54.68
Random Forest Classifier

7.34
KNN

0 10 20 30 40 50 60 70 80

Jaccard Score(%)

G. Results Discussion
After applying different algorithms to our Bank deposit project, we conducted a comparison between
different model. The results indicated that the k-NN algorithm performed well in the approach.

Algorithm Name Train Set Test Set Sensitivity Specificity

Logistic Regression 80.95% 79% 83.91% 71.89%

Decision Tree 100% 83.9% 89.58% 45.31%


Classifier

Random Forest 99.9% 86.31% 92.10% 54.68%


Classifier

Support Vector Machine 83.47% 81.37% 82.10% 71.89%

K-NeighborsClassifier 90.92% 90.55% 98.93% 7.34%

From the table we see that during the implementation of model Logistic Regression the accuracy
of train set 80.95% where the test set accuracy is 79%, That means that it makes a better fit and make a
good result. While implementing Decision Tree Classifier the accuracy of train set 100% where
the test set accuracy is 83.9%. In that case we see that the train set overfit cause the test set
result has more gap. If the model is best fit then the test result would be closer to the train set
accuracy. So, this model is not best fit for the bank deposit data. On the other hand, while
implementing Random Forest Classifier the accuracy of train set 99.9% close to 100% where
the test set accuracy is 86.13%. Here we see that this model is as like as Decision Tree
Classifier cause the Random Forest Classifier model is overfitted but when compare the test set
result with the Decision Tree Classifier test set result then the Random Forest Classifier result
far better. On the other hand, during implementing Support Vector Machine (SVM) the
accuracy of the train set 83.47% where the test set accuracy is 81.37%.so we sat that this model
is fitted and it also make promising result. Among of these models the K-Neighbors Classifier
(KNN) make a promising result. Here we see that in the implementation the accuracy of the
train set 90.92% where the test set accuracy is 90.55%. This model is best fit in that case if we
compare with another model that we implement in this project. In that model we implement
Weighted K-Nearest Neighbors (K-NN) that assigns different weights to the neighbors based
on their distances from the query point.
Neighbors that are closer to the query point are given higher weights in the decision-making
process.
For that reason, it fitted best. When we see in the table the “Root Mean Squared Error"
(RMSE) the lowest value is 0.30 and this value belongs the KNN model and it that case when
the RMSE value is low that means there have a lowest error. We also se the Mean Square Error
(MSE) the lowest value belongs KNN model that is 0.095. In the analysis of sensitivity, we see
that the value of KNN model is best (98.93) %. In the context of binary classification,
sensitivity, also known as true positive rate, recall, or hit rate, is a measure of the proportion of
actual positive cases that were correctly identified by a classification model. A high sensitivity
value indicates that the model is effective at identifying positive instances, but it may come at
the cost of an increased false positive rate. Sensitivity is particularly important in scenarios
where the cost of missing positive instances (false negatives) is high, and it is often used
alongside other metrics such as specificity, precision, and F1-score for a more comprehensive
evaluation of a classifier. The Jaccard Index is a measure of similarity between two sets. In the
context of binary classification, it is often used to evaluate the similarity between the predicted
positive instances and the true positive instances. We also see in the table the score is highest
in the model KNN. The Jaccard Index ranges from 0 to 1, with 1 indicating perfect similarity
between the sets. It is a useful metric when dealing with imbalanced datasets or when you want
to focus on the intersection of positive instances. We also cheek the cross validation, here the
KNN perform best. we used % random set of data and find out the score around all the point
around 0.95. we also make ROC to AUC curve for visualization. That are given bellow-

Fig-Logistic Regression Fig- Decision Tree Classifier


Fig- Random Forest Classifier Fig- Support Vector Machine

Fig- K-Nearest Neighbors Classifier


Conclusion

In summary, the Decision Tree Classifier and Random Forest algorithms exhibited overfitting issues,
with 100% and 99.99 % accuracy on the training set but poor accuracy on the test set. The K-Nearest
Neighbors showed the best overall performance, achieving 90.92% accuracy on the training set and
90.55% on the test set. The logistic regression and Support Vector Machine (SVM) also make a promising
accuracy around 79% and 81% for that dataset. Overall, K-Neighbors Classifier (KNN), Logistic
Regression and Support Vector Machine (SVM) are promising choices for a bank deposit model.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy