1 PB
1 PB
Amgad Muneer1, Rao Faizan Ali1, Amal Alghamdi2, Shakirah Mohd Taib1, Ahmed Almaghthawi2,
Ebrahim Abdulwasea Abdullah Ghaleb1
1
Department of Computer and Information Sciences, Faculty of Science and Information Technology, Universiti Teknologi
PETRONAS, Seri Iskandar, Malaysia
2
Department of Computer Science and Artificial Intelligence, College of Computer Science and Engineering University of Jeddah,
Jeddah, Saudi Arabia
Corresponding Author:
Amgad Muneer
Department of Computer and Information Sciences, Universiti Teknologi PETRONAS
32610 Seri Iskandar, Malaysia
Email: muneeramgad@gmail.com
1. INTRODUCTION
Every day there is much competition growing in the banking industry [1]. Thus, if any bank wants
to increase its market share by acquiring new customers, it must follow customer retention strategies. It is
shown that improving the retention rate by up to 5% can increase a bank’s profit by up to 85% [2]. Different
banks offer attractive plans like internet banking, mobile banking, debit card, credit card, savings accounts
with nil balance, credit points based on the usage of the customers [3], best plans for various loans like
education loan, housing loan, agricultural loan, vehicle loan, mortgage loan, and startups loan. In the group of
all these facilities or plans, crediting a loan to a customer is a critical task because, in this case, each bank has
to analyze the customer's capacity prior to offering that loan [4]. To complete the crediting loan process to
customers, there are a number of banks that have decided to incorporate a credit card scheme that will ensure
that whenever a customer applies for a credit card, his or her ability to avail of the card will be evaluated.
Many banks initiate the request for providing credit cards to new customers based on their credit points [5].
However, there will be multiple opportunities for clients to churn out of a particular bank for every customer
who has more than one credit card with more than one bank [4], [6], [7]. Whenever a customer realizes that
Bank A offers many facilities at a low-interest rate compared to Bank B, the customer churning prediction for
Bank B is high. Therefore, it is the bank credit card account management system responsibility to ensure that
the existing customers are maintained through low interest rates. Churn analysis algorithms currently exist,
but they are limited by the nature of the churn prediction problem. These three features are typically
associated with this problem: i) The data is imbalanced; for example, the number of churn customers
represents a tiny fraction of the total samples (usually 2% of the total samples); ii) Data from large learning
applications will inherently contain noise; and iii) To predict churn, it is necessary to rank subscribers
according to their likelihood to churn [8], [9]. Nowadays, with the intense machine learning advancement, it
is beneficial to build a prediction approach that able to predict whether a credit cardholder or a customer will
churn out from a particular bank or not [4]. This prediction will be possible on previously available data
collected from the old customers history records. Machine learning (ML) methods like Naive Bayes, decision
trees, logistic regression, random forest, artificial neural networks, and support vector machines will
determine the churn [10]. All these ML techniques are implemented not only in the banking field but also
applied in various sectors like insurance [11], medical systems [12], cyberbullying [13], retail marketing [14],
automobile industry, gaming industry [12]. Therefore, the contribution of this study summarizes in threefold;
i) We collect credit card churn customer data of around 10,000 from Kaggle repository; ii) We have
conducted an exploratory data analysis (EDA) at the first stage based on available data and employ the
hybridization of SMOTE data sampling and random forest classifier to overcome inherent class imbalance
problem; iii) At the final stage of model selection and evaluation, we have implemented three models
(random forest (RF), AdaBoost, support vector machine (SVM)) and we have performed a detailed
comparison between model results.
The remainder of this paper is organized as shown in: Section 2 discusses the background of the
study and its related research. Research methodology is outlined in section 3, while experimental findings are
presented in section 4. Finally, section 5 concludes the paper by describing future directions.
2. LITERATURE REVIEW
Many data mining techniques can research credit card churn prediction systems. Related work of
available methods is listed out here briefly. For example, according to Dias et al., [15] have predicted in
advance whether a given customer will end his relationship with an organization or not. They use six
different methods using machine learning like the random forest, support vector machine, logistic regression,
multivariate adaptive regression splines, classification and regression techniques, and stochastic boosting
applied on the retail banking customer churn prediction problem, considering predictions up to 6 months in
advance. The best results are concluded from the stochastic boosting data mining technique. According to
Dalmia et al. [16] have used a supervised machine learning technique, a proprietary algorithm has been
created to predict and inform the bank about the customers at the highest risk of leaving the bank. Different
classifiers are able to achieve different accuracies with different datasets. K-nearest neighbour (KNN) is a
groundbreaking new approach based on weighted scales and the XGBooster algorithm for high and improved
accuracy. The dataset is appropriately grouped into training and testing models based on weighted scales and
the KNN algorithm. According Gholamiangonabadi et al. [17] proposed a study to find customer churn
predictions of an Iranian bank; they introduced a new procedural approach. First, they normalize their data
using data pre-processing. Then, a data cluster is formed by using a k-medoids method. The Davies-Bouldin
index is used to assess clustering performance. Various neural network (NN) approaches were utilized in
order to discover patterns within the data, including radial basis function (RBFNN), generalized regression
(GRNN), multilayer perceptron (MLPNN), and SVM. According to the results, MLPNN and SVM models
had higher precisions and lower costs. According to Ahmad et al. [18] have proposed three machine learning
techniques to be applied to predict churn, namely, Decision trees (DT), Naive Bayes, SVM, using two
benchmark datasets IBM Watson dataset, which contains 7033 observations, 21 attributes, and cell2cell
dataset that contains 71,047 observations and 57 attributes. Therefore, data unbalanced is one of the key
drawbacks of the aforementioned works.
The performance of the models has been measured using the area under the curve (AUC), which
they scored 0.82, 0.87, 0.77 respectively for the IBM dataset and 0.98, 0.99, 0.98 for the cell2cell dataset. In
[18], [19] the authors focus on applying data mining techniques in telecommunications to predict the
churning behaviour of customers. In this research work, they use the CART algorithm to predict customer
churning. In [20] research, they have built a computer system based on the application of artificial neural
networks (ANN) and SVM approaches. According to the model, there are three different states of customers:
active (i.e., those that are fully engaged in business with a positive balance in their account), non-active (i.e.,
Indonesian J Elec Eng & Comp Sci, Vol. 26, No. 1, April 2022: 539-549
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 541
those with low balances in their accounts and those who do not have any investments), and churning (closed
bank account). They have demonstrated excellent results with their computer software [21].
3. RESEARCH METHOD
3.1. Data collection and description
This section describes the methods used to predict customer churning within the banking industry,
explain the dataset and the proposed approach utilized. The dataset used for the prediction process task is
publicly available on the Kaggle website [22]. The variables included in the dataset are listed in Table 1. Of
the 23 variables, the last two columns should be removed since they do not contribute to the classification
process. Removing the last two columns from the dataset now contains 21 variables, 20 predictor variables,
and one class variable. It contains 10,127 records, of which 8,496 (83.9%) are non-churners and 1,630
(16.1%) are churners. Therefore, the dataset is highly unbalanced in terms of the proportion of churners and
non-churners. Furthermore, we conducted an exploratory data analysis to determine the percentages between
genders, age groups, and so on. Before inputting the data to the classifier, it is necessary to balance the data
so that the classifiers do not tend towards the majority class consisting of non-churners while predicting the
future. A mixture of synthetic minority oversampling techniques (SMOTE), undersampling, and
oversampling is used to achieve the balancing.
Predicting customers churning in banking industry: A machine learning approach (Amgad Muneer)
542 ISSN: 2502-4752
(a)
(b)
Figure 1. Illustration of (a) distribution of customer age and (b) Distribution of months the customer is part of
the bank
Indonesian J Elec Eng & Comp Sci, Vol. 26, No. 1, April 2022: 539-549
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 543
(a)
(b)
Figure 2. Illustration of (a) Distribution of the credit limit and (b) Distribution of total transaction amount
Predicting customers churning in banking industry: A machine learning approach (Amgad Muneer)
544 ISSN: 2502-4752
(a)
(b)
Figure 3. The results of (a) Proportion of churn vs does not churn customers and
(b) Number of inactive months
Indonesian J Elec Eng & Comp Sci, Vol. 26, No. 1, April 2022: 539-549
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 545
of the proposed classifiers. Secondly, we show the 5-corss validation and then we described the experimental
results obtained in this study. Finally, the comparative analysis was provided to provide the readers a clear
comparison between the proposed classifiers in this study and the state of the art.
4.1.1. Accuracy
Accuracy is a ratio of the true detected cases to the total cases, and it has been utilized to evaluate
models on a balanced dataset [24]. Accordingly, it can be calculated as (1):
(𝑡𝑝+𝑡𝑛)
Accuracy =(𝑡𝑝+𝑓𝑝+𝑡𝑛+𝑓𝑛) (1)
where tp means true positive, tn is true negative, fp denotes false positive, and fn is a false negative.
2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
F-measure= (3)
𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
Figure 3. Performance evaluation for three proposed models using F1-score metrics
Predicting customers churning in banking industry: A machine learning approach (Amgad Muneer)
546 ISSN: 2502-4752
Table 3. The performance of proposed three models on original data before applying SMOTE
Model Recall F1 Score Accuracy
Random Forest 0.64 0.63 0.637%
AdaBoost 0.62 0.57 0.622%
SVM 0.75 0.55 0.562%
Table 2 and Table 3 show that the results based on random forest models are significantly higher
than those based on other models. As a result, we selected the random forest model to forecast customer
churning in the banking industry. The results of this prediction are presented in Figure 4.
Figure 4. Confusion matrix for random forest prediction on the original data
Indonesian J Elec Eng & Comp Sci, Vol. 26, No. 1, April 2022: 539-549
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 547
5. CONCLUSION
The proposed study conducted the most comprehensive investigation of the credit card churn
prediction problem in banks using machine learning techniques. We proposed a customer churn prediction
system with Random Forest, AdaBoost, and SVM intelligent models. The best results are achieved when the
unbalanced original data is SMOTED and undersampling is combined with oversampling. When the SMOTE
technique was applied to overcome the class imbalances in the data, the results revealed that RF
outperformed the other two predictors with an accuracy of 88.7% and an F1 score of 0.91. The experimental
results also demonstrated that RF performed well for the full feature-selected datasets. Accordingly, the
proposed RF predictor can be used to calculate customer churn periodically from various perspectives.
Churning can be measured in terms of the number of customers lost, the ratio of customers lost, or the
percentage of customers lost compared to the total number of customers in the bank. This churning can be
measured quarterly or annually. An accurate forecast provides insight into the future, which allows for
developing a strategy. Lastly, in future work, we seek to implement a deep learning model in order to
improve the accuracy of the proposed study.
REFERENCES
[1] I. Japparova and R. Rupeika-Apoga, “Banking business models of the digital future: The case of Latvia,” European Research
Studies Journal, vol. 20, no. 3, pp. 864–878, 2017, doi: 10.35808/ersj/749.
[2] G. Nie, W. Rowe, L. Zhang, Y. Tian, and Y. Shi, “Credit card churn forecasting by logistic regression and decision tree,” Expert
Systems with Applications, vol. 38, no. 12, pp. 15273–15285, Nov. 2011, doi: 10.1016/j.eswa.2011.06.028.
[3] R. Goel, S. Sahai, A. Vinaik, and V. Garg, “Moving from cash to cashless economy: A study of consumer perception towards
digital transactions,” International Journal of Recent Technology and Engineering, vol. 8, no. 1, pp. 1220–1226, Jun. 2019, doi:
10.17492/pragati.v7i1.195425.
[4] R. Rajamohamed and J. Manokaran, “Improved credit card churn prediction based on rough clustering and supervised learning
techniques,” Cluster Computing, vol. 21, no. 1, pp. 65–77, Mar. 2018, doi: 10.1007/s10586-017-0933-1.
[5] L. Bursztyn, B. Ferman, S. Fiorin, M. Kanz, and G. Rao, “Status Goods: Experimental evidence from platinum credit cards,”
Quarterly Journal of Economics, vol. 133, no. 3, pp. 1561–1595, Aug. 2018, doi: 10.1093/QJE/QJX048.
[6] H, Jain, G. Yadav, and R. Manoov. "Churn prediction and retention in banking, telecom and IT sectors using machine learning
techniques." Advances in Machine Learning and Computational Intelligence. Springer, Singapore, 2021. 137-156.
[7] G. G. Sundarkumar and V. Ravi, “A novel hybrid undersampling method for mining unbalanced datasets in banking and
insurance,” Engineering Applications of Artificial Intelligence, vol. 37, pp. 368–377, Jan. 2015, doi:
10.1016/j.engappai.2014.09.019.
[8] Y. Xie, X. Li, E. W. T. Ngai, and W. Ying, “Customer churn prediction using improved balanced random forests,” Expert Systems
with Applications, vol. 36, no. 3, pp. 5445–5449, Apr. 2009, doi: 10.1016/j.eswa.2008.06.121.
[9] K. G. M. Karvana, S. Yazid, A. Syalim, and P. Mursanto, “Customer churn analysis and prediction using data mining models in
banking industry,” in 2019 International Workshop on Big Data and Information Security, IWBIS 2019, Oct. 2019, pp. 33–38,
doi: 10.1109/IWBIS.2019.8935884.
[10] M. A. H. Farquad, V. Ravi, and S. B. Raju, “Data mining using rules extracted from SVM: An application to churn prediction in
bank credit cards,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture
Notes in Bioinformatics), vol. 5908 LNAI, 2009, pp. 390–397.
[11] N. A. Akbar, A. Sunyoto, M. Rudyanto Arief, and W. Caesarendra, “Improvement of decision tree classifier accuracy for
healthcare insurance fraud prediction by using Extreme Gradient Boosting algorithm,” in Proceedings-2nd International
Conference on Informatics, Multimedia, Cyber, and Information System, ICIMCIS 2020, Nov. 2020, pp. 110–114, doi:
10.1109/ICIMCIS51567.2020.9354286.
[12] S. M. Fati, A. Muneer, N. A. Akbar, and S. M. Taib, “A continuous cuffless blood pressure estimation using tree-based pipeline
optimization tool,” Symmetry, vol. 13, no. 4, 2021, doi: 10.3390/sym13040686.
[13] A. Muneer and S. M. Fati, “A comparative analysis of machine learning techniques for cyberbullying detection on twitter,”
Future Internet, vol. 12, no. 11, pp. 1–21, Oct. 2020, doi: 10.3390/fi12110187.
[14] M. Al-Ghobari, A. Muneer, and S. M. Fati, “Location-aware personalized traveler recommender system (lapta) using
collaborative filtering knn,” Computers, Materials and Continua, vol. 69, no. 2, pp. 1553–1570, 2021, doi:
10.32604/cmc.2021.016348.
[15] J. Dias, P. Godinho, and P. Torres, “Machine learning for customer churn prediction in retail banking,” in Lecture Notes in
Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12251
LNCS, 2020, pp. 576–589.
Predicting customers churning in banking industry: A machine learning approach (Amgad Muneer)
548 ISSN: 2502-4752
[16] H. Dalmia, C. V. S. S. Nikil, and S. Kumar, “Churning of bank customers using supervised learning,” in Lecture Notes in
Networks and Systems, vol. 107, 2020, pp. 681–691.
[17] D. Gholamiangonabadi, S. Nakhodchi, A. Jalalimanesh, and A. Shahi, “Customer churn prediction using a meta-classifier
approach; A case study of Iranian banking industry,” in Proceedings of the International Conference on Industrial Engineering
and Operations Management, 2019, vol. 2019, no. MAR, pp. 364–375.
[18] A. K. Ahmad, A. Jafar, and K. Aljoumaa, “Customer churn prediction in telecom using machine learning in big data platform,”
Journal of Big Data, vol. 6, no. 1, p. 28, Dec. 2019, doi: 10.1186/s40537-019-0191-6.
[19] V. K. Nijhawan, M. Madan, and M. Dave, “An analytical implementation of CART Using RStudio for Churn Prediction,”
Information and Communication Technology for Competitive Strategies, vol. 40. Springer Singapore, 2019.
[20] S. Osowski and L. Sierenski, “Prediction of customer status in corporate banking using neural networks,” in Proceedings of the
International Joint Conference on Neural Networks, Jul. 2020, pp. 1–6, doi: 10.1109/IJCNN48605.2020.9206693.
[21] K. Ebrah and S. Elnasir, “Churn prediction using machine learning and recommendations plans for telecoms,” Journal of
Computer and Communications, vol. 07, no. 11, pp. 33–53, 2019, doi: 10.4236/jcc.2019.711003.
[22] Churn for Bank Customers. (2020). Accessed: 21 March 2021. [Online]. Available: https://www.kaggle.com/mathchi/churn-for-
bank-customers
[23] A. Omar and A. Almaghthawi, “Towards an integrated model of data governance and integration for the implementation of digital
transformation processes in the Saudi Universities,” International Journal of Advanced Computer Science and Applications, vol.
11, no. 8, pp. 588–593, 2020, doi: 10.14569/IJACSA.2020.0110873.
[24] S. Naseer, S. M. Fati, A. Muneer, and R. F. Ali, “iAceS-Deep: Sequence-based identification of acetyl serine sites in proteins
using PseAAC and deep neural representations,” IEEE Access, vol. 10, pp. 12953–12965, 2022, doi:
10.1109/access.2022.3144226.
[25] A. Muneer and S. M. Fati, “Efficient and automated herbs classification approach based on shape and texture features using deep
learning,” IEEE Access, vol. 8, pp. 196747–196764, 2020, doi: 10.1109/ACCESS.2020.3034033.
[26] S. E. Charandabi, “Prediction of Customer Churn in Banking Industry,” Age, vol. 18, no. 92, pp. 38–92, 2020.
BIOGRAPHIES OF AUTHORS
Rao Faizan Ali received the bachelor’s degree in computer science from
COMSATS University Islamabad, Pakistan, and the M.Phil. degree in computer science
from the University of Management and Technology, Lahore, Pakistan. He is currently
pursuing the Ph.D. degree with University Technology PETRONAS, Malaysia. He has
eight years of experience in teaching and research. He has been with various computer
science positions in financial, consulting, academia, and government sectors. He is
currently working as a Research Officer with the Department of Computer and information
Sciences, University Technology Petronas, Perak, Malaysia. He can be contacted at email:
rao_16001107@utp.edu.my.
Indonesian J Elec Eng & Comp Sci, Vol. 26, No. 1, April 2022: 539-549
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 549
Predicting customers churning in banking industry: A machine learning approach (Amgad Muneer)