Machine Learning and Artificial Intelligence in Ba
Machine Learning and Artificial Intelligence in Ba
BY
FULL NAME
PSC-MATNO
JANUARY 2025
MACHINE LEARNING AND ARTIFICIAL INTELLIGENCE IN BANKING
BY
FULL NAME
PSC-MAT NO
JANUARY 2025
CERTIFICATION
This is to certify that this project work was carried out by FULL NAME with
satisfactory, both in scope and content, for the award of Bachelor of Science (B.sc)
I
APPROVAL
This project work is hereby approved in partial fulfillment of the requirements for the
University of Benin.
PROF.GODSPOWER.O.EKUOB DATE
ASE,PHD
Head of Department
II
DEDICATION
This project is dedicated to God Almighty for giving me the strength and wisdom to
see it through to completion, and even throughout my stay in the University of Benin
(UNIBEN).
III
ACKNOWLEDGEMENT
gratitude to my project supervisor, DR. (Mr.) E.C. IGODAN for his consistent
I would also like to specially thank my project coordinator Dr. (Mrs.) A.R. Usiobaifo,
and other lecturers in the Department of Computer Science who I have been
opportune to cross paths with, and have impacted me immensely these past few years:
Prof. G.O. Ekuobase, Dr. F.O. Oliha, Prof. K.C. Ukaoha, Prof. A.A. Imiavan, Prof.
(Mrs.) F. Egbokhare, Prof. (Mrs.) V.V.N. Akwukwuma, Prof. F.I. Amadin, Prof.
(Mrs.) S. Konyeha, Prof. (Mrs.) V.I. Osubor, Dr. (Mrs.) Aziken, Dr. F.O. Chete, Dr.
(Mrs) R.O. Osaseri, Dr. J.C. Obi, Mr. P. E.B. Imiefoh, Mr. I.E. Obasohan, Mr. S.O.P.
Oliomogbe, Mr. K.O. Otokiti, Mr. I.E. obayagbonna, Mrs. R.I. Izevbizua, Mr. E.C.
TABLE OF CONTEN
IV
T
CERTIFICATION........................................................................................................I
APPROVAL.................................................................................................................II
DEDICATION...........................................................................................................III
ACKNOWLEDGEMENT.........................................................................................IV
LIST OF FIGURES................................................................................................VIII
LIST OF TABLES.....................................................................................................IX
ABSTRACT.................................................................................................................X
CHAPTER ONE...........................................................................................................1
INTRODUCTION........................................................................................................1
CHAPTER TWO..........................................................................................................7
LITERATURE REVIEW............................................................................................7
V
2.2 MACHINE LEARNING AND MACHINE LEARNING TECHNIQUES....9
CHAPTER THREE...................................................................................................31
3.1 OVERVIEW.......................................................................................31
VI
3.5 DATA MINING PROCESS............................................................................36
3.8 METHODOLOGY.......................................................................................39
CHAPTER FOUR......................................................................................................46
4.1 INTRODUCTION.......................................................................................46
4.3.1.2 PYTHON..........................................................................47
CHAPTER FIVE........................................................................................................53
5.1 CONCLUSION.......................................................................................53
REFERENCES...........................................................................................................55
APPENDIX.................................................................................................................61
VII
VIII
LIST OF FIGURES
IX
LISTS OF TABLES
X
ABSTRACT
This study focuses on using a specific Machine learning technique called Adaptive
Neuro Fuzzy Inference System (ANFIS) to detect bank and credit card fraud. ANFIS
and fuzzy logic. It can effectively handle complex and uncertain data, making it
suitable for Fraud detection. Feature selection is the process of selecting the most
relevant and informative features from a dataset. In the context of Fraud detection,
this means identifying the most important factors or variables that can help accurately
predict the presence or absence of the Fraud. The work uses Pre-processing, model
This work explores the potential of using ANFIS and feature selection techniques to
enhance the Detection of bank fraud , ultimately leading to more effective and timely
intervention for Bank users who are affected by fraudsters trying to scam them. It will
also introduce a web based application for users to easily report instances of fraud
XI
CHAPTER ONE
INTRODUCTION
fraud, which poses significant threats to the global financial ecosystem. The effect of
this phenomenon has led to the liquidation of many banks in Nigeria. It has been the
major cause of the ugly development in our Banking industry now refereed to as
‘DISTRESS’. Fraudulent activities ranging from phishing attacks and identity theft to
more complex forms of financial fraud such as account takeovers and money
challenges for traditional fraud detection and prevention methods, which often
struggle to keep pace with the agility and ingenuity of modern cybercriminals (Bello
et al .,2024). These development which records high occurrence in our banks has
caused a big question mark on the credibility of Nigerians in both within and outside
the country. The need to combat the double –headed monster by all means gave rise
to the issue of cultivation of the banking industry especially in the areas susceptible to
fraud.
in various domains, including natural language processing, computer vision, and data
enabling the development of more accurate and adaptive systems. Machine learning
algorithms can automatically learn patterns and anomalies from vast amounts of
1
communication. There are two different types of fraud, including opportunistic, and
professional fraud that the second type is committed by organized groups. Although
the organized fraud is perpetrated fewer than the opportunistic bank fraud, the
majority of revenue outflow (financial losses) is due to these groups. Fraud detection
would be difficult due to many reasons. The first one is, involving high volume of
data, which are constantly evolving. In reality, for processing these sets of data, the
fast, the novel, and efficient algorithms are entailed. Moreover, in terms of cost, it is
evident that undertaking a detailed analysis of all records is too much expensive. Here
the issues of effectiveness enter; indeed, many legitimate records exist for every
detection within the banking sector. (Johnson & Martinez, 2020). conducted a
investigated the application of deep learning techniques for fraud detection and
data. Furthermore, the rise of big data technologies and cloud computing has
facilitated the scalability and efficiency of machine learning algorithms, making them
and external data feeds, provides a holistic view of customer behaviour, enhancing the
machine learning based fraud detection system in the banking sector. These
2
challenges include the need for large and diverse labelled datasets for training,
interpretability of complex machine learning models, and the potential for adversarial
attacks aimed at deceiving the system. Most the banking sector is witnessing a
age has introduced complex fraud challenges that necessitate innovative solutions.
Effective fraud prevention strategies are critical for maintaining financial security and
trust. Artificial Intelligence stands at the forefront of these efforts, offering powerful
deeper into the exploration of AI-driven fraud prevention, it becomes evident that
This study aims to contribute to the existing body of knowledge by developing and
evaluating a robust fraud detection system that leverages machine learning and rule-
based techniques to mitigate the risks associated with fraudulent activities in the
banking industry.
The hybridization of machine learning algorithms is that they are used to solve
complex problems. In machine learning, Artificial Neural Networks and Fuzzy Logic
knowledge representation and to build models that can learn and adapt to any
The detection of fraud has been seriously taken into account in recent years. Despite
the extensive utilization of data mining algorithms for sorting fraud out, The rise in
3
fraudulent activities. Cybersecurity measures are constantly tested by sophisticated
schemes that aim to bypass conventional detection methods. Human oversight, while
breaches; it’s challenging to keep up with the pace and cunning of modern cyber
threats. Banks in Nigeria have been for quite some times the centre of public attention
especially in the area of fraud. Fraud has been on the increase recent years. Although,
fraud in the bank is a global phenomenon, the growth in Nigeria had become out-
standing. Effective fraud prevention strategies are crucial for safeguarding financial
systems, protecting consumer trust, and ensuring the stability of economic activities.
The financial losses associated with fraud can be devastating for both individuals and
prevent fraud in the banking industries. This project work would strive to build a
model that would help to streamline and prevent banks and its customer from falling
The main aim of this project work is to design a fraud detection system in
programming language
3. Carry out a case study on Bank fraud using Data from Kaggle database.
4
1.4 METHODOLOGY
This section outlines the systematic review methodology applied to critically assess
the role of Artificial Intelligence (AI) in fraud detection within financial institutions.
By adhering to a transparent and structured approach, this study ensures that the
selection of documents, the extraction of relevant data, and the synthesis of findings
are both comprehensive and credible (Kitchenham et al., 2020). Systematic review
literature, helping to identify trends, challenges, and gaps in the field. The
methodology described here was designed to answer the study’s key research
This project work will apply three commonly used filtering methods chi squared, gain
ratio and information gain . Also, the majority voting will be applied to finally select
support vector machine (SVM), random forest (RF) and logistics regression (LR) will
model that blends the learning capabilities of neural networks with the interpretability
and rule-based reasoning of fuzzy logic systems. This combined approach makes it
possible for ANFIS to learn from data and to adjust its parameters to enhance its
performance over time; this makes it specifically useful for complex, non-linear
The significance of this study lies on the great importance attached to Artificial
intelligence in the banking sector from the findings of this research, banks will know
5
the importance of AI in fraud detection and thus prevention. This work will be an eye
opener to the pubic at large. Interested organization, business centers, institution and
many others having know the role of Machine learning will make use of Artificial
Intelligence in their organization for them to curb this fast growing problem. This
study will be of immense benefit to the entire Banking sector in Nigeria and the
computer science department in the sense that the study will educate the above
subjects on the types of fraud in the Banking sector in Nigeria. The study will educate
them how to design a fraud detection system. The study will also serve as a repository
of information to other researchers that desire to carry out similar research on the
This expected project work when completed will serve as a second opinion for the
Banking Industry to curb the ongoing fraud pandemic in Nigeria. It will also be used
technologies that allow machines to perform tasks that typically require human
intelligence.
c. Banking: Banking is a financial industry that involves the storage and use of
6
d. Machine Learning: Machine learning (ML) is a branch of and computer
science that focuses on the using data and algorithms to enable AI to imitate the way
7
CHAPTER TWO
LITERATURE REVIEW
are different types of Fraud which varies from Online fraud, identity theft, online
fraud, health care fraud, tax fraud and more. Majority of these frauds happens through
the Banking sector. Bank fraud includes activities such as unauthorized fund transfers
detection and prevention. Leveraging the power of machine learning, data analytics,
fraudulent activities in real time. The world of fraud detection is changing fast, and
e-commerce platforms and other industries are turning to ML to tackle fraud more
effectively. For instance, ML algorithms have significantly cut down credit card fraud
in the financial sector by analyzing transaction patterns and spotting anomalies with
remarkable precision. Citibank has slashed phishing attacks by 70% thanks to ML,
while Walmart has reduced shoplifting by 25% through real-time video analysis. In
today's digital age, the proliferation of online transactions, e-commerce, and digital
& Cilluffo, 2022). These schemes range from phishing attacks and identity theft to
8
more complex forms of financial fraud such as account takeovers and money
challenges for traditional fraud detection and prevention methods, which often
struggle to keep pace with the agility and ingenuity of modern cybercriminals.
activities with higher accuracy and efficiency compared to traditional methods. Here,
we explore some of the key AI techniques employed in fraud detection (Hasan, Gazi
& Gurung, 2024, Yalamati, 2023). AI-driven systems can analyze vast amounts of
learning, neural networks, and natural language processing (NLP) enable the
development of advanced fraud detection models that continuously learn and adapt to
processes, AI helps organizations stay one step ahead of fraudsters, ensuring more
allow computers to learn from and make predictions based on data. In fraud detection,
ML techniques are extensively used to identify patterns and anomalies that indicate
suggest risk rules. It can then implement the rules to block or allow certain user
training the machine learning engine, you must flag previous cases of fraud and non-
9
fraud to avoid false positives and to improve your risk rules’ precision. The longer the
algorithms run, the more accurate the rule suggestions will be.
algorithms that can learn from historical transactional data, machine learning systems
These systems have the potential to significantly enhance the accuracy and timeliness
algorithms is to learn from the data, identify the behaviors and predict the future with
the minimal intervention of the mankind(Nguyen, et al.,2023). There are two popular
methods for the Machine Learning widely used across the globe Which are;
on a labeled dataset, where the input data is paired with the correct output. This
approach is highly effective for fraud detection as it allows the model to learn
from historical data and identify similar patterns in new data. Decision trees are
simple yet powerful models that use a tree-like structure to make decisions based
on the features of the input data. In fraud detection, decision trees can be used to
attributes, such as transaction amount, location, and time (Afriyie, et. al., 2023,
10
that process and transform the input data. Neural networks are particularly useful
labeled data. Instead, they identify patterns and structures in the data based on its
inherent properties. This approach is useful for detecting new and emerging types
of fraud that may not have been previously labeled. Clustering algorithms group
similar data points together based on their features. In fraud detection, clustering
can be used to identify clusters of similar transactions. Transactions that do not fit
further investigation (Ahmad, et. al. 2023, Huang, et. al., 2024, Min, et. al.,
patterns that deviate from the norm. These algorithms are particularly effective in
Isolation Forest, and One-Class SVM are commonly used for anomaly detection.
Deep Learning is a subset of machine learning that uses neural networks with many
layers (deep neural networks) to model complex patterns in data. Deep learning
detection. Convolutional Neural Networks (CNNs) are primarily used for image and
spatial data analysis, but they can also be applied to fraud detection by treating
automatically extract relevant features from the input data. In fraud detection, CNNs
can be used to analyze transaction sequences and patterns over time. By capturing
spatial relationships within the data, CNNs can detect subtle and complex fraud
11
patterns that may not be apparent using traditional methods. Recurrent Neural
Networks (RNNs) are designed to handle sequential data and time-series analysis
(Nagaraju, et. al., 2024, Palakurti, 2024, Yadav, Yadav & Goar, 2024). They have the
memory cells), making them suitable for tasks that involve temporal dependencies.
RNNs are particularly useful in fraud detection for analyzing transaction histories and
identifying suspicious patterns over time. For instance, they can detect fraudulent
behaviors that involve a series of transactions across different time periods, which
computers and human language. NLP techniques are used to analyze and understand
textual data, which is valuable for detecting fraud involving written communication.
NLP techniques can be used to analyze textual data such as emails, chat messages,
suspicious language patterns, keywords, and phrases that may indicate fraudulent
intent or activity (Adekunle, et. al., 2024, Chang, Yen & Hung, 2022, Krishnan, et.
al., 2022). For example, certain terms and language structures commonly used in
phishing emails can be flagged as potential fraud indicators. NLP can be employed to
scan emails for signs of phishing, social engineering, and other fraudulent schemes.
By analyzing the content, structure, and context of emails, AI systems can detect
transaction descriptions that do not align with typical patterns can be flagged for
further review. Beyond emails, NLP can be used to analyze various forms of
12
communication, including text messages, social media interactions, and customer
service chats. This helps in identifying fraudulent activities that involve deceptive
deep learning, and natural language processing play a critical role in enhancing fraud
can improve their ability to identify and mitigate fraudulent activities, ultimately
safeguarding financial systems and maintaining trust in the digital age (Bharadiya,
Artificial Intelligence (AI) has proven to be an invaluable tool in the battle against
fraud across various sectors. Its ability to process and analyze vast amounts of data in
real time allows for more effective detection and prevention of fraudulent activities
(Jagatheesaperumal, et. al., 2021, Mahalakshmi, et. al., 2022, Mohammed & Rahman,
systems are capable of monitoring credit card transactions in real time, providing
transaction data, AI can identify unusual patterns and behaviors that deviate from a
cardholder’s typical spending habits. For instance, AI algorithms can detect anomalies
consecutive transactions that are out of character for the user. When such anomalies
are detected, the system can automatically flag the transaction for further investigation
or temporarily halt the transaction to prevent potential fraud. ML techniques used for
financial fraud detection was presented as shown in table 1 by ML techniques used for
13
Table 2.1: ML Techniques used for Financial Fraud Detection (Ali, et. al., 2022).
Fuzzy Logic A logic that indicates that methods of thinking are estimated
Decision Tree A regression tree and classification method that is used for
decision support
Genetic Algorithm It searches for the best way to solve problems concerning the
suggested solutions
classification problems.
14
of decision trees
ANFIS, as you know, stands for Adaptive Neuro-Fuzzy Inference System. Let's delve
deeper into its structure, learning process, and how it leverages fuzzy logic and neural
learn and adjust its parameters based on new data. This learning happens through a
hybrid approach that combines gradient descent from neural networks and least
squares estimation.
Fuzzy logic is a computation technique used to describe the relationship between the
information attributes. Fuzzy logic has a structure that allows that human reasoning
level of computational processing (Wu et al., 2011). There are four basic processing
inference engines are examples of these designs. The crisp set input is transformed via
the fuzzification process. into hazy groups. A membership function is a concept that
experts or the researcher will supply. Through fuzzy inference on the input and the
previously established if-then rules in the knowledge base architecture, the inference
engine will replicate human expectations. The final design, known as defuzzification,
15
One of the advantages of fuzzy logic is the simplicity of applying mathematical
concept in its fuzzy logics. Due to the flexibility of fuzzy logic modification phase,
adding or removing rules are simpler. On the other hand, fuzzy logic has number of
disadvantages. It still requires expertise to understand and develop a fuzzy system and
it is time consuming to develop fuzzy rules and membership functions (Shamim et al.,
2011).
Artificial Neural Networks are relatively electronic models based on the structure of
the brain. The brain mainly learns from experience. Artificial Neural Networks
modelling also has the potentials of less technical way to develop in machine
solutions. The first artificial neural network was designed in 1958 which was called
human brain can process visual data and also learn to recognize items. Since then, the
number of using neural network was increasing and there have been hundreds of
different models developed using artificial neural network. There are differences in
these models such as the functions, accepted values, topology, learning algorithms
and etc. The learning capabilities and pattern matching of artificial neural network has
allowed them to solve many problems that were impossible and difficult to answer by
The structure of artificial neural network is inspired by natural neurons, these neurons
take many possible inputs then base on the weight value, it proceeds to the next
proceed to the next neuron or to the output. Activation process of every neuron
basically consists of inputs which are multiplied by weight value or the strength of the
signal and then computed by a transfer function. These transfer functions must reach a
16
certain threshold. Artificial Neural Network can have multiple neurons. Artificial
neural network combines the input from every single neuron in each intermediate
layer to process information and finally reach a single output (Yadav, 2015).
The advantage of artificial neural network is its ability to solve both linear and
nonlinear problems. Neural programs are able to learn and it does not require
required high processing time for large neural networks. The operation of artificial
neural network is highly relying on the training process (Rao & Srinivas, 2003).
both artificial neural network and fuzzy logic. By combining both of the methods, this
network not only will inherit the benefits and characteristic of them but also will
17
2.3.4 GENETIC ALGORITHM(GA)
The main idea of GA is to mimic the natural selection and the survival of the fittest.
evaluated for fitness values and they are ranked from best to worst based on fitness
applications of three genetic operators: selection, crossover, and mutation. First, the
better chromosomes are selected to become parents to produce new offspring (new
chromosomes). To simulate the survivor of the fittest, the chromosomes with better
fitness are selected with higher probabilities than the chromosomes with poorer
fitness. The selection probabilities are usually defined using the relative ranking of
the fitness values. Once the parent chromosomes are selected, the crossover operator
18
START
Yes
Time to stop STOP
No
selection, Naive Bayes can be used to evaluate the relevance of subsets of features by
19
most to the discriminative power of the model are retained, while less informative
Support Vector Machine (SVM) is a classification algorithm that draws a strict border
between the positive and negative classes. It's like the bouncer at a VIP club, making
sure no unwanted elements go through. SVM is all about maximizing the distance
between data points from different classes, just like people desperately trying to claim
boundary between classes, that has the maximum distance from the data points. It's
like putting up a massive fence to keep the good folks inside and the troublemakers
out. The hyperplane acts as a threshold for classifying new data points. If it's on one
side, it's labeled as positive, and if it's on the other, it's labeled as negative.
20
Linear SVM: This is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed
Nonlinear SVM:. This is used for non-linearly separated data, which means if a data
set cannot be classified by using a straight line, then such data is known as nonlinear
data.
Decision tree algorithms, such as Random Forest, are commonly used embedded
on how much each feature decreases impurity across all decision trees in the
more important and are retained, while less informative features are pruned from the
model. Random Forest's ability to naturally select relevant features during the training
since feature selection is integrated into the model training process, eliminating the
need for external evaluations. However, they may be less flexible in terms of feature
21
selection strategies compared to wrapper methods. Decision Trees are easy-to-use and
recursively dividing the dataset into key-criteria subgroups. Leaf nodes are the result
Decision trees are helpful in decision-making because they are simple to comprehend
decision-making tree that incorporates utility, resource costs, and the results of
random events.
and fraudulent activities (Alarfaj, et. al., 2022, Hilal, Gadsden & Yawney, 2022).
clustering and anomaly detection, are used to uncover new and emerging fraud
22
patterns that have not been previously identified. This dual approach ensures a
multiple accounts and institutions, which are often used to obscure the origins of
illicit funds (Mishra & Mohapatra, 2024, Youssef, Bouchra & Brahim, 2023, Zhang
& Chen, 2024). By automating the detection process, AI systems can quickly flag
This accelerates the identification of money laundering schemes and reduces the
Action Task Force (FATF) and the Bank Secrecy Act (BSA) impose stringent
activities (Gaviyau & Sibindi, 2023, Siddiqui, 2023, Stevens, 2022). AI helps
transactions and the entities involved. This not only ensures compliance with
details. AI-powered systems can analyze emails, messages, and websites to detect
23
deceptive content (Alabdan, 2020, Alkhalil, et. al., 2021, Jain & Gupta, 2022).
high accuracy. By integrating AI with email security protocols and web filters,
organizations can significantly reduce the risk of falling victim to phishing attacks.
user behavior to identify potential threats. Machine learning models can detect
security solutions, organizations can automate threat detection and response, reducing
the time it takes to identify and mitigate cyber threats (Camacho, 2024, Manoharan &
prevent data breaches, protect sensitive information, and maintain the integrity of IT
offering sophisticated tools to combat various forms of fraud. From real-time credit
more advanced tactics, the integration of AI in fraud prevention strategies will remain
essential for organizations to stay ahead of potential threats and protect their assets
Because machines have a much easier job processing a large dataset than humans,
what you get is the ability to slice and dice huge amounts of information. That means:
24
1. Faster and more efficient detection: The system gets to quickly identify
suspicious patterns and behaviors that might have taken human agents months to
establish.
2. Reduced manual review time: Similarly, the amount of time spent on manually
reviewing information can be drastically reduced when you let machines analyze all
3. Better predictions with large datasets: The more data you feed a machine learning
engine, the more trained it becomes. That is to say, while large datasets can
sometimes make it challenging for humans to find patterns, it’s actually the opposite
4. Cost-effective solution: Unlike hiring more RiskOps agents, you only need one
machine-learning system to go through all the data you throw at it, regardless of the
volume. This is ideal for businesses with seasonal ebbs and flows in traffic,
company without increasing risk management costs drastically at the same time.
5. Last but not least, algorithms don’t need breaks, holidays, or sleep. Fraud attacks
can happen 24/7, but even the best fraud managers might come to work on Monday
morning with a backlog of manual reviews. Machines can ease up the process by
eCommerce businesses.
25
2.5 CHALLENGES IN AI-DRIVEN FRAUD PREVENTION
Implementing AI-driven fraud prevention strategies comes with its own set of
challenges, ranging from data privacy concerns to the quality of datasets and the
effectiveness and ethical use of AI in fraud prevention. One of the primary challenges
information and comply with privacy regulations such as GDPR, CCPA, and others.
Striking a balance between utilizing data for fraud prevention purposes and respecting
their use of data is transparent, lawful, and proportionate to the goal of preventing
fraud.
and diversity of the datasets used for training (Bao, Hilary & Ke, 2022, Paldino, et.
al., 2024, Whang, et. al., 2023, Yandrapalli, 2024). Organizations must ensure that
their datasets are comprehensive, representative, and free from biases to avoid
impact the performance of AI models. Organizations must identify and address biases
AI models, particularly deep learning models, are often considered "black boxes" due
This includes providing explanations for why a particular transaction was flagged as
26
fraudulent and how the AI model arrived at that decision. Overcoming the challenges
associated with AI-driven fraud prevention requires a holistic approach that considers
data privacy, dataset quality, and model interpretability (Sarker, et. al., 2024, Wang,
et. al., 2024, Williamson & Prybutok, 2024). By addressing these challenges,
organizations can harness the power of AI to enhance their fraud prevention efforts
while ensuring compliance with regulations and maintaining trust with customers.
27
detection 90% Recall
89% Specificity
Bao, Hilary & A web based ANFIS, Mean Keio Bank Achieves high Focused on
Ke, (2022) credit card Shift Tokyo classification only five
fruad Clustering accuracy and types of
detection Algorithm, specificity rate fraud
application GCLM compared to
previous
classification
algorithm.
Paldino, et. A survey on SVM, Fuzzy Dataset collected 89% Accuracy Used only a
al., (2024) artificial logic from Kaggle specific set
intelligence of filter and
based embedded
techniques for methods,
detection of and focusing
Bank fraud on a
particular
fraud dataset
Whang, et. A study on ANFIS Collected from 98.7% Accuracy Poor data
al., (2023) bank fraud MATLAB gagle . quality
detection
using multi
layer neural
network
Yandrapalli, Fuzzy logic Neural network Taken from bank 91.26% Decisions
(2024) and system of India Accuracy are often
correlation 98% Sensitivity hard to
Fuzzy
based hybrid inference understand
89%Specitivity
on Credit card system
fraud dataset
28
Gupta (2024) Artificial ANFIS, GA Data set 98.66% Challenges
intelligence in consisted of 345 Accuracy in data
fraud analyzed cases. quality,
prevention: interpretabili
Exploring ty of model
techniques decisions
and
applications
challenges
and
opportunities
Fuzzy logic
Manoharan & Integrating AI CNN, ConvNet Uploaded data Immediate and Poor data
Sarker in Banking by users through accurate results quality
29
(2023) and Fraud android in seconds
prevention application.
Rangaraju, Anomaly ANN, SVM Questionare and 90% Accuracy Poor data
(2023) detection in data submited by quality
credit card users through
transaction there mobile
using phones
machine
learning
Mahalakshmi, Credit card ANFIS, CNN Data base 96% Accuracy Model
et. al., (2022) fraud consisted of 260 performance
detection analysed data depends on
using cases. quality of
Unsupervised training
learning data.
Algorithms
Small
dataset
30
network
31
CHAPTER THREE
3.1 OVERVIEW
rules to function as a cohesive whole. A system's limits, structure, and purpose are
characterised by its surroundings, which also impact its operation. In systems theory
objectives and purpose and then developing procedures and systems that will
iii. Define requirements for the new system (Dennis et al., 2012)
One of the key methods for comprehending, assessing, and developing or modifying
system analysis:
subsystems may have goals that appear to be at odds with one another.
32
3. It offers means to create understanding of the complex structures.
4. System analysis assists in setting each subsystem in its appropriate context and
viewpoint so that the system as a whole can accomplish its goals with the least
this.
total system.
System analysis helps in several ways. Here are some benefits of system analysis.
However, there can be indeed be no 'perfect path'. Still, when the stpes to be
benefit such as reducing cost, minimize the chance of fatal errors and prevent
the downfall of the business and also reduces the scope of future errors .
2. The fact that system analysis is a relatively easy subject to master is another
crucial feature. This indicates that neither a degree nor any professional
3. In the corporate world, system analysis plays a crucial part in ensuring that
items are created correctly and delivered on time. On time delivery of products
is a great chance that the finished items will contain many faults if they are
software.
33
3.3 ANALYSIS OF EXISTING SYSTEM
The case study of this project is the work of ( Njoku et al., 2024). In their approach to
gain and adaptive neuro-fuzzy inference system for the detection of fraud can be a
potential and efficient alternative for the Banking industry to detect Fraudulent
activities. The dataset and preprocessing methods were detailed, with The dataset
comprising of 15550 records of Credit card fraud with 5 attributes for each User data.
The dataset was preprocessed, and attribute values were normalized to make it
suitable for the mining process. The proposed approach outlines the methodology
information gain to reduce the feature number. The process involved dataset
collection, preprocessing, feature extraction, and model training using the selected
algorithms.
The first step involves the analysis of data where each and every column is analyzed
and the necessary measurements are taken for missing values and other forms of data.
Outliner and other values which do not have much impact are dealt with. Then pre-
processed data is used to build the classification model where the data will be split
into two parts one is for training and remaining data for testing purpose. Machine
learning algorithms are applied on the training data where the model learns the pattern
from the data and the model will deal with test data or new data and classify whether
it's fraud or not. The algorithms are compared and the performance metric of the
34
algorithms are calculated. The classification accuracy for the information gain-
ANFIS was found to be 95.24% for credit card fraud. Figure 3.1 shows the
While the paper reports accuracy, specificity, and sensitivity, it does not mention
other important metrics like the F1-score, ROC-AUC, or precision, which can provide
interpretability of ANFIS models can be challenging. The paper does not provide
insights into how the rules generated by the ANFIS system can be interpreted by
bankers, which is crucial for the adoption of such systems in clinical settings. It builds
a classification model to classify whether a credit card is fraud or not. The existing
1. It deals with only one type of fraud detection which is the one likened to credit-card
transactions.
2. The model built is a bit complex for use as there is no implementation of web user
graphical interface, i.e., inputs must be passed into the model from the terminal.
3. No provision for generating recent data of fraudulent account reports to help detect
35
3.4 SYSTEM DESIGN
and architecture to satisfy predefined criteria. It involves figuring out the fundamental
features the system has to have, understanding how different components work
together, and ensuring dependability, performance, and scalability. The design phase
determines the hardware, software, and network infrastructure that will be used in the
system, including the user interface, forms, and reports that will be utilized, as well as
the particular databases, files, and programs that will be required. This phase will
include a brief discussion of the system's basic architecture design, which outlines the
architecture design, interface design, database modeling, and file specifications. The
process of planning a new system based on the results of the current system to replace
the current system by defining its components or modules to satisfy the specific
The process of sifting through massive data sets to find links and patterns that may be
used to address business problems through data analysis is known as data mining.
Businesses can forecast future trends and make more educated business decisions by
valuable insights from large data sets. More specifically, data mining is a phase of the
collects, processes, and analyzes data. Although KDD and data mining are sometimes
36
concepts. Effective data collection, warehousing, and processing are prerequisites for
the data mining process. Discovering more about a user base, predicting results,
identifying bottlenecks and dependencies are just a few uses for data mining.
most difficult problems facing data mining is the complex issue of data quality."
(manilla and Heikki, 2019). The Cross-Industry Standard Process for Data Mining
(CRISP-DM) is the data mining framework that was used in this study.
and implementing a data mining project. The process consists of six major
phases:
a. Business Understanding: The goals of the data mining project are outlined in
b. Data Understanding: Here, the data scientist starts gathering the first set of
data and gets acquainted with it. Important duties include of Collecting
c. Data Preparation: Data is rarely clean. This phase is dedicated to cleaning and
transforming raw data into a suitable format for modeling. Key tasks include:
Integrating data
37
d. Modeling: Once clean data is obtained, different modeling approaches are
used. It is usual to return to the data preparation stage since different methods
may call for different data formats. Choosing modeling techniques, creating
tests, constructing the model, and evaluating the model are important
activities..
effective and generalizable they are. Holdout samples and cross-validation are
f. Deployment:Putting the model into a real-world setting is the last step. This
things, reviewing the project, and finishing the project are important duties.
38
3.7 PROPOSED SYSTEM
which is better for prediction of credit card fraud by enhancing the work of (Njoku1
et, al 2024). The proposed system utilizes machine learning which relies on the
accumulation of extensive historical data through data gathering. This data collection
encompasses both sufficient historical data and raw data. However, raw data cannot
appropriate algorithm is chosen along with a model. In the context of detecting credit
card fraud transactions using real datasets, supervised machine learning algorithms
such as logistic regression played a vital role. The algorithm built a classification
framework using machine learning methods. The model is then subjected to training
and testing to ensure accurate predictions with minimal errors. Periodic tuning of the
model further enhances its accuracy, carried out at intervals to continually refine its
performance. Also, data gathered from account fraud reports are part of the collected
bank data, which are verified by human experts. Then, rules that consider set
thresholds for fraud detection are applied to test accounts and a decision is made
39
(which validates whether an account is associated with frauds or not). Figure. 3.3
3.8 METHODOLOGY
In this section, the following methods (CRISP-DM approach discussed earlier) used
A. DATA ACQUISITION:
A single dataset was used for this analysis. The dataset was obtained from
Kaggle machine learning repository. This data set includes all transactions
recorded over the course of two years. As described in the dataset, the features
are scaled and the names of the features are not shown due to privacy reasons.
study could not be done. There are 284807 records. The only thing we know is
that those columns that are unknown have been scaled already.
B. DATA ANALYSIS:
and outliers in the data in order to guide more focused testing of your theory.
gain insights into the relationships between the feature and the target variable,
the script generates bar plots using Matplotlib and Seaborn. Figure 3.4
40
describe the dataset, the mean, standard deviation, minimum ,maximum, and
C. DATA PREPROCESSING:
It is necessary to clean and alter the data before applying any classification
The scripts loads the necessary libraries for data manipulation l, visualization
scikit-learn and ANFIS. The feature data (X) is examined and it's information
and descriptive statistics are displayed to gain an understanding of the data set.
To handle missing values the scripts employs the 'filna' function from pandas,
replacing the missing values with zeros. The target variable (Y) is extracted
from the original dataset as a separate variable named class. The data set is
splitter into training and texting set using the 'train_ set_ split' function from
scikit-learn, with a test set size of 20% and a random state of 42 for
and preventing one feature from dominating the other. Minimum- maximum
41
normalization was used to increase the efficiency and also efficiency of the
algorithm.
D. FEATURE SELECTION:
In this project, after data processing, a hybrid feature selection method was
used which comprises of three filter based selection methods; chi - squared,
selection using χ2. χ2 computes a score based on the initial premise that the
II. Information Gain : Information gain is one of the filter feature selection
techniques used to extract pertinent properties from a set of features. When the
determining the class attribute. Information theory serves as the foundation for
this method, which ranks and chooses the best features to minimize feature
size prior to the commencement of the learning process. Before ranking, the
III. Gain ratio: The gain ratio was included to enhance IG's bias in favor of
features with high diversity values.The gain ratio shows a large number when
42
the data are evenly distributed; when the data are all concentrated in one
branch of the attribute, the gain ratio shows a small value. It determines an
attribute based on the quantity and size of branches and adjusts IG by taking
can be used to determine the intrinsic information of a given feature. One can
compute the gain ratio between a feature value (y) and a given feature (x).
The script employs the three feature selection already listed above. The 'selectkBest'
class from scikit-learn is used to compute the scores for each feature selection
E. VOTING SCHEME
Majority voting ensemble is an ensemble machine learning model that combines the
prediction from multiple models . It's a technique that may be used to improve model
performance, ideally achieving better performance than any single model used in the
ensemble. There are two types of majority vote prediction for classification. They
include;
I. Hard voting: this is used to predict the largest sum of votes from models .
II. Soft voting: this is used to predict the class with the largest summed
In this project work hard voting was used as it preferred voting scheme, where a
feature is selveted if it is above the threshold (mean of all scores) in at least two
scoring methods. The selected features are used to create new training and testing sets
43
After the optimal dataset is obtained via the hybrid feature selection processs, then
classification algorithms are used to processs the sub dataset. For the ANFIS
algorithm, the scripts define the number of membership functions and the standard
deviation for the membership functions used . These parameters influence the shape
and behavior of the membership functions, which play a crucial role for the fuzzy
inference process. For each feature the script initializes gausssinan membership
functions with random means and specialized deviation. The membership functions
are used to map the input values to corresponding fuzzy sets. The script invokes the
'trainHybridjangoffline' method from the anfis Library to train the ANFIS model. The
training process is carried out for specified number of epoched (20), allowing the
model to iteratively learn from the training data and refine it's parameters.
The ANFIS structure consists of some nodes in different layers which are connected
to each other. The output of this network depends on the tunable parameters of these
nodes. The network learning rules determine the parameters’ updating method for
minimizing the error. A fuzzy inference system is a framework based on fuzzy theory
and If-Then rules. The ANFIS structure has three main elements; rule base, data base,
and reasoning mechanism. The learning algorithm is used to tune all the tunable
After training, the scripts utilizes the trained ANFIS model to make predictions on the
test data using the 'predict' function from anfis Library to assess the model's
performance, the script calculates the accuracy of the ANFIS by comparing it's
prediction with true labels in the test set. This is achieved using the 'accuracy_score'
function from scikit-learn. The script report is generated using 'classification _report'
44
function providing valuable metrics such as precision, recall and F1 - score to each
class in the classification task. The script also train and evaluates three additional
models : support vector, random forest and Logistics regression. For each model the
section also contain the confusion matrix of some fonthe classifier of the proposed
model.
45
CHAPTER FOUR
4.1 INTRODUCTION
System implementation is the process of putting the trained model into use and
integrating it into the production environment. It involves making sure the model
works well, integrates with other systems, and meets performance requirements.
into account. This chapter focuses on creating models with the Python programming
System requirements refers to the settings that a system must have in order to function
smoothly, effectively, and predictably, with the understanding that failing to fulfill
A good laptop with the following hardware requirement is needed to run the model
conveniently:
b. RAM: 8GB
c. Memory: 120GB
e. GPU: A GPU is needed in order to enhance the model's speed and also due to
its multiple processing cores that the model can leverage during its training
phase.
46
f. Internet: a stable internet connection from a reliable internet service provider
models. The major software tools used in the development of the model are given in
including the project requirements and familiarity of the programmer with the
4.3.1.2 PYTHON
Python is a high-level, general purpose programming language with powerful data
types. Its Hugh level built in data structure, combined with dynamic typing and
dynamic binding, make it very attractive for rapid application development, as well as
widely regarded as the de facto language for machine learning and data science due to
its simplicity, readability, extensive library support, and vibrant community. Popular
47
libraries such as TensorFlow, PyTorch, Scikit-learn, and Keras offer comprehensive
tools for building and training machine learning models. Some of the python libraries
generating a wide range of plots and charts, including line plots, scatter plots,
c. Pandas: Pandas is a Python library widely used for data manipulation and
for working with structured data, making it essential for tasks like data
based on matplotlib. It can be used in all sorts of data analysis tasks that
Python machine learning package that offers easy-to-use tools for data mining
48
f. Django: streamlines fraud detection system design through rapid development,
The proposed system is poised to attain and characterized by its intuitive graphical
user interface or interaction to manage both credit- card and bank account transactions
fraud detection; it makes provision for reporting account fraud and generating reports
based on detected frauds by the system; there is an integrable APIs for bank use to
standardised visual elements that are universally comprehensible among experts and
the abstract representation of the actual proposed web system to enhance fraud
49
System testing is a software testing phase in which the entire software system is
correctly overall. Using the dataset and comparing the model's performance in terms
of accuracy, recall, and precision is the general process of testing the model.
data. This data can be gathered from a variety of sources, such as transaction records,
remove errors and inconsistencies. The pre-processed data is then used to train a
model or set rules. The model or rule-base is then tested to ensure that it is working
correctly and predicting or making decisions accurately. The model or rule can be
System Functionalities and Outputs Here are the essential operations of how the
proposed system works to achieve the set objectives; Collect Account Fraud Reports
The form in this page collects fraud transaction reports performed with an account
50
The administrators access the proposed fraud detection system to carry out any
authenticates them with what it has in its database. This admin must be a banker for
secure access.
The dashboard allows the administrator to verify the accounts reported and approves
them to be associated with fraud if the report is found to be legit. It also allow setting
the limit under which an account can be flagged for fraud based on the banking
institution that is associated with the account number. Then, declares an account as
Credit card dataset are uploaded in .csv format to be used to train the model running
in the proposed system. After Upload of the Credit-Card Test Dataset, we can View
Fraud Prediction result in the Web app as well. Admin can Upload the test dataset (in
single or multiple csv) that contains credit-card transactions for the trained model in
the proposed system to make predictions of which are fraudulent or not. You can get
in .csv format the validated reported accounts for fraud which can be used as datasets
51
for decision making. Also, download the credit-card fraud detection predictions of
transactions.
52
Figure 4.4 Uploaded and view of credit-card dataset
53
CHAPTER FIVE
5.1 CONCLUSION
This project, which is the design of an enhanced fraud detection system Using
Artificial inteligence and/or machine learning techniques , provides an easy to use all
in one system that can detect credit-card fraudulent transactions and accounts marked
was its user-centric design, featuring an intuitive interface that allowed people to
report potential fraud incidents associated with an account number. This user
fraud prevention but also enriched the system's dataset, contributing to its continuous
system for credit card and account transactions, integrating machine learning or rules,
not only bolstered security but also empowered users to contribute to the protection of
their financial assets. The system's potential to reshape fraud prevention in the digital
The proposed system's capacity to accurately discern between genuine and fraudulent
comprehensive solution that empowers both financial institutions and account holders.
The integration of user-reported data not only enhances the system's adaptability but
54
5.2 RECOMMENDATION FOR FUTURE WORKS
Artificial Intelligence (AI) has emerged as a powerful tool in the fight against fraud,
offering sophisticated techniques and applications that enhance fraud detection and
prevention efforts. AI techniques such as machine learning, deep learning, and natural
language processing are instrumental in detecting and preventing fraud across various
sectors and should be furture improved. From credit card fraud detection to
strategies comes with challenges, including data privacy concerns, the quality of
become more sophisticated, AI technologies must evolve to detect and prevent new
that their AI systems remain effective and adaptive to emerging threats. The future of
55
REFERENCES
Abass, T., Itua, E. O., Bature, T., & Eruaga, M. A. (2024). Concept paper: Innovative
A., Salman, A. K., ... & Nasser, A. M. (2023). The role of modern technology
Adekunle, T. S., Alabi, O. O., Lawrence, M. O., Ebong, G. N., Ajiboye, G. O., &
169-178.
Afriyie, J. K., Tawiah, K., Pels, W. A., Addai-Henne, S., Dwamena, H. A., Owiredu,
E. O., ... & Eshun, J. (2023). A supervised machine learning algorithm for
Journal, 6, 100163.
Ahmad, H., Kasasbeh, B., Aldabaybah, B., & Rawashdeh, E. (2023). Class balancing
framework for credit card fraud detection based on clustering and similarity-
15(1), 325-333.
56
Alarfaj, F. K., Malik, I., Khan, H. U., Almusallam, N., Ramzan, M., & Ahmed, M.
Ali, A., Abd Razak, S., Othman, S. H., Eisa, T. A. E., Al-Dhaqm, A., Nasser, M., ... &
Alkhalil, Z., Hewage, C., Nawaf, L., & Khan, I. (2021). Phishing attacks: A recent
563060.
Bolton, R. J., & Hand, D. J. (2022). Statistical fraud detection: A review. Statistical
Chang, J. W., Yen, N., & Hung, J. C. (2022). Design of a NLP-empowered finance
fraud awareness model: the anti-fraud Chatbot for fraud detection and fraud
Chen, S., Wang, Y., & Lee, C. 2021. Challenges and Countermeasures for
57
Chogugudza, M. (2022). The classification performance of ensemble decision tree
George, A. S., & George, A. H. (2023). A review of ChatGPT AI's impact on several
23.
Hasan, M. R., Gazi, M. S., & Gurung, N. (2024). Explainable AI in credit card fraud
Hassan, M., Aziz, L. A. R., & Andriansyah, Y. (2023). The role artificial intelligence
Huang, Z., Zheng, H., Li, C., & Che, C. (2024). Application of machine learning-
Jagatheesaperumal, S. K., Rahouti, M., Ahmad, K., Al-Fuqaha, A., & Guizani, M.
(2021). The duo of artificial intelligence and big data for industry 4.0: Applications,
58
Johnson, P. S., & Martinez, A. R. 2020, Fraud Detection in Mobile Payment
Krishnan, S., Shashidhar, N., Varol, C., & Islam, A. R. (2022). A novel text mining
Mahalakshmi, V., Kulkarni, N., Kumar, K. P., Kumar, K. S., Sree, D. N., & Durga, S.
Min, W., Liang, W., Yin, H., Wang, Z., Li, M., & Lal, A. (2021). Explainable deep
arXiv:2101.04285.
Intelligence (AI) on the Fraud Detection in the Private Sector in Saudi Arabia.
(100) 472-506
Nagaraju, M., Babu, P. N., Ravipati, V. S. P., & Chaitanya, V. (2024). UPI fraud
Nguyen, D. K., Sermpinis, G., & Stasinakis, C. (2023). Big data, artificial intelligence
59
Rangaraju, S. (2023). Secure by intelligence: enhancing products with AI-Driven
9(3), 36-41.
Smith, R. L., & Brown, K. P. 2019. Deep Learning Approaches for Improved Fraud
2020-01.
Wang, C., Yang, Z., Li, Z. S., Damian, D., & Lo, D. (2024). Quality assurance for
Whang, S. E., Roh, Y., Song, H., & Lee, J. G. (2023). Data collection and quality
32(4), 791-813.
Williamson, S. M., & Prybutok, V. (2024). Balancing privacy and progress: a review
Yadav, N. S. S., Yadav, P. S., & Goar, V. (2024). Deep learning, neural networks,
method for ensuring data quality for machine learning applications. In 2024
60
Second International Conference on Emerging Trends in Information
Youssef, B., Bouchra, F., & Brahim, O. (2023, March). State of the art literature on
61
APPENDIX
//Importing dependencies
import numpy as np
import pandas as pd
import time
import warnings
warnings.filterwarnings('ignore')
62
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
//Reading CSV
start_time = time.time()
fraud = pd.read_csv('../input/paysim1/PS_20174392719_1491204439457_log.csv')
end_time = time.time()
63
# Boxplotting with logarithmic y-axis scale
sns.boxplot(data=fraud)
plt.yscale('log') # Set y-axis scale to logarithmic
> Now we will use the info function to see the datatypes and number of instances in
our dataset.
fraud.info()
> Let's use the shape function that returns the shape of our dataset.
fraud.shape
**Insights**
* The dataset consist of **11 columns**.
* We have **5 columns** of float datatype.
* We have **3 columns** of integer datatype.
* We have **3 columns** of object datatype.
* The dataset contains **6362620** rows of data.
//Data Cleaning
It's quite important to have a look at the missing values of the the dataset. We will
remove them or interpolate those missing values if we find any. Let's use missingo
library function to do this task.
plt.figure(figsize = (15, 8))
msno.bar(fraud, figsize = (15,5), sort = 'ascending', color = "#896F82")
plt.show()
> It's time to use the duplication function to make sure if there are any duplicated
values in the dataset or not.
sns.distplot()
print('Number of duplicates are : ', fraud.duplicated().sum())
64
> Let's have a look at the column names to see if they need any correction or not. We
usually do this to see typo errors.
fraud.columns
> Now, let's rename some of the column names by using rename function.
fraud = fraud.rename(columns = {'nameOrig' : 'origin', 'oldbalanceOrg' :
'sender_old_balance', 'newbalanceOrig': 'sender_new_balance', 'nameDest' :
'destination', 'oldbalanceDest' : 'receiver_old_balance', 'newbalanceDest':
'receiver_new_balance', 'isFraud' : 'isfraud'})
> It's also good to drop down the non essential columns from a dataset. We will do
this with the help of drop function.
fraud = fraud.drop(columns = ['step', 'isFlaggedFraud'], axis = 'columns')
> Now it's time to move the target column to our desired location in the dataset.
cols = fraud.columns.tolist()
new_position = 3
cols.insert(new_position, cols.pop(cols.index('destination')))
fraud = fraud[cols]
> By using the head function, let's be assure of the changes we have done so far.
fraud.head()
fraud2 = pd.read_csv('../input/paysim1/PS_20174392719_1491204439457_log.csv')
# Filter the dataset to include only rows where 'isFraud' is equal to 1
fraud_testing = fraud2[fraud2['isFraud'] == 1].copy()
65
# Define custom colors
custom_palette = ['#1f77b4', '#ff7f0e'] # Blue and orange
plt.figure(figsize=(15, 8))
ax = sns.countplot(data=fraud, x="type", hue="isfraud", palette=custom_palette)
plt.yscale('log') # Use log scale for y-axis
plt.title('Fraudulent vs. Non-Fraudulent Transactions')
# Adding annotations
for p in ax.patches:
ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x() + 0.01, p.get_height() +
10000))
plt.savefig('fraud_transactions_plot.png')
plt.show()
plt.savefig('fraud_transactions_plot.png')
> Checking the origin from where the transactions were made.
transfer_fraud = fraud[((fraud['type']=='TRANSFER') & fraud['isfraud']==1)]
transfer_fraud['origin'].value_counts()
> Checking the destination from where the transactions were cashed out.
cash_out_fraud = fraud[(fraud['type'] == 'CASH_OUT') & (fraud['isfraud'] == 1)]
cash_out_fraud['destination'].value_counts()
> Checking if the transfer and receiving account were same.
fraud_trans = fraud[fraud['isfraud'] == 1]
valid_trans = fraud[fraud['isfraud'] == 0]
**Insights**
* Our fraud transactions are done in **TRANSFER** and **CASH_OUT**
transaction type.
* The fraud transactions in **TRANSFER** were **4097** and **CASH_OUT**
were **4116**.
66
* The **fraud** transactions were generally from **Customer to Customer**
* The transactions account used for **Recieing and Sending was not Same** in case
of **Fraud transactions**.
//Feature Engineering
> Now we can do feature engineering to make another column that seems to be
helpful for later Machine Learning tasks.
data = fraud.copy()
data['type2'] = np.nan
data.loc[fraud.origin.str.contains('C') & fraud.destination.str.contains('C'), 'type2'] =
'CC'
data.loc[fraud.origin.str.contains('C') & fraud.destination.str.contains('M'), 'type2'] =
'CM'
data.loc[fraud.origin.str.contains('M') & fraud.destination.str.contains('C'), 'type2'] =
'MC'
data.loc[fraud.origin.str.contains('M') & fraud.destination.str.contains('C'), 'type2'] =
'MM'
> Changing the column position for our ease of use.
cols = data.columns.tolist()
new_position = 1
cols.insert(new_position, cols.pop(cols.index('type2')))
data = data[cols]
> Again, dropping the irrelevant columns.
data.drop(columns = ['origin','destination'], axis = 'columns', inplace = True)
> Using the head function to have a new look of the dataset.
data.head()
> Now we are going to see the number of fraud and valid transactions according to the
type 2 that tells if the transaction was done from customer to customer, customer to
merchant, merchant to customer or merchant to merchant.
fraud_trans = data[data['isfraud'] == 1]
valid_trans = data[data['isfraud'] == 0]
print('Number of fraud transactions according to type are below:\n',
fraud_trans.type2.value_counts(), '\n')
print('Number of valid transactions according to type are below:\n',
valid_trans.type2.value_counts())
**Insights**
67
* At first, we did some feature engineering and introduced a new column **type2**
that contained the type of transaction between **Customers and Merchants**.
* Then, we adjusted the column position and dropped some columns that were no
longer of use.
* The number of **Fraud Transactions** in total were **8213** and were made
from **Customer to Customer**.
* The number of **Valid Transactions** made from **Customer to Customer** are
**4202912**.
* The number of **Valid Transactions** made from **Customer to Merchant** are
**2151495**.
//Data Visualization
fr = fraud_trans.type2.value_counts()
va = valid_trans.type2.value_counts()
plt.figure(figsize=(15, 8))
plt.subplot(1,2,1)
sns.countplot(x = fr)
plt.title('Fraud',fontweight="bold", size=20)
plt.subplot(1,2,2)
sns.countplot(x = va)
plt.title('Valid',fontweight="bold", size=20)
plt.figure(figsize = (15, 8))
ax=sns.countplot(data = data, x = "type")
plt.title('Transactions according to type')
plt.savefig("Transactions according to Type")
for p in ax.patches:
ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x()+0.01, p.get_height()
+10000))
plt.figure(figsize=(15,8))
colors = ['#006400','#008000','#00FF00','#2E8B57','#2F4F4F']
plt.pie(data.type.value_counts().values,labels=data.type.value_counts().index, colors
= colors, autopct='%.0f%%')
plt.title("Transactions according to type")
plt.savefig("Transactions according to type")
plt.show()
plt.figure(figsize = (15, 8))
ax=sns.countplot(data = data, x = "type2", hue="isfraud", palette = 'Set1')
68
plt.title('Transactions according to type2')
for p in ax.patches:
ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x()+0.01, p.get_height()
+10000))
plt.figure(figsize=(15,8))
colors = ['#006400','#008000']
plt.pie(data.type2.value_counts().values,labels=data.type2.value_counts().index,
colors = colors, autopct='%.0f%%')
plt.title("Transactions according to type2")
plt.show()
**Insights**
* The first two visualizations contains the number of transactions according to the
type of transaction, sender and reciever type.
* Most common transaction type used for transactions = **CASH_OUT** .
* Least common transaction type used for transactions = **DEBIT**.
* Most of the transactions done were **Customer to Customer** .
//Data Preprocessing
> Let's create the dummies for the later process of Machine Learning. I'll use dummy
hot encoding as the columns does not have a specific order.
data = pd.get_dummies(data, prefix = ['type', 'type2'], drop_first = True)
> Now it's time to split the dataset into training and testing part and then do the
standardization.
X = data.drop('isfraud', 1)
y = data.isfraud
for i in range(5):
start_time = time.time()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, stratify =
data.isfraud)
end_time = time.time()
69
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
//Model Building
> We will now use 4 ML models and train them. Later we will append each model
into the list.
X.head()
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
# Configure RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=20, max_depth=5)
# Configure DecisionTreeClassifier
dtc = DecisionTreeClassifier(max_depth=5)
# Configure LogisticRegression
logreg = LogisticRegression(max_iter=100, penalty='l2', solver='lbfgs')
# Configure GaussianNB
nb = GaussianNB()
70
accuracy_list = []
auc_list = []
precision_list = []
recall_list = []
metrics_dict = {
'accuracy': accuracy_list,
'auc': auc_list,
'precision': precision_list,
'recall': recall_list
}
metrics_dict_sorted = {}
for metric, values in metrics_dict.items():
sorted_values = [values[i] for i in sorted(range(len(classifiers)), key=lambda k:
values[k])]
metrics_dict_sorted[metric] = sorted_values
71
for classifier, values in transposed_metrics.items():
print(f'{classifier}:')
for metric, value in zip(['auc'], values):
print(f'{metric}: {value}')
print()
from sklearn.metrics import roc_auc_score
auc_per_class = {}
> Below, i'll just create a functions for the visualization part.
# from sklearn.metrics import classification_report
72
for classifier in classifiers:
y_pred = classifier.predict(X_test)
print(classifier.__class__.__name__)
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))
print()
def px_bar(x,y,text,title,color,color_discrete_sequence):
return px.bar(x = x, y = y, text = text, title = title, color = color,
color_discrete_sequence=color_discrete_sequence)
> Let's call the function to see accuracies of each model.
fig = px_bar(list(accuracy_dict_sorted.keys()), list(accuracy_dict_sorted.values()),
np.round(list(accuracy_dict_sorted.values()),3), 'Accuracy score of each classifiers',
list(accuracy_dict_sorted.keys()), px.colors.sequential.matter)
for idx in [2,3]:
fig.data[idx].marker.line.width = 3
fig.data[idx].marker.line.color = "black"
fig.show()
> I'll call the visualization function once again to see the AUC values as well.
fig = px_bar(list(auc_dict_sorted.keys()), list(auc_dict_sorted.values()),
np.round(list(auc_dict_sorted.values()),3), 'AUC score of each classifiers',
list(auc_dict_sorted.keys()), px.colors.sequential.matter)
//Model Evaluation
> Let's train our best model once again.
rfc=RandomForestClassifier(n_estimators = 15, n_jobs = -1, random_state = 42)
rfc.fit(X_train, y_train)
rfc_pred = rfc.predict(X_test)
rfc_pred_proba = rfc.predict_proba(X_test)[:,1]
> Printing out the Classification Report.
print(classification_report(y_test, rfc_pred, target_names=['Not Fraud','Fraud']))
73
> Showing the AUC below to show Randome Forest performance.
fpr, tpr, temp = roc_curve(y_test, rfc_pred_proba)
auc = round(roc_auc_score(y_test, rfc_pred_proba),3)
plt.figure(figsize=(15,7))
plt.plot(fpr,tpr,label='Random Forest Classifier,
AUC='+str(auc),linestyle='solid',color='#800000')
plt.plot([0, 1], [0, 1], color = 'g')
plt.title('ROC Curve')
plt.legend(loc='upper right')
//Conclusion
The total number of fraud transactions were **8213** out of **6362620**
transactions. These fraud transactions were either **CASH_OUT** or **DEBIT**
and were made from a **Customer to Customer** account. We trained 4 algorithms
and **Random Forest** performed the best among them. It gave the **AUC score**
of **0.9**.
74