0% found this document useful (0 votes)
19 views87 pages

Machine Learning and Artificial Intelligence in Ba

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views87 pages

Machine Learning and Artificial Intelligence in Ba

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 87

MACHINE LEARNING AND ARTIFICIAL INTELLIGENCE IN BANKING

AND FRAUD DETECTION

BY

FULL NAME

PSC-MATNO

DEPARTMENT OF COMPUTER SCIENCE, FACULTY OF

PHYSICALSCIENCES, UNIVERSITY OF BENIN, BENIN CITY,

EDO STATE, NIGERIA.

JANUARY 2025
MACHINE LEARNING AND ARTIFICIAL INTELLIGENCE IN BANKING

AND FRAUD DETECTION

BY

FULL NAME

PSC-MAT NO

A PROJECT REPORT SUBMITTED TO THE DEPARTMENT OF

COMPUTER SCIENCE, FACULTY OF PHYSICAL SCIENCES,

UNIVERSITY OF BENIN, BENIN CITY

IN PARTIAL FULFILMENT OF THE REQUIREMENT FOR THE AWARD

OF A BACHELOR OF SCIENCE (B.Sc.) DEGREE IN COMPUTER SCIENCE

JANUARY 2025
CERTIFICATION

This is to certify that this project work was carried out by FULL NAME with

Matriculation Number PSC-MATNO under my supervision. It is adequate and

satisfactory, both in scope and content, for the award of Bachelor of Science (B.sc)

Degree in Computer Science of the University of Benin

DR. (MR.) E.C. IGODAN DATE


Project Supervisor

I
APPROVAL

This project work is hereby approved in partial fulfillment of the requirements for the

award of Bachelor of Science (B.Sc.) Degree in Computer Science from the

University of Benin.

PROF.GODSPOWER.O.EKUOB DATE
ASE,PHD
Head of Department

II
DEDICATION

This project is dedicated to God Almighty for giving me the strength and wisdom to

see it through to completion, and even throughout my stay in the University of Benin

(UNIBEN).

III
ACKNOWLEDGEMENT

My utmost acknowledgement goes to God Almighty for giving me the strength,

wisdom and direction throughout my academic journey. I would like to express my

gratitude to my project supervisor, DR. (Mr.) E.C. IGODAN for his consistent

guidance towards ensuring the successful completion of this project.

I would also like to specially thank my project coordinator Dr. (Mrs.) A.R. Usiobaifo,

and other lecturers in the Department of Computer Science who I have been

opportune to cross paths with, and have impacted me immensely these past few years:

Prof. G.O. Ekuobase, Dr. F.O. Oliha, Prof. K.C. Ukaoha, Prof. A.A. Imiavan, Prof.

(Mrs.) F. Egbokhare, Prof. (Mrs.) V.V.N. Akwukwuma, Prof. F.I. Amadin, Prof.

(Mrs.) S. Konyeha, Prof. (Mrs.) V.I. Osubor, Dr. (Mrs.) Aziken, Dr. F.O. Chete, Dr.

(Mrs) R.O. Osaseri, Dr. J.C. Obi, Mr. P. E.B. Imiefoh, Mr. I.E. Obasohan, Mr. S.O.P.

Oliomogbe, Mr. K.O. Otokiti, Mr. I.E. obayagbonna, Mrs. R.I. Izevbizua, Mr. E.C.

Igodan, Miss L.O.Usiosefe, Mr J. Okhuoya, Prof. F.A.U. Imouokhome, Mrs. J.I.

Adun, Dr. E. Nweli and Mr. D.N. Idehen.

TABLE OF CONTEN

IV
T

CERTIFICATION........................................................................................................I

APPROVAL.................................................................................................................II

DEDICATION...........................................................................................................III

ACKNOWLEDGEMENT.........................................................................................IV

LIST OF FIGURES................................................................................................VIII

LIST OF TABLES.....................................................................................................IX

ABSTRACT.................................................................................................................X

CHAPTER ONE...........................................................................................................1

INTRODUCTION........................................................................................................1

1.1 BACKGROUND OF STUDY...........................................................................1

1.2 MOTIVATION OF STUDY..............................................................................3

1.3 RESEARCH AIM AND OBJECTIVES..........................................................4

1.4 SIGNIFICANCE OF THE STUDY..................................................................5

1.5 CONTRIBUTION TO KNOWLEDGE...........................................................5

1.6 STUDY OUTLINE.........................................................................................5

1.7 DEFINITION OF TERMS.............................................................................6

CHAPTER TWO..........................................................................................................7

LITERATURE REVIEW............................................................................................7

2.0 INTRODUCTION TO FRAUD........................................................................7

2.1 ARTIFICIAL INTELLIGENCE IN BANK FRAUD DETECTION.............8

V
2.2 MACHINE LEARNING AND MACHINE LEARNING TECHNIQUES....9

2.3 APPLICATIONS OF AI IN FRAUD PREVENTION...................................12

2.3.1 ADAPTIVE NEURO FUZZY INFERENCE SYSTEM (ANFIS).......14

2.3.2 FUZZY LOGIC.................................................................................14

2.3.3 ARTIFICIAL NEURAL NETWORK (ANN).......................................15

2.3.4 GENETIC ALGORITHM(GA)..............................................................17

2.3.5 NAIVE BAYES.................................................................................18

2.3.6 SUPPORT VECTOR MACHINE (SVM)..............................................19

2.3.7 DECISION TREE ALGORITHMS LIKE RANDOM FOREST........20

2.4 BENEFITS OF MACHINE LEARNING FOR FRAUD DETECTION.......23

2.5 CHALLENGES IN AI-DRIVEN FRAUD PREVENTION...........................25

2.6 SUMMARY OF REVIEWED LITERATURE...............................................26

CHAPTER THREE...................................................................................................31

METHODOLOGY AND DESIGN...........................................................................31

3.1 OVERVIEW.......................................................................................31

3.2 SYSTEM ANALYSIS.......................................................................................31

3.2.1 OBJECTIVE OF SYSTEM ANALYSIS.................................................31

3.2.2 BENEFITS OF SYSTEM ANALYSIS....................................................32

3.3 ANALYSIS OF EXISTING SYSTEM..............................................................33

3.3.1 GAPS IN EXISTING SYSTEM...............................................................34

3.4 SYSTEM DESIGN.......................................................................................35

VI
3.5 DATA MINING PROCESS............................................................................36

3.6 CRISP-DM APROACH...................................................................................37

3.7 PROPOSED SYSTEM.....................................................................................38

3.8 METHODOLOGY.......................................................................................39

3.9 MODEL EVALUATION AND COMPARISON...........................................44

CHAPTER FOUR......................................................................................................46

IMPLEMENTATION AND RESULT.....................................................................46

4.1 INTRODUCTION.......................................................................................46

4.2 SYSTEM REQUIREMENTS............................................................................46

4.2.1 HARDWARE REQIREMENTS.............................................................46

4.2.2 SOFTWARE REQUIREMENTS...........................................................47

4.3 MODEL DEVELOPMENT TOOLS................................................................47

4.3.1 CHOICE OF PROGRAMMING LANGUAGE....................................47

4.3.1.2 PYTHON..........................................................................47

4.4 SYSTEM TESTING.................................................................................49

4.4.1 RESULT AND SCREENSHOTS............................................................50

CHAPTER FIVE........................................................................................................53

CONCLUSION, RECOMMENDATION AND FUTURE WORKS.....................53

5.1 CONCLUSION.......................................................................................53

REFERENCES...........................................................................................................55

APPENDIX.................................................................................................................61

VII
VIII
LIST OF FIGURES

Figure 2.1 ANFIS architecture 16

Figure 2.2 Flowchart for Genetic algorithm 18

Figure 2.3 Support vector machine 19

Figure 2.4 Random forest 20

Figure 2.5 Decision tree 26

Figure 3.1 Architecture of existing System 34

Figure 3.2 Life Cycle of CRISP-DM 38

Figure 3.3 Architecture of Proposed System 39

Figure 3.4 Feature list of Credit card dataset. 40

Figure 3.5 ANFIS confusion matrix. 45

Figure 3.6 SVM confusion matrix. 45

Figure 3.7 RF confusion matrix. 45

Figure 3.8 LR confusion matrix 45

Figure 4.1 Proposed system architecture 49

Figure 4.2: Fraud report page of the proposed system 50

Figure 4.3: Admin Login page to access the proposed system 51

Figure 4.4 Uploaded and View of credit card dataset 52

IX
LISTS OF TABLES

Table 2.1 ML techniques used in financial fraud detection 22

Table 2.2 Summary of reviewed Literature 28

X
ABSTRACT

This study focuses on using a specific Machine learning technique called Adaptive

Neuro Fuzzy Inference System (ANFIS) to detect bank and credit card fraud. ANFIS

is a powerful computational model that combines the capabilities of neural networks

and fuzzy logic. It can effectively handle complex and uncertain data, making it

suitable for Fraud detection. Feature selection is the process of selecting the most

relevant and informative features from a dataset. In the context of Fraud detection,

this means identifying the most important factors or variables that can help accurately

predict the presence or absence of the Fraud. The work uses Pre-processing, model

training, and optimization to accurately estimate using data from Database.

This work explores the potential of using ANFIS and feature selection techniques to

enhance the Detection of bank fraud , ultimately leading to more effective and timely

intervention for Bank users who are affected by fraudsters trying to scam them. It will

also introduce a web based application for users to easily report instances of fraud

with appropriate data to help further train the Model.

XI
CHAPTER ONE

INTRODUCTION

1.1 BACKGROUND OF STUDY

The exponential growth of digital transactions has resulted in a surge in financial

fraud, which poses significant threats to the global financial ecosystem. The effect of

this phenomenon has led to the liquidation of many banks in Nigeria. It has been the

major cause of the ugly development in our Banking industry now refereed to as

‘DISTRESS’. Fraudulent activities ranging from phishing attacks and identity theft to

more complex forms of financial fraud such as account takeovers and money

laundering. The rapid evolution of these fraudulent activities poses significant

challenges for traditional fraud detection and prevention methods, which often

struggle to keep pace with the agility and ingenuity of modern cybercriminals (Bello

et al .,2024). These development which records high occurrence in our banks has

caused a big question mark on the credibility of Nigerians in both within and outside

the country. The need to combat the double –headed monster by all means gave rise

to the issue of cultivation of the banking industry especially in the areas susceptible to

fraud.

In recent years, machine learning techniques have demonstrated remarkable success

in various domains, including natural language processing, computer vision, and data

analytics. These techniques have the potential to transform fraud detection by

enabling the development of more accurate and adaptive systems. Machine learning

algorithms can automatically learn patterns and anomalies from vast amounts of

transactional data, allowing banks to detect fraudulent activities that may go

unnoticed by manual or rule-based systems(Njoku., et al 2024). Fraud is committed in

various fields such as Insurance, credit card, telecommunication, and financial

1
communication. There are two different types of fraud, including opportunistic, and

professional fraud that the second type is committed by organized groups. Although

the organized fraud is perpetrated fewer than the opportunistic bank fraud, the

majority of revenue outflow (financial losses) is due to these groups. Fraud detection

would be difficult due to many reasons. The first one is, involving high volume of

data, which are constantly evolving. In reality, for processing these sets of data, the

fast, the novel, and efficient algorithms are entailed. Moreover, in terms of cost, it is

evident that undertaking a detailed analysis of all records is too much expensive. Here

the issues of effectiveness enter; indeed, many legitimate records exist for every

person that an effective method should detect fraudulent records correctly.

Several studies have highlighted the effectiveness of machine learning in fraud

detection within the banking sector. (Johnson & Martinez, 2020). conducted a

comprehensive analysis of fraud detection using machine learning algorithms,

demonstrating significant improvements in detection accuracy and reduced false

positives compared to traditional methods. Similarly, (Smith & Brown,. 2019).

investigated the application of deep learning techniques for fraud detection and

emphasized the ability of neural networks to capture intricate patterns in transactional

data. Furthermore, the rise of big data technologies and cloud computing has

facilitated the scalability and efficiency of machine learning algorithms, making them

feasible for real-time fraud monitoring in high-velocity banking environments. The

integration of various data sources, including transaction histories, customer profiles,

and external data feeds, provides a holistic view of customer behaviour, enhancing the

accuracy of fraud detection models.

However, it is crucial to address certain challenges associated with implementing a

machine learning based fraud detection system in the banking sector. These

2
challenges include the need for large and diverse labelled datasets for training,

interpretability of complex machine learning models, and the potential for adversarial

attacks aimed at deceiving the system. Most the banking sector is witnessing a

paradigm shift in fraud detection methods, with machine learning emerging as a

promising approach to enhance accuracy and adaptability. In conclusion, the digital

age has introduced complex fraud challenges that necessitate innovative solutions.

Effective fraud prevention strategies are critical for maintaining financial security and

trust. Artificial Intelligence stands at the forefront of these efforts, offering powerful

techniques and applications to enhance fraud detection and prevention. As we delve

deeper into the exploration of AI-driven fraud prevention, it becomes evident that

leveraging AI's capabilities is essential for combating the ever-evolving landscape of

fraud in the digital era.

This study aims to contribute to the existing body of knowledge by developing and

evaluating a robust fraud detection system that leverages machine learning and rule-

based techniques to mitigate the risks associated with fraudulent activities in the

banking industry.

1.2 MOTIVATION OF STUDY

The hybridization of machine learning algorithms is that they are used to solve

complex problems. In machine learning, Artificial Neural Networks and Fuzzy Logic

or Fuzzy Inference systems have continued to address problems associated with

knowledge representation and to build models that can learn and adapt to any

situation. (Igodan et al., 2022).

The detection of fraud has been seriously taken into account in recent years. Despite

the extensive utilization of data mining algorithms for sorting fraud out, The rise in

online transactions has led to an increase in the frequency and complexity of

3
fraudulent activities. Cybersecurity measures are constantly tested by sophisticated

schemes that aim to bypass conventional detection methods. Human oversight, while

necessary, is no longer sufficient to counteract the sheer volume of these security

breaches; it’s challenging to keep up with the pace and cunning of modern cyber

threats. Banks in Nigeria have been for quite some times the centre of public attention

especially in the area of fraud. Fraud has been on the increase recent years. Although,

fraud in the bank is a global phenomenon, the growth in Nigeria had become out-

standing. Effective fraud prevention strategies are crucial for safeguarding financial

systems, protecting consumer trust, and ensuring the stability of economic activities.

The financial losses associated with fraud can be devastating for both individuals and

organizations, leading to significant economic impact and reputational damage

(Karpoff, 2021, Mandal, 2023).

In view of this fraudulent practice, Artificial intelligence should be to detect and

prevent fraud in the banking industries. This project work would strive to build a

model that would help to streamline and prevent banks and its customer from falling

victim to fraud by making use of some machine learning algorithms.

1.3 RESEARCH AIM AND OBJECTIVES

The main aim of this project work is to design a fraud detection system in

computerized Banking firms using Machine learning Algorithms.

1. Evaluate different machine learning techniques.

2. Implement the most suitable machine learning technique using python

programming language

3. Carry out a case study on Bank fraud using Data from Kaggle database.

4
1.4 METHODOLOGY

This section outlines the systematic review methodology applied to critically assess

the role of Artificial Intelligence (AI) in fraud detection within financial institutions.

By adhering to a transparent and structured approach, this study ensures that the

selection of documents, the extraction of relevant data, and the synthesis of findings

are both comprehensive and credible (Kitchenham et al., 2020). Systematic review

methodologies provide a robust framework for aggregating knowledge from existing

literature, helping to identify trends, challenges, and gaps in the field. The

methodology described here was designed to answer the study’s key research

questions by focusing on AI's effectiveness in detecting financial fraud and the

challenges of implementing AI-based systems.

This project work will apply three commonly used filtering methods chi squared, gain

ratio and information gain . Also, the majority voting will be applied to finally select

the descriptive features. Finally three supervised machine learning techniques :

support vector machine (SVM), random forest (RF) and logistics regression (LR) will

be used for comparing and accuracy check.

Adaptive Neuro-Fuzzy Inference Systems (ANFIS) are a form of artificial intelligence

model that blends the learning capabilities of neural networks with the interpretability

and rule-based reasoning of fuzzy logic systems. This combined approach makes it

possible for ANFIS to learn from data and to adjust its parameters to enhance its

performance over time; this makes it specifically useful for complex, non-linear

problems where standard fuzzy systems may encounter difficulties.

1.5 SIGNIFICANCE OF THE STUDY

The significance of this study lies on the great importance attached to Artificial

intelligence in the banking sector from the findings of this research, banks will know

5
the importance of AI in fraud detection and thus prevention. This work will be an eye

opener to the pubic at large. Interested organization, business centers, institution and

many others having know the role of Machine learning will make use of Artificial

Intelligence in their organization for them to curb this fast growing problem. This

study will be of immense benefit to the entire Banking sector in Nigeria and the

computer science department in the sense that the study will educate the above

subjects on the types of fraud in the Banking sector in Nigeria. The study will educate

them how to design a fraud detection system. The study will also serve as a repository

of information to other researchers that desire to carry out similar research on the

above topic and to contribute to the body of the existing literature.

1.5 CONTRIBUTION TO KNOWLEDGE

This expected project work when completed will serve as a second opinion for the

Banking Industry to curb the ongoing fraud pandemic in Nigeria. It will also be used

for detection and classification of different types of fraud.

1.7 DEFINITION OF TERMS

These are the definitions of terms used in the study.


a. Fraud: Fraud is an intentional act of deceit designed to reward the perpetrator

or to deny the rights of a victim.

b. Artificial Intelligence: Artificial intelligence (AI) is a broad term for a set of

technologies that allow machines to perform tasks that typically require human

intelligence.

c. Banking: Banking is a financial industry that involves the storage and use of

money, and the provision of financial services.

6
d. Machine Learning: Machine learning (ML) is a branch of and computer

science that focuses on the using data and algorithms to enable AI to imitate the way

that humans learn, gradually improving its accuracy.

7
CHAPTER TWO

LITERATURE REVIEW

2.0 INTRODUCTION TO FRAUD

Fraud is an intentional act of deception carried out to secure unfair or unlawful

financial or personal gain. It typically involves misrepresentation, concealment, or

manipulation of information to deceive individuals, organizations, or systems. There

are different types of Fraud which varies from Online fraud, identity theft, online

fraud, health care fraud, tax fraud and more. Majority of these frauds happens through

the Banking sector. Bank fraud includes activities such as unauthorized fund transfers

or forging checks (Chen .s et al 2021).

Artificial Intelligence (AI) has emerged as a game-changer in the realm of fraud

detection and prevention. Leveraging the power of machine learning, data analytics,

and predictive modeling, AI offers sophisticated tools to identify and mitigate

fraudulent activities in real time. The world of fraud detection is changing fast, and

machine learning (ML) is at the forefront of this transformation. Financial institutions,

e-commerce platforms and other industries are turning to ML to tackle fraud more

effectively. For instance, ML algorithms have significantly cut down credit card fraud

in the financial sector by analyzing transaction patterns and spotting anomalies with

remarkable precision. Citibank has slashed phishing attacks by 70% thanks to ML,

while Walmart has reduced shoplifting by 25% through real-time video analysis. In

today's digital age, the proliferation of online transactions, e-commerce, and digital

banking has created new opportunities for fraudsters to exploit vulnerabilities in

financial systems. Cybercrime is on the rise, with increasingly sophisticated schemes

targeting individuals, businesses, and governments (Świątkowska, 2020, Wainwright

& Cilluffo, 2022). These schemes range from phishing attacks and identity theft to

8
more complex forms of financial fraud such as account takeovers and money

laundering. The rapid evolution of these fraudulent activities poses significant

challenges for traditional fraud detection and prevention methods, which often

struggle to keep pace with the agility and ingenuity of modern cybercriminals.

2.1 ARTIFICIAL INTELLIGENCE IN BANK FRAUD DETECTION

Artificial Intelligence (AI) offers a range of techniques that significantly enhance

fraud detection capabilities. These techniques enable the identification of fraudulent

activities with higher accuracy and efficiency compared to traditional methods. Here,

we explore some of the key AI techniques employed in fraud detection (Hasan, Gazi

& Gurung, 2024, Yalamati, 2023). AI-driven systems can analyze vast amounts of

data at unprecedented speeds, uncovering hidden patterns and anomalies that

traditional methods might overlook. Techniques such as supervised and unsupervised

learning, neural networks, and natural language processing (NLP) enable the

development of advanced fraud detection models that continuously learn and adapt to

emerging threats. By automating and enhancing the accuracy of fraud detection

processes, AI helps organizations stay one step ahead of fraudsters, ensuring more

effective and efficient fraud prevention measures.

Machine Learning (ML) is a subset of AI that focuses on developing algorithms that

allow computers to learn from and make predictions based on data. In fraud detection,

ML techniques are extensively used to identify patterns and anomalies that indicate

fraudulent behavior. In Bank fraud detection and prevention, machine learning is a

collection of artificial intelligence (AI) algorithms trained with historical data to

suggest risk rules. It can then implement the rules to block or allow certain user

actions, such as suspicious logins, identity theft, or fraudulent transactions. When

training the machine learning engine, you must flag previous cases of fraud and non-

9
fraud to avoid false positives and to improve your risk rules’ precision. The longer the

algorithms run, the more accurate the rule suggestions will be.

2.2 MACHINE LEARNING AND MACHINE LEARNING TECHNIQUES

Machine learning, a subset of artificial intelligence, has emerged as a promising

approach to address the limitations of traditional fraud detection methods. By utilising

algorithms that can learn from historical transactional data, machine learning systems

can identify complex patterns and anomalies indicative of fraudulent behaviour.

These systems have the potential to significantly enhance the accuracy and timeliness

of fraud detection, leading to proactive intervention and prevention. However, the

successful implementation of a machine learning-based fraud detection system in the

banking sector presents several challenges. The concept of Machine Learning

algorithms is to learn from the data, identify the behaviors and predict the future with

the minimal intervention of the mankind(Nguyen, et al.,2023). There are two popular

methods for the Machine Learning widely used across the globe Which are;

A. Supervised Learning Algorithms: Supervised learning involves training a model

on a labeled dataset, where the input data is paired with the correct output. This

approach is highly effective for fraud detection as it allows the model to learn

from historical data and identify similar patterns in new data. Decision trees are

simple yet powerful models that use a tree-like structure to make decisions based

on the features of the input data. In fraud detection, decision trees can be used to

classify transactions as fraudulent or nonfraudulent by evaluating various

attributes, such as transaction amount, location, and time (Afriyie, et. al., 2023,

Chogugudza, 2022, Karthik, Mishra & Reddy, 2022). Neural networks,

particularly deep neural networks, are capable of learning complex patterns in

large datasets. They consist of multiple layers of interconnected nodes (neurons)

10
that process and transform the input data. Neural networks are particularly useful

in fraud detection for their ability to capture non-linear relationships and

interactions between features.

B. Unsupervised Learning Algorithms: Unsupervised learning models do not require

labeled data. Instead, they identify patterns and structures in the data based on its

inherent properties. This approach is useful for detecting new and emerging types

of fraud that may not have been previously labeled. Clustering algorithms group

similar data points together based on their features. In fraud detection, clustering

can be used to identify clusters of similar transactions. Transactions that do not fit

into any cluster can be flagged as potential outliers or anomalies, warranting

further investigation (Ahmad, et. al. 2023, Huang, et. al., 2024, Min, et. al.,

2021). Anomaly detection algorithms are designed to identify rare or unusual

patterns that deviate from the norm. These algorithms are particularly effective in

fraud detection, as fraudulent transactions often exhibit anomalous behavior

compared to regular transactions. Techniques such as k-means clustering,

Isolation Forest, and One-Class SVM are commonly used for anomaly detection.

Deep Learning is a subset of machine learning that uses neural networks with many

layers (deep neural networks) to model complex patterns in data. Deep learning

techniques have shown remarkable success in various applications, including fraud

detection. Convolutional Neural Networks (CNNs) are primarily used for image and

spatial data analysis, but they can also be applied to fraud detection by treating

transaction data as multi-dimensional inputs. CNNs use convolutional layers to

automatically extract relevant features from the input data. In fraud detection, CNNs

can be used to analyze transaction sequences and patterns over time. By capturing

spatial relationships within the data, CNNs can detect subtle and complex fraud

11
patterns that may not be apparent using traditional methods. Recurrent Neural

Networks (RNNs) are designed to handle sequential data and time-series analysis

(Nagaraju, et. al., 2024, Palakurti, 2024, Yadav, Yadav & Goar, 2024). They have the

capability to retain information from previous inputs (using a mechanism called

memory cells), making them suitable for tasks that involve temporal dependencies.

RNNs are particularly useful in fraud detection for analyzing transaction histories and

identifying suspicious patterns over time. For instance, they can detect fraudulent

behaviors that involve a series of transactions across different time periods, which

may be indicative of money laundering or other sophisticated fraud schemes. Natural

Language Processing (NLP) is a field of AI that focuses on the interaction between

computers and human language. NLP techniques are used to analyze and understand

textual data, which is valuable for detecting fraud involving written communication.

NLP techniques can be used to analyze textual data such as emails, chat messages,

and transaction descriptions. By applying text analysis, AI systems can identify

suspicious language patterns, keywords, and phrases that may indicate fraudulent

intent or activity (Adekunle, et. al., 2024, Chang, Yen & Hung, 2022, Krishnan, et.

al., 2022). For example, certain terms and language structures commonly used in

phishing emails can be flagged as potential fraud indicators. NLP can be employed to

scan emails for signs of phishing, social engineering, and other fraudulent schemes.

By analyzing the content, structure, and context of emails, AI systems can detect

attempts to deceive recipients into divulging sensitive information or performing

unauthorized actions. NLP techniques can also be applied to transaction descriptions

to identify unusual or suspicious entries. For example, inconsistencies or anomalies in

transaction descriptions that do not align with typical patterns can be flagged for

further review. Beyond emails, NLP can be used to analyze various forms of

12
communication, including text messages, social media interactions, and customer

service chats. This helps in identifying fraudulent activities that involve deceptive

communication practices. In conclusion, AI techniques such as machine learning,

deep learning, and natural language processing play a critical role in enhancing fraud

detection and prevention. By leveraging these advanced technologies, organizations

can improve their ability to identify and mitigate fraudulent activities, ultimately

safeguarding financial systems and maintaining trust in the digital age (Bharadiya,

2023, Farayola, 2024, George & George, 2023).

2.3 APPLICATIONS OF AI IN FRAUD PREVENTION

Artificial Intelligence (AI) has proven to be an invaluable tool in the battle against

fraud across various sectors. Its ability to process and analyze vast amounts of data in

real time allows for more effective detection and prevention of fraudulent activities

(Jagatheesaperumal, et. al., 2021, Mahalakshmi, et. al., 2022, Mohammed & Rahman,

(2024). Below, we explore some key applications of AI in fraud prevention. AI

systems are capable of monitoring credit card transactions in real time, providing

immediate detection of potentially fraudulent activities. By continuously analyzing

transaction data, AI can identify unusual patterns and behaviors that deviate from a

cardholder’s typical spending habits. For instance, AI algorithms can detect anomalies

such as sudden spikes in transaction amounts, unusual purchasing locations, or rapid

consecutive transactions that are out of character for the user. When such anomalies

are detected, the system can automatically flag the transaction for further investigation

or temporarily halt the transaction to prevent potential fraud. ML techniques used for

financial fraud detection was presented as shown in table 1 by ML techniques used for

financial fraud detection by Ali, et. al., 2022.

13
Table 2.1: ML Techniques used for Financial Fraud Detection (Ali, et. al., 2022).

ML Techniques Short Description

SVM A classification method used in linear classification

HMM A dual embedded random process used to provide more

complex random processes

ANN Amulti-layer network that works similar to human thought

Fuzzy Logic A logic that indicates that methods of thinking are estimated

and not accurate.

KNN It classifies data according to their similar and closest classes.

Decision Tree A regression tree and classification method that is used for

decision support

Genetic Algorithm It searches for the best way to solve problems concerning the

suggested solutions

Ensemble Meta algorithms that combined manifold intelligent technique

into one predictive technique

Logic Regression They are mainly applied in binary and multi-class

classification problems.

Clustering Unsupervised learning method which involve grouping

identical instances into the same sets

Random Forest Classification methods that operate by combining a multitude

14
of decision trees

Naive Bayes A classification algorithm that can predict group membership

2.3.1 ADAPTIVE NEURO FUZZY INFERENCE SYSTEM (ANFIS)

ANFIS, as you know, stands for Adaptive Neuro-Fuzzy Inference System. Let's delve

deeper into its structure, learning process, and how it leverages fuzzy logic and neural

networks for powerful problem-solving. ANFIS is an adaptive system, meaning it can

learn and adjust its parameters based on new data. This learning happens through a

hybrid approach that combines gradient descent from neural networks and least

squares estimation.

2.3.2 FUZZY LOGIC

Fuzzy logic is a computation technique used to describe the relationship between the

information attributes. Fuzzy logic has a structure that allows that human reasoning

capacities to be applied to artificial knowledge base structures. It also offers a high

level of computational processing (Wu et al., 2011). There are four basic processing

architectures for fuzzy logic. Fuzzification, defuzzification, knowledge bases, and

inference engines are examples of these designs. The crisp set input is transformed via

the fuzzification process. into hazy groups. A membership function is a concept that

applies to fuzzy set variables. A fuzzy set is graphically represented by a membership

function. A knowledge base is a collection of if-then statements that the domain

experts or the researcher will supply. Through fuzzy inference on the input and the

previously established if-then rules in the knowledge base architecture, the inference

engine will replicate human expectations. The final design, known as defuzzification,

uses an inference engine to convert fuzzy sets into values.

15
One of the advantages of fuzzy logic is the simplicity of applying mathematical

concept in its fuzzy logics. Due to the flexibility of fuzzy logic modification phase,

adding or removing rules are simpler. On the other hand, fuzzy logic has number of

disadvantages. It still requires expertise to understand and develop a fuzzy system and

it is time consuming to develop fuzzy rules and membership functions (Shamim et al.,

2011).

2.3.3 ARTIFICIAL NEURAL NETWORK (ANN)

Artificial Neural Networks are relatively electronic models based on the structure of

the brain. The brain mainly learns from experience. Artificial Neural Networks

modelling also has the potentials of less technical way to develop in machine

solutions. The first artificial neural network was designed in 1958 which was called

Perceptron (Shanmuganathan, & Samarasinghe, 2016). Perceptron proposed how

human brain can process visual data and also learn to recognize items. Since then, the

number of using neural network was increasing and there have been hundreds of

different models developed using artificial neural network. There are differences in

these models such as the functions, accepted values, topology, learning algorithms

and etc. The learning capabilities and pattern matching of artificial neural network has

allowed them to solve many problems that were impossible and difficult to answer by

computational or statistical methods (Yadav, 2015).

The structure of artificial neural network is inspired by natural neurons, these neurons

take many possible inputs then base on the weight value, it proceeds to the next

neuron or produces a single output. The neuron needs to be activated in order to

proceed to the next neuron or to the output. Activation process of every neuron

basically consists of inputs which are multiplied by weight value or the strength of the

signal and then computed by a transfer function. These transfer functions must reach a

16
certain threshold. Artificial Neural Network can have multiple neurons. Artificial

neural network combines the input from every single neuron in each intermediate

layer to process information and finally reach a single output (Yadav, 2015).

The advantage of artificial neural network is its ability to solve both linear and

nonlinear problems. Neural programs are able to learn and it does not require

reprogramming the whole system. Artificial neural network has disadvantages. It

required high processing time for large neural networks. The operation of artificial

neural network is highly relying on the training process (Rao & Srinivas, 2003).

Adaptive Neuro-Fuzzy Inference System is an adaptive network which makes use of

both artificial neural network and fuzzy logic. By combining both of the methods, this

network not only will inherit the benefits and characteristic of them but also will

eliminate their lonely usage.

Figure 2.1 ANFIS architecture

17
2.3.4 GENETIC ALGORITHM(GA)

The main idea of GA is to mimic the natural selection and the survival of the fittest.

In GA, the solutions are represented as chromosomes. The chromosomes are

evaluated for fitness values and they are ranked from best to worst based on fitness

value. The process to produce new solutions in GA is mimicking the natural

selection of living organisms, and this process is accomplished through repeated

applications of three genetic operators: selection, crossover, and mutation. First, the

better chromosomes are selected to become parents to produce new offspring (new

chromosomes). To simulate the survivor of the fittest, the chromosomes with better

fitness are selected with higher probabilities than the chromosomes with poorer

fitness. The selection probabilities are usually defined using the relative ranking of

the fitness values. Once the parent chromosomes are selected, the crossover operator

combines the chromosomes of the parents to produce new offspring (perturbation of

old solutions) (Voratas, 2012).

18
START

Generate initial population

Evaluate individual fitness


Rank individual fitness

Yes
Time to stop STOP

No

Generate new population


Selection
Crossover
Mutation

Figure 2.2 Flowchart for genetic algorithm(GA)

2.3.5 NAIVE BAYES

Naive Bayes is a probabilistic classifier based on Bayes' theorem with an assumption

of independence between features. In the context of wrapper methods for feature

selection, Naive Bayes can be used to evaluate the relevance of subsets of features by

assessing their discriminatory power in classification tasks. Features that contribute

19
most to the discriminative power of the model are retained, while less informative

features are discarded.

2.3.6 SUPPORT VECTOR MACHINE (SVM)

Support Vector Machine (SVM) is a classification algorithm that draws a strict border

between the positive and negative classes. It's like the bouncer at a VIP club, making

sure no unwanted elements go through. SVM is all about maximizing the distance

between data points from different classes, just like people desperately trying to claim

space on a crowded dance floor.

SVM works by finding a hyperplane, which is essentially a fancy way of saying a

boundary between classes, that has the maximum distance from the data points. It's

like putting up a massive fence to keep the good folks inside and the troublemakers

out. The hyperplane acts as a threshold for classifying new data points. If it's on one

side, it's labeled as positive, and if it's on the other, it's labeled as negative.

Figure 2.3 support vector machine

Support vector machine can be of two types:

20
Linear SVM: This is used for linearly separable data, which means if a dataset can

be classified into two classes by using a single straight line, then such data is termed

as linearly separable data.

Nonlinear SVM:. This is used for non-linearly separated data, which means if a data

set cannot be classified by using a straight line, then such data is known as nonlinear

data.

2.3.7 DECISION TREE ALGORITHMS LIKE RANDOM FOREST

Decision tree algorithms, such as Random Forest, are commonly used embedded

feature selection techniques. In Random Forest, feature importance is assessed based

on how much each feature decreases impurity across all decision trees in the

ensemble. Features that consistently contribute to reducing impurity are considered

more important and are retained, while less informative features are pruned from the

model. Random Forest's ability to naturally select relevant features during the training

process makes it an effective embedded feature selection method.

Figure 2.4 random forest

Embedded methods offer computational efficiency compared to wrapper methods

since feature selection is integrated into the model training process, eliminating the

need for external evaluations. However, they may be less flexible in terms of feature

21
selection strategies compared to wrapper methods. Decision Trees are easy-to-use and

flexible methods for regression and classification. A tree-like structure is obtained by

recursively dividing the dataset into key-criteria subgroups. Leaf nodes are the result

of decisions made at each node.

Decision trees are helpful in decision-making because they are simple to comprehend

and illustrate. Trimming enhances generality because overfitting might happen. a

decision-making tree that incorporates utility, resource costs, and the results of

random events.

Figure 2.5 Decision tree

AI leverages advanced pattern recognition techniques to differentiate between

legitimate and fraudulent transactions. Machine learning models are trained on

historical transaction data, learning to recognize the characteristics of both normal

and fraudulent activities (Alarfaj, et. al., 2022, Hilal, Gadsden & Yawney, 2022).

Using supervised learning methods, AI systems can classify transactions based on

known fraud patterns. Meanwhile, unsupervised learning methods, such as

clustering and anomaly detection, are used to uncover new and emerging fraud

22
patterns that have not been previously identified. This dual approach ensures a

comprehensive fraud detection system that adapts to evolving fraudulent tactics.

AI plays a crucial role in anti-money laundering efforts by analyzing transaction

data to detect suspicious patterns indicative of money laundering activities. Machine

learning models can identify complex sequences of transactions that involve

multiple accounts and institutions, which are often used to obscure the origins of

illicit funds (Mishra & Mohapatra, 2024, Youssef, Bouchra & Brahim, 2023, Zhang

& Chen, 2024). By automating the detection process, AI systems can quickly flag

potentially suspicious transactions for further investigation by compliance officers.

This accelerates the identification of money laundering schemes and reduces the

risk of regulatory non-compliance. Regulatory frameworks such as the Financial

Action Task Force (FATF) and the Bank Secrecy Act (BSA) impose stringent

requirements on financial institutions to detect and report money laundering

activities (Gaviyau & Sibindi, 2023, Siddiqui, 2023, Stevens, 2022). AI helps

institutions comply with these regulations by automating the monitoring and

reporting processes. AI-driven AML systems can generate comprehensive reports

on suspicious activities, providing detailed insights into the nature of the

transactions and the entities involved. This not only ensures compliance with

regulatory requirements but also enhances the institution’s ability to respond to

regulatory inquiries and audits efficiently.

Phishing is a prevalent form of online fraud where attackers attempt to deceive

individuals into providing sensitive information, such as login credentials or financial

details. AI-powered systems can analyze emails, messages, and websites to detect

phishing attempts by identifying malicious links, suspicious sender addresses, and

23
deceptive content (Alabdan, 2020, Alkhalil, et. al., 2021, Jain & Gupta, 2022).

Natural Language Processing (NLP) techniques enable AI to understand and interpret

the context of communications, making it possible to identify phishing attempts with

high accuracy. By integrating AI with email security protocols and web filters,

organizations can significantly reduce the risk of falling victim to phishing attacks.

AI enhances cybersecurity protocols by continuously monitoring network traffic and

user behavior to identify potential threats. Machine learning models can detect

unusual activities, such as unauthorized access attempts, data exfiltration, or abnormal

user behavior, which may indicate a security breach. By implementing AI-driven

security solutions, organizations can automate threat detection and response, reducing

the time it takes to identify and mitigate cyber threats (Camacho, 2024, Manoharan &

Sarker, 2023, Rangaraju, 2023). This proactive approach to cybersecurity helps

prevent data breaches, protect sensitive information, and maintain the integrity of IT

systems. The applications of AI in fraud prevention are vast and transformative,

offering sophisticated tools to combat various forms of fraud. From real-time credit

card transaction monitoring and antimoney laundering efforts to enhancing

cybersecurity protocols, AI is instrumental in safeguarding financial systems and

ensuring compliance with regulatory requirements. As fraudsters continue to develop

more advanced tactics, the integration of AI in fraud prevention strategies will remain

essential for organizations to stay ahead of potential threats and protect their assets

(Gupta, 2024, Kotagiri, 2023, Kotagiri & Yada, 2024).

2.4 BENEFITS OF MACHINE LEARNING FOR FRAUD DETECTION

Because machines have a much easier job processing a large dataset than humans,

what you get is the ability to slice and dice huge amounts of information. That means:

24
1. Faster and more efficient detection: The system gets to quickly identify

suspicious patterns and behaviors that might have taken human agents months to

establish.

2. Reduced manual review time: Similarly, the amount of time spent on manually

reviewing information can be drastically reduced when you let machines analyze all

the data points for you.

3. Better predictions with large datasets: The more data you feed a machine learning

engine, the more trained it becomes. That is to say, while large datasets can

sometimes make it challenging for humans to find patterns, it’s actually the opposite

with an AI-driven system.

4. Cost-effective solution: Unlike hiring more RiskOps agents, you only need one

machine-learning system to go through all the data you throw at it, regardless of the

volume. This is ideal for businesses with seasonal ebbs and flows in traffic,

checkouts, or signups. A machine learning system is a great ally to scale up your

company without increasing risk management costs drastically at the same time.

5. Last but not least, algorithms don’t need breaks, holidays, or sleep. Fraud attacks

can happen 24/7, but even the best fraud managers might come to work on Monday

morning with a backlog of manual reviews. Machines can ease up the process by

sorting through the obviously fraudulent or acceptable cases.

According to a whitepaper by computer scientists from the University of Jakarta,

machine learning algorithms achieved up to 96% accuracy in reducing fraud for

eCommerce businesses.

25
2.5 CHALLENGES IN AI-DRIVEN FRAUD PREVENTION

Implementing AI-driven fraud prevention strategies comes with its own set of

challenges, ranging from data privacy concerns to the quality of datasets and the

interpretability of AI models. Addressing these challenges is crucial to ensuring the

effectiveness and ethical use of AI in fraud prevention. One of the primary challenges

in AI-driven fraud prevention is ensuring the protection of sensitive data.

Organizations must implement robust data protection measures to safeguard customer

information and comply with privacy regulations such as GDPR, CCPA, and others.

Striking a balance between utilizing data for fraud prevention purposes and respecting

individuals' privacy rights is a significant challenge. Organizations must ensure that

their use of data is transparent, lawful, and proportionate to the goal of preventing

fraud.

The effectiveness of AI models in fraud prevention depends heavily on the quality

and diversity of the datasets used for training (Bao, Hilary & Ke, 2022, Paldino, et.

al., 2024, Whang, et. al., 2023, Yandrapalli, 2024). Organizations must ensure that

their datasets are comprehensive, representative, and free from biases to avoid

misleading or inaccurate results. Biases and inaccuracies in datasets can significantly

impact the performance of AI models. Organizations must identify and address biases

in their datasets to ensure fair and unbiased fraud detection outcomes.

AI models, particularly deep learning models, are often considered "black boxes" due

to their complex decision-making processes. Understanding how these models arrive

at their conclusions is crucial for ensuring transparency and accountability in fraud

prevention. To enhance transparency and trust in AI systems, organizations must

develop techniques for explaining AI decisions in a clear and understandable manner.

This includes providing explanations for why a particular transaction was flagged as

26
fraudulent and how the AI model arrived at that decision. Overcoming the challenges

associated with AI-driven fraud prevention requires a holistic approach that considers

data privacy, dataset quality, and model interpretability (Sarker, et. al., 2024, Wang,

et. al., 2024, Williamson & Prybutok, 2024). By addressing these challenges,

organizations can harness the power of AI to enhance their fraud prevention efforts

while ensuring compliance with regulations and maintaining trust with customers.

2.6 SUMMARY OF REVIEWED LITERATURE

Table 2.2 Summary of reviewed literature

AUTHORS OBJECTIVE METHODS DATA CONTRIBUTI LIMITATI


SOURCE ON TO ON
KNOWLEDGE

(Świątkowska An G-flip Kaggle 99.55% easy task Limited


, 2020, application algorithm, accuracy dataset, and
Wainwright for Credit Fuzzy Logic 92.35% difficult Time
& Cilluffo, card fraud task accuracy consuming
2022). detection

Sarker, et. al., An applicaton ANN, Gaggle 98.7% Accuracy Small


(2024) for Money MATLAB dataset.
laundering logic tool box Accuracy
detection depends on
pre-
processed
data and
adequate
feature
selection.

Williamson & An KNN, GA, Bank of 99% Accuracy Nil


Prybutok, application SVM, California 1% Error
(2024) for Novel Irvine(UCI) 95% Precision
fraud

27
detection 90% Recall

89% Specificity

Bao, Hilary & A web based ANFIS, Mean Keio Bank Achieves high Focused on
Ke, (2022) credit card Shift Tokyo classification only five
fruad Clustering accuracy and types of
detection Algorithm, specificity rate fraud
application GCLM compared to
previous
classification
algorithm.

Paldino, et. A survey on SVM, Fuzzy Dataset collected 89% Accuracy Used only a
al., (2024) artificial logic from Kaggle specific set
intelligence of filter and
based embedded
techniques for methods,
detection of and focusing
Bank fraud on a
particular
fraud dataset

Whang, et. A study on ANFIS Collected from 98.7% Accuracy Poor data
al., (2023) bank fraud MATLAB gagle . quality
detection
using multi
layer neural
network

Yandrapalli, Fuzzy logic Neural network Taken from bank 91.26% Decisions
(2024) and system of India Accuracy are often
correlation 98% Sensitivity hard to
Fuzzy
based hybrid inference understand
89%Specitivity
on Credit card system
fraud dataset

28
Gupta (2024) Artificial ANFIS, GA Data set 98.66% Challenges
intelligence in consisted of 345 Accuracy in data
fraud analyzed cases. quality,
prevention: interpretabili
Exploring ty of model
techniques decisions
and
applications
challenges
and
opportunities

Kotagiri, Review of Clustering, Collected from 88% Accuracy Effectivenes


(2023) machine Decision tree Kaggle s for rare or
learning less
approach on common
credit card credit card
fraud detetion fraud

Kotagiri & Role of AI in Java Questionnaire 82% Accuracy Might be


Yada (2024) combacting programming ineffective
cyber threats language in detecting
in banking similar
MySQL
frauds.
database.

Fuzzy logic

Camacho, Fuzzy logic FUZZY Bank of Baroda Improved fraud Limited


(2024) and LOGIC, detection dataset
correlation ANFIS accuracy within
based hybrid its data range.
on Bank fraud
dataset

Manoharan & Integrating AI CNN, ConvNet Uploaded data Immediate and Poor data
Sarker in Banking by users through accurate results quality

29
(2023) and Fraud android in seconds
prevention application.

Rangaraju, Anomaly ANN, SVM Questionare and 90% Accuracy Poor data
(2023) detection in data submited by quality
credit card users through
transaction there mobile
using phones
machine
learning

Jagatheesaper ML KNN, GA, K- University of 97.9 Accuracy Focussed


umal, et. al., appooarch to Means California (UCD only on
(2021) fraud Clustering. machine learning online fraud
detection: A repository
web based website)
application

Mahalakshmi, Credit card ANFIS, CNN Data base 96% Accuracy Model
et. al., (2022) fraud consisted of 260 performance
detection analysed data depends on
using cases. quality of
Unsupervised training
learning data.
Algorithms

Small
dataset

Mohammed, Detection of Logic Data base 88% Accuracy Small


A. F. A., & credit card regression, consisted of 200 dataset
Rahman, H. fraud with clustering analysed data
M. A. A. logistic presented Punjab
(2024) regression National bank.
and artificial
neural

30
network

Karthik, An KNN, ANN, Kaggle 96.77%Accuracy Restricted to


Mishra & application FFNN, CNN, some
97.45%Specificit
Reddy (2022) for Credit SVM particular
y
card fraud kind of Bank
97% Sensitivity
detection fraud

31
CHAPTER THREE

METHODOLOGY AND DESIGN

3.1 OVERVIEW

A system is made up of several interconnected or interacting parts that follow a set of

rules to function as a cohesive whole. A system's limits, structure, and purpose are

characterised by its surroundings, which also impact its operation. In systems theory

and other systems sciences, systems are studied.(Merriam Webster, 2017)

3.2 SYSTEM ANALYSIS

System analysis is the process of examining a method or process to determine its

objectives and purpose and then developing procedures and systems that will

effectively accomplish them. (oxford dictionary, 2015).

Three steps comprise the fundamental process of system analysis:

i. Understand the existing situation

ii. Identify improvements

iii. Define requirements for the new system (Dennis et al., 2012)

3.2.1 OBJECTIVE OF SYSTEM ANALYSIS

One of the key methods for comprehending, assessing, and developing or modifying

systems to achieve certain goals is system analysis. It offers a methodical and

comprehensive perspective. Here is a further explanation of the typical goals of

system analysis:

1. System analysis is useful in figuring out how to design systems where

subsystems may have goals that appear to be at odds with one another.

2. It facilitates the achievement of subsystem interoperability and unity of purpose.

32
3. It offers means to create understanding of the complex structures.

4. System analysis assists in setting each subsystem in its appropriate context and

viewpoint so that the system as a whole can accomplish its goals with the least

amount of resources possible. System and goal synchronization is produced by

this.

5. It helps in understanding and comparing functional impacts of sub system to the

total system.

3.2.2 BENEFITS OF SYSTEM ANALYSIS

System analysis helps in several ways. Here are some benefits of system analysis.

1. Analyzing the plans to be undertaken by any business is very important.

However, there can be indeed be no 'perfect path'. Still, when the stpes to be

taken are properly analyzed before implementation, it can prove to be of great

benefit such as reducing cost, minimize the chance of fatal errors and prevent

the downfall of the business and also reduces the scope of future errors .

2. The fact that system analysis is a relatively easy subject to master is another

crucial feature. This indicates that neither a degree nor any professional

expertise are needed. It's simple to teach.

3. In the corporate world, system analysis plays a crucial part in ensuring that

items are created correctly and delivered on time. On time delivery of products

guarantees customer pleasure and maximizes the capacity of human resources.

4. System analysis greatly improves and simplifies business management. There

is a great chance that the finished items will contain many faults if they are

finalized without analysis. Implementing system analysis results in flexible

software.

33
3.3 ANALYSIS OF EXISTING SYSTEM

The case study of this project is the work of ( Njoku et al., 2024). In their approach to

compute information gain, they developed WEKA. WEKA is written in Java

language and is an open-source machine learning software that provides an

environment for the calculation of information gain. It comprises of various machine

learning and data-mining methods for data processing, visualization, association,

classification, clustering, and regression. They proposed that combining information

gain and adaptive neuro-fuzzy inference system for the detection of fraud can be a

potential and efficient alternative for the Banking industry to detect Fraudulent

activities. The dataset and preprocessing methods were detailed, with The dataset

comprising of 15550 records of Credit card fraud with 5 attributes for each User data.

The dataset was preprocessed, and attribute values were normalized to make it

suitable for the mining process. The proposed approach outlines the methodology

employed, including the application of machine learning algorithms such as use of

information gain to reduce the feature number. The process involved dataset

collection, preprocessing, feature extraction, and model training using the selected

algorithms.

The first step involves the analysis of data where each and every column is analyzed

and the necessary measurements are taken for missing values and other forms of data.

Outliner and other values which do not have much impact are dealt with. Then pre-

processed data is used to build the classification model where the data will be split

into two parts one is for training and remaining data for testing purpose. Machine

learning algorithms are applied on the training data where the model learns the pattern

from the data and the model will deal with test data or new data and classify whether

it's fraud or not. The algorithms are compared and the performance metric of the

34
algorithms are calculated. The classification accuracy for the information gain-

ANFIS was found to be 95.24% for credit card fraud. Figure 3.1 shows the

architecture of the existing system.

Figure 3.1 Architecture of the existing system

3.3.1 GAPS IN EXISTING SYSTEM

While the paper reports accuracy, specificity, and sensitivity, it does not mention

other important metrics like the F1-score, ROC-AUC, or precision, which can provide

a more balanced view of model performance, especially in imbalanced datasets. The

interpretability of ANFIS models can be challenging. The paper does not provide

insights into how the rules generated by the ANFIS system can be interpreted by

bankers, which is crucial for the adoption of such systems in clinical settings. It builds

a classification model to classify whether a credit card is fraud or not. The existing

system has the following shortcomings;

1. It deals with only one type of fraud detection which is the one likened to credit-card

transactions.

2. The model built is a bit complex for use as there is no implementation of web user

graphical interface, i.e., inputs must be passed into the model from the terminal.

3. No provision for generating recent data of fraudulent account reports to help detect

and avert fraud in the future.

35
3.4 SYSTEM DESIGN

System design is the process of specifying a system's components, interfaces, data,

and architecture to satisfy predefined criteria. It involves figuring out the fundamental

features the system has to have, understanding how different components work

together, and ensuring dependability, performance, and scalability. The design phase

determines the hardware, software, and network infrastructure that will be used in the

system, including the user interface, forms, and reports that will be utilized, as well as

the particular databases, files, and programs that will be required. This phase will

include a brief discussion of the system's basic architecture design, which outlines the

hardware, software, and network infrastructure to be utilized.The programming team

will be in charge of handling system specifications for implementation, including

architecture design, interface design, database modeling, and file specifications. The

process of planning a new system based on the results of the current system to replace

the current system by defining its components or modules to satisfy the specific

requirements is known as system design.

3.5 DATA MINING PROCESS

The process of sifting through massive data sets to find links and patterns that may be

used to address business problems through data analysis is known as data mining.

Businesses can forecast future trends and make more educated business decisions by

utilizing data mining techniques and technologies.One of the fundamental subfields of

data science, data mining employs sophisticated analytical methods to extract

valuable insights from large data sets. More specifically, data mining is a phase of the

data science methodology known as knowledge discovery in databases (KDD), which

collects, processes, and analyzes data. Although KDD and data mining are sometimes

used interchangeably, they are more frequently understood to be two different

36
concepts. Effective data collection, warehousing, and processing are prerequisites for

the data mining process. Discovering more about a user base, predicting results,

identifying fraud or security problems, describing a target data collection, and

identifying bottlenecks and dependencies are just a few uses for data mining.

Moreover, it can be carried out mechanically or somewhat automatically. "One of the

most difficult problems facing data mining is the complex issue of data quality."

(manilla and Heikki, 2019). The Cross-Industry Standard Process for Data Mining

(CRISP-DM) is the data mining framework that was used in this study.

3.6 CRISP-DM APROACH

CRISP-DM stands for Cross-Industry Standard Process for Data Mining. It is

a cyclical process that provides a structured approach to planning, organizing,

and implementing a data mining project. The process consists of six major

phases:

a. Business Understanding: The goals of the data mining project are outlined in

connection to business objectives at this stage. Stakeholder requirements and

constraints are identified, ensuring alignment between data mining activities

and organizational objectives

b. Data Understanding: Here, the data scientist starts gathering the first set of

data and gets acquainted with it. Important duties include of Collecting

preliminary information, describing the information,investigating data and

confirming the quality of the data.

c. Data Preparation: Data is rarely clean. This phase is dedicated to cleaning and

transforming raw data into a suitable format for modeling. Key tasks include:

Selecting data,Cleaning data,Constructing data (feature engineering) and

Integrating data

37
d. Modeling: Once clean data is obtained, different modeling approaches are

used. It is usual to return to the data preparation stage since different methods

may call for different data formats. Choosing modeling techniques, creating

tests, constructing the model, and evaluating the model are important

activities..

e. Evaluation: Models are evaluated after training in order to determine how

effective and generalizable they are. Holdout samples and cross-validation are

employed as validation techniques to assess the model's correctness,

dependability, and robustness.

f. Deployment:Putting the model into a real-world setting is the last step. This

could be as easy as creating a report or as difficult as putting in place a

repeatable data mining procedure. Planning deployment, keeping an eye on

things, reviewing the project, and finishing the project are important duties.

Figure 3.2 Life Cycle of CRISP-DM

38
3.7 PROPOSED SYSTEM

The proposed system works towards comparing different methods to recommend

which is better for prediction of credit card fraud by enhancing the work of (Njoku1

et, al 2024). The proposed system utilizes machine learning which relies on the

accumulation of extensive historical data through data gathering. This data collection

encompasses both sufficient historical data and raw data. However, raw data cannot

be employed directly without undergoing data pre-processing. It is during this pre-

processing stage that raw data is refined to a usable state. Subsequently, an

appropriate algorithm is chosen along with a model. In the context of detecting credit

card fraud transactions using real datasets, supervised machine learning algorithms

such as logistic regression played a vital role. The algorithm built a classification

framework using machine learning methods. The model is then subjected to training

and testing to ensure accurate predictions with minimal errors. Periodic tuning of the

model further enhances its accuracy, carried out at intervals to continually refine its

performance. Also, data gathered from account fraud reports are part of the collected

bank data, which are verified by human experts. Then, rules that consider set

thresholds for fraud detection are applied to test accounts and a decision is made

39
(which validates whether an account is associated with frauds or not). Figure. 3.3

shows the Proposed system.

Figure 3.3 Architecture of Proposed System

3.8 METHODOLOGY

In this section, the following methods (CRISP-DM approach discussed earlier) used

towards achieving the results are stated as follows.

A. DATA ACQUISITION:

A single dataset was used for this analysis. The dataset was obtained from

Kaggle machine learning repository. This data set includes all transactions

recorded over the course of two years. As described in the dataset, the features

are scaled and the names of the features are not shown due to privacy reasons.

The dataset consists of numerical values from the 28 ‘Principal Component

Analysis (PCA)’ transformed features, namely V1 to V28. Furthermore, there

is no metadata about the original features provided, so pre-analysis or feature

study could not be done. There are 284807 records. The only thing we know is

that those columns that are unknown have been scaled already.

B. DATA ANALYSIS:

An integral part of any study analysis is exploratory data analysis, or EDA.

Exploratory analysis's main goal is to look for distribution patterns, anomalies,

and outliers in the data in order to guide more focused testing of your theory.

Additionally, it offers resources for formulating hypotheses by interpreting

and visualizing the data, typically through graphical representation. It helps to

gain insights into the relationships between the feature and the target variable,

the script generates bar plots using Matplotlib and Seaborn. Figure 3.4

40
describe the dataset, the mean, standard deviation, minimum ,maximum, and

percentile values of the data set.

Figure 3.4 Feature list of credit card dataset.

C. DATA PREPROCESSING:

It is necessary to clean and alter the data before applying any classification

techniques to it. This phase is referred to as pre-processing. Preprocessing

involves a number of steps, such as evaluating missing values, eliminating

noisy data, such as outliers, normalizing, and balancing unbalanced data.

The scripts loads the necessary libraries for data manipulation l, visualization

and machine learning tasks, including NumPy, pandas Matplotlib, seasborn,

scikit-learn and ANFIS. The feature data (X) is examined and it's information

and descriptive statistics are displayed to gain an understanding of the data set.

To handle missing values the scripts employs the 'filna' function from pandas,

replacing the missing values with zeros. The target variable (Y) is extracted

from the original dataset as a separate variable named class. The data set is

splitter into training and texting set using the 'train_ set_ split' function from

scikit-learn, with a test set size of 20% and a random state of 42 for

reproducibility. Then the feature scaling was performed using the

'miniMaxscalar' from scikit-learn , ensuring all features are on similar scale

and preventing one feature from dominating the other. Minimum- maximum

41
normalization was used to increase the efficiency and also efficiency of the

algorithm.

D. FEATURE SELECTION:

In this project, after data processing, a hybrid feature selection method was

used which comprises of three filter based selection methods; chi - squared,

information gain and gain ratio to get the optimal dataset.

I. Chi squared: When determining the degree of independence between two

variables, a score is computed using the chi-squared (χ2) statistic. The

independence of features with regard to the class is measured in feature

selection using χ2. χ2 computes a score based on the initial premise that the

feature and the class are independent. A high-dependency connection is

present when a score has a substantial value. (Opeyemi et al., 2016).

II. Information Gain : Information gain is one of the filter feature selection

techniques used to extract pertinent properties from a set of features. When the

value of a feature is unknown, IG reduces the uncertainty involved in

determining the class attribute. Information theory serves as the foundation for

this method, which ranks and chooses the best features to minimize feature

size prior to the commencement of the learning process. Before ranking, the

entropy value of the distribution is calculated to ascertain the degree of

uncertainty associated with each attribute based on its significance in

distinguishing across classes. The entropy of the distribution, sample entropy,

or predicted model entropy of the dataset determines the degree of uncertainty.

(Opeyemi et al., 2016).

III. Gain ratio: The gain ratio was included to enhance IG's bias in favor of

features with high diversity values.The gain ratio shows a large number when

42
the data are evenly distributed; when the data are all concentrated in one

branch of the attribute, the gain ratio shows a small value. It determines an

attribute based on the quantity and size of branches and adjusts IG by taking

intrinsic information into account. The instance value's entropy distribution

can be used to determine the intrinsic information of a given feature. One can

compute the gain ratio between a feature value (y) and a given feature (x).

(Opeyemi et al., 2016).

The script employs the three feature selection already listed above. The 'selectkBest'

class from scikit-learn is used to compute the scores for each feature selection

methods and the scores are displayed.

E. VOTING SCHEME

Majority voting ensemble is an ensemble machine learning model that combines the

prediction from multiple models . It's a technique that may be used to improve model

performance, ideally achieving better performance than any single model used in the

ensemble. There are two types of majority vote prediction for classification. They

include;

I. Hard voting: this is used to predict the largest sum of votes from models .

II. Soft voting: this is used to predict the class with the largest summed

probability from models.

In this project work hard voting was used as it preferred voting scheme, where a

feature is selveted if it is above the threshold (mean of all scores) in at least two

scoring methods. The selected features are used to create new training and testing sets

for subsequent model training and evaluation.

F. CLASSIFICATION AND ENSEMBLE ALGORITHM USED

43
After the optimal dataset is obtained via the hybrid feature selection processs, then

classification algorithms are used to processs the sub dataset. For the ANFIS

algorithm, the scripts define the number of membership functions and the standard

deviation for the membership functions used . These parameters influence the shape

and behavior of the membership functions, which play a crucial role for the fuzzy

inference process. For each feature the script initializes gausssinan membership

functions with random means and specialized deviation. The membership functions

are used to map the input values to corresponding fuzzy sets. The script invokes the

'trainHybridjangoffline' method from the anfis Library to train the ANFIS model. The

training process is carried out for specified number of epoched (20), allowing the

model to iteratively learn from the training data and refine it's parameters.

3.9 MODEL EVALUATION AND COMPARISON

The ANFIS structure consists of some nodes in different layers which are connected

to each other. The output of this network depends on the tunable parameters of these

nodes. The network learning rules determine the parameters’ updating method for

minimizing the error. A fuzzy inference system is a framework based on fuzzy theory

and If-Then rules. The ANFIS structure has three main elements; rule base, data base,

and reasoning mechanism. The learning algorithm is used to tune all the tunable

parameters (target function and conclusion parameters) and to obtain output

parameters values of ANFIS which is adapted with training data.

After training, the scripts utilizes the trained ANFIS model to make predictions on the

test data using the 'predict' function from anfis Library to assess the model's

performance, the script calculates the accuracy of the ANFIS by comparing it's

prediction with true labels in the test set. This is achieved using the 'accuracy_score'

function from scikit-learn. The script report is generated using 'classification _report'

44
function providing valuable metrics such as precision, recall and F1 - score to each

class in the classification task. The script also train and evaluates three additional

models : support vector, random forest and Logistics regression. For each model the

classification report is displayed, enabling a direct comparison performance. This

section also contain the confusion matrix of some fonthe classifier of the proposed

model.

Figure 3.5 Anfis confusion matrix. Figure 3.6 SVM confusion


matrix.

Figure 3.7 RF confusion matrix. Figure 3.8 LR confusion matrix

45
CHAPTER FOUR

IMPLEMENTATION AND RESULT

4.1 INTRODUCTION

System implementation is the process of putting the trained model into use and

integrating it into the production environment. It involves making sure the model

works well, integrates with other systems, and meets performance requirements.

Scalability, dependability, and real-time inference capabilities are frequently taken

into account. This chapter focuses on creating models with the Python programming

language, recording the steps involved, and providing the outcomes.

4.2 SYSTEM REQUIREMENTS

System requirements refers to the settings that a system must have in order to function

smoothly, effectively, and predictably, with the understanding that failing to fulfill

these requirements might have an impact on the performance.

4.2.1 HARDWARE REQIREMENTS

A good laptop with the following hardware requirement is needed to run the model

conveniently:

a. Processor: Intel i3, i5 , i7 and above

b. RAM: 8GB

c. Memory: 120GB

d. Processor speed: 2.2Ghz and above

e. GPU: A GPU is needed in order to enhance the model's speed and also due to

its multiple processing cores that the model can leverage during its training

phase.

46
f. Internet: a stable internet connection from a reliable internet service provider

(ISP) because the software used is required to run online.

4.2.2 SOFTWARE REQUIREMENTS

Software refers to a collection of instructions, data, or programs that tell a computer

how to perform specific tasks or operations. The minimum software requirements

needed to run the model are:

5. Operating System: Windows 10

6. Development Environment: Jupyter notebook

4.3 MODEL DEVELOPMENT TOOLS

Model development tools are software platforms or frameworks specifically designed

to facilitate the creation, training, testing, and deployment of machine learning

models. The major software tools used in the development of the model are given in

the next section.

4.3.1 CHOICE OF PROGRAMMING LANGUAGE

The choice of programming language for development depends on various factors,

including the project requirements and familiarity of the programmer with the

language and performance considerations. For this project Python is used.

4.3.1.2 PYTHON
Python is a high-level, general purpose programming language with powerful data

types. Its Hugh level built in data structure, combined with dynamic typing and

dynamic binding, make it very attractive for rapid application development, as well as

for use as a scripting or glue language to connect existing components together. It is

widely regarded as the de facto language for machine learning and data science due to

its simplicity, readability, extensive library support, and vibrant community. Popular

47
libraries such as TensorFlow, PyTorch, Scikit-learn, and Keras offer comprehensive

tools for building and training machine learning models. Some of the python libraries

utilized in this world include:

a. Numpy: Python numerical computing requires the core module NumPy. It

supports massive, multidimensional matrices and arrays and offers a set of

mathematical functions to effectively work with larger arrays..

b. Matplotlib: Matplotlib is a comprehensive library for creating static,

interactive, and animated visualizations in Python. It is widely used for

generating a wide range of plots and charts, including line plots, scatter plots,

bar plots, histograms, heatmaps, and more.

c. Pandas: Pandas is a Python library widely used for data manipulation and

analysis. It provides high-performance, easy-to-use data structures and tools

for working with structured data, making it essential for tasks like data

cleaning, transformation, and exploration.

d. Seaborn: Seaborn is used for data visualization. It is a high level interface

based on matplotlib. It can be used in all sorts of data analysis tasks that

requires visualization of data and infer information from it.

e. Scikit-learn: Scikit-learn, sometimes known as sklearn, is a well-known

Python machine learning package that offers easy-to-use tools for data mining

and analysis. It is a component of the larger Python scientific computing

ecosystem because it is developed on top of other Python libraries such as

NumPy, SciPy, and Matplotlib. For classification, regression, clustering,

dimensionality reduction, model selection, and preprocessing, it provides a

large array of machine learning techniques.

48
f. Django: streamlines fraud detection system design through rapid development,

pragmatic design, and reusability. It provides a consistent Python environment

for integrating advanced techniques, and its elective administrative interface

aids in efficient management of flagged transactions and rules.

The proposed system is poised to attain and characterized by its intuitive graphical

user interface or interaction to manage both credit- card and bank account transactions

fraud detection; it makes provision for reporting account fraud and generating reports

based on detected frauds by the system; there is an integrable APIs for bank use to

enhance fraud detection in their already existing systems. This encompasses

standardised visual elements that are universally comprehensible among experts and

serve to represent systems in diverse modes of utilisation and implementation. Here is

the abstract representation of the actual proposed web system to enhance fraud

detection as presented in view for this research project.

Figure 4.1 Proposed system architecture

4.4 SYSTEM TESTING

49
System testing is a software testing phase in which the entire software system is

assessed to confirm that it satisfies predetermined specifications and operates

correctly overall. Using the dataset and comparing the model's performance in terms

of accuracy, recall, and precision is the general process of testing the model.

Machine learning algorithms and rule-based approaches require a lot of historical

data. This data can be gathered from a variety of sources, such as transaction records,

reports etc. Once the data is gathered, it needs to be pre-processed or verified to

remove errors and inconsistencies. The pre-processed data is then used to train a

model or set rules. The model or rule-base is then tested to ensure that it is working

correctly and predicting or making decisions accurately. The model or rule can be

tuned over time (adjusted) to improve its accuracy.

4.4.1 RESULT AND SCREENSHOTS

System Functionalities and Outputs Here are the essential operations of how the

proposed system works to achieve the set objectives; Collect Account Fraud Reports

Figure 4.2 Fraud report page of the proposed system

The form in this page collects fraud transaction reports performed with an account

number as reported by users or individuals and stores it in the system database.

50
The administrators access the proposed fraud detection system to carry out any

operation. By entering a username and password, the system automatically

authenticates them with what it has in its database. This admin must be a banker for

secure access.

Figure 4.3 Admin login page to access the proposed system

The dashboard allows the administrator to verify the accounts reported and approves

them to be associated with fraud if the report is found to be legit. It also allow setting

the limit under which an account can be flagged for fraud based on the banking

institution that is associated with the account number. Then, declares an account as

fraud once it reaches the said threshold, otherwise not.

Credit card dataset are uploaded in .csv format to be used to train the model running

in the proposed system. After Upload of the Credit-Card Test Dataset, we can View

Fraud Prediction result in the Web app as well. Admin can Upload the test dataset (in

single or multiple csv) that contains credit-card transactions for the trained model in

the proposed system to make predictions of which are fraudulent or not. You can get

in .csv format the validated reported accounts for fraud which can be used as datasets

51
for decision making. Also, download the credit-card fraud detection predictions of

transactions.

52
Figure 4.4 Uploaded and view of credit-card dataset

53
CHAPTER FIVE

CONCLUSION, RECOMMENDATION AND FUTURE WORKS

5.1 CONCLUSION

This project, which is the design of an enhanced fraud detection system Using

Artificial inteligence and/or machine learning techniques , provides an easy to use all

in one system that can detect credit-card fraudulent transactions and accounts marked

for fraudulent activities. Through a combination of cutting- edge machine learning

algorithms and rule-based approach, the system effectively differentiated between

legitimate and fraudulent in a financial ecosystem. An integral aspect of the system

was its user-centric design, featuring an intuitive interface that allowed people to

report potential fraud incidents associated with an account number. This user

engagement mechanism not only empowered individuals to play an active role in

fraud prevention but also enriched the system's dataset, contributing to its continuous

improvement. The research successfully yielded a sophisticated fraud detection

system for credit card and account transactions, integrating machine learning or rules,

user engagement, and streamlined backend processing. This comprehensive approach

not only bolstered security but also empowered users to contribute to the protection of

their financial assets. The system's potential to reshape fraud prevention in the digital

age underscores its significance in safeguarding the integrity of financial transactions.

The proposed system's capacity to accurately discern between genuine and fraudulent

transactions, coupled with its user-friendly reporting interface, presents a

comprehensive solution that empowers both financial institutions and account holders.

The integration of user-reported data not only enhances the system's adaptability but

also strengthens the collaborative effort in combating fraud.

54
5.2 RECOMMENDATION FOR FUTURE WORKS

Artificial Intelligence (AI) has emerged as a powerful tool in the fight against fraud,

offering sophisticated techniques and applications that enhance fraud detection and

prevention efforts. AI techniques such as machine learning, deep learning, and natural

language processing are instrumental in detecting and preventing fraud across various

sectors and should be furture improved. From credit card fraud detection to

cybersecurity threats, AI offers versatile solutions for combating fraudulent activities.

Proactive fraud prevention strategies, such as predictive analytics and real-time

monitoring systems, are essential for staying ahead of fraudsters. By forecasting

potential fraud hotspots and implementing preventative measures, organizations can

reduce the risk of fraud occurring. Implementing AI-driven fraud prevention

strategies comes with challenges, including data privacy concerns, the quality of

datasets, and model interpretability. Addressing these challenges is crucial for

ensuring the ethical use and effectiveness of AI in fraud prevention. Continuous

innovation in AI is essential for staying ahead of evolving fraud tactics. As fraudsters

become more sophisticated, AI technologies must evolve to detect and prevent new

forms of fraud. By investing in research and development, organizations can ensure

that their AI systems remain effective and adaptive to emerging threats. The future of

AI in fraud prevention looks promising, with advancements in machine learning, deep

learning, and the integration of AI with emerging technologies. AI is expected to play

an increasingly important role in fraud prevention across various sectors, expanding

beyond financial services to areas such as healthcare, retail, and telecommunications.

55
REFERENCES

Abass, T., Itua, E. O., Bature, T., & Eruaga, M. A. (2024). Concept paper: Innovative

approaches to food quality control: AI and machine learning for predictive

analysis. World Journal of Advanced Research and Reviews, 21(3), 823-828.

Abdullah, A. M., Mousa, A. A., Abdulrahman, A. M., Mesfer, A. N., Mohammed, A.

A., Salman, A. K., ... & Nasser, A. M. (2023). The role of modern technology

in preventing and detecting accounting fraud. International Journal of

Multidisciplinary Innovation and Research Methodology, 2(2), 1-10.

Adekunle, T. S., Alabi, O. O., Lawrence, M. O., Ebong, G. N., Ajiboye, G. O., &

Bamisaye, T. A. (2024). The use of AI to analyze social media attacks for

predictive analytics. Journal of Computing Theories and Applications, 2(2),

169-178.

Afriyie, J. K., Tawiah, K., Pels, W. A., Addai-Henne, S., Dwamena, H. A., Owiredu,

E. O., ... & Eshun, J. (2023). A supervised machine learning algorithm for

detecting and predicting fraud in credit card transactions. Decision Analytics

Journal, 6, 100163.

Ahmad, H., Kasasbeh, B., Aldabaybah, B., & Rawashdeh, E. (2023). Class balancing

framework for credit card fraud detection based on clustering and similarity-

based selection (SBS). International Journal of Information Technology,

15(1), 325-333.

Alabdan, R. (2020). Phishing attacks survey: Types, vectors, and technical

approaches. Future Internet, 12(10), 168.

56
Alarfaj, F. K., Malik, I., Khan, H. U., Almusallam, N., Ramzan, M., & Ahmed, M.

(2022). Credit card fraud detection using state-of-the-art machine learning

and deep learning algorithms. IEEE Access, 10, 39700-39715.

Ali, A., Abd Razak, S., Othman, S. H., Eisa, T. A. E., Al-Dhaqm, A., Nasser, M., ... &

Saif, A. (2022). Financial fraud detection based on machine learning: a

systematic literature review. Applied Sciences, 12(19), 9637

Alkhalil, Z., Hewage, C., Nawaf, L., & Khan, I. (2021). Phishing attacks: A recent

comprehensive study and a new anatomy. Frontiers in Computer Science, 3,

563060.

Bello, O. A., & Olufemi, K. (2024). Artificial intelligence in fraud prevention:

Exploring techniques and applications challenges and

opportunities. Computer Science & IT Research Journal, 5(6), 1505-1520.

Bharadiya, J. P. (2023). Machine learning and AI in business intelligence: Trends and

opportunities. International Journal of Computer (IJC), 48(1), 123-134.

Bolton, R. J., & Hand, D. J. (2022). Statistical fraud detection: A review. Statistical

Science, 17(3), 235-255

Chang, J. W., Yen, N., & Hung, J. C. (2022). Design of a NLP-empowered finance

fraud awareness model: the anti-fraud Chatbot for fraud detection and fraud

classification as an instance. Journal of Ambient Intelligence and Humanized

Computing, 13(10), 4663-4679.

Chen, S., Wang, Y., & Lee, C. 2021. Challenges and Countermeasures for

Implementing Machine Learning-Based Fraud Detection in the Banking

Sector. International Journal of Financial Studies, 9(2), 20.

57
Chogugudza, M. (2022). The classification performance of ensemble decision tree

classifiers: A case study of detecting fraud in credit card transactions.

D. O. Njoku1, V. C. Iwuchukwu, J. E. Jibiri, C. T. Ikwuazom, C. I. Ofoegbu, F. O.

Nwokoma (2024), Machine Learning Approach for Fraud Detection System

in Financial Institution: A Web Base Application

George, A. S., & George, A. H. (2023). A review of ChatGPT AI's impact on several

business sectors. Partners Universal International Innovation Journal, 1(1), 9-

23.

Hasan, M. R., Gazi, M. S., & Gurung, N. (2024). Explainable AI in credit card fraud

detection: interpretable models and transparent decision-making for enhanced

trust and compliance in the USA. Journal of Computer Science and

Technology Studies, 6(2), 01-12.

Hassan, M., Aziz, L. A. R., & Andriansyah, Y. (2023). The role artificial intelligence

in modern banking: an exploration of AI-driven approaches for enhanced

fraud prevention, risk management, and regulatory compliance. Reviews of

Contemporary Business Analytics, 6(1), 110-132.

Huang, Z., Zheng, H., Li, C., & Che, C. (2024). Application of machine learning-

based k-means clustering for financial fraud detection. Academic Journal of

Science and Technology, 10(1), 33-39.

Jagatheesaperumal, S. K., Rahouti, M., Ahmad, K., Al-Fuqaha, A., & Guizani, M.

(2021). The duo of artificial intelligence and big data for industry 4.0: Applications,

techniques, challenges, and future research directions. IEEE Internet of Things

Journal, 9(15), 12861-12885.

58
Johnson, P. S., & Martinez, A. R. 2020, Fraud Detection in Mobile Payment

Systems: Challenges and Approaches. Mobile Computing and Communications

Review, 24(3), 60-73.

Krishnan, S., Shashidhar, N., Varol, C., & Islam, A. R. (2022). A novel text mining

approach to securities and financial fraud detection of case suspects.

International Journal of Artificial Intelligence and Expert Systems, 10(3).

Mahalakshmi, V., Kulkarni, N., Kumar, K. P., Kumar, K. S., Sree, D. N., & Durga, S.

(2022). The role of implementing artificial intelligence and machine learning

technologies in the financial services industry for creating competitive

intelligence. Materials Today: Proceedings, 56, 2252-2255.

Min, W., Liang, W., Yin, H., Wang, Z., Li, M., & Lal, A. (2021). Explainable deep

behavioral sequence clustering for transaction fraud detection. arXiv preprint

arXiv:2101.04285.

Mohammed, A. F. A., & Rahman, H. M. A. A. (2024). The Role of Artificial

Intelligence (AI) on the Fraud Detection in the Private Sector in Saudi Arabia.

(100) 472-506

Nagaraju, M., Babu, P. N., Ravipati, V. S. P., & Chaitanya, V. (2024). UPI fraud

detection using convolutional neural networks (CNN).

Nguyen, D. K., Sermpinis, G., & Stasinakis, C. (2023). Big data, artificial intelligence

and machine learning: A transformative symbiosis in favour of financial

technology. European Financial Management, 29(2), 517-548.

Palakurti, N. R. (2024). Challenges and future directions in anomaly detection. In

Practical Applications of Data Processing, Algorithms, and Modeling (pp.

269-284). IGI Global.

59
Rangaraju, S. (2023). Secure by intelligence: enhancing products with AI-Driven

security measures. EPH-International Journal of Science And Engineering,

9(3), 36-41.

Smith, R. L., & Brown, K. P. 2019. Deep Learning Approaches for Improved Fraud

Detection in Banking Transactions. International Journal of Data Science and

Analytics, 3(4), 289-302.

Świątkowska, J. (2020). Tackling cybercrime to unleash developing countries’ digital

potential. Pathways for Prosperity Commission Background Paper Series, 33,

2020-01.

Wang, C., Yang, Z., Li, Z. S., Damian, D., & Lo, D. (2024). Quality assurance for

artificial intelligence: a study of industrial concerns, challenges and best

practices. arXiv preprint arXiv:2402.16391.

Whang, S. E., Roh, Y., Song, H., & Lee, J. G. (2023). Data collection and quality

challenges in deep learning: A data-centric ai perspective. The VLDB Journal,

32(4), 791-813.

Williamson, S. M., & Prybutok, V. (2024). Balancing privacy and progress: a review

of privacy challenges, systemic oversight, and patient perceptions in AI-

Driven healthcare. Applied Sciences, 14(2), 675.

Yadav, N. S. S., Yadav, P. S., & Goar, V. (2024). Deep learning, neural networks,

and their applications in business analytics. In Intelligent Optimization

Techniques for Business Analytics (pp. 288-313). IGI Global.

Yandrapalli, V. (2024, February). AI-Powered data governance: a cutting-edge

method for ensuring data quality for machine learning applications. In 2024

60
Second International Conference on Emerging Trends in Information

Technology and Engineering (ICETITE) (pp. 1-6). IEEE.

Youssef, B., Bouchra, F., & Brahim, O. (2023, March). State of the art literature on

anti-money laundering using machine learning and deep learning techniques.

In The International Conference on Artificial Intelligence and Computer

Vision (pp. 77-90). Cham: Springer Nature Switzerland.

61
APPENDIX

//Importing dependencies
import numpy as np
import pandas as pd
import time

import matplotlib.pyplot as plt


import seaborn as sns
import missingno as msno
from scipy import stats
import plotly.express as px
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.graph_objs as go
plt.style.use('ggplot')

from sklearn.preprocessing import StandardScaler


from sklearn.model_selection import train_test_split
from imblearn.combine import SMOTETomek
from collections import Counter

from sklearn.ensemble import RandomForestClassifier


from lightgbm.sklearn import LGBMClassifier
import xgboost as xgb
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score


from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, RocCurveDisplay

import warnings
warnings.filterwarnings('ignore')

62
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
//Reading CSV
start_time = time.time()
fraud = pd.read_csv('../input/paysim1/PS_20174392719_1491204439457_log.csv')
end_time = time.time()

execution_time = end_time - start_time


print("Time taken to load:", execution_time, "seconds")
fraud2 = pd.read_csv('../input/paysim1/PS_20174392719_1491204439457_log.csv')
# Set the display format for float values
pd.options.display.float_format = '{:.2f}'.format

# Display summary statistics of the fraud DataFrame


fraud_description = fraud.describe()

# Print or display the summary statistics


print(fraud_description)

# Plotting the describe() output with values in exponential notation


plt.ticklabel_format(style='sci', axis='y', scilimits=(0,0))
fraud.describe().plot(kind='bar', figsize=(10, 6), colormap='viridis', legend=True)
plt.title('Summary Statistics of Fraud DataFrame')
plt.ylabel('Value')
plt.xlabel('Statistic')
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Set the figure size


plt.figure(figsize=(12, 8))

63
# Boxplotting with logarithmic y-axis scale
sns.boxplot(data=fraud)
plt.yscale('log') # Set y-axis scale to logarithmic

# Set plot title


plt.title('Boxplot of Fraud Data')
plt.savefig('Boxplot')
# Show plot
plt.show()
fraud.head()

> Now we will use the info function to see the datatypes and number of instances in
our dataset.
fraud.info()

> Let's use the shape function that returns the shape of our dataset.
fraud.shape
**Insights**
* The dataset consist of **11 columns**.
* We have **5 columns** of float datatype.
* We have **3 columns** of integer datatype.
* We have **3 columns** of object datatype.
* The dataset contains **6362620** rows of data.
//Data Cleaning
It's quite important to have a look at the missing values of the the dataset. We will
remove them or interpolate those missing values if we find any. Let's use missingo
library function to do this task.
plt.figure(figsize = (15, 8))
msno.bar(fraud, figsize = (15,5), sort = 'ascending', color = "#896F82")
plt.show()

> It's time to use the duplication function to make sure if there are any duplicated
values in the dataset or not.
sns.distplot()
print('Number of duplicates are : ', fraud.duplicated().sum())

64
> Let's have a look at the column names to see if they need any correction or not. We
usually do this to see typo errors.
fraud.columns
> Now, let's rename some of the column names by using rename function.
fraud = fraud.rename(columns = {'nameOrig' : 'origin', 'oldbalanceOrg' :
'sender_old_balance', 'newbalanceOrig': 'sender_new_balance', 'nameDest' :
'destination', 'oldbalanceDest' : 'receiver_old_balance', 'newbalanceDest':
'receiver_new_balance', 'isFraud' : 'isfraud'})
> It's also good to drop down the non essential columns from a dataset. We will do
this with the help of drop function.
fraud = fraud.drop(columns = ['step', 'isFlaggedFraud'], axis = 'columns')
> Now it's time to move the target column to our desired location in the dataset.
cols = fraud.columns.tolist()
new_position = 3

cols.insert(new_position, cols.pop(cols.index('destination')))
fraud = fraud[cols]
> By using the head function, let's be assure of the changes we have done so far.
fraud.head()

fraud2 = pd.read_csv('../input/paysim1/PS_20174392719_1491204439457_log.csv')
# Filter the dataset to include only rows where 'isFraud' is equal to 1
fraud_testing = fraud2[fraud2['isFraud'] == 1].copy()

# Display the first few rows of the testing dataset


print(fraud_testing.head())
**Insights**
* First of all we realized that no singal column in our dataset had any null values. It
means that our dataset was a clean dataset or the missing values were already taken
care of.
* Second, we saw that our dataset does not have any duplicate values.
* Then we renamed some of our columns and changed their positions for our comfort
and better understanding.

//Exploratory Data Analysis


> Now let's make a barplot to see the fraud and non fraud transactions in different
transactions type

65
# Define custom colors
custom_palette = ['#1f77b4', '#ff7f0e'] # Blue and orange

plt.figure(figsize=(15, 8))
ax = sns.countplot(data=fraud, x="type", hue="isfraud", palette=custom_palette)
plt.yscale('log') # Use log scale for y-axis
plt.title('Fraudulent vs. Non-Fraudulent Transactions')

# Adding annotations
for p in ax.patches:
ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x() + 0.01, p.get_height() +
10000))
plt.savefig('fraud_transactions_plot.png')
plt.show()
plt.savefig('fraud_transactions_plot.png')
> Checking the origin from where the transactions were made.
transfer_fraud = fraud[((fraud['type']=='TRANSFER') & fraud['isfraud']==1)]
transfer_fraud['origin'].value_counts()
> Checking the destination from where the transactions were cashed out.
cash_out_fraud = fraud[(fraud['type'] == 'CASH_OUT') & (fraud['isfraud'] == 1)]
cash_out_fraud['destination'].value_counts()
> Checking if the transfer and receiving account were same.
fraud_trans = fraud[fraud['isfraud'] == 1]
valid_trans = fraud[fraud['isfraud'] == 0]

trans_transfer = fraud[fraud['type'] == 'TRANSER']


trans_cashout = fraud[fraud['type'] == 'CASH_OUT']

print('Has the receiving accoung used for cashing out?')


trans_transfer.destination.isin(trans_cashout.origin).any()

**Insights**
* Our fraud transactions are done in **TRANSFER** and **CASH_OUT**
transaction type.
* The fraud transactions in **TRANSFER** were **4097** and **CASH_OUT**
were **4116**.

66
* The **fraud** transactions were generally from **Customer to Customer**
* The transactions account used for **Recieing and Sending was not Same** in case
of **Fraud transactions**.

//Feature Engineering
> Now we can do feature engineering to make another column that seems to be
helpful for later Machine Learning tasks.
data = fraud.copy()
data['type2'] = np.nan
data.loc[fraud.origin.str.contains('C') & fraud.destination.str.contains('C'), 'type2'] =
'CC'
data.loc[fraud.origin.str.contains('C') & fraud.destination.str.contains('M'), 'type2'] =
'CM'
data.loc[fraud.origin.str.contains('M') & fraud.destination.str.contains('C'), 'type2'] =
'MC'
data.loc[fraud.origin.str.contains('M') & fraud.destination.str.contains('C'), 'type2'] =
'MM'
> Changing the column position for our ease of use.
cols = data.columns.tolist()
new_position = 1

cols.insert(new_position, cols.pop(cols.index('type2')))
data = data[cols]
> Again, dropping the irrelevant columns.
data.drop(columns = ['origin','destination'], axis = 'columns', inplace = True)
> Using the head function to have a new look of the dataset.
data.head()
> Now we are going to see the number of fraud and valid transactions according to the
type 2 that tells if the transaction was done from customer to customer, customer to
merchant, merchant to customer or merchant to merchant.
fraud_trans = data[data['isfraud'] == 1]
valid_trans = data[data['isfraud'] == 0]
print('Number of fraud transactions according to type are below:\n',
fraud_trans.type2.value_counts(), '\n')
print('Number of valid transactions according to type are below:\n',
valid_trans.type2.value_counts())

**Insights**

67
* At first, we did some feature engineering and introduced a new column **type2**
that contained the type of transaction between **Customers and Merchants**.
* Then, we adjusted the column position and dropped some columns that were no
longer of use.
* The number of **Fraud Transactions** in total were **8213** and were made
from **Customer to Customer**.
* The number of **Valid Transactions** made from **Customer to Customer** are
**4202912**.
* The number of **Valid Transactions** made from **Customer to Merchant** are
**2151495**.

//Data Visualization
fr = fraud_trans.type2.value_counts()
va = valid_trans.type2.value_counts()
plt.figure(figsize=(15, 8))
plt.subplot(1,2,1)
sns.countplot(x = fr)
plt.title('Fraud',fontweight="bold", size=20)
plt.subplot(1,2,2)
sns.countplot(x = va)
plt.title('Valid',fontweight="bold", size=20)
plt.figure(figsize = (15, 8))
ax=sns.countplot(data = data, x = "type")
plt.title('Transactions according to type')
plt.savefig("Transactions according to Type")
for p in ax.patches:
ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x()+0.01, p.get_height()
+10000))
plt.figure(figsize=(15,8))
colors = ['#006400','#008000','#00FF00','#2E8B57','#2F4F4F']
plt.pie(data.type.value_counts().values,labels=data.type.value_counts().index, colors
= colors, autopct='%.0f%%')
plt.title("Transactions according to type")
plt.savefig("Transactions according to type")
plt.show()
plt.figure(figsize = (15, 8))
ax=sns.countplot(data = data, x = "type2", hue="isfraud", palette = 'Set1')

68
plt.title('Transactions according to type2')
for p in ax.patches:
ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x()+0.01, p.get_height()
+10000))
plt.figure(figsize=(15,8))
colors = ['#006400','#008000']
plt.pie(data.type2.value_counts().values,labels=data.type2.value_counts().index,
colors = colors, autopct='%.0f%%')
plt.title("Transactions according to type2")
plt.show()

**Insights**
* The first two visualizations contains the number of transactions according to the
type of transaction, sender and reciever type.
* Most common transaction type used for transactions = **CASH_OUT** .
* Least common transaction type used for transactions = **DEBIT**.
* Most of the transactions done were **Customer to Customer** .

//Data Preprocessing
> Let's create the dummies for the later process of Machine Learning. I'll use dummy
hot encoding as the columns does not have a specific order.
data = pd.get_dummies(data, prefix = ['type', 'type2'], drop_first = True)
> Now it's time to split the dataset into training and testing part and then do the
standardization.
X = data.drop('isfraud', 1)
y = data.isfraud

for i in range(5):
start_time = time.time()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, stratify =
data.isfraud)
end_time = time.time()

execution_time = end_time - start_time


print("Time taken to train test split:", execution_time, "seconds")
print(len(X_train))

69
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

//Model Building
> We will now use 4 ML models and train them. Later we will append each model
into the list.
X.head()
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB

# Configure RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=20, max_depth=5)

# Configure DecisionTreeClassifier
dtc = DecisionTreeClassifier(max_depth=5)

# Configure LogisticRegression
logreg = LogisticRegression(max_iter=100, penalty='l2', solver='lbfgs')

# Configure GaussianNB
nb = GaussianNB()

# Fit the models


rfc.fit(X_train, y_train)
dtc.fit(X_train, y_train)
logreg.fit(X_train, y_train)
nb.fit(X_train, y_train)

# Store the models in a list


classifiers = [rfc, dtc, logreg, nb]
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score,
recall_score

70
accuracy_list = []
auc_list = []
precision_list = []
recall_list = []

for classifier in classifiers:


y_pred = classifier.predict(X_test)
y_pred_proba = classifier.predict_proba(X_test)[:, 1]
accuracy_list.append(accuracy_score(y_test, y_pred))
auc_list.append(roc_auc_score(y_test, y_pred_proba))
precision_list.append(precision_score(y_test, y_pred))
recall_list.append(recall_score(y_test, y_pred))

metrics_dict = {
'accuracy': accuracy_list,
'auc': auc_list,
'precision': precision_list,
'recall': recall_list
}

metrics_dict_sorted = {}
for metric, values in metrics_dict.items():
sorted_values = [values[i] for i in sorted(range(len(classifiers)), key=lambda k:
values[k])]
metrics_dict_sorted[metric] = sorted_values

# Transposing the dictionaries


classifiers_names = ['Random Forest', 'Decision Tree', 'Logistic Regression', 'Naive
Bayes']
transposed_metrics = {classifier: [] for classifier in classifiers_names}
for metric, values in metrics_dict.items():
for i, classifier in enumerate(classifiers_names):
transposed_metrics[classifier].append(values[i])

# Printing the metrics grouped by model

71
for classifier, values in transposed_metrics.items():
print(f'{classifier}:')
for metric, value in zip(['auc'], values):
print(f'{metric}: {value}')
print()
from sklearn.metrics import roc_auc_score

auc_per_class = {}

for classifier in classifiers:


y_pred_proba = classifier.predict_proba(X_test)
classes = classifier.classes_
auc_per_class[classifier.__class__.__name__] = {}
for i, class_name in enumerate(classes):
auc_per_class[classifier.__class__.__name__][class_name] =
roc_auc_score(y_test == class_name, y_pred_proba[:, i])

# Printing the AUC for each class in each model


for model, class_aucs in auc_per_class.items():
print(model)
for class_name, auc in class_aucs.items():
print(f'Class {class_name}: AUC = {auc}')
print()

> Below, i'll just create a functions for the visualization part.
# from sklearn.metrics import classification_report

# for classifier in classifiers:


# y_pred = classifier.predict(X_test)
# print(classifier.__class__.__name__)
# print(classification_report(y_test, y_pred))
# print()

from sklearn.metrics import classification_report, accuracy_score

72
for classifier in classifiers:
y_pred = classifier.predict(X_test)
print(classifier.__class__.__name__)
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))
print()

def px_bar(x,y,text,title,color,color_discrete_sequence):
return px.bar(x = x, y = y, text = text, title = title, color = color,
color_discrete_sequence=color_discrete_sequence)
> Let's call the function to see accuracies of each model.
fig = px_bar(list(accuracy_dict_sorted.keys()), list(accuracy_dict_sorted.values()),
np.round(list(accuracy_dict_sorted.values()),3), 'Accuracy score of each classifiers',
list(accuracy_dict_sorted.keys()), px.colors.sequential.matter)
for idx in [2,3]:
fig.data[idx].marker.line.width = 3
fig.data[idx].marker.line.color = "black"
fig.show()
> I'll call the visualization function once again to see the AUC values as well.
fig = px_bar(list(auc_dict_sorted.keys()), list(auc_dict_sorted.values()),
np.round(list(auc_dict_sorted.values()),3), 'AUC score of each classifiers',
list(auc_dict_sorted.keys()), px.colors.sequential.matter)

for idx in [2,3]:


fig.data[idx].marker.line.width = 3
fig.data[idx].marker.line.color = "black"
fig.show()

//Model Evaluation
> Let's train our best model once again.
rfc=RandomForestClassifier(n_estimators = 15, n_jobs = -1, random_state = 42)
rfc.fit(X_train, y_train)

rfc_pred = rfc.predict(X_test)
rfc_pred_proba = rfc.predict_proba(X_test)[:,1]
> Printing out the Classification Report.
print(classification_report(y_test, rfc_pred, target_names=['Not Fraud','Fraud']))

73
> Showing the AUC below to show Randome Forest performance.
fpr, tpr, temp = roc_curve(y_test, rfc_pred_proba)
auc = round(roc_auc_score(y_test, rfc_pred_proba),3)
plt.figure(figsize=(15,7))
plt.plot(fpr,tpr,label='Random Forest Classifier,
AUC='+str(auc),linestyle='solid',color='#800000')
plt.plot([0, 1], [0, 1], color = 'g')
plt.title('ROC Curve')
plt.legend(loc='upper right')

//Conclusion
The total number of fraud transactions were **8213** out of **6362620**
transactions. These fraud transactions were either **CASH_OUT** or **DEBIT**
and were made from a **Customer to Customer** account. We trained 4 algorithms
and **Random Forest** performed the best among them. It gave the **AUC score**
of **0.9**.

74

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy