Credit Card Fraud Detection Report
Credit Card Fraud Detection Report
Chapter 1
INTRODUCTION
1.1 Overview
There has been the need for displaying massive amounts of data in a way that is easily
accessible and understandable. Organizations generate data every day. As a result, the amount of
data available on the Web has increased dramatically. It is difficult for users to visualize, explore,
and use this enormous data. The ability to visualize data is crucial to scientific research. Today,
computers can be used to process large amounts of data. Data visualization is concerned with the
design, development, and application of computer-generated graphical representation of the data.
It provides effective data representation of data originating from different sources. This enables
decision makers to see analytics in visual form and makes it easy for them to make sense of the
data. It helps them discover patterns, comprehend information, and form an opinion.
Data visualization is also regarded as information visualization or scientific visualization. Human
beings have always employed visualizations to make messages or information last in time. What
Cannot be touched, smelled, or tasted can be represented visually [1].
Data visualization is the technique used to deliver insights in data using visual cues such as
graphs, charts, maps, and many others. This is useful as it helps in intuitive and easy
understanding of the large quantities of data and thereby make better decisions regarding it.
The popular data visualization tools that are available are Tableau, Plotly, R, Google Charts,
Infogram, and Kibana. The various data visualization platforms have different capabilities,
functionality, and use cases. They also require a different skill set. This article discusses the use
of R for data visualization.
R is a language that is designed for statistical computing, graphical data analysis, and scientific
research. It is usually preferred for data visualization as it offers flexibility and minimum
required coding through its packages.
Visualization Techniques
Visualization is the use of computer-supported, visual representation of data. Unlike static data
visualization, interactive data visualization allows users to specify the format used in displaying
data.
Common visualization techniques are as shown in Figure 1.1 and include [2]:
Line graph: This shows the relationship between items. It can be used to compare changes
over a period of time.
Bar chart: This is used to compare quantities of different categories.
Scatter plot: This is a two-dimensional plot showing variation of two items.
Pie chart: This is used to compare the parts of a whole.
Thus, the format of graphs and charts can take the form of bar chart, pie chart, line graph, etc. It is
important to understand which chart or graph to use for your data.
Data visualization uses computer graphics to show patterns, trends, and relationship among
elements of the data. It can generate pie charts, bar charts, scatter plots, and other types of data
graphs with simple pull-down menus and mouse clicks. Colours are carefully selected for certain
types of visualization. When color is used to represent data, we must choose effective colors to
differentiate between data elements.
In data visualization, data is abstracted and summarized. Spatial variables such as position, size,
and shape represent key elements in the data. A visualization system should perform a data
reduction, transform and project the original dataset on a screen. It should visualize results in the
form of charts and graphs and present results in user friendly way.
Line Graph
Pie Bar
Chart
Data Visualization Chart
Scatter Plot
Applications
Most visualization designs are to aid decision making and serve as tools that augment
cognition. In designing and building a data visualization prototype, one must be guided by
how the visualization will be applied. Data visualization is more than just representing
numbers; it involves selecting and rethinking the numbers on which the visualization is
based [3].
Visualization of data is an important branch of computer science and has wide range of
application areas. Several application-specific tools have been developed to analyze
individual datasets in many fields of medicine and science.
Public Health: The ability to analyze and present data in an understandable manner is
critical to the success of public health surveillance. Health researchers need useful and
intelligent tools to aid their work [4]. Security is important in cloud-based medical data
visualizations. Open any medical or health magazine today, and you will see all kinds of
graphical representations.
Fraud Detection: Data visualization is important in the early stages of fraud investigation.
Fraud investigator may use data visualization as a proactive detection approach, using it to
see patterns that suggest fraudulent activity [7].
Several information visualization algorithms and associated software have been developed.
This software enables users to interpret data more rapidly than ever before. These include
ManyEyes from IBM, SmartMoney for stock market, Insights from Facebook
Corporation, Visual Analytics from SAS, and Thoth from California Institute of
Technology, Tableau, and TOPCAT [10, 11]. They make data visualizations easy to
interpret and rapid to produce. Each tool has its own good features and limitations.
Visualization of a large-scale multidimensional data sets can be combined with new
approaches of interacting with a computer using the Web application (as a service).
Challenges
Large, time-varying datasets pose great challenge for data visualization because of the
enormous data volume. Real-time data visualization can enable users to proactively
respond to issues that arise. Animation generation approach is used for interactive
exploration process of time-varying data. It visualizes temporal events by mimicking the
composition of storytelling techniques [12].
Users differ in their ability to use data visualization and make decisions under tight time
constraints. It is hard to quantify the merit of a data visualization technique. This is the
reason for having a multitude of visualization algorithms and associated software. Most of
this software have not taken advantage of the multi-touch interactions and direct
manipulation capabilities of the new devices.
Big data, structured and unstructured, introduces a unique set of challenges for developing
visualizations. This is due to the fact that we must take into account the speed, size, and
diversity of the data. A new set of issues related to performance, operability, and degree
of discrimination challenge large data visualization and analysis [13]. It is difficult and
time-consuming to create a large simulated data set. It is also difficult to decide what
visual might be the best to use.
in credit card transactions. As the technology is changing and becoming more and more
advanced day by day, it is becoming more and more difficult to track the behavior and
pattern of criminal activities. Through this project we will be able to provide a solution
that can make use of technologies such as Machine Learning and Data Visualization using
R. Hence, easing the process of detection of fraudulent card transactions.
1.3 Objectives
The basic objectives of the projects are listed below:
To study the unauthorized and unwanted ‘fraud’ in credit card transactions.
To monitor the activities of the population of users in order to perceive or avoid
objectionable behavior.
To collect data from a trusted source and analyze the data.
To visualize the ongoing trend in such frauds by using advanced visualizing tools
such as R.
1.4 Project Scope
This is a very relevant problem that demands the attention of communities such as machine
learning and data science where the solution to this problem can be automated. Fraud detection
involves monitoring the activities of populations of users in order to estimate, perceive or avoid
objectionable behaviour, which consist of fraud, intrusion, and defaulting.
This problem is particularly challenging from the perspective of learning, as it is characterized by
various factors such as class imbalance. The number of valid transactions far outnumber fraudu-
lent ones. Also, the transaction patterns often change their statistical properties over the course of
time. These are not the only challenges in the implementation of a real-world fraud detection
system, however. In real world examples, the massive stream of payment requests is quickly
scanned by automatic tools that determine which transactions to authorize.
businesses reduce fraud and protect their customers. Here is an overview of the components and
functionalities you might find in such a system:
Machine Learning Models:
Machine learning algorithms are used for fraud detection. Common algorithms include logistic
regression, decision trees, random forests, neural networks, and anomaly detection models. These
models are trained on historical transaction data to learn patterns of legitimate and fraudulent
behavior.
Real-time Monitoring:
The system continuously monitors incoming transactions in real-time to identify suspicious
activities. It uses the trained machine learning models to flag transactions that exhibit
characteristics of potential fraud.
Alerting and Reporting:
When a potentially fraudulent transaction is detected, the system generates alerts for review.
Alerts are sent to authorized personnel, such as fraud analysts or investigators, for further action.
Detailed reports are created to document suspicious transactions and provide evidence for
investigation and reporting.
Data Visualization:
Data visualization tools and libraries, which may include R, are used to create interactive charts,
graphs, and dashboards. Visualizations help analysts and decision-makers quickly understand
trends and patterns in transaction data. Common visualization types include time series plots,
geographical heatmaps, and network graphs.
User Interface:
The system typically offers a user-friendly web-based interface where authorized users can log in
to access alerts, reports, and visualizations. User interfaces may also support filtering and drill-
down capabilities for in-depth analysis.
Compliance:
The system must comply with relevant data protection and security regulations, such as GDPR,
HIPAA, and PCI DSS, to protect customer data and ensure data privacy.
Integration:
The system often needs to integrate with external services and databases to access additional
information for fraud detection, such as blacklists, watchlists, and customer profiles.
Performance and Scalability:
These systems must be able to handle a large volume of transactions and scale to accommodate
increased data flows.
Security:
Robust security measures are implemented to protect the system against cyber threats and
unauthorized access.
Maintenance and Support:
Regular maintenance and updates are essential to keep the system effective against evolving fraud
tactics.
The specific technology stack and architecture of an existing system may vary from one
organization to another, but the core functionalities described above are common in credit card
fraud detection and data visualization systems. The choice of technologies, algorithms, and tools
depends on the organization's needs, resources, and technological capabilities.
We propose to make a Credit Card Fraud Detection System in R language by making use of
Machine Learning and advanced R concepts. We would be incorporating various algorithms
like Decision Tress, Artificial Neural Networks, Logistic Regression and Gradient Boosting
Classifier. In order to carry out the task of credit card fraud detection, we will be making use
of a Credit Card Transactions dataset consisting of a mix of fraud as well as non-fraudulent
transactions.
Method Used
We have referred various research papers to identify the various components that might be
required in our project. We also referred few websites to understand the various packages
and libraries in R that are to be used.
Chapter 4: System analysis and Design: which provides the architecture and analysis.
Chapter 5: Provides the detailed technology view and the pseudocode.
Chapter 6: Testing: Provides the several testing benchmarks to check whether the requirements
are acquired. Finally, Chapter 7: Brief out the Results and the visualization.
Chapter 2
Literature Survey
Chapter 3
Software Requirement Specification
3.1 Purpose
The purpose of this document is to outline the requirements for a Credit Card Fraud Detection
System with data visualization capabilities using the R programming language. This system is
designed to detect and visualize potential fraudulent credit card transactions, helping financial
institutions to reduce fraud and protect their customers.
3.2 Scope
The system will analyze transaction data, identify suspicious activities, and provide interactive
data visualization using R. It will be capable of handling a large volume of transaction data and
generate meaningful insights for fraud detection.
Chapter 4
System Analysis and Design
Define the criteria for generating alerts when potential fraud is detected. Specify reporting
requirements, including the format, content, and delivery method of reports.
Data Visualization:
Specify the types of visualizations required, such as time series plots, geographical heatmaps, and
network graphs. Define the interactivity and user-friendliness of the visualizations.
User Interface:
Detail the requirements for the user interface, including user roles, access levels, and user
interactions. Specify features like filtering, drill-down capabilities, and data export options.
Compliance and Security:
Define the system's compliance requirements with data protection regulations (e.g., GDPR,
HIPAA, PCI DSS). Specify security measures to protect sensitive transaction data and user
access.
Scalability:
Describe how the system will scale to accommodate increased transaction volume and data
sources.
Technology Stack:
Specify the technology stack, tools, and frameworks that will be used for system development.
Discuss the choice of R packages and libraries for data visualization and analysis.
Performance Requirements:
Set performance requirements for response times, throughput, and system availability.
Historical Data Retention:
Specify the requirements for storing historical transaction data for analysis and reporting
purposes.
Model Retraining:
Define the frequency and process for retraining machine learning models to adapt to evolving
fraud patterns.
Once the system analysis is completed, we have a comprehensive understanding of the project's
requirements and constraints. This analysis serves as the foundation for the design and
development phases of the credit card fraud detection system with data visualization using R.
Chapter 5
Implementation
5.1 Methodology
Implementing a credit card fraud detection system with data visualization using R involves
several steps. Here's a high-level outline of the implementation process:
Data Collection and Preprocessing:
Collect historical credit card transaction data, including both legitimate and fraudulent
transactions. Preprocess the data by cleaning, transforming, and extracting relevant features.
Handle missing values and outliers.
Data Visualization with R:
Use R and relevant packages (e.g., ggplot2, plotly, Shiny) for data visualization.
Create interactive data visualizations, such as time series plots, geographical heatmaps, and
network graphs, to gain insights into the data.
Machine Learning Model Development:
Train machine learning models on historical data to detect fraudulent transactions. Common
algorithms include logistic regression, decision trees, random forests, neural networks, and
clustering methods. Evaluate model performance using appropriate metrics (e.g., precision, recall,
F1-score).
Real-time Monitoring and Alerting:
Implement a real-time monitoring system to continuously analyze incoming credit card
transactions. Generate alerts when a transaction is detected as potentially fraudulent based on the
trained machine learning models.
User Interface Development:
Create a web-based user interface using R and Shiny or other web development tools.
Design a user-friendly dashboard where users can view alerts, reports, and visualizations.
Compliance and Security:
Ensure the system complies with relevant data protection regulations (e.g., GDPR, HIPAA, PCI
DSS). Implement robust security measures to protect sensitive transaction data and user access.
Integration:
Integrate the system with external data sources and fraud databases to enhance fraud detection
accuracy. Implement APIs for external systems to interact with the fraud detection and
visualization components.
The implementation of a credit card fraud detection system with data visualization is a complex
process that requires expertise in data analysis, machine learning, data visualization, and web
development.
> library(caret)
> library(data.table)
> head(creditcard_data,6) Gives the first 6 data entries in the dataset > tail(creditcard_data,6)
Gives the last 6 entries in the dataset.
> summary(creditcard_data$Amount) This gives us the summary of the dataset Statistically. >
names(creditcard_data) This command will tell us about the names of the columns in the dataset
Step 3: We will be scaling our data by using the scale() function in R, in order to remove any
extreme values which might hinder in the functioning of our model. The scaling function helps
standardize the data, by structuring them according to a specific range.
> head(NewData) This is used in order to recheck our model after scaling
Step 4: Now, after we have scaled our data, it is ready for training. So, now we will be extracting
two sets of data from the existing data, one will be train_data, and the other will be test_data. >
library(caTools)
> set.seed(123) It generates random numbers
> dim(train_data) It is used to check the dimensions of the training data. [1] 227846 30
> dim(test_data) It is used to heck the dimensions of the test dataset [1] 56961 30
Step 5: In this step, we will be performing Logical regression. The Logistic Regression
determines the extent to which there is a linear relationship between a dependent variable and one
or more independent variables. In terms of output, linear regression will give us a trend line
plotted amongst a set of data points. So, in our project, we have used it determine the relationship
between fraud or not fraud.
> Logistic_Model=glm(Class~.,test_data,family=binomial()) It is used to generate a Binomial
Linear Regression Model. > summary(Logistic_Model)
Then, we have to assess the performance of our model, so we use it to delineate the ROC
curve(Receiver Optimistic Characterisitics).
Chapter 6
Testing
Testing is a crucial step in the development of a credit card fraud detection system with data
visualization using R. It ensures that the system functions as intended, accurately detects
fraudulent transactions, and provides meaningful data visualization. Here are various types of
testing that should be conducted:
Unit Testing:
Test individual components and functions in isolation to ensure they work correctly.
Verify that data preprocessing, machine learning algorithms, alerting, and visualization functions
produce the expected results.
Integration Testing:
Test the interactions between different system components to verify that they work together
harmoniously. Check data flow between modules, APIs, and external data sources.
System Testing:
Test the system as a whole to ensure all components function cohesively. Verify that the end-to-
end process from data ingestion to data visualization is working as expected.
Functional Testing:
Verify that all system functions meet their specified requirements. Test the real-time fraud
detection, alerting, reporting, and data visualization features.
Non-Functional Testing:
Test non-functional aspects of the system, including performance, security, and scalability. Check
that the system can handle a high volume of transactions without performance degradation.
User Acceptance Testing (UAT):
Involve end-users and stakeholders in testing the system. Ensure that the system meets their
requirements and is user-friendly.
Regression Testing:
Continuously test the system as new features are added or changes are made to existing ones.
Ensure that new updates do not introduce issues or break existing functionality.
Security Testing:
Perform security testing, including penetration testing, to identify vulnerabilities in the system.
Ensure that sensitive data is protected, and there are no security breaches.
Performance Testing:
Evaluate the system's performance under different loads and conditions. Measure response times,
throughput, and scalability.
Data Quality Testing:
Test the quality of the data by verifying that data preprocessing steps handle missing values and
outliers appropriately. Ensure that transformed data maintains its integrity.
Visualization Testing:
Validate the accuracy and interactivity of data visualizations created with R. Ensure that charts,
graphs, and dashboards provide meaningful insights.
Testing should be an iterative process, with issues identified, documented, and addressed before
moving on to the next stage of development. It is essential to involve end-users and stakeholders
in the testing process to gather their feedback and ensure that the system meets their needs and
expectations.
Chapter 7
Results and Snapshots
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.261e+01 1.024e+01 -1.231 0.2182
V1 -1.730e-01 1.274e+00 -0.136 0.8920
V2 1.445e+00 4.231e+00 0.342 0.7327
V3 1.790e-01 2.406e-01 0.744 0.4569
V4 3.136e+00 7.178e+00 0.437 0.6622
V5 1.490e+00 3.804e+00 0.392 0.6952
V6 -1.243e-01 2.220e-01 -0.560 0.5756
V7 1.409e+00 4.226e+00 0.333 0.7388
V8 -3.525e-01 1.746e-01 -2.019 0.0435 *
V9 3.022e+00 8.673e+00 0.348 0.7275
V10 -2.896e+00 6.624e+00 -0.437 0.6620
V11 -9.769e-02 2.827e-01 -0.346 0.7297
V12 1.980e+00 6.567e+00 0.301 0.7630
V13 -7.167e-01 1.256e+00 -0.570 0.5684
V14 1.932e-01 3.289e+00 0.059 0.9532
V15 1.039e+00 2.893e+00 0.359 0.7195
V16 -2.982e+00 7.114e+00 -0.419 0.6751
V17 -1.818e+00 4.998e+00 -0.364 0.7160
V18 2.748e+00 8.132e+00 0.338 0.7354
V19 -1.632e+00 4.772e+00 -0.342 0.7323
V20 -6.993e-01 1.151e+00 -0.607 0.5436
V21 -4.508e-01 1.992e+00 -0.226 0.8209
V22 -1.404e+00 5.190e+00 -0.271 0.7868
V23 1.903e-01 6.119e-01 0.311 0.7559
V24 -1.289e-01 4.470e-01 -0.288 0.7731
V25 -5.784e-01 1.950e+00 -0.297 0.7668
V26 2.659e+00 9.350e+00 0.284 0.7761
V27 -4.540e-01 8.150e-01 -0.557 0.5775
V28 -6.639e-02 3.573e-01 -0.186 0.8526
Amount 9.026e-04 2.874e-03 0.314 0.7535
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
7.1 Predicted Values v/s Residuals 7.2 Theoretical Quantities v/s Std. Deviation
Residuals
7.3 Predicted Values v/s sqrt. Std. 7.4 Leverage glm v/s Std. Pearson Resid.
Deviation Residuals
Future Work
The current Fraud Detection System can be expanded by adding more ways to secure the data
by adding extensive Machine Learning Applications and Techniques. So, in the near future, we
will be going over more research projects in order to understand more techniques which can be
applied in order to make the current model more efficient.
References
[1] Patil, S., Nemade, V., & Soni, P. K. (2018). Predictive modelling for credit card
fraud detection using data analytics. Procedia computer science, 132, 385-395.
[2] Awoyemi, J. O., Adetunmbi, A. O., & Oluwadare, S. A. (2017, October). Credit card
fraud detection using machine learning techniques: A comparative analysis. In 2017
International Conference on Computing Networking and Informatics (ICCNI) (pp. 1-9).
IEEE.
[3] Roy, A., Sun, J., Mahoney, R., Alonzi, L., Adams, S., & Beling, P. (2018, April).
Deep learning detecting fraud in credit card transactions. In 2018 Systems and Information
Engineering Design Symposium (SIEDS) (pp. 129-134). IEEE.
[4] Xuan, S., Liu, G., Li, Z., Zheng, L., Wang, S., & Jiang, C. (2018, March). Random
forest for credit card fraud detection. In 2018 IEEE 15th International Conference on
Networking, Sensing and Control (ICNSC) (pp. 1-6). IEEE.
[5] Jurgovsky, J., Granitzer, M., Ziegler, K., Calabretto, S., Portier, P. E., He-Guelton,
L., & Caelen, O. (2018). Sequence classification for credit-card fraud detection. Expert
Systems with Applications, 100, 234-245.
[6] Elgendy, N., & Elragal, A. (2014, July). Big data analytics: a literature review paper.
In Industrial Conference on Data Mining (pp. 214-227). Springer, Cham.
[8] Leppäaho, E., Ammad-ud-din, M., & Kaski, S. (2017). GFA: exploratory analysis of
multiple data sources with group factor analysis. The Journal of Machine Learning
Research, 18(1), 12941298.
[9] Andrienko, G., Andrienko, N., Drucker, S., Fekete, J. D., Fisher, D., Idreos, S., ... &
Stonebraker, M. Big Data Visualization and Analytics: Future Research Challenges and
Emerging Applications.
[10] Kamaruddin, S., & Ravi, V. (2016, August). Credit card fraud detection using big
data analytics: use of PSOAANN based one-class classification. In Proceedings of the
International Conference on Informatics and Analytics (pp. 1-8).
[11] Maniraj, S & Saini, Aditya & Ahmed, Shadab & Sarkar, Swarna. (2019). Credit Card
Fraud Detection using Machine Learning and Data Science. International Journal of
Engineering Research and. 08. 10.17577/IJERTV8IS090031.
[12] Varmedja, Dejan & Karanovic, Mirjana & Sladojevic, Srdjan & Arsenovic, Marko &
Anderla, Andras. (2019). Credit Card Fraud Detection - Machine Learning methods. 1-5.
10.1109/INFOTEH.2019.8717766.
[13] Maniraj, S & Saini, Aditya & Ahmed, Shadab & Sarkar, Swarna. (2019). Credit Card
Fraud Detection using Machine Learning and Data Science. International Journal of
Engineering Research and. 08. 10.17577/IJERTV8IS090031
[14] Varmedja, Dejan & Karanovic, Mirjana & Sladojevic, Srdjan & Arsenovic, Marko &
Anderla, Andras. (2019). Credit Card Fraud Detection - Machine Learning methods. 1-5.
10.1109/INFOTEH.2019.8717766.