0% found this document useful (0 votes)
35 views50 pages

Bank Fraud Detuct Project

This document discusses fraud detection in banking and other industries. It introduces visual analytics as an approach to help detect fraud by combining data visualization, analysis, and human interaction. The rest of the document outlines a literature review and discusses using visual analytics and machine learning algorithms like decision trees to analyze transaction data and detect fraudulent transactions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views50 pages

Bank Fraud Detuct Project

This document discusses fraud detection in banking and other industries. It introduces visual analytics as an approach to help detect fraud by combining data visualization, analysis, and human interaction. The rest of the document outlines a literature review and discusses using visual analytics and machine learning algorithms like decision trees to analyze transaction data and detect fraudulent transactions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

CHAPTER 1

1. INTRODUCTION

Fraud detection is a topic applicable to many industries including banking and


financial sectors, insurance, government agencies and law enforcement, and
more. Fraud attempts have seen a drastic increase in recent years, making fraud
detection more important than ever. Despite efforts on the part of the affected
institutions, hundreds of millions of dollars are lost to fraud every year. Since
relatively few cases show fraud in a large population, finding these can be tricky.

In banking, fraud can involve using stolen credit cards, forging checks,
misleading accounting practices, etc. In insurance, 25% of claims contain some
form of fraud, resulting in approximately 10% of insurance payout dollars. Fraud
can range from exaggerated losses to deliberately causing an accident for the
payout. With all the different methods of fraud, finding it becomes harder still.

Figure 1.1 Financial Fraud Detection (FFD)

1
Transaction Banking can be defined as set of instruments and services that
bank offers to their customers and their partners to financially support their
reciprocal exchange of cash. In banking sector millions of transactions occurring
annually. Most of the transaction were legitimate but some transaction events are
fraud and criminal attempts. Due to this type of transaction bank lost their trust in
their customer and also customer feels that his/her money at bad hands. There are
multidimensional data in our hand financial fraud detection is a complex task. The
bank transaction data is multi variant and time variant .Due to this complex data
model it needs to understand clearly by banking professionals to stop fraudulent
event transaction like money laundering and unauthorized transaction.

Visual Analytics can be seen as an integral approach combining visualization,


human factors, and data analysis. The Figureure illustrates the research areas related
to Visual Analytics. Besides visualization and data analysis, especially human
factors, including the areas of cognition and perception, play an important role in the
communication between the human and the computer, as well as in the decision-
making process. With respect to visualization, Visual Analytics relates to the areas
of Information Visualization and Computer Graphics, and with respect to data
analysis, it profits from methodologies developed in the fields of information
retrieval, data management & knowledge representation as well as data mining.

The Visual Analytics Process combines automatic and visual analysis


methods with a tight coupling through human interaction in order to gain knowledge
from data. The Figurer shows an abstract overview of the different stages
(represented through ovals) and their transitions (arrows) in the Visual Analytics
Process.

2
In many application scenarios, heterogeneous data sources need to be
integrated before visual or automatic analysis methods can be applied. Therefore,
the first step is often to preprocess and transform the data to derive different
representations for further exploration (as indicated by the Transformation arrow in
the Figureure). Other typical preprocessing tasks include data cleaning,
normalization, grouping, or integration of heterogeneous data sources.

After the transformation, the analyst may choose between applying visual or
automatic analysis methods. If an automated analysis is used first, data mining
methods are applied to generate models of the original data. Once a model is created
the analyst has to evaluate and refine the models, which can best be done by
interacting with the data. Visualizations allow the analysts to interact with the
automatic methods by modifying parameters or selecting other analysis algorithms.
Model visualization can then be used to evaluate the findings of the generated
models. Alternating between visual and automatic methods is characteristic for the
Visual Analytics process and leads to a continuous refinement and verification of
preliminary results. Misleading results in an intermediate step can thus be discovered
at an early stage, leading to better results and a higher confidence.

If a visual data exploration is performed first, the user has to confirm the
generated hypotheses by an automated analysis. User interaction with the
visualization is needed to reveal insightful information, for instance by zooming in
on different data areas or by considering different visual views on the data. Findings
in the visualizations can be used to steer model building in the automatic analysis.
In summary, in the Visual Analytics Process knowledge can be gained from
visualization, automatic analysis, as well as the preceding interactions between
visualizations, models, and the human analysts.

3
Figure 1.2 Visual Analytics (VA)

1.1 PROBLEM IDENTIFICATION


Existing system has context complexity and many false alarms. And it is
difficult to validate time oriented frauds. In this work we evaluate and visualize
fraudulent transactions.
1.1.1 Discussion with Guide
We discussed about the financial fraud detection systems and searched where
the data sets are available related to visual analytics. Searches were done on huge
volume of data’s. The data’s of a private bank transaction and it was finalized to
evaluate. Various factors for detecting fraudulent events were discussed. and we
decided to plot a new bike fraudulent transactions . Various factors were found and
suggestions were given on certain factors. Algorithms were discussed based on the
scenario.

4
1.1.2 Analysis and Modification

The existing system does not have not Time variant.As transaction is
tremendously increasing in banks fraud detection system is needed.Due to various
crime happening in the banking sector It must be stopped or detected before
happened now a day's data stealing in banking sector happens at frequent interval
many loan application fraud are happening to reduce this we can use VA approach
from customer transactions.
1.1.3 Establish Project Statement
As there is a millions of transactions are happening daily in a bank. Moreover
some people make fraudulent transactions. This system is to detect fraudulent
transaction and customer and visualize it based on the previous bank transactions.
By considering the transaction of a group of bank customers a new system is
implemented by decision tree algorithm. The main part is designed using R
programming.
1.2 LITERATURE REVIEW
There is a number of surveys that focus on fraud detection. In 2002,
Bolton and Hand [32] published a review about fraud detection approaches. They
described the available tools for statistical fraud detection and identified the most
used technologies in four areas: credit card
fraud, money laundering, telecommunication fraud, and computer intrusion. Kou et
al. [20] presented a survey of techniques for identifying the
same types of fraud as described in [32]. The different approaches are
broadly classified into two categories: misuse and anomaly detection.
Both categories present techniques such as: outlier detection, neural
networks, expert systems, model-based reasoning, data mining, state
transition analysis, and information visualization. These works helped

5
us to understand diverse fraud domains and how they are normally
tackled.

When looking on surveys of visual approaches for financial


data, we identified FinanceVis [7] which is a browser tool including
over 85 papers related to financial data visualization. FinanceVis was
instrumental in analyzing how data that is similar to our data is usually
visualized.
Motivated by a lack of information, Ko et al. [19] presented a survey of
approaches for exploring financial data. In this work, financial data experts were
interviewed concerning their preferences of data sources, automated techniques,
visualizations, and interaction methods. When it comes to visual solutions to support
FFD, Kirkland et al. [15] published one of the first works in fraud detection using
visual techniques. In their work they combined Artificial Intelligence (AI),
visualization, pattern recognition, and data mining to
support regulatory analysis, alerts (fraud detection), and knowledge
discovery. In our approach, we use a similar combination of techniques, but we also
provide means for an interactive exploration of the
WireVis’s [4] main idea is to explore big amounts of transaction data
using multiple coordinated views. In order to aid fraud detection, they
highlight similarities between accounts based on keywords over time.
Yet, WireVis does not support the detailed analysis of single accounts
without clustering a set of accounts by their similar keywords usage.
This is the most similar approach to EVA. However, instead of focusing
on hierarchical analysis of keywords patterns within the transactions,
EVA enables a broader and more flexible analysis.

6
A visual analytics based first financial data flow is presented by [34]. In this
approach, data are aggregated in order to allow users to draw analytical conclusions
and make transaction decisions. EventFlow [28] was designed to facilitate analysis,
query, and data transformation of temporal event datasets. The goal of this work is
to create aggregated data representations to track entities and the events related to
them.
When looking at approaches for event monitoring in general, Huang et al. [10]
presented a VA framework for stock market security. In order to reduce the
number of false alarms produced by traditional AI techniques, this work presents a
visualization approach combining a 3D tree map for market performance analysis
and a node-link diagram for network analysis.
Dillaetal.[6], In his paper he presented the current needs in FFD. The authors
presented a theoretical framework to predict when and how the investigators should
apply VA techniques. They evaluated various visualization techniques and derived
which visualizations support different cognitive processes. In addition, the authors
also suggest future challenges in this research area and discuss the efficacy of
interactive data visualization for fraud detection, which we used as a starting point
for our approach.
Carminati et al. [3] presented a semi-supervised online banking fraud
analysis and decision support based on profile generation and analysis.
While this approach provides no visual support for fraud analysis, it is
directly related to our approach since we are also focusing on profile
analysis. However, we believe that VA methods have great potential
to foster the investigation of the data and enable the analyst to better
fine-tune the scoring system. In the health domain, Rind et al. [33] conducted a
survey study focusing on information visualization systems for exploring and
querying electronic health records.
7
Moreover, Wagner et al. [36] presented a systematic overview and
categorization of malware visualization systems from a VA perspective. Both
domains of these studies are similar to FFD, since they both involve multivariate and
temporal aspects. However, the FFD domain demands for special consideration due
to the complexity of the involved tasks

1.3 OBJECTIVE
The main objective of this project is to evaluate and visualize fraudulent
transactions by considering the previous customer transaction data. To attain it
decision tree and random forest algorithm is used. This is done using R programming
and Tableau for visualizing fraudulent transactions and can make a suggestion , whet
ther the transaction is liable or not.

8
CHAPTER 2

2. SYSTEM ANALYSIS
2.1 EXISTING SYSTEM
Financial institutions handle millions of transactions from clients per
year. Although the majority part of these transactions being legitimate,
a small number of them are criminal attempts, which may cause serious
harm to customers or to the financial institutions themselves. Thus,
the trustability of each transaction has to be assessed by the institution. However,
due to the complex and multidimensional data at hand,
financial fraud detection (FFD) is a difficult task.
2.1.1 Drawbacks
• No time-oriented analysis
• Lot of data
• Difficult identify and classify frauds.
2.2 PROPOSED SYSTEM
There are millions of transactions happening in a bank. Moreover bankers find
same difficulties to identify the fraudu. This system is to achieve detect fraudulent
transactions and visualize it based on the previous transactions of customers.
2.2.1 Advantages
• Detect fraudulent transactions.
• Identification and Classification of frauds.
2.3 FEASIBILITY STUDY
A feasibility study is a high level capsule version of the entire system analysis
and design process. In this phase, the feasibility of the proposed system is analyzed
by classifying the problem definition. Feasibility is to determine if it is worth doing

9
and not a burden to the company .Once the problem definition has been approved, a
logical model of the system can be developed. The search for alternate solution has
to be analyzed carefully.
Three major considerations involved in the feasibility study are,
1. Economical Feasibility
2. Operational Feasibility
3. Technical Feasibility
2.3.1 Economical Feasibility
Economical feasibility is carried out to check to economic impact that the
system is to have on the organization. It attempts to weight both the cost of
developing as well as new implementing system, against benefits that should be
obtained by having the new system in place.
A simple economic analysis in which gives the actual comparison of costs and
benefit are much more meaning in this case. In addition, this proves to be useful
project progresses’ study was conducted to find out the economical feasibility, it was
proved that the developed system is within the budget. Some of the benefits include
the absence of database and independence of restricted background which makes the
system ease to work it.
2.3.2 Operational Feasibility
The operational feasibility is carried out to check whether the system provides
product apt to the requirement. It has provided the proposed system to be beneficial
only , it could be turned into information systems that will meet the organizational
operating requirements. This test of feasibility asks if the system will work when it
is developed and installed. Few data will the operational feasibility of the project
include: Sufficient support for the project from the users, his current system which
was developed is well accepted to the user been involved in the planning and
developed of the project.
10
2.3.3 Technical Feasibility
Determining the technical feasibility is the trickiest part of the feasibility
study. This is because, there is no too many detailed study on the design of the
system but on its implementation. Its focus is mainly on the technical requirements
of the system, hence it should be noted that there is no high demands on the available
technical resources.
Different technologies involved should be analyzed properly before the
commencement of a project. Once has to be very clear about the technologies
required for the developed of the new system. The developed system must have a
modest requirement such that only minimal or null changes are required for
implementing the system.

CHAPTER 3

11
3. SYSTEM SPECIFICATION
3.1 HARDWARE SPECIFICATION
System Pentium IV 2.4 GHz
Hard Disk 40 GB
Monitor 15 VGA Color
Ram 4 GB

3.2 SOFTWARE SPECIFICATION


Operating System Windows 8.1
Coding Language R Programming
Database Dataset of a bank
Visualization tool Tableau
Programming language R programming

CHAPTER 4

12
4. SOFTWARE DESCRIPTION
4.1 FRONT END
4.1.1 Introduction to R Studio
RStudio is a free and open-source integrated development environment
(IDE) for R, a programming language for statistical computing and graphics.
RStudio was founded by JJ Allaire, creator of the programming language
ColdFusion. Hadley Wickham is the Chief Scientist at RStudio
RStudio is available in two editions: RStudio Desktop, where the program is
run locally as a regular desktop application; and RStudio Server, which allows
accessing RStudio using a web browser while it is running on a remote Linux server.
Prepackaged distributions of RStudio Desktop are available for Windows, macOS,
and Linux.
RStudio is available in open source and commercial editions and runs on the
desktop (Windows, macOS, and Linux) or in a browser connected to RStudio Server
or RStudio Server Pro (Debian, Ubuntu, Red Hat Linux, CentOS, openSUSE and
SLES).
RStudio is written in the C++ programming language and uses the Qt
framework for its graphical user interface. Work on RStudio started at around
December 2010, and the first public beta version (v0.92) was officially announced
in February 2011. Version 1.0 was released on 1 November 2016. Version 1.1 was
released on 9 October 2017.

13
Figure 4.1 R Studio
4.1.2 Introduction to Tableau
Tableau is a Business Intelligence tool for visually analyzing the data. Users
can create and distribute an interactive and shareable dashboard, which depict the
trends, variations, and density of the data in the form of graphs and charts. Tableau
can connect to files, relational and Big Data sources to acquire and process data. The
software allows data blending and real-time collaboration, which makes it very
unique. It is used by businesses, academic researchers, and many government
organizations for visual data analysis. It is also positioned as a leader Business
Intelligence and Analytics Platform in Gartner Magic Quadrant. Using R serve we
can run R scripts in tableau.

Figure 4.2 Tableau


14
CHAPTER 5

5. PROJECT DESCRIPTION
5.1 PROBLEM DEFINITION
Financial institutions handle millions of transactions from clients per
year. Although the majority part of these transactions being legitimate,
a small number of them are criminal attempts, which may cause serious
harm to customers or to the financial institutions themselves. Thus,
the trustability of each transaction has to be assessed by the institution. However,
due to the complex and multidimensional data at hand,
financial fraud detection (FFD) is a difficult task.
5.2 OVERVIEW OF THE PROJECT
We developed a prototype using the approach of visual analytics in order to enhance
the FFD techniques. We used R for automated analysis and Tableau for data
visualization techniques.
5.3 MODULES
• Data collection and pre-processing
• Data transformation
• Predictive analysis
• Evaluation and visualization
5.4 MODULE DESCRIPTION
5.4.1 Data Collection and Pre-Processing
The data of private bank contains 42 variables including key variables:
“Deposit ”, “withdrawal”, “Profession” and “gender” which are related to the first
three main information:
• Balance
• Withdrawal

15
• Customer information
The data are collected and pre-processed (removing unwanted attributes and
missing values are filled). The source dataset contains header and double quote, the
first task is to remove header and double quote.
We spitted data into month wise transaction with different parameters like
withdrawal, balance, deposits, fund transferred(transactions)
5.4.2 Data Transformation
Data transformation is the process of transforming the data into appropriate
forms for mining process. In this process the data is consolidated so that the resulting
mining process may be more efficient and the patterns found may be easier to
understand. It includes the tasks like Normalization, Attribute Construction,
Aggregation, Attribute subset selection, Discretization, Generalization.
Normalization- the attribute data is scaled so as to fall within a small specified
range, such as – 2.0 to 2.0, 0.0 to 2.0.
If a database design is not perfect, it may contain anomalies, which are like a
bad dream for any database administrator. Managing a database with anomalies is
next to impossible.
Update anomalies:
If data items are scattered and are not linked to each other properly, then it
could lead to strange situations. For example, when we try to update one data item
having its copies scattered over several places, a few instances get updated properly
while a few others are left with old values. Such instances leave the database in an
inconsistent state.
Deletion anomalies:
We tried to delete a record, but parts of it was left undeleted because of
unawareness, the data is also saved somewhere else.
Insert anomalies:
16
We tried to insert data in a record that does not exist at all. Normalization is a
method to remove all these anomalies and bring the database to a consistent state.
Attribute construction- It is a new feature. In attribute construction, the new
attributes are constructed and .added from the given set of attributes to help the
mining.
Aggregation- Aggregation is a operation that is applied to the data. It helps
to compute or combine the previous data.
Discretization- Discretization helps for dividing the range of continuous
attribute into many intervals.
Generalization- Generalization is the process of replacing the low level or
primitive data with high level concepts through the use of concept hierarchies.
5.4.3 Predictive Model Building

Decision tree are powerful algorithm for both prediction and classification.
Decision tree represents some set of rules which will be able to understand by
humans. Decision tree is a classifier in the form of tree structure. Namely
• Decision node- specifies a test on single attribute.
• Leaf node- indicates the target attribute value.
• Edge- split of one attribute.
• Path- a disjunction of test to make the final decision.
5.4.4 Evaluation and Visualization
After analyzing of the transactions of customer , for past 1 year. visual
representation is made which are the fraudulent transactions and types of frauds. and
which customer is fraud. It is done by using R-programming.

17
Figure 5.1 Visualization of data

18
5.5 DATAFLOW DIAGRAM

Figure 5.2 Dataflow Diagram

19
5.6 SYSTEM DESIGN
5.6.1 Use case Diagram

Figure 5.3 Use case Diagram

20
5.6.2 System Architecture

Figure 5.4 Architecture Diagram

21
5.7 INPUT DESIGN

Input design has a set of attributes which has 42 instances which are in CSV
files. Each CSV file has 500 records and it has for past one year. The input of the
project is the excel content in English language where complete information about
the bike data set are given. The datum are therefore transformed and only the needed
data’s are taken for evaluation.
5.8 OUTPUT DESIGN
Output generally refers to the evaluation and visualization of the fraudulent
transactions. In this project the output is the graphical representation of fraudulent
transaction data and who is doing liable transactions.

22
CHAPTER 6
6 SYSTEM TESTING
6.1 Architecture Testing

RStudio is a free and open-source integrated development environment (IDE) for R,


a programming language for statistical computing and graphics. Hence, architectural
testing is crucial to ensure success of your project. Poorly or improper designed
system may lead to performance degradation, and the system could fail to meet the
requirement. At least, Performance and Failover test services should be done in a R
environment. Performance testing includes testing of job completion time, memory
utilization, data throughput and similar system metrics. While the motive of Failover
test service is to verify that data processing occurs seamlessly in case of failure of
data nodes

6.2 Performance Testing

Performance Testing for visual analytics includes two main action

6.2.1 Data ingestion and throughout

In this stage, the tester verifies how the fast system can consume data from
various data source. Testing involves identifying different message that the queue
can process in a given time frame. It also includes how quickly data can be inserted
into underlying data store for example insertion rate into a Mongo and Cassandra
database.

6.2.2 Data Processing

It involves verifying the speed with which the queries or map reduce jobs are
executed. It also includes testing the data processing in isolation when the underlying
23
data store is populated within the data sets. For example running Map Reduce jobs
on the underlying HDFS

6.2.3 Sub-Component Performance

These systems are made up of multiple components, and it is essential to test


each of these components in isolation. For example, how quickly message is indexed
and consumed, mapreduce jobs, query performance, search, etc.

6.3 Structural Testing

Structural Testing is to understand what is missing in our test suite and to


complement functional testing. It helps to identify obvious inadequacies

6.3.1 Categories of Structural Testing

The whole structural Testing is categorized into four division

• Statement Coverage: It is the weakest form of testing, as it requires that every


statement in the code has to be executed at least once.
• Branch Coverage: In this each branch condition for the program is tested for its
true or false values.
• Path Coverage: For path coverage, the path of the program is executed at least
once, it test individual path for the program
• Condition Coverage: In this type of testing, it checks all possible combinations
of conditions. For conditional branches, we execute the TRUE branch at least
once and the FALSE branch, at least, one. Unlike branch coverage, it tests for
both conditional as well as non- conditional branches. 6.4 Parallel Testing

24
Parallel testing means testing multiple applications or subcomponents of one
application concurrently to reduce the test time. In parallel testing, tester runs two
different versions of software concurrently with same input. The aim to find out
whether the legacy system and the new system are behaving same or

differently. It ensures that the new system is capable enough to run the software
efficiently.

Parallel Testing is to make sure the new version of the application performs
correctly, to make sure the consistencies are same between new and old version, to
check if the data format between two versions has changed and to check the integrity
of the new application.

In such cases, testers need to do the parallel testing, in order to evaluate that data
migration is done successfully. Also to check whether the changes in the new version
does not affect the system function. The tester must verify that changes are executed
properly, and the user is getting the desired output as per the requirement. Parallel
Testing has two level criteria

6.4.1 Parallel test entry Criteria

Parallel test entry criteria define the tasks that must be satisfied before parallel
testing can be efficiently executed.

6.4.2 Parallel test exit Criteria

Parallel test exit criteria defines the successful conclusion of the parallel
testing stage.

25
6.5 Loop Testing

Loop testing is a White box testing. This technique is used to test loops in the
program.

6.5.1 Types of loop Tested

The types of loop tested are,

• Simple loop
• Nested loop
• Concatenated loop
• Unstructured loop

6.5.2 Loop Testing is done for the following reasons

• Testing can fix the loop repetition issues


• Loops testing can reveal performance/capacity bottlenecks
• By testing loops, the uninitialized variables in the loop can be determined
• It helps to identify loops initialization problems.

26
CHAPTER 7

7. SYSTEM IMPLEMENTATION

System implementation is the stage of the project where the theoretical design
is turned out into a working system. Thus it can be considered to be the most critical
stage in achieving a successful new system and in giving the user, confidence that
the new system will work properly and will be effective.
7.1 DATA COLLECTION AND PRE-PROCESSING
The data of private bank contains 42 variables including key variables:
“Deposit ”, “withdrawal”, “Profession” and “gender” which are related to the first
three main information:
• Balance
• Withdrawal
• Customer information
The data are collected and pre-processed (removing unwanted attributes and missing
values are filled). The source dataset contains header and double quote, the first task
is to remove header and double quote.
We spitted data into month wise transaction with different parameters like
withdrawal, balance, deposits, fund transferred(transactions)
7.2 DATA TRANSFORMATION
Data transformation is the process of transforming the data into appropriate
forms for mining process. In this process the data is consolidated so that the resulting
mining process may be more efficient and the patterns found may be easier to
understand. It includes the tasks like Normalization, Attribute Construction,
Aggregation, Attribute subset selection, Discretization, Generalization.

27
Normalization- the attribute data is scaled so as to fall within a small specified
range, such as – 2.0 to 2.0, 0.0 to 2.0.
If a database design is not perfect, it may contain anomalies, which are like a
bad dream for any database administrator. Managing a database with anomalies is
next to impossible.
Update anomalies: If data items are scattered and are not linked to each other
properly, then it could lead to strange situations. For example, when we try to update
one data item having its copies scattered over several places, a few instances get
updated properly while a few others are left with old values. Such instances leave
the database in an inconsistent state.
Deletion anomalies: We tried to delete a record, but parts of it was left undeleted
because of unawareness, the data is also saved somewhere else.
Insert anomalies: We tried to insert data in a record that does not exist at all.
Normalization is a method to remove all these anomalies and bring the database to
a consistent state.
7.3 PREDICTIVE MODEL BUILDING
Decision tree are powerful algorithm for both prediction and classification.
Decision tree represents some set of rules which will be able to understand by
humans. Decision tree is a classifier in the form of tree structure. Namely
• Decision node- specifies a test on single attribute.
• Leaf node- indicates the target attribute value.
• Edge- split of one attribute.
• Path- a disjunction of test to make the final decision.

28
Figure 7.1 Decision tree
7.4 EVALUATION AND VISUALIZATION
After the analyses of the transaction of a customer for past 1 year visual
representation is made which are the fraudulent transactions and types of frauds. and
which customer is fraud. It is done by using R-programming.

29
CHAPTER 8
8. CONCLUSION AND FUTURE ENHANCEMENTS
8.1 CONCLUSION
This concludes the contributions to the bike FFD system and to find the
drawbacks during the predictive model building and evaluation process. This project
aims to fulfill to detect fraud transaction from a bank customer transaction dataset.
There is millions of transaction occurs in a day is to find which are fraud transaction
and who is fraudster. This project follows some objective to considering the
algorithms like decision tree algorithm and random forest algorithm.

The aim of the project is to detect fraudulent transaction and visualizing it


from a user transactions for a year. There are some detection may be wrong, but it
not has 100% accuracy.
8.2 FUTURE ENHANCEMENTS
Network Analysis. Although our prototype shows promising
results for investigating long time intervals of transactions and relating
a small number of accounts, we do not support the investigation of
networks of accounts yet. An interactive network visualization would
allow investigators to better reason about suspicious money transfer
relationships and patterns within their contexts.

APPENDIX 1 SOURCE CODE

30
RANDOM FOREST :
> dataset <- read.csv("C:/Users/silamparasan/Desktop/diabetes.csv")
> View(dataset)
> set.seed(2)
> id<-sample(2,nrow(dataset),prob = c(0.7,0.3),replace = TRUE)
> dataset_train<-dataset[id==1,]
> dataset_test<-dataset[id==1,]
> install.packages("randomForest")
> library(randomForest)
> dataset$Balance.12.<-as.factor(dataset$Balance.12.)
> dataset_train$Balance.12.<-as.factor(dataset_train$Balance.12.)
> dataset_train$Balance.12.<-as.factor(dataset_train$Balance.12.)
> bestmtry<-tuneRF(dataset_train,dataset_train$Balance.12.,stepFactor = 1.2,impr
ove = 0.01,trace = T,plot = T)
mtry = 3 OOB error = 0%
Searching left ...
Searching right ...
> library(randomForest)
> data_for<-randomForest(Balance.12.~.,data=dataset_train)
> data_for
Call:
randomForest(formula = Outcome ~ ., data = diabetes_train)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 26.5%
Confusion matrix:
31
NO YES class.error
NO 294 54 0.1551724
YES 87 97 0.4728261
> importance(dia_for)
> varImpPlot(dia_for)
> pred1_dia<-predict(dia_for,newdata = dataset_test,type = "class")
> pred1_data
> library(caret)
>confusionMatrix(table(pred1_data,ddataset_test$Balance.12.))
Confusion Matrix and Statistics
pred1_data NO YES
NO 25 0
YES 0 43
Accuracy : 0.9855
95% CI : (0.9931, 1)
No Information Rate : 0.6541
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.6541
Detection Rate : 0.6541
Detection Prevalence : 0.6541
Balanced Accuracy : 1.00
32
'Positive' Class : NO

DECISION TREE:
> dataset <- read.csv("C:/Users/silamparasan/Desktop/dataset.csv")
> View(dataset)
> set.seed(3)
> id<-sample(2,nrow(dataset),prob = c(0.7,0.3),replace = TRUE)
> dataset_train<-dataset[id==1,]
> dataset_test<-dataset[id==1,]
> library(rpart)
> nrow(dataset)
[1] 100
> nrow(dataset_test)
[1] 69
> nrow(dataset_train)
[1] 69
> colnames(dataset)
[1] "Account.nmuber" "Name" "Surname" "Gender"
[5] "Job.Classification" "Opening.Balance" "X01.Jan.18" "Withdraw..1."
[9] "Balance.1." "X01.Feb.18" "Withdraw.2." "Balance.2."
[13] "X01.Mar.18" "Withdraw.3." "Balance.3." "X01.Apr.18"
[17] "Withdraw.4." "Balance.4." "X01.May.18" "Withdraw.5."
[21] "Balance.5." "X01.Jun.18" "Withdraw.6." "Balance.6."
[25] "X01.Jul.18" "Withdraw.7." "Balance.7." "X01.Aug.18"
[29] "Withdraw.8." "Balance.8." "X01.Sep.18" "Withdraw.9."
[33] "Balance.9." "X01.Oct.18" "Withdraw.10." "Balance.10."
[37] "X01.Nov.18" "Withdraw.11." "Balance.11." "X01.Dec.18"
33
[41] "Withdraw.12." "Balance.12." "loan"
> dataset_model<-rpart(loan~.,data=dataset_train)
> plot(dataset_model,margin = 0.1)
> text(dataset_model,use.n = TRUE)
Warning message:
In labels.rpart(x, minlength = minlength) :
more than 52 levels in a predicting factor, truncated for printout
> text(dataset_model,use.n = TRUE,pretty = TRUE,cex = 0.8)
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
Warning message:
package ‘ggplot2’ was built under R version 3.4.4
> library(lattice)
> pred_data<-predict(dataset_model,newdata = dataset_test,type = "class")
> pred_data
1 3 4 5 6 7 8 9 10 11 12 13 14 17 20 21 22 23 24 25 27 29 31
yes yes yes no no no yes yes yes yes yes yes yes yes no no no no no no no n
o no
32 33 34 35 36 38 39 40 41 43 44 45 46 47 48 49 51 52 57 58 59 60
62
yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes no no no no no
no yes
64 66 68 69 70 75 76 77 78 82 83 84 87 88 89 91 92 93 96 97 98 99 1
00
yes yes no yes no yes yes yes yes no no no no no no yes yes yes yes yes yes y
es yes
34
Levels: no yes
> library(caret)
> confusionMatrix(table(pred_data,dataset_test$loan))
Confusion Matrix and Statistics
pred_data no yes
no 25 1
yes 0 43
Accuracy : 0.9855
95% CI : (0.9219, 0.9996)
No Information Rate : 0.6377
P-Value [Acc > NIR] : 1.324e-12
Kappa : 0.9689
Mcnemar's Test P-Value : 1
Sensitivity : 1.0000
Specificity : 0.9773
Pos Pred Value : 0.9615
Neg Pred Value : 1.0000
Prevalence : 0.3623
Detection Rate : 0.3623
Detection Prevalence : 0.3768
Balanced Accuracy : 0.9886
'Positive' Class : no

APPENDIX 2
SCREENSHORT
35
DATASET

Figure A2.1 Dataset

R STUDIO DESKTOP

36
Figure A2.2 Rstudio desktop

IMPORT DATA

37
Figure A2.2 Rstudio desktop

EXECUTION PART

38
Figure A2.4 Execution part

GENERATE DECISION TREE

39
Figure A2.5 Generate decision tree

STATISTICAL REPORT

40
Figure A2.6 Statistical report

TABLEAU –IMPORT DATA

41
Figure A2.7 Tableau-import data

EXTRACTING DATA

42
Figure A2.8 Extracting data

ANALYTICS PART

43
Figure A2.9 Analytics part

DETECTING FRAUDULENT EVENTS

44
Figure A2.10 Detection fraudulent events

JOB CLASSIFICATION

45
Figure A2.11 Job classification

REFERENCES

46
[1] Aigner.w, Miksch, Schumann.h, and Tominski.C. Visualization of
time-oriented data. Springer Science & Business Media, 2011.

[2] Andrienko .w and Andrienko.G. Exploratory Analysis of Spatial and


Temporal Data: A Systematic Approach. Springer Berlin Heidelberg,
Berlin, Heidelberg, 2006. doi: 10.1007/3-540-31190-4 3

[3] Carminat.M,.Caron.R, Maggi.F, Epifani.E, and Zanero.S. Banksealer:


an online banking fraud analysis and decision support system. In IFIP
International Information Security Conference, pp. 380–394. Springer,
2014.

[4] Chang.R, Ghoniem.M, Kosara.R, Ribarsky.W, Yang.J, Suma.E,


Ziemkiewicz.C, Kern.D, and Sudjianto.A. Wirevis: Visualization of categorical,
time-varying data from financial transactions. In Visual Analytics
Science and Technology. VAST. IEEE Symposium on, pp. 155–162. IEEE,
2007.

[5] Cleveland.W.S. Graphical methods for data presentation: Full scale


breaks, dot charts, and multibased logging. The American Statistician,
38(4):270–280, 1984.

[6] Dilla.W.N and RaschkeR.L. Data visualization for fraud detection:


Practice implications and a call for future research. International Journal
of Accounting Information Systems, 16:1–22, 2015.
47
[7] Dumas.M, McGuffinM.J, and LemieuxV.L. Financevis.net - a visual
survey of financial data visualizations. In Poster Abstracts of IEEE VIS
2014, November 2014. Poster and Extended Abstract.

[8] Erste Bank, Austria. Erste Group IT International.


https://www.erstegroupit.com/en/home.

[9] Harrower.H and BrewerC.A. Colorbrewer. org: an online tool for


selecting colour schemes for maps. The Cartographic Journal, 40(1):27–
37, 2003.

[10] Huang.M.L, Liang.J, and Nguyen.Q.V. visualization approach for


frauds detection in financial market. In Information Visualisation, 13th
International Conference, pp. 197–202. IEEE, 2009.

[11] Isenberg.T, Isenberg.P, Chen.J, Sedlmair.M, and Moller.T. A


systematic¨review on the practice of evaluating visualization. IEEE Transactions on
Visualization and Computer Graphics, 19(12):2818–2827, 2013.

[12] Jaishankar.K. Cyber criminology: exploring internet crimes and criminal


behavior. CRC Press, 2011.
[13] Keim.D.K, Mansmann.F, Schneidewind.J, Thomas.J, and Ziegler.H.
Visual analytics: Scope and challenges. Springer, 2008.

[14] Kielman.J, Thomas.J, and May.R. Foundations and frontiers in visual


analytics. Information Visualization, 8(4):239, 2009.
48
[15] Kirkland.J.D, Senator.T.E, . Hayden.J.J, Dybala.J, Goldberg.H.G,
and Shyr.P. The nasd regulation advanced-detection system (ads). AI
Magazine, 20(1):55, 1999
.
[16] Klein.G. Seeing what others don’t: The remarkable ways we gain insights.
PublicAffairs, 2013.

[17] Klein.G, Moon.B, and Hoffman.R.R. Making sense of sensemaking


1: Alternative perspectives. IEEE Intelligent Systems, 21(4):70–73, July
2006.

[18] Klein.G, Moon.B, and Hoffman.R.R. Making sense of sensemaking


2: A macrocognitive model. IEEE Intelligent Systems, 21(5):88–92, Sept.
2006.

[19] Ko.S,. Cho.I, Afzal.S, Yau.C, Chae.J, Malik.A, Beck.K,. Jang.Y,


. A survey on visual analysis approaches forfinancial data. In Computer Graphics
Forum, vol. 35, pp. 599–617. WileyOnline Library, 2016.

[20] Kou.Y,, Sirwongwattana.S, and -P. Huang.Y. Survey of fraud


detection techniques. In Networking, sensing and control, 2004 IEEE
international conference on, vol. 2, pp. 749–754, 2004.
[21] Kriglstein.S and Pohl.M. Choosing the Right Sample? Experiences of
Selecting Participants for Visualization Evaluation. In Aigner.W, P., eds., EuroVis
Workshop on Reproducibility, Verification, and Validation in Visualization

49
(EuroRV3). The EurographicsAssociation, 2015. doi: 10.2312/eurorv3.20151146

[22] Leite.R.A, Gschwandtner.T, Miksch.S, Gstrein.E, and Kuntner.J.


Visual analytics for fraud detection and monitoring. In Visual Analytics
Science and Technology (VAST), 2015 IEEE Conference on, pp. 201–202.
IEEE, 2015.[23] Luell.J. Employee fraud detection under real world conditions.
PhDthesis, UNIVERSITY OF ZURICH, 2010
.
[24] Mackinlay.I. Automating the design of graphical presentations of relational
information. ACM Transactions On Graphics (Tog), 5(2):110–141,1986
.
[25] Microsoft. Excel. office.microsoft.com/en-us/excel/ (accessed: 2016-12-
09).

[26] Microsoft. Powerpoint. office.microsoft.com/en-us/powerpoint/ (accessed:


2016-12-09).

[27] Miksch.S and Aigner. A matter of time: Applying a data–users–tasks


design triangle to visual analytics of time-oriented data. Computers &
Graphics, 38:286–290, 2014.

[28] Monroe.M, Lan.R, Lee.H, Plaisant.C, and Shneiderman. Temporal.B


event sequence simplification. IEEE transactions on visualization and
computer graphics, 19(12):2227–2236, 201

50

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy