Credit Card Project-2
Credit Card Project-2
Chapter 1
INTRODUCTION
1.1 Overview
In 2015, credit, debit and prepaid cards generated over $31 trillion in total volume worldwide with
fraud losses reaching over $21 billion [1]. In that same year there were over 225 billion purchase
transactions, a figure that is projected to surpass 600 billion by 2025 [2]. Fraud associated with credit,
debit, and prepaid cards is a significant and growing issue for consumers, businesses, and the
financial industry. Historically, software solutions used to combat credit card fraud by issuers closely
followed progress in classification, clustering and pattern recognition [3, 4]. Today, most Fraud
Detection Systems (FDS) continue to use increasingly sophisticated machine learning algorithms to
learn and detect fraudulent patterns in real-time, as well as offline, with minimal disturbance of
genuine transactions [5].Generally, FDS need to address several inherent challenges related to the
task: extreme unbalanced of the dataset as frauds represent only a fraction of total
transactions,distributions that evolve due to changing consumer behaviour,and assessment challenges
that come with real time data processing [6]. For example difficulties arise when learning from an
unbalanced datasets as many machine intelligence methods are not designed to handle extremely
large differences between class sizes [5]. Also dynamic trends within the data require robust
algorithms with high tolerance for concept drift in legitimate consumer behaviours [7].
Although specialized techniques exist that may handle large class imbalance such as outlier
detection, fuzzy inference systems and knowledge based systems [8, 9], current state-of the-art
research suggests that conventional algorithms in fact may be used with success if the data is
sampled to produce equivalent class sizes [10, 11, 12]. The newest sampling techniques involve
creating synthetic samples using a clustering algorithm such as k-Nearest-Neighbour (kNN). This is
valuable as it means a wider range of, in some cases off-the shelf, typical classification algorithms
may be used which mitigates existing algorithmic limitations due to extreme class imbalance. Not
only can fraud detection capabilities potentially increase due to a larger scope of potential methods,
the cost of development can be decreased due to reduced reliance on highly specialized niche
methods, expert systems, and continued research into algorithmic methods which handle class
imbalance directly.
In the existing credit card fraud detection business processing system, fraudulent transaction will be
detected after transaction is done. It is difficult to find out fraudulent and regarding loses will be
barred by issuing authorities. Hidden Markov Model is the statistical tools for engineer and scientists
to solve various problems. In this paper, it is shown that creditcard fraud can be detected using
Hidden Markov Model during transactions. Hidden Markov Model helps to obtain a high fraud
coverage combined with a low false alarm rate.
In the existing system web services based fraud detection needs labeled data for both genuine and
fraudulent transactions. New frauds cannot be found in these existing techniques.
The credit card fraud dataset used in this paper is obtainable from Kaggle.com [13] and contains a
subset of online European credit card transactions made in September 2013 over a period of two days,
consisting of a highly imbalanced 492 frauds out of 284 807 total transactions [5, 6, 7, 14, 15]. For
confidentiality the dataset is simply provided as 28 unlabeled columns resulting from a PCA
transformation. Additionally, there are three labeled columns: Class, Time, and Amount.
We will be implementing two major algorithms namely:
All these technologies have their pros and cons as well. Association rule is a simple method that
initially need large data set in which it can find frequent item set. Neural Network can be applied in
Supervised as well as Unsupervised Approach. As Unsupervised approach is little bit more complex
but give more optimized results. So, the main motive of our project is to represent all important
technologies that can detect the fraud as early as possible and to avoid the loss as much as possible.
There are many emerging technologies that are able to detect credit card fraud detection. Some of
condign technologies that will work on some parameters and able to detect fraud earlier as well are
listed below:
Kenneth Aguilar & Cesar Ponce et al. defines that all human have particular characteristics in their
behavior as well as in their physiological characteristics as depicted in Figure 2. Here behavioral
characteristics mean any human’s voice, signature, keystroke etc. And physiological characteristics
means fingerprints, face image or hand geometry. Biometric Data mining is an application of
knowledge discovery techniques in which we provide biometric information with the motive to
identify patterns.
Behavioral Characteristics:
According to Revett Henrique Santos,we can have following characteristics that are able to
identify patterns. Figure 2.1 shows the types of Behavioral Characteristics.
Figure 2.1 Types of Characteristics
a) Key Stroke Patterns: Revett & Henrique Santos defined keystroke pattern in the term of
keyboard duration or latency. According to them, pattern of Striking of keys of every person is
unique. So it will help to identify the legitimate or fraudulent persons.
b) Mouse Movement: User identification during mouse movement is done by measuring the
temperature and humidity of user’s palm and his/her intensity of pressing. These parameters can
be helpful to recognize suspicious behavior.
2.1.2 Learning
Learning is generally done with or without the help of teacher. Generally division of learning take
place as shown in Figure 2.2
The learning that take place under the supervision of teacher is termed as supervised learning.But In
which there is no guidance of teacher is termed as Unsupervised learning.These Learning are
explored as:
1) Supervised Learning
According to Patdar and Lokesh Sharma following technologies are based on Supervised Learning
approach in which we have an external teacher to check our output.
Support Vector Machine (SVM)
In supervised learning Vapnik come with an idea of support Vector Machine.Joseph King-Fun
Pun approached thati in this classification algorithm we can construct a hyper plane as a
decision plane which can make distinction between fraudulent and legitimate transaction. This
Hyper plane Separate the different class of data. Support Vector Machine can maximize the
geometric margin and simultaneously it can minimize the empirical classification. So, it is
also called Maximum Margin classifier..The separating Hyperplane is a plane that exploit the
distance between the two equivalent hyper plane.
2) Unsupervised Learning
In contrast to supervised Learning, unsupervised or self organized learning does not require an
external teacher. Quah & Sriganesh defined that in this during training session the neural network
receive a number of different input patterns, discovered significant features in these pattern and learn
how to classify input pattern into different categories.
Chapter 3
SYSTEM REQUIREMENT SPECIFICATION
A system requirements specification (SRS) is a specification for a software system which gives the
complete behavioral description of the system to be developed .It also contains functional and non-
functional requirements.
3.1 Objective
The system requirement specification document details out all necessary requirement for the
development of the project. Clear and thorough understanding of the products to be developed is
important to derive requirement. This can be prepared after detailed communications with project
function is given by set of inputs, behaviors and outputs. The functional requirement may comprise of
calculations, technical description, data modifications, processing and other specific functionality
defining what a system should accomplish .Figure 3.1 Shows the Use Case Diagram of Credit Card
Validation System.
In Figure 3.2 data flow diagram of our credit card detection system is depicted. The system fetches
transaction data , then fraction of collected data is selected. Using Isolation Forest Algorithm and
Local Outlier Factor Algorithm we can determine whether Transaction is fraud or not.
3.3 Non-Functional Requirements
A non-functional requirement specifies the criteria that may be used to judge the operation of asystem
rather than the specific behavior. If differs from the functional requirement where it specifies the
specific behavior or functions. On the other hand, non-functional requirement defines how the system
is supposed to be. They are expressed in the form “system shall be requirement”.
The main idea, which is different from other popular outlier detection methods, is that Isolation
Forest explicitly identifies anomalies instead of profiling normal data points. Isolation Forest, like any
tree ensemble method, is built on the basis of decision trees. In these trees, partitions are created by
first randomly selecting a feature and then selecting a random split value between the minimum and
maximum value of the selected feature.
In principle, outliers are less frequent than regular observations and are different from them in terms
of values (they lie further away from the regular observations in the feature space). That is why by
using such random partitioning they should be identified closer to the root of the tree (shorter average
path length, i.e., the number of edges an observation must pass in the tree going from the root to the
terminal node), with fewer splits necessary.
The idea of identifying a normal vs. abnormal observation can be observed in Figure 1 from [1]. A
normal point (on the left) requires more partitions to be identified than an abnormal point (right)
where h(x) is the path length of observation x, c(n) is the average path length of unsuccessful search
in a Binary Search Tree and n is the number of external nodes. More on the anomaly score and its
components can be read in [1].
Each observation is given an anomaly score and the following decision can be made on its basis:
If all scores are close to 0.5 than the entire sample does not seem to have clearly distinct
anomalies.
The local outlier factor (LOF) method scores points in a multivariate dataset whose rows are assumed
to be generated independently from the same probability distribution.
4.2.1 Background
Local outlier factor is a density-based method that relies on nearest neighbors search. The LOF
method scores each data point by computing the ratio of the average densities of the point's neighbors
to the density of the point itself. According to the method, the estimated density of a point p is the
number of p's neighbors divided by the sum of distances to the point's neighbors.
In these proposed system we analyzed and detect the fraud in online credit-card transactions in real
time. Also the algorithm implements a multi-layered approach for security based on the amount of the
transaction. It classifies the transactions according to the spending habits of the customer and
calculates a threshold value which helps in detecting whether the current transaction is genuine or not.
The evaluations conducted using the datasets
APPENDIX
PYTHON
matplotlib.pyplot
matplotlib.pyplot is a collection of command style functions that make matplotlib work like
MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a
plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.
In matplotlib.pyplot various states are preserved across function calls, so that it keeps track of things
like the current figure and plotting area, and the plotting functions are directed to the current axes
(please note that "axes" here and in most places in the documentation refers to the axes part offigure
and not the strict mathematical term for more than one axis).
pandas
Pandas is a popular Python package for data science, and with good reason: it offers powerful,
expressive and flexible data structures that make data manipulation and analysis easy, among many
other things. The DataFrame is one of these structures.
This tutorial covers Pandas DataFrames, from basic manipulations to advanced operations, by
tackling 11 of the most popular questions so that you understand -and avoid- the doubts of the
Pythonistas who have gone before you.
seaborn
The main idea of Seaborn is that it provides high-level commands to create a variety of plot types
useful for statistical data exploration, and even some statistical model fitting.
Let's take a look at a few of the datasets and plot types available in Seaborn. Note that all of the
following could be done using raw Matplotlib commands (this is, in fact, what Seaborn does under
the hood) but the Seaborn API is much more convenient.
REFERENCES
M. Abadi, et. al., “TensorFlow: Large-scale machine learning on heterogeneous systems”, 2015.
[Online] Available: http://download.tensorflow.org/paper/whitepaper2015.pdf
J. Davis and M. Goadrich. “The relationship between precision-recall and roc curves”. in
Proceedings of the 23rd international conference on Machine learning, pages 233–240. ACM, 2006.
Gao J, Fan W, Han J, Yu P.S,(2007) A General Framework for Mining Concept-Drifting Data
Streams with Skewed Distributions. International Conference on Data Mining SIAM 2017.
Kaggle. (2017, Jan. 12). Credit Card Fraud Detection [Online].
Available: https://www.kaggle.com/dalpozz/creditcardfraud
[14] https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.html.