0% found this document useful (0 votes)
758 views17 pages

Credit Card Project-2

This document discusses using machine learning techniques for credit card fraud detection. It begins with an overview of the large scale of credit card fraud globally and how machine learning algorithms are increasingly being used by financial institutions to detect fraudulent transactions in real-time. It then discusses some of the key challenges in credit card fraud detection like class imbalance and concept drift. The rest of the document outlines the proposed system for this project, which will use isolation forest and local outlier factor algorithms on a credit card transactions dataset to detect anomalies and potential fraudulent transactions.

Uploaded by

Jeevan J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
758 views17 pages

Credit Card Project-2

This document discusses using machine learning techniques for credit card fraud detection. It begins with an overview of the large scale of credit card fraud globally and how machine learning algorithms are increasingly being used by financial institutions to detect fraudulent transactions in real-time. It then discusses some of the key challenges in credit card fraud detection like class imbalance and concept drift. The rest of the document outlines the proposed system for this project, which will use isolation forest and local outlier factor algorithms on a credit card transactions dataset to detect anomalies and potential fraudulent transactions.

Uploaded by

Jeevan J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Credit card fraud detection using ML

Chapter 1
INTRODUCTION

1.1 Overview
In 2015, credit, debit and prepaid cards generated over $31 trillion in total volume worldwide with
fraud losses reaching over $21 billion [1]. In that same year there were over 225 billion purchase
transactions, a figure that is projected to surpass 600 billion by 2025 [2]. Fraud associated with credit,
debit, and prepaid cards is a significant and growing issue for consumers, businesses, and the
financial industry. Historically, software solutions used to combat credit card fraud by issuers closely
followed progress in classification, clustering and pattern recognition [3, 4]. Today, most Fraud
Detection Systems (FDS) continue to use increasingly sophisticated machine learning algorithms to
learn and detect fraudulent patterns in real-time, as well as offline, with minimal disturbance of
genuine transactions [5].Generally, FDS need to address several inherent challenges related to the
task: extreme unbalanced of the dataset as frauds represent only a fraction of total
transactions,distributions that evolve due to changing consumer behaviour,and assessment challenges
that come with real time data processing [6]. For example difficulties arise when learning from an
unbalanced datasets as many machine intelligence methods are not designed to handle extremely
large differences between class sizes [5]. Also dynamic trends within the data require robust
algorithms with high tolerance for concept drift in legitimate consumer behaviours [7].

Although specialized techniques exist that may handle large class imbalance such as outlier
detection, fuzzy inference systems and knowledge based systems [8, 9], current state-of the-art
research suggests that conventional algorithms in fact may be used with success if the data is
sampled to produce equivalent class sizes [10, 11, 12]. The newest sampling techniques involve
creating synthetic samples using a clustering algorithm such as k-Nearest-Neighbour (kNN). This is
valuable as it means a wider range of, in some cases off-the shelf, typical classification algorithms
may be used which mitigates existing algorithmic limitations due to extreme class imbalance. Not
only can fraud detection capabilities potentially increase due to a larger scope of potential methods,
the cost of development can be decreased due to reduced reliance on highly specialized niche
methods, expert systems, and continued research into algorithmic methods which handle class
imbalance directly.

Dept of CSE Page 1 NIE,MYSURU


1.1 Existing System

In the existing credit card fraud detection business processing system, fraudulent transaction will be
detected after transaction is done. It is difficult to find out fraudulent and regarding loses will be
barred by issuing authorities. Hidden Markov Model is the statistical tools for engineer and scientists
to solve various problems. In this paper, it is shown that creditcard fraud can be detected using
Hidden Markov Model during transactions. Hidden Markov Model helps to obtain a high fraud
coverage combined with a low false alarm rate.

1.2 Disadvantage Of Existing System

In the existing system web services based fraud detection needs labeled data for both genuine and
fraudulent transactions. New frauds cannot be found in these existing techniques.

1.3 Proposed System

The credit card fraud dataset used in this paper is obtainable from Kaggle.com [13] and contains a
subset of online European credit card transactions made in September 2013 over a period of two days,
consisting of a highly imbalanced 492 frauds out of 284 807 total transactions [5, 6, 7, 14, 15]. For
confidentiality the dataset is simply provided as 28 unlabeled columns resulting from a PCA
transformation. Additionally, there are three labeled columns: Class, Time, and Amount.
We will be implementing two major algorithms namely:

1.3.1 A local outlier factor to calculate anomaly scores.


1.3.2 Isolation forest algorithm.
Chapter 2
LITERATURE SURVEY
Fraud act as the wrongful or criminal deception intended to result in financial or personal benefit. In
the area of fraud detection, neural network like feed forward neural network with back propagation
have found immense application. Usually such applications need to know previous data and on the
behalf of this previous data they detect the fraud. Another statistically approach is feed forward
network in which there is certain kind of relationship is found between user data and other parameters
to get the result.

All these technologies have their pros and cons as well. Association rule is a simple method that
initially need large data set in which it can find frequent item set. Neural Network can be applied in
Supervised as well as Unsupervised Approach. As Unsupervised approach is little bit more complex
but give more optimized results. So, the main motive of our project is to represent all important
technologies that can detect the fraud as early as possible and to avoid the loss as much as possible.

2.1 Techniques to Detect Credit Card Fraud

There are many emerging technologies that are able to detect credit card fraud detection. Some of
condign technologies that will work on some parameters and able to detect fraud earlier as well are
listed below:

2.1.1 Biometric Approach

Kenneth Aguilar & Cesar Ponce et al. defines that all human have particular characteristics in their
behavior as well as in their physiological characteristics as depicted in Figure 2. Here behavioral
characteristics mean any human’s voice, signature, keystroke etc. And physiological characteristics
means fingerprints, face image or hand geometry. Biometric Data mining is an application of
knowledge discovery techniques in which we provide biometric information with the motive to
identify patterns.
 Behavioral Characteristics:

According to Revett Henrique Santos,we can have following characteristics that are able to
identify patterns. Figure 2.1 shows the types of Behavioral Characteristics.
Figure 2.1 Types of Characteristics

a) Key Stroke Patterns: Revett & Henrique Santos defined keystroke pattern in the term of
keyboard duration or latency. According to them, pattern of Striking of keys of every person is
unique. So it will help to identify the legitimate or fraudulent persons.

b) Mouse Movement: User identification during mouse movement is done by measuring the
temperature and humidity of user’s palm and his/her intensity of pressing. These parameters can
be helpful to recognize suspicious behavior.

2.1.2 Learning

Learning is generally done with or without the help of teacher. Generally division of learning take
place as shown in Figure 2.2

Figure 2.2 Types of Learning

The learning that take place under the supervision of teacher is termed as supervised learning.But In
which there is no guidance of teacher is termed as Unsupervised learning.These Learning are
explored as:

1) Supervised Learning

According to Patdar and Lokesh Sharma following technologies are based on Supervised Learning
approach in which we have an external teacher to check our output.
 Support Vector Machine (SVM)
In supervised learning Vapnik come with an idea of support Vector Machine.Joseph King-Fun
Pun approached thati in this classification algorithm we can construct a hyper plane as a
decision plane which can make distinction between fraudulent and legitimate transaction. This
Hyper plane Separate the different class of data. Support Vector Machine can maximize the
geometric margin and simultaneously it can minimize the empirical classification. So, it is
also called Maximum Margin classifier..The separating Hyperplane is a plane that exploit the
distance between the two equivalent hyper plane.

2) Unsupervised Learning

In contrast to supervised Learning, unsupervised or self organized learning does not require an
external teacher. Quah & Sriganesh defined that in this during training session the neural network
receive a number of different input patterns, discovered significant features in these pattern and learn
how to classify input pattern into different categories.
Chapter 3
SYSTEM REQUIREMENT SPECIFICATION
A system requirements specification (SRS) is a specification for a software system which gives the
complete behavioral description of the system to be developed .It also contains functional and non-
functional requirements.

3.1 Objective
The system requirement specification document details out all necessary requirement for the

development of the project. Clear and thorough understanding of the products to be developed is

important to derive requirement. This can be prepared after detailed communications with project

team and the customer.

3.2 Functional Requirements


The function of a software system or its component is defined by the functional requirement. A

function is given by set of inputs, behaviors and outputs. The functional requirement may comprise of

calculations, technical description, data modifications, processing and other specific functionality

defining what a system should accomplish .Figure 3.1 Shows the Use Case Diagram of Credit Card

Validation System.

Figure 3.1 Use case Diagram of Credit Card Validation System


Data Flow Diagram

Figure 3.2 Data Flow Diagram of Credit Card Fraud Detection

In Figure 3.2 data flow diagram of our credit card detection system is depicted. The system fetches
transaction data , then fraction of collected data is selected. Using Isolation Forest Algorithm and
Local Outlier Factor Algorithm we can determine whether Transaction is fraud or not.
3.3 Non-Functional Requirements

A non-functional requirement specifies the criteria that may be used to judge the operation of asystem
rather than the specific behavior. If differs from the functional requirement where it specifies the
specific behavior or functions. On the other hand, non-functional requirement defines how the system
is supposed to be. They are expressed in the form “system shall be requirement”.

3.4 System Requirements


3.4.1 Hardware Requirements

Processor : Pentium i3 or higher version Hard Disk


Hard disk : 500GB for deployment RAM

RAM : 4GB or higher

3.4.2 Software Requirements

Operating System : Windows 10/Linux/Fedora Language


Language : Python
IDE : Anaconda_navigator (Jupyter Notebook), Spyder
Chapter 4
SYSTEM DESIGN AND ANALYSIS
4.1 Isolation Forest

The main idea, which is different from other popular outlier detection methods, is that Isolation
Forest explicitly identifies anomalies instead of profiling normal data points. Isolation Forest, like any
tree ensemble method, is built on the basis of decision trees. In these trees, partitions are created by
first randomly selecting a feature and then selecting a random split value between the minimum and
maximum value of the selected feature.
In principle, outliers are less frequent than regular observations and are different from them in terms
of values (they lie further away from the regular observations in the feature space). That is why by
using such random partitioning they should be identified closer to the root of the tree (shorter average
path length, i.e., the number of edges an observation must pass in the tree going from the root to the
terminal node), with fewer splits necessary.
The idea of identifying a normal vs. abnormal observation can be observed in Figure 1 from [1]. A
normal point (on the left) requires more partitions to be identified than an abnormal point (right)

Figure 4.1 Identifying normal vs. abnormal observations


As with other outlier detection methods, an anomaly score is required for decision making. In case of Isolation
Forest it is defined as:

where h(x) is the path length of observation x, c(n) is the average path length of unsuccessful search
in a Binary Search Tree and n is the number of external nodes. More on the anomaly score and its
components can be read in [1].
Each observation is given an anomaly score and the following decision can be made on its basis:

 Score close to 1 indicates anomalies

 Score much smaller than 0.5 indicates normal observations

 If all scores are close to 0.5 than the entire sample does not seem to have clearly distinct
anomalies.

4.2 Local Outlier Factor

The local outlier factor (LOF) method scores points in a multivariate dataset whose rows are assumed
to be generated independently from the same probability distribution.

4.2.1 Background

Local outlier factor is a density-based method that relies on nearest neighbors search. The LOF
method scores each data point by computing the ratio of the average densities of the point's neighbors
to the density of the point itself. According to the method, the estimated density of a point p is the
number of p's neighbors divided by the sum of distances to the point's neighbors.

Figure 4.2 Local outlier Factor Score


Figure 4.2 shows comparing the local density of a point with the densities of its neighbors. A has a
much lower density than its neighbors.
Suppose N(p) is the set of neighbors of point p, k is the number of points in this set, and d(p,x) is the
distance between points p and x.
The estimated density is:
f^(p)=k∑x N(p)d(p,x)
and the local outlier factor score is:
LOF(p)=1k∑x N(p)f^(x)f^(p).
SCOPE OF FUTURE WORK AND CONCLUSION
The detection of credit card fraud using data mining and Machine Learning techniques have become
one of the reliable approaches to counter this illegal activity. However, the process to gather real time
credit card fraud data is very hard. Therefore, to mimic the real data, the development of dummy data
may assist the detection process. However, the creation and credibility of dummy data must be
ascertained prior to conducting the classification processes. Speed of the software can be enhanced by
implementation of algorithms of less complexity.

In these proposed system we analyzed and detect the fraud in online credit-card transactions in real
time. Also the algorithm implements a multi-layered approach for security based on the amount of the
transaction. It classifies the transactions according to the spending habits of the customer and
calculates a threshold value which helps in detecting whether the current transaction is genuine or not.
The evaluations conducted using the datasets
APPENDIX
PYTHON

Python is an interpreted, object-oriented, high-level programming language with dynamic semantics.


Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it
very attractive for Rapid Application Development, as well as for use as a scripting or glue language
to connect existing components together. Python's simple, easy to learn syntax emphasizes readability
and therefore reduces the cost of program maintenance. Python supports modules and packages,
which encourages program modularity and code reuse. The Python interpreter and th
hg extensive standard library are available in source or binary form without charge for all major
platforms, and can be freely distributed.
Often, programmers fall in love with Python because of the increased productivity it provides. Since
there is no compilation step, the edit-test-debug cycle is incredibly fast. Debugging Python programs
is easy: a bug or bad input will never cause a segmentation fault. Instead, when the interpreter
discovers an error, it raises an exception. When the program doesn't catch the exception, the
interpreter prints a stack trace. A source level debugger allows inspection of local and global
variables, evaluation of arbitrary expressions, setting breakpoints, stepping through the code a line at
a time, and so on. The debugger is written in Python itself, testifying to Python's introspective power.
On the other hand, often the quickest way to debug a program is to add a few print statements to the
source: the fast edit-test-debug cycle makes this simple approach very effective.

matplotlib.pyplot

matplotlib.pyplot is a collection of command style functions that make matplotlib work like
MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a
plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.

In matplotlib.pyplot various states are preserved across function calls, so that it keeps track of things
like the current figure and plotting area, and the plotting functions are directed to the current axes
(please note that "axes" here and in most places in the documentation refers to the axes part offigure
and not the strict mathematical term for more than one axis).
pandas

Pandas is a popular Python package for data science, and with good reason: it offers powerful,
expressive and flexible data structures that make data manipulation and analysis easy, among many
other things. The DataFrame is one of these structures.
This tutorial covers Pandas DataFrames, from basic manipulations to advanced operations, by
tackling 11 of the most popular questions so that you understand -and avoid- the doubts of the
Pythonistas who have gone before you.

seaborn

The main idea of Seaborn is that it provides high-level commands to create a variety of plot types
useful for statistical data exploration, and even some statistical model fitting.
Let's take a look at a few of the datasets and plot types available in Seaborn. Note that all of the
following could be done using raw Matplotlib commands (this is, in fact, what Seaborn does under
the hood) but the Seaborn API is much more convenient.
REFERENCES

M. Abadi, et. al., “TensorFlow: Large-scale machine learning on heterogeneous systems”, 2015.
[Online] Available: http://download.tensorflow.org/paper/whitepaper2015.pdf
J. Davis and M. Goadrich. “The relationship between precision-recall and roc curves”. in
Proceedings of the 23rd international conference on Machine learning, pages 233–240. ACM, 2006.
Gao J, Fan W, Han J, Yu P.S,(2007) A General Framework for Mining Concept-Drifting Data
Streams with Skewed Distributions. International Conference on Data Mining SIAM 2017.
Kaggle. (2017, Jan. 12). Credit Card Fraud Detection [Online].
Available: https://www.kaggle.com/dalpozz/creditcardfraud
[14] https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.html.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy