Credit Card Fraud Lessons
Credit Card Fraud Lessons
a r t i c l e i n f o a b s t r a c t
Keywords: Billions of dollars of loss are caused every year due to fraudulent credit card transactions. The design of
Incremental learning efficient fraud detection algorithms is key for reducing these losses, and more algorithms rely on
Unbalanced data advanced machine learning techniques to assist fraud investigators. The design of fraud detection algo-
Fraud detection rithms is however particularly challenging due to non-stationary distribution of the data, highly imbal-
anced classes distributions and continuous streams of transactions.
At the same time public data are scarcely available for confidentiality issues, leaving unanswered many
questions about which is the best strategy to deal with them.
In this paper we provide some answers from the practitioner’s perspective by focusing on three crucial
issues: unbalancedness, non-stationarity and assessment. The analysis is made possible by a real credit
card dataset provided by our industrial partner.
Ó 2014 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.eswa.2014.02.026
0957-4174/Ó 2014 Elsevier Ltd. All rights reserved.
4916 A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928
perspective. These are just some of potential questions that could 3.2. Unbalanced problem
raise during the design of a detection system. We do not claim to
be able to give a definite answer to the problem, but we hope to Learning from unbalanced datasets is a difficult task since most
that our work serves as guideline for other people in the field. learning systems are not designed to cope with a large difference
Our goal is to show what worked and what did not in a real case between the number of cases belonging to each class (Batista,
study. In this paper we give a formalisation of the learning problem Carvalho, & Monard, 2000). In the literature, traditional methods
in the context of credit card fraud detection. We present a way to for classification with unbalanced datasets rely on sampling
create new features in the datasets that can trace the card holder techniques to balance the dataset (Japkowicz & Stephen, 2002).
spending habits. By doing this it is possible to present the transac- In particular we can distinguish between methods that operates
tions to the learning algorithm without providing the card holder at the data and algorithmic levels (Chawla, Japkowicz, & Kotcz,
identifier. We then argue that traditional classification metrics 2004). At the data level, balancing techniques are used as a
are not suited for a detection task and present existing alternative pre-processing step to rebalance the dataset or to remove the noise
measures. between the two classes, before any algorithm is applied. At the
We propose and compare three approaches for online learning algorithmic level, the classification algorithms themselves are
in order to identify what is important to retain or to forget in a adapted to deal with the minority class detection. In this article
changing and non-stationary environment. We show the impact we focus on data level techniques as they have the advantage of
of the rebalancing technique on the final performance when the leaving the algorithms unchanged.
class distribution is skewed. In doing this we merge techniques Sampling techniques do not take into consideration any specific
developed for unbalanced static datasets with online learning information in removing or adding observations from one class, yet
strategies. The resulting frameworks are able to deal with unbal- they are easy to implement and to understand. Undersampling
anced and evolving data streams. All the results are obtained by (Drummond & Holte, 2003) consists in down-sizing the majority
experimentation on a dataset of real credit card transactions pro- class by removing observations at random until the dataset is
vided by our industrial partner. balanced.
SMOTE (Chawla, Bowyer, Hall, & Kegelmeyer, 2011) over-
samples the minority class by generating synthetic minority
examples in the neighborhood of observed ones. The idea is to form
3. State of the art in credit card fraud detection new minority examples by interpolating between examples of the
same class. This has the effect of creating clusters around each
Credit card fraud detection is one of the most explored domains minority observation.
of fraud detection (Chan, Fan, Prodromidis, & Stolfo, 1999; Bolton & Ensemble methods combine balancing techniques with a
Hand, 2001; Brause, Langsdorf, & Hepp, 1999) and relies on the classifier to explore the majority and minority class distribution.
automatic analysis of recorded transactions to detect fraudulent EasyEnsemble is claimed in Liu, Wu, and Zhou (2009) to be better
behavior. Every time a credit card is used, transaction data, com- alternative to undersampling. This method learns different aspects
posed of a number of attributes (e.g. credit card identifier, transac- of the original majority class in an unsupervised manner. This is
tion date, recipient, amount of the transaction), are stored in the done by creating different balanced training sets by Undersampling,
databases of the service provider. learning a model for each dataset and then combining all
However a single transaction information is typically not suffi- predictions.
cient to detect a fraud occurrence (Bolton & Hand, 2001) and the
analysis has to consider aggregate measures like total spent per 3.3. Incremental learning
day, transaction number per week or average amount of a transac-
tion (Whitrow, Hand, Juszczak, Weston, & Adams, 2009). Static learning is the classical learning setting where the data are
processed all at once in a single learning batch. Incremental learning
instead interprets data as a continuous stream and processes each
new instance ‘‘on arrival’’ (Oza, 2005). In this context it is impor-
3.1. Supervised versus unsupervised detection tant to preserve the previously acquired knowledge as well as to
update it properly in front of new observations. In incremental
In the fraud detection literature we encounter both supervised learning data arrives in chunks where the underlying data genera-
techniques that make use of the class of the transaction (e.g. gen- tion function may change, while static learning deals with a single
uine or fraudulent) and unsupervised techniques. Supervised dataset. The problem of learning in the case of unbalanced data has
methods assume that labels of past transactions are available and been widely explored in the static learning setting (Chawla et al.,
reliable but are often limited to recognize fraud patterns that have 2011; Drummond & Holte, 2003; Japkowicz & Stephen, 2002; Liu
already occurred (Bolton & Hand, 2002). On the other hand, unsu- et al., 2009). Learning from non-stationary data stream with
pervised methods do not use the class of transactions and are capa- skewed class distribution is however a relatively recent domain.
ble of detecting new fraudulent behaviours (Bolton & Hand, 2001). In the incremental setting, when the data distribution changes,
Clustering based methods (Quah & Sriganesh, 2008; Weston, Hand, it is important to learn from new observations while retaining
Adams, Whitrow, & Juszczak, 2008) form customer profiles to existing knowledge form past observations. Concepts learnt in
identify new hidden fraud patterns. the past may re-occur in the future as new concepts may appear
The focus of this paper will be on supervised methods. In the lit- in the data stream. This is known as the stability-plasticity dilem-
erature several supervised methods have been applied to fraud ma (Grossberg, 1988). A classifier is required to be able to respond
detection such as Neural networks (Dorronsoro, Ginel, Sgnchez, & to changes in the data distribution, while ensuring that it still re-
Cruz, 1997), Rule-based methods (BAYES Clark & Niblett, 1989, tains relevant past knowledge. Many of the techniques proposed
RIPPER Cohen, 1995) and tree-based algorithms (C4.5 Quinlan, (Chen, He, Li, & Desai, 2010; Polikar, Upda, Upda, & Honavar,
1993 and CART Olshen & Stone, 1984). It is well known however 2001; Street & Kim, 2001) use ensemble classifiers in order to com-
that an open issue is how to manage unbalanced class sizes since bine what is learnt from new observations and the knowledge ac-
the legitimate transactions generally far outnumber the fraudulent quired before. As fraud evolves over time, the learning framework
ones. has to adapt to the new distribution. The classifier should be able
A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928 4917
to learn from a new fraud distributions and ‘‘forget’’ outdated on the Internet) changes the behaviour of the cardholders and con-
knowledge. It becomes critical then to set the rate of forgetting sequently the distributions PðX Þ. At the same time the evolution of
in order to match the rate of change in the distribution (Kuncheva, the types of frauds affects the class conditional probability distri-
2004). The simplest strategy uses a constant forgetting rate, which bution PðYjX Þ. As a result the joint distribution PðX ; YÞ is not
boils down to consider a fix window of recent observations to stationary: this is is known as concept-drift (Hoens, Polikar, &
retrain the model. FLORA approach (Widmer & Kubat, 1996) uses Chawla, 2012). Note that Gao et al. (2007) suggests that even when
a variable forgetting rate where the window is shrunk if a change the concept drift is not detected, there is still a benefit in updating
is detected and expanded otherwise. The evolution of a class the models.
concept is called in literature concept drift.
Gao, Fan, Han, and Philip (2007) proposes to store all previous
5. Performance measure
minority class examples into the current training data set to make
it less unbalanced and then to combine the models into an ensem-
Fraud detection must deal with the following challenges: (i)
ble of classifiers. SERA (Chen et al., 2009) and REA (Chen & He,
timeliness of decision (a card should be blocked as soon as it is
2011) selectively accumulate old minority class observations to
found victim of fraud, quick reaction to the appearance of the first
rebalance the training chunk. They propose two different methods
can prevent other frauds), (ii) unbalanced class sizes (the number
(Mahalanobis distance and K nearest neighbours) in order to select
of frauds are relatively small compare to genuine transactions)
the most relevant minority instances to include in the current
and (iii) cost structure of the problem (the cost of a fraud is not
chunk from the set of old minority instances.
easy to define). The cost of a fraud is often assumed to be equal
These methods consist in oversampling the minority class of the
to the transaction amount (Elkan, 2001). However, frauds of small
current chunk by retaining old positive observations. Accumula-
and big amounts must be treated with equal importance. A fraud-
tion of previous minority class examples is of limited volume due
ulent activity is usually tested with a small amount and then, if
to skewed class distribution, therefore oversampling does not
successful, replicated with bigger amount. The cost should also in-
increase a lot the chunk size.
clude the time taken by the detection system to react. The shorter
is the reaction time, the larger is the number of frauds that it is
4. Formalization of the learning problem possible to prevent. Depending on the fraud risk assigned by the
detection system to the transaction, the following can happens:
In this section, we formalize the credit card fraud detection task (i) transaction accepted, (ii) transaction refused (iii) card blocked.
as a statistical learning problem. Let X ij be the transaction number j Usually the card is blocked only in few cases where there is a high
risk of fraud (well known fraudulent patterns with high accuracy,
of a card number i. We assume that the transactions are ordered in
e.g. 99% correct). When a transaction is refused, the investigators
time such that if X iv occurs before X iw then v < w. For each trans-
make a phone call to the card holder to verify if it is the case of a
action some basic information is available such as amount of the
false alert or a real fraud. The cost of a false alert can then be con-
expenditure, the shop where it was performed, the currency, etc.
sidered equivalent to the cost of the phone call, which is negligible
However these variables do not provide any information about
compared to the loss that occurs in case of a fraud. However, when
the normal card usage. The normal behaviour of a card can be mea-
the number of false alerts is too big or the card is blocked by error,
sured by using a set of historical transactions from the same card.
the impossibility to make transactions can translate into big losses
For example, we can get an idea of the card holder spending habits
for the customer. For all these reasons, defining a cost measure is a
by looking at the average amount spent in different merchant cat-
challenging problem in credit card detection.
egories (e.g. restaurant, online shopping, gas station, etc.) in the
The fraud problem can be seen as a binary classification and
last 3 months preceding the transaction. Let X ik be a new transac-
detection problem.
tion and let dt ðX ik Þ be the corresponding transaction date-time in
the dataset. Let T denote the time-frame of a set of historical
transactions for the same card. X Hik is then the set of the historical 5.1. Classification
transactions occurring in the time-frame T before X ik such that
In a classification problem an algorithm is assessed on its accu-
X Hik ¼ X ij , where dt X ij – dtðX ik Þ and dtðX ik Þ > dt X ij P dt ðX ik Þ T.
racy to predict the correct classes of new unseen observations. Let
For instance, with T ¼ 90 days, X Hik
is the set of transactions for
fY 0 g be the set of genuine transactions, fY 1 g the set of fraudulent
the same card occurring in the 3 months preceding dtðX ik Þ. The b 0 g the set of transactions predicted as genuine and
transactions, f Y
card behaviour can be summarised using classical aggregation b 1 g the set of transactions predicted as fraudulent. For a binary
fY
methods (e.g. mean, max, min or count) on the set X Hik . This means classification problem it is conventional to define a confusion
that it is possible to create new aggregated variables that can be matrix (Table 1).
added to the original transaction variables to include information In an unbalanced class problem, it is well-known that quantities
of the card. In this way we have included information about the TP
like TPR (TPþFN TN
), TNR (FPþTN TPþTN
) and Accuracy (TPþFNþFPþTN ) are mislead-
user behaviour at the transaction level and we can now no longer
ing assessment measures (Provost, 2000). Balanced Error Rate
consider the card ID. Transactions from card holders with similar FP FN
(0:5 TNþFP þ 0:5 FNþTP ) may be inappropriate too because of dif-
spending habits will share similar aggregate variables. Let fX g be
ferent costs of misclassification false negatives and false positives.
the new set of transactions with aggregated variables. Each trans-
A well accepted measure for unbalanced dataset is AUC (area
action X j is assigned a binary status Y j where Y j ¼ 1 when the
under the ROC curve) (Chawla, 2005). This metric gives an measure
transaction j is fraudulent and Y j ¼ 0 otherwise. The goal of a
detection system is to learn P ðYjX Þ and predict the class of a
new transaction Y b N 2 ð0;1Þ. Note that here the focus is not on Table 1
classifying the card holder, but the transaction as fraudulent or Confusion Matrix.
of how much the ROC curve is close to the point of perfect classi- where DRðt r Þ ¼ Rðt r Þ Rðt r1 Þ and N is the total number of observa-
fication. Hand (2009) considers calculation of the area under del tion in the dataset. From the definition of Rðt r Þ we have:
ROC curve as inappropriate, since this translate into making an (
average of the misclassification cost of the two classes. An alterna- 1
hðtr Þ hðt r1 Þ p if the the rth is fraudulent
tive way of estimating AUC is based on the use of the MannWhit- DRðt r Þ ¼ ¼
p 0 if the the rth is genuine
ney statistic and consists in ranking the observations by the fraud
probability and measuring the probability that a random minority An algorithm ‘‘A’’ is superior to an algorithm ‘‘B’’ only if it de-
class example ranks higher than a random majority class example tects the frauds before algorithm ‘‘B’’. The better the rank, the
(Bamber, 1975). By using the rank-based formulation of AUC we greater the AP. The optimal algorithm that ranks all the frauds
can avoid setting different probability thresholds to generate the ahead of the legitimates has average precision of 1.
ROC curve and avoid the problem raised by Hand. Let n0 ¼ jY 0 j In detection teams like the one of our industrial partner, each
be the number of genuine transactions, and n1 ¼ jY 1 j be the num- time a fraud alert is generated by the detection system, it has to
ber of fraudulent transactions. Let g i ¼ p^0 ðxi0 Þ be the estimated be checked by investigators before proceeding with actions (e.g.
probability of belonging to the genuine class for the i transaction
th customer contact or card stop). Given the limited number of inves-
in fY 0 g, for i ¼ 1; . . . ; n0 . Define fi ¼ p ^0 ðxi1 Þ similarly for the n1 tigators it is possible to verify only a limited number of alerts.
fraudulent transactions. Then fg 1 ; . . . ; g n0 g and ff1 ; . . . ; fn1 g are sam- Therefore it is crucial to have the best ranking within the maxi-
mum number a of alerts that they can investigate. In this setting
ples from the g and f distributions. Rank the combined set of values
it is important to have the highest Precision within the first a
g 1 ; . . . ; g n0 ; f1 ; . . . ; fn1 in increasing order and let qi be the rank of the
alerts.
ith genuine transaction. There are ðqi iÞ fraudulent transactions
In the following we will denote as PrecisionRank the Precision
with estimated probabilities of belonging to class 0 which are
within the a observations with the highest rank.
smaller than that of the ith genuine transaction (Hand & Till,
2001). Summing over the 0 class, we see that the total number of
pairs of transactions, one from class 0 and one from class 1, in 6. Strategies for incremental learning with unbalanced fraud
which the fraudulent transaction has smaller estimated probability data
of belonging to class 0 than does the fraudulent transaction, is
X
n0 X
n0 X
n0 The most conventional way to deal with sequential fraud data is
ðqi iÞ ¼ qi i ¼ R0 n0 ðn0 þ 1Þ=2 to adopt a static approach (Fig. 1) which creates once in a while a
i¼0 i¼0 i¼0 classification model and uses it as a predictor during a long hori-
P n0
where R0 ¼ i¼0 qi . Since there are n0 n1 such pairs of transactions zon. Though this approach reduces the learning effort, its main
altogether, our estimate of the probability that a randomly chosen problem resides in the lack of adaptivity which makes it insensitive
fraudulent transaction has a lower estimated probability of belong- to any change of distribution in the upcoming chunks.
ing to class 0 than a randomly chosen genuine transaction is On the basis of the state-of-the-work described in Section 3.3, it
is possible to conceive two alternative strategies to address both
^ ¼ R0 n0 ðn0 þ 1Þ=2
A the incremental and the unbalanced nature of the fraud detection
n0 n1 problem.
^ gives an estimation of AUC that avoids errors introduced by
A The first approach, denoted as the updating approach and illus-
smoothing procedures in the ROC curve and that is threshold-free trated in Fig. 2, is inspired to Wang, Fan, Yu, and Han (2003). It uses
(Hand & Till, 2001). a set of M models and a number K of chunks to train each model.
Note that for M > 1 and K > 1 the training sets of the M models
5.2. Detection are overlapping. This approach adapts to changing environment
by forgetting chunks at a constant rate. The last M models are
The performance of a detection task (like fraud detection) is not stored and used in a weighted ensemble of models EM . Let
necessarily well described in terms of classification (Fan & Zhu, PrecisionRankm denote the predictive accuracy measured in terms
2011). In a detection problem what matters most is whether the of PrecisionRank on the last (testing) chunk of the mth model.
algorithm can rank the few useful items (e.g. frauds) ahead of The ensemble EM is defined as the linear combination of all the
the rest. In a scenario with limited resources, fraud investigators M models hm :
cannot revise all transactions marked as fraudulent from a classifi-
cation algorithm. They have to put their effort into investigating X
M
the transactions with the highest risk of fraud, which means that EM ¼ wm hm
m¼1
the detection system is asked to return the transactions ranked
by their posteriori fraud probability. The goal then is not only to
where
predict accurately each class, but to return a correct rank of the
minority classes.
PrecisionRankm PrecisionRankmin
In this context a good detection algorithm should be able to give wm ¼
a high rank to relevant items (frauds) and low score to non-rele-
PrecisionRankmax PrecisionRankmin
vant. Fan and Zhu (2011) consider the average precision (AP) as PrecisionRankmin ¼ minðPrecisionRankm Þ
m2M
the correct measure for a detection task. Let p be the number of po- PrecisionRankmax ¼ maxðPrecisionRankm Þ
sitive (fraud) case in the original dataset. Out of the t% top-ranked m2M
candidates, suppose hðtÞ are truly positive (hðtÞ <¼ t). We can then The second approach denoted as the forgetting genuine approach
define recall as RðtÞ ¼ hðtÞ=p and precision as PðtÞ ¼ hðtÞ=t. Then and illustrated in Fig. 3 is inspired to Gao et al.’s work. In order to
Pðtr Þ and Rðtr Þ is the precision and recall of the rth ranked observa- mitigate the unbalanced effects, each time a new chunk is avail-
tion. The formula for calculating the average precision is: able, a model is learned on the genuine transactions of the previous
X
N K gen chunks and all past fraudulent transactions. Since this ap-
AP ¼ Pðt r ÞDRðt r Þ proach leads to training sets which grow in size over the time, a
r¼1 maximum training size is set to avoid overloading. Once this size
A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928 4919
Fig. 1. Static approach: a model is trained on K ¼ 3 chunks and used to predict future chunks.
7.1. Dataset
is reached older observations are removed in favor of the more re- 7.2. Learned lessons
cent ones. An ensemble of models is obtained by combining the
last M models as in the update approach. Our experimental analysis allows to provide some answers to
Note that in all these approaches (including the static one), a the most common questions of credit card fraud detection. The
balancing technique (Section 3.2) can be used to reduce the skew- questions and the answers based on our experimental findings
ness of the training set (Fig. 4). are detailed below.
In Table 2 we have summarised the strengths and weaknesses
of the incremental approaches presented. The Static strategy has
7.2.1. Which algorithm and which training size is recommended in
the advantage of being fast as the training of the model is done
case of a static approach?
only once, but this does not return a model that follows the
The static approach (described in Section 6) is one of the most
changes in the distribution of the data. The other two approaches
commonly used by practitioners because of its simplicity and
on the contrary can adapt to concept drift. They differ essentially
rapidity. However, open questions remain about which learning
in the way the minority class is accumulated in the training
algorithm should be used and the consequent sensitivity of the
chunks. The Forget strategy propagates instances between chunks
accuracy to the training size. We tested three different supervised
leading to bigger training sets and computational burden.
algorithms: Random Forests (RF), Neural Network (NNET) and
Support Vector Machine (SVM) provided by the R software
7. Experimental assessment (Development Core Team, 2011). We used R version 3.0.1 with
packages randomForest (Liaw & Wiener, 2002), e1071 (Meyer,
In this section we perform an extensive experimental assess- Dimitriadou, Hornik, Weingessel, & Leisch, 2012), unbalanced
ment on the basis of real data (Section 7.1) in order to address (Pozzolo, 2014) and MASS (Venables & Ripley, 2002).
common issues that the practitioner has to solve when facing large In order to assess the impact of the training set size (in terms of
credit card fraud datasets (Section 7.2). days/chunks) we carried out the predictions with different
4920 A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928
Fig. 3. Forgetting genuine approach: for each new chunk a model is created by keeping all previous fraudulent transactions and a small set of genuine transactions from the
last 2 chunks (K gen ¼ 2). Single models are used to predict the following chunk or can be combined into an ensemble (M ¼ 4).
Table 3
Fraudulent dataset.
where K is the total number of chunks. The higher the sum, the high-
er is the number of times that one strategy is superior to the others.
The white bars denote models which are significantly worse than the
best (paired t-test based on the ranks of each chunk).
The strategy names follow a structure built on the following
options:
windows (K = 30, 60 and 90). All training sets were rebalanced Then the strategy options are concatenated using the dot as
using undersampling at first (50% fraudulent, 50% genuine). separation point (e.g. RF.Under.Daily.10M.Update.60K).
All experiments are replicated five times to reduce the variance In both datasets, Random Forests clearly outperforms its com-
caused by the sampling implicit in unbalanced techniques. Fig. 5 petitors and, as expected, accuracy is improved by increasing the
shows the sum of the ranks from the Friedman test (Friedman, training size (Fig. 5). Because of the significative superiority of Ran-
1937) for each strategy in terms of AP, AUC and PrecisionRank. dom Forests with respect to the other algorithms, in what follows
For each chunk, we rank the strategies from the least to the best we will limit to consider only this learning algorithm.
performing. Then we sum the ranks over all chunks. More formally,
let rs;k 2 f1; . . . ; Sg be the rank of strategy s on chunk k and S be the
number of strategies to compare. The strategy with the highest 7.2.2. Is there an advantage in updating models?
accuracy in k has r s;k ¼ S and the one with the lowest has rs;k ¼ 1. Here we assess the advantage of adopting the update approach
PK
Then the sum of ranks for the strategy s is defined as k¼1 r k;s ,
described in Section 6. Fig. 6 reports the results for different values
of K and M.
Table 2
Strengths and weaknesses of the incremental approaches.
Metric: AP
RF.Under.One.1M.Static.90K
RF.Under.One.1M.Static.60K
RF.Under.One.1M.Static.30K
Best significant
SVM.Under.One.1M.Static.90K FALSE
TRUE
NNET.Under.One.1M.Static.90K
SVM.Under.One.1M.Static.60K
NNET.Under.One.1M.Static.60K
Metric: AUC
RF.Under.One.1M.Static.90K
RF.Under.One.1M.Static.60K
RF.Under.One.1M.Static.30K
Best significant
NNET.Under.One.1M.Static.90K FALSE
TRUE
NNET.Under.One.1M.Static.60K
SVM.Under.One.1M.Static.90K
SVM.Under.One.1M.Static.60K
Metric: PrecisionRank
RF.Under.One.1M.Static.90K
RF.Under.One.1M.Static.60K
RF.Under.One.1M.Static.30K
Best significant
SVM.Under.One.1M.Static.90K FALSE
TRUE
NNET.Under.One.1M.Static.90K
SVM.Under.One.1M.Static.60K
NNET.Under.One.1M.Static.60K
The strategies are called daily if a model is built every day, 90 days (K ¼ 90) for training and keeps the last 5 models created
weekly if once a week or 15days if every 15 days. We compare (M ¼ 5) for predictions. In the case of AP however this strategy is
ensemble strategies (M > 1) with models built on single chunks not statistically better than the ensemble approaches ranked as
(K ¼ 1) against single models strategies (M ¼ 1) using several second.
chunks in the training (K > 1). For all metrics, the strategies that use only the current chunk to
For all metrics, the best strategy is RF.Under.Daily.5M.Upda- build a model (K ¼ 1) are coherently the worst. This confirms the
te.90K. It creates a new model at each chunk using previous result of previous analysis showing that a too short window of data
4922 A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928
(and consequently a very small fraction of frauds) is insufficient to 7.2.3. Retaining old genuine transactions together with old frauds is
learn a reliable model. beneficial?
When comparing the update frequency of the models using the This section assesses the accuracy of the forgetting approach
same number of chunks for training (K ¼ 90), daily update is rank- described in Section 6 whose rationale is to avoid the discard of
ing always better than weekly and 15days. This confirms the intui- old fraudulent observations.
tion that fraud distribution is always evolving and therefore it is Accumulating old frauds leads to less unbalanced chunks. In
better to update the models as soon as possible. order to avoid having chunks where the accumulated frauds
Metric: AP
RF.Under.Daily.5M.Update.90K
RF.Under.Daily.15M.Update.90K
RF.Under.Daily.30M.Update.90K
RF.Under.Daily.1M.Update.90K
RF.Under.Daily.1M.Update.60K
RF.Under.Weekly.1M.Update.90K Best significant
FALSE
RF.Under.15days.1M.Update.90K TRUE
RF.Under.Daily.1M.Update.30K
RF.Under.Daily.30M.Update.1K
RF.Under.Daily.15M.Update.1K
RF.Under.Daily.60M.Update.1K
RF.Under.Daily.90M.Update.1K
0 1000 2000
Metric: AUC
RF.Under.Daily.5M.Update.90K
RF.Under.Daily.15M.Update.90K
RF.Under.Daily.30M.Update.90K
RF.Under.Daily.1M.Update.90K
RF.Under.Weekly.1M.Update.90K
RF.Under.Daily.1M.Update.60K Best significant
FALSE
RF.Under.15days.1M.Update.90K TRUE
RF.Under.Daily.1M.Update.30K
RF.Under.Daily.30M.Update.1K
RF.Under.Daily.15M.Update.1K
RF.Under.Daily.60M.Update.1K
RF.Under.Daily.90M.Update.1K
0 1000 2000
Metric: PrecisionRank
RF.Under.Daily.5M.Update.90K
RF.Under.Daily.15M.Update.90K
RF.Under.Daily.30M.Update.90K
RF.Under.Daily.1M.Update.90K
RF.Under.Weekly.1M.Update.90K
RF.Under.Daily.1M.Update.60K Best significant
FALSE
RF.Under.15days.1M.Update.90K TRUE
RF.Under.Daily.1M.Update.30K
RF.Under.Daily.30M.Update.1K
RF.Under.Daily.15M.Update.1K
RF.Under.Daily.60M.Update.1K
RF.Under.Daily.90M.Update.1K
0 500 1000 1500 2000 2500
outnumber the genuine transactions, two options are available: (i) Fig. 7 shows the sum of ranks for different strategies where the
forgetting some of the old frauds (ii) accumulating old genuine genuine transactions are taken from a different number of days
transactions as well. In the first case when the accumulated frauds (K gen ). The best strategy for AP and PrecisionRank uses an ensemble
represent 40% of the transaction, new frauds replace old frauds as of 5 models for each chunk (M ¼ 5) and 30 days for genuine
in Gao, Ding, Fan, Han, and Yu (2008). In the second case we accu- transactions (K gen ¼ 30). The same strategy ranks third in terms
mulate genuine transactions from previous K gen chunks, where K gen of AUC and is significantly worse than the best. To create ensem-
defines the number of chunks used (see Fig. 3). bles we use a time-based array of models of fixed size M, which
Metric: AP
RF.Under.Daily.5M.Forget.30Kgen
RF.Under.Daily.10M.Forget.30Kgen
RF.Under.Daily.15M.Forget.30Kgen
RF.Under.Daily.1M.Forget.15Kgen
Best significant
RF.Under.Daily.1M.Forget.30Kgen FALSE
TRUE
RF.Under.Daily.1M.Forget.60Kgen
RF.Under.Daily.1M.Forget.90Kgen
RF.Under.Daily.30M.Forget.30Kgen
RF.Under.Daily.1M.Forget.0Kgen
Metric: AUC
RF.Under.Daily.1M.Forget.90Kgen
RF.Under.Daily.1M.Forget.60Kgen
RF.Under.Daily.5M.Forget.30Kgen
RF.Under.Daily.10M.Forget.30Kgen
Best significant
RF.Under.Daily.15M.Forget.30Kgen FALSE
TRUE
RF.Under.Daily.1M.Forget.30Kgen
RF.Under.Daily.30M.Forget.30Kgen
RF.Under.Daily.1M.Forget.15Kgen
RF.Under.Daily.1M.Forget.0Kgen
Metric: PrecisionRank
RF.Under.Daily.5M.Forget.30Kgen
RF.Under.Daily.1M.Forget.60Kgen
RF.Under.Daily.1M.Forget.90Kgen
RF.Under.Daily.10M.Forget.30Kgen
Best significant
RF.Under.Daily.1M.Forget.30Kgen FALSE
TRUE
RF.Under.Daily.15M.Forget.30Kgen
RF.Under.Daily.1M.Forget.15Kgen
RF.Under.Daily.30M.Forget.30Kgen
RF.Under.Daily.1M.Forget.0Kgen
0 1000 2000
means that when the number of models available is greater than 7.2.4. Do balancing techniques have an impact on accuracy?
M, the most recent in time model replaces the Mth model in the ar- So far we considered exclusively undersampling as balancing
ray removing the oldest model in the ensemble. technique in our experiments. In this section we assess the impact
In general we see better performances when K gen increase from of using alternative methods like SMOTE and EasyEnsemble.
0 to 30 and only in few cases K gen > 30 leads to significantly better Experimental results (Fig. 8) show that they both over-perform
accuracy. Note that in all our strategies after selecting the observa- undersampling.
tions to include in the training sets we use undersampling to make In our datasets, the number of frauds is on average 0.4% of all
sure we have the two classes equally represented. transactions in the chunk. Undersampling randomly selects a
Metric: AP
RF.SMOTE.Daily.1M.Update.90K
RF.EasyEnsemble.Daily.1M.Forget.90Kgen
RF.SMOTE.Daily.1M.Forget.90Kgen
RF.EasyEnsemble.Daily.1M.Update.90K
Best significant
RF.Under.Daily.1M.Forget.90Kgen FALSE
TRUE
RF.Under.Daily.1M.Update.90K
RF.EasyEnsemble.One.1M.Static.90K
RF.SMOTE.One.1M.Static.90K
RF.Under.One.1M.Static.90K
Metric: AUC
RF.EasyEnsemble.Daily.1M.Forget.90Kgen
RF.EasyEnsemble.Daily.1M.Update.90K
RF.SMOTE.Daily.1M.Update.90K
RF.Under.Daily.1M.Forget.90Kgen
Best significant
RF.SMOTE.Daily.1M.Forget.90Kgen FALSE
TRUE
RF.Under.Daily.1M.Update.90K
RF.EasyEnsemble.One.1M.Static.90K
RF.Under.One.1M.Static.90K
RF.SMOTE.One.1M.Static.90K
Metric: PrecisionRank
RF.SMOTE.Daily.1M.Update.90K
RF.EasyEnsemble.Daily.1M.Forget.90Kgen
RF.EasyEnsemble.Daily.1M.Update.90K
RF.SMOTE.Daily.1M.Forget.90Kgen
Best significant
RF.Under.Daily.1M.Forget.90Kgen FALSE
TRUE
RF.Under.Daily.1M.Update.90K
RF.EasyEnsemble.One.1M.Static.90K
RF.SMOTE.One.1M.Static.90K
RF.Under.One.1M.Static.90K
Fig. 8. Comparison of different balancing techniques and strategies using sum of ranks in all chunks.
A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928 4925
number of genuine transactions equal to the number of frauds, the static approach ranks low in Figs. 10 as it is not able to adapt
which means removing about 99.6% of the genuine transactions to the changing distribution. The forgetting approach is signifi-
in the chunk. EasyEnsemble is able to reduce the variance of under- cantly better than update for EasyEnsemble, while SMOTE gives
sampling by using several sub-models for each chunk, while better ranking with update.
SMOTE creates new artificial fraudulent transactions. In our exper- It is worth notice that strategies which combines more than one
iments we used 5 sub-models in EasyEnsemble. For all balancing model (M > 1) together with undersampling are not superior to
techniques, between the three approaches presented in Section 6, the predictions with a single model and EasyEnsemble. EasyEn-
the static approach is consistently the worse. semble learns from different samples of the majority class, which
In Fig. 9 we compare the previous strategies in terms of average means that for each chunk different concepts of the majority class
prediction time over all chunks. SMOTE is computationally heavy are learnt.
since it consists in oversampling, leading to bigger chunk sizes.
EasyEnsemble replicates undersampling and learns from several
sub-chunks. This gives higher computational time than undersam-
8. Future work
pling. Between the different incremental approaches, static has the
lowest time as the model is learnt once and no retrained. Forget
Future work will focus on the automatic selection of the best
strategy has the highest prediction time over all balancing meth-
unbalanced technique in the case of online learning. Dal Pozzolo,
ods. This is expected since it retains old transactions to deal with
Caelen, Waterschoot, and Bontempi (2013) recently proposed to
unbalanced chunks.
use a F-race (Birattari, Stützle, Paquete, & Varrentrapp, 2002) algo-
rithm to automatically select the correct unbalanced strategy for a
7.2.5. Overall, which is the best strategy? given dataset. In their work a cross validation is used to feed the
The large number of possible alternatives (in terms of learning data into the race. A natural extension of this work could be the
classifier, balancing technique and incremental learning strategy) use of racing in incremental data where the data fed into the race
require a joint assessment of several combination in order to come comes from new chunks in the stream.
up with a recommended approach. Fig. 10 summaries the best Throughout our paper we used only data driven techniques to
strategies in terms of different metrics. The combinations of Easy- deal with the unbalanced problem. HDDT (Cieslak & Chawla,
Ensemble with forgetting emerge as best for all metrics. SMOTE 2008) is a decision tree that uses Hellinger distance (Hellinger,
with update is not significantly worse of the best for AP and Preci- 1909) as splitting criteria that is able to deal with skewed distribu-
sionRank, but it is not ranking well in terms of AUC. The fact that tion. With HDDT, balancing method are not longer needed before
within the best strategies we see different balancing techniques training. The use of such algorithm could remove the need of posi-
confirms that in online learning when the data is unbalanced, the tive instances propagation between chunks to fight the unbalanced
adopted balancing strategy may play a major role. As expected problem.
400
Computational time
200
0
n
0K
0K
ge
ge
ge
.90
.90
.90
c.9
c.9
c.9
0K
0K
0K
ate
ate
ate
ati
ati
ati
t.9
t.9
t.9
pd
pd
pd
St
.St
.St
rge
rge
rge
.U
.U
.U
M.
M
.Fo
.Fo
.Fo
1M
1M
1M
e.1
e.1
e.1
1M
1M
1M
ily.
ily.
ily.
On
On
On
Da
Da
Da
ily.
ily.
ily.
Da
Da
Da
Strategy
Fig. 9. Comparison of different balancing techniques and strategies in terms of average prediction time (in seconds) over all chunks. Experiments run on a HP ProLiant BL465c
G5 blades with 2x AMD Opteron 2.4 GHz, 4 cores each and 32 GB DDR3 RAM.
4926 A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928
Metric: AP
RF.EasyEnsemble.Daily.1M.Forget.90Kgen
RF.SMOTE.Daily.1M.Update.90K
RF.Under.Daily.5M.Forget.30Kgen
RF.SMOTE.Daily.1M.Forget.90Kgen
RF.EasyEnsemble.Daily.1M.Update.90K
RF.Under.Daily.10M.Forget.30Kgen
RF.Under.Daily.15M.Forget.30Kgen
RF.Under.Daily.5M.Update.90K
RF.Under.Daily.15M.Update.90K
RF.Under.Daily.30M.Update.90K
RF.Under.Daily.30M.Forget.30Kgen
RF.Under.Daily.1M.Forget.90Kgen
RF.Under.Daily.1M.Update.90K Best significant
RF.Under.Weekly.1M.Update.90K FALSE
RF.Under.15days.1M.Update.90K TRUE
RF.Under.One.1M.Static.120K
RF.EasyEnsemble.One.1M.Static.90K
RF.Under.One.1M.Static.90K
RF.SMOTE.One.1M.Static.90K
RF.Under.One.1M.Static.60K
RF.Under.One.1M.Static.30K
SVM.Under.One.1M.Static.120K
SVM.Under.One.1M.Static.90K
NNET.Under.One.1M.Static.120K
NNET.Under.One.1M.Static.90K
SVM.Under.One.1M.Static.60K
NNET.Under.One.1M.Static.60K
0 1000 2000 3000 4000
Metric: AUC
RF.EasyEnsemble.Daily.1M.Forget.90Kgen
RF.EasyEnsemble.Daily.1M.Update.90K
RF.Under.Daily.5M.Update.90K
RF.Under.Daily.15M.Update.90K
RF.SMOTE.Daily.1M.Update.90K
RF.Under.Daily.1M.Forget.90Kgen
RF.Under.Daily.30M.Update.90K
RF.Under.Daily.1M.Update.90K
RF.SMOTE.Daily.1M.Forget.90Kgen
RF.Under.Weekly.1M.Update.90K
RF.Under.Daily.5M.Forget.30Kgen
RF.Under.Daily.10M.Forget.30Kgen
RF.Under.15days.1M.Update.90K Best significant
RF.Under.Daily.15M.Forget.30Kgen FALSE
RF.Under.One.1M.Static.120K TRUE
RF.Under.Daily.30M.Forget.30Kgen
RF.EasyEnsemble.One.1M.Static.90K
RF.Under.One.1M.Static.90K
RF.SMOTE.One.1M.Static.90K
RF.Under.One.1M.Static.60K
RF.Under.One.1M.Static.30K
NNET.Under.One.1M.Static.120K
NNET.Under.One.1M.Static.90K
NNET.Under.One.1M.Static.60K
SVM.Under.One.1M.Static.90K
SVM.Under.One.1M.Static.120K
SVM.Under.One.1M.Static.60K
0 1000 2000 3000 4000
Metric: PrecisionRank
RF.EasyEnsemble.Daily.1M.Forget.90Kgen
RF.SMOTE.Daily.1M.Update.90K
RF.EasyEnsemble.Daily.1M.Update.90K
RF.Under.Daily.5M.Update.90K
RF.Under.Daily.15M.Update.90K
RF.Under.Daily.1M.Forget.90Kgen
RF.Under.Daily.5M.Forget.30Kgen
RF.SMOTE.Daily.1M.Forget.90Kgen
RF.Under.Daily.30M.Update.90K
RF.Under.Daily.1M.Update.90K
RF.Under.Daily.10M.Forget.30Kgen
RF.Under.Daily.15M.Forget.30Kgen
RF.Under.Weekly.1M.Update.90K Best significant
RF.Under.Daily.30M.Forget.30Kgen FALSE
RF.Under.15days.1M.Update.90K TRUE
RF.Under.One.1M.Static.120K
RF.EasyEnsemble.One.1M.Static.90K
RF.Under.One.1M.Static.90K
RF.SMOTE.One.1M.Static.90K
RF.Under.One.1M.Static.60K
RF.Under.One.1M.Static.30K
SVM.Under.One.1M.Static.120K
NNET.Under.One.1M.Static.120K
SVM.Under.One.1M.Static.90K
NNET.Under.One.1M.Static.90K
SVM.Under.One.1M.Static.60K
NNET.Under.One.1M.Static.60K
0 1000 2000 3000 4000
Fig. 10. Comparison of all strategies using sum of ranks in all chunks.
A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928 4927
In our work the combination of models in an ensemble is based performance measure adopted, etc. (Dal Pozzolo et al., 2013). In
on the performance of each model in the testing chunk. Several our work we adopted two alternatives to undersampling: SMOTE
other methods (Kolter & Maloof, 2003; Hoens, Chawla, & Polikar, and EasyEnsemble. In particular we show that they are both able
2011; Lichtenwalter & Chawla, 2010) have been proposed to com- to return higher accuracies. Our framework can be easily extended
bine models in presence of concept drift. In future work it would be to include other data-level balancing techniques.
interesting to test some of these methods and compare it to our The experimental part has shown that in online learning, when
framework. the data is skewed towards one class it is important maintaining
In this manuscript we assumed that there is a single concept to previous minority examples in order to learn a better separation
learn for the minority class. However, as frauds are different from of two classes. Instance propagation from previous chunks has
each other we could distinguish several sub-concept within the the effect of increasing the minority class in the current chunk,
positive class. Hoens et al. (2011) suggest to use Naive Bayes to but it is of limited impact given the small number of frauds. The
retain old positive instances that come from the same sub-concept. update and forgetting approaches presented in Section 6 differ
REA (Chen & He, 2011) and SERA (Chen et al., 2009) proposed by essentially in the way the minority class is oversampled in the cur-
Chen and He propagate to the last chunk only minority class that rent chunk. We tested several ensemble and single models strate-
belong to the same concept using Mahalanobis distance and a gies using different number of chunks for training. In general we
k-nearest neighbors algorithm. Future work should take into con- see that models trained on several chunks have better accuracy
sideration the possibility of having several minority concepts. than single chunk models. Multi-chunks models learn on overlap-
ping training sets, when this happens single models strategies can
beat ensembles.
9. Conclusion Our framework addresses the problem of non-stationary in data
streams by creating a new model every time a new chunk is avail-
The need to detect fraudulent patterns in huge amount of data able. This approach has showed better results than updating the
demands the adoption of automatic methods. The scarcity of public models at a lower frequency (weekly or every 15days). Updating
available dataset in credit card transactions gives little chance to the models is crucial in a non-stationary environments, this intui-
the community to test and asses the impact of existing techniques tion is confirmed by the bad results of the static approach. In our
on real data. The goal of our work it to give some guidelines to dataset, overall we saw Random Forest beating Neural Network
practitioners on how to tackle the detection problem. and Support Vector Machine. The final best strategy implemented
The paper presents the fraud detection problem and proposes the forgetting approach together with EasyEnsemble and daily
AP, AUC and PrecisonRank as correct performance measures for a update.
fraud detection task. Frauds occur rarely compared to the total
amount of transactions. As explained in Section 5.1,, standard clas-
sification metrics such as Accuracy are not suitable for unbalanced References
problems. Moreover, the goal of detection is giving the investiga-
tors the transactions with the highest fraud risk. For this reason Bamber, D. (1975). The area above the ordinal dominance graph and the area below
we argue that having a good ranking of the transactions by their the receiver operating characteristic graph. Journal of Mathematical Psychology,
12, 387–415.
fraud probability is more important than having transactions Batista, G., Carvalho, A., & Monard, M. (2000). Applying one-sided selection to
correctly classified. unbalanced datasets. MICAI 2000: Advances in Artificial Intelligence, 315–325.
Credit card fraud detection relies on the analysis of recorded Birattari, M., Stützle, T., Paquete, L., & Varrentrapp, K. (2002). A racing algorithm for
configuring metaheuristics. In Proceedings of the genetic and evolutionary
transactions. However a single transaction information is not con- computation conference (pp. 11–18).
sidered sufficient to detect a fraud occurrence (Bolton & Hand, Bolton, R. J., & Hand, D. J. (2001). Unsupervised profiling methods for fraud
2001) and the analysis has to take into consideration the card- detection. Credit Scoring and Credit Control, VII, 235–255.
Bolton, R., & Hand, D. (2002). Statistical fraud detection: A review. Statistical Science,
holder behaviour. In this paper we have proposed a way to include 235–249.
cardholder information into the transaction by computing aggre- Brause, R., Langsdorf, T., & Hepp, M. (1999). Neural data mining for credit card fraud
gate variables on historical transaction of the same card. detection. In Tools with artificial intelligence, 1999. Proceedings (pp. 103–106).
IEEE.
As new credit-card transactions keep arriving, the detection
Chan, P., Fan, W., Prodromidis, A., & Stolfo, S. (1999). Distributed data mining in
system has to process them as soon as they arrive incrementally credit card fraud detection. Intelligent Systems and their Applications, 14, 67–74.
and avoid retaining in memory too many old transactions. Fraud Chawla, N. V. (2005). Data mining for imbalanced datasets: An overview. In Data
types are in continuous evolution and detection has to adapt to mining and knowledge discovery handbook (pp. 853–867). Springer.
Chawla, N., Bowyer, K., Hall, L., & Kegelmeyer, W. (2011). Smote: synthetic minority
fraudsters. Once a fraud is well detected, the fraudster could over-sampling technique. Arxiv preprint arXiv:1106.1813.
change his habits and find another way to fraud. Adaptive schemes Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Editorial: Special issue on learning
are therefore required to deal with non-stationary fraud dynamics from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 6, 1–6.
Chen, S., & He, H. (2009). Sera: Selectively recursive approach towards
and discover potentially new fraud mechanisms by itself. We com- nonstationary imbalanced stream data mining. In International joint conference
pare three alternative approaches (static, update and forgetting) to on neural networks, 2009. IJCNN 2009 (pp. 522–529). IEEE.
learn from unbalanced and non-stationary credit card data Chen, S., & He, H. (2011). Towards incremental learning of nonstationary
imbalanced data stream: A multiple selectively recursive approach. Evolving
streams. Systems, 2, 35–50.
Fraud detection is a highly unbalanced problem where the Chen, S., He, H., Li, K., & Desai, S. (2010). Musera: Multiple selectively recursive
number of genuine transactions far outnumbers the fraudulent approach towards imbalanced stream data mining. In The 2010 international
joint conference on neural networks (IJCNN) (pp. 1–8). IEEE.
ones. In the static learning setting a wide range of techniques have Cieslak, D. A., & Chawla, N. V. (2008). Learning decision trees for unbalanced data. In
been proposed to deal with unbalanced dataset. In incremental Machine learning and knowledge discovery in databases (pp. 241–256). Springer.
learning however few attempts have tried to deal with unbalanced Clark, P., & Niblett, T. (1989). The cn2 induction algorithm. Machine Learning, 3,
261–283.
data streams (Gao et al., 2007; Hoens et al., 2012; Ditzler, Polikar, &
Cohen, W. W. (1995). Fast effective rule induction. In Machine learning-international
Chawla, 2010). In these works, the most common balancing workshop then conference (pp. 115–123). Morgan Kaufmann Publishers, INC..
technique consists in undersampling the majority class in order Dal Pozzolo, A., Caelen, O., Waterschoot, S., & Bontempi, G. (2013). Racing for
to reduce the skewness of the of the chunks. unbalanced methods selection. In Proceedings of the 14th international
conference on intelligent data engineering and automated learning, IDEAL.
The best technique for unbalanced data may depend on several Delamaire, L., Abdou, H., & Pointon, J. (2009). Credit card fraud and detection
factors such as (i) data distribution (ii) classifier used (iii) techniques: A review. Banks and Bank Systems, 4, 57–68.
4928 A. Dal Pozzolo et al. / Expert Systems with Applications 41 (2014) 4915–4928
R Development Core Team, R: A Language and Environment for Statistical Kuncheva, L. I. (2004). Classifier ensembles for changing environments. In Multiple
Computing, R Foundation for Statistical Computing, Vienna, Austria, 2011. classifier systems (pp. 1–15). Springer.
<http://www.R-project.org/>, ISBN 3-900051-07-0. Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R
Ditzler, G., Polikar, R., & Chawla, N. (2010). An incremental learning algorithm for News, 2, 18–22.
non-stationary environments and class imbalance. In 2010 20th International Lichtenwalter, R. N., & Chawla, N. V. (2010). Adaptive methods for classification in
conference on pattern recognition (ICPR) (pp. 2997–3000). IEEE. arbitrarily imbalanced and drifting data streams. In New frontiers in applied data
Dorronsoro, J., Ginel, F., Sgnchez, C., & Cruz, C. (1997). Neural fraud detection in mining (pp. 53–75). Springer.
credit card operations. Neural Networks, 8, 827–834. Liu, X., Wu, J., & Zhou, Z. (2009). Exploratory undersampling for class-imbalance
Drummond, C., & Holte, R. (2003). C4. 5, class imbalance, and cost sensitivity: Why learning. Systems, Man, and Cybernetics, Part B: Cybernetics, 39, 539–550.
under-sampling beats over-sampling. In Workshop on learning from imbalanced Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., & Leisch, F. (2012). e1071:
datasets II. Citeseer. Misc Functions of the Department of Statistics (e1071), TU Wien, 2012. <http://
Elkan, C. (2001). The foundations of cost-sensitive learning. International joint CRAN.R-project.org/package=e1071>, r package version 1.6-1.
conference on artificial intelligence (Vol. 17, pp. 973–978). Citeseer. Olshen, L., & Stone, C. (1986). Classification and regression trees. Wadsworth
Fan, G., & Zhu, M. (2011). Detection of rare items with target. Statistics and Its International Group.
Interface, 4, 11–17. Oza, N. C. (2005). Online bagging and boosting. Systems, man and cybernetics (Vol. 3,
Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit pp. 2340–2345). IEEE.
in the analysis of variance. Journal of the American Statistical Association, 32, Pavía, J. M., Veres-Ferrer, E. J., & Foix-Escura, G. (2012). Credit card incidents and
675–701. control systems. International Journal of Information Management, 32, 501–503.
Gao, J., Ding, B., Fan, W., Han, J., & Yu, P. S. (2008). Classifying data streams with Polikar, R., Upda, L., Upda, S., Honavar, V., et al. (2001). Learn++: An incremental
skewed class distributions and concept drifts. Internet Computing, 12, 37–49. learning algorithm for supervised neural networks. Systems, Man, and
Gao, J., Fan, W., Han, J., & Philip, S. Y. (2007). A general framework for mining Cybernetics, Part C: Applications and Reviews, 31, 497–508.
concept-drifting data streams with skewed distributions. In SDM. Pozzolo, A. D. (2014). unbalanced: The package implements different data-driven
Grossberg, S. (1988). Nonlinear neural networks: Principles, mechanisms, and method for unbalanced datasets. R package version 1.0.
architectures. Neural Networks, 1, 17–61. Provost, F. (2000). Machine learning from imbalanced data sets 101. In Proceedings
Hand, D. (2009). Measuring classifier performance: A coherent alternative to the of the AAAI’2000 workshop on imbalanced data sets.
area under the ROC curve. Machine Learning, 77, 103–123. Quah, J. T., & Sriganesh, M. (2008). Real-time credit card fraud detection using
Hand, D. J., & Till, R. J. (2001). A simple generalisation of the area under the roc curve computational intelligence. Expert Systems with Applications, 35, 1721–1732.
for multiple class classification problems. Machine Learning, 45, 171–186. Quinlan, J. (1993). C4. 5: Programs for machine learning (Vol. 1). Morgan Kaufmann.
Hellinger, E. (1909). Neue begründung der theorie quadratischer formen von Street, W. N., & Kim, Y. (2001). A streaming ensemble algorithm (sea) for large-scale
unendlichvielen veränderlichen. Journal für die reine und Angewandte classification. In Proceedings of the seventh ACM SIGKDD international conference
Mathematik, 136, 210–271. on knowledge discovery and data mining (pp. 377–382). ACM.
Hoens, T. R., Chawla, N. V., & Polikar, R. (2011). Heuristic updatable weighted Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (fourth ed.). 0-
random subspaces for non-stationary environments. In 2011 IEEE 11th 387-95457-0. New York: Springer<http://www.stats.ox.ac.uk/pub/MASS4> .
international conference on data mining (ICDM) (pp. 241–250). IEEE. Wang, H., Fan, W., Yu, P. S., & Han, J. (2003). Mining concept-drifting data streams
Hoens, T. R., Polikar, R., & Chawla, N. V. (2012). Learning from streaming data with using ensemble classifiers. In Proceedings of the ninth ACM SIGKDD international
concept drift and imbalance: An overview. Progress in Artificial Intelligence, 1, conference on Knowledge discovery and data mining (pp. 226–235). ACM.
89–101. Weston, D., Hand, D., Adams, N., Whitrow, C., & Juszczak, P. (2008). Plastic card
Holte, R. C., Acker, L. E., & Porter, B. W. (1989). Concept learning and the problem of fraud detection using peer group analysis. Advances in Data Analysis and
small disjuncts. Proceedings of the eleventh international joint conference on Classification, 2, 45–62.
artificial intelligence (Vol. 1). Citeseer. Whitrow, C., Hand, D. J., Juszczak, P., Weston, D., & Adams, N. M. (2009). Transaction
Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic aggregation as a strategy for credit card fraud detection. Data Mining and
study. Intelligent Data Analysis, 6, 429–449. Knowledge Discovery, 18, 30–55.
Kolter, J. Z., & Maloof, M. A. (2003). Dynamic weighted majority: A new ensemble Widmer, G., & Kubat, M. (1996). Learning in the presence of concept drift and
method for tracking concept drift. In Third IEEE international conference on data hidden contexts. Machine Learning, 23, 69–101.
mining, 2003. ICDM 2003 (pp. 123–130). IEEE.