IMECS2010 pp513-517
IMECS2010 pp513-517
Abstract—In Imbalanced datasets, minority classes can be Random over-sampling may make the decision regions of
erroneously classified by common classification algorithms. In the learner smaller and more specific, thus may cause the
this paper, an ensemble-base algorithm is proposed by creating learner to over-fit.
new balanced training sets with all the minority class and
As an alternative of over-sampling, SMOTE [1] was
under-sampling majority class. In each round, algorithm
identified hard examples on majority class and generated proposed as method to generate synthetic samples on
synthetic examples for the next round. For each training set a minority class. The advantage of SMOTE is that it makes the
Weak Learner is used as base classifier. Final predictions would decision regions larger and less specific. SMOTEBoost [2]
be achieved by casting a majority vote. This method is proceeds in a series of T rounds where every round the
compared whit some known algorithms and experimental distribution Dt is updated. Therefore the examples from the
results demonstrate the effectiveness of the proposed algorithm.
minority class are over-sampled by creating synthetic
minority class examples. Databoost-IM [3] is a modification
Index Terms—Data Mining, Ensemble algorithm, Imbalanced
data sets, Synthetic Samples. of AdaBoost.M2, which identifies hard examples and
generates synthetic examples for the minority as well as the
majority class.
I. INTRODUCTION The methods at algorithm level operate on the algorithms
I mbalance datasets where one class is represented by a other than the data sets. Bagging and Boosting are two
larger number of instances than other classes are common algorithms to improve the performance of classifier. They are
on fraud detection, text classification, and medical examples of ensemble methods, or methods that use a
diagnosis, On this domains as well others examples, minority combination of models. Bagging (Bootstrap aggregating)
class can be the less tolerant to classification fail and very was proposed by Leo Breiman in 1994 to improve the
important for cost sensitive. For example, misclassification classification by combining classifications of randomly
of a credit card fraud may cause a bank reputation deplored, generated training sets [4]. Boosting is a mechanism for
cost of transaction, and dissatisfied client. However, a training a sequence of “weak” learners and combining the
misclassification not fraud transaction only costs a call to hypotheses generated by these weak learners so as to obtain
client. Likewise in an oil split detection, an undetected split an aggregate hypothesis which is highly accurate. Adaboost
may cost thousands of dollars, but classifying a not split [5], increases the weights of misclassified examples and
sample as a split just cost an inspection. Due imbalance decreases those correctly classified using the same
problem Traditional machine learning can achieve better proportion, without considering the imbalance of the data
results on majority class but may predict poorly on the sets. Thus, traditional boosting algorithms do not perform
minority class examples. In order to tackle this problem well on the minority class.
many solutions have been presented, these solutions are
In this paper, an algorithm to cope with imbalanced
divided on data level and algorithm level.
datasets is proposed as described in the next section.
On data level, most common ways to tackle rarity are
The rest of the paper is organized as follows. Section 2
over-sampling and under-sampling. Under-sampling may
describes E-AdSampling algorithm. Section 3 shows the
cause a loss of information on majority class, and deplore on
setup for the experiments. Section 4 shows the comparative
its classification due to remove some examples on this class.
evaluation of E-AdSampling on 6 datasets. Finally,
conclusion is drawn in section 5.
Manuscript received January 23, 2010.
This paper is supported by the Postdoctoral Foundation of Central South
University (2008), China, the Education Innovation Foundation for Graduate II. E-ADSAMPLING ALGORITHM
Student of Central South University(2008).
Ordonez Jon Geiler is with Central South University Institute of The main focus of E-Adsampling is enhancing prediction
Information Science and Engineering, Changsha, Hunan, China 410083 on minority class without sacrifice majority class
(corresponding author to provide phone.: +86 7318832954; E-mail:
performance. An ensemble algorithm with balanced datasets
jongeilerordonezp@ hotmail.com).
Li Hong is with Central South University Institute of Information Science by under-sampling and generation of synthetic examples is
and Engineering, Changsha, Hunan, China 410083; (e-mail: proposed. For majority class an under-sample strategy is
lihongcsu2007@126.com). applied. By under-sampling majority class, algorithm will be
Guo Yue-jian is with Central South University Institute of Information
Science and Engineering, Changsha, Hunan, China 410083; (e-mail: lead algorithm towards minority class, getting better
tianyahongqi@126.com). performance on True Positive ratio and accuracy for minority
class. Nevertheless, majority class will suffer a reduction on
accuracy and True Positive ratio due to the loss of A. Generate Synthetic Examples
information. To alleviate this loss, the proposed algorithm
will search misclassified samples on majority class in each Smote was proposed by N. V. Chawla, K. W. Bowyer, and P.
round. Then, it generates new synthetic samples based on W. Kegelmeyer[1] as a method to over-sampling datasets.
SMOTE over-samples the minority class by taking each
these hard samples and adds them to the new training set.
minority class sample and introducing synthetic examples
As show on Fig 1 the process is split in 4 steps: first, in
along the line segments joining of the minority class nearest
order to balance training dataset majority class is randomly
neighbors. E-Adsampling will adopt the same technique to
under-sampled; second, synthetic examples are generated for majority class examples which have been misclassified. By
hard examples on majority class and add to training dataset; using this technique, the inductive learners, such as decision
third, using any weak learning algorithm all training sets are trees, are able to broaden decision regions on majority hard
modeled; finally, all the results obtained on each training set examples.
are combined.
B. Sampling Training Datasets
Input : Set S {(x1, y1), … , (xm, ym)} xi ∈X, with labels yi ∈ Adaptive sampling designs are mainly in which the
Y = {1, …, C}, selection procedure may depend sequentially on observed
• For t = 1, 2, 3, 4, … T values of the variable of interest. As class as interest variable,
o Create a balanced sample Dt by E-Adsampling under-sample or over-sample base on
under-sampling majority class. observation below to class, or observation has been
o Identify hard examples from the original erroneously predictive.
data set for majority class. In each round of the algorithm, a new training dataset will
o Generate synthetic examples for hard be generated. In the first round of the algorithm, the training
examples on majority class. dataset will be perfectly balanced by under-sampling
o Add synthetic examples to Dt. majority class. From second to the final round, it will also
o Train a weak learner using distribution Dt. under-sample majority class to start with a balanced training
o Compute weak hypothesis ht: X × Y → datasets, and additionally new synthetic samples will be
[0, 1]. generated and added for hard examples on majority class.
• Output the final hypothesis: H* = arg max ∑h
i
t
Table I show an example of 10 rounds of the algorithm for
Ozone dataset.
Ozone Dataset has 2536 samples, 73 on minority Class,
Fig 1. The E-AdSampling Algorithm
2463 on majority class and a balance rate of 0.02:0.98.
TABLE I
GENERATING TRAINING DATASETS ROUNDS FOR OZONE DATASET.
Total
Initial Samples
Balance Misclassified Synthetic on
Training Majority Samples Trainin Min. Maj.
Round Dataset Samples Added g Set Final Final Balance Rate
1 146 0 0 146 73 73 0.50 : 0.50
2 146 578 1156 1302 73 1229 0.06 : 0.94
3 146 612 1224 1370 73 1297 0.05 : 0.95
4 146 2 4 150 73 77 0.49 : 0.51
5 146 372 744 890 73 817 0.08 : 0.92
6 146 0 0 146 73 73 0.50 : 0.50
7 146 239 478 624 73 551 0.12 : 0.88
8 146 2 4 150 73 77 0.49 : 0.51
9 146 175 350 496 73 423 0.15 : 0.85
10 146 1 0 146 73 73 0.50 : 0.50
TABLE II the model is perfect, then its area under ROC curve would
TWO CLASSES CONFUSION MATRIX
equal 1. If the model simply performs random guessing, then
Predicted Positive Predicted Negative
its area under the ROC curve would equal to 0.5. A model
that is strictly better than another would have a large area
Actual TP( the number of FN (the number of under the ROC curve.
Positive True Positives False Negatives)
Actual FP( the number of TN ( the number of B. Datasets
Negative False Positives) True Negatives) The experiments were carried out on 6 real data sets taken
from the UCI Machine Learning Database Repository[9] (a
Accuracy, defined as summary is given in Table III). All data sets were chosen or
TP + TN (1) transformed into two-class problems.
Acc =
TP + FN + FP + TN TABLE III
DATASETS USED IN THE EXPERIMENTS
The TP Rate and FP Rate are calculated as TP/(FN+TP) Dataset Cases Min May Attrib Distributio
and FP/(FP+TN). The Precision and Recall are calculated as Class Class utes n
Hepatitis 155 32 123 20 0.20:0.80
TP / (TP + FP) and TP / (TP + FN). The F-measure is defined Adult 32561 7841 24720 15 0.24:0.76
as Pima 768 268 500 9 0.35:0.65
Monk2 169 64 105 6 0.37:0.63
((1 + β )× Re call × Pr ecision)(β
2 2
× Re call + Pr ecision ) (2) Yeast
Ozone
483
2536
20
73
463
2463
8
72
0.04:0.96
0.02:0.98
G-mean is based on the recalls on both classes. The benefit IV. RESULTS AND DISCUSSION
of selecting this metric is that it can measure how balanced Weka 3.6.0[10] was used as a tool for prediction, C4.5 tree
the combination scheme is. If a classifier is highly biased was used as base classifier, AdatabostM1, Bagging, Adacost
toward one class (such as the majority class), the G-mean CSB2, and E-AdSampling were set with 10 iterations.
value is low. For example, if a+ = 0 and a− = 1, which means AdaCost[11]: False Positives receive a greater weight
none of the positive examples is identified, g-mean=0 [7]. increase than False Negatives and True Positives loss less
weights than True Negatives by using a cost adjustment
A. The Receiver Operation Characteristic Curve function. A cost adjustment function as: β + =−0.5Cn +0.5
A receiver operation characteristic ROC Curve [8] is a and β− = 0.5Cn + 0.5 was chosen, where Cn is the
graphical approach for displaying the tradeoff between True misclassification cost of the nth example, and β + ( β − )
Positive Rate (TRP) and False Positive Rate (FPR) of a
denotes the output in case of the sample correctly classified
classifier. In an ROC curve, the True Positive Rate (TPR) is
(misclassified).
plotted along the y axis and the False Positive Rate (FPR) is
CSB2[11]: Weights of True Positives and True Negatives
show on the X axis.
are decreased equally; False Negatives get more boosted
There are several critical points along an ROC curve that
weights than False Positives. Cost factor 2 and 5 were
have well-known interpretations.
implemented for Adacost and CSB2.
Except for Adult and Monk2 which provide a test dataset,
(TPR=0,FPR=0): Model predicts every instance to be a
a 10-folds cross-validation was implemented. The initial data
negative class.
are randomly partitioned into 10 mutually exclusive subsets
(TPR=1,FPR=1): Model predicts every instance to be a
or “folds” D1, D2…D10, each of approximately equal size.
positive class.
Training and testing is performed 10 times. In iteration I,
(TPR=1, FPR=0): The ideal model.
partition Di is reserved as the test set, and the remaining
partitions are collectively used to train the model. For
A good classification model should be located as close as
classification, the accuracy estimate is the overall number of
possible to the upper left corner of the diagram, while the
correct classifications from 10 iterations, divided by the total
model which makes random guesses should reside along the
number of tuples in the initial data. Results are shown on
main diagonal, connecting the points (TPR=0, FPR=0) and
Table IV.
(TPR=1, FPR=1).
The area under the ROC curve (AUC) provides another
approach for evaluation which model is better on average. If
ACKNOWLEDGMENT
This paper is supported by the National Science
Foundation for Outstanding Youth Scientists of China under
Grant No.60425310.
REFERENCES
[1] N. V. Chawla, K. W. Bowyer, and P. W. Kegelmeyer, "Smote:
Synthetic minority over-sampling technique," Journal of Artificial
Intelligence Research, vol. 16, pp. 321-357, 2002.
[2] N.V Chawla, A. Lazarevic, L.O. Hall, and K.W. Bowyer,
SMOTEBoost: improving prediction of the minority class in boosting.
7th European Conference on Principles and Practice of Knowledge
Discovery in Databases, Cavtat-Dubrovnik, Croatia , 107-119, 2003.
[3] H. Guo and H. L. Viktor, "Learning from imbalanced data sets with
boosting and data generation: the databoost-im approach," SIGKDD
Explor. Newsl., vol. 6, no. 1, pp. 30-39, June 2004.
[4] L. Breiman, "Bagging predictors," Machine Learning, vol. 24, no. 2,
pp. 123-140, 1996.
[5] Y. Freund and R. E. Schapire, "A decision-theoretic generalization of
Fig 3. Roc Curve of the Ozone Data set
on-line learning and an application to boosting," Journal of Computer
and System Sciences, vol. 55, no. 1, pp. 119-139, 1997.
[6] J. Han, and M. Kamber, Data Mining: Concepts and Techniques,
To understand better on the achievements of Elsevier Inc, Singapore, 2006. pp. 360.
E-AdSampling, a ROC curve for Hepatitis (Fig 2) and Ozone [7] R.Yan, Y. Liu, R. Jin, and A. Hauptmann, On predicting rare class with
(Fig 3) datasets are presented. Hepatitis dataset was chosen SVM ensemble in scene classification, IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP'03), April 6-10,
due to the high performance improvement by E-AdSampling 2003.
and its high imbalanced degree. Ozone dataset was chosen [8] P. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining,
due to the high imbalanced degree and the difficult to classify Pearson Education, Inc, 2006, pp. 220-221.
[9] C.L. Blake and C. J. Merz, UCI Repository of Machine Learning
on minority class. Adacost and CSB2 were executed with Databases [http://www.ics.uci.edu/~mlearn/MLRepository.html],
Cost factor 2. On both graphics the area under the ROC curve Department of Information and Computer Science, University of
(AUC) show good results for E-AdSampling. Table V show California, Irvine, CA, 1998.
all results for AUC. [10] I. H. Witten, and E. Frank, Data Mining: Practical machine learning
tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco,
2005.
TABLE V
[11] Y. Sun, M. S. Kamel, A. K. C. Wong, and Y. Wang, "Cost-sensitive
Result Area under Curver boosting for classification of imbalanced data," Pattern Recogn., vol.
Hepa Adult Pima Monk Yeast Ozone 40, no. 12, pp. 3358-3378, December 2007.
t-itis s-2
C4.5 0.70 0.89 0.75 0.59 0.65 0.67
AdaboostM1 0.81 0.87 0.77 0.73 0.83 0.80
Bagging 0.80 0.90 0.79 0.67 0.79 0.83
Adacost(2) 0.87 0.90 0.78 0.67 0.86 0.83
Adacost(5) 0.83 0.90 0.78 0.69 0.88 0.84
CSB2(2) 0.84 0.89 0.77 0.74 0.86 0.84
CSB2(5) 0.85 0.89 0.78 0.75 0.84 0.83
E-AdSampling 0.88 0.91 0.81 0.71 0.87 0.87
V. CONCLUSION
In this paper, an alternative algorithm for imbalanced
datasets was presented. Datasets on several and not several
imbalanced degree were taking on consideration. In both
cases E-AdSampling showed good performance on all
measures. Besides E-AdSampling can get good results on TP
Ratio and F measure for minority class, it also can remain
almost constant or has a slight increase on F-measure for
majority class and Overall Accuracy. While some
cost-sensitive algorithms gain better results on TP Radio,
E-AdSampling can yield better results on F-measures on both
majority and minority class as well overall accuracy for
almost all cases.
The ROC curves for two of the Datasets, present
graphically the achievements of E-AdSampling.
Our future work will be focus on automatically set the
number of neighbors needed to generate the synthetics
samples and the percent of synthetic samples generated
according to the dataset.