0% found this document useful (0 votes)
45 views5 pages

IMECS2010 pp513-517

This document summarizes an adaptive sampling ensemble classifier algorithm proposed to address imbalanced data sets. The algorithm creates new balanced training sets by under-sampling the majority class and over-sampling the minority class using synthetic examples. It identifies hard examples in the majority class from each round and generates new synthetic samples to add to the next training set. A weak learner is used as the base classifier for each training set. The final predictions are determined by a majority vote of the classifiers. The algorithm is compared to other known methods and experimental results show its effectiveness on several data sets.

Uploaded by

Thiago Salles
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views5 pages

IMECS2010 pp513-517

This document summarizes an adaptive sampling ensemble classifier algorithm proposed to address imbalanced data sets. The algorithm creates new balanced training sets by under-sampling the majority class and over-sampling the minority class using synthetic examples. It identifies hard examples in the majority class from each round and generates new synthetic samples to add to the next training set. A weak learner is used as the base classifier for each training set. The final predictions are determined by a majority vote of the classifiers. The algorithm is compared to other known methods and experimental results show its effectiveness on several data sets.

Uploaded by

Thiago Salles
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Proceedings of the International MultiConference of Engineers and Computer Scientists 2010 Vol I,

IMECS 2010, March 17 - 19, 2010, Hong Kong

An Adaptive Sampling Ensemble Classifier for


Learning from Imbalanced Data Sets
Ordonez Jon Geiler, Li Hong, Guo Yue-jian

Abstract—In Imbalanced datasets, minority classes can be Random over-sampling may make the decision regions of
erroneously classified by common classification algorithms. In the learner smaller and more specific, thus may cause the
this paper, an ensemble-base algorithm is proposed by creating learner to over-fit.
new balanced training sets with all the minority class and
As an alternative of over-sampling, SMOTE [1] was
under-sampling majority class. In each round, algorithm
identified hard examples on majority class and generated proposed as method to generate synthetic samples on
synthetic examples for the next round. For each training set a minority class. The advantage of SMOTE is that it makes the
Weak Learner is used as base classifier. Final predictions would decision regions larger and less specific. SMOTEBoost [2]
be achieved by casting a majority vote. This method is proceeds in a series of T rounds where every round the
compared whit some known algorithms and experimental distribution Dt is updated. Therefore the examples from the
results demonstrate the effectiveness of the proposed algorithm.
minority class are over-sampled by creating synthetic
minority class examples. Databoost-IM [3] is a modification
Index Terms—Data Mining, Ensemble algorithm, Imbalanced
data sets, Synthetic Samples. of AdaBoost.M2, which identifies hard examples and
generates synthetic examples for the minority as well as the
majority class.
I. INTRODUCTION The methods at algorithm level operate on the algorithms

I mbalance datasets where one class is represented by a other than the data sets. Bagging and Boosting are two
larger number of instances than other classes are common algorithms to improve the performance of classifier. They are
on fraud detection, text classification, and medical examples of ensemble methods, or methods that use a
diagnosis, On this domains as well others examples, minority combination of models. Bagging (Bootstrap aggregating)
class can be the less tolerant to classification fail and very was proposed by Leo Breiman in 1994 to improve the
important for cost sensitive. For example, misclassification classification by combining classifications of randomly
of a credit card fraud may cause a bank reputation deplored, generated training sets [4]. Boosting is a mechanism for
cost of transaction, and dissatisfied client. However, a training a sequence of “weak” learners and combining the
misclassification not fraud transaction only costs a call to hypotheses generated by these weak learners so as to obtain
client. Likewise in an oil split detection, an undetected split an aggregate hypothesis which is highly accurate. Adaboost
may cost thousands of dollars, but classifying a not split [5], increases the weights of misclassified examples and
sample as a split just cost an inspection. Due imbalance decreases those correctly classified using the same
problem Traditional machine learning can achieve better proportion, without considering the imbalance of the data
results on majority class but may predict poorly on the sets. Thus, traditional boosting algorithms do not perform
minority class examples. In order to tackle this problem well on the minority class.
many solutions have been presented, these solutions are
In this paper, an algorithm to cope with imbalanced
divided on data level and algorithm level.
datasets is proposed as described in the next section.
On data level, most common ways to tackle rarity are
The rest of the paper is organized as follows. Section 2
over-sampling and under-sampling. Under-sampling may
describes E-AdSampling algorithm. Section 3 shows the
cause a loss of information on majority class, and deplore on
setup for the experiments. Section 4 shows the comparative
its classification due to remove some examples on this class.
evaluation of E-AdSampling on 6 datasets. Finally,
conclusion is drawn in section 5.
Manuscript received January 23, 2010.
This paper is supported by the Postdoctoral Foundation of Central South
University (2008), China, the Education Innovation Foundation for Graduate II. E-ADSAMPLING ALGORITHM
Student of Central South University(2008).
Ordonez Jon Geiler is with Central South University Institute of The main focus of E-Adsampling is enhancing prediction
Information Science and Engineering, Changsha, Hunan, China 410083 on minority class without sacrifice majority class
(corresponding author to provide phone.: +86 7318832954; E-mail:
performance. An ensemble algorithm with balanced datasets
jongeilerordonezp@ hotmail.com).
Li Hong is with Central South University Institute of Information Science by under-sampling and generation of synthetic examples is
and Engineering, Changsha, Hunan, China 410083; (e-mail: proposed. For majority class an under-sample strategy is
lihongcsu2007@126.com). applied. By under-sampling majority class, algorithm will be
Guo Yue-jian is with Central South University Institute of Information
Science and Engineering, Changsha, Hunan, China 410083; (e-mail: lead algorithm towards minority class, getting better
tianyahongqi@126.com). performance on True Positive ratio and accuracy for minority
class. Nevertheless, majority class will suffer a reduction on

ISBN: 978-988-17012-8-2 IMECS 2010


ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)
Proceedings of the International MultiConference of Engineers and Computer Scientists 2010 Vol I,
IMECS 2010, March 17 - 19, 2010, Hong Kong

accuracy and True Positive ratio due to the loss of A. Generate Synthetic Examples
information. To alleviate this loss, the proposed algorithm
will search misclassified samples on majority class in each Smote was proposed by N. V. Chawla, K. W. Bowyer, and P.
round. Then, it generates new synthetic samples based on W. Kegelmeyer[1] as a method to over-sampling datasets.
SMOTE over-samples the minority class by taking each
these hard samples and adds them to the new training set.
minority class sample and introducing synthetic examples
As show on Fig 1 the process is split in 4 steps: first, in
along the line segments joining of the minority class nearest
order to balance training dataset majority class is randomly
neighbors. E-Adsampling will adopt the same technique to
under-sampled; second, synthetic examples are generated for majority class examples which have been misclassified. By
hard examples on majority class and add to training dataset; using this technique, the inductive learners, such as decision
third, using any weak learning algorithm all training sets are trees, are able to broaden decision regions on majority hard
modeled; finally, all the results obtained on each training set examples.
are combined.
B. Sampling Training Datasets
Input : Set S {(x1, y1), … , (xm, ym)} xi ∈X, with labels yi ∈ Adaptive sampling designs are mainly in which the
Y = {1, …, C}, selection procedure may depend sequentially on observed
• For t = 1, 2, 3, 4, … T values of the variable of interest. As class as interest variable,
o Create a balanced sample Dt by E-Adsampling under-sample or over-sample base on
under-sampling majority class. observation below to class, or observation has been
o Identify hard examples from the original erroneously predictive.
data set for majority class. In each round of the algorithm, a new training dataset will
o Generate synthetic examples for hard be generated. In the first round of the algorithm, the training
examples on majority class. dataset will be perfectly balanced by under-sampling
o Add synthetic examples to Dt. majority class. From second to the final round, it will also
o Train a weak learner using distribution Dt. under-sample majority class to start with a balanced training
o Compute weak hypothesis ht: X × Y → datasets, and additionally new synthetic samples will be
[0, 1]. generated and added for hard examples on majority class.
• Output the final hypothesis: H* = arg max ∑h
i
t
Table I show an example of 10 rounds of the algorithm for
Ozone dataset.
Ozone Dataset has 2536 samples, 73 on minority Class,
Fig 1. The E-AdSampling Algorithm
2463 on majority class and a balance rate of 0.02:0.98.

TABLE I
GENERATING TRAINING DATASETS ROUNDS FOR OZONE DATASET.
Total
Initial Samples
Balance Misclassified Synthetic on
Training Majority Samples Trainin Min. Maj.
Round Dataset Samples Added g Set Final Final Balance Rate
1 146 0 0 146 73 73 0.50 : 0.50
2 146 578 1156 1302 73 1229 0.06 : 0.94
3 146 612 1224 1370 73 1297 0.05 : 0.95
4 146 2 4 150 73 77 0.49 : 0.51
5 146 372 744 890 73 817 0.08 : 0.92
6 146 0 0 146 73 73 0.50 : 0.50
7 146 239 478 624 73 551 0.12 : 0.88
8 146 2 4 150 73 77 0.49 : 0.51
9 146 175 350 496 73 423 0.15 : 0.85
10 146 1 0 146 73 73 0.50 : 0.50

As seen to Table I, in some rounds of the algorithm, the


balance rate between minority and majority are 50:50. In III. EXPERIMENTS SETUP
these cases, it is possible that some samples for majority class This section will describe the measures and domains used
will be erroneously classified. To alleviate this loss, the in the experiment.
algorithm will generate synthetic samples for these samples The confusion matrix is a useful tool for analyzing how
in next round. But not matter how many synthetic samples are well the classifier can recognize the samples of different
added in any of the rounds, the balance rate will never larger classes [6]. A confusion matrix for two classes is show on
than the original one 0.02:0.98. This imbalanced reduction Table II.
will lead to better results on minority class.

ISBN: 978-988-17012-8-2 IMECS 2010


ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)
Proceedings of the International MultiConference of Engineers and Computer Scientists 2010 Vol I,
IMECS 2010, March 17 - 19, 2010, Hong Kong

TABLE II the model is perfect, then its area under ROC curve would
TWO CLASSES CONFUSION MATRIX
equal 1. If the model simply performs random guessing, then
Predicted Positive Predicted Negative
its area under the ROC curve would equal to 0.5. A model
that is strictly better than another would have a large area
Actual TP( the number of FN (the number of under the ROC curve.
Positive True Positives False Negatives)
Actual FP( the number of TN ( the number of B. Datasets
Negative False Positives) True Negatives) The experiments were carried out on 6 real data sets taken
from the UCI Machine Learning Database Repository[9] (a
Accuracy, defined as summary is given in Table III). All data sets were chosen or
TP + TN (1) transformed into two-class problems.
Acc =
TP + FN + FP + TN TABLE III
DATASETS USED IN THE EXPERIMENTS
The TP Rate and FP Rate are calculated as TP/(FN+TP) Dataset Cases Min May Attrib Distributio
and FP/(FP+TN). The Precision and Recall are calculated as Class Class utes n
Hepatitis 155 32 123 20 0.20:0.80
TP / (TP + FP) and TP / (TP + FN). The F-measure is defined Adult 32561 7841 24720 15 0.24:0.76
as Pima 768 268 500 9 0.35:0.65
Monk2 169 64 105 6 0.37:0.63
((1 + β )× Re call × Pr ecision)(β
2 2
× Re call + Pr ecision ) (2) Yeast
Ozone
483
2536
20
73
463
2463
8
72
0.04:0.96
0.02:0.98

Where ß correspond to the relative importance of precision


Adult dataset training has 32561 examples, but also
versus the recall and it is usually set to 1. The F-measure
provides a test dataset with 16281 examples; Monk2 has 169
incorporates the recall and precision into a single number. It
on training and 432 examples on test dataset. Yeast dataset
follows that the F-measure is high when both recall and
was learned from classes ‘CYT’ And ‘POX’ as done on [3].
precision are high.
All datasets were chosen for having a high imbalanced
degree necessary to apply the method. Minority class was
g − mean =
− + −1
where a =
TN +
=
TP (3)
a xa (TN + FP ) a (TP + FN ) taking as a positive class.

G-mean is based on the recalls on both classes. The benefit IV. RESULTS AND DISCUSSION
of selecting this metric is that it can measure how balanced Weka 3.6.0[10] was used as a tool for prediction, C4.5 tree
the combination scheme is. If a classifier is highly biased was used as base classifier, AdatabostM1, Bagging, Adacost
toward one class (such as the majority class), the G-mean CSB2, and E-AdSampling were set with 10 iterations.
value is low. For example, if a+ = 0 and a− = 1, which means AdaCost[11]: False Positives receive a greater weight
none of the positive examples is identified, g-mean=0 [7]. increase than False Negatives and True Positives loss less
weights than True Negatives by using a cost adjustment
A. The Receiver Operation Characteristic Curve function. A cost adjustment function as: β + =−0.5Cn +0.5
A receiver operation characteristic ROC Curve [8] is a and β− = 0.5Cn + 0.5 was chosen, where Cn is the
graphical approach for displaying the tradeoff between True misclassification cost of the nth example, and β + ( β − )
Positive Rate (TRP) and False Positive Rate (FPR) of a
denotes the output in case of the sample correctly classified
classifier. In an ROC curve, the True Positive Rate (TPR) is
(misclassified).
plotted along the y axis and the False Positive Rate (FPR) is
CSB2[11]: Weights of True Positives and True Negatives
show on the X axis.
are decreased equally; False Negatives get more boosted
There are several critical points along an ROC curve that
weights than False Positives. Cost factor 2 and 5 were
have well-known interpretations.
implemented for Adacost and CSB2.
Except for Adult and Monk2 which provide a test dataset,
(TPR=0,FPR=0): Model predicts every instance to be a
a 10-folds cross-validation was implemented. The initial data
negative class.
are randomly partitioned into 10 mutually exclusive subsets
(TPR=1,FPR=1): Model predicts every instance to be a
or “folds” D1, D2…D10, each of approximately equal size.
positive class.
Training and testing is performed 10 times. In iteration I,
(TPR=1, FPR=0): The ideal model.
partition Di is reserved as the test set, and the remaining
partitions are collectively used to train the model. For
A good classification model should be located as close as
classification, the accuracy estimate is the overall number of
possible to the upper left corner of the diagram, while the
correct classifications from 10 iterations, divided by the total
model which makes random guesses should reside along the
number of tuples in the initial data. Results are shown on
main diagonal, connecting the points (TPR=0, FPR=0) and
Table IV.
(TPR=1, FPR=1).
The area under the ROC curve (AUC) provides another
approach for evaluation which model is better on average. If

ISBN: 978-988-17012-8-2 IMECS 2010


ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)
Proceedings of the International MultiConference of Engineers and Computer Scientists 2010 Vol I,
IMECS 2010, March 17 - 19, 2010, Hong Kong

TABLE IV obtains an improvement compared to cost-sensitive


RESULT COMPARE AGAINST F-MEASURES, TP RATE MIN, ACCURACY, AND
G-MEAN. USING C.4.5 CLASSIFIER, ADABOOST-M1, BAGGING, ADACOST,
algorithms as well as non cost-sensitive algorithms. This
CSB2, AND E-ADSAMPLING ENSEMBLES improving may rise about 12% as on Hepatitis Dataset. For
Data Set F F TP G- Overall majority class, F-measure is also increased on almost all
Min Maj Rate Mea Accu. cases, except for Adult and Ozone datasets where this
Min n
Hepatitis C4.5 52.8 90.3 43.8 64.2 83.87% measure mostly remains constant, just getting a reduction of
AdaboostM1 60.7 91.3 53.1 70.7 85.80% 0.5. This reduction can be considered small compared to the
Bagging 51.9 89.8 43.8 63.9 83.22% gain on TP Ratio and F-Measure for minority class.
Adacost (2) 59.7 88.9 62.5 74.0 82.58%
Adacost(5) 57.5 83.4 78.1 66.8 76.12%
For the G-mean which is considered as an important
CSB2(2) 60.3 87.8 68.8 76.2 81.29% measure on imbalanced datasets. E-AdSampling yields the
CSB2(5) 55.0 75.6 93.8 76.1 68.38% highest G-mean almost on all datasets; except for Adult and
E-AdSampling 72.5 92.1 78.1 83.9 87.74%
Ozone where some cost sensitive algorithms achieve better
Adult C4.5 67.8 90.9 62.9 76.4 85.84%
AdaboostM1 64.3 89.3 61.9 75.1 83.53% results. But the results on E-AdSampling show how
Bagging 67.1 91.1 60.5 75.3 85.98% E-AdSampling can be ideal for imbalanced datasets,
Adacost(2) 68.5 87.4 82.7 82.2 82.02% indicating also that TP Rate for majority class is not
Adacost(5) 64.3 82.9 88.2 80.4 76.88%
CSB2(2) 66.7 87.8 75.8 79.8 82.09% compromise by the increase of TP Rate for minority class.
CSB2(5) 58.3 73.8 95.2 75.1 67.79% For the overall accuracy measure, E-AdSampling gets an
E-AdSampling 70.0 90.6 71.0 80.0 85.63% improving on the 4 datasets. Ozone and Adult are the only
Pima C4.5 61.4 80.2 59.7 69.7 73.82%
AdaboostM1 60.6 78.8 60.8 69.1 72.39%
Datasets which suffer a reduction. This reduction can be on
Bagging 62.1 80.3 60.8 70.2 74.08% the range of 1%, which is small compared to the gain on other
Adacost(2) 61.3 72.1 73.5 68.7 67.57% measures.
Adacost(5) 61.5 65.8 82.8 66.6 63.80% As seeen on Table IV Cost-sensitive algorithms (Adacost,
CSB2(2) 63.2 73.3 76.1 70.4 69.01%
CSB2(5) 57.7 44.3 94.0 52.5 51.95% CSB2) can achieve good results on TP Rate for minority
E-AdSampling 63.5 81.3 61.0 71.3 75.26% class. But these results will not be highlighted by reduction
Monks-2 C4.5 38.9 75.5 33.8 52.9 65.04% on F measures on both class and on some cases a reduction on
AdaboostM1 56.8 76.6 60.6 67.0 69.67%
Bagging 48.3 76.6 45.8 59.9 67.82% Overall Accuracy.
Adacost(2) 56.3 58.9 83.1 61.2 57.63% Not Cost-sensitive algorithms (C4.5, AdaboostM1,
Adacost(5) 57.3 52.1 92.3 58.1 54.86% Bagging) only achieve better results for Adult and Ozone
CSB2(2) 56.0 57.4 83.8 60.3 56.71%
CSB2(5) 50.0 4.1 100 14.4 34.25%
Datasets on F-Measure for majority class and Overall
E-AdSampling 60.4 76.9 67.6 69.9 70.83% Accuracy, E-AdSampling beat this algorithms in others
Yeast C4.5 33.3 98.3 20.0 44.7 96.68% measures.
AdaboostM1 53.3 98.5 40.0 63.1 97.10%
Bagging 0 97.7 0 0 95.41%
Adacost(2) 48.3 98.4 35.0 59.0 96.89%
Adacost(5) 47.6 97.6 50.0 69.7 95.44%
CSB2(2) 54.5 98.4 45.0 66.7 96.89%
CSB2(5) 41.7 96.9 50.0 69.3 94.20%
E-AdSampling 61.1 98.5 55.0 73.7 97.10%
Ozone C4.5 23.1 98.1 19.2 43.5 96.33%
AdaboostM1 17.2 98.5 11.0 33.0 96.96%
Bagging 2.5 98.4 1.4 11.7 96.92%
Adacost(2) 26.9 98.5 19.2 43.6 97.00%
Adacost(5) 25.7 97.4 30.1 54.0 94.99%
CSB2(2) 29.1 98.3 23.3 48.0 96.72%
CSB2(5) 26.8 96.3 45.2 65.2 92.90%
E-AdSampling 35.3 98.0 37.0 60.1 96.09%

In terms of TP Rate measure, compared to non


cost-sensitive algorithms, E-AdSampling reduces mistakes in
minority class prediction. Take Hepatitis Dataset for example,
the difference between E-AdSampling and Adaboost-M1 is
34%. This difference represents a reduction on 8
misclassified cases on minority class. On Ozone dataset, the Fig 2. Roc Curve of the Hepatitis Data set
difference between C4.5 and E-AdSampling is 17.8%, which
represents a reduction on 13 misclassified cases on minority
class. On these cases or others examples where
E-AdSampling performs well, the reduction of misclassified
cases on minority class may represent a cost reduction.
Compared to cost sensitive algorithms (Adacost, CSB2),
E-AdSampling casually would be low for TP Rate minority,
but also can be seen how Adacost and CSB2 sacrifice
majority class by suffering a reduction on F-measure.
As to F-measure, it is evident how minority class always

ISBN: 978-988-17012-8-2 IMECS 2010


ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)
Proceedings of the International MultiConference of Engineers and Computer Scientists 2010 Vol I,
IMECS 2010, March 17 - 19, 2010, Hong Kong

ACKNOWLEDGMENT
This paper is supported by the National Science
Foundation for Outstanding Youth Scientists of China under
Grant No.60425310.

REFERENCES
[1] N. V. Chawla, K. W. Bowyer, and P. W. Kegelmeyer, "Smote:
Synthetic minority over-sampling technique," Journal of Artificial
Intelligence Research, vol. 16, pp. 321-357, 2002.
[2] N.V Chawla, A. Lazarevic, L.O. Hall, and K.W. Bowyer,
SMOTEBoost: improving prediction of the minority class in boosting.
7th European Conference on Principles and Practice of Knowledge
Discovery in Databases, Cavtat-Dubrovnik, Croatia , 107-119, 2003.
[3] H. Guo and H. L. Viktor, "Learning from imbalanced data sets with
boosting and data generation: the databoost-im approach," SIGKDD
Explor. Newsl., vol. 6, no. 1, pp. 30-39, June 2004.
[4] L. Breiman, "Bagging predictors," Machine Learning, vol. 24, no. 2,
pp. 123-140, 1996.
[5] Y. Freund and R. E. Schapire, "A decision-theoretic generalization of
Fig 3. Roc Curve of the Ozone Data set
on-line learning and an application to boosting," Journal of Computer
and System Sciences, vol. 55, no. 1, pp. 119-139, 1997.
[6] J. Han, and M. Kamber, Data Mining: Concepts and Techniques,
To understand better on the achievements of Elsevier Inc, Singapore, 2006. pp. 360.
E-AdSampling, a ROC curve for Hepatitis (Fig 2) and Ozone [7] R.Yan, Y. Liu, R. Jin, and A. Hauptmann, On predicting rare class with
(Fig 3) datasets are presented. Hepatitis dataset was chosen SVM ensemble in scene classification, IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP'03), April 6-10,
due to the high performance improvement by E-AdSampling 2003.
and its high imbalanced degree. Ozone dataset was chosen [8] P. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining,
due to the high imbalanced degree and the difficult to classify Pearson Education, Inc, 2006, pp. 220-221.
[9] C.L. Blake and C. J. Merz, UCI Repository of Machine Learning
on minority class. Adacost and CSB2 were executed with Databases [http://www.ics.uci.edu/~mlearn/MLRepository.html],
Cost factor 2. On both graphics the area under the ROC curve Department of Information and Computer Science, University of
(AUC) show good results for E-AdSampling. Table V show California, Irvine, CA, 1998.
all results for AUC. [10] I. H. Witten, and E. Frank, Data Mining: Practical machine learning
tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco,
2005.
TABLE V
[11] Y. Sun, M. S. Kamel, A. K. C. Wong, and Y. Wang, "Cost-sensitive
Result Area under Curver boosting for classification of imbalanced data," Pattern Recogn., vol.
Hepa Adult Pima Monk Yeast Ozone 40, no. 12, pp. 3358-3378, December 2007.
t-itis s-2
C4.5 0.70 0.89 0.75 0.59 0.65 0.67
AdaboostM1 0.81 0.87 0.77 0.73 0.83 0.80
Bagging 0.80 0.90 0.79 0.67 0.79 0.83
Adacost(2) 0.87 0.90 0.78 0.67 0.86 0.83
Adacost(5) 0.83 0.90 0.78 0.69 0.88 0.84
CSB2(2) 0.84 0.89 0.77 0.74 0.86 0.84
CSB2(5) 0.85 0.89 0.78 0.75 0.84 0.83
E-AdSampling 0.88 0.91 0.81 0.71 0.87 0.87

V. CONCLUSION
In this paper, an alternative algorithm for imbalanced
datasets was presented. Datasets on several and not several
imbalanced degree were taking on consideration. In both
cases E-AdSampling showed good performance on all
measures. Besides E-AdSampling can get good results on TP
Ratio and F measure for minority class, it also can remain
almost constant or has a slight increase on F-measure for
majority class and Overall Accuracy. While some
cost-sensitive algorithms gain better results on TP Radio,
E-AdSampling can yield better results on F-measures on both
majority and minority class as well overall accuracy for
almost all cases.
The ROC curves for two of the Datasets, present
graphically the achievements of E-AdSampling.
Our future work will be focus on automatically set the
number of neighbors needed to generate the synthetics
samples and the percent of synthetic samples generated
according to the dataset.

ISBN: 978-988-17012-8-2 IMECS 2010


ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy