0% found this document useful (0 votes)

45 views5 pages

IMECS2010 pp513-517

This document summarizes an adaptive sampling ensemble classifier algorithm proposed to address imbalanced data sets. The algorithm creates new balanced training sets by under-sampling the majority class and over-sampling the minority class using synthetic examples. It identifies hard examples in the majority class from each round and generates new synthetic samples to add to the next training set. A weak learner is used as the base classifier for each training set. The final predictions are determined by a majority vote of the classifiers. The algorithm is compared to other known methods and experimental results show its effectiveness on several data sets.

Uploaded by

Thiago Salles

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views5 pages

IMECS2010 pp513-517

Uploaded by

Thiago Salles

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Proceedings of the International MultiConference of Engineers and Computer Scientists 2010 Vol I,

IMECS 2010, March 17 - 19, 2010, Hong Kong

An Adaptive Sampling Ensemble Classifier for

Learning from Imbalanced Data Sets
Ordonez Jon Geiler, Li Hong, Guo Yue-jian

Abstract—In Imbalanced datasets, minority classes can be Random over-sampling may make the decision regions of
erroneously classified by common classification algorithms. In the learner smaller and more specific, thus may cause the
this paper, an ensemble-base algorithm is proposed by creating learner to over-fit.
new balanced training sets with all the minority class and
As an alternative of over-sampling, SMOTE [1] was
under-sampling majority class. In each round, algorithm
identified hard examples on majority class and generated proposed as method to generate synthetic samples on
synthetic examples for the next round. For each training set a minority class. The advantage of SMOTE is that it makes the
Weak Learner is used as base classifier. Final predictions would decision regions larger and less specific. SMOTEBoost [2]
be achieved by casting a majority vote. This method is proceeds in a series of T rounds where every round the
compared whit some known algorithms and experimental distribution Dt is updated. Therefore the examples from the
results demonstrate the effectiveness of the proposed algorithm.
minority class are over-sampled by creating synthetic
minority class examples. Databoost-IM [3] is a modification
Index Terms—Data Mining, Ensemble algorithm, Imbalanced
data sets, Synthetic Samples. of AdaBoost.M2, which identifies hard examples and
generates synthetic examples for the minority as well as the
majority class.
I. INTRODUCTION The methods at algorithm level operate on the algorithms

I mbalance datasets where one class is represented by a other than the data sets. Bagging and Boosting are two
larger number of instances than other classes are common algorithms to improve the performance of classifier. They are
on fraud detection, text classification, and medical examples of ensemble methods, or methods that use a
diagnosis, On this domains as well others examples, minority combination of models. Bagging (Bootstrap aggregating)
class can be the less tolerant to classification fail and very was proposed by Leo Breiman in 1994 to improve the
important for cost sensitive. For example, misclassification classification by combining classifications of randomly
of a credit card fraud may cause a bank reputation deplored, generated training sets [4]. Boosting is a mechanism for
cost of transaction, and dissatisfied client. However, a training a sequence of “weak” learners and combining the
misclassification not fraud transaction only costs a call to hypotheses generated by these weak learners so as to obtain
client. Likewise in an oil split detection, an undetected split an aggregate hypothesis which is highly accurate. Adaboost
may cost thousands of dollars, but classifying a not split [5], increases the weights of misclassified examples and
sample as a split just cost an inspection. Due imbalance decreases those correctly classified using the same
problem Traditional machine learning can achieve better proportion, without considering the imbalance of the data
results on majority class but may predict poorly on the sets. Thus, traditional boosting algorithms do not perform
minority class examples. In order to tackle this problem well on the minority class.
many solutions have been presented, these solutions are
In this paper, an algorithm to cope with imbalanced
divided on data level and algorithm level.
datasets is proposed as described in the next section.
On data level, most common ways to tackle rarity are
The rest of the paper is organized as follows. Section 2
over-sampling and under-sampling. Under-sampling may
describes E-AdSampling algorithm. Section 3 shows the
cause a loss of information on majority class, and deplore on
setup for the experiments. Section 4 shows the comparative
its classification due to remove some examples on this class.
evaluation of E-AdSampling on 6 datasets. Finally,
conclusion is drawn in section 5.
Manuscript received January 23, 2010.
This paper is supported by the Postdoctoral Foundation of Central South
University (2008), China, the Education Innovation Foundation for Graduate II. E-ADSAMPLING ALGORITHM
Student of Central South University(2008).
Ordonez Jon Geiler is with Central South University Institute of The main focus of E-Adsampling is enhancing prediction
Information Science and Engineering, Changsha, Hunan, China 410083 on minority class without sacrifice majority class
(corresponding author to provide phone.: +86 7318832954; E-mail:
performance. An ensemble algorithm with balanced datasets
jongeilerordonezp@ hotmail.com).
Li Hong is with Central South University Institute of Information Science by under-sampling and generation of synthetic examples is
and Engineering, Changsha, Hunan, China 410083; (e-mail: proposed. For majority class an under-sample strategy is
lihongcsu2007@126.com). applied. By under-sampling majority class, algorithm will be
Guo Yue-jian is with Central South University Institute of Information
Science and Engineering, Changsha, Hunan, China 410083; (e-mail: lead algorithm towards minority class, getting better
tianyahongqi@126.com). performance on True Positive ratio and accuracy for minority
class. Nevertheless, majority class will suffer a reduction on

ISBN: 978-988-17012-8-2 IMECS 2010

ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)
Proceedings of the International MultiConference of Engineers and Computer Scientists 2010 Vol I,
IMECS 2010, March 17 - 19, 2010, Hong Kong

accuracy and True Positive ratio due to the loss of A. Generate Synthetic Examples
information. To alleviate this loss, the proposed algorithm
will search misclassified samples on majority class in each Smote was proposed by N. V. Chawla, K. W. Bowyer, and P.
round. Then, it generates new synthetic samples based on W. Kegelmeyer[1] as a method to over-sampling datasets.
SMOTE over-samples the minority class by taking each
these hard samples and adds them to the new training set.
minority class sample and introducing synthetic examples
As show on Fig 1 the process is split in 4 steps: first, in
along the line segments joining of the minority class nearest
order to balance training dataset majority class is randomly
neighbors. E-Adsampling will adopt the same technique to
under-sampled; second, synthetic examples are generated for majority class examples which have been misclassified. By
hard examples on majority class and add to training dataset; using this technique, the inductive learners, such as decision
third, using any weak learning algorithm all training sets are trees, are able to broaden decision regions on majority hard
modeled; finally, all the results obtained on each training set examples.
are combined.
B. Sampling Training Datasets
Input : Set S {(x1, y1), … , (xm, ym)} xi ∈X, with labels yi ∈ Adaptive sampling designs are mainly in which the
Y = {1, …, C}, selection procedure may depend sequentially on observed
• For t = 1, 2, 3, 4, … T values of the variable of interest. As class as interest variable,
o Create a balanced sample Dt by E-Adsampling under-sample or over-sample base on
under-sampling majority class. observation below to class, or observation has been
o Identify hard examples from the original erroneously predictive.
data set for majority class. In each round of the algorithm, a new training dataset will
o Generate synthetic examples for hard be generated. In the first round of the algorithm, the training
examples on majority class. dataset will be perfectly balanced by under-sampling
o Add synthetic examples to Dt. majority class. From second to the final round, it will also
o Train a weak learner using distribution Dt. under-sample majority class to start with a balanced training
o Compute weak hypothesis ht: X × Y → datasets, and additionally new synthetic samples will be
[0, 1]. generated and added for hard examples on majority class.
• Output the final hypothesis: H* = arg max ∑h
i
t
Table I show an example of 10 rounds of the algorithm for
Ozone dataset.
Ozone Dataset has 2536 samples, 73 on minority Class,
Fig 1. The E-AdSampling Algorithm
2463 on majority class and a balance rate of 0.02:0.98.

TABLE I
GENERATING TRAINING DATASETS ROUNDS FOR OZONE DATASET.
Total
Initial Samples
Balance Misclassified Synthetic on
Training Majority Samples Trainin Min. Maj.
Round Dataset Samples Added g Set Final Final Balance Rate
1 146 0 0 146 73 73 0.50 : 0.50
2 146 578 1156 1302 73 1229 0.06 : 0.94
3 146 612 1224 1370 73 1297 0.05 : 0.95
4 146 2 4 150 73 77 0.49 : 0.51
5 146 372 744 890 73 817 0.08 : 0.92
6 146 0 0 146 73 73 0.50 : 0.50
7 146 239 478 624 73 551 0.12 : 0.88
8 146 2 4 150 73 77 0.49 : 0.51
9 146 175 350 496 73 423 0.15 : 0.85
10 146 1 0 146 73 73 0.50 : 0.50

As seen to Table I, in some rounds of the algorithm, the

balance rate between minority and majority are 50:50. In III. EXPERIMENTS SETUP
these cases, it is possible that some samples for majority class This section will describe the measures and domains used
will be erroneously classified. To alleviate this loss, the in the experiment.
algorithm will generate synthetic samples for these samples The confusion matrix is a useful tool for analyzing how
in next round. But not matter how many synthetic samples are well the classifier can recognize the samples of different
added in any of the rounds, the balance rate will never larger classes [6]. A confusion matrix for two classes is show on
than the original one 0.02:0.98. This imbalanced reduction Table II.
will lead to better results on minority class.

ISBN: 978-988-17012-8-2 IMECS 2010

ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)
Proceedings of the International MultiConference of Engineers and Computer Scientists 2010 Vol I,
IMECS 2010, March 17 - 19, 2010, Hong Kong

TABLE II the model is perfect, then its area under ROC curve would
TWO CLASSES CONFUSION MATRIX
equal 1. If the model simply performs random guessing, then
Predicted Positive Predicted Negative
its area under the ROC curve would equal to 0.5. A model
that is strictly better than another would have a large area
Actual TP( the number of FN (the number of under the ROC curve.
Positive True Positives False Negatives)
Actual FP( the number of TN ( the number of B. Datasets
Negative False Positives) True Negatives) The experiments were carried out on 6 real data sets taken
from the UCI Machine Learning Database Repository[9] (a
Accuracy, defined as summary is given in Table III). All data sets were chosen or
TP + TN (1) transformed into two-class problems.
Acc =
TP + FN + FP + TN TABLE III
DATASETS USED IN THE EXPERIMENTS
The TP Rate and FP Rate are calculated as TP/(FN+TP) Dataset Cases Min May Attrib Distributio
and FP/(FP+TN). The Precision and Recall are calculated as Class Class utes n
Hepatitis 155 32 123 20 0.20:0.80
TP / (TP + FP) and TP / (TP + FN). The F-measure is defined Adult 32561 7841 24720 15 0.24:0.76
as Pima 768 268 500 9 0.35:0.65
Monk2 169 64 105 6 0.37:0.63
((1 + β )× Re call × Pr ecision)(β
2 2
× Re call + Pr ecision ) (2) Yeast
Ozone
483
2536
20
73
463
2463
8
72
0.04:0.96
0.02:0.98

Where ß correspond to the relative importance of precision

Adult dataset training has 32561 examples, but also
versus the recall and it is usually set to 1. The F-measure
provides a test dataset with 16281 examples; Monk2 has 169
incorporates the recall and precision into a single number. It
on training and 432 examples on test dataset. Yeast dataset
follows that the F-measure is high when both recall and
was learned from classes ‘CYT’ And ‘POX’ as done on [3].
precision are high.
All datasets were chosen for having a high imbalanced
degree necessary to apply the method. Minority class was
g − mean =
− + −1
where a =
TN +
=
TP (3)
a xa (TN + FP ) a (TP + FN ) taking as a positive class.

G-mean is based on the recalls on both classes. The benefit IV. RESULTS AND DISCUSSION
of selecting this metric is that it can measure how balanced Weka 3.6.0[10] was used as a tool for prediction, C4.5 tree
the combination scheme is. If a classifier is highly biased was used as base classifier, AdatabostM1, Bagging, Adacost
toward one class (such as the majority class), the G-mean CSB2, and E-AdSampling were set with 10 iterations.
value is low. For example, if a+ = 0 and a− = 1, which means AdaCost[11]: False Positives receive a greater weight
none of the positive examples is identified, g-mean=0 [7]. increase than False Negatives and True Positives loss less
weights than True Negatives by using a cost adjustment
A. The Receiver Operation Characteristic Curve function. A cost adjustment function as: β + =−0.5Cn +0.5
A receiver operation characteristic ROC Curve [8] is a and β− = 0.5Cn + 0.5 was chosen, where Cn is the
graphical approach for displaying the tradeoff between True misclassification cost of the nth example, and β + ( β − )
Positive Rate (TRP) and False Positive Rate (FPR) of a
denotes the output in case of the sample correctly classified
classifier. In an ROC curve, the True Positive Rate (TPR) is
(misclassified).
plotted along the y axis and the False Positive Rate (FPR) is
CSB2[11]: Weights of True Positives and True Negatives
show on the X axis.
are decreased equally; False Negatives get more boosted
There are several critical points along an ROC curve that
weights than False Positives. Cost factor 2 and 5 were
have well-known interpretations.
implemented for Adacost and CSB2.
Except for Adult and Monk2 which provide a test dataset,
(TPR=0,FPR=0): Model predicts every instance to be a
a 10-folds cross-validation was implemented. The initial data
negative class.
are randomly partitioned into 10 mutually exclusive subsets
(TPR=1,FPR=1): Model predicts every instance to be a
or “folds” D1, D2…D10, each of approximately equal size.
positive class.
Training and testing is performed 10 times. In iteration I,
(TPR=1, FPR=0): The ideal model.
partition Di is reserved as the test set, and the remaining
partitions are collectively used to train the model. For
A good classification model should be located as close as
classification, the accuracy estimate is the overall number of
possible to the upper left corner of the diagram, while the
correct classifications from 10 iterations, divided by the total
model which makes random guesses should reside along the
number of tuples in the initial data. Results are shown on
main diagonal, connecting the points (TPR=0, FPR=0) and
Table IV.
(TPR=1, FPR=1).
The area under the ROC curve (AUC) provides another
approach for evaluation which model is better on average. If

ISBN: 978-988-17012-8-2 IMECS 2010

ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)
Proceedings of the International MultiConference of Engineers and Computer Scientists 2010 Vol I,
IMECS 2010, March 17 - 19, 2010, Hong Kong

TABLE IV obtains an improvement compared to cost-sensitive

RESULT COMPARE AGAINST F-MEASURES, TP RATE MIN, ACCURACY, AND
G-MEAN. USING C.4.5 CLASSIFIER, ADABOOST-M1, BAGGING, ADACOST,
algorithms as well as non cost-sensitive algorithms. This
CSB2, AND E-ADSAMPLING ENSEMBLES improving may rise about 12% as on Hepatitis Dataset. For
Data Set F F TP G- Overall majority class, F-measure is also increased on almost all
Min Maj Rate Mea Accu. cases, except for Adult and Ozone datasets where this
Min n
Hepatitis C4.5 52.8 90.3 43.8 64.2 83.87% measure mostly remains constant, just getting a reduction of
AdaboostM1 60.7 91.3 53.1 70.7 85.80% 0.5. This reduction can be considered small compared to the
Bagging 51.9 89.8 43.8 63.9 83.22% gain on TP Ratio and F-Measure for minority class.
Adacost (2) 59.7 88.9 62.5 74.0 82.58%
Adacost(5) 57.5 83.4 78.1 66.8 76.12%
For the G-mean which is considered as an important
CSB2(2) 60.3 87.8 68.8 76.2 81.29% measure on imbalanced datasets. E-AdSampling yields the
CSB2(5) 55.0 75.6 93.8 76.1 68.38% highest G-mean almost on all datasets; except for Adult and
E-AdSampling 72.5 92.1 78.1 83.9 87.74%
Ozone where some cost sensitive algorithms achieve better
Adult C4.5 67.8 90.9 62.9 76.4 85.84%
AdaboostM1 64.3 89.3 61.9 75.1 83.53% results. But the results on E-AdSampling show how
Bagging 67.1 91.1 60.5 75.3 85.98% E-AdSampling can be ideal for imbalanced datasets,
Adacost(2) 68.5 87.4 82.7 82.2 82.02% indicating also that TP Rate for majority class is not
Adacost(5) 64.3 82.9 88.2 80.4 76.88%
CSB2(2) 66.7 87.8 75.8 79.8 82.09% compromise by the increase of TP Rate for minority class.
CSB2(5) 58.3 73.8 95.2 75.1 67.79% For the overall accuracy measure, E-AdSampling gets an
E-AdSampling 70.0 90.6 71.0 80.0 85.63% improving on the 4 datasets. Ozone and Adult are the only
Pima C4.5 61.4 80.2 59.7 69.7 73.82%
AdaboostM1 60.6 78.8 60.8 69.1 72.39%
Datasets which suffer a reduction. This reduction can be on
Bagging 62.1 80.3 60.8 70.2 74.08% the range of 1%, which is small compared to the gain on other
Adacost(2) 61.3 72.1 73.5 68.7 67.57% measures.
Adacost(5) 61.5 65.8 82.8 66.6 63.80% As seeen on Table IV Cost-sensitive algorithms (Adacost,
CSB2(2) 63.2 73.3 76.1 70.4 69.01%
CSB2(5) 57.7 44.3 94.0 52.5 51.95% CSB2) can achieve good results on TP Rate for minority
E-AdSampling 63.5 81.3 61.0 71.3 75.26% class. But these results will not be highlighted by reduction
Monks-2 C4.5 38.9 75.5 33.8 52.9 65.04% on F measures on both class and on some cases a reduction on
AdaboostM1 56.8 76.6 60.6 67.0 69.67%
Bagging 48.3 76.6 45.8 59.9 67.82% Overall Accuracy.
Adacost(2) 56.3 58.9 83.1 61.2 57.63% Not Cost-sensitive algorithms (C4.5, AdaboostM1,
Adacost(5) 57.3 52.1 92.3 58.1 54.86% Bagging) only achieve better results for Adult and Ozone
CSB2(2) 56.0 57.4 83.8 60.3 56.71%
CSB2(5) 50.0 4.1 100 14.4 34.25%
Datasets on F-Measure for majority class and Overall
E-AdSampling 60.4 76.9 67.6 69.9 70.83% Accuracy, E-AdSampling beat this algorithms in others
Yeast C4.5 33.3 98.3 20.0 44.7 96.68% measures.
AdaboostM1 53.3 98.5 40.0 63.1 97.10%
Bagging 0 97.7 0 0 95.41%
Adacost(2) 48.3 98.4 35.0 59.0 96.89%
Adacost(5) 47.6 97.6 50.0 69.7 95.44%
CSB2(2) 54.5 98.4 45.0 66.7 96.89%
CSB2(5) 41.7 96.9 50.0 69.3 94.20%
E-AdSampling 61.1 98.5 55.0 73.7 97.10%
Ozone C4.5 23.1 98.1 19.2 43.5 96.33%
AdaboostM1 17.2 98.5 11.0 33.0 96.96%
Bagging 2.5 98.4 1.4 11.7 96.92%
Adacost(2) 26.9 98.5 19.2 43.6 97.00%
Adacost(5) 25.7 97.4 30.1 54.0 94.99%
CSB2(2) 29.1 98.3 23.3 48.0 96.72%
CSB2(5) 26.8 96.3 45.2 65.2 92.90%
E-AdSampling 35.3 98.0 37.0 60.1 96.09%

In terms of TP Rate measure, compared to non

cost-sensitive algorithms, E-AdSampling reduces mistakes in
minority class prediction. Take Hepatitis Dataset for example,
the difference between E-AdSampling and Adaboost-M1 is
34%. This difference represents a reduction on 8
misclassified cases on minority class. On Ozone dataset, the Fig 2. Roc Curve of the Hepatitis Data set
difference between C4.5 and E-AdSampling is 17.8%, which
represents a reduction on 13 misclassified cases on minority
class. On these cases or others examples where
E-AdSampling performs well, the reduction of misclassified
cases on minority class may represent a cost reduction.
Compared to cost sensitive algorithms (Adacost, CSB2),
E-AdSampling casually would be low for TP Rate minority,
but also can be seen how Adacost and CSB2 sacrifice
majority class by suffering a reduction on F-measure.
As to F-measure, it is evident how minority class always

ISBN: 978-988-17012-8-2 IMECS 2010

ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)
Proceedings of the International MultiConference of Engineers and Computer Scientists 2010 Vol I,
IMECS 2010, March 17 - 19, 2010, Hong Kong

ACKNOWLEDGMENT
This paper is supported by the National Science
Foundation for Outstanding Youth Scientists of China under
Grant No.60425310.

REFERENCES
[1] N. V. Chawla, K. W. Bowyer, and P. W. Kegelmeyer, "Smote:
Synthetic minority over-sampling technique," Journal of Artificial
Intelligence Research, vol. 16, pp. 321-357, 2002.
[2] N.V Chawla, A. Lazarevic, L.O. Hall, and K.W. Bowyer,
SMOTEBoost: improving prediction of the minority class in boosting.
7th European Conference on Principles and Practice of Knowledge
Discovery in Databases, Cavtat-Dubrovnik, Croatia , 107-119, 2003.
[3] H. Guo and H. L. Viktor, "Learning from imbalanced data sets with
boosting and data generation: the databoost-im approach," SIGKDD
Explor. Newsl., vol. 6, no. 1, pp. 30-39, June 2004.
[4] L. Breiman, "Bagging predictors," Machine Learning, vol. 24, no. 2,
pp. 123-140, 1996.
[5] Y. Freund and R. E. Schapire, "A decision-theoretic generalization of
Fig 3. Roc Curve of the Ozone Data set
on-line learning and an application to boosting," Journal of Computer
and System Sciences, vol. 55, no. 1, pp. 119-139, 1997.
[6] J. Han, and M. Kamber, Data Mining: Concepts and Techniques,
To understand better on the achievements of Elsevier Inc, Singapore, 2006. pp. 360.
E-AdSampling, a ROC curve for Hepatitis (Fig 2) and Ozone [7] R.Yan, Y. Liu, R. Jin, and A. Hauptmann, On predicting rare class with
(Fig 3) datasets are presented. Hepatitis dataset was chosen SVM ensemble in scene classification, IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP'03), April 6-10,
due to the high performance improvement by E-AdSampling 2003.
and its high imbalanced degree. Ozone dataset was chosen [8] P. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining,
due to the high imbalanced degree and the difficult to classify Pearson Education, Inc, 2006, pp. 220-221.
[9] C.L. Blake and C. J. Merz, UCI Repository of Machine Learning
on minority class. Adacost and CSB2 were executed with Databases [http://www.ics.uci.edu/~mlearn/MLRepository.html],
Cost factor 2. On both graphics the area under the ROC curve Department of Information and Computer Science, University of
(AUC) show good results for E-AdSampling. Table V show California, Irvine, CA, 1998.
all results for AUC. [10] I. H. Witten, and E. Frank, Data Mining: Practical machine learning
tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco,
2005.
TABLE V
[11] Y. Sun, M. S. Kamel, A. K. C. Wong, and Y. Wang, "Cost-sensitive
Result Area under Curver boosting for classification of imbalanced data," Pattern Recogn., vol.
Hepa Adult Pima Monk Yeast Ozone 40, no. 12, pp. 3358-3378, December 2007.
t-itis s-2
C4.5 0.70 0.89 0.75 0.59 0.65 0.67
AdaboostM1 0.81 0.87 0.77 0.73 0.83 0.80
Bagging 0.80 0.90 0.79 0.67 0.79 0.83
Adacost(2) 0.87 0.90 0.78 0.67 0.86 0.83
Adacost(5) 0.83 0.90 0.78 0.69 0.88 0.84
CSB2(2) 0.84 0.89 0.77 0.74 0.86 0.84
CSB2(5) 0.85 0.89 0.78 0.75 0.84 0.83
E-AdSampling 0.88 0.91 0.81 0.71 0.87 0.87

V. CONCLUSION
In this paper, an alternative algorithm for imbalanced
datasets was presented. Datasets on several and not several
imbalanced degree were taking on consideration. In both
cases E-AdSampling showed good performance on all
measures. Besides E-AdSampling can get good results on TP
Ratio and F measure for minority class, it also can remain
almost constant or has a slight increase on F-measure for
majority class and Overall Accuracy. While some
cost-sensitive algorithms gain better results on TP Radio,
E-AdSampling can yield better results on F-measures on both
majority and minority class as well overall accuracy for
almost all cases.
The ROC curves for two of the Datasets, present
graphically the achievements of E-AdSampling.
Our future work will be focus on automatically set the
number of neighbors needed to generate the synthetics
samples and the percent of synthetic samples generated
according to the dataset.

ISBN: 978-988-17012-8-2 IMECS 2010

ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)

A Comparative Study of SMOTE Borderline-SMOTE and ADASYN Oversampling Techniques Using Different Classifiers
No ratings yet
A Comparative Study of SMOTE Borderline-SMOTE and ADASYN Oversampling Techniques Using Different Classifiers
9 pages
Imbalanced Data Problem in Machine Learning A Review
No ratings yet
Imbalanced Data Problem in Machine Learning A Review
14 pages
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
No ratings yet
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
20 pages
1 s2.0 S0957417423032803 Main
No ratings yet
1 s2.0 S0957417423032803 Main
29 pages
Imbalanced Data
No ratings yet
Imbalanced Data
54 pages
An Intensity Based Deep Approach To Mitigate Step Imbalance Problem Under Extreme Paucity of Images From Rare Classes
No ratings yet
An Intensity Based Deep Approach To Mitigate Step Imbalance Problem Under Extreme Paucity of Images From Rare Classes
31 pages
Comprehensive Roadmap For AI, ML, DS, DA & DSA
No ratings yet
Comprehensive Roadmap For AI, ML, DS, DA & DSA
26 pages
Spam Text Detection Over Social Media Usage A Supervised Sampling Approach
No ratings yet
Spam Text Detection Over Social Media Usage A Supervised Sampling Approach
8 pages
Navo Minority Over-Sampling Technique (Nmote) : A Consistent Performance Booster On Imbalanced Datasets
No ratings yet
Navo Minority Over-Sampling Technique (Nmote) : A Consistent Performance Booster On Imbalanced Datasets
42 pages
A Three-Step Combination Strategy For Addressing Outliers and Class Imbalance in Software Defect Prediction
No ratings yet
A Three-Step Combination Strategy For Addressing Outliers and Class Imbalance in Software Defect Prediction
12 pages
Internship Report On Machine Learning Techniques
No ratings yet
Internship Report On Machine Learning Techniques
29 pages
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
100% (1)
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
36 pages
JPSP - 2022 - 383
No ratings yet
JPSP - 2022 - 383
12 pages
702 1974 1 PB
No ratings yet
702 1974 1 PB
9 pages
A Cluster Based Under-Sampling Solution For Handling Imbalanced Data
No ratings yet
A Cluster Based Under-Sampling Solution For Handling Imbalanced Data
12 pages
Gradient Boosting in ML
No ratings yet
Gradient Boosting in ML
5 pages
Paper 6 - 240417 - 184500 OCR
No ratings yet
Paper 6 - 240417 - 184500 OCR
11 pages
Author Final Version
No ratings yet
Author Final Version
11 pages
Enhanced Synthetic Oversampling For Multiclass Imbalanced Data
No ratings yet
Enhanced Synthetic Oversampling For Multiclass Imbalanced Data
20 pages
MK-SMOTE and M-SMOTE: Enhanced Techniques For Handling Class Imbalance Problem
No ratings yet
MK-SMOTE and M-SMOTE: Enhanced Techniques For Handling Class Imbalance Problem
19 pages
A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering
No ratings yet
A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering
10 pages
Imbalanced Data Classification Method Based On LSSASMOTE
No ratings yet
Imbalanced Data Classification Method Based On LSSASMOTE
9 pages
I D L A R: Mbalanced ATA Earning Pproaches Eview
No ratings yet
I D L A R: Mbalanced ATA Earning Pproaches Eview
19 pages
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
No ratings yet
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
9 pages
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
No ratings yet
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
33 pages
11-A-SMOTE A New Preprocessing Approach For Highly Im
No ratings yet
11-A-SMOTE A New Preprocessing Approach For Highly Im
11 pages
Predictive Accuracy: A Misleading Performance Measure For Highly Imbalanced Data
No ratings yet
Predictive Accuracy: A Misleading Performance Measure For Highly Imbalanced Data
12 pages
Classifying Imbalanced Data Sets Using Similarity Based Hierarchical Decomposition
No ratings yet
Classifying Imbalanced Data Sets Using Similarity Based Hierarchical Decomposition
16 pages
2.feb-2020 A Multiple Classifiers System For Anomaly
No ratings yet
2.feb-2020 A Multiple Classifiers System For Anomaly
12 pages
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
No ratings yet
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
11 pages
Under-Sampling Technique For Imbalanced Data Using Minimum Sum of Euclidean Distance in Principal Component Subset
No ratings yet
Under-Sampling Technique For Imbalanced Data Using Minimum Sum of Euclidean Distance in Principal Component Subset
14 pages
AgroAdvisor Crop Yield Prediction Crop and Fertili
No ratings yet
AgroAdvisor Crop Yield Prediction Crop and Fertili
27 pages
Programming Hadoop
No ratings yet
Programming Hadoop
42 pages
Admin, 1277
No ratings yet
Admin, 1277
21 pages
Ajay Kumar Yadav Ongc 2
No ratings yet
Ajay Kumar Yadav Ongc 2
23 pages
Performance Evaluation of Class Balancing
No ratings yet
Performance Evaluation of Class Balancing
6 pages
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
No ratings yet
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
15 pages
Stop Oversampling For Class Imbalance Learning - A Review (OJO) - AHMAD S. TARAWNEH, AHMAD B. HASSANAT, GHADA AWAD ALTARAWNEH, ABDULLAH ALMUHAIMEED
No ratings yet
Stop Oversampling For Class Imbalance Learning - A Review (OJO) - AHMAD S. TARAWNEH, AHMAD B. HASSANAT, GHADA AWAD ALTARAWNEH, ABDULLAH ALMUHAIMEED
18 pages
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
No ratings yet
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
9 pages
Failure Prediction in The Refinery Piping System Using Machine Learning Algorithms
No ratings yet
Failure Prediction in The Refinery Piping System Using Machine Learning Algorithms
10 pages
Icpcsi.2017.8392219
No ratings yet
Icpcsi.2017.8392219
6 pages
International Conference On Information and Communications Technology
No ratings yet
International Conference On Information and Communications Technology
5 pages
Imbalanced Dataset Classification and Solutions: A Review
No ratings yet
Imbalanced Dataset Classification and Solutions: A Review
29 pages
d2c0 PDF
No ratings yet
d2c0 PDF
6 pages
An Empirical Comparison and Evaluation of Minority Oversampling
No ratings yet
An Empirical Comparison and Evaluation of Minority Oversampling
13 pages
Li 2011
No ratings yet
Li 2011
4 pages
Model Optimisation of Class Imbalanced Learning Using Ensemble Classifier On Over-Sampling Data
No ratings yet
Model Optimisation of Class Imbalanced Learning Using Ensemble Classifier On Over-Sampling Data
8 pages
11192-Article (PDF) - 20731-1-10-20180420
No ratings yet
11192-Article (PDF) - 20731-1-10-20180420
43 pages
2515-Article Text-14337-4-10-20230331
No ratings yet
2515-Article Text-14337-4-10-20230331
12 pages
2018 12state of ArtofImbalancedDataClassificationMethods
No ratings yet
2018 12state of ArtofImbalancedDataClassificationMethods
7 pages
An Overview of Classification Algorithms For Imbalanced Datasets
No ratings yet
An Overview of Classification Algorithms For Imbalanced Datasets
7 pages
ADASYN: Adaptive Synthetic Sampling Approach For Imbalanced Learning
No ratings yet
ADASYN: Adaptive Synthetic Sampling Approach For Imbalanced Learning
7 pages
5 Techniques To Handle Imbalanced Data For A Classification Problem
No ratings yet
5 Techniques To Handle Imbalanced Data For A Classification Problem
7 pages
Eng2 12298 PDF
No ratings yet
Eng2 12298 PDF
24 pages
Over-Sampling Algorithm For Imbalanced Data Classification: XU Xiaolong, Chen Wen, and SUN Yanfei
No ratings yet
Over-Sampling Algorithm For Imbalanced Data Classification: XU Xiaolong, Chen Wen, and SUN Yanfei
10 pages
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
No ratings yet
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
7 pages
Random Forest Algorithm
No ratings yet
Random Forest Algorithm
39 pages
Knowledge-Based Systems: Michał Koziarski Michał Woźniak Bartosz Krawczyk
No ratings yet
Knowledge-Based Systems: Michał Koziarski Michał Woźniak Bartosz Krawczyk
16 pages
IEEE Format Paper
No ratings yet
IEEE Format Paper
20 pages
Medical Insurance Cost Prediction
No ratings yet
Medical Insurance Cost Prediction
7 pages
Analysis of Imbalanced Classification Algorithms A Perspective View
No ratings yet
Analysis of Imbalanced Classification Algorithms A Perspective View
5 pages
New Synopsis
No ratings yet
New Synopsis
18 pages
Expert Systems With Applications: Georgios Douzas, Fernando Bacao
No ratings yet
Expert Systems With Applications: Georgios Douzas, Fernando Bacao
8 pages
CS3491 Artificial Intelligence and Machine Learning Two Mark Questions 1
No ratings yet
CS3491 Artificial Intelligence and Machine Learning Two Mark Questions 1
23 pages
Xgboost: A Scalable Tree Boosting System: Tianqi Chen Tqchen@Cs - Washington.Edu Carlos Guestrin Guestrin@Cs - Washington.Edu
100% (1)
Xgboost: A Scalable Tree Boosting System: Tianqi Chen Tqchen@Cs - Washington.Edu Carlos Guestrin Guestrin@Cs - Washington.Edu
13 pages
A Study For The Discovery of Web Usage Patterns Using Soft Computing Based Data Clustering Techniques
No ratings yet
A Study For The Discovery of Web Usage Patterns Using Soft Computing Based Data Clustering Techniques
14 pages
Imbalanced Data: How To Handle Imbalanced Classification Problems
No ratings yet
Imbalanced Data: How To Handle Imbalanced Classification Problems
17 pages
A Survey On Oversampling Techniques For Imbalanced Learning
No ratings yet
A Survey On Oversampling Techniques For Imbalanced Learning
6 pages
Elmousalami-Elaskary2020 Article DrillingStuckPipeClassificatio
No ratings yet
Elmousalami-Elaskary2020 Article DrillingStuckPipeClassificatio
14 pages
IITD CPADSAI F5904ed02d
No ratings yet
IITD CPADSAI F5904ed02d
20 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
Trustworthy Machine Learning-Enhanced 3D Concrete Printing - Predicting Bond Strength and Designing Reinforcement Embedment Length
No ratings yet
Trustworthy Machine Learning-Enhanced 3D Concrete Printing - Predicting Bond Strength and Designing Reinforcement Embedment Length
19 pages
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
No ratings yet
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
5 pages
2403 13536
No ratings yet
2403 13536
23 pages
Cse 590 Data Mining: Prof. Anita Wasilewska SUNY Stony Brook
No ratings yet
Cse 590 Data Mining: Prof. Anita Wasilewska SUNY Stony Brook
66 pages
Addressing Imbalance Problem in The Class - A Survey
No ratings yet
Addressing Imbalance Problem in The Class - A Survey
5 pages
Class Imbalance Problem in Data Mining: Review
No ratings yet
Class Imbalance Problem in Data Mining: Review
5 pages
Customer Credit Risk Application and Evaluation of Machine Learning and Deep Learning Models
No ratings yet
Customer Credit Risk Application and Evaluation of Machine Learning and Deep Learning Models
5 pages
ML Lab Assignment 1
No ratings yet
ML Lab Assignment 1
1 page
Statistical Performance Assessment of Supervised Machine Learning Algorithms For Intrusion Detection System
No ratings yet
Statistical Performance Assessment of Supervised Machine Learning Algorithms For Intrusion Detection System
12 pages
Credit Card Fraud Detection Using Machine Learning: Ruttala Sailusha V. Gnaneswar
No ratings yet
Credit Card Fraud Detection Using Machine Learning: Ruttala Sailusha V. Gnaneswar
7 pages
EE769 9 Combining Models
No ratings yet
EE769 9 Combining Models
32 pages
AdaBoost Is Consistent
No ratings yet
AdaBoost Is Consistent
22 pages
Ensemble Learning
No ratings yet
Ensemble Learning
8 pages
Lec 16
No ratings yet
Lec 16
15 pages
Class 10
No ratings yet
Class 10
13 pages
Brazil 9k 2025
No ratings yet
Brazil 9k 2025
9 pages
An Ensemble Method For Phishing Websites Detection Based On XGBoost
No ratings yet
An Ensemble Method For Phishing Websites Detection Based On XGBoost
6 pages
Full ml-2
No ratings yet
Full ml-2
1 page
Fbi Crime Data
No ratings yet
Fbi Crime Data
6 pages
Application of Dimensionality Reduction in Recommender System - A Case Study
No ratings yet
Application of Dimensionality Reduction in Recommender System - A Case Study
12 pages
Using Machine Learning For Land Suitability Classification
No ratings yet
Using Machine Learning For Land Suitability Classification
12 pages
Exploring The Relation Between Blood Tests and Covid-19 Using Machine Learning
No ratings yet
Exploring The Relation Between Blood Tests and Covid-19 Using Machine Learning
6 pages
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
From Everand
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
Fouad Sabry
No ratings yet
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

IMECS2010 pp513-517

Uploaded by

IMECS2010 pp513-517

Uploaded by

Proceedings of the International MultiConference of Engineers and Computer Scientists 2010 Vol I,

IMECS 2010, March 17 - 19, 2010, Hong Kong

An Adaptive Sampling Ensemble Classifier for

ISBN: 978-988-17012-8-2 IMECS 2010

As seen to Table I, in some rounds of the algorithm, the

ISBN: 978-988-17012-8-2 IMECS 2010

Where ß correspond to the relative importance of precision

ISBN: 978-988-17012-8-2 IMECS 2010

TABLE IV obtains an improvement compared to cost-sensitive

In terms of TP Rate measure, compared to non

ISBN: 978-988-17012-8-2 IMECS 2010

ISBN: 978-988-17012-8-2 IMECS 2010

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.