11-A-SMOTE A New Preprocessing Approach For Highly Im
11-A-SMOTE A New Preprocessing Approach For Highly Im
Special Issue
A-SMOTE: A New Preprocessing Approach for Highly
Imbalanced Datasets by Improving SMOTE
Ahmed Saad Hussein1,2 , Tianrui Li1,* , Wondaferaw Yohannese Chubato1 , Kamal Bashir1
1
School of Information Science and Technology, Southwest Jiaotong University, Chengdu 611756, China
2
University of Information Technology and Communications, Baghdad 00964, Iraq
1. INTRODUCTION modules therein is often minimal (i.e., 2%). If a software defect pre-
diction classification model predicts that all the modules are normal
Machine learning (ML) techniques are widely used in different (defect-free), it will have a predictive accuracy of 98%. However, the
applications such as banking, bioinformatics, finance, epidemiol- classifier cannot ascertain the target modules that are defective in
ogy, marketing, medical diagnosis, and meteorological data analy- the dataset. Therefore, if a classifier can efficiently and accurately
sis [1]. In these domains, data is necessary for training the model. predict the minority class samples, it will be useful to help sev-
However, the distribution of classes in most real-world datasets is eral entities make proper decisions and save cost [3,4]. The minor-
imbalanced, and this circumstance poses huge challenge to the stan- ity class examples are usually the object of most interest in many
dard ML algorithms. The challenge of imbalanced datasets in clas- applications and the most difficult to predict in the perspective of
sification problem arises when the number of samples in one class ML classification task [5,6]. There are several applications including
is far outnumber those of the other class. In such circumstance, satellite image classification [7], medical applications [8], risk man-
a classifier usually favors the majority class in terms of prediction agement [9] in which imbalance class distribution is manifested in
and completely ignores the minority class. This challenge is often data. Several works have demonstrated that data mining techniques
experienced in several disciplines when mining data [2]. The con- might not function well when the training data is imbalanced [10].
sequence of this bias is that, most classification models developed Conventional ML classifiers assume balanced class distribution for
fail to correctly predict the minority class sample in out-of-sample the training data and are predisposed to accurately classify the
data. This fact is a huge course of worry for real-world data analy- majority class, whereas the minority samples are often misclassi-
sis. For example, a software development entity would like to build fied [11]. The ML community appears to settle on the proposition
a classifier to predict whether the software program will have defec- that class imbalance in training data is a major problem in induc-
tive modules or not at the end of the development process. In this tive learning. Although noticed several years back, that imbalance
regard, historical dataset is often employed and the number of fault in data may lead to considerable deterioration in standard classi-
fication model performance, some scholars have argued that class
* imbalance in data is not a difficulty itself. In some domains such
Corresponding author. Email: trli@swjtu.edu.cn
Pdf_Folio:1
2 A.S. Hussein et al. / International Journal of Computational Intelligence Systems, in press
as the Sick dataset [12] for instance, it has been observed that the approach, the existing algorithms are modified to recognize sam-
standard ML algorithms have the capability to induce effective clas- ples in the minority class [24]. The drawback of this approach is its
sification models even when trained on extremely imbalanced data. dependence on classifiers and the difficulty in handling it [25]. In
This illustrates that imbalance class distribution is not the only issue the data level approach, datasets are modified by adding instances
associated with the degradation in performance of classifiers. Also, to the minority class or eliminating samples from the majority class.
the classic work of López et al. [13] has demonstrated that the low This technique aims to present balanced datasets [26]. The data
classification performance reflected in some real imbalanced prob- level technique is easier to use as compared to the algorithm level
lems may be associated with the validation scheme applied to eval- approach, because the datasets are mended before they are trained
uate the classifier. A similar view was shared in [14–16] in which by classifiers [27,28]. The main advantage of the data level method is
the authors are of the view that the deterioration in classification that they are more adaptable since their applications do not depen-
performance is usually associated with other reasons connected to dent on the classifier chosen. Besides, we may preprocess all datasets
data distributions. Hence, Tang, and Chen [17] proposed an adjust- and apply them to train different classifiers. Among the data level
ment to the direction of newly created synthetic samples through a techniques is the well-known SMOTE [23]. In the case of oversam-
mechanism and the empirical results demonstrated improved per- pling, SMOTE is applied to introduce synthetic samples along the
formance of the classification model built. In that study, oversam- line segments connecting any or all of the k minority class nearest
pling with synthetic samples was presented to minimize overfitting neighbors, considering each minority class example in the data. The
resulting from random and directed oversampling. The new exam- selection of k nearest neighbors is carried out randomly based on
ples were then added into the original training set following spe- the amount of oversampling needed. SMOTE algorithm has repeat-
cific rules. The aim is to effectively expand the decision zone of the edly reported successes for better sampling distribution. However,
minority class in the feature space, and also augment the number of when applied in its original form may produce suboptimal results
the minority class samples. However, it is predictable that the addi- or it may even be counterproductive in many cases [29]. This is
tion of new samples in this way will inevitably present further noise mainly attributed to the fact that SMOTE presents several setbacks
into the training dataset, since these synthetic samples generated related to insensitive oversampling in which the generation of new
are no more than a mere estimation of the real distribution [18]. minority samples only consider the proximity and size of minority
In addition, rare events and class overlapping that come with class samples. The introduction of new samples to the minority without
imbalance have been identified in [18,19] as potential factors that critical emphasis on the direction and distribution of such examples
can result in performance degradation of the model developed on as a major disadvantage of SMOTE. This setback which can further
imbalanced data. To further find the possible causes of the learn- exacerbate the problems created by noisy includes the introduction
ing challenge in imbalanced domain. Prati et al. [20] advanced a of new minority samples in areas closer to the majority than the
methodical research aiming to interrogate whether class imbalances minority class. This drawback may cause performance deteriora-
is the main source of hindrance to inductive learning or other fac- tion in a given classification task [21]. Therefore, to overcome the
tors are responsible for the deficiencies. The study developed on above setbacks, two different techniques are adopted in the litera-
a series of artificial data sets with the view to entirely control all ture as
the variables required for the analysis. The experimental results,
by applying a discrimination-based inductive framework, demon- • Extensions of SMOTE by combining it with other techniques
strated that the learning challenge is not entirely associated with such as noise filtering. In the standard classification tasks, noise
class imbalance, but is also linked to the level of data overlapping filters are often used in order to detect and remove noisy
among the classes. samples from training datasets and also to clean up and to
create more regular class boundaries [29–31]. Empirical
To this end, we propose a critical modification to Synthetic
studies, such as [29], confirmed the advantage of integrating
Minority Oversampling Technique (SMOTE) for highly imbal-
iterative partitioning filter (IPF) [32] as a post-processing
anced datasets, where the generation of new synthetic samples are
period after applying SMOTE.
directed closer to the minority than the majority. In this way, the
line of distinction between the two classes will be clearly defined • Modifications of SMOTE in which the formation of new
and all samples in data will be located within their class boundaries minority samples realized by SMOTE is focused on specific
to ensure accurate prediction of the classifiers developed. portions of the input space, taking the specific features of the
data into account. The Safe-Levels-SMOTE (SL-SMOTE) [33],
The structure of this paper is organized as follows: Section 2 pro-
the Borderline-SMOTE (B1-SMOTE and B2-SMOTE) [34]
vides an overview of related works. In Section 3, we present the
methods come from this category. These methods try to create
proposed A-SMOTE. In Section 4, we introduce the experimental
minority samples close to regions with a high concentration of
results, discuss the evaluation metric and statistical tests used in
the minority samples or only within the borders of the
this work. Finally, concluding remarks and future work are drawn
minority class.
in Section 5.
Ramentol et al. [35] presented a new hybrid approach for pre-
2. RELATED WORKS processing imbalanced datasets through the creation of new sam-
ples, using SMOTE together with the Rough Set Theory. From the
The problem of learning in imbalance domain has been getting experimental results, they observed excellent average results. Simi-
attention in different research areas [21–23]. The methods pro- larly, Barua et al. [36] presented MWMOTE to address imbalanced
posed for imbalanced learning can be classified broadly under algo- learning problems. The approach first recognizes the hard-to-learn
rithm level approach and data level approach. In the algorithm level informational minority class samples and assigns them weights
Pdf_Folio:2
A.S. Hussein et al. / International Journal of Computational Intelligence Systems, in press 3
Having this gap in mind, which has not been addressed by many The SMOTE method comes with some weaknesses related to its
studies, we present a new approach that treats the highly imbal- insensitive oversampling where the creation of minority samples
anced dataset following two concepts step by step: Firstly, we cre- fails to account for the distribution of sample from the majority
ate a new synthetic instance using SMOTE algorithm. Secondly, we class. This may lead to the generation of unnecessary minority sam-
eliminate the synthetic samples with higher proximity to the major- ples around the positive examples that can further exacerbate the
ity class than the minority as well as the synthetic instances closer problem produced for borderline and noisy in the learning process.
to the borderline crated by SMOTE. Finally, the data is evidently
devoid noisy and borderline samples. Details of our new approach 3.2. A-SMOTE Algorithm
for improved classification performance are exhibited in the follow-
ing sections. To perform better prediction, most of the classification algorithms
strive to obtain pure samples to learn and make the borderline of
3. SMOTE AND A-SMOTE ALGORITHMS each class as definitive as possible. The synthetic examples that
are far away from the borderline are more easy to classify than the
In this section, we discuss the SMOTE and our A-SMOTE. ones close to the borderline, that pose a huge learning challenge for
majority of the classifiers. On the basis of these facts, we present a
new advanced approach (A-SMOTE) for preprocessing of imbal-
3.1. SMOTE: Synthetic Minority anced training sets, which tries to clearly define the borderline and
Oversampling Technique generate pure synthetic samples from SMOTE generalization. Our
proposed method has two stages and discussed as follows:
SMOTE [23] is an essential approach by oversampling the minority
class to generate balanced datasets. It oversamples the minority class First stage, we first apply SMOTE algorithm to generate the syn-
by practicing each minority class sample and including synthetic thetic instance based on following equation:
examples along the line segments joining any/all of the k minority N = 2 ∗ (r – z) + z (1)
class nearest neighbors. Depending upon the amount of oversam-
pling needed, neighbors from the k-nearest neighbors are randomly where N is the initial synthetic instance number (newly gener-
taken. This method is shown in Figure 1, where Yi is the point under ated), r, is the number of majority class samples, and z, is the
consideration, Yi1 to Yi4 are nearest neighbors and w1 to w4 the number of minority class samples.
synthetic data generated by the randomized interjection.
Second stage, we eliminate the synthetic samples with higher
Synthetic samples generated by taking the difference between the proximity to the majority class than the minority as well as
nearest neighbor and feature vector (sample) under consideration. the synthetic instances closer to the borderline generated by
Multiply this difference by a random number among 1 and 0, and SMOTE. The A-SMOTE procedure step-by-step is outlined as
add it to the feature vector under consideration. This produces the
Pdf_Folio:3 follows:
4 A.S. Hussein et al. / International Journal of Computational Intelligence Systems, in press
( )
• Step 1: The synthetic instances that generated by SMOTE where MinRap. Ŝi , Sm is the samples rapprochement with all
might be accepted or rejected on two conditions and it matches minority, and according to Equation (6), then we get L, which
with the first stage: Suppose that x̂ = {x̂1 , x̂2 , ..., x̂N } is the set of is defined as follows:
(j)
the new synthetic instances, and x̂i is the j-th attribute value of n
( ( ))
x̂i , j ∈ [1, M]. Let Sm = {Sm1 , Sm2 , ..., Smz } and L = ∑ MinRap. Ŝi , Sm (7)
Sa = {Sa1 , Sa2 , ..., Sar } be the sets of the minority samples and i=1
the majority samples, respectively. In order to make the
For example, if the number of original minority samples is 10,
acceptance or rejection decision, we calculate the distance
then we choose 10 elements from Ŝi to delete, which have the
between x̂i and Smk , D Dminority (x̂i , Smk ) and the distance
largest distance between Sm and Ŝi , to obtain high and pure
between x̂i , and the Sal , D Dmajority (x̂i , Sal ), respectively. For i
results.
from 1 to N step by step, we compute these distances as follows:
• Step 3: Similarly, we calculate the distance between Ŝi and each
M ( ) ( )
(j) (j) 2 original majority Sa , MajRap. Ŝi , Sa , described as follows:
D Dminority (x̂i , Smk ) = ∑ √ x̂i – Smk , k ∈ [1, z] (2)
j=1
M ( ) ( ) r M ( )
(j) (j) 2 (j) (j) 2
D Dmajority (x̂i , Sal ) = ∑ √ x̂i – Sal , l ∈ [1, r] (3) MajRap. Ŝi , Sa = ∑ ∑ Ŝ – S
i (8)
j=1 l=1j=1
√ al
Pdf_Folio:4
Figure 2 Advanced Synthetic Minority Oversampling Technique (A-SMOTE) algorithm process.
A.S. Hussein et al. / International Journal of Computational Intelligence Systems, in press 5
problems with imbalanced domains, it is necessary to deal with the When used to evaluate a learner’s performance on imbalanced
issue of performance evaluation. For classification, this point needs datasets, accuracy is more efficient in the majority class prediction
more attention but several solutions for evaluation in this context than the minority one. We draw this conclusion from its definition
already exist [41]. Therefore, we use confusion matrix to develop (Equation (10)): if the dataset is unusually imbalanced, even though
multiple evaluation matrix for the purpose of performance evalua- the classifier proceeds to a correct majority examples classification
tion among our proposed method and previous methods. We define but misclassifies all the minority examples, the learner still has high
the standard rates of accuracy as follows: accuracy because of the huge amount of the majority examples. In
(TP + TN) this circumstance, accuracy leads up to an unreliable prediction for
Accuracy = (10) the minority class. Thus, in addition to accuracy, more appropriate
(TP + FN + FP + TN)
evaluation metrics must be conducted. The ROC curve [42] is one
of the essential metrics to evaluate learners for imbalanced datasets.
FP It is a two-dimensional y and x axis graph where both TP and FP
FPrate = (11)
(TN + FP) rate are plotted accordingly. The FP rate (Equation (11)) denotes
the percentage of misclassified negative examples, and the TP rate
TP (comparison (12)) is the percentage of correctly classified positive
TPrate = Recall = (12)
Pdf_Folio:5
(TP + FN) cases. Basically, the learners look for the ideal point denoted as
6 A.S. Hussein et al. / International Journal of Computational Intelligence Systems, in press
(0, 1). The ROC curve highlights trade-offs between benefits (TP in the oversampling producing a preferred performance at the clas-
rate) and costs (FP rate). To this end, the area under the ROC (AUC) sification as compared to SMOTE (see Figure 3). The highest AUC
can also be used for the imbalanced datasets evaluation as the fol- value is shown in Table 3. There are two numbers per cell. The ini-
lowing equation shows: tial number denotes the count of times that a given algorithm is
the most preferred over the other algorithms, while the next one
1 + TPrate – FPrate shows the number of times shares equal performance with other
AUC = (13)
2 algorithms.
In this section, our approach obtains the best ranking as shown
4.2. Datasets and Statistical Tests in Table 4. We can see that, the average ranking of the algorithms
demonstrate how good a method over the others. This ranking is
In this research, we illustrate the datasets used for the experimen- accomplished by assigning a position to each algorithm depending
tal study and the statistical tests used alongside the empirical anal- on its performance with each dataset. The algorithm that achieves
ysis. We have used 44 datasets from the KEEL data repository1 [43] the best accuracy on a specific dataset will have the first rank-
with highly imbalanced rates. The summary of the datasets appears ing (value 1); then, the algorithm with the second best accuracy is
in Table 1. For our experiments, we admit the following parameters assigned to rank 2, and so forth. Finally, this task is carried out for
for the A-SMOTE algorithm: K is the number of nearest neighbors all datasets, and then an average ranking is estimated as the mean
and is fixed to be 5 and the class distribution will be rebalanced to value of all rankings.
50_ 50%. These parameter values are recommended in the previ-
ous studies presented in [23,44], and consequently, we have adopted Furthermore, for multiple comparisons, we utilize the Holm post
them as a standard for our experiment. To statistically support the hoc test to determine the algorithms that reject the hypothesis of
analysis of the results, we use statistical test. In this study, non- equality concerning a selected control method (see Table 5). The
parametric tests (freedman test and Holm post hoc test) for hypoth- post hoc system allows the comparison of means to know the accep-
esis testing are used, as was suggested by [45,46] and employed tance at the lowest significance level ∝ = 0.05. However, we calcu-
before [35]. late the p-value associated with each comparison, which describes
the lowest level of importance of a hypothesis that results in a
rejection. It is shown that most algorithms reject the hypothesis of
4.3. Comparative Analysis and Results equality.
The experimental findings in this paper are presented using dif- 4.3.2. Case 2: Using F-measure performance metric
ferent performance metrics to allow a fair comparison with other
methods from the literature among others. In this part of study, we do the threefold process to measure the
performance of the classifier learned from the training dataset gen-
erated through different oversampling methods. We randomly par-
4.3.1. Case 1: Using AUC performance metric tition the dataset into threefolds, and each fold holds almost the
same proportion of classes as the original datasets. Of the three-
To make a fair comparison, the sets were divided in order to per-
folds, only onefold is retained as the validation data for the test-
form a fivefolds cross-validation, 80% for training and 20% for
ing, and the remaining twofolds are employed as training data. The
testing, where the 5 test data-sets form the whole set. For each
process is then replicated three times, with each of the threefolds
data-set, we consider the average results of the five partitions. The
applied precisely once as the validation data. The three results from
learning algorithm employed for the experiments is C4.5, which
the folds then are averaged to provide the estimation of one test. We
has been identified as one of the top algorithms in data min-
employ Naive-Bayes classifier to evaluate the efficiency of SMOTE,
ing [47] and has been extensively applied in imbalanced prob-
SNOCC, CBSO, and A-SMOTE. This is done for a fair comparison.
lems [48]. In this part, we compare our approach (A-SMOTE)
with seven oversampling and undersampling preprocessing tech- We use a Laplace estimator to calculate the prior probability. The
niques based on SMOTE, that is, the SMOTE algorithm and the Laplace estimator shows excellent performance in Naive-Bayes clas-
preprocessing approaches: S-ENN, S-TomekLinks, Borderline-1, sification algorithm [38,50]. One extra benefit of using Laplace esti-
Borderline-2, safelevel, SMOTE-RSB (they are analyzed in [35,49]) mator is that zero probability can be avoided. F-measure for the
and MWMOTE. Table 2 shows the results of the experimental eval- minority (positive) class is used as the evaluation standard. In this
uation for the implementation test, wherein the first column we part, we compare our approach A-SMOTE with CBSO, and SNOCC
have involved the effect on the datasets, and the best approach (they are analyzed in [38,51]). The F-measure value of classification
is emphasized in bold for each dataset. The performance of the for the different oversampling methods is shown in Figure 4. The
algorithms is ranked on each dataset selected for this study. Thus, oversampling technique is given in the column title. The second
our proposed algorithm appears in first place 35 times and five column titled Normal is the F-value without oversampling. In each
times in the second position. We can recognize that our method row, the most significant F-value is made bold. From the results of
obtains the highest performance value of all the methodologies that the experiments, a comparison is done to find the best preprocess-
are being compared. SMOTE-RSB and Borderline-2 achieve good ing algorithm (see Table 6). With AUC and F-Measure results, and
results. Additionally, the unlimited results for SMOTE-ENN and statistical tests, we observe that our approach (A-SMOTE) is statis-
SMOTE-TomekLinks highlight the significance of the cleaning step tically preferred of all compared techniques.
1
Pdf_Folio:6
http://www.keel.es/datasets.php
A.S. Hussein et al. / International Journal of Computational Intelligence Systems, in press 7
Table 2 Illustration of the AUC results for nine preprocessing algorithms using C4.5 classifier (Case 1).
Dataset Original SMOTE SMOTE-TL S-ENN Border-1 Border-2 Safe-level SMOTE- MWMOTE A-SMOTE
RSB*
ecoli0137vs26 0.7481 0.8136 0.8136 0.8209 0.8445 0.8445 0.8118 0.8445 0.7795 0.9648
shuttle0vs4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9988 1 1 1
yeast1vs7 0.6275 0.7003 0.7371 0.7277 0.6422 0.6407 0.6621 0.8617 0.5669 0.8929
shuttle2vs4 1 0.9917 1 1 1 1 1 1 0.9960 1
glass016vs2 0.5938 0.6062 0.6388 0.6390 0.5738 0.5212 0.6338 0.6376 0.6905 0.8005
glass016vs5 0.8943 0.8129 0.8629 0.8743 0.8386 0.8300 0.8429 0.8800 0.9262 0.9580
pageblocks13vs4 0.9978 0.9955 0.9910 0.9888 0.9978 0.9944 0.9831 0.9978 0.9978 0.9934
yeast05679vs4 0.6802 0.7602 0.7802 0.7569 0.7473 0.7331 0.7825 0.7719 0.6312 0.8610
yeast1289vs7 0.6156 0.6832 0.6332 0.7037 0.6058 0.5473 0.5603 0.7487 0.5271 0.8032
yeast1458vs7 0.5000 0.5367 0.5563 0.5201 0.4955 0.4910 0.5891 0.6183 0.5282 0.6452
yeast2vs4 0.8307 0.8588 0.9042 0.9153 0.8635 0.8576 0.8647 0.9681 0.8539 0.9753
(continued)
Pdf_Folio:7
8 A.S. Hussein et al. / International Journal of Computational Intelligence Systems, in press
Table 2 Illustration of the AUC results for nine preprocessing algorithms using C4.5 classifier (Case 1). (Continued)
Dataset Original SMOTE SMOTE-TL S-ENN Border-1 Border-2 Safe-level SMOTE- MWMOTE A-SMOTE
RSB*
Ecoli4 0.8437 0.8310 0.8544 0.9044 0.8358 0.8155 0.8386 0.8544 0.8710 0.9802
Yeast4 0.6135 0.7004 0.7307 0.7257 0.7124 0.6882 0.7945 0.7609 0.5647 0.7538
Vowel0 0.9706 0.9494 0.9444 0.9455 0.9278 0.9766 0.9566 0.9678 0.9351 0.9796
Yeast2vs8 0.5250 0.8066 0.8045 0.8197 0.6827 0.6968 0.8112 0.7370 0.5545 0.9561
Glass4 0.7542 0.8508 0.9150 0.8650 0.7900 0.8325 0.9020 0.8768 0.9077 0.9727
Glass5 0.8976 0.8829 0.8805 0.7756 0.8854 0.8402 0.8939 0.9232 0.9618 0.9975
Glass2 0.7194 0.5424 0.6269 0.7457 0.7092 0.5701 0.6979 0.7912 0.6661 0.8589
Yeast5 0.8833 0.9233 0.9427 0.9406 0.9118 0.9219 0.9542 0.9622 0.8043 0.9450
Yeast6 0.7115 0.8280 0.8287 0.8270 0.7928 0.7485 0.8163 0.8208 0.6589 0.8745
abalone19 0.5000 0.5202 0.5162 0.5166 0.5202 0.5202 0.5363 0.5244 0.5137 0.5630
abalone918 0.5983 0.6215 0.6675 0.7193 0.7216 0.6819 0.8112 0.6791 0.5949 0.7672
cleveland0vs4 0.6878 0.7908 0.8376 0.7605 0.7194 0.7255 0.8511 0.7620 0.7158 0.9028
ecoli01vs235 0.7136 0.8377 0.8495 0.8332 0.7377 0.7514 0.7550 0.7777 0.7806 0.9257
ecoli01vs5 0.8159 0.7977 0.8432 0.8250 0.8318 0.8295 0.8568 0.7818 0.9567 0.9779
ecoli0146vs5 0.7885 0.8981 0.8981 0.8981 0.7558 0.8058 0.8519 0.8231 0.8980 0.9783
ecoli0147vs2356 0.8051 0.8277 0.8195 0.8228 0.7465 0.8320 0.8149 0.8154 0.8083 0.9180
ecoli0147vs56 0.8318 0.8592 0.8424 0.8424 0.8420 0.8453 0.8197 0.8670 0.8173 0.9471
ecoli0234vs5 0.8307 0.8974 0.8920 0.8947 0.8613 0.8586 0.8700 0.9058 0.9490 0.9691
ecoli0267vs35 0.7752 0.8155 0.8604 0.8179 0.8352 0.8102 0.8380 0.8227 0.7941 0.9510
ecoli034vs5 0.8389 0.9000 0.9361 0.8806 0.8806 0.9028 0.8306 0.9417 0.9085 0.9765
ecoli0346vs5 0.8615 0.8980 0.8703 0.8980 0.8534 0.8838 0.8520 0.8649 0.8255 0.9591
ecoli0347vs56 0.7757 0.8568 0.8482 0.8546 0.8427 0.8449 0.7995 0.8984 0.8445 0.9539
ecoli046vs5 0.8168 0.8701 0.8674 0.8869 0.8615 0.8892 0.8923 0.9476 0.8113 0.9564
ecoli067vs35 0.8250 0.8500 0.8125 0.8125 0.8550 0.8750 0.7950 0.8525 0.8253 0.9302
ecoli067vs5 0.7675 0.8475 0.8425 0.8450 0.8875 0.8900 0.7975 0.8800 0.9528 0.9780
glass0146vs2 0.6616 0.7842 0.7454 0.7095 0.6565 0.6958 0.7465 0.7978 0.6402 0.8287
glass015vs2 0.5011 0.6772 0.7040 0.7957 0.5196 0.5817 0.7215 0.7065 0.6577 0.7723
glass04vs5 0.9941 0.9816 0.9754 0.9754 0.9941 1 0.9261 0.9941 0.9741 0.9941
glass06vs5 0.9950 0.9147 0.9597 0.9647 0.9950 0.9000 0.9137 0.9650 0.9258 0.9904
led7digit02456789vs1 0.8788 0.8908 0.8822 0.8379 0.8908 0.8908 0.9023 0.9019 0.9212 0.9413
yeast0359vs78 0.5868 0.7047 0.7214 0.7024 0.6228 0.6438 0.7296 0.7400 0.5613 0.7426
yeast0256vs3789 0.6606 0.7951 0.7499 0.7817 0.7528 0.7644 0.7551 0.7857 0.7137 0.8664
yeast02579vs368 0.8432 0.9143 0.9007 0.9138 0.8810 0.8901 0.9003 0.9105 0.8361 0.9519
AUC, area under the ROC; SMOTE, Synthetic Minority Oversampling Technique.
Figure 3 Average area under the ROC (AUC) for 44 datasets and 9 preprocessing technique using C4.5 classifier (Case 1).
Original SMOTE SMOTE-TL S-ENN Border-1 Border-2 Safe-level SMOTE-RSB∗ MWMOTE A-SMOTE
TEST 0/3 0/0 0/1 1/1 0/3 1/1 2/1 1/3 0/2 35/2
SMOTE, Synthetic Minority Oversampling Technique.
Pdf_Folio:8
A.S. Hussein et al. / International Journal of Computational Intelligence Systems, in press 9
Table 4 Average ranks obtained by each method in the Friedman test for case 1.
Algorithm Ranking
A-SMOTE 1.4773
S-RSB ∗ 3.7727
S-ENN 5.1932
SMOTE-TL 5.2386
SMOTE 5.5795
Safelevel 5.6705
Border-2 6.5227
Border-1 6.7273
MWMOTE 6.9659
SMOTE, Synthetic Minority Oversampling Technique; A-SMOTE, Advanced SMOTE.
Table 5 Holms table for 𝛼 = 0.05, A-SMOTE is the control method for case 1.
Figure 4 Average F-value for 12 datasets and 4 preprocessing technique using NB classifier(Case 2).
Table 6 The F-measure results of different oversampling methods using NB classifier (Case 2).
5. CONCLUSION AND FUTURE WORK The performance of A-SMOTE was evaluated on 44 datasets with
high ratios of imbalanced classification. The proposed method was
In this study, we have proposed a novel approach for highly imbal-
Pdf_Folio:9
techniques, using a ML algorithm (e.g., C4.5, Naive-Bayes). In our [13] V. López, A. Fernández, F. Herrera, On the importance of the
experimental result, the A-SMOTE technique for preprocessing of validation technique for classification with imbalanced datasets:
imbalanced datasets obtained a higher accuracy and F-measure addressing covariate shift when data is skewed, Inf. Sci. 257
(F-value). We believe that the proposed A-SMOTE can be a useful (2014), 1–13.
tool for researchers and practitioners since it results in the genera- [14] N. Japkowicz, Class imbalances: are we focusing on the right issue,
tion of high-quality data. For future work, we will focus on how to in Workshop on Learning from Imbalanced Data Sets II, 2003, vol.
combine A-SMOTE with the rough set theory to solve imbalanced 1723, p. 6.
datasets classification problem. In addition to that, the problem of [15] V. García, J. Sánchez, R. Mollineda, An empirical study of the
imbalanced data has been much related with extended belief rule- behavior of classifiers on imbalanced and overlapped data sets, in
based system [52,53] developed to deal with classification tasks. It Iberoamerican Congress on Pattern Recognition, Springer, Val-
will have an invaluable contribution in the field of complex data paraiso, 2007, pp. 397–406.
analysis, which we plan to work on in the future. [16] K. Napierała, J. Stefanowski, S. Wilk, Learning from imbalanced
data in presence of noisy and borderline examples, in Interna-
tional Conference on Rough Sets and Current Trends in Comput-
ing, Springer, Warsaw, 2010, pp. 158–167.
REFERENCES [17] S. Tang, S.P. Chen, The generation mechanism of synthetic minor-
ity class examples, in International Conference on Information
[1] A.S. Hussein, T. Li, N.S. Jaber, W.Y. Chubato, A rough set based Technology and Applications in Biomedicine (ITAB 2008), IEEE,
hybrid approach for classification, in 13th International FLINS Shenzhen, 2008, pp. 444–447.
Conference on Data Science and Knowledge Engineering for [18] G.M. Weiss, Mining with rarity: a unifying framework, ACM
Sensing Decision Support (FLINS 2018), World Scientific, Belfast, SIGKDD Explor. Newsl. 6 (2004), 7–19.
Northern Ireland, 2018, pp. 683–690. [19] T. Jo, N. Japkowicz, Class imbalances versus small disjuncts, ACM
[2] R. Pruengkarn, K.W. Wong, C.C. Fung, Imbalanced data classifi- SIGKDD Explor. Newsl. 6 (2004), 40–49.
cation using complementary fuzzy support vector machine tech- [20] R.C. Prati, G.E. Batista, M.C. Monard, Class imbalances ver-
niques and smote, in IEEE International Conference on Systems, sus class overlapping: an analysis of a learning system behavior,
Man and Cybernetics, Banff, 2017, pp. 978–983. in Mexican International Conference on Artificial Intelligence,
[3] W.Y. Chubato, T. Li, A combined-learning based framework for Springer, Mexico City, 2004, pp. 312–321.
improved software fault prediction, Int. J. Comput. Intell. Syst. 10 [21] R. Barandela, R.M. Valdovinos, J.S. Sánchez, F.J. Ferri, The imbal-
(2017), 647–662. anced training sample problem: Under or over sampling?, in
[4] W.Y. Chubato, T. Li, K. Bashir, A three-stage based ensem- Joint IAPR International Workshops on Statistical Techniques in
ble learning for improved software fault prediction: an empir- Pattern Recognition (SPR) and Structural and Syntactic Pattern
ical comparative study, Int. J. Comput. Intell. Syst. 11 (2018), Recognition (SSPR), Springer, Lisbon, 2004, pp. 806–814.
1229–1247. [22] J. Laurikkala, Improving identification of difficult small classes by
[5] B. Krawczyk, Learning from imbalanced data: open challenges balancing class distribution, in Conference on Artificial Intelli-
and future directions, Prog. Artif. Intell. 5 (2016), 221–232. gence in Medicine in Europe, Springer, Cascais, 2001, pp. 63–66.
[6] R. Blagus, L. Lusa, Smote for high-dimensional class-imbalanced [23] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote:
data, BMC Bioinformat. 14 (2013), 106. synthetic minority over-sampling technique, J. Artif. intell. Res.
[7] S. Suresh, N. Sundararajan, P. Saratchandran, Risk-sensitive loss 16 (2002), 321–357.
functions for sparse multi-category classification problems, Inf. [24] M. Mahdizadeh, M. Eftekhari, Designing fuzzy imbalanced classi-
Sci. 178 (2008), 2621–2638. fier based on the subtractive clustering and genetic programming,
[8] Y.M. Huang, C.M. Hung, H.C. Jiau, Evaluation of neural networks in 2013 13th Iranian Conference on Fuzzy Systems (IFSC), IEEE,
and data mining methods on a credit assessment task for class Qazvin, 2013, pp. 1–6.
imbalance problem, Nonlinear Anal. Real World Appl. 7 (2006), [25] M. Sahare, H. Gupta, A review of multi-class classification for
720–747. imbalanced data, Int. J. Adv. Comput. Res. 2 (2012), 160–164.
[9] M.A. Mazurowski, P.A. Habas, J.M. Zurada, J.Y. Lo, J.A. Baker, [26] P. Jeatrakul, K.W. Wong, Enhancing classification performance
G.D. Tourassi, Training neural network classifiers for medical of multi-class imbalanced data using the oaa-db algorithm, in
decision making: the effects of imbalanced datasets on classifica- The 2012 International Joint Conference on Neural Networks
tion performance, Neural Netw. 21 (2008), 427–436. (IJCNN), IEEE, Brisbane, 2012, pp. 1–8.
[10] K. Bashir, T. Li, W.Y. Chubato, M. Yahaya, T. Ali, A novel pre- [27] N.V. Chawla, Data mining for imbalanced datasets: an overview,
processing approach for imbalanced learning in software defect in: O. Maimon, L. Rokach (Eds.), Data Mining and Knowledge
prediction, in 13th International FLINS Conference on Data Sci- Discovery Handbook, Springer, Boston, 2010, pp. 875–886.
ence and Knowledge Engineering for Sensing Decision Support [28] E. Ramentol, N. Verbiest, R. Bello, Y. Caballero, C. Cornelis,
(FLINS 2018), World Scientific, Belfast, Northern Ireland, 2018, F. Herrera, Smote-frst: a new resampling method using fuzzy
pp. 500–508. rough set theory, in 10th International FLINS conference on
[11] S.J. Yen, Y.S. Lee, Under-sampling approaches for improving Uncertainty Modeling in Knowledge Engineering and Decision
prediction of the minority class in an imbalanced dataset, in: Making, World Scientific, 2012, pp. 800–805.
D.-S. Huang, G. William Irwin (Eds.), Intelligent Control and [29] J.A. Sáez, J. Luengo, J. Stefanowski, F. Herrera, Smote–ipf:
Automation, Springer, Berlin, Heidelberg, 2006, pp. 731–740. addressing the noisy and borderline examples problem in imbal-
[12] D. Dheeru, E. Karra Taniskidou, UCI Machine Learning Reposi- anced classification by a re-sampling method with filtering, Inf.
tory, 2017.
Pdf_Folio:10 Sci. 291 (2015), 184–203.
A.S. Hussein et al. / International Journal of Computational Intelligence Systems, in press 11
[30] D. Guan, W. Yuan, Y.K. Lee, S. Lee, Nearest neighbor editing aided [43] J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García,
by unlabeled data, Inf. Sci. 179 (2009), 2273–2282. L. Sánchez, F. Herrera, Keel data-mining software tool: data set
[31] I. Tomek, Two modifications of cnn, IEEE Trans. Syst. Man repository, integration of algorithms and experimental analysis
Cybern. SMC-6 (1976), 769–772. framework, J. Multiple Valued Logic Soft Comput. 17 (2011),
[32] T.M. Khoshgoftaar, P. Rebours, Improving software quality pre- 255–287.
diction by noise filtering techniques, J. Comput. Sci. Technol. 22 [44] T.M. Khoshgoftaar, C. Seiffert, J. Van Hulse, A. Napolitano,
(2007), 387–396. A. Folleco, Learning with limited minority class data, in Sixth
[33] C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, Safe- International Conference on Machine Learning and Applications
level-smote: safe-level-synthetic minority over-sampling tech- (ICMLA 2007), IEEE, Cincinnati, 2007, pp. 348–353.
nique for handling the class imbalanced problem, in Pacific-Asia [45] A. Fernández, M.J. Del Jesus, F. Herrera, Multi-class imbal-
Conference on Knowledge Discovery and Data Mining, Springer, anced data-sets with linguistic fuzzy rule based classification sys-
Bangkok, 2009, pp. 475–482. tems based on pairwise learning, in International Conference
[34] H. Han, W.Y. Wang, B.H. Mao, Borderline-smote: a new over- on Information Processing and Management of Uncertainty in
sampling method in imbalanced data sets learning, in Inter- Knowledge-Based Systems, Springer, Dortmund, 2010, pp. 89–98.
national Conference on Intelligent Computing, Springer, Hefei, [46] J. Demšar, Statistical comparisons of classifiers over multiple data
2005, pp. 878–887. sets, J. Mach. Learn. Res. 7 (2006), 1–30.
[35] E. Ramentol, Y. Caballero, R. Bello, F. Herrera, Smote-rsb*: [47] P.V. Ngoc, C.V.T. Ngoc, T.V.T. Ngoc, D.N. Duy. A C4. 5 algorithm
a hybrid preprocessing approach based on oversampling and for english emotional classification, Evolving Syst. 10 (2019),
undersampling for high imbalanced data-sets using smote and 425–451.
rough sets theory. Knowl. inf. Syst. 33 (2012), 245–265. [48] W. Liu, S. Chawla, D.A. Cieslak, N.V. Chawla, A robust deci-
[36] S. Barua, M.M. Islam, X. Yao, K. Murase, Mwmote–majority sion tree algorithm for imbalanced data sets, in Proceedings of
weighted minority oversampling technique for imbalanced data the 2010 SIAM International Conference on Data Mining, SIAM,
set learning, IEEE Trans. Knowl. Data Eng. 26 (2014), 405–425. 2010, pp. 766–777.
[37] N. Verbiest, E. Ramentol, C. Cornelis, F. Herrera, Prepro- [49] G.E. Batista, R.C. Prati, M.C. Monard, A study of the behavior
cessing noisy imbalanced datasets using smote enhanced with of several methods for balancing machine learning training data,
fuzzy rough prototype selection, Appl. Soft Comput. 22 (2014), ACM SIGKDD Explor. Newsl. 6 (2004), 20–29.
511–517. [50] H. Zhang, J. Su, Naive bayesian classifiers for ranking, in Euro-
[38] Z. Zheng, Y. Cai, Y. Li, Oversampling method for imbalanced pean Conference on Machine Learning, Springer, Pisa, 2004,
classification, Comput. Informat. 34 (2016), 1017–1037. pp. 501–512.
[39] P. Branco, L. Torgo, R. P. Ribeiro. A survey of predictive mod- [51] S. Barua, M.M. Islam, K. Murase, A novel synthetic minority over-
eling on imbalanced domains, ACM Comput. Surv. (CSUR). 49 sampling technique for imbalanced data set learning, in Interna-
(2016), 31. tional Conference on Neural Information Processing, Springer,
[40] N. Moniz, P. Branco, L. Torgo, Evaluation of ensemble meth- Shanghai, 2011, pp. 735–744.
ods in imbalanced regression tasks, in International Workshop on [52] L.H. Yang, J. Liu, Y.M. Wang, L. Martínez, Extended belief-rule-
Learning with Imbalanced Domains - Theory and Applications, based system with new activation rule determination and weight
2017. calculation for classification problems, Appl. Soft Comput. 72
[41] N. Japkowicz, Assessment metrics for imbalanced learning, in: (2018), 261–272.
H. He, Y. Ma (Eds.), Imbalanced Learning: Foundations, Algo- [53] L.H. Yang, J. Liu, Y.M. Wang, L. Martínez, A micro-extended belief
rithms, and Applications, Wiley-IEEE Press, 2013, pp. 187–206. rule-based system for big data multiclass classification problems,
[42] C.G. Weng, J. Poon, A new evaluation measure for imbalanced IEEE Trans. Syst. Man Cybern. Syst. PP (2018), 1–21.
datasets, in The 7th Australasian Data Mining Conference, Aus-
tralian Computer Society, Inc., Glenelg, 2008, vol. 87, pp. 27–32.
Pdf_Folio:11