0% found this document useful (0 votes)

14 views11 pages

11-A-SMOTE A New Preprocessing Approach For Highly Im

The paper introduces A-SMOTE, an advanced version of the SMOTE technique for preprocessing highly imbalanced datasets, which aims to improve classifier performance by generating synthetic minority class samples closer to the original minority class rather than the majority class. It addresses issues of noise and borderline examples that can degrade model performance, demonstrating its effectiveness through experiments on 44 datasets using various classifiers. The results indicate that A-SMOTE outperforms existing methods, making it a suitable approach for data preprocessing in classification tasks.

Uploaded by

longenie0506

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views11 pages

11-A-SMOTE A New Preprocessing Approach For Highly Im

Uploaded by

longenie0506

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

International Journal of Computational Intelligence Systems

In Press, Uncorrected Proof

DOI: https://doi.org/10.2991/ijcis.d.191114.002; ISSN: 1875-6891; eISSN: 1875-6883
https://www.atlantis-press.com/journals/ijcis/

Special Issue
A-SMOTE: A New Preprocessing Approach for Highly
Imbalanced Datasets by Improving SMOTE

Ahmed Saad Hussein1,2 , Tianrui Li1,* , Wondaferaw Yohannese Chubato1 , Kamal Bashir1
1
School of Information Science and Technology, Southwest Jiaotong University, Chengdu 611756, China
2
University of Information Technology and Communications, Baghdad 00964, Iraq

ARTICLE INFO ABSTRACT

Article History Imbalance learning is a challenging task for most standard machine learning algorithms. The Synthetic Minority Oversampling
Received 10 Jan 2019 Technique (SMOTE) is a well-known preprocessing approach for handling imbalanced datasets, where the minority class is over-
Accepted 09 Nov 2019 sampled by producing synthetic examples in feature vector rather than data space. However, many recent works have shown that
the imbalanced ratio in itself is not a problem and deterioration of the model performance is caused by other reasons linked to
Keywords the minority class sample distribution. The blind oversampling by SMOTE leads to two major problems: noise and borderline
Imbalanced datasets examples. Noisy examples are those from one class located in the safe zone of the other. Borderline examples are those located
SMOTE in the neighborhood of the class boundary. These samples are associated with deteriorating performance of the models devel-
Machine learning oped. Therefore, it is critical to concentrate on the minority class data structure and regulate the positioning of the newly intro-
Oversampling duced minority class samples for better performance of classifiers. Hence, this paper proposes the advanced SMOTE, denoted as
Undersampling A-SMOTE, to adjust the newly introduced minority class examples based on distance to the original minority class samples.
To achieve this objective, we first employ the SMOTE algorithm to introduce new samples to the minority and eliminate those
examples that are closer to the majority than the minority. We apply the proposed method to 44 datasets at various imbalance
ratios. Ten widely used data sampling methods selected from the literature are employed for performance comparison. The C4.5
and Naive Bayes classifiers are utilized for experimental validation. The results confirm the advantage of the proposed method
over the other methods in almost all the datasets and illustrate its suitability for data preprocessing in classification tasks.

© 2019 The Authors. Published by Atlantis Press SARL.

This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

1. INTRODUCTION modules therein is often minimal (i.e., 2%). If a software defect pre-
diction classification model predicts that all the modules are normal
Machine learning (ML) techniques are widely used in different (defect-free), it will have a predictive accuracy of 98%. However, the
applications such as banking, bioinformatics, finance, epidemiol- classifier cannot ascertain the target modules that are defective in
ogy, marketing, medical diagnosis, and meteorological data analy- the dataset. Therefore, if a classifier can efficiently and accurately
sis [1]. In these domains, data is necessary for training the model. predict the minority class samples, it will be useful to help sev-
However, the distribution of classes in most real-world datasets is eral entities make proper decisions and save cost [3,4]. The minor-
imbalanced, and this circumstance poses huge challenge to the stan- ity class examples are usually the object of most interest in many
dard ML algorithms. The challenge of imbalanced datasets in clas- applications and the most difficult to predict in the perspective of
sification problem arises when the number of samples in one class ML classification task [5,6]. There are several applications including
is far outnumber those of the other class. In such circumstance, satellite image classification [7], medical applications [8], risk man-
a classifier usually favors the majority class in terms of prediction agement [9] in which imbalance class distribution is manifested in
and completely ignores the minority class. This challenge is often data. Several works have demonstrated that data mining techniques
experienced in several disciplines when mining data [2]. The con- might not function well when the training data is imbalanced [10].
sequence of this bias is that, most classification models developed Conventional ML classifiers assume balanced class distribution for
fail to correctly predict the minority class sample in out-of-sample the training data and are predisposed to accurately classify the
data. This fact is a huge course of worry for real-world data analy- majority class, whereas the minority samples are often misclassi-
sis. For example, a software development entity would like to build fied [11]. The ML community appears to settle on the proposition
a classifier to predict whether the software program will have defec- that class imbalance in training data is a major problem in induc-
tive modules or not at the end of the development process. In this tive learning. Although noticed several years back, that imbalance
regard, historical dataset is often employed and the number of fault in data may lead to considerable deterioration in standard classi-
fication model performance, some scholars have argued that class
* imbalance in data is not a difficulty itself. In some domains such
Corresponding author. Email: trli@swjtu.edu.cn
Pdf_Folio:1
2 A.S. Hussein et al. / International Journal of Computational Intelligence Systems, in press

as the Sick dataset [12] for instance, it has been observed that the approach, the existing algorithms are modified to recognize sam-
standard ML algorithms have the capability to induce effective clas- ples in the minority class [24]. The drawback of this approach is its
sification models even when trained on extremely imbalanced data. dependence on classifiers and the difficulty in handling it [25]. In
This illustrates that imbalance class distribution is not the only issue the data level approach, datasets are modified by adding instances
associated with the degradation in performance of classifiers. Also, to the minority class or eliminating samples from the majority class.
the classic work of López et al. [13] has demonstrated that the low This technique aims to present balanced datasets [26]. The data
classification performance reflected in some real imbalanced prob- level technique is easier to use as compared to the algorithm level
lems may be associated with the validation scheme applied to eval- approach, because the datasets are mended before they are trained
uate the classifier. A similar view was shared in [14–16] in which by classifiers [27,28]. The main advantage of the data level method is
the authors are of the view that the deterioration in classification that they are more adaptable since their applications do not depen-
performance is usually associated with other reasons connected to dent on the classifier chosen. Besides, we may preprocess all datasets
data distributions. Hence, Tang, and Chen [17] proposed an adjust- and apply them to train different classifiers. Among the data level
ment to the direction of newly created synthetic samples through a techniques is the well-known SMOTE [23]. In the case of oversam-
mechanism and the empirical results demonstrated improved per- pling, SMOTE is applied to introduce synthetic samples along the
formance of the classification model built. In that study, oversam- line segments connecting any or all of the k minority class nearest
pling with synthetic samples was presented to minimize overfitting neighbors, considering each minority class example in the data. The
resulting from random and directed oversampling. The new exam- selection of k nearest neighbors is carried out randomly based on
ples were then added into the original training set following spe- the amount of oversampling needed. SMOTE algorithm has repeat-
cific rules. The aim is to effectively expand the decision zone of the edly reported successes for better sampling distribution. However,
minority class in the feature space, and also augment the number of when applied in its original form may produce suboptimal results
the minority class samples. However, it is predictable that the addi- or it may even be counterproductive in many cases [29]. This is
tion of new samples in this way will inevitably present further noise mainly attributed to the fact that SMOTE presents several setbacks
into the training dataset, since these synthetic samples generated related to insensitive oversampling in which the generation of new
are no more than a mere estimation of the real distribution [18]. minority samples only consider the proximity and size of minority
In addition, rare events and class overlapping that come with class samples. The introduction of new samples to the minority without
imbalance have been identified in [18,19] as potential factors that critical emphasis on the direction and distribution of such examples
can result in performance degradation of the model developed on as a major disadvantage of SMOTE. This setback which can further
imbalanced data. To further find the possible causes of the learn- exacerbate the problems created by noisy includes the introduction
ing challenge in imbalanced domain. Prati et al. [20] advanced a of new minority samples in areas closer to the majority than the
methodical research aiming to interrogate whether class imbalances minority class. This drawback may cause performance deteriora-
is the main source of hindrance to inductive learning or other fac- tion in a given classification task [21]. Therefore, to overcome the
tors are responsible for the deficiencies. The study developed on above setbacks, two different techniques are adopted in the litera-
a series of artificial data sets with the view to entirely control all ture as
the variables required for the analysis. The experimental results,
by applying a discrimination-based inductive framework, demon- • Extensions of SMOTE by combining it with other techniques
strated that the learning challenge is not entirely associated with such as noise filtering. In the standard classification tasks, noise
class imbalance, but is also linked to the level of data overlapping filters are often used in order to detect and remove noisy
among the classes. samples from training datasets and also to clean up and to
create more regular class boundaries [29–31]. Empirical
To this end, we propose a critical modification to Synthetic
studies, such as [29], confirmed the advantage of integrating
Minority Oversampling Technique (SMOTE) for highly imbal-
iterative partitioning filter (IPF) [32] as a post-processing
anced datasets, where the generation of new synthetic samples are
period after applying SMOTE.
directed closer to the minority than the majority. In this way, the
line of distinction between the two classes will be clearly defined • Modifications of SMOTE in which the formation of new
and all samples in data will be located within their class boundaries minority samples realized by SMOTE is focused on specific
to ensure accurate prediction of the classifiers developed. portions of the input space, taking the specific features of the
data into account. The Safe-Levels-SMOTE (SL-SMOTE) [33],
The structure of this paper is organized as follows: Section 2 pro-
the Borderline-SMOTE (B1-SMOTE and B2-SMOTE) [34]
vides an overview of related works. In Section 3, we present the
methods come from this category. These methods try to create
proposed A-SMOTE. In Section 4, we introduce the experimental
minority samples close to regions with a high concentration of
results, discuss the evaluation metric and statistical tests used in
the minority samples or only within the borders of the
this work. Finally, concluding remarks and future work are drawn
minority class.
in Section 5.
Ramentol et al. [35] presented a new hybrid approach for pre-
2. RELATED WORKS processing imbalanced datasets through the creation of new sam-
ples, using SMOTE together with the Rough Set Theory. From the
The problem of learning in imbalance domain has been getting experimental results, they observed excellent average results. Simi-
attention in different research areas [21–23]. The methods pro- larly, Barua et al. [36] presented MWMOTE to address imbalanced
posed for imbalanced learning can be classified broadly under algo- learning problems. The approach first recognizes the hard-to-learn
rithm level approach and data level approach. In the algorithm level informational minority class samples and assigns them weights
Pdf_Folio:2
A.S. Hussein et al. / International Journal of Computational Intelligence Systems, in press 3

based on Euclidean distance from the nearest majority samples and

finally creates the synthetic samples from the weighted informa-
tional minority samples using a clustering approach. The results
showed that their method outperformed other existing methods
regarding several estimation metrics, such as AUC and G-mean.
Verbiest et al. [37] proposed a hybrid approach FRIPS-SMOTE-
FRBPS. First, it cleans data using a fuzzy rough prototype selec-
tion technique (FRIPS), that takes the imbalanced characteristic of
the data. Second, it uses SMOTE to balance the data and cleans
the data again using a fuzzy rough prototype selection technique
(FRBPS) for balanced data. Experiments on synthetic data showed Figure 1 An example of how to generate
that FRIPS-SMOTE-FRBPS outperforms state-of-the-art methods synthetic data in the Synthetic Minority
such as SMOTE and its various modifications. Zheng et al. [38] Oversampling Technique (SMOTE)
proposed a new oversampling approach SNOCC that can compen- algorithm.
sate the defects of SMOTE. In this proposed SNOCC, the authors
increased the number of seed samples to rule out the new sam-
ples from the line segment between two seed samples in SMOTE. selection of a random point along the line segment among two
Several experiments have been conducted and the results show distinctive features. The SMOTE algorithm is described in detail
that new SNOCC performance is higher than SMOTE and CBSO. below:
Among these studies, the effect of noisy and borderline samples on
classification model performance in imbalanced data was empiri- • Find the k-nearest neighbors for each sample.
cally researched in Napierała et al. [16]. Barandela et al. referred to
• Select samples randomly from a k-nearest neighbor.
noisy samples as the samples from one class placed inside the area
of the other class in Barandela et al. [21]. To the best of our knowl- • Find the new samples = original samples + difference * gap (0,1).
edge, not enough research attempt has been made to tackle the noise
• Add new samples to the minority. Finally, a new dataset is
problem and clean borderline examples together on synthetically
over-sampled data using SMOTE. created.

Having this gap in mind, which has not been addressed by many The SMOTE method comes with some weaknesses related to its
studies, we present a new approach that treats the highly imbal- insensitive oversampling where the creation of minority samples
anced dataset following two concepts step by step: Firstly, we cre- fails to account for the distribution of sample from the majority
ate a new synthetic instance using SMOTE algorithm. Secondly, we class. This may lead to the generation of unnecessary minority sam-
eliminate the synthetic samples with higher proximity to the major- ples around the positive examples that can further exacerbate the
ity class than the minority as well as the synthetic instances closer problem produced for borderline and noisy in the learning process.
to the borderline crated by SMOTE. Finally, the data is evidently
devoid noisy and borderline samples. Details of our new approach 3.2. A-SMOTE Algorithm
for improved classification performance are exhibited in the follow-
ing sections. To perform better prediction, most of the classification algorithms
strive to obtain pure samples to learn and make the borderline of
3. SMOTE AND A-SMOTE ALGORITHMS each class as definitive as possible. The synthetic examples that
are far away from the borderline are more easy to classify than the
In this section, we discuss the SMOTE and our A-SMOTE. ones close to the borderline, that pose a huge learning challenge for
majority of the classifiers. On the basis of these facts, we present a
new advanced approach (A-SMOTE) for preprocessing of imbal-
3.1. SMOTE: Synthetic Minority anced training sets, which tries to clearly define the borderline and
Oversampling Technique generate pure synthetic samples from SMOTE generalization. Our
proposed method has two stages and discussed as follows:
SMOTE [23] is an essential approach by oversampling the minority
class to generate balanced datasets. It oversamples the minority class First stage, we first apply SMOTE algorithm to generate the syn-
by practicing each minority class sample and including synthetic thetic instance based on following equation:
examples along the line segments joining any/all of the k minority N = 2 ∗ (r – z) + z (1)
class nearest neighbors. Depending upon the amount of oversam-
pling needed, neighbors from the k-nearest neighbors are randomly where N is the initial synthetic instance number (newly gener-
taken. This method is shown in Figure 1, where Yi is the point under ated), r, is the number of majority class samples, and z, is the
consideration, Yi1 to Yi4 are nearest neighbors and w1 to w4 the number of minority class samples.
synthetic data generated by the randomized interjection.
Second stage, we eliminate the synthetic samples with higher
Synthetic samples generated by taking the difference between the proximity to the majority class than the minority as well as
nearest neighbor and feature vector (sample) under consideration. the synthetic instances closer to the borderline generated by
Multiply this difference by a random number among 1 and 0, and SMOTE. The A-SMOTE procedure step-by-step is outlined as
add it to the feature vector under consideration. This produces the
Pdf_Folio:3 follows:
4 A.S. Hussein et al. / International Journal of Computational Intelligence Systems, in press

( )
• Step 1: The synthetic instances that generated by SMOTE where MinRap. Ŝi , Sm is the samples rapprochement with all
might be accepted or rejected on two conditions and it matches minority, and according to Equation (6), then we get L, which
with the first stage: Suppose that x̂ = {x̂1 , x̂2 , ..., x̂N } is the set of is defined as follows:
(j)
the new synthetic instances, and x̂i is the j-th attribute value of n
( ( ))
x̂i , j ∈ [1, M]. Let Sm = {Sm1 , Sm2 , ..., Smz } and L = ∑ MinRap. Ŝi , Sm (7)
Sa = {Sa1 , Sa2 , ..., Sar } be the sets of the minority samples and i=1
the majority samples, respectively. In order to make the
For example, if the number of original minority samples is 10,
acceptance or rejection decision, we calculate the distance
then we choose 10 elements from Ŝi to delete, which have the
between x̂i and Smk , D Dminority (x̂i , Smk ) and the distance
largest distance between Sm and Ŝi , to obtain high and pure
between x̂i , and the Sal , D Dmajority (x̂i , Sal ), respectively. For i
results.
from 1 to N step by step, we compute these distances as follows:
• Step 3: Similarly, we calculate the distance between Ŝi and each
M ( ) ( )
(j) (j) 2 original majority Sa , MajRap. Ŝi , Sa , described as follows:
D Dminority (x̂i , Smk ) = ∑ √ x̂i – Smk , k ∈ [1, z] (2)
j=1
M ( ) ( ) r M ( )
(j) (j) 2 (j) (j) 2
D Dmajority (x̂i , Sal ) = ∑ √ x̂i – Sal , l ∈ [1, r] (3) MajRap. Ŝi , Sa = ∑ ∑ Ŝ – S
i (8)
j=1 l=1j=1
√ al

According to Equations (2) and (3), then we get two arrays ( )

where MajRap. Ŝi , Sa is the samples rapprochement with all
Aminority and Amajority , which are defined as follows:
majorities, and according to Equation (8), then we get H, which
( ) is defined as follows:
Aminority = D Dminority (x̂i , Sm1 ) , ..., D Dminority (x̂i , Smz ) (4)
( )
Amajority = D Dmajority (x̂i , Sa1 ) , ..., D Dmajority (x̂i , Sar ) (5)
n
( ( ))
H = ∑ MajRap. Ŝi , Sa (9)
i=1
Afterwards, we select the minimum value from Aminority ,
( )
min Aminority , and the minimum value from Amajority , Then, we eliminate the half of the synthetic samples that have
( ) ( ) ( )
min Amajority . If min Aminority are less than min Amajority , the least distance between Sa and Ŝi to obtain high pure data.
the new synthetic samples are accepted; otherwise rejected, By these procedures, the data is supposedly devoid of noisy and
that is borderline samples (see Figure 2).
( ) ( )( )
min Aminority < min Amajority accepted
This approach known as A-SMOTE is adopted to design a robust
( ) ( )( )
min Aminority ≥ min Amajority rejected preprocessing method for imbalance learning. The details are illus-
trated in Algorithm 1.
To avoid delays caused by the generation of synthetic samples
that do not meet the acceptance requirement sequentially and
maintain the performance of the algorithm at high speed, in 4. EXPERIMENTAL DESIGN
the case of synthetic samples that do not meet the acceptance
requirement successively ten times, we accept the synthetic In this part, we present the experimental design and the results
sample of these ten based on its proximity to the minority. This based on the evaluation metrics employed, datasets, different
sample is considered as a new synthetic sample and is treated as imbalanced methods, and statistical tests. The experiments carried
a sample that meets the acceptance requirement. In this way, out using Visual Studio 2015, KEEL tool, MATLAB (2016a), SPSS
we obtain synthetic instances that are near the minority class. statistics 22, and RStudio.

• Step 2: After that, with the accepted synthetic instances the

following is carried out to eliminate the noisy. Suppose that 4.1. Evaluation Metrics in Imbalanced
Ŝ = {Ŝ1 , Ŝ2 , ..., Ŝn } is a new synthetic minority obtained by Domains
Step1. Then, we calculate the distance ̂
( ) between Si with each
original minority Sm , MinRap. Ŝi , Sm described as follows: It is well known that performance evaluation in imbalanced dom-
ains requires the use of individual purpose metrics [39]. In fact,
( ) z M ( ) standard performance assessment metrics are focused on the stan-
(j) (j) 2
MinRap. Ŝi , Sm = ∑ ∑ Ŝi – Smk (6) dard behavior instead of the user’s preferences which frequently
√
k=1j=1 results in misleading conclusions [39,40]. Therefore, when solving

Pdf_Folio:4
Figure 2 Advanced Synthetic Minority Oversampling Technique (A-SMOTE) algorithm process.
A.S. Hussein et al. / International Journal of Computational Intelligence Systems, in press 5

Algorithm 1: A-SMOTE Algorithm

problems with imbalanced domains, it is necessary to deal with the When used to evaluate a learner’s performance on imbalanced
issue of performance evaluation. For classification, this point needs datasets, accuracy is more efficient in the majority class prediction
more attention but several solutions for evaluation in this context than the minority one. We draw this conclusion from its definition
already exist [41]. Therefore, we use confusion matrix to develop (Equation (10)): if the dataset is unusually imbalanced, even though
multiple evaluation matrix for the purpose of performance evalua- the classifier proceeds to a correct majority examples classification
tion among our proposed method and previous methods. We define but misclassifies all the minority examples, the learner still has high
the standard rates of accuracy as follows: accuracy because of the huge amount of the majority examples. In
(TP + TN) this circumstance, accuracy leads up to an unreliable prediction for
Accuracy = (10) the minority class. Thus, in addition to accuracy, more appropriate
(TP + FN + FP + TN)
evaluation metrics must be conducted. The ROC curve [42] is one
of the essential metrics to evaluate learners for imbalanced datasets.
FP It is a two-dimensional y and x axis graph where both TP and FP
FPrate = (11)
(TN + FP) rate are plotted accordingly. The FP rate (Equation (11)) denotes
the percentage of misclassified negative examples, and the TP rate
TP (comparison (12)) is the percentage of correctly classified positive
TPrate = Recall = (12)
Pdf_Folio:5
(TP + FN) cases. Basically, the learners look for the ideal point denoted as
6 A.S. Hussein et al. / International Journal of Computational Intelligence Systems, in press

(0, 1). The ROC curve highlights trade-offs between benefits (TP in the oversampling producing a preferred performance at the clas-
rate) and costs (FP rate). To this end, the area under the ROC (AUC) sification as compared to SMOTE (see Figure 3). The highest AUC
can also be used for the imbalanced datasets evaluation as the fol- value is shown in Table 3. There are two numbers per cell. The ini-
lowing equation shows: tial number denotes the count of times that a given algorithm is
the most preferred over the other algorithms, while the next one
1 + TPrate – FPrate shows the number of times shares equal performance with other
AUC = (13)
2 algorithms.
In this section, our approach obtains the best ranking as shown
4.2. Datasets and Statistical Tests in Table 4. We can see that, the average ranking of the algorithms
demonstrate how good a method over the others. This ranking is
In this research, we illustrate the datasets used for the experimen- accomplished by assigning a position to each algorithm depending
tal study and the statistical tests used alongside the empirical anal- on its performance with each dataset. The algorithm that achieves
ysis. We have used 44 datasets from the KEEL data repository1 [43] the best accuracy on a specific dataset will have the first rank-
with highly imbalanced rates. The summary of the datasets appears ing (value 1); then, the algorithm with the second best accuracy is
in Table 1. For our experiments, we admit the following parameters assigned to rank 2, and so forth. Finally, this task is carried out for
for the A-SMOTE algorithm: K is the number of nearest neighbors all datasets, and then an average ranking is estimated as the mean
and is fixed to be 5 and the class distribution will be rebalanced to value of all rankings.
50_ 50%. These parameter values are recommended in the previ-
ous studies presented in [23,44], and consequently, we have adopted Furthermore, for multiple comparisons, we utilize the Holm post
them as a standard for our experiment. To statistically support the hoc test to determine the algorithms that reject the hypothesis of
analysis of the results, we use statistical test. In this study, non- equality concerning a selected control method (see Table 5). The
parametric tests (freedman test and Holm post hoc test) for hypoth- post hoc system allows the comparison of means to know the accep-
esis testing are used, as was suggested by [45,46] and employed tance at the lowest significance level ∝ = 0.05. However, we calcu-
before [35]. late the p-value associated with each comparison, which describes
the lowest level of importance of a hypothesis that results in a
rejection. It is shown that most algorithms reject the hypothesis of
4.3. Comparative Analysis and Results equality.

The experimental findings in this paper are presented using dif- 4.3.2. Case 2: Using F-measure performance metric
ferent performance metrics to allow a fair comparison with other
methods from the literature among others. In this part of study, we do the threefold process to measure the
performance of the classifier learned from the training dataset gen-
erated through different oversampling methods. We randomly par-
4.3.1. Case 1: Using AUC performance metric tition the dataset into threefolds, and each fold holds almost the
same proportion of classes as the original datasets. Of the three-
To make a fair comparison, the sets were divided in order to per-
folds, only onefold is retained as the validation data for the test-
form a fivefolds cross-validation, 80% for training and 20% for
ing, and the remaining twofolds are employed as training data. The
testing, where the 5 test data-sets form the whole set. For each
process is then replicated three times, with each of the threefolds
data-set, we consider the average results of the five partitions. The
applied precisely once as the validation data. The three results from
learning algorithm employed for the experiments is C4.5, which
the folds then are averaged to provide the estimation of one test. We
has been identified as one of the top algorithms in data min-
employ Naive-Bayes classifier to evaluate the efficiency of SMOTE,
ing [47] and has been extensively applied in imbalanced prob-
SNOCC, CBSO, and A-SMOTE. This is done for a fair comparison.
lems [48]. In this part, we compare our approach (A-SMOTE)
with seven oversampling and undersampling preprocessing tech- We use a Laplace estimator to calculate the prior probability. The
niques based on SMOTE, that is, the SMOTE algorithm and the Laplace estimator shows excellent performance in Naive-Bayes clas-
preprocessing approaches: S-ENN, S-TomekLinks, Borderline-1, sification algorithm [38,50]. One extra benefit of using Laplace esti-
Borderline-2, safelevel, SMOTE-RSB (they are analyzed in [35,49]) mator is that zero probability can be avoided. F-measure for the
and MWMOTE. Table 2 shows the results of the experimental eval- minority (positive) class is used as the evaluation standard. In this
uation for the implementation test, wherein the first column we part, we compare our approach A-SMOTE with CBSO, and SNOCC
have involved the effect on the datasets, and the best approach (they are analyzed in [38,51]). The F-measure value of classification
is emphasized in bold for each dataset. The performance of the for the different oversampling methods is shown in Figure 4. The
algorithms is ranked on each dataset selected for this study. Thus, oversampling technique is given in the column title. The second
our proposed algorithm appears in first place 35 times and five column titled Normal is the F-value without oversampling. In each
times in the second position. We can recognize that our method row, the most significant F-value is made bold. From the results of
obtains the highest performance value of all the methodologies that the experiments, a comparison is done to find the best preprocess-
are being compared. SMOTE-RSB and Borderline-2 achieve good ing algorithm (see Table 6). With AUC and F-Measure results, and
results. Additionally, the unlimited results for SMOTE-ENN and statistical tests, we observe that our approach (A-SMOTE) is statis-
SMOTE-TomekLinks highlight the significance of the cleaning step tically preferred of all compared techniques.

1
Pdf_Folio:6
http://www.keel.es/datasets.php
A.S. Hussein et al. / International Journal of Computational Intelligence Systems, in press 7

Table 1 Summary of the datasets.

Dataset #Instances #Attributes %Class (Minority, Majority) #IR

ecoli0137vs26 281 7 (2.49, 97.51) 39.15
shuttle0vs4 1829 9 (6.72, 93.28) 13.87
yeast1vs7 459 7 (6.53, 93.47) 14.3
shuttle2vs4 129 9 (4.65, 95.35) 20.5
glass016vs2 192 9 (8.85, 91.15) 10.29
glass016vs5 184 9 (4.89, 95.11) 19.44
pageblocks13vs4 472 10 (5.93, 94.07) 15.85
yeast05679vs4 528 8 (9.66, 90.34) 9.35
yeast1289vs7 947 8 (3.16, 96.84) 30.5
yeast1458vs7 693 8 (4.33, 95.67) 22.10
yeast2vs4 514 8 (9.92, 90.08) 9.08
Ecoli4 336 7 (6.74, 93.26) 13.84
Yeast4 1484 8 (3.43, 96.57) 28.41
Vowel0 988 13 (9.01, 90.99) 10.10
Yeast2vs8 482 8 (4.15, 95.85) 23.10
Glass4 214 9 (6.07, 93.93) 15.47
Glass5 214 9 (4.20, 95.80) 22.81
Glass2 214 9 (7.94, 92.06) 11.59
Yeast5 1484 8 (2.96, 97.04) 32.78
Yeast6 1484 8 (2.49, 97.51) 39.16
abalone19 4174 8 (0.77, 99.23) 128.87
abalone918 731 8 (5.65, 94.25) 16.68
cleveland0vs4 177 13 (7.34, 92.66) 12.61
ecoli01vs235 244 7 (2.86, 97.14) 9.16
ecoli01vs5 240 7 (2.91, 97.09) 11
ecoli0146vs5 280 7 (2.5, 97.5) 13
ecoli0147vs2356 336 7 (2.08, 97.92) 10.58
ecoli0147vs56 332 7 (2.1, 97.9) 12.28
ecoli0234vs5 202 7 (3.46, 96.54) 9.1
ecoli0267vs35 224 7 (3.12, 96.88) 9.18
ecoli034vs5 300 7 (2.33, 97.67) 9
ecoli0346vs5 205 7 (3.41, 96.59) 9.25
ecoli0347vs56 257 7 (2.72, 97.28) 9.28
ecoli046vs5 203 7 (3.44, 96.56) 9.15
ecoli067vs35 222 7 (3.15, 96.85) 9.09
ecoli067vs5 220 7 (3.18, 96.82) 10
glass0146vs2 205 9 (4.39, 95.61) 11.05
glass015vs2 172 9 (5.23, 94.77) 9.11
glass04vs5 92 9 (9.78, 90.22) 9.22
glass06vs5 108 9 (8.33, 91.67) 11
led7digit02456789vs1 443 7 (1.58, 98.42) 10.97
yeast0359vs78 506 8 (9.8, 90.2) 9.12
yeast0256vs3789 1004 8 (9.86, 90.14) 9.14
yeast02579vs368 1004 8 (9.86, 90.13) 9.14
Note: The dataset highlighted in bold also used in Case 2.

Table 2 Illustration of the AUC results for nine preprocessing algorithms using C4.5 classifier (Case 1).

Dataset Original SMOTE SMOTE-TL S-ENN Border-1 Border-2 Safe-level SMOTE- MWMOTE A-SMOTE
RSB*
ecoli0137vs26 0.7481 0.8136 0.8136 0.8209 0.8445 0.8445 0.8118 0.8445 0.7795 0.9648
shuttle0vs4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9988 1 1 1
yeast1vs7 0.6275 0.7003 0.7371 0.7277 0.6422 0.6407 0.6621 0.8617 0.5669 0.8929
shuttle2vs4 1 0.9917 1 1 1 1 1 1 0.9960 1
glass016vs2 0.5938 0.6062 0.6388 0.6390 0.5738 0.5212 0.6338 0.6376 0.6905 0.8005
glass016vs5 0.8943 0.8129 0.8629 0.8743 0.8386 0.8300 0.8429 0.8800 0.9262 0.9580
pageblocks13vs4 0.9978 0.9955 0.9910 0.9888 0.9978 0.9944 0.9831 0.9978 0.9978 0.9934
yeast05679vs4 0.6802 0.7602 0.7802 0.7569 0.7473 0.7331 0.7825 0.7719 0.6312 0.8610
yeast1289vs7 0.6156 0.6832 0.6332 0.7037 0.6058 0.5473 0.5603 0.7487 0.5271 0.8032
yeast1458vs7 0.5000 0.5367 0.5563 0.5201 0.4955 0.4910 0.5891 0.6183 0.5282 0.6452
yeast2vs4 0.8307 0.8588 0.9042 0.9153 0.8635 0.8576 0.8647 0.9681 0.8539 0.9753
(continued)

Pdf_Folio:7
8 A.S. Hussein et al. / International Journal of Computational Intelligence Systems, in press

Table 2 Illustration of the AUC results for nine preprocessing algorithms using C4.5 classifier (Case 1). (Continued)

Dataset Original SMOTE SMOTE-TL S-ENN Border-1 Border-2 Safe-level SMOTE- MWMOTE A-SMOTE
RSB*
Ecoli4 0.8437 0.8310 0.8544 0.9044 0.8358 0.8155 0.8386 0.8544 0.8710 0.9802
Yeast4 0.6135 0.7004 0.7307 0.7257 0.7124 0.6882 0.7945 0.7609 0.5647 0.7538
Vowel0 0.9706 0.9494 0.9444 0.9455 0.9278 0.9766 0.9566 0.9678 0.9351 0.9796
Yeast2vs8 0.5250 0.8066 0.8045 0.8197 0.6827 0.6968 0.8112 0.7370 0.5545 0.9561
Glass4 0.7542 0.8508 0.9150 0.8650 0.7900 0.8325 0.9020 0.8768 0.9077 0.9727
Glass5 0.8976 0.8829 0.8805 0.7756 0.8854 0.8402 0.8939 0.9232 0.9618 0.9975
Glass2 0.7194 0.5424 0.6269 0.7457 0.7092 0.5701 0.6979 0.7912 0.6661 0.8589
Yeast5 0.8833 0.9233 0.9427 0.9406 0.9118 0.9219 0.9542 0.9622 0.8043 0.9450
Yeast6 0.7115 0.8280 0.8287 0.8270 0.7928 0.7485 0.8163 0.8208 0.6589 0.8745
abalone19 0.5000 0.5202 0.5162 0.5166 0.5202 0.5202 0.5363 0.5244 0.5137 0.5630
abalone918 0.5983 0.6215 0.6675 0.7193 0.7216 0.6819 0.8112 0.6791 0.5949 0.7672
cleveland0vs4 0.6878 0.7908 0.8376 0.7605 0.7194 0.7255 0.8511 0.7620 0.7158 0.9028
ecoli01vs235 0.7136 0.8377 0.8495 0.8332 0.7377 0.7514 0.7550 0.7777 0.7806 0.9257
ecoli01vs5 0.8159 0.7977 0.8432 0.8250 0.8318 0.8295 0.8568 0.7818 0.9567 0.9779
ecoli0146vs5 0.7885 0.8981 0.8981 0.8981 0.7558 0.8058 0.8519 0.8231 0.8980 0.9783
ecoli0147vs2356 0.8051 0.8277 0.8195 0.8228 0.7465 0.8320 0.8149 0.8154 0.8083 0.9180
ecoli0147vs56 0.8318 0.8592 0.8424 0.8424 0.8420 0.8453 0.8197 0.8670 0.8173 0.9471
ecoli0234vs5 0.8307 0.8974 0.8920 0.8947 0.8613 0.8586 0.8700 0.9058 0.9490 0.9691
ecoli0267vs35 0.7752 0.8155 0.8604 0.8179 0.8352 0.8102 0.8380 0.8227 0.7941 0.9510
ecoli034vs5 0.8389 0.9000 0.9361 0.8806 0.8806 0.9028 0.8306 0.9417 0.9085 0.9765
ecoli0346vs5 0.8615 0.8980 0.8703 0.8980 0.8534 0.8838 0.8520 0.8649 0.8255 0.9591
ecoli0347vs56 0.7757 0.8568 0.8482 0.8546 0.8427 0.8449 0.7995 0.8984 0.8445 0.9539
ecoli046vs5 0.8168 0.8701 0.8674 0.8869 0.8615 0.8892 0.8923 0.9476 0.8113 0.9564
ecoli067vs35 0.8250 0.8500 0.8125 0.8125 0.8550 0.8750 0.7950 0.8525 0.8253 0.9302
ecoli067vs5 0.7675 0.8475 0.8425 0.8450 0.8875 0.8900 0.7975 0.8800 0.9528 0.9780
glass0146vs2 0.6616 0.7842 0.7454 0.7095 0.6565 0.6958 0.7465 0.7978 0.6402 0.8287
glass015vs2 0.5011 0.6772 0.7040 0.7957 0.5196 0.5817 0.7215 0.7065 0.6577 0.7723
glass04vs5 0.9941 0.9816 0.9754 0.9754 0.9941 1 0.9261 0.9941 0.9741 0.9941
glass06vs5 0.9950 0.9147 0.9597 0.9647 0.9950 0.9000 0.9137 0.9650 0.9258 0.9904
led7digit02456789vs1 0.8788 0.8908 0.8822 0.8379 0.8908 0.8908 0.9023 0.9019 0.9212 0.9413
yeast0359vs78 0.5868 0.7047 0.7214 0.7024 0.6228 0.6438 0.7296 0.7400 0.5613 0.7426
yeast0256vs3789 0.6606 0.7951 0.7499 0.7817 0.7528 0.7644 0.7551 0.7857 0.7137 0.8664
yeast02579vs368 0.8432 0.9143 0.9007 0.9138 0.8810 0.8901 0.9003 0.9105 0.8361 0.9519

AUC, area under the ROC; SMOTE, Synthetic Minority Oversampling Technique.

Figure 3 Average area under the ROC (AUC) for 44 datasets and 9 preprocessing technique using C4.5 classifier (Case 1).

Table 3 Illustration of the best algorithm (Case 1).

Original SMOTE SMOTE-TL S-ENN Border-1 Border-2 Safe-level SMOTE-RSB∗ MWMOTE A-SMOTE
TEST 0/3 0/0 0/1 1/1 0/3 1/1 2/1 1/3 0/2 35/2
SMOTE, Synthetic Minority Oversampling Technique.

Pdf_Folio:8
A.S. Hussein et al. / International Journal of Computational Intelligence Systems, in press 9

Table 4 Average ranks obtained by each method in the Friedman test for case 1.

Algorithm Ranking
A-SMOTE 1.4773
S-RSB ∗ 3.7727
S-ENN 5.1932
SMOTE-TL 5.2386
SMOTE 5.5795
Safelevel 5.6705
Border-2 6.5227
Border-1 6.7273
MWMOTE 6.9659
SMOTE, Synthetic Minority Oversampling Technique; A-SMOTE, Advanced SMOTE.

Table 5 Holms table for 𝛼 = 0.05, A-SMOTE is the control method for case 1.

i Algorithm z = (R0 – Ri )/SE p Holm Hypothesis

7 MWMOTE 8.502959 0 0.00625 Reject
7 Border-1 8.133265 0 0.007143 Reject
6 Border-2 7.816385 0 0.008333 Reject
5 Safelevel 6.496049 0 0.01 Reject
4 SMOTE 6.355214 0 0.0125 Reject
3 SMOTE-TL 5.827079 0 0.016667 Reject
2 S-ENN 5.756662 0 0.025 Reject
1 S-RSB∗ 3.556103 0.000376 0.05 Reject
SMOTE, Synthetic Minority Oversampling Technique; A-SMOTE, Advanced SMOTE.

Figure 4 Average F-value for 12 datasets and 4 preprocessing technique using NB classifier(Case 2).

Table 6 The F-measure results of different oversampling methods using NB classifier (Case 2).

Data-set Normal SMOTE CBSO SNOCC A-SMOTE

ecoli-0-1-3-7 vs 2-6 0.2992 0.51 0.5028 0.5749 0.7555
ecoli4 0.3626 0.6723 0.685 0.6552 0.6336
glass-0-1-6 vs 5 0.5597 0.562 0.6186 0.7219 0.7903
glass5 0.5247 0.5407 0.5801 0.6651 0.7115
yeast-0-5-6-7-9 vs 4 0.0356 0.3272 0.3459 0.3363 0.3621
yeast-1-2-8-9 vs 7 0.0265 0.083 0.0826 0.0947 0.0990
yeast-1-4-5-8 vs 7 0.0091 0.1100 0.1079 0.1286 0.1288
yeast-1 vs 7 0.1311 0.2463 0.2365 0.2613 0.2289
yeast-2 vs 4 0.6057 0.6739 0.6748 0.663 0.6822
yeast5 0.4758 0.5279 0.5368 0.5747 0.5404
yeast6 0.1882 0.2041 0.2288 0.3903 0.3533
yeast4 0.0457 0.162 0.1813 0.1818 0.2010
SMOTE, Synthetic Minority Oversampling Technique; A-SMOTE, Advanced SMOTE.

5. CONCLUSION AND FUTURE WORK The performance of A-SMOTE was evaluated on 44 datasets with
high ratios of imbalanced classification. The proposed method was
In this study, we have proposed a novel approach for highly imbal-
Pdf_Folio:9

compared with multiple hybrid oversampling and undersampling

anced datasets, A-SMOTE, which is an improvement on SMOTE.
10 A.S. Hussein et al. / International Journal of Computational Intelligence Systems, in press

techniques, using a ML algorithm (e.g., C4.5, Naive-Bayes). In our [13] V. López, A. Fernández, F. Herrera, On the importance of the
experimental result, the A-SMOTE technique for preprocessing of validation technique for classification with imbalanced datasets:
imbalanced datasets obtained a higher accuracy and F-measure addressing covariate shift when data is skewed, Inf. Sci. 257
(F-value). We believe that the proposed A-SMOTE can be a useful (2014), 1–13.
tool for researchers and practitioners since it results in the genera- [14] N. Japkowicz, Class imbalances: are we focusing on the right issue,
tion of high-quality data. For future work, we will focus on how to in Workshop on Learning from Imbalanced Data Sets II, 2003, vol.
combine A-SMOTE with the rough set theory to solve imbalanced 1723, p. 6.
datasets classification problem. In addition to that, the problem of [15] V. García, J. Sánchez, R. Mollineda, An empirical study of the
imbalanced data has been much related with extended belief rule- behavior of classifiers on imbalanced and overlapped data sets, in
based system [52,53] developed to deal with classification tasks. It Iberoamerican Congress on Pattern Recognition, Springer, Val-
will have an invaluable contribution in the field of complex data paraiso, 2007, pp. 397–406.
analysis, which we plan to work on in the future. [16] K. Napierała, J. Stefanowski, S. Wilk, Learning from imbalanced
data in presence of noisy and borderline examples, in Interna-
tional Conference on Rough Sets and Current Trends in Comput-
ing, Springer, Warsaw, 2010, pp. 158–167.
REFERENCES [17] S. Tang, S.P. Chen, The generation mechanism of synthetic minor-
ity class examples, in International Conference on Information
[1] A.S. Hussein, T. Li, N.S. Jaber, W.Y. Chubato, A rough set based Technology and Applications in Biomedicine (ITAB 2008), IEEE,
hybrid approach for classification, in 13th International FLINS Shenzhen, 2008, pp. 444–447.
Conference on Data Science and Knowledge Engineering for [18] G.M. Weiss, Mining with rarity: a unifying framework, ACM
Sensing Decision Support (FLINS 2018), World Scientific, Belfast, SIGKDD Explor. Newsl. 6 (2004), 7–19.
Northern Ireland, 2018, pp. 683–690. [19] T. Jo, N. Japkowicz, Class imbalances versus small disjuncts, ACM
[2] R. Pruengkarn, K.W. Wong, C.C. Fung, Imbalanced data classifi- SIGKDD Explor. Newsl. 6 (2004), 40–49.
cation using complementary fuzzy support vector machine tech- [20] R.C. Prati, G.E. Batista, M.C. Monard, Class imbalances ver-
niques and smote, in IEEE International Conference on Systems, sus class overlapping: an analysis of a learning system behavior,
Man and Cybernetics, Banff, 2017, pp. 978–983. in Mexican International Conference on Artificial Intelligence,
[3] W.Y. Chubato, T. Li, A combined-learning based framework for Springer, Mexico City, 2004, pp. 312–321.
improved software fault prediction, Int. J. Comput. Intell. Syst. 10 [21] R. Barandela, R.M. Valdovinos, J.S. Sánchez, F.J. Ferri, The imbal-
(2017), 647–662. anced training sample problem: Under or over sampling?, in
[4] W.Y. Chubato, T. Li, K. Bashir, A three-stage based ensem- Joint IAPR International Workshops on Statistical Techniques in
ble learning for improved software fault prediction: an empir- Pattern Recognition (SPR) and Structural and Syntactic Pattern
ical comparative study, Int. J. Comput. Intell. Syst. 11 (2018), Recognition (SSPR), Springer, Lisbon, 2004, pp. 806–814.
1229–1247. [22] J. Laurikkala, Improving identification of difficult small classes by
[5] B. Krawczyk, Learning from imbalanced data: open challenges balancing class distribution, in Conference on Artificial Intelli-
and future directions, Prog. Artif. Intell. 5 (2016), 221–232. gence in Medicine in Europe, Springer, Cascais, 2001, pp. 63–66.
[6] R. Blagus, L. Lusa, Smote for high-dimensional class-imbalanced [23] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote:
data, BMC Bioinformat. 14 (2013), 106. synthetic minority over-sampling technique, J. Artif. intell. Res.
[7] S. Suresh, N. Sundararajan, P. Saratchandran, Risk-sensitive loss 16 (2002), 321–357.
functions for sparse multi-category classification problems, Inf. [24] M. Mahdizadeh, M. Eftekhari, Designing fuzzy imbalanced classi-
Sci. 178 (2008), 2621–2638. fier based on the subtractive clustering and genetic programming,
[8] Y.M. Huang, C.M. Hung, H.C. Jiau, Evaluation of neural networks in 2013 13th Iranian Conference on Fuzzy Systems (IFSC), IEEE,
and data mining methods on a credit assessment task for class Qazvin, 2013, pp. 1–6.
imbalance problem, Nonlinear Anal. Real World Appl. 7 (2006), [25] M. Sahare, H. Gupta, A review of multi-class classification for
720–747. imbalanced data, Int. J. Adv. Comput. Res. 2 (2012), 160–164.
[9] M.A. Mazurowski, P.A. Habas, J.M. Zurada, J.Y. Lo, J.A. Baker, [26] P. Jeatrakul, K.W. Wong, Enhancing classification performance
G.D. Tourassi, Training neural network classifiers for medical of multi-class imbalanced data using the oaa-db algorithm, in
decision making: the effects of imbalanced datasets on classifica- The 2012 International Joint Conference on Neural Networks
tion performance, Neural Netw. 21 (2008), 427–436. (IJCNN), IEEE, Brisbane, 2012, pp. 1–8.
[10] K. Bashir, T. Li, W.Y. Chubato, M. Yahaya, T. Ali, A novel pre- [27] N.V. Chawla, Data mining for imbalanced datasets: an overview,
processing approach for imbalanced learning in software defect in: O. Maimon, L. Rokach (Eds.), Data Mining and Knowledge
prediction, in 13th International FLINS Conference on Data Sci- Discovery Handbook, Springer, Boston, 2010, pp. 875–886.
ence and Knowledge Engineering for Sensing Decision Support [28] E. Ramentol, N. Verbiest, R. Bello, Y. Caballero, C. Cornelis,
(FLINS 2018), World Scientific, Belfast, Northern Ireland, 2018, F. Herrera, Smote-frst: a new resampling method using fuzzy
pp. 500–508. rough set theory, in 10th International FLINS conference on
[11] S.J. Yen, Y.S. Lee, Under-sampling approaches for improving Uncertainty Modeling in Knowledge Engineering and Decision
prediction of the minority class in an imbalanced dataset, in: Making, World Scientific, 2012, pp. 800–805.
D.-S. Huang, G. William Irwin (Eds.), Intelligent Control and [29] J.A. Sáez, J. Luengo, J. Stefanowski, F. Herrera, Smote–ipf:
Automation, Springer, Berlin, Heidelberg, 2006, pp. 731–740. addressing the noisy and borderline examples problem in imbal-
[12] D. Dheeru, E. Karra Taniskidou, UCI Machine Learning Reposi- anced classification by a re-sampling method with filtering, Inf.
tory, 2017.
Pdf_Folio:10 Sci. 291 (2015), 184–203.
A.S. Hussein et al. / International Journal of Computational Intelligence Systems, in press 11

[30] D. Guan, W. Yuan, Y.K. Lee, S. Lee, Nearest neighbor editing aided [43] J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García,
by unlabeled data, Inf. Sci. 179 (2009), 2273–2282. L. Sánchez, F. Herrera, Keel data-mining software tool: data set
[31] I. Tomek, Two modifications of cnn, IEEE Trans. Syst. Man repository, integration of algorithms and experimental analysis
Cybern. SMC-6 (1976), 769–772. framework, J. Multiple Valued Logic Soft Comput. 17 (2011),
[32] T.M. Khoshgoftaar, P. Rebours, Improving software quality pre- 255–287.
diction by noise filtering techniques, J. Comput. Sci. Technol. 22 [44] T.M. Khoshgoftaar, C. Seiffert, J. Van Hulse, A. Napolitano,
(2007), 387–396. A. Folleco, Learning with limited minority class data, in Sixth
[33] C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, Safe- International Conference on Machine Learning and Applications
level-smote: safe-level-synthetic minority over-sampling tech- (ICMLA 2007), IEEE, Cincinnati, 2007, pp. 348–353.
nique for handling the class imbalanced problem, in Pacific-Asia [45] A. Fernández, M.J. Del Jesus, F. Herrera, Multi-class imbal-
Conference on Knowledge Discovery and Data Mining, Springer, anced data-sets with linguistic fuzzy rule based classification sys-
Bangkok, 2009, pp. 475–482. tems based on pairwise learning, in International Conference
[34] H. Han, W.Y. Wang, B.H. Mao, Borderline-smote: a new over- on Information Processing and Management of Uncertainty in
sampling method in imbalanced data sets learning, in Inter- Knowledge-Based Systems, Springer, Dortmund, 2010, pp. 89–98.
national Conference on Intelligent Computing, Springer, Hefei, [46] J. Demšar, Statistical comparisons of classifiers over multiple data
2005, pp. 878–887. sets, J. Mach. Learn. Res. 7 (2006), 1–30.
[35] E. Ramentol, Y. Caballero, R. Bello, F. Herrera, Smote-rsb*: [47] P.V. Ngoc, C.V.T. Ngoc, T.V.T. Ngoc, D.N. Duy. A C4. 5 algorithm
a hybrid preprocessing approach based on oversampling and for english emotional classification, Evolving Syst. 10 (2019),
undersampling for high imbalanced data-sets using smote and 425–451.
rough sets theory. Knowl. inf. Syst. 33 (2012), 245–265. [48] W. Liu, S. Chawla, D.A. Cieslak, N.V. Chawla, A robust deci-
[36] S. Barua, M.M. Islam, X. Yao, K. Murase, Mwmote–majority sion tree algorithm for imbalanced data sets, in Proceedings of
weighted minority oversampling technique for imbalanced data the 2010 SIAM International Conference on Data Mining, SIAM,
set learning, IEEE Trans. Knowl. Data Eng. 26 (2014), 405–425. 2010, pp. 766–777.
[37] N. Verbiest, E. Ramentol, C. Cornelis, F. Herrera, Prepro- [49] G.E. Batista, R.C. Prati, M.C. Monard, A study of the behavior
cessing noisy imbalanced datasets using smote enhanced with of several methods for balancing machine learning training data,
fuzzy rough prototype selection, Appl. Soft Comput. 22 (2014), ACM SIGKDD Explor. Newsl. 6 (2004), 20–29.
511–517. [50] H. Zhang, J. Su, Naive bayesian classifiers for ranking, in Euro-
[38] Z. Zheng, Y. Cai, Y. Li, Oversampling method for imbalanced pean Conference on Machine Learning, Springer, Pisa, 2004,
classification, Comput. Informat. 34 (2016), 1017–1037. pp. 501–512.
[39] P. Branco, L. Torgo, R. P. Ribeiro. A survey of predictive mod- [51] S. Barua, M.M. Islam, K. Murase, A novel synthetic minority over-
eling on imbalanced domains, ACM Comput. Surv. (CSUR). 49 sampling technique for imbalanced data set learning, in Interna-
(2016), 31. tional Conference on Neural Information Processing, Springer,
[40] N. Moniz, P. Branco, L. Torgo, Evaluation of ensemble meth- Shanghai, 2011, pp. 735–744.
ods in imbalanced regression tasks, in International Workshop on [52] L.H. Yang, J. Liu, Y.M. Wang, L. Martínez, Extended belief-rule-
Learning with Imbalanced Domains - Theory and Applications, based system with new activation rule determination and weight
2017. calculation for classification problems, Appl. Soft Comput. 72
[41] N. Japkowicz, Assessment metrics for imbalanced learning, in: (2018), 261–272.
H. He, Y. Ma (Eds.), Imbalanced Learning: Foundations, Algo- [53] L.H. Yang, J. Liu, Y.M. Wang, L. Martínez, A micro-extended belief
rithms, and Applications, Wiley-IEEE Press, 2013, pp. 187–206. rule-based system for big data multiclass classification problems,
[42] C.G. Weng, J. Poon, A new evaluation measure for imbalanced IEEE Trans. Syst. Man Cybern. Syst. PP (2018), 1–21.
datasets, in The 7th Australasian Data Mining Conference, Aus-
tralian Computer Society, Inc., Glenelg, 2008, vol. 87, pp. 27–32.

Pdf_Folio:11

Iso 27001
100% (8)
Iso 27001
183 pages
Imbalanced Data Problem in Machine Learning A Review
No ratings yet
Imbalanced Data Problem in Machine Learning A Review
14 pages
SMOTE For Imbalanced Classification With Python - GeeksforGeeks
No ratings yet
SMOTE For Imbalanced Classification With Python - GeeksforGeeks
18 pages
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
No ratings yet
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
20 pages
A Comparative Study of SMOTE Borderline-SMOTE and ADASYN Oversampling Techniques Using Different Classifiers
No ratings yet
A Comparative Study of SMOTE Borderline-SMOTE and ADASYN Oversampling Techniques Using Different Classifiers
9 pages
Navo Minority Over-Sampling Technique (Nmote) : A Consistent Performance Booster On Imbalanced Datasets
No ratings yet
Navo Minority Over-Sampling Technique (Nmote) : A Consistent Performance Booster On Imbalanced Datasets
42 pages
Survey On Deep Learning With Class Imbalance: Open Access Survey Paper
No ratings yet
Survey On Deep Learning With Class Imbalance: Open Access Survey Paper
54 pages
FULLTEXT01
No ratings yet
FULLTEXT01
42 pages
A Three-Step Combination Strategy For Addressing Outliers and Class Imbalance in Software Defect Prediction
No ratings yet
A Three-Step Combination Strategy For Addressing Outliers and Class Imbalance in Software Defect Prediction
12 pages
Classification of Imbalanced Data A Review
No ratings yet
Classification of Imbalanced Data A Review
34 pages
Evaluation and Enhancement of Standard Classifier
No ratings yet
Evaluation and Enhancement of Standard Classifier
31 pages
DeepSMOTE Fusing Deep Learning and SMOTE For Imbalanced Data
No ratings yet
DeepSMOTE Fusing Deep Learning and SMOTE For Imbalanced Data
15 pages
Data Oversampling and Imbalanced Datasets: An Investigation of Performance For Machine Learning and Feature Engineering
No ratings yet
Data Oversampling and Imbalanced Datasets: An Investigation of Performance For Machine Learning and Feature Engineering
32 pages
MK-SMOTE and M-SMOTE: Enhanced Techniques For Handling Class Imbalance Problem
No ratings yet
MK-SMOTE and M-SMOTE: Enhanced Techniques For Handling Class Imbalance Problem
19 pages
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
No ratings yet
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
33 pages
Admin, 1277
No ratings yet
Admin, 1277
21 pages
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
No ratings yet
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
15 pages
11192-Article (PDF) - 20731-1-10-20180420
No ratings yet
11192-Article (PDF) - 20731-1-10-20180420
43 pages
Stop Oversampling For Class Imbalance Learning - A Review (OJO) - AHMAD S. TARAWNEH, AHMAD B. HASSANAT, GHADA AWAD ALTARAWNEH, ABDULLAH ALMUHAIMEED
No ratings yet
Stop Oversampling For Class Imbalance Learning - A Review (OJO) - AHMAD S. TARAWNEH, AHMAD B. HASSANAT, GHADA AWAD ALTARAWNEH, ABDULLAH ALMUHAIMEED
18 pages
Modeling Imbalance Class
No ratings yet
Modeling Imbalance Class
24 pages
1 s2.0 S0957417423032803 Main
No ratings yet
1 s2.0 S0957417423032803 Main
29 pages
DeepSMOTE Fusing Deep Learning and SMOTE For Imbalanced Data
No ratings yet
DeepSMOTE Fusing Deep Learning and SMOTE For Imbalanced Data
15 pages
Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE
No ratings yet
Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE
12 pages
Imbalanced Data Classification Method Based On LSSASMOTE
No ratings yet
Imbalanced Data Classification Method Based On LSSASMOTE
9 pages
A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering
No ratings yet
A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering
10 pages
Author Final Version
No ratings yet
Author Final Version
11 pages
Gaussian-Based SMOTE Algorithm For Solving Skewed Class Distributions
No ratings yet
Gaussian-Based SMOTE Algorithm For Solving Skewed Class Distributions
6 pages
JPSP - 2022 - 383
No ratings yet
JPSP - 2022 - 383
12 pages
To SMOTE, or Not To SMOTE?
No ratings yet
To SMOTE, or Not To SMOTE?
10 pages
2515-Article Text-14337-4-10-20230331
No ratings yet
2515-Article Text-14337-4-10-20230331
12 pages
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
No ratings yet
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
11 pages
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
100% (1)
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
36 pages
1 s2.0 S0950705119302898 Main
No ratings yet
1 s2.0 S0950705119302898 Main
17 pages
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
No ratings yet
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
9 pages
An Empirical Comparison and Evaluation of Minority Oversampling
No ratings yet
An Empirical Comparison and Evaluation of Minority Oversampling
13 pages
Batista 2004
No ratings yet
Batista 2004
10 pages
Model Optimisation of Class Imbalanced Learning Using Ensemble Classifier On Over-Sampling Data
No ratings yet
Model Optimisation of Class Imbalanced Learning Using Ensemble Classifier On Over-Sampling Data
8 pages
Catboost ET Comparaison
No ratings yet
Catboost ET Comparaison
20 pages
Learning From Imbalanced Data: Open Challenges and Future Directions
No ratings yet
Learning From Imbalanced Data: Open Challenges and Future Directions
13 pages
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
No ratings yet
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
9 pages
MEE22154 Task2
No ratings yet
MEE22154 Task2
4 pages
2018 12state of ArtofImbalancedDataClassificationMethods
No ratings yet
2018 12state of ArtofImbalancedDataClassificationMethods
7 pages
5 Techniques To Handle Imbalanced Data For A Classification Problem
No ratings yet
5 Techniques To Handle Imbalanced Data For A Classification Problem
7 pages
Eng2 12298 PDF
No ratings yet
Eng2 12298 PDF
24 pages
Imbalanced Learn Python
No ratings yet
Imbalanced Learn Python
5 pages
An Overview of Classification Algorithms For Imbalanced Datasets
No ratings yet
An Overview of Classification Algorithms For Imbalanced Datasets
7 pages
Li 2011
No ratings yet
Li 2011
4 pages
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
No ratings yet
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
7 pages
International Conference On Information and Communications Technology
No ratings yet
International Conference On Information and Communications Technology
5 pages
Over-Sampling Algorithm For Imbalanced Data Classification: XU Xiaolong, Chen Wen, and SUN Yanfei
No ratings yet
Over-Sampling Algorithm For Imbalanced Data Classification: XU Xiaolong, Chen Wen, and SUN Yanfei
10 pages
IMECS2010 pp513-517
No ratings yet
IMECS2010 pp513-517
5 pages
Ensemble Models For Effective Classification of Big Data With Data Imbalance
No ratings yet
Ensemble Models For Effective Classification of Big Data With Data Imbalance
17 pages
An Insight Into Classification With Imbalanced Data
No ratings yet
An Insight Into Classification With Imbalanced Data
29 pages
Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
No ratings yet
Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
10 pages
Expert Systems With Applications: Georgios Douzas, Fernando Bacao
No ratings yet
Expert Systems With Applications: Georgios Douzas, Fernando Bacao
8 pages
GCE Reviewer
No ratings yet
GCE Reviewer
11 pages
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
No ratings yet
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
5 pages
Battlemaps APOCALYPSE 16 Wasteland Ruins 2 FC SQ 01
100% (1)
Battlemaps APOCALYPSE 16 Wasteland Ruins 2 FC SQ 01
84 pages
Class Imbalance Problem in Data Mining: Review
No ratings yet
Class Imbalance Problem in Data Mining: Review
5 pages
Addressing Imbalance Problem in The Class - A Survey
No ratings yet
Addressing Imbalance Problem in The Class - A Survey
5 pages
A Coragem de Nao Agradarpdf 2 PDF Free
No ratings yet
A Coragem de Nao Agradarpdf 2 PDF Free
269 pages
1 Chapter 6 Architectural Design
No ratings yet
1 Chapter 6 Architectural Design
88 pages
PCC Pantone Catalog 2003 Low Res
No ratings yet
PCC Pantone Catalog 2003 Low Res
22 pages
Autodesk-infraworks-Traffic Simulation
No ratings yet
Autodesk-infraworks-Traffic Simulation
23 pages
COSC-1105-Introduction To ICT-Final
No ratings yet
COSC-1105-Introduction To ICT-Final
7 pages
2bizbox Quick Start Tutorial v3.0.0
No ratings yet
2bizbox Quick Start Tutorial v3.0.0
24 pages
En-Quiz 1 - Becoming A Reactive Web Developer - Quizizz
No ratings yet
En-Quiz 1 - Becoming A Reactive Web Developer - Quizizz
5 pages
AZ-305 StudyGuide ENU FY23Q1 Vnext 2
No ratings yet
AZ-305 StudyGuide ENU FY23Q1 Vnext 2
9 pages
Laboratorio 9
No ratings yet
Laboratorio 9
21 pages
Fiber Optic Network Management - The Ultimate Guide
No ratings yet
Fiber Optic Network Management - The Ultimate Guide
15 pages
DDCS Expert User's Manual V1 (Part2)
No ratings yet
DDCS Expert User's Manual V1 (Part2)
55 pages
Diagnostic Messages PDF
No ratings yet
Diagnostic Messages PDF
332 pages
AIT Company Profile
No ratings yet
AIT Company Profile
28 pages
Assignment No 4 Submitted To:. Sir Salman Butt
No ratings yet
Assignment No 4 Submitted To:. Sir Salman Butt
7 pages
Lab Manual: Department of Computer Science and Engineering
No ratings yet
Lab Manual: Department of Computer Science and Engineering
51 pages
Software Project Management Semester Project Iot Based Smart Medical Box Roll No: 19003105012,001,003 Submitted To: Ma'Am Sadia Naz
No ratings yet
Software Project Management Semester Project Iot Based Smart Medical Box Roll No: 19003105012,001,003 Submitted To: Ma'Am Sadia Naz
6 pages
Revision Questions
No ratings yet
Revision Questions
22 pages
T8FB (BT) : User Manual
No ratings yet
T8FB (BT) : User Manual
44 pages
Genshin Nahida Build & Weapon Genshin Impact - GameWith
No ratings yet
Genshin Nahida Build & Weapon Genshin Impact - GameWith
1 page
Simon Ardhi Yudanto Update
No ratings yet
Simon Ardhi Yudanto Update
3 pages
Operating System Word
No ratings yet
Operating System Word
21 pages
Ms SQL Server Always On Io Reliability Storage System On Hitachi VSP
No ratings yet
Ms SQL Server Always On Io Reliability Storage System On Hitachi VSP
25 pages
Unit 3 Data Warehouse
No ratings yet
Unit 3 Data Warehouse
17 pages
Running A Cluster
No ratings yet
Running A Cluster
5 pages
Final Cloud Computing 2023-2024
No ratings yet
Final Cloud Computing 2023-2024
6 pages
System Programming and Operating System Notes
No ratings yet
System Programming and Operating System Notes
5 pages
Arjun Resume
No ratings yet
Arjun Resume
3 pages
Few-Shot Machine Learning: Doing More with Less Data
From Everand
Few-Shot Machine Learning: Doing More with Less Data
Robert Johnson
No ratings yet
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

11-A-SMOTE A New Preprocessing Approach For Highly Im

Uploaded by

11-A-SMOTE A New Preprocessing Approach For Highly Im

Uploaded by

International Journal of Computational Intelligence Systems

In Press, Uncorrected Proof

ARTICLE INFO ABSTRACT

© 2019 The Authors. Published by Atlantis Press SARL.

based on Euclidean distance from the nearest majority samples and

According to Equations (2) and (3), then we get two arrays ( )

• Step 2: After that, with the accepted synthetic instances the

Algorithm 1: A-SMOTE Algorithm

Table 1 Summary of the datasets.

Dataset #Instances #Attributes %Class (Minority, Majority) #IR

Table 3 Illustration of the best algorithm (Case 1).

i Algorithm z = (R0 – Ri )/SE p Holm Hypothesis

Data-set Normal SMOTE CBSO SNOCC A-SMOTE

compared with multiple hybrid oversampling and undersampling

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.