0% found this document useful (0 votes)
91 views8 pages

Expert Systems With Applications: Georgios Douzas, Fernando Bacao

Music genre

Uploaded by

Farzad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views8 pages

Expert Systems With Applications: Georgios Douzas, Fernando Bacao

Music genre

Uploaded by

Farzad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Expert Systems With Applications 91 (2018) 464–471

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

Effective data generation for imbalanced learning using conditional


generative adversarial networks
Georgios Douzas, Fernando Bacao∗
NOVA Information Management School, Universidade Nova de Lisboa, Portugal

a r t i c l e i n f o a b s t r a c t

Article history: Learning from imbalanced datasets is a frequent but challenging task for standard classification algo-
Received 1 June 2017 rithms. Although there are different strategies to address this problem, methods that generate artificial
Revised 17 August 2017
data for the minority class constitute a more general approach compared to algorithmic modifications.
Accepted 11 September 2017
Standard oversampling methods are variations of the SMOTE algorithm, which generates synthetic sam-
Available online 13 September 2017
ples along the line segment that joins minority class samples. Therefore, these approaches are based on
Keywords: local information, rather on the overall minority class distribution. Contrary to these algorithms, in this
GAN paper the conditional version of Generative Adversarial Networks (cGAN) is used to approximate the true
Imbalanced learning data distribution and generate data for the minority class of various imbalanced datasets. The perfor-
Artificial data mance of cGAN is compared against multiple standard oversampling algorithms. We present empirical
Minority class results that show a significant improvement in the quality of the generated data when cGAN is used as
an oversampling algorithm.
© 2017 Elsevier Ltd. All rights reserved.

1. Introduction Martens, Hur, & Baesens, 2012; Zhao, Li, Chen, & Aihara, 2008).
Standard learning methods perform poorly in imbalanced data sets
Learning from imbalanced data is an important problem for as they induce a bias in favor of the majority class. Specifically,
the research community as well as the industry practitioners during the training of a standard classification method the mi-
(Chawla, Japkowicz, & Kolcz, 2003). An imbalanced learning prob- nority classes contribute less to the minimization of the objective
lem can be defined as a learning problem from a binary or function. Also the distinction between noisy and minority class
multiple-class dataset where the number of instances for one of instances is often difficult. An important observation is that in
the classes, called the majority class, is significantly higher than many of these applications the misclassification cost of the mi-
the number of instances for the rest of the classes, called the mi- nority classes is often higher than the misclassification cost of the
nority classes (Chawla, Bowyer, Hall, & Kegelmeyer, 2002). The Im- majority class (Domingos, 1999; Ting, 2002). Therefore the meth-
balance Ratio (IR), defined as the ratio between the majority class ods that address the class imbalance problem aim to increase the
and each of the minority classes, varies for different applications classification accuracy for the minority classes.
and for binary problems values between 100 and 10 0.0 0 0 have There are three main approaches to deal with the class imbal-
been observed (Chawla et al., 2002; Barua, Islam, Yao, & Murase, anced problem (Fernández, López, Galar, Jesus, & Herrera, 2013).
2014). The first is the modification/creation of algorithms that reinforce
Imbalanced data are a characteristic of multiple real-world ap- the learning towards the minority class. The second approach is
plications such as medical diagnosis, information retrieval systems, the application of cost-sensitive methods at the data or algorithmic
fraud detection, detection of oil spills in radar images, direct mar- level in order to minimize higher cost errors. The third and more
keting, automatic classification of land use and land cover in re- general approach is the modification at the data level by rebalanc-
mote sensing images, detection of rare particles in experimental ing the class distribution through under-sampling, over-sampling
high-energy physics, telecommunications management and bioin- or hybrid methods.
formatics (Akbani, Kwek, & Japkowicz, 2004; He & Garcia, 2009; Our focus in this paper is oversampling techniques, which result
Clearwater & Stern, 1991; Graves et al., 2016; Verbeke, Dejaeger, in the generation of artificial data for the minority class. Standard
oversampling methods are inspired by Synthetic Minority Over-
sampling Technique (SMOTE) algorithm (Chawla et al., 2002), gen-

Corresponding author.
erating synthetic samples along the line segment that joins minor-
E-mail addresses: gdouzas@icloud.com (G. Douzas), bacao@novaims.unl.pt
(F. Bacao).
ity class samples. A direct approach to the data generation process

http://dx.doi.org/10.1016/j.eswa.2017.09.030
0957-4174/© 2017 Elsevier Ltd. All rights reserved.
G. Douzas, F. Bacao / Expert Systems With Applications 91 (2018) 464–471 465

would be the use of a generative model that captures the actual lem is the within-class imbalance (Nekooeimehr & Lai-Yuen, 2016;
data distribution. Generative Adversarial Networks (GAN) is a re- Bunkhumpornpat, Sinapiromsaran, & Lursinsap, 2012; Cieslak &
cent method that uses neural networks to create generative mod- Chawla, 2008; Jo & Japkowicz, 2004) i.e. when sparse or dense
els (Goodfellow et al., 2014). A conditional Generative Adversarial subclusters of minority or majority instances exist. Clustering
Network (cGAN) extends the GAN model by conditioning the train- based oversampling methods that deal with the between-class
ing procedure on external information (Mirza & Osindero, 2014). In imbalance problem have recently been proposed. These methods
this paper we apply a cGAN on binary class imbalanced datasets, are initially partitioning the input space and then apply sam-
where the cGAN conditioning on external information are the class pling methods in order to adjust the size of the various clusters.
labels of the imbalanced datasets. The final generative model is Cluster-SMOTE applies the k-means algorithm and then generates
used to create artificial data for the minority class i.e. the gener- artificial data by applying SMOTE in the clusters. Similarly DB-
ator corresponds to an oversampling algorithm. SMOTE (Bunkhumpornpat et al., 2012) uses the DB-SCAN algorithm
For the evaluation of cGAN as an oversampling method an ex- to discover arbitrarily shaped clusters and generates synthetic in-
perimental analysis is performed, based on 12 publicly available stances along a shortest path from each minority class instance
datasets from Machine Learning Repository. In order to test it on to a pseudo-centroid of the cluster. A-SUWO (Nekooeimehr & Lai-
a wide range of IRs, additional datasets are created by undersam- Yuen, 2016) creates clusters of the minority class instances with
pling the minority class of these 12 datasets as well as by adding a size, which is determined using cross validation and generates
simulated datasets with appropriate characteristics. Then the pro- synthetic instances based on a proposed weighting system. SOMO
posed method is compared to Random Oversampling, SMOTE al- (Douzas & Bacao, 2017) creates a two dimensional representation
gorithm, Borderline SMOTE (Han, Wang, & Mao, 2005), ADASYN of the input space and based on it, applies the SMOTE procedure
(He, Bai, Garcia, & Li, 2008) and Cluster-SMOTE (Cieslak, Chawla, to generate intracluster and intercluster synthetic data that pre-
& Striegel, 2006). For the classification of the binary class data five serve the underlying manifold structure. Other types of oversam-
classifiers and three evaluation metrics are applied. pling approaches are based on ensemble methods (Wang, Minku, &
The sections in the paper are organized as follows. In Section 2, Yao, 2015; Sun et al., 2015) such as SMOTEBoost (Chawla, Lazare-
an overview of related previous works and existing sampling vic, Hall, & Bowyer, 2003), DataBoost-IM (Guo & Viktor, 2004).
methods is given. In Section 3, the theory behind GANs is
described. Section 4 presents the proposed method in detail. 3. GAN and cGAN algorithms
Section 5 presents the research methodology. In Section 6 the ex-
perimental results are presented while conclusions are provided in In this section, we provide a summary of the GAN and cGAN
Section 7. frameworks following closely the notation in Goodfellow et al.
(2014) and Gauthier (2015). The GAN is based on the idea of com-
2. Related work petition, in which a generator G and a discriminator D are trying
to outsmart each other. The objective of the generator is to confuse
Considering that our focus is the modification on the data level, the discriminator. The objective of the discriminator is to distin-
and particularly the generation of artificial data, we provide a short guish the instances coming from the generator and the instances
review of the oversampling methods. A review of the other meth- coming from the original dataset. If the discriminator is able to
ods can be found in Galar, Fernández, Barrenechea, Bustince, and identify easily the instances coming from the generator then, rel-
Herrera (2012) and Chawla (2005). Oversampling methods gener- ative to its discrimination ability, the generator is producing low
ate synthetic examples for the minority class and add them to the quality data. We can look at the GAN setup as a training envi-
training set. A simple approach, known as Random Oversampling, ronment for the generator where the discriminator, while also im-
creates new data by copying random minority class examples. The proving, is providing feedback about the quality of the generated
drawback of this approach is that the exact replication of training instances, forcing the generator to increase its performance.
examples can lead to overfitting since the classifier is exposed to More formally, the generative model G, defined as G: Z → X
the same information. where Z is the noise space of arbitrary dimension dZ that cor-
An alternative approach that aims to avoid this problem is responds to a hyperparameter and X is the data space, aims to
SMOTE. Synthetic data are generated along the line segment that capture the data distribution. The discriminative model, defined
joins minority class samples. SMOTE has the disadvantage that, as D: X → [0, 1], estimates the probability that a sample came
since the separation between majority and minority class clusters from the data distribution rather than G. These two models, which
is not often clear, noisy samples may be generated (He & Gar- are both multilayer perceptrons, compete in a two-player minmax
cia, 2009). To avoid this scenario various modifications of SMOTE game with value function:
have been proposed. SMOTE+Edited Nearest Neighbor (Batista,
Prati, & Monard, 2004) combination applies the edited near-
minG maxD V (D, G ) = ED + EG
est neighbor rule (Wilson, 1972) after the generation of artifi- where:
cial examples through SMOTE to remove any misclassified in- ED = Ex∼pdata (x ) [logD(x )]
stances, based on the classification by its three nearest neighbors.
EG = Ez∼pz (z ) [log (1 − D(G(z ) ) )] (1)
Safe-Level SMOTE (Bunkhumpornpat, Sinapiromsaran, & Lursin-
sap, 2009) modifies the SMOTE algorithm by applying a weight The x ∈ X values are sampled from the data distribution pdata (x)
degree, the safe level, in the data generation process. Borderline- and the z ∈ Z values are sampled from the noise distribution pz (z).
SMOTE (Han et al., 2005), MWMOTE (Majority Weighted Minor- The training procedure consists of alternating between k optimiz-
ity Oversampling Technique for Imbalanced Data Set Learning) ing steps for D and one optimizing step for G by applying SGD.
(Barua et al., 2014), ADASYN (He et al., 2008) and its variation Ker- Therefore during training, D is optimized to correctly classify train-
nelADASYN (Tang & He 2015) aim to avoid the generation of noisy ing data and samples generated from G, assigning 1 and 0 respec-
samples by identifying the borderline instances of the majority and tively. On the other hand the generator is optimized to confuse the
minority classes that in turn are used to identify the informative discriminator by assigning the label 1 to samples generated from
minority class samples. G. The unique solution of this adversarial game corresponds to G
The methods above address the problem of between-class im- recovering the data distribution and D equal to ½ for any input
balance (Nekooeimehr & Lai-Yuen, 2016). Another type of prob- (Goodfellow et al., 2014).
466 G. Douzas, F. Bacao / Expert Systems With Applications 91 (2018) 464–471

The cGAN is an extension of the above GAN framework. An ad-

Description of the datasets. Comma separated values in the table cells correspond to different IR versions of the same dataset. Simulated datasets with labels 1 - 5 and 6–10 have a subclustering structure of 1 to 5 clusters for

6.0 0, 12.0 0, 24.15, 36.0 0, 60.0 0, 94.29, 123.75


ditional space Y is introduced, which represents the external infor-

3.25, 6.54, 13.20, 19.61, 34.05, 49.77, 71.89


mation coming from the training data. The cGAN framework mod-

1.87, 3.73, 7.46, 11.36, 19.23, 29.41, 38.46


ifies the generative model G, to include the additional space Y, as
follows:

1.38, 2.78, 5.56, 8.33, 14.29, 22.22


1.25, 2.50, 5.00, 7.50, 12.50, 18.75
G: Z × Y →X (2)

2.78, 5.62, 11.25, 17.31, 28.12


5.46, 10.92, 21.85, 35.50
Similarly to G the discriminator D is modified as follows:

4.0 0, 8.0 0, 16.0 0, 24.0 0


2.0 0, 4.0 0, 8.33, 12.50
2.06, 4.11, 8.47, 13.09

1.51, 3.06, 6.29, 9.73


D : X × Y → [0, 1] (3)

Imbalance Ratio

1.94, 3.89, 7.78


Eq. (1) is also modified reflecting the redefinition of G and D:

minG maxD V (D, G ) = ED + EG

229.77
149.00
159.00

199.00
152.85

156.89

135.36
172.91
172.91

172.91
where:
ED = Ex, y∼pdata (x, y ) [logD (x, y )]
EG = Ez∼pz (z ), y∼p(y ) [log (1 − D (G (z, y ), y ) )] (4)

# Majority instances
The (x, y) ∈ X × Y values are sampled from the data distribu-
tion pdata (x, y), the z ∈ Z values are sampled from the noise distri-
bution pz (z) and y ∈ Y values are sampled from conditional data
vectors found in the training data and represented by the density

2981
2985
2987
2980
3975

2978
3977
3977

3977
1980

3974
function py (y). The training procedure for cGANs is identical to the

200
500
284

225

288

647
100
144

150

107
70
training of the GAN model. Rephrasing Eq. (4), the cost functions
for the gradient update of the discriminator and generator on a
minibatch of m training examples {(xi , yi )}m i=1
and m noise sam-

330, 165, 82, 55, 33, 21, 16


268, 134, 67, 44, 26, 17, 13
ples {zi }m are the following logistic cost expressions:

199, 99, 49, 33, 19, 13, 9


i=1
 

120, 60, 30, 20, 12, 8

145, 72, 36, 24, 14, 9


# Minority instances
1  
m m
JD = − log D(xi , yi ) + log(1 − D(G(zi , yi ), yi ) )

81, 40, 20, 13, 8


2m

72, 36, 18, 12


70, 35, 17, 11

71, 35, 17, 11


50, 25, 12, 8
i=1 i=1 52, 26, 13, 8
36, 18, 9

(5)

25
23
23
26
23

22
20

13
15
19
1 
m
JG = − log D(G(zi , yi ), yi ) (6)
m
i=1
2310, 2145, 2062, 2035, 2013, 2001, 1996
Eq. (6) is the modified version of the generator’s loss function
based on Eq. (4) in order to avoid saturation of the discriminator
846, 746, 696, 680, 666, 660, 656
768, 634, 567, 544, 526, 517, 513

(Goodfellow et al., 2014). The cGAN model is trained by alternating


345, 272, 236, 224, 214, 209

gradient-based updates of Eqs. (5) and (6), similarly to the GAN


270, 210, 180, 170, 162, 158

framework, where D parameters are updated k times followed by


306, 265, 245, 238, 233

a single update of G parameters.


360, 324, 306, 300
336, 310, 297, 292

150, 125, 112, 108

178, 142, 124, 118


214, 179, 161, 155

4. The cGAN application to the imbalanced learning problem


# Instances

106, 88, 79

The aim of the paper is to evaluate the effectiveness of a cGAN’s


40 0 0
40 0 0
40 0 0
40 0 0
40 0 0
30 0 0
30 0 0
30 0 0
30 0 0
30 0 0
generator G as an oversampling method. Specifically, the cGAN
framework is trained on binary class imbalanced data {(xi , yi )}ni=1 ,
where (xi , yi ) ∈ X × {0, 1} with y = 1 corresponding to the minor-
ity class. The external information of the CGAN process is repre-
# Features

sented by the class variable y. As it was mentioned above Z is a


noise space of dimensionality dZ while the dimensionalities of X
200
200
200
200
200
90

20
20
20
20
20
13

13
16
18
9
7
9
3

6
8

and Y are defined as dX and dY , respectively. Therefore the genera-


tor receives as input a vector that belongs to the Z × Y space and
outputs a vector that belongs to the input space X. On the other
Minority class

hand the discriminator receives as input a vector that belongs to


the X × Y space and classifies it as real or as, generated by G, ar-
WINDOW
car, fad

tificial data. After the end of cGAN training the generator can be
1, 2, 3

van

used to create artificial data for the minority class by receiving in-
pp
1
2
1
2

1
1

2
1
1
1
1
1
1
1
1
1
1

put vectors of the form (z, y = 1) ∈ Z × Y, where z is a sample


from the noise space Z.
the minority class.

The hyperparameters of the above process are the dimension


Simulated10
Simulated1
Simulated2
Simulated3
Simulated4
Simulated5
Simulated6
Simulated7
Simulated8
Simulated9
Haberman

dZ of the noise space, the hyperparameters related to the G and D


Segment
Dataset

Vehicle

networks architecture as well as their training options. According


Breast
Table 1

Heart

Wine
Glass

Pima
Libra
Liver
Ecoli

Iris

to Eqs. (2) and (3), the networks architectures of the two mod-
els are constrained by the dimensionality of the input space X, the
G. Douzas, F. Bacao / Expert Systems With Applications 91 (2018) 464–471 467

Table 2
Dimension dz and number of hidden units for G and D. Multiple numbers in the same cell correspond to the various IRs of
Table 1.

Dataset dZ Number of units for hidden layer of G Number of units for hidden layer of D

Breast 70, 70, 70 90, 90, 180 45, 45, 45


Ecoli 35, 55, 55, 45 40, 70, 70, 70 20, 35, 35, 30
Glass 25, 25, 35, 35 70, 70, 90, 90 35, 35, 45, 45
Haberman 10, 10, 12, 15, 15 20, 25, 25, 25, 25 10, 15, 15, 15, 15
Heart 65, 65, 65, 65, 65, 65 80, 80, 80, 80, 80, 80 40, 40, 40, 40, 40, 40
Iris 20, 20, 20, 20 25, 25, 25, 25 15, 15, 15, 15
Libra 180, 180, 180, 180 540, 540, 540, 540 270, 270, 270, 270
Liver 50, 50, 50, 25, 50, 50 35, 35, 35, 50, 50, 50 20, 20, 20, 20, 30, 30
Pima 15, 15, 15, 15, 40, 40, 40 50, 50, 50, 50, 25, 25, 25 80, 80, 80, 80, 80, 80, 80
Segment 80, 80, 80, 80, 80, 80, 80 160, 160, 160, 160, 160, 160, 160, 80, 80, 80, 80, 80, 80, 80
Vehicle 70, 70, 70, 70, 90, 90, 90 100, 100, 100, 100, 100, 100, 100 55, 55, 55, 55, 55, 55, 55
Wine 130, 130, 130, 130 65, 65, 65, 65 40, 40, 40, 40
Simulated1 50 70 40
Simulated2 60 70 40
Simulated3 40 70 50
Simulated4 50 70 40
Simulated5 50 70 40
Simulated6 600 800 500
Simulated7 600 800 500
Simulated8 500 800 400
Simulated9 500 800 500
Simulated10 500 800 400

Table 3
Results for mean ranking of the oversampling methods across the datasets. The bold highlights the best
performing method.

Metric None Random SMOTE Borderline SMOTE ADASYN Cluster-SMOTE cGAN

Algorithm: LR
AUC 4.37 3.79 3.54 4.42 4.89 4.15 2.85
F 5.31 3.99 3.69 3.34 4.20 4.92 2.56
G 6.73 2.94 3.18 3.82 3.14 5.69 2.49
Algorithm: SVM
AUC 4.34 4.20 3.77 4.10 5.11 3.76 2.72
F 6.17 4.63 3.21 3.01 4.54 3.94 2.49
G 6.32 4.63 3.23 3.10 4.44 4.10 2.18
Algorithm: KNN
AUC 5.01 5.37 3.42 3.70 4.27 3.72 2.51
F 5.87 3.90 3.62 3.32 4.07 4.70 2.51
G 6.59 4.63 2.59 3.77 2.93 5.25 2.23
Algorithm: DT
AUC 5.23 4.76 3.70 4.04 3.61 3.96 2.70
F 5.34 4.10 3.76 4.03 3.68 4.28 2.82
G 5.46 5.18 3.44 3.92 3.27 4.18 2.55
Algorithm: GBM
AUC 4.93 4.58 3.49 4.13 4.18 4.13 2.56
F 5.49 4.14 3.54 3.80 4.13 4.54 2.37
G 6.10 4.85 2.97 3.82 3.13 4.62 2.52

output space Y and the fact that D is a binary classifier. Specifi- tory UCI were used. For each one of them, additional datasets were
cally, the input and output layers of G have dZ + dY and dX number generated that resulted from undersampling the minority class of
of units, respectively. Also, the input and output layers of D have the initial datasets such that their final IR was increased approxi-
dX + dY and one unit respectively. For a binary classification prob- mately by a multiplication factor of 2, 4, 6, 10, 15 and 20. This pro-
lem the dimensionality dY of the class y is equal to one. Therefore, cedure was applied for a given multiplication factor only when the
using a single hidden layer for G and D, the non-constrained hy- final number of minority class instances was not less than 8. Ad-
perparameters of the cGAN are the dimension dZ of the noise space ditionally 10 artificial datasets were generated using the Python li-
and the number of units for the hidden layers of G and D. brary Scikit-Learn (Pedregosa et al., 2011) that adapts an algorithm
It is also important to notice that there are different formu- from Guyon (2003). Specifically, each class is composed of a num-
lations of the optimization problem as well as choices of the ber of gaussian clusters, in the range of 1 to 5, each located around
loss function of the cGAN framework. Following Goodfellow et al. the vertices of a hypercube. For each cluster, informative features
(2014) and Gauthier (2015), we choose the vanilla cGAN formula- are drawn independently from N(0, 1) and then randomly linearly
tion and the logistic cost functions given in Eqs. (5) and (6). The combined within each cluster in order to add covariance. The clus-
choice of the hyperparameters of the networks is described in the ters are then placed on the vertices of the hypercube. Therefore a
next section. subcluster structure is created for the minority class. Table 1 shows
a summary of the 71 data sets.
5. Research methodology The performance of the cGAN as an oversampling method was
evaluated and compared against Random Oversampling, SMOTE,
In order to test the performance of the proposed application of Borderline SMOTE, ADASYN and Cluster-SMOTE. Since total ac-
cGANs, 12 imbalanced datasets from the Machine Learning Reposi- curacy is not appropriate for imbalanced datasets (balanced)
468 G. Douzas, F. Bacao / Expert Systems With Applications 91 (2018) 464–471

Fig. 1. Ranking of each oversampling method versus the datasets IRs. The bold polygon line correspond to the CGAN method.

F-measure, G-mean and Area Under the ROC Curve (AUC) are used detect differences in the results across multiple experimental at-
(He & Garcia, 2009; Galar et al., 2012). tempts, when the normality assumption may not hold. The null
A ranking score was applied to each oversampling method for hypothesis is whether the classifiers have a similar performance
every combination of the 71 data sets, 3 evaluation metrics and across the oversampling methods and evaluation metrics when
5 classifiers. Additionally to the 6 oversampling algorithms we they are compared to their mean rankings. Holm’s test is a non-
also included the performance of the classifiers when no oversam- parametric version of the t-test and the null hypothesis is whether
pling is used. Therefore the ranking score for the best perform- the proposed application of cGAN outperforms the other methods
ing method is 1 and for the worst performing method is 7. The as the control algorithm.
Friedman test was applied to the ranking results, followed by the For the evaluation of the oversampling methods, Logistic Re-
Holm’s test where cGAN oversampler was considered as the con- gression (LR) (McCullagh, 1984), Support Vector Machine with ra-
trol method (Guyon, 2003). Generally the Friedman test is used to dial basis function (SVM) (Chang & Lin, 2011), Nearest Neighbors
G. Douzas, F. Bacao / Expert Systems With Applications 91 (2018) 464–471 469

Table 4 The hyperparameter tuning of the classifiers and the various


Results for Friedman’s test.
oversampling algorithms was done using only the training data Ti
Metric p-value of each cross validation stage i ∈ {1, 2, ..., k}. Specifically, an ad-
Algorithm: LR ditional training/validation splitting on Ti was applied as Ti = Tt, i ∪
AUC 2.46E−07 Tv, i. The classifiers were trained on Tt, I and the optimal hyperpa-
F 5.80E−15 rameters were selected in order to maximize the AUC of the vali-
G 6.06E−47 dation set Tv, i . For SMOTE, Borderline SMOTE, Cluster-SMOTE and
Algorithm: SVM
ADASYN algorithms the tuning parameters, for any combination of
AUC 1.21E−08
F 6.46E−28 datasets, metrics and classifiers, were selected such that the per-
G 1.34E−32 formance of the classifier on the validation set Tv, i is maximized
Algorithm: KNN when synthetic data were added to Tt, i and formed the training
AUC 1.50E−16
set for the classifier.
F 3.30E−20
G 1.60E−46 A similar strategy was followed for the cGAN hyperparame-
Algorithm: DT ter tuning. The validation performance was optimized only for the
AUC 2.90E−11 GBM classifier and the same cGAN hyperparameters were used for
F 1.57E−09 the rest of the classifiers. Furthermore instability issues during the
G 3.09E−19
cGAN training were observed. This problem was partially solved
Algorithm: GBM
AUC 5.76E−10 through the fine tuning of the hyperparameters, which provided
F 7.40E−16 a stable state. Both the generator and the discriminator used recti-
G 8.06E−29 fied linear units (Glorot, Bordes, & Bengio, 2011) as activation func-
tions for the single hidden layer and sigmoid activations for the
output layer. Each model was trained using the Adam optimizer
(KNN) (Cover & Hart, 1967), Decision Trees (DT) (Quinlan, 1993) (Kingma & Ba, 2015) with the default settings for both the initial
and Gradient Boosting Machine (GBM) (Friedman, 2001) were learning rate and the exponential decay rates of the moment esti-
used. In order to evaluate the performance of the algorithms k-fold mates. No dropout was applied to either the generator or the dis-
cross validation was applied with k = 5. Before training, in each criminator. Also the hyperparameter k, that controls the number
stage i ∈ {1, 2, ..., k} of the k-fold cross validation procedure, of D parameters updates followed by a single G parameters up-
synthetic data Tg, i were generated based on the training data Ti date, was set equal to one since no improvement was observed for
of the k - 1 folds such that the resulting Tg, i ∪ Ti training set be- higher values. The optimal range for the number of epochs was
comes perfectly balanced. This enhanced training set in turn was found to be 20 0 0–10,0 0 0 and the mini-batch size was set to the
used to train the classifier. The performance evaluation of the clas- range from 1/20 up to 1/100 of each training set’s sample size.
sifiers was done on the validation data Vi of the remaining fold. Due to limited computational resources the search for the opti-

Table 5
Results for Holms’ test. The bold highlights statistical significance.

AUC F G

Adjusted a Algorithm: LR
0.0125 ADASYN 1.19e−08 Cluster-SMOTE 2.95e−15 None 5.62e−45
0.0143 Borderline SMOTE 1.21e−05 None 2.11e−12 Cluster-SMOTE 7.98e−35
0.0167 None 6.94e−05 ADASYN 9.55e−08 Borderline SMOTE 2.73e−07
0.0200 Cluster-SMOTE 1.68e−04 Random 1.51e−06 SMOTE 5.70e−03
0.0250 Random 3.94e−03 SMOTE 3.81e−05 ADASYN 1.59e−02
0.0333 SMOTE 3.75e−02 Borderline SMOTE 6.54e−03 Random 6.12e−02
Adjusted a Algorithm: SVM
0.0125 ADASYN 8.58e−11 None 1.06e−26 None 9.14e−41
0.0143 Random 1.20e−05 Random 2.67e−11 Random 4.72e−16
0.0167 None 2.85e−05 ADASYN 8.38e−09 ADASYN 1.25e−10
0.0200 Borderline SMOTE 4.29e−05 Cluster-SMOTE 1.66e−05 Cluster-SMOTE 3.08e−09
0.0250 Cluster-SMOTE 1.47e−03 SMOTE 8.83e−03 SMOTE 2.42e−05
0.0333 SMOTE 1.68e−03 Borderline SMOTE 5.70e−02 Borderline SMOTE 2.39e−04
Adjusted a Algorithm: KNN
0.0125 Random 2.11e−20 None 2.95e−19 None 7.95e−47
0.0143 None 7.47e−13 Cluster-SMOTE 9.18e−14 Cluster-SMOTE 4.06e−34
0.0167 ADASYN 2.74e−07 ADASYN 5.18e−07 Random 4.65e−20
0.0200 Cluster-SMOTE 1.41e−05 Random 1.28e−05 Borderline SMOTE 5.89e−12
0.0250 Borderline SMOTE 4.79e−05 SMOTE 3.67e−05 ADASYN 6.83e−03
0.0333 SMOTE 1.307e−03 Borderline SMOTE 3.06e−03 SMOTE 8.08e−02
Adjusted a Algorithm: DT
0.0125 None 1.63e−13 None 1.60e−12 None 1.22e−18
0.0143 Random 3.68e−10 Cluster-SMOTE 3.86e−08 Random 2.08e−17
0.0167 Cluster-SMOTE 7.76e−07 Borderline SMOTE 9.68e−05 Cluster-SMOTE 1.07e−12
0.0200 Borderline SMOTE 2.77e−05 Random 1.29e−04 Borderline SMOTE 1.49e−06
0.0250 SMOTE 7.20e−04 SMOTE 1.76e−03 SMOTE 5.96e−04
0.0333 ADASYN 5.67e−03 ADASYN 7.14e−03 ADASYN 1.34e−02
Adjusted a Algorithm: GBM
0.0125 None 9.83e−11 None 9.92e−18 None 1.12e−23
0.0143 Random 4.19e−09 Cluster-SMOTE 5.51e−15 Random 1.08e−13
0.0167 Cluster-SMOTE 3.01e−07 Random 2.10e−08 Cluster-SMOTE 2.24e−13
0.0200 Borderline SMOTE 1.66e−06 ADASYN 3.55e−08 Borderline SMOTE 9.82e−06
0.0250 ADASYN 2.77e−06 Borderline SMOTE 9.46e−07 ADASYN 5.12e−02
0.0333 SMOTE 2.39e−03 SMOTE 9.69e−06 SMOTE 7.49e−02
470 G. Douzas, F. Bacao / Expert Systems With Applications 91 (2018) 464–471

mal hyperparameters was not extensive. In fact it should be noted classifiers. The results show that cGAN performs better compared
that the time and space complexity of the cGAN follows the stan- to the other methods for a variety of classifiers, evaluation met-
dard neural network characteristics. Contrary to the other methods, rics and datasets with complex structure. The explanation for this
the cGAN as an oversampler requires a training phase. Taking into improvement in performance relates to the ability of cGAN to re-
consideration the size of the datasets used, standard oversampling cover the training data distribution, if given enough capacity and
methods were faster compared to the cGAN procedure. A mix of training time. This ability is in contrast to standard oversampling
random grid search and manual selection was applied, choosing dZ methods where heuristic approaches are applied in order to gen-
values as well as the number of hidden layer’s units for G and D erate minority class instances in safe areas of the input space. Al-
to be a multiple of the dataset’s number of features by a factor of though training the cGAN requires more effort and time compared
2 up to 10. The resulting optimal values are shown on Table 2. to standard oversampling methods, once the training is finished,
The experimental procedure was repeated 5 times and the re- the generation of minority class instances is simple and effective.
ported results include the average values between the experi- The generator accepts noise and the minority class label as input
ments. The implementation of the classifiers and standard over- and outputs the generated data. Future research extensions of this
sampling algorithms was based on the Python libraries Scikit-Learn work include the application of cGAN to the multiclass imbalanced
(Pedregosa et al., 2011) and Imbalanced-Learn (Lemaitre, Nogueira, problem and the search for more efficient ways to train G and D
& Aridas, 2016). The cGAN implementation1 is based on Tensor- in the context of the imbalance learning problem.
Flow (Abadi et al., 2015).

6. Experimental results References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C. et al. (2015). Tensor-
The ranking results2 are summarized using 15 plots. Each plot Flow: Large-scale machine learning on heterogeneous systems. Software avail-
corresponds to a specific classifier and evaluation metric. The x- able from tensorflow.org.
Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying support vector machine to
axis and y-axis of each plot represent the IR of the imbalanced
imbalanced datasets. Machine Learning: ECML, 2004, 39–50.
dataset and the ranking score, respectively. Therefore each plot in- Barua, S., Islam, Md. M., Yao, X., & Murase, K. (2014). MWMOTE - majority weighted
cludes 7 polygon lines, corresponding to the 7 oversampling meth- minority oversampling technique for imbalanced data set learning. IEEE Trans-
ods, where the bold line represents the cGAN oversampler (Fig. 1). actions on Knowledge and Data Engineering, 26(2), 405–425.
Batista, G., Prati, R., & Monard, M. (2004). A Study of the Behavior of Several Meth-
The mean ranking of the oversampling methods across the data ods for Balancing Machine Learning Training Data. ACM SIGKDD Explorations
sets for each combination of a classifier and evaluation metric is Newsletter - Special issue on learning from imbalanced datasets, 6, 20–29.
summarized in Table 3. Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2009). Safe-Level-S-
MOTE: Safe-level-synthetic minority over-sampling technique for handling the
The cGAN method is the best performing method for all combi- class imbalanced problem. In Lecture notes in computer science (including sub-
nations of classification algorithms and evaluation metrics. In order series lecture notes in artificial intelligence and lecture notes in bioinformatics)
to statistically confirm the conclusion, the Friedman test is applied (pp. 475–482).
Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2012). DBSMOTE: Densi-
and the results are shown in Table 4. ty-based synthetic minority over-sampling technique. Applied Intelligence, 36(3),
Therefore at a significance level of a = 0.05 the null hypothe- 664–684.
sis is rejected, i.e. the classifiers do not perform similarly in mean Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Syn-
thetic minority over-sampling technique. Journal of Artificial Intelligence Research,
rankings for any evaluation metric. The Holm’s test is applied with
16, 321–357.
cGAN oversampler as the control method. The adjusted a and the Chawla, N. V., Japkowicz, N., & Kolcz, A. (2003). Workshop learning from imbalanced
p-values are shown in Table 5. data sets II. In Proceedings of international conference on machine learning.
Chawla, N. V. (2005). Data mining for imbalanced data sets: an overview. In Data
We observe that the cGAN oversampler outperforms all other
mining and knowledge discovery handbook. (pp. 875–886). Springer.
methods for any evaluation metric when DT is used as a classi- Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost: Im-
fier. For the SVM, GBM and KNN classifiers, cGAN outperforms the proving prediction of the minority class in boosting. In Proceedings of seventh
other methods in 2 out of 3 evaluation metrics. Finally when LR European conference principles and practice of knowledge discovery in databases
(pp. 107–119).
is used as a classifier, cGAN has a higher performance compared Cieslak, D. A., & Chawla, N. V. (2008). Start globally, optimize locally, predict glob-
to the other oversamplers in the F-measure. Although the simu- ally: Improving performance on imbalanced data. In Proceedings - IEEE interna-
lated data have a subclustering structure for the minority class and tional conference on data mining, ICDM (pp. 143–152).
Clearwater, S. H., & Stern, E. G. (1991). A rule-learning program in high energy
vanilla cGAN suffers from the problem of mode collapse (Mehrjou physics event classification. Computer Physics Communications, 67(2), 159–182.
& Saremi, 2017), where the generator focuses in parts of the low- Cieslak, D. A., Chawla, N. V., & Striegel, A. (2006). Combating imbalance in net-
dimensional data manifold, we did not observe a decrease in the work intrusion datasets. In IEEE international conference on granular computation
(pp. 732–737).
performance of the classifiers in these cases. Cover, T. M., & Hart, P. E. (1967). Nearest neighbour pattern classification. IEEE Trans-
actions on Information Theory, 13, 21–27.
7. Conclusions Domingos, P. (1999). MetaCost : A general method for making classifiers. In Proceed-
ings of the 5th international conference on knowledge discovery and data mining
(pp. 155–164).
In this paper we propose the use of cGAN, as an oversampling Douzas, G., & Bacao, F. (2017). Self-organizing map oversampling (SOMO) for imbal-
approach for binary class imbalanced data. The proposed appli- anced data set learning. Expert Systems with Applications, 82, 40–52.
cation of cGAN results to a generative model G that has learned Fernández, A., López, V., Galar, M., Jesus, M. J., & Herrera, F. (2013). Analysing the
classification of imbalanced data-sets with multiple classes: Binarization tech-
the actual data distribution conditioned on the class labels. G is niques and ad-hoc approaches. Knowledge-Based Systems, 42, 97–110.
used to generate synthetic data for the minority class i.e. as an Friedman, J. H. (2001). Greedy function approximation: A gradient boosting ma-
oversampling algorithm. cGAN performance was evaluated on 71 chine. Annals of Statistics, 29(5), 1189–1232.
Galar, M., Fernández, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A re-
datasets with different imbalance ratios, number of features and view on ensembles for the class imbalance problem: Bagging, boosting, and hy-
subclustering structures and compared to multiple oversampling brid-based approaches. IEEE Transactions on Systems, Man and Cybernetics Part
methods, using Logistic Regression, Support Vector Machine, Near- C: Applications and Reviews, 42(4), 463–484.
Gauthier, J. (2015). Conditional generative adversarial nets for convolutional face gen-
est Neighbors, Decision Trees and Gradient Boosting Machine as eration. Technical report.
Chang, C., & Lin, C. (2011). LIBSVM: A library for support vector machines. ACM
Transactions on Intelligent Systems and Technology, 2, 27.
1
Code is available at https://github.com/gdouzas/generative-adversarial-nets Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural networks. In
2
The table format is available at https://github.com/gdouzas/publications/blob/ Proceedings of the fourteenth international conference on artificial intelligence and
master/CGAN/results.csv statistics, PMLR: 15 (pp. 315–323).
G. Douzas, F. Bacao / Expert Systems With Applications 91 (2018) 464–471 471

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Mehrjou, A., & Saremi, S. (2017). Annealed generative adversarial networks. Interna-
et al. (2014). Generative adversarial nets. Advances in Neural Information Pro- tional conference on learning representations, ICLR.
cessing Systems, 27, 2672–2680. Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. In Proceed-
Graves, S. J., Asner, G. P., Martin, R. E., Anderson, C. B., Colgan, M. S., Kalantari, L., ings of the neural information processing systems deep learning workshop (NIPS).
et al. (2016). Tree species abundance predictions in a tropical agricultural land- Nekooeimehr, I., & Lai-Yuen, S. K. (2016). Adaptive semi-unsupervised weighted
scape with a supervised classification model and imbalanced data. Remote Sens- oversampling (A-SUWO) for imbalanced datasets. Expert Systems with Applica-
ing, 8, 161. tions, 46, 405–416.
Guo, H., & Viktor, H. (2004). Learning from imbalanced data sets with boosting and Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
data generation: The DataBoost IM approach. ACM SIGKDD Explorations Newslet- et al. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learn-
ter, 6(1), 30–39. ing Research, 12, 2825–2830.
Guyon, I. (2003). Design of experiments for the NIPS 2003 variable selection bench- Quinlan, J. (1993). In C4.5: programs for machine learning: 16 (pp. 235–240). Morgan
mark. Kaufmann Publishers, Inc.
Han, H., Wang, W., & Mao, B. (2005). Borderline-SMOTE: A new over-sampling Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., & Zhou, Y. (2015). A novel ensemble method
method in imbalanced data sets learning. Advances In Intelligent Computing, for classifying imbalanced data. Pattern Recognition, 48, 1623–1637.
17(12), 878–887. Tang, B., & He, H. (2015). KernelADASYN: Kernel based adaptive synthetic data
He, H., Bai, Y., Garcia, E., & Li, S. (2008). ADASYN: Adaptive synthetic sampling ap- generation for imbalanced learning. IEEE Congress on Evolutionary Computation
proach for imbalanced learning. In IEEE international joint conference on neural (CEC).
networks, 2008 (pp. 1322–1328). Ting, K. M. (2002). An instance-weighting method to induce cost-sensitive trees.
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on IEEE Transactions on Knowledge and Data Engineering, 14(3), 659–665.
Knowledge and Data Engineering, 21(9), 1263–1284. Verbeke, W., Dejaeger, K., Martens, D., Hur, J., & Baesens, B. (2012). New insights into
Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. ACM Sigkdd churn prediction in the telecommunication sector: A profit driven data mining
Explorations Newsletter, 6, 40–49. approach. European Journal of Operational Research, 218(1), 211–229.
Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. Interna- Wang, S., Minku, L. L., & Yao, X. (2015). Resampling-based ensemble methods for
tional conference on learning representations, ICLR. online class imbalance learning. IEEE Transactions on Knowledge and Data Engi-
Lemaitre, G., Nogueira, F., & Aridas, C. (2016). Imbalanced-learn: A python tool- neering, 27(5), 1356–1368.
box to tackle the curse of imbalanced datasets in machine learning. CoRR Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited
abs/1609.06570. data. IEEE Transactions on Systems, Man and Cybernetics, 2(3), 408–421.
McCullagh, P. (1984). Generalized linear models. European Journal of Operational Zhao, X. M., Li, X., Chen, L., & Aihara, K. (2008). Protein classification with imbal-
Reasearch, 16, 285–292. anced data. Proteins: Structure, Function and Genetics, 70(4), 1125–1132.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy