Artigo Smallex
Artigo Smallex
net/publication/262378087
CITATIONS READS
2 3,191
2 authors:
All content following this page was uploaded by Renato Krohling on 03 August 2016.
Abstract Data mining is a key area for many fields of science and engineering.
In this context, a statistical learning method, known as Support Vector Machines
(SVM) has presented itself as an alternative method to solve data classification. Usu-
ally, the SVM problem is formulated as a nonlinear optimization problem subject to
constraints. Conventional optimization techniques using the Lagrangian approach
are used to solve this kind of problem. In the case of classification of noisy data the
conventional techniques show performance deterioration, since the resulting opti-
mization problem is multidimensional and may present many local minima. In this
work, it is proposed a Differential Evolution algorithm (DE) combined with a local
search technique to find the optimal parameters of SVM classifiers applied to noisy
data.
1 Introduction
Rodrigo de C. Cosme
Universidade Federal do Espı́rito Santo, e-mail: rdccosmo@gmail.com
Renato A. Krohling
Universidade Federal do Espı́rito Santo, e-mail: krohling.renato@gmail.com
1
2 Rodrigo de C. Cosme and Renato A. Krohling
multiclass problems [5] some methods have been proposed. SVM Ensemble is one
of these methods. It consists of applying SVM to the m classes of the problem in
groups of two, then each output of these m binary SVMs is combined to solve the
original multiclass problem. Some Ensemble methods to aggregate binary SVMs to
solve a multiclass problem are majority voting and weighting [5]. The main concern
of multiclass classification problems is the computational cost growth. In addition,
there are cases in which the classes are not linearly separable. In such cases, a non-
linear transformation is applied to the data in order to achieve a plane where the
data can be separated. This transformation is called kernel function. Several kernel
functions have been studied for better performance, to incorporate other properties
such as feature selection [7].
By using SVMs in the context of classification several parameters must be con-
figured to minimize the classification error for the validation instances. As a multi-
dimensional optimization problem, it must be avoided to get trapped into local min-
ima. Amongst the alternatives to solve the optimization problem, are the gradient-
based deterministic methods. These methods normally are computationally efficient,
but can lead to local minima resulting in stagnation. On the other hand, population
based methods (genetic algorithms, differential evolution, particle swarm optimiza-
tion) have been shown effective to solve these kinds of problems. Generally the
search or optimization methods working with a population are more computation-
ally costly than gradient-based methods but may provide more robust solutions.
Hence, the automatization of the SVM training process utilizing a biologically in-
spired evolutionary algorithm is desirable.
Many ways exist to approach this problem, one of them is to use an evolutionary
algorithm to tune the SVMs parameters encoded as an individual [13]. In [13] it is
shown an approach to perform feature selection and determine the parameters of a
SVM utilizing the Particle Swarm algorithm. In [7] it is presented an extension of the
Gaussian kernel which can perform feature extraction. In [4] a study of the impact
of introducing noise in different places on the classification accuracy while utilizing
different techniques of supervised learning such as neural networks and naive Bayes
probabilistic classifier. In [8] it is introduced and analyzed noisy variables.
Other works focus on data clustering in order to produce meaningful classifica-
tion in noisy data [1] with the disadvantage that the number of clusters are known a
priori. In order to have good classifications and reduce the parameters needed to be
set up by the designer, in this paper a SVM+DE approach is proposed. This hybrid
technique benefits from statistical power of the SVM while the DE is used to find
the optimal values of the SVM parameters .
In Section 2 we briefly describe the SVMs. Section 3 presents the DE algorithm.
Section 4 describes the Tabu Search and Nelder Mead methods used for local search.
Section 5 illustrates the hybrid method. Section 6 shows the results and discussions.
In Section 7 the conclusions are given.
Support Vector Machines applied to noisy data classification using DE with local search 3
Given a training set Z = {(x1 , y1 ), (x2 , y2 ), . . . (xl , yl )} of l instances where each in-
stance is composed of n attributes (features) (xi = (x1i , . . . , xni )T ∈ Rn ) and a classi-
fying label yi ∈ {1, −1}, the task of classification consists of separating two classes
with a hyperplane.
min 21 ||w||2 +C
(2)
s.t. yi [w · xi + b] ≥ 1
Or in the Dual form
In [7], an extension of the gaussian kernel was given, with the property of feature
extraction as well as feature mapping. The feature extraction is executed due to the
parameter vector β . The lower the value of βk is, the less relevant the feature k is to
the classification. Therefore, removing this feature in question would not affect the
classifier.
− ∑Nk=1 βk (xik − x jk )2
K(x, x j ) = exp( ) (9)
2σk2
Apart from the choice of kernel function, the selection of samples for training
and validation greatly affects the performance of the classifier. It is well established
in the literature a proportion of 70% of the data set is used for training and 30% for
validation [5] but which samples to reserve for each task still remains an open issue.
A few algorithms were proposed to address this issue. In [15] Boosting and Bagging
techniques are compared.
As the dimension of the data set increases the task of classification becomes
increasingly more complex. Since the number of parameters to be optimized grow
proportionately. Next, we describe the algorithms to optimize the SVM parameters.
Many researches have been done in the field of noisy SVM to define methods
to remove the noise or to accommodate the learning model to the perturbed data.
Topics of research include data clustering to remove the noise [1], Differential Evo-
lution classifiers [8] and relevant features extraction [3]. As far as we know little has
been done with SVM+DE. Hence, this is a first approach using SVM+DE to learn
the classification model for noisy data.
3 Differential Evolution
To better balance exploration and exploitation, it was proposed in [11] the use
of neighborhood-based mutation. In this method, local neighborhood, where the
best individual is in a small neighborhood, and global neighborhood, where the best
individual is in the whole population at the current generarion, are used together. The
combination of local and global model is done by a new parameter w, the weight
factor, resulting in the neighborhood-based mutation vector.
To create the local vector, the best vector in the neighborhood and two other
vectors are chosen, as given by Equation 10:
In Tabu Search (TS), a number R of random neighbors and the best point evaluated
is chosen as start point, current solution (CS ) and best solution (S). The list used to
keep recent points, the Short-Term Tabu List (STTL) is updated with CS . Then, N
neighbors are generated randomly around CS based on a search strategy and ranked
according to their performance. The best neighbor point is copied to CS if it is not a
member of STTL. A neighbor is chosen as next move if it outperforms S. For further
exploitation, S is added to a Long-Term Tabu List (LTTL). Thus, if CS is better than
S, then S is replaced by CS . The continuous process of neighbor generation and
selection of the best is stopped only when a number os iterations is achieved or
when there is no improvement after a pre-specified number of iterations.
In order to find the minimun of a function, the Nelder-Mead (NM) method randomly
generates D points in a D-dimensional search space, from P1 , . . . , PD , around a start
point P0 . The start point can be chosen randomly or be the result of a previously run
algorithm. Each point Pi evaluated has its value denoted by Yi and the highest and
lowest values are Yh and Yl , respectively.
The operations used for NM search are reflection, expansion and contraction,
generating the points Pr , Pe and Pc , respectively. Another point used in these trans-
formations is the centroid, denoted as P. These operations are defined as follows:
1 n
P= ∑ Pi (14)
n i=1
Pr = (1 + αNM )P − αNM Ph (15)
Pe = γNM Pr + (1 − γNM )P (16)
Pc = βNM Ph + (1 − βNM )P (17)
where αNM , βNM and γNM are values in the interval [0, 1].
The NM search algorithm is presented as follows:
1 Perform reflection for the point Ph
Support Vector Machines applied to noisy data classification using DE with local search 7
1 n
∑ k Pik − Pik+1 k2 < ε
D i=1
(18)
where Pik and Pik+1 are the points in iteration k and k + 1, respectively, and ε is a
small real number.
Our method consists of finding the kernel parameters and selecting the minimum
amount of features while maximizing the classification accuracy of the SVM, that
is, minimizing the classification errors made by the SVM.
The individuals representation of the SVM parameters is made up of the param-
eters 0 ≤ C ≤ 1 (from Equations 2 and 3), 0 ≤ σ ≤ 1, and 0 ≤ β ≤ 1. The σ , and β
components of the individual are multidimensional variables related to the modified
gaussian kernel function [7] as shown in Equation 9.
According to [7] after training the SVM with N features, one can validate the
SVM with fewer features and obtain about the same result, because a small value of
an element of β means that the corresponding feature does no contribute much to
the classification, consequently it can be removed from the data set without affecting
the classification accuracy significantly. This procedure, called feature extraction, is
responsible for increasing the classifier performance [16]. It was configured that all
values of β satisfying βi ≤ 0.5, i = 1, . . . , N can be removed from the features set as
done in [7].
During the evolution process, the objective is the maximization of fitness. When
the best fitness is stagnated for MAXstagnation (set to 10), a local search algorithm is
applied to further exploit the best solution found so far, taking as starting point the
best individual or the local vector, with probability 50%. This algorithm is the com-
bination of Tabu Search and Nelder-Mead method [6] and is described in Section
4.
8 Rodrigo de C. Cosme and Renato A. Krohling
6 Simulation results
To create the test suite, three data sets were selected from the UCI [2] repository and
then noise was aggregated to them. Data sets that contained m classes (where m > 2)
were transformed into m 2-class data sets and their results are shown for each class
separately. This transformation consists of assigning a different label to the classes
different than the current selected class. This procedure is performed for each of the
m classes. In addition, the data sets were perturbed with different degrees of noise.
The generation of noise can be classified in different ways [4]. In this study two
categories were taken into account, which are distribution and location. The distri-
bution chosen was Gaussian; and the locations selected to introduce noise were in
the output class, in the training data, in the validation data, or a combination of both.
Another possibility is to introduce noise variables in the training data as done in [8].
Here it was also introduced in the validation data and in both training and validation
data sets. The percentage of data to be perturbed was set to 10% and 50%. Once
the percentage η of noise is determined, ηLs feature vectors are randomly selected
according to a normal distribution to be perturbed, where Ls is the size of the data
set to be perturbed which can be the training set, the validation set or both.
The selection of individual vectors to be perturbed is described as follows. For
each feature, N random numbers are generated with normal distribution, where N is
the number of vectors in the data set multiplied by the noise percentage to be intro-
duced in the data set. Once the vectors to be altered are selected, the perturbation is
introduced in the following manner. The perturbed variables are replaced with ran-
dom numbers sampled from a Gaussian distribution with the original value as mean
and a standard deviation with value 1.
The types of noise are separated in Gaussian noise and noisy variables. The ad-
dition of Gaussian noise works as described earlier. As for the addition of noisy
variables, after the addition of new uniform random variables, the same principle of
Gaussian noise applies. Either it can be added in training data, in validation data,
or in both training and validation (training/validation) data. For ease of reference
we named Gaussian noise in training, validation, training/validation and training la-
bel as noise A, B, C and D, respectively. Also, for the noise variables, we named
them as noise E, F and G, for noise variables added in training, validation and train-
ing/validation, respectively. Noise types A, B, C and D have been studied in [4].
All the noise types are added to the input data, except for noise D, which sets the
output data (the label vector) to a uniform random value chosen out of two pos-
sible values {−1, 1}. In [8] noiseE adds noisy variables to the training data. The
remaining noise types follow the same principle, but add noise in different parts, in
the validation data (noiseF) and in training/validation data (noiseG). The number of
noise variables added was set to twice the number of features of the data set being
classified [8].
Support Vector Machines applied to noisy data classification using DE with local search 9
During the evolutionary process the individuals were ranked based on N-fold
cross-validation values, with the number of folds set to three. To validate the op-
timized classifiers a new validation was performed with the data set selected ran-
domly and divided into 70% for training and 30% for validation. The data pertain-
ing to each class was divided in the same proportions, 70% of each class for training
and 30% for validation. Finally, to statistically validate the method we performed a
bootstrap with N = 100 samples with the trained classifiers.
The effectiveness of the proposed method was evaluated performing three different
classification problems from the UCI repository, for instance, the heart data, the
breast-cancer data and the iris data. The results are summarized in Table 1 for the
heart data set, in Table 2 for the breast data set and Table 3 for the iris data set. The
values are averaged cross-validation results obtained in 10 runs of the evolutionary
process. Since the run consists of the evolutionary process and the bootstrap for
three problems with the different noise setups, it is computationally expensive and
thus 10 runs was considered plausible. The fitness used to measure the classifiers
performance was the classification accuracy.
For the 3 datasets analysed, only the results obtained for iris dataset were com-
pared to other works. The main reason for leaving the heart scale and breast-cancer
uncompared is that the noise setup used was the same as the one used by [4]. So a
complete comparisom is left for future work.
Table 1 Cross-validation values achieved with SV M + DE + LS for the heart scale problem.
no noise type A type B type C type D type E type F type G
10 50 10 50 10 50 10 50 10 50 10 50 10 50
heart scale0 83.2 83.7 78.9 82.9 81.9 82.6 76.7 86.1 87.3 83.0 84.6 82.7 80.5 83.2 84.6
Table 2 Cross-validation values achieved with SV M + DE + LS for the breast-cancer scale prob-
lem.
no noise type A type B type C type D type E type F type G
10 50 10 50 10 50 10 50 10 50 10 50 10 50
breast-cancer scale0 96.7 95.8 93.6 95.7 93.3 94.8 91.1 95.7 95.4 96.8 95.7 95.9 96.3 96.7 96.6
In [8], the result obtained for the heart problem without noise is 83.21, which is in
agreement with our results. Table 1 shows that despite the introduction of noise the
accuracy of the classifier increased for some cases. One possible reason may be that
the noise contributed to increase the distance between the samples and the margins
of the separating hyperplane, making the task of classification more precise. A 10%
noise introduction increased or maintained the accuracy for all noise types. A noise
introduction of 50% decreased the performance for all noise types except for noise
10 Rodrigo de C. Cosme and Renato A. Krohling
Table 3 Cross-validation values achieved with SVM+DE+LS for the iris scale problem.
no noise type A type B type C type D type E type F type G
10 50 10 50 10 50 10 50 10 50 10 50 10 50
iris scale0 100.0 99.7 93.0 99.7 98.3 99.4 88.1 99.7 100.0 100.0 100.0 100.0 100.0 100.0 99.7
NaiveBayesc0[4] 92.42 91.85 54.97 88.30 88.63 89.52 63.12
iris scale1 94.4 93.3 79.6 94.3 89.7 93.9 74.0 94.9 94.8 94.8 90.0 95.2 93.3 94.6 93.4
NaiveBayesc1[4] 100 95.78 44.21 100 100 97.89 80.70
iris scale2 93.5 94.0 85.7 94.6 90.7 94.2 85.3 95.6 95.4 96.5 95.3 95.0 95.4 95.0 96.8
NaiveBayesc2[4] 90.7 88.19 36.48 87.88 83.57 86.11 52.33
types D and G. In [8], when 20 noise variables were added with 100% probability,
the mean classification accuracy was 80.99, while we obtained 86.
Table 2 shows the results for the breast-cancer problem. All the noise types de-
creased or maintained the accuracy level. The noise types which had the highest
accuracy decrease were noise types A, B and C, partially in agreement with Nettle-
ton et al. in [4], that observed that noise in training data affects more the accuracy
of classifiers. The partial agreement is due to: 1) noise B, that affects testing data,
deteriorated the performance slightly more than noise A, which affects training data.
While noise C, affecting training and testing, had the highest accuracy decrease. 2)
On the other hand, for the noise variables type E, F and G, the propertie of noise
in training data affecting more the performance than noise in testing was preserved,
that is, noise E deteriorated the performance more than noises F and G.
For the iris problem, as discussed in 6.1, the original 3 classes decision problem
was transformed into 3 decision problems: irisc0, irisc1 and irisc2. As shown in
Table 3, the introduction of 10% noise didn’t affect the accuracy for any of the noise
types. The introduction of 50% noise had a major impact on noise types A, B and
C, deteriorating the performance, while the others maintained the performance. The
greatest performance deterioration was seen for noise type C at 50% noise level. In
comparison to the Naive Bayes method of [4], our method yielded good results. In
irisc1, our method lost all cases by a difference of no more than 10%, except for the
experiment with noise type A at 50%, winning by a difference of 44%.
To further validate our approach, the trained classifiers were submitted to a boot-
strap method to statistically show that the method achieves good results despite the
randomness. The points plotted are resulting values obtained averaging bootstrap
values of 10 runs to estimate the accuracy without any bias that might be introduced
by the randomness of the selection of both training and validation sets. In Figures 1,
2 and 3, 4 and 5 the fitness achieved for increasing values of noise is shown for each
type of noise inserted, for the heart and irisc0, irisc1 and irisc2 data classification
problems respectively.
The analysis of the resulting graphic shows that, for the Heart problem, noises
of type A and C had a higher performance deterioration as noise increased, with the
accuracies achieving values below 80 percent. The analysis of the graphic for the
Breast problem shows that noise of type C had a higher performance deterioration of
the final classifiers as noise increased. For the Iris problem, noises of type A and C
had more negative impact in the classification. As stated in [4], noise in the training
Support Vector Machines applied to noisy data classification using DE with local search 11
data causes more impact in the classification, which explains the higher influence
caused by the two noise types.
Bootstrap
90 90
85 85
80 80
75 75
70 70
0 10 20 30 40 50 0 10 20 30 40 50
Noise percentage Noise percentage
Bootstrap
90 90
85 85
80 80
75 75
70 70
0 10 20 30 40 50 0 10 20 30 40 50
Noise percentage Noise percentage
Fig. 1 Fitness bootstrap values as error increases for the Heart problem.
7 Conclusions
In this paper we propose the use of Differential Evolution hybridized with local
search to optimize the parameters of Support Vector Machines to classify data. The
behavior of the classifiers were evaluated while noise was introduced in different
parts of the data.
Analizing the results for the heart problem, in terms of noise type, the deterio-
ration was noticed with 10% noise, when the test data set was altered. That means,
all noise types that affected the test set suffered deterioration: B, C, F. The only
exception was type G, which maintained the same level of accuracy. Although the
performance deterioration was higher with 10% of noise, when the test data set was
perturbed, the same behaviour didn’t occur with a higher percentage of noise. As the
12 Rodrigo de C. Cosme and Renato A. Krohling
Bootstrap
90 90
85 85
80 80
75 75
70 70
0 10 20 30 40 50 0 10 20 30 40 50
Noise percentage Noise percentage
Bootstrap
90 90
85 85
80 80
75 75
70 70
0 10 20 30 40 50 0 10 20 30 40 50
Noise percentage Noise percentage
Fig. 2 Fitness bootstrap values as error increases for the Breast problem.
noise percentage increased to 50%, the noises affecting the training data set were the
most deteriorated: A and C. Even though noise types D, E, and G affect the training
set, as noise types A and C do, they didn’t show the same deterioration as these types
of Gaussian noise. On the contrary, with 10% noise they showed little deterioration
or maintained the performance; with 50% noise they presented even better accuracy
levels compared to the performance of the original data set, without noise. Noise D
didn’t present deterioration, it increased the accuracy with 10% and 50% noise.
The results for the breast-cancer problem were different. Both training and test
data sets, when perturbed with noise, presented performance deterioration. This
time, type D was not an exception and presented a decrease in accuracy. The de-
terioration for types A and B was the same for both noise percentages. For this
dataset, type D showed performance deterioration, but it still was smaller than any
of the other types, while type C had the highest deterioration. Noise type E only had
a decrease in performance accuracy with 50% noise. As for noise types F and G,
they had no noticeable decrease in performance, maintaining the accuracies. Noise
type F had a small decrease with 10% noise but with 50% noise its accuracy level
increased again staying little below the original dataset level.
Support Vector Machines applied to noisy data classification using DE with local search 13
Bootstrap
90 90
85 85
80 80
75 75
70 70
0 10 20 30 40 50 0 10 20 30 40 50
Noise percentage Noise percentage
Bootstrap
90 90
85 85
80 80
75 75
70 70
0 10 20 30 40 50 0 10 20 30 40 50
Noise percentage Noise percentage
Fig. 3 Fitness bootstrap values as error increases for the Iris problem Class 0.
The iris problem, divided into 3 sub-problems, iris0, iris1 and iris2, followed the
same pattern observed with the breast-cancer problem. Noise types A and B have
performance deteriorations but noise type C was the highest. Noise types D, E, F and
G maintained the same performances. With a 10% noise, the deterioration, when
observed, was small with almost no changing in performance. But as the percentage
increased to 50% it grew higher.
In a real world classification problem, the data may be corrupted and subject to
noise. This work gives a perspective of what to expect when adopting SVM+DE to
classify noisy data. Preliminary analysis indicates that the performance of SVM+DE
depends on the type and percentage of noise and the characteristics of the data.
For future works, we plan to compare our method with other approaches studied
thoroughly in [4].
14 Rodrigo de C. Cosme and Renato A. Krohling
Bootstrap
90 90
85 85
80 80
75 75
70 70
0 10 20 30 40 50 0 10 20 30 40 50
Noise percentage Noise percentage
Bootstrap
90 90
85 85
80 80
75 75
70 70
0 10 20 30 40 50 0 10 20 30 40 50
Noise percentage Noise percentage
Fig. 4 Fitness bootstrap values as error increases for the Iris problem Class 1.
References
Bootstrap
90 90
85 85
80 80
75 75
70 70
0 10 20 30 40 50 0 10 20 30 40 50
Noise percentage Noise percentage
Bootstrap
90 90
85 85
80 80
75 75
70 70
0 10 20 30 40 50 0 10 20 30 40 50
Noise percentage Noise percentage
Fig. 5 Fitness bootstrap values as error increases for the Iris problem Class 2.