0% found this document useful (0 votes)
25 views17 pages

Artigo Smallex

The document describes using a differential evolution algorithm combined with local search techniques to optimize parameters for support vector machines applied to noisy data classification problems. Support vector machines formulate classification as an optimization problem to find an optimal separating hyperplane between classes. For noisy data, conventional optimization techniques can get stuck in local minima. The proposed approach uses differential evolution to search the parameter space globally and incorporates local search methods like tabu search and Nelder-Mead to further refine promising solutions. The hybrid algorithm aims to find high quality SVM parameter values that improve classification accuracy for noisy data compared to traditional approaches.

Uploaded by

Will Corleone
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views17 pages

Artigo Smallex

The document describes using a differential evolution algorithm combined with local search techniques to optimize parameters for support vector machines applied to noisy data classification problems. Support vector machines formulate classification as an optimization problem to find an optimal separating hyperplane between classes. For noisy data, conventional optimization techniques can get stuck in local minima. The proposed approach uses differential evolution to search the parameter space globally and incorporates local search methods like tabu search and Nelder-Mead to further refine promising solutions. The hybrid algorithm aims to find high quality SVM parameter values that improve classification accuracy for noisy data compared to traditional approaches.

Uploaded by

Will Corleone
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/262378087

Support Vector Machines applied to noisy data classification using differential


evolution with local search

Article · January 2011

CITATIONS READS

2 3,191

2 authors:

Rodrigo Cosme Renato Krohling


Instituto Capixaba De Pesquisa, Assistência Técnica E Extensão Rural Universidade Federal do Espírito Santo
3 PUBLICATIONS 21 CITATIONS 105 PUBLICATIONS 3,789 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Renato Krohling on 03 August 2016.

The user has requested enhancement of the downloaded file.


Support Vector Machines applied to noisy data
classification using differential evolution with
local search

Rodrigo de C. Cosme and Renato A. Krohling

Abstract Data mining is a key area for many fields of science and engineering.
In this context, a statistical learning method, known as Support Vector Machines
(SVM) has presented itself as an alternative method to solve data classification. Usu-
ally, the SVM problem is formulated as a nonlinear optimization problem subject to
constraints. Conventional optimization techniques using the Lagrangian approach
are used to solve this kind of problem. In the case of classification of noisy data the
conventional techniques show performance deterioration, since the resulting opti-
mization problem is multidimensional and may present many local minima. In this
work, it is proposed a Differential Evolution algorithm (DE) combined with a local
search technique to find the optimal parameters of SVM classifiers applied to noisy
data.

1 Introduction

In 1995, Vapnik [17] introduced a foundation to Support vector machines (SVM),


a statistical machine learning method. In its basic form the objective of SVM is
to maximize the distance separating the elements of two different classes. When
the classes to which the elements belong to are known a priori, the problem is
called classification. The set of data used to calculate the boundary limit between
the classes is called the training set, while the data set used to test the efficacy of the
method, is called validation set.
Initially, classification problems were designed to separate two classes, the binary
class problem. The introduction of more classes introduces a new level of difficulty.
Since SVM was mathematicaly proposed to judge two classes, in order to classify

Rodrigo de C. Cosme
Universidade Federal do Espı́rito Santo, e-mail: rdccosmo@gmail.com
Renato A. Krohling
Universidade Federal do Espı́rito Santo, e-mail: krohling.renato@gmail.com

1
2 Rodrigo de C. Cosme and Renato A. Krohling

multiclass problems [5] some methods have been proposed. SVM Ensemble is one
of these methods. It consists of applying SVM to the m classes of the problem in
groups of two, then each output of these m binary SVMs is combined to solve the
original multiclass problem. Some Ensemble methods to aggregate binary SVMs to
solve a multiclass problem are majority voting and weighting [5]. The main concern
of multiclass classification problems is the computational cost growth. In addition,
there are cases in which the classes are not linearly separable. In such cases, a non-
linear transformation is applied to the data in order to achieve a plane where the
data can be separated. This transformation is called kernel function. Several kernel
functions have been studied for better performance, to incorporate other properties
such as feature selection [7].
By using SVMs in the context of classification several parameters must be con-
figured to minimize the classification error for the validation instances. As a multi-
dimensional optimization problem, it must be avoided to get trapped into local min-
ima. Amongst the alternatives to solve the optimization problem, are the gradient-
based deterministic methods. These methods normally are computationally efficient,
but can lead to local minima resulting in stagnation. On the other hand, population
based methods (genetic algorithms, differential evolution, particle swarm optimiza-
tion) have been shown effective to solve these kinds of problems. Generally the
search or optimization methods working with a population are more computation-
ally costly than gradient-based methods but may provide more robust solutions.
Hence, the automatization of the SVM training process utilizing a biologically in-
spired evolutionary algorithm is desirable.
Many ways exist to approach this problem, one of them is to use an evolutionary
algorithm to tune the SVMs parameters encoded as an individual [13]. In [13] it is
shown an approach to perform feature selection and determine the parameters of a
SVM utilizing the Particle Swarm algorithm. In [7] it is presented an extension of the
Gaussian kernel which can perform feature extraction. In [4] a study of the impact
of introducing noise in different places on the classification accuracy while utilizing
different techniques of supervised learning such as neural networks and naive Bayes
probabilistic classifier. In [8] it is introduced and analyzed noisy variables.
Other works focus on data clustering in order to produce meaningful classifica-
tion in noisy data [1] with the disadvantage that the number of clusters are known a
priori. In order to have good classifications and reduce the parameters needed to be
set up by the designer, in this paper a SVM+DE approach is proposed. This hybrid
technique benefits from statistical power of the SVM while the DE is used to find
the optimal values of the SVM parameters .
In Section 2 we briefly describe the SVMs. Section 3 presents the DE algorithm.
Section 4 describes the Tabu Search and Nelder Mead methods used for local search.
Section 5 illustrates the hybrid method. Section 6 shows the results and discussions.
In Section 7 the conclusions are given.
Support Vector Machines applied to noisy data classification using DE with local search 3

2 Support Vector Machines

Given a training set Z = {(x1 , y1 ), (x2 , y2 ), . . . (xl , yl )} of l instances where each in-
stance is composed of n attributes (features) (xi = (x1i , . . . , xni )T ∈ Rn ) and a classi-
fying label yi ∈ {1, −1}, the task of classification consists of separating two classes
with a hyperplane.

f (w, b) = sign(w · x + b) (1)


The parameter b (called bias) is calculated using two vectors, but it can be calcu-
lated using all the support vectors on margin to give stability [18].
Although Equation 1 can separate any part of the features space, it is necessary
to establish an optimal separating hyperplane (OSH) [12] (w*, b*). To optimally
separate the set of vectors, they must be separated in a way that minimizes the
classification error of a new instance and the distance between the closest vectors
to the hyperplane is maximal. The hyperplane can be determined by the solution of
the following minimization problem.

min 21 ||w||2 +C
(2)
s.t. yi [w · xi + b] ≥ 1
Or in the Dual form

max ∑ni=1 αi − 21 ∑i, j αi α j ci c j k(xi , x j )


0 ≤ αi ≤ C, 0 ≤ i ≤ l (3)
s.t. l
∑ j=1 α j y j = 0
where l is the number of lagrangian multipliers.
To solve the nonlinear decision surfaces case, the OSH is found by nonlinearly
transforming the set of original feature vectors xi into a high dimensional feature
space mapping Φ : xi → zi followed by the linear separation. However, it is nec-
essary an enormous internal products computation (Φ(x)Φ(xi )) in the high dimen-
sional feature space. Therefore, using a kernel function that satisfies the Mercer’s
Theorem [14] given by Equation 4, it significantly diminishes the calculations to
solve nonlinear problems. The Gaussian kernel is given by Equation 5 and the SVM
binary decision function in Equation 1 can be rewritten as given by Equation 6.
Using Equation 7 to further simplify Equation 6, leads us to Equation 8.
4 Rodrigo de C. Cosme and Renato A. Krohling

(Φ(x)Φ(x)) = K(x, xi ) (4)


−||x − x j ||2
K(x, x j ) = exp( ) (5)
2σ 2
N
g(x) = sign( ∑ αi yi K(xi , x) + b) (6)
i=1
l
w = ∑ yi αi xi (7)
i=1
N
g(x) = sign( ∑ wΦ(x) + b) (8)
i=1

In [7], an extension of the gaussian kernel was given, with the property of feature
extraction as well as feature mapping. The feature extraction is executed due to the
parameter vector β . The lower the value of βk is, the less relevant the feature k is to
the classification. Therefore, removing this feature in question would not affect the
classifier.

− ∑Nk=1 βk (xik − x jk )2
K(x, x j ) = exp( ) (9)
2σk2
Apart from the choice of kernel function, the selection of samples for training
and validation greatly affects the performance of the classifier. It is well established
in the literature a proportion of 70% of the data set is used for training and 30% for
validation [5] but which samples to reserve for each task still remains an open issue.
A few algorithms were proposed to address this issue. In [15] Boosting and Bagging
techniques are compared.
As the dimension of the data set increases the task of classification becomes
increasingly more complex. Since the number of parameters to be optimized grow
proportionately. Next, we describe the algorithms to optimize the SVM parameters.
Many researches have been done in the field of noisy SVM to define methods
to remove the noise or to accommodate the learning model to the perturbed data.
Topics of research include data clustering to remove the noise [1], Differential Evo-
lution classifiers [8] and relevant features extraction [3]. As far as we know little has
been done with SVM+DE. Hence, this is a first approach using SVM+DE to learn
the classification model for noisy data.

3 Differential Evolution

The optimization procedure Differential Evolution (DE) was introduced by Storn


and Price [10]. As a population based evolutionary algorithm, each individual can-
didate solution in the population is subject to basic operations of mutation, crossover
and selection.
Support Vector Machines applied to noisy data classification using DE with local search 5

To better balance exploration and exploitation, it was proposed in [11] the use
of neighborhood-based mutation. In this method, local neighborhood, where the
best individual is in a small neighborhood, and global neighborhood, where the best
individual is in the whole population at the current generarion, are used together. The
combination of local and global model is done by a new parameter w, the weight
factor, resulting in the neighborhood-based mutation vector.
To create the local vector, the best vector in the neighborhood and two other
vectors are chosen, as given by Equation 10:

Li = Xi + α(Xbesti − Xi) + β (X p − Xq ) (10)


where Xbesti stands for the best vector in the neighborhood of Xi , k is the neighbor-
hood radius and p, q ∈ [i − k, i + k] with p 6= q 6= i.
Analogously, the global vector is created according to:

Gi = Xi + α(Xbest − Xi) + β (Xr1 − Xr2 ) (11)


where Xbest stands for the best vector in the population and r1, r2 ∈ [1, N] with r1 6=
r2 6= i. The parameters α and β are scaling factors.
The mutation vector is formed as a combination of Li and Gi as given by:

Vi = wGi + (1 − w)Li (12)


With regard to neighborhood, the idea of topology is used to build the concept
of proximity. A few topologies have been proposed for the PSO algorithm, e.g. star,
wheel and circular, but experiments have shown that the ring topology is better for
the case of DE [11]. In a population of N individuals, the neighbors of individual Xi
are Xi−k . . . Xi . . . Xi+k , where k ∈ [1, N] and k must be smaller than N.
Motivated by [9], where the coefficients of PSO were generated by an exponen-
tial probability distribution with good results, the exponential probability distribu-
tion is used in this work to generate the scaling factor in DE. The exponential proba-
bility distribution with density function f (r) is described by the following Equation.
1
f (r) = exp(−|r − a|/b), −∞ ≤ r ≤ ∞, a, b ≥ 0 (13)
2b
To control the variance one can change the parameters a and b. In this work it was
set as a = 0 and the scale parameter b = 1. Experimental results indicate that using a
scaling factor from this probability distribution has contributed in a positive way to
the performance of DE. Sampling random numbers from the exponential probability
distribution, for the scaling factor in DE may provide a good compromise between
exploration and exploitation.
6 Rodrigo de C. Cosme and Renato A. Krohling

4 Tabu Search and Nelder-Mead method

As mentioned in the introduction, we adopted Tabu Search (TS) and Nelder-Mead


(NM) [6] to exploit the local neighborhood in an attempt to escape of the stagnation
during the evolution process.

4.1 Tabu Search method

In Tabu Search (TS), a number R of random neighbors and the best point evaluated
is chosen as start point, current solution (CS ) and best solution (S). The list used to
keep recent points, the Short-Term Tabu List (STTL) is updated with CS . Then, N
neighbors are generated randomly around CS based on a search strategy and ranked
according to their performance. The best neighbor point is copied to CS if it is not a
member of STTL. A neighbor is chosen as next move if it outperforms S. For further
exploitation, S is added to a Long-Term Tabu List (LTTL). Thus, if CS is better than
S, then S is replaced by CS . The continuous process of neighbor generation and
selection of the best is stopped only when a number os iterations is achieved or
when there is no improvement after a pre-specified number of iterations.

4.2 Nelder-Mead method

In order to find the minimun of a function, the Nelder-Mead (NM) method randomly
generates D points in a D-dimensional search space, from P1 , . . . , PD , around a start
point P0 . The start point can be chosen randomly or be the result of a previously run
algorithm. Each point Pi evaluated has its value denoted by Yi and the highest and
lowest values are Yh and Yl , respectively.
The operations used for NM search are reflection, expansion and contraction,
generating the points Pr , Pe and Pc , respectively. Another point used in these trans-
formations is the centroid, denoted as P. These operations are defined as follows:

1 n
P= ∑ Pi (14)
n i=1
Pr = (1 + αNM )P − αNM Ph (15)
Pe = γNM Pr + (1 − γNM )P (16)
Pc = βNM Ph + (1 − βNM )P (17)

where αNM , βNM and γNM are values in the interval [0, 1].
The NM search algorithm is presented as follows:
1 Perform reflection for the point Ph
Support Vector Machines applied to noisy data classification using DE with local search 7

2 Copy Pr into Ptemp


3 if f (Ptemp ) ≤ Yl then substitute Pl with Ptemp and Yl with f (Ptemp ) else
goto 6
4 Perform expansion at point Yl
5 Copy Pe into Ptemp goto 3
6 if ( f (Ptemp ) < Yh and f (Ptemp ) > Yi , i 6= h) replace Ph by Ptemp and f (Ph ) by
f (Ptemp )
7 Perform contraction of Ph
8 Copy Pc into Ptemp
9 if f (Ptemp > Yh ) then replace all Pi ’s by ((Pi + Pl ))/2
10 if stopping condition has not been reached goto 1
11 Return Pl and Yl

Listing 1 Nelder-Mead algorithm

where the stopping condition in step 10 is given by:

1 n
∑ k Pik − Pik+1 k2 < ε
D i=1
(18)

where Pik and Pik+1 are the points in iteration k and k + 1, respectively, and ε is a
small real number.

5 Optimization of Support Vector Machines Parameters

Our method consists of finding the kernel parameters and selecting the minimum
amount of features while maximizing the classification accuracy of the SVM, that
is, minimizing the classification errors made by the SVM.
The individuals representation of the SVM parameters is made up of the param-
eters 0 ≤ C ≤ 1 (from Equations 2 and 3), 0 ≤ σ ≤ 1, and 0 ≤ β ≤ 1. The σ , and β
components of the individual are multidimensional variables related to the modified
gaussian kernel function [7] as shown in Equation 9.
According to [7] after training the SVM with N features, one can validate the
SVM with fewer features and obtain about the same result, because a small value of
an element of β means that the corresponding feature does no contribute much to
the classification, consequently it can be removed from the data set without affecting
the classification accuracy significantly. This procedure, called feature extraction, is
responsible for increasing the classifier performance [16]. It was configured that all
values of β satisfying βi ≤ 0.5, i = 1, . . . , N can be removed from the features set as
done in [7].
During the evolution process, the objective is the maximization of fitness. When
the best fitness is stagnated for MAXstagnation (set to 10), a local search algorithm is
applied to further exploit the best solution found so far, taking as starting point the
best individual or the local vector, with probability 50%. This algorithm is the com-
bination of Tabu Search and Nelder-Mead method [6] and is described in Section
4.
8 Rodrigo de C. Cosme and Renato A. Krohling

6 Simulation results

6.1 Experimental settings

To create the test suite, three data sets were selected from the UCI [2] repository and
then noise was aggregated to them. Data sets that contained m classes (where m > 2)
were transformed into m 2-class data sets and their results are shown for each class
separately. This transformation consists of assigning a different label to the classes
different than the current selected class. This procedure is performed for each of the
m classes. In addition, the data sets were perturbed with different degrees of noise.
The generation of noise can be classified in different ways [4]. In this study two
categories were taken into account, which are distribution and location. The distri-
bution chosen was Gaussian; and the locations selected to introduce noise were in
the output class, in the training data, in the validation data, or a combination of both.
Another possibility is to introduce noise variables in the training data as done in [8].
Here it was also introduced in the validation data and in both training and validation
data sets. The percentage of data to be perturbed was set to 10% and 50%. Once
the percentage η of noise is determined, ηLs feature vectors are randomly selected
according to a normal distribution to be perturbed, where Ls is the size of the data
set to be perturbed which can be the training set, the validation set or both.
The selection of individual vectors to be perturbed is described as follows. For
each feature, N random numbers are generated with normal distribution, where N is
the number of vectors in the data set multiplied by the noise percentage to be intro-
duced in the data set. Once the vectors to be altered are selected, the perturbation is
introduced in the following manner. The perturbed variables are replaced with ran-
dom numbers sampled from a Gaussian distribution with the original value as mean
and a standard deviation with value 1.
The types of noise are separated in Gaussian noise and noisy variables. The ad-
dition of Gaussian noise works as described earlier. As for the addition of noisy
variables, after the addition of new uniform random variables, the same principle of
Gaussian noise applies. Either it can be added in training data, in validation data,
or in both training and validation (training/validation) data. For ease of reference
we named Gaussian noise in training, validation, training/validation and training la-
bel as noise A, B, C and D, respectively. Also, for the noise variables, we named
them as noise E, F and G, for noise variables added in training, validation and train-
ing/validation, respectively. Noise types A, B, C and D have been studied in [4].
All the noise types are added to the input data, except for noise D, which sets the
output data (the label vector) to a uniform random value chosen out of two pos-
sible values {−1, 1}. In [8] noiseE adds noisy variables to the training data. The
remaining noise types follow the same principle, but add noise in different parts, in
the validation data (noiseF) and in training/validation data (noiseG). The number of
noise variables added was set to twice the number of features of the data set being
classified [8].
Support Vector Machines applied to noisy data classification using DE with local search 9

During the evolutionary process the individuals were ranked based on N-fold
cross-validation values, with the number of folds set to three. To validate the op-
timized classifiers a new validation was performed with the data set selected ran-
domly and divided into 70% for training and 30% for validation. The data pertain-
ing to each class was divided in the same proportions, 70% of each class for training
and 30% for validation. Finally, to statistically validate the method we performed a
bootstrap with N = 100 samples with the trained classifiers.

6.2 Results and Discussion

The effectiveness of the proposed method was evaluated performing three different
classification problems from the UCI repository, for instance, the heart data, the
breast-cancer data and the iris data. The results are summarized in Table 1 for the
heart data set, in Table 2 for the breast data set and Table 3 for the iris data set. The
values are averaged cross-validation results obtained in 10 runs of the evolutionary
process. Since the run consists of the evolutionary process and the bootstrap for
three problems with the different noise setups, it is computationally expensive and
thus 10 runs was considered plausible. The fitness used to measure the classifiers
performance was the classification accuracy.
For the 3 datasets analysed, only the results obtained for iris dataset were com-
pared to other works. The main reason for leaving the heart scale and breast-cancer
uncompared is that the noise setup used was the same as the one used by [4]. So a
complete comparisom is left for future work.

Table 1 Cross-validation values achieved with SV M + DE + LS for the heart scale problem.
no noise type A type B type C type D type E type F type G
10 50 10 50 10 50 10 50 10 50 10 50 10 50
heart scale0 83.2 83.7 78.9 82.9 81.9 82.6 76.7 86.1 87.3 83.0 84.6 82.7 80.5 83.2 84.6

Table 2 Cross-validation values achieved with SV M + DE + LS for the breast-cancer scale prob-
lem.
no noise type A type B type C type D type E type F type G
10 50 10 50 10 50 10 50 10 50 10 50 10 50
breast-cancer scale0 96.7 95.8 93.6 95.7 93.3 94.8 91.1 95.7 95.4 96.8 95.7 95.9 96.3 96.7 96.6

In [8], the result obtained for the heart problem without noise is 83.21, which is in
agreement with our results. Table 1 shows that despite the introduction of noise the
accuracy of the classifier increased for some cases. One possible reason may be that
the noise contributed to increase the distance between the samples and the margins
of the separating hyperplane, making the task of classification more precise. A 10%
noise introduction increased or maintained the accuracy for all noise types. A noise
introduction of 50% decreased the performance for all noise types except for noise
10 Rodrigo de C. Cosme and Renato A. Krohling

Table 3 Cross-validation values achieved with SVM+DE+LS for the iris scale problem.
no noise type A type B type C type D type E type F type G
10 50 10 50 10 50 10 50 10 50 10 50 10 50
iris scale0 100.0 99.7 93.0 99.7 98.3 99.4 88.1 99.7 100.0 100.0 100.0 100.0 100.0 100.0 99.7
NaiveBayesc0[4] 92.42 91.85 54.97 88.30 88.63 89.52 63.12
iris scale1 94.4 93.3 79.6 94.3 89.7 93.9 74.0 94.9 94.8 94.8 90.0 95.2 93.3 94.6 93.4
NaiveBayesc1[4] 100 95.78 44.21 100 100 97.89 80.70
iris scale2 93.5 94.0 85.7 94.6 90.7 94.2 85.3 95.6 95.4 96.5 95.3 95.0 95.4 95.0 96.8
NaiveBayesc2[4] 90.7 88.19 36.48 87.88 83.57 86.11 52.33

types D and G. In [8], when 20 noise variables were added with 100% probability,
the mean classification accuracy was 80.99, while we obtained 86.
Table 2 shows the results for the breast-cancer problem. All the noise types de-
creased or maintained the accuracy level. The noise types which had the highest
accuracy decrease were noise types A, B and C, partially in agreement with Nettle-
ton et al. in [4], that observed that noise in training data affects more the accuracy
of classifiers. The partial agreement is due to: 1) noise B, that affects testing data,
deteriorated the performance slightly more than noise A, which affects training data.
While noise C, affecting training and testing, had the highest accuracy decrease. 2)
On the other hand, for the noise variables type E, F and G, the propertie of noise
in training data affecting more the performance than noise in testing was preserved,
that is, noise E deteriorated the performance more than noises F and G.
For the iris problem, as discussed in 6.1, the original 3 classes decision problem
was transformed into 3 decision problems: irisc0, irisc1 and irisc2. As shown in
Table 3, the introduction of 10% noise didn’t affect the accuracy for any of the noise
types. The introduction of 50% noise had a major impact on noise types A, B and
C, deteriorating the performance, while the others maintained the performance. The
greatest performance deterioration was seen for noise type C at 50% noise level. In
comparison to the Naive Bayes method of [4], our method yielded good results. In
irisc1, our method lost all cases by a difference of no more than 10%, except for the
experiment with noise type A at 50%, winning by a difference of 44%.
To further validate our approach, the trained classifiers were submitted to a boot-
strap method to statistically show that the method achieves good results despite the
randomness. The points plotted are resulting values obtained averaging bootstrap
values of 10 runs to estimate the accuracy without any bias that might be introduced
by the randomness of the selection of both training and validation sets. In Figures 1,
2 and 3, 4 and 5 the fitness achieved for increasing values of noise is shown for each
type of noise inserted, for the heart and irisc0, irisc1 and irisc2 data classification
problems respectively.
The analysis of the resulting graphic shows that, for the Heart problem, noises
of type A and C had a higher performance deterioration as noise increased, with the
accuracies achieving values below 80 percent. The analysis of the graphic for the
Breast problem shows that noise of type C had a higher performance deterioration of
the final classifiers as noise increased. For the Iris problem, noises of type A and C
had more negative impact in the classification. As stated in [4], noise in the training
Support Vector Machines applied to noisy data classification using DE with local search 11

data causes more impact in the classification, which explains the higher influence
caused by the two noise types.

Noise in training Noise in test


100 100
Gaussian noise Gaussian noise
95 Noise variable 95 Noise variable
Bootstrap

Bootstrap
90 90
85 85
80 80
75 75
70 70
0 10 20 30 40 50 0 10 20 30 40 50
Noise percentage Noise percentage

Noise in training and test Noise in training label


100 100
Gaussian noise Gaussian noise
95 Noise variable 95
Bootstrap

Bootstrap

90 90
85 85
80 80
75 75
70 70
0 10 20 30 40 50 0 10 20 30 40 50
Noise percentage Noise percentage

Fig. 1 Fitness bootstrap values as error increases for the Heart problem.

7 Conclusions

In this paper we propose the use of Differential Evolution hybridized with local
search to optimize the parameters of Support Vector Machines to classify data. The
behavior of the classifiers were evaluated while noise was introduced in different
parts of the data.
Analizing the results for the heart problem, in terms of noise type, the deterio-
ration was noticed with 10% noise, when the test data set was altered. That means,
all noise types that affected the test set suffered deterioration: B, C, F. The only
exception was type G, which maintained the same level of accuracy. Although the
performance deterioration was higher with 10% of noise, when the test data set was
perturbed, the same behaviour didn’t occur with a higher percentage of noise. As the
12 Rodrigo de C. Cosme and Renato A. Krohling

Noise in training Noise in test


100 100
Gaussian noise Gaussian noise
Bootstrap
95 Noise variable 95 Noise variable

Bootstrap
90 90
85 85
80 80
75 75
70 70
0 10 20 30 40 50 0 10 20 30 40 50
Noise percentage Noise percentage

Noise in training and test Noise in training label


100 100
Gaussian noise Gaussian noise
95 Noise variable 95
Bootstrap

Bootstrap
90 90
85 85
80 80
75 75
70 70
0 10 20 30 40 50 0 10 20 30 40 50
Noise percentage Noise percentage

Fig. 2 Fitness bootstrap values as error increases for the Breast problem.

noise percentage increased to 50%, the noises affecting the training data set were the
most deteriorated: A and C. Even though noise types D, E, and G affect the training
set, as noise types A and C do, they didn’t show the same deterioration as these types
of Gaussian noise. On the contrary, with 10% noise they showed little deterioration
or maintained the performance; with 50% noise they presented even better accuracy
levels compared to the performance of the original data set, without noise. Noise D
didn’t present deterioration, it increased the accuracy with 10% and 50% noise.
The results for the breast-cancer problem were different. Both training and test
data sets, when perturbed with noise, presented performance deterioration. This
time, type D was not an exception and presented a decrease in accuracy. The de-
terioration for types A and B was the same for both noise percentages. For this
dataset, type D showed performance deterioration, but it still was smaller than any
of the other types, while type C had the highest deterioration. Noise type E only had
a decrease in performance accuracy with 50% noise. As for noise types F and G,
they had no noticeable decrease in performance, maintaining the accuracies. Noise
type F had a small decrease with 10% noise but with 50% noise its accuracy level
increased again staying little below the original dataset level.
Support Vector Machines applied to noisy data classification using DE with local search 13

Noise in training Noise in test


100 100
Gaussian noise Gaussian noise
Bootstrap
95 Noise variable 95 Noise variable

Bootstrap
90 90
85 85
80 80
75 75
70 70
0 10 20 30 40 50 0 10 20 30 40 50
Noise percentage Noise percentage

Noise in training and test Noise in training label


100 100
Gaussian noise Gaussian noise
95 Noise variable 95
Bootstrap

Bootstrap
90 90
85 85
80 80
75 75
70 70
0 10 20 30 40 50 0 10 20 30 40 50
Noise percentage Noise percentage

Fig. 3 Fitness bootstrap values as error increases for the Iris problem Class 0.

The iris problem, divided into 3 sub-problems, iris0, iris1 and iris2, followed the
same pattern observed with the breast-cancer problem. Noise types A and B have
performance deteriorations but noise type C was the highest. Noise types D, E, F and
G maintained the same performances. With a 10% noise, the deterioration, when
observed, was small with almost no changing in performance. But as the percentage
increased to 50% it grew higher.
In a real world classification problem, the data may be corrupted and subject to
noise. This work gives a perspective of what to expect when adopting SVM+DE to
classify noisy data. Preliminary analysis indicates that the performance of SVM+DE
depends on the type and percentage of noise and the characteristics of the data.
For future works, we plan to compare our method with other approaches studied
thoroughly in [4].
14 Rodrigo de C. Cosme and Renato A. Krohling

Noise in training Noise in test


100 100
Gaussian noise Gaussian noise
Bootstrap
95 Noise variable 95 Noise variable

Bootstrap
90 90
85 85
80 80
75 75
70 70
0 10 20 30 40 50 0 10 20 30 40 50
Noise percentage Noise percentage

Noise in training and test Noise in training label


100 100
Gaussian noise Gaussian noise
95 Noise variable 95
Bootstrap

Bootstrap
90 90
85 85
80 80
75 75
70 70
0 10 20 30 40 50 0 10 20 30 40 50
Noise percentage Noise percentage

Fig. 4 Fitness bootstrap values as error increases for the Iris problem Class 1.

References

[1] A. Banerjee. Robust fuzzy clustering as a multi-objective optimization proce-


dure. In Fuzzy Information Processing Society, 2009. NAFIPS 2009. Annual
Meeting of the North American, pages 1 –6, 2009.
[2] A. Frank and A. Asuncion. UCI Machine Learning Repository, 2010.
http://archive.ics.uci.edu/ml/.
[3] B. Byeon and K. Rasheed. Simultaneously removing noise and selecting rel-
evant features for high dimensional noisy data. In Machine Learning and Ap-
plications, 2008. ICMLA ’08. Seventh International Conference on, pages 147
–152, December 2008.
[4] D. Nettleton, A. Orriols-Puig, and A. Fornells. A study of the effect of different
types of noise on the precision of supervised learning techniques. Artificial
Intelligence Review, 33:275–306, 2010.
[5] H. Kim, S. Pang, H. Je, D. Kim, and S. Y. Bang. Constructing support vector
machine ensemble. Pattern Recognition, 36(12):2757–2767, 2003.
Support Vector Machines applied to noisy data classification using DE with local search 15

Noise in training Noise in test


100 100
Gaussian noise Gaussian noise
Bootstrap
95 Noise variable 95 Noise variable

Bootstrap
90 90
85 85
80 80
75 75
70 70
0 10 20 30 40 50 0 10 20 30 40 50
Noise percentage Noise percentage

Noise in training and test Noise in training label


100 100
Gaussian noise Gaussian noise
95 Noise variable 95
Bootstrap

Bootstrap
90 90
85 85
80 80
75 75
70 70
0 10 20 30 40 50 0 10 20 30 40 50
Noise percentage Noise percentage

Fig. 5 Fitness bootstrap values as error increases for the Iris problem Class 2.

[6] M. H. Mashinchi, M. A. Orgun, and W. Pedrycz. Hybrid optimization with


improved tabu search. Appl. Soft Comput., 11:1993–2006, 2011.
[7] P. Du, J. Peng, and T. Terlaky. Self-adaptive support vector machines: mod-
elling and experiments. Computational Management Science, 6(1):41–51,
2009.
[8] P. Luukka and J. Lampinen. Differential evolution classifier in noisy settings
and with interacting variables. Applied Soft Computing, 11(1):891–899, 2011.
[9] R. A. Krohling and L. dos Santos Coelho. PSO-E: Particle swarm with ex-
ponential distribution. In Proceedings of the IEEE Congress on Evolutionary
Computation, CEC 2006, pages 1428 –1433, 2006.
[10] R. Storn and K. Price. Differential evolution – a simple and efficient heuristic
for global optimization over continuous spaces. Journal of Global Optimiza-
tion, 11:341–359, 1997.
[11] S. Das, A. Abraham, U.K. Chakraborty, and A. Konar. Differential evolution
using a neighborhood-based mutation operator. IEEE Transactions on Evolu-
tionary Computation, 13(3):526 –553, June 2009.
[12] S. Kawano, D. Okumura, H. Tamura, H. Tanaka, and K. Tanno. Online learn-
ing method using support vector machine for surface-electromyogram recog-
16 Rodrigo de C. Cosme and Renato A. Krohling

nition. Artificial Life and Robotics, 13:483–487, 2009.


[13] S. Lin, K. Ying, S. Chen, and Z. Lee. Particle swarm optimization for param-
eter determination and feature selection of support vector machines. Expert
Syst. Appl., 35:1817–1824, 2008.
[14] S. Wang, A. Mathew, Y. Chen, L. Xi, L. Ma, and J. Lee. Empirical analysis
of support vector machine ensemble classifiers. Expert Syst. Appl., 36:6466–
6476, April 2009.
[15] T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano. Comparing boosting
and bagging techniques with noisy and imbalanced data. IEEE Transactions
on Systems, Man and Cybernetics, Part A: Systems and Humans, 41(3):552
–568, 2011.
[16] T. Wu, J. Duchateau, J. Martens, and D. Van Compernolle. Feature subset
selection for improved native accent identification. Speech Commun., 52:83–
98, February 2010.
[17] V. N. Vapnik. The nature of statistical learning theory. Springer-Verlag, New
York, NY, USA, 1995.
[18] V. N. Vapnik, S. E. Golowich, and S. Smola. Support vector method for func-
tion approximation, regression estimation, and signal processing. In Advances
in Neural Information Processing Systems 9, pages 281–287. MIT Press, 1996.

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy