Classification Algorithms
Classification Algorithms
using
Classification Algorithms
classification problems. Few existing system are able to deal with problems were the target
variable is continuous. However, many interesting real world domains demand for regression
tools. This may be a serious drawback of ML techniques in a data mining context. In this
paper we present and evaluate a pre-processing method that extends the applicability of
continuous values of the goal variable. This discretisation process provides a different
common practice in statistical data analysis to group the observed values of a continuous
variable into class intervals and work with this grouped data [2]. The choice of these intervals
is a critical issue as too many intervals impair the comprehensibility of the models and too few
hide important features of the variable distribution. The methods we propose provide means
have to transform the observed values of the goal variable into a set of intervals. These
intervals may be considered values of an ordinal variable (i.e. discrete values with an implicit
ordering among them). Classification systems deal with discrete (or nominal) target variables.
They are not able to take advantage of the given ordering. We propose a second step whose
objective is to overcome this difficulty. We use misclassification costs which are carefully
chosen to reflect the ordering of the intervals as a means to compensate for the information
set of intervals. Initial experiments revealed that there was no clear winner among them. This
fact lead us to try a search-based approach [15] to this task of finding an adequate set of
intervals.
We have implemented our method in a system called RECLA1. We can look at our
system as a kind of pre-processing tool that transforms the regression problem into a
classification one before feeding it into a classification system. We have tested RECLA in
several regression domains with three different classification systems : C4.5 [12], CN2 [3],
and a linear discriminant [4, 6]. The results of our experiments show the validity of our
search-based approach and the gains in accuracy obtained by adding misclassification costs to
classification algorithms.
In the next section we outline the steps necessary to use classification algorithms in
regression problems. Section 3 describes the method we use for discretising the values of a
improve the accuracy of our models. The experimental evaluation of our proposals is given in
Section 5. Finally we relate our work to others and present the main conclusions.
transformation steps. The more important consists of pre-processing the given training data
so that the classification system is able to learn from it. This can be achieved by discretising
the continuous target variable into a set of intervals. Each interval can be used as a class label
numeric predictions from the resulting learned “theory”. The model learned by the
classification system describes a set of concepts. In our case these concepts (or classes) are
the intervals obtained from the original goal variable. When using the learned theory to make
a prediction, the classification algorithm will output one of these classes (an interval) as its
prediction. The question is how to assert the regression accuracy of these “predictions”.
Regression accuracy is usually measured as a function of the numeric distance between the
actual and the predicted values. We thus need a number that somehow represents each
interval. The natural choice for this value is to use a statistic of centrality that summarises the
values of the training instances within each interval. We use the median instead of the more
Summarising, our proposal consists of discretising the continuous values of the goal
variable into a set of intervals and take the medians of these intervals as the class labels for
obtaining a discrete version of the regression problem. RECLA system uses this strategy to
deal with a regression problem using a classification system. The system architecture can be
This may involve some coding effort in the learning systems interface module. This effort
should not be high as long as the target classification system works in a fairly standard way.
In effect, the only coding that it is usually necessary is related to different data set formats
3 D i s c r e t i s i n g a C o n t i n u o u s G o a l Va r i a b l e
The main task that enables the use of classification algorithms in regression problems is the
transformation of a set of continuous values into a set of intervals. Two main questions arise
when performing this task. How many intervals to build and how to define the boundaries of
these intervals. The number of intervals will have a direct effect on both the accuracy and the
interpretability of the resulting learned models. We argue that this decision is strongly
dependent on the target classification system. In effect, deciding how many intervals to build
is equivalent to deciding how many classes to use. This will change the class frequencies as
the number of training samples remains constant. Different class frequencies may affect
differently the classification algorithms due to the way they use the training data. This
In this section we address the question of how to divide a set of continuous values into a set
of elements. It can be said that this method has the focus on class frequencies and that it
makes the assumption that equal class frequencies is the best for a classification problem.
• Equal width intervals (EW) : The range of values is divided into N equal width intervals.
• K-means clustering (KM) : In this method the goal is to build N intervals that minimize the
sum of the distances of each element of an interval to its gravity center [4]. This method
starts with the EW approximation and then moves the elements of each interval to
contiguous intervals if these changes reduce the referred sum of distances. This is the more
sophisticated method and it seems to be the most coherent with the way we make
predictions with the learned model. In effect, as we use the median for making a
We present a simple example to better illustrate these methods. We use the Servo data set2
and we assume that the best number of intervals is 4. In the following figure we have divided
the original values into 20 equal width intervals to obtain a kind of histogram that somehow
80
70
60
50
No of obs
40
30
20
10
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5
2 In the appendix we provide details of the data sets used throughout the paper.
Using each of the three splitting strategies to obtain 4 intervals we get the following :
Table 2 presents the class frequencies resulting from each method decision. We also give the
sum of the deviations of the class label from the real example values as well as the average of
these deviations. In the bottom of the table we give totals for these two statistics.
The resulting class frequencies are quite different, namely with the EW method. Knowing
which solution is better involves understanding the sources of the error made by models
Given a query instance the theory obtained by a classification algorithm will predict a
class label. This label is the median of an interval of the original range of values of the goal
variable. If the testing instance also belongs to the same interval this would mean that the
classification system predicted the correct class. However, this does not mean, that we have
the correct prediction in terms of regression. In effect, this predicted label can be “far” from
the true value being predicted. Thus high classification accuracy not necessarily corresponds
to high regression accuracy. The later is clearly damaged if few classes are used. However, if
more classes are introduced the class frequencies will start to decrease which will most
probably damage the classification accuracy. In order to observe the interaction between
these two types of errors when the number of classes is increased we have conducted a simple
experiment. Using a permutation of the Housing data set we have set the first 70% examples
as our training set and the remaining as testing set. Using C4.5 as learning engine we have
varied the number of classes from one to one hundred, collecting two types of error for each
trial. The first is the overall prediction error obtained by the resulting model on the testing set.
The second is the error rate of the same model (i.e. the percentage of classification errors
made by the model). For instance, if a testing instance has a Y value of 35 belonging to the
interval 25..40 with median 32, and C4.5 predicts class 57, we would count this as a
classification error (label 57 different from label 32), and would sum to the overall prediction
error the value 22 (= 57-35). In the following graph we plot two lines as a function of the
number of classes : the overall prediction error in the testing set (lines with label terminating
in “-P”); and the error rate (percentage of classification errors) in the same data (lines with
label ending in “-C”). We present these two lines for each of the described splitting methods.
7 100%
90%
6
80%
5 70%
Overall Regression Error
Classification Error
60%
4
50%
3
40%
2 30%
20%
1
10%
0 0%
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
Number of Classes
The first interesting observation is that the regression error quickly decreases and then stays
more or less constant although the error rate steadily increases. A possible explanation for the
constant increase of error rate is the fact that class frequencies start to become unreliable with
large number of classes. The interesting fact is that this does not seem to be affecting
regression accuracy. The reason is that although the number of errors increases this does not
mean that they are larger in metric terms. In effect they should tend to be smaller as the class
medians get nearer and nearer when more classes are introduced. This explains why
regression accuracy does not follows the error rate tendency. We have repeated this
experiment with other data sets and the overall picture was the same.
error we should look for the types of classification errors and not for their number. We should
insure that the absolute difference between predicted class and the true class is as small as
possible. In Section 4 we will present a methodology that aims at minimising the absolute
comprehensibility it is not worthwhile to try larger number of classes as the accuracy gains do
not compensate for complexity increase. In the following section we describe how RECLA
“walks” through the search space of “number of classes” and the guidance criteria it uses for
The splitting methods described in the previous section assumed that the number of intervals
was known. This section addresses the question of how to discover this number. We use a
wrapper [8, 9] approach as general search methodology. The number of intervals (i.e. the
number of classes) will have a direct effect on accuracy so it can be seen as a parameter of the
learning algorithm. Our goal is to set the value of this “parameter” such that the system
accuracy is optimised. As the number of ways of dividing a set of values into a set of intervals
is too large a heuristic search algorithm is necessary. The wrapper approach is a well known
strategy that has been mainly used for feature subset selection [8] and parameter estimation
[9]. The use of this iterative approach to estimate a parameter of a learning algorithm can be
The components inside of the box are the elements that perform the tuning of the target
parameters. The two main components of the wrapper approach are the way new parameter
settings are generated and how their results are evaluated in the context of the target learning
algorithm. The basic idea is that of an iterative search procedure where different parameter
settings are tried and the setting that gives the best estimated accuracy is returned as the
result of the wrapper. This best setting will then be used by the learning algorithm in the real
evaluation using an independent test set. In our case this will correspond to getting the “best”
estimated number of intervals that will then be used to split the original continuous goal
values.
With respect to the search component we use a hill-climbing algorithm coupled with a
settable look-ahead parameter to minimise the well-known problem of local minima. Given a
tentative solution and the respective evaluation the search component is responsible for
generating a new trial. We provide the following two alternative search operators :
• Varying the number of intervals (VNI): This simple alternative consists of incrementing the
• Incrementally improving the number of intervals (INI) : The idea of this alternative is to
try to improve the previous set of intervals taking into account their individual evaluation.
For each trial we evaluate not only the overall result obtained by the algorithm but also the
error committed by each of the classes (intervals). The next set of intervals is built using
the median of these individual class error estimates. All intervals whose error is above the
median are further split. All the other intervals remain unchanged. This method provides a
kind of hierarchical interval structure for the goal variable which can also been considered
The search algorithm of the wrapper used by RECLA can be generally described by :
Algorithm 1. The Search Component Algorithm of the Wrapper
DO
Generate New Trial
Evaluate New Trial
IF Failed Trial THEN
Failures = Failures + 1
ELSE
IF Better than Best Trial THEN
Best Trial = New Trial
ENDIF
Failures = 0
ENDIF
UNTIL Failures >= Allowed Failures
There are two factors that control the termination of this algorithm. One is the number of
allowed failed trials (the look-ahead parameter mentioned above). The other is the notion of
failed trial. One way of defining this concept would be to state that if the trial is worse than
the previous one then it is failure. We add a further degree of flexibility by defining the
where
Ti and Ti-1 are the current and previous trials, respectively
and Eval(.) is the evaluation of a trial (its estimated regression error)
If the value of PG is below a certain threshold we consider the trial a failure even if its
estimated error is lower than the previous trial. The main motivation for this is that each trial
is adding a further degree of complexity to the learned model and as we have seen in Figure 3
The other important component of the wrapper approach is the evaluation strategy. We
use a N-fold Cross Validation [14] estimation technique which is well-known for its reliable
estimates of prediction error. This means that each time a new tentative set of intervals is
generated, RECLA uses an internal N-fold Cross Validation (CV) process to evaluate it. In
the next subsection we provide a small example of a discretisation process to better illustrate
In this example we use the Auto-Mpg data set and C4.5 as learning engine. We have
performed two experiments with the two different search operators. Table 3 presents a trace
of RECLA discretisation trials using the VNI search operator. The first column shows the
number of intervals tried in each iteration of the wrapper. The fact that starts with 2 and goes
in increments of 2 is just an adjustable parameter of RECLA. The second column shows the
obtained intervals using one of the splitting strategies described in Section 3.1. The second
line of this column includes the corresponding medians of the intervals (the used class labels).
The last column gives the wrapper 5-fold CV error estimate of the trial. In this example we
have used the value 1 for the look-ahead parameter mentioned before and all error
This means that as soon the next trial is worse than the previous the search process stops. The
solution using this operator is 6 intervals (the trial with best estimated error).
In the second experiment we use the INI search operator. The results are given in Table 4
using a similar format as in the previous table. We also include the estimated error of each
interval (the value in parenthesis). Each next trial is dependent on these individual estimates.
The intervals whose error is greater or equal than the median of these estimates are split in
two intervals. For instance, in the third trial (5 intervals) we can observe that the last interval
([29.9..46.6]) was maintained from the second trial, while the other were obtained by splitting
a previous interval.
The two methods obtain different solutions for grouping the values. In this example the INI
alternative leads to lower estimated error and consequently would be preferred by RECLA.
In this section we describe a method that tries to decrease one of the causes of the errors
made by RECLA. As mentioned in Section 3.1, part of the overall prediction error made by
RECLA is caused by the averaging effect of the discretisation process. The other cause of
error is the fact that the classification system predicts the wrong class (interval). The method
described bellow tries to minimise the effect of these misclassifications by “preferring “ errors
Classification systems search for theories that have minimal estimated prediction error
according to a 0/1 loss function, thus all errors are equally important. In regression, the error
is a function of the difference between the observed and predicted values (i.e. errors are
metric). Accuracy in regression is dependent on the amplitude of the error. In our
experiments we use the Mean Absolute Error (MAE) as regression accuracy measure :
MAE =
∑ y i − y i′
(2)
N
where
yi is the real value and y’i is the model prediction.
This means that in terms of regression accuracy it is not irrelevant the kind of error made by a
theory. Two different misclassifications can have a different effect on the overall regression
error depending on the distance between the predicted “class” and the value being predicted.
costs in the prediction procedure. If we take ci,j as the cost of classifying a class j instance as
class i, and if we take p(j|x) as the probability given by our classifier that instance x is of class
j, the task of classifying instance x resumes to finding the class i that minimises the expression
∑ c p( j| x)
i, j
j ∈{ classes}
(3)
The question is how to define the misclassification costs. We propose to estimate the cost of
misclassifying two intervals using the absolute difference between their representatives, i.e.
c i, j = ~
yi − ~
yj (4)
where
~
y i is the median of the values that where “discretised” into the interval i.
By proceeding this way we ensure that the system predictions minimise the expected absolute
A drawback of this proposal is that not all classification systems are prepared to use
information regarding misclassification costs. To use this information the systems need to be
able to compute class probability estimates for each given testing instance. With respect to the
systems we have used with RECLA only CN2 did not provide an easy way of obtaining this
information. The used Linear Discriminant was prepared to work with misclassification costs
from scratch. With C4.5 it was necessary to make a program that used the class probability
estimates of the trees learned by C4.5. This means that although the “standard” C4.5 is not
able to use misclassification costs, the version used within RECLA is able to use them.
We have carried out a series of experiments with these two classifications systems with
our benchmark data sets to assert the benefits of using misclassification costs. We have done
these experiments with all combinations of search operators and splitting methods available
within RECLA. The tables presented bellow give the mean average error estimated by 10-fold
Cross Validation, of each discretisation method with and without misclassification costs. The
best result is highlighted in grey, and in the case the difference is statistically significant
Our experiments show a clear advantage of using misclassification costs. This advantage is
more evident with the Linear Discriminant. A possible explanation for the less significant C4.5
results is fact that class probabilities estimates are obtained at the tree leaves. Decision tree
algorithms try to discriminate as much as possible among classes which means that in most
tree leaves there is a big discrepancy among the probabilities of classes. This originates that
seldom the classification predicted by C4.5 is changed due to the incorporation of costs (see
Eq. 3). This is not the case with the Linear Discriminant where we do not have the recursive
partitioning effect of trees and thus the class probabilities may be more similar, leading to
5 E x p e r i me n t a l E v a l u a t i o n
In this section we present the results of a set of experiments that we have carried out with
RECLA in our benchmark data sets whose details are given in the Appendix. For all
experiments the used methodology was the following. The initial data set was randomly
permuted to eliminate any ordering effect. In all experiments we estimate the mean average
Whenever paired comparisons are being carried out, all candidate methods are compared
using the same 10 train/test folds. We use paired t-Student tests for asserting the significance
of observed differences.
The method used by RECLA to discretise the goal variable of a regression problem depends
on two main issues as we have seen in Section 3 : the splitting method and the search
operator used for generating new trials. This leads to 6 different discretisation methods.
RECLA can use a specific method or try all and chose the one that gives better estimated
results. In the following table we let RECLA choose the “best” discretisation method and
These results show a big variety of discretisation methods depending on the problem set up.
This provides empirical evidence for our search-based approach. Table 9 gives the total
The main conclusion of these experiments is that the choice of the best discretisation method
is clearly dependent on the problem set up. Moreover, we have observed that given a data set
and a classification algorithm, the differences among the results obtained using different
In this section we present the results obtained other regression methods in the same data sets
we have evaluated RECLA (see Table 9). The goal of these experiments it is not to compare
RECLA with these alternative methods. RECLA is not a learning system. As a pre-processing
tool the resulting accuracy is highly dependent on the classification system after the
The first column of Table 10 presents the results M5 [11, 13]. This regression system is
able to learn tree-based models with linear regression equations in the leaves (also known as
model trees). By default this system makes the prediction for each testing instance by
combining the prediction of a model tree with a 3-nearest neighbour [13]. In the second
column we give the result when this combination is disabled thus using only model trees. The
third column of the table gives the results obtained by a standard 3-nearest neighbour
algorithm. The fourth column shows the results using a least squares linear regression model.
We then have the performance of a regression tree, and finally we the results obtained with
Swap1R [17]. This later system learns a set of regression rules after discretising the target
Bellow we present a table that summarises the wins and losses of RECLA (with each of the
classification systems) compared to the other regression methods. We use the versions with
costs for C4.5 and Discrim. In parenthesis we indicate the number of statistically significant
These results show that there is an accuracy penalty to pay for the discretisation process as
expected. This effect can be particularly significant when compared to sophisticated methods
like M5 that uses prediction combination among different regression models. The averaging
effect of the discretisation of the target variable damages regression accuracy. However, the
same kind of averaging is done by standard regression trees and the usual argument for their
use is the interpretability of their models. The same argument can be applied to RECLA with
either C4.5 or CN2. It is interesting to notice that RECLA with C4.5 is quite competitive
It is clear from the experiments we have carried out that the used learning engine can
originate significant differences in terms of regression accuracy. This can be confirmed when
looking at Swap1R results. This system deals with regression using the same process of
transforming it into a classification problem. It uses an algorithm called P-class that splits the
continuous values into a set of K intervals. This algorithm is basically the same as K-means
(KM). Swap1R asks for the number of classes (intervals) to use3, although the authors
suggest that this number could be found by cross validation [17]. As the discretisation method
is equal to one of the methods provided by RECLA, the better results of Swap1R can only be
caused by its classification learning algorithm. This means that the results obtained by
6 R e l a t e d Wo r k
Mapping regression into classification was first proposed in Weiss and Indurkhya’s work [16,
17]. These authors incorporate the mapping within their regression system. They use an
algorithm called P-class which is basically the same as ours KM method. Compared to this
work we added other alternative discretisation methods and empirically proved the
discretisation process from the learning algorithm we extended this approach to other
systems. Finally, we have introduced the use of misclassification costs to overcome the
3 In these experiments we have always used 5 classes following a suggestion of one of the authors of
Swap1R (Nitin Indurkhya).
inadequacy of classification systems to deal with ordinal target variables. This originated a
The vast research are on continuous attribute discretisation usually proceeds by trying
to maximise the mutual information between the resulting discrete attribute and the classes
[5]. This strategy is applicable only when the classes are given. Ours is a different problem, as
7 Conclusions
The method described in this paper enables the use of classification systems on regression
tasks. The significance of this work is two-fold. First, we have managed to extend the
models. Our method also provides a better insight about the structure of the target variable by
dividing its values into significant intervals, which extends our understanding of the domain.
which provide a better theoretical justification for using classification systems on regression
tasks. We have used a search-based approach which is justified by our experimental results
which show that the best discretisation is often dependent on both the domain and the
induction tool.
Our proposals were implemented in a system called RECLA which we have applied in
conjunction with three different classification systems. These systems are quite different from
each other which again provides evidence for the high generality of our methods. The system
is easily extendible to other classification algorithms thus being a useful tool for the users of
Finally, we have compared the results obtained by RECLA using the three learning
engines, to other standard regression methods. The results of these comparisons show that
although RECLA can be competitive with some algorithms, still it is has lower accuracy than
some state-of-the-art regression systems. These results are obviously dependent on the
learning engine used as our experiments have shown. Comparison with Swap1R, that uses a
similar mapping strategy, reveal that better regression accuracy is achievable if other learning
Ac kno w l e dg e me nts
Thanks are due to the authors of the classification systems we have used in our experiments
for providing them to us.
References
Appendix
Most of the data sets we have used were obtained from the UCI Machine Learning
Repository [http://www.ics.uci.edu/MLRepository.html]. The main characteristics of the used
domains as well as eventual modifications made to the original databases are described
bellow:
• Housing - this data set contains 506 instances described by 13 continuous input variables.
The goal consists of predicting the housing values in suburbs of Boston.
• Auto (Auto-Mpg database) - 398 instances described by 3 nominal and 4 continuous
variables. The target variable is the fuel consumption (miles per gallon).
• Servo - 167 instances; 4 nominal attributes.
• Machine (Computer Hardware database) - 209 instances; 6 continuous attributes. The
goal is to predict the cpu relative performance based on other computer characteristics.
• Price (Automobile database) - 159 cases; 16 continuous attributes. This data set is built
from the Automobile database by removing all instances with unknown values from the
original 205 cases. Nominal attributes were also removed. The goal is to predict the car
prices based on other characteristics.
• Imports (Automobile database) - based on the same database we have built a different data
set consisting of 164 instances described by 11 nominal attributes and 14 continuous
variables. From the original data we only removed the cases with unknown value on the
attribute “normalized-losses”. This attribute describes the car insurance normalized losses.
This variable was taken as the predicting goal.
• Wbc (Wisconsin Breast Cancer databases) - predicting recurrence time in 194 breast
cancer cases (4 instances with unknowns removed); 32 continuous attributes.
• Gate (non-UCI data set) - 300 instances; 10 continuous variables. The problem consists of
predicting the time to collapse of an electrical network based on some monitoring variable
values.