A Neural Network Approach To Ordinal Regression
A Neural Network Approach To Ordinal Regression
Abstract— Ordinal regression is an important type of [42], Bayesian hierarchical experts [32], binary classification
learning, which has properties of both classification and approach [16], [26] that decomposes the original ordinal
regression. Here we describe an effective approach to adapt regression problem into a set of binary classifications, and
a traditional neural network to learn ordinal categories.
Our approach is a generalization of the perceptron method the optimization of nonsmooth cost functions [6].
for ordinal regression. On several benchmark datasets, our Most of these methods can be roughly classified into two
method (NNRank) outperforms a neural network classification categories: pairwise constraint approach [22], [24], [15], [5]
method. Compared with the ordinal regression methods using and multi-threshold approach [14], [38], [9]. The former
Gaussian processes and support vector machines, NNRank is to convert the full ranking relation into pairwise order
achieves comparable performance. Moreover, NNRank has
the advantages of traditional neural networks: learning in constraints. The latter tries to learn multiple thresholds to di-
both online and batch modes, handling very large training vide data into ordinal categories. Multi-threshold approaches
datasets, and making rapid predictions. These features make also can be unified under the general, extended binary
NNRank a useful and complementary tool for large-scale classification framework [26].
data mining tasks such as information retrieval, web page The ordinal regression methods have different advantages
ranking, collaborative filtering, and protein ranking in
Bioinformatics. The neural network software is available at: and disadvantages. Prank [14], a perceptron approach that
http://www.cs.missouri.edu/∼ chengji/cheng software.html. generalizes the binary perceptron algorithm to the ordinal
multi-class situation, is a fast online algorithm. However,
like a standard perceptron method, its accuracy suffers when
I. I NTRODUCTION
dealing with non-linear data, while a quadratic kernel version
Rdinal regression (or ranking learning) is an important
O supervised problem of learning a ranking or ordering
of Prank greatly relieves this problem. One class of accurate
large-margin classifier approaches [22], [24] convert the
on instances, which has the property of both classification ordinal relations into O(n2 ) (n: the number of data points)
and metric regression. The learning task of ordinal regression pairwise ranking constraints for the structural risk minimiza-
is to assign data points into a set of finite ordered categories. tion [39], [36]. Thus, it can not be applied to medium size
For example, a teacher rates students’ performance using datasets (> 10,000 data points), without discarding some
A, B, C, D, and E (A > B > C > D > E) [9]. Ordinal pairwise preference relations. It may also overfit noise due
regression is different from classification due to the order to incomparable pairs.
of categories. In contrast to metric regression, the response The other class of powerful large-margin classifier meth-
variables (categories) in ordinal regression is discrete and ods [38], [11] generalize the support vector formulation for
finite. ordinal regression by finding K − 1 thresholds on the real
The research of ordinal regression dates back to the ordinal line that divide data into K ordered categories. The size of
statistics methods in 1980s [28], [29] and machine learning this optimization problem is linear in the number of training
research in 1990s [7], [20], [13]. It has attracted the consider- examples. However, like support vector machine used for
able attention in recent years due to its potential applications classification, the prediction speed is slow when the solution
in many data-intensive domains such as information retrieval is not sparse, which makes it not appropriate for time-critical
[20], web page ranking [24], collaborative filtering [18], tasks. Similarly, another state-of-the-art approach, Gaussian
[3], [41], image retrieval [40], and protein ranking [8] in process method [9], also has the difficulty of handling large
Bioinformatics. training datasets and the problem of slow prediction speed
A number of machine learning methods have been de- in some situations.
veloped or redesigned to address ordinal regression problem Here we describe a new neural network approach for
[33], including perceptron [14] and its kernelized general- ordinal regression that has the advantages of neural network
ization [3], neural network with gradient descent [7], [5], learning: learning in both online and batch mode, training on
Gaussian process [10], [9], [37], large margin classifier (or very large dataset [5], handling non-linear data, good perfor-
support vector machine) [21], [22], [24], [38], [11], [2], [12], mance, and rapid prediction. Our method can be considered
k-partite classifier [1], boosting algorithm [17], [15], con- a generalization of the perceptron learning [14] into multi-
straint classification [19], regression trees [25], Naive Bayes layer perceptrons (neural network) for ordinal regression. Our
Jianlin Cheng and Zheng Wang are with the computer science depart- method is also related to the classic generalized linear models
ment and informatics institute, University of Missouri, Columbia, MO (e.g., cumulative logit model) for ordinal regression [28].
65211, USA (email: chengji@missouri.edu, zwyw6@missouri.edu); Gian- Unlike the neural network method [5] trained on pairs of
luca Pollastri is with the school of computer science and informatics,
University of College Dublin, Belfield, Dublin 4, Ireland (email: gian- examples to learn pairwise order relations, our method works
luca.pollastri@ucd.ie). on individual data points and uses multiple output nodes to
1279
978-1-4244-1821-3/08/$25.002008
c IEEE
Authorized licensed use limited to: UNIVERSITAETSBIBL STUTTGART. Downloaded on February 08,2023 at 22:12:39 UTC from IEEE Xplore. Restrictions apply.
estimate the probabilities of ordinal categories. Thus, our the number of dimensions of input feature vector x and K
method falls into the category of multi-threshold approach. output nodes corresponding to K ordinal categories. There
The learning of our method proceeds similarly as traditional can be one or more hidden layers. Without loss of generality,
neural networks using back-propagation [35]. we use one hidden layer to construct a standard two-layer
On the same benchmark datasets, our method yields the feedforward neural network. Like a standard neural network
performance better than the standard classification neural for classification, input nodes are fully connected with hidden
networks and comparable to the state-of-the-art methods nodes, which in turn are fully connected with output nodes.
using support vector machines and Gaussian processes. In Likewise, the transfer function of hidden nodes can be linear
addition, our method can learn on very large datasets and function, sigmoid function, and tanh function that is used in
make rapid predictions. our experiment. The only difference from traditional neural
network lies in the output layer. Traditional neural networks
−zi
II. M ETHOD use softmax PKe e−zi (or normalized exponential function)
i=1
A. Formulation for output nodes, satisfying the constraint that the sum of
Let D represent an ordinal regression dataset consisting of outputs K i=1 oi is 1. zi is the net input to the output node
n data points (x, y) , where x ∈ Rd is an input feature vector Oi .
and y is its ordinal category from a finite set Y . Without loss In contrast, each output node Oi of our neural network
of generality, we assume that Y = 1, 2, ..., K with ”<” as uses a standard sigmoid function 1+e1−zi , without including
order relation. the outputs from other nodes, as shown in Figure 1. Output
For a standard classification neural network without con- node Oi is used to estimate the probability oi that a data
sidering the order of categories, the goal is to predict the point belongs to category i independently, without subjecting
probability of a data point x belonging to one category k to normalization as traditional neural networks do. Thus,
(y = k). The input is x and the target of encoding the for a data point x of category k, the target vector is
category k is a vector t = (0, ..., 0, 1, 0, ..., 0), where only the (1, , 1, .., 1, 0, 0, 0), in which the first k elements is 1 and
element tk is set to 1 and all others to 0. The goal is to learn others 0. This sets the target value of output nodes Oi (i ≤ k)
a function to map input vector x to a probability distribution to 1 and Oi (i > k) to 0. The targets instruct the neural
vector o = (o1 , o2 , ...ok , ...oK ), where ok is closer to 1 and network to adjust weights to produce probability outputs as
other close as possible to the target vector. It is worth pointing out
K elements are close to zero, subject to the constraint that using independent sigmoid functions for output nodes
i=1 oi = 1.
In contrast, like the perceptron approach [14], our neural does not guaranteed the monotonic relation (o1 >= o2 >=
network approach considers the order of the categories. ... >= oK ), which is not necessary but, desirable for making
If a data point x belongs to category k, it is classified predictions [26]. A more sophisticated approach is to impose
automatically into lower-order categories (1, 2, ..., k − 1) as the inequality constraints on the outputs to improve the
well. So the target vector of x is t = (1, 1, .., 1, 0, 0, 0), performance.
where ti (1 ≤ i ≤ k) is set to 1 and other elements Training of the neural network for ordinal regression
zeros, as shown in Figure 1. Thus, the goal is to learn a proceeds very similarly as standard neural networks. The
function to map the input vector x to a probability vector cost function for a data point x can be relative entropy
o = (o1 , o2 , ..., ok , ...oK ), where oi (i ≤ k) is close to or square error between the target vector and the output
1 and oi (i ≥ k) is close to 0. K vector. For relative K entropy, the cost function for output
i=1 oi is the estimate
nodes is fc = i=1 (ti log oi + (1 − ti ) log(1 − oi )). For
of number of categories (i.e. k) that x belongs to, instead K 2
of 1. The formulation of the target vector is similar to the square error, the error function is fc = i=1 (ti − oi ) .
perceptron approach [14]. It is also related to the classical Previous studies [34] on neural network cost functions show
cumulative probit model for ordinal regression [28], in the that relative entropy and square error functions usually yield
sense that we can consider the output probability vector very similar results. In our experiments, we use square error
(o1 , ...ok , ...oK ) as a cumulativePprobability distribution on function and standard back-propagation to train the neural
K
oi network. The errors are propagated back to output nodes,
categories (1, ..., k, ..., K), i.e., i=1K is the proportion of
and from output nodes to hidden nodes, and finally to input
categories that x belongs to, starting from category 1.
nodes.
The target encoding scheme of our method is related to
Since the transfer function ft of output node Oi is the
but, different from multi-label learning [4] and multiple label
independent sigmoid function 1+e1−zi , the derivative of ft of
learning [23] because our method imposes an order on the e−zi 1 1
labels (or categories). output node Oi is ∂f ∂zi = (1+e−zi )2 = 1+e−zi (1 − 1+e−zi )
t
Authorized licensed use limited to: UNIVERSITAETSBIBL STUTTGART. Downloaded on February 08,2023 at 22:12:39 UTC from IEEE Xplore. Restrictions apply.
number of epochs, and the learning rate. We create a grid
for these three parameters, where the hidden unit num-
ber is in the range [1..15], the epoch number in the set
(50, 200, 500, 1000, 1500, 2000), and the initial learning rate
in the range [0.01..0.5]. During the training, the learning
rate is halved if training errors continuously go up for a
pre-defined number (40, 60, 80, or 100) of epochs. For
experiments on each data split, the neural network parameters
are fully optimized on the training data without using any test
data.
For each experiment, after the parameters are optimized on
the training data, we train five models on the training data
with the optimal parameters, starting from different initial
weights. The ensemble of five trained models are then used
to estimate the generalized performance on the test data. That
is, the average output of five neural network models is used
to make predictions.
Fig. 1. Comparison between a standard classification neural network and We evaluate our method using zero-one error and mean
an ordinal regression neural network. Without loss of generality, the neural absolute error as in [9]. Zero-one error is the percentage
networks are assumed to have one hidden layer and one output layer with
four output nodes. For a data point in category three, the target vector of of wrong assignments of ordinal categories. Mean absolute
the standard neural network is (0, 0, 1, 0), while the target vector of the error is the root mean square difference between assigned
ordinal regression neural network is (1, 1, 1, 0). The transfer function of categories (k ) and true categories (k) of all data points. For
output node i of the standard neural network is the normalized exponential
−zi
function PKe −zi . In contrast, the ordinal regression neural network uses
each dataset, the training and evaluation process is repeated
e
i=1
1
20 times on 20 data splits. Thus, we compute the average
the sigmoid function .
1+e−zi error and the standard deviation of the two metrics as in [9].
Authorized licensed use limited to: UNIVERSITAETSBIBL STUTTGART. Downloaded on February 08,2023 at 22:12:39 UTC from IEEE Xplore. Restrictions apply.
of the protein pairs. The category of each data point was like the perceptron approach [14], our method can learn
assigned by biologists [31]. in both batch and online mode. The online learning ability
A data point representing a query-template protein pair makes our method a good tool for adaptive learning in the
is labeled as fold if the two proteins have similar tertiary real-time. The multi-layer structure of neural network and
structures but do not have evolutionary relationship; super the non-linear transfer function give our method the stronger
family if they have similar structures and weak revolutionary fitting ability than perceptron methods.
relationship; and family if they have similar structures and Second, the neural network can be trained on very large
strong revolutionary relationship. Each data point has 62 datasets iteratively, while training is more complex than
features, corresponding to specific criteria used to measure support vector machines and Gaussian processes. Since the
the similarities between a query and a template protein. training process of our method is the same as traditional
The data points are splitted into a training dataset con- neural networks, average neural network users can use this
sisting of 6018 data points (2910 in fold, 1810 in super method for their tasks.
family, and 1298 in family) and a test dataset containing Third, neural network method can make rapid prediction
747 data points (395 in fold, 166 in super family, and 186 once models are trained. The ability of learning on very
in family). Both NNRank and NNClass are trained on the large dataset and predicting in time makes our method a
training dataset and evaluated on the test dataset. useful and competitive tool for ordinal regression tasks,
The mean zero one error and mean absolute error of particularly for time-critical and large-scale ranking prob-
NNRank are 23.96% and 0.258, respectively. The mean zero lems in information retrieval, web page ranking, collab-
one error and mean absolute error of NNClass are 25.03% orative filtering, and the emerging fields of Bioinformat-
and 0.277, respectively. The mean zero one error of NNRank ics. To facilitate the application of this new approach,
is 1.1% lower than NNClass. The mean absolute error of we make both NNRank and NNClass to accept a gen-
NNRank is 0.019 less than NNClass. The experiment shows eral input format and freely available to the community at
that NNRank performs better than NNClass on a large, real http://www.cs.missouri.edu/∼ chengji/cheng software.html.
ordinal regression dataset. There are some directions to further improve the neural
network (or multi-layer perceptron) approach for ordinal
D. Comparison with Gaussian Processes and Support Vector
regression. One direction is to design a transfer function
Machines on Standard Benchmarks
to ensure the monotonic decrease of the outputs of the
To further evaluate the performance of our method, we neural network; the other direction is to derive the general
compare NNRank with two Gaussian process methods (GP- error bounds of the method under the binary classification
MAP and GP-EP) [9] and a support vector machine method framework [26]. Furthermore, the other flavors of imple-
(SVM) [38] implemented in [9]. The results of the three mentations of the multi-threshold multi-layer perceptron ap-
methods are quoted from [9]. Table II reports the zero-one proach for ordinal regression are possible. Since machine
error on the eight datasets. NNRank achieves the best results learning ranking is a fundamental problem that has wide
on Diabetes, Triazines, and Abalone, GP-EP on Pyrimidines, applications in many diverse domains such as web page
Auto MPG, and Boston, GP-MAP on Machine, and SVM on ranking, information retrieval, image retrieval, collaborative
Stocks. filtering, bioinformatics and so on, we believe the further
Table III reports the mean absolute error on the eight exploration of the neural network (or multi-layer perceptron)
datasets. NNRank yields the best results on Diabetes and approach for ranking and ordinal regression is worthwhile.
Abalone, GP-EP on Pyrimidines, Auto MPG, and Boston,
GP-MAP on Triazines and Machine, SVM on Stocks. R EFERENCES
In summary, on the eight datasets, the performance of [1] S. Agarwal and D. Roth. Learnability of bipartite ranking functions.
NNRank is comparable to the three state-of-the-art methods In Proc. of the 18th annual conference on learning theory (COLT-05).
2005.
for ordinal regression. [2] F. Aiolli and A. Sperduti. Learning preferences for multiclass
problems. In Advances in Neural Information Processing Systems 17
IV. D ISCUSSION AND F UTURE W ORK (NIPS). 2004.
We have described a novel approach to adapt traditional [3] J. Basilico and T. Hofmann. Unifying collaborative and content-based
filtering. In Proceedings of the twenty-first international conference
neural networks for ordinal regression. Our neural network on machine learning (ICML), page 9. ACM press, New York, USA,
approach can be considered a generalization of one-layer 2004.
perceptron approach [14] into multi-layer. On the standard [4] C. Bishop. Neural Networks for Pattern Recognition. Oxford
University Press, USA, 1996.
benchmark of ordinal regression, our method outperforms [5] C. J. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds,
standard neural networks used for classification. Further- N. Hamilton, and G. Hullender. Learning to rank using gradient
more, on the same benchmark, our method achieves the descent. In Proc. of Internaltional Conference on Machine Learning
(ICML-05), pages 89–97. 2005.
similar performance as the two state-of-the-art methods (sup- [6] C.J.C. Burges, R. Ragno, and Q. V. Le. Learning to rank with nons-
port vector machines and Gaussian processes) for ordinal mooth cost functions. In Advances in Neural Information Processing
regression. Systems (NIPS) 20. MIT press, Cambridge, MA, 2006.
[7] R. Caruana, S. Baluja, and T. Mitchell. Using the future to sort out the
Compared with existing methods for ordinal regression, present: Rankprop and multitask learning for medical risk evaluation.
our method has several advantages of neural networks. First, In Advances in neural information processing systems 8 (NIPS). 1996.
Authorized licensed use limited to: UNIVERSITAETSBIBL STUTTGART. Downloaded on February 08,2023 at 22:12:39 UTC from IEEE Xplore. Restrictions apply.
[8] J. Cheng and P. Baldi. A machine learning information retrieval [34] M.D. Richard and R.P. Lippman. Neural network classifiers estimate
approach to protein fold recognition. Bioinformatics, 22:1456–1463, bayesian a-posteriori probabilities. Neural Computation, 3:461–483,
2006. 1991.
[9] W. Chu and Z. Ghahramani. Gaussian processes for ordinal regression. [35] D.E. Rumelhart, G.E. Hinton, and R.J. Williams. Learning Internal
Journal of Machine Learning Research, 6:1019–1041, 2005. Representations by Error Propagation. In D. E. Rumelhart and J. L.
[10] W. Chu and Z. Ghahramani. Preference learning with Gaussian McClelland, editors, Parallel Distributed Processing: Explorations in
processes. In Proc. of International Conference on Machine Learning the Microstructure of Cognition. Vol. I: Foundations, pages 318–362.
(ICML-05), pages 137–144. 2005. Bradford Books/MIT Press, Cambridge, MA., 1986.
[11] W. Chu and S.S. Keerthi. New approaches to support vector ordinal [36] B. Schölkopf and A.J. Smola. Learning with Kernels, Support Vector
regression. In Proc. of International Conference on Machine Learning Machines, Regularization, Optimization and Beyond. MIT University
(ICML-05), pages 145–152. 2005. Press, Cambridge, MA, 2002.
[12] W. Chu and S.S. Keerthi. Support vector ordinal regression. Neural [37] A. Schwaighofer, V. Tresp, and K. Yu. Hiearachical bayesian mod-
Computation, 19(3), 2007. elling with gaussian processes. In Advances in Neural Information
[13] W. W. Cohen, R. E. Schapire, and Y. Singer. Learning to order things. Processing Systems 17 (NIPS). MIT press, 2005.
Journal of Artificial Intelligence Research, 10:243–270, 1999. [38] A. Shashua and A. Levin. Ranking with large margin principle: two
[14] K. Crammer and Y. Singer. Pranking with ranking. In Advances in approaches. In Advances in Neural Information Processing Systems
Neural Information Processing Systems (NIPS) 14, pages 641–647. 15 (NIPS). 2003.
MIT press, Cambridge, MA, 2002. [39] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag,
[15] O. Dekel, J. Keshet, and Y. Singer. Log-linear models for label ranking. Berlin, Germany, 1995.
In Proc. of the 21st international conference on machine learning [40] H. Wu, H. Lu, and S. Ma. A practical svm-based algorithm for ordinal
(ICML-06), pages 209–216. 2004. regression in image retrieval. pages 612–621, 2003.
[16] E. Frank and M. Hall. A simple approach to ordinal classification. In [41] S. Yu, K. Yu, V. Tresp, and H. P. Kriegel. Collaborative ordinal
Proc. of the European Conference on Machine Learning. 2001. regression. In Proc. of 23rd international conference on machine
[17] Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer. An efficient boosting learning, pages 1089–1096. 2006.
algorithm for combining preferences. Journal of Machine Learning [42] H. Zhang, L. Jiang, and J. Su. Augmenting naive bayes for ranking.
Research, 4:933–969, 2003. In International Conference on Machine Learning (ICML-05). 2005.
[18] D. Goldberg, D. Nichols, B. Oki, and D. Terry. Using collaborative
filtering to weave an information tapestry. Communications of the
ACM, 35:61–70, 1992.
[19] S. Har-Peled, D. Roth, and D. Zimak. Constraint classification: a
new approach to multiclass classification and ranking. In Advances in
Neural Information Processing Systems 15 (NIPS). 2002.
[20] R. Herbrich, T. Graepel, P. Bollmann-Sdorra, and K. Obermayer.
Learning preference relations for information retrieval. In Proc. of
ICML workshop on text categorization and machine learning, pages
80–84. 1998.
[21] R. Herbrich, T. Graepel, and K. Obermayer. Support vector learning
for ordinal regression. In Proc. of 9th International Conference on
Artificial Neural Networks (ICANN), pages 97–102. 1999.
[22] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank bound-
aries for ordinal regression. In A. J. Smola, P. Bartlett, B. Scholkopf,
and D. Schuurmans, editors, Advances in Large Margin Classifiers,
pages 115–132. MIT Press, Cambridge, MA, 2000.
[23] R. Jin and Z. Ghahramani. Learning with multiple labels. In Advances
in Neural Information Processing Systems (NIPS) 15. MIT press,
Cambridge, MA, 2003.
[24] I. Joachims. Optimizing search engines using clickthrough data. In
David Hand, Daniel Keim, and Raymond NG, editors, Proc. of 8th
ACM SIGKDD International conference on knowledge discovery and
data mining, pages 133–142. 2002.
[25] S. Kramer, G. Widmer, B. Pfahringer, and M. DeGroeve. Prediction
of ordinal classes using regression trees. Fundamenta Informaticae,
47:1–13, 2001.
[26] L. Li and H. Lin. Ordinal regression by extended binary classification.
In Advances in Neural Information Processing Systems (NIPS) 20.
MIT press, Cambridge, MA, 2006.
[27] D. J. C. MacKay. A practical bayesian framework for back propagation
networks. Neural Computation, 4:448–472, 1992.
[28] P. McCullagh. Regression models for ordinal data. Journal of the
Royal Statistical Society B, 42:109–142, 1980.
[29] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman
and Hall, London, 1983.
[30] T. P. Minka. A family of algorithms for approximate bayesian
inference. PhD Thesis, Massachusetts Institute of Technology, 2001.
[31] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. SCOP:
A structural classification of proteins database for the investigation of
sequences and structures. J. Mol. Biol., 247:536–540, 1995.
[32] U. Paquet, S. Holden, and A. Naish-Guzman. Bayesian hierarchical
ordinal regression. In Proc. of the international conference on artifical
neural networks. 2005.
[33] S. Rajaram, A. Garg, X.S. Zhou, and T.S. Huang. Classification
approach towards ranking and sorting problems. In Machine Learning:
ECML 2003, vol. 2837 of Lecture Notes in Artificail Intelligence (N.
Lavrac, D. gamberger, H. Blockeel and L. Todorovski eds.), pages
301–312. Springer-Verlag, 2003.
Authorized licensed use limited to: UNIVERSITAETSBIBL STUTTGART. Downloaded on February 08,2023 at 22:12:39 UTC from IEEE Xplore. Restrictions apply.
Mean zero-one error Mean absolute error
Dataset NNRank NNClass NNRank NNClass
Stocks 12.68±1.8% 16.97± 2.3% 0.127±0.01 0.173±0.02
Pyrimidines 37.71±8.1% 41.87±7.9% 0.450±0.09 0.508±0.11
Auto MPG 27.13±2.0% 28.82±2.7% 0.281±0.02 0.307±0.03
Machine 17.03±4.2% 17.80±4.4% 0.186±0.04 0.192±0.06
Abalone 21.39±0.3% 21.74± 0.4% 0.226±0.01 0.232±0.01
Triazines 52.55±5.0% 52.84±5.9% 0.730±0.06 0.790±0.09
Boston 26.38±3.0% 26.62±2.7% 0.295±0.03 0.297±0.03
Diabetes 44.90±12.5% 43.84±10.0% 0.546±0.15 0.592±0.09
TABLE I
T HE RESULTS OF NNR ANK AND NNC LASS ON THE EIGHT DATASETS. T HE RESULTS ARE THE AVERAGE ERROR OVER 20 TRIALS ALONG WITH THE
STANDARD DEVIATION .
TABLE II
Z ERO - ONE ERROR OF NNR ANK , SVM, GP-MAP, AND GP-EP ON THE EIGHT DATASETS. SVM DENOTES THE SUPPORT VECTOR MACHINE METHOD
[38], [9]. GP-MAP AND GP-EP ARE TWO G AUSSIAN PROCESS METHODS USING L APLACE APPROXIMATION [27] AND EXPECTATION PROPAGATION
[30] RESPECTIVELY [9]. T HE RESULTS ARE THE AVERAGE ERROR OVER 20 TRIALS ALONG WITH THE STANDARD DEVIATION. W E USE BOLDFACE TO
DENOTE THE BEST RESULTS .
TABLE III
M EAN ABSOLUTE ERROR OF NNR ANK , SVM, GP-MAP, AND GP-EP ON THE EIGHT DATASETS. SVM DENOTES THE SUPPORT VECTOR MACHINE
METHOD [38], [9]. GP-MAP AND GP-EP ARE TWO G AUSSIAN PROCESS METHODS USING L APLACE APPROXIMATION AND EXPECTATION
PROPAGATION RESPECTIVELY [9]. T HE RESULTS ARE THE AVERAGE ERROR OVER 20 TRIALS ALONG WITH THE STANDARD DEVIATION. W E USE
BOLDFACE TO DENOTE THE BEST RESULTS.
Authorized licensed use limited to: UNIVERSITAETSBIBL STUTTGART. Downloaded on February 08,2023 at 22:12:39 UTC from IEEE Xplore. Restrictions apply.