Kernels Regularization and Differential Equations
Kernels Regularization and Differential Equations
Pattern Recognition
journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / p r
A R T I C L E I N F O A B S T R A C T
Article history: Many common machine learning methods such as support vector machines or Gaussian process infer-
Received 18 March 2008 ence make use of positive definite kernels, reproducing kernel Hilbert spaces, Gaussian processes, and
Received in revised form 3 June 2008 regularization operators. In this work these objects are presented in a general, unifying framework and
Accepted 5 June 2008
interrelations are highlighted.
With this in mind we then show how linear stochastic differential equation models can be incorporated
Keywords: naturally into the kernel framework. And vice versa, many kernel machines can be interpreted in terms
Positive definite kernel of differential equations. We focus especially on ordinary differential equations, also known as dynamical
Differential equation systems, and it is shown that standard kernel inference algorithms are equivalent to Kalman filter meth-
Gaussian process ods based on such models.
Reproducing kernel Hilbert space In order not to cloud qualitative insights with heavy mathematical machinery, we restrict ourselves to
finite domains, implying that differential equations are treated via their corresponding finite difference
equations.
© 2008 Elsevier Ltd. All rights reserved.
1. Introduction Hilbert space (RKHS), and then classify the data with the help of a
separating hyperplane. Since there are often many hyperplanes that
Kernel methods are commonly used in the field of pattern recog- separate the training data points, SVMs select the hyperplane with
nition. For example, the authors of Ref. [1] have developed a support the largest margin, that is, the largest distance between the hyper-
vector machine (SVM)-based face detector that works in real time plane and the data points. However, what is the intuitive meaning
on video data, and Ref. [2] uses SVMs for the tracking of humans of distance in this feature space? One way to understand such dis-
with extensive pose articulation. Moreover, unsupervised detection tances is to explicitly choose a specific feature function of which
of brain activation patterns is explored by [3] using one-class SVMs. all components have some problem-dependent meaning. However,
The authors of Ref. [4] determine structured error patterns in mi- often the RKHS and its corresponding norm are only defined implic-
croarray data using probabilistic kernel methods, and Ref. [5] uses a itly via the choice of a kernel function k(x, y) = (x)T (y). In this case,
similar approach for processing motion capture data. Many such pat- the interpretation is not as straightforward. It was noted by Ref. [19]
tern recognition methods use SVMs for binary classification [6–8]. that any kernel function is related to a specific regularization opera-
However, kernel methods are also employed for multi-class classi- tor. The present paper explains this connection in a simple but very
fication [9], regression [10], novelty detection [11], semi-supervised general form, and we show how it can help to better understand
learning [12] and dimensionality reduction [13]. Gaussian processes SVMs and other related kernel machines.
(GPs) are the Bayesian versions of kernel methods. They have also Furthermore, it turns out that for the commonly used Gaussian
been applied to classification [14,15], regression [16,17] or dimen- (RBF) kernel, the feature space is a subset of the space of all functions
sionality reduction [18]. All these kernel methods are built around from the input domain to the real numbers, and the corresponding
some common notions and objects, which are explained in this pa- regularization operator is an infinite sum of derivative operators [20].
per in a simple unifying way. We generalize this result and show that all translation-invariant ker-
As depicted in Fig. 1, support vector machines can be thought nel functions are related to differential operators. The corresponding
of as follows. They first map the training and test input data into a homogeneous differential equations (DEs) are a useful tool for un-
potentially infinite dimensional feature space, a reproducing kernel derstanding the meaning of specific kernel functions. However, we
could also exploit this relation in the inverse direction and construct
kernels that are specifically adapted to problems involving DE mod-
∗ Corresponding author. Tel.: +49 7071 601571; fax: +49 7071 601552. els. To make this point clearer, let us consider a simple regression
E-mail addresses: steinke@tuebingen.mpg.de (F. Steinke), bs@tuebingen.mpg.de
example from physics, which can be visualized easily and which we
(B. Schölkopf).
will thus use throughout the paper. Assume that we have acquired
0031-3203/$30.00 © 2008 Elsevier Ltd. All rights reserved.
doi:10.1016/j.patcog.2008.06.011
3272 F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286
input space feature space tions could also be dealt with similarly. By DEs we will in this paper
always mean stochastic DEs, since these can be nicely incorporated
into kernel methods. Stochastic DEs are a superset of normal DEs,
since any DE can be converted into a stochastic DE by setting the
noise level to zero.
Φ
1.1. Finite domains
over functions on a given fixed finite domain. We do not make state- dimensional spaces, but also finite sets of graphs, texts, or any other
ments about what happens if one or more points are added to the type of objects.
domain of the model, and the defined processes are not assumed to We denote by H the space of all functions f : X → R. f is
be marginals of their continuous analogs. fully described by the RN -vector f = ( f (x1 ), . . . , f (xN ))T . Vectors and
matrices are denoted in bold font, but if an element of H is thought
1.2. Overview of as a function from X to R, we use the corresponding normal
font character. For points xi ∈ X we define location vectors/functions
The remainder of the paper is structured as follows: after intro- by dxi = (ij ) , where ij is the Kronecker symbol. The inner
j=1, ...,N
ducing some notation in Section 2, we define in Section 3 a frame- product of these with a function f ∈ H yields dTxi f = f (xi ). Thus,
work of basic objects used in kernel methods, and we explain how location vectors correspond to Dirac-delta functions centred at the
these objects are interrelated. Thereafter, we describe the use of point xi for continuous, infinite domains.
these objects for SVR in Section 3.2, for GP regression in Section 3.3, Linear operators G : H → H are isomorphic to matrices in
and for vector-valued regression in Section 3.4. In Section 4, we dis- RN×N . Therefore, any function g : X × X → R uniquely determines a
cuss a typical kernel-machine regression model and show its rela- linear operator G : H → H through Gij = dTxi Gdxj = g(xi , xj ) and vice
tion to linear stochastic DEs. We demonstrate how to develop kernel versa. The columns of G will be noted
functions from linear state-space models or higher-order DEs. We by Gxi = Gdxi ; they are real-
valued functions on X. For a set X = xi |i = 1, . . . , m ⊆ X of points,
show that the resulting inference methods are equivalent to Kalman
filter-based methods. The pendulum and other examples are pre- GX will denote the m × m submatrix of G corresponding to X.
sented in detail in Section 5. In Section 6 we discuss the practical im-
plications of the link between kernel machines and linear stochastic 3. The kernel framework
DEs. We summarize our conclusions in Section 7.
For better readability, we have restricted the main part of the pa- In non-parametric regression, we are given observations (xi , yi ) ∈
per to real-valued kernels, and postpone the more natural, slightly X × R, i = 1, . . . , m, m N, and the goal is to predict the value y∗
more technical treatment involving complex numbers to Appendix for arbitrary test points x∗ ∈ X. SVR estimates a prediction function
A. It will appear throughout the text that, with regularization the- f : X → R, y∗ = f (x∗ ), as the minimizer of a functional like
ory in mind, conditionally positive definite (cpd) kernels arise quite
naturally. We have transferred all parts dealing with cpd kernels to min R f 2 + C Loss({(xi , yi , f (xi ))|i = 1, . . . , m}). (1)
Appendix B, where we present an extension of the kernel framework f ∈H
to cpd kernels.
On the one hand, f should be close to the observed data as mea-
sured through a loss function Loss : (X × R × R)m → R. On the other
1.3. Related work
hand, f should be regular as measured by the regularization operator
R : H → G, where G is any finite dimensional Hilbert space. These
Most of the mathematical results of this paper are not the authors'
two objectives are relatively weighted through the regularization pa-
original work, but have been mentioned in different contexts before.
rameter C. Note that SVMs also use the same setting for binary clas-
Our contribution is to reformulate them in a unified, easily under-
sification. The classes are represented as y = ±1. First a real-valued
standable framework, the simple language of finite domains. Fur-
function f : X → R is estimated and then thresholded to obtain
thermore, we reinterpret them to highlight parallels between kernel
the binary class predictions. Unlike radial basis function networks
methods and linear DEs.
[20,35], SVMs use the hinge loss |yf (x) − 1|+ where |x|+ = x if x > 0
There is a large body of literature on kernels and DEs in many
and |x|+ = 0 otherwise.
different communities, and we only cite some relevant books con-
Many questions arise around objective (1). How are R f 2 and
taining overviews of their respective fields as well as further refer-
the commonly used function space norm f 2K related? This will
ences. Many machine learning-related facts about kernels and reg-
lead to the notion of RKHSs. The N-dimensional problem (1) can be
ularization methods are taken from Ref. [8], as well as Ref. [23] for
solved using a smaller m-dimensional equivalent involving kernel
the Bayesian interpretation. Sources in the statistics literature in-
functions. But how does R relate to the chosen kernel function? Can
clude [24,25], and in approximation theory [26]. For an overview of
one interpret (1) in a Bayesian way? For example, with the help of
linear stochastic dynamical systems and their estimation we refer to
GPs? The current section will answer the above questions in a simple,
Ref. [27].
yet precise way for finite domains. We will furthermore show the
The connection between stochastic processes and splines was
interrelations between the terms mentioned above.
first explored in Ref. [28]. It is also well known that thin-plate/cubic
Throughout the main part of this paper we assume that R is a one-
splines minimize the second derivative [29,26]. Connections between
to-one operator. This will lead to a framework with positive definite
regularization operators and kernel functions are explained in Refs.
kernels. If R is not one-to-one, cpd kernels arise. All definitions and
[20,30], and general linear operator equations are solved with GPs in
theorems derived for the positive definite case in the current section
Ref. [31]. A unifying survey of the theory of kernels, RKHSs, and GPs
are extended to the cpd case in Appendix B.
has been undertaken by Ref. [32]. However, they do not use finite
domains, which complicates their study and they do not mention
the link with differential or operator equations. Approaches that 3.1. Regularization operators, kernels, RKHS, and GPs
directly employ kernel methods towards the estimation of stochastic
DE models are proposed in Refs. [33,34]. Fig. 3 depicts the most common objects in the kernel framework.
We will explain them below, starting with the covariance operator.
2. Notation The covariance operator is not commonly used in the kernel liter-
ature, but we introduce it as a useful abstraction in the centre of
We consider functions f : X → R, where the domain X is a finite the framework. While it does not in itself have a special meaning,
set, |X| = N. When considering dynamical systems we will typically it helps us to unify the links between the other “leaf” objects. With
set X to be an evenly discretized interval and assume N to be large. the covariance operator in mind, the reader may then easily derive
Other examples of finite domains are discretized regions of higher additional direct links.
3274 F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286
a square matrix K X and sets p( f X ) = N(0, K X ). Using standard for- Since for f =0, Rf −u=u = 0, Rf −u cannot be used as a norm
mulas for conditioning Gaussian distributions and block-partitioned in an RKHS. To circumvent this problem, note that since R is assumed
matrix inversion one can show that this construction is consistent, to be one-to-one R−1 u exists uniquely and can be computed without
i.e. for all X ⊆ X, X ∩ X = ∅ it holds that p( f X ) = p( f X∪X ) df X . By regard to the measurement data. We can then base any inference on
Kolmogorov's extension theorem, or by simply using X = X in our f̃ =f −R−1 u, adapting the loss term appropriately. The regularization
finite dimensional case, this yields a GP on all of X. term then reads Rf̃ = Rf − u, which represents a true norm for
f̃ . The kernel framework can now be applied as described above.
3.4. Vector-valued regression
4. Kernels and DEs
Consider now regression from X to Rn , n > 1. We will show that
the kernel framework explained above can be easily extended to this SVR and GP inference both use an a priori model that can be
case. The function space of all functions f : X → Rn will be denoted expressed in the form
by Hn . We can represent such a function as a vector f in RnN . Denot-
T
Rf ≈ 0, (6)
T
ing the component functions by f i : X → R it is f ≡ (f 1 · · · f n )T .
T Functions f : X → R which fulfill Eq. (6) to a high degree as mea-
The standard inner product in Hn is f T g = nj=1 f j g j . The unit
sured by Rf , the two-norm of the residual, are preferred to func-
j
vector d xi , i.e. the location vector for location xi and the j-th compo- tions that significantly violate the equation.
nent, then has the j-th component equal to dxi and all others equal In this section we discuss a common choice for R, namely linear
jT stochastic DEs. If the input domain is one-dimensional, one speaks
to zero. It is dxi f = f j (xi ). Linear operators A : Hn → Hn are iso-
of ordinary differential equations (ODEs) or dynamical systems, and
morphic to R(Nn)×(Nn) matrices. for multivariate input these are PDEs. Since this paper is restricted to
finite domains, the term DE should be understood as meaning finite
Theorem 10. The function space Hn is isomorphic to the space H̃ of difference equations throughout. In most cases, the differences are
all functions from X̃ = X × {1, . . . , n} to R. negligible for discretization steps that are sufficiently small.
Linking DEs and kernel machines is useful both from a machine
This obvious theorem includes all we need in order to work with learning perspective as well as from a perspective focused primarily
vector-valued functions: As X is a finite set, so is X̃. All the above on work with DEs.
theory on kernels, regularization operators, and GPs applies. For ex- From a machine learning point of view, stochastic DEs can be
ample, using the regularization operator R : Hn → G, the corre- seen as an ideal prior model. They describe local properties of the
sponding kernel function is function f, that is, how the function value at one point relates to
function values in the neighbourhood. On a global level, stochastic
T
k(xi , xj )lm = k((xi , l), (xj , m)) = dlxi (RT R)−1 dm
x . (5) DEs do not constrain the function very much, because small local
j
noise contributions can add up over longer distances. Thus, this prior
To construct a sensible regularizer R, a similarity measure between is well-suited to situations where we a priori do not know much
points in X̃ is needed. Since in many applications it is not clear about the global structure of the target function, but we assume that
how to compare different components of f , it is common to use a locally it should not vary too much or only in a certain predefined
block-diagonal regularizer R = diag(R1 , . . . , Rn ), i.e. regularizing each manner.
component separately. The corresponding kernel function then has From a DE point of view, it is useful to have all the machin-
the vector form ery of kernel methods at hand. With these, one can estimate the
state/trajectory of the DE model, that is, the function described by
j jT the DE. One can also estimate the DE or its parameters, a task com-
K xi = (0, . . . , 0, K xi , 0, . . . , 0)T ,
monly known as system identification. Both problems are ubiquitous
throughout natural science, statistics and engineering.
with the individual kernel functions K x i = (R j,T R j )−1 dxi in the cor-
j
should not depend on the discretization step size. Suppose we split in the continuous case—as valid state-space models. An introduction
one interval into M smaller steps; then the joint process noise in this to infinite dimensional systems can be found in Ref. [38]. Imagine
(P) (P)
an arbitrary GP p(z) = N(μ, K) for z : X → R. One could simply
interval is M i=1 i , where the i are i.i.d. random variables. If the
(P) set x0 = z, i.e. μ0 = μ, R0 = K, and then propagate with A = 1,
variance of the i is finite, then the sum will have a Gaussian distri-
(P)
ui = 0, and RP = 0. Alternatively, one could use the decomposition
bution for large M, regardless of the distribution of the i . Thus, if p(z)=p(z0 )p(z1 |z2 ), . . . , p(zN−1 |z0 , . . . , zN−2 ) to formulate a state-space
the process noise has finite variance, the only valid distribution that model. Since for arbitrary covariances K, we cannot assume special
can be refined on an ever smaller grid is the Gaussian distribution. Markov properties, we would need again an N-dimensional state-
We now interpret the state-space model in terms of the kernel space to represent the GP. For special K, however, this construction
framework. may allow one to exploit Markov properties of the GP, and thus a
representation with a much lower state dimension.
Theorem 11. The linear state-space model (7) defines a GP over trajec-
tories x : X → Rn , X = {0, . . , N − 1}. Mean and covariance for i, j ∈ X
4.2. Linear DEs and the Fourier transform
are given as
i Kernel methods are often motivated via regularization in the
μi = E(xi ) = Ai μ0 + Ai−l Bul , (9)
l=1 Fourier domain [8]. At the same time, derivative operators reduce
to simple multiplications in the Fourier domain. This leads us to ex-
K i, j = E((xi − μi )(xj − μj )T ) amine more closely the connection between DEs and Fourier space
min(i,j)
penalization in this section.
= Ai R0 A j,T + Ai−l RP A j−l,T . (10) Assume X to be the discretized real line, i.e. X = {i/h|i = 1, . . . , N},
The GP equivalent to Eq. (7) has the density known that functions of D can be computed by applying equivalent
operations to the eigenvalues of wk . In particular, the corresponding
p(x) ∝ exp(− 12 Rx − u2 ). (11) kernel function then is
T T
This expression has a nice, simple interpretation: trajectories x that k(xl , xm ) = (L(D) L(D))−1
lm
= dTx (L(D) L(D))−1 dxm (14)
l
follow the model DE (7) are a priori the most likely functions x : X → N
Rn , and deviations from the equation are penalized quadratically. = dTxl uk
1
uT dxm (15)
So far, we have shown that linear state-space models define GP k=1
L(wk )L(wk ) k
distributions on trajectories x : X → Rn . Whether any GP can be N
1 2
written as a linear state-space model depends on whether the reader = exp i k(l − m) . (16)
considers models with state dimension N—or infinite state dimension |L(wk )|2 N
k=1
3278 F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286
Thus, the kernel k : X × X → R is the (discrete) Fourier transform 4.4. State estimation and system identification using kernels
of g(wk ) = 1/|L(wk )|2 . Since g is real-valued, the Fourier transform
of it is also real and additionally symmetric. The corresponding ker- Both GP and SVR regression can be interpreted as optimal state
nel function then is real-valued and only depends on the distance estimators if the kernel is chosen with respect to a DE as described
between xl and xm , d = |l − m|, that is, it is translation-invariant. above. Both methods try to minimize the deviation of the estimated
Let us motivate Eq. (12) from a regularization point of view. trajectory from the DE Rf = 0 and at the same time try to minimize
High derivatives are described by polynomials L( ) of high order, in the distance to the measured data points, where the distance is mea-
which case L(D) f 2 = k f T uk |L(wk )|2 uTk f strongly penalizes high sured either through a loss function in the SVR case or through a
frequencies. The corresponding kernel then contains few high fre- likelihood in the probabilistic setting. An optimal trade-off between
quency components and is thus relatively smooth. these potentially contradicting targets is obtained. Furthermore, SVR
One can also discuss the reverse derivation from a translation- and GP regression can both be used for system identification. In SVR
invariant kernel function on X to a differential regularization oper- one typically chooses the kernel to minimize the cross-validation er-
ator. Translation-invariance implies that the covariance operator K ror on the training set. In GP regression one tries to find the kernel
is diagonal in the Fourier basis. In order to derive a DE, invert the function that maximizes the marginal likelihood, that is, the com-
eigenvalues of K, take the square root, and interpolate the result by a plete likelihood of the training data and latent function f : X →
polynomial L of at most degree N. Eq. (12) then yields the model that R marginalized over the latents. Since each DE can be related to a
is implicitly used when performing regression with this kernel. A fa- specific kernel function, optimizing for the best kernel in a class of
mous example is the Gaussian kernel, k(xi , xj ) ∝ exp(−|i − j|2 /2
2 ). kernels derived from DEs is equivalent to choosing the most appro-
The discrete Fourier transform is difficult to compute analytically priate DE model for the given data set. More formally, assume, for
in this case, so we approximate it with its continuous counterpart example, that we are interested in a DE model of the form
for large N and small step sizes. The continuous Fourier transform
of a Gaussian is again a Gaussian with variance
−2 . Inverting and 0
taking the square root, we derive a function exp((
2 /4)w2 ), whose Lh (D) f = i+1 Di f = 0. (19)
φ
φ
time time time
Fig. 4. (left) Kernel function k(xi , . ) derived from the differential equation (20) describing a pendulum. Fourier space transforms with periodic boundary conditions were
used. The resulting kernel is translation invariant, xi is chosen in the middle of the interval. (middle) The 50 data points from Fig. 2, denoted by black crosses, are regressed
using a Gaussian process with the pendulum kernel, left, and a Gaussian i.i.d. likelihood. The solid red line denotes the mean of the posterior GP, the shaded area plus–minus
one marginal standard deviation of the function values. The dashed black line shows the true sample path from which the data points were generated. (right) GP regression
as in the middle figure; however, with a Gaussian kernel.
φ
time time
Fig. 5. (left) The covariance matrix derived from the differential equation describing a pendulum (20) using a state-space formulation with initial condition. Since the
state-space is two-dimensional the kernel function has for each position pair i, j four entries. Two entries describe the covariance within each component, the two others
the cross-covariances. (middle) Gaussian process regression using the kernel from the left figure and the 50 data points from Fig. 2. The solid red line denotes the mean of
the posterior GP, the shaded area plus–minus one marginal standard deviation for the function values. The dashed black line is the original sample path. (right) Equivalent
results produced by a Kalman smoother.
Fig. 7. For a two-dimensional domain X with periodic boundary conditions, the kernel functions Rxi for harmonic and thin-plate spline regularization are shown in the top
row. xi is chosen in the middle of X. Below we show the mean of a GP regression with these kernels and five data points, denoted as black stars.
T T
likelihood for GP regression with a Gaussian kernel. The maximal This results in Rf 2 = f T Df , where D = D1 D1 + D2 D2 is the (dis-
marginal likelihood for a Gaussian kernel with automatically chosen crete) Laplace operator. Functions minimizing this expression, the
parameters is 20 orders of magnitude smaller than for the pendulum so-called harmonic energy, effectively minimize the graph's area and
kernel. In a Bayesian interpretation the data thus strongly prefers a are thus very common in many fields of research, especially com-
pendulum-adapted model over the standard Gaussian kernel model. puter graphics [41]. Since constant functions are not penalized by R,
the cpd framework for non-one-to-one R has to be used in this case,
5.3. Two-dimensional PDEs see Appendix B. Postponing a more detailed discussion, the most
important change here is to use the pseudoinverse instead of the
In this section we discuss kernels for two-dimensional domains. inverse for deriving the kernel, K = (RT R)+ . This operation is easily
We show how the harmonic and the thin-plate spline regularizer performed using the two-dimensional fast Fourier transform.
that both build on derivatives and can be interpreted as stochastic The thin-plate splines energy penalize the Hessian of f : X → R,
PDEs can be incorporated into the kernel framework. that is, all second derivatives,
Next, we show examples of harmonic and thin-plate spline reg-
⎛ 1 1⎞
ularization in the kernel framework. D D
As mentioned in Section 4.3, the Fourier transform can also be ⎜ D1 D2 ⎟
R=⎝ 2 1⎟
⎜ .
applied for functions on higher-dimensional domains, and deriva- D D ⎠
tive operators can also be translated into multiplications in this set- D2 D2
ting. Consider a rectangular grid with N2 = 2562 points and periodic
boundary conditions. The discrete derivative D1 in the first direction The energy leaves linear functions unpenalized, thus we again have
and the derivative D2 in the second direction are both diagonal in to use the cpd framework and correspondingly the pseudoinverse.
the tensor Fourier basis uk1 ⊗ uk2 , where (dxl ⊗ dxm )T uk1 ⊗ uk2 = In Fig. 7, we show the resulting kernels for harmonic and thin-
exp i((2/N)(lk1 + mk2 )) and the eigenvalues are wk1 ⊗k2 = wk1 wk2 , plate spline regularization. Furthermore, we show results of approx-
k1 , k2 = 1, . . . , N. imating five randomly chosen data points with a GP regression with
Harmonic regularization results from penalizing the Jacobian of the respective kernels. Note that the harmonic kernel is sharply
f : X → R, that is, all first derivatives, peaked, but the regression output stays in the convex hull of the
training output values, the famous mean value property of harmonic
1
D maps. The thin-plate spline solution is much smoother, but occa-
R= .
D2 sionally overshoots the training values.
F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286 3281
Fig. 8. Kernel corresponding to a graph Laplacian as regularizer RT R. The kernel functions Rxi are encoded in the colour and the size of the nodes. Vertex xi is marked with
a black cross, the edges of the graph are shown in black.
5.4. Graph Laplacian gression exploiting the special features of (low-dimensional) linear
state-space models. SVR is slightly different in that it typically uses
Since graph domains are naturally finite, graph-based learning an -insensitive linear loss function [8] which corresponds to a dif-
is a good example of where the finite domain kernel framework ferent likelihood model. For a quadratic loss, however, the output of
directly applies without the need for discretization. an SVR will be identical to the mean estimate of a Kalman smoother.
The graph Laplacian is an approximation of the true Laplacian D It is interesting to note that even without considering equivalence
on graphs [42]. Kernels on graphs based on the graph Laplacian are of the underlying model assumptions, kernel methods can be re-
described by Ref. [43], they are used for semi-supervised learning by lated to Kalman filter-like algorithms. For dynamical systems, the
Ref. [44] and Ref. [45] use them in GPs on finite image domains for matrix RT R, whose inverse yields the covariance operator, is block-
image super-resolution. The graph Laplacian DG for a graph G=(E, X) tridiagonal. Ref. [48] propose an algorithm to invert such matrices in
with edges E and vertices X is given by DG = D − W, where W ij is linear time using a forward–backward scheme that is closely remi-
the weight of edge (i, j) ∈ E, 0 if (i, j) ∈/ E, and the degree matrix D is niscent of the Kalman smoother algorithm.
diagonal with entries Dii = j W ij . We use an -neighbourhood graph Considering system identification for linear ODEs, there exist
constructed from 40 random points in [0, 1]2 , = 0. 2, i.e. (i, j) ∈ E many different algorithms in the control community such as sub-
if and only if xi − xj < . Edge weights W ij are set as W ij = exp space identification, Fourier space methods or prediction error meth-
ods [27]. Statisticians classically use expectation maximization (EM),
(−xi − xj 2 / 2 ).
which maximizes the marginal likelihood of the model, that is, the
As in the above section, setting RT R = DG leads to the problem likelihood of the observed outputs given the parameters with the
that DG is not one-to-one. Functions f constant on a connected com- hidden states integrated out. The marginal likelihood can be effi-
ponent have f T DG f = 0, a fact commonly used in spectral clustering ciently computed using a Kalman smoother. As for the case of state
[46]. Thus, in order to derive a kernel we again use the pseudoin- estimation, all these methods are at least qualitatively equivalent to
verse. For more details see Appendix B. kernel machine model selection algorithms. The marginal likelihood
Fig. 8 shows the resulting kernel function K xi . The closer a point is also used in GP regression for kernel selection. The cross-validation
is to xi the larger its corresponding kernel values. Equivalently, under error can be seen as an approximation of the negative marginal like-
the corresponding GP prior the correlation of the function value at a lihood or the prediction error, which also links SVR regression to this
certain point with the function value at xi is the stronger the closer picture.
the point is to xi . Note that the distance is measured in terms of the Since we have argued above that kernel methods are largely
geodesic distance intrinsic to the graph, not the Euclidean distance equivalent to standard algorithms for treating DEs, we might ask
of the embedding space. in which context may one benefit from using kernel methods. Ker-
nel methods are to be understood here as algorithms that explicitly
6. Discussion compute the kernel function and that perform batch inference by
minimizing/integrating an expression of the dimension m, where m
We have shown that common linear DE models can be flawlessly is the number of measured data points. Conversely, all classical al-
integrated into the kernel framework and that trajectory/state esti- gorithms work sequentially, performing inference without explicitly
mation and system identification can both be performed with kernel computing the kernel function.
machines such as SVR or GP regression. However, there are already For one-dimensional problems, that is, ODEs or dynamical sys-
many well-established algorithms for state estimation and system tems, Kalman filter or graphical model-based methods concentrate
identification. In this section, we discuss how kernel methods relate on the chain-like structure of the model. They give rise to many O(N)
to these standard methods, and when one should prefer which type algorithms for computing marginal means, marginal variances or the
of algorithm. marginal likelihood, where N is the number of discretization steps.
State estimation in the linear state-space model described in Sec- If only m measurements, m>N, are given, this effort can be reduced
tion 4.1 is classically dominated by the Kalman filter/smoother [40] to O(m) with a little precomputation, summarizing many small steps
and its variants [27]. For such models the Kalman filter algorithm is without observations into one large step. In contrast, kernel-based
also equivalent to graphical model message-passing algorithms [47]. methods working with the full covariance matrix typically scale
Since all these models perform optimal state estimation in the state- around O(m3 ) for regression or computing the marginal likelihood.
space model as do kernel methods such as GP regression or SVR, Furthermore, such methods have to compute the kernel function for
the results of the two types of methods are identical. The Kalman the given dynamical system. Using the Fourier framework described
filter can be interpreted as just an efficient way of computing GP re- in Section 4.2, the fast Fourier transform takes O(N log N) time, and
3282 F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286
using the state-space model, the kernel is given explicitly by Eq. (10). corresponding to a low order linear DE. Their key calculation is
One advantage of the kernel view for dynamical systems is that it motivated in finite dimensions and is then extended to continu-
yields direct access to all pairwise marginal distributions, even for ous domains. Conversely, one could also ask whether sequential
non-neighbouring points, which is not obvious with sequential al- inference schemes for nonlinear DEs such as the extended Kalman
gorithms. filter, the unscented Kalman filter [50], or sequential Monte Carlo
For multidimensional problems, that is, in PDEs, the kernel methods [51] can be transferred to other, potentially multivariate,
method's view on the joint problem is more useful in practical nonlinear kernel-like problems.
terms, since message-passing is difficult due to many loops and is
not guaranteed to yield the optimal solution [47]. However, in this
case, too, the kernel cannot be computed analytically but has to 7. Conclusion
be derived either through a fast Fourier transform or, in the worst
case, through matrix inversion, which scales like O(N3 ). If one aims We have presented a joint framework for kernels, RKHSs, Gaus-
at estimating the whole latent function f : X → R, then direct sian processes and regularization operators. All these objects are
optimization of problem (1) may be advantageous in comparison closely related to each other. Given the theoretical framework, it is
with computing the kernels first and then optimizing the kernelized natural to see stochastic linear DEs as important examples of regu-
problem. For example, in graph-based learning one typically solves larization operators. We have discussed ordinary as well as partial
the estimation problem directly in the so-called primal. However, linear differential equations.
if the graph were given in advance and the labels of the nodes While the exposition is kept simple through the use of the fi-
were only uncovered at a later time, it would be advantageous to nite domain assumption, note that most results also hold for infi-
precompute the kernel functions, since regression to yield all of nite/continuous domains and we hope the readers will be able to
f : X → R could then be performed in O(m3 ) instead of O(N3 ). realize this when making comparisons with existing work. An exact
In sum, one could say that the connection between kernels and treatment for infinite, continuous domains often requires advanced
DEs will typically not yield faster or better algorithms, except in mathematical machinery [21,22,26], and we have thus concentrated
a few special cases. However, it may help to gain deeper theoret- on the finite dimensional case, which mostly yields qualitatively sim-
ical understanding of both kernel methods and DEs. For example, ilar results.
the connection presented shows that given a state-space model and A good understanding of all the mentioned interrelations be-
measurements, the posterior covariances between states at different tween different methods and communities will help the readers to
time points are not dependent on the observations; they are simply select suitable algorithms for specific problems and may guide their
given through the covariance matrix K. This insight is not obvious intuition in developing new methods, for example, for dealing with
from looking at the Kalman update equations. Conversely, the exis- nonlinear DEs. One potential future application may be to explore
tence of an O(N) inversion algorithm for tridiagonal matrices is not the meaning of kernel PCA [19] for kernels derived from dynamical
surprising when formulating the inversion in terms of a Kalman fil- systems, which to our knowledge has not yet been studied.
ter state estimation problem.
Appendix A. Complex-valued functions and kernels
6.1. Nonlinear extensions
For finite domains X, complex-valued functions f : X → C
This paper has so far solely focused on linear DEs or equivalently are isomorphic to elements in CN = H. Some basics of linear alge-
on linear regularization operators. However, there is great interest bra in CN are as follows: Set f ∗ = f T . The inner product in CN is
in nonlinear models in many fields, and it is natural to ask whether f ∗ g = i f (xi )g(xi ) and thus satisfies f ∗ g = g ∗ f . A matrix A is called
any of the insights presented above carry over to such a situation. symmetric or hermitian, if A∗ = AT = A. Hermitian matrices have real
The disappointing answer is that most of the results are critically eigenvalues i and an orthogonal basis of eigenfunctions {ui } ,
dependent on the linearity assumption. If R is not a linear operator, i=1, ...,N
∗
then Rf does not define a norm. Also, interpreting the kernel as thus, f Af is real for any f ∈ H.
the Green's function of RT R, that is, the solution of RT RK xi = dxi , does Complex-valued algebra does not interfere with the kernel frame-
not make sense, since the solution of nonlinear differential problems work. All definitions, theorems and proofs of Section 3 hold if the
Rf = u cannot in general be represented in terms of a linear sum functions are understood as complex-valued and the appropriate in-
ner product is used. For example the positive definite kernel condi-
of such Green's functions as in the linear case. Also, corresponding
probability distributions over functions f : X → R are then, in gen- tion then states that i, j i j k(xi , xj ) > 0, where the sum is real-valued,
eral, not Gaussian any more, and often cannot be described through since K is a hermitian matrix by assumption. We will not be more
an analytic expression at all. explicit here, but just state the following theorem, that shows that
Kernel methods are sometimes used for nonlinear systems, typ- the complex-valued theory consistently reduces to the real-valued
ically in the form that xi+1 = f (xi ), where f : Rn → Rn is described one described in Section 3, if all involved entities are in fact real.
by a kernel regression. However, such kernel methods should not be
mixed up with the type of kernels we discussed here, since in this Theorem 12. With the notation of the SVR objective (3) and the
paper the kernels were functions of time, not of the preceding state. Representer
theorem
8 the following holds: if the observation values
Furthermore, such one-step-ahead prediction with kernels is not as- yi |i = 1, . . . , m and the kernel K are real-valued and the loss term is
sociated with a Gaussian process over trajectories in H, nor does it a non-decreasing function of |fa (xi ) − yi |, then the function fa : X → C
yield an SVR problem of type (1) over trajectories. minimizing (3) is real-valued and additionally all coefficients a in
While these are strong negative statements, the dual view of Theorem 8 are real.
DEs—either in terms of local conditional distributions or more
kernel-like as joint distributions over whole functions—may still Proof. Assume f = f R + if I ∈ H, f R , f I ∈ RN . Then
help to shape intuitions for the nonlinear case and may help to
develop new approximate inference algorithms. For example, [49]
f 2K = f R 2K + f I 2K + 2 I( f I K −1 f R )
T
investigate the joint N-dimensional state distribution of a non- (A.1)
linear DE, and approximate it using an N-variate GP distribution =0, as K is real
F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286 3283
f ∗ K c f 0, ∀f ∈ H. (B.1)
where K˜ c is the operator given as K˜ c ij = kc (xi , xj ).
∗
The covariance operator is then related to the regularization operator In other words, if f = m i=1 i dxi , a = 0, and f p = 0 ∀p ∈ P, then
Rc as f ∗ K˜ c f > 0. Or equivalent but shorter, K˜ c is positive definite on P⊥ .
∗
K c = (R c Rc )+ . (B.2)
It is important to note, that the operator K˜ c which is composed
Note that the null space of K c is also P. The corresponding Gaussian from the cpd kernel function values is not necessarily equal to the
process pK c (f ) has the form covariance operator K c , and there exists famous counter examples,
e.g. thin-plate spline kernel functions. The definition of a cpd ker-
pK c (f ) = NU (0, K c ) ∝ exp(− 12 Rc f 2 ), (B.3)
nel function with respect to P just implies that K˜ c be positive def-
where NU (. , . ) is an unnormalized Gaussian density. If the dimen- inite on P⊥ , it does not make any claim about the behaviour on P.
sion M of the null space P is greater than zero, then pK c (f ) cannot For example, thin-plate spline kernels [26] yield matrices K˜ c which
3284 F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286
have f ∗ K˜ c f < 0 for some f ∈ P. This contradicts the positive semi- Table B1
Summary of the objects of the conditionally positive definite kernel framework and
definiteness assumption of the covariance operator K c , which was
their interrelations
enforced since surely f 2K c = Rc f 2 0 for all f ∈ H.
Entity Symbol Relations
This problem can be circumvented by setting
cpd Kernel func. kc : X × X → C kc (xi , xj ) = K˜ c ij
K c = (1 − P)K˜ c (1 − P). (B.7) Covariance op. Kc :H→H K c = (1 − P)K˜ c (1 − P)
∗
K c = (R c R c )+
Due to the projection step the assignment of a cpd kernel function to ∗
Native space (. , . )K c : H × H → C ( f , g)K c = f ∗ K c+ g = f ∗ R c R c g
a covariance operator is not unique. If {pi } is an orthonormal . K c . : H → R
1/2 c
f K c = ( f , f )K c = R f
i=1, ...,M
Gaussian process pK c : H → R pK c ( f ) = NU (0, K c )
basis of P, then Eq. (B.7) implies that
pK c ( f ) ∝ exp(− 12 f 2K c )
∗ ∝ exp(− 12 R c f 2 )
K ijc = dxi (1 − P)K˜ c (1 − P)dxj
pK c ( f )√
Regularization op. Rc : H → G (R c = K c+ , not unique)
∗
= k c (xi , xj ) − pl (xi )(pl K˜ c xj )
l
⊥
∗ Note that condition (B.10) ensures that m i=1 i dxi ∈ P . Further-
∗
− (K˜ c xi pm )pm (xj ) + pl (xi )(pl pm )pm (xi ). (B.8)
m
more, it is i=1 i K xci = K c
m d , and K c and K˜ c just differ
i=1 i xi xi xi
m
by an element of P. Thus, one could replace K c in Eq. (B.9) by K˜ c
l,m
xi xi
Note that above we have made an important assumption that does without changing the expression. Practically that means that we can
not in general hold for infinite domains and thus requires a slightly work directly with the cpd kernel function when performing SVR
different formalism when extended to this setting. We have assumed regression and do not have to use the more complicated expression
that an L2 -type inner product exists in H. While we could restrict (B.8) which includes projections.
the space of functions H to L2 (X) for infinite domains, this is not
natural for our purposes. Since we aim at regularizing with Rc f Proof. The theorem states that f (xi ) = m Kc + M
j=1 j ji
p(xi ), i =
j=1 j
we only need this expression to be well defined. We do not need
B.4. GP inference Proof. As a first step note that V(, f ) is strictly convex in f for all ∈
U. Both R f 2 andLoss({(xi , yi , f (xi ))|i = 1, . . . , m}) are convex with
The decomposition in Lemma 16 is also the key to compute the respect to f for all . If = 0 then R f 2 is strictly convex and so is
marginals of an unnormalized Gaussian process. As in Section 3.3 we the sum (“strictly convex + convex = strictly convex”). If = 0 then
will call this the GP representer theorem for the cpd case. R f 2 is constant in the direction of vectors p ∈ P. However, for
these p at least one of the p(xi ), i = 1, . . . , m, is not equal to zero since
Theorem 18. For X ⊆ X unisolvent with respect to P, the marginal dis- X is unisolvent. Thus, the loss term is strictly convex with respect to
tribution pK c ( f X ) ∝ NU (0, M + ) under the joint GP pK c ( f ) ∝ NU (0, K c ) where f = f + p, and so is the whole objective function.
is given by Since V(, f ) is strictly convex in f and continuously differen-
tiable, the unique minimum for given is determined by
M = K Xc − K Xc T(T ∗ K Xc T )+ T ∗ K Xc , (B.18)
j
F(, f ) ≡ V(, f ) = 0.
where {pj } is a basis of P and T ij = pj (xi ). jf
j=1, ...,M
By assumption F : U × CN → CN is continuously differentiable and
Proof. By Lemma 16 any f ∈ H can be written as f = m i=1 i K xi +
c
f 2K c = a∗ K cX a + q2K c
Given this theorem one could argue that the cpd framework is
= f X∗ (K Xc − K Xc T (T ∗ K Xc T )+ T ∗ K Xc ) f X + q2K c unnecessary: if the goal is to regularize with a non-one-to-one op-
= f X∗ Mf X + q2K c . erator R one could just use a slightly perturbed version of R which
actually is one-to-one and for which one could use the positive def-
From that it follows that inite framework. The solution of a SVR would then not differ very
much from the unperturbed result. However, if R∗ R is nearly sin-
1 gular the corresponding covariance operator K = (R∗ R)−1 will have
p( f X ) ∝ exp − Rc f 2 df X\X
2 some large values. Computations with such a kernel will then be nu-
1 ∗ + 1 merically unstable, and it is better to use the cpd framework instead.
∝ exp − f X M X f X exp − q2K c df X\X
2 2
=const Appendix C. Additional proofs
1
∝ exp − f X∗ M + f
X X
. In the finite domains, H with any inner product (. , . )S is an
2
RKHS, also with the usual L2 inner product. To see this note that in
B.5. Transitions between the cpd and the positive definite world
RN all norms are equivalent and |xi ( f )| = |f (xi )| f 1 C f S .
Imagine a family of regularization operators R : H → G contin- Proof (Lemma 5). (1) Riesz's theorem.
uously parameterized by ∈ U where U ⊆ R is an open neighbour- (2) Since the functionals xi are linearly independent, so are
m
hood of 0. Assume that R is one-to-one for all except for = 0. their representers Sxi . Then for a = 0 it is m i=1 j=1 i j s(xi , xj ) =
m
m
m
Thus, for = 0 we have to use the cpd framework, for = 0 we i=1 j=1 i j (S x i , S x )S = i=1 i S x 2 > 0.
i S
j
should use the positive definite scheme. However, the limit of K for (3) Set T ij = (dxi , dxj )S . Then for any f = i f (xi )dxi , g = i g(xi )dxi ,
0 = → 0 is not equal to K c . The limit does not even exist since
=0 it is ( f , g)S = i,j f (xi )g(xj )(dxi , dxj )S = i,j f (xi )g(xj )T ij = f T Tg.
in the positive definite case the kernel is the inverse of R∗ R which
diverges for → 0. On the other hand, the SVR objective function (4) Using the reproducing property on dxi , ij =(Sxi , dxj )S =dTxi ST dxj
and ij = (dxi , Sxj )S = dTxi TSdxj for all xi , xj ∈ X implies the claim.
V(, f ) ≡ R f 2 + C Loss({(xi , yi , f (xi ))|i = 1, . . . , m}) (B.19)
(5) Since necessarily S = T −1 and T uniquely defines the inner
product, the last claim follows.
depends continuously on . Thus one might hope that the minimizer
also depends continuously on .
The following theorem which is novel to our knowledge shows Proof (Lemma 7). f is the sum of a part f a in the span of the K xi , xi ∈
that this apparent problem of continuity can be resolved. It shows X, and the K-orthogonal complement q. The orthogonality condition
especially that, while the kernel is diverging for → 0, the SVR (K xi , q)K = 0 implies (xi ) = 0. Since K is positive definite, so is the
solution for = 0 converges for → 0, and that the limiting element submatrix K X . Therefore the system f X = K X a is uniquely solvable
is equal to the cpd SVR solution for = 0. for a ∈ Rm .
Theorem 19. Let R : H → G depend continuously differentiable on Proof (Theorem 8). Following Lemma 7, and f ∈ H can be written
as f = f a + q with ( f a , q)K = 0. The objective can then be written as
∈ U, U ∈ Rd an open neighbourhood of 0 and let R be one-to-one
if and only if = 0. Let P be the null space of R=0 . Furthermore, let
aT K X a + q2K + C Loss(xi , yi , fa (xi )) .
X = xi |i = 1, . . . , m ⊆ X, m N, be a set of distinct points unisolvent i=1, ...,m
with respect to P with corresponding observations yi |i = 1, . . . , m ⊆ The loss term is independent of q because (xi ) = 0, i = 1, . . . , m, and
C. The minimizer f = arg min V(, f ) depends continuously on , if thus the objective is minimized for q = 0. Convexity of the loss and
f ∈H the uniqueness of the map between f and a, Lemma 7, imply that
Loss({(xi , yi , f (xi ))|i=1, . . . , m}) is strictly convex and twice continuously the whole objective here is convex in a. Thus, the minimum is unique
differentiable with respect to the f (xi ). in this case.
3286 F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286
References [27] L. Ljung, System Identification—Theory for the User, second ed., Prentice-Hall,
Upper Saddle River, NJ, 1999.
[1] W. Kienzle, G. Bakir, M. Franz, B. Schölkopf, Face detection—efficient and rank [28] G. Kimeldorf, G. Wahba, A correspondence between bayesian estimation on
deficient, Advances in Neural Information Processing Systems, vol. 17, MIT stochastic processes and smoothing by splines, Ann. Math. Stat. 41 (2) (1970)
Press, Cambridge, MA, 2005, pp. 673–680. 495–502.
[2] L. Zhang, B. Wu, R. Nevatia, Detection and tracking of multiple humans with [29] W. Madych, S. Nelson, Multivariate interpolation and conditionally positive
extensive pose articulation, in: Proceedings of the 2007 IEEE Computer Society definite functions. II, Math. Comput. 54 (189) (1990) 211–230.
Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8. [30] A. Smola, B. Schölkopf, K.-R. Müller, The connection between regularization
[3] X. Song, G. Iordanescu, A. Wyrwicz, One-class machine learning for brain operators and support vector kernels, Neural Networks 11 (1998) 637–649.
activation detection, in: Proceedings of the 2007 IEEE Computer Society [31] T. Graepel, Solving noisy linear operator equations by Gaussian processes:
Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8. application to ordinary and partial differential equations, in: Proceedings of the
[4] G. Sanguinetti, M. Milo, M. Rattray, N. Lawrence, Accounting for probe-level 20th International Conference on Machine Learning, vol. 20, 2003, pp. 234–241.
noise in principal component analysis of microarray data, Bioinformatics 21 [32] M. Hein, O. Bousquet, Kernels, associated structures and generalizations,
(19) (2005) 3748–3754. Technical Report 127, Max Planck Institute for Biological Cybernetics, Tübingen,
[5] N.D. Lawrence, J. Quiñonero-Candela, Local distance preservation in the GP- Germany, 2004.
LVM through back constraints, in: Proceedings of the International Conference [33] N.E. Heckman, J.O. Ramsay, Penalized regression with model-based penalties,
in Machine Learning, 2006, pp. 513–520. Can. J. Stat. 28 (2000) 241–258.
[6] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, [34] F. Steinke, B. Schölkopf, Machine learning methods for estimating operator
1995. equations, in: Proceedings of the 14th IFAC Symposium on System Identification,
[7] C. Burges, A tutorial on support vector machines for pattern recognition, Data SYSID06, Elsevier, Amsterdam, 2006, pp. 1–6.
Min. Knowl. Discovery 2 (2) (1998) 121–167. [35] F. Girosi, M. Jones, T. Poggio, Regularization theory and neural network
[8] B. Schölkopf, A. Smola, Learning with Kernels, MIT Press, Cambridge, MA, 2002 architectures, Neural Comput. 7 (1995) 219–267.
URL http://www.learning-with-kernels.org. [36] C. Walder, B. Schölkopf, O. Chapelle, Implicit surface modelling with a globally
[9] J. Weston, C. Watkins, Support vector machines for multiclass pattern regularised basis of compact support, Comput. Graphics Forum 25 (3) (2006)
recognition, in: Proceedings of the Seventh European Symposium on Artificial 635–644.
Neural Networks, 1999. [37] C. Micchelli, M. Pontil, On learning vector-valued functions, Neural Comput. 17
[10] A. Smola, B. Schölkopf, A tutorial on support vector regression, Stat. Comput. (1) (2005) 177–204.
14 (3) (2004) 199–222. [38] R. Curtain, H. Zwart, An Introduction to Infinite Dimensional Linear Systems
[11] B. Schölkopf, J. Platt, J. Shawe-Taylor, A. Smola, R. Williamson, Estimating Theory, Springer, Berlin, 1995.
the support of a high-dimensional distribution, Neural Comput. 13 (7) (2001) [39] S. Bochner, Monotone Funktionen, Stieltjessche integrate und harmonische
1443–1471. analyse, Math. Ann. 108 (1933) 378–410.
[12] O. Chapelle, B. Schölkopf, A. Zien, Semi-supervised Learning, MIT Press, [40] R. Kalman, A new approach to linear filtering and prediction problems, J. Basic
Cambridge, MA, 2006. Eng. 82 (1) (1960) 35–45.
[13] B. Schölkopf, A. Smola, K. Müller, Nonlinear component analysis as a kernel [41] M. Floater, K. Hormann, Surface parameterization: a tutorial and survey,
eigenvalue problem, Neural Comput. 10 (5) (1998) 1299–1319. Advances in Multiresolution for Geometric Modelling, vol. 1, Springer, Berlin,
[14] C. Williams, D. Barber, Bayesian classification with Gaussian processes, IEEE 2005.
Trans. Pattern Anal. Mach. Intell. 20 (12) (1998) 1342–1351. [42] M. Hein, J.-Y. Audibert, U. von Luxburg, Graph Laplacians and their convergence
[15] M. Opper, O. Winther, Gaussian processes for classification: mean-field on random neighborhood graphs, J. Mach. Learn. Res. 8 (2007) 1325–1370.
algorithms, Neural Comput. 12 (11) (2000) 2655–2684. [43] A. Smola, R. Kondor, Kernels and regularization on graphs, in: Proceedings of
[16] C. Williams, C. Rasmussen, Gaussian processes for regression, Advances in the Conference on Learning Theory, Springer, Berlin, 2003.
Neural Information Processing Systems, vol. 8, 1996, pp. 514–520. [44] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-supervised learning using Gaussian fields
[17] N. Lawrence, M. Seeger, R. Herbrich, Fast sparse Gaussian process methods: and harmonic functions, in: Proceedings of the 20th International Conference
the informative vector machine, Advances in Neural Information Processing on Machine Learning, vol. 20, 2003.
Systems, vol. 15, 2003, pp. 609–616. [45] M. Tipping, C. Bishop, Bayesian image super-resolution, in: Advances in Neural
[18] N. Lawrence, Gaussian process latent variable models for visualisation of high Information Processing Systems, vol. 15, 2003, pp. 1279–1286.
dimensional data, Advances in Neural Information Processing Systems, vol. 16, [46] U. von Luxburg, A tutorial on spectral clustering, Stat. Comput. 17 (4) (2007)
2004. 395–416.
[19] A. Smola, B. Schölkopf, K. Müller, The connection between regularization [47] M. Jordan, Z. Ghahramani, T. Jaakkola, L. Saul, An introduction to variational
operators and support vector kernels, Neural Networks 11 (4) (1998) 637–649. methods for graphical models, Mach. Learning 37 (2) (1999) 183–233.
[20] F. Girosi, M. Jones, T. Poggio, Priors stabilizers and basis functions: from [48] Y. Huang, W. McColl, Analytical inversion of general tridiagonal matrices, J.
regularization to radial, tensor and additive splines, A.I. Memo No. 1430, MIT Phys. A: Math. General 30 (1997) 7919–7933.
Press, Cambridge, MA, 1993. [49] C. Archambeau, D. Cornford, M. Opper, J. Shawe-Taylor, Gaussian process
[21] V. Bogachev, Gaussian Measures, AMS, New York, 1998. approximations of stochastic differential equations, J. Mach. Learn. Res., in:
[22] B. Oksendal, Stochastic differential equations: an introduction with applications, Workshop and Conference Proceedings, vol. 1, 2007, pp. 1–16.
sixth ed., Springer, Berlin, 2002. [50] S. Julier, J. Uhlmann, A new extension of the Kalman filter to nonlinear systems,
[23] C.E. Rasmussen, C.K. Williams, Gaussian Processes for Machine Learning, MIT in: I. Kadar (Ed.), Proceedings of the Conference on Signal Processing, Sensor
Press, Cambridge, MA, 2006. Fusion, and Target Recognition VI, vol. 3068, 1997, pp. 182–193.
[24] G. Wahba, Spline models for observational data, SIAM, Philadelphia, PA, 1990. [51] A. Doucet, N. de Freitas, N. Gordon, Sequential Monte Carlo Methods in Practice,
[25] J.O. Ramsay, B.W. Silverman, Functional Data Analysis, second ed., Springer, Springer, Berlin, 2001.
Berlin, 2005. [52] H. Heuser, Lehrbuch der Analysis, Teil 2, B. G. Teubner, Stuttgart, Germany,
[26] H. Wendland, Scattered Data Approximation, Cambridge University Press, 1991.
Cambridge, UK, 2005.
About the Author—FLORIAN STEINKE earned a Diplom in Physics from the Eberhard-Karls-Universität, Tübingen, in 2005. He is currently pursuing a Ph.D. at the Max Planck
Institute for Biological Cybernetics in Tübingen under the supervision of Bernhard Schölkopf. His diploma thesis on Modeling Human Heads with Implicit Surfaces won the
DAGM-SMI prize for the best diploma thesis on Pattern Recognition in Germany in the academic year 2005–2006.
About the Author—BERNHARD SCHÖLKOPF was born in Stuttgart on 20 February, 1968. He received an M.Sc. in Mathematics and the Lionel Cooper Memorial Prize from
the University of London in 1992, followed in 1994 by the Diplom in Physics from the Eberhard-Karls-Universität, Tübingen. Three years later, he obtained a Doctorate in
Computer Science from the Technical University Berlin. His thesis on Support Vector Learning won the annual dissertation prize of the German Association for Computer
Science (GI). In 1998, he won the prize for the best scientific project at the German National Research Center for Computer Science (GMD). He has researched at AT&T Bell
Labs, at GMD FIRST, Berlin, at the Australian National University, Canberra, and at Microsoft Research Cambridge (UK). He has taught at Humboldt University, Technical
University Berlin, and Eberhard-Karls-University Tübingen. In July 2001, he was appointed scientific member of the Max Planck Society and director at the MPI for Biological
Cybernetics; in October 2002, he was appointed Honorarprofessor for Machine Learning at the Technical University Berlin. In 2006, he received the J.K. Aggarwal Prize of
the International Association for Pattern Recognition. He has been the program chair of COLT and NIPS and serves on the editorial boards of JMLR, IEEE PAMI and IJCV.