0% found this document useful (0 votes)
97 views16 pages

Kernels Regularization and Differential Equations

Uploaded by

AniAce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views16 pages

Kernels Regularization and Differential Equations

Uploaded by

AniAce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Pattern Recognition 41 (2008) 3271 -- 3286

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / p r

Kernels, regularization and differential equations


Florian Steinke ∗ , Bernhard Schölkopf
Max-Planck-Institute for Biological Cybernetics, Spemannstr. 38, 72076 Tübingen, Germany

A R T I C L E I N F O A B S T R A C T

Article history: Many common machine learning methods such as support vector machines or Gaussian process infer-
Received 18 March 2008 ence make use of positive definite kernels, reproducing kernel Hilbert spaces, Gaussian processes, and
Received in revised form 3 June 2008 regularization operators. In this work these objects are presented in a general, unifying framework and
Accepted 5 June 2008
interrelations are highlighted.
With this in mind we then show how linear stochastic differential equation models can be incorporated
Keywords: naturally into the kernel framework. And vice versa, many kernel machines can be interpreted in terms
Positive definite kernel of differential equations. We focus especially on ordinary differential equations, also known as dynamical
Differential equation systems, and it is shown that standard kernel inference algorithms are equivalent to Kalman filter meth-
Gaussian process ods based on such models.
Reproducing kernel Hilbert space In order not to cloud qualitative insights with heavy mathematical machinery, we restrict ourselves to
finite domains, implying that differential equations are treated via their corresponding finite difference
equations.
© 2008 Elsevier Ltd. All rights reserved.

1. Introduction Hilbert space (RKHS), and then classify the data with the help of a
separating hyperplane. Since there are often many hyperplanes that
Kernel methods are commonly used in the field of pattern recog- separate the training data points, SVMs select the hyperplane with
nition. For example, the authors of Ref. [1] have developed a support the largest margin, that is, the largest distance between the hyper-
vector machine (SVM)-based face detector that works in real time plane and the data points. However, what is the intuitive meaning
on video data, and Ref. [2] uses SVMs for the tracking of humans of distance in this feature space? One way to understand such dis-
with extensive pose articulation. Moreover, unsupervised detection tances is to explicitly choose a specific feature function  of which
of brain activation patterns is explored by [3] using one-class SVMs. all components have some problem-dependent meaning. However,
The authors of Ref. [4] determine structured error patterns in mi- often the RKHS and its corresponding norm are only defined implic-
croarray data using probabilistic kernel methods, and Ref. [5] uses a itly via the choice of a kernel function k(x, y) = (x)T (y). In this case,
similar approach for processing motion capture data. Many such pat- the interpretation is not as straightforward. It was noted by Ref. [19]
tern recognition methods use SVMs for binary classification [6–8]. that any kernel function is related to a specific regularization opera-
However, kernel methods are also employed for multi-class classi- tor. The present paper explains this connection in a simple but very
fication [9], regression [10], novelty detection [11], semi-supervised general form, and we show how it can help to better understand
learning [12] and dimensionality reduction [13]. Gaussian processes SVMs and other related kernel machines.
(GPs) are the Bayesian versions of kernel methods. They have also Furthermore, it turns out that for the commonly used Gaussian
been applied to classification [14,15], regression [16,17] or dimen- (RBF) kernel, the feature space is a subset of the space of all functions
sionality reduction [18]. All these kernel methods are built around from the input domain to the real numbers, and the corresponding
some common notions and objects, which are explained in this pa- regularization operator is an infinite sum of derivative operators [20].
per in a simple unifying way. We generalize this result and show that all translation-invariant ker-
As depicted in Fig. 1, support vector machines can be thought nel functions are related to differential operators. The corresponding
of as follows. They first map the training and test input data into a homogeneous differential equations (DEs) are a useful tool for un-
potentially infinite dimensional feature space, a reproducing kernel derstanding the meaning of specific kernel functions. However, we
could also exploit this relation in the inverse direction and construct
kernels that are specifically adapted to problems involving DE mod-
∗ Corresponding author. Tel.: +49 7071 601571; fax: +49 7071 601552. els. To make this point clearer, let us consider a simple regression
E-mail addresses: steinke@tuebingen.mpg.de (F. Steinke), bs@tuebingen.mpg.de
example from physics, which can be visualized easily and which we
(B. Schölkopf).
will thus use throughout the paper. Assume that we have acquired
0031-3203/$30.00 © 2008 Elsevier Ltd. All rights reserved.
doi:10.1016/j.patcog.2008.06.011
3272 F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286

input space feature space tions could also be dealt with similarly. By DEs we will in this paper
always mean stochastic DEs, since these can be nicely incorporated
into kernel methods. Stochastic DEs are a superset of normal DEs,
since any DE can be converted into a stochastic DE by setting the
noise level to zero.
Φ
1.1. Finite domains

The current paper is formulated in terms of finite domains. Func-


tions to be estimated are assumed to map finite domains to R or Rn .
Fig. 1. Support vector machines map input data points via  into a potentially In the pendulum example imagine time to be discretized into many
infinite dimensional feature space. The classification then proceeds by finding the small time steps. The use of finite domains thus means that when-
separating hyperplane with the largest margin between the classes. However, what ever we speak of DEs in this paper we actually mean discretized ver-
is the meaning of distance in this feature space? Especially if the feature space is
sions thereof, that is, the corresponding finite difference equations.
only defined implicitly via a kernel function k(x, y) = (x)T (y)?.
In the authors' opinion, finite domains are just the right level of
simplification needed for an easy, yet very far-reaching exposition of
the matter. The restriction to finite domains simplifies the required
mathematics dramatically. Functions on finite domains are finite di-
mensional vectors, requiring only simple linear algebra for analysis
instead of more involved functional analysis. Existence and conver-
gence of sums/integrals is trivial for finite domains, and point eval-
φ
uations are described by inner products with unit vectors instead of
φ

functionals involving Dirac-delta distributions. Finite domains also


allow one to define Gaussian densities for function-valued random
variables. This is not possible for infinite dimensional functions, at
least not with respect to the standard Lebesgue measure, which does
not exist for infinite dimensional function spaces [21].
time Despite these important simplifications, little qualitative expres-
sion power is lost. Most well-known results on kernels can be eas-
Fig. 2. (left) Schematic view of a pendulum, and (right) 50 noisy measurements of
ily derived and motivated for finite domains. Reasonably smoothly
the pendulum's angle (ti ) at times ti , i = 1, . . . , 50.
varying functions can be approximated well by their finite dimen-
sional piecewise-linear counterparts, which, in most cases, allow DEs
to be converted straightforwardly into qualitatively equivalent finite
measurements of a pendulum's position at given time instances, as difference equations. Finally, there are also some common settings
depicted in Fig. 2. We are then interested in two problems: for machine learning that naturally deal with finite domains, for ex-
Firstly, we will discuss how to optimally reconstruct the full time ample graph-based or transductive learning.
course of the pendulum's position. The pendulum's dynamics can be There are, of course, also certain shortcomings of a finite domain
described approximately by a simple linear DE, and estimating the approach. Generally speaking, we cannot answer questions regard-
full state trajectory from few measurements is equivalent to classical ing the limiting behaviour for ever smaller discretization steps. Note
state estimation in linear dynamical systems. For this task one typi- that while such limiting processes on continuous domains typically
cally employs a variant of the Kalman filter. On the other hand, the exist, see e.g. Ref. [22] for one-dimensional domains, they often have
problem of reconstructing a function from a finite number of mea- some additional surprising properties, some of which are at first sight
surements is also the goal of non-parametric regression techniques, in conflict with our understanding of the corresponding model for
such as the kernel-based methods support vector machines/support finite domains. For example, the sample paths of Brownian motion
vector regression (SVR) or GP inference. In this paper, we will show are continuous, yet nowhere differentiable [22]. This implies that the
how the knowledge of a model DE can be included into kernel meth- corresponding RKHS norm, defined below, is infinite for each sample
ods, and that these are closely related to Kalman filter-based ap- path almost surely. While the RKHS is thus a null space under the
proaches. measure of the continuous time process, the mean of non-parametric
Secondly, we will explore how to learn about properties of the regression with a finite number of data points is nevertheless guar-
pendulum from the given measurements. In particular this will aim anteed to be an element of the RKHS, a very surprising fact. Also, if
at determining parameters of the DE that characteristically describes we define our models via discrete regularization operators or inverse
the pendulum, a task that is commonly known as linear system iden- covariances as defined below and then take the limit of step size to
tification. We will show how model selection methods for kernel zero, then the marginal distributions of these continuous processes
methods such as cross-validation or marginal likelihood optimiza- are often not identical to the finite distributions. For example, for the
tion can be used for system identification purposes. As for state esti- linear difference equation xi = (1 + At)xi−1 the exact discretization
mation, these machine learning-inspired approaches turn out to be of the continuous analog would be xi = exp(At)xi−1 . While these
equivalent to other well-known system identification methods, such expressions are similar for small step sizes t they are not identical.
as prediction error methods. This fact is sometimes important for computational reasons, since by
Having these two objectives in mind, we will first describe ker- construction the discrete models often have some sparsity structures
nel methods in a relatively broad way that is not specifically tailored in the inverse covariance and these are not, in general, preserved for
towards DEs. However, this framework will allow us to straightfor- the marginals.
wardly understand the close links between linear DEs and kernel The aim of this paper is to offer a simple intuitive introduction
methods as a special case. We mostly focus on ordinary linear DEs, to the kernel framework and to show its connections to DEs. We
also known as dynamical systems, but will also give examples of lin- thus concentrate solely on finite domains. Note that this means that
ear partial differential equations (PDEs). Other linear operator equa- when speaking of processes in this paper, we just mean distributions
F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286 3273

over functions on a given fixed finite domain. We do not make state- dimensional spaces, but also finite sets of graphs, texts, or any other
ments about what happens if one or more points are added to the type of objects.
domain of the model, and the defined processes are not assumed to We denote by H the space of all functions f : X → R. f is
be marginals of their continuous analogs. fully described by the RN -vector f = ( f (x1 ), . . . , f (xN ))T . Vectors and
matrices are denoted in bold font, but if an element of H is thought
1.2. Overview of as a function from X to R, we use the corresponding normal
font character. For points xi ∈ X we define location vectors/functions
The remainder of the paper is structured as follows: after intro- by dxi = (ij ) , where ij is the Kronecker symbol. The inner
j=1, ...,N
ducing some notation in Section 2, we define in Section 3 a frame- product of these with a function f ∈ H yields dTxi f = f (xi ). Thus,
work of basic objects used in kernel methods, and we explain how location vectors correspond to Dirac-delta functions centred at the
these objects are interrelated. Thereafter, we describe the use of point xi for continuous, infinite domains.
these objects for SVR in Section 3.2, for GP regression in Section 3.3, Linear operators G : H → H are isomorphic to matrices in
and for vector-valued regression in Section 3.4. In Section 4, we dis- RN×N . Therefore, any function g : X × X → R uniquely determines a
cuss a typical kernel-machine regression model and show its rela- linear operator G : H → H through Gij = dTxi Gdxj = g(xi , xj ) and vice
tion to linear stochastic DEs. We demonstrate how to develop kernel versa. The columns of G will be noted
functions from linear state-space models or higher-order DEs. We  by Gxi = Gdxi ; they are real-
valued functions on X. For a set X = xi |i = 1, . . . , m ⊆ X of points,
show that the resulting inference methods are equivalent to Kalman
filter-based methods. The pendulum and other examples are pre- GX will denote the m × m submatrix of G corresponding to X.
sented in detail in Section 5. In Section 6 we discuss the practical im-
plications of the link between kernel machines and linear stochastic 3. The kernel framework
DEs. We summarize our conclusions in Section 7.
For better readability, we have restricted the main part of the pa- In non-parametric regression, we are given observations (xi , yi ) ∈
per to real-valued kernels, and postpone the more natural, slightly X × R, i = 1, . . . , m, m  N, and the goal is to predict the value y∗
more technical treatment involving complex numbers to Appendix for arbitrary test points x∗ ∈ X. SVR estimates a prediction function
A. It will appear throughout the text that, with regularization the- f : X → R, y∗ = f (x∗ ), as the minimizer of a functional like
ory in mind, conditionally positive definite (cpd) kernels arise quite
naturally. We have transferred all parts dealing with cpd kernels to min R f 2 + C Loss({(xi , yi , f (xi ))|i = 1, . . . , m}). (1)
Appendix B, where we present an extension of the kernel framework f ∈H
to cpd kernels.
On the one hand, f should be close to the observed data as mea-
sured through a loss function Loss : (X × R × R)m → R. On the other
1.3. Related work
hand, f should be regular as measured by the regularization operator
R : H → G, where G is any finite dimensional Hilbert space. These
Most of the mathematical results of this paper are not the authors'
two objectives are relatively weighted through the regularization pa-
original work, but have been mentioned in different contexts before.
rameter C. Note that SVMs also use the same setting for binary clas-
Our contribution is to reformulate them in a unified, easily under-
sification. The classes are represented as y = ±1. First a real-valued
standable framework, the simple language of finite domains. Fur-
function f : X → R is estimated and then thresholded to obtain
thermore, we reinterpret them to highlight parallels between kernel
the binary class predictions. Unlike radial basis function networks
methods and linear DEs.
[20,35], SVMs use the hinge loss |yf (x) − 1|+ where |x|+ = x if x > 0
There is a large body of literature on kernels and DEs in many
and |x|+ = 0 otherwise.
different communities, and we only cite some relevant books con-
Many questions arise around objective (1). How are  R f 2 and
taining overviews of their respective fields as well as further refer-
the commonly used function space norm  f 2K related? This will
ences. Many machine learning-related facts about kernels and reg-
lead to the notion of RKHSs. The N-dimensional problem (1) can be
ularization methods are taken from Ref. [8], as well as Ref. [23] for
solved using a smaller m-dimensional equivalent involving kernel
the Bayesian interpretation. Sources in the statistics literature in-
functions. But how does R relate to the chosen kernel function? Can
clude [24,25], and in approximation theory [26]. For an overview of
one interpret (1) in a Bayesian way? For example, with the help of
linear stochastic dynamical systems and their estimation we refer to
GPs? The current section will answer the above questions in a simple,
Ref. [27].
yet precise way for finite domains. We will furthermore show the
The connection between stochastic processes and splines was
interrelations between the terms mentioned above.
first explored in Ref. [28]. It is also well known that thin-plate/cubic
Throughout the main part of this paper we assume that R is a one-
splines minimize the second derivative [29,26]. Connections between
to-one operator. This will lead to a framework with positive definite
regularization operators and kernel functions are explained in Refs.
kernels. If R is not one-to-one, cpd kernels arise. All definitions and
[20,30], and general linear operator equations are solved with GPs in
theorems derived for the positive definite case in the current section
Ref. [31]. A unifying survey of the theory of kernels, RKHSs, and GPs
are extended to the cpd case in Appendix B.
has been undertaken by Ref. [32]. However, they do not use finite
domains, which complicates their study and they do not mention
the link with differential or operator equations. Approaches that 3.1. Regularization operators, kernels, RKHS, and GPs
directly employ kernel methods towards the estimation of stochastic
DE models are proposed in Refs. [33,34]. Fig. 3 depicts the most common objects in the kernel framework.
We will explain them below, starting with the covariance operator.
2. Notation The covariance operator is not commonly used in the kernel liter-
ature, but we introduce it as a useful abstraction in the centre of
We consider functions f : X → R, where the domain X is a finite the framework. While it does not in itself have a special meaning,
set, |X| = N. When considering dynamical systems we will typically it helps us to unify the links between the other “leaf” objects. With
set X to be an evenly discretized interval and assume N to be large. the covariance operator in mind, the reader may then easily derive
Other examples of finite domains are discretized regions of higher additional direct links.
3274 F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286

As for the definition of GPs, this formulation of the definition


of RKHSs is tailored towards the continuous domain case. The
definition ensures that point evaluations of functions in S are
well-defined, which is not obvious for functions on continuous do-
mains, for example, L2 functions. Well-defined point evaluations
are, of course, necessary for machine learning methods that deal
with point-wise data measurements. In the finite domain setting,
the definition of RKHSs is quite trivial. It implies that H with
any inner product (. , . )S is an RKHS, also with the usual L2 inner
product. The proof is found in Appendix C, together with the proof
of the following lemma which summarizes some useful results
about RKHSs.

Lemma 5. The following statements hold for RKHS (H, (. , . )S ):


Fig. 3. Common objects in the kernel framework and their interrelations. Arrows
denote that one can uniquely be determined from the other (the * denotes that this
connection is not unique). (1) There exists a unique element Sxi ∈ H for each xi ∈ X, the repre-
senter, such that

xi (f ) = f (xi ) = (Sxi , f )S


Definition 1 (Covariance operator). A covariance operator K is a posi-
tive definite matrix of size N ×N, i.e. for all f ∈H, f = 0, it is f T K f > 0. for all f ∈ H. This property is called the reproducing property.
(2) The function s : X × X → R defined by s(xi , xj ) = (Sxi , Sxj )S is a
A first interpretation of the covariance operator which gives K its positive definite kernel function in the sense of Definition 3.
name is given through its use in GPs.
Let the operator S : H → H be defined by Sij = s(xi , xj ).
Definition 2 (Gaussian process). A GP is a distribution over all func-
tions f : X → R such that for any linear functional w : H → R (3) Any inner product ( f , g)S can be uniquely expressed in the form
the value w( f ) = wT f is a real-valued, normally distributed random f T Tg where T is a positive definite operator.
variable. (4) s(xi , xj ) = T ij−1 or equivalently S = T −1 .
(5) The kernel s defines the inner product (. , . )S uniquely.
This definition taken from Ref. [21] is tailored to the case where
f is infinite dimensional, and no Lebesgue density exists in H. For The above lemma implies that for a given covariance operator K
finite X, it simply implies that the distribution has a density pK ( f ) one can define an RKHS (H, (. , . )K ) by setting
over the functions in H, and that this density is a multivariate Gaus-
sian. Note that this means that in the finite dimensional setting, dis- ( f , g)K ≡ f T K −1 g.
tributions over functions can be described via standard multivariate
Then the representer of this RKHS is identical with the kernel
Gaussian distributions. Given a covariance operator K we can define
function K dxi derived from K via k(xi , xj ) = K ij . Since the relation be-
a special zero mean GP by
tween kernel and inner product is unique, one could also construct
pK ( f ) = N(0, K) ∝ exp(− 12 f T K −1 f 2 ). (2) a unique valid covariance operator from a given RKHS.
The definitions so far have been purely technical, but we can give
Conversely, given a GP, its covariance matrix is a valid positive def- them a practical meaning when considering them in conjunction
inite covariance operator. with a regularization operator as used in the SVR objective (1).
The covariance operator also allows one to define another well-
known object. Definition 6 (Regularization operator). A regularization operator R :
H → G is a one-to-one linear operator. Here, G is any finite dimen-
Definition 3 (Kernel function). A symmetric function k : X × X → R sional Hilbert space.
is called a positive definite kernel function, if for all subsets X ⊆ X,
X = {x1 , . . . , xm }, m  N, and all 0 = a ∈ Rm , it holds that If we use K = (RT R)−1 , then by Lemma 5 it is
⎛ ⎞T ⎛ ⎞
 m
m  m
 m
  f 2K = f T K −1 f = f T RT R f = R f 2 .
i j k(xi , xj ) = aT K X a = ⎝ i dxi ⎠ K ⎝ j dxj ⎠ > 0.
i=1 j=1 i=1 j=1
That means that if Rf  measures the regularity of f : X → R, then
the RKHS norm exactly equals the regularity measure. In the SVR
objective (1) regular functions are thus preferred over less regular
By definition, kernel functions give rise to a positive definite co-
ones. Furthermore, the related GP is
variance operator K X . Conversely, a covariance operator K defines
a kernel function through k(xi , xj ) = K ij = dTxi K dxj , since positive def- pK ( f ) = N(0, K) ∝ exp(− 12 R f 2 )
initeness of K implies that K X , too, is positive definite for all X ⊆ X.
Kernel functions naturally lead to the definition of specially implying that under this distribution regular functions are more
adapted function spaces. likely than less regular ones. The most likely functions are those
which exactly fulfill the regularity/model equation
Definition 4 (Reproducing Kernel Hilbert Space). A Hilbert space R f = 0.
(S, (. , . )S ), S ⊆ H, of functions f : X → R is called a RKHS, if the
evaluation functionals xi : H → R defined by xi ( f ) = dTxi f = f (xi ) Note that since R is assumed to be one-to-one, only the zero func-
are continuous for all xi ∈ X, i.e. |xi (f )|  C f S for all f ∈ S. tion can fulfill the model equation exactly. Non-vanishing functions
F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286 3275

Table 1 Remark. f can also be expanded in another function system, say


Summary of the objects of the positive definite kernel framework and their inter-

f = Lj=1 cj /j . Then min cT Mc + Loss(xi , yi , f c (xi )) with


relations c∈RL i=1, ...,m
T T
Entity Symbol Relations M ij = /i R R/j is the optimization problem corresponding to Eq. (1),
T see e.g. Refs. [25,36]. This is also a convex problem and can some-
Kernel function k:X×X→R k(xi , xj ) = K i,j = dxi K xj
K xi : X → R k(xi , xj ) = (K xi , K xj )K times be solved very efficiently if, for example, compactly supported
k(xi , xj ) = dxi (RT R)−1 dxj
T basis functions are used [36]. However, one only finds the optimal
k(xi , xj ) = Covf ∼p ( f (xi ), f (xj ))
K
solution within the span of the selected basis functions. A globally
Covariance op. K:H→H K i,j = k(xi , xj ), optimal solution in H would, in general, require L = N basis func-
T −1
K = (R R) = Covf ∼p ( f , f )
K tions. Furthermore, M ij = /Ti RT R/j has to be computed for all i, j
RKHS (. , . )K : H × H → R ( f , g)K = f T K −1 g = f T RT Rg which could be challenging.
1/2
. K . : H → R  f K = ( f , f )K = Rf 
Gaussian process pK : H → R pK ( f ) = N(0, K)
3.3. GP inference
pK ( f ) ∝ exp(− 12  f 2K )
pK ( f ) ∝ exp(− 12  Rf 2 )
−1 The SVR objective (1) can also be interpreted from a Bayesian
Regularization op. R:H→G (R = K , not unique)
perspective. Assume a two step-model where firstly a latent function
Covx∼p(x) (xi , xj ) denotes the covariance between xi and xj under a distribution of x
f : X → R is drawn from the GP prior pK ( f ) with covariance operator
with density p(x). If the arguments are vectors, the corresponding covariance matrix
is meant. K, and where subsequently the measurements are determined from
this function as described by a local likelihood p( y| f )=p( y| f X ), where
y = ( y1 , . . . , ym )T and X = {x1 , . . . , xm }. A common example of a local

violate this equation by an amount that is determined by the struc- likelihood is the i.i.d. likelihood, that is, p( y| f ) = i p( yi | f (xi )). The
ture of R. If non-trivial functions are to be considered fully regular, posterior for local likelihoods is
that is, R f =0, then R cannot be one-to-one. This case is discussed
in Appendix B. p( f | y, X) ∝ p( y| f )pK ( f ) ∝ p( y| f X ) exp(− 12 Rf 2 ),
Given a covariance operator K, we can compute an associated and the maximum a posteriori (MAP) estimate is
regularization operator R as R = K −1 . However, note that if we
transform R → K → R in this way we will not necessarily recover the arg max p( f | y, X) = arg min 12 Rf 2 − log p( y| f X ).
same regularization operator we started from. The original R does f ∈H f ∈H
not have to be quadratic and even if it is, taking the root would set So if one can identify − log p( y| f X ) with Loss({(xi , yi , f (xi ))|i=
all originally negative eigenvalues of R to positive. 1, . . . , m}), which is possible, for example, for the common squared
The objects of the kernel framework and their interrelations are loss, then SVR is just a MAP estimate of a GP model. Note, however,
summarized in Table 1. that in some well-known cases such as, for example, the hinge loss,
this identification is in a strict sense not possible. The resulting like-
3.2. Support vector machines lihood would not be normalizable with respect to y. Nevertheless, if
one is willing to work with unnormalized models, the equivalence
With the above definitions the SVR objective (1) can be rewritten holds in general. The qualitative meaning of the prior is the same in
as any case.
Bayesian statistics is typically not only interested in the MAP
min  f 2K + C Loss(xi , yi , f (xi )) . (3) estimate of f (x∗ ) but in the full predictive distribution,
f ∈H i=1, ...,m

p( f (x∗ )| y, X) ∝ p( y| f X )pK (f ) d f X\x∗ .
This optimization problem over the whole function space H, i.e.
over N variables where N is potentially large, can be reduced to a
Here, we have used the notation that for every set I = {xi1 , . . . , xik } ⊆
typically much smaller m-dimensional optimization problem using
kernel functions. To see this, we will derive the famous representer X, df I means df (xi1 ), . . . , df (xik ). Because of the local likelihood we
theorem in two steps. The proofs are found in Appendix C. can then split the N − 1 dimensional integral as follows:

The first step, which is interesting in itself, shows a general prop-
erty of RKHSs: Any function in an RKHS can be decomposed into p( f (x∗ )| y, X) ∝ p( y| f X ) pK ( f ) df X\X∪x∗ df X .
a set of kernel functions and its H-orthogonal complement. If the   
complement is understood as a function from X to R, then it has =pK ( f X∪x∗ )
function value zero at all kernel centres.
So if an analytic expression of the marginal pK ( f X∪x∗ ), which is inde-
  pendent of the data, could be computed, then only an m-dimensional
Lemma 7. Given distinct points X = xi |i = 1, . . . , m , m  N, any f ∈ H integral would have to be solved for inference. Such an expression

m
can be uniquely written as f = mi=1 i K xi + q, a ∈ R , where q ∈ H is given in the following theorem, which just expresses a standard
satisfies the conditions (xi ) = (K xi , q)K = 0, i = 1, . . . , m. property of Gaussian distributions. Since it reduces the work from N
dimensions to m dimensions similar to the representer theorem 8,
The second step then is as follows. one could call it the Bayesian representer theorem.

Theorem 9. Given m  N distinct points X = {x1 , . . . , xm } ⊆ X the GP


Theorem
 8 (Representer
 theorem).
 Given m  N distinct points X = pK (f ) has the marginals
xi |i = 1, . . . , m and labels yi |i = 1, . . . , m ⊆ R the minimizer f of Eq.

1
(3) has the form f a = m m
i=1 i K xi . a ∈ R minimizes the expression
pK ( f X ) =  exp(− 12 f XT K X−1 f X ) = N(0, K X ).
m
(2) |K X |
aT K X a + CLoss(xi , yi , fa (xi )) . (4)
i=1, ...,m
This property is often used to construct GPs: Given a kernel func-
If the loss is convex, a is determined uniquely. tion k : X × X → R one stores the values corresponding to X into
3276 F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286

a square matrix K X and sets p( f X ) = N(0, K X ). Using standard for- Since for f =0, Rf −u=u = 0, Rf −u cannot be used as a norm
mulas for conditioning Gaussian distributions and block-partitioned in an RKHS. To circumvent this problem, note that since R is assumed
matrix inversion one can show that this construction is consistent, to be one-to-one R−1 u exists uniquely and can be computed without

i.e. for all X ⊆ X, X ∩ X = ∅ it holds that p( f X ) = p( f X∪X ) df X . By regard to the measurement data. We can then base any inference on
Kolmogorov's extension theorem, or by simply using X = X in our f̃ =f −R−1 u, adapting the loss term appropriately. The regularization
finite dimensional case, this yields a GP on all of X. term then reads Rf̃  = Rf − u, which represents a true norm for
f̃ . The kernel framework can now be applied as described above.
3.4. Vector-valued regression
4. Kernels and DEs
Consider now regression from X to Rn , n > 1. We will show that
the kernel framework explained above can be easily extended to this SVR and GP inference both use an a priori model that can be
case. The function space of all functions f : X → Rn will be denoted expressed in the form
by Hn . We can represent such a function as a vector f in RnN . Denot-
T
Rf ≈ 0, (6)
T
ing the component functions by f i : X → R it is f ≡ (f 1 · · · f n )T .

T Functions f : X → R which fulfill Eq. (6) to a high degree as mea-
The standard inner product in Hn is f T g = nj=1 f j g j . The unit
sured by Rf , the two-norm of the residual, are preferred to func-
j
vector d xi , i.e. the location vector for location xi and the j-th compo- tions that significantly violate the equation.
nent, then has the j-th component equal to dxi and all others equal In this section we discuss a common choice for R, namely linear
jT stochastic DEs. If the input domain is one-dimensional, one speaks
to zero. It is dxi f = f j (xi ). Linear operators A : Hn → Hn are iso-
of ordinary differential equations (ODEs) or dynamical systems, and
morphic to R(Nn)×(Nn) matrices. for multivariate input these are PDEs. Since this paper is restricted to
finite domains, the term DE should be understood as meaning finite
Theorem 10. The function space Hn is isomorphic to the space H̃ of difference equations throughout. In most cases, the differences are
all functions from X̃ = X × {1, . . . , n} to R. negligible for discretization steps that are sufficiently small.
Linking DEs and kernel machines is useful both from a machine
This obvious theorem includes all we need in order to work with learning perspective as well as from a perspective focused primarily
vector-valued functions: As X is a finite set, so is X̃. All the above on work with DEs.
theory on kernels, regularization operators, and GPs applies. For ex- From a machine learning point of view, stochastic DEs can be
ample, using the regularization operator R : Hn → G, the corre- seen as an ideal prior model. They describe local properties of the
sponding kernel function is function f, that is, how the function value at one point relates to
function values in the neighbourhood. On a global level, stochastic
T
k(xi , xj )lm = k((xi , l), (xj , m)) = dlxi (RT R)−1 dm
x . (5) DEs do not constrain the function very much, because small local
j
noise contributions can add up over longer distances. Thus, this prior
To construct a sensible regularizer R, a similarity measure between is well-suited to situations where we a priori do not know much
points in X̃ is needed. Since in many applications it is not clear about the global structure of the target function, but we assume that
how to compare different components of f , it is common to use a locally it should not vary too much or only in a certain predefined
block-diagonal regularizer R = diag(R1 , . . . , Rn ), i.e. regularizing each manner.
component separately. The corresponding kernel function then has From a DE point of view, it is useful to have all the machin-
the vector form ery of kernel methods at hand. With these, one can estimate the
state/trajectory of the DE model, that is, the function described by
j jT the DE. One can also estimate the DE or its parameters, a task com-
K xi = (0, . . . , 0, K xi , 0, . . . , 0)T ,
monly known as system identification. Both problems are ubiquitous
throughout natural science, statistics and engineering.
with the individual kernel functions K x i = (R j,T R j )−1 dxi in the cor-
j

responding components. The joint covariance matrix K is block-


diagonal in this case. If the loss/likelihood term does not imply a de- 4.1. Linear state-space models
pendency between different components, such as, for example, the
quadratic loss, then each dimension can be treated separately. How- Linear state-space models are the most common models in the
ever, there are also numerous situations where a joint regularization class of ODEs, or dynamical systems [27]. They are classically given as
makes sense. Examples are shown in the next section.
(P)
The theory as described here was mentioned in Ref. [32]. Ref. [37] xi = Axi−1 + Bui + i , i = 1, . . . , N − 1 (7)
introduces a slightly different formalism employing operator-valued (M)
yi = Cxi + Dui + i , i = 1, . . . , N − 1. (8)
kernel functions in this context. However, the derived representer
theorem is equivalent to the simple approach presented here. The model equation (7) states that the hidden states xi ∈ Rn follow
Note that one could also reorder the entries in f ; for example, a stochastic difference equation with external user-defined control
we could define f = ( f (x1 )T , . . . , f (xN )T )T . While in this section we (P)
ui ∈ Rk and i.i.d. process noise i , which is Gaussian-distributed
have used a special notation for vector-valued functions in order to with mean zero and covariance RP . The likelihood of the measure-
highlight the differences, we will from now on use normal vector ments yi ∈ Rm is defined via Eq. (8). The measurements are linear
notation also for vector-valued functions to keep the notation simple. combinations of the state and the control with additive i.i.d. Gaus-
(M)
sian measurement noise i with mean zero and covariance RM . The
3.5. Inhomogeneous regularization initial state x0 is independently Gaussian-distributed with mean μ0
and covariance R0 . Note that the assumption that the process noise
As shown in the next section, there are numerous cases where is Gaussian-distributed is in fact a very natural one if the finite differ-
one would like to have Rf − u, u = 0, as the regularizer in the SVR ence equations ought to be discretizations of a continuous stochastic
objective (1) or equivalently use non-zero means for GPs. model. In this case, the distribution of a finite difference model
F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286 3277

should not depend on the discretization step size. Suppose we split in the continuous case—as valid state-space models. An introduction
one interval into M smaller steps; then the joint process noise in this to infinite dimensional systems can be found in Ref. [38]. Imagine

(P) (P)
an arbitrary GP p(z) = N(μ, K) for z : X → R. One could simply
interval is M i=1 i , where the i are i.i.d. random variables. If the
(P) set x0 = z, i.e. μ0 = μ, R0 = K, and then propagate with A = 1,
variance of the i is finite, then the sum will have a Gaussian distri-
(P)
ui = 0, and RP = 0. Alternatively, one could use the decomposition
bution for large M, regardless of the distribution of the i . Thus, if p(z)=p(z0 )p(z1 |z2 ), . . . , p(zN−1 |z0 , . . . , zN−2 ) to formulate a state-space
the process noise has finite variance, the only valid distribution that model. Since for arbitrary covariances K, we cannot assume special
can be refined on an ever smaller grid is the Gaussian distribution. Markov properties, we would need again an N-dimensional state-
We now interpret the state-space model in terms of the kernel space to represent the GP. For special K, however, this construction
framework. may allow one to exploit Markov properties of the GP, and thus a
representation with a much lower state dimension.
Theorem 11. The linear state-space model (7) defines a GP over trajec-
tories x : X → Rn , X = {0, . . , N − 1}. Mean and covariance for i, j ∈ X
4.2. Linear DEs and the Fourier transform
are given as
i Kernel methods are often motivated via regularization in the
μi = E(xi ) = Ai μ0 + Ai−l Bul , (9)
l=1 Fourier domain [8]. At the same time, derivative operators reduce
to simple multiplications in the Fourier domain. This leads us to ex-
K i, j = E((xi − μi )(xj − μj )T ) amine more closely the connection between DEs and Fourier space
min(i,j)
 penalization in this section.
= Ai R0 A j,T + Ai−l RP A j−l,T . (10) Assume X to be the discretized real line, i.e. X = {i/h|i = 1, . . . , N},

l=1 h > 0, and let L( ) = ni=0 ai i be an n-th order polynomial. Consider


the linear ODE
Proof (Dynamical systems view). Since all (conditional) distributions
n

of the xi are Gaussian, so is the joint distribution of x : X → Rn , i.e.
L(D) f = ai Di f = 0, (12)
it is a GP. Furthermore, it is
i=0
i

xi = Ai x0 + Ai−l (Bul + l ).
(P)
where D is the first derivative operator and f : X → R. For the
l=1 remainder of the chapter we will assume periodic boundary condi-
tions, allowing the use of the discrete Fourier transform to express
Using the independence assumptions, Eqs. (9) and (10) follow.  the derivative operator. Periodic systems are in general not causal,
since random events in the future could propagate forward to influ-
Proof (Kernel view). Eq. (7) can be written equivalently as ence the past. However, for stable linear systems these effects can
⎛ −1/2 ⎞⎛ be neglected for large enough domains, because the contribution of
R0 ⎞
1 any state onto future state values decays to zero eventually. The
⎜ ⎟⎜ ⎟

⎜ R−1/2
P
⎟ ⎜ −A
⎟⎝
1 ⎟ natural formulation of the Fourier transform in terms of complex
⎝ ⎠ .. .. ⎠ exponentials requires the use of complex-valued linear algebra. For
...
R−1/2 −A 1 ease of presentation we have omitted this so far, however, all def-
 
P  initions and theorems can also be formulated with complex num-
=R bers, as sketched in Appendix A. We will also assume that L(D) is
⎛ ⎞ ⎛ ⎞
x0 R−1/2
0
μ0 one-to-one. Unfortunately, there are common examples where this
⎜ x1 ⎟ ⎜ ⎜ −1/2 ⎟
×⎜ ⎟ RP Bu1 ⎟ ⎟ = ,
is not the case, e.g. for the second derivative used for thin-plate
⎝ .. ⎠−⎜ ⎝ ⎠ splines. Regularization with non-one-to-one operators requires the
..
xN−1 R−1/2
Bu use of the cpd kernels as described in Appendix B. For discrete X,
    P  N−1

=x
a straightforward approximation of the continuous derivative is the
=u
approximate derivative operator D given as follows in the case of
periodic boundary conditions,
where the deviations  ∈ RNn are i.i.d. Gaussian-distributed with
mean zero and covariance one. Since, for any initial state x0 there ⎛ ⎞
−1 1
exists exactly one solution of the system, i.e. one trajectory x that 1⎜ −1 1 ⎟
D= ⎜ ⎟. (13)
follows Eq. (7), the R thus constructed is one-to-one and defines a h⎝ −1 1 ⎠
valid regularization operator. Using the theory from Section 3, the 1 −1
model is then equivalent to a GP with mean μ =R−1 u and covariance
K = (RT R)−1 . Formulas (9) and (10) can be verified by checking that
N
D can be diagonalized in the Fourier basis, D = u w uT , where
k=1 k k k
Rμ = u and K(RT R) = (RT R)K = 1.  T
wk = 1/h(exp(i(2/N)k) − 1) and dx uk = exp(i(2/N)jk). It is well-
j

The GP equivalent to Eq. (7) has the density known that functions of D can be computed by applying equivalent
operations to the eigenvalues of wk . In particular, the corresponding
p(x) ∝ exp(− 12 Rx − u2 ). (11) kernel function then is
T T
This expression has a nice, simple interpretation: trajectories x that k(xl , xm ) = (L(D) L(D))−1
lm
= dTx (L(D) L(D))−1 dxm (14)
l
follow the model DE (7) are a priori the most likely functions x : X → N

Rn , and deviations from the equation are penalized quadratically. = dTxl uk
1
uT dxm (15)
So far, we have shown that linear state-space models define GP k=1
L(wk )L(wk ) k
distributions on trajectories x : X → Rn . Whether any GP can be N 
 1 2
written as a linear state-space model depends on whether the reader = exp i k(l − m) . (16)
considers models with state dimension N—or infinite state dimension |L(wk )|2 N
k=1
3278 F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286

Thus, the kernel k : X × X → R is the (discrete) Fourier transform 4.4. State estimation and system identification using kernels
of g(wk ) = 1/|L(wk )|2 . Since g is real-valued, the Fourier transform
of it is also real and additionally symmetric. The corresponding ker- Both GP and SVR regression can be interpreted as optimal state
nel function then is real-valued and only depends on the distance estimators if the kernel is chosen with respect to a DE as described
between xl and xm , d = |l − m|, that is, it is translation-invariant. above. Both methods try to minimize the deviation of the estimated
Let us motivate Eq. (12) from a regularization point of view. trajectory from the DE Rf = 0 and at the same time try to minimize
High derivatives are described by polynomials L( ) of high order, in the distance to the measured data points, where the distance is mea-

which case L(D) f 2 = k f T uk |L(wk )|2 uTk f strongly penalizes high sured either through a loss function in the SVR case or through a
frequencies. The corresponding kernel then contains few high fre- likelihood in the probabilistic setting. An optimal trade-off between
quency components and is thus relatively smooth. these potentially contradicting targets is obtained. Furthermore, SVR
One can also discuss the reverse derivation from a translation- and GP regression can both be used for system identification. In SVR
invariant kernel function on X to a differential regularization oper- one typically chooses the kernel to minimize the cross-validation er-
ator. Translation-invariance implies that the covariance operator K ror on the training set. In GP regression one tries to find the kernel
is diagonal in the Fourier basis. In order to derive a DE, invert the function that maximizes the marginal likelihood, that is, the com-
eigenvalues of K, take the square root, and interpolate the result by a plete likelihood of the training data and latent function f : X →
polynomial L of at most degree N. Eq. (12) then yields the model that R marginalized over the latents. Since each DE can be related to a
is implicitly used when performing regression with this kernel. A fa- specific kernel function, optimizing for the best kernel in a class of
mous example is the Gaussian kernel, k(xi , xj ) ∝ exp(−|i − j|2 /2
2 ). kernels derived from DEs is equivalent to choosing the most appro-
The discrete Fourier transform is difficult to compute analytically priate DE model for the given data set. More formally, assume, for
in this case, so we approximate it with its continuous counterpart example, that we are interested in a DE model of the form
for large N and small step sizes. The continuous Fourier transform
of a Gaussian is again a Gaussian with variance
−2 . Inverting and 0

taking the square root, we derive a function exp((
2 /4)w2 ), whose Lh (D) f = i+1 Di f = 0. (19)

Taylor expansion is L(w) = ∞ n=0 (


/2 n!)w . Replacing w by the
2n 2n 2n i=0
derivative jx , we re-derive the result of Ref. [20]. They state that the Optimizing for the best parameters h of the corresponding kernel
Gaussian kernel is equivalent to regularization with derivatives of function K h = (Lh (D)T Lh (D))−1 is equivalent to determining the best
all (even) orders, differential model of the above form.
The possibility of using kernel machines to estimate the state


2n and the parameters of DEs has been noticed by Ref. [33] in a spline
R= D2n . (17)
22n n! context, and by Ref. [34] who use SVR and cross-validation.
n=0
Before discussing the practical implications of this matter, we
present some examples highlighting the kernel framework and its
A larger
leads to a stronger penalization of high derivatives, i.e. to connections to DEs.
smoother functions.
The introduction of the Fourier transform above also leads to a 5. Examples
discrete version of Bochner's theorem [39]. While the original theo-
rem in continuous domains deals with positive semi-definite func- 5.1. The pendulum—state estimation
tions, we can make a stronger statement involving positive definite-
ness for finite domains: A translation-invariant function k : X × X → Consider again the pendulum in Fig. 2. According to Newton's
R, k(xi , xj ) = (i − j), is a positive definite kernel function if and only third law, the free motion dynamics of the angle of the pendulum is
if the (discrete) Fourier transform of  is positive. Since the Fourier approximately described by the second-order linear DE
transform of  is identical with the eigenvalues of K, and we do not
have to be concerned with the existence and regularity of Fourier ¨ (t) + 
ml2  ˙ (t) + mgl(t) = 0, (20)
transforms in finite domains, the result, in our case, is trivial.
where m is the mass of the pendulum, l the length, g the gravita-
tional constant and > 0 a damping factor. Eq. (20) is only approx-
4.3. Linear stochastic PDEs imately correct for two qualitatively different reasons. Firstly, it is
only the linearization around the rest position of a truly nonlinear
A general form of discrete stochastic linear PDEs for f : X → R is DE. The true gravitational effect is mgl sin((t)) which for small (t)
is similar to mgl(t). Secondly, there may be many, potentially ran-
 (P)
f (xi ) = aij f (xj ) + i , x i ∈ X, (18) dom influences on the pendulum which are not known or cannot in
xj ∈Ni
principle be observed. For example, the viscosity of the surrounding
air could change slightly due to local temperature changes, or more
drastically a by-passer could simply hit the pendulum. Both model
where Ni ⊂ X is the set of neighbours of xi , aij ∈ R, and (P) is mismatch and stochastic influences can be modeled as process noise
i.i.d. zero mean Gaussian noise with covariance K i . Since Eq. (18) in a stochastic DE system, rendering this a versatile model.
is a linear equation system in f , it is a valid kernel model equation Fourier space method: The pendulum equation (20) can be written
(6). If the xi are placed on a regular grid and periodic boundary in the operator form
conditions are assumed, the Fourier transform methods from the
previous section can also be applied for this multivariate setting. L(jx ) f (x) = (j2x + c1 jx + c2 I) f (x) = 0, (21)
Note that apart from being a discretized stochastic PDE, Eq. (18)
where I : H → H it the identity operator. We discretize an in-
is also one form of writing Gaussian Markov random fields. Addition-
put interval into N = 4096 steps and apply the Fourier framework
ally, graph-based learning involving the graph Laplacian can be writ-
from Section 4.2 to derive a translation-invariant kernel k(xi , xj ) =
ten in this form. This noteworthy fact implies that multiple methods
in physics, control theory, image processing, PDE theory, machine (L(D)T L(D))−1
ij
. The resulting kernel and a GP regression with this
learning and statistics all use the same underlying model. kernel for the pendulum data in Fig. 2 (right) is shown in Fig. 4.
F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286 3279

pendulum kernel GP with pendulum kernel GP with Gaussian kernel

φ
φ
time time time

Fig. 4. (left) Kernel function k(xi , . ) derived from the differential equation (20) describing a pendulum. Fourier space transforms with periodic boundary conditions were
used. The resulting kernel is translation invariant, xi is chosen in the middle of the interval. (middle) The 50 data points from Fig. 2, denoted by black crosses, are regressed
using a Gaussian process with the pendulum kernel, left, and a Gaussian i.i.d. likelihood. The solid red line denotes the mean of the posterior GP, the shaded area plus–minus
one marginal standard deviation of the function values. The dashed black line shows the true sample path from which the data points were generated. (right) GP regression
as in the middle figure; however, with a Gaussian kernel.

covariance operator φ GP regression Kalman smoother

φ
time time

Fig. 5. (left) The covariance matrix derived from the differential equation describing a pendulum (20) using a state-space formulation with initial condition. Since the
state-space is two-dimensional the kernel function has for each position pair i, j four entries. Two entries describe the covariance within each component, the two others
the cross-covariances. (middle) Gaussian process regression using the kernel from the left figure and the 50 data points from Fig. 2. The solid red line denotes the mean of
the posterior GP, the shaded area plus–minus one marginal standard deviation for the function values. The dashed black line is the original sample path. (right) Equivalent
results produced by a Kalman smoother.

Observe that the GP regression with the kernel adapted to the


pendulum is able to nicely follow the true sample path (middle).
While a GP regression with a standard Gaussian kernel yields com-
parable results in regions where many data points are observed, −40
it performs much worse in the middle where no observations are
− log p (Y|X)

recorded. This can be explained as follows. Since the a priori model


of f in terms of a stochastic DE, Rf =  ∼ N(0,
2 1), allows violations −50
of the exact DE Rf =0, multiple observations can override the model.
However, in regions with no observations the prior is more impor-
tant. Since the Gaussian kernel encodes for the wrong prior model
−60
(17) its predictions are especially bad in these regions.
State-space view: The pendulum equation (20) can equally be
written as a state-space model with a two-dimensional state, n = 2.
Then it is −70

0 1 20 40 60 80 100
A=h 2 + 1, C = (1 0), B = D = 0,
− /ml −g /l c2

0 (M),2 ,
K=

(P),2 , H =
Fig. 6. The negative log marginal likelihood of a Gaussian process regression for
the pendulum data set in Fig. 2. Different parameters c2 are used for the pendu-
where we used N = 4096, h = 0. 003, 0 = (0. 2, 0. 1)T , 0 = 10−5 1, lum-adapted kernel in Fig. 4. The minimum of the negative log marginal likelihood
is obtained for c2,min = 27. 5, the true value is c2,true = 25.
/ml2 = 25, g/l = 1,
(P) = 0. 085, and
(M) = 0. 02. The data samples
for the pendulum—see Fig. 2 (right)—were drawn from this model.
The covariance operator for this state-space model computed by
Eq. (10) is colour-coded in Fig. 5 (left). Observe the oscillations when 5.2. The pendulum—parameter estimation
fixing a row or column which corresponds to fixing a kernel centre
xi and observing the kernel function K xi . Fig. 5 (middle) shows the In Fig. 6 we show results from a simple system identification
marginal posterior mean and variances when performing GP regres- task, i.e. determining the parameter c2 of the pendulum model (21).
sion using the kernel from the left figure and the data from Fig. 2 We use the pendulum kernel in Fig. 4 and maximize the marginal
(right). Note that the results are up to numerical errors identical to likelihood of a GP regression model for the optimal value of c2 ,
the solution of a Kalman smoother [40], as shown in Fig. 4 (right). where c1 is assumed to be known. The maximum is attained for a
This fact is discussed in more detail in Section 6. value c2 close to the true model. We also computed the marginal
3280 F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286

harmonic regularization thin-plate spline reg.

Fig. 7. For a two-dimensional domain X with periodic boundary conditions, the kernel functions Rxi for harmonic and thin-plate spline regularization are shown in the top
row. xi is chosen in the middle of X. Below we show the mean of a GP regression with these kernels and five data points, denoted as black stars.

T T
likelihood for GP regression with a Gaussian kernel. The maximal This results in Rf 2 = f T Df , where D = D1 D1 + D2 D2 is the (dis-
marginal likelihood for a Gaussian kernel with automatically chosen crete) Laplace operator. Functions minimizing this expression, the
parameters is 20 orders of magnitude smaller than for the pendulum so-called harmonic energy, effectively minimize the graph's area and
kernel. In a Bayesian interpretation the data thus strongly prefers a are thus very common in many fields of research, especially com-
pendulum-adapted model over the standard Gaussian kernel model. puter graphics [41]. Since constant functions are not penalized by R,
the cpd framework for non-one-to-one R has to be used in this case,
5.3. Two-dimensional PDEs see Appendix B. Postponing a more detailed discussion, the most
important change here is to use the pseudoinverse instead of the
In this section we discuss kernels for two-dimensional domains. inverse for deriving the kernel, K = (RT R)+ . This operation is easily
We show how the harmonic and the thin-plate spline regularizer performed using the two-dimensional fast Fourier transform.
that both build on derivatives and can be interpreted as stochastic The thin-plate splines energy penalize the Hessian of f : X → R,
PDEs can be incorporated into the kernel framework. that is, all second derivatives,
Next, we show examples of harmonic and thin-plate spline reg-
⎛ 1 1⎞
ularization in the kernel framework. D D
As mentioned in Section 4.3, the Fourier transform can also be ⎜ D1 D2 ⎟
R=⎝ 2 1⎟
⎜ .
applied for functions on higher-dimensional domains, and deriva- D D ⎠
tive operators can also be translated into multiplications in this set- D2 D2
ting. Consider a rectangular grid with N2 = 2562 points and periodic
boundary conditions. The discrete derivative D1 in the first direction The energy leaves linear functions unpenalized, thus we again have
and the derivative D2 in the second direction are both diagonal in to use the cpd framework and correspondingly the pseudoinverse.
the tensor Fourier basis uk1 ⊗ uk2 , where (dxl ⊗ dxm )T uk1 ⊗ uk2 = In Fig. 7, we show the resulting kernels for harmonic and thin-
exp i((2/N)(lk1 + mk2 )) and the eigenvalues are wk1 ⊗k2 = wk1 wk2 , plate spline regularization. Furthermore, we show results of approx-
k1 , k2 = 1, . . . , N. imating five randomly chosen data points with a GP regression with
Harmonic regularization results from penalizing the Jacobian of the respective kernels. Note that the harmonic kernel is sharply
f : X → R, that is, all first derivatives, peaked, but the regression output stays in the convex hull of the
training output values, the famous mean value property of harmonic
1
D maps. The thin-plate spline solution is much smoother, but occa-
R= .
D2 sionally overshoots the training values.
F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286 3281

Fig. 8. Kernel corresponding to a graph Laplacian as regularizer RT R. The kernel functions Rxi are encoded in the colour and the size of the nodes. Vertex xi is marked with
a black cross, the edges of the graph are shown in black.

5.4. Graph Laplacian gression exploiting the special features of (low-dimensional) linear
state-space models. SVR is slightly different in that it typically uses
Since graph domains are naturally finite, graph-based learning an -insensitive linear loss function [8] which corresponds to a dif-
is a good example of where the finite domain kernel framework ferent likelihood model. For a quadratic loss, however, the output of
directly applies without the need for discretization. an SVR will be identical to the mean estimate of a Kalman smoother.
The graph Laplacian is an approximation of the true Laplacian D It is interesting to note that even without considering equivalence
on graphs [42]. Kernels on graphs based on the graph Laplacian are of the underlying model assumptions, kernel methods can be re-
described by Ref. [43], they are used for semi-supervised learning by lated to Kalman filter-like algorithms. For dynamical systems, the
Ref. [44] and Ref. [45] use them in GPs on finite image domains for matrix RT R, whose inverse yields the covariance operator, is block-
image super-resolution. The graph Laplacian DG for a graph G=(E, X) tridiagonal. Ref. [48] propose an algorithm to invert such matrices in
with edges E and vertices X is given by DG = D − W, where W ij is linear time using a forward–backward scheme that is closely remi-
the weight of edge (i, j) ∈ E, 0 if (i, j) ∈/ E, and the degree matrix D is niscent of the Kalman smoother algorithm.

diagonal with entries Dii = j W ij . We use an -neighbourhood graph Considering system identification for linear ODEs, there exist
constructed from 40 random points in [0, 1]2 ,  = 0. 2, i.e. (i, j) ∈ E many different algorithms in the control community such as sub-
if and only if xi − xj  < . Edge weights W ij are set as W ij = exp space identification, Fourier space methods or prediction error meth-
ods [27]. Statisticians classically use expectation maximization (EM),
(−xi − xj 2 / 2 ).
which maximizes the marginal likelihood of the model, that is, the
As in the above section, setting RT R = DG leads to the problem likelihood of the observed outputs given the parameters with the
that DG is not one-to-one. Functions f constant on a connected com- hidden states integrated out. The marginal likelihood can be effi-
ponent have f T DG f = 0, a fact commonly used in spectral clustering ciently computed using a Kalman smoother. As for the case of state
[46]. Thus, in order to derive a kernel we again use the pseudoin- estimation, all these methods are at least qualitatively equivalent to
verse. For more details see Appendix B. kernel machine model selection algorithms. The marginal likelihood
Fig. 8 shows the resulting kernel function K xi . The closer a point is also used in GP regression for kernel selection. The cross-validation
is to xi the larger its corresponding kernel values. Equivalently, under error can be seen as an approximation of the negative marginal like-
the corresponding GP prior the correlation of the function value at a lihood or the prediction error, which also links SVR regression to this
certain point with the function value at xi is the stronger the closer picture.
the point is to xi . Note that the distance is measured in terms of the Since we have argued above that kernel methods are largely
geodesic distance intrinsic to the graph, not the Euclidean distance equivalent to standard algorithms for treating DEs, we might ask
of the embedding space. in which context may one benefit from using kernel methods. Ker-
nel methods are to be understood here as algorithms that explicitly
6. Discussion compute the kernel function and that perform batch inference by
minimizing/integrating an expression of the dimension m, where m
We have shown that common linear DE models can be flawlessly is the number of measured data points. Conversely, all classical al-
integrated into the kernel framework and that trajectory/state esti- gorithms work sequentially, performing inference without explicitly
mation and system identification can both be performed with kernel computing the kernel function.
machines such as SVR or GP regression. However, there are already For one-dimensional problems, that is, ODEs or dynamical sys-
many well-established algorithms for state estimation and system tems, Kalman filter or graphical model-based methods concentrate
identification. In this section, we discuss how kernel methods relate on the chain-like structure of the model. They give rise to many O(N)
to these standard methods, and when one should prefer which type algorithms for computing marginal means, marginal variances or the
of algorithm. marginal likelihood, where N is the number of discretization steps.
State estimation in the linear state-space model described in Sec- If only m measurements, m>N, are given, this effort can be reduced
tion 4.1 is classically dominated by the Kalman filter/smoother [40] to O(m) with a little precomputation, summarizing many small steps
and its variants [27]. For such models the Kalman filter algorithm is without observations into one large step. In contrast, kernel-based
also equivalent to graphical model message-passing algorithms [47]. methods working with the full covariance matrix typically scale
Since all these models perform optimal state estimation in the state- around O(m3 ) for regression or computing the marginal likelihood.
space model as do kernel methods such as GP regression or SVR, Furthermore, such methods have to compute the kernel function for
the results of the two types of methods are identical. The Kalman the given dynamical system. Using the Fourier framework described
filter can be interpreted as just an efficient way of computing GP re- in Section 4.2, the fast Fourier transform takes O(N log N) time, and
3282 F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286

using the state-space model, the kernel is given explicitly by Eq. (10). corresponding to a low order linear DE. Their key calculation is
One advantage of the kernel view for dynamical systems is that it motivated in finite dimensions and is then extended to continu-
yields direct access to all pairwise marginal distributions, even for ous domains. Conversely, one could also ask whether sequential
non-neighbouring points, which is not obvious with sequential al- inference schemes for nonlinear DEs such as the extended Kalman
gorithms. filter, the unscented Kalman filter [50], or sequential Monte Carlo
For multidimensional problems, that is, in PDEs, the kernel methods [51] can be transferred to other, potentially multivariate,
method's view on the joint problem is more useful in practical nonlinear kernel-like problems.
terms, since message-passing is difficult due to many loops and is
not guaranteed to yield the optimal solution [47]. However, in this
case, too, the kernel cannot be computed analytically but has to 7. Conclusion
be derived either through a fast Fourier transform or, in the worst
case, through matrix inversion, which scales like O(N3 ). If one aims We have presented a joint framework for kernels, RKHSs, Gaus-
at estimating the whole latent function f : X → R, then direct sian processes and regularization operators. All these objects are
optimization of problem (1) may be advantageous in comparison closely related to each other. Given the theoretical framework, it is
with computing the kernels first and then optimizing the kernelized natural to see stochastic linear DEs as important examples of regu-
problem. For example, in graph-based learning one typically solves larization operators. We have discussed ordinary as well as partial
the estimation problem directly in the so-called primal. However, linear differential equations.
if the graph were given in advance and the labels of the nodes While the exposition is kept simple through the use of the fi-
were only uncovered at a later time, it would be advantageous to nite domain assumption, note that most results also hold for infi-
precompute the kernel functions, since regression to yield all of nite/continuous domains and we hope the readers will be able to
f : X → R could then be performed in O(m3 ) instead of O(N3 ). realize this when making comparisons with existing work. An exact
In sum, one could say that the connection between kernels and treatment for infinite, continuous domains often requires advanced
DEs will typically not yield faster or better algorithms, except in mathematical machinery [21,22,26], and we have thus concentrated
a few special cases. However, it may help to gain deeper theoret- on the finite dimensional case, which mostly yields qualitatively sim-
ical understanding of both kernel methods and DEs. For example, ilar results.
the connection presented shows that given a state-space model and A good understanding of all the mentioned interrelations be-
measurements, the posterior covariances between states at different tween different methods and communities will help the readers to
time points are not dependent on the observations; they are simply select suitable algorithms for specific problems and may guide their
given through the covariance matrix K. This insight is not obvious intuition in developing new methods, for example, for dealing with
from looking at the Kalman update equations. Conversely, the exis- nonlinear DEs. One potential future application may be to explore
tence of an O(N) inversion algorithm for tridiagonal matrices is not the meaning of kernel PCA [19] for kernels derived from dynamical
surprising when formulating the inversion in terms of a Kalman fil- systems, which to our knowledge has not yet been studied.
ter state estimation problem.
Appendix A. Complex-valued functions and kernels
6.1. Nonlinear extensions
For finite domains X, complex-valued functions f : X → C
This paper has so far solely focused on linear DEs or equivalently are isomorphic to elements in CN = H. Some basics of linear alge-
on linear regularization operators. However, there is great interest bra in CN are as follows: Set f ∗ = f T . The inner product in CN is

in nonlinear models in many fields, and it is natural to ask whether f ∗ g = i f (xi )g(xi ) and thus satisfies f ∗ g = g ∗ f . A matrix A is called
any of the insights presented above carry over to such a situation. symmetric or hermitian, if A∗ = AT = A. Hermitian matrices have real
The disappointing answer is that most of the results are critically eigenvalues i and an orthogonal basis of eigenfunctions {ui } ,
dependent on the linearity assumption. If R is not a linear operator, i=1, ...,N

then Rf  does not define a norm. Also, interpreting the kernel as thus, f Af is real for any f ∈ H.
the Green's function of RT R, that is, the solution of RT RK xi = dxi , does Complex-valued algebra does not interfere with the kernel frame-
not make sense, since the solution of nonlinear differential problems work. All definitions, theorems and proofs of Section 3 hold if the
Rf = u cannot in general be represented in terms of a linear sum functions are understood as complex-valued and the appropriate in-
ner product is used. For example the positive definite kernel condi-
of such Green's functions as in the linear case. Also, corresponding

probability distributions over functions f : X → R are then, in gen- tion then states that i, j i j k(xi , xj ) > 0, where the sum is real-valued,
eral, not Gaussian any more, and often cannot be described through since K is a hermitian matrix by assumption. We will not be more
an analytic expression at all. explicit here, but just state the following theorem, that shows that
Kernel methods are sometimes used for nonlinear systems, typ- the complex-valued theory consistently reduces to the real-valued
ically in the form that xi+1 = f (xi ), where f : Rn → Rn is described one described in Section 3, if all involved entities are in fact real.
by a kernel regression. However, such kernel methods should not be
mixed up with the type of kernels we discussed here, since in this Theorem 12. With the notation of the SVR objective (3) and the
paper the kernels were functions of time, not of the preceding state. Representer
 theorem
 8 the following holds: if the observation values
Furthermore, such one-step-ahead prediction with kernels is not as- yi |i = 1, . . . , m and the kernel K are real-valued and the loss term is
sociated with a Gaussian process over trajectories in H, nor does it a non-decreasing function of |fa (xi ) − yi |, then the function fa : X → C
yield an SVR problem of type (1) over trajectories. minimizing (3) is real-valued and additionally all coefficients a in
While these are strong negative statements, the dual view of Theorem 8 are real.
DEs—either in terms of local conditional distributions or more
kernel-like as joint distributions over whole functions—may still Proof. Assume f = f R + if I ∈ H, f R , f I ∈ RN . Then
help to shape intuitions for the nonlinear case and may help to
develop new approximate inference algorithms. For example, [49]
 f 2K =  f R 2K +  f I 2K + 2 I( f I K −1 f R )
T
investigate the joint N-dimensional state distribution of a non- (A.1)
  
linear DE, and approximate it using an N-variate GP distribution =0, as K is real
F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286 3283

is minimized for f I = 0. Similarly, the loss term is minimized for


f I = 0, since the loss of | f (x ) − y |2 = (dT f R − y )2 + (dT f I )2 is by
i i xi i xi
assumption larger that the loss of |f R (xi ) − yi |2 . Thus the combined
minimum is attained for f I = 0. It is f X = K X a and K X is real and
positive definite, thus one-to-one. It follows that f X ∈ Rm requires
a ∈ Rm . 

Appendix B. The cpd world

Regularization operators Rc which are not one-to-one motivate


the use of the cpd framework. For example, regularizing with the
first derivative yields zero penalty for all constant functions, thus Rc Fig. B1. Common objects in the cpd kernel framework and their interrelations.
cannot be one-to-one in this case. Arrows denote that one can uniquely be determined from the other (the ∗ denotes
Most kernel results in Section 3 can be extended to cpd ker- that this connection is not unique). A semi-inner product is an inner product which
nels. However, special care has to be taken of the null space of the is only positive semi-definite.
regularization operator. The description in this section will use the
complex-valued setting as introduced in Appendix A above.
be normalized since the density is constant in the directions of P,
B.1. The pseudoinverse Rc p = 0 for p ∈ P. However, an unnormalizable prior may nev-
ertheless be useful and lead to a valid posterior, if the likelihood
Consider a hermitian matrix A with orthonormal eigendecompo- constrains possible functions f enough.


sition A = i ui i ui . If A is not one-to-one, i.e. ∃i : i = 0, then we We define a semi-inner product (. , . )K c by
can define the (Moore–Penrose) pseudoinverse of A by

( f , g)K c = f T R c Rc g = f T K c+ g. (B.4)
N
 1 ∗
A+ = ui ui .
i A semi-inner product is an inner product which is only positive semi-
i=1, i =0
definite, the corresponding semi-norm . K c is also only positive
 

 ∗ semi-definite. The tuple H, (. , . )K c then is not a Hilbert space, we
Lemma 13. For A as above and P = 
i| i =0 ui ui the orthogonal follow Ref. [26] and call it a native space.
projection from H to the null space N of A, we have (H, (. , . )K c ) can be converted into an RKHS in two ways: firstly, by
restricting the function space to (P⊥ , (. , . )K c ). The second alternative
(1) (A+ )∗ = A+ ; is to extend the inner product to ( f , g)S = ( f , g)K c + f ∗ Pg, such that
(2) AA+ A = A, A+ AA+ = A+ , and A+ A = 1 ⊥ ; (H, (. , . )S ) is an RKHS.
N
(3) [P, A] = 0 where [A, P] = AP − PA; When discussing cpd kernel functions there are some additional
(4) If (1 − P)A(1 − P) is positive definite on N⊥ , then (1 − P) A+ (1 − P) subtleties not encountered in the positive definite case.
is also positive definite on that subspace.
Definition 14. A symmetric function kc : X × X → C is called cpd
B.2. The cpd kernel framework with respect to the linear space P ⊆ H, if for all distinct points
x1 , . . . , xm ∈ X, m  N, and all 0 = a ∈ Cm with
Fig. B1 depicts the most common objects for the cpd setting in
m
 m

parallel to Fig. 3. The structures and interrelations are very similar
j p(xj ) = j p∗ dxj = 0, ∀p ∈ P (B.5)
to the positive definite case, see Section 3.1, but a non-empty null
j=1 j=1
space of Rc requires a few changes.
Throughout this section we will assume that the regularization
we have that
operator Rc : H → G is an arbitrary operator from H to some lin-
ear space G. We do not assume that it is one-to-one. We denote its m 
 m
null space of dimension 0  M  N as P and let P be the orthogonal i j kc (xi , xj )

projection from H to P. If Rc is not one-to-one, neither is R c Rc , i=1 j=1
⎛ ⎞∗ ⎛ ⎞
and we cannot define the covariance operator as the inverse of this m
 m
matrix. Instead, we redefine the covariance operator K c to be a sym- = a∗ K˜ c Xa =
⎝ ˜
i dxi ⎠ K ⎝ j dxj ⎠ > 0,
c (B.6)
metric positive semi-definite matrix, i.e. i=1 j=1

f ∗ K c f  0, ∀f ∈ H. (B.1)
where K˜ c is the operator given as K˜ c ij = kc (xi , xj ).


The covariance operator is then related to the regularization operator In other words, if f = m i=1 i dxi , a = 0, and f p = 0 ∀p ∈ P, then
Rc as f ∗ K˜ c f > 0. Or equivalent but shorter, K˜ c is positive definite on P⊥ .

K c = (R c Rc )+ . (B.2)
It is important to note, that the operator K˜ c which is composed
Note that the null space of K c is also P. The corresponding Gaussian from the cpd kernel function values is not necessarily equal to the
process pK c (f ) has the form covariance operator K c , and there exists famous counter examples,
e.g. thin-plate spline kernel functions. The definition of a cpd ker-
pK c (f ) = NU (0, K c ) ∝ exp(− 12 Rc f 2 ), (B.3)
nel function with respect to P just implies that K˜ c be positive def-
where NU (. , . ) is an unnormalized Gaussian density. If the dimen- inite on P⊥ , it does not make any claim about the behaviour on P.
sion M of the null space P is greater than zero, then pK c (f ) cannot For example, thin-plate spline kernels [26] yield matrices K˜ c which
3284 F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286

have f ∗ K˜ c f < 0 for some f ∈ P. This contradicts the positive semi- Table B1
Summary of the objects of the conditionally positive definite kernel framework and
definiteness assumption of the covariance operator K c , which was
their interrelations
enforced since surely  f 2K c = Rc f 2  0 for all f ∈ H.
Entity Symbol Relations
This problem can be circumvented by setting
cpd Kernel func. kc : X × X → C kc (xi , xj ) = K˜ c ij
K c = (1 − P)K˜ c (1 − P). (B.7) Covariance op. Kc :H→H K c = (1 − P)K˜ c (1 − P)

K c = (R c R c )+
Due to the projection step the assignment of a cpd kernel function to ∗
Native space (. , . )K c : H × H → C ( f , g)K c = f ∗ K c+ g = f ∗ R c R c g
a covariance operator is not unique. If {pi } is an orthonormal . K c . : H → R
1/2 c
 f K c = ( f , f )K c = R f 
i=1, ...,M
Gaussian process pK c : H → R pK c ( f ) = NU (0, K c )
basis of P, then Eq. (B.7) implies that
pK c ( f ) ∝ exp(− 12  f 2K c )
∗ ∝ exp(− 12 R c f 2 )
K ijc = dxi (1 − P)K˜ c (1 − P)dxj
pK c ( f )√
Regularization op. Rc : H → G (R c = K c+ , not unique)
 ∗
= k c (xi , xj ) − pl (xi )(pl K˜ c xj )
l

 ∗  Note that condition (B.10) ensures that m i=1 i dxi ∈ P . Further-

− (K˜ c xi pm )pm (xj ) + pl (xi )(pl pm )pm (xi ). (B.8)
m
more, it is i=1 i K xci = K c


m  d , and K c and K˜ c just differ
i=1 i xi xi xi
m
by an element of P. Thus, one could replace K c in Eq. (B.9) by K˜ c
l,m
xi xi
Note that above we have made an important assumption that does without changing the expression. Practically that means that we can
not in general hold for infinite domains and thus requires a slightly work directly with the cpd kernel function when performing SVR
different formalism when extended to this setting. We have assumed regression and do not have to use the more complicated expression
that an L2 -type inner product exists in H. While we could restrict (B.8) which includes projections.
the space of functions H to L2 (X) for infinite domains, this is not

natural for our purposes. Since we aim at regularizing with Rc f  Proof. The theorem states that f (xi ) = m  Kc + M
j=1 j ji
 p(xi ), i =
j=1 j
we only need this expression to be well defined. We do not need

1, . . . , m, where mi=1 i pj (x i ) = 0, j = 1, . . . , M. In matrix notation this is


that f itself has a finite L2 norm, it could be an element of a larger
   
space than L2 (X). For example, using X = R and regularizing with c a K Xc T a fX
K ext ≡ = (B.12)
the first derivative we could include constant functions into H even b T∗ 0 b 0
though an L2 -type inner product between two linear functions on
R does not exist. While for finite domains it is trivially H ⊆ L2 (X), with T ∈ Cm×M defined by T ij = pj (xi ). This system is uniquely solv-
Ref. [26] gives an account for more general function spaces H and able for (a, b) because of the following argument due to Ref. [26, p.
c . Then we have
117]: Suppose that (a, b) lies in the null space of K ext
infinite domains. Specifically, he uses a slightly different projection
for relating the covariance operator with the kernel function in Eqs.
(B.7) and (B.8). K Xc a + T b = 0,
The results of this section are summarized in Table B1. T ∗ a = 0.

K Xc is positive definite for all a that satisfy the second equation.


B.3. Support vector machines Multiplying the first equation by a∗ yields 0 = a∗ K Xc a + (T ∗ a)∗ b =
a∗ K Xc a. Due to positive definiteness, we can conclude that a = 0 and
Employing regularization operators which are not necessarily thus T b = 0. Since X is a unisolvent set of points, this implies b = 0.
one-to-one leads to SVR which is slightly different from the positive Returning to the inhomogeneous system (B.12) it can be shown
definite case. As in Section 3.2, Lemma 7; we first present a useful [24] using block matrix inversion theorems that
decomposition of an arbitrary function in H and then the represen-
ter theorem follows. a = (K Xc − K Xc T(T ∗ K Xc T )+ T ∗ K Xc ) f X , (B.13)
 
Definition 15. A set X = xi |i = 1, . . . , m ⊆ X, m  N, of points b = (T ∗ K Xc T )+ T ∗ K Xc f X . (B.14)

is called unisolvent with respect to the linear space P ⊆ H,


m
M
i=1 i K xi + p. 
Finally, set q = f − c
j=1 j j
dim(P)  m, if the only solution for p(xi ) = 0 with p ∈ P, i = 1, . . . , m
is p = 0.
Using this decomposition, the representer theorem for cpd ker-
  nels is straight-forward as in the positive definite case.
Lemma 16. Given distinct points X = xi |i = 1, . . . , m , m  N, which
are unisolvent with respect to P, any f ∈ H can be written like Theorem 17 (Representer theorem). Given distinct, unisolvent points
   
m M X = xi |i = 1, . . . , m ⊆ X, m  N, and labels yi |i = 1, . . . , m ⊆ C, C ∈
 
f= i K xci + j pj + q, (B.9) R, the minimizer of
i=1 j=1
 f 2K c + C Loss({(xi , yi , f (xi ))|i = 1, . . . , m}) (B.15)
m M
where {pj } is a basis of P and a ∈ C , b ∈ C , and q ∈ H
m
M
j=1, ...,M a ∈ Cm , b ∈ CM minimize
i=1 i K xi + j=1 j pj .
has the form f a,b = c
are uniquely determined and satisfy the following conditions:
the expression
⎛ ⎞
m m
∗  a∗ K Xc a + C Loss({(xi , yi , f (xi ))|i = 1, . . . , m}) (B.16)
i p (xi ) = p ⎝ i dxi ⎠ = 0, j = 1, . . . , M,
j j
(B.10)
i=1 i=1
subject to the conditions
q(xi ) = 0, i = 1, . . . , m. (B.11) m

i pj (xi ) = 0, j = 1, . . . , M. (B.17)
Furthermore,  f 2K c can then be written as  f 2K c = a∗ K Xc a + q2K c . i=1
F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286 3285

B.4. GP inference Proof. As a first step note that V( , f ) is strictly convex in f for all ∈
U. Both R f 2 andLoss({(xi , yi , f (xi ))|i = 1, . . . , m}) are convex with
The decomposition in Lemma 16 is also the key to compute the respect to f for all . If = 0 then R f 2 is strictly convex and so is
marginals of an unnormalized Gaussian process. As in Section 3.3 we the sum (“strictly convex + convex = strictly convex”). If = 0 then
will call this the GP representer theorem for the cpd case. R f 2 is constant in the direction of vectors p ∈ P. However, for
these p at least one of the p(xi ), i = 1, . . . , m, is not equal to zero since
Theorem 18. For X ⊆ X unisolvent with respect to P, the marginal dis- X is unisolvent. Thus, the loss term is strictly convex with respect to
tribution pK c ( f X ) ∝ NU (0, M + ) under the joint GP pK c ( f ) ∝ NU (0, K c )  where f  = f + p, and so is the whole objective function.
is given by Since V( , f ) is strictly convex in f and continuously differen-
tiable, the unique minimum for given is determined by
M = K Xc − K Xc T(T ∗ K Xc T )+ T ∗ K Xc , (B.18)
j
F( , f ) ≡ V( , f ) = 0.
where {pj } is a basis of P and T ij = pj (xi ). jf
j=1, ...,M


By assumption F : U × CN → CN is continuously differentiable and
Proof. By Lemma 16 any f ∈ H can be written as f = m i=1 i K xi +
c

M (j/ jf )F( , f ) = (j2 / jf 2 )V( , f ) is invertible since the objective is


 p + q where (xi ) = 0, i = 1, . . . , m. Therefore q is independent
j=1 j j strictly convex. Using the implicit function theorem [52, p. 292] there
of f X . Furthermore with Eq. (B.13) it is exists a continuous function f : U → H with F( , f ) = 0. 

f 2K c = a∗ K cX a + q2K c
Given this theorem one could argue that the cpd framework is
= f X∗ (K Xc − K Xc T (T ∗ K Xc T )+ T ∗ K Xc ) f X + q2K c unnecessary: if the goal is to regularize with a non-one-to-one op-
= f X∗ Mf X + q2K c . erator R one could just use a slightly perturbed version of R which
actually is one-to-one and for which one could use the positive def-
From that it follows that inite framework. The solution of a SVR would then not differ very
much from the unperturbed result. However, if R∗ R is nearly sin-

1 gular the corresponding covariance operator K = (R∗ R)−1 will have
p( f X ) ∝ exp − Rc f 2 df X\X
2 some large values. Computations with such a kernel will then be nu-
 
1 ∗ + 1 merically unstable, and it is better to use the cpd framework instead.
∝ exp − f X M X f X exp − q2K c df X\X
2 2
  
=const Appendix C. Additional proofs

1
∝ exp − f X∗ M + f
X X
.  In the finite domains, H with any inner product (. , . )S is an
2
RKHS, also with the usual L2 inner product. To see this note that in
B.5. Transitions between the cpd and the positive definite world
RN all norms are equivalent and |xi ( f )| = |f (xi )|   f 1  C f S .

Imagine a family of regularization operators R : H → G contin- Proof (Lemma 5). (1) Riesz's theorem.
uously parameterized by ∈ U where U ⊆ R is an open neighbour- (2) Since the functionals xi are linearly independent, so are


m
hood of 0. Assume that R is one-to-one for all except for = 0. their representers Sxi . Then for a = 0 it is m i=1 j=1 i j s(xi , xj ) =

m
m
m
Thus, for = 0 we have to use the cpd framework, for = 0 we i=1 j=1  i  j (S x i , S x )S =  i=1 i S x 2 > 0.
i S
j

should use the positive definite scheme. However, the limit of K for (3) Set T ij = (dxi , dxj )S . Then for any f = i f (xi )dxi , g = i g(xi )dxi ,
0 = → 0 is not equal to K c . The limit does not even exist since

=0 it is ( f , g)S = i,j f (xi )g(xj )(dxi , dxj )S = i,j f (xi )g(xj )T ij = f T Tg.
in the positive definite case the kernel is the inverse of R∗ R which
diverges for → 0. On the other hand, the SVR objective function (4) Using the reproducing property on dxi , ij =(Sxi , dxj )S =dTxi ST dxj
and ij = (dxi , Sxj )S = dTxi TSdxj for all xi , xj ∈ X implies the claim.
V( , f ) ≡ R f 2 + C Loss({(xi , yi , f (xi ))|i = 1, . . . , m}) (B.19)
(5) Since necessarily S = T −1 and T uniquely defines the inner
product, the last claim follows. 
depends continuously on . Thus one might hope that the minimizer
also depends continuously on .
The following theorem which is novel to our knowledge shows Proof (Lemma 7). f is the sum of a part f a in the span of the K xi , xi ∈
that this apparent problem of continuity can be resolved. It shows X, and the K-orthogonal complement q. The orthogonality condition
especially that, while the kernel is diverging for → 0, the SVR (K xi , q)K = 0 implies (xi ) = 0. Since K is positive definite, so is the
solution for = 0 converges for → 0, and that the limiting element submatrix K X . Therefore the system f X = K X a is uniquely solvable
is equal to the cpd SVR solution for = 0. for a ∈ Rm . 

Theorem 19. Let R : H → G depend continuously differentiable on Proof (Theorem 8). Following Lemma 7, and f ∈ H can be written
as f = f a + q with ( f a , q)K = 0. The objective can then be written as
∈ U, U ∈ Rd an open neighbourhood of 0 and let R be one-to-one
if and only if = 0. Let P be the null space of R =0 . Furthermore, let
  aT K X a + q2K + C Loss(xi , yi , fa (xi )) .
X = xi |i = 1, . . . , m ⊆ X, m  N, be a set of distinct points unisolvent i=1, ...,m
 
with respect to P with corresponding observations yi |i = 1, . . . , m ⊆ The loss term is independent of q because (xi ) = 0, i = 1, . . . , m, and
C. The minimizer f = arg min V( , f ) depends continuously on , if thus the objective is minimized for q = 0. Convexity of the loss and
f ∈H the uniqueness of the map between f  and a, Lemma 7, imply that
Loss({(xi , yi , f (xi ))|i=1, . . . , m}) is strictly convex and twice continuously the whole objective here is convex in a. Thus, the minimum is unique
differentiable with respect to the f (xi ). in this case. 
3286 F. Steinke, B. Schölkopf / Pattern Recognition 41 (2008) 3271 -- 3286

References [27] L. Ljung, System Identification—Theory for the User, second ed., Prentice-Hall,
Upper Saddle River, NJ, 1999.
[1] W. Kienzle, G. Bakir, M. Franz, B. Schölkopf, Face detection—efficient and rank [28] G. Kimeldorf, G. Wahba, A correspondence between bayesian estimation on
deficient, Advances in Neural Information Processing Systems, vol. 17, MIT stochastic processes and smoothing by splines, Ann. Math. Stat. 41 (2) (1970)
Press, Cambridge, MA, 2005, pp. 673–680. 495–502.
[2] L. Zhang, B. Wu, R. Nevatia, Detection and tracking of multiple humans with [29] W. Madych, S. Nelson, Multivariate interpolation and conditionally positive
extensive pose articulation, in: Proceedings of the 2007 IEEE Computer Society definite functions. II, Math. Comput. 54 (189) (1990) 211–230.
Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8. [30] A. Smola, B. Schölkopf, K.-R. Müller, The connection between regularization
[3] X. Song, G. Iordanescu, A. Wyrwicz, One-class machine learning for brain operators and support vector kernels, Neural Networks 11 (1998) 637–649.
activation detection, in: Proceedings of the 2007 IEEE Computer Society [31] T. Graepel, Solving noisy linear operator equations by Gaussian processes:
Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8. application to ordinary and partial differential equations, in: Proceedings of the
[4] G. Sanguinetti, M. Milo, M. Rattray, N. Lawrence, Accounting for probe-level 20th International Conference on Machine Learning, vol. 20, 2003, pp. 234–241.
noise in principal component analysis of microarray data, Bioinformatics 21 [32] M. Hein, O. Bousquet, Kernels, associated structures and generalizations,
(19) (2005) 3748–3754. Technical Report 127, Max Planck Institute for Biological Cybernetics, Tübingen,
[5] N.D. Lawrence, J. Quiñonero-Candela, Local distance preservation in the GP- Germany, 2004.
LVM through back constraints, in: Proceedings of the International Conference [33] N.E. Heckman, J.O. Ramsay, Penalized regression with model-based penalties,
in Machine Learning, 2006, pp. 513–520. Can. J. Stat. 28 (2000) 241–258.
[6] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, [34] F. Steinke, B. Schölkopf, Machine learning methods for estimating operator
1995. equations, in: Proceedings of the 14th IFAC Symposium on System Identification,
[7] C. Burges, A tutorial on support vector machines for pattern recognition, Data SYSID06, Elsevier, Amsterdam, 2006, pp. 1–6.
Min. Knowl. Discovery 2 (2) (1998) 121–167. [35] F. Girosi, M. Jones, T. Poggio, Regularization theory and neural network
[8] B. Schölkopf, A. Smola, Learning with Kernels, MIT Press, Cambridge, MA, 2002 architectures, Neural Comput. 7 (1995) 219–267.
URL http://www.learning-with-kernels.org. [36] C. Walder, B. Schölkopf, O. Chapelle, Implicit surface modelling with a globally
[9] J. Weston, C. Watkins, Support vector machines for multiclass pattern regularised basis of compact support, Comput. Graphics Forum 25 (3) (2006)
recognition, in: Proceedings of the Seventh European Symposium on Artificial 635–644.
Neural Networks, 1999. [37] C. Micchelli, M. Pontil, On learning vector-valued functions, Neural Comput. 17
[10] A. Smola, B. Schölkopf, A tutorial on support vector regression, Stat. Comput. (1) (2005) 177–204.
14 (3) (2004) 199–222. [38] R. Curtain, H. Zwart, An Introduction to Infinite Dimensional Linear Systems
[11] B. Schölkopf, J. Platt, J. Shawe-Taylor, A. Smola, R. Williamson, Estimating Theory, Springer, Berlin, 1995.
the support of a high-dimensional distribution, Neural Comput. 13 (7) (2001) [39] S. Bochner, Monotone Funktionen, Stieltjessche integrate und harmonische
1443–1471. analyse, Math. Ann. 108 (1933) 378–410.
[12] O. Chapelle, B. Schölkopf, A. Zien, Semi-supervised Learning, MIT Press, [40] R. Kalman, A new approach to linear filtering and prediction problems, J. Basic
Cambridge, MA, 2006. Eng. 82 (1) (1960) 35–45.
[13] B. Schölkopf, A. Smola, K. Müller, Nonlinear component analysis as a kernel [41] M. Floater, K. Hormann, Surface parameterization: a tutorial and survey,
eigenvalue problem, Neural Comput. 10 (5) (1998) 1299–1319. Advances in Multiresolution for Geometric Modelling, vol. 1, Springer, Berlin,
[14] C. Williams, D. Barber, Bayesian classification with Gaussian processes, IEEE 2005.
Trans. Pattern Anal. Mach. Intell. 20 (12) (1998) 1342–1351. [42] M. Hein, J.-Y. Audibert, U. von Luxburg, Graph Laplacians and their convergence
[15] M. Opper, O. Winther, Gaussian processes for classification: mean-field on random neighborhood graphs, J. Mach. Learn. Res. 8 (2007) 1325–1370.
algorithms, Neural Comput. 12 (11) (2000) 2655–2684. [43] A. Smola, R. Kondor, Kernels and regularization on graphs, in: Proceedings of
[16] C. Williams, C. Rasmussen, Gaussian processes for regression, Advances in the Conference on Learning Theory, Springer, Berlin, 2003.
Neural Information Processing Systems, vol. 8, 1996, pp. 514–520. [44] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-supervised learning using Gaussian fields
[17] N. Lawrence, M. Seeger, R. Herbrich, Fast sparse Gaussian process methods: and harmonic functions, in: Proceedings of the 20th International Conference
the informative vector machine, Advances in Neural Information Processing on Machine Learning, vol. 20, 2003.
Systems, vol. 15, 2003, pp. 609–616. [45] M. Tipping, C. Bishop, Bayesian image super-resolution, in: Advances in Neural
[18] N. Lawrence, Gaussian process latent variable models for visualisation of high Information Processing Systems, vol. 15, 2003, pp. 1279–1286.
dimensional data, Advances in Neural Information Processing Systems, vol. 16, [46] U. von Luxburg, A tutorial on spectral clustering, Stat. Comput. 17 (4) (2007)
2004. 395–416.
[19] A. Smola, B. Schölkopf, K. Müller, The connection between regularization [47] M. Jordan, Z. Ghahramani, T. Jaakkola, L. Saul, An introduction to variational
operators and support vector kernels, Neural Networks 11 (4) (1998) 637–649. methods for graphical models, Mach. Learning 37 (2) (1999) 183–233.
[20] F. Girosi, M. Jones, T. Poggio, Priors stabilizers and basis functions: from [48] Y. Huang, W. McColl, Analytical inversion of general tridiagonal matrices, J.
regularization to radial, tensor and additive splines, A.I. Memo No. 1430, MIT Phys. A: Math. General 30 (1997) 7919–7933.
Press, Cambridge, MA, 1993. [49] C. Archambeau, D. Cornford, M. Opper, J. Shawe-Taylor, Gaussian process
[21] V. Bogachev, Gaussian Measures, AMS, New York, 1998. approximations of stochastic differential equations, J. Mach. Learn. Res., in:
[22] B. Oksendal, Stochastic differential equations: an introduction with applications, Workshop and Conference Proceedings, vol. 1, 2007, pp. 1–16.
sixth ed., Springer, Berlin, 2002. [50] S. Julier, J. Uhlmann, A new extension of the Kalman filter to nonlinear systems,
[23] C.E. Rasmussen, C.K. Williams, Gaussian Processes for Machine Learning, MIT in: I. Kadar (Ed.), Proceedings of the Conference on Signal Processing, Sensor
Press, Cambridge, MA, 2006. Fusion, and Target Recognition VI, vol. 3068, 1997, pp. 182–193.
[24] G. Wahba, Spline models for observational data, SIAM, Philadelphia, PA, 1990. [51] A. Doucet, N. de Freitas, N. Gordon, Sequential Monte Carlo Methods in Practice,
[25] J.O. Ramsay, B.W. Silverman, Functional Data Analysis, second ed., Springer, Springer, Berlin, 2001.
Berlin, 2005. [52] H. Heuser, Lehrbuch der Analysis, Teil 2, B. G. Teubner, Stuttgart, Germany,
[26] H. Wendland, Scattered Data Approximation, Cambridge University Press, 1991.
Cambridge, UK, 2005.

About the Author—FLORIAN STEINKE earned a Diplom in Physics from the Eberhard-Karls-Universität, Tübingen, in 2005. He is currently pursuing a Ph.D. at the Max Planck
Institute for Biological Cybernetics in Tübingen under the supervision of Bernhard Schölkopf. His diploma thesis on Modeling Human Heads with Implicit Surfaces won the
DAGM-SMI prize for the best diploma thesis on Pattern Recognition in Germany in the academic year 2005–2006.

About the Author—BERNHARD SCHÖLKOPF was born in Stuttgart on 20 February, 1968. He received an M.Sc. in Mathematics and the Lionel Cooper Memorial Prize from
the University of London in 1992, followed in 1994 by the Diplom in Physics from the Eberhard-Karls-Universität, Tübingen. Three years later, he obtained a Doctorate in
Computer Science from the Technical University Berlin. His thesis on Support Vector Learning won the annual dissertation prize of the German Association for Computer
Science (GI). In 1998, he won the prize for the best scientific project at the German National Research Center for Computer Science (GMD). He has researched at AT&T Bell
Labs, at GMD FIRST, Berlin, at the Australian National University, Canberra, and at Microsoft Research Cambridge (UK). He has taught at Humboldt University, Technical
University Berlin, and Eberhard-Karls-University Tübingen. In July 2001, he was appointed scientific member of the Max Planck Society and director at the MPI for Biological
Cybernetics; in October 2002, he was appointed Honorarprofessor for Machine Learning at the Technical University Berlin. In 2006, he received the J.K. Aggarwal Prize of
the International Association for Pattern Recognition. He has been the program chair of COLT and NIPS and serves on the editorial boards of JMLR, IEEE PAMI and IJCV.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy