Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey
Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey
Thomas-Agnan, 2011). The first work on RKHS was 2014). Some survey papers about kernel-based machine
(Aronszajn, 1950). Later, the concepts of RKHS were im- learning are (Hofmann et al., 2006; 2008; Müller et al.,
proved further in (Aizerman et al., 1964). The RKHS re- 2018). In addition to some of the above-mentioned books,
mained in pure mathematics until this space was used for there exist some other books/papers on kernel SVM such as
the first time in machine learning by introduction of ker- (Schölkopf et al., 1997b; Burges, 1998; Hastie et al., 2009).
nel Support Vector Machine (SVM) (Boser et al., 1992;
Vapnik, 1995). Eigenfunctions were also developed for 1.3. Kernel in Different Fields of Science
eigenvalue problem applied on operators and functions The term kernel has been used in different fields of science
(Williams & Seeger, 2000) and were used in machine learn- for various purposes. In the following, we briefly introduce
ing (Bengio et al., 2003c) and physics (Kusse & Westwig, the different uses of kernel in science to clarify which use
2006). This is related to RKHS because it uses weighted of kernel we are focusing on in this paper.
inner product in Hilbert space (Williams & Seeger, 2000) 1. Kernel of filter in signal processing: In signal pro-
and RKHS is a Hilbert space of functions with a reproduc- cessing, one can use filters to filter a price of signal,
ing kernel. such as an image (Gonzalez & Woods, 2002). Dig-
Using kernels was widely noticed when linear SVM (Vap- ital filters have a kernel which determine the values
nik & Chervonenkis, 1974) was kernelized (Boser et al., of filter in the window of filter (Schlichtharle, 2011).
1992; Vapnik, 1995). Kernel SVM showed off very suc- In convolutional neural networks, the filter kernels are
cessfully because of its merits. Two competing models, learned in deep learning (Goodfellow et al., 2016).
kernel SVM and neural network were the models which 2. Kernel smoothing for density estimation: Kernel
could handle nonlinear data (see (Fausett, 1994) for history density estimation can be used for fitting a mixture
of neural networks). Kernel SVM transformed nonlinear of distributions to some data instances (Scott, 1992).
data to RKHS to make the pattern of data linear hopefully For this, a histogram with infinite number of bins is
and then applied linear SVM on it. However, the approach utilized. In limit, this histogram is converged to a ker-
of neural network was different because the model itself nel smoothing (Wand & Jones, 1994) where the kernel
was nonlinear (see Section 8 for more details). The suc- determines the type of distribution. For example, if a
cess of kernel SVM plus the problem of vanishing gradi- Radial Basis Function (RBF) kernel is used, a mixture
ents in neural networks (Goodfellow et al., 2016) resulted of Gaussian distributions is fitted to data.
in the winter of neural network around years 2000 to 2006. 3. Kernelization in complexity theory: Kernelization is
However, the problems of training deep neural networks a pre-processing technique where the input to an algo-
started to be resolved (Hinton & Salakhutdinov, 2006) and rithm is replaced by a part of the input named kernel.
their success plus two problems of kernel SVM helped neu- The output of the algorithm on kernel should either be
ral networks take over kernel SVM gradually. One prob- the same as or be able to be transformed to the out-
lem of kernel SVM was not knowing the suitable kernel put of the algorithm for the whole input (Fomin et al.,
type for various learning problems. In other words, ker- 2019). An example usage of kernelization is in vertex
nel SVM still required the user to choose the type of ker- cover problem (Abu-Khzam et al., 2004).
nel but neural networks were end-to-end and almost robust 4. Kernel in operating system: Kernel is the core of an
to hyperparameters such as the number of layers or neu- operating system, such as Linux, which connects the
rons. Another problem was that kernel SVM could not han- hardware including CPU, memory, and peripheral de-
dle big data, although Nyström method, first proposed in vices to applications (Anderson & Dahlin, 2014).
(Nyström, 1930), was used to resolve this problem of ker- 5. Kernel in linear algebra and graphs: Consider a map-
nel methods by approximating kernels from a subset of data ping from the vector space V to the vector space W as
(Williams & Seeger, 2001). Note that kernels have been L : V → W. The kernel, also called the nullspace, of
used widely in machine learning such as in SVM (Vap- this mapping is defined as ker(L) := {v ∈ V | L(v) =
nik, 1995), Gaussian process classifiers (Williams & Bar- 0}. For example, for a matrix A ∈ Ra×b , the ker-
ber, 1998), and spline methods (Wahba, 1990). The types nel of A is ker(A) = {x ∈ Rn | Ax = 0}. The
of use of kernels in machine learning will be discussed in four fundamental subspaces of a matrix are its kernel,
Section 9. row space, column space, and left null space (Strang,
1993). Note that the kernel (nullspace) of adjacency
1.2. Useful Books on Kernels matrix in graphs has also been well developed (Ak-
There exist several books about use of kernels in machine bari et al., 2006).
learning. Some examples are (Smola & Schölkopf, 1998; 6. Kernel in other domains of mathematics: There ex-
Schölkopf et al., 1999a; Schölkopf & Smola, 2002; Shawe- ist kernel concepts in other domains of mathematics
Taylor & Cristianini, 2004; Camps-Valls, 2006; Steinwart and statistics such as geometry of polygon (Icking &
& Christmann, 2008; Rojo-Álvarez et al., 2018; Kung, Klein, 1995), set theory (Bergman, 2011, p. 14), etc.
3
7. Kernel in feature space for machine learning: In sta- K ∈ Rn×n is a Gram matrix, also known as a Gramian
tistical machine learning, kernels pull data to a feature matrix or a kernel matrix, whose (i, j)-th element is:
space for the sake of better discrimination of classes or
simpler representation of data (Hofmann et al., 2006; K(i, j) := k(xi , xj ), ∀i, j ∈ {1, . . . , n}. (1)
2008). In this paper, our focus is on this category
which is kernels for machine learning. Here, we defined the square kernel matrix applied on a set
of n data instances; hence, the kernel is a n × n matrix.
1.4. Organization of Paper We may also have a kernel matrix between two sets of data
This paper is a tutorial and survey paper on kernels and ker- instances. This will be explained more in Section 8. More-
nel methods. It can be useful for several fields of science in- over, note that the kernel matrix can be computed using the
cluding machine learning, functional analysis in mathemat- inner product between pulled data to the feature space. This
ics, and mathematical physics in quantum mechanics. The will be explained in detail in Section 3.2.
remainder of this paper is organized as follows. Section
2 introduces the Mercer kernel, important spaces in func- 2.2. Hilbert, Banach, Lp , and Sobolev Spaces
tional analysis including the Hilbert and Banach spaces, Before defining the RKHS and details of kernels, we need
and Reproducing Kernel Hilbert Space (RKHS). Mercer’s to introduce Hilbert, Banach, Lp , and Sobolev spaces,
theorem and its proof are provided in Section 3. Character- which are well-known spaces in functional analysis (Con-
istics of kernels are explained in Section 4. We introduce way, 2007).
frequently used kernels, kernel construction from distance
Definition 3 (Metric Space). A metric space is a set where
metric, and important classes of kernels in Section 5. Ker-
a metric, for measuring the distance between instances of
nel centering and normalization are explained in Section 6.
set, is defined on it.
Eigenfunctions are then introduced in Section 7. We ex-
plain two techniques for kernelization in Section 8. Types Definition 4 (Vector Space). A vector space is a set of vec-
of use of kernels in machine learning are reviewed in Sec- tors equipped with some available operations such as ad-
tion 9. Kernel factorization and Nyström approximation dition and multiplication by scalars.
are introduced in Section 10. Finally, Section 11 concludes Definition 5 (Complete Space). A space F is complete
the paper. if every Cauchy sequence converges to a member of this
space f ∈ F. Note that the Cauchy sequence is a sequence
Required Background for the Reader whose elements become arbitrarily close to one another as
This paper assumes that the reader has general knowledge the sequence progresses (i.e., it converges in limit).
of calculus, linear algebra, and basics of optimization. The Definition 6 (Compact Space). A space is compact if it
required basics of functional analysis are explained in the is closed (i.e., it contains all its limit points) and bounded
paper. (i.e., all its points lie within some fixed distance of one an-
other).
2. Mercer Kernel and Spaces In Functional
Definition 7 (Hilbert Space (Reed & Simon, 1972)). A
Analysis Hilbert space H is an inner product space that is a com-
2.1. Mercer Kernel and Gram Matrix plete metric space with respect to the norm or distance
Definition 1 (Mercer Kernel (Mercer, 1909)). The function function induced by the inner product.
k : X 2 → R is a Mercer kernel function (also known as
The Hilbert space generalizes the Euclidean space to a fi-
kernel function) where:
nite or infinite dimensional space. Usually, the Hilbert
1. it is symmetric: k(x, y) = k(y, x), space is high dimensional. By convention in machine learn-
2. and its corresponding kernel matrix K(i, j) = ing, unless otherwise stated, Hilbert space is also referred
k(xi , xj ), ∀i, j ∈ {1, . . . , n} is positive semi-definite: to as the feature space. By feature space, researchers of-
K 0. ten specifically mean the RKHS space which will be intro-
The corresponding kernel matrix of a Mercer kernel is a duced in Section 2.3.
Mercer kernel matrix.
Definition 8 (Banach Space (Beauzamy, 1982)). A Banach
The two properties of a Mercer kernel will be proved in space is a complete vector space equipped with a norm.
Section 4. By convention, unless otherwise stated, the term Remark 1 (Difference of Hilbert and Banach Spaces).
kernel refers to Mercer kernel. The effectiveness of Mercer Hilbert space is a special case of Banach space equipped
kernel will be shown and proven in the Mercer’s theorem, with a norm defined using an inner product notion. All
i.e., Theorem 2. Hilbert spaces are Banach spaces but the converse is not
Definition 2 (Gram Matrix or Kernel Matrix). The matrix true.
4
Suppose Rn , H, B, Mc , M, T denote the Euclidean space, where (a) is because we define kx (.) := k(x, .). This
Hilbert space, Banach space, complete metric space, met- equation shows that the bases of an RKHS are kernels. The
ric space, and topological space (containing both open and proof of this equation is obtained by considering both Eqs.
closed sets), respectively. Then, we have: (21) and (34) together (n.b. for better organization, it is
better that we provide those equations later). It is also note-
Rn ⊂ H ⊂ B ⊂ Mc ⊂ M ⊂ T. (2) worthy that this equation will also appear in Theorem 1.
Definition 9 (Lp Space). Consider a function f with do- According to Eq. (6), every function in the RKHS can be
main [a, b]. For p > 0, let the Lp norm be defined as: written as a linear combination.P Consider two functions in
n
this
Pn space represented as f = i=1 αi k(xi , y) and g =
Z p1 β k(x, y ). Hence, the inner product in RKHS is
j=1 j j
kf kp := |f (x)|p dx . (3)
calculated as:
n n
The Lp space is defined as the set of functions with bounded (6)
DX X E
Lp norm: hf, gik = αi k(xi , .), βj k(y j , .)
k
i=1 j=1
Lp (a, b) := {f : [a, b] → R | kf kp < ∞}. (4) (a)
DXn Xn E
= αi k(xi , .), βj k(., y j )
k
Definition 10 (Sobolev Space (Renardy & Rogers, 2006, i=1 j=1
Chapter 7)). A Sobolev space is a vector space of functions n X
X n
equipped with Lp norms and derivatives: = αi βj k(xi , y j ), (7)
i=1 j=1
Wm,p := {f ∈ Lp (0, 1) | Dm f ∈ Lp (0, 1)}, (5)
where (a) is because kernel is symmetric (it will be proved
where Dm f denotes the m-th order derivative. in Section 4). Hence, the norm in RKHS is calculated as:
Note that the Sobolev spaces are RKHS with some specific p
kernels (Novak et al., 2018). RKHS will be explained in kf kk := hf, f ik . (8)
Section 2.3.
The subscript of norm and inner product in RKHS has var-
2.3. Reproducing Kernel Hilbert Space ious notations in the research papers. Some most famous
2.3.1. D EFINITION OF RKHS notations are hf, gik , hf, giH , hf, giHk , hf, giF where Hk
denotes the Hilbert space associated with kernel k and F
Reproducing Kernel Hilbert Space (RKHS), first proposed stands for the feature space because RKHS is sometimes
in (Aronszajn, 1950), is a special case of Hilbert space with referred to as the feature space.
some properties. It is a Hilbert space of functions with re-
producing kernels (Berlinet & Thomas-Agnan, 2011). Af- Remark 2 (RKHS Being Unique for a Kernel). Given a
ter the initial work on RKHS (Aronszajn, 1950), another kernel, the corresponding RKHS is unique (up to isometric
work (Aizerman et al., 1964) developed the RKHS con- isomorphisms). Given an RKHS, the corresponding kernel
cepts. In the following, we introduce this space. is unique. In other words, each kernel generates a new
RKHS.
Definition 11 (RKHS (Aronszajn, 1950; Berlinet &
Thomas-Agnan, 2011)). A Reproducing Kernel Hilbert Remark 3. As we also saw in Mercer’s theorem, the bases
Space (RKHS) is a Hilbert space H of functions f : X → R of RKHS space is the eigenfunctions {ψi (.)}∞i=1 which are
with a reproducing kernel k : X 2 → R where k(x, .) ∈ H functions themselves. This, along with Eq. (6), show that
and f (x) = h k(x, .), f i. the RKHS space is a space of functions and not a space
of vectors. In other words, the basis vectors of RKHS are
The RKHS is explained in more detail in the following. basis functions named eigenfunctions. Because the RKHS
Consider the kernel function k(x, y) which is a function is a space of functions rather than a space of vectors, we
of two variables. Suppose, for n points, we fix one of the usually do not know the exact location of pulled points to
variables to have k(x1 , y), k(x2 , y), . . . , k(xn , y). These the RKHS but we know the relation of them as a function.
are all functions of the variable y. RKHS is a function This will be explained more in Section 3.2 and Fig. 1.
space which is the set of all possible linear combinations
of these functions (Kimeldorf & Wahba, 1971), (Aizerman 2.3.2. R EPRODUCING P ROPERTY
et al., 1964, p. 834), (Mercer, 1909):
Pnconsider only one component for g to have
In Eq. (7),
n n
X o (a) n n
X o g(x) = j=1 βj k(xi , x) = βk(x, x) where we take
H := f (.) = αi k(xi , .) = f (.) = αi kxi (.) , β = 1 to have g(x) = k(x, x) = kx (.). In other words,
i=1 i=1 assume the function g(x) is a kernel P
in the RKHS space.
n
(6) Also consider the function f (x) = i=1 αi k(xi , x) in
5
the space. According to Eq. (7), the inner product of these components along and orthogonal to this subspace, respec-
functions is: tively denoted by fk and f⊥ :
1. “Reproducing”: because if the reproducing property Using Eqs. (13) and (15), we can say:
of RKHS which was proved above. n
X
min `(f (xi ), y i ) + η kf k2k
2. “Kernel”: because of the kernels associated to RKHS f ∈H
i=1
as stated in Definition 11 and Eq. (6). n
X
= min `(fk (xi ), y i ) + η kfk k2k .
3. “Hilbert Space”: because RKHS is a Hilbert space of f ∈H
i=1
functions with a reproducing kernel, as stated in Defi-
nition 11. Hence, for this minimization, we only require the compo-
nent lying in the space spanned by the kernels of RKHS.
2.3.3. R EPRESENTATION IN RKHS Therefore, we can represent the function (solution of opti-
In the following, we provide a proof for Eq. (6) and explain mization) to lie in the space as linear combination of basis
why that equation defines the RKHS. vectors {k(xi , .)}ni=1 . Q.E.D.
Theorem 1 (Representer Theorem (Kimeldorf & Wahba, Corollary 1. In Section 2.2, we mentioned that Hilbert
1971), simplified in (Rudin, 2012)). For a set of data space can be infinite dimensional. According to Definition
X = {xi }ni=1 , consider a RKHS H of functions f : X → R 11, RKHS is a Hilbert space so it may be infinite dimen-
with kernel function k. For any function ` : R2 → R sional. The representer theorem states that, in practice,
(usually called the loss function), consider the optimization we only need to deal with a finite-dimensional space; al-
problem: though, that finite number of dimensions is usually a large
number.
n
X
f ∗ ∈ arg min `(f (xi ), y i ) + η Ω(kf kk ), (10) Note that the representer theorem has been used in kernel
f ∈H SVM where αi ’s are the dual variables which are non-zero
i=1
for support vectors (Boser et al., 1992; Vapnik, 1995). Ac-
where η ≥ 0 is the regularization parameter and Ω(kf kk ) cording to this theorem, kernel SVM only requires to learn
is a penalty term such as kf k2k . The solution of this opti- the dual variables, αi ’s, to find the optimal boundary be-
mization can be expressed as: tween classes.
n
X n
X
∗
f = αi k(xi , .) = αi kxi (.). (11) 3. Mercer’s Theorem and Feature Map
i=1 i=1 3.1. Mercer’s Theorem
P.S.: Eq. (11) can also be seen in (Aizerman et al., 1964, Definition 12 (Definite Kernel (Hilbert, 1904)). A kernel
p. 834). k : [a, b] × [a, b] → R is a definite kernel where the follow-
ing double integral:
Proof. Proof is inspired by (Rudin, 2012). Assume Z bZ b
we project the function f onto a subspace spanned by J(f ) = k(x, y)f (x)f (y) dx dy, (16)
{k(xi , .)}ni=1 . The function f can be decomposed into a a
6
satisfies J(f ) > 0 for all f (x) 6= 0. provides a spectral (or eigenvalue) decomposition for the
operator Tk (Ghojogh et al., 2019a):
Mercer improved over Hilbert’s work (Hilbert, 1904) to
propose his theorem, the Mercer’s theorem (Mercer, 1909), Tk ψi (x) = λi ψi (x), (22)
introduced in the following.
Theorem 2 (Mercer’s Theorem (Mercer, 1909)). Suppose where {ψi (.)}∞ ∞
i=1 and {λi }i=1 are the eigenvectors and
k : [a, b] × [a, b] → R is a continuous symmetric positive eigenvalues of the operator Tk , respectively. Noticing the
semi-definite kernel which is bounded: defined Eq. (18) and the eigenvalue decomposition, Eq.
(22), we have:
sup k(x, y) < ∞. (17) Z
x,y (18) (22)
k(x, y) ψi (y) dy = Tk ψi (x) = λi ψi (x). (23)
Assume the operator Tk takes a function f (x) as its argu-
ment and outputs a new function as: This proves the Eq. (20) which is the eigenfunction de-
Z b composition of the operator Tk . Note that the eigenvectors
Tk f (x) := k(x, y)f (y) dy, (18) {ψi (.)}∞
i=1 are referred to as the eigenfunctions because the
a decomposition is applied on a function or operator rather
which is a Fredholm integral equation (Schmidt, 1908). than a matrix. Note that eigenfunctions will be explained
The operator Tk is called the Hilbert–Schmidt integral op- more in Section 7.
erator (Renardy & Rogers, 2006, Chapter 8). This output Step 3 of proof: According to Parseval’s theorem (Parse-
function is positive semi-definite: val des Chenes, 1806), the Bessel’s inequality can be con-
ZZ verted to equality (Saxe, 2002). For the orthonormal bases
k(x, y)f (y) dx dy ≥ 0. (19) {ψi (.)}∞
i=1 in the Hilbert space H associated with kernel k,
we have for any function f ∈ L2 (a, b):
Then, there is a set of orthonormal bases {ψi (.)}∞ i=1 of ∞
X
L2 (a, b) consisting of eigenfunctions of TK such that the f= hf, ψi ik ψi . (24)
corresponding sequence of eigenvalues {λi }∞ i=1 are non- i=1
negative:
If we replace ψi with f in Eq. (22) and consider Eq. (24),
we will have:
Z
k(x, y) ψi (y) dy = λi ψi (x). (20)
∞
X
Tk f = λi hf, ψi ik ψi . (25)
The eigenfunctions corresponding to the non-zero eigen-
i=1
values are continuous on [a, b] and k can be represented as
(Aizerman et al., 1964): One can consider Eq. (18) as Tk f = kf . Noticing this and
∞ Eq. (25) results in:
X
k(x, y) = λi ψi (x) ψi (y), (21) ∞
X
i=1
kf = λi hf, ψi ik ψi . (26)
where the convergence is absolute and uniform. i=1
Proof. A roughly high-level proof for the Mercer’s theo- Ignoring f from Eq. (26) gives:
rem is as follows. ∞
X
Step 1 of proof: According to assumptions of theorem, k(x, y) = λi ψi (x) ψi (y), (27)
the Hilbert-Schmidt integral operator Tk is a symmet- i=1
ric operator on L2 (a, b) space. Consider a unit ball in
which is Eq. (21); hence, that is proved.
L2 (a, b) as input to the operator. As the kernel is bounded,
supx,y k(x, y) < ∞, the sequence f1 , f2 , . . . converges Step 4 of proof: We define the truncated kernel rn (with
in norm, i.e. kfn − f k → 0 as n → 0. Therefore, accord- parameter n) as:
ing to the Arzelà-Ascoli theorem (Arzelà, 1895), the image n
X
of the unit ball after applying the operator is compact. In rn (x, y) := k(x, y) − λi ψi (x) ψi (y)
other words, the operator Tk is compact. i=1
Step 2 of proof: According to the spectral theorem ∞
X
(Hawkins, 1975), there exist several orthonormal bases = λi ψi (x) ψi (y). (28)
{ψi (.)}∞
i=1 in L2 (a, b) for the compact operator Tk . This
i=n+1
7
As Tk is an integral operator, this truncated kernel has pos- where {ψ i } and {λi } are eigenfunctions and eigenvalues
itive kernel, i.e., for every x ∈ [a, b], we have: of the kernel operator (see Eq. (20)). Note that eigenfunc-
n
tions will be explained more in Section 7.
X
rn (x, x) = k(x, x) − λi ψi (x) ψi (x) ≥ 0 Let t denote the dimensionality of φ(x). The feature map
i=1 may be infinite or finite dimensional, i.e. t can be infinity;
n
X it is usually a very large number (recall Definition 7 where
=⇒ λi ψi (x) ψi (x) ≤ k(x, x) ≤ sup k(x, x). we said Hilbert space may have infinite number of dimen-
i=1 x∈[a,b]
sions).
(29)
Considering both Eqs. (21) and (33) shows that:
By Cauchy-Schwartz inequality, we have:
k(x, y) = φ(x), φ(y) k = φ(x)> φ(y).
(34)
Xn 2
λi ψi (x) ψi (y) Hence, the kernel between two points is the inner product
i=1 of pulled data points to the feature space. Suppose we stack
n
X n
X the feature maps of all points X ∈ Rd×n column-wise in:
≤ λi ψi (x) ψi (x) λi ψi (y) ψi (y)
i=1 i=1 Φ(X) := [φ(x1 ), φ(x2 ), . . . , φ(xn )], (35)
(29) 2
≤ sup k(x, x) . which is t × n dimensional and t may be infinity or a large
x∈[a,b]
number. The kernel matrix defined in Definition 2 can be
Taking second root from the sides of inequality gives: calculated as:
to transform data from the input space to the feature space, Pulling data to the feature space is performed using kernels
i.e. Hilbert space. In other words, this mapping pulls data which is the inner product of points in RKHS according
to the feature space: to Eq. (34). Hence, the relative similarity (inner product)
of pulled data points is known by the kernel. However,
x 7→ φ(x). (32) in most of kernels, we cannot find an explicit expression
for the pulled data points. Therefore, the exact location of
The function φ(x) is called the feature map or pulling func- pulled data points to RKHS is not necessarily known but
tion. The feature map is a (possibly infinite-dimensional) the relative similarity of pulled points, which is the ker-
vector whose elements are (Minh et al., 2006): nel, is known. An exceptional kernel is the linear kernel
in which we have φ(x) = x. Figure 1 illustrates what we
φ(x) = [φ1 (x), φ2 (x), . . . ]> mean by not knowing the explicit location of pulled points
√ √ (33)
:= [ λ1 ψ 1 (x), λ2 ψ 2 (x), . . . ]> , to RKHS.
8
Figure 1. Pulling data from the input space to the feature space (RKHS). The explicit locations of pulled points are not necessarily known
but the relative similarity (inner product) of pulled data points is known in the feature space.
4. Characteristics of Kernels Proof. Let v(i) denote the i-th element of vector v.
In this section, we review some of the characteristics of ker- n X
n
nels including the symmetry and positive semi-definiteness (1) X
v > Kv = v(i) v(j) k(x(i), x(j))
properties of Mercer kernel (recall Definition 1). i=1 j=1
Lemma 1 (Symmetry of Kernel). A square Mercer kernel n X n D
(7) X E
matrix is symmetric, so we have: = v(i) v(j) φ x(i) , φ x(j)
k
i=1 j=1
hf, gik = hg, f ik , or (37) Xn X n D
E
K ∈ Sn , i.e., k(x, y) = k(y, x). (38) = v(i) φ x(i) , v(j) φ x(j)
k
i=1 j=1
n n
Proof. DX X E
= v(i) φ x(i) , v(j) φ x(j)
(34) k
i=1 j=1
φ(x), φ(y) k = φ(x)> φ(y)
k(x, y) =
n
2
X
(a)
= φ(y)> φ(x) = k(y, x), =
v(i) φ x(i)
≥ 0, ∀v ∈ Rn .
k
i=1
0 ≤ f 2 (x) = 0 =⇒ f (x) = 0. Comparing this with Eq. (34) shows that in linear kernel we
have φ(x) = x. Hence, in this kernel, the feature map is
explicitly known. Note that φ(x) = x shows that data are
Lemma 3 (Positive Semi-definiteness of Kernel). The not pulled to any other space in linear kernel but in the input
Mercer kernel matrix is positive semi-definite: space, the inner products of points are calculated to obtain
the feature space. Moreover, recall Remark 9 which states
K ∈ Sn+ , i.e., K 0. (40) that, depending on the kernelization approach, using lin-
9
matrix. We double-center the matrix D as follows (Old- where (a) is because H and D are symmetric matrices.
ford, 2018): Moreover, the kernel is positive semi-definite because:
1 > 1 1
HDH = (I − 11 )D(I − 11> ) K = − HDH = Φ(X)> Φ(X)
n n 2
1 > 1 =⇒ v > Kv = v > Φ(X)> Φ(X)v
= (I − 11 )(1g − 2G + g1> )(I − 11> )
>
n n = kΦ(X)vk22 ≥ 0, ∀v ∈ Rn .
1 > > 1 >
= (I − 11 )1 g − 2(I − 11 )G
n{z n Hence, according to Definition 1, this kernel is a Mercer
kernel. Q.E.D.
| }
=0
1 > 1
11 )g1> (I − 11> )
+ (I − Remark 5 (Kernel Construction from Metric). One can
n n use any valid distance metric, satisfying the following prop-
1 > 1 >
= −2(I − 11 )G(I − 11 ) erties:
n n
1 > 1 1. non-negativity: D(x, y) ≥ 0,
+ (I − 11 )g 1 (I − 11> )
>
2. equal points: D(x, y) = 0 ⇐⇒ x = y,
n {zn
| } 3. symmetry: D(x, y) = D(y, x),
=0
4. triangular inequality: D(x, y) ≤ D(x, z)+D(z, y),
1 1
= −2(I − 11> )G(I − 11> ) = −2 HGH. to calculate elements of distance matrix D in Eq. (50). It
n n
is important that the used distance matrix should be a valid
distance matrix. Using various distance metrics in Eq. (50)
1 results in various useful kernels.
∴ HGH = HX > XH = − HDH. (49)
2 Some examples are the geodesic kernel and Structural Sim-
ilarity Index (SSIM) kernel, used in Isomap (Tenenbaum
Note that (I − n1 11> )1 = 0 and 1> (I − n1 11> ) = 0 et al., 2000) and image structure subspace learning (Gho-
because removing the row mean of 1 and column mean of jogh et al., 2019c), respectively. The geodesic kernel is de-
of 1> results in the zero vectors, respectively. fined as (Tenenbaum et al., 2000; Ghojogh et al., 2020b):
If data X are already centered, i.e., the mean has been re-
moved (X ← XH), Eq. (49) becomes: 1
K = − HD (g) H, (52)
2
1 where the approximation of geodesic distances using piece-
X > X = − HDH. (50)
2 wise Euclidean distances is used in calculating the geodesic
distance matrix D (g) . The SSIM kernel is defined as (Gho-
According to the kernel trick, Eq. (104), we can write a jogh et al., 2019c):
general kernel matrix rather than the linear Gram matrix in
Eq. (50), to have (Cox & Cox, 2008): 1
K = − HD (s) H, (53)
2
1
Rn×n 3 K = Φ(X)> Φ(X) = − HDH. (51) where the distance matrix D (s) is calculated using the
2
SSIM distance (Brunet et al., 2011).
This kernel is double-centered because of HDH. It is also
noteworthy that Eq. (51) can be used for unifying the spec- 5.3. Important Classes of Kernels
tral dimensionality reduction methods as special cases of In the following, we introduce some of the important
kernel principal component analysis with different kernels. classes of kernels which are widely used in statistics and
See (Ham et al., 2004; Bengio et al., 2004) and (Strange & machine learning. A good survey on the classes of kernels
Zwiggelaar, 2014, Table 2.1) for more details. is (Genton, 2001).
Lemma 4 (Distance-based Kernel is a Mercer Kernel). The 5.3.1. B OUNDED K ERNELS
kernel constructed from a valid distance metric, i.e. Eq. Definition 15 (Bounded Kernel). A kernel function k is
(51), is a Mercer kernel. bounded if:
5.3.2. I NTEGRALLY P OSITIVE D EFINITE K ERNELS (Ghojogh et al., 2020d), denoted by K s , whose Taylor se-
Definition 16 (Integrally Positive Definite RKernel). A ker- ries expansion is (Ghojogh et al., 2019c):
nel matrix K is integrally positive definite ( p.d.) on Ω×Ω
5 15 5 1
if: Ks ≈ − − r + r2 − r3 + . . . ,
16 16 16 16
Z Z
K(x, y)f (x)f (y) ≥ 0, ∀f ∈ L2 (Ω). (55) where r is the squared SSIM distance (Brunet et al., 2011)
Ω Ω between images. Note that polynomial kernels are not uni-
R versal. Universal kernels have been widely used for kernel
A kernel matrix K is integrally strictly positive definite (
SVM. More detailed discussion and proofs for use of uni-
s.p.d.) on Ω × Ω if:
versal kernels in kernel SVM can be found in (Steinwart &
Christmann, 2008).
Z Z
K(x, y)f (x)f (y) > 0, ∀f ∈ L2 (Ω). (56)
Ω Ω Lemma 6 ((Borgwardt et al., 2006), (Song, 2008, Theorem
10)). A kernel is universal if for arbitrary sets of distinct
5.3.3. U NIVERSAL K ERNELS points, it induces strictly positive definite kernel matrices
Definition 17 (Universal Kernel (Steinwart, 2001, Defini- (Borgwardt et al., 2006; Song, 2008). Conversely, if a ker-
tion 4), (Steinwart, 2002, Definition 2)). Let C(X ) denote nel matrix can be written as K = K 0 + I where K 0 0,
the space of all continuous functions on space X . A con- > 0, and I is the identity matrix, the kernel function cor-
tinuous kernel k on a compact metric space X is called responding to K is universal (Pan et al., 2008).
universal if the RKHS H, with kernel function k, is dense
in C(X ). In other words, for every function g ∈ C(X ) 5.3.4. S TATIONARY K ERNELS
and all > 0, there exists a function f ∈ H such that Definition 18 (Stationary Kernel (Genton, 2001; Noack &
kf − gk∞ ≤ . Sethian, 2021)). A kernel k is stationary if it is a positive
definite function of the form:
Remark 6. We can approximate any function, including
continuous functions and functions which can be approxi- k(x, y) = k(kx − yk), (58)
mated by continuous functions, using a universal kernel.
Lemma 5 ((Steinwart, 2001, Corollary 10)). Consider a where k.k is some norm defined on the input space.
function f : (−r, r) → R where 0 < r ≤ ∞ and f ∈
An example for stationary kernel is the RBF kernel defined
C ∞ (C ∞ denotes the differentiable space for all √
degrees
in Eq. (42) which has the form of Eq. (58). Stationary
of differentiation). Let X := {x ∈ Rd | kxk2 < r}. If
kernels are used for Gaussian processes (Noack & Sethian,
the function f can be expanded by Taylor expansion in 0
2021).
as:
∞ 5.3.5. C HARACTERISTIC K ERNELS
X
f (x) = aj xj , ∀x ∈ (−r, r), (57) The characteristic kernels, which are widely used for dis-
j=0 tribution embedding in the Hilbert space, will be defined
and explained in Section 9.4. Examples for characteristic
and aj > 0 for all j ≥ 0, then k(x, y) = f (hx, yi) is a kernels are RBF and Laplacian kernels. Polynomial ker-
universal kernel on every compact subset of X . nels. however, are not characteristic. Note that the relation
between universal kernels, characteristic kernels, and inte-
Proof. For proof, see (Steinwart, 2001, proof of Corollary
grally strictly positive definite kernels has been studied in
10). Note that the Stone-Weierstrass theorem (De Branges,
(Sriperumbudur et al., 2011).
1959) is used for the proof of this lemma.
An example for universal kernel is RBF kernel (Steinwart, 6. Kernel Centering and Normalization
2001, Example 1) because its Taylor series expansion is: 6.1. Kernel Centering
In some cases, there is a need to center the pulled data in the
γ2 2 γ3 3
exp(−γr) ≈ 1 − γr + r − r + ..., feature space. For this, the kernel matrix should be centered
2 6
in a way that the mean of pulled dataset becomes zero. Note
where r := kx − yk22 . Considering Eq. (34) and notic- that this will restrict the place of pulled points in the feature
ing that this Taylor series expansion has infinite number space further (see Fig. 2); however, because of different
of terms, we see that the RKHS for RBF kernel is infinite possible rotations of pulled points around origin, the exact
dimensional because φ(x), although cannot be calculated positions of pulled points are still unknown.
explicitely for this kernel, will have infinite dimensions. For kernel centering, one should follow the following
Another example for universal kernel is the SSIM kernel theory, which is based on (Schölkopf et al., 1997a) and
12
X t = [xt,1 , . . . , xt,nt ] ∈ Rd×nt . Consider the kernel If we center the pulled training and out-of-sample data, the
matrix for the training data Rn×n 3 K := Φ(X)> Φ(X), (i, j)-th element of kernel matrix becomes:
whose (i, j)-th element is R 3 K(i, j) = φ(xi )> φ(xj ).
We want to center the pulled training data in the feature K̆ t (i, j) := φ̆(xi )> φ̆(xt,j )
space: (a) 1 X
n
> n
1 X
= φ(xi ) − φ(xk1 ) φ(xt,j ) − φ(xk2 )
1X
n n n
k1 =1 k2 =1
φ̆(xi ) := φ(xi ) − φ(xk ). (59) n
n 1
k=1
X
= φ(xi )> φ(xt,j ) − φ(xk1 )> φ(xt,j )
n
If we center the pulled training data, the (i, j)-th element k1 =1
of kernel matrix becomes: n n n
1 X > 1 X X
− φ(xi ) φ(xk2 ) + 2 φ(xk1 )>φ(xk2 ),
K̆(i, j) := φ̆(xi )> φ̆(xj ) (60) n n
k2 =1 k1 =1 k2 =1
n n
(59) 1 X > 1 X where (a) is because of Eqs. (59) and (64). Therefore,
= φ(xi ) − φ(xk1 ) φ(xj ) − φ(xk2 )
n n the double-centered kernel matrix over training and out-of-
k1 =1 k2 =1
n
sample data is:
1 X
= φ(xi )> φ(xj ) − φ(xk1 )> φ(xj ) 1 1
n
k1 =1
Rn×nt 3 K̆ t = K t − 1n×n K t − K1n×nt
n n
n n n 1
1 X > 1 X X + 1n×n K1n×nt , (65)
− φ(xi ) φ(xk2 ) + 2 φ(xk1 )>φ(xk2 ). n2
n n
k2 =1 k1 =1 k2 =1
where Rn×nt 3 1n×nt := 1n 1> nt and R
nt
3 1nt :=
Writing this in the matrix form gives: >
[1, . . . , 1] . The Eq. (65) is the kernel matrix when the
1 1 pulled training data in the feature space are centered and
Rn×n 3 K̆ = K − 1n×n K − K1n×n the pulled out-of-sample data are centered using the mean
n n
1 of pulled training data.
+ 1n×n K1n×n = HKH, (61) If we have one out-of-sample xt , the Eq. (65) becomes:
n2
where H is the centering matrix (see Eq. (48)). The Eq. 1 1 1
Rn 3 k̆t = kt − 1n×n kt − K1n + 2 1n×n K1n ,
(61) is called the double-centered kernel. This equation n n n
is the kernel matrix when the pulled training data in the (66)
13
Proof. If we discretize the domain [a, b], for example by then the function f is an eigenfunction for the operator O
sampling, with step ∆x, the function values become vec- and the constant λ is the corresponding eigenvalue. Note
tors as f (x) = [f (x1 ), f (x2 ), . . . , f (xn )]> and g(x) = that the form of eigenfunction problem is:
[g(x1 ), g(x2 ), . . . , g(xn )]> . According to the inner prod-
Operator (function f ) = constant × function f. (79)
uct of two vectors, we have:
n
X Some examples of operator are derivative, kernel function,
hf (x), g(x)i = g H f = f (xi )g(xi ), etc. For example, eλx is an eigenfunction of derivative be-
d λx
i=1 cause dx e = λeλx . Note that eigenfunctions have appli-
cation in many fields of science including machine learning
where g H denotes the conjugate transpose of g (it is trans-
(Bengio et al., 2003c) and quantum mechanics (Reed & Si-
pose if functions are real). Multiplying the sides of this
mon, 1972).
equation by the setp ∆x gives:
Recall that in eigenvalue problem, the eigenvectors show
n
X the most important or informative directions of matrix and
hf (x), g(x)i∆x = g H f = f (xi )g(xi )∆x, the corresponding eigenvalue shows the amount of impor-
i=1
tance (Ghojogh et al., 2019a). Likewise, in eigenfunc-
which is a Riemann sum. This is the Riemann approxima- tion problem of an operator, the eigenfunction is the most
tion of the Eq. (75). This approximation gets more accurate important function of the operator and the corresponding
by ∆x → 0 or n → ∞. Hence, that equation is a valid in- eigenvalue shows the amount of this importance. This con-
ner product in the Hilbert space. Q.E.D. nection between eigenfunction and eigenvalue problems is
proved in the following theorem.
Remark 7 (Interpretation of Inner Product of Functions
in Hilbert Space). The inner product of two functions, i.e. Theorem 3 (Connection of Eigenfunction and Eigenvalue
Eq. (75), measures how similar two functions are. The Problems). If we assume that the operator and the func-
more similar they are in their domain [a, b], the larger in- tion are a matrix and a vector, eigenfunction problem is
ner product they have. Note that this similarity is more converted to an eigenvalue problem where the vector is the
about the pattern (or changes) of functions and not the ex- eigenvector of the matrix.
act value of functions. If the pattern of functions is very Proof. Consider any function space such as a Hilbert
similar, they will have a large inner product. space. Let {ej }nj=1 be the bases (basis functions) of this
Corollary 2 (Weighted Inner Product in Hilbert Space function space where n may be infinite. The function f in
(Williams & Seeger, 2000), (Bengio et al., 2003c, Section this space can be represented as a linear combination bases:
2)). The Eq. (75) is the inner product with uniform weight- n
ing. With density function p(x), one can weight the inner
X
f (x) = αj ej (x). (80)
product in the Hilbert space as (assuming the functions are j=1
real):
An example of this linear combination is Eq. (6) in RKHS
Z b
where the bases are kernels. Consider the operator O which
hf (x), g(x)iH = f (x) g(x) p(x) dx. (76)
a
can be applied on the functions in this function space. Ap-
plying this operator on Eq. (80) gives:
7.2. Eigenfunctions n n
X (a) X
Recall eigenvalue problem for a matrix A (Ghojogh et al., Of (x) = O αj ej (x) = αj Oej (x), (81)
2019a): j=1 j=1
A φi = λi φi , ∀i ∈ {1, . . . , d}, (77) where (a) is because the operator O is a linear operator
according to Definition 21. Also, we have:
where φi and λi are the i-th eigenvector and eigenvalue of n
A, respectively. In the following, we introduce the Eigen- (78) (a) X
Of (x) = λf (x) = λ αj ej (x), (82)
function problem which has a similar form but for an oper-
j=1
ator rather than a matrix.
Definition 21 (Eigenfunction (Kusse & Westwig, 2006, where (a) is because λ is a scalar.
Chapter 11.2)). Consider a linear operator O which can On the other hand, the output function from applying the
be applied on a function f . If applying this operator on operator on a function can also be written as a linear com-
the function results in a multiplication of function to a con- bination of the bases:
stant: Xn
Of (x) = βj ej (x). (83)
Of = λf, (78) j=1
15
From Eqs. (81) and (83), we have: 7.3. Use of Eigenfunctions for Spectral Embedding
Consider a Hilbert space H of functions with the inner
n n
X X product defined by Eq. (76). Let the data in the input space
αj Oej (x) = βj ej (x). (84)
be X = {xi ∈ Rd }ni=1 . In this space, we can consider an
j=1 j=1
operator for the kernel function Kp as (Williams & Seeger,
2000), (Bengio et al., 2003a, Section 3):
In parentheses, consider an n × n matrix A whose (i, j)-th Z
element is the inner product of ei and Oej : (Kp f )(x) := k(x, y) f (y) p(y) dy, (89)
Z
(75)
A(i, j) := hei , Oej ik = e∗i (x) Oej (x)dx, (85) where f ∈ H and the density function p(y) can be ap-
proximated empirically. A discrete approximation of this
operator is (Williams & Seeger, 2000):
where integral is over the domain of functions in the func-
n
tion space. 1X
(Kp,n f )(x) := k(x, xi ) f (xi ), (90)
Using Eq. (75), we take the inner product of sides of Eq. n i=1
(84) with an arbitrary basis function ei :
which converges to Eq. (89) if n → ∞. Note that this
n
X Z n
X Z equation is also mentioned in (Bengio et al., 2003c, Section
αj e∗i (x) Oej (x) dx = βj e∗i (x) ej (x) dx. 2), (Bengio et al., 2004, Section 4), (Bengio et al., 2006,
j=1 j=1 Section 3.2).
Lemma 10 (Relation of Eigenvalues of Eigenvalue Prob-
According to Eq. (85), this equation is simplified to: lem and Eigenfunction Problem for Kernel (Bengio et al.,
2003a, Proposition 1), (Bengio et al., 2003c, Theorem 1),
n n Z
X X (a) (Bengio et al., 2004, Section 4)). Assume λk denotes the k-
αj A(i, j) = βj e∗i (x) ej (x) dx = βi , (86)
th eigenvalue for eigenfunction decomposition of the oper-
j=1 j=1
ator Kp and δk denotes the k-th eigenvalue for eigenvalue
problem of the matrix K ∈ Rn×n . We have:
which is true for ∀i ∈ {1, . . . n} and (a) is because the
bases are orthonormal, so: δk = n λk . (91)
(75)
Z
1 if i = j, Proof. This proof gets help from (Bengio et al., 2003b,
hei , ej ik = e∗i (x) ej (x) dx = proof of Proposition 3). According to Eq. (78), the eigen-
0 Otherwise.
function problems for the operators Kp and Kp,n (discrete
The Eq. (86) can be written in matrix form: version) are:
(Kp fk )(x) = λk fk (x), ∀k ∈ {1, . . . , n},
Aα = β, (87) (92)
(Kp,n fk )(x) = λk fk (x), ∀k ∈ {1, . . . , n},
where α := [α1 , . . . , αn ]> and β := [β1 , . . . , βn ]> . where fk (.) is the k-th eigenfunction and λk is the corre-
sponding eigenvalue. Consider the kernel matrix defined
From Eqs. (82) and (83), we have:
by Definition 2. The eigenvalue problem for the kernel ma-
n n trix is (Ghojogh et al., 2019a):
X X
λ αj ej (x) = βj ej (x) =⇒ λ α = β. (88) Kv k = δk v k , ∀k ∈ {1, . . . , n}, (93)
j=1 j=1
where v k is the k-th eigenvector and δk is the correspond-
Comparing Eqs. (87) and (88) shows: ing eigenvalue. According to Eqs. (90) and (92), we have:
n
1X
Aα = λ α, k(x, xi ) f (xi ) = λk fk (x), ∀k ∈ {1, . . . , n}.
n i=1
which is an eigenvalue problem for matrix A with eigen- When this equation is evaluated only at xi ∈ X , we have
vector α and eigenvalue λ (Ghojogh et al., 2019a). Note (Bengio et al., 2004, Section 4), (Bengio et al., 2006, Sec-
that, according to Eq. (80), the information of function f tion 3.2):
is in the coefficients αj ’s of the basis functions of space.
1
Therefore, the function is converted to the eigenvector (vec- Kfk = λk fk , ∀k ∈ {1, . . . , n},
tor of coefficients) and the operator O is converted to the n
matrix A. Q.E.D. =⇒ Kfk = nλk fk .
16
According to Theorem 3, eigenfunction can be seen as an allowed to re-arrange them. Re-arranging the terms in this
eigenvector. If so, we can say: equation gives:
n
Kfk = nλk fk =⇒ Kv k = nλk v k , (94) X
ηk v` φ̆(xj )> φ̆(x` )
Comparing Eqs. (93) and (94) results in Eq. (91). Q.E.D. `=1
n n
1X X
= v` φ̆(xj )> φ̆(xi ) φ̆(xi )> φ̆(x` ) .
Lemma 11 (Relation of Eigenvalues of Kernel and Covari- n i=1
`=1
ance in the Feature Space (Schölkopf et al., 1998)). Con-
sider the covariance of pulled data to the feature space: Considering Eqs. (36) and (60), we can write this equa-
2
tion in matrix form ηk K̆v k = n1 K̆ v k where v k :=
n
1X [v1 , . . . , vn ]> . As K̆ is positive semi-definite (see Lemma
CH := φ̆(xi )φ̆(xi )> , (95)
n i=1 3), it is often non-singular. For non-zero eigenvalues, we
−1
can left multiply this equation to K̆ to have:
where φ̆(xi ) is the centered pulled data defined by Eq. (64).
which is t × t dimensional where t may be infinite. Assume n ηk v k = K̆v k ,
ηk denotes the k-th eigenvalue C H and δk denotes the k-th
eigenvalue of centered kernel K̆. We have: which is the eigenvalue problem for K̆ where v is the
eigenvector and δk = n ηk is the eigenvalue (cf. Eq. (93)).
δk = n ηk . (96) Q.E.D.
Proof. This proof is based on (Schölkopf et al., 1998, Sec- Lemma 12 (Relation of Eigenfunctions and Eigenvectors
tion 2). The eigenvalue problem for this covariance matrix for Kernel (Bengio et al., 2003a, Proposition 1), (Bengio
is: et al., 2003c, Theorem 1)). Consider a training dataset
{xi ∈ Rd }ni=1 and the eigenvalue problem (93) where
ηk uk = C H uk , ∀k ∈ {1, . . . , n}, v k ∈ Rn and δk are the k-th eigenvector and eigenvalue
of matrix K ∈ Rn×n . If vki is the i-th element of vector
where uk is the k-th eigenvector and ηk is its corresponding v k , the eigenfunction for the point x and the i-th training
eigenvalue (Ghojogh et al., 2019a). Left multiplying this point xi are:
equation with φ̆(xj )> gives: √ X n
n
fk (x) = vki k̆(xi , x), (99)
ηk φ̆(xj )> uk = φ̆(xj )> C H uk , ∀k ∈ {1, . . . , n}. δk i=1
(97) √
fk (xi ) = n vki , (100)
As uk is the eigenvector of the covariance matrix in the
feature space, it lies in the feature space; hence, according respectively, where k̆(xi , x) is the centered kernel. If x
to Lemma 13 which will come later, we can represent it as: is a training point, k̆(xi , x) is the centered kernel over
training data and if x is an out-of-sample point, then
n
1 X k̆(xi , x) = k̆t (xi , x) is between training set and the out-
uk = √ v` φ̆(x` ), (98)
δk `=1 of-sample point (n.b. kernel centering is explained in Sec-
tion 6.1).
where pulled data to feature space are assumed to be cen-
tered, v` ’s are the coefficients in representation, and the Proof. For proof of Eq. (99), see (Bengio et al., 2003c,
√ proof of Theorem 1) or (Williams & Seeger, 2001, Section
normalization by 1/ δk is because of a normalization used
in (Bengio et al., 2003c, Section 4). Substituting Eq. (98) 1.1). The Eq. (100) is claimed in (Bengio et al., 2003c,
and Eq. (95) in Eq. (97) results in: Proposition 1). For proof of this equation, see (Bengio
et al., 2003c, proof of Theorem 1, Eq. 7).
n
X
ηk φ̆(xj )> v` φ̆(x` ) It is noteworthy that Eq. (99) is similar and related to the
`=1 Nyström approximation of eigenfunctions of kernel opera-
n n tor which will be explained in Lemma 16.
1X X
= φ̆(xj )> φ̆(xi )φ̆(xi )> v` φ̆(x` ),
n i=1 Theorem 4 (Embedding from Eigenfunctions of Kernel
`=1
Operator (Bengio et al., 2003a, Proposition 1), (Bengio
where normalization factors are simplified from sides. In et al., 2003c, Section 4)). Consider a dimensionality reduc-
the right-hand side, as the summations are finite, we are tion algorithm which embeds data into a low-dimensional
17
[x2,1 , . . . , x2,n2 ]. In this case, the kernel matrix has size Q.E.D.
n1 × n2 and the kernel trick is:
Remark 8 (Justification by Representation Theory). Ac-
(34)
x>
1,i x1,j
>
7→ φ(x1,i ) φ(x1,j ) = k(x1,i , x1,j ), (105) cording to representation theory (Alperin, 1993), any func-
(36)
tion in the space can be represented as a linear combina-
X> >
1 X 2 7→ Φ(X 1 ) Φ(X 2 ) = K(X 1 , X 2 ) ∈ R
n1 ×n2
. tion of bases of the space. This makes sense because the
(106) function is in the space and the space is spanned by the
bases. Now, assume the space is RKHS. Hence, any func-
An example for kernel between two sets of data is the ker- tion should lie in the RKHS spanned by the pulled data
nel between training data and out-of-sample (test) data. As points to the feature space. This justifies Lemma 13 using
stated in (Schölkopf, 2001), the kernel trick is proved to representation theory.
work for Mercer kernels in (Boser et al., 1992; Vapnik,
1995) or equivalently for the positive definite kernels (Berg 8.2.1. K ERNELIZATION FOR V ECTOR S OLUTION
et al., 1984; Wahba, 1990). Now, consider an algorithm whose optimization variable or
Some examples of using kernel trick in machine learning solution is the vector/direction u ∈ Rd in the input space.
are kernel PCA (Schölkopf et al., 1997a; 1998; Ghojogh & For kernelization, we pull this solution to RKHS by Eq.
Crowley, 2019) and kernel SVM (Boser et al., 1992; Vap- (32) to have φ(u). According to Lemma 13, this pulled
nik, 1995). More examples will be provided in Section 9. solution must lie in the span of all pulled training points
Note that in some algorithms, data do not not appear only {φ(xi )}ni=1 as:
by inner product which is required for the kernel trick. In
n
these cases, if possible, a “dual” method for the algorithm is X
φ(u) = αi φ(xi ) = Φ(X) α, (108)
proposed which only uses the inner product of data. Then,
i=1
the dual algorithm is kernelized using kernel trick. Some
examples for this are kernelization of dual PCA (Ghojogh which is t dimensional and t may be infinite. Note that
& Crowley, 2019) and dual SVM (Burges, 1998). As an ad- Φ(X) is defined by Eq. (35) and is t × n dimensional. The
ditional point, it is noteworthy that it is possible to replace vector α := [α1 , . . . , αn ]> ∈ Rn contains the coefficients.
kernel trick with function replacement (see (Ma, 2003) for According to Eq. (32), we can replace u with φ(u) in the
more details on this). algorithm. If, by this replacement, the terms φ(xi )> φ(xi )
19
or Φ(X)> Φ(X) appear, then we can use Eq. (34) and As was mentioned, in many machine learning algorithms,
replace φ(xi )> φ(xi ) with k(xi , xi ) or use Eq. (36) to re- the solution U ∈ Rd×p is a projection matrix for projecting
place Φ(X)> Φ(X) with K(X, X). This kernelizes the d-dimensional data onto a p-dimensional subspace. Some
method. The steps of kernelization by representation the- example methods which have used kernelization by rep-
ory are summarized below: resentation theory are kernel Fisher discriminant analysis
• Step 1: u → φ(u) (FDA) (Mika et al., 1999; Ghojogh et al., 2019b), kernel
• Step 2: Replace φ(u) with Eq. (108) in the algorithm supervised principal component analysis (PCA) (Barshan
formulation et al., 2011; Ghojogh & Crowley, 2019), and direct ker-
• Step 3: Some φ(xi )> φ(xi ) or Φ(X)> Φ(X) terms nel Roweis discriminant analysis (RDA) (Ghojogh et al.,
appear in the formulation 2020c).
• Step 4: Use Eq. (34) or (36) Remark 9 (Linear Kernel in Kernelization). If we use ker-
• Step 5: Solve (optimize) the algorithm where the vari- nel trick, the kernelized algorithm with a linear kernel is
able to find is α rather than u equivalent to the non-kernelized algorithm. This is because
Usually, the goal of algorithm results in kernel. For exam- in linear kernel, we have φ(x) = x and k(x, y) = x> y
ple, if u is a projection direction, the desired projected data according to Eq. (41). So, the kernel trick, which is Eq.
are obtained as: (105), maps data as x> y 7→ φ(x)> φ(y) = x> y for lin-
(32) (108) ear kernel. Therefore, linear kernel does not have any effect
u> xi 7→ φ(u)> φ(xi ) = α> Φ(X)> φ(xi ) when using kernel trick. Examples for this are kernel PCA
(36) (Schölkopf et al., 1997a; 1998; Ghojogh & Crowley, 2019)
= α> k(X, xi ), (109)
and kernel SVM (Boser et al., 1992; Vapnik, 1995) which
where k(X, xi ) ∈ Rn is the kernel between all n train- are equivalent to PCA and SVM, respectively, if linear ker-
ing points with the point xi . As this equation shows, the nel is used.
desired goal is based on kernel. However, kernel trick does have impact when using kernel-
ization by representation theory because it finds the inner
8.2.2. K ERNELIZATION FOR M ATRIX S OLUTION products of pulled data points after pulling the solution and
Usually, the algorithm has multiple directions/vectors as representation as a span of bases. Hence, kernelized algo-
its solution. In other words, its solution is a matrix U = rithm using representation theory with linear kernel is not
[u1 , . . . , up ] ∈ Rd×p . In this case, Eq. (108) is used for all equivalent to non-kernelized algorithm. Examples of this
p vectors and in a matrix form, we have: are kernel FDA (Mika et al., 1999; Ghojogh et al., 2019b)
and kernel supervised PCA (Barshan et al., 2011; Ghojogh
Φ(U ) = Φ(X) A, (110) & Crowley, 2019) which are different from FDA and super-
vised PCA, respectively, even if linear kernel is used.
where Rn×p 3 A := [α1 , . . . , αp ]> and Φ(U ) is t × p di-
mensional where t may be infinite. Similarly, the following 9. Types of Use of Kernels in Machine
steps should be performed to kernelize the algorithm: Learning
• Step 1: U → φ(U )
There are several types of using kernels in machine learn-
• Step 2: Replace φ(U ) with Eq. (110) in the algorithm
ing. In the following, we explain these types of usage of
formulation
kernels.
• Step 3: Some Φ(X)> Φ(X) terms appear in the for-
mulation 9.1. Kernel Methods
• Step 4: Use Eq. (36)
The first type of using kernels in machine learning is ker-
• Step 5: Solve (optimize) the algorithm where the vari-
nelization of algorithms using either kernel trick or rep-
able to find is A rather than U
resentation theory. As was discussed in Section 8, linear
Again the goal of algorithm usually results in kernel. For methods can be kernelized to handle nonlinear data better.
example, if U is a projection matrix onto its column space, Even nonlinear algorithms can be kernelized to perform in
we have: the feature space rather than the input space. In machine
(32) (110) learning both kernel trick and kernelization by representa-
U > xi 7→ φ(U )> φ(xi ) = A> Φ(X)> φ(xi ) tion theory have been used. We provide some examples for
(36) each of these categories:
= A> k(X, xi ), (111)
• Examples for kernelization by kernel trick: ker-
where k(X, xi ) ∈ Rn is the kernel between all n train- nel Principal Component Analysis (PCA) (Schölkopf
ing points with the point xi . As this equation shows, the et al., 1997a; 1998; Ghojogh & Crowley, 2019), kernel
desired goal is based on kernel. Support Vector Machine (SVM) (Boser et al., 1992;
20
Vapnik, 1995). that kernel learning by SDP has also been used for labeling
• Examples for kernelization by representation the- a not completely labeled dataset and is also used for ker-
ory: kernel supervised PCA (Barshan et al., 2011; nel SVM (Lanckriet et al., 2004; Karimi, 2017). Our focus
Ghojogh & Crowley, 2019), kernel Fisher Discrim- here is on MVU. In the following, we briefly introduce the
inant Analysis (FDA) (Mika et al., 1999; Ghojogh MVU (or SDE) algorithm.
et al., 2019b), direct kernel Roweis Discriminant Lemma 14 (Distance in RKHS (Schölkopf, 2001)). The
Analysis (RDA) (Ghojogh et al., 2020c). squared Euclidean distance between points in the feature
As was discussed in Section 1, using kernels was widely space is:
noticed when linear SVM (Vapnik & Chervonenkis, 1974)
was kernelized in (Boser et al., 1992; Vapnik, 1995). More kφ(xi ) − φ(xj )k2k = k(xi , xi ) + k(xj , xj ) − 2k(xi , xj ).
discussions on kernel SVM can be found in (Schölkopf (112)
et al., 1997b; Hastie et al., 2009). A tutorial on kernel SVM Proof.
is (Burges, 1998). Universal kernels, introduced in Section
>
5.3, are widely used in kernel SVM. More detailed discus- kφ(xi ) − φ(xj )k2k = φ(xi ) − φ(xj )
φ(xi ) − φ(xj )
sions and proofs for use of universal kernels in kernel SVM
= φ(xi )> φ(xi ) + φ(xj )> φ(xj ) − φ(xi )> φ(xj )
can be read in (Steinwart & Christmann, 2008).
(37)
As was discussed in Section 8.1, many machine learning − φ(xj )> φ(xi ) = φ(xi )> φ(xi ) + φ(xj )> φ(xj )
algorithms are developed to have dual versions because in- (34)
ner products of points usually appear in the dual algorithms − 2φ(xi )> φ(xj ) = k(xi , xi ) + k(xj , xj ) − 2k(xi , xj ).
and kernel trick can be applied on them. Some examples of
Q.E.D.
these are dual PCA (Schölkopf et al., 1997a; 1998; Gho-
jogh & Crowley, 2019) and dual SVM (Boser et al., 1992; MVU desires to unfolds the manifold of data in its maxi-
Vapnik, 1995) yielding to kernel PCA and kernel SVM, re- mum variance direction. For example, consider a Swiss roll
spectively. In some algorithms, however, either a dual ver- which can be unrolled to have maximum variance after be-
sion does not exist or formulation does not allow for merely ing unrolled. As trace of matrix is the summation of eigen-
having inner products of points. In those algorithms, ker- values and kernel matrix is a measure of similarity between
nel trick cannot be used and representation theory should be points (see Remark 4), the trace of kernel can be used to
used. An example for this is FDA (Ghojogh et al., 2019b). show the summation of variance of data. Hence, we should
Moreover, some algorithms, such as kernel reinforcement maximize tr(K) where tr(.) denotes the trace of matrix.
learning (Ormoneit & Sen, 2002), use kernel as a measure MVU pulls data to the RKHS and then unfolds the mani-
of similarity (see Remark 4). fold. This unfolding should not ruin the local distances be-
tween points after pulling data to the feature space. Hence,
9.2. Kernel Learning we should preserve the local distances as:
After development of many spectral dimensionality reduc-
set
tion methods in machine learning, it was found out that kxi −xj k22 = kφ(xi ) − φ(xj )k2k
many of them are actually special cases of kernel Principal (112)
Component Analysis (PCA) (Bengio et al., 2003c; 2004). = k(xi , xi ) + k(xj , xj ) − 2k(xi , xj ). (113)
Paper (Ham et al., 2004) has shown that PCA, multidi- Moreover, according to Lemma 3, the kernel should be pos-
mensional scaling, Isomap, locally linear embedding, and itive semidefinite, i.e. K 0. MVU also centers kernel to
Laplacian eigenmap are special cases of kernel PCA with have zero mean for the pulled dataset in the feature space
kernels in the formulation of Eq. (51). A list of these ker- (see
nels can be seen in (Strange & Zwiggelaar, 2014, Chapter Pn Section
Pn 6.1). According to Eq. (63), we will have
i=1 j=1 K(i, j) = 0. In summary, the optimization
2) and (Ghojogh et al., 2019d). of MVU is (Weinberger et al., 2005; Weinberger & Saul,
Because of this, some generalized dimensionality reduc- 2006b;a):
tion methods, such as graph embedding (Yan et al., 2005),
were proposed. In addition, as many spectral methods maximize tr(K)
K
are cases of kernel PCA, some researchers tried to learn
subject to kxi − xj k22 = K(i, i) + K(j, j) − 2K(i, j),
the best kernel for manifold unfolding. Maximum Vari-
ance Unfolding (MVU) or Semidefinite Embedding (SDE) ∀i, j ∈ {1, . . . , n},
n X
n
(Weinberger et al., 2005; Weinberger & Saul, 2006b;a) is X
a method for kernel learning using Semidefinite Program- K(i, j) = 0,
i=1 j=1
ming (SDP) (Vandenberghe & Boyd, 1996). MVU is used
for manifold unfolding and dimensionality reduction. Note K 0,
(114)
21
which is a SDP problem (Vandenberghe & Boyd, 1996). Using the explained intuition, an empirical estimation of
Solving this optimization gives the best kernel for maxi- the Hilbert-Schmidt Independence Criterion (HSIC) is in-
mum variance unfolding of manifold. Then, MVU consid- troduced (Gretton et al., 2005):
ers the eigenvalue problem for kernel, i.e. Eq. (93), and 1
finds the embedding using Eq. (102). HSIC(X, Y ) := tr(K x HK y H), (117)
(n − 1)2
9.3. Use of Kernels for Difference of Distributions where K x := φ(x)> φ(x) and K y := φ(y)> φ(y) are the
There exist several different measures for difference of dis- kernels over x and y, respectively. The term 1/(n − 1)2
tributions (i.e., PDFs). Some of them make use of kernels is used for normalization. The matrix H is the centering
and some do not. A measure of difference of distributions matrix (see Eq. (48)). Note that HSIC double-centers one
can be used for (1) calculating the divergence (difference) of the kernels and then computes the Hilbert-Schmidt norm
of a distribution from another reference distribution or (2) between kernels.
convergence of a distribution to another reference distribu- HSIC measures the dependence of two random variable
tion using optimization. We will explain the second use of vectors x and y. Note that HSIC = 0 and HSIC > 0 mean
this measure better in Corollary 5. that x and y are independent and dependent, respectively.
For the information of reader, we first enumerate some The greater the HSIC, the greater dependence they have.
of the measures without kernels and then introduce the Lemma 15 (Independence of Random Variables Using
kernel-based measures for difference of distributions. One Cross-Covariance (Gretton & Györfi, 2010, Theorem 5)).
of methods which do not use kernels is the Kullback- Two random variables X and Y are independent if and
Leibler (KL) divergence (Kullback & Leibler, 1951). KL- only if Cov(f (x), f (y)) = 0 for any pair of bounded con-
divergence, which is a relative entropy from one distribu- tinuous functions (f, g). Because of relation of HSIC with
tion to the other one and has been widely used in deep the cross-covariance of variables, two random variables
learning (Goodfellow et al., 2016). Another measure is the are independent if and only if HSIC(X, Y ) = 0.
Wasserstein metric which has been used in generative mod-
els (Arjovsky et al., 2017). The integral probability metric 9.3.2. M AXIMUM M EAN D ISCREPANCY (MMD)
(Müller, 1997) is another measure for difference of distri- MMD, also known as the kernel two sample test and pro-
butions. posed in (Gretton et al., 2006; 2012), is a measure for dif-
In the following, we introduce some well-known measures ference of distributions. For comparison of two distribu-
for difference of distributions, using kernels. tions, one can find the difference of all moments of the two
distributions. However, as the number of moments is infi-
9.3.1. H ILBERT-S CHMIDT I NDEPENDENCE C RITERION
nite, it is intractable to calculate the difference of all mo-
(HSIC)
ments. One idea to do this tractably is to pull both distri-
Suppose we want to measure the dependence of two ran- butions to the feature space and then compute the distance
dom variables. Measuring the correlation between them is of all pulled data points from distributions in RKHS. This
easier because correlation is just “linear” dependence. difference is a suitable estimate for the difference of all mo-
According to (Hein & Bousquet, 2004), two random vari- ments in the input space. This is the idea behind MMD.
ables X and Y are independent if and only if any bounded MMD is a semi-metric (Simon-Gabriel et al., 2020) and
continuous functions of them are uncorrelated. Therefore, uses distance in the RKHS (Schölkopf, 2001) (see Lemma
if we map the samples of two random variables {x}ni=1 14). Consider PDFs P and Q and samples {xi }ni=1 ∼ P and
and {y}ni=1 to two different (“separable”) RKHSs and have {y i }ni=1 ∼ Q. The squared MMD between these PDFs is:
φ(x) and φ(y), we can measure the correlation of φ(x) and
1 X n n
2
φ(y) in Hilbert space to have an estimation of dependence 1X
MMD2 (P, Q) :=
φ(xi ) − φ(y i )
of x and y in the input space. n i=1 n i=1 k
The correlation of φ(x) and φ(y) can be computed by n n n n
(112) 1 XX 1 XX
the Hilbert-Schmidt norm of the cross-covariance of them = k(x i , x j ) + k(y i , y j )
(Gretton et al., 2005). Note that the squared Hilbert- n2 i=1 j=1 n2 i=1 j=1
Schmidt norm of a matrix A is (Bell, 2016): n n
2 XX
− k(xi , y j )
||A||2HS := tr(A> A), (115) n2 i=1 j=1
and the cross-covariance matrix of two vectors x and y is = Ex [K(x, y)] + Ey [K(x, y)] − 2Ex [Ey [K(x, y)]],
(Gubner, 2006; Gretton et al., 2005): (118)
where Ex [K(x, y)], Ey [K(x, y)], and Ex [Ey [K(x, y)]]
h i
Cov(x, y) := E x − E(x) y − E(y) . (116)
are average of rows, average of columns, and total aver-
22
age of rows and columns of the kernel matrix, respectively. Corollary 5 (Convergence of Distributions to Each Other
Note that MMD ≥ 0 where MMD = 0 means the two dis- Using Characteristic Kernels (Simon-Gabriel et al., 2020)).
tributions are equivalent if the used kernel is characteristic Let Q be the PDF for a theoretical or sample reference
(see Corollary 5 which will be provided later). MMD has distribution. Following Definition 24, if the kernel used in
been widely used in machine learning such as generative measures for difference of distributions is characteristic,
moment matching networks (Li et al., 2015). the measure can be used in an optimization framework to
Remark 10 (Equivalence of HSIC and MMD (Sejdinovic converge a PDF P to the reference distribution Q as:
et al., 2013)). After development of HSIC and MMD mea-
sures, it was found out that they are equivalent. dk (P, Q) = 0 ⇐⇒ P = Q. (121)
9.4. Kernel Embedding of Distributions (Kernel Mean where dk (., .) denotes a measure for difference of distribu-
Embedding) tions such as MMD.
Definition 23 (Kernel Embedding of Distributions (Smola Characteristic kernels have been used for dimensionality
et al., 2007)). Kernel embedding of distributions, also reduction in machine learning. For example, see (Fuku-
called the Kernel Mean Embedding (KME) or mean mizu et al., 2004; 2009).
map, represents (or embeds) Probability Density Functions So far, we have introduced three different types of embed-
(PDFs) in a RKHS. ding in RKHS. In the following, we summarize these three
Corollary 4 (Distribution Embedding in Hilbert Space). types available in the literature of kernels.
Inspired by Eq. (6) or (11), if we map a PDF P from its Remark 11 (Types of Embedding in Hilbert Space). There
space X to the Hilbert space H, its mapped PDF, denoted are three types of embeddings in Hilbert space:
by φ(P), can be represented as:
1. Embedding of points in the Hilbert space: This em-
Z bedding maps x 7→ k(x, .) as stated in Sections 2.3
P 7→ φ(P) = k(x, .) dP(x). (119) and 8.
X
2. Embedding of functions inRthe Hilbert space: This em-
This integral is called the Bochner integral. bedding maps f (x) 7→ K(x, y) f (y) p(y) dy as
stated in Section 7.
KME and MMD were first proposed in the field of pure
3. Embedding of distributions (PDF’s) Rin the Hilbert
mathematics (Guilbart, 1978). Later on, KME and MMD
space: This embedding maps P 7→ k(x, .) dP(x)
were used in machine learning, first in (Smola et al., 2007).
as stated in Section 9.4.
KME is a family on methods which use Eq. (119) for em-
bedding PDFs in RKHS. This family of methods is more Researchers are expecting that a combination of these types
discussed in (Sriperumbudur et al., 2010). A survey on of embedding might appear in the future.
KME is (Muandet et al., 2016).
9.5. Kernel Dimensionality Reduction for Sufficient
Universal kernels, introduced in Section 5.3.3, can be used Dimensionality Reduction
for KME (Sriperumbudur et al., 2011; Simon-Gabriel &
Kernels can also be used directly for dimensionality reduc-
Schölkopf, 2018). In addition to universal kernels, char-
tion. Assume X is the random variable of data and Y is
acteristic kernels and integrally strictly positive definite
the random variables of labels of data. The labels can be
are useful for KME (Sriperumbudur et al., 2011; Simon-
discrete finite for classification or continuous for regres-
Gabriel & Schölkopf, 2018). The integrally strictly posi-
sion. Sufficient Dimensionality Reduction (SDR) (Adragni
tive definite kernel was introduced in Section 5.3.2. In the
& Cook, 2009) is a family of methods which find a trans-
following, we introduce the characteristic kernels.
formation of data to a lower dimensional space, denoted
Definition 24 (Characteristic Kernel (Fukumizu et al., by R(x), which does not change the conditional of labels
2008)). A kernel is characteristic if the mapping (119) is given data:
injective. In other words, for a characteristic kernel k, we
have: PY |X (y | x) = PY |R(X) (y | R(x)). (122)
EX∼P [k(., X)] = EY ∼Q [k(., Y )] ⇐⇒ P = Q, (120) Kernel Dimensionality Reduction (KDR) (Fukumizu et al.,
2004; 2009; Wang et al., 2010b) is a SDR method with lin-
where P and Q are two PDFs and X and Y are random
ear projection for transformation, i.e. R(x) : x 7→ U > x
variables from these distributions, respectively.
which projects data onto the column space of U . The goal
Some examples for characteristic kernels are RBF and of KDR is:
Laplacian kernels. Polynomial kernels are not character-
istic kernels. PY |X (y | x) = PY |U,X (y | U , x). (123)
23
Definition 25 (Dual Space). A dual space of a vector in the formulation of kernel gives:
space V, denoted by V ∗ , is the set of all linear function-
(36) (126)
als φ : V → F where F is the field on which vector space K = Φ(X)> Φ(X) = (U ΣV > )> (U ΣV > )
is defined. > > > 2 >
| {zU} ΣV = V ΣΣV = V Σ V ,
= V ΣU
Theorem 5 (Riesz (or Riesz–Fréchet) representation theo- =I
rem (Garling, 1973)). Let H be a Hilbert space with norm =⇒ KV = V Σ2 V > 2
h., .iH . Suppose φ ∈ H∗ (e.g., φ : f → R). Then, there | {zV} =⇒ KV = V Σ . (127)
=I
exists a unique f ∈ H such that for any x ∈ H, we have
φ(x) = hf, gi: If we take ∆ = diag([δ1 , . . . , δn ]> ) := Σ2 , this equation
becomes:
∃f ∈ H : φ ∈ H∗ , ∀x ∈ H, φ(x) = hf, xiH . (124)
KV = V ∆, (128)
Corollary 6. According to Theorem 5, we have:
which is the matrix of Eq. (93), i.e. the eigenvalue problem
EX [f (x)] = hf, φ(P)iH , ∀f ∈ H, (125) (Ghojogh et al., 2019a) for the kernel matrix with V and
∆ as eigenvectors and eigenvalues, respectively. Hence,
where φ(P) is defined by Eq. (119). for SVD on the pulled dataset, one can apply Eigenvalue
Decomposition (EVD) on the kernel where the eigenvec-
KDR uses Theorem 5 and Corollary 6 in its formualtions.
tors of kernel are equal to right singular vectors of pulled
Note that characteristic kernels (Fukumizu et al., 2008), in-
dataset and the eigenvalues of kernel are the squared singu-
troduced in Definition 24, are used in KDR.
lar values of pulled dataset. This technique has been used
in kernel PCA (Ghojogh & Crowley, 2019).
10. Rank and Factorization of Kernel and the
Nyström Method 10.1.2. C HOLESKY AND QR D ECOMPOSITIONS
10.1. Rank and Factorization of Kernel Matrix The kernel matrix can be factorized using LU decompo-
Usually, the rank of kernel is small. This is because the sition; however, as the kernel matrix is symmetric pos-
manifold hypothesis which states that data points often itive semi-definite matrix (see Lemmas 1 and 3), it can
do not cover the whole space but lie on a sub-manifold. be decomposed using Cholesky decomposition which is
Nyström approximation of the kernel matrix also works much faster than LU decomposition. The Cholesky de-
well because kernels are often low-rank matrices (we will composition of kernel is in the form Rn×n 3 K = LL>
discuss it in Corollary 7). Because of low rank of kernels, where L ∈ Rn×n is a lower-triangular matrix. The ker-
they can be approximated (Kishore Kumar & Schneider, nel matrix can also be factorized using QR decomposition
2017), learned (Kulis et al., 2006; 2009), and factorized. as Rn×n 3 K = QR where Q ∈ Rn×n is an orthogo-
Kernel factorization has also be used for the sake of clus- nal matrix and R ∈ Rn×n is an upper-triangular matrix.
tering (Wang et al., 2010a). In the following, we introduce The paper (Bach & Jordan, 2005) has incorporated side in-
some of the most well-known decompositions for the ker- formation, such as class labels, in the Cholesky and QR
nel matrix. decompositions of kernel matrix.
10.1.1. S INGULAR VALUE AND E IGENVALUE 10.2. Nyström Method for Approximation of
D ECOMPOSITIONS Eigenfunctions
The Singular Value Decomposition (SVD) of the pulled Nyström method, first proposed in (Nyström, 1930), was
data to the feature space is: initially used for approximating the eigenfunctions of an
operator (or of a matrix corresponding to an operator). The
Rt×n 3 Φ(X) = U ΣV > , (126) following lemma provides the Nyström approximation for
the eigenfunctions of the kernel operator defined by Eq.
where U ∈ Rt×n and V ∈ Rn×n are orthogonal matrices (89). For more discussion on this, reader can refer to
and contain the left and right singular vectors, respectively, (Baker, 1978), (Williams & Seeger, 2001, Section 1.1), and
and Σ ∈ Rn×n is a diagonal matrix with singular values. (Williams & Seeger, 2000).
Note that here, we are using notations such as Rt×n for Lemma 16 (Nyström Approximation of Eigenfunction
showing the dimensionality of matrices and this notation (Baker, 1978, Chapter 3), (Williams & Seeger, 2001, Sec-
does not imply a Euclidean space. tion 1.1)). Consider a training dataset {xi ∈ Rd }ni=1 and
As mentioned before, the pulled data are not necessarily the eigenfunction problem (78) where fk ∈ H and λk are
available so Eq. (126) cannot necessarily be done. The the k-th eigenfunction and eigenvalue of kernel operator
kernel, however, is available. Using the SVD of pulled data defined by Eq. (89) or (90). The eigenfunction can be ap-
24
proximated by Nyström method as: of points c, d, and e from a and b, resulting in matrix B,
we cannot have much freedom on the location of c, d, and
n
1 X e, which is the matrix C. This is because of the positive
fk (x) ≈ k(xi , x) fk (xi ), (129)
nλk i=1 semi-definiteness of the matrix K. The points selected in
submatrix A are named landmarks. Note that the land-
where k(xi , x) is the kernel (or centered kernel) corre- marks can be selected randomly from the columns/rows of
sponding to the kernel operator. matrix K and, without loss of generality, they can be put
together to form a submatrix at the top-left corner of matrix.
Proof. This Lemma is somewhat similar and related to For Nyström approximation, some methods have been pro-
Lemma 12. Consider the kernel operator, defined by posed for sampling more important columns/rows of matrix
Eq. (89), in Eq. (78): Kf = λf . Combining this more wisely rather than randomly. We will mention some
with the discrete P version of operator, Eq. (90), gives of these sampling methods in Section 10.5.
n
λk fk (x) = (1/n) i=1 k(xi , x)fk (xi ) Dividing the sides As the matrix K is positive semi-definite, by definition, it
of this equation by λk brings Eq. (129). Q.E.D. can be written as K = O > O. If we take O = [R, S]
where R are the selected columns (landmarks) of O and S
10.3. Nyström Method for Kernel Completion and
are the other columns of O. We have:
Approximation
>
As explained before, the kernel matrix has a low rank > R
K=O O= [R, S] (131)
often. Because of its low rank, it can be approximated S>
(Kishore Kumar & Schneider, 2017). This is important >
R R R> S (130) A B
because in big data when n 1, constructing the kernel = = . (132)
S>R S>S B> C
matrix is both time-consuming and also intractable to store
in computer; i.e., its computation will run forever and will Hence, we have A = R> R. The eigenvalue decomposi-
raise a memory error finally. Hence, it is desired to com- tion (Ghojogh et al., 2019a) of A gives:
pute the kernel function between a subset of data points
(called landmarks) and then approximate the rest of kernel A = U ΣU > (133)
matrix using this subset of kernel matrix. Nyström approx- > > (1/2) >
=⇒ R R = U ΣU =⇒ R = Σ U . (134)
imation can be used for this goal.
The Nyström method can be used for kernel approxima- Moreover, we have B = R> S so we have:
tion. It is a technique used to approximate a positive semi-
definite matrix using merely a subset of its columns (or B = (Σ(1/2) U > )> S = U Σ(1/2) S
rows) (Williams & Seeger, 2001, Section 1.2), (Drineas (a)
et al., 2005). Hence, it can be used for kernel completion =⇒ U > B = Σ(1/2) S =⇒ S = Σ(−1/2) U > B,
in big data where computation of the entire kernel matrix (135)
is time consuming and intractable. One can compute some
of the important important columns or rows of a kernel ma- where (a) is because U is orthogonal (in the eigenvalue
trix, called landmarks, and approximate the rest of columns decomposition). Finally, we have:
or rows by Nyström approximation. C = S > S = B > U Σ(−1/2) Σ(−1/2) U > B
Consider a positive semi-definite matrix Rn×n 3 K 0 (133)
whose parts are: = B > U Σ−1 U > B = B > A−1 B. (136)
and a lower bound on the number of landmarks. In practice, are (Kumar et al., 2009; 2012) and landmark selection on a
it is recommended to use more number of landmarks for sparse manifold (Silva et al., 2006).
more accurate approximation but there is a trade-off with
the speed. 11. Conclusion
This paper was a tutorial and survey paper on kernels
Corollary 7. As we usually have m n, the Nyström and kernel methods. We covered various topics including
approximation works well especially for the low-rank ma- Mercer kernels, Mercer’s theorem, RKHS, eigenfunctions,
trices (Kishore Kumar & Schneider, 2017) because we will Nyström methods, kernelization techniques, and use of ker-
need a small A (so small number of landmarks) for ap- nels in machine learning. This paper can be useful for dif-
proximation. Usually, because of the manifold hypothe- ferent fields of science such as machine learning, functional
sis, data fall on a submanifold; hence, usually, the ker- analysis, and quantum mechanics.
nel (similarity) matrix or the distance matrix has a low
rank. Therefore, the Nyström approximation works well Acknowledgement
for many kernel-based or distance-based manifold learn-
Some parts of this tutorial, particularly some parts of the
ing methods.
RKHS, are covered by Prof. Larry Wasserman’s statisti-
10.4. Use of Nyström Approximation for Landmark cal machine learning course at the Department of Statistics
Spectral Embedding and Data Science, Carnegie Mellon University (watch his
course on YouTube). The video of RKHS in Statistical Ma-
The spectral dimensionality reduction methods (Saul et al.,
chine Learning course by Prof. Ulrike von Luxburg at the
2006) are based on geometry of data and their solutions of-
University of Tübingen is also great (watch on YouTube,
ten follow an eigenvalue problem (Ghojogh et al., 2019a).
Tübingen Machine Learning channel). Also, some parts
Therefore, they cannot handle big data where n 1. To
such as the Nyström approximation and MVU are covered
tackle this issue, there exist some landmark methods which
by Prof. Ali Ghodsi’s course, at University of Waterloo,
approximate the embedding of all points using the embed-
available on YouTube. There are other useful videos on
ding of some landmarks. Big data, i.e. n 1, results
this field which can be found on YouTube.
in large kernel matrices. Selecting some most informative
columns or rows of the kernel matrix, called landmarks, can
reduce computations. This technique is named the Nyström
References
approximation which is used for kernel approximation and Abu-Khzam, Faisal N, Collins, Rebecca L, Fellows,
completion. Micheal R, Langston, Micheal A, Suters, W Henry, and
Symons, Christopher T. Kernelization algorithms for the
Nyström approximation, introduced below, can be used to
vertex cover problem: Theory and experiments. Techni-
make the spectral methods such as locally linear embed-
cal report, University of Tennessee, 2004.
ding (Ghojogh et al., 2020a) and Multidimensional Scaling
(MDS) (Ghojogh et al., 2020b) scalable and suitable for Adragni, Kofi P and Cook, R Dennis. Sufficient dimen-
big data embedding. It is shown in (Platt, 2005) that all the sion reduction and prediction in regression. Philosoph-
landmark MDS methods are Nyström approximations. For ical Transactions of the Royal Society A: Mathematical,
more details on usage of Nyström approximation in spec- Physical and Engineering Sciences, 367(1906):4385–
tral embedding, refer to (Ghojogh et al., 2020a;b). 4405, 2009.
10.5. Other Improvements over Nyström Ah-Pine, Julien. Normalized kernels as similarity indices.
Approximation of Kernels In Pacific-Asia Conference on Knowledge Discovery and
The Nyström method has been improved for kernel approx- Data Mining, pp. 362–373. Springer, 2010.
imation in a line of research. For example, it has been
used for clustering (Fowlkes et al., 2004) and regulariza- Aizerman, Mark A, Braverman, E. M., and Rozonoer, L. I.
tion (Rudi et al., 2015). Greedy Nyström (Farahat et al., Theoretical foundations of the potential function method
2011; 2015) and large scale Nyström (Li et al., 2010) are in pattern recognition learning. Automation and remote
other examples. There is a trade-off between the approxi- control, 25:821–837, 1964.
mation accuracy and computational efficiency and they are Akbari, Saieed, Ghareghani, Narges, Khosrovshahi, Gho-
balanced in (Lim et al., 2015; 2018). The error analysis lamreza B, and Maimani, Hamidreza. The kernels of the
of Nyström method can be found in (Zhang et al., 2008; incidence matrices of graphs revisited. Linear algebra
Zhang & Kwok, 2010). It is better to sample the land- and its applications, 414(2-3):617–625, 2006.
marks wisely rather than randomly. This field of research
is named “column subset selection” or “landmark selec- Alperin, Jonathan L. Local representation theory: Modu-
tion” for Nystron approximation. Some of these methods lar representations as an introduction to the local repre-
26
sentation theory of finite groups. Cambridge University Bengio, Yoshua, Delalleau, Olivier, Roux, Nicolas Le,
Press, 1993. Paiement, Jean-François, Vincent, Pascal, and Ouimet,
Marie. Learning eigenfunctions links spectral embed-
Anderson, Thomas and Dahlin, Michael. Operating Sys- ding and kernel PCA. Neural computation, 16(10):
tems: Principles and Practice. Recursive Books, 2014. 2197–2219, 2004.
Arjovsky, Martin, Chintala, Soumith, and Bottou, Léon. Bengio, Yoshua, Delalleau, Olivier, Le Roux, Nicolas,
Wasserstein generative adversarial networks. In Inter- Paiement, Jean-François, Vincent, Pascal, and Ouimet,
national conference on machine learning, pp. 214–223. Marie. Spectral dimensionality reduction. In Feature
PMLR, 2017. Extraction, pp. 519–550. Springer, 2006.
Aronszajn, Nachman. Theory of reproducing kernels. Berg, Christian, Christensen, Jens Peter Reus, and Ressel,
Transactions of the American mathematical society, 68 Paul. Harmonic analysis on semigroups: theory of posi-
(3):337–404, 1950. tive definite and related functions, volume 100. Springer,
1984.
Arzelà, Cesare. Sulle Funzioni Di Linee. Tipografia Gam-
berini e Parmeggiani, 1895. Bergman, Clifford. Universal algebra: Fundamentals and
selected topics. CRC Press, 2011.
Bach, Francis R and Jordan, Michael I. Predictive low-rank
decomposition for kernel methods. In Proceedings of the Berlinet, Alain and Thomas-Agnan, Christine. Reproduc-
22nd international conference on machine learning, pp. ing kernel Hilbert spaces in probability and statistics.
33–40, 2005. Springer Science & Business Media, 2011.
Baker, Christopher TH. The numerical treatment of inte- Bhatia, Rajendra. Positive definite matrices. Princeton Uni-
gral equations. Clarendon press, 1978. versity Press, 2009.
Barshan, Elnaz, Ghodsi, Ali, Azimifar, Zohreh, and Borgwardt, Karsten M, Gretton, Arthur, Rasch, Malte J,
Jahromi, Mansoor Zolghadri. Supervised principal com- Kriegel, Hans-Peter, Schölkopf, Bernhard, and Smola,
ponent analysis: Visualization, classification and regres- Alex J. Integrating structured biological data by kernel
sion on subspaces and submanifolds. Pattern Recogni- maximum mean discrepancy. Bioinformatics, 22(14):
tion, 44(7):1357–1371, 2011. e49–e57, 2006.
Beauzamy, Bernard. Introduction to Banach spaces and Boser, Bernhard E, Guyon, Isabelle M, and Vapnik,
their geometry. North-Holland, 1982. Vladimir N. A training algorithm for optimal margin
classifiers. In Proceedings of the fifth annual workshop
Bell, Jordan. Trace class operators and Hilbert-Schmidt on Computational learning theory, pp. 144–152, 1992.
operators. Department of Mathematics, University of
Toronto, Technical Report, 2016. Bourbaki, Nicolas. Sur certains espaces vectoriels
topologiques. In Annales de l’institut Fourier, volume 2,
Bengio, Yoshua, Paiement, Jean-françcois, Vincent, Pas- pp. 5–16, 1950.
cal, Delalleau, Olivier, Roux, Nicolas, and Ouimet,
Marie. Out-of-sample extensions for LLE, Isomap, Brunet, Dominique, Vrscay, Edward R, and Wang, Zhou.
MDS, eigenmaps, and spectral clustering. Advances On the mathematical properties of the structural similar-
in neural information processing systems, 16:177–184, ity index. IEEE Transactions on Image Processing, 21
2003a. (4):1488–1499, 2011.
Bengio, Yoshua, Vincent, Pascal, Paiement, Jean-François, Burges, Christopher JC. A tutorial on support vector ma-
Delalleau, O, Ouimet, M, and LeRoux, N. Learning chines for pattern recognition. Data mining and knowl-
eigenfunctions of similarity: linking spectral cluster- edge discovery, 2(2):121–167, 1998.
ing and kernel PCA. Technical report, Departement Camps-Valls, Gustavo. Kernel methods in bioengineering,
d’Informatique et Recherche Operationnelle, 2003b. signal and image processing. Igi Global, 2006.
Bengio, Yoshua, Vincent, Pascal, Paiement, Jean-François, Conway, John B. A course in functional analysis. Springer,
Delalleau, Olivier, Ouimet, Marie, and Le Roux, Nico- 2 edition, 2007.
las. Spectral clustering and kernel PCA are learn-
ing eigenfunctions. Technical report, Departement Cox, Michael AA and Cox, Trevor F. Multidimensional
d’Informatique et Recherche Operationnelle, Technical scaling. In Handbook of data visualization, pp. 315–347.
Report 1239, 2003c. Springer, 2008.
27
De Branges, Louis. The Stone-Weierstrass theorem. Pro- Ghojogh, Benyamin, Karray, Fakhri, and Crowley, Mark.
ceedings of the American Mathematical Society, 10(5): Eigenvalue and generalized eigenvalue problems: Tuto-
822–824, 1959. rial. arXiv preprint arXiv:1903.11240, 2019a.
Drineas, Petros, Mahoney, Michael W, and Cristianini, Ghojogh, Benyamin, Karray, Fakhri, and Crowley, Mark.
Nello. On the Nyström method for approximating a Fisher and kernel Fisher discriminant analysis: Tutorial.
Gram matrix for improved kernel-based learning. jour- arXiv preprint arXiv:1906.09436, 2019b.
nal of machine learning research, 6(12), 2005.
Ghojogh, Benyamin, Karray, Fakhri, and Crowley, Mark.
Farahat, Ahmed, Ghodsi, Ali, and Kamel, Mohamed. A
Image structure subspace learning using structural simi-
novel greedy algorithm for Nyström approximation. In
larity index. In International Conference on Image Anal-
Proceedings of the Fourteenth International Conference
ysis and Recognition, pp. 33–44. Springer, 2019c.
on Artificial Intelligence and Statistics, pp. 269–277.
JMLR Workshop and Conference Proceedings, 2011. Ghojogh, Benyamin, Samad, Maria N, Mashhadi,
Farahat, Ahmed K, Elgohary, Ahmed, Ghodsi, Ali, and Sayema Asif, Kapoor, Tania, Ali, Wahab, Karray,
Kamel, Mohamed S. Greedy column subset selection Fakhri, and Crowley, Mark. Feature selection and fea-
for large-scale data sets. Knowledge and Information ture extraction in pattern analysis: A literature review.
Systems, 45(1):1–34, 2015. arXiv preprint arXiv:1905.02845, 2019d.
Fausett, Laurene V. Fundamentals of neural networks: ar- Ghojogh, Benyamin, Ghodsi, Ali, Karray, Fakhri, and
chitectures, algorithms and applications. Prentice-Hall, Crowley, Mark. Locally linear embedding and
Inc., 1994. its variants: Tutorial and survey. arXiv preprint
arXiv:2011.10925, 2020a.
Fomin, Fedor V, Lokshtanov, Daniel, Saurabh, Saket, and
Zehavi, Meirav. Kernelization: theory of parameterized Ghojogh, Benyamin, Ghodsi, Ali, Karray, Fakhri, and
preprocessing. Cambridge University Press, 2019. Crowley, Mark. Multidimensional scaling, Sammon
mapping, and Isomap: Tutorial and survey. arXiv
Fowlkes, Charless, Belongie, Serge, Chung, Fan, and Ma-
preprint arXiv:2009.08136, 2020b.
lik, Jitendra. Spectral grouping using the Nyström
method. IEEE transactions on pattern analysis and ma-
Ghojogh, Benyamin, Karray, Fakhri, and Crowley, Mark.
chine intelligence, 26(2):214–225, 2004.
Generalized subspace learning by Roweis discriminant
Fukumizu, Kenji, Bach, Francis R, and Jordan, Michael I. analysis. In International Conference on Image Analysis
Dimensionality reduction for supervised learning with and Recognition, pp. 328–342. Springer, 2020c.
reproducing kernel Hilbert spaces. Journal of Machine
Learning Research, 5(Jan):73–99, 2004. Ghojogh, Benyamin, Karray, Fakhri, and Crowley, Mark.
Theoretical insights into the use of structural similar-
Fukumizu, Kenji, Sriperumbudur, Bharath K, Gretton, ity index in generative models and inferential autoen-
Arthur, and Schölkopf, Bernhard. Characteristic kernels coders. In International Conference on Image Analysis
on groups and semigroups. In Advances in neural infor- and Recognition, pp. 112–117. Springer, 2020d.
mation processing systems, pp. 473–480, 2008.
Gonzalez, Rafael C and Woods, Richard E. Digital image
Fukumizu, Kenji, Bach, Francis R, Jordan, Michael I, et al. processing. Prentice hall Upper Saddle River, NJ, 2002.
Kernel dimension reduction in regression. The Annals of
Statistics, 37(4):1871–1905, 2009. Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron, and
Bengio, Yoshua. Deep learning, volume 1. MIT press
Garling, DJH. A ‘short’ proof of the Riesz representa-
Cambridge, 2016.
tion theorem. In Mathematical Proceedings of the Cam-
bridge Philosophical Society, volume 73, pp. 459–460. Gretton, Arthur and Györfi, László. Consistent nonpara-
Cambridge University Press, 1973. metric tests of independence. The Journal of Machine
Genton, Marc G. Classes of kernels for machine learning: Learning Research, 11:1391–1423, 2010.
a statistics perspective. Journal of machine learning re-
search, 2(Dec):299–312, 2001. Gretton, Arthur, Bousquet, Olivier, Smola, Alex, and
Schölkopf, Bernhard. Measuring statistical dependence
Ghojogh, Benyamin and Crowley, Mark. Unsupervised with Hilbert-Schmidt norms. In International conference
and supervised principal component analysis: Tutorial. on algorithmic learning theory, pp. 63–77. Springer,
arXiv preprint arXiv:1906.03148, 2019. 2005.
28
Gretton, Arthur, Borgwardt, Karsten, Rasch, Malte, Karimi, Amir-Hossein. A summary of the kernel matrix,
Schölkopf, Bernhard, and Smola, Alex. A kernel method and how to learn it effectively using semidefinite pro-
for the two-sample-problem. Advances in neural infor- gramming. arXiv preprint arXiv:1709.06557, 2017.
mation processing systems, 19:513–520, 2006.
Kimeldorf, George and Wahba, Grace. Some results on
Gretton, Arthur, Borgwardt, Karsten M, Rasch, Malte J, Tchebycheffian spline functions. Journal of mathemati-
Schölkopf, Bernhard, and Smola, Alexander. A kernel cal analysis and applications, 33(1):82–95, 1971.
two-sample test. The Journal of Machine Learning Re-
search, 13(1):723–773, 2012. Kishore Kumar, N and Schneider, Jan. Literature survey on
low rank approximation of matrices. Linear and Multi-
Gubner, John A. Probability and random processes for
linear Algebra, 65(11):2212–2244, 2017.
electrical and computer engineers. Cambridge Univer-
sity Press, 2006. Kulis, Brian, Sustik, Mátyás, and Dhillon, Inderjit. Learn-
Guilbart, Christian. Etude des produits scalaires sur ing low-rank kernel matrices. In Proceedings of the 23rd
l’espace des mesures: estimation par projections. PhD international conference on Machine learning, pp. 505–
thesis, Université des Sciences et Techniques de Lille, 512, 2006.
1978.
Kulis, Brian, Sustik, Mátyás A, and Dhillon, Inderjit S.
Ham, Jihun, Lee, Daniel D, Mika, Sebastian, and Low-rank kernel learning with Bregman matrix diver-
Schölkopf, Bernhard. A kernel view of the dimensional- gences. Journal of Machine Learning Research, 10(2),
ity reduction of manifolds. In Proceedings of the twenty- 2009.
first international conference on Machine learning, pp.
47, 2004. Kullback, Solomon and Leibler, Richard A. On informa-
tion and sufficiency. The annals of mathematical statis-
Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. tics, 22(1):79–86, 1951.
The elements of statistical learning: data mining, infer-
ence, and prediction. Springer Science & Business Me- Kumar, Sanjiv, Mohri, Mehryar, and Talwalkar, Ameet.
dia, 2009. Sampling techniques for the Nyström method. In Ar-
tificial Intelligence and Statistics, pp. 304–311. PMLR,
Hawkins, Thomas. Cauchy and the spectral theory of ma-
2009.
trices. Historia mathematica, 2(1):1–29, 1975.
Hein, Matthias and Bousquet, Olivier. Kernels, associated Kumar, Sanjiv, Mohri, Mehryar, and Talwalkar, Ameet.
structures and generalizations. Max-Planck-Institut fuer Sampling methods for the Nyström method. The Journal
biologische Kybernetik, Technical Report, 2004. of Machine Learning Research, 13(1):981–1006, 2012.
Hilbert, David. Grundzüge einer allgemeinen theo- Kung, Sun Yuan. Kernel methods and machine learning.
rie der linearen integralrechnungen I. Nachrichten Cambridge University Press, 2014.
von der Gesellschaft der Wissenschaften zu Göttingen,
Mathematisch-Physikalische Klasse, pp. 49–91, 1904. Kusse, Bruce R and Westwig, Erik A. Mathematical
physics: applied mathematics for scientists and engi-
Hinton, Geoffrey E and Salakhutdinov, Ruslan R. Reduc- neers. Wiley-VCH, 2 edition, 2006.
ing the dimensionality of data with neural networks. Sci-
ence, 313(5786):504–507, 2006. Lanckriet, Gert RG, Cristianini, Nello, Bartlett, Peter,
Ghaoui, Laurent El, and Jordan, Michael I. Learning the
Hofmann, Thomas, Schölkopf, Bernhard, and Smola,
kernel matrix with semidefinite programming. Journal
Alexander J. A review of kernel methods in machine
of Machine learning research, 5(Jan):27–72, 2004.
learning. Max-Planck-Institute Technical Report, 156,
2006. Li, Mu, Kwok, James Tin-Yau, and Lü, Baoliang. Mak-
Hofmann, Thomas, Schölkopf, Bernhard, and Smola, ing large-scale Nyström approximation possible. In
Alexander J. Kernel methods in machine learning. The ICML 2010-Proceedings, 27th International Conference
annals of statistics, pp. 1171–1220, 2008. on Machine Learning, pp. 631, 2010.
Icking, Christian and Klein, Rolf. Searching for the kernel Li, Yujia, Swersky, Kevin, and Zemel, Rich. Genera-
of a polygon—a competitive strategy. In Proceedings of tive moment matching networks. In International Con-
the eleventh annual symposium on Computational geom- ference on Machine Learning, pp. 1718–1727. PMLR,
etry, pp. 258–266, 1995. 2015.
29
Lim, Woosang, Kim, Minhwan, Park, Haesun, and Jung, Nyström, Evert J. Über die praktische auflösung von
Kyomin. Double Nyström method: An efficient and ac- integralgleichungen mit anwendungen auf randwertauf-
curate Nyström scheme for large-scale data sets. In In- gaben. Acta Mathematica, 54(1):185–204, 1930.
ternational Conference on Machine Learning, pp. 1367–
1375. PMLR, 2015. Oldford, Wayne. Lecture: Recasting principal compo-
nents. Lecture notes for Data Visualization, Department
Lim, Woosang, Du, Rundong, Dai, Bo, Jung, Kyomin, of Statistics and Actuarial Science, University of Water-
Song, Le, and Park, Haesun. Multi-scale Nyström loo, 2018.
method. In International Conference on Artificial Intel-
Ormoneit, Dirk and Sen, Śaunak. Kernel-based reinforce-
ligence and Statistics, pp. 68–76. PMLR, 2018.
ment learning. Machine learning, 49(2):161–178, 2002.
Ma, Junshui. Function replacement vs. kernel trick. Neu-
Orr, Mark J. L. Introduction to radial basis function net-
rocomputing, 50:479–483, 2003.
works. Technical report, Center for Cognitive Science,
Mercer, J. Functions of positive and negative type and their University of Edinburgh, 1996.
connection with the theory of integral equations. Philo- Pan, Sinno Jialin, Kwok, James T, and Yang, Qiang. Trans-
sophical Transactions of the Royal Society, A(209):415– fer learning via dimensionality reduction. In AAAI, vol-
446, 1909. ume 8, pp. 677–682, 2008.
Mika, Sebastian, Ratsch, Gunnar, Weston, Jason, Parseval des Chenes, MA. Mémoires présentés à l’institut
Scholkopf, Bernhard, and Mullers, Klaus-Robert. Fisher des sciences, lettres et arts, par divers savans, et lus dans
discriminant analysis with kernels. In Neural networks ses assemblées. Sciences, mathématiques et physiques
for signal processing IX: Proceedings of the 1999 IEEE (Savans étrangers), 1:638, 1806.
signal processing society workshop, pp. 41–48. Ieee,
1999. Perlibakas, Vytautas. Distance measures for PCA-based
face recognition. Pattern recognition letters, 25(6):711–
Minh, Ha Quang, Niyogi, Partha, and Yao, Yuan. Mer- 724, 2004.
cer’s theorem, feature maps, and smoothing. In Interna-
tional Conference on Computational Learning Theory, Platt, John. FastMap, MetricMap, and landmark MDS are
pp. 154–168. Springer, 2006. all Nystrom algorithms. In AISTATS, 2005.
Muandet, Krikamol, Fukumizu, Kenji, Sriperumbudur, Prugovecki, Eduard. Quantum mechanics in Hilbert space.
Bharath, and Schölkopf, Bernhard. Kernel mean em- Academic Press, 1982.
bedding of distributions: A review and beyond. arXiv Reed, Michael and Simon, Barry. Methods of modern
preprint arXiv:1605.09522, 2016. mathematical physics: Functional analysis. Academic
Müller, Alfred. Integral probability metrics and their gen- Press, 1972.
erating classes of functions. Advances in Applied Prob- Renardy, Michael and Rogers, Robert C. An introduction
ability, pp. 429–443, 1997. to partial differential equations, volume 13. Springer
Müller, Klaus-Robert, Mika, Sebastian, Tsuda, Koji, and Science & Business Media, 2006.
Schölkopf, Bernhard. An introduction to kernel-based Rennie, Jason. How to normalize a kernel matrix. Tech-
learning algorithms. Handbook of Neural Network Sig- nical report, MIT Computer Science & Artificial Intelli-
nal Processing, 2018. gence Lab, 2005.
Narici, Lawrence and Beckenstein, Edward. Topological Rojo-Álvarez, José Luis, Martı́nez-Ramón, Manel, Marı́,
vector spaces. CRC Press, 2010. Jordi Muñoz, and Camps-Valls, Gustavo. Digital signal
processing with Kernel methods. Wiley Online Library,
Noack, Marcus M and Sethian, James A. Advanced station-
2018.
ary and non-stationary kernel designs for domain-aware
Gaussian processes. arXiv preprint arXiv:2102.03432, Rudi, Alessandro, Camoriano, Raffaello, and Rosasco,
2021. Lorenzo. Less is more: Nyström computational regu-
larization. In Advances in neural information processing
Novak, Erich, Ullrich, Mario, Woźniakowski, Henryk, and systems, pp. 1657–1665, 2015.
Zhang, Shun. Reproducing kernels of Sobolev spaces
on Rd and applications to embedding constants and Rudin, Cynthia. Prediction: Machine learning and statis-
tractability. Analysis and Applications, 16(05):693–715, tics (MIT 15.097), lecture on kernels. Technical report,
2018. Massachusetts Institute of Technology, 2012.
30
Rupp, Matthias. Machine learning for quantum mechanics Sejdinovic, Dino, Sriperumbudur, Bharath, Gretton,
in a nutshell. International Journal of Quantum Chem- Arthur, and Fukumizu, Kenji. Equivalence of distance-
istry, 115(16):1058–1073, 2015. based and RKHS-based statistics in hypothesis testing.
The Annals of Statistics, pp. 2263–2291, 2013.
Saul, Lawrence K, Weinberger, Kilian Q, Sha, Fei, Ham,
Jihun, and Lee, Daniel D. Spectral methods for dimen- Shawe-Taylor, John and Cristianini, Nello. Kernel methods
sionality reduction. Semi-supervised learning, 3, 2006. for pattern analysis. Cambridge university press, 2004.
Silva, Jorge, Marques, Jorge, and Lemos, João. Select-
Saxe, Karen. Beginning functional analysis. Springer,
ing landmark points for sparse manifold learning. In
2002.
Advances in neural information processing systems, pp.
Schlichtharle, Dietrich. Digital Filters: Basics and Design. 1241–1248, 2006.
Springer, 2 edition, 2011. Simon-Gabriel, Carl-Johann and Schölkopf, Bernhard.
Kernel distribution embeddings: Universal kernels, char-
Schmidt, Erhard. Über die auflösung linearer gleichungen
acteristic kernels and kernel metrics on distributions. The
mit unendlich vielen unbekannten. Rendiconti del Cir-
Journal of Machine Learning Research, 19(1):1708–
colo Matematico di Palermo (1884-1940), 25(1):53–77,
1736, 2018.
1908.
Simon-Gabriel, Carl-Johann, Barp, Alessandro, and
Schölkopf, Bernhard. The kernel trick for distances. Ad- Mackey, Lester. Metrizing weak convergence with
vances in neural information processing systems, pp. maximum mean discrepancies. arXiv preprint
301–307, 2001. arXiv:2006.09268, 2020.
Schölkopf, Bernhard and Smola, Alexander J. Learning Smola, Alex, Gretton, Arthur, Song, Le, and Schölkopf,
with kernels: support vector machines, regularization, Bernhard. A Hilbert space embedding for distributions.
optimization, and beyond. MIT press, 2002. In International Conference on Algorithmic Learning
Theory, pp. 13–31. Springer, 2007.
Schölkopf, Bernhard, Smola, Alexander, and Müller,
Klaus-Robert. Kernel principal component analysis. In Smola, Alex J and Schölkopf, Bernhard. Learning with
International conference on artificial neural networks, kernels, volume 4. Citeseer, 1998.
pp. 583–588. Springer, 1997a.
Song, Le. Learning via Hilbert space embedding of distri-
Schölkopf, Bernhard, Sung, Kah-Kay, Burges, Christo- butions. PhD thesis, The University of Sydney, 2008.
pher JC, Girosi, Federico, Niyogi, Partha, Poggio, Sriperumbudur, Bharath K, Gretton, Arthur, Fukumizu,
Tomaso, and Vapnik, Vladimir. Comparing support vec- Kenji, Schölkopf, Bernhard, and Lanckriet, Gert RG.
tor machines with Gaussian kernels to radial basis func- Hilbert space embeddings and metrics on probability
tion classifiers. IEEE transactions on Signal Processing, measures. The Journal of Machine Learning Research,
45(11):2758–2765, 1997b. 11:1517–1561, 2010.
Schölkopf, Bernhard, Smola, Alexander, and Müller, Sriperumbudur, Bharath K, Fukumizu, Kenji, and Lanck-
Klaus-Robert. Nonlinear component analysis as a kernel riet, Gert RG. Universality, characteristic kernels and
eigenvalue problem. Neural computation, 10(5):1299– RKHS embedding of measures. Journal of Machine
1319, 1998. Learning Research, 12(7), 2011.
Schölkopf, Bernhard, Burges, Christopher JC, and Smola, Steinwart, Ingo. On the influence of the kernel on the con-
Alexander J. Advances in kernel methods: support vec- sistency of support vector machines. Journal of machine
tor learning. MIT press, 1999a. learning research, 2(Nov):67–93, 2001.
Steinwart, Ingo. Support vector machines are universally
Schölkopf, Bernhard, Mika, Sebastian, Burges, Chris JC,
consistent. Journal of Complexity, 18(3):768–791, 2002.
Knirsch, Philipp, Muller, K-R, Ratsch, Gunnar, and
Smola, Alexander J. Input space versus feature space Steinwart, Ingo and Christmann, Andreas. Support vector
in kernel-based methods. IEEE transactions on neural machines. Springer Science & Business Media, 2008.
networks, 10(5):1000–1017, 1999b.
Strang, Gilbert. The fundamental theorem of linear algebra.
Scott, David W. Multivariate density estimation: theory, The American Mathematical Monthly, 100(9):848–855,
practice, and visualization. John Wiley & Sons, 1992. 1993.
31
Strange, Harry and Zwiggelaar, Reyer. Open Problems in Williams, Christopher KI and Barber, David. Bayesian
Spectral Dimensionality Reduction. Springer, 2014. classification with Gaussian processes. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 20
Tenenbaum, Joshua B, De Silva, Vin, and Langford, (12):1342–1351, 1998.
John C. A global geometric framework for nonlinear di-
mensionality reduction. Science, 290(5500):2319–2323, Yan, Shuicheng, Xu, Dong, Zhang, Benyu, and Zhang,
2000. Hong-Jiang. Graph embedding: A general framework
for dimensionality reduction. In 2005 IEEE Computer
Vandenberghe, Lieven and Boyd, Stephen. Semidefinite Society Conference on Computer Vision and Pattern
programming. SIAM review, 38(1):49–95, 1996. Recognition (CVPR’05), volume 2, pp. 830–837. IEEE,
Vapnik, Vladimir. The nature of statistical learning theory. 2005.
Springer science & business media, 1995. Zhang, Jianguo, Marszałek, Marcin, Lazebnik, Svetlana,
Vapnik, Vladimir and Chervonenkis, Alexey. Theory of and Schmid, Cordelia. Local features and kernels for
pattern recognition. Nauka, Moscow, 1974. classification of texture and object categories: A com-
prehensive study. International journal of computer vi-
Wahba, Grace. Spline models for observational data. sion, 73(2):213–238, 2007.
SIAM, 1990.
Zhang, Kai and Kwok, James T. Clustered Nyström
Wand, Matt P and Jones, M Chris. Kernel smoothing. CRC method for large scale manifold learning and dimension
press, 1994. reduction. IEEE Transactions on Neural Networks, 21
(10):1576–1587, 2010.
Wang, Lijun, Rege, Manjeet, Dong, Ming, and Ding, Yong-
sheng. Low-rank kernel matrix factorization for large- Zhang, Kai, Tsang, Ivor W, and Kwok, James T. Improved
scale evolutionary clustering. IEEE Transactions on Nyström low-rank approximation and error analysis. In
Knowledge and Data Engineering, 24(6):1036–1050, Proceedings of the 25th international conference on ma-
2010a. chine learning, pp. 1232–1239, 2008.
Wang, Meihong, Sha, Fei, and Jordan, Michael. Unsuper-
vised kernel dimension reduction. Advances in neural
information processing systems, 23:2379–2387, 2010b.
Weinberger, Kilian Q and Saul, Lawrence K. An introduc-
tion to nonlinear dimensionality reduction by maximum
variance unfolding. In Proceedings of the AAAI Confer-
ence on Artificial Intelligence, volume 6, pp. 1683–1686,
2006a.
Weinberger, Kilian Q and Saul, Lawrence K. Unsupervised
learning of image manifolds by semidefinite program-
ming. International journal of computer vision, 70(1):
77–90, 2006b.
Weinberger, Kilian Q, Packer, Benjamin, and Saul,
Lawrence K. Nonlinear dimensionality reduction by
semidefinite programming and kernel matrix factoriza-
tion. In AISTATS, 2005.
Williams, Christopher and Seeger, Matthias. The effect of
the input density distribution on kernel-based classifiers.
In Proceedings of the 17th international conference on
machine learning, 2000.
Williams, Christopher and Seeger, Matthias. Using the
Nyström method to speed up kernel machines. In Pro-
ceedings of the 14th annual conference on neural infor-
mation processing systems, number CONF, pp. 682–688,
2001.