0% found this document useful (0 votes)

111 views31 pages

Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey

This document is a tutorial and survey on reproducing kernel Hilbert spaces, kernels, and their applications in machine learning. It discusses the history of kernels in functional analysis and machine learning. It covers topics such as Mercer's theorem, commonly used kernels, kernel methods for machine learning like support vector machines, and applications of kernels in fields like dimensionality reduction, quantum mechanics, and signal processing. The purpose is to provide background on kernels for an upcoming textbook on dimensionality reduction and manifold learning.

Uploaded by

Shwetha B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

111 views31 pages

Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey

Uploaded by

Shwetha B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

To appear as a part of an upcoming textbook on dimensionality reduction and manifold learning.

Reproducing Kernel Hilbert Space, Mercer’s Theorem, Eigenfunctions,

Nyström Method, and Use of Kernels in Machine Learning:
Tutorial and Survey

Benyamin Ghojogh BGHOJOGH @ UWATERLOO . CA

Department of Electrical and Computer Engineering,
Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada
arXiv:2106.08443v1 [stat.ML] 15 Jun 2021

Ali Ghodsi ALI . GHODSI @ UWATERLOO . CA

Department of Statistics and Actuarial Science & David R. Cheriton School of Computer Science,
Data Analytics Laboratory, University of Waterloo, Waterloo, ON, Canada
Fakhri Karray KARRAY @ UWATERLOO . CA
Department of Electrical and Computer Engineering,
Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, ON, Canada
Mark Crowley MCROWLEY @ UWATERLOO . CA
Department of Electrical and Computer Engineering,
Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada

Abstract chine learning, dimensionality reduction, func-

tional analysis in mathematics, and mathematical
This is a tutorial and survey paper on kernels,
physics in quantum mechanics.
kernel methods, and related fields. We start with
reviewing the history of kernels in functional
analysis and machine learning. Then, Mercer 1. Introduction
kernel, Hilbert and Banach spaces, Reproduc-
ing Kernel Hilbert Space (RKHS), Mercer’s the- 1.1. History of Kernels
orem and its proof, frequently used kernels, ker- It is 1904 when David Hilbert proposed his work on ker-
nel construction from distance metric, important nels and defined a definite kernel (Hilbert, 1904), and
classes of kernels (including bounded, integrally later, his and Erhard Schmidt’s works proposed integral
positive definite, universal, stationary, and char- equations such as Fredholm integral equations (Hilbert,
acteristic kernels), kernel centering and normal- 1904; Schmidt, 1908). This was the introduction of a
ization, and eigenfunctions are explained in de- new space which was named the Hilbert space. Later, the
tail. Then, we introduce types of use of kernels Hilbert space was found to be very useful for the formu-
in machine learning including kernel methods lations in quantum mechanics (Prugovecki, 1982). After
(such as kernel support vector machines), kernel the initial works on Hilbert space by Hilbert and Schmidt
learning by semi-definite programming, Hilbert- (Hilbert, 1904; Schmidt, 1908), James Mercer improved
Schmidt independence criterion, maximum mean Hilbert’s work and proposed his theorem in 1909 (Mer-
discrepancy, kernel mean embedding, and ker- cer, 1909) which was named the Mercer’s theorem later.
nel dimensionality reduction. We also cover rank In the mean time, Stefan Banach, Hans Hahn, and Eduard
and factorization of kernel matrix as well as the Helly proposed the concepts of another new space in years
approximation of eigenfunctions and kernels us- 1920–1922 (Bourbaki, 1950) which was named the Banach
ing the Nyström method. This paper can be use- space later by Maurice René Fréchet (Narici & Becken-
ful for various fields of science including ma- stein, 2010). The Hilbert space is a subset of the Banach
space.
Reproducing Kernel Hilbert Space (RKHS) is a special
case of Hilbert space with some properties. It is a Hilbert
space of functions with reproducing kernels (Berlinet &
2

Thomas-Agnan, 2011). The first work on RKHS was 2014). Some survey papers about kernel-based machine
(Aronszajn, 1950). Later, the concepts of RKHS were im- learning are (Hofmann et al., 2006; 2008; Müller et al.,
proved further in (Aizerman et al., 1964). The RKHS re- 2018). In addition to some of the above-mentioned books,
mained in pure mathematics until this space was used for there exist some other books/papers on kernel SVM such as
the first time in machine learning by introduction of ker- (Schölkopf et al., 1997b; Burges, 1998; Hastie et al., 2009).
nel Support Vector Machine (SVM) (Boser et al., 1992;
Vapnik, 1995). Eigenfunctions were also developed for 1.3. Kernel in Different Fields of Science
eigenvalue problem applied on operators and functions The term kernel has been used in different fields of science
(Williams & Seeger, 2000) and were used in machine learn- for various purposes. In the following, we briefly introduce
ing (Bengio et al., 2003c) and physics (Kusse & Westwig, the different uses of kernel in science to clarify which use
2006). This is related to RKHS because it uses weighted of kernel we are focusing on in this paper.
inner product in Hilbert space (Williams & Seeger, 2000) 1. Kernel of filter in signal processing: In signal pro-
and RKHS is a Hilbert space of functions with a reproduc- cessing, one can use filters to filter a price of signal,
ing kernel. such as an image (Gonzalez & Woods, 2002). Dig-
Using kernels was widely noticed when linear SVM (Vap- ital filters have a kernel which determine the values
nik & Chervonenkis, 1974) was kernelized (Boser et al., of filter in the window of filter (Schlichtharle, 2011).
1992; Vapnik, 1995). Kernel SVM showed off very suc- In convolutional neural networks, the filter kernels are
cessfully because of its merits. Two competing models, learned in deep learning (Goodfellow et al., 2016).
kernel SVM and neural network were the models which 2. Kernel smoothing for density estimation: Kernel
could handle nonlinear data (see (Fausett, 1994) for history density estimation can be used for fitting a mixture
of neural networks). Kernel SVM transformed nonlinear of distributions to some data instances (Scott, 1992).
data to RKHS to make the pattern of data linear hopefully For this, a histogram with infinite number of bins is
and then applied linear SVM on it. However, the approach utilized. In limit, this histogram is converged to a ker-
of neural network was different because the model itself nel smoothing (Wand & Jones, 1994) where the kernel
was nonlinear (see Section 8 for more details). The suc- determines the type of distribution. For example, if a
cess of kernel SVM plus the problem of vanishing gradi- Radial Basis Function (RBF) kernel is used, a mixture
ents in neural networks (Goodfellow et al., 2016) resulted of Gaussian distributions is fitted to data.
in the winter of neural network around years 2000 to 2006. 3. Kernelization in complexity theory: Kernelization is
However, the problems of training deep neural networks a pre-processing technique where the input to an algo-
started to be resolved (Hinton & Salakhutdinov, 2006) and rithm is replaced by a part of the input named kernel.
their success plus two problems of kernel SVM helped neu- The output of the algorithm on kernel should either be
ral networks take over kernel SVM gradually. One prob- the same as or be able to be transformed to the out-
lem of kernel SVM was not knowing the suitable kernel put of the algorithm for the whole input (Fomin et al.,
type for various learning problems. In other words, ker- 2019). An example usage of kernelization is in vertex
nel SVM still required the user to choose the type of ker- cover problem (Abu-Khzam et al., 2004).
nel but neural networks were end-to-end and almost robust 4. Kernel in operating system: Kernel is the core of an
to hyperparameters such as the number of layers or neu- operating system, such as Linux, which connects the
rons. Another problem was that kernel SVM could not han- hardware including CPU, memory, and peripheral de-
dle big data, although Nyström method, first proposed in vices to applications (Anderson & Dahlin, 2014).
(Nyström, 1930), was used to resolve this problem of ker- 5. Kernel in linear algebra and graphs: Consider a map-
nel methods by approximating kernels from a subset of data ping from the vector space V to the vector space W as
(Williams & Seeger, 2001). Note that kernels have been L : V → W. The kernel, also called the nullspace, of
used widely in machine learning such as in SVM (Vap- this mapping is defined as ker(L) := {v ∈ V | L(v) =
nik, 1995), Gaussian process classifiers (Williams & Bar- 0}. For example, for a matrix A ∈ Ra×b , the ker-
ber, 1998), and spline methods (Wahba, 1990). The types nel of A is ker(A) = {x ∈ Rn | Ax = 0}. The
of use of kernels in machine learning will be discussed in four fundamental subspaces of a matrix are its kernel,
Section 9. row space, column space, and left null space (Strang,
1993). Note that the kernel (nullspace) of adjacency
1.2. Useful Books on Kernels matrix in graphs has also been well developed (Ak-
There exist several books about use of kernels in machine bari et al., 2006).
learning. Some examples are (Smola & Schölkopf, 1998; 6. Kernel in other domains of mathematics: There ex-
Schölkopf et al., 1999a; Schölkopf & Smola, 2002; Shawe- ist kernel concepts in other domains of mathematics
Taylor & Cristianini, 2004; Camps-Valls, 2006; Steinwart and statistics such as geometry of polygon (Icking &
& Christmann, 2008; Rojo-Álvarez et al., 2018; Kung, Klein, 1995), set theory (Bergman, 2011, p. 14), etc.
3

7. Kernel in feature space for machine learning: In sta- K ∈ Rn×n is a Gram matrix, also known as a Gramian
tistical machine learning, kernels pull data to a feature matrix or a kernel matrix, whose (i, j)-th element is:
space for the sake of better discrimination of classes or
simpler representation of data (Hofmann et al., 2006; K(i, j) := k(xi , xj ), ∀i, j ∈ {1, . . . , n}. (1)
2008). In this paper, our focus is on this category
which is kernels for machine learning. Here, we defined the square kernel matrix applied on a set
of n data instances; hence, the kernel is a n × n matrix.
1.4. Organization of Paper We may also have a kernel matrix between two sets of data
This paper is a tutorial and survey paper on kernels and ker- instances. This will be explained more in Section 8. More-
nel methods. It can be useful for several fields of science in- over, note that the kernel matrix can be computed using the
cluding machine learning, functional analysis in mathemat- inner product between pulled data to the feature space. This
ics, and mathematical physics in quantum mechanics. The will be explained in detail in Section 3.2.
remainder of this paper is organized as follows. Section
2 introduces the Mercer kernel, important spaces in func- 2.2. Hilbert, Banach, Lp , and Sobolev Spaces
tional analysis including the Hilbert and Banach spaces, Before defining the RKHS and details of kernels, we need
and Reproducing Kernel Hilbert Space (RKHS). Mercer’s to introduce Hilbert, Banach, Lp , and Sobolev spaces,
theorem and its proof are provided in Section 3. Character- which are well-known spaces in functional analysis (Con-
istics of kernels are explained in Section 4. We introduce way, 2007).
frequently used kernels, kernel construction from distance
Definition 3 (Metric Space). A metric space is a set where
metric, and important classes of kernels in Section 5. Ker-
a metric, for measuring the distance between instances of
nel centering and normalization are explained in Section 6.
set, is defined on it.
Eigenfunctions are then introduced in Section 7. We ex-
plain two techniques for kernelization in Section 8. Types Definition 4 (Vector Space). A vector space is a set of vec-
of use of kernels in machine learning are reviewed in Sec- tors equipped with some available operations such as ad-
tion 9. Kernel factorization and Nyström approximation dition and multiplication by scalars.
are introduced in Section 10. Finally, Section 11 concludes Definition 5 (Complete Space). A space F is complete
the paper. if every Cauchy sequence converges to a member of this
space f ∈ F. Note that the Cauchy sequence is a sequence
Required Background for the Reader whose elements become arbitrarily close to one another as
This paper assumes that the reader has general knowledge the sequence progresses (i.e., it converges in limit).
of calculus, linear algebra, and basics of optimization. The Definition 6 (Compact Space). A space is compact if it
required basics of functional analysis are explained in the is closed (i.e., it contains all its limit points) and bounded
paper. (i.e., all its points lie within some fixed distance of one an-
other).
2. Mercer Kernel and Spaces In Functional
Definition 7 (Hilbert Space (Reed & Simon, 1972)). A
Analysis Hilbert space H is an inner product space that is a com-
2.1. Mercer Kernel and Gram Matrix plete metric space with respect to the norm or distance
Definition 1 (Mercer Kernel (Mercer, 1909)). The function function induced by the inner product.
k : X 2 → R is a Mercer kernel function (also known as
The Hilbert space generalizes the Euclidean space to a fi-
kernel function) where:
nite or infinite dimensional space. Usually, the Hilbert
1. it is symmetric: k(x, y) = k(y, x), space is high dimensional. By convention in machine learn-
2. and its corresponding kernel matrix K(i, j) = ing, unless otherwise stated, Hilbert space is also referred
k(xi , xj ), ∀i, j ∈ {1, . . . , n} is positive semi-definite: to as the feature space. By feature space, researchers of-
K 0. ten specifically mean the RKHS space which will be intro-
The corresponding kernel matrix of a Mercer kernel is a duced in Section 2.3.
Mercer kernel matrix.
Definition 8 (Banach Space (Beauzamy, 1982)). A Banach
The two properties of a Mercer kernel will be proved in space is a complete vector space equipped with a norm.
Section 4. By convention, unless otherwise stated, the term Remark 1 (Difference of Hilbert and Banach Spaces).
kernel refers to Mercer kernel. The effectiveness of Mercer Hilbert space is a special case of Banach space equipped
kernel will be shown and proven in the Mercer’s theorem, with a norm defined using an inner product notion. All
i.e., Theorem 2. Hilbert spaces are Banach spaces but the converse is not
Definition 2 (Gram Matrix or Kernel Matrix). The matrix true.
4

Suppose Rn , H, B, Mc , M, T denote the Euclidean space, where (a) is because we define kx (.) := k(x, .). This
Hilbert space, Banach space, complete metric space, met- equation shows that the bases of an RKHS are kernels. The
ric space, and topological space (containing both open and proof of this equation is obtained by considering both Eqs.
closed sets), respectively. Then, we have: (21) and (34) together (n.b. for better organization, it is
better that we provide those equations later). It is also note-
Rn ⊂ H ⊂ B ⊂ Mc ⊂ M ⊂ T. (2) worthy that this equation will also appear in Theorem 1.
Definition 9 (Lp Space). Consider a function f with do- According to Eq. (6), every function in the RKHS can be
main [a, b]. For p > 0, let the Lp norm be defined as: written as a linear combination.P Consider two functions in
n
this
Pn space represented as f = i=1 αi k(xi , y) and g =
Z p1 β k(x, y ). Hence, the inner product in RKHS is
j=1 j j
kf kp := |f (x)|p dx . (3)
calculated as:
n n
The Lp space is defined as the set of functions with bounded (6)
DX X E
Lp norm: hf, gik = αi k(xi , .), βj k(y j , .)
k
i=1 j=1
Lp (a, b) := {f : [a, b] → R | kf kp < ∞}. (4) (a)
DXn Xn E
= αi k(xi , .), βj k(., y j )
k
Definition 10 (Sobolev Space (Renardy & Rogers, 2006, i=1 j=1
Chapter 7)). A Sobolev space is a vector space of functions n X
X n
equipped with Lp norms and derivatives: = αi βj k(xi , y j ), (7)
i=1 j=1
Wm,p := {f ∈ Lp (0, 1) | Dm f ∈ Lp (0, 1)}, (5)
where (a) is because kernel is symmetric (it will be proved
where Dm f denotes the m-th order derivative. in Section 4). Hence, the norm in RKHS is calculated as:
Note that the Sobolev spaces are RKHS with some specific p
kernels (Novak et al., 2018). RKHS will be explained in kf kk := hf, f ik . (8)
Section 2.3.
The subscript of norm and inner product in RKHS has var-
2.3. Reproducing Kernel Hilbert Space ious notations in the research papers. Some most famous
2.3.1. D EFINITION OF RKHS notations are hf, gik , hf, giH , hf, giHk , hf, giF where Hk
denotes the Hilbert space associated with kernel k and F
Reproducing Kernel Hilbert Space (RKHS), first proposed stands for the feature space because RKHS is sometimes
in (Aronszajn, 1950), is a special case of Hilbert space with referred to as the feature space.
some properties. It is a Hilbert space of functions with re-
producing kernels (Berlinet & Thomas-Agnan, 2011). Af- Remark 2 (RKHS Being Unique for a Kernel). Given a
ter the initial work on RKHS (Aronszajn, 1950), another kernel, the corresponding RKHS is unique (up to isometric
work (Aizerman et al., 1964) developed the RKHS con- isomorphisms). Given an RKHS, the corresponding kernel
cepts. In the following, we introduce this space. is unique. In other words, each kernel generates a new
RKHS.
Definition 11 (RKHS (Aronszajn, 1950; Berlinet &
Thomas-Agnan, 2011)). A Reproducing Kernel Hilbert Remark 3. As we also saw in Mercer’s theorem, the bases
Space (RKHS) is a Hilbert space H of functions f : X → R of RKHS space is the eigenfunctions {ψi (.)}∞i=1 which are
with a reproducing kernel k : X 2 → R where k(x, .) ∈ H functions themselves. This, along with Eq. (6), show that
and f (x) = h k(x, .), f i. the RKHS space is a space of functions and not a space
of vectors. In other words, the basis vectors of RKHS are
The RKHS is explained in more detail in the following. basis functions named eigenfunctions. Because the RKHS
Consider the kernel function k(x, y) which is a function is a space of functions rather than a space of vectors, we
of two variables. Suppose, for n points, we fix one of the usually do not know the exact location of pulled points to
variables to have k(x1 , y), k(x2 , y), . . . , k(xn , y). These the RKHS but we know the relation of them as a function.
are all functions of the variable y. RKHS is a function This will be explained more in Section 3.2 and Fig. 1.
space which is the set of all possible linear combinations
of these functions (Kimeldorf & Wahba, 1971), (Aizerman 2.3.2. R EPRODUCING P ROPERTY
et al., 1964, p. 834), (Mercer, 1909):
Pnconsider only one component for g to have
In Eq. (7),
n n
X o (a) n n
X o g(x) = j=1 βj k(xi , x) = βk(x, x) where we take
H := f (.) = αi k(xi , .) = f (.) = αi kxi (.) , β = 1 to have g(x) = k(x, x) = kx (.). In other words,
i=1 i=1 assume the function g(x) is a kernel P
in the RKHS space.
n
(6) Also consider the function f (x) = i=1 αi k(xi , x) in
5

the space. According to Eq. (7), the inner product of these components along and orthogonal to this subspace, respec-
functions is: tively denoted by fk and f⊥ :

hf (x), g(x)ik = hf, kx (.)ik f = fk + f⊥ (12)

DX n E =⇒ kf k2k = kfk k2k + kf⊥ k2k ≥ kfk k2k . (13)
= αi k(xi , x), k(x, x)
k Moreover, using the reproducing property of the RKHS, we
i=1
n have:
X (a)
= αi k(xi , x) = f (x), (9) f (xi ) = hf, k(xi , .)ik
i=1
(12)
= hfk , k(xi , .)ik + hf⊥ , k(xi , .)ik
where (a) is because the term before equation is the previ-
(a) (9)
ously considered function f . As Eq. (9) shows, the func- = hfk , k(xi , .)ik = fk (xi ), (14)
tion f is reproduced from the inner product of that function
where (a) is because the orthogonal component has zero
with one of the kernels in the space. This shows the repro-
inner product with the bases of subspace. According to Eq.
ducing property of the RKHS space. A special case of Eq.
(14), we have:
(9) is hkx , kx ik = k(x, x).
n n
The name of RKHS consists of several parts becuase of the X X
`(f (xi ), y i ) = `(fk (xi ), y i ). (15)
following justifications:
i=1 i=1

1. “Reproducing”: because if the reproducing property Using Eqs. (13) and (15), we can say:
of RKHS which was proved above. n
X
min `(f (xi ), y i ) + η kf k2k
2. “Kernel”: because of the kernels associated to RKHS f ∈H
i=1
as stated in Definition 11 and Eq. (6). n
X
= min `(fk (xi ), y i ) + η kfk k2k .
3. “Hilbert Space”: because RKHS is a Hilbert space of f ∈H
i=1
functions with a reproducing kernel, as stated in Defi-
nition 11. Hence, for this minimization, we only require the compo-
nent lying in the space spanned by the kernels of RKHS.
2.3.3. R EPRESENTATION IN RKHS Therefore, we can represent the function (solution of opti-
In the following, we provide a proof for Eq. (6) and explain mization) to lie in the space as linear combination of basis
why that equation defines the RKHS. vectors {k(xi , .)}ni=1 . Q.E.D.
Theorem 1 (Representer Theorem (Kimeldorf & Wahba, Corollary 1. In Section 2.2, we mentioned that Hilbert
1971), simplified in (Rudin, 2012)). For a set of data space can be infinite dimensional. According to Definition
X = {xi }ni=1 , consider a RKHS H of functions f : X → R 11, RKHS is a Hilbert space so it may be infinite dimen-
with kernel function k. For any function ` : R2 → R sional. The representer theorem states that, in practice,
(usually called the loss function), consider the optimization we only need to deal with a finite-dimensional space; al-
problem: though, that finite number of dimensions is usually a large
number.
n
X
f ∗ ∈ arg min `(f (xi ), y i ) + η Ω(kf kk ), (10) Note that the representer theorem has been used in kernel
f ∈H SVM where αi ’s are the dual variables which are non-zero
i=1
for support vectors (Boser et al., 1992; Vapnik, 1995). Ac-
where η ≥ 0 is the regularization parameter and Ω(kf kk ) cording to this theorem, kernel SVM only requires to learn
is a penalty term such as kf k2k . The solution of this opti- the dual variables, αi ’s, to find the optimal boundary be-
mization can be expressed as: tween classes.
n
X n
X
∗
f = αi k(xi , .) = αi kxi (.). (11) 3. Mercer’s Theorem and Feature Map
i=1 i=1 3.1. Mercer’s Theorem
P.S.: Eq. (11) can also be seen in (Aizerman et al., 1964, Definition 12 (Definite Kernel (Hilbert, 1904)). A kernel
p. 834). k : [a, b] × [a, b] → R is a definite kernel where the follow-
ing double integral:
Proof. Proof is inspired by (Rudin, 2012). Assume Z bZ b
we project the function f onto a subspace spanned by J(f ) = k(x, y)f (x)f (y) dx dy, (16)
{k(xi , .)}ni=1 . The function f can be decomposed into a a
6

satisfies J(f ) > 0 for all f (x) 6= 0. provides a spectral (or eigenvalue) decomposition for the
operator Tk (Ghojogh et al., 2019a):
Mercer improved over Hilbert’s work (Hilbert, 1904) to
propose his theorem, the Mercer’s theorem (Mercer, 1909), Tk ψi (x) = λi ψi (x), (22)
introduced in the following.
Theorem 2 (Mercer’s Theorem (Mercer, 1909)). Suppose where {ψi (.)}∞ ∞
i=1 and {λi }i=1 are the eigenvectors and
k : [a, b] × [a, b] → R is a continuous symmetric positive eigenvalues of the operator Tk , respectively. Noticing the
semi-definite kernel which is bounded: defined Eq. (18) and the eigenvalue decomposition, Eq.
(22), we have:
sup k(x, y) < ∞. (17) Z
x,y (18) (22)
k(x, y) ψi (y) dy = Tk ψi (x) = λi ψi (x). (23)
Assume the operator Tk takes a function f (x) as its argu-
ment and outputs a new function as: This proves the Eq. (20) which is the eigenfunction de-
Z b composition of the operator Tk . Note that the eigenvectors
Tk f (x) := k(x, y)f (y) dy, (18) {ψi (.)}∞
i=1 are referred to as the eigenfunctions because the
a decomposition is applied on a function or operator rather
which is a Fredholm integral equation (Schmidt, 1908). than a matrix. Note that eigenfunctions will be explained
The operator Tk is called the Hilbert–Schmidt integral op- more in Section 7.
erator (Renardy & Rogers, 2006, Chapter 8). This output Step 3 of proof: According to Parseval’s theorem (Parse-
function is positive semi-definite: val des Chenes, 1806), the Bessel’s inequality can be con-
ZZ verted to equality (Saxe, 2002). For the orthonormal bases
k(x, y)f (y) dx dy ≥ 0. (19) {ψi (.)}∞
i=1 in the Hilbert space H associated with kernel k,
we have for any function f ∈ L2 (a, b):
Then, there is a set of orthonormal bases {ψi (.)}∞ i=1 of ∞
X
L2 (a, b) consisting of eigenfunctions of TK such that the f= hf, ψi ik ψi . (24)
corresponding sequence of eigenvalues {λi }∞ i=1 are non- i=1
negative:
If we replace ψi with f in Eq. (22) and consider Eq. (24),
we will have:
Z
k(x, y) ψi (y) dy = λi ψi (x). (20)
∞
X
Tk f = λi hf, ψi ik ψi . (25)
The eigenfunctions corresponding to the non-zero eigen-
i=1
values are continuous on [a, b] and k can be represented as
(Aizerman et al., 1964): One can consider Eq. (18) as Tk f = kf . Noticing this and
∞ Eq. (25) results in:
X
k(x, y) = λi ψi (x) ψi (y), (21) ∞
X
i=1
kf = λi hf, ψi ik ψi . (26)
where the convergence is absolute and uniform. i=1

Proof. A roughly high-level proof for the Mercer’s theo- Ignoring f from Eq. (26) gives:
rem is as follows. ∞
X
Step 1 of proof: According to assumptions of theorem, k(x, y) = λi ψi (x) ψi (y), (27)
the Hilbert-Schmidt integral operator Tk is a symmet- i=1
ric operator on L2 (a, b) space. Consider a unit ball in
which is Eq. (21); hence, that is proved.
L2 (a, b) as input to the operator. As the kernel is bounded,
supx,y k(x, y) < ∞, the sequence f1 , f2 , . . . converges Step 4 of proof: We define the truncated kernel rn (with
in norm, i.e. kfn − f k → 0 as n → 0. Therefore, accord- parameter n) as:
ing to the Arzelà-Ascoli theorem (Arzelà, 1895), the image n
X
of the unit ball after applying the operator is compact. In rn (x, y) := k(x, y) − λi ψi (x) ψi (y)
other words, the operator Tk is compact. i=1
Step 2 of proof: According to the spectral theorem ∞
X
(Hawkins, 1975), there exist several orthonormal bases = λi ψi (x) ψi (y). (28)
{ψi (.)}∞
i=1 in L2 (a, b) for the compact operator Tk . This
i=n+1
7

As Tk is an integral operator, this truncated kernel has pos- where {ψ i } and {λi } are eigenfunctions and eigenvalues
itive kernel, i.e., for every x ∈ [a, b], we have: of the kernel operator (see Eq. (20)). Note that eigenfunc-
n
tions will be explained more in Section 7.
X
rn (x, x) = k(x, x) − λi ψi (x) ψi (x) ≥ 0 Let t denote the dimensionality of φ(x). The feature map
i=1 may be infinite or finite dimensional, i.e. t can be infinity;
n
X it is usually a very large number (recall Definition 7 where
=⇒ λi ψi (x) ψi (x) ≤ k(x, x) ≤ sup k(x, x). we said Hilbert space may have infinite number of dimen-
i=1 x∈[a,b]
sions).
(29)
Considering both Eqs. (21) and (33) shows that:
By Cauchy-Schwartz inequality, we have:
k(x, y) = φ(x), φ(y) k = φ(x)> φ(y).

(34)
Xn 2
λi ψi (x) ψi (y) Hence, the kernel between two points is the inner product

i=1 of pulled data points to the feature space. Suppose we stack
n
X n
X the feature maps of all points X ∈ Rd×n column-wise in:
≤ λi ψi (x) ψi (x) λi ψi (y) ψi (y)
i=1 i=1 Φ(X) := [φ(x1 ), φ(x2 ), . . . , φ(xn )], (35)
(29) 2
≤ sup k(x, x) . which is t × n dimensional and t may be infinity or a large
x∈[a,b]
number. The kernel matrix defined in Definition 2 can be
Taking second root from the sides of inequality gives: calculated as:

Rn×n 3 K = Φ(X), Φ(X) k = Φ(X)> Φ(X). (36)

n
X (17)

λi ψi (x) ψi (x) ≤ sup |k(x, x)| ≤ ∞. (30)
i=1 x∈[a,b] Eqs. (34) and (36) show that there is no need to compute
Pn kernel using eigenfunctions but a simple inner product suf-
This shows that the sequence i=1 λi ψi (x) ψi (x) con- fices for kernel computation. This is the beauty of kernel
verges absolutely and uniformly. Q.E.D. methods which are simple to compute.
Definition 14 (Input Space and Feature Space (Schölkopf
The Mercer’s theorem is very important. Many of its equa- et al., 1999b)). The space in which data X exist is called
tions, such as Eqs. (18), (20), and (21) are used in theory the input space, also known as the original space. This
of kernels and kernel methods. space is denoted by X and is usually an Rd Euclidean
space. The RKHS to which the data have been pulled is
3.2. Feature Map and Pulling Function called the feature space. Data can be pulled from the input
Let X := {xi }ni=1 be the set of data in the input space to feature space using kernels.
(note that the input space is the original space of data). The Remark 4 (Kernel is a Measure of Similarity). Inner prod-
t-dimensional (perhaps infinite dimensional) feature space uct is a measure of similarity in terms of angles of vectors
(or Hilbert space) is denoted by H. or in terms of location of points with respect to origin. Ac-
Definition 13 (Feature Map or Pulling Function). We de- cording to Eq. (34), kernel can be seen as inner product
fine the mapping: between feature maps of points; hence, kernel is a measure
of similarity between points and this similarity is computed
φ : X → H, (31) in the feature space rather than input space.

to transform data from the input space to the feature space, Pulling data to the feature space is performed using kernels
i.e. Hilbert space. In other words, this mapping pulls data which is the inner product of points in RKHS according
to the feature space: to Eq. (34). Hence, the relative similarity (inner product)
of pulled data points is known by the kernel. However,
x 7→ φ(x). (32) in most of kernels, we cannot find an explicit expression
for the pulled data points. Therefore, the exact location of
The function φ(x) is called the feature map or pulling func- pulled data points to RKHS is not necessarily known but
tion. The feature map is a (possibly infinite-dimensional) the relative similarity of pulled points, which is the ker-
vector whose elements are (Minh et al., 2006): nel, is known. An exceptional kernel is the linear kernel
in which we have φ(x) = x. Figure 1 illustrates what we
φ(x) = [φ1 (x), φ2 (x), . . . ]> mean by not knowing the explicit location of pulled points
√ √ (33)
:= [ λ1 ψ 1 (x), λ2 ψ 2 (x), . . . ]> , to RKHS.
8

Figure 1. Pulling data from the input space to the feature space (RKHS). The explicit locations of pulled points are not necessarily known
but the relative similarity (inner product) of pulled data points is known in the feature space.

4. Characteristics of Kernels Proof. Let v(i) denote the i-th element of vector v.
In this section, we review some of the characteristics of ker- n X
n
nels including the symmetry and positive semi-definiteness (1) X
v > Kv = v(i) v(j) k(x(i), x(j))
properties of Mercer kernel (recall Definition 1). i=1 j=1
Lemma 1 (Symmetry of Kernel). A square Mercer kernel n X n D
(7) X E
matrix is symmetric, so we have: = v(i) v(j) φ x(i) , φ x(j)
k
i=1 j=1
hf, gik = hg, f ik , or (37) Xn X n D
E
K ∈ Sn , i.e., k(x, y) = k(y, x). (38) = v(i) φ x(i) , v(j) φ x(j)
k
i=1 j=1
n n
Proof. DX X E
= v(i) φ x(i) , v(j) φ x(j)
(34) k
i=1 j=1
φ(x), φ(y) k = φ(x)> φ(y)

k(x, y) =
n
2
X
(a)
= φ(y)> φ(x) = k(y, x), = v(i) φ x(i) ≥ 0, ∀v ∈ Rn .

k
i=1

where (a) is because φ(x)> φ(y) and φ(y)> φ(x) are

Hence, according to the definition of positive semi-
scalars and are equivalent according to the definition of dot
definiteness (Bhatia, 2009), we have K 0. Q.E.D.
product between vectors. Q.E.D.
Lemma 2 (Zero Kernel). We have: 5. Well-known Kernel Functions
5.1. Frequently Used Kernels
hf, f ik = 0 iff f = 0. (39)
There exist many different kernel functions which are
Proof. widely used in machine learning (Rojo-Álvarez et al.,
2018). In the following, we list some of the most well-
(9) known kernels.
0 ≤ f 2 (x) = hf, kx ik hf, kx ik
(a)
– Linear Kernel:
(b)
≤ kf kk kkx kk kf kk kkx kk = kf k2k kkx k2k = 0, Linear kernel is the simplest kernel which is the inner prod-
uct of points:
where (a) is because Cauchy-Schwarz inequality and (b) is
because we had assumed hf, f ik = kf kk = 0. Hence: k(x, y) := x> y. (41)

0 ≤ f 2 (x) = 0 =⇒ f (x) = 0. Comparing this with Eq. (34) shows that in linear kernel we
have φ(x) = x. Hence, in this kernel, the feature map is
explicitly known. Note that φ(x) = x shows that data are
Lemma 3 (Positive Semi-definiteness of Kernel). The not pulled to any other space in linear kernel but in the input
Mercer kernel matrix is positive semi-definite: space, the inner products of points are calculated to obtain
the feature space. Moreover, recall Remark 9 which states
K ∈ Sn+ , i.e., K 0. (40) that, depending on the kernelization approach, using lin-
9

ear kernel may or may not be equivalent to non-kernelized – Cosine Kernel:

method. According to Remark 4, kernel is a measure of similarity
– Radial Basis Function (RBF) or Gaussian Kernel: and computes the inner product between points in the fea-
RBF kernel has a scaled Gaussian (or normal) distribution ture space. Cosine kernel computes the similarity between
where the normalization factor of distribution is usually ig- points. It is obtained from the formula of cosine and inner
nored. Hence, it is also called the Gaussian kernel. The product:
RBF kernel is formulated as:
x> y
kx − yk22 k(x, y) := cos(x, y) = . (46)
k(x, y) := exp(−γ kx − yk22 ) = exp(− ), kxk2 kyk2
σ2
(42) The normalization in the denominator projects the points
onto a unit hyper-sphere so that the inner product measures
where γ := 1/σ 2 and σ 2 is the variance of kernel. A the similarity of their angles regardless of their lengths.
proper value for this parameter is γ = 1/d where d is the Note that angle-based measures such as cosine are found
dimensionality of data. Note that RBF kernel has also been to work better for face recognition compared to Euclidean
widely used in RBF networks (Orr, 1996) and kernel den- distances (Perlibakas, 2004).
sity estimation (Scott, 1992).
– Chi-squared Kernel:
– Laplacian Kernel:
Assume x(j) denotes the j-th dimension of the d-
The Laplacian kernel, also called the Laplace kernel, is dimensional point x. The Chi-squared (χ2 ) kernel is
similar to the RBF kernel but with `1 norm rather than (Zhang et al., 2007):
squared `2 norm. The Laplacian kernel is:
d 2
kx − yk1
X x(j) − y(j)
k(x, y) := exp(−γ kx − yk1 ) = exp(− ), k(x, y) := exp −γ , (47)
σ2 j=1
x(j) + y(j)
(43)
where γ > 0 is a parameter (a proper value is γ = 1).
where kx − yk1 is also called the Manhattan distance. A Note that the summation term inside exponential (without
proper value for this parameter is γ = 1/d where d is the the minus) is the Chi-squared distance which is related to
dimensionality of data. In some specific fields of science, the Chi-squared test in statistics.
the Laplacian kernel has been found to perform better than
Gaussian kernel (Rupp, 2015). This makes sense because 5.2. Kernel Construction from Distance Metric
of betting on sparsity principal (Hastie et al., 2009) since `1 Consider d2ij = ||xi − xj ||22 as the squared Euclidean dis-
norm makes algorithm sparse. Note that `2 norm in RBF tance between xi and xj . We have:
kernel is also more sensitive to noise; however, the com-
putation and derivative of `1 norm is more difficult than `2 d2ij = ||xi − xj ||22 = (xi − xj )> (xi − xj )
norm.
= x> > > >
i xi − xi xj − xj xi + xj xj
– Sigmoid Kernel:
Sigmoid kernel is a hyperbolic tangent function applied on = x> > >
i xi − 2xi xj + xj xj = Gii − 2Gij + Gjj ,
inner product of points. It is formulated as:
where Rn×n 3 G := X > X is the linear Gram matrix. If
>
k(x, y) := tanh(γx y + c), (44) Rn 3 g := [g 1 , . . . , g n ] = [G11 , . . . , Gnn ] = diag(G),
we have:
where γ > 0 is the slope and c is the intercept. Some proper
values for these parameters are γ = 1/d and c = 1 where d2ij = g i − 2Gij + g j ,
d is the dimensionality of data. Note that the hyperbolic
D = g1> − 2G + 1g > = 1g > − 2G + g1> ,
tangent function is also used widely for activation functions
in neural networks (Goodfellow et al., 2016). where 1 is the vector of ones and D is the distance matrix
– Polynomial Kernel: with squared Euclidean distance (d2ij as its elements). Let
Polynomial kernel applies a polynomial function with de- H denote the centering matrix:
gree δ (a positive integer) on inner product of points:
1
Rn×n 3 H := I − 1n 1>
n, (48)
k(x, y) := (γx> y + c)d , (45) n
where γ > 0 is the slope and c is the intercept. Some proper and I is the identity matrix, 1n := [1, . . . , 1]> ∈ Rn and
values for these parameters are γ = 1/d and c = 1 where 1n×n := 1n 1> n ∈ R
n×n
. Refer to (Ghojogh & Crowley,
d is the dimensionality of data. 2019, Appendix A) for more details about the centering
10

matrix. We double-center the matrix D as follows (Old- where (a) is because H and D are symmetric matrices.
ford, 2018): Moreover, the kernel is positive semi-definite because:

1 > 1 1
HDH = (I − 11 )D(I − 11> ) K = − HDH = Φ(X)> Φ(X)
n n 2
1 > 1 =⇒ v > Kv = v > Φ(X)> Φ(X)v
= (I − 11 )(1g − 2G + g1> )(I − 11> )
>
n n = kΦ(X)vk22 ≥ 0, ∀v ∈ Rn .
1 > > 1 >
= (I − 11 )1 g − 2(I − 11 )G
n{z n Hence, according to Definition 1, this kernel is a Mercer
kernel. Q.E.D.
| }
=0
1 > 1
11 )g1> (I − 11> )

+ (I − Remark 5 (Kernel Construction from Metric). One can
n n use any valid distance metric, satisfying the following prop-
1 > 1 >
= −2(I − 11 )G(I − 11 ) erties:
n n
1 > 1 1. non-negativity: D(x, y) ≥ 0,
+ (I − 11 )g 1 (I − 11> )
>
2. equal points: D(x, y) = 0 ⇐⇒ x = y,
n {zn
| } 3. symmetry: D(x, y) = D(y, x),
=0
4. triangular inequality: D(x, y) ≤ D(x, z)+D(z, y),
1 1
= −2(I − 11> )G(I − 11> ) = −2 HGH. to calculate elements of distance matrix D in Eq. (50). It
n n
is important that the used distance matrix should be a valid
distance matrix. Using various distance metrics in Eq. (50)
1 results in various useful kernels.
∴ HGH = HX > XH = − HDH. (49)
2 Some examples are the geodesic kernel and Structural Sim-
ilarity Index (SSIM) kernel, used in Isomap (Tenenbaum
Note that (I − n1 11> )1 = 0 and 1> (I − n1 11> ) = 0 et al., 2000) and image structure subspace learning (Gho-
because removing the row mean of 1 and column mean of jogh et al., 2019c), respectively. The geodesic kernel is de-
of 1> results in the zero vectors, respectively. fined as (Tenenbaum et al., 2000; Ghojogh et al., 2020b):
If data X are already centered, i.e., the mean has been re-
moved (X ← XH), Eq. (49) becomes: 1
K = − HD (g) H, (52)
2
1 where the approximation of geodesic distances using piece-
X > X = − HDH. (50)
2 wise Euclidean distances is used in calculating the geodesic
distance matrix D (g) . The SSIM kernel is defined as (Gho-
According to the kernel trick, Eq. (104), we can write a jogh et al., 2019c):
general kernel matrix rather than the linear Gram matrix in
Eq. (50), to have (Cox & Cox, 2008): 1
K = − HD (s) H, (53)
2
1
Rn×n 3 K = Φ(X)> Φ(X) = − HDH. (51) where the distance matrix D (s) is calculated using the
2
SSIM distance (Brunet et al., 2011).
This kernel is double-centered because of HDH. It is also
noteworthy that Eq. (51) can be used for unifying the spec- 5.3. Important Classes of Kernels
tral dimensionality reduction methods as special cases of In the following, we introduce some of the important
kernel principal component analysis with different kernels. classes of kernels which are widely used in statistics and
See (Ham et al., 2004; Bengio et al., 2004) and (Strange & machine learning. A good survey on the classes of kernels
Zwiggelaar, 2014, Table 2.1) for more details. is (Genton, 2001).
Lemma 4 (Distance-based Kernel is a Mercer Kernel). The 5.3.1. B OUNDED K ERNELS
kernel constructed from a valid distance metric, i.e. Eq. Definition 15 (Bounded Kernel). A kernel function k is
(51), is a Mercer kernel. bounded if:

Proof. The kernel is symmetric because: sup k(x, y) < ∞, (54)

x,y∈X

1 (a) 1 where X is the input space. Likewise, the kernel matrix K

K > = − H > D > H > = − HDH = K,
2 2 is bounded if supx,y∈X K(x, y) < ∞.
11

5.3.2. I NTEGRALLY P OSITIVE D EFINITE K ERNELS (Ghojogh et al., 2020d), denoted by K s , whose Taylor se-
Definition 16 (Integrally Positive Definite RKernel). A ker- ries expansion is (Ghojogh et al., 2019c):
nel matrix K is integrally positive definite ( p.d.) on Ω×Ω
5 15 5 1
if: Ks ≈ − − r + r2 − r3 + . . . ,
16 16 16 16
Z Z
K(x, y)f (x)f (y) ≥ 0, ∀f ∈ L2 (Ω). (55) where r is the squared SSIM distance (Brunet et al., 2011)
Ω Ω between images. Note that polynomial kernels are not uni-
R versal. Universal kernels have been widely used for kernel
A kernel matrix K is integrally strictly positive definite (
SVM. More detailed discussion and proofs for use of uni-
s.p.d.) on Ω × Ω if:
versal kernels in kernel SVM can be found in (Steinwart &
Christmann, 2008).
Z Z
K(x, y)f (x)f (y) > 0, ∀f ∈ L2 (Ω). (56)
Ω Ω Lemma 6 ((Borgwardt et al., 2006), (Song, 2008, Theorem
10)). A kernel is universal if for arbitrary sets of distinct
5.3.3. U NIVERSAL K ERNELS points, it induces strictly positive definite kernel matrices
Definition 17 (Universal Kernel (Steinwart, 2001, Defini- (Borgwardt et al., 2006; Song, 2008). Conversely, if a ker-
tion 4), (Steinwart, 2002, Definition 2)). Let C(X ) denote nel matrix can be written as K = K 0 + I where K 0 0,
the space of all continuous functions on space X . A con- > 0, and I is the identity matrix, the kernel function cor-
tinuous kernel k on a compact metric space X is called responding to K is universal (Pan et al., 2008).
universal if the RKHS H, with kernel function k, is dense
in C(X ). In other words, for every function g ∈ C(X ) 5.3.4. S TATIONARY K ERNELS
and all > 0, there exists a function f ∈ H such that Definition 18 (Stationary Kernel (Genton, 2001; Noack &
kf − gk∞ ≤ . Sethian, 2021)). A kernel k is stationary if it is a positive
definite function of the form:
Remark 6. We can approximate any function, including
continuous functions and functions which can be approxi- k(x, y) = k(kx − yk), (58)
mated by continuous functions, using a universal kernel.
Lemma 5 ((Steinwart, 2001, Corollary 10)). Consider a where k.k is some norm defined on the input space.
function f : (−r, r) → R where 0 < r ≤ ∞ and f ∈
An example for stationary kernel is the RBF kernel defined
C ∞ (C ∞ denotes the differentiable space for all √
degrees
in Eq. (42) which has the form of Eq. (58). Stationary
of differentiation). Let X := {x ∈ Rd | kxk2 < r}. If
kernels are used for Gaussian processes (Noack & Sethian,
the function f can be expanded by Taylor expansion in 0
2021).
as:
∞ 5.3.5. C HARACTERISTIC K ERNELS
X
f (x) = aj xj , ∀x ∈ (−r, r), (57) The characteristic kernels, which are widely used for dis-
j=0 tribution embedding in the Hilbert space, will be defined
and explained in Section 9.4. Examples for characteristic
and aj > 0 for all j ≥ 0, then k(x, y) = f (hx, yi) is a kernels are RBF and Laplacian kernels. Polynomial ker-
universal kernel on every compact subset of X . nels. however, are not characteristic. Note that the relation
between universal kernels, characteristic kernels, and inte-
Proof. For proof, see (Steinwart, 2001, proof of Corollary
grally strictly positive definite kernels has been studied in
10). Note that the Stone-Weierstrass theorem (De Branges,
(Sriperumbudur et al., 2011).
1959) is used for the proof of this lemma.

An example for universal kernel is RBF kernel (Steinwart, 6. Kernel Centering and Normalization
2001, Example 1) because its Taylor series expansion is: 6.1. Kernel Centering
In some cases, there is a need to center the pulled data in the
γ2 2 γ3 3
exp(−γr) ≈ 1 − γr + r − r + ..., feature space. For this, the kernel matrix should be centered
2 6
in a way that the mean of pulled dataset becomes zero. Note
where r := kx − yk22 . Considering Eq. (34) and notic- that this will restrict the place of pulled points in the feature
ing that this Taylor series expansion has infinite number space further (see Fig. 2); however, because of different
of terms, we see that the RKHS for RBF kernel is infinite possible rotations of pulled points around origin, the exact
dimensional because φ(x), although cannot be calculated positions of pulled points are still unknown.
explicitely for this kernel, will have infinite dimensions. For kernel centering, one should follow the following
Another example for universal kernel is the SSIM kernel theory, which is based on (Schölkopf et al., 1997a) and
12

feature space are centered. Also, double-centered kernel

has zero row-wise and column-wise mean (so its row and
column summations are zero). Therefore, after this kernel
centering, we will have:
n
1X
φ̆(xi ) = 0, (62)
n i=1
n X
X n
K̆(i, j) = 0. (63)
Figure 2. Centered pulled data the feature space (RKHS). This i=1 j=1
happens after kernel centering where the mean of cloud of pulled
data becomes zero in RKHS. Even by kernel centering, the ex-
6.1.2. C ENTERING THE K ERNEL BETWEEN T RAINING
plicit locations of pulled points are not necessarily known because AND O UT- OF - SAMPLE DATA
of not knowing the rotation of pulled data in that space. Now, consider the kernel matrix between the training
data and the out-of-sample data Rn×nt 3 K t :=
Φ(X)> Φ(X t ). whose (i, j)-th element is R 3
(Schölkopf et al., 1998, Appendix A). An example of use K t (i, j) = φ(xi )> φ(xt,j ). We want to center the pulled
of kernel centering in machine learning is kernel principal training data in the feature space, i.e., Eq. (59). Moreover,
component analysis (see (Ghojogh & Crowley, 2019) for the out-of-sample data should be centered using the mean
more details). of training (and not out-of-sample) data:
6.1.1. C ENTERING THE K ERNEL OF T RAINING DATA n
1X
Assume we have some training data X = φ̆(xt,i ) := φ(xt,i ) − φ(xk ). (64)
n
[x1 , . . . , xn ] ∈ Rd×n and some out-of-sample data k=1

X t = [xt,1 , . . . , xt,nt ] ∈ Rd×nt . Consider the kernel If we center the pulled training and out-of-sample data, the
matrix for the training data Rn×n 3 K := Φ(X)> Φ(X), (i, j)-th element of kernel matrix becomes:
whose (i, j)-th element is R 3 K(i, j) = φ(xi )> φ(xj ).
We want to center the pulled training data in the feature K̆ t (i, j) := φ̆(xi )> φ̆(xt,j )
space: (a) 1 X
n
> n
1 X
= φ(xi ) − φ(xk1 ) φ(xt,j ) − φ(xk2 )
1X
n n n
k1 =1 k2 =1
φ̆(xi ) := φ(xi ) − φ(xk ). (59) n
n 1
k=1
X
= φ(xi )> φ(xt,j ) − φ(xk1 )> φ(xt,j )
n
If we center the pulled training data, the (i, j)-th element k1 =1
of kernel matrix becomes: n n n
1 X > 1 X X
− φ(xi ) φ(xk2 ) + 2 φ(xk1 )>φ(xk2 ),
K̆(i, j) := φ̆(xi )> φ̆(xj ) (60) n n
k2 =1 k1 =1 k2 =1
n n
(59) 1 X > 1 X where (a) is because of Eqs. (59) and (64). Therefore,
= φ(xi ) − φ(xk1 ) φ(xj ) − φ(xk2 )
n n the double-centered kernel matrix over training and out-of-
k1 =1 k2 =1
n
sample data is:
1 X
= φ(xi )> φ(xj ) − φ(xk1 )> φ(xj ) 1 1
n
k1 =1
Rn×nt 3 K̆ t = K t − 1n×n K t − K1n×nt
n n
n n n 1
1 X > 1 X X + 1n×n K1n×nt , (65)
− φ(xi ) φ(xk2 ) + 2 φ(xk1 )>φ(xk2 ). n2
n n
k2 =1 k1 =1 k2 =1
where Rn×nt 3 1n×nt := 1n 1> nt and R
nt
3 1nt :=
Writing this in the matrix form gives: >
[1, . . . , 1] . The Eq. (65) is the kernel matrix when the
1 1 pulled training data in the feature space are centered and
Rn×n 3 K̆ = K − 1n×n K − K1n×n the pulled out-of-sample data are centered using the mean
n n
1 of pulled training data.
+ 1n×n K1n×n = HKH, (61) If we have one out-of-sample xt , the Eq. (65) becomes:
n2
where H is the centering matrix (see Eq. (48)). The Eq. 1 1 1
Rn 3 k̆t = kt − 1n×n kt − K1n + 2 1n×n K1n ,
(61) is called the double-centered kernel. This equation n n n
is the kernel matrix when the pulled training data in the (66)
13

where: (which is already normalized) is kernelized as:

Rn 3 kt = kt (X, xt ) := Φ(X)> φ(xt ) (67) φ(xi )> φ(xj )

K(i, j) = p
= [φ(x1 )> φ(xt ), . . . , φ(xn )> φ(xt )]> , φ(xi )> φ(xi ) φ(xj )> φ(xj )
Rn 3 k̆t = k̆t (X, xt ) := Φ̆(X)> φ̆(xt ), (68) (103) K(i, j)
= p .
> > > K(i, i)K(j, j)
= [φ̆(x1 ) φ̆(xt ), . . . , φ̆(xn ) φ̆(xt )] ,
Q.E.D.
where Φ̆(X) and φ̆(xt ) are according to Eqs. (59) and
(64), respectively. Definition 19 (generalized mean with exponent t (Ah-Pine,
2010)). The generalized mean with exponent t as:
Note that Eq. (61) or (65) can be restated as the following
lemma. p
1 X 1t
Lemma 7 (Kernel Centering (Bengio et al., 2003b;c)). The mt (a1 , . . . , an ) := ati . (71)
p i=1
pulled data to the feature space can be centered by kernel
centering. The kernel matrix K(x, y) is centered as: The generalized mean becomes the harmonic, geometric,
> and arithmetic mean for t = −1, t → 0, and t = 1, respec-
K̆(x, y) = φ(x) − Ex [φ(x)] φ(x) − Ex [φ(x)] tively.
= K(x, y) − Ex [K(x, y)] − Ey [K(x, y)] Definition 20 (Generalized Kernel Normalization of order
+ Ex [Ey [K(x, y)]]. (69) t (Ah-Pine, 2010)). The generalized kernel normalization
of order t > 0 normalizes the kernel K ∈ Rn×n as:
Proof. The explained derivations for Eqs. (59) and (64)
and definition of expectation complete the proof. K(i, j)
K(i, j) ← , ∀i, j ∈ {1, . . . , n}.
mt K(i, i), K(j, j)
Note that in Eq. (69), Ex [K(x, y)], Ey [K(x, y)], and (72)
Ex [Ey [K(x, y)]] are average of rows, average of columns,
and total average of rows and columns of the kernel matrix, Both cosine normalization and generalized normalization
respectively. make the kernel of every point with itself one. In other
words, after normalization, we have:
6.2. Kernel Normalization
According to Eq. (34), kernel value can be large if the k(x, x) = 1, ∀x ∈ X . (73)
pulled vectors to the feature map have large length. Hence,
in practical computations and optimization, it is sometimes As the most similar point to a point is itself, the values of a
required to normalize the kernel matrix. normalized kernel will be less than or equal to one. In other
words, after normalization, we have:
Lemma 8 (Cosine Normalization of Kernel (Rennie, 2005;
Ah-Pine, 2010)). The kernel matrix K ∈ Rn×n can be k(xi , xj ) ≤ 1, ∀i, j ∈ {1, . . . , n}. (74)
normalized as:
This helps the values not explode to large values in algo-
K(i, j)
K(i, j) ← p , ∀i, j ∈ {1, . . . , n}. rithms, especially in the iterative algorithms (e.g., algo-
K(i, i)K(j, j) rithms which use gradient descent for optimization).
(70)
7. Eigenfunctions
Proof. Cosine normalizes points onto a unit hyper-sphere
and then computes the similarity of points using inner prod- 7.1. Inner Product in Hilbert Space
uct. Cosine is computed by Eq. (46) and according to the Lemma 9 (Inner Product in Hilbert Space). If the domain
relation of norm and inner product, it is: of functions in a Hilbert space H is [a, b], the inner product
of two functions in the Hilbert space is calculated as:
x>i xj x>i xj
cos(xi , xj ) = =p Z b Z b
kxi k2 kxj k2 kxi k22 kxj k22 (37)
hf (x), g(x)iH = f (x) g ∗ (x)dx = f ∗ (x) g(x)dx,
x> xj a a
=q i . (75)
x> >
i xi xj xj
where g ∗ is the complex conjugate of function g. If
According to Remark 4, kernel is also a measure of simi- functions are real, the inner product is simplified to
Rb
larity. Using kernel trick, Eq. (103), the cosine similarity a
f (x) g(x)dx.
14

Proof. If we discretize the domain [a, b], for example by then the function f is an eigenfunction for the operator O
sampling, with step ∆x, the function values become vec- and the constant λ is the corresponding eigenvalue. Note
tors as f (x) = [f (x1 ), f (x2 ), . . . , f (xn )]> and g(x) = that the form of eigenfunction problem is:
[g(x1 ), g(x2 ), . . . , g(xn )]> . According to the inner prod-
Operator (function f ) = constant × function f. (79)
uct of two vectors, we have:
n
X Some examples of operator are derivative, kernel function,
hf (x), g(x)i = g H f = f (xi )g(xi ), etc. For example, eλx is an eigenfunction of derivative be-
d λx
i=1 cause dx e = λeλx . Note that eigenfunctions have appli-
cation in many fields of science including machine learning
where g H denotes the conjugate transpose of g (it is trans-
(Bengio et al., 2003c) and quantum mechanics (Reed & Si-
pose if functions are real). Multiplying the sides of this
mon, 1972).
equation by the setp ∆x gives:
Recall that in eigenvalue problem, the eigenvectors show
n
X the most important or informative directions of matrix and
hf (x), g(x)i∆x = g H f = f (xi )g(xi )∆x, the corresponding eigenvalue shows the amount of impor-
i=1
tance (Ghojogh et al., 2019a). Likewise, in eigenfunc-
which is a Riemann sum. This is the Riemann approximation problem of an operator, the eigenfunction is the most
tion of the Eq. (75). This approximation gets more accurate important function of the operator and the corresponding
by ∆x → 0 or n → ∞. Hence, that equation is a valid in- eigenvalue shows the amount of this importance. This con-
ner product in the Hilbert space. Q.E.D. nection between eigenfunction and eigenvalue problems is
proved in the following theorem.
Remark 7 (Interpretation of Inner Product of Functions
in Hilbert Space). The inner product of two functions, i.e. Theorem 3 (Connection of Eigenfunction and Eigenvalue
Eq. (75), measures how similar two functions are. The Problems). If we assume that the operator and the func-
more similar they are in their domain [a, b], the larger in- tion are a matrix and a vector, eigenfunction problem is
ner product they have. Note that this similarity is more converted to an eigenvalue problem where the vector is the
about the pattern (or changes) of functions and not the ex- eigenvector of the matrix.
act value of functions. If the pattern of functions is very Proof. Consider any function space such as a Hilbert
similar, they will have a large inner product. space. Let {ej }nj=1 be the bases (basis functions) of this
Corollary 2 (Weighted Inner Product in Hilbert Space function space where n may be infinite. The function f in
(Williams & Seeger, 2000), (Bengio et al., 2003c, Section this space can be represented as a linear combination bases:
2)). The Eq. (75) is the inner product with uniform weight- n
ing. With density function p(x), one can weight the inner
X
f (x) = αj ej (x). (80)
product in the Hilbert space as (assuming the functions are j=1
real):
An example of this linear combination is Eq. (6) in RKHS
Z b
where the bases are kernels. Consider the operator O which
hf (x), g(x)iH = f (x) g(x) p(x) dx. (76)
a
can be applied on the functions in this function space. Ap-
plying this operator on Eq. (80) gives:
7.2. Eigenfunctions n n
X (a) X
Recall eigenvalue problem for a matrix A (Ghojogh et al., Of (x) = O αj ej (x) = αj Oej (x), (81)
2019a): j=1 j=1

A φi = λi φi , ∀i ∈ {1, . . . , d}, (77) where (a) is because the operator O is a linear operator
according to Definition 21. Also, we have:
where φi and λi are the i-th eigenvector and eigenvalue of n
A, respectively. In the following, we introduce the Eigen- (78) (a) X
Of (x) = λf (x) = λ αj ej (x), (82)
function problem which has a similar form but for an oper-
j=1
ator rather than a matrix.
Definition 21 (Eigenfunction (Kusse & Westwig, 2006, where (a) is because λ is a scalar.
Chapter 11.2)). Consider a linear operator O which can On the other hand, the output function from applying the
be applied on a function f . If applying this operator on operator on a function can also be written as a linear com-
the function results in a multiplication of function to a con- bination of the bases:
stant: Xn
Of (x) = βj ej (x). (83)
Of = λf, (78) j=1
15

From Eqs. (81) and (83), we have: 7.3. Use of Eigenfunctions for Spectral Embedding
Consider a Hilbert space H of functions with the inner
n n
X X product defined by Eq. (76). Let the data in the input space
αj Oej (x) = βj ej (x). (84)
be X = {xi ∈ Rd }ni=1 . In this space, we can consider an
j=1 j=1
operator for the kernel function Kp as (Williams & Seeger,
2000), (Bengio et al., 2003a, Section 3):
In parentheses, consider an n × n matrix A whose (i, j)-th Z
element is the inner product of ei and Oej : (Kp f )(x) := k(x, y) f (y) p(y) dy, (89)
Z
(75)
A(i, j) := hei , Oej ik = e∗i (x) Oej (x)dx, (85) where f ∈ H and the density function p(y) can be ap-
proximated empirically. A discrete approximation of this
operator is (Williams & Seeger, 2000):
where integral is over the domain of functions in the func-
n
tion space. 1X
(Kp,n f )(x) := k(x, xi ) f (xi ), (90)
Using Eq. (75), we take the inner product of sides of Eq. n i=1
(84) with an arbitrary basis function ei :
which converges to Eq. (89) if n → ∞. Note that this
n
X Z n
X Z equation is also mentioned in (Bengio et al., 2003c, Section
αj e∗i (x) Oej (x) dx = βj e∗i (x) ej (x) dx. 2), (Bengio et al., 2004, Section 4), (Bengio et al., 2006,
j=1 j=1 Section 3.2).
Lemma 10 (Relation of Eigenvalues of Eigenvalue Prob-
According to Eq. (85), this equation is simplified to: lem and Eigenfunction Problem for Kernel (Bengio et al.,
2003a, Proposition 1), (Bengio et al., 2003c, Theorem 1),
n n Z
X X (a) (Bengio et al., 2004, Section 4)). Assume λk denotes the k-
αj A(i, j) = βj e∗i (x) ej (x) dx = βi , (86)
th eigenvalue for eigenfunction decomposition of the oper-
j=1 j=1
ator Kp and δk denotes the k-th eigenvalue for eigenvalue
problem of the matrix K ∈ Rn×n . We have:
which is true for ∀i ∈ {1, . . . n} and (a) is because the
bases are orthonormal, so: δk = n λk . (91)

(75)
Z
1 if i = j, Proof. This proof gets help from (Bengio et al., 2003b,
hei , ej ik = e∗i (x) ej (x) dx = proof of Proposition 3). According to Eq. (78), the eigen-
0 Otherwise.
function problems for the operators Kp and Kp,n (discrete
The Eq. (86) can be written in matrix form: version) are:
(Kp fk )(x) = λk fk (x), ∀k ∈ {1, . . . , n},
Aα = β, (87) (92)
(Kp,n fk )(x) = λk fk (x), ∀k ∈ {1, . . . , n},

where α := [α1 , . . . , αn ]> and β := [β1 , . . . , βn ]> . where fk (.) is the k-th eigenfunction and λk is the corre-
sponding eigenvalue. Consider the kernel matrix defined
From Eqs. (82) and (83), we have:
by Definition 2. The eigenvalue problem for the kernel ma-
n n trix is (Ghojogh et al., 2019a):
X X
λ αj ej (x) = βj ej (x) =⇒ λ α = β. (88) Kv k = δk v k , ∀k ∈ {1, . . . , n}, (93)
j=1 j=1
where v k is the k-th eigenvector and δk is the correspond-
Comparing Eqs. (87) and (88) shows: ing eigenvalue. According to Eqs. (90) and (92), we have:
n
1X
Aα = λ α, k(x, xi ) f (xi ) = λk fk (x), ∀k ∈ {1, . . . , n}.
n i=1
which is an eigenvalue problem for matrix A with eigen- When this equation is evaluated only at xi ∈ X , we have
vector α and eigenvalue λ (Ghojogh et al., 2019a). Note (Bengio et al., 2004, Section 4), (Bengio et al., 2006, Sec-
that, according to Eq. (80), the information of function f tion 3.2):
is in the coefficients αj ’s of the basis functions of space.
1
Therefore, the function is converted to the eigenvector (vec- Kfk = λk fk , ∀k ∈ {1, . . . , n},
tor of coefficients) and the operator O is converted to the n
matrix A. Q.E.D. =⇒ Kfk = nλk fk .
16

According to Theorem 3, eigenfunction can be seen as an allowed to re-arrange them. Re-arranging the terms in this
eigenvector. If so, we can say: equation gives:
n
Kfk = nλk fk =⇒ Kv k = nλk v k , (94) X
ηk v` φ̆(xj )> φ̆(x` )
Comparing Eqs. (93) and (94) results in Eq. (91). Q.E.D. `=1
n n
1X X
= v` φ̆(xj )> φ̆(xi ) φ̆(xi )> φ̆(x` ) .
Lemma 11 (Relation of Eigenvalues of Kernel and Covari- n i=1
`=1
ance in the Feature Space (Schölkopf et al., 1998)). Con-
sider the covariance of pulled data to the feature space: Considering Eqs. (36) and (60), we can write this equa-
2
tion in matrix form ηk K̆v k = n1 K̆ v k where v k :=
n
1X [v1 , . . . , vn ]> . As K̆ is positive semi-definite (see Lemma
CH := φ̆(xi )φ̆(xi )> , (95)
n i=1 3), it is often non-singular. For non-zero eigenvalues, we
−1
can left multiply this equation to K̆ to have:
where φ̆(xi ) is the centered pulled data defined by Eq. (64).
which is t × t dimensional where t may be infinite. Assume n ηk v k = K̆v k ,
ηk denotes the k-th eigenvalue C H and δk denotes the k-th
eigenvalue of centered kernel K̆. We have: which is the eigenvalue problem for K̆ where v is the
eigenvector and δk = n ηk is the eigenvalue (cf. Eq. (93)).
δk = n ηk . (96) Q.E.D.

Proof. This proof is based on (Schölkopf et al., 1998, Sec- Lemma 12 (Relation of Eigenfunctions and Eigenvectors
tion 2). The eigenvalue problem for this covariance matrix for Kernel (Bengio et al., 2003a, Proposition 1), (Bengio
is: et al., 2003c, Theorem 1)). Consider a training dataset
{xi ∈ Rd }ni=1 and the eigenvalue problem (93) where
ηk uk = C H uk , ∀k ∈ {1, . . . , n}, v k ∈ Rn and δk are the k-th eigenvector and eigenvalue
of matrix K ∈ Rn×n . If vki is the i-th element of vector
where uk is the k-th eigenvector and ηk is its corresponding v k , the eigenfunction for the point x and the i-th training
eigenvalue (Ghojogh et al., 2019a). Left multiplying this point xi are:
equation with φ̆(xj )> gives: √ X n
n
fk (x) = vki k̆(xi , x), (99)
ηk φ̆(xj )> uk = φ̆(xj )> C H uk , ∀k ∈ {1, . . . , n}. δk i=1
(97) √
fk (xi ) = n vki , (100)
As uk is the eigenvector of the covariance matrix in the
feature space, it lies in the feature space; hence, according respectively, where k̆(xi , x) is the centered kernel. If x
to Lemma 13 which will come later, we can represent it as: is a training point, k̆(xi , x) is the centered kernel over
training data and if x is an out-of-sample point, then
n
1 X k̆(xi , x) = k̆t (xi , x) is between training set and the out-
uk = √ v` φ̆(x` ), (98)
δk `=1 of-sample point (n.b. kernel centering is explained in Sec-
tion 6.1).
where pulled data to feature space are assumed to be cen-
tered, v` ’s are the coefficients in representation, and the Proof. For proof of Eq. (99), see (Bengio et al., 2003c,
√ proof of Theorem 1) or (Williams & Seeger, 2001, Section
normalization by 1/ δk is because of a normalization used
in (Bengio et al., 2003c, Section 4). Substituting Eq. (98) 1.1). The Eq. (100) is claimed in (Bengio et al., 2003c,
and Eq. (95) in Eq. (97) results in: Proposition 1). For proof of this equation, see (Bengio
et al., 2003c, proof of Theorem 1, Eq. 7).
n
X
ηk φ̆(xj )> v` φ̆(x` ) It is noteworthy that Eq. (99) is similar and related to the
`=1 Nyström approximation of eigenfunctions of kernel opera-
n n tor which will be explained in Lemma 16.
1X X
= φ̆(xj )> φ̆(xi )φ̆(xi )> v` φ̆(x` ),
n i=1 Theorem 4 (Embedding from Eigenfunctions of Kernel
`=1
Operator (Bengio et al., 2003a, Proposition 1), (Bengio
where normalization factors are simplified from sides. In et al., 2003c, Section 4)). Consider a dimensionality reduc-
the right-hand side, as the summations are finite, we are tion algorithm which embeds data into a low-dimensional
17

embedding space. Let the embedding of the point x be

Rp 3 y(x) = [y1 (x), . . . , yp (x)]> where p ≤ n. The
k-th dimension of this embedding is:
n
p fk (x) 1 X
yk (x) = δk √ = √ vki k̆(xi , x), (101)
n δk i=1

where k̆(xi , x) is the centered training or out-of-sample

kernel depending on whether x is a training or an out-of-
sample point (n.b. kernel centering will be explained in
Section 6.1).
Figure 3. Transforming data to RKHS using kernels to make the
Proof. We can embed data point x by pulling it to the fea- nonlinear pattern of data more linear. For example, here the
ture space and centering the pulled dataset to have φ̆(x) classes have become linearly separable (by a linear hyperplane)
and then projecting it onto the eigenvector of covariance after kernelization.
matrix in the feature space (Schölkopf et al., 1998, Section
2), (Bengio et al., 2003c, Section 4): rithm should be proposed to be able to handle nonlin-
n ear data. Some examples of this category are nonlin-
(98) 1 X
ear dimensionality methods such as locally linear em-
yk (x) = u>
k φ̆(x) = √ vi φ̆(xi )> φ̆(x)
δk i=1 bedding (Ghojogh et al., 2020a) and Isomap (Ghojogh
n et al., 2020b).
1 X
(60)
= √ vki k̆(xi , x).
δk i=1 2. Or the nonlinear data should be modified in a way to
become more linear in pattern. In other words, a trans-
Q.E.D. formation should be applied on data so that the pattern
of data becomes roughly linear or easier to process
The Theorem 4 has been widely used for out-of-sample by the linear algorithm. Some examples of this cate-
(test data) embedding in many spectral dimensionality re- gory are kernel versions of linear methods such as ker-
duction algorithms (Bengio et al., 2003a). nel Principal Component Analysis (PCA) (Schölkopf
Corollary 3 (Embedding from Eigenvectors of Kernel Ma- et al., 1997a; 1998; Ghojogh & Crowley, 2019), ker-
trix). Consider the eigenvalue problem for the kernel manel Fisher Discriminant Analysis (FDA) (Mika et al.,
trix, i.e. Eq. (93), where v k = [vk1 , . . . , vkn ]> and 1999; Ghojogh et al., 2019b), and kernel Support
δk are the k-th eigenvector and eigenvalue of kernel, re- Vector Machine (SVM) (Boser et al., 1992; Vapnik,
spectively. According to Eqs. (100) and (101), we can 1995).
compute the embedding of point x, denoted by y(x) =
[y1 (x), . . . , yp (x)]> (where p ≤ n) using the eigenvector The second approach is called kernelization in machine
of kernel as: learning which we define in the following. Figure 3 shows
how kernelization for transforming data can help separate
p 1 √ p classes for better classification.
yk (x) = δk √ ( n)vki = δk vki . (102)
n Definition 22 (Kernelization). In machine learning and
data science, kernelization means a slight change in algo-
The Eq. (102) is used in several dimensionality reduction
rithm formulation (without any modification in the idea of
methods such as maximum variance unfolding (or semidef-
algorithm) so that the pulled data to the RKHS, rather than
inite embedding) (Weinberger et al., 2005; Weinberger &
the raw data, are used as input of algorithm.
Saul, 2006b;a). We will introduce this method in Section
9.2. Note that kernelization can be useful for enabling linear al-
gorithms to handle nonlinear data better. Nevertheless, it
8. Kernelization Techniques should be noted that nonlinear algorithms can also be ker-
Linear algorithms cannot properly handle nonlinear pat- nelized to be able to handle nonlinear data perhaps better
terns of data obviously. When dealing with nonlinear data, by transforming data.
if the algorithm is linear, two solutions exist to have accept- Generally, there exist two main approaches for kerneliza-
able performance: tion in machine learning. These two approaches are related
in theory but have two ways for kernelization. In the fol-
1. Either the linear method should be modified to be- lowing, we explain these methods which are kernel trick
come nonlinear or a completely new nonlinear algo- and kernelization using representation theory.
18

8.1. Kernelization by Kernel Trick 8.2. Kernelization by Representation Theory

Recall Eqs. (34) and (36) where kernel can be computed by As was explained in Section 8.1, if the formulation of algo-
inner product between pulled data instances to the RKHS. rithm has data only as inner products of points, kernel trick
One technique to kernelize an algorithm is kernel trick. can be used. In some cases, a dual version of algorithm is
In this technique, we first try to formulate the algorithm used to have only inner products. Some algorithms, how-
formulas or optimization in a way that data always ap- ever, cannot be formulated in a way to have data only in
pear as inner product of data instances and not a data in- inner product form, nor does their dual have this form. An
stance alone. In other words, the formulation of algorithm example is Fisher Discriminant Analysis (FDA) (Ghojogh
should only have x> x, x> X, X > x, or X > X and not a et al., 2019b) which uses another technique for kerneliza-
lonely x or X. In this way, kernel trick replaces x> x with tion (Mika et al., 1999). In the following, we explain this
φ(x)> φ(x) and uses Eq. (34) or (36). To better explain, technique.
kernel trick applies the following mapping (Burges, 1998): Lemma 13 (Representation of Function Using Bases
(34) (Mika et al., 1999)). Consider a RKHS denoted by H. Any
x> x 7→ φ(x)> φ(x) = k(x, x). (103) function f ∈ H lies in the span of all points in the RKHS,
i.e.,
Therefore, the inner products of points are all replaced with
the kernel between points. The matrix form of kernel trick n
X
is: f= αi φ(xi ). (107)
i=1
(36)
X > X 7→ Φ(X)> Φ(X) = K(X, X) ∈ Rn×n . Proof. Consider Eq. (6) which can be restated as:
(104)
n n
(6) X (34) X
Most often, kernel matrix is computed over one dataset; f (y) = αi k(xi , y) = αi φ(xi )> φ(y)
hence, its dimensionality is n × n. However, in some i=1 i=1
n
cases, the kernel matrix is computed between two sets X
of data instances with sample sizes n1 and n2 for exam- =⇒ f (.) = αi φ(xi ).
ple, i.e. datasets X 1 := [x1,1 , . . . , x1,n1 ] and X 2 := i=1

[x2,1 , . . . , x2,n2 ]. In this case, the kernel matrix has size Q.E.D.
n1 × n2 and the kernel trick is:
Remark 8 (Justification by Representation Theory). Ac-
(34)
x>
1,i x1,j
>
7→ φ(x1,i ) φ(x1,j ) = k(x1,i , x1,j ), (105) cording to representation theory (Alperin, 1993), any func-
(36)
tion in the space can be represented as a linear combina-
X> >
1 X 2 7→ Φ(X 1 ) Φ(X 2 ) = K(X 1 , X 2 ) ∈ R
n1 ×n2
. tion of bases of the space. This makes sense because the
(106) function is in the space and the space is spanned by the
bases. Now, assume the space is RKHS. Hence, any func-
An example for kernel between two sets of data is the ker- tion should lie in the RKHS spanned by the pulled data
nel between training data and out-of-sample (test) data. As points to the feature space. This justifies Lemma 13 using
stated in (Schölkopf, 2001), the kernel trick is proved to representation theory.
work for Mercer kernels in (Boser et al., 1992; Vapnik,
1995) or equivalently for the positive definite kernels (Berg 8.2.1. K ERNELIZATION FOR V ECTOR S OLUTION
et al., 1984; Wahba, 1990). Now, consider an algorithm whose optimization variable or
Some examples of using kernel trick in machine learning solution is the vector/direction u ∈ Rd in the input space.
are kernel PCA (Schölkopf et al., 1997a; 1998; Ghojogh & For kernelization, we pull this solution to RKHS by Eq.
Crowley, 2019) and kernel SVM (Boser et al., 1992; Vap- (32) to have φ(u). According to Lemma 13, this pulled
nik, 1995). More examples will be provided in Section 9. solution must lie in the span of all pulled training points
Note that in some algorithms, data do not not appear only {φ(xi )}ni=1 as:
by inner product which is required for the kernel trick. In
n
these cases, if possible, a “dual” method for the algorithm is X
φ(u) = αi φ(xi ) = Φ(X) α, (108)
proposed which only uses the inner product of data. Then,
i=1
the dual algorithm is kernelized using kernel trick. Some
examples for this are kernelization of dual PCA (Ghojogh which is t dimensional and t may be infinite. Note that
& Crowley, 2019) and dual SVM (Burges, 1998). As an ad- Φ(X) is defined by Eq. (35) and is t × n dimensional. The
ditional point, it is noteworthy that it is possible to replace vector α := [α1 , . . . , αn ]> ∈ Rn contains the coefficients.
kernel trick with function replacement (see (Ma, 2003) for According to Eq. (32), we can replace u with φ(u) in the
more details on this). algorithm. If, by this replacement, the terms φ(xi )> φ(xi )
19

or Φ(X)> Φ(X) appear, then we can use Eq. (34) and As was mentioned, in many machine learning algorithms,
replace φ(xi )> φ(xi ) with k(xi , xi ) or use Eq. (36) to re- the solution U ∈ Rd×p is a projection matrix for projecting
place Φ(X)> Φ(X) with K(X, X). This kernelizes the d-dimensional data onto a p-dimensional subspace. Some
method. The steps of kernelization by representation the- example methods which have used kernelization by rep-
ory are summarized below: resentation theory are kernel Fisher discriminant analysis
• Step 1: u → φ(u) (FDA) (Mika et al., 1999; Ghojogh et al., 2019b), kernel
• Step 2: Replace φ(u) with Eq. (108) in the algorithm supervised principal component analysis (PCA) (Barshan
formulation et al., 2011; Ghojogh & Crowley, 2019), and direct ker-
• Step 3: Some φ(xi )> φ(xi ) or Φ(X)> Φ(X) terms nel Roweis discriminant analysis (RDA) (Ghojogh et al.,
appear in the formulation 2020c).
• Step 4: Use Eq. (34) or (36) Remark 9 (Linear Kernel in Kernelization). If we use ker-
• Step 5: Solve (optimize) the algorithm where the vari- nel trick, the kernelized algorithm with a linear kernel is
able to find is α rather than u equivalent to the non-kernelized algorithm. This is because
Usually, the goal of algorithm results in kernel. For exam- in linear kernel, we have φ(x) = x and k(x, y) = x> y
ple, if u is a projection direction, the desired projected data according to Eq. (41). So, the kernel trick, which is Eq.
are obtained as: (105), maps data as x> y 7→ φ(x)> φ(y) = x> y for lin-
(32) (108) ear kernel. Therefore, linear kernel does not have any effect
u> xi 7→ φ(u)> φ(xi ) = α> Φ(X)> φ(xi ) when using kernel trick. Examples for this are kernel PCA
(36) (Schölkopf et al., 1997a; 1998; Ghojogh & Crowley, 2019)
= α> k(X, xi ), (109)
and kernel SVM (Boser et al., 1992; Vapnik, 1995) which
where k(X, xi ) ∈ Rn is the kernel between all n train- are equivalent to PCA and SVM, respectively, if linear ker-
ing points with the point xi . As this equation shows, the nel is used.
desired goal is based on kernel. However, kernel trick does have impact when using kernel-
ization by representation theory because it finds the inner
8.2.2. K ERNELIZATION FOR M ATRIX S OLUTION products of pulled data points after pulling the solution and
Usually, the algorithm has multiple directions/vectors as representation as a span of bases. Hence, kernelized algo-
its solution. In other words, its solution is a matrix U = rithm using representation theory with linear kernel is not
[u1 , . . . , up ] ∈ Rd×p . In this case, Eq. (108) is used for all equivalent to non-kernelized algorithm. Examples of this
p vectors and in a matrix form, we have: are kernel FDA (Mika et al., 1999; Ghojogh et al., 2019b)
and kernel supervised PCA (Barshan et al., 2011; Ghojogh
Φ(U ) = Φ(X) A, (110) & Crowley, 2019) which are different from FDA and super-
vised PCA, respectively, even if linear kernel is used.
where Rn×p 3 A := [α1 , . . . , αp ]> and Φ(U ) is t × p di-
mensional where t may be infinite. Similarly, the following 9. Types of Use of Kernels in Machine
steps should be performed to kernelize the algorithm: Learning
• Step 1: U → φ(U )
There are several types of using kernels in machine learn-
• Step 2: Replace φ(U ) with Eq. (110) in the algorithm
ing. In the following, we explain these types of usage of
formulation
kernels.
• Step 3: Some Φ(X)> Φ(X) terms appear in the for-
mulation 9.1. Kernel Methods
• Step 4: Use Eq. (36)
The first type of using kernels in machine learning is ker-
• Step 5: Solve (optimize) the algorithm where the vari-
nelization of algorithms using either kernel trick or rep-
able to find is A rather than U
resentation theory. As was discussed in Section 8, linear
Again the goal of algorithm usually results in kernel. For methods can be kernelized to handle nonlinear data better.
example, if U is a projection matrix onto its column space, Even nonlinear algorithms can be kernelized to perform in
we have: the feature space rather than the input space. In machine
(32) (110) learning both kernel trick and kernelization by representa-
U > xi 7→ φ(U )> φ(xi ) = A> Φ(X)> φ(xi ) tion theory have been used. We provide some examples for
(36) each of these categories:
= A> k(X, xi ), (111)
• Examples for kernelization by kernel trick: ker-
where k(X, xi ) ∈ Rn is the kernel between all n train- nel Principal Component Analysis (PCA) (Schölkopf
ing points with the point xi . As this equation shows, the et al., 1997a; 1998; Ghojogh & Crowley, 2019), kernel
desired goal is based on kernel. Support Vector Machine (SVM) (Boser et al., 1992;
20

Vapnik, 1995). that kernel learning by SDP has also been used for labeling
• Examples for kernelization by representation the- a not completely labeled dataset and is also used for ker-
ory: kernel supervised PCA (Barshan et al., 2011; nel SVM (Lanckriet et al., 2004; Karimi, 2017). Our focus
Ghojogh & Crowley, 2019), kernel Fisher Discrim- here is on MVU. In the following, we briefly introduce the
inant Analysis (FDA) (Mika et al., 1999; Ghojogh MVU (or SDE) algorithm.
et al., 2019b), direct kernel Roweis Discriminant Lemma 14 (Distance in RKHS (Schölkopf, 2001)). The
Analysis (RDA) (Ghojogh et al., 2020c). squared Euclidean distance between points in the feature
As was discussed in Section 1, using kernels was widely space is:
noticed when linear SVM (Vapnik & Chervonenkis, 1974)
was kernelized in (Boser et al., 1992; Vapnik, 1995). More kφ(xi ) − φ(xj )k2k = k(xi , xi ) + k(xj , xj ) − 2k(xi , xj ).
discussions on kernel SVM can be found in (Schölkopf (112)
et al., 1997b; Hastie et al., 2009). A tutorial on kernel SVM Proof.
is (Burges, 1998). Universal kernels, introduced in Section
>
5.3, are widely used in kernel SVM. More detailed discus- kφ(xi ) − φ(xj )k2k = φ(xi ) − φ(xj )

φ(xi ) − φ(xj )
sions and proofs for use of universal kernels in kernel SVM
= φ(xi )> φ(xi ) + φ(xj )> φ(xj ) − φ(xi )> φ(xj )
can be read in (Steinwart & Christmann, 2008).
(37)
As was discussed in Section 8.1, many machine learning − φ(xj )> φ(xi ) = φ(xi )> φ(xi ) + φ(xj )> φ(xj )
algorithms are developed to have dual versions because in- (34)
ner products of points usually appear in the dual algorithms − 2φ(xi )> φ(xj ) = k(xi , xi ) + k(xj , xj ) − 2k(xi , xj ).
and kernel trick can be applied on them. Some examples of
Q.E.D.
these are dual PCA (Schölkopf et al., 1997a; 1998; Gho-
jogh & Crowley, 2019) and dual SVM (Boser et al., 1992; MVU desires to unfolds the manifold of data in its maxi-
Vapnik, 1995) yielding to kernel PCA and kernel SVM, re- mum variance direction. For example, consider a Swiss roll
spectively. In some algorithms, however, either a dual ver- which can be unrolled to have maximum variance after be-
sion does not exist or formulation does not allow for merely ing unrolled. As trace of matrix is the summation of eigen-
having inner products of points. In those algorithms, ker- values and kernel matrix is a measure of similarity between
nel trick cannot be used and representation theory should be points (see Remark 4), the trace of kernel can be used to
used. An example for this is FDA (Ghojogh et al., 2019b). show the summation of variance of data. Hence, we should
Moreover, some algorithms, such as kernel reinforcement maximize tr(K) where tr(.) denotes the trace of matrix.
learning (Ormoneit & Sen, 2002), use kernel as a measure MVU pulls data to the RKHS and then unfolds the mani-
of similarity (see Remark 4). fold. This unfolding should not ruin the local distances be-
tween points after pulling data to the feature space. Hence,
9.2. Kernel Learning we should preserve the local distances as:
After development of many spectral dimensionality reduc-
set
tion methods in machine learning, it was found out that kxi −xj k22 = kφ(xi ) − φ(xj )k2k
many of them are actually special cases of kernel Principal (112)
Component Analysis (PCA) (Bengio et al., 2003c; 2004). = k(xi , xi ) + k(xj , xj ) − 2k(xi , xj ). (113)
Paper (Ham et al., 2004) has shown that PCA, multidi- Moreover, according to Lemma 3, the kernel should be pos-
mensional scaling, Isomap, locally linear embedding, and itive semidefinite, i.e. K 0. MVU also centers kernel to
Laplacian eigenmap are special cases of kernel PCA with have zero mean for the pulled dataset in the feature space
kernels in the formulation of Eq. (51). A list of these ker- (see
nels can be seen in (Strange & Zwiggelaar, 2014, Chapter Pn Section
Pn 6.1). According to Eq. (63), we will have
i=1 j=1 K(i, j) = 0. In summary, the optimization
2) and (Ghojogh et al., 2019d). of MVU is (Weinberger et al., 2005; Weinberger & Saul,
Because of this, some generalized dimensionality reduc- 2006b;a):
tion methods, such as graph embedding (Yan et al., 2005),
were proposed. In addition, as many spectral methods maximize tr(K)
K
are cases of kernel PCA, some researchers tried to learn
subject to kxi − xj k22 = K(i, i) + K(j, j) − 2K(i, j),
the best kernel for manifold unfolding. Maximum Vari-
ance Unfolding (MVU) or Semidefinite Embedding (SDE) ∀i, j ∈ {1, . . . , n},
n X
n
(Weinberger et al., 2005; Weinberger & Saul, 2006b;a) is X
a method for kernel learning using Semidefinite Program- K(i, j) = 0,
i=1 j=1
ming (SDP) (Vandenberghe & Boyd, 1996). MVU is used
for manifold unfolding and dimensionality reduction. Note K 0,
(114)
21

which is a SDP problem (Vandenberghe & Boyd, 1996). Using the explained intuition, an empirical estimation of
Solving this optimization gives the best kernel for maxi- the Hilbert-Schmidt Independence Criterion (HSIC) is in-
mum variance unfolding of manifold. Then, MVU consid- troduced (Gretton et al., 2005):
ers the eigenvalue problem for kernel, i.e. Eq. (93), and 1
finds the embedding using Eq. (102). HSIC(X, Y ) := tr(K x HK y H), (117)
(n − 1)2
9.3. Use of Kernels for Difference of Distributions where K x := φ(x)> φ(x) and K y := φ(y)> φ(y) are the
There exist several different measures for difference of dis- kernels over x and y, respectively. The term 1/(n − 1)2
tributions (i.e., PDFs). Some of them make use of kernels is used for normalization. The matrix H is the centering
and some do not. A measure of difference of distributions matrix (see Eq. (48)). Note that HSIC double-centers one
can be used for (1) calculating the divergence (difference) of the kernels and then computes the Hilbert-Schmidt norm
of a distribution from another reference distribution or (2) between kernels.
convergence of a distribution to another reference distribu- HSIC measures the dependence of two random variable
tion using optimization. We will explain the second use of vectors x and y. Note that HSIC = 0 and HSIC > 0 mean
this measure better in Corollary 5. that x and y are independent and dependent, respectively.
For the information of reader, we first enumerate some The greater the HSIC, the greater dependence they have.
of the measures without kernels and then introduce the Lemma 15 (Independence of Random Variables Using
kernel-based measures for difference of distributions. One Cross-Covariance (Gretton & Györfi, 2010, Theorem 5)).
of methods which do not use kernels is the Kullback- Two random variables X and Y are independent if and
Leibler (KL) divergence (Kullback & Leibler, 1951). KL- only if Cov(f (x), f (y)) = 0 for any pair of bounded con-
divergence, which is a relative entropy from one distribu- tinuous functions (f, g). Because of relation of HSIC with
tion to the other one and has been widely used in deep the cross-covariance of variables, two random variables
learning (Goodfellow et al., 2016). Another measure is the are independent if and only if HSIC(X, Y ) = 0.
Wasserstein metric which has been used in generative mod-
els (Arjovsky et al., 2017). The integral probability metric 9.3.2. M AXIMUM M EAN D ISCREPANCY (MMD)
(Müller, 1997) is another measure for difference of distri- MMD, also known as the kernel two sample test and pro-
butions. posed in (Gretton et al., 2006; 2012), is a measure for dif-
In the following, we introduce some well-known measures ference of distributions. For comparison of two distribu-
for difference of distributions, using kernels. tions, one can find the difference of all moments of the two
distributions. However, as the number of moments is infi-
9.3.1. H ILBERT-S CHMIDT I NDEPENDENCE C RITERION
nite, it is intractable to calculate the difference of all mo-
(HSIC)
ments. One idea to do this tractably is to pull both distri-
Suppose we want to measure the dependence of two ran- butions to the feature space and then compute the distance
dom variables. Measuring the correlation between them is of all pulled data points from distributions in RKHS. This
easier because correlation is just “linear” dependence. difference is a suitable estimate for the difference of all mo-
According to (Hein & Bousquet, 2004), two random vari- ments in the input space. This is the idea behind MMD.
ables X and Y are independent if and only if any bounded MMD is a semi-metric (Simon-Gabriel et al., 2020) and
continuous functions of them are uncorrelated. Therefore, uses distance in the RKHS (Schölkopf, 2001) (see Lemma
if we map the samples of two random variables {x}ni=1 14). Consider PDFs P and Q and samples {xi }ni=1 ∼ P and
and {y}ni=1 to two different (“separable”) RKHSs and have {y i }ni=1 ∼ Q. The squared MMD between these PDFs is:
φ(x) and φ(y), we can measure the correlation of φ(x) and
1 X n n 2
φ(y) in Hilbert space to have an estimation of dependence 1X
MMD2 (P, Q) := φ(xi ) − φ(y i )

of x and y in the input space. n i=1 n i=1 k
The correlation of φ(x) and φ(y) can be computed by n n n n
(112) 1 XX 1 XX
the Hilbert-Schmidt norm of the cross-covariance of them = k(x i , x j ) + k(y i , y j )
(Gretton et al., 2005). Note that the squared Hilbert- n2 i=1 j=1 n2 i=1 j=1
Schmidt norm of a matrix A is (Bell, 2016): n n
2 XX
− k(xi , y j )
||A||2HS := tr(A> A), (115) n2 i=1 j=1

and the cross-covariance matrix of two vectors x and y is = Ex [K(x, y)] + Ey [K(x, y)] − 2Ex [Ey [K(x, y)]],
(Gubner, 2006; Gretton et al., 2005): (118)
where Ex [K(x, y)], Ey [K(x, y)], and Ex [Ey [K(x, y)]]
h i
Cov(x, y) := E x − E(x) y − E(y) . (116)
are average of rows, average of columns, and total aver-
22

age of rows and columns of the kernel matrix, respectively. Corollary 5 (Convergence of Distributions to Each Other
Note that MMD ≥ 0 where MMD = 0 means the two dis- Using Characteristic Kernels (Simon-Gabriel et al., 2020)).
tributions are equivalent if the used kernel is characteristic Let Q be the PDF for a theoretical or sample reference
(see Corollary 5 which will be provided later). MMD has distribution. Following Definition 24, if the kernel used in
been widely used in machine learning such as generative measures for difference of distributions is characteristic,
moment matching networks (Li et al., 2015). the measure can be used in an optimization framework to
Remark 10 (Equivalence of HSIC and MMD (Sejdinovic converge a PDF P to the reference distribution Q as:
et al., 2013)). After development of HSIC and MMD mea-
sures, it was found out that they are equivalent. dk (P, Q) = 0 ⇐⇒ P = Q. (121)

9.4. Kernel Embedding of Distributions (Kernel Mean where dk (., .) denotes a measure for difference of distribu-
Embedding) tions such as MMD.
Definition 23 (Kernel Embedding of Distributions (Smola Characteristic kernels have been used for dimensionality
et al., 2007)). Kernel embedding of distributions, also reduction in machine learning. For example, see (Fuku-
called the Kernel Mean Embedding (KME) or mean mizu et al., 2004; 2009).
map, represents (or embeds) Probability Density Functions So far, we have introduced three different types of embed-
(PDFs) in a RKHS. ding in RKHS. In the following, we summarize these three
Corollary 4 (Distribution Embedding in Hilbert Space). types available in the literature of kernels.
Inspired by Eq. (6) or (11), if we map a PDF P from its Remark 11 (Types of Embedding in Hilbert Space). There
space X to the Hilbert space H, its mapped PDF, denoted are three types of embeddings in Hilbert space:
by φ(P), can be represented as:
1. Embedding of points in the Hilbert space: This em-
Z bedding maps x 7→ k(x, .) as stated in Sections 2.3
P 7→ φ(P) = k(x, .) dP(x). (119) and 8.
X
2. Embedding of functions inRthe Hilbert space: This em-
This integral is called the Bochner integral. bedding maps f (x) 7→ K(x, y) f (y) p(y) dy as
stated in Section 7.
KME and MMD were first proposed in the field of pure
3. Embedding of distributions (PDF’s) Rin the Hilbert
mathematics (Guilbart, 1978). Later on, KME and MMD
space: This embedding maps P 7→ k(x, .) dP(x)
were used in machine learning, first in (Smola et al., 2007).
as stated in Section 9.4.
KME is a family on methods which use Eq. (119) for em-
bedding PDFs in RKHS. This family of methods is more Researchers are expecting that a combination of these types
discussed in (Sriperumbudur et al., 2010). A survey on of embedding might appear in the future.
KME is (Muandet et al., 2016).
9.5. Kernel Dimensionality Reduction for Sufficient
Universal kernels, introduced in Section 5.3.3, can be used Dimensionality Reduction
for KME (Sriperumbudur et al., 2011; Simon-Gabriel &
Kernels can also be used directly for dimensionality reduc-
Schölkopf, 2018). In addition to universal kernels, char-
tion. Assume X is the random variable of data and Y is
acteristic kernels and integrally strictly positive definite
the random variables of labels of data. The labels can be
are useful for KME (Sriperumbudur et al., 2011; Simon-
discrete finite for classification or continuous for regres-
Gabriel & Schölkopf, 2018). The integrally strictly posi-
sion. Sufficient Dimensionality Reduction (SDR) (Adragni
tive definite kernel was introduced in Section 5.3.2. In the
& Cook, 2009) is a family of methods which find a trans-
following, we introduce the characteristic kernels.
formation of data to a lower dimensional space, denoted
Definition 24 (Characteristic Kernel (Fukumizu et al., by R(x), which does not change the conditional of labels
2008)). A kernel is characteristic if the mapping (119) is given data:
injective. In other words, for a characteristic kernel k, we
have: PY |X (y | x) = PY |R(X) (y | R(x)). (122)

EX∼P [k(., X)] = EY ∼Q [k(., Y )] ⇐⇒ P = Q, (120) Kernel Dimensionality Reduction (KDR) (Fukumizu et al.,
2004; 2009; Wang et al., 2010b) is a SDR method with lin-
where P and Q are two PDFs and X and Y are random
ear projection for transformation, i.e. R(x) : x 7→ U > x
variables from these distributions, respectively.
which projects data onto the column space of U . The goal
Some examples for characteristic kernels are RBF and of KDR is:
Laplacian kernels. Polynomial kernels are not character-
istic kernels. PY |X (y | x) = PY |U,X (y | U , x). (123)
23

Definition 25 (Dual Space). A dual space of a vector in the formulation of kernel gives:
space V, denoted by V ∗ , is the set of all linear function-
(36) (126)
als φ : V → F where F is the field on which vector space K = Φ(X)> Φ(X) = (U ΣV > )> (U ΣV > )
is defined. > > > 2 >
| {zU} ΣV = V ΣΣV = V Σ V ,
= V ΣU
Theorem 5 (Riesz (or Riesz–Fréchet) representation theo- =I
rem (Garling, 1973)). Let H be a Hilbert space with norm =⇒ KV = V Σ2 V > 2
h., .iH . Suppose φ ∈ H∗ (e.g., φ : f → R). Then, there | {zV} =⇒ KV = V Σ . (127)
=I
exists a unique f ∈ H such that for any x ∈ H, we have
φ(x) = hf, gi: If we take ∆ = diag([δ1 , . . . , δn ]> ) := Σ2 , this equation
becomes:
∃f ∈ H : φ ∈ H∗ , ∀x ∈ H, φ(x) = hf, xiH . (124)
KV = V ∆, (128)
Corollary 6. According to Theorem 5, we have:
which is the matrix of Eq. (93), i.e. the eigenvalue problem
EX [f (x)] = hf, φ(P)iH , ∀f ∈ H, (125) (Ghojogh et al., 2019a) for the kernel matrix with V and
∆ as eigenvectors and eigenvalues, respectively. Hence,
where φ(P) is defined by Eq. (119). for SVD on the pulled dataset, one can apply Eigenvalue
Decomposition (EVD) on the kernel where the eigenvec-
KDR uses Theorem 5 and Corollary 6 in its formualtions.
tors of kernel are equal to right singular vectors of pulled
Note that characteristic kernels (Fukumizu et al., 2008), in-
dataset and the eigenvalues of kernel are the squared singu-
troduced in Definition 24, are used in KDR.
lar values of pulled dataset. This technique has been used
in kernel PCA (Ghojogh & Crowley, 2019).
10. Rank and Factorization of Kernel and the
Nyström Method 10.1.2. C HOLESKY AND QR D ECOMPOSITIONS
10.1. Rank and Factorization of Kernel Matrix The kernel matrix can be factorized using LU decompo-
Usually, the rank of kernel is small. This is because the sition; however, as the kernel matrix is symmetric pos-
manifold hypothesis which states that data points often itive semi-definite matrix (see Lemmas 1 and 3), it can
do not cover the whole space but lie on a sub-manifold. be decomposed using Cholesky decomposition which is
Nyström approximation of the kernel matrix also works much faster than LU decomposition. The Cholesky de-
well because kernels are often low-rank matrices (we will composition of kernel is in the form Rn×n 3 K = LL>
discuss it in Corollary 7). Because of low rank of kernels, where L ∈ Rn×n is a lower-triangular matrix. The ker-
they can be approximated (Kishore Kumar & Schneider, nel matrix can also be factorized using QR decomposition
2017), learned (Kulis et al., 2006; 2009), and factorized. as Rn×n 3 K = QR where Q ∈ Rn×n is an orthogo-
Kernel factorization has also be used for the sake of clus- nal matrix and R ∈ Rn×n is an upper-triangular matrix.
tering (Wang et al., 2010a). In the following, we introduce The paper (Bach & Jordan, 2005) has incorporated side in-
some of the most well-known decompositions for the ker- formation, such as class labels, in the Cholesky and QR
nel matrix. decompositions of kernel matrix.

10.1.1. S INGULAR VALUE AND E IGENVALUE 10.2. Nyström Method for Approximation of
D ECOMPOSITIONS Eigenfunctions
The Singular Value Decomposition (SVD) of the pulled Nyström method, first proposed in (Nyström, 1930), was
data to the feature space is: initially used for approximating the eigenfunctions of an
operator (or of a matrix corresponding to an operator). The
Rt×n 3 Φ(X) = U ΣV > , (126) following lemma provides the Nyström approximation for
the eigenfunctions of the kernel operator defined by Eq.
where U ∈ Rt×n and V ∈ Rn×n are orthogonal matrices (89). For more discussion on this, reader can refer to
and contain the left and right singular vectors, respectively, (Baker, 1978), (Williams & Seeger, 2001, Section 1.1), and
and Σ ∈ Rn×n is a diagonal matrix with singular values. (Williams & Seeger, 2000).
Note that here, we are using notations such as Rt×n for Lemma 16 (Nyström Approximation of Eigenfunction
showing the dimensionality of matrices and this notation (Baker, 1978, Chapter 3), (Williams & Seeger, 2001, Sec-
does not imply a Euclidean space. tion 1.1)). Consider a training dataset {xi ∈ Rd }ni=1 and
As mentioned before, the pulled data are not necessarily the eigenfunction problem (78) where fk ∈ H and λk are
available so Eq. (126) cannot necessarily be done. The the k-th eigenfunction and eigenvalue of kernel operator
kernel, however, is available. Using the SVD of pulled data defined by Eq. (89) or (90). The eigenfunction can be ap-
24

proximated by Nyström method as: of points c, d, and e from a and b, resulting in matrix B,
we cannot have much freedom on the location of c, d, and
n
1 X e, which is the matrix C. This is because of the positive
fk (x) ≈ k(xi , x) fk (xi ), (129)
nλk i=1 semi-definiteness of the matrix K. The points selected in
submatrix A are named landmarks. Note that the land-
where k(xi , x) is the kernel (or centered kernel) corre- marks can be selected randomly from the columns/rows of
sponding to the kernel operator. matrix K and, without loss of generality, they can be put
together to form a submatrix at the top-left corner of matrix.
Proof. This Lemma is somewhat similar and related to For Nyström approximation, some methods have been pro-
Lemma 12. Consider the kernel operator, defined by posed for sampling more important columns/rows of matrix
Eq. (89), in Eq. (78): Kf = λf . Combining this more wisely rather than randomly. We will mention some
with the discrete P version of operator, Eq. (90), gives of these sampling methods in Section 10.5.
n
λk fk (x) = (1/n) i=1 k(xi , x)fk (xi ) Dividing the sides As the matrix K is positive semi-definite, by definition, it
of this equation by λk brings Eq. (129). Q.E.D. can be written as K = O > O. If we take O = [R, S]
where R are the selected columns (landmarks) of O and S
10.3. Nyström Method for Kernel Completion and
are the other columns of O. We have:
Approximation
>
As explained before, the kernel matrix has a low rank > R
K=O O= [R, S] (131)
often. Because of its low rank, it can be approximated S>
(Kishore Kumar & Schneider, 2017). This is important >
R R R> S (130) A B

because in big data when n 1, constructing the kernel = = . (132)
S>R S>S B> C
matrix is both time-consuming and also intractable to store
in computer; i.e., its computation will run forever and will Hence, we have A = R> R. The eigenvalue decomposi-
raise a memory error finally. Hence, it is desired to com- tion (Ghojogh et al., 2019a) of A gives:
pute the kernel function between a subset of data points
(called landmarks) and then approximate the rest of kernel A = U ΣU > (133)
matrix using this subset of kernel matrix. Nyström approx- > > (1/2) >
=⇒ R R = U ΣU =⇒ R = Σ U . (134)
imation can be used for this goal.
The Nyström method can be used for kernel approxima- Moreover, we have B = R> S so we have:
tion. It is a technique used to approximate a positive semi-
definite matrix using merely a subset of its columns (or B = (Σ(1/2) U > )> S = U Σ(1/2) S
rows) (Williams & Seeger, 2001, Section 1.2), (Drineas (a)
et al., 2005). Hence, it can be used for kernel completion =⇒ U > B = Σ(1/2) S =⇒ S = Σ(−1/2) U > B,
in big data where computation of the entire kernel matrix (135)
is time consuming and intractable. One can compute some
of the important important columns or rows of a kernel ma- where (a) is because U is orthogonal (in the eigenvalue
trix, called landmarks, and approximate the rest of columns decomposition). Finally, we have:
or rows by Nyström approximation. C = S > S = B > U Σ(−1/2) Σ(−1/2) U > B
Consider a positive semi-definite matrix Rn×n 3 K 0 (133)
whose parts are: = B > U Σ−1 U > B = B > A−1 B. (136)

Therefore, Eq. (130) becomes:

A B
Rn×n 3 K = , (130)
B> C
A B
K≈ . (137)
where A ∈ Rm×m , B ∈ Rm×(n−m) , and C ∈ B > B > A−1 B
R(n−m)×(n−m) in which m n. This positive semi-
Lemma 17 (Impact of Size of sub-matrix A on Nyström
definite matrix can be a kernel (or Gram) matrix.
approximation). By increasing m, the approximation of
The Nyström approximation says if we have the small parts Eq. (137) becomes more accurate. If rank of K is at most
of this matrix, i.e. A and B, we can approximate C and m, this approximation is exact.
thus the whole matrix K. The intuition is as follows. As-
sume m = 2 (containing two points, a and b) and n = 5 Proof. In Eq. (136), we have the inverse of A. In order to
(containing three other points, c, d, and e). If we know the have this inverse, the matrix A must not be singular. For
similarity (or distance) of points a and b from one another, having a full-rank A ∈ Rm×m , the rank of A should be m.
resulting in matrix A, as well as the similarity (or distance) This results in m to be an upper bound on the rank of K
25

and a lower bound on the number of landmarks. In practice, are (Kumar et al., 2009; 2012) and landmark selection on a
it is recommended to use more number of landmarks for sparse manifold (Silva et al., 2006).
more accurate approximation but there is a trade-off with
the speed. 11. Conclusion
This paper was a tutorial and survey paper on kernels
Corollary 7. As we usually have m n, the Nyström and kernel methods. We covered various topics including
approximation works well especially for the low-rank ma- Mercer kernels, Mercer’s theorem, RKHS, eigenfunctions,
trices (Kishore Kumar & Schneider, 2017) because we will Nyström methods, kernelization techniques, and use of ker-
need a small A (so small number of landmarks) for ap- nels in machine learning. This paper can be useful for dif-
proximation. Usually, because of the manifold hypothe- ferent fields of science such as machine learning, functional
sis, data fall on a submanifold; hence, usually, the ker- analysis, and quantum mechanics.
nel (similarity) matrix or the distance matrix has a low
rank. Therefore, the Nyström approximation works well Acknowledgement
for many kernel-based or distance-based manifold learn-
Some parts of this tutorial, particularly some parts of the
ing methods.
RKHS, are covered by Prof. Larry Wasserman’s statisti-
10.4. Use of Nyström Approximation for Landmark cal machine learning course at the Department of Statistics
Spectral Embedding and Data Science, Carnegie Mellon University (watch his
course on YouTube). The video of RKHS in Statistical Ma-
The spectral dimensionality reduction methods (Saul et al.,
chine Learning course by Prof. Ulrike von Luxburg at the
2006) are based on geometry of data and their solutions of-
University of Tübingen is also great (watch on YouTube,
ten follow an eigenvalue problem (Ghojogh et al., 2019a).
Tübingen Machine Learning channel). Also, some parts
Therefore, they cannot handle big data where n 1. To
such as the Nyström approximation and MVU are covered
tackle this issue, there exist some landmark methods which
by Prof. Ali Ghodsi’s course, at University of Waterloo,
approximate the embedding of all points using the embed-
available on YouTube. There are other useful videos on
ding of some landmarks. Big data, i.e. n 1, results
this field which can be found on YouTube.
in large kernel matrices. Selecting some most informative
columns or rows of the kernel matrix, called landmarks, can
reduce computations. This technique is named the Nyström
References
approximation which is used for kernel approximation and Abu-Khzam, Faisal N, Collins, Rebecca L, Fellows,
completion. Micheal R, Langston, Micheal A, Suters, W Henry, and
Symons, Christopher T. Kernelization algorithms for the
Nyström approximation, introduced below, can be used to
vertex cover problem: Theory and experiments. Techni-
make the spectral methods such as locally linear embed-
cal report, University of Tennessee, 2004.
ding (Ghojogh et al., 2020a) and Multidimensional Scaling
(MDS) (Ghojogh et al., 2020b) scalable and suitable for Adragni, Kofi P and Cook, R Dennis. Sufficient dimen-
big data embedding. It is shown in (Platt, 2005) that all the sion reduction and prediction in regression. Philosoph-
landmark MDS methods are Nyström approximations. For ical Transactions of the Royal Society A: Mathematical,
more details on usage of Nyström approximation in spec- Physical and Engineering Sciences, 367(1906):4385–
tral embedding, refer to (Ghojogh et al., 2020a;b). 4405, 2009.
10.5. Other Improvements over Nyström Ah-Pine, Julien. Normalized kernels as similarity indices.
Approximation of Kernels In Pacific-Asia Conference on Knowledge Discovery and
The Nyström method has been improved for kernel approx- Data Mining, pp. 362–373. Springer, 2010.
imation in a line of research. For example, it has been
used for clustering (Fowlkes et al., 2004) and regulariza- Aizerman, Mark A, Braverman, E. M., and Rozonoer, L. I.
tion (Rudi et al., 2015). Greedy Nyström (Farahat et al., Theoretical foundations of the potential function method
2011; 2015) and large scale Nyström (Li et al., 2010) are in pattern recognition learning. Automation and remote
other examples. There is a trade-off between the approxi- control, 25:821–837, 1964.
mation accuracy and computational efficiency and they are Akbari, Saieed, Ghareghani, Narges, Khosrovshahi, Gho-
balanced in (Lim et al., 2015; 2018). The error analysis lamreza B, and Maimani, Hamidreza. The kernels of the
of Nyström method can be found in (Zhang et al., 2008; incidence matrices of graphs revisited. Linear algebra
Zhang & Kwok, 2010). It is better to sample the land- and its applications, 414(2-3):617–625, 2006.
marks wisely rather than randomly. This field of research
is named “column subset selection” or “landmark selec- Alperin, Jonathan L. Local representation theory: Modu-
tion” for Nystron approximation. Some of these methods lar representations as an introduction to the local repre-
26

sentation theory of finite groups. Cambridge University Bengio, Yoshua, Delalleau, Olivier, Roux, Nicolas Le,
Press, 1993. Paiement, Jean-François, Vincent, Pascal, and Ouimet,
Marie. Learning eigenfunctions links spectral embed-
Anderson, Thomas and Dahlin, Michael. Operating Sys- ding and kernel PCA. Neural computation, 16(10):
tems: Principles and Practice. Recursive Books, 2014. 2197–2219, 2004.
Arjovsky, Martin, Chintala, Soumith, and Bottou, Léon. Bengio, Yoshua, Delalleau, Olivier, Le Roux, Nicolas,
Wasserstein generative adversarial networks. In Inter- Paiement, Jean-François, Vincent, Pascal, and Ouimet,
national conference on machine learning, pp. 214–223. Marie. Spectral dimensionality reduction. In Feature
PMLR, 2017. Extraction, pp. 519–550. Springer, 2006.
Aronszajn, Nachman. Theory of reproducing kernels. Berg, Christian, Christensen, Jens Peter Reus, and Ressel,
Transactions of the American mathematical society, 68 Paul. Harmonic analysis on semigroups: theory of posi-
(3):337–404, 1950. tive definite and related functions, volume 100. Springer,
1984.
Arzelà, Cesare. Sulle Funzioni Di Linee. Tipografia Gam-
berini e Parmeggiani, 1895. Bergman, Clifford. Universal algebra: Fundamentals and
selected topics. CRC Press, 2011.
Bach, Francis R and Jordan, Michael I. Predictive low-rank
decomposition for kernel methods. In Proceedings of the Berlinet, Alain and Thomas-Agnan, Christine. Reproduc-
22nd international conference on machine learning, pp. ing kernel Hilbert spaces in probability and statistics.
33–40, 2005. Springer Science & Business Media, 2011.
Baker, Christopher TH. The numerical treatment of inte- Bhatia, Rajendra. Positive definite matrices. Princeton Uni-
gral equations. Clarendon press, 1978. versity Press, 2009.
Barshan, Elnaz, Ghodsi, Ali, Azimifar, Zohreh, and Borgwardt, Karsten M, Gretton, Arthur, Rasch, Malte J,
Jahromi, Mansoor Zolghadri. Supervised principal com- Kriegel, Hans-Peter, Schölkopf, Bernhard, and Smola,
ponent analysis: Visualization, classification and regres- Alex J. Integrating structured biological data by kernel
sion on subspaces and submanifolds. Pattern Recogni- maximum mean discrepancy. Bioinformatics, 22(14):
tion, 44(7):1357–1371, 2011. e49–e57, 2006.

Beauzamy, Bernard. Introduction to Banach spaces and Boser, Bernhard E, Guyon, Isabelle M, and Vapnik,
their geometry. North-Holland, 1982. Vladimir N. A training algorithm for optimal margin
classifiers. In Proceedings of the fifth annual workshop
Bell, Jordan. Trace class operators and Hilbert-Schmidt on Computational learning theory, pp. 144–152, 1992.
operators. Department of Mathematics, University of
Toronto, Technical Report, 2016. Bourbaki, Nicolas. Sur certains espaces vectoriels
topologiques. In Annales de l’institut Fourier, volume 2,
Bengio, Yoshua, Paiement, Jean-françcois, Vincent, Pas- pp. 5–16, 1950.
cal, Delalleau, Olivier, Roux, Nicolas, and Ouimet,
Marie. Out-of-sample extensions for LLE, Isomap, Brunet, Dominique, Vrscay, Edward R, and Wang, Zhou.
MDS, eigenmaps, and spectral clustering. Advances On the mathematical properties of the structural similar-
in neural information processing systems, 16:177–184, ity index. IEEE Transactions on Image Processing, 21
2003a. (4):1488–1499, 2011.

Bengio, Yoshua, Vincent, Pascal, Paiement, Jean-François, Burges, Christopher JC. A tutorial on support vector ma-
Delalleau, O, Ouimet, M, and LeRoux, N. Learning chines for pattern recognition. Data mining and knowl-
eigenfunctions of similarity: linking spectral cluster- edge discovery, 2(2):121–167, 1998.
ing and kernel PCA. Technical report, Departement Camps-Valls, Gustavo. Kernel methods in bioengineering,
d’Informatique et Recherche Operationnelle, 2003b. signal and image processing. Igi Global, 2006.
Bengio, Yoshua, Vincent, Pascal, Paiement, Jean-François, Conway, John B. A course in functional analysis. Springer,
Delalleau, Olivier, Ouimet, Marie, and Le Roux, Nico- 2 edition, 2007.
las. Spectral clustering and kernel PCA are learn-
ing eigenfunctions. Technical report, Departement Cox, Michael AA and Cox, Trevor F. Multidimensional
d’Informatique et Recherche Operationnelle, Technical scaling. In Handbook of data visualization, pp. 315–347.
Report 1239, 2003c. Springer, 2008.
27

De Branges, Louis. The Stone-Weierstrass theorem. Pro- Ghojogh, Benyamin, Karray, Fakhri, and Crowley, Mark.
ceedings of the American Mathematical Society, 10(5): Eigenvalue and generalized eigenvalue problems: Tuto-
822–824, 1959. rial. arXiv preprint arXiv:1903.11240, 2019a.
Drineas, Petros, Mahoney, Michael W, and Cristianini, Ghojogh, Benyamin, Karray, Fakhri, and Crowley, Mark.
Nello. On the Nyström method for approximating a Fisher and kernel Fisher discriminant analysis: Tutorial.
Gram matrix for improved kernel-based learning. jour- arXiv preprint arXiv:1906.09436, 2019b.
nal of machine learning research, 6(12), 2005.
Ghojogh, Benyamin, Karray, Fakhri, and Crowley, Mark.
Farahat, Ahmed, Ghodsi, Ali, and Kamel, Mohamed. A
Image structure subspace learning using structural simi-
novel greedy algorithm for Nyström approximation. In
larity index. In International Conference on Image Anal-
Proceedings of the Fourteenth International Conference
ysis and Recognition, pp. 33–44. Springer, 2019c.
on Artificial Intelligence and Statistics, pp. 269–277.
JMLR Workshop and Conference Proceedings, 2011. Ghojogh, Benyamin, Samad, Maria N, Mashhadi,
Farahat, Ahmed K, Elgohary, Ahmed, Ghodsi, Ali, and Sayema Asif, Kapoor, Tania, Ali, Wahab, Karray,
Kamel, Mohamed S. Greedy column subset selection Fakhri, and Crowley, Mark. Feature selection and fea-
for large-scale data sets. Knowledge and Information ture extraction in pattern analysis: A literature review.
Systems, 45(1):1–34, 2015. arXiv preprint arXiv:1905.02845, 2019d.

Fausett, Laurene V. Fundamentals of neural networks: ar- Ghojogh, Benyamin, Ghodsi, Ali, Karray, Fakhri, and
chitectures, algorithms and applications. Prentice-Hall, Crowley, Mark. Locally linear embedding and
Inc., 1994. its variants: Tutorial and survey. arXiv preprint
arXiv:2011.10925, 2020a.
Fomin, Fedor V, Lokshtanov, Daniel, Saurabh, Saket, and
Zehavi, Meirav. Kernelization: theory of parameterized Ghojogh, Benyamin, Ghodsi, Ali, Karray, Fakhri, and
preprocessing. Cambridge University Press, 2019. Crowley, Mark. Multidimensional scaling, Sammon
mapping, and Isomap: Tutorial and survey. arXiv
Fowlkes, Charless, Belongie, Serge, Chung, Fan, and Ma-
preprint arXiv:2009.08136, 2020b.
lik, Jitendra. Spectral grouping using the Nyström
method. IEEE transactions on pattern analysis and ma-
Ghojogh, Benyamin, Karray, Fakhri, and Crowley, Mark.
chine intelligence, 26(2):214–225, 2004.
Generalized subspace learning by Roweis discriminant
Fukumizu, Kenji, Bach, Francis R, and Jordan, Michael I. analysis. In International Conference on Image Analysis
Dimensionality reduction for supervised learning with and Recognition, pp. 328–342. Springer, 2020c.
reproducing kernel Hilbert spaces. Journal of Machine
Learning Research, 5(Jan):73–99, 2004. Ghojogh, Benyamin, Karray, Fakhri, and Crowley, Mark.
Theoretical insights into the use of structural similar-
Fukumizu, Kenji, Sriperumbudur, Bharath K, Gretton, ity index in generative models and inferential autoen-
Arthur, and Schölkopf, Bernhard. Characteristic kernels coders. In International Conference on Image Analysis
on groups and semigroups. In Advances in neural infor- and Recognition, pp. 112–117. Springer, 2020d.
mation processing systems, pp. 473–480, 2008.
Gonzalez, Rafael C and Woods, Richard E. Digital image
Fukumizu, Kenji, Bach, Francis R, Jordan, Michael I, et al. processing. Prentice hall Upper Saddle River, NJ, 2002.
Kernel dimension reduction in regression. The Annals of
Statistics, 37(4):1871–1905, 2009. Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron, and
Bengio, Yoshua. Deep learning, volume 1. MIT press
Garling, DJH. A ‘short’ proof of the Riesz representa-
Cambridge, 2016.
tion theorem. In Mathematical Proceedings of the Cam-
bridge Philosophical Society, volume 73, pp. 459–460. Gretton, Arthur and Györfi, László. Consistent nonpara-
Cambridge University Press, 1973. metric tests of independence. The Journal of Machine
Genton, Marc G. Classes of kernels for machine learning: Learning Research, 11:1391–1423, 2010.
a statistics perspective. Journal of machine learning re-
search, 2(Dec):299–312, 2001. Gretton, Arthur, Bousquet, Olivier, Smola, Alex, and
Schölkopf, Bernhard. Measuring statistical dependence
Ghojogh, Benyamin and Crowley, Mark. Unsupervised with Hilbert-Schmidt norms. In International conference
and supervised principal component analysis: Tutorial. on algorithmic learning theory, pp. 63–77. Springer,
arXiv preprint arXiv:1906.03148, 2019. 2005.
28

Gretton, Arthur, Borgwardt, Karsten, Rasch, Malte, Karimi, Amir-Hossein. A summary of the kernel matrix,
Schölkopf, Bernhard, and Smola, Alex. A kernel method and how to learn it effectively using semidefinite pro-
for the two-sample-problem. Advances in neural infor- gramming. arXiv preprint arXiv:1709.06557, 2017.
mation processing systems, 19:513–520, 2006.
Kimeldorf, George and Wahba, Grace. Some results on
Gretton, Arthur, Borgwardt, Karsten M, Rasch, Malte J, Tchebycheffian spline functions. Journal of mathemati-
Schölkopf, Bernhard, and Smola, Alexander. A kernel cal analysis and applications, 33(1):82–95, 1971.
two-sample test. The Journal of Machine Learning Re-
search, 13(1):723–773, 2012. Kishore Kumar, N and Schneider, Jan. Literature survey on
low rank approximation of matrices. Linear and Multi-
Gubner, John A. Probability and random processes for
linear Algebra, 65(11):2212–2244, 2017.
electrical and computer engineers. Cambridge Univer-
sity Press, 2006. Kulis, Brian, Sustik, Mátyás, and Dhillon, Inderjit. Learn-
Guilbart, Christian. Etude des produits scalaires sur ing low-rank kernel matrices. In Proceedings of the 23rd
l’espace des mesures: estimation par projections. PhD international conference on Machine learning, pp. 505–
thesis, Université des Sciences et Techniques de Lille, 512, 2006.
1978.
Kulis, Brian, Sustik, Mátyás A, and Dhillon, Inderjit S.
Ham, Jihun, Lee, Daniel D, Mika, Sebastian, and Low-rank kernel learning with Bregman matrix diver-
Schölkopf, Bernhard. A kernel view of the dimensional- gences. Journal of Machine Learning Research, 10(2),
ity reduction of manifolds. In Proceedings of the twenty- 2009.
first international conference on Machine learning, pp.
47, 2004. Kullback, Solomon and Leibler, Richard A. On informa-
tion and sufficiency. The annals of mathematical statis-
Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. tics, 22(1):79–86, 1951.
The elements of statistical learning: data mining, infer-
ence, and prediction. Springer Science & Business Me- Kumar, Sanjiv, Mohri, Mehryar, and Talwalkar, Ameet.
dia, 2009. Sampling techniques for the Nyström method. In Ar-
tificial Intelligence and Statistics, pp. 304–311. PMLR,
Hawkins, Thomas. Cauchy and the spectral theory of ma-
2009.
trices. Historia mathematica, 2(1):1–29, 1975.
Hein, Matthias and Bousquet, Olivier. Kernels, associated Kumar, Sanjiv, Mohri, Mehryar, and Talwalkar, Ameet.
structures and generalizations. Max-Planck-Institut fuer Sampling methods for the Nyström method. The Journal
biologische Kybernetik, Technical Report, 2004. of Machine Learning Research, 13(1):981–1006, 2012.

Hilbert, David. Grundzüge einer allgemeinen theo- Kung, Sun Yuan. Kernel methods and machine learning.
rie der linearen integralrechnungen I. Nachrichten Cambridge University Press, 2014.
von der Gesellschaft der Wissenschaften zu Göttingen,
Mathematisch-Physikalische Klasse, pp. 49–91, 1904. Kusse, Bruce R and Westwig, Erik A. Mathematical
physics: applied mathematics for scientists and engi-
Hinton, Geoffrey E and Salakhutdinov, Ruslan R. Reduc- neers. Wiley-VCH, 2 edition, 2006.
ing the dimensionality of data with neural networks. Sci-
ence, 313(5786):504–507, 2006. Lanckriet, Gert RG, Cristianini, Nello, Bartlett, Peter,
Ghaoui, Laurent El, and Jordan, Michael I. Learning the
Hofmann, Thomas, Schölkopf, Bernhard, and Smola,
kernel matrix with semidefinite programming. Journal
Alexander J. A review of kernel methods in machine
of Machine learning research, 5(Jan):27–72, 2004.
learning. Max-Planck-Institute Technical Report, 156,
2006. Li, Mu, Kwok, James Tin-Yau, and Lü, Baoliang. Mak-
Hofmann, Thomas, Schölkopf, Bernhard, and Smola, ing large-scale Nyström approximation possible. In
Alexander J. Kernel methods in machine learning. The ICML 2010-Proceedings, 27th International Conference
annals of statistics, pp. 1171–1220, 2008. on Machine Learning, pp. 631, 2010.

Icking, Christian and Klein, Rolf. Searching for the kernel Li, Yujia, Swersky, Kevin, and Zemel, Rich. Genera-
of a polygon—a competitive strategy. In Proceedings of tive moment matching networks. In International Con-
the eleventh annual symposium on Computational geom- ference on Machine Learning, pp. 1718–1727. PMLR,
etry, pp. 258–266, 1995. 2015.
29

Lim, Woosang, Kim, Minhwan, Park, Haesun, and Jung, Nyström, Evert J. Über die praktische auflösung von
Kyomin. Double Nyström method: An efficient and ac- integralgleichungen mit anwendungen auf randwertauf-
curate Nyström scheme for large-scale data sets. In In- gaben. Acta Mathematica, 54(1):185–204, 1930.
ternational Conference on Machine Learning, pp. 1367–
1375. PMLR, 2015. Oldford, Wayne. Lecture: Recasting principal compo-
nents. Lecture notes for Data Visualization, Department
Lim, Woosang, Du, Rundong, Dai, Bo, Jung, Kyomin, of Statistics and Actuarial Science, University of Water-
Song, Le, and Park, Haesun. Multi-scale Nyström loo, 2018.
method. In International Conference on Artificial Intel-
Ormoneit, Dirk and Sen, Śaunak. Kernel-based reinforce-
ligence and Statistics, pp. 68–76. PMLR, 2018.
ment learning. Machine learning, 49(2):161–178, 2002.
Ma, Junshui. Function replacement vs. kernel trick. Neu-
Orr, Mark J. L. Introduction to radial basis function net-
rocomputing, 50:479–483, 2003.
works. Technical report, Center for Cognitive Science,
Mercer, J. Functions of positive and negative type and their University of Edinburgh, 1996.
connection with the theory of integral equations. Philo- Pan, Sinno Jialin, Kwok, James T, and Yang, Qiang. Trans-
sophical Transactions of the Royal Society, A(209):415– fer learning via dimensionality reduction. In AAAI, vol-
446, 1909. ume 8, pp. 677–682, 2008.
Mika, Sebastian, Ratsch, Gunnar, Weston, Jason, Parseval des Chenes, MA. Mémoires présentés à l’institut
Scholkopf, Bernhard, and Mullers, Klaus-Robert. Fisher des sciences, lettres et arts, par divers savans, et lus dans
discriminant analysis with kernels. In Neural networks ses assemblées. Sciences, mathématiques et physiques
for signal processing IX: Proceedings of the 1999 IEEE (Savans étrangers), 1:638, 1806.
signal processing society workshop, pp. 41–48. Ieee,
1999. Perlibakas, Vytautas. Distance measures for PCA-based
face recognition. Pattern recognition letters, 25(6):711–
Minh, Ha Quang, Niyogi, Partha, and Yao, Yuan. Mer- 724, 2004.
cer’s theorem, feature maps, and smoothing. In Interna-
tional Conference on Computational Learning Theory, Platt, John. FastMap, MetricMap, and landmark MDS are
pp. 154–168. Springer, 2006. all Nystrom algorithms. In AISTATS, 2005.

Muandet, Krikamol, Fukumizu, Kenji, Sriperumbudur, Prugovecki, Eduard. Quantum mechanics in Hilbert space.
Bharath, and Schölkopf, Bernhard. Kernel mean em- Academic Press, 1982.
bedding of distributions: A review and beyond. arXiv Reed, Michael and Simon, Barry. Methods of modern
preprint arXiv:1605.09522, 2016. mathematical physics: Functional analysis. Academic
Müller, Alfred. Integral probability metrics and their gen- Press, 1972.
erating classes of functions. Advances in Applied Prob- Renardy, Michael and Rogers, Robert C. An introduction
ability, pp. 429–443, 1997. to partial differential equations, volume 13. Springer
Müller, Klaus-Robert, Mika, Sebastian, Tsuda, Koji, and Science & Business Media, 2006.
Schölkopf, Bernhard. An introduction to kernel-based Rennie, Jason. How to normalize a kernel matrix. Tech-
learning algorithms. Handbook of Neural Network Sig- nical report, MIT Computer Science & Artificial Intelli-
nal Processing, 2018. gence Lab, 2005.
Narici, Lawrence and Beckenstein, Edward. Topological Rojo-Álvarez, José Luis, Martı́nez-Ramón, Manel, Marı́,
vector spaces. CRC Press, 2010. Jordi Muñoz, and Camps-Valls, Gustavo. Digital signal
processing with Kernel methods. Wiley Online Library,
Noack, Marcus M and Sethian, James A. Advanced station-
2018.
ary and non-stationary kernel designs for domain-aware
Gaussian processes. arXiv preprint arXiv:2102.03432, Rudi, Alessandro, Camoriano, Raffaello, and Rosasco,
2021. Lorenzo. Less is more: Nyström computational regu-
larization. In Advances in neural information processing
Novak, Erich, Ullrich, Mario, Woźniakowski, Henryk, and systems, pp. 1657–1665, 2015.
Zhang, Shun. Reproducing kernels of Sobolev spaces
on Rd and applications to embedding constants and Rudin, Cynthia. Prediction: Machine learning and statis-
tractability. Analysis and Applications, 16(05):693–715, tics (MIT 15.097), lecture on kernels. Technical report,
2018. Massachusetts Institute of Technology, 2012.
30

Rupp, Matthias. Machine learning for quantum mechanics Sejdinovic, Dino, Sriperumbudur, Bharath, Gretton,
in a nutshell. International Journal of Quantum Chem- Arthur, and Fukumizu, Kenji. Equivalence of distance-
istry, 115(16):1058–1073, 2015. based and RKHS-based statistics in hypothesis testing.
The Annals of Statistics, pp. 2263–2291, 2013.
Saul, Lawrence K, Weinberger, Kilian Q, Sha, Fei, Ham,
Jihun, and Lee, Daniel D. Spectral methods for dimen- Shawe-Taylor, John and Cristianini, Nello. Kernel methods
sionality reduction. Semi-supervised learning, 3, 2006. for pattern analysis. Cambridge university press, 2004.
Silva, Jorge, Marques, Jorge, and Lemos, João. Select-
Saxe, Karen. Beginning functional analysis. Springer,
ing landmark points for sparse manifold learning. In
2002.
Advances in neural information processing systems, pp.
Schlichtharle, Dietrich. Digital Filters: Basics and Design. 1241–1248, 2006.
Springer, 2 edition, 2011. Simon-Gabriel, Carl-Johann and Schölkopf, Bernhard.
Kernel distribution embeddings: Universal kernels, char-
Schmidt, Erhard. Über die auflösung linearer gleichungen
acteristic kernels and kernel metrics on distributions. The
mit unendlich vielen unbekannten. Rendiconti del Cir-
Journal of Machine Learning Research, 19(1):1708–
colo Matematico di Palermo (1884-1940), 25(1):53–77,
1736, 2018.
1908.
Simon-Gabriel, Carl-Johann, Barp, Alessandro, and
Schölkopf, Bernhard. The kernel trick for distances. Ad- Mackey, Lester. Metrizing weak convergence with
vances in neural information processing systems, pp. maximum mean discrepancies. arXiv preprint
301–307, 2001. arXiv:2006.09268, 2020.
Schölkopf, Bernhard and Smola, Alexander J. Learning Smola, Alex, Gretton, Arthur, Song, Le, and Schölkopf,
with kernels: support vector machines, regularization, Bernhard. A Hilbert space embedding for distributions.
optimization, and beyond. MIT press, 2002. In International Conference on Algorithmic Learning
Theory, pp. 13–31. Springer, 2007.
Schölkopf, Bernhard, Smola, Alexander, and Müller,
Klaus-Robert. Kernel principal component analysis. In Smola, Alex J and Schölkopf, Bernhard. Learning with
International conference on artificial neural networks, kernels, volume 4. Citeseer, 1998.
pp. 583–588. Springer, 1997a.
Song, Le. Learning via Hilbert space embedding of distri-
Schölkopf, Bernhard, Sung, Kah-Kay, Burges, Christo- butions. PhD thesis, The University of Sydney, 2008.
pher JC, Girosi, Federico, Niyogi, Partha, Poggio, Sriperumbudur, Bharath K, Gretton, Arthur, Fukumizu,
Tomaso, and Vapnik, Vladimir. Comparing support vec- Kenji, Schölkopf, Bernhard, and Lanckriet, Gert RG.
tor machines with Gaussian kernels to radial basis func- Hilbert space embeddings and metrics on probability
tion classifiers. IEEE transactions on Signal Processing, measures. The Journal of Machine Learning Research,
45(11):2758–2765, 1997b. 11:1517–1561, 2010.
Schölkopf, Bernhard, Smola, Alexander, and Müller, Sriperumbudur, Bharath K, Fukumizu, Kenji, and Lanck-
Klaus-Robert. Nonlinear component analysis as a kernel riet, Gert RG. Universality, characteristic kernels and
eigenvalue problem. Neural computation, 10(5):1299– RKHS embedding of measures. Journal of Machine
1319, 1998. Learning Research, 12(7), 2011.

Schölkopf, Bernhard, Burges, Christopher JC, and Smola, Steinwart, Ingo. On the influence of the kernel on the con-
Alexander J. Advances in kernel methods: support vec- sistency of support vector machines. Journal of machine
tor learning. MIT press, 1999a. learning research, 2(Nov):67–93, 2001.
Steinwart, Ingo. Support vector machines are universally
Schölkopf, Bernhard, Mika, Sebastian, Burges, Chris JC,
consistent. Journal of Complexity, 18(3):768–791, 2002.
Knirsch, Philipp, Muller, K-R, Ratsch, Gunnar, and
Smola, Alexander J. Input space versus feature space Steinwart, Ingo and Christmann, Andreas. Support vector
in kernel-based methods. IEEE transactions on neural machines. Springer Science & Business Media, 2008.
networks, 10(5):1000–1017, 1999b.
Strang, Gilbert. The fundamental theorem of linear algebra.
Scott, David W. Multivariate density estimation: theory, The American Mathematical Monthly, 100(9):848–855,
practice, and visualization. John Wiley & Sons, 1992. 1993.
31

Strange, Harry and Zwiggelaar, Reyer. Open Problems in Williams, Christopher KI and Barber, David. Bayesian
Spectral Dimensionality Reduction. Springer, 2014. classification with Gaussian processes. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 20
Tenenbaum, Joshua B, De Silva, Vin, and Langford, (12):1342–1351, 1998.
John C. A global geometric framework for nonlinear di-
mensionality reduction. Science, 290(5500):2319–2323, Yan, Shuicheng, Xu, Dong, Zhang, Benyu, and Zhang,
2000. Hong-Jiang. Graph embedding: A general framework
for dimensionality reduction. In 2005 IEEE Computer
Vandenberghe, Lieven and Boyd, Stephen. Semidefinite Society Conference on Computer Vision and Pattern
programming. SIAM review, 38(1):49–95, 1996. Recognition (CVPR’05), volume 2, pp. 830–837. IEEE,
Vapnik, Vladimir. The nature of statistical learning theory. 2005.
Springer science & business media, 1995. Zhang, Jianguo, Marszałek, Marcin, Lazebnik, Svetlana,
Vapnik, Vladimir and Chervonenkis, Alexey. Theory of and Schmid, Cordelia. Local features and kernels for
pattern recognition. Nauka, Moscow, 1974. classification of texture and object categories: A com-
prehensive study. International journal of computer vi-
Wahba, Grace. Spline models for observational data. sion, 73(2):213–238, 2007.
SIAM, 1990.
Zhang, Kai and Kwok, James T. Clustered Nyström
Wand, Matt P and Jones, M Chris. Kernel smoothing. CRC method for large scale manifold learning and dimension
press, 1994. reduction. IEEE Transactions on Neural Networks, 21
(10):1576–1587, 2010.
Wang, Lijun, Rege, Manjeet, Dong, Ming, and Ding, Yong-
sheng. Low-rank kernel matrix factorization for large- Zhang, Kai, Tsang, Ivor W, and Kwok, James T. Improved
scale evolutionary clustering. IEEE Transactions on Nyström low-rank approximation and error analysis. In
Knowledge and Data Engineering, 24(6):1036–1050, Proceedings of the 25th international conference on ma-
2010a. chine learning, pp. 1232–1239, 2008.
Wang, Meihong, Sha, Fei, and Jordan, Michael. Unsuper-
vised kernel dimension reduction. Advances in neural
information processing systems, 23:2379–2387, 2010b.
Weinberger, Kilian Q and Saul, Lawrence K. An introduc-
tion to nonlinear dimensionality reduction by maximum
variance unfolding. In Proceedings of the AAAI Confer-
ence on Artificial Intelligence, volume 6, pp. 1683–1686,
2006a.
Weinberger, Kilian Q and Saul, Lawrence K. Unsupervised
learning of image manifolds by semidefinite program-
ming. International journal of computer vision, 70(1):
77–90, 2006b.
Weinberger, Kilian Q, Packer, Benjamin, and Saul,
Lawrence K. Nonlinear dimensionality reduction by
semidefinite programming and kernel matrix factoriza-
tion. In AISTATS, 2005.
Williams, Christopher and Seeger, Matthias. The effect of
the input density distribution on kernel-based classifiers.
In Proceedings of the 17th international conference on
machine learning, 2000.
Williams, Christopher and Seeger, Matthias. Using the
Nyström method to speed up kernel machines. In Pro-
ceedings of the 14th annual conference on neural infor-
mation processing systems, number CONF, pp. 682–688,
2001.

Kernel-Based Approximation Methods Using MATLAB
0% (1)
Kernel-Based Approximation Methods Using MATLAB
9 pages
Scholkopf B Smola AJ Learning With Kernels Support Vector Machines Regularization Optimization and Beyond PDF
100% (2)
Scholkopf B Smola AJ Learning With Kernels Support Vector Machines Regularization Optimization and Beyond PDF
646 pages
Kernel Methods For Pattern Analysis
100% (3)
Kernel Methods For Pattern Analysis
478 pages
Grade 9 4th Quarter Table-of-Specifications
100% (4)
Grade 9 4th Quarter Table-of-Specifications
1 page
Machine Learning With Kernel Methods
No ratings yet
Machine Learning With Kernel Methods
760 pages
Graph Kernels: S.V. N. Vishwanathan
No ratings yet
Graph Kernels: S.V. N. Vishwanathan
42 pages
Kernel Methods For General Pattern Analysis PDF
No ratings yet
Kernel Methods For General Pattern Analysis PDF
77 pages
Vector-Valued Reproducing Kernel Hilbert Spaces: With Applications To Function Extension and Image Colorization
No ratings yet
Vector-Valued Reproducing Kernel Hilbert Spaces: With Applications To Function Extension and Image Colorization
71 pages
Reproducing Kernel Banach Spaces For Machine Learning: Haizhang Zhang Yuesheng Xu
No ratings yet
Reproducing Kernel Banach Spaces For Machine Learning: Haizhang Zhang Yuesheng Xu
35 pages
Multivariate Approximation
100% (1)
Multivariate Approximation
296 pages
Problems For Prmo
No ratings yet
Problems For Prmo
4 pages
Kernels and Distances For Structured Data
No ratings yet
Kernels and Distances For Structured Data
28 pages
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
No ratings yet
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
65 pages
ICS E4030 Lecture1
No ratings yet
ICS E4030 Lecture1
37 pages
A Primer On Kernel Methods PDF
No ratings yet
A Primer On Kernel Methods PDF
42 pages
On The Nystr Om Method For Approximating A Gram Matrix For Improved Kernel-Based Learning
No ratings yet
On The Nystr Om Method For Approximating A Gram Matrix For Improved Kernel-Based Learning
23 pages
Kernal Methods Machine Learning
No ratings yet
Kernal Methods Machine Learning
53 pages
Maths MCQ
No ratings yet
Maths MCQ
120 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
Lecture4 introToRKHS
No ratings yet
Lecture4 introToRKHS
33 pages
Probability Product Kernels: Tony Jebara Risi Kondor Andrew Howard
No ratings yet
Probability Product Kernels: Tony Jebara Risi Kondor Andrew Howard
26 pages
Combining Entropy Measures For Anomaly Detection
No ratings yet
Combining Entropy Measures For Anomaly Detection
14 pages
Mva - Slides Machine Learning With Kernel Methods
No ratings yet
Mva - Slides Machine Learning With Kernel Methods
644 pages
Arthur Gretton - Slides4A
No ratings yet
Arthur Gretton - Slides4A
121 pages
1D Finite Difference Method
No ratings yet
1D Finite Difference Method
6 pages
High Dimensional Representation
No ratings yet
High Dimensional Representation
33 pages
Ds 11
No ratings yet
Ds 11
21 pages
Quantum Machine Learning in Feature Hilbert Spaces: Maria@xanadu - Ai
No ratings yet
Quantum Machine Learning in Feature Hilbert Spaces: Maria@xanadu - Ai
12 pages
SVM and Kernels
No ratings yet
SVM and Kernels
13 pages
Eigenvalues and Eigenvect.9241785.Powerpoint
No ratings yet
Eigenvalues and Eigenvect.9241785.Powerpoint
5 pages
A Reproducing Kernel Hilbert Space Framework For Information-Theoretic Learning
No ratings yet
A Reproducing Kernel Hilbert Space Framework For Information-Theoretic Learning
12 pages
Kernels Regularization and Differential Equations
No ratings yet
Kernels Regularization and Differential Equations
16 pages
Classes of Kernels For Machine Learning: A Statistics Perspective
No ratings yet
Classes of Kernels For Machine Learning: A Statistics Perspective
14 pages
07 Kernels
No ratings yet
07 Kernels
6 pages
Learning With Fractional Orthogonal Kernel Classifiers in Support Vector Machines
No ratings yet
Learning With Fractional Orthogonal Kernel Classifiers in Support Vector Machines
312 pages
Gaussian Process Kernels For Pattern Discovery and Extrapolation
No ratings yet
Gaussian Process Kernels For Pattern Discovery and Extrapolation
10 pages
Introduction To Optimization: MIT and James Orlin ©2003 1
No ratings yet
Introduction To Optimization: MIT and James Orlin ©2003 1
39 pages
History of Functions
100% (1)
History of Functions
17 pages
De Capitani Brown 1987
No ratings yet
De Capitani Brown 1987
14 pages
SAT Suite Question Bank Algebra Medium and Hard
No ratings yet
SAT Suite Question Bank Algebra Medium and Hard
115 pages
ML Kernel Methods
No ratings yet
ML Kernel Methods
51 pages
7 PDF
No ratings yet
7 PDF
4 pages
Sol2 PDF
No ratings yet
Sol2 PDF
5 pages
Matrix Inverses and Solving Systems Warm Up Lesson Presentation Lesson Quiz
No ratings yet
Matrix Inverses and Solving Systems Warm Up Lesson Presentation Lesson Quiz
31 pages
Paul Honeiné, Cédric Richard, Patrick Flandrin, Jean-Baptiste Pothin
No ratings yet
Paul Honeiné, Cédric Richard, Patrick Flandrin, Jean-Baptiste Pothin
4 pages
Kernel Nearest-Neighbor Algorithm
No ratings yet
Kernel Nearest-Neighbor Algorithm
10 pages
4c Kernels
No ratings yet
4c Kernels
31 pages
Ioannou Web Ch7
No ratings yet
Ioannou Web Ch7
63 pages
Reproducing Kernel Hilbert Spaces
No ratings yet
Reproducing Kernel Hilbert Spaces
4 pages
Distribution of The Time To Explosion For One-Dimensional Diffusions
No ratings yet
Distribution of The Time To Explosion For One-Dimensional Diffusions
35 pages
Kernel Method
No ratings yet
Kernel Method
5 pages
Some Methods of Constructing Kernel
No ratings yet
Some Methods of Constructing Kernel
23 pages
10 Understanding Kernels
No ratings yet
10 Understanding Kernels
41 pages
Some Products of Picture Fuzzy Soft Graph: M. Vijaya, D. Hema
No ratings yet
Some Products of Picture Fuzzy Soft Graph: M. Vijaya, D. Hema
8 pages
SISSA Groups Course2017
No ratings yet
SISSA Groups Course2017
99 pages
Probability Distributions
No ratings yet
Probability Distributions
29 pages
Design of Iris Recognition System: Student, Dept .Of. ECE, Don Bosco Institute of Technology, Bangalore, Karnataka, India
No ratings yet
Design of Iris Recognition System: Student, Dept .Of. ECE, Don Bosco Institute of Technology, Bangalore, Karnataka, India
6 pages
2D Transformations Transformations and Matrices
No ratings yet
2D Transformations Transformations and Matrices
3 pages
Christian Clason Introduction To Functional Analysis
No ratings yet
Christian Clason Introduction To Functional Analysis
9 pages
Test 2 Ssce1693 2122 01 Utmspace
No ratings yet
Test 2 Ssce1693 2122 01 Utmspace
2 pages
Sheet 2 PDF
No ratings yet
Sheet 2 PDF
2 pages
Reproducing Kernel Hilbert - Module and Kernel Mean Embeddings
No ratings yet
Reproducing Kernel Hilbert - Module and Kernel Mean Embeddings
56 pages
Poly Kernel
No ratings yet
Poly Kernel
6 pages
Learning Targets 1-7 Practice Problems - AA2
No ratings yet
Learning Targets 1-7 Practice Problems - AA2
2 pages
Kernel Models 1233
No ratings yet
Kernel Models 1233
56 pages
Kernel Methods in Machine Learning
No ratings yet
Kernel Methods in Machine Learning
53 pages
Universal Multi-Task Kernels
No ratings yet
Universal Multi-Task Kernels
32 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
Liu Et Al. - 2021 - Random Features For Kernel Approximation A Survey On Algorithms, Theory, and Beyond
No ratings yet
Liu Et Al. - 2021 - Random Features For Kernel Approximation A Survey On Algorithms, Theory, and Beyond
35 pages
תרגול - SVM 1
No ratings yet
תרגול - SVM 1
32 pages
Kernel Methods in Machine Learning
No ratings yet
Kernel Methods in Machine Learning
3 pages
Unit 4 - Signal & Systems - WWW - Rgpvnotes.in
No ratings yet
Unit 4 - Signal & Systems - WWW - Rgpvnotes.in
6 pages
37J SRG Xi 02.09.2024 Math Quiz
No ratings yet
37J SRG Xi 02.09.2024 Math Quiz
4 pages
CTT Module 3
No ratings yet
CTT Module 3
30 pages
Differntial Calculus
No ratings yet
Differntial Calculus
39 pages
Efficient Algorithms For Kernel Aggregation Queries
No ratings yet
Efficient Algorithms For Kernel Aggregation Queries
14 pages
SVM 4
No ratings yet
SVM 4
8 pages
SCH Smo 03 C
No ratings yet
SCH Smo 03 C
24 pages
Lec 16
No ratings yet
Lec 16
23 pages
More Kernels and Their Properties
No ratings yet
More Kernels and Their Properties
3 pages
Op MCQ
No ratings yet
Op MCQ
9 pages
Math 121 Calculus I Cat
No ratings yet
Math 121 Calculus I Cat
3 pages
Tables of Derivatives
No ratings yet
Tables of Derivatives
2 pages
Banach Spaces and Hilbert Spaces in Machine Learning Theory
No ratings yet
Banach Spaces and Hilbert Spaces in Machine Learning Theory
33 pages
Kernal and Multiclass
No ratings yet
Kernal and Multiclass
51 pages
5th Unit ML
No ratings yet
5th Unit ML
40 pages
The Theory of Spinors
From Everand
The Theory of Spinors
Élie Cartan
No ratings yet
Introduction to the Mathematics of Inversion in Remote Sensing and Indirect Measurements
From Everand
Introduction to the Mathematics of Inversion in Remote Sensing and Indirect Measurements
S. Twomey
No ratings yet
A Survey of Matrix Theory and Matrix Inequalities
From Everand
A Survey of Matrix Theory and Matrix Inequalities
Marvin Marcus
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey

Uploaded by

Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey

Uploaded by

To appear as a part of an upcoming textbook on dimensionality reduction and manifold learning.

Reproducing Kernel Hilbert Space, Mercer’s Theorem, Eigenfunctions,

Benyamin Ghojogh BGHOJOGH @ UWATERLOO . CA

Ali Ghodsi ALI . GHODSI @ UWATERLOO . CA

Abstract chine learning, dimensionality reduction, func-

hf (x), g(x)ik = hf, kx (.)ik f = fk + f⊥ (12)

Rn×n 3 K = Φ(X), Φ(X) k = Φ(X)> Φ(X). (36)

where (a) is because φ(x)> φ(y) and φ(y)> φ(x) are

ear kernel may or may not be equivalent to non-kernelized – Cosine Kernel:

Proof. The kernel is symmetric because: sup k(x, y) < ∞, (54)

1 (a) 1 where X is the input space. Likewise, the kernel matrix K

feature space are centered. Also, double-centered kernel

where: (which is already normalized) is kernelized as:

Rn 3 kt = kt (X, xt ) := Φ(X)> φ(xt ) (67) φ(xi )> φ(xj )

embedding space. Let the embedding of the point x be

where k̆(xi , x) is the centered training or out-of-sample

8.1. Kernelization by Kernel Trick 8.2. Kernelization by Representation Theory

Therefore, Eq. (130) becomes:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.