More Kernels and Their Properties
More Kernels and Their Properties
Any learning algorithm that only depends on the inner product of examples, and therefore, can be
run kernels, is called kernel method.
Nearest Neighbors: We next show that k-nearest neighbor (kNN) is also a kernel method. kNN
classifies new example by finding k closest examples in the sample and taking a majority vote on
the label. So all we need is to find distances between examples. We have
~ 2 =
~ − φ(y)|| (φi (x) − φi (y))2
X
||φ(x)
φi (x)2 + φi (y)2 − 2
X X X
= φi (x)φi (y)
= k(x, x) + k(y, y) − 2k(x, y)
So indeed the distance can be calculated using 3 calls to the kernel function.
The polynomial kernel: Let x, y ∈ Rn . Define k(x, y) = (hx, yi + c)d for c, d ∈ R. This
corresponds to a feature map φ including polynomials of the original variables.
For example P if d = 2, 2then:
k(~x, ~y )P
= ( xi yi +P c)
= (P P xi yi + c)( xj yj +Pc)
= i j xi yi · xj yj + 2c i xi yi + c2
P √ √
2c yi + c2
P P
= i j x i x j · y i yj + i 2c xi
1
2
The Gaussian kernel: is defined as k(x, y) = e−||~x−~y|| /σ . By using Taylor’s expansion ea =
1 k
1 + a + . . . + k! a one can see that e~x·~y is a kernel with (an infinite set of) features corresponding
to polynomial terms. Then we can normalize by σ and divide the corresponding features by e ||x||
and e||y|| to get the Gaussian kernel.
2 Linear Algebra
A quick review was given using slides. See slide copies. The main result we need is as follows:
Any symmetric matrix K with real valued entries can be written in the form K = P DP T
where P = (V~1 , V~2 , ..., V~m ), V~i are eigen vectors of K that form an orthonormal basis (so we also
have P T = P −1 ) and where D is a diagonal matrix with Di,i = λi being the corresponding eigen
values. T
P P A square matrix A is positive semi-definite (PSD) iff for all vectors c we have c Ac =
i j ci cj Ai,j ≥ 0. It is well known that a matrix is positive semi-definite iff all the eigen values
are non-negative.
3 Mercer’s Theorem
The sample S = x1 , x2 , ..., xm includes m examples. The Kernel (Gram) matrix K is an m × m
matrix including inner products between all pairs of examples i.e., Ki,j = k(xi , xj ). K is symmetric
since k(x, y) = k(y, x) = φ(x) . φ(y)
Mercer’s Theorem: A symmetric function k(., .) is a kernel iff for any finite sample S the
kernel matrix for S is positive semi-definite.
One direction of the theorem
P P is easy: if k() PisP
a kernel, and K is theP kernel matrix P with K i,j =
T
k(xj , xj ). Then c Kc = i j ci cj Ki,j = i j ci cj φ(xi )φ(xj ) = ( i ci φ(xi ))( j cj φ(xj )) =
||( j cj φ(xj ))||2 ≥ 0.
P
For the other direction we will prove a weaker result.
Theorem: Consider a finite input space X = {x1 , x2 , ..., xm } and the kernel matrix K over the
entire space. If K is positive semi-definite then k(., .) is a kernel function.
Proof: By the linear algebra facts above we can write K = P DP T .
Define a feature mapping
√ into a m-dimensional space where the lth bit in feature expansion for
i i ~
example x is φl (x ) = λl (Vl )i .
The inner product is
m
~ i ) · φ(y
~ j) =
X
φ(x φl (xi ) φl (xj )
l=1
m
X
= λl (Vl )i (Vl )j
l=1
Consider i, jth entry of the matrix K = k(xi , xj ). We have the following identities where the
last one proves the result.
Ki,j = [P DP T ]i,j
2
= [[P D]P T ]i,j
Note that Mercer’s theorem allows us to work with a kernel function without knowing which
feature map it corresponds to or its relevance to the learning problem. This has often been used
in practical applications.
Proof of (1):
K = K 1 + K2
where we add matrices component-wise
~T K X
∀~x, X ~ =X
~ T K1 X
~ +X
~ T K2 X
~ ≥0