0% found this document useful (0 votes)
16 views3 pages

More Kernels and Their Properties

The document discusses advanced topics in machine learning, focusing on kernels and their properties. It introduces various kernel functions, such as polynomial and Gaussian kernels, and explains their applications in kernel methods like k-nearest neighbors. Additionally, it covers Mercer’s Theorem, which establishes the conditions under which a function can be considered a kernel, and outlines properties of kernels in relation to linear algebra.

Uploaded by

Misael García
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views3 pages

More Kernels and Their Properties

The document discusses advanced topics in machine learning, focusing on kernels and their properties. It introduces various kernel functions, such as polynomial and Gaussian kernels, and explains their applications in kernel methods like k-nearest neighbors. Additionally, it covers Mercer’s Theorem, which establishes the conditions under which a function can be considered a kernel, and outlines properties of kernels in relation to linear algebra.

Uploaded by

Misael García
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

150AML: Advanced Topics in Machine Learning Spring 2008

Course Topic: Computational Learning Theory


Department of Computer Science, Tufts University
Lecture 18 Apr 1
Instructor: Roni Khardon Scribe: Roni

More Kernels and Their Properties


These notes are slightly edited from previous scribe notes (in Spring 2006) taken by Mashhood
Ishaque.

1 Kernels and Kernel Methods


In the previous lecture we introduced the idea of kernels and gave the Boolean kernels and dual
perceptron algorithm that works with kernels. Here we introduce some more common kernels and
kernel methods.
We say that k(x, y) is a kernel function iff there is a feature map φ such that for all x, y,
~ · φ(y)
k(x, y) = φ(x) ~

Any learning algorithm that only depends on the inner product of examples, and therefore, can be
run kernels, is called kernel method.

Nearest Neighbors: We next show that k-nearest neighbor (kNN) is also a kernel method. kNN
classifies new example by finding k closest examples in the sample and taking a majority vote on
the label. So all we need is to find distances between examples. We have

~ 2 =
~ − φ(y)|| (φi (x) − φi (y))2
X
||φ(x)
φi (x)2 + φi (y)2 − 2
X X X
= φi (x)φi (y)
= k(x, x) + k(y, y) − 2k(x, y)
So indeed the distance can be calculated using 3 calls to the kernel function.

The polynomial kernel: Let x, y ∈ Rn . Define k(x, y) = (hx, yi + c)d for c, d ∈ R. This
corresponds to a feature map φ including polynomials of the original variables.
For example P if d = 2, 2then:
k(~x, ~y )P
= ( xi yi +P c)
= (P P xi yi + c)( xj yj +Pc)
= i j xi yi · xj yj + 2c i xi yi + c2
P √  √
2c yi + c2
P P 
= i j x i x j · y i yj + i 2c xi

φ has n2 entries from


P P
i j =⇒ (feature is xi xj )
P √
+ n entries from i =⇒ (feature is 2c xi )
+1 =⇒ (feature is c)

1
2
The Gaussian kernel: is defined as k(x, y) = e−||~x−~y|| /σ . By using Taylor’s expansion ea =
1 k
1 + a + . . . + k! a one can see that e~x·~y is a kernel with (an infinite set of) features corresponding
to polynomial terms. Then we can normalize by σ and divide the corresponding features by e ||x||
and e||y|| to get the Gaussian kernel.

2 Linear Algebra
A quick review was given using slides. See slide copies. The main result we need is as follows:
Any symmetric matrix K with real valued entries can be written in the form K = P DP T
where P = (V~1 , V~2 , ..., V~m ), V~i are eigen vectors of K that form an orthonormal basis (so we also
have P T = P −1 ) and where D is a diagonal matrix with Di,i = λi being the corresponding eigen
values. T
P P A square matrix A is positive semi-definite (PSD) iff for all vectors c we have c Ac =
i j ci cj Ai,j ≥ 0. It is well known that a matrix is positive semi-definite iff all the eigen values
are non-negative.

3 Mercer’s Theorem
The sample S = x1 , x2 , ..., xm includes m examples. The Kernel (Gram) matrix K is an m × m
matrix including inner products between all pairs of examples i.e., Ki,j = k(xi , xj ). K is symmetric
since k(x, y) = k(y, x) = φ(x) . φ(y)
Mercer’s Theorem: A symmetric function k(., .) is a kernel iff for any finite sample S the
kernel matrix for S is positive semi-definite.
One direction of the theorem
P P is easy: if k() PisP
a kernel, and K is theP kernel matrix P with K i,j =
T
k(xj , xj ). Then c Kc = i j ci cj Ki,j = i j ci cj φ(xi )φ(xj ) = ( i ci φ(xi ))( j cj φ(xj )) =
||( j cj φ(xj ))||2 ≥ 0.
P
For the other direction we will prove a weaker result.
Theorem: Consider a finite input space X = {x1 , x2 , ..., xm } and the kernel matrix K over the
entire space. If K is positive semi-definite then k(., .) is a kernel function.
Proof: By the linear algebra facts above we can write K = P DP T .
Define a feature mapping
√ into a m-dimensional space where the lth bit in feature expansion for
i i ~
example x is φl (x ) = λl (Vl )i .
The inner product is
m
~ i ) · φ(y
~ j) =
X
φ(x φl (xi ) φl (xj )
l=1
m
X
= λl (Vl )i (Vl )j
l=1

We want to show that


~ i ) · φ(y
k(xi , xj ) = φ(x ~ j)

Consider i, jth entry of the matrix K = k(xi , xj ). We have the following identities where the
last one proves the result.

Ki,j = [P DP T ]i,j

2
= [[P D]P T ]i,j

[P D] = (V~1 , V~2 , ..., V~m )D


[P D]i,l = (Vl )i λl
m
X
[[P D]P T ]i,j = (Vl )i λl (Vl )j
l=1

Note that Mercer’s theorem allows us to work with a kernel function without knowing which
feature map it corresponds to or its relevance to the learning problem. This has often been used
in practical applications.

4 More Properties of Kernels


Consider any space X of samples and kernels k1 (., .) and k2 (., .) over X. Then k(., .) is a kernel with
(1) k(x, y) = k1 (x, y) + k2 (x, y)
(2) k(x, y) = ak1 (x, y) where a > 0
(3) k(x, y) = f (x) · f (y) for any function f on x
(4) k(x, y) = k1 (x, y) · k2 (x, y)
(5) k(x, y) = √ k1 (x,y)√
k1 (x,x) k1 (y,y)

Proof of (1):
K = K 1 + K2
where we add matrices component-wise
~T K X
∀~x, X ~ =X
~ T K1 X
~ +X
~ T K2 X
~ ≥0

Another Proof: Let


φ1 (x) = (φ11 (x), ..., φ1N1 (x))
φ2 (x) = (φ21 (x), ..., φ2N2 (x))
be the feature map for K1 and K2
Define φ(x) by concatenating the feature maps (or alternate features if the spaces are infinite)
φ(x) = (φ11 (x), ..., φ1N1 (x), φ21 (x), ..., φ2N2 (x))
The mapping clearly satisfies φ(x) · φ(y) = φ1 (x) · φ1 (y) + φ2 (x) · φ2 (y).
Proof of (2):
√ √ √ √
k(x, y) = ( aφ11 (x), ..., aφ1N (x))( aφ11 (y), ..., aφN1 (y) = ak1 (x, y)

Proof of (3): there is just one feature defined by f ()


Proof of (4): multiply out the φ expressions for k1 and k2 to see that k is a kernels with the
space of products of features from φ1 and φ2 .
Proof of (5):
φ1 (x)
Let φ1 (x) be as above. Define φ(x) by φi (x) = kφ1i (x)k . Then k() calculates the inner product
for φ().

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy