Lecture 11 Dimensionality Reduction
Lecture 11 Dimensionality Reduction
Discriminant function
Subspace Methods
Fisher’s method
Atsuto Maki
Autumn, 2020
• Dimensionality reduction
– Principal Component Analysis (PCA)
• Discriminant function
– Similarity measures: angle, projection length
• Subspace Methods
Principal Component Analysis (PCA)
1. Maximizing variance
x2 u1
Centroid E (x)
x1
Number of samples
Mean vector of x : E ( x) = (1 / r )å x
Covariance matrix: S = E (( x - E ( x))( x - E ( x))T )
1. Maximum variance criterion
Reduce the effective number of variables
(only dealing with components with larger variances)
E (( xT ui - E ( xT ui )) 2 ) → Maximize i = 1,…, p
= E ((uiT ( x - E ( x))) 2 ) Condition:
= u E (( x - E ( x))( x - E ( x)) )ui = u Sui
T
i
T T
i
uiT u j = d ij
Covariance matrix
max[tr(U T SU )]
The transformation matrix U consists of p columns:
the eigenvectors of the covariance matrix, Σ
(corresponding to its p largest eigenvalues).
Example 3-d to 2-d: Ninety observations simulated in 3-d
The first 2 principal component directions span the plane that best fits the data.
It minimizes the sum of squared distances from each point to the plane.
Figure from
An Introduction to Statistical Learning (James et al.)
Principal Component Analysis (PCA)
2. Min. approximation error
x2
u1
x
Distribution is viewed
from the origin,
not from the centroid
x1
Feature extraction
Pattern vectors: normalized & blurred patterns
…
What is the set of images of an object
under all possible lighting conditions?
(Harvard Database)
Concept of subspace
Subspace L is a collection of n-d vectors:
spanned by a basis, a set of linearly independent vectors
p
L(b1,, bp ) = {z | z = ∑ξ i bi } (ξ i ∈ R, bi ∈ R n )
i=1
Dimension of a subspace:
the number of base vectors
p = dim( L) << n u1
O
u2
L(u1 , u2 )
Conveniently represented R 3
u0 u1
Model (Dictionary)
• Classification methods
– Discriminant function
– Subspace method
…
Nearest Neighbor methods (revisiting)
• Binary classification
C1
Unseen point x
– N1 samples of class C1
– N2 samples of class C2
C2
– Unseen data x
→ Compute distances
to N1 + N2 samples
n-dim feature space
C3
Formulation: one prototype per class
– K classes:
– K prototypes:
Discriminant function
max
Input x Output
class
Setting the “don’t know” category
• Reject if the distance is above the threshold
C1
C2
Direction cosine as similarity
Think of the new input and the prototype as vectors.
Compute cosine between the input vector x and vector
Simple similarity”
(The closer it is to 1, the more likely to be in )
a.k.a. CLAFIC
CLAss-Featuring Information Compression
Framework of Subspace Method
1. Training: for each class, compute a low-dimensional
subspace that represents the distribution in the class.
w (1) ,!, w ( K )
2. Testing: determine the class of new unknown input by
comparing which subspace best approximates the input.
Training Testing
Input
vector Similarity 1
subspace 1
subspace 1 subspace 2 Similarity 2
max
subspace 2 Projection
subspace K Similarity K
subspace K
Similarity in Subspace Method
Projection length to the subspace
p Input
S = ∑ (x, ui ) 2
i=1
p: dimension of subspace
ui : reference vectors
(orthonormal basis)
subspace
Similarity in Subspace Method (example)
Projection length to the subspace
w (1) w (1)
w ( 2) w ( 2)
x1 x1
1 K
s = å å ( x - E (i ) ( x))T ( x - E (i ) ( x))
2
W
r i =1 xÎw ( i )
Total # of samples Average overall
Between-class variance
1 K (i ) (i )
s B = å r ( E ( x) - E ( x))T ( E (i ) ( x) - E ( x))
2
r i =1
Number of samples in class w
(i )
xÎw ( i )
Within-class: SW º S1 + S 2
Between-classes: S B º å r ( i ) ( E ( i ) ( x) - E ( x))( E ( i ) ( x) - E ( x))T
i =1, 2
r (1) r ( 2 ) (1)
... = ( E ( x) - E ( 2 ) ( x))( E (1) ( x) - E ( 2 ) ( x))T
r
From n-d feature space to 1-d space by Matrix A
A is an n x 1 matrix → n-dim vector a in practice
→ The pattern will become a scalar by y = AT x
xÎw ( i )
å
T
= A ( x - T
E ( x ))( x - E(i )
( x )) A = AT (i )
Si A
yÎw ( i )
r (1) r ( 2 ) T (1)
... = A ( E ( x) - E ( 2 ) ( x)) 2 A = AT S B A
r
Fisher’s criterion:
Maximizing the ratio of
Sˆ B AT S B A between-classes variance
J S ( A) º = T
SˆW A SW A to within-class variance
Lagrange multiplier
J (a) º a T S B a - l (a T SW a - I ) →Maximize
S B a = lSW a Condition: Sˆ = I
W
Û S S a = la
-1
W B
-1
→ The eigenvector for the greatest eigenvalue of SW S B
gives A that maximises Fisher’s criterion