0% found this document useful (0 votes)
55 views32 pages

Lecture 11 Dimensionality Reduction

The document discusses dimensionality reduction and subspace methods. Principal Component Analysis (PCA) is used to reduce the dimensionality of data while retaining variation. Discriminant functions classify new data based on distance to class prototypes. Subspace methods represent each class in a low-dimensional subspace and classify new data based on which subspace best approximates it. The dimensionality of each class subspace is important and can be determined by the cumulative contributions of eigenvalues of the data's autocorrelation matrix.

Uploaded by

LIU Hengxu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views32 pages

Lecture 11 Dimensionality Reduction

The document discusses dimensionality reduction and subspace methods. Principal Component Analysis (PCA) is used to reduce the dimensionality of data while retaining variation. Discriminant functions classify new data based on distance to class prototypes. Subspace methods represent each class in a low-dimensional subspace and classify new data based on which subspace best approximates it. The dimensionality of each class subspace is important and can be determined by the cumulative contributions of eigenvalues of the data's autocorrelation matrix.

Uploaded by

LIU Hengxu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Principal Component Analysis (PCA)

Discriminant function
Subspace Methods
Fisher’s method

Lecture 11: Dimensionality Reduction and


Subspace Methods
DD2421

Atsuto Maki

Autumn, 2020

Atsuto Maki Lecture 11: Dimensionality Reduction and Subspace Methods


Our keywords today:

• Dimensionality reduction
– Principal Component Analysis (PCA)

• Discriminant function
– Similarity measures: angle, projection length

• Subspace Methods
Principal Component Analysis (PCA)
1. Maximizing variance
x2 u1

Centroid E (x)

x1

Number of samples
Mean vector of x : E ( x) = (1 / r )å x
Covariance matrix: S = E (( x - E ( x))( x - E ( x))T )
1. Maximum variance criterion
Reduce the effective number of variables
(only dealing with components with larger variances)
E (( xT ui - E ( xT ui )) 2 ) → Maximize i = 1,…, p
= E ((uiT ( x - E ( x))) 2 ) Condition:
= u E (( x - E ( x))( x - E ( x)) )ui = u Sui
T
i
T T
i
uiT u j = d ij
Covariance matrix
max[tr(U T SU )]
The transformation matrix U consists of p columns:
the eigenvectors of the covariance matrix, Σ
(corresponding to its p largest eigenvalues).
Example 3-d to 2-d: Ninety observations simulated in 3-d

The first 2 principal component directions span the plane that best fits the data.
It minimizes the sum of squared distances from each point to the plane.

Figure from
An Introduction to Statistical Learning (James et al.)
Principal Component Analysis (PCA)
2. Min. approximation error
x2
u1
x

Distribution is viewed
from the origin,
not from the centroid
x1

Autocorrelation matrix: Q = E ( xxT )


2. Minimum squared distance criterion
Averaged squared error between x and its
approximation to be minimized by a set {u1 ,!, u p }
x ~x residual
E (|| x - x¢ ||2 ) →minimize i = 1,…, p
p x ¢
Approximated x¢ = å ( x ui )ui
T O
i =1 L(u1 , ! , u p )
|| x¢ ||2 =|| x ||2 - || ~
x ||2 →maximize

The basis consists of p eigenvectors of the


autocorrelation matrix, Q (corresponding to its p
largest eigenvalues).
PCA example 1: Hand-written digits

Feature extraction
Pattern vectors: normalized & blurred patterns

(figure credit: Y. Kurosawa)


Numeral Characters(0 - 4) (figure credit: Y. Kurosawa)
Example 2: Human face classificaiton
Basis vectors of a person: someone’s dictionary

(Eigenvectors from a large collection of his/her face)

(figure credit: K. Fukui)


Example 3: Ship classification (profiles)
Profile vectors

Principal Component Analysis (PCA)



Eigenvectors

Eigenvectors for the greatest eigenvalues


What is the set of images of an object
under all possible lighting conditions?

(Harvard Database)
Concept of subspace
Subspace L is a collection of n-d vectors:
spanned by a basis, a set of linearly independent vectors
p
L(b1,, bp ) = {z | z = ∑ξ i bi } (ξ i ∈ R, bi ∈ R n )
i=1

Dimension of a subspace:
the number of base vectors
p = dim( L) << n u1
O
u2
L(u1 , u2 )
Conveniently represented R 3

by orthonormal basis {u1 ,!, u p }


• Variations of “9” covered by a 2-d subspace

u0 u1

(figure credit: Y. Kurosawa)


Background: Schematic of classification

Training Feature extraction


Samples Training
(labeled)

Model (Dictionary)

New inputs Feature extraction


(test data)
Testing Output
class
Training phase
• Given: Limited number of labeled data
(samples whose classes are known)
• The dimensionality often too high for limited
number of samples

One approach to this is to find redundant variables


and discard them, i.e. dimensionality reduction
(without losing essential information)

Information compression: to extract the class


characteristics and throw away the rest!
Testing phase

• Various ways to measure the distance


– Euclidean / Mahalanobis distance
– Angle between vectors
– Projection length on subspaces

• Classification methods
– Discriminant function
– Subspace method

Nearest Neighbor methods (revisiting)
• Binary classification
C1
Unseen point x
– N1 samples of class C1
– N2 samples of class C2
C2
– Unseen data x

→ Compute distances
to N1 + N2 samples
n-dim feature space

• Find the nearest neighbour


→ classify x to the same class
Discriminant function
• Need to remember all the samples?
– In k-NN we simply used all the training data
– Still cover only a small portion of possible patterns

• Define a class by a few representative patterns


– e.g. the centroid of class distribution

C1 Extreme case: one vector per class


C2

C3
Formulation: one prototype per class
– K classes:
– K prototypes:

Consider Euclidean distances between the new input x


and the prototypes:

→ Choose the class that minimises the distance.

Discriminant function

max
Input x Output
class
Setting the “don’t know” category
• Reject if the distance is above the threshold

C1

C2
Direction cosine as similarity
Think of the new input and the prototype as vectors.
Compute cosine between the input vector x and vector

Simple similarity”
(The closer it is to 1, the more likely to be in )

Now let’s extend the class representative to


a set of basis vectors spans a subspace
Subspace Methods
• Exploit localization of pattern distributions
Samples in the same class such as a digit (or face
images of a person) are similar to each other.
They are localized in a subspace spanned by a set of basis ui .
ui : reference vectors
(orthonormal basis)

a.k.a. CLAFIC
CLAss-Featuring Information Compression
Framework of Subspace Method
1. Training: for each class, compute a low-dimensional
subspace that represents the distribution in the class.
w (1) ,!, w ( K )
2. Testing: determine the class of new unknown input by
comparing which subspace best approximates the input.
Training Testing
Input
vector Similarity 1
subspace 1
subspace 1 subspace 2 Similarity 2

max
subspace 2 Projection
subspace K Similarity K

subspace K
Similarity in Subspace Method
Projection length to the subspace
p Input
S = ∑ (x, ui ) 2

i=1
p: dimension of subspace
ui : reference vectors
(orthonormal basis)

subspace
Similarity in Subspace Method (example)
Projection length to the subspace

p: the dimensionality of class subspace


(can be determined for each class, how?)
Dimensionality of a class subspace
Eigenvalues of autocorrelation matrix Q: λ1 ≥ ...λ j ... ≥ λ p ≥ 0
The number of dimensions to be used for each class:
– Too low → low capability to represent the class
– Too high→ issue of overlapping across classes
•Cumulative contributions
P (i ) Choose a dimension p(i) for each class ω (i)
∑λ j
(i)
a( p ) = j=1 a ( p (i ) ) £ k £ a ( p (i ) + 1) κ: common value
p

∑λ j The projection length to the subspace is


j=1
made uniform.

Experiments still needed to find a good dimensionality


What is a good dimension (direction)
for classification, given labels?
Ideal distributions of input pattern vectors:
§ Patterns from an identical class be close
§ Patterns from different classes be apart
x2 x2

w (1) w (1)

w ( 2) w ( 2)

x1 x1

→ Overlapping distributions harmful for classification


Ratio of between-classes variance to
within-class variance
Within-class variance Average in class w
(i )

1 K
s = å å ( x - E (i ) ( x))T ( x - E (i ) ( x))
2
W
r i =1 xÎw ( i )
Total # of samples Average overall
Between-class variance
1 K (i ) (i )
s B = å r ( E ( x) - E ( x))T ( E (i ) ( x) - E ( x))
2

r i =1
Number of samples in class w
(i )

Within-class var. between-class var. ratio

s B2 Between-class variance In short: distance between classes


Js = 2
sW Within-class var in ave normalized by distance within class
→ the larger the better!
Fisher’s method
Find a subspace most suitable to classification
discriminant analysis
Given pattern distributions in 2 classes
⇒ Optimal axis direction where J is maximized
Scatter matrix represents variation within class
å (x - E
T
Si º (i )
( x))( x - E ( x))
(i )

xÎw ( i )

Within-class: SW º S1 + S 2
Between-classes: S B º å r ( i ) ( E ( i ) ( x) - E ( x))( E ( i ) ( x) - E ( x))T
i =1, 2

r (1) r ( 2 ) (1)
... = ( E ( x) - E ( 2 ) ( x))( E (1) ( x) - E ( 2 ) ( x))T
r
From n-d feature space to 1-d space by Matrix A
A is an n x 1 matrix → n-dim vector a in practice
→ The pattern will become a scalar by y = AT x

Scatter matrix in the space after the transformation:


Sˆi º å(y - E
T
(i )
( y ))( y - E ( y ))
(i )

xÎw ( i )

å
T
= A ( x - T
E ( x ))( x - E(i )
( x )) A = AT (i )
Si A
yÎw ( i )

Within-class: SˆW º Sˆ1 + Sˆ2 = AT S1 A + AT S 2 A = AT SW A


Between-class: Sˆ B º å r ( i ) ( E ( i ) ( y ) - E ( y )) 2 Scalar
i =1, 2

r (1) r ( 2 ) T (1)
... = A ( E ( x) - E ( 2 ) ( x)) 2 A = AT S B A
r
Fisher’s criterion:
Maximizing the ratio of
Sˆ B AT S B A between-classes variance
J S ( A) º = T
SˆW A SW A to within-class variance
Lagrange multiplier

J (a) º a T S B a - l (a T SW a - I ) →Maximize
S B a = lSW a Condition: Sˆ = I
W
Û S S a = la
-1
W B

Û max{J S (a)} = l1 The greatest eigenvalue of SW−1SB

-1
→ The eigenvector for the greatest eigenvalue of SW S B
gives A that maximises Fisher’s criterion

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy