Program: Course Code: Course Name:: M.C.A. MCAS9220 Data Science Fundamentals
Program: Course Code: Course Name:: M.C.A. MCAS9220 Data Science Fundamentals
Program: M.C.A.
Course Code: MCAS9220
Course Name: Data Science Fundamentals
Principal Components Analysis ( PCA)
• An exploratory technique used to reduce the
dimensionality of the data set to 2D or 3D
• Can be used to:
– Reduce number of dimensions in data
– Find patterns in high-dimensional data
– Visualize data of high dimensionality
• Example applications:
– Face recognition
– Image compression
– Gene expression analysis
2
Principal Components Analysis Ideas ( PCA)
3
Principal Component Analysis
See online tutorials such as
http://www.cs.otago.ac.nz/cosc453/stu
X2
dent_tutorials/principal_components.pd
f
Y1
Y2 x
x
x
xx
Note: Y1 is x
x x
x
the first x
x
x
eigen x
x x
x
vector, x x
x x X1
Y2 is the x
x
x Key observation:
x
second. Y2
x variance = largest!
ignorable.
4
Principal Component Analysis: one
attribute first Temperat
ure
42
40
• Question: how much
24
spread is in the data 30
along the axis? 15
(distance to the mean) 18
• Variance=Standard 15
30
deviation^2 n
(Xi X ) 2 15
30
s 2 i 1 35
(n 1) 30
40
5
30
Now consider two dimensions
X=Temperatur
Y=Humidity
Covariance: measures the e
correlation between X and Y 40 90
• cov(X,Y)=0: independent 40 90
•Cov(X,Y)>0: move same dir 40 90
•Cov(X,Y)<0: move oppo dir 30 90
15 70
15 70
15 70
n
30 90
(X
i 1
i X )(Yi Y )
15 70
cov( X , Y )
(n 1) 30 70
30 70
6
30 90
More than two attributes: covariance
matrix
• Contains covariance values between all
possible dimensions (=attributes):
nxn
C (cij | cij cov(Dimi , Dim j ))
2 3 3 12 3
x 4 x
2 1 2 8 2
8
Eigenvalues & eigenvectors
• Ax=x (A-I)x=0
• How to calculate x and :
– Calculate det(A-I), yields a polynomial
(degree n)
– Determine roots to det(A-I)=0, roots are
eigenvalues
– Solve (A- I) x=0 for each to obtain
eigenvectors x
9
Principal components
• 1. principal component (PC1)
– The eigenvalue with the largest absolute value
will indicate that the data have the largest variance
along its eigenvector, the direction along which
there is greatest variation
• 2. principal component (PC2)
– the direction with maximum variation left in data,
orthogonal to the 1. PC
• In general, only few directions manage to
capture most of the variability in the data.
10
Steps of PCA
• Let X be the mean
• For matrix C, vectors e
vector (taking the mean (=column vector) having
of all rows) same direction as Ce :
• Adjust the original data – eigenvectors of C is e such
that Ce=e,
by the mean
is called an eigenvalue of
X’ = X – X C.
• Compute the covariance • Ce=e (C-I)e=0
matrix C of adjusted X
• Find the eigenvectors – Most data mining
packages do this for you.
and eigenvalues of C.
11
Eigenvalues
• Calculate eigenvalues and eigenvectors x for
covariance matrix:
– Eigenvalues j are used for calculation of [% of total
variance] (Vj) for each component j:
j n
V j 100 n x n
x
x 1
x 1
12
Principal components - Variance
25
20
Variance (%)
15
10
0
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
13
Transformed Data
• Eigenvalues j corresponds to variance on each
component j
• Thus, sort by j
• Take the first p eigenvectors ei; where p is the number of
top eigenvalues
• These are the directions with the largest variances
yi1 e1 xi1 x1
yi 2 e2 xi 2 x2
... ...
...
y e x x
ip p in n 14
An Example Mean1=24.1
Mean2=53.8
X1 X2 X1' X2' 100
90
80
70
60
19 63 -5.1 9.25 50 Series1
40
30
20
39 74 14.9 20.25 10
0
0 10 20 30 40 50
30 87 5.9 33.25
40
30
30 23 5.9 -30.75 20
10
0 Series1
15 35 -9.1 -18.75 -15 -10 -5
-10
0 5 10 15 20
-20
-30
15 43 -9.1 -10.75 -40
15 32 -9.1 -21.75 15
Covariance Matrix
75 106
• C=
106 482
16
If we only keep one dimension: e2
0.5
yi
0.4
0.3 -10.14
0.2 -16.72
• We keep the dimension 0.1 -31.35
of e2=(0.21,-0.98) -40 -20
0
-0.1 0 20 40
31.37
4
• We can obtain the final -0.2
-0.3
16.46
4
data as -0.4
8.624
-0.5
19.40
4
-17.63
xi1
yi 0.21 0.98 0.21* xi1 0.98 * xi 2
xi 2
17
18
19
20
PCA –> Original Data
• Retrieving old data (e.g. in data compression)
– RetrievedRowData=(RowFeatureVectorT x
FinalData)+OriginalMean
– Yields original data using the chosen components
21
Principal components
• General about principal components
– summary variables
– linear combinations of the original variables
– uncorrelated with each other
– capture as much of the original variance as
possible
22
Applications – Gene expression analysis
• Reference: Raychaudhuri et al. (2000)
• Purpose: Determine core set of conditions for useful
gene comparison
• Dimensions: conditions, observations: genes
• Yeast sporulation dataset (7 conditions, 6118 genes)
• Result: Two components capture most of variability (90%)
• Issues: uneven data intervals, data dependencies
• PCA is common prior to clustering
• Crisp clustering questioned : genes may correlate with multiple
clusters
• Alternative: determination of gene’s closest neighbours
23
Two Way (Angle) Data Analysis
Genes 103–104 Conditions 101–
102
Samples 101-102
Genes 103-104
Gene Gene
expression expression
matrix matrix
25
PCA on all Genes
Leukemia data, precursor B and T
Plot of 34 patients, dimension of 8973 genes
reduced to 2
26
PCA on 100 top significant genes
Leukemia data, precursor B and T
Plot of 34 patients, dimension of 100 genes
reduced to 2
27
PCA of genes (Leukemia data)
Plot of 8973 genes, dimension of 34 patients reduced to 2
28