0% found this document useful (0 votes)
29 views28 pages

Program: Course Code: Course Name:: M.C.A. MCAS9220 Data Science Fundamentals

The document provides an overview of Principal Components Analysis (PCA), a technique used for reducing the dimensionality of data sets to identify patterns and visualize high-dimensional data. It explains the mathematical foundations of PCA, including covariance matrices, eigenvalues, and eigenvectors, and discusses its applications in fields like face recognition, image compression, and gene expression analysis. The document also highlights the steps involved in PCA and presents examples, particularly in the context of gene expression analysis.

Uploaded by

Manish Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views28 pages

Program: Course Code: Course Name:: M.C.A. MCAS9220 Data Science Fundamentals

The document provides an overview of Principal Components Analysis (PCA), a technique used for reducing the dimensionality of data sets to identify patterns and visualize high-dimensional data. It explains the mathematical foundations of PCA, including covariance matrices, eigenvalues, and eigenvectors, and discusses its applications in fields like face recognition, image compression, and gene expression analysis. The document also highlights the steps involved in PCA and presents examples, particularly in the context of gene expression analysis.

Uploaded by

Manish Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 28

School of Computing

Science and Engineering

Program: M.C.A.
Course Code: MCAS9220
Course Name: Data Science Fundamentals
Principal Components Analysis ( PCA)
• An exploratory technique used to reduce the
dimensionality of the data set to 2D or 3D
• Can be used to:
– Reduce number of dimensions in data
– Find patterns in high-dimensional data
– Visualize data of high dimensionality
• Example applications:
– Face recognition
– Image compression
– Gene expression analysis
2
Principal Components Analysis Ideas ( PCA)

• Does the data set ‘span’ the whole of d


dimensional space?
• For a matrix of m samples x n genes, create a new
covariance matrix of size n x n.
• Transform some large number of variables into a
smaller number of uncorrelated variables called
principal components (PCs).
• developed to capture as much of the variation in
data as possible

3
Principal Component Analysis
 See online tutorials such as
http://www.cs.otago.ac.nz/cosc453/stu
X2
dent_tutorials/principal_components.pd
f
Y1
Y2 x
x
x
xx
Note: Y1 is x
x x
x
the first x
x
x

eigen x
x x
x
vector, x x
x x X1
Y2 is the x
x
x Key observation:
x
second. Y2
x variance = largest!
ignorable.

4
Principal Component Analysis: one
attribute first Temperat
ure
42
40
• Question: how much
24
spread is in the data 30
along the axis? 15
(distance to the mean) 18

• Variance=Standard 15
30
deviation^2 n

 (Xi  X ) 2 15
30
s 2  i 1 35
(n  1) 30
40
5
30
Now consider two dimensions
X=Temperatur
Y=Humidity
Covariance: measures the e
correlation between X and Y 40 90
• cov(X,Y)=0: independent 40 90
•Cov(X,Y)>0: move same dir 40 90
•Cov(X,Y)<0: move oppo dir 30 90
15 70
15 70
15 70
n
30 90
 (X
i 1
i  X )(Yi  Y )
15 70
cov( X , Y ) 
(n  1) 30 70
30 70
6
30 90
More than two attributes: covariance
matrix
• Contains covariance values between all
possible dimensions (=attributes):
nxn
C (cij | cij cov(Dimi , Dim j ))

• Example for three attributes (x,y,z):


 cov( x, x) cov( x, y ) cov( x, z ) 
 
C  cov( y, x) cov( y, y ) cov( y, z ) 
 cov( z , x) cov( z , y ) cov( z , z ) 
 
7
Eigenvalues & eigenvectors
• Vectors x having same direction as Ax are called
eigenvectors of A (A is an n by n matrix).
• In the equation Ax=x,  is called an eigenvalue of A.

 2 3   3   12   3
  x    4 x 
 2 1  2  8   2

8
Eigenvalues & eigenvectors

• Ax=x  (A-I)x=0
• How to calculate x and :
– Calculate det(A-I), yields a polynomial
(degree n)
– Determine roots to det(A-I)=0, roots are
eigenvalues 
– Solve (A- I) x=0 for each  to obtain
eigenvectors x
9
Principal components
• 1. principal component (PC1)
– The eigenvalue with the largest absolute value
will indicate that the data have the largest variance
along its eigenvector, the direction along which
there is greatest variation
• 2. principal component (PC2)
– the direction with maximum variation left in data,
orthogonal to the 1. PC
• In general, only few directions manage to
capture most of the variability in the data.
10
Steps of PCA
• Let X be the mean
• For matrix C, vectors e
vector (taking the mean (=column vector) having
of all rows) same direction as Ce :
• Adjust the original data – eigenvectors of C is e such
that Ce=e,
by the mean
  is called an eigenvalue of
X’ = X – X C.
• Compute the covariance • Ce=e  (C-I)e=0
matrix C of adjusted X
• Find the eigenvectors – Most data mining
packages do this for you.
and eigenvalues of C.

11
Eigenvalues
• Calculate eigenvalues  and eigenvectors x for
covariance matrix:
– Eigenvalues j are used for calculation of [% of total
variance] (Vj) for each component j:

j n
V j 100  n  x n
 x
x 1
x 1

12
Principal components - Variance
25

20
Variance (%)

15

10

0
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10

13
Transformed Data
• Eigenvalues j corresponds to variance on each
component j
• Thus, sort by j
• Take the first p eigenvectors ei; where p is the number of
top eigenvalues
• These are the directions with the largest variances
 yi1   e1   xi1  x1 
    
 yi 2   e2   xi 2  x2 
 ...   ...   
     ... 
 y   e  x  x 
 ip   p   in n 14
An Example Mean1=24.1
Mean2=53.8
X1 X2 X1' X2' 100
90
80
70
60
19 63 -5.1 9.25 50 Series1
40
30
20
39 74 14.9 20.25 10
0
0 10 20 30 40 50

30 87 5.9 33.25
40

30
30 23 5.9 -30.75 20

10

0 Series1
15 35 -9.1 -18.75 -15 -10 -5
-10
0 5 10 15 20

-20

-30
15 43 -9.1 -10.75 -40

15 32 -9.1 -21.75 15
Covariance Matrix
75 106
• C=
106 482

• Using MATLAB, we find out:


– Eigenvectors:
– e1=(-0.98,-0.21), 1=51.8
– e2=(0.21,-0.98), 2=560.2
– Thus the second eigenvector is more important!

16
If we only keep one dimension: e2
0.5
yi
0.4
0.3 -10.14
0.2 -16.72
• We keep the dimension 0.1 -31.35
of e2=(0.21,-0.98) -40 -20
0
-0.1 0 20 40
31.37
4
• We can obtain the final -0.2
-0.3
16.46
4
data as -0.4
8.624
-0.5
19.40
4
-17.63
 xi1 
yi 0.21  0.98  0.21* xi1  0.98 * xi 2
 xi 2 

17
18
19
20
PCA –> Original Data
• Retrieving old data (e.g. in data compression)
– RetrievedRowData=(RowFeatureVectorT x
FinalData)+OriginalMean
– Yields original data using the chosen components

21
Principal components
• General about principal components
– summary variables
– linear combinations of the original variables
– uncorrelated with each other
– capture as much of the original variance as
possible

22
Applications – Gene expression analysis
• Reference: Raychaudhuri et al. (2000)
• Purpose: Determine core set of conditions for useful
gene comparison
• Dimensions: conditions, observations: genes
• Yeast sporulation dataset (7 conditions, 6118 genes)
• Result: Two components capture most of variability (90%)
• Issues: uneven data intervals, data dependencies
• PCA is common prior to clustering
• Crisp clustering questioned : genes may correlate with multiple
clusters
• Alternative: determination of gene’s closest neighbours

23
Two Way (Angle) Data Analysis
Genes 103–104 Conditions 101–
102
Samples 101-102

Genes 103-104
Gene Gene
expression expression
matrix matrix

Sample space Gene space


analysis analysis
24
PCA - example

25
PCA on all Genes
Leukemia data, precursor B and T
Plot of 34 patients, dimension of 8973 genes
reduced to 2

26
PCA on 100 top significant genes
Leukemia data, precursor B and T
Plot of 34 patients, dimension of 100 genes
reduced to 2

27
PCA of genes (Leukemia data)
Plot of 8973 genes, dimension of 34 patients reduced to 2

28

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy