Bia b350f Unit 4
Bia b350f Unit 4
Thus, there is a need to reduce the number of variables to a few significant linear
combinations of the data that are easier to interpret and analyze. Under this context, each
linear combination will correspond to a principal component.
Principal component analysis is a technique that is used to simplify a data set. It can be
used to reduce dimensionality by eliminating principal components that are considered to
be relatively less important.
Cov Yi , Y j a 'i Σ a j i, j 1,2, , k
In other words, the k principal components are those uncorrelated linear combinations
with the top k largest variances in Var Y a '
i Σ a i from different
one couldi generated
choices of the coefficient vector ai provided that .
a 'i a i 1
It turns out that the i th eigenvectors will be the i th coefficients vectors of the i th
principal components and that the variance for the i th principal component is
equal to the i th eigenvalue. The resulting principal components are uncorrelated
with one another.
Hence, the proportion of total variance due to (explained by) the i th principal
component is given by:
λi
i 1,2, , k
λ1 λ2 λk
10
Example 4.1
Suppose the random variables X1, X2, and X3 have the covariance matrix Σ.
Thus, the principal components Y1 and Y2 could replace the original three random variables with
negligible loss of information.
Example 4.1 13
1 0 0 1 5.236, e 1 0 0.8507 0.5257 Y1 e 1x 0.8507 X 2 0.5257 X 3
Σ 0 4 2 with 2 1.000, e 2 1 0 0 Y2 e 2 x X 1
0 2 2 3 0.764, e 3 0 0.5257 0.8507 Y3 e 3 x 0.5257 X 2 0.8507 X 3
e11 1 0 5.236
Corr Y1 , X 1 0
11 1
The correlation between Y1 and X2, Y1 and X3 are –0.8489 and 0.8506 respectively, almost
identical. This indicates that the two variables are equally important toY1.
Example 4.1 14
1 0 0 1 5.236, e 1 0 0.8507 0.5257 Y1 e 1x 0.8507 X 2 0.5257 X 3
Σ 0 4 2 with 2 1.000, e 2 1 0 0 Y2 e 2 x X 1
0 2 2 3 0.764, e 3 0 0.5257 0.8507 Y3 e 3 x 0.5257 X 2 0.8507 X 3
e21 2 1 1
Corr Y2 , X 1 1
11 1
e22 2 0 1
Corr Y2 , X 2 0
22 4
e23 2 0 1
Corr Y2 , X 3 0
33 2
Let x1, x2, x3, … xn be n independent drawings from k -dimensional population with
mean vector µ and covariance matrix Σ. The samples mean vector and sample
covariance matrix are respectively. x and S
17
variables X1, X2, …, Xk, then the ith sample principal component is given by
For housing and crime, the lower the score the better. For the rest of the variables, the
higher the score the better.
With nine variables, the covariance matrix may be too large to analyze and interpret in a
proper manner. There would be too many pairwise covariance between the variables to study.
Graphical display of data also may not too helpful if the data set is very large. To interpret the
data in a more meaningful way, it is therefore necessary to reduce the number of variables to a
few, interpretable linear combinations or principal components of the data.
19
Covariance Matrix
Climate Housing Health Crime Trans Educate Arts Recreate Econ
Climate 0.0128923499 0.0032677528 0.0054792649 0.0043741176 0.000385724 0.0004415009 0.0106885887 0.0025732595 –.0009661793
7
Housing 0.0032677528 0.0111161410 0.0145962100 0.0024830608 0.005278579 0.0010695852 0.0292263029 0.0091269830 0.0026458304
9
Health 0.0054792649 0.0145962100 0.1027278915 0.0099549524 0.0211534636 0.0074778111 0.1184843654 0.0152994310 0.0014633998
Crime 0.0043741176 0.0024830608 0.0099549524 0.0286107020 0.007298931 0.0004713186 0.0319465684 0.0092846815 0.0039464274
7
Trans 0.0003857247 0.0052785799 0.0211534636 0.0072989317 0.024828868 0.0024618893 0.0470407089 0.0115674940 0.0008343588
8
Educate 0.0004415009 0.0010695852 0.0074778111 0.0004713186 0.002461889 0.0025199764 0.0095204087 0.0008772470 0.0005464533
3
Arts 0.0106885887 0.0292263029 0.1184843654 0.0319465684 0.047040708 0.0095204087 0.2971731520 0.0508599879 0.0062060281
9
Recreate 0.0025732595 0.0091269830 0.0152994310 0.0092846815 0.0115674940 0.0008772470 0.0508599879 0.0353078256 0.0027924140
Econ –.0009661793 0.0026458304 0.0014633998 0.0039464274 0.000834358 0.0005464533 0.0062060281 0.0027924140 0.0071365383
Use sum(diag(R)) to obtain total variance 8
0.5223134457
20
“SS loadings” represents the eigenvalue for each PC. The sum of SS loadings = 0.5223. The
proportion of variation explained by each eigenvalue is highlighted in blue bracket. For
example, 0.377462 divided by 0.5223 equals 0.7227, or, about 72% of the total variation is
explained by this first eigenvalue.
If you compute the differences of SS loadings between two adjacent PCs, you can see the
magnitude of difference is decreasing.
Subtracting the second eigenvalue 0.051 from the first eigenvalue, 0.377 we get a difference
of 0.326. The difference between the second and third eigenvalues is 0.0232; the next difference
is 0.0049. Subsequent differences are even smaller. A sharp drop from one eigenvalue to the
next may serve as another indicator of how many eigenvalues to consider.
The first three principal components explain 87% of the variation. This actually is an acceptable
large percentage.
23
The Scree plot is utilized in parallel analysis. A parallel analysis simulates a random dataset
(or resampled dataset), with the same sample size as the input dataset. Then, eigenvalues are
computed from this simulated dataset (or resampled dataset). Finally, these simulated
eigenvalues are plotted along with eigenvalues from the input dataset, on the same scree plot.
Simulated eigenvalues are plotted in dash lines.
24
Kaiser rule
Kaiser (1960) recommends that we should
retain eigenvalues only if they are at least
equal to one. This is also known as
“eigenvalue > 1” rule.
(Source: https://www.rasch.org/rmt/rmt191h.htm)
25
(Source: https://www.rasch.org/rmt/rmt191h.htm)
26
To interpret each component, the correlations between each original variable and each
principal component will be computed. These correlations between the principal
components and the original variables will be used to interpret these principal
components. Note that among the principal components themselves there is zero
correlation between the components.
27
For example, the first two principal component scores for an individual community of interest
can be computed using the elements of the eigenvector and the values of that community for
each of the nine variables :
#213 has a very high score for the first principal component
and it is expected that this community possesses high values
for the Arts, Health, Housing, Transportation and
Recreation.
#195 has a very high value for the second component. One
can deduce expect that these two communities would be
bad for Health.
## extract the eigenvalues and eigenvectors Eigenvalues and eigenvectors are stored
eigen.values <- out$values explicitly. Call them out if needed.
eigen.vectors <- out$loadings
The eigenvectors are labelled as “loadings” in R. The Unstandardized loadings are the coefficient of
principal components (in original units i.e., ) and the standardized loadings are the correlation between the
principal components and the variables (i.e., ) . So, sum of squares of the unstandardized loadings for each
principal component gives the eigenvalue (i.e., ) .
35
PC Resampled Data
8
component to extract.
2
0
2 4 6 8 10
Component Number
36
Loadings
PCA
scores of
samples
Scores
Workbook – Unit 4 Q1