Week 6 (PCA, SVD, LDA)
Week 6 (PCA, SVD, LDA)
0.1 WEEK-6:
Feature Extraction: (Use packages that are applicable) 1. Principal Component Analysis (PCA)
2. Singular Value Decomposition (SVD) 3. Linear Discriminant Analysis (LDA) 4. Feature Subset
Selection
[8]: df=datasets.load_iris()
df
1
[4.6, 3.6, 1. , 0.2],
[5.1, 3.3, 1.7, 0.5],
[4.8, 3.4, 1.9, 0.2],
[5. , 3. , 1.6, 0.2],
[5. , 3.4, 1.6, 0.4],
[5.2, 3.5, 1.5, 0.2],
[5.2, 3.4, 1.4, 0.2],
[4.7, 3.2, 1.6, 0.2],
[4.8, 3.1, 1.6, 0.2],
[5.4, 3.4, 1.5, 0.4],
[5.2, 4.1, 1.5, 0.1],
[5.5, 4.2, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.2],
[5. , 3.2, 1.2, 0.2],
[5.5, 3.5, 1.3, 0.2],
[4.9, 3.6, 1.4, 0.1],
[4.4, 3. , 1.3, 0.2],
[5.1, 3.4, 1.5, 0.2],
[5. , 3.5, 1.3, 0.3],
[4.5, 2.3, 1.3, 0.3],
[4.4, 3.2, 1.3, 0.2],
[5. , 3.5, 1.6, 0.6],
[5.1, 3.8, 1.9, 0.4],
[4.8, 3. , 1.4, 0.3],
[5.1, 3.8, 1.6, 0.2],
[4.6, 3.2, 1.4, 0.2],
[5.3, 3.7, 1.5, 0.2],
[5. , 3.3, 1.4, 0.2],
[7. , 3.2, 4.7, 1.4],
[6.4, 3.2, 4.5, 1.5],
[6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4. , 1.3],
[6.5, 2.8, 4.6, 1.5],
[5.7, 2.8, 4.5, 1.3],
[6.3, 3.3, 4.7, 1.6],
[4.9, 2.4, 3.3, 1. ],
[6.6, 2.9, 4.6, 1.3],
[5.2, 2.7, 3.9, 1.4],
[5. , 2. , 3.5, 1. ],
[5.9, 3. , 4.2, 1.5],
[6. , 2.2, 4. , 1. ],
[6.1, 2.9, 4.7, 1.4],
[5.6, 2.9, 3.6, 1.3],
[6.7, 3.1, 4.4, 1.4],
[5.6, 3. , 4.5, 1.5],
[5.8, 2.7, 4.1, 1. ],
[6.2, 2.2, 4.5, 1.5],
2
[5.6, 2.5, 3.9, 1.1],
[5.9, 3.2, 4.8, 1.8],
[6.1, 2.8, 4. , 1.3],
[6.3, 2.5, 4.9, 1.5],
[6.1, 2.8, 4.7, 1.2],
[6.4, 2.9, 4.3, 1.3],
[6.6, 3. , 4.4, 1.4],
[6.8, 2.8, 4.8, 1.4],
[6.7, 3. , 5. , 1.7],
[6. , 2.9, 4.5, 1.5],
[5.7, 2.6, 3.5, 1. ],
[5.5, 2.4, 3.8, 1.1],
[5.5, 2.4, 3.7, 1. ],
[5.8, 2.7, 3.9, 1.2],
[6. , 2.7, 5.1, 1.6],
[5.4, 3. , 4.5, 1.5],
[6. , 3.4, 4.5, 1.6],
[6.7, 3.1, 4.7, 1.5],
[6.3, 2.3, 4.4, 1.3],
[5.6, 3. , 4.1, 1.3],
[5.5, 2.5, 4. , 1.3],
[5.5, 2.6, 4.4, 1.2],
[6.1, 3. , 4.6, 1.4],
[5.8, 2.6, 4. , 1.2],
[5. , 2.3, 3.3, 1. ],
[5.6, 2.7, 4.2, 1.3],
[5.7, 3. , 4.2, 1.2],
[5.7, 2.9, 4.2, 1.3],
[6.2, 2.9, 4.3, 1.3],
[5.1, 2.5, 3. , 1.1],
[5.7, 2.8, 4.1, 1.3],
[6.3, 3.3, 6. , 2.5],
[5.8, 2.7, 5.1, 1.9],
[7.1, 3. , 5.9, 2.1],
[6.3, 2.9, 5.6, 1.8],
[6.5, 3. , 5.8, 2.2],
[7.6, 3. , 6.6, 2.1],
[4.9, 2.5, 4.5, 1.7],
[7.3, 2.9, 6.3, 1.8],
[6.7, 2.5, 5.8, 1.8],
[7.2, 3.6, 6.1, 2.5],
[6.5, 3.2, 5.1, 2. ],
[6.4, 2.7, 5.3, 1.9],
[6.8, 3. , 5.5, 2.1],
[5.7, 2.5, 5. , 2. ],
[5.8, 2.8, 5.1, 2.4],
[6.4, 3.2, 5.3, 2.3],
3
[6.5, 3. , 5.5, 1.8],
[7.7, 3.8, 6.7, 2.2],
[7.7, 2.6, 6.9, 2.3],
[6. , 2.2, 5. , 1.5],
[6.9, 3.2, 5.7, 2.3],
[5.6, 2.8, 4.9, 2. ],
[7.7, 2.8, 6.7, 2. ],
[6.3, 2.7, 4.9, 1.8],
[6.7, 3.3, 5.7, 2.1],
[7.2, 3.2, 6. , 1.8],
[6.2, 2.8, 4.8, 1.8],
[6.1, 3. , 4.9, 1.8],
[6.4, 2.8, 5.6, 2.1],
[7.2, 3. , 5.8, 1.6],
[7.4, 2.8, 6.1, 1.9],
[7.9, 3.8, 6.4, 2. ],
[6.4, 2.8, 5.6, 2.2],
[6.3, 2.8, 5.1, 1.5],
[6.1, 2.6, 5.6, 1.4],
[7.7, 3. , 6.1, 2.3],
[6.3, 3.4, 5.6, 2.4],
[6.4, 3.1, 5.5, 1.8],
[6. , 3. , 4.8, 1.8],
[6.9, 3.1, 5.4, 2.1],
[6.7, 3.1, 5.6, 2.4],
[6.9, 3.1, 5.1, 2.3],
[5.8, 2.7, 5.1, 1.9],
[6.8, 3.2, 5.9, 2.3],
[6.7, 3.3, 5.7, 2.5],
[6.7, 3. , 5.2, 2.3],
[6.3, 2.5, 5. , 1.9],
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]]),
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
'frame': None,
'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'),
'DESCR': '.. _iris_dataset:\n\nIris plants
dataset\n--------------------\n\n**Data Set Characteristics:**\n\n :Number of
Instances: 150 (50 in each of three classes)\n :Number of Attributes: 4
4
numeric, predictive attributes and the class\n :Attribute Information:\n
- sepal length in cm\n - sepal width in cm\n - petal length in
cm\n - petal width in cm\n - class:\n - Iris-
Setosa\n - Iris-Versicolour\n - Iris-Virginica\n
\n :Summary Statistics:\n\n ============== ==== ==== ======= =====
====================\n Min Max Mean SD Class
Correlation\n ============== ==== ==== ======= ===== ====================\n
sepal length: 4.3 7.9 5.84 0.83 0.7826\n sepal width: 2.0 4.4
3.05 0.43 -0.4194\n petal length: 1.0 6.9 3.76 1.76 0.9490
(high!)\n petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)\n
============== ==== ==== ======= ===== ====================\n\n :Missing
Attribute Values: None\n :Class Distribution: 33.3% for each of 3 classes.\n
:Creator: R.A. Fisher\n :Donor: Michael Marshall
(MARSHALL%PLU@io.arc.nasa.gov)\n :Date: July, 1988\n\nThe famous Iris
database, first used by Sir R.A. Fisher. The dataset is taken\nfrom Fisher\'s
paper. Note that it\'s the same as in R, but not as in the UCI\nMachine Learning
Repository, which has two wrong data points.\n\nThis is perhaps the best known
database to be found in the\npattern recognition literature. Fisher\'s paper is
a classic in the field and\nis referenced frequently to this day. (See Duda &
Hart, for example.) The\ndata set contains 3 classes of 50 instances each,
where each class refers to a\ntype of iris plant. One class is linearly
separable from the other 2; the\nlatter are NOT linearly separable from each
other.\n\n.. topic:: References\n\n - Fisher, R.A. "The use of multiple
measurements in taxonomic problems"\n Annual Eugenics, 7, Part II, 179-188
(1936); also in "Contributions to\n Mathematical Statistics" (John Wiley,
NY, 1950).\n - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and
Scene Analysis.\n (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See
page 218.\n - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New
System\n Structure and Classification Rule for Recognition in Partially
Exposed\n Environments". IEEE Transactions on Pattern Analysis and
Machine\n Intelligence, Vol. PAMI-2, No. 1, 67-71.\n - Gates, G.W. (1972)
"The Reduced Nearest Neighbor Rule". IEEE Transactions\n on Information
Theory, May 1972, 431-433.\n - See also: 1988 MLC Proceedings, 54-64.
Cheeseman et al"s AUTOCLASS II\n conceptual clustering system finds 3
classes in the data.\n - Many, many more …',
'feature_names': ['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)'],
'filename': 'iris.csv',
'data_module': 'sklearn.datasets.data'}
[ ]:
[ ]:
5
0.2 1. Principal Component Analysis (PCA)
[9]: #Determining initla dimensions of dataset
x=df.data
y=df.target
print(x.shape,y.shape)
(150, 4) (150,)
pca = PCA()
X_new = pca.fit_transform(x)
[12]: cov_mat=pca.get_covariance()
[13]: cov_mat
6
[ 0.51627069, -0.12163937, 1.2956094 , 0.58100626]])
Eigenvelues
[4.22824171 0.24267075 0.0782095 0.02383509]
Eigenvectors
[[ 0.36138659 -0.65658877 -0.58202985 0.31548719]
[-0.08452251 -0.73016143 0.59791083 -0.3197231 ]
[ 0.85667061 0.17337266 0.07623608 -0.47983899]
[ 0.3582892 0.07548102 0.54583143 0.75365743]]
[15]: PCA(n_components=2)
[17]: # Transforming the dataset from 4 dimensions into 2 dimension using PCA
z=pca.transform(x)
z.shape
[17]: (150, 2)
7
0.2.1 Observation:The three classes appear to be well separated
0.3 Observation
0.3.1 Together, the first two principal components contain 97.76% of the information.
The first principal component contains 94.46% of the variance and the second principal component
contains 5.3% of the variance. The third and fourth principal component contained the rest of the
variance of the dataset.
[22]: (150, 4)
8
[25]: df_iris = pd.DataFrame(iris1.data, columns=iris1.feature_names)
df_iris.shape
[25]: (150, 4)
[27]: U_iris.shape
[27]: (150, 4)
NOTE: numpy.linalg.svd actually returns a Σ that is not a diagonal matrix, but a list of the entries
on the diagonal.
[28]: S_iris
[29]: Vt_iris
#scatter plot
plt.scatter(x[:,0],x[:,1],c=y)
(150, 4) (150,)
9
[33]: # After SVD
import matplotlib.pyplot as plt
10
[ ]:
[ ]:
[ ]:
(150, 4)
11
[50]: # visualizing 2D datab in the form of scatter plot
import numpy as np
colors=['royalblue','red','tan']
vectorizer=np.vectorize(lambda X:colors[X%len(colors)])
plt.scatter(X_r2[:,0],X_r2[:,1],c=vectorizer(y))
0.6 Observation
0.6.1 LDA is able to able to separate the classes very well after dimenionality reduc-
tion
0.7 4.Feature Subset Selection
0.7.1 Filter approach
iris
12
4 5 5.0 3.6 1.4 0.2
.. … … … … …
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8
Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
.. …
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica
plt.show()
13
0.9 Observation
[ ]:
14