0% found this document useful (0 votes)
100 views14 pages

Week 6 (PCA, SVD, LDA)

The document discusses performing feature extraction on iris flower data using principal component analysis (PCA), singular value decomposition (SVD), linear discriminant analysis (LDA), and feature subset selection. It loads the iris dataset using Pandas and displays information about the data.

Uploaded by

nirmala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views14 pages

Week 6 (PCA, SVD, LDA)

The document discusses performing feature extraction on iris flower data using principal component analysis (PCA), singular value decomposition (SVD), linear discriminant analysis (LDA), and feature subset selection. It loads the iris dataset using Pandas and displays information about the data.

Uploaded by

nirmala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

wstktfkgj

October 14, 2023

[21]: import pandas as pd


import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.decomposition import PCA
import sklearn.preprocessing

0.1 WEEK-6:
Feature Extraction: (Use packages that are applicable) 1. Principal Component Analysis (PCA)
2. Singular Value Decomposition (SVD) 3. Linear Discriminant Analysis (LDA) 4. Feature Subset
Selection
[8]: df=datasets.load_iris()
df

[8]: {'data': array([[5.1, 3.5, 1.4, 0.2],


[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3. , 1.4, 0.1],
[4.3, 3. , 1.1, 0.1],
[5.8, 4. , 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3],
[5.4, 3.4, 1.7, 0.2],
[5.1, 3.7, 1.5, 0.4],

1
[4.6, 3.6, 1. , 0.2],
[5.1, 3.3, 1.7, 0.5],
[4.8, 3.4, 1.9, 0.2],
[5. , 3. , 1.6, 0.2],
[5. , 3.4, 1.6, 0.4],
[5.2, 3.5, 1.5, 0.2],
[5.2, 3.4, 1.4, 0.2],
[4.7, 3.2, 1.6, 0.2],
[4.8, 3.1, 1.6, 0.2],
[5.4, 3.4, 1.5, 0.4],
[5.2, 4.1, 1.5, 0.1],
[5.5, 4.2, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.2],
[5. , 3.2, 1.2, 0.2],
[5.5, 3.5, 1.3, 0.2],
[4.9, 3.6, 1.4, 0.1],
[4.4, 3. , 1.3, 0.2],
[5.1, 3.4, 1.5, 0.2],
[5. , 3.5, 1.3, 0.3],
[4.5, 2.3, 1.3, 0.3],
[4.4, 3.2, 1.3, 0.2],
[5. , 3.5, 1.6, 0.6],
[5.1, 3.8, 1.9, 0.4],
[4.8, 3. , 1.4, 0.3],
[5.1, 3.8, 1.6, 0.2],
[4.6, 3.2, 1.4, 0.2],
[5.3, 3.7, 1.5, 0.2],
[5. , 3.3, 1.4, 0.2],
[7. , 3.2, 4.7, 1.4],
[6.4, 3.2, 4.5, 1.5],
[6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4. , 1.3],
[6.5, 2.8, 4.6, 1.5],
[5.7, 2.8, 4.5, 1.3],
[6.3, 3.3, 4.7, 1.6],
[4.9, 2.4, 3.3, 1. ],
[6.6, 2.9, 4.6, 1.3],
[5.2, 2.7, 3.9, 1.4],
[5. , 2. , 3.5, 1. ],
[5.9, 3. , 4.2, 1.5],
[6. , 2.2, 4. , 1. ],
[6.1, 2.9, 4.7, 1.4],
[5.6, 2.9, 3.6, 1.3],
[6.7, 3.1, 4.4, 1.4],
[5.6, 3. , 4.5, 1.5],
[5.8, 2.7, 4.1, 1. ],
[6.2, 2.2, 4.5, 1.5],

2
[5.6, 2.5, 3.9, 1.1],
[5.9, 3.2, 4.8, 1.8],
[6.1, 2.8, 4. , 1.3],
[6.3, 2.5, 4.9, 1.5],
[6.1, 2.8, 4.7, 1.2],
[6.4, 2.9, 4.3, 1.3],
[6.6, 3. , 4.4, 1.4],
[6.8, 2.8, 4.8, 1.4],
[6.7, 3. , 5. , 1.7],
[6. , 2.9, 4.5, 1.5],
[5.7, 2.6, 3.5, 1. ],
[5.5, 2.4, 3.8, 1.1],
[5.5, 2.4, 3.7, 1. ],
[5.8, 2.7, 3.9, 1.2],
[6. , 2.7, 5.1, 1.6],
[5.4, 3. , 4.5, 1.5],
[6. , 3.4, 4.5, 1.6],
[6.7, 3.1, 4.7, 1.5],
[6.3, 2.3, 4.4, 1.3],
[5.6, 3. , 4.1, 1.3],
[5.5, 2.5, 4. , 1.3],
[5.5, 2.6, 4.4, 1.2],
[6.1, 3. , 4.6, 1.4],
[5.8, 2.6, 4. , 1.2],
[5. , 2.3, 3.3, 1. ],
[5.6, 2.7, 4.2, 1.3],
[5.7, 3. , 4.2, 1.2],
[5.7, 2.9, 4.2, 1.3],
[6.2, 2.9, 4.3, 1.3],
[5.1, 2.5, 3. , 1.1],
[5.7, 2.8, 4.1, 1.3],
[6.3, 3.3, 6. , 2.5],
[5.8, 2.7, 5.1, 1.9],
[7.1, 3. , 5.9, 2.1],
[6.3, 2.9, 5.6, 1.8],
[6.5, 3. , 5.8, 2.2],
[7.6, 3. , 6.6, 2.1],
[4.9, 2.5, 4.5, 1.7],
[7.3, 2.9, 6.3, 1.8],
[6.7, 2.5, 5.8, 1.8],
[7.2, 3.6, 6.1, 2.5],
[6.5, 3.2, 5.1, 2. ],
[6.4, 2.7, 5.3, 1.9],
[6.8, 3. , 5.5, 2.1],
[5.7, 2.5, 5. , 2. ],
[5.8, 2.8, 5.1, 2.4],
[6.4, 3.2, 5.3, 2.3],

3
[6.5, 3. , 5.5, 1.8],
[7.7, 3.8, 6.7, 2.2],
[7.7, 2.6, 6.9, 2.3],
[6. , 2.2, 5. , 1.5],
[6.9, 3.2, 5.7, 2.3],
[5.6, 2.8, 4.9, 2. ],
[7.7, 2.8, 6.7, 2. ],
[6.3, 2.7, 4.9, 1.8],
[6.7, 3.3, 5.7, 2.1],
[7.2, 3.2, 6. , 1.8],
[6.2, 2.8, 4.8, 1.8],
[6.1, 3. , 4.9, 1.8],
[6.4, 2.8, 5.6, 2.1],
[7.2, 3. , 5.8, 1.6],
[7.4, 2.8, 6.1, 1.9],
[7.9, 3.8, 6.4, 2. ],
[6.4, 2.8, 5.6, 2.2],
[6.3, 2.8, 5.1, 1.5],
[6.1, 2.6, 5.6, 1.4],
[7.7, 3. , 6.1, 2.3],
[6.3, 3.4, 5.6, 2.4],
[6.4, 3.1, 5.5, 1.8],
[6. , 3. , 4.8, 1.8],
[6.9, 3.1, 5.4, 2.1],
[6.7, 3.1, 5.6, 2.4],
[6.9, 3.1, 5.1, 2.3],
[5.8, 2.7, 5.1, 1.9],
[6.8, 3.2, 5.9, 2.3],
[6.7, 3.3, 5.7, 2.5],
[6.7, 3. , 5.2, 2.3],
[6.3, 2.5, 5. , 1.9],
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]]),
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
'frame': None,
'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'),
'DESCR': '.. _iris_dataset:\n\nIris plants
dataset\n--------------------\n\n**Data Set Characteristics:**\n\n :Number of
Instances: 150 (50 in each of three classes)\n :Number of Attributes: 4

4
numeric, predictive attributes and the class\n :Attribute Information:\n
- sepal length in cm\n - sepal width in cm\n - petal length in
cm\n - petal width in cm\n - class:\n - Iris-
Setosa\n - Iris-Versicolour\n - Iris-Virginica\n
\n :Summary Statistics:\n\n ============== ==== ==== ======= =====
====================\n Min Max Mean SD Class
Correlation\n ============== ==== ==== ======= ===== ====================\n
sepal length: 4.3 7.9 5.84 0.83 0.7826\n sepal width: 2.0 4.4
3.05 0.43 -0.4194\n petal length: 1.0 6.9 3.76 1.76 0.9490
(high!)\n petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)\n
============== ==== ==== ======= ===== ====================\n\n :Missing
Attribute Values: None\n :Class Distribution: 33.3% for each of 3 classes.\n
:Creator: R.A. Fisher\n :Donor: Michael Marshall
(MARSHALL%PLU@io.arc.nasa.gov)\n :Date: July, 1988\n\nThe famous Iris
database, first used by Sir R.A. Fisher. The dataset is taken\nfrom Fisher\'s
paper. Note that it\'s the same as in R, but not as in the UCI\nMachine Learning
Repository, which has two wrong data points.\n\nThis is perhaps the best known
database to be found in the\npattern recognition literature. Fisher\'s paper is
a classic in the field and\nis referenced frequently to this day. (See Duda &
Hart, for example.) The\ndata set contains 3 classes of 50 instances each,
where each class refers to a\ntype of iris plant. One class is linearly
separable from the other 2; the\nlatter are NOT linearly separable from each
other.\n\n.. topic:: References\n\n - Fisher, R.A. "The use of multiple
measurements in taxonomic problems"\n Annual Eugenics, 7, Part II, 179-188
(1936); also in "Contributions to\n Mathematical Statistics" (John Wiley,
NY, 1950).\n - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and
Scene Analysis.\n (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See
page 218.\n - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New
System\n Structure and Classification Rule for Recognition in Partially
Exposed\n Environments". IEEE Transactions on Pattern Analysis and
Machine\n Intelligence, Vol. PAMI-2, No. 1, 67-71.\n - Gates, G.W. (1972)
"The Reduced Nearest Neighbor Rule". IEEE Transactions\n on Information
Theory, May 1972, 431-433.\n - See also: 1988 MLC Proceedings, 54-64.
Cheeseman et al"s AUTOCLASS II\n conceptual clustering system finds 3
classes in the data.\n - Many, many more …',
'feature_names': ['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)'],
'filename': 'iris.csv',
'data_module': 'sklearn.datasets.data'}

[ ]:

[ ]:

5
0.2 1. Principal Component Analysis (PCA)
[9]: #Determining initla dimensions of dataset
x=df.data
y=df.target
print(x.shape,y.shape)

(150, 4) (150,)

[10]: #scatter plot


plt.scatter(x[:,0],x[:,1],c=y)

[10]: <matplotlib.collections.PathCollection at 0x243da686310>

[11]: from sklearn.decomposition import PCA

pca = PCA()
X_new = pca.fit_transform(x)

[12]: cov_mat=pca.get_covariance()

[13]: cov_mat

[13]: array([[ 0.68569351, -0.042434 , 1.27431544, 0.51627069],


[-0.042434 , 0.18997942, -0.32965638, -0.12163937],
[ 1.27431544, -0.32965638, 3.11627785, 1.2956094 ],

6
[ 0.51627069, -0.12163937, 1.2956094 , 0.58100626]])

[14]: import numpy as np


eig_vals, eig_vecs=np.linalg.eig(cov_mat)

print("Eigenvelues \n%s" %eig_vals)


print("Eigenvectors \n%s" %eig_vecs)

Eigenvelues
[4.22824171 0.24267075 0.0782095 0.02383509]
Eigenvectors
[[ 0.36138659 -0.65658877 -0.58202985 0.31548719]
[-0.08452251 -0.73016143 0.59791083 -0.3197231 ]
[ 0.85667061 0.17337266 0.07623608 -0.47983899]
[ 0.3582892 0.07548102 0.54583143 0.75365743]]

[15]: #PCA for tagrget dimension on the dataset


pca=PCA(n_components=2)
pca.fit(x)

[15]: PCA(n_components=2)

[16]: # Visualizing principle components


pca.components_

[16]: array([[ 0.36138659, -0.08452251, 0.85667061, 0.3582892 ],


[ 0.65658877, 0.73016143, -0.17337266, -0.07548102]])

[17]: # Transforming the dataset from 4 dimensions into 2 dimension using PCA
z=pca.transform(x)
z.shape

[17]: (150, 2)

[18]: #scatter plot


plt.scatter(z[:,0],z[:,1],c=y)

[18]: <matplotlib.collections.PathCollection at 0x243dab43be0>

7
0.2.1 Observation:The three classes appear to be well separated

[19]: # Variance ratio of the target dimensions


pca.explained_variance_ratio_

[19]: array([0.92461872, 0.05306648])

0.3 Observation
0.3.1 Together, the first two principal components contain 97.76% of the information.
The first principal component contains 94.46% of the variance and the second principal component
contains 5.3% of the variance. The third and fourth principal component contained the rest of the
variance of the dataset.

0.4 2.Singular Value Decomposition


0.4.1 Displaying high-dimensional data using reduced-rank matrices
If the data is highly dimensional, you can use Singular Value Decomposition (SVD) to find a
reduced-rank approximation of the data that can be visualized easily.
[22]: iris1 = sklearn.datasets.load_iris()
iris1.data.shape

[22]: (150, 4)

8
[25]: df_iris = pd.DataFrame(iris1.data, columns=iris1.feature_names)

df_iris.shape

[25]: (150, 4)

[26]: U_iris, S_iris, Vt_iris = np.linalg.svd(df_iris, full_matrices=False)

[27]: U_iris.shape

[27]: (150, 4)

NOTE: numpy.linalg.svd actually returns a Σ that is not a diagonal matrix, but a list of the entries
on the diagonal.
[28]: S_iris

[28]: array([95.95991387, 17.76103366, 3.46093093, 1.88482631])

[29]: Vt_iris

[29]: array([[-0.75110816, -0.38008617, -0.51300886, -0.16790754],


[ 0.2841749 , 0.5467445 , -0.70866455, -0.34367081],
[ 0.50215472, -0.67524332, -0.05916621, -0.53701625],
[ 0.32081425, -0.31725607, -0.48074507, 0.75187165]])

[31]: #Befroe SVD scatter plot


x=iris1.data
y=iris1.target
print(x.shape,y.shape)

#scatter plot
plt.scatter(x[:,0],x[:,1],c=y)

(150, 4) (150,)

[31]: <matplotlib.collections.PathCollection at 0x243dabc5250>

9
[33]: # After SVD
import matplotlib.pyplot as plt

# Plot the first two columns of U


plt.scatter(U_iris[:, 0], U_iris[:, 1], c=iris1.target)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

10
[ ]:

[ ]:

[ ]:

0.5 3.Linear Discriminant Analysis (LDA)


[44]: #determing the initial dimensions of the dataset
X=df.data
Y=df.target
print(X.shape)

(150, 4)

[45]: # DEcomposing 4D into 2D ising LDA


from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda=LinearDiscriminantAnalysis(n_components=2)
X_r2=lda.fit(X,Y).transform(X)

[46]: #Getting the variance ratio


lda.explained_variance_ratio_

[46]: array([0.9912126, 0.0087874])

11
[50]: # visualizing 2D datab in the form of scatter plot
import numpy as np
colors=['royalblue','red','tan']
vectorizer=np.vectorize(lambda X:colors[X%len(colors)])
plt.scatter(X_r2[:,0],X_r2[:,1],c=vectorizer(y))

[50]: <matplotlib.collections.PathCollection at 0x1e6ce8ba610>

0.6 Observation
0.6.1 LDA is able to able to separate the classes very well after dimenionality reduc-
tion
0.7 4.Feature Subset Selection
0.7.1 Filter approach

[52]: iris= pd.read_csv('E:\\OneDrive - Don Bosco␣


↪School\\VNR-VJIET\\DE\\datasets\\iris.csv')

iris

[52]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \


0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2

12
4 5 5.0 3.6 1.4 0.2
.. … … … … …
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8

Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
.. …
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica

[150 rows x 6 columns]

[58]: # Visualizing using pair plot


#sns.pairplot(iris.drop(['Id'],axis=1),hue='Species',height=2)

0.8 Correlation Matrix with Heatmap


Correlation states how the features are related to each other or the target variable.
Correlation can be positive (increase in one value of feature increases the value of the target variable)
or negative (increase in one value of feature decreases the value of the target variable)
Heatmap makes it easy to identify which features are most related to the target variable, we will
plot heatmap of correlated features using the seaborn library.
[38]: #Visualizing correlation using heatmap
sns.heatmap(iris.corr(method='pearson').drop(['Id'],axis=1).
↪drop(['Id'],axis=0),annot=True)

plt.show()

13
0.9 Observation
[ ]:

14

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy