0% found this document useful (0 votes)
73 views27 pages

3.2 Pca

PCA is an algorithm used to reduce dimensionality in datasets by transforming variables into a set of orthogonal principal components. It works by computing the covariance matrix of the dataset and calculating its eigenvectors and eigenvalues. The principal components with the highest eigenvalues contain the most information about the dataset and are used to reduce its dimensionality.

Uploaded by

Javada Javada
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views27 pages

3.2 Pca

PCA is an algorithm used to reduce dimensionality in datasets by transforming variables into a set of orthogonal principal components. It works by computing the covariance matrix of the dataset and calculating its eigenvectors and eigenvalues. The principal components with the highest eigenvalues contain the most information about the dataset and are used to reduce its dimensionality.

Uploaded by

Javada Javada
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

The Problem of Dimensionality

 Handling the high-dimensional data is very difficult in practice, commonly


known as the curse of dimensionality.
 If the dimensionality of the input dataset increases, any machine learning
algorithm and model becomes more complex.
 As the number of features increases, the number of samples also gets
increased proportionally, and the chance of overfitting also increases.
 If the machine learning model is trained on high-dimensional data, it
becomes overfitted and results in poor performance.
 Hence - required to reduce the number of features, which can be
done with dimensionality reduction.
Principal Component Analysis(PCA)

 Principal Component Analysis is an unsupervised learning algorithm that is


used for the dimensionality reduction in machine learning

 It is a technique to draw strong patterns from the given dataset by reducing


the variances.

 It is a feature extraction technique, so it contains the important variables and


drops the least important variable.
Why dimensionality reduction?
How PCA works?
 Dimensionality: It is the number of features or variables / columns present in the
dataset.

 Correlation: It signifies that how strongly two variables are related to each other.
Such as if one changes, the other variable also gets changed.

 Orthogonal: It defines that variables are not correlated to each other, and hence the
correlation between the pair of variables is zero.

 Eigenvectors : represent the direction in which PCs are aligned

 Eigen values : represent the units of spread captured by each PC.

 Covariance Matrix: A matrix containing the covariance between the pair of variables
is called the Covariance Matrix.
Procedure for performing principal component analysis

Step By Step Computation Of PCA

1. Getting the dataset

2. Representing data into a structure

3. Standardization of the data

4. Computing the covariance matrix

5. Calculating the eigenvectors and eigenvalues and Sorting the Eigen Vectors

6. Computing the Principal Components

7. Reducing the dimensions of the data set(Remove less or unimportant features from
the new dataset.)
1. Getting the dataset
 Firstly, we need to take the input dataset
2. Representing data into a structure
 Consider a dataset which has 4 features and a total of 5 training examples.
 Here each row corresponds to the data items, and the column corresponds to the
Features.
 The number of columns is the dimensions of the dataset.
3. Standardization of the data

• We have 2 variables in our data set


 one has values ranging between 10-100 and
 the other has values between 1000-5000.
• Output - biased since the variable with a larger range will have a more obvious
impact on the outcome.
• Therefore, standardizing the data into a comparable range is very important
4. Computing the covariance matrix

 PCA helps to identify the correlation and dependencies among the features in a
data set.
 A covariance matrix expresses the correlation between the different variables in
the data set.
 It is essential to identify heavily dependent variables because they contain biased
and redundant information which reduces the overall performance of the model.

Eg: Consider a case where we have a 2-Dimensional data set with variables a and b,
the covariance matrix is a 2×2 matrix
5. Calculating the eigenvectors and eigenvalues

 Eigen vector
 to use the Covariance matrix to understand where in the data there is the
most amount of variance.
 Since more variance in the data denotes more information about the
data, eigenvectors are used to identify and compute Principal Components.
 Eigenvalues - simply denote the scalars of the respective eigenvectors.
6. Computing the Principal Components

 where the eigenvector with the highest eigenvalue is the most significant and thus
forms the first principal component.

 The principal components of lesser significances can thus be removed in order to


reduce the dimensions of the data.

 Final step in computing the Principal Components - to form a matrix known as


the feature matrix that contains all the significant data variables that possess
maximum information about the data.
7. Reducing the dimensions of the data set(Remove less or unimportant
features from the new dataset.)

 The last step in performing PCA is to re-arrange the original data with the
final principal components which represent the maximum and the most
significant information of the data set.
Advantages of PCA

 Easy to compute.
 Speeds up other machine learning algorithms.
 Counteracts the issues of high-dimensional data.
Disadvantages of PCA

 Low interpretability of principal components.


 The trade-off between information loss and dimensionality reduction.
Applications of PCA in Machine Learning

 visualize multidimensional data.


 to reduce the number of dimensions in healthcare data.
 can help resize an image.
 used in finance to analyze stock data and forecast returns.
 helps to find patterns in the high-dimensional datasets.
Summary

 Several points plotted on a 2-D plane.


 There are two principal components.
 PC1 is the primary principal component that explains the maximum variance in the data.
 PC2 is another principal component that is orthogonal to PC1.
 The Principal Components are a straight line that captures most of the variance of the data.
 They have a direction and magnitude.
 Principal components are orthogonal projections (perpendicular) of data onto lower-
dimensional space.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy