Principal Component Analysis
Principal Component Analysis
learning to transform high-dimensional data into a lower-dimensional space while preserving the
most important information. It does this by identifying principal components, which are orthogonal
linear combinations of the original variables that capture the most variance in the data.
PCA stands for Principal Component Analysis. It is a statistical technique used in data analysis and
machine learning to simplify the complexity of high-dimensional data while retaining its important
features.
PCA primarily aims to transform a dataset’s original variables into a new set of uncorrelated
variables called principal components. These components are linear combinations of the original
variables and are chosen in such a way that they capture the maximum variance present in the
data.
PCA is often used for dimensionality reduction, which is particularly useful when dealing with
datasets with many variables. By reducing the number of dimensions, PCA can help mitigate
issues related to the “curse of dimensionality” and make subsequent analysis or modelling more
efficient and accurate. Additionally, PCA can also be used for data visualization and noise
reduction.
In PCA, the first principal component captures the most variance in the data; the second principal
component captures the second most, and so on. These principal components are orthogonal to
each other, meaning they are uncorrelated. Finding these components involves computing
eigenvectors and eigenvalues of the data’s covariance matrix.
The intuition behind Principal Component Analysis (PCA) revolves around simplifying complex data
by focusing on its most significant patterns. Imagine you have a high-dimensional dataset with
numerous variables. Each variable represents a different aspect or measurement, and together,
they form a multi-dimensional space. However, not all of these variables may contribute equally to
the underlying structure of the data.
PCA aims to find a new set of axes, called principal components, in this multi-dimensional space
such that when the data is projected onto these components, the variance (spread) of the data is
maximized along the first component, followed by the second most variance on the second
component, and so on. PCA identifies the directions in the original data space where the data
varies the most.
Let’s illustrate this process with a simple 2D example with the concept of variance and how PCA
selects principal components
Let’s consider a simple 2D example to illustrate the concept of variance and how PCA selects
principal components.
Imagine you have a dataset of points in a 2D space, where each point represents an observation
with two variables: X and Y. Here’s the dataset:
X | Y
----------------
2 | 3
4 | 5
6 | 7
8 | 9
10 | 11
1. Calculating Means: The first step in PCA is calculating the means of both variables (X and Y).
In this case, the mean of X is (2 + 4 + 6 + 8 + 10) / 5 = 6, and the mean of Y is (3 + 5 + 7 + 9 +
11) / 5 = 7.
2. Centering the Data: Subtract the respective means from each data point. This centers the
data around the origin (0, 0):
X | Y
----------------
-4 | -4
-2 | -2
0 | 0
2 | 2
4 | 4
3. Calculating Covariance: Calculate the covariance matrix of the centred data:
X | Y
-----------------
10.0 | 10.0
10.0 | 10.0
10.0 | 10.0
10.0 | 10.0
10.0 | 10.0
Notice that the off-diagonal elements are identical, indicating that X and Y are perfectly correlated
in this example.
4. Finding Eigenvectors and Eigenvalues: The next step is to find the eigenvectors and
eigenvalues of the covariance matrix. In this simple example, it turns out that any vector in the
space is an eigenvector with a corresponding eigenvalue of 50. This is because the covariance
matrix is proportional to the identity matrix, indicating no preferred direction of variability.
5. Choosing Principal Components: Since X and Y have the same variance (equal to the
eigenvalue, 50), any linear combination of X and Y is a principal component. However, we can
choose the original axes (X and Y) as the principal components for this example.
In this example, both X and Y contribute equally to the variance of the data, so the principal
components are aligned with the original axes. In more complex examples, PCA would select
directions where the data varies the most, allowing you to capture the most important patterns
and reduce dimensionality.
Remember that this is a highly simplified example. In real-world scenarios, PCA becomes
particularly powerful when there’s a noticeable difference in variance along different directions,
allowing it to capture the main patterns in high-dimensional data effectively.
The Mathematics Behind Principal Component Analysis
Principal Component Analysis (PCA) might sound complex, but at its core, it relies on
straightforward mathematical principles to uncover the intrinsic structure of data. In this section,
we’ll delve into the mathematical underpinnings of PCA, breaking down the steps that lead to
identifying those crucial principal components.
1. Covariance Matrix and Centered Data
At the heart of PCA lies the covariance matrix. This matrix quantifies the relationships between
different variables in your data. But we need to centre the data before we compute the covariance
matrix. Centring involves subtracting the mean of each variable from its respective values,
ensuring that the new origin is at the mean of the data.
Mathematically, for each data point (x, y) , the centred point becomes (x - mean(x), y -
mean(y)). Once all data points are centred, we can construct the covariance matrix. This matrix
captures how much the variables vary together.
2. Eigenvalues and Eigenvectors
With the covariance matrix in hand, we find its eigenvalues and eigenvectors. Eigenvalues and
eigenvectors are fundamental concepts in linear algebra. An eigenvector of a matrix remains in
the same direction, only scaled, when the matrix is applied to it. The corresponding eigenvalue
represents the amount by which the eigenvector is scaled.
For the covariance matrix, the eigenvectors represent the directions along which the data varies
the most. The eigenvalues tell us how much variance is captured along each eigenvector direction.
The eigenvector with the largest eigenvalue corresponds to the first principal component, the
direction with the most variance in the data. The second largest eigenvalue corresponds to the
second principal component, and so on.
3. Selecting Principal Components
The final step involves selecting the top k eigenvectors (principal components) corresponding to
the k largest eigenvalues. These principal components collectively form a new coordinate system
for the data. The original data is projected onto this new coordinate system, capturing the
essential patterns while discarding the less significant information.
In practice, you can choose how many principal components to retain based on the variance you
want to preserve. Retaining more components holds more information but may lead to higher-
dimensional representations.
4. Dimensionality Reduction and Reconstruction
One of the primary applications of PCA is dimensionality reduction. By selecting a subset of the
principal components, you reduce the dimensionality of your data while retaining most of its
essential characteristics. This can significantly simplify subsequent analysis, visualization, and
modelling.
Additionally, you can use the retained principal components to reconstruct an approximation of the
original data. This is done by projecting the data back into the original space using the selected
principal components.
The mathematical machinery behind PCA might seem intricate, but its conceptual core is
accessible. By focusing on the relationships between variables, the eigenvalues and eigenvectors
guide us to the principal components that capture the essence of the data. With this
understanding, we can move on to practical implementations and explore how PCA works its magic
on real-world datasets.
How to implement PCA with sklearn In Python
Now that we have a solid grasp of the mathematical foundation of Principal Component Analysis
(PCA) let’s dive into the practical steps of implementing PCA using popular libraries like scikit-
learn in Python. By the end of this section, you’ll be equipped to apply PCA to your datasets and
harness its power for dimensionality reduction and data analysis.
1. Data Preparation
Before applying PCA, ensure that your data is preprocessed and normalized. This is crucial for
PCA’s performance. Suppose you have your dataset loaded into a NumPy array or a Pandas
DataFrame.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Generating some fake data for this demonstration
np.random.seed(42)
num_samples = 100
# Create correlated data with a positive correlation
mean = [5, 7]
cov = [[2, 1.5], [1.5, 2]]
data = np.random.multivariate_normal(mean, cov, num_samples)
# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
2. Applying PCA
With your data preprocessed, you can apply PCA. Scikit-learn provides an easy-to-use PCA class for
this purpose.
# Apply PCA
# Instantiate PCA with the number of components you want to retain
num_components = 2
pca = PCA(n_components=num_components)
# Fit PCA to the scaled data
pca_data = pca.fit_transform(scaled_data)
3. Explained Variance Ratio
One of the critical pieces of information PCA provides is the explained variance ratio of each
principal component. This ratio tells you the proportion of the total variance in the original data
captured by each component.
# Explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
print("Explained Variance Ratio:", explained_variance_ratio)
Output:
Explained Variance Ratio: [0.8373527 0.1626473]
4. Visualization: PCA plot
Visualizing the transformed data in the PCA space can be insightful. For a 2D PCA space, you can
create a scatter plot.
# Visualize the original and PCA-transformed data
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.scatter(data[:, 0], data[:, 1])
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Original Data')
plt.subplot(1, 2, 2)
plt.scatter(pca_data[:, 0], pca_data[:, 1])
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Transformed Data')
plt.tight_layout()
plt.show()