0% found this document useful (0 votes)
51 views28 pages

Data Mining Project - Parijat

This document discusses clustering analysis performed on advertisement data. It covers data preprocessing steps like handling missing values and outliers. Different clustering algorithms like K-Means, hierarchical and principal component analysis are applied. Optimum number of clusters are identified using elbow plot and silhouette scores. The clusters are profiled and insights are provided. Principal components are interpreted to understand the influence of original variables. Overall it provides a comprehensive overview of clustering and dimensionality reduction techniques.

Uploaded by

PARIJAT DEV
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views28 pages

Data Mining Project - Parijat

This document discusses clustering analysis performed on advertisement data. It covers data preprocessing steps like handling missing values and outliers. Different clustering algorithms like K-Means, hierarchical and principal component analysis are applied. Optimum number of clusters are identified using elbow plot and silhouette scores. The clusters are profiled and insights are provided. Principal components are interpreted to understand the influence of original variables. Overall it provides a comprehensive overview of clustering and dimensionality reduction techniques.

Uploaded by

PARIJAT DEV
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Project on Clustering and Principal Component

Analysis
PGP-DSBA

By-Parijat Dev
Clustering Problems ....................................................................................................................................... 4
Problem 1 - Read the data and perform basic analysis such as printing a few rows (head and tail), info,
data summary, null values duplicate values, etc. ...................................................................................... 4
Problem 2 - Treat missing values in CPC, CTR and CPM using the formula given. .................................... 4
Problem 3- Clustering: Check if there are any outliers. Do you think treating outliers is necessary for K-
Means clustering? Based on your judgement decide whether to treat outliers and if yes, which
method to employ. .................................................................................................................................... 5
Problem 4 - Perform z-score scaling and discuss how it affects the speed of the algorithm. .................. 6
Problem 5 - Perform Hierarchical by constructing a Dendrogram using WARD and Euclidean distance. 6
Problem 6 - Make Elbow plot (up to n=10) and identify optimum number of clusters for k-means
algorithm.................................................................................................................................................... 7
Problem 7 - Print silhouette scores for up to 10 clusters and identify optimum number of clusters. ..... 7
Problem 8 - Profile the ads based on optimum number of clusters using silhouette score and your
domain understanding. ............................................................................................................................. 8
Problem 9- Conclude the project by providing summary of your learnings ........................................... 12
Principal Component Analysis ..................................................................................................................... 14
Question 1 - Read the data and perform basic checks like checking head, info, summary, nulls, and
duplicates, etc. ......................................................................................................................................... 14
Question 2 - Perform detailed Exploratory analysis by creating certain questions like (i) Which state
has highest gender ratio and which has the lowest? (ii) Which district has the highest & lowest gender
ratio? (Example Questions). Pick 5 variables out of the given 24 variables below for EDA: No_HH,
TOT_M, TOT_F, M_06, F_06, M_SC, F_SC, M_ST, F_ST, M_LIT, F_LIT, M_ILL, F_ILL, TOT_WORK_M,
TOT_WORK_F, MAINWORK_M, MAINWORK_F, MAIN_CL_M, MAIN_CL_F, MAIN_AL_M, MAIN_AL_F,
MAIN_HH_M, MAIN_HH_F, MAIN_OT_M, MAIN_OT_F ......................................................................... 17
Problem 3 - We choose not to treat outliers for this case. Do you think that treating outliers for this
case is necessary? .................................................................................................................................... 20
Problem 4 - Scale the Data using z-score method. Does scaling have any impact on outliers? Compare
boxplots before and after scaling and comment. ................................................................................... 22
Problem 5 - Perform all the required steps for PCA (use sklearn only) Create the covariance Matrix Get
eigen values and eigen vector. ................................................................................................................ 25
Problem 6- Identify the optimum number of PCs (for this project, take at least 90% explained
variance). Show Scree plot. ..................................................................................................................... 26
Problem 7 -Compare PCs with Actual Columns and identify which is explaining most variance. Write
inferences about all the principal components in terms of actual variables. ......................................... 26
Problem 8 - Write linear equation for first PC1. ...................................................................................... 28
Clustering Problems
Problem 1 - Read the data and perform basic analysis such as printing a few rows (head
and tail), info, data summary, null values duplicate values, etc.

Solution – Basic Information about the Dataset

• Advertisement data Frame has clearly defined headers.


• It has 23066 Rows and 19 Columns.
• Time Stamp, Inventory Type, Ad Type, Platform, Device Type, Format columns are object.
data types and rest of the columns are float and integer type.
• There are no duplicates present in the data Frame.
• Following is the description of the data Frame

• There are null values present in the CTR, CPM and CPC columns.

Problem 2 - Treat missing values in CPC, CTR and CPM using the formula given.

Solution – The missing values have been imputed using the below formula

• CPM = (Total Campaign Spend / Number of Impressions) * 1,000


• CPC = Total Cost (spend) / Number of Clicks
• CTR = Total Measured Clicks / Total Measured Ad Impressions x 100

Note :- We have removed columns Timestamp, InventoryType, Ad Type, Platform, Device Type, Format
from our analysis as these were object type datatypes.
Problem 3- Clustering: Check if there are any outliers. Do you think treating outliers is
necessary for K-Means clustering? Based on your judgement decide whether to treat
outliers and if yes, which method to employ.

Solution – Boxplot of Attributes present in the Data Frame before outlier treatment.

Boxplot of Attributes present in the Data Frame after outlier treatment using interquantile range.
Treating outliers can be important in the K-Means clustering technique, but it depends on the nature of
the data and the specific goals of the analysis. Here are some considerations:

• Sensitivity to Outliers: K-Means clustering is sensitive to outliers because it minimizes the sum of
squared distances between data points and the cluster centroids. Outliers with extreme values
can significantly affect the position of cluster centroids and, as a result, influence the final
clustering results.
• Impact on Cluster Centers: Outliers can pull the cluster centers away from the main cluster and
create artificial clusters around them. If the outliers are genuine data points representing distinct
groups, this might be desirable. However, if the outliers are noise or anomalies, it can lead to less
meaningful clusters.
• Data Preprocessing: Before applying K-Means, it's essential to preprocess your data, which
includes handling outliers. We can choose to remove outliers, cap their values, or impute them
based on domain knowledge or statistical methods. Alternatively, we can use robust versions of
K-Means that are less sensitive to outliers, such as K-Medians or K-Medoids.
• Outlier Detection: Before applying K-Means, you can perform outlier detection techniques like z-
score, IQR, or density-based methods to identify and remove or treat outliers separately from the
clustering process.

Problem 4 - Perform z-score scaling and discuss how it affects the speed of the algorithm.

Solution – The z-score normalization (standardization) process itself does not directly affect the speed of
the algorithm like K-Means clustering. The main purpose of z-score normalization is to scale the features
to have a mean of 0 and a standard deviation of 1, which can be helpful in cases where features have
significantly different scales.

The impact of z-score normalization on the speed of the algorithm can be indirect and depends on the
specific algorithm being used. For K-Means clustering, the primary factor affecting its speed is the distance
calculation between data points and centroids. Normalizing the features with z-score may help improve
the convergence speed of the algorithm in some cases because it brings all features to a similar scale.

When features are not normalized, those with larger scales can dominate the distance calculations, and
the algorithm might take longer to converge. On the other hand, if features are already on a similar scale,
the impact of z-score normalization on the speed of the K-Means algorithm might be negligible.

However, it's essential to note that the effect of z-score normalization on the speed of the K-Means
algorithm is usually not significant compared to other factors such as the number of data points, the
number of clusters (k), and the initial placement of centroids. Additionally, some libraries and
implementations of K-Means may handle normalization internally, reducing the need for manual feature
scaling.

Problem 5 - Perform Hierarchical by constructing a Dendrogram using WARD and


Euclidean distance.

Answer- Below is the dendrogram which is truncated for the last p value of 10.
From the above dendrogram we can identify that there can be 5 clusters possible.

Problem 6 - Make Elbow plot (up to n=10) and identify optimum number of clusters for k-
means algorithm.

Solution – Below is the elbow plot for the up to 10 number of clusters. Sometimes it is difficult to
Analyse using the elbow plot the exact count of cluster.

From the below graph 3 seems to be optimum number of clusters that can be created in this data frame
to further confirm our analysis we will use silhouette score.

Problem 7 - Print silhouette scores for up to 10 clusters and identify optimum number of
clusters.

As we can see from the below figure, the silhouette score is highest at 5 cluster. So, it is clear that 5 is
the optimum number of clusters for this data frame.
Problem 8 - Profile the ads based on optimum number of clusters using silhouette score
and your domain understanding.

Solution – We divided the Data frame in 5 clusters, below is the mean and sum of all the clusters are
shown

Spend Analysis by Device Type


It can be clearly observed that the spend in the cluster 2 is highest followed by cluster 3. The spend
remains approximately same for the device type across all the cluster. Cluster 4 stands with the lowest
spending.

Analysis of Clicks

From the above figure, we can see that the cluster 3 has the highest number of clicks followed by cluster
1 and cluster 2. Cluster 4 has the lowest number of clicks. Cluster 3 has highest number of clicks and the
second highest spend across all clusters.
Analysis of Revenue

Cluster 2 generates highest amount of revenue followed by cluster 3. It can be possible that the buyers in
cluster 2 are mostly sure of what they want to buy and do not make multiple clicks before making a
purchase. Revenue from cluster 4 remains the lowest. Revenue generated from both the devices remains
equal for a particular cluster.

Analysis of Click Through Rate

CTR or Click-through rate is the ratio of clicks on a specific link to the number of times a page, email, or
advertisement is shown. It is commonly used to measure the success of an online advertising campaign
for a particular website.

A good click through rate ratio is the average around 1.91% for search and 0.35% for display. We can see
from the above graph that the number of click through ratio is best for the cluster 2 followed by cluster
0. That means that when the people from cluster 2 see the advertisement they will most of the time open
the link.

Analysis of CPM

CPM, which stands for cost per mille, represents the average cost incurred for every one thousand ad
impressions. It indicates the amount paid for every thousand instances that an ad is displayed to internet
users. We can see that the cluster 2 and cluster 0 has a very good CPM, meaning that the company is
spending very little in these clusters per thousand people and the campaign is successful and the company
is generating more money from these cluster with little spend on the advertisement. The highest CPM
stands for cluster 3 followed by cluster 4.

Analysis of CPC

CPC or Cost per Click is the alternative of the CPM. Cost per click (CPC) is an online advertising revenue
model that websites use to bill advertisers based on the number of times visitors click on a display ad
attached to their sites.
Cluster 2 has the highest CPC followed by cluster 0. As we have seen earlier that the cluster 2 and cluster
0 has lower CPM and now we can see they have higher CPC. It means that the advertisement are targeted
well and the price for advertisement to these cluster is very high.

Problem 9- Conclude the project by providing summary of your learnings

Solution – Bellow is the analysis of all the clusters

Cluster 0

This cluster has moderate ad size, impressions, and clicks but a relatively low CTR (Click-Through Rate)
and CPC (Cost Per Click).

The CPM (Cost Per Mille) is moderate, indicating a reasonable cost for reaching a thousand impressions.

Recommendation: This cluster may represent ads that are not highly engaging or targeted. To improve
performance, consider optimizing ad creatives, targeting specific audiences, and refining the ad
placement to increase CTR.

Cluster 1

This cluster has the highest ad size, impressions, clicks, and CTR among all clusters. However, it also has
a high CPC and relatively low CPM.

The revenue generated from this cluster is the highest, suggesting that these ads are effective in driving
conversions.

Recommendation: Since this cluster performs well in terms of CTR and revenue, focus on maintaining
the targeting strategy and ad content. However, it's essential to monitor the CPC and optimize the bids
to ensure cost-effectiveness.

Cluster 2

This cluster has the highest available impressions, matched queries, and total impressions. However, it
has the lowest CTR and CPC.

The CPM is low, indicating a cost-effective way to reach a large number of impressions.

Recommendation: Since this cluster generates many impressions but has a low CTR, consider refining
the ad targeting and creative to increase user engagement and CTR.

Cluster 3

This cluster has the lowest ad size, matched queries, impressions, and clicks. It also has a high CTR, CPC,
and CPM.

The revenue from this cluster is relatively low due to the small number of impressions and clicks.

Recommendation: This cluster might represent niche ads with high CPC and low reach. Consider
expanding the targeting options or revisiting the ad strategy to increase impressions and clicks.
Cluster 4

This cluster has a moderate ad size, matched queries, impressions, clicks, and CTR. The CPC and CPM are
also moderate.

The revenue generated from this cluster is reasonable, indicating a balanced performance.

Recommendation: Continue monitoring the performance of this cluster and make small optimizations to
improve the CTR and overall ad efficiency.

Overall Recommendation

We should analyze the characteristics and performance of each cluster regularly to identify any shifts in
ad performance.

We can Conduct A/B testing to refine ad creatives, targeting options, and bidding strategies.

Continuously optimize CPC bids to balance cost and performance.

We can consider using the insights from cluster 1 to drive revenue while experimenting with different
strategies in other clusters to improve their performance.
Principal Component Analysis

Question 1 - Read the data and perform basic checks like checking head, info, summary,
nulls, and duplicates, etc.

Solution - Basic Information about the Dataset

• Advertisement data Frame has clearly defined headers.


• It has 640 Rows and 61 Columns.
• There are no duplicates present in the data Frame.
• Following is the description of the data Frame

COUNT MEAN STD MIN 25% 50% 75% MAX


NO_HH 640 51222.8 48135.4 350 19484 35837 68892 310450
7 1
TOT_M 640 79940.5 73384.5 391 30228 58339 107918. 485417
8 1 5
TOT_F 640 122372. 113600. 698 46517.7 87724.5 164251. 750392
1 7 5 8
M_06 640 12309.1 11500.9 56 4733.75 9159 16520.2 96223
1 5
F_06 640 11942.3 11326.2 56 4672.25 8663 15902.2 95129
9 5
M_SC 640 13820.9 14426.3 0 3466.25 9591.5 19429.7 103307
5 7 5
F_SC 640 20778.3 21727.8 0 5603.25 13709 29180 156429
9 9
M_ST 640 6191.80 9912.66 0 293.75 2333.5 7658 96785
8 9
F_ST 640 10155.6 15875.7 0 429.5 3834.5 12480.2 130119
4 5
M_LIT 640 57967.9 55910.2 286 21298 42693.5 77989.5 403261
8 8
F_LIT 640 66359.5 75037.8 371 20932 43796.5 84799.7 571140
7 6 5
M_ILL 640 21972.6 19825.6 105 8590 15767.5 29512.5 105961
1
F_ILL 640 56012.5 47116.6 327 22367 42386 78471 254160
2 9
TOT_W 640 37992.4 36419.5 100 13753.5 27936.5 50226.7 269422
ORK_M 1 4 5
TOT_W 640 41295.7 37192.3 357 16097.7 30588.5 53234.2 257848
ORK_F 6 6 5 5
MAINW 640 30204.4 31480.9 65 9787 21250.5 40119 247911
ORK_M 5 2
MAINW 640 28198.8 29998.2 240 9502.25 18484 35063.2 226166
ORK_F 5 6 5
MAIN_C 640 5424.34 4739.16 0 2023.5 4160.5 7695 29113
L_M 2 2
MAIN_C 640 5486.04 5326.36 0 1920.25 3908.5 7286.25 36193
L_F 2 3
MAIN_A 640 5849.10 6399.50 0 1070.25 3936.5 8067.25 40843
L_M 9 8
MAIN_A 640 8925.99 12864.2 0 1408.75 3933.5 10617.5 87945
L_F 5 9
MAIN_H 640 883.893 1278.64 0 187.5 498.5 1099.25 16429
H_M 8 2
MAIN_H 640 1380.77 3179.41 0 248.75 540.5 1435.75 45979
H_F 3 4
MAIN_O 640 18047.1 26068.4 36 3997.5 9598 21249.5 240855
T_M 8
MAIN_O 640 12406.0 18972.2 153 3142.5 6380.5 14368.2 209355
T_F 4 5
MARGW 640 7787.96 7410.79 35 2937.5 5627 9800.25 47553
ORK_M 1 2
MARGW 640 13096.9 10996.4 117 5424.5 10175 18879.2 66915
ORK_F 1 7 5
MARG_C 640 1040.73 1311.54 0 311.75 606.5 1281 13201
L_M 8 7
MARG_C 640 2307.68 3564.62 0 630.25 1226 2659.25 44324
L_F 3 6
MARG_ 640 3304.32 3781.55 0 873.5 2062 4300.75 23719
AL_M 7 6
MARG_ 640 6463.28 6773.87 0 1402.5 4020.5 9089.25 45301
AL_F 1 6
MARG_ 640 316.742 462.661 0 71.75 166 356.5 4298
HH_M 2 9
MARG_ 640 786.626 1198.71 0 171.75 429 962.5 15448
HH_F 6 8
MARG_ 640 3126.15 3609.39 7 935.5 2036 3985.25 24728
OT_M 5 2
MARG_ 640 3539.32 4115.19 19 1071.75 2349.5 4400.5 36377
OT_F 3 1
MARGW 640 41948.1 39045.3 291 16208.2 30315 57218.7 300937
ORK_3_ 7 2 5 5
6_M
MARGW 640 81076.3 82970.4 341 26619.5 56793 107924 676450
ORK_3_ 2 1
6_F
MARG_C 640 6394.98 6019.80 27 2372 4630 8167 39106
L_3_6_ 8 7
M
MARG_C 640 10339.8 8467.47 85 4351.5 8295 15102 50065
L_3_6_F 6 3
MARG_ 640 789.848 905.639 0 235.5 480.5 986 7426
AL_3_6_ 4 3
M
MARG_ 640 1749.58 2496.54 0 497.25 985.5 2059 27171
AL_3_6_ 4 2
F
MARG_ 640 2743.63 3059.58 0 718.75 1714.5 3702.25 19343
HH_3_6 6 6
_M
MARG_ 640 5169.85 5335.64 0 1113.75 3294 7502.25 36253
HH_3_6 1
_F
MARG_ 640 245.362 358.728 0 58 129.5 276 3535
OT_3_6_ 5 6
M
MARG_ 640 585.884 900.025 0 127.75 320.5 719.25 12094
OT_3_6_ 4 8
F
MARGW 640 2616.14 3036.96 7 755 1681.5 3320.25 20648
ORK_0_ 1 4
3_M
MARGW 640 2834.54 3327.83 14 833.5 1834.5 3610.5 25844
ORK_0_ 5 7
3_F
MARG_C 640 1392.97 1489.70 4 489.5 949 1714 9875
L_0_3_ 3 7
M
MARG_C 640 2757.05 2788.77 30 957.25 1928 3599.75 21611
L_0_3_F 7
MARG_ 640 250.889 453.336 0 47 114.5 270.75 5775
AL_0_3_ 1 6
M
MARG_ 640 558.098 1117.64 0 109 247.5 568.75 17153
AL_0_3_ 4 3
F
MARG_ 640 560.690 762.579 0 136.5 308 642 6116
HH_0_3 6
_M
MARG_ 640 1293.43 1585.37 0 298 717 1710.75 13714
HH_0_3 1 8
_F
MARG_ 640 71.3796 107.897 0 14 35 79 895
OT_0_3_ 9 6
M
MARG_ 640 200.742 309.740 0 43 113 240 3354
OT_0_3_ 2 9
F
NON_W 640 510.014 610.603 0 161 326 604.5 6456
ORK_M 1 2
NON_W 640 704.778 910.209 5 220.5 464.5 853.5 10533
ORK_F 1 2

• We have removed State Code, Dist.Code, State, Area Name and State as these are object
data types. Rest all other columns are of int and float data types.

Question 2 - Perform detailed Exploratory analysis by creating certain questions like (i)
Which state has highest gender ratio and which has the lowest? (ii) Which district has the
highest & lowest gender ratio? (Example Questions). Pick 5 variables out of the given 24
variables below for EDA: No_HH, TOT_M, TOT_F, M_06, F_06, M_SC, F_SC, M_ST, F_ST,
M_LIT, F_LIT, M_ILL, F_ILL, TOT_WORK_M, TOT_WORK_F, MAINWORK_M,
MAINWORK_F, MAIN_CL_M, MAIN_CL_F, MAIN_AL_M, MAIN_AL_F, MAIN_HH_M,
MAIN_HH_F, MAIN_OT_M, MAIN_OT_F

Solution –

(I)Gender Ratio

A) Which state has highest gender ratio and which has the lowest?

Lakshwadeep, Jammu and Kashmir, Uttar Pradesh Rajasthan Meghalaya are the Top 5 states which
highest gender ratio and Chhattisgarh, Goa, dadar and nagar haveli, Maharashtra and Tripura are the
lowest in gender equality

(II)Number of Household Ratio

A) Which district has the highest & lowest gender ratio?


Lakshadweep, Badgam of state Jammu & Kashmir, Mahamaya Nagar of state Uttar Pradesh, Dhaulpur of state
Rajasthan, Baghpat of state Uttar Pradesh are the districts with highest geneder equality and Krishna of state Andhra
Pradesh, Koraput of state Odisha, Virudhunagar of state Tamil Nadu, West Godavari of state Andhra Pradesh, Baudh
of state Odisha are the districts with lowest gender equality.

A) Which state has highest number of household and which has the lowest?
West Bengal, Maharashtra, Karnataka, Kerala are the states with highest number of household. And Arunachal
Pradesh, Himachal Pradesh, Sikkim are the states with lowest number of household.

(B) Which district has highest number of household and which has the lowest?
A) North Twenty Four Parganas of state West Bengal, Mumbai Suburban of state Maharashtra, Thane of state
Maharashtra are the disctricts with highest number of household. And Dibang Valley of state Arunachal Pradesh,
Anjaw of state Arunachal Pradesh, Upper Siang of state Arunachal Pradesh are the districts with lowest number of
households.

(III) SC Population
A) Which state has highest number SC Population and which has the lowest?

From the above figure we can easily identify top states with highest SC population and states with lowest
SC population

B) Which district has highest number of SC population and which has the lowest?

South Andaman from union territory of Andaman & Nicobar Island and Lower Dibang Valley from the state of
Arunachal Pradesh are the districts with lowest SC population.

Barddhaman from the state of West Bengal and North Twenty Four Parganas from the state of West Bengal has the
highest SC population.

(IV) Total Male Working Population

A) Which state has highest number of total working male populationand which has the lowest?
From the above figure we can easily find states with highest and lowest total working male population

B) Which district has highest number of total working male population and which has the lowest?

Please refer to line number [208] of the codebook which shows a similar graph shown above

IV) Total Female Working Population


A) Which state has highest number of total working female population and which has the lowest?

From the above figure we can easily find states with highest and lowest total working female population.

B) Which district has highest number of total working female population and which has the lowest?

Please refer to line number [212] of the codebook which shows a similar graph shown above

Problem 3 - We choose not to treat outliers for this case. Do you think that treating
outliers for this case is necessary?

Treating outliers in PCA depends on the specific context and goals of your analysis. Here are some
considerations:

• Impact on results: Outliers can have a significant impact on the results of PCA. Since PCA aims to
capture the maximum variance in the data, outliers with extreme values can disproportionately
influence the principal components. If outliers are present in your data and are not representative
of the underlying patterns, they can distort the results.
• Robustness: PCA is not inherently robust to outliers. It assumes that the data follows a Gaussian
distribution, and outliers can violate this assumption. Outliers can affect the estimation of
covariance matrix and the computation of principal components.
• Data integrity: Outliers may represent genuine observations and should not be removed without
careful consideration. Removing outliers without a valid reason can lead to loss of important
information and potentially bias the analysis.

Based on these considerations, it is generally recommended to handle outliers in PCA.


Problem 4 - Scale the Data using z-score method. Does scaling have any impact on
outliers? Compare boxplots before and after scaling and comment.

Solution – Scaling the data using z-score normalization does not directly remove or eliminate outliers
from the dataset. However, it can affect the representation of outliers in the scaled data.

When applying z-score scaling, the values are transformed to have a mean of 0 and a standard deviation
of 1. This means that the data points will be centered around 0 and spread out based on their standard
deviations. Outliers that are far from the mean can still retain their extreme values after scaling, but
their relative position in the scaled data may change.

In some cases, scaling the data can make outliers appear less extreme compared to the rest of the data.
This is because the extreme values are adjusted based on the mean and standard deviation, which can
shrink their range. However, the actual values of the outliers are not altered, and they can still be
identified as extreme values.

The Boxplot of Before and after scaling in shown in the next page.
Before Scaling
Boxplot After Scaling

In the before and after boxplot we can observe that the scale of the data on y-axis is now normalized
and the scale of each individual boxplot is same. But we are still able to see outliers present.
Problem 5 - Perform all the required steps for PCA (use sklearn only) Create the
covariance Matrix Get eigen values and eigen vector.
Solution – PCA Covariance Matrix
We have successfully extracted eigen values in line [357] and eigen vectors in line [358] of the codebook file.

Problem 6- Identify the optimum number of PCs (for this project, take at least 90%
explained variance). Show Scree plot.

Solution - PC1 to PC6 makes more than 90% of the explained variance. We will take that into consideration these
PC’s only.

Problem 7 -Compare PCs with Actual Columns and identify which is explaining most
variance. Write inferences about all the principal components in terms of actual
variables.
PC1 represents a component with positive loadings for all variables, indicating a general measure of
overall population and household size. It captures the variation in the total population, the number of
households, and the population in different age groups (0-6, 3-6, 0-3). It is a measure of the total
population and household characteristics.

PC2 captures the variation related to scheduled caste and scheduled tribe populations, both male and
female. It has negative loadings on scheduled caste and scheduled tribe populations, indicating that areas
with a higher proportion of these populations have lower PC2 scores. It represents the variation in these
specific ethnic groups' populations.
PC3 is associated with variations in the literacy rate and working population. It has positive loadings on
literates population, both male and female, and negative loadings on the total working population. Higher
PC3 scores suggest areas with higher literacy rates but a lower proportion of the total working population.

PC4 represents variations in the main working population and cultivators. It has negative loadings on the
main working population and main cultivator population, both male and female. Higher PC4 scores
indicate areas where fewer people are engaged in these specific work categories.

PC5 is associated with variations in the main agricultural laborers and main household industries
populations, both male and female. It has positive loadings on main agricultural laborers and main
household industries populations. Higher PC5 scores suggest areas with a higher proportion of people
engaged in these specific types of work.

PC6 captures variations related to marginal workers and populations in the age group 3-6. It has negative
loadings on marginal workers and populations in the age group 3-6, both male and female. Higher PC6
scores indicate areas with a lower proportion of marginal workers and populations in the age group 3-6.

These six principal components together explain 90% of the variance in the dataset, providing a concise
representation of the original variables' patterns. The interpretation of each principal component is based
on the combination of variable loadings and can help in understanding the underlying structure and
patterns in the data.

Problem 8 - Write linear equation for first PC1.

Solution – We can write the linear equation for PC1 using the formula PC1=a1x1+a2x2+a3x3+…..+anxn

We can replace the ‘a’ in the formula by using using the loadings of PC1 and ‘x’ is the observed scaled
data.

PC1 = -4.6172634816

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy