Data Mining Project - Parijat
Data Mining Project - Parijat
Analysis
PGP-DSBA
By-Parijat Dev
Clustering Problems ....................................................................................................................................... 4
Problem 1 - Read the data and perform basic analysis such as printing a few rows (head and tail), info,
data summary, null values duplicate values, etc. ...................................................................................... 4
Problem 2 - Treat missing values in CPC, CTR and CPM using the formula given. .................................... 4
Problem 3- Clustering: Check if there are any outliers. Do you think treating outliers is necessary for K-
Means clustering? Based on your judgement decide whether to treat outliers and if yes, which
method to employ. .................................................................................................................................... 5
Problem 4 - Perform z-score scaling and discuss how it affects the speed of the algorithm. .................. 6
Problem 5 - Perform Hierarchical by constructing a Dendrogram using WARD and Euclidean distance. 6
Problem 6 - Make Elbow plot (up to n=10) and identify optimum number of clusters for k-means
algorithm.................................................................................................................................................... 7
Problem 7 - Print silhouette scores for up to 10 clusters and identify optimum number of clusters. ..... 7
Problem 8 - Profile the ads based on optimum number of clusters using silhouette score and your
domain understanding. ............................................................................................................................. 8
Problem 9- Conclude the project by providing summary of your learnings ........................................... 12
Principal Component Analysis ..................................................................................................................... 14
Question 1 - Read the data and perform basic checks like checking head, info, summary, nulls, and
duplicates, etc. ......................................................................................................................................... 14
Question 2 - Perform detailed Exploratory analysis by creating certain questions like (i) Which state
has highest gender ratio and which has the lowest? (ii) Which district has the highest & lowest gender
ratio? (Example Questions). Pick 5 variables out of the given 24 variables below for EDA: No_HH,
TOT_M, TOT_F, M_06, F_06, M_SC, F_SC, M_ST, F_ST, M_LIT, F_LIT, M_ILL, F_ILL, TOT_WORK_M,
TOT_WORK_F, MAINWORK_M, MAINWORK_F, MAIN_CL_M, MAIN_CL_F, MAIN_AL_M, MAIN_AL_F,
MAIN_HH_M, MAIN_HH_F, MAIN_OT_M, MAIN_OT_F ......................................................................... 17
Problem 3 - We choose not to treat outliers for this case. Do you think that treating outliers for this
case is necessary? .................................................................................................................................... 20
Problem 4 - Scale the Data using z-score method. Does scaling have any impact on outliers? Compare
boxplots before and after scaling and comment. ................................................................................... 22
Problem 5 - Perform all the required steps for PCA (use sklearn only) Create the covariance Matrix Get
eigen values and eigen vector. ................................................................................................................ 25
Problem 6- Identify the optimum number of PCs (for this project, take at least 90% explained
variance). Show Scree plot. ..................................................................................................................... 26
Problem 7 -Compare PCs with Actual Columns and identify which is explaining most variance. Write
inferences about all the principal components in terms of actual variables. ......................................... 26
Problem 8 - Write linear equation for first PC1. ...................................................................................... 28
Clustering Problems
Problem 1 - Read the data and perform basic analysis such as printing a few rows (head
and tail), info, data summary, null values duplicate values, etc.
• There are null values present in the CTR, CPM and CPC columns.
Problem 2 - Treat missing values in CPC, CTR and CPM using the formula given.
Solution – The missing values have been imputed using the below formula
Note :- We have removed columns Timestamp, InventoryType, Ad Type, Platform, Device Type, Format
from our analysis as these were object type datatypes.
Problem 3- Clustering: Check if there are any outliers. Do you think treating outliers is
necessary for K-Means clustering? Based on your judgement decide whether to treat
outliers and if yes, which method to employ.
Solution – Boxplot of Attributes present in the Data Frame before outlier treatment.
Boxplot of Attributes present in the Data Frame after outlier treatment using interquantile range.
Treating outliers can be important in the K-Means clustering technique, but it depends on the nature of
the data and the specific goals of the analysis. Here are some considerations:
• Sensitivity to Outliers: K-Means clustering is sensitive to outliers because it minimizes the sum of
squared distances between data points and the cluster centroids. Outliers with extreme values
can significantly affect the position of cluster centroids and, as a result, influence the final
clustering results.
• Impact on Cluster Centers: Outliers can pull the cluster centers away from the main cluster and
create artificial clusters around them. If the outliers are genuine data points representing distinct
groups, this might be desirable. However, if the outliers are noise or anomalies, it can lead to less
meaningful clusters.
• Data Preprocessing: Before applying K-Means, it's essential to preprocess your data, which
includes handling outliers. We can choose to remove outliers, cap their values, or impute them
based on domain knowledge or statistical methods. Alternatively, we can use robust versions of
K-Means that are less sensitive to outliers, such as K-Medians or K-Medoids.
• Outlier Detection: Before applying K-Means, you can perform outlier detection techniques like z-
score, IQR, or density-based methods to identify and remove or treat outliers separately from the
clustering process.
Problem 4 - Perform z-score scaling and discuss how it affects the speed of the algorithm.
Solution – The z-score normalization (standardization) process itself does not directly affect the speed of
the algorithm like K-Means clustering. The main purpose of z-score normalization is to scale the features
to have a mean of 0 and a standard deviation of 1, which can be helpful in cases where features have
significantly different scales.
The impact of z-score normalization on the speed of the algorithm can be indirect and depends on the
specific algorithm being used. For K-Means clustering, the primary factor affecting its speed is the distance
calculation between data points and centroids. Normalizing the features with z-score may help improve
the convergence speed of the algorithm in some cases because it brings all features to a similar scale.
When features are not normalized, those with larger scales can dominate the distance calculations, and
the algorithm might take longer to converge. On the other hand, if features are already on a similar scale,
the impact of z-score normalization on the speed of the K-Means algorithm might be negligible.
However, it's essential to note that the effect of z-score normalization on the speed of the K-Means
algorithm is usually not significant compared to other factors such as the number of data points, the
number of clusters (k), and the initial placement of centroids. Additionally, some libraries and
implementations of K-Means may handle normalization internally, reducing the need for manual feature
scaling.
Answer- Below is the dendrogram which is truncated for the last p value of 10.
From the above dendrogram we can identify that there can be 5 clusters possible.
Problem 6 - Make Elbow plot (up to n=10) and identify optimum number of clusters for k-
means algorithm.
Solution – Below is the elbow plot for the up to 10 number of clusters. Sometimes it is difficult to
Analyse using the elbow plot the exact count of cluster.
From the below graph 3 seems to be optimum number of clusters that can be created in this data frame
to further confirm our analysis we will use silhouette score.
Problem 7 - Print silhouette scores for up to 10 clusters and identify optimum number of
clusters.
As we can see from the below figure, the silhouette score is highest at 5 cluster. So, it is clear that 5 is
the optimum number of clusters for this data frame.
Problem 8 - Profile the ads based on optimum number of clusters using silhouette score
and your domain understanding.
Solution – We divided the Data frame in 5 clusters, below is the mean and sum of all the clusters are
shown
Analysis of Clicks
From the above figure, we can see that the cluster 3 has the highest number of clicks followed by cluster
1 and cluster 2. Cluster 4 has the lowest number of clicks. Cluster 3 has highest number of clicks and the
second highest spend across all clusters.
Analysis of Revenue
Cluster 2 generates highest amount of revenue followed by cluster 3. It can be possible that the buyers in
cluster 2 are mostly sure of what they want to buy and do not make multiple clicks before making a
purchase. Revenue from cluster 4 remains the lowest. Revenue generated from both the devices remains
equal for a particular cluster.
CTR or Click-through rate is the ratio of clicks on a specific link to the number of times a page, email, or
advertisement is shown. It is commonly used to measure the success of an online advertising campaign
for a particular website.
A good click through rate ratio is the average around 1.91% for search and 0.35% for display. We can see
from the above graph that the number of click through ratio is best for the cluster 2 followed by cluster
0. That means that when the people from cluster 2 see the advertisement they will most of the time open
the link.
Analysis of CPM
CPM, which stands for cost per mille, represents the average cost incurred for every one thousand ad
impressions. It indicates the amount paid for every thousand instances that an ad is displayed to internet
users. We can see that the cluster 2 and cluster 0 has a very good CPM, meaning that the company is
spending very little in these clusters per thousand people and the campaign is successful and the company
is generating more money from these cluster with little spend on the advertisement. The highest CPM
stands for cluster 3 followed by cluster 4.
Analysis of CPC
CPC or Cost per Click is the alternative of the CPM. Cost per click (CPC) is an online advertising revenue
model that websites use to bill advertisers based on the number of times visitors click on a display ad
attached to their sites.
Cluster 2 has the highest CPC followed by cluster 0. As we have seen earlier that the cluster 2 and cluster
0 has lower CPM and now we can see they have higher CPC. It means that the advertisement are targeted
well and the price for advertisement to these cluster is very high.
Cluster 0
This cluster has moderate ad size, impressions, and clicks but a relatively low CTR (Click-Through Rate)
and CPC (Cost Per Click).
The CPM (Cost Per Mille) is moderate, indicating a reasonable cost for reaching a thousand impressions.
Recommendation: This cluster may represent ads that are not highly engaging or targeted. To improve
performance, consider optimizing ad creatives, targeting specific audiences, and refining the ad
placement to increase CTR.
Cluster 1
This cluster has the highest ad size, impressions, clicks, and CTR among all clusters. However, it also has
a high CPC and relatively low CPM.
The revenue generated from this cluster is the highest, suggesting that these ads are effective in driving
conversions.
Recommendation: Since this cluster performs well in terms of CTR and revenue, focus on maintaining
the targeting strategy and ad content. However, it's essential to monitor the CPC and optimize the bids
to ensure cost-effectiveness.
Cluster 2
This cluster has the highest available impressions, matched queries, and total impressions. However, it
has the lowest CTR and CPC.
The CPM is low, indicating a cost-effective way to reach a large number of impressions.
Recommendation: Since this cluster generates many impressions but has a low CTR, consider refining
the ad targeting and creative to increase user engagement and CTR.
Cluster 3
This cluster has the lowest ad size, matched queries, impressions, and clicks. It also has a high CTR, CPC,
and CPM.
The revenue from this cluster is relatively low due to the small number of impressions and clicks.
Recommendation: This cluster might represent niche ads with high CPC and low reach. Consider
expanding the targeting options or revisiting the ad strategy to increase impressions and clicks.
Cluster 4
This cluster has a moderate ad size, matched queries, impressions, clicks, and CTR. The CPC and CPM are
also moderate.
The revenue generated from this cluster is reasonable, indicating a balanced performance.
Recommendation: Continue monitoring the performance of this cluster and make small optimizations to
improve the CTR and overall ad efficiency.
Overall Recommendation
We should analyze the characteristics and performance of each cluster regularly to identify any shifts in
ad performance.
We can Conduct A/B testing to refine ad creatives, targeting options, and bidding strategies.
We can consider using the insights from cluster 1 to drive revenue while experimenting with different
strategies in other clusters to improve their performance.
Principal Component Analysis
Question 1 - Read the data and perform basic checks like checking head, info, summary,
nulls, and duplicates, etc.
• We have removed State Code, Dist.Code, State, Area Name and State as these are object
data types. Rest all other columns are of int and float data types.
Question 2 - Perform detailed Exploratory analysis by creating certain questions like (i)
Which state has highest gender ratio and which has the lowest? (ii) Which district has the
highest & lowest gender ratio? (Example Questions). Pick 5 variables out of the given 24
variables below for EDA: No_HH, TOT_M, TOT_F, M_06, F_06, M_SC, F_SC, M_ST, F_ST,
M_LIT, F_LIT, M_ILL, F_ILL, TOT_WORK_M, TOT_WORK_F, MAINWORK_M,
MAINWORK_F, MAIN_CL_M, MAIN_CL_F, MAIN_AL_M, MAIN_AL_F, MAIN_HH_M,
MAIN_HH_F, MAIN_OT_M, MAIN_OT_F
Solution –
(I)Gender Ratio
A) Which state has highest gender ratio and which has the lowest?
Lakshwadeep, Jammu and Kashmir, Uttar Pradesh Rajasthan Meghalaya are the Top 5 states which
highest gender ratio and Chhattisgarh, Goa, dadar and nagar haveli, Maharashtra and Tripura are the
lowest in gender equality
A) Which state has highest number of household and which has the lowest?
West Bengal, Maharashtra, Karnataka, Kerala are the states with highest number of household. And Arunachal
Pradesh, Himachal Pradesh, Sikkim are the states with lowest number of household.
(B) Which district has highest number of household and which has the lowest?
A) North Twenty Four Parganas of state West Bengal, Mumbai Suburban of state Maharashtra, Thane of state
Maharashtra are the disctricts with highest number of household. And Dibang Valley of state Arunachal Pradesh,
Anjaw of state Arunachal Pradesh, Upper Siang of state Arunachal Pradesh are the districts with lowest number of
households.
(III) SC Population
A) Which state has highest number SC Population and which has the lowest?
From the above figure we can easily identify top states with highest SC population and states with lowest
SC population
B) Which district has highest number of SC population and which has the lowest?
South Andaman from union territory of Andaman & Nicobar Island and Lower Dibang Valley from the state of
Arunachal Pradesh are the districts with lowest SC population.
Barddhaman from the state of West Bengal and North Twenty Four Parganas from the state of West Bengal has the
highest SC population.
A) Which state has highest number of total working male populationand which has the lowest?
From the above figure we can easily find states with highest and lowest total working male population
B) Which district has highest number of total working male population and which has the lowest?
Please refer to line number [208] of the codebook which shows a similar graph shown above
From the above figure we can easily find states with highest and lowest total working female population.
B) Which district has highest number of total working female population and which has the lowest?
Please refer to line number [212] of the codebook which shows a similar graph shown above
Problem 3 - We choose not to treat outliers for this case. Do you think that treating
outliers for this case is necessary?
Treating outliers in PCA depends on the specific context and goals of your analysis. Here are some
considerations:
• Impact on results: Outliers can have a significant impact on the results of PCA. Since PCA aims to
capture the maximum variance in the data, outliers with extreme values can disproportionately
influence the principal components. If outliers are present in your data and are not representative
of the underlying patterns, they can distort the results.
• Robustness: PCA is not inherently robust to outliers. It assumes that the data follows a Gaussian
distribution, and outliers can violate this assumption. Outliers can affect the estimation of
covariance matrix and the computation of principal components.
• Data integrity: Outliers may represent genuine observations and should not be removed without
careful consideration. Removing outliers without a valid reason can lead to loss of important
information and potentially bias the analysis.
Solution – Scaling the data using z-score normalization does not directly remove or eliminate outliers
from the dataset. However, it can affect the representation of outliers in the scaled data.
When applying z-score scaling, the values are transformed to have a mean of 0 and a standard deviation
of 1. This means that the data points will be centered around 0 and spread out based on their standard
deviations. Outliers that are far from the mean can still retain their extreme values after scaling, but
their relative position in the scaled data may change.
In some cases, scaling the data can make outliers appear less extreme compared to the rest of the data.
This is because the extreme values are adjusted based on the mean and standard deviation, which can
shrink their range. However, the actual values of the outliers are not altered, and they can still be
identified as extreme values.
The Boxplot of Before and after scaling in shown in the next page.
Before Scaling
Boxplot After Scaling
In the before and after boxplot we can observe that the scale of the data on y-axis is now normalized
and the scale of each individual boxplot is same. But we are still able to see outliers present.
Problem 5 - Perform all the required steps for PCA (use sklearn only) Create the
covariance Matrix Get eigen values and eigen vector.
Solution – PCA Covariance Matrix
We have successfully extracted eigen values in line [357] and eigen vectors in line [358] of the codebook file.
Problem 6- Identify the optimum number of PCs (for this project, take at least 90%
explained variance). Show Scree plot.
Solution - PC1 to PC6 makes more than 90% of the explained variance. We will take that into consideration these
PC’s only.
Problem 7 -Compare PCs with Actual Columns and identify which is explaining most
variance. Write inferences about all the principal components in terms of actual
variables.
PC1 represents a component with positive loadings for all variables, indicating a general measure of
overall population and household size. It captures the variation in the total population, the number of
households, and the population in different age groups (0-6, 3-6, 0-3). It is a measure of the total
population and household characteristics.
PC2 captures the variation related to scheduled caste and scheduled tribe populations, both male and
female. It has negative loadings on scheduled caste and scheduled tribe populations, indicating that areas
with a higher proportion of these populations have lower PC2 scores. It represents the variation in these
specific ethnic groups' populations.
PC3 is associated with variations in the literacy rate and working population. It has positive loadings on
literates population, both male and female, and negative loadings on the total working population. Higher
PC3 scores suggest areas with higher literacy rates but a lower proportion of the total working population.
PC4 represents variations in the main working population and cultivators. It has negative loadings on the
main working population and main cultivator population, both male and female. Higher PC4 scores
indicate areas where fewer people are engaged in these specific work categories.
PC5 is associated with variations in the main agricultural laborers and main household industries
populations, both male and female. It has positive loadings on main agricultural laborers and main
household industries populations. Higher PC5 scores suggest areas with a higher proportion of people
engaged in these specific types of work.
PC6 captures variations related to marginal workers and populations in the age group 3-6. It has negative
loadings on marginal workers and populations in the age group 3-6, both male and female. Higher PC6
scores indicate areas with a lower proportion of marginal workers and populations in the age group 3-6.
These six principal components together explain 90% of the variance in the dataset, providing a concise
representation of the original variables' patterns. The interpretation of each principal component is based
on the combination of variable loadings and can help in understanding the underlying structure and
patterns in the data.
Solution – We can write the linear equation for PC1 using the formula PC1=a1x1+a2x2+a3x3+…..+anxn
We can replace the ‘a’ in the formula by using using the loadings of PC1 and ‘x’ is the observed scaled
data.
PC1 = -4.6172634816