Clustering Project
Clustering Project
Contents
Problem 1
Introduction……………………………………………………………………………………………………………
Read the data and perform basic analysis such as printing a few rows (head and
tail), info, data summary, null values duplicate values, etc…………………………
Treat missing values in CPC, CTR and CPM using the formula given.. You have to
basically create an user defined function and then call the function for imputing…..
Perform z-score scaling and discuss how it affects the speed of the algorithm………
Make Elbow plot (up to n=10) and identify optimum number of clusters for k-means
algorithm…………………………………………………………………………………….
Print silhouette scores for up to 10 clusters and identify optimum number of clusters.
Profile the ads based on optimum number of clusters using silhouette score and your
domain understanding……………………………………………………………………….
[Hint: Group the data by clusters and take sum or mean to identify trends in clicks,
spend, revenue, CPM, CTR, & CPC based on Device Type. Make bar plots.]
PROBLEM 2…………………………………………………………………………………………………………….
Introduction………………………………………………………………………………………………………………..
Read the data and perform basic steps………………………………………………………………………..
Perform EDA for given variables…………………………………………………………………………………
Visualize outliers…………………………………………………………………………………………………………
Perform z_scaling……………………………………………………………………………………………………….
Compare boxplot before and after scaling………………………………………………………………….
Perform all steps of PCA……………………………………………………………………………………………..
DATA MINING GRADED PROJECT
List of figures…………………………………………………………………………………………………………
BOXPLOT of clustering dataset…………………………………………………………………………………
Box plot after treating outliers……………………………………………………………………………………
Boxplot of scaled data of clustering…………………………………………………………………………….
Dendogram………………………………………………………………………………………………………………...
Elbow plot………………………………………………………………………………………………………………….
Figure of cluster count and CPM………………………………………………………………………………..
Figure of cluster count and CPC…………………………………………………………………………………
Figure of cluster count and CTR………………………………………………………………………………….
Figure of cluster count and Clicks……………………………………………………………………………….
Figure of cluster count and Revenue…………………………………………………………………………..
PCA:
Bar graph of questions and answers……………………………………………………………………………
Box plot of given variables………………………………………………………………………………………….
Heatmap……………………………………………………………………………………………………………………..
Pairplot……………………………………………………………………………………………………………………….
Boxplot before scaling…………………………………………………………………………………………………
Boxplot after scaling……………………………………………………………………………………………………
Scree plot……………………………………………………………………………………………………………………
Heatmap b/w PC’s………………………………………………………………………………………………………
Graph for comparison of PC’S……………………………………………………………………………………..
DATA MINING GRADED PROJECT
List of Tables………………………………………………………………………………………………………..
Clustering:
First five and last five rows………………………………………………………………………………………
Info of the data set…………………………………………………………………………………………………
Dataset after filling Nan values……………………………………………………………………………………
Checking null values……………………………………………………………………………………………………
Describe the data set………………………………………………………………………………………………….
Scaled data…………………………………………………………………………………………………………………
Cluster count column in the dataset……………………………………………………………………………
Wss values………………………………………………………………………………………………………………….
Silhouette score table…………………………………………………………………………………………………
PCA:
First five and last five rows………………………………………………………………………………………
Info of the data set…………………………………………………………………………………………………
Checking null values……………………………………………………………………………………………………
Describe the data set………………………………………………………………………………………………….
Questions and answers table related with 5 given column………………………………………..
Bivariate analysis……………………………………………………………………………………………………….
Scaled data set…………………………………………………………………………………………………………..
PCA covariance matrix……………………………………………………………………………………………….
Eigen vector……………………………………………………………………………………………………………….
Eigen values………………………………………………………………………………………………………………..
Cumulative explained variance……………………………………………………………………………………
PCA components table………………………………………………………………………………………………..
PCA data head……………………………………………………………………………………………………………
Correlation matrix……………………………………………………………………………………………………….
Original dataset with PC’S…………………………………………………………………………………………..
Linear equation of PC1………………………………………………………………………………………………..
DATA MINING GRADED PROJECT
DATA MINING GRADED PROJECT
Introduction
Clustering:
Digital Ads Data:
The ads24x7 is a Digital Marketing company which has now got seed funding of $10
Million. They are expanding their wings in Marketing Analytics. They collected data
from their Marketing Intelligence team and now wants you (their newly appointed
data analyst) to segment type of ads based on the features provided. Use Clustering
procedure to segment ads into homogeneous groups.
The following three features are commonly used in digital marketing:
Step 1
• Import all the required liabraries
• Read the dataset as df
• View 1st five and last five rows with head and tail method
DATA MINING GRADED PROJECT
There is 4736 null values in the data set we have to fill it we given formula
Step 2
Here we remove the categorical columns by drop method and describe all the remaining
columns
DATA MINING GRADED PROJECT
STEP 3
DRAW boxplot for the new data set to show if there is any outlier present in the dataset
Yes there are outliers present in the data set except Ad-Length and Ad – Width Columns
We have to treat outliers present in the data because k-means clustering is very sensitive to
outliers.
Here we are able to remove all the outliers presents in the data
We treat the outliers because presence of outliers can mislead the process of clustering. In
the presence of outliers mean and median values also change which effect the speed pf
algorithm.
DATA MINING GRADED PROJECT
Perform z-score
After z-scaling all the data on the same weight and it is now easy to identify relation
b/w all the variables and also enhance the speed of algorithm.
Here is now only float type columns and reduced size from 19 to 13
DATA MINING GRADED PROJECT
Step 5
Here we perform the hierarchical clustering after import the all required liabraries.
Constructing a dendogram
We can see the data set after hierarchical clustering .here cluster count is
added as a column in the data set.
ELBOW PLOT
Step 8
DATA MINING GRADED PROJECT
Here add the cluster count as a column in the data set by taking mean of the
clusters.
Here we can see cluster 2 has the maximum number of values of CTR.
DATA MINING GRADED PROJECT
Here we can see based on the device type cluster 3 has the max. value of CPC
DATA MINING GRADED PROJECT
FOR CPM
ON the base of device type we can see cluster 0 has the max. count of CPM.
DATA MINING GRADED PROJECT
CLICKS:
Here cluster 3 has the max count of revenue on the basis of device type.
DATA MINING GRADED PROJECT
Step 9:
• Cluster no. 2 that is 3rd cluster has the max number of counts.
• Cluster 4 has the max count of revenue.
• Cluster 0 has the max. count of clicks.
• By performing clustering, we can easily analysis the data
DATA MINING GRADED PROJECT
Problem 2:
Introduction:
PCA FH (FT): Primary census abstract for female headed households excluding
institutional households (India & States/UTs - District Level), Scheduled tribes - 2011
PCA for Female Headed Household Excluding Institutional Household. The Indian Census
has the reputation of being one of the best in the world. The first Census in India was
conducted in the year 1872. This was conducted at different points of time in different
parts of the country. In 1881 a Census was taken for the entire country simultaneously.
Since then, Census has been conducted every ten years, without a break. Thus, the
Census of India 2011 was the fifteenth in this unbroken series since 1872, the seventh
after independence and the second census of the third millennium and twenty first
century. The census has been uninterruptedly continued despite of several adversities
like wars, epidemics, natural calamities, political unrest, etc. The Census of India is
conducted under the provisions of the Census Act 1948 and the Census Rules, 1990. The
Primary Census Abstract which is important publication of 2011 Census gives basic
information on Area, Total Number of Households, Total Population, Scheduled Castes,
Scheduled Tribes Population, Population in the age group 0-6, Literates, Main Workers
and Marginal Workers classified by the four broad industrial categories, namely, (i)
Cultivators, (ii) Agricultural Laborers, (iii) Household Industry Workers, and (iv) Other
Workers and also Non-Workers. The characteristics of the Total Population include
Scheduled Castes, Scheduled Tribes, Institutional and Houseless Population and are
presented by sex and rural-urban residence. Census 2011 covered 35 States/Union
Territories, 640 districts, 5,924 sub-districts, 7,935 Towns and 6,40,867 Villages.
The data collected has so many variables thus making it difficult to find useful details
without using Data Science Techniques. You are tasked to perform detailed EDA and
identify Optimum Principal Components that explains the most variance in data. Use
Sklearn only.
Step 1:
These are 1st five rows and we can see 61 columns of the dataset.
DATA MINING GRADED PROJECT
Null values:
• There is no null values in the dataset.
• There is no duplicate values in the dataset.
Here we can summarise the data set by describing it.
DATA MINING GRADED PROJECT
DATA MINING GRADED PROJECT
• From the above figure we can find the max. and min values of each column.
• We can also check the mean and median value of the given column.
Here we generate a new column ‘ratio’ and find the max and min ratio.
DATA MINING GRADED PROJECT
Ans. Here utter Pradesh has the max. ratio. And Lakshadweep has min. gender ratio.
Ans. Utter Pradesh has max. literate male and Dadra and Nagar Haveli has min. ratio
of male literate
Which state has max. illiterate female?
Ans.
DATA MINING GRADED PROJECT
Utter Pradesh has max population so illiterate female is more in the this state.
Lakshadweep has least number of illiterate females.
Now we perform univariate and bi variate analysis of the data set.
Univariate Analysis:
DATA MINING GRADED PROJECT
This is a correlation matrix which shows the relation b/w all the given variable.
We can see it by graphs.
DATA MINING GRADED PROJECT
Step 3
Outliers impacts the performance of algorithm a lot.
Here we choose not to treat the outlier so we do not treat outliers.
Step 4
Scale the data set using z score:
Compare boxplot before and after scaling:
DATA MINING GRADED PROJECT
Eigen values:
Now we have to create a new data set with pc’s.it looks like as:
Now shape of the data set has been reduced.it is now 640 rows and 6 columns.
Scree plot:
Here we can see there is 6 number of pc from 0 to 5 which shows 90% of variance.
Step 7:
Compare pc’s with actual column”
DATA MINING GRADED PROJECT
DATA MINING GRADED PROJECT
DATA MINING GRADED PROJECT
Step 8
LINEAR EQUATIONS for 1st PC:
DATA MINING GRADED PROJECT
DATA MINING GRADED PROJECT