100% found this document useful (1 vote)
249 views44 pages

Clustering Project

The document describes a data mining project on digital ad clustering. It includes steps to: 1) Read and clean the data, treating missing values and outliers 2) Perform dimensionality reduction through z-score scaling and hierarchical clustering 3) Determine the optimal number of clusters through elbow plotting and silhouette scoring for k-means clustering 4) Profile the ads based on clustering to understand trends in key metrics like clicks and spend.

Uploaded by

kirti sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
249 views44 pages

Clustering Project

The document describes a data mining project on digital ad clustering. It includes steps to: 1) Read and clean the data, treating missing values and outliers 2) Perform dimensionality reduction through z-score scaling and hierarchical clustering 3) Determine the optimal number of clusters through elbow plotting and silhouette scoring for k-means clustering 4) Profile the ads based on clustering to understand trends in key metrics like clicks and spend.

Uploaded by

kirti sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

DATA MINING GRADED PROJECT

Contents
Problem 1
Introduction……………………………………………………………………………………………………………
Read the data and perform basic analysis such as printing a few rows (head and
tail), info, data summary, null values duplicate values, etc…………………………

Treat missing values in CPC, CTR and CPM using the formula given.. You have to
basically create an user defined function and then call the function for imputing…..

Check if there are any outliers…………………………………………………………….


Do you think treating outliers is necessary for K-Means clustering? Based on your
judgement decide whether to treat outliers and if yes, which method to employ. (As
an analyst your judgement may be different from another analyst)…………………

Perform clustering and do the following:………………………………..

Perform z-score scaling and discuss how it affects the speed of the algorithm………

Perform Hierarchical by constructing a Dendrogram using WARD and Euclidean


distance………………………………………………………………………………………

Make Elbow plot (up to n=10) and identify optimum number of clusters for k-means
algorithm…………………………………………………………………………………….

Print silhouette scores for up to 10 clusters and identify optimum number of clusters.

Profile the ads based on optimum number of clusters using silhouette score and your
domain understanding……………………………………………………………………….
[Hint: Group the data by clusters and take sum or mean to identify trends in clicks,
spend, revenue, CPM, CTR, & CPC based on Device Type. Make bar plots.]

Conclude the project by providing summary of your learnings…………………………

PROBLEM 2…………………………………………………………………………………………………………….
Introduction………………………………………………………………………………………………………………..
Read the data and perform basic steps………………………………………………………………………..
Perform EDA for given variables…………………………………………………………………………………
Visualize outliers…………………………………………………………………………………………………………
Perform z_scaling……………………………………………………………………………………………………….
Compare boxplot before and after scaling………………………………………………………………….
Perform all steps of PCA……………………………………………………………………………………………..
DATA MINING GRADED PROJECT

Identify optimum number or PCA ……………………………………………………………………………….


Show scree plot…………………………………………………………………………………………………………..
Compare PCA with actual columns……………………………………………………………………………..
Linear equation for 1st PC……………………………………………………………………………………………

List of figures…………………………………………………………………………………………………………
BOXPLOT of clustering dataset…………………………………………………………………………………
Box plot after treating outliers……………………………………………………………………………………
Boxplot of scaled data of clustering…………………………………………………………………………….
Dendogram………………………………………………………………………………………………………………...
Elbow plot………………………………………………………………………………………………………………….
Figure of cluster count and CPM………………………………………………………………………………..
Figure of cluster count and CPC…………………………………………………………………………………
Figure of cluster count and CTR………………………………………………………………………………….
Figure of cluster count and Clicks……………………………………………………………………………….
Figure of cluster count and Revenue…………………………………………………………………………..
PCA:
Bar graph of questions and answers……………………………………………………………………………
Box plot of given variables………………………………………………………………………………………….
Heatmap……………………………………………………………………………………………………………………..
Pairplot……………………………………………………………………………………………………………………….
Boxplot before scaling…………………………………………………………………………………………………
Boxplot after scaling……………………………………………………………………………………………………
Scree plot……………………………………………………………………………………………………………………
Heatmap b/w PC’s………………………………………………………………………………………………………
Graph for comparison of PC’S……………………………………………………………………………………..
DATA MINING GRADED PROJECT

List of Tables………………………………………………………………………………………………………..
Clustering:
First five and last five rows………………………………………………………………………………………
Info of the data set…………………………………………………………………………………………………
Dataset after filling Nan values……………………………………………………………………………………
Checking null values……………………………………………………………………………………………………
Describe the data set………………………………………………………………………………………………….
Scaled data…………………………………………………………………………………………………………………
Cluster count column in the dataset……………………………………………………………………………
Wss values………………………………………………………………………………………………………………….
Silhouette score table…………………………………………………………………………………………………
PCA:
First five and last five rows………………………………………………………………………………………
Info of the data set…………………………………………………………………………………………………
Checking null values……………………………………………………………………………………………………
Describe the data set………………………………………………………………………………………………….
Questions and answers table related with 5 given column………………………………………..
Bivariate analysis……………………………………………………………………………………………………….
Scaled data set…………………………………………………………………………………………………………..
PCA covariance matrix……………………………………………………………………………………………….
Eigen vector……………………………………………………………………………………………………………….
Eigen values………………………………………………………………………………………………………………..
Cumulative explained variance……………………………………………………………………………………
PCA components table………………………………………………………………………………………………..
PCA data head……………………………………………………………………………………………………………
Correlation matrix……………………………………………………………………………………………………….
Original dataset with PC’S…………………………………………………………………………………………..
Linear equation of PC1………………………………………………………………………………………………..
DATA MINING GRADED PROJECT
DATA MINING GRADED PROJECT

Introduction

Clustering:
Digital Ads Data:
The ads24x7 is a Digital Marketing company which has now got seed funding of $10
Million. They are expanding their wings in Marketing Analytics. They collected data
from their Marketing Intelligence team and now wants you (their newly appointed
data analyst) to segment type of ads based on the features provided. Use Clustering
procedure to segment ads into homogeneous groups.
The following three features are commonly used in digital marketing:

Step 1
• Import all the required liabraries
• Read the dataset as df
• View 1st five and last five rows with head and tail method
DATA MINING GRADED PROJECT

• We get information of the data by info()method

• There are 23066 rows and 19 columns


• There are 4 categorical variables
• There are 6 float64 and 7 int 64 datatypes columns
DATA MINING GRADED PROJECT

There is 4736 null values in the data set we have to fill it we given formula

Step 2

Now there is no null values

Here we remove the categorical columns by drop method and describe all the remaining
columns
DATA MINING GRADED PROJECT

STEP 3

DRAW boxplot for the new data set to show if there is any outlier present in the dataset

Yes there are outliers present in the data set except Ad-Length and Ad – Width Columns

We have to treat outliers present in the data because k-means clustering is very sensitive to
outliers.

After treating outliers the new boxplot is

Here we are able to remove all the outliers presents in the data

We treat the outliers because presence of outliers can mislead the process of clustering. In
the presence of outliers mean and median values also change which effect the speed pf
algorithm.
DATA MINING GRADED PROJECT

Perform z-score

After z-scaling all the data on the same weight and it is now easy to identify relation
b/w all the variables and also enhance the speed of algorithm.

This is the information about the data after scaling

Here is now only float type columns and reduced size from 19 to 13
DATA MINING GRADED PROJECT

This is the boxplot of the data after scaling

Step 5
Here we perform the hierarchical clustering after import the all required liabraries.

Constructing a dendogram

This is a figure of DENDOGRAM.


Here we can see the number of clusters. we can choose the number of clusters by
two methos:
• We can choose it by appropriate threshold level.
• We can define it by number of clusters.
DATA MINING GRADED PROJECT

We do it by two methods both gave the same output.


Here we choose 4 no. of clusters.

We can see the data set after hierarchical clustering .here cluster count is
added as a column in the data set.

Step 6(k-means clustering)

This is the wss score for clustering.


DATA MINING GRADED PROJECT

ELBOW PLOT

Here at 4 cluster the curve is making an elbow. So we choose number of cluster


is=4.
After 4 the difference is minimum and curve is like flat curve.
Step7
Silhouette scores:
Silhouette scores is measure of how similar an object is to its own cluster
compared to other clusters. this ranges from –1 to 1.
This score up to n=10 is given below:

Step 8
DATA MINING GRADED PROJECT

Here add the cluster count as a column in the data set by taking mean of the
clusters.

Here we can see cluster 2 has the maximum number of values of CTR.
DATA MINING GRADED PROJECT

CLUSTER 1 holds the max. value of CPC.

Here we can see based on the device type cluster 3 has the max. value of CPC
DATA MINING GRADED PROJECT

FOR CPM

Cluster 2 has the max. values of CPM

ON the base of device type we can see cluster 0 has the max. count of CPM.
DATA MINING GRADED PROJECT

CLICKS:

Cluster 0 ha the max. count of clicks.


Revenue:

Here cluster 3 has the max count of revenue on the basis of device type.
DATA MINING GRADED PROJECT

Step 9:
• Cluster no. 2 that is 3rd cluster has the max number of counts.
• Cluster 4 has the max count of revenue.
• Cluster 0 has the max. count of clicks.
• By performing clustering, we can easily analysis the data
DATA MINING GRADED PROJECT

Problem 2:
Introduction:
PCA FH (FT): Primary census abstract for female headed households excluding
institutional households (India & States/UTs - District Level), Scheduled tribes - 2011
PCA for Female Headed Household Excluding Institutional Household. The Indian Census
has the reputation of being one of the best in the world. The first Census in India was
conducted in the year 1872. This was conducted at different points of time in different
parts of the country. In 1881 a Census was taken for the entire country simultaneously.
Since then, Census has been conducted every ten years, without a break. Thus, the
Census of India 2011 was the fifteenth in this unbroken series since 1872, the seventh
after independence and the second census of the third millennium and twenty first
century. The census has been uninterruptedly continued despite of several adversities
like wars, epidemics, natural calamities, political unrest, etc. The Census of India is
conducted under the provisions of the Census Act 1948 and the Census Rules, 1990. The
Primary Census Abstract which is important publication of 2011 Census gives basic
information on Area, Total Number of Households, Total Population, Scheduled Castes,
Scheduled Tribes Population, Population in the age group 0-6, Literates, Main Workers
and Marginal Workers classified by the four broad industrial categories, namely, (i)
Cultivators, (ii) Agricultural Laborers, (iii) Household Industry Workers, and (iv) Other
Workers and also Non-Workers. The characteristics of the Total Population include
Scheduled Castes, Scheduled Tribes, Institutional and Houseless Population and are
presented by sex and rural-urban residence. Census 2011 covered 35 States/Union
Territories, 640 districts, 5,924 sub-districts, 7,935 Towns and 6,40,867 Villages.
The data collected has so many variables thus making it difficult to find useful details
without using Data Science Techniques. You are tasked to perform detailed EDA and
identify Optimum Principal Components that explains the most variance in data. Use
Sklearn only.

Step 1:

• Import all the required libraries.


• Read the data set as df_census.
• Read the 1st five and last five rows of dataset by using head and tail methos.

These are 1st five rows and we can see 61 columns of the dataset.
DATA MINING GRADED PROJECT

These are last five rows of the data set.


We can see the information about the data by using info method
DATA MINING GRADED PROJECT
DATA MINING GRADED PROJECT

• Here we see there are total 61 columns and 640 rows.


• There are 59 int. columns
• There are 2 categorical column
• But above the information we can see there is 2 more categorical columns.
State code and Area name.

Null values:
• There is no null values in the dataset.
• There is no duplicate values in the dataset.
Here we can summarise the data set by describing it.
DATA MINING GRADED PROJECT
DATA MINING GRADED PROJECT

• From the above figure we can find the max. and min values of each column.
• We can also check the mean and median value of the given column.

These are the columns of the data.


Step 2: EDA
• We have to perform EDA on the 5 variables from the given 24 variables.
• Here we define a new dataset for given 5 variables.

This is the new data set for given variables.


Here we can answer some questions.

• which state has the highest gender ratio?

Here we generate a new column ‘ratio’ and find the max and min ratio.
DATA MINING GRADED PROJECT

Ans. Here utter Pradesh has the max. ratio. And Lakshadweep has min. gender ratio.

Which state has max. male literate?


DATA MINING GRADED PROJECT

Ans. Utter Pradesh has max. literate male and Dadra and Nagar Haveli has min. ratio
of male literate
Which state has max. illiterate female?
Ans.
DATA MINING GRADED PROJECT

Utter Pradesh has max population so illiterate female is more in the this state.
Lakshadweep has least number of illiterate females.
Now we perform univariate and bi variate analysis of the data set.
Univariate Analysis:
DATA MINING GRADED PROJECT

Total male population has the skewness in the data.

Yes there is outliers present in the total male population.


Total_feamale:
DATA MINING GRADED PROJECT

Total female population also have outliers in the data.


Male literate:
DATA MINING GRADED PROJECT

Male literate also have skewed data.


Female literate:
DATA MINING GRADED PROJECT

This variable also have outliers.


Male illiterate:
DATA MINING GRADED PROJECT

All the variables in the data set has some outliers .


Bivariate analysis:

This is a correlation matrix which shows the relation b/w all the given variable.
We can see it by graphs.
DATA MINING GRADED PROJECT

This is a heatmap graph which shows relation b/w variables.


Here we can see all the 5 variables TOT_M,TOT_F,M_LIT,F_LIT,M_ILL,F_ILL are
highly co related with each other.
Literature ratio and illiterate ratio is -vely corelated.
Pair plot b/w variables:
DATA MINING GRADED PROJECT

Step 3
Outliers impacts the performance of algorithm a lot.
Here we choose not to treat the outlier so we do not treat outliers.
Step 4
Scale the data set using z score:
Compare boxplot before and after scaling:
DATA MINING GRADED PROJECT

This is the data set after scaling.


Boxplot before scaling:

There is a lot of outliers present in the data set with eachvariable.


Boxplot after scaling:
DATA MINING GRADED PROJECT

After scaling boxplot shows more outliers in the data set.


Step 5:
PCA:
Import all required liabraries.
Perform pca step by step using sklearn.
We got this array:
DATA MINING GRADED PROJECT

Now the data is in matrix form.


Eigen vectors:

These are the eigenvectors for the variables.


DATA MINING GRADED PROJECT

Eigen values:

This figure shows eigen values for the variables.


Step 6:
Optimum number of PC:
For optimum number of PC which contain at least 90% of explained variance we
consider cumulative variance explained. this array shows the explained variance.
For 90% of variance we have to choose 6 number of PC.

When we choose 6 number of pc the reduced data output is:


DATA MINING GRADED PROJECT

Now we have to create a new data set with pc’s.it looks like as:

Now shape of the data set has been reduced.it is now 640 rows and 6 columns.
Scree plot:

Here we can see there is 6 number of pc from 0 to 5 which shows 90% of variance.
Step 7:
Compare pc’s with actual column”
DATA MINING GRADED PROJECT
DATA MINING GRADED PROJECT
DATA MINING GRADED PROJECT

PC1 contain the max. number of ToT_M


PC2 contains the max number of MARG_CL__M
PC3 contains max. number of MARG_AL_0_3_F
PC4 contains max number of MAIN_AL_F
PC5 contains max number of F_ST
PC6 contains max. value of MAIN_HH_F.
DATA MINING GRADED PROJECT

This heatmap shows the relation b/w all PC’S

Step 8
LINEAR EQUATIONS for 1st PC:
DATA MINING GRADED PROJECT
DATA MINING GRADED PROJECT

This is the linear equation for PC1.


END

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy