100% found this document useful (1 vote)

249 views44 pages

Clustering Project

The document describes a data mining project on digital ad clustering. It includes steps to: 1) Read and clean the data, treating missing values and outliers 2) Perform dimensionality reduction through z-score scaling and hierarchical clustering 3) Determine the optimal number of clusters through elbow plotting and silhouette scoring for k-means clustering 4) Profile the ads based on clustering to understand trends in key metrics like clicks and spend.

Uploaded by

kirti sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

249 views44 pages

Clustering Project

Uploaded by

kirti sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

DATA MINING GRADED PROJECT

Contents
Problem 1
Introduction……………………………………………………………………………………………………………
Read the data and perform basic analysis such as printing a few rows (head and
tail), info, data summary, null values duplicate values, etc…………………………

Treat missing values in CPC, CTR and CPM using the formula given.. You have to
basically create an user defined function and then call the function for imputing…..

Check if there are any outliers…………………………………………………………….

Do you think treating outliers is necessary for K-Means clustering? Based on your
judgement decide whether to treat outliers and if yes, which method to employ. (As
an analyst your judgement may be different from another analyst)…………………

Perform clustering and do the following:………………………………..

Perform z-score scaling and discuss how it affects the speed of the algorithm………

Perform Hierarchical by constructing a Dendrogram using WARD and Euclidean

distance………………………………………………………………………………………

Make Elbow plot (up to n=10) and identify optimum number of clusters for k-means
algorithm…………………………………………………………………………………….

Print silhouette scores for up to 10 clusters and identify optimum number of clusters.

Profile the ads based on optimum number of clusters using silhouette score and your
domain understanding……………………………………………………………………….
[Hint: Group the data by clusters and take sum or mean to identify trends in clicks,
spend, revenue, CPM, CTR, & CPC based on Device Type. Make bar plots.]

Conclude the project by providing summary of your learnings…………………………

PROBLEM 2…………………………………………………………………………………………………………….
Introduction………………………………………………………………………………………………………………..
Read the data and perform basic steps………………………………………………………………………..
Perform EDA for given variables…………………………………………………………………………………
Visualize outliers…………………………………………………………………………………………………………
Perform z_scaling……………………………………………………………………………………………………….
Compare boxplot before and after scaling………………………………………………………………….
Perform all steps of PCA……………………………………………………………………………………………..
DATA MINING GRADED PROJECT

Identify optimum number or PCA ……………………………………………………………………………….

Show scree plot…………………………………………………………………………………………………………..
Compare PCA with actual columns……………………………………………………………………………..
Linear equation for 1st PC……………………………………………………………………………………………

List of figures…………………………………………………………………………………………………………
BOXPLOT of clustering dataset…………………………………………………………………………………
Box plot after treating outliers……………………………………………………………………………………
Boxplot of scaled data of clustering…………………………………………………………………………….
Dendogram………………………………………………………………………………………………………………...
Elbow plot………………………………………………………………………………………………………………….
Figure of cluster count and CPM………………………………………………………………………………..
Figure of cluster count and CPC…………………………………………………………………………………
Figure of cluster count and CTR………………………………………………………………………………….
Figure of cluster count and Clicks……………………………………………………………………………….
Figure of cluster count and Revenue…………………………………………………………………………..
PCA:
Bar graph of questions and answers……………………………………………………………………………
Box plot of given variables………………………………………………………………………………………….
Heatmap……………………………………………………………………………………………………………………..
Pairplot……………………………………………………………………………………………………………………….
Boxplot before scaling…………………………………………………………………………………………………
Boxplot after scaling……………………………………………………………………………………………………
Scree plot……………………………………………………………………………………………………………………
Heatmap b/w PC’s………………………………………………………………………………………………………
Graph for comparison of PC’S……………………………………………………………………………………..
DATA MINING GRADED PROJECT

List of Tables………………………………………………………………………………………………………..
Clustering:
First five and last five rows………………………………………………………………………………………
Info of the data set…………………………………………………………………………………………………
Dataset after filling Nan values……………………………………………………………………………………
Checking null values……………………………………………………………………………………………………
Describe the data set………………………………………………………………………………………………….
Scaled data…………………………………………………………………………………………………………………
Cluster count column in the dataset……………………………………………………………………………
Wss values………………………………………………………………………………………………………………….
Silhouette score table…………………………………………………………………………………………………
PCA:
First five and last five rows………………………………………………………………………………………
Info of the data set…………………………………………………………………………………………………
Checking null values……………………………………………………………………………………………………
Describe the data set………………………………………………………………………………………………….
Questions and answers table related with 5 given column………………………………………..
Bivariate analysis……………………………………………………………………………………………………….
Scaled data set…………………………………………………………………………………………………………..
PCA covariance matrix……………………………………………………………………………………………….
Eigen vector……………………………………………………………………………………………………………….
Eigen values………………………………………………………………………………………………………………..
Cumulative explained variance……………………………………………………………………………………
PCA components table………………………………………………………………………………………………..
PCA data head……………………………………………………………………………………………………………
Correlation matrix……………………………………………………………………………………………………….
Original dataset with PC’S…………………………………………………………………………………………..
Linear equation of PC1………………………………………………………………………………………………..
DATA MINING GRADED PROJECT
DATA MINING GRADED PROJECT

Introduction

Clustering:
Digital Ads Data:
The ads24x7 is a Digital Marketing company which has now got seed funding of $10
Million. They are expanding their wings in Marketing Analytics. They collected data
from their Marketing Intelligence team and now wants you (their newly appointed
data analyst) to segment type of ads based on the features provided. Use Clustering
procedure to segment ads into homogeneous groups.
The following three features are commonly used in digital marketing:

Step 1
• Import all the required liabraries
• Read the dataset as df
• View 1st five and last five rows with head and tail method
DATA MINING GRADED PROJECT

• We get information of the data by info()method

• There are 23066 rows and 19 columns

• There are 4 categorical variables
• There are 6 float64 and 7 int 64 datatypes columns
DATA MINING GRADED PROJECT

There is 4736 null values in the data set we have to fill it we given formula

Step 2

Now there is no null values

Here we remove the categorical columns by drop method and describe all the remaining
columns
DATA MINING GRADED PROJECT

STEP 3

DRAW boxplot for the new data set to show if there is any outlier present in the dataset

Yes there are outliers present in the data set except Ad-Length and Ad – Width Columns

We have to treat outliers present in the data because k-means clustering is very sensitive to
outliers.

After treating outliers the new boxplot is

Here we are able to remove all the outliers presents in the data

We treat the outliers because presence of outliers can mislead the process of clustering. In
the presence of outliers mean and median values also change which effect the speed pf
algorithm.
DATA MINING GRADED PROJECT

Perform z-score

After z-scaling all the data on the same weight and it is now easy to identify relation
b/w all the variables and also enhance the speed of algorithm.

This is the information about the data after scaling

Here is now only float type columns and reduced size from 19 to 13
DATA MINING GRADED PROJECT

This is the boxplot of the data after scaling

Step 5
Here we perform the hierarchical clustering after import the all required liabraries.

Constructing a dendogram

This is a figure of DENDOGRAM.

Here we can see the number of clusters. we can choose the number of clusters by
two methos:
• We can choose it by appropriate threshold level.
• We can define it by number of clusters.
DATA MINING GRADED PROJECT

We do it by two methods both gave the same output.

Here we choose 4 no. of clusters.

We can see the data set after hierarchical clustering .here cluster count is
added as a column in the data set.

Step 6(k-means clustering)

This is the wss score for clustering.

DATA MINING GRADED PROJECT

ELBOW PLOT

Here at 4 cluster the curve is making an elbow. So we choose number of cluster

is=4.
After 4 the difference is minimum and curve is like flat curve.
Step7
Silhouette scores:
Silhouette scores is measure of how similar an object is to its own cluster
compared to other clusters. this ranges from –1 to 1.
This score up to n=10 is given below:

Step 8
DATA MINING GRADED PROJECT

Here add the cluster count as a column in the data set by taking mean of the
clusters.

Here we can see cluster 2 has the maximum number of values of CTR.
DATA MINING GRADED PROJECT

CLUSTER 1 holds the max. value of CPC.

Here we can see based on the device type cluster 3 has the max. value of CPC
DATA MINING GRADED PROJECT

FOR CPM

Cluster 2 has the max. values of CPM

ON the base of device type we can see cluster 0 has the max. count of CPM.
DATA MINING GRADED PROJECT

CLICKS:

Cluster 0 ha the max. count of clicks.

Revenue:

Here cluster 3 has the max count of revenue on the basis of device type.
DATA MINING GRADED PROJECT

Step 9:
• Cluster no. 2 that is 3rd cluster has the max number of counts.
• Cluster 4 has the max count of revenue.
• Cluster 0 has the max. count of clicks.
• By performing clustering, we can easily analysis the data
DATA MINING GRADED PROJECT

Problem 2:
Introduction:
PCA FH (FT): Primary census abstract for female headed households excluding
institutional households (India & States/UTs - District Level), Scheduled tribes - 2011
PCA for Female Headed Household Excluding Institutional Household. The Indian Census
has the reputation of being one of the best in the world. The first Census in India was
conducted in the year 1872. This was conducted at different points of time in different
parts of the country. In 1881 a Census was taken for the entire country simultaneously.
Since then, Census has been conducted every ten years, without a break. Thus, the
Census of India 2011 was the fifteenth in this unbroken series since 1872, the seventh
after independence and the second census of the third millennium and twenty first
century. The census has been uninterruptedly continued despite of several adversities
like wars, epidemics, natural calamities, political unrest, etc. The Census of India is
conducted under the provisions of the Census Act 1948 and the Census Rules, 1990. The
Primary Census Abstract which is important publication of 2011 Census gives basic
information on Area, Total Number of Households, Total Population, Scheduled Castes,
Scheduled Tribes Population, Population in the age group 0-6, Literates, Main Workers
and Marginal Workers classified by the four broad industrial categories, namely, (i)
Cultivators, (ii) Agricultural Laborers, (iii) Household Industry Workers, and (iv) Other
Workers and also Non-Workers. The characteristics of the Total Population include
Scheduled Castes, Scheduled Tribes, Institutional and Houseless Population and are
presented by sex and rural-urban residence. Census 2011 covered 35 States/Union
Territories, 640 districts, 5,924 sub-districts, 7,935 Towns and 6,40,867 Villages.
The data collected has so many variables thus making it difficult to find useful details
without using Data Science Techniques. You are tasked to perform detailed EDA and
identify Optimum Principal Components that explains the most variance in data. Use
Sklearn only.

Step 1:

• Import all the required libraries.

• Read the data set as df_census.
• Read the 1st five and last five rows of dataset by using head and tail methos.

These are 1st five rows and we can see 61 columns of the dataset.
DATA MINING GRADED PROJECT

These are last five rows of the data set.

We can see the information about the data by using info method
DATA MINING GRADED PROJECT
DATA MINING GRADED PROJECT

• Here we see there are total 61 columns and 640 rows.

• There are 59 int. columns
• There are 2 categorical column
• But above the information we can see there is 2 more categorical columns.
State code and Area name.

Null values:
• There is no null values in the dataset.
• There is no duplicate values in the dataset.
Here we can summarise the data set by describing it.
DATA MINING GRADED PROJECT
DATA MINING GRADED PROJECT

• From the above figure we can find the max. and min values of each column.
• We can also check the mean and median value of the given column.

These are the columns of the data.

Step 2: EDA
• We have to perform EDA on the 5 variables from the given 24 variables.
• Here we define a new dataset for given 5 variables.

This is the new data set for given variables.

Here we can answer some questions.

• which state has the highest gender ratio?

Here we generate a new column ‘ratio’ and find the max and min ratio.
DATA MINING GRADED PROJECT

Ans. Here utter Pradesh has the max. ratio. And Lakshadweep has min. gender ratio.

Which state has max. male literate?

DATA MINING GRADED PROJECT

Ans. Utter Pradesh has max. literate male and Dadra and Nagar Haveli has min. ratio
of male literate
Which state has max. illiterate female?
Ans.
DATA MINING GRADED PROJECT

Utter Pradesh has max population so illiterate female is more in the this state.
Lakshadweep has least number of illiterate females.
Now we perform univariate and bi variate analysis of the data set.
Univariate Analysis:
DATA MINING GRADED PROJECT

Total male population has the skewness in the data.

Yes there is outliers present in the total male population.

Total_feamale:
DATA MINING GRADED PROJECT

Total female population also have outliers in the data.

Male literate:
DATA MINING GRADED PROJECT

Male literate also have skewed data.

Female literate:
DATA MINING GRADED PROJECT

This variable also have outliers.

Male illiterate:
DATA MINING GRADED PROJECT

All the variables in the data set has some outliers .

Bivariate analysis:

This is a correlation matrix which shows the relation b/w all the given variable.
We can see it by graphs.
DATA MINING GRADED PROJECT

This is a heatmap graph which shows relation b/w variables.

Here we can see all the 5 variables TOT_M,TOT_F,M_LIT,F_LIT,M_ILL,F_ILL are
highly co related with each other.
Literature ratio and illiterate ratio is -vely corelated.
Pair plot b/w variables:
DATA MINING GRADED PROJECT

Step 3
Outliers impacts the performance of algorithm a lot.
Here we choose not to treat the outlier so we do not treat outliers.
Step 4
Scale the data set using z score:
Compare boxplot before and after scaling:
DATA MINING GRADED PROJECT

This is the data set after scaling.

Boxplot before scaling:

There is a lot of outliers present in the data set with eachvariable.

Boxplot after scaling:
DATA MINING GRADED PROJECT

After scaling boxplot shows more outliers in the data set.

Step 5:
PCA:
Import all required liabraries.
Perform pca step by step using sklearn.
We got this array:
DATA MINING GRADED PROJECT

Now the data is in matrix form.

Eigen vectors:

These are the eigenvectors for the variables.

DATA MINING GRADED PROJECT

Eigen values:

This figure shows eigen values for the variables.

Step 6:
Optimum number of PC:
For optimum number of PC which contain at least 90% of explained variance we
consider cumulative variance explained. this array shows the explained variance.
For 90% of variance we have to choose 6 number of PC.

When we choose 6 number of pc the reduced data output is:

DATA MINING GRADED PROJECT

Now we have to create a new data set with pc’s.it looks like as:

Now shape of the data set has been reduced.it is now 640 rows and 6 columns.
Scree plot:

Here we can see there is 6 number of pc from 0 to 5 which shows 90% of variance.
Step 7:
Compare pc’s with actual column”
DATA MINING GRADED PROJECT
DATA MINING GRADED PROJECT
DATA MINING GRADED PROJECT

PC1 contain the max. number of ToT_M

PC2 contains the max number of MARG_CL__M
PC3 contains max. number of MARG_AL_0_3_F
PC4 contains max number of MAIN_AL_F
PC5 contains max number of F_ST
PC6 contains max. value of MAIN_HH_F.
DATA MINING GRADED PROJECT

This heatmap shows the relation b/w all PC’S

Step 8
LINEAR EQUATIONS for 1st PC:
DATA MINING GRADED PROJECT
DATA MINING GRADED PROJECT

This is the linear equation for PC1.

END

ML-2 Guided Project Report
No ratings yet
ML-2 Guided Project Report
63 pages
FS1575 FS2575 Service Manual B 11-26-2012 PDF
No ratings yet
FS1575 FS2575 Service Manual B 11-26-2012 PDF
609 pages
Decision Making: Submitted By-Ankita Mishra
No ratings yet
Decision Making: Submitted By-Ankita Mishra
20 pages
Assignment Clustering
No ratings yet
Assignment Clustering
22 pages
Factor-Hair RV PDF
No ratings yet
Factor-Hair RV PDF
23 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Srs Chat Box
50% (2)
Srs Chat Box
49 pages
Rahulsharma - 03 12 23
No ratings yet
Rahulsharma - 03 12 23
25 pages
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
No ratings yet
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
28 pages
Mini Project - Factor Hair Analysis: Sravanthi.M
100% (2)
Mini Project - Factor Hair Analysis: Sravanthi.M
24 pages
Pranjal - Singh - 25.12.2022 - Data Mining Project
No ratings yet
Pranjal - Singh - 25.12.2022 - Data Mining Project
8 pages
Project Questions
No ratings yet
Project Questions
3 pages
Answer Report: Data Mining
No ratings yet
Answer Report: Data Mining
32 pages
RACHIT MITTAL Capstone Project. Notes 2 PDF
No ratings yet
RACHIT MITTAL Capstone Project. Notes 2 PDF
39 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
P L Lohitha 19-04-23 TSF Business Report
No ratings yet
P L Lohitha 19-04-23 TSF Business Report
70 pages
ML - Project - Business Report
No ratings yet
ML - Project - Business Report
43 pages
Data Mining Project - 27.06.2021
No ratings yet
Data Mining Project - 27.06.2021
6 pages
Answer Book - Rose Wines
100% (1)
Answer Book - Rose Wines
11 pages
Palash Bhai - Machine Learning Assignment
100% (2)
Palash Bhai - Machine Learning Assignment
18 pages
Data Mining Project - PCA - Hair Salon
No ratings yet
Data Mining Project - PCA - Hair Salon
8 pages
Project - Finance and Risk Assessment: Submitted By: Navendu Mishra
No ratings yet
Project - Finance and Risk Assessment: Submitted By: Navendu Mishra
18 pages
Anshul Dyundi Machine Learning July 2022
50% (2)
Anshul Dyundi Machine Learning July 2022
46 pages
SMDM-Project Report (Madhur Dhananiwala)
100% (2)
SMDM-Project Report (Madhur Dhananiwala)
43 pages
SMDM Project Report
100% (1)
SMDM Project Report
19 pages
SMDM - Project Report - Lakshmi
No ratings yet
SMDM - Project Report - Lakshmi
26 pages
Sunira - Predictive Modeling
100% (1)
Sunira - Predictive Modeling
65 pages
Data Mining Clustering PDF
No ratings yet
Data Mining Clustering PDF
15 pages
SMT Capstone PPT Ayushi Rastogi PGPDSBA.O.MAY22.C
No ratings yet
SMT Capstone PPT Ayushi Rastogi PGPDSBA.O.MAY22.C
12 pages
Capstone Notes-1
No ratings yet
Capstone Notes-1
18 pages
FRA Project Report Milestone 1 PDF
No ratings yet
FRA Project Report Milestone 1 PDF
29 pages
DataMining Aug2021
100% (2)
DataMining Aug2021
49 pages
SQL Quiz Results
No ratings yet
SQL Quiz Results
17 pages
Project 2 SMDM
50% (2)
Project 2 SMDM
5 pages
PM ProjectJune - 2021
100% (1)
PM ProjectJune - 2021
33 pages
Quiz 3 Name: Kainat Iftikhar Reg# 2021630007 1. List Three Examples of Time Series Data. Time Series Data
No ratings yet
Quiz 3 Name: Kainat Iftikhar Reg# 2021630007 1. List Three Examples of Time Series Data. Time Series Data
2 pages
Business Report Pradeep Chauhan 11june'23
100% (1)
Business Report Pradeep Chauhan 11june'23
25 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
Anshul Dyundi Predictive Modelling Alternate Project July 2022
No ratings yet
Anshul Dyundi Predictive Modelling Alternate Project July 2022
11 pages
Machine Learning Business Report - Compress (AutoRecovered)
100% (3)
Machine Learning Business Report - Compress (AutoRecovered)
69 pages
Surabhi FRA PartA
No ratings yet
Surabhi FRA PartA
13 pages
Marketing & Retail Analytics - Report - Part A
100% (2)
Marketing & Retail Analytics - Report - Part A
18 pages
Education - Post 12th Standard - CSV
No ratings yet
Education - Post 12th Standard - CSV
11 pages
SMDM Project Report
100% (1)
SMDM Project Report
9 pages
FRA Extended
No ratings yet
FRA Extended
22 pages
Time Series Forecasting - Rose - Buisness Report
100% (1)
Time Series Forecasting - Rose - Buisness Report
69 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
Akshaya SMDM Project Report
100% (1)
Akshaya SMDM Project Report
18 pages
Social Media Tourism - Capstone Project
No ratings yet
Social Media Tourism - Capstone Project
13 pages
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
100% (3)
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
49 pages
AS Project Report
No ratings yet
AS Project Report
22 pages
M4 Data Mining W4 Business Report
No ratings yet
M4 Data Mining W4 Business Report
22 pages
Business Report SMDM Bhushan
No ratings yet
Business Report SMDM Bhushan
18 pages
FINANCE & RISK ANALYTICS – PROJECT - YARESH VIJAYASUNDARAM
No ratings yet
FINANCE & RISK ANALYTICS – PROJECT - YARESH VIJAYASUNDARAM
48 pages
Rajiv Ranjan 11 Dec 2022
No ratings yet
Rajiv Ranjan 11 Dec 2022
18 pages
Cart-Rf-Ann: Prepared by Muralidharan N
67% (3)
Cart-Rf-Ann: Prepared by Muralidharan N
33 pages
Random Forest - US - Heart - Patients - Class
100% (1)
Random Forest - US - Heart - Patients - Class
24 pages
SMDM Project
100% (1)
SMDM Project
22 pages
Machine Learning Project: Raghul Harish
100% (2)
Machine Learning Project: Raghul Harish
46 pages
NIrupam Agarwal Business Report-ML
100% (1)
NIrupam Agarwal Business Report-ML
23 pages
Suresh-Rose Time Series Forecasting Project Report
100% (1)
Suresh-Rose Time Series Forecasting Project Report
75 pages
DATA MINING Project Report
No ratings yet
DATA MINING Project Report
28 pages
TIPS International Conference 2023
No ratings yet
TIPS International Conference 2023
6 pages
NVIDIA - General Terms and Conditions - 30.11.2021
No ratings yet
NVIDIA - General Terms and Conditions - 30.11.2021
3 pages
Rohan Report
No ratings yet
Rohan Report
40 pages
Belden® Optical Fiber Catalog - Original - 17106 PDF
No ratings yet
Belden® Optical Fiber Catalog - Original - 17106 PDF
64 pages
PHD Proposal Heriot P&A
No ratings yet
PHD Proposal Heriot P&A
3 pages
Optimizing Window Design On Residential Building Facades
No ratings yet
Optimizing Window Design On Residential Building Facades
27 pages
A Low Cost Design Solution - DSP Based Active Power Factor Corrector For SMPS/ UPS (Single Phase)
No ratings yet
A Low Cost Design Solution - DSP Based Active Power Factor Corrector For SMPS/ UPS (Single Phase)
7 pages
14.string in Python - Python String Functions
No ratings yet
14.string in Python - Python String Functions
10 pages
Connectx-6 DX Cards Product Brief
No ratings yet
Connectx-6 DX Cards Product Brief
3 pages
Simple Past Tense Exercises
100% (10)
Simple Past Tense Exercises
3 pages
Session 2005-1222: "Proceedings of The 2005 American Society For Engineering Education Annual Conference & Exposition
No ratings yet
Session 2005-1222: "Proceedings of The 2005 American Society For Engineering Education Annual Conference & Exposition
7 pages
Driver Rack DBX 260
No ratings yet
Driver Rack DBX 260
2 pages
00 General
100% (1)
00 General
24 pages
Pentastar
No ratings yet
Pentastar
15 pages
Simplex ES Net 1-2 Launch Bulletin
100% (1)
Simplex ES Net 1-2 Launch Bulletin
7 pages
Multiple Project Tracking Template Excelx Year2025
No ratings yet
Multiple Project Tracking Template Excelx Year2025
28 pages
Class 14 List Methods 2
No ratings yet
Class 14 List Methods 2
17 pages
R12 Oracle Order Management Fundamentals
0% (1)
R12 Oracle Order Management Fundamentals
4 pages
Testing of Power Transformers
No ratings yet
Testing of Power Transformers
6 pages
PCS-9613 X Instruction Manual en Customized ECKF140565 R1.01
0% (1)
PCS-9613 X Instruction Manual en Customized ECKF140565 R1.01
274 pages
What Is Meant by Utility Program
No ratings yet
What Is Meant by Utility Program
6 pages
gc_٢٠٢٤_١٢_١١
No ratings yet
gc_٢٠٢٤_١٢_١١
10 pages
Securing The Software Supply Chain
No ratings yet
Securing The Software Supply Chain
32 pages
Camara IPC-A35 - Quick Start Guide
No ratings yet
Camara IPC-A35 - Quick Start Guide
10 pages
Ensayo Sobre La Guerra de Gaza
100% (1)
Ensayo Sobre La Guerra de Gaza
5 pages
Elegant Brochure 2
No ratings yet
Elegant Brochure 2
2 pages
LG cj98, cj88, cj87, cj65, cj45, cj44, ck99, ck57, ck56, ck43, Oj98, Ok99, Om4560, Ok45, Ok55, Ok75 Bulletin
No ratings yet
LG cj98, cj88, cj87, cj65, cj45, cj44, ck99, ck57, ck56, ck43, Oj98, Ok99, Om4560, Ok45, Ok55, Ok75 Bulletin
26 pages
Ds 9850 Series-469642
No ratings yet
Ds 9850 Series-469642
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Clustering Project

Uploaded by

Clustering Project

Uploaded by

DATA MINING GRADED PROJECT

Check if there are any outliers…………………………………………………………….

Perform clustering and do the following:………………………………..

Perform Hierarchical by constructing a Dendrogram using WARD and Euclidean

Conclude the project by providing summary of your learnings…………………………

Identify optimum number or PCA ……………………………………………………………………………….

• We get information of the data by info()method

• There are 23066 rows and 19 columns

Now there is no null values

After treating outliers the new boxplot is

This is the information about the data after scaling

This is the boxplot of the data after scaling

This is a figure of DENDOGRAM.

We do it by two methods both gave the same output.

Step 6(k-means clustering)

This is the wss score for clustering.

Here at 4 cluster the curve is making an elbow. So we choose number of cluster

CLUSTER 1 holds the max. value of CPC.

Cluster 2 has the max. values of CPM

Cluster 0 ha the max. count of clicks.

• Import all the required libraries.

These are last five rows of the data set.

• Here we see there are total 61 columns and 640 rows.

These are the columns of the data.

This is the new data set for given variables.

• which state has the highest gender ratio?

Which state has max. male literate?

Total male population has the skewness in the data.

Yes there is outliers present in the total male population.

Total female population also have outliers in the data.

Male literate also have skewed data.

This variable also have outliers.

All the variables in the data set has some outliers .

This is a heatmap graph which shows relation b/w variables.

This is the data set after scaling.

There is a lot of outliers present in the data set with eachvariable.

After scaling boxplot shows more outliers in the data set.

Now the data is in matrix form.

These are the eigenvectors for the variables.

This figure shows eigen values for the variables.

When we choose 6 number of pc the reduced data output is:

PC1 contain the max. number of ToT_M

This heatmap shows the relation b/w all PC’S

This is the linear equation for PC1.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.