0% found this document useful (0 votes)

18 views53 pages

Jahnavijillella ML1 30 06 2024 PDF

The report outlines a project by All Life Bank to enhance its credit card customer base through personalized marketing and improved service delivery. The objective is to segment customers based on spending patterns and interactions using clustering algorithms, with data analysis revealing insights into customer behavior and preferences. The findings suggest three distinct customer clusters based on their modes of contacting the bank, with recommendations for targeted marketing strategies and service improvements.

Uploaded by

kart238

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views53 pages

Jahnavijillella ML1 30 06 2024 PDF

Uploaded by

kart238

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

Machine Learning-1 Project

Guided Report
Context

All Life Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their
marketing research team, that the penetration in the market can be improved. Based on this input, the Marketing team
proposes to run personalized campaigns to target new customers as well as upsell to existing customers. Another insight
from the market research was that the customers perceive the support services of the back poorly. Based on this, the
Operations team wants to upgrade the service delivery model, to ensure that customer queries are resolved faster. Head of
Marketing and Head of Delivery both decide to reach out to the Data Science team for help
Objective

To identify different segments in the existing customer, based on their spending patterns as well as past interaction with
the bank, using clustering algorithms, and provide recommendations to the bank on how to better market to and service these
customers.

Data Description

The data provided is of various customers of a bank and their financial attributes like credit limit, the total number of credit
cards the customer has, and different channels through which customers have contacted the bank for any queries (including
visiting the bank, online and through a call center).

Data Dictionary.
• Sl_No: Primary key of the records
• Customer Key: Customer identification number
• Average Credit Limit: Average credit limit of each customer for all credit cards
• Total credit cards: Total number of credit cards possessed by the customer
• Total visits bank: Total number of visits that customer made (yearly) personally to the bank
• Total visits online: Total number of visits or online logins made by the customer (yearly)
• Total calls made: Total number of calls made by the customer to the bank or its customer service department (yearly)

Overview of the data set

❖ The initial steps to get an overview of any dataset is to observe the first few rows of the dataset, to check whether the dataset has
been loaded properly or not
❖ get information about the number of rows and columns in the dataset
❖ find out the data types of the columns to ensure that data is stored in the preferred format and the value of each property is as
expected, and check the statistical summary of the dataset to get an overview of the numerical columns of the data
Checking the shape of the dataset
The dataset has 660 rows and 7 columns

Displaying few rows of the dataset

The first 5 rows of the dataset:

Creating a Copy of the Original Data

copying the data to another variable to avoid any changes to original data:
fixing column names:
Checking the data types of the columns for the dataset

Checking datatypes and number of non-null values for each column:

All the columns in the data are integers.

Checking the missing values

There are 6 missing values in the data.

Checking the number of unique values in each column

• There are less unique values in the Customer_Key column than the number of observations in the data. This means
that there are duplicate values in the column.

Checking for duplicates values

Let's look at the duplicate values in the Customer_Key column closely.
We will drop the Sl_No and Customer_Key as they do not add any value to the analysis.
Statistical summary of the dataset

Exploratory Data Analysis

The functions can be used to explore the distribution of individual features in the dataset and gain insights into the data.
Univariate analysis

The boxplot shows that the majority of customers have an average credit limit between 0 and 200,000. The histogram shows that the
distribution of average credit limits is skewed to the right, with a few customers having very high credit limits.
The boxplot shows that most customers have between 1 and 4 credit cards. The histogram shows that the distribution of total credit
cards is roughly normal, with a few customers having a large number of credit cards.

The boxplot shows that most customers visit the bank between 0 and 5 times per year. The histogram shows that the distribution of total
visits to the bank is skewed to the right, with a few customers visiting the bank very frequently.
The boxplot shows that most customers visit the bank's website between 0 and 10 times per year. The histogram shows that the
distribution of total visits to the bank's website is skewed to the right, with a few customers visiting the website very frequently.
The boxplot shows that most customers call the bank between 0 and 5 times per year. The histogram shows that the distribution of total
calls made to the bank is skewed to the right, with a few customers calling the bank very frequently.
Calling the above Specified functions for plotting plots
**Observations**

# The barplots show the percentage of customers who have each level of the categorical features.
# For example, the first barplot shows that 30% of customers have 1 credit card, 25% have 2 credit cards, and so on.
# The barplots also show that the majority of customers have 1 or 2 credit cards, and that a small percentage of customers
have 5 or more credit cards.
Creating a subplot grid of CDF (Cumulative Distribution Function) plots for numerical variables
❖ The figure shows the CDF (Cumulative Distribution Function) plots for the numerical variables in the dataset.
❖ Each plot shows the cumulative probability of a data point falling below a certain value.
❖ For example, the first plot shows that about 30% of customers have an average credit limit below 50,000, about 60%
have an average credit limit below 100,000, and so on.
❖ The plots can be used to compare the distributions of different variables and to identify outliers.

**Observations**

❖ The distribution of average credit limit is skewed to the right, with a few customers having very high credit limits.
❖ The distribution of total credit cards is roughly normal, with a few customers having a large number of credit cards.
❖ The distribution of total visits to the bank is skewed to the right, with a few customers visiting the bank very frequently.
❖ The distribution of total visits to the bank's website is skewed to the right, with a few customers visiting the website very
frequently.
❖ The distribution of total calls made to the bank is skewed to the right, with a few customers calling the bank very
frequently.

Bivariate Analysis

Creating a heatmap to visualize the correlation matrix

❖ The heatmap shows the correlation coefficients between all pairs of numerical variables in the dataset.
❖ The values in the heatmap range from -1 to 1, where 1 indicates a perfect positive correlation, -1 indicates a perfect negative
correlation, and 0 indicates no correlation.
❖ The diagonal elements of the heatmap are all 1 because the correlation between a variable and itself is always 1.
❖ The heatmap shows that there are some strong positive correlations between some of the variables in the dataset.
❖ For example, there is a strong positive correlation between average credit limit and total credit cards, and between total visits to
the bank and total visits to the bank's website.
❖ There is also a strong negative correlation between average credit limit and total calls made to the bank.
❖ This means that customers with higher average credit limits tend to have more credit cards and visit the bank and its website
more frequently, but they tend to call the bank less frequently.
Observations

❖ The heatmap shows that there are some strong correlations between some of the variables in the dataset.
❖ These correlations can be used to identify relationships between variables and to make predictions about one variable based on
another variable.
❖ For example, the strong positive correlation between average credit limit and total credit cards suggests that customers with
higher average credit limits are more likely to have more credit cards.
❖ The strong negative correlation between average credit limit and total calls made to the bank suggests that customers with higher
average credit limits are less likely to call the bank.

Creating a pair plot with a kernel density estimate on the diagonal

❖ The pairplot shows the relationships between all pairs of variables in the dataset.
❖ Each plot is a scatter plot with the two variables on the x and y axes.
❖ The diagonal plots are histograms of the individual variables.
❖ The plots can be used to identify relationships between variables, such as positive correlations, negative correlations, and
outliers.
❖ There is a positive correlation between average credit limit and total credit cards.
❖ There is a negative correlation between average credit limit and total visits to the bank.
❖ There is a positive correlation between total credit cards and total visits to the bank.
❖ There is a positive correlation between total visits to the bank and total calls made to the bank.
❖ There is a positive correlation between total visits to the bank's website and total calls made to the bank.

The pairplot also shows that there are a few outliers in the data. For example, there are a few customers with very high average credit
limits and a few customers who visit the bank or the bank's website very frequently.

We can add a hue and see if we can see some clustered distributions.

Creating a pair plot using seaborn to visualize relationships between selected features

❖ The figure shows a pairplot with a hue for the "Total_Credit_Cards" feature.
❖ This means that the plots are colored by the number of credit cards that each customer has.
❖ The pairplot shows the relationships between all pairs of variables in the dataset, colored by the number of credit cards.
❖ Each plot is a scatter plot with the two variables on the x and y axes.
❖ The diagonal plots are histograms of the individual variables, colored by the number of credit cards.
❖ The plots can be used to identify relationships between variables, such as positive correlations, negative correlations, and
outliers.
In this case, the pairplot shows that:

❖ There is a positive correlation between total visits to the bank and total calls made to the bank.
❖ There is a positive correlation between total visits to the bank's website and total calls made to the bank.
❖ Customers with more credit cards tend to visit the bank more often and make more calls to the bank.
❖ The pairplot also shows that there are a few outliers in the data.
❖ For example, there are a few customers with a high number of credit cards who visit the bank or the bank's website very
frequently.

Let's visualize the modes of contacting the bank in a 3D plot.

❖ The figure shows a 3D scatter plot of the "Total_visits_bank", "Total_visits_online", and "Total_calls_made" features. Each point
in the plot represents a customer. The plot shows that there are three main modes of contacting the bank. Visiting the bank in
person. Visiting the bank's website. Calling the bank on the phone.
❖ Most customers use only one of these modes of contact.
❖ However, there are a few customers who use two or even three modes of contact.
❖ The plot also shows that there are a few outliers in the data.
❖ For example, there are a few customers who visit the bank or the bank's website very frequently, and a few customers who call
the bank very frequently.

We can observe three segments of the customers by their preferred mode of contacting the bank.
Data Preprocessing

Outlier Detection

• Let's find outliers in the data using z-score with a threshold of 3.

The following are the outliers in the data:

Avg_Credit_Limit : [153000, 155000, 156000, 156000, 157000, 158000, 163000, 163000, 166000, 166000, 167000, 171000, 172000,
172000, 173000, 176000, 178000, 183000, 184000, 186000, 187000, 195000, 195000, 200000]

Total_Credit_Cards : []

Total_visits_bank : []

Total_visits_online : [12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15]

Total_calls_made : []
Scaling
• Let's scale the data before we proceed with clustering.
• Creating a pair plot with a kernel density estimate on the diagonal

# The plots can be used to identify relationships between variables, such as positive correlations, negative
correlations, and outliers.

# In this case, the pairplot shows that:

# - The variables are now on a similar scale.

# - There is a positive correlation between average credit limit and total credit cards.
# - There is a negative correlation between average credit limit and total visits to the bank.
# - There is a positive correlation between total credit cards and total visits to the bank.
# - There is a positive correlation between total visits to the bank and total calls made to the bank.
# - There is a positive correlation between total visits to the bank's website and total calls made to the bank.

# The pairplot also shows that there are a few outliers in the data.
# For example, there are a few customers with very high average credit limits and a few customers who visit the
bank or the bank's website very frequently.
❖ The figure shows a pairplot of the data after scaling.
❖ Each plot is a scatter plot with the two variables on the x and y axes.
❖ The diagonal plots are histograms of the individual variables.

scaling the data before clustering

creating a dataframe of the scaled data
K-means Clustering

❖ The figure shows a scatter plot of the "Total_visits_bank" and "Total_visits_online" features, colored by the cluster labels.
❖ Each point in the plot represents a customer.
❖ The plot shows that the three clusters are:
❖ Cluster 0: Customers who visit the bank in person frequently but do not visit the bank's website or call the bank on the phone very
often.
❖ Cluster 1: Customers who visit the bank's website frequently but do not visit the bank in person or call the bank on the phone very
often.
❖ Cluster 2: Customers who call the bank on the phone frequently but do not visit the bank in person or the bank's website very
often.
❖ There are also a few customers who do not belong to any of the three clusters.
❖ These customers are outliers in the data.

Checking Elbow Plot

❖ The elbow plot shows the within-cluster sum of squares (WSS) for different numbers of clusters.
❖ The WSS is a measure of how well the data is clustered, with lower values indicating better clustering.
❖ The elbow plot shows that the WSS decreases as the number of clusters increases.
❖ This is because as the number of clusters increases, the data points are divided into smaller and smaller groups, which results in a
better fit.
❖ However, the WSS also increases as the number of clusters increases, because each cluster has fewer data points and therefore
the average distance between the data points in each cluster is larger.
❖ The elbow plot shows that the optimal number of clusters is 3, because this is the point where the WSS starts to increase more
rapidly.
❖ Choosing a larger number of clusters would result in overfitting, which means that the clusters would be too small and the data
points would be too similar to each other.
❖ Choosing a smaller number of clusters would result in underfitting, which means that the clusters would be too large and the data
points would be too different from each other.

Checking Silhouette Scores

The key with the maximum value in the dictionary ss
The optimal number of clusters according to the silhouette score is: 3

❖ The figure shows the silhouette plot for different numbers of clusters.
❖ The silhouette plot shows the average silhouette score for each data point for different numbers of clusters.
❖ The silhouette score is a measure of how well each data point fits into its own cluster, with values ranging from -1 to 1.
❖ A score of 1 indicates that the data point is well-clustered, a score of 0 indicates that the data point is on the border of two
clusters, and a score of -1 indicates that the data point is poorly-clustered.
❖ The plot shows that the average silhouette score is highest for 3 clusters.
❖ This indicates that 3 is the optimal number of clusters for the data.
❖
❖ The silhouette plot also shows that the silhouette scores for 2 and 4 clusters are relatively low.
❖ This indicates that 2 and 4 are not good choices for the number of clusters.
❖ The silhouette plot can be used to choose the optimal number of clusters for a dataset.
❖ The optimal number of clusters is the point where the average silhouette score is highest.

Fitting and predicting with the KMeans model

Adding kmeans cluster labels to the original and scaled dataframes

Cluster Profiling
Hierarchical Clustering

❖ The dendrogram shows the hierarchical clustering of the data points in the dataframe.
❖ Each data point is represented by a leaf node in the dendrogram.
❖ The distance between two data points is represented by the height of the branch that connects them.
❖ The dendrogram shows that the data points are divided into three main clusters.
❖ The first cluster is represented by the leaves at the bottom of the dendrogram.
❖ The second cluster is represented by the leaves in the middle of the dendrogram.
❖ The third cluster is represented by the leaves at the top of the dendrogram.
❖ The dendrogram also shows that the first and second clusters are more similar to each other than they are to the third cluster.
❖ This is because the branch that connects the first and second clusters is shorter than the branch that connects the third cluster
to the other two clusters.
❖ The dendrogram can be used to choose the optimal number of clusters for the data.
❖ The optimal number of clusters is the point where the dendrogram starts to branch out into multiple branches.
❖ In this case, the optimal number of clusters is 3.

Let's check silhouette score

The optimal number of clusters according to the silhouette score is: 3

❖ The figure shows the silhouette plot for different numbers of clusters using hierarchical clustering.
❖ The silhouette plot shows the average silhouette score for each data point for different numbers of clusters.
❖ The silhouette score is a measure of how well each data point fits into its own cluster, with values ranging from -1 to 1.
❖ A score of 1 indicates that the data point is well-clustered, a score of 0 indicates that the data point is on the border of two
clusters, and a score of -1 indicates that the data point is poorly-clustered.
❖ The plot shows that the average silhouette score is highest for 3 clusters.
❖ This indicates that 3 is the optimal number of clusters for the data using hierarchical clustering.
❖ The silhouette plot also shows that the silhouette scores for 2 and 4 clusters are relatively low.
❖ This indicates that 2 and 4 are not good choices for the number of clusters.
❖ The silhouette plot can be used to choose the optimal number of clusters for a dataset.
❖ The optimal number of clusters is the point where the average silhouette score is highest.

Creating final model

Hierarchical Clustering model with specified parameters

❖ Creating a copy of the original data

❖ Adding hierarchical cluster labels to the original and scaled dataframes

❖ Displaying the first five rows of the dataframe

❖ Displaying the first five rows of the dataframe2
❖ Hierarchical clustering labels to the 'HC_Clusters' column of df DataFrame.
Cluster Profiling and Comparison

Cluster Profiling: K-means Clustering

Cluster Profiling: Hierarchical Clustering

K-means vs Hierarchical Clustering

This line highlights the maximum value in each column of K-means DataFrame with a light green background.

This line highlights the maximum value in each column of Hierarchical DataFrame with a light green background.
Creating a bar plot of mean values in k_means_df grouped by a specified column

❖ The figure shows the average values of each feature for each cluster in the K-means clustering.
❖ The bars are color-coded to represent different features.
❖ The x-axis labels represent the cluster labels.
❖ The y-axis label represents the average value of the features.
❖ The figure shows that the clusters are well-separated in terms of the average values of the features.
❖ For example, cluster 0 has a high average value for the "Avg_Credit_Limit" feature, while cluster 1 has a low average value for the
"Avg_Credit_Limit" feature.
❖ This indicates that the K-means clustering has successfully grouped together data points that are similar in terms of their
features.
❖ The figure also shows that the clusters are not perfectly separated.
❖ For example, there is some overlap between the clusters in terms of the "Avg_Credit_Limit" feature.
❖ This indicates that there are some data points that are not clearly assigned to a single cluster.
❖ Overall, the figure shows that the K-means clustering has successfully identified some of the underlying structure in the data.
❖ However, the clustering is not perfect and there are some data points that are not clearly assigned to a single cluster.

Creating a bar plot of the mean values of hc_df grouped by a specific column

❖ # The figure shows the average values of each feature for each cluster in the hierarchical clustering.
❖ The bars are color-coded to represent different features.
❖ The x-axis labels represent the cluster labels.
❖ The y-axis label represents the average value of the features.
❖ The figure shows that the clusters are well-separated in terms of the average values of the features.
❖ For example, cluster 0 has a high average value for the "Avg_Credit_Limit" feature, while cluster 1 has a low average value for the
"Avg_Credit_Limit" feature.
❖ This indicates that the hierarchical clustering has successfully grouped together data points that are similar in terms of their
features.
❖ The figure also shows that the clusters are not perfectly separated.
❖ For example, there is some overlap between the clusters in terms of the "Avg_Credit_Limit" feature.
❖ This indicates that there are some data points that are not clearly assigned to a single cluster.
❖ Overall, the figure shows that the hierarchical clustering has successfully identified some of the underlying structure in the data.
❖ However, the clustering is not perfect and there are some data points that are not clearly assigned to a single cluster.

Let's create some plots on the original data to understand the customer distribution among the clusters.
❖ The boxplot shows the distribution of each numerical variable for each cluster obtained using K-means Clustering.
❖ The boxplot shows the median, quartiles, and outliers for each variable.
❖ The boxplot shows that the clusters are well-separated in terms of the distribution of the numerical variables.
❖ For example, cluster 0 has a higher median value for the "Avg_Credit_Limit" variable than cluster 1.
❖ This indicates that the K-means clustering has successfully grouped together data points that are similar in terms of their
numerical variables.
❖ The boxplot also shows that there is some overlap between the clusters in terms of the distribution of the numerical variables.
❖ For example, there are some data points in cluster 0 that have a lower value for the "Avg_Credit_Limit" variable than some data
points in cluster 1.
❖ This indicates that there are some data points that are not clearly assigned to a single cluster.
❖ Overall, the boxplot shows that the K-means clustering has successfully identified some of the underlying structure in the data.
❖ However, the clustering is not perfect and there are some data points that are not clearly assigned to a single cluster.
❖ The boxplot shows the distribution of each numerical variable for each cluster obtained using Hierarchical Clustering.
❖ The boxplot shows the median, quartiles, and outliers for each variable.
❖ The boxplot shows that the clusters are well-separated in terms of the distribution of the numerical variables.
❖ For example, cluster 0 has a higher median value for the "Avg_Credit_Limit" variable than cluster 1.
❖ # This indicates that the Hierarchical Clustering has successfully grouped together data points that are similar in terms of their
numerical variables.
❖ The boxplot also shows that there is some overlap between the clusters in terms of the distribution of the numerical variables.
❖ For example, there are some data points in cluster 0 that have a lower value for the "Avg_Credit_Limit" variable than some data
points in cluster 1.
❖ This indicates that there are some data points that are not clearly assigned to a single cluster.
❖ Overall, the boxplot shows that the Hierarchical Clustering has successfully identified some of the underlying structure in the
data.
❖ However, the clustering is not perfect and there are some data points that are not clearly assigned to a single cluster.

Actionable Insights and Recommendations

❖ # Based on the cluster profiles, we can identify the following actionable insights:

❖ Cluster 0: This cluster represents customers with high average credit limits and high average balances.
Recommendation: This cluster could be targeted with personalized offers for high-value products and services. The bank could
consider offering these customers higher credit limits or lower interest rates.

❖ Cluster 1: This cluster represents customers with low average credit limits and low average balances.
Recommendation: This cluster could be targeted with offers for basic banking products and services. The bank could consider
offering these customers lower fees or higher interest rates on savings accounts.

❖ Cluster 2: This cluster represents customers with average credit limits and average balances.
❖ Recommendation: This cluster could be targeted with offers for a variety of banking products and services. The bank could
consider offering these customers personalized offers based on their individual needs and preferences.

❖ Overall, the cluster analysis provides valuable insights into the different customer segments within the bank's customer base.
❖ This information can be used to develop more targeted and effective marketing campaigns and product offerings.
❖ In addition to the above insights, the bank could also consider the following recommendations:

❖ Use the cluster labels to create new customer segments in the bank's CRM system.
❖ Track the performance of marketing campaigns and product offerings by cluster segment.
❖ Use the cluster labels to develop personalized marketing messages and offers for each customer segment.
❖ Regularly review the cluster profiles and insights to ensure that they are still accurate and relevant.

ML-2 Guided Project Report
No ratings yet
ML-2 Guided Project Report
63 pages
MGT555 - Individual Assignment 1 - 2019641012 - NBO5B
No ratings yet
MGT555 - Individual Assignment 1 - 2019641012 - NBO5B
10 pages
Data Mining Business Report Hansraj Yadav
83% (12)
Data Mining Business Report Hansraj Yadav
34 pages
DM Gopala Satish Kumar Business Report G8 DSBA
100% (2)
DM Gopala Satish Kumar Business Report G8 DSBA
26 pages
MGT555 - Individual Assignment 1 - AFIQ NAJMI BIN ROSMAN 2020878336-NURUL NABILAH BINTI AYOB 202183944 NBO5B
No ratings yet
MGT555 - Individual Assignment 1 - AFIQ NAJMI BIN ROSMAN 2020878336-NURUL NABILAH BINTI AYOB 202183944 NBO5B
10 pages
Business Report Project Data Mining
No ratings yet
Business Report Project Data Mining
50 pages
Sunira Data Mining
No ratings yet
Sunira Data Mining
53 pages
DS Unit 1
No ratings yet
DS Unit 1
99 pages
Data Mining Project Anshul
100% (1)
Data Mining Project Anshul
48 pages
Default of Credit Card Clients
No ratings yet
Default of Credit Card Clients
27 pages
Data Mining Project
100% (2)
Data Mining Project
20 pages
Data Mini Proj
100% (2)
Data Mini Proj
44 pages
12 - Asterix at The Olympic Games (1968) (Digital-Empire) (WebP by Doc MaKS)
100% (1)
12 - Asterix at The Olympic Games (1968) (Digital-Empire) (WebP by Doc MaKS)
54 pages
Clustering Analysis: Reading The Data
100% (1)
Clustering Analysis: Reading The Data
15 pages
Data Mining Graded Assignment: Problem 1: Clustering Analysis
100% (3)
Data Mining Graded Assignment: Problem 1: Clustering Analysis
39 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
Credit Scoring: Case Study in Data Analytics
No ratings yet
Credit Scoring: Case Study in Data Analytics
18 pages
Jflap Manual and Exercises
No ratings yet
Jflap Manual and Exercises
44 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
21nku14 - Data Visualization Assignment
No ratings yet
21nku14 - Data Visualization Assignment
10 pages
13 - Asterix and The Cauldron (1969) (Digital-Empire) (WebP by Doc MaKS)
100% (1)
13 - Asterix and The Cauldron (1969) (Digital-Empire) (WebP by Doc MaKS)
54 pages
Data Mining Project Report - Reshma
No ratings yet
Data Mining Project Report - Reshma
23 pages
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
100% (1)
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
12 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
Answer Report: Data Mining
No ratings yet
Answer Report: Data Mining
32 pages
Data Mining Notes C3
No ratings yet
Data Mining Notes C3
11 pages
Bank Customer Segmentation
No ratings yet
Bank Customer Segmentation
14 pages
Chapter 2-Simple Searching and Sorting Algorithms
100% (1)
Chapter 2-Simple Searching and Sorting Algorithms
21 pages
Credit Card Customer Segmentation by Clustering: Bennett NG Teng Seng
No ratings yet
Credit Card Customer Segmentation by Clustering: Bennett NG Teng Seng
6 pages
CSE 473: Artificial Intelligence: Backtracking Search
No ratings yet
CSE 473: Artificial Intelligence: Backtracking Search
17 pages
Group 1 E-Banking
No ratings yet
Group 1 E-Banking
14 pages
EDA Credit Assignment Shakti - PDF
No ratings yet
EDA Credit Assignment Shakti - PDF
51 pages
A Limited T,: Memory Algorithm For Bound Constrained T, T
No ratings yet
A Limited T,: Memory Algorithm For Bound Constrained T, T
19 pages
Chapter 03 - Random Variables
No ratings yet
Chapter 03 - Random Variables
14 pages
Detached Eddy Simulation
No ratings yet
Detached Eddy Simulation
2 pages
Assignment2 Stats
No ratings yet
Assignment2 Stats
5 pages
DSBDAL - Assignment No 9
No ratings yet
DSBDAL - Assignment No 9
12 pages
3 - Notes Triangular Distribution
No ratings yet
3 - Notes Triangular Distribution
8 pages
Monticelli 1985
No ratings yet
Monticelli 1985
7 pages
Deneesha Tharunika Sooriyaarachchi CL-HDCSE-CMU-102-40 CSE5014 1668472 412159309
No ratings yet
Deneesha Tharunika Sooriyaarachchi CL-HDCSE-CMU-102-40 CSE5014 1668472 412159309
15 pages
Pes TP TR112 PSDP 090523
No ratings yet
Pes TP TR112 PSDP 090523
112 pages
Summary and Context
No ratings yet
Summary and Context
51 pages
Calibration Certificates 2019 2020
No ratings yet
Calibration Certificates 2019 2020
7 pages
BE LP3 Q2 41239 ML MiniProject
No ratings yet
BE LP3 Q2 41239 ML MiniProject
6 pages
Fixed and Floating Point Error Analysis of
No ratings yet
Fixed and Floating Point Error Analysis of
4 pages
November 2010)
No ratings yet
November 2010)
6 pages
Arpita Saha SMDM Coded Project Module 2 10 01 2024 G2 Business Report
No ratings yet
Arpita Saha SMDM Coded Project Module 2 10 01 2024 G2 Business Report
21 pages
Smai A1 PDF
No ratings yet
Smai A1 PDF
3 pages
Basic Statistics
No ratings yet
Basic Statistics
2 pages
SSRN Id4208856
No ratings yet
SSRN Id4208856
12 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
EE7401 Probability and Random Processes
No ratings yet
EE7401 Probability and Random Processes
58 pages
M1.2 DS
No ratings yet
M1.2 DS
29 pages
Report
No ratings yet
Report
24 pages
Week1 Exercises
No ratings yet
Week1 Exercises
3 pages
MCA Syllabus
No ratings yet
MCA Syllabus
24 pages
Velammal Engineering College (An Autonomous Institution), Chennai-66 Teaching & Learning Lesson Plan Form
No ratings yet
Velammal Engineering College (An Autonomous Institution), Chennai-66 Teaching & Learning Lesson Plan Form
7 pages
22BCE2200
No ratings yet
22BCE2200
17 pages
Untitled7.ipynb - Colaboratory
No ratings yet
Untitled7.ipynb - Colaboratory
12 pages
FDS Sem5
No ratings yet
FDS Sem5
20 pages
Its665 Report
No ratings yet
Its665 Report
45 pages
Mlproj
No ratings yet
Mlproj
49 pages
UL Coded Project Report - KC
No ratings yet
UL Coded Project Report - KC
30 pages
Types of Data
No ratings yet
Types of Data
3 pages
Assignment 3
No ratings yet
Assignment 3
4 pages
Project Slide - Final
No ratings yet
Project Slide - Final
23 pages
Neural Networks
No ratings yet
Neural Networks
75 pages
Report
No ratings yet
Report
17 pages
FDS - 3 Solved
No ratings yet
FDS - 3 Solved
21 pages
Matplotlib Exercise
No ratings yet
Matplotlib Exercise
3 pages
Unit-5 Computer Vision
No ratings yet
Unit-5 Computer Vision
3 pages
Banking Analysis
No ratings yet
Banking Analysis
2 pages
Capstone Project
No ratings yet
Capstone Project
33 pages
NLA Lecture Notes
No ratings yet
NLA Lecture Notes
86 pages
Email Classification Using Machine Learning
No ratings yet
Email Classification Using Machine Learning
22 pages
Business Report
No ratings yet
Business Report
18 pages
Visualization
No ratings yet
Visualization
24 pages
BA 502 (1) Introduction To Statistics and Statistical Inference
No ratings yet
BA 502 (1) Introduction To Statistics and Statistical Inference
34 pages
Brochure - Global Wi-Fi Market - Global Forecast To 2020
No ratings yet
Brochure - Global Wi-Fi Market - Global Forecast To 2020
24 pages
Probability & Statistics - Workbook.solutions
No ratings yet
Probability & Statistics - Workbook.solutions
471 pages
Workbook - Hypothesis Testing - Solutions
No ratings yet
Workbook - Hypothesis Testing - Solutions
91 pages
Workbook - Discrete Random Variables
No ratings yet
Workbook - Discrete Random Variables
19 pages
Workbook Regression
No ratings yet
Workbook Regression
18 pages
Marketing Engineering and Analytics
No ratings yet
Marketing Engineering and Analytics
52 pages
Workbook - Hypothesis Testing
No ratings yet
Workbook - Hypothesis Testing
26 pages
Car Insurance Insights Summary Presentation
No ratings yet
Car Insurance Insights Summary Presentation
10 pages
Text Book
No ratings yet
Text Book
2 pages
10 Hypothesis Testing For The Difference of Proportions
No ratings yet
10 Hypothesis Testing For The Difference of Proportions
9 pages
AllLife Bank Customer Segmentation Unsupervised Learning-Coded-Project-Business-Report
No ratings yet
AllLife Bank Customer Segmentation Unsupervised Learning-Coded-Project-Business-Report
10 pages
Module 6
No ratings yet
Module 6
11 pages
Probability & Statistics - Workbook
No ratings yet
Probability & Statistics - Workbook
163 pages
03 Coefficient of Determination and RMSE
No ratings yet
03 Coefficient of Determination and RMSE
7 pages
P3 - III - I B.tech Revaluation Results NOV-2024
No ratings yet
P3 - III - I B.tech Revaluation Results NOV-2024
5 pages
Python Seaborn Tutorial For Beginners v2
No ratings yet
Python Seaborn Tutorial For Beginners v2
40 pages
Chat GPT
No ratings yet
Chat GPT
4 pages
02 Significance Level and Type I and II Errors
No ratings yet
02 Significance Level and Type I and II Errors
8 pages
09 Lineplot
No ratings yet
09 Lineplot
21 pages
Probability & Statistics - Final Exam - Solutions
No ratings yet
Probability & Statistics - Final Exam - Solutions
16 pages
Probability & Statistics - Final Exam
No ratings yet
Probability & Statistics - Final Exam
9 pages
3 Outliers Iqr
No ratings yet
3 Outliers Iqr
3 pages
Probability & Statistics - Final Exam - Practice 1
No ratings yet
Probability & Statistics - Final Exam - Practice 1
9 pages
LoRA Retains More
No ratings yet
LoRA Retains More
3 pages
Fds QB
No ratings yet
Fds QB
6 pages
ML Report
No ratings yet
ML Report
12 pages
Data Mining and Predictive Modelling: Lecture 2: Functionalities, KDD Process, Data Attributes and Properties
No ratings yet
Data Mining and Predictive Modelling: Lecture 2: Functionalities, KDD Process, Data Attributes and Properties
11 pages
Pracal Labexamsamplequestions
No ratings yet
Pracal Labexamsamplequestions
35 pages
01 Mean, Variance, and Standard Deviation
No ratings yet
01 Mean, Variance, and Standard Deviation
10 pages
02 Measures of Spread
No ratings yet
02 Measures of Spread
6 pages
10 Building Histograms From Data Sets
No ratings yet
10 Building Histograms From Data Sets
7 pages
03 Symmetric and Skewed Distributions and Outliers
No ratings yet
03 Symmetric and Skewed Distributions and Outliers
6 pages
04 Box and Whisker Plots
No ratings yet
04 Box and Whisker Plots
6 pages
02 Frequency Histograms and Polygons, and Density Curves
No ratings yet
02 Frequency Histograms and Polygons, and Density Curves
6 pages
01 Measures of Central Tendency
No ratings yet
01 Measures of Central Tendency
6 pages
07 Relative Frequency Tables
No ratings yet
07 Relative Frequency Tables
6 pages
09 Histograms and Stem-And-leaf Plots
No ratings yet
09 Histograms and Stem-And-leaf Plots
6 pages
08 Joint Distributions
No ratings yet
08 Joint Distributions
6 pages
DS&ML 4
No ratings yet
DS&ML 4
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Jahnavijillella ML1 30 06 2024 PDF

Uploaded by

Jahnavijillella ML1 30 06 2024 PDF

Uploaded by

Machine Learning-1 Project

Overview of the data set

Displaying few rows of the dataset

Creating a Copy of the Original Data

Checking datatypes and number of non-null values for each column:

Checking the missing values

There are 6 missing values in the data.

Checking the number of unique values in each column

Checking for duplicates values

Exploratory Data Analysis

Creating a heatmap to visualize the correlation matrix

Creating a pair plot with a kernel density estimate on the diagonal

Let's visualize the modes of contacting the bank in a 3D plot.

• Let's find outliers in the data using z-score with a threshold of 3.

The following are the outliers in the data:

# In this case, the pairplot shows that:

# - The variables are now on a similar scale.

scaling the data before clustering

Checking Elbow Plot

Checking Silhouette Scores

Fitting and predicting with the KMeans model

Let's check silhouette score

The optimal number of clusters according to the silhouette score is: 3

Creating final model

Hierarchical Clustering model with specified parameters

❖ Creating a copy of the original data

❖ Adding hierarchical cluster labels to the original and scaled dataframes

❖ Displaying the first five rows of the dataframe

Cluster Profiling: K-means Clustering

Cluster Profiling: Hierarchical Clustering

K-means vs Hierarchical Clustering

Actionable Insights and Recommendations

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.