0% found this document useful (0 votes)
18 views53 pages

Jahnavijillella ML1 30 06 2024 PDF

The report outlines a project by All Life Bank to enhance its credit card customer base through personalized marketing and improved service delivery. The objective is to segment customers based on spending patterns and interactions using clustering algorithms, with data analysis revealing insights into customer behavior and preferences. The findings suggest three distinct customer clusters based on their modes of contacting the bank, with recommendations for targeted marketing strategies and service improvements.

Uploaded by

kart238
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views53 pages

Jahnavijillella ML1 30 06 2024 PDF

The report outlines a project by All Life Bank to enhance its credit card customer base through personalized marketing and improved service delivery. The objective is to segment customers based on spending patterns and interactions using clustering algorithms, with data analysis revealing insights into customer behavior and preferences. The findings suggest three distinct customer clusters based on their modes of contacting the bank, with recommendations for targeted marketing strategies and service improvements.

Uploaded by

kart238
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Machine Learning-1 Project

Guided Report
Context

All Life Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their
marketing research team, that the penetration in the market can be improved. Based on this input, the Marketing team
proposes to run personalized campaigns to target new customers as well as upsell to existing customers. Another insight
from the market research was that the customers perceive the support services of the back poorly. Based on this, the
Operations team wants to upgrade the service delivery model, to ensure that customer queries are resolved faster. Head of
Marketing and Head of Delivery both decide to reach out to the Data Science team for help
Objective

To identify different segments in the existing customer, based on their spending patterns as well as past interaction with
the bank, using clustering algorithms, and provide recommendations to the bank on how to better market to and service these
customers.

Data Description

The data provided is of various customers of a bank and their financial attributes like credit limit, the total number of credit
cards the customer has, and different channels through which customers have contacted the bank for any queries (including
visiting the bank, online and through a call center).

Data Dictionary.
• Sl_No: Primary key of the records
• Customer Key: Customer identification number
• Average Credit Limit: Average credit limit of each customer for all credit cards
• Total credit cards: Total number of credit cards possessed by the customer
• Total visits bank: Total number of visits that customer made (yearly) personally to the bank
• Total visits online: Total number of visits or online logins made by the customer (yearly)
• Total calls made: Total number of calls made by the customer to the bank or its customer service department (yearly)

Overview of the data set


❖ The initial steps to get an overview of any dataset is to observe the first few rows of the dataset, to check whether the dataset has
been loaded properly or not
❖ get information about the number of rows and columns in the dataset
❖ find out the data types of the columns to ensure that data is stored in the preferred format and the value of each property is as
expected, and check the statistical summary of the dataset to get an overview of the numerical columns of the data
Checking the shape of the dataset
The dataset has 660 rows and 7 columns

Displaying few rows of the dataset


The first 5 rows of the dataset:

Creating a Copy of the Original Data


copying the data to another variable to avoid any changes to original data:
fixing column names:
Checking the data types of the columns for the dataset

Checking datatypes and number of non-null values for each column:


All the columns in the data are integers.

Checking the missing values

There are 6 missing values in the data.

Checking the number of unique values in each column


• There are less unique values in the Customer_Key column than the number of observations in the data. This means
that there are duplicate values in the column.

Checking for duplicates values


Let's look at the duplicate values in the Customer_Key column closely.
We will drop the Sl_No and Customer_Key as they do not add any value to the analysis.
Statistical summary of the dataset

Exploratory Data Analysis


The functions can be used to explore the distribution of individual features in the dataset and gain insights into the data.
Univariate analysis

The boxplot shows that the majority of customers have an average credit limit between 0 and 200,000. The histogram shows that the
distribution of average credit limits is skewed to the right, with a few customers having very high credit limits.
The boxplot shows that most customers have between 1 and 4 credit cards. The histogram shows that the distribution of total credit
cards is roughly normal, with a few customers having a large number of credit cards.

The boxplot shows that most customers visit the bank between 0 and 5 times per year. The histogram shows that the distribution of total
visits to the bank is skewed to the right, with a few customers visiting the bank very frequently.
The boxplot shows that most customers visit the bank's website between 0 and 10 times per year. The histogram shows that the
distribution of total visits to the bank's website is skewed to the right, with a few customers visiting the website very frequently.
The boxplot shows that most customers call the bank between 0 and 5 times per year. The histogram shows that the distribution of total
calls made to the bank is skewed to the right, with a few customers calling the bank very frequently.
Calling the above Specified functions for plotting plots
**Observations**

# The barplots show the percentage of customers who have each level of the categorical features.
# For example, the first barplot shows that 30% of customers have 1 credit card, 25% have 2 credit cards, and so on.
# The barplots also show that the majority of customers have 1 or 2 credit cards, and that a small percentage of customers
have 5 or more credit cards.
Creating a subplot grid of CDF (Cumulative Distribution Function) plots for numerical variables
❖ The figure shows the CDF (Cumulative Distribution Function) plots for the numerical variables in the dataset.
❖ Each plot shows the cumulative probability of a data point falling below a certain value.
❖ For example, the first plot shows that about 30% of customers have an average credit limit below 50,000, about 60%
have an average credit limit below 100,000, and so on.
❖ The plots can be used to compare the distributions of different variables and to identify outliers.

**Observations**

❖ The distribution of average credit limit is skewed to the right, with a few customers having very high credit limits.
❖ The distribution of total credit cards is roughly normal, with a few customers having a large number of credit cards.
❖ The distribution of total visits to the bank is skewed to the right, with a few customers visiting the bank very frequently.
❖ The distribution of total visits to the bank's website is skewed to the right, with a few customers visiting the website very
frequently.
❖ The distribution of total calls made to the bank is skewed to the right, with a few customers calling the bank very
frequently.

Bivariate Analysis

Creating a heatmap to visualize the correlation matrix


❖ The heatmap shows the correlation coefficients between all pairs of numerical variables in the dataset.
❖ The values in the heatmap range from -1 to 1, where 1 indicates a perfect positive correlation, -1 indicates a perfect negative
correlation, and 0 indicates no correlation.
❖ The diagonal elements of the heatmap are all 1 because the correlation between a variable and itself is always 1.
❖ The heatmap shows that there are some strong positive correlations between some of the variables in the dataset.
❖ For example, there is a strong positive correlation between average credit limit and total credit cards, and between total visits to
the bank and total visits to the bank's website.
❖ There is also a strong negative correlation between average credit limit and total calls made to the bank.
❖ This means that customers with higher average credit limits tend to have more credit cards and visit the bank and its website
more frequently, but they tend to call the bank less frequently.
Observations

❖ The heatmap shows that there are some strong correlations between some of the variables in the dataset.
❖ These correlations can be used to identify relationships between variables and to make predictions about one variable based on
another variable.
❖ For example, the strong positive correlation between average credit limit and total credit cards suggests that customers with
higher average credit limits are more likely to have more credit cards.
❖ The strong negative correlation between average credit limit and total calls made to the bank suggests that customers with higher
average credit limits are less likely to call the bank.

Creating a pair plot with a kernel density estimate on the diagonal

❖ The pairplot shows the relationships between all pairs of variables in the dataset.
❖ Each plot is a scatter plot with the two variables on the x and y axes.
❖ The diagonal plots are histograms of the individual variables.
❖ The plots can be used to identify relationships between variables, such as positive correlations, negative correlations, and
outliers.
❖ There is a positive correlation between average credit limit and total credit cards.
❖ There is a negative correlation between average credit limit and total visits to the bank.
❖ There is a positive correlation between total credit cards and total visits to the bank.
❖ There is a positive correlation between total visits to the bank and total calls made to the bank.
❖ There is a positive correlation between total visits to the bank's website and total calls made to the bank.

The pairplot also shows that there are a few outliers in the data. For example, there are a few customers with very high average credit
limits and a few customers who visit the bank or the bank's website very frequently.

We can add a hue and see if we can see some clustered distributions.

Creating a pair plot using seaborn to visualize relationships between selected features

❖ The figure shows a pairplot with a hue for the "Total_Credit_Cards" feature.
❖ This means that the plots are colored by the number of credit cards that each customer has.
❖ The pairplot shows the relationships between all pairs of variables in the dataset, colored by the number of credit cards.
❖ Each plot is a scatter plot with the two variables on the x and y axes.
❖ The diagonal plots are histograms of the individual variables, colored by the number of credit cards.
❖ The plots can be used to identify relationships between variables, such as positive correlations, negative correlations, and
outliers.
In this case, the pairplot shows that:

❖ There is a positive correlation between total visits to the bank and total calls made to the bank.
❖ There is a positive correlation between total visits to the bank's website and total calls made to the bank.
❖ Customers with more credit cards tend to visit the bank more often and make more calls to the bank.
❖ The pairplot also shows that there are a few outliers in the data.
❖ For example, there are a few customers with a high number of credit cards who visit the bank or the bank's website very
frequently.

Let's visualize the modes of contacting the bank in a 3D plot.


❖ The figure shows a 3D scatter plot of the "Total_visits_bank", "Total_visits_online", and "Total_calls_made" features. Each point
in the plot represents a customer. The plot shows that there are three main modes of contacting the bank. Visiting the bank in
person. Visiting the bank's website. Calling the bank on the phone.
❖ Most customers use only one of these modes of contact.
❖ However, there are a few customers who use two or even three modes of contact.
❖ The plot also shows that there are a few outliers in the data.
❖ For example, there are a few customers who visit the bank or the bank's website very frequently, and a few customers who call
the bank very frequently.

We can observe three segments of the customers by their preferred mode of contacting the bank.
Data Preprocessing

Outlier Detection

• Let's find outliers in the data using z-score with a threshold of 3.

The following are the outliers in the data:

Avg_Credit_Limit : [153000, 155000, 156000, 156000, 157000, 158000, 163000, 163000, 166000, 166000, 167000, 171000, 172000,
172000, 173000, 176000, 178000, 183000, 184000, 186000, 187000, 195000, 195000, 200000]

Total_Credit_Cards : []

Total_visits_bank : []

Total_visits_online : [12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15]

Total_calls_made : []
Scaling
• Let's scale the data before we proceed with clustering.
• Creating a pair plot with a kernel density estimate on the diagonal

# The plots can be used to identify relationships between variables, such as positive correlations, negative
correlations, and outliers.

# In this case, the pairplot shows that:

# - The variables are now on a similar scale.


# - There is a positive correlation between average credit limit and total credit cards.
# - There is a negative correlation between average credit limit and total visits to the bank.
# - There is a positive correlation between total credit cards and total visits to the bank.
# - There is a positive correlation between total visits to the bank and total calls made to the bank.
# - There is a positive correlation between total visits to the bank's website and total calls made to the bank.

# The pairplot also shows that there are a few outliers in the data.
# For example, there are a few customers with very high average credit limits and a few customers who visit the
bank or the bank's website very frequently.
❖ The figure shows a pairplot of the data after scaling.
❖ Each plot is a scatter plot with the two variables on the x and y axes.
❖ The diagonal plots are histograms of the individual variables.

scaling the data before clustering


creating a dataframe of the scaled data
K-means Clustering

❖ The figure shows a scatter plot of the "Total_visits_bank" and "Total_visits_online" features, colored by the cluster labels.
❖ Each point in the plot represents a customer.
❖ The plot shows that the three clusters are:
❖ Cluster 0: Customers who visit the bank in person frequently but do not visit the bank's website or call the bank on the phone very
often.
❖ Cluster 1: Customers who visit the bank's website frequently but do not visit the bank in person or call the bank on the phone very
often.
❖ Cluster 2: Customers who call the bank on the phone frequently but do not visit the bank in person or the bank's website very
often.
❖ There are also a few customers who do not belong to any of the three clusters.
❖ These customers are outliers in the data.

Checking Elbow Plot


❖ The elbow plot shows the within-cluster sum of squares (WSS) for different numbers of clusters.
❖ The WSS is a measure of how well the data is clustered, with lower values indicating better clustering.
❖ The elbow plot shows that the WSS decreases as the number of clusters increases.
❖ This is because as the number of clusters increases, the data points are divided into smaller and smaller groups, which results in a
better fit.
❖ However, the WSS also increases as the number of clusters increases, because each cluster has fewer data points and therefore
the average distance between the data points in each cluster is larger.
❖ The elbow plot shows that the optimal number of clusters is 3, because this is the point where the WSS starts to increase more
rapidly.
❖ Choosing a larger number of clusters would result in overfitting, which means that the clusters would be too small and the data
points would be too similar to each other.
❖ Choosing a smaller number of clusters would result in underfitting, which means that the clusters would be too large and the data
points would be too different from each other.

Checking Silhouette Scores


The key with the maximum value in the dictionary ss
The optimal number of clusters according to the silhouette score is: 3

❖ The figure shows the silhouette plot for different numbers of clusters.
❖ The silhouette plot shows the average silhouette score for each data point for different numbers of clusters.
❖ The silhouette score is a measure of how well each data point fits into its own cluster, with values ranging from -1 to 1.
❖ A score of 1 indicates that the data point is well-clustered, a score of 0 indicates that the data point is on the border of two
clusters, and a score of -1 indicates that the data point is poorly-clustered.
❖ The plot shows that the average silhouette score is highest for 3 clusters.
❖ This indicates that 3 is the optimal number of clusters for the data.

❖ The silhouette plot also shows that the silhouette scores for 2 and 4 clusters are relatively low.
❖ This indicates that 2 and 4 are not good choices for the number of clusters.
❖ The silhouette plot can be used to choose the optimal number of clusters for a dataset.
❖ The optimal number of clusters is the point where the average silhouette score is highest.

Fitting and predicting with the KMeans model


Adding kmeans cluster labels to the original and scaled dataframes

Cluster Profiling
Hierarchical Clustering

❖ The dendrogram shows the hierarchical clustering of the data points in the dataframe.
❖ Each data point is represented by a leaf node in the dendrogram.
❖ The distance between two data points is represented by the height of the branch that connects them.
❖ The dendrogram shows that the data points are divided into three main clusters.
❖ The first cluster is represented by the leaves at the bottom of the dendrogram.
❖ The second cluster is represented by the leaves in the middle of the dendrogram.
❖ The third cluster is represented by the leaves at the top of the dendrogram.
❖ The dendrogram also shows that the first and second clusters are more similar to each other than they are to the third cluster.
❖ This is because the branch that connects the first and second clusters is shorter than the branch that connects the third cluster
to the other two clusters.
❖ The dendrogram can be used to choose the optimal number of clusters for the data.
❖ The optimal number of clusters is the point where the dendrogram starts to branch out into multiple branches.
❖ In this case, the optimal number of clusters is 3.

Let's check silhouette score

The optimal number of clusters according to the silhouette score is: 3


❖ The figure shows the silhouette plot for different numbers of clusters using hierarchical clustering.
❖ The silhouette plot shows the average silhouette score for each data point for different numbers of clusters.
❖ The silhouette score is a measure of how well each data point fits into its own cluster, with values ranging from -1 to 1.
❖ A score of 1 indicates that the data point is well-clustered, a score of 0 indicates that the data point is on the border of two
clusters, and a score of -1 indicates that the data point is poorly-clustered.
❖ The plot shows that the average silhouette score is highest for 3 clusters.
❖ This indicates that 3 is the optimal number of clusters for the data using hierarchical clustering.
❖ The silhouette plot also shows that the silhouette scores for 2 and 4 clusters are relatively low.
❖ This indicates that 2 and 4 are not good choices for the number of clusters.
❖ The silhouette plot can be used to choose the optimal number of clusters for a dataset.
❖ The optimal number of clusters is the point where the average silhouette score is highest.

Creating final model

Hierarchical Clustering model with specified parameters

❖ Creating a copy of the original data

❖ Adding hierarchical cluster labels to the original and scaled dataframes

❖ Displaying the first five rows of the dataframe


❖ Displaying the first five rows of the dataframe2
❖ Hierarchical clustering labels to the 'HC_Clusters' column of df DataFrame.
Cluster Profiling and Comparison

Cluster Profiling: K-means Clustering

Cluster Profiling: Hierarchical Clustering

K-means vs Hierarchical Clustering


This line highlights the maximum value in each column of K-means DataFrame with a light green background.

This line highlights the maximum value in each column of Hierarchical DataFrame with a light green background.
Creating a bar plot of mean values in k_means_df grouped by a specified column

❖ The figure shows the average values of each feature for each cluster in the K-means clustering.
❖ The bars are color-coded to represent different features.
❖ The x-axis labels represent the cluster labels.
❖ The y-axis label represents the average value of the features.
❖ The figure shows that the clusters are well-separated in terms of the average values of the features.
❖ For example, cluster 0 has a high average value for the "Avg_Credit_Limit" feature, while cluster 1 has a low average value for the
"Avg_Credit_Limit" feature.
❖ This indicates that the K-means clustering has successfully grouped together data points that are similar in terms of their
features.
❖ The figure also shows that the clusters are not perfectly separated.
❖ For example, there is some overlap between the clusters in terms of the "Avg_Credit_Limit" feature.
❖ This indicates that there are some data points that are not clearly assigned to a single cluster.
❖ Overall, the figure shows that the K-means clustering has successfully identified some of the underlying structure in the data.
❖ However, the clustering is not perfect and there are some data points that are not clearly assigned to a single cluster.

Creating a bar plot of the mean values of hc_df grouped by a specific column

❖ # The figure shows the average values of each feature for each cluster in the hierarchical clustering.
❖ The bars are color-coded to represent different features.
❖ The x-axis labels represent the cluster labels.
❖ The y-axis label represents the average value of the features.
❖ The figure shows that the clusters are well-separated in terms of the average values of the features.
❖ For example, cluster 0 has a high average value for the "Avg_Credit_Limit" feature, while cluster 1 has a low average value for the
"Avg_Credit_Limit" feature.
❖ This indicates that the hierarchical clustering has successfully grouped together data points that are similar in terms of their
features.
❖ The figure also shows that the clusters are not perfectly separated.
❖ For example, there is some overlap between the clusters in terms of the "Avg_Credit_Limit" feature.
❖ This indicates that there are some data points that are not clearly assigned to a single cluster.
❖ Overall, the figure shows that the hierarchical clustering has successfully identified some of the underlying structure in the data.
❖ However, the clustering is not perfect and there are some data points that are not clearly assigned to a single cluster.

Let's create some plots on the original data to understand the customer distribution among the clusters.
❖ The boxplot shows the distribution of each numerical variable for each cluster obtained using K-means Clustering.
❖ The boxplot shows the median, quartiles, and outliers for each variable.
❖ The boxplot shows that the clusters are well-separated in terms of the distribution of the numerical variables.
❖ For example, cluster 0 has a higher median value for the "Avg_Credit_Limit" variable than cluster 1.
❖ This indicates that the K-means clustering has successfully grouped together data points that are similar in terms of their
numerical variables.
❖ The boxplot also shows that there is some overlap between the clusters in terms of the distribution of the numerical variables.
❖ For example, there are some data points in cluster 0 that have a lower value for the "Avg_Credit_Limit" variable than some data
points in cluster 1.
❖ This indicates that there are some data points that are not clearly assigned to a single cluster.
❖ Overall, the boxplot shows that the K-means clustering has successfully identified some of the underlying structure in the data.
❖ However, the clustering is not perfect and there are some data points that are not clearly assigned to a single cluster.
❖ The boxplot shows the distribution of each numerical variable for each cluster obtained using Hierarchical Clustering.
❖ The boxplot shows the median, quartiles, and outliers for each variable.
❖ The boxplot shows that the clusters are well-separated in terms of the distribution of the numerical variables.
❖ For example, cluster 0 has a higher median value for the "Avg_Credit_Limit" variable than cluster 1.
❖ # This indicates that the Hierarchical Clustering has successfully grouped together data points that are similar in terms of their
numerical variables.
❖ The boxplot also shows that there is some overlap between the clusters in terms of the distribution of the numerical variables.
❖ For example, there are some data points in cluster 0 that have a lower value for the "Avg_Credit_Limit" variable than some data
points in cluster 1.
❖ This indicates that there are some data points that are not clearly assigned to a single cluster.
❖ Overall, the boxplot shows that the Hierarchical Clustering has successfully identified some of the underlying structure in the
data.
❖ However, the clustering is not perfect and there are some data points that are not clearly assigned to a single cluster.

Actionable Insights and Recommendations

❖ # Based on the cluster profiles, we can identify the following actionable insights:

❖ Cluster 0: This cluster represents customers with high average credit limits and high average balances.
Recommendation: This cluster could be targeted with personalized offers for high-value products and services. The bank could
consider offering these customers higher credit limits or lower interest rates.

❖ Cluster 1: This cluster represents customers with low average credit limits and low average balances.
Recommendation: This cluster could be targeted with offers for basic banking products and services. The bank could consider
offering these customers lower fees or higher interest rates on savings accounts.

❖ Cluster 2: This cluster represents customers with average credit limits and average balances.
❖ Recommendation: This cluster could be targeted with offers for a variety of banking products and services. The bank could
consider offering these customers personalized offers based on their individual needs and preferences.

❖ Overall, the cluster analysis provides valuable insights into the different customer segments within the bank's customer base.
❖ This information can be used to develop more targeted and effective marketing campaigns and product offerings.
❖ In addition to the above insights, the bank could also consider the following recommendations:

❖ Use the cluster labels to create new customer segments in the bank's CRM system.
❖ Track the performance of marketing campaigns and product offerings by cluster segment.
❖ Use the cluster labels to develop personalized marketing messages and offers for each customer segment.
❖ Regularly review the cluster profiles and insights to ensure that they are still accurate and relevant.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy