Jahnavijillella ML1 30 06 2024 PDF
Jahnavijillella ML1 30 06 2024 PDF
Guided Report
Context
All Life Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their
marketing research team, that the penetration in the market can be improved. Based on this input, the Marketing team
proposes to run personalized campaigns to target new customers as well as upsell to existing customers. Another insight
from the market research was that the customers perceive the support services of the back poorly. Based on this, the
Operations team wants to upgrade the service delivery model, to ensure that customer queries are resolved faster. Head of
Marketing and Head of Delivery both decide to reach out to the Data Science team for help
Objective
To identify different segments in the existing customer, based on their spending patterns as well as past interaction with
the bank, using clustering algorithms, and provide recommendations to the bank on how to better market to and service these
customers.
Data Description
The data provided is of various customers of a bank and their financial attributes like credit limit, the total number of credit
cards the customer has, and different channels through which customers have contacted the bank for any queries (including
visiting the bank, online and through a call center).
Data Dictionary.
• Sl_No: Primary key of the records
• Customer Key: Customer identification number
• Average Credit Limit: Average credit limit of each customer for all credit cards
• Total credit cards: Total number of credit cards possessed by the customer
• Total visits bank: Total number of visits that customer made (yearly) personally to the bank
• Total visits online: Total number of visits or online logins made by the customer (yearly)
• Total calls made: Total number of calls made by the customer to the bank or its customer service department (yearly)
The boxplot shows that the majority of customers have an average credit limit between 0 and 200,000. The histogram shows that the
distribution of average credit limits is skewed to the right, with a few customers having very high credit limits.
The boxplot shows that most customers have between 1 and 4 credit cards. The histogram shows that the distribution of total credit
cards is roughly normal, with a few customers having a large number of credit cards.
The boxplot shows that most customers visit the bank between 0 and 5 times per year. The histogram shows that the distribution of total
visits to the bank is skewed to the right, with a few customers visiting the bank very frequently.
The boxplot shows that most customers visit the bank's website between 0 and 10 times per year. The histogram shows that the
distribution of total visits to the bank's website is skewed to the right, with a few customers visiting the website very frequently.
The boxplot shows that most customers call the bank between 0 and 5 times per year. The histogram shows that the distribution of total
calls made to the bank is skewed to the right, with a few customers calling the bank very frequently.
Calling the above Specified functions for plotting plots
**Observations**
# The barplots show the percentage of customers who have each level of the categorical features.
# For example, the first barplot shows that 30% of customers have 1 credit card, 25% have 2 credit cards, and so on.
# The barplots also show that the majority of customers have 1 or 2 credit cards, and that a small percentage of customers
have 5 or more credit cards.
Creating a subplot grid of CDF (Cumulative Distribution Function) plots for numerical variables
❖ The figure shows the CDF (Cumulative Distribution Function) plots for the numerical variables in the dataset.
❖ Each plot shows the cumulative probability of a data point falling below a certain value.
❖ For example, the first plot shows that about 30% of customers have an average credit limit below 50,000, about 60%
have an average credit limit below 100,000, and so on.
❖ The plots can be used to compare the distributions of different variables and to identify outliers.
**Observations**
❖ The distribution of average credit limit is skewed to the right, with a few customers having very high credit limits.
❖ The distribution of total credit cards is roughly normal, with a few customers having a large number of credit cards.
❖ The distribution of total visits to the bank is skewed to the right, with a few customers visiting the bank very frequently.
❖ The distribution of total visits to the bank's website is skewed to the right, with a few customers visiting the website very
frequently.
❖ The distribution of total calls made to the bank is skewed to the right, with a few customers calling the bank very
frequently.
Bivariate Analysis
❖ The heatmap shows that there are some strong correlations between some of the variables in the dataset.
❖ These correlations can be used to identify relationships between variables and to make predictions about one variable based on
another variable.
❖ For example, the strong positive correlation between average credit limit and total credit cards suggests that customers with
higher average credit limits are more likely to have more credit cards.
❖ The strong negative correlation between average credit limit and total calls made to the bank suggests that customers with higher
average credit limits are less likely to call the bank.
❖ The pairplot shows the relationships between all pairs of variables in the dataset.
❖ Each plot is a scatter plot with the two variables on the x and y axes.
❖ The diagonal plots are histograms of the individual variables.
❖ The plots can be used to identify relationships between variables, such as positive correlations, negative correlations, and
outliers.
❖ There is a positive correlation between average credit limit and total credit cards.
❖ There is a negative correlation between average credit limit and total visits to the bank.
❖ There is a positive correlation between total credit cards and total visits to the bank.
❖ There is a positive correlation between total visits to the bank and total calls made to the bank.
❖ There is a positive correlation between total visits to the bank's website and total calls made to the bank.
The pairplot also shows that there are a few outliers in the data. For example, there are a few customers with very high average credit
limits and a few customers who visit the bank or the bank's website very frequently.
We can add a hue and see if we can see some clustered distributions.
Creating a pair plot using seaborn to visualize relationships between selected features
❖ The figure shows a pairplot with a hue for the "Total_Credit_Cards" feature.
❖ This means that the plots are colored by the number of credit cards that each customer has.
❖ The pairplot shows the relationships between all pairs of variables in the dataset, colored by the number of credit cards.
❖ Each plot is a scatter plot with the two variables on the x and y axes.
❖ The diagonal plots are histograms of the individual variables, colored by the number of credit cards.
❖ The plots can be used to identify relationships between variables, such as positive correlations, negative correlations, and
outliers.
In this case, the pairplot shows that:
❖ There is a positive correlation between total visits to the bank and total calls made to the bank.
❖ There is a positive correlation between total visits to the bank's website and total calls made to the bank.
❖ Customers with more credit cards tend to visit the bank more often and make more calls to the bank.
❖ The pairplot also shows that there are a few outliers in the data.
❖ For example, there are a few customers with a high number of credit cards who visit the bank or the bank's website very
frequently.
We can observe three segments of the customers by their preferred mode of contacting the bank.
Data Preprocessing
Outlier Detection
Avg_Credit_Limit : [153000, 155000, 156000, 156000, 157000, 158000, 163000, 163000, 166000, 166000, 167000, 171000, 172000,
172000, 173000, 176000, 178000, 183000, 184000, 186000, 187000, 195000, 195000, 200000]
Total_Credit_Cards : []
Total_visits_bank : []
Total_visits_online : [12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15]
Total_calls_made : []
Scaling
• Let's scale the data before we proceed with clustering.
• Creating a pair plot with a kernel density estimate on the diagonal
# The plots can be used to identify relationships between variables, such as positive correlations, negative
correlations, and outliers.
# The pairplot also shows that there are a few outliers in the data.
# For example, there are a few customers with very high average credit limits and a few customers who visit the
bank or the bank's website very frequently.
❖ The figure shows a pairplot of the data after scaling.
❖ Each plot is a scatter plot with the two variables on the x and y axes.
❖ The diagonal plots are histograms of the individual variables.
❖ The figure shows a scatter plot of the "Total_visits_bank" and "Total_visits_online" features, colored by the cluster labels.
❖ Each point in the plot represents a customer.
❖ The plot shows that the three clusters are:
❖ Cluster 0: Customers who visit the bank in person frequently but do not visit the bank's website or call the bank on the phone very
often.
❖ Cluster 1: Customers who visit the bank's website frequently but do not visit the bank in person or call the bank on the phone very
often.
❖ Cluster 2: Customers who call the bank on the phone frequently but do not visit the bank in person or the bank's website very
often.
❖ There are also a few customers who do not belong to any of the three clusters.
❖ These customers are outliers in the data.
❖ The figure shows the silhouette plot for different numbers of clusters.
❖ The silhouette plot shows the average silhouette score for each data point for different numbers of clusters.
❖ The silhouette score is a measure of how well each data point fits into its own cluster, with values ranging from -1 to 1.
❖ A score of 1 indicates that the data point is well-clustered, a score of 0 indicates that the data point is on the border of two
clusters, and a score of -1 indicates that the data point is poorly-clustered.
❖ The plot shows that the average silhouette score is highest for 3 clusters.
❖ This indicates that 3 is the optimal number of clusters for the data.
❖
❖ The silhouette plot also shows that the silhouette scores for 2 and 4 clusters are relatively low.
❖ This indicates that 2 and 4 are not good choices for the number of clusters.
❖ The silhouette plot can be used to choose the optimal number of clusters for a dataset.
❖ The optimal number of clusters is the point where the average silhouette score is highest.
Cluster Profiling
Hierarchical Clustering
❖ The dendrogram shows the hierarchical clustering of the data points in the dataframe.
❖ Each data point is represented by a leaf node in the dendrogram.
❖ The distance between two data points is represented by the height of the branch that connects them.
❖ The dendrogram shows that the data points are divided into three main clusters.
❖ The first cluster is represented by the leaves at the bottom of the dendrogram.
❖ The second cluster is represented by the leaves in the middle of the dendrogram.
❖ The third cluster is represented by the leaves at the top of the dendrogram.
❖ The dendrogram also shows that the first and second clusters are more similar to each other than they are to the third cluster.
❖ This is because the branch that connects the first and second clusters is shorter than the branch that connects the third cluster
to the other two clusters.
❖ The dendrogram can be used to choose the optimal number of clusters for the data.
❖ The optimal number of clusters is the point where the dendrogram starts to branch out into multiple branches.
❖ In this case, the optimal number of clusters is 3.
This line highlights the maximum value in each column of Hierarchical DataFrame with a light green background.
Creating a bar plot of mean values in k_means_df grouped by a specified column
❖ The figure shows the average values of each feature for each cluster in the K-means clustering.
❖ The bars are color-coded to represent different features.
❖ The x-axis labels represent the cluster labels.
❖ The y-axis label represents the average value of the features.
❖ The figure shows that the clusters are well-separated in terms of the average values of the features.
❖ For example, cluster 0 has a high average value for the "Avg_Credit_Limit" feature, while cluster 1 has a low average value for the
"Avg_Credit_Limit" feature.
❖ This indicates that the K-means clustering has successfully grouped together data points that are similar in terms of their
features.
❖ The figure also shows that the clusters are not perfectly separated.
❖ For example, there is some overlap between the clusters in terms of the "Avg_Credit_Limit" feature.
❖ This indicates that there are some data points that are not clearly assigned to a single cluster.
❖ Overall, the figure shows that the K-means clustering has successfully identified some of the underlying structure in the data.
❖ However, the clustering is not perfect and there are some data points that are not clearly assigned to a single cluster.
Creating a bar plot of the mean values of hc_df grouped by a specific column
❖ # The figure shows the average values of each feature for each cluster in the hierarchical clustering.
❖ The bars are color-coded to represent different features.
❖ The x-axis labels represent the cluster labels.
❖ The y-axis label represents the average value of the features.
❖ The figure shows that the clusters are well-separated in terms of the average values of the features.
❖ For example, cluster 0 has a high average value for the "Avg_Credit_Limit" feature, while cluster 1 has a low average value for the
"Avg_Credit_Limit" feature.
❖ This indicates that the hierarchical clustering has successfully grouped together data points that are similar in terms of their
features.
❖ The figure also shows that the clusters are not perfectly separated.
❖ For example, there is some overlap between the clusters in terms of the "Avg_Credit_Limit" feature.
❖ This indicates that there are some data points that are not clearly assigned to a single cluster.
❖ Overall, the figure shows that the hierarchical clustering has successfully identified some of the underlying structure in the data.
❖ However, the clustering is not perfect and there are some data points that are not clearly assigned to a single cluster.
Let's create some plots on the original data to understand the customer distribution among the clusters.
❖ The boxplot shows the distribution of each numerical variable for each cluster obtained using K-means Clustering.
❖ The boxplot shows the median, quartiles, and outliers for each variable.
❖ The boxplot shows that the clusters are well-separated in terms of the distribution of the numerical variables.
❖ For example, cluster 0 has a higher median value for the "Avg_Credit_Limit" variable than cluster 1.
❖ This indicates that the K-means clustering has successfully grouped together data points that are similar in terms of their
numerical variables.
❖ The boxplot also shows that there is some overlap between the clusters in terms of the distribution of the numerical variables.
❖ For example, there are some data points in cluster 0 that have a lower value for the "Avg_Credit_Limit" variable than some data
points in cluster 1.
❖ This indicates that there are some data points that are not clearly assigned to a single cluster.
❖ Overall, the boxplot shows that the K-means clustering has successfully identified some of the underlying structure in the data.
❖ However, the clustering is not perfect and there are some data points that are not clearly assigned to a single cluster.
❖ The boxplot shows the distribution of each numerical variable for each cluster obtained using Hierarchical Clustering.
❖ The boxplot shows the median, quartiles, and outliers for each variable.
❖ The boxplot shows that the clusters are well-separated in terms of the distribution of the numerical variables.
❖ For example, cluster 0 has a higher median value for the "Avg_Credit_Limit" variable than cluster 1.
❖ # This indicates that the Hierarchical Clustering has successfully grouped together data points that are similar in terms of their
numerical variables.
❖ The boxplot also shows that there is some overlap between the clusters in terms of the distribution of the numerical variables.
❖ For example, there are some data points in cluster 0 that have a lower value for the "Avg_Credit_Limit" variable than some data
points in cluster 1.
❖ This indicates that there are some data points that are not clearly assigned to a single cluster.
❖ Overall, the boxplot shows that the Hierarchical Clustering has successfully identified some of the underlying structure in the
data.
❖ However, the clustering is not perfect and there are some data points that are not clearly assigned to a single cluster.
❖ # Based on the cluster profiles, we can identify the following actionable insights:
❖ Cluster 0: This cluster represents customers with high average credit limits and high average balances.
Recommendation: This cluster could be targeted with personalized offers for high-value products and services. The bank could
consider offering these customers higher credit limits or lower interest rates.
❖ Cluster 1: This cluster represents customers with low average credit limits and low average balances.
Recommendation: This cluster could be targeted with offers for basic banking products and services. The bank could consider
offering these customers lower fees or higher interest rates on savings accounts.
❖ Cluster 2: This cluster represents customers with average credit limits and average balances.
❖ Recommendation: This cluster could be targeted with offers for a variety of banking products and services. The bank could
consider offering these customers personalized offers based on their individual needs and preferences.
❖ Overall, the cluster analysis provides valuable insights into the different customer segments within the bank's customer base.
❖ This information can be used to develop more targeted and effective marketing campaigns and product offerings.
❖ In addition to the above insights, the bank could also consider the following recommendations:
❖ Use the cluster labels to create new customer segments in the bank's CRM system.
❖ Track the performance of marketing campaigns and product offerings by cluster segment.
❖ Use the cluster labels to develop personalized marketing messages and offers for each customer segment.
❖ Regularly review the cluster profiles and insights to ensure that they are still accurate and relevant.