0% found this document useful (0 votes)
11 views

Data Mining

A leading bank wants to develop a customer segmentation to give promotional offers to its customers. They collected a sample that summarizes the activities of users during the past few months. You are given the task to identify the segments based on credit card usage.

Uploaded by

Mohseen Sayyed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Data Mining

A leading bank wants to develop a customer segmentation to give promotional offers to its customers. They collected a sample that summarizes the activities of users during the past few months. You are given the task to identify the segments based on credit card usage.

Uploaded by

Mohseen Sayyed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

PROJECT DATA MINING

BUSINESS REPORT
Mohseen Sayyed | Mohseen.Sayyed@gmail.com
PART I
CLUSTERING

Problem Statement …… 3

1.1 Read the data, do the necessary initial


steps, and exploratory data analysis
(Univariate, Bi-variate, and multivariate
analysis)……… 4 to 7

1.2 Do you think scaling is necessary for


clustering in this case? Justify……… 8

1.3 Apply hierarchical clustering to scaled


data. Identify the number of optimum
clusters using Dendrogram and briefly
describe them……… 8 to 9

1.4 Apply K-Means clustering on scaled


data and determine optimum clusters.
Apply elbow curve and silhouette score.
Explain the results properly. Interpret and
write inferences on the finalized clusters.
20XX PRESENTATION TITLE 2

……… 10

1.5 Describe cluster profiles for the clusters


defined. Recommend different promotional
strategies for different clusters. ……… 11
PROBLEM STATEMENT

A leading bank wants to develop a customer segmentation to


give promotional offers to its customers. They collected a sample
that summarizes the activities of users during the past few
months. You are given the task to identify the segments based
on credit card usage.

3
DATA ANALYSIS

DATA SHAPE
• The Data frame has 210 rows and 7 Columns

DATA DESCRIPTION
• The Dataset contains 7 Columns and 210 Rows
• No Categorical Variables
• We see for most of the variable, mean/medium are nearly equal
• Include a 90% to see variations and it looks distributed evenly
• Std Deviation is high for spending variable

DATA SAMPLE

• The Data is clean with no Null or NA values


• No Duplicated found in the data
EXPLORATORY DATA ANALYSIS

• We have performed Univariate, Bi-Variate and Multi-variate Analysis on all the Columns
of the data set
• We have found the Outliers if any in the data, the correlation between the columns
• You can see the skewness on the data basis the charts we have

spending advance_payments probability_of_full_payment current_balance


Range of values 10.59 4.84 0.1102 1.776
Minimum 10.59 12.41 0.8081 4.899
Maximum 21.18 17.25 0.9183 6.675
Mean value 14.8475 14.5593 0.871 5.6285
Median value 14.355 14.32 0.8735 5.5235
Standard deviation 2.9097 1.306 0.0236 0.4431
Null values False False False False
spending - 1st Quartile (Q1) 12.27 13.45 0.8569 5.2623
spending - 3st Quartile (Q3) 17.305 15.715 0.8878 5.9798
Interquartile range (IQR) 5.035 2.265 0.0309 0.7175

Range of values credit_limit min_payment_amt max_spent_in_single_shopping


Range of values 1.403 7.6909 2.031
Minimum 2.63 0.7651 4.519
Maximum 4.033 8.456 6.55
Mean value 3.2586 3.7002 5.4081
Median value 3.237 3.599 5.223
Standard deviation 0.3777 1.5036 0.4915
Null values False False False
spending - 1st Quartile (Q1) 2.944 2.5615 5.045
spending - 3st Quartile (Q3) 3.5618 4.7688 5.877
Interquartile range (IQR) 0.6177 2.2073 0.832

• The Table above gives us details on each columns Max and Min Value
• Shows Mean, Median and Inter Quartile range of the Column

OUTLIER ANALYSIS

Fig (a)

• Outliers Analysis on all Columns show that we have outliers only on


Probability_of_full_payment and min_payment_amt
SKEWNESS
• We see the distribution is skewed towards the right tail for all the variable except
probability_of_full_payment variable. (Fig b)

Fig (b)

Fig (c)
CORRELATION
• Strong positive correlation observed between the below Features: (Fig c , Pg 6)
• - spending & advance_payments,
• - advance_payments & current_balance,
• - credit_limit & spending
• - spending & current_balance
• - credit_limit & advance_payments
• - max_spent_in_single_shopping current_balance
PROJECT ANALYSIS

1.2 Do you think scaling is necessary for clustering in this case?


Justify

Yes Scaling is Needed


Inference
• Scaling needs to be performed only on Numerical Values. Z-Score method is used
for Scaling on this dataset
• The dataset has columns with variety of Data fetaures like
Credit_Limit,Min_payment, payment_amount, Spending. which would vary between
people. There can be higher Credit limit for one and lower for the other basis the
salary and Credit Score of a person. Hence we would need to scale the data.
• With Z-Score we would come to know how many std deviation is the point away from
mean. It also states the Direction.

1.3 Apply hierarchical clustering to scaled data. Identify the


number of optimum clusters using Dendrogram and briefly
describe them

We will use 2 Methods to do this


1. Ward
2. Agglomerative Clustering

Dendrogram on entire dataset Dendrogram on Last 25 Merge

WARD

• We have used Ward method here, Ward´s linkage is a method for hierarchical cluster
analysis.
• The idea has much in common with analysis of variance (ANOVA). The linkage
function specifying the distance between two clusters is computed as the increase in
the "error sum of squares" (ESS) after fusing two clusters into a single cluster.
• In the above dendrogram, we locate the largest vertical difference between nodes, and
in the middle pass an horizontal line. The number of vertical lines intersecting it is the
optimal number of clusters
• I assume the number of clusters basis this Dendrogram as 3
PROJECT ANALYSIS

AGGLOMERATIVE CLUSTERING

• Agglomerative Clustering is a type of hierarchical clustering algorithm. It is an


unsupervised machine learning technique that divides the population into several
clusters such that data points in the same cluster are more similar and data points in
different clusters are dissimilar
• I assume the number of clusters basis this Dendrogram as 3

WARD

AGGLOMERATIVE CLUSTERING

• Inference
• Both the method are almost similer means , minor variation, which we know it occurs.
• We for cluster grouping based on the dendrogram, 3 or 4 looks good. Did the further
analysis, and based on the dataset had gone for 3 group cluster solution based on the
hierarchical clustering
• Also in real time, there colud have been more variables value captured - tenure,
BALANCE_FREQUENCY, balance, purchase, installment of purchase, others.
• And three group cluster solution gives a pattern based on high/medium/low spending
with max_spent_in_single_shopping (high value item) and
probability_of_full_payment(payment made)
PROJECT ANALYSIS

1.4 Apply K-Means clustering on scaled data and determine


optimum clusters. Apply elbow curve and silhouette score.
Explain the results properly. Interpret and write inferences on the
finalized clusters.

• k-means clustering is a method of vector quantization, originally from signal processing, that
aims to partition n observations into k clusters in which each observation belongs to the
cluster with the nearest mean, serving as a prototype of the cluster.
• Inertia measures how well a dataset was clustered by K-Means. It is calculated by measuring
the distance between each data point and its centroid, squaring this distance, and summing
these squares across one cluster.

• Basis the Silhoutte Score, Cluster 3 or 4 could be the Optimum Clusters


• This Dataset can be explained in terms of Spending Pattern of an Individual and Risk they
bring with the Spend basically as Low Spend - High Risk, Medium Spend - Low Risk and High
Spend - Low Risk
PROJECT ANALYSIS

1.5 Describe cluster profiles for the clusters defined. Recommend


different promotional strategies for different clusters.

• Cluster 0 : High Spending


• Cluster 1 : Low Spending
• Cluster 2 - Medium Spending

Cluster 0: High Spending

Observations:
• High Spending Customers, have lower Credit Limit thus their Spend gets limited.
• Most of them prefer to have full Payment done, to maintain their Credit Score
Recommendations:
• Increase their Credit Limit, giver promotional Offers and introduce a Reward scheme to
encourage Spend as the max spent in Single shopping is higher in this category
• Give Loan on Credit Card as they have a good repayment history

Cluster 1: Low Spending


Observations:
• Low Spending Customers, have higher Credit Limit. They have the highest Spent amount in
Single Shopping
• Most of them prefer to have full Payment done, to maintain their Credit Score
Recommendations:
• Offers to be provided to increase their Spend
• Tie up with multiple vendors for groceries, Utlities that would help them spend more
• Advertisement emails to be sent by studying their Spend Pattern

Cluster 2: Medium Spending


Observations:
• Medium Spending Customers, have higher Credit Limit.
• Most of them prefer to have full Payment done, to maintain their Credit Score
Recommendations:
• These are our Potentional High Spenders
• Promote premium Card or Loyalty Program to increase the Transaction.
• Tie up with sirport Lounge, Travel partners
• Advertisement emails of Premium Sites, Brands can help boost their Spend
PART II
CART-RF-ANN

Problem Statement …… 13

2.1 Read the data, do the necessary initial


steps, and exploratory data analysis
(Univariate, Bi-variate, and multivariate
analysis)…… 14 to 17

2.2 Data Split: Split the data into test and


train, build classification model CART,
Random Forest, Artificial Neural
Network…… 18

2.3 Performance Metrics: Comment and


Check the performance of Predictions on
Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get
ROC_AUC score, classification reports for
each model…… 19 to 24

2.4 Final Model: Compare all the models


20XX PRESENTATION TITLE 12

and write an inference which model is


best/optimized….. 25

2.5 Inference: Based on the whole


Analysis, what are the business insights
and recommendations ….. 26
PROBLEM STATEMENT

An Insurance firm providing tour insurance is facing higher claim


frequency. The management decides to collect data from the
past few years. You are assigned the task to make a model
which predicts the claim status and provide recommendations to
management. Use CART, RF & ANN and compare the models'
performances in train and test sets.

13
DATA ANALYSIS

DATA SHAPE
• The Data frame has 210 rows and 9 Columns

DATA DESCRIPTION
• 10 features within the dataset
• 3000 rows, no null Values
• Apart from Age, Commision, Durationand Sales rest are Object data type
• Target variable is Claimed
• Agency_code has 4 unique Values
• There are 2 Types of Insurance
• 2 Channels
• 5 Products and 3 Destinations

DATA SAMPLE
14

• The Data is clean with no Null or NA values


• The dataset shows 139 Duplicate values, we cannot choose to remove it. This data does
not provide us with the Customer ID, there can be a possibility that the data belongs to
different Customers. Hence i choose not to drop these rows
EXPLORATORY DATA ANALYSIS

• We have performed Univariate, Bi-Variate and Multi-variate Analysis on all the Columns
of the data set
• We have found the Outliers if any in the data, the correlation between the columns
• You can see the skewness on the data basis the charts we have

Age Commision Duration Sales


Range of values 76 210.21 4581 539
Minimum 8 0 -1 0
Maximum 84 210.21 4580 539
Mean value 38.091 14.5292 70.0013 60.2499
Median value 36 4.63 26.5 33
Standard deviation 10.4635 25.4815 134.0533 70.734
Null values False False False False
spending - 1st Quartile (Q1) is 32 0 11 20
spending - 3st Quartile (Q3) is 42 17.235 63 69
Interquartile range (IQR) 10 17.235 52 49

• The Table above gives us details on each columns Max and Min Value
• Shows Mean, Median and Inter Quartile range of the Column

OUTLIER ANALYSIS

• There are Outliers in every Column of the data


PAIRPLOT AND CORRELATION
CATEGORICAL FEATURES
• There is a need to convert all Categorical variable to Integer

DATA SAMPLE
PROJECT ANALYSIS

The Data Need to be Scaled


Inference
• Scaling needs to be performed only on Numerical Values. Z-Score method is used
for Scaling on this dataset
• With Z-Score we would come to know how many std deviation is the point away from
mean. It also states the Direction.

Scaled Dataset

Scaled Dataset is used for Test and Train Split. 30% Test and 70% Train

X_train (2100, 9) | X_test (900, 9) | train_labels (2100,) | test_labels (900,)


PROJECT ANALYSIS

2.3 Performance Metrics: Comment and Check the performance of Predictions


on Train and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score, classification reports for each model.

We have tried 3 Models, Decision Tree, Random Forest and Artificial Neural Network to
predict the Accuracy

DECISION TREE
A decision tree is a decision support tool that uses a tree-like model of decisions and their
possible consequences, including chance event outcomes, resource costs, and utility. It is
one way to display an algorithm that only contains conditional control statements.

Decisin Tree
Decision Tree Image

The above 2 Attachment has the Decision tree before estimating the best Cluster and Leaf Size.
Considering the Original dataset, below is the feature Importance details we get to see.

We tried a few Combinations to arrive at the best values in Grid Search to estimate the Accuracy
GridSearchCV(cv=3, estimator=DecisionTreeClassifier(),
param_grid={'max_depth': [3, 4, 4.5, 5, 5.5],
'min_samples_leaf': [30, 35, 36, 38],
'min_samples_split': [90, 105, 112, 116]})

Best Parameters we found basis the above combination is :


{'max_depth': 5, 'min_samples_leaf': 35, 'min_samples_split': 116}

Feature Importance graph basis the


best parameters of Decision Tree
looks like this.
Agency_code gained highest
importance followed by Sales.
PROJECT ANALYSIS

Scores basis train Model Scores basis test Model

Train Model

Test Model
PROJECT ANALYSIS

RANDOM FOREST
Random forest is a Supervised Machine Learning Algorithm that is used widely in
Classification and Regression problems. It builds decision trees on different samples and
takes their majority vote for classification and average in case of regression

We tried a few Combinations to arrive at the best values in Grid Search to estimate the Accuracy

GridSearchCV(cv=3, estimator=RandomForestClassifier(),
param_grid={'max_depth': [8, 10, 12], 'max_features': [2, 3, 4, 5],
'min_samples_leaf': [8, 9, 10, 11],
'min_samples_split': [24, 36, 40, 44],
'n_estimators': [101, 301]})

Best Parameters we found basis the above combination is :


{'max_depth': 8, 'max_features': 5, 'min_samples_leaf': 8, 'min_samples_split': 36,
'n_estimators': 301}

Feature Importance graph basis the best parameters of Random Forest looks like this.
Agency_code gained highest importance followed by Sales.

Scores basis train Model Scores basis test Model


PROJECT ANALYSIS

Train Model

Test Model
PROJECT ANALYSIS

ARTIFICIAL NEURAL NETWORK


Artificial neural networks, usually simply called neural networks, are computing systems
inspired by the biological neural networks that constitute animal brains. An ANN is based
on a collection of connected units or nodes called artificial neurons, which loosely model
the neurons in a biological brain

We tried a few Combinations to arrive at the best values in Grid Search to estimate the Accuracy

GridSearchCV(cv=3, estimator=MLPClassifier(),
param_grid={'activation': ['logistic', 'relu'],
'hidden_layer_sizes': [(100, 100, 100)],
'max_iter': [10000], 'solver': ['sgd', 'adam'],
'tol': [0.1, 0.01]})

Best Parameters we found basis the above combination is :


{'activation': 'relu’, 'hidden_layer_sizes': (100, 100, 100), 'max_iter': 10000,
'solver': 'adam’, 'tol': 0.01}

Feature Importance graph won’t be seen in ANN as it is termed as black box algorithm

Scores basis train Model Scores basis test Model


PROJECT ANALYSIS

Train Model

Test Model
PROJECT ANALYSIS

MODEL COMPARISON
2.4 Final Model: Compare all the models and write an
inference which model is best/optimized.
We have compared the outputs of the 3 models to study, Precision, F1 Score, Accuracy and
Recall. This Should help us decide the optimum model
PROJECT ANALYSIS

MODEL COMPARISON
2.5 Inference: Based on the whole Analysis, what are the
business insights and recommendations

• The Analysis calls for gathering more real-time Data and Historical data as
the accuracy and Recall across the model does not fluctuate big time.
• Given set of information are the primary columns but other passive
parameters can fluctuate the data to a large extent like Weather,
Diseases, Types of Vehicles.
• Streamlining online experiences benefitted customers, leading to an
increase in conversions, which subsequently raised profits.
• As per the data 90% of insurance is done by online channel.
• We need to find why, Other interesting fact, is almost all the offline
business has a claimed associated
• Need to train the JZI agency resources to pick up sales as they are in
bottom, need to run promotional marketing campaign or evaluate if we
need to tie up with alternate agency
• Also based on the model we are getting 80%accuracy, so we need
customer books airline tickets or plans, cross sell the insurance based on
the claim data pattern.
• Other interesting fact is more sales happen via Agency than Airlines and
the trend shows the claim are processed more at Airline. So we may need
to deep dive into the process to understand the workflow and why?
• Key performance indicators (KPI) The KPI’s of insurance claims are:
Reduce claims cycle time, Increase customer satisfaction, Combat fraud,
Optimize claims recovery, Reduce claim handling costs Insights gained
from data and AI-powered analytics could expand the boundaries of
insurability, extend existing products, and give rise to new risk transfer
solutions in areas like a non-damage business interruption and
reputational damage.
THANK YOU

20XX PRESENTATION TITLE 27

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy