Data Mining
Data Mining
BUSINESS REPORT
Mohseen Sayyed | Mohseen.Sayyed@gmail.com
PART I
CLUSTERING
Problem Statement …… 3
……… 10
3
DATA ANALYSIS
DATA SHAPE
• The Data frame has 210 rows and 7 Columns
DATA DESCRIPTION
• The Dataset contains 7 Columns and 210 Rows
• No Categorical Variables
• We see for most of the variable, mean/medium are nearly equal
• Include a 90% to see variations and it looks distributed evenly
• Std Deviation is high for spending variable
DATA SAMPLE
• We have performed Univariate, Bi-Variate and Multi-variate Analysis on all the Columns
of the data set
• We have found the Outliers if any in the data, the correlation between the columns
• You can see the skewness on the data basis the charts we have
• The Table above gives us details on each columns Max and Min Value
• Shows Mean, Median and Inter Quartile range of the Column
OUTLIER ANALYSIS
Fig (a)
Fig (b)
Fig (c)
CORRELATION
• Strong positive correlation observed between the below Features: (Fig c , Pg 6)
• - spending & advance_payments,
• - advance_payments & current_balance,
• - credit_limit & spending
• - spending & current_balance
• - credit_limit & advance_payments
• - max_spent_in_single_shopping current_balance
PROJECT ANALYSIS
WARD
• We have used Ward method here, Ward´s linkage is a method for hierarchical cluster
analysis.
• The idea has much in common with analysis of variance (ANOVA). The linkage
function specifying the distance between two clusters is computed as the increase in
the "error sum of squares" (ESS) after fusing two clusters into a single cluster.
• In the above dendrogram, we locate the largest vertical difference between nodes, and
in the middle pass an horizontal line. The number of vertical lines intersecting it is the
optimal number of clusters
• I assume the number of clusters basis this Dendrogram as 3
PROJECT ANALYSIS
AGGLOMERATIVE CLUSTERING
WARD
AGGLOMERATIVE CLUSTERING
• Inference
• Both the method are almost similer means , minor variation, which we know it occurs.
• We for cluster grouping based on the dendrogram, 3 or 4 looks good. Did the further
analysis, and based on the dataset had gone for 3 group cluster solution based on the
hierarchical clustering
• Also in real time, there colud have been more variables value captured - tenure,
BALANCE_FREQUENCY, balance, purchase, installment of purchase, others.
• And three group cluster solution gives a pattern based on high/medium/low spending
with max_spent_in_single_shopping (high value item) and
probability_of_full_payment(payment made)
PROJECT ANALYSIS
• k-means clustering is a method of vector quantization, originally from signal processing, that
aims to partition n observations into k clusters in which each observation belongs to the
cluster with the nearest mean, serving as a prototype of the cluster.
• Inertia measures how well a dataset was clustered by K-Means. It is calculated by measuring
the distance between each data point and its centroid, squaring this distance, and summing
these squares across one cluster.
Observations:
• High Spending Customers, have lower Credit Limit thus their Spend gets limited.
• Most of them prefer to have full Payment done, to maintain their Credit Score
Recommendations:
• Increase their Credit Limit, giver promotional Offers and introduce a Reward scheme to
encourage Spend as the max spent in Single shopping is higher in this category
• Give Loan on Credit Card as they have a good repayment history
Problem Statement …… 13
13
DATA ANALYSIS
DATA SHAPE
• The Data frame has 210 rows and 9 Columns
DATA DESCRIPTION
• 10 features within the dataset
• 3000 rows, no null Values
• Apart from Age, Commision, Durationand Sales rest are Object data type
• Target variable is Claimed
• Agency_code has 4 unique Values
• There are 2 Types of Insurance
• 2 Channels
• 5 Products and 3 Destinations
DATA SAMPLE
14
• We have performed Univariate, Bi-Variate and Multi-variate Analysis on all the Columns
of the data set
• We have found the Outliers if any in the data, the correlation between the columns
• You can see the skewness on the data basis the charts we have
• The Table above gives us details on each columns Max and Min Value
• Shows Mean, Median and Inter Quartile range of the Column
OUTLIER ANALYSIS
DATA SAMPLE
PROJECT ANALYSIS
Scaled Dataset
Scaled Dataset is used for Test and Train Split. 30% Test and 70% Train
We have tried 3 Models, Decision Tree, Random Forest and Artificial Neural Network to
predict the Accuracy
DECISION TREE
A decision tree is a decision support tool that uses a tree-like model of decisions and their
possible consequences, including chance event outcomes, resource costs, and utility. It is
one way to display an algorithm that only contains conditional control statements.
Decisin Tree
Decision Tree Image
The above 2 Attachment has the Decision tree before estimating the best Cluster and Leaf Size.
Considering the Original dataset, below is the feature Importance details we get to see.
We tried a few Combinations to arrive at the best values in Grid Search to estimate the Accuracy
GridSearchCV(cv=3, estimator=DecisionTreeClassifier(),
param_grid={'max_depth': [3, 4, 4.5, 5, 5.5],
'min_samples_leaf': [30, 35, 36, 38],
'min_samples_split': [90, 105, 112, 116]})
Train Model
Test Model
PROJECT ANALYSIS
RANDOM FOREST
Random forest is a Supervised Machine Learning Algorithm that is used widely in
Classification and Regression problems. It builds decision trees on different samples and
takes their majority vote for classification and average in case of regression
We tried a few Combinations to arrive at the best values in Grid Search to estimate the Accuracy
GridSearchCV(cv=3, estimator=RandomForestClassifier(),
param_grid={'max_depth': [8, 10, 12], 'max_features': [2, 3, 4, 5],
'min_samples_leaf': [8, 9, 10, 11],
'min_samples_split': [24, 36, 40, 44],
'n_estimators': [101, 301]})
Feature Importance graph basis the best parameters of Random Forest looks like this.
Agency_code gained highest importance followed by Sales.
Train Model
Test Model
PROJECT ANALYSIS
We tried a few Combinations to arrive at the best values in Grid Search to estimate the Accuracy
GridSearchCV(cv=3, estimator=MLPClassifier(),
param_grid={'activation': ['logistic', 'relu'],
'hidden_layer_sizes': [(100, 100, 100)],
'max_iter': [10000], 'solver': ['sgd', 'adam'],
'tol': [0.1, 0.01]})
Feature Importance graph won’t be seen in ANN as it is termed as black box algorithm
Train Model
Test Model
PROJECT ANALYSIS
MODEL COMPARISON
2.4 Final Model: Compare all the models and write an
inference which model is best/optimized.
We have compared the outputs of the 3 models to study, Precision, F1 Score, Accuracy and
Recall. This Should help us decide the optimum model
PROJECT ANALYSIS
MODEL COMPARISON
2.5 Inference: Based on the whole Analysis, what are the
business insights and recommendations
• The Analysis calls for gathering more real-time Data and Historical data as
the accuracy and Recall across the model does not fluctuate big time.
• Given set of information are the primary columns but other passive
parameters can fluctuate the data to a large extent like Weather,
Diseases, Types of Vehicles.
• Streamlining online experiences benefitted customers, leading to an
increase in conversions, which subsequently raised profits.
• As per the data 90% of insurance is done by online channel.
• We need to find why, Other interesting fact, is almost all the offline
business has a claimed associated
• Need to train the JZI agency resources to pick up sales as they are in
bottom, need to run promotional marketing campaign or evaluate if we
need to tie up with alternate agency
• Also based on the model we are getting 80%accuracy, so we need
customer books airline tickets or plans, cross sell the insurance based on
the claim data pattern.
• Other interesting fact is more sales happen via Agency than Airlines and
the trend shows the claim are processed more at Airline. So we may need
to deep dive into the process to understand the workflow and why?
• Key performance indicators (KPI) The KPI’s of insurance claims are:
Reduce claims cycle time, Increase customer satisfaction, Combat fraud,
Optimize claims recovery, Reduce claim handling costs Insights gained
from data and AI-powered analytics could expand the boundaries of
insurability, extend existing products, and give rise to new risk transfer
solutions in areas like a non-damage business interruption and
reputational damage.
THANK YOU