100% found this document useful (1 vote)
232 views73 pages

Predictive Modeling (MP) Project Report

Predictive Modeling (MP) Project Report 1. Data analysis on diamond quality, clustering of data and linear regression model development 2. Holiday package sales analysis, ML models development

Uploaded by

Kyoto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
232 views73 pages

Predictive Modeling (MP) Project Report

Predictive Modeling (MP) Project Report 1. Data analysis on diamond quality, clustering of data and linear regression model development 2. Holiday package sales analysis, ML models development

Uploaded by

Kyoto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Predictive Modeling- Project

Predictive Modeling (PM)


Project Report

Date: October 2021


Version 1.0
Predictive Modeling- Project

Table of Contents
1 Problem 1 Statement............................................................................................................... 1
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the
null values, Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate
Analysis. ....................................................................................................................................... 1
1.1.1 Data Summary........................................................................................................... 1
1.1.2 Duplicated Data Summary ........................................................................................ 2
1.1.3 Descriptive Statistics ................................................................................................. 3
1.1.4 Sample Data .............................................................................................................. 3
1.1.5 Univariate Analysis.................................................................................................... 3
1.1.5.1 Carat Column ..................................................................................................... 3
1.1.5.2 Depth Column .................................................................................................... 4
1.1.5.3 Table Column ..................................................................................................... 4
1.1.5.4 X Column ............................................................................................................ 5
1.1.5.5 Y Column ............................................................................................................ 5
1.1.5.6 Z Column ............................................................................................................ 6
1.1.5.7 Price Column ...................................................................................................... 6
1.1.5.8 Cut Column ........................................................................................................ 7
1.1.5.9 Clarity Column ................................................................................................. 12
1.1.5.10 Color Column ................................................................................................... 15
1.1.6 Bivariate Analysis .................................................................................................... 17
1.2 Impute null values if present, also check for the values which are equal to zero. Do they
have any meaning or do we need to change them or drop them? Check for the possibility of
combining the sub levels of an ordinal variables and take actions accordingly. Explain why you
are combining these sub levels with appropriate reasoning. ................................................... 21
1.2.1 Null Values and Zero Values Check ......................................................................... 21
1.2.2 Outlier Values Check and Imputation ..................................................................... 21
1.2.3 Null Values and Zero Values Imputation ................................................................ 24
1.2.4 Sub-Level Merge for Ordinal Data .......................................................................... 24
1.2.4.1 Color and Clarity .............................................................................................. 24
1.2.4.2 Cut.................................................................................................................... 25
1.2.5 Data Multi-Collinearity Check- VIF .......................................................................... 27
Predictive Modeling- Project

1.3 Encode the data (having string values) for Modelling. Split the data into train and test
(70:30). Apply Linear regression using scikit learn. Perform checks for significant variables using
appropriate method from statsmodel. Create multiple models and check the performance of
Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare. Compare these models
and select the best one with appropriate reasoning. ............................................................... 28
1.3.1 Encoding of Categorical Data .................................................................................. 28
1.3.2 Spilt of Data – Train and Test .................................................................................. 29
1.3.3 Linear Regression Model ........................................................................................ 30
1.3.4 Linear Regression Model with Sklearn Library ....................................................... 31
1.3.4.1 Linear Reg. Model Step1 – Data Split .............................................................. 31
1.3.4.2 Linear Reg. Model Step2 –Model Build ........................................................... 32
1.3.4.3 Linear Reg. Model Step3 – Checking Features’ Coefficients & Intercept ....... 32
1.3.4.4 Linear Reg. Model Step4 – R2 and RMSE (Prediction & Evaluation) ............... 32
1.3.4.5 Model Status.................................................................................................... 33
1.3.5 Linear Regression Model with Statsmodel Library ................................................. 33
1.3.5.1 Linear Reg. Model Step1 – Data Split .............................................................. 33
1.3.5.2 Linear Reg. Model Step2 –Null and Alternate Hypothesis .............................. 33
1.3.5.3 Linear Reg. Model Step2 –Model Build ........................................................... 33
1.3.5.4 Linear Reg. Model Step3 – Checking Coefficients & Hypothesis Check .......... 34
1.3.5.5 Linear Reg. Model Step4 – R2 and RMSE (Prediction & Evaluation) ............... 35
1.3.5.6 Test Data .......................................................................................................... 35
1.3.5.7 Linear Reg. Model Sample Computation ......................................................... 35
1.4 Inference: Basis on these predictions, what are the business insights and
recommendations. Please explain and summarize the various steps performed in this project.
There should be proper business interpretation and actionable insights present. ................. 36
1.4.1 Model Insights......................................................................................................... 36
1.4.2 Recommendations .................................................................................................. 36
2 Problem 2 Statement............................................................................................................. 37
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis. Do
exploratory data analysis. ......................................................................................................... 37
2.1.1 Data Summary......................................................................................................... 37
2.1.2 Descriptive Statistics ............................................................................................... 38
Predictive Modeling- Project

2.1.3 Sample Data ............................................................................................................ 38


2.1.4 Univariate Analysis.................................................................................................. 38
2.1.4.1 Age ................................................................................................................... 38
2.1.4.2 Education ......................................................................................................... 39
2.1.4.3 Salary ............................................................................................................... 39
2.1.4.4 Number of Young Children .............................................................................. 43
2.1.4.5 Number of Young Children .............................................................................. 43
2.1.4.6 Foreign ............................................................................................................. 44
2.1.4.7 Holiday Package ............................................................................................... 45
2.1.5 Bivariate Analysis .................................................................................................... 47
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant
analysis). .................................................................................................................................... 49
2.2.1 Outlier Check and Cleanup ..................................................................................... 49
2.2.2 Conversion of Categorical Data .............................................................................. 50
2.2.3 Spilt of Data – Train and Test .................................................................................. 51
2.2.4 Logistic Regression Model ...................................................................................... 52
2.2.4.1 Log. Reg. Model Step1- Data Split ................................................................... 52
2.2.4.2 Log. Reg. Model Step2- Model Build ............................................................... 52
2.2.4.3 Log. Reg. Model Step4- Best Model (Grid Search) .......................................... 52
2.2.5 Linear Discriminant Analysis Model........................................................................ 53
2.2.5.1 LDA Model Step1- Data Split ........................................................................... 53
2.2.5.2 LDA Model Step2- Model Build ....................................................................... 53
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final
Model: Compare Both the models and write inference which model is best/optimized. ....... 55
2.3.1 Confusion Matrix..................................................................................................... 55
2.3.1.1 Accuracy........................................................................................................... 56
2.3.1.2 Precision .......................................................................................................... 56
2.3.1.3 Recall/Sensitivity.............................................................................................. 56
2.3.1.4 Specificity ......................................................................................................... 56
2.3.1.5 F1 Score ........................................................................................................... 57
Predictive Modeling- Project

2.3.1.6 Classification Report ........................................................................................ 57


2.3.1.7 ROC Curve and AUC Score ............................................................................... 57
2.3.2 Model Performance Decisions ................................................................................ 58
2.3.3 Logistic Regression – Model Performance ............................................................. 59
2.3.3.1 Training Dataset............................................................................................... 59
2.3.3.2 Test Dataset ..................................................................................................... 60
2.3.4 LDA – Model Performance ...................................................................................... 61
2.3.4.1 Training Dataset............................................................................................... 61
2.3.4.2 Test Dataset ..................................................................................................... 62
2.3.5 Model Compare – Analysis ..................................................................................... 63
2.4 Inference: Basis on these predictions, what are the insights and recommendations.
Please explain and summarize the various steps performed in this project. There should be
proper business interpretation and actionable insights present. ............................................ 65
Predictive Modeling- Project

List of Figures
Figure 1-1 Carat – Boxplot and Histogram...................................................................................... 4
Figure 1-2 Depth – Boxplot and Histogram .................................................................................... 4
Figure 1-3 Table – Boxplot and Histogram ..................................................................................... 5
Figure 1-4 X – Boxplot and Histogram ............................................................................................ 5
Figure 1-5 Y – Boxplot and Histogram ............................................................................................ 6
Figure 1-6 Z – Boxplot and Histogram ............................................................................................ 6
Figure 1-7 Price – Boxplot and Histogram ...................................................................................... 7
Figure 1-8 Cut – Count Plot ............................................................................................................. 7
Figure 1-9 Average Price and Carat Weight-Cut Bar Plot ............................................................... 8
Figure 1-10 Cut –Clarity Distribution Plot ..................................................................................... 10
Figure 1-11 Cut –Color Distribution Plot....................................................................................... 11
Figure 1-12 Clarity – Count Plot .................................................................................................... 12
Figure 1-13 Clarity – Average Price and Carat weight Plot ........................................................... 13
Figure 1-14 Clarity –Color Distribution Plot .................................................................................. 14
Figure 1-15 Color – Count Plot ...................................................................................................... 15
Figure 1-16 Color – Average Price and Carat weight Plot............................................................. 16
Figure 1-17 Cubic Zirconia – Numerical Data Heat Map .............................................................. 17
Figure 1-18 Cubic Zirconia Pair Plot .............................................................................................. 20
Figure 1-19 Cubic Zirconia – Numerical Data Box Plot ................................................................. 22
Figure 1-20 Average Price and Carat Weight-Cut Bar Plot – Before Merge ................................. 26
Figure 1-21 Average Price and Carat Weight-Cut Bar Plot – After Merge.................................... 26
Figure 1-22 Simple Linear Regression ........................................................................................... 30
Figure 2-1 Age Data – Boxplot and Histogram.............................................................................. 39
Figure 2-2 Education Data – Boxplot and Histogram ................................................................... 39
Figure 2-3 Salary Data – Boxplot and Histogram .......................................................................... 40
Figure 2-4 Salary + Age Relationship ............................................................................................ 41
Figure 2-5 Salary + Education Relationship .................................................................................. 42
Figure 2-6 Salary + Foreign Employee Relationship ..................................................................... 42
Figure 2-7 Number of Young Children Data – Boxplot and Histogram......................................... 43
Figure 2-8 Number of Older Children Data – Boxplot and Histogram.......................................... 44
Figure 2-9 Foreign Employees Count Plot .................................................................................... 44
Figure 2-10 Holiday Package Opted Count Plot ............................................................................ 45
Figure 2-11 Holiday Package Opted – Age Groups ....................................................................... 45
Figure 2-12 Holiday Package– Children Status ............................................................................. 46
Figure 2-13 Holiday Package– Average Salary & Foreign Employees........................................... 46
Figure 2-14 Travel Parameters Heat Map ..................................................................................... 47
Figure 2-15 Travel Data Parameters Pair Plot............................................................................... 48
Figure 2-16 Holiday Package– Numerical Data Box Plot .............................................................. 49
Figure 2-17 Confusion Matrix ....................................................................................................... 55
Predictive Modeling- Project

Figure 2-18 Log. Reg. Training Data Confusion Matrix ................................................................. 59


Figure 2-19 Log. Reg. Training Data ROC-AUC Curve.................................................................... 60
Figure 2-20 Log. Reg. Test Data Confusion Matrix ....................................................................... 60
Figure 2-21 Log. Reg. Test Data ROC-AUC Curve .......................................................................... 61
Figure 2-22 LDA Training Data Confusion Matrix ......................................................................... 61
Figure 2-23 Log. Reg. Training Data ROC-AUC Curve.................................................................... 62
Figure 2-24 LDA Test Data Confusion Matrix ................................................................................ 62
Figure 2-25 LDA Test Data ROC-AUC Curve .................................................................................. 63
Figure 2-26 All Model ROC-AUC Curve – Training Data ................................................................ 64
Figure 2-27 All Model ROC-AUC Curve – Test Data ...................................................................... 64

List of Tables
Table 1-1 Data Dictionary for Cubic Zirconia Details ...................................................................... 1
Table 1-2 Sample of Duplicated Cubic Zirconia Data ...................................................................... 3
Table 1-3 Descriptive Statistics of P1 Data ..................................................................................... 3
Table 1-4 Sample Cubic Zirconia Data ............................................................................................ 3
Table 1-5 Cubic Zirconia – Stone Cut Distribution .......................................................................... 7
Table 1-6 Highest Priced Cubic Zirconia Per Cut ............................................................................. 8
Table 1-7 Largest Cubic Zirconia Per Cut ........................................................................................ 9
Table 1-8 Cubic Zirconia – Stone Cut Distribution ........................................................................ 12
Table 1-9 Highest Priced Cubic Zirconia Per Clarity ...................................................................... 13
Table 1-10 Cubic Zirconia – Stone Color Distribution ................................................................... 15
Table 1-11 Highest Priced Cubic Zirconia Per Color...................................................................... 16
Table 1-12 Correlation Analysis of Dataset Parameters ............................................................... 19
Table 1-13 Cubic Zirconia Data with Null ...................................................................................... 21
Table 1-14 Cubic Zirconia Data with 0 .......................................................................................... 21
Table 1-15 Cubic Zirconia Numerical Data - Outlier Details ......................................................... 23
Table 1-16 Descriptive Statistics Comparison After Outlier Imputation ...................................... 23
Table 1-17 Descriptive Statistics Comparison After Null Imputation ........................................... 24
Table 1-18 Color – Ordinal Data Merge ........................................................................................ 24
Table 1-19 Clarity – Ordinal Data Merge ...................................................................................... 25
Table 1-20 Cut – Ordinal Data Merge ........................................................................................... 25
Table 1-21 Good-Very Good Mean Data Comparison .................................................................. 25
Table 1-22 Very Good Mean Data – After Good-Very Good Merge............................................. 26
Table 1-23 VIF Values .................................................................................................................... 27
Table 1-24 Categorical Values to Numerical Number Codes ........................................................ 28
Table 1-25 Lin Reg.-Computed Coefficients & Intercept for All Features .................................... 32
Table 1-26 Lin Reg.- Train Data R2, Adjusted R2 and RSME .......................................................... 32
Table 1-27 Lin Reg.- Test Data R, Adjusted R2 2 and RSME ........................................................... 33
Predictive Modeling- Project

Table 1-28 Lin Reg.-Computed Coefficients & Intercept for All Features .................................... 35
Table 1-29 Lin Reg.- Train Data R2, Adjusted R2 and RSME .......................................................... 35
Table 1-30 Lin Reg.- Test Data R2 and RSME................................................................................. 35
Table 2-1 Data Dictionary for Tour and Travel Agency ................................................................. 37
Table 2-2 Descriptive Statistics of Holiday Package Data ............................................................. 38
Table 2-3 Sample Holiday Package Data ....................................................................................... 38
Table 2-4 Employee Age Bracket – Average Salary Data .............................................................. 40
Table 2-5 Employee Education Bracket – Average Salary Data .................................................... 41
Table 2-6 Foreign& Local Employees Salary and Age Distribution ............................................... 42
Table 2-7 Foreign& Local Employees Salary and Education Distribution ..................................... 43
Table 2-8 Foreign Employees Data Distribution ........................................................................... 44
Table 2-9 Holiday Package Opted-Data Distribution .................................................................... 45
Table 2-10 Holiday Package Numerical Data - Outlier Details...................................................... 50
Table 2-11 Descriptive Statistics Comparison After Outlier Imputation ...................................... 50
Table 2-12 Categorical Values to Numerical Number Codes ........................................................ 50
Table 2-13 Sample Classification Report ...................................................................................... 57
Table 2-14 Case 1- Metric Importance ......................................................................................... 58
Table 2-15 Case 2- Metric Importance ......................................................................................... 59
Table 2-16 Log. Reg. Training Data Classification Report ............................................................. 59
Table 2-17 Log. Reg. Test Data Classification Report ................................................................... 60
Table 2-18 LDA Training Data Classification Report ..................................................................... 62
Table 2-19 Log. Reg. Test Data Classification Report ................................................................... 63
Table 2-20 – All Model Metrics Comparison ................................................................................ 64

List of Formulae
Formula 1-1 Simple Linear Regression ......................................................................................... 30
Formula 1-2 Linear Regression – Adjusted R2 ............................................................................... 31
Formula 1-3 Linear Regression – RMSE Computation .................................................................. 31
Formula 1-4 Linear Regression Model Price Computation ........................................................... 35
Formula 2-1 Confusion Matrix – Accuracy.................................................................................... 56
Formula 2-2 Confusion Matrix - Precision .................................................................................... 56
Formula 2-3 Confusion Matrix - Recall ......................................................................................... 56
Formula 2-4 Confusion Matrix - Specificity .................................................................................. 57
Formula 2-5 Confusion Matrix – F1 Score .................................................................................... 57
Predictive Modeling- Project

1 Problem 1 Statement
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are
provided with the dataset containing the prices and other attributes of almost 27,000 cubic
zirconia (which is an inexpensive diamond alternative with many of the same qualities as a
diamond).
The company is earning different profits on different prize slots. You have to help the company
in predicting the price for the stone on the bases of the details given in the dataset so it can
distinguish between higher profitable stones and lower profitable stones so as to have better
profit share. Also, provide them with the best 5 attributes that are most important.

Variable Name Description


Carat Carat weight of the cubic zirconia.
Describe the cut quality of the cubic zirconia. Quality is increasing order
Cut
Fair, Good, Very Good, Premium, Ideal.
Color Color of the cubic zirconia. With D being the worst and J the best.
Clarity refers to the absence of the Inclusions and Blemishes. (In order
Clarity from Worst to Best in terms of average price) IF, VVS1, VVS2, VS1, VS2,
Sl1, Sl2, l1
The Height of cubic zirconia, measured from the Culet to the table,
Depth
divided by its average Girdle Diameter.
The Width of the cubic zirconia's Table expressed as a Percentage of its
Table
Average Diameter.
X Length of the cubic zirconia in mm.
Y Width of the cubic zirconia in mm.
Z Height of the cubic zirconia in mm.
Price* Price of the cubic zirconia.
Table 1-1 Data Dictionary for Cubic Zirconia Details

* - Target Variable

1.1 Read the data and do exploratory data analysis. Describe the data briefly.
(Check the null values, Data types, shape, EDA, duplicate values). Perform
Univariate and Bivariate Analysis.
1.1.1 Data Summary
The summary describes the data type and the number of data entries in each of the columns in
the dataset. The presence of null data and duplicated data is also noted.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26967 entries, 0 to 26966
Data columns (total 11 columns):

1
Predictive Modeling- Project

# Column Non-Null Count Dtype


--- ------ -------------- -----
0 Unnamed: 0 26967 non-null int64
1 Carat 26967 non-null float64
2 Cut 26967 non-null object
3 Color 26967 non-null object
4 Clarity 26967 non-null object
5 Depth 26270 non-null float64
6 Table 26967 non-null float64
7 X 26967 non-null float64
8 Y 26967 non-null float64
9 Z 26967 non-null float64
10 Price 26967 non-null int64
dtypes: float64(6), int64(2), object(3)
memory usage: 2.3+ MB

1. There are a total of 11 columns and 26,967 rows


2. Three of the columns hold categorical data
3. Rest of the seven columns hold numerical data
4. The column Depth holds 697 null data which needs to be addressed
5. Column’ Unnamed: 0 which held serial numbers from 1 to 26967 is dropped after initial
data read. It is not used for further processing as it does not contribute to the Price
computation
6. There are 34 rows in the data which hold duplicated values (Unnamed: 0 was the unique
identifier. Duplicates were identified after dropping it)

1.1.2 Duplicated Data Summary


There are a total of 34 rows that have duplicated data, which constitutes around 0.13% of the
entire dataset.
It cannot be concluded without expert statement if the duplicated information is valid or invalid
based on the nature of data. As in this case, the duplicated data forms such a small percent of
the total dataset, it is safely excluded (dropped) in the evaluation model creation. Once the
duplicated data is removed, there are 26,933 rows that will be used for model creation and
evaluation.
A sample of the duplicated pairs are shown color coded.

Carat Cut Color Clarity Depth Table X Y Z Price


8482 0.3 Good J VS1 63.4 57 4.23 4.26 2.69 394

2
Predictive Modeling- Project

19731 0.3 Good J VS1 63.4 57 4.23 4.26 2.69 394


10894 1.8 Ideal H VS1 62.3 56 7.79 7.76 4.84 15105
20760 1.8 Ideal H VS1 62.3 56 7.79 7.76 4.84 15105
12631 2.54 Very Good H SI2 63.5 56 8.68 8.65 5.5 16353
26191 2.54 Very Good H SI2 63.5 56 8.68 8.65 5.5 16353
Table 1-2 Sample of Duplicated Cubic Zirconia Data

1.1.3 Descriptive Statistics


The descriptive statistics of all the data (post clearing of duplicates) is summarized.

mean std min 25% 50% 75% max mode


Carat 0.798 0.477 0.2 0.4 0.7 1.05 4.5 0.3
Depth 61.745 1.412 50.8 61 61.8 62.5 73.6 62
Table 57.456 2.232 49 56 57 59 79 56
X 5.729 1.127 0 4.71 5.69 6.55 10.23 4.38
Y 5.733 1.165 0 4.71 5.7 6.54 58.9 4.35
Z 3.538 0.72 0 2.9 3.52 4.04 31.8 2.69,2.7
Price 3937.526 4022.552 326 945 2375 5356 18818 544
Table 1-3 Descriptive Statistics of P1 Data

1. We observe that the min/max values of Carat, Depth, Table and Price are valid.
2. X, Y and Z columns have a minimum value of 0 which may need checking from the jeweler
3. The largest cubic zirconia has a weight of 4.5 carat which is significantly larger than the 75th
percentile of carat weight (1.05) This is indicative of presence of outliers which needs to
checked and corrected.

1.1.4 Sample Data


A sample of the original data (with dropped column) set is as below.

Carat Cut Color Clarity Depth Table X Y Z Price


0 0.3 Ideal E SI1 62.1 58 4.27 4.29 2.66 499
1 0.33 Premium G IF 60.8 58 4.42 4.46 2.7 984
2 0.9 Very Good E VVS2 62.2 60 6.04 6.12 3.78 6289
3 0.42 Ideal F VS1 61.6 56 4.82 4.8 2.96 1082
4 0.31 Ideal F VVS1 60.4 59 4.35 4.43 2.65 779
Table 1-4 Sample Cubic Zirconia Data

1.1.5 Univariate Analysis


1.1.5.1 Carat Column
Distribution: Does not follow a normal distribution with multiple peaks seen in the data.
Skew: Data is highly skewed to the right as evidenced by the skewness value.

3
Predictive Modeling- Project

Figure 1-1 Carat – Boxplot and Histogram

Skewness: 1.11
Outliers: Present on only the upper whisker end.
1.1.5.2 Depth Column
Distribution: Follows almost a pure normal distribution.
Skew: Data is fairly symmetrical and not skewed.
Skewness: -0.03
Outliers: Present on both the upper and lower whisker ends.

Figure 1-2 Depth – Boxplot and Histogram

1.1.5.3 Table Column


Distribution: Does not follow a normal distribution with the presence of multiple peaks in the
data.
Skew: The data is moderately skewed, there is no long tail on both sides if the data.

4
Predictive Modeling- Project

Figure 1-3 Table – Boxplot and Histogram

Skewness: 0.77
Outliers: Present on both the upper and lower whisker ends.
1.1.5.4 X Column
Distribution: Does not follow a normal distribution and data seems to follow a random
distribution with multiple peaks.
Skew: The data is moderately skewed, there is no long tail on both sides if the data.
Skewness: 0.39
Outliers: Present on both the upper and lower whisker ends.

Figure 1-4 X – Boxplot and Histogram

1.1.5.5 Y Column
Distribution: Does not follow a normal distribution and data seems to follow a random
distribution with multiple peaks.
Skew: Highly skewed data.
Skewness: 3.87

5
Predictive Modeling- Project

Figure 1-5 Y – Boxplot and Histogram

Outliers: Present on both the upper and lower whisker ends.


1.1.5.6 Z Column
Distribution: Does not follow a normal distribution and data seems to follow a random
distribution with multiple peaks.
Skew: Highly skewed data.
Skewness: 2.58
Outliers: Present on both the upper and lower whisker ends.

Figure 1-6 Z – Boxplot and Histogram

1.1.5.7 Price Column


Distribution: Fairly follows a pure normal distribution and has a longer tail to the right.
Skew: Highly skewed to the right i.e. the mean is greater than the median.
Skewness: 1.62

6
Predictive Modeling- Project

Figure 1-7 Price – Boxplot and Histogram

Outliers: Present on only the upper whisker end.

1.1.5.8 Cut Column


Unique values of the Cut column have been displayed in the count plot to show their frequency
of occurrence. They have been ordered in order of their quality

Cut Number of Occurrences Percentage


Ideal 10805 40.11
Premium 6886 25.56
Very Good 6027 22.37
Good 2435 9.04
Fair 780 2.89
Table 1-5 Cubic Zirconia – Stone Cut Distribution

The count plot and table clearly show that 40% of the cubic zirconia are in ideal cut state which
is beneficial for both the vendor and customer.

Figure 1-8 Cut – Count Plot

7
Predictive Modeling- Project

The mean price of each of the different cuts of the cubic zirconia has been plotted.

• Although “ideal” is the best cut available and largest count of cubic zirconia with the
vendor are ideal cut, the average price for the ideal cut is the lowest. This can be explained
by seeing that the mean carat size for the ideal cut cubic zirconia is the lowest.
• The fair cut average price. As the average carat size of the fair cut cubic zirconia is the
highest this can explain the high average price.
• The average price and average carat weight for good and very good cuts are very close.
This is indicative that they may be considered to be almost same.

Figure 1-9 Average Price and Carat Weight-Cut Bar Plot

The highest priced stones for each cut has been tabulated. It is seen that the price range is quite
close, all the highest priced stones are greater than 2 carats, have good clarity and are of high
level of color.

Carat Cut Color Clarity Depth Table X Y Z Price


2.01 Fair G SI1 70.6 64 7.43 6.64 4.69 18574
2.07 Good I VS2 61.8 61 8.12 8.16 5.03 18707
2 Very Good G SI1 63.5 56 7.9 7.97 5.04 18818
2.04 Premium H SI1 58.1 60 8.37 8.28 4.84 18795
2.07 Ideal G SI2 62.5 55 8.2 8.13 5.11 18804
Table 1-6 Highest Priced Cubic Zirconia Per Cut

The highest carat sizes seen has been also tabulated. It is very interesting to note that the largest
stones are not the most expensive stones for each cut. This indicates that the other quality
parameters like color and clarity also play a major role in deciding the price of the cubic zirconia.

Carat Cut Color Clarity Depth Table X Y Z Price


4.5 Fair J I1 65.8 58 10.23 10.16 6.72 18531
3.01 Good I SI2 63.9 60 9.06 9.01 5.77 18242
4 Very Good I I1 63.3 58 10.01 9.94 6.31 15984
4.01 Premium J I1 62.5 62 10.02 9.94 6.24 15223

8
Predictive Modeling- Project

3.5 Ideal H I1 62.8 57 9.65 9.59 6.03 12587


Table 1-7 Largest Cubic Zirconia Per Cut

Distribution of the cubic zirconia clarity available for each cut inference:

• Fair cut has the most number of I1 accounting for almost 11.4% of all the fair cut
• 52% of the fair cut cubic zirconia all are SI1 and SI2
• Very good and good cuts have high number of SI1 clarity as compared to all other types
• Premium and Ideal cuts have a large number of SI1 and VS2 cubic zirconia
• Ideal cut has the largest number of IF clarity as compared to all the other cut.

9
Predictive Modeling- Project

Figure 1-10 Cut –Clarity Distribution Plot

Distribution of the cubic zirconia color available across each cut inference:

• The distribution for all cuts across the cubic zirconia color spectrum is relatively similar.

10
Predictive Modeling- Project

• In each cut type, the lowest number cubic zirconia is attributed to J (best and worst) color.
• All cuts have a reasonably average percent of D colored cubic zirconia.
• E-F-G-H colors are most common colors

Figure 1-11 Cut –Color Distribution Plot

11
Predictive Modeling- Project

1.1.5.9 Clarity Column


Unique values of the Clarity column have been displayed in the count plot to show their
frequency of occurrence. They have been ordered in order of their quality with IF (Internally
Flawless) being the best quality of stone you can purchase with I1 being a lower quality with
noticeable inclusions.

• 24.3% of cubic zirconia have SI1 clarity closely followed by VS2 at 22.6%.
• Only 3.3% of all the stones have IF clarity.
• Lowest percentage of stones have I1 clarity

Clarity (Best to Lowest) Number of Occurrences Percentage


IF 891 3.30
VVS1 1839 6.82
VVS2 2530 9.39
VS1 4087 15.17
VS2 6093 22.62
SI1 6565 24.37
SI2 4564 16.94
I1 364 1.35
Table 1-8 Cubic Zirconia – Stone Cut Distribution

Figure 1-12 Clarity – Count Plot

The highest priced stones for each clarity has been tabulated. It is seen that the clarity has a
higher impact on the price.
The carat value of the highest priced IF clarity cubic zirconia (1.5 carats) is significantly smaller
than the carat value for an equally priced I1 clarity (4.5 carats) or SI2 clarity (2.07) cubic zirconia.

Carat Cut Color Clarity Depth Table X Y Z Price


1.5 Very Good F IF 63.2 58 7.2 7.32 4.59 18552

12
Predictive Modeling- Project

1.7 Ideal G VVS1 61 56 7.67 7.62 4.66 18445


1.7 Premium G VVS2 59.8 59 7.7 7.75 4.62 18718
2 Premium I VS1 60.8 59 8.13 8.02 4.91 18795
1.71 Premium F VS2 62.3 59 7.57 7.53 4.7 18791
2 Very Good G SI1 63.5 56 7.9 7.97 5.04 18818
2.07 Ideal G SI2 62.5 55 8.2 8.13 5.11 18804
4.5 Fair J I1 65.8 58 10.23 10.16 6.72 18531
Table 1-9 Highest Priced Cubic Zirconia Per Clarity

A side by side plot of the average price and average carat size for all the different cubic zirconia
clarity has been displayed below.

• The carat size and price look to be directly related to each other with both plots following
similar distribution
• The exception to above statement is seen only for I1 clarity where the average price is
low as compared to the average carat weight.

Figure 1-13 Clarity – Average Price and Carat weight Plot

Distribution of color vs clarity of the cubic zirconia inference:

• Each color is evenly distributed across each clarity


• Color G is consistently high for all clarity
• Color J is low for all clarities

13
Predictive Modeling- Project

Figure 1-14 Clarity –Color Distribution Plot

14
Predictive Modeling- Project

1.1.5.10 Color Column


Unique values of the Color column have been displayed in the count plot to show their frequency
of occurrence. They have been ordered in order of their quality with D being a colorless gem to J
being a near colorless gem.

Color (Best to Lowest) Number of Occurrences Percentage


D 3341 12.40
E 4916 18.25
F 4723 17.53
G 5653 20.98
H 4095 15.20
I 2765 10.26
J 1440 5.34
Table 1-10 Cubic Zirconia – Stone Color Distribution

• Stones labelled J (lowest in colorless spectrum) make up only 5.3% of all the stones.
• Most of the cubic zirconia (56.6%) are classified into the mid color spectrum (E-F-G)

Figure 1-15 Color – Count Plot

The highest priced stones for each color has been tabulated. It is seen that the color has a higher
impact on the price.
The carat value of the highest priced D color (top spectrum) cubic zirconia (2.14 carats) is
significantly smaller than the carat value for an equally priced J color (3.51 carats).

Carat Cut Color Clarity Depth Table X Y Z Price


2.14 Very Good D SI2 60.3 60 8.31 8.43 5.05 18526
2.02 Very Good E SI1 59.8 59 8.11 8.2 4.88 18731
1.71 Premium F VS2 62.3 59 7.57 7.53 4.7 18791
2 Very Good G SI1 63.5 56 7.9 7.97 5.04 18818

15
Predictive Modeling- Project

2.04 Premium H SI1 58.1 60 8.37 8.28 4.84 18795


2 Premium I VS1 60.8 59 8.13 8.02 4.91 18795
3.51 Premium J VS2 62.5 59 9.66 9.63 6.03 18701
Table 1-11 Highest Priced Cubic Zirconia Per Color

Just like the analysis seen for clarity, the average price of a cubic zirconia color quality is directly
dependent on its carat size as evidenced by the bar plot below.

Figure 1-16 Color – Average Price and Carat weight Plot

16
Predictive Modeling- Project

1.1.6 Bivariate Analysis


The relationship between the different numerical columns of the dataset can be visualized with
a pair plot. In addition, a heat map of the correlations also lets us understand the degree of
correlation between the data columns. Both the pair plot and the heat map for all the parameters
have been constructed and placed below.

Figure 1-17 Cubic Zirconia – Numerical Data Heat Map

The correlations between all the columns with a brief analysis and the correlation value has been
tabulated.

17
Predictive Modeling- Project

Corr.
Col1 Col2 Correlation Analysis Between Column1 and Column2
Value
• Very low degree of positive correlation; Pair plot shows no
Carat Depth 0.04
distinguishable pattern
• Very low degree of positive correlation; Pair plot shows no
Carat Table 0.18
distinguishable pattern
• High positive correlation
Carat X • Pair plot shows as X increases, so does carat weight 0.98
• Some outliers can be seen in the pair plot
• High positive correlation
Carat Y • Pair plot shows as X increases, so does carat weight 0.94
• Some outliers can be seen in the pair plot
• High positive correlation
Carat Z • Pair plot shows as X increases, so does carat weight 0.94
• Some outliers can be seen in the pair plot
• High positive correlation
Carat Price • Pair plot shows as carat weight increases, so does the price 0.92
• Some outliers can be seen in the pair plot
• Very low degree of negative correlation; Pair plot shows no
Depth Table distinguishable pattern -0.3
• Some outliers can be seen in the pair plot
• Very low degree of negative correlation; Pair plot shows no
Depth X distinguishable pattern -0.02
• Some outliers can be seen in the pair plot
• Very low degree of negative correlation; Pair plot shows no
Depth Y distinguishable pattern -0.02
• Some outliers can be seen in the pair plot
• Very low degree of positive correlation; Pair plot shows no
Depth Z 0.1
distinguishable pattern
Depth Price • No correlation value captured -0.0
• Very low degree of positive correlation; Pair plot shows no
Table X distinguishable pattern 0.2
• Some outliers can be seen in the pair plot
• Very low degree of positive correlation; Pair plot shows no
Table Y distinguishable pattern 0.18
• Some outliers can be seen in the pair plot
• Very low degree of positive correlation; Pair plot shows no
Table Z distinguishable pattern 0.15
• Some outliers can be seen in the pair plot
• Very low degree of positive correlation; Pair plot shows no
Table Price 0.13
distinguishable pattern
X Y • High positive correlation 0.96

18
Predictive Modeling- Project

• Pair plot shows as X increases, so does the Y


• Some outliers can be seen in the pair plot
• High positive correlation
X Z • Pair plot shows as X increases, so does the Z 0.96
• Some outliers can be seen in the pair plot
• High positive correlation
X Price • Pair plot shows as X increases, so does the Price 0.89
• Some outliers can be seen in the pair plot
• High positive correlation
Y Z • Pair plot shows as X increases, so does the Z 0.93
• Some outliers can be seen in the pair plot
• High positive correlation
Y Price • Pair plot shows as Y increases, so does the price 0.96
• Some outliers can be seen in the pair plot
• High positive correlation
Z Price • Pair plot shows as Z increases, so does the price 0.85
• Some outliers can be seen in the pair plot
Table 1-12 Correlation Analysis of Dataset Parameters

The heat map indicates that there is a degree of multi-collinearity between the numerical
variables.

19
Predictive Modeling- Project

Figure 1-18 Cubic Zirconia Pair Plot

20
Predictive Modeling- Project

1.2 Impute null values if present, also check for the values which are equal to zero.
Do they have any meaning or do we need to change them or drop them? Check
for the possibility of combining the sub levels of an ordinal variables and take
actions accordingly. Explain why you are combining these sub levels with
appropriate reasoning.
1.2.1 Null Values and Zero Values Check
1. Depth column has 697 values which are Null

Carat Cut Color Clarity Depth Table X Y Z Price


1 0.34 Ideal D SI1 57 4.5 4.44 2.74 803
2 0.74 Ideal E SI2 59 5.92 5.97 3.52 2501

696 0.51 Ideal D VS2 57 5.12 5.09 3.18 1882
697 1.1 Very Good D SI2 63 6.76 6.69 3.94 4361
Table 1-13 Cubic Zirconia Data with Null

2. X and Y columns have 2 values which are 0


3. Z column has 8 values which are 0

Carat Cut Color Clarity Depth Table X Y Z Price


0.71 Good F SI2 64.1 60 0 0 0 2130
2.02 Premium H VS2 62.7 53 8.02 7.95 0 18207
2.2 Premium H SI1 61.2 59 8.42 8.37 0 17265
2.18 Premium H SI2 59.4 61 8.49 8.45 0 12631
1.1 Premium G SI2 63 59 6.5 6.47 0 3696
1.14 Fair G VS1 57.5 67 0 0 0 6381
1.01 Premium H I1 58.1 59 6.66 6.6 0 3167
1.12 Premium G I1 60.4 59 6.71 6.67 0 2383
Table 1-14 Cubic Zirconia Data with 0

1.2.2 Outlier Values Check and Imputation


All the numerical data in the data set have outliers. This has been visualized in the box plot.

21
Predictive Modeling- Project

Figure 1-19 Cubic Zirconia – Numerical Data Box Plot

The linear regression model is highly sensitive to outliers. In order to avoid the outliers influencing
the model, we need to treat them. There are different ways to treat outliers:
1. Trimming Data: The rows of data which are holding the outliers shall be removed from the
dataset. Although this method is direct and simple, there will be loss of information which
may result in issues in the created model.
2. Imputation of Data: The outliers are replaced with acceptable values. This includes capping
the data or replacing the outliers with the median value. Imputation ensures that data is not
lost in the process of outlier treatment.
In this case, we shall impute the data with the capped values. Once the imputation is complete,
we check to see if all the outliers have been treated.

Carat Depth Table X Y Z Price


Number
of 657 1209 318 14 14 22 1778
Outliers
Position Above Above Above Above Above Above Above
of upper upper upper upper upper upper upper
Outliers whisker whisker whisker whisker whisker whisker whisker

22
Predictive Modeling- Project

Below Below Below Below Below


lower lower lower lower lower
whisker whisker whisker whisker whisker
Table 1-15 Cubic Zirconia Numerical Data - Outlier Details

We further check the descriptive statistics of the columns after the imputation to see if there are
any major changes.

Carat
Before & After mean std min 25% 50% 75% max
Before 0.79 0.47 0.2 0.4 0.7 1.05 4.5
After 0.79 0.46 0.2 0.4 0.7 1.05 2.025
Depth
Before & After mean std min 25% 50% 75% max
Before 61.74 1.41 50.8 61 61.8 62.5 73.6
After 61.74 1.25 58.75 61 61.8 62.5 64.75
Table
Before & After mean std min 25% 50% 75% max
Before 57.45 2.23 49 56 57 59 79
After 57.43 2.15 51.5 56 57 59 63.5
X
Before & After mean std min 25% 50% 75% max
Before 5.72 1.12 0 4.71 5.69 6.55 10.23
After 5.72 1.12 1.95 4.71 5.69 6.55 9.31
Y
Before & After mean std min 25% 50% 75% max
Before 5.73 1.16 0 4.71 5.7 6.54 58.9
After 5.73 1.11 1.965 4.71 5.7 6.54 9.285
Z
Before & After mean std min 25% 50% 75% max
Before 3.53 0.71 0 2.9 3.52 4.04 31.8
After 3.53 0.69 1.19 2.9 3.52 4.04 5.75
Price
Before & After mean std min 25% 50% 75% max
Before 3937.52 4022.55 326 945 2375 5356 18818
After 3735.83 3468.20 326 945 2375 5356 11972.5
Table 1-16 Descriptive Statistics Comparison After Outlier Imputation

1. All columns have their maximum value reduced, some of the columns (X, Y, Z, Price) by a
significant value.
2. The mean value for Price column has a noticeable change.

23
Predictive Modeling- Project

3. As X, Y and Z have their minimum values imputed, there are no longer any 0 present in
the data set

1.2.3 Null Values and Zero Values Imputation


1. There are no longer any zero values present in the X, Y, Z columns after the outlier
correction
2. The null values seen in Depth are corrected by imputation with the median value of 61.8
On reviewing the descriptive statistics of Depth column after the imputation, we see that there
is no major change.

Depth
Before & After mean std min 25% 50% 75% max
Before 61.74 1.41 50.8 61 61.8 62.5 73.6
After 61.74 1.24 58.75 61.1 61.8 62.5 64.75
Table 1-17 Descriptive Statistics Comparison After Null Imputation

1.2.4 Sub-Level Merge for Ordinal Data


Ordinal data is a type of data in which variables exist in naturally occurring ordered categories
i.e. data is classified into categories within a variable that have a natural rank order. The distances
between the categories are uneven or unknown.
All three of the categorical columns are ordinal in nature.
1.2.4.1 Color and Clarity
Color (Best to
Significance
Worst Quality)
D 1. The color grade for a stone is set based on its appearance and will be
E documented in its certificate
F 2. Although D-F are ‘colorless’ and G-J are ‘near colorless’, these grades
G cannot be changed without significantly altering the data sanctity
H 3. The price ranges between the different color of diamonds are not very
I close. D and E are close to each other (Figure 1-16 Color – Average Price
and Carat weight Plot) D is top rated whereas E is not.
J Conclusion: There can be no merging of any of the color ordinal values.
Table 1-18 Color – Ordinal Data Merge

Clarity (Best to
Significance
Worst Quality)
IF 1. The clarity grade for a stone is set based on its appearance and will be
VVS1 documented in its certificate

24
Predictive Modeling- Project

VVS2 2. Each grade is assigned based on amount of impurities present in the


VS1 stone and grades cannot be changed without significantly altering the
VS2 data sanctity.
SI1 Conclusion: There can be no merging of any of the clarity ordinal values.
SI2
I1
IF
Table 1-19 Clarity – Ordinal Data Merge

1.2.4.2 Cut
Cut (Best to
Significance
Worst Quality)
Ideal 1. From the analysis in 1.1.5.8 Cut Column, it is clearly seen that the price,
Premium carat distribution for Good and Very Good cuts are very close
Very Good 2. Also the grading of both these cuts are also close
Good Conclusion: Good and Very Good can be merged into a single value of Very
Good. This is because there are more number of Very Good classified
Fair
stones as compared to Good.
Table 1-20 Cut – Ordinal Data Merge

Before proceeding with the merge, a comparative analysis of the mean values seen for all the
numerical columns between Good and Very Good is as below. It is seen that they both are very
close in all the data set and so may be interchangeably classified.

Good Very Good


Price 3770.3969 3829.8435
Carat 0.8452 0.8095
Depth 62.4515 61.8321
Table 58.6431 57.9623
X 5.8421 5.7522
Y 5.8568 5.7815
Z 3.6452 3.5652
Table 1-21 Good-Very Good Mean Data Comparison

Once the Good values have been replaced with Very Good (merging two ordinal), the mean carat
and price values have been plotted. Below is the before and after plots and the inferences that
can be drawn from them.
1. There is no significant difference in the mean carat weight after the merge.
2. There is a slight reduction in the mean price of the Very Good cut post merge.

25
Predictive Modeling- Project

Figure 1-20 Average Price and Carat Weight-Cut Bar Plot – Before Merge

Figure 1-21 Average Price and Carat Weight-Cut Bar Plot – After Merge

The mean values generated after the merge is complete are as below.

Very Good
Price 3812.7373
Carat 0.8198
Depth 62.0104
Table 58.1582
X 5.7781
Y 5.8032
Z 3.5882
Table 1-22 Very Good Mean Data – After Good-Very Good Merge

26
Predictive Modeling- Project

1.2.5 Data Multi-Collinearity Check- VIF


Variance inflation factor (VIF) is a measure of the amount of multi-collinearity in within a set of
multiple independent regression variables. A high VIF indicates that the associated independent
variable is highly collinear with the other variables in the model.
By analysis of the heat map, we have already seen that there is a degree of multi-collinearity
already present in the dataset. We further confirm this by computing the VIF for the cubic zirconia
dataset.

Columns VIF
Carat 108.768
Depth 801.821
Table 643.029
X 9985.527
Y 9191.05
Z 1774.115
Table 1-23 VIF Values

These values are not preferable.

27
Predictive Modeling- Project

1.3 Encode the data (having string values) for Modelling. Split the data into train
and test (70:30). Apply Linear regression using scikit learn. Perform checks for
significant variables using appropriate method from statsmodel. Create
multiple models and check the performance of Predictions on Train and Test
sets using Rsquare, RMSE & Adj Rsquare. Compare these models and select the
best one with appropriate reasoning.
1.3.1 Encoding of Categorical Data
Models are developed on numerical data. With this predicate, we convert the unique values in
the categorical columns into numerical values. This converted data is used in the modelling
operations.
The cubic zirconia dataset has 3 categorical variables which are ordinal in nature. We convert
these columns to be numerical in nature before using it to build the prediction model.

Cut Numerical Conversion


Ideal 4
Premium 3
Very Good 2
Fair 1
Clarity Numerical Conversion
IF 8
VVS1 7
VVS2 6
VS1 5
VS2 4
SI1 3
SI2 2
I1 1
Color Numerical Conversion
D 7
E 6
F 5
G 4
H 3
I 2
J 1
Table 1-24 Categorical Values to Numerical Number Codes

28
Predictive Modeling- Project

1.3.2 Spilt of Data – Train and Test


The train-test split operation is performed for classification or regression problems. It is used for
any supervised learning algorithm. In this operation, the dataset is randomly divided it into two
subsets.
1. Subset1 is used to train the model. This is named as the training dataset.
2. Subset2 is used to test the created model and so is named the test dataset.
3. The model is trained with Subset1 which includes the independent data and the target
data.
4. The trained model is then independent data of Subset2 as input. The model shall generate
the predictions as the output.
5. The predictions made by the model are then compared against the expected values i.e.
the target data of Subset2.
The comparison between the expected values and the model predicted values is used to evaluate
the model performance.
There are two main configuration parameters that are used to create the training and test data
subsets.

• test_size: This is the size of test set. It is expressed as a percentage between 0 and 1 for
either the train or test datasets. The specified percent of data will be collected into the
test subset.
To give an example, the insurance dataset has a total of 3000 rows. On specifying
test_size=0.3, the resultant test subset has 900 data points (30% of 3000) and the training
subset shall hold 2100 entries (70% of 3000).
• random_state: This input is used to initialize the internal random number generator,
which decides how to split the data into train and test subsets. This input should be set
to the same value, if the same consistency is to be expected over multiple runs of the
code.
The splitting of the data is done using the test_train_split function from the python module
sklearn.

1. For the cubic zirconia test set, the execution of the split operation is performed with the
inputs random_state=1, test_size=0.3
2. The test and train data subset of independent input variables have their content as
follows:
Training subset = 18853 rows, 9 columns
Test subset = 8080 rows, 9 columns

3. The test and train subset target variable details as given from the test_train_split
operation is as below:

29
Predictive Modeling- Project

Training subset target data = 18853 rows, 1 column


Test subset target data = 8080 rows, 1 column

1.3.3 Linear Regression Model


Linear regression is a simple and direct supervised regression method used for predictive
analysis. It shows the linear relationship between the independent (input) variable and the
dependent (target) variable, consequently called linear regression.
In case of a single input variable, such linear
regression is called simple linear regression. With
multiple input variables, it is referred to as
multiple linear regression.
The graph presents the linear relationship
between the dependent variable x and the
independent variable. When the value of x
(independent variable) increases, the value of y
(dependent variable) is likewise increasing. The
linear regression model gives a sloped straight line
referred to as the best fit straight line describing
the relationship within the variables.
Figure 1-22 Simple Linear Regression

Mathematically, a simple linear regression equation is written as below. In the case of multiple
regression, there will be multiple slopes, one for each of the independent variables.

𝑦 = 𝑏𝑜 + 𝑏1 𝑥
𝑏0 = intercept
𝑏1 = slope
Formula 1-1 Simple Linear Regression

The model gets the best regression fit line by computing the best values for the slopes and
intercept by the method of gradient descent.
We test the model performance with two factors which are the R2 and the Root Mean Squared
Error (RMSE).
In statistics, the coefficient of determination, R2, is the proportion of the variation in the
dependent variable that is predictable from the independent variable(s). Simply stated, it is a
score which tells how well the regression model fits the observed data. For example, if a model

30
Predictive Modeling- Project

gives a R2 score of 75%, it indicates that around 75% of the data fits the generated regression
model. Generally, a higher r-squared indicates a better fit for the mode.
Adjusted R2 is a modified version of R2 that has been adjusted for the number of predictors in the
model. This value increases when the new term improves the model more than would be
expected by chance. It decreases when a predictor improves the model by less than expected.

(1 − 𝑅2 )(𝑁 − 1)
2
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅 = 1 −
𝑁−𝑝−1
𝑁 = Number of records in data set
𝑝 = number of independent variables
Formula 1-2 Linear Regression – Adjusted R2

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors).
Residuals are a measure of how far from the regression line data points are; RMSE is a measure
of how spread out these residuals are. In other words, it tells you how concentrated the data is
around the line of best fit.

𝑛
1
𝑅𝑀𝑆𝐸 = √ ∑(𝑦𝑗 − 𝑦̂𝑗 )2
𝑛
𝑗=1

𝑁 = number of data points


𝑦𝑖 = actual observation
𝑦̂𝑖 = estimated value
Formula 1-3 Linear Regression – RMSE Computation

It is computed by taking the distances from the points to the regression line (these distances are
the “errors”) and squaring them. The squaring is necessary to remove any negative signs. Lower
values are better with 0 indicating a perfect model.

1.3.4 Linear Regression Model with Sklearn Library


1.3.4.1 Linear Reg. Model Step1 – Data Split
The dataset has to be divided into the training and test subsets. This has been covered in section
1.3.2 Spilt of Data – Train and Test.

31
Predictive Modeling- Project

1.3.4.2 Linear Reg. Model Step2 –Model Build


We use the LinearRegression in the sklearn.linear_model python library to build the model.
It fits a linear model with coefficients to minimize the residual sum of squares between the
observed targets in the dataset, and the targets predicted by the linear approximation.
Arguments passed for this functions are as below:

• fit_intercept = True
This decides whether to compute the intercept for model. It is generally recommended
to keep this as True and the default value for this argument is True.
The constructed model is then used to fit the training dataset in order to complete the model
training operation.
1.3.4.3 Linear Reg. Model Step3 – Checking Features’ Coefficients & Intercept
The generated model gives the importance of each of the features that will impact the output.
We see the importance table as below.
We can see that Agency Code, Sales and Product Name features hold the most influence as
compared to the other features.

Columns Coefficients
Intercept -2370.689909092437
Carat 8843.416
Y 1225.787
Clarity 441.6878
Color 277.711
Cut 110.5788
Table -17.6759
Depth -18.697
Z -306.232
X -1400.35
Carat 8843.416
Table 1-25 Lin Reg.-Computed Coefficients & Intercept for All Features

1.3.4.4 Linear Reg. Model Step4 – R2 and RMSE (Prediction & Evaluation)
1.3.4.4.1 Train Data
R2 93.090%
Adjusted R2 0.93086
RMSE 910.98
Table 1-26 Lin Reg.- Train Data R2, Adjusted R2 and RSME

1.3.4.4.2 Test Data


R2 93.097%
Adjusted R2 0.93089

32
Predictive Modeling- Project

RMSE 912.72
Table 1-27 Lin Reg.- Test Data R, Adjusted R2 2 and RSME

1.3.4.5 Model Status


The model is giving consistent scores for the training and test. With this we can state that the
model is not over-fitted.

1.3.5 Linear Regression Model with Statsmodel Library


1.3.5.1 Linear Reg. Model Step1 – Data Split
The dataset has to be divided into the training and test subsets. This has been covered in section
1.3.2 Spilt of Data – Train and Test.
Once the training and test sets are created, we concatenate the independent variables data
frame and the target variable data frame into a single data frame as stats model requires the
input data in this format.
1.3.5.2 Linear Reg. Model Step2 –Null and Alternate Hypothesis
H0 = Null Hypothesis
All coefficients in the model are equal to zero i.e.
𝛽1 = 𝛽2 = ⋯ = 𝛽𝑗 = 0

This can be understood as there is no statistically significant relationship between any


of the independent variables and the target variable.
Ha = Alternative Hypothesis
Not all coefficients in the model are simultaneously equal to zero i.e.
𝛽1 = 𝛽2 = ⋯ = 𝛽𝑗 ≠ 0

This can be understood as there is a statistically significant relationship between any of


the independent variables and the target variable.
1.3.5.3 Linear Reg. Model Step2 –Model Build
We use the ols in the statsmodels.formula.api python library to build the model. It creates
a model with the given formula and input data frame. Arguments passed for this functions are as
below:

• formula = stringVal
This is a string value which specifies the model. In this case of the cubic zirconia, below is
the formula sting passed.
stringVal = 'Price ~ Carat + Depth + Table + X + Y + Z + Cut + Color + Clarity'

33
Predictive Modeling- Project

• data = concatenated train data frame


The data used to train the constructed model.
The constructed model is fitted with the training dataset and the model is completed and ready
for prediction operation.
1.3.5.4 Linear Reg. Model Step3 – Checking Coefficients & Hypothesis Check
The generated model gives the importance of each of the features that will impact the output. In
addition, the stats model also gives additional information that need to be used to check and
handle the hypothesis. We see the summary table.

OLS Regression Results - Cubic Zirconia Data


Dep. Variable: Price R-squared: 0.931
Model: OLS Adj. R-squared: 0.931
Method: Least Squares F-statistic: 28210
Sun, 31 Oct
Date: Prob (F-statistic): 0
2021
Time: 9:30:32 Log-Likelihood: -155230
No.
18853 AIC: 310500
Observations:
Df Residuals: 18843 BIC: 310500
Df Model: 9
Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]


Intercept -2370.6899 691.328 -3.429 0.001 -3725.754 -1015.626
Carat 8843.416 81.402 108.639 0 8683.86 9002.972
Depth -18.697 8.846 -2.113 0.035 -36.037 -1.357
Table -17.6759 3.918 -4.512 0 -25.355 -9.997
X -1400.3473 120.257 -11.645 0 -1636.062 -1164.632
Y 1225.7868 118.735 10.324 0 993.056 1458.518
Z -306.2325 96.667 -3.168 0.002 -495.708 -116.757
Cut 110.5788 8.877 12.457 0 93.179 127.979
Color 277.711 4.119 67.419 0 269.637 285.785
Clarity 441.6878 4.469 98.835 0 432.928 450.447

Omnibus: 2701.642 Durbin-Watson: 1.989


Prob(Omnibus): 0 Jarque-Bera (JB): 9155.803
Skew: 0.721 Prob(JB): 0
Kurtosis: 6.094 Cond. No. 8.91E+03

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

34
Predictive Modeling- Project

[2] The condition number is large, 8.91e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Table 1-28 Lin Reg.-Computed Coefficients & Intercept for All Features

1. The coefficients and intercept computed by the statsmodel generated linear regression
model are similar to those generated by sklearn - Table 1-28 Lin Reg.-Computed
Coefficients & Intercept for All Features.

2. We check the P>|t| column in the above table in order to decide the hypothesis state. In
this case as all values are lesser than the α=0.05, we can reject H0 and accept Ha that at
least 1 regression coefficient is not 0.
In this case, we see that all the regression coefficients are not 0

1.3.5.5 Linear Reg. Model Step4 – R2 and RMSE (Prediction & Evaluation)
1.3.5.5.1 Train Data
R2 93.090%
Adjusted R 2 0.93086
RMSE 910.98
Table 1-29 Lin Reg.- Train Data R2, Adjusted R2 and RSME

1.3.5.6 Test Data


R2 93.097%
Adjusted R2 0.93089
RMSE 912.72
Table 1-30 Lin Reg.- Test Data R2 and RSME

1.3.5.6.1 Model Status


The model is giving consistent scores for the training and test. With this we can state that the
model is not over-fitted.

1.3.5.7 Linear Reg. Model Sample Computation


The computed intercept and the coefficients are used as follows in the linear regression model
formula:

𝑃𝑟𝑖𝑐𝑒 = −2370.6899 + 𝐶𝑎𝑟𝑎𝑡 ∗ 8843.416 + 𝐷𝑒𝑝𝑡ℎ ∗ −18.697 + 𝑇𝑎𝑏𝑙𝑒


∗ −17.6759 + 𝑋 ∗ −1400.3473 + 𝑌 ∗ 1225.7868 + 𝑍
∗ −306.2325 + 𝐶𝑢𝑡 ∗ 110.5788 + 𝐶𝑜𝑙𝑜𝑟 ∗ 277.711
+ 𝐶𝑙𝑎𝑟𝑖𝑡𝑦 ∗ 441.6878
Formula 1-4 Linear Regression Model Price Computation

35
Predictive Modeling- Project

1.4 Inference: Basis on these predictions, what are the business insights and
recommendations. Please explain and summarize the various steps performed
in this project. There should be proper business interpretation and actionable
insights present.
In conclusion to the EDA for the cubic zirconia data and the price prediction model development,
we have developed the following business insights.

1.4.1 Model Insights


1. The R2 value is very high. Although a high value is preferred, such a high value may also
indicate issues. Consultation with a SME is needed to check if this is a reasonable value for
the dataset.
In this case, we have seen the presence of strong multi-collinearity in the dataset. Presence
of multi-collinearity adversely affects the created linear regression model which leads to an
inflated R2 value.

2. As RMSE is a scale dependent score (it is on the same scale as the target variable, here it is
price), we see that the model indicates that there is an error of ~912 in predicting the price
of a cubic zirconia.

1.4.2 Recommendations
1. The vendor can try to focus on VVS1, VVS2 and IF stones both in quantity and in availably in
a range of carat sizes. This will allow him to market to niche customers who value quality over
other parameters.

2. The vendor should chart out the relevance of the CCCC – carat, clarity, color and cut when
selecting a stone for the benefit of his customers as some of them may be unaware.

3. In order to best serve his customers, the vendor should first get the budget input from them.
The next important piece of information is to understand what is most important for the
customer. Some may be looking for a quality stone whereas other may place higher
preference for a mi-quality stone of a larger size.
Trying to understand and provide the finest stone to the customer as per his requirements is
a process that should be streamlined and followed by the salespersons.

4. Also another important aspect is where the stone is to be used. If a stone has to be used in a
solitaire ring or earrings, then the quality of stone is important along with the size. On the
other hand, if they are to be used in a larger necklace, then mid-quality stones can be used.
This information can be provided to the customer to help him make the best judgements.

36
Predictive Modeling- Project

2 Problem 2 Statement
You are hired by a tour and travel agency which deals in selling holiday packages. You are
provided details of 872 employees of a company. Among these employees, some opted for the
package and some didn't. You have to help the company in predicting whether an employee will
opt for the package or not on the basis of the information given in the data set. Also, find out the
important factors on the basis of which the company will focus on particular employees to sell
their packages.

Variable Name Description


Holiday Package* Opted for Holiday Package [Yes/No]
Salary Employee Salary
Age Age in Years
Education Years of Formal Education
Num Young Children Number of Young Children (younger than 7 years)
Num Older Children Number of Older Children
Foreign Foreigner [Yes/No]
Table 2-1 Data Dictionary for Tour and Travel Agency

* - Target Variable

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate
Analysis. Do exploratory data analysis.
2.1.1 Data Summary
The summary describes the data type and the number of data entries in each of the columns in
the dataset. The presence of null data and duplicated data is also noted.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 872 entries, 0 to 871
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 872 non-null int64
1 Holiday_Package 872 non-null object
2 Salary 872 non-null int64
3 Age 872 non-null int64
4 Education 872 non-null int64
5 Num_Young_Children 872 non-null int64
6 Num_Older_Children 872 non-null int64
7 Foreign 872 non-null object

37
Predictive Modeling- Project

dtypes: int64(6), object(2)


memory usage: 54.6+ KB

1. There are a total of 8 columns and 872 rows


2. There are columns of numerical and object data types
3. There is no null data present in any of the columns
4. There is no duplicated data present in the dataset
5. Column’ Unnamed: 0 which held serial numbers from 1 to 872 is dropped after initial data
read. It is not used for further processing as it does not contribute to the Price
computation

2.1.2 Descriptive Statistics


The descriptive statistics of all the data is summarized.

mean std min 25% 50% 75% max


Salary 47729.17 23418.67 1322 35324 41903.5 53469.5 236961
Age 39.95528 10.55167 20 32 39 48 62
Education 9.307339 3.036259 1 8 9 12 21
Num_Young_Children 0.311927 0.61287 0 0 0 0 3
Num_Older_Children 0.982798 1.086786 0 0 1 2 6
Table 2-2 Descriptive Statistics of Holiday Package Data

1. We observe that the min/max values of all the columns are valid.
2. The number of young children has its first, second and third quantile values as 0

2.1.3 Sample Data


A sample of the original data set is as below.

Holiday Num Young Num Older


Salary Age Education Foreign
Package Children Children
0 no 48412 30 8 1 1 no
1 yes 37207 45 8 0 1 no
2 no 58022 46 9 0 0 no
3 no 66503 31 11 2 0 no
4 no 66734 44 12 0 2 no
Table 2-3 Sample Holiday Package Data

2.1.4 Univariate Analysis


2.1.4.1 Age
Distribution: The age data is fairly normally distributed with the a very wide top indicating that
the number of people are evenly distributed across all ages. Max number of employees are aged
between 43-46 years.

38
Predictive Modeling- Project

Skew: Data has very low skew with no long tail on either end

Figure 2-1 Age Data – Boxplot and Histogram

Skewness: 0.15
Outliers: Age data has no outliers.
2.1.4.2 Education
Distribution: The data distribution shows that salary is not normally distributed with multiple
peaks seen. Max number of employees have around 8 -10 years of education.
Skew: Data is minimally skewed
Skewness: -0.05
Outliers: A few outliers are present in the data set with 3 outliers lying above the top (right)
whisker and 1 outlier lying below the bottom (left) whisker.

Figure 2-2 Education Data – Boxplot and Histogram

2.1.4.3 Salary
Distribution: The data distribution shows that salary is fairly normally distributed. There is a very
long tail on the right indicating data is skewed to the right. This is confirmed by the skewness
value.

39
Predictive Modeling- Project

Skew: Highly skewed to the right. This can be attributed the large number of outliers mainly
above the Q3 (75th percentile)

Figure 2-3 Salary Data – Boxplot and Histogram

Skewness: 3.1
Outliers: A very high number of outliers are present in the data set. There are 56 outliers lying
above the top (right) whisker and 1 outlier lying below the bottom (left) whisker.
2.1.4.3.1 Salary + Age Relationship
Age brackets of employees has been used to check the number of employees in each bracket and
the average salary earned by them.

Age Bracket Num. of Employees Avg Salary


(0, 25] 73 37012.53425
(25, 35] 262 48108.40458
(35, 45] 260 48846.93846
(45, 55] 190 51018.41579
(55, 70] 87 45055.37931
Table 2-4 Employee Age Bracket – Average Salary Data

Inferences:
1. The company employs people of all age brackets
2. Most employees are within the 25-45 age bracket
3. Highest salary is earned by the 45-55 age bracket employees
4. There is not much salary range from 25 to 55 age range with only a 6% increase seen from
[25-35] to [34-55]

40
Predictive Modeling- Project

Figure 2-4 Salary + Age Relationship

2.1.4.3.2 Salary + Education Relationship


Education brackets of employees has been used to check the number of employees in each
bracket and the average salary earned by them.

Education Bracket Num. of Employees Avg Salary


(0, 4] 68 38525.05882
(4, 8] 276 40368.4529
(8, 12] 428 50477.03738
(12, 16] 93 62486.01075
(16, 20] 6 64103.16667
(20, 24] 1 58451
Table 2-5 Employee Education Bracket – Average Salary Data

Inferences:
1. Most employees fall in the 8-12 years’ education bracket.
2. A single employee is educated above 20 years
3. There is an increase in salary seen as the education bracket increases, with highest salary
seen in the 16-20 years’ education bracket.

41
Predictive Modeling- Project

Figure 2-5 Salary + Education Relationship

2.1.4.3.3 Salary + Foreign Employee Relationship


It is seen that the average salary of a
foreign employee is lower than that of
non-foreign employees.
This data is further analyzed to check
the distribution of the salary among
foreign and local employees along with
the age and education ranges. This
data helps recognize the presence of
bias seen within the organization.

Figure 2-6 Salary + Foreign Employee Relationship

It is seen that consistently the foreign employees are underpaid across all age brackets. The %
difference between the average salary for local employees and foreign employees ranges from a
minimum of 13.3% to an extreme high of 32.8%.

Age Avg Salary Num. of Local Avg Salary Num. of Foreign


% Diff
Bracket (Local Emp) Employees (Foreign Emp) Employees
(0, 25] 38305.04 52 33812.05 21 13.29
(25, 35] 51611.73 190 38863.53 72 32.80
(35, 45] 51913.36 192 40188.81 68 29.17
(45, 55] 54011.56 142 42163.71 48 28.09
(55, 70] 45581.14 80 39046.71 7 16.73
Table 2-6 Foreign& Local Employees Salary and Age Distribution

42
Predictive Modeling- Project

Similar to the age brackets, the average salaries across the education brackets also show foreign
employees being underpaid with the exception being the 12-16 years’ education bracket.

Education Avg Salary Num. of Local Avg Salary Num. of Foreign


% Diff
Bracket (Local Emp) Employees (Foreign. Emp) Employees
(0, 4] 44081.30 26 35085.47 42 25.63
(4, 8] 43342.28 163 36078.76 113 20.13
(8, 12] 51043.42 377 46290.21 51 10.26
(12, 16] 62460.71 83 62696 10 -0.37
(16, 20] 64103.16 6 - 0 -
(20, 24] 58451 1 - 0 -
Table 2-7 Foreign& Local Employees Salary and Education Distribution

2.1.4.4 Number of Young Children


Distribution: The data distribution shows that salary is not normally distributed with multiple
peaks seen.
Skew: Data is highly skewed with a very long right tail.
Skewness: 1.95
Outliers: There are a large number of outliers with 207 outliers lying above the top (right)
whisker.

Figure 2-7 Number of Young Children Data – Boxplot and Histogram

2.1.4.5 Number of Young Children


Distribution: The data distribution shows that salary is not normally distributed with multiple
peaks seen.
Skew: Data is highly skewed with a very long right tail.
Skewness: 0.95
Outliers: There are only two outliers lying above the top (right) whisker.

43
Predictive Modeling- Project

Figure 2-8 Number of Older Children Data – Boxplot and Histogram

2.1.4.6 Foreign
This column holds the
number of foreign
employees in the
organization.
We see that around 24.77%
of the employees are
foreign. Further analysis
about foreign employees
salary analysis has been
documented in section
2.1.4.3.3 Salary + Foreign
Employee Relationship.
Figure 2-9 Foreign Employees Count
Plot

Claimed Number of Occurrences Percentage


No 656 75.23
Yes 216 24.77
Table 2-8 Foreign Employees Data Distribution

44
Predictive Modeling- Project

2.1.4.7 Holiday Package


This column is the target
column under analysis. The
count plot gives us an
understanding of the
frequency at which the
employees opt for the
holiday package. We see
that around 54.01% of the
employees do not opt for
the holiday package.

Figure 2-10 Holiday Package Opted


Count Plot

Claimed Number of Occurrences Percentage


No 471 54.01
Yes 401 45.99
Table 2-9 Holiday Package Opted-Data Distribution

We further examine the


ages of the employees who
have opted for the holiday
package to see the
interested demographic.
Employees in the 35-45 age
range have opted for the
holiday package as
compared to other age
groups.

Figure 2-11 Holiday Package Opted –


Age Groups

The number of children also has large influences on the opting of discounted holiday packages.
The children status of employees who have taken the holiday packages are analyzed to see the
influence of children.

45
Predictive Modeling- Project

It is seen that families who


have older children prefer
opt holiday packages. In
contrast, families who have
only young children or both
young and older children
do not opt for holiday
packages.

Figure 2-12 Holiday Package–


Children Status

It is seen that more foreigner employees opt for holiday packages as compared to local
employees. In addition, it can be noted that in general, employees who opt for the offered
holiday package have a lower average salary as compared to employees who do not opt for a
holiday package.

Figure 2-13 Holiday Package– Average Salary & Foreign Employees

46
Predictive Modeling- Project

2.1.5 Bivariate Analysis


The relationship between the different columns of the dataset can be visualized with a pair plot.
In addition, a heat map of the correlations also lets us understand the degree of correlation
between the data columns. Both the pair plot and the heat map for all the parameters have been
constructed and placed below.
It is seen that there is no strong co-relation between any of the numerical columns.

Figure 2-14 Travel Parameters Heat Map

47
Predictive Modeling- Project

Figure 2-15 Travel Data Parameters Pair Plot

As summarized by the data of the heat map, the pair plot does not show any distinguishable
pattern between the numerical data columns of the dataset.

48
Predictive Modeling- Project

2.2 Do not scale the data. Encode the data (having string values) for Modelling.
Data Split: Split the data into train and test (70:30). Apply Logistic Regression
and LDA (linear discriminant analysis).
2.2.1 Outlier Check and Cleanup
All the numerical data in the data set have outliers. This has been visualized in the box plot. All
the numerical data except Age have outliers present.

Figure 2-16 Holiday Package– Numerical Data Box Plot

The logistic regression and the linear discriminant analysis models Are sensitive to outliers. In
order to avoid the outliers influencing the model, we treat them. There are different ways to
treat outliers:
1. Trimming Data: The rows of data which are holding the outliers shall be removed from the
dataset. Although this method is direct and simple, there will be loss of information which
may result in issues in the created model.
2. Imputation of Data: The outliers are replaced with acceptable values. This includes capping
the data or replacing the outliers with the median value. Imputation ensures that data is not
lost in the process of outlier treatment.

49
Predictive Modeling- Project

In this case, we shall impute the data with the capped values. Once the imputation is complete,
we check to see if all the outliers have been treated.

Salary Education Num Young Children Num Older Children


Number of
57 4 207 2
Outliers
Above upper Above upper
Position of whisker whisker Above upper whisker Above upper whisker
Outliers Below lower Below lower Below lower whisker Below lower whisker
whisker whisker
Table 2-10 Holiday Package Numerical Data - Outlier Details

Analysis of the outliers indicates that the outlier values in the education and number of older and
younger children columns are valid. So these columns’ outliers shall not be imputed.
After the imputation of the outliers in the salary column, we check the descriptive statistics to
identify the changes.

Salary
Before & After mean std min 25% 50% 75% max
Before 47729.17 23418.67 1322 35324 41903.5 53469.5 236961
After 45608.33 15699.74 8105.75 35324 41903.5 53469.5 80687
Table 2-11 Descriptive Statistics Comparison After Outlier Imputation

The maximum value reduced and the minimum value has increased. There is some decrease in
the mean value as well.

2.2.2 Conversion of Categorical Data


Models are developed on numerical data. With this predicate, we convert the unique values in
the categorical columns into numerical values. This converted data is used in the modelling
operations.

Foreign Numerical Conversion


No 0
Yes 1
Holiday Package Numerical Conversion
No 0
Yes 1
Table 2-12 Categorical Values to Numerical Number Codes

50
Predictive Modeling- Project

2.2.3 Spilt of Data – Train and Test


The train-test split operation is performed for classification or regression problems. It is used for
any supervised learning algorithm.
In this operation, the dataset is randomly divided it into two subsets, each subset having
part1_independent_data and part2_target_data.
1. Subset1 is used to train the model. This is named as the training dataset.
2. Subset2 is used to test the created model and so is named the test dataset.
3. The trained model is given the Subset2- part1_independent_data as input and the model
shall give the predictions as the output.
4. The predictions made by the model are then compared against the expected values i.e.
Subset2- part2_target_data.
The comparison between the expected values and the model predicted values is used to evaluate
the model performance
There are two main configuration parameters that are used to create the training and test data
subsets.

• test_size: This is the size of test set. It is expressed as a percentage between 0 and 1 for
either the train or test datasets. The specified percent of data will be collected into the
test subset.
To give an example, the insurance dataset has a total of 3000 rows. On specifying
test_size=0.3, the resultant test subset has 900 data points (30% of 3000) and the training
subset shall hold 2100 entries (70% of 3000).
• random_state: This input is used to initialize the internal random number generator,
which decides how to split the data into train and test subsets. This input should be set
to the same value, if the same consistency is to be expected over multiple runs of the
code.
The splitting of the data is done using the test_train_split function from the python module
sklearn.

1. We have executed this split operation with random_state=1, test_size=0.3


2. The test and train data subset shapes are:
Training subset = 610 rows, 6 columns
Test subset = 262 rows, 6 columns

51
Predictive Modeling- Project

2.2.4 Logistic Regression Model


Logistic regression is a supervised learning technique for a binary response. The two response
classes are Positive-Negative; the output is given as the probability of positive based on the
values of the predictors
2.2.4.1 Log. Reg. Model Step1- Data Split
The dataset has to be divided into the training and test subset. This has been detailed in the
section 2.2.3 Spilt of Data – Train and Test.
2.2.4.2 Log. Reg. Model Step2- Model Build
The logistic regression model is constructed using the function LogisticRegression from the
sklearn.linear_model library. Arguments passed for this functions are as below:

• solver=’newton-cg’
This is the algorithm to use in the optimization problem.
• max_iter=10000
10,000 is the maximum number of iterations for the solvers to converge.
• penalty = none
No penalty is added to the model
• tol=0.0001
This is the tolerance value for the stopping criteria.
• verbose=True
Setting this to true allows the progress messages to be printed out
• random_state=1
This makes the model’s output replicable. The model will always produce the same results
when it has a definite value of random_state and if it has been given the same parameters
and the same training data.
The mean accuracy of the built model for the training and test data is as follows.

• Train Data = 67.54%


• Test Data = 63.74%
The scores for the train and test dataset are similar indicating that the generated model is not
over-fitted.
2.2.4.3 Log. Reg. Model Step4- Best Model (Grid Search)
We use the GridSearchCV function of the sklearn.model_selection module to identify the
best possible combinations of inputs to generate a better model.
We give the following combinations of inputs as the grid along with the modelling algorithm as
input to the GridSearchCV function It exhaustively generates candidates from the grid of
parameter values specified and the best inputs for that algorithm are then selected.

52
Predictive Modeling- Project

In this case, below is the parameter grid which is given as the input:
param_grid = {
'penalty':['l2','none','l1','elasticnet'],
'solver':['sag','lbfgs','saga','newton-cg','liblinear'],
'tol':[0.001,0.0001,0.00001],
'l1_ratio':[0.25,0.5,0.75],
'max_iter':[100,1000,10000]

After the function execution is complete, we check the best selected parameters from this.
{'l1_ratio': 0.25,
'max_iter': 100,
'penalty': 'l2',
'solver': 'newton-cg',
'tol': 0.001}
With these values set, we recheck the score of the model.

• Train Data = 67.37%


• Test Data = 64.50%
The scores for the train and test dataset are similar indicating that the generated model is not
over-fitted.

2.2.5 Linear Discriminant Analysis Model


Linear Discriminant Analysis (LDA) is a supervised method used for classifying observations to a
class or category based on predictor (independent) variables of the data.
2.2.5.1 LDA Model Step1- Data Split
The dataset has to be divided into the training and test subset. This has been detailed in the
section 2.2.3 Spilt of Data – Train and Test.
2.2.5.2 LDA Model Step2- Model Build
The logistic regression model is constructed using the function LinearDiscriminantAnalysis
from the sklearn. discriminant_analysis library. We use default arguments for this model
generation. The main default arguments are:

• solver=’svd’
• tol=0.0001
The mean accuracy of the built model for the training and test data is as follows.

• Train Data = 67.54%

53
Predictive Modeling- Project

• Test Data = 64.12%


The scores for the train and test dataset are similar indicating that the generated model is not
over-fitted.

54
Predictive Modeling- Project

2.3 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score
for each model Final Model: Compare Both the models and write inference
which model is best/optimized.
2.3.1 Confusion Matrix
A confusion matrix is NxN matrix used for evaluating the performance of a classification model,
where N is the number of target classes. It compares the actual target values with those predicted
by the built machine learning model. This
gives us a holistic view of how well our
ACTUAL VALUES classification model is performing and what
kinds of errors it is making.
Positive Negative • A classification target variable
(binary) has two possible values, Positive or
Positive

Negative
TP FP • The columns represent the actual
PREDICTED
VALUES

values of the target variable


• The rows represent the predicted
values of the target variable
Negative

FN TN
Figure 2-17 Confusion Matrix

The components that form the matrix are:

• True Positive (TP)


Actual value and the model predicted value match and the predicted value is True
(Positive)
• True Negative (TP)
Actual value and the model predicted value match and the predicted value is False
(Negative)
• False Positive (FP)
Actual value and the model predicted value match do not match. The actual value is False
(Negative) but was incorrectly predicted as True (Positive)
This is known as Type I error.
• False Negative (FN)
Actual value and the model predicted value match do not match. The actual value is True
(Positive) but was incorrectly predicted as False(Negative)
This is known as Type II error.

55
Predictive Modeling- Project

Using the confusion matrix, the metrics accuracy, precision, recall and specificity are derived.
2.3.1.1 Accuracy
Accuracy (ACC) is the number of all correct predictions divided by the total number of the dataset.
The best accuracy is 1.0, whereas the worst is 0.0

(𝑇𝑃 + 𝑇𝑁)
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
(𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁)
Formula 2-1 Confusion Matrix – Accuracy

Accuracy is not the best metric to be checked especially if there is an imbalanced dataset. In such
cases, accuracy metric does not give correct understanding. In order to mitigate this, we use the
additional metrics of precision and recall.
2.3.1.2 Precision
Precision (PREC) is calculated as the number of correct positive predictions divided by the total
number of positive predictions. It tells us how may correctly predicted true cases are actually
positive.
It is also called positive predictive value (PPV). The best precision is 1.0, whereas the worst is 0.0.
Precision is a useful metric in cases where False Positive is a higher concern than False Negatives
(e.g.: In e-commerce recommendations, wrong results could lead to customer churn).

𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
(𝑇𝑃 + 𝐹𝑃)
Formula 2-2 Confusion Matrix - Precision

2.3.1.3 Recall/Sensitivity
Recall is calculated as the number of correct positive predictions divided by the total number of
positives i.e. the actual positive cases we were able to predict correctly with our model.

𝑇𝑃
𝑟𝑒𝑐𝑎𝑙𝑙 =
(𝑇𝑃 + 𝐹𝑁)
Formula 2-3 Confusion Matrix - Recall

It is also referred to as the true positive rate (TPR). The best recall is 1.0, whereas the worst is
0.0. Recall is a useful metric in cases where False Negatives is a higher concern than False
Positives (e.g.: In medical diagnosis raising a false alarm may be safer).
2.3.1.4 Specificity
Specificity is calculated as the number of correct negative predictions divided by the total number
of negatives.

56
Predictive Modeling- Project

𝑇𝑁
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
(𝑇𝑁 + 𝐹𝑃)
Formula 2-4 Confusion Matrix - Specificity

It is also referred to as the true negative rate (TNR). The best recall is 1.0, whereas the worst is
0.0.
2.3.1.5 F1 Score
Recall and Precision metrics are inversely proportional to each other. The best way to capture
the trend is to use a combination of both which gives us the F1-Score metric. The F1 score is a
weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is
0.0.

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑟𝑒𝑐𝑎𝑙𝑙 𝑇𝑃
𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 ∙ =
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 1
𝑇𝑃 + (𝐹𝑃 + 𝐹𝑁)
2
Formula 2-5 Confusion Matrix – F1 Score

The interpretability of the F1-score is poor on its own. Using it in combination with other
evaluation metrics which gives us a complete picture of the result.
2.3.1.6 Classification Report
This report displays the precision, recall, F1, and support scores for the created model. It is
generated by the classification_report function of the sklearn library. A sample report is shown
below:
precision recall f1-score support
0 0.78 0.91 0.84 300
1 0.71 0.47 0.57 600
accuracy 0.77 900
macro avg 0.75 0.69 0.70 900
weighted avg 0.76 0.77 0.75 900
Table 2-13 Sample Classification Report

2.3.1.7 ROC Curve and AUC Score


AUC (Area Under the Curve) and ROC (Receiver Operating Characteristics) are important model
performance measurement visualizations.
The ROC is a probability curve which is a plot of the TPR (true Positive Rate) against the FPR (False
Positive rate) where the TPR is in the y-axis and FPR is on the x-axis.
AUC represents the area under the ROC curve. Higher the AUC, better the model is correctly
classifying the instances. The ideal ROC curve should extend to the top left corner which would
result in the AUC to be 1.

57
Predictive Modeling- Project

2.3.2 Model Performance Decisions


In order to judge all the generated logistic regression and the LDA models, we use the following
set of guidelines.
1. The target class is fairly balanced standing at a ratio of [54:46] for [No: Yes]. Considering this,
we will consider the value of the accuracy metric as a fairly important metric.

2. We are looking to see if an employee opts for the available holiday package. Based on what
is more important to the agency we check different metrics as based on the what the agency
wants, prediction fails are not equal; one will be costlier than the other.
Case1 - Agency wants employees to opt for holiday packages

Worse Better
Predict that a customer shall opt for the Predict that a customer shall not opt for
holiday package but he does not. the holiday package but he does.

Why: The agency may have tried to tailor Why: Although the agency has not tried
packages for the employee to influence them to customize any holiday packages for
into opting in for the holiday package. This this employee, he has still availed it.
effort will have cost the agency.
At the end if the employee does not take
advantage of this, there will have been wasted
agency resources.
False Positive False Negative
Table 2-14 Case 1- Metric Importance

Case 2- Agency does want employees to opt for holiday packages – This is a second
possibility. Employees opting for holiday packages may be costing the agency due to price
subsidy.

Better Worse
Predict that a customer shall opt for the holiday Predict that a customer shall not opt
package but he does not. for the holiday package but he does.

Why: Agencies will be offering a price subsidy to Why: The agency may not have
the employees for the holiday packages. budgeted for the subsidy that it has to
Sometimes these subsidies will result in the offer to the employee.
agency making low profit or in the worst case a This may result in the agency losing
loss. profits due to the subsidy.
In such cases, if the employee does opt to take
the holiday package, the agency may not lose
money.
False Positive False Negative

58
Predictive Modeling- Project

Table 2-15 Case 2- Metric Importance

3. For our models, we are considering Case 1, the agency wants its employees to take advantage
of its offered travel packages. Considering this understanding, it is better to have a False
Negative rather than False Positive. In this case, as we need to reduce the FP, it is better to
focus on value of precision when measuring the performance of the model.

4. The F1-Score value is checked to see if is having a high value

5. AUC is checked to see if the value is high. The curve shape is also checked to see if it is
extending up to the top left corner

2.3.3 Logistic Regression – Model Performance


2.3.3.1 Training Dataset
Using the best logistic regression model
generated at the end of grid search
operation, the training data is used as
input to get the predictions. The
predictions done by the developed
model on the training dataset is
summarized in this section.

Figure 2-18 Log. Reg. Training Data Confusion Matrix

The metrics computed in the classification report using the above confusion matrix are as below.
precision recall f1-score support
0 0.67 0.77 0.72 326
1 0.68 0.56 0.62 284
accuracy 0.67 610
macro avg 0.68 0.67 0.67 610
weighted avg 0.67 0.67 0.67 610
Table 2-16 Log. Reg. Training Data Classification Report

Positive Class Analysis:


Accuracy = 67%
Precision = 68%

59
Predictive Modeling- Project

Recall = 56%
F1-Score = 62%
AUC = 0.74

1. Accuracy is fairly high for the


training data prediction
2. Precision value is fairly high
which is good as we are trying
to detect FP.
3. F1-Score is medium
4. AUC score indicates that the
model is good

Figure 2-19 Log. Reg. Training Data ROC-


AUC Curve

2.3.3.2 Test Dataset


Using the best logistic regression
model generated at the end of grid
search operation, the test dataset is
used as input to get the predictions.
The predictions done by the developed
model on the test dataset is
summarized in this section.

Figure 2-20 Log. Reg. Test Data Confusion Matrix

The metrics computed in the classification report using the above confusion matrix are as below.
precision recall f1-score support
0 0.67 0.70 0.69 145
1 0.61 0.57 0.59 117
accuracy 0.65 262
macro avg 0.64 0.64 0.64 262
weighted avg 0.64 0.65 0.64 262
Table 2-17 Log. Reg. Test Data Classification Report

Positive Class Analysis:

60
Predictive Modeling- Project

Accuracy = 65%
Precision = 61%
Recall = 57%
F1-Score = 59%
AUC = 0.70

1. Accuracy is fairly high for the test data prediction


2. Precision value is lower for
the test dataset as compared to
the training data. It is still fair.
3. F1-Score is low
4. AUC score indicates that the
model is good
5. The output metrics of the
model shows that it is not over
fitted.

Figure 2-21 Log. Reg. Test Data ROC-AUC


Curve

2.3.4 LDA – Model Performance


2.3.4.1 Training Dataset
Using the generated LDA model, the
training data is used as input to get the
predictions. The predictions done by
the developed model on the training
dataset is summarized in this section.

Figure 2-22 LDA Training Data Confusion Matrix

The metrics computed in the classification report using the above confusion matrix are as below.
precision recall f1-score support
0 0.67 0.78 0.72 326
1 0.69 0.56 0.61 284

61
Predictive Modeling- Project

accuracy 0.68 610


macro avg 0.68 0.67 0.67 610
weighted avg 0.68 0.68 0.67 610
Table 2-18 LDA Training Data Classification Report

Positive Class Analysis:


Accuracy = 68%%
Precision = 69%
Recall = 56%
F1-Score = 61%
AUC = 0.74

1. Accuracy is fairly high for the training data prediction


2. Precision value is fairly high
which is good as we are trying
to detect FP.
3. F1-Score is medium
4. AUC score indicates that the
model is good

Figure 2-23 Log. Reg. Training Data ROC-


AUC Curve

2.3.4.2 Test Dataset


Using the LDA model, the test dataset is used as input to get the predictions. The predictions
done by the developed model on the
test dataset is summarized in this
section.

Figure 2-24 LDA Test Data Confusion Matrix

62
Predictive Modeling- Project

The metrics computed in the classification report using the above confusion matrix are as below.
precision recall f1-score support
0 0.66 0.71 0.69 145
1 0.61 0.56 0.58 117
accuracy 0.64 262
macro avg 0.64 0.63 0.63 262
weighted avg 0.64 0.64 0.64 262
Table 2-19 Log. Reg. Test Data Classification Report

Positive Class Analysis:


Accuracy = 64%
Precision = 61%
Recall = 56%
F1-Score = 58%
AUC = 0.70
1. Accuracy is fairly high for the test data prediction
2. Precision value is lower for
the test dataset as compared to
the training data. It is still fair.
3. F1-Score is low
4. AUC score indicates that the
model is good
5. The output metrics of the
model shows that it is not over
fitted.

Figure 2-25 LDA Test Data ROC-AUC Curve

2.3.5 Model Compare – Analysis


The various metrics derived from all the logistic regression and LDA models is tabulated below
for the positive field. They have been separated by the predictions made by the model on the
training set and the test set.

Accuracy Recall Precision F1-Score AUC


Training Data – Logistic Regression 67 56 68 62 0.74
Training Data - LDA 65 57 61 59 0.7

Test Data – Logistic Regression 68 56 69 61 0.74

63
Predictive Modeling- Project

Test Data - LDA 64 56 61 58 0.7


Table 2-20 – All Models Metrics Comparison

The AUC-ROC of the training and test data for all the three developed models has been plotted
on the same graph to show
their relative performance.

Figure 2-26 All Model ROC-AUC Curve –


Training Data

Both the plots along with the AUC scores show that the models have same kind of performance.

Figure 2-27 All Model ROC-AUC Curve –


Test Data

Observations-on Model Performance -Comparison:


1. None of the created models are over fitted
2. Both models have fairly similar prediction metrics
3. If a choice had to be made, the logistic regression model has better performance with
higher scores for accuracy and F1-score
After comparing the model performances, we can conclude that the logistic regression has
better performance as compared to the other models.

64
Predictive Modeling- Project

2.4 Inference: Basis on these predictions, what are the insights and
recommendations. Please explain and summarize the various steps performed
in this project. There should be proper business interpretation and actionable
insights present.
Based on the EDA and model creation, we can have the below insights and recommendations.
1. Foreign employees are under paid consistently across all age and education groups. The
reasoning behind this should be analyzed.
The attrition rate among foreign employees should be compared with local employees. If
the attrition of foreign employees is higher, better salaries may remedy that.

2. Younger employees are paid well. This is a good point to be made if the agency is looking
to hire fresh talent.

3. Employees with higher average salary do not opt for holiday packages. This may stem
from the non-availability of more luxury options. A simple survey can help bridge this gap.
The agency can then offer better holiday alternatives.

4. More foreign employees opt for holiday packages as compared to local employees. It may
be that they are availing the packages to travel to their native nations.
Foreign employees can be offered a no frills travel packages (travel only with no itinerary)
only to their native to encourage more foreign employees to opt for travel packages.

5. Employees with older children opt for holiday packages as compared to their colleagues
who have only young children or those who have both young and older children. This may
due non availability of quality child care facilities at the holiday destinations.
Packages tailored for employees with young children can help encourage them to avail
the holiday packages.

6. The time in a when employees opt for holidays is an essential component. Using this data
better insight can be had about travel plans of employees. Is it during the holiday season,
summer vacations or during cultural events.
With this information, better travel options can be made available to the agency
employees.

65

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy