0% found this document useful (0 votes)
37 views70 pages

Pooja Kabadi - Predictive Modelling Project

The document outlines a predictive modeling project focused on linear and logistic regression techniques applied to a dataset of cubic zirconia prices. It includes detailed exploratory data analysis, data preprocessing steps, and model performance evaluation metrics. The goal is to assist a company in predicting prices based on various attributes of the cubic zirconia stones to optimize profit margins.

Uploaded by

shaam solanki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views70 pages

Pooja Kabadi - Predictive Modelling Project

The document outlines a predictive modeling project focused on linear and logistic regression techniques applied to a dataset of cubic zirconia prices. It includes detailed exploratory data analysis, data preprocessing steps, and model performance evaluation metrics. The goal is to assist a company in predicting prices based on various attributes of the cubic zirconia stones to optimize profit margins.

Uploaded by

shaam solanki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Predictive Modelling January 23, 2022

PREDICTIVE
MODELLING
PROJECT

Pooja Kabadi
PGP-DSBA Online
Batch- A4
23-01-2022

1|Page
Predictive Modelling January 23, 2022

Table of Contents:
Problem 1: Linear Regression ............................................................................................................. 5
1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values,
Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate Analysis. .................. 5
1.2 Impute null values if present, also check for the values which are equal to zero. Do they have any
meaning or do we need to change them or drop them? Check for the possibility of combining the sub
levels of a ordinal variables and take actions accordingly. Explain why you are combining these sub
levels with appropriate reasoning. .................................................................................................... 19
1.3 Encode the data (having string values) for Modelling. Split the data into train and test (70:30).
Apply Linear regression using scikit learn. Perform checks for significant variables using appropriate
method from stats model. Create multiple models and check the performance of Predictions on Train
and Test sets using Rsquare, RMSE & Adj-Rsquare. Compare these models and select the best one
with appropriate reasoning. ............................................................................................................... 22
Problem 2: Logistic Regression and LDA......................................................................................... 44
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
write an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data analysis.
.......................................................................................................................................................... 44
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis).
.......................................................................................................................................................... 56
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model:
Compare Both the models and write inference which model is best/optimized. .............................. 60
2.4 Inference: Basis on these predictions, what are the insights and recommendations. .................. 68

List of Figures:
Figure 1. Boxplot and Distplot of Carat. ................................................................................................. 8
Figure 2. Boxplot and Distplot of Depth................................................................................................. 9
Figure 3. Boxplot and Distplot of Table. ................................................................................................ 9
Figure 4. Boxplot and Distplot of 'X' .................................................................................................... 10
Figure 5. Boxplot and Distplot of 'Y' .................................................................................................... 10
Figure 6. Boxplot and Distplot of 'Z' .................................................................................................... 10
Figure 7. Boxplot and Distplot of Price ................................................................................................ 11
Figure 8. Frequency Distribution of Cut ............................................................................................... 12
Figure 9. Frequency Distribution of colour. ......................................................................................... 12
Figure 10. Frequency Distribution of Clarity........................................................................................ 12
Figure 11. Boxplot of Cut with price variable. ..................................................................................... 13
Figure 12. Boxplot of Color with Price variable. .................................................................................. 14
Figure 13. Boxplot of Clarity with Price............................................................................................... 15
Figure 14. Count plot of Categorical variables with price. ................................................................... 16
Figure 15. Bar plot of Categorical variables with Price. ....................................................................... 16
Figure 16. Pair plot of Zirconia dataset................................................................................................. 17
Figure 17. Scatter plots of all numeric variable with price. .................................................................. 18

2|Page
Predictive Modelling January 23, 2022

Figure 18. Heatmap for Zirconia dataset. ............................................................................................. 18


Figure 19. Boxplot before outlier treatment.......................................................................................... 20
Figure 20. Boxplot after Outlier Treatment. ......................................................................................... 21
Figure 21. Scatter plot for model 1. ...................................................................................................... 25
Figure 22. Scatter plot for model 2. ...................................................................................................... 29
Figure 23. Scatter plot for model 3. ...................................................................................................... 32
Figure 24. Scatter plot for model 4. ...................................................................................................... 35
Figure 25. Scatter plot for model 5. ...................................................................................................... 38
Figure 26. Boxplot and Distplot for Salary ........................................................................................... 46
Figure 27. Boxplot and Distplot for Age. ............................................................................................. 47
Figure 28. Boxplot and Distplot for Education. .................................................................................... 47
Figure 29. Boxplot and Distplot for no_young_children ...................................................................... 47
Figure 30. Boxplot and Displot for no_older_children ......................................................................... 48
Figure 31. Count plot for foreign. ......................................................................................................... 49
Figure 32. Count plot for Holiday package........................................................................................... 49
Figure 33. Boxplot of Salary vs Holiday package ................................................................................ 49
Figure 34. Boxplot of Age vs Holiday package .................................................................................... 50
Figure 35. Count plot of Age against Holiday package ........................................................................ 50
Figure 36. Count plot of Age against Holiday package ........................................................................ 50
Figure 37. Boxplot of Education vs Holiday package .......................................................................... 51
Figure 38. Count plot of Education against Holiday package .............................................................. 51
Figure 39. Boxplot of no of young children vs Holiday package ......................................................... 52
Figure 40. Count plot of no of young against Holiday package ........................................................... 52
Figure 41. Boxplot of no of older children vs Holiday package ........................................................... 53
Figure 42. Count plot of no of older against Holiday package ............................................................. 53
Figure 43. Count plot of foreign vs Holiday package ........................................................................... 54
Figure 44. Boxplot of foreign vs Salary with Holiday package as hue ................................................. 54
Figure 45. Pair plot of problem 2 .......................................................................................................... 55
Figure 46. Heatmap for Problem 2........................................................................................................ 56
Figure 47. Classification report of training and testing data for Logistic Regression model................ 61
Figure 48. Confusion Matrix of train (left) and Test (right) for Logistic Regression ........................... 62
Figure 49. ROC curve for training and testing data for Logistic Regression........................................ 62
Figure 50. Classification report for LDA with default probability cut-off of 0.5 ................................. 63
Figure 51. Confusion matrix of train (left) and test(right) for LDA :0.5 .............................................. 63
Figure 52. ROC curve for train and test for LDA:0.5 ........................................................................... 64
Figure 53. Classification report for LDA with default probability cut-off of 0.4 ................................. 65
Figure 54. Confusion matrix of train (left) and test(right) for LDA :0.4 .............................................. 65
Figure 55. ROC curve for train and test for LDA:0.4 ........................................................................... 65
Figure 56. ROC of model comparison for Train data. .......................................................................... 67
Figure 57. ROC of model comparison for Test data. ............................................................................ 68

3|Page
Predictive Modelling January 23, 2022

List of Tables:
Table 1. Inferences of Univariate Data visualization. ........................................................................... 11
Table 2. Model comparison table. ......................................................................................................... 39
Table 3. Inferences of Univariate Data visualization for problem 2. .................................................... 48
Table 4. LDA cut off probability performance table ............................................................................ 59
Table 5. Model Performance for Logistic Regression Model. .............................................................. 62
Table 6. Model performance for LDA [0.5] ......................................................................................... 64
Table 7. Model performance for LDA [0.4] ......................................................................................... 66
Table 8. Metrices comparison table between models. .......................................................................... 67

4|Page
Predictive Modelling January 23, 2022

Problem 1: Linear Regression

You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are
provided with the dataset containing the prices and other attributes of almost 27,000 cubic zirconia
(which is an inexpensive diamond alternative with many of the same qualities as a diamond). The
company is earning different profits on different prize slots. You have to help the company in predicting
the price for the stone on the bases of the details given in the dataset so it can distinguish between higher
profitable stones and lower profitable stones so as to have better profit share. Also, provide them with
the best 5 attributes that are most important.

Data Dictionary:

Variable Name Description

Carat Carat weight of the cubic zirconia.

Describe the cut quality of the cubic zirconia. Quality is increasing order Fair,
Cut
Good, Very Good, Premium, Ideal.

Colour Colour of the cubic zirconia. With D being the worst and J the best.

Clarity refers to the absence of the Inclusions and Blemishes. (In order
Clarity from Worst to Best in terms of avg price) IF, VVS1, VVS2, VS1, VS2, Sl1,
Sl2, l1

Depth The Height of cubic zirconia, measured from the Culet to the table, divided by
its average Girdle Diameter.

The Width of the cubic zirconia's Table expressed as a Percentage of its


Table
Average Diameter.

Price The Price of the cubic zirconia.

X Length of the cubic zirconia in mm.

Y Width of the cubic zirconia in mm.

Z Height of the cubic zirconia in mm.

The purpose of the report is to examine past information on cubic zirconia in order to assist the company
in predicting price slots for the stone based on the information provided in the dataset. Understanding
the data and examining the pattern of how pricing influences various variables. Providing business
insights based on exploratory data analysis and predictions of price.

1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check
the null values, Data types, shape, EDA, duplicate values). Perform Univariate and
Bivariate Analysis.

Exploratory Data Analysis:

5|Page
Predictive Modelling January 23, 2022

Read and view data after dropping ‘Unnamed: 0’ variable:

Checking for the information of features:

Checking the Skewness and Kurtosis:

6|Page
Predictive Modelling January 23, 2022

Checking the description of dataset:

Checking for duplicates in this dataset:

Checking the data types in the dataset:

Checking for number of rows and columns:

Observations:
• Dataset has 11 columns and 26967 rows including the 'unnamed:0' column.
• The first column "Unnamed: 0" has only serial numbers, so we can drop it as it is not useful.

7|Page
Predictive Modelling January 23, 2022

• There are both categorical and continuous data. For categorical data, we have cut, colour
and clarity for continuous data we have carat, depth, table, x. y, z and price.
• Price will be target variable.
• The dataset is used for predicting the price for the zirconia stone on the bases of the details
given in the dataset so it can distinguish between higher profitable stones and lower
profitable stones so as to have better profit share.
• There are around 697 missing values in the variable 'depth' which will be imputed during
the data pre-processing stage.
• There are 34 duplicate values present in the dataset, although there is a probability that 2
or more stones can be of similar dimensions and features but we will drop the duplicates so
avoid any overlapping.
• There is total 5 unique types of 'cut' out of which the highest number of cut is 'Ideal' one
which accounts to almost 10816 of observations, which is approximately 50% of the
dataset.
• There is total 7 types of 'color' out of which highest number of color is 'G', which is 5661,
accounts to almost 25% of the dataset.
• There is total 8 types of 'clarity' in the dataset and the highest number of 'clarity' is 'SI1'
which is 6571 which accounts to almost 30% of the dataset.
• Skewness and Kurtosis is also calculated for each column, Data with high skewness indicates
lack of symmetry and high value of kurtosis indicates heavily tailed data.
• Based on summary descriptive, the data looks good, we see that for most of the variables the
mean/medium are nearly equal.

Data Visualization:

Univariate Analysis for Numeric variables:


Let us define a function 'Univariate Analysis numeric' to display information as part of univariate
analysis of numeric variables. The function will accept column name and number of bins as arguments.
The function will display the statistical description of the numeric variable, histogram or distplot to
view the distribution and the box plot to view 5-point summary and outliers if any

1 - Carat: Carat weight of the cubic zirconia

Figure 1. Boxplot and Distplot of Carat.

• From the above graphs, we can infer that mean 'carat' weight of the cubic zirconia is around
0.79 with the minimum of 0.20 and maximum of 4.50.
• The distribution of 'cart' is right skewed with skewness value of 1.1164.
• The distribution spikes at around 0.4 ,1, 1.5 and 2
• The distplot shows the distribution of most of data from 0 to 2.5.
• The box plot of the 'cart' variable shows presence of large number of outliers.

8|Page
Predictive Modelling January 23, 2022

2 - Depth: The Height of cubic zirconia, measured from the Culet to the table, divided by
its average Girdle Diameter.

Figure 2. Boxplot and Distplot of Depth

• From the above graphs, we can infer that mean 'depth' height of cubic zirconia, measured from
the Culet to the table, divided by its average Girdle Diameter is around 61.74 with the minimum
of 50.80 and maximum of 73.60.
• The distribution of 'depth' is slightly left skewed with skewness value of -0.0286.
• The distribution follows a near normal distribution with long tails both on the right side and the
left side.
• The distplot shows the distribution of most of data from 55 to 70.
• The box plot of the 'depth' variable shows presence of large number of outliers.

3- Table: The Width of the cubic zirconia's Table expressed as a Percentage of its Average
Diameter.

Figure 3. Boxplot and Distplot of Table.

• From the above graphs, we can infer that mean width of the cubic zirconia's 'Table' expressed
as a Percentage of its Average Diameters around 57.45 with the minimum of 49.00 and
maximum of 79.00.
• The distribution of 'table' is right skewed with skewness value of 0.7657.
• The distribution has multiple spikes at around 53, 55,60 and 62.5.
• The distplot shows the distribution of most of data from 50 to 65.
• The box plot of the 'table' variable shows presence of outliers.

4- X: Length of the cubic zirconia in mm.

9|Page
Predictive Modelling January 23, 2022

Figure 4. Boxplot and Distplot of 'X'

• From the above graphs, we can infer that mean 'X' length of the cubic zirconia in mm is around
5.72.
• The distribution of 'X' is slightly right skewed with skewness value of 0.3879.
• This distribution has various spikes.
• The distplot shows the distribution of most of data from 3 to 10.
• The box plot of the 'X' variable shows presence of few outliers.

5- Y: Width of the cubic zirconia in mm.

Figure 5. Boxplot and Distplot of 'Y'

• From the above graphs, we can infer that mean 'Y' Width of the cubic zirconia in mm is around
5.73.
• The distribution of 'Y' is right skewed with skewness value of 3.8501.
• The distribution has an extremely long right-side tail because of presence of one outlier at
around 60.
• The distplot shows the distribution of most of data from 0 to 10.
• The box plot of the 'Y' variable shows presence of few outliers.

6 - Z: Height of the cubic zirconia in mm.

Figure 6. Boxplot and Distplot of 'Z'

10 | P a g e
Predictive Modelling January 23, 2022

• From the above graphs, we can infer that mean 'Z' Height of the cubic zirconia in mm is around
3.53.
• The distribution of 'Z' is right skewed with skewness value of 2.568.
• The distribution has an extremely long right-side tail because of presence of one outlier at
around 30.
• The distplot shows the distribution of most of data from 0 to 5.
• The box plot of the 'Z' variable shows presence of few outliers.

7 - Price: The Price of the cubic zirconia

Figure 7. Boxplot and Distplot of Price


• From the above graphs, we can infer that mean the Price of the cubic zirconia is around 3939.51
with the minimum of 326.00 and maximum of 18818.00.
• The distribution of 'Price' is right skewed with skewness value of 1.6185.
• The distribution has an extremely long right-side tail because of presence of one outlier at
around 30.
• The distplot shows the distribution of most of data from 325 to 15000.
• The box plot of the 'Price' variable shows presence of large number of outliers.

Observations:

Table 1. Inferences of Univariate Data visualization.

Sl. No Features Distribution Skewness Outliers


1 Carat Right Skewed +1.116 Yes
2 Depth Almost Normal -0.028 Yes
3 Table Right Skewed +0.765 Yes
4 X: Length Right Skewed +0.387 Yes
5 Y: Width Right Skewed +3.850 Yes
6 Z: Height Right Skewed +2.568 Yes
7 Price Right Skewed +1.618 Yes

• Mean and Median values are not very far away from each other.
• Data of all attributes are skewed (mostly right) except X (Length).
• Data for X (Length) is almost normal, outliers tend to make it a little left skewed.
• There are outliers in all numerical features of the cubic zirconia dataset.

11 | P a g e
Predictive Modelling January 23, 2022

Univariate Analysis for Categorical variables:

1. Cut- Describe the cut quality of the cubic zirconia. Quality is increasing order Fair,
Good, Very Good, Premium, Ideal.

Figure 8. Frequency Distribution of Cut

2. Colour- Colour of the cubic zirconia. With D being the worst and J the best.

Figure 9. Frequency Distribution of colour.

3. Clarity- Clarity refers to the absence of the Inclusions and Blemishes. (In order from
Worst to Best in terms of avg price) IF, VVS1, VVS2, VS1, VS2, Sl1, Sl2, l1

Figure 10. Frequency Distribution of Clarity.

12 | P a g e
Predictive Modelling January 23, 2022

Observations:
• The distribution of the 'cut' which describe the cut quality of the cubic zirconia, in which 'Ideal' cut
shows maximum frequency of 10816 and the least frequency cut observed is the ‘Fair’ one.
• The distribution of the 'Colour' of the cubic zirconia, shows 'G' colour with maximum frequency of
5661 and the least frequency one is J.
• The distribution of the 'clarity' of the cubic zirconia (Clarity refers to the absence of the Inclusions
and Blemishes), shows 'SI1' type with maximum frequency of 6571 and the least frequently
observed is ‘I1’

Bivariate Analysis of Categorical variable with Price:

Cut with Price:


Statistical description of Cut variable with respective price.

Figure 11. Boxplot of Cut with price variable.

• For the cut variable we see the most sold zirconia stone is 'Ideal' cut type gems and least sold
is Fair cut gems
• All cut type gems have outliers with respect to price.
• Slightly less priced seems to be Ideal type and premium cut type to be slightly more expensive

13 | P a g e
Predictive Modelling January 23, 2022

Color with Price:

Statistical description of Color variable with respective Price:

Figure 12. Boxplot of Color with Price variable.

• For the color variable we see the most sold is G colored gems and least is J colored gems
• All color type gems have outliers with respect to price
• However, the least priced seems to be E type; J and I colored gems seems to be more expensive

Clarity with Price:

Statistical description of Clarity variable with respective Price:

14 | P a g e
Predictive Modelling January 23, 2022

Figure 13. Boxplot of Clarity with Price

• For the clarity variable we see the most sold is SI1 clarity gems and least is I1 clarity gems
• All clarity type gems have outliers with respect to price
• Slightly less priced seems to be SI1 type; VS2 and SI2 clarity stones seems to be more
expensive.

Count plot of Categorical variables with Target variable Price:

• ‘Ideal’ is the most selling cut type of zirconia stone and ‘Fair’ type being the least sold.
• We see that ‘G’ color is the most selling zirconia stone followed by ‘E’ and ‘F’ nearly following
in same range and ‘J’ color gem is the least selling stone.
• S1 type of Clarity is most selling followed by VS2 and I1 being the least selling one.

15 | P a g e
Predictive Modelling January 23, 2022

Figure 14. Count plot of Categorical variables with price.

Bar plot Categorical variables with Price:

16 | P a g e

Figure 15. Bar plot of Categorical variables with Price.


Predictive Modelling January 23, 2022

• The price of ‘Ideal’ type cut is the most expensive and Fair is cheap one compared to all.
• G color gem is the costly one and also most liked by the people and are highest sold.
• J color gem price is less and also the least sold one
• S1 is the expensive one followed by the VS2 and S2 clarity which fall in the same price range
and l1 and lF are the cheap gems.

Pair plot:
A pair plot gives us correlation graphs between all numerical variables in the dataset. Thus, from the
graphs we can identify the relationships between all numerical variables.

Figure 16. Pair plot of Zirconia dataset

17 | P a g e
Predictive Modelling January 23, 2022

Figure 17. Scatter plots of all numeric variable with price.

Observations:
• From the above pair plot, we can see that 'Carat' and 'Price' are linearly correlated, which means
the attribute carat influences the price of zirconia stone the most.
• We can see that X, Y and Z are having the linear relation with each other and also the target
variable 'price'
• According to the assumptions for Linear regression model, the independent variables should
not be linearly corelated with each other which leads to the high multicollinearity between the
independent variables X, Y and Z which is length, width and height respectively.

Multivariate Analysis:
Heat map

Figure 18. Heatmap for Zirconia dataset.

18 | P a g e
Predictive Modelling January 23, 2022

A heatmap gives us the correlation between numerical variables. If the correlation value is tending to
1, the variables are highly positively correlated whereas if the correlation value is close to 0, the
variables are not correlated. Also, if the value is negative, the correlation is negative. That means, higher
the value of one variable, the lower is the value of another variable and vice-versa.

Observations:
• Carat is highly correlated with price. Carat attribute is the best predictor of price.
• Depth is not related with price, so it depth attribute does not play major role in prediction of
price.
• X (Length), Y (Width) and Z (Height) are highly correlated with price.
• X (Length), Y (Width) and Z (Height) are highly correlated with each other and are responsible
for high multicollinearity.
• Multicollinearity is a setback for the linear regression model. The highly correlated values can
be dropped in one of the model buildings and check the model performance.

1.2 Impute null values if present, also check for the values which are equal to zero. Do
they have any meaning or do we need to change them or drop them? Check for the
possibility of combining the sub levels of a ordinal variables and take actions
accordingly. Explain why you are combining these sub levels with appropriate
reasoning.

Imputing Null values:


Following table shows the total number of missing values for all the variables.

We can see that there are 697 missing values in the depth variable, The missing values are imputed
by the median values of the variable. The above table shows the total number of missing values before
and after imputation.

Checking the values which are equal to zero:


As we saw in the Describe function earlier that ‘x’, ‘y’ and ‘z’ attributes have 0 values which implies
that either the length, width or height of the stone is 0. This is practically not possible and this must be
some kind of manual error.

19 | P a g e
Predictive Modelling January 23, 2022

Checking the data points where we have 0 value for dimensions:

We can see that, there are 8 observations with values as 0. Since, the number of observations are very
less in number compared to the total number of observations that is 26967, so dropping these won’t
affect much.

Checking duplicate data points:

We can see that, the total number of duplicates are 34 values, dropping the duplicates since the values
are very less compared to size of dataset. The total number of data points after dropping the
duplicates and the values which are equal to zero are 26925.

Outlier Treatment: Figure 19. Boxplot before outlier treatment.

Check the outliers of all variables by plotting the box plot. According the
assumptions, the outlier’s impact on the model building, so the outliers
are treated and the boxplots of all variables are plotted to check the
treatment.

20 | P a g e
Predictive Modelling January 23, 2022

Figure 20. Boxplot after Outlier Treatment.

Checking for the possibility of combining the sub levels of Categorical variables attribute,
an ordinal variable and take actions accordingly:
There are 3 different categorical variables. ‘Cut’, ‘Color’ and ‘Clarity’.

1. Checking the possibility of combining Cut variable sub-categories:


The variable ‘cut’ describes quality of the cubic zirconia. Quality is increasing order Fair, Good, Very
Good, Premium, Ideal. Checking the brief summary of cut attribute across different categories with
respect to ‘Price’, which is our target variable.

• There are 5 sub categories in the ‘Cut’ variable.


• From above summary we can see that the mean and median price of ‘Good’ and ‘Very Good’
are close to each other.
• The stones of these 2 categories have similar description with respect to price.
• Combining these two sun categories ‘Good’ and ‘Very Good’ and naming it as ‘Good’.
• Final sub-categories of ‘Cut’ are ‘Fair’, ‘Good’, ‘Premium’ and ‘Ideal’.

2. Color refers to the color of the stone.


Although we can see a lot of possibilities of grouping this field. But we will choose to ignore it. No
grouping is done of color attribute sub-category. This is because colors are different and cannot be
grouped.

21 | P a g e
Predictive Modelling January 23, 2022

3. Checking the possibility of combining Clarity variable sub-categories:


‘Clarity’ is the absence of the inclusions and blemishes. Summary of clarity attribute with respect to
price is as below:

000

• The sub-categories VS1 and VS2’s mean and median prices are very close. The both stones
prices lie in similar range. Let the category be called as VS.
• The next categories which can be grouped are VVS1 and VVS2. The mean and median of
price range is little different but still close enough to be grouped. Let the category be called
as VVS.
• The third grouping involves SI1 and SI2. The price range of these categories is almost same.
Given that SI1 has a larger number of stones but a lower mean price, and SI2 has a lower
number of stones but a higher mean price, we may conclude that the two are balanced and
can be grouped together. Let the category be called as SI
• Final categories of Clarity variable are I1, SI, VS, VVS and IF.
The groping of sub-categories of the above variables are considered in the new copy of dataset. The
model is built based on this dataset and the model performance is checked based on non-grouped and
other models as well.

1.3 Encode the data (having string values) for Modelling. Split the data into train and
test (70:30). Apply Linear regression using scikit learn. Perform checks for significant
variables using appropriate method from stats model. Create multiple models and
check the performance of Predictions on Train and Test sets using Rsquare, RMSE
& Adj-Rsquare. Compare these models and select the best one with appropriate
reasoning.

Encoding the categorical variables:


The given dataset categorical variables are having the defined ordinal sub-categories, so Ordinal
encoding is appropriate and best suitable for the model building. Mapping the sub-categories of
variables from 1 to n as mentioned in the data dictionary. An ordinal encoding involves mapping each
unique label to an integer value. This type of encoding is really only appropriate, in this situation where
the relationship or order is already known between the categories, which is clearly mentioned in Data
dictionary.

22 | P a g e
Predictive Modelling January 23, 2022

Encoding/ Mapping Cut variable:


‘Cut’ variable describes the quality of the stone. According to data dictionary, the quality is increasing
in order from Fair, Good, Very Good, Premium, Ideal. Mapping the numbers such the 1 being the best
and 5 as the cheap quality cut.

• Ideal: 1
• Premium: 2
• Very Good: 3
• Good: 4
• Fair: 5

Encoding/ Mapping Clarity variable:


‘Clarity’ is the absence of the inclusions and blemishes. The order is given from Worst to Best in terms
of average price it is IF, VVS1, VVS2, VS1, VS2, Sl1, Sl2, l1. That is l1 being the best clarity stone
and IF being the worst. The mapping is done such that 1 being the best clarity and 8 being the worst.

• I1: 1
• Sl2: 2
• Sl1: 3
• VS2: 4
• VS1: 5
• VVS2: 6
• VVS1: 7
• IF: 8

Encoding/ Mapping Color variable:


‘Color’ refers to the color of the stone. With D being the worst and J the best. The mapping is done such
that 1 being the best color and 7 being the worst.

• J: 1
• I: 2
• H: 3
• G: 4
• F: 5
• E: 6
• D: 7

Checking the head of Dataset after encoding the Categorical variables

Now the dataset is cleaned, encoded and ready to use for model building.

23 | P a g e
Predictive Modelling January 23, 2022

Linear Regression model:


Linear Regression is the supervised Machine Learning model in which the model finds the best fit
linear line between the independent and dependent variable i.e., it finds the linear relationship between
the dependent and independent variable. A Linear Regression model’s main aim is to find the best fit
linear line and the optimal values of intercept and coefficients such that the error is minimized.
Multiple Linear Regression models are built and check their model performance metrics. In the end, the
models are compared and best fit model is selected. The selected model will be used to create the final
equation
y (price) = m0 + m1 * carat + m2 * depth + m3 * table + m4 * X + m5 * Y + m6 * Z + m7 * cut_c
+ m8 * color_c + m9 * clarity_c.
The objective is building different models and make predictions of price slots and check the
performance of each model using different performance matrices. Finally, comparing all the models
and select the best one with appropriate reasoning. The data is analysed and following models are built
with appropriate reasoning.
Model 1: Considering all the variables as it is and fitting the linear regression model.
Model 2: Dropping the attributes 'x', 'y' & 'z' and fitting the linear regression model.
Model 3: Dropping the attributes 'x', 'y', 'z' & 'depth' and fitting the linear regression model.
Model 4: Dropping the attributes 'x', 'y','z', 'depth' and grouping sub categories of attributes and fitting
the linear regression model.
Model 5: Dropping the attributes 'x', 'y', 'z' & 'depth' and fitting the linear regression model for scaled
data.

Model 1: Considering all the variables as it is and fitting the linear regression model.

In this model, all the attributes are considered as it is and the dataset is not scaled since the accuracy
and model performance does not get influence by scaling the dataset.

Linear Regression Model - Sklearn


1. Capturing the target column into separate vectors for training set and test set.
2. X variable with independent attributes and y variable with the target variable which is ‘price’
in our case.
3. Splitting the dataset in to train and test in the ratio of 70:30 using train test split from sklearn,
keeping the random state as 1.
4. Checking the shape of the split data.

5. Fitting the Linear regression model from sklearn linear models to Training set.
6. Finding the coefficient of determinants for each of independent attributes.

o The coefficient for carat is 8887.182245900442


o The coefficient for depth is 35.446432597917344
o The coefficient for table is -15.069203823159084
o The coefficient for x is -1348.7213850676303
o The coefficient for y is 1561.8443409182516
o The coefficient for z is -970.5030385552958

24 | P a g e
Predictive Modelling January 23, 2022

o The coefficient for cut_c is -113.33064005373288


o The coefficient for color_c is 273.22599181271306
o The coefficient for clarity_c is 436.8984753150906

- From the above following coefficient of determinants of all independent attributes we


can infer that ‘Carat’ variable has the most weightage and acts as the best predictor
for price.
- We can see that; on other hand the ‘Depth’ and ‘Table’ variable do not have that much
weightage in the prediction.
- The coefficient of determination is a measurement used to explain how much variability
of one factor can be caused by its relationship to another related factor
- For example, unit change in the value of carat will bring 8887.18 change in price.

7. The intercept for our model is -5164.440069032453


8. Model performance of regression model built is calculated by the coefficient of determinant (R
square). R square determines the fitness of a linear model. R square value ranges from 0 to 1.
The closer the data point is to the best fit plane; the coefficient of determinant value tends to 1
and the better the model.
- R square of training data is 0.93122
- R square of testing data is 0.93162
We can see that the score of Train and Test is almost similar, the model is a good fit model.
9. Calculating root mean square error (RMSE) value for checking model performance i.e. RMSE
value is standard deviation of the prediction errors (residual). Residual errors or sum of squared
errors are the measure of how far the data point is from the best fit plane. So basically, RMSE
tells the spread out of these residuals. That means lower the RMSE is, closer are the data points
to the best fit plane.
- RMSE of training data is 906.899
- RMSE of testing data is 911.29
10. Checking the plot between the original price and the predicted price for linear relationship.

The Figure 21. Scatter plot for model 1. model performance is limited for Linear
regression model using sci-kit learn library have limited performance parameters to measure. Therefore,
we perform Linear Regression by using a statsmodel.

Linear Regression Model- statsmodels


Statsmodel uses OLS (ordinary least square method) to predict the best fit plane. OLS also minimizes
the sum of squared differences between the observed and predicted values by estimating coefficients
and bias.

25 | P a g e
Predictive Modelling January 23, 2022

The difference between sci-kit Learn Linear Regression and statsmodel Linear Regression is that the
stat model gives a more detailed summary of the model. Statsmodel also provides with Adjusted R
square values and probabilities to check if the model is reliable or not. It also provides the probabilities
of all variables depicting if their coefficients are reliable or not. Statsmodel is a good statistical analysis
of the model and get information on which attributes we can drop and which we can keep and better
compared to sklearn.
Adjusted R-square metric accounts for the spurious correlations. The above analysis infers that there is
Multicollinearity in some extent
In OLS model, to establish the reliability of the coefficients, we need hypothesis testing. The null
hypothesis (H0) claims that there is no relation between dependent and independent variables.
At 95% confidence level if the p value is > = 0.5, we do not have enough evidence to reject H0.
Therefore, no relation between dependent and independent variable.
Similarly, if p value is < 0.5, we reject null hypothesis. Therefore, there is relationship between
dependent and independent variables.

Checking the OLS summary for the model:

26 | P a g e
Predictive Modelling January 23, 2022

From the above summary we can infer that:

• R- squared and Adjusted R-squared vales are same which is equal to 0.940
• Overall p value of model is 0.00 (< 0.05) which means model is reliable.
• Considering individual variable and its p-value, we can see cut_c [T.2] and depth have values
greater than 0.05. Therefore, these variables are not good predictors of price and can be dropped
to get a better performing model.
• The condition number is large which means it indicates the presence of multicollinearity in
dataset, which was clearly seen in the above analysis between X, Y and Z variables.
• The RMSE score for our OLS model is 844.29
• Checking Multi-collinearity using VIF
We test for multicollinearity with Variation Inflation Factor (VIF). VIF identifies correlation
between independent variables and the strength of that correlation. VIF starts from 1 and has
no upper value.

VIF equal to 1 indicates no correlation between independent variables.


VIF between 1 to 5 indicates moderate correlation but not severe.
VIF greater than 5 indicates critical levels of multicollinearity.

o carat ---> 122.65490394147022


o depth ---> 1126.3143618911165
o table ---> 892.2124758097101
o x ---> 10638.27854893691
o y ---> 9419.13075753421
o z ---> 3226.9583455469033
o cut_c ---> 6.138962724380723
o color_c ---> 8.53348426777295
o clarity_c ---> 8.66162674295014

• High levels of Multicollinearity are present in data. This model is not reliable based on the high
multicollinearity. Making changes in data and dropping highly correlated variables may
overcome the problem of Multicollinearity.
• Linear Equation:
(-4325.02) * Intercept + (-31.21) * cut_c[T.2] + (-127.54) * cut_
c[T.3] + (-242.7) * cut_c[T.4] + (-629.95) * cut_c[T.5] + (531.48
) * color_c[T.2] + (1030.1) * color_c[T.3] + (1450.54) * color_c[
T.4] + (1630.39) * color_c[T.5] + (1672.75) * color_c[T.6] + (186
1.63) * color_c[T.7] + (1712.14) * clarity_c[T.2] + (2535.87) * c
larity_c[T.3] + (3072.13) * clarity_c[T.4] + (3355.1) * clarity_c
[T.5] + (3766.77) * clarity_c[T.6] + (3776.88) * clarity_c[T.7] +
(3995.22) * clarity_c[T.8] + (9200.19) * carat + (12.59) * depth
+ (-23.07) * table + (-1176.95) * x + (1083.2) * y + (-642.48)*z

• Inferences: Carat with a coefficient of 9200.19 is the best predictor of price. For 1 unit change
in ‘carat’, the price will change by 9200.19 units keeping all other variables 0.
Based on all the analysis of model, we see that, this is not the best model for predicting price
slots for zirconia stones. But from this model, we came to know what are the changes to be
done to create a best fit model.

27 | P a g e
Predictive Modelling January 23, 2022

Model 2 - Dropping the attributes 'x', 'y' & 'z' and fitting the linear regression model

In this model, considering all the variables except ‘x’, ‘y’ and ‘z’ and also the data is not scaled. As we
saw in model 1 analysis that ‘x’, ‘y’ and ‘z’ contributes in high multicollinearity. The VIF scores was
large indicating for the cause of high multicollinearity, so building a model by drooping the x, y and z
variables and checking the model performance and compare.

Linear Regression Model - Sklearn


1. Capturing the target column into separate vectors for training set and test set.
2. X variable with independent attributes where x, y & z variables are not considered and y
variable with the target variable which is ‘price’ in our case.
3. Splitting the dataset in to train and test in the ratio of 70:30 using train test split from sklearn,
keeping the random state as 1.
4. Checking the shape of the split data.

5.
6. Fitting the Linear regression model from sklearn linear models to Training set.
7. Finding the coefficient of determinants for each of independent attributes.

o The coefficient for carat is 7957.233009701646


o The coefficient for depth is -17.60091599461744
o The coefficient for table is -20.090752371797244
o The coefficient for cut_c is -105.77168383079943
o The coefficient for color_c is 271.88660130178056
o The coefficient for clarity_c is 450.3790478005037

8. From the above following coefficient of determinants of all independent attributes we can infer
that ‘Carat’ variable even in this model has the most weightage and acts as the best predictor
for price.
9. We can see that; on other hand the ‘Depth’ and ‘Table’ variable do not have that much
weightage in the prediction. There are changes in coefficient of determinant values from model
1 to model 2. Model 2 is performing better compared to model 1.
10. The coefficient of determination is a measurement used to explain how much variability of one
factor can be caused by its relationship to another related factor
11. For example, unit change in the value of carat will bring 7957.23 change in price.
12. The intercept for our model is -3136.18073. The absolute value of intercept is lower than Model
1 but it is more than Model 2.
13. Model performance of regression model built is calculated by the coefficient of determinant (R
square). R square determines the fitness of a linear model. R square value ranges from 0 to 1.
The closer the data point is to the best fit plane; the coefficient of determinant value tends to 1
and the better the model.
R square of training data is 0.93016
R square of testing data is 0.93055
We can see that the score of Train and Test is almost similar, the model is a good fit model.
Calculating root mean square error (RMSE) value for checking model performance i.e. RMSE value is
standard deviation of the prediction errors (residual). Residual errors or sum of squared errors are the

28 | P a g e
Predictive Modelling January 23, 2022

measure of how far the data point is from the best fit plane. So basically, RMSE tells the spread out of
these residuals. That means lower the RMSE is, closer are the data points to the best fit plane.
RMSE of training data is 913.878
RMSE of testing data is 918.433
The RMSE values have increased a bit as compared to our previous model.
14. Checking the plot between the original price and the predicted price for linear relationship.

Figure 22. Scatter plot for model 2.


Checking the OLS summary for the model:

29 | P a g e
Predictive Modelling January 23, 2022

From the above summary we can infer that:

• R- squared and Adjusted R-squared vales are same which is equal to 0.939
• Overall p value of model is 0.00 (< 0.05) which means model is reliable.
• Considering each variable and its p value, we can see cut_c [T.2] has become 0 in this model. But
‘depth’ has p value greater than 0.05. Keeping ‘depth’ variable in model is not necessary. Next
model is built by dropping depth variable too.
• The condition number is large which means it indicates the presence of multicollinearity in dataset,
which is because of other variables of dataset. Condition number is reduced from model 1, dropping
depth variable may eliminated the problem of multicollinearity and improve the performance of
model.
• The RMSE score for our OLS model is 851.307.
• Checking Multi-collinearity using VIF for model 2.
We test for multicollinearity with Variation Inflation Factor (VIF). VIF identifies correlation
between independent variables and the strength of that correlation. VIF starts from 1 and has
no upper value.

VIF equal to 1 indicates no correlation between independent variables.


VIF between 1 to 5 indicates moderate correlation but not severe.
VIF greater than 5 indicates critical levels of multicollinearity.

o carat ---> 5.138632848517505


o depth ---> 480.09395166004856
o table ---> 500.29465902799984
o cut_c ---> 5.165922749512226
o color_c ---> 8.477234121974698
o clarity_c ---> 8.33395177559938

• VIF scores has decreased to a great extent. So, the problem of multicollinearity is getting treated
to a very good extent by removing ‘x’, ‘y’ and ‘z’ variables.
• Linear Equation for model 2:

(-4823.84) * Intercept + (-67.17) * cut_c[T.2] + (-100.1) * cut_


c[T.3] + (-232.58) * cut_c[T.4] + (-706.73) * cut_c[T.5] + (529.
37) * color_c[T.2] + (1015.59) * color_c[T.3] + (1423.1) * color
_c[T.4] + (1602.37) * color_c[T.5] + (1656.0) * color_c[T.6] + (
1844.96) * color_c[T.7] + (1747.32) * clarity_c[T.2] + (2575.53)
* clarity_c[T.3] + (3131.5) * clarity_c[T.4] + (3415.86) * clari
ty_c[T.5] + (3853.69) * clarity_c[T.6] + (3882.4) * clarity_c[T.
7] + (4106.88) * clarity_c[T.8] + (8027.46) * carat + (-10.42) *
depth + (-22.95) * table

• Inferences: ‘Carat’ is still the best predictor with a coefficient of 8027.46. For 1 unit change in
‘carat’, the price will change by 8027.46 units keeping all other variables 0. From this model
we can infer that, the strong multicollinearity which was due to x, y & z is reduced to greater
extend. The remaining collinearity is due to the ‘depth’ variable. The depth variable can also
be dropped since its not a good predictor for model building. The next model is built by
dropping even the depth variable and compare its affect on the model performance and compare
with the previous models and choose the best fit model for prediction of price slots for a
company.

30 | P a g e
Predictive Modelling January 23, 2022

Model 3: Dropping the attributes 'x', 'y', 'z' & 'depth' and fitting the linear regression
model.
In this model, considering all the variables except ‘x’, ‘y’, ‘z’ and ‘depth’ and the data is unscaled. As
we saw in model 2 analysis that dropping ‘x’, ‘y’ and ‘z’ contributes in reducing the high
multicollinearity in the model. In this model along with ‘x’, ‘y’ and ‘z’ dropping the ‘depth’ variable to
which is not good predictor and also contributes for some amount of multicollinearity and enhancing
the model for better performances.

Linear Regression Model – Sklearn

1. Capturing the target column into separate vectors for training set and test set.
2. X variable with independent attributes where x, y , z & depth variables are not considered and y
variable with the target variable which is ‘price’ in our case.
3. Splitting the dataset in to train and test in the ratio of 70:30 using train test split from sklearn,
keeping the random state as 1.
4. Checking the shape of the split data.

5. Fitting the Linear regression model from sklearn linear models to Training set.
6. Finding the coefficient of determinants for each of independent attributes.

o The coefficient for carat is 7956.146166877565


o The coefficient for table is -15.226520064985072
o The coefficient for cut_c is -113.92419740534123
o The coefficient for color_c is 272.479510071415
o The coefficient for clarity_c is 451.1570717705239

7. No difference in coefficient of discriminant is seen compared to previous models.


8. From the above following coefficient of determinants of all independent attributes we can infer that
‘Carat’ variable has the most weightage and acts as the best predictor for price. The number
of predictors is less, yet the model is better comparatively.
9. The coefficient of determination is a measurement used to explain how much variability of one
factor can be caused by its relationship to another related factor
10. The intercept for our model is -4490.273. The absolute value of intercept is more compared to
other two models.
11. Model performance of regression model built is calculated by the coefficient of determinant (R
square). R square determines the fitness of a linear model. R square value ranges from 0 to 1. The
closer the data point is to the best fit plane; the coefficient of determinant value tends to 1 and the
better the model.
R square of training data is 0.93013
R square of testing data is 0.930502
We can see that the score of Train and Test is almost similar, the model is a good fit model.
12. Calculating root mean square error (RMSE) value for checking model performance i.e. RMSE value
is standard deviation of the prediction errors (residual). Residual errors or sum of squared errors are
the measure of how far the data point is from the best fit plane. So basically, RMSE tells the spread
out of these residuals. That means lower the RMSE is, closer are the data points to the best fit plane.

31 | P a g e
Predictive Modelling January 23, 2022

RMSE of training data is 914.069


RMSE of testing data is 918.750
No changes in RMSE scores compared to last model.
13. Checking the plot between the original price and the predicted price for linear relationship.

Figure 23. Scatter plot for model 3.

Checking the OLS summary for the model:

32 | P a g e
Predictive Modelling January 23, 2022

From the above summary we can infer that:

• R- squared and Adjusted R-squared vales are same which is equal to 0.939
• Overall p value of model is 0.00 (< 0.05) which means model is reliable.
• Considering each variable and its p value, we can see no variable has p value more than 0.05. So,
all variables in this model have good relationship with dependent variable. Considering all these
variables in our model.
• Condition number is large which means there is string multicollinearity in the dataset or other
numerical problems are present in dataset. But we can see that, condition number has reduced from
1.03e+04 in Model 1 to 1.99e+03 in model 3.
• The RMSE score for our OLS model is 851.355.
• Checking Multi-collinearity using VIF for model 3.
We test for multicollinearity with Variation Inflation Factor (VIF). VIF identifies correlation
between independent variables and the strength of that correlation. VIF starts from 1 and has
no upper value.

VIF equal to 1 indicates no correlation between independent variables.


VIF between 1 to 5 indicates moderate correlation but not severe.
VIF greater than 5 indicates critical levels of multicollinearity.

o carat ---> 5.137344729116566


o table ---> 34.373679445889046
o cut_c ---> 5.036820239177246
o color_c ---> 8.425690234941365
o clarity_c ---> 8.165917118475038

• VIF scores for variables have reduced considerably.


• Linear Equation for model 3:

(-5621.25) * Intercept + (-70.02) * cut_c[T.2] + (-106.59) * cut_c


[T.3] + (-247.62) * cut_c[T.4] + (-729.91) * cut_c[T.5] + (529.25)
* color_c[T.2] + (1015.98) * color_c[T.3] + (1423.82) * color_c[T.
4] + (1603.85) * color_c[T.5] + (1657.39) * color_c[T.6] + (1846.7
1) * color_c[T.7] + (1749.48) * clarity_c[T.2] + (2576.91) * clari
ty_c[T.3] + (3133.64) * clarity_c[T.4] + (3418.95) * clarity_c[T.5
] + (3856.72) * clarity_c[T.6] + (3885.6) * clarity_c[T.7] + (4112
.16) * clarity_c[T.8] + (8026.74) * carat + (-20.25) * table

Inferences:
‘Carat’ is still the best predictor with a coefficient of 7956.14. For 1 unit change in ‘carat’, the price
will change by 7956.14 units keeping all other variables 0.
VIF scores have reduced to almost 5 for most of the variables. RMSE has not shown any major change
till in this model. Variables which are good predictors are understood through this model and all p
values for all the variables are under 0.05. Condition number is also reduced considerably.
Now we see that our RMSE is high and coefficients are not balanced. So, we should bring the variables
to a balanced state. We can achieve that by bringing all variables to a comparable form. We can achieve
that by scaling the data. Performing the same on scaled model and comparing.

33 | P a g e
Predictive Modelling January 23, 2022

Model 4: Dropping the attributes 'x', 'y', 'z', 'depth' and grouping sub categories of
attributes and fitting the linear regression model.

In this model considering the same attributes as the previous one and also grouping the sub categories
of the clarity variable. The data frame which was copied in which the sub categories are grouped is used
in the model building. To check if the performance while altering the data. It is compared with the
original data performance and the suggestions for company can be given based on the results.

Linear Regression Model – Sklearn


1. Capturing the target column into separate vectors for training set and test set.
2. X variable with independent attributes where x, y , z, depth are dropped and grouping the sub
categories of clarity variables and y variable with the target variable which is ‘price’ in our case.
3. Splitting the dataset in to train and test in the ratio of 70:30 using train test split from sklearn,
keeping the random state as 1.
4. Checking the shape of the split data.

5. Fitting the Linear regression model from sklearn linear models to Training set.
6. Finding the coefficient of determinants for each of independent attributes.

o The coefficient for carat is 7861.428694376872


o The coefficient for table is -17.17005364009355
o The coefficient for cut_c is -117.61394278203296
o The coefficient for color_c is 260.6060552289523
o The coefficient for clarity_c is 407.7462698772857

7. No difference in coefficient of discriminant is seen compared to previous models.


8. The coefficient of determination is a measurement used to explain how much variability of one
factor can be caused by its relationship to another related factor
9. The intercept for our model is -3878.169.
10. Model performance of regression model built is calculated by the coefficient of determinant (R
square). R square determines the fitness of a linear model. R square value ranges from 0 to 1. The
closer the data point is to the best fit plane; the coefficient of determinant value tends to 1 and the
better the model.
R square of training data is 0.92477
R square of testing data is 0.92506
We can see that the score of Train and Test is almost similar, the model is a good fit model.
11. Calculating root mean square error (RMSE) value for checking model performance i.e. RMSE value
is standard deviation of the prediction errors (residual). Residual errors or sum of squared errors are
the measure of how far the data point is from the best fit plane. So basically, RMSE tells the spread
out of these residuals. That means lower the RMSE is, closer are the data points to the best fit plane.
RMSE of training data is 948.47
RMSE of testing data is 953.85

34 | P a g e
Predictive Modelling January 23, 2022

The RMSE is more than the previous models even after altering and combining the sub categories of
the attributes is not working fine. Its better to keep the all the sub categories same as original data.
12. Checking the plot between the original price and the predicted price for linear relationship.

Figure 24. Scatter plot for model 4.

Checking the OLS summary for the model:

35 | P a g e
Predictive Modelling January 23, 2022

From the above summary we can infer that:

• R- squared and Adjusted R-squared vales are same which is equal to 0.933. The R- square value is
slightly decreased compared to previous models
• Overall p value of model is 0.00 (< 0.05) which means model is reliable.
• Considering each variable and its p value, we can see no variable has p value more than 0.05. So,
all variables in this model have good relationship with dependent variable. Considering all these
variables in our model.
• Condition number is large, +1.96e+03. This indicates that is some numerical problems since we
have removed the variables contributing for the multicollinearity.
• The RMSE score for our OLS model is 893.715
• Checking Multi-collinearity using VIF for model 4.
We test for multicollinearity with Variation Inflation Factor (VIF). VIF identifies correlation
between independent variables and the strength of that correlation. VIF starts from 1 and has
no upper value.

VIF equal to 1 indicates no correlation between independent variables.


VIF between 1 to 5 indicates moderate correlation but not severe.
VIF greater than 5 indicates critical levels of multicollinearity.

o carat ---> 5.014781525999582


o table ---> 30.721836494082076
o cut_c ---> 5.030393865852007
o color_c ---> 8.39160381584299
o clarity_c ---> 6.38478635319889

• VIF scores for variables have reduced considerably.


• Linear Equation for model 3:

(-5326.17) * Intercept + (-75.27) * cut_c[T.2] + (-94.9) * cut_c[T


.3] + (-242.03) * cut_c[T.4] + (-775.72) * cut_c[T.5] + (515.25) *
color_c[T.2] + (973.01) * color_c[T.3] + (1379.3) * color_c[T.4] +
(1551.05) * color_c[T.5] + (1591.27) * color_c[T.6] + (1775.29) *
color_c[T.7] + (2202.48) * clarity_c[T.2] + (3188.59) * clarity_c[
T.4] + (3788.51) * clarity_c[T.6] + (4021.81) * clarity_c[T.8] + (
7916.86) * carat + (-22.11) * table

Inferences:
‘Carat’ is still the best predictor with a coefficient of 7916.86. For 1 unit change in ‘carat’, the price
will change by 7916.86 units keeping all other variables 0.
VIF scores have reduced to almost 5 for most of the variables. RMSE has not shown any major change
till in this model. Variables which are good predictors are understood through this model and all p
values for all the variables are under 0.05. Condition number is also reduced considerably.
The R squared values is decreased slightly from the original dataset and also there is fair amount of
increase in RMSE values which indicates that combining sub categories of variables is not contributing
for better performance. So, it is better to consider the previous model i.e model 3.

36 | P a g e
Predictive Modelling January 23, 2022

Model 5: Dropping the attributes 'x', 'y', 'z' & 'depth' and fitting the linear regression
model for scaled data.

In all the above four models, model 3 is performing better compared all other model, but the data is not
balanced, hence scaling the dataset using z-score, where mean is closer to 0 and standard deviation to
1. Checking the impact of scaling on the model and comparing model 3 and 5 and finally selecting the
best model for prediction of price slots.

Linear Regression Model – Sklearn

1. Capturing the target column into separate vectors for training set and test set.
2. X variable with independent attributes where x, y , z & depth variables are not considered and y
variable with the target variable which is ‘price’ in our case.
3. Splitting the dataset in to train and test in the ratio of 70:30 using train test split from sklearn,
keeping the random state as 1.
4. Checking the shape of the split data.

5. Fitting the Linear regression model from sklearn linear models to Training set.
6. Finding the coefficient of determinants for each of independent attributes.

o The coefficient for carat is 1.0580025633434904


o The coefficient for table is -0.009494734169229253
o The coefficient for cut_c is -0.03664855966671572
o The coefficient for color_c is 0.13427496894739407
o The coefficient for clarity_c is 0.2151144567745126

7. From the above following coefficient of determinants of all independent attributes we can infer that
‘Carat’ variable has the most weightage and acts as the best predictor for price. The number
of predictors is less, yet the model is better comparatively.
8. The coefficient of determination is a measurement used to explain how much variability of one
factor can be caused by its relationship to another related factor
9. The intercept for our model is -2.725e-16. After scaling the intercept becomes almost equal to
zero.
10. Model performance of regression model built is calculated by the coefficient of determinant (R
square). R square determines the fitness of a linear model. R square value ranges from 0 to 1. The
closer the data point is to the best fit plane; the coefficient of determinant value tends to 1 and the
better the model.
R square of training data is 0.93013
R square of testing data is 0.930502
Scaling of dataset, does not affect the R square values. We can see that the score of Train and Test is
almost similar, the model is a good fit model.
11. Calculating root mean square error (RMSE) value for checking model performance i.e. RMSE value
is standard deviation of the prediction errors (residual). Residual errors or sum of squared errors are
the measure of how far the data point is from the best fit plane. So basically, RMSE tells the spread
out of these residuals. That means lower the RMSE is, closer are the data points to the best fit plane.

37 | P a g e
Predictive Modelling January 23, 2022

RMSE of training data is 0.2643


RMSE of testing data is 0.2636
This means we have almost 26% variance of residual error or unexplained error in our model. It allows
better interpretability for us to study the model. After scaling our RMSE value has been cut down to a
large extent. The data points are concentrated close to the best fit plane.
12. Checking the plot between the original price and the predicted price for linear relationship.

Figure 25. Scatter plot for model 5.

Checking the OLS summary for the model:

From the above summary we can infer that:

• R- squared and Adjusted R-squared vales are same which is equal to 0.930
• Overall p value of model is 0.00 (< 0.05) which means model is reliable.

38 | P a g e
Predictive Modelling January 23, 2022

• Considering each variable and its p value, we can see no variable has p value more than 0.05. So,
all variables in this model have good relationship with dependent variable. Considering all these
variables in our model.
• In this model, condition number has reduced to 1.87. This shows that multicollinearity and other
mathematical problems are not there anymore in our model.
• The RMSE score for our OLS model is 0.2643.
• Checking Multi-collinearity using VIF for model 3.
We test for multicollinearity with Variation Inflation Factor (VIF). VIF identifies correlation
between independent variables and the strength of that correlation. VIF starts from 1 and has
no upper value.

VIF equal to 1 indicates no correlation between independent variables.


VIF between 1 to 5 indicates moderate correlation but not severe.
VIF greater than 5 indicates critical levels of multicollinearity.

o carat ---> 1.30155395589329


o table ---> 1.2632339639911656
o cut_c ---> 1.2570332894546756
o color_c ---> 1.1153449236143709
o clarity_c ---> 1.193835288526028

• All variables have VIF score of almost 1 suggesting negligible correlation among the independent
variables is present in the dataset
• Linear Equation for model 5:

(-0.0) * Intercept + (1.06) * carat + (-0.01) * table + (-0.04) *


cut_c + (0.13) * color_c + (0.22) * clarity_c

Inferences:
‘Carat’ is still the best predictor with a coefficient of 1.06. For 1 unit change in ‘carat’, the price will
change by 1.06 units keeping all other variables 0. By looking at all the performance matrices of this
model, we can say this model fulfils all criteria to be the best fit model.

Model comparison:

Table 2. Model comparison table.

Model 1: Considering all the variables as it is and fitting the linear regression model.
Model 2: Dropping the attributes 'x', 'y' & 'z' and fitting the linear regression model.
Model 3: Dropping the attributes 'x', 'y', 'z' & 'depth' and fitting the linear regression model.
Model 4: Dropping the attributes 'x', 'y', 'z', 'depth' and grouping sub categories of attributes and fitting
the linear regression model.

39 | P a g e
Predictive Modelling January 23, 2022

Model 5: Dropping the attributes 'x', 'y', 'z' & 'depth' and fitting the linear regression model for scaled
data

Inferences:
• Accuracy (R square) is same for all models for both sklearn as well as stat models.
• Model 1, 3 and 5 give us the best RMSE values.
• VIF max and VIF min values are lowest for models 5, since that data is scaled.
• In model 4, even after combining the sub categories of attributes the RMSE score for train
and test is more compared to other model, so the idea of combining sub categories is dropped.
• Model 3 and 5 using same attributes while one model is built using scaled attributes and other
is original dataset.
• Model 5 is our best fit model and most viable for the given set based on the performance
measures of other models.

Final linear equation is as given below:

(-0.0) * Intercept + (1.06) * carat + (-0.01) * table + (-0.04) * cut_c + (0.13) * color_c + (0.22) *
clarity_c

1.4 Inference: Basis on these predictions, what are the business insights and
recommendations.
According to the problem statement, Gem stones co ltd, a cubic zirconia manufacturer earns varying
profits on different pricing slots. The company wants to predict the stone's price based on the data
provided so that the company may distinguish between higher profitable and lower profitable stones
and maximise the company’s profit share. Also require the top five attributes which are most essential
in price prediction.
In our extensive analysis so far, we have thoroughly examined historical data and developed a model
that predicts different price slots based on the characteristics in our dataset. Let us now look at the key
points in our past data first and try to suggest some recommendations for the firm.
To have a better profit share, the business value is to distinguish between higher profitable stones and
lower profitable stones. Our model has an accuracy score of more than 90%, which may be acceptable
in this business, and will properly predict the price for more than 90% of the stones.
Following are the insights and recommendations to help the firm to solve the business objective:

1. Carat: Carat weight of the cubic zirconia


Insights:

• Carat is the best predictor for the price.


• It has the positive linear relation with price. The price increases with increase in carat of zirconia
stone.
• Carat is measure of weight which has direct correlation with physical dimensions (x, y, z)
Recommendations:

• Carat is the best predictor of price, according to the best fit model.
• The firm should favour more stones with a higher carat value, stones with larger carat values
are priced higher
• The significance of higher carat stones should be advertised to people.

40 | P a g e
Predictive Modelling January 23, 2022

• Marketing should be done in such a way that clients are aware of the significance of higher
carat values.
• Customers should receive varied presentations depending on their financial capabilities.
Customers with a higher financial status should be offered higher quality carat stones, while
those with a lesser paying ability should be offered lower carat stones.
• The marketing can be done educating customers about the significance of a better carat score
and quality.

2. Cut: Describe the cut quality of the cubic zirconia.


Insights:

• For cut attribute, we see that Ideal cut type is the most selling and the average price of Ideal is
slightly less prices compared to premium cut type which is slightly more expensive.
• ‘Fair’ and ‘Good’ have a lower count of sales and have a relatively higher average price.
• The ideal, premium, very good cut types have better profits.
Recommendations:

• The ideal, premium, very good cut types are the one which are bringing more profits, proper
marketing of the products may increase the sales to greater extend.
• The best quality cut, 'Ideal,' has a lower average price comparatively. However, 'Ideal' has a
high count at this pricing. The firm might try increasing the price of the ideal category a little
to see whether it affects sales. If sales are reduced, they should return to the current market
price.
• Although we know that 'Fair' and 'Good' are of the lowest cut quality and are sold in small
quantities, their average price is still rather substantial. The firm can attempt to lower its average
price or increase the quality of these cuts so that customers are willing to pay the higher price.
• ‘Fair’ and ‘Good’ cut types is advisable to eschew as the number of sales and profits are very
less.

3. ‘X’, ‘Y’ & ‘Z’:


Insights:

• X, Y and Z are the length, width and height of the cubic zirconia. All are having the linear
relation with each other and also the target variable 'price'.
• All three have a strong relation to the price variable. That is, changes in the values of x, y, and
z cause price values to change.
• At the same time, there is a significant association between these three. This indicates that these
variables end up causing a high multicollinearity, which affect the performance of our price
prediction
Recommendations:

• The dimensions are having negative effect on the stones, smaller the dimension’s mostly
balanced size is more expensive.
• If a stone with smaller dimensions has a larger carat value and superior clarity, it will be valued
higher than a huge stone with lower carat and clarity.
• Firm can focus more on the balanced different sizes with higher quality stones.

41 | P a g e
Predictive Modelling January 23, 2022

4. Depth and Table:


• Depth and table both are poor predictors of price.
• From the EDA of depth and price & table & price, we can see that there is a minimal relationship
of depth and table with price, there is no defined relationship its spread like could, which is not
useful for model building.

5. Clarity:
Insights:

• Clarity refers to the absence of the Inclusions and Blemishes and has emerged as a strong
predictor of price as well.
• S1 is the expensive one followed by the VS2 and S2 clarity which fall in the same price range
and l1 and lF are the cheap stones.
• S1 type of Clarity is most selling followed by VS2 and I1 being the least selling one.
• Clarity of stone types Sl1, VS2 and Sl2 are helping the firm put an expensive price cap on the
stones and also have most selling counts.
Recommendations:

• Price of ‘I1’ could be reduced as it is having very low sales.


• I1' is of the highest quality and may reduce earnings, but a little risk may be taken by the firm
by lowering its price for a period of time, and if sales grow, the price can be raised to its former
level.
• 'IF,' 'VVS1', and 'VVS2' are more helpful in price prediction than other clarity categories. In
comparison to other areas, the firm should put greater emphasis on them.

6. Color
Insights:

• G color gem is the costly one and also most liked by the people and are highest sold.
• J color gem price is less and also the least sold one
• We see that ‘G’ color is the most selling zirconia stone followed by ‘E’ and ‘F’ nearly following
in same range and ‘J’ color gem is the least selling stone.
Recommendations:

• The color of the stones, such as H, I, and J, will not help the company in putting a high price
cap on such stones.
• Instead, the firm should concentrate on stones in the color D, E, and F in order to fetch greater
prices and boost sales.
• This might also signal that the firm should be exploring for unique color stones, such as
transparent stones to help boost the pricing.
• ‘J’ and ‘I’ color stones should be priced lower. Maybe the customers get attracted by the lower
price and the sales is increased.
The best 5 attributes which are good predictors for prediction of price are as follows:
1. Carat
2. Clarity
3. Color
4. Cut
5. Table

42 | P a g e
Predictive Modelling January 23, 2022

Key performance indicators:


• Sales promotion: Special deals stimulates demand. Sales promotion can be effective in
changing short term behaviour of buyer.
• Advertising is the efficiency way for reaching many people and the potential buyers. For
example, Advertising campaign can be done in around month of Jan and Feb, when the
Valentine’s Day is near, or the occasions like Mother’s Day, etc.
• The company can make segments, and target the customer based on their income/paying
capacity etc, which can be further studied.
• Customers can be educated about the value of a higher carat score and the clarity index through
marketing initiatives.
• Customization of products can be initiated for better sales.

__________________________________________________________________________________

43 | P a g e
Predictive Modelling January 23, 2022

Problem 2: Logistic Regression and LDA

You are hired by a tour and travel agency which deals in selling holiday packages. You are provided
details of 872 employees of a company. Among these employees, some opted for the package and some
didn't. You have to help the company in predicting whether an employee will opt for the package or not
on the basis of the information given in the data set. Also, find out the important factors on the basis of
which the company will focus on particular employees to sell their packages.

Data Dictionary:
Variable Name Description
Holiday_Package Opted for Holiday Package yes/no?
Salary Employee salary
age Age in years
educ Years of formal education
The number of young children (younger than 7
no_young_children
years)
no_older_children Number of older children
foreign foreigner Yes/No

The purpose of the report is to examine past information on selling holiday packages in order to assist
the company in predicting whether an employee will opt for the package or not on the basis of the
information given in the data set. Understanding the data and examining the pattern. Providing business
insights based on exploratory data analysis and predictions of classes.

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis.
Do exploratory data analysis.

Exploratory Data Analysis:


Read and view data after dropping ‘Unnamed: 0’ variable:

44 | P a g e
Predictive Modelling January 23, 2022

Checking for the information of features:

Checking the Skewness and Kurtosis:

Checking the description of dataset:

Checking for duplicates in this dataset:

Checking for number of rows and columns:

45 | P a g e
Predictive Modelling January 23, 2022

Checking for Null and missing values in the dataset:

Observations:
• Dataset has 7 columns and 872 rows excluding the 'unnamed:0' column.
• The first column "Unnamed: 0" has only serial numbers, so we can drop it as it is not useful.
• There are both categorical and continuous data. For categorical data, we have 'Holiday_Package' and
'foreign', for continuous data we have salary, age, educ, no_young_children, no_older_children.
• Holliday Package will be target variable.
• The dataset is used in predicting whether an employee will opt for the Holiday_package or not on the
basis of the information given in the data set.
• There are no missing and duplicate values in the dataset.
• There is total 5 unique types of 'cut' out of which the highest number of cut is 'Ideal' one which accounts
to almost 10816 of observations, which is approximately 50% of the dataset.
• Skewness and Kurtosis is also calculated for each column, Data with high skewness indicates lack of
symmetry and high value of kurtosis indicates heavily tailed data.
• Based on summary descriptive, the data looks good, we see that for most of the variables the
mean/medium are nearly equal.
• We have a balanced dataset where 54% yes values and 45% no values of Target variable.

Data Visualization:

Univariate Analysis for Numeric Variables:


Let us define a function 'univariateAnalysis_numeric' to display information as part of univariate
analysis of numeric variables. The function will accept column name and number of bins as arguments.

1 - Salary

Figure 26. Boxplot and Distplot for Salary


• From the above graphs, we can infer that mean 'Salary' of employee is around 47729.17 with
the minimum of 1322.0 and maximum of 236961.0.
• The distribution of 'Salary' is right skewed with skewness value of 3.103216.
• The distplot shows the distribution of most of data from 1000 to 10,000 approximately.
• The box plot of the 'Salary' variable shows presence of large number of outliers.

46 | P a g e
Predictive Modelling January 23, 2022

2 – Age

Figure 27. Boxplot and Distplot for Age.

• From the above graphs, we can infer that mean 'Age' of employee is around 39 years with the
minimum of 20yrs and maximum of 62yrs old in company.
• The distribution of 'Age' looks almost normally distributed with skewness value of 0.146412
• The distplot shows the distribution of most of data from 20 to 60 approximately.
• The box plot of the 'Age' variable does not have any outlier.

3 - Educ: Years of formal education

Figure 28. Boxplot and Distplot for Education.

• From the above graphs, we can infer that mean 'Educ' years of formal education of employee
is around 9 years with the minimum of 1yr and maximum of 21yrs.
• The distribution of 'Educ' is slightly left skewed with skewness value of -0.045501.
• The distplot shows the distribution of most of data from 1 to 20 approximately.
• The box plot of the 'Educ' variable shows presence of few outliers.

4 - no_young_children: The number of young children below the age of 7yrs.

Figure 29. Boxplot and Distplot for no_young_children

47 | P a g e
Predictive Modelling January 23, 2022

• From the above graphs, we can infer that mean 'no_young_children' number of young children
below the age of 7yrs is around 0.3119 with the minimum of 0 and 3.
• The distribution of 'no_young_children' is slightly left skewed with skewness value of 1.9465.
• The distplot shows the distribution of most of data from 0-3.
• The box plot of the 'no_young_children' variable shows presence of few outliers.

5 - no_older_children: The number of older children.

Figure 30. Boxplot and Displot for no_older_children

• From the above graphs, we can infer that mean 'no_older_children' the number of older children
is around 0.9827 with the minimum of 0 and maximum of 6.
• The distribution of 'no_older_children' is slightly right skewed with skewness value of
0.953951
• The distplot shows the distribution of most of data 0-4 approximately.
• The box plot of the 'no_older_children' variable shows presence of one outlier at 6.

Observations:
Table 3. Inferences of Univariate Data visualization for problem 2.

Sl. No Features Distribution Skewness Outliers


1 Salary Right Skewed +3.103 Yes
2 Age Left Skewed -0.146 No
3 Education Left Skewed -0.045 Very few
4 no_young_children Right Skewed +1.946 Very few
5 no_older_children Right Skewed +3.850 Very few

• There are outliers just in Salary variable, and the outliers in other variable are just 1 or 2 which
does not effect.
• Treating of Outlier might not be feasible option as the data can be original and genuine.
• Foreigners accepting the holiday package have mean of years of formal education lesser
than natives accepting the holiday package.
• If employee is foreigner and employee not having young children, chances of opting for
Holiday Package is good.

48 | P a g e
Predictive Modelling January 23, 2022

Univariate Analysis for Categorical variables:

1. Holliday_Package 2. Foreign: foreigner Yes/No

Figure 32. Count plot for Holiday package Figure 31. Count plot for foreign.

Observations:
• The distribution of the 'Holiday_package' is one where employee opt for package or no, we can
see that frequency distribution of 'No' is more which is around 471 and the employees who
opted are slight less which is 401 in count.
• We can observe that 54% of the employees are not opting for the holiday package and 46% are
interested in the package. This implies we have a dataset which is fairly balanced
• The frequency distribution of foreign implies that the employees are mostly from the same
country which is around 75% of employees and foreigners are around 25% of them.

Bivariate Analysis:

Salary vs Holiday_Package:

Figure 33. Boxplot of Salary vs Holiday package


49 | P a g e
Predictive Modelling January 23, 2022

We can see that the average 'Salary' of employees opting for holiday package and not opting for holiday
package is similar in nature. However, the distribution is fairly more spread out for people not opting
for holiday packages.

Age vs Holiday package:

Figure 34. Boxplot of Age vs Holiday package

We can see that, the age distribution for employees who are opting for holiday package and not opting
are similar in nature, though the number of people opting are less in number and mostly fall in range of
35-45 age group.

Count plot of Age with Holiday package as hue

Figure 35.
36. Count plot of Age against Holiday package

We can clearly see that frequency of employees in middle range (34 to 45 years) are opting for holiday
package are more as compared to older and younger employees.

50 | P a g e
Predictive Modelling January 23, 2022

Education vs Holiday package

Figure 37. Boxplot of Education vs Holiday package

The variable 'educ' the number of years of formal education is showing a similar pattern. This means
education is likely not a variable that influences for opting of holiday packages for employees.

Count plot of Education with Holiday package as hue

Figure 38. Count plot of Education against Holiday package

We can see that employee with less years of formal education (1 to 7 years) and higher education are
not opting for the Holiday package as compared to employees with formal education of 8 year to 12
years

51 | P a g e
Predictive Modelling January 23, 2022

No of young children vs Holiday package

Figure 39. Boxplot of no of young children vs Holiday package

We can see that there is a significant difference in employees with younger children who are opting for
holiday package and employees who are not opting for holiday package, this attribute is good predictor
as there is significant difference in them.

Count plot of no of young children with Holiday package as hue

Figure 40. Count plot of no of young against Holiday package

We can see clearly that people with younger children are opting for holiday packages are very few in
number compared to employees who do not have young children.

52 | P a g e
Predictive Modelling January 23, 2022

No of older children vs Holiday Package

Figure 41. Boxplot of no of older children vs Holiday package

The distribution for opting or not opting for holiday packages looks same for employees with older
children. At this point, this might not be a good predictor for model building.

Count plot of no of older children with Holiday package as hue

Figure 42. Count plot of no of older against Holiday package

Almost same distribution for both the scenarios when dealing with employees with older children.

53 | P a g e
Predictive Modelling January 23, 2022

Foreign vs Holiday package

Figure 43. Count plot of foreign vs Holiday package

We can see that the percentage of foreigners accepting the holiday package is substantially higher
compared to the citizens with considering the ratio of foreigners and the citizens.

Box plot of foreign vs Salary with Holiday package as hue

Figure 44. Boxplot of foreign vs Salary with Holiday package as hue

• In both foreigner and non-foreigner, the people who did not opt for the Holiday package are
more in number that the people who have opted.
• The average of people who didn’t opt for Holiday package is slightly more than who have
opted.
• The mean salary of foreign people is slightly less than natives.
• There are outliers in all the combinations.

54 | P a g e
Predictive Modelling January 23, 2022

Pair plot:

The Pair plot helps us to visualize how the features numerical in nature interact with each other. The
pair plot further helps us visualize how the distribution of the target variables differs within each
individual the feature itself.

Figure 45. Pair plot of problem 2

Observations:
• There is no obvious defined correlation between the attributes and Holiday package, the data seems
to be fine.
• There is no considerable difference between data distribution of holiday package. No clear and
considerable difference is observed.
• Looking at the distribution of age, we can deduce that the employees who accept the holiday
package usually tend to be in the middle of their careers (late 30s).
• Across education we can observe that the employees with higher number of years of formal
education have a lower tendency to opt for the holiday package relative to employees with lesser
years of formal education

55 | P a g e
Predictive Modelling January 23, 2022

Multivariate Analysis:

Heatmap
A heatmap gives us the correlation between numerical variables. If the correlation value is tending to
1, the variables are highly positively correlated whereas if the correlation value is close to 0, the
variables are not correlated. Also, if the value is negative, the correlation is negative. That means, higher
the value of one variable, the lower is the value of another variable and vice-versa.

Figure 46. Heatmap for Problem 2.

Observations:
• There is no strong correlation between the variables, hence we do not face the issue of
multicollinearity.
• Observing the heatmap we can see that the there is some positive correlation is among number
of years of formal education and the salary received.
• There some negative correlation between age and the employees with no of young children
below age 7.

2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data
Split: Split the data into train and test (70:30). Apply Logistic Regression and LDA
(linear discriminant analysis).

Encoding the categorical variables:


Encoding categorical data is a process of converting categorical data into integer format so that the data
with converted categorical values can be used to build the models to give the predictions. The object
type variables are converted to integer using pandas categorical to codes of 0 and 1.After the encoding
the variables should be converted in to integer type data types for the model building.

56 | P a g e
Predictive Modelling January 23, 2022

Checking datatypes after encoding:

Checking the head of Dataset after encoding the Categorical variables

Now the dataset is cleaned, encoded and ready to use for model building.

Split: Split the data into train and test (70:30)


1. Capture the target column into separate vectors for training set and test set
X variable with independent attributes and y variable with the target variable which is
‘Holiday_Package’ in our case.

2. Splitting the dataset in to train and test in the ratio of 70:30 using train test split from sklearn,
keeping the random state as 1.
3. Checking the shape of the split data.

The data is now read to fit the models on train and check the performance of test data. The data is
divided in 70% of train and 30% of test.

Logistic Regression Model


Logistic Regression is a Machine Learning algorithm which is used for the classification problems, it
is a predictive analysis algorithm to understand the relationship between the dependent variable and one
or more independent variables by estimating probabilities using a logistic regression equation. The
classification algorithm Logistic Regression is used to predict the likelihood of a categorical dependent
variable. The dependant variable in logistic regression is a binary variable with data coded as 1.

57 | P a g e
Predictive Modelling January 23, 2022

To build a Logistic Regression model,

• Fitting the Logistic Regression model which is imported from Sklearn linear model.
• Predicting on Training and Testing dataset
• Getting the Predicted Classes and Probabilities and creating a data frame.
• Model evaluation through Accuracy, Confusion Matrix, Classification report, AUC, ROC
curve.
Initially, we fit the train data and labels in the Logistic Regression model, based on the model
performance the model is tunned using Grid search, the best parameters are used and the model is re-
built and model performance is calculated which includes Classification report of accuracy, recall,
precision and F1 score for both train and test data.
Grid Search: Grid search divides the hyperparameter domain into distinct grids. Then, using cross-
validation, it attempts every possible combination of values in this grid, computing some performance
measures. The ideal combination of values for the hyperparameters is the point on the grid that
maximizes the average value in cross-validation. Grid search is a comprehensive technique that
considers all possible combinations in order to locate the best point in the domain
Hyperparameter Tuning:
o 'penalty': ['L2’, 'none'],
o 'solver': ['sag', 'lbfgs', 'liblinear', 'newton-cg'],
o 'tol': [0.0001, 0.00001],
o 'Max_iter': [10000, 5000, 15000]
o Cross validation (cv):5
o Scoring: 'f1'

Penalized logistic regression imposes a penalty to the logistic model for having too many variables.
This results in shrinking the coefficients of the less contribute variables toward zero. This is also known
as regularization. In our grid search, we take ‘L2’ and ‘none’ as our arguments and check which is
preferred by grid search.
The solver is the process that runs for the optimization of the weights in the model. The solver uses a
Coordinate Descent (CD) algorithm that solves optimization problems by successively performing
approximate minimization along coordinate directions or coordinate hyperplanes. Different solvers take
a different approach to get the best fit model. In our case, we have taken ‘sag’, ‘lbfgs’, ‘liblinear’ and
‘newton-cg’ as our arguments. We will check which is preferred by grid search.
Tol is the tolerance of optimization. When the training loss is not improved by at least the given tol on
consecutive iterations, convergence is considered to be reached and the training stops. We will be
checking for tolerance of 0.0001 and 0.00001.
The logistic regression uses an iterative maximum likelihood algorithm to fit the data. There are no set
criteria for maximum iterations. The solver will run the model till it reaches convergence or till the
max iterations, you have provided. In this case, we have given 5000, 10000 and 15000 as inputs. We
will see which fits better.
We have taken cross-validation as 3 and scoring as F1 for our grid search.

58 | P a g e
Predictive Modelling January 23, 2022

The final best parameters are:


o Max_iter is ‘10000’
o Penalty is ‘None’
o Solver used is ‘newton-cg’
o Tol is 0.0001

Our new model, which is based on the grid search algorithm's best parameters and the model's
performance is tested using these parameters is then saved in a distinct variable as best_model. This
model is used to predict the values of the target variable, and then the model's performance is evaluated
using these parameters.
Checking the Coefficients:

• The coefficient for Salary is -1.646142121152848e-05


• The coefficient for age is -0.05707255243551053
• The coefficient for educ is 0.06034737348280886
• The coefficient for no_young_children is -1.3488352961597043
• The coefficient for no_older_children is -0.04894374035375453
• The coefficient for foreign is 1.2664799760127905

LDA Model (linear discriminant analysis)


Linear Discriminant Analysis is a dimensionality reduction technique that is commonly used for
supervised classification problems. It is used for modelling differences in groups i.e., separating two or
more classes. It is used to project the features in higher dimension space into a lower dimension space.
LDA works when the measurements made on independent variables for each observation are continuous
quantities. When dealing with categorical independent variables, the equivalent technique is
discriminant correspondence analysis.
On the train data set, we fit our Linear Discriminant model. By default, LDA uses a custom cut-off
probability of 0.5. So, initially, we'll create our LDA model with a cut-off probability of 0.5 and see
how it performs, then we'll see how it performs with multiple cut-off probabilities to see which one
performs the best.
We obtain an LDA model based on a default custom cut-off probability (i.e., 0.5). To get the best results,
we'll need to test our model with several cut-off probabilities and choose the one that produces the
greatest results. To do so, we'll start with probability 0.1 and work our way up to 0.9 with a 1 interval,
checking each probability recall and F1 score value along the way. We will use the likelihood that we
will get the best recall and F1 score balance as our final probability value.

Cut off probability Recall F1 Score


0.1 0.9964 0.6393
0.2 0.9644 0.6499
0.3 0.8932 0.6693
0.4 0.7580 0.6762
0.5 0.5765 0.6125
0.6 0.4235 0.5336
0.7 0.2989 0.4398
0.8 0.1103 0.1981
0.9 0.0071 0.0141
Table 4. LDA cut off probability performance table

59 | P a g e
Predictive Modelling January 23, 2022

We can see from the table above that cut off probability 0.4 provides the optimal balance of recall and
F1 score. As a result, we'll discuss about the performance of our LDA model using both the default and
the 0.4 cut-off probability.

2.3 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for
each model Final Model: Compare Both the models and write inference which model
is best/optimized.

Model performance helps to understand how good the model that we have trained using the
dataset is so that we have confidence in the performance of the model for future predictions.
We evaluate our models' performance on train and test datasets once they've been constructed. We try
to determine if the model is underfitting or overfitting by checking for accuracy, precision, and other
factors. We have specific scores and matrices for our model's performance. Following are the methods
used to evaluate the model performance:
1. Confusion Matrix
2. Classification Report
o Accuracy
o Precision
o Recall
o F1 Score
3. ROC curve
4. AUC score

1. Confusion Matrix:
This gives us how many zeros (0s) i.e. (class = No claim) and ones (1s) i.e. (class = Yes claim) were
correctly predicted by our model and how many were wrongly predicted.
Predicted Class
Class = No Class = Yes
True False
Class = No
Actual Negative Positive
class False True
Class = yes
Negative Positive

I. Accuracy:
Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted
observation to the total observations.
Accuracy = (TP + TN) / (TP + TN + FP + FN)
II. Precision:
Precision is the ratio of correctly predicted positive observations to the total predicted positive
observations.

60 | P a g e
Predictive Modelling January 23, 2022

Precision = TP/(TP + FP)

III. Recall (Sensitivity):


Recall is the ratio of correctly predicted positive observations to the all observations in actual class -
yes.
Recall = TP/(TP + FN)
IV. F1 Score:
F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives
and false negatives into account. That is, a good F1 score means that you have low false positives and
low false negatives, so you're correctly identifying real threats and you are not disturbed by false alarms.
An F1 score is considered perfect when it's 1 , while the model is a total failure when it's 0
F1 score = 2 x [(Precision x Recall) / (Precision + Recall)]
2. ROC Curve:
ROC curve is a graph showing the performance of a classification model at all classification thresholds.
This curve plots two parameters: True Positive Rate. False Positive Rate.
3. AUC Score:
AUC score gives the area under the ROC curve built. The higher the AUC, the better the performance
of the model at distinguishing between the positive and negative.
Employees who choose a holiday package denoted as 1 while those who do not opt are denoted
as 0 in our dependent variable 'Holliday Package.' In this scenario, True Positives are workers who
chose a vacation package and our model correctly anticipated their decision, whereas True Negatives
are employees who did not choose a vacation package and our model correctly predicted their decision.
False positives, on the other hand, are those who did not choose a package but were predicted to do so
by our model. False Negatives, on the other side, are those who choose a vacation package despite our
model's prediction that they would not.
If an employee chose to choose a package that was not anticipated by the algorithm, the company would
suffer more losses. As a result, false negatives should be kept to a minimum. As a result, recall should
be enhanced.
False positives, on the other hand, will result in some loss. As a result, precision is important. As a
result, there should be a balance between recall and precision. As a result, the F1 score should also be
considered.

Checking the Model performance of Logistic Regression model:

Classification report:
Classification report for train data Classification report for test data

Figure 47. Classification report of training and testing data for Logistic Regression model

61 | P a g e
Predictive Modelling January 23, 2022

Confusion Matrix for training and testing data:

Figure 48. Confusion Matrix of train (left) and Test (right) for Logistic Regression

ROC Curve and ROC_AUC score

Figure 49. ROC curve for training and testing data for Logistic Regression

- AUC for the Training Data: 0.735


- AUC for the Test Data: 0.717

Logistic Regression Model


Sl. No Train Data Test Data
1. True Positive 163 62
2. True Negative 244 109
3. False Positive 85 33
4. False Negative 118 58
5. Accuracy 67% 65%
6. Precision 66% 65%
7. Recall 58% 52%
8. F1 score 62% 58%
9. AUC score 73.5% 71.7%
Table 5. Model Performance for Logistic Regression Model.

62 | P a g e
Predictive Modelling January 23, 2022

• Test data Accuracy, AUC, precision, and recall are nearly identical to training data and test
data.
• This shows that there was neither overfitting or underfitting, and that the model is a good
classification model overall.
• Overall, the metrics are high and good fit.

Inferences:
We must comprehend the meaning of False Positives and False Negatives as stated in the issue
description. False positives, are those who did not choose a package but were predicted to do so by our
model. False Negatives are those who choose a vacation package despite our model's prediction that
they would not.
As a result, False positive impacts in small extents. False negatives will impact the firm. Sensitivity or
recall will be the important in this instance. And also, F1 score should be considered.

checking the coefficients of each variable for this model:

- The coefficients for no young children and foreign are the highest.
- That is, a unit change in these variables will cause the log function of the Logistic Regression
model to change the most.
- With the lowest coefficient, salary is the weakest predictor.
- The coefficients for age, education, and no older children are all quite low.

Checking the Model performance of Linear discriminant analysis model:


LDA model performance based on a default cut-off probability (i.e., 0.5).
Classification report:

Figure 50. Classification report for LDA with default probability cut-off of 0.5

Confusion Matrix for training and testing data:

Figure 51. Confusion matrix of train (left) and test(right) for LDA :0.5 63 | P a g e
Predictive Modelling January 23, 2022

ROC Curve and ROC_AUC score:

Figure 52. ROC curve for train and test for LDA:0.5

- AUC for the Training Data: 0.733


- AUC for the Test Data: 0.714

Linear discriminant analysis model – 0.5


Sl. No Train Data Test Data
1. True Positive 162 59
2. True Negative 243 109
3. False Positive 86 33
4. False Negative 119 58
5. Accuracy 66% 64%
6. Precision 65% 64%
7. Recall 58% 49%
8. F1 score 61% 56%
9. AUC score 73.3% 71.4%
Table 6. Model performance for LDA [0.5]

• Test data Accuracy, AUC, precision, and recall are nearly identical to training data and test
data.
• This shows that there was neither overfitting or underfitting, and that the model is a good
classification model overall.
• Overall, the metrics are high and good fit.
The model accuracy on the training as well as the test set is about 63% and 65% respectively, which is
roughly the same proportion as the class 0 observations in the dataset. This model is affected by a class
imbalance problem. Since we only have 872 observations, if re-build the same LDA model with a
greater number of data points, an even better model could be built.
Further changing the cut-off values for maximum recall, since recall is important and the performance
of model is regularized. We saw that at probability of 0.4, the recall is increasing to greater extend and
without impacting much on accuracy. At 0.4 the F1 score is also best fit. Now next checking the model
performance at 0.4 and considering the best one.

64 | P a g e
Predictive Modelling January 23, 2022

LDA model performance based on a custom cut-off probability (i.e., 0.4).


Classification report:

Figure 53. Classification report for LDA with default probability cut-off of 0.4

Confusion Matrix for training and testing data:

Figure 54. Confusion matrix of train (left) and test(right) for LDA :0.4

ROC Curve and ROC_AUC score:

Figure 55. ROC curve for train and test for LDA:0.4

- AUC for the Training Data: 0.733


- AUC for the Test Data: 0.714

65 | P a g e
Predictive Modelling January 23, 2022

Linear discriminant analysis model – 0.5


Sl. No Train Data Test Data
1. True Positive 213 87
2. True Negative 193 82
3. False Positive 68 33
4. False Negative 136 60
5. Accuracy 67% 65%
6. Precision 61% 59%
7. Recall 76% 72%
8. F1 score 68% 65%
9. AUC score 73.3% 71.4%
Table 7. Model performance for LDA [0.4]
• Test data Accuracy, AUC, precision, and recall are nearly identical to training data and test
data.
• This shows that there was neither overfitting or underfitting, and that the model is a good
classification model overall.
• Overall, the metrics are high and good fit.
We see that the Recall and F1 score is increased to greater extend in the custom probability cut-off of
0.4. This is our best fit model for the LDA. Considering this model for further comparison of Logistic
Regression model and LDA to check the best fit model for the firm.

Checking the coefficients of each variable for this model:

We see a similar result with no_young_children and foreign as good predictors and salary being the
worst predictor.

Comparison of performance metrics between models:


So far, we've developed models for Logistic regression and Linear discriminant analysis, and we've
used a confusion matrix, classification report, AUC scores, and ROC curves to evaluate their
performance. Now we'll compare the models based on their results to see which one is best for
classification.
As previously stated, recall value is quite important for our problem statement. To some extent,
precision is also vital. The F1 score and recall value should be concentrated.
For model comparison of Logistic Regression, the best fit model after applying grid search is used and
for Liner discriminant analysis, we saw that the custom probability cut-off of 0.4 is giving the better
results, so the best model of custom probability performance is considered for model comparison.

66 | P a g e
Predictive Modelling January 23, 2022

Table 8. Metrices comparison table between models.

In this table, we have Accuracy, Recall, Precision, F1 score and AUC scores for 2 different models. The
models are as follows:
1) Logistic Regression Model – Best fit model after grid search.
2) LDA with custom cut- off probability (0.4).

Inferences:
• We can see that, the Accuracy for both models for both train and test is almost similar.
• The AUC and Precision of Logistic is slightly greater than the LDA for both test and train
• However, for our model, Recall and F1 score being the important measure of model performance,
we can see that LDA model is performing much better compared to Logistic regression model. We
can say that LDA is best fit model.
• Linear discriminant analysis model with custom probability of 0.4 is the best fit model.

Comparing the ROC curves and AUC scores for LDA and Logistic Regression models.

Figure 56. ROC of model comparison for Train data.

67 | P a g e
Predictive Modelling January 23, 2022

Figure 57. ROC of model comparison for Test data.

We can see from the graphs that Logistic Regression and LDA perform approximately identically for
both the train and test data sets. The logistic regression model, on the other hand, performs somewhat
better ROC.

2.4 Inference: Basis on these predictions, what are the insights and recommendations.

We had a business problem where we need to predict whether an employee would opt for a holiday
package or not. For this problem we had predicted the results using both logistic regression and linear
discriminant analysis.
In our extensive analysis so far, we have thoroughly examined given data and developed a model that
predicts the classification of whether the employee opts for holiday package or no, based on the
attributes in our dataset. Let us now look at the key points in our past data first and try to suggest some
recommendations for the firm.
Insights from the Graphs and Analysis from EDA:
Holiday package:

• We can observe that 54% of the employees are not opting for the holiday package and 46% are
interested in the package. This implies we have a dataset which is fairly balanced.
Salary

• The average 'Salary' of employees opting for holiday package and not opting for holiday
package is similar in nature.
• The coefficient for Salary is -1.3803 e-05. There is almost no relation with the Holiday package,
so we can say that Salary is not a good predictor for model building.
• Higher salary employees are more prone to not opt for holiday package.

68 | P a g e
Predictive Modelling January 23, 2022

Foreign

• Foreign is a good predictor of dependent variable with a high positive coefficient.


• The frequency distribution of foreign implies that the employees are mostly from the same
country which is around 75% of employees and foreigners are around 25% of them.
• We can see that the percentage of foreigners accepting the holiday package is substantially
higher compared to the citizens while considering the ratio of foreigners and the citizens.
• The mean salary of foreign people is slightly less than natives.
Age

• We can see that, the age distribution for employees who are opting for holiday package and not
opting are similar in nature, though the number of people opting are less in number and mostly
fall in range of 35-45 age group.
• We can see that, employees in middle range (34 to 45 years) are opting for holiday package are
more as compared to older and younger employees.
Education

• The variable 'educ' the number of years of formal education is showing a similar pattern. This
means education is likely not a variable that influences for opting of holiday packages for
employees.
• We can see that employee with less years of formal education (1 to 7 years) and higher
education are not opting for the Holiday package as compared to employees with formal
education of 8 year to 12 years
• Across education we can observe that the employees with higher number of years of formal
education have a lower tendency to opt for the holiday package relative to employees with
lesser years of formal education
No. of young children

• No_young_children have a -1.29 approximately coefficient. This can be treated as a good


predictor of dependent variable.
• We can see that there is a significant difference in employees with younger children who are
opting for holiday package and employees who are not opting for holiday package, this attribute
is good predictor as there is significant difference in them.
• We can see that people with younger children are opting for holiday packages are very few in
number compared to employees who do not have young children.
No. of older children

• The distribution for opting or not opting for holiday packages looks same for employees with
older children. At this point, this might not be a good predictor for model building.
• Almost same distribution for both the scenarios when dealing with employees with older
children
• For the employees with older children, it’s hard to differentiate between the 2 different classes
of dependent variable. The employees who opt for package and the ones who do not do not
have much difference between them.
• This is not a good variable for model building.

69 | P a g e
Predictive Modelling January 23, 2022

Recommendations:
• The firm should concentrate its efforts on foreigners in order to increase sales of vacation
packages, as this is where the majority of conversions will occur.
• The firm might try to target their marketing efforts or offers at foreigners in order to increase
the number of people who choose vacation packages.
• Focus on Foreign variable for good prediction while building the classification model.
• To improve the likelihood of lower-wage employees selecting for a vacation package, the firm
might provide certain incentives or discounts to them.
• The company should not target employees with younger children. The employees with younger
children have more chances of not opting for holiday package.
• Employees with older children who do not opt for vacation package might be targeted using
some marketing strategies. The organisation can conduct a deep dive or conduct a survey to
determine why the rest of the employees are not taking advantage of the holiday package. The
corporation may be able to come up with some suggestions or offers to convert the remaining
employees.
• The employer can provide references of workers with older children who have chosen the
package to those who have not chosen it, in order to persuade them to do so.

Key performance indicators:


• Highlight the benefits of Holiday package and services and educate the employees about it.
• Company can come up with lucrative enchantments in holiday packages
• Customer satisfaction should be utmost priority.
• Engage with employees through social media.
• New destinations can be added.
• Video is a great way to engage and inspire potential travellers.
• Travel influencers can promote destinations, activities, and businesses by using their social
media influence.
• Get feedback from employees who took the holiday package and work on the betterment of
package accordingly.
__________________________________________________________________________________

70 | P a g e

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy