Pooja Kabadi - Predictive Modelling Project
Pooja Kabadi - Predictive Modelling Project
PREDICTIVE
MODELLING
PROJECT
Pooja Kabadi
PGP-DSBA Online
Batch- A4
23-01-2022
1|Page
Predictive Modelling January 23, 2022
Table of Contents:
Problem 1: Linear Regression ............................................................................................................. 5
1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values,
Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate Analysis. .................. 5
1.2 Impute null values if present, also check for the values which are equal to zero. Do they have any
meaning or do we need to change them or drop them? Check for the possibility of combining the sub
levels of a ordinal variables and take actions accordingly. Explain why you are combining these sub
levels with appropriate reasoning. .................................................................................................... 19
1.3 Encode the data (having string values) for Modelling. Split the data into train and test (70:30).
Apply Linear regression using scikit learn. Perform checks for significant variables using appropriate
method from stats model. Create multiple models and check the performance of Predictions on Train
and Test sets using Rsquare, RMSE & Adj-Rsquare. Compare these models and select the best one
with appropriate reasoning. ............................................................................................................... 22
Problem 2: Logistic Regression and LDA......................................................................................... 44
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
write an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data analysis.
.......................................................................................................................................................... 44
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis).
.......................................................................................................................................................... 56
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model:
Compare Both the models and write inference which model is best/optimized. .............................. 60
2.4 Inference: Basis on these predictions, what are the insights and recommendations. .................. 68
List of Figures:
Figure 1. Boxplot and Distplot of Carat. ................................................................................................. 8
Figure 2. Boxplot and Distplot of Depth................................................................................................. 9
Figure 3. Boxplot and Distplot of Table. ................................................................................................ 9
Figure 4. Boxplot and Distplot of 'X' .................................................................................................... 10
Figure 5. Boxplot and Distplot of 'Y' .................................................................................................... 10
Figure 6. Boxplot and Distplot of 'Z' .................................................................................................... 10
Figure 7. Boxplot and Distplot of Price ................................................................................................ 11
Figure 8. Frequency Distribution of Cut ............................................................................................... 12
Figure 9. Frequency Distribution of colour. ......................................................................................... 12
Figure 10. Frequency Distribution of Clarity........................................................................................ 12
Figure 11. Boxplot of Cut with price variable. ..................................................................................... 13
Figure 12. Boxplot of Color with Price variable. .................................................................................. 14
Figure 13. Boxplot of Clarity with Price............................................................................................... 15
Figure 14. Count plot of Categorical variables with price. ................................................................... 16
Figure 15. Bar plot of Categorical variables with Price. ....................................................................... 16
Figure 16. Pair plot of Zirconia dataset................................................................................................. 17
Figure 17. Scatter plots of all numeric variable with price. .................................................................. 18
2|Page
Predictive Modelling January 23, 2022
3|Page
Predictive Modelling January 23, 2022
List of Tables:
Table 1. Inferences of Univariate Data visualization. ........................................................................... 11
Table 2. Model comparison table. ......................................................................................................... 39
Table 3. Inferences of Univariate Data visualization for problem 2. .................................................... 48
Table 4. LDA cut off probability performance table ............................................................................ 59
Table 5. Model Performance for Logistic Regression Model. .............................................................. 62
Table 6. Model performance for LDA [0.5] ......................................................................................... 64
Table 7. Model performance for LDA [0.4] ......................................................................................... 66
Table 8. Metrices comparison table between models. .......................................................................... 67
4|Page
Predictive Modelling January 23, 2022
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are
provided with the dataset containing the prices and other attributes of almost 27,000 cubic zirconia
(which is an inexpensive diamond alternative with many of the same qualities as a diamond). The
company is earning different profits on different prize slots. You have to help the company in predicting
the price for the stone on the bases of the details given in the dataset so it can distinguish between higher
profitable stones and lower profitable stones so as to have better profit share. Also, provide them with
the best 5 attributes that are most important.
Data Dictionary:
Describe the cut quality of the cubic zirconia. Quality is increasing order Fair,
Cut
Good, Very Good, Premium, Ideal.
Colour Colour of the cubic zirconia. With D being the worst and J the best.
Clarity refers to the absence of the Inclusions and Blemishes. (In order
Clarity from Worst to Best in terms of avg price) IF, VVS1, VVS2, VS1, VS2, Sl1,
Sl2, l1
Depth The Height of cubic zirconia, measured from the Culet to the table, divided by
its average Girdle Diameter.
The purpose of the report is to examine past information on cubic zirconia in order to assist the company
in predicting price slots for the stone based on the information provided in the dataset. Understanding
the data and examining the pattern of how pricing influences various variables. Providing business
insights based on exploratory data analysis and predictions of price.
1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check
the null values, Data types, shape, EDA, duplicate values). Perform Univariate and
Bivariate Analysis.
5|Page
Predictive Modelling January 23, 2022
6|Page
Predictive Modelling January 23, 2022
Observations:
• Dataset has 11 columns and 26967 rows including the 'unnamed:0' column.
• The first column "Unnamed: 0" has only serial numbers, so we can drop it as it is not useful.
7|Page
Predictive Modelling January 23, 2022
• There are both categorical and continuous data. For categorical data, we have cut, colour
and clarity for continuous data we have carat, depth, table, x. y, z and price.
• Price will be target variable.
• The dataset is used for predicting the price for the zirconia stone on the bases of the details
given in the dataset so it can distinguish between higher profitable stones and lower
profitable stones so as to have better profit share.
• There are around 697 missing values in the variable 'depth' which will be imputed during
the data pre-processing stage.
• There are 34 duplicate values present in the dataset, although there is a probability that 2
or more stones can be of similar dimensions and features but we will drop the duplicates so
avoid any overlapping.
• There is total 5 unique types of 'cut' out of which the highest number of cut is 'Ideal' one
which accounts to almost 10816 of observations, which is approximately 50% of the
dataset.
• There is total 7 types of 'color' out of which highest number of color is 'G', which is 5661,
accounts to almost 25% of the dataset.
• There is total 8 types of 'clarity' in the dataset and the highest number of 'clarity' is 'SI1'
which is 6571 which accounts to almost 30% of the dataset.
• Skewness and Kurtosis is also calculated for each column, Data with high skewness indicates
lack of symmetry and high value of kurtosis indicates heavily tailed data.
• Based on summary descriptive, the data looks good, we see that for most of the variables the
mean/medium are nearly equal.
Data Visualization:
• From the above graphs, we can infer that mean 'carat' weight of the cubic zirconia is around
0.79 with the minimum of 0.20 and maximum of 4.50.
• The distribution of 'cart' is right skewed with skewness value of 1.1164.
• The distribution spikes at around 0.4 ,1, 1.5 and 2
• The distplot shows the distribution of most of data from 0 to 2.5.
• The box plot of the 'cart' variable shows presence of large number of outliers.
8|Page
Predictive Modelling January 23, 2022
2 - Depth: The Height of cubic zirconia, measured from the Culet to the table, divided by
its average Girdle Diameter.
• From the above graphs, we can infer that mean 'depth' height of cubic zirconia, measured from
the Culet to the table, divided by its average Girdle Diameter is around 61.74 with the minimum
of 50.80 and maximum of 73.60.
• The distribution of 'depth' is slightly left skewed with skewness value of -0.0286.
• The distribution follows a near normal distribution with long tails both on the right side and the
left side.
• The distplot shows the distribution of most of data from 55 to 70.
• The box plot of the 'depth' variable shows presence of large number of outliers.
3- Table: The Width of the cubic zirconia's Table expressed as a Percentage of its Average
Diameter.
• From the above graphs, we can infer that mean width of the cubic zirconia's 'Table' expressed
as a Percentage of its Average Diameters around 57.45 with the minimum of 49.00 and
maximum of 79.00.
• The distribution of 'table' is right skewed with skewness value of 0.7657.
• The distribution has multiple spikes at around 53, 55,60 and 62.5.
• The distplot shows the distribution of most of data from 50 to 65.
• The box plot of the 'table' variable shows presence of outliers.
9|Page
Predictive Modelling January 23, 2022
• From the above graphs, we can infer that mean 'X' length of the cubic zirconia in mm is around
5.72.
• The distribution of 'X' is slightly right skewed with skewness value of 0.3879.
• This distribution has various spikes.
• The distplot shows the distribution of most of data from 3 to 10.
• The box plot of the 'X' variable shows presence of few outliers.
• From the above graphs, we can infer that mean 'Y' Width of the cubic zirconia in mm is around
5.73.
• The distribution of 'Y' is right skewed with skewness value of 3.8501.
• The distribution has an extremely long right-side tail because of presence of one outlier at
around 60.
• The distplot shows the distribution of most of data from 0 to 10.
• The box plot of the 'Y' variable shows presence of few outliers.
10 | P a g e
Predictive Modelling January 23, 2022
• From the above graphs, we can infer that mean 'Z' Height of the cubic zirconia in mm is around
3.53.
• The distribution of 'Z' is right skewed with skewness value of 2.568.
• The distribution has an extremely long right-side tail because of presence of one outlier at
around 30.
• The distplot shows the distribution of most of data from 0 to 5.
• The box plot of the 'Z' variable shows presence of few outliers.
Observations:
• Mean and Median values are not very far away from each other.
• Data of all attributes are skewed (mostly right) except X (Length).
• Data for X (Length) is almost normal, outliers tend to make it a little left skewed.
• There are outliers in all numerical features of the cubic zirconia dataset.
11 | P a g e
Predictive Modelling January 23, 2022
1. Cut- Describe the cut quality of the cubic zirconia. Quality is increasing order Fair,
Good, Very Good, Premium, Ideal.
2. Colour- Colour of the cubic zirconia. With D being the worst and J the best.
3. Clarity- Clarity refers to the absence of the Inclusions and Blemishes. (In order from
Worst to Best in terms of avg price) IF, VVS1, VVS2, VS1, VS2, Sl1, Sl2, l1
12 | P a g e
Predictive Modelling January 23, 2022
Observations:
• The distribution of the 'cut' which describe the cut quality of the cubic zirconia, in which 'Ideal' cut
shows maximum frequency of 10816 and the least frequency cut observed is the ‘Fair’ one.
• The distribution of the 'Colour' of the cubic zirconia, shows 'G' colour with maximum frequency of
5661 and the least frequency one is J.
• The distribution of the 'clarity' of the cubic zirconia (Clarity refers to the absence of the Inclusions
and Blemishes), shows 'SI1' type with maximum frequency of 6571 and the least frequently
observed is ‘I1’
• For the cut variable we see the most sold zirconia stone is 'Ideal' cut type gems and least sold
is Fair cut gems
• All cut type gems have outliers with respect to price.
• Slightly less priced seems to be Ideal type and premium cut type to be slightly more expensive
13 | P a g e
Predictive Modelling January 23, 2022
• For the color variable we see the most sold is G colored gems and least is J colored gems
• All color type gems have outliers with respect to price
• However, the least priced seems to be E type; J and I colored gems seems to be more expensive
14 | P a g e
Predictive Modelling January 23, 2022
• For the clarity variable we see the most sold is SI1 clarity gems and least is I1 clarity gems
• All clarity type gems have outliers with respect to price
• Slightly less priced seems to be SI1 type; VS2 and SI2 clarity stones seems to be more
expensive.
• ‘Ideal’ is the most selling cut type of zirconia stone and ‘Fair’ type being the least sold.
• We see that ‘G’ color is the most selling zirconia stone followed by ‘E’ and ‘F’ nearly following
in same range and ‘J’ color gem is the least selling stone.
• S1 type of Clarity is most selling followed by VS2 and I1 being the least selling one.
15 | P a g e
Predictive Modelling January 23, 2022
16 | P a g e
• The price of ‘Ideal’ type cut is the most expensive and Fair is cheap one compared to all.
• G color gem is the costly one and also most liked by the people and are highest sold.
• J color gem price is less and also the least sold one
• S1 is the expensive one followed by the VS2 and S2 clarity which fall in the same price range
and l1 and lF are the cheap gems.
Pair plot:
A pair plot gives us correlation graphs between all numerical variables in the dataset. Thus, from the
graphs we can identify the relationships between all numerical variables.
17 | P a g e
Predictive Modelling January 23, 2022
Observations:
• From the above pair plot, we can see that 'Carat' and 'Price' are linearly correlated, which means
the attribute carat influences the price of zirconia stone the most.
• We can see that X, Y and Z are having the linear relation with each other and also the target
variable 'price'
• According to the assumptions for Linear regression model, the independent variables should
not be linearly corelated with each other which leads to the high multicollinearity between the
independent variables X, Y and Z which is length, width and height respectively.
Multivariate Analysis:
Heat map
18 | P a g e
Predictive Modelling January 23, 2022
A heatmap gives us the correlation between numerical variables. If the correlation value is tending to
1, the variables are highly positively correlated whereas if the correlation value is close to 0, the
variables are not correlated. Also, if the value is negative, the correlation is negative. That means, higher
the value of one variable, the lower is the value of another variable and vice-versa.
Observations:
• Carat is highly correlated with price. Carat attribute is the best predictor of price.
• Depth is not related with price, so it depth attribute does not play major role in prediction of
price.
• X (Length), Y (Width) and Z (Height) are highly correlated with price.
• X (Length), Y (Width) and Z (Height) are highly correlated with each other and are responsible
for high multicollinearity.
• Multicollinearity is a setback for the linear regression model. The highly correlated values can
be dropped in one of the model buildings and check the model performance.
1.2 Impute null values if present, also check for the values which are equal to zero. Do
they have any meaning or do we need to change them or drop them? Check for the
possibility of combining the sub levels of a ordinal variables and take actions
accordingly. Explain why you are combining these sub levels with appropriate
reasoning.
We can see that there are 697 missing values in the depth variable, The missing values are imputed
by the median values of the variable. The above table shows the total number of missing values before
and after imputation.
19 | P a g e
Predictive Modelling January 23, 2022
We can see that, there are 8 observations with values as 0. Since, the number of observations are very
less in number compared to the total number of observations that is 26967, so dropping these won’t
affect much.
We can see that, the total number of duplicates are 34 values, dropping the duplicates since the values
are very less compared to size of dataset. The total number of data points after dropping the
duplicates and the values which are equal to zero are 26925.
Check the outliers of all variables by plotting the box plot. According the
assumptions, the outlier’s impact on the model building, so the outliers
are treated and the boxplots of all variables are plotted to check the
treatment.
20 | P a g e
Predictive Modelling January 23, 2022
Checking for the possibility of combining the sub levels of Categorical variables attribute,
an ordinal variable and take actions accordingly:
There are 3 different categorical variables. ‘Cut’, ‘Color’ and ‘Clarity’.
21 | P a g e
Predictive Modelling January 23, 2022
000
• The sub-categories VS1 and VS2’s mean and median prices are very close. The both stones
prices lie in similar range. Let the category be called as VS.
• The next categories which can be grouped are VVS1 and VVS2. The mean and median of
price range is little different but still close enough to be grouped. Let the category be called
as VVS.
• The third grouping involves SI1 and SI2. The price range of these categories is almost same.
Given that SI1 has a larger number of stones but a lower mean price, and SI2 has a lower
number of stones but a higher mean price, we may conclude that the two are balanced and
can be grouped together. Let the category be called as SI
• Final categories of Clarity variable are I1, SI, VS, VVS and IF.
The groping of sub-categories of the above variables are considered in the new copy of dataset. The
model is built based on this dataset and the model performance is checked based on non-grouped and
other models as well.
1.3 Encode the data (having string values) for Modelling. Split the data into train and
test (70:30). Apply Linear regression using scikit learn. Perform checks for significant
variables using appropriate method from stats model. Create multiple models and
check the performance of Predictions on Train and Test sets using Rsquare, RMSE
& Adj-Rsquare. Compare these models and select the best one with appropriate
reasoning.
22 | P a g e
Predictive Modelling January 23, 2022
• Ideal: 1
• Premium: 2
• Very Good: 3
• Good: 4
• Fair: 5
• I1: 1
• Sl2: 2
• Sl1: 3
• VS2: 4
• VS1: 5
• VVS2: 6
• VVS1: 7
• IF: 8
• J: 1
• I: 2
• H: 3
• G: 4
• F: 5
• E: 6
• D: 7
Now the dataset is cleaned, encoded and ready to use for model building.
23 | P a g e
Predictive Modelling January 23, 2022
Model 1: Considering all the variables as it is and fitting the linear regression model.
In this model, all the attributes are considered as it is and the dataset is not scaled since the accuracy
and model performance does not get influence by scaling the dataset.
5. Fitting the Linear regression model from sklearn linear models to Training set.
6. Finding the coefficient of determinants for each of independent attributes.
24 | P a g e
Predictive Modelling January 23, 2022
The Figure 21. Scatter plot for model 1. model performance is limited for Linear
regression model using sci-kit learn library have limited performance parameters to measure. Therefore,
we perform Linear Regression by using a statsmodel.
25 | P a g e
Predictive Modelling January 23, 2022
The difference between sci-kit Learn Linear Regression and statsmodel Linear Regression is that the
stat model gives a more detailed summary of the model. Statsmodel also provides with Adjusted R
square values and probabilities to check if the model is reliable or not. It also provides the probabilities
of all variables depicting if their coefficients are reliable or not. Statsmodel is a good statistical analysis
of the model and get information on which attributes we can drop and which we can keep and better
compared to sklearn.
Adjusted R-square metric accounts for the spurious correlations. The above analysis infers that there is
Multicollinearity in some extent
In OLS model, to establish the reliability of the coefficients, we need hypothesis testing. The null
hypothesis (H0) claims that there is no relation between dependent and independent variables.
At 95% confidence level if the p value is > = 0.5, we do not have enough evidence to reject H0.
Therefore, no relation between dependent and independent variable.
Similarly, if p value is < 0.5, we reject null hypothesis. Therefore, there is relationship between
dependent and independent variables.
26 | P a g e
Predictive Modelling January 23, 2022
• R- squared and Adjusted R-squared vales are same which is equal to 0.940
• Overall p value of model is 0.00 (< 0.05) which means model is reliable.
• Considering individual variable and its p-value, we can see cut_c [T.2] and depth have values
greater than 0.05. Therefore, these variables are not good predictors of price and can be dropped
to get a better performing model.
• The condition number is large which means it indicates the presence of multicollinearity in
dataset, which was clearly seen in the above analysis between X, Y and Z variables.
• The RMSE score for our OLS model is 844.29
• Checking Multi-collinearity using VIF
We test for multicollinearity with Variation Inflation Factor (VIF). VIF identifies correlation
between independent variables and the strength of that correlation. VIF starts from 1 and has
no upper value.
• High levels of Multicollinearity are present in data. This model is not reliable based on the high
multicollinearity. Making changes in data and dropping highly correlated variables may
overcome the problem of Multicollinearity.
• Linear Equation:
(-4325.02) * Intercept + (-31.21) * cut_c[T.2] + (-127.54) * cut_
c[T.3] + (-242.7) * cut_c[T.4] + (-629.95) * cut_c[T.5] + (531.48
) * color_c[T.2] + (1030.1) * color_c[T.3] + (1450.54) * color_c[
T.4] + (1630.39) * color_c[T.5] + (1672.75) * color_c[T.6] + (186
1.63) * color_c[T.7] + (1712.14) * clarity_c[T.2] + (2535.87) * c
larity_c[T.3] + (3072.13) * clarity_c[T.4] + (3355.1) * clarity_c
[T.5] + (3766.77) * clarity_c[T.6] + (3776.88) * clarity_c[T.7] +
(3995.22) * clarity_c[T.8] + (9200.19) * carat + (12.59) * depth
+ (-23.07) * table + (-1176.95) * x + (1083.2) * y + (-642.48)*z
• Inferences: Carat with a coefficient of 9200.19 is the best predictor of price. For 1 unit change
in ‘carat’, the price will change by 9200.19 units keeping all other variables 0.
Based on all the analysis of model, we see that, this is not the best model for predicting price
slots for zirconia stones. But from this model, we came to know what are the changes to be
done to create a best fit model.
27 | P a g e
Predictive Modelling January 23, 2022
Model 2 - Dropping the attributes 'x', 'y' & 'z' and fitting the linear regression model
In this model, considering all the variables except ‘x’, ‘y’ and ‘z’ and also the data is not scaled. As we
saw in model 1 analysis that ‘x’, ‘y’ and ‘z’ contributes in high multicollinearity. The VIF scores was
large indicating for the cause of high multicollinearity, so building a model by drooping the x, y and z
variables and checking the model performance and compare.
5.
6. Fitting the Linear regression model from sklearn linear models to Training set.
7. Finding the coefficient of determinants for each of independent attributes.
8. From the above following coefficient of determinants of all independent attributes we can infer
that ‘Carat’ variable even in this model has the most weightage and acts as the best predictor
for price.
9. We can see that; on other hand the ‘Depth’ and ‘Table’ variable do not have that much
weightage in the prediction. There are changes in coefficient of determinant values from model
1 to model 2. Model 2 is performing better compared to model 1.
10. The coefficient of determination is a measurement used to explain how much variability of one
factor can be caused by its relationship to another related factor
11. For example, unit change in the value of carat will bring 7957.23 change in price.
12. The intercept for our model is -3136.18073. The absolute value of intercept is lower than Model
1 but it is more than Model 2.
13. Model performance of regression model built is calculated by the coefficient of determinant (R
square). R square determines the fitness of a linear model. R square value ranges from 0 to 1.
The closer the data point is to the best fit plane; the coefficient of determinant value tends to 1
and the better the model.
R square of training data is 0.93016
R square of testing data is 0.93055
We can see that the score of Train and Test is almost similar, the model is a good fit model.
Calculating root mean square error (RMSE) value for checking model performance i.e. RMSE value is
standard deviation of the prediction errors (residual). Residual errors or sum of squared errors are the
28 | P a g e
Predictive Modelling January 23, 2022
measure of how far the data point is from the best fit plane. So basically, RMSE tells the spread out of
these residuals. That means lower the RMSE is, closer are the data points to the best fit plane.
RMSE of training data is 913.878
RMSE of testing data is 918.433
The RMSE values have increased a bit as compared to our previous model.
14. Checking the plot between the original price and the predicted price for linear relationship.
29 | P a g e
Predictive Modelling January 23, 2022
• R- squared and Adjusted R-squared vales are same which is equal to 0.939
• Overall p value of model is 0.00 (< 0.05) which means model is reliable.
• Considering each variable and its p value, we can see cut_c [T.2] has become 0 in this model. But
‘depth’ has p value greater than 0.05. Keeping ‘depth’ variable in model is not necessary. Next
model is built by dropping depth variable too.
• The condition number is large which means it indicates the presence of multicollinearity in dataset,
which is because of other variables of dataset. Condition number is reduced from model 1, dropping
depth variable may eliminated the problem of multicollinearity and improve the performance of
model.
• The RMSE score for our OLS model is 851.307.
• Checking Multi-collinearity using VIF for model 2.
We test for multicollinearity with Variation Inflation Factor (VIF). VIF identifies correlation
between independent variables and the strength of that correlation. VIF starts from 1 and has
no upper value.
• VIF scores has decreased to a great extent. So, the problem of multicollinearity is getting treated
to a very good extent by removing ‘x’, ‘y’ and ‘z’ variables.
• Linear Equation for model 2:
• Inferences: ‘Carat’ is still the best predictor with a coefficient of 8027.46. For 1 unit change in
‘carat’, the price will change by 8027.46 units keeping all other variables 0. From this model
we can infer that, the strong multicollinearity which was due to x, y & z is reduced to greater
extend. The remaining collinearity is due to the ‘depth’ variable. The depth variable can also
be dropped since its not a good predictor for model building. The next model is built by
dropping even the depth variable and compare its affect on the model performance and compare
with the previous models and choose the best fit model for prediction of price slots for a
company.
30 | P a g e
Predictive Modelling January 23, 2022
Model 3: Dropping the attributes 'x', 'y', 'z' & 'depth' and fitting the linear regression
model.
In this model, considering all the variables except ‘x’, ‘y’, ‘z’ and ‘depth’ and the data is unscaled. As
we saw in model 2 analysis that dropping ‘x’, ‘y’ and ‘z’ contributes in reducing the high
multicollinearity in the model. In this model along with ‘x’, ‘y’ and ‘z’ dropping the ‘depth’ variable to
which is not good predictor and also contributes for some amount of multicollinearity and enhancing
the model for better performances.
1. Capturing the target column into separate vectors for training set and test set.
2. X variable with independent attributes where x, y , z & depth variables are not considered and y
variable with the target variable which is ‘price’ in our case.
3. Splitting the dataset in to train and test in the ratio of 70:30 using train test split from sklearn,
keeping the random state as 1.
4. Checking the shape of the split data.
5. Fitting the Linear regression model from sklearn linear models to Training set.
6. Finding the coefficient of determinants for each of independent attributes.
31 | P a g e
Predictive Modelling January 23, 2022
32 | P a g e
Predictive Modelling January 23, 2022
• R- squared and Adjusted R-squared vales are same which is equal to 0.939
• Overall p value of model is 0.00 (< 0.05) which means model is reliable.
• Considering each variable and its p value, we can see no variable has p value more than 0.05. So,
all variables in this model have good relationship with dependent variable. Considering all these
variables in our model.
• Condition number is large which means there is string multicollinearity in the dataset or other
numerical problems are present in dataset. But we can see that, condition number has reduced from
1.03e+04 in Model 1 to 1.99e+03 in model 3.
• The RMSE score for our OLS model is 851.355.
• Checking Multi-collinearity using VIF for model 3.
We test for multicollinearity with Variation Inflation Factor (VIF). VIF identifies correlation
between independent variables and the strength of that correlation. VIF starts from 1 and has
no upper value.
Inferences:
‘Carat’ is still the best predictor with a coefficient of 7956.14. For 1 unit change in ‘carat’, the price
will change by 7956.14 units keeping all other variables 0.
VIF scores have reduced to almost 5 for most of the variables. RMSE has not shown any major change
till in this model. Variables which are good predictors are understood through this model and all p
values for all the variables are under 0.05. Condition number is also reduced considerably.
Now we see that our RMSE is high and coefficients are not balanced. So, we should bring the variables
to a balanced state. We can achieve that by bringing all variables to a comparable form. We can achieve
that by scaling the data. Performing the same on scaled model and comparing.
33 | P a g e
Predictive Modelling January 23, 2022
Model 4: Dropping the attributes 'x', 'y', 'z', 'depth' and grouping sub categories of
attributes and fitting the linear regression model.
In this model considering the same attributes as the previous one and also grouping the sub categories
of the clarity variable. The data frame which was copied in which the sub categories are grouped is used
in the model building. To check if the performance while altering the data. It is compared with the
original data performance and the suggestions for company can be given based on the results.
5. Fitting the Linear regression model from sklearn linear models to Training set.
6. Finding the coefficient of determinants for each of independent attributes.
34 | P a g e
Predictive Modelling January 23, 2022
The RMSE is more than the previous models even after altering and combining the sub categories of
the attributes is not working fine. Its better to keep the all the sub categories same as original data.
12. Checking the plot between the original price and the predicted price for linear relationship.
35 | P a g e
Predictive Modelling January 23, 2022
• R- squared and Adjusted R-squared vales are same which is equal to 0.933. The R- square value is
slightly decreased compared to previous models
• Overall p value of model is 0.00 (< 0.05) which means model is reliable.
• Considering each variable and its p value, we can see no variable has p value more than 0.05. So,
all variables in this model have good relationship with dependent variable. Considering all these
variables in our model.
• Condition number is large, +1.96e+03. This indicates that is some numerical problems since we
have removed the variables contributing for the multicollinearity.
• The RMSE score for our OLS model is 893.715
• Checking Multi-collinearity using VIF for model 4.
We test for multicollinearity with Variation Inflation Factor (VIF). VIF identifies correlation
between independent variables and the strength of that correlation. VIF starts from 1 and has
no upper value.
Inferences:
‘Carat’ is still the best predictor with a coefficient of 7916.86. For 1 unit change in ‘carat’, the price
will change by 7916.86 units keeping all other variables 0.
VIF scores have reduced to almost 5 for most of the variables. RMSE has not shown any major change
till in this model. Variables which are good predictors are understood through this model and all p
values for all the variables are under 0.05. Condition number is also reduced considerably.
The R squared values is decreased slightly from the original dataset and also there is fair amount of
increase in RMSE values which indicates that combining sub categories of variables is not contributing
for better performance. So, it is better to consider the previous model i.e model 3.
36 | P a g e
Predictive Modelling January 23, 2022
Model 5: Dropping the attributes 'x', 'y', 'z' & 'depth' and fitting the linear regression
model for scaled data.
In all the above four models, model 3 is performing better compared all other model, but the data is not
balanced, hence scaling the dataset using z-score, where mean is closer to 0 and standard deviation to
1. Checking the impact of scaling on the model and comparing model 3 and 5 and finally selecting the
best model for prediction of price slots.
1. Capturing the target column into separate vectors for training set and test set.
2. X variable with independent attributes where x, y , z & depth variables are not considered and y
variable with the target variable which is ‘price’ in our case.
3. Splitting the dataset in to train and test in the ratio of 70:30 using train test split from sklearn,
keeping the random state as 1.
4. Checking the shape of the split data.
5. Fitting the Linear regression model from sklearn linear models to Training set.
6. Finding the coefficient of determinants for each of independent attributes.
7. From the above following coefficient of determinants of all independent attributes we can infer that
‘Carat’ variable has the most weightage and acts as the best predictor for price. The number
of predictors is less, yet the model is better comparatively.
8. The coefficient of determination is a measurement used to explain how much variability of one
factor can be caused by its relationship to another related factor
9. The intercept for our model is -2.725e-16. After scaling the intercept becomes almost equal to
zero.
10. Model performance of regression model built is calculated by the coefficient of determinant (R
square). R square determines the fitness of a linear model. R square value ranges from 0 to 1. The
closer the data point is to the best fit plane; the coefficient of determinant value tends to 1 and the
better the model.
R square of training data is 0.93013
R square of testing data is 0.930502
Scaling of dataset, does not affect the R square values. We can see that the score of Train and Test is
almost similar, the model is a good fit model.
11. Calculating root mean square error (RMSE) value for checking model performance i.e. RMSE value
is standard deviation of the prediction errors (residual). Residual errors or sum of squared errors are
the measure of how far the data point is from the best fit plane. So basically, RMSE tells the spread
out of these residuals. That means lower the RMSE is, closer are the data points to the best fit plane.
37 | P a g e
Predictive Modelling January 23, 2022
• R- squared and Adjusted R-squared vales are same which is equal to 0.930
• Overall p value of model is 0.00 (< 0.05) which means model is reliable.
38 | P a g e
Predictive Modelling January 23, 2022
• Considering each variable and its p value, we can see no variable has p value more than 0.05. So,
all variables in this model have good relationship with dependent variable. Considering all these
variables in our model.
• In this model, condition number has reduced to 1.87. This shows that multicollinearity and other
mathematical problems are not there anymore in our model.
• The RMSE score for our OLS model is 0.2643.
• Checking Multi-collinearity using VIF for model 3.
We test for multicollinearity with Variation Inflation Factor (VIF). VIF identifies correlation
between independent variables and the strength of that correlation. VIF starts from 1 and has
no upper value.
• All variables have VIF score of almost 1 suggesting negligible correlation among the independent
variables is present in the dataset
• Linear Equation for model 5:
Inferences:
‘Carat’ is still the best predictor with a coefficient of 1.06. For 1 unit change in ‘carat’, the price will
change by 1.06 units keeping all other variables 0. By looking at all the performance matrices of this
model, we can say this model fulfils all criteria to be the best fit model.
Model comparison:
Model 1: Considering all the variables as it is and fitting the linear regression model.
Model 2: Dropping the attributes 'x', 'y' & 'z' and fitting the linear regression model.
Model 3: Dropping the attributes 'x', 'y', 'z' & 'depth' and fitting the linear regression model.
Model 4: Dropping the attributes 'x', 'y', 'z', 'depth' and grouping sub categories of attributes and fitting
the linear regression model.
39 | P a g e
Predictive Modelling January 23, 2022
Model 5: Dropping the attributes 'x', 'y', 'z' & 'depth' and fitting the linear regression model for scaled
data
Inferences:
• Accuracy (R square) is same for all models for both sklearn as well as stat models.
• Model 1, 3 and 5 give us the best RMSE values.
• VIF max and VIF min values are lowest for models 5, since that data is scaled.
• In model 4, even after combining the sub categories of attributes the RMSE score for train
and test is more compared to other model, so the idea of combining sub categories is dropped.
• Model 3 and 5 using same attributes while one model is built using scaled attributes and other
is original dataset.
• Model 5 is our best fit model and most viable for the given set based on the performance
measures of other models.
(-0.0) * Intercept + (1.06) * carat + (-0.01) * table + (-0.04) * cut_c + (0.13) * color_c + (0.22) *
clarity_c
1.4 Inference: Basis on these predictions, what are the business insights and
recommendations.
According to the problem statement, Gem stones co ltd, a cubic zirconia manufacturer earns varying
profits on different pricing slots. The company wants to predict the stone's price based on the data
provided so that the company may distinguish between higher profitable and lower profitable stones
and maximise the company’s profit share. Also require the top five attributes which are most essential
in price prediction.
In our extensive analysis so far, we have thoroughly examined historical data and developed a model
that predicts different price slots based on the characteristics in our dataset. Let us now look at the key
points in our past data first and try to suggest some recommendations for the firm.
To have a better profit share, the business value is to distinguish between higher profitable stones and
lower profitable stones. Our model has an accuracy score of more than 90%, which may be acceptable
in this business, and will properly predict the price for more than 90% of the stones.
Following are the insights and recommendations to help the firm to solve the business objective:
• Carat is the best predictor of price, according to the best fit model.
• The firm should favour more stones with a higher carat value, stones with larger carat values
are priced higher
• The significance of higher carat stones should be advertised to people.
40 | P a g e
Predictive Modelling January 23, 2022
• Marketing should be done in such a way that clients are aware of the significance of higher
carat values.
• Customers should receive varied presentations depending on their financial capabilities.
Customers with a higher financial status should be offered higher quality carat stones, while
those with a lesser paying ability should be offered lower carat stones.
• The marketing can be done educating customers about the significance of a better carat score
and quality.
• For cut attribute, we see that Ideal cut type is the most selling and the average price of Ideal is
slightly less prices compared to premium cut type which is slightly more expensive.
• ‘Fair’ and ‘Good’ have a lower count of sales and have a relatively higher average price.
• The ideal, premium, very good cut types have better profits.
Recommendations:
• The ideal, premium, very good cut types are the one which are bringing more profits, proper
marketing of the products may increase the sales to greater extend.
• The best quality cut, 'Ideal,' has a lower average price comparatively. However, 'Ideal' has a
high count at this pricing. The firm might try increasing the price of the ideal category a little
to see whether it affects sales. If sales are reduced, they should return to the current market
price.
• Although we know that 'Fair' and 'Good' are of the lowest cut quality and are sold in small
quantities, their average price is still rather substantial. The firm can attempt to lower its average
price or increase the quality of these cuts so that customers are willing to pay the higher price.
• ‘Fair’ and ‘Good’ cut types is advisable to eschew as the number of sales and profits are very
less.
• X, Y and Z are the length, width and height of the cubic zirconia. All are having the linear
relation with each other and also the target variable 'price'.
• All three have a strong relation to the price variable. That is, changes in the values of x, y, and
z cause price values to change.
• At the same time, there is a significant association between these three. This indicates that these
variables end up causing a high multicollinearity, which affect the performance of our price
prediction
Recommendations:
• The dimensions are having negative effect on the stones, smaller the dimension’s mostly
balanced size is more expensive.
• If a stone with smaller dimensions has a larger carat value and superior clarity, it will be valued
higher than a huge stone with lower carat and clarity.
• Firm can focus more on the balanced different sizes with higher quality stones.
41 | P a g e
Predictive Modelling January 23, 2022
5. Clarity:
Insights:
• Clarity refers to the absence of the Inclusions and Blemishes and has emerged as a strong
predictor of price as well.
• S1 is the expensive one followed by the VS2 and S2 clarity which fall in the same price range
and l1 and lF are the cheap stones.
• S1 type of Clarity is most selling followed by VS2 and I1 being the least selling one.
• Clarity of stone types Sl1, VS2 and Sl2 are helping the firm put an expensive price cap on the
stones and also have most selling counts.
Recommendations:
6. Color
Insights:
• G color gem is the costly one and also most liked by the people and are highest sold.
• J color gem price is less and also the least sold one
• We see that ‘G’ color is the most selling zirconia stone followed by ‘E’ and ‘F’ nearly following
in same range and ‘J’ color gem is the least selling stone.
Recommendations:
• The color of the stones, such as H, I, and J, will not help the company in putting a high price
cap on such stones.
• Instead, the firm should concentrate on stones in the color D, E, and F in order to fetch greater
prices and boost sales.
• This might also signal that the firm should be exploring for unique color stones, such as
transparent stones to help boost the pricing.
• ‘J’ and ‘I’ color stones should be priced lower. Maybe the customers get attracted by the lower
price and the sales is increased.
The best 5 attributes which are good predictors for prediction of price are as follows:
1. Carat
2. Clarity
3. Color
4. Cut
5. Table
42 | P a g e
Predictive Modelling January 23, 2022
__________________________________________________________________________________
43 | P a g e
Predictive Modelling January 23, 2022
You are hired by a tour and travel agency which deals in selling holiday packages. You are provided
details of 872 employees of a company. Among these employees, some opted for the package and some
didn't. You have to help the company in predicting whether an employee will opt for the package or not
on the basis of the information given in the data set. Also, find out the important factors on the basis of
which the company will focus on particular employees to sell their packages.
Data Dictionary:
Variable Name Description
Holiday_Package Opted for Holiday Package yes/no?
Salary Employee salary
age Age in years
educ Years of formal education
The number of young children (younger than 7
no_young_children
years)
no_older_children Number of older children
foreign foreigner Yes/No
The purpose of the report is to examine past information on selling holiday packages in order to assist
the company in predicting whether an employee will opt for the package or not on the basis of the
information given in the data set. Understanding the data and examining the pattern. Providing business
insights based on exploratory data analysis and predictions of classes.
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis.
Do exploratory data analysis.
44 | P a g e
Predictive Modelling January 23, 2022
45 | P a g e
Predictive Modelling January 23, 2022
Observations:
• Dataset has 7 columns and 872 rows excluding the 'unnamed:0' column.
• The first column "Unnamed: 0" has only serial numbers, so we can drop it as it is not useful.
• There are both categorical and continuous data. For categorical data, we have 'Holiday_Package' and
'foreign', for continuous data we have salary, age, educ, no_young_children, no_older_children.
• Holliday Package will be target variable.
• The dataset is used in predicting whether an employee will opt for the Holiday_package or not on the
basis of the information given in the data set.
• There are no missing and duplicate values in the dataset.
• There is total 5 unique types of 'cut' out of which the highest number of cut is 'Ideal' one which accounts
to almost 10816 of observations, which is approximately 50% of the dataset.
• Skewness and Kurtosis is also calculated for each column, Data with high skewness indicates lack of
symmetry and high value of kurtosis indicates heavily tailed data.
• Based on summary descriptive, the data looks good, we see that for most of the variables the
mean/medium are nearly equal.
• We have a balanced dataset where 54% yes values and 45% no values of Target variable.
Data Visualization:
1 - Salary
46 | P a g e
Predictive Modelling January 23, 2022
2 – Age
• From the above graphs, we can infer that mean 'Age' of employee is around 39 years with the
minimum of 20yrs and maximum of 62yrs old in company.
• The distribution of 'Age' looks almost normally distributed with skewness value of 0.146412
• The distplot shows the distribution of most of data from 20 to 60 approximately.
• The box plot of the 'Age' variable does not have any outlier.
• From the above graphs, we can infer that mean 'Educ' years of formal education of employee
is around 9 years with the minimum of 1yr and maximum of 21yrs.
• The distribution of 'Educ' is slightly left skewed with skewness value of -0.045501.
• The distplot shows the distribution of most of data from 1 to 20 approximately.
• The box plot of the 'Educ' variable shows presence of few outliers.
47 | P a g e
Predictive Modelling January 23, 2022
• From the above graphs, we can infer that mean 'no_young_children' number of young children
below the age of 7yrs is around 0.3119 with the minimum of 0 and 3.
• The distribution of 'no_young_children' is slightly left skewed with skewness value of 1.9465.
• The distplot shows the distribution of most of data from 0-3.
• The box plot of the 'no_young_children' variable shows presence of few outliers.
• From the above graphs, we can infer that mean 'no_older_children' the number of older children
is around 0.9827 with the minimum of 0 and maximum of 6.
• The distribution of 'no_older_children' is slightly right skewed with skewness value of
0.953951
• The distplot shows the distribution of most of data 0-4 approximately.
• The box plot of the 'no_older_children' variable shows presence of one outlier at 6.
Observations:
Table 3. Inferences of Univariate Data visualization for problem 2.
• There are outliers just in Salary variable, and the outliers in other variable are just 1 or 2 which
does not effect.
• Treating of Outlier might not be feasible option as the data can be original and genuine.
• Foreigners accepting the holiday package have mean of years of formal education lesser
than natives accepting the holiday package.
• If employee is foreigner and employee not having young children, chances of opting for
Holiday Package is good.
48 | P a g e
Predictive Modelling January 23, 2022
Figure 32. Count plot for Holiday package Figure 31. Count plot for foreign.
Observations:
• The distribution of the 'Holiday_package' is one where employee opt for package or no, we can
see that frequency distribution of 'No' is more which is around 471 and the employees who
opted are slight less which is 401 in count.
• We can observe that 54% of the employees are not opting for the holiday package and 46% are
interested in the package. This implies we have a dataset which is fairly balanced
• The frequency distribution of foreign implies that the employees are mostly from the same
country which is around 75% of employees and foreigners are around 25% of them.
Bivariate Analysis:
Salary vs Holiday_Package:
We can see that the average 'Salary' of employees opting for holiday package and not opting for holiday
package is similar in nature. However, the distribution is fairly more spread out for people not opting
for holiday packages.
We can see that, the age distribution for employees who are opting for holiday package and not opting
are similar in nature, though the number of people opting are less in number and mostly fall in range of
35-45 age group.
Figure 35.
36. Count plot of Age against Holiday package
We can clearly see that frequency of employees in middle range (34 to 45 years) are opting for holiday
package are more as compared to older and younger employees.
50 | P a g e
Predictive Modelling January 23, 2022
The variable 'educ' the number of years of formal education is showing a similar pattern. This means
education is likely not a variable that influences for opting of holiday packages for employees.
We can see that employee with less years of formal education (1 to 7 years) and higher education are
not opting for the Holiday package as compared to employees with formal education of 8 year to 12
years
51 | P a g e
Predictive Modelling January 23, 2022
We can see that there is a significant difference in employees with younger children who are opting for
holiday package and employees who are not opting for holiday package, this attribute is good predictor
as there is significant difference in them.
We can see clearly that people with younger children are opting for holiday packages are very few in
number compared to employees who do not have young children.
52 | P a g e
Predictive Modelling January 23, 2022
The distribution for opting or not opting for holiday packages looks same for employees with older
children. At this point, this might not be a good predictor for model building.
Almost same distribution for both the scenarios when dealing with employees with older children.
53 | P a g e
Predictive Modelling January 23, 2022
We can see that the percentage of foreigners accepting the holiday package is substantially higher
compared to the citizens with considering the ratio of foreigners and the citizens.
• In both foreigner and non-foreigner, the people who did not opt for the Holiday package are
more in number that the people who have opted.
• The average of people who didn’t opt for Holiday package is slightly more than who have
opted.
• The mean salary of foreign people is slightly less than natives.
• There are outliers in all the combinations.
54 | P a g e
Predictive Modelling January 23, 2022
Pair plot:
The Pair plot helps us to visualize how the features numerical in nature interact with each other. The
pair plot further helps us visualize how the distribution of the target variables differs within each
individual the feature itself.
Observations:
• There is no obvious defined correlation between the attributes and Holiday package, the data seems
to be fine.
• There is no considerable difference between data distribution of holiday package. No clear and
considerable difference is observed.
• Looking at the distribution of age, we can deduce that the employees who accept the holiday
package usually tend to be in the middle of their careers (late 30s).
• Across education we can observe that the employees with higher number of years of formal
education have a lower tendency to opt for the holiday package relative to employees with lesser
years of formal education
55 | P a g e
Predictive Modelling January 23, 2022
Multivariate Analysis:
Heatmap
A heatmap gives us the correlation between numerical variables. If the correlation value is tending to
1, the variables are highly positively correlated whereas if the correlation value is close to 0, the
variables are not correlated. Also, if the value is negative, the correlation is negative. That means, higher
the value of one variable, the lower is the value of another variable and vice-versa.
Observations:
• There is no strong correlation between the variables, hence we do not face the issue of
multicollinearity.
• Observing the heatmap we can see that the there is some positive correlation is among number
of years of formal education and the salary received.
• There some negative correlation between age and the employees with no of young children
below age 7.
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data
Split: Split the data into train and test (70:30). Apply Logistic Regression and LDA
(linear discriminant analysis).
56 | P a g e
Predictive Modelling January 23, 2022
Now the dataset is cleaned, encoded and ready to use for model building.
2. Splitting the dataset in to train and test in the ratio of 70:30 using train test split from sklearn,
keeping the random state as 1.
3. Checking the shape of the split data.
The data is now read to fit the models on train and check the performance of test data. The data is
divided in 70% of train and 30% of test.
57 | P a g e
Predictive Modelling January 23, 2022
• Fitting the Logistic Regression model which is imported from Sklearn linear model.
• Predicting on Training and Testing dataset
• Getting the Predicted Classes and Probabilities and creating a data frame.
• Model evaluation through Accuracy, Confusion Matrix, Classification report, AUC, ROC
curve.
Initially, we fit the train data and labels in the Logistic Regression model, based on the model
performance the model is tunned using Grid search, the best parameters are used and the model is re-
built and model performance is calculated which includes Classification report of accuracy, recall,
precision and F1 score for both train and test data.
Grid Search: Grid search divides the hyperparameter domain into distinct grids. Then, using cross-
validation, it attempts every possible combination of values in this grid, computing some performance
measures. The ideal combination of values for the hyperparameters is the point on the grid that
maximizes the average value in cross-validation. Grid search is a comprehensive technique that
considers all possible combinations in order to locate the best point in the domain
Hyperparameter Tuning:
o 'penalty': ['L2’, 'none'],
o 'solver': ['sag', 'lbfgs', 'liblinear', 'newton-cg'],
o 'tol': [0.0001, 0.00001],
o 'Max_iter': [10000, 5000, 15000]
o Cross validation (cv):5
o Scoring: 'f1'
Penalized logistic regression imposes a penalty to the logistic model for having too many variables.
This results in shrinking the coefficients of the less contribute variables toward zero. This is also known
as regularization. In our grid search, we take ‘L2’ and ‘none’ as our arguments and check which is
preferred by grid search.
The solver is the process that runs for the optimization of the weights in the model. The solver uses a
Coordinate Descent (CD) algorithm that solves optimization problems by successively performing
approximate minimization along coordinate directions or coordinate hyperplanes. Different solvers take
a different approach to get the best fit model. In our case, we have taken ‘sag’, ‘lbfgs’, ‘liblinear’ and
‘newton-cg’ as our arguments. We will check which is preferred by grid search.
Tol is the tolerance of optimization. When the training loss is not improved by at least the given tol on
consecutive iterations, convergence is considered to be reached and the training stops. We will be
checking for tolerance of 0.0001 and 0.00001.
The logistic regression uses an iterative maximum likelihood algorithm to fit the data. There are no set
criteria for maximum iterations. The solver will run the model till it reaches convergence or till the
max iterations, you have provided. In this case, we have given 5000, 10000 and 15000 as inputs. We
will see which fits better.
We have taken cross-validation as 3 and scoring as F1 for our grid search.
58 | P a g e
Predictive Modelling January 23, 2022
Our new model, which is based on the grid search algorithm's best parameters and the model's
performance is tested using these parameters is then saved in a distinct variable as best_model. This
model is used to predict the values of the target variable, and then the model's performance is evaluated
using these parameters.
Checking the Coefficients:
59 | P a g e
Predictive Modelling January 23, 2022
We can see from the table above that cut off probability 0.4 provides the optimal balance of recall and
F1 score. As a result, we'll discuss about the performance of our LDA model using both the default and
the 0.4 cut-off probability.
2.3 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for
each model Final Model: Compare Both the models and write inference which model
is best/optimized.
Model performance helps to understand how good the model that we have trained using the
dataset is so that we have confidence in the performance of the model for future predictions.
We evaluate our models' performance on train and test datasets once they've been constructed. We try
to determine if the model is underfitting or overfitting by checking for accuracy, precision, and other
factors. We have specific scores and matrices for our model's performance. Following are the methods
used to evaluate the model performance:
1. Confusion Matrix
2. Classification Report
o Accuracy
o Precision
o Recall
o F1 Score
3. ROC curve
4. AUC score
1. Confusion Matrix:
This gives us how many zeros (0s) i.e. (class = No claim) and ones (1s) i.e. (class = Yes claim) were
correctly predicted by our model and how many were wrongly predicted.
Predicted Class
Class = No Class = Yes
True False
Class = No
Actual Negative Positive
class False True
Class = yes
Negative Positive
I. Accuracy:
Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted
observation to the total observations.
Accuracy = (TP + TN) / (TP + TN + FP + FN)
II. Precision:
Precision is the ratio of correctly predicted positive observations to the total predicted positive
observations.
60 | P a g e
Predictive Modelling January 23, 2022
Classification report:
Classification report for train data Classification report for test data
Figure 47. Classification report of training and testing data for Logistic Regression model
61 | P a g e
Predictive Modelling January 23, 2022
Figure 48. Confusion Matrix of train (left) and Test (right) for Logistic Regression
Figure 49. ROC curve for training and testing data for Logistic Regression
62 | P a g e
Predictive Modelling January 23, 2022
• Test data Accuracy, AUC, precision, and recall are nearly identical to training data and test
data.
• This shows that there was neither overfitting or underfitting, and that the model is a good
classification model overall.
• Overall, the metrics are high and good fit.
Inferences:
We must comprehend the meaning of False Positives and False Negatives as stated in the issue
description. False positives, are those who did not choose a package but were predicted to do so by our
model. False Negatives are those who choose a vacation package despite our model's prediction that
they would not.
As a result, False positive impacts in small extents. False negatives will impact the firm. Sensitivity or
recall will be the important in this instance. And also, F1 score should be considered.
- The coefficients for no young children and foreign are the highest.
- That is, a unit change in these variables will cause the log function of the Logistic Regression
model to change the most.
- With the lowest coefficient, salary is the weakest predictor.
- The coefficients for age, education, and no older children are all quite low.
Figure 50. Classification report for LDA with default probability cut-off of 0.5
Figure 51. Confusion matrix of train (left) and test(right) for LDA :0.5 63 | P a g e
Predictive Modelling January 23, 2022
Figure 52. ROC curve for train and test for LDA:0.5
• Test data Accuracy, AUC, precision, and recall are nearly identical to training data and test
data.
• This shows that there was neither overfitting or underfitting, and that the model is a good
classification model overall.
• Overall, the metrics are high and good fit.
The model accuracy on the training as well as the test set is about 63% and 65% respectively, which is
roughly the same proportion as the class 0 observations in the dataset. This model is affected by a class
imbalance problem. Since we only have 872 observations, if re-build the same LDA model with a
greater number of data points, an even better model could be built.
Further changing the cut-off values for maximum recall, since recall is important and the performance
of model is regularized. We saw that at probability of 0.4, the recall is increasing to greater extend and
without impacting much on accuracy. At 0.4 the F1 score is also best fit. Now next checking the model
performance at 0.4 and considering the best one.
64 | P a g e
Predictive Modelling January 23, 2022
Figure 53. Classification report for LDA with default probability cut-off of 0.4
Figure 54. Confusion matrix of train (left) and test(right) for LDA :0.4
Figure 55. ROC curve for train and test for LDA:0.4
65 | P a g e
Predictive Modelling January 23, 2022
We see a similar result with no_young_children and foreign as good predictors and salary being the
worst predictor.
66 | P a g e
Predictive Modelling January 23, 2022
In this table, we have Accuracy, Recall, Precision, F1 score and AUC scores for 2 different models. The
models are as follows:
1) Logistic Regression Model – Best fit model after grid search.
2) LDA with custom cut- off probability (0.4).
Inferences:
• We can see that, the Accuracy for both models for both train and test is almost similar.
• The AUC and Precision of Logistic is slightly greater than the LDA for both test and train
• However, for our model, Recall and F1 score being the important measure of model performance,
we can see that LDA model is performing much better compared to Logistic regression model. We
can say that LDA is best fit model.
• Linear discriminant analysis model with custom probability of 0.4 is the best fit model.
Comparing the ROC curves and AUC scores for LDA and Logistic Regression models.
67 | P a g e
Predictive Modelling January 23, 2022
We can see from the graphs that Logistic Regression and LDA perform approximately identically for
both the train and test data sets. The logistic regression model, on the other hand, performs somewhat
better ROC.
2.4 Inference: Basis on these predictions, what are the insights and recommendations.
We had a business problem where we need to predict whether an employee would opt for a holiday
package or not. For this problem we had predicted the results using both logistic regression and linear
discriminant analysis.
In our extensive analysis so far, we have thoroughly examined given data and developed a model that
predicts the classification of whether the employee opts for holiday package or no, based on the
attributes in our dataset. Let us now look at the key points in our past data first and try to suggest some
recommendations for the firm.
Insights from the Graphs and Analysis from EDA:
Holiday package:
• We can observe that 54% of the employees are not opting for the holiday package and 46% are
interested in the package. This implies we have a dataset which is fairly balanced.
Salary
• The average 'Salary' of employees opting for holiday package and not opting for holiday
package is similar in nature.
• The coefficient for Salary is -1.3803 e-05. There is almost no relation with the Holiday package,
so we can say that Salary is not a good predictor for model building.
• Higher salary employees are more prone to not opt for holiday package.
68 | P a g e
Predictive Modelling January 23, 2022
Foreign
• We can see that, the age distribution for employees who are opting for holiday package and not
opting are similar in nature, though the number of people opting are less in number and mostly
fall in range of 35-45 age group.
• We can see that, employees in middle range (34 to 45 years) are opting for holiday package are
more as compared to older and younger employees.
Education
• The variable 'educ' the number of years of formal education is showing a similar pattern. This
means education is likely not a variable that influences for opting of holiday packages for
employees.
• We can see that employee with less years of formal education (1 to 7 years) and higher
education are not opting for the Holiday package as compared to employees with formal
education of 8 year to 12 years
• Across education we can observe that the employees with higher number of years of formal
education have a lower tendency to opt for the holiday package relative to employees with
lesser years of formal education
No. of young children
• The distribution for opting or not opting for holiday packages looks same for employees with
older children. At this point, this might not be a good predictor for model building.
• Almost same distribution for both the scenarios when dealing with employees with older
children
• For the employees with older children, it’s hard to differentiate between the 2 different classes
of dependent variable. The employees who opt for package and the ones who do not do not
have much difference between them.
• This is not a good variable for model building.
69 | P a g e
Predictive Modelling January 23, 2022
Recommendations:
• The firm should concentrate its efforts on foreigners in order to increase sales of vacation
packages, as this is where the majority of conversions will occur.
• The firm might try to target their marketing efforts or offers at foreigners in order to increase
the number of people who choose vacation packages.
• Focus on Foreign variable for good prediction while building the classification model.
• To improve the likelihood of lower-wage employees selecting for a vacation package, the firm
might provide certain incentives or discounts to them.
• The company should not target employees with younger children. The employees with younger
children have more chances of not opting for holiday package.
• Employees with older children who do not opt for vacation package might be targeted using
some marketing strategies. The organisation can conduct a deep dive or conduct a survey to
determine why the rest of the employees are not taking advantage of the holiday package. The
corporation may be able to come up with some suggestions or offers to convert the remaining
employees.
• The employer can provide references of workers with older children who have chosen the
package to those who have not chosen it, in order to persuade them to do so.
70 | P a g e