Predictive Modeling (MP) Project Report
Predictive Modeling (MP) Project Report
Table of Contents
1 Problem 1 Statement............................................................................................................... 1
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the
null values, Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate
Analysis. ....................................................................................................................................... 1
1.1.1 Data Summary........................................................................................................... 1
1.1.2 Duplicated Data Summary ........................................................................................ 2
1.1.3 Descriptive Statistics ................................................................................................. 3
1.1.4 Sample Data .............................................................................................................. 3
1.1.5 Univariate Analysis.................................................................................................... 3
1.1.5.1 Carat Column ..................................................................................................... 3
1.1.5.2 Depth Column .................................................................................................... 4
1.1.5.3 Table Column ..................................................................................................... 4
1.1.5.4 X Column ............................................................................................................ 5
1.1.5.5 Y Column ............................................................................................................ 5
1.1.5.6 Z Column ............................................................................................................ 6
1.1.5.7 Price Column ...................................................................................................... 6
1.1.5.8 Cut Column ........................................................................................................ 7
1.1.5.9 Clarity Column ................................................................................................. 12
1.1.5.10 Color Column ................................................................................................... 15
1.1.6 Bivariate Analysis .................................................................................................... 17
1.2 Impute null values if present, also check for the values which are equal to zero. Do they
have any meaning or do we need to change them or drop them? Check for the possibility of
combining the sub levels of an ordinal variables and take actions accordingly. Explain why you
are combining these sub levels with appropriate reasoning. ................................................... 21
1.2.1 Null Values and Zero Values Check ......................................................................... 21
1.2.2 Outlier Values Check and Imputation ..................................................................... 21
1.2.3 Null Values and Zero Values Imputation ................................................................ 24
1.2.4 Sub-Level Merge for Ordinal Data .......................................................................... 24
1.2.4.1 Color and Clarity .............................................................................................. 24
1.2.4.2 Cut.................................................................................................................... 25
1.2.5 Data Multi-Collinearity Check- VIF .......................................................................... 27
Predictive Modeling- Project
1.3 Encode the data (having string values) for Modelling. Split the data into train and test
(70:30). Apply Linear regression using scikit learn. Perform checks for significant variables using
appropriate method from statsmodel. Create multiple models and check the performance of
Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare. Compare these models
and select the best one with appropriate reasoning. ............................................................... 28
1.3.1 Encoding of Categorical Data .................................................................................. 28
1.3.2 Spilt of Data – Train and Test .................................................................................. 29
1.3.3 Linear Regression Model ........................................................................................ 30
1.3.4 Linear Regression Model with Sklearn Library ....................................................... 31
1.3.4.1 Linear Reg. Model Step1 – Data Split .............................................................. 31
1.3.4.2 Linear Reg. Model Step2 –Model Build ........................................................... 32
1.3.4.3 Linear Reg. Model Step3 – Checking Features’ Coefficients & Intercept ....... 32
1.3.4.4 Linear Reg. Model Step4 – R2 and RMSE (Prediction & Evaluation) ............... 32
1.3.4.5 Model Status.................................................................................................... 33
1.3.5 Linear Regression Model with Statsmodel Library ................................................. 33
1.3.5.1 Linear Reg. Model Step1 – Data Split .............................................................. 33
1.3.5.2 Linear Reg. Model Step2 –Null and Alternate Hypothesis .............................. 33
1.3.5.3 Linear Reg. Model Step2 –Model Build ........................................................... 33
1.3.5.4 Linear Reg. Model Step3 – Checking Coefficients & Hypothesis Check .......... 34
1.3.5.5 Linear Reg. Model Step4 – R2 and RMSE (Prediction & Evaluation) ............... 35
1.3.5.6 Test Data .......................................................................................................... 35
1.3.5.7 Linear Reg. Model Sample Computation ......................................................... 35
1.4 Inference: Basis on these predictions, what are the business insights and
recommendations. Please explain and summarize the various steps performed in this project.
There should be proper business interpretation and actionable insights present. ................. 36
1.4.1 Model Insights......................................................................................................... 36
1.4.2 Recommendations .................................................................................................. 36
2 Problem 2 Statement............................................................................................................. 37
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis. Do
exploratory data analysis. ......................................................................................................... 37
2.1.1 Data Summary......................................................................................................... 37
2.1.2 Descriptive Statistics ............................................................................................... 38
Predictive Modeling- Project
List of Figures
Figure 1-1 Carat – Boxplot and Histogram...................................................................................... 4
Figure 1-2 Depth – Boxplot and Histogram .................................................................................... 4
Figure 1-3 Table – Boxplot and Histogram ..................................................................................... 5
Figure 1-4 X – Boxplot and Histogram ............................................................................................ 5
Figure 1-5 Y – Boxplot and Histogram ............................................................................................ 6
Figure 1-6 Z – Boxplot and Histogram ............................................................................................ 6
Figure 1-7 Price – Boxplot and Histogram ...................................................................................... 7
Figure 1-8 Cut – Count Plot ............................................................................................................. 7
Figure 1-9 Average Price and Carat Weight-Cut Bar Plot ............................................................... 8
Figure 1-10 Cut –Clarity Distribution Plot ..................................................................................... 10
Figure 1-11 Cut –Color Distribution Plot....................................................................................... 11
Figure 1-12 Clarity – Count Plot .................................................................................................... 12
Figure 1-13 Clarity – Average Price and Carat weight Plot ........................................................... 13
Figure 1-14 Clarity –Color Distribution Plot .................................................................................. 14
Figure 1-15 Color – Count Plot ...................................................................................................... 15
Figure 1-16 Color – Average Price and Carat weight Plot............................................................. 16
Figure 1-17 Cubic Zirconia – Numerical Data Heat Map .............................................................. 17
Figure 1-18 Cubic Zirconia Pair Plot .............................................................................................. 20
Figure 1-19 Cubic Zirconia – Numerical Data Box Plot ................................................................. 22
Figure 1-20 Average Price and Carat Weight-Cut Bar Plot – Before Merge ................................. 26
Figure 1-21 Average Price and Carat Weight-Cut Bar Plot – After Merge.................................... 26
Figure 1-22 Simple Linear Regression ........................................................................................... 30
Figure 2-1 Age Data – Boxplot and Histogram.............................................................................. 39
Figure 2-2 Education Data – Boxplot and Histogram ................................................................... 39
Figure 2-3 Salary Data – Boxplot and Histogram .......................................................................... 40
Figure 2-4 Salary + Age Relationship ............................................................................................ 41
Figure 2-5 Salary + Education Relationship .................................................................................. 42
Figure 2-6 Salary + Foreign Employee Relationship ..................................................................... 42
Figure 2-7 Number of Young Children Data – Boxplot and Histogram......................................... 43
Figure 2-8 Number of Older Children Data – Boxplot and Histogram.......................................... 44
Figure 2-9 Foreign Employees Count Plot .................................................................................... 44
Figure 2-10 Holiday Package Opted Count Plot ............................................................................ 45
Figure 2-11 Holiday Package Opted – Age Groups ....................................................................... 45
Figure 2-12 Holiday Package– Children Status ............................................................................. 46
Figure 2-13 Holiday Package– Average Salary & Foreign Employees........................................... 46
Figure 2-14 Travel Parameters Heat Map ..................................................................................... 47
Figure 2-15 Travel Data Parameters Pair Plot............................................................................... 48
Figure 2-16 Holiday Package– Numerical Data Box Plot .............................................................. 49
Figure 2-17 Confusion Matrix ....................................................................................................... 55
Predictive Modeling- Project
List of Tables
Table 1-1 Data Dictionary for Cubic Zirconia Details ...................................................................... 1
Table 1-2 Sample of Duplicated Cubic Zirconia Data ...................................................................... 3
Table 1-3 Descriptive Statistics of P1 Data ..................................................................................... 3
Table 1-4 Sample Cubic Zirconia Data ............................................................................................ 3
Table 1-5 Cubic Zirconia – Stone Cut Distribution .......................................................................... 7
Table 1-6 Highest Priced Cubic Zirconia Per Cut ............................................................................. 8
Table 1-7 Largest Cubic Zirconia Per Cut ........................................................................................ 9
Table 1-8 Cubic Zirconia – Stone Cut Distribution ........................................................................ 12
Table 1-9 Highest Priced Cubic Zirconia Per Clarity ...................................................................... 13
Table 1-10 Cubic Zirconia – Stone Color Distribution ................................................................... 15
Table 1-11 Highest Priced Cubic Zirconia Per Color...................................................................... 16
Table 1-12 Correlation Analysis of Dataset Parameters ............................................................... 19
Table 1-13 Cubic Zirconia Data with Null ...................................................................................... 21
Table 1-14 Cubic Zirconia Data with 0 .......................................................................................... 21
Table 1-15 Cubic Zirconia Numerical Data - Outlier Details ......................................................... 23
Table 1-16 Descriptive Statistics Comparison After Outlier Imputation ...................................... 23
Table 1-17 Descriptive Statistics Comparison After Null Imputation ........................................... 24
Table 1-18 Color – Ordinal Data Merge ........................................................................................ 24
Table 1-19 Clarity – Ordinal Data Merge ...................................................................................... 25
Table 1-20 Cut – Ordinal Data Merge ........................................................................................... 25
Table 1-21 Good-Very Good Mean Data Comparison .................................................................. 25
Table 1-22 Very Good Mean Data – After Good-Very Good Merge............................................. 26
Table 1-23 VIF Values .................................................................................................................... 27
Table 1-24 Categorical Values to Numerical Number Codes ........................................................ 28
Table 1-25 Lin Reg.-Computed Coefficients & Intercept for All Features .................................... 32
Table 1-26 Lin Reg.- Train Data R2, Adjusted R2 and RSME .......................................................... 32
Table 1-27 Lin Reg.- Test Data R, Adjusted R2 2 and RSME ........................................................... 33
Predictive Modeling- Project
Table 1-28 Lin Reg.-Computed Coefficients & Intercept for All Features .................................... 35
Table 1-29 Lin Reg.- Train Data R2, Adjusted R2 and RSME .......................................................... 35
Table 1-30 Lin Reg.- Test Data R2 and RSME................................................................................. 35
Table 2-1 Data Dictionary for Tour and Travel Agency ................................................................. 37
Table 2-2 Descriptive Statistics of Holiday Package Data ............................................................. 38
Table 2-3 Sample Holiday Package Data ....................................................................................... 38
Table 2-4 Employee Age Bracket – Average Salary Data .............................................................. 40
Table 2-5 Employee Education Bracket – Average Salary Data .................................................... 41
Table 2-6 Foreign& Local Employees Salary and Age Distribution ............................................... 42
Table 2-7 Foreign& Local Employees Salary and Education Distribution ..................................... 43
Table 2-8 Foreign Employees Data Distribution ........................................................................... 44
Table 2-9 Holiday Package Opted-Data Distribution .................................................................... 45
Table 2-10 Holiday Package Numerical Data - Outlier Details...................................................... 50
Table 2-11 Descriptive Statistics Comparison After Outlier Imputation ...................................... 50
Table 2-12 Categorical Values to Numerical Number Codes ........................................................ 50
Table 2-13 Sample Classification Report ...................................................................................... 57
Table 2-14 Case 1- Metric Importance ......................................................................................... 58
Table 2-15 Case 2- Metric Importance ......................................................................................... 59
Table 2-16 Log. Reg. Training Data Classification Report ............................................................. 59
Table 2-17 Log. Reg. Test Data Classification Report ................................................................... 60
Table 2-18 LDA Training Data Classification Report ..................................................................... 62
Table 2-19 Log. Reg. Test Data Classification Report ................................................................... 63
Table 2-20 – All Model Metrics Comparison ................................................................................ 64
List of Formulae
Formula 1-1 Simple Linear Regression ......................................................................................... 30
Formula 1-2 Linear Regression – Adjusted R2 ............................................................................... 31
Formula 1-3 Linear Regression – RMSE Computation .................................................................. 31
Formula 1-4 Linear Regression Model Price Computation ........................................................... 35
Formula 2-1 Confusion Matrix – Accuracy.................................................................................... 56
Formula 2-2 Confusion Matrix - Precision .................................................................................... 56
Formula 2-3 Confusion Matrix - Recall ......................................................................................... 56
Formula 2-4 Confusion Matrix - Specificity .................................................................................. 57
Formula 2-5 Confusion Matrix – F1 Score .................................................................................... 57
Predictive Modeling- Project
1 Problem 1 Statement
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are
provided with the dataset containing the prices and other attributes of almost 27,000 cubic
zirconia (which is an inexpensive diamond alternative with many of the same qualities as a
diamond).
The company is earning different profits on different prize slots. You have to help the company
in predicting the price for the stone on the bases of the details given in the dataset so it can
distinguish between higher profitable stones and lower profitable stones so as to have better
profit share. Also, provide them with the best 5 attributes that are most important.
* - Target Variable
1.1 Read the data and do exploratory data analysis. Describe the data briefly.
(Check the null values, Data types, shape, EDA, duplicate values). Perform
Univariate and Bivariate Analysis.
1.1.1 Data Summary
The summary describes the data type and the number of data entries in each of the columns in
the dataset. The presence of null data and duplicated data is also noted.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26967 entries, 0 to 26966
Data columns (total 11 columns):
1
Predictive Modeling- Project
2
Predictive Modeling- Project
1. We observe that the min/max values of Carat, Depth, Table and Price are valid.
2. X, Y and Z columns have a minimum value of 0 which may need checking from the jeweler
3. The largest cubic zirconia has a weight of 4.5 carat which is significantly larger than the 75th
percentile of carat weight (1.05) This is indicative of presence of outliers which needs to
checked and corrected.
3
Predictive Modeling- Project
Skewness: 1.11
Outliers: Present on only the upper whisker end.
1.1.5.2 Depth Column
Distribution: Follows almost a pure normal distribution.
Skew: Data is fairly symmetrical and not skewed.
Skewness: -0.03
Outliers: Present on both the upper and lower whisker ends.
4
Predictive Modeling- Project
Skewness: 0.77
Outliers: Present on both the upper and lower whisker ends.
1.1.5.4 X Column
Distribution: Does not follow a normal distribution and data seems to follow a random
distribution with multiple peaks.
Skew: The data is moderately skewed, there is no long tail on both sides if the data.
Skewness: 0.39
Outliers: Present on both the upper and lower whisker ends.
1.1.5.5 Y Column
Distribution: Does not follow a normal distribution and data seems to follow a random
distribution with multiple peaks.
Skew: Highly skewed data.
Skewness: 3.87
5
Predictive Modeling- Project
6
Predictive Modeling- Project
The count plot and table clearly show that 40% of the cubic zirconia are in ideal cut state which
is beneficial for both the vendor and customer.
7
Predictive Modeling- Project
The mean price of each of the different cuts of the cubic zirconia has been plotted.
• Although “ideal” is the best cut available and largest count of cubic zirconia with the
vendor are ideal cut, the average price for the ideal cut is the lowest. This can be explained
by seeing that the mean carat size for the ideal cut cubic zirconia is the lowest.
• The fair cut average price. As the average carat size of the fair cut cubic zirconia is the
highest this can explain the high average price.
• The average price and average carat weight for good and very good cuts are very close.
This is indicative that they may be considered to be almost same.
The highest priced stones for each cut has been tabulated. It is seen that the price range is quite
close, all the highest priced stones are greater than 2 carats, have good clarity and are of high
level of color.
The highest carat sizes seen has been also tabulated. It is very interesting to note that the largest
stones are not the most expensive stones for each cut. This indicates that the other quality
parameters like color and clarity also play a major role in deciding the price of the cubic zirconia.
8
Predictive Modeling- Project
Distribution of the cubic zirconia clarity available for each cut inference:
• Fair cut has the most number of I1 accounting for almost 11.4% of all the fair cut
• 52% of the fair cut cubic zirconia all are SI1 and SI2
• Very good and good cuts have high number of SI1 clarity as compared to all other types
• Premium and Ideal cuts have a large number of SI1 and VS2 cubic zirconia
• Ideal cut has the largest number of IF clarity as compared to all the other cut.
9
Predictive Modeling- Project
Distribution of the cubic zirconia color available across each cut inference:
• The distribution for all cuts across the cubic zirconia color spectrum is relatively similar.
10
Predictive Modeling- Project
• In each cut type, the lowest number cubic zirconia is attributed to J (best and worst) color.
• All cuts have a reasonably average percent of D colored cubic zirconia.
• E-F-G-H colors are most common colors
11
Predictive Modeling- Project
• 24.3% of cubic zirconia have SI1 clarity closely followed by VS2 at 22.6%.
• Only 3.3% of all the stones have IF clarity.
• Lowest percentage of stones have I1 clarity
The highest priced stones for each clarity has been tabulated. It is seen that the clarity has a
higher impact on the price.
The carat value of the highest priced IF clarity cubic zirconia (1.5 carats) is significantly smaller
than the carat value for an equally priced I1 clarity (4.5 carats) or SI2 clarity (2.07) cubic zirconia.
12
Predictive Modeling- Project
A side by side plot of the average price and average carat size for all the different cubic zirconia
clarity has been displayed below.
• The carat size and price look to be directly related to each other with both plots following
similar distribution
• The exception to above statement is seen only for I1 clarity where the average price is
low as compared to the average carat weight.
13
Predictive Modeling- Project
14
Predictive Modeling- Project
• Stones labelled J (lowest in colorless spectrum) make up only 5.3% of all the stones.
• Most of the cubic zirconia (56.6%) are classified into the mid color spectrum (E-F-G)
The highest priced stones for each color has been tabulated. It is seen that the color has a higher
impact on the price.
The carat value of the highest priced D color (top spectrum) cubic zirconia (2.14 carats) is
significantly smaller than the carat value for an equally priced J color (3.51 carats).
15
Predictive Modeling- Project
Just like the analysis seen for clarity, the average price of a cubic zirconia color quality is directly
dependent on its carat size as evidenced by the bar plot below.
16
Predictive Modeling- Project
The correlations between all the columns with a brief analysis and the correlation value has been
tabulated.
17
Predictive Modeling- Project
Corr.
Col1 Col2 Correlation Analysis Between Column1 and Column2
Value
• Very low degree of positive correlation; Pair plot shows no
Carat Depth 0.04
distinguishable pattern
• Very low degree of positive correlation; Pair plot shows no
Carat Table 0.18
distinguishable pattern
• High positive correlation
Carat X • Pair plot shows as X increases, so does carat weight 0.98
• Some outliers can be seen in the pair plot
• High positive correlation
Carat Y • Pair plot shows as X increases, so does carat weight 0.94
• Some outliers can be seen in the pair plot
• High positive correlation
Carat Z • Pair plot shows as X increases, so does carat weight 0.94
• Some outliers can be seen in the pair plot
• High positive correlation
Carat Price • Pair plot shows as carat weight increases, so does the price 0.92
• Some outliers can be seen in the pair plot
• Very low degree of negative correlation; Pair plot shows no
Depth Table distinguishable pattern -0.3
• Some outliers can be seen in the pair plot
• Very low degree of negative correlation; Pair plot shows no
Depth X distinguishable pattern -0.02
• Some outliers can be seen in the pair plot
• Very low degree of negative correlation; Pair plot shows no
Depth Y distinguishable pattern -0.02
• Some outliers can be seen in the pair plot
• Very low degree of positive correlation; Pair plot shows no
Depth Z 0.1
distinguishable pattern
Depth Price • No correlation value captured -0.0
• Very low degree of positive correlation; Pair plot shows no
Table X distinguishable pattern 0.2
• Some outliers can be seen in the pair plot
• Very low degree of positive correlation; Pair plot shows no
Table Y distinguishable pattern 0.18
• Some outliers can be seen in the pair plot
• Very low degree of positive correlation; Pair plot shows no
Table Z distinguishable pattern 0.15
• Some outliers can be seen in the pair plot
• Very low degree of positive correlation; Pair plot shows no
Table Price 0.13
distinguishable pattern
X Y • High positive correlation 0.96
18
Predictive Modeling- Project
The heat map indicates that there is a degree of multi-collinearity between the numerical
variables.
19
Predictive Modeling- Project
20
Predictive Modeling- Project
1.2 Impute null values if present, also check for the values which are equal to zero.
Do they have any meaning or do we need to change them or drop them? Check
for the possibility of combining the sub levels of an ordinal variables and take
actions accordingly. Explain why you are combining these sub levels with
appropriate reasoning.
1.2.1 Null Values and Zero Values Check
1. Depth column has 697 values which are Null
21
Predictive Modeling- Project
The linear regression model is highly sensitive to outliers. In order to avoid the outliers influencing
the model, we need to treat them. There are different ways to treat outliers:
1. Trimming Data: The rows of data which are holding the outliers shall be removed from the
dataset. Although this method is direct and simple, there will be loss of information which
may result in issues in the created model.
2. Imputation of Data: The outliers are replaced with acceptable values. This includes capping
the data or replacing the outliers with the median value. Imputation ensures that data is not
lost in the process of outlier treatment.
In this case, we shall impute the data with the capped values. Once the imputation is complete,
we check to see if all the outliers have been treated.
22
Predictive Modeling- Project
We further check the descriptive statistics of the columns after the imputation to see if there are
any major changes.
Carat
Before & After mean std min 25% 50% 75% max
Before 0.79 0.47 0.2 0.4 0.7 1.05 4.5
After 0.79 0.46 0.2 0.4 0.7 1.05 2.025
Depth
Before & After mean std min 25% 50% 75% max
Before 61.74 1.41 50.8 61 61.8 62.5 73.6
After 61.74 1.25 58.75 61 61.8 62.5 64.75
Table
Before & After mean std min 25% 50% 75% max
Before 57.45 2.23 49 56 57 59 79
After 57.43 2.15 51.5 56 57 59 63.5
X
Before & After mean std min 25% 50% 75% max
Before 5.72 1.12 0 4.71 5.69 6.55 10.23
After 5.72 1.12 1.95 4.71 5.69 6.55 9.31
Y
Before & After mean std min 25% 50% 75% max
Before 5.73 1.16 0 4.71 5.7 6.54 58.9
After 5.73 1.11 1.965 4.71 5.7 6.54 9.285
Z
Before & After mean std min 25% 50% 75% max
Before 3.53 0.71 0 2.9 3.52 4.04 31.8
After 3.53 0.69 1.19 2.9 3.52 4.04 5.75
Price
Before & After mean std min 25% 50% 75% max
Before 3937.52 4022.55 326 945 2375 5356 18818
After 3735.83 3468.20 326 945 2375 5356 11972.5
Table 1-16 Descriptive Statistics Comparison After Outlier Imputation
1. All columns have their maximum value reduced, some of the columns (X, Y, Z, Price) by a
significant value.
2. The mean value for Price column has a noticeable change.
23
Predictive Modeling- Project
3. As X, Y and Z have their minimum values imputed, there are no longer any 0 present in
the data set
Depth
Before & After mean std min 25% 50% 75% max
Before 61.74 1.41 50.8 61 61.8 62.5 73.6
After 61.74 1.24 58.75 61.1 61.8 62.5 64.75
Table 1-17 Descriptive Statistics Comparison After Null Imputation
Clarity (Best to
Significance
Worst Quality)
IF 1. The clarity grade for a stone is set based on its appearance and will be
VVS1 documented in its certificate
24
Predictive Modeling- Project
1.2.4.2 Cut
Cut (Best to
Significance
Worst Quality)
Ideal 1. From the analysis in 1.1.5.8 Cut Column, it is clearly seen that the price,
Premium carat distribution for Good and Very Good cuts are very close
Very Good 2. Also the grading of both these cuts are also close
Good Conclusion: Good and Very Good can be merged into a single value of Very
Good. This is because there are more number of Very Good classified
Fair
stones as compared to Good.
Table 1-20 Cut – Ordinal Data Merge
Before proceeding with the merge, a comparative analysis of the mean values seen for all the
numerical columns between Good and Very Good is as below. It is seen that they both are very
close in all the data set and so may be interchangeably classified.
Once the Good values have been replaced with Very Good (merging two ordinal), the mean carat
and price values have been plotted. Below is the before and after plots and the inferences that
can be drawn from them.
1. There is no significant difference in the mean carat weight after the merge.
2. There is a slight reduction in the mean price of the Very Good cut post merge.
25
Predictive Modeling- Project
Figure 1-20 Average Price and Carat Weight-Cut Bar Plot – Before Merge
Figure 1-21 Average Price and Carat Weight-Cut Bar Plot – After Merge
The mean values generated after the merge is complete are as below.
Very Good
Price 3812.7373
Carat 0.8198
Depth 62.0104
Table 58.1582
X 5.7781
Y 5.8032
Z 3.5882
Table 1-22 Very Good Mean Data – After Good-Very Good Merge
26
Predictive Modeling- Project
Columns VIF
Carat 108.768
Depth 801.821
Table 643.029
X 9985.527
Y 9191.05
Z 1774.115
Table 1-23 VIF Values
27
Predictive Modeling- Project
1.3 Encode the data (having string values) for Modelling. Split the data into train
and test (70:30). Apply Linear regression using scikit learn. Perform checks for
significant variables using appropriate method from statsmodel. Create
multiple models and check the performance of Predictions on Train and Test
sets using Rsquare, RMSE & Adj Rsquare. Compare these models and select the
best one with appropriate reasoning.
1.3.1 Encoding of Categorical Data
Models are developed on numerical data. With this predicate, we convert the unique values in
the categorical columns into numerical values. This converted data is used in the modelling
operations.
The cubic zirconia dataset has 3 categorical variables which are ordinal in nature. We convert
these columns to be numerical in nature before using it to build the prediction model.
28
Predictive Modeling- Project
• test_size: This is the size of test set. It is expressed as a percentage between 0 and 1 for
either the train or test datasets. The specified percent of data will be collected into the
test subset.
To give an example, the insurance dataset has a total of 3000 rows. On specifying
test_size=0.3, the resultant test subset has 900 data points (30% of 3000) and the training
subset shall hold 2100 entries (70% of 3000).
• random_state: This input is used to initialize the internal random number generator,
which decides how to split the data into train and test subsets. This input should be set
to the same value, if the same consistency is to be expected over multiple runs of the
code.
The splitting of the data is done using the test_train_split function from the python module
sklearn.
1. For the cubic zirconia test set, the execution of the split operation is performed with the
inputs random_state=1, test_size=0.3
2. The test and train data subset of independent input variables have their content as
follows:
Training subset = 18853 rows, 9 columns
Test subset = 8080 rows, 9 columns
3. The test and train subset target variable details as given from the test_train_split
operation is as below:
29
Predictive Modeling- Project
Mathematically, a simple linear regression equation is written as below. In the case of multiple
regression, there will be multiple slopes, one for each of the independent variables.
𝑦 = 𝑏𝑜 + 𝑏1 𝑥
𝑏0 = intercept
𝑏1 = slope
Formula 1-1 Simple Linear Regression
The model gets the best regression fit line by computing the best values for the slopes and
intercept by the method of gradient descent.
We test the model performance with two factors which are the R2 and the Root Mean Squared
Error (RMSE).
In statistics, the coefficient of determination, R2, is the proportion of the variation in the
dependent variable that is predictable from the independent variable(s). Simply stated, it is a
score which tells how well the regression model fits the observed data. For example, if a model
30
Predictive Modeling- Project
gives a R2 score of 75%, it indicates that around 75% of the data fits the generated regression
model. Generally, a higher r-squared indicates a better fit for the mode.
Adjusted R2 is a modified version of R2 that has been adjusted for the number of predictors in the
model. This value increases when the new term improves the model more than would be
expected by chance. It decreases when a predictor improves the model by less than expected.
(1 − 𝑅2 )(𝑁 − 1)
2
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅 = 1 −
𝑁−𝑝−1
𝑁 = Number of records in data set
𝑝 = number of independent variables
Formula 1-2 Linear Regression – Adjusted R2
Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors).
Residuals are a measure of how far from the regression line data points are; RMSE is a measure
of how spread out these residuals are. In other words, it tells you how concentrated the data is
around the line of best fit.
𝑛
1
𝑅𝑀𝑆𝐸 = √ ∑(𝑦𝑗 − 𝑦̂𝑗 )2
𝑛
𝑗=1
It is computed by taking the distances from the points to the regression line (these distances are
the “errors”) and squaring them. The squaring is necessary to remove any negative signs. Lower
values are better with 0 indicating a perfect model.
31
Predictive Modeling- Project
• fit_intercept = True
This decides whether to compute the intercept for model. It is generally recommended
to keep this as True and the default value for this argument is True.
The constructed model is then used to fit the training dataset in order to complete the model
training operation.
1.3.4.3 Linear Reg. Model Step3 – Checking Features’ Coefficients & Intercept
The generated model gives the importance of each of the features that will impact the output.
We see the importance table as below.
We can see that Agency Code, Sales and Product Name features hold the most influence as
compared to the other features.
Columns Coefficients
Intercept -2370.689909092437
Carat 8843.416
Y 1225.787
Clarity 441.6878
Color 277.711
Cut 110.5788
Table -17.6759
Depth -18.697
Z -306.232
X -1400.35
Carat 8843.416
Table 1-25 Lin Reg.-Computed Coefficients & Intercept for All Features
1.3.4.4 Linear Reg. Model Step4 – R2 and RMSE (Prediction & Evaluation)
1.3.4.4.1 Train Data
R2 93.090%
Adjusted R2 0.93086
RMSE 910.98
Table 1-26 Lin Reg.- Train Data R2, Adjusted R2 and RSME
32
Predictive Modeling- Project
RMSE 912.72
Table 1-27 Lin Reg.- Test Data R, Adjusted R2 2 and RSME
• formula = stringVal
This is a string value which specifies the model. In this case of the cubic zirconia, below is
the formula sting passed.
stringVal = 'Price ~ Carat + Depth + Table + X + Y + Z + Cut + Color + Clarity'
33
Predictive Modeling- Project
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
34
Predictive Modeling- Project
[2] The condition number is large, 8.91e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Table 1-28 Lin Reg.-Computed Coefficients & Intercept for All Features
1. The coefficients and intercept computed by the statsmodel generated linear regression
model are similar to those generated by sklearn - Table 1-28 Lin Reg.-Computed
Coefficients & Intercept for All Features.
2. We check the P>|t| column in the above table in order to decide the hypothesis state. In
this case as all values are lesser than the α=0.05, we can reject H0 and accept Ha that at
least 1 regression coefficient is not 0.
In this case, we see that all the regression coefficients are not 0
1.3.5.5 Linear Reg. Model Step4 – R2 and RMSE (Prediction & Evaluation)
1.3.5.5.1 Train Data
R2 93.090%
Adjusted R 2 0.93086
RMSE 910.98
Table 1-29 Lin Reg.- Train Data R2, Adjusted R2 and RSME
35
Predictive Modeling- Project
1.4 Inference: Basis on these predictions, what are the business insights and
recommendations. Please explain and summarize the various steps performed
in this project. There should be proper business interpretation and actionable
insights present.
In conclusion to the EDA for the cubic zirconia data and the price prediction model development,
we have developed the following business insights.
2. As RMSE is a scale dependent score (it is on the same scale as the target variable, here it is
price), we see that the model indicates that there is an error of ~912 in predicting the price
of a cubic zirconia.
1.4.2 Recommendations
1. The vendor can try to focus on VVS1, VVS2 and IF stones both in quantity and in availably in
a range of carat sizes. This will allow him to market to niche customers who value quality over
other parameters.
2. The vendor should chart out the relevance of the CCCC – carat, clarity, color and cut when
selecting a stone for the benefit of his customers as some of them may be unaware.
3. In order to best serve his customers, the vendor should first get the budget input from them.
The next important piece of information is to understand what is most important for the
customer. Some may be looking for a quality stone whereas other may place higher
preference for a mi-quality stone of a larger size.
Trying to understand and provide the finest stone to the customer as per his requirements is
a process that should be streamlined and followed by the salespersons.
4. Also another important aspect is where the stone is to be used. If a stone has to be used in a
solitaire ring or earrings, then the quality of stone is important along with the size. On the
other hand, if they are to be used in a larger necklace, then mid-quality stones can be used.
This information can be provided to the customer to help him make the best judgements.
36
Predictive Modeling- Project
2 Problem 2 Statement
You are hired by a tour and travel agency which deals in selling holiday packages. You are
provided details of 872 employees of a company. Among these employees, some opted for the
package and some didn't. You have to help the company in predicting whether an employee will
opt for the package or not on the basis of the information given in the data set. Also, find out the
important factors on the basis of which the company will focus on particular employees to sell
their packages.
* - Target Variable
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate
Analysis. Do exploratory data analysis.
2.1.1 Data Summary
The summary describes the data type and the number of data entries in each of the columns in
the dataset. The presence of null data and duplicated data is also noted.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 872 entries, 0 to 871
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 872 non-null int64
1 Holiday_Package 872 non-null object
2 Salary 872 non-null int64
3 Age 872 non-null int64
4 Education 872 non-null int64
5 Num_Young_Children 872 non-null int64
6 Num_Older_Children 872 non-null int64
7 Foreign 872 non-null object
37
Predictive Modeling- Project
1. We observe that the min/max values of all the columns are valid.
2. The number of young children has its first, second and third quantile values as 0
38
Predictive Modeling- Project
Skew: Data has very low skew with no long tail on either end
Skewness: 0.15
Outliers: Age data has no outliers.
2.1.4.2 Education
Distribution: The data distribution shows that salary is not normally distributed with multiple
peaks seen. Max number of employees have around 8 -10 years of education.
Skew: Data is minimally skewed
Skewness: -0.05
Outliers: A few outliers are present in the data set with 3 outliers lying above the top (right)
whisker and 1 outlier lying below the bottom (left) whisker.
2.1.4.3 Salary
Distribution: The data distribution shows that salary is fairly normally distributed. There is a very
long tail on the right indicating data is skewed to the right. This is confirmed by the skewness
value.
39
Predictive Modeling- Project
Skew: Highly skewed to the right. This can be attributed the large number of outliers mainly
above the Q3 (75th percentile)
Skewness: 3.1
Outliers: A very high number of outliers are present in the data set. There are 56 outliers lying
above the top (right) whisker and 1 outlier lying below the bottom (left) whisker.
2.1.4.3.1 Salary + Age Relationship
Age brackets of employees has been used to check the number of employees in each bracket and
the average salary earned by them.
Inferences:
1. The company employs people of all age brackets
2. Most employees are within the 25-45 age bracket
3. Highest salary is earned by the 45-55 age bracket employees
4. There is not much salary range from 25 to 55 age range with only a 6% increase seen from
[25-35] to [34-55]
40
Predictive Modeling- Project
Inferences:
1. Most employees fall in the 8-12 years’ education bracket.
2. A single employee is educated above 20 years
3. There is an increase in salary seen as the education bracket increases, with highest salary
seen in the 16-20 years’ education bracket.
41
Predictive Modeling- Project
It is seen that consistently the foreign employees are underpaid across all age brackets. The %
difference between the average salary for local employees and foreign employees ranges from a
minimum of 13.3% to an extreme high of 32.8%.
42
Predictive Modeling- Project
Similar to the age brackets, the average salaries across the education brackets also show foreign
employees being underpaid with the exception being the 12-16 years’ education bracket.
43
Predictive Modeling- Project
2.1.4.6 Foreign
This column holds the
number of foreign
employees in the
organization.
We see that around 24.77%
of the employees are
foreign. Further analysis
about foreign employees
salary analysis has been
documented in section
2.1.4.3.3 Salary + Foreign
Employee Relationship.
Figure 2-9 Foreign Employees Count
Plot
44
Predictive Modeling- Project
The number of children also has large influences on the opting of discounted holiday packages.
The children status of employees who have taken the holiday packages are analyzed to see the
influence of children.
45
Predictive Modeling- Project
It is seen that more foreigner employees opt for holiday packages as compared to local
employees. In addition, it can be noted that in general, employees who opt for the offered
holiday package have a lower average salary as compared to employees who do not opt for a
holiday package.
46
Predictive Modeling- Project
47
Predictive Modeling- Project
As summarized by the data of the heat map, the pair plot does not show any distinguishable
pattern between the numerical data columns of the dataset.
48
Predictive Modeling- Project
2.2 Do not scale the data. Encode the data (having string values) for Modelling.
Data Split: Split the data into train and test (70:30). Apply Logistic Regression
and LDA (linear discriminant analysis).
2.2.1 Outlier Check and Cleanup
All the numerical data in the data set have outliers. This has been visualized in the box plot. All
the numerical data except Age have outliers present.
The logistic regression and the linear discriminant analysis models Are sensitive to outliers. In
order to avoid the outliers influencing the model, we treat them. There are different ways to
treat outliers:
1. Trimming Data: The rows of data which are holding the outliers shall be removed from the
dataset. Although this method is direct and simple, there will be loss of information which
may result in issues in the created model.
2. Imputation of Data: The outliers are replaced with acceptable values. This includes capping
the data or replacing the outliers with the median value. Imputation ensures that data is not
lost in the process of outlier treatment.
49
Predictive Modeling- Project
In this case, we shall impute the data with the capped values. Once the imputation is complete,
we check to see if all the outliers have been treated.
Analysis of the outliers indicates that the outlier values in the education and number of older and
younger children columns are valid. So these columns’ outliers shall not be imputed.
After the imputation of the outliers in the salary column, we check the descriptive statistics to
identify the changes.
Salary
Before & After mean std min 25% 50% 75% max
Before 47729.17 23418.67 1322 35324 41903.5 53469.5 236961
After 45608.33 15699.74 8105.75 35324 41903.5 53469.5 80687
Table 2-11 Descriptive Statistics Comparison After Outlier Imputation
The maximum value reduced and the minimum value has increased. There is some decrease in
the mean value as well.
50
Predictive Modeling- Project
• test_size: This is the size of test set. It is expressed as a percentage between 0 and 1 for
either the train or test datasets. The specified percent of data will be collected into the
test subset.
To give an example, the insurance dataset has a total of 3000 rows. On specifying
test_size=0.3, the resultant test subset has 900 data points (30% of 3000) and the training
subset shall hold 2100 entries (70% of 3000).
• random_state: This input is used to initialize the internal random number generator,
which decides how to split the data into train and test subsets. This input should be set
to the same value, if the same consistency is to be expected over multiple runs of the
code.
The splitting of the data is done using the test_train_split function from the python module
sklearn.
51
Predictive Modeling- Project
• solver=’newton-cg’
This is the algorithm to use in the optimization problem.
• max_iter=10000
10,000 is the maximum number of iterations for the solvers to converge.
• penalty = none
No penalty is added to the model
• tol=0.0001
This is the tolerance value for the stopping criteria.
• verbose=True
Setting this to true allows the progress messages to be printed out
• random_state=1
This makes the model’s output replicable. The model will always produce the same results
when it has a definite value of random_state and if it has been given the same parameters
and the same training data.
The mean accuracy of the built model for the training and test data is as follows.
52
Predictive Modeling- Project
In this case, below is the parameter grid which is given as the input:
param_grid = {
'penalty':['l2','none','l1','elasticnet'],
'solver':['sag','lbfgs','saga','newton-cg','liblinear'],
'tol':[0.001,0.0001,0.00001],
'l1_ratio':[0.25,0.5,0.75],
'max_iter':[100,1000,10000]
After the function execution is complete, we check the best selected parameters from this.
{'l1_ratio': 0.25,
'max_iter': 100,
'penalty': 'l2',
'solver': 'newton-cg',
'tol': 0.001}
With these values set, we recheck the score of the model.
• solver=’svd’
• tol=0.0001
The mean accuracy of the built model for the training and test data is as follows.
53
Predictive Modeling- Project
54
Predictive Modeling- Project
2.3 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score
for each model Final Model: Compare Both the models and write inference
which model is best/optimized.
2.3.1 Confusion Matrix
A confusion matrix is NxN matrix used for evaluating the performance of a classification model,
where N is the number of target classes. It compares the actual target values with those predicted
by the built machine learning model. This
gives us a holistic view of how well our
ACTUAL VALUES classification model is performing and what
kinds of errors it is making.
Positive Negative • A classification target variable
(binary) has two possible values, Positive or
Positive
Negative
TP FP • The columns represent the actual
PREDICTED
VALUES
FN TN
Figure 2-17 Confusion Matrix
55
Predictive Modeling- Project
Using the confusion matrix, the metrics accuracy, precision, recall and specificity are derived.
2.3.1.1 Accuracy
Accuracy (ACC) is the number of all correct predictions divided by the total number of the dataset.
The best accuracy is 1.0, whereas the worst is 0.0
(𝑇𝑃 + 𝑇𝑁)
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
(𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁)
Formula 2-1 Confusion Matrix – Accuracy
Accuracy is not the best metric to be checked especially if there is an imbalanced dataset. In such
cases, accuracy metric does not give correct understanding. In order to mitigate this, we use the
additional metrics of precision and recall.
2.3.1.2 Precision
Precision (PREC) is calculated as the number of correct positive predictions divided by the total
number of positive predictions. It tells us how may correctly predicted true cases are actually
positive.
It is also called positive predictive value (PPV). The best precision is 1.0, whereas the worst is 0.0.
Precision is a useful metric in cases where False Positive is a higher concern than False Negatives
(e.g.: In e-commerce recommendations, wrong results could lead to customer churn).
𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
(𝑇𝑃 + 𝐹𝑃)
Formula 2-2 Confusion Matrix - Precision
2.3.1.3 Recall/Sensitivity
Recall is calculated as the number of correct positive predictions divided by the total number of
positives i.e. the actual positive cases we were able to predict correctly with our model.
𝑇𝑃
𝑟𝑒𝑐𝑎𝑙𝑙 =
(𝑇𝑃 + 𝐹𝑁)
Formula 2-3 Confusion Matrix - Recall
It is also referred to as the true positive rate (TPR). The best recall is 1.0, whereas the worst is
0.0. Recall is a useful metric in cases where False Negatives is a higher concern than False
Positives (e.g.: In medical diagnosis raising a false alarm may be safer).
2.3.1.4 Specificity
Specificity is calculated as the number of correct negative predictions divided by the total number
of negatives.
56
Predictive Modeling- Project
𝑇𝑁
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
(𝑇𝑁 + 𝐹𝑃)
Formula 2-4 Confusion Matrix - Specificity
It is also referred to as the true negative rate (TNR). The best recall is 1.0, whereas the worst is
0.0.
2.3.1.5 F1 Score
Recall and Precision metrics are inversely proportional to each other. The best way to capture
the trend is to use a combination of both which gives us the F1-Score metric. The F1 score is a
weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is
0.0.
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑟𝑒𝑐𝑎𝑙𝑙 𝑇𝑃
𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 ∙ =
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 1
𝑇𝑃 + (𝐹𝑃 + 𝐹𝑁)
2
Formula 2-5 Confusion Matrix – F1 Score
The interpretability of the F1-score is poor on its own. Using it in combination with other
evaluation metrics which gives us a complete picture of the result.
2.3.1.6 Classification Report
This report displays the precision, recall, F1, and support scores for the created model. It is
generated by the classification_report function of the sklearn library. A sample report is shown
below:
precision recall f1-score support
0 0.78 0.91 0.84 300
1 0.71 0.47 0.57 600
accuracy 0.77 900
macro avg 0.75 0.69 0.70 900
weighted avg 0.76 0.77 0.75 900
Table 2-13 Sample Classification Report
57
Predictive Modeling- Project
2. We are looking to see if an employee opts for the available holiday package. Based on what
is more important to the agency we check different metrics as based on the what the agency
wants, prediction fails are not equal; one will be costlier than the other.
Case1 - Agency wants employees to opt for holiday packages
Worse Better
Predict that a customer shall opt for the Predict that a customer shall not opt for
holiday package but he does not. the holiday package but he does.
Why: The agency may have tried to tailor Why: Although the agency has not tried
packages for the employee to influence them to customize any holiday packages for
into opting in for the holiday package. This this employee, he has still availed it.
effort will have cost the agency.
At the end if the employee does not take
advantage of this, there will have been wasted
agency resources.
False Positive False Negative
Table 2-14 Case 1- Metric Importance
Case 2- Agency does want employees to opt for holiday packages – This is a second
possibility. Employees opting for holiday packages may be costing the agency due to price
subsidy.
Better Worse
Predict that a customer shall opt for the holiday Predict that a customer shall not opt
package but he does not. for the holiday package but he does.
Why: Agencies will be offering a price subsidy to Why: The agency may not have
the employees for the holiday packages. budgeted for the subsidy that it has to
Sometimes these subsidies will result in the offer to the employee.
agency making low profit or in the worst case a This may result in the agency losing
loss. profits due to the subsidy.
In such cases, if the employee does opt to take
the holiday package, the agency may not lose
money.
False Positive False Negative
58
Predictive Modeling- Project
3. For our models, we are considering Case 1, the agency wants its employees to take advantage
of its offered travel packages. Considering this understanding, it is better to have a False
Negative rather than False Positive. In this case, as we need to reduce the FP, it is better to
focus on value of precision when measuring the performance of the model.
5. AUC is checked to see if the value is high. The curve shape is also checked to see if it is
extending up to the top left corner
The metrics computed in the classification report using the above confusion matrix are as below.
precision recall f1-score support
0 0.67 0.77 0.72 326
1 0.68 0.56 0.62 284
accuracy 0.67 610
macro avg 0.68 0.67 0.67 610
weighted avg 0.67 0.67 0.67 610
Table 2-16 Log. Reg. Training Data Classification Report
59
Predictive Modeling- Project
Recall = 56%
F1-Score = 62%
AUC = 0.74
The metrics computed in the classification report using the above confusion matrix are as below.
precision recall f1-score support
0 0.67 0.70 0.69 145
1 0.61 0.57 0.59 117
accuracy 0.65 262
macro avg 0.64 0.64 0.64 262
weighted avg 0.64 0.65 0.64 262
Table 2-17 Log. Reg. Test Data Classification Report
60
Predictive Modeling- Project
Accuracy = 65%
Precision = 61%
Recall = 57%
F1-Score = 59%
AUC = 0.70
The metrics computed in the classification report using the above confusion matrix are as below.
precision recall f1-score support
0 0.67 0.78 0.72 326
1 0.69 0.56 0.61 284
61
Predictive Modeling- Project
62
Predictive Modeling- Project
The metrics computed in the classification report using the above confusion matrix are as below.
precision recall f1-score support
0 0.66 0.71 0.69 145
1 0.61 0.56 0.58 117
accuracy 0.64 262
macro avg 0.64 0.63 0.63 262
weighted avg 0.64 0.64 0.64 262
Table 2-19 Log. Reg. Test Data Classification Report
63
Predictive Modeling- Project
The AUC-ROC of the training and test data for all the three developed models has been plotted
on the same graph to show
their relative performance.
Both the plots along with the AUC scores show that the models have same kind of performance.
64
Predictive Modeling- Project
2.4 Inference: Basis on these predictions, what are the insights and
recommendations. Please explain and summarize the various steps performed
in this project. There should be proper business interpretation and actionable
insights present.
Based on the EDA and model creation, we can have the below insights and recommendations.
1. Foreign employees are under paid consistently across all age and education groups. The
reasoning behind this should be analyzed.
The attrition rate among foreign employees should be compared with local employees. If
the attrition of foreign employees is higher, better salaries may remedy that.
2. Younger employees are paid well. This is a good point to be made if the agency is looking
to hire fresh talent.
3. Employees with higher average salary do not opt for holiday packages. This may stem
from the non-availability of more luxury options. A simple survey can help bridge this gap.
The agency can then offer better holiday alternatives.
4. More foreign employees opt for holiday packages as compared to local employees. It may
be that they are availing the packages to travel to their native nations.
Foreign employees can be offered a no frills travel packages (travel only with no itinerary)
only to their native to encourage more foreign employees to opt for travel packages.
5. Employees with older children opt for holiday packages as compared to their colleagues
who have only young children or those who have both young and older children. This may
due non availability of quality child care facilities at the holiday destinations.
Packages tailored for employees with young children can help encourage them to avail
the holiday packages.
6. The time in a when employees opt for holidays is an essential component. Using this data
better insight can be had about travel plans of employees. Is it during the holiday season,
summer vacations or during cultural events.
With this information, better travel options can be made available to the agency
employees.
65