100% found this document useful (2 votes)
1K views16 pages

Capstone Proect Notes 2

The document discusses building and interpreting various regression models for predicting employee compensation (expected CTC) using HR data. Several linear regression models are constructed using sklearn, statsmodels, and scaled data. Performance is evaluated using metrics like R-squared, RMSE. Multicollinearity is checked using VIF. Other models like decision tree, random forest, and neural network regressors are also explored. The linear models perform well with 98% R-squared values on both training and test sets. Model tuning is done through scaling, ensemble methods, and parameter optimization.

Uploaded by

Deva Dass
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
1K views16 pages

Capstone Proect Notes 2

The document discusses building and interpreting various regression models for predicting employee compensation (expected CTC) using HR data. Several linear regression models are constructed using sklearn, statsmodels, and scaled data. Performance is evaluated using metrics like R-squared, RMSE. Multicollinearity is checked using VIF. Other models like decision tree, random forest, and neural network regressors are also explored. The linear models perform well with 98% R-squared values on both training and test sets. Model tuning is done through scaling, ensemble methods, and parameter optimization.

Uploaded by

Deva Dass
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

HR DATA CAPSTONE

PROJECT NOTES 2
DSBA

Submitted by

PGPDSBA Online June_C 2021


JUNE 19TH, 2022
Table of Contents
Contents Page
No

1). Model building and interpretation. 01-10

a. Build various models (You can choose to build models for


either or all of descriptive, predictive or prescriptive purposes) 01-09

b. Test your predictive model against the test set using various 09-10
appropriate performance metrics

c.Interpretation of the model(s)


10

2). Model Tuning and business implication 11-13

a.Ensemble modelling, wherever applicable 11

b. Any other model tuning measures(if applicable) 11-12

c. Interpretation of the most optimum model and its implication


13
on the business
List of Tables

Contents Page No

Table1: Co-efficient of independent variables 2

Table2: Value of intercept 2

Table 3: Stats model parameters 4

Table 4: lm1 summary 5

Table 5: Co-efficient of independent variables after Scaling 6

Table 6: value of Intercept after Scaling 6

Table 7: VIF Values after scaling 7

Table 8: VIF Values after scaling and removing columns having more than 5 8

Table 9: Comparison of R square and RSME of all the models 9


Table 10: RMSE, Training and Test Score for both Train and Test Dataset for all 10
models
Table 11: Best parameter for Decision tree 11

Table 12: Best parameter for Random Forest 12

Table 13: Best parameter for Neural Network 12

Table 14: Comparison of RMSE, Training and Test Score and MAPE for all 12
models

List of Figures

Contents Page No

Figure 1: Terminology used in Decision Tree 09


MODEL BUILDING AND INTERPRETATION

1) Build various models (You can choose to build models for either or all of descriptive,
predictive or prescriptive purposes)

Solution:

In this problem I am going to use the following models as mentioned below:

1) Linear Regression

2) Linear Regression using stats

3) Linear Regression using Z Score.

4) Decision Tree Regressor.

5) Random Forest Regressor.

6) Artificial Neural Network (ANN) Regressor.

The data need to be scaled for ANN Regressor only. Because ANN is sensitive to outliers
and it works on the principle of weighted Average.

 Linear Regression:

 After the data is split into train and test data set linear Regression model is build.

 Scaling is required for this problem before doing linear Regression but still I have
done linear regression to both unscaled data and also to the scaled data just to
compare the difference of results between the two data and also to show you how
scaled data gives us more promising value than the unscaled data.

 Scaling can be useful to reduce or check the multi collinearity in the data, so if
scaling is not applied I find the VIF – variance inflation factor values very high.
Which indicates presence of multi collinearity.

 These values are calculated after building the model of linear regression. To
understand the multi collinearity in the model.

 The scaling had no impact in model score or coefficients of attributes nor the
intercept.

 The Co-efficient of independent variables is a below:

1
Table1: Co-efficient of independent variables

Intercept:
The intercept (often labeled as constant) is the point where the function crosses the y-axis.
In some analysis, the regression model only becomes significant when we remove the
intercept, and the regression line reduces to Y = bX + error.

Table2: Value of intercept

PERFORMANCE METRICS OF LINEAR REGRESSION MODEL

To understand the performance of the Regression model performing model evaluation is

necessary. Some of the Evaluation metrics used for Regression analysis are:

1. R squared or Coefficient of Determination: The most commonly used metric for

model evaluation in regression analysis is R squared. It can be defined as a Ratio of

variation to the Total Variation. The value of R squared lies between 0 to 1, the value

closer to 1 the better the model.

2
where SSRES is the Residual Sum of squares and SSTOT is the Total Sum of squares

2. Adjusted R squared: It is the improvement to R squared. The problem/drawback with

R2 is that as the features increase, the value of R2 also increases which gives the illusion

of a good model. So the Adjusted R2 solves the drawback of R2. It only considers the

features which are important for the model and shows the real improvement of the model.

Adjusted R2 is always lower than R2.

3. Mean Squared Error (MSE): Another Common metric for evaluation is Mean

squared error which is the mean of the squared difference of actual vs predicted values.

4. Root Mean Squared Error (RMSE): It is the root of MSE i.e Root of the mean

difference of Actual and Predicted values. RMSE penalizes the large errors whereas MSE

doesn’t.

R square on Train data:

R square for the training data is 0.9798427651045714


RSME for Training data is 168128.02035495275

Conclusion:
 98% of the variation in the price is explained by the predictors in the model for
train data set.

3
 Root Mean Square Errors(RMSE) is 168128.02 for the train dataset.

R square on Test data:


R square for the testing data is 0.9807850643373318
RSME for Testing data is 166028.1393216096

Conclusion:
 98% of the variation in the price is explained by the predictors in the model for
test data set.
 Root Mean Square Errors(RMSE) is 166028.13 for the test dataset.

Inference: The model worked really good on both train and test data with R square value
being 98%.

Linear Regression model using Stats Model:

We will use statsmodels.formula.api package to build the Stats model


We will now formulate an expression where dependent variable is a function of all the
independent variables:
Expected_CTC=Total_Experience+Total_Experience_in_field_applied+Department
+Role+Industry+Organization+Designation+Education+Graduation_Specialization
+Curent_Location+Preferred_location+Current_CTC+Inhand_Offer+Last_Apprais
al_Rating+Number_of_Publications+Certifications+International_degree_any
We will build a linear model namely lm1 and compute the values for all the coefficients
as follows:

Table 3: Stats model parameters

4
By comparing the values of intercept in sklearn model and in table 3 the coefficients and
intercept of stats models are the same.

Summary of the lm1 model:

Let us now the see of the lm1to know the value of R square

Table 4: lm1 summary

Inference: The overall P value is not less than alpha, so rejecting H0 and accepting Ha
that at least 1 regression co-efficient is not 0. Here all regression coefficients are not 0.
Also, R square value is 98% as was seen from the previous model as well which
concludes that this is fairly good model for our predictions.

5
LINEAR REGRESSION MODEL AFTER SCALING USING Z SCORE

Since all the variables are in different units of measurement we will scale our train and
test dataset using z score from scipy.stats package and fit these scaled datasets into our
model.
The Co-efficient of independent variables is a below:

Table 5: Co-efficient of independent variables after Scaling

Intercept:

Table 6: value of Intercept after Scaling

R square and RSME on Train data:


R square for the training data is 0.9797667886197281
RSME for Train data is 0.14197617721092737

Conclusion:
 98% of the variation in the price is explained by the predictors in the model for
train data set.
 Root Mean Square Errors(RMSE) is 0.1419 for the train dataset.

R square and RSME on Test data:


R square for the testing data is 0.9807842181116928

RSME for Test data is 0.13835446487966233


6
Conclusion:
 98% of the variation in the price is explained by the predictors in the model for
test data set.
 Root Mean Square Errors(RMSE) is 0.1383 for the test dataset.
Inference: The model worked really good on both train and test data with R square value
being 98%.

Variation Inflation Factor(VIF) Values

Variance inflation factor (VIF) is a measure of the amount of multicollinearity in a set of


multiple regression variables. Mathematically, the VIF for a regression model variable
is equal to the ratio of the overall model variance to the variance of a model that
includes only that single independent variable.

Table 7: VIF Values after scaling

From table we can see that after scaling also, there are values more than 5, so we ill delete
all the columns having more than 5.

7
Table 8: VIF Values after scaling and removing columns having more than 5

But Total experience and Current CTC will not be deleted because Total experience is
required to calculate Expected CTC and Current CTC is directly related to Total
experience.
Comparison of results:

Linear Regression Linear Regression


model using Sklearn model after scaling
Model

Train Test Train Test

R Square 0.979 0.980 0.979 0.980

RSME 168128.02 166028.13 0.1419 0.1383

Table 9: Comparison of R square and RSME of all the models

From the above table we can see that R square is same for all the models and RSME is
different for Sklearn model and z score model because the values are scaled. Out of
anyone model can be used. For better accuracy we can go with Linear Model after
scaling.

Decision Tree:

Decision Tree is made for a supervised learning algorithm which can be used for both
Classification and Regression Type of Problem.

As the name itself suggests that it uses a flowchart like a tree structure to show the
predictions that result from a series of feature-based splits. It starts with a root node and
ends with a decision made by leaves.

Terminologies used in Decision Tree:


8
Figure 1: Terminology used in Decision Tree

Root Node: The node which has all the observations of the training sample is called Root
Node.

Decision Node: The nodes we get after splitting the root nodes are called Decision Node.

Terminal Node: The nodes where further splitting is not possible are called leaf nodes or
terminal nodes.

Random Forest:

Random Forest is an ensemble model made of many decision trees using bootstrapping,
random subsets of features, and average voting to make predictions. This is an example of
a bagging ensemble.

Neural Network:

Neural Networks and Data Mining. An Artificial Neural Network, often just called a
neural network, is a mathematical model inspired by biological neural networks. A neural
network consists of an interconnected group of artificial neurons, and it processes
information using a connectionist approach to computation.

Note: Since it is Linear Regression Problem, there is no need to calculate Confusion


Matrix, Classification Report, AUC and ROC Curve.

b. Test your predictive model against the test set using various appropriate
performance metrics

Solution:

The table10, below shows the RMSE, Training and Test Score for both Train and Test
Dataset for all the Linear Regression Model, Decision Tree, Random Forest and Artificial
Neural Networks (ANN).

9
Table 10: RMSE, Training and Test Score for both Train and Test Dataset for all models

c. Interpretation of the model(s)


Solution:
From the table 10, we can clearly understand that
 The RSME score for Train and test is different for Decision Tree and Random
Forest.
 The RSME score for Train and test is almost similar for Linear Regression and
ANN Regressor.
 From the Training and Testing score of Decision Tree and Random Forest we can
clear understand that these models are over fitted.
 From the Training and Testing score of Linear Regression and ANN Regressor we
can clear understand that these models are not over fitted.
 From the table 10, we can clearly say that we can choose any one between linear
Regression and ANN Regressor.
 Before selecting that model, we will tune all the models using GridCV search.

10
MODEL TUNING AND BUSINESS IMPLICATION

a) Ensemble modelling, wherever applicable


Solution:
Ensemble learning is a general meta approach to machine learning that seeks better
predictive performance by combining the predictions from multiple models.
Although there are a seemingly unlimited number of ensembles that you can
develop for your predictive modeling problem, there are three methods that dominate the
field of ensemble learning. So much so, that rather than algorithms per se, each is a field
of study that has spawned many more specialized methods.

The three main classes of ensemble learning methods are bagging, stacking,
and boosting, and it is important to both have a detailed understanding of each method
and to consider them on your predictive modeling project.

b. Any other model tuning measures (if applicable)


Solution:
Building a CART Classifier:

The two important parameters in building Decision tree are max_depth and
min_sample_size. The value for max_depth should be with 10-15 and for
min_sample_split, we have to take 2-3% of the train size.

Using GridSearchCV from sklearn. the best parameters are detected and used to create
CART are:

Table 11: Best parameter for Decision tree

From the above table we built the CART Classifier using the criterion ‘Gini’ method,
min_sample_leaf as 15.

Building a Random Forest:

The three important parameters in building Random Forest are max_depth,


min_sample_size and max_features. The value for max_depth should be with 10-15,
for min_sample_split, we have to take 2-3% of the train size and for max_features, we
have to take square of independent variable and half of independent variable. For
Example, if the number of independent variable is 15 then square root of 15 is 4 and
half of 15 is 8.

11
Using GridSearchCV from sklearn. the best parameters are detected and used to create
Random Forest are:

Table 12: Best parameter for Random Forest

From the above table we built the Random Forest using the max_depth as 10,
max_festures as 6 and min_sample_leaf as 3.

Building Neural Network:

In Building Neural Network there are three most important hyper parameters, they are
number of hidden layers, tolerance and activation function.

The number of hidden layers are calculated by (number of input Variable+ number of
output Variable)/2, tolerance industry standards are 0.001,0.0001 and activation function
should be Relu.

Using GridSearchCV from sklearn. the best parameters are detected and used to create
Neural Networks are:

Table 13: Best parameter for Neural Network

From the above table we built the Neural Networks using the hidden_layers as 100, and
tolerance=0.001.

Comparison of RMSE, Training and Test Score and MAPE for Linear regression,
Decision Tree, Random Forest and Artificial Neural Networks:

Table 14: Comparison of RMSE, Training and Test Score and MAPE for all models
 From table 14, it is clear that the Decision Tree Regressor, Random Forest
Regressor is overfitted after using parameters got from GridCV Search.
 The ANN Regressor and Linear Regression has almost same value in both training
and testing score.
 By looking the RMSE for training and Testing for both Linear Regression and
ANN Regressor, I would choose Linear Regression.

12
c. Interpretation of the most optimum model and its implication on the business

Solution:

 By looking the RMSE for training and Testing for both Linear Regression and
ANN Regressor from table 14, I would choose Linear Regression.
 By seeing the value of Train and Test set, we can say that the model is not over
fitted.
 The model performs well in both Training and Testing dataset.
 The linear Regression model gives 97% accuracy for the target variable correctly.

13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy