100% found this document useful (2 votes)

1K views16 pages

Capstone Proect Notes 2

The document discusses building and interpreting various regression models for predicting employee compensation (expected CTC) using HR data. Several linear regression models are constructed using sklearn, statsmodels, and scaled data. Performance is evaluated using metrics like R-squared, RMSE. Multicollinearity is checked using VIF. Other models like decision tree, random forest, and neural network regressors are also explored. The linear models perform well with 98% R-squared values on both training and test sets. Model tuning is done through scaling, ensemble methods, and parameter optimization.

Uploaded by

Deva Dass

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

1K views16 pages

Capstone Proect Notes 2

Uploaded by

Deva Dass

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

HR DATA CAPSTONE

PROJECT NOTES 2
DSBA

Submitted by

PGPDSBA Online June_C 2021

JUNE 19TH, 2022
Table of Contents
Contents Page
No

1). Model building and interpretation. 01-10

a. Build various models (You can choose to build models for

either or all of descriptive, predictive or prescriptive purposes) 01-09

b. Test your predictive model against the test set using various 09-10
appropriate performance metrics

c.Interpretation of the model(s)

2). Model Tuning and business implication 11-13

a.Ensemble modelling, wherever applicable 11

b. Any other model tuning measures(if applicable) 11-12

c. Interpretation of the most optimum model and its implication

13
on the business
List of Tables

Contents Page No

Table1: Co-efficient of independent variables 2

Table2: Value of intercept 2

Table 3: Stats model parameters 4

Table 4: lm1 summary 5

Table 5: Co-efficient of independent variables after Scaling 6

Table 6: value of Intercept after Scaling 6

Table 7: VIF Values after scaling 7

Table 8: VIF Values after scaling and removing columns having more than 5 8

Table 9: Comparison of R square and RSME of all the models 9

Table 10: RMSE, Training and Test Score for both Train and Test Dataset for all 10
models
Table 11: Best parameter for Decision tree 11

Table 12: Best parameter for Random Forest 12

Table 13: Best parameter for Neural Network 12

Table 14: Comparison of RMSE, Training and Test Score and MAPE for all 12
models

List of Figures

Contents Page No

Figure 1: Terminology used in Decision Tree 09

MODEL BUILDING AND INTERPRETATION

1) Build various models (You can choose to build models for either or all of descriptive,
predictive or prescriptive purposes)

Solution:

In this problem I am going to use the following models as mentioned below:

1) Linear Regression

2) Linear Regression using stats

3) Linear Regression using Z Score.

4) Decision Tree Regressor.

5) Random Forest Regressor.

6) Artificial Neural Network (ANN) Regressor.

The data need to be scaled for ANN Regressor only. Because ANN is sensitive to outliers
and it works on the principle of weighted Average.

 Linear Regression:

 After the data is split into train and test data set linear Regression model is build.

 Scaling is required for this problem before doing linear Regression but still I have
done linear regression to both unscaled data and also to the scaled data just to
compare the difference of results between the two data and also to show you how
scaled data gives us more promising value than the unscaled data.

 Scaling can be useful to reduce or check the multi collinearity in the data, so if
scaling is not applied I find the VIF – variance inflation factor values very high.
Which indicates presence of multi collinearity.

 These values are calculated after building the model of linear regression. To
understand the multi collinearity in the model.

 The scaling had no impact in model score or coefficients of attributes nor the
intercept.

 The Co-efficient of independent variables is a below:

1
Table1: Co-efficient of independent variables

Intercept:
The intercept (often labeled as constant) is the point where the function crosses the y-axis.
In some analysis, the regression model only becomes significant when we remove the
intercept, and the regression line reduces to Y = bX + error.

Table2: Value of intercept

PERFORMANCE METRICS OF LINEAR REGRESSION MODEL

To understand the performance of the Regression model performing model evaluation is

necessary. Some of the Evaluation metrics used for Regression analysis are:

1. R squared or Coefficient of Determination: The most commonly used metric for

model evaluation in regression analysis is R squared. It can be defined as a Ratio of

variation to the Total Variation. The value of R squared lies between 0 to 1, the value

closer to 1 the better the model.

2
where SSRES is the Residual Sum of squares and SSTOT is the Total Sum of squares

2. Adjusted R squared: It is the improvement to R squared. The problem/drawback with

R2 is that as the features increase, the value of R2 also increases which gives the illusion

of a good model. So the Adjusted R2 solves the drawback of R2. It only considers the

features which are important for the model and shows the real improvement of the model.

Adjusted R2 is always lower than R2.

3. Mean Squared Error (MSE): Another Common metric for evaluation is Mean

squared error which is the mean of the squared difference of actual vs predicted values.

4. Root Mean Squared Error (RMSE): It is the root of MSE i.e Root of the mean

difference of Actual and Predicted values. RMSE penalizes the large errors whereas MSE

doesn’t.

R square on Train data:

R square for the training data is 0.9798427651045714

RSME for Training data is 168128.02035495275

Conclusion:
 98% of the variation in the price is explained by the predictors in the model for
train data set.

3
 Root Mean Square Errors(RMSE) is 168128.02 for the train dataset.

R square on Test data:

R square for the testing data is 0.9807850643373318
RSME for Testing data is 166028.1393216096

Conclusion:
 98% of the variation in the price is explained by the predictors in the model for
test data set.
 Root Mean Square Errors(RMSE) is 166028.13 for the test dataset.

Inference: The model worked really good on both train and test data with R square value
being 98%.

Linear Regression model using Stats Model:

We will use statsmodels.formula.api package to build the Stats model

We will now formulate an expression where dependent variable is a function of all the
independent variables:
Expected_CTC=Total_Experience+Total_Experience_in_field_applied+Department
+Role+Industry+Organization+Designation+Education+Graduation_Specialization
+Curent_Location+Preferred_location+Current_CTC+Inhand_Offer+Last_Apprais
al_Rating+Number_of_Publications+Certifications+International_degree_any
We will build a linear model namely lm1 and compute the values for all the coefficients
as follows:

Table 3: Stats model parameters

4
By comparing the values of intercept in sklearn model and in table 3 the coefficients and
intercept of stats models are the same.

Summary of the lm1 model:

Let us now the see of the lm1to know the value of R square

Table 4: lm1 summary

Inference: The overall P value is not less than alpha, so rejecting H0 and accepting Ha
that at least 1 regression co-efficient is not 0. Here all regression coefficients are not 0.
Also, R square value is 98% as was seen from the previous model as well which
concludes that this is fairly good model for our predictions.

5
LINEAR REGRESSION MODEL AFTER SCALING USING Z SCORE

Since all the variables are in different units of measurement we will scale our train and
test dataset using z score from scipy.stats package and fit these scaled datasets into our
model.
The Co-efficient of independent variables is a below:

Table 5: Co-efficient of independent variables after Scaling

Intercept:

Table 6: value of Intercept after Scaling

R square and RSME on Train data:

R square for the training data is 0.9797667886197281
RSME for Train data is 0.14197617721092737

Conclusion:
 98% of the variation in the price is explained by the predictors in the model for
train data set.
 Root Mean Square Errors(RMSE) is 0.1419 for the train dataset.

R square and RSME on Test data:

R square for the testing data is 0.9807842181116928

RSME for Test data is 0.13835446487966233

6
Conclusion:
 98% of the variation in the price is explained by the predictors in the model for
test data set.
 Root Mean Square Errors(RMSE) is 0.1383 for the test dataset.
Inference: The model worked really good on both train and test data with R square value
being 98%.

Variation Inflation Factor(VIF) Values

Variance inflation factor (VIF) is a measure of the amount of multicollinearity in a set of

multiple regression variables. Mathematically, the VIF for a regression model variable
is equal to the ratio of the overall model variance to the variance of a model that
includes only that single independent variable.

Table 7: VIF Values after scaling

From table we can see that after scaling also, there are values more than 5, so we ill delete
all the columns having more than 5.

7
Table 8: VIF Values after scaling and removing columns having more than 5

But Total experience and Current CTC will not be deleted because Total experience is
required to calculate Expected CTC and Current CTC is directly related to Total
experience.
Comparison of results:

Linear Regression Linear Regression

model using Sklearn model after scaling
Model

Train Test Train Test

R Square 0.979 0.980 0.979 0.980

RSME 168128.02 166028.13 0.1419 0.1383

Table 9: Comparison of R square and RSME of all the models

From the above table we can see that R square is same for all the models and RSME is
different for Sklearn model and z score model because the values are scaled. Out of
anyone model can be used. For better accuracy we can go with Linear Model after
scaling.

Decision Tree:

Decision Tree is made for a supervised learning algorithm which can be used for both
Classification and Regression Type of Problem.

As the name itself suggests that it uses a flowchart like a tree structure to show the
predictions that result from a series of feature-based splits. It starts with a root node and
ends with a decision made by leaves.

Terminologies used in Decision Tree:

8
Figure 1: Terminology used in Decision Tree

Root Node: The node which has all the observations of the training sample is called Root
Node.

Decision Node: The nodes we get after splitting the root nodes are called Decision Node.

Terminal Node: The nodes where further splitting is not possible are called leaf nodes or
terminal nodes.

Random Forest:

Random Forest is an ensemble model made of many decision trees using bootstrapping,
random subsets of features, and average voting to make predictions. This is an example of
a bagging ensemble.

Neural Network:

Neural Networks and Data Mining. An Artificial Neural Network, often just called a
neural network, is a mathematical model inspired by biological neural networks. A neural
network consists of an interconnected group of artificial neurons, and it processes
information using a connectionist approach to computation.

Note: Since it is Linear Regression Problem, there is no need to calculate Confusion

Matrix, Classification Report, AUC and ROC Curve.

b. Test your predictive model against the test set using various appropriate
performance metrics

Solution:

The table10, below shows the RMSE, Training and Test Score for both Train and Test
Dataset for all the Linear Regression Model, Decision Tree, Random Forest and Artificial
Neural Networks (ANN).

9
Table 10: RMSE, Training and Test Score for both Train and Test Dataset for all models

c. Interpretation of the model(s)

Solution:
From the table 10, we can clearly understand that
 The RSME score for Train and test is different for Decision Tree and Random
Forest.
 The RSME score for Train and test is almost similar for Linear Regression and
ANN Regressor.
 From the Training and Testing score of Decision Tree and Random Forest we can
clear understand that these models are over fitted.
 From the Training and Testing score of Linear Regression and ANN Regressor we
can clear understand that these models are not over fitted.
 From the table 10, we can clearly say that we can choose any one between linear
Regression and ANN Regressor.
 Before selecting that model, we will tune all the models using GridCV search.

10
MODEL TUNING AND BUSINESS IMPLICATION

a) Ensemble modelling, wherever applicable

Solution:
Ensemble learning is a general meta approach to machine learning that seeks better
predictive performance by combining the predictions from multiple models.
Although there are a seemingly unlimited number of ensembles that you can
develop for your predictive modeling problem, there are three methods that dominate the
field of ensemble learning. So much so, that rather than algorithms per se, each is a field
of study that has spawned many more specialized methods.

The three main classes of ensemble learning methods are bagging, stacking,
and boosting, and it is important to both have a detailed understanding of each method
and to consider them on your predictive modeling project.

b. Any other model tuning measures (if applicable)

Solution:
Building a CART Classifier:

The two important parameters in building Decision tree are max_depth and
min_sample_size. The value for max_depth should be with 10-15 and for
min_sample_split, we have to take 2-3% of the train size.

Using GridSearchCV from sklearn. the best parameters are detected and used to create
CART are:

Table 11: Best parameter for Decision tree

From the above table we built the CART Classifier using the criterion ‘Gini’ method,
min_sample_leaf as 15.

Building a Random Forest:

The three important parameters in building Random Forest are max_depth,

min_sample_size and max_features. The value for max_depth should be with 10-15,
for min_sample_split, we have to take 2-3% of the train size and for max_features, we
have to take square of independent variable and half of independent variable. For
Example, if the number of independent variable is 15 then square root of 15 is 4 and
half of 15 is 8.

11
Using GridSearchCV from sklearn. the best parameters are detected and used to create
Random Forest are:

Table 12: Best parameter for Random Forest

From the above table we built the Random Forest using the max_depth as 10,
max_festures as 6 and min_sample_leaf as 3.

Building Neural Network:

In Building Neural Network there are three most important hyper parameters, they are
number of hidden layers, tolerance and activation function.

The number of hidden layers are calculated by (number of input Variable+ number of
output Variable)/2, tolerance industry standards are 0.001,0.0001 and activation function
should be Relu.

Using GridSearchCV from sklearn. the best parameters are detected and used to create
Neural Networks are:

Table 13: Best parameter for Neural Network

From the above table we built the Neural Networks using the hidden_layers as 100, and
tolerance=0.001.

Comparison of RMSE, Training and Test Score and MAPE for Linear regression,
Decision Tree, Random Forest and Artificial Neural Networks:

Table 14: Comparison of RMSE, Training and Test Score and MAPE for all models
 From table 14, it is clear that the Decision Tree Regressor, Random Forest
Regressor is overfitted after using parameters got from GridCV Search.
 The ANN Regressor and Linear Regression has almost same value in both training
and testing score.
 By looking the RMSE for training and Testing for both Linear Regression and
ANN Regressor, I would choose Linear Regression.

12
c. Interpretation of the most optimum model and its implication on the business

Solution:

 By looking the RMSE for training and Testing for both Linear Regression and
ANN Regressor from table 14, I would choose Linear Regression.
 By seeing the value of Train and Test set, we can say that the model is not over
fitted.
 The model performs well in both Training and Testing dataset.
 The linear Regression model gives 97% accuracy for the target variable correctly.

The Impact of Technology On The New Generation
No ratings yet
The Impact of Technology On The New Generation
11 pages
Predictive Modelling Project - n1
100% (4)
Predictive Modelling Project - n1
36 pages
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
No ratings yet
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
89 pages
Business Report Machine Learning-1
100% (7)
Business Report Machine Learning-1
60 pages
Job Crafting A Meta-Analysis of Relationships With Individual Differences, Job Characteristics
No ratings yet
Job Crafting A Meta-Analysis of Relationships With Individual Differences, Job Characteristics
87 pages
Module 1 - Introduction To History: Definition, Issues, Sources and Methodology
100% (3)
Module 1 - Introduction To History: Definition, Issues, Sources and Methodology
10 pages
PM - ExtendedProject - Business Report
100% (4)
PM - ExtendedProject - Business Report
35 pages
FinalReport Life Insurance
80% (5)
FinalReport Life Insurance
34 pages
Project-Predictive Modeling-Rajendra M Bhat
100% (3)
Project-Predictive Modeling-Rajendra M Bhat
14 pages
Unit II - Diagnotis and Multiple Linear
No ratings yet
Unit II - Diagnotis and Multiple Linear
8 pages
CREDIT RISK and MARKETRISK MILESTONE2
100% (2)
CREDIT RISK and MARKETRISK MILESTONE2
34 pages
Predictive Modeling Business Report Seetharaman Final Changes PDF
100% (1)
Predictive Modeling Business Report Seetharaman Final Changes PDF
28 pages
FRA Assignment
100% (1)
FRA Assignment
31 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
Capstone Presentation: Telecom Churn Study
100% (3)
Capstone Presentation: Telecom Churn Study
19 pages
Template Cascading
No ratings yet
Template Cascading
18 pages
Linear Regression Firm Basit PDF
No ratings yet
Linear Regression Firm Basit PDF
21 pages
Project Time Series Forecasting
100% (1)
Project Time Series Forecasting
53 pages
Unit 5
No ratings yet
Unit 5
18 pages
Transition To EY1 - v1
No ratings yet
Transition To EY1 - v1
16 pages
SMDM Project Business Report - Ketan Sawalkar: (Document Title)
100% (2)
SMDM Project Business Report - Ketan Sawalkar: (Document Title)
17 pages
Time Series Forecasting - SoftDrink - Business Report
75% (4)
Time Series Forecasting - SoftDrink - Business Report
37 pages
Predictive Modeling
100% (1)
Predictive Modeling
22 pages
The Multicultural World I - Listening Test: Text 1
No ratings yet
The Multicultural World I - Listening Test: Text 1
2 pages
This Study Resource Was: SQL Project
67% (3)
This Study Resource Was: SQL Project
9 pages
Group 5 - Lesson Plan (Writing)
No ratings yet
Group 5 - Lesson Plan (Writing)
15 pages
INTEGRATIVE ASSESSMENT GRADE 10 English Science Math TLE and MAPEH For Week 1online and Modular
No ratings yet
INTEGRATIVE ASSESSMENT GRADE 10 English Science Math TLE and MAPEH For Week 1online and Modular
4 pages
KPK & Federal LP Sun GK (Em) One T1u1
No ratings yet
KPK & Federal LP Sun GK (Em) One T1u1
28 pages
MRA Project ML 1: Abhishek Kapoor Dsba Aug A20
100% (1)
MRA Project ML 1: Abhishek Kapoor Dsba Aug A20
47 pages
Advanced Statistics Project - Business Report
No ratings yet
Advanced Statistics Project - Business Report
11 pages
Advanced Statistics Project Report
100% (1)
Advanced Statistics Project Report
34 pages
FRA Report
100% (1)
FRA Report
30 pages
Predictive Modelling Project 1 PDF
50% (2)
Predictive Modelling Project 1 PDF
38 pages
Financial Risk Analysis Project Report Financial Risk Analysis Project Report
100% (2)
Financial Risk Analysis Project Report Financial Risk Analysis Project Report
29 pages
Unit 7 Organising Learning Experiences: Structure
No ratings yet
Unit 7 Organising Learning Experiences: Structure
8 pages
Project +Sweta+Kumari+ +FRA+Milestone+1+ July+ 2021
100% (2)
Project +Sweta+Kumari+ +FRA+Milestone+1+ July+ 2021
31 pages
DM Gopala Satish Kumar Business Report G8 DSBA
100% (2)
DM Gopala Satish Kumar Business Report G8 DSBA
26 pages
Lessonplan 3A
No ratings yet
Lessonplan 3A
7 pages
Larsen-Freeman (2019) Agency - CDST
No ratings yet
Larsen-Freeman (2019) Agency - CDST
19 pages
A Detailed Lesson Plan in Math 2: College of Education
No ratings yet
A Detailed Lesson Plan in Math 2: College of Education
3 pages
QMS Assignment
No ratings yet
QMS Assignment
14 pages
Predictive Modelling ALOK KUMAR
100% (1)
Predictive Modelling ALOK KUMAR
25 pages
Project Report
100% (3)
Project Report
36 pages
Project Report - FRA V1.0
71% (7)
Project Report - FRA V1.0
28 pages
Manali Andyal 26 05 2025 FRA Part A Guided Project Report PDF
100% (1)
Manali Andyal 26 05 2025 FRA Part A Guided Project Report PDF
19 pages
Mra Project1 - Firoz Afzal
60% (5)
Mra Project1 - Firoz Afzal
20 pages
Chapter 3 The Identification and Development of Business Ideas
No ratings yet
Chapter 3 The Identification and Development of Business Ideas
17 pages
ML Ts Proj
100% (9)
ML Ts Proj
58 pages
Design 3 Lecture Note 1 - Precedent Study - Rev00
100% (1)
Design 3 Lecture Note 1 - Precedent Study - Rev00
17 pages
PT8 Unit 8 Lesson A
No ratings yet
PT8 Unit 8 Lesson A
28 pages
Predective Modellig Project
100% (1)
Predective Modellig Project
18 pages
Cart-Rf-ANN: Prepared by Muralidharan N
0% (1)
Cart-Rf-ANN: Prepared by Muralidharan N
16 pages
Problem Statement
100% (1)
Problem Statement
17 pages
Weekly Quiz - 2 (TSF) - Time Series Forecasting - Great Learning PDF
100% (3)
Weekly Quiz - 2 (TSF) - Time Series Forecasting - Great Learning PDF
4 pages
MRA Project Milestone 2
100% (2)
MRA Project Milestone 2
31 pages
Data Mining Project PCA Report
100% (1)
Data Mining Project PCA Report
27 pages
Designing A Nutritional Bar: Name: MYP Year Level: 1 Date
No ratings yet
Designing A Nutritional Bar: Name: MYP Year Level: 1 Date
10 pages
Bandura Reviewer
No ratings yet
Bandura Reviewer
5 pages
FRA Project Business Report
100% (2)
FRA Project Business Report
27 pages
OGAT Teaching Recruitment Brochure
No ratings yet
OGAT Teaching Recruitment Brochure
14 pages
Problem 1: Linear Regression
54% (13)
Problem 1: Linear Regression
14 pages
Facebook Comment Volume Prediction
100% (1)
Facebook Comment Volume Prediction
12 pages
Report - Project8 - FRA - Surabhi - Report
100% (2)
Report - Project8 - FRA - Surabhi - Report
15 pages
Boston Condo Sale Story
0% (1)
Boston Condo Sale Story
11 pages
2020 21 Annual Appraisal Band 5 Deva Dass
No ratings yet
2020 21 Annual Appraisal Band 5 Deva Dass
13 pages
Analysing and Implications of Teacher Development Stages
100% (1)
Analysing and Implications of Teacher Development Stages
17 pages
Lifi
100% (1)
Lifi
16 pages
CH 5 International Culture
No ratings yet
CH 5 International Culture
36 pages
SMDM Business-Report Arvind Soni-2
0% (1)
SMDM Business-Report Arvind Soni-2
15 pages
MTB
No ratings yet
MTB
10 pages
This Study Resource Was: Quiz 3
100% (1)
This Study Resource Was: Quiz 3
5 pages
Week 7 Project Report 1 and 2
No ratings yet
Week 7 Project Report 1 and 2
10 pages
Gowtham Mra 2
No ratings yet
Gowtham Mra 2
18 pages
Session 1 - Conceptual Framework in PE and Health Final Edited
No ratings yet
Session 1 - Conceptual Framework in PE and Health Final Edited
33 pages
Final PM Examination Profed 102 Teaching Profession 2021
No ratings yet
Final PM Examination Profed 102 Teaching Profession 2021
8 pages
Data Visualisation - Car Claim Insurance Project
100% (5)
Data Visualisation - Car Claim Insurance Project
6 pages
Neil Armstrong Essay - Jade Jackman
No ratings yet
Neil Armstrong Essay - Jade Jackman
2 pages
Points
0% (6)
Points
1 page
Capstone Project
100% (1)
Capstone Project
7 pages
Problem Statement
0% (2)
Problem Statement
2 pages
Time Series Project
50% (4)
Time Series Project
2 pages
Saint Louis College of Bulanao Tabuk City, Kalinga
No ratings yet
Saint Louis College of Bulanao Tabuk City, Kalinga
5 pages
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
No ratings yet
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
18 pages
Provision It
No ratings yet
Provision It
2 pages
FRA Assignment - India Credit Model
No ratings yet
FRA Assignment - India Credit Model
14 pages
Foucault Introduction To Kants Anthropology
No ratings yet
Foucault Introduction To Kants Anthropology
13 pages
Cash Invoice: Kaalinetwork
No ratings yet
Cash Invoice: Kaalinetwork
1 page
119 Deva Dass
No ratings yet
119 Deva Dass
1 page
044 Deva Dass
No ratings yet
044 Deva Dass
1 page
Project Avinash Ray DVT Car Insurance
No ratings yet
Project Avinash Ray DVT Car Insurance
4 pages
MRA Project Milestone 1 PDF
No ratings yet
MRA Project Milestone 1 PDF
1 page
Edukasyon Sa Pagpapakatao: Pagiging Pilipino
No ratings yet
Edukasyon Sa Pagpapakatao: Pagiging Pilipino
2 pages
Jblank
No ratings yet
Jblank
2 pages
Milestone 1
No ratings yet
Milestone 1
2 pages
Project 7 - DVT - Manoj
No ratings yet
Project 7 - DVT - Manoj
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Capstone Proect Notes 2

Uploaded by

Capstone Proect Notes 2

Uploaded by

HR DATA CAPSTONE

PGPDSBA Online June_C 2021

1). Model building and interpretation. 01-10

a. Build various models (You can choose to build models for

c.Interpretation of the model(s)

2). Model Tuning and business implication 11-13

a.Ensemble modelling, wherever applicable 11

b. Any other model tuning measures(if applicable) 11-12

c. Interpretation of the most optimum model and its implication

Table1: Co-efficient of independent variables 2

Table2: Value of intercept 2

Table 3: Stats model parameters 4

Table 4: lm1 summary 5

Table 5: Co-efficient of independent variables after Scaling 6

Table 6: value of Intercept after Scaling 6

Table 7: VIF Values after scaling 7

Table 9: Comparison of R square and RSME of all the models 9

Table 12: Best parameter for Random Forest 12

Table 13: Best parameter for Neural Network 12

Figure 1: Terminology used in Decision Tree 09

In this problem I am going to use the following models as mentioned below:

2) Linear Regression using stats

3) Linear Regression using Z Score.

4) Decision Tree Regressor.

5) Random Forest Regressor.

6) Artificial Neural Network (ANN) Regressor.

 The Co-efficient of independent variables is a below:

Table2: Value of intercept

PERFORMANCE METRICS OF LINEAR REGRESSION MODEL

To understand the performance of the Regression model performing model evaluation is

1. R squared or Coefficient of Determination: The most commonly used metric for

model evaluation in regression analysis is R squared. It can be defined as a Ratio of

closer to 1 the better the model.

2. Adjusted R squared: It is the improvement to R squared. The problem/drawback with

Adjusted R2 is always lower than R2.

R square on Train data:

R square for the training data is 0.9798427651045714

R square on Test data:

Linear Regression model using Stats Model:

We will use statsmodels.formula.api package to build the Stats model

Table 3: Stats model parameters

Summary of the lm1 model:

Table 4: lm1 summary

Table 5: Co-efficient of independent variables after Scaling

Table 6: value of Intercept after Scaling

R square and RSME on Train data:

R square and RSME on Test data:

RSME for Test data is 0.13835446487966233

Variation Inflation Factor(VIF) Values

Variance inflation factor (VIF) is a measure of the amount of multicollinearity in a set of

Table 7: VIF Values after scaling

Linear Regression Linear Regression

Train Test Train Test

R Square 0.979 0.980 0.979 0.980

RSME 168128.02 166028.13 0.1419 0.1383

Table 9: Comparison of R square and RSME of all the models

Terminologies used in Decision Tree:

Note: Since it is Linear Regression Problem, there is no need to calculate Confusion

c. Interpretation of the model(s)

a) Ensemble modelling, wherever applicable

b. Any other model tuning measures (if applicable)

Table 11: Best parameter for Decision tree

Building a Random Forest:

The three important parameters in building Random Forest are max_depth,

Table 12: Best parameter for Random Forest

Building Neural Network:

Table 13: Best parameter for Neural Network

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.