Capstone Proect Notes 2
Capstone Proect Notes 2
PROJECT NOTES 2
DSBA
Submitted by
b. Test your predictive model against the test set using various 09-10
appropriate performance metrics
Contents Page No
Table 8: VIF Values after scaling and removing columns having more than 5 8
Table 14: Comparison of RMSE, Training and Test Score and MAPE for all 12
models
List of Figures
Contents Page No
1) Build various models (You can choose to build models for either or all of descriptive,
predictive or prescriptive purposes)
Solution:
1) Linear Regression
The data need to be scaled for ANN Regressor only. Because ANN is sensitive to outliers
and it works on the principle of weighted Average.
Linear Regression:
After the data is split into train and test data set linear Regression model is build.
Scaling is required for this problem before doing linear Regression but still I have
done linear regression to both unscaled data and also to the scaled data just to
compare the difference of results between the two data and also to show you how
scaled data gives us more promising value than the unscaled data.
Scaling can be useful to reduce or check the multi collinearity in the data, so if
scaling is not applied I find the VIF – variance inflation factor values very high.
Which indicates presence of multi collinearity.
These values are calculated after building the model of linear regression. To
understand the multi collinearity in the model.
The scaling had no impact in model score or coefficients of attributes nor the
intercept.
1
Table1: Co-efficient of independent variables
Intercept:
The intercept (often labeled as constant) is the point where the function crosses the y-axis.
In some analysis, the regression model only becomes significant when we remove the
intercept, and the regression line reduces to Y = bX + error.
necessary. Some of the Evaluation metrics used for Regression analysis are:
variation to the Total Variation. The value of R squared lies between 0 to 1, the value
2
where SSRES is the Residual Sum of squares and SSTOT is the Total Sum of squares
R2 is that as the features increase, the value of R2 also increases which gives the illusion
of a good model. So the Adjusted R2 solves the drawback of R2. It only considers the
features which are important for the model and shows the real improvement of the model.
3. Mean Squared Error (MSE): Another Common metric for evaluation is Mean
squared error which is the mean of the squared difference of actual vs predicted values.
4. Root Mean Squared Error (RMSE): It is the root of MSE i.e Root of the mean
difference of Actual and Predicted values. RMSE penalizes the large errors whereas MSE
doesn’t.
Conclusion:
98% of the variation in the price is explained by the predictors in the model for
train data set.
3
Root Mean Square Errors(RMSE) is 168128.02 for the train dataset.
Conclusion:
98% of the variation in the price is explained by the predictors in the model for
test data set.
Root Mean Square Errors(RMSE) is 166028.13 for the test dataset.
Inference: The model worked really good on both train and test data with R square value
being 98%.
4
By comparing the values of intercept in sklearn model and in table 3 the coefficients and
intercept of stats models are the same.
Let us now the see of the lm1to know the value of R square
Inference: The overall P value is not less than alpha, so rejecting H0 and accepting Ha
that at least 1 regression co-efficient is not 0. Here all regression coefficients are not 0.
Also, R square value is 98% as was seen from the previous model as well which
concludes that this is fairly good model for our predictions.
5
LINEAR REGRESSION MODEL AFTER SCALING USING Z SCORE
Since all the variables are in different units of measurement we will scale our train and
test dataset using z score from scipy.stats package and fit these scaled datasets into our
model.
The Co-efficient of independent variables is a below:
Intercept:
Conclusion:
98% of the variation in the price is explained by the predictors in the model for
train data set.
Root Mean Square Errors(RMSE) is 0.1419 for the train dataset.
From table we can see that after scaling also, there are values more than 5, so we ill delete
all the columns having more than 5.
7
Table 8: VIF Values after scaling and removing columns having more than 5
But Total experience and Current CTC will not be deleted because Total experience is
required to calculate Expected CTC and Current CTC is directly related to Total
experience.
Comparison of results:
From the above table we can see that R square is same for all the models and RSME is
different for Sklearn model and z score model because the values are scaled. Out of
anyone model can be used. For better accuracy we can go with Linear Model after
scaling.
Decision Tree:
Decision Tree is made for a supervised learning algorithm which can be used for both
Classification and Regression Type of Problem.
As the name itself suggests that it uses a flowchart like a tree structure to show the
predictions that result from a series of feature-based splits. It starts with a root node and
ends with a decision made by leaves.
Root Node: The node which has all the observations of the training sample is called Root
Node.
Decision Node: The nodes we get after splitting the root nodes are called Decision Node.
Terminal Node: The nodes where further splitting is not possible are called leaf nodes or
terminal nodes.
Random Forest:
Random Forest is an ensemble model made of many decision trees using bootstrapping,
random subsets of features, and average voting to make predictions. This is an example of
a bagging ensemble.
Neural Network:
Neural Networks and Data Mining. An Artificial Neural Network, often just called a
neural network, is a mathematical model inspired by biological neural networks. A neural
network consists of an interconnected group of artificial neurons, and it processes
information using a connectionist approach to computation.
b. Test your predictive model against the test set using various appropriate
performance metrics
Solution:
The table10, below shows the RMSE, Training and Test Score for both Train and Test
Dataset for all the Linear Regression Model, Decision Tree, Random Forest and Artificial
Neural Networks (ANN).
9
Table 10: RMSE, Training and Test Score for both Train and Test Dataset for all models
10
MODEL TUNING AND BUSINESS IMPLICATION
The three main classes of ensemble learning methods are bagging, stacking,
and boosting, and it is important to both have a detailed understanding of each method
and to consider them on your predictive modeling project.
The two important parameters in building Decision tree are max_depth and
min_sample_size. The value for max_depth should be with 10-15 and for
min_sample_split, we have to take 2-3% of the train size.
Using GridSearchCV from sklearn. the best parameters are detected and used to create
CART are:
From the above table we built the CART Classifier using the criterion ‘Gini’ method,
min_sample_leaf as 15.
11
Using GridSearchCV from sklearn. the best parameters are detected and used to create
Random Forest are:
From the above table we built the Random Forest using the max_depth as 10,
max_festures as 6 and min_sample_leaf as 3.
In Building Neural Network there are three most important hyper parameters, they are
number of hidden layers, tolerance and activation function.
The number of hidden layers are calculated by (number of input Variable+ number of
output Variable)/2, tolerance industry standards are 0.001,0.0001 and activation function
should be Relu.
Using GridSearchCV from sklearn. the best parameters are detected and used to create
Neural Networks are:
From the above table we built the Neural Networks using the hidden_layers as 100, and
tolerance=0.001.
Comparison of RMSE, Training and Test Score and MAPE for Linear regression,
Decision Tree, Random Forest and Artificial Neural Networks:
Table 14: Comparison of RMSE, Training and Test Score and MAPE for all models
From table 14, it is clear that the Decision Tree Regressor, Random Forest
Regressor is overfitted after using parameters got from GridCV Search.
The ANN Regressor and Linear Regression has almost same value in both training
and testing score.
By looking the RMSE for training and Testing for both Linear Regression and
ANN Regressor, I would choose Linear Regression.
12
c. Interpretation of the most optimum model and its implication on the business
Solution:
By looking the RMSE for training and Testing for both Linear Regression and
ANN Regressor from table 14, I would choose Linear Regression.
By seeing the value of Train and Test set, we can say that the model is not over
fitted.
The model performs well in both Training and Testing dataset.
The linear Regression model gives 97% accuracy for the target variable correctly.
13