Predictive_Modelling_Alternate_Project_Business_Case.docx
Predictive_Modelling_Alternate_Project_Business_Case.docx
Problem Statement: You are a part of an investing firm and your work is to do research about these
759 firms. You are provided with the dataset containing the sales and other attributes of these 759
firms. Predict the sales of these firms on the bases of the details given in the dataset so as to help
your company in investing consciously. Also, provide them with 5 attributes that are most important.
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values,
data types, shape, EDA). Perform Univariate and Bivariate Analysis.
1.2 Impute null values if present? Do you think scaling is necessary in this case?
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into test and train
(70:30). Apply Linear regression. Performance Metrics: Check the performance of Predictions on
Train and Test sets using Rsquare, RMSE.
1.4 Inference: Based on these predictions, what are the business insights and recommendations.
Problem Statement: You are hired by Government to do analysis on car crashes. You are provided
details of car crashes, among which some people survived and some didn't. You have to help the
government in predicting whether a person will survive or not on the basis of the information given
in the data set so as to provide insights that will help government to make stronger laws for car
manufacturers to ensure safety measures. Also, find out the important factors on the basis of which
you made your predictions.
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
write an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data analysis.
2.2 Encode the data (having string values) for Modelling. Data Split: Split the data into train and test
(70:30). Apply Logistic Regression and LDA (linear discriminant analysis).
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Compare both
the models and write inferences, which model is best/optimized.
2.4 Inference: Based on these predictions, what are the insights and recommendations.
1|Page
Solution
Problem 1: Linear Regression
Problem Statement: You are a part of an investing firm and your work is to do research about these
759 firms. You are provided with the dataset containing the sales and other attributes of these 759
firms. Predict the sales of these firms on the bases of the details given in the dataset so as to help
your company in investing consciously. Also, provide them with 5 attributes that are most important.
Data Dictionary:
Solution:
2|Page
Now, reading the head and tail of the dataset to check whether data has been properly fed.
3|Page
PYTHON OUTPUT: DATA DESCRIPTION,
Observation:
For continuous data we have sales, capital, patents, randd, employment, tobinq, value, institutions
4|Page
PYTHON OUTPUT: NULL VALUE TREATMENT
Updated Dataframe:
sales capital patents randd employment
sp500 \
0 826.995050 161.603986 10 382.078247 2.306000 no
1 407.753973 122.101012 2 0.000000 1.860000 no
2 8407.845588 6221.144614 138 3296.700439 49.659005 yes
3 451.000010 266.899987 1 83.540161 3.071000 no
4 174.927981 140.124004 2 14.233637 1.947000 no
.. ... ... ... ... ... ...
754 1253.900196 708.299935 32 412.936157 22.100002 yes
755 171.821025 73.666008 1 0.037735 1.684000 no
756 202.726967 123.926991 13 74.861099 1.460000 no
757 785.687944 138.780992 6 0.621750 2.900000 yes
758 22.701999 14.244999 5 18.574360 0.197000 no
5|Page
PYTHON OUTPUT: UNIVARIATE / BIVARIATE ANALYSIS
6|Page
7|Page
8|Page
Observation:
Employment, Tobinq, value, rand, sales, patent has multiple outliers in the data.
Tobinq, value, randd, sales, patent Positive skewed.
The institute ranges from 25 to 60.
The Tobinq ranges from 0.5 to 3.
The value, employment, patent, rand, capital, sales ranges from 0 to 0.5.
9|Page
PYTHON OUTPUT BIVARIATE DATA DISTRIBUTION WITH HUE AS SALES
10 | P a g e
Observation: There is no correlation between the data, the data seems to be normal. There is no
huge difference in the data distribution among the sales, I don’t see any clear two different
distribution in the data. Multiple outliers are being observed which needs to be treated.
11 | P a g e
Observation: No multi collinearity in the data
1.2 Impute null values if present? Do you think scaling is necessary in this case?
Solution:
Observation: We do have Null value in Tobinq. The fix the Null value we can do a mean or median
imputation. The percentage of Null values is less than 5%, we can also drop these if we want. Post
we imputed the mean; we do not see have any null value in the dataset.
12 | P a g e
Observation: Scaling isn’t required as the dataset features looks to be in more or less fixed range.
Also, there isn’t any multi collinearity in the data. There are multiple outliers present in the dataset
which needs to be treated before we proceed with the modelling.
13 | P a g e
PYTHON OUTPUT: POST OUTLIERS TREATMENT
14 | P a g e
15 | P a g e
PYTHON OUTPUT: HEATMAP POST OUTLIER TREATEMENT
16 | P a g e
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into
test and train (70:30). Apply Linear regression. Performance Metrics: Check the
performance of Predictions on Train and Test sets using Rsquare, RMSE.
Solution:
Observation: Dummies have been encoded. Linear regression model does not take categorical
values so that we have encoded categorical values to integer for better results.
Observation: Unrequired columns were already dropped at the initial level. (Train/Test Split)
Index(['sales', 'capital', 'patents', 'randd', 'employment', 'tobinq',
'value',
'institutions', 'sp500_no', 'sp500_yes'],
dtype='object')
PYTHON OUTPUT:
17 | P a g e
- Invoked the Linear Regression function and finding the best model on Training data.
PYTHON OUTPUT:
LinearRegression()
PYTHON OUTPUT:
PYTHON OUTPUT:
The intercept for our model is 155.8971701239957
PYTHON OUTPUT:
0.9358806629736066
PYTHON OUTPUT:
0.924129439335239
PYTHON OUTPUT:
394.6129494572075
PYTHON OUTPUT:
399.74321332112794
18 | P a g e
- VIF –VALUES
PYTHON OUTPUT:
capital ---> 5.884834435358601
patents ---> 2.5564811032960173
randd ---> 2.9241166081719343
employment ---> 5.289087439090918
tobinq ---> 1.4736588698814541
value ---> 6.0730692748610045
institutions ---> 1.2923225457814675
sp500_no ---> 5.627713456806028
sp500_yes ---> 7.007866608862636
Observation: Visible multi collinearity in the dataset, to drop these values to Lower level we can
drop columns after doing stats model. From stats model we can understand the features that do not
contribute to the Model. We can remove those features after that the Vif Values will be reduced
Ideal value of VIF is less than 5%.
PYTHON OUTPUT:
OLS Regression Results
=======================================================================
=======
Dep. Variable: sales R-squared:
0.936
Model: OLS Adj. R-squared:
0.935
Method: Least Squares F-statistic:
952.4
Date: Thu, 20 Jan 2022 Prob (F-statistic):
1.05e-305
Time: 16:40:50 Log-Likelihood:
-3927.7
No. Observations: 531 AIC:
7873.
Df Residuals: 522 BIC:
7912.
Df Model: 8
Covariance Type: nonrobust
=======================================================================
=========
coef std err t P>|t| [0.025
0.975]
-----------------------------------------------------------------------
---------
Intercept 103.9314 42.150 2.466 0.014 21.128
186.735
capital 0.4062 0.042 9.651 0.000 0.323
0.489
patents -4.6473 2.789 -1.666 0.096 -10.127
0.833
19 | P a g e
randd 0.6399 0.232 2.753 0.006 0.183
1.096
employment 78.6137 4.765 16.498 0.000 69.252
87.975
tobinq -39.9258 12.145 -3.288 0.001 -63.784
-16.067
value 0.2446 0.026 9.592 0.000 0.195
0.295
institutions 0.2174 0.902 0.241 0.810 -1.555
1.990
sp500_no -31.1003 25.504 -1.219 0.223 -81.203
19.003
sp500_yes 135.0318 49.490 2.728 0.007 37.808
232.256
=======================================================================
=======
Omnibus: 185.527 Durbin-Watson:
1.966
Prob(Omnibus): 0.000 Jarque-Bera (JB):
1284.253
Skew: 1.351 Prob(JB):
1.34e-279
Kurtosis: 10.123 Cond. No.
2.47e+19
=======================================================================
=======
PYTHON OUTPUT:
OLS Regression Results
=======================================================================
=======
Dep. Variable: sales R-squared:
0.936
Model: OLS Adj. R-squared:
0.935
Method: Least Squares F-statistic:
952.4
Date: Fri, 21 Jan 2022 Prob (F-statistic):
1.05e-305
Time: 10:53:58 Log-Likelihood:
-3927.7
No. Observations: 531 AIC:
7873.
Df Residuals: 522 BIC:
7912.
Df Model: 8
Covariance Type: nonrobust
=======================================================================
=========
coef std err t P>|t| [0.025
0.975]
-----------------------------------------------------------------------
---------
20 | P a g e
Intercept 103.9314 42.150 2.466 0.014 21.128
186.735
capital 0.4062 0.042 9.651 0.000 0.323
0.489
patents -4.6473 2.789 -1.666 0.096 -10.127
0.833
randd 0.6399 0.232 2.753 0.006 0.183
1.096
employment 78.6137 4.765 16.498 0.000 69.252
87.975
tobinq -39.9258 12.145 -3.288 0.001 -63.784
-16.067
value 0.2446 0.026 9.592 0.000 0.195
0.295
institutions 0.2174 0.902 0.241 0.810 -1.555
1.990
sp500_no -31.1003 25.504 -1.219 0.223 -81.203
19.003
sp500_yes 135.0318 49.490 2.728 0.007 37.808
232.256
=======================================================================
=======
Omnibus: 185.527 Durbin-Watson:
1.966
Prob(Omnibus): 0.000 Jarque-Bera (JB):
1284.253
Skew: 1.351 Prob(JB):
1.34e-279
Kurtosis: 10.123 Cond. No.
2.47e+19
=======================================================================
=======
1.4 Inference: Based on these predictions, what are the business insights and
recommendations
Solution:
The investment criteria for any new investor are mainly based on the capital invested in
the company by the promoters and investors are vying on the firms where the capital
investment is good as also reflecting in the plots.
To generate capital the company should have the combination of the following attributes
such as value, employment, sales and patents.
The highest contributing attribute is employment followed by patents
Using stats model if we could run the model again, we can have P values and coefficients
which will give us better understanding of the relationship, so that values more 0.05 we
can drop those variables and re run the model again for better results. For better
accuracy dropping depth column in iteration for better results
21 | P a g e
Problem 2: Logistic Regression and LDA
Problem Statement: You are hired by Government to do analysis on car crashes. You are provided
details of car crashes, among which some people survived and some didn't. You have to help the
government in predicting whether a person will survive or not on the basis of the information given
in the data set so as to provide insights that will help government to make stronger laws for car
manufacturers to ensure safety measures. Also, find out the important factors on the basis of which
you made your predictions.
1. dvcat: factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39, 40-54, 55+
2. weight: Observation weights, albeit of uncertain accuracy, designed to account for varying
sampling probabilities. (The inverse probability weighting estimator can be used to demonstrate
causality when the researcher cannot conduct a controlled experiment but has observed data to
model) for further information go to this link:
https://en.wikipedia.org/wiki/Inverse_probability_weighting
11. abcat: Did one or more (driver or passenger) airbag(s) deploy? This factor has levels deploy,
nodeploy and unavail
13. deploy: a numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or more bags
deployed.
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis.
Do exploratory data analysis.
Solution:
Loaded necessary library for the model building. Read the head and tail of the dataset to check
whether data has been properly updated.
22 | P a g e
PYTHON OUTPUT:
PYTHON OUTPUT:
PYTHON OUTPUT:
PYTHON OUTPUT:
PYTHON OUTPUT:
(11217, 15)
23 | P a g e
INFO
PYTHON OUTPUT:
Observation:
DATA DESCRIBE:
PYTHON OUTPUT:
Observation: We have integer and continuous data, “Survived” is our target variable.
24 | P a g e
NULL VALUE CHECK
PYTHON OUTPUT:
PYTHON OUTPUT:
PYTHON OUTPUT:
Number of duplicate rows = 0
25 | P a g e
UNIQUE VALUE IN THE CATEGORICAL DATA
PYTHON OUTPUT:
DVCAT : 5
1-9km/h 282
55+ 809
40-54 1344
25-39 3368
10-24 5414
Name: dvcat, dtype: int64
SURVIVED : 2
Not_Survived 1180
survived 10037
Name: Survived, dtype: int64
AIRBAG : 2
none 4153
airbag 7064
Name: airbag, dtype: int64
SEATBELT : 2
none 3368
belted 7849
Name: seatbelt, dtype: int64
SEX : 2
f 5169
m 6048
Name: sex, dtype: int64
ABCAT : 3
nodeploy 2699
unavail 4153
deploy 4365
Name: abcat, dtype: int64
OCCROLE : 2
pass 2431
driver 8786
Name: occRole, dtype: int64
CASEID : 6488
5:41:1 1
49:228:2 1
8:126:2 1
26 | P a g e
43:166:2 1
5:50:1 1
..
75:84:2 6
49:156:1 6
74:74:2 6
49:106:1 6
73:100:2 7
Name: caseid, Length: 6488, dtype: int64
PYTHON OUTPUT:
27 | P a g e
UNIVARIATE & BIVARIATE ANALYSIS
PYTHON OUTPUT:
28 | P a g e
Observation:
29 | P a g e
The data looks more or less positively skewed.
PYTHON OUTPUT:
Dvcat
Observation: The estimated Impacted speed is high while driving 10-24 km/h.
Survived
Observation: The factor with levels survived is too high approx., 9000 compare to Not Survived
which is approx., 1000.
Airbag
30 | P a g e
Observation: The factor with levels airbag is 57% higher than none. This suggest that 43% cars are
not fully safe which caused accident.
Seatbelt
Observation: The factor with levels seatbelt belted is approx. 56% higher than none. Which also
suggest 44% cars were unbelted which caused accident.
Sex
31 | P a g e
Observation: The factor with levels M (Male) is 17 % higher than f (Female) contributing towards
crashes.
abcat
Observation: The factor comparatively suggests the deploy vs unavailable is more or less similar
which increased the risk of death. Also, 63% cars didn’t deploy the airbag which is major concern of
safety and could have direct correlation with the number of deaths.
OccRole
Observation: A factor with level driver or pass passenger suggest driver are approx. 81% higher than
pass.
32 | P a g e
Survived vs Frontal
Observation: Approx. 25 % People with non-frontal impact survived, while 75 % with front impact
survived. The Non-Survived looks more or less similar for non-frontal and frontal both.
Survived vs airbag
Observation: Approx. 75 % People with airbag in their car survived, while 25 % without airbag also
survived. The Non-Survived looks more or less similar for airbag and none both.
33 | P a g e
Survived vs Seatbelt
Observation: Approx. 75 % People who belted in their car survived, while 25 % without belt also
survived. The Non-Survived looks more or less similar for belted and none both.
PYTHON OUTPUT:
34 | P a g e
Observation: There is no correlation between the data, the data seems to be normal. There is no hu
ge difference in the data distribution among the survived, I don’t see any clear two different
distribution in the data.
TREATING OUTLIERS
Observation: we have outliers in the dataset, as LDA works based on numerical computation treating
outliers will help perform the model better.
PYTHON OUTPUT:
35 | P a g e
POST TREATING THE OUTLIER,
PYTHON OUTPUT:
36 | P a g e
2.2 Encode the data (having string values) for Modelling. Data Split: Split the data into
train and test (70:30). Apply Logistic Regression and LDA (linear discriminant
analysis).
Solution:
The encoding helps the logistic regression model predict better results.
PYTHON OUTPUT:
GRID SEARCH METHOD: The grid search which is a method used for logistic regression to find the
optimal solving and the parameters for solving
PYTHON OUTPUT:
37 | P a g e
The grid search method gives, liblinear solver which is suitable for small datasets. Tolerance and
penalty have been found using grid search method. Predicting the training data,
2.3 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for
each model. Compare both the models and write inferences, which model is
best/optimized.
Solution:
38 | P a g e
PYTHON OUTPUT: CONFUSION MATRIX FOR TEST DATA
precision recall f1-score support
39 | P a g e
PYTHON OUTPUT: ACCURACY FOR TEST DATA
0.9815805109922757
40 | P a g e
PYTHON OUTPUT: PREDICTING THE VARIABLE
PYTHON OUTPUT: CLASSFICATION REPORT & CONFUSION MATRIX FOR TRAIN DATA
precision recall f1-score support
PYTHON OUTPUT: CLASSFICATION REPORT & CONFUSION MATRIX FOR TEST DATA
precision recall f1-score support
41 | P a g e
PYTHON OUTPUT: CHANGING THE CUTT OFF VALUE TO CHECK OPTIMAL VALUE THAT GIVES
BETTER ACCURACY AND F1 SCORE
0.1
Confusion Matrix
0.2
Confusion Matrix
0.3
Confusion Matrix
42 | P a g e
0.4
Confusion Matrix
0.5
Confusion Matrix
0.6
Confusion Matrix
43 | P a g e
0.7
Confusion Matrix
0.8
Confusion Matrix
0.9
44 | P a g e
Confusion Matrix
Observation: Comparing both these models, we find both results are almost same, but LR works
better when there is category target variable.
45 | P a g e
2.4 Inference: Based on these predictions, what are the insights and
recommendations.
The model accuracy of logistic regression on both training data as well as testing data is
almost same i.e 98%.
Similarly, AUC in logistic regression for training data and testing data is also similar.
The other parameters of confusion matrix in logistic regression is also similar, therefore we
can presume in this that our model is over fitted.
We have therefore applied Grid Search CV to hyper tune our model and as per which F1
score in both training and test data was almost similar.
In case of LDA, the AUC for testing and training data is also same and it was 96%, besides this
the other parameters of confusion matrix of LDA model were also similar and it clearly
shows that model is overfitted here too.
Overall, we can conclude that logistic regression model is best suited for this data set given
the level of accuracy in spite of the Linear Discriminant Analysis that the model is overfitted.
46 | P a g e