0% found this document useful (0 votes)
21 views47 pages

Predictive_Modelling_Alternate_Project_Business_Case.docx

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views47 pages

Predictive_Modelling_Alternate_Project_Business_Case.docx

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Project - Predictive Modelling

Batch: PGDSBA _JULY 2021


Author: Kaushal Kishor
Content
Problem 1: Linear Regression

Problem Statement: You are a part of an investing firm and your work is to do research about these
759 firms. You are provided with the dataset containing the sales and other attributes of these 759
firms. Predict the sales of these firms on the bases of the details given in the dataset so as to help
your company in investing consciously. Also, provide them with 5 attributes that are most important.

1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values,
data types, shape, EDA). Perform Univariate and Bivariate Analysis.

1.2 Impute null values if present? Do you think scaling is necessary in this case?

1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into test and train
(70:30). Apply Linear regression. Performance Metrics: Check the performance of Predictions on
Train and Test sets using Rsquare, RMSE.

1.4 Inference: Based on these predictions, what are the business insights and recommendations.

Problem 2: Logistic Regression and LDA

Problem Statement: You are hired by Government to do analysis on car crashes. You are provided
details of car crashes, among which some people survived and some didn't. You have to help the
government in predicting whether a person will survive or not on the basis of the information given
in the data set so as to provide insights that will help government to make stronger laws for car
manufacturers to ensure safety measures. Also, find out the important factors on the basis of which
you made your predictions.

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
write an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data analysis.

2.2 Encode the data (having string values) for Modelling. Data Split: Split the data into train and test
(70:30). Apply Logistic Regression and LDA (linear discriminant analysis).

2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Compare both
the models and write inferences, which model is best/optimized.

2.4 Inference: Based on these predictions, what are the insights and recommendations.

1|Page
Solution
Problem 1: Linear Regression

Problem Statement: You are a part of an investing firm and your work is to do research about these
759 firms. You are provided with the dataset containing the sales and other attributes of these 759
firms. Predict the sales of these firms on the bases of the details given in the dataset so as to help
your company in investing consciously. Also, provide them with 5 attributes that are most important.

Dataset for Problem 1: Firm_level_data.csv

Data Dictionary:

1. sales: Sales (in millions of dollars).


2. capital: Net stock of property, plant, and equipment.
3. patents: Granted patents.
4. randd: R&D stock (in millions of dollars).
5. employment: Employment (in 1000s).
6. sp500: Membership of firms in the S&P 500 index. S&P, is a stock market index that measures the
stock performance of 500 large companies listed on stock exchanges in the United States
7. tobinq: Tobin's q (also known as q ratio and Kaldor's v) is the ratio between a physical asset's
market value and its replacement value.
8. value: Stock market value.
9. institutions: Proportion of stock owned by institutions
1.1. Read the data and do exploratory data analysis. Describe the data briefly.
(Check the null values, Data types, shape, EDA, duplicate values). Perform
Univariate and Bivariate Analysis.

Solution:

Loading all the necessary library for the model building.

2|Page
Now, reading the head and tail of the dataset to check whether data has been properly fed.

PYTHON OUTPUT: HEAD OF THE DATA,

PYTHON OUTPUT: TAIL OF THE DATA,

PYTHON OUTPUT: SHAPE OF THE DATA,


(759, 10)

PYTHON OUTPUT: CHECKING THE INFO OF THE DATA,

We have float, int and object data types in the data.

3|Page
PYTHON OUTPUT: DATA DESCRIPTION,

Observation:

We have both categorical and continuous data,

For categorical data we have sp500

For continuous data we have sales, capital, patents, randd, employment, tobinq, value, institutions

PYTHON OUTPUT: CHECKING THE DUPLICATES IN THE DATA,


Number of duplicate rows = 0

PYTHON OUTPUT: UNIQUE VALUES IN THE CATEGORICAL DATA


SP500 : 2
yes 217
no 542
Name: sp500, dtype: int64

PYTHON OUTPUT: NULL VALUE CHECK


Unnamed: 0 0
sales 0
capital 0
patents 0
randd 0
employment 0
sp500 0
tobinq 21
value 0
institutions 0
dtype: int64

Tobinq has null values.

4|Page
PYTHON OUTPUT: NULL VALUE TREATMENT
Updated Dataframe:
sales capital patents randd employment
sp500 \
0 826.995050 161.603986 10 382.078247 2.306000 no
1 407.753973 122.101012 2 0.000000 1.860000 no
2 8407.845588 6221.144614 138 3296.700439 49.659005 yes
3 451.000010 266.899987 1 83.540161 3.071000 no
4 174.927981 140.124004 2 14.233637 1.947000 no
.. ... ... ... ... ... ...
754 1253.900196 708.299935 32 412.936157 22.100002 yes
755 171.821025 73.666008 1 0.037735 1.684000 no
756 202.726967 123.926991 13 74.861099 1.460000 no
757 785.687944 138.780992 6 0.621750 2.900000 yes
758 22.701999 14.244999 5 18.574360 0.197000 no

tobinq value institutions


0 11.049511 1625.453755 80.27
1 0.844187 243.117082 59.02
2 5.205257 25865.233800 47.70
3 0.305221 63.024630 26.88
4 1.063300 67.406408 49.46
.. ... ... ...
754 0.697454 267.119487 33.50
755 2.794910 228.475701 46.41
756 5.229723 580.430741 42.25
757 1.625398 309.938651 61.39
758 2.213070 18.940140 7.50

[759 rows x 9 columns]

PYTHON OUTPUT: POST NULL VALUE TREATEMENT


sales 0
capital 0
patents 0
randd 0
employment 0
sp500 0
tobinq 0
value 0
institutions 0
dtype: int64

5|Page
PYTHON OUTPUT: UNIVARIATE / BIVARIATE ANALYSIS

6|Page
7|Page
8|Page
Observation:

 Employment, Tobinq, value, rand, sales, patent has multiple outliers in the data.
 Tobinq, value, randd, sales, patent Positive skewed.
 The institute ranges from 25 to 60.
 The Tobinq ranges from 0.5 to 3.
 The value, employment, patent, rand, capital, sales ranges from 0 to 0.5.

PYTHON OUTPUT: SKEW


sales 9.219023
capital 7.555091
patents 7.766943
randd 10.270483
employment 9.068875
tobinq 3.332006
value 6.075996
institutions -0.168071
dtype: float64

PYTHON OUTPUT BIVARIATE DATA DISTRIBUTION

9|Page
PYTHON OUTPUT BIVARIATE DATA DISTRIBUTION WITH HUE AS SALES

10 | P a g e
Observation: There is no correlation between the data, the data seems to be normal. There is no
huge difference in the data distribution among the sales, I don’t see any clear two different
distribution in the data. Multiple outliers are being observed which needs to be treated.

PYTHON OUTPUT: CORRELATION MATRIX

11 | P a g e
Observation: No multi collinearity in the data

1.2 Impute null values if present? Do you think scaling is necessary in this case?

Solution:

PYTHON OUTPUT: NULL VALUE

PYTHON OUTPUT: PERCENTAGE OF NULL VALUE

Observation: We do have Null value in Tobinq. The fix the Null value we can do a mean or median
imputation. The percentage of Null values is less than 5%, we can also drop these if we want. Post
we imputed the mean; we do not see have any null value in the dataset.

12 | P a g e
Observation: Scaling isn’t required as the dataset features looks to be in more or less fixed range.
Also, there isn’t any multi collinearity in the data. There are multiple outliers present in the dataset
which needs to be treated before we proceed with the modelling.

PYTHON OUTPUT: OUTLIERS IDENTIFIED

13 | P a g e
PYTHON OUTPUT: POST OUTLIERS TREATMENT

14 | P a g e
15 | P a g e
PYTHON OUTPUT: HEATMAP POST OUTLIER TREATEMENT

16 | P a g e
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into
test and train (70:30). Apply Linear regression. Performance Metrics: Check the
performance of Predictions on Train and Test sets using Rsquare, RMSE.

Solution:

PYTHON OUTPUT: ENCODING THE STRING VALUES

GET DUMMIES (Converting categorical to dummy variable in data)

Index(['sales', 'capital', 'patents', 'randd', 'employment', 'tobinq',


'value',
'institutions', 'sp500_no', 'sp500_yes'],
dtype='object')

Observation: Dummies have been encoded. Linear regression model does not take categorical
values so that we have encoded categorical values to integer for better results.

PYTHON OUTPUT: TRAIN/TEST SPLIT

- Dropping Unwanted Columns

Observation: Unrequired columns were already dropped at the initial level. (Train/Test Split)
Index(['sales', 'capital', 'patents', 'randd', 'employment', 'tobinq',
'value',
'institutions', 'sp500_no', 'sp500_yes'],
dtype='object')

- Split of X and y into training and test set in 70:30 ratio

PYTHON OUTPUT:

17 | P a g e
- Invoked the Linear Regression function and finding the best model on Training data.

PYTHON OUTPUT:
LinearRegression()

- Explore the coefficients for each of the independent attributes

PYTHON OUTPUT:

- Let us check the intercept for the model

PYTHON OUTPUT:
The intercept for our model is 155.8971701239957

- R square on training data

PYTHON OUTPUT:
0.9358806629736066

- R square on testing data

PYTHON OUTPUT:
0.924129439335239

- RMSE on Training data

PYTHON OUTPUT:
394.6129494572075

- RMSE on Testing data

PYTHON OUTPUT:
399.74321332112794

18 | P a g e
- VIF –VALUES

PYTHON OUTPUT:
capital ---> 5.884834435358601
patents ---> 2.5564811032960173
randd ---> 2.9241166081719343
employment ---> 5.289087439090918
tobinq ---> 1.4736588698814541
value ---> 6.0730692748610045
institutions ---> 1.2923225457814675
sp500_no ---> 5.627713456806028
sp500_yes ---> 7.007866608862636

Observation: Visible multi collinearity in the dataset, to drop these values to Lower level we can
drop columns after doing stats model. From stats model we can understand the features that do not
contribute to the Model. We can remove those features after that the Vif Values will be reduced
Ideal value of VIF is less than 5%.

- Using STATSMODEL library

PYTHON OUTPUT:
OLS Regression Results
=======================================================================
=======
Dep. Variable: sales R-squared:
0.936
Model: OLS Adj. R-squared:
0.935
Method: Least Squares F-statistic:
952.4
Date: Thu, 20 Jan 2022 Prob (F-statistic):
1.05e-305
Time: 16:40:50 Log-Likelihood:
-3927.7
No. Observations: 531 AIC:
7873.
Df Residuals: 522 BIC:
7912.
Df Model: 8
Covariance Type: nonrobust
=======================================================================
=========
coef std err t P>|t| [0.025
0.975]
-----------------------------------------------------------------------
---------
Intercept 103.9314 42.150 2.466 0.014 21.128
186.735
capital 0.4062 0.042 9.651 0.000 0.323
0.489
patents -4.6473 2.789 -1.666 0.096 -10.127
0.833

19 | P a g e
randd 0.6399 0.232 2.753 0.006 0.183
1.096
employment 78.6137 4.765 16.498 0.000 69.252
87.975
tobinq -39.9258 12.145 -3.288 0.001 -63.784
-16.067
value 0.2446 0.026 9.592 0.000 0.195
0.295
institutions 0.2174 0.902 0.241 0.810 -1.555
1.990
sp500_no -31.1003 25.504 -1.219 0.223 -81.203
19.003
sp500_yes 135.0318 49.490 2.728 0.007 37.808
232.256
=======================================================================
=======
Omnibus: 185.527 Durbin-Watson:
1.966
Prob(Omnibus): 0.000 Jarque-Bera (JB):
1284.253
Skew: 1.351 Prob(JB):
1.34e-279
Kurtosis: 10.123 Cond. No.
2.47e+19
=======================================================================
=======

- After dropping the depth variable OLS Regression Results

PYTHON OUTPUT:
OLS Regression Results
=======================================================================
=======
Dep. Variable: sales R-squared:
0.936
Model: OLS Adj. R-squared:
0.935
Method: Least Squares F-statistic:
952.4
Date: Fri, 21 Jan 2022 Prob (F-statistic):
1.05e-305
Time: 10:53:58 Log-Likelihood:
-3927.7
No. Observations: 531 AIC:
7873.
Df Residuals: 522 BIC:
7912.
Df Model: 8
Covariance Type: nonrobust
=======================================================================
=========
coef std err t P>|t| [0.025
0.975]
-----------------------------------------------------------------------
---------

20 | P a g e
Intercept 103.9314 42.150 2.466 0.014 21.128
186.735
capital 0.4062 0.042 9.651 0.000 0.323
0.489
patents -4.6473 2.789 -1.666 0.096 -10.127
0.833
randd 0.6399 0.232 2.753 0.006 0.183
1.096
employment 78.6137 4.765 16.498 0.000 69.252
87.975
tobinq -39.9258 12.145 -3.288 0.001 -63.784
-16.067
value 0.2446 0.026 9.592 0.000 0.195
0.295
institutions 0.2174 0.902 0.241 0.810 -1.555
1.990
sp500_no -31.1003 25.504 -1.219 0.223 -81.203
19.003
sp500_yes 135.0318 49.490 2.728 0.007 37.808
232.256
=======================================================================
=======
Omnibus: 185.527 Durbin-Watson:
1.966
Prob(Omnibus): 0.000 Jarque-Bera (JB):
1284.253
Skew: 1.351 Prob(JB):
1.34e-279
Kurtosis: 10.123 Cond. No.
2.47e+19
=======================================================================
=======

1.4 Inference: Based on these predictions, what are the business insights and
recommendations
Solution:

 The investment criteria for any new investor are mainly based on the capital invested in
the company by the promoters and investors are vying on the firms where the capital
investment is good as also reflecting in the plots.
 To generate capital the company should have the combination of the following attributes
such as value, employment, sales and patents.
 The highest contributing attribute is employment followed by patents
 Using stats model if we could run the model again, we can have P values and coefficients
which will give us better understanding of the relationship, so that values more 0.05 we
can drop those variables and re run the model again for better results. For better
accuracy dropping depth column in iteration for better results

PYTHON OUTPUT: The Equation


(103.93) * Intercept + (0.41) * capital + (-4.65) * patents + (0.64) *
randd + (78.61) * employment + (-39.93) * tobinq + (0.24) * value +
(0.22) * institutions + (-31.1) * sp500_no + (135.03) * sp500_yes +

21 | P a g e
Problem 2: Logistic Regression and LDA

Problem Statement: You are hired by Government to do analysis on car crashes. You are provided
details of car crashes, among which some people survived and some didn't. You have to help the
government in predicting whether a person will survive or not on the basis of the information given
in the data set so as to provide insights that will help government to make stronger laws for car
manufacturers to ensure safety measures. Also, find out the important factors on the basis of which
you made your predictions.

Data Dictionary for Car_Crash:

1. dvcat: factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39, 40-54, 55+

2. weight: Observation weights, albeit of uncertain accuracy, designed to account for varying
sampling probabilities. (The inverse probability weighting estimator can be used to demonstrate
causality when the researcher cannot conduct a controlled experiment but has observed data to
model) for further information go to this link:
https://en.wikipedia.org/wiki/Inverse_probability_weighting

3. Survived: factor with levels Survived or not_survived

4. airbag: a factor with levels none or airbag

5. seatbelt: a factor with levels none or belted

6. frontal: a numeric vector; 0 = non-frontal, 1=frontal impact

7. sex: a factor with levels f: Female or m: Male

8. ageOFocc: age of occupant in years

9. yearacc: year of accident

10. yearVeh: Year of model of vehicle; a numeric vector

11. abcat: Did one or more (driver or passenger) airbag(s) deploy? This factor has levels deploy,
nodeploy and unavail

12. occRole: a factor with levels driver or pass: passenger

13. deploy: a numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or more bags
deployed.

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis.
Do exploratory data analysis.
Solution:

Loaded necessary library for the model building. Read the head and tail of the dataset to check
whether data has been properly updated.

22 | P a g e
PYTHON OUTPUT:

HEAD OF THE DATA

PYTHON OUTPUT:

DROPPED UNWANTED COLOUMN: “Unnamed: 0” (isn’t required for analysis)

PYTHON OUTPUT:

TAIL OF THE DATA

PYTHON OUTPUT:

SHAPE OF THE DATA

PYTHON OUTPUT:
(11217, 15)

23 | P a g e
INFO

PYTHON OUTPUT:

Observation:

 Null values in the dataset (“injSeverity”) and would be treated later.

 We have integer, floats and object data

DATA DESCRIBE:

PYTHON OUTPUT:

Observation: We have integer and continuous data, “Survived” is our target variable.

24 | P a g e
NULL VALUE CHECK

PYTHON OUTPUT:

PERCENTAGE OF MISSING VALUE

PYTHON OUTPUT:

CHECK FOR DUPLICATES IN DATA

PYTHON OUTPUT:
Number of duplicate rows = 0

25 | P a g e
UNIQUE VALUE IN THE CATEGORICAL DATA

PYTHON OUTPUT:
DVCAT : 5
1-9km/h 282
55+ 809
40-54 1344
25-39 3368
10-24 5414
Name: dvcat, dtype: int64

SURVIVED : 2
Not_Survived 1180
survived 10037
Name: Survived, dtype: int64

AIRBAG : 2
none 4153
airbag 7064
Name: airbag, dtype: int64

SEATBELT : 2
none 3368
belted 7849
Name: seatbelt, dtype: int64

SEX : 2
f 5169
m 6048
Name: sex, dtype: int64

ABCAT : 3
nodeploy 2699
unavail 4153
deploy 4365
Name: abcat, dtype: int64

OCCROLE : 2
pass 2431
driver 8786
Name: occRole, dtype: int64

CASEID : 6488
5:41:1 1
49:228:2 1
8:126:2 1

26 | P a g e
43:166:2 1
5:50:1 1
..
75:84:2 6
49:156:1 6
74:74:2 6
49:106:1 6
73:100:2 7
Name: caseid, Length: 6488, dtype: int64

TREATED NULL VALUE – injSeverity & POST TREATMENT

PYTHON OUTPUT:

27 | P a g e
UNIVARIATE & BIVARIATE ANALYSIS

PYTHON OUTPUT:

28 | P a g e
Observation:

 Outliers are observed within dataset for Year, Age, Weight

29 | P a g e
 The data looks more or less positively skewed.

CATEGORICAL UNIVARIATE ANALYSIS

PYTHON OUTPUT:

Dvcat

Observation: The estimated Impacted speed is high while driving 10-24 km/h.

Survived

Observation: The factor with levels survived is too high approx., 9000 compare to Not Survived
which is approx., 1000.

Airbag

30 | P a g e
Observation: The factor with levels airbag is 57% higher than none. This suggest that 43% cars are
not fully safe which caused accident.

Seatbelt

Observation: The factor with levels seatbelt belted is approx. 56% higher than none. Which also
suggest 44% cars were unbelted which caused accident.

Sex

31 | P a g e
Observation: The factor with levels M (Male) is 17 % higher than f (Female) contributing towards
crashes.

abcat

Observation: The factor comparatively suggests the deploy vs unavailable is more or less similar
which increased the risk of death. Also, 63% cars didn’t deploy the airbag which is major concern of
safety and could have direct correlation with the number of deaths.

OccRole

Observation: A factor with level driver or pass passenger suggest driver are approx. 81% higher than
pass.

32 | P a g e
Survived vs Frontal

Observation: Approx. 25 % People with non-frontal impact survived, while 75 % with front impact
survived. The Non-Survived looks more or less similar for non-frontal and frontal both.

Survived vs airbag

Observation: Approx. 75 % People with airbag in their car survived, while 25 % without airbag also
survived. The Non-Survived looks more or less similar for airbag and none both.

33 | P a g e
Survived vs Seatbelt

Observation: Approx. 75 % People who belted in their car survived, while 25 % without belt also
survived. The Non-Survived looks more or less similar for belted and none both.

BIVARITE ANALYIS DATA DISTRIBUTION

PYTHON OUTPUT:

34 | P a g e
Observation: There is no correlation between the data, the data seems to be normal. There is no hu
ge difference in the data distribution among the survived, I don’t see any clear two different
distribution in the data.

Observation: No multi collinearity in the data

TREATING OUTLIERS

BEFORE OUTLIER TREATMENT

Observation: we have outliers in the dataset, as LDA works based on numerical computation treating
outliers will help perform the model better.

PYTHON OUTPUT:

35 | P a g e
POST TREATING THE OUTLIER,

PYTHON OUTPUT:

Observation: No outliers in the data, all outliers have been treated.

36 | P a g e
2.2 Encode the data (having string values) for Modelling. Data Split: Split the data into
train and test (70:30). Apply Logistic Regression and LDA (linear discriminant
analysis).

Solution:

PYTHON OUTPUT: ENCODING CATEGORICAL VARIABLE

The encoding helps the logistic regression model predict better results.

TRAIN / TEST SPLIT

PYTHON OUTPUT:

GRID SEARCH METHOD: The grid search which is a method used for logistic regression to find the
optimal solving and the parameters for solving

PYTHON OUTPUT:

37 | P a g e
The grid search method gives, liblinear solver which is suitable for small datasets. Tolerance and
penalty have been found using grid search method. Predicting the training data,

2.3 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for
each model. Compare both the models and write inferences, which model is
best/optimized.
Solution:

PYTHON OUTPUT: CONFUSION MATRIX TRAIN DATA


precision recall f1-score support

0 0.93 0.89 0.91 826


1 0.99 0.99 0.99 7025

accuracy 0.98 7851


macro avg 0.96 0.94 0.95 7851
weighted avg 0.98 0.98 0.98 7851

38 | P a g e
PYTHON OUTPUT: CONFUSION MATRIX FOR TEST DATA
precision recall f1-score support

0 0.93 0.89 0.91 354


1 0.99 0.99 0.99 3012

accuracy 0.98 3366


macro avg 0.96 0.94 0.95 3366
weighted avg 0.98 0.98 0.98 3366

PYTHON OUTPUT: ACCURACY – TRAINING DATA


0.9811488982295249

PYTHON OUTPUT: AUC, ROC CURVE FOR TRAIN DATA


AUC: 0.987

39 | P a g e
PYTHON OUTPUT: ACCURACY FOR TEST DATA
0.9815805109922757

PYTHON OUTPUT: AUC, ROC CURVE FOR TEST DATA


AUC: 0.987

PYTHON OUTPUT: LDA

40 | P a g e
PYTHON OUTPUT: PREDICTING THE VARIABLE

PYTHON OUTPUT: MODEL SCORE (Train)


0.9582218825627309

PYTHON OUTPUT: CLASSFICATION REPORT & CONFUSION MATRIX FOR TRAIN DATA
precision recall f1-score support

0 0.83 0.75 0.79 826


1 0.97 0.98 0.98 7025

accuracy 0.96 7851


macro avg 0.90 0.87 0.88 7851
weighted avg 0.96 0.96 0.96 7851

PYTHON OUTPUT: MODEL SCORE (TEST)


0.9557338086749851

PYTHON OUTPUT: CLASSFICATION REPORT & CONFUSION MATRIX FOR TEST DATA
precision recall f1-score support

0 0.81 0.76 0.78 354


1 0.97 0.98 0.98 3012

accuracy 0.96 3366


macro avg 0.89 0.87 0.88 3366
weighted avg 0.95 0.96 0.96 3366

41 | P a g e
PYTHON OUTPUT: CHANGING THE CUTT OFF VALUE TO CHECK OPTIMAL VALUE THAT GIVES
BETTER ACCURACY AND F1 SCORE
0.1

Accuracy Score 0.941


F1 Score 0.9681

Confusion Matrix

0.2

Accuracy Score 0.9531


F1 Score 0.9744

Confusion Matrix

0.3

Accuracy Score 0.9585


F1 Score 0.9772

Confusion Matrix

42 | P a g e
0.4

Accuracy Score 0.9595


F1 Score 0.9776

Confusion Matrix

0.5

Accuracy Score 0.9582


F1 Score 0.9768

Confusion Matrix

0.6

Accuracy Score 0.9533


F1 Score 0.9738

Confusion Matrix

43 | P a g e
0.7

Accuracy Score 0.9443


F1 Score 0.9685

Confusion Matrix

0.8

Accuracy Score 0.9303


F1 Score 0.9601

Confusion Matrix

0.9

Accuracy Score 0.8953


F1 Score 0.9384

44 | P a g e
Confusion Matrix

PYTHON OUTPUT: AUC AND ROC CURVE


AUC for the Training Data: 0.975
AUC for the Test Data: 0.974

Observation: Comparing both these models, we find both results are almost same, but LR works
better when there is category target variable.

45 | P a g e
2.4 Inference: Based on these predictions, what are the insights and
recommendations.

 The model accuracy of logistic regression on both training data as well as testing data is
almost same i.e 98%.
 Similarly, AUC in logistic regression for training data and testing data is also similar.
 The other parameters of confusion matrix in logistic regression is also similar, therefore we
can presume in this that our model is over fitted.
 We have therefore applied Grid Search CV to hyper tune our model and as per which F1
score in both training and test data was almost similar.
 In case of LDA, the AUC for testing and training data is also same and it was 96%, besides this
the other parameters of confusion matrix of LDA model were also similar and it clearly
shows that model is overfitted here too.
 Overall, we can conclude that logistic regression model is best suited for this data set given
the level of accuracy in spite of the Linear Discriminant Analysis that the model is overfitted.

46 | P a g e

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy