0% found this document useful (0 votes)

21 views47 pages

Predictive_Modelling_Alternate_Project_Business_Case.docx

Uploaded by

AshwiniSapali-Kudure

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views47 pages

Predictive_Modelling_Alternate_Project_Business_Case.docx

Uploaded by

AshwiniSapali-Kudure

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Project - Predictive Modelling

Batch: PGDSBA _JULY 2021

Author: Kaushal Kishor
Content
Problem 1: Linear Regression

Problem Statement: You are a part of an investing firm and your work is to do research about these
759 firms. You are provided with the dataset containing the sales and other attributes of these 759
firms. Predict the sales of these firms on the bases of the details given in the dataset so as to help
your company in investing consciously. Also, provide them with 5 attributes that are most important.

1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values,
data types, shape, EDA). Perform Univariate and Bivariate Analysis.

1.2 Impute null values if present? Do you think scaling is necessary in this case?

1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into test and train
(70:30). Apply Linear regression. Performance Metrics: Check the performance of Predictions on
Train and Test sets using Rsquare, RMSE.

1.4 Inference: Based on these predictions, what are the business insights and recommendations.

Problem 2: Logistic Regression and LDA

Problem Statement: You are hired by Government to do analysis on car crashes. You are provided
details of car crashes, among which some people survived and some didn't. You have to help the
government in predicting whether a person will survive or not on the basis of the information given
in the data set so as to provide insights that will help government to make stronger laws for car
manufacturers to ensure safety measures. Also, find out the important factors on the basis of which
you made your predictions.

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
write an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data analysis.

2.2 Encode the data (having string values) for Modelling. Data Split: Split the data into train and test
(70:30). Apply Logistic Regression and LDA (linear discriminant analysis).

2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Compare both
the models and write inferences, which model is best/optimized.

2.4 Inference: Based on these predictions, what are the insights and recommendations.

1|Page
Solution
Problem 1: Linear Regression

Dataset for Problem 1: Firm_level_data.csv

Data Dictionary:

1. sales: Sales (in millions of dollars).

2. capital: Net stock of property, plant, and equipment.
3. patents: Granted patents.
4. randd: R&D stock (in millions of dollars).
5. employment: Employment (in 1000s).
6. sp500: Membership of firms in the S&P 500 index. S&P, is a stock market index that measures the
stock performance of 500 large companies listed on stock exchanges in the United States
7. tobinq: Tobin's q (also known as q ratio and Kaldor's v) is the ratio between a physical asset's
market value and its replacement value.
8. value: Stock market value.
9. institutions: Proportion of stock owned by institutions
1.1. Read the data and do exploratory data analysis. Describe the data briefly.
(Check the null values, Data types, shape, EDA, duplicate values). Perform
Univariate and Bivariate Analysis.

Solution:

Loading all the necessary library for the model building.

2|Page
Now, reading the head and tail of the dataset to check whether data has been properly fed.

PYTHON OUTPUT: HEAD OF THE DATA,

PYTHON OUTPUT: TAIL OF THE DATA,

PYTHON OUTPUT: SHAPE OF THE DATA,

(759, 10)

PYTHON OUTPUT: CHECKING THE INFO OF THE DATA,

We have float, int and object data types in the data.

3|Page
PYTHON OUTPUT: DATA DESCRIPTION,

Observation:

We have both categorical and continuous data,

For categorical data we have sp500

For continuous data we have sales, capital, patents, randd, employment, tobinq, value, institutions

PYTHON OUTPUT: CHECKING THE DUPLICATES IN THE DATA,

Number of duplicate rows = 0

PYTHON OUTPUT: UNIQUE VALUES IN THE CATEGORICAL DATA

SP500 : 2
yes 217
no 542
Name: sp500, dtype: int64

PYTHON OUTPUT: NULL VALUE CHECK

Unnamed: 0 0
sales 0
capital 0
patents 0
randd 0
employment 0
sp500 0
tobinq 21
value 0
institutions 0
dtype: int64

Tobinq has null values.

4|Page
PYTHON OUTPUT: NULL VALUE TREATMENT
Updated Dataframe:
sales capital patents randd employment
sp500 \
0 826.995050 161.603986 10 382.078247 2.306000 no
1 407.753973 122.101012 2 0.000000 1.860000 no
2 8407.845588 6221.144614 138 3296.700439 49.659005 yes
3 451.000010 266.899987 1 83.540161 3.071000 no
4 174.927981 140.124004 2 14.233637 1.947000 no
.. ... ... ... ... ... ...
754 1253.900196 708.299935 32 412.936157 22.100002 yes
755 171.821025 73.666008 1 0.037735 1.684000 no
756 202.726967 123.926991 13 74.861099 1.460000 no
757 785.687944 138.780992 6 0.621750 2.900000 yes
758 22.701999 14.244999 5 18.574360 0.197000 no

tobinq value institutions

0 11.049511 1625.453755 80.27
1 0.844187 243.117082 59.02
2 5.205257 25865.233800 47.70
3 0.305221 63.024630 26.88
4 1.063300 67.406408 49.46
.. ... ... ...
754 0.697454 267.119487 33.50
755 2.794910 228.475701 46.41
756 5.229723 580.430741 42.25
757 1.625398 309.938651 61.39
758 2.213070 18.940140 7.50

[759 rows x 9 columns]

PYTHON OUTPUT: POST NULL VALUE TREATEMENT

sales 0
capital 0
patents 0
randd 0
employment 0
sp500 0
tobinq 0
value 0
institutions 0
dtype: int64

5|Page
PYTHON OUTPUT: UNIVARIATE / BIVARIATE ANALYSIS

6|Page
7|Page
8|Page
Observation:

 Employment, Tobinq, value, rand, sales, patent has multiple outliers in the data.
 Tobinq, value, randd, sales, patent Positive skewed.
 The institute ranges from 25 to 60.
 The Tobinq ranges from 0.5 to 3.
 The value, employment, patent, rand, capital, sales ranges from 0 to 0.5.

PYTHON OUTPUT: SKEW

sales 9.219023
capital 7.555091
patents 7.766943
randd 10.270483
employment 9.068875
tobinq 3.332006
value 6.075996
institutions -0.168071
dtype: float64

PYTHON OUTPUT BIVARIATE DATA DISTRIBUTION

9|Page
PYTHON OUTPUT BIVARIATE DATA DISTRIBUTION WITH HUE AS SALES

10 | P a g e
Observation: There is no correlation between the data, the data seems to be normal. There is no
huge difference in the data distribution among the sales, I don’t see any clear two different
distribution in the data. Multiple outliers are being observed which needs to be treated.

PYTHON OUTPUT: CORRELATION MATRIX

11 | P a g e
Observation: No multi collinearity in the data

1.2 Impute null values if present? Do you think scaling is necessary in this case?

Solution:

PYTHON OUTPUT: NULL VALUE

PYTHON OUTPUT: PERCENTAGE OF NULL VALUE

Observation: We do have Null value in Tobinq. The fix the Null value we can do a mean or median
imputation. The percentage of Null values is less than 5%, we can also drop these if we want. Post
we imputed the mean; we do not see have any null value in the dataset.

12 | P a g e
Observation: Scaling isn’t required as the dataset features looks to be in more or less fixed range.
Also, there isn’t any multi collinearity in the data. There are multiple outliers present in the dataset
which needs to be treated before we proceed with the modelling.

PYTHON OUTPUT: OUTLIERS IDENTIFIED

13 | P a g e
PYTHON OUTPUT: POST OUTLIERS TREATMENT

14 | P a g e
15 | P a g e
PYTHON OUTPUT: HEATMAP POST OUTLIER TREATEMENT

16 | P a g e
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into
test and train (70:30). Apply Linear regression. Performance Metrics: Check the
performance of Predictions on Train and Test sets using Rsquare, RMSE.

Solution:

PYTHON OUTPUT: ENCODING THE STRING VALUES

GET DUMMIES (Converting categorical to dummy variable in data)

Index(['sales', 'capital', 'patents', 'randd', 'employment', 'tobinq',

'value',
'institutions', 'sp500_no', 'sp500_yes'],
dtype='object')

Observation: Dummies have been encoded. Linear regression model does not take categorical
values so that we have encoded categorical values to integer for better results.

PYTHON OUTPUT: TRAIN/TEST SPLIT

- Dropping Unwanted Columns

Observation: Unrequired columns were already dropped at the initial level. (Train/Test Split)
Index(['sales', 'capital', 'patents', 'randd', 'employment', 'tobinq',
'value',
'institutions', 'sp500_no', 'sp500_yes'],
dtype='object')

- Split of X and y into training and test set in 70:30 ratio

PYTHON OUTPUT:

17 | P a g e
- Invoked the Linear Regression function and finding the best model on Training data.

PYTHON OUTPUT:
LinearRegression()

- Explore the coefficients for each of the independent attributes

PYTHON OUTPUT:

- Let us check the intercept for the model

PYTHON OUTPUT:
The intercept for our model is 155.8971701239957

- R square on training data

PYTHON OUTPUT:
0.9358806629736066

- R square on testing data

PYTHON OUTPUT:
0.924129439335239

- RMSE on Training data

PYTHON OUTPUT:
394.6129494572075

- RMSE on Testing data

PYTHON OUTPUT:
399.74321332112794

18 | P a g e
- VIF –VALUES

PYTHON OUTPUT:
capital ---> 5.884834435358601
patents ---> 2.5564811032960173
randd ---> 2.9241166081719343
employment ---> 5.289087439090918
tobinq ---> 1.4736588698814541
value ---> 6.0730692748610045
institutions ---> 1.2923225457814675
sp500_no ---> 5.627713456806028
sp500_yes ---> 7.007866608862636

Observation: Visible multi collinearity in the dataset, to drop these values to Lower level we can
drop columns after doing stats model. From stats model we can understand the features that do not
contribute to the Model. We can remove those features after that the Vif Values will be reduced
Ideal value of VIF is less than 5%.

- Using STATSMODEL library

PYTHON OUTPUT:
OLS Regression Results
=======================================================================
=======
Dep. Variable: sales R-squared:
0.936
Model: OLS Adj. R-squared:
0.935
Method: Least Squares F-statistic:
952.4
Date: Thu, 20 Jan 2022 Prob (F-statistic):
1.05e-305
Time: 16:40:50 Log-Likelihood:
-3927.7
No. Observations: 531 AIC:
7873.
Df Residuals: 522 BIC:
7912.
Df Model: 8
Covariance Type: nonrobust
=======================================================================
=========
coef std err t P>|t| [0.025
0.975]
-----------------------------------------------------------------------
---------
Intercept 103.9314 42.150 2.466 0.014 21.128
186.735
capital 0.4062 0.042 9.651 0.000 0.323
0.489
patents -4.6473 2.789 -1.666 0.096 -10.127
0.833

19 | P a g e
randd 0.6399 0.232 2.753 0.006 0.183
1.096
employment 78.6137 4.765 16.498 0.000 69.252
87.975
tobinq -39.9258 12.145 -3.288 0.001 -63.784
-16.067
value 0.2446 0.026 9.592 0.000 0.195
0.295
institutions 0.2174 0.902 0.241 0.810 -1.555
1.990
sp500_no -31.1003 25.504 -1.219 0.223 -81.203
19.003
sp500_yes 135.0318 49.490 2.728 0.007 37.808
232.256
=======================================================================
=======
Omnibus: 185.527 Durbin-Watson:
1.966
Prob(Omnibus): 0.000 Jarque-Bera (JB):
1284.253
Skew: 1.351 Prob(JB):
1.34e-279
Kurtosis: 10.123 Cond. No.
2.47e+19
=======================================================================
=======

- After dropping the depth variable OLS Regression Results

PYTHON OUTPUT:
OLS Regression Results
=======================================================================
=======
Dep. Variable: sales R-squared:
0.936
Model: OLS Adj. R-squared:
0.935
Method: Least Squares F-statistic:
952.4
Date: Fri, 21 Jan 2022 Prob (F-statistic):
1.05e-305
Time: 10:53:58 Log-Likelihood:
-3927.7
No. Observations: 531 AIC:
7873.
Df Residuals: 522 BIC:
7912.
Df Model: 8
Covariance Type: nonrobust
=======================================================================
=========
coef std err t P>|t| [0.025
0.975]
-----------------------------------------------------------------------
---------

20 | P a g e
Intercept 103.9314 42.150 2.466 0.014 21.128
186.735
capital 0.4062 0.042 9.651 0.000 0.323
0.489
patents -4.6473 2.789 -1.666 0.096 -10.127
0.833
randd 0.6399 0.232 2.753 0.006 0.183
1.096
employment 78.6137 4.765 16.498 0.000 69.252
87.975
tobinq -39.9258 12.145 -3.288 0.001 -63.784
-16.067
value 0.2446 0.026 9.592 0.000 0.195
0.295
institutions 0.2174 0.902 0.241 0.810 -1.555
1.990
sp500_no -31.1003 25.504 -1.219 0.223 -81.203
19.003
sp500_yes 135.0318 49.490 2.728 0.007 37.808
232.256
=======================================================================
=======
Omnibus: 185.527 Durbin-Watson:
1.966
Prob(Omnibus): 0.000 Jarque-Bera (JB):
1284.253
Skew: 1.351 Prob(JB):
1.34e-279
Kurtosis: 10.123 Cond. No.
2.47e+19
=======================================================================
=======

1.4 Inference: Based on these predictions, what are the business insights and
recommendations
Solution:

 The investment criteria for any new investor are mainly based on the capital invested in
the company by the promoters and investors are vying on the firms where the capital
investment is good as also reflecting in the plots.
 To generate capital the company should have the combination of the following attributes
such as value, employment, sales and patents.
 The highest contributing attribute is employment followed by patents
 Using stats model if we could run the model again, we can have P values and coefficients
which will give us better understanding of the relationship, so that values more 0.05 we
can drop those variables and re run the model again for better results. For better
accuracy dropping depth column in iteration for better results

PYTHON OUTPUT: The Equation

(103.93) * Intercept + (0.41) * capital + (-4.65) * patents + (0.64) *
randd + (78.61) * employment + (-39.93) * tobinq + (0.24) * value +
(0.22) * institutions + (-31.1) * sp500_no + (135.03) * sp500_yes +

21 | P a g e
Problem 2: Logistic Regression and LDA

Data Dictionary for Car_Crash:

1. dvcat: factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39, 40-54, 55+

2. weight: Observation weights, albeit of uncertain accuracy, designed to account for varying
sampling probabilities. (The inverse probability weighting estimator can be used to demonstrate
causality when the researcher cannot conduct a controlled experiment but has observed data to
model) for further information go to this link:
https://en.wikipedia.org/wiki/Inverse_probability_weighting

3. Survived: factor with levels Survived or not_survived

4. airbag: a factor with levels none or airbag

5. seatbelt: a factor with levels none or belted

6. frontal: a numeric vector; 0 = non-frontal, 1=frontal impact

7. sex: a factor with levels f: Female or m: Male

8. ageOFocc: age of occupant in years

9. yearacc: year of accident

10. yearVeh: Year of model of vehicle; a numeric vector

11. abcat: Did one or more (driver or passenger) airbag(s) deploy? This factor has levels deploy,
nodeploy and unavail

12. occRole: a factor with levels driver or pass: passenger

13. deploy: a numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or more bags
deployed.

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis.
Do exploratory data analysis.
Solution:

Loaded necessary library for the model building. Read the head and tail of the dataset to check
whether data has been properly updated.

22 | P a g e
PYTHON OUTPUT:

HEAD OF THE DATA

PYTHON OUTPUT:

DROPPED UNWANTED COLOUMN: “Unnamed: 0” (isn’t required for analysis)

PYTHON OUTPUT:

TAIL OF THE DATA

PYTHON OUTPUT:

SHAPE OF THE DATA

PYTHON OUTPUT:
(11217, 15)

23 | P a g e
INFO

PYTHON OUTPUT:

Observation:

 Null values in the dataset (“injSeverity”) and would be treated later.

 We have integer, floats and object data

DATA DESCRIBE:

PYTHON OUTPUT:

Observation: We have integer and continuous data, “Survived” is our target variable.

24 | P a g e
NULL VALUE CHECK

PYTHON OUTPUT:

PERCENTAGE OF MISSING VALUE

PYTHON OUTPUT:

CHECK FOR DUPLICATES IN DATA

PYTHON OUTPUT:
Number of duplicate rows = 0

25 | P a g e
UNIQUE VALUE IN THE CATEGORICAL DATA

PYTHON OUTPUT:
DVCAT : 5
1-9km/h 282
55+ 809
40-54 1344
25-39 3368
10-24 5414
Name: dvcat, dtype: int64

SURVIVED : 2
Not_Survived 1180
survived 10037
Name: Survived, dtype: int64

AIRBAG : 2
none 4153
airbag 7064
Name: airbag, dtype: int64

SEATBELT : 2
none 3368
belted 7849
Name: seatbelt, dtype: int64

SEX : 2
f 5169
m 6048
Name: sex, dtype: int64

ABCAT : 3
nodeploy 2699
unavail 4153
deploy 4365
Name: abcat, dtype: int64

OCCROLE : 2
pass 2431
driver 8786
Name: occRole, dtype: int64

CASEID : 6488
5:41:1 1
49:228:2 1
8:126:2 1

26 | P a g e
43:166:2 1
5:50:1 1
..
75:84:2 6
49:156:1 6
74:74:2 6
49:106:1 6
73:100:2 7
Name: caseid, Length: 6488, dtype: int64

TREATED NULL VALUE – injSeverity & POST TREATMENT

PYTHON OUTPUT:

27 | P a g e
UNIVARIATE & BIVARIATE ANALYSIS

PYTHON OUTPUT:

28 | P a g e
Observation:

 Outliers are observed within dataset for Year, Age, Weight

29 | P a g e
 The data looks more or less positively skewed.

CATEGORICAL UNIVARIATE ANALYSIS

PYTHON OUTPUT:

Dvcat

Observation: The estimated Impacted speed is high while driving 10-24 km/h.

Survived

Observation: The factor with levels survived is too high approx., 9000 compare to Not Survived
which is approx., 1000.

Airbag

30 | P a g e
Observation: The factor with levels airbag is 57% higher than none. This suggest that 43% cars are
not fully safe which caused accident.

Seatbelt

Observation: The factor with levels seatbelt belted is approx. 56% higher than none. Which also
suggest 44% cars were unbelted which caused accident.

Sex

31 | P a g e
Observation: The factor with levels M (Male) is 17 % higher than f (Female) contributing towards
crashes.

abcat

Observation: The factor comparatively suggests the deploy vs unavailable is more or less similar
which increased the risk of death. Also, 63% cars didn’t deploy the airbag which is major concern of
safety and could have direct correlation with the number of deaths.

OccRole

Observation: A factor with level driver or pass passenger suggest driver are approx. 81% higher than
pass.

32 | P a g e
Survived vs Frontal

Observation: Approx. 25 % People with non-frontal impact survived, while 75 % with front impact
survived. The Non-Survived looks more or less similar for non-frontal and frontal both.

Survived vs airbag

Observation: Approx. 75 % People with airbag in their car survived, while 25 % without airbag also
survived. The Non-Survived looks more or less similar for airbag and none both.

33 | P a g e
Survived vs Seatbelt

Observation: Approx. 75 % People who belted in their car survived, while 25 % without belt also
survived. The Non-Survived looks more or less similar for belted and none both.

BIVARITE ANALYIS DATA DISTRIBUTION

PYTHON OUTPUT:

34 | P a g e
Observation: There is no correlation between the data, the data seems to be normal. There is no hu
ge difference in the data distribution among the survived, I don’t see any clear two different
distribution in the data.

Observation: No multi collinearity in the data

TREATING OUTLIERS

BEFORE OUTLIER TREATMENT

Observation: we have outliers in the dataset, as LDA works based on numerical computation treating
outliers will help perform the model better.

PYTHON OUTPUT:

35 | P a g e
POST TREATING THE OUTLIER,

PYTHON OUTPUT:

Observation: No outliers in the data, all outliers have been treated.

36 | P a g e
2.2 Encode the data (having string values) for Modelling. Data Split: Split the data into
train and test (70:30). Apply Logistic Regression and LDA (linear discriminant
analysis).

Solution:

PYTHON OUTPUT: ENCODING CATEGORICAL VARIABLE

The encoding helps the logistic regression model predict better results.

TRAIN / TEST SPLIT

PYTHON OUTPUT:

GRID SEARCH METHOD: The grid search which is a method used for logistic regression to find the
optimal solving and the parameters for solving

PYTHON OUTPUT:

37 | P a g e
The grid search method gives, liblinear solver which is suitable for small datasets. Tolerance and
penalty have been found using grid search method. Predicting the training data,

2.3 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for
each model. Compare both the models and write inferences, which model is
best/optimized.
Solution:

PYTHON OUTPUT: CONFUSION MATRIX TRAIN DATA

precision recall f1-score support

0 0.93 0.89 0.91 826

1 0.99 0.99 0.99 7025

accuracy 0.98 7851

macro avg 0.96 0.94 0.95 7851
weighted avg 0.98 0.98 0.98 7851

38 | P a g e
PYTHON OUTPUT: CONFUSION MATRIX FOR TEST DATA
precision recall f1-score support

0 0.93 0.89 0.91 354

1 0.99 0.99 0.99 3012

accuracy 0.98 3366

macro avg 0.96 0.94 0.95 3366
weighted avg 0.98 0.98 0.98 3366

PYTHON OUTPUT: ACCURACY – TRAINING DATA

0.9811488982295249

PYTHON OUTPUT: AUC, ROC CURVE FOR TRAIN DATA

AUC: 0.987

39 | P a g e
PYTHON OUTPUT: ACCURACY FOR TEST DATA
0.9815805109922757

PYTHON OUTPUT: AUC, ROC CURVE FOR TEST DATA

AUC: 0.987

PYTHON OUTPUT: LDA

40 | P a g e
PYTHON OUTPUT: PREDICTING THE VARIABLE

PYTHON OUTPUT: MODEL SCORE (Train)

0.9582218825627309

PYTHON OUTPUT: CLASSFICATION REPORT & CONFUSION MATRIX FOR TRAIN DATA
precision recall f1-score support

0 0.83 0.75 0.79 826

1 0.97 0.98 0.98 7025

accuracy 0.96 7851

macro avg 0.90 0.87 0.88 7851
weighted avg 0.96 0.96 0.96 7851

PYTHON OUTPUT: MODEL SCORE (TEST)

0.9557338086749851

PYTHON OUTPUT: CLASSFICATION REPORT & CONFUSION MATRIX FOR TEST DATA
precision recall f1-score support

0 0.81 0.76 0.78 354

1 0.97 0.98 0.98 3012

accuracy 0.96 3366

macro avg 0.89 0.87 0.88 3366
weighted avg 0.95 0.96 0.96 3366

41 | P a g e
PYTHON OUTPUT: CHANGING THE CUTT OFF VALUE TO CHECK OPTIMAL VALUE THAT GIVES
BETTER ACCURACY AND F1 SCORE
0.1

Accuracy Score 0.941

F1 Score 0.9681

Confusion Matrix

0.2

Accuracy Score 0.9531

F1 Score 0.9744

Confusion Matrix

0.3

Accuracy Score 0.9585

F1 Score 0.9772

Confusion Matrix

42 | P a g e
0.4

Accuracy Score 0.9595

F1 Score 0.9776

Confusion Matrix

0.5

Accuracy Score 0.9582

F1 Score 0.9768

Confusion Matrix

0.6

Accuracy Score 0.9533

F1 Score 0.9738

Confusion Matrix

43 | P a g e
0.7

Accuracy Score 0.9443

F1 Score 0.9685

Confusion Matrix

0.8

Accuracy Score 0.9303

F1 Score 0.9601

Confusion Matrix

0.9

Accuracy Score 0.8953

F1 Score 0.9384

44 | P a g e
Confusion Matrix

PYTHON OUTPUT: AUC AND ROC CURVE

AUC for the Training Data: 0.975
AUC for the Test Data: 0.974

Observation: Comparing both these models, we find both results are almost same, but LR works
better when there is category target variable.

45 | P a g e
2.4 Inference: Based on these predictions, what are the insights and
recommendations.

 The model accuracy of logistic regression on both training data as well as testing data is
almost same i.e 98%.
 Similarly, AUC in logistic regression for training data and testing data is also similar.
 The other parameters of confusion matrix in logistic regression is also similar, therefore we
can presume in this that our model is over fitted.
 We have therefore applied Grid Search CV to hyper tune our model and as per which F1
score in both training and test data was almost similar.
 In case of LDA, the AUC for testing and training data is also same and it was 96%, besides this
the other parameters of confusion matrix of LDA model were also similar and it clearly
shows that model is overfitted here too.
 Overall, we can conclude that logistic regression model is best suited for this data set given
the level of accuracy in spite of the Linear Discriminant Analysis that the model is overfitted.

46 | P a g e

FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
Predictive Modelling Project
No ratings yet
Predictive Modelling Project
28 pages
Data Analytics Lab Manuals 2025-2026-1
No ratings yet
Data Analytics Lab Manuals 2025-2026-1
39 pages
Machine Exercise 3 (1)
No ratings yet
Machine Exercise 3 (1)
22 pages
Linear Regression_25Mar2025
No ratings yet
Linear Regression_25Mar2025
14 pages
Project Predictive Modeling
No ratings yet
Project Predictive Modeling
43 pages
Elements of News - PP
100% (1)
Elements of News - PP
25 pages
Dsbda Viva Ans
No ratings yet
Dsbda Viva Ans
8 pages
FAQ - ReCell
No ratings yet
FAQ - ReCell
5 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
7 pages
Geomatica 2018.1 Object Analyst Guide - PCI - Geomatics
100% (1)
Geomatica 2018.1 Object Analyst Guide - PCI - Geomatics
62 pages
Arun_27072021_Predictive_Modeling.pdf
No ratings yet
Arun_27072021_Predictive_Modeling.pdf
33 pages
Assignment Report - Predictive Modelling - Rahul Dubey
No ratings yet
Assignment Report - Predictive Modelling - Rahul Dubey
18 pages
Devidutta_Predictive_Modeling.pdf
No ratings yet
Devidutta_Predictive_Modeling.pdf
25 pages
Business Report PM Suchita Bhovar March 10 2024
No ratings yet
Business Report PM Suchita Bhovar March 10 2024
27 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
Predictive Modeling Project
No ratings yet
Predictive Modeling Project
16 pages
Sukanya Linear LogisticRegression Report
100% (1)
Sukanya Linear LogisticRegression Report
23 pages
FYMCA IDSLab A6 Submission
No ratings yet
FYMCA IDSLab A6 Submission
9 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Predictive Modelling Alternative Firm Level PDF
100% (4)
Predictive Modelling Alternative Firm Level PDF
26 pages
Arpita - Sarkar - Business - Report - 17th December, 2023
No ratings yet
Arpita - Sarkar - Business - Report - 17th December, 2023
23 pages
Predictive Modelling Project 2
100% (4)
Predictive Modelling Project 2
32 pages
Linear Regression Assignment
0% (2)
Linear Regression Assignment
8 pages
DA_Programs
No ratings yet
DA_Programs
44 pages
DataAnalytics Lab Manual (1)
No ratings yet
DataAnalytics Lab Manual (1)
35 pages
TYCS Practical
No ratings yet
TYCS Practical
26 pages
ML LAB manual-1
No ratings yet
ML LAB manual-1
33 pages
Linear_Regression_datascience_basit.pdf
No ratings yet
Linear_Regression_datascience_basit.pdf
19 pages
DADS301 MBA Sem 3programming in DS
No ratings yet
DADS301 MBA Sem 3programming in DS
10 pages
Predictive Modeling Business Report Seetharaman Final Changes PDF
100% (1)
Predictive Modeling Business Report Seetharaman Final Changes PDF
28 pages
Adjustment Computation
No ratings yet
Adjustment Computation
5 pages
Analysis and Prediction of House Prices by Linear Regression Model
No ratings yet
Analysis and Prediction of House Prices by Linear Regression Model
91 pages
lab mannual of ML
No ratings yet
lab mannual of ML
43 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
Survey Manual: Colorado Department of Transportation
No ratings yet
Survey Manual: Colorado Department of Transportation
51 pages
Predictive Modelling Project_Nandini
No ratings yet
Predictive Modelling Project_Nandini
31 pages
FRA Business Report
100% (1)
FRA Business Report
21 pages
FRA Milestone1 - Maminulislam
100% (4)
FRA Milestone1 - Maminulislam
23 pages
20dit073 Jay Prajapati ML
No ratings yet
20dit073 Jay Prajapati ML
68 pages
Project paarth (1) (1)
No ratings yet
Project paarth (1) (1)
21 pages
Data Preprocessing ML Lab
No ratings yet
Data Preprocessing ML Lab
6 pages
Predictive Modelling ALOK KUMAR
100% (1)
Predictive Modelling ALOK KUMAR
25 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
ML 1-11
No ratings yet
ML 1-11
27 pages
India Credit Risk Default Model - Nivedita Dey - PGP BABI May19 - 2
100% (4)
India Credit Risk Default Model - Nivedita Dey - PGP BABI May19 - 2
19 pages
AIDS - DM Using Python - Lab Programs
No ratings yet
AIDS - DM Using Python - Lab Programs
19 pages
profitanalysis
No ratings yet
profitanalysis
18 pages
Finance & Risk Analytics QSTN 1 - Credit Risk
No ratings yet
Finance & Risk Analytics QSTN 1 - Credit Risk
24 pages
Wiki
No ratings yet
Wiki
113 pages
Sensors: The Positioning Accuracy of BAUV Using Fusion of Data From USBL System and Movement Parameters Measurements
No ratings yet
Sensors: The Positioning Accuracy of BAUV Using Fusion of Data From USBL System and Movement Parameters Measurements
23 pages
FRA Report
100% (1)
FRA Report
30 pages
PM - ExtendedProject - Business Report
100% (4)
PM - ExtendedProject - Business Report
35 pages
VaibhavKumar Extendedproject PDF
100% (2)
VaibhavKumar Extendedproject PDF
10 pages
Chapter 2 Theory of Error
100% (1)
Chapter 2 Theory of Error
30 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Data Preprocessing & Visualization1
No ratings yet
Data Preprocessing & Visualization1
2 pages
Financial Risk Analytics: Assignment
No ratings yet
Financial Risk Analytics: Assignment
35 pages
ACH: A Tool For Analyzing Competing Hypotheses: Technical Description For Version 1.1
No ratings yet
ACH: A Tool For Analyzing Competing Hypotheses: Technical Description For Version 1.1
11 pages
Exp 8_LM
No ratings yet
Exp 8_LM
10 pages
Dealing With Missing Data in Python Pandas
100% (1)
Dealing With Missing Data in Python Pandas
14 pages
AB-Dynamics-SPMM-Solution-Brochure-ROW
No ratings yet
AB-Dynamics-SPMM-Solution-Brochure-ROW
16 pages
Problem 1: Linear Regression
54% (13)
Problem 1: Linear Regression
14 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
Transducers and Sensors
No ratings yet
Transducers and Sensors
41 pages
Anshul Dyundi Predictive Modelling Alternate Project July 2022
No ratings yet
Anshul Dyundi Predictive Modelling Alternate Project July 2022
11 pages
Python practice questions (1)
No ratings yet
Python practice questions (1)
5 pages
Cem131p-5 Group 1
No ratings yet
Cem131p-5 Group 1
29 pages
Your Science Laboratory 1.4
No ratings yet
Your Science Laboratory 1.4
6 pages
IOAI_2025_Contest-Rules
No ratings yet
IOAI_2025_Contest-Rules
8 pages
Annex 8 Qualification of Balances
No ratings yet
Annex 8 Qualification of Balances
11 pages
1MA1 3F Rms 20240111
No ratings yet
1MA1 3F Rms 20240111
20 pages
D 4986 - 03 _RDQ5ODY_
No ratings yet
D 4986 - 03 _RDQ5ODY_
6 pages
Van de Ridder Et Al 2008 What Is Feedback in Clinical Education
No ratings yet
Van de Ridder Et Al 2008 What Is Feedback in Clinical Education
9 pages
Kinanthropometric Assessement (Guidelines For Athlete Assessment in New Zealand Sport)
No ratings yet
Kinanthropometric Assessement (Guidelines For Athlete Assessment in New Zealand Sport)
30 pages
Skema Bi k12 Trial SPM 2014 MRSM
No ratings yet
Skema Bi k12 Trial SPM 2014 MRSM
20 pages
Statistical Sampling - A Useful Audit Tool
No ratings yet
Statistical Sampling - A Useful Audit Tool
7 pages
Pb2 AI SetA
No ratings yet
Pb2 AI SetA
5 pages
TOFD Flaw Sizing
100% (1)
TOFD Flaw Sizing
22 pages
Poultry Housing Tips
No ratings yet
Poultry Housing Tips
4 pages
Ielts Writing Band Descriptors
No ratings yet
Ielts Writing Band Descriptors
4 pages
Digital Panel Meters BD
No ratings yet
Digital Panel Meters BD
24 pages
Topic 1 - Designing A Roller Coaster
No ratings yet
Topic 1 - Designing A Roller Coaster
5 pages
GB - T 10095.2-2008
No ratings yet
GB - T 10095.2-2008
5 pages
Analyzer 204 Oil-Hydrocarbons in Water r.1
No ratings yet
Analyzer 204 Oil-Hydrocarbons in Water r.1
2 pages
Problems in Surveying
No ratings yet
Problems in Surveying
2 pages
General Physics 1 1st Quarter Module 1 Activities
No ratings yet
General Physics 1 1st Quarter Module 1 Activities
16 pages
The Data Science Workshop: A New, Interactive Approach to Learning Data Science
From Everand
The Data Science Workshop: A New, Interactive Approach to Learning Data Science
Anthony So
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.