0% found this document useful (0 votes)
38 views44 pages

Predictivemodellingproject Report Vijay Borade Aug2023

This document describes a predictive modeling project involving linear regression and logistic regression. The project uses a computer system activity dataset to 1) build a linear regression model to predict the percentage of time CPUs operate in user mode and 2) build classification models like logistic regression and LDA to predict system failures. The project follows the steps of data exploration, preprocessing, model building and evaluation to analyze the data and develop and assess predictive models.

Uploaded by

borade.vijay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views44 pages

Predictivemodellingproject Report Vijay Borade Aug2023

This document describes a predictive modeling project involving linear regression and logistic regression. The project uses a computer system activity dataset to 1) build a linear regression model to predict the percentage of time CPUs operate in user mode and 2) build classification models like logistic regression and LDA to predict system failures. The project follows the steps of data exploration, preprocessing, model building and evaluation to analyze the data and develop and assess predictive models.

Uploaded by

borade.vijay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

ABSTRACT

This project involves two main problems: Linear


Regression and Logistic Regression coupled with
Linear Discriminant Analysis (LDA). The project
follows a structured approach encompassing
data exploration, preprocessing, model building,
and evaluation using appropriate statistical and
machine learning techniques.

Vijay Arjun Borade


DSBA Aug 2023

PREDICTIVEMODELLING
PROJECT REPORT
Batch - Aug 2023

PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)


PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Context and Problem Statement:


The objective is to establish a linear equation to predict the percentage of time CPUs
operate in user mode ('usr'). This involves analyzing various system attributes to
understand their influence on the system's 'usr' mode.

Approach:
1. Data Description: Describe the dataset attributes related to system measures
and 'usr' mode.

2. Exploratory Data Analysis (EDA): Explore correlations between system


attributes and 'usr' mode.

3. Model Development: Develop a linear regression model to predict 'usr'.

4. Model Evaluation: Assess model performance using appropriate evaluation


metrics.

5. Insights and Recommendations: Highlight influential factors and


recommendations for optimizing 'usr' mode.

Problem Statement 1:- Linear Rergression


The comp-active database is a collection of a computer system activity measures. The
data was collected from a Sun SPARCstations 20/ 712 with 128 Mbytes of memory
running in a Multi-user university department. Users would typically be doing a large
variety of task ranging from accessing the internet, editing files or running very cpu-
bound programs.

As you are a budding data scientist you thought to find out a liner equation to build
a model to predit ‘usr’ (Portion of time (%) that cpus run in user mode) and to find
out how each attribute affects the system to be in ‘usr’ mode using a list of system
attributes.

1
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Contents: -
List of Tables…………………………………………………………………………………………………………………………………………………………………………………………………….3

List of Figures……………………………………………………………………………………………………….....................................................................................................................3

Problem Statement 1 : Linear Regression………………………………………………………………………………………………………………………………………………………..4

Data Description………………………………………………………………………………………………………………………………………………………………………………………………5

1.1 ……………………………………………………………………………………………………………………………………………………………………………………………………………….5

EDA………………………………………………………………………………………………………………………………………………………………………………………………………..5,6

Univariate Analysis……………………………………………………………………………………......................................................................................................................7,8

Bivariate Analysis……………………………………………………………………………………………………………………………………………………………………………………9

Multivariate Analysis………………………………………………………………………………………………………………………………………………………………………………10,11

Outlier Detection & Treatment………………………………………………………………………………………………………………………………………………………………12

1.2 ……………………………………………………………………………………………………………………………………………………………………………………………………………………13

Null Value treatment, feature Engineering…………………………………………………………………………………………………………………………………………………13

1.3 ……………………………………………………………………………………………………………………………………………………………………………………………………………………13,14

Encoding……………………………………………………………………………………………………………………………………………………………………………………………………..13

Train – Test split & Model Building……………………………………………………………………………………………………………………………………………………………..14

Model Performance…………………………………………………………………………………………………………………………………………………………………………………….15

1.4…………………………………………………………………………………………………………………………………………………………………………………………………………………….18

Business Insights…………………………………………………………………………………………………………………………………………………………………………………………18

Problem Statement 2: Classification……………………………………………………………………………………………………………………………………………………………19

Data Description…………………………………………………………………………………………………………………………………………………………………………………………19

2.1…………………………………………………………………………………………………………………………………………………………………………………………………………………….19

EDA……………………………………………………………………………………………………………………………………………………………………………………………………………..19

Univariate Analysis………………………………………………………………………………………………………………………………………………………………………………………22

Bivariate Analysis………………………………………………………………………………………………………………………………………………………………………………………..25

2.2…………………………………………………………………………………………………………………………………………………………………………………………………………………….30

Encoding……………………………………………………………………………………………………………………………………………………………………………………………………...30

Train – Test Split……………………………………………………………………………………………………………………………………………………………………………………………31

Model Building…………………………………………………………………………………………………………………………………………………………………………………………….31

Logistic Regression Model…………………………………………………………………………………………………………………………………………………………………………..31

Optimized Logistic Regression Model…………………………………………………………………………………………………………………………………………………………33

LDA Model…………………………………………………………………………………………………………………………………………………………………………………………………...35

CART Model………………………………………………………………………………………………………...................................................................................................................37

2.3……………………………………………………………………………………………………………………………………………………………………………………………………………………40

Model Evaluation and Performance…………………………………………………………………...................................................................................................................40

2.4 …………………………………………………………………………………………………………………………………………………………………………………………………………………..42

Business Insights & Recommendations……………………………………………………………………………………………………………………………………………………….42

2
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

List of Tables
Table 1: Data Description – Dataset 1……………………………………………………………………………………………………………………………………………………………….5

Table 2: Data Summary ……………………………………………………………………………………………………………………………………………………………………………………6

Table 3: Encoding Data …………………………………………………………………………………………………………………………………………………………………………………….13

Table 4: Data Description ……………………………………………………………………………………......................................................................................................................19

Table 5: Encoding Data …………………………………………………………………………………………………………………………………………………………………………………….20

Table 6: Classification report – Logistic regression model 1 – Train ………………………………………………………………………………………………………………...30

Table 7: Classification report - Logistic regression model 1 – Test ……………………........................................................................................................................31

Table 8: Classification report – Optimized Logistic regression model – Test..……..……………………………………………………………………………………………34

Table 9: Classification report - Optimized Logistic regression model – Train……………………………………………………………………………………………………34

Table 10: Classification report – LDA Model – Train …………………………………………………………………………………………………………………………………………35

Table 11: Classification report - LDA Model – Test …………………………………………………………………………………………………………………………………………..36

Table 12: Classification report – CART model – Train ……………………………………………………………………………………………………………………………………….37

Table 13: Classification report – CART model – Test …………………………………………………………………………………………………………………………………………37

Table 14: Important features from CART model …………………………………………………......................................................................................................................39

Table 15: Different model parameters ……………………………………………………………………………………………………………………………………………………………..41

List of Figures
Figure 1: Univariate Analysis (Usr) …………………………………………………………………………………………………………………………………………………………………..7

Figure 2: Univariate Analysis (lread)…………………………………………………………………………………………………………………………………………………………………7

Figure 3: Univariate Analysis (lwrite)…………………………………………………………………………………………………………………………………………………………………7

Figure 4: Univariate Analysis (srw rate)…………………………………………………………………………………………………………………………………………………………….8

Figure 5: Bivariate Analysis ( usr vs lread)………………………………………………………………………………………………………………………………………………………...9

Figure 6: Bivariate Analysis (usr vs lwrite) ………………………………………………………………………………………………………………………………………………………..9

Figure 7: Pairplot………………………………………………………………………………………………………………………………………………………………………………………………10

Figure 8: Correlation Heatmap …………………………………………………………………………….....................................................................................................................11

Figure 9: Boxplot for outlier detection………………………………………………………………….....................................................................................................................12

Figure 10: Statsmodel Model 1…………………………………………………………………………………………………………………………………………………………………………16

Figure 11: Statsmodel Model 2…………………………………………………………………………………………………………………………………………………………………………17

Figure 12: Scatterplot – Actual vs predicted…………………………………………………………………………………………………………………………………………………….18

Figure 13: Histogram (wife age)……………………………………………………………………………………………………………………………………………………………………….20

Figure 15: Histogram (No. of children born)…………………………………………………………………………………………………………………………………………………….21

Figure 16: Countplot (wife education)……………………………………………………………………………………………………………………………………………………………...22

Figure 17: Husband education…………………………………………………………………………………………………………………………………………………………………………22

Figure 18: Countplot Wife religion……………………………………………………………………………………………………………………………………………………………………23

Figure 19: Countplot (wife working)…………………………………………………………………………………………………………………………………………………………………23

Figure 20: Countplot (Standard of living index)……………………………………………………………………………………………………………………………………………….24

Figure 21: Countplot (Contraceptive method used)…………………………………………………………………………………………………………………………………………24

3
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Figure 22: Contraceptive used vs Wife age ………………………………………………………….....................................................................................................................25

Figure 23: Contraceptive used vs wife education…………………………………………………………………………………………………………………………………………….25

Figure 24: Contraceptive used vs Husband education……………………………………………………………………………………………………………………………………...26

Figure 25: Contraceptive used vs No. of Children …….…………………………………………………………………………………………………………………………………….26

Figure 26: Contraceptive used vs Wife religion ……………………………………………………………………………………………………………………………………………….27

Figure 27: Contraceptive used vs Wife working ………………………………………………………………………………………………………………………………………………27

Figure 28: Contraceptive used vs Husband occupation ………………………………………………………………………………………………………………………………….28

Figure 29: Contraceptive used vs Standard of living index………………………………………………………………………………………………………………………………28

Figure 30: Contraceptive used vs Media Exposure ………………………………………………………………………………………………………………………………………….29

Figure 31: Pairplot ……………………………………………………………………………………………………………………………………………………………………………………………29

Figure 32: ROC Curve – Logistic regression model – Train……………………………………………………………………………………………………………………………….32

Figure 33: ROC Curve – Logistic regression model – Test ………………………………………………………………………………………………………………………………..32

Figure 34: Important features ………………………………………………………………………………………………………………………………………………………………………….33

Figure 35: ROC curve – Optimized Logistic Regression model – Train ……………………………………………………………………………………………………………34

Figure 36: ROC curve – optimized Logistic Regression – Test………………………………………………………………………………………………………………………….35

Figure 37: ROC curve – LDA model – Train ………………………………………………………………………………………………………………………………………………………36

Figure 38: ROC curve – LDA model – Test ………………………………………………………………………………………………………………………………………………………..36

Figure 39: ROC curve - CART model – Train …………………………………………………………………………………………………………………………………………………….38

Figure 40: ROC curve - CART model – Test ………………………………………………………………………………………………………………………………………………………38

Figure 41: Confusion matrices of all model (Train Data) ………………………………………………………………………………………………………………………………….40

Figure 42: Confusion matrices of all model (Test data) ……………………………………………………………………………………………………………………………………41

Problem Statement 1: Liner Regression


The comp-active database is collection of a computer system activity
measures. The data was collected from Sun SPARCstation 20/712 with
128Mbyte of memory running in a multi-user university department.
Users would typically be doing a large variety of task ranging from
accessing the internet, editing files or running very CPU-bound
programmes.

As you are budding data scientist you thought to find out a linear
equation to build model to predict ‘usr’ (portion of time (%) that CPUs
run in user mode) and to find out how each attribute affected the system
to be in ‘usr’ mode using a list of system attributes.

4
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Data Description
System measures used:
Name Description Data Type
lread Reads (transfers per second ) between system memory and user memory Integer
lwrite writes (transfers per second) between system memory and user memory
scall Number of system calls of all types per second
sread Number of system read calls per second .
swrite Number of system write calls per second .
fork Number of system fork calls per second.
exec Number of system exec calls per second.
rchar Number of characters transferred per second by system read calls
wchar Number of characters transfreed per second by system write calls
pgout Number of page out requests per second
ppgout Number of pages, paged out per second
pgfree Number of pages per second placed on the free list.
pgscan Number of pages checked if they can be freed per second
atch Number of page attaches (satisfying a page fault by reclaiming a page in memory) -
per second
pgin Number of page-in requests per second
ppgin Number of pages paged in per second
pflt Number of page faults caused by protection errors (copy-on-writes).
vflt Number of page faults caused by address translation .
Process run queue size (The number of kernel threads in memory that are waiting
for a CPU to run. Typically, this value should be less than 2. Consistently higher
values mean that the system might be CPU-bound.)
runqsz Might be CPU
freemem Number of memory pages available to user processes
freeswap Number of disk blocks available for page swapping.
usr Portion of time (%) that cpus run in user mode
Table 1:Data Description – Dataset 1

1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check
the Data types, shape, EDA, 5 point summary). Perform Univariate, Bivariate Analysis,
Multivariate Analysis.

EDA

The data is imported, and the following are the observations:


➢ The data has 8192 rows and 22 columns. There is 1 object type and
rest are float 7 int data types.

5
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Table 2 – Data Summary

6
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

• Majority of the time the process run queue size was Not CPU Bound
• On an average 83.9%of the time the CPU run in user mode.
• There are no duplicate rows present in the data.
• There are few missing values in variables ‘rchar’ & ‘wchar’ these will be
treated later.
• A new feature can be calculated which is the ‘System Read-Write rate’ by
the feature Number of system read and write calls per second. So these
features can be replaced by the newly created one.

Univariate Analysis.

Figure 1 Univariant Analysis (Usr)

➢ CPU Runs in user mode 80 to 90% of the time or stay idle.

Figure 2 Univariant Analysis (Iread)

Figure 3 Univariant Analysis (Iwrite)

7
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Figure 4: Univariant Analysis ( srw rate )

➢ The transfers per seconds for read and write is pretty fast as majority of the transfers are
happening quickly.
➢ The system read-write rate is also quick and majority of the transactions happen to be under
5% .
➢ It seems that there are not many activities that are happening.

8
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Bivariate Analysis

Figure 5: Bivariate analysis (usr vs lread)

➢ Two unusual spikes can be seen when comparing the number of reads
second with CPUs running in user mode.

Figure 6: Bivariate analysis (usr vs Iwrite)

➢ Similarly for the write, when the number or write is high, only 2% of CPU runs
in user mode.
➢ This indicates that when the read/ write is high, most of the CPU does not run
in user mode.

9
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Multivariate Analysis

➢ From the pairplot shown the below, we can see correlations between a few
variables:
o Linear correlations can be seen between ‘vflt’,’pflt’ & ‘fork’. If the fork
calls increase page faults also tend to increase.
o Similarly, Number of pages out requests per second is also highly
correlated to the number of pages paged out per second.

Figure 7: pairplot

10
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

➢ Similar correlations can be observed from the heatmap.


➢ Majority of the times ‘pgscan’ i.e. the number of pages checked if they can be
freed per second is 0.

Figure 8: Correlation Heatmap

➢ As observed before, number of pages out requests per second is also highly
correlated to the number of pages, paged out per second variable.
➢ Similarly, both the page fault variables - pflt & vflt are high correlated with the
fork variable.
➢ We can try to drop these variables from the model and check the
performance.
➢ We will also drop pgscan variable as it is 0, before building the model.

11
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Outlier Detection & Treatment

➢ The red dot boxplots show that there is presence of outliers in all the
variables.
➢ Majority of the variables are highly skewed as well.
➢ All the outliers are treated by adjusting them to the lower and upper bound
values calculated by the IQR value.

Figure 9: Boxplot for outlier detection

12
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

1.2 Impute null values if present, also check for the values which are equal to
zero. Do they have any meaning, or do we need to change them or drop
them? Check for the possibility of creating new features if required. Also
check for outlier and duplicates if there.

The date is imported, and the following are the observations:

➢ As observed before, there were null values present in the variables ‘rchar’ &
‘wchar’.
➢ Since there are only a few values missing, these null values are replaced with
the median values.
➢ It has also been observed that there are 0s present for many dimensions.
These are all valid values as a it is related to the activities in the computer
system.
➢ No ordinal variables are available in the data hence an option to combine the
sub-ordinal variables is not available.
➢ Instead, as described above; a new feature is generated i.e. ‘srw_rate’ (system
Read-write rate) which will be useful in model building and reducing multi-
collinearity in the data.
➢ New features – number of page rate & page requests rate have created with
the variables pgin, pgout, ppgin & ppgout.
➢ However, these new features are not giving any significant output as majority
of the values are 0 or inf.
1.3 Encode the data (having string values) for modelling. Split the data into
train and test (70:30) Apply Linear regression using scikit learn. Perform
checks for significant variables using appropriate method from statsmodel.
Create multiple models and check the performance of predictions on Train
and Test set using Rsquare, RMSE & Adj Rsquare. Compare these models
and select the best one with appropriate reasoning.

One Hot encoding is done on the only ‘Object’ types of variable i.e. ‘runqsz’

A new column is created, with 1 indicating that variable as True and 0 as False and
this is how the extended variable’s data looks.

Table 3: Encoded data

13
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Train – Test split & model Building

The data set is split into training and testing data in the ratio of 70:30.

The Linear Regression model is built and fitted into the Training dataset.

The coefficients of all the variables are calculated, and it clearly shows that features
like ‘runqsz_CPU_Bound’, ‘pgout’ will directly impact the value of the Target variable if
all the other variables are 0.

Similarly, is the case of the variables with negative coefficients.

14
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Model Performance

Model 1 – Sklearn method

To check the model’s performance, we calculate the Rsquare values or the Coefficient
of Determinant for both Train and test data.

Rsquare for Train data: 0.722

RMSE for Train data: 2.47

This is good value this shows that almost 72% of the variance of the training dataset
was captured by the model.

Now evaluating the Rsquare and RMSE for test data.

RMSE for Test data: 2.54

This is also a good value. This shows that almost 70% of the variance of the testing
dataset was captured by the model.

The model seems to be neither overfitting nor under-fitted, therefore this is a good
model to go with.

However, let’s see of there is any improvement with the statsmodel approach.

Model 1 – Statsmodel method

If we build the model using stats model and OLS method, we see that the adjusted
Rsqure is equal to theRsqure value which is 0.72.

This shows that there is no statistical fluke or sampling error present.

Looking at the p-value of the predictors, we see that variable like ‘exec’, ‘pgout’,
‘ppout’ etc. have a p-value greater than 0.05. This shows that there is no relation
between this variable and the target variable, hence these are not useful in
prediction.

15
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Rsquare for Train data:0.72

OLS output :

Figure 10: Statsmodel Model 1

There seems not much of the difference in the Rsqaure values for 2 models above.
We can try the remove more non-significant variable, and build more models.

Model 2

Building another model using statsmodel without the variables : “fork”, “exec”, “atch”,
“pgfree”, “freemem” which have very high p-value.

Rsuare for Train data : 0.70

RMSE for Train data : 2.57

Rsquare for Test data: 0.68

RMSE for Test data: 2.63

OLS output:

16
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Figure 11: Statsmodel Model 2

There seems to be no major improvement in R square and RMSE values after


removing the not so significant variables.

It would be better to go with Sklearn model for prediction and statsmodel model for
interpretation and understand which variables are playing major role in ht model.

Note: The VIF Method can also used for identifying important variables adnd
eliminating the one that are not significant and have high multicollinearity.

1.4 interference: Business on prediction, what are the business insight and
recommendations.

Business Insights

The following are the observations for the presentation made by the model:

17
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Figure 12: Scatterplot – Actual vs Predicted.

• Since this regression model, we have plotted the predicted y values vs the
actual values for the test dataset. This is the plot obtained.
• From the plot, it is visible that the actual and predicted values are close
enough, expect for a few. This shows that the model performed good as per
the data.
• We get the following Linear regression equation from the final model:
Usr = (66.2) * Intercept + (0.04) * lread + (-0.06) * lwrite + (-0.0) * scall + (-0.46) * fork + (-0.01) *
exec + (-0.0) * rchar + (-0.0) * wchar + (0.33) * pgout + (-0.25) * ppgout + (-0.03) * pgfree + (0.0) *
pgscan + (0.34) * atch + (0.04) * pgin + (-0.12) * ppgin + (-0.02) * pflt + (-0.01) * vflt + (0.0) *
freemem + (-0.0) * freeswap + (0.22) * srw_rate + (33.48) * runqsz_CPU_Bound + (32.72) *
runqsz_Not_CPU_Bound

• We see that CPU rune in user mode is highly influenced by the process run
queue size.

• If the CPU bound queue size is increased by 1 unit, the % of time the CPU
will run in user mode will increase by 33.5 times, keeping all other features
constant.
• Similarly, if the non-CPU bound queue size is increased by 1 unit, the % of
time the CPU will run in user mode will increase by 32.7 times, keeping all
other features constant.
• Together the Process run queue size variable affects the % of time the CPU
will run in user mode by a value of approx. 132 times, including the
Intercept.
• All the other features are not impacting the CPU runtime too significantly.

Problem Statement 2: Classification

18
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

You are a statistician at the Republic of Indonesia Ministry of Health, and you are
provided with a data of 1473 females collected from a Contraceptive Prevalence
Survey. The samples are married women who were either not pregnant or do not
know if they were at the time of the survey.
The problem is to predict do/don't they use a contraceptive method of choice based
on their demographic and socio-economic characteristics.

Data Description

Column Name Description Data Type


Wife_age Wife’s age Numerical
Wife_education 1=uneducated, 2, 3, 4=tertiary Categorical
Husband_education 1=uneducated, 2, 3, 4=tertiary Categorical
No_of_children_born Numerical
Wife_religion Non-Scientology, Scientology Binary
Wife_Working Yes, No Binary
Husband_Occupation 1, 2, 3, 4(random) Categorical
Standard_of_living_index 1=very low, 2, 3, 4=high Categorical
Media_exposure Good, Not good Binary
Contraceptive_method_used Target
Table 4: Data Description

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null
value condition check, check for duplicate and outliers and write and inference
on it. Preform Univariate and Bivariate Analysis and Multivariate Analysis.

EDA

The data is imported, and the following are observations:


• The data has 1473 rows and 10 variables. There are 5 variables that have object data
types and rest are int data types.
• There are a few missing values in the dataset in the variables ‘wife_age’ and
‘No_of_children_born’. These are replaced by the median values to remove the null
entries.
• There are 80 duplicate rows which can be dropped from the dataset. The number for
rows is 1393 now.
• The variable ‘Husband Occupation’ has been also changed to Object data type as it is a
categorical variable.

19
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Table 5: Data description

• From the 5-point summary of the object type variables, we can see that
Tertiary is the most frequent education level of both Husband and Wife.
• Scientology is the most frequent religion that is followed by the women and
majority of them are not working.
• Majority of the Husbands Occupation is of level 3.
• The Standard of living index is very high amongst the people and majority of
them are exposed to media.
• This means that the people might be from a city or an urban area.
• Majority of the women have used a contraceptive method.

Figure 14 Histogram (wife age)

20
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

• The age of the wives ranges from 18 to 49 years where most of them are in
their 30’s and mid 20’s early 50’s.
• Majority of the people had 1 or 2 children but a few have more than 15
children as well.

Figure 15: Histogram (No. of children born)

21
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Univariate Analysis

Figure 16: Countplot (wife education)

Figure 17: Husband eduction

22
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

• As mentioned, Tertiary is the most frequent education level of both Husband


and wife.
• Fewer Husbands are uneducated as compared to the wife.

Figure 18: Countplot Wife religion

• Also, Scientology is followed the most.

Figure 19: Countplot (Wife Working)

23
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

• Majority of wife are not working.

Figure 20: Countplot (Standard of living index)

• Major portion of the people are from the area where the standard of living is
Very High and High.

• In total around 350 people are from the areas with Low and Very low standard
of living index.

Figure 21: Countplot (Contraceptive method used)

24
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

• We already know that most of the women have used a contraceptive method,
however there is a good proportion as well who have not used any.

Bivariate Analysis

Figure 22: Contraceptives used vs Wife age.

• Looks like females at an age of 25 to 35 have used contraceptive methods,


with some extreme values.
• Many females from 25to 43 have not used any contraceptive methods as well.

Figure 23 Contraceptive used vs wife education.

25
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Figure 24: Contraceptives used vs Husband education.

• Females who have completed their secondary and Tertiary education have
used contraceptive methods more as compared to the others.
• Whereas, Females who are not educated or only completed Primary education
tend not to use any contraceptive methods.
• Similar finding can be seen based on the Husband’s education level.

Figure 25: Contraceptives used vs No.of children.

26
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

• many women are using contraceptives after 3 childern.


• Majority Women who have only 1 child are not taking any contraceptives. This
indicates that they have intentions to have more children.
• However, a very few women also take contraceptives even though they have
no children.

Figure 26: Contraceptives used vs Wife religion

• Religion does not seem to affect the use of Contraceptives.

Figure 27: Contraceptives used vs Wife working

27
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

• The proportion of non-working women taking contraceptives as more as


compared to the women who are working.

Figure 28: Contraceptive used vs Husband occupation

• Since we do not have clear definitions of the Husbands Occupation levels,


assuming level 1 to be the lowest and 4 being the highest.
• The proportion of females using contraceptives as more for occupation level
1,2 & 3 as compared to 4.

Figure 29: Contraceptives used vs Standard of living index

28
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

• As seen before, women from area with high and very high standard of living
use contraceptive methods.

Figure 30: Contraceptives used vs Media Exposure

• Women with the exposure to media use contraceptives more as compared to


the others.

Figure 31: Pairplot

29
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

• The pairplot does not indicate any major trend/correlation between the
variables.
• Some of the variables available in the pairplot, do not have the classes well
separated. They will not be considered as good predictors.

2.2 Do not scale the data. Encode the data (having string values) for
Modelling. Data Split: Split the data into train and test (70:30). Apply Logistic
Regression and LDA (linear discriminant analysis) and CART.
Encoding

• Since the data has string & categorical type variables, these variables must be
encoded so that the Machine Learning model understands the data.
• In the target variable, "No" is replaced by 0 and "Yes" is replaced by 1 first.
• Similarly, ordinal numbers are given to the values in variables Wife_ education,
Husband_education & Standard_of_living_index.
• After this dummy encoding us used to encode the data for the rest of the
columns.
• The dataset looks like this.

Table 6: Encode data.

30
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Train-Test Split

• To build the Machine Learning models, we split the entire data set into a ratio
of 70:30 into Training dataset and Testing dataset.
• Since there are 1 and 0 values in the dependent variable, we need to ensure
that an equal number of 1 and 0 are split into both Training and Testing
datasets.
• This will ensure a balance in the data and will not cause biasness while
Training or Testing the model. Therefore, a function stratify=target is used
while splitting.

Model Building

Logistic Regression Model

After the data preprocessing Logistic Regression model is applied to the Train and
Test datasets with default hyper-parameters and solver considered as to be ‘newton-
cg’.

Performance metrics
Classification report – Train Data

Table 7: Classification report – Logistic regression model 1 – Train

• Classification report – Test Data

Table 8: Classification report - Logistic regression model 1 – Test

31
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

• AUC and ROC Curve – Train Data


o AUC: 0.722

Figure 32: ROC Curve – Logistic regression Model – Train

• AUC and ROC Curve – Train Data


o AUC: 0.722

Figure 33: ROC Curve – Logistic regression model – Test

32
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Inference

From the Accuracy and Recall values, the model seems to be performing fine.
However, from the AUC values & ROC curve for Test data shows that it is not
covering a large area as compared to the train data.
Therefore, there is a need to optimize this model.

Feature importance:

Figure 34: important features

Optimized Logistic Regression Model

To optimize the Logistic Regression model, the best parameters are found using Grid
Search Cross Validation technique.

These are the best parameters obtained:

'penalty': 'l1'
'solver': 'saga'
'tol': 0.0001

Another Logistic Regression model is built with these best parameters

The model evaluation score is calculated, along with the confusion matrix. The AUC-
ROC curve is also plotted for both the versions of the model to check their
performance. This will be described later.

33
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Performance metrics
• Classification report – Train Data

Table 8: Classification report – optimized logistic Regression model – Train

• Classification report – Test Data

Table 9: Classification report – optimized logistic Regression model – Test

• AUC and ROC Curve – Train Data


o AUC: 0.722

Figure 35: ROC curve – optimized Logistic Regression model – Train.

34
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

• AUC and ROC Curve – Test Data


o AUC: 0.722

Figure 36: ROC curve – optimized Logistic Regression model – Test.

Inference
The Accuracy, Recall and Precision seems to be the same as per the previous model,
however there are slight variation in the AUC score.
There does not seem to be much of an improvement in the figures, therefore let us
try to build an LDA model to get better performance.

LDA Model
The LDA model is also built with default parameters. The default cut-off value of 0.5
is considered for prediction.
This model is also further evaluated with Accuracy score, along with the confusion
matrix. The AUC-ROC curve is plotted for both the Train and Test data.

Performance metrics

• Classification report – Train Data

Table 10: Classification report – LDA Model – Train

• Classification report – Test Data

35
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Table 11: Classification report – LDA Model – Test Data.

• AUC and ROC Curve – Train Data


o AUC: 0.722

Figure 37: ROC Curve – LDA model – Train

• AUC and ROC Curve – Test Data


o AUC: 0.662

Figure 38: ROC Curve – LDA Model - Test Data

36
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Inference

The LDA model looks a bit better than the Logistic Regression models in terms of the
Recall value for Train and test data. However, the Accuracy for the test data has taken
a hit. The AUC and ROC curves also do not show a significant difference compared to
the other models built.

CART Model

A CART model is also built using the following parameters:


criterion = 'gini',
max_depth = 7,
min_samples_leaf=20,
min_samples_split=60
This model is also further evaluated with Accuracy score, along with the confusion
matrix. The AUC-ROC curve is plotted for both the Train and Test data.

• Classification report – Train Data

Table 12: Classification report – CART model – Train

• Classification report – Test Data

Table 13: Classification report – CART model – Test Data

• AUC and ROC Curve – Train Data


o AUC: 0.821

37
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Figure 39: ROC Curve – CART Model - Train Data

• AUC and ROC Curve – Test Data


o AUC: 0.721

Figure 40: ROC Curve – CART Model – Test Data

The CART model from all the other models seems to be performing the best in terms
of Accuracy, Recall and Precision values.

Feature Importance:

38
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Table 14: Important features from CART Model

The CART model also gives the most important features according to which the split
in the Decision Tree was made.
Wife_age, Wife_ education & No_of_children_born are the important features. These
are not the same as the Logistic Regression model suggested.

39
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

2.3 Performance Metrics: Check the performance of Predictions on Train and


Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC
score for each model Final Model: Compare Both the models and write
inference which model is best/optimized.

Model Evaluation and Performance

To check performance of Predictions of every model built on Train and Test datasets,
Accuracy score is calculated.

A Confusion Matrix, ROC curve and AUC-ROC score has been devised as well.
We have considered the ‘Contraceptive method used’ i.e both 0, 1 as the interest
classes. Therefore, we will also look at the Accuracy scores of all models.
Comparing Confusion matrix of all models (Train data)

Figure 41: Confusion matrices of all models (Train data)

40
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Comparing Confusion matrix of all models (Test data)

Figure 42: Confusion matrices of all models (Test Data)

Model Accuracy Recall Precision AUC


Name Train Test Train Test Train Test Train Test
Logistic
Regression 0.68 0.65 0.79 0.80 0.68 0.65 0.72 0.66
Model
Tuned
Logistic 0.67 0.65 0.79 0.80 0.68 0.66 0.72 0.72
Regression
Model
LDA 0.67 0.64 0.80 0.80 0.67 0.64 0.72 0.66
Model
CART 0.74 0.68 0.85 0.8 0.73 0.68 0.82 0.72
model

Table 15: Different model Parameters

41
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

From all the inferences above, we see that mostly all the models have similar
performance.

The Accuracy score for all the models are above 65% for both test and train data.
Best model selection:

With this, it is also very clear that the CART model has performed above all the rest
of the models. With an Accuracy value of 68%, it is predicting the highest percentage
of both our classes of interest.

If we still look at the Recall value, the CART model is able to identify 80% of the true
positives correctly. The LDA model also gives a similar Recall value, however the
Accuracy of the CART model is slightly higher therefore it would be better to
consider the CART model for doing the prediction.
Similarly, we see that the Area Under the Curve (AUC) captured is 82% for train data
and 72% for the test data. It is not the best; however, it still supersedes all the other
models.

Therefore, it is safe to say that this model can be used for making predictions on any
unseen data that is fed to the model.

2.4 Inference: Basis on these predictions, what are the insights and
recommendations.

Business Insights & Recommendations


From the important features from Logistic Regression Model & CART Model

• As per the Logistic Regression model, the wife’s education, no. of children
born is very important in deciding whether the women will use contraceptive
methods or not.
• The CART model also indicates that the wife’s education, no. of children born
are important. Therefore, these features are highly important.
• Both the models also indicated that the Husband’s education is also
important, and in real life that makes sense. This feature can influence the
wife’s decision to use contraceptive methods.

42
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)

Recommendations

• Women from area with high and very high standard of living are more likely to
use contraceptive methods.
• Women between the age of 25 to 35 years are more likely to use
contraceptives which have a good education level.
• The education level of the husband also plays a major role in contributing to
the fact that the wife will use contraceptive methods or not.
• It would be helpful to get the viewpoint of the women who do not have any
children and are still using contraceptives.
• The exposure to media also plays a key role.
• Republic of Indonesia Ministry of Health can reach out to women who do not
use contraceptive and can educate them about its usage, affects etc.
• Wives who have 8, 10, 11 & 12 do not use contraceptives. It would be
interesting to see if why this situation is there.

43

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy