Predictivemodellingproject Report Vijay Borade Aug2023
Predictivemodellingproject Report Vijay Borade Aug2023
PREDICTIVEMODELLING
PROJECT REPORT
Batch - Aug 2023
Approach:
1. Data Description: Describe the dataset attributes related to system measures
and 'usr' mode.
As you are a budding data scientist you thought to find out a liner equation to build
a model to predit ‘usr’ (Portion of time (%) that cpus run in user mode) and to find
out how each attribute affects the system to be in ‘usr’ mode using a list of system
attributes.
1
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
Contents: -
List of Tables…………………………………………………………………………………………………………………………………………………………………………………………………….3
List of Figures……………………………………………………………………………………………………….....................................................................................................................3
Data Description………………………………………………………………………………………………………………………………………………………………………………………………5
1.1 ……………………………………………………………………………………………………………………………………………………………………………………………………………….5
EDA………………………………………………………………………………………………………………………………………………………………………………………………………..5,6
Univariate Analysis……………………………………………………………………………………......................................................................................................................7,8
Bivariate Analysis……………………………………………………………………………………………………………………………………………………………………………………9
Multivariate Analysis………………………………………………………………………………………………………………………………………………………………………………10,11
1.2 ……………………………………………………………………………………………………………………………………………………………………………………………………………………13
1.3 ……………………………………………………………………………………………………………………………………………………………………………………………………………………13,14
Encoding……………………………………………………………………………………………………………………………………………………………………………………………………..13
Model Performance…………………………………………………………………………………………………………………………………………………………………………………….15
1.4…………………………………………………………………………………………………………………………………………………………………………………………………………………….18
Business Insights…………………………………………………………………………………………………………………………………………………………………………………………18
Data Description…………………………………………………………………………………………………………………………………………………………………………………………19
2.1…………………………………………………………………………………………………………………………………………………………………………………………………………………….19
EDA……………………………………………………………………………………………………………………………………………………………………………………………………………..19
Univariate Analysis………………………………………………………………………………………………………………………………………………………………………………………22
Bivariate Analysis………………………………………………………………………………………………………………………………………………………………………………………..25
2.2…………………………………………………………………………………………………………………………………………………………………………………………………………………….30
Encoding……………………………………………………………………………………………………………………………………………………………………………………………………...30
Model Building…………………………………………………………………………………………………………………………………………………………………………………………….31
LDA Model…………………………………………………………………………………………………………………………………………………………………………………………………...35
CART Model………………………………………………………………………………………………………...................................................................................................................37
2.3……………………………………………………………………………………………………………………………………………………………………………………………………………………40
2.4 …………………………………………………………………………………………………………………………………………………………………………………………………………………..42
2
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
List of Tables
Table 1: Data Description – Dataset 1……………………………………………………………………………………………………………………………………………………………….5
List of Figures
Figure 1: Univariate Analysis (Usr) …………………………………………………………………………………………………………………………………………………………………..7
Figure 7: Pairplot………………………………………………………………………………………………………………………………………………………………………………………………10
3
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
Figure 35: ROC curve – Optimized Logistic Regression model – Train ……………………………………………………………………………………………………………34
As you are budding data scientist you thought to find out a linear
equation to build model to predict ‘usr’ (portion of time (%) that CPUs
run in user mode) and to find out how each attribute affected the system
to be in ‘usr’ mode using a list of system attributes.
4
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
Data Description
System measures used:
Name Description Data Type
lread Reads (transfers per second ) between system memory and user memory Integer
lwrite writes (transfers per second) between system memory and user memory
scall Number of system calls of all types per second
sread Number of system read calls per second .
swrite Number of system write calls per second .
fork Number of system fork calls per second.
exec Number of system exec calls per second.
rchar Number of characters transferred per second by system read calls
wchar Number of characters transfreed per second by system write calls
pgout Number of page out requests per second
ppgout Number of pages, paged out per second
pgfree Number of pages per second placed on the free list.
pgscan Number of pages checked if they can be freed per second
atch Number of page attaches (satisfying a page fault by reclaiming a page in memory) -
per second
pgin Number of page-in requests per second
ppgin Number of pages paged in per second
pflt Number of page faults caused by protection errors (copy-on-writes).
vflt Number of page faults caused by address translation .
Process run queue size (The number of kernel threads in memory that are waiting
for a CPU to run. Typically, this value should be less than 2. Consistently higher
values mean that the system might be CPU-bound.)
runqsz Might be CPU
freemem Number of memory pages available to user processes
freeswap Number of disk blocks available for page swapping.
usr Portion of time (%) that cpus run in user mode
Table 1:Data Description – Dataset 1
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check
the Data types, shape, EDA, 5 point summary). Perform Univariate, Bivariate Analysis,
Multivariate Analysis.
EDA
5
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
6
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
• Majority of the time the process run queue size was Not CPU Bound
• On an average 83.9%of the time the CPU run in user mode.
• There are no duplicate rows present in the data.
• There are few missing values in variables ‘rchar’ & ‘wchar’ these will be
treated later.
• A new feature can be calculated which is the ‘System Read-Write rate’ by
the feature Number of system read and write calls per second. So these
features can be replaced by the newly created one.
Univariate Analysis.
7
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
➢ The transfers per seconds for read and write is pretty fast as majority of the transfers are
happening quickly.
➢ The system read-write rate is also quick and majority of the transactions happen to be under
5% .
➢ It seems that there are not many activities that are happening.
8
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
Bivariate Analysis
➢ Two unusual spikes can be seen when comparing the number of reads
second with CPUs running in user mode.
➢ Similarly for the write, when the number or write is high, only 2% of CPU runs
in user mode.
➢ This indicates that when the read/ write is high, most of the CPU does not run
in user mode.
9
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
Multivariate Analysis
➢ From the pairplot shown the below, we can see correlations between a few
variables:
o Linear correlations can be seen between ‘vflt’,’pflt’ & ‘fork’. If the fork
calls increase page faults also tend to increase.
o Similarly, Number of pages out requests per second is also highly
correlated to the number of pages paged out per second.
Figure 7: pairplot
10
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
➢ As observed before, number of pages out requests per second is also highly
correlated to the number of pages, paged out per second variable.
➢ Similarly, both the page fault variables - pflt & vflt are high correlated with the
fork variable.
➢ We can try to drop these variables from the model and check the
performance.
➢ We will also drop pgscan variable as it is 0, before building the model.
11
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
➢ The red dot boxplots show that there is presence of outliers in all the
variables.
➢ Majority of the variables are highly skewed as well.
➢ All the outliers are treated by adjusting them to the lower and upper bound
values calculated by the IQR value.
12
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
1.2 Impute null values if present, also check for the values which are equal to
zero. Do they have any meaning, or do we need to change them or drop
them? Check for the possibility of creating new features if required. Also
check for outlier and duplicates if there.
➢ As observed before, there were null values present in the variables ‘rchar’ &
‘wchar’.
➢ Since there are only a few values missing, these null values are replaced with
the median values.
➢ It has also been observed that there are 0s present for many dimensions.
These are all valid values as a it is related to the activities in the computer
system.
➢ No ordinal variables are available in the data hence an option to combine the
sub-ordinal variables is not available.
➢ Instead, as described above; a new feature is generated i.e. ‘srw_rate’ (system
Read-write rate) which will be useful in model building and reducing multi-
collinearity in the data.
➢ New features – number of page rate & page requests rate have created with
the variables pgin, pgout, ppgin & ppgout.
➢ However, these new features are not giving any significant output as majority
of the values are 0 or inf.
1.3 Encode the data (having string values) for modelling. Split the data into
train and test (70:30) Apply Linear regression using scikit learn. Perform
checks for significant variables using appropriate method from statsmodel.
Create multiple models and check the performance of predictions on Train
and Test set using Rsquare, RMSE & Adj Rsquare. Compare these models
and select the best one with appropriate reasoning.
One Hot encoding is done on the only ‘Object’ types of variable i.e. ‘runqsz’
A new column is created, with 1 indicating that variable as True and 0 as False and
this is how the extended variable’s data looks.
13
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
The data set is split into training and testing data in the ratio of 70:30.
The Linear Regression model is built and fitted into the Training dataset.
The coefficients of all the variables are calculated, and it clearly shows that features
like ‘runqsz_CPU_Bound’, ‘pgout’ will directly impact the value of the Target variable if
all the other variables are 0.
14
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
Model Performance
To check the model’s performance, we calculate the Rsquare values or the Coefficient
of Determinant for both Train and test data.
This is good value this shows that almost 72% of the variance of the training dataset
was captured by the model.
This is also a good value. This shows that almost 70% of the variance of the testing
dataset was captured by the model.
The model seems to be neither overfitting nor under-fitted, therefore this is a good
model to go with.
However, let’s see of there is any improvement with the statsmodel approach.
If we build the model using stats model and OLS method, we see that the adjusted
Rsqure is equal to theRsqure value which is 0.72.
Looking at the p-value of the predictors, we see that variable like ‘exec’, ‘pgout’,
‘ppout’ etc. have a p-value greater than 0.05. This shows that there is no relation
between this variable and the target variable, hence these are not useful in
prediction.
15
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
OLS output :
There seems not much of the difference in the Rsqaure values for 2 models above.
We can try the remove more non-significant variable, and build more models.
Model 2
Building another model using statsmodel without the variables : “fork”, “exec”, “atch”,
“pgfree”, “freemem” which have very high p-value.
OLS output:
16
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
It would be better to go with Sklearn model for prediction and statsmodel model for
interpretation and understand which variables are playing major role in ht model.
Note: The VIF Method can also used for identifying important variables adnd
eliminating the one that are not significant and have high multicollinearity.
1.4 interference: Business on prediction, what are the business insight and
recommendations.
Business Insights
The following are the observations for the presentation made by the model:
17
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
• Since this regression model, we have plotted the predicted y values vs the
actual values for the test dataset. This is the plot obtained.
• From the plot, it is visible that the actual and predicted values are close
enough, expect for a few. This shows that the model performed good as per
the data.
• We get the following Linear regression equation from the final model:
Usr = (66.2) * Intercept + (0.04) * lread + (-0.06) * lwrite + (-0.0) * scall + (-0.46) * fork + (-0.01) *
exec + (-0.0) * rchar + (-0.0) * wchar + (0.33) * pgout + (-0.25) * ppgout + (-0.03) * pgfree + (0.0) *
pgscan + (0.34) * atch + (0.04) * pgin + (-0.12) * ppgin + (-0.02) * pflt + (-0.01) * vflt + (0.0) *
freemem + (-0.0) * freeswap + (0.22) * srw_rate + (33.48) * runqsz_CPU_Bound + (32.72) *
runqsz_Not_CPU_Bound
• We see that CPU rune in user mode is highly influenced by the process run
queue size.
• If the CPU bound queue size is increased by 1 unit, the % of time the CPU
will run in user mode will increase by 33.5 times, keeping all other features
constant.
• Similarly, if the non-CPU bound queue size is increased by 1 unit, the % of
time the CPU will run in user mode will increase by 32.7 times, keeping all
other features constant.
• Together the Process run queue size variable affects the % of time the CPU
will run in user mode by a value of approx. 132 times, including the
Intercept.
• All the other features are not impacting the CPU runtime too significantly.
18
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
You are a statistician at the Republic of Indonesia Ministry of Health, and you are
provided with a data of 1473 females collected from a Contraceptive Prevalence
Survey. The samples are married women who were either not pregnant or do not
know if they were at the time of the survey.
The problem is to predict do/don't they use a contraceptive method of choice based
on their demographic and socio-economic characteristics.
Data Description
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null
value condition check, check for duplicate and outliers and write and inference
on it. Preform Univariate and Bivariate Analysis and Multivariate Analysis.
EDA
19
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
• From the 5-point summary of the object type variables, we can see that
Tertiary is the most frequent education level of both Husband and Wife.
• Scientology is the most frequent religion that is followed by the women and
majority of them are not working.
• Majority of the Husbands Occupation is of level 3.
• The Standard of living index is very high amongst the people and majority of
them are exposed to media.
• This means that the people might be from a city or an urban area.
• Majority of the women have used a contraceptive method.
20
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
• The age of the wives ranges from 18 to 49 years where most of them are in
their 30’s and mid 20’s early 50’s.
• Majority of the people had 1 or 2 children but a few have more than 15
children as well.
21
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
Univariate Analysis
22
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
23
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
• Major portion of the people are from the area where the standard of living is
Very High and High.
• In total around 350 people are from the areas with Low and Very low standard
of living index.
24
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
• We already know that most of the women have used a contraceptive method,
however there is a good proportion as well who have not used any.
Bivariate Analysis
25
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
• Females who have completed their secondary and Tertiary education have
used contraceptive methods more as compared to the others.
• Whereas, Females who are not educated or only completed Primary education
tend not to use any contraceptive methods.
• Similar finding can be seen based on the Husband’s education level.
26
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
27
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
28
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
• As seen before, women from area with high and very high standard of living
use contraceptive methods.
29
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
• The pairplot does not indicate any major trend/correlation between the
variables.
• Some of the variables available in the pairplot, do not have the classes well
separated. They will not be considered as good predictors.
2.2 Do not scale the data. Encode the data (having string values) for
Modelling. Data Split: Split the data into train and test (70:30). Apply Logistic
Regression and LDA (linear discriminant analysis) and CART.
Encoding
• Since the data has string & categorical type variables, these variables must be
encoded so that the Machine Learning model understands the data.
• In the target variable, "No" is replaced by 0 and "Yes" is replaced by 1 first.
• Similarly, ordinal numbers are given to the values in variables Wife_ education,
Husband_education & Standard_of_living_index.
• After this dummy encoding us used to encode the data for the rest of the
columns.
• The dataset looks like this.
30
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
Train-Test Split
• To build the Machine Learning models, we split the entire data set into a ratio
of 70:30 into Training dataset and Testing dataset.
• Since there are 1 and 0 values in the dependent variable, we need to ensure
that an equal number of 1 and 0 are split into both Training and Testing
datasets.
• This will ensure a balance in the data and will not cause biasness while
Training or Testing the model. Therefore, a function stratify=target is used
while splitting.
Model Building
After the data preprocessing Logistic Regression model is applied to the Train and
Test datasets with default hyper-parameters and solver considered as to be ‘newton-
cg’.
Performance metrics
Classification report – Train Data
31
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
32
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
Inference
From the Accuracy and Recall values, the model seems to be performing fine.
However, from the AUC values & ROC curve for Test data shows that it is not
covering a large area as compared to the train data.
Therefore, there is a need to optimize this model.
Feature importance:
To optimize the Logistic Regression model, the best parameters are found using Grid
Search Cross Validation technique.
'penalty': 'l1'
'solver': 'saga'
'tol': 0.0001
The model evaluation score is calculated, along with the confusion matrix. The AUC-
ROC curve is also plotted for both the versions of the model to check their
performance. This will be described later.
33
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
Performance metrics
• Classification report – Train Data
34
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
Inference
The Accuracy, Recall and Precision seems to be the same as per the previous model,
however there are slight variation in the AUC score.
There does not seem to be much of an improvement in the figures, therefore let us
try to build an LDA model to get better performance.
LDA Model
The LDA model is also built with default parameters. The default cut-off value of 0.5
is considered for prediction.
This model is also further evaluated with Accuracy score, along with the confusion
matrix. The AUC-ROC curve is plotted for both the Train and Test data.
Performance metrics
35
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
36
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
Inference
The LDA model looks a bit better than the Logistic Regression models in terms of the
Recall value for Train and test data. However, the Accuracy for the test data has taken
a hit. The AUC and ROC curves also do not show a significant difference compared to
the other models built.
CART Model
37
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
The CART model from all the other models seems to be performing the best in terms
of Accuracy, Recall and Precision values.
Feature Importance:
38
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
The CART model also gives the most important features according to which the split
in the Decision Tree was made.
Wife_age, Wife_ education & No_of_children_born are the important features. These
are not the same as the Logistic Regression model suggested.
39
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
To check performance of Predictions of every model built on Train and Test datasets,
Accuracy score is calculated.
A Confusion Matrix, ROC curve and AUC-ROC score has been devised as well.
We have considered the ‘Contraceptive method used’ i.e both 0, 1 as the interest
classes. Therefore, we will also look at the Accuracy scores of all models.
Comparing Confusion matrix of all models (Train data)
40
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
41
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
From all the inferences above, we see that mostly all the models have similar
performance.
The Accuracy score for all the models are above 65% for both test and train data.
Best model selection:
With this, it is also very clear that the CART model has performed above all the rest
of the models. With an Accuracy value of 68%, it is predicting the highest percentage
of both our classes of interest.
If we still look at the Recall value, the CART model is able to identify 80% of the true
positives correctly. The LDA model also gives a similar Recall value, however the
Accuracy of the CART model is slightly higher therefore it would be better to
consider the CART model for doing the prediction.
Similarly, we see that the Area Under the Curve (AUC) captured is 82% for train data
and 72% for the test data. It is not the best; however, it still supersedes all the other
models.
Therefore, it is safe to say that this model can be used for making predictions on any
unseen data that is fed to the model.
2.4 Inference: Basis on these predictions, what are the insights and
recommendations.
• As per the Logistic Regression model, the wife’s education, no. of children
born is very important in deciding whether the women will use contraceptive
methods or not.
• The CART model also indicates that the wife’s education, no. of children born
are important. Therefore, these features are highly important.
• Both the models also indicated that the Husband’s education is also
important, and in real life that makes sense. This feature can influence the
wife’s decision to use contraceptive methods.
42
PREDICTIVE MODELING PROJECT REPORT – BATCH AUG 2023 (VIJAY BORADE)
Recommendations
• Women from area with high and very high standard of living are more likely to
use contraceptive methods.
• Women between the age of 25 to 35 years are more likely to use
contraceptives which have a good education level.
• The education level of the husband also plays a major role in contributing to
the fact that the wife will use contraceptive methods or not.
• It would be helpful to get the viewpoint of the women who do not have any
children and are still using contraceptives.
• The exposure to media also plays a key role.
• Republic of Indonesia Ministry of Health can reach out to women who do not
use contraceptive and can educate them about its usage, affects etc.
• Wives who have 8, 10, 11 & 12 do not use contraceptives. It would be
interesting to see if why this situation is there.
43