0% found this document useful (0 votes)
27 views36 pages

Monika Sree 11-07-2024

wshbdbxzscs

Uploaded by

monikasreee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views36 pages

Monika Sree 11-07-2024

wshbdbxzscs

Uploaded by

monikasreee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

PREDICTIVE MODELLING

BUSSINESS REPORT

E Monika Sree
11-07-2024
Problem 1

1.A) Define the problem and perform exploratory Data Analysis


Problem definition - Check shape, Data types, statistical summary - Univariate analysis -
Multivariate analysis - Use appropriate visualizations to identify the patterns and insights -
Key meaningful observations on individual variables and the relationship between variables

1.B) Data Pre-processing


Prepare the data for modelling: - Missing Value Treatment (if needed) - Outlier Detection
(treat, if needed) - Feature Engineering - Encode the data - Train-test split

1.C) Model Building - Linear regression


Apply linear Regression using Sklearn - Using Statsmodels Perform checks for significant
variables using the appropriate method - Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare.

1.D) Business Insights & Recommendations


Comment on the Linear Regression equation from the final model and impact of relevant
variables (atleast 2) as per the equation - Conclude with the key takeaways (actionable
insights and recommendations) for the business
Problem Definition-
Observe the data set, import the required libraries and load the data set using pandas.
Top 5 rows of data set-

Shape of data-

 There are 8192 rows and 22 columns.


Description of data-
Datatypes-

Basic Info of dataset-

 There are 8192 rows and 22 columns in dataset,13 columns are float type,8 columns
are int type and 1 object type variable.
 In rchar, wchar we can observe the row count is not 8192.
Null values check on data set-

 We can observe there are null values in rchar and wchar columns.
Treating null values using median method and rechecking for Null values-

 Null values have been treated using median method.


Duplicate values checking on dataset-

 No duplicate rows are found on the dataset.


 But when we observe data, we can see many 0 values in few of the variables.
 Let’s check number of zeroes on each column in data set.
 In the following columns more than 50 percent values are 0’s. So, let’s drop those
columns.
'pgout','ppgout','pgfree','pgscan','atch'
 Data set after dropping the above 5 columns-

 For the rest of columns let’s treat the 0’s using replace method by median.

 There are 8192 rows and 17 columns.


 Now let’s create a dataframe that contains only integer and float type variables and
let’s plot boxplots for these data columns.

 Let’s understand that there are outliers in the data, and those need to be treated.
 Let’s use IQR method to treat outliers.
 So, IQR method means any observation that is less than Q1-IQR or more than Q3+IQR
is treated as outlier.
Where,
IQR=Q3-Q1
 Box plots plotting after treating outliers-
Univariate Analysis-
 Let’s plot different histogram plots for the data set.
Bivariate Analysis-
 The following are the scatterplots showing the relationship between the dependent
and independent variables:
Multivariate Analysis-
 Scatter plot between different variables with hue as ‘runqsz’ column.
 Let’s see correlation between variables

 Correlation between variables:

 There is comparatively high correlation between ‘lread’ and ‘lwrite’.


 Pair plot of data-
 Pairplot shows the relationship between the variables in form of scatterplot and
distribution of variable in histogram form.
 In some plots we can observe positive correlation, negative correlation and some
have no correlation.
 Let’s convert the categorical variable ‘runqsz’ into numerical by encoding using
dummy variable.
 Top 5 rows of dataset-

 Let’s make a copy of dataset for preventing the original dataset.


Train-test split-
 Let’s separate the dataset into X as independent variable and Y as dependent
variables.
 Let’s create x and y variables with respect to ‘usr’ as target variable. x has every data
as except the target variable and y has the data of target variable.
 Now we split the X, Y(independent, dependent variables) into training and test data
set as follows X_train, X_test, Y_train, Y_test.
 Let’s use stats model api as SM to intercept X variable.
 Using sklearn let’s split the data into x_train and y_train.

 Coefficients-

 Intercept of the model is

 R square of training dataset

 R square of testing dataset

 RMSE of training dataset

 RMSE of testing dataset

Linear Regression Using STATS Model-


As the train and test data are separated, we can begin the process of developing the linear
model. To create an OLS model, use the.ols file from the Stats Model API package.Fit the
data using x_train and y_train.
Model summary is as below-

 The R-square value tells that the model can explain 76.6 % of the variance in the
training set
 Adjusted R-square also nearly to the R-square,76.6%.
 RMSE of train data-

 RMSE of test data-

 Scatter plot of actual y value vs predicted y value


 Comparison between the actual and predicted values and the residual (difference
between them)

 Scatter plot between predictive and residuals

 Graph of residuals-

 Linear equation of the model-

 From the equation we can predict that,


1. There are many negative coefficients present in the linear equation
2. Excluding ‘fork’ and ‘freemem’ all the coefficients imply in the decrement of
value of the equation.
3. Increasing the 'fork - Number of system fork calls per second' by a unit result in
a 33% increase in 'usr', whereas increasing the 'number of system exec calls per
second' results in a 38.9% decrease in 'usr.
4. The coefficients of ‘rchar’,’wchar’,’freeswap’ are almost so the values of these 3
variables has comparatively very low impact on the equation value. So, we can
remove those variables we choose dimension reduction.
Problem 2
Define the problem and perform exploratory Data Analysis
Problem definition - Check shape, Data types, statistical summary - Univariate analysis -
Multivariate analysis - Use appropriate visualizations to identify the patterns and insights -
Key meaningful observations on individual variables and the relationship between variables
Data Pre-processing
Prepare the data for modelling: - Missing value Treatment (if needed) - Outlier
Detection(treat, if needed) - Feature Engineering (if needed) - Encode the data - Train-test
split
Model Building and Compare the Performance of the Models
Build a Logistic Regression model - Build a Linear Discriminant Analysis model - Build a CART
model - Prune the CART model by finding the best hyperparameters using GridSearch -
Check the performance of the models across train and test set using different metrics -
Compare the performance of all the models built and choose the best one with proper
rationale
Business Insights & Recommendations
Comment on the importance of features based on the best model - Conclude with the key
takeaways (actionable insights and recommendations) for the business.
Problem Definition-
Observe the data set, import the required libraries and load the data set using pandas.
Top 10 rows of dataset-

Shape of data set-

Basic info of the data set-

 There are 2 features with float datatype, 1 with int datatype and 7 with object
datatype.
Summary of dataset-

Data types-
Null Value Check-

 We found null values in ‘Wife_age’ and ‘No_of_children_born’ columns.


 Let’s treat null values by replacing null values with median value of the feature.
Treating null values and rechecking for null values-

 So, all the null values are treated.


Checking for duplicates-
80
 There are 80 duplicate rows.
 Let’s drop the duplicate rows.
 Recheck for the duplicates
0
Univariate Analysis-
Bivariate Analysis-
The below plot shows the relation between different age group of women and
contraceptive method used.

From the below plot we may note that, teritary educated women use the most
contraceptive methods.
The below plot shows that wived with highest husband’s education have used the most
contraceptive methods.

The below plot shows that the non-working women use the most contraceptive methods.

The below plot depicts that women with very high standard living index ,use the most
contraceptive measures.
 Correlation between columns-

 We discovered that just three integer characteristics can be plotted. We now convert
all objects to categorical codes.
The categorical variables Wife_education, Husband_education, Wife_religion,
Standard_of_living_index, Media_exposure, and Contraceptive_method_used were
encoded in ascending order from worst to best, as LDA does not accept text variables
as parameters in model development.
 Below isthe encoding for ordinal values:
Wife_ education: Uneducated = 1, Primary = 2, Secondary = 3, Tertiary = 4.
Husband_education: Uneducated = 1, Primary = 2, Secondary = 3, Tertiary = 4.
Wife_religion: Scientology = 1 and non-Scientology = 2.
Wife_Working: Yes = 1 and No = 2.
Standard_of_living _index: Very Low = 1, Low = 2, High = 3, Very High = 4.
Media_exposure: Exposed = 1 and Not-Exposed = 2.
Contraceptive_method_used: Yes = 1 and No = 0
Info of the data set-

Boxplot plotting-

 Outliers are detected.


Outliers treating and plotting of boxplots-
 Outliers are treated using IQR method.
Pairplots:
 We can observe that pair plot is messy and hard to predict the correlation between
the variables.
Correlation between variables among dataset-

 We can observe that the positive correlation is highest between Husband_education


and Wife_education.
 We can see that the negative correlation is highest between Husband_occupation
andWife_education.
Logistic Regression
To train and test the split, create data for x and y variables based on the
''Contraceptive_method_used'' column. Now, x has all data except the target variable, while
y only has the target variable.
 Import or check required libraries before proceeding with the process. In this encoding for
''Contraceptive_method_used'', 1 means yes and 0 means no.
 We use the Label Encoder from the sklearn library to encode data that has not been
encoded previously.
 The encoding is for constructing dummy variables.
 The train and test sets have been generated using the sklearnmodel.Using logistic
regression to fit data and create a logistic model.
 The percentage of 1s and 0s (Customers using Contraceptive_method_used Yes /No) as
follows)

 Now we need o fit into Logistic Regression model by using cg as solver,


maximum iteration as 1000, then we get the predicted data frame model as

 From the data we can observe that to see 1 the highest accuracy is 69.43%.
 The model accuracy is 67.3%
AUC and ROC curve
 Let’s plot the AUC and ROC curve of the model separately for the train data set and
test dataset.
 AUC curve for training data set
1. Train data set

 From this curve, the AUC curve is not in perfect shape but it is acceptable state.
 The Area Under Curve (AUC) of train data is 71.8%
2. Test data set
 Test data curve is similar to train data curve with variations in the starting points.
 As this curve is also above the line it is acceptable,but the curve is not best fit curve.
 The area under curve is 71.8%.
 When comparing the AUC of train and test data, the curves are mainly comparable,
with just minor variations. The AUC is 71.8.10%. Let’s move to the confusion matrix.

Confusion matrix
1) Train data
 This plot shows the relationship between the true label and predicted label as 0’s and
1’s
 Classification report is given below-

 For Contraceptive_method_used(Label 0)-


 Precision(66%)-66% of married women predicted are actually not using
contraceptive method out of all married women predicted not using
Contraceptive method.
 Recall(53%)-53%-Out of all the married women who are not using
contraceptive method,53% of married women have been predicted correctly.
 For Contraceptive_method_used (Label 1 )-
 Precision (68%) – 68% of married women predicted are actually using
Contraceptive method out of all married women predicted to be using
Contraceptive method .
 Recall (79%) – Out of all the married women actually using contraceptive
method 79% of married women have been predicted correctly .
 The accuracy is 67% which is more than 50%, so the model is good.
2) Test data

 This plot shows the relationship between the true label and predicted label as 0’s and
1’s
 Classification report is given below-

 For Contraceptive_method_used (Label 0 ):


 Precision (64%) – Out of all married women predicted to not use
contraception, precision (64%) - 64% of them actually do not use the
method.
 Recall (46%) – Out of all the married women not using Contraceptive
method , 46%of married women have been predicted correctly .
 For Contraceptive_method_used (Label 1 ):
 Precision (65%) – Of all married women anticipated to use
contraception, 65% of them are really doing so.
 Recall (79%) – Out of all the married women actually using contraceptive
method ,79% of married women have been predicted correctly.
 The accuracy is 65% which is >50% ,so the model is as good as Training data.
GRID SEARCH-
For train data-
By using the grid search CV from sklearn model to get predict the best model.
The process is same as above
 This plot shows the relationship between the true label and predicted label as 0’s and
1’s

 Classification report is given below-

 Accuracy is same as above method is still 67%


For test data-

 This plot shows the relationship between the true label and predicted label as 0’s and
1’s
 Classification report is given below-

 In the same way as shown above, we obtain comparable values here, with an
accuracy of 65%.
 The model's overall accuracy is 67%, indicating that 67% of the total predictions are
correct.
 The accuracy, AUC, Precision, and Recall for test data closely match those for the
training data. This demonstrates that there is no overfitting or underfitting, and
overall, the model is suitable for classification.
Linear Discriminant Analysis-
Train Test Split data-
 Procedure is same as Logistics Regression for splitting train and test data.
 Import LDA from sklearn library and results is given below.
 Train data

 Test data

 There is slight difference with training and test, but its good as accuracy of training
data is 67% and the accuracy of test data is 65%.
CART-
 CART is not sensitive to outliers, thus we can utilize the dataset with them.
 Train and Test Split:
The same method as the preceding Logistic Regression and the LDA, Train and test
data need to be separated, and the relevant libraries must be imported first.
 In cart, the decision tree is the most significant.
Decision Tree-
 Integrate train and test data into a decision tree. Create a new Word document and store
it in the project folder.
 Copy and paste the code into http://webgraphviz.com. To check the decision tree, we can
remove old codes and replace them.
 Due to the large amount of information or categories in the data, the tree may be messy.
We will reduce the maximum number of leaves, depth, and sample size.
 "GINI" is a decision tree classifier that plays a significant function. Create a new Word
document with decreased branches (30), leaf (10), and depth (7), and save it in the project
folder.
 Now the decision tree looks better than before.
Let's now examine feature importance, which is defined as methods that rate input features
according to how helpful they are in predicting a goal.

We can see, depending upon the ‘wife_age’ having more importance, we can slightly
predict that the contraceptive method can be used depend upon the age factors of women.
AUC plot-
AUC: 0.824

We can see the AUC curve bending high,the model is good and itsAUC value for train data is
82.4%
AUC: 0.700
Here, the plot is smooth, but over the area giving the bend formation and AUC value for test
data is 70%.
Confusion matrix for train data-

By observing the confusion matrix, we can see that True Positive is 260 and True Negative is
474.
 Regarding Contraceptive_method_used (Label 0):
Precision(77%) - Of all married women expected to not use contraception, 77%
actually do not use it.
Recall (62%) – 62% of married women who do not use contraception had their
predictions accurate.

 Regarding Contraceptive_method_used (Label 1):


Precision(75%)-Out of all married women expected to use contraceptive methods,
75% of them actually do so.
Recall (86%) 86 percent of married women who genuinely use contraceptives have
been accurately anticipated, making up the recall rate.
 The model performs well as training data since its accuracy is 75%, which is higher
than 50%.

Confusion matrix for test data-

By checking up the confusion matrix of the train data, we can get the value of True Positive
as 91 and the True Negative as 182.
 For the contraceptive technique utilized (Label 0):
Precision (67%): 67% of forecasted married women do not use contraception.
Recall (47%). Out of all married women who do not use contraception, 47% were
accurately predicted.

 For the contraceptive technique utilized (Label 1):


Precision (64%): 64% of married women anticipated to use contraception actually do
so.
Recall (81%): 81% of married women used contraception as planned.
 The model has an accuracy of 65%, making it suitable for training data.

CONCLUSION
 From the aforementioned models, it can be shown that each model's encoded label
(using the conceptual technique) was projected to be high, and that the
models,accuracy and F1 score likewise supported the label "1."
 However, we are unable to determine if the contraceptive method was utilized or not.
However, we can forecast that married women used the method, and the final
prediction also indicates the same things.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy