0% found this document useful (0 votes)

27 views36 pages

Monika Sree 11-07-2024

wshbdbxzscs

Uploaded by

monikasreee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views36 pages

Monika Sree 11-07-2024

wshbdbxzscs

Uploaded by

monikasreee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 36

PREDICTIVE MODELLING

BUSSINESS REPORT

E Monika Sree
11-07-2024
Problem 1

1.A) Define the problem and perform exploratory Data Analysis

Problem definition - Check shape, Data types, statistical summary - Univariate analysis -
Multivariate analysis - Use appropriate visualizations to identify the patterns and insights -
Key meaningful observations on individual variables and the relationship between variables

1.B) Data Pre-processing

Prepare the data for modelling: - Missing Value Treatment (if needed) - Outlier Detection
(treat, if needed) - Feature Engineering - Encode the data - Train-test split

1.C) Model Building - Linear regression

Apply linear Regression using Sklearn - Using Statsmodels Perform checks for significant
variables using the appropriate method - Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare.

1.D) Business Insights & Recommendations

Comment on the Linear Regression equation from the final model and impact of relevant
variables (atleast 2) as per the equation - Conclude with the key takeaways (actionable
insights and recommendations) for the business
Problem Definition-
Observe the data set, import the required libraries and load the data set using pandas.
Top 5 rows of data set-

Shape of data-

 There are 8192 rows and 22 columns.

Description of data-
Datatypes-

Basic Info of dataset-

 There are 8192 rows and 22 columns in dataset,13 columns are float type,8 columns
are int type and 1 object type variable.
 In rchar, wchar we can observe the row count is not 8192.
Null values check on data set-

 We can observe there are null values in rchar and wchar columns.
Treating null values using median method and rechecking for Null values-

 Null values have been treated using median method.

Duplicate values checking on dataset-

 No duplicate rows are found on the dataset.

 But when we observe data, we can see many 0 values in few of the variables.
 Let’s check number of zeroes on each column in data set.
 In the following columns more than 50 percent values are 0’s. So, let’s drop those
columns.
'pgout','ppgout','pgfree','pgscan','atch'
 Data set after dropping the above 5 columns-

 For the rest of columns let’s treat the 0’s using replace method by median.

 There are 8192 rows and 17 columns.

 Now let’s create a dataframe that contains only integer and float type variables and
let’s plot boxplots for these data columns.

 Let’s understand that there are outliers in the data, and those need to be treated.
 Let’s use IQR method to treat outliers.
 So, IQR method means any observation that is less than Q1-IQR or more than Q3+IQR
is treated as outlier.
Where,
IQR=Q3-Q1
 Box plots plotting after treating outliers-
Univariate Analysis-
 Let’s plot different histogram plots for the data set.
Bivariate Analysis-
 The following are the scatterplots showing the relationship between the dependent
and independent variables:
Multivariate Analysis-
 Scatter plot between different variables with hue as ‘runqsz’ column.
 Let’s see correlation between variables

 Correlation between variables:

 There is comparatively high correlation between ‘lread’ and ‘lwrite’.

 Pair plot of data-
 Pairplot shows the relationship between the variables in form of scatterplot and
distribution of variable in histogram form.
 In some plots we can observe positive correlation, negative correlation and some
have no correlation.
 Let’s convert the categorical variable ‘runqsz’ into numerical by encoding using
dummy variable.
 Top 5 rows of dataset-

 Let’s make a copy of dataset for preventing the original dataset.

Train-test split-
 Let’s separate the dataset into X as independent variable and Y as dependent
variables.
 Let’s create x and y variables with respect to ‘usr’ as target variable. x has every data
as except the target variable and y has the data of target variable.
 Now we split the X, Y(independent, dependent variables) into training and test data
set as follows X_train, X_test, Y_train, Y_test.
 Let’s use stats model api as SM to intercept X variable.
 Using sklearn let’s split the data into x_train and y_train.

 Coefficients-

 Intercept of the model is

 R square of training dataset

 R square of testing dataset

 RMSE of training dataset

 RMSE of testing dataset

Linear Regression Using STATS Model-

As the train and test data are separated, we can begin the process of developing the linear
model. To create an OLS model, use the.ols file from the Stats Model API package.Fit the
data using x_train and y_train.
Model summary is as below-

 The R-square value tells that the model can explain 76.6 % of the variance in the
training set
 Adjusted R-square also nearly to the R-square,76.6%.
 RMSE of train data-

 RMSE of test data-

 Scatter plot of actual y value vs predicted y value

 Comparison between the actual and predicted values and the residual (difference
between them)

 Scatter plot between predictive and residuals

 Graph of residuals-

 Linear equation of the model-

 From the equation we can predict that,

1. There are many negative coefficients present in the linear equation
2. Excluding ‘fork’ and ‘freemem’ all the coefficients imply in the decrement of
value of the equation.
3. Increasing the 'fork - Number of system fork calls per second' by a unit result in
a 33% increase in 'usr', whereas increasing the 'number of system exec calls per
second' results in a 38.9% decrease in 'usr.
4. The coefficients of ‘rchar’,’wchar’,’freeswap’ are almost so the values of these 3
variables has comparatively very low impact on the equation value. So, we can
remove those variables we choose dimension reduction.
Problem 2
Define the problem and perform exploratory Data Analysis
Problem definition - Check shape, Data types, statistical summary - Univariate analysis -
Multivariate analysis - Use appropriate visualizations to identify the patterns and insights -
Key meaningful observations on individual variables and the relationship between variables
Data Pre-processing
Prepare the data for modelling: - Missing value Treatment (if needed) - Outlier
Detection(treat, if needed) - Feature Engineering (if needed) - Encode the data - Train-test
split
Model Building and Compare the Performance of the Models
Build a Logistic Regression model - Build a Linear Discriminant Analysis model - Build a CART
model - Prune the CART model by finding the best hyperparameters using GridSearch -
Check the performance of the models across train and test set using different metrics -
Compare the performance of all the models built and choose the best one with proper
rationale
Business Insights & Recommendations
Comment on the importance of features based on the best model - Conclude with the key
takeaways (actionable insights and recommendations) for the business.
Problem Definition-
Observe the data set, import the required libraries and load the data set using pandas.
Top 10 rows of dataset-

Shape of data set-

Basic info of the data set-

 There are 2 features with float datatype, 1 with int datatype and 7 with object
datatype.
Summary of dataset-

Data types-
Null Value Check-

 We found null values in ‘Wife_age’ and ‘No_of_children_born’ columns.

 Let’s treat null values by replacing null values with median value of the feature.
Treating null values and rechecking for null values-

 So, all the null values are treated.

Checking for duplicates-
80
 There are 80 duplicate rows.
 Let’s drop the duplicate rows.
 Recheck for the duplicates
0
Univariate Analysis-
Bivariate Analysis-
The below plot shows the relation between different age group of women and
contraceptive method used.

From the below plot we may note that, teritary educated women use the most
contraceptive methods.
The below plot shows that wived with highest husband’s education have used the most
contraceptive methods.

The below plot shows that the non-working women use the most contraceptive methods.

The below plot depicts that women with very high standard living index ,use the most
contraceptive measures.
 Correlation between columns-

 We discovered that just three integer characteristics can be plotted. We now convert
all objects to categorical codes.
The categorical variables Wife_education, Husband_education, Wife_religion,
Standard_of_living_index, Media_exposure, and Contraceptive_method_used were
encoded in ascending order from worst to best, as LDA does not accept text variables
as parameters in model development.
 Below isthe encoding for ordinal values:
Wife_ education: Uneducated = 1, Primary = 2, Secondary = 3, Tertiary = 4.
Husband_education: Uneducated = 1, Primary = 2, Secondary = 3, Tertiary = 4.
Wife_religion: Scientology = 1 and non-Scientology = 2.
Wife_Working: Yes = 1 and No = 2.
Standard_of_living _index: Very Low = 1, Low = 2, High = 3, Very High = 4.
Media_exposure: Exposed = 1 and Not-Exposed = 2.
Contraceptive_method_used: Yes = 1 and No = 0
Info of the data set-

Boxplot plotting-

 Outliers are detected.

Outliers treating and plotting of boxplots-
 Outliers are treated using IQR method.
Pairplots:
 We can observe that pair plot is messy and hard to predict the correlation between
the variables.
Correlation between variables among dataset-

 We can observe that the positive correlation is highest between Husband_education

and Wife_education.
 We can see that the negative correlation is highest between Husband_occupation
andWife_education.
Logistic Regression
To train and test the split, create data for x and y variables based on the
''Contraceptive_method_used'' column. Now, x has all data except the target variable, while
y only has the target variable.
 Import or check required libraries before proceeding with the process. In this encoding for
''Contraceptive_method_used'', 1 means yes and 0 means no.
 We use the Label Encoder from the sklearn library to encode data that has not been
encoded previously.
 The encoding is for constructing dummy variables.
 The train and test sets have been generated using the sklearnmodel.Using logistic
regression to fit data and create a logistic model.
 The percentage of 1s and 0s (Customers using Contraceptive_method_used Yes /No) as
follows)

 Now we need o fit into Logistic Regression model by using cg as solver,

maximum iteration as 1000, then we get the predicted data frame model as

 From the data we can observe that to see 1 the highest accuracy is 69.43%.
 The model accuracy is 67.3%
AUC and ROC curve
 Let’s plot the AUC and ROC curve of the model separately for the train data set and
test dataset.
 AUC curve for training data set
1. Train data set

 From this curve, the AUC curve is not in perfect shape but it is acceptable state.
 The Area Under Curve (AUC) of train data is 71.8%
2. Test data set
 Test data curve is similar to train data curve with variations in the starting points.
 As this curve is also above the line it is acceptable,but the curve is not best fit curve.
 The area under curve is 71.8%.
 When comparing the AUC of train and test data, the curves are mainly comparable,
with just minor variations. The AUC is 71.8.10%. Let’s move to the confusion matrix.

Confusion matrix
1) Train data
 This plot shows the relationship between the true label and predicted label as 0’s and
1’s
 Classification report is given below-

 For Contraceptive_method_used(Label 0)-

 Precision(66%)-66% of married women predicted are actually not using
contraceptive method out of all married women predicted not using
Contraceptive method.
 Recall(53%)-53%-Out of all the married women who are not using
contraceptive method,53% of married women have been predicted correctly.
 For Contraceptive_method_used (Label 1 )-
 Precision (68%) – 68% of married women predicted are actually using
Contraceptive method out of all married women predicted to be using
Contraceptive method .
 Recall (79%) – Out of all the married women actually using contraceptive
method 79% of married women have been predicted correctly .
 The accuracy is 67% which is more than 50%, so the model is good.
2) Test data

 This plot shows the relationship between the true label and predicted label as 0’s and
1’s
 Classification report is given below-

 For Contraceptive_method_used (Label 0 ):

 Precision (64%) – Out of all married women predicted to not use
contraception, precision (64%) - 64% of them actually do not use the
method.
 Recall (46%) – Out of all the married women not using Contraceptive
method , 46%of married women have been predicted correctly .
 For Contraceptive_method_used (Label 1 ):
 Precision (65%) – Of all married women anticipated to use
contraception, 65% of them are really doing so.
 Recall (79%) – Out of all the married women actually using contraceptive
method ,79% of married women have been predicted correctly.
 The accuracy is 65% which is >50% ,so the model is as good as Training data.
GRID SEARCH-
For train data-
By using the grid search CV from sklearn model to get predict the best model.
The process is same as above
 This plot shows the relationship between the true label and predicted label as 0’s and
1’s

 Classification report is given below-

 Accuracy is same as above method is still 67%

For test data-

 This plot shows the relationship between the true label and predicted label as 0’s and
1’s
 Classification report is given below-

 In the same way as shown above, we obtain comparable values here, with an
accuracy of 65%.
 The model's overall accuracy is 67%, indicating that 67% of the total predictions are
correct.
 The accuracy, AUC, Precision, and Recall for test data closely match those for the
training data. This demonstrates that there is no overfitting or underfitting, and
overall, the model is suitable for classification.
Linear Discriminant Analysis-
Train Test Split data-
 Procedure is same as Logistics Regression for splitting train and test data.
 Import LDA from sklearn library and results is given below.
 Train data

 Test data

 There is slight difference with training and test, but its good as accuracy of training
data is 67% and the accuracy of test data is 65%.
CART-
 CART is not sensitive to outliers, thus we can utilize the dataset with them.
 Train and Test Split:
The same method as the preceding Logistic Regression and the LDA, Train and test
data need to be separated, and the relevant libraries must be imported first.
 In cart, the decision tree is the most significant.
Decision Tree-
 Integrate train and test data into a decision tree. Create a new Word document and store
it in the project folder.
 Copy and paste the code into http://webgraphviz.com. To check the decision tree, we can
remove old codes and replace them.
 Due to the large amount of information or categories in the data, the tree may be messy.
We will reduce the maximum number of leaves, depth, and sample size.
 "GINI" is a decision tree classifier that plays a significant function. Create a new Word
document with decreased branches (30), leaf (10), and depth (7), and save it in the project
folder.
 Now the decision tree looks better than before.
Let's now examine feature importance, which is defined as methods that rate input features
according to how helpful they are in predicting a goal.

We can see, depending upon the ‘wife_age’ having more importance, we can slightly
predict that the contraceptive method can be used depend upon the age factors of women.
AUC plot-
AUC: 0.824

We can see the AUC curve bending high,the model is good and itsAUC value for train data is
82.4%
AUC: 0.700
Here, the plot is smooth, but over the area giving the bend formation and AUC value for test
data is 70%.
Confusion matrix for train data-

By observing the confusion matrix, we can see that True Positive is 260 and True Negative is
474.
 Regarding Contraceptive_method_used (Label 0):
Precision(77%) - Of all married women expected to not use contraception, 77%
actually do not use it.
Recall (62%) – 62% of married women who do not use contraception had their
predictions accurate.

 Regarding Contraceptive_method_used (Label 1):

Precision(75%)-Out of all married women expected to use contraceptive methods,
75% of them actually do so.
Recall (86%) 86 percent of married women who genuinely use contraceptives have
been accurately anticipated, making up the recall rate.
 The model performs well as training data since its accuracy is 75%, which is higher
than 50%.

Confusion matrix for test data-

By checking up the confusion matrix of the train data, we can get the value of True Positive
as 91 and the True Negative as 182.
 For the contraceptive technique utilized (Label 0):
Precision (67%): 67% of forecasted married women do not use contraception.
Recall (47%). Out of all married women who do not use contraception, 47% were
accurately predicted.

 For the contraceptive technique utilized (Label 1):

Precision (64%): 64% of married women anticipated to use contraception actually do
so.
Recall (81%): 81% of married women used contraception as planned.
 The model has an accuracy of 65%, making it suitable for training data.

CONCLUSION
 From the aforementioned models, it can be shown that each model's encoded label
(using the conceptual technique) was projected to be high, and that the
models,accuracy and F1 score likewise supported the label "1."
 However, we are unable to determine if the contraceptive method was utilized or not.
However, we can forecast that married women used the method, and the final
prediction also indicates the same things.

Problem 1: Linear Regression
54% (13)
Problem 1: Linear Regression
14 pages
A Wholesale Distributor
83% (6)
A Wholesale Distributor
5 pages
India Credit Risk Default Model - Nivedita Dey - PGP BABI May19 - 2
100% (4)
India Credit Risk Default Model - Nivedita Dey - PGP BABI May19 - 2
19 pages
Predictive Modelling ALOK KUMAR
100% (1)
Predictive Modelling ALOK KUMAR
25 pages
Assignment Report - Predictive Modelling - Rahul Dubey
No ratings yet
Assignment Report - Predictive Modelling - Rahul Dubey
18 pages
FRA Milestone1 - Maminulislam
100% (4)
FRA Milestone1 - Maminulislam
23 pages
Multiple Choice Test Bank Questions No Feedback - Chapter 3
100% (1)
Multiple Choice Test Bank Questions No Feedback - Chapter 3
5 pages
Welcome: Ist2024 Applied Statistics March 2020 Semester
No ratings yet
Welcome: Ist2024 Applied Statistics March 2020 Semester
14 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Arpita - Sarkar - Business - Report - 17th December, 2023
No ratings yet
Arpita - Sarkar - Business - Report - 17th December, 2023
23 pages
'Yatham Padma' 8 May 2022
No ratings yet
'Yatham Padma' 8 May 2022
82 pages
Sukanya December Predictive Modeling 14th Jan 2024
No ratings yet
Sukanya December Predictive Modeling 14th Jan 2024
50 pages
AIDS - DM Using Python - Lab Programs
No ratings yet
AIDS - DM Using Python - Lab Programs
19 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
Business Report PM Suchita Bhovar March 10 2024
No ratings yet
Business Report PM Suchita Bhovar March 10 2024
27 pages
Predictive - Modelling - Project - PDF 1
No ratings yet
Predictive - Modelling - Project - PDF 1
31 pages
PM ProjectJune - 2021
100% (1)
PM ProjectJune - 2021
33 pages
Pooja Kabadi - Predictive Modelling Project
No ratings yet
Pooja Kabadi - Predictive Modelling Project
70 pages
Linear Regression Datascience Basit PDF
No ratings yet
Linear Regression Datascience Basit PDF
19 pages
Predictive Modelling Project 2
100% (4)
Predictive Modelling Project 2
32 pages
Project Report
100% (3)
Project Report
36 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
Devidutta Predictive Modeling PDF
No ratings yet
Devidutta Predictive Modeling PDF
25 pages
FRA Report
100% (1)
FRA Report
30 pages
Machine Learning Project Report
No ratings yet
Machine Learning Project Report
65 pages
Sukanya Linear LogisticRegression Report
100% (1)
Sukanya Linear LogisticRegression Report
23 pages
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
100% (4)
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
36 pages
Assignment AI-ML
No ratings yet
Assignment AI-ML
13 pages
Machine Learning Project: Name-Rasmita Mallick Date - 5 September 2021
100% (2)
Machine Learning Project: Name-Rasmita Mallick Date - 5 September 2021
47 pages
Predictive Modelling Alternate Project Business Case
No ratings yet
Predictive Modelling Alternate Project Business Case
47 pages
Data Science
No ratings yet
Data Science
18 pages
Saurabh
No ratings yet
Saurabh
22 pages
Ethics and Ai Exp-2
No ratings yet
Ethics and Ai Exp-2
5 pages
Python Practice Questions
No ratings yet
Python Practice Questions
5 pages
20dit073 Jay Prajapati ML
No ratings yet
20dit073 Jay Prajapati ML
68 pages
Data-Analytics-Manual Lab G.anill Kumar
No ratings yet
Data-Analytics-Manual Lab G.anill Kumar
23 pages
Nanduri Naga Sowri Pgp-Dsba - Octa - G2 Great Learning
No ratings yet
Nanduri Naga Sowri Pgp-Dsba - Octa - G2 Great Learning
40 pages
Exp 8 - LM
No ratings yet
Exp 8 - LM
10 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
DA Programs
No ratings yet
DA Programs
44 pages
Machine Learning VIVEK
80% (5)
Machine Learning VIVEK
118 pages
Predictive Modelling Project
No ratings yet
Predictive Modelling Project
29 pages
Logistic Regression
No ratings yet
Logistic Regression
3 pages
Employee Performance Analysis
No ratings yet
Employee Performance Analysis
3 pages
Predictive Analytics: Group Assignment 2
No ratings yet
Predictive Analytics: Group Assignment 2
6 pages
Data Preprocessing
No ratings yet
Data Preprocessing
18 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
PM Projec2 - SOBAC
No ratings yet
PM Projec2 - SOBAC
38 pages
Practical - Questions - Unit 1 and 2
No ratings yet
Practical - Questions - Unit 1 and 2
5 pages
Predictive Modelling Sweta Kumari
No ratings yet
Predictive Modelling Sweta Kumari
35 pages
SMDM Predictive Modeling Business Report 05.02.2022 PDF
No ratings yet
SMDM Predictive Modeling Business Report 05.02.2022 PDF
38 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Analysis and Prediction of House Prices by Linear Regression Model
No ratings yet
Analysis and Prediction of House Prices by Linear Regression Model
91 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Lab Mannual of ML
No ratings yet
Lab Mannual of ML
43 pages
FRA Business Report
100% (1)
FRA Business Report
21 pages
FYMCA IDSLab A6 Submission
No ratings yet
FYMCA IDSLab A6 Submission
9 pages
Anshul Dyundi Predictive Modelling Alternate Project July 2022
No ratings yet
Anshul Dyundi Predictive Modelling Alternate Project July 2022
11 pages
1 Final-Exam
No ratings yet
1 Final-Exam
6 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
E Monika Sree SQL
No ratings yet
E Monika Sree SQL
7 pages
E Monika Sree 10-10-2024
No ratings yet
E Monika Sree 10-10-2024
60 pages
E Monika Sree
No ratings yet
E Monika Sree
2 pages
PM Guided Project Sample Business Report
100% (1)
PM Guided Project Sample Business Report
52 pages
Skripsi Tablet Fe
No ratings yet
Skripsi Tablet Fe
7 pages
Computational Methods For Mixed Models
No ratings yet
Computational Methods For Mixed Models
21 pages
1 - Linear Regression
No ratings yet
1 - Linear Regression
28 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
59 pages
A 18-Page Statistics & Data Science Cheat Sheets
No ratings yet
A 18-Page Statistics & Data Science Cheat Sheets
18 pages
OM Forecasting
No ratings yet
OM Forecasting
72 pages
Adam Smith Business School Subject of Economics Degree of MSC Degree Exam Basic Econometrics, Econ5002
No ratings yet
Adam Smith Business School Subject of Economics Degree of MSC Degree Exam Basic Econometrics, Econ5002
6 pages
Testing of Hypothesis
No ratings yet
Testing of Hypothesis
8 pages
Nonparametric Tests: Larson/Farber 4th Ed
No ratings yet
Nonparametric Tests: Larson/Farber 4th Ed
94 pages
Test Instrument: Gecc 103 - Mathematics in The Modern World Final Examination
No ratings yet
Test Instrument: Gecc 103 - Mathematics in The Modern World Final Examination
6 pages
Biostatistics Unit 10. Measures of Relationship
No ratings yet
Biostatistics Unit 10. Measures of Relationship
37 pages
Research Methodology Practical
No ratings yet
Research Methodology Practical
64 pages
Lecture 9 - Evaluations
No ratings yet
Lecture 9 - Evaluations
68 pages
DAE14
No ratings yet
DAE14
44 pages
Beta-4 Manual Supplement
No ratings yet
Beta-4 Manual Supplement
10 pages
CA Foundation BMRS Key
0% (1)
CA Foundation BMRS Key
10 pages
Ma40189 2016 2017 Problem Sheet 3 Solutions合并版
No ratings yet
Ma40189 2016 2017 Problem Sheet 3 Solutions合并版
67 pages
Application of T-Test To Analyze The Small Sample of Statistical Research
No ratings yet
Application of T-Test To Analyze The Small Sample of Statistical Research
4 pages
Bài tập chương 3 - Bài 2
No ratings yet
Bài tập chương 3 - Bài 2
16 pages
Applied Econometrics Using Stata
100% (1)
Applied Econometrics Using Stata
100 pages
SNM Au BB - PDF
No ratings yet
SNM Au BB - PDF
8 pages
Lesson 1 Intro To Hypothesis Testing
No ratings yet
Lesson 1 Intro To Hypothesis Testing
26 pages
Classification of Ford Motor Data: Joerg D. Wichard
No ratings yet
Classification of Ford Motor Data: Joerg D. Wichard
4 pages
Numpy NP Pandas PD Matplotlib - Pyplot PLT Sklearn - Model - Selection Sklearn - Ensemble Sklearn - Metrics Xgboost Lightgbm Google - Colab Io
No ratings yet
Numpy NP Pandas PD Matplotlib - Pyplot PLT Sklearn - Model - Selection Sklearn - Ensemble Sklearn - Metrics Xgboost Lightgbm Google - Colab Io
14 pages
Common Statistical Methods For Clinical Research With SAS Examples Second Edition Glenn A. Walker
100% (1)
Common Statistical Methods For Clinical Research With SAS Examples Second Edition Glenn A. Walker
44 pages
MINITAB 14 Supplement For Biostatistics For Health Sciences
No ratings yet
MINITAB 14 Supplement For Biostatistics For Health Sciences
87 pages
Chapter 7
No ratings yet
Chapter 7
17 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Monika Sree 11-07-2024

Uploaded by

Monika Sree 11-07-2024

Uploaded by

PREDICTIVE MODELLING

1.A) Define the problem and perform exploratory Data Analysis

1.B) Data Pre-processing

1.C) Model Building - Linear regression

1.D) Business Insights & Recommendations

 There are 8192 rows and 22 columns.

Basic Info of dataset-

 Null values have been treated using median method.

 No duplicate rows are found on the dataset.

 There are 8192 rows and 17 columns.

 Correlation between variables:

 There is comparatively high correlation between ‘lread’ and ‘lwrite’.

 Let’s make a copy of dataset for preventing the original dataset.

 Intercept of the model is

 R square of training dataset

 R square of testing dataset

 RMSE of training dataset

 RMSE of testing dataset

Linear Regression Using STATS Model-

 RMSE of test data-

 Scatter plot of actual y value vs predicted y value

 Scatter plot between predictive and residuals

 Linear equation of the model-

 From the equation we can predict that,

Shape of data set-

Basic info of the data set-

 We found null values in ‘Wife_age’ and ‘No_of_children_born’ columns.

 So, all the null values are treated.

 Outliers are detected.

 We can observe that the positive correlation is highest between Husband_education

 Now we need o fit into Logistic Regression model by using cg as solver,

 For Contraceptive_method_used(Label 0)-

 For Contraceptive_method_used (Label 0 ):

 Classification report is given below-

 Accuracy is same as above method is still 67%

 Regarding Contraceptive_method_used (Label 1):

Confusion matrix for test data-

 For the contraceptive technique utilized (Label 1):

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.