Monika Sree 11-07-2024
Monika Sree 11-07-2024
BUSSINESS REPORT
E Monika Sree
11-07-2024
Problem 1
Shape of data-
There are 8192 rows and 22 columns in dataset,13 columns are float type,8 columns
are int type and 1 object type variable.
In rchar, wchar we can observe the row count is not 8192.
Null values check on data set-
We can observe there are null values in rchar and wchar columns.
Treating null values using median method and rechecking for Null values-
For the rest of columns let’s treat the 0’s using replace method by median.
Let’s understand that there are outliers in the data, and those need to be treated.
Let’s use IQR method to treat outliers.
So, IQR method means any observation that is less than Q1-IQR or more than Q3+IQR
is treated as outlier.
Where,
IQR=Q3-Q1
Box plots plotting after treating outliers-
Univariate Analysis-
Let’s plot different histogram plots for the data set.
Bivariate Analysis-
The following are the scatterplots showing the relationship between the dependent
and independent variables:
Multivariate Analysis-
Scatter plot between different variables with hue as ‘runqsz’ column.
Let’s see correlation between variables
Coefficients-
The R-square value tells that the model can explain 76.6 % of the variance in the
training set
Adjusted R-square also nearly to the R-square,76.6%.
RMSE of train data-
Graph of residuals-
There are 2 features with float datatype, 1 with int datatype and 7 with object
datatype.
Summary of dataset-
Data types-
Null Value Check-
From the below plot we may note that, teritary educated women use the most
contraceptive methods.
The below plot shows that wived with highest husband’s education have used the most
contraceptive methods.
The below plot shows that the non-working women use the most contraceptive methods.
The below plot depicts that women with very high standard living index ,use the most
contraceptive measures.
Correlation between columns-
We discovered that just three integer characteristics can be plotted. We now convert
all objects to categorical codes.
The categorical variables Wife_education, Husband_education, Wife_religion,
Standard_of_living_index, Media_exposure, and Contraceptive_method_used were
encoded in ascending order from worst to best, as LDA does not accept text variables
as parameters in model development.
Below isthe encoding for ordinal values:
Wife_ education: Uneducated = 1, Primary = 2, Secondary = 3, Tertiary = 4.
Husband_education: Uneducated = 1, Primary = 2, Secondary = 3, Tertiary = 4.
Wife_religion: Scientology = 1 and non-Scientology = 2.
Wife_Working: Yes = 1 and No = 2.
Standard_of_living _index: Very Low = 1, Low = 2, High = 3, Very High = 4.
Media_exposure: Exposed = 1 and Not-Exposed = 2.
Contraceptive_method_used: Yes = 1 and No = 0
Info of the data set-
Boxplot plotting-
From the data we can observe that to see 1 the highest accuracy is 69.43%.
The model accuracy is 67.3%
AUC and ROC curve
Let’s plot the AUC and ROC curve of the model separately for the train data set and
test dataset.
AUC curve for training data set
1. Train data set
From this curve, the AUC curve is not in perfect shape but it is acceptable state.
The Area Under Curve (AUC) of train data is 71.8%
2. Test data set
Test data curve is similar to train data curve with variations in the starting points.
As this curve is also above the line it is acceptable,but the curve is not best fit curve.
The area under curve is 71.8%.
When comparing the AUC of train and test data, the curves are mainly comparable,
with just minor variations. The AUC is 71.8.10%. Let’s move to the confusion matrix.
Confusion matrix
1) Train data
This plot shows the relationship between the true label and predicted label as 0’s and
1’s
Classification report is given below-
This plot shows the relationship between the true label and predicted label as 0’s and
1’s
Classification report is given below-
This plot shows the relationship between the true label and predicted label as 0’s and
1’s
Classification report is given below-
In the same way as shown above, we obtain comparable values here, with an
accuracy of 65%.
The model's overall accuracy is 67%, indicating that 67% of the total predictions are
correct.
The accuracy, AUC, Precision, and Recall for test data closely match those for the
training data. This demonstrates that there is no overfitting or underfitting, and
overall, the model is suitable for classification.
Linear Discriminant Analysis-
Train Test Split data-
Procedure is same as Logistics Regression for splitting train and test data.
Import LDA from sklearn library and results is given below.
Train data
Test data
There is slight difference with training and test, but its good as accuracy of training
data is 67% and the accuracy of test data is 65%.
CART-
CART is not sensitive to outliers, thus we can utilize the dataset with them.
Train and Test Split:
The same method as the preceding Logistic Regression and the LDA, Train and test
data need to be separated, and the relevant libraries must be imported first.
In cart, the decision tree is the most significant.
Decision Tree-
Integrate train and test data into a decision tree. Create a new Word document and store
it in the project folder.
Copy and paste the code into http://webgraphviz.com. To check the decision tree, we can
remove old codes and replace them.
Due to the large amount of information or categories in the data, the tree may be messy.
We will reduce the maximum number of leaves, depth, and sample size.
"GINI" is a decision tree classifier that plays a significant function. Create a new Word
document with decreased branches (30), leaf (10), and depth (7), and save it in the project
folder.
Now the decision tree looks better than before.
Let's now examine feature importance, which is defined as methods that rate input features
according to how helpful they are in predicting a goal.
We can see, depending upon the ‘wife_age’ having more importance, we can slightly
predict that the contraceptive method can be used depend upon the age factors of women.
AUC plot-
AUC: 0.824
We can see the AUC curve bending high,the model is good and itsAUC value for train data is
82.4%
AUC: 0.700
Here, the plot is smooth, but over the area giving the bend formation and AUC value for test
data is 70%.
Confusion matrix for train data-
By observing the confusion matrix, we can see that True Positive is 260 and True Negative is
474.
Regarding Contraceptive_method_used (Label 0):
Precision(77%) - Of all married women expected to not use contraception, 77%
actually do not use it.
Recall (62%) – 62% of married women who do not use contraception had their
predictions accurate.
By checking up the confusion matrix of the train data, we can get the value of True Positive
as 91 and the True Negative as 182.
For the contraceptive technique utilized (Label 0):
Precision (67%): 67% of forecasted married women do not use contraception.
Recall (47%). Out of all married women who do not use contraception, 47% were
accurately predicted.
CONCLUSION
From the aforementioned models, it can be shown that each model's encoded label
(using the conceptual technique) was projected to be high, and that the
models,accuracy and F1 score likewise supported the label "1."
However, we are unable to determine if the contraceptive method was utilized or not.
However, we can forecast that married women used the method, and the final
prediction also indicates the same things.