0% found this document useful (0 votes)
3 views3 pages

Project Report

Ms project
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views3 pages

Project Report

Ms project
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

1.

Technical Approach

We have created 4 models using Random Forest Classifier, Decision Tree Classifier, Naïve Bayes
Classifier, and Support Vector Machine techniques. We have created them with 14 components from
the Principal Component Analysis and found out the confusion matrix as well as calculated the accuracy
score for each. The classifiers are imported from the sci-kit-learn library. The model was fit with the
training set and the target is predicted for the independent variables of the test set. These values are
compared against each other and the confusion matrix is computed. The accuracy score of each is as
follows:

Techniques Accuracy Score


Random Forest Classifier 75.00%
Decision Tree Classifier 67.42%
Naïve Bayes Classifier 74.24%
Support Vector Machine 71.96%

The accuracy scores appear to be moderately performing well. The Random Forest Classifier leads the
accuracy score closely followed by Naïve Bayes Classifier & Support Vector Machine. The Decision Tree
Classifier stands the least out of the four tested classification models. The four models were again
created without applying the PCA i.e. without eliminating any features from the given dataset (except
for the Body Mass Index which was dropped earlier due to high correlation). The new models show a
significant difference in the accuracy score:

Techniques Accuracy Score


Random Forest Classifier 84.84%
Decision Tree Classifier 81.06%
Naïve Bayes Classifier 78.78%
Support Vector Machine 74.24%

Without conducting Principal Component Analysis, the accuracy scores of Random Forest Classifier and
Decision Tree Classifier show a remarkable increase. This can be attributed to the fact that, though PCA
served it’s purpose in reducing the dimensionality of the data, the model applied on data with PCA failed
to capture the underlying pattern and returned a relatively lower accuracy score. The Random Forest
Classifier Technique has the best accuracy score and further, the parameters are tuned by applying Grid
Search. The parameters tuned are n_estimators: [6, 100, 30], max depth: [5, 7, 10]. Even after tuning the
Grid Search, the best parameters {‘max_depth’: 7, ‘n_estimators’: 30} did not show any significant
improvement in the accuracy score. An interesting observation we found was that the SVM model is
misclassifying all the observations as either Group 1 or Group 2, while eliminating to classify as Group 0.
This can be identified from the confusion matrix:
2. Test and Evaluation

The Test Data of all the models is replaced with the Test Dataset provided for evaluation. And, the
confusion matrix & accuracy score are computed for each model:

Techniques Accuracy Score


Random Forest Classifier 71.62%
Decision Tree Classifier 71.62%
Naïve Bayes Classifier 60.81%
Support Vector Machine 60.60%

The Random Forest Classifier and the Decision Tree Classifier both stand the highest at 71.62% accuracy
score. The Naïve Bayes Classifier stands third at about 60.81%. The Support Vector Machine is
misclassifying the total observations as Group 1 without considering Group 0 and Group 2, which means
that the SVM is not a recommended model for the given problem statement. This can be verified from
the confusion matrix below:

An accuracy score of above 70 is a good score considering that the problem statement falls in the scope
of Human Resources Domain. Human Resources deal directly with the behavior of human beings; hence,
high accuracy scores cannot be expected. Moreover, we are dealing with limited variables to predict
employee’s behavior. Several other variables such as employee morale, job satisfaction, relationship
with manager, workplace ambience etc. which are generally considered to be the key indicators of an
employee’s performance and absenteeism rates are not present in the dataset.

Thus, Random Forest Classifier & Decision Tree Classifier which have an accuracy score of 71.62% are
selected for further evaluation. As the accuracy score alone is not a sufficient metric to estimate the
discrimination ability of the model, we have plotted the Receiver’s Operating Characteristic Curve
between the True-Positive Rate (probability of detection) and False Positive Rate (probability of false
alarm) for both Random Forest & Decision Tree. We have chosen these two models as our final
consideration, thus, plotted the ROC curve only for these two models.

Random Forest Classifier


The Random Forest Classifier has an area under the curve around 89% for all the classes. Thus, the
model is having a near to perfect discrimination ability with an area under the curve of 0.89 and an
accuracy of 71.62%. The ROC Curve is shown below:

Decision Tree Classifier

The Decision Tree Classifier has an area under the curve for Class 1 as 0.94 but below 0.79 & 0.78 for
both Class 2 & 3 respectively. Hence, this model is satisfactory with its discrimination ability though the
accuracy score is 71.62% The ROC Curve is shown below:

Therefore, the Random Forest Classifier technique produced a better classification model with its high
accuracy score along with an eminent area under the curve.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy