Project Report
Project Report
Technical Approach
We have created 4 models using Random Forest Classifier, Decision Tree Classifier, Naïve Bayes
Classifier, and Support Vector Machine techniques. We have created them with 14 components from
the Principal Component Analysis and found out the confusion matrix as well as calculated the accuracy
score for each. The classifiers are imported from the sci-kit-learn library. The model was fit with the
training set and the target is predicted for the independent variables of the test set. These values are
compared against each other and the confusion matrix is computed. The accuracy score of each is as
follows:
The accuracy scores appear to be moderately performing well. The Random Forest Classifier leads the
accuracy score closely followed by Naïve Bayes Classifier & Support Vector Machine. The Decision Tree
Classifier stands the least out of the four tested classification models. The four models were again
created without applying the PCA i.e. without eliminating any features from the given dataset (except
for the Body Mass Index which was dropped earlier due to high correlation). The new models show a
significant difference in the accuracy score:
Without conducting Principal Component Analysis, the accuracy scores of Random Forest Classifier and
Decision Tree Classifier show a remarkable increase. This can be attributed to the fact that, though PCA
served it’s purpose in reducing the dimensionality of the data, the model applied on data with PCA failed
to capture the underlying pattern and returned a relatively lower accuracy score. The Random Forest
Classifier Technique has the best accuracy score and further, the parameters are tuned by applying Grid
Search. The parameters tuned are n_estimators: [6, 100, 30], max depth: [5, 7, 10]. Even after tuning the
Grid Search, the best parameters {‘max_depth’: 7, ‘n_estimators’: 30} did not show any significant
improvement in the accuracy score. An interesting observation we found was that the SVM model is
misclassifying all the observations as either Group 1 or Group 2, while eliminating to classify as Group 0.
This can be identified from the confusion matrix:
2. Test and Evaluation
The Test Data of all the models is replaced with the Test Dataset provided for evaluation. And, the
confusion matrix & accuracy score are computed for each model:
The Random Forest Classifier and the Decision Tree Classifier both stand the highest at 71.62% accuracy
score. The Naïve Bayes Classifier stands third at about 60.81%. The Support Vector Machine is
misclassifying the total observations as Group 1 without considering Group 0 and Group 2, which means
that the SVM is not a recommended model for the given problem statement. This can be verified from
the confusion matrix below:
An accuracy score of above 70 is a good score considering that the problem statement falls in the scope
of Human Resources Domain. Human Resources deal directly with the behavior of human beings; hence,
high accuracy scores cannot be expected. Moreover, we are dealing with limited variables to predict
employee’s behavior. Several other variables such as employee morale, job satisfaction, relationship
with manager, workplace ambience etc. which are generally considered to be the key indicators of an
employee’s performance and absenteeism rates are not present in the dataset.
Thus, Random Forest Classifier & Decision Tree Classifier which have an accuracy score of 71.62% are
selected for further evaluation. As the accuracy score alone is not a sufficient metric to estimate the
discrimination ability of the model, we have plotted the Receiver’s Operating Characteristic Curve
between the True-Positive Rate (probability of detection) and False Positive Rate (probability of false
alarm) for both Random Forest & Decision Tree. We have chosen these two models as our final
consideration, thus, plotted the ROC curve only for these two models.
The Decision Tree Classifier has an area under the curve for Class 1 as 0.94 but below 0.79 & 0.78 for
both Class 2 & 3 respectively. Hence, this model is satisfactory with its discrimination ability though the
accuracy score is 71.62% The ROC Curve is shown below:
Therefore, the Random Forest Classifier technique produced a better classification model with its high
accuracy score along with an eminent area under the curve.