Formulate Hypothesis
Formulate Hypothesis
Travel Frequency Hypothesis: Employees who travel frequently for business are more
likely to leave the company due to increased job stress or work-life balance challenges.
Age and Experience Hypothesis: Younger employees or employees with fewer total
working years are more likely to leave, as they may seek faster career progression or better
opportunities.
Work-Life Balance Hypothesis: Employees who report low work-life balance are more
likely to leave the organization.
Promotion Frequency Hypothesis: Employees who haven’t been promoted in a long time
are more likely to leave due to perceived stagnation in career growth.
Relationship with Manager Hypothesis: Employees with fewer years with their current
manager may have weaker bonds or lack mentorship, potentially leading to higher attrition.
Role Tenure Hypothesis: Employees who have been in the same role for many years without
change may feel stagnant and may be more likely to leave.
Department-Specific Hypothesis: Certain departments (e.g., high-stress ones like
Cardiology) may have higher attrition rates due to the nature of the work.
Dataset Overview
The dataset represents a healthcare company's employee data, with a focus on attributes that
could help analyze employee attrition and workplace dynamics.
Key Steps in Preprocessing:
1. Missing Values Handling:
o Columns with more than 50% missing values were dropped.
o For remaining columns, missing numerical values were filled with the mean,
and categorical missing values were filled with the mode.
2. Outlier Treatment:
o Outliers in numerical columns were treated using the Interquartile Range
(IQR) method, capping values outside 1.5 times the IQR.
3. Encoding:
o Binary columns (e.g., Attrition, Gender, Over18, Over Time) were label-
encoded.
o Multi-class categorical columns (e.g., Business Travel, Department, Education
Field, Job Role, Marital Status, Shift) were one-hot encoded.
Processed Columns:
The dataset now contains 47 columns, which include both numerical and encoded categorical
features. Some key columns are:
Demographic Data:
o Employee ID, Age, Gender, Marital Status, Education, Education Field
Work Information:
o Department, Job Role, Job Level, Total Working Years, Years at Company,
Years In Current Role, Years Since Last Promotion, Years With Curr Manager
Compensation and Benefits:
o Daily Rate, Monthly Income, Hourly Rate, Percent Salary Hike
Work Satisfaction and Performance:
o Environment Satisfaction, Job Satisfaction, Relationship Satisfaction,
Performance Rating
Attrition:
o Attrition (target variable indicating whether an employee left the company)
The numerical features have been summarized with mean, standard deviation,
minimum, and maximum values for easy reference. The summary provides insights
into average values across work experience, income, job satisfaction, and other key
attributes of employees.
This processed dataset is ready for exploratory data analysis or model building,
particularly for tasks like predicting employee attrition.
Summary of EDA and generate relevant
visualizations
Here’s a summary of the analysis on employee attrition in a healthcare setting presented in
bullet points:
Demographic Factors:
o Younger employees (particularly those in their 20s and early 30s) show higher
attrition rates.
o Employees with lower monthly income are more likely to leave.
o Specific departments, such as "Maternity," have higher attrition rates.
Work Conditions:
o Frequent business travel correlates with higher attrition rates.
o Employees rating their work-life balance lower tend to leave more often.
o Those who work overtime exhibit increased attrition, suggesting excessive
work hours contribute to burnout.
Visual Data Insights:
o The "Attrition by Overtime" chart indicates that overtime work negatively
impacts work-life balance and leads to job dissatisfaction.
Implications for Retention:
o Factors influencing attrition include age, income, department, business travel,
work-life balance, and overtime.
o Strategies to improve retention could involve:
Fair compensation practices.
Limiting overtime hours.
Promoting a healthier work-life balance.
Offering career development opportunities.
Data Modelling
Model Performance Analysis
The Random Forest Classifier was applied to predict employee attrition in the healthcare
dataset, with the following performance metrics obtained from the model evaluation:
Accuracy: The model achieved an accuracy of 89%, indicating that it correctly
classified approximately 89% of the instances in the test set. This high accuracy
suggests that the model is effective overall; however, it is crucial to analyze the
performance across both classes (attrition and non-attrition) to ensure it is not simply
predicting the majority class.
Precision: The precision for the positive class (attrition) is 88%. This means that
when the model predicts that an employee will leave, it is correct 88% of the time. A
high precision indicates a low rate of false positives, which is important in minimizing
unnecessary concern for employees who are not at risk of attrition.
Recall: The recall for the positive class is notably low at 28%. This implies that the
model only correctly identifies 28% of the actual attrition cases. A low recall indicates
that many employees who actually left the organization were not predicted as such by
the model, leading to a high rate of false negatives. This aspect is critical because it
means the model might not be effective in identifying at-risk employees, which is
essential for implementing proactive retention strategies.
F1 Score: The F1 score for the positive class is 0.43, which reflects a balance between
precision and recall. The relatively low F1 score indicates that while the model can
accurately identify some attrition cases, it struggles with recall, meaning there is
significant room for improvement in identifying employees at risk of leaving.
Classification Report Insights
The classification report further elaborates on the model's performance:
For the negative class (non-attrition), the model performs well with a precision of
89% and a recall of 99%, which means it effectively identifies most employees who
are not likely to leave.
For the positive class (attrition), the precision is 88%, but the recall of 28% reveals a
challenge in predicting actual cases of employee attrition. Out of 74 employees who
left, the model only identified 21 correctly, missing 53 cases.
The macro average metrics (precision: 88%, recall: 64%, F1 score: 68%) highlight
the disparity in performance between the classes, showing that while the model can
classify the majority class well, it fails to adequately capture the minority class.