0% found this document useful (0 votes)
27 views7 pages

Formulate Hypothesis

The document presents hypotheses regarding employee attrition factors, including travel frequency, age, work-life balance, and promotion frequency. It summarizes descriptive statistics of a healthcare company's employee dataset, highlighting key demographics and work conditions that influence attrition. The analysis includes model performance metrics from a Random Forest Classifier, indicating high accuracy but low recall for attrition predictions, emphasizing the need for improved identification of at-risk employees.

Uploaded by

Phaneendra jammu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views7 pages

Formulate Hypothesis

The document presents hypotheses regarding employee attrition factors, including travel frequency, age, work-life balance, and promotion frequency. It summarizes descriptive statistics of a healthcare company's employee dataset, highlighting key demographics and work conditions that influence attrition. The analysis includes model performance metrics from a Random Forest Classifier, indicating high accuracy but low recall for attrition predictions, emphasizing the need for improved identification of at-risk employees.

Uploaded by

Phaneendra jammu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Formulate Hypothesis.

Travel Frequency Hypothesis: Employees who travel frequently for business are more
likely to leave the company due to increased job stress or work-life balance challenges.
Age and Experience Hypothesis: Younger employees or employees with fewer total
working years are more likely to leave, as they may seek faster career progression or better
opportunities.
Work-Life Balance Hypothesis: Employees who report low work-life balance are more
likely to leave the organization.
Promotion Frequency Hypothesis: Employees who haven’t been promoted in a long time
are more likely to leave due to perceived stagnation in career growth.
Relationship with Manager Hypothesis: Employees with fewer years with their current
manager may have weaker bonds or lack mentorship, potentially leading to higher attrition.
Role Tenure Hypothesis: Employees who have been in the same role for many years without
change may feel stagnant and may be more likely to leave.
Department-Specific Hypothesis: Certain departments (e.g., high-stress ones like
Cardiology) may have higher attrition rates due to the nature of the work.

descriptive statistics for the data


The descriptive statistics presented summarize key characteristics of the numerical and
categorical columns in the dataset, providing an overview of the data's distribution, central
tendency, and spread. Here’s a breakdown of each statistic for both numerical and categorical
data:
Numerical Columns
For numerical columns, statistics such as count, mean, standard deviation, minimum, and
percentiles (25%, 50%, 75%) are calculated to give insights into data distribution:
1. Count: Number of non-null entries in each column. Here, all numerical columns have
1676 entries, meaning no missing values.
2. Mean: The average value for each column. For example, the average Age is
approximately 36.87 years, and the average Daily Rate is around 800.56.
3. Standard Deviation (std): Measures the dispersion or variability around the mean.
For instance, Age has a standard deviation of 9.13 years, indicating how spread-out
ages are around the mean.
4. Minimum (min) and Maximum (max): The smallest and largest values in each
column. For example, Age ranges from 18 to 60, and Total Working Years goes from
0 to 40.
5. Percentiles (25%, 50%, 75%): Also known as quartiles, these values divide the data
into quarters:
o 25% (1st quartile): 25% of the data points are below this value.
o 50% (2nd quartile or median): 50% of the data points are below this value.
o 75% (3rd quartile): 75% of the data points are below this value.
For example, the 25th, 50th, and 75th percentiles of Distance from Home are 2, 7, and 14,
respectively, indicating that 25% of employees live within 2 units of distance from home, half
live within 7, and 75% within 14.
Categorical Columns
For categorical columns, the summary provides additional information:
1. Count: Number of entries in each column, showing all rows are complete for
categorical columns as well.
2. Unique: Number of distinct values within each categorical column. For example,
Business Travel has 3 unique values (e.g., Travel Rarely, Travel Frequently, etc.), and
Department has 3 unique values (e.g., Cardiology, Maternity).
3. Top: The most frequently occurring category in each column. For example, Business
Travel is most commonly Travel Rarely.
4. Frequency (freq): The frequency of the most common category. In Attrition, the most
common category is No, with a frequency of 1477, indicating that most employees
have not left the company.
Interpretation of Key Columns
 Age: Average age is 36.87 years with a standard deviation of 9.13, indicating a fairly
mature workforce.
 Distance From Home: Employees generally live close to work, with a mean of 9.22
and most within 14 units.
 Job Level: Job levels range from 1 to 5, with an average level around 2, suggesting a
majority of employees are at lower to mid-level positions.
 Years At Company and Total Working Years: With mean values of 7.03 and 11.34,
respectively, this shows employees tend to stay at the company long-term, though
some have considerable prior experience.
 Attrition: The "No" category has a frequency of 1477, showing that the majority of
employees in the dataset have not left the company.

Dataset Overview
The dataset represents a healthcare company's employee data, with a focus on attributes that
could help analyze employee attrition and workplace dynamics.
Key Steps in Preprocessing:
1. Missing Values Handling:
o Columns with more than 50% missing values were dropped.
o For remaining columns, missing numerical values were filled with the mean,
and categorical missing values were filled with the mode.
2. Outlier Treatment:
o Outliers in numerical columns were treated using the Interquartile Range
(IQR) method, capping values outside 1.5 times the IQR.
3. Encoding:
o Binary columns (e.g., Attrition, Gender, Over18, Over Time) were label-
encoded.
o Multi-class categorical columns (e.g., Business Travel, Department, Education
Field, Job Role, Marital Status, Shift) were one-hot encoded.
Processed Columns:
The dataset now contains 47 columns, which include both numerical and encoded categorical
features. Some key columns are:
 Demographic Data:
o Employee ID, Age, Gender, Marital Status, Education, Education Field
 Work Information:
o Department, Job Role, Job Level, Total Working Years, Years at Company,
Years In Current Role, Years Since Last Promotion, Years With Curr Manager
 Compensation and Benefits:
o Daily Rate, Monthly Income, Hourly Rate, Percent Salary Hike
 Work Satisfaction and Performance:
o Environment Satisfaction, Job Satisfaction, Relationship Satisfaction,
Performance Rating
 Attrition:
o Attrition (target variable indicating whether an employee left the company)

 The numerical features have been summarized with mean, standard deviation,
minimum, and maximum values for easy reference. The summary provides insights
into average values across work experience, income, job satisfaction, and other key
attributes of employees.
 This processed dataset is ready for exploratory data analysis or model building,
particularly for tasks like predicting employee attrition.
Summary of EDA and generate relevant
visualizations
Here’s a summary of the analysis on employee attrition in a healthcare setting presented in
bullet points:
 Demographic Factors:
o Younger employees (particularly those in their 20s and early 30s) show higher
attrition rates.
o Employees with lower monthly income are more likely to leave.
o Specific departments, such as "Maternity," have higher attrition rates.
 Work Conditions:
o Frequent business travel correlates with higher attrition rates.
o Employees rating their work-life balance lower tend to leave more often.
o Those who work overtime exhibit increased attrition, suggesting excessive
work hours contribute to burnout.
 Visual Data Insights:
o The "Attrition by Overtime" chart indicates that overtime work negatively
impacts work-life balance and leads to job dissatisfaction.
 Implications for Retention:
o Factors influencing attrition include age, income, department, business travel,
work-life balance, and overtime.
o Strategies to improve retention could involve:
 Fair compensation practices.
 Limiting overtime hours.
 Promoting a healthier work-life balance.
 Offering career development opportunities.

Data Modelling
Model Performance Analysis
The Random Forest Classifier was applied to predict employee attrition in the healthcare
dataset, with the following performance metrics obtained from the model evaluation:
 Accuracy: The model achieved an accuracy of 89%, indicating that it correctly
classified approximately 89% of the instances in the test set. This high accuracy
suggests that the model is effective overall; however, it is crucial to analyze the
performance across both classes (attrition and non-attrition) to ensure it is not simply
predicting the majority class.
 Precision: The precision for the positive class (attrition) is 88%. This means that
when the model predicts that an employee will leave, it is correct 88% of the time. A
high precision indicates a low rate of false positives, which is important in minimizing
unnecessary concern for employees who are not at risk of attrition.
 Recall: The recall for the positive class is notably low at 28%. This implies that the
model only correctly identifies 28% of the actual attrition cases. A low recall indicates
that many employees who actually left the organization were not predicted as such by
the model, leading to a high rate of false negatives. This aspect is critical because it
means the model might not be effective in identifying at-risk employees, which is
essential for implementing proactive retention strategies.
 F1 Score: The F1 score for the positive class is 0.43, which reflects a balance between
precision and recall. The relatively low F1 score indicates that while the model can
accurately identify some attrition cases, it struggles with recall, meaning there is
significant room for improvement in identifying employees at risk of leaving.
Classification Report Insights
The classification report further elaborates on the model's performance:
 For the negative class (non-attrition), the model performs well with a precision of
89% and a recall of 99%, which means it effectively identifies most employees who
are not likely to leave.
 For the positive class (attrition), the precision is 88%, but the recall of 28% reveals a
challenge in predicting actual cases of employee attrition. Out of 74 employees who
left, the model only identified 21 correctly, missing 53 cases.
 The macro average metrics (precision: 88%, recall: 64%, F1 score: 68%) highlight
the disparity in performance between the classes, showing that while the model can
classify the majority class well, it fails to adequately capture the minority class.

Validation and testing


The output of the prediction indicates that the new employee is likely to stay with the
organization, as per the model's assessment. Here’s a breakdown of the prediction process
and its implications:
Explanation of Prediction
1. Data Preparation:
o New Employee Data: A sample data record was created for a new employee,
including various features such as age, daily rate, job satisfaction, and marital
status.
o Encoding Categorical Variables: The categorical variables (like Gender,
Over Time, and Marital Status) were encoded to numerical values to match the
format used during the model training phase. This step is crucial because
machine learning algorithms typically work with numerical data.
2. Matching Feature Set:
o The new employee's data was reindexed to ensure that it includes all the
necessary features that the model was trained on. Any missing columns were
filled with zeros. This ensures consistency between the training and prediction
phases.
3. Model Prediction:
o The trained Random Forest model was used to make a prediction based on the
new employee's features. The predict method outputs a binary value:
 0 indicates that the employee is likely to stay.
 1 indicates that the employee is likely to attrite.
Interpretation of Results
 The model predicted that the employee is likely to stay (output of 0). This suggests
that the combination of features for this employee aligns with those profiles identified
in the training data as having lower attrition risks.
 Factors contributing to this prediction might include:
o Age (30 years): Younger employees tend to have higher attrition rates, but
being in their 30s often indicates more stability.
o Job Satisfaction (4): A high level of job satisfaction is generally associated
with a lower likelihood of leaving.
o Work-Life Balance (3): A moderate rating suggests a satisfactory balance,
reducing stress and the likelihood of burnout.
o Over Time (No): Not working overtime may contribute to a better work-life
balance, positively influencing retention.
o Marital Status (Single): While marital status can vary in its impact on
attrition, being single may correlate with fewer family responsibilities,
allowing for more flexibility in work engagement.
Conclusion
Overall, the model's prediction indicates that the new employee is in a favourable position
regarding retention. This insight can be valuable for HR and management in understanding
employee dynamics and implementing targeted strategies to foster a supportive work
environment, especially for those who may be at risk of leaving. It also highlights the
importance of continuously monitoring and analysing employee data to anticipate potential
attrition issues proactively.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy