0% found this document useful (0 votes)

5 views66 pages

Capstone Project - Employee Attrition Rate

The document outlines the analysis of an HR Employee Attrition dataset using Python libraries such as pandas, numpy, and seaborn. It includes data reading, exploration, and visualization, revealing insights about employee demographics, attrition rates, and correlations among various features. The dataset consists of 2940 entries and 35 columns, with a focus on attributes like age, attrition status, and business travel frequency.

Uploaded by

shree0504sha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views66 pages

Capstone Project - Employee Attrition Rate

Uploaded by

shree0504sha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

In [1]: import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix,roc_curve,auc,accuracy_score,classification_
from sklearn import tree
from six import StringIO
import warnings
warnings.filterwarnings('ignore')

Reading Data from the file

In [2]: df=pd.read_csv('HR_Employee_Attrition_Data.csv',na_values='NA')
df.head()

Out[2]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Emplo

0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences

Research &
1 49 No Travel_Frequently 279 8 1 Life Sciences
Development

Research &
2 37 Yes Travel_Rarely 1373 2 2 Other
Development

Research &
3 33 No Travel_Frequently 1392 3 4 Life Sciences
Development

Research &
4 27 No Travel_Rarely 591 2 1 Medical
Development

5 rows × 35 columns

In [3]: df.shape
(2940, 35)
Out[3]:

In [4]: df.size

102900
Out[4]:

In [5]: df.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',

Out[5]:
'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
'YearsWithCurrManager'],
dtype='object')
In [6]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2940 entries, 0 to 2939
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 2940 non-null int64
1 Attrition 2940 non-null object
2 BusinessTravel 2940 non-null object
3 DailyRate 2940 non-null int64
4 Department 2940 non-null object
5 DistanceFromHome 2940 non-null int64
6 Education 2940 non-null int64
7 EducationField 2940 non-null object
8 EmployeeCount 2940 non-null int64
9 EmployeeNumber 2940 non-null int64
10 EnvironmentSatisfaction 2940 non-null int64
11 Gender 2940 non-null object
12 HourlyRate 2940 non-null int64
13 JobInvolvement 2940 non-null int64
14 JobLevel 2940 non-null int64
15 JobRole 2940 non-null object
16 JobSatisfaction 2940 non-null int64
17 MaritalStatus 2940 non-null object
18 MonthlyIncome 2940 non-null int64
19 MonthlyRate 2940 non-null int64
20 NumCompaniesWorked 2940 non-null int64
21 Over18 2940 non-null object
22 OverTime 2940 non-null object
23 PercentSalaryHike 2940 non-null int64
24 PerformanceRating 2940 non-null int64
25 RelationshipSatisfaction 2940 non-null int64
26 StandardHours 2940 non-null int64
27 StockOptionLevel 2940 non-null int64
28 TotalWorkingYears 2940 non-null int64
29 TrainingTimesLastYear 2940 non-null int64
30 WorkLifeBalance 2940 non-null int64
31 YearsAtCompany 2940 non-null int64
32 YearsInCurrentRole 2940 non-null int64
33 YearsSinceLastPromotion 2940 non-null int64
34 YearsWithCurrManager 2940 non-null int64
dtypes: int64(26), object(9)
memory usage: 804.0+ KB

In [7]: df.describe()

Out[7]: Age DailyRate DistanceFromHome Education EmployeeCount EmployeeNumber Environmen

count 2940.000000 2940.000000 2940.000000 2940.000000 2940.0 2940.000000

mean 36.923810 802.485714 9.192517 2.912925 1.0 1470.500000

std 9.133819 403.440447 8.105485 1.023991 0.0 848.849221

min 18.000000 102.000000 1.000000 1.000000 1.0 1.000000

25% 30.000000 465.000000 2.000000 2.000000 1.0 735.750000

50% 36.000000 802.000000 7.000000 3.000000 1.0 1470.500000

75% 43.000000 1157.000000 14.000000 4.000000 1.0 2205.250000

max 60.000000 1499.000000 29.000000 5.000000 1.0 2940.000000

8 rows × 26 columns
In [8]: df['MonthlyIncome'].describe()

count 2940.000000
Out[8]:
mean 6502.931293
std 4707.155770
min 1009.000000
25% 2911.000000
50% 4919.000000
75% 8380.000000
max 19999.000000
Name: MonthlyIncome, dtype: float64

In [9]: #Checking for null values

df.isna().sum()

Age 0
Out[9]:
Attrition 0
BusinessTravel 0
DailyRate 0
Department 0
DistanceFromHome 0
Education 0
EducationField 0
EmployeeCount 0
EmployeeNumber 0
EnvironmentSatisfaction 0
Gender 0
HourlyRate 0
JobInvolvement 0
JobLevel 0
JobRole 0
JobSatisfaction 0
MaritalStatus 0
MonthlyIncome 0
MonthlyRate 0
NumCompaniesWorked 0
Over18 0
OverTime 0
PercentSalaryHike 0
PerformanceRating 0
RelationshipSatisfaction 0
StandardHours 0
StockOptionLevel 0
TotalWorkingYears 0
TrainingTimesLastYear 0
WorkLifeBalance 0
YearsAtCompany 0
YearsInCurrentRole 0
YearsSinceLastPromotion 0
YearsWithCurrManager 0
dtype: int64

In [10]: #Checking for duplicate rows

df.duplicated().sum()

0
Out[10]:

EDA
In [11]: # sns.pairplot(df)

In [12]: df.corr()
Out[12]: Age DailyRate DistanceFromHome Education EmployeeCount EmployeeNumber

Age 1.000000 0.010661 -0.001686 0.208034 NaN -0.005175

DailyRate 0.010661 1.000000 -0.004985 -0.016806 NaN -0.025742

DistanceFromHome -0.001686 -0.004985 1.000000 0.021042 NaN 0.016464

Education 0.208034 -0.016806 0.021042 1.000000 NaN 0.020950

EmployeeCount NaN NaN NaN NaN NaN NaN

EmployeeNumber -0.005175 -0.025742 0.016464 0.020950 NaN 1.000000

EnvironmentSatisfaction 0.010146 0.018355 -0.016075 -0.027128 NaN 0.008712

HourlyRate 0.024287 0.023381 0.031131 0.016775 NaN 0.017377

JobInvolvement 0.029820 0.046135 0.008783 0.042438 NaN -0.003552

JobLevel 0.509604 0.002966 0.005303 0.101589 NaN -0.009020

JobSatisfaction -0.004892 0.030571 -0.003669 -0.011296 NaN -0.022970

MonthlyIncome 0.497855 0.007707 -0.017014 0.094961 NaN -0.007188

MonthlyRate 0.028051 -0.032182 0.027473 -0.026084 NaN 0.006177

NumCompaniesWorked 0.299635 0.038153 -0.029251 0.126317 NaN -0.000345

PercentSalaryHike 0.003634 0.022704 0.040235 -0.011111 NaN -0.006685

PerformanceRating 0.001904 0.000473 0.027110 -0.024539 NaN -0.010338

RelationshipSatisfaction 0.053535 0.007846 0.006557 -0.009118 NaN -0.034827

StandardHours NaN NaN NaN NaN NaN NaN

StockOptionLevel 0.037510 0.042143 0.044872 0.018422 NaN 0.031226

TotalWorkingYears 0.680381 0.014515 0.004628 0.148280 NaN -0.007047

TrainingTimesLastYear -0.019621 0.002453 -0.036942 -0.025100 NaN 0.011953

WorkLifeBalance -0.021490 -0.037848 -0.026556 0.009819 NaN 0.005370

YearsAtCompany 0.311309 -0.034055 0.009508 0.069114 NaN -0.005779

YearsInCurrentRole 0.212901 0.009932 0.018845 0.060236 NaN -0.004427

YearsSinceLastPromotion 0.216513 -0.033229 0.010029 0.054254 NaN -0.004575

YearsWithCurrManager 0.202089 -0.026363 0.014406 0.069065 NaN -0.004716

26 rows × 26 columns

In [13]: #unique count of age in dataset

age_info=df['Age'].value_counts()
age_dict=dict(age_info)
import collections as clt
ordered_dict=clt.OrderedDict(sorted(age_dict.items()))
print(ordered_dict)

#the format printed below is (age,count of employees)

OrderedDict([(18, 16), (19, 18), (20, 22), (21, 26), (22, 32), (23, 28), (24, 52), (25,
52), (26, 78), (27, 96), (28, 96), (29, 136), (30, 120), (31, 138), (32, 122), (33, 11
6), (34, 154), (35, 156), (36, 138), (37, 100), (38, 116), (39, 84), (40, 114), (41, 8
0), (42, 92), (43, 64), (44, 66), (45, 82), (46, 66), (47, 48), (48, 38), (49, 48), (50,
60), (51, 38), (52, 36), (53, 38), (54, 36), (55, 44), (56, 28), (57, 8), (58, 28), (59,
20), (60, 10)])

In [14]: plt.figure(figsize=(14,7))
sns.distplot(df['Age'])

<AxesSubplot:xlabel='Age', ylabel='Density'>
Out[14]:

In [15]: #Countplot of employees in dataset

plt.figure(figsize=(20,10))
plt.xlabel('Age',fontsize=15)
plt.ylabel('Count',fontsize=15)
plt.title('Age vs Count of employees',fontsize=18,fontweight='bold')
sns.countplot(data=df,x='Age')

<AxesSubplot:title={'center':'Age vs Count of employees'}, xlabel='Age', ylabel='count'>

Out[15]:

We have more employees in the dataset with age between 29 to 40

In [16]: #Countplot of employees with attrition
plt.figure(figsize=(20,10))
plt.xlabel('Age',fontsize=15)
plt.ylabel('Count',fontsize=15)
plt.title('Age vs Count of employees with attrition',fontsize=18,fontweight='bold')
sns.countplot(data=df,x='Age',hue='Attrition')

<AxesSubplot:title={'center':'Age vs Count of employees with attrition'}, xlabel='Age',

Out[16]:
ylabel='count'>

Employees of range of age 18 to 35 have high attrition rate while age of 54,57,59,60 have no attrition

In [17]: #Unique values of Business Travel

df['BusinessTravel'].unique()

array(['Travel_Rarely', 'Travel_Frequently', 'Non-Travel'], dtype=object)

Out[17]:

In [18]: business_series=df['BusinessTravel'].value_counts()
business_series

Travel_Rarely 2086
Out[18]:
Travel_Frequently 554
Non-Travel 300
Name: BusinessTravel, dtype: int64

In [19]: #Pie chart representation of business travel

px.pie(business_series,names=business_series.index,values=business_series.values,hole=0.

Percentage distribution of business travel

Travel_Rarely
Travel_Frequently
Non-Travel
18.8%
10.2%

71%

In [20]: df_business=df[['Age','BusinessTravel']].value_counts().reset_index().rename(columns={0:
df_business[:10]

Out[20]: Age BusinessTravel Count

105 18 Travel_Frequently 4

82 18 Non-Travel 8

92 18 Travel_Rarely 4

64 19 Travel_Rarely 14

115 19 Travel_Frequently 2

113 19 Non-Travel 2

52 20 Travel_Rarely 18

95 20 Travel_Frequently 4

99 21 Travel_Frequently 4

46 21 Travel_Rarely 20

In [21]: plt.figure(figsize=(16,8))
plt.xlabel('Age',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Age vs Count of employees with Business Travel',fontsize=14,fontweight='bold'
sns.lineplot(data=df_business,x='Age',y='Count',hue='BusinessTravel')

<AxesSubplot:title={'center':'Age vs Count of employees with Business Travel'}, xlabel

Out[21]:
='Age', ylabel='Count'>
In [22]: plt.figure(figsize=(10,6))
plt.xlabel('Business Travel',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Business Travel vs Count of employees with attrition',fontsize=14,fontweight=
g=sns.countplot(data=df,x='BusinessTravel',hue='Attrition')

Employees who travel rarely seems to have high attrition as count of employees are also high where
employees who dont travel have less attrition rate

In [23]: print(max(df['DailyRate']))

1499

In [24]: plt.figure(figsize=(12,6))
sns.boxplot(df['DailyRate'])
<AxesSubplot:xlabel='DailyRate'>
Out[24]:

In [25]: #Average DailyRate of employees with respect to age

plt.figure(figsize=(16,9))
avg_dailyrate=df.groupby(['Age'])['DailyRate'].mean().reset_index()
sns.lineplot(data=avg_dailyrate,x='Age',y='DailyRate')

<AxesSubplot:xlabel='Age', ylabel='DailyRate'>
Out[25]:

In [26]: plt.figure(figsize=(16,9))
plt.xlabel('Age',fontsize=14)
plt.ylabel('DailyRate',fontsize=14)
plt.title('Age vs DailyRate with attrition',fontsize=14,fontweight='bold')
sns.lineplot(data=df,x='Age',y='DailyRate',hue='Attrition',palette='rocket',ci=None)#Usi

<AxesSubplot:title={'center':'Age vs DailyRate with attrition'}, xlabel='Age', ylabel='D

Out[26]:
ailyRate'>
The above lineplot shows even with high as 1200 wage are leaving the organisation whereas employees
with range wage of 400 to 900 are staying in the organisation.

In [27]: #Department employees count

dept=df['Department'].value_counts()
px.pie(names=dept.index,values=dept.values,hole=0.5,title='Department distribution')

Department distribution

Research & Development

Sales
Human Resources

30.3%

9% 65.4%
4.2
Research & Development has the highest number of employees followed by Sales

In [28]: #Attrition rate depending upon department

plt.figure(figsize=(16,9))
plt.xlabel('Department',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Department vs Count of Employees with attrition',fontsize=14,fontweight='bold
sns.countplot(x='Department',data=df,hue='Attrition',palette='Set2')

<AxesSubplot:title={'center':'Department vs Count of Employees with attrition'}, xlabel

Out[28]:
='Department', ylabel='count'>

Human Resources department has the high attrition rate considering its count.
High attrition rate in sales and moderate attrition in R&D

In [29]: # Average DailyRate with respect to Department

df.groupby('Department')['DailyRate'].mean()

Department
Out[29]:
Human Resources 751.539683
Research & Development 806.851197
Sales 800.275785
Name: DailyRate, dtype: float64

In [30]: #Unique values of distance in km

df['DistanceFromHome'].unique()

array([ 1, 8, 2, 3, 24, 23, 27, 16, 15, 26, 19, 21, 5, 11, 9, 7, 6,
Out[30]:
10, 4, 25, 12, 18, 29, 22, 14, 20, 28, 17, 13], dtype=int64)

In [31]: plt.figure(figsize=(16,9))
plt.xlabel('Distance From Home',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Distance vs Count of Employees with attrition',fontsize=14,fontweight='bold')
sns.countplot(x='DistanceFromHome',data=df,hue='Attrition',palette='rainbow')

<AxesSubplot:title={'center':'Distance vs Count of Employees with attrition'}, xlabel='D

Out[31]:
istanceFromHome', ylabel='count'>
It might seem that 1 or 2 km have high attrition but considering the count of employees in 1 or 2 km it is
somewhat fine.
Distance of 24 km seem to high attrition rate considering the count of employees in that range.
Also the most of employees are from distance 1 to 10 km

In [32]: #Education unique values

df['Education'].unique()

array([2, 1, 4, 3, 5], dtype=int64)

Out[32]:

In [33]: #Countplot of employees with respect to education

plt.figure(figsize=(16,9))
plt.xlabel('Education',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Education vs Count of Employees with attrition',fontsize=14,fontweight='bold'
sns.countplot(x='Education',data=df,hue='Attrition')

<AxesSubplot:title={'center':'Education vs Count of Employees with attrition'}, xlabel

Out[33]:
='Education', ylabel='count'>
Education rating of 5 has less employees as well as lowest attrition rate as they are highly educated.
Rating of 2,3,4 have above 20 percent attrition considering its count of employees

In [34]: #Education Field analysis

edu=df.groupby('EducationField')['EducationField'].count()
px.pie(names=edu.index,values=edu.values,title='Education Distribution',hole=0.5,color_d

Education Distribution

Life Sciences
Medical
Marketing
Technical Degree
Other
31.6%
Human Resources

41.2%

10.8%

8.98%
5.58%
1.84%
Life Sciences field has the largest number of employees followed medical ...Human resources has the least
count

In [35]: plt.figure(figsize=(16,9))
plt.xlabel('Education Field',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Education Field vs Count of Employees with attrition',fontsize=14,fontweight=
sns.countplot(y='EducationField',data=df,hue='Attrition',palette='rocket')

<AxesSubplot:title={'center':'Education Field vs Count of Employees with attrition'}, xl

Out[35]:
abel='count', ylabel='EducationField'>

Though Human Resources have least count,attrition seems to be more than 40 percent in HR.
Other field has less attrition.
Technical Degree,Marketing have high attrition.
Considering medical and life sciences count attrition rate is somewhat moderate

In [36]: plt.figure(figsize=(16,9))
plt.xlabel('Education Field',fontsize=14)
plt.ylabel('DailyRate',fontsize=14)
plt.title('Education Field vs Count of Employees with attrition',fontsize=14,fontweight=
sns.lineplot(x='EducationField',y='DailyRate',data=df,ci=None)

<AxesSubplot:title={'center':'Education Field vs Count of Employees with attrition'}, xl

Out[36]:
abel='Education Field', ylabel='DailyRate'>
From above line plot technical degree on average gets paid well

In [37]: # Education Field in respect to working in Department

#Research and Development Department
randd_df=df[df['Department']=='Sales']
randd_df[:3]

Out[37]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Employ

0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences

18 53 No Travel_Rarely 1219 Sales 2 4 Life Sciences

21 36 Yes Travel_Rarely 1218 Sales 9 4 Life Sciences

3 rows × 35 columns

In [38]: plt.figure(figsize=(16,9))
plt.xlabel('Education Field',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Education Field with respect to Department',fontsize=14,fontweight='bold')
# plt.legend(loc='upper right')
sns.countplot(x='EducationField',data=df,hue='Department',palette='rocket')

<AxesSubplot:title={'center':'Education Field with respect to Department'}, xlabel='Educ

Out[38]:
ationField', ylabel='count'>
In [39]: #Unique values of environment satisfication
df['EnvironmentSatisfaction'].unique()

array([2, 3, 4, 1], dtype=int64)

Out[39]:

In [40]: #1 being the lowest rating whereas 4 being the highest

#Rating based on department
envsatdep=df.groupby(['Department','EnvironmentSatisfaction'])['EmployeeNumber'].count()
envsatdep

Out[40]: Department EnvironmentSatisfaction EmployeeNumber

0 Human Resources 1 22

1 Human Resources 2 24

2 Human Resources 3 52

3 Human Resources 4 28

4 Research & Development 1 374

5 Research & Development 2 354

6 Research & Development 3 584

7 Research & Development 4 610

8 Sales 1 172

9 Sales 2 196

10 Sales 3 270

11 Sales 4 254

In [41]: # df['EnvironmentSatisfaction'].astype(object)

In [42]: plt.figure(figsize=(16,9))
plt.xlabel('Education Satisfication',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Environment Satisfication with attrition',fontsize=14,fontweight='bold')
# plt.legend(loc='upper right')
sns.countplot(x='EnvironmentSatisfaction',data=df,hue='Attrition',palette='rocket')

<AxesSubplot:title={'center':'Environment Satisfication with attrition'}, xlabel='Enviro

Out[42]:
nmentSatisfaction', ylabel='count'>

Almost 27.2% of employees in rating 1 are resigning.

Almost 16.66% of employees in rating 2 are resigning.
11.3% of employees in rating 3 and rating 4 are resigning # Lowest Attrition rate

In [43]: # Gender representation with respect to department

plt.figure(figsize=(18,14),dpi=1600)
gender_df=df.groupby(['Gender','Department'])['EmployeeNumber'].count().reset_index().re
male_df=gender_df[gender_df['Gender']=='Male']
female_df=gender_df[gender_df['Gender']=='Female']
ax1 = plt.subplot2grid((2,2),(0,0))
lm=plt.pie('Count',labels='Department',autopct='%1.2f%%',data=male_df)
plt.title('Male Department Distribution',fontsize=12,fontweight='bold')
# px.pie(male_df,names='Department',values='Count',hole=0.5,title='Department Male repre
ax1 = plt.subplot2grid((2,2),(0,1))
# px.pie(female_df,names='Department',values='Count',hole=0.5,title='Department Female r
fm=plt.pie('Count',labels='Department',autopct='%1.2f%%',data=female_df,colors=['violet'
plt.title('Female Department Distribution',fontsize=12,fontweight='bold')

Text(0.5, 1.0, 'Female Department Distribution')

Out[43]:
In [44]: #Attrition with respect to Gender
plt.figure(figsize=(14,6))
plt.xlabel('Gender',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Gender with respect to Attrition',fontsize=14,fontweight='bold')
sns.countplot(x='Gender',data=df,hue='Attrition',palette='magma')

<AxesSubplot:title={'center':'Gender with respect to Attrition'}, xlabel='Gender', ylabe

Out[44]:
l='count'>

Male attrition seems to higher than female though count of respective genders must also be taken into
account

In [45]: #Average Hourly Rate of employees with respect to age

plt.figure(figsize=(16,9))
avg_dailyrate=df.groupby(['Age'])['HourlyRate'].mean().reset_index()
sns.lineplot(data=avg_dailyrate,x='Age',y='HourlyRate')

<AxesSubplot:xlabel='Age', ylabel='HourlyRate'>
Out[45]:
On Average at age 57 or 58 , the hourly rate pay is good which is inferred from above line plot

In [46]: plt.figure(figsize=(16,9))
plt.xlabel('Age',fontsize=14)
plt.ylabel('HourlyRate',fontsize=14)
plt.title('Age vs HourlyRate with attrition',fontsize=14,fontweight='bold')
sns.lineplot(data=df,x='Age',y='HourlyRate',hue='Attrition',palette='rocket',ci=None)#Us

<AxesSubplot:title={'center':'Age vs HourlyRate with attrition'}, xlabel='Age', ylabel

Out[46]:
='HourlyRate'>

Even though hourlyrate is above 90 for age of 42 or 47 people tend to resign .

Whereas hourly rate of 50 or below 50 for age in range of 42 to 52 resigning seems to be justified.
The hourly rate range where employees tend less to attrition is 60 to 90

In [47]: #JobInvolvement analysis

df['JobInvolvement'].unique()

array([3, 2, 4, 1], dtype=int64)

Out[47]:

In [48]: #JobInvolvement with respect to Gender

plt.figure(figsize=(16,8))
plt.xlabel('Job Involvement',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Gender with respect to JobInvolvement',fontsize=14,fontweight='bold')
sns.countplot(x='JobInvolvement',data=df,hue='Gender',palette='cubehelix')

<AxesSubplot:title={'center':'Gender with respect to JobInvolvement'}, xlabel='JobInvolv

Out[48]:
ement', ylabel='count'>

Rating 3 seems to be high for both male and female

In [49]: #JobInvolvement with respect to Department

plt.figure(figsize=(16,8))
plt.xlabel('Job Involvement',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Department with respect to JobInvolvement',fontsize=14,fontweight='bold')
sns.countplot(x='JobInvolvement',data=df,hue='Department',palette='Set2')

<AxesSubplot:title={'center':'Department with respect to JobInvolvement'}, xlabel='JobIn

Out[49]:
volvement', ylabel='count'>
In [50]: #JobInvolvement with respect to Attrition
plt.figure(figsize=(16,8))
plt.xlabel('Job Involvement',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Attrition with respect to JobInvolvement',fontsize=14,fontweight='bold')
sns.countplot(x='JobInvolvement',data=df,hue='Attrition',palette='husl')

<AxesSubplot:title={'center':'Attrition with respect to JobInvolvement'}, xlabel='JobInv

Out[50]:
olvement', ylabel='count'>

rating 4 has the least attrition since employees are involved with job seriously.
rating 3 attrition seems to be of moderate level.
rating 1 has almost 50 percent attrition rate.

In [51]: job_df=df.groupby(['JobLevel','JobRole']).size().reset_index()
job_df.drop(0,axis=1,inplace=True)
job_df.set_index('JobLevel')
Out[51]: JobRole

JobLevel

1 Human Resources

1 Laboratory Technician

1 Research Scientist

1 Sales Representative

2 Healthcare Representative

2 Human Resources

2 Laboratory Technician

2 Manufacturing Director

2 Research Scientist

2 Sales Executive

2 Sales Representative

3 Healthcare Representative

3 Human Resources

3 Laboratory Technician

3 Manager

3 Manufacturing Director

3 Research Director

3 Research Scientist

3 Sales Executive

4 Healthcare Representative

4 Manager

4 Manufacturing Director

4 Research Director

4 Sales Executive

5 Manager

5 Research Director

from above dataframe seems a single job role has multiple job level
Job level of rating 5 has 2 roles:Manager and Research Director

In [52]: #JobRole with respect to Attrition

plt.figure(figsize=(18,9))
plt.xlabel('JobRole',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.xticks(rotation=90)
plt.title('Attrition with respect to JobRole',fontsize=14,fontweight='bold')
sns.countplot(x='JobRole',data=df,hue='Attrition')

<AxesSubplot:title={'center':'Attrition with respect to JobRole'}, xlabel='JobRole', yla

Out[52]:
bel='count'>
Research director,manager and manufacturing director has the least attrition.
Sales Representative has highest attrition considering its count of employees followed by human resources
and laboratory technician

In [53]: #Job satisfication considering total employees

df.groupby('JobSatisfaction').size()

JobSatisfaction
Out[53]:
1 578
2 560
3 884
4 918
dtype: int64

In [54]: plt.figure(figsize=(18,9))
plt.xlabel('JobSatisfaction',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Job satisfication with respect to department',fontsize=14,fontweight='bold')
sns.countplot(x='Department',data=df,hue='JobSatisfaction',palette='rocket')

<AxesSubplot:title={'center':'Job satisfication with respect to department'}, xlabel='De

Out[54]:
partment', ylabel='count'>
Sales department has high count of 4 rating
Research and development has high count in 3 rating
Human resources has high count in rating 2

In [55]: #Job satisfication with respect to attrition

plt.figure(figsize=(18,9))
plt.xlabel('JobSatisfaction',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Job satisfication with respect to attrition',fontsize=14,fontweight='bold')
sns.countplot(x='JobSatisfaction',data=df,hue='Attrition',palette='Set2')

<AxesSubplot:title={'center':'Job satisfication with respect to attrition'}, xlabel='Job

Out[55]:
Satisfaction', ylabel='count'>

In [56]: marital_df=df.groupby('MaritalStatus').size().reset_index().rename(columns={0:'No of emp

marital_df

Out[56]: MaritalStatus No of employees

0 Divorced 654

1 Married 1346

2 Single 940

In [57]: plt.figure(figsize=(16,8))
plt.xlabel('MaritalStatus',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('MaritalStatus with respect to attrition',fontsize=14,fontweight='bold')
sns.countplot(x='MaritalStatus',data=df,hue='Attrition')

<AxesSubplot:title={'center':'MaritalStatus with respect to attrition'}, xlabel='Marital

Out[57]:
Status', ylabel='count'>

Single employees tend to resign more

Married employees as well as divorced employees are stable in respect to attrition

In [58]: #Maximum monthly income in the dataset

df['MonthlyIncome'].max()

19999
Out[58]:

In [59]: plt.figure(figsize=(14,7))
sns.distplot(df['MonthlyIncome'])

<AxesSubplot:xlabel='MonthlyIncome', ylabel='Density'>
Out[59]:
In [60]: #Monthly income with respect to Department and attrition
plt.figure(figsize=(14,7))
plt.xlabel('Department',fontsize=14)
plt.ylabel('Monthly Income',fontsize=14)
plt.title('Monthly income with respect to Department and attrition',fontsize=14,fontweig
sns.swarmplot(x='Department',y='MonthlyIncome',data=df,hue='Attrition')

<AxesSubplot:title={'center':'Monthly income with respect to Department and attrition'},

Out[60]:
xlabel='Department', ylabel='MonthlyIncome'>

In Sales Department,attrition is high in range of 1000 to 3500 income. Stable income range of 15000 to
18500 has no attrition
In Research and development attrition is moderate in range of 1000 to 12500 income .Stable range is 14000
to 18000 where there is no attrition
In Human Resources pay gap is huge considering the plot . High attrition in range of 1000 to 2500/3000 .
No attrition in high paying income for HR department

In [61]: #Average monthly income of employees with respect to Department

plt.figure(figsize=(14,7))
plt.xlabel('Department',fontsize=14)
plt.ylabel('Monthly Income',fontsize=14)
plt.title('Monthly income with respect to Department and attrition',fontsize=14,fontweig
sns.lineplot(x='Department',y='MonthlyIncome',data=df,hue='Attrition',ci=None)

<AxesSubplot:title={'center':'Monthly income with respect to Department and attrition'},

Out[61]:
xlabel='Department', ylabel='Monthly Income'>

Above plot gives a clear perspective.Average monthly income of below 6000 seems to have high attrition
among employees
Average income above 6500 has very less attrition

In [62]: #Unique values of NumCompaniesWorked

df['NumCompaniesWorked'].unique()

array([8, 1, 6, 9, 0, 4, 5, 2, 7, 3], dtype=int64)

Out[62]:

In [63]: num_com=df.groupby('NumCompaniesWorked').size().reset_index().rename(columns={0:'Count'}
num_com

Out[63]: NumCompaniesWorked Count

0 0 394

1 1 1042

2 2 292

3 3 318

4 4 278

5 5 126

6 6 140

7 7 148

8 8 98

9 9 104
In [64]: # pie chart representation of number of companies worked
px.pie(num_com,names='NumCompaniesWorked',values='Count',title='Number of companies work

Number of companies worked representation

1
0
13.4% 3
2
4

35.4% 7
10.8% 6
5
9
8
9.93%

3.
33
9.46% %

3.
4.2

54
%
5.03% 4.76% 9%

35.4% of employees had worked atleast in 1 company before

3.54% of employees had worked in nearly 9 companies

In [65]: plt.figure(figsize=(16,8))
plt.xlabel('Num of Companies Worked',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('NumCompaniesWorked with respect to attrition',fontsize=14,fontweight='bold')
sns.countplot(x='NumCompaniesWorked',data=df,hue='Attrition')

<AxesSubplot:title={'center':'NumCompaniesWorked with respect to attrition'}, xlabel='Nu

Out[65]:
mCompaniesWorked', ylabel='count'>
Attrition high for employees who worked in less than 1 company
Low Attrition for employees who worked in 2 or more than 2 companies

In [66]: dep_comp=df.groupby(['NumCompaniesWorked','Department'])['EmployeeNumber'].count().reset_
dep_comp[:5]

Out[66]: NumCompaniesWorked Department Count

0 0 Human Resources 24

1 0 Research & Development 240

2 0 Sales 130

3 1 Human Resources 38

4 1 Research & Development 680

In [67]: for i in dep_comp['Department'].unique():

labels_subcat=dep_comp[dep_comp['Department']==i].NumCompaniesWorked.tolist()
values_subcat=dep_comp[dep_comp['Department']==i].Count.tolist()
plt.figure(figsize=(16,8))
hui=px.pie(names=labels_subcat,values=values_subcat,hole=0.4,title=i)
hui.show()

Human Resources

1
0
4
19% 3
9
30.2%
6
2
8
5
15.9% 7

1.59%
1.59%
4.76%
7.94%
4.76%
7.94% 6.35%

Research & Development

1
0
12.5% 3
2
4
6
11.2% 35.4%
7
5
9
8
10.2%

3.
02
9.37% %
3.
95
4.3

%
% 7

5.1% 4.89%

Sales

1
0
3
14.6%
2
4
7
36.3%
10.3% 5
6
8
9

10.1%

2.
02
8.74%

3.
%

81
3.8
4.48%

%
1
5.83%

%
<Figure size 1152x576 with 0 Axes>
<Figure size 1152x576 with 0 Axes>
<Figure size 1152x576 with 0 Axes>

In [68]: #Job satisfication with respect to number of companies worked

job_comp=df.groupby(['NumCompaniesWorked','JobSatisfaction'])['EmployeeNumber'].count().
job_comp[:5]

Out[68]: NumCompaniesWorked JobSatisfaction Count

0 0 1 76

1 0 2 70

2 0 3 118

3 0 4 130

4 1 1 182

In [69]: for i in job_comp['JobSatisfaction'].unique():

labels_subcat=job_comp[job_comp['JobSatisfaction']==i].NumCompaniesWorked.tolist()
values_subcat=job_comp[job_comp['JobSatisfaction']==i].Count.tolist()
plt.figure(figsize=(16,8))
hui=px.pie(names=labels_subcat,values=values_subcat,hole=0.4,title=f'rating {i} job
hui.show()

rating 1 job satisfaction level

1
0
2
13.1%
4
3
31.5% 7
12.5% 5
6
8
9

9.69%
3.4
6%
%
3.
46
%
8.65%
5.19%
6.92% 5.54%

rating 2 job satisfaction level

1
0
12.5% 4
2
3
9.64% 5
36.1%
6
7
8
8.57% 9

8.57%
3.
57
%

6.07% 4.64%
5.71% 4.64%

rating 3 job satisfaction level

1
0
3
13.3%
2
4
6
36%
11.8% 7
9
5
8

9.95%
2.
94
8.82% %

2.
3.8

94
%
5%
5.43% 4.98%

rating 4 job satisfaction level

1
0
3
14.2%
4
2
7
37% 5
12.6%
6
9
8

9.8%
2.
83

9.15%
3.
4.14%

%
3.2

27
3.7%

%
7%

<Figure size 1152x576 with 0 Axes>

In [70]: #Over 18 unique values

df['Over18'].value_counts()

Y 2940
Out[70]:
Name: Over18, dtype: int64

All employees are above 18 years old

In [71]: #Overtime unique values

df['OverTime'].value_counts()

No 2108
Out[71]:
Yes 832
Name: OverTime, dtype: int64

#Swarm plot analysis of Overtime

In [72]: plt.figure(figsize=(14,7))
plt.xlabel('OverTime',fontsize=14)
plt.ylabel('Monthly Income',fontsize=14)
plt.title('Monthly income with respect to OverTime and attrition',fontsize=14,fontweight=
sns.swarmplot(x='OverTime',y='MonthlyIncome',data=df,hue='Attrition')

<AxesSubplot:title={'center':'Monthly income with respect to OverTime and attrition'}, x

Out[72]:
label='OverTime', ylabel='MonthlyIncome'>

People working in overtime with less than 10000 monthly income tend to resign more though overtime
people working are less
Attrition among non overtime working people is less

In [73]: #Swarm plot analysis of Overtime with Age

# plt.figure(figsize=(14,7))
# plt.xlabel('OverTime',fontsize=14)
# plt.ylabel('Age',fontsize=14)
# plt.title('Age with respect to OverTime and attrition',fontsize=14,fontweight='bold')
# sns.swarmplot(x='OverTime',y='Age',data=df,hue='Attrition')

In [74]: # PercentSalaryHike unique values

print(df['PercentSalaryHike'].nunique())
print(df.PercentSalaryHike.value_counts())

15
11 420
13 418
14 402
12 396
15 202
18 178
17 164
16 156
19 152
22 112
20 110
21 96
23 56
24 42
25 36
Name: PercentSalaryHike, dtype: int64
In [75]: plt.figure(figsize=(16,8))
plt.xlabel('Num of Companies Worked',fontsize=14)
plt.ylabel('Salary hike',fontsize=14)
plt.title('Number of companies worked with respect to percent hike and attrition',fontsi
sns.barplot(x='NumCompaniesWorked',y='PercentSalaryHike',data=df,hue='Attrition')

<AxesSubplot:title={'center':'Number of companies worked with respect to percent hike an

Out[75]:
d attrition'}, xlabel='NumCompaniesWorked', ylabel='PercentSalaryHike'>

Even though employees who worked in 0,4,7,9 companies have an average of more than 15 percent
hike,they tend to resign more

In [76]: #Unique values of performance rating

df['PerformanceRating'].unique()

array([3, 4], dtype=int64)

Out[76]:

In [77]: per_rating=df.groupby('PerformanceRating').size().reset_index().rename(columns={0:'Count
per_rating

Out[77]: PerformanceRating Count

0 3 2488

1 4 452

In [78]: px.pie(per_rating,names='PerformanceRating',values='Count',title='Percentage distributio

Percentage distribution of Performance Rating

3
4

15.4%
84.6%

84.6% are average performers in the organisation

15.4% are top performers

In [79]: plt.figure(figsize=(16,8))
plt.xlabel('Job Involvement',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Job Involvement worked with respect to performance rating',fontsize=14,fontwe
sns.countplot(x='JobInvolvement',hue='PerformanceRating',data=df,palette='Set2')

<AxesSubplot:title={'center':'Job Involvement worked with respect to performance ratin

Out[79]:
g'}, xlabel='JobInvolvement', ylabel='count'>

Job involvement with 3 rating has the highestemployees as well as highest 3 performance rating
Employees who have very less job involvement like 1 and 2 are also awarded highest rating[4]

In [80]: plt.figure(figsize=(16,8))
plt.xlabel('Job Involvement',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Plot with respect to performance rating and attrition',fontsize=14,fontweight=
sns.countplot(x='PerformanceRating',hue='Attrition',data=df,palette='rocket')
Out[80]: <AxesSubplot:title={'center':'Plot with respect to performance rating and attrition'}, x
label='PerformanceRating', ylabel='count'>

In [81]: per_atrr=df.groupby(['Attrition','PerformanceRating'])['EmployeeNumber'].size().reset_in
per_atrr

Out[81]: Attrition PerformanceRating Count

0 No 3 2088

1 No 4 378

2 Yes 3 400

3 Yes 4 74

In [82]: # def fnct(df):

# d1={}
# pr=[3,4]
# for j in pr:
# d1[j]=df[df['Attrition']=='Yes' & df['PerformanceRating']==j].Count/(df[df['At
# return d1

# fnct(per_atrr)

In [83]: #unique RelationshipSatisfaction

df['RelationshipSatisfaction'].unique()

array([1, 4, 2, 3], dtype=int64)

Out[83]:

In [84]: plt.figure(figsize=(16,8))
plt.xlabel('Relationship Satisfaction',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Relationship Satisfaction with respect to attrition',fontsize=14,fontweight='b
sns.countplot(x='RelationshipSatisfaction',data=df,hue='Attrition')

<AxesSubplot:title={'center':'Relationship Satisfaction with respect to attrition'}, xla

Out[84]:
bel='RelationshipSatisfaction', ylabel='count'>
In [85]: #Since the chart is not giving a clear view of the percentage we can create a function f
def fnct_relation(dataframe):
dict1={}
j=[1,2,3,4]
for i in j:
cal=((df[(df['RelationshipSatisfaction']==i) & (df['Attrition']=='Yes')]['Employ
cal=round(cal,2)
dict1[i]=cal
cal=0
return dict1

fnct_relation(df)

{1: 20.65, 2: 14.85, 3: 15.47, 4: 14.81}

Out[85]:

From above dictionary values we can infer following information

RelationshipSatisfaction of 1 rating has the highest attrition of nearly 20 percent
interestingly rating 3 has more attrition than rating 2
rating 4 has the low attrition among other ratings

In [86]: #unique values of StandardHours

df['StandardHours'].unique()

array([80], dtype=int64)
Out[86]:

In [87]: #Unique values of StockOptionLevel

df['StockOptionLevel'].unique()

array([0, 1, 3, 2], dtype=int64)

Out[87]:

In [88]: plt.figure(figsize=(16,8))
plt.xlabel('Stock level option',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Stock level option with respect to attrition',fontsize=14,fontweight='bold')
sns.countplot(x='StockOptionLevel',data=df,hue='Attrition',palette='rainbow')

<AxesSubplot:title={'center':'Stock level option with respect to attrition'}, xlabel='St

Out[88]:
ockOptionLevel', ylabel='count'>
In [89]: #Since the chart is not giving a clear view of the percentage we can create a function f
def fnct_relation1(dataframe):
dict1={}
j=[0,1,2,3]
for i in j:
cal=((df[(df['StockOptionLevel']==i) & (df['Attrition']=='Yes')]['EmployeeNumber
cal=round(cal,2)
dict1[i]=cal
return dict1

fnct_relation1(df)

{0: 24.41, 1: 9.4, 2: 7.59, 3: 17.65}

Out[89]:

Employees with no stock option tend to resign more

From the data given , employees with more stock option [3] also has 2nd high attrition rate

In [90]: #Swarm plot analysis of StockOptionLevel with Monthly income

plt.figure(figsize=(14,7))
plt.xlabel('OverTime',fontsize=14)
plt.ylabel('Monthly Income',fontsize=14)
plt.title('Monthly income with respect to StockOptionLevel and attrition',fontsize=14,fo
sns.swarmplot(x='StockOptionLevel',y='MonthlyIncome',data=df,hue='Attrition')

<AxesSubplot:title={'center':'Monthly income with respect to StockOptionLevel and attrit

Out[90]:
ion'}, xlabel='StockOptionLevel', ylabel='MonthlyIncome'>
In [91]: #Total working years of employees
plt.figure(figsize=(16,8))
plt.xlabel('TotalWorkingYears',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Total working years of employees with attrition',fontsize=14,fontweight='bold
sns.countplot(x='TotalWorkingYears',hue='Attrition',data=df,palette='rocket')

<AxesSubplot:title={'center':'Total working years of employees with attrition'}, xlabel

Out[91]:
='TotalWorkingYears', ylabel='count'>

In [92]: #Since the chart is not giving a clear view of the percentage we can create a function f
def fnct_relation1(dataframe):
dict1={}

for i in range(0,41):
cal=((df[(df['TotalWorkingYears']==i) & (df['Attrition']=='Yes')]['EmployeeNumbe
cal=round(cal,2)
dict1[i]=cal
del dict1[39] #Since we dont have 39 years experienced employees
return dict1

a=fnct_relation1(df)

In [93]: work_df=pd.DataFrame(data=a.items(),columns=['No_of_years','Attrition_percent_among_yes_
work_df

Out[93]: No_of_years Attrition_percent_among_yes_and_no

0 0 45.45

1 1 49.38

2 2 29.03

3 3 21.43

4 4 19.05

5 5 18.18

6 6 17.60

7 7 22.22

8 8 15.53

9 9 10.42

10 10 12.38

11 11 19.44

12 12 10.42

13 13 8.33

14 14 12.90

15 15 12.50

16 16 8.11

17 17 9.09

18 18 14.81

19 19 13.64

20 20 6.67

21 21 2.94

22 22 9.52

23 23 9.09

24 24 16.67

25 25 7.14

26 26 7.14

27 27 0.00

28 28 7.14

29 29 0.00

30 30 0.00

31 31 11.11
32 32 0.00

33 33 14.29

34 34 20.00

35 35 0.00

36 36 0.00

37 37 0.00

38 38 0.00

39 40 100.00

Employees with 40 years of experience have 100 percent attrition rate

Employees with 0 and 1 year experience also have attrition rate from 45 to 50 percent
Employees with range from 35 to 38 have no attrition

In [94]: #TrainingTimesLastYear unique values

df['TrainingTimesLastYear'].unique()

array([0, 3, 2, 5, 1, 4, 6], dtype=int64)

Out[94]:

In [95]: #Training times last year of employees

plt.figure(figsize=(16,8))
plt.xlabel('TrainingTimesLastYear',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Training Times Last Year of employees',fontsize=14,fontweight='bold')
sns.countplot(x='TrainingTimesLastYear',hue='Attrition',data=df,palette='rocket')

<AxesSubplot:title={'center':'Training Times Last Year of employees'}, xlabel='TrainingT

Out[95]:
imesLastYear', ylabel='count'>

In [96]: #Since the chart is not giving a clear view of the percentage we can create a function f
def fnct_relation2(dataframe):
dict2={}

for i in range(0,7):
cal=((df[(df['TrainingTimesLastYear']==i) & (df['Attrition']=='Yes')]['EmployeeN
cal=round(cal,2)
dict2[i]=cal
return dict2

b=fnct_relation2(df)

In [97]: b

{0: 27.78, 1: 12.68, 2: 17.92, 3: 14.05, 4: 21.14, 5: 11.76, 6: 9.23}

Out[97]:

No training and traing rating 4 employees have high attrition

Highest rated training[6] has least attrition

In [98]: # WorkLifeBalance unique values

df['WorkLifeBalance'].unique()

array([1, 3, 2, 4], dtype=int64)

Out[98]:

In [99]: #Work Life Balance of employees with attrition

plt.figure(figsize=(16,8))
plt.xlabel('Attrition',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Work Life Balance of employees with attrition',fontsize=14,fontweight='bold')
sns.countplot(x='Attrition',hue='WorkLifeBalance',data=df,palette='rocket')

<AxesSubplot:title={'center':'Work Life Balance of employees with attrition'}, xlabel='A

Out[99]:
ttrition', ylabel='count'>

work life balance with rating 3 has the highest as well as least attrition considering the count of employees
1 and 4 rating are almost same attrition

In [100… #unique values of YearsAtCompany

df['YearsAtCompany'].unique()

array([ 6, 10, 0, 8, 2, 7, 1, 9, 5, 4, 25, 3, 12, 14, 22, 15, 27,

Out[100]:
21, 17, 11, 13, 37, 16, 20, 40, 24, 33, 19, 36, 18, 29, 31, 32, 34,
26, 30, 23], dtype=int64)

In [101… #Years at company of employees with attrition

plt.figure(figsize=(16,8))
plt.xlabel('YearsAtCompany',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Years at company of employees with attrition',fontsize=14,fontweight='bold')
sns.countplot(x='YearsAtCompany',hue='Attrition',data=df,palette='rocket')

<AxesSubplot:title={'center':'Years at company of employees with attrition'}, xlabel='Ye

Out[101]:
arsAtCompany', ylabel='count'>

The most reliable employees for the organisation are from 11 to 30,34,36,37 years experience at the
company

In [102… #"Unique values of YearsInCurrentRole"

df['YearsInCurrentRole'].unique()

array([ 4, 7, 0, 2, 5, 9, 8, 3, 6, 13, 1, 15, 14, 16, 11, 10, 12,

Out[102]:
18, 17], dtype=int64)

In [103… df['JobRole'].unique()

array(['Sales Executive', 'Research Scientist', 'Laboratory Technician',

Out[103]:
'Manufacturing Director', 'Healthcare Representative', 'Manager',
'Sales Representative', 'Research Director', 'Human Resources'],
dtype=object)

In [104… res_df=df[df['Attrition']=='Yes']
res_df.shape

(474, 35)
Out[104]:

In [105… plt.figure(figsize=(16,8))
plt.xlabel('YearsAtCompany',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Years in current role with attrition',fontsize=14,fontweight='bold')
sns.countplot(x='YearsInCurrentRole',hue='JobRole',data=res_df)
plt.legend(loc='center right')

<matplotlib.legend.Legend at 0x2499e92cfd0>
Out[105]:
Laboratory technician at 0 and 1 years role have high attrition
Research director and Healthcare representative has least attrition
Sales Executive has attrition almost across all years of current job role
Sales Representative has attrition of less than 7 years in current role
Research Scientist has peak attrition in 2 years at current role

In [106… #YearsSinceLastPromotion unique values

df['YearsSinceLastPromotion'].unique()

array([ 0, 1, 3, 2, 7, 4, 8, 6, 5, 15, 9, 13, 12, 10, 11, 14],

Out[106]:
dtype=int64)

In [107… #years since last promotion countplot

plt.figure(figsize=(16,8))
plt.xlabel('YearsSinceLastPromotion',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Years Since Last Promotion with attrition',fontsize=14,fontweight='bold')
sns.countplot(x='YearsSinceLastPromotion',data=res_df)

<AxesSubplot:title={'center':'Years Since Last Promotion with attrition'}, xlabel='Years

Out[107]:
SinceLastPromotion', ylabel='count'>
Employees with no promotion have high attrition
higher the years lesser the attrition

In [108… # YearsWithCurrManager unique values

df['YearsWithCurrManager'].unique()

array([ 5, 7, 0, 2, 6, 8, 3, 11, 17, 1, 4, 12, 9, 10, 15, 13, 16,

Out[108]:
14], dtype=int64)

In [109… plt.figure(figsize=(16,8))
plt.xlabel('YearsSinceLastPromotion',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Years With Current Manager with attrition',fontsize=14,fontweight='bold')
sns.countplot(x='YearsWithCurrManager',data=res_df)

<AxesSubplot:title={'center':'Years With Current Manager with attrition'}, xlabel='Years

Out[109]:
WithCurrManager', ylabel='count'>
Attrition is uneven in years with current manager feature

Data Preprocessing [ Feature Scaling and

Encoding]
In [110… #Droping the unnecessary columns
df.drop(['EmployeeCount','EmployeeNumber','Over18','StandardHours'],axis=1,inplace=True)

In [111… df.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',

Out[111]:
'DistanceFromHome', 'Education', 'EducationField',
'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement',
'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus',
'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'OverTime',
'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction',
'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
'YearsSinceLastPromotion', 'YearsWithCurrManager'],
dtype='object')

In [112… #Age boxplot

sns.boxplot(data=df['Age'])

<AxesSubplot:>
Out[112]:

In [113… df['BusinessTravel'].unique()
array(['Travel_Rarely', 'Travel_Frequently', 'Non-Travel'], dtype=object)
Out[113]:

In [114… #BusinessTravel Encoding by function

df['BusinessTravel']=df['BusinessTravel'].map({'Non-Travel':0,'Travel_Rarely':1,'Travel_

In [115… df[:3]

Out[115]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environ

0 41 Yes 1 1102 Sales 1 2 Life Sciences

Research &
1 49 No 2 279 8 1 Life Sciences
Development

Research &
2 37 Yes 1 1373 2 2 Other
Development
3 rows × 31 columns

In [116… #Daily rate outlier detection

q1=np.quantile(df['DailyRate'],0.25)
q2=np.quantile(df['DailyRate'],0.50)
q3=np.quantile(df['DailyRate'],0.75)
iqr=q3-q1
sd=df[(df['DailyRate']>(q1-(1.5*iqr))) & (df['DailyRate']>(q3+(1.5*iqr)))]
sd.head()

Out[116]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environm

0 rows × 31 columns

In [117… #Encoding of Department

labenc=LabelEncoder()
df['Department']=labenc.fit_transform(df['Department'])
df['Department'].unique()

array([2, 1, 0])
Out[117]:

In [118… df[:5]#0-HR, 1-Research Department , 2-Sales

Out[118]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environ

0 41 Yes 1 1102 2 1 2 Life Sciences

1 49 No 2 279 1 8 1 Life Sciences

2 37 Yes 1 1373 1 2 2 Other

3 33 No 2 1392 1 3 4 Life Sciences

4 27 No 1 591 1 2 1 Medical

5 rows × 31 columns

In [119… #DistanceFromHome outlier detection

q4=np.quantile(df['DistanceFromHome'],0.25)
q5=np.quantile(df['DistanceFromHome'],0.50)
q6=np.quantile(df['DistanceFromHome'],0.75)
iqr=q6-q4
sd_1=df[(df['DistanceFromHome']<(q4-(1.5*iqr))) & (df['DistanceFromHome']>(q6+(1.5*iqr))
sd_1.head()

Out[119]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environm

0 rows × 31 columns

In [120… sns.boxplot(df['DistanceFromHome'])

<AxesSubplot:xlabel='DistanceFromHome'>
Out[120]:
In [121… #Label encoder of EducationField
df['EducationField']=labenc.fit_transform(df['EducationField'])
df[:3]

Out[121]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environ

0 41 Yes 1 1102 2 1 2 1

1 49 No 2 279 1 8 1 1

2 37 Yes 1 1373 1 2 2 4

3 rows × 31 columns

In [122… print(labenc.classes_)

['Human Resources' 'Life Sciences' 'Marketing' 'Medical' 'Other'

'Technical Degree']

In [123… #Gender Label encoder

df['Gender']=labenc.fit_transform(df['Gender'])
df[:3]

Out[123]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environ

0 41 Yes 1 1102 2 1 2 1

1 49 No 2 279 1 8 1 1

2 37 Yes 1 1373 1 2 2 4

3 rows × 31 columns

In [124… df['Gender'].unique()
array([0, 1])
Out[124]:

In [125… #Outlier detection using box plot for Hourly Rate

sns.boxplot(df['HourlyRate'])

<AxesSubplot:xlabel='HourlyRate'>
Out[125]:
In [126… #JobRole label encoder
df['JobRole']=labenc.fit_transform(df['JobRole'])
df[:5]

Out[126]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environ

0 41 Yes 1 1102 2 1 2 1

1 49 No 2 279 1 8 1 1

2 37 Yes 1 1373 1 2 2 4

3 33 No 2 1392 1 3 4 1

4 27 No 1 591 1 2 1 3

5 rows × 31 columns

In [127… print(labenc.classes_)

['Healthcare Representative' 'Human Resources' 'Laboratory Technician'

'Manager' 'Manufacturing Director' 'Research Director'
'Research Scientist' 'Sales Executive' 'Sales Representative']

In [128… #Since marital status doesnot hold that much importance we can remove them from dataset
df.drop('MaritalStatus',axis=1,inplace=True)

In [129… #Checking the number of columns

count=0
for i in df.columns:
count=count+1
count

30
Out[129]:

In [130… #Box plot representation of monthly income

plt.figure(figsize=(14,7))
sns.boxplot(df['MonthlyIncome'])

<AxesSubplot:xlabel='MonthlyIncome'>
Out[130]:
From the above plot we can find many outlier values

In [131… #Skewness detection

print('skewness value of monthly income: ',df['MonthlyIncome'].skew())

skewness value of monthly income: 1.3691171405078755

In [132… med=np.median(df['MonthlyIncome'])
med

4919.0
Out[132]:

In [133… #Monthly income outlier detection

q7=np.quantile(df['MonthlyIncome'],0.25)
q8=np.quantile(df['MonthlyIncome'],0.50)
q9=np.quantile(df['MonthlyIncome'],0.75)
iqr=q9-q7
sd2=df[(df['MonthlyIncome']<(q7-(1.5*iqr))) | (df['MonthlyIncome']>(q9+(1.5*iqr)))]
sd2

Out[133]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Envi

25 53 No 1 1282 1 5 3 4

29 46 No 1 705 2 2 4 2

45 41 Yes 1 1360 1 12 3 5

62 50 No 1 989 1 7 2 3

105 59 No 0 1420 0 2 4 0

... ... ... ... ... ... ... ... ...

2844 58 No 1 605 2 21 3 1

2847 49 No 2 1064 1 2 1 1

2871 55 No 1 189 0 26 4 0

2907 39 No 0 105 1 9 3 1

2913 42 No 1 300 1 2 3 1
228 rows × 30 columns

In [134… q8

4919.0
Out[134]:

In [135… # #Replacing the outlier values of dataframe with the monthlyincome median value
df.iloc[sd2.index,15]=q8
# df['MonthlyIncome'].fillna(np.median(df['MonthlyIncome']),inplace=True)

In [136… df.iloc[sd2.index,15]
25 4919
Out[136]:
29 4919
45 4919
62 4919
105 4919
...
2844 4919
2847 4919
2871 4919
2907 4919
2913 4919
Name: MonthlyIncome, Length: 228, dtype: int64

In [137… # plt.figure(figsize=(14,7))
# sns.boxplot(df['MonthlyIncome'])

In [138… #Monthly rate box plot

sns.boxplot(df['MonthlyRate'])

<AxesSubplot:xlabel='MonthlyRate'>
Out[138]:

In [139… #OverTime label encoder using function

df['OverTime']=df['OverTime'].map({'Yes':1,'No':0})

In [140… df['OverTime'].unique()

array([1, 0], dtype=int64)

Out[140]:

In [141… #PercentSalaryHike outlier detection

sns.boxplot(df['PercentSalaryHike'])
print(f'skewness is {df["PercentSalaryHike"].skew()}')

skewness is 0.8207086405356568
No outliers from the above data but skewed data

In [142… #Label encoding for Attrition

df['Attrition']=labenc.fit_transform(df['Attrition'])
print(labenc.classes_)

['No' 'Yes']

Model using Decision Tree

In [143… df.head()

Out[143]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environ

0 41 1 1 1102 2 1 2 1

1 49 0 2 279 1 8 1 1

2 37 1 1 1373 1 2 2 4

3 33 0 2 1392 1 3 4 1

4 27 0 1 591 1 2 1 3

5 rows × 30 columns

In [144… #Choosing X and y value

X=df.drop('Attrition',axis=1)
y=df['Attrition']

In [145… X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1

In [146… #Training the model without parameters

dc_model=DecisionTreeClassifier()
dc_model.fit(X_train,y_train)
training_accuracy=dc_model.score(X_train,y_train)
print("Training Accuracy: ",training_accuracy)
testing_accuracy=dc_model.score(X_test,y_test)
print("Testing Accuracy : ",testing_accuracy)

Training Accuracy: 1.0

Testing Accuracy : 0.9625850340136054

In [147… #prediction for decision tree without parameters

pred=dc_model.predict(X_test)
In [148… #Grid Search CV for hyperparameter tuning
param_dist={
'max_depth':[2,3,4,5,6,7,8],
'min_samples_split':[10,20,30,40],
'min_samples_leaf':[1,2,3,4],
'max_features': ['auto', 'sqrt', 'log2', None],
'criterion': ['gini', 'entropy']
}
cv_df=GridSearchCV(dc_model,cv=10,param_grid=param_dist,n_jobs=3)
cv_df.fit(X_train,y_train)
print(cv_df.best_params_)

{'criterion': 'entropy', 'max_depth': 8, 'max_features': None, 'min_samples_leaf': 1, 'm

in_samples_split': 10}

In [149… #Training the model with the best parameters from Grid Search CV
dc_model_1=DecisionTreeClassifier(max_depth=8,min_samples_split=10,min_samples_leaf=1)
dc_model_1.fit(X_train,y_train)
training_accuracy_1=dc_model_1.score(X_train,y_train)
print("Training Accuracy: ",training_accuracy_1)
testing_accuracy_1=dc_model_1.score(X_test,y_test)
print("Testing Accuracy : ",testing_accuracy_1)

Training Accuracy: 0.9345238095238095

Testing Accuracy : 0.9013605442176871

The above accuracy seems to be fine as there is no sign of overfitting or underfitting

In [150… #Tree diagram representation

plt.figure(figsize=(18,9))
tree_1=tree.plot_tree(dc_model_1,filled=True)

Performance Metrics of Decision Tree

In [151… predictions = dc_model_1.predict(X_test)
print (dc_model_1.score(X_test, y_test))

0.9013605442176871

In [152… print(classification_report(y_test,predictions))
precision recall f1-score support

0 0.92 0.97 0.94 507

1 0.71 0.48 0.57 81

accuracy 0.90 588

macro avg 0.82 0.72 0.76 588
weighted avg 0.89 0.90 0.89 588

The above report shows that attrition : no has accuracy prediction compared to attrition : yes
Recall of true positive[predicted yes|Actual yes] is only 0.57 percent

In [153… def create_conf_mat(test_class_set, predictions):

"""Function returns confusion matrix comparing two arrays"""
if (len(test_class_set.shape) != len(predictions.shape) == 1):
return print('Arrays entered are not 1-D.\nPlease enter the correctly sized sets
elif (test_class_set.shape != predictions.shape):
return print('Number of values inside the Arrays are not equal to each other.\nP
else:
# Set Metrics
test_crosstb_comp = pd.crosstab(index = test_class_set,
columns = predictions)
# Changed for Future deprecation of as_matrix
test_crosstb = test_crosstb_comp.values
return test_crosstb

In [154… #Confusion matrix for decision tree without parameters

conf_mat = create_conf_mat(y_test, pred)
sns.heatmap(conf_mat, annot=True, fmt='d', cbar=False)
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Actual vs. Predicted Confusion Matrix')
plt.show()

In [155… #Confusion matrix for decision tree with hyperparameter tuning

conf_mat = create_conf_mat(y_test, predictions)
sns.heatmap(conf_mat, annot=True, fmt='d', cbar=False)
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Actual vs. Predicted Confusion Matrix')
plt.show()
The above heatmap also shows that true negative [attrition : No] has highest number of correct prediction
whereas false positive
number is greater than true positive[attrition : Yes]

Area Under the curve

In [156… #For decision tree without parameter tuning
fpr_dt_0, tpr_dt_0, _ = roc_curve(y_test, pred)
roc_auc_dt_0 = auc(fpr_dt_0, tpr_dt_0)

In [157… plt.figure(1)
lw = 2
plt.plot(fpr_dt_0, tpr_dt_0, color='green',
lw=lw, label='Decision Tree(AUC = %0.2f)' % roc_auc_dt_0)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Area Under Curve')
plt.legend(loc="lower right")
plt.show()

In [158… #For decision tree without hyperparameter tuning

fpr_dt, tpr_dt, _ = roc_curve(y_test, predictions)
roc_auc_dt = auc(fpr_dt, tpr_dt)

In [159… plt.figure(1)
lw = 2
plt.plot(fpr_dt, tpr_dt, color='green',
lw=lw, label='Decision Tree(AUC = %0.2f)' % roc_auc_dt)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Area Under Curve')
plt.legend(loc="lower right")
plt.show()

Without hyperparameter tuning , the area under curve has good probability compared to with
hyperparameter tuning

Model using Random Forest

In [160… rf_model=RandomForestClassifier(random_state=42)

In [161… #Grid search cv for hyperparameter optimization

param_dist={
'max_depth':[2,3,4,5],
'bootstrap':[True,False],
'max_features':['auto', 'sqrt', 'log2', None],
'criterion':['gini','entropy']
}
rf_cv=GridSearchCV(rf_model,cv=10,param_grid=param_dist,n_jobs=3)
rf_cv.fit(X_train,y_train)
print('Best Parameters using grid search: \n', rf_cv.best_params_)

Best Parameters using grid search:

{'bootstrap': True, 'criterion': 'gini', 'max_depth': 5, 'max_features': None}

In [162… #Using the best parameters for the model

rf_model.set_params(criterion='gini',max_depth= 5,bootstrap= True,max_features=None)

RandomForestClassifier(max_depth=5, max_features=None, random_state=42)

Out[162]:
OOB Rate
In [163… rf_model.set_params(warm_start=True,
oob_score=True)

min_estimators = 15
max_estimators = 1000

error_rate = {}

for i in range(min_estimators, max_estimators + 1):

rf_model.set_params(n_estimators=i)
rf_model.fit(X_train,y_train)

oob_error = 1 - rf_model.oob_score_
error_rate[i] = oob_error

In [164… # Convert dictionary to a pandas series for easy plotting

oob_series = pd.Series(error_rate)

In [165… fig, ax = plt.subplots(figsize=(10, 10))

ax.set_facecolor('#fafafa')

oob_series.plot(kind='line',color = 'red')
# plt.axhline(0.055, color='#875FDB',linestyle='--')
# plt.axhline(0.05, color='#875FDB',linestyle='--')
plt.xlabel('n_estimators')
plt.ylabel('OOB Error Rate')
plt.title('OOB Error Rate Across various Forest sizes \n(From 15 to 1000 trees)')

Text(0.5, 1.0, 'OOB Error Rate Across various Forest sizes \n(From 15 to 1000 trees)')
Out[165]:
In [166… print('OOB Error rate for 500 trees is: {0:.5f}'.format(oob_series[500]))

OOB Error rate for 500 trees is: 0.12287

In [167… rf_model.set_params(n_estimators=500,bootstrap=True,warm_start=False,oob_score=False)

RandomForestClassifier(max_depth=5, max_features=None, n_estimators=500,

Out[167]:
random_state=42)

In [168… #Training the model

rf_model.fit(X_train,y_train)

RandomForestClassifier(max_depth=5, max_features=None, n_estimators=500,

Out[168]:
random_state=42)

In [169… #Predicting for test data

predictions1=rf_model.predict(X_test)

Performance Metrics for Random Forest

In [170… print (rf_model.score(X_train, y_train))
print (rf_model.score(X_test, y_test))

0.9017857142857143
0.9098639455782312

In [171… #Classification report

print(classification_report(y_test,predictions1))

precision recall f1-score support

0 0.91 0.99 0.95 507

1 0.89 0.40 0.55 81

accuracy 0.91 588

macro avg 0.90 0.69 0.75 588
weighted avg 0.91 0.91 0.89 588

In [172… #Confusion Matrix

conf_mat = create_conf_mat(y_test, predictions1)
sns.heatmap(conf_mat, annot=True, fmt='d', cbar=False)
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Actual vs. Predicted Confusion Matrix')
plt.show()

Area under the curve

In [173… fpr_dt1, tpr_dt1, _ = roc_curve(y_test, predictions1)
roc_auc_dt1 = auc(fpr_dt1, tpr_dt1)

In [174… plt.figure(1)
lw = 2
plt.plot(fpr_dt1, tpr_dt1, color='green',
lw=lw, label='Random Forest(AUC = %0.2f)' % roc_auc_dt1)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Area Under Curve')
plt.legend(loc="lower right")
plt.show()
From the above curve , random forest is not that much good for predicting yes attrition compared to
decision tree

KNN Model
In [175… #Since model will be using eucledian distance need to standardize the data
#Standardization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [176… #Training the model

from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)

KNeighborsClassifier()
Out[176]:

In [177… #predicting the model

pred_knn=classifier.predict(X_test)

In [178… print(classifier.score(X_train,y_train))
print(classifier.score(X_test,y_test))

0.8924319727891157
0.8860544217687075

In [179… #Optimal value for n_neighbours

error = []

# Calculating error for K values between 1 and 40

for i in range(1, 40):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train, y_train)
pred_i = knn.predict(X_test)
error.append(np.mean(pred_i != y_test))

In [180… plt.figure(figsize=(12, 6))

plt.plot(range(1, 40), error, color='red', linestyle='dashed', marker='o',
markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')
plt.show()

n-neighbours = 5 seems to be optimal value as accuracy seems to be fine without any overfitting

KNN performance metrics

In [181… #Confusion Matrix for KNN[5 neighbours]
conf_mat = create_conf_mat(y_test, pred_knn)
sns.heatmap(conf_mat, annot=True, fmt='d', cbar=False)
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Actual vs. Predicted Confusion Matrix')
plt.show()

In [182… #Classification report

print(classification_report(y_test,pred_knn))

precision recall f1-score support

0 0.89 0.98 0.94 507

1 0.73 0.27 0.40 81

accuracy 0.89 588

macro avg 0.81 0.63 0.67 588
weighted avg 0.87 0.89 0.86 588

From above metrics it is clear attrition: no has higher accuracy prediction.

Attrition : yes true positive rate is very less compared to decision model prediction

Model using SVM

In [183… #SVM using default parameters
from sklearn.svm import SVC
from sklearn import metrics
svc=SVC() #Default hyperparameters
svc.fit(X_train,y_train)
y_pred_svm=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred_svm))

Accuracy Score:
0.9217687074829932

In [184… #Linear KErnel

svc=SVC(kernel='linear')
svc.fit(X_train,y_train)
y_pred_lin=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred_lin))

Accuracy Score:
0.8928571428571429

In [185… #Polynomial Kernel

svc=SVC(kernel='poly')
svc.fit(X_train,y_train)
y_pred_poly=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred_poly))

Accuracy Score:
0.9319727891156463

In [186… #Optimisation using GridSearchCV

tuned_parameters = {
'C': (np.arange(0.1,1,0.1)) , 'kernel': ['linear','rbf','poly'],
'degree': [2,3,4] ,'gamma':[0.01,0.02,0.03,0.04,0.05]
}
model_svm_1 = GridSearchCV(svc, tuned_parameters,cv=10,scoring='accuracy')

In [187… model_svm_1.fit(X_train, y_train)

print(model_svm_1.best_score_)
# pred_svm1=model_svm_1.predict(X_test)

0.9409015506671474

In [188… model_svm_1.best_params_
Out[188]: {'C': 0.9, 'degree': 4, 'gamma': 0.05, 'kernel': 'poly'}

In [189… #Using the best hyperparamters for the SVM model

svm_final_model=SVC(C= 0.9, degree= 4, gamma= 0.05, kernel= 'poly')
svm_final_model.fit(X_train,y_train)
y_pred_svm=svm_final_model.predict(X_test)

Performance Metrics for SVM

In [190… print(metrics.accuracy_score(y_test,y_pred_svm))

0.9693877551020408

In [191… #Confusion Matrix for SVM

conf_mat = create_conf_mat(y_test, y_pred_svm)
sns.heatmap(conf_mat, annot=True, fmt='d', cbar=False)
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Actual vs. Predicted Confusion Matrix')
plt.show()

In [192… print(classification_report(y_test,y_pred_svm))

precision recall f1-score support

0 0.97 0.99 0.98 507

1 0.94 0.83 0.88 81

accuracy 0.97 588

macro avg 0.96 0.91 0.93 588
weighted avg 0.97 0.97 0.97 588

In [193… #Area Under the curve

fpr_dt2, tpr_dt2, _ = roc_curve(y_test, y_pred_svm)
roc_auc_dt2 = auc(fpr_dt2, tpr_dt2)

In [194… plt.figure(1)
lw = 2
plt.plot(fpr_dt2, tpr_dt2, color='green',
lw=lw, label='SVM(AUC = %0.2f)' % roc_auc_dt2)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Area Under Curve')
plt.legend(loc="lower right")
plt.show()

SVM model has better accuracy and best true positive and true negative values compared to decision tree
without tuning

Final Inference
The following are the points that are related to high attrition

1. Employees in the range of age 18 to 35 have high attrition rate

2. Employees who dont travel for business has less attrition
3. HR and Sales has somewhat high attrition considering its count whereas moderate attrition in R&D
department
4. Attrition among Employees with distance seems to even among all distance except distance - 24 km
where there is high attrition
5. Education rating 5 has least attrition whereas count of employees is more in rating 3
6. High attrition among employees who have Technical degree,marketing,HR education field considering
the individual field count
7. Employees with environment satisfication rating as 1 and 2 has high attrition
8. Attrition is moderate among both gender,though gender has imbalanced count
9. On average hourly rate in range of 60 to 80 tend to resign
10.Job involvement rating 1 has highest attrition followed by 2 and 3
11.Job Role of Sales Representative has highest attrition considering its count of employees followed by
human resources and laboratory technician
12.Job satisfication with rating 1 and 2 have good amount of attrition
13.Though Marital Status seems to have not mucch impact,Single status employees have high attrition
compared to other status
14.In Sales Department,attrition is high in range of 1000 to 3500 income. Stable income range of 15000
to 18500 has no attrition
15.In Research and development attrition is moderate in range of 1000 to 12500 income .Stable range is
14000 to 18000 where there is no attrition
16.In Human Resources pay gap is huge considering the plot . High attrition in range of 1000 to 2500/3000 .
No attrition in high paying income for HR department
17.Average monthly income of below 6000 seems to have high attrition among employees
18.People working in overtime with less than 10000 monthly income tend to resign more though overtime
people working are less
19.Attrition high for employees who worked in less than 1 company
20.RelationshipSatisfaction of 1 rating has the highest attrition of nearly 20 percent
21.interestingly rating 3 has more attrition than rating 2
22.Performance rating 3 has high attrition
23.Employees with no stock option tend to resign more
24.From the data given , employees with more stock option [3] also has 2nd high attrition rate
25.Employees with 40 years of experience have 100 percent attrition rate
26.Employees with 0 and 1 year experience also have attrition rate from 45 to 50 percent
27.Employees with range from 35 to 38 of years experience have no attrition
28.No training and traing rating 4 employees have high attrition.Highest rated training[6] has least attrition
29.work life balance with rating 3 has the highest as well as least attrition considering the count of
employees and 1 and 4 rating are almost same attrition
30.Laboratory technician at 0 and 1 years role have high attrition
31.Research director and Healthcare representative has least attrition
32.Sales Executive has attrition almost across all years of current job role
33.Sales Representative has attrition of less than 7 years in current role
34.Research Scientist has peak attrition in 2 years at current role
35.Employees with no promotion have high attrition.Higher the years lesser the attrition
36.Attrition is high in less 4 years with current manager

Model Inference
SVM has best accuracy in predicting both classes of Attrition followed by decision tree without
hyperparameter tuning
Decision tree with hyperparameter tuning,Random forest and KNN has good accuracy in finding the
attrition as no but low recall and accuracy for attrition:Yes
which can be inferred through confusion matrix and area under curve for these algorithms

INX Future Employee Performance Project
No ratings yet
INX Future Employee Performance Project
62 pages
Employee - Attrition - Rate - Jupyter Notebook
No ratings yet
Employee - Attrition - Rate - Jupyter Notebook
62 pages
Ads Exam 21c3
No ratings yet
Ads Exam 21c3
22 pages
Ola Case Study
No ratings yet
Ola Case Study
51 pages
Data Preprocessing
No ratings yet
Data Preprocessing
27 pages
Random Forest Classifier
No ratings yet
Random Forest Classifier
18 pages
Samsung DA
No ratings yet
Samsung DA
56 pages
Final Capstone Project Report
100% (1)
Final Capstone Project Report
35 pages
HR Comma Sep - CSV
No ratings yet
HR Comma Sep - CSV
255 pages
ML Projects
No ratings yet
ML Projects
22 pages
Employee Turnover
No ratings yet
Employee Turnover
20 pages
Coding
No ratings yet
Coding
9 pages
Group 3
No ratings yet
Group 3
15 pages
CH 05
No ratings yet
CH 05
27 pages
3 - Analysis of Default - Ipynb - Colab
No ratings yet
3 - Analysis of Default - Ipynb - Colab
16 pages
WA - FN UseC - HR Employee Attrition
No ratings yet
WA - FN UseC - HR Employee Attrition
68 pages
188 Code Tugas 1
No ratings yet
188 Code Tugas 1
18 pages
From Agriscience To Agribusiness Theories Policies and Practices in Technology Transfer and Commercialization 1st Edition Nicholas Kalaitzandonakes Download
No ratings yet
From Agriscience To Agribusiness Theories Policies and Practices in Technology Transfer and Commercialization 1st Edition Nicholas Kalaitzandonakes Download
41 pages
Assignment3: 1) Identify Percentage of Missing Values in Each Column and Display The Same
No ratings yet
Assignment3: 1) Identify Percentage of Missing Values in Each Column and Display The Same
30 pages
CHAPTER 6 Economics
No ratings yet
CHAPTER 6 Economics
26 pages
HR Analytic Using Logistic Regression
No ratings yet
HR Analytic Using Logistic Regression
12 pages
HR Data Analysis Using Python
No ratings yet
HR Data Analysis Using Python
20 pages
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
No ratings yet
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
12 pages
Output
No ratings yet
Output
5 pages
Employee Turnover Analytics
No ratings yet
Employee Turnover Analytics
32 pages
Sanjar Xolmirzayev - SQL Practice Worksheet (Employee Database)
No ratings yet
Sanjar Xolmirzayev - SQL Practice Worksheet (Employee Database)
9 pages
DACLUSTER
No ratings yet
DACLUSTER
9 pages
Data Cleaning
No ratings yet
Data Cleaning
1 page
Copy of Final Project
No ratings yet
Copy of Final Project
16 pages
Samana Tatheer-Assign 7-20U00323.Ipynb - Colaboratory
No ratings yet
Samana Tatheer-Assign 7-20U00323.Ipynb - Colaboratory
9 pages
Interview Preparation
No ratings yet
Interview Preparation
8 pages
Identifying Business Opportunities
100% (1)
Identifying Business Opportunities
15 pages
Otptdbm
No ratings yet
Otptdbm
4 pages
Queries
No ratings yet
Queries
3 pages
Question Calc
No ratings yet
Question Calc
98 pages
HRDataset v14.Csv
No ratings yet
HRDataset v14.Csv
22 pages
Assignment On Classification Tree Model Development: Submitted By-Gaurav Khokhani
No ratings yet
Assignment On Classification Tree Model Development: Submitted By-Gaurav Khokhani
19 pages
Transposed File
No ratings yet
Transposed File
1 page
Satya772244@gmail Compdf
No ratings yet
Satya772244@gmail Compdf
7 pages
Aahar Exibition 2023
100% (1)
Aahar Exibition 2023
8 pages
Pandas Revision1
No ratings yet
Pandas Revision1
2 pages
AI Assignment 6 - Employee Performance Analysis - Jupyter Notebook
No ratings yet
AI Assignment 6 - Employee Performance Analysis - Jupyter Notebook
9 pages
Assignment Ds Midterm
No ratings yet
Assignment Ds Midterm
2 pages
Salary Prediction
No ratings yet
Salary Prediction
28 pages
Employee
No ratings yet
Employee
1 page
15 - 11 - 24 - SVM - Jupyter Notebook
No ratings yet
15 - 11 - 24 - SVM - Jupyter Notebook
5 pages
ML Project 2
No ratings yet
ML Project 2
19 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Decision - Tree-Random - Forest - Jupyter Notebook
No ratings yet
Decision - Tree-Random - Forest - Jupyter Notebook
12 pages
R Working Manuals Students
No ratings yet
R Working Manuals Students
11 pages
Report
No ratings yet
Report
15 pages
Approved By:: Calibration Procedure For Torque Wrenches
No ratings yet
Approved By:: Calibration Procedure For Torque Wrenches
7 pages
Human Resources
No ratings yet
Human Resources
26 pages
DBMS Lab Assignment 6
No ratings yet
DBMS Lab Assignment 6
4 pages
Data Dictionary
No ratings yet
Data Dictionary
3 pages
07 HR
No ratings yet
07 HR
15 pages
Student Notebook HR Analysis
No ratings yet
Student Notebook HR Analysis
11 pages
Formulate Hypothesis
No ratings yet
Formulate Hypothesis
7 pages
Data Analysis Using Python
No ratings yet
Data Analysis Using Python
12 pages
106hh43866 1591989288809CM Top 18 Golden Niches
100% (1)
106hh43866 1591989288809CM Top 18 Golden Niches
19 pages
Chapter 2 Cognitive Ambidexterity2
No ratings yet
Chapter 2 Cognitive Ambidexterity2
15 pages
Industry Assignment 1 - EmployeeAnalyis
No ratings yet
Industry Assignment 1 - EmployeeAnalyis
4 pages
Employee Data Dictionary
No ratings yet
Employee Data Dictionary
3 pages
Road Marking Service Agreement
No ratings yet
Road Marking Service Agreement
2 pages
Data Wrangling Report
No ratings yet
Data Wrangling Report
3 pages
gc4401 - Brand Story
No ratings yet
gc4401 - Brand Story
8 pages
Surf Vs Airel
100% (7)
Surf Vs Airel
13 pages
Cisco VCS Authenticating Devices Deployment Guide X7-2
No ratings yet
Cisco VCS Authenticating Devices Deployment Guide X7-2
50 pages
Revenue Cycle
No ratings yet
Revenue Cycle
13 pages
BUS558 2009A Ch1-4
No ratings yet
BUS558 2009A Ch1-4
15 pages
Enhancement of Material Master Data Mm01
No ratings yet
Enhancement of Material Master Data Mm01
12 pages
Technological Environment
No ratings yet
Technological Environment
32 pages
Oroojlooyjadid Et Al 2021 A Deep Q Network For The Beer Game Deep Reinforcement Learning For Inventory Optimization
No ratings yet
Oroojlooyjadid Et Al 2021 A Deep Q Network For The Beer Game Deep Reinforcement Learning For Inventory Optimization
21 pages
09 Chamber Reduced
No ratings yet
09 Chamber Reduced
72 pages
How Are They Similar, How Are They Different?: Business Process Automation vs. Robotic Process Automation
No ratings yet
How Are They Similar, How Are They Different?: Business Process Automation vs. Robotic Process Automation
6 pages
Malawi Customs Agent Broker Authorization Form
No ratings yet
Malawi Customs Agent Broker Authorization Form
2 pages
How Big Is China's Real Estate Bubble and Why Hasn't It Burst Yet? PDF
No ratings yet
How Big Is China's Real Estate Bubble and Why Hasn't It Burst Yet? PDF
10 pages
HDFC Bank
No ratings yet
HDFC Bank
16 pages
Ebara Italy Price List
No ratings yet
Ebara Italy Price List
26 pages
Script For Presentation
No ratings yet
Script For Presentation
10 pages
ADBM - FA - Questions CRA
No ratings yet
ADBM - FA - Questions CRA
5 pages
How To Validate A Backup: 1. Validating A Logical Export (Taken Using Exp Utility)
No ratings yet
How To Validate A Backup: 1. Validating A Logical Export (Taken Using Exp Utility)
3 pages
CINDYresume
No ratings yet
CINDYresume
2 pages
SEECURE... Invoice
No ratings yet
SEECURE... Invoice
1 page
Ch7 L1 Guided ReadingFILLABLE
No ratings yet
Ch7 L1 Guided ReadingFILLABLE
3 pages
Wonder Book Corp. vs. Philippine Bank of Communications (676 SCRA 489 (2012) )
No ratings yet
Wonder Book Corp. vs. Philippine Bank of Communications (676 SCRA 489 (2012) )
3 pages
Brilliant Working the 3D Way: Continuous improvement for working teams, processes and culture
From Everand
Brilliant Working the 3D Way: Continuous improvement for working teams, processes and culture
James Lascelles
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
The Smart Math Tricks Secrets to Solving Math Fast and Easy
From Everand
The Smart Math Tricks Secrets to Solving Math Fast and Easy
Leonardo Cruz
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.