0% found this document useful (0 votes)
5 views66 pages

Capstone Project - Employee Attrition Rate

The document outlines the analysis of an HR Employee Attrition dataset using Python libraries such as pandas, numpy, and seaborn. It includes data reading, exploration, and visualization, revealing insights about employee demographics, attrition rates, and correlations among various features. The dataset consists of 2940 entries and 35 columns, with a focus on attributes like age, attrition status, and business travel frequency.

Uploaded by

shree0504sha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views66 pages

Capstone Project - Employee Attrition Rate

The document outlines the analysis of an HR Employee Attrition dataset using Python libraries such as pandas, numpy, and seaborn. It includes data reading, exploration, and visualization, revealing insights about employee demographics, attrition rates, and correlations among various features. The dataset consists of 2940 entries and 35 columns, with a focus on attributes like age, attrition status, and business travel frequency.

Uploaded by

shree0504sha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

In [1]: import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix,roc_curve,auc,accuracy_score,classification_
from sklearn import tree
from six import StringIO
import warnings
warnings.filterwarnings('ignore')

Reading Data from the file


In [2]: df=pd.read_csv('HR_Employee_Attrition_Data.csv',na_values='NA')
df.head()

Out[2]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Emplo

0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences

Research &
1 49 No Travel_Frequently 279 8 1 Life Sciences
Development

Research &
2 37 Yes Travel_Rarely 1373 2 2 Other
Development

Research &
3 33 No Travel_Frequently 1392 3 4 Life Sciences
Development

Research &
4 27 No Travel_Rarely 591 2 1 Medical
Development

5 rows × 35 columns

In [3]: df.shape
(2940, 35)
Out[3]:

In [4]: df.size

102900
Out[4]:

In [5]: df.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',


Out[5]:
'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
'YearsWithCurrManager'],
dtype='object')
In [6]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2940 entries, 0 to 2939
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 2940 non-null int64
1 Attrition 2940 non-null object
2 BusinessTravel 2940 non-null object
3 DailyRate 2940 non-null int64
4 Department 2940 non-null object
5 DistanceFromHome 2940 non-null int64
6 Education 2940 non-null int64
7 EducationField 2940 non-null object
8 EmployeeCount 2940 non-null int64
9 EmployeeNumber 2940 non-null int64
10 EnvironmentSatisfaction 2940 non-null int64
11 Gender 2940 non-null object
12 HourlyRate 2940 non-null int64
13 JobInvolvement 2940 non-null int64
14 JobLevel 2940 non-null int64
15 JobRole 2940 non-null object
16 JobSatisfaction 2940 non-null int64
17 MaritalStatus 2940 non-null object
18 MonthlyIncome 2940 non-null int64
19 MonthlyRate 2940 non-null int64
20 NumCompaniesWorked 2940 non-null int64
21 Over18 2940 non-null object
22 OverTime 2940 non-null object
23 PercentSalaryHike 2940 non-null int64
24 PerformanceRating 2940 non-null int64
25 RelationshipSatisfaction 2940 non-null int64
26 StandardHours 2940 non-null int64
27 StockOptionLevel 2940 non-null int64
28 TotalWorkingYears 2940 non-null int64
29 TrainingTimesLastYear 2940 non-null int64
30 WorkLifeBalance 2940 non-null int64
31 YearsAtCompany 2940 non-null int64
32 YearsInCurrentRole 2940 non-null int64
33 YearsSinceLastPromotion 2940 non-null int64
34 YearsWithCurrManager 2940 non-null int64
dtypes: int64(26), object(9)
memory usage: 804.0+ KB

In [7]: df.describe()

Out[7]: Age DailyRate DistanceFromHome Education EmployeeCount EmployeeNumber Environmen

count 2940.000000 2940.000000 2940.000000 2940.000000 2940.0 2940.000000

mean 36.923810 802.485714 9.192517 2.912925 1.0 1470.500000

std 9.133819 403.440447 8.105485 1.023991 0.0 848.849221

min 18.000000 102.000000 1.000000 1.000000 1.0 1.000000

25% 30.000000 465.000000 2.000000 2.000000 1.0 735.750000

50% 36.000000 802.000000 7.000000 3.000000 1.0 1470.500000

75% 43.000000 1157.000000 14.000000 4.000000 1.0 2205.250000

max 60.000000 1499.000000 29.000000 5.000000 1.0 2940.000000

8 rows × 26 columns
In [8]: df['MonthlyIncome'].describe()

count 2940.000000
Out[8]:
mean 6502.931293
std 4707.155770
min 1009.000000
25% 2911.000000
50% 4919.000000
75% 8380.000000
max 19999.000000
Name: MonthlyIncome, dtype: float64

In [9]: #Checking for null values


df.isna().sum()

Age 0
Out[9]:
Attrition 0
BusinessTravel 0
DailyRate 0
Department 0
DistanceFromHome 0
Education 0
EducationField 0
EmployeeCount 0
EmployeeNumber 0
EnvironmentSatisfaction 0
Gender 0
HourlyRate 0
JobInvolvement 0
JobLevel 0
JobRole 0
JobSatisfaction 0
MaritalStatus 0
MonthlyIncome 0
MonthlyRate 0
NumCompaniesWorked 0
Over18 0
OverTime 0
PercentSalaryHike 0
PerformanceRating 0
RelationshipSatisfaction 0
StandardHours 0
StockOptionLevel 0
TotalWorkingYears 0
TrainingTimesLastYear 0
WorkLifeBalance 0
YearsAtCompany 0
YearsInCurrentRole 0
YearsSinceLastPromotion 0
YearsWithCurrManager 0
dtype: int64

In [10]: #Checking for duplicate rows


df.duplicated().sum()

0
Out[10]:

EDA
In [11]: # sns.pairplot(df)

In [12]: df.corr()
Out[12]: Age DailyRate DistanceFromHome Education EmployeeCount EmployeeNumber

Age 1.000000 0.010661 -0.001686 0.208034 NaN -0.005175

DailyRate 0.010661 1.000000 -0.004985 -0.016806 NaN -0.025742

DistanceFromHome -0.001686 -0.004985 1.000000 0.021042 NaN 0.016464

Education 0.208034 -0.016806 0.021042 1.000000 NaN 0.020950

EmployeeCount NaN NaN NaN NaN NaN NaN

EmployeeNumber -0.005175 -0.025742 0.016464 0.020950 NaN 1.000000

EnvironmentSatisfaction 0.010146 0.018355 -0.016075 -0.027128 NaN 0.008712

HourlyRate 0.024287 0.023381 0.031131 0.016775 NaN 0.017377

JobInvolvement 0.029820 0.046135 0.008783 0.042438 NaN -0.003552

JobLevel 0.509604 0.002966 0.005303 0.101589 NaN -0.009020

JobSatisfaction -0.004892 0.030571 -0.003669 -0.011296 NaN -0.022970

MonthlyIncome 0.497855 0.007707 -0.017014 0.094961 NaN -0.007188

MonthlyRate 0.028051 -0.032182 0.027473 -0.026084 NaN 0.006177

NumCompaniesWorked 0.299635 0.038153 -0.029251 0.126317 NaN -0.000345

PercentSalaryHike 0.003634 0.022704 0.040235 -0.011111 NaN -0.006685

PerformanceRating 0.001904 0.000473 0.027110 -0.024539 NaN -0.010338

RelationshipSatisfaction 0.053535 0.007846 0.006557 -0.009118 NaN -0.034827

StandardHours NaN NaN NaN NaN NaN NaN

StockOptionLevel 0.037510 0.042143 0.044872 0.018422 NaN 0.031226

TotalWorkingYears 0.680381 0.014515 0.004628 0.148280 NaN -0.007047

TrainingTimesLastYear -0.019621 0.002453 -0.036942 -0.025100 NaN 0.011953

WorkLifeBalance -0.021490 -0.037848 -0.026556 0.009819 NaN 0.005370

YearsAtCompany 0.311309 -0.034055 0.009508 0.069114 NaN -0.005779

YearsInCurrentRole 0.212901 0.009932 0.018845 0.060236 NaN -0.004427

YearsSinceLastPromotion 0.216513 -0.033229 0.010029 0.054254 NaN -0.004575

YearsWithCurrManager 0.202089 -0.026363 0.014406 0.069065 NaN -0.004716

26 rows × 26 columns

In [13]: #unique count of age in dataset


age_info=df['Age'].value_counts()
age_dict=dict(age_info)
import collections as clt
ordered_dict=clt.OrderedDict(sorted(age_dict.items()))
print(ordered_dict)

#the format printed below is (age,count of employees)

OrderedDict([(18, 16), (19, 18), (20, 22), (21, 26), (22, 32), (23, 28), (24, 52), (25,
52), (26, 78), (27, 96), (28, 96), (29, 136), (30, 120), (31, 138), (32, 122), (33, 11
6), (34, 154), (35, 156), (36, 138), (37, 100), (38, 116), (39, 84), (40, 114), (41, 8
0), (42, 92), (43, 64), (44, 66), (45, 82), (46, 66), (47, 48), (48, 38), (49, 48), (50,
60), (51, 38), (52, 36), (53, 38), (54, 36), (55, 44), (56, 28), (57, 8), (58, 28), (59,
20), (60, 10)])

In [14]: plt.figure(figsize=(14,7))
sns.distplot(df['Age'])

<AxesSubplot:xlabel='Age', ylabel='Density'>
Out[14]:

In [15]: #Countplot of employees in dataset


plt.figure(figsize=(20,10))
plt.xlabel('Age',fontsize=15)
plt.ylabel('Count',fontsize=15)
plt.title('Age vs Count of employees',fontsize=18,fontweight='bold')
sns.countplot(data=df,x='Age')

<AxesSubplot:title={'center':'Age vs Count of employees'}, xlabel='Age', ylabel='count'>


Out[15]:

We have more employees in the dataset with age between 29 to 40


In [16]: #Countplot of employees with attrition
plt.figure(figsize=(20,10))
plt.xlabel('Age',fontsize=15)
plt.ylabel('Count',fontsize=15)
plt.title('Age vs Count of employees with attrition',fontsize=18,fontweight='bold')
sns.countplot(data=df,x='Age',hue='Attrition')

<AxesSubplot:title={'center':'Age vs Count of employees with attrition'}, xlabel='Age',


Out[16]:
ylabel='count'>

Employees of range of age 18 to 35 have high attrition rate while age of 54,57,59,60 have no attrition

In [17]: #Unique values of Business Travel


df['BusinessTravel'].unique()

array(['Travel_Rarely', 'Travel_Frequently', 'Non-Travel'], dtype=object)


Out[17]:

In [18]: business_series=df['BusinessTravel'].value_counts()
business_series

Travel_Rarely 2086
Out[18]:
Travel_Frequently 554
Non-Travel 300
Name: BusinessTravel, dtype: int64

In [19]: #Pie chart representation of business travel


px.pie(business_series,names=business_series.index,values=business_series.values,hole=0.

Percentage distribution of business travel

Travel_Rarely
Travel_Frequently
Non-Travel
18.8%
10.2%

71%

In [20]: df_business=df[['Age','BusinessTravel']].value_counts().reset_index().rename(columns={0:
df_business[:10]

Out[20]: Age BusinessTravel Count

105 18 Travel_Frequently 4

82 18 Non-Travel 8

92 18 Travel_Rarely 4

64 19 Travel_Rarely 14

115 19 Travel_Frequently 2

113 19 Non-Travel 2

52 20 Travel_Rarely 18

95 20 Travel_Frequently 4

99 21 Travel_Frequently 4

46 21 Travel_Rarely 20

In [21]: plt.figure(figsize=(16,8))
plt.xlabel('Age',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Age vs Count of employees with Business Travel',fontsize=14,fontweight='bold'
sns.lineplot(data=df_business,x='Age',y='Count',hue='BusinessTravel')

<AxesSubplot:title={'center':'Age vs Count of employees with Business Travel'}, xlabel


Out[21]:
='Age', ylabel='Count'>
In [22]: plt.figure(figsize=(10,6))
plt.xlabel('Business Travel',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Business Travel vs Count of employees with attrition',fontsize=14,fontweight=
g=sns.countplot(data=df,x='BusinessTravel',hue='Attrition')

Employees who travel rarely seems to have high attrition as count of employees are also high where
employees who dont travel have less attrition rate

In [23]: print(max(df['DailyRate']))

1499

In [24]: plt.figure(figsize=(12,6))
sns.boxplot(df['DailyRate'])
<AxesSubplot:xlabel='DailyRate'>
Out[24]:

In [25]: #Average DailyRate of employees with respect to age


plt.figure(figsize=(16,9))
avg_dailyrate=df.groupby(['Age'])['DailyRate'].mean().reset_index()
sns.lineplot(data=avg_dailyrate,x='Age',y='DailyRate')

<AxesSubplot:xlabel='Age', ylabel='DailyRate'>
Out[25]:

In [26]: plt.figure(figsize=(16,9))
plt.xlabel('Age',fontsize=14)
plt.ylabel('DailyRate',fontsize=14)
plt.title('Age vs DailyRate with attrition',fontsize=14,fontweight='bold')
sns.lineplot(data=df,x='Age',y='DailyRate',hue='Attrition',palette='rocket',ci=None)#Usi

<AxesSubplot:title={'center':'Age vs DailyRate with attrition'}, xlabel='Age', ylabel='D


Out[26]:
ailyRate'>
The above lineplot shows even with high as 1200 wage are leaving the organisation whereas employees
with range wage of 400 to 900 are staying in the organisation.

In [27]: #Department employees count


dept=df['Department'].value_counts()
px.pie(names=dept.index,values=dept.values,hole=0.5,title='Department distribution')

Department distribution

Research & Development


Sales
Human Resources

30.3%

9% 65.4%
4.2
Research & Development has the highest number of employees followed by Sales

In [28]: #Attrition rate depending upon department


plt.figure(figsize=(16,9))
plt.xlabel('Department',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Department vs Count of Employees with attrition',fontsize=14,fontweight='bold
sns.countplot(x='Department',data=df,hue='Attrition',palette='Set2')

<AxesSubplot:title={'center':'Department vs Count of Employees with attrition'}, xlabel


Out[28]:
='Department', ylabel='count'>

Human Resources department has the high attrition rate considering its count.
High attrition rate in sales and moderate attrition in R&D

In [29]: # Average DailyRate with respect to Department


df.groupby('Department')['DailyRate'].mean()

Department
Out[29]:
Human Resources 751.539683
Research & Development 806.851197
Sales 800.275785
Name: DailyRate, dtype: float64

In [30]: #Unique values of distance in km


df['DistanceFromHome'].unique()

array([ 1, 8, 2, 3, 24, 23, 27, 16, 15, 26, 19, 21, 5, 11, 9, 7, 6,
Out[30]:
10, 4, 25, 12, 18, 29, 22, 14, 20, 28, 17, 13], dtype=int64)

In [31]: plt.figure(figsize=(16,9))
plt.xlabel('Distance From Home',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Distance vs Count of Employees with attrition',fontsize=14,fontweight='bold')
sns.countplot(x='DistanceFromHome',data=df,hue='Attrition',palette='rainbow')

<AxesSubplot:title={'center':'Distance vs Count of Employees with attrition'}, xlabel='D


Out[31]:
istanceFromHome', ylabel='count'>
It might seem that 1 or 2 km have high attrition but considering the count of employees in 1 or 2 km it is
somewhat fine.
Distance of 24 km seem to high attrition rate considering the count of employees in that range.
Also the most of employees are from distance 1 to 10 km

In [32]: #Education unique values


df['Education'].unique()

array([2, 1, 4, 3, 5], dtype=int64)


Out[32]:

In [33]: #Countplot of employees with respect to education


plt.figure(figsize=(16,9))
plt.xlabel('Education',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Education vs Count of Employees with attrition',fontsize=14,fontweight='bold'
sns.countplot(x='Education',data=df,hue='Attrition')

<AxesSubplot:title={'center':'Education vs Count of Employees with attrition'}, xlabel


Out[33]:
='Education', ylabel='count'>
Education rating of 5 has less employees as well as lowest attrition rate as they are highly educated.
Rating of 2,3,4 have above 20 percent attrition considering its count of employees

In [34]: #Education Field analysis


edu=df.groupby('EducationField')['EducationField'].count()
px.pie(names=edu.index,values=edu.values,title='Education Distribution',hole=0.5,color_d

Education Distribution

Life Sciences
Medical
Marketing
Technical Degree
Other
31.6%
Human Resources

41.2%

10.8%

8.98%
5.58%
1.84%
Life Sciences field has the largest number of employees followed medical ...Human resources has the least
count

In [35]: plt.figure(figsize=(16,9))
plt.xlabel('Education Field',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Education Field vs Count of Employees with attrition',fontsize=14,fontweight=
sns.countplot(y='EducationField',data=df,hue='Attrition',palette='rocket')

<AxesSubplot:title={'center':'Education Field vs Count of Employees with attrition'}, xl


Out[35]:
abel='count', ylabel='EducationField'>

Though Human Resources have least count,attrition seems to be more than 40 percent in HR.
Other field has less attrition.
Technical Degree,Marketing have high attrition.
Considering medical and life sciences count attrition rate is somewhat moderate

In [36]: plt.figure(figsize=(16,9))
plt.xlabel('Education Field',fontsize=14)
plt.ylabel('DailyRate',fontsize=14)
plt.title('Education Field vs Count of Employees with attrition',fontsize=14,fontweight=
sns.lineplot(x='EducationField',y='DailyRate',data=df,ci=None)

<AxesSubplot:title={'center':'Education Field vs Count of Employees with attrition'}, xl


Out[36]:
abel='Education Field', ylabel='DailyRate'>
From above line plot technical degree on average gets paid well

In [37]: # Education Field in respect to working in Department


#Research and Development Department
randd_df=df[df['Department']=='Sales']
randd_df[:3]

Out[37]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Employ

0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences

18 53 No Travel_Rarely 1219 Sales 2 4 Life Sciences

21 36 Yes Travel_Rarely 1218 Sales 9 4 Life Sciences

3 rows × 35 columns

In [38]: plt.figure(figsize=(16,9))
plt.xlabel('Education Field',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Education Field with respect to Department',fontsize=14,fontweight='bold')
# plt.legend(loc='upper right')
sns.countplot(x='EducationField',data=df,hue='Department',palette='rocket')

<AxesSubplot:title={'center':'Education Field with respect to Department'}, xlabel='Educ


Out[38]:
ationField', ylabel='count'>
In [39]: #Unique values of environment satisfication
df['EnvironmentSatisfaction'].unique()

array([2, 3, 4, 1], dtype=int64)


Out[39]:

In [40]: #1 being the lowest rating whereas 4 being the highest


#Rating based on department
envsatdep=df.groupby(['Department','EnvironmentSatisfaction'])['EmployeeNumber'].count()
envsatdep

Out[40]: Department EnvironmentSatisfaction EmployeeNumber

0 Human Resources 1 22

1 Human Resources 2 24

2 Human Resources 3 52

3 Human Resources 4 28

4 Research & Development 1 374

5 Research & Development 2 354

6 Research & Development 3 584

7 Research & Development 4 610

8 Sales 1 172

9 Sales 2 196

10 Sales 3 270

11 Sales 4 254

In [41]: # df['EnvironmentSatisfaction'].astype(object)

In [42]: plt.figure(figsize=(16,9))
plt.xlabel('Education Satisfication',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Environment Satisfication with attrition',fontsize=14,fontweight='bold')
# plt.legend(loc='upper right')
sns.countplot(x='EnvironmentSatisfaction',data=df,hue='Attrition',palette='rocket')

<AxesSubplot:title={'center':'Environment Satisfication with attrition'}, xlabel='Enviro


Out[42]:
nmentSatisfaction', ylabel='count'>

Almost 27.2% of employees in rating 1 are resigning.


Almost 16.66% of employees in rating 2 are resigning.
11.3% of employees in rating 3 and rating 4 are resigning # Lowest Attrition rate

In [43]: # Gender representation with respect to department


plt.figure(figsize=(18,14),dpi=1600)
gender_df=df.groupby(['Gender','Department'])['EmployeeNumber'].count().reset_index().re
male_df=gender_df[gender_df['Gender']=='Male']
female_df=gender_df[gender_df['Gender']=='Female']
ax1 = plt.subplot2grid((2,2),(0,0))
lm=plt.pie('Count',labels='Department',autopct='%1.2f%%',data=male_df)
plt.title('Male Department Distribution',fontsize=12,fontweight='bold')
# px.pie(male_df,names='Department',values='Count',hole=0.5,title='Department Male repre
ax1 = plt.subplot2grid((2,2),(0,1))
# px.pie(female_df,names='Department',values='Count',hole=0.5,title='Department Female r
fm=plt.pie('Count',labels='Department',autopct='%1.2f%%',data=female_df,colors=['violet'
plt.title('Female Department Distribution',fontsize=12,fontweight='bold')

Text(0.5, 1.0, 'Female Department Distribution')


Out[43]:
In [44]: #Attrition with respect to Gender
plt.figure(figsize=(14,6))
plt.xlabel('Gender',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Gender with respect to Attrition',fontsize=14,fontweight='bold')
sns.countplot(x='Gender',data=df,hue='Attrition',palette='magma')

<AxesSubplot:title={'center':'Gender with respect to Attrition'}, xlabel='Gender', ylabe


Out[44]:
l='count'>

Male attrition seems to higher than female though count of respective genders must also be taken into
account

In [45]: #Average Hourly Rate of employees with respect to age


plt.figure(figsize=(16,9))
avg_dailyrate=df.groupby(['Age'])['HourlyRate'].mean().reset_index()
sns.lineplot(data=avg_dailyrate,x='Age',y='HourlyRate')

<AxesSubplot:xlabel='Age', ylabel='HourlyRate'>
Out[45]:
On Average at age 57 or 58 , the hourly rate pay is good which is inferred from above line plot

In [46]: plt.figure(figsize=(16,9))
plt.xlabel('Age',fontsize=14)
plt.ylabel('HourlyRate',fontsize=14)
plt.title('Age vs HourlyRate with attrition',fontsize=14,fontweight='bold')
sns.lineplot(data=df,x='Age',y='HourlyRate',hue='Attrition',palette='rocket',ci=None)#Us

<AxesSubplot:title={'center':'Age vs HourlyRate with attrition'}, xlabel='Age', ylabel


Out[46]:
='HourlyRate'>

Even though hourlyrate is above 90 for age of 42 or 47 people tend to resign .


Whereas hourly rate of 50 or below 50 for age in range of 42 to 52 resigning seems to be justified.
The hourly rate range where employees tend less to attrition is 60 to 90

In [47]: #JobInvolvement analysis


df['JobInvolvement'].unique()

array([3, 2, 4, 1], dtype=int64)


Out[47]:

In [48]: #JobInvolvement with respect to Gender


plt.figure(figsize=(16,8))
plt.xlabel('Job Involvement',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Gender with respect to JobInvolvement',fontsize=14,fontweight='bold')
sns.countplot(x='JobInvolvement',data=df,hue='Gender',palette='cubehelix')

<AxesSubplot:title={'center':'Gender with respect to JobInvolvement'}, xlabel='JobInvolv


Out[48]:
ement', ylabel='count'>

Rating 3 seems to be high for both male and female

In [49]: #JobInvolvement with respect to Department


plt.figure(figsize=(16,8))
plt.xlabel('Job Involvement',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Department with respect to JobInvolvement',fontsize=14,fontweight='bold')
sns.countplot(x='JobInvolvement',data=df,hue='Department',palette='Set2')

<AxesSubplot:title={'center':'Department with respect to JobInvolvement'}, xlabel='JobIn


Out[49]:
volvement', ylabel='count'>
In [50]: #JobInvolvement with respect to Attrition
plt.figure(figsize=(16,8))
plt.xlabel('Job Involvement',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Attrition with respect to JobInvolvement',fontsize=14,fontweight='bold')
sns.countplot(x='JobInvolvement',data=df,hue='Attrition',palette='husl')

<AxesSubplot:title={'center':'Attrition with respect to JobInvolvement'}, xlabel='JobInv


Out[50]:
olvement', ylabel='count'>

rating 4 has the least attrition since employees are involved with job seriously.
rating 3 attrition seems to be of moderate level.
rating 1 has almost 50 percent attrition rate.

In [51]: job_df=df.groupby(['JobLevel','JobRole']).size().reset_index()
job_df.drop(0,axis=1,inplace=True)
job_df.set_index('JobLevel')
Out[51]: JobRole

JobLevel

1 Human Resources

1 Laboratory Technician

1 Research Scientist

1 Sales Representative

2 Healthcare Representative

2 Human Resources

2 Laboratory Technician

2 Manufacturing Director

2 Research Scientist

2 Sales Executive

2 Sales Representative

3 Healthcare Representative

3 Human Resources

3 Laboratory Technician

3 Manager

3 Manufacturing Director

3 Research Director

3 Research Scientist

3 Sales Executive

4 Healthcare Representative

4 Manager

4 Manufacturing Director

4 Research Director

4 Sales Executive

5 Manager

5 Research Director

from above dataframe seems a single job role has multiple job level
Job level of rating 5 has 2 roles:Manager and Research Director

In [52]: #JobRole with respect to Attrition


plt.figure(figsize=(18,9))
plt.xlabel('JobRole',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.xticks(rotation=90)
plt.title('Attrition with respect to JobRole',fontsize=14,fontweight='bold')
sns.countplot(x='JobRole',data=df,hue='Attrition')

<AxesSubplot:title={'center':'Attrition with respect to JobRole'}, xlabel='JobRole', yla


Out[52]:
bel='count'>
Research director,manager and manufacturing director has the least attrition.
Sales Representative has highest attrition considering its count of employees followed by human resources
and laboratory technician

In [53]: #Job satisfication considering total employees


df.groupby('JobSatisfaction').size()

JobSatisfaction
Out[53]:
1 578
2 560
3 884
4 918
dtype: int64

In [54]: plt.figure(figsize=(18,9))
plt.xlabel('JobSatisfaction',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Job satisfication with respect to department',fontsize=14,fontweight='bold')
sns.countplot(x='Department',data=df,hue='JobSatisfaction',palette='rocket')

<AxesSubplot:title={'center':'Job satisfication with respect to department'}, xlabel='De


Out[54]:
partment', ylabel='count'>
Sales department has high count of 4 rating
Research and development has high count in 3 rating
Human resources has high count in rating 2

In [55]: #Job satisfication with respect to attrition


plt.figure(figsize=(18,9))
plt.xlabel('JobSatisfaction',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Job satisfication with respect to attrition',fontsize=14,fontweight='bold')
sns.countplot(x='JobSatisfaction',data=df,hue='Attrition',palette='Set2')

<AxesSubplot:title={'center':'Job satisfication with respect to attrition'}, xlabel='Job


Out[55]:
Satisfaction', ylabel='count'>

In [56]: marital_df=df.groupby('MaritalStatus').size().reset_index().rename(columns={0:'No of emp


marital_df

Out[56]: MaritalStatus No of employees


0 Divorced 654

1 Married 1346

2 Single 940

In [57]: plt.figure(figsize=(16,8))
plt.xlabel('MaritalStatus',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('MaritalStatus with respect to attrition',fontsize=14,fontweight='bold')
sns.countplot(x='MaritalStatus',data=df,hue='Attrition')

<AxesSubplot:title={'center':'MaritalStatus with respect to attrition'}, xlabel='Marital


Out[57]:
Status', ylabel='count'>

Single employees tend to resign more


Married employees as well as divorced employees are stable in respect to attrition

In [58]: #Maximum monthly income in the dataset


df['MonthlyIncome'].max()

19999
Out[58]:

In [59]: plt.figure(figsize=(14,7))
sns.distplot(df['MonthlyIncome'])

<AxesSubplot:xlabel='MonthlyIncome', ylabel='Density'>
Out[59]:
In [60]: #Monthly income with respect to Department and attrition
plt.figure(figsize=(14,7))
plt.xlabel('Department',fontsize=14)
plt.ylabel('Monthly Income',fontsize=14)
plt.title('Monthly income with respect to Department and attrition',fontsize=14,fontweig
sns.swarmplot(x='Department',y='MonthlyIncome',data=df,hue='Attrition')

<AxesSubplot:title={'center':'Monthly income with respect to Department and attrition'},


Out[60]:
xlabel='Department', ylabel='MonthlyIncome'>

In Sales Department,attrition is high in range of 1000 to 3500 income. Stable income range of 15000 to
18500 has no attrition
In Research and development attrition is moderate in range of 1000 to 12500 income .Stable range is 14000
to 18000 where there is no attrition
In Human Resources pay gap is huge considering the plot . High attrition in range of 1000 to 2500/3000 .
No attrition in high paying income for HR department

In [61]: #Average monthly income of employees with respect to Department


plt.figure(figsize=(14,7))
plt.xlabel('Department',fontsize=14)
plt.ylabel('Monthly Income',fontsize=14)
plt.title('Monthly income with respect to Department and attrition',fontsize=14,fontweig
sns.lineplot(x='Department',y='MonthlyIncome',data=df,hue='Attrition',ci=None)

<AxesSubplot:title={'center':'Monthly income with respect to Department and attrition'},


Out[61]:
xlabel='Department', ylabel='Monthly Income'>

Above plot gives a clear perspective.Average monthly income of below 6000 seems to have high attrition
among employees
Average income above 6500 has very less attrition

In [62]: #Unique values of NumCompaniesWorked


df['NumCompaniesWorked'].unique()

array([8, 1, 6, 9, 0, 4, 5, 2, 7, 3], dtype=int64)


Out[62]:

In [63]: num_com=df.groupby('NumCompaniesWorked').size().reset_index().rename(columns={0:'Count'}
num_com

Out[63]: NumCompaniesWorked Count

0 0 394

1 1 1042

2 2 292

3 3 318

4 4 278

5 5 126

6 6 140

7 7 148

8 8 98

9 9 104
In [64]: # pie chart representation of number of companies worked
px.pie(num_com,names='NumCompaniesWorked',values='Count',title='Number of companies work

Number of companies worked representation

1
0
13.4% 3
2
4

35.4% 7
10.8% 6
5
9
8
9.93%

3.
33
9.46% %

3.
4.2

54
%
5.03% 4.76% 9%

35.4% of employees had worked atleast in 1 company before


3.54% of employees had worked in nearly 9 companies

In [65]: plt.figure(figsize=(16,8))
plt.xlabel('Num of Companies Worked',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('NumCompaniesWorked with respect to attrition',fontsize=14,fontweight='bold')
sns.countplot(x='NumCompaniesWorked',data=df,hue='Attrition')

<AxesSubplot:title={'center':'NumCompaniesWorked with respect to attrition'}, xlabel='Nu


Out[65]:
mCompaniesWorked', ylabel='count'>
Attrition high for employees who worked in less than 1 company
Low Attrition for employees who worked in 2 or more than 2 companies

In [66]: dep_comp=df.groupby(['NumCompaniesWorked','Department'])['EmployeeNumber'].count().reset_
dep_comp[:5]

Out[66]: NumCompaniesWorked Department Count

0 0 Human Resources 24

1 0 Research & Development 240

2 0 Sales 130

3 1 Human Resources 38

4 1 Research & Development 680

In [67]: for i in dep_comp['Department'].unique():


labels_subcat=dep_comp[dep_comp['Department']==i].NumCompaniesWorked.tolist()
values_subcat=dep_comp[dep_comp['Department']==i].Count.tolist()
plt.figure(figsize=(16,8))
hui=px.pie(names=labels_subcat,values=values_subcat,hole=0.4,title=i)
hui.show()

Human Resources

1
0
4
19% 3
9
30.2%
6
2
8
5
15.9% 7

1.59%
1.59%
4.76%
7.94%
4.76%
7.94% 6.35%

Research & Development

1
0
12.5% 3
2
4
6
11.2% 35.4%
7
5
9
8
10.2%

3.
02
9.37% %
3.
95
4.3

%
% 7

5.1% 4.89%

Sales

1
0
3
14.6%
2
4
7
36.3%
10.3% 5
6
8
9

10.1%

2.
02
8.74%

3.
%

81
3.8
4.48%

%
1
5.83%

%
<Figure size 1152x576 with 0 Axes>
<Figure size 1152x576 with 0 Axes>
<Figure size 1152x576 with 0 Axes>

In [68]: #Job satisfication with respect to number of companies worked


job_comp=df.groupby(['NumCompaniesWorked','JobSatisfaction'])['EmployeeNumber'].count().
job_comp[:5]

Out[68]: NumCompaniesWorked JobSatisfaction Count

0 0 1 76

1 0 2 70

2 0 3 118

3 0 4 130

4 1 1 182

In [69]: for i in job_comp['JobSatisfaction'].unique():


labels_subcat=job_comp[job_comp['JobSatisfaction']==i].NumCompaniesWorked.tolist()
values_subcat=job_comp[job_comp['JobSatisfaction']==i].Count.tolist()
plt.figure(figsize=(16,8))
hui=px.pie(names=labels_subcat,values=values_subcat,hole=0.4,title=f'rating {i} job
hui.show()

rating 1 job satisfaction level

1
0
2
13.1%
4
3
31.5% 7
12.5% 5
6
8
9

9.69%
3.4
6%
%
3.
46
%
8.65%
5.19%
6.92% 5.54%

rating 2 job satisfaction level

1
0
12.5% 4
2
3
9.64% 5
36.1%
6
7
8
8.57% 9

8.57%
3.
57
%

6.07% 4.64%
5.71% 4.64%

rating 3 job satisfaction level

1
0
3
13.3%
2
4
6
36%
11.8% 7
9
5
8

9.95%
2.
94
8.82% %

2.
3.8

94
%
5%
5.43% 4.98%

rating 4 job satisfaction level

1
0
3
14.2%
4
2
7
37% 5
12.6%
6
9
8

9.8%
2.
83

9.15%
3.
4.14%

%
3.2

27
3.7%

%
7%

<Figure size 1152x576 with 0 Axes>


<Figure size 1152x576 with 0 Axes>
<Figure size 1152x576 with 0 Axes>
<Figure size 1152x576 with 0 Axes>

In [70]: #Over 18 unique values


df['Over18'].value_counts()

Y 2940
Out[70]:
Name: Over18, dtype: int64

All employees are above 18 years old

In [71]: #Overtime unique values


df['OverTime'].value_counts()

No 2108
Out[71]:
Yes 832
Name: OverTime, dtype: int64

#Swarm plot analysis of Overtime


In [72]: plt.figure(figsize=(14,7))
plt.xlabel('OverTime',fontsize=14)
plt.ylabel('Monthly Income',fontsize=14)
plt.title('Monthly income with respect to OverTime and attrition',fontsize=14,fontweight=
sns.swarmplot(x='OverTime',y='MonthlyIncome',data=df,hue='Attrition')

<AxesSubplot:title={'center':'Monthly income with respect to OverTime and attrition'}, x


Out[72]:
label='OverTime', ylabel='MonthlyIncome'>

People working in overtime with less than 10000 monthly income tend to resign more though overtime
people working are less
Attrition among non overtime working people is less

In [73]: #Swarm plot analysis of Overtime with Age


# plt.figure(figsize=(14,7))
# plt.xlabel('OverTime',fontsize=14)
# plt.ylabel('Age',fontsize=14)
# plt.title('Age with respect to OverTime and attrition',fontsize=14,fontweight='bold')
# sns.swarmplot(x='OverTime',y='Age',data=df,hue='Attrition')

In [74]: # PercentSalaryHike unique values


print(df['PercentSalaryHike'].nunique())
print(df.PercentSalaryHike.value_counts())

15
11 420
13 418
14 402
12 396
15 202
18 178
17 164
16 156
19 152
22 112
20 110
21 96
23 56
24 42
25 36
Name: PercentSalaryHike, dtype: int64
In [75]: plt.figure(figsize=(16,8))
plt.xlabel('Num of Companies Worked',fontsize=14)
plt.ylabel('Salary hike',fontsize=14)
plt.title('Number of companies worked with respect to percent hike and attrition',fontsi
sns.barplot(x='NumCompaniesWorked',y='PercentSalaryHike',data=df,hue='Attrition')

<AxesSubplot:title={'center':'Number of companies worked with respect to percent hike an


Out[75]:
d attrition'}, xlabel='NumCompaniesWorked', ylabel='PercentSalaryHike'>

Even though employees who worked in 0,4,7,9 companies have an average of more than 15 percent
hike,they tend to resign more

In [76]: #Unique values of performance rating


df['PerformanceRating'].unique()

array([3, 4], dtype=int64)


Out[76]:

In [77]: per_rating=df.groupby('PerformanceRating').size().reset_index().rename(columns={0:'Count
per_rating

Out[77]: PerformanceRating Count

0 3 2488

1 4 452

In [78]: px.pie(per_rating,names='PerformanceRating',values='Count',title='Percentage distributio

Percentage distribution of Performance Rating

3
4

15.4%
84.6%

84.6% are average performers in the organisation


15.4% are top performers

In [79]: plt.figure(figsize=(16,8))
plt.xlabel('Job Involvement',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Job Involvement worked with respect to performance rating',fontsize=14,fontwe
sns.countplot(x='JobInvolvement',hue='PerformanceRating',data=df,palette='Set2')

<AxesSubplot:title={'center':'Job Involvement worked with respect to performance ratin


Out[79]:
g'}, xlabel='JobInvolvement', ylabel='count'>

Job involvement with 3 rating has the highestemployees as well as highest 3 performance rating
Employees who have very less job involvement like 1 and 2 are also awarded highest rating[4]

In [80]: plt.figure(figsize=(16,8))
plt.xlabel('Job Involvement',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Plot with respect to performance rating and attrition',fontsize=14,fontweight=
sns.countplot(x='PerformanceRating',hue='Attrition',data=df,palette='rocket')
Out[80]: <AxesSubplot:title={'center':'Plot with respect to performance rating and attrition'}, x
label='PerformanceRating', ylabel='count'>

In [81]: per_atrr=df.groupby(['Attrition','PerformanceRating'])['EmployeeNumber'].size().reset_in
per_atrr

Out[81]: Attrition PerformanceRating Count

0 No 3 2088

1 No 4 378

2 Yes 3 400

3 Yes 4 74

In [82]: # def fnct(df):


# d1={}
# pr=[3,4]
# for j in pr:
# d1[j]=df[df['Attrition']=='Yes' & df['PerformanceRating']==j].Count/(df[df['At
# return d1

# fnct(per_atrr)

In [83]: #unique RelationshipSatisfaction


df['RelationshipSatisfaction'].unique()

array([1, 4, 2, 3], dtype=int64)


Out[83]:

In [84]: plt.figure(figsize=(16,8))
plt.xlabel('Relationship Satisfaction',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Relationship Satisfaction with respect to attrition',fontsize=14,fontweight='b
sns.countplot(x='RelationshipSatisfaction',data=df,hue='Attrition')

<AxesSubplot:title={'center':'Relationship Satisfaction with respect to attrition'}, xla


Out[84]:
bel='RelationshipSatisfaction', ylabel='count'>
In [85]: #Since the chart is not giving a clear view of the percentage we can create a function f
def fnct_relation(dataframe):
dict1={}
j=[1,2,3,4]
for i in j:
cal=((df[(df['RelationshipSatisfaction']==i) & (df['Attrition']=='Yes')]['Employ
cal=round(cal,2)
dict1[i]=cal
cal=0
return dict1

fnct_relation(df)

{1: 20.65, 2: 14.85, 3: 15.47, 4: 14.81}


Out[85]:

From above dictionary values we can infer following information


RelationshipSatisfaction of 1 rating has the highest attrition of nearly 20 percent
interestingly rating 3 has more attrition than rating 2
rating 4 has the low attrition among other ratings

In [86]: #unique values of StandardHours


df['StandardHours'].unique()

array([80], dtype=int64)
Out[86]:

In [87]: #Unique values of StockOptionLevel


df['StockOptionLevel'].unique()

array([0, 1, 3, 2], dtype=int64)


Out[87]:

In [88]: plt.figure(figsize=(16,8))
plt.xlabel('Stock level option',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Stock level option with respect to attrition',fontsize=14,fontweight='bold')
sns.countplot(x='StockOptionLevel',data=df,hue='Attrition',palette='rainbow')

<AxesSubplot:title={'center':'Stock level option with respect to attrition'}, xlabel='St


Out[88]:
ockOptionLevel', ylabel='count'>
In [89]: #Since the chart is not giving a clear view of the percentage we can create a function f
def fnct_relation1(dataframe):
dict1={}
j=[0,1,2,3]
for i in j:
cal=((df[(df['StockOptionLevel']==i) & (df['Attrition']=='Yes')]['EmployeeNumber
cal=round(cal,2)
dict1[i]=cal
return dict1

fnct_relation1(df)

{0: 24.41, 1: 9.4, 2: 7.59, 3: 17.65}


Out[89]:

Employees with no stock option tend to resign more


From the data given , employees with more stock option [3] also has 2nd high attrition rate

In [90]: #Swarm plot analysis of StockOptionLevel with Monthly income


plt.figure(figsize=(14,7))
plt.xlabel('OverTime',fontsize=14)
plt.ylabel('Monthly Income',fontsize=14)
plt.title('Monthly income with respect to StockOptionLevel and attrition',fontsize=14,fo
sns.swarmplot(x='StockOptionLevel',y='MonthlyIncome',data=df,hue='Attrition')

<AxesSubplot:title={'center':'Monthly income with respect to StockOptionLevel and attrit


Out[90]:
ion'}, xlabel='StockOptionLevel', ylabel='MonthlyIncome'>
In [91]: #Total working years of employees
plt.figure(figsize=(16,8))
plt.xlabel('TotalWorkingYears',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Total working years of employees with attrition',fontsize=14,fontweight='bold
sns.countplot(x='TotalWorkingYears',hue='Attrition',data=df,palette='rocket')

<AxesSubplot:title={'center':'Total working years of employees with attrition'}, xlabel


Out[91]:
='TotalWorkingYears', ylabel='count'>

In [92]: #Since the chart is not giving a clear view of the percentage we can create a function f
def fnct_relation1(dataframe):
dict1={}

for i in range(0,41):
cal=((df[(df['TotalWorkingYears']==i) & (df['Attrition']=='Yes')]['EmployeeNumbe
cal=round(cal,2)
dict1[i]=cal
del dict1[39] #Since we dont have 39 years experienced employees
return dict1

a=fnct_relation1(df)

In [93]: work_df=pd.DataFrame(data=a.items(),columns=['No_of_years','Attrition_percent_among_yes_
work_df

Out[93]: No_of_years Attrition_percent_among_yes_and_no

0 0 45.45

1 1 49.38

2 2 29.03

3 3 21.43

4 4 19.05

5 5 18.18

6 6 17.60

7 7 22.22

8 8 15.53

9 9 10.42

10 10 12.38

11 11 19.44

12 12 10.42

13 13 8.33

14 14 12.90

15 15 12.50

16 16 8.11

17 17 9.09

18 18 14.81

19 19 13.64

20 20 6.67

21 21 2.94

22 22 9.52

23 23 9.09

24 24 16.67

25 25 7.14

26 26 7.14

27 27 0.00

28 28 7.14

29 29 0.00

30 30 0.00

31 31 11.11
32 32 0.00

33 33 14.29

34 34 20.00

35 35 0.00

36 36 0.00

37 37 0.00

38 38 0.00

39 40 100.00

Employees with 40 years of experience have 100 percent attrition rate


Employees with 0 and 1 year experience also have attrition rate from 45 to 50 percent
Employees with range from 35 to 38 have no attrition

In [94]: #TrainingTimesLastYear unique values


df['TrainingTimesLastYear'].unique()

array([0, 3, 2, 5, 1, 4, 6], dtype=int64)


Out[94]:

In [95]: #Training times last year of employees


plt.figure(figsize=(16,8))
plt.xlabel('TrainingTimesLastYear',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Training Times Last Year of employees',fontsize=14,fontweight='bold')
sns.countplot(x='TrainingTimesLastYear',hue='Attrition',data=df,palette='rocket')

<AxesSubplot:title={'center':'Training Times Last Year of employees'}, xlabel='TrainingT


Out[95]:
imesLastYear', ylabel='count'>

In [96]: #Since the chart is not giving a clear view of the percentage we can create a function f
def fnct_relation2(dataframe):
dict2={}

for i in range(0,7):
cal=((df[(df['TrainingTimesLastYear']==i) & (df['Attrition']=='Yes')]['EmployeeN
cal=round(cal,2)
dict2[i]=cal
return dict2

b=fnct_relation2(df)

In [97]: b

{0: 27.78, 1: 12.68, 2: 17.92, 3: 14.05, 4: 21.14, 5: 11.76, 6: 9.23}


Out[97]:

No training and traing rating 4 employees have high attrition


Highest rated training[6] has least attrition

In [98]: # WorkLifeBalance unique values


df['WorkLifeBalance'].unique()

array([1, 3, 2, 4], dtype=int64)


Out[98]:

In [99]: #Work Life Balance of employees with attrition


plt.figure(figsize=(16,8))
plt.xlabel('Attrition',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Work Life Balance of employees with attrition',fontsize=14,fontweight='bold')
sns.countplot(x='Attrition',hue='WorkLifeBalance',data=df,palette='rocket')

<AxesSubplot:title={'center':'Work Life Balance of employees with attrition'}, xlabel='A


Out[99]:
ttrition', ylabel='count'>

work life balance with rating 3 has the highest as well as least attrition considering the count of employees
1 and 4 rating are almost same attrition

In [100… #unique values of YearsAtCompany


df['YearsAtCompany'].unique()

array([ 6, 10, 0, 8, 2, 7, 1, 9, 5, 4, 25, 3, 12, 14, 22, 15, 27,


Out[100]:
21, 17, 11, 13, 37, 16, 20, 40, 24, 33, 19, 36, 18, 29, 31, 32, 34,
26, 30, 23], dtype=int64)

In [101… #Years at company of employees with attrition


plt.figure(figsize=(16,8))
plt.xlabel('YearsAtCompany',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Years at company of employees with attrition',fontsize=14,fontweight='bold')
sns.countplot(x='YearsAtCompany',hue='Attrition',data=df,palette='rocket')

<AxesSubplot:title={'center':'Years at company of employees with attrition'}, xlabel='Ye


Out[101]:
arsAtCompany', ylabel='count'>

The most reliable employees for the organisation are from 11 to 30,34,36,37 years experience at the
company

In [102… #"Unique values of YearsInCurrentRole"


df['YearsInCurrentRole'].unique()

array([ 4, 7, 0, 2, 5, 9, 8, 3, 6, 13, 1, 15, 14, 16, 11, 10, 12,


Out[102]:
18, 17], dtype=int64)

In [103… df['JobRole'].unique()

array(['Sales Executive', 'Research Scientist', 'Laboratory Technician',


Out[103]:
'Manufacturing Director', 'Healthcare Representative', 'Manager',
'Sales Representative', 'Research Director', 'Human Resources'],
dtype=object)

In [104… res_df=df[df['Attrition']=='Yes']
res_df.shape

(474, 35)
Out[104]:

In [105… plt.figure(figsize=(16,8))
plt.xlabel('YearsAtCompany',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Years in current role with attrition',fontsize=14,fontweight='bold')
sns.countplot(x='YearsInCurrentRole',hue='JobRole',data=res_df)
plt.legend(loc='center right')

<matplotlib.legend.Legend at 0x2499e92cfd0>
Out[105]:
Laboratory technician at 0 and 1 years role have high attrition
Research director and Healthcare representative has least attrition
Sales Executive has attrition almost across all years of current job role
Sales Representative has attrition of less than 7 years in current role
Research Scientist has peak attrition in 2 years at current role

In [106… #YearsSinceLastPromotion unique values


df['YearsSinceLastPromotion'].unique()

array([ 0, 1, 3, 2, 7, 4, 8, 6, 5, 15, 9, 13, 12, 10, 11, 14],


Out[106]:
dtype=int64)

In [107… #years since last promotion countplot


plt.figure(figsize=(16,8))
plt.xlabel('YearsSinceLastPromotion',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Years Since Last Promotion with attrition',fontsize=14,fontweight='bold')
sns.countplot(x='YearsSinceLastPromotion',data=res_df)

<AxesSubplot:title={'center':'Years Since Last Promotion with attrition'}, xlabel='Years


Out[107]:
SinceLastPromotion', ylabel='count'>
Employees with no promotion have high attrition
higher the years lesser the attrition

In [108… # YearsWithCurrManager unique values


df['YearsWithCurrManager'].unique()

array([ 5, 7, 0, 2, 6, 8, 3, 11, 17, 1, 4, 12, 9, 10, 15, 13, 16,


Out[108]:
14], dtype=int64)

In [109… plt.figure(figsize=(16,8))
plt.xlabel('YearsSinceLastPromotion',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Years With Current Manager with attrition',fontsize=14,fontweight='bold')
sns.countplot(x='YearsWithCurrManager',data=res_df)

<AxesSubplot:title={'center':'Years With Current Manager with attrition'}, xlabel='Years


Out[109]:
WithCurrManager', ylabel='count'>
Attrition is uneven in years with current manager feature

Data Preprocessing [ Feature Scaling and


Encoding]
In [110… #Droping the unnecessary columns
df.drop(['EmployeeCount','EmployeeNumber','Over18','StandardHours'],axis=1,inplace=True)

In [111… df.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',


Out[111]:
'DistanceFromHome', 'Education', 'EducationField',
'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement',
'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus',
'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'OverTime',
'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction',
'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
'YearsSinceLastPromotion', 'YearsWithCurrManager'],
dtype='object')

In [112… #Age boxplot


sns.boxplot(data=df['Age'])

<AxesSubplot:>
Out[112]:

In [113… df['BusinessTravel'].unique()
array(['Travel_Rarely', 'Travel_Frequently', 'Non-Travel'], dtype=object)
Out[113]:

In [114… #BusinessTravel Encoding by function


df['BusinessTravel']=df['BusinessTravel'].map({'Non-Travel':0,'Travel_Rarely':1,'Travel_

In [115… df[:3]

Out[115]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environ

0 41 Yes 1 1102 Sales 1 2 Life Sciences

Research &
1 49 No 2 279 8 1 Life Sciences
Development

Research &
2 37 Yes 1 1373 2 2 Other
Development
3 rows × 31 columns

In [116… #Daily rate outlier detection


q1=np.quantile(df['DailyRate'],0.25)
q2=np.quantile(df['DailyRate'],0.50)
q3=np.quantile(df['DailyRate'],0.75)
iqr=q3-q1
sd=df[(df['DailyRate']>(q1-(1.5*iqr))) & (df['DailyRate']>(q3+(1.5*iqr)))]
sd.head()

Out[116]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environm

0 rows × 31 columns

In [117… #Encoding of Department


labenc=LabelEncoder()
df['Department']=labenc.fit_transform(df['Department'])
df['Department'].unique()

array([2, 1, 0])
Out[117]:

In [118… df[:5]#0-HR, 1-Research Department , 2-Sales

Out[118]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environ

0 41 Yes 1 1102 2 1 2 Life Sciences

1 49 No 2 279 1 8 1 Life Sciences

2 37 Yes 1 1373 1 2 2 Other

3 33 No 2 1392 1 3 4 Life Sciences

4 27 No 1 591 1 2 1 Medical

5 rows × 31 columns

In [119… #DistanceFromHome outlier detection


q4=np.quantile(df['DistanceFromHome'],0.25)
q5=np.quantile(df['DistanceFromHome'],0.50)
q6=np.quantile(df['DistanceFromHome'],0.75)
iqr=q6-q4
sd_1=df[(df['DistanceFromHome']<(q4-(1.5*iqr))) & (df['DistanceFromHome']>(q6+(1.5*iqr))
sd_1.head()

Out[119]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environm

0 rows × 31 columns

In [120… sns.boxplot(df['DistanceFromHome'])

<AxesSubplot:xlabel='DistanceFromHome'>
Out[120]:
In [121… #Label encoder of EducationField
df['EducationField']=labenc.fit_transform(df['EducationField'])
df[:3]

Out[121]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environ

0 41 Yes 1 1102 2 1 2 1

1 49 No 2 279 1 8 1 1

2 37 Yes 1 1373 1 2 2 4

3 rows × 31 columns

In [122… print(labenc.classes_)

['Human Resources' 'Life Sciences' 'Marketing' 'Medical' 'Other'


'Technical Degree']

In [123… #Gender Label encoder


df['Gender']=labenc.fit_transform(df['Gender'])
df[:3]

Out[123]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environ

0 41 Yes 1 1102 2 1 2 1

1 49 No 2 279 1 8 1 1

2 37 Yes 1 1373 1 2 2 4

3 rows × 31 columns

In [124… df['Gender'].unique()
array([0, 1])
Out[124]:

In [125… #Outlier detection using box plot for Hourly Rate


sns.boxplot(df['HourlyRate'])

<AxesSubplot:xlabel='HourlyRate'>
Out[125]:
In [126… #JobRole label encoder
df['JobRole']=labenc.fit_transform(df['JobRole'])
df[:5]

Out[126]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environ

0 41 Yes 1 1102 2 1 2 1

1 49 No 2 279 1 8 1 1

2 37 Yes 1 1373 1 2 2 4

3 33 No 2 1392 1 3 4 1

4 27 No 1 591 1 2 1 3

5 rows × 31 columns

In [127… print(labenc.classes_)

['Healthcare Representative' 'Human Resources' 'Laboratory Technician'


'Manager' 'Manufacturing Director' 'Research Director'
'Research Scientist' 'Sales Executive' 'Sales Representative']

In [128… #Since marital status doesnot hold that much importance we can remove them from dataset
df.drop('MaritalStatus',axis=1,inplace=True)

In [129… #Checking the number of columns


count=0
for i in df.columns:
count=count+1
count

30
Out[129]:

In [130… #Box plot representation of monthly income


plt.figure(figsize=(14,7))
sns.boxplot(df['MonthlyIncome'])

<AxesSubplot:xlabel='MonthlyIncome'>
Out[130]:
From the above plot we can find many outlier values

In [131… #Skewness detection


print('skewness value of monthly income: ',df['MonthlyIncome'].skew())

skewness value of monthly income: 1.3691171405078755

In [132… med=np.median(df['MonthlyIncome'])
med

4919.0
Out[132]:

In [133… #Monthly income outlier detection


q7=np.quantile(df['MonthlyIncome'],0.25)
q8=np.quantile(df['MonthlyIncome'],0.50)
q9=np.quantile(df['MonthlyIncome'],0.75)
iqr=q9-q7
sd2=df[(df['MonthlyIncome']<(q7-(1.5*iqr))) | (df['MonthlyIncome']>(q9+(1.5*iqr)))]
sd2

Out[133]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Envi

25 53 No 1 1282 1 5 3 4

29 46 No 1 705 2 2 4 2

45 41 Yes 1 1360 1 12 3 5

62 50 No 1 989 1 7 2 3

105 59 No 0 1420 0 2 4 0

... ... ... ... ... ... ... ... ...

2844 58 No 1 605 2 21 3 1

2847 49 No 2 1064 1 2 1 1

2871 55 No 1 189 0 26 4 0

2907 39 No 0 105 1 9 3 1

2913 42 No 1 300 1 2 3 1
228 rows × 30 columns

In [134… q8

4919.0
Out[134]:

In [135… # #Replacing the outlier values of dataframe with the monthlyincome median value
df.iloc[sd2.index,15]=q8
# df['MonthlyIncome'].fillna(np.median(df['MonthlyIncome']),inplace=True)

In [136… df.iloc[sd2.index,15]
25 4919
Out[136]:
29 4919
45 4919
62 4919
105 4919
...
2844 4919
2847 4919
2871 4919
2907 4919
2913 4919
Name: MonthlyIncome, Length: 228, dtype: int64

In [137… # plt.figure(figsize=(14,7))
# sns.boxplot(df['MonthlyIncome'])

In [138… #Monthly rate box plot


sns.boxplot(df['MonthlyRate'])

<AxesSubplot:xlabel='MonthlyRate'>
Out[138]:

In [139… #OverTime label encoder using function


df['OverTime']=df['OverTime'].map({'Yes':1,'No':0})

In [140… df['OverTime'].unique()

array([1, 0], dtype=int64)


Out[140]:

In [141… #PercentSalaryHike outlier detection


sns.boxplot(df['PercentSalaryHike'])
print(f'skewness is {df["PercentSalaryHike"].skew()}')

skewness is 0.8207086405356568
No outliers from the above data but skewed data

In [142… #Label encoding for Attrition


df['Attrition']=labenc.fit_transform(df['Attrition'])
print(labenc.classes_)

['No' 'Yes']

Model using Decision Tree


In [143… df.head()

Out[143]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environ

0 41 1 1 1102 2 1 2 1

1 49 0 2 279 1 8 1 1

2 37 1 1 1373 1 2 2 4

3 33 0 2 1392 1 3 4 1

4 27 0 1 591 1 2 1 3

5 rows × 30 columns

In [144… #Choosing X and y value


X=df.drop('Attrition',axis=1)
y=df['Attrition']

In [145… X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1

In [146… #Training the model without parameters


dc_model=DecisionTreeClassifier()
dc_model.fit(X_train,y_train)
training_accuracy=dc_model.score(X_train,y_train)
print("Training Accuracy: ",training_accuracy)
testing_accuracy=dc_model.score(X_test,y_test)
print("Testing Accuracy : ",testing_accuracy)

Training Accuracy: 1.0


Testing Accuracy : 0.9625850340136054

In [147… #prediction for decision tree without parameters


pred=dc_model.predict(X_test)
In [148… #Grid Search CV for hyperparameter tuning
param_dist={
'max_depth':[2,3,4,5,6,7,8],
'min_samples_split':[10,20,30,40],
'min_samples_leaf':[1,2,3,4],
'max_features': ['auto', 'sqrt', 'log2', None],
'criterion': ['gini', 'entropy']
}
cv_df=GridSearchCV(dc_model,cv=10,param_grid=param_dist,n_jobs=3)
cv_df.fit(X_train,y_train)
print(cv_df.best_params_)

{'criterion': 'entropy', 'max_depth': 8, 'max_features': None, 'min_samples_leaf': 1, 'm


in_samples_split': 10}

In [149… #Training the model with the best parameters from Grid Search CV
dc_model_1=DecisionTreeClassifier(max_depth=8,min_samples_split=10,min_samples_leaf=1)
dc_model_1.fit(X_train,y_train)
training_accuracy_1=dc_model_1.score(X_train,y_train)
print("Training Accuracy: ",training_accuracy_1)
testing_accuracy_1=dc_model_1.score(X_test,y_test)
print("Testing Accuracy : ",testing_accuracy_1)

Training Accuracy: 0.9345238095238095


Testing Accuracy : 0.9013605442176871

The above accuracy seems to be fine as there is no sign of overfitting or underfitting

In [150… #Tree diagram representation


plt.figure(figsize=(18,9))
tree_1=tree.plot_tree(dc_model_1,filled=True)

Performance Metrics of Decision Tree


In [151… predictions = dc_model_1.predict(X_test)
print (dc_model_1.score(X_test, y_test))

0.9013605442176871

In [152… print(classification_report(y_test,predictions))
precision recall f1-score support

0 0.92 0.97 0.94 507


1 0.71 0.48 0.57 81

accuracy 0.90 588


macro avg 0.82 0.72 0.76 588
weighted avg 0.89 0.90 0.89 588

The above report shows that attrition : no has accuracy prediction compared to attrition : yes
Recall of true positive[predicted yes|Actual yes] is only 0.57 percent

In [153… def create_conf_mat(test_class_set, predictions):


"""Function returns confusion matrix comparing two arrays"""
if (len(test_class_set.shape) != len(predictions.shape) == 1):
return print('Arrays entered are not 1-D.\nPlease enter the correctly sized sets
elif (test_class_set.shape != predictions.shape):
return print('Number of values inside the Arrays are not equal to each other.\nP
else:
# Set Metrics
test_crosstb_comp = pd.crosstab(index = test_class_set,
columns = predictions)
# Changed for Future deprecation of as_matrix
test_crosstb = test_crosstb_comp.values
return test_crosstb

In [154… #Confusion matrix for decision tree without parameters


conf_mat = create_conf_mat(y_test, pred)
sns.heatmap(conf_mat, annot=True, fmt='d', cbar=False)
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Actual vs. Predicted Confusion Matrix')
plt.show()

In [155… #Confusion matrix for decision tree with hyperparameter tuning


conf_mat = create_conf_mat(y_test, predictions)
sns.heatmap(conf_mat, annot=True, fmt='d', cbar=False)
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Actual vs. Predicted Confusion Matrix')
plt.show()
The above heatmap also shows that true negative [attrition : No] has highest number of correct prediction
whereas false positive
number is greater than true positive[attrition : Yes]

Area Under the curve


In [156… #For decision tree without parameter tuning
fpr_dt_0, tpr_dt_0, _ = roc_curve(y_test, pred)
roc_auc_dt_0 = auc(fpr_dt_0, tpr_dt_0)

In [157… plt.figure(1)
lw = 2
plt.plot(fpr_dt_0, tpr_dt_0, color='green',
lw=lw, label='Decision Tree(AUC = %0.2f)' % roc_auc_dt_0)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Area Under Curve')
plt.legend(loc="lower right")
plt.show()

In [158… #For decision tree without hyperparameter tuning


fpr_dt, tpr_dt, _ = roc_curve(y_test, predictions)
roc_auc_dt = auc(fpr_dt, tpr_dt)

In [159… plt.figure(1)
lw = 2
plt.plot(fpr_dt, tpr_dt, color='green',
lw=lw, label='Decision Tree(AUC = %0.2f)' % roc_auc_dt)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Area Under Curve')
plt.legend(loc="lower right")
plt.show()

Without hyperparameter tuning , the area under curve has good probability compared to with
hyperparameter tuning

Model using Random Forest


In [160… rf_model=RandomForestClassifier(random_state=42)

In [161… #Grid search cv for hyperparameter optimization


param_dist={
'max_depth':[2,3,4,5],
'bootstrap':[True,False],
'max_features':['auto', 'sqrt', 'log2', None],
'criterion':['gini','entropy']
}
rf_cv=GridSearchCV(rf_model,cv=10,param_grid=param_dist,n_jobs=3)
rf_cv.fit(X_train,y_train)
print('Best Parameters using grid search: \n', rf_cv.best_params_)

Best Parameters using grid search:


{'bootstrap': True, 'criterion': 'gini', 'max_depth': 5, 'max_features': None}

In [162… #Using the best parameters for the model


rf_model.set_params(criterion='gini',max_depth= 5,bootstrap= True,max_features=None)

RandomForestClassifier(max_depth=5, max_features=None, random_state=42)


Out[162]:
OOB Rate
In [163… rf_model.set_params(warm_start=True,
oob_score=True)

min_estimators = 15
max_estimators = 1000

error_rate = {}

for i in range(min_estimators, max_estimators + 1):


rf_model.set_params(n_estimators=i)
rf_model.fit(X_train,y_train)

oob_error = 1 - rf_model.oob_score_
error_rate[i] = oob_error

In [164… # Convert dictionary to a pandas series for easy plotting


oob_series = pd.Series(error_rate)

In [165… fig, ax = plt.subplots(figsize=(10, 10))

ax.set_facecolor('#fafafa')

oob_series.plot(kind='line',color = 'red')
# plt.axhline(0.055, color='#875FDB',linestyle='--')
# plt.axhline(0.05, color='#875FDB',linestyle='--')
plt.xlabel('n_estimators')
plt.ylabel('OOB Error Rate')
plt.title('OOB Error Rate Across various Forest sizes \n(From 15 to 1000 trees)')

Text(0.5, 1.0, 'OOB Error Rate Across various Forest sizes \n(From 15 to 1000 trees)')
Out[165]:
In [166… print('OOB Error rate for 500 trees is: {0:.5f}'.format(oob_series[500]))

OOB Error rate for 500 trees is: 0.12287

In [167… rf_model.set_params(n_estimators=500,bootstrap=True,warm_start=False,oob_score=False)

RandomForestClassifier(max_depth=5, max_features=None, n_estimators=500,


Out[167]:
random_state=42)

In [168… #Training the model


rf_model.fit(X_train,y_train)

RandomForestClassifier(max_depth=5, max_features=None, n_estimators=500,


Out[168]:
random_state=42)

In [169… #Predicting for test data


predictions1=rf_model.predict(X_test)

Performance Metrics for Random Forest


In [170… print (rf_model.score(X_train, y_train))
print (rf_model.score(X_test, y_test))

0.9017857142857143
0.9098639455782312

In [171… #Classification report


print(classification_report(y_test,predictions1))

precision recall f1-score support

0 0.91 0.99 0.95 507


1 0.89 0.40 0.55 81

accuracy 0.91 588


macro avg 0.90 0.69 0.75 588
weighted avg 0.91 0.91 0.89 588

In [172… #Confusion Matrix


conf_mat = create_conf_mat(y_test, predictions1)
sns.heatmap(conf_mat, annot=True, fmt='d', cbar=False)
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Actual vs. Predicted Confusion Matrix')
plt.show()

Area under the curve


In [173… fpr_dt1, tpr_dt1, _ = roc_curve(y_test, predictions1)
roc_auc_dt1 = auc(fpr_dt1, tpr_dt1)

In [174… plt.figure(1)
lw = 2
plt.plot(fpr_dt1, tpr_dt1, color='green',
lw=lw, label='Random Forest(AUC = %0.2f)' % roc_auc_dt1)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Area Under Curve')
plt.legend(loc="lower right")
plt.show()
From the above curve , random forest is not that much good for predicting yes attrition compared to
decision tree

KNN Model
In [175… #Since model will be using eucledian distance need to standardize the data
#Standardization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [176… #Training the model


from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)

KNeighborsClassifier()
Out[176]:

In [177… #predicting the model


pred_knn=classifier.predict(X_test)

In [178… print(classifier.score(X_train,y_train))
print(classifier.score(X_test,y_test))

0.8924319727891157
0.8860544217687075

In [179… #Optimal value for n_neighbours


error = []

# Calculating error for K values between 1 and 40


for i in range(1, 40):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train, y_train)
pred_i = knn.predict(X_test)
error.append(np.mean(pred_i != y_test))

In [180… plt.figure(figsize=(12, 6))


plt.plot(range(1, 40), error, color='red', linestyle='dashed', marker='o',
markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')
plt.show()

n-neighbours = 5 seems to be optimal value as accuracy seems to be fine without any overfitting

KNN performance metrics


In [181… #Confusion Matrix for KNN[5 neighbours]
conf_mat = create_conf_mat(y_test, pred_knn)
sns.heatmap(conf_mat, annot=True, fmt='d', cbar=False)
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Actual vs. Predicted Confusion Matrix')
plt.show()

In [182… #Classification report


print(classification_report(y_test,pred_knn))

precision recall f1-score support

0 0.89 0.98 0.94 507


1 0.73 0.27 0.40 81

accuracy 0.89 588


macro avg 0.81 0.63 0.67 588
weighted avg 0.87 0.89 0.86 588

From above metrics it is clear attrition: no has higher accuracy prediction.


Attrition : yes true positive rate is very less compared to decision model prediction

Model using SVM


In [183… #SVM using default parameters
from sklearn.svm import SVC
from sklearn import metrics
svc=SVC() #Default hyperparameters
svc.fit(X_train,y_train)
y_pred_svm=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred_svm))

Accuracy Score:
0.9217687074829932

In [184… #Linear KErnel


svc=SVC(kernel='linear')
svc.fit(X_train,y_train)
y_pred_lin=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred_lin))

Accuracy Score:
0.8928571428571429

In [185… #Polynomial Kernel


svc=SVC(kernel='poly')
svc.fit(X_train,y_train)
y_pred_poly=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred_poly))

Accuracy Score:
0.9319727891156463

In [186… #Optimisation using GridSearchCV


tuned_parameters = {
'C': (np.arange(0.1,1,0.1)) , 'kernel': ['linear','rbf','poly'],
'degree': [2,3,4] ,'gamma':[0.01,0.02,0.03,0.04,0.05]
}
model_svm_1 = GridSearchCV(svc, tuned_parameters,cv=10,scoring='accuracy')

In [187… model_svm_1.fit(X_train, y_train)


print(model_svm_1.best_score_)
# pred_svm1=model_svm_1.predict(X_test)

0.9409015506671474

In [188… model_svm_1.best_params_
Out[188]: {'C': 0.9, 'degree': 4, 'gamma': 0.05, 'kernel': 'poly'}

In [189… #Using the best hyperparamters for the SVM model


svm_final_model=SVC(C= 0.9, degree= 4, gamma= 0.05, kernel= 'poly')
svm_final_model.fit(X_train,y_train)
y_pred_svm=svm_final_model.predict(X_test)

Performance Metrics for SVM


In [190… print(metrics.accuracy_score(y_test,y_pred_svm))

0.9693877551020408

In [191… #Confusion Matrix for SVM


conf_mat = create_conf_mat(y_test, y_pred_svm)
sns.heatmap(conf_mat, annot=True, fmt='d', cbar=False)
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Actual vs. Predicted Confusion Matrix')
plt.show()

In [192… print(classification_report(y_test,y_pred_svm))

precision recall f1-score support

0 0.97 0.99 0.98 507


1 0.94 0.83 0.88 81

accuracy 0.97 588


macro avg 0.96 0.91 0.93 588
weighted avg 0.97 0.97 0.97 588

In [193… #Area Under the curve


fpr_dt2, tpr_dt2, _ = roc_curve(y_test, y_pred_svm)
roc_auc_dt2 = auc(fpr_dt2, tpr_dt2)

In [194… plt.figure(1)
lw = 2
plt.plot(fpr_dt2, tpr_dt2, color='green',
lw=lw, label='SVM(AUC = %0.2f)' % roc_auc_dt2)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Area Under Curve')
plt.legend(loc="lower right")
plt.show()

SVM model has better accuracy and best true positive and true negative values compared to decision tree
without tuning

Final Inference
The following are the points that are related to high attrition

1. Employees in the range of age 18 to 35 have high attrition rate


2. Employees who dont travel for business has less attrition
3. HR and Sales has somewhat high attrition considering its count whereas moderate attrition in R&D
department
4. Attrition among Employees with distance seems to even among all distance except distance - 24 km
where there is high attrition
5. Education rating 5 has least attrition whereas count of employees is more in rating 3
6. High attrition among employees who have Technical degree,marketing,HR education field considering
the individual field count
7. Employees with environment satisfication rating as 1 and 2 has high attrition
8. Attrition is moderate among both gender,though gender has imbalanced count
9. On average hourly rate in range of 60 to 80 tend to resign
10.Job involvement rating 1 has highest attrition followed by 2 and 3
11.Job Role of Sales Representative has highest attrition considering its count of employees followed by
human resources and laboratory technician
12.Job satisfication with rating 1 and 2 have good amount of attrition
13.Though Marital Status seems to have not mucch impact,Single status employees have high attrition
compared to other status
14.In Sales Department,attrition is high in range of 1000 to 3500 income. Stable income range of 15000
to 18500 has no attrition
15.In Research and development attrition is moderate in range of 1000 to 12500 income .Stable range is
14000 to 18000 where there is no attrition
16.In Human Resources pay gap is huge considering the plot . High attrition in range of 1000 to 2500/3000 .
No attrition in high paying income for HR department
17.Average monthly income of below 6000 seems to have high attrition among employees
18.People working in overtime with less than 10000 monthly income tend to resign more though overtime
people working are less
19.Attrition high for employees who worked in less than 1 company
20.RelationshipSatisfaction of 1 rating has the highest attrition of nearly 20 percent
21.interestingly rating 3 has more attrition than rating 2
22.Performance rating 3 has high attrition
23.Employees with no stock option tend to resign more
24.From the data given , employees with more stock option [3] also has 2nd high attrition rate
25.Employees with 40 years of experience have 100 percent attrition rate
26.Employees with 0 and 1 year experience also have attrition rate from 45 to 50 percent
27.Employees with range from 35 to 38 of years experience have no attrition
28.No training and traing rating 4 employees have high attrition.Highest rated training[6] has least attrition
29.work life balance with rating 3 has the highest as well as least attrition considering the count of
employees and 1 and 4 rating are almost same attrition
30.Laboratory technician at 0 and 1 years role have high attrition
31.Research director and Healthcare representative has least attrition
32.Sales Executive has attrition almost across all years of current job role
33.Sales Representative has attrition of less than 7 years in current role
34.Research Scientist has peak attrition in 2 years at current role
35.Employees with no promotion have high attrition.Higher the years lesser the attrition
36.Attrition is high in less 4 years with current manager

Model Inference
SVM has best accuracy in predicting both classes of Attrition followed by decision tree without
hyperparameter tuning
Decision tree with hyperparameter tuning,Random forest and KNN has good accuracy in finding the
attrition as no but low recall and accuracy for attrition:Yes
which can be inferred through confusion matrix and area under curve for these algorithms

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy