Capstone Project - Employee Attrition Rate
Capstone Project - Employee Attrition Rate
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix,roc_curve,auc,accuracy_score,classification_
from sklearn import tree
from six import StringIO
import warnings
warnings.filterwarnings('ignore')
Out[2]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Emplo
Research &
1 49 No Travel_Frequently 279 8 1 Life Sciences
Development
Research &
2 37 Yes Travel_Rarely 1373 2 2 Other
Development
Research &
3 33 No Travel_Frequently 1392 3 4 Life Sciences
Development
Research &
4 27 No Travel_Rarely 591 2 1 Medical
Development
5 rows × 35 columns
In [3]: df.shape
(2940, 35)
Out[3]:
In [4]: df.size
102900
Out[4]:
In [5]: df.columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2940 entries, 0 to 2939
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 2940 non-null int64
1 Attrition 2940 non-null object
2 BusinessTravel 2940 non-null object
3 DailyRate 2940 non-null int64
4 Department 2940 non-null object
5 DistanceFromHome 2940 non-null int64
6 Education 2940 non-null int64
7 EducationField 2940 non-null object
8 EmployeeCount 2940 non-null int64
9 EmployeeNumber 2940 non-null int64
10 EnvironmentSatisfaction 2940 non-null int64
11 Gender 2940 non-null object
12 HourlyRate 2940 non-null int64
13 JobInvolvement 2940 non-null int64
14 JobLevel 2940 non-null int64
15 JobRole 2940 non-null object
16 JobSatisfaction 2940 non-null int64
17 MaritalStatus 2940 non-null object
18 MonthlyIncome 2940 non-null int64
19 MonthlyRate 2940 non-null int64
20 NumCompaniesWorked 2940 non-null int64
21 Over18 2940 non-null object
22 OverTime 2940 non-null object
23 PercentSalaryHike 2940 non-null int64
24 PerformanceRating 2940 non-null int64
25 RelationshipSatisfaction 2940 non-null int64
26 StandardHours 2940 non-null int64
27 StockOptionLevel 2940 non-null int64
28 TotalWorkingYears 2940 non-null int64
29 TrainingTimesLastYear 2940 non-null int64
30 WorkLifeBalance 2940 non-null int64
31 YearsAtCompany 2940 non-null int64
32 YearsInCurrentRole 2940 non-null int64
33 YearsSinceLastPromotion 2940 non-null int64
34 YearsWithCurrManager 2940 non-null int64
dtypes: int64(26), object(9)
memory usage: 804.0+ KB
In [7]: df.describe()
8 rows × 26 columns
In [8]: df['MonthlyIncome'].describe()
count 2940.000000
Out[8]:
mean 6502.931293
std 4707.155770
min 1009.000000
25% 2911.000000
50% 4919.000000
75% 8380.000000
max 19999.000000
Name: MonthlyIncome, dtype: float64
Age 0
Out[9]:
Attrition 0
BusinessTravel 0
DailyRate 0
Department 0
DistanceFromHome 0
Education 0
EducationField 0
EmployeeCount 0
EmployeeNumber 0
EnvironmentSatisfaction 0
Gender 0
HourlyRate 0
JobInvolvement 0
JobLevel 0
JobRole 0
JobSatisfaction 0
MaritalStatus 0
MonthlyIncome 0
MonthlyRate 0
NumCompaniesWorked 0
Over18 0
OverTime 0
PercentSalaryHike 0
PerformanceRating 0
RelationshipSatisfaction 0
StandardHours 0
StockOptionLevel 0
TotalWorkingYears 0
TrainingTimesLastYear 0
WorkLifeBalance 0
YearsAtCompany 0
YearsInCurrentRole 0
YearsSinceLastPromotion 0
YearsWithCurrManager 0
dtype: int64
0
Out[10]:
EDA
In [11]: # sns.pairplot(df)
In [12]: df.corr()
Out[12]: Age DailyRate DistanceFromHome Education EmployeeCount EmployeeNumber
26 rows × 26 columns
OrderedDict([(18, 16), (19, 18), (20, 22), (21, 26), (22, 32), (23, 28), (24, 52), (25,
52), (26, 78), (27, 96), (28, 96), (29, 136), (30, 120), (31, 138), (32, 122), (33, 11
6), (34, 154), (35, 156), (36, 138), (37, 100), (38, 116), (39, 84), (40, 114), (41, 8
0), (42, 92), (43, 64), (44, 66), (45, 82), (46, 66), (47, 48), (48, 38), (49, 48), (50,
60), (51, 38), (52, 36), (53, 38), (54, 36), (55, 44), (56, 28), (57, 8), (58, 28), (59,
20), (60, 10)])
In [14]: plt.figure(figsize=(14,7))
sns.distplot(df['Age'])
<AxesSubplot:xlabel='Age', ylabel='Density'>
Out[14]:
Employees of range of age 18 to 35 have high attrition rate while age of 54,57,59,60 have no attrition
In [18]: business_series=df['BusinessTravel'].value_counts()
business_series
Travel_Rarely 2086
Out[18]:
Travel_Frequently 554
Non-Travel 300
Name: BusinessTravel, dtype: int64
Travel_Rarely
Travel_Frequently
Non-Travel
18.8%
10.2%
71%
In [20]: df_business=df[['Age','BusinessTravel']].value_counts().reset_index().rename(columns={0:
df_business[:10]
105 18 Travel_Frequently 4
82 18 Non-Travel 8
92 18 Travel_Rarely 4
64 19 Travel_Rarely 14
115 19 Travel_Frequently 2
113 19 Non-Travel 2
52 20 Travel_Rarely 18
95 20 Travel_Frequently 4
99 21 Travel_Frequently 4
46 21 Travel_Rarely 20
In [21]: plt.figure(figsize=(16,8))
plt.xlabel('Age',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Age vs Count of employees with Business Travel',fontsize=14,fontweight='bold'
sns.lineplot(data=df_business,x='Age',y='Count',hue='BusinessTravel')
Employees who travel rarely seems to have high attrition as count of employees are also high where
employees who dont travel have less attrition rate
In [23]: print(max(df['DailyRate']))
1499
In [24]: plt.figure(figsize=(12,6))
sns.boxplot(df['DailyRate'])
<AxesSubplot:xlabel='DailyRate'>
Out[24]:
<AxesSubplot:xlabel='Age', ylabel='DailyRate'>
Out[25]:
In [26]: plt.figure(figsize=(16,9))
plt.xlabel('Age',fontsize=14)
plt.ylabel('DailyRate',fontsize=14)
plt.title('Age vs DailyRate with attrition',fontsize=14,fontweight='bold')
sns.lineplot(data=df,x='Age',y='DailyRate',hue='Attrition',palette='rocket',ci=None)#Usi
Department distribution
30.3%
9% 65.4%
4.2
Research & Development has the highest number of employees followed by Sales
Human Resources department has the high attrition rate considering its count.
High attrition rate in sales and moderate attrition in R&D
Department
Out[29]:
Human Resources 751.539683
Research & Development 806.851197
Sales 800.275785
Name: DailyRate, dtype: float64
array([ 1, 8, 2, 3, 24, 23, 27, 16, 15, 26, 19, 21, 5, 11, 9, 7, 6,
Out[30]:
10, 4, 25, 12, 18, 29, 22, 14, 20, 28, 17, 13], dtype=int64)
In [31]: plt.figure(figsize=(16,9))
plt.xlabel('Distance From Home',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Distance vs Count of Employees with attrition',fontsize=14,fontweight='bold')
sns.countplot(x='DistanceFromHome',data=df,hue='Attrition',palette='rainbow')
Education Distribution
Life Sciences
Medical
Marketing
Technical Degree
Other
31.6%
Human Resources
41.2%
10.8%
8.98%
5.58%
1.84%
Life Sciences field has the largest number of employees followed medical ...Human resources has the least
count
In [35]: plt.figure(figsize=(16,9))
plt.xlabel('Education Field',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Education Field vs Count of Employees with attrition',fontsize=14,fontweight=
sns.countplot(y='EducationField',data=df,hue='Attrition',palette='rocket')
Though Human Resources have least count,attrition seems to be more than 40 percent in HR.
Other field has less attrition.
Technical Degree,Marketing have high attrition.
Considering medical and life sciences count attrition rate is somewhat moderate
In [36]: plt.figure(figsize=(16,9))
plt.xlabel('Education Field',fontsize=14)
plt.ylabel('DailyRate',fontsize=14)
plt.title('Education Field vs Count of Employees with attrition',fontsize=14,fontweight=
sns.lineplot(x='EducationField',y='DailyRate',data=df,ci=None)
Out[37]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Employ
3 rows × 35 columns
In [38]: plt.figure(figsize=(16,9))
plt.xlabel('Education Field',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Education Field with respect to Department',fontsize=14,fontweight='bold')
# plt.legend(loc='upper right')
sns.countplot(x='EducationField',data=df,hue='Department',palette='rocket')
0 Human Resources 1 22
1 Human Resources 2 24
2 Human Resources 3 52
3 Human Resources 4 28
8 Sales 1 172
9 Sales 2 196
10 Sales 3 270
11 Sales 4 254
In [41]: # df['EnvironmentSatisfaction'].astype(object)
In [42]: plt.figure(figsize=(16,9))
plt.xlabel('Education Satisfication',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Environment Satisfication with attrition',fontsize=14,fontweight='bold')
# plt.legend(loc='upper right')
sns.countplot(x='EnvironmentSatisfaction',data=df,hue='Attrition',palette='rocket')
Male attrition seems to higher than female though count of respective genders must also be taken into
account
<AxesSubplot:xlabel='Age', ylabel='HourlyRate'>
Out[45]:
On Average at age 57 or 58 , the hourly rate pay is good which is inferred from above line plot
In [46]: plt.figure(figsize=(16,9))
plt.xlabel('Age',fontsize=14)
plt.ylabel('HourlyRate',fontsize=14)
plt.title('Age vs HourlyRate with attrition',fontsize=14,fontweight='bold')
sns.lineplot(data=df,x='Age',y='HourlyRate',hue='Attrition',palette='rocket',ci=None)#Us
rating 4 has the least attrition since employees are involved with job seriously.
rating 3 attrition seems to be of moderate level.
rating 1 has almost 50 percent attrition rate.
In [51]: job_df=df.groupby(['JobLevel','JobRole']).size().reset_index()
job_df.drop(0,axis=1,inplace=True)
job_df.set_index('JobLevel')
Out[51]: JobRole
JobLevel
1 Human Resources
1 Laboratory Technician
1 Research Scientist
1 Sales Representative
2 Healthcare Representative
2 Human Resources
2 Laboratory Technician
2 Manufacturing Director
2 Research Scientist
2 Sales Executive
2 Sales Representative
3 Healthcare Representative
3 Human Resources
3 Laboratory Technician
3 Manager
3 Manufacturing Director
3 Research Director
3 Research Scientist
3 Sales Executive
4 Healthcare Representative
4 Manager
4 Manufacturing Director
4 Research Director
4 Sales Executive
5 Manager
5 Research Director
from above dataframe seems a single job role has multiple job level
Job level of rating 5 has 2 roles:Manager and Research Director
JobSatisfaction
Out[53]:
1 578
2 560
3 884
4 918
dtype: int64
In [54]: plt.figure(figsize=(18,9))
plt.xlabel('JobSatisfaction',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('Job satisfication with respect to department',fontsize=14,fontweight='bold')
sns.countplot(x='Department',data=df,hue='JobSatisfaction',palette='rocket')
1 Married 1346
2 Single 940
In [57]: plt.figure(figsize=(16,8))
plt.xlabel('MaritalStatus',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('MaritalStatus with respect to attrition',fontsize=14,fontweight='bold')
sns.countplot(x='MaritalStatus',data=df,hue='Attrition')
19999
Out[58]:
In [59]: plt.figure(figsize=(14,7))
sns.distplot(df['MonthlyIncome'])
<AxesSubplot:xlabel='MonthlyIncome', ylabel='Density'>
Out[59]:
In [60]: #Monthly income with respect to Department and attrition
plt.figure(figsize=(14,7))
plt.xlabel('Department',fontsize=14)
plt.ylabel('Monthly Income',fontsize=14)
plt.title('Monthly income with respect to Department and attrition',fontsize=14,fontweig
sns.swarmplot(x='Department',y='MonthlyIncome',data=df,hue='Attrition')
In Sales Department,attrition is high in range of 1000 to 3500 income. Stable income range of 15000 to
18500 has no attrition
In Research and development attrition is moderate in range of 1000 to 12500 income .Stable range is 14000
to 18000 where there is no attrition
In Human Resources pay gap is huge considering the plot . High attrition in range of 1000 to 2500/3000 .
No attrition in high paying income for HR department
Above plot gives a clear perspective.Average monthly income of below 6000 seems to have high attrition
among employees
Average income above 6500 has very less attrition
In [63]: num_com=df.groupby('NumCompaniesWorked').size().reset_index().rename(columns={0:'Count'}
num_com
0 0 394
1 1 1042
2 2 292
3 3 318
4 4 278
5 5 126
6 6 140
7 7 148
8 8 98
9 9 104
In [64]: # pie chart representation of number of companies worked
px.pie(num_com,names='NumCompaniesWorked',values='Count',title='Number of companies work
1
0
13.4% 3
2
4
35.4% 7
10.8% 6
5
9
8
9.93%
3.
33
9.46% %
3.
4.2
54
%
5.03% 4.76% 9%
In [65]: plt.figure(figsize=(16,8))
plt.xlabel('Num of Companies Worked',fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.title('NumCompaniesWorked with respect to attrition',fontsize=14,fontweight='bold')
sns.countplot(x='NumCompaniesWorked',data=df,hue='Attrition')
In [66]: dep_comp=df.groupby(['NumCompaniesWorked','Department'])['EmployeeNumber'].count().reset_
dep_comp[:5]
0 0 Human Resources 24
2 0 Sales 130
3 1 Human Resources 38
Human Resources
1
0
4
19% 3
9
30.2%
6
2
8
5
15.9% 7
1.59%
1.59%
4.76%
7.94%
4.76%
7.94% 6.35%
1
0
12.5% 3
2
4
6
11.2% 35.4%
7
5
9
8
10.2%
3.
02
9.37% %
3.
95
4.3
%
% 7
5.1% 4.89%
Sales
1
0
3
14.6%
2
4
7
36.3%
10.3% 5
6
8
9
10.1%
2.
02
8.74%
3.
%
81
3.8
4.48%
%
1
5.83%
%
<Figure size 1152x576 with 0 Axes>
<Figure size 1152x576 with 0 Axes>
<Figure size 1152x576 with 0 Axes>
0 0 1 76
1 0 2 70
2 0 3 118
3 0 4 130
4 1 1 182
1
0
2
13.1%
4
3
31.5% 7
12.5% 5
6
8
9
9.69%
3.4
6%
%
3.
46
%
8.65%
5.19%
6.92% 5.54%
1
0
12.5% 4
2
3
9.64% 5
36.1%
6
7
8
8.57% 9
8.57%
3.
57
%
6.07% 4.64%
5.71% 4.64%
1
0
3
13.3%
2
4
6
36%
11.8% 7
9
5
8
9.95%
2.
94
8.82% %
2.
3.8
94
%
5%
5.43% 4.98%
1
0
3
14.2%
4
2
7
37% 5
12.6%
6
9
8
9.8%
2.
83
9.15%
3.
4.14%
%
3.2
27
3.7%
%
7%
Y 2940
Out[70]:
Name: Over18, dtype: int64
No 2108
Out[71]:
Yes 832
Name: OverTime, dtype: int64
People working in overtime with less than 10000 monthly income tend to resign more though overtime
people working are less
Attrition among non overtime working people is less
15
11 420
13 418
14 402
12 396
15 202
18 178
17 164
16 156
19 152
22 112
20 110
21 96
23 56
24 42
25 36
Name: PercentSalaryHike, dtype: int64
In [75]: plt.figure(figsize=(16,8))
plt.xlabel('Num of Companies Worked',fontsize=14)
plt.ylabel('Salary hike',fontsize=14)
plt.title('Number of companies worked with respect to percent hike and attrition',fontsi
sns.barplot(x='NumCompaniesWorked',y='PercentSalaryHike',data=df,hue='Attrition')
Even though employees who worked in 0,4,7,9 companies have an average of more than 15 percent
hike,they tend to resign more
In [77]: per_rating=df.groupby('PerformanceRating').size().reset_index().rename(columns={0:'Count
per_rating
0 3 2488
1 4 452
3
4
15.4%
84.6%
In [79]: plt.figure(figsize=(16,8))
plt.xlabel('Job Involvement',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Job Involvement worked with respect to performance rating',fontsize=14,fontwe
sns.countplot(x='JobInvolvement',hue='PerformanceRating',data=df,palette='Set2')
Job involvement with 3 rating has the highestemployees as well as highest 3 performance rating
Employees who have very less job involvement like 1 and 2 are also awarded highest rating[4]
In [80]: plt.figure(figsize=(16,8))
plt.xlabel('Job Involvement',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Plot with respect to performance rating and attrition',fontsize=14,fontweight=
sns.countplot(x='PerformanceRating',hue='Attrition',data=df,palette='rocket')
Out[80]: <AxesSubplot:title={'center':'Plot with respect to performance rating and attrition'}, x
label='PerformanceRating', ylabel='count'>
In [81]: per_atrr=df.groupby(['Attrition','PerformanceRating'])['EmployeeNumber'].size().reset_in
per_atrr
0 No 3 2088
1 No 4 378
2 Yes 3 400
3 Yes 4 74
# fnct(per_atrr)
In [84]: plt.figure(figsize=(16,8))
plt.xlabel('Relationship Satisfaction',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Relationship Satisfaction with respect to attrition',fontsize=14,fontweight='b
sns.countplot(x='RelationshipSatisfaction',data=df,hue='Attrition')
fnct_relation(df)
array([80], dtype=int64)
Out[86]:
In [88]: plt.figure(figsize=(16,8))
plt.xlabel('Stock level option',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Stock level option with respect to attrition',fontsize=14,fontweight='bold')
sns.countplot(x='StockOptionLevel',data=df,hue='Attrition',palette='rainbow')
fnct_relation1(df)
In [92]: #Since the chart is not giving a clear view of the percentage we can create a function f
def fnct_relation1(dataframe):
dict1={}
for i in range(0,41):
cal=((df[(df['TotalWorkingYears']==i) & (df['Attrition']=='Yes')]['EmployeeNumbe
cal=round(cal,2)
dict1[i]=cal
del dict1[39] #Since we dont have 39 years experienced employees
return dict1
a=fnct_relation1(df)
In [93]: work_df=pd.DataFrame(data=a.items(),columns=['No_of_years','Attrition_percent_among_yes_
work_df
0 0 45.45
1 1 49.38
2 2 29.03
3 3 21.43
4 4 19.05
5 5 18.18
6 6 17.60
7 7 22.22
8 8 15.53
9 9 10.42
10 10 12.38
11 11 19.44
12 12 10.42
13 13 8.33
14 14 12.90
15 15 12.50
16 16 8.11
17 17 9.09
18 18 14.81
19 19 13.64
20 20 6.67
21 21 2.94
22 22 9.52
23 23 9.09
24 24 16.67
25 25 7.14
26 26 7.14
27 27 0.00
28 28 7.14
29 29 0.00
30 30 0.00
31 31 11.11
32 32 0.00
33 33 14.29
34 34 20.00
35 35 0.00
36 36 0.00
37 37 0.00
38 38 0.00
39 40 100.00
In [96]: #Since the chart is not giving a clear view of the percentage we can create a function f
def fnct_relation2(dataframe):
dict2={}
for i in range(0,7):
cal=((df[(df['TrainingTimesLastYear']==i) & (df['Attrition']=='Yes')]['EmployeeN
cal=round(cal,2)
dict2[i]=cal
return dict2
b=fnct_relation2(df)
In [97]: b
work life balance with rating 3 has the highest as well as least attrition considering the count of employees
1 and 4 rating are almost same attrition
The most reliable employees for the organisation are from 11 to 30,34,36,37 years experience at the
company
In [103… df['JobRole'].unique()
In [104… res_df=df[df['Attrition']=='Yes']
res_df.shape
(474, 35)
Out[104]:
In [105… plt.figure(figsize=(16,8))
plt.xlabel('YearsAtCompany',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Years in current role with attrition',fontsize=14,fontweight='bold')
sns.countplot(x='YearsInCurrentRole',hue='JobRole',data=res_df)
plt.legend(loc='center right')
<matplotlib.legend.Legend at 0x2499e92cfd0>
Out[105]:
Laboratory technician at 0 and 1 years role have high attrition
Research director and Healthcare representative has least attrition
Sales Executive has attrition almost across all years of current job role
Sales Representative has attrition of less than 7 years in current role
Research Scientist has peak attrition in 2 years at current role
In [109… plt.figure(figsize=(16,8))
plt.xlabel('YearsSinceLastPromotion',fontsize=14)
plt.ylabel('Count of employees',fontsize=14)
plt.title('Years With Current Manager with attrition',fontsize=14,fontweight='bold')
sns.countplot(x='YearsWithCurrManager',data=res_df)
In [111… df.columns
<AxesSubplot:>
Out[112]:
In [113… df['BusinessTravel'].unique()
array(['Travel_Rarely', 'Travel_Frequently', 'Non-Travel'], dtype=object)
Out[113]:
In [115… df[:3]
Out[115]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environ
Research &
1 49 No 2 279 8 1 Life Sciences
Development
Research &
2 37 Yes 1 1373 2 2 Other
Development
3 rows × 31 columns
Out[116]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environm
0 rows × 31 columns
array([2, 1, 0])
Out[117]:
Out[118]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environ
4 27 No 1 591 1 2 1 Medical
5 rows × 31 columns
Out[119]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environm
0 rows × 31 columns
In [120… sns.boxplot(df['DistanceFromHome'])
<AxesSubplot:xlabel='DistanceFromHome'>
Out[120]:
In [121… #Label encoder of EducationField
df['EducationField']=labenc.fit_transform(df['EducationField'])
df[:3]
Out[121]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environ
0 41 Yes 1 1102 2 1 2 1
1 49 No 2 279 1 8 1 1
2 37 Yes 1 1373 1 2 2 4
3 rows × 31 columns
In [122… print(labenc.classes_)
Out[123]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environ
0 41 Yes 1 1102 2 1 2 1
1 49 No 2 279 1 8 1 1
2 37 Yes 1 1373 1 2 2 4
3 rows × 31 columns
In [124… df['Gender'].unique()
array([0, 1])
Out[124]:
<AxesSubplot:xlabel='HourlyRate'>
Out[125]:
In [126… #JobRole label encoder
df['JobRole']=labenc.fit_transform(df['JobRole'])
df[:5]
Out[126]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environ
0 41 Yes 1 1102 2 1 2 1
1 49 No 2 279 1 8 1 1
2 37 Yes 1 1373 1 2 2 4
3 33 No 2 1392 1 3 4 1
4 27 No 1 591 1 2 1 3
5 rows × 31 columns
In [127… print(labenc.classes_)
In [128… #Since marital status doesnot hold that much importance we can remove them from dataset
df.drop('MaritalStatus',axis=1,inplace=True)
30
Out[129]:
<AxesSubplot:xlabel='MonthlyIncome'>
Out[130]:
From the above plot we can find many outlier values
In [132… med=np.median(df['MonthlyIncome'])
med
4919.0
Out[132]:
Out[133]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Envi
25 53 No 1 1282 1 5 3 4
29 46 No 1 705 2 2 4 2
45 41 Yes 1 1360 1 12 3 5
62 50 No 1 989 1 7 2 3
105 59 No 0 1420 0 2 4 0
2844 58 No 1 605 2 21 3 1
2847 49 No 2 1064 1 2 1 1
2871 55 No 1 189 0 26 4 0
2907 39 No 0 105 1 9 3 1
2913 42 No 1 300 1 2 3 1
228 rows × 30 columns
In [134… q8
4919.0
Out[134]:
In [135… # #Replacing the outlier values of dataframe with the monthlyincome median value
df.iloc[sd2.index,15]=q8
# df['MonthlyIncome'].fillna(np.median(df['MonthlyIncome']),inplace=True)
In [136… df.iloc[sd2.index,15]
25 4919
Out[136]:
29 4919
45 4919
62 4919
105 4919
...
2844 4919
2847 4919
2871 4919
2907 4919
2913 4919
Name: MonthlyIncome, Length: 228, dtype: int64
In [137… # plt.figure(figsize=(14,7))
# sns.boxplot(df['MonthlyIncome'])
<AxesSubplot:xlabel='MonthlyRate'>
Out[138]:
In [140… df['OverTime'].unique()
skewness is 0.8207086405356568
No outliers from the above data but skewed data
['No' 'Yes']
Out[143]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Environ
0 41 1 1 1102 2 1 2 1
1 49 0 2 279 1 8 1 1
2 37 1 1 1373 1 2 2 4
3 33 0 2 1392 1 3 4 1
4 27 0 1 591 1 2 1 3
5 rows × 30 columns
In [149… #Training the model with the best parameters from Grid Search CV
dc_model_1=DecisionTreeClassifier(max_depth=8,min_samples_split=10,min_samples_leaf=1)
dc_model_1.fit(X_train,y_train)
training_accuracy_1=dc_model_1.score(X_train,y_train)
print("Training Accuracy: ",training_accuracy_1)
testing_accuracy_1=dc_model_1.score(X_test,y_test)
print("Testing Accuracy : ",testing_accuracy_1)
0.9013605442176871
In [152… print(classification_report(y_test,predictions))
precision recall f1-score support
The above report shows that attrition : no has accuracy prediction compared to attrition : yes
Recall of true positive[predicted yes|Actual yes] is only 0.57 percent
In [157… plt.figure(1)
lw = 2
plt.plot(fpr_dt_0, tpr_dt_0, color='green',
lw=lw, label='Decision Tree(AUC = %0.2f)' % roc_auc_dt_0)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Area Under Curve')
plt.legend(loc="lower right")
plt.show()
In [159… plt.figure(1)
lw = 2
plt.plot(fpr_dt, tpr_dt, color='green',
lw=lw, label='Decision Tree(AUC = %0.2f)' % roc_auc_dt)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Area Under Curve')
plt.legend(loc="lower right")
plt.show()
Without hyperparameter tuning , the area under curve has good probability compared to with
hyperparameter tuning
min_estimators = 15
max_estimators = 1000
error_rate = {}
oob_error = 1 - rf_model.oob_score_
error_rate[i] = oob_error
ax.set_facecolor('#fafafa')
oob_series.plot(kind='line',color = 'red')
# plt.axhline(0.055, color='#875FDB',linestyle='--')
# plt.axhline(0.05, color='#875FDB',linestyle='--')
plt.xlabel('n_estimators')
plt.ylabel('OOB Error Rate')
plt.title('OOB Error Rate Across various Forest sizes \n(From 15 to 1000 trees)')
Text(0.5, 1.0, 'OOB Error Rate Across various Forest sizes \n(From 15 to 1000 trees)')
Out[165]:
In [166… print('OOB Error rate for 500 trees is: {0:.5f}'.format(oob_series[500]))
In [167… rf_model.set_params(n_estimators=500,bootstrap=True,warm_start=False,oob_score=False)
0.9017857142857143
0.9098639455782312
In [174… plt.figure(1)
lw = 2
plt.plot(fpr_dt1, tpr_dt1, color='green',
lw=lw, label='Random Forest(AUC = %0.2f)' % roc_auc_dt1)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Area Under Curve')
plt.legend(loc="lower right")
plt.show()
From the above curve , random forest is not that much good for predicting yes attrition compared to
decision tree
KNN Model
In [175… #Since model will be using eucledian distance need to standardize the data
#Standardization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)
KNeighborsClassifier()
Out[176]:
In [178… print(classifier.score(X_train,y_train))
print(classifier.score(X_test,y_test))
0.8924319727891157
0.8860544217687075
n-neighbours = 5 seems to be optimal value as accuracy seems to be fine without any overfitting
Accuracy Score:
0.9217687074829932
Accuracy Score:
0.8928571428571429
Accuracy Score:
0.9319727891156463
0.9409015506671474
In [188… model_svm_1.best_params_
Out[188]: {'C': 0.9, 'degree': 4, 'gamma': 0.05, 'kernel': 'poly'}
0.9693877551020408
In [192… print(classification_report(y_test,y_pred_svm))
In [194… plt.figure(1)
lw = 2
plt.plot(fpr_dt2, tpr_dt2, color='green',
lw=lw, label='SVM(AUC = %0.2f)' % roc_auc_dt2)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Area Under Curve')
plt.legend(loc="lower right")
plt.show()
SVM model has better accuracy and best true positive and true negative values compared to decision tree
without tuning
Final Inference
The following are the points that are related to high attrition
Model Inference
SVM has best accuracy in predicting both classes of Attrition followed by decision tree without
hyperparameter tuning
Decision tree with hyperparameter tuning,Random forest and KNN has good accuracy in finding the
attrition as no but low recall and accuracy for attrition:Yes
which can be inferred through confusion matrix and area under curve for these algorithms