Ml Projects
Ml Projects
Reg.No: GO_STP_7458
Date: 08-06-2021
Logistic Regression Model on Why HR Leaving | Predicting employee attrition using Machine Learning
Predict retention of an employee within an organization such that whether the employee will leave the company or continue with it. An
organization is only as good as its employees, and these people are the true source of its competitive advantage.
First do data exploration and visualization, after this create a logistic regression model to predict Employee Attrition Using
Machine Learning & Python.
Import Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as matplot
import seaborn as sns
%matplotlib inline
df=pd.read_csv("/content/HR_comma_sep.csv")
df.head()
df.shape
(14999, 10)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 satisfaction_level 14999 non-null float64
1 last_evaluation 14999 non-null float64
2 number_project 14999 non-null int64
3 average_montly_hours 14999 non-null int64
4 time_spend_company 14999 non-null int64
5 Work_accident 14999 non-null int64
6 left 14999 non-null int64
7 promotion_last_5years 14999 non-null int64
8 Department 14999 non-null object
9 salary 14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB
df.dtypes
satisfaction_level float64
last_evaluation float64
number_project int64
average_montly_hours int64
time_spend_company int64
Work_accident int64
left int64
promotion_last_5years int64
Department object
salary object
dtype: object
df.describe()
df.columns
df.groupby('left').count()
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident pro
left
sns.set(rc={'figure.figsize':(9,7)})
correlation_matrix = df.corr().round(2)
sns.heatmap(data=correlation_matrix, annot=True ,cmap="YlGnBu")
<matplotlib.axes._subplots.AxesSubplot at 0x7ffaf6b30790>
corr = df.corr()
corr = (corr)
sns.heatmap(corr,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values,cmap='PuRd')
plt.title('Heatmap of Correlation Matrix')
corr
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Wo
salary
pd.crosstab(df.Department, df.left)
left 0 1
Department
IT 954 273
hr 524 215
management 539 91
emp_population_satisfaction = df['satisfaction_level'].mean()
emp_turnover_satisfaction = df [df [ 'left']==1]['satisfaction_level'].mean()
sns.distplot (df.average_montly_hours, kde=False, color="r", ax=axes[2]).set_title( 'Employee Average Monthly Hours Distribu
color_types = ['#78c850','#F08030','#6890F0','#ABB820','#A8A878','#A040A0','#F8D030',
#Bar chart for department employee work for and the frequency of
turnover pd.crosstab(df.Department,df.left).plot(kind='bar')
plt.title('Turnover Frequency for Department')
plt.xlabel('Department')
plt.ylabel('Frequency of Turnover')
plt.savefig('department_bar_chart')
#Bar chart for employee salary level and the frequency of turnover
table=pd.crosstab(df.salary, df.left)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of Salary Level vs Turnover')
plt.xlabel('Salary Level')
plt.ylabel('Proportion of Employees')
plt.savefig('salary_bar_chart')
fig = plt.figure(figsize= (15,5),)
fig=plt.figure(figsize=(15,5))
Code Text
data = df[['satisfaction_level', 'average_montly_hours', 'promotion_last_5years',
'salary']] data.head ()
satisfaction_level average_montly_hours promotion_last_5years salary
salary
0 0 1 0
1 0 0 1
2 0 0 1
3 0 1 0
4 0 1 0
14994 0 1 0
14995 0 1 0
14996 0 1 0
14997 0 1 0
14998 0 1 0
14999 rows × 3 columns
new_df = pd.concat([data,salary],axis=1)
new_df
satisfaction_level average_montly_hours promotion_last_5years salary sa
0 0.38 157 0 1
1 0.80 262 0 0
2 0.11 272 0 0
3 0.72 223 0 1
4 0.37 159 0 1
X = new_df.copy()
X ... ... ... ... ...
14994 0.40 151 0 1
satisfaction_level average_montly_hours promotion_last_5years salary_low
14995 0.37 160 0 1
0 0.38 157 0 1
14996 0.37 143 0 1
1 0.80 262 0 0
2 0.11 272 0 0
3 0.72 223 0 1
4 0.37 159 0 1
y = df['left']
y
0 1
1 1
2 1
3 1
4 1
..
14994 1
14995 1
14996 1
14997 1
14998 1
Name: left, Length: 14999, dtype: int64
y_pred = lr.predict(test_x)
y_pred
lr.score(test_x,test_y)
0.7724444444444445
from sklearn.metrics import accuracy_score,confusion_matrix
plot_confusion_matrix
<function sklearn.metrics._plot.confusion_matrix.plot_confusion_matrix>
accuracy_score(test_y,y_pred)
0.7724444444444445
confusion_matrix(test_y,y_pred)
array([[3202, 229],
[ 795, 274]])
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7ffaf2863710>