0% found this document useful (0 votes)
14 views22 pages

Ml Projects

Uploaded by

subhasishguha742
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views22 pages

Ml Projects

Uploaded by

subhasishguha742
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Task-MINI PROJECT ON MACHINE LEARNING

Name: Prof Shantanu Chakraborty

Reg.No: GO_STP_7458

Date: 08-06-2021

Logistic Regression Model on Why HR Leaving | Predicting employee attrition using Machine Learning

Predict retention of an employee within an organization such that whether the employee will leave the company or continue with it. An
organization is only as good as its employees, and these people are the true source of its competitive advantage.

Kaggle Link: https://www.kaggle.com/giripujar/hr-analytics

First do data exploration and visualization, after this create a logistic regression model to predict Employee Attrition Using
Machine Learning & Python.

Import Packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as matplot
import seaborn as sns
%matplotlib inline

df=pd.read_csv("/content/HR_comma_sep.csv")
df.head()

satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left

0 0.38 0.53 2 157 3 0 1

1 0.80 0.86 5 262 6 0 1

2 0.11 0.88 7 272 4 0 1

3 0.72 0.87 5 223 5 0 1

4 0.37 0.52 2 159 3 0 1

df.shape

(14999, 10)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 satisfaction_level 14999 non-null float64
1 last_evaluation 14999 non-null float64
2 number_project 14999 non-null int64
3 average_montly_hours 14999 non-null int64
4 time_spend_company 14999 non-null int64
5 Work_accident 14999 non-null int64
6 left 14999 non-null int64
7 promotion_last_5years 14999 non-null int64
8 Department 14999 non-null object
9 salary 14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB

df.dtypes
satisfaction_level float64
last_evaluation float64
number_project int64
average_montly_hours int64
time_spend_company int64
Work_accident int64
left int64
promotion_last_5years int64
Department object
salary object
dtype: object

df.describe()

satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident

count 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14

mean 0.612834 0.716102 3.803054 201.050337 3.498233 0.144610

std 0.248631 0.171169 1.232592 49.943099 1.460136 0.351719

min 0.090000 0.360000 2.000000 96.000000 2.000000 0.000000

25% 0.440000 0.560000 3.000000 156.000000 3.000000 0.000000

50% 0.640000 0.720000 4.000000 200.000000 3.000000 0.000000

75% 0.820000 0.870000 5.000000 245.000000 4.000000 0.000000

max 1.000000 1.000000 7.000000 310.000000 10.000000 1.000000

df.columns

Index(['satisfaction_level', 'last_evaluation', 'number_project',


'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
'promotion_last_5years', 'Department', 'salary'],
dtype='object')

df.groupby('left').count()
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident pro

left

0 11428 11428 11428 11428 11428 11428

1 3571 3571 3571 3571 3571 3571

sns.set(rc={'figure.figsize':(9,7)})
correlation_matrix = df.corr().round(2)
sns.heatmap(data=correlation_matrix, annot=True ,cmap="YlGnBu")
<matplotlib.axes._subplots.AxesSubplot at 0x7ffaf6b30790>

corr = df.corr()
corr = (corr)
sns.heatmap(corr,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values,cmap='PuRd')
plt.title('Heatmap of Correlation Matrix')
corr
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Wo

satisfaction_level 1.000000 0.105021 -0.142970 -0.020048 -0.100866

last_evaluation 0.105021 1.000000 0.349333 0.339742 0.131591

number_project -0.142970 0.349333 1.000000 0.417211 0.196786

average_montly_hours -0.020048 0.339742 0.417211 1.000000 0.127755

time_spend_company -0.100866 0.131591 0.196786 0.127755 1.000000

Work_accident 0.058697 -0.007104 -0.004741 -0.010143 0.002120

left -0.388375 0.006567 0.023787 0.071287 0.144822

promotion_last_5years 0.025605 -0.008684 -0.006064 -0.003544 0.067433


df.groupby('salary').mean()

satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident

salary

high 0.637470 0.704325 3.767179 199.867421 3.692805 0.155214

low 0.600753 0.717017 3.799891 200.996583 3.438218 0.142154

medium 0.621817 0.717322 3.813528 201.338349 3.529010 0.145361

pd.crosstab(df.Department, df.left)

left 0 1

Department

IT 954 273

RandD 666 121

accounting 563 204

hr 524 215

management 539 91

marketing 655 203

product_mng 704 198

sales 3126 1014

support 1674 555

technical 2023 697

emp_population_satisfaction = df['satisfaction_level'].mean()
emp_turnover_satisfaction = df [df [ 'left']==1]['satisfaction_level'].mean()

print('The mean for the employee population is: ' +


str(emp_population_satisfaction))
print('The mean for the employees that had a left is: '+ str (emp_turnover_satisfaction))

The mean for the employee population is: 0.6128335222348166


The mean for the employees that had a left is: 0.44009801176140917

f, axes = plt.subplots (ncols=3, figsize=(15, 6))

sns.distplot (df.satisfaction_level, kde= False, color="y", ax=axes[0]).set_title('Employee Satisfaction Distribution')


axes[0].set_ylabel('Employee Count')

sns.distplot (df.last_evaluation, kde= False, color="b", ax=axes[1]).set_title('Employee Evaluation Distribution')

axes [1].set_ylabel ('Employee Count')

sns.distplot (df.average_montly_hours, kde=False, color="r", ax=axes[2]).set_title( 'Employee Average Monthly Hours Distribu

axes[2].set_ylabel ('Employee Count')


/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated
functio warnings.warn(msg, FutureWarning)
Text(0, 0.5, 'Employee Count')

color_types = ['#78c850','#F08030','#6890F0','#ABB820','#A8A878','#A040A0','#F8D030',

'#E0C068','#EE99AC', '#C03028','#F85888','#B8A038','#705898','#98D8D8', '#7038F8']


sns.countplot (x='Department', data=df, palette=color_types).set_title('Employee Department Distribution');
%matplotlib inline

#Bar chart for department employee work for and the frequency of
turnover pd.crosstab(df.Department,df.left).plot(kind='bar')
plt.title('Turnover Frequency for Department')
plt.xlabel('Department')
plt.ylabel('Frequency of Turnover')
plt.savefig('department_bar_chart')

#Bar chart for employee salary level and the frequency of turnover
table=pd.crosstab(df.salary, df.left)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of Salary Level vs Turnover')
plt.xlabel('Salary Level')
plt.ylabel('Proportion of Employees')
plt.savefig('salary_bar_chart')
fig = plt.figure(figsize= (15,5),)

ax=sns.kdeplot (df.loc [(df ["left"] == 0),'last_evaluation'], color='blue', shade=True)

ax=sns.kdeplot (df.loc[ (df['left'] == 1),'last_evaluation'], color='black', shade=True)

plt.title('Employee Evaluation Distribution Left vs retained')


Text(0.5, 1.0, 'Employee Evaluation Distribution Left vs retained')

fig= plt.figure (figsize=(15,5))

ax=sns.kdeplot(df.loc[(df ['left'] == 0),'average_montly_hours'], color='green', shade=True)

ax=sns.kdeplot(df.loc[(df['left'] ==1),'average_montly_hours'], color='red',shade=True)

plt.title('Employee Evaluation Distribution Left vs retained')


Text(0.5, 1.0, 'Employee Evaluation Distribution Left vs retained')

fig=plt.figure(figsize=(15,5))

ax=sns.kdeplot(df.loc[(df ['left'] == 0), 'satisfaction_level'], color='red', shade=True)


ax=sns.kdeplot(df.loc[(df [ 'left'] == 1), 'satisfaction_level'], color='black', shade=True)
plt.title ('Employee Evaluation Distribution Left vs retained')

Text(0.5, 1.0, 'Employee Evaluation Distribution Left vs retained')

Code Text
data = df[['satisfaction_level', 'average_montly_hours', 'promotion_last_5years',

'salary']] data.head ()
satisfaction_level average_montly_hours promotion_last_5years salary

0 0.38 157 0 low

1 0.80 262 0 medium

2 0.11 272 0 medium

3 0.72 223 0 low

4 0.37 159 0 low

salary=pd.get_dummies (data['salary'], prefix='salary')

salary

salary_high salary_low salary_medium

0 0 1 0

1 0 0 1

2 0 0 1

3 0 1 0

4 0 1 0

... ... ... ...

14994 0 1 0

14995 0 1 0

14996 0 1 0

14997 0 1 0

14998 0 1 0
14999 rows × 3 columns

new_df = pd.concat([data,salary],axis=1)
new_df
satisfaction_level average_montly_hours promotion_last_5years salary sa

0 0.38 157 0 low

1 0.80 262 0 medium

2 0.11 272 0 medium

3 0.72 223 0 low

4 0.37 159 0 low

... ... ... ... ...

14994 0.40 151 0 low

14995 0.37 160 0 low

14996 0.37 143 0 low

new_df.drop(['salary','salary_high'], axis=1, inplace=True)


new_df
satisfaction_level average_montly_hours promotion_last_5years salary_low

0 0.38 157 0 1

1 0.80 262 0 0

2 0.11 272 0 0

3 0.72 223 0 1
4 0.37 159 0 1
X = new_df.copy()
X ... ... ... ... ...
14994 0.40 151 0 1
satisfaction_level average_montly_hours promotion_last_5years salary_low
14995 0.37 160 0 1
0 0.38 157 0 1
14996 0.37 143 0 1
1 0.80 262 0 0

2 0.11 272 0 0

3 0.72 223 0 1

4 0.37 159 0 1

... ... ... ... ...

14994 0.40 151 0 1

14995 0.37 160 0 1

14996 0.37 143 0 1

y = df['left']
y

0 1
1 1
2 1
3 1
4 1
..
14994 1
14995 1
14996 1
14997 1
14998 1
Name: left, Length: 14999, dtype: int64

from sklearn.model_selection import train_test_split


train_x, test_x, train_y, test_y = train_test_split(X, y,
test_size=0.30, random_state=99)
train_x.shape, test_x.shape, train_y.shape, test_y.shape

((10499, 5), (4500, 5), (10499,), (4500,))

from sklearn.linear_model import LogisticRegression


lr = LogisticRegression(solver='liblinear')
lr.fit(train_x,train_y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,


intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='liblinear', tol=0.0001, verbose=0,
warm_start=False)

y_pred = lr.predict(test_x)
y_pred

array([0, 0, 0, ..., 0, 0, 0])

lr.score(test_x,test_y)

0.7724444444444445
from sklearn.metrics import accuracy_score,confusion_matrix
plot_confusion_matrix

<function sklearn.metrics._plot.confusion_matrix.plot_confusion_matrix>

accuracy_score(test_y,y_pred)

0.7724444444444445

confusion_matrix(test_y,y_pred)

array([[3202, 229],
[ 795, 274]])

plot_confusion_matrix(lr, test_x, test_y,cmap=plt.cm.PuBu)

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7ffaf2863710>

from sklearn import metrics


y_true = test_y # true labels
y_probas = y_pred # predicted results
fpr tpr thresholds = metrics roc curve(y true y probas pos label=0)
fpr, tpr, thresholds = metrics.roc_curve(y_true, y_probas, pos_label=0)
# Print ROC curve
plt.plot(fpr,tpr,linewidth="4",color='black')
plt.show()
# Print AUC
auc = np.trapz(tpr,fpr)
print('AUC:', auc)
c 0s completed at 6:00 PM

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy