0% found this document useful (0 votes)

14 views22 pages

Ml Projects

Uploaded by

subhasishguha742

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views22 pages

Ml Projects

Uploaded by

subhasishguha742

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Task-MINI PROJECT ON MACHINE LEARNING

Name: Prof Shantanu Chakraborty

Reg.No: GO_STP_7458

Date: 08-06-2021

Logistic Regression Model on Why HR Leaving | Predicting employee attrition using Machine Learning

Predict retention of an employee within an organization such that whether the employee will leave the company or continue with it. An
organization is only as good as its employees, and these people are the true source of its competitive advantage.

Kaggle Link: https://www.kaggle.com/giripujar/hr-analytics

First do data exploration and visualization, after this create a logistic regression model to predict Employee Attrition Using
Machine Learning & Python.

Import Packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as matplot
import seaborn as sns
%matplotlib inline

df=pd.read_csv("/content/HR_comma_sep.csv")
df.head()

satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left

0 0.38 0.53 2 157 3 0 1

1 0.80 0.86 5 262 6 0 1

2 0.11 0.88 7 272 4 0 1

3 0.72 0.87 5 223 5 0 1

4 0.37 0.52 2 159 3 0 1

df.shape

(14999, 10)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 satisfaction_level 14999 non-null float64
1 last_evaluation 14999 non-null float64
2 number_project 14999 non-null int64
3 average_montly_hours 14999 non-null int64
4 time_spend_company 14999 non-null int64
5 Work_accident 14999 non-null int64
6 left 14999 non-null int64
7 promotion_last_5years 14999 non-null int64
8 Department 14999 non-null object
9 salary 14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB

df.dtypes
satisfaction_level float64
last_evaluation float64
number_project int64
average_montly_hours int64
time_spend_company int64
Work_accident int64
left int64
promotion_last_5years int64
Department object
salary object
dtype: object

df.describe()

satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident

count 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14

mean 0.612834 0.716102 3.803054 201.050337 3.498233 0.144610

std 0.248631 0.171169 1.232592 49.943099 1.460136 0.351719

min 0.090000 0.360000 2.000000 96.000000 2.000000 0.000000

25% 0.440000 0.560000 3.000000 156.000000 3.000000 0.000000

50% 0.640000 0.720000 4.000000 200.000000 3.000000 0.000000

75% 0.820000 0.870000 5.000000 245.000000 4.000000 0.000000

max 1.000000 1.000000 7.000000 310.000000 10.000000 1.000000

df.columns

Index(['satisfaction_level', 'last_evaluation', 'number_project',

'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
'promotion_last_5years', 'Department', 'salary'],
dtype='object')

df.groupby('left').count()
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident pro

left

0 11428 11428 11428 11428 11428 11428

1 3571 3571 3571 3571 3571 3571

sns.set(rc={'figure.figsize':(9,7)})
correlation_matrix = df.corr().round(2)
sns.heatmap(data=correlation_matrix, annot=True ,cmap="YlGnBu")
<matplotlib.axes._subplots.AxesSubplot at 0x7ffaf6b30790>

corr = df.corr()
corr = (corr)
sns.heatmap(corr,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values,cmap='PuRd')
plt.title('Heatmap of Correlation Matrix')
corr
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Wo

satisfaction_level 1.000000 0.105021 -0.142970 -0.020048 -0.100866

last_evaluation 0.105021 1.000000 0.349333 0.339742 0.131591

number_project -0.142970 0.349333 1.000000 0.417211 0.196786

average_montly_hours -0.020048 0.339742 0.417211 1.000000 0.127755

time_spend_company -0.100866 0.131591 0.196786 0.127755 1.000000

Work_accident 0.058697 -0.007104 -0.004741 -0.010143 0.002120

left -0.388375 0.006567 0.023787 0.071287 0.144822

promotion_last_5years 0.025605 -0.008684 -0.006064 -0.003544 0.067433

df.groupby('salary').mean()

satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident

salary

high 0.637470 0.704325 3.767179 199.867421 3.692805 0.155214

low 0.600753 0.717017 3.799891 200.996583 3.438218 0.142154

medium 0.621817 0.717322 3.813528 201.338349 3.529010 0.145361

pd.crosstab(df.Department, df.left)

left 0 1

Department

IT 954 273

RandD 666 121

accounting 563 204

hr 524 215

management 539 91

marketing 655 203

product_mng 704 198

sales 3126 1014

support 1674 555

technical 2023 697

emp_population_satisfaction = df['satisfaction_level'].mean()
emp_turnover_satisfaction = df [df [ 'left']==1]['satisfaction_level'].mean()

print('The mean for the employee population is: ' +

str(emp_population_satisfaction))
print('The mean for the employees that had a left is: '+ str (emp_turnover_satisfaction))

The mean for the employee population is: 0.6128335222348166

The mean for the employees that had a left is: 0.44009801176140917

f, axes = plt.subplots (ncols=3, figsize=(15, 6))

sns.distplot (df.satisfaction_level, kde= False, color="y", ax=axes[0]).set_title('Employee Satisfaction Distribution')

axes[0].set_ylabel('Employee Count')

sns.distplot (df.last_evaluation, kde= False, color="b", ax=axes[1]).set_title('Employee Evaluation Distribution')

axes [1].set_ylabel ('Employee Count')

sns.distplot (df.average_montly_hours, kde=False, color="r", ax=axes[2]).set_title( 'Employee Average Monthly Hours Distribu

axes[2].set_ylabel ('Employee Count')

/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated
functio warnings.warn(msg, FutureWarning)
Text(0, 0.5, 'Employee Count')

color_types = ['#78c850','#F08030','#6890F0','#ABB820','#A8A878','#A040A0','#F8D030',

'#E0C068','#EE99AC', '#C03028','#F85888','#B8A038','#705898','#98D8D8', '#7038F8']

sns.countplot (x='Department', data=df, palette=color_types).set_title('Employee Department Distribution');
%matplotlib inline

#Bar chart for department employee work for and the frequency of
turnover pd.crosstab(df.Department,df.left).plot(kind='bar')
plt.title('Turnover Frequency for Department')
plt.xlabel('Department')
plt.ylabel('Frequency of Turnover')
plt.savefig('department_bar_chart')

#Bar chart for employee salary level and the frequency of turnover
table=pd.crosstab(df.salary, df.left)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of Salary Level vs Turnover')
plt.xlabel('Salary Level')
plt.ylabel('Proportion of Employees')
plt.savefig('salary_bar_chart')
fig = plt.figure(figsize= (15,5),)

ax=sns.kdeplot (df.loc [(df ["left"] == 0),'last_evaluation'], color='blue', shade=True)

ax=sns.kdeplot (df.loc[ (df['left'] == 1),'last_evaluation'], color='black', shade=True)

plt.title('Employee Evaluation Distribution Left vs retained')

Text(0.5, 1.0, 'Employee Evaluation Distribution Left vs retained')

fig= plt.figure (figsize=(15,5))

ax=sns.kdeplot(df.loc[(df ['left'] == 0),'average_montly_hours'], color='green', shade=True)

ax=sns.kdeplot(df.loc[(df['left'] ==1),'average_montly_hours'], color='red',shade=True)

plt.title('Employee Evaluation Distribution Left vs retained')

Text(0.5, 1.0, 'Employee Evaluation Distribution Left vs retained')

fig=plt.figure(figsize=(15,5))

ax=sns.kdeplot(df.loc[(df ['left'] == 0), 'satisfaction_level'], color='red', shade=True)

ax=sns.kdeplot(df.loc[(df [ 'left'] == 1), 'satisfaction_level'], color='black', shade=True)
plt.title ('Employee Evaluation Distribution Left vs retained')

Text(0.5, 1.0, 'Employee Evaluation Distribution Left vs retained')

Code Text
data = df[['satisfaction_level', 'average_montly_hours', 'promotion_last_5years',

'salary']] data.head ()
satisfaction_level average_montly_hours promotion_last_5years salary

0 0.38 157 0 low

1 0.80 262 0 medium

2 0.11 272 0 medium

3 0.72 223 0 low

4 0.37 159 0 low

salary=pd.get_dummies (data['salary'], prefix='salary')

salary

salary_high salary_low salary_medium

0 0 1 0

1 0 0 1

2 0 0 1

3 0 1 0

4 0 1 0

... ... ... ...

14994 0 1 0

14995 0 1 0

14996 0 1 0

14997 0 1 0

14998 0 1 0
14999 rows × 3 columns

new_df = pd.concat([data,salary],axis=1)
new_df
satisfaction_level average_montly_hours promotion_last_5years salary sa

0 0.38 157 0 low

1 0.80 262 0 medium

2 0.11 272 0 medium

3 0.72 223 0 low

4 0.37 159 0 low

... ... ... ... ...

14994 0.40 151 0 low

14995 0.37 160 0 low

14996 0.37 143 0 low

new_df.drop(['salary','salary_high'], axis=1, inplace=True)

new_df
satisfaction_level average_montly_hours promotion_last_5years salary_low

0 0.38 157 0 1

1 0.80 262 0 0

2 0.11 272 0 0

3 0.72 223 0 1
4 0.37 159 0 1
X = new_df.copy()
X ... ... ... ... ...
14994 0.40 151 0 1
satisfaction_level average_montly_hours promotion_last_5years salary_low
14995 0.37 160 0 1
0 0.38 157 0 1
14996 0.37 143 0 1
1 0.80 262 0 0

2 0.11 272 0 0

3 0.72 223 0 1

4 0.37 159 0 1

... ... ... ... ...

14994 0.40 151 0 1

14995 0.37 160 0 1

14996 0.37 143 0 1

y = df['left']
y

0 1
1 1
2 1
3 1
4 1
..
14994 1
14995 1
14996 1
14997 1
14998 1
Name: left, Length: 14999, dtype: int64

from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(X, y,
test_size=0.30, random_state=99)
train_x.shape, test_x.shape, train_y.shape, test_y.shape

((10499, 5), (4500, 5), (10499,), (4500,))

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(solver='liblinear')
lr.fit(train_x,train_y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,

intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='liblinear', tol=0.0001, verbose=0,
warm_start=False)

y_pred = lr.predict(test_x)
y_pred

array([0, 0, 0, ..., 0, 0, 0])

lr.score(test_x,test_y)

0.7724444444444445
from sklearn.metrics import accuracy_score,confusion_matrix
plot_confusion_matrix

accuracy_score(test_y,y_pred)

0.7724444444444445

confusion_matrix(test_y,y_pred)

array([[3202, 229],
[ 795, 274]])

plot_confusion_matrix(lr, test_x, test_y,cmap=plt.cm.PuBu)

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7ffaf2863710>

from sklearn import metrics

y_true = test_y # true labels
y_probas = y_pred # predicted results
fpr tpr thresholds = metrics roc curve(y true y probas pos label=0)
fpr, tpr, thresholds = metrics.roc_curve(y_true, y_probas, pos_label=0)
# Print ROC curve
plt.plot(fpr,tpr,linewidth="4",color='black')
plt.show()
# Print AUC
auc = np.trapz(tpr,fpr)
print('AUC:', auc)
c 0s completed at 6:00 PM

(Anderson) Design of Experiments Realistic Approach
100% (1)
(Anderson) Design of Experiments Realistic Approach
441 pages
AI Assignment 6 - Employee Performance Analysis - Jupyter Notebook
No ratings yet
AI Assignment 6 - Employee Performance Analysis - Jupyter Notebook
9 pages
HR Analytic Using Logistic Regression
No ratings yet
HR Analytic Using Logistic Regression
12 pages
Logistic Binary Classification
No ratings yet
Logistic Binary Classification
3 pages
Salary Prediction
No ratings yet
Salary Prediction
32 pages
DADM Unit 5 Programs
No ratings yet
DADM Unit 5 Programs
63 pages
Maxbox Starter139 Top5 Data Diagram Types
No ratings yet
Maxbox Starter139 Top5 Data Diagram Types
4 pages
ML 5.0
No ratings yet
ML 5.0
2 pages
Seaborn Besant
No ratings yet
Seaborn Besant
27 pages
Employee_attrition_rate - Jupyter Notebook
No ratings yet
Employee_attrition_rate - Jupyter Notebook
62 pages
Kunj Project 1
No ratings yet
Kunj Project 1
34 pages
PA Lab 4
No ratings yet
PA Lab 4
6 pages
Social Network Analysis: Cheruvu Nvss Suhas 21BCE8374
No ratings yet
Social Network Analysis: Cheruvu Nvss Suhas 21BCE8374
10 pages
Mastering_Pandas_with_103_Practical_Questions_and_Solution_1731584558
No ratings yet
Mastering_Pandas_with_103_Practical_Questions_and_Solution_1731584558
48 pages
Viksit Ip Project File
No ratings yet
Viksit Ip Project File
33 pages
Pps Ui22cs57lab 10
No ratings yet
Pps Ui22cs57lab 10
17 pages
IP_Employee_Project
No ratings yet
IP_Employee_Project
31 pages
Salaries for San Francisco Employee _ ML _ FA _ DA projects
No ratings yet
Salaries for San Francisco Employee _ ML _ FA _ DA projects
33 pages
Capstone Project - Employee Attrition Rate
No ratings yet
Capstone Project - Employee Attrition Rate
66 pages
06 Seaborn
No ratings yet
06 Seaborn
13 pages
IP Employee Project
No ratings yet
IP Employee Project
32 pages
Unit 4
No ratings yet
Unit 4
25 pages
employee management-Ghanim,Rudra
No ratings yet
employee management-Ghanim,Rudra
25 pages
Srushti ML Assign1
No ratings yet
Srushti ML Assign1
9 pages
SMARAN HR Analytics - Ipynb - Colab
No ratings yet
SMARAN HR Analytics - Ipynb - Colab
65 pages
PySpark_slides
No ratings yet
PySpark_slides
30 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
11 pages
Random Forest Classifier
No ratings yet
Random Forest Classifier
18 pages
Logistic
No ratings yet
Logistic
5 pages
Emp at Tricode
No ratings yet
Emp at Tricode
6 pages
Data Visualization EDA-print
No ratings yet
Data Visualization EDA-print
18 pages
Kunj Project 1
No ratings yet
Kunj Project 1
34 pages
Kunj 3
No ratings yet
Kunj 3
34 pages
Ads Exam 21c3
No ratings yet
Ads Exam 21c3
22 pages
Komal ML Assg1
No ratings yet
Komal ML Assg1
9 pages
1624106057@g.us
No ratings yet
1624106057@g.us
13 pages
Modern statistics for the social and behavioral sciences : a practical introduction Second Edition. Edition Wilcox - Download the ebook now to never miss important content
No ratings yet
Modern statistics for the social and behavioral sciences : a practical introduction Second Edition. Edition Wilcox - Download the ebook now to never miss important content
56 pages
AL Notes
No ratings yet
AL Notes
61 pages
SQL & Python Interview Q&A
No ratings yet
SQL & Python Interview Q&A
7 pages
Sanket ML Assign1
No ratings yet
Sanket ML Assign1
9 pages
The Normal Distribution: Mathematics in The Modern World - Unit 4
No ratings yet
The Normal Distribution: Mathematics in The Modern World - Unit 4
33 pages
Chapter 1
No ratings yet
Chapter 1
19 pages
Student Notebook HR Analysis
No ratings yet
Student Notebook HR Analysis
11 pages
Coding
No ratings yet
Coding
9 pages
Data Project
No ratings yet
Data Project
12 pages
Predicting Employee Churn in Python
100% (1)
Predicting Employee Churn in Python
19 pages
Decision_Tree-Random_Forest - Jupyter Notebook
No ratings yet
Decision_Tree-Random_Forest - Jupyter Notebook
12 pages
Buku Metode Penelitian
No ratings yet
Buku Metode Penelitian
63 pages
Data Preprocessing & Visualization1
No ratings yet
Data Preprocessing & Visualization1
2 pages
Employee Info
No ratings yet
Employee Info
2 pages
ML LAB manual-1
No ratings yet
ML LAB manual-1
33 pages
Ali Bhai's IP Project
No ratings yet
Ali Bhai's IP Project
31 pages
211423205047-Exp1d
No ratings yet
211423205047-Exp1d
6 pages
L6 and 7-Data Preprocessing-coding
No ratings yet
L6 and 7-Data Preprocessing-coding
34 pages
Assignment 2 297
No ratings yet
Assignment 2 297
6 pages
Capstone Project Assignment
No ratings yet
Capstone Project Assignment
3 pages
Assignment Ds Midterm
No ratings yet
Assignment Ds Midterm
2 pages
Satya772244@gmail Compdf
No ratings yet
Satya772244@gmail Compdf
7 pages
Salary Prediction LinearRegression
100% (1)
Salary Prediction LinearRegression
7 pages
Solution Manual For Introductory Statistics 9th by Mann
100% (1)
Solution Manual For Introductory Statistics 9th by Mann
40 pages
Basic Business Statistics: 12 Edition
No ratings yet
Basic Business Statistics: 12 Edition
73 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Data Visualization Using Plotly, Matplotlib, Seaborn and Squarify - Data Science
No ratings yet
Data Visualization Using Plotly, Matplotlib, Seaborn and Squarify - Data Science
61 pages
Data Analysis CheatSheet
No ratings yet
Data Analysis CheatSheet
2 pages
Data Analysis Using Python
No ratings yet
Data Analysis Using Python
12 pages
First Part of Measures of Variability
No ratings yet
First Part of Measures of Variability
33 pages
Arima Jmulti
No ratings yet
Arima Jmulti
11 pages
SANDYA VB-Business Report TSF
100% (6)
SANDYA VB-Business Report TSF
24 pages
Regression Model 1: Square Footage: Variables Entered/Removed
No ratings yet
Regression Model 1: Square Footage: Variables Entered/Removed
4 pages
Accuracy-of-mechanical-torque-limiting-devices-for
No ratings yet
Accuracy-of-mechanical-torque-limiting-devices-for
10 pages
Analyzing Ranking and Rating Data
No ratings yet
Analyzing Ranking and Rating Data
22 pages
Questionnaire 4-6 Collective Efficacy
No ratings yet
Questionnaire 4-6 Collective Efficacy
19 pages
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
No ratings yet
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
19 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
Interval Estimation Practice Questions
0% (2)
Interval Estimation Practice Questions
19 pages
Normal Distribution
No ratings yet
Normal Distribution
36 pages
fTHE VALIDITY OF THE JOB CHARACTERISTICS MODEL-ried1987 PDF
No ratings yet
fTHE VALIDITY OF THE JOB CHARACTERISTICS MODEL-ried1987 PDF
36 pages
Sampletest2 Fall2003
100% (1)
Sampletest2 Fall2003
8 pages
Session 3-Ch 3 (Stevenson)
No ratings yet
Session 3-Ch 3 (Stevenson)
37 pages
Meta Analysis
No ratings yet
Meta Analysis
9 pages
Marketing Analytics Project: Alisha Srivastava Prachi Aggarwal Anup Thakur Gowtham Reddy Sandeep Pal
No ratings yet
Marketing Analytics Project: Alisha Srivastava Prachi Aggarwal Anup Thakur Gowtham Reddy Sandeep Pal
16 pages
Sampling Homework
No ratings yet
Sampling Homework
7 pages
Statistical Inferences Presentation
No ratings yet
Statistical Inferences Presentation
11 pages
Social Science Research Principles Methods and Practices
No ratings yet
Social Science Research Principles Methods and Practices
5 pages
Naïve Bayes
No ratings yet
Naïve Bayes
6 pages
18 - Chapter 10 Taguchi Anova Analysis
No ratings yet
18 - Chapter 10 Taguchi Anova Analysis
10 pages
AlgoRyze Pro-Sol
No ratings yet
AlgoRyze Pro-Sol
23 pages
SYSTEMATIC REVIEW: Are The Results of The Review Valid?: Introduction Should Clearly State The Question. If You
No ratings yet
SYSTEMATIC REVIEW: Are The Results of The Review Valid?: Introduction Should Clearly State The Question. If You
2 pages
Mathematics: Quarter 4 - Module 1: Measures of Position
100% (1)
Mathematics: Quarter 4 - Module 1: Measures of Position
10 pages
PeopleSoft HRMS Interview Questions, Answers, and Explanations
From Everand
PeopleSoft HRMS Interview Questions, Answers, and Explanations
equitypress
4.5/5 (3)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.